Wang, Penghao; Wilson, Susan R
2013-01-01
Mass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Both approaches have shortcomings, so an integrative approach has been developed. The integrative approach firstly infers partial peptide sequences, known as tags, directly from tandem spectra through de novo sequencing, and then puts these sequences into a database search to see if a close peptide match can be found. However the current implementation of this integrative approach has several limitations. Firstly, simplistic de novo sequencing is applied and only very short sequence tags are used. Secondly, most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. Thirdly, by applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching. We have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods.
Reliable Detection of Herpes Simplex Virus Sequence Variation by High-Throughput Resequencing.
Morse, Alison M; Calabro, Kaitlyn R; Fear, Justin M; Bloom, David C; McIntyre, Lauren M
2017-08-16
High-throughput sequencing (HTS) has resulted in data for a number of herpes simplex virus (HSV) laboratory strains and clinical isolates. The knowledge of these sequences has been critical for investigating viral pathogenicity. However, the assembly of complete herpesviral genomes, including HSV, is complicated due to the existence of large repeat regions and arrays of smaller reiterated sequences that are commonly found in these genomes. In addition, the inherent genetic variation in populations of isolates for viruses and other microorganisms presents an additional challenge to many existing HTS sequence assembly pipelines. Here, we evaluate two approaches for the identification of genetic variants in HSV1 strains using Illumina short read sequencing data. The first, a reference-based approach, identifies variants from reads aligned to a reference sequence and the second, a de novo assembly approach, identifies variants from reads aligned to de novo assembled consensus sequences. Of critical importance for both approaches is the reduction in the number of low complexity regions through the construction of a non-redundant reference genome. We compared variants identified in the two methods. Our results indicate that approximately 85% of variants are identified regardless of the approach. The reference-based approach to variant discovery captures an additional 15% representing variants divergent from the HSV1 reference possibly due to viral passage. Reference-based approaches are significantly less labor-intensive and identify variants across the genome where de novo assembly-based approaches are limited to regions where contigs have been successfully assembled. In addition, regions of poor quality assembly can lead to false variant identification in de novo consensus sequences. For viruses with a well-assembled reference genome, a reference-based approach is recommended.
Data compression of discrete sequence: A tree based approach using dynamic programming
NASA Technical Reports Server (NTRS)
Shivaram, Gurusrasad; Seetharaman, Guna; Rao, T. R. N.
1994-01-01
A dynamic programming based approach for data compression of a ID sequence is presented. The compression of an input sequence of size N to that of a smaller size k is achieved by dividing the input sequence into k subsequences and replacing the subsequences by their respective average values. The partitioning of the input sequence is carried with the intention of reducing the mean squared error in the reconstructed sequence. The complexity involved in finding the partitions which would result in such an optimal compressed sequence is reduced by using the dynamic programming approach, which is presented.
PLAN-IT: Knowledge-Based Mission Sequencing
NASA Astrophysics Data System (ADS)
Biefeld, Eric W.
1987-02-01
Mission sequencing consumes a large amount of time and manpower during a space exploration effort. Plan-It is a knowledge-based approach to assist in mission sequencing. Plan-It uses a combined frame and blackboard architecture. This paper reports on the approach implemented by Plan-It and the current applications of Plan-It for sequencing at NASA.
Ishikawa, Sohta A; Inagaki, Yuji; Hashimoto, Tetsuo
2012-01-01
In phylogenetic analyses of nucleotide sequences, 'homogeneous' substitution models, which assume the stationarity of base composition across a tree, are widely used, albeit individual sequences may bear distinctive base frequencies. In the worst-case scenario, a homogeneous model-based analysis can yield an artifactual union of two distantly related sequences that achieved similar base frequencies in parallel. Such potential difficulty can be countered by two approaches, 'RY-coding' and 'non-homogeneous' models. The former approach converts four bases into purine and pyrimidine to normalize base frequencies across a tree, while the heterogeneity in base frequency is explicitly incorporated in the latter approach. The two approaches have been applied to real-world sequence data; however, their basic properties have not been fully examined by pioneering simulation studies. Here, we assessed the performances of the maximum-likelihood analyses incorporating RY-coding and a non-homogeneous model (RY-coding and non-homogeneous analyses) on simulated data with parallel convergence to similar base composition. Both RY-coding and non-homogeneous analyses showed superior performances compared with homogeneous model-based analyses. Curiously, the performance of RY-coding analysis appeared to be significantly affected by a setting of the substitution process for sequence simulation relative to that of non-homogeneous analysis. The performance of a non-homogeneous analysis was also validated by analyzing a real-world sequence data set with significant base heterogeneity.
García-Remesal, Miguel; Maojo, Victor; Crespo, José
2010-01-01
In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences.
Shin, Jeong Hong; Jung, Soobin; Ramakrishna, Suresh; Kim, Hyongbum Henry; Lee, Junwon
2018-07-07
Genome editing technology using programmable nucleases has rapidly evolved in recent years. The primary mechanism to achieve precise integration of a transgene is mainly based on homology-directed repair (HDR). However, an HDR-based genome-editing approach is less efficient than non-homologous end-joining (NHEJ). Recently, a microhomology-mediated end-joining (MMEJ)-based transgene integration approach was developed, showing feasibility both in vitro and in vivo. We expanded this method to achieve targeted sequence substitution (TSS) of mutated sequences with normal sequences using double-guide RNAs (gRNAs), and a donor template flanking the microhomologies and target sequence of the gRNAs in vitro and in vivo. Our method could realize more efficient sequence substitution than the HDR-based method in vitro using a reporter cell line, and led to the survival of a hereditary tyrosinemia mouse model in vivo. The proposed MMEJ-based TSS approach could provide a novel therapeutic strategy, in addition to HDR, to achieve gene correction from a mutated sequence to a normal sequence. Copyright © 2018 Elsevier Inc. All rights reserved.
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fleischmann, R.D.; Adams, M.D.; White, O.
1995-07-28
An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Haemophilus influenzae Rd. This approach eliminates the need for initial mapping efforts and is therefore applicable to the vast array of microbial species for which genome maps are unavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase accession number L42023) represents the only complete genome sequence from a free-living organism. 46 refs., 4 figs., 4 tabs.
Zhang, Yiming; Jin, Quan; Wang, Shuting; Ren, Ren
2011-05-01
The mobile behavior of 1481 peptides in ion mobility spectrometry (IMS), which are generated by protease digestion of the Drosophila melanogaster proteome, is modeled and predicted based on two different types of characterization methods, i.e. sequence-based approach and structure-based approach. In this procedure, the sequence-based approach considers both the amino acid composition of a peptide and the local environment profile of each amino acid in the peptide; the structure-based approach is performed with the CODESSA protocol, which regards a peptide as a common organic compound and generates more than 200 statistically significant variables to characterize the whole structure profile of a peptide molecule. Subsequently, the nonlinear support vector machine (SVM) and Gaussian process (GP) as well as linear partial least squares (PLS) regression is employed to correlate the structural parameters of the characterizations with the IMS drift times of these peptides. The obtained quantitative structure-spectrum relationship (QSSR) models are evaluated rigorously and investigated systematically via both one-deep and two-deep cross-validations as well as the rigorous Monte Carlo cross-validation (MCCV). We also give a comprehensive comparison on the resulting statistics arising from the different combinations of variable types with modeling methods and find that the sequence-based approach can give the QSSR models with better fitting ability and predictive power but worse interpretability than the structure-based approach. In addition, though the QSSR modeling using sequence-based approach is not needed for the preparation of the minimization structures of peptides before the modeling, it would be considerably efficient as compared to that using structure-based approach. Copyright © 2011 Elsevier Ltd. All rights reserved.
The rapid evolution of molecular genetic diagnostics in neuromuscular diseases.
Volk, Alexander E; Kubisch, Christian
2017-10-01
The development of massively parallel sequencing (MPS) has revolutionized molecular genetic diagnostics in monogenic disorders. The present review gives a brief overview of different MPS-based approaches used in clinical diagnostics of neuromuscular disorders (NMDs) and highlights their advantages and limitations. MPS-based approaches like gene panel sequencing, (whole) exome sequencing, (whole) genome sequencing, and RNA sequencing have been used to identify the genetic cause in NMDs. Although gene panel sequencing has evolved as a standard test for heterogeneous diseases, it is still debated, mainly because of financial issues and unsolved problems of variant interpretation, whether genome sequencing (and to a lesser extent also exome sequencing) of single patients can already be regarded as routine diagnostics. However, it has been shown that the inclusion of parents and additional family members often leads to a substantial increase in the diagnostic yield in exome-wide/genome-wide MPS approaches. In addition, MPS-based RNA sequencing just enters the research and diagnostic scene. Next-generation sequencing increasingly enables the detection of the genetic cause in highly heterogeneous diseases like NMDs in an efficient and affordable way. Gene panel sequencing and family-based exome sequencing have been proven as potent and cost-efficient diagnostic tools. Although clinical validation and interpretation of genome sequencing is still challenging, diagnostic RNA sequencing represents a promising tool to bypass some hurdles of diagnostics using genomic DNA.
Sequence-similar, structure-dissimilar protein pairs in the PDB.
Kosloff, Mickey; Kolodny, Rachel
2008-05-01
It is often assumed that in the Protein Data Bank (PDB), two proteins with similar sequences will also have similar structures. Accordingly, it has proved useful to develop subsets of the PDB from which "redundant" structures have been removed, based on a sequence-based criterion for similarity. Similarly, when predicting protein structure using homology modeling, if a template structure for modeling a target sequence is selected by sequence alone, this implicitly assumes that all sequence-similar templates are equivalent. Here, we show that this assumption is often not correct and that standard approaches to create subsets of the PDB can lead to the loss of structurally and functionally important information. We have carried out sequence-based structural superpositions and geometry-based structural alignments of a large number of protein pairs to determine the extent to which sequence similarity ensures structural similarity. We find many examples where two proteins that are similar in sequence have structures that differ significantly from one another. The source of the structural differences usually has a functional basis. The number of such proteins pairs that are identified and the magnitude of the dissimilarity depend on the approach that is used to calculate the differences; in particular sequence-based structure superpositioning will identify a larger number of structurally dissimilar pairs than geometry-based structural alignments. When two sequences can be aligned in a statistically meaningful way, sequence-based structural superpositioning provides a meaningful measure of structural differences. This approach and geometry-based structure alignments reveal somewhat different information and one or the other might be preferable in a given application. Our results suggest that in some cases, notably homology modeling, the common use of nonredundant datasets, culled from the PDB based on sequence, may mask important structural and functional information. We have established a data base of sequence-similar, structurally dissimilar protein pairs that will help address this problem (http://luna.bioc.columbia.edu/rachel/seqsimstrdiff.htm).
Exploring Dance Movement Data Using Sequence Alignment Methods
Chavoshi, Seyed Hossein; De Baets, Bernard; Neutens, Tijs; De Tré, Guy; Van de Weghe, Nico
2015-01-01
Despite the abundance of research on knowledge discovery from moving object databases, only a limited number of studies have examined the interaction between moving point objects in space over time. This paper describes a novel approach for measuring similarity in the interaction between moving objects. The proposed approach consists of three steps. First, we transform movement data into sequences of successive qualitative relations based on the Qualitative Trajectory Calculus (QTC). Second, sequence alignment methods are applied to measure the similarity between movement sequences. Finally, movement sequences are grouped based on similarity by means of an agglomerative hierarchical clustering method. The applicability of this approach is tested using movement data from samba and tango dancers. PMID:26181435
2016-01-01
Abstract Background Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. New information In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand. Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset. Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach. PMID:27932919
Holovachov, Oleksandr
2016-01-01
Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand.Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset.Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach.
K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics.
Lin, Jie; Adjeroh, Donald A; Jiang, Bing-Hua; Jiang, Yue
2018-05-15
Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. We propose a new non-parametric alignment-free sequence comparison method, called K2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K2 approach is much faster than the other methods. An improved method, K2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes. The K2 and K2* approaches are implemented in the R language as a package and is freely available for open access (http://community.wvu.edu/daadjeroh/projects/K2/K2_1.0.tar.gz). yueljiang@163.com. Supplementary data are available at Bioinformatics online.
Promoter Sequences Prediction Using Relational Association Rule Mining
Czibula, Gabriela; Bocicor, Maria-Iuliana; Czibula, Istvan Gergely
2012-01-01
In this paper we are approaching, from a computational perspective, the problem of promoter sequences prediction, an important problem within the field of bioinformatics. As the conditions for a DNA sequence to function as a promoter are not known, machine learning based classification models are still developed to approach the problem of promoter identification in the DNA. We are proposing a classification model based on relational association rules mining. Relational association rules are a particular type of association rules and describe numerical orderings between attributes that commonly occur over a data set. Our classifier is based on the discovery of relational association rules for predicting if a DNA sequence contains or not a promoter region. An experimental evaluation of the proposed model and comparison with similar existing approaches is provided. The obtained results show that our classifier overperforms the existing techniques for identifying promoter sequences, confirming the potential of our proposal. PMID:22563233
Simulation-Based Evaluation of Learning Sequences for Instructional Technologies
ERIC Educational Resources Information Center
McEneaney, John E.
2016-01-01
Instructional technologies critically depend on systematic design, and learning hierarchies are a commonly advocated tool for designing instructional sequences. But hierarchies routinely allow numerous sequences and choosing an optimal sequence remains an unsolved problem. This study explores a simulation-based approach to modeling learning…
Hajibabaei, Mehrdad; Shokralla, Shadi; Zhou, Xin; Singer, Gregory A. C.; Baird, Donald J.
2011-01-01
Timely and accurate biodiversity analysis poses an ongoing challenge for the success of biomonitoring programs. Morphology-based identification of bioindicator taxa is time consuming, and rarely supports species-level resolution especially for immature life stages. Much work has been done in the past decade to develop alternative approaches for biodiversity analysis using DNA sequence-based approaches such as molecular phylogenetics and DNA barcoding. On-going assembly of DNA barcode reference libraries will provide the basis for a DNA-based identification system. The use of recently introduced next-generation sequencing (NGS) approaches in biodiversity science has the potential to further extend the application of DNA information for routine biomonitoring applications to an unprecedented scale. Here we demonstrate the feasibility of using 454 massively parallel pyrosequencing for species-level analysis of freshwater benthic macroinvertebrate taxa commonly used for biomonitoring. We designed our experiments in order to directly compare morphology-based, Sanger sequencing DNA barcoding, and next-generation environmental barcoding approaches. Our results show the ability of 454 pyrosequencing of mini-barcodes to accurately identify all species with more than 1% abundance in the pooled mixture. Although the approach failed to identify 6 rare species in the mixture, the presence of sequences from 9 species that were not represented by individuals in the mixture provides evidence that DNA based analysis may yet provide a valuable approach in finding rare species in bulk environmental samples. We further demonstrate the application of the environmental barcoding approach by comparing benthic macroinvertebrates from an urban region to those obtained from a conservation area. Although considerable effort will be required to robustly optimize NGS tools to identify species from bulk environmental samples, our results indicate the potential of an environmental barcoding approach for biomonitoring programs. PMID:21533287
Design of nucleic acid sequences for DNA computing based on a thermodynamic approach
Tanaka, Fumiaki; Kameda, Atsushi; Yamamoto, Masahito; Ohuchi, Azuma
2005-01-01
We have developed an algorithm for designing multiple sequences of nucleic acids that have a uniform melting temperature between the sequence and its complement and that do not hybridize non-specifically with each other based on the minimum free energy (ΔGmin). Sequences that satisfy these constraints can be utilized in computations, various engineering applications such as microarrays, and nano-fabrications. Our algorithm is a random generate-and-test algorithm: it generates a candidate sequence randomly and tests whether the sequence satisfies the constraints. The novelty of our algorithm is that the filtering method uses a greedy search to calculate ΔGmin. This effectively excludes inappropriate sequences before ΔGmin is calculated, thereby reducing computation time drastically when compared with an algorithm without the filtering. Experimental results in silico showed the superiority of the greedy search over the traditional approach based on the hamming distance. In addition, experimental results in vitro demonstrated that the experimental free energy (ΔGexp) of 126 sequences correlated well with ΔGmin (|R| = 0.90) than with the hamming distance (|R| = 0.80). These results validate the rationality of a thermodynamic approach. We implemented our algorithm in a graphic user interface-based program written in Java. PMID:15701762
Fast alignment-free sequence comparison using spaced-word frequencies.
Leimeister, Chris-Andre; Boden, Marcus; Horwege, Sebastian; Lindner, Sebastian; Morgenstern, Burkhard
2014-07-15
Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. To reduce the statistical dependency between adjacent word matches, we propose to use 'spaced words', defined by patterns of 'match' and 'don't care' positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. Our program is freely available at http://spaced.gobics.de/. © The Author 2014. Published by Oxford University Press.
Oligo/Polynucleotide-Based Gene Modification: Strategies and Therapeutic Potential
Sargent, R. Geoffrey; Kim, Soya
2011-01-01
Oligonucleotide- and polynucleotide-based gene modification strategies were developed as an alternative to transgene-based and classical gene targeting-based gene therapy approaches for treatment of genetic disorders. Unlike the transgene-based strategies, oligo/polynucleotide gene targeting approaches maintain gene integrity and the relationship between the protein coding and gene-specific regulatory sequences. Oligo/polynucleotide-based gene modification also has several advantages over classical vector-based homologous recombination approaches. These include essentially complete homology to the target sequence and the potential to rapidly engineer patient-specific oligo/polynucleotide gene modification reagents. Several oligo/polynucleotide-based approaches have been shown to successfully mediate sequence-specific modification of genomic DNA in mammalian cells. The strategies involve the use of polynucleotide small DNA fragments, triplex-forming oligonucleotides, and single-stranded oligodeoxynucleotides to mediate homologous exchange. The primary focus of this review will be on the mechanistic aspects of the small fragment homologous replacement, triplex-forming oligonucleotide-mediated, and single-stranded oligodeoxynucleotide-mediated gene modification strategies as it relates to their therapeutic potential. PMID:21417933
NASA Astrophysics Data System (ADS)
Nie, Xiaokai; Coca, Daniel
2018-01-01
The paper introduces a matrix-based approach to estimate the unique one-dimensional discrete-time dynamical system that generated a given sequence of probability density functions whilst subjected to an additive stochastic perturbation with known density.
Nie, Xiaokai; Coca, Daniel
2018-01-01
The paper introduces a matrix-based approach to estimate the unique one-dimensional discrete-time dynamical system that generated a given sequence of probability density functions whilst subjected to an additive stochastic perturbation with known density.
Shahin, Arwa; Smulders, Marinus J. M.; van Tuyl, Jaap M.; Arens, Paul; Bakker, Freek T.
2014-01-01
Next Generation Sequencing (NGS) may enable estimating relationships among genotypes using allelic variation of multiple nuclear genes simultaneously. We explored the potential and caveats of this strategy in four genetically distant Lilium cultivars to estimate their genetic divergence from transcriptome sequences using three approaches: POFAD (Phylogeny of Organisms from Allelic Data, uses allelic information of sequence data), RAxML (Randomized Accelerated Maximum Likelihood, tree building based on concatenated consensus sequences) and Consensus Network (constructing a network summarizing among gene tree conflicts). Twenty six gene contigs were chosen based on the presence of orthologous sequences in all cultivars, seven of which also had an orthologous sequence in Tulipa, used as out-group. The three approaches generated the same topology. Although the resolution offered by these approaches is high, in this case there was no extra benefit in using allelic information. We conclude that these 26 genes can be widely applied to construct a species tree for the genus Lilium. PMID:25368628
SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics.
Will, Sebastian; Otto, Christina; Miladi, Milad; Möhl, Mathias; Backofen, Rolf
2015-08-01
RNA-Seq experiments have revealed a multitude of novel ncRNAs. The gold standard for their analysis based on simultaneous alignment and folding suffers from extreme time complexity of [Formula: see text]. Subsequently, numerous faster 'Sankoff-style' approaches have been suggested. Commonly, the performance of such methods relies on sequence-based heuristics that restrict the search space to optimal or near-optimal sequence alignments; however, the accuracy of sequence-based methods breaks down for RNAs with sequence identities below 60%. Alignment approaches like LocARNA that do not require sequence-based heuristics, have been limited to high complexity ([Formula: see text] quartic time). Breaking this barrier, we introduce the novel Sankoff-style algorithm 'sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)', which runs in quadratic time without sequence-based heuristics. To achieve this low complexity, on par with sequence alignment algorithms, SPARSE features strong sparsification based on structural properties of the RNA ensembles. Following PMcomp, SPARSE gains further speed-up from lightweight energy computation. Although all existing lightweight Sankoff-style methods restrict Sankoff's original model by disallowing loop deletions and insertions, SPARSE transfers the Sankoff algorithm to the lightweight energy model completely for the first time. Compared with LocARNA, SPARSE achieves similar alignment and better folding quality in significantly less time (speedup: 3.7). At similar run-time, it aligns low sequence identity instances substantially more accurate than RAF, which uses sequence-based heuristics. © The Author 2015. Published by Oxford University Press.
Sequence specificity of single-stranded DNA-binding proteins: a novel DNA microarray approach
Morgan, Hugh P.; Estibeiro, Peter; Wear, Martin A.; Max, Klaas E.A.; Heinemann, Udo; Cubeddu, Liza; Gallagher, Maurice P.; Sadler, Peter J.; Walkinshaw, Malcolm D.
2007-01-01
We have developed a novel DNA microarray-based approach for identification of the sequence-specificity of single-stranded nucleic-acid-binding proteins (SNABPs). For verification, we have shown that the major cold shock protein (CspB) from Bacillus subtilis binds with high affinity to pyrimidine-rich sequences, with a binding preference for the consensus sequence, 5′-GTCTTTG/T-3′. The sequence was modelled onto the known structure of CspB and a cytosine-binding pocket was identified, which explains the strong preference for a cytosine base at position 3. This microarray method offers a rapid high-throughput approach for determining the specificity and strength of ss DNA–protein interactions. Further screening of this newly emerging family of transcription factors will help provide an insight into their cellular function. PMID:17488853
An improved and validated RNA HLA class I SBT approach for obtaining full length coding sequences.
Gerritsen, K E H; Olieslagers, T I; Groeneweg, M; Voorter, C E M; Tilanus, M G J
2014-11-01
The functional relevance of human leukocyte antigen (HLA) class I allele polymorphism beyond exons 2 and 3 is difficult to address because more than 70% of the HLA class I alleles are defined by exons 2 and 3 sequences only. For routine application on clinical samples we improved and validated the HLA sequence-based typing (SBT) approach based on RNA templates, using either a single locus-specific or two overlapping group-specific polymerase chain reaction (PCR) amplifications, with three forward and three reverse sequencing reactions for full length sequencing. Locus-specific HLA typing with RNA SBT of a reference panel, representing the major antigen groups, showed identical results compared to DNA SBT typing. Alleles encountered with unknown exons in the IMGT/HLA database and three samples, two with Null and one with a Low expressed allele, have been addressed by the group-specific RNA SBT approach to obtain full length coding sequences. This RNA SBT approach has proven its value in our routine full length definition of alleles. © 2014 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Characterization of GM events by insert knowledge adapted re-sequencing approaches
Yang, Litao; Wang, Congmao; Holst-Jensen, Arne; Morisset, Dany; Lin, Yongjun; Zhang, Dabing
2013-01-01
Detection methods and data from molecular characterization of genetically modified (GM) events are needed by stakeholders of public risk assessors and regulators. Generally, the molecular characteristics of GM events are incomprehensively revealed by current approaches and biased towards detecting transformation vector derived sequences. GM events are classified based on available knowledge of the sequences of vectors and inserts (insert knowledge). Herein we present three insert knowledge-adapted approaches for characterization GM events (TT51-1 and T1c-19 rice as examples) based on paired-end re-sequencing with the advantages of comprehensiveness, accuracy, and automation. The comprehensive molecular characteristics of two rice events were revealed with additional unintended insertions comparing with the results from PCR and Southern blotting. Comprehensive transgene characterization of TT51-1 and T1c-19 is shown to be independent of a priori knowledge of the insert and vector sequences employing the developed approaches. This provides an opportunity to identify and characterize also unknown GM events. PMID:24088728
Characterization of GM events by insert knowledge adapted re-sequencing approaches.
Yang, Litao; Wang, Congmao; Holst-Jensen, Arne; Morisset, Dany; Lin, Yongjun; Zhang, Dabing
2013-10-03
Detection methods and data from molecular characterization of genetically modified (GM) events are needed by stakeholders of public risk assessors and regulators. Generally, the molecular characteristics of GM events are incomprehensively revealed by current approaches and biased towards detecting transformation vector derived sequences. GM events are classified based on available knowledge of the sequences of vectors and inserts (insert knowledge). Herein we present three insert knowledge-adapted approaches for characterization GM events (TT51-1 and T1c-19 rice as examples) based on paired-end re-sequencing with the advantages of comprehensiveness, accuracy, and automation. The comprehensive molecular characteristics of two rice events were revealed with additional unintended insertions comparing with the results from PCR and Southern blotting. Comprehensive transgene characterization of TT51-1 and T1c-19 is shown to be independent of a priori knowledge of the insert and vector sequences employing the developed approaches. This provides an opportunity to identify and characterize also unknown GM events.
Kocot, Kevin M; Citarella, Mathew R; Moroz, Leonid L; Halanych, Kenneth M
2013-01-01
Molecular phylogenetics relies on accurate identification of orthologous sequences among the taxa of interest. Most orthology inference programs available for use in phylogenomics rely on small sets of pre-defined orthologs from model organisms or phenetic approaches such as all-versus-all sequence comparisons followed by Markov graph-based clustering. Such approaches have high sensitivity but may erroneously include paralogous sequences. We developed PhyloTreePruner, a software utility that uses a phylogenetic approach to refine orthology inferences made using phenetic methods. PhyloTreePruner checks single-gene trees for evidence of paralogy and generates a new alignment for each group containing only sequences inferred to be orthologs. Importantly, PhyloTreePruner takes into account support values on the tree and avoids unnecessarily deleting sequences in cases where a weakly supported tree topology incorrectly indicates paralogy. A test of PhyloTreePruner on a dataset generated from 11 completely sequenced arthropod genomes identified 2,027 orthologous groups sampled for all taxa. Phylogenetic analysis of the concatenated supermatrix yielded a generally well-supported topology that was consistent with the current understanding of arthropod phylogeny. PhyloTreePruner is freely available from http://sourceforge.net/projects/phylotreepruner/.
Pereira, Rui P A; Peplies, Jörg; Brettar, Ingrid; Höfle, Manfred G
2017-03-31
Next Generation Sequencing (NGS) has revolutionized the analysis of natural and man-made microbial communities by using universal primers for bacteria in a PCR based approach targeting the 16S rRNA gene. In our study we narrowed primer specificity to a single, monophyletic genus because for many questions in microbiology only a specific part of the whole microbiome is of interest. We have chosen the genus Legionella, comprising more than 20 pathogenic species, due to its high relevance for water-based respiratory infections. A new NGS-based approach was designed by sequencing 16S rRNA gene amplicons specific for the genus Legionella using the Illumina MiSeq technology. This approach was validated and applied to a set of representative freshwater samples. Our results revealed that the generated libraries presented a low average raw error rate per base (<0.5%); and substantiated the use of high-fidelity enzymes, such as KAPA HiFi, for increased sequence accuracy and quality. The approach also showed high in situ specificity (>95%) and very good repeatability. Only in samples in which the gammabacterial clade SAR86 was present more than 1% non-Legionella sequences were observed. Next-generation sequencing read counts did not reveal considerable amplification/sequencing biases and showed a sensitive as well as precise quantification of L. pneumophila along a dilution range using a spiked-in, certified genome standard. The genome standard and a mock community consisting of six different Legionella species demonstrated that the developed NGS approach was quantitative and specific at the level of individual species, including L. pneumophila. The sensitivity of our genus-specific approach was at least one order of magnitude higher compared to the universal NGS approach. Comparison of quantification by real-time PCR showed consistency with the NGS data. Overall, our NGS approach can determine the quantitative abundances of Legionella species, i. e. the complete Legionella microbiome, without the need for species-specific primers. The developed NGS approach provides a new molecular surveillance tool to monitor all Legionella species in qualitative and quantitative terms if a spiked-in genome standard is used to calibrate the method. Overall, the genus-specific NGS approach opens up a new avenue to massive parallel diagnostics in a quantitative, specific and sensitive way.
Uribe-Convers, Simon; Duke, Justin R.; Moore, Michael J.; Tank, David C.
2014-01-01
• Premise of the study: We present an alternative approach for molecular systematic studies that combines long PCR and next-generation sequencing. Our approach can be used to generate templates from any DNA source for next-generation sequencing. Here we test our approach by amplifying complete chloroplast genomes, and we present a set of 58 potentially universal primers for angiosperms to do so. Additionally, this approach is likely to be particularly useful for nuclear and mitochondrial regions. • Methods and Results: Chloroplast genomes of 30 species across angiosperms were amplified to test our approach. Amplification success varied depending on whether PCR conditions were optimized for a given taxon. To further test our approach, some amplicons were sequenced on an Illumina HiSeq 2000. • Conclusions: Although here we tested this approach by sequencing plastomes, long PCR amplicons could be generated using DNA from any genome, expanding the possibilities of this approach for molecular systematic studies. PMID:25202592
Reproducibility and quantitation of amplicon sequencing-based detection
Zhou, Jizhong; Wu, Liyou; Deng, Ye; Zhi, Xiaoyang; Jiang, Yi-Huei; Tu, Qichao; Xie, Jianping; Van Nostrand, Joy D; He, Zhili; Yang, Yunfeng
2011-01-01
To determine the reproducibility and quantitation of the amplicon sequencing-based detection approach for analyzing microbial community structure, a total of 24 microbial communities from a long-term global change experimental site were examined. Genomic DNA obtained from each community was used to amplify 16S rRNA genes with two or three barcode tags as technical replicates in the presence of a small quantity (0.1% wt/wt) of genomic DNA from Shewanella oneidensis MR-1 as the control. The technical reproducibility of the amplicon sequencing-based detection approach is quite low, with an average operational taxonomic unit (OTU) overlap of 17.2%±2.3% between two technical replicates, and 8.2%±2.3% among three technical replicates, which is most likely due to problems associated with random sampling processes. Such variations in technical replicates could have substantial effects on estimating β-diversity but less on α-diversity. A high variation was also observed in the control across different samples (for example, 66.7-fold for the forward primer), suggesting that the amplicon sequencing-based detection approach could not be quantitative. In addition, various strategies were examined to improve the comparability of amplicon sequencing data, such as increasing biological replicates, and removing singleton sequences and less-representative OTUs across biological replicates. Finally, as expected, various statistical analyses with preprocessed experimental data revealed clear differences in the composition and structure of microbial communities between warming and non-warming, or between clipping and non-clipping. Taken together, these results suggest that amplicon sequencing-based detection is useful in analyzing microbial community structure even though it is not reproducible and quantitative. However, great caution should be taken in experimental design and data interpretation when the amplicon sequencing-based detection approach is used for quantitative analysis of the β-diversity of microbial communities. PMID:21346791
The advantages of SMRT sequencing.
Roberts, Richard J; Carneiro, Mauricio O; Schatz, Michael C
2013-07-03
Of the current next-generation sequencing technologies, SMRT sequencing is sometimes overlooked. However, attributes such as long reads, modified base detection and high accuracy make SMRT a useful technology and an ideal approach to the complete sequencing of small genomes.
An efficient approach to BAC based assembly of complex genomes.
Visendi, Paul; Berkman, Paul J; Hayashi, Satomi; Golicz, Agnieszka A; Bayer, Philipp E; Ruperao, Pradeep; Hurgobin, Bhavna; Montenegro, Juan; Chan, Chon-Kit Kenneth; Staňková, Helena; Batley, Jacqueline; Šimková, Hana; Doležel, Jaroslav; Edwards, David
2016-01-01
There has been an exponential growth in the number of genome sequencing projects since the introduction of next generation DNA sequencing technologies. Genome projects have increasingly involved assembly of whole genome data which produces inferior assemblies compared to traditional Sanger sequencing of genomic fragments cloned into bacterial artificial chromosomes (BACs). While whole genome shotgun sequencing using next generation sequencing (NGS) is relatively fast and inexpensive, this method is extremely challenging for highly complex genomes, where polyploidy or high repeat content confounds accurate assembly, or where a highly accurate 'gold' reference is required. Several attempts have been made to improve genome sequencing approaches by incorporating NGS methods, to variable success. We present the application of a novel BAC sequencing approach which combines indexed pools of BACs, Illumina paired read sequencing, a sequence assembler specifically designed for complex BAC assembly, and a custom bioinformatics pipeline. We demonstrate this method by sequencing and assembling BAC cloned fragments from bread wheat and sugarcane genomes. We demonstrate that our assembly approach is accurate, robust, cost effective and scalable, with applications for complete genome sequencing in large and complex genomes.
Motomura, Kenta; Nakamura, Morikazu; Otaki, Joji M.
2013-01-01
Protein structure and function information is coded in amino acid sequences. However, the relationship between primary sequences and three-dimensional structures and functions remains enigmatic. Our approach to this fundamental biochemistry problem is based on the frequencies of short constituent sequences (SCSs) or words. A protein amino acid sequence is considered analogous to an English sentence, where SCSs are equivalent to words. Availability scores, which are defined as real SCS frequencies in the non-redundant amino acid database relative to their probabilistically expected frequencies, demonstrate the biological usage bias of SCSs. As a result, this frequency-based linguistic approach is expected to have diverse applications, such as secondary structure specifications by structure-specific SCSs and immunological adjuvants with rare or non-existent SCSs. Linguistic similarities (e.g., wide ranges of scale-free distributions) and dissimilarities (e.g., behaviors of low-rank samples) between proteins and the natural English language have been revealed in the rank-frequency relationships of SCSs or words. We have developed a web server, the SCS Package, which contains five applications for analyzing protein sequences based on the linguistic concept. These tools have the potential to assist researchers in deciphering structurally and functionally important protein sites, species-specific sequences, and functional relationships between SCSs. The SCS Package also provides researchers with a tool to construct amino acid sequences de novo based on the idiomatic usage of SCSs. PMID:24688703
Motomura, Kenta; Nakamura, Morikazu; Otaki, Joji M
2013-01-01
Protein structure and function information is coded in amino acid sequences. However, the relationship between primary sequences and three-dimensional structures and functions remains enigmatic. Our approach to this fundamental biochemistry problem is based on the frequencies of short constituent sequences (SCSs) or words. A protein amino acid sequence is considered analogous to an English sentence, where SCSs are equivalent to words. Availability scores, which are defined as real SCS frequencies in the non-redundant amino acid database relative to their probabilistically expected frequencies, demonstrate the biological usage bias of SCSs. As a result, this frequency-based linguistic approach is expected to have diverse applications, such as secondary structure specifications by structure-specific SCSs and immunological adjuvants with rare or non-existent SCSs. Linguistic similarities (e.g., wide ranges of scale-free distributions) and dissimilarities (e.g., behaviors of low-rank samples) between proteins and the natural English language have been revealed in the rank-frequency relationships of SCSs or words. We have developed a web server, the SCS Package, which contains five applications for analyzing protein sequences based on the linguistic concept. These tools have the potential to assist researchers in deciphering structurally and functionally important protein sites, species-specific sequences, and functional relationships between SCSs. The SCS Package also provides researchers with a tool to construct amino acid sequences de novo based on the idiomatic usage of SCSs.
Krawitz, Peter M; Schiska, Daniela; Krüger, Ulrike; Appelt, Sandra; Heinrich, Verena; Parkhomchuk, Dmitri; Timmermann, Bernd; Millan, Jose M; Robinson, Peter N; Mundlos, Stefan; Hecht, Jochen; Gross, Manfred
2014-01-01
Usher syndrome is an autosomal recessive disorder characterized both by deafness and blindness. For the three clinical subtypes of Usher syndrome causal mutations in altogether 12 genes and a modifier gene have been identified. Due to the genetic heterogeneity of Usher syndrome, the molecular analysis is predestined for a comprehensive and parallelized analysis of all known genes by next-generation sequencing (NGS) approaches. We describe here the targeted enrichment and deep sequencing for exons of Usher genes and compare the costs and workload of this approach compared to Sanger sequencing. We also present a bioinformatics analysis pipeline that allows us to detect single-nucleotide variants, short insertions and deletions, as well as copy number variations of one or more exons on the same sequence data. Additionally, we present a flexible in silico gene panel for the analysis of sequence variants, in which newly identified genes can easily be included. We applied this approach to a cohort of 44 Usher patients and detected biallelic pathogenic mutations in 35 individuals and monoallelic mutations in eight individuals of our cohort. Thirty-nine of the sequence variants, including two heterozygous deletions comprising several exons of USH2A, have not been reported so far. Our NGS-based approach allowed us to assess single-nucleotide variants, small indels, and whole exon deletions in a single test. The described diagnostic approach is fast and cost-effective with a high molecular diagnostic yield. PMID:25333064
Krawitz, Peter M; Schiska, Daniela; Krüger, Ulrike; Appelt, Sandra; Heinrich, Verena; Parkhomchuk, Dmitri; Timmermann, Bernd; Millan, Jose M; Robinson, Peter N; Mundlos, Stefan; Hecht, Jochen; Gross, Manfred
2014-09-01
Usher syndrome is an autosomal recessive disorder characterized both by deafness and blindness. For the three clinical subtypes of Usher syndrome causal mutations in altogether 12 genes and a modifier gene have been identified. Due to the genetic heterogeneity of Usher syndrome, the molecular analysis is predestined for a comprehensive and parallelized analysis of all known genes by next-generation sequencing (NGS) approaches. We describe here the targeted enrichment and deep sequencing for exons of Usher genes and compare the costs and workload of this approach compared to Sanger sequencing. We also present a bioinformatics analysis pipeline that allows us to detect single-nucleotide variants, short insertions and deletions, as well as copy number variations of one or more exons on the same sequence data. Additionally, we present a flexible in silico gene panel for the analysis of sequence variants, in which newly identified genes can easily be included. We applied this approach to a cohort of 44 Usher patients and detected biallelic pathogenic mutations in 35 individuals and monoallelic mutations in eight individuals of our cohort. Thirty-nine of the sequence variants, including two heterozygous deletions comprising several exons of USH2A, have not been reported so far. Our NGS-based approach allowed us to assess single-nucleotide variants, small indels, and whole exon deletions in a single test. The described diagnostic approach is fast and cost-effective with a high molecular diagnostic yield.
SPHINX--an algorithm for taxonomic binning of metagenomic sequences.
Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar; Singh, Nitin Kumar; Mande, Sharmila S
2011-01-01
Compared with composition-based binning algorithms, the binning accuracy and specificity of alignment-based binning algorithms is significantly higher. However, being alignment-based, the latter class of algorithms require enormous amount of time and computing resources for binning huge metagenomic datasets. The motivation was to develop a binning approach that can analyze metagenomic datasets as rapidly as composition-based approaches, but nevertheless has the accuracy and specificity of alignment-based algorithms. This article describes a hybrid binning approach (SPHINX) that achieves high binning efficiency by utilizing the principles of both 'composition'- and 'alignment'-based binning algorithms. Validation results with simulated sequence datasets indicate that SPHINX is able to analyze metagenomic sequences as rapidly as composition-based algorithms. Furthermore, the binning efficiency (in terms of accuracy and specificity of assignments) of SPHINX is observed to be comparable with results obtained using alignment-based algorithms. A web server for the SPHINX algorithm is available at http://metagenomics.atc.tcs.com/SPHINX/.
GFam: a platform for automatic annotation of gene families.
Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto
2012-10-01
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam's capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.
Sanz, Yolanda
2017-01-01
Abstract The miniaturized and portable DNA sequencer MinION™ has demonstrated great potential in different analyses such as genome-wide sequencing, pathogen outbreak detection and surveillance, human genome variability, and microbial diversity. In this study, we tested the ability of the MinION™ platform to perform long amplicon sequencing in order to design new approaches to study microbial diversity using a multi-locus approach. After compiling a robust database by parsing and extracting the rrn bacterial region from more than 67000 complete or draft bacterial genomes, we demonstrated that the data obtained during sequencing of the long amplicon in the MinION™ device using R9 and R9.4 chemistries were sufficient to study 2 mock microbial communities in a multiplex manner and to almost completely reconstruct the microbial diversity contained in the HM782D and D6305 mock communities. Although nanopore-based sequencing produces reads with lower per-base accuracy compared with other platforms, we presented a novel approach consisting of multi-locus and long amplicon sequencing using the MinION™ MkIb DNA sequencer and R9 and R9.4 chemistries that help to overcome the main disadvantage of this portable sequencing platform. Furthermore, the nanopore sequencing library, constructed with the last releases of pore chemistry (R9.4) and sequencing kit (SQK-LSK108), permitted the retrieval of the higher level of 1D read accuracy sufficient to characterize the microbial species present in each mock community analysed. Improvements in nanopore chemistry, such as minimizing base-calling errors and new library protocols able to produce rapid 1D libraries, will provide more reliable information in the near future. Such data will be useful for more comprehensive and faster specific detection of microbial species and strains in complex ecosystems. PMID:28605506
Identifying active foraminifera in the Sea of Japan using metatranscriptomic approach
NASA Astrophysics Data System (ADS)
Lejzerowicz, Franck; Voltsky, Ivan; Pawlowski, Jan
2013-02-01
Metagenetics represents an efficient and rapid tool to describe environmental diversity patterns of microbial eukaryotes based on ribosomal DNA sequences. However, the results of metagenetic studies are often biased by the presence of extracellular DNA molecules that are persistent in the environment, especially in deep-sea sediment. As an alternative, short-lived RNA molecules constitute a good proxy for the detection of active species. Here, we used a metatranscriptomic approach based on RNA-derived (cDNA) sequences to study the diversity of the deep-sea benthic foraminifera and compared it to the metagenetic approach. We analyzed 257 ribosomal DNA and cDNA sequences obtained from seven sediments samples collected in the Sea of Japan at depths ranging from 486 to 3665 m. The DNA and RNA-based approaches gave a similar view of the taxonomic composition of foraminiferal assemblage, but differed in some important points. First, the cDNA dataset was dominated by sequences of rotaliids and robertiniids, suggesting that these calcareous species, some of which have been observed in Rose Bengal stained samples, are the most active component of foraminiferal community. Second, the richness of monothalamous (single-chambered) foraminifera was particularly high in DNA extracts from the deepest samples, confirming that this group of foraminifera is abundant but not necessarily very active in the deep-sea sediments. Finally, the high divergence of undetermined sequences in cDNA dataset indicate the limits of our database and lack of knowledge about some active but possibly rare species. Our study demonstrates the capability of the metatranscriptomic approach to detect active foraminiferal species and prompt its use in future high-throughput sequencing-based environmental surveys.
SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics
Will, Sebastian; Otto, Christina; Miladi, Milad; Möhl, Mathias; Backofen, Rolf
2015-01-01
Motivation: RNA-Seq experiments have revealed a multitude of novel ncRNAs. The gold standard for their analysis based on simultaneous alignment and folding suffers from extreme time complexity of O(n6). Subsequently, numerous faster ‘Sankoff-style’ approaches have been suggested. Commonly, the performance of such methods relies on sequence-based heuristics that restrict the search space to optimal or near-optimal sequence alignments; however, the accuracy of sequence-based methods breaks down for RNAs with sequence identities below 60%. Alignment approaches like LocARNA that do not require sequence-based heuristics, have been limited to high complexity (≥ quartic time). Results: Breaking this barrier, we introduce the novel Sankoff-style algorithm ‘sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)’, which runs in quadratic time without sequence-based heuristics. To achieve this low complexity, on par with sequence alignment algorithms, SPARSE features strong sparsification based on structural properties of the RNA ensembles. Following PMcomp, SPARSE gains further speed-up from lightweight energy computation. Although all existing lightweight Sankoff-style methods restrict Sankoff’s original model by disallowing loop deletions and insertions, SPARSE transfers the Sankoff algorithm to the lightweight energy model completely for the first time. Compared with LocARNA, SPARSE achieves similar alignment and better folding quality in significantly less time (speedup: 3.7). At similar run-time, it aligns low sequence identity instances substantially more accurate than RAF, which uses sequence-based heuristics. Availability and implementation: SPARSE is freely available at http://www.bioinf.uni-freiburg.de/Software/SPARSE. Contact: backofen@informatik.uni-freiburg.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25838465
Is multiple-sequence alignment required for accurate inference of phylogeny?
Höhl, Michael; Ragan, Mark A
2007-04-01
The process of inferring phylogenetic trees from molecular sequences almost always starts with a multiple alignment of these sequences but can also be based on methods that do not involve multiple sequence alignment. Very little is known about the accuracy with which such alignment-free methods recover the correct phylogeny or about the potential for increasing their accuracy. We conducted a large-scale comparison of ten alignment-free methods, among them one new approach that does not calculate distances and a faster variant of our pattern-based approach; all distance-based alignment-free methods are freely available from http://www.bioinformatics.org.au (as Python package decaf+py). We show that most methods exhibit a higher overall reconstruction accuracy in the presence of high among-site rate variation. Under all conditions that we considered, variants of the pattern-based approach were significantly better than the other alignment-free methods. The new pattern-based variant achieved a speed-up of an order of magnitude in the distance calculation step, accompanied by a small loss of tree reconstruction accuracy. A method of Bayesian inference from k-mers did not improve on classical alignment-free (and distance-based) methods but may still offer other advantages due to its Bayesian nature. We found the optimal word length k of word-based methods to be stable across various data sets, and we provide parameter ranges for two different alphabets. The influence of these alphabets was analyzed to reveal a trade-off in reconstruction accuracy between long and short branches. We have mapped the phylogenetic accuracy for many alignment-free methods, among them several recently introduced ones, and increased our understanding of their behavior in response to biologically important parameters. In all experiments, the pattern-based approach emerged as superior, at the expense of higher resource consumption. Nonetheless, no alignment-free method that we examined recovers the correct phylogeny as accurately as does an approach based on maximum-likelihood distance estimates of multiply aligned sequences.
Validation of Pooled Whole-Genome Re-Sequencing in Arabidopsis lyrata.
Fracassetti, Marco; Griffin, Philippa C; Willi, Yvonne
2015-01-01
Sequencing pooled DNA of multiple individuals from a population instead of sequencing individuals separately has become popular due to its cost-effectiveness and simple wet-lab protocol, although some criticism of this approach remains. Here we validated a protocol for pooled whole-genome re-sequencing (Pool-seq) of Arabidopsis lyrata libraries prepared with low amounts of DNA (1.6 ng per individual). The validation was based on comparing single nucleotide polymorphism (SNP) frequencies obtained by pooling with those obtained by individual-based Genotyping By Sequencing (GBS). Furthermore, we investigated the effect of sample number, sequencing depth per individual and variant caller on population SNP frequency estimates. For Pool-seq data, we compared frequency estimates from two SNP callers, VarScan and Snape; the former employs a frequentist SNP calling approach while the latter uses a Bayesian approach. Results revealed concordance correlation coefficients well above 0.8, confirming that Pool-seq is a valid method for acquiring population-level SNP frequency data. Higher accuracy was achieved by pooling more samples (25 compared to 14) and working with higher sequencing depth (4.1× per individual compared to 1.4× per individual), which increased the concordance correlation coefficient to 0.955. The Bayesian-based SNP caller produced somewhat higher concordance correlation coefficients, particularly at low sequencing depth. We recommend pooling at least 25 individuals combined with sequencing at a depth of 100× to produce satisfactory frequency estimates for common SNPs (minor allele frequency above 0.05).
Computational approaches to define a human milk metaglycome
Agravat, Sanjay B.; Song, Xuezheng; Rojsajjakul, Teerapat; Cummings, Richard D.; Smith, David F.
2016-01-01
Motivation: The goal of deciphering the human glycome has been hindered by the lack of high-throughput sequencing methods for glycans. Although mass spectrometry (MS) is a key technology in glycan sequencing, MS alone provides limited information about the identification of monosaccharide constituents, their anomericity and their linkages. These features of individual, purified glycans can be partly identified using well-defined glycan-binding proteins, such as lectins and antibodies that recognize specific determinants within glycan structures. Results: We present a novel computational approach to automate the sequencing of glycans using metadata-assisted glycan sequencing, which combines MS analyses with glycan structural information from glycan microarray technology. Success in this approach was aided by the generation of a ‘virtual glycome’ to represent all potential glycan structures that might exist within a metaglycomes based on a set of biosynthetic assumptions using known structural information. We exploited this approach to deduce the structures of soluble glycans within the human milk glycome by matching predicted structures based on experimental data against the virtual glycome. This represents the first meta-glycome to be defined using this method and we provide a publically available web-based application to aid in sequencing milk glycans. Availability and implementation: http://glycomeseq.emory.edu Contact: sagravat@bidmc.harvard.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26803164
An evolution based biosensor receptor DNA sequence generation algorithm.
Kim, Eungyeong; Lee, Malrey; Gatton, Thomas M; Lee, Jaewan; Zang, Yupeng
2010-01-01
A biosensor is composed of a bioreceptor, an associated recognition molecule, and a signal transducer that can selectively detect target substances for analysis. DNA based biosensors utilize receptor molecules that allow hybridization with the target analyte. However, most DNA biosensor research uses oligonucleotides as the target analytes and does not address the potential problems of real samples. The identification of recognition molecules suitable for real target analyte samples is an important step towards further development of DNA biosensors. This study examines the characteristics of DNA used as bioreceptors and proposes a hybrid evolution-based DNA sequence generating algorithm, based on DNA computing, to identify suitable DNA bioreceptor recognition molecules for stable hybridization with real target substances. The Traveling Salesman Problem (TSP) approach is applied in the proposed algorithm to evaluate the safety and fitness of the generated DNA sequences. This approach improves efficiency and stability for enhanced and variable-length DNA sequence generation and allows extension to generation of variable-length DNA sequences with diverse receptor recognition requirements.
Structator: fast index-based search for RNA sequence-structure patterns
2011-01-01
Background The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs. Results We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods. Conclusions The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator. PMID:21619640
A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes.
Mehmood, Tahir; Bohlin, Jon; Snipen, Lars
2015-01-01
The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value < 0.01) and SVM (p-value < 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.
Application of genetic algorithm in integrated setup planning and operation sequencing
NASA Astrophysics Data System (ADS)
Kafashi, Sajad; Shakeri, Mohsen
2011-01-01
Process planning is an essential component for linking design and manufacturing process. Setup planning and operation sequencing is two main tasks in process planning. Many researches solved these two problems separately. Considering the fact that the two functions are complementary, it is necessary to integrate them more tightly so that performance of a manufacturing system can be improved economically and competitively. This paper present a generative system and genetic algorithm (GA) approach to process plan the given part. The proposed approach and optimization methodology analyses the TAD (tool approach direction), tolerance relation between features and feature precedence relations to generate all possible setups and operations using workshop resource database. Based on these technological constraints the GA algorithm approach, which adopts the feature-based representation, optimizes the setup plan and sequence of operations using cost indices. Case study show that the developed system can generate satisfactory results in optimizing the setup planning and operation sequencing simultaneously in feasible condition.
A new strategy for genome assembly using short sequence reads and reduced representation libraries.
Young, Andrew L; Abaan, Hatice Ozel; Zerbino, Daniel; Mullikin, James C; Birney, Ewan; Margulies, Elliott H
2010-02-01
We have developed a novel approach for using massively parallel short-read sequencing to generate fast and inexpensive de novo genomic assemblies comparable to those generated by capillary-based methods. The ultrashort (<100 base) sequences generated by this technology pose specific biological and computational challenges for de novo assembly of large genomes. To account for this, we devised a method for experimentally partitioning the genome using reduced representation (RR) libraries prior to assembly. We use two restriction enzymes independently to create a series of overlapping fragment libraries, each containing a tractable subset of the genome. Together, these libraries allow us to reassemble the entire genome without the need of a reference sequence. As proof of concept, we applied this approach to sequence and assembled the majority of the 125-Mb Drosophila melanogaster genome. We subsequently demonstrate the accuracy of our assembly method with meaningful comparisons against the current available D. melanogaster reference genome (dm3). The ease of assembly and accuracy for comparative genomics suggest that our approach will scale to future mammalian genome-sequencing efforts, saving both time and money without sacrificing quality.
Large-Scale Biomonitoring of Remote and Threatened Ecosystems via High-Throughput Sequencing
Gibson, Joel F.; Shokralla, Shadi; Curry, Colin; Baird, Donald J.; Monk, Wendy A.; King, Ian; Hajibabaei, Mehrdad
2015-01-01
Biodiversity metrics are critical for assessment and monitoring of ecosystems threatened by anthropogenic stressors. Existing sorting and identification methods are too expensive and labour-intensive to be scaled up to meet management needs. Alternately, a high-throughput DNA sequencing approach could be used to determine biodiversity metrics from bulk environmental samples collected as part of a large-scale biomonitoring program. Here we show that both morphological and DNA sequence-based analyses are suitable for recovery of individual taxonomic richness, estimation of proportional abundance, and calculation of biodiversity metrics using a set of 24 benthic samples collected in the Peace-Athabasca Delta region of Canada. The high-throughput sequencing approach was able to recover all metrics with a higher degree of taxonomic resolution than morphological analysis. The reduced cost and increased capacity of DNA sequence-based approaches will finally allow environmental monitoring programs to operate at the geographical and temporal scale required by industrial and regulatory end-users. PMID:26488407
TotalReCaller: improved accuracy and performance via integrated alignment and base-calling.
Menges, Fabian; Narzisi, Giuseppe; Mishra, Bud
2011-09-01
Currently, re-sequencing approaches use multiple modules serially to interpret raw sequencing data from next-generation sequencing platforms, while remaining oblivious to the genomic information until the final alignment step. Such approaches fail to exploit the full information from both raw sequencing data and the reference genome that can yield better quality sequence reads, SNP-calls, variant detection, as well as an alignment at the best possible location in the reference genome. Thus, there is a need for novel reference-guided bioinformatics algorithms for interpreting analog signals representing sequences of the bases ({A, C, G, T}), while simultaneously aligning possible sequence reads to a source reference genome whenever available. Here, we propose a new base-calling algorithm, TotalReCaller, to achieve improved performance. A linear error model for the raw intensity data and Burrows-Wheeler transform (BWT) based alignment are combined utilizing a Bayesian score function, which is then globally optimized over all possible genomic locations using an efficient branch-and-bound approach. The algorithm has been implemented in soft- and hardware [field-programmable gate array (FPGA)] to achieve real-time performance. Empirical results on real high-throughput Illumina data were used to evaluate TotalReCaller's performance relative to its peers-Bustard, BayesCall, Ibis and Rolexa-based on several criteria, particularly those important in clinical and scientific applications. Namely, it was evaluated for (i) its base-calling speed and throughput, (ii) its read accuracy and (iii) its specificity and sensitivity in variant calling. A software implementation of TotalReCaller as well as additional information, is available at: http://bioinformatics.nyu.edu/wordpress/projects/totalrecaller/ fabian.menges@nyu.edu.
PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data
NASA Astrophysics Data System (ADS)
Deneke, Carlus; Rentzsch, Robert; Renard, Bernhard Y.
2017-01-01
The reliable detection of novel bacterial pathogens from next-generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from the reference database. Here we present the machine learning based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide range of species with known pathogenicity phenotype. To that end we compiled a comprehensive list of pathogenic and non-pathogenic bacteria with human host, using various genome metadata in conjunction with a rule-based protocol. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads with low similarity to currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. CombiningPaPrBaG with existing approaches further improves prediction results.
KGCAK: a K-mer based database for genome-wide phylogeny and complexity evaluation.
Wang, Dapeng; Xu, Jiayue; Yu, Jun
2015-09-16
The K-mer approach, treating genomic sequences as simple characters and counting the relative abundance of each string upon a fixed K, has been extensively applied to phylogeny inference for genome assembly, annotation, and comparison. To meet increasing demands for comparing large genome sequences and to promote the use of the K-mer approach, we develop a versatile database, KGCAK ( http://kgcak.big.ac.cn/KGCAK/ ), containing ~8,000 genomes that include genome sequences of diverse life forms (viruses, prokaryotes, protists, animals, and plants) and cellular organelles of eukaryotic lineages. It builds phylogeny based on genomic elements in an alignment-free fashion and provides in-depth data processing enabling users to compare the complexity of genome sequences based on K-mer distribution. We hope that KGCAK becomes a powerful tool for exploring relationship within and among groups of species in a tree of life based on genomic data.
Sequence comparison alignment-free approach based on suffix tree and L-words frequency.
Soares, Inês; Goios, Ana; Amorim, António
2012-01-01
The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length L-L-words--in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.
Links, Matthew G; Chaban, Bonnie; Hemmingsen, Sean M; Muirhead, Kevin; Hill, Janet E
2013-08-15
Formation of operational taxonomic units (OTU) is a common approach to data aggregation in microbial ecology studies based on amplification and sequencing of individual gene targets. The de novo assembly of OTU sequences has been recently demonstrated as an alternative to widely used clustering methods, providing robust information from experimental data alone, without any reliance on an external reference database. Here we introduce mPUMA (microbial Profiling Using Metagenomic Assembly, http://mpuma.sourceforge.net), a software package for identification and analysis of protein-coding barcode sequence data. It was developed originally for Cpn60 universal target sequences (also known as GroEL or Hsp60). Using an unattended process that is independent of external reference sequences, mPUMA forms OTUs by DNA sequence assembly and is capable of tracking OTU abundance. mPUMA processes microbial profiles both in terms of the direct DNA sequence as well as in the translated amino acid sequence for protein coding barcodes. By forming OTUs and calculating abundance through an assembly approach, mPUMA is capable of generating inputs for several popular microbiota analysis tools. Using SFF data from sequencing of a synthetic community of Cpn60 sequences derived from the human vaginal microbiome, we demonstrate that mPUMA can faithfully reconstruct all expected OTU sequences and produce compositional profiles consistent with actual community structure. mPUMA enables analysis of microbial communities while empowering the discovery of novel organisms through OTU assembly.
Model compilation: An approach to automated model derivation
NASA Technical Reports Server (NTRS)
Keller, Richard M.; Baudin, Catherine; Iwasaki, Yumi; Nayak, Pandurang; Tanaka, Kazuo
1990-01-01
An approach is introduced to automated model derivation for knowledge based systems. The approach, model compilation, involves procedurally generating the set of domain models used by a knowledge based system. With an implemented example, how this approach can be used to derive models of different precision and abstraction is illustrated, and models are tailored to different tasks, from a given set of base domain models. In particular, two implemented model compilers are described, each of which takes as input a base model that describes the structure and behavior of a simple electromechanical device, the Reaction Wheel Assembly of NASA's Hubble Space Telescope. The compilers transform this relatively general base model into simple task specific models for troubleshooting and redesign, respectively, by applying a sequence of model transformations. Each transformation in this sequence produces an increasingly more specialized model. The compilation approach lessens the burden of updating and maintaining consistency among models by enabling their automatic regeneration.
Eastman, Alexander W.; Yuan, Ze-Chun
2015-01-01
Advances in sequencing technology have drastically increased the depth and feasibility of bacterial genome sequencing. However, little information is available that details the specific techniques and procedures employed during genome sequencing despite the large numbers of published genomes. Shotgun approaches employed by second-generation sequencing platforms has necessitated the development of robust bioinformatics tools for in silico assembly, and complete assembly is limited by the presence of repetitive DNA sequences and multi-copy operons. Typically, re-sequencing with multiple platforms and laborious, targeted Sanger sequencing are employed to finish a draft bacterial genome. Here we describe a novel strategy based on the identification and targeted sequencing of repetitive rDNA operons to expedite bacterial genome assembly and finishing. Our strategy was validated by finishing the genome of Paenibacillus polymyxa strain CR1, a bacterium with potential in sustainable agriculture and bio-based processes. An analysis of the 38 contigs contained in the P. polymyxa strain CR1 draft genome revealed 12 repetitive rDNA operons with varied intragenic and flanking regions of variable length, unanimously located at contig boundaries and within contig gaps. These highly similar but not identical rDNA operons were experimentally verified and sequenced simultaneously with multiple, specially designed primer sets. This approach also identified and corrected significant sequence rearrangement generated during the initial in silico assembly of sequencing reads. Our approach reduces the required effort associated with blind primer walking for contig assembly, increasing both the speed and feasibility of genome finishing. Our study further reinforces the notion that repetitive DNA elements are major limiting factors for genome finishing. Moreover, we provided a step-by-step workflow for genome finishing, which may guide future bacterial genome finishing projects. PMID:25653642
Meher, Prabina Kumar; Sahu, Tanmaya Kumar; Rao, A R
2016-11-05
DNA barcoding is a molecular diagnostic method that allows automated and accurate identification of species based on a short and standardized fragment of DNA. To this end, an attempt has been made in this study to develop a computational approach for identifying the species by comparing its barcode with the barcode sequence of known species present in the reference library. Each barcode sequence was first mapped onto a numeric feature vector based on k-mer frequencies and then Random forest methodology was employed on the transformed dataset for species identification. The proposed approach outperformed similarity-based, tree-based, diagnostic-based approaches and found comparable with existing supervised learning based approaches in terms of species identification success rate, while compared using real and simulated datasets. Based on the proposed approach, an online web interface SPIDBAR has also been developed and made freely available at http://cabgrid.res.in:8080/spidbar/ for species identification by the taxonomists. Copyright © 2016 Elsevier B.V. All rights reserved.
Guiding principles for peptide nanotechnology through directed discovery.
Lampel, A; Ulijn, R V; Tuttle, T
2018-05-21
Life's diverse molecular functions are largely based on only a small number of highly conserved building blocks - the twenty canonical amino acids. These building blocks are chemically simple, but when they are organized in three-dimensional structures of tremendous complexity, new properties emerge. This review explores recent efforts in the directed discovery of functional nanoscale systems and materials based on these same amino acids, but that are not guided by copying or editing biological systems. The review summarises insights obtained using three complementary approaches of searching the sequence space to explore sequence-structure relationships for assembly, reactivity and complexation, namely: (i) strategic editing of short peptide sequences; (ii) computational approaches to predicting and comparing assembly behaviours; (iii) dynamic peptide libraries that explore the free energy landscape. These approaches give rise to guiding principles on controlling order/disorder, complexation and reactivity by peptide sequence design.
FASH: A web application for nucleotides sequence search.
Veksler-Lublinksy, Isana; Barash, Danny; Avisar, Chai; Troim, Einav; Chew, Paul; Kedem, Klara
2008-05-27
: FASH (Fourier Alignment Sequence Heuristics) is a web application, based on the Fast Fourier Transform, for finding remote homologs within a long nucleic acid sequence. Given a query sequence and a long text-sequence (e.g, the human genome), FASH detects subsequences within the text that are remotely-similar to the query. FASH offers an alternative approach to Blast/Fasta for querying long RNA/DNA sequences. FASH differs from these other approaches in that it does not depend on the existence of contiguous seed-sequences in its initial detection phase. The FASH web server is user friendly and very easy to operate. FASH can be accessed athttps://fash.bgu.ac.il:8443/fash/default.jsp (secured website).
Recognizing human actions by learning and matching shape-motion prototype trees.
Jiang, Zhuolin; Lin, Zhe; Davis, Larry S
2012-03-01
A shape-motion prototype-based approach is introduced for action recognition. The approach represents an action as a sequence of prototypes for efficient and flexible action matching in long video sequences. During training, an action prototype tree is learned in a joint shape and motion space via hierarchical K-means clustering and each training sequence is represented as a labeled prototype sequence; then a look-up table of prototype-to-prototype distances is generated. During testing, based on a joint probability model of the actor location and action prototype, the actor is tracked while a frame-to-prototype correspondence is established by maximizing the joint probability, which is efficiently performed by searching the learned prototype tree; then actions are recognized using dynamic prototype sequence matching. Distance measures used for sequence matching are rapidly obtained by look-up table indexing, which is an order of magnitude faster than brute-force computation of frame-to-frame distances. Our approach enables robust action matching in challenging situations (such as moving cameras, dynamic backgrounds) and allows automatic alignment of action sequences. Experimental results demonstrate that our approach achieves recognition rates of 92.86 percent on a large gesture data set (with dynamic backgrounds), 100 percent on the Weizmann action data set, 95.77 percent on the KTH action data set, 88 percent on the UCF sports data set, and 87.27 percent on the CMU action data set.
Jiménez-Hernández, Hugo; González-Barbosa, Jose-Joel; Garcia-Ramírez, Teresa
2010-01-01
This investigation demonstrates an unsupervised approach for modeling traffic flow and detecting abnormal vehicle behaviors at intersections. In the first stage, the approach reveals and records the different states of the system. These states are the result of coding and grouping the historical motion of vehicles as long binary strings. In the second stage, using sequences of the recorded states, a stochastic graph model based on a Markovian approach is built. A behavior is labeled abnormal when current motion pattern cannot be recognized as any state of the system or a particular sequence of states cannot be parsed with the stochastic model. The approach is tested with several sequences of images acquired from a vehicular intersection where the traffic flow and duration used in connection with the traffic lights are continuously changed throughout the day. Finally, the low complexity and the flexibility of the approach make it reliable for use in real time systems. PMID:22163616
Jiménez-Hernández, Hugo; González-Barbosa, Jose-Joel; Garcia-Ramírez, Teresa
2010-01-01
This investigation demonstrates an unsupervised approach for modeling traffic flow and detecting abnormal vehicle behaviors at intersections. In the first stage, the approach reveals and records the different states of the system. These states are the result of coding and grouping the historical motion of vehicles as long binary strings. In the second stage, using sequences of the recorded states, a stochastic graph model based on a Markovian approach is built. A behavior is labeled abnormal when current motion pattern cannot be recognized as any state of the system or a particular sequence of states cannot be parsed with the stochastic model. The approach is tested with several sequences of images acquired from a vehicular intersection where the traffic flow and duration used in connection with the traffic lights are continuously changed throughout the day. Finally, the low complexity and the flexibility of the approach make it reliable for use in real time systems.
Zhou, Mu; Zhang, Qiao; Xu, Kunjie; Tian, Zengshan; Wang, Yanmeng; He, Wei
2015-01-01
Due to the wide deployment of wireless local area networks (WLAN), received signal strength (RSS)-based indoor WLAN localization has attracted considerable attention in both academia and industry. In this paper, we propose a novel page rank-based indoor mapping and localization (PRIMAL) by using the gene-sequenced unlabeled WLAN RSS for simultaneous localization and mapping (SLAM). Specifically, first of all, based on the observation of the motion patterns of the people in the target environment, we use the Allen logic to construct the mobility graph to characterize the connectivity among different areas of interest. Second, the concept of gene sequencing is utilized to assemble the sporadically-collected RSS sequences into a signal graph based on the transition relations among different RSS sequences. Third, we apply the graph drawing approach to exhibit both the mobility graph and signal graph in a more readable manner. Finally, the page rank (PR) algorithm is proposed to construct the mapping from the signal graph into the mobility graph. The experimental results show that the proposed approach achieves satisfactory localization accuracy and meanwhile avoids the intensive time and labor cost involved in the conventional location fingerprinting-based indoor WLAN localization. PMID:26404274
Chen, Jin-Jin; Zhao, Qing-Sheng; Liu, Yi-Lan; Zha, Sheng-Hua; Zhao, Bing
2015-09-01
Maca (Lepidium meyenii) is an herbaceous plant that grows in high plateaus and has been used as both food and folk medicine for centuries because of its benefits to human health. In the present study, ITS (internal transcribed spacer) sequences of forty-three maca samples, collected from different regions or vendors, were amplified and analyzed. The ITS sequences of nineteen potential adulterants of maca were also collected and analyzed. The results indicated that the ITS sequence of maca was consistent in all samples and unique when compared with its adulterants. Therefore, this DNA-barcoding approach based on the ITS sequence can be used for the molecular identification of maca and its adulterants. Copyright © 2015 China Pharmaceutical University. Published by Elsevier B.V. All rights reserved.
Patterns and Sequences: Interactive Exploration of Clickstreams to Understand Common Visitor Paths.
Liu, Zhicheng; Wang, Yang; Dontcheva, Mira; Hoffman, Matthew; Walker, Seth; Wilson, Alan
2017-01-01
Modern web clickstream data consists of long, high-dimensional sequences of multivariate events, making it difficult to analyze. Following the overarching principle that the visual interface should provide information about the dataset at multiple levels of granularity and allow users to easily navigate across these levels, we identify four levels of granularity in clickstream analysis: patterns, segments, sequences and events. We present an analytic pipeline consisting of three stages: pattern mining, pattern pruning and coordinated exploration between patterns and sequences. Based on this approach, we discuss properties of maximal sequential patterns, propose methods to reduce the number of patterns and describe design considerations for visualizing the extracted sequential patterns and the corresponding raw sequences. We demonstrate the viability of our approach through an analysis scenario and discuss the strengths and limitations of the methods based on user feedback.
Biological intuition in alignment-free methods: response to Posada.
Ragan, Mark A; Chan, Cheong Xin
2013-08-01
A recent editorial in Journal of Molecular Evolution highlights opportunities and challenges facing molecular evolution in the era of next-generation sequencing. Abundant sequence data should allow more-complex models to be fit at higher confidence, making phylogenetic inference more reliable and improving our understanding of evolution at the molecular level. However, concern that approaches based on multiple sequence alignment may be computationally infeasible for large datasets is driving the development of so-called alignment-free methods for sequence comparison and phylogenetic inference. The recent editorial characterized these approaches as model-free, not based on the concept of homology, and lacking in biological intuition. We argue here that alignment-free methods have not abandoned models or homology, and can be biologically intuitive.
Yu, Xiaoyu; Reva, Oleg N
2018-01-01
Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA.
Yu, Xiaoyu; Reva, Oleg N
2018-01-01
Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA. PMID:29511354
Ries, David; Holtgräwe, Daniela; Viehöver, Prisca; Weisshaar, Bernd
2016-03-15
The combination of bulk segregant analysis (BSA) and next generation sequencing (NGS), also known as mapping by sequencing (MBS), has been shown to significantly accelerate the identification of causal mutations for species with a reference genome sequence. The usual approach is to cross homozygous parents that differ for the monogenic trait to address, to perform deep sequencing of DNA from F2 plants pooled according to their phenotype, and subsequently to analyze the allele frequency distribution based on a marker table for the parents studied. The method has been successfully applied for EMS induced mutations as well as natural variation. Here, we show that pooling genetically diverse breeding lines according to a contrasting phenotype also allows high resolution mapping of the causal gene in a crop species. The test case was the monogenic locus causing red vs. green hypocotyl color in Beta vulgaris (R locus). We determined the allele frequencies of polymorphic sequences using sequence data from two diverging phenotypic pools of 180 B. vulgaris accessions each. A single interval of about 31 kbp among the nine chromosomes was identified which indeed contained the causative mutation. By applying a variation of the mapping by sequencing approach, we demonstrated that phenotype-based pooling of diverse accessions from breeding panels and subsequent direct determination of the allele frequency distribution can be successfully applied for gene identification in a crop species. Our approach made it possible to identify a small interval around the causative gene. Sequencing of parents or individual lines was not necessary. Whenever the appropriate plant material is available, the approach described saves time compared to the generation of an F2 population. In addition, we provide clues for planning similar experiments with regard to pool size and the sequencing depth required.
A statistical method for the detection of variants from next-generation resequencing of DNA pools.
Bansal, Vikas
2010-06-15
Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing. We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80-85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3-5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP. Implementation of this method is available at http://polymorphism.scripps.edu/~vbansal/software/CRISP/.
Mans, Ben J; Pienaar, Ronel; Ratabane, John; Pule, Boitumelo; Latif, Abdalla A
2016-07-01
Molecular classification and systematics of the Theileria is based on the analysis of the 18S rRNA gene. Reverse line blot or conventional sequencing approaches have disadvantages in the study of 18S rRNA diversity and a next-generation 454 sequencing approach was investigated. The 18S rRNA gene was amplified using RLB primers coupled to 96 unique sequence identifiers (MIDs). Theileria positive samples from African buffalo (672) and cattle (480) from southern Africa were combined in batches of 96 and sequenced using the GS Junior 454 sequencer to produce 825711 informative sequences. Sequences were extracted based on MIDs and analysed to identify Theileria genotypes. Genotypes observed in buffalo and cattle were confirmed in the current study, while no new genotypes were discovered. Genotypes showed specific geographic distributions, most probably linked with vector distributions. Host specificity of buffalo and cattle specific genotypes were confirmed and prevalence data as well as relative parasitemia trends indicate preference for different hosts. Mixed infections are common with African buffalo carrying more genotypes compared to cattle. Associative or exclusion co-infection profiles were observed between genotypes that may have implications for speciation and systematics: specifically that more Theileria species may exist in cattle and buffalo than currently recognized. Analysis of primers used for Theileria parva diagnostics indicate that no new genotypes will be amplified by the current primer sets confirming their specificity. T. parva SNP variants that occur in the 18S rRNA hypervariable region were confirmed. A next generation sequencing approach is useful in obtaining comprehensive knowledge regarding 18S rRNA diversity and prevalence for the Theileria, allowing for the assessment of systematics and diagnostic assays based on the 18S gene. Copyright © 2016 Elsevier GmbH. All rights reserved.
PMS2 gene mutational analysis: direct cDNA sequencing to circumvent pseudogene interference.
Wimmer, Katharina; Wernstedt, Annekatrin
2014-01-01
The presence of highly homologous pseudocopies can compromise the mutation analysis of a gene of interest. In particular, when using PCR-based strategies, pseudogene co-amplification has to be effectively prevented. This is often achieved by using primers designed to be parental gene specific according to the reference sequence and by applying stringent PCR conditions. However, there are cases in which this approach is of limited utility. For example, it has been shown that the PMS2 gene exchanges sequences with one of its pseudogenes, named PMS2CL. This results in functional PMS2 alleles containing pseudogene-derived sequences at their 3'-end and in nonfunctional PMS2CL pseudogene alleles that contain gene-derived sequences. Hence, the paralogues cannot be distinguished according to the reference sequence. This shortcoming can be effectively circumvented by using direct cDNA sequencing. This approach is based on the selective amplification of PMS2 transcripts in two overlapping 1.6-kb RT-PCR products. In addition to avoiding pseudogene co-amplification and allele dropout, this method has also the advantage that it allows to effectively identify deletions, splice mutations, and de novo retrotransposon insertions that escape the detection of most DNA-based mutation analysis protocols.
A novel and efficient technique for identification and classification of GPCRs.
Gupta, Ravi; Mittal, Ankush; Singh, Kuldip
2008-07-01
G-protein coupled receptors (GPCRs) play a vital role in different biological processes, such as regulation of growth, death, and metabolism of cells. GPCRs are the focus of significant amount of current pharmaceutical research since they interact with more than 50% of prescription drugs. The dipeptide-based support vector machine (SVM) approach is the most accurate technique to identify and classify the GPCRs. However, this approach has two major disadvantages. First, the dimension of dipeptide-based feature vector is equal to 400. The large dimension makes the classification task computationally and memory wise inefficient. Second, it does not consider the biological properties of protein sequence for identification and classification of GPCRs. In this paper, we present a novel-feature-based SVM classification technique. The novel features are derived by applying wavelet-based time series analysis approach on protein sequences. The proposed feature space summarizes the variance information of seven important biological properties of amino acids in a protein sequence. In addition, the dimension of the feature vector for proposed technique is equal to 35. Experiments were performed on GPCRs protein sequences available at GPCRs Database. Our approach achieves an accuracy of 99.9%, 98.06%, 97.78%, and 94.08% for GPCR superfamily, families, subfamilies, and subsubfamilies (amine group), respectively, when evaluated using fivefold cross-validation. Further, an accuracy of 99.8%, 97.26%, and 97.84% was obtained when evaluated on unseen or recall datasets of GPCR superfamily, families, and subfamilies, respectively. Comparison with dipeptide-based SVM technique shows the effectiveness of our approach.
Discriminative Prediction of A-To-I RNA Editing Events from DNA Sequence
Sun, Jiangming; Singh, Pratibha; Bagge, Annika; Valtat, Bérengère; Vikman, Petter; Spégel, Peter; Mulder, Hindrik
2016-01-01
RNA editing is a post-transcriptional alteration of RNA sequences that, via insertions, deletions or base substitutions, can affect protein structure as well as RNA and protein expression. Recently, it has been suggested that RNA editing may be more frequent than previously thought. A great impediment, however, to a deeper understanding of this process is the paramount sequencing effort that needs to be undertaken to identify RNA editing events. Here, we describe an in silico approach, based on machine learning, that ameliorates this problem. Using 41 nucleotide long DNA sequences, we show that novel A-to-I RNA editing events can be predicted from known A-to-I RNA editing events intra- and interspecies. The validity of the proposed method was verified in an independent experimental dataset. Using our approach, 203 202 putative A-to-I RNA editing events were predicted in the whole human genome. Out of these, 9% were previously reported. The remaining sites require further validation, e.g., by targeted deep sequencing. In conclusion, the approach described here is a useful tool to identify potential A-to-I RNA editing events without the requirement of extensive RNA sequencing. PMID:27764195
A long-term target detection approach in infrared image sequence
NASA Astrophysics Data System (ADS)
Li, Hang; Zhang, Qi; Wang, Xin; Hu, Chao
2016-10-01
An automatic target detection method used in long term infrared (IR) image sequence from a moving platform is proposed. Firstly, based on POME(the principle of maximum entropy), target candidates are iteratively segmented. Then the real target is captured via two different selection approaches. At the beginning of image sequence, the genuine target with litter texture is discriminated from other candidates by using contrast-based confidence measure. On the other hand, when the target becomes larger, we apply online EM method to estimate and update the distributions of target's size and position based on the prior detection results, and then recognize the genuine one which satisfies both the constraints of size and position. Experimental results demonstrate that the presented method is accurate, robust and efficient.
Regularized rare variant enrichment analysis for case-control exome sequencing data.
Larson, Nicholas B; Schaid, Daniel J
2014-02-01
Rare variants have recently garnered an immense amount of attention in genetic association analysis. However, unlike methods traditionally used for single marker analysis in GWAS, rare variant analysis often requires some method of aggregation, since single marker approaches are poorly powered for typical sequencing study sample sizes. Advancements in sequencing technologies have rendered next-generation sequencing platforms a realistic alternative to traditional genotyping arrays. Exome sequencing in particular not only provides base-level resolution of genetic coding regions, but also a natural paradigm for aggregation via genes and exons. Here, we propose the use of penalized regression in combination with variant aggregation measures to identify rare variant enrichment in exome sequencing data. In contrast to marginal gene-level testing, we simultaneously evaluate the effects of rare variants in multiple genes, focusing on gene-based least absolute shrinkage and selection operator (LASSO) and exon-based sparse group LASSO models. By using gene membership as a grouping variable, the sparse group LASSO can be used as a gene-centric analysis of rare variants while also providing a penalized approach toward identifying specific regions of interest. We apply extensive simulations to evaluate the performance of these approaches with respect to specificity and sensitivity, comparing these results to multiple competing marginal testing methods. Finally, we discuss our findings and outline future research. © 2013 WILEY PERIODICALS, INC.
NASA Astrophysics Data System (ADS)
Jiao, Yong; Wakakuwa, Eyuri; Ogawa, Tomohiro
2018-02-01
We consider asymptotic convertibility of an arbitrary sequence of bipartite pure states into another by local operations and classical communication (LOCC). We adopt an information-spectrum approach to address cases where each element of the sequences is not necessarily a tensor power of a bipartite pure state. We derive necessary and sufficient conditions for the LOCC convertibility of one sequence to another in terms of spectral entropy rates of entanglement of the sequences. Based on these results, we also provide simple proofs for previously known results on the optimal rates of entanglement concentration and dilution of general sequences of bipartite pure states.
Zhu, X Q; Gasser, R B
1998-06-01
In this study, we assessed single-strand conformation polymorphism (SSCP)-based approaches for their capacity to fingerprint sequence variation in ribosomal DNA (rDNA) of ascaridoid nematodes of veterinary and/or human health significance. The second internal transcribed spacer region (ITS-2) of rDNA was utilised as the target region because it is known to provide species-specific markers for this group of parasites. ITS-2 was amplified by PCR from genomic DNA derived from individual parasites and subjected to analysis. Direct SSCP analysis of amplicons from seven taxa (Toxocara vitulorum, Toxocara cati, Toxocara canis, Toxascaris leonina, Baylisascaris procyonis, Ascaris suum and Parascaris equorum) showed that the single-strand (ss) ITS-2 patterns produced allowed their unequivocal identification to species. While no variation in SSCP patterns was detected in the ITS-2 within four species for which multiple samples were available, the method allowed the direct display of four distinct sequence types of ITS-2 among individual worms of T. cati. Comparison of SSCP/sequencing with the methods of dideoxy fingerprinting (ddF) and restriction endonuclease fingerprinting (REF) revealed that also ddF allowed the definition of the four sequence types, whereas REF displayed three of four. The findings indicate the usefulness of the SSCP-based approaches for the identification of ascaridoid nematodes to species, the direct display of sequence variation in rDNA and the detection of population variation. The ability to fingerprint microheterogeneity in ITS-2 rDNA using such approaches also has implications for studying fundamental aspects relating to mutational change in rDNA.
Osmylated DNA, a novel concept for sequencing DNA using nanopores
NASA Astrophysics Data System (ADS)
Kanavarioti, Anastassia
2015-03-01
Saenger sequencing has led the advances in molecular biology, while faster and cheaper next generation technologies are urgently needed. A newer approach exploits nanopores, natural or solid-state, set in an electrical field, and obtains base sequence information from current variations due to the passage of a ssDNA molecule through the pore. A hurdle in this approach is the fact that the four bases are chemically comparable to each other which leads to small differences in current obstruction. ‘Base calling’ becomes even more challenging because most nanopores sense a short sequence and not individual bases. Perhaps sequencing DNA via nanopores would be more manageable, if only the bases were two, and chemically very different from each other; a sequence of 1s and 0s comes to mind. Osmylated DNA comes close to such a sequence of 1s and 0s. Osmylation is the addition of osmium tetroxide bipyridine across the C5-C6 double bond of the pyrimidines. Osmylation adds almost 400% mass to the reactive base, creates a sterically and electronically notably different molecule, labeled 1, compared to the unreactive purines, labeled 0. If osmylated DNA were successfully sequenced, the result would be a sequence of osmylated pyrimidines (1), and purines (0), and not of the actual nucleobases. To solve this problem we studied the osmylation reaction with short oligos and with M13mp18, a long ssDNA, developed a UV-vis assay to measure extent of osmylation, and designed two protocols. Protocol A uses mild conditions and yields osmylated thymidines (1), while leaving the other three bases (0) practically intact. Protocol B uses harsher conditions and effectively osmylates both pyrimidines, but not the purines. Applying these two protocols also to the complementary of the target polynucleotide yields a total of four osmylated strands that collectively could define the actual base sequence of the target DNA.
Oduru, Sreedhar; Campbell, Janee L; Karri, SriTulasi; Hendry, William J; Khan, Shafiq A; Williams, Simon C
2003-01-01
Background Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish) genomes. Results 735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque. Conclusion The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells. PMID:12783626
Ng'endo, R.N.; Osiemo, Z.B.; Brandl, R.
2013-01-01
DNA sequencing is increasingly being used to assist in species identification in order to overcome taxonomic impediment. However, few studies attempt to compare the results of these molecular studies with a more traditional species delineation approach based on morphological characters. Mitochondrial DNA Cytochrome oxidase subunit 1 (CO1) gene was sequenced, measuring 636 base pairs, from 47 ants of the genus Pheidole (Formicidae: Myrmicinae) collected in the Brazilian Atlantic Forest to test whether the morphology-based assignment of individuals into species is supported by DNA-based species delimitation. Twenty morphospecies were identified, whereas the barcoding analysis identified 19 Molecular Operational Taxonomic Units (MOTUs). Fifteen out of the 19 DNA-based clusters allocated, using sequence divergence thresholds of 2% and 3%, matched with morphospecies. Both thresholds yielded the same number of MOTUs. Only one MOTU was successfully identified to species level using the CO1 sequences of Pheidole species already in the Genbank. The average pairwise sequence divergence for all 47 sequences was 19%, ranging between 0–25%. In some cases, however, morphology and molecular based methods differed in their assignment of individuals to morphospecies or MOTUs. The occurrence of distinct mitochondrial lineages within morphological species highlights groups for further detailed genetic and morphological studies, and therefore a pluralistic approach using several methods to understand the taxonomy of difficult lineages is advocated. PMID:23902257
Binladen, Jonas; Gilbert, M Thomas P; Bollback, Jonathan P; Panitz, Frank; Bendixen, Christian; Nielsen, Rasmus; Willerslev, Eske
2007-02-14
The invention of the Genome Sequence 20 DNA Sequencing System (454 parallel sequencing platform) has enabled the rapid and high-volume production of sequence data. Until now, however, individual emulsion PCR (emPCR) reactions and subsequent sequencing runs have been unable to combine template DNA from multiple individuals, as homologous sequences cannot be subsequently assigned to their original sources. We use conventional PCR with 5'-nucleotide tagged primers to generate homologous DNA amplification products from multiple specimens, followed by sequencing through the high-throughput Genome Sequence 20 DNA Sequencing System (GS20, Roche/454 Life Sciences). Each DNA sequence is subsequently traced back to its individual source through 5'tag-analysis. We demonstrate that this new approach enables the assignment of virtually all the generated DNA sequences to the correct source once sequencing anomalies are accounted for (miss-assignment rate<0.4%). Therefore, the method enables accurate sequencing and assignment of homologous DNA sequences from multiple sources in single high-throughput GS20 run. We observe a bias in the distribution of the differently tagged primers that is dependent on the 5' nucleotide of the tag. In particular, primers 5' labelled with a cytosine are heavily overrepresented among the final sequences, while those 5' labelled with a thymine are strongly underrepresented. A weaker bias also exists with regards to the distribution of the sequences as sorted by the second nucleotide of the dinucleotide tags. As the results are based on a single GS20 run, the general applicability of the approach requires confirmation. However, our experiments demonstrate that 5'primer tagging is a useful method in which the sequencing power of the GS20 can be applied to PCR-based assays of multiple homologous PCR products. The new approach will be of value to a broad range of research areas, such as those of comparative genomics, complete mitochondrial analyses, population genetics, and phylogenetics.
Inferring Short-Range Linkage Information from Sequencing Chromatograms
Beggel, Bastian; Neumann-Fraune, Maria; Kaiser, Rolf; Verheyen, Jens; Lengauer, Thomas
2013-01-01
Direct Sanger sequencing of viral genome populations yields multiple ambiguous sequence positions. It is not straightforward to derive linkage information from sequencing chromatograms, which in turn hampers the correct interpretation of the sequence data. We present a method for determining the variants existing in a viral quasispecies in the case of two nearby ambiguous sequence positions by exploiting the effect of sequence context-dependent incorporation of dideoxynucleotides. The computational model was trained on data from sequencing chromatograms of clonal variants and was evaluated on two test sets of in vitro mixtures. The approach achieved high accuracies in identifying the mixture components of 97.4% on a test set in which the positions to be analyzed are only one base apart from each other, and of 84.5% on a test set in which the ambiguous positions are separated by three bases. In silico experiments suggest two major limitations of our approach in terms of accuracy. First, due to a basic limitation of Sanger sequencing, it is not possible to reliably detect minor variants with a relative frequency of no more than 10%. Second, the model cannot distinguish between mixtures of two or four clonal variants, if one of two sets of linear constraints is fulfilled. Furthermore, the approach requires repetitive sequencing of all variants that might be present in the mixture to be analyzed. Nevertheless, the effectiveness of our method on the two in vitro test sets shows that short-range linkage information of two ambiguous sequence positions can be inferred from Sanger sequencing chromatograms without any further assumptions on the mixture composition. Additionally, our model provides new insights into the established and widely used Sanger sequencing technology. The source code of our method is made available at http://bioinf.mpi-inf.mpg.de/publications/beggel/linkageinformation.zip. PMID:24376502
Elman RNN based classification of proteins sequences on account of their mutual information.
Mishra, Pooja; Nath Pandey, Paras
2012-10-21
In the present work we have employed the method of estimating residue correlation within the protein sequences, by using the mutual information (MI) of adjacent residues, based on structural and solvent accessibility properties of amino acids. The long range correlation between nonadjacent residues is improved by constructing a mutual information vector (MIV) for a single protein sequence, like this each protein sequence is associated with its corresponding MIVs. These MIVs are given to Elman RNN to obtain the classification of protein sequences. The modeling power of MIV was shown to be significantly better, giving a new approach towards alignment free classification of protein sequences. We also conclude that sequence structural and solvent accessible property based MIVs are better predictor. Copyright © 2012 Elsevier Ltd. All rights reserved.
Assembly and diploid architecture of an individual human genome via single-molecule technologies
Pendleton, Matthew; Sebra, Robert; Pang, Andy Wing Chun; Ummat, Ajay; Franzen, Oscar; Rausch, Tobias; Stütz, Adrian M; Stedman, William; Anantharaman, Thomas; Hastie, Alex; Dai, Heng; Fritz, Markus Hsi-Yang; Cao, Han; Cohain, Ariella; Deikus, Gintaras; Durrett, Russell E; Blanchard, Scott C; Altman, Roger; Chin, Chen-Shan; Guo, Yan; Paxinos, Ellen E; Korbel, Jan O; Darnell, Robert B; McCombie, W Richard; Kwok, Pui-Yan; Mason, Christopher E; Schadt, Eric E; Bashir, Ali
2015-01-01
We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality. PMID:26121404
Assembly and diploid architecture of an individual human genome via single-molecule technologies.
Pendleton, Matthew; Sebra, Robert; Pang, Andy Wing Chun; Ummat, Ajay; Franzen, Oscar; Rausch, Tobias; Stütz, Adrian M; Stedman, William; Anantharaman, Thomas; Hastie, Alex; Dai, Heng; Fritz, Markus Hsi-Yang; Cao, Han; Cohain, Ariella; Deikus, Gintaras; Durrett, Russell E; Blanchard, Scott C; Altman, Roger; Chin, Chen-Shan; Guo, Yan; Paxinos, Ellen E; Korbel, Jan O; Darnell, Robert B; McCombie, W Richard; Kwok, Pui-Yan; Mason, Christopher E; Schadt, Eric E; Bashir, Ali
2015-08-01
We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.
Garrido-Martín, Diego; Pazos, Florencio
2018-02-27
The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.
Malhis, Nawar; Butterfield, Yaron S N; Ester, Martin; Jones, Steven J M
2009-01-01
A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output. In this article, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base's probabilities given in the prb files. Compared with other aligners, Slider has higher alignment accuracy and efficiency. In addition, given that Slider matches bases with probabilities other than the most probable, it significantly reduces the percentage of base mismatches. The result is that its SNP predictions are more accurate than other SNP prediction approaches used today that start from the most probable sequence, including those using base quality.
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
2016-07-19
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
USDA-ARS?s Scientific Manuscript database
A reassociation kinetics-based approach was used to reduce the complexity of genomic DNA from the Deutsch laboratory strain of the cattle tick, Rhipicephalus microplus, to facilitate genome sequencing. Selected genomic DNA (Cot value = 660) was sequenced using 454 GS FLX technology, resulting in 356...
Jiang, Rui ; Yang, Hua ; Zhou, Linqi ; Kuo, C.-C. Jay ; Sun, Fengzhu ; Chen, Ting
2007-01-01
The increasing demand for the identification of genetic variation responsible for common diseases has translated into a need for sophisticated methods for effectively prioritizing mutations occurring in disease-associated genetic regions. In this article, we prioritize candidate nonsynonymous single-nucleotide polymorphisms (nsSNPs) through a bioinformatics approach that takes advantages of a set of improved numeric features derived from protein-sequence information and a new statistical learning model called “multiple selection rule voting” (MSRV). The sequence-based features can maximize the scope of applications of our approach, and the MSRV model can capture subtle characteristics of individual mutations. Systematic validation of the approach demonstrates that this approach is capable of prioritizing causal mutations for both simple monogenic diseases and complex polygenic diseases. Further studies of familial Alzheimer diseases and diabetes show that the approach can enrich mutations underlying these polygenic diseases among the top of candidate mutations. Application of this approach to unclassified mutations suggests that there are 10 suspicious mutations likely to cause diseases, and there is strong support for this in the literature. PMID:17668383
Pre-Service Teachers' Approaches to a Historical Problem in Mechanics
ERIC Educational Resources Information Center
Malgieri, Massimiliano; Onorato, Pasquale; Mascheretti, Paolo; De Ambrosis, Anna
2014-01-01
In this paper we report on an activity sequence with a group of 29 pre-service physics teachers based on the reconstruction and analysis of a thought experiment that was crucial for Huygens' derivation of the formula for the centre of oscillation of a physical pendulum. The sequence starts with student teachers approaching the historical…
Superresolution restoration of an image sequence: adaptive filtering approach.
Elad, M; Feuer, A
1999-01-01
This paper presents a new method based on adaptive filtering theory for superresolution restoration of continuous image sequences. The proposed methodology suggests least squares (LS) estimators which adapt in time, based on adaptive filters, least mean squares (LMS) or recursive least squares (RLS). The adaptation enables the treatment of linear space and time-variant blurring and arbitrary motion, both of them assumed known. The proposed new approach is shown to be of relatively low computational requirements. Simulations demonstrating the superresolution restoration algorithms are presented.
ERIC Educational Resources Information Center
Blanco-López, Ángel; Franco-Mariscal, Antonio Joaquín; España-Ramos, Enrique
2016-01-01
We present a case study to illustrate the design and implementation of a teaching sequence about oral and dental health and hygiene. This teaching sequence was aimed at year 10 students (age 15-16) and sought to develop their scientific competences. In line with the PISA assessment framework for science and the tenets of a context-based approach…
Screening for SNPs with Allele-Specific Methylation based on Next-Generation Sequencing Data.
Hu, Bo; Ji, Yuan; Xu, Yaomin; Ting, Angela H
2013-05-01
Allele-specific methylation (ASM) has long been studied but mainly documented in the context of genomic imprinting and X chromosome inactivation. Taking advantage of the next-generation sequencing technology, we conduct a high-throughput sequencing experiment with four prostate cell lines to survey the whole genome and identify single nucleotide polymorphisms (SNPs) with ASM. A Bayesian approach is proposed to model the counts of short reads for each SNP conditional on its genotypes of multiple subjects, leading to a posterior probability of ASM. We flag SNPs with high posterior probabilities of ASM by accounting for multiple comparisons based on posterior false discovery rates. Applying the Bayesian approach to the in-house prostate cell line data, we identify 269 SNPs as candidates of ASM. A simulation study is carried out to demonstrate the quantitative performance of the proposed approach.
Optimized Probe Masking for Comparative Transcriptomics of Closely Related Species
Poeschl, Yvonne; Delker, Carolin; Trenner, Jana; Ullrich, Kristian Karsten; Quint, Marcel; Grosse, Ivo
2013-01-01
Microarrays are commonly applied to study the transcriptome of specific species. However, many available microarrays are restricted to model organisms, and the design of custom microarrays for other species is often not feasible. Hence, transcriptomics approaches of non-model organisms as well as comparative transcriptomics studies among two or more species often make use of cost-intensive RNAseq studies or, alternatively, by hybridizing transcripts of a query species to a microarray of a closely related species. When analyzing these cross-species microarray expression data, differences in the transcriptome of the query species can cause problems, such as the following: (i) lower hybridization accuracy of probes due to mismatches or deletions, (ii) probes binding multiple transcripts of different genes, and (iii) probes binding transcripts of non-orthologous genes. So far, methods for (i) exist, but these neglect (ii) and (iii). Here, we propose an approach for comparative transcriptomics addressing problems (i) to (iii), which retains only transcript-specific probes binding transcripts of orthologous genes. We apply this approach to an Arabidopsis lyrata expression data set measured on a microarray designed for Arabidopsis thaliana, and compare it to two alternative approaches, a sequence-based approach and a genomic DNA hybridization-based approach. We investigate the number of retained probe sets, and we validate the resulting expression responses by qRT-PCR. We find that the proposed approach combines the benefit of sequence-based stringency and accuracy while allowing the expression analysis of much more genes than the alternative sequence-based approach. As an added benefit, the proposed approach requires probes to detect transcripts of orthologous genes only, which provides a superior base for biological interpretation of the measured expression responses. PMID:24260119
Stratification of co-evolving genomic groups using ranked phylogenetic profiles
Freilich, Shiri; Goldovsky, Leon; Gottlieb, Assaf; Blanc, Eric; Tsoka, Sophia; Ouzounis, Christos A
2009-01-01
Background Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database. Results The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples. Conclusion Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples. PMID:19860884
SNP-based genotyping in lentil: linking sequence information with phenotypes
USDA-ARS?s Scientific Manuscript database
Lentil (Lens culinaris) has been late to enter the world of high throughput molecular analysis due to a general lack of genomic resources. Using a 454 sequencing-based approach, SNPs have been identified in genes across the lentil genome. Several hundred have been turned into single SNP KASP assay...
Vander Lugt correlation of DNA sequence data
NASA Astrophysics Data System (ADS)
Christens-Barry, William A.; Hawk, James F.; Martin, James C.
1990-12-01
DNA, the molecule containing the genetic code of an organism, is a linear chain of subunits. It is the sequence of subunits, of which there are four kinds, that constitutes the unique blueprint of an individual. This sequence is the focus of a large number of analyses performed by an army of geneticists, biologists, and computer scientists. Most of these analyses entail searches for specific subsequences within the larger set of sequence data. Thus, most analyses are essentially pattern recognition or correlation tasks. Yet, there are special features to such analysis that influence the strategy and methods of an optical pattern recognition approach. While the serial processing employed in digital electronic computers remains the main engine of sequence analyses, there is no fundamental reason that more efficient parallel methods cannot be used. We describe an approach using optical pattern recognition (OPR) techniques based on matched spatial filtering. This allows parallel comparison of large blocks of sequence data. In this study we have simulated a Vander Lugt1 architecture implementing our approach. Searches for specific target sequence strings within a block of DNA sequence from the Co/El plasmid2 are performed.
Transient effects in π-pulse sequences in MAS solid-state NMR
NASA Astrophysics Data System (ADS)
Hellwagner, Johannes; Wili, Nino; Ibáñez, Luis Fábregas; Wittmann, Johannes J.; Meier, Beat H.; Ernst, Matthias
2018-02-01
Dipolar recoupling techniques that use isolated rotor-synchronized π pulses are commonly used in solid-state NMR spectroscopy to gain insight into the structure of biological molecules. These sequences excel through their simplicity, stability towards radio-frequency (rf) inhomogeneity, and low rf requirements. For a theoretical understanding of such sequences, we present a Floquet treatment based on an interaction-frame transformation including the chemical-shift offset dependence. This approach is applied to the homonuclear dipolar-recoupling sequence Radio-Frequency Driven Recoupling (RFDR) and the heteronuclear recoupling sequence Rotational Echo Double Resonance (REDOR). Based on the Floquet approach, we show the influence of effective fields caused by pulse transients and discuss the advantages of pulse-transient compensation. We demonstrate experimentally that the transfer efficiency for homonuclear recoupling can be doubled in some cases in model compounds as well as in simple peptides if pulse-transient compensation is applied to the π pulses. Additionally, we discuss the influence of various phase cycles on the recoupling efficiency in order to reduce the magnitude of effective fields. Based on the findings from RFDR, we are able to explain why the REDOR sequence does not suffer in the recoupling efficiency despite the presence of effective fields.
Establishing homologies in protein sequences
NASA Technical Reports Server (NTRS)
Dayhoff, M. O.; Barker, W. C.; Hunt, L. T.
1983-01-01
Computer-based statistical techniques used to determine homologies between proteins occurring in different species are reviewed. The technique is based on comparison of two protein sequences, either by relating all segments of a given length in one sequence to all segments of the second or by finding the best alignment of the two sequences. Approaches discussed include selection using printed tabulations, identification of very similar sequences, and computer searches of a database. The use of the SEARCH, RELATE, and ALIGN programs (Dayhoff, 1979) is explained; sample data are presented in graphs, diagrams, and tables and the construction of scoring matrices is considered.
Frickmann, Hagen; Zautner, Andreas E.
2014-01-01
Atypical and multidrug resistance, especially ESBL and carbapenemase expressing Enterobacteriaceae, is globally spreading. Therefore, it becomes increasingly difficult to achieve therapeutic success by calculated antibiotic therapy. Consequently, rapid antibiotic resistance testing is essential. Various molecular and mass spectrometry-based approaches have been introduced in diagnostic microbiology to speed up the providing of reliable resistance data. PCR- and sequencing-based approaches are the most expensive but the most frequently applied modes of testing, suitable for the detection of resistance genes even from primary material. Next generation sequencing, based either on assessment of allelic single nucleotide polymorphisms or on the detection of nonubiquitous resistance mechanisms might allow for sequence-based bacterial resistance testing comparable to viral resistance testing on the long term. Fluorescence in situ hybridization (FISH), based on specific binding of fluorescence-labeled oligonucleotide probes, provides a less expensive molecular bridging technique. It is particularly useful for detection of resistance mechanisms based on mutations in ribosomal RNA. Approaches based on MALDI-TOF-MS, alone or in combination with molecular techniques, like PCR/electrospray ionization MS or minisequencing provide the fastest resistance results from pure colonies or even primary samples with a growing number of protocols. This review details the various approaches of rapid resistance testing, their pros and cons, and their potential use for the diagnostic laboratory. PMID:25343142
Lu, Fu-Hao; McKenzie, Neil; Kettleborough, George; Heavens, Darren; Clark, Matthew D; Bevan, Michael W
2018-05-01
The accurate sequencing and assembly of very large, often polyploid, genomes remains a challenging task, limiting long-range sequence information and phased sequence variation for applications such as plant breeding. The 15-Gb hexaploid bread wheat (Triticum aestivum) genome has been particularly challenging to sequence, and several different approaches have recently generated long-range assemblies. Mapping and understanding the types of assembly errors are important for optimising future sequencing and assembly approaches and for comparative genomics. Here we use a Fosill 38-kb jumping library to assess medium and longer-range order of different publicly available wheat genome assemblies. Modifications to the Fosill protocol generated longer Illumina sequences and enabled comprehensive genome coverage. Analyses of two independent Bacterial Artificial Chromosome (BAC)-based chromosome-scale assemblies, two independent Illumina whole genome shotgun assemblies, and a hybrid Single Molecule Real Time (SMRT-PacBio) and short read (Illumina) assembly were carried out. We revealed a surprising scale and variety of discrepancies using Fosill mate-pair mapping and validated several of each class. In addition, Fosill mate-pairs were used to scaffold a whole genome Illumina assembly, leading to a 3-fold increase in N50 values. Our analyses, using an independent means to validate different wheat genome assemblies, show that whole genome shotgun assemblies based solely on Illumina sequences are significantly more accurate by all measures compared to BAC-based chromosome-scale assemblies and hybrid SMRT-Illumina approaches. Although current whole genome assemblies are reasonably accurate and useful, additional improvements will be needed to generate complete assemblies of wheat genomes using open-source, computationally efficient, and cost-effective methods.
Saingam, Prakit; Li, Bo; Yan, Tao
2018-06-01
DNA-based molecular detection of microbial pathogens in complex environments is still plagued by sensitivity, specificity and robustness issues. We propose to address these issues by viewing them as inadvertent consequences of requiring specific and adequate amplification (SAA) of target DNA molecules by current PCR methods. Using the invA gene of Salmonella as the model system, we investigated if next generation sequencing (NGS) can be used to directly detect target sequences in false-negative PCR reaction (PCR-NGS) in order to remove the SAA requirement from PCR. False-negative PCR and qPCR reactions were first created using serial dilutions of laboratory-prepared Salmonella genomic DNA and then analyzed directly by NGS. Target invA sequences were detected in all false-negative PCR and qPCR reactions, which lowered the method detection limits near the theoretical minimum of single gene copy detection. The capability of the PCR-NGS approach in correcting false negativity was further tested and confirmed under more environmentally relevant conditions using Salmonella-spiked stream water and sediment samples. Finally, the PCR-NGS approach was applied to ten urban stream water samples and detected invA sequences in eight samples that would be otherwise deemed Salmonella negative. Analysis of the non-target sequences in the false-negative reactions helped to identify primer dime-like short sequences as the main cause of the false negativity. Together, the results demonstrated that the PCR-NGS approach can significantly improve method sensitivity, correct false-negative detections, and enable sequence-based analysis for failure diagnostics in complex environmental samples. Copyright © 2018 Elsevier B.V. All rights reserved.
Stranges, P. Benjamin; Palla, Mirkó; Kalachikov, Sergey; Nivala, Jeff; Dorwart, Michael; Trans, Andrew; Kumar, Shiv; Porel, Mintu; Chien, Minchen; Tao, Chuanjuan; Morozova, Irina; Li, Zengmin; Shi, Shundi; Aberra, Aman; Arnold, Cleoma; Yang, Alexander; Aguirre, Anne; Harada, Eric T.; Korenblum, Daniel; Pollard, James; Bhat, Ashwini; Gremyachinskiy, Dmitriy; Bibillo, Arek; Chen, Roger; Davis, Randy; Russo, James J.; Fuller, Carl W.; Roever, Stefan; Ju, Jingyue; Church, George M.
2016-01-01
Scalable, high-throughput DNA sequencing is a prerequisite for precision medicine and biomedical research. Recently, we presented a nanopore-based sequencing-by-synthesis (Nanopore-SBS) approach, which used a set of nucleotides with polymer tags that allow discrimination of the nucleotides in a biological nanopore. Here, we designed and covalently coupled a DNA polymerase to an α-hemolysin (αHL) heptamer using the SpyCatcher/SpyTag conjugation approach. These porin–polymerase conjugates were inserted into lipid bilayers on a complementary metal oxide semiconductor (CMOS)-based electrode array for high-throughput electrical recording of DNA synthesis. The designed nanopore construct successfully detected the capture of tagged nucleotides complementary to a DNA base on a provided template. We measured over 200 tagged-nucleotide signals for each of the four bases and developed a classification method to uniquely distinguish them from each other and background signals. The probability of falsely identifying a background event as a true capture event was less than 1.2%. In the presence of all four tagged nucleotides, we observed sequential additions in real time during polymerase-catalyzed DNA synthesis. Single-polymerase coupling to a nanopore, in combination with the Nanopore-SBS approach, can provide the foundation for a low-cost, single-molecule, electronic DNA-sequencing platform. PMID:27729524
Scaling up discovery of hidden diversity in fungi: impacts of barcoding approaches.
Yahr, Rebecca; Schoch, Conrad L; Dentinger, Bryn T M
2016-09-05
The fungal kingdom is a hyperdiverse group of multicellular eukaryotes with profound impacts on human society and ecosystem function. The challenge of documenting and describing fungal diversity is exacerbated by their typically cryptic nature, their ability to produce seemingly unrelated morphologies from a single individual and their similarity in appearance to distantly related taxa. This multiplicity of hurdles resulted in the early adoption of DNA-based comparisons to study fungal diversity, including linking curated DNA sequence data to expertly identified voucher specimens. DNA-barcoding approaches in fungi were first applied in specimen-based studies for identification and discovery of taxonomic diversity, but are now widely deployed for community characterization based on sequencing of environmental samples. Collectively, fungal barcoding approaches have yielded important advances across biological scales and research applications, from taxonomic, ecological, industrial and health perspectives. A major outstanding issue is the growing problem of 'sequences without names' that are somewhat uncoupled from the traditional framework of fungal classification based on morphology and preserved specimens. This review summarizes some of the most significant impacts of fungal barcoding, its limitations, and progress towards the challenge of effective utilization of the exponentially growing volume of data gathered from high-throughput sequencing technologies.This article is part of the themed issue 'From DNA barcodes to biomes'. © 2016 The Authors.
Network-based function prediction and interactomics: the case for metabolic enzymes.
Janga, S C; Díaz-Mejía, J Javier; Moreno-Hagelsieb, G
2011-01-01
As sequencing technologies increase in power, determining the functions of unknown proteins encoded by the DNA sequences so produced becomes a major challenge. Functional annotation is commonly done on the basis of amino-acid sequence similarity alone. Long after sequence similarity becomes undetectable by pair-wise comparison, profile-based identification of homologs can often succeed due to the conservation of position-specific patterns, important for a protein's three dimensional folding and function. Nevertheless, prediction of protein function from homology-driven approaches is not without problems. Homologous proteins might evolve different functions and the power of homology detection has already started to reach its maximum. Computational methods for inferring protein function, which exploit the context of a protein in cellular networks, have come to be built on top of homology-based approaches. These network-based functional inference techniques provide both a first hand hint into a proteins' functional role and offer complementary insights to traditional methods for understanding the function of uncharacterized proteins. Most recent network-based approaches aim to integrate diverse kinds of functional interactions to boost both coverage and confidence level. These techniques not only promise to solve the moonlighting aspect of proteins by annotating proteins with multiple functions, but also increase our understanding on the interplay between different functional classes in a cell. In this article we review the state of the art in network-based function prediction and describe some of the underlying difficulties and successes. Given the volume of high-throughput data that is being reported the time is ripe to employ these network-based approaches, which can be used to unravel the functions of the uncharacterized proteins accumulating in the genomic databases. © 2010 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Fogarty, Ian; Geelan, David
2013-01-01
Students in 4 Canadian high school physics classes completed instructional sequences in two key physics topics related to motion--Straight Line Motion and Newton's First Law. Different sequences of laboratory investigation, teacher explanation (lecture) and the use of computer-based scientific visualizations (animations and simulations) were…
Han, Yuepeng; Chagné, David; Gasic, Ksenija; Rikkerink, Erik H A; Beever, Jonathan E; Gardiner, Susan E; Korban, Schuyler S
2009-03-01
A genome-wide BAC physical map of the apple, Malus x domestica Borkh., has been recently developed. Here, we report on integrating the physical and genetic maps of the apple using a SNP-based approach in conjunction with bin mapping. Briefly, BAC clones located at ends of BAC contigs were selected, and sequenced at both ends. The BAC end sequences (BESs) were used to identify candidate SNPs. Subsequently, these candidate SNPs were genetically mapped using a bin mapping strategy for the purpose of mapping the physical onto the genetic map. Using this approach, 52 (23%) out of 228 BESs tested were successfully exploited to develop SNPs. These SNPs anchored 51 contigs, spanning approximately 37 Mb in cumulative physical length, onto 14 linkage groups. The reliability of the integration of the physical and genetic maps using this SNP-based strategy is described, and the results confirm the feasibility of this approach to construct an integrated physical and genetic maps for apple.
Rodriguez-Rivas, Juan; Marsili, Simone; Juan, David; Valencia, Alfonso
2016-01-01
Protein–protein interactions are fundamental for the proper functioning of the cell. As a result, protein interaction surfaces are subject to strong evolutionary constraints. Recent developments have shown that residue coevolution provides accurate predictions of heterodimeric protein interfaces from sequence information. So far these approaches have been limited to the analysis of families of prokaryotic complexes for which large multiple sequence alignments of homologous sequences can be compiled. We explore the hypothesis that coevolution points to structurally conserved contacts at protein–protein interfaces, which can be reliably projected to homologous complexes with distantly related sequences. We introduce a domain-centered protocol to study the interplay between residue coevolution and structural conservation of protein–protein interfaces. We show that sequence-based coevolutionary analysis systematically identifies residue contacts at prokaryotic interfaces that are structurally conserved at the interface of their eukaryotic counterparts. In turn, this allows the prediction of conserved contacts at eukaryotic protein–protein interfaces with high confidence using solely mutational patterns extracted from prokaryotic genomes. Even in the context of high divergence in sequence (the twilight zone), where standard homology modeling of protein complexes is unreliable, our approach provides sequence-based accurate information about specific details of protein interactions at the residue level. Selected examples of the application of prokaryotic coevolutionary analysis to the prediction of eukaryotic interfaces further illustrate the potential of this approach. PMID:27965389
Rodriguez-Rivas, Juan; Marsili, Simone; Juan, David; Valencia, Alfonso
2016-12-27
Protein-protein interactions are fundamental for the proper functioning of the cell. As a result, protein interaction surfaces are subject to strong evolutionary constraints. Recent developments have shown that residue coevolution provides accurate predictions of heterodimeric protein interfaces from sequence information. So far these approaches have been limited to the analysis of families of prokaryotic complexes for which large multiple sequence alignments of homologous sequences can be compiled. We explore the hypothesis that coevolution points to structurally conserved contacts at protein-protein interfaces, which can be reliably projected to homologous complexes with distantly related sequences. We introduce a domain-centered protocol to study the interplay between residue coevolution and structural conservation of protein-protein interfaces. We show that sequence-based coevolutionary analysis systematically identifies residue contacts at prokaryotic interfaces that are structurally conserved at the interface of their eukaryotic counterparts. In turn, this allows the prediction of conserved contacts at eukaryotic protein-protein interfaces with high confidence using solely mutational patterns extracted from prokaryotic genomes. Even in the context of high divergence in sequence (the twilight zone), where standard homology modeling of protein complexes is unreliable, our approach provides sequence-based accurate information about specific details of protein interactions at the residue level. Selected examples of the application of prokaryotic coevolutionary analysis to the prediction of eukaryotic interfaces further illustrate the potential of this approach.
Shafiee, Mohammad Javad; Chung, Audrey G; Khalvati, Farzad; Haider, Masoom A; Wong, Alexander
2017-10-01
While lung cancer is the second most diagnosed form of cancer in men and women, a sufficiently early diagnosis can be pivotal in patient survival rates. Imaging-based, or radiomics-driven, detection methods have been developed to aid diagnosticians, but largely rely on hand-crafted features that may not fully encapsulate the differences between cancerous and healthy tissue. Recently, the concept of discovery radiomics was introduced, where custom abstract features are discovered from readily available imaging data. We propose an evolutionary deep radiomic sequencer discovery approach based on evolutionary deep intelligence. Motivated by patient privacy concerns and the idea of operational artificial intelligence, the evolutionary deep radiomic sequencer discovery approach organically evolves increasingly more efficient deep radiomic sequencers that produce significantly more compact yet similarly descriptive radiomic sequences over multiple generations. As a result, this framework improves operational efficiency and enables diagnosis to be run locally at the radiologist's computer while maintaining detection accuracy. We evaluated the evolved deep radiomic sequencer (EDRS) discovered via the proposed evolutionary deep radiomic sequencer discovery framework against state-of-the-art radiomics-driven and discovery radiomics methods using clinical lung CT data with pathologically proven diagnostic data from the LIDC-IDRI dataset. The EDRS shows improved sensitivity (93.42%), specificity (82.39%), and diagnostic accuracy (88.78%) relative to previous radiomics approaches.
Herbold, Craig W.; Pelikan, Claus; Kuzyk, Orest; Hausmann, Bela; Angel, Roey; Berry, David; Loy, Alexander
2015-01-01
High throughput sequencing of phylogenetic and functional gene amplicons provides tremendous insight into the structure and functional potential of complex microbial communities. Here, we introduce a highly adaptable and economical PCR approach to barcoding and pooling libraries of numerous target genes. In this approach, we replace gene- and sequencing platform-specific fusion primers with general, interchangeable barcoding primers, enabling nearly limitless customized barcode-primer combinations. Compared to barcoding with long fusion primers, our multiple-target gene approach is more economical because it overall requires lower number of primers and is based on short primers with generally lower synthesis and purification costs. To highlight our approach, we pooled over 900 different small-subunit rRNA and functional gene amplicon libraries obtained from various environmental or host-associated microbial community samples into a single, paired-end Illumina MiSeq run. Although the amplicon regions ranged in size from approximately 290 to 720 bp, we found no significant systematic sequencing bias related to amplicon length or gene target. Our results indicate that this flexible multiplexing approach produces large, diverse, and high quality sets of amplicon sequence data for modern studies in microbial ecology. PMID:26236305
Thomas, W. Kelley; Vida, J. T.; Frisse, Linda M.; Mundo, Manuel; Baldwin, James G.
1997-01-01
To effectively integrate DNA sequence analysis and classical nematode taxonomy, we must be able to obtain DNA sequences from formalin-fixed specimens. Microdissected sections of nematodes were removed from specimens fixed in formalin, using standard protocols and without destroying morphological features. The fixed sections provided sufficient template for multiple polymerase chain reaction-based DNA sequence analyses. PMID:19274156
Wright, Imogen A.; Travers, Simon A.
2014-01-01
The challenge presented by high-throughput sequencing necessitates the development of novel tools for accurate alignment of reads to reference sequences. Current approaches focus on using heuristics to map reads quickly to large genomes, rather than generating highly accurate alignments in coding regions. Such approaches are, thus, unsuited for applications such as amplicon-based analysis and the realignment phase of exome sequencing and RNA-seq, where accurate and biologically relevant alignment of coding regions is critical. To facilitate such analyses, we have developed a novel tool, RAMICS, that is tailored to mapping large numbers of sequence reads to short lengths (<10 000 bp) of coding DNA. RAMICS utilizes profile hidden Markov models to discover the open reading frame of each sequence and aligns to the reference sequence in a biologically relevant manner, distinguishing between genuine codon-sized indels and frameshift mutations. This approach facilitates the generation of highly accurate alignments, accounting for the error biases of the sequencing machine used to generate reads, particularly at homopolymer regions. Performance improvements are gained through the use of graphics processing units, which increase the speed of mapping through parallelization. RAMICS substantially outperforms all other mapping approaches tested in terms of alignment quality while maintaining highly competitive speed performance. PMID:24861618
A chemogenomic analysis of the human proteome: application to enzyme families.
Bernasconi, Paul; Chen, Min; Galasinski, Scott; Popa-Burke, Ioana; Bobasheva, Anna; Coudurier, Louis; Birkos, Steve; Hallam, Rhonda; Janzen, William P
2007-10-01
Sequence-based phylogenies (SBP) are well-established tools for describing relationships between proteins. They have been used extensively to predict the behavior and sensitivity toward inhibitors of enzymes within a family. The utility of this approach diminishes when comparing proteins with little sequence homology. Even within an enzyme family, SBPs must be complemented by an orthogonal method that is independent of sequence to better predict enzymatic behavior. A chemogenomic approach is demonstrated here that uses the inhibition profile of a 130,000 diverse molecule library to uncover relationships within a set of enzymes. The profile is used to construct a semimetric additive distance matrix. This matrix, in turn, defines a sequence-independent phylogeny (SIP). The method was applied to 97 enzymes (kinases, proteases, and phosphatases). SIP does not use structural information from the molecules used for establishing the profile, thus providing a more heuristic method than the current approaches, which require knowledge of the specific inhibitor's structure. Within enzyme families, SIP shows a good overall correlation with SBP. More interestingly, SIP uncovers distances within families that are not recognizable by sequence-based methods. In addition, SIP allows the determination of distance between enzymes with no sequence homology, thus uncovering novel relationships not predicted by SBP. This chemogenomic approach, used in conjunction with SBP, should prove to be a powerful tool for choosing target combinations for drug discovery programs as well as for guiding the selection of profiling and liability targets.
Anonymization of electronic medical records for validating genome-wide association studies
Loukides, Grigorios; Gkoulalas-Divanis, Aris; Malin, Bradley
2010-01-01
Genome-wide association studies (GWAS) facilitate the discovery of genotype–phenotype relations from population-based sequence databases, which is an integral facet of personalized medicine. The increasing adoption of electronic medical records allows large amounts of patients’ standardized clinical features to be combined with the genomic sequences of these patients and shared to support validation of GWAS findings and to enable novel discoveries. However, disseminating these data “as is” may lead to patient reidentification when genomic sequences are linked to resources that contain the corresponding patients’ identity information based on standardized clinical features. This work proposes an approach that provably prevents this type of data linkage and furnishes a result that helps support GWAS. Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases. Extensive experiments with real patient data derived from the Vanderbilt's University Medical Center verify that our approach generates data that eliminate the threat of individual reidentification, while supporting GWAS validation and clinical case analysis tasks. PMID:20385806
Enhanced Methods for Local Ancestry Assignment in Sequenced Admixed Individuals
Brown, Robert; Pasaniuc, Bogdan
2014-01-01
Inferring the ancestry at each locus in the genome of recently admixed individuals (e.g., Latino Americans) plays a major role in medical and population genetic inferences, ranging from finding disease-risk loci, to inferring recombination rates, to mapping missing contigs in the human genome. Although many methods for local ancestry inference have been proposed, most are designed for use with genotyping arrays and fail to make use of the full spectrum of data available from sequencing. In addition, current haplotype-based approaches are very computationally demanding, requiring large computational time for moderately large sample sizes. Here we present new methods for local ancestry inference that leverage continent-specific variants (CSVs) to attain increased performance over existing approaches in sequenced admixed genomes. A key feature of our approach is that it incorporates the admixed genomes themselves jointly with public datasets, such as 1000 Genomes, to improve the accuracy of CSV calling. We use simulations to show that our approach attains accuracy similar to widely used computationally intensive haplotype-based approaches with large decreases in runtime. Most importantly, we show that our method recovers comparable local ancestries, as the 1000 Genomes consensus local ancestry calls in the real admixed individuals from the 1000 Genomes Project. We extend our approach to account for low-coverage sequencing and show that accurate local ancestry inference can be attained at low sequencing coverage. Finally, we generalize CSVs to sub-continental population-specific variants (sCSVs) and show that in some cases it is possible to determine the sub-continental ancestry for short chromosomal segments on the basis of sCSVs. PMID:24743331
Size-based molecular diagnostics using plasma DNA for noninvasive prenatal testing.
Yu, Stephanie C Y; Chan, K C Allen; Zheng, Yama W L; Jiang, Peiyong; Liao, Gary J W; Sun, Hao; Akolekar, Ranjit; Leung, Tak Y; Go, Attie T J I; van Vugt, John M G; Minekawa, Ryoko; Oudejans, Cees B M; Nicolaides, Kypros H; Chiu, Rossa W K; Lo, Y M Dennis
2014-06-10
Noninvasive prenatal testing using fetal DNA in maternal plasma is an actively researched area. The current generation of tests using massively parallel sequencing is based on counting plasma DNA sequences originating from different genomic regions. In this study, we explored a different approach that is based on the use of DNA fragment size as a diagnostic parameter. This approach is dependent on the fact that circulating fetal DNA molecules are generally shorter than the corresponding maternal DNA molecules. First, we performed plasma DNA size analysis using paired-end massively parallel sequencing and microchip-based capillary electrophoresis. We demonstrated that the fetal DNA fraction in maternal plasma could be deduced from the overall size distribution of maternal plasma DNA. The fetal DNA fraction is a critical parameter affecting the accuracy of noninvasive prenatal testing using maternal plasma DNA. Second, we showed that fetal chromosomal aneuploidy could be detected by observing an aberrant proportion of short fragments from an aneuploid chromosome in the paired-end sequencing data. Using this approach, we detected fetal trisomy 21 and trisomy 18 with 100% sensitivity (T21: 36/36; T18: 27/27) and 100% specificity (non-T21: 88/88; non-T18: 97/97). For trisomy 13, the sensitivity and specificity were 95.2% (20/21) and 99% (102/103), respectively. For monosomy X, the sensitivity and specificity were both 100% (10/10 and 8/8). Thus, this study establishes the principle of size-based molecular diagnostics using plasma DNA. This approach has potential applications beyond noninvasive prenatal testing to areas such as oncology and transplantation monitoring.
Size-based molecular diagnostics using plasma DNA for noninvasive prenatal testing
Yu, Stephanie C. Y.; Chan, K. C. Allen; Zheng, Yama W. L.; Jiang, Peiyong; Liao, Gary J. W.; Sun, Hao; Akolekar, Ranjit; Leung, Tak Y.; Go, Attie T. J. I.; van Vugt, John M. G.; Minekawa, Ryoko; Oudejans, Cees B. M.; Nicolaides, Kypros H.; Chiu, Rossa W. K.; Lo, Y. M. Dennis
2014-01-01
Noninvasive prenatal testing using fetal DNA in maternal plasma is an actively researched area. The current generation of tests using massively parallel sequencing is based on counting plasma DNA sequences originating from different genomic regions. In this study, we explored a different approach that is based on the use of DNA fragment size as a diagnostic parameter. This approach is dependent on the fact that circulating fetal DNA molecules are generally shorter than the corresponding maternal DNA molecules. First, we performed plasma DNA size analysis using paired-end massively parallel sequencing and microchip-based capillary electrophoresis. We demonstrated that the fetal DNA fraction in maternal plasma could be deduced from the overall size distribution of maternal plasma DNA. The fetal DNA fraction is a critical parameter affecting the accuracy of noninvasive prenatal testing using maternal plasma DNA. Second, we showed that fetal chromosomal aneuploidy could be detected by observing an aberrant proportion of short fragments from an aneuploid chromosome in the paired-end sequencing data. Using this approach, we detected fetal trisomy 21 and trisomy 18 with 100% sensitivity (T21: 36/36; T18: 27/27) and 100% specificity (non-T21: 88/88; non-T18: 97/97). For trisomy 13, the sensitivity and specificity were 95.2% (20/21) and 99% (102/103), respectively. For monosomy X, the sensitivity and specificity were both 100% (10/10 and 8/8). Thus, this study establishes the principle of size-based molecular diagnostics using plasma DNA. This approach has potential applications beyond noninvasive prenatal testing to areas such as oncology and transplantation monitoring. PMID:24843150
Impact of exome sequencing in inflammatory bowel disease
Cardinale, Christopher J; Kelsen, Judith R; Baldassano, Robert N; Hakonarson, Hakon
2013-01-01
Approaches to understanding the genetic contribution to inflammatory bowel disease (IBD) have continuously evolved from family- and population-based epidemiology, to linkage analysis, and most recently, to genome-wide association studies (GWAS). The next stage in this evolution seems to be the sequencing of the exome, that is, the regions of the human genome which encode proteins. The GWAS approach has been very fruitful in identifying at least 163 loci as being associated with IBD, and now, exome sequencing promises to take our genetic understanding to the next level. In this review we will discuss the possible contributions that can be made by an exome sequencing approach both at the individual patient level to aid with disease diagnosis and future therapies, as well as in advancing knowledge of the pathogenesis of IBD. PMID:24187447
Yang, Lei; Naylor, Gavin J P
2016-01-01
We determined the complete mitochondrial genome sequence (16,760 bp) of the peacock skate Pavoraja nitida using a long-PCR based next generation sequencing method. It has 13 protein-coding genes, 22 tRNA genes, 2 rRNA genes, and 1 control region in the typical vertebrate arrangement. Primers, protocols, and procedures used to obtain this mitogenome are provided. We anticipate that this approach will facilitate rapid collection of mitogenome sequences for studies on phylogenetic relationships, population genetics, and conservation of cartilaginous fishes.
Multiplex De Novo Sequencing of Peptide Antibiotics
NASA Astrophysics Data System (ADS)
Mohimani, Hosein; Liu, Wei-Ting; Yang, Yu-Liang; Gaudêncio, Susana P.; Fenical, William; Dorrestein, Pieter C.; Pevzner, Pavel A.
Proliferation of drug-resistant diseases raises the challenge of searching for new, more efficient antibiotics. Currently, some of the most effective antibiotics (i.e., Vancomycin and Daptomycin) are cyclic peptides produced by non-ribosomal biosynthetic pathways. The isolation and sequencing of cyclic peptide antibiotics, unlike the same activity with linear peptides, is time-consuming and error-prone. The dominant technique for sequencing cyclic peptides is NMR-based and requires large amounts (milligrams) of purified materials that, for most compounds, are not possible to obtain. Given these facts, there is a need for new tools to sequence cyclic NRPs using picograms of material. Since nearly all cyclic NRPs are produced along with related analogs, we develop a mass spectrometry approach for sequencing all related peptides at once (in contrast to the existing approach that analyzes individual peptides). Our results suggest that instead of attempting to isolate and NMR-sequence the most abundant compound, one should acquire spectra of many related compounds and sequence all of them simultaneously using tandem mass spectrometry. We illustrate applications of this approach by sequencing new variants of cyclic peptide antibiotics from Bacillus brevis, as well as sequencing a previously unknown familiy of cyclic NRPs produced by marine bacteria.
Vujaklija, Ivan; Bielen, Ana; Paradžik, Tina; Biđin, Siniša; Goldstein, Pavle; Vujaklija, Dušica
2016-02-18
The massive accumulation of protein sequences arising from the rapid development of high-throughput sequencing, coupled with automatic annotation, results in high levels of incorrect annotations. In this study, we describe an approach to decrease annotation errors of protein families characterized by low overall sequence similarity. The GDSL lipolytic family comprises proteins with multifunctional properties and high potential for pharmaceutical and industrial applications. The number of proteins assigned to this family has increased rapidly over the last few years. In particular, the natural abundance of GDSL enzymes reported recently in plants indicates that they could be a good source of novel GDSL enzymes. We noticed that a significant proportion of annotated sequences lack specific GDSL motif(s) or catalytic residue(s). Here, we applied motif-based sequence analyses to identify enzymes possessing conserved GDSL motifs in selected proteomes across the plant kingdom. Motif-based HMM scanning (Viterbi decoding-VD and posterior decoding-PD) and the here described PD/VD protocol were successfully applied on 12 selected plant proteomes to identify sequences with GDSL motifs. A significant number of identified GDSL sequences were novel. Moreover, our scanning approach successfully detected protein sequences lacking at least one of the essential motifs (171/820) annotated by Pfam profile search (PfamA) as GDSL. Based on these analyses we provide a curated list of GDSL enzymes from the selected plants. CLANS clustering and phylogenetic analysis helped us to gain a better insight into the evolutionary relationship of all identified GDSL sequences. Three novel GDSL subfamilies as well as unreported variations in GDSL motifs were discovered in this study. In addition, analyses of selected proteomes showed a remarkable expansion of GDSL enzymes in the lycophyte, Selaginella moellendorffii. Finally, we provide a general motif-HMM scanner which is easily accessible through the graphical user interface ( http://compbio.math.hr/ ). Our results show that scanning with a carefully parameterized motif-HMM is an effective approach for annotation of protein families with low sequence similarity and conserved motifs. The results of this study expand current knowledge and provide new insights into the evolution of the large GDSL-lipase family in land plants.
DNA sequence analysis with droplet-based microfluidics
Abate, Adam R.; Hung, Tony; Sperling, Ralph A.; Mary, Pascaline; Rotem, Assaf; Agresti, Jeremy J.; Weiner, Michael A.; Weitz, David A.
2014-01-01
Droplet-based microfluidic techniques can form and process micrometer scale droplets at thousands per second. Each droplet can house an individual biochemical reaction, allowing millions of reactions to be performed in minutes with small amounts of total reagent. This versatile approach has been used for engineering enzymes, quantifying concentrations of DNA in solution, and screening protein crystallization conditions. Here, we use it to read the sequences of DNA molecules with a FRET-based assay. Using probes of different sequences, we interrogate a target DNA molecule for polymorphisms. With a larger probe set, additional polymorphisms can be interrogated as well as targets of arbitrary sequence. PMID:24185402
ERIC Educational Resources Information Center
Braguglia, Kay H.; Jackson, Kanata A.
2012-01-01
This article presents a reflective analysis of teaching research methodology through a three course sequence using a project-based approach. The authors reflect critically on their experiences in teaching research methods courses in an undergraduate business management program. The introduction of a range of specific techniques including student…
Screening for SNPs with Allele-Specific Methylation based on Next-Generation Sequencing Data
Hu, Bo; Xu, Yaomin
2013-01-01
Allele-specific methylation (ASM) has long been studied but mainly documented in the context of genomic imprinting and X chromosome inactivation. Taking advantage of the next-generation sequencing technology, we conduct a high-throughput sequencing experiment with four prostate cell lines to survey the whole genome and identify single nucleotide polymorphisms (SNPs) with ASM. A Bayesian approach is proposed to model the counts of short reads for each SNP conditional on its genotypes of multiple subjects, leading to a posterior probability of ASM. We flag SNPs with high posterior probabilities of ASM by accounting for multiple comparisons based on posterior false discovery rates. Applying the Bayesian approach to the in-house prostate cell line data, we identify 269 SNPs as candidates of ASM. A simulation study is carried out to demonstrate the quantitative performance of the proposed approach. PMID:23710259
Molecular Identification of Ectomycorrhizal Mycelium in Soil Horizons
Landeweert, Renske; Leeflang, Paula; Kuyper, Thom W.; Hoffland, Ellis; Rosling, Anna; Wernars, Karel; Smit, Eric
2003-01-01
Molecular identification techniques based on total DNA extraction provide a unique tool for identification of mycelium in soil. Using molecular identification techniques, the ectomycorrhizal (EM) fungal community under coniferous vegetation was analyzed. Soil samples were taken at different depths from four horizons of a podzol profile. A basidiomycete-specific primer pair (ITS1F-ITS4B) was used to amplify fungal internal transcribed spacer (ITS) sequences from total DNA extracts of the soil horizons. Amplified basidiomycete DNA was cloned and sequenced, and a selection of the obtained clones was analyzed phylogenetically. Based on sequence similarity, the fungal clone sequences were sorted into 25 different fungal groups, or operational taxonomic units (OTUs). Out of 25 basidiomycete OTUs, 7 OTUs showed high nucleotide homology (≥99%) with known EM fungal sequences and 16 were found exclusively in the mineral soil. The taxonomic positions of six OTUs remained unclear. OTU sequences were compared to sequences from morphotyped EM root tips collected from the same sites. Of the 25 OTUs, 10 OTUs had ≥98% sequence similarity with these EM root tip sequences. The present study demonstrates the use of molecular techniques to identify EM hyphae in various soil types. This approach differs from the conventional method of EM root tip identification and provides a novel approach to examine EM fungal communities in soil. PMID:12514012
A long-term target detection approach in infrared image sequence
NASA Astrophysics Data System (ADS)
Li, Hang; Zhang, Qi; Li, Yuanyuan; Wang, Liqiang
2015-12-01
An automatic target detection method used in long term infrared (IR) image sequence from a moving platform is proposed. Firstly, based on non-linear histogram equalization, target candidates are coarse-to-fine segmented by using two self-adapt thresholds generated in the intensity space. Then the real target is captured via two different selection approaches. At the beginning of image sequence, the genuine target with litter texture is discriminated from other candidates by using contrast-based confidence measure. On the other hand, when the target becomes larger, we apply online EM method to iteratively estimate and update the distributions of target's size and position based on the prior detection results, and then recognize the genuine one which satisfies both the constraints of size and position. Experimental results demonstrate that the presented method is accurate, robust and efficient.
Whiley, David M; Jacob, Kevin; Nakos, Jennifer; Bletchly, Cheryl; Nimmo, Graeme R; Nissen, Michael D; Sloots, Theo P
2012-06-01
Numerous real-time PCR assays have been described for detection of the influenza A H275Y alteration. However, the performance of these methods can be undermined by sequence variation in the regions flanking the codon of interest. This is a problem encountered more broadly in microbial diagnostics. In this study, we developed a modification of hybridization probe-based melting curve analysis, whereby primers are used to mask proximal mutations in the sequence targets of hybridization probes, so as to limit the potential for sequence variation to interfere with typing. The approach was applied to the H275Y alteration of the influenza A (H1N1) 2009 strain, as well as a Neisseria gonorrhoeae mutation associated with antimicrobial resistance. Assay performances were assessed using influenza A and N. gonorrhoeae strains characterized by DNA sequencing. The modified hybridization probe-based approach proved successful in limiting the effects of proximal mutations, with the results of melting curve analyses being 100% consistent with the results of DNA sequencing for all influenza A and N. gonorrhoeae strains tested. Notably, these included influenza A and N. gonorrhoeae strains exhibiting additional mutations in hybridization probe targets. Of particular interest was that the H275Y assay correctly typed influenza A strains harbouring a T822C nucleotide substitution, previously shown to interfere with H275Y typing methods. Overall our modified hybridization probe-based approach provides a simple means of circumventing problems caused by sequence variation, and offers improved detection of the influenza A H275Y alteration and potentially other resistance mechanisms.
COI (cytochrome oxidase-I) sequence based studies of Carangid fishes from Kakinada coast, India.
Persis, M; Chandra Sekhar Reddy, A; Rao, L M; Khedkar, G D; Ravinder, K; Nasruddin, K
2009-09-01
Mitochondrial DNA, cytochrome oxidase-1 gene sequences were analyzed for species identification and phylogenetic relationship among the very high food value and commercially important Indian carangid fish species. Sequence analysis of COI gene very clearly indicated that all the 28 fish species fell into five distinct groups, which are genetically distant from each other and exhibited identical phylogenetic reservation. All the COI gene sequences from 28 fishes provide sufficient phylogenetic information and evolutionary relationship to distinguish the carangid species unambiguously. This study proves the utility of mtDNA COI gene sequence based approach in identifying fish species at a faster pace.
Emerman, Amy B; Bowman, Sarah K; Barry, Andrew; Henig, Noa; Patel, Kruti M; Gardner, Andrew F; Hendrickson, Cynthia L
2017-07-05
Next-generation sequencing (NGS) is a powerful tool for genomic studies, translational research, and clinical diagnostics that enables the detection of single nucleotide polymorphisms, insertions and deletions, copy number variations, and other genetic variations. Target enrichment technologies improve the efficiency of NGS by only sequencing regions of interest, which reduces sequencing costs while increasing coverage of the selected targets. Here we present NEBNext Direct ® , a hybridization-based, target-enrichment approach that addresses many of the shortcomings of traditional target-enrichment methods. This approach features a simple, 7-hr workflow that uses enzymatic removal of off-target sequences to achieve a high specificity for regions of interest. Additionally, unique molecular identifiers are incorporated for the identification and filtering of PCR duplicates. The same protocol can be used across a wide range of input amounts, input types, and panel sizes, enabling NEBNext Direct to be broadly applicable across a wide variety of research and diagnostic needs. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.
Sparse reconstruction of liver cirrhosis from monocular mini-laparoscopic sequences
NASA Astrophysics Data System (ADS)
Marcinczak, Jan Marek; Painer, Sven; Grigat, Rolf-Rainer
2015-03-01
Mini-laparoscopy is a technique which is used by clinicians to inspect the liver surface with ultra-thin laparoscopes. However, so far no quantitative measures based on mini-laparoscopic sequences are possible. This paper presents a Structure from Motion (SfM) based methodology to do 3D reconstruction of liver cirrhosis from mini-laparoscopic videos. The approach combines state-of-the-art tracking, pose estimation, outlier rejection and global optimization to obtain a sparse reconstruction of the cirrhotic liver surface. Specular reflection segmentation is included into the reconstruction framework to increase the robustness of the reconstruction. The presented approach is evaluated on 15 endoscopic sequences using three cirrhotic liver phantoms. The median reconstruction accuracy ranges from 0.3 mm to 1 mm.
Toma, Tudor; Bosman, Robert-Jan; Siebes, Arno; Peek, Niels; Abu-Hanna, Ameen
2010-08-01
An important problem in the Intensive Care is how to predict on a given day of stay the eventual hospital mortality for a specific patient. A recent approach to solve this problem suggested the use of frequent temporal sequences (FTSs) as predictors. Methods following this approach were evaluated in the past by inducing a model from a training set and validating the prognostic performance on an independent test set. Although this evaluative approach addresses the validity of the specific models induced in an experiment, it falls short of evaluating the inductive method itself. To achieve this, one must account for the inherent sources of variation in the experimental design. The main aim of this work is to demonstrate a procedure based on bootstrapping, specifically the .632 bootstrap procedure, for evaluating inductive methods that discover patterns, such as FTSs. A second aim is to apply this approach to find out whether a recently suggested inductive method that discovers FTSs of organ functioning status is superior over a traditional method that does not use temporal sequences when compared on each successive day of stay at the Intensive Care Unit. The use of bootstrapping with logistic regression using pre-specified covariates is known in the statistical literature. Using inductive methods of prognostic models based on temporal sequence discovery within the bootstrap procedure is however novel at least in predictive models in the Intensive Care. Our results of applying the bootstrap-based evaluative procedure demonstrate the superiority of the FTS-based inductive method over the traditional method in terms of discrimination as well as accuracy. In addition we illustrate the insights gained by the analyst into the discovered FTSs from the bootstrap samples. Copyright 2010 Elsevier Inc. All rights reserved.
Genome alignment with graph data structures: a comparison
2014-01-01
Background Recent advances in rapid, low-cost sequencing have opened up the opportunity to study complete genome sequences. The computational approach of multiple genome alignment allows investigation of evolutionarily related genomes in an integrated fashion, providing a basis for downstream analyses such as rearrangement studies and phylogenetic inference. Graphs have proven to be a powerful tool for coping with the complexity of genome-scale sequence alignments. The potential of graphs to intuitively represent all aspects of genome alignments led to the development of graph-based approaches for genome alignment. These approaches construct a graph from a set of local alignments, and derive a genome alignment through identification and removal of graph substructures that indicate errors in the alignment. Results We compare the structures of commonly used graphs in terms of their abilities to represent alignment information. We describe how the graphs can be transformed into each other, and identify and classify graph substructures common to one or more graphs. Based on previous approaches, we compile a list of modifications that remove these substructures. Conclusion We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs. If we neglect vertex or edge labels, the graphs differ in their information content. Still, many ideas are shared among all graph-based approaches. Based on these findings, we outline a conceptual framework for graph-based genome alignment that can assist in the development of future genome alignment tools. PMID:24712884
ERIC Educational Resources Information Center
Besson, Ugo; Borghi, Lidia; De Ambrosis, Anna; Mascheretti, Paolo
2010-01-01
We have developed a teaching-learning sequence (TLS) on friction based on a preliminary study involving three dimensions: an analysis of didactic research on the topic, an overview of usual approaches, and a critical analysis of the subject, considered also in its historical development. We found that mostly the usual presentations do not take…
Cycle-time determination and process control of sequencing batch membrane bioreactors.
Krampe, J
2013-01-01
In this paper a method to determine the cycle time for sequencing batch membrane bioreactors (SBMBRs) is introduced. One of the advantages of SBMBRs is the simplicity of adapting them to varying wastewater composition. The benefit of this flexibility can only be fully utilised if the cycle times are optimised for the specific inlet load conditions. This requires either proactive and ongoing operator adjustment or active predictive instrument-based control. Determination of the cycle times for conventional sequencing batch reactor (SBR) plants is usually based on experience. Due to the higher mixed liquor suspended solids concentrations in SBMBRs and the limited experience with their application, a new approach to calculate the cycle time had to be developed. Based on results from a semi-technical pilot plant, the paper presents an approach for calculating the cycle time in relation to the influent concentration according to the Activated Sludge Model No. 1 and the German HSG (Hochschulgruppe) Approach. The approach presented in this paper considers the increased solid contents in the reactor and the resultant shortened reaction times. This allows for an exact calculation of the nitrification and denitrification cycles with a tolerance of only a few minutes. Ultimately the same approach can be used for a predictive control strategy and for conventional SBR plants.
Clustering evolving proteins into homologous families.
Chan, Cheong Xin; Mahbob, Maisarah; Ragan, Mark A
2013-04-08
Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, including content bias and degree of divergence. New, highly scalable methods have recently been introduced to cluster the very large datasets being generated by next-generation sequencing technologies. However, there has been little systematic investigation of how characteristics of the data impact the performance of these approaches. Using clusters from a manually curated dataset as reference, we examined the performance of a widely used graph-based Markov clustering algorithm (MCL) and a greedy heuristic approach (UCLUST) in delineating protein families coded by three sets of bacterial genomes of different G+C content. Both MCL and UCLUST generated clusters that are comparable to the reference sets at specific parameter settings, although UCLUST tends to under-cluster compositionally biased sequences (G+C content 33% and 66%). Using simulated data, we sought to assess the individual effects of sequence divergence, rate heterogeneity, and underlying G+C content. Performance decreased with increasing sequence divergence, decreasing among-site rate variation, and increasing G+C bias. Two MCL-based methods recovered the simulated families more accurately than did UCLUST. MCL using local alignment distances is more robust across the investigated range of sequence features than are greedy heuristics using distances based on global alignment. Our results demonstrate that sequence divergence, rate heterogeneity and content bias can individually and in combination affect the accuracy with which MCL and UCLUST can recover homologous protein families. For application to data that are more divergent, and exhibit higher among-site rate variation and/or content bias, MCL may often be the better choice, especially if computational resources are not limiting.
Phase-based motion magnification video for monitoring of vital signals using the Hermite transform
NASA Astrophysics Data System (ADS)
Brieva, Jorge; Moya-Albor, Ernesto
2017-11-01
In this paper we present a new Eulerian phase-based motion magnification technique using the Hermite Transform (HT) decomposition that is inspired in the Human Vision System (HVS). We test our method in one sequence of the breathing of a newborn baby and on a video sequence that shows the heartbeat on the wrist. We detect and magnify the heart pulse applying our technique. Our motion magnification approach is compared to the Laplacian phase based approach by means of quantitative metrics (based on the RMS error and the Fourier transform) to measure the quality of both reconstruction and magnification. In addition a noise robustness analysis is performed for the two methods.
Toward the 1,000 dollars human genome.
Bennett, Simon T; Barnes, Colin; Cox, Anthony; Davies, Lisa; Brown, Clive
2005-06-01
Revolutionary new technologies, capable of transforming the economics of sequencing, are providing an unparalleled opportunity to analyze human genetic variation comprehensively at the whole-genome level within a realistic timeframe and at affordable costs. Current estimates suggest that it would cost somewhere in the region of 30 million US dollars to sequence an entire human genome using Sanger-based sequencing, and on one machine it would take about 60 years. Solexa is widely regarded as a company with the necessary disruptive technology to be the first to achieve the ultimate goal of the so-called 1,000 dollars human genome - the conceptual cost-point needed for routine analysis of individual genomes. Solexa's technology is based on completely novel sequencing chemistry capable of sequencing billions of individual DNA molecules simultaneously, a base at a time, to enable highly accurate, low cost analysis of an entire human genome in a single experiment. When applied over a large enough genomic region, these new approaches to resequencing will enable the simultaneous detection and typing of known, as well as unknown, polymorphisms, and will also offer information about patterns of linkage disequilibrium in the population being studied. Technological progress, leading to the advent of single-molecule-based approaches, is beginning to dramatically drive down costs and increase throughput to unprecedented levels, each being several orders of magnitude better than that which is currently available. A new sequencing paradigm based on single molecules will be faster, cheaper and more sensitive, and will permit routine analysis at the whole-genome level.
Description of Drinking Water Bacterial Communities Using 16S rRNA Gene Sequence Analyses
Descriptions of bacterial communities inhabiting water distribution systems (WDS) have mainly been accomplished using culture-based approaches. Due to the inherent selective nature of culture-based approaches, the majority of bacteria inhabiting WDS remain uncharacterized. The go...
Ren, Jie; Song, Kai; Deng, Minghua; Reinert, Gesine; Cannon, Charles H; Sun, Fengzhu
2016-04-01
Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html fsun@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Fraley, Stephanie I; Hardick, Justin; Masek, Billie J; Jo Masek, Billie; Athamanolap, Pornpat; Rothman, Richard E; Gaydos, Charlotte A; Carroll, Karen C; Wakefield, Teresa; Wang, Tza-Huei; Yang, Samuel
2013-10-01
Comprehensive profiling of nucleic acids in genetically heterogeneous samples is important for clinical and basic research applications. Universal digital high-resolution melt (U-dHRM) is a new approach to broad-based PCR diagnostics and profiling technologies that can overcome issues of poor sensitivity due to contaminating nucleic acids and poor specificity due to primer or probe hybridization inaccuracies for single nucleotide variations. The U-dHRM approach uses broad-based primers or ligated adapter sequences to universally amplify all nucleic acid molecules in a heterogeneous sample, which have been partitioned, as in digital PCR. Extensive assay optimization enables direct sequence identification by algorithm-based matching of melt curve shape and Tm to a database of known sequence-specific melt curves. We show that single-molecule detection and single nucleotide sensitivity is possible. The feasibility and utility of U-dHRM is demonstrated through detection of bacteria associated with polymicrobial blood infection and microRNAs (miRNAs) associated with host response to infection. U-dHRM using broad-based 16S rRNA gene primers demonstrates universal single cell detection of bacterial pathogens, even in the presence of larger amounts of contaminating bacteria; U-dHRM using universally adapted Lethal-7 miRNAs in a heterogeneous mixture showcases the single copy sensitivity and single nucleotide specificity of this approach.
Wright, Imogen A; Travers, Simon A
2014-07-01
The challenge presented by high-throughput sequencing necessitates the development of novel tools for accurate alignment of reads to reference sequences. Current approaches focus on using heuristics to map reads quickly to large genomes, rather than generating highly accurate alignments in coding regions. Such approaches are, thus, unsuited for applications such as amplicon-based analysis and the realignment phase of exome sequencing and RNA-seq, where accurate and biologically relevant alignment of coding regions is critical. To facilitate such analyses, we have developed a novel tool, RAMICS, that is tailored to mapping large numbers of sequence reads to short lengths (<10 000 bp) of coding DNA. RAMICS utilizes profile hidden Markov models to discover the open reading frame of each sequence and aligns to the reference sequence in a biologically relevant manner, distinguishing between genuine codon-sized indels and frameshift mutations. This approach facilitates the generation of highly accurate alignments, accounting for the error biases of the sequencing machine used to generate reads, particularly at homopolymer regions. Performance improvements are gained through the use of graphics processing units, which increase the speed of mapping through parallelization. RAMICS substantially outperforms all other mapping approaches tested in terms of alignment quality while maintaining highly competitive speed performance. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Prediction of Protein Structure by Template-Based Modeling Combined with the UNRES Force Field.
Krupa, Paweł; Mozolewska, Magdalena A; Joo, Keehyoung; Lee, Jooyoung; Czaplewski, Cezary; Liwo, Adam
2015-06-22
A new approach to the prediction of protein structures that uses distance and backbone virtual-bond dihedral angle restraints derived from template-based models and simulations with the united residue (UNRES) force field is proposed. The approach combines the accuracy and reliability of template-based methods for the segments of the target sequence with high similarity to those having known structures with the ability of UNRES to pack the domains correctly. Multiplexed replica-exchange molecular dynamics with restraints derived from template-based models of a given target, in which each restraint is weighted according to the accuracy of the prediction of the corresponding section of the molecule, is used to search the conformational space, and the weighted histogram analysis method and cluster analysis are applied to determine the families of the most probable conformations, from which candidate predictions are selected. To test the capability of the method to recover template-based models from restraints, five single-domain proteins with structures that have been well-predicted by template-based methods were used; it was found that the resulting structures were of the same quality as the best of the original models. To assess whether the new approach can improve template-based predictions with incorrectly predicted domain packing, four such targets were selected from the CASP10 targets; for three of them the new approach resulted in significantly better predictions compared with the original template-based models. The new approach can be used to predict the structures of proteins for which good templates can be found for sections of the sequence or an overall good template can be found for the entire sequence but the prediction quality is remarkably weaker in putative domain-linker regions.
GASP: Gapped Ancestral Sequence Prediction for proteins
Edwards, Richard J; Shields, Denis C
2004-01-01
Background The prediction of ancestral protein sequences from multiple sequence alignments is useful for many bioinformatics analyses. Predicting ancestral sequences is not a simple procedure and relies on accurate alignments and phylogenies. Several algorithms exist based on Maximum Parsimony or Maximum Likelihood methods but many current implementations are unable to process residues with gaps, which may represent insertion/deletion (indel) events or sequence fragments. Results Here we present a new algorithm, GASP (Gapped Ancestral Sequence Prediction), for predicting ancestral sequences from phylogenetic trees and the corresponding multiple sequence alignments. Alignments may be of any size and contain gaps. GASP first assigns the positions of gaps in the phylogeny before using a likelihood-based approach centred on amino acid substitution matrices to assign ancestral amino acids. Important outgroup information is used by first working down from the tips of the tree to the root, using descendant data only to assign probabilities, and then working back up from the root to the tips using descendant and outgroup data to make predictions. GASP was tested on a number of simulated datasets based on real phylogenies. Prediction accuracy for ungapped data was similar to three alternative algorithms tested, with GASP performing better in some cases and worse in others. Adding simple insertions and deletions to the simulated data did not have a detrimental effect on GASP accuracy. Conclusions GASP (Gapped Ancestral Sequence Prediction) will predict ancestral sequences from multiple protein alignments of any size. Although not as accurate in all cases as some of the more sophisticated maximum likelihood approaches, it can process a wide range of input phylogenies and will predict ancestral sequences for gapped and ungapped residues alike. PMID:15350199
2010-01-01
Background Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information. Results We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name. Conclusions We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch. PMID:20682041
VECTOR: A Hands-On Approach that Makes Electromagnetics Relevant to Students
ERIC Educational Resources Information Center
Bunting, C. F.; Cheville, R. A.
2009-01-01
A two-course sequence in electromagnetics (EM) was developed in order to address a perceived lack of student learning and engagement observed in a traditional, lecture-based EM course. The two-course sequence is named VECTOR: Vitalizing Electromagnetic Concepts To Obtain Relevance. This paper reports on the first course of the sequence. VECTOR…
NASA Astrophysics Data System (ADS)
Ivanov, Mark V.; Lobas, Anna A.; Levitsky, Lev I.; Moshkovskii, Sergei A.; Gorshkov, Mikhail V.
2018-02-01
In a proteogenomic approach based on tandem mass spectrometry analysis of proteolytic peptide mixtures, customized exome or RNA-seq databases are employed for identifying protein sequence variants. However, the problem of variant peptide identification without personalized genomic data is important for a variety of applications. Following the recent proposal by Chick et al. (Nat. Biotechnol. 33, 743-749, 2015) on the feasibility of such variant peptide search, we evaluated two available approaches based on the previously suggested "open" search and the "brute-force" strategy. To improve the efficiency of these approaches, we propose an algorithm for exclusion of false variant identifications from the search results involving analysis of modifications mimicking single amino acid substitutions. Also, we propose a de novo based scoring scheme for assessment of identified point mutations. In the scheme, the search engine analyzes y-type fragment ions in MS/MS spectra to confirm the location of the mutation in the variant peptide sequence.
The interplay of representations and patterns of classroom discourse in science teaching sequences
NASA Astrophysics Data System (ADS)
Tang, Kok-Sing
2016-09-01
The purpose of this study is to examines the relationship between the communicative approach of classroom talk and the modes of representations used by science teachers. Based on video data from two physics classrooms in Singapore, a recurring pattern in the relationship was observed as the teaching sequence of a lesson unfolded. It was found that as the mode of representation shifted from enactive (action based) to iconic (image based) to symbolic (language based), there was a concurrent and coordinated shift in the classroom communicative approach from interactive-dialogic to interactive-authoritative to non-interactive-authoritative. Specifically, the shift from enactive to iconic to symbolic representations occurred mainly within the interactive-dialogic approach while the shift towards the interactive-authoritative and non-interactive-authoritative approaches occurred when symbolic modes of representation were used. This concurrent and coordinated shift has implications on how we conceive the use of representations in conjunction with the co-occurring classroom discourse, both theoretically and pedagogically.
[Current applications of high-throughput DNA sequencing technology in antibody drug research].
Yu, Xin; Liu, Qi-Gang; Wang, Ming-Rong
2012-03-01
Since the publication of a high-throughput DNA sequencing technology based on PCR reaction was carried out in oil emulsions in 2005, high-throughput DNA sequencing platforms have been evolved to a robust technology in sequencing genomes and diverse DNA libraries. Antibody libraries with vast numbers of members currently serve as a foundation of discovering novel antibody drugs, and high-throughput DNA sequencing technology makes it possible to rapidly identify functional antibody variants with desired properties. Herein we present a review of current applications of high-throughput DNA sequencing technology in the analysis of antibody library diversity, sequencing of CDR3 regions, identification of potent antibodies based on sequence frequency, discovery of functional genes, and combination with various display technologies, so as to provide an alternative approach of discovery and development of antibody drugs.
NASA Astrophysics Data System (ADS)
Debenjak, Andrej; Boškoski, Pavle; Musizza, Bojan; Petrovčič, Janko; Juričić, Đani
2014-05-01
This paper proposes an approach to the estimation of PEM fuel cell impedance by utilizing pseudo-random binary sequence as a perturbation signal and continuous wavelet transform with Morlet mother wavelet. With the approach, the impedance characteristic in the frequency band from 0.1 Hz to 500 Hz is identified in 60 seconds, approximately five times faster compared to the conventional single-sine approach. The proposed approach was experimentally evaluated on a single PEM fuel cell of a larger fuel cell stack. The quality of the results remains at the same level compared to the single-sine approach.
NASA Astrophysics Data System (ADS)
Seto, Donald
The convergence and wealth of informatics, bioinformatics and genomics methods and associated resources allow a comprehensive and rapid approach for the surveillance and detection of bacterial and viral organisms. Coupled with the continuing race for the fastest, most cost-efficient and highest-quality DNA sequencing technology, that is, "next generation sequencing", the detection of biological threat agents by `cheaper and faster' means is possible. With the application of improved bioinformatic tools for the understanding of these genomes and for parsing unique pathogen genome signatures, along with `state-of-the-art' informatics which include faster computational methods, equipment and databases, it is feasible to apply new algorithms to biothreat agent detection. Two such methods are high-throughput DNA sequencing-based and resequencing microarray-based identification. These are illustrated and validated by two examples involving human adenoviruses, both from real-world test beds.
Sequencing small genomic targets with high efficiency and extreme accuracy
Schmitt, Michael W.; Fox, Edward J.; Prindle, Marc J.; Reid-Bayliss, Kate S.; True, Lawrence D.; Radich, Jerald P.; Loeb, Lawrence A.
2015-01-01
The detection of minority variants in mixed samples demands methods for enrichment and accurate sequencing of small genomic intervals. We describe an efficient approach based on sequential rounds of hybridization with biotinylated oligonucleotides, enabling more than one-million fold enrichment of genomic regions of interest. In conjunction with error correcting double-stranded molecular tags, our approach enables the quantification of mutations in individual DNA molecules. PMID:25849638
DNA enrichment approaches to identify unauthorized genetically modified organisms (GMOs).
Arulandhu, Alfred J; van Dijk, Jeroen P; Dobnik, David; Holst-Jensen, Arne; Shi, Jianxin; Zel, Jana; Kok, Esther J
2016-07-01
With the increased global production of different genetically modified (GM) plant varieties, chances increase that unauthorized GM organisms (UGMOs) may enter the food chain. At the same time, the detection of UGMOs is a challenging task because of the limited sequence information that will generally be available. PCR-based methods are available to detect and quantify known UGMOs in specific cases. If this approach is not feasible, DNA enrichment of the unknown adjacent sequences of known GMO elements is one way to detect the presence of UGMOs in a food or feed product. These enrichment approaches are also known as chromosome walking or gene walking (GW). In recent years, enrichment approaches have been coupled with next generation sequencing (NGS) analysis and implemented in, amongst others, the medical and microbiological fields. The present review will provide an overview of these approaches and an evaluation of their applicability in the identification of UGMOs in complex food or feed samples.
Sequence-specific bias correction for RNA-seq data using recurrent neural networks.
Zhang, Yao-Zhong; Yamaguchi, Rui; Imoto, Seiya; Miyano, Satoru
2017-01-25
The recent success of deep learning techniques in machine learning and artificial intelligence has stimulated a great deal of interest among bioinformaticians, who now wish to bring the power of deep learning to bare on a host of bioinformatical problems. Deep learning is ideally suited for biological problems that require automatic or hierarchical feature representation for biological data when prior knowledge is limited. In this work, we address the sequence-specific bias correction problem for RNA-seq data redusing Recurrent Neural Networks (RNNs) to model nucleotide sequences without pre-determining sequence structures. The sequence-specific bias of a read is then calculated based on the sequence probabilities estimated by RNNs, and used in the estimation of gene abundance. We explore the application of two popular RNN recurrent units for this task and demonstrate that RNN-based approaches provide a flexible way to model nucleotide sequences without knowledge of predetermined sequence structures. Our experiments show that training a RNN-based nucleotide sequence model is efficient and RNN-based bias correction methods compare well with the-state-of-the-art sequence-specific bias correction method on the commonly used MAQC-III data set. RNNs provides an alternative and flexible way to calculate sequence-specific bias without explicitly pre-determining sequence structures.
Managing complex processing of medical image sequences by program supervision techniques
NASA Astrophysics Data System (ADS)
Crubezy, Monica; Aubry, Florent; Moisan, Sabine; Chameroy, Virginie; Thonnat, Monique; Di Paola, Robert
1997-05-01
Our objective is to offer clinicians wider access to evolving medical image processing (MIP) techniques, crucial to improve assessment and quantification of physiological processes, but difficult to handle for non-specialists in MIP. Based on artificial intelligence techniques, our approach consists in the development of a knowledge-based program supervision system, automating the management of MIP libraries. It comprises a library of programs, a knowledge base capturing the expertise about programs and data and a supervision engine. It selects, organizes and executes the appropriate MIP programs given a goal to achieve and a data set, with dynamic feedback based on the results obtained. It also advises users in the development of new procedures chaining MIP programs.. We have experimented the approach for an application of factor analysis of medical image sequences as a means of predicting the response of osteosarcoma to chemotherapy, with both MRI and NM dynamic image sequences. As a result our program supervision system frees clinical end-users from performing tasks outside their competence, permitting them to concentrate on clinical issues. Therefore our approach enables a better exploitation of possibilities offered by MIP and higher quality results, both in terms of robustness and reliability.
Highly multiplexed targeted DNA sequencing from single nuclei.
Leung, Marco L; Wang, Yong; Kim, Charissa; Gao, Ruli; Jiang, Jerry; Sei, Emi; Navin, Nicholas E
2016-02-01
Single-cell DNA sequencing methods are challenged by poor physical coverage, high technical error rates and low throughput. To address these issues, we developed a single-cell DNA sequencing protocol that combines flow-sorting of single nuclei, time-limited multiple-displacement amplification (MDA), low-input library preparation, DNA barcoding, targeted capture and next-generation sequencing (NGS). This approach represents a major improvement over our previous single nucleus sequencing (SNS) Nature Protocols paper in terms of generating higher-coverage data (>90%), thereby enabling the detection of genome-wide variants in single mammalian cells at base-pair resolution. Furthermore, by pooling 48-96 single-cell libraries together for targeted capture, this approach can be used to sequence many single-cell libraries in parallel in a single reaction. This protocol greatly reduces the cost of single-cell DNA sequencing, and it can be completed in 5-6 d by advanced users. This single-cell DNA sequencing protocol has broad applications for studying rare cells and complex populations in diverse fields of biological research and medicine.
Chen, Xianfeng; Johnson, Stephen; Jeraldo, Patricio; Wang, Junwen; Chia, Nicholas; Kocher, Jean-Pierre A; Chen, Jun
2018-03-01
Illumina paired-end sequencing has been increasingly popular for 16S rRNA gene-based microbiota profiling. It provides higher phylogenetic resolution than single-end reads due to a longer read length. However, the reverse read (R2) often has significant low base quality, and a large proportion of R2s will be discarded after quality control, resulting in a mixture of paired-end and single-end reads. A typical 16S analysis pipeline usually processes either paired-end or single-end reads but not a mixture. Thus, the quantification accuracy and statistical power will be reduced due to the loss of a large amount of reads. As a result, rare taxa may not be detectable with the paired-end approach, or low taxonomic resolution will result in a single-end approach. To have both the higher phylogenetic resolution provided by paired-end reads and the higher sequence coverage by single-end reads, we propose a novel OTU-picking pipeline, hybrid-denovo, that can process a hybrid of single-end and paired-end reads. Using high-quality paired-end reads as a gold standard, we show that hybrid-denovo achieved the highest correlation with the gold standard and performed better than the approaches based on paired-end or single-end reads in terms of quantifying the microbial diversity and taxonomic abundances. By applying our method to a rheumatoid arthritis (RA) data set, we demonstrated that hybrid-denovo captured more microbial diversity and identified more RA-associated taxa than a paired-end or single-end approach. Hybrid-denovo utilizes both paired-end and single-end 16S sequencing reads and is recommended for 16S rRNA gene targeted paired-end sequencing data.
2013-01-01
Background With high quantity and quality data production and low cost, next generation sequencing has the potential to provide new opportunities for plant phylogeographic studies on single and multiple species. Here we present an approach for in silicio chloroplast DNA assembly and single nucleotide polymorphism detection from short-read shotgun sequencing. The approach is simple and effective and can be implemented using standard bioinformatic tools. Results The chloroplast genome of Toona ciliata (Meliaceae), 159,514 base pairs long, was assembled from shotgun sequencing on the Illumina platform using de novo assembly of contigs. To evaluate its practicality, value and quality, we compared the short read assembly with an assembly completed using 454 data obtained after chloroplast DNA isolation. Sanger sequence verifications indicated that the Illumina dataset outperformed the longer read 454 data. Pooling of several individuals during preparation of the shotgun library enabled detection of informative chloroplast SNP markers. Following validation, we used the identified SNPs for a preliminary phylogeographic study of T. ciliata in Australia and to confirm low diversity across the distribution. Conclusions Our approach provides a simple method for construction of whole chloroplast genomes from shotgun sequencing of whole genomic DNA using short-read data and no available closely related reference genome (e.g. from the same species or genus). The high coverage of Illumina sequence data also renders this method appropriate for multiplexing and SNP discovery and therefore a useful approach for landscape level studies of evolutionary ecology. PMID:23497206
Mutation Detection with Next-Generation Resequencing through a Mediator Genome
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wurtzel, Omri; Dori-Bachash, Mally; Pietrokovski, Shmuel
2010-12-31
The affordability of next generation sequencing (NGS) is transforming the field of mutation analysis in bacteria. The genetic basis for phenotype alteration can be identified directly by sequencing the entire genome of the mutant and comparing it to the wild-type (WT) genome, thus identifying acquired mutations. A major limitation for this approach is the need for an a-priori sequenced reference genome for the WT organism, as the short reads of most current NGS approaches usually prohibit de-novo genome assembly. To overcome this limitation we propose a general framework that utilizes the genome of relative organisms as mediators for comparing WTmore » and mutant bacteria. Under this framework, both mutant and WT genomes are sequenced with NGS, and the short sequencing reads are mapped to the mediator genome. Variations between the mutant and the mediator that recur in the WT are ignored, thus pinpointing the differences between the mutant and the WT. To validate this approach we sequenced the genome of Bdellovibrio bacteriovorus 109J, an obligatory bacterial predator, and its prey-independent mutant, and compared both to the mediator species Bdellovibrio bacteriovorus HD100. Although the mutant and the mediator sequences differed in more than 28,000 nucleotide positions, our approach enabled pinpointing the single causative mutation. Experimental validation in 53 additional mutants further established the implicated gene. Our approach extends the applicability of NGS-based mutant analyses beyond the domain of available reference genomes.« less
Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.
2015-01-01
We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences. PMID:26723608
Modular protein domains: an engineering approach toward functional biomaterials.
Lin, Charng-Yu; Liu, Julie C
2016-08-01
Protein domains and peptide sequences are a powerful tool for conferring specific functions to engineered biomaterials. Protein sequences with a wide variety of functionalities, including structure, bioactivity, protein-protein interactions, and stimuli responsiveness, have been identified, and advances in molecular biology continue to pinpoint new sequences. Protein domains can be combined to make recombinant proteins with multiple functionalities. The high fidelity of the protein translation machinery results in exquisite control over the sequence of recombinant proteins and the resulting properties of protein-based materials. In this review, we discuss protein domains and peptide sequences in the context of functional protein-based materials, composite materials, and their biological applications. Copyright © 2016 Elsevier Ltd. All rights reserved.
Salvi, Daniele; Macali, Armando; Mariottini, Paolo
2014-01-01
The bivalve family Ostreidae has a worldwide distribution and includes species of high economic importance. Phylogenetics and systematic of oysters based on morphology have proved difficult because of their high phenotypic plasticity. In this study we explore the phylogenetic information of the DNA sequence and secondary structure of the nuclear, fast-evolving, ITS2 rRNA and the mitochondrial 16S rRNA genes from the Ostreidae and we implemented a multi-locus framework based on four loci for oyster phylogenetics and systematics. Sequence-structure rRNA models aid sequence alignment and improved accuracy and nodal support of phylogenetic trees. In agreement with previous molecular studies, our phylogenetic results indicate that none of the currently recognized subfamilies, Crassostreinae, Ostreinae, and Lophinae, is monophyletic. Single gene trees based on Maximum likelihood (ML) and Bayesian (BA) methods and on sequence-structure ML were congruent with multilocus trees based on a concatenated (ML and BA) and coalescent based (BA) approaches and consistently supported three main clades: (i) Crassostrea, (ii) Saccostrea, and (iii) an Ostreinae-Lophinae lineage. Therefore, the subfamily Crassotreinae (including Crassostrea), Saccostreinae subfam. nov. (including Saccostrea and tentatively Striostrea) and Ostreinae (including Ostreinae and Lophinae taxa) are recognized. Based on phylogenetic and biogeographical evidence the Asian species of Crassostrea from the Pacific Ocean are assigned to Magallana gen. nov., whereas an integrative taxonomic revision is required for the genera Ostrea and Dendostrea. This study pointed out the suitability of the ITS2 marker for DNA barcoding of oyster and the relevance of using sequence-structure rRNA models and features of the ITS2 folding in molecular phylogenetics and taxonomy. The multilocus approach allowed inferring a robust phylogeny of Ostreidae providing a broad molecular perspective on their systematics. PMID:25250663
Salvi, Daniele; Macali, Armando; Mariottini, Paolo
2014-01-01
The bivalve family Ostreidae has a worldwide distribution and includes species of high economic importance. Phylogenetics and systematic of oysters based on morphology have proved difficult because of their high phenotypic plasticity. In this study we explore the phylogenetic information of the DNA sequence and secondary structure of the nuclear, fast-evolving, ITS2 rRNA and the mitochondrial 16S rRNA genes from the Ostreidae and we implemented a multi-locus framework based on four loci for oyster phylogenetics and systematics. Sequence-structure rRNA models aid sequence alignment and improved accuracy and nodal support of phylogenetic trees. In agreement with previous molecular studies, our phylogenetic results indicate that none of the currently recognized subfamilies, Crassostreinae, Ostreinae, and Lophinae, is monophyletic. Single gene trees based on Maximum likelihood (ML) and Bayesian (BA) methods and on sequence-structure ML were congruent with multilocus trees based on a concatenated (ML and BA) and coalescent based (BA) approaches and consistently supported three main clades: (i) Crassostrea, (ii) Saccostrea, and (iii) an Ostreinae-Lophinae lineage. Therefore, the subfamily Crassostreinae (including Crassostrea), Saccostreinae subfam. nov. (including Saccostrea and tentatively Striostrea) and Ostreinae (including Ostreinae and Lophinae taxa) are recognized [corrected]. Based on phylogenetic and biogeographical evidence the Asian species of Crassostrea from the Pacific Ocean are assigned to Magallana gen. nov., whereas an integrative taxonomic revision is required for the genera Ostrea and Dendostrea. This study pointed out the suitability of the ITS2 marker for DNA barcoding of oyster and the relevance of using sequence-structure rRNA models and features of the ITS2 folding in molecular phylogenetics and taxonomy. The multilocus approach allowed inferring a robust phylogeny of Ostreidae providing a broad molecular perspective on their systematics.
Breaking Lander-Waterman’s Coverage Bound
Nashta-ali, Damoun; Motahari, Seyed Abolfazl; Hosseinkhalaj, Babak
2016-01-01
Lander-Waterman’s coverage bound establishes the total number of reads required to cover the whole genome of size G bases. In fact, their bound is a direct consequence of the well-known solution to the coupon collector’s problem which proves that for such genome, the total number of bases to be sequenced should be O(G ln G). Although the result leads to a tight bound, it is based on a tacit assumption that the set of reads are first collected through a sequencing process and then are processed through a computation process, i.e., there are two different machines: one for sequencing and one for processing. In this paper, we present a significant improvement compared to Lander-Waterman’s result and prove that by combining the sequencing and computing processes, one can re-sequence the whole genome with as low as O(G) sequenced bases in total. Our approach also dramatically reduces the required computational power for the combined process. Simulation results are performed on real genomes with different sequencing error rates. The results support our theory predicting the log G improvement on coverage bound and corresponding reduction in the total number of bases required to be sequenced. PMID:27806058
Signature of genetic associations in oral cancer.
Sharma, Vishwas; Nandan, Amrita; Sharma, Amitesh Kumar; Singh, Harpreet; Bharadwaj, Mausumi; Sinha, Dhirendra Narain; Mehrotra, Ravi
2017-10-01
Oral cancer etiology is complex and controlled by multi-factorial events including genetic events. Candidate gene studies, genome-wide association studies, and next-generation sequencing identified various chromosomal loci to be associated with oral cancer. There is no available review that could give us the comprehensive picture of genetic loci identified to be associated with oral cancer by candidate gene studies-based, genome-wide association studies-based, and next-generation sequencing-based approaches. A systematic literature search was performed in the PubMed database to identify the loci associated with oral cancer by exclusive candidate gene studies-based, genome-wide association studies-based, and next-generation sequencing-based study approaches. The information of loci associated with oral cancer is made online through the resource "ORNATE." Next, screening of the loci validated by candidate gene studies and next-generation sequencing approach or by two independent studies within candidate gene studies or next-generation sequencing approaches were performed. A total of 264 loci were identified to be associated with oral cancer by candidate gene studies, genome-wide association studies, and next-generation sequencing approaches. In total, 28 loci, that is, 14q32.33 (AKT1), 5q22.2 (APC), 11q22.3 (ATM), 2q33.1 (CASP8), 11q13.3 (CCND1), 16q22.1 (CDH1), 9p21.3 (CDKN2A), 1q31.1 (COX-2), 7p11.2 (EGFR), 22q13.2 (EP300), 4q35.2 (FAT1), 4q31.3 (FBXW7), 4p16.3 (FGFR3), 1p13.3 (GSTM1-GSTT1), 11q13.2 (GSTP1), 11p15.5 (H-RAS), 3p25.3 (hOGG1), 1q32.1 (IL-10), 4q13.3 (IL-8), 12p12.1 (KRAS), 12q15 (MDM2), 12q13.12 (MLL2), 9q34.3 (NOTCH1), 17p13.1 (p53), 3q26.32 (PIK3CA), 10q23.31 (PTEN), 13q14.2 (RB1), and 5q14.2 (XRCC4), were validated to be associated with oral cancer. "ORNATE" gives a snapshot of genetic loci associated with oral cancer. All 28 loci were validated to be linked to oral cancer for which further fine-mapping followed by gene-by-gene and gene-environment interaction studies is needed to confirm their involvement in modifying oral cancer.
A statistical physics perspective on alignment-independent protein sequence comparison.
Chattopadhyay, Amit K; Nasiev, Diar; Flower, Darren R
2015-08-01
Within bioinformatics, the textual alignment of amino acid sequences has long dominated the determination of similarity between proteins, with all that implies for shared structure, function and evolutionary descent. Despite the relative success of modern-day sequence alignment algorithms, so-called alignment-free approaches offer a complementary means of determining and expressing similarity, with potential benefits in certain key applications, such as regression analysis of protein structure-function studies, where alignment-base similarity has performed poorly. Here, we offer a fresh, statistical physics-based perspective focusing on the question of alignment-free comparison, in the process adapting results from 'first passage probability distribution' to summarize statistics of ensemble averaged amino acid propensity values. In this article, we introduce and elaborate this approach. © The Author 2015. Published by Oxford University Press.
Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models
2017-01-01
We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence mapping problem. The end-to-end trained model has an encoder–decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation. The model is trained on 50,000 experimental reaction examples from the United States patent literature, which span 10 broad reaction types that are commonly used by medicinal chemists. We find that our model performs comparably with a rule-based expert system baseline model, and also overcomes certain limitations associated with rule-based expert systems and with any machine learning approach that contains a rule-based expert system component. Our model provides an important first step toward solving the challenging problem of computational retrosynthetic analysis. PMID:29104927
Kück, Patrick; Meusemann, Karen; Dambach, Johannes; Thormann, Birthe; von Reumont, Björn M; Wägele, Johann W; Misof, Bernhard
2010-03-31
Methods of alignment masking, which refers to the technique of excluding alignment blocks prior to tree reconstructions, have been successful in improving the signal-to-noise ratio in sequence alignments. However, the lack of formally well defined methods to identify randomness in sequence alignments has prevented a routine application of alignment masking. In this study, we compared the effects on tree reconstructions of the most commonly used profiling method (GBLOCKS) which uses a predefined set of rules in combination with alignment masking, with a new profiling approach (ALISCORE) based on Monte Carlo resampling within a sliding window, using different data sets and alignment methods. While the GBLOCKS approach excludes variable sections above a certain threshold which choice is left arbitrary, the ALISCORE algorithm is free of a priori rating of parameter space and therefore more objective. ALISCORE was successfully extended to amino acids using a proportional model and empirical substitution matrices to score randomness in multiple sequence alignments. A complex bootstrap resampling leads to an even distribution of scores of randomly similar sequences to assess randomness of the observed sequence similarity. Testing performance on real data, both masking methods, GBLOCKS and ALISCORE, helped to improve tree resolution. The sliding window approach was less sensitive to different alignments of identical data sets and performed equally well on all data sets. Concurrently, ALISCORE is capable of dealing with different substitution patterns and heterogeneous base composition. ALISCORE and the most relaxed GBLOCKS gap parameter setting performed best on all data sets. Correspondingly, Neighbor-Net analyses showed the most decrease in conflict. Alignment masking improves signal-to-noise ratio in multiple sequence alignments prior to phylogenetic reconstruction. Given the robust performance of alignment profiling, alignment masking should routinely be used to improve tree reconstructions. Parametric methods of alignment profiling can be easily extended to more complex likelihood based models of sequence evolution which opens the possibility of further improvements.
Sun, Beili; Zhou, Dongrui; Tu, Jing; Lu, Zuhong
2017-01-01
The characteristics of tongue coating are very important symbols for disease diagnosis in traditional Chinese medicine (TCM) theory. As a habitat of oral microbiota, bacteria on the tongue dorsum have been proved to be the cause of many oral diseases. The high-throughput next-generation sequencing (NGS) platforms have been widely applied in the analysis of bacterial 16S rRNA gene. We developed a methodology based on genus-specific multiprimer amplification and ligation-based sequencing for microbiota analysis. In order to validate the efficiency of the approach, we thoroughly analyzed six tongue coating samples from lung cancer patients with different TCM types, and more than 600 genera of bacteria were detected by this platform. The results showed that ligation-based parallel sequencing combined with enzyme digestion and multiamplification could expand the effective length of sequencing reads and could be applied in the microbiota analysis.
Inquiry-based experiments for large-scale introduction to PCR and restriction enzyme digests.
Johanson, Kelly E; Watt, Terry J
2015-01-01
Polymerase chain reaction and restriction endonuclease digest are important techniques that should be included in all Biochemistry and Molecular Biology laboratory curriculums. These techniques are frequently taught at an advanced level, requiring many hours of student and faculty time. Here we present two inquiry-based experiments that are designed for introductory laboratory courses and combine both techniques. In both approaches, students must determine the identity of an unknown DNA sequence, either a gene sequence or a primer sequence, based on a combination of PCR product size and restriction digest pattern. The experimental design is flexible, and can be adapted based on available instructor preparation time and resources, and both approaches can accommodate large numbers of students. We implemented these experiments in our courses with a combined total of 584 students and have an 85% success rate. Overall, students demonstrated an increase in their understanding of the experimental topics, ability to interpret the resulting data, and proficiency in general laboratory skills. © 2015 The International Union of Biochemistry and Molecular Biology.
Comparing K-mer based methods for improved classification of 16S sequences.
Vinje, Hilde; Liland, Kristian Hovde; Almøy, Trygve; Snipen, Lars
2015-07-01
The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length. The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau. We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.
The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity
Wang, Quanli; Halvorsen, Matt; Han, Yujun; Weir, William H.; Allen, Andrew S.; Goldstein, David B.
2015-01-01
Noncoding sequence contains pathogenic mutations. Yet, compared with mutations in protein-coding sequence, pathogenic regulatory mutations are notoriously difficult to recognize. Most fundamentally, we are not yet adept at recognizing the sequence stretches in the human genome that are most important in regulating the expression of genes. For this reason, it is difficult to apply to the regulatory regions the same kinds of analytical paradigms that are being successfully applied to identify mutations among protein-coding regions that influence risk. To determine whether dosage sensitive genes have distinct patterns among their noncoding sequence, we present two primary approaches that focus solely on a gene’s proximal noncoding regulatory sequence. The first approach is a regulatory sequence analogue of the recently introduced residual variation intolerance score (RVIS), termed noncoding RVIS, or ncRVIS. The ncRVIS compares observed and predicted levels of standing variation in the regulatory sequence of human genes. The second approach, termed ncGERP, reflects the phylogenetic conservation of a gene’s regulatory sequence using GERP++. We assess how well these two approaches correlate with four gene lists that use different ways to identify genes known or likely to cause disease through changes in expression: 1) genes that are known to cause disease through haploinsufficiency, 2) genes curated as dosage sensitive in ClinGen’s Genome Dosage Map, 3) genes judged likely to be under purifying selection for mutations that change expression levels because they are statistically depleted of loss-of-function variants in the general population, and 4) genes judged unlikely to cause disease based on the presence of copy number variants in the general population. We find that both noncoding scores are highly predictive of dosage sensitivity using any of these criteria. In a similar way to ncGERP, we assess two ensemble-based predictors of regional noncoding importance, ncCADD and ncGWAVA, and find both scores are significantly predictive of human dosage sensitive genes and appear to carry information beyond conservation, as assessed by ncGERP. These results highlight that the intolerance of noncoding sequence stretches in the human genome can provide a critical complementary tool to other genome annotation approaches to help identify the parts of the human genome increasingly likely to harbor mutations that influence risk of disease. PMID:26332131
Ding, Xiuhua; Su, Shaoyong; Nandakumar, Kannabiran; Wang, Xiaoling; Fardo, David W
2014-01-01
Large-scale genetic studies are often composed of related participants, and utilizing familial relationships can be cumbersome and computationally challenging. We present an approach to efficiently handle sequencing data from complex pedigrees that incorporates information from rare variants as well as common variants. Our method employs a 2-step procedure that sequentially regresses out correlation from familial relatedness and then uses the resulting phenotypic residuals in a penalized regression framework to test for associations with variants within genetic units. The operating characteristics of this approach are detailed using simulation data based on a large, multigenerational cohort.
ERIC Educational Resources Information Center
Karas, Timothy
2017-01-01
Through a case study approach of a cohort of community college students at a single community college, the impact on success rates in composition courses was analyzed based on the sequence of completing an information literacy course. Two student cohorts were sampled based on completing an information literacy course prior to, or concurrently with…
A Protein Domain and Family Based Approach to Rare Variant Association Analysis.
Richardson, Tom G; Shihab, Hashem A; Rivas, Manuel A; McCarthy, Mark I; Campbell, Colin; Timpson, Nicholas J; Gaunt, Tom R
2016-01-01
It has become common practice to analyse large scale sequencing data with statistical approaches based around the aggregation of rare variants within the same gene. We applied a novel approach to rare variant analysis by collapsing variants together using protein domain and family coordinates, regarded to be a more discrete definition of a biologically functional unit. Using Pfam definitions, we collapsed rare variants (Minor Allele Frequency ≤ 1%) together in three different ways 1) variants within single genomic regions which map to individual protein domains 2) variants within two individual protein domain regions which are predicted to be responsible for a protein-protein interaction 3) all variants within combined regions from multiple genes responsible for coding the same protein domain (i.e. protein families). A conventional collapsing analysis using gene coordinates was also undertaken for comparison. We used UK10K sequence data and investigated associations between regions of variants and lipid traits using the sequence kernel association test (SKAT). We observed no strong evidence of association between regions of variants based on Pfam domain definitions and lipid traits. Quantile-Quantile plots illustrated that the overall distributions of p-values from the protein domain analyses were comparable to that of a conventional gene-based approach. Deviations from this distribution suggested that collapsing by either protein domain or gene definitions may be favourable depending on the trait analysed. We have collapsed rare variants together using protein domain and family coordinates to present an alternative approach over collapsing across conventionally used gene-based regions. Although no strong evidence of association was detected in these analyses, future studies may still find value in adopting these approaches to detect previously unidentified association signals.
Kang, Hahk-Soo
2017-02-01
Genomics-based methods are now commonplace in natural products research. A phylogeny-guided mining approach provides a means to quickly screen a large number of microbial genomes or metagenomes in search of new biosynthetic gene clusters of interest. In this approach, biosynthetic genes serve as molecular markers, and phylogenetic trees built with known and unknown marker gene sequences are used to quickly prioritize biosynthetic gene clusters for their metabolites characterization. An increase in the use of this approach has been observed for the last couple of years along with the emergence of low cost sequencing technologies. The aim of this review is to discuss the basic concept of a phylogeny-guided mining approach, and also to provide examples in which this approach was successfully applied to discover new natural products from microbial genomes and metagenomes. I believe that the phylogeny-guided mining approach will continue to play an important role in genomics-based natural products research.
2011-01-01
Background Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches - the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known. Results The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (http://main.g2.bx.psu.edu/). Nine genes (APLN, ZC4H2, MAGED4, MAGED4B, RAP2C, FAM156A, FAM156B, TBL1X, and UXT) were highlighted as highly-ranked XLMR methods. Conclusions The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR. Reviewers: This article was reviewed by Dr Barbara Bardoni (nominated by Prof Juergen Brosius); Prof Neil Smalheiser and Dr Dustin Holloway (nominated by Prof Charles DeLisi). PMID:21668950
A topological approach for protein classification
Cang, Zixuan; Mu, Lin; Wu, Kedi; ...
2015-11-04
Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.
ERIC Educational Resources Information Center
Soto, Julio G.; Everhart, Jerry
2016-01-01
Biology faculty at San José State University developed, piloted, implemented, and assessed a freshmen course sequence based on the macro-to micro-teaching approach that was team-taught, and organized around unifying themes. Content learning assessment drove the conceptual framework of our course sequence. Content student learning increased…
Chakraborty, Mohua; Dhar, Bishal; Ghosh, Sankar Kumar
2017-11-01
The DNA barcodes are generally interpreted using distance-based and character-based methods. The former uses clustering of comparable groups, based on the relative genetic distance, while the latter is based on the presence or absence of discrete nucleotide substitutions. The distance-based approach has a limitation in defining a universal species boundary across the taxa as the rate of mtDNA evolution is not constant throughout the taxa. However, character-based approach more accurately defines this using a unique set of nucleotide characters. The character-based analysis of full-length barcode has some inherent limitations, like sequencing of the full-length barcode, use of a sparse-data matrix and lack of a uniform diagnostic position for each group. A short continuous stretch of a fragment can be used to resolve the limitations. Here, we observe that a 154-bp fragment, from the transversion-rich domain of 1367 COI barcode sequences can successfully delimit species in the three most diverse orders of freshwater fishes. This fragment is used to design species-specific barcode motifs for 109 species by the character-based method, which successfully identifies the correct species using a pattern-matching program. The motifs also correctly identify geographically isolated population of the Cypriniformes species. Further, this region is validated as a species-specific mini-barcode for freshwater fishes by successful PCR amplification and sequencing of the motif (154 bp) using the designed primers. We anticipate that use of such motifs will enhance the diagnostic power of DNA barcode, and the mini-barcode approach will greatly benefit the field-based system of rapid species identification. © 2017 John Wiley & Sons Ltd.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karpinets, Tatiana V; Park, Byung; Syed, Mustafa H
2010-01-01
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire non-redundant sequences of the CAZy database. Themore » second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains (DUF) and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit (CAT), and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.« less
Park, Byung H; Karpinets, Tatiana V; Syed, Mustafa H; Leuze, Michael R; Uberbacher, Edward C
2010-12-01
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire nonredundant sequences of the CAZy database. The second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit, and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.
Reordering Histology to Enhance Engagement
ERIC Educational Resources Information Center
Amerongen, Helen
2011-01-01
In redesigning the preclinical curriculum and shifting from a discipline-based approach to an organ system-based approach, faculty at the University of Arizona College of Medicine in Tucson took the opportunity to restructure the sequence of introductory histology content to make it more engaging and relevant. In this article, the author describes…
An Activation-Based Model of Routine Sequence Errors
2015-04-01
part of the ACT-R frame- work (e.g., Anderson, 1983), we adopt a newer, richer no- tion of priming as part of our approach ( Harrison & Trafton, 2010...2014). Other models of routine sequence errors, such as the in- teractive activation network ( IAN ) model (Cooper & Shal- lice, 2006) and the simple...error patterns that results from an interface layout shift. The ideas behind our expanded priming approach, however, could apply to IAN , which uses
Shpynov, S; Pozdnichenko, N; Gumenuk, A
2015-01-01
Genome sequences of 36 Rickettsia and Orientia were analyzed using Formal Order Analysis (FOA). This approach takes into account arrangement of nucleotides in each sequence. A numerical characteristic, the average distance (remoteness) - "g" was used to compare of genomes. Our results corroborated previous separation of three groups within the genus Rickettsia, including typhus group, classic spotted fever group, and the ancestral group and Orientia as a separate genus. Rickettsia felis URRWXCal2 and R. akari Hartford were not in the same group based on FOA, therefore designation of a so-called transitional Rickettsia group could not be confirmed with this approach. Copyright © 2015 Institut Pasteur. Published by Elsevier Masson SAS. All rights reserved.
A Window Into Clinical Next-Generation Sequencing-Based Oncology Testing Practices.
Nagarajan, Rakesh; Bartley, Angela N; Bridge, Julia A; Jennings, Lawrence J; Kamel-Reid, Suzanne; Kim, Annette; Lazar, Alexander J; Lindeman, Neal I; Moncur, Joel; Rai, Alex J; Routbort, Mark J; Vasalos, Patricia; Merker, Jason D
2017-12-01
- Detection of acquired variants in cancer is a paradigm of precision medicine, yet little has been reported about clinical laboratory practices across a broad range of laboratories. - To use College of American Pathologists proficiency testing survey results to report on the results from surveys on next-generation sequencing-based oncology testing practices. - College of American Pathologists proficiency testing survey results from more than 250 laboratories currently performing molecular oncology testing were used to determine laboratory trends in next-generation sequencing-based oncology testing. - These presented data provide key information about the number of laboratories that currently offer or are planning to offer next-generation sequencing-based oncology testing. Furthermore, we present data from 60 laboratories performing next-generation sequencing-based oncology testing regarding specimen requirements and assay characteristics. The findings indicate that most laboratories are performing tumor-only targeted sequencing to detect single-nucleotide variants and small insertions and deletions, using desktop sequencers and predesigned commercial kits. Despite these trends, a diversity of approaches to testing exists. - This information should be useful to further inform a variety of topics, including national discussions involving clinical laboratory quality systems, regulation and oversight of next-generation sequencing-based oncology testing, and precision oncology efforts in a data-driven manner.
Talkowski, Michael E; Ernst, Carl; Heilbut, Adrian; Chiang, Colby; Hanscom, Carrie; Lindgren, Amelia; Kirby, Andrew; Liu, Shangtao; Muddukrishna, Bhavana; Ohsumi, Toshiro K; Shen, Yiping; Borowsky, Mark; Daly, Mark J; Morton, Cynthia C; Gusella, James F
2011-04-08
The contribution of balanced chromosomal rearrangements to complex disorders remains unclear because they are not detected routinely by genome-wide microarrays and clinical localization is imprecise. Failure to consider these events bypasses a potentially powerful complement to single nucleotide polymorphism and copy-number association approaches to complex disorders, where much of the heritability remains unexplained. To capitalize on this genetic resource, we have applied optimized sequencing and analysis strategies to test whether these potentially high-impact variants can be mapped at reasonable cost and throughput. By using a whole-genome multiplexing strategy, rearrangement breakpoints could be delineated at a fraction of the cost of standard sequencing. For rearrangements already mapped regionally by karyotyping and fluorescence in situ hybridization, a targeted approach enabled capture and sequencing of multiple breakpoints simultaneously. Importantly, this strategy permitted capture and unique alignment of up to 97% of repeat-masked sequences in the targeted regions. Genome-wide analyses estimate that only 3.7% of bases should be routinely omitted from genomic DNA capture experiments. Illustrating the power of these approaches, the rearrangement breakpoints were rapidly defined to base pair resolution and revealed unexpected sequence complexity, such as co-occurrence of inversion and translocation as an underlying feature of karyotypically balanced alterations. These findings have implications ranging from genome annotation to de novo assemblies and could enable sequencing screens for structural variations at a cost comparable to that of microarrays in standard clinical practice. Copyright © 2011 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
SOMKE: kernel density estimation over data streams by sequences of self-organizing maps.
Cao, Yuan; He, Haibo; Man, Hong
2012-08-01
In this paper, we propose a novel method SOMKE, for kernel density estimation (KDE) over data streams based on sequences of self-organizing map (SOM). In many stream data mining applications, the traditional KDE methods are infeasible because of the high computational cost, processing time, and memory requirement. To reduce the time and space complexity, we propose a SOM structure in this paper to obtain well-defined data clusters to estimate the underlying probability distributions of incoming data streams. The main idea of this paper is to build a series of SOMs over the data streams via two operations, that is, creating and merging the SOM sequences. The creation phase produces the SOM sequence entries for windows of the data, which obtains clustering information of the incoming data streams. The size of the SOM sequences can be further reduced by combining the consecutive entries in the sequence based on the measure of Kullback-Leibler divergence. Finally, the probability density functions over arbitrary time periods along the data streams can be estimated using such SOM sequences. We compare SOMKE with two other KDE methods for data streams, the M-kernel approach and the cluster kernel approach, in terms of accuracy and processing time for various stationary data streams. Furthermore, we also investigate the use of SOMKE over nonstationary (evolving) data streams, including a synthetic nonstationary data stream, a real-world financial data stream and a group of network traffic data streams. The simulation results illustrate the effectiveness and efficiency of the proposed approach.
De, Rajat K.
2015-01-01
Copy number variation (CNV) is a form of structural alteration in the mammalian DNA sequence, which are associated with many complex neurological diseases as well as cancer. The development of next generation sequencing (NGS) technology provides us a new dimension towards detection of genomic locations with copy number variations. Here we develop an algorithm for detecting CNVs, which is based on depth of coverage data generated by NGS technology. In this work, we have used a novel way to represent the read count data as a two dimensional geometrical point. A key aspect of detecting the regions with CNVs, is to devise a proper segmentation algorithm that will distinguish the genomic locations having a significant difference in read count data. We have designed a new segmentation approach in this context, using convex hull algorithm on the geometrical representation of read count data. To our knowledge, most algorithms have used a single distribution model of read count data, but here in our approach, we have considered the read count data to follow two different distribution models independently, which adds to the robustness of detection of CNVs. In addition, our algorithm calls CNVs based on the multiple sample analysis approach resulting in a low false discovery rate with high precision. PMID:26291322
Sinha, Rituparna; Samaddar, Sandip; De, Rajat K
2015-01-01
Copy number variation (CNV) is a form of structural alteration in the mammalian DNA sequence, which are associated with many complex neurological diseases as well as cancer. The development of next generation sequencing (NGS) technology provides us a new dimension towards detection of genomic locations with copy number variations. Here we develop an algorithm for detecting CNVs, which is based on depth of coverage data generated by NGS technology. In this work, we have used a novel way to represent the read count data as a two dimensional geometrical point. A key aspect of detecting the regions with CNVs, is to devise a proper segmentation algorithm that will distinguish the genomic locations having a significant difference in read count data. We have designed a new segmentation approach in this context, using convex hull algorithm on the geometrical representation of read count data. To our knowledge, most algorithms have used a single distribution model of read count data, but here in our approach, we have considered the read count data to follow two different distribution models independently, which adds to the robustness of detection of CNVs. In addition, our algorithm calls CNVs based on the multiple sample analysis approach resulting in a low false discovery rate with high precision.
Comparison of phasing strategies for whole human genomes
Kirkness, Ewen; Schork, Nicholas J.
2018-01-01
Humans are a diploid species that inherit one set of chromosomes paternally and one homologous set of chromosomes maternally. Unfortunately, most human sequencing initiatives ignore this fact in that they do not directly delineate the nucleotide content of the maternal and paternal copies of the 23 chromosomes individuals possess (i.e., they do not ‘phase’ the genome) often because of the costs and complexities of doing so. We compared 11 different widely-used approaches to phasing human genomes using the publicly available ‘Genome-In-A-Bottle’ (GIAB) phased version of the NA12878 genome as a gold standard. The phasing strategies we compared included laboratory-based assays that prepare DNA in unique ways to facilitate phasing as well as purely computational approaches that seek to reconstruct phase information from general sequencing reads and constructs or population-level haplotype frequency information obtained through a reference panel of haplotypes. To assess the performance of the 11 approaches, we used metrics that included, among others, switch error rates, haplotype block lengths, the proportion of fully phase-resolved genes, phasing accuracy and yield between pairs of SNVs. Our comparisons suggest that a hybrid or combined approach that leverages: 1. population-based phasing using the SHAPEIT software suite, 2. either genome-wide sequencing read data or parental genotypes, and 3. a large reference panel of variant and haplotype frequencies, provides a fast and efficient way to produce highly accurate phase-resolved individual human genomes. We found that for population-based approaches, phasing performance is enhanced with the addition of genome-wide read data; e.g., whole genome shotgun and/or RNA sequencing reads. Further, we found that the inclusion of parental genotype data within a population-based phasing strategy can provide as much as a ten-fold reduction in phasing errors. We also considered a majority voting scheme for the construction of a consensus haplotype combining multiple predictions for enhanced performance and site coverage. Finally, we also identified DNA sequence signatures associated with the genomic regions harboring phasing switch errors, which included regions of low polymorphism or SNV density. PMID:29621242
Statistical and linguistic features of DNA sequences
NASA Technical Reports Server (NTRS)
Havlin, S.; Buldyrev, S. V.; Goldberger, A. L.; Mantegna, R. N.; Peng, C. K.; Simons, M.; Stanley, H. E.
1995-01-01
We present evidence supporting the idea that the DNA sequence in genes containing noncoding regions is correlated, and that the correlation is remarkably long range--indeed, base pairs thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationary" feature of the sequence of base pairs by applying a new algorithm called Detrended Fluctuation Analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and noncoding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to all eukaryotic DNA sequences (33 301 coding and 29 453 noncoding) in the entire GenBank database. We describe a simple model to account for the presence of long-range power-law correlations which is based upon a generalization of the classic Levy walk. Finally, we describe briefly some recent work showing that the noncoding sequences have certain statistical features in common with natural languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts, and the Shannon approach to quantifying the "redundancy" of a linguistic text in terms of a measurable entropy function. We suggest that noncoding regions in plants and invertebrates may display a smaller entropy and larger redundancy than coding regions, further supporting the possibility that noncoding regions of DNA may carry biological information.
NASA Astrophysics Data System (ADS)
Song, Yang; Laskay, Ünige A.; Vilcins, Inger-Marie E.; Barbour, Alan G.; Wysocki, Vicki H.
2015-11-01
Ticks are vectors for disease transmission because they are indiscriminant in their feeding on multiple vertebrate hosts, transmitting pathogens between their hosts. Identifying the hosts on which ticks have fed is important for disease prevention and intervention. We have previously shown that hemoglobin (Hb) remnants from a host on which a tick fed can be used to reveal the host's identity. For the present research, blood was collected from 33 bird species that are common in the U.S. as hosts for ticks but that have unknown Hb sequences. A top-down-assisted bottom-up mass spectrometry approach with a customized searching database, based on variability in known bird hemoglobin sequences, has been devised to facilitate fast and complete sequencing of hemoglobin from birds with unknown sequences. These hemoglobin sequences will be added to a hemoglobin database and used for tick host identification. The general approach has the potential to sequence any set of homologous proteins completely in a rapid manner.
Assessing the utility of the Oxford Nanopore MinION for snake venom gland cDNA sequencing.
Hargreaves, Adam D; Mulley, John F
2015-01-01
Portable DNA sequencers such as the Oxford Nanopore MinION device have the potential to be truly disruptive technologies, facilitating new approaches and analyses and, in some cases, taking sequencing out of the lab and into the field. However, the capabilities of these technologies are still being revealed. Here we show that single-molecule cDNA sequencing using the MinION accurately characterises venom toxin-encoding genes in the painted saw-scaled viper, Echis coloratus. We find the raw sequencing error rate to be around 12%, improved to 0-2% with hybrid error correction and 3% with de novo error correction. Our corrected data provides full coding sequences and 5' and 3' UTRs for 29 of 33 candidate venom toxins detected, far superior to Illumina data (13/40 complete) and Sanger-based ESTs (15/29). We suggest that, should the current pace of improvement continue, the MinION will become the default approach for cDNA sequencing in a variety of species.
Assessing the utility of the Oxford Nanopore MinION for snake venom gland cDNA sequencing
Hargreaves, Adam D.
2015-01-01
Portable DNA sequencers such as the Oxford Nanopore MinION device have the potential to be truly disruptive technologies, facilitating new approaches and analyses and, in some cases, taking sequencing out of the lab and into the field. However, the capabilities of these technologies are still being revealed. Here we show that single-molecule cDNA sequencing using the MinION accurately characterises venom toxin-encoding genes in the painted saw-scaled viper, Echis coloratus. We find the raw sequencing error rate to be around 12%, improved to 0–2% with hybrid error correction and 3% with de novo error correction. Our corrected data provides full coding sequences and 5′ and 3′ UTRs for 29 of 33 candidate venom toxins detected, far superior to Illumina data (13/40 complete) and Sanger-based ESTs (15/29). We suggest that, should the current pace of improvement continue, the MinION will become the default approach for cDNA sequencing in a variety of species. PMID:26623194
Synthesis of Speaker Facial Movement to Match Selected Speech Sequences
NASA Technical Reports Server (NTRS)
Scott, K. C.; Kagels, D. S.; Watson, S. H.; Rom, H.; Wright, J. R.; Lee, M.; Hussey, K. J.
1994-01-01
A system is described which allows for the synthesis of a video sequence of a realistic-appearing talking human head. A phonic based approach is used to describe facial motion; image processing rather than physical modeling techniques are used to create video frames.
Using "Arabidopsis" Genetic Sequences to Teach Bioinformatics
ERIC Educational Resources Information Center
Zhang, Xiaorong
2009-01-01
This article describes a new approach to teaching bioinformatics using "Arabidopsis" genetic sequences. Several open-ended and inquiry-based laboratory exercises have been designed to help students grasp key concepts and gain practical skills in bioinformatics, using "Arabidopsis" leucine-rich repeat receptor-like kinase (LRR…
Guzzi, Pietro Hiram; Milenkovic, Tijana
2018-05-01
Analogous to genomic sequence alignment that allows for across-species transfer of biological knowledge between conserved sequence regions, biological network alignment can be used to guide the knowledge transfer between conserved regions of molecular networks of different species. Hence, biological network alignment can be used to redefine the traditional notion of a sequence-based homology to a new notion of network-based homology. Analogous to genomic sequence alignment, there exist local and global biological network alignments. Here, we survey prominent and recent computational approaches of each network alignment type and discuss their (dis)advantages. Then, as it was recently shown that the two approach types are complementary, in the sense that they capture different slices of cellular functioning, we discuss the need to reconcile the two network alignment types and present a recent first step in this direction. We conclude with some open research problems on this topic and comment on the usefulness of network alignment in other domains besides computational biology.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ecale Zhou, C L; Zemla, A T; Roe, D
2005-01-29
Specific and sensitive ligand-based protein detection assays that employ antibodies or small molecules such as peptides, aptamers, or other small molecules require that the corresponding surface region of the protein be accessible and that there be minimal cross-reactivity with non-target proteins. To reduce the time and cost of laboratory screening efforts for diagnostic reagents, we developed new methods for evaluating and selecting protein surface regions for ligand targeting. We devised combined structure- and sequence-based methods for identifying 3D epitopes and binding pockets on the surface of the A chain of ricin that are conserved with respect to a set ofmore » ricin A chains and unique with respect to other proteins. We (1) used structure alignment software to detect structural deviations and extracted from this analysis the residue-residue correspondence, (2) devised a method to compare corresponding residues across sets of ricin structures and structures of closely related proteins, (3) devised a sequence-based approach to determine residue infrequency in local sequence context, and (4) modified a pocket-finding algorithm to identify surface crevices in close proximity to residues determined to be conserved/unique based on our structure- and sequence-based methods. In applying this combined informatics approach to ricin A we identified a conserved/unique pocket in close proximity (but not overlapping) the active site that is suitable for bi-dentate ligand development. These methods are generally applicable to identification of surface epitopes and binding pockets for development of diagnostic reagents, therapeutics, and vaccines.« less
Optimizing Illumina next-generation sequencing library preparation for extremely AT-biased genomes.
Oyola, Samuel O; Otto, Thomas D; Gu, Yong; Maslen, Gareth; Manske, Magnus; Campino, Susana; Turner, Daniel J; Macinnis, Bronwyn; Kwiatkowski, Dominic P; Swerdlow, Harold P; Quail, Michael A
2012-01-03
Massively parallel sequencing technology is revolutionizing approaches to genomic and genetic research. Since its advent, the scale and efficiency of Next-Generation Sequencing (NGS) has rapidly improved. In spite of this success, sequencing genomes or genomic regions with extremely biased base composition is still a great challenge to the currently available NGS platforms. The genomes of some important pathogenic organisms like Plasmodium falciparum (high AT content) and Mycobacterium tuberculosis (high GC content) display extremes of base composition. The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses. Alternative library-preparation approaches that omit PCR amplification require large quantities of starting material and hence are not suitable for small amounts of DNA/RNA such as those from clinical isolates. We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content sequences. We have used our optimized conditions in parallel with standard methods to prepare Illumina sequencing libraries from a non-clinical and a clinical isolate (containing ~53% host contamination). By analyzing and comparing the quality of sequence data generated, we show that our optimized conditions that involve a PCR additive (TMAC), produces amplified libraries with improved coverage of extremely AT-rich regions and reduced bias toward GC neutral templates. We have developed a robust and optimized Next-Generation Sequencing library amplification method suitable for extremely AT-rich genomes. The new amplification conditions significantly reduce bias and retain the complexity of either extremes of base composition. This development will greatly benefit sequencing clinical samples that often require amplification due to low mass of DNA starting material.
Structure-Based Phylogenetic Analysis of the Lipocalin Superfamily.
Lakshmi, Balasubramanian; Mishra, Madhulika; Srinivasan, Narayanaswamy; Archunan, Govindaraju
2015-01-01
Lipocalins constitute a superfamily of extracellular proteins that are found in all three kingdoms of life. Although very divergent in their sequences and functions, they show remarkable similarity in 3-D structures. Lipocalins bind and transport small hydrophobic molecules. Earlier sequence-based phylogenetic studies of lipocalins highlighted that they have a long evolutionary history. However the molecular and structural basis of their functional diversity is not completely understood. The main objective of the present study is to understand functional diversity of the lipocalins using a structure-based phylogenetic approach. The present study with 39 protein domains from the lipocalin superfamily suggests that the clusters of lipocalins obtained by structure-based phylogeny correspond well with the functional diversity. The detailed analysis on each of the clusters and sub-clusters reveals that the 39 lipocalin domains cluster based on their mode of ligand binding though the clustering was performed on the basis of gross domain structure. The outliers in the phylogenetic tree are often from single member families. Also structure-based phylogenetic approach has provided pointers to assign putative function for the domains of unknown function in lipocalin family. The approach employed in the present study can be used in the future for the functional identification of new lipocalin proteins and may be extended to other protein families where members show poor sequence similarity but high structural similarity.
Counting Patterns in Degenerated Sequences
NASA Astrophysics Data System (ADS)
Nuel, Grégory
Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach.
Secondary structural entropy in RNA switch (Riboswitch) identification.
Manzourolajdad, Amirhossein; Arnold, Jonathan
2015-04-28
RNA regulatory elements play a significant role in gene regulation. Riboswitches, a widespread group of regulatory RNAs, are vital components of many bacterial genomes. These regulatory elements generally function by forming a ligand-induced alternative fold that controls access to ribosome binding sites or other regulatory sites in RNA. Riboswitch-mediated mechanisms are ubiquitous across bacterial genomes. A typical class of riboswitch has its own unique structural and biological complexity, making de novo riboswitch identification a formidable task. Traditionally, riboswitches have been identified through comparative genomics based on sequence and structural homology. The limitations of structural-homology-based approaches, coupled with the assumption that there is a great diversity of undiscovered riboswitches, suggests the need for alternative methods for riboswitch identification, possibly based on features intrinsic to their structure. As of yet, no such reliable method has been proposed. We used structural entropy of riboswitch sequences as a measure of their secondary structural dynamics. Entropy values of a diverse set of riboswitches were compared to that of their mutants, their dinucleotide shuffles, and their reverse complement sequences under different stochastic context-free grammar folding models. Significance of our results was evaluated by comparison to other approaches, such as the base-pairing entropy and energy landscapes dynamics. Classifiers based on structural entropy optimized via sequence and structural features were devised as riboswitch identifiers and tested on Bacillus subtilis, Escherichia coli, and Synechococcus elongatus as an exploration of structural entropy based approaches. The unusually long untranslated region of the cotH in Bacillus subtilis, as well as upstream regions of certain genes, such as the sucC genes were associated with significant structural entropy values in genome-wide examinations. Various tests show that there is in fact a relationship between higher structural entropy and the potential for the RNA sequence to have alternative structures, within the limitations of our methodology. This relationship, though modest, is consistent across various tests. Understanding the behavior of structural entropy as a fairly new feature for RNA conformational dynamics, however, may require extensive exploratory investigation both across RNA sequences and folding models.
QRS complex detection based on continuous density hidden Markov models using univariate observations
NASA Astrophysics Data System (ADS)
Sotelo, S.; Arenas, W.; Altuve, M.
2018-04-01
In the electrocardiogram (ECG), the detection of QRS complexes is a fundamental step in the ECG signal processing chain since it allows the determination of other characteristics waves of the ECG and provides information about heart rate variability. In this work, an automatic QRS complex detector based on continuous density hidden Markov models (HMM) is proposed. HMM were trained using univariate observation sequences taken either from QRS complexes or their derivatives. The detection approach is based on the log-likelihood comparison of the observation sequence with a fixed threshold. A sliding window was used to obtain the observation sequence to be evaluated by the model. The threshold was optimized by receiver operating characteristic curves. Sensitivity (Sen), specificity (Spc) and F1 score were used to evaluate the detection performance. The approach was validated using ECG recordings from the MIT-BIH Arrhythmia database. A 6-fold cross-validation shows that the best detection performance was achieved with 2 states HMM trained with QRS complexes sequences (Sen = 0.668, Spc = 0.360 and F1 = 0.309). We concluded that these univariate sequences provide enough information to characterize the QRS complex dynamics from HMM. Future works are directed to the use of multivariate observations to increase the detection performance.
NASA Astrophysics Data System (ADS)
Wang, Guanxi; Tie, Yun; Qi, Lin
2017-07-01
In this paper, we propose a novel approach based on Depth Maps and compute Multi-Scale Histograms of Oriented Gradient (MSHOG) from sequences of depth maps to recognize actions. Each depth frame in a depth video sequence is projected onto three orthogonal Cartesian planes. Under each projection view, the absolute difference between two consecutive projected maps is accumulated through a depth video sequence to form a Depth Map, which is called Depth Motion Trail Images (DMTI). The MSHOG is then computed from the Depth Maps for the representation of an action. In addition, we apply L2-Regularized Collaborative Representation (L2-CRC) to classify actions. We evaluate the proposed approach on MSR Action3D dataset and MSRGesture3D dataset. Promising experimental result demonstrates the effectiveness of our proposed method.
Babben, Steve; Perovic, Dragan; Koch, Michael; Ordon, Frank
2015-01-01
Recent declines in costs accelerated sequencing of many species with large genomes, including hexaploid wheat (Triticum aestivum L.). Although the draft sequence of bread wheat is known, it is still one of the major challenges to developlocus specific primers suitable to be used in marker assisted selection procedures, due to the high homology of the three genomes. In this study we describe an efficient approach for the development of locus specific primers comprising four steps, i.e. (i) identification of genomic and coding sequences (CDS) of candidate genes, (ii) intron- and exon-structure reconstruction, (iii) identification of wheat A, B and D sub-genome sequences and primer development based on sequence differences between the three sub-genomes, and (iv); testing of primers for functionality, correct size and localisation. This approach was applied to single, low and high copy genes involved in frost tolerance in wheat. In summary for 27 of these genes for which sequences were derived from Triticum aestivum, Triticum monococcum and Hordeum vulgare, a set of 119 primer pairs was developed and after testing on Nulli-tetrasomic (NT) lines, a set of 65 primer pairs (54.6%), corresponding to 19 candidate genes, turned out to be specific. Out of these a set of 35 fragments was selected for validation via Sanger's amplicon re-sequencing. All fragments, with the exception of one, could be assigned to the original reference sequence. The approach presented here showed a much higher specificity in primer development in comparison to techniques used so far in bread wheat and can be applied to other polyploid species with a known draft sequence. PMID:26565976
Complete Genome Sequence of a Putative Densovirus of the Asian Citrus Psyllid, Diaphorina citri.
Nigg, Jared C; Nouri, Shahideh; Falk, Bryce W
2016-07-28
Here, we report the complete genome sequence of a putative densovirus of the Asian citrus psyllid, Diaphorina citri Diaphorina citri densovirus (DcDNV) was originally identified through metagenomics, and here, we obtained the complete nucleotide sequence using PCR-based approaches. Phylogenetic analysis places DcDNV between viruses of the Ambidensovirus and Iteradensovirus genera. Copyright © 2016 Nigg et al.
Libbrecht, Maxwell W; Bilmes, Jeffrey A; Noble, William Stafford
2018-04-01
Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions. © 2018 Wiley Periodicals, Inc.
Ceccarelli, Marcello; Galluzzi, Luca; Diotallevi, Aurora; Andreoni, Francesca; Fowler, Hailie; Petersen, Christine; Vitale, Fabrizio; Magnani, Mauro
2017-05-16
Leishmaniasis is a neglected disease caused by many Leishmania species, belonging to subgenera Leishmania (Leishmania) and Leishmania (Viannia). Several qPCR-based molecular diagnostic approaches have been reported for detection and quantification of Leishmania species. Many of these approaches use the kinetoplast DNA (kDNA) minicircles as the target sequence. These assays had potential cross-species amplification, due to sequence similarity between Leishmania species. Previous works demonstrated discrimination between L. (Leishmania) and L. (Viannia) by SYBR green-based qPCR assays designed on kDNA, followed by melting or high-resolution melt (HRM) analysis. Importantly, these approaches cannot fully distinguish L. (L.) infantum from L. (L.) amazonensis, which can coexist in the same geographical area. DNA from 18 strains/isolates of L. (L.) infantum, L. (L.) amazonensis, L. (V.) braziliensis, L. (V.) panamensis, L. (V.) guyanensis, and 62 clinical samples from L. (L.) infantum-infected dogs were amplified by a previously developed qPCR (qPCR-ML) and subjected to HRM analysis; selected PCR products were sequenced using an ABI PRISM 310 Genetic Analyzer. Based on the obtained sequences, a new SYBR-green qPCR assay (qPCR-ama) intended to amplify a minicircle subclass more abundant in L. (L.) amazonensis was designed. The qPCR-ML followed by HRM analysis did not allow discrimination between L. (L.) amazonensis and L. (L.) infantum in 53.4% of cases. Hence, the novel SYBR green-based qPCR (qPCR-ama) has been tested. This assay achieved a detection limit of 0.1 pg of parasite DNA in samples spiked with host DNA and did not show cross amplification with Trypanosoma cruzi or host DNA. Although the qPCR-ama also amplified L. (L.) infantum strains, the C q values were dramatically increased compared to qPCR-ML. Therefore, the combined analysis of C q values from qPCR-ML and qPCR-ama allowed to distinguish L. (L.) infantum and L. (L.) amazonensis in 100% of tested samples. A new and affordable SYBR-green qPCR-based approach to distinguish between L. (L.) infantum and L. (L.) amazonensis was developed exploiting the major abundance of a minicircle sequence rather than targeting a hypothetical species-specific sequence. The fast and accurate discrimination between these species can be useful to provide adequate prognosis and treatment.
A Model for Applying Lexical Approach in Teaching Russian Grammar.
ERIC Educational Resources Information Center
Gettys, Serafima
The lexical approach to teaching Russian grammar is explained, an instructional sequence is outlined, and a classroom study testing the effectiveness of the approach is reported. The lexical approach draws on research on cognitive psychology, second language acquisition theory, and research on learner language. Its bases in research and its…
Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions.
Mai, Te-Lun; Hu, Geng-Ming; Chen, Chi-Ming
2016-07-01
Research in the recent decade has demonstrated the usefulness of protein network knowledge in furthering the study of molecular evolution of proteins, understanding the robustness of cells to perturbation, and annotating new protein functions. In this study, we aimed to provide a general clustering approach to visualize the sequence-structure-function relationship of protein networks, and investigate possible causes for inconsistency in the protein classifications based on sequences, structures, and functions. Such visualization of protein networks could facilitate our understanding of the overall relationship among proteins and help researchers comprehend various protein databases. As a demonstration, we clustered 1437 enzymes by their sequences and structures using the minimum span clustering (MSC) method. The general structure of this protein network was delineated at two clustering resolutions, and the second level MSC clustering was found to be highly similar to existing enzyme classifications. The clustering of these enzymes based on sequence, structure, and function information is consistent with each other. For proteases, the Jaccard's similarity coefficient is 0.86 between sequence and function classifications, 0.82 between sequence and structure classifications, and 0.78 between structure and function classifications. From our clustering results, we discussed possible examples of divergent evolution and convergent evolution of enzymes. Our clustering approach provides a panoramic view of the sequence-structure-function network of proteins, helps visualize the relation between related proteins intuitively, and is useful in predicting the structure and function of newly determined protein sequences.
USDA-ARS?s Scientific Manuscript database
Background: Vertebrate immune systems generate diverse repertoires of antibodies capable of mediating response to a variety of antigens. Next generation sequencing methods provide unique approaches to a number of immuno-based research areas including antibody discovery and engineering, disease surve...
NASA Astrophysics Data System (ADS)
Upton, Brianna; Evans, John; Morrow, Cherilynn; Thoms, Brian
2009-11-01
Previous studies have shown that many students have misconceptions about basic concepts in physics. Moreover, it has been concluded that one of the challenges lies in the teaching methodology. To address this, Georgia State University has begun teaching studio algebra-based physics. Although many institutions have implemented studio physics, most have done so in calculus-based sequences. The effectiveness of the studio approach in an algebra-based introductory physics course needs further investigation. A 3-semester study assessing the effectiveness of studio physics in an algebra-based physics sequence has been performed. This study compares the results of student pre- and post-tests using the Force Concept Inventory. Using the results from this assessment tool, we will discuss the effectiveness of the studio approach to teaching physics at GSU.
Computational functional genomics-based approaches in analgesic drug discovery and repurposing.
Lippmann, Catharina; Kringel, Dario; Ultsch, Alfred; Lötsch, Jörn
2018-06-01
Persistent pain is a major healthcare problem affecting a fifth of adults worldwide with still limited treatment options. The search for new analgesics increasingly includes the novel research area of functional genomics, which combines data derived from various processes related to DNA sequence, gene expression or protein function and uses advanced methods of data mining and knowledge discovery with the goal of understanding the relationship between the genome and the phenotype. Its use in drug discovery and repurposing for analgesic indications has so far been performed using knowledge discovery in gene function and drug target-related databases; next-generation sequencing; and functional proteomics-based approaches. Here, we discuss recent efforts in functional genomics-based approaches to analgesic drug discovery and repurposing and highlight the potential of computational functional genomics in this field including a demonstration of the workflow using a novel R library 'dbtORA'.
ERIC Educational Resources Information Center
Attwood, Paul V.
1997-01-01
Describes a self-instructional assignment approach to the teaching of advanced enzymology. Presents an assignment that offers a means of teaching enzymology to students that exposes them to modern computer-based techniques of analyzing protein structure and relates structure to enzyme function. (JRH)
The Electric Circuit as a System: A New Approach.
ERIC Educational Resources Information Center
Hartel, H.
1982-01-01
Traditionally, the terms "current,""voltage," and "resistance" are introduced in a linear sequence according to structure of the discipline and based on measurement operations. A new approach is discussed, the terms being introduced simultaneously in a qualitative way, using the system aspect of the electric circuit as an integrative base.…
ERIC Educational Resources Information Center
Winarno, Sri; Muthu, Kalaiarasi Sonai; Ling, Lew Sook
2018-01-01
Direct instruction approach has been widely used in higher education. Many studies revealed that direct instruction improved students' knowledge. The characteristics of direct instruction include the subject delivered through face-to-face interaction with the lecturers and materials that sequenced deliberately and taught explicitly. However,…
AmpliVar: mutation detection in high-throughput sequence from amplicon-based libraries.
Hsu, Arthur L; Kondrashova, Olga; Lunke, Sebastian; Love, Clare J; Meldrum, Cliff; Marquis-Nicholson, Renate; Corboy, Greg; Pham, Kym; Wakefield, Matthew; Waring, Paul M; Taylor, Graham R
2015-04-01
Conventional means of identifying variants in high-throughput sequencing align each read against a reference sequence, and then call variants at each position. Here, we demonstrate an orthogonal means of identifying sequence variation by grouping the reads as amplicons prior to any alignment. We used AmpliVar to make key-value hashes of sequence reads and group reads as individual amplicons using a table of flanking sequences. Low-abundance reads were removed according to a selectable threshold, and reads above this threshold were aligned as groups, rather than as individual reads, permitting the use of sensitive alignment tools. We show that this approach is more sensitive, more specific, and more computationally efficient than comparable methods for the analysis of amplicon-based high-throughput sequencing data. The method can be extended to enable alignment-free confirmation of variants seen in hybridization capture target-enrichment data. © 2015 WILEY PERIODICALS, INC.
Competitive Genomic Screens of Barcoded Yeast Libraries
Urbanus, Malene; Proctor, Michael; Heisler, Lawrence E.; Giaever, Guri; Nislow, Corey
2011-01-01
By virtue of advances in next generation sequencing technologies, we have access to new genome sequences almost daily. The tempo of these advances is accelerating, promising greater depth and breadth. In light of these extraordinary advances, the need for fast, parallel methods to define gene function becomes ever more important. Collections of genome-wide deletion mutants in yeasts and E. coli have served as workhorses for functional characterization of gene function, but this approach is not scalable, current gene-deletion approaches require each of the thousands of genes that comprise a genome to be deleted and verified. Only after this work is complete can we pursue high-throughput phenotyping. Over the past decade, our laboratory has refined a portfolio of competitive, miniaturized, high-throughput genome-wide assays that can be performed in parallel. This parallelization is possible because of the inclusion of DNA 'tags', or 'barcodes,' into each mutant, with the barcode serving as a proxy for the mutation and one can measure the barcode abundance to assess mutant fitness. In this study, we seek to fill the gap between DNA sequence and barcoded mutant collections. To accomplish this we introduce a combined transposon disruption-barcoding approach that opens up parallel barcode assays to newly sequenced, but poorly characterized microbes. To illustrate this approach we present a new Candida albicans barcoded disruption collection and describe how both microarray-based and next generation sequencing-based platforms can be used to collect 10,000 - 1,000,000 gene-gene and drug-gene interactions in a single experiment. PMID:21860376
NASA Technical Reports Server (NTRS)
Adeleye, Sanya; Chung, Christopher
2006-01-01
Commercial aircraft undergo a significant number of maintenance and logistical activities during the turnaround operation at the departure gate. By analyzing the sequencing of these activities, more effective turnaround contingency plans may be developed for logistical and maintenance disruptions. Turnaround contingency plans are particularly important as any kind of delay in a hub based system may cascade into further delays with subsequent connections. The contingency sequencing of the maintenance and logistical turnaround activities were analyzed using a combined network and computer simulation modeling approach. Experimental analysis of both current and alternative policies provides a framework to aid in more effective tactical decision making.
Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.
Lun, Aaron T L; Bach, Karsten; Marioni, John C
2016-04-27
Normalization of single-cell RNA sequencing data is necessary to eliminate cell-specific biases prior to downstream analyses. However, this is not straightforward for noisy single-cell data where many counts are zero. We present a novel approach where expression values are summed across pools of cells, and the summed values are used for normalization. Pool-based size factors are then deconvolved to yield cell-based factors. Our deconvolution approach outperforms existing methods for accurate normalization of cell-specific biases in simulated data. Similar behavior is observed in real data, where deconvolution improves the relevance of results of downstream analyses.
NASA Astrophysics Data System (ADS)
Carranza, N.; Cristóbal, G.; Sroubek, F.; Ledesma-Carbayo, M. J.; Santos, A.
2006-08-01
Myocardial motion analysis and quantification is of utmost importance for analyzing contractile heart abnormalities and it can be a symptom of a coronary artery disease. A fundamental problem in processing sequences of images is the computation of the optical flow, which is an approximation to the real image motion. This paper presents a new algorithm for optical flow estimation based on a spatiotemporal-frequency (STF) approach, more specifically on the computation of the Wigner-Ville distribution (WVD) and the Hough Transform (HT) of the motion sequences. The later is a well-known line and shape detection method very robust against incomplete data and noise. The rationale of using the HT in this context is because it provides a value of the displacement field from the STF representation. In addition, a probabilistic approach based on Gaussian mixtures has been implemented in order to improve the accuracy of the motion detection. Experimental results with synthetic sequences are compared against an implementation of the variational technique for local and global motion estimation, where it is shown that the results obtained here are accurate and robust to noise degradations. Real cardiac magnetic resonance images have been tested and evaluated with the current method.
Random whole metagenomic sequencing for forensic discrimination of soils.
Khodakova, Anastasia S; Smith, Renee J; Burgoyne, Leigh; Abarno, Damien; Linacre, Adrian
2014-01-01
Here we assess the ability of random whole metagenomic sequencing approaches to discriminate between similar soils from two geographically distinct urban sites for application in forensic science. Repeat samples from two parklands in residential areas separated by approximately 3 km were collected and the DNA was extracted. Shotgun, whole genome amplification (WGA) and single arbitrarily primed DNA amplification (AP-PCR) based sequencing techniques were then used to generate soil metagenomic profiles. Full and subsampled metagenomic datasets were then annotated against M5NR/M5RNA (taxonomic classification) and SEED Subsystems (metabolic classification) databases. Further comparative analyses were performed using a number of statistical tools including: hierarchical agglomerative clustering (CLUSTER); similarity profile analysis (SIMPROF); non-metric multidimensional scaling (NMDS); and canonical analysis of principal coordinates (CAP) at all major levels of taxonomic and metabolic classification. Our data showed that shotgun and WGA-based approaches generated highly similar metagenomic profiles for the soil samples such that the soil samples could not be distinguished accurately. An AP-PCR based approach was shown to be successful at obtaining reproducible site-specific metagenomic DNA profiles, which in turn were employed for successful discrimination of visually similar soil samples collected from two different locations.
Noronha, Jyothi M; Liu, Mengya; Squires, R Burke; Pickett, Brett E; Hale, Benjamin G; Air, Gillian M; Galloway, Summer E; Takimoto, Toru; Schmolke, Mirco; Hunt, Victoria; Klem, Edward; García-Sastre, Adolfo; McGee, Monnie; Scheuermann, Richard H
2012-05-01
Genetic drift of influenza virus genomic sequences occurs through the combined effects of sequence alterations introduced by a low-fidelity polymerase and the varying selective pressures experienced as the virus migrates through different host environments. While traditional phylogenetic analysis is useful in tracking the evolutionary heritage of these viruses, the specific genetic determinants that dictate important phenotypic characteristics are often difficult to discern within the complex genetic background arising through evolution. Here we describe a novel influenza virus sequence feature variant type (Flu-SFVT) approach, made available through the public Influenza Research Database resource (www.fludb.org), in which variant types (VTs) identified in defined influenza virus protein sequence features (SFs) are used for genotype-phenotype association studies. Since SFs have been defined for all influenza virus proteins based on known structural, functional, and immune epitope recognition properties, the Flu-SFVT approach allows the rapid identification of the molecular genetic determinants of important influenza virus characteristics and their connection to underlying biological functions. We demonstrate the use of the SFVT approach to obtain statistical evidence for effects of NS1 protein sequence variations in dictating influenza virus host range restriction.
Heuristics for multiobjective multiple sequence alignment.
Abbasi, Maryam; Paquete, Luís; Pereira, Francisco B
2016-07-15
Aligning multiple sequences arises in many tasks in Bioinformatics. However, the alignments produced by the current software packages are highly dependent on the parameters setting, such as the relative importance of opening gaps with respect to the increase of similarity. Choosing only one parameter setting may provide an undesirable bias in further steps of the analysis and give too simplistic interpretations. In this work, we reformulate multiple sequence alignment from a multiobjective point of view. The goal is to generate several sequence alignments that represent a trade-off between maximizing the substitution score and minimizing the number of indels/gaps in the sum-of-pairs score function. This trade-off gives to the practitioner further information about the similarity of the sequences, from which she could analyse and choose the most plausible alignment. We introduce several heuristic approaches, based on local search procedures, that compute a set of sequence alignments, which are representative of the trade-off between the two objectives (substitution score and indels). Several algorithm design options are discussed and analysed, with particular emphasis on the influence of the starting alignment and neighborhood search definitions on the overall performance. A perturbation technique is proposed to improve the local search, which provides a wide range of high-quality alignments. The proposed approach is tested experimentally on a wide range of instances. We performed several experiments with sequences obtained from the benchmark database BAliBASE 3.0. To evaluate the quality of the results, we calculate the hypervolume indicator of the set of score vectors returned by the algorithms. The results obtained allow us to identify reasonably good choices of parameters for our approach. Further, we compared our method in terms of correctly aligned pairs ratio and columns correctly aligned ratio with respect to reference alignments. Experimental results show that our approaches can obtain better results than TCoffee and Clustal Omega in terms of the first ratio.
You, Ronghui; Huang, Xiaodi; Zhu, Shanfeng
2018-06-06
As of April 2018, UniProtKB has collected more than 115 million protein sequences. Less than 0.15% of these proteins, however, have been associated with experimental GO annotations. As such, the use of automatic protein function prediction (AFP) to reduce this huge gap becomes increasingly important. The previous studies conclude that sequence homology based methods are highly effective in AFP. In addition, mining motif, domain, and functional information from protein sequences has been found very helpful for AFP. Other than sequences, alternative information sources such as text, however, may be useful for AFP as well. Instead of using BOW (bag of words) representation in traditional text-based AFP, we propose a new method called DeepText2GO that relies on deep semantic text representation, together with different kinds of available protein information such as sequence homology, families, domains, and motifs, to improve large-scale AFP. Furthermore, DeepText2GO integrates text-based methods with sequence-based ones by means of a consensus approach. Extensive experiments on the benchmark dataset extracted from UniProt/SwissProt have demonstrated that DeepText2GO significantly outperformed both text-based and sequence-based methods, validating its superiority. Copyright © 2018 Elsevier Inc. All rights reserved.
Approach to numerical safety guidelines based on a core melt criterion. [PWR; BWR
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azarm, M.A.; Hall, R.E.
1982-01-01
A plausible approach is proposed for translating a single level criterion to a set of numerical guidelines. The criterion for core melt probability is used to set numerical guidelines for various core melt sequences, systems and component unavailabilities. These guidelines can be used as a means for making decisions regarding the necessity for replacing a component or improving part of a safety system. This approach is applied to estimate a set of numerical guidelines for various sequences of core melts that are analyzed in Reactor Safety Study for the Peach Bottom Nuclear Power Plant.
Understanding protein evolution: from protein physics to Darwinian selection.
Zeldovich, Konstantin B; Shakhnovich, Eugene I
2008-01-01
Efforts in whole-genome sequencing and structural proteomics start to provide a global view of the protein universe, the set of existing protein structures and sequences. However, approaches based on the selection of individual sequences have not been entirely successful at the quantitative description of the distribution of structures and sequences in the protein universe because evolutionary pressure acts on the entire organism, rather than on a particular molecule. In parallel to this line of study, studies in population genetics and phenomenological molecular evolution established a mathematical framework to describe the changes in genome sequences in populations of organisms over time. Here, we review both microscopic (physics-based) and macroscopic (organism-level) models of protein-sequence evolution and demonstrate that bridging the two scales provides the most complete description of the protein universe starting from clearly defined, testable, and physiologically relevant assumptions.
Deciphering the distance to antibiotic resistance for the pneumococcus using genome sequencing data
Mobegi, Fredrick M.; Cremers, Amelieke J. H.; de Jonge, Marien I.; Bentley, Stephen D.; van Hijum, Sacha A. F. T.; Zomer, Aldert
2017-01-01
Advances in genome sequencing technologies and genome-wide association studies (GWAS) have provided unprecedented insights into the molecular basis of microbial phenotypes and enabled the identification of the underlying genetic variants in real populations. However, utilization of genome sequencing in clinical phenotyping of bacteria is challenging due to the lack of reliable and accurate approaches. Here, we report a method for predicting microbial resistance patterns using genome sequencing data. We analyzed whole genome sequences of 1,680 Streptococcus pneumoniae isolates from four independent populations using GWAS and identified probable hotspots of genetic variation which correlate with phenotypes of resistance to essential classes of antibiotics. With the premise that accumulation of putative resistance-conferring SNPs, potentially in combination with specific resistance genes, precedes full resistance, we retrogressively surveyed the hotspot loci and quantified the number of SNPs and/or genes, which if accumulated would confer full resistance to an otherwise susceptible strain. We name this approach the ‘distance to resistance’. It can be used to identify the creep towards complete antibiotics resistance in bacteria using genome sequencing. This approach serves as a basis for the development of future sequencing-based methods for predicting resistance profiles of bacterial strains in hospital microbiology and public health settings. PMID:28205635
Illeghems, Koen; De Vuyst, Luc; Papalexandratou, Zoi; Weckx, Stefan
2012-01-01
This is the first report on the phylogenetic analysis of the community diversity of a single spontaneous cocoa bean box fermentation sample through a metagenomic approach involving 454 pyrosequencing. Several sequence-based and composition-based taxonomic profiling tools were used and evaluated to avoid software-dependent results and their outcome was validated by comparison with previously obtained culture-dependent and culture-independent data. Overall, this approach revealed a wider bacterial (mainly γ-Proteobacteria) and fungal diversity than previously found. Further, the use of a combination of different classification methods, in a software-independent way, helped to understand the actual composition of the microbial ecosystem under study. In addition, bacteriophage-related sequences were found. The bacterial diversity depended partially on the methods used, as composition-based methods predicted a wider diversity than sequence-based methods, and as classification methods based solely on phylogenetic marker genes predicted a more restricted diversity compared with methods that took all reads into account. The metagenomic sequencing analysis identified Hanseniaspora uvarum, Hanseniaspora opuntiae, Saccharomyces cerevisiae, Lactobacillus fermentum, and Acetobacter pasteurianus as the prevailing species. Also, the presence of occasional members of the cocoa bean fermentation process was revealed (such as Erwinia tasmaniensis, Lactobacillus brevis, Lactobacillus casei, Lactobacillus rhamnosus, Lactococcus lactis, Leuconostoc mesenteroides, and Oenococcus oeni). Furthermore, the sequence reads associated with viral communities were of a restricted diversity, dominated by Myoviridae and Siphoviridae, and reflecting Lactobacillus as the dominant host. To conclude, an accurate overview of all members of a cocoa bean fermentation process sample was revealed, indicating the superiority of metagenomic sequencing over previously used techniques.
Gruel, Jérémy; LeBorgne, Michel; LeMeur, Nolwenn; Théret, Nathalie
2011-09-12
Regulation of gene expression plays a pivotal role in cellular functions. However, understanding the dynamics of transcription remains a challenging task. A host of computational approaches have been developed to identify regulatory motifs, mainly based on the recognition of DNA sequences for transcription factor binding sites. Recent integration of additional data from genomic analyses or phylogenetic footprinting has significantly improved these methods. Here, we propose a different approach based on the compilation of Simple Shared Motifs (SSM), groups of sequences defined by their length and similarity and present in conserved sequences of gene promoters. We developed an original algorithm to search and count SSM in pairs of genes. An exceptional number of SSM is considered as a common regulatory pattern. The SSM approach is applied to a sample set of genes and validated using functional gene-set enrichment analyses. We demonstrate that the SSM approach selects genes that are over-represented in specific biological categories (Ontology and Pathways) and are enriched in co-expressed genes. Finally we show that genes co-expressed in the same tissue or involved in the same biological pathway have increased SSM values. Using unbiased clustering of genes, Simple Shared Motifs analysis constitutes an original contribution to provide a clearer definition of expression networks.
2011-01-01
Background Regulation of gene expression plays a pivotal role in cellular functions. However, understanding the dynamics of transcription remains a challenging task. A host of computational approaches have been developed to identify regulatory motifs, mainly based on the recognition of DNA sequences for transcription factor binding sites. Recent integration of additional data from genomic analyses or phylogenetic footprinting has significantly improved these methods. Results Here, we propose a different approach based on the compilation of Simple Shared Motifs (SSM), groups of sequences defined by their length and similarity and present in conserved sequences of gene promoters. We developed an original algorithm to search and count SSM in pairs of genes. An exceptional number of SSM is considered as a common regulatory pattern. The SSM approach is applied to a sample set of genes and validated using functional gene-set enrichment analyses. We demonstrate that the SSM approach selects genes that are over-represented in specific biological categories (Ontology and Pathways) and are enriched in co-expressed genes. Finally we show that genes co-expressed in the same tissue or involved in the same biological pathway have increased SSM values. Conclusions Using unbiased clustering of genes, Simple Shared Motifs analysis constitutes an original contribution to provide a clearer definition of expression networks. PMID:21910886
Single molecule sequencing-guided scaffolding and correction of draft assemblies.
Zhu, Shenglong; Chen, Danny Z; Emrich, Scott J
2017-12-06
Although single molecule sequencing is still improving, the lengths of the generated sequences are inevitably an advantage in genome assembly. Prior work that utilizes long reads to conduct genome assembly has mostly focused on correcting sequencing errors and improving contiguity of de novo assemblies. We propose a disassembling-reassembling approach for both correcting structural errors in the draft assembly and scaffolding a target assembly based on error-corrected single molecule sequences. To achieve this goal, we formulate a maximum alternating path cover problem. We prove that this problem is NP-hard, and solve it by a 2-approximation algorithm. Our experimental results show that our approach can improve the structural correctness of target assemblies in the cost of some contiguity, even with smaller amounts of long reads. In addition, our reassembling process can also serve as a competitive scaffolder relative to well-established assembly benchmarks.
You, Yanqin; Sun, Yan; Li, Xuchao; Li, Yali; Wei, Xiaoming; Chen, Fang; Ge, Huijuan; Lan, Zhangzhang; Zhu, Qian; Tang, Ying; Wang, Shujuan; Gao, Ya; Jiang, Fuman; Song, Jiaping; Shi, Quan; Zhu, Xuan; Mu, Feng; Dong, Wei; Gao, Vince; Jiang, Hui; Yi, Xin; Wang, Wei; Gao, Zhiying
2014-08-01
This article demonstrates a prominent noninvasive prenatal approach to assist the clinical diagnosis of a single-gene disorder disease, maple syrup urine disease, using targeted sequencing knowledge from the affected family. The method reported here combines novel mutant discovery in known genes by targeted massively parallel sequencing with noninvasive prenatal testing. By applying this new strategy, we successfully revealed novel mutations in the gene BCKDHA (Ex2_4dup and c.392A>G) in this Chinese family and developed a prenatal haplotype-assisted approach to noninvasively detect the genotype of the fetus (transmitted from both parents). This is the first report of integration of targeted sequencing and noninvasive prenatal testing into clinical practice. Our study has demonstrated that this massively parallel sequencing-based strategy can potentially be used for single-gene disorder diagnosis in the future.
Madi, Nada; Al-Nakib, Widad; Mustafa, Abu Salim; Habibi, Nazima
2018-03-01
A metagenomic approach based on target independent next-generation sequencing has become a known method for the detection of both known and novel viruses in clinical samples. This study aimed to use the metagenomic sequencing approach to characterize the viral diversity in respiratory samples from patients with respiratory tract infections. We have investigated 86 respiratory samples received from various hospitals in Kuwait between 2015 and 2016 for the diagnosis of respiratory tract infections. A metagenomic approach using the next-generation sequencer to characterize viruses was used. According to the metagenomic analysis, an average of 145, 019 reads were identified, and 2% of these reads were of viral origin. Also, metagenomic analysis of the viral sequences revealed many known respiratory viruses, which were detected in 30.2% of the clinical samples. Also, sequences of non-respiratory viruses were detected in 14% of the clinical samples, while sequences of non-human viruses were detected in 55.8% of the clinical samples. The average genome coverage of the viruses was 12% with the highest genome coverage of 99.2% for respiratory syncytial virus, and the lowest was 1% for torque teno midi virus 2. Our results showed 47.7% agreement between multiplex Real-Time PCR and metagenomics sequencing in the detection of respiratory viruses in the clinical samples. Though there are some difficulties in using this method to clinical samples such as specimen quality, these observations are indicative of the promising utility of the metagenomic sequencing approach for the identification of respiratory viruses in patients with respiratory tract infections. © 2017 Wiley Periodicals, Inc.
Copy number variation of individual cattle genomes using next-generation sequencing
USDA-ARS?s Scientific Manuscript database
Copy number variations (CNVs) affect a wide range of phenotypic traits; however, CNVs in or near segmental duplication regions are often intractable. Using a read depth approach based on next-generation sequencing, we examined genome-wide copy number differences among five taurine (three Angus, one ...
Individualized cattle copy number and segmental duplication maps using next generation sequencing
USDA-ARS?s Scientific Manuscript database
Copy Number Variations (CNVs) affect a wide range of phenotypic traits; however, CNVs in or near segmental duplication regions are often intractable. Using a read depth approach based on next generation sequencing, we examined genome-wide copy number differences among five taurine (three Angus, one ...
Copy number variation of individual cattle genomes using next-generation sequencing
USDA-ARS?s Scientific Manuscript database
Copy Number Variations (CNVs) affect a wide range of phenotypic traits; however, CNVs in or near segmental duplication regions are often difficult to track. Using a read depth approach based on next generation sequencing, we examined genome-wide copy number differences among five taurine (three Angu...
Getting Limits off the Ground via Sequences
ERIC Educational Resources Information Center
Gass, Frederick
2006-01-01
Most beginning calculus courses spend little or no time on a technical definition of the limit concept. In most of the remaining courses, the definition presented is the traditional epsilon-delta definition. An alternative approach that bases the definition on infinite sequences has occasionally appeared in commercial textbooks but has not yet…
Early detection of non-native fishes using next-generation DNA sequencing of fish larvae
Our objective was to evaluate the use of fish larvae for early detection of non-native fishes, comparing traditional and molecular taxonomy based on next-generation DNA sequencing to investigate potential efficiencies. Our approach was to intensively sample a Great Lakes non-nati...
A tale of two sequences: microRNA-target chimeric reads.
Broughton, James P; Pasquinelli, Amy E
2016-04-04
In animals, a functional interaction between a microRNA (miRNA) and its target RNA requires only partial base pairing. The limited number of base pair interactions required for miRNA targeting provides miRNAs with broad regulatory potential and also makes target prediction challenging. Computational approaches to target prediction have focused on identifying miRNA target sites based on known sequence features that are important for canonical targeting and may miss non-canonical targets. Current state-of-the-art experimental approaches, such as CLIP-seq (cross-linking immunoprecipitation with sequencing), PAR-CLIP (photoactivatable-ribonucleoside-enhanced CLIP), and iCLIP (individual-nucleotide resolution CLIP), require inference of which miRNA is bound at each site. Recently, the development of methods to ligate miRNAs to their target RNAs during the preparation of sequencing libraries has provided a new tool for the identification of miRNA target sites. The chimeric, or hybrid, miRNA-target reads that are produced by these methods unambiguously identify the miRNA bound at a specific target site. The information provided by these chimeric reads has revealed extensive non-canonical interactions between miRNAs and their target mRNAs, and identified many novel interactions between miRNAs and noncoding RNAs.
Schürch, A C; Arredondo-Alonso, S; Willems, R J L; Goering, R V
2018-04-01
Whole genome sequence (WGS)-based strain typing finds increasing use in the epidemiologic analysis of bacterial pathogens in both public health as well as more localized infection control settings. This minireview describes methodologic approaches that have been explored for WGS-based epidemiologic analysis and considers the challenges and pitfalls of data interpretation. Personal collection of relevant publications. When applying WGS to study the molecular epidemiology of bacterial pathogens, genomic variability between strains is translated into measures of distance by determining single nucleotide polymorphisms in core genome alignments or by indexing allelic variation in hundreds to thousands of core genes, assigning types to unique allelic profiles. Interpreting isolate relatedness from these distances is highly organism specific, and attempts to establish species-specific cutoffs are unlikely to be generally applicable. In cases where single nucleotide polymorphism or core gene typing do not provide the resolution necessary for accurate assessment of the epidemiology of bacterial pathogens, inclusion of accessory gene or plasmid sequences may provide the additional required discrimination. As with all epidemiologic analysis, realizing the full potential of the revolutionary advances in WGS-based approaches requires understanding and dealing with issues related to the fundamental steps of data generation and interpretation. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.
TARGETED CAPTURE IN EVOLUTIONARY AND ECOLOGICAL GENOMICS
Jones, Matthew R.; Good, Jeffrey M.
2016-01-01
The rapid expansion of next-generation sequencing has yielded a powerful array of tools to address fundamental biological questions at a scale that was inconceivable just a few years ago. Various genome partitioning strategies to sequence select subsets of the genome have emerged as powerful alternatives to whole genome sequencing in ecological and evolutionary genomic studies. High throughput targeted capture is one such strategy that involves the parallel enrichment of pre-selected genomic regions of interest. The growing use of targeted capture demonstrates its potential power to address a range of research questions, yet these approaches have yet to expand broadly across labs focused on evolutionary and ecological genomics. In part, the use of targeted capture has been hindered by the logistics of capture design and implementation in species without established reference genomes. Here we aim to 1) increase the accessibility of targeted capture to researchers working in non-model taxa by discussing capture methods that circumvent the need of a reference genome, 2) highlight the evolutionary and ecological applications where this approach is emerging as a powerful sequencing strategy, and 3) discuss the future of targeted capture and other genome partitioning approaches in light of the increasing accessibility of whole genome sequencing. Given the practical advantages and increasing feasibility of high-throughput targeted capture, we anticipate an ongoing expansion of capture-based approaches in evolutionary and ecological research, synergistic with an expansion of whole genome sequencing. PMID:26137993
Li, Jun; Tibshirani, Robert
2015-01-01
We discuss the identification of features that are associated with an outcome in RNA-Sequencing (RNA-Seq) and other sequencing-based comparative genomic experiments. RNA-Seq data takes the form of counts, so models based on the normal distribution are generally unsuitable. The problem is especially challenging because different sequencing experiments may generate quite different total numbers of reads, or ‘sequencing depths’. Existing methods for this problem are based on Poisson or negative binomial models: they are useful but can be heavily influenced by ‘outliers’ in the data. We introduce a simple, nonparametric method with resampling to account for the different sequencing depths. The new method is more robust than parametric methods. It can be applied to data with quantitative, survival, two-class or multiple-class outcomes. We compare our proposed method to Poisson and negative binomial-based methods in simulated and real data sets, and find that our method discovers more consistent patterns than competing methods. PMID:22127579
Haplotag: Software for Haplotype-Based Genotyping-by-Sequencing Analysis
Tinker, Nicholas A.; Bekele, Wubishet A.; Hattori, Jiro
2016-01-01
Genotyping-by-sequencing (GBS), and related methods, are based on high-throughput short-read sequencing of genomic complexity reductions followed by discovery of single nucleotide polymorphisms (SNPs) within sequence tags. This provides a powerful and economical approach to whole-genome genotyping, facilitating applications in genomics, diversity analysis, and molecular breeding. However, due to the complexity of analyzing large data sets, applications of GBS may require substantial time, expertise, and computational resources. Haplotag, the novel GBS software described here, is freely available, and operates with minimal user-investment on widely available computer platforms. Haplotag is unique in fulfilling the following set of criteria: (1) operates without a reference genome; (2) can be used in a polyploid species; (3) provides a discovery mode, and a production mode; (4) discovers polymorphisms based on a model of tag-level haplotypes within sequenced tags; (5) reports SNPs as well as haplotype-based genotypes; and (6) provides an intuitive visual “passport” for each inferred locus. Haplotag is optimized for use in a self-pollinating plant species. PMID:26818073
Xu, Weijia; Ozer, Stuart; Gutell, Robin R
2009-01-01
With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure.
Xu, Weijia; Ozer, Stuart; Gutell, Robin R.
2010-01-01
With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure. PMID:20502534
Retterer, Kyle; Scuffins, Julie; Schmidt, Daniel; Lewis, Rachel; Pineda-Alvarez, Daniel; Stafford, Amanda; Schmidt, Lindsay; Warren, Stephanie; Gibellini, Federica; Kondakova, Anastasia; Blair, Amanda; Bale, Sherri; Matyakhina, Ludmila; Meck, Jeanne; Aradhya, Swaroop; Haverfield, Eden
2015-08-01
Detection of copy-number variation (CNV) is important for investigating many genetic disorders. Testing a large clinical cohort by array comparative genomic hybridization provides a deep perspective on the spectrum of pathogenic CNV. In this context, we describe a bioinformatics approach to extract CNV information from whole-exome sequencing and demonstrate its utility in clinical testing. Exon-focused arrays and whole-genome chromosomal microarray analysis were used to test 14,228 and 14,000 individuals, respectively. Based on these results, we developed an algorithm to detect deletions/duplications in whole-exome sequencing data and a novel whole-exome array. In the exon array cohort, we observed a positive detection rate of 2.4% (25 duplications, 318 deletions), of which 39% involved one or two exons. Chromosomal microarray analysis identified 3,345 CNVs affecting single genes (18%). We demonstrate that our whole-exome sequencing algorithm resolves CNVs of three or more exons. These results demonstrate the clinical utility of single-exon resolution in CNV assays. Our whole-exome sequencing algorithm approaches this resolution but is complemented by a whole-exome array to unambiguously identify intragenic CNVs and single-exon changes. These data illustrate the next advancements in CNV analysis through whole-exome sequencing and whole-exome array.Genet Med 17 8, 623-629.
Oligomerization of G protein-coupled receptors: computational methods.
Selent, J; Kaczor, A A
2011-01-01
Recent research has unveiled the complexity of mechanisms involved in G protein-coupled receptor (GPCR) functioning in which receptor dimerization/oligomerization may play an important role. Although the first high-resolution X-ray structure for a likely functional chemokine receptor dimer has been deposited in the Protein Data Bank, the interactions and mechanisms of dimer formation are not yet fully understood. In this respect, computational methods play a key role for predicting accurate GPCR complexes. This review outlines computational approaches focusing on sequence- and structure-based methodologies as well as discusses their advantages and limitations. Sequence-based approaches that search for possible protein-protein interfaces in GPCR complexes have been applied with success in several studies, but did not yield always consistent results. Structure-based methodologies are a potent complement to sequence-based approaches. For instance, protein-protein docking is a valuable method especially when guided by experimental constraints. Some disadvantages like limited receptor flexibility and non-consideration of the membrane environment have to be taken into account. Molecular dynamics simulation can overcome these drawbacks giving a detailed description of conformational changes in a native-like membrane. Successful prediction of GPCR complexes using computational approaches combined with experimental efforts may help to understand the role of dimeric/oligomeric GPCR complexes for fine-tuning receptor signaling. Moreover, since such GPCR complexes have attracted interest as potential drug target for diverse diseases, unveiling molecular determinants of dimerization/oligomerization can provide important implications for drug discovery.
DECIPHER, a Search-Based Approach to Chimera Identification for 16S rRNA Sequences
Wright, Erik S.; Yilmaz, L. Safak
2012-01-01
DECIPHER is a new method for finding 16S rRNA chimeric sequences by the use of a search-based approach. The method is based upon detecting short fragments that are uncommon in the phylogenetic group where a query sequence is classified but frequently found in another phylogenetic group. The algorithm was calibrated for full sequences (fs_DECIPHER) and short sequences (ss_DECIPHER) and benchmarked against WigeoN (Pintail), ChimeraSlayer, and Uchime using artificially generated chimeras. Overall, ss_DECIPHER and Uchime provided the highest chimera detection for sequences 100 to 600 nucleotides long (79% and 81%, respectively), but Uchime's performance deteriorated for longer sequences, while ss_DECIPHER maintained a high detection rate (89%). Both methods had low false-positive rates (1.3% and 1.6%). The more conservative fs_DECIPHER, benchmarked only for sequences longer than 600 nucleotides, had an overall detection rate lower than that of ss_DECIPHER (75%) but higher than those of the other programs. In addition, fs_DECIPHER had the lowest false-positive rate among all the benchmarked programs (<0.20%). DECIPHER was outperformed only by ChimeraSlayer and Uchime when chimeras were formed from closely related parents (less than 10% divergence). Given the differences in the programs, it was possible to detect over 89% of all chimeras with just the combination of ss_DECIPHER and Uchime. Using fs_DECIPHER, we detected between 1% and 2% additional chimeras in the RDP, SILVA, and Greengenes databases from which chimeras had already been removed with Pintail or Bellerophon. DECIPHER was implemented in the R programming language and is directly accessible through a webpage or by downloading the program as an R package (http://DECIPHER.cee.wisc.edu). PMID:22101057
Accelerating Information Retrieval from Profile Hidden Markov Model Databases.
Tamimi, Ahmad; Ashhab, Yaqoub; Tamimi, Hashem
2016-01-01
Profile Hidden Markov Model (Profile-HMM) is an efficient statistical approach to represent protein families. Currently, several databases maintain valuable protein sequence information as profile-HMMs. There is an increasing interest to improve the efficiency of searching Profile-HMM databases to detect sequence-profile or profile-profile homology. However, most efforts to enhance searching efficiency have been focusing on improving the alignment algorithms. Although the performance of these algorithms is fairly acceptable, the growing size of these databases, as well as the increasing demand for using batch query searching approach, are strong motivations that call for further enhancement of information retrieval from profile-HMM databases. This work presents a heuristic method to accelerate the current profile-HMM homology searching approaches. The method works by cluster-based remodeling of the database to reduce the search space, rather than focusing on the alignment algorithms. Using different clustering techniques, 4284 TIGRFAMs profiles were clustered based on their similarities. A representative for each cluster was assigned. To enhance sensitivity, we proposed an extended step that allows overlapping among clusters. A validation benchmark of 6000 randomly selected protein sequences was used to query the clustered profiles. To evaluate the efficiency of our approach, speed and recall values were measured and compared with the sequential search approach. Using hierarchical, k-means, and connected component clustering techniques followed by the extended overlapping step, we obtained an average reduction in time of 41%, and an average recall of 96%. Our results demonstrate that representation of profile-HMMs using a clustering-based approach can significantly accelerate data retrieval from profile-HMM databases.
Rickert, Keith W; Grinberg, Luba; Woods, Robert M; Wilson, Susan; Bowen, Michael A; Baca, Manuel
2016-01-01
The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3-5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material.
Rickert, Keith W.; Grinberg, Luba; Woods, Robert M.; Wilson, Susan; Bowen, Michael A.; Baca, Manuel
2016-01-01
ABSTRACT The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3–5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material. PMID:26852694
Graph pyramids for protein function prediction
2015-01-01
Background Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Methods Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Results Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data. PMID:26044522
Graph pyramids for protein function prediction.
Sandhan, Tushar; Yoo, Youngjun; Choi, Jin; Kim, Sun
2015-01-01
Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data.
Lagkouvardos, Ilias; Joseph, Divya; Kapfhammer, Martin; Giritli, Sabahattin; Horn, Matthias; Haller, Dirk; Clavel, Thomas
2016-09-23
The SRA (Sequence Read Archive) serves as primary depository for massive amounts of Next Generation Sequencing data, and currently host over 100,000 16S rRNA gene amplicon-based microbial profiles from various host habitats and environments. This number is increasing rapidly and there is a dire need for approaches to utilize this pool of knowledge. Here we created IMNGS (Integrated Microbial Next Generation Sequencing), an innovative platform that uniformly and systematically screens for and processes all prokaryotic 16S rRNA gene amplicon datasets available in SRA and uses them to build sample-specific sequence databases and OTU-based profiles. Via a web interface, this integrative sequence resource can easily be queried by users. We show examples of how the approach allows testing the ecological importance of specific microorganisms in different hosts or ecosystems, and performing targeted diversity studies for selected taxonomic groups. The platform also offers a complete workflow for de novo analysis of users' own raw 16S rRNA gene amplicon datasets for the sake of comparison with existing data. IMNGS can be accessed at www.imngs.org.
Fokkema, Ivo F A C; den Dunnen, Johan T; Taschner, Peter E M
2005-08-01
The completion of the human genome project has initiated, as well as provided the basis for, the collection and study of all sequence variation between individuals. Direct access to up-to-date information on sequence variation is currently provided most efficiently through web-based, gene-centered, locus-specific databases (LSDBs). We have developed the Leiden Open (source) Variation Database (LOVD) software approaching the "LSDB-in-a-Box" idea for the easy creation and maintenance of a fully web-based gene sequence variation database. LOVD is platform-independent and uses PHP and MySQL open source software only. The basic gene-centered and modular design of the database follows the recommendations of the Human Genome Variation Society (HGVS) and focuses on the collection and display of DNA sequence variations. With minimal effort, the LOVD platform is extendable with clinical data. The open set-up should both facilitate and promote functional extension with scripts written by the community. The LOVD software is freely available from the Leiden Muscular Dystrophy pages (www.DMD.nl/LOVD/). To promote the use of LOVD, we currently offer curators the possibility to set up an LSDB on our Leiden server. (c) 2005 Wiley-Liss, Inc.
Inferring the mode of origin of polyploid species from next-generation sequence data.
Roux, Camille; Pannell, John R
2015-03-01
Many eukaryote organisms are polyploid. However, despite their importance, evolutionary inference of polyploid origins and modes of inheritance has been limited by a need for analyses of allele segregation at multiple loci using crosses. The increasing availability of sequence data for nonmodel species now allows the application of established approaches for the analysis of genomic data in polyploids. Here, we ask whether approximate Bayesian computation (ABC), applied to realistic traditional and next-generation sequence data, allows correct inference of the evolutionary and demographic history of polyploids. Using simulations, we evaluate the robustness of evolutionary inference by ABC for tetraploid species as a function of the number of individuals and loci sampled, and the presence or absence of an outgroup. We find that ABC adequately retrieves the recent evolutionary history of polyploid species on the basis of both old and new sequencing technologies. The application of ABC to sequence data from diploid and polyploid species of the plant genus Capsella confirms its utility. Our analysis strongly supports an allopolyploid origin of C. bursa-pastoris about 80 000 years ago. This conclusion runs contrary to previous findings based on the same data set but using an alternative approach and is in agreement with recent findings based on whole-genome sequencing. Our results indicate that ABC is a promising and powerful method for revealing the evolution of polyploid species, without the need to attribute alleles to a homeologous chromosome pair. The approach can readily be extended to more complex scenarios involving higher ploidy levels. © 2015 John Wiley & Sons Ltd.
Precone, Vincenza; Del Monaco, Valentina; Esposito, Maria Valeria; De Palma, Fatima Domenica Elisa; Ruocco, Anna; D'Argenio, Valeria
2015-01-01
Next-generation sequencing (NGS) technologies have greatly impacted on every field of molecular research mainly because they reduce costs and increase throughput of DNA sequencing. These features, together with the technology's flexibility, have opened the way to a variety of applications including the study of the molecular basis of human diseases. Several analytical approaches have been developed to selectively enrich regions of interest from the whole genome in order to identify germinal and/or somatic sequence variants and to study DNA methylation. These approaches are now widely used in research, and they are already being used in routine molecular diagnostics. However, some issues are still controversial, namely, standardization of methods, data analysis and storage, and ethical aspects. Besides providing an overview of the NGS-based approaches most frequently used to study the molecular basis of human diseases at DNA level, we discuss the principal challenges and applications of NGS in the field of human genomics. PMID:26665001
NASA Astrophysics Data System (ADS)
Eugster, H.; Huber, F.; Nebiker, S.; Gisi, A.
2012-07-01
Stereovision based mobile mapping systems enable the efficient capturing of directly georeferenced stereo pairs. With today's camera and onboard storage technologies imagery can be captured at high data rates resulting in dense stereo sequences. These georeferenced stereo sequences provide a highly detailed and accurate digital representation of the roadside environment which builds the foundation for a wide range of 3d mapping applications and image-based geo web-services. Georeferenced stereo images are ideally suited for the 3d mapping of street furniture and visible infrastructure objects, pavement inspection, asset management tasks or image based change detection. As in most mobile mapping systems, the georeferencing of the mapping sensors and observations - in our case of the imaging sensors - normally relies on direct georeferencing based on INS/GNSS navigation sensors. However, in urban canyons the achievable direct georeferencing accuracy of the dynamically captured stereo image sequences is often insufficient or at least degraded. Furthermore, many of the mentioned application scenarios require homogeneous georeferencing accuracy within a local reference frame over the entire mapping perimeter. To achieve these demands georeferencing approaches are presented and cost efficient workflows are discussed which allows validating and updating the INS/GNSS based trajectory with independently estimated positions in cases of prolonged GNSS signal outages in order to increase the georeferencing accuracy up to the project requirements.
Ghosh, Jayadri Sekhar; Bhattacharya, Samik; Pal, Amita
2017-06-01
The unavailability of the reproductive structure and unpredictability of vegetative characters for the identification and phylogenetic study of bamboo prompted the application of molecular techniques for greater resolution and consensus. We first employed internal transcribed spacer (ITS1, 5.8S rRNA and ITS2) sequences to construct the phylogenetic tree of 21 tropical bamboo species. While the sequence alone could grossly reconstruct the traditional phylogeny amongst the 21-tropical species studied, some anomalies were encountered that prompted a further refinement of the phylogenetic analyses. Therefore, we integrated the secondary structure of the ITS sequences to derive individual sequence-structure matrix to gain more resolution on the phylogenetic reconstruction. The results showed that ITS sequence-structure is the reliable alternative to the conventional phenotypic method for the identification of bamboo species. The best-fit topology obtained by the sequence-structure based phylogeny over the sole sequence based one underscores closer clustering of all the studied Bambusa species (Sub-tribe Bambusinae), while Melocanna baccifera, which belongs to Sub-Tribe Melocanneae, disjointedly clustered as an out-group within the consensus phylogenetic tree. In this study, we demonstrated the dependability of the combined (ITS sequence+structure-based) approach over the only sequence-based analysis for phylogenetic relationship assessment of bamboo.
Liu, Lian; Zhang, Shao-Wu; Huang, Yufei; Meng, Jia
2017-08-31
As a newly emerged research area, RNA epigenetics has drawn increasing attention recently for the participation of RNA methylation and other modifications in a number of crucial biological processes. Thanks to high throughput sequencing techniques, such as, MeRIP-Seq, transcriptome-wide RNA methylation profile is now available in the form of count-based data, with which it is often of interests to study the dynamics at epitranscriptomic layer. However, the sample size of RNA methylation experiment is usually very small due to its costs; and additionally, there usually exist a large number of genes whose methylation level cannot be accurately estimated due to their low expression level, making differential RNA methylation analysis a difficult task. We present QNB, a statistical approach for differential RNA methylation analysis with count-based small-sample sequencing data. Compared with previous approaches such as DRME model based on a statistical test covering the IP samples only with 2 negative binomial distributions, QNB is based on 4 independent negative binomial distributions with their variances and means linked by local regressions, and in the way, the input control samples are also properly taken care of. In addition, different from DRME approach, which relies only the input control sample only for estimating the background, QNB uses a more robust estimator for gene expression by combining information from both input and IP samples, which could largely improve the testing performance for very lowly expressed genes. QNB showed improved performance on both simulated and real MeRIP-Seq datasets when compared with competing algorithms. And the QNB model is also applicable to other datasets related RNA modifications, including but not limited to RNA bisulfite sequencing, m 1 A-Seq, Par-CLIP, RIP-Seq, etc.
Kraková, Lucia; Šoltys, Katarína; Budiš, Jaroslav; Grivalský, Tomáš; Ďuriš, František; Pangallo, Domenico; Szemes, Tomáš
2016-09-01
Different protocols based on Illumina high-throughput DNA sequencing and denaturing gradient gel electrophoresis (DGGE)-cloning were developed and applied for investigating hot spring related samples. The study was focused on three target genes: archaeal and bacterial 16S rRNA and mcrA of methanogenic microflora. Shorter read lengths of the currently most popular technology of sequencing by Illumina do not allow analysis of the complete 16S rRNA region, or of longer gene fragments, as was the case of Sanger sequencing. Here, we demonstrate that there is no need for special indexed or tailed primer sets dedicated to short variable regions of 16S rRNA since the presented approach allows the analysis of complete bacterial 16S rRNA amplicons (V1-V9) and longer archaeal 16S rRNA and mcrA sequences. Sample augmented with transposon is represented by a set of approximately 300 bp long fragments that can be easily sequenced by Illumina MiSeq. Furthermore, a low proportion of chimeric sequences was observed. DGGE-cloning based strategies were performed combining semi-nested PCR, DGGE and clone library construction. Comparing both investigation methods, a certain degree of complementarity was observed confirming that the DGGE-cloning approach is not obsolete. Novel protocols were created for several types of laboratories, utilizing the traditional DGGE technique or using the most modern Illumina sequencing.
Su, Jiao; Zhang, Haijie; Jiang, Bingying; Zheng, Huzhi; Chai, Yaqin; Yuan, Ruo; Xiang, Yun
2011-11-15
We report an ultrasensitive electrochemical approach for the detection of uropathogen sequence-specific DNA target. The sensing strategy involves a dual signal amplification process, which combines the signal enhancement by the enzymatic target recycling technique with the sensitivity improvement by the quantum dot (QD) layer-by-layer (LBL) assembled labels. The enzyme-based catalytic target DNA recycling process results in the use of each target DNA sequence for multiple times and leads to direct amplification of the analytical signal. Moreover, the LBL assembled QD labels can further enhance the sensitivity of the sensing system. The coupling of these two effective signal amplification strategies thus leads to low femtomolar (5fM) detection of the target DNA sequences. The proposed strategy also shows excellent discrimination between the target DNA and the single-base mismatch sequences. The advantageous intrinsic sequence-independent property of exonuclease III over other sequence-dependent enzymes makes our new dual signal amplification system a general sensing platform for monitoring ultralow level of various types of target DNA sequences. Copyright © 2011 Elsevier B.V. All rights reserved.
Petri net modeling of high-order genetic systems using grammatical evolution.
Moore, Jason H; Hahn, Lance W
2003-11-01
Understanding how DNA sequence variations impact human health through a hierarchy of biochemical and physiological systems is expected to improve the diagnosis, prevention, and treatment of common, complex human diseases. We have previously developed a hierarchical dynamic systems approach based on Petri nets for generating biochemical network models that are consistent with genetic models of disease susceptibility. This modeling approach uses an evolutionary computation approach called grammatical evolution as a search strategy for optimal Petri net models. We have previously demonstrated that this approach routinely identifies biochemical network models that are consistent with a variety of genetic models in which disease susceptibility is determined by nonlinear interactions between two DNA sequence variations. In the present study, we evaluate whether the Petri net approach is capable of identifying biochemical networks that are consistent with disease susceptibility due to higher order nonlinear interactions between three DNA sequence variations. The results indicate that our model-building approach is capable of routinely identifying good, but not perfect, Petri net models. Ideas for improving the algorithm for this high-dimensional problem are presented.
NASA Astrophysics Data System (ADS)
Zhang, Wenshuai; Zeng, Xiaoyan; Zhang, Li; Peng, Haiyan; Jiao, Yongjun; Zeng, Jun; Treutlein, Herbert R.
2013-06-01
In this work, we have developed a new approach to predict the epitopes of antigens that are recognized by a specific antibody. Our method is based on the "multiple copy simultaneous search" (MCSS) approach which identifies optimal locations of small chemical functional groups on the surfaces of the antibody, and identifying sequence patterns of peptides that can bind to the surface of the antibody. The identified sequence patterns are then used to search the amino-acid sequence of the antigen protein. The approach was validated by reproducing the binding epitope of HIV gp120 envelop glycoprotein for the human neutralizing antibody as revealed in the available crystal structure. Our method was then applied to predict the epitopes of two glycoproteins of a newly discovered bunyavirus recognized by an antibody named MAb 4-5. These predicted epitopes can be verified by experimental methods. We also discuss the involvement of different amino acids in the antigen-antibody recognition based on the distributions of MCSS minima of different functional groups.
Current and future molecular diagnostics for ocular infectious diseases.
Doan, Thuy; Pinsky, Benjamin A
2016-11-01
Confirmation of ocular infections can pose great challenges to the clinician. A fundamental limitation is the small amounts of specimen that can be obtained from the eye. Molecular diagnostics can circumvent this limitation and have been shown to be more sensitive than conventional culture. The purpose of this review is to describe new molecular methods and to discuss the applications of next-generation sequencing-based approaches in the diagnosis of ocular infections. Efforts have focused on improving the sensitivity of pathogen detection using molecular methods. This review describes a new molecular target for Toxoplasma gondii-directed polymerase chain reaction assays. Molecular diagnostics for Chlamydia trachomatis and Acanthamoeba species are also discussed. Finally, we describe a hypothesis-free approach, metagenomic deep sequencing, which can detect DNA and RNA pathogens from a single specimen in one test. In some cases, this method can provide the geographic location and timing of the infection. Pathogen-directed PCRs have been powerful tools in the diagnosis of ocular infections for over 20 years. The use of next-generation sequencing-based approaches, when available, will further improve sensitivity of detection with the potential to improve patient care.
The fungal composition of natural biofinishes on oil-treated wood.
van Nieuwenhuijzen, Elke J; Houbraken, Jos A M P; Punt, Peter J; Roeselers, Guus; Adan, Olaf C G; Samson, Robert A
2017-01-01
Biofinished wood is considered to be a decorative and protective material for outdoor constructions, showing advantages compared to traditional treated wood in terms of sustainability and self-repair. Natural dark wood staining fungi are essential to biofinish formation on wood. Although all sorts of outdoor situated timber are subjected to fungal staining, the homogenous dark staining called biofinish has only been detected on specific vegetable oil-treated substrates. Revealing the fungal composition of various natural biofinishes on wood is a first step to understand and control biofinish formation for industrial application. A culture-based survey of fungi in natural biofinishes on oil-treated wood samples showed the common wood stain fungus Aureobasidium and the recently described genus Superstratomyces to be predominant constituents. A culture-independent approach, based on amplification of the internal transcribed spacer regions, cloning and Sanger sequencing, resulted in clone libraries of two types of biofinishes. Aureobasidium was present in both biofinish types, but was only predominant in biofinishes on pine sapwood treated with raw linseed oil. Most cloned sequences of the other biofinish type (pine sapwood treated with olive oil) could not be identified. In addition, a more in-depth overview of the fungal composition of biofinishes was obtained with Illumina amplicon sequencing that targeted the internal transcribed spacer region 1. All investigated samples, that varied in wood species, (oil) treatments and exposure times, contained Aureobasidium and this genus was predominant in the biofinishes on pine sapwood treated with raw linseed oil. Lapidomyces was the predominant genus in most of the other biofinishes and present in all other samples. Surprisingly, Superstratomyces , which was predominantly detected by the cultivation-based approach, could not be found with the Illumina sequencing approach, while Lapidomyces was not detected in the culture-based approach. Overall, the culture-based approach and two culture-independent methods that were used in this study revealed that natural biofinishes were composed of multiple fungal genera always containing the common wood staining mould Aureobasidium . Besides Aureobasidium , the use of other fungal genera for the production of biofinished wood has to be considered.
Unlocking hidden genomic sequence
Keith, Jonathan M.; Cochran, Duncan A. E.; Lala, Gita H.; Adams, Peter; Bryant, Darryn; Mitchelson, Keith R.
2004-01-01
Despite the success of conventional Sanger sequencing, significant regions of many genomes still present major obstacles to sequencing. Here we propose a novel approach with the potential to alleviate a wide range of sequencing difficulties. The technique involves extracting target DNA sequence from variants generated by introduction of random mutations. The introduction of mutations does not destroy original sequence information, but distributes it amongst multiple variants. Some of these variants lack problematic features of the target and are more amenable to conventional sequencing. The technique has been successfully demonstrated with mutation levels up to an average 18% base substitution and has been used to read previously intractable poly(A), AT-rich and GC-rich motifs. PMID:14973330
ElGokhy, Sherin M; ElHefnawi, Mahmoud; Shoukry, Amin
2014-05-06
MicroRNAs (miRNAs) are endogenous ∼22 nt RNAs that are identified in many species as powerful regulators of gene expressions. Experimental identification of miRNAs is still slow since miRNAs are difficult to isolate by cloning due to their low expression, low stability, tissue specificity and the high cost of the cloning procedure. Thus, computational identification of miRNAs from genomic sequences provide a valuable complement to cloning. Different approaches for identification of miRNAs have been proposed based on homology, thermodynamic parameters, and cross-species comparisons. The present paper focuses on the integration of miRNA classifiers in a meta-classifier and the identification of miRNAs from metagenomic sequences collected from different environments. An ensemble of classifiers is proposed for miRNA hairpin prediction based on four well-known classifiers (Triplet SVM, Mipred, Virgo and EumiR), with non-identical features, and which have been trained on different data. Their decisions are combined using a single hidden layer neural network to increase the accuracy of the predictions. Our ensemble classifier achieved 89.3% accuracy, 82.2% f-measure, 74% sensitivity, 97% specificity, 92.5% precision and 88.2% negative predictive value when tested on real miRNA and pseudo sequence data. The area under the receiver operating characteristic curve of our classifier is 0.9 which represents a high performance index.The proposed classifier yields a significant performance improvement relative to Triplet-SVM, Virgo and EumiR and a minor refinement over MiPred.The developed ensemble classifier is used for miRNA prediction in mine drainage, groundwater and marine metagenomic sequences downloaded from the NCBI sequence reed archive. By consulting the miRBase repository, 179 miRNAs have been identified as highly probable miRNAs. Our new approach could thus be used for mining metagenomic sequences and finding new and homologous miRNAs. The paper investigates a computational tool for miRNA prediction in genomic or metagenomic data. It has been applied on three metagenomic samples from different environments (mine drainage, groundwater and marine metagenomic sequences). The prediction results provide a set of extremely potential miRNA hairpins for cloning prediction methods. Among the ensemble prediction obtained results there are pre-miRNA candidates that have been validated using miRbase while they have not been recognized by some of the base classifiers.
Song, Jiangning; Burrage, Kevin; Yuan, Zheng; Huber, Thomas
2006-03-09
The majority of peptide bonds in proteins are found to occur in the trans conformation. However, for proline residues, a considerable fraction of Prolyl peptide bonds adopt the cis form. Proline cis/trans isomerization is known to play a critical role in protein folding, splicing, cell signaling and transmembrane active transport. Accurate prediction of proline cis/trans isomerization in proteins would have many important applications towards the understanding of protein structure and function. In this paper, we propose a new approach to predict the proline cis/trans isomerization in proteins using support vector machine (SVM). The preliminary results indicated that using Radial Basis Function (RBF) kernels could lead to better prediction performance than that of polynomial and linear kernel functions. We used single sequence information of different local window sizes, amino acid compositions of different local sequences, multiple sequence alignment obtained from PSI-BLAST and the secondary structure information predicted by PSIPRED. We explored these different sequence encoding schemes in order to investigate their effects on the prediction performance. The training and testing of this approach was performed on a newly enlarged dataset of 2424 non-homologous proteins determined by X-Ray diffraction method using 5-fold cross-validation. Selecting the window size 11 provided the best performance for determining the proline cis/trans isomerization based on the single amino acid sequence. It was found that using multiple sequence alignments in the form of PSI-BLAST profiles could significantly improve the prediction performance, the prediction accuracy increased from 62.8% with single sequence to 69.8% and Matthews Correlation Coefficient (MCC) improved from 0.26 with single local sequence to 0.40. Furthermore, if coupled with the predicted secondary structure information by PSIPRED, our method yielded a prediction accuracy of 71.5% and MCC of 0.43, 9% and 0.17 higher than the accuracy achieved based on the singe sequence information, respectively. A new method has been developed to predict the proline cis/trans isomerization in proteins based on support vector machine, which used the single amino acid sequence with different local window sizes, the amino acid compositions of local sequence flanking centered proline residues, the position-specific scoring matrices (PSSMs) extracted by PSI-BLAST and the predicted secondary structures generated by PSIPRED. The successful application of SVM approach in this study reinforced that SVM is a powerful tool in predicting proline cis/trans isomerization in proteins and biological sequence analysis.
Sturm, Marc; Quinten, Sascha; Huber, Christian G.; Kohlbacher, Oliver
2007-01-01
We propose a new model for predicting the retention time of oligonucleotides. The model is based on ν support vector regression using features derived from base sequence and predicted secondary structure of oligonucleotides. Because of the secondary structure information, the model is applicable even at relatively low temperatures where the secondary structure is not suppressed by thermal denaturing. This makes the prediction of oligonucleotide retention time for arbitrary temperatures possible, provided that the target temperature lies within the temperature range of the training data. We describe different possibilities of feature calculation from base sequence and secondary structure, present the results and compare our model to existing models. PMID:17567619
Mahajan, Gaurang; Mande, Shekhar C
2017-04-04
A comprehensive map of the human-M. tuberculosis (MTB) protein interactome would help fill the gaps in our understanding of the disease, and computational prediction can aid and complement experimental studies towards this end. Several sequence-based in silico approaches tap the existing data on experimentally validated protein-protein interactions (PPIs); these PPIs serve as templates from which novel interactions between pathogen and host are inferred. Such comparative approaches typically make use of local sequence alignment, which, in the absence of structural details about the interfaces mediating the template interactions, could lead to incorrect inferences, particularly when multi-domain proteins are involved. We propose leveraging the domain-domain interaction (DDI) information in PDB complexes to score and prioritize candidate PPIs between host and pathogen proteomes based on targeted sequence-level comparisons. Our method picks out a small set of human-MTB protein pairs as candidates for physical interactions, and the use of functional meta-data suggests that some of them could contribute to the in vivo molecular cross-talk between pathogen and host that regulates the course of the infection. Further, we present numerical data for Pfam domain families that highlights interaction specificity on the domain level. Not every instance of a pair of domains, for which interaction evidence has been found in a few instances (i.e. structures), is likely to functionally interact. Our sorting approach scores candidates according to how "distant" they are in sequence space from known examples of DDIs (templates). Thus, it provides a natural way to deal with the heterogeneity in domain-level interactions. Our method represents a more informed application of local alignment to the sequence-based search for potential human-microbial interactions that uses available PPI data as a prior. Our approach is somewhat limited in its sensitivity by the restricted size and diversity of the template dataset, but, given the rapid accumulation of solved protein complex structures, its scope and utility are expected to keep steadily improving.
An unsupervised approach for measuring myocardial perfusion in MR image sequences
NASA Astrophysics Data System (ADS)
Discher, Antoine; Rougon, Nicolas; Preteux, Francoise
2005-08-01
Quantitatively assessing myocardial perfusion is a key issue for the diagnosis, therapeutic planning and patient follow-up of cardio-vascular diseases. To this end, perfusion MRI (p-MRI) has emerged as a valuable clinical investigation tool thanks to its ability of dynamically imaging the first pass of a contrast bolus in the framework of stress/rest exams. However, reliable techniques for automatically computing regional first pass curves from 2D short-axis cardiac p-MRI sequences remain to be elaborated. We address this problem and develop an unsupervised four-step approach comprising: (i) a coarse spatio-temporal segmentation step, allowing to automatically detect a region of interest for the heart over the whole sequence, and to select a reference frame with maximal myocardium contrast; (ii) a model-based variational segmentation step of the reference frame, yielding a bi-ventricular partition of the heart into left ventricle, right ventricle and myocardium components; (iii) a respiratory/cardiac motion artifacts compensation step using a novel region-driven intensity-based non rigid registration technique, allowing to elastically propagate the reference bi-ventricular segmentation over the whole sequence; (iv) a measurement step, delivering first-pass curves over each region of a segmental model of the myocardium. The performance of this approach is assessed over a database of 15 normal and pathological subjects, and compared with perfusion measurements delivered by a MRI manufacturer software package based on manual delineations by a medical expert.
2012-01-01
Background Members of the phylum Proteobacteria are most prominent among bacteria causing plant diseases that result in a diminution of the quantity and quality of food produced by agriculture. To ameliorate these losses, there is a need to identify infections in early stages. Recent developments in next generation nucleic acid sequencing and mass spectrometry open the door to screening plants by the sequences of their macromolecules. Such an approach requires the ability to recognize the organismal origin of unknown DNA or peptide fragments. There are many ways to approach this problem but none have emerged as the best protocol. Here we attempt a systematic way to determine organismal origins of peptides by using a machine learning algorithm. The algorithm that we implement is a Support Vector Machine (SVM). Result The amino acid compositions of proteobacterial proteins were found to be different from those of plant proteins. We developed an SVM model based on amino acid and dipeptide compositions to distinguish between a proteobacterial protein and a plant protein. The amino acid composition (AAC) based SVM model had an accuracy of 92.44% with 0.85 Matthews correlation coefficient (MCC) while the dipeptide composition (DC) based SVM model had a maximum accuracy of 94.67% and 0.89 MCC. We also developed SVM models based on a hybrid approach (AAC and DC), which gave a maximum accuracy 94.86% and a 0.90 MCC. The models were tested on unseen or untrained datasets to assess their validity. Conclusion The results indicate that the SVM based on the AAC and DC hybrid approach can be used to distinguish proteobacterial from plant protein sequences. PMID:23046503
Efficient Mining of Interesting Patterns in Large Biological Sequences
Rashid, Md. Mamunur; Karim, Md. Rezaul; Jeong, Byeong-Soo
2012-01-01
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time. PMID:23105928
Efficient mining of interesting patterns in large biological sequences.
Rashid, Md Mamunur; Karim, Md Rezaul; Jeong, Byeong-Soo; Choi, Ho-Jin
2012-03-01
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time.
Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes
Liu, Kuan-Liang; Porras-Alfaro, Andrea; Eichorst, Stephanie A.
2012-01-01
Taxonomic and phylogenetic fingerprinting based on sequence analysis of gene fragments from the large-subunit rRNA (LSU) gene or the internal transcribed spacer (ITS) region is becoming an integral part of fungal classification. The lack of an accurate and robust classification tool trained by a validated sequence database for taxonomic placement of fungal LSU genes is a severe limitation in taxonomic analysis of fungal isolates or large data sets obtained from environmental surveys. Using a hand-curated set of 8,506 fungal LSU gene fragments, we determined the performance characteristics of a naïve Bayesian classifier across multiple taxonomic levels and compared the classifier performance to that of a sequence similarity-based (BLASTN) approach. The naïve Bayesian classifier was computationally more rapid (>460-fold with our system) than the BLASTN approach, and it provided equal or superior classification accuracy. Classifier accuracies were compared using sequence fragments of 100 bp and 400 bp and two different PCR primer anchor points to mimic sequence read lengths commonly obtained using current high-throughput sequencing technologies. Accuracy was higher with 400-bp sequence reads than with 100-bp reads. It was also significantly affected by sequence location across the 1,400-bp test region. The highest accuracy was obtained across either the D1 or D2 variable region. The naïve Bayesian classifier provides an effective and rapid means to classify fungal LSU sequences from large environmental surveys. The training set and tool are publicly available through the Ribosomal Database Project (http://rdp.cme.msu.edu/classifier/classifier.jsp). PMID:22194300
Registration of retinal sequences from new video-ophthalmoscopic camera.
Kolar, Radim; Tornow, Ralf P; Odstrcilik, Jan; Liberdova, Ivana
2016-05-20
Analysis of fast temporal changes on retinas has become an important part of diagnostic video-ophthalmology. It enables investigation of the hemodynamic processes in retinal tissue, e.g. blood-vessel diameter changes as a result of blood-pressure variation, spontaneous venous pulsation influenced by intracranial-intraocular pressure difference, blood-volume changes as a result of changes in light reflection from retinal tissue, and blood flow using laser speckle contrast imaging. For such applications, image registration of the recorded sequence must be performed. Here we use a new non-mydriatic video-ophthalmoscope for simple and fast acquisition of low SNR retinal sequences. We introduce a novel, two-step approach for fast image registration. The phase correlation in the first stage removes large eye movements. Lucas-Kanade tracking in the second stage removes small eye movements. We propose robust adaptive selection of the tracking points, which is the most important part of tracking-based approaches. We also describe a method for quantitative evaluation of the registration results, based on vascular tree intensity profiles. The achieved registration error evaluated on 23 sequences (5840 frames) is 0.78 ± 0.67 pixels inside the optic disc and 1.39 ± 0.63 pixels outside the optic disc. We compared the results with the commonly used approaches based on Lucas-Kanade tracking and scale-invariant feature transform, which achieved worse results. The proposed method can efficiently correct particular frames of retinal sequences for shift and rotation. The registration results for each frame (shift in X and Y direction and eye rotation) can also be used for eye-movement evaluation during single-spot fixation tasks.
Fast and accurate phylogeny reconstruction using filtered spaced-word matches
Sohrabi-Jahromi, Salma; Morgenstern, Burkhard
2017-01-01
Abstract Motivation: Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results: We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ Contact: chris.leimeister@stud.uni-goettingen.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:28073754
Fast and accurate phylogeny reconstruction using filtered spaced-word matches.
Leimeister, Chris-André; Sohrabi-Jahromi, Salma; Morgenstern, Burkhard
2017-04-01
Word-based or 'alignment-free' algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. We propose Filtered Spaced Word Matches (FSWM) , a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don't-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don't-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don't-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/. chris.leimeister@stud.uni-goettingen.de. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press.
Unified Deep Learning Architecture for Modeling Biology Sequence.
Wu, Hongjie; Cao, Chengyuan; Xia, Xiaoyan; Lu, Qiang
2017-10-09
Prediction of the spatial structure or function of biological macromolecules based on their sequence remains an important challenge in bioinformatics. When modeling biological sequences using traditional sequencing models, characteristics, such as long-range interactions between basic units, the complicated and variable output of labeled structures, and the variable length of biological sequences, usually lead to different solutions on a case-by-case basis. This study proposed the use of bidirectional recurrent neural networks based on long short-term memory or a gated recurrent unit to capture long-range interactions by designing the optional reshape operator to adapt to the diversity of the output labels and implementing a training algorithm to support the training of sequence models capable of processing variable-length sequences. Additionally, the merge and pooling operators enhanced the ability to capture short-range interactions between basic units of biological sequences. The proposed deep-learning model and its training algorithm might be capable of solving currently known biological sequence-modeling problems through the use of a unified framework. We validated our model on one of the most difficult biological sequence-modeling problems currently known, with our results indicating the ability of the model to obtain predictions of protein residue interactions that exceeded the accuracy of current popular approaches by 10% based on multiple benchmarks.
Di Pierro, Erica A; Gianfranceschi, Luca; Di Guardo, Mario; Koehorst-van Putten, Herma Jj; Kruisselbrink, Johannes W; Longhi, Sara; Troggio, Michela; Bianco, Luca; Muranty, Hélène; Pagliarani, Giulia; Tartarini, Stefano; Letschka, Thomas; Lozano Luis, Lidia; Garkava-Gustavsson, Larisa; Micheletti, Diego; Bink, Marco Cam; Voorrips, Roeland E; Aziz, Ebrahimi; Velasco, Riccardo; Laurens, François; van de Weg, W Eric
2016-01-01
Quantitative trait loci (QTL) mapping approaches rely on the correct ordering of molecular markers along the chromosomes, which can be obtained from genetic linkage maps or a reference genome sequence. For apple ( Malus domestica Borkh), the genome sequence v1 and v2 could not meet this need; therefore, a novel approach was devised to develop a dense genetic linkage map, providing the most reliable marker-loci order for the highest possible number of markers. The approach was based on four strategies: (i) the use of multiple full-sib families, (ii) the reduction of missing information through the use of HaploBlocks and alternative calling procedures for single-nucleotide polymorphism (SNP) markers, (iii) the construction of a single backcross-type data set including all families, and (iv) a two-step map generation procedure based on the sequential inclusion of markers. The map comprises 15 417 SNP markers, clustered in 3 K HaploBlock markers spanning 1 267 cM, with an average distance between adjacent markers of 0.37 cM and a maximum distance of 3.29 cM. Moreover, chromosome 5 was oriented according to its homoeologous chromosome 10. This map was useful to improve the apple genome sequence, design the Axiom Apple 480 K SNP array and perform multifamily-based QTL studies. Its collinearity with the genome sequences v1 and v3 are reported. To our knowledge, this is the shortest published SNP map in apple, while including the largest number of markers, families and individuals. This result validates our methodology, proving its value for the construction of integrated linkage maps for any outbreeding species.
Di Pierro, Erica A; Gianfranceschi, Luca; Di Guardo, Mario; Koehorst-van Putten, Herma JJ; Kruisselbrink, Johannes W; Longhi, Sara; Troggio, Michela; Bianco, Luca; Muranty, Hélène; Pagliarani, Giulia; Tartarini, Stefano; Letschka, Thomas; Lozano Luis, Lidia; Garkava-Gustavsson, Larisa; Micheletti, Diego; Bink, Marco CAM; Voorrips, Roeland E; Aziz, Ebrahimi; Velasco, Riccardo; Laurens, François; van de Weg, W Eric
2016-01-01
Quantitative trait loci (QTL) mapping approaches rely on the correct ordering of molecular markers along the chromosomes, which can be obtained from genetic linkage maps or a reference genome sequence. For apple (Malus domestica Borkh), the genome sequence v1 and v2 could not meet this need; therefore, a novel approach was devised to develop a dense genetic linkage map, providing the most reliable marker-loci order for the highest possible number of markers. The approach was based on four strategies: (i) the use of multiple full-sib families, (ii) the reduction of missing information through the use of HaploBlocks and alternative calling procedures for single-nucleotide polymorphism (SNP) markers, (iii) the construction of a single backcross-type data set including all families, and (iv) a two-step map generation procedure based on the sequential inclusion of markers. The map comprises 15 417 SNP markers, clustered in 3 K HaploBlock markers spanning 1 267 cM, with an average distance between adjacent markers of 0.37 cM and a maximum distance of 3.29 cM. Moreover, chromosome 5 was oriented according to its homoeologous chromosome 10. This map was useful to improve the apple genome sequence, design the Axiom Apple 480 K SNP array and perform multifamily-based QTL studies. Its collinearity with the genome sequences v1 and v3 are reported. To our knowledge, this is the shortest published SNP map in apple, while including the largest number of markers, families and individuals. This result validates our methodology, proving its value for the construction of integrated linkage maps for any outbreeding species. PMID:27917289
Biomarker Discovery and Mechanistic Studies of Prostate Cancer using Targeted Proteomic Approaches
2012-07-01
basigin in Drosophila ) tightly regulates cytoskeleton rearrangement in Drosophila melanogaster [23]. Based on the present results and the existing...from OligoEngine according to the manufac- turer’s instruction. Plasmids were amplified in DH5a cell and confirmed by sequencing . Subconfluent cell...electrophoresis and the results are shown in Figure 1 (Panel C). The RT-PCR products were cloned and subjected to DNA sequenc - ing. The sequencing
Conversion of amino-acid sequence in proteins to classical music: search for auditory patterns
2007-01-01
We have converted genome-encoded protein sequences into musical notes to reveal auditory patterns without compromising musicality. We derived a reduced range of 13 base notes by pairing similar amino acids and distinguishing them using variations of three-note chords and codon distribution to dictate rhythm. The conversion will help make genomic coding sequences more approachable for the general public, young children, and vision-impaired scientists. PMID:17477882
TaxI: a software tool for DNA barcoding using distance methods
Steinke, Dirk; Vences, Miguel; Salzburger, Walter; Meyer, Axel
2005-01-01
DNA barcoding is a promising approach to the diagnosis of biological diversity in which DNA sequences serve as the primary key for information retrieval. Most existing software for evolutionary analysis of DNA sequences was designed for phylogenetic analyses and, hence, those algorithms do not offer appropriate solutions for the rapid, but precise analyses needed for DNA barcoding, and are also unable to process the often large comparative datasets. We developed a flexible software tool for DNA taxonomy, named TaxI. This program calculates sequence divergences between a query sequence (taxon to be barcoded) and each sequence of a dataset of reference sequences defined by the user. Because the analysis is based on separate pairwise alignments this software is also able to work with sequences characterized by multiple insertions and deletions that are difficult to align in large sequence sets (i.e. thousands of sequences) by multiple alignment algorithms because of computational restrictions. Here, we demonstrate the utility of this approach with two datasets of fish larvae and juveniles from Lake Constance and juvenile land snails under different models of sequence evolution. Sets of ribosomal 16S rRNA sequences, characterized by multiple indels, performed as good as or better than cox1 sequence sets in assigning sequences to species, demonstrating the suitability of rRNA genes for DNA barcoding. PMID:16214755
Wang, Anqi; Wang, Zhanyu; Li, Zheng; Li, Lei M
2018-06-15
It is highly desirable to assemble genomes of high continuity and consistency at low cost. The current bottleneck of draft genome continuity using the second generation sequencing (SGS) reads is primarily caused by uncertainty among repetitive sequences. Even though the single-molecule real-time sequencing technology is very promising to overcome the uncertainty issue, its relatively high cost and error rate add burden on budget or computation. Many long-read assemblers take the overlap-layout-consensus (OLC) paradigm, which is less sensitive to sequencing errors, heterozygosity and variability of coverage. However, current assemblers of SGS data do not sufficiently take advantage of the OLC approach. Aiming at minimizing uncertainty, the proposed method BAUM, breaks the whole genome into regions by adaptive unique mapping; then the local OLC is used to assemble each region in parallel. BAUM can (i) perform reference-assisted assembly based on the genome of a close species (ii) or improve the results of existing assemblies that are obtained based on short or long sequencing reads. The tests on two eukaryote genomes, a wild rice Oryza longistaminata and a parrot Melopsittacus undulatus, show that BAUM achieved substantial improvement on genome size and continuity. Besides, BAUM reconstructed a considerable amount of repetitive regions that failed to be assembled by existing short read assemblers. We also propose statistical approaches to control the uncertainty in different steps of BAUM. http://www.zhanyuwang.xin/wordpress/index.php/2017/07/21/baum. Supplementary data are available at Bioinformatics online.
Progress of targeted genome modification approaches in higher plants.
Cardi, Teodoro; Neal Stewart, C
2016-07-01
Transgene integration in plants is based on illegitimate recombination between non-homologous sequences. The low control of integration site and number of (trans/cis)gene copies might have negative consequences on the expression of transferred genes and their insertion within endogenous coding sequences. The first experiments conducted to use precise homologous recombination for gene integration commenced soon after the first demonstration that transgenic plants could be produced. Modern transgene targeting categories used in plant biology are: (a) homologous recombination-dependent gene targeting; (b) recombinase-mediated site-specific gene integration; (c) oligonucleotide-directed mutagenesis; (d) nuclease-mediated site-specific genome modifications. New tools enable precise gene replacement or stacking with exogenous sequences and targeted mutagenesis of endogeneous sequences. The possibility to engineer chimeric designer nucleases, which are able to target virtually any genomic site, and use them for inducing double-strand breaks in host DNA create new opportunities for both applied plant breeding and functional genomics. CRISPR is the most recent technology available for precise genome editing. Its rapid adoption in biological research is based on its inherent simplicity and efficacy. Its utilization, however, depends on available sequence information, especially for genome-wide analysis. We will review the approaches used for genome modification, specifically those for affecting gene integration and modification in higher plants. For each approach, the advantages and limitations will be noted. We also will speculate on how their actual commercial development and implementation in plant breeding will be affected by governmental regulations.
Elements of Mathematics, Book O: Intuitive Background. Chapter 1, Operational Systems.
ERIC Educational Resources Information Center
Exner, Robert; And Others
The sixteen chapters of this book provide the core material for the Elements of Mathematics Program, a secondary sequence developed for highly motivated students with strong verbal abilities. The sequence is based on a functional-relational approach to mathematics teaching, and emphasizes teaching by analysis of real-life situations. This text is…
Comparison between genotyping by sequencing and SNP-chip genotyping in QTL mapping in wheat
USDA-ARS?s Scientific Manuscript database
Array- or chip-based single nucleotide polymorphism (SNP) markers are widely used in genomic studies because of their abundance in a genome and cost less per data point compared to older marker technologies. Genotyping by sequencing (GBS), a relatively newer approach of genotyping, suggests equal or...
Detection of distorted frames in retinal video-sequences via machine learning
NASA Astrophysics Data System (ADS)
Kolar, Radim; Liberdova, Ivana; Odstrcilik, Jan; Hracho, Michal; Tornow, Ralf P.
2017-07-01
This paper describes detection of distorted frames in retinal sequences based on set of global features extracted from each frame. The feature vector is consequently used in classification step, in which three types of classifiers are tested. The best classification accuracy 96% has been achieved with support vector machine approach.
Elements of Mathematics, Book O: Intuitive Background. Chapter 5, Mappings.
ERIC Educational Resources Information Center
Exner, Robert; And Others
The sixteen chapters of this book provide the core material for the Elements of Mathematics Program, a secondary sequence developed for highly motivated students with strong verbal abilities. The sequence is based on a functional-relational approach to mathematics teaching, and emphasizes teaching by analysis of real-life situations. This text is…
USDA-ARS?s Scientific Manuscript database
In recent years, next generation sequencing (NGS) based bulked segregant analysis (BSA) has become a powerful approach for allele discovery in non-model plant species. However, challenges remain, particular for out-crossing species with complex genomes. Here, the genetic control of a weeping bran...
Stage Sequences of Adolescent Substance Use: A Prospective Longitudinal Approach.
ERIC Educational Resources Information Center
Collins, Linda M; And Others
Past research based primarily on cross-sectional data, has suggested a Guttman scale of substance use onset where marijuana use is preceded by alcohol use, and where sometimes tobacco is placed between alcohol and marijuana. Prospective longitudinal data is needed to determine whether the stages form a temporal sequence. A new statistical…
A Teaching--Learning Sequence on Free Fall Motion
ERIC Educational Resources Information Center
Borghi, L.; De Ambrosis, A.; Lamberti, N.; Mascheretti, P.
2005-01-01
A teaching--learning sequence is presented that is designed to help high school pupils gain awareness about the independence of the vertical and horizontal components of free fall motion. The approach we propose is based on the use of experimental activities and computer simulations designed specifically to help pupils reflect on the experiments…
Elements of Mathematics, Book O: Intuitive Background. Chapter 2, The Integers.
ERIC Educational Resources Information Center
Exner, Robert; And Others
The sixteen chapters of this book provide the core materials for the Elements of Mathematics Program, a secondary sequence developed for highly motivated students with strong verbal abilities. The sequence is based on a functional-relational approach to mathematics teaching, and emphasizes teaching by analysis of real-life situations. This text is…
Teaching the Process of Molecular Phylogeny and Systematics: A Multi-Part Inquiry-Based Exercise
ERIC Educational Resources Information Center
Lents, Nathan H.; Cifuentes, Oscar E.; Carpi, Anthony
2010-01-01
Three approaches to molecular phylogenetics are demonstrated to biology students as they explore molecular data from "Homo sapiens" and four related primates. By analyzing DNA sequences, protein sequences, and chromosomal maps, students are repeatedly challenged to develop hypotheses regarding the ancestry of the five species. Although…
Detection of Splice Sites Using Support Vector Machine
NASA Astrophysics Data System (ADS)
Varadwaj, Pritish; Purohit, Neetesh; Arora, Bhumika
Automatic identification and annotation of exon and intron region of gene, from DNA sequences has been an important research area in field of computational biology. Several approaches viz. Hidden Markov Model (HMM), Artificial Intelligence (AI) based machine learning and Digital Signal Processing (DSP) techniques have extensively and independently been used by various researchers to cater this challenging task. In this work, we propose a Support Vector Machine based kernel learning approach for detection of splice sites (the exon-intron boundary) in a gene. Electron-Ion Interaction Potential (EIIP) values of nucleotides have been used for mapping character sequences to corresponding numeric sequences. Radial Basis Function (RBF) SVM kernel is trained using EIIP numeric sequences. Furthermore this was tested on test gene dataset for detection of splice site by window (of 12 residues) shifting. Optimum values of window size, various important parameters of SVM kernel have been optimized for a better accuracy. Receiver Operating Characteristic (ROC) curves have been utilized for displaying the sensitivity rate of the classifier and results showed 94.82% accuracy for splice site detection on test dataset.
Leuthaeuser, Janelle B; Knutson, Stacy T; Kumar, Kiran; Babbitt, Patricia C; Fetrow, Jacquelyn S
2015-09-01
The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods. © 2015 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
Leuthaeuser, Janelle B; Knutson, Stacy T; Kumar, Kiran; Babbitt, Patricia C; Fetrow, Jacquelyn S
2015-01-01
The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods. PMID:26073648
Pérez-Porro, A R; Navarro-Gómez, D; Uriz, M J; Giribet, G
2013-05-01
Sponges can be dominant organisms in many marine and freshwater habitats where they play essential ecological roles. They also represent a key group to address important questions in early metazoan evolution. Recent approaches for improving knowledge on sponge biological and ecological functions as well as on animal evolution have focused on the genetic toolkits involved in ecological responses to environmental changes (biotic and abiotic), development and reproduction. These approaches are possible thanks to newly available, massive sequencing technologies-such as the Illumina platform, which facilitate genome and transcriptome sequencing in a cost-effective manner. Here we present the first NGS (next-generation sequencing) approach to understanding the life cycle of an encrusting marine sponge. For this we sequenced libraries of three different life cycle stages of the Mediterranean sponge Crella elegans and generated de novo transcriptome assemblies. Three assemblies were based on sponge tissue of a particular life cycle stage, including non-reproductive tissue, tissue with sperm cysts and tissue with larvae. The fourth assembly pooled the data from all three stages. By aggregating data from all the different life cycle stages we obtained a higher total number of contigs, contigs with blast hit and annotated contigs than from one stage-based assemblies. In that multi-stage assembly we obtained a larger number of the developmental regulatory genes known for metazoans than in any other assembly. We also advance the differential expression of selected genes in the three life cycle stages to explore the potential of RNA-seq for improving knowledge on functional processes along the sponge life cycle. © 2013 Blackwell Publishing Ltd.
Thiel, William H.; Bair, Thomas; Peek, Andrew S.; Liu, Xiuying; Dassie, Justin; Stockdale, Katie R.; Behlke, Mark A.; Miller, Francis J.; Giangrande, Paloma H.
2012-01-01
Background The broad applicability of RNA aptamers as cell-specific delivery tools for therapeutic reagents depends on the ability to identify aptamer sequences that selectively access the cytoplasm of distinct cell types. Towards this end, we have developed a novel approach that combines a cell-based selection method (cell-internalization SELEX) with high-throughput sequencing (HTS) and bioinformatics analyses to rapidly identify cell-specific, internalization-competent RNA aptamers. Methodology/Principal Findings We demonstrate the utility of this approach by enriching for RNA aptamers capable of selective internalization into vascular smooth muscle cells (VSMCs). Several rounds of positive (VSMCs) and negative (endothelial cells; ECs) selection were performed to enrich for aptamer sequences that preferentially internalize into VSMCs. To identify candidate RNA aptamer sequences, HTS data from each round of selection were analyzed using bioinformatics methods: (1) metrics of selection enrichment; and (2) pairwise comparisons of sequence and structural similarity, termed edit and tree distance, respectively. Correlation analyses of experimentally validated aptamers or rounds revealed that the best cell-specific, internalizing aptamers are enriched as a result of the negative selection step performed against ECs. Conclusions and Significance We describe a novel approach that combines cell-internalization SELEX with HTS and bioinformatics analysis to identify cell-specific, cell-internalizing RNA aptamers. Our data highlight the importance of performing a pre-clear step against a non-target cell in order to select for cell-specific aptamers. We expect the extended use of this approach to enable the identification of aptamers to a multitude of different cell types, thereby facilitating the broad development of targeted cell therapies. PMID:22962591
Requirements-Based Conformance Testing of ARINC 653 Real-Time Operating Systems
NASA Astrophysics Data System (ADS)
Maksimov, Andrey
2010-08-01
Requirements-based testing is emphasized in avionics certification documents because this strategy has been found to be the most effective at revealing errors. This paper describes the unified requirements-based approach to the creation of conformance test suites for mission-critical systems. The approach uses formal machine-readable specifications of requirements and finite state machine model for test sequences generation on-the-fly. The paper also presents the test system for automated test generation for ARINC 653 services built on this approach. Possible application of the presented approach to various areas of avionics embedded systems testing is discussed.
NASA Technical Reports Server (NTRS)
Wheeler, Ward C.
2003-01-01
The problem of determining the minimum cost hypothetical ancestral sequences for a given cladogram is known to be NP-complete (Wang and Jiang, 1994). Traditionally, point estimations of hypothetical ancestral sequences have been used to gain heuristic, upper bounds on cladogram cost. These include procedures with such diverse approaches as non-additive optimization of multiple sequence alignment, direct optimization (Wheeler, 1996), and fixed-state character optimization (Wheeler, 1999). A method is proposed here which, by extending fixed-state character optimization, replaces the estimation process with a search. This form of optimization examines a diversity of potential state solutions for cost-efficient hypothetical ancestral sequences and can result in greatly more parsimonious cladograms. Additionally, such an approach can be applied to other NP-complete phylogenetic optimization problems such as genomic break-point analysis. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.
Defining the wheat gluten peptide fingerprint via a discovery and targeted proteomics approach.
Martínez-Esteso, María José; Nørgaard, Jørgen; Brohée, Marcel; Haraszi, Reka; Maquet, Alain; O'Connor, Gavin
2016-09-16
Accurate, reliable and sensitive detection methods for gluten are required to support current EU regulations. The enforcement of legislative levels requires that measurement results are comparable over time and between methods. This is not a trivial task for gluten which comprises a large number of protein targets. This paper describes a strategy for defining a set of specific analytical targets for wheat gluten. A comprehensive proteomic approach was applied by fractionating wheat gluten using RP-HPLC (reversed phase high performance liquid chromatography) followed by a multi-enzymatic digestion (LysC, trypsin and chymotrypsin) with subsequent mass spectrometric analysis. This approach identified 434 peptide sequences from gluten. Peptides were grouped based on two criteria: unique to a single gluten protein sequence; contained known immunogenic and toxic sequences in the context of coeliac disease. An LC-MS/MS method based on selected reaction monitoring (SRM) was developed on a triple quadrupole mass spectrometer for the specific detection of the target peptides. The SRM based screening approach was applied to gluten containing cereals (wheat, rye, barley and oats) and non-gluten containing flours (corn, soy and rice). A unique set of wheat gluten marker peptides were identified and are proposed as wheat specific markers. The measurement of gluten in processed food products in support of regulatory limits is performed routinely. Mass spectrometry is emerging as a viable alternative to ELISA based methods. Here we outline a set of peptide markers that are representative of gluten and consider the end user's needs in protecting those with coeliac disease. The approach taken has been applied to wheat but can be easily extended to include other species potentially enabling the MS quantification of different gluten containing species from the identified markers. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Chwialkowska, Karolina; Korotko, Urszula; Kosinska, Joanna; Szarejko, Iwona; Kwasniewski, Miroslaw
2017-01-01
Epigenetic mechanisms, including histone modifications and DNA methylation, mutually regulate chromatin structure, maintain genome integrity, and affect gene expression and transposon mobility. Variations in DNA methylation within plant populations, as well as methylation in response to internal and external factors, are of increasing interest, especially in the crop research field. Methylation Sensitive Amplification Polymorphism (MSAP) is one of the most commonly used methods for assessing DNA methylation changes in plants. This method involves gel-based visualization of PCR fragments from selectively amplified DNA that are cleaved using methylation-sensitive restriction enzymes. In this study, we developed and validated a new method based on the conventional MSAP approach called Methylation Sensitive Amplification Polymorphism Sequencing (MSAP-Seq). We improved the MSAP-based approach by replacing the conventional separation of amplicons on polyacrylamide gels with direct, high-throughput sequencing using Next Generation Sequencing (NGS) and automated data analysis. MSAP-Seq allows for global sequence-based identification of changes in DNA methylation. This technique was validated in Hordeum vulgare . However, MSAP-Seq can be straightforwardly implemented in different plant species, including crops with large, complex and highly repetitive genomes. The incorporation of high-throughput sequencing into MSAP-Seq enables parallel and direct analysis of DNA methylation in hundreds of thousands of sites across the genome. MSAP-Seq provides direct genomic localization of changes and enables quantitative evaluation. We have shown that the MSAP-Seq method specifically targets gene-containing regions and that a single analysis can cover three-quarters of all genes in large genomes. Moreover, MSAP-Seq's simplicity, cost effectiveness, and high-multiplexing capability make this method highly affordable. Therefore, MSAP-Seq can be used for DNA methylation analysis in crop plants with large and complex genomes.
Chwialkowska, Karolina; Korotko, Urszula; Kosinska, Joanna; Szarejko, Iwona; Kwasniewski, Miroslaw
2017-01-01
Epigenetic mechanisms, including histone modifications and DNA methylation, mutually regulate chromatin structure, maintain genome integrity, and affect gene expression and transposon mobility. Variations in DNA methylation within plant populations, as well as methylation in response to internal and external factors, are of increasing interest, especially in the crop research field. Methylation Sensitive Amplification Polymorphism (MSAP) is one of the most commonly used methods for assessing DNA methylation changes in plants. This method involves gel-based visualization of PCR fragments from selectively amplified DNA that are cleaved using methylation-sensitive restriction enzymes. In this study, we developed and validated a new method based on the conventional MSAP approach called Methylation Sensitive Amplification Polymorphism Sequencing (MSAP-Seq). We improved the MSAP-based approach by replacing the conventional separation of amplicons on polyacrylamide gels with direct, high-throughput sequencing using Next Generation Sequencing (NGS) and automated data analysis. MSAP-Seq allows for global sequence-based identification of changes in DNA methylation. This technique was validated in Hordeum vulgare. However, MSAP-Seq can be straightforwardly implemented in different plant species, including crops with large, complex and highly repetitive genomes. The incorporation of high-throughput sequencing into MSAP-Seq enables parallel and direct analysis of DNA methylation in hundreds of thousands of sites across the genome. MSAP-Seq provides direct genomic localization of changes and enables quantitative evaluation. We have shown that the MSAP-Seq method specifically targets gene-containing regions and that a single analysis can cover three-quarters of all genes in large genomes. Moreover, MSAP-Seq's simplicity, cost effectiveness, and high-multiplexing capability make this method highly affordable. Therefore, MSAP-Seq can be used for DNA methylation analysis in crop plants with large and complex genomes. PMID:29250096
Tank, David C.
2016-01-01
Advances in high-throughput sequencing (HTS) have allowed researchers to obtain large amounts of biological sequence information at speeds and costs unimaginable only a decade ago. Phylogenetics, and the study of evolution in general, is quickly migrating towards using HTS to generate larger and more complex molecular datasets. In this paper, we present a method that utilizes microfluidic PCR and HTS to generate large amounts of sequence data suitable for phylogenetic analyses. The approach uses the Fluidigm Access Array System (Fluidigm, San Francisco, CA, USA) and two sets of PCR primers to simultaneously amplify 48 target regions across 48 samples, incorporating sample-specific barcodes and HTS adapters (2,304 unique amplicons per Access Array). The final product is a pooled set of amplicons ready to be sequenced, and thus, there is no need to construct separate, costly genomic libraries for each sample. Further, we present a bioinformatics pipeline to process the raw HTS reads to either generate consensus sequences (with or without ambiguities) for every locus in every sample or—more importantly—recover the separate alleles from heterozygous target regions in each sample. This is important because it adds allelic information that is well suited for coalescent-based phylogenetic analyses that are becoming very common in conservation and evolutionary biology. To test our approach and bioinformatics pipeline, we sequenced 576 samples across 96 target regions belonging to the South American clade of the genus Bartsia L. in the plant family Orobanchaceae. After sequencing cleanup and alignment, the experiment resulted in ~25,300bp across 486 samples for a set of 48 primer pairs targeting the plastome, and ~13,500bp for 363 samples for a set of primers targeting regions in the nuclear genome. Finally, we constructed a combined concatenated matrix from all 96 primer combinations, resulting in a combined aligned length of ~40,500bp for 349 samples. PMID:26828929
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cao Daliang; Earl, Matthew A.; Luan, Shuang
2006-04-15
A new leaf-sequencing approach has been developed that is designed to reduce the number of required beam segments for step-and-shoot intensity modulated radiation therapy (IMRT). This approach to leaf sequencing is called continuous-intensity-map-optimization (CIMO). Using a simulated annealing algorithm, CIMO seeks to minimize differences between the optimized and sequenced intensity maps. Two distinguishing features of the CIMO algorithm are (1) CIMO does not require that each optimized intensity map be clustered into discrete levels and (2) CIMO is not rule-based but rather simultaneously optimizes both the aperture shapes and weights. To test the CIMO algorithm, ten IMRT patient cases weremore » selected (four head-and-neck, two pancreas, two prostate, one brain, and one pelvis). For each case, the optimized intensity maps were extracted from the Pinnacle{sup 3} treatment planning system. The CIMO algorithm was applied, and the optimized aperture shapes and weights were loaded back into Pinnacle. A final dose calculation was performed using Pinnacle's convolution/superposition based dose calculation. On average, the CIMO algorithm provided a 54% reduction in the number of beam segments as compared with Pinnacle's leaf sequencer. The plans sequenced using the CIMO algorithm also provided improved target dose uniformity and a reduced discrepancy between the optimized and sequenced intensity maps. For ten clinical intensity maps, comparisons were performed between the CIMO algorithm and the power-of-two reduction algorithm of Xia and Verhey [Med. Phys. 25(8), 1424-1434 (1998)]. When the constraints of a Varian Millennium multileaf collimator were applied, the CIMO algorithm resulted in a 26% reduction in the number of segments. For an Elekta multileaf collimator, the CIMO algorithm resulted in a 67% reduction in the number of segments. An average leaf sequencing time of less than one minute per beam was observed.« less
The Neandertal genome and ancient DNA authenticity
Green, Richard E; Briggs, Adrian W; Krause, Johannes; Prüfer, Kay; Burbano, Hernán A; Siebauer, Michael; Lachmann, Michael; Pääbo, Svante
2009-01-01
Recent advances in high-thoughput DNA sequencing have made genome-scale analyses of genomes of extinct organisms possible. With these new opportunities come new difficulties in assessing the authenticity of the DNA sequences retrieved. We discuss how these difficulties can be addressed, particularly with regard to analyses of the Neandertal genome. We argue that only direct assays of DNA sequence positions in which Neandertals differ from all contemporary humans can serve as a reliable means to estimate human contamination. Indirect measures, such as the extent of DNA fragmentation, nucleotide misincorporations, or comparison of derived allele frequencies in different fragment size classes, are unreliable. Fortunately, interim approaches based on mtDNA differences between Neandertals and current humans, detection of male contamination through Y chromosomal sequences, and repeated sequencing from the same fossil to detect autosomal contamination allow initial large-scale sequencing of Neandertal genomes. This will result in the discovery of fixed differences in the nuclear genome between Neandertals and current humans that can serve as future direct assays for contamination. For analyses of other fossil hominins, which may become possible in the future, we suggest a similar ‘boot-strap' approach in which interim approaches are applied until sufficient data for more definitive direct assays are acquired. PMID:19661919
2011-01-01
Background BAC-based physical maps provide for sequencing across an entire genome or a selected sub-genomic region of biological interest. Such a region can be approached with next-generation whole-genome sequencing and assembly as if it were an independent small genome. Using the minimum tiling path as a guide, specific BAC clones representing the prioritized genomic interval are selected, pooled, and used to prepare a sequencing library. Results This pooled BAC approach was taken to sequence and assemble a QTL-rich region, of ~3 Mbp and represented by twenty-seven BACs, on linkage group 5 of the Theobroma cacao cv. Matina 1-6 genome. Using various mixtures of read coverages from paired-end and linear 454 libraries, multiple assemblies of varied quality were generated. Quality was assessed by comparing the assembly of 454 reads with a subset of ten BACs individually sequenced and assembled using Sanger reads. A mixture of reads optimal for assembly was identified. We found, furthermore, that a quality assembly suitable for serving as a reference genome template could be obtained even with a reduced depth of sequencing coverage. Annotation of the resulting assembly revealed several genes potentially responsible for three T. cacao traits: black pod disease resistance, bean shape index, and pod weight. Conclusions Our results, as with other pooled BAC sequencing reports, suggest that pooling portions of a minimum tiling path derived from a BAC-based physical map is an effective method to target sub-genomic regions for sequencing. While we focused on a single QTL region, other QTL regions of importance could be similarly sequenced allowing for biological discovery to take place before a high quality whole-genome assembly is completed. PMID:21794110
Feltus, Frank A; Saski, Christopher A; Mockaitis, Keithanne; Haiminen, Niina; Parida, Laxmi; Smith, Zachary; Ford, James; Staton, Margaret E; Ficklin, Stephen P; Blackmon, Barbara P; Cheng, Chun-Huai; Schnell, Raymond J; Kuhn, David N; Motamayor, Juan-Carlos
2011-07-27
BAC-based physical maps provide for sequencing across an entire genome or a selected sub-genomic region of biological interest. Such a region can be approached with next-generation whole-genome sequencing and assembly as if it were an independent small genome. Using the minimum tiling path as a guide, specific BAC clones representing the prioritized genomic interval are selected, pooled, and used to prepare a sequencing library. This pooled BAC approach was taken to sequence and assemble a QTL-rich region, of ~3 Mbp and represented by twenty-seven BACs, on linkage group 5 of the Theobroma cacao cv. Matina 1-6 genome. Using various mixtures of read coverages from paired-end and linear 454 libraries, multiple assemblies of varied quality were generated. Quality was assessed by comparing the assembly of 454 reads with a subset of ten BACs individually sequenced and assembled using Sanger reads. A mixture of reads optimal for assembly was identified. We found, furthermore, that a quality assembly suitable for serving as a reference genome template could be obtained even with a reduced depth of sequencing coverage. Annotation of the resulting assembly revealed several genes potentially responsible for three T. cacao traits: black pod disease resistance, bean shape index, and pod weight. Our results, as with other pooled BAC sequencing reports, suggest that pooling portions of a minimum tiling path derived from a BAC-based physical map is an effective method to target sub-genomic regions for sequencing. While we focused on a single QTL region, other QTL regions of importance could be similarly sequenced allowing for biological discovery to take place before a high quality whole-genome assembly is completed.
Song, Dandan; Li, Ning; Liao, Lejian
2015-01-01
Due to the generation of enormous amounts of data at both lower costs as well as in shorter times, whole-exome sequencing technologies provide dramatic opportunities for identifying disease genes implicated in Mendelian disorders. Since upwards of thousands genomic variants can be sequenced in each exome, it is challenging to filter pathogenic variants in protein coding regions and reduce the number of missing true variants. Therefore, an automatic and efficient pipeline for finding disease variants in Mendelian disorders is designed by exploiting a combination of variants filtering steps to analyze the family-based exome sequencing approach. Recent studies on the Freeman-Sheldon disease are revisited and show that the proposed method outperforms other existing candidate gene identification methods.
Numerical classification of coding sequences
NASA Technical Reports Server (NTRS)
Collins, D. W.; Liu, C. C.; Jukes, T. H.
1992-01-01
DNA sequences coding for protein may be represented by counts of nucleotides or codons. A complete reading frame may be abbreviated by its base count, e.g. A76C158G121T74, or with the corresponding codon table, e.g. (AAA)0(AAC)1(AAG)9 ... (TTT)0. We propose that these numerical designations be used to augment current methods of sequence annotation. Because base counts and codon tables do not require revision as knowledge of function evolves, they are well-suited to act as cross-references, for example to identify redundant GenBank entries. These descriptors may be compared, in place of DNA sequences, to extract homologous genes from large databases. This approach permits rapid searching with good selectivity.
Task planning with uncertainty for robotic systems. Thesis
NASA Technical Reports Server (NTRS)
Cao, Tiehua
1993-01-01
In a practical robotic system, it is important to represent and plan sequences of operations and to be able to choose an efficient sequence from them for a specific task. During the generation and execution of task plans, different kinds of uncertainty may occur and erroneous states need to be handled to ensure the efficiency and reliability of the system. An approach to task representation, planning, and error recovery for robotic systems is demonstrated. Our approach to task planning is based on an AND/OR net representation, which is then mapped to a Petri net representation of all feasible geometric states and associated feasibility criteria for net transitions. Task decomposition of robotic assembly plans based on this representation is performed on the Petri net for robotic assembly tasks, and the inheritance of properties of liveness, safeness, and reversibility at all levels of decomposition are explored. This approach provides a framework for robust execution of tasks through the properties of traceability and viability. Uncertainty in robotic systems are modeled by local fuzzy variables, fuzzy marking variables, and global fuzzy variables which are incorporated in fuzzy Petri nets. Analysis of properties and reasoning about uncertainty are investigated using fuzzy reasoning structures built into the net. Two applications of fuzzy Petri nets, robot task sequence planning and sensor-based error recovery, are explored. In the first application, the search space for feasible and complete task sequences with correct precedence relationships is reduced via the use of global fuzzy variables in reasoning about subgoals. In the second application, sensory verification operations are modeled by mutually exclusive transitions to reason about local and global fuzzy variables on-line and automatically select a retry or an alternative error recovery sequence when errors occur. Task sequencing and task execution with error recovery capability for one and multiple soft components in robotic systems are investigated.
Template-Based Modeling of Protein-RNA Interactions.
Zheng, Jinfang; Kundrotas, Petras J; Vakser, Ilya A; Liu, Shiyong
2016-09-01
Protein-RNA complexes formed by specific recognition between RNA and RNA-binding proteins play an important role in biological processes. More than a thousand of such proteins in human are curated and many novel RNA-binding proteins are to be discovered. Due to limitations of experimental approaches, computational techniques are needed for characterization of protein-RNA interactions. Although much progress has been made, adequate methodologies reliably providing atomic resolution structural details are still lacking. Although protein-RNA free docking approaches proved to be useful, in general, the template-based approaches provide higher quality of predictions. Templates are key to building a high quality model. Sequence/structure relationships were studied based on a representative set of binary protein-RNA complexes from PDB. Several approaches were tested for pairwise target/template alignment. The analysis revealed a transition point between random and correct binding modes. The results showed that structural alignment is better than sequence alignment in identifying good templates, suitable for generating protein-RNA complexes close to the native structure, and outperforms free docking, successfully predicting complexes where the free docking fails, including cases of significant conformational change upon binding. A template-based protein-RNA interaction modeling protocol PRIME was developed and benchmarked on a representative set of complexes.
Scheirlinck, Ilse; Van der Meulen, Roel; Van Schoor, Ann; Vancanneyt, Marc; De Vuyst, Luc; Vandamme, Peter; Huys, Geert
2007-01-01
A culture-based approach was used to investigate the diversity of lactic acid bacteria (LAB) in Belgian traditional sourdoughs and to assess the influence of flour type, bakery environment, geographical origin, and technological characteristics on the taxonomic composition of these LAB communities. For this purpose, a total of 714 LAB from 21 sourdoughs sampled at 11 artisan bakeries throughout Belgium were subjected to a polyphasic identification approach. The microbial composition of the traditional sourdoughs was characterized by bacteriological culture in combination with genotypic identification methods, including repetitive element sequence-based PCR fingerprinting and phenylalanyl-tRNA synthase (pheS) gene sequence analysis. LAB from Belgian sourdoughs belonged to the genera Lactobacillus, Pediococcus, Leuconostoc, Weissella, and Enterococcus, with the heterofermentative species Lactobacillus paralimentarius, Lactobacillus sanfranciscensis, Lactobacillus plantarum, and Lactobacillus pontis as the most frequently isolated taxa. Statistical analysis of the identification data indicated that the microbial composition of the sourdoughs is mainly affected by the bakery environment rather than the flour type (wheat, rye, spelt, or a mixture of these) used. In conclusion, the polyphasic approach, based on rapid genotypic screening and high-resolution, sequence-dependent identification, proved to be a powerful tool for studying the LAB diversity in traditional fermented foods such as sourdough. PMID:17675431
Scheirlinck, Ilse; Van der Meulen, Roel; Van Schoor, Ann; Vancanneyt, Marc; De Vuyst, Luc; Vandamme, Peter; Huys, Geert
2007-10-01
A culture-based approach was used to investigate the diversity of lactic acid bacteria (LAB) in Belgian traditional sourdoughs and to assess the influence of flour type, bakery environment, geographical origin, and technological characteristics on the taxonomic composition of these LAB communities. For this purpose, a total of 714 LAB from 21 sourdoughs sampled at 11 artisan bakeries throughout Belgium were subjected to a polyphasic identification approach. The microbial composition of the traditional sourdoughs was characterized by bacteriological culture in combination with genotypic identification methods, including repetitive element sequence-based PCR fingerprinting and phenylalanyl-tRNA synthase (pheS) gene sequence analysis. LAB from Belgian sourdoughs belonged to the genera Lactobacillus, Pediococcus, Leuconostoc, Weissella, and Enterococcus, with the heterofermentative species Lactobacillus paralimentarius, Lactobacillus sanfranciscensis, Lactobacillus plantarum, and Lactobacillus pontis as the most frequently isolated taxa. Statistical analysis of the identification data indicated that the microbial composition of the sourdoughs is mainly affected by the bakery environment rather than the flour type (wheat, rye, spelt, or a mixture of these) used. In conclusion, the polyphasic approach, based on rapid genotypic screening and high-resolution, sequence-dependent identification, proved to be a powerful tool for studying the LAB diversity in traditional fermented foods such as sourdough.
Current strategies for mobilome research.
Jørgensen, Tue S; Kiil, Anne S; Hansen, Martin A; Sørensen, Søren J; Hansen, Lars H
2014-01-01
Mobile genetic elements (MGEs) are pivotal for bacterial evolution and adaptation, allowing shuffling of genes even between distantly related bacterial species. The study of these elements is biologically interesting as the mode of genetic propagation is kaleidoscopic and important, as MGEs are the main vehicles of the increasing bacterial antibiotic resistance that causes thousands of human deaths each year. The study of MGEs has previously focused on plasmids from individual isolates, but the revolution in sequencing technology has allowed the study of mobile genomic elements of entire communities using metagenomic approaches. The problem in using metagenomic sequencing for the study of MGEs is that plasmids and other mobile elements only comprise a small fraction of the total genetic content that are difficult to separate from chromosomal DNA based on sequence alone. The distinction between plasmid and chromosome is important as the mobility and regulation of genes largely depend on their genetic context. Several different approaches have been proposed that specifically enrich plasmid DNA from community samples. Here, we review recent approaches used to study entire plasmid pools from complex environments, and point out possible future developments for and pitfalls of these approaches. Further, we discuss the use of the PacBio long-read sequencing technology for MGE discovery.
Detecting false positive sequence homology: a machine learning approach.
Fujimoto, M Stanley; Suvorov, Anton; Jensen, Nicholas O; Clement, Mark J; Bybee, Seth M
2016-02-24
Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection. In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set. Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches.
Manku, H K; Dhanoa, J K; Kaur, S; Arora, J S; Mukhopadhyay, C S
2017-10-01
MicroRNAs (miRNAs) are small (19-25 base long), non-coding RNAs that regulate post-transcriptional gene expression by cleaving targeted mRNAs in several eukaryotes. The miRNAs play vital roles in multiple biological and metabolic processes, including developmental timing, signal transduction, cell maintenance and differentiation, diseases and cancers. Experimental identification of microRNAs is expensive and lab-intensive. Alternatively, computational approaches for predicting putative miRNAs from genomic or exomic sequences rely on features of miRNAs viz. secondary structures, sequence conservation, minimum free energy index (MFEI) etc. To date, not a single miRNA has been identified in bubaline (Bubalus bubalis), which is an economically important livestock. The present study aims at predicting the putative miRNAs of buffalo using comparative computational approach from buffalo whole genome shotgun sequencing data (INSDC: AWWX00000000.1). The sequences were blasted against the known mammalian miRNA. The obtained miRNAs were then passed through a series of filtration criteria to obtain the set of predicted (putative and novel) bubaline miRNA. Eight miRNAs were selected based on lowest E-value and validated by real time PCR (SYBR green chemistry) using RNU6 as endogenous control. The results from different trails of real time PCR shows that out of selected 8 miRNAs, only 2 (hsa-miR-1277-5p; bta-miR-2285b) are not expressed in bubaline PBMCs. The potential target genes based on their sequence complementarities were then predicted using miRanda. This work is the first report on prediction of bubaline miRNA from whole genome sequencing data followed by experimental validation. The finding could pave the way to future studies in economically important traits in buffalo. Copyright © 2017 Elsevier Ltd. All rights reserved.
Free energy minimization to predict RNA secondary structures and computational RNA design.
Churkin, Alexander; Weinbrand, Lina; Barash, Danny
2015-01-01
Determining the RNA secondary structure from sequence data by computational predictions is a long-standing problem. Its solution has been approached in two distinctive ways. If a multiple sequence alignment of a collection of homologous sequences is available, the comparative method uses phylogeny to determine conserved base pairs that are more likely to form as a result of billions of years of evolution than by chance. In the case of single sequences, recursive algorithms that compute free energy structures by using empirically derived energy parameters have been developed. This latter approach of RNA folding prediction by energy minimization is widely used to predict RNA secondary structure from sequence. For a significant number of RNA molecules, the secondary structure of the RNA molecule is indicative of its function and its computational prediction by minimizing its free energy is important for its functional analysis. A general method for free energy minimization to predict RNA secondary structures is dynamic programming, although other optimization methods have been developed as well along with empirically derived energy parameters. In this chapter, we introduce and illustrate by examples the approach of free energy minimization to predict RNA secondary structures.
Sorenson, Laurie; Santini, Francesco
2013-01-01
Ray-finned fishes constitute the dominant radiation of vertebrates with over 32,000 species. Although molecular phylogenetics has begun to disentangle major evolutionary relationships within this vast section of the Tree of Life, there is no widely available approach for efficiently collecting phylogenomic data within fishes, leaving much of the enormous potential of massively parallel sequencing technologies for resolving major radiations in ray-finned fishes unrealized. Here, we provide a genomic perspective on longstanding questions regarding the diversification of major groups of ray-finned fishes through targeted enrichment of ultraconserved nuclear DNA elements (UCEs) and their flanking sequence. Our workflow efficiently and economically generates data sets that are orders of magnitude larger than those produced by traditional approaches and is well-suited to working with museum specimens. Analysis of the UCE data set recovers a well-supported phylogeny at both shallow and deep time-scales that supports a monophyletic relationship between Amia and Lepisosteus (Holostei) and reveals elopomorphs and then osteoglossomorphs to be the earliest diverging teleost lineages. Our approach additionally reveals that sequence capture of UCE regions and their flanking sequence offers enormous potential for resolving phylogenetic relationships within ray-finned fishes. PMID:23824177
ESTimating plant phylogeny: lessons from partitioning
de la Torre, Jose EB; Egan, Mary G; Katari, Manpreet S; Brenner, Eric D; Stevenson, Dennis W; Coruzzi, Gloria M; DeSalle, Rob
2006-01-01
Background While Expressed Sequence Tags (ESTs) have proven a viable and efficient way to sample genomes, particularly those for which whole-genome sequencing is impractical, phylogenetic analysis using ESTs remains difficult. Sequencing errors and orthology determination are the major problems when using ESTs as a source of characters for systematics. Here we develop methods to incorporate EST sequence information in a simultaneous analysis framework to address controversial phylogenetic questions regarding the relationships among the major groups of seed plants. We use an automated, phylogenetically derived approach to orthology determination called OrthologID generate a phylogeny based on 43 process partitions, many of which are derived from ESTs, and examine several measures of support to assess the utility of EST data for phylogenies. Results A maximum parsimony (MP) analysis resulted in a single tree with relatively high support at all nodes in the tree despite rampant conflict among trees generated from the separate analysis of individual partitions. In a comparison of broader-scale groupings based on cellular compartment (ie: chloroplast, mitochondrial or nuclear) or function, only the nuclear partition tree (based largely on EST data) was found to be topologically identical to the tree based on the simultaneous analysis of all data. Despite topological conflict among the broader-scale groupings examined, only the tree based on morphological data showed statistically significant differences. Conclusion Based on the amount of character support contributed by EST data which make up a majority of the nuclear data set, and the lack of conflict of the nuclear data set with the simultaneous analysis tree, we conclude that the inclusion of EST data does provide a viable and efficient approach to address phylogenetic questions within a parsimony framework on a genomic scale, if problems of orthology determination and potential sequencing errors can be overcome. In addition, approaches that examine conflict and support in a simultaneous analysis framework allow for a more precise understanding of the evolutionary history of individual process partitions and may be a novel way to understand functional aspects of different kinds of cellular classes of gene products. PMID:16776834
High Class-Imbalance in pre-miRNA Prediction: A Novel Approach Based on deepSOM.
Stegmayer, Georgina; Yones, Cristian; Kamenetzky, Laura; Milone, Diego H
2017-01-01
The computational prediction of novel microRNA within a full genome involves identifying sequences having the highest chance of being a miRNA precursor (pre-miRNA). These sequences are usually named candidates to miRNA. The well-known pre-miRNAs are usually only a few in comparison to the hundreds of thousands of potential candidates to miRNA that have to be analyzed, which makes this task a high class-imbalance classification problem. The classical way of approaching it has been training a binary classifier in a supervised manner, using well-known pre-miRNAs as positive class and artificially defining the negative class. However, although the selection of positive labeled examples is straightforward, it is very difficult to build a set of negative examples in order to obtain a good set of training samples for a supervised method. In this work, we propose a novel and effective way of approaching this problem using machine learning, without the definition of negative examples. The proposal is based on clustering unlabeled sequences of a genome together with well-known miRNA precursors for the organism under study, which allows for the quick identification of the best candidates to miRNA as those sequences clustered with known precursors. Furthermore, we propose a deep model to overcome the problem of having very few positive class labels. They are always maintained in the deep levels as positive class while less likely pre-miRNA sequences are filtered level after level. Our approach has been compared with other methods for pre-miRNAs prediction in several species, showing effective predictivity of novel miRNAs. Additionally, we will show that our approach has a lower training time and allows for a better graphical navegability and interpretation of the results. A web-demo interface to try deepSOM is available at http://fich.unl.edu.ar/sinc/web-demo/deepsom/.
Chao, Michael C.; Pritchard, Justin R.; Zhang, Yanjia J.; Rubin, Eric J.; Livny, Jonathan; Davis, Brigid M.; Waldor, Matthew K.
2013-01-01
The coupling of high-density transposon mutagenesis to high-throughput DNA sequencing (transposon-insertion sequencing) enables simultaneous and genome-wide assessment of the contributions of individual loci to bacterial growth and survival. We have refined analysis of transposon-insertion sequencing data by normalizing for the effect of DNA replication on sequencing output and using a hidden Markov model (HMM)-based filter to exploit heretofore unappreciated information inherent in all transposon-insertion sequencing data sets. The HMM can smooth variations in read abundance and thereby reduce the effects of read noise, as well as permit fine scale mapping that is independent of genomic annotation and enable classification of loci into several functional categories (e.g. essential, domain essential or ‘sick’). We generated a high-resolution map of genomic loci (encompassing both intra- and intergenic sequences) that are required or beneficial for in vitro growth of the cholera pathogen, Vibrio cholerae. This work uncovered new metabolic and physiologic requirements for V. cholerae survival, and by combining transposon-insertion sequencing and transcriptomic data sets, we also identified several novel noncoding RNA species that contribute to V. cholerae growth. Our findings suggest that HMM-based approaches will enhance extraction of biological meaning from transposon-insertion sequencing genomic data. PMID:23901011
Poincaré recurrences of DNA sequences
NASA Astrophysics Data System (ADS)
Frahm, K. M.; Shepelyansky, D. L.
2012-01-01
We analyze the statistical properties of Poincaré recurrences of Homo sapiens, mammalian, and other DNA sequences taken from the Ensembl Genome data base with up to 15 billion base pairs. We show that the probability of Poincaré recurrences decays in an algebraic way with the Poincaré exponent β≈4 even if the oscillatory dependence is well pronounced. The correlations between recurrences decay with an exponent ν≈0.6 that leads to an anomalous superdiffusive walk. However, for Homo sapiens sequences, with the largest available statistics, the diffusion coefficient converges to a finite value on distances larger than one million base pairs. We argue that the approach based on Poncaré recurrences determines new proximity features between different species and sheds a new light on their evolution history.
Ye, Kai; Kosters, Walter A; Ijzerman, Adriaan P
2007-03-15
Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.
2014-01-01
Background Recent innovations in sequencing technologies have provided researchers with the ability to rapidly characterize the microbial content of an environmental or clinical sample with unprecedented resolution. These approaches are producing a wealth of information that is providing novel insights into the microbial ecology of the environment and human health. However, these sequencing-based approaches produce large and complex datasets that require efficient and sensitive computational analysis workflows. Many recent tools for analyzing metagenomic-sequencing data have emerged, however, these approaches often suffer from issues of specificity, efficiency, and typically do not include a complete metagenomic analysis framework. Results We present PathoScope 2.0, a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including reference genome library extraction and indexing, read quality control and alignment, strain identification, and summarization and annotation of results. We rigorously evaluated PathoScope 2.0 using simulated data and data from the 2011 outbreak of Shiga-toxigenic Escherichia coli O104:H4. Conclusions The results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 2.0 pipeline software is freely available for download at: http://sourceforge.net/projects/pathoscope/. PMID:25225611
Mahmood, Shahid; Freitag, Thomas E; Prosser, James I
2006-06-01
PCR-based techniques are commonly used to characterize microbial communities, but are subject to bias that is difficult to assess. This study aimed to evaluate bias of several PCR primer-based strategies used to study diversity of autotrophic ammonia oxidizers. 16S rRNA genes from soil- or sediment-DNA were amplified using primers considered either selective or specific for betaproteobacterial ammonia oxidizers. Five approaches were assessed: (a) amplification with primers betaAMO143f-betaAMO1315r; (b) amplification with primers CTO189f-CTO654r; (c) nested amplification with betaAMO143f-betaAMO1315r followed by CTO189f-CTO654r primers; (d) nested amplification with betaAMO143f-betaAMO1315r and CTO189f-Pf1053r primers; (e) nested amplification with 27f-1492r and CTO189f-CTO654r primers. Amplification products were characterized by denaturing gradient gel electrophoresis (DGGE) analysis after further amplification with 357f-GC-518r primers. DGGE profiles of soil communities were heterogeneous and depended on the approach followed. Ammonia oxidizer diversity was higher using approaches (b), (c) and (e) than using (a) and (d), where sequences of the most prominent bands showed similarities to nonammonia oxidizers. Profiles from marine sediments were more consistent, regardless of the approach adopted, and sequence analysis of excised bands indicated that these consisted of ammonia oxidizers only. The study demonstrates the importance of choice of primer, of screening for sequences of nontarget organisms and use of several approaches when characterizing microbial communities in natural environments.
Hurst, Sarah J; Han, Min Su; Lytton-Jean, Abigail K R; Mirkin, Chad A
2007-09-15
We have developed a novel competition assay that uses a gold nanoparticle (Au NP)-based, high-throughput colorimetric approach to screen the sequence selectivity of DNA-binding molecules. This assay hinges on the observation that the melting behavior of DNA-functionalized Au NP aggregates is sensitive to the concentration of the DNA-binding molecule in solution. When short, oligomeric hairpin DNA sequences were added to a reaction solution consisting of DNA-functionalized Au NP aggregates and DNA-binding molecules, these molecules may either bind to the Au NP aggregate interconnects or the hairpin stems based on their relative affinity for each. This relative affinity can be measured as a change in the melting temperature (Tm) of the DNA-modified Au NP aggregates in solution. As a proof of concept, we evaluated the selectivity of 4',6-diamidino-2-phenylindone (an AT-specific binder), ethidium bromide (a nonspecific binder), and chromomycin A (a GC-specific binder) for six sequences of hairpin DNA having different numbers of AT pairs in a five-base pair variable stem region. Our assay accurately and easily confirmed the known trends in selectivity for the DNA binders in question without the use of complicated instrumentation. This novel assay will be useful in assessing large libraries of potential drug candidates that work by binding DNA to form a drug/DNA complex.
Chen, Xin; Wu, Qiong; Sun, Ruimin; Zhang, Louxin
2012-01-01
The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of this SNP discovery approach. In this study, we formulate two new combinatorial optimization problems. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as SNP - MSP, limits its search to sequences whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, the second problem, denoted as SNP - MSQ, limits its search to sequences whose in silico predicted mass spectra instead contain all the signals of the measured mass spectra. We present an exact dynamic programming algorithm for solving the SNP - MSP problem and also show that the SNP - MSQ problem is NP-hard by a reduction from a restricted variation of the 3-partition problem. We believe that an efficient solution to either problem above could offer a seamless integration of information in four complementary base-specific cleavage reactions, thereby improving the capability of the underlying biotechnology for sensitive and accurate SNP discovery.
Wallace, Lindsay M; Saad, Nizar Y; Pyne, Nettie K; Fowler, Allison M; Eidahl, Jocelyn O; Domire, Jacqueline S; Griffin, Danielle A; Herman, Adam C; Sahenk, Zarife; Rodino-Klapac, Louise R; Harper, Scott Q
2018-03-16
RNAi emerged as a prospective molecular therapy nearly 15 years ago. Since then, two major RNAi platforms have been under development: oligonucleotides and gene therapy. Oligonucleotide-based approaches have seen more advancement, with some promising therapies that may soon reach market. In contrast, vector-based approaches for RNAi therapy have remained largely in the pre-clinical realm, with limited clinical safety and efficacy data to date. We are developing a gene therapy approach to treat the autosomal-dominant disorder facioscapulohumeral muscular dystrophy. Our strategy involves silencing the myotoxic gene DUX4 using adeno-associated viral vectors to deliver targeted microRNA expression cassettes (miDUX4s). We previously demonstrated proof of concept for this approach in mice, and we are now taking additional steps here to assess safety issues related to miDUX4 overexpression and sequence-specific off-target silencing. In this study, we describe improvements in vector design and expansion of our miDUX4 sequence repertoire and report differential toxicity elicited by two miDUX4 sequences, of which one was toxic and the other was not. This study provides important data to help advance our goal of translating RNAi gene therapy for facioscapulohumeral muscular dystrophy.
Laine, Elodie; Carbone, Alessandra
2015-01-01
Protein-protein interactions (PPIs) are essential to all biological processes and they represent increasingly important therapeutic targets. Here, we present a new method for accurately predicting protein-protein interfaces, understanding their properties, origins and binding to multiple partners. Contrary to machine learning approaches, our method combines in a rational and very straightforward way three sequence- and structure-based descriptors of protein residues: evolutionary conservation, physico-chemical properties and local geometry. The implemented strategy yields very precise predictions for a wide range of protein-protein interfaces and discriminates them from small-molecule binding sites. Beyond its predictive power, the approach permits to dissect interaction surfaces and unravel their complexity. We show how the analysis of the predicted patches can foster new strategies for PPIs modulation and interaction surface redesign. The approach is implemented in JET2, an automated tool based on the Joint Evolutionary Trees (JET) method for sequence-based protein interface prediction. JET2 is freely available at www.lcqb.upmc.fr/JET2. PMID:26690684
Genomic Insights into Geothermal Spring Community Members using a 16S Agnostic Single-Cell Approach
NASA Astrophysics Data System (ADS)
Bowers, R. M.
2016-12-01
INSTUTIONS (ALL): DOE Joint Genome Institute, Walnut Creek, CA USA. Bigelow Laboratory for Ocean Sciences, East Boothbay, ME USA. Department of Biological Sciences, University of Calgary, Calgary, Alberta, Canada. ABSTRACT BODY: With recent advances in DNA sequencing, rapid and affordable screening of single-cell genomes has become a reality. Single-cell sequencing is a multi-step process that takes advantage of any number of single-cell sorting techniques, whole genome amplification (WGA), and 16S rRNA gene based PCR screening to identify the microbes of interest prior to shotgun sequencing. However, the 16S PCR based screening step is costly and may lead to unanticipated losses of microbial diversity, as cells that do not produce a clean 16S amplicon are typically omitted from downstream shotgun sequencing. While many of the sorted cells that fail the 16S PCR step likely originate from poor quality amplified DNA, some of the cells with good WGA kinetics may instead represent bacteria or archaea with 16S genes that fail to amplify due to primer mis-matches or the presence of intervening sequences. Using cell material from Dewar Creek, a hot spring in British Columbia, we sequenced all sorted cells with good WGA kinetics irrespective of their 16S amplification success. We show that this high-throughput approach to single-cell sequencing (i) can reduce the overall cost of single-cell genome production, and (ii). may lead to the discovery of previously unknown branches on the microbial tree of life.
Cankar, Katarina; Chauvensy-Ancel, Valérie; Fortabat, Marie-Noelle; Gruden, Kristina; Kobilinsky, André; Zel, Jana; Bertheau, Yves
2008-05-15
Detection of nonauthorized genetically modified organisms (GMOs) has always presented an analytical challenge because the complete sequence data needed to detect them are generally unavailable although sequence similarity to known GMOs can be expected. A new approach, differential quantitative polymerase chain reaction (PCR), for detection of nonauthorized GMOs is presented here. This method is based on the presence of several common elements (e.g., promoter, genes of interest) in different GMOs. A statistical model was developed to study the difference between the number of molecules of such a common sequence and the number of molecules identifying the approved GMO (as determined by border-fragment-based PCR) and the donor organism of the common sequence. When this difference differs statistically from zero, the presence of a nonauthorized GMO can be inferred. The interest and scope of such an approach were tested on a case study of different proportions of genetically modified maize events, with the P35S promoter as the Cauliflower Mosaic Virus common sequence. The presence of a nonauthorized GMO was successfully detected in the mixtures analyzed and in the presence of (donor organism of P35S promoter). This method could be easily transposed to other common GMO sequences and other species and is applicable to other detection areas such as microbiology.
Sanger sequencing as a first-line approach for molecular diagnosis of Andersen-Tawil syndrome.
Totomoch-Serra, Armando; Marquez, Manlio F; Cervantes-Barragán, David E
2017-01-01
In 1977, Frederick Sanger developed a new method for DNA sequencing based on the chain termination method, now known as the Sanger sequencing method (SSM). Recently, massive parallel sequencing, better known as next-generation sequencing (NGS), is replacing the SSM for detecting mutations in cardiovascular diseases with a genetic background. The present opinion article wants to remark that "targeted" SSM is still effective as a first-line approach for the molecular diagnosis of some specific conditions, as is the case for Andersen-Tawil syndrome (ATS). ATS is described as a rare multisystemic autosomal dominant channelopathy syndrome caused mainly by a heterozygous mutation in the KCNJ2 gene . KCJN2 has particular characteristics that make it attractive for "directed" SSM. KCNJ2 has a sequence of 17,510 base pairs (bp), and a short coding region with two exons (exon 1=166 bp and exon 2=5220 bp), half of the mutations are located in the C-terminal cytosolic domain, a mutational hotspot has been described in residue Arg218, and this gene explains the phenotype in 60% of ATS cases that fulfill all the clinical criteria of the disease. In order to increase the diagnosis of ATS we urge cardiologists to search for facial and muscular abnormalities in subjects with frequent ventricular arrhythmias (especially bigeminy) and prominent U waves on the electrocardiogram.
Sanger sequencing as a first-line approach for molecular diagnosis of Andersen-Tawil syndrome
Totomoch-Serra, Armando; Marquez, Manlio F.; Cervantes-Barragán, David E.
2017-01-01
In 1977, Frederick Sanger developed a new method for DNA sequencing based on the chain termination method, now known as the Sanger sequencing method (SSM). Recently, massive parallel sequencing, better known as next-generation sequencing (NGS), is replacing the SSM for detecting mutations in cardiovascular diseases with a genetic background. The present opinion article wants to remark that “targeted” SSM is still effective as a first-line approach for the molecular diagnosis of some specific conditions, as is the case for Andersen-Tawil syndrome (ATS). ATS is described as a rare multisystemic autosomal dominant channelopathy syndrome caused mainly by a heterozygous mutation in the KCNJ2 gene . KCJN2 has particular characteristics that make it attractive for “directed” SSM. KCNJ2 has a sequence of 17,510 base pairs (bp), and a short coding region with two exons (exon 1=166 bp and exon 2=5220 bp), half of the mutations are located in the C-terminal cytosolic domain, a mutational hotspot has been described in residue Arg218, and this gene explains the phenotype in 60% of ATS cases that fulfill all the clinical criteria of the disease. In order to increase the diagnosis of ATS we urge cardiologists to search for facial and muscular abnormalities in subjects with frequent ventricular arrhythmias (especially bigeminy) and prominent U waves on the electrocardiogram. PMID:29093808
Suyama, Yoshihisa; Matsuki, Yu
2015-01-01
Restriction-enzyme (RE)-based next-generation sequencing methods have revolutionized marker-assisted genetic studies; however, the use of REs has limited their widespread adoption, especially in field samples with low-quality DNA and/or small quantities of DNA. Here, we developed a PCR-based procedure to construct reduced representation libraries without RE digestion steps, representing de novo single-nucleotide polymorphism discovery, and its genotyping using next-generation sequencing. Using multiplexed inter-simple sequence repeat (ISSR) primers, thousands of genome-wide regions were amplified effectively from a wide variety of genomes, without prior genetic information. We demonstrated: 1) Mendelian gametic segregation of the discovered variants; 2) reproducibility of genotyping by checking its applicability for individual identification; and 3) applicability in a wide variety of species by checking standard population genetic analysis. This approach, called multiplexed ISSR genotyping by sequencing, should be applicable to many marker-assisted genetic studies with a wide range of DNA qualities and quantities. PMID:26593239
Lázaro-Muñoz, Gabriel; Conley, John M.; Davis, Arlene M.; Van Riper, Marcia; Walker, Rebecca L.; Juengst, Eric T.
2015-01-01
Advances in genomics have led to calls for developing population-based preventive genomic sequencing (PGS) programs with the goal of identifying genetic health risks in adults without known risk factors. One critical issue for minimizing the harms and maximizing the benefits of PGS is determining the kind and degree of control individuals should have over the generation, use, and handling of their genomic information. In this article we examine whether PGS programs should offer individuals the opportunity to selectively opt-out of the sequencing or analysis of specific genomic conditions (the menu approach) or whether PGS should be implemented using an all-or-nothing panel approach. We conclude that any responsible scale up of PGS will require a menu approach that may seem impractical to some, but which draws its justification from a rich mix of normative, legal, and practical considerations. PMID:26147254
Validation of two ribosomal RNA removal methods for microbial metatranscriptomics
DOE Office of Scientific and Technical Information (OSTI.GOV)
He, Shaomei; Wurtzel, Omri; Singh, Kanwar
2010-10-01
The predominance of rRNAs in the transcriptome is a major technical challenge in sequence-based analysis of cDNAs from microbial isolates and communities. Several approaches have been applied to deplete rRNAs from (meta)transcriptomes, but no systematic investigation of potential biases introduced by any of these approaches has been reported. Here we validated the effectiveness and fidelity of the two most commonly used approaches, subtractive hybridization and exonuclease digestion, as well as combinations of these treatments, on two synthetic five-microorganism metatranscriptomes using massively parallel sequencing. We found that the effectiveness of rRNA removal was a function of community composition and RNA integritymore » for these treatments. Subtractive hybridization alone introduced the least bias in relative transcript abundance, whereas exonuclease and in particular combined treatments greatly compromised mRNA abundance fidelity. Illumina sequencing itself also can compromise quantitative data analysis by introducing a G+C bias between runs.« less
NASA Astrophysics Data System (ADS)
Liu, Qiong; Wang, Wen-xi; Zhu, Ke-ren; Zhang, Chao-yong; Rao, Yun-qing
2014-11-01
Mixed-model assembly line sequencing is significant in reducing the production time and overall cost of production. To improve production efficiency, a mathematical model aiming simultaneously to minimize overtime, idle time and total set-up costs is developed. To obtain high-quality and stable solutions, an advanced scatter search approach is proposed. In the proposed algorithm, a new diversification generation method based on a genetic algorithm is presented to generate a set of potentially diverse and high-quality initial solutions. Many methods, including reference set update, subset generation, solution combination and improvement methods, are designed to maintain the diversification of populations and to obtain high-quality ideal solutions. The proposed model and algorithm are applied and validated in a case company. The results indicate that the proposed advanced scatter search approach is significant for mixed-model assembly line sequencing in this company.
Dhir, Somdutta; Pacurar, Mircea; Franklin, Dino; Gáspári, Zoltán; Kertész-Farkas, Attila; Kocsor, András; Eisenhaber, Frank; Pongor, Sándor
2010-11-01
SBASE is a project initiated to detect known domain types and predicting domain architectures using sequence similarity searching (Simon et al., Protein Seq Data Anal, 5: 39-42, 1992, Pongor et al, Nucl. Acids. Res. 21:3111-3115, 1992). The current approach uses a curated collection of domain sequences - the SBASE domain library - and standard similarity search algorithms, followed by postprocessing which is based on a simple statistics of the domain similarity network (http://hydra.icgeb.trieste.it/sbase/). It is especially useful in detecting rare, atypical examples of known domain types which are sometimes missed even by more sophisticated methodologies. This approach does not require multiple alignment or machine learning techniques, and can be a useful complement to other domain detection methodologies. This article gives an overview of the project history as well as of the concepts and principles developed within this the project.
Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.
Bonham-Carter, Oliver; Steele, Joe; Bastola, Dhundy
2014-11-01
Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events. New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies. We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base-base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Yan, Qiongqiong; Fanning, Séamus
2015-01-01
Cronobacter species are emerging opportunistic food-borne pathogens, which consists of seven species, including C. sakazakii, C. malonaticus, C. muytjensii, C. turicensis, C. dublinensis, C. universalis, and C. condimenti. The organism can cause severe clinical infections, including necrotizing enterocolitis, septicemia, and meningitis, predominately among neonates <4 weeks of age. Cronobacter species can be isolated from various foods and their surrounding environments; however, powdered infant formula (PIF) is the most frequently implicated food source linked with Cronobacter infection. This review aims to provide a summary of laboratory-based strategies that can be used to identify and trace Cronobacter species. The identification of Cronobacter species using conventional culture method and immuno-based detection protocols were first presented. The molecular detection and identification at genus-, and species-level along with molecular-based serogroup approaches are also described, followed by the molecular sub-typing methods, in particular pulsed-field gel electrophoresis and multi-locus sequence typing. Next generation sequence approaches, including whole genome sequencing, DNA microarray, and high-throughput whole-transcriptome sequencing, are also highlighted. Appropriate application of these strategies would contribute to reduce the risk of Cronobacter contamination in PIF and production environments, thereby improving food safety and protecting public health. PMID:26000266
Pembleton, Luke W; Drayton, Michelle C; Bain, Melissa; Baillie, Rebecca C; Inch, Courtney; Spangenberg, German C; Wang, Junping; Forster, John W; Cogan, Noel O I
2016-05-01
A targeted amplicon-based genotyping-by-sequencing approach has permitted cost-effective and accurate discrimination between ryegrass species (perennial, Italian and inter-species hybrid), and identification of cultivars based on bulked samples. Perennial ryegrass and Italian ryegrass are the most important temperate forage species for global agriculture, and are represented in the commercial pasture seed market by numerous cultivars each composed of multiple highly heterozygous individuals. Previous studies have identified difficulties in the use of morphophysiological criteria to discriminate between these two closely related taxa. Recently, a highly multiplexed single nucleotide polymorphism (SNP)-based genotyping assay has been developed that permits accurate differentiation between both species and cultivars of ryegrasses at the genetic level. This assay has since been further developed into an amplicon-based genotyping-by-sequencing (GBS) approach implemented on a second-generation sequencing platform, allowing accelerated throughput and ca. sixfold reduction in cost. Using the GBS approach, 63 cultivars of perennial, Italian and interspecific hybrid ryegrasses, as well as intergeneric Festulolium hybrids, were genotyped. The genetic relationships between cultivars were interpreted in terms of known breeding histories and indistinct species boundaries within the Lolium genus, as well as suitability of current cultivar registration methodologies. An example of applicability to quality assurance and control (QA/QC) of seed purity is also described. Rapid, low-cost genotypic assays provide new opportunities for breeders to more fully explore genetic diversity within breeding programs, allowing the combination of novel unique genetic backgrounds. Such tools also offer the potential to more accurately define cultivar identities, allowing protection of varieties in the commercial market and supporting processes of cultivar accreditation and quality assurance.
Skin microbiome: genomics-based insights into the diversity and role of skin microbes
Kong, Heidi H.
2011-01-01
Recent advances in DNA sequencing methodology have enabled studies of human skin microbes that circumvent difficulties in isolating and characterizing fastidious microbes. Sequence-based approaches have identified greater diversity of cutaneous bacteria than studies using traditional cultivation techniques. However, improved sequencing technologies and analytical methods are needed to study all skin microbes, including bacteria, archaea, fungi, viruses, and mites, and how they interact with each other and their human hosts. This review discusses current skin microbiome research, with a primary focus on bacteria, and the challenges facing investigators striving to understand how skin micro-organisms contribute to health and disease. PMID:21376666
Comparison of Metabolic Pathways in Escherichia coli by Using Genetic Algorithms.
Ortegon, Patricia; Poot-Hernández, Augusto C; Perez-Rueda, Ernesto; Rodriguez-Vazquez, Katya
2015-01-01
In order to understand how cellular metabolism has taken its modern form, the conservation and variations between metabolic pathways were evaluated by using a genetic algorithm (GA). The GA approach considered information on the complete metabolism of the bacterium Escherichia coli K-12, as deposited in the KEGG database, and the enzymes belonging to a particular pathway were transformed into enzymatic step sequences by using the breadth-first search algorithm. These sequences represent contiguous enzymes linked to each other, based on their catalytic activities as they are encoded in the Enzyme Commission numbers. In a posterior step, these sequences were compared using a GA in an all-against-all (pairwise comparisons) approach. Individual reactions were chosen based on their measure of fitness to act as parents of offspring, which constitute the new generation. The sequences compared were used to construct a similarity matrix (of fitness values) that was then considered to be clustered by using a k-medoids algorithm. A total of 34 clusters of conserved reactions were obtained, and their sequences were finally aligned with a multiple-sequence alignment GA optimized to align all the reaction sequences included in each group or cluster. From these comparisons, maps associated with the metabolism of similar compounds also contained similar enzymatic step sequences, reinforcing the Patchwork Model for the evolution of metabolism in E. coli K-12, an observation that can be expanded to other organisms, for which there is metabolism information. Finally, our mapping of these reactions is discussed, with illustrations from a particular case.
Comparison of Metabolic Pathways in Escherichia coli by Using Genetic Algorithms
Ortegon, Patricia; Poot-Hernández, Augusto C.; Perez-Rueda, Ernesto; Rodriguez-Vazquez, Katya
2015-01-01
In order to understand how cellular metabolism has taken its modern form, the conservation and variations between metabolic pathways were evaluated by using a genetic algorithm (GA). The GA approach considered information on the complete metabolism of the bacterium Escherichia coli K-12, as deposited in the KEGG database, and the enzymes belonging to a particular pathway were transformed into enzymatic step sequences by using the breadth-first search algorithm. These sequences represent contiguous enzymes linked to each other, based on their catalytic activities as they are encoded in the Enzyme Commission numbers. In a posterior step, these sequences were compared using a GA in an all-against-all (pairwise comparisons) approach. Individual reactions were chosen based on their measure of fitness to act as parents of offspring, which constitute the new generation. The sequences compared were used to construct a similarity matrix (of fitness values) that was then considered to be clustered by using a k-medoids algorithm. A total of 34 clusters of conserved reactions were obtained, and their sequences were finally aligned with a multiple-sequence alignment GA optimized to align all the reaction sequences included in each group or cluster. From these comparisons, maps associated with the metabolism of similar compounds also contained similar enzymatic step sequences, reinforcing the Patchwork Model for the evolution of metabolism in E. coli K-12, an observation that can be expanded to other organisms, for which there is metabolism information. Finally, our mapping of these reactions is discussed, with illustrations from a particular case. PMID:25973143
Song, Jiangning; Yuan, Zheng; Tan, Hao; Huber, Thomas; Burrage, Kevin
2007-12-01
Disulfide bonds are primary covalent crosslinks between two cysteine residues in proteins that play critical roles in stabilizing the protein structures and are commonly found in extracy-toplasmatic or secreted proteins. In protein folding prediction, the localization of disulfide bonds can greatly reduce the search in conformational space. Therefore, there is a great need to develop computational methods capable of accurately predicting disulfide connectivity patterns in proteins that could have potentially important applications. We have developed a novel method to predict disulfide connectivity patterns from protein primary sequence, using a support vector regression (SVR) approach based on multiple sequence feature vectors and predicted secondary structure by the PSIPRED program. The results indicate that our method could achieve a prediction accuracy of 74.4% and 77.9%, respectively, when averaged on proteins with two to five disulfide bridges using 4-fold cross-validation, measured on the protein and cysteine pair on a well-defined non-homologous dataset. We assessed the effects of different sequence encoding schemes on the prediction performance of disulfide connectivity. It has been shown that the sequence encoding scheme based on multiple sequence feature vectors coupled with predicted secondary structure can significantly improve the prediction accuracy, thus enabling our method to outperform most of other currently available predictors. Our work provides a complementary approach to the current algorithms that should be useful in computationally assigning disulfide connectivity patterns and helps in the annotation of protein sequences generated by large-scale whole-genome projects. The prediction web server and Supplementary Material are accessible at http://foo.maths.uq.edu.au/~huber/disulfide
Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce.
Yehdego, Daniel T; Zhang, Boyu; Kodimala, Vikram K R; Johnson, Kyle L; Taufer, Michela; Leung, Ming-Ying
2013-05-01
Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.
A gradient-boosting approach for filtering de novo mutations in parent-offspring trios.
Liu, Yongzhuang; Li, Bingshan; Tan, Renjie; Zhu, Xiaolin; Wang, Yadong
2014-07-01
Whole-genome and -exome sequencing on parent-offspring trios is a powerful approach to identifying disease-associated genes by detecting de novo mutations in patients. Accurate detection of de novo mutations from sequencing data is a critical step in trio-based genetic studies. Existing bioinformatic approaches usually yield high error rates due to sequencing artifacts and alignment issues, which may either miss true de novo mutations or call too many false ones, making downstream validation and analysis difficult. In particular, current approaches have much worse specificity than sensitivity, and developing effective filters to discriminate genuine from spurious de novo mutations remains an unsolved challenge. In this article, we curated 59 sequence features in whole genome and exome alignment context which are considered to be relevant to discriminating true de novo mutations from artifacts, and then employed a machine-learning approach to classify candidates as true or false de novo mutations. Specifically, we built a classifier, named De Novo Mutation Filter (DNMFilter), using gradient boosting as the classification algorithm. We built the training set using experimentally validated true and false de novo mutations as well as collected false de novo mutations from an in-house large-scale exome-sequencing project. We evaluated DNMFilter's theoretical performance and investigated relative importance of different sequence features on the classification accuracy. Finally, we applied DNMFilter on our in-house whole exome trios and one CEU trio from the 1000 Genomes Project and found that DNMFilter could be coupled with commonly used de novo mutation detection approaches as an effective filtering approach to significantly reduce false discovery rate without sacrificing sensitivity. The software DNMFilter implemented using a combination of Java and R is freely available from the website at http://humangenome.duke.edu/software. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Exploring the Convergence of Sequences in the Embodied World Using GeoGebra
ERIC Educational Resources Information Center
de Moura Fonseca, Daila Silva Seabra; de Oliveira Lino Franchi, Regina Helena
2016-01-01
This study addresses the embodied approach of convergence of numerical sequences using the GeoGebra software. We discuss activities that were applied in regular calculus classes, as a part of a research which used a qualitative methodology and aimed to identify contributions of the development of activities based on the embodiment of concepts,…
ERIC Educational Resources Information Center
Exner, Robert; And Others
The sixteen chapters of this book provide the core material for the Elements of Mathematics Program, a secondary sequence developed for highly motivated students with strong verbal abilities. The sequence is based on a functional-relational approach to mathematics teaching, and emphasizes teaching by analysis of real-life situations. This text is…
Real-time UAV trajectory generation using feature points matching between video image sequences
NASA Astrophysics Data System (ADS)
Byun, Younggi; Song, Jeongheon; Han, Dongyeob
2017-09-01
Unmanned aerial vehicles (UAVs), equipped with navigation systems and video capability, are currently being deployed for intelligence, reconnaissance and surveillance mission. In this paper, we present a systematic approach for the generation of UAV trajectory using a video image matching system based on SURF (Speeded up Robust Feature) and Preemptive RANSAC (Random Sample Consensus). Video image matching to find matching points is one of the most important steps for the accurate generation of UAV trajectory (sequence of poses in 3D space). We used the SURF algorithm to find the matching points between video image sequences, and removed mismatching by using the Preemptive RANSAC which divides all matching points to outliers and inliers. The inliers are only used to determine the epipolar geometry for estimating the relative pose (rotation and translation) between image sequences. Experimental results from simulated video image sequences showed that our approach has a good potential to be applied to the automatic geo-localization of the UAVs system
Automated design evolution of stereochemically randomized protein foldamers
NASA Astrophysics Data System (ADS)
Ranbhor, Ranjit; Kumar, Anil; Patel, Kirti; Ramakrishnan, Vibin; Durani, Susheel
2018-05-01
Diversification of chain stereochemistry opens up the possibilities of an ‘in principle’ increase in the design space of proteins. This huge increase in the sequence and consequent structural variation is aimed at the generation of smart materials. To diversify protein structure stereochemically, we introduced L- and D-α-amino acids as the design alphabet. With a sequence design algorithm, we explored the usage of specific variables such as chirality and the sequence of this alphabet in independent steps. With molecular dynamics, we folded stereochemically diverse homopolypeptides and evaluated their ‘fitness’ for possible design as protein-like foldamers. We propose a fitness function to prune the most optimal fold among 1000 structures simulated with an automated repetitive simulated annealing molecular dynamics (AR-SAMD) approach. The highly scored poly-leucine fold with sequence lengths of 24 and 30 amino acids were later sequence-optimized using a Dead End Elimination cum Monte Carlo based optimization tool. This paper demonstrates a novel approach for the de novo design of protein-like foldamers.
Protein sequencing via nanopore based devices: a nanofluidics perspective
NASA Astrophysics Data System (ADS)
Chinappi, Mauro; Cecconi, Fabio
2018-05-01
Proteins perform a huge number of central functions in living organisms, thus all the new techniques allowing their precise, fast and accurate characterization at single-molecule level certainly represent a burst in proteomics with important biomedical impact. In this review, we describe the recent progresses in the developing of nanopore based devices for protein sequencing. We start with a critical analysis of the main technical requirements for nanopore protein sequencing, summarizing some ideas and methodologies that have recently appeared in the literature. In the last sections, we focus on the physical modelling of the transport phenomena occurring in nanopore based devices. The multiscale nature of the problem is discussed and, in this respect, some of the main possible computational approaches are illustrated.
Collection and Extraction of Occupational Air Samples for Analysis of Fungal DNA.
Lemons, Angela R; Lindsley, William G; Green, Brett J
2018-05-02
Traditional methods of identifying fungal exposures in occupational environments, such as culture and microscopy-based approaches, have several limitations that have resulted in the exclusion of many species. Advances in the field over the last two decades have led occupational health researchers to turn to molecular-based approaches for identifying fungal hazards. These methods have resulted in the detection of many species within indoor and occupational environments that have not been detected using traditional methods. This protocol details an approach for determining fungal diversity within air samples through genomic DNA extraction, amplification, sequencing, and taxonomic identification of fungal internal transcribed spacer (ITS) regions. ITS sequencing results in the detection of many fungal species that are either not detected or difficult to identify to species level using culture or microscopy. While these methods do not provide quantitative measures of fungal burden, they offer a new approach to hazard identification and can be used to determine overall species richness and diversity within an occupational environment.
Long-range correlations and charge transport properties of DNA sequences
NASA Astrophysics Data System (ADS)
Liu, Xiao-liang; Ren, Yi; Xie, Qiong-tao; Deng, Chao-sheng; Xu, Hui
2010-04-01
By using Hurst's analysis and transfer approach, the rescaled range functions and Hurst exponents of human chromosome 22 and enterobacteria phage lambda DNA sequences are investigated and the transmission coefficients, Landauer resistances and Lyapunov coefficients of finite segments based on above genomic DNA sequences are calculated. In a comparison with quasiperiodic and random artificial DNA sequences, we find that λ-DNA exhibits anticorrelation behavior characterized by a Hurst exponent 0.5
ERIC Educational Resources Information Center
De Grez, Luc; Valcke, Martin; Roozen, Irene
2014-01-01
The present study focuses on the design and evaluation of an innovative instructional approach to developing oral presentation skills. The intervention builds on the observational learning theoretical perspective. This perspective is contrasted with the traditional training and practice approach. Two sequencing approaches--learners starting with…
van den Oever, Jessica M E; van Minderhout, Ivonne J H M; Harteveld, Cornelis L; den Hollander, Nicolette S; Bakker, Egbert; van der Stoep, Nienke; Boon, Elles M J
2015-09-01
The challenge in noninvasive prenatal diagnosis for monogenic disorders lies in the detection of low levels of fetal variants in the excess of maternal cell-free plasma DNA. Next-generation sequencing, which is the main method used for noninvasive prenatal testing and diagnosis, can overcome this challenge. However, this method may not be accessible to all genetic laboratories. Moreover, shotgun next-generation sequencing as, for instance, currently applied for noninvasive fetal trisomy screening may not be suitable for the detection of inherited mutations. We have developed a sensitive, mutation-specific, and fast alternative for next-generation sequencing-mediated noninvasive prenatal diagnosis using a PCR-based method. For this proof-of-principle study, noninvasive fetal paternally inherited mutation detection was performed using cell-free DNA from maternal plasma. Preferential amplification of the paternally inherited allele was accomplished through a personalized approach using a blocking probe against maternal sequences in a high-resolution melting curve analysis-based assay. Enhanced detection of the fetal paternally inherited mutation was obtained for both an autosomal dominant and a recessive monogenic disorder by blocking the amplification of maternal sequences in maternal plasma. Copyright © 2015 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Sumner, Jeremy G; Taylor, Amelia; Holland, Barbara R; Jarvis, Peter D
2017-12-01
Recently there has been renewed interest in phylogenetic inference methods based on phylogenetic invariants, alongside the related Markov invariants. Broadly speaking, both these approaches give rise to polynomial functions of sequence site patterns that, in expectation value, either vanish for particular evolutionary trees (in the case of phylogenetic invariants) or have well understood transformation properties (in the case of Markov invariants). While both approaches have been valued for their intrinsic mathematical interest, it is not clear how they relate to each other, and to what extent they can be used as practical tools for inference of phylogenetic trees. In this paper, by focusing on the special case of binary sequence data and quartets of taxa, we are able to view these two different polynomial-based approaches within a common framework. To motivate the discussion, we present three desirable statistical properties that we argue any invariant-based phylogenetic method should satisfy: (1) sensible behaviour under reordering of input sequences; (2) stability as the taxa evolve independently according to a Markov process; and (3) explicit dependence on the assumption of a continuous-time process. Motivated by these statistical properties, we develop and explore several new phylogenetic inference methods. In particular, we develop a statistically bias-corrected version of the Markov invariants approach which satisfies all three properties. We also extend previous work by showing that the phylogenetic invariants can be implemented in such a way as to satisfy property (3). A simulation study shows that, in comparison to other methods, our new proposed approach based on bias-corrected Markov invariants is extremely powerful for phylogenetic inference. The binary case is of particular theoretical interest as-in this case only-the Markov invariants can be expressed as linear combinations of the phylogenetic invariants. A wider implication of this is that, for models with more than two states-for example DNA sequence alignments with four-state models-we find that methods which rely on phylogenetic invariants are incapable of satisfying all three of the stated statistical properties. This is because in these cases the relevant Markov invariants belong to a class of polynomials independent from the phylogenetic invariants.
NASA Astrophysics Data System (ADS)
Lu, Xiaoguang; Xue, Hui; Jolly, Marie-Pierre; Guetter, Christoph; Kellman, Peter; Hsu, Li-Yueh; Arai, Andrew; Zuehlsdorff, Sven; Littmann, Arne; Georgescu, Bogdan; Guehring, Jens
2011-03-01
Cardiac perfusion magnetic resonance imaging (MRI) has proven clinical significance in diagnosis of heart diseases. However, analysis of perfusion data is time-consuming, where automatic detection of anatomic landmarks and key-frames from perfusion MR sequences is helpful for anchoring structures and functional analysis of the heart, leading toward fully automated perfusion analysis. Learning-based object detection methods have demonstrated their capabilities to handle large variations of the object by exploring a local region, i.e., context. Conventional 2D approaches take into account spatial context only. Temporal signals in perfusion data present a strong cue for anchoring. We propose a joint context model to encode both spatial and temporal evidence. In addition, our spatial context is constructed not only based on the landmark of interest, but also the landmarks that are correlated in the neighboring anatomies. A discriminative model is learned through a probabilistic boosting tree. A marginal space learning strategy is applied to efficiently learn and search in a high dimensional parameter space. A fully automatic system is developed to simultaneously detect anatomic landmarks and key frames in both RV and LV from perfusion sequences. The proposed approach was evaluated on a database of 373 cardiac perfusion MRI sequences from 77 patients. Experimental results of a 4-fold cross validation show superior landmark detection accuracies of the proposed joint spatial-temporal approach to the 2D approach that is based on spatial context only. The key-frame identification results are promising.
Zemali, El-Amine; Boukra, Abdelmadjid
2015-08-01
The multiple sequence alignment (MSA) is one of the most challenging problems in bioinformatics, it involves discovering similarity between a set of protein or DNA sequences. This paper introduces a new method for the MSA problem called biogeography-based optimization with multiple populations (BBOMP). It is based on a recent metaheuristic inspired from the mathematics of biogeography named biogeography-based optimization (BBO). To improve the exploration ability of BBO, we have introduced a new concept allowing better exploration of the search space. It consists of manipulating multiple populations having each one its own parameters. These parameters are used to build up progressive alignments allowing more diversity. At each iteration, the best found solution is injected in each population. Moreover, to improve solution quality, six operators are defined. These operators are selected with a dynamic probability which changes according to the operators efficiency. In order to test proposed approach performance, we have considered a set of datasets from Balibase 2.0 and compared it with many recent algorithms such as GAPAM, MSA-GA, QEAMSA and RBT-GA. The results show that the proposed approach achieves better average score than the previously cited methods.
On the design of henon and logistic map-based random number generator
NASA Astrophysics Data System (ADS)
Magfirawaty; Suryadi, M. T.; Ramli, Kalamullah
2017-10-01
The key sequence is one of the main elements in the cryptosystem. True Random Number Generators (TRNG) method is one of the approaches to generating the key sequence. The randomness source of the TRNG divided into three main groups, i.e. electrical noise based, jitter based and chaos based. The chaos based utilizes a non-linear dynamic system (continuous time or discrete time) as an entropy source. In this study, a new design of TRNG based on discrete time chaotic system is proposed, which is then simulated in LabVIEW. The principle of the design consists of combining 2D and 1D chaotic systems. A mathematical model is implemented for numerical simulations. We used comparator process as a harvester method to obtain the series of random bits. Without any post processing, the proposed design generated random bit sequence with high entropy value and passed all NIST 800.22 statistical tests.
BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC
Satija, Rahul; Novák, Ádám; Miklós, István; Lyngsø, Rune; Hein, Jotun
2009-01-01
Background We have previously combined statistical alignment and phylogenetic footprinting to detect conserved functional elements without assuming a fixed alignment. Considering a probability-weighted distribution of alignments removes sensitivity to alignment errors, properly accommodates regions of alignment uncertainty, and increases the accuracy of functional element prediction. Our method utilized standard dynamic programming hidden markov model algorithms to analyze up to four sequences. Results We present a novel approach, implemented in the software package BigFoot, for performing phylogenetic footprinting on greater numbers of sequences. We have developed a Markov chain Monte Carlo (MCMC) approach which samples both sequence alignments and locations of slowly evolving regions. We implement our method as an extension of the existing StatAlign software package and test it on well-annotated regions controlling the expression of the even-skipped gene in Drosophila and the α-globin gene in vertebrates. The results exhibit how adding additional sequences to the analysis has the potential to improve the accuracy of functional predictions, and demonstrate how BigFoot outperforms existing alignment-based phylogenetic footprinting techniques. Conclusion BigFoot extends a combined alignment and phylogenetic footprinting approach to analyze larger amounts of sequence data using MCMC. Our approach is robust to alignment error and uncertainty and can be applied to a variety of biological datasets. The source code and documentation are publicly available for download from PMID:19715598
BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC.
Satija, Rahul; Novák, Adám; Miklós, István; Lyngsø, Rune; Hein, Jotun
2009-08-28
We have previously combined statistical alignment and phylogenetic footprinting to detect conserved functional elements without assuming a fixed alignment. Considering a probability-weighted distribution of alignments removes sensitivity to alignment errors, properly accommodates regions of alignment uncertainty, and increases the accuracy of functional element prediction. Our method utilized standard dynamic programming hidden markov model algorithms to analyze up to four sequences. We present a novel approach, implemented in the software package BigFoot, for performing phylogenetic footprinting on greater numbers of sequences. We have developed a Markov chain Monte Carlo (MCMC) approach which samples both sequence alignments and locations of slowly evolving regions. We implement our method as an extension of the existing StatAlign software package and test it on well-annotated regions controlling the expression of the even-skipped gene in Drosophila and the alpha-globin gene in vertebrates. The results exhibit how adding additional sequences to the analysis has the potential to improve the accuracy of functional predictions, and demonstrate how BigFoot outperforms existing alignment-based phylogenetic footprinting techniques. BigFoot extends a combined alignment and phylogenetic footprinting approach to analyze larger amounts of sequence data using MCMC. Our approach is robust to alignment error and uncertainty and can be applied to a variety of biological datasets. The source code and documentation are publicly available for download from http://www.stats.ox.ac.uk/~satija/BigFoot/
Application of industrial scale genomics to discovery of therapeutic targets in heart failure.
Mehraban, F; Tomlinson, J E
2001-12-01
In recent years intense activity in both academic and industrial sectors has provided a wealth of information on the human genome with an associated impressive increase in the number of novel gene sequences deposited in sequence data repositories and patent applications. This genomic industrial revolution has transformed the way in which drug target discovery is now approached. In this article we discuss how various differential gene expression (DGE) technologies are being utilized for cardiovascular disease (CVD) drug target discovery. Other approaches such as sequencing cDNA from cardiovascular derived tissues and cells coupled with bioinformatic sequence analysis are used with the aim of identifying novel gene sequences that may be exploited towards target discovery. Additional leverage from gene sequence information is obtained through identification of polymorphisms that may confer disease susceptibility and/or affect drug responsiveness. Pharmacogenomic studies are described wherein gene expression-based techniques are used to evaluate drug response and/or efficacy. Industrial-scale genomics supports and addresses not only novel target gene discovery but also the burgeoning issues in pharmaceutical and clinical cardiovascular medicine relative to polymorphic gene responses.
Ferro, Myriam; Tardif, Marianne; Reguer, Erwan; Cahuzac, Romain; Bruley, Christophe; Vermat, Thierry; Nugues, Estelle; Vigouroux, Marielle; Vandenbrouck, Yves; Garin, Jérôme; Viari, Alain
2008-05-01
PepLine is a fully automated software which maps MS/MS fragmentation spectra of trypsic peptides to genomic DNA sequences. The approach is based on Peptide Sequence Tags (PSTs) obtained from partial interpretation of QTOF MS/MS spectra (first module). PSTs are then mapped on the six-frame translations of genomic sequences (second module) giving hits. Hits are then clustered to detect potential coding regions (third module). Our work aimed at optimizing the algorithms of each component to allow the whole pipeline to proceed in a fully automated manner using raw nucleic acid sequences (i.e., genomes that have not been "reduced" to a database of ORFs or putative exons sequences). The whole pipeline was tested on controlled MS/MS spectra sets from standard proteins and from Arabidopsis thaliana envelope chloroplast samples. Our results demonstrate that PepLine competed with protein database searching softwares and was fast enough to potentially tackle large data sets and/or high size genomes. We also illustrate the potential of this approach for the detection of the intron/exon structure of genes.
Gulati, Ashima; Somlo, Stefan
2018-05-01
The genesis of whole exome sequencing as a powerful tool for detailing the protein coding sequence of the human genome was conceptualized based on the availability of next-generation sequencing technology and knowledge of the human reference genome. The field of pediatric nephrology enriched with molecularly unsolved phenotypes is allowing the clinical and research application of whole exome sequencing to enable novel gene discovery and provide amendment of phenotypic misclassification. Recent studies in the field have informed us that newer high-throughput sequencing techniques are likely to be of high yield when applied in conjunction with conventional genomic approaches such as linkage analysis and other strategies used to focus subsequent analysis. They have also emphasized the need for the validation of novel genetic findings in large collaborative cohorts and the production of robust corroborative biological data. The well-structured application of comprehensive genomic testing in clinical and research arenas will hopefully continue to advance patient care and precision medicine, but does call for attention to be paid to its integrated challenges.
Identifying functionally informative evolutionary sequence profiles.
Gil, Nelson; Fiser, Andras
2018-04-15
Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. andras.fiser@einstein.yu.edu. Supplementary data are available at Bioinformatics online.
Using cellular automata to generate image representation for biological sequences.
Xiao, X; Shao, S; Ding, Y; Huang, Z; Chen, X; Chou, K-C
2005-02-01
A novel approach to visualize biological sequences is developed based on cellular automata (Wolfram, S. Nature 1984, 311, 419-424), a set of discrete dynamical systems in which space and time are discrete. By transforming the symbolic sequence codes into the digital codes, and using some optimal space-time evolvement rules of cellular automata, a biological sequence can be represented by a unique image, the so-called cellular automata image. Many important features, which are originally hidden in a long and complicated biological sequence, can be clearly revealed thru its cellular automata image. With biological sequences entering into databanks rapidly increasing in the post-genomic era, it is anticipated that the cellular automata image will become a very useful vehicle for investigation into their key features, identification of their function, as well as revelation of their "fingerprint". It is anticipated that by using the concept of the pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43, 246-255), the cellular automata image approach can also be used to improve the quality of predicting protein attributes, such as structural class and subcellular location.
Blank-Landeshammer, Bernhard; Kollipara, Laxmikanth; Biß, Karsten; Pfenninger, Markus; Malchow, Sebastian; Shuvaev, Konstantin; Zahedi, René P; Sickmann, Albert
2017-09-01
Complex mass spectrometry based proteomics data sets are mostly analyzed by protein database searches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e., de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than 3-fold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11 120 PSMs (combined) instead of 3476 PSMs (alone) in triplicate 2 h LC-MS runs of tryptic HeLa digestion.
Team-based learning in therapeutics workshop sessions.
Beatty, Stuart J; Kelley, Katherine A; Metzger, Anne H; Bellebaum, Katherine L; McAuley, James W
2009-10-01
To implement team-based learning in the workshop portion of a pathophysiology and therapeutics sequence of courses to promote integration of concepts across the pharmacy curriculum, provide a consistent problem-solving approach to patient care, and determine the impact on student perceptions of professionalism and teamwork. Team-based learning was incorporated into the workshop portion of 3 of 6 pathophysiology and therapeutics courses. Assignments that promoted team-building and application of key concepts were created. Readiness assurance tests were used to assess individual and team understanding of course materials. Students consistently scored 20% higher on team assessments compared with individual assessments. Mean professionalism and teamwork scores were significantly higher after implementation of team-based learning; however, this improvement was not considered educationally significant. Approximately 91% of students felt team-based learning improved understanding of course materials and 93% of students felt teamwork should continue in workshops. Team-based learning is an effective teaching method to ensure a consistent approach to problem-solving and curriculum integration in workshop sessions for a pathophysiology and therapeutics course sequence.
Single-cell genome sequencing at ultra-high-throughput with microfluidic droplet barcoding.
Lan, Freeman; Demaree, Benjamin; Ahmed, Noorsher; Abate, Adam R
2017-07-01
The application of single-cell genome sequencing to large cell populations has been hindered by technical challenges in isolating single cells during genome preparation. Here we present single-cell genomic sequencing (SiC-seq), which uses droplet microfluidics to isolate, fragment, and barcode the genomes of single cells, followed by Illumina sequencing of pooled DNA. We demonstrate ultra-high-throughput sequencing of >50,000 cells per run in a synthetic community of Gram-negative and Gram-positive bacteria and fungi. The sequenced genomes can be sorted in silico based on characteristic sequences. We use this approach to analyze the distributions of antibiotic-resistance genes, virulence factors, and phage sequences in microbial communities from an environmental sample. The ability to routinely sequence large populations of single cells will enable the de-convolution of genetic heterogeneity in diverse cell populations.
Low-Bandwidth and Non-Compute Intensive Remote Identification of Microbes from Raw Sequencing Reads
Gautier, Laurent; Lund, Ole
2013-01-01
Cheap DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approach to the analysis of sequencing data where a reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data. Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment. To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients: one running in a web browser, and one as a python script. Both are able to handle a large number of sequencing reads and from portable devices (the browser-based running on a tablet), perform its task within seconds, and consume an amount of bandwidth compatible with mobile broadband networks. Such client-server approaches could develop in the future, allowing a fully automated processing of sequencing data and routine instant quality check of sequencing runs from desktop sequencers. A web access is available at http://tapir.cbs.dtu.dk. The source code for a python command-line client, a server, and supplementary data are available at http://bit.ly/1aURxkc. PMID:24391826
Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads.
Gautier, Laurent; Lund, Ole
2013-01-01
Cheap DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approach to the analysis of sequencing data where a reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data. Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment. To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients: one running in a web browser, and one as a python script. Both are able to handle a large number of sequencing reads and from portable devices (the browser-based running on a tablet), perform its task within seconds, and consume an amount of bandwidth compatible with mobile broadband networks. Such client-server approaches could develop in the future, allowing a fully automated processing of sequencing data and routine instant quality check of sequencing runs from desktop sequencers. A web access is available at http://tapir.cbs.dtu.dk. The source code for a python command-line client, a server, and supplementary data are available at http://bit.ly/1aURxkc.
RAD tag sequencing as a source of SNP markers in Cynara cardunculus L
2012-01-01
Background The globe artichoke (Cynara cardunculus L. var. scolymus) genome is relatively poorly explored, especially compared to those of the other major Asteraceae crops sunflower and lettuce. No SNP markers are in the public domain. We have combined the recently developed restriction-site associated DNA (RAD) approach with the Illumina DNA sequencing platform to effect the rapid and mass discovery of SNP markers for C. cardunculus. Results RAD tags were sequenced from the genomic DNA of three C. cardunculus mapping population parents, generating 9.7 million reads, corresponding to ~1 Gbp of sequence. An assembly based on paired ends produced ~6.0 Mbp of genomic sequence, separated into ~19,000 contigs (mean length 312 bp), of which ~21% were fragments of putative coding sequence. The shared sequences allowed for the discovery of ~34,000 SNPs and nearly 800 indels, equivalent to a SNP frequency of 5.6 per 1,000 nt, and an indel frequency of 0.2 per 1,000 nt. A sample of heterozygous SNP loci was mapped by CAPS assays and this exercise provided validation of our mining criteria. The repetitive fraction of the genome had a high representation of retrotransposon sequence, followed by simple repeats, AT-low complexity regions and mobile DNA elements. The genomic k-mers distribution and CpG rate of C. cardunculus, compared with data derived from three whole genome-sequenced dicots species, provided a further evidence of the random representation of the C. cardunculus genome generated by RAD sampling. Conclusion The RAD tag sequencing approach is a cost-effective and rapid method to develop SNP markers in a highly heterozygous species. Our approach permitted to generate a large and robust SNP datasets by the adoption of optimized filtering criteria. PMID:22214349
Mutation detection using automated fluorescence-based sequencing.
Montgomery, Kate T; Iartchouck, Oleg; Li, Li; Perera, Anoja; Yassin, Yosuf; Tamburino, Alex; Loomis, Stephanie; Kucherlapati, Raju
2008-04-01
The development of high-throughput DNA sequencing techniques has made direct DNA sequencing of PCR-amplified genomic DNA a rapid and economical approach to the identification of polymorphisms that may play a role in disease. Point mutations as well as small insertions or deletions are readily identified by DNA sequencing. The mutations may be heterozygous (occurring in one allele while the other allele retains the normal sequence) or homozygous (occurring in both alleles). Sequencing alone cannot discriminate between true homozygosity and apparent homozygosity due to the loss of one allele due to a large deletion. In this unit, strategies are presented for using PCR amplification and automated fluorescence-based sequencing to identify sequence variation. The size of the project and laboratory preference and experience will dictate how the data is managed and which software tools are used for analysis. A high-throughput protocol is given that has been used to search for mutations in over 200 different genes at the Harvard Medical School - Partners Center for Genetics and Genomics (HPCGG, http://www.hpcgg.org/). Copyright 2008 by John Wiley & Sons, Inc.
Rapid and Easy Protocol for Quantification of Next-Generation Sequencing Libraries.
Hawkins, Steve F C; Guest, Paul C
2018-01-01
The emergence of next-generation sequencing (NGS) over the last 10 years has increased the efficiency of DNA sequencing in terms of speed, ease, and price. However, the exact quantification of a NGS library is crucial in order to obtain good data on sequencing platforms developed by the current market leader Illumina. Different approaches for DNA quantification are available currently and the most commonly used are based on analysis of the physical properties of the DNA through spectrophotometric or fluorometric methods. Although these methods are technically simple, they do not allow exact quantification as can be achieved using a real-time quantitative PCR (qPCR) approach. A qPCR protocol for DNA quantification with applications in NGS library preparation studies is presented here. This can be applied in various fields of study such as medical disorders resulting from nutritional programming disturbances.
Telesign: a videophone system for sign language distant communication
NASA Astrophysics Data System (ADS)
Mozelle, Gerard; Preteux, Francoise J.; Viallet, Jean-Emmanuel
1998-09-01
This paper presents a low bit rate videophone system for deaf people communicating by means of sign language. Classic video conferencing systems have focused on head and shoulders sequences which are not well-suited for sign language video transmission since hearing impaired people also use their hands and arms to communicate. To address the above-mentioned functionality, we have developed a two-step content-based video coding system based on: (1) A segmentation step. Four or five video objects (VO) are extracted using a cooperative approach between color-based and morphological segmentation. (2) VO coding are achieved by using a standardized MPEG-4 video toolbox. Results of encoded sign language video sequences, presented for three target bit rates (32 kbits/s, 48 kbits/s and 64 kbits/s), demonstrate the efficiency of the approach presented in this paper.
A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS
Jiao, Xiaoli; Zheng, Xin; Ma, Liang; Kutty, Geetha; Gogineni, Emile; Sun, Qiang; Sherman, Brad T.; Hu, Xiaojun; Jones, Kristine; Raley, Castle; Tran, Bao; Munroe, David J.; Stephens, Robert; Liang, Dun; Imamichi, Tomozumi; Kovacs, Joseph A.; Lempicki, Richard A.; Huang, Da Wei
2013-01-01
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results. PMID:24179701
A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS.
Jiao, Xiaoli; Zheng, Xin; Ma, Liang; Kutty, Geetha; Gogineni, Emile; Sun, Qiang; Sherman, Brad T; Hu, Xiaojun; Jones, Kristine; Raley, Castle; Tran, Bao; Munroe, David J; Stephens, Robert; Liang, Dun; Imamichi, Tomozumi; Kovacs, Joseph A; Lempicki, Richard A; Huang, Da Wei
2013-07-31
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.
Porter, Teresita M.; Golding, G. Brian
2012-01-01
Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys. PMID:22558215
Choe, Se-Eun; Nguyen, Thuy Thi-Dieu; Kang, Tae-Gyu; Kweon, Chang-Hee; Kang, Seung-Won
2011-09-01
Nuclear ribosomal DNA sequence of the second internal transcribed spacer (ITS-2) has been used efficiently to identify the liver fluke species collected from different hosts and various geographic regions. ITS-2 sequences of 19 Fasciola samples collected from Korean native cattle were determined and compared. Sequence comparison including ITS-2 sequences of isolates from this study and reference sequences from Fasciola hepatica and Fasciola gigantica and intermediate Fasciola in Genbank revealed seven identical variable sites of investigated isolates. Among 19 samples, 12 individuals had ITS-2 sequences completely identical to that of pure F. hepatica, five possessed the sequences identical to F. gigantica type, whereas two shared the sequence of both F. hepatica and F. gigantica. No variations in length and nucleotide composition of ITS-2 sequence were observed within isolates that belonged to F. hepatica or F. gigantica. At the position of 218, five Fasciola containing a single-base substitution (C>T) formed a distinct branch inside the F. gigantica-type group which was similar to those of Asian-origin isolates. The phylogenetic tree of the Fasciola spp. based on complete ITS-2 sequences from this study and other representative isolates in different locations clearly showed that pure F. hepatica, F. gigantica type and intermediate Fasciola were observed. The result also provided additional genetic evidence for the existence of three forms of Fasciola isolated from native cattle in Korea by genetic approach using ITS-2 sequence.
Numerical study on the sequential Bayesian approach for radioactive materials detection
NASA Astrophysics Data System (ADS)
Qingpei, Xiang; Dongfeng, Tian; Jianyu, Zhu; Fanhua, Hao; Ge, Ding; Jun, Zeng
2013-01-01
A new detection method, based on the sequential Bayesian approach proposed by Candy et al., offers new horizons for the research of radioactive detection. Compared with the commonly adopted detection methods incorporated with statistical theory, the sequential Bayesian approach offers the advantages of shorter verification time during the analysis of spectra that contain low total counts, especially in complex radionuclide components. In this paper, a simulation experiment platform implanted with the methodology of sequential Bayesian approach was developed. Events sequences of γ-rays associating with the true parameters of a LaBr3(Ce) detector were obtained based on an events sequence generator using Monte Carlo sampling theory to study the performance of the sequential Bayesian approach. The numerical experimental results are in accordance with those of Candy. Moreover, the relationship between the detection model and the event generator, respectively represented by the expected detection rate (Am) and the tested detection rate (Gm) parameters, is investigated. To achieve an optimal performance for this processor, the interval of the tested detection rate as a function of the expected detection rate is also presented.
Standish, Kristopher A; Carland, Tristan M; Lockwood, Glenn K; Pfeiffer, Wayne; Tatineni, Mahidhar; Huang, C Chris; Lamberth, Sarah; Cherkas, Yauheniya; Brodmerkel, Carrie; Jaeger, Ed; Smith, Lance; Rajagopal, Gunaretnam; Curran, Mark E; Schork, Nicholas J
2015-09-22
Next-generation sequencing (NGS) technologies have become much more efficient, allowing whole human genomes to be sequenced faster and cheaper than ever before. However, processing the raw sequence reads associated with NGS technologies requires care and sophistication in order to draw compelling inferences about phenotypic consequences of variation in human genomes. It has been shown that different approaches to variant calling from NGS data can lead to different conclusions. Ensuring appropriate accuracy and quality in variant calling can come at a computational cost. We describe our experience implementing and evaluating a group-based approach to calling variants on large numbers of whole human genomes. We explore the influence of many factors that may impact the accuracy and efficiency of group-based variant calling, including group size, the biogeographical backgrounds of the individuals who have been sequenced, and the computing environment used. We make efficient use of the Gordon supercomputer cluster at the San Diego Supercomputer Center by incorporating job-packing and parallelization considerations into our workflow while calling variants on 437 whole human genomes generated as part of large association study. We ultimately find that our workflow resulted in high-quality variant calls in a computationally efficient manner. We argue that studies like ours should motivate further investigations combining hardware-oriented advances in computing systems with algorithmic developments to tackle emerging 'big data' problems in biomedical research brought on by the expansion of NGS technologies.
The evolution processes of DNA sequences, languages and carols
NASA Astrophysics Data System (ADS)
Hauck, Jürgen; Henkel, Dorothea; Mika, Klaus
2001-04-01
The sequences of bases A, T, C and G of about 100 enolase, secA and cytochrome DNA were analyzed for attractive or repulsive interactions by the numbers T 1,T 2,T 3; r of nearest, next-nearest and third neighbor bases of the same kind and the concentration r=other bases/analyzed base. The area of possible T1, T2 values is limited by the linear borders T 2=2T 1-2, T 2=0 or T1=0 for clustering, attractive or repulsive interactions and the border T2=-2 T1+2(2- r) for a variation from repulsive to attractive interactions at r⩽2. Clustering is preferred by most bases in sequences of enolases and secA’ s. Major deviations with repulsive interactions of some bases are observed for archaea bacteria in secA and for highly developed animals and the human species in enolase sequences. The borders of the structure map for enthalpy stabilized structures with maximum interactions are approached in few cases. Most letters of the natural languages and some music notes are at the borders of the structure map.
Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ)
Mascher, Martin; Muehlbauer, Gary J; Rokhsar, Daniel S; Chapman, Jarrod; Schmutz, Jeremy; Barry, Kerrie; Muñoz-Amatriaín, María; Close, Timothy J; Wise, Roger P; Schulman, Alan H; Himmelbach, Axel; Mayer, Klaus FX; Scholz, Uwe; Poland, Jesse A; Stein, Nils; Waugh, Robbie
2013-01-01
Next-generation whole-genome shotgun assemblies of complex genomes are highly useful, but fail to link nearby sequence contigs with each other or provide a linear order of contigs along individual chromosomes. Here, we introduce a strategy based on sequencing progeny of a segregating population that allows de novo production of a genetically anchored linear assembly of the gene space of an organism. We demonstrate the power of the approach by reconstructing the chromosomal organization of the gene space of barley, a large, complex and highly repetitive 5.1 Gb genome. We evaluate the robustness of the new assembly by comparison to a recently released physical and genetic framework of the barley genome, and to various genetically ordered sequence-based genotypic datasets. The method is independent of the need for any prior sequence resources, and will enable rapid and cost-efficient establishment of powerful genomic information for many species. PMID:23998490
Length-independent structural similarities enrich the antibody CDR canonical class model.
Nowak, Jaroslaw; Baker, Terry; Georges, Guy; Kelm, Sebastian; Klostermann, Stefan; Shi, Jiye; Sridharan, Sudharsan; Deane, Charlotte M
2016-01-01
Complementarity-determining regions (CDRs) are antibody loops that make up the antigen binding site. Here, we show that all CDR types have structurally similar loops of different lengths. Based on these findings, we created length-independent canonical classes for the non-H3 CDRs. Our length variable structural clusters show strong sequence patterns suggesting either that they evolved from the same original structure or result from some form of convergence. We find that our length-independent method not only clusters a larger number of CDRs, but also predicts canonical class from sequence better than the standard length-dependent approach. To demonstrate the usefulness of our findings, we predicted cluster membership of CDR-L3 sequences from 3 next-generation sequencing datasets of the antibody repertoire (over 1,000,000 sequences). Using the length-independent clusters, we can structurally classify an additional 135,000 sequences, which represents a ∼20% improvement over the standard approach. This suggests that our length-independent canonical classes might be a highly prevalent feature of antibody space, and could substantially improve our ability to accurately predict the structure of novel CDRs identified by next-generation sequencing.
Functional interrogation of non-coding DNA through CRISPR genome editing
Canver, Matthew C.; Bauer, Daniel E.; Orkin, Stuart H.
2017-01-01
Methodologies to interrogate non-coding regions have lagged behind coding regions despite comprising the vast majority of the genome. However, the rapid evolution of clustered regularly interspaced short palindromic repeats (CRISPR)-based genome editing has provided a multitude of novel techniques for laboratory investigation including significant contributions to the toolbox for studying non-coding DNA. CRISPR-mediated loss-of-function strategies rely on direct disruption of the underlying sequence or repression of transcription without modifying the targeted DNA sequence. CRISPR-mediated gain-of-function approaches similarly benefit from methods to alter the targeted sequence through integration of customized sequence into the genome as well as methods to activate transcription. Here we review CRISPR-based loss- and gain-of-function techniques for the interrogation of non-coding DNA. PMID:28288828
Real-Time Culture Change Improves Lean Success: Sequenced Culture Change Gets Failing Grades.
Kusy, Mitchell; Diamond, Marty; Vrchota, Scott
2015-01-01
Success with the Lean management system is rooted in a culture of stakeholder engagement and commitment. Unfortunately, many leaders view Lean as an "add-on" tool instead of one that requires a new way of thinking and approaching culture. This article addresses the "why, how, and what" to promote a Lean culture that works. We present a five-phased approach grounded in evidence-based practices of real-time culture change. We further help healthcare leaders understand the differences between traditional "sequenced" approaches to culture change and "real-time" methods--and why these real-time practices are more sustainable and ultimately more successful than traditional culture change methods.
Yeast One-Hybrid Gγ Recruitment System for Identification of Protein Lipidation Motifs
Fukuda, Nobuo; Doi, Motomichi; Honda, Shinya
2013-01-01
Fatty acids and isoprenoids can be covalently attached to a variety of proteins. These lipid modifications regulate protein structure, localization and function. Here, we describe a yeast one-hybrid approach based on the Gγ recruitment system that is useful for identifying sequence motifs those influence lipid modification to recruit proteins to the plasma membrane. Our approach facilitates the isolation of yeast cells expressing lipid-modified proteins via a simple and easy growth selection assay utilizing G-protein signaling that induces diploid formation. In the current study, we selected the N-terminal sequence of Gα subunits as a model case to investigate dual lipid modification, i.e., myristoylation and palmitoylation, a modification that is widely conserved from yeast to higher eukaryotes. Our results suggest that both lipid modifications are required for restoration of G-protein signaling. Although we could not differentiate between myristoylation and palmitoylation, N-terminal position 7 and 8 play some critical role. Moreover, we tested the preference for specific amino-acid residues at position 7 and 8 using library-based screening. This new approach will be useful to explore protein-lipid associations and to determine the corresponding sequence motifs. PMID:23922919
Soldà, Giulia; Merlino, Giuseppe; Fina, Emanuela; Brini, Elena; Moles, Anna; Cappelletti, Vera; Daidone, Maria Grazia
2016-01-01
Numerous studies have reported the existence of tumor-promoting cells (TPC) with self-renewal potential and a relevant role in drug resistance. However, pathways and modifications involved in the maintenance of such tumor subpopulations are still only partially understood. Sequencing-based approaches offer the opportunity for a detailed study of TPC including their transcriptome modulation. Using microarrays and RNA sequencing approaches, we compared the transcriptional profiles of parental MCF7 breast cancer cells with MCF7-derived TPC (i.e. MCFS). Data were explored using different bioinformatic approaches, and major findings were experimentally validated. The different analytical pipelines (Lifescope and Cufflinks based) yielded similar although not identical results. RNA sequencing data partially overlapped microarray results and displayed a higher dynamic range, although overall the two approaches concordantly predicted pathway modifications. Several biological functions were altered in TPC, ranging from production of inflammatory cytokines (i.e., IL-8 and MCP-1) to proliferation and response to steroid hormones. More than 300 non-coding RNAs were defined as differentially expressed, and 2,471 potential splicing events were identified. A consensus signature of genes up-regulated in TPC was derived and was found to be significantly associated with insensitivity to fulvestrant in a public breast cancer patient dataset. Overall, we obtained a detailed portrait of the transcriptome of a breast cancer TPC line, highlighted the role of non-coding RNAs and differential splicing, and identified a gene signature with a potential as a context-specific biomarker in patients receiving endocrine treatment. PMID:26556871
Khodakov, Dmitriy; Wang, Chunyan; Zhang, David Yu
2016-10-01
Nucleic acid sequence variations have been implicated in many diseases, and reliable detection and quantitation of DNA/RNA biomarkers can inform effective therapeutic action, enabling precision medicine. Nucleic acid analysis technologies being translated into the clinic can broadly be classified into hybridization, PCR, and sequencing, as well as their combinations. Here we review the molecular mechanisms of popular commercial assays, and their progress in translation into in vitro diagnostics. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
BATTLE: Biomarker-Based Approaches of Targeted Therapy for Lung Cancer Elimination
2008-04-01
although a grade 3 neutropenia was dose-limiting in one importance. Th th ubstrate of the CYP3A4 isoenzyme and P-gp. Its metabolism is sensitive to...tratification in clinis Molecular Pathway Biomarkers Type of Analysis EGFR EGFR Mutation ( exons 18 to 21) DNA sequencing EGFR Increased Copy Number...polysomy/am 1plification) DNA FISH K-Ras/B-Raf K-RAS Mutation (codons 12,13, 61) DNA sequencing B-RAF Mutations ( exons 11 and 15) DNA sequencing
Object tracking using plenoptic image sequences
NASA Astrophysics Data System (ADS)
Kim, Jae Woo; Bae, Seong-Joon; Park, Seongjin; Kim, Do Hyung
2017-05-01
Object tracking is a very important problem in computer vision research. Among the difficulties of object tracking, partial occlusion problem is one of the most serious and challenging problems. To address the problem, we proposed novel approaches to object tracking on plenoptic image sequences. Our approaches take advantage of the refocusing capability that plenoptic images provide. Our approaches input the sequences of focal stacks constructed from plenoptic image sequences. The proposed image selection algorithms select the sequence of optimal images that can maximize the tracking accuracy from the sequence of focal stacks. Focus measure approach and confidence measure approach were proposed for image selection and both of the approaches were validated by the experiments using thirteen plenoptic image sequences that include heavily occluded target objects. The experimental results showed that the proposed approaches were satisfactory comparing to the conventional 2D object tracking algorithms.
Mapping copy number variation by population-scale genome sequencing.
Mills, Ryan E; Walter, Klaudia; Stewart, Chip; Handsaker, Robert E; Chen, Ken; Alkan, Can; Abyzov, Alexej; Yoon, Seungtai Chris; Ye, Kai; Cheetham, R Keira; Chinwalla, Asif; Conrad, Donald F; Fu, Yutao; Grubert, Fabian; Hajirasouliha, Iman; Hormozdiari, Fereydoun; Iakoucheva, Lilia M; Iqbal, Zamin; Kang, Shuli; Kidd, Jeffrey M; Konkel, Miriam K; Korn, Joshua; Khurana, Ekta; Kural, Deniz; Lam, Hugo Y K; Leng, Jing; Li, Ruiqiang; Li, Yingrui; Lin, Chang-Yun; Luo, Ruibang; Mu, Xinmeng Jasmine; Nemesh, James; Peckham, Heather E; Rausch, Tobias; Scally, Aylwyn; Shi, Xinghua; Stromberg, Michael P; Stütz, Adrian M; Urban, Alexander Eckehart; Walker, Jerilyn A; Wu, Jiantao; Zhang, Yujun; Zhang, Zhengdong D; Batzer, Mark A; Ding, Li; Marth, Gabor T; McVean, Gil; Sebat, Jonathan; Snyder, Michael; Wang, Jun; Ye, Kenny; Eichler, Evan E; Gerstein, Mark B; Hurles, Matthew E; Lee, Charles; McCarroll, Steven A; Korbel, Jan O
2011-02-03
Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
Christiansen, Anders; Kringelum, Jens V; Hansen, Christian S; Bøgh, Katrine L; Sullivan, Eric; Patel, Jigar; Rigby, Neil M; Eiwegger, Thomas; Szépfalusi, Zsolt; de Masi, Federico; Nielsen, Morten; Lund, Ole; Dufva, Martin
2015-08-06
Phage display is a prominent screening technique with a multitude of applications including therapeutic antibody development and mapping of antigen epitopes. In this study, phages were selected based on their interaction with patient serum and exhaustively characterised by high-throughput sequencing. A bioinformatics approach was developed in order to identify peptide motifs of interest based on clustering and contrasting to control samples. Comparison of patient and control samples confirmed a major issue in phage display, namely the selection of unspecific peptides. The potential of the bioinformatic approach was demonstrated by identifying epitopes of a prominent peanut allergen, Ara h 1, in sera from patients with severe peanut allergy. The identified epitopes were confirmed by high-density peptide micro-arrays. The present study demonstrates that high-throughput sequencing can empower phage display by (i) enabling the analysis of complex biological samples, (ii) circumventing the traditional laborious picking and functional testing of individual phage clones and (iii) reducing the number of selection rounds.
Lorén, J. Gaspar; Farfán, Maribel; Fusté, M. Carmen
2014-01-01
Several approaches have been developed to estimate both the relative and absolute rates of speciation and extinction within clades based on molecular phylogenetic reconstructions of evolutionary relationships, according to an underlying model of diversification. However, the macroevolutionary models established for eukaryotes have scarcely been used with prokaryotes. We have investigated the rate and pattern of cladogenesis in the genus Aeromonas (γ-Proteobacteria, Proteobacteria, Bacteria) using the sequences of five housekeeping genes and an uncorrelated relaxed-clock approach. To our knowledge, until now this analysis has never been applied to all the species described in a bacterial genus and thus opens up the possibility of establishing models of speciation from sequence data commonly used in phylogenetic studies of prokaryotes. Our results suggest that the genus Aeromonas began to diverge between 248 and 266 million years ago, exhibiting a constant divergence rate through the Phanerozoic, which could be described as a pure birth process. PMID:24586399
Hahn, Lars; Leimeister, Chris-André; Ounit, Rachid; Lonardi, Stefano; Morgenstern, Burkhard
2016-10-01
Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.
Identification of Functionally Related Enzymes by Learning-to-Rank Methods.
Stock, Michiel; Fober, Thomas; Hüllermeier, Eyke; Glinca, Serghei; Klebe, Gerhard; Pahikkala, Tapio; Airola, Antti; De Baets, Bernard; Waegeman, Willem
2014-01-01
Enzyme sequences and structures are routinely used in the biological sciences as queries to search for functionally related enzymes in online databases. To this end, one usually departs from some notion of similarity, comparing two enzymes by looking for correspondences in their sequences, structures or surfaces. For a given query, the search operation results in a ranking of the enzymes in the database, from very similar to dissimilar enzymes, while information about the biological function of annotated database enzymes is ignored. In this work, we show that rankings of that kind can be substantially improved by applying kernel-based learning algorithms. This approach enables the detection of statistical dependencies between similarities of the active cleft and the biological function of annotated enzymes. This is in contrast to search-based approaches, which do not take annotated training data into account. Similarity measures based on the active cleft are known to outperform sequence-based or structure-based measures under certain conditions. We consider the Enzyme Commission (EC) classification hierarchy for obtaining annotated enzymes during the training phase. The results of a set of sizeable experiments indicate a consistent and significant improvement for a set of similarity measures that exploit information about small cavities in the surface of enzymes.
Mizas, Ch; Sirakoulis, G Ch; Mardiris, V; Karafyllidis, I; Glykos, N; Sandaltzopoulos, R
2008-04-01
Change of DNA sequence that fuels evolution is, to a certain extent, a deterministic process because mutagenesis does not occur in an absolutely random manner. So far, it has not been possible to decipher the rules that govern DNA sequence evolution due to the extreme complexity of the entire process. In our attempt to approach this issue we focus solely on the mechanisms of mutagenesis and deliberately disregard the role of natural selection. Hence, in this analysis, evolution refers to the accumulation of genetic alterations that originate from mutations and are transmitted through generations without being subjected to natural selection. We have developed a software tool that allows modelling of a DNA sequence as a one-dimensional cellular automaton (CA) with four states per cell which correspond to the four DNA bases, i.e. A, C, T and G. The four states are represented by numbers of the quaternary number system. Moreover, we have developed genetic algorithms (GAs) in order to determine the rules of CA evolution that simulate the DNA evolution process. Linear evolution rules were considered and square matrices were used to represent them. If DNA sequences of different evolution steps are available, our approach allows the determination of the underlying evolution rule(s). Conversely, once the evolution rules are deciphered, our tool may reconstruct the DNA sequence in any previous evolution step for which the exact sequence information was unknown. The developed tool may be used to test various parameters that could influence evolution. We describe a paradigm relying on the assumption that mutagenesis is governed by a near-neighbour-dependent mechanism. Based on the satisfactory performance of our system in the deliberately simplified example, we propose that our approach could offer a starting point for future attempts to understand the mechanisms that govern evolution. The developed software is open-source and has a user-friendly graphical input interface.
Gardiner, Laura-Jayne; Gawroński, Piotr; Olohan, Lisa; Schnurbusch, Thorsten; Hall, Neil; Hall, Anthony
2014-12-01
Mapping-by-sequencing analyses have largely required a complete reference sequence and employed whole genome re-sequencing. In species such as wheat, no finished genome reference sequence is available. Additionally, because of its large genome size (17 Gb), re-sequencing at sufficient depth of coverage is not practical. Here, we extend the utility of mapping by sequencing, developing a bespoke pipeline and algorithm to map an early-flowering locus in einkorn wheat (Triticum monococcum L.) that is closely related to the bread wheat genome A progenitor. We have developed a genomic enrichment approach using the gene-rich regions of hexaploid bread wheat to design a 110-Mbp NimbleGen SeqCap EZ in solution capture probe set, representing the majority of genes in wheat. Here, we use the capture probe set to enrich and sequence an F2 mapping population of the mutant. The mutant locus was identified in T. monococcum, which lacks a complete genome reference sequence, by mapping the enriched data set onto pseudo-chromosomes derived from the capture probe target sequence, with a long-range order of genes based on synteny of wheat with Brachypodium distachyon. Using this approach we are able to map the region and identify a set of deleted genes within the interval. © 2014 The Authors.The Plant Journal published by Society for Experimental Biology and John Wiley & Sons Ltd.
FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues.
El-Manzalawy, Yasser; Abbas, Mostafa; Malluhi, Qutaibah; Honavar, Vasant
2016-01-01
A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.
Robust k-mer frequency estimation using gapped k-mers
Ghandi, Mahmoud; Mohammad-Noori, Morteza
2013-01-01
Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome. PMID:23861010
Robust k-mer frequency estimation using gapped k-mers.
Ghandi, Mahmoud; Mohammad-Noori, Morteza; Beer, Michael A
2014-08-01
Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.
Fuselli, S; Baptista, R P; Panziera, A; Magi, A; Guglielmi, S; Tonin, R; Benazzo, A; Bauzer, L G; Mazzoni, C J; Bertorelle, G
2018-03-24
The major histocompatibility complex (MHC) acts as an interface between the immune system and infectious diseases. Accurate characterization and genotyping of the extremely variable MHC loci are challenging especially without a reference sequence. We designed a combination of long-range PCR, Illumina short-reads, and Oxford Nanopore MinION long-reads approaches to capture the genetic variation of the MHC II DRB locus in an Italian population of the Alpine chamois (Rupicapra rupicapra). We utilized long-range PCR to generate a 9 Kb fragment of the DRB locus. Amplicons from six different individuals were fragmented, tagged, and simultaneously sequenced with Illumina MiSeq. One of these amplicons was sequenced with the MinION device, which produced long reads covering the entire amplified fragment. A pipeline that combines short and long reads resolved several short tandem repeats and homopolymers and produced a de novo reference, which was then used to map and genotype the short reads from all individuals. The assembled DRB locus showed a high level of polymorphism and the presence of a recombination breakpoint. Our results suggest that an amplicon-based NGS approach coupled with single-molecule MinION nanopore sequencing can efficiently achieve both the assembly and the genotyping of complex genomic regions in multiple individuals in the absence of a reference sequence.
Sul, Woo Jun; Cole, James R.; Jesus, Ederson da C.; Wang, Qiong; Farris, Ryan J.; Fish, Jordan A.; Tiedje, James M.
2011-01-01
High-throughput sequencing of 16S rRNA genes has increased our understanding of microbial community structure, but now even higher-throughput methods to the Illumina scale allow the creation of much larger datasets with more samples and orders-of-magnitude more sequences that swamp current analytic methods. We developed a method capable of handling these larger datasets on the basis of assignment of sequences into an existing taxonomy using a supervised learning approach (taxonomy-supervised analysis). We compared this method with a commonly used clustering approach based on sequence similarity (taxonomy-unsupervised analysis). We sampled 211 different bacterial communities from various habitats and obtained ∼1.3 million 16S rRNA sequences spanning the V4 hypervariable region by pyrosequencing. Both methodologies gave similar ecological conclusions in that β-diversity measures calculated by using these two types of matrices were significantly correlated to each other, as were the ordination configurations and hierarchical clustering dendrograms. In addition, our taxonomy-supervised analyses were also highly correlated with phylogenetic methods, such as UniFrac. The taxonomy-supervised analysis has the advantages that it is not limited by the exhaustive computation required for the alignment and clustering necessary for the taxonomy-unsupervised analysis, is more tolerant of sequencing errors, and allows comparisons when sequences are from different regions of the 16S rRNA gene. With the tremendous expansion in 16S rRNA data acquisition underway, the taxonomy-supervised approach offers the potential to provide more rapid and extensive community comparisons across habitats and samples. PMID:21873204
ERIC Educational Resources Information Center
Coxhead, Averil
2018-01-01
Research into the formulaic nature of language has grown in size and scale in the last 20 years or more, much of it based in corpus studies and involving the identification and categorisation of formulas. Research suggests that there are benefits for second and foreign language learners recognising formulaic sequences when listening and reading,…
The Interplay of Representations and Patterns of Classroom Discourse in Science Teaching Sequences
ERIC Educational Resources Information Center
Tang, Kok-Sing
2016-01-01
The purpose of this study is to examines the relationship between the communicative approach of classroom talk and the modes of representations used by science teachers. Based on video data from two physics classrooms in Singapore, a recurring pattern in the relationship was observed as the teaching sequence of a lesson unfolded. It was found that…
Motion magnification using the Hermite transform
NASA Astrophysics Data System (ADS)
Brieva, Jorge; Moya-Albor, Ernesto; Gomez-Coronel, Sandra L.; Escalante-Ramírez, Boris; Ponce, Hiram; Mora Esquivel, Juan I.
2015-12-01
We present an Eulerian motion magnification technique with a spatial decomposition based on the Hermite Transform (HT). We compare our results to the approach presented in.1 We test our method in one sequence of the breathing of a newborn baby and on an MRI left ventricle sequence. Methods are compared using quantitative and qualitative metrics after the application of the motion magnification algorithm.
ERIC Educational Resources Information Center
Watt, Sarah Jean
2013-01-01
Research to identify validated instructional approaches to teach math to students with LD and those at-risk for failure in both core and supplemental instructional settings is necessary to assist teachers in closing the achievement gaps that exist across the country. The concrete-to-representational-to-abstract instructional sequence (CRA) has…
ERIC Educational Resources Information Center
Hernández, María Isabel; Couso, Digna; Pintó, Roser
2015-01-01
The study we have carried out aims to characterize 15-to 16-year-old students' learning progressions throughout the implementation of a teaching-learning sequence on the acoustic properties of materials. Our purpose is to better understand students' modeling processes about this topic and to identify how the instructional design and actual…
Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis
Steele, Joe; Bastola, Dhundy
2014-01-01
Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events. New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies. We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base–base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel–Ziv techniques from data compression. PMID:23904502
Time-optimal excitation of maximum quantum coherence: Physical limits and pulse sequences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Köcher, S. S.; Institute of Energy and Climate Research; Heydenreich, T.
Here we study the optimum efficiency of the excitation of maximum quantum (MaxQ) coherence using analytical and numerical methods based on optimal control theory. The theoretical limit of the achievable MaxQ amplitude and the minimum time to achieve this limit are explored for a set of model systems consisting of up to five coupled spins. In addition to arbitrary pulse shapes, two simple pulse sequence families of practical interest are considered in the optimizations. Compared to conventional approaches, substantial gains were found both in terms of the achieved MaxQ amplitude and in pulse sequence durations. For a model system, theoreticallymore » predicted gains of a factor of three compared to the conventional pulse sequence were experimentally demonstrated. Motivated by the numerical results, also two novel analytical transfer schemes were found: Compared to conventional approaches based on non-selective pulses and delays, double-quantum coherence in two-spin systems can be created twice as fast using isotropic mixing and hard spin-selective pulses. Also it is proved that in a chain of three weakly coupled spins with the same coupling constants, triple-quantum coherence can be created in a time-optimal fashion using so-called geodesic pulses.« less
Nilsson, R. Henrik; Kristiansson, Erik; Ryberg, Martin; Hallenberg, Nils; Larsson, Karl-Henrik
2008-01-01
The internal transcribed spacer (ITS) region of the nuclear ribosomal repeat unit is the most popular locus for species identification and subgeneric phylogenetic inference in sequence-based mycological research. The region is known to show certain variability even within species, although its intraspecific variability is often held to be limited and clearly separated from interspecific variability. The existence of such a divide between intra- and interspecific variability is implicitly assumed by automated approaches to species identification, but whether intraspecific variability indeed is negligible within the fungal kingdom remains contentious. The present study estimates the intraspecific ITS variability in all fungi presently available to the mycological community through the international sequence databases. Substantial differences were found within the kingdom, and the results are not easily correlated to the taxonomic affiliation or nutritional mode of the taxa considered. No single unifying yet stringent upper limit for intraspecific variability, such as the canonical 3% threshold, appears to be applicable with the desired outcome throughout the fungi. Our results caution against simplified approaches to automated ITS-based species delimitation and reiterate the need for taxonomic expertise in the translation of sequence data into species names. PMID:19204817
Scaling features of noncoding DNA
NASA Technical Reports Server (NTRS)
Stanley, H. E.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Peng, C. K.; Simons, M.
1999-01-01
We review evidence supporting the idea that the DNA sequence in genes containing noncoding regions is correlated, and that the correlation is remarkably long range--indeed, base pairs thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene, and utilize this fact to build a Coding Sequence Finder Algorithm, which uses statistical ideas to locate the coding regions of an unknown DNA sequence. Finally, we describe briefly some recent work adapting to DNA the Zipf approach to analyzing linguistic texts, and the Shannon approach to quantifying the "redundancy" of a linguistic text in terms of a measurable entropy function, and reporting that noncoding regions in eukaryotes display a larger redundancy than coding regions. Specifically, we consider the possibility that this result is solely a consequence of nucleotide concentration differences as first noted by Bonhoeffer and his collaborators. We find that cytosine-guanine (CG) concentration does have a strong "background" effect on redundancy. However, we find that for the purine-pyrimidine binary mapping rule, which is not affected by the difference in CG concentration, the Shannon redundancy for the set of analyzed sequences is larger for noncoding regions compared to coding regions.
Application of the MIDAS approach for analysis of lysine acetylation sites.
Evans, Caroline A; Griffiths, John R; Unwin, Richard D; Whetton, Anthony D; Corfe, Bernard M
2013-01-01
Multiple Reaction Monitoring Initiated Detection and Sequencing (MIDAS™) is a mass spectrometry-based technique for the detection and characterization of specific post-translational modifications (Unwin et al. 4:1134-1144, 2005), for example acetylated lysine residues (Griffiths et al. 18:1423-1428, 2007). The MIDAS™ technique has application for discovery and analysis of acetylation sites. It is a hypothesis-driven approach that requires a priori knowledge of the primary sequence of the target protein and a proteolytic digest of this protein. MIDAS essentially performs a targeted search for the presence of modified, for example acetylated, peptides. The detection is based on the combination of the predicted molecular weight (measured as mass-charge ratio) of the acetylated proteolytic peptide and a diagnostic fragment (product ion of m/z 126.1), which is generated by specific fragmentation of acetylated peptides during collision induced dissociation performed in tandem mass spectrometry (MS) analysis. Sequence information is subsequently obtained which enables acetylation site assignment. The technique of MIDAS was later trademarked by ABSciex for targeted protein analysis where an MRM scan is combined with full MS/MS product ion scan to enable sequence confirmation.
Petruzziello, Filomena; Fouillen, Laetitia; Wadensten, Henrik; Kretz, Robert; Andren, Per E; Rainer, Gregor; Zhang, Xiaozhe
2012-02-03
Neuropeptidomics is used to characterize endogenous peptides in the brain of tree shrews (Tupaia belangeri). Tree shrews are small animals similar to rodents in size but close relatives of primates, and are excellent models for brain research. Currently, tree shrews have no complete proteome information available on which direct database search can be allowed for neuropeptide identification. To increase the capability in the identification of neuropeptides in tree shrews, we developed an integrated mass spectrometry (MS)-based approach that combines methods including data-dependent, directed, and targeted liquid chromatography (LC)-Fourier transform (FT)-tandem MS (MS/MS) analysis, database construction, de novo sequencing, precursor protein search, and homology analysis. Using this integrated approach, we identified 107 endogenous peptides that have sequences identical or similar to those from other mammalian species. High accuracy MS and tandem MS information, with BLAST analysis and chromatographic characteristics were used to confirm the sequences of all the identified peptides. Interestingly, further sequence homology analysis demonstrated that tree shrew peptides have a significantly higher degree of homology to equivalent sequences in humans than those in mice or rats, consistent with the close phylogenetic relationship between tree shrews and primates. Our results provide the first extensive characterization of the peptidome in tree shrews, which now permits characterization of their function in nervous and endocrine system. As the approach developed fully used the conservative properties of neuropeptides in evolution and the advantage of high accuracy MS, it can be portable for identification of neuropeptides in other species for which the fully sequenced genomes or proteomes are not available.
Park, Doori; Park, Su-Hyun; Ban, Yong Wook; Kim, Youn Shic; Park, Kyoung-Cheul; Kim, Nam-Soo; Kim, Ju-Kon; Choi, Ik-Young
2017-08-15
Genetically modified crops (GM crops) have been developed to improve the agricultural traits of modern crop cultivars. Safety assessments of GM crops are of paramount importance in research at developmental stages and before releasing transgenic plants into the marketplace. Sequencing technology is developing rapidly, with higher output and labor efficiencies, and will eventually replace existing methods for the molecular characterization of genetically modified organisms. To detect the transgenic insertion locations in the three GM rice gnomes, Illumina sequencing reads are mapped and classified to the rice genome and plasmid sequence. The both mapped reads are classified to characterize the junction site between plant and transgene sequence by sequence alignment. Herein, we present a next generation sequencing (NGS)-based molecular characterization method, using transgenic rice plants SNU-Bt9-5, SNU-Bt9-30, and SNU-Bt9-109. Specifically, using bioinformatics tools, we detected the precise insertion locations and copy numbers of transfer DNA, genetic rearrangements, and the absence of backbone sequences, which were equivalent to results obtained from Southern blot analyses. NGS methods have been suggested as an effective means of characterizing and detecting transgenic insertion locations in genomes. Our results demonstrate the use of a combination of NGS technology and bioinformatics approaches that offers cost- and time-effective methods for assessing the safety of transgenic plants.
Using single cell sequencing data to model the evolutionary history of a tumor.
Kim, Kyung In; Simon, Richard
2014-01-24
The introduction of next-generation sequencing (NGS) technology has made it possible to detect genomic alterations within tumor cells on a large scale. However, most applications of NGS show the genetic content of mixtures of cells. Recently developed single cell sequencing technology can identify variation within a single cell. Characterization of multiple samples from a tumor using single cell sequencing can potentially provide information on the evolutionary history of that tumor. This may facilitate understanding how key mutations accumulate and evolve in lineages to form a heterogeneous tumor. We provide a computational method to infer an evolutionary mutation tree based on single cell sequencing data. Our approach differs from traditional phylogenetic tree approaches in that our mutation tree directly describes temporal order relationships among mutation sites. Our method also accommodates sequencing errors. Furthermore, we provide a method for estimating the proportion of time from the earliest mutation event of the sample to the most recent common ancestor of the sample of cells. Finally, we discuss current limitations on modeling with single cell sequencing data and possible improvements under those limitations. Inferring the temporal ordering of mutational sites using current single cell sequencing data is a challenge. Our proposed method may help elucidate relationships among key mutations and their role in tumor progression.
Transportation Planning with Immune System Derived Approach
NASA Astrophysics Data System (ADS)
Sugiyama, Kenji; Yaji, Yasuhito; Ootsuki, John Takuya; Fujimoto, Yasutaka; Sekiguchi, Takashi
This paper presents an immune system derived approach for planning transportation of materials between manufacturing processes in the factory. Transportation operations are modeled by Petri Net, and divided into submodels. Transportation orders are derived from the firing sequences of those submodels through convergence calculation by the immune system derived excitation and suppression operations. Basic evaluation of this approach is conducted by simulation-based investigation.
Wan, Shixiang; Zou, Quan
2017-01-01
Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
Grégory, Dubourg; Chaudet, Hervé; Lagier, Jean-Christophe; Raoult, Didier
2018-03-01
Describing the human hut gut microbiota is one the most exciting challenges of the 21 st century. Currently, high-throughput sequencing methods are considered as the gold standard for this purpose, however, they suffer from several drawbacks, including their inability to detect minority populations. The advent of mass-spectrometric (MS) approaches to identify cultured bacteria in clinical microbiology enabled the creation of the culturomics approach, which aims to establish a comprehensive repertoire of cultured prokaryotes from human specimens using extensive culture conditions. Areas covered: This review first underlines how mass spectrometric approaches have revolutionized clinical microbiology. It then highlights the contribution of MS-based methods to culturomics studies, paying particular attention to the extension of the human gut microbiota repertoire through the discovery of new bacterial species. Expert commentary: MS-based approaches have enabled cultivation methods to be resuscitated to study the human gut microbiota and thus to fill in the blanks left by high-throughput sequencing methods in terms of culturing minority populations. Continued efforts to recover new taxa using culture methods, combined with their rapid implementation in genomic databases, would allow for an exhaustive analysis of the gut microbiota through the use of a comprehensive approach.
Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning.
Teng, Haotian; Cao, Minh Duc; Hall, Michael B; Duarte, Tania; Wang, Sheng; Coin, Lachlan J M
2018-05-01
Sequencing by translocating DNA fragments through an array of nanopores is a rapidly maturing technology that offers faster and cheaper sequencing than other approaches. However, accurately deciphering the DNA sequence from the noisy and complex electrical signal is challenging. Here, we report Chiron, the first deep learning model to achieve end-to-end basecalling and directly translate the raw signal to DNA sequence without the error-prone segmentation step. Trained with only a small set of 4,000 reads, we show that our model provides state-of-the-art basecalling accuracy, even on previously unseen species. Chiron achieves basecalling speeds of more than 2,000 bases per second using desktop computer graphics processing units.
Biophysical and structural considerations for protein sequence evolution
2011-01-01
Background Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current site-independent models of sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected improvements in model performance. This suggests that further innovations are needed for progress in this field. Results Here we develop a coarse-grained physics-based model of protein folding and binding function, and compare it to a popular informational model. We find that both models violate the assumption of the native sequence being close to a thermodynamic optimum, causing directional selection away from the native state. Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and tends to provide less support for an invariant sites model, although amino acid substitutions are generally conservative. Both approaches produce sequences with natural features like dN/dS < 1 and gamma-distributed rates across sites. Conclusions Simple coarse-grained models of protein folding can describe some natural features of evolving proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding energy, and binding function, as regards both native and non-native conformations, and describe a large number of possible applications for such a model. PMID:22171550
Chen, Shun-Li; Wu, Shiaw-Lin; Huang, Li-Juan; Huang, Jia-Bao; Chen, Shu-Hui
2013-06-01
Liquid chromatography-tandem mass spectrometry-based proteomics for peptide mapping and sequencing was used to characterize the marketed monoclonal antibody trastuzumab and compare it with two biosimilar products, mAb A containing D359E and L361M variations at the Fc site and mAb B without variants. Complete sequence coverage (100%) including disulfide linkages, glycosylations and other commonly occurring modifications (i.e., deamidation, oxidation, dehydration and K-clipping) were identified using maps generated from multi-enzyme digestions. In addition to the targeted comparison for the relative populations of targeted modification forms, a non-targeted approach was used to globally compare ion intensities in tryptic maps. The non-targeted comparison provided an extra-dimensional view to examine any possible differences related to variants or modifications. A peptide containing the two variants in mAb A, D359E and L361M, was revealed using the non-targeted comparison of the tryptic maps. In contrast, no significant differences were observed when trastuzumab was self-compared or compared with mAb B. These results were consistent with the data derived from peptide sequencing via collision induced dissociation/electron transfer dissociation. Thus, combined targeted and non-targeted approaches using powerful mass spectrometry-based proteomic tools hold great promise for the structural characterization of biosimilar products. Copyright © 2013 Elsevier B.V. All rights reserved.
Baptista, Rodrigo P; Reis-Cunha, Joao Luis; DeBarry, Jeremy D; Chiari, Egler; Kissinger, Jessica C; Bartholomeu, Daniella C; Macedo, Andrea M
2018-02-14
Next-generation sequencing (NGS) methods are low-cost high-throughput technologies that produce thousands to millions of sequence reads. Despite the high number of raw sequence reads, their short length, relative to Sanger, PacBio or Nanopore reads, complicates the assembly of genomic repeats. Many genome tools are available, but the assembly of highly repetitive genome sequences using only NGS short reads remains challenging. Genome assembly of organisms responsible for important neglected diseases such as Trypanosoma cruzi, the aetiological agent of Chagas disease, is known to be challenging because of their repetitive nature. Only three of six recognized discrete typing units (DTUs) of the parasite have their draft genomes published and therefore genome evolution analyses in the taxon are limited. In this study, we developed a computational workflow to assemble highly repetitive genomes via a combination of de novo and reference-based assembly strategies to better overcome the intrinsic limitations of each, based on Illumina reads. The highly repetitive genome of the human-infecting parasite T. cruzi 231 strain was used as a test subject. The combined-assembly approach shown in this study benefits from the reference-based assembly ability to resolve highly repetitive sequences and from the de novo capacity to assemble genome-specific regions, improving the quality of the assembly. The acceptable confidence obtained by analyzing our results showed that our combined approach is an attractive option to assemble highly repetitive genomes with NGS short reads. Phylogenomic analysis including the 231 strain, the first representative of DTU III whose genome was sequenced, was also performed and provides new insights into T. cruzi genome evolution.
Manigart, Olivier; Boeras, Debrah I; Karita, Etienne; Hawkins, Paulina A; Vwalika, Cheswa; Makombe, Nathan; Mulenga, Joseph; Derdeyn, Cynthia A; Allen, Susan; Hunter, Eric
2012-12-01
A critical step in HIV-1 transmission studies is the rapid and accurate identification of epidemiologically linked transmission pairs. To date, this has been accomplished by comparison of polymerase chain reaction (PCR)-amplified nucleotide sequences from potential transmission pairs, which can be cost-prohibitive for use in resource-limited settings. Here we describe a rapid, cost-effective approach to determine transmission linkage based on the heteroduplex mobility assay (HMA), and validate this approach by comparison to nucleotide sequencing. A total of 102 HIV-1-infected Zambian and Rwandan couples, with known linkage, were analyzed by gp41-HMA. A 400-base pair fragment within the envelope gp41 region of the HIV proviral genome was PCR amplified and HMA was applied to both partners' amplicons separately (autologous) and as a mixture (heterologous). If the diversity between gp41 sequences was low (<5%), a homoduplex was observed upon gel electrophoresis and the transmission was characterized as having occurred between partners (linked). If a new heteroduplex formed, within the heterologous migration, the transmission was determined to be unlinked. Initial blind validation of gp-41 HMA demonstrated 90% concordance between HMA and sequencing with 100% concordance in the case of linked transmissions. Following validation, 25 newly infected partners in Kigali and 12 in Lusaka were evaluated prospectively using both HMA and nucleotide sequences. Concordant results were obtained in all but one case (97.3%). The gp41-HMA technique is a reliable and feasible tool to detect linked transmissions in the field. All identified unlinked results should be confirmed by sequence analyses.
ERIC Educational Resources Information Center
Malgieri, Massimiliano; Onorato, Pasquale; De Ambrosis, Anna
2017-01-01
In this paper we present the results of a research-based teaching-learning sequence on introductory quantum physics based on Feynman's sum over paths approach in the Italian high school. Our study focuses on students' understanding of two founding ideas of quantum physics, wave particle duality and the uncertainty principle. In view of recent…
Current strategies for mobilome research
Jørgensen, Tue S.; Kiil, Anne S.; Hansen, Martin A.; Sørensen, Søren J.; Hansen, Lars H.
2015-01-01
Mobile genetic elements (MGEs) are pivotal for bacterial evolution and adaptation, allowing shuffling of genes even between distantly related bacterial species. The study of these elements is biologically interesting as the mode of genetic propagation is kaleidoscopic and important, as MGEs are the main vehicles of the increasing bacterial antibiotic resistance that causes thousands of human deaths each year. The study of MGEs has previously focused on plasmids from individual isolates, but the revolution in sequencing technology has allowed the study of mobile genomic elements of entire communities using metagenomic approaches. The problem in using metagenomic sequencing for the study of MGEs is that plasmids and other mobile elements only comprise a small fraction of the total genetic content that are difficult to separate from chromosomal DNA based on sequence alone. The distinction between plasmid and chromosome is important as the mobility and regulation of genes largely depend on their genetic context. Several different approaches have been proposed that specifically enrich plasmid DNA from community samples. Here, we review recent approaches used to study entire plasmid pools from complex environments, and point out possible future developments for and pitfalls of these approaches. Further, we discuss the use of the PacBio long-read sequencing technology for MGE discovery. PMID:25657641
A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Model.
Bruneau, Marine; Mottet, Thierry; Moulin, Serge; Kerbiriou, Maël; Chouly, Franz; Chretien, Stéphane; Guyeux, Christophe
2018-02-01
In this article, a new Python package for nucleotide sequences clustering is proposed. This package, freely available on-line, implements a Laplacian eigenmap embedding and a Gaussian Mixture Model for DNA clustering. It takes nucleotide sequences as input, and produces the optimal number of clusters along with a relevant visualization. Despite the fact that we did not optimise the computational speed, our method still performs reasonably well in practice. Our focus was mainly on data analytics and accuracy and as a result, our approach outperforms the state of the art, even in the case of divergent sequences. Furthermore, an a priori knowledge on the number of clusters is not required here. For the sake of illustration, this method is applied on a set of 100 DNA sequences taken from the mitochondrially encoded NADH dehydrogenase 3 (ND3) gene, extracted from a collection of Platyhelminthes and Nematoda species. The resulting clusters are tightly consistent with the phylogenetic tree computed using a maximum likelihood approach on gene alignment. They are coherent too with the NCBI taxonomy. Further test results based on synthesized data are then provided, showing that the proposed approach is better able to recover the clusters than the most widely used software, namely Cd-hit-est and BLASTClust. Copyright © 2017 Elsevier Ltd. All rights reserved.
Retrieval of Sentence Sequences for an Image Stream via Coherence Recurrent Convolutional Networks.
Park, Cesc Chunseong; Kim, Youngjin; Kim, Gunhee
2018-04-01
We propose an approach for retrieving a sequence of natural sentences for an image stream. Since general users often take a series of pictures on their experiences, much online visual information exists in the form of image streams, for which it would better take into consideration of the whole image stream to produce natural language descriptions. While almost all previous studies have dealt with the relation between a single image and a single natural sentence, our work extends both input and output dimension to a sequence of images and a sequence of sentences. For retrieving a coherent flow of multiple sentences for a photo stream, we propose a multimodal neural architecture called coherence recurrent convolutional network (CRCN), which consists of convolutional neural networks, bidirectional long short-term memory (LSTM) networks, and an entity-based local coherence model. Our approach directly learns from vast user-generated resource of blog posts as text-image parallel training data. We collect more than 22 K unique blog posts with 170 K associated images for the travel topics of NYC, Disneyland , Australia, and Hawaii. We demonstrate that our approach outperforms other state-of-the-art image captioning methods for text sequence generation, using both quantitative measures and user studies via Amazon Mechanical Turk.
Template-Based Modeling of Protein-RNA Interactions
Zheng, Jinfang; Kundrotas, Petras J.; Vakser, Ilya A.
2016-01-01
Protein-RNA complexes formed by specific recognition between RNA and RNA-binding proteins play an important role in biological processes. More than a thousand of such proteins in human are curated and many novel RNA-binding proteins are to be discovered. Due to limitations of experimental approaches, computational techniques are needed for characterization of protein-RNA interactions. Although much progress has been made, adequate methodologies reliably providing atomic resolution structural details are still lacking. Although protein-RNA free docking approaches proved to be useful, in general, the template-based approaches provide higher quality of predictions. Templates are key to building a high quality model. Sequence/structure relationships were studied based on a representative set of binary protein-RNA complexes from PDB. Several approaches were tested for pairwise target/template alignment. The analysis revealed a transition point between random and correct binding modes. The results showed that structural alignment is better than sequence alignment in identifying good templates, suitable for generating protein-RNA complexes close to the native structure, and outperforms free docking, successfully predicting complexes where the free docking fails, including cases of significant conformational change upon binding. A template-based protein-RNA interaction modeling protocol PRIME was developed and benchmarked on a representative set of complexes. PMID:27662342
Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar; Chadaram, Sudha; Mande, Sharmila S
2011-11-30
Obtaining accurate estimates of microbial diversity using rDNA profiling is the first step in most metagenomics projects. Consequently, most metagenomic projects spend considerable amounts of time, money and manpower for experimentally cloning, amplifying and sequencing the rDNA content in a metagenomic sample. In the second step, the entire genomic content of the metagenome is extracted, sequenced and analyzed. Since DNA sequences obtained in this second step also contain rDNA fragments, rapid in silico identification of these rDNA fragments would drastically reduce the cost, time and effort of current metagenomic projects by entirely bypassing the experimental steps of primer based rDNA amplification, cloning and sequencing. In this study, we present an algorithm called i-rDNA that can facilitate the rapid detection of 16S rDNA fragments from amongst millions of sequences in metagenomic data sets with high detection sensitivity. Performance evaluation with data sets/database variants simulating typical metagenomic scenarios indicates the significantly high detection sensitivity of i-rDNA. Moreover, i-rDNA can process a million sequences in less than an hour on a simple desktop with modest hardware specifications. In addition to the speed of execution, high sensitivity and low false positive rate, the utility of the algorithmic approach discussed in this paper is immense given that it would help in bypassing the entire experimental step of primer-based rDNA amplification, cloning and sequencing. Application of this algorithmic approach would thus drastically reduce the cost, time and human efforts invested in all metagenomic projects. A web-server for the i-rDNA algorithm is available at http://metagenomics.atc.tcs.com/i-rDNA/
Pairwise Sequence Alignment Library
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jeff Daily, PNNL
2015-05-20
Vector extensions, such as SSE, have been part of the x86 CPU since the 1990s, with applications in graphics, signal processing, and scientific applications. Although many algorithms and applications can naturally benefit from automatic vectorization techniques, there are still many that are difficult to vectorize due to their dependence on irregular data structures, dense branch operations, or data dependencies. Sequence alignment, one of the most widely used operations in bioinformatics workflows, has a computational footprint that features complex data dependencies. The trend of widening vector registers adversely affects the state-of-the-art sequence alignment algorithm based on striped data layouts. Therefore, amore » novel SIMD implementation of a parallel scan-based sequence alignment algorithm that can better exploit wider SIMD units was implemented as part of the Parallel Sequence Alignment Library (parasail). Parasail features: Reference implementations of all known vectorized sequence alignment approaches. Implementations of Smith Waterman (SW), semi-global (SG), and Needleman Wunsch (NW) sequence alignment algorithms. Implementations across all modern CPU instruction sets including AVX2 and KNC. Language interfaces for C/C++ and Python.« less
McDermott, Jason E.; Bruillard, Paul; Overall, Christopher C.; ...
2015-03-09
There are many examples of groups of proteins that have similar function, but the determinants of functional specificity may be hidden by lack of sequencesimilarity, or by large groups of similar sequences with different functions. Transporters are one such protein group in that the general function, transport, can be easily inferred from the sequence, but the substrate specificity can be impossible to predict from sequence with current methods. In this paper we describe a linguistic-based approach to identify functional patterns from groups of unaligned protein sequences and its application to predict multi-drug resistance transporters (MDRs) from bacteria. We first showmore » that our method can recreate known patterns from PROSITE for several motifs from unaligned sequences. We then show that the method, MDRpred, can predict MDRs with greater accuracy and positive predictive value than a collection of currently available family-based models from the Pfam database. Finally, we apply MDRpred to a large collection of protein sequences from an environmental microbiome study to make novel predictions about drug resistance in a potential environmental reservoir.« less
Gini, Beatrice; Mischel, Paul S
2014-08-01
Single-cell sequencing approaches are needed to characterize the genomic diversity of complex tumors, shedding light on their evolutionary paths and potentially suggesting more effective therapies. In this issue of Cancer Discovery, Francis and colleagues develop a novel integrative approach to identify distinct tumor subpopulations based on joint detection of clonal and subclonal events from bulk tumor and single-nucleus whole-genome sequencing, allowing them to infer a subclonal architecture. Surprisingly, the authors identify convergent evolution of multiple, mutually exclusive, independent EGFR gain-of-function variants in a single tumor. This study demonstrates the value of integrative single-cell genomics and highlights the biologic primacy of EGFR as an actionable target in glioblastoma. ©2014 American Association for Cancer Research.
2010-01-01
Background Usual methods for inferring species boundaries from molecular sequence data rely either on gene trees or on population genetic analyses. Another way of delimiting species, based on a view of species as "fields for recombination" (FFRs) characterized by mutual allelic exclusivity, was suggested in 1995 by Doyle. Here we propose to use haplowebs (haplotype networks with additional connections between haplotypes found co-occurring in heterozygous individuals) to visualize and delineate single-locus FFRs (sl-FFRs). Furthermore, we introduce a method to quantify the reliability of putative species boundaries according to the number of independent markers that support them, and illustrate this approach with a case study of taxonomically difficult corals of the genus Pocillopora collected around Clipperton Island (far eastern Pacific). Results One haploweb built from intron sequences of the ATP synthase β subunit gene revealed the presence of two sl-FFRs among our 74 coral samples, whereas a second one built from ITS sequences turned out to be composed of four sl-FFRs. As a third independent marker, we performed a combined analysis of two regions of the mitochondrial genome: since haplowebs are not suited to analyze non-recombining markers, individuals were sorted into four haplogroups according to their mitochondrial sequences. Among all possible bipartitions of our set of samples, thirteen were supported by at least one molecular dataset, none by two and only one by all three datasets: this congruent pattern obtained from independent nuclear and mitochondrial markers indicates that two species of Pocillopora are present in Clipperton. Conclusions Our approach builds on Doyle's method and extends it by introducing an intuitive, user-friendly graphical representation and by proposing a conceptual framework to analyze and quantify the congruence between sl-FFRs obtained from several independent markers. Like delineation methods based on population-level statistical approaches, our method can distinguish closely-related species that have not yet reached reciprocal monophyly at most or all of their loci; like tree-based approaches, it can yield meaningful conclusions using a number of independent markers as low as three. Future efforts will aim to develop programs that speed up the construction of haplowebs from FASTA sequence alignments and help perform the congruence analysis outlined in this article. PMID:21118572
Flot, Jean-François; Couloux, Arnaud; Tillier, Simon
2010-11-30
Usual methods for inferring species boundaries from molecular sequence data rely either on gene trees or on population genetic analyses. Another way of delimiting species, based on a view of species as "fields for recombination" (FFRs) characterized by mutual allelic exclusivity, was suggested in 1995 by Doyle. Here we propose to use haplowebs (haplotype networks with additional connections between haplotypes found co-occurring in heterozygous individuals) to visualize and delineate single-locus FFRs (sl-FFRs). Furthermore, we introduce a method to quantify the reliability of putative species boundaries according to the number of independent markers that support them, and illustrate this approach with a case study of taxonomically difficult corals of the genus Pocillopora collected around Clipperton Island (far eastern Pacific). One haploweb built from intron sequences of the ATP synthase β subunit gene revealed the presence of two sl-FFRs among our 74 coral samples, whereas a second one built from ITS sequences turned out to be composed of four sl-FFRs. As a third independent marker, we performed a combined analysis of two regions of the mitochondrial genome: since haplowebs are not suited to analyze non-recombining markers, individuals were sorted into four haplogroups according to their mitochondrial sequences. Among all possible bipartitions of our set of samples, thirteen were supported by at least one molecular dataset, none by two and only one by all three datasets: this congruent pattern obtained from independent nuclear and mitochondrial markers indicates that two species of Pocillopora are present in Clipperton. Our approach builds on Doyle's method and extends it by introducing an intuitive, user-friendly graphical representation and by proposing a conceptual framework to analyze and quantify the congruence between sl-FFRs obtained from several independent markers. Like delineation methods based on population-level statistical approaches, our method can distinguish closely-related species that have not yet reached reciprocal monophyly at most or all of their loci; like tree-based approaches, it can yield meaningful conclusions using a number of independent markers as low as three. Future efforts will aim to develop programs that speed up the construction of haplowebs from FASTA sequence alignments and help perform the congruence analysis outlined in this article.
Leonard, Susan R.; Mammel, Mark K.; Lacher, David W.
2015-01-01
Culture-independent diagnostics reduce the reliance on traditional (and slower) culture-based methodologies. Here we capitalize on advances in next-generation sequencing (NGS) to apply this approach to food pathogen detection utilizing NGS as an analytical tool. In this study, spiking spinach with Shiga toxin-producing Escherichia coli (STEC) following an established FDA culture-based protocol was used in conjunction with shotgun metagenomic sequencing to determine the limits of detection, sensitivity, and specificity levels and to obtain information on the microbiology of the protocol. We show that an expected level of contamination (∼10 CFU/100 g) could be adequately detected (including key virulence determinants and strain-level specificity) within 8 h of enrichment at a sequencing depth of 10,000,000 reads. We also rationalize the relative benefit of static versus shaking culture conditions and the addition of selected antimicrobial agents, thereby validating the long-standing culture-based parameters behind such protocols. Moreover, the shotgun metagenomic approach was informative regarding the dynamics of microbial communities during the enrichment process, including initial surveys of the microbial loads associated with bagged spinach; the microbes found included key genera such as Pseudomonas, Pantoea, and Exiguobacterium. Collectively, our metagenomic study highlights and considers various parameters required for transitioning to such sequencing-based diagnostics for food safety and the potential to develop better enrichment processes in a high-throughput manner not previously possible. Future studies will investigate new species-specific DNA signature target regimens, rational design of medium components in concert with judicious use of additives, such as antibiotics, and alterations in the sample processing protocol to enhance detection. PMID:26386062
Late-onset Bartter syndrome type II.
Gollasch, Benjamin; Anistan, Yoland-Marie; Canaan-Kühl, Sima; Gollasch, Maik
2017-10-01
Mutations in the ROMK1 potassium channel gene ( KCNJ1 ) cause antenatal/neonatal Bartter syndrome type II (aBS II), a renal disorder that begins in utero , accounting for the polyhydramnios and premature delivery that is typical in affected infants, who develop massive renal salt wasting, hypokalaemic metabolic alkalosis, secondary hyperreninaemic hyperaldosteronism, hypercalciuria and nephrocalcinosis. This BS type is believed to represent a disorder of the infancy, but not in adulthood. We herein describe a female patient with a remarkably late-onset and mild clinical manifestation of BS II with compound heterozygous KCNJ1 missense mutations, consisting of a novel c.197T > A (p.I66N) and a previously reported c.875G > A (p.R292Q) KCNJ1 mutation. We implemented and evaluated the performance of two different bioinformatics-based approaches of targeted massively parallel sequencing [next generation sequencing (NGS)] in defining the molecular diagnosis. Our results demonstrate that aBS II may be suspected in patients with a late-onset phenotype. Our experimental approach of NGS-based mutation screening combined with Sanger sequencing proved to be a reliable molecular approach for defining the clinical diagnosis in our patient, and results in important differential diagnostic and therapeutic implications for patients with BS. Our results could have a significant impact on the diagnosis and methodological approaches of genetic testing in other patients with clinical unclassified phenotypes of nephrocalcinosis and congenital renal electrolyte abnormalities.
Zheng, Xiasheng; Zhang, Peng; Liao, Baosheng; Li, Jing; Liu, Xingyun; Shi, Yuhua; Cheng, Jinle; Lai, Zhitian; Xu, Jiang; Chen, Shilin
2017-01-01
Herbal medicine is a major component of complementary and alternative medicine, contributing significantly to the health of many people and communities. Quality control of herbal medicine is crucial to ensure that it is safe and sound for use. Here, we investigated a comprehensive quality evaluation system for a classic herbal medicine, Danggui Buxue Formula, by applying genetic-based and analytical chemistry approaches to authenticate and evaluate the quality of its samples. For authenticity, we successfully applied two novel technologies, third-generation sequencing and PCR-DGGE (denaturing gradient gel electrophoresis), to analyze the ingredient composition of the tested samples. For quality evaluation, we used high performance liquid chromatography assays to determine the content of chemical markers to help estimate the dosage relationship between its two raw materials, plant roots of Huangqi and Danggui. A series of surveys were then conducted against several exogenous contaminations, aiming to further access the efficacy and safety of the samples. In conclusion, the quality evaluation system demonstrated here can potentially address the authenticity, quality, and safety of herbal medicines, thus providing novel insight for enhancing their overall quality control. Highlight: We established a comprehensive quality evaluation system for herbal medicine, by combining two genetic-based approaches third-generation sequencing and DGGE (denaturing gradient gel electrophoresis) with analytical chemistry approaches to achieve the authentication and quality connotation of the samples. PMID:28955365
Zheng, Xiasheng; Zhang, Peng; Liao, Baosheng; Li, Jing; Liu, Xingyun; Shi, Yuhua; Cheng, Jinle; Lai, Zhitian; Xu, Jiang; Chen, Shilin
2017-01-01
Herbal medicine is a major component of complementary and alternative medicine, contributing significantly to the health of many people and communities. Quality control of herbal medicine is crucial to ensure that it is safe and sound for use. Here, we investigated a comprehensive quality evaluation system for a classic herbal medicine, Danggui Buxue Formula, by applying genetic-based and analytical chemistry approaches to authenticate and evaluate the quality of its samples. For authenticity, we successfully applied two novel technologies, third-generation sequencing and PCR-DGGE (denaturing gradient gel electrophoresis), to analyze the ingredient composition of the tested samples. For quality evaluation, we used high performance liquid chromatography assays to determine the content of chemical markers to help estimate the dosage relationship between its two raw materials, plant roots of Huangqi and Danggui. A series of surveys were then conducted against several exogenous contaminations, aiming to further access the efficacy and safety of the samples. In conclusion, the quality evaluation system demonstrated here can potentially address the authenticity, quality, and safety of herbal medicines, thus providing novel insight for enhancing their overall quality control. Highlight : We established a comprehensive quality evaluation system for herbal medicine, by combining two genetic-based approaches third-generation sequencing and DGGE (denaturing gradient gel electrophoresis) with analytical chemistry approaches to achieve the authentication and quality connotation of the samples.
Ensuring privacy in the study of pathogen genetics
Mehta, Sanjay R.; Vinterbo, Staal A.; Little, Susan J.
2014-01-01
Rapid growth in the genetic sequencing of pathogens in recent years has led to the creation of large sequence databases. This aggregated sequence data can be very useful for tracking and predicting epidemics of infectious diseases. However, the balance between the potential public health benefit and the risk to personal privacy for individuals whose genetic data (personal or pathogen) are included in such work has been difficult to delineate, because neither the true benefit nor the actual risk to participants has been adequately defined. Existing approaches to minimise the risk of privacy loss to participants are based on de-identification of data by removal of a predefined set of identifiers. These approaches neither guarantee privacy nor protect the usefulness of the data. We propose a new approach to privacy protection that will quantify the risk to participants, while still maximising the usefulness of the data to researchers. This emerging standard in privacy protection and disclosure control, which is known as differential privacy, uses a process-driven rather than data-centred approach to protecting privacy. PMID:24721230
Ensuring privacy in the study of pathogen genetics.
Mehta, Sanjay R; Vinterbo, Staal A; Little, Susan J
2014-08-01
Rapid growth in the genetic sequencing of pathogens in recent years has led to the creation of large sequence databases. This aggregated sequence data can be very useful for tracking and predicting epidemics of infectious diseases. However, the balance between the potential public health benefit and the risk to personal privacy for individuals whose genetic data (personal or pathogen) are included in such work has been difficult to delineate, because neither the true benefit nor the actual risk to participants has been adequately defined. Existing approaches to minimise the risk of privacy loss to participants are based on de-identification of data by removal of a predefined set of identifiers. These approaches neither guarantee privacy nor protect the usefulness of the data. We propose a new approach to privacy protection that will quantify the risk to participants, while still maximising the usefulness of the data to researchers. This emerging standard in privacy protection and disclosure control, which is known as differential privacy, uses a process-driven rather than data-centred approach to protecting privacy. Copyright © 2014 Elsevier Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
Havelund, Klaus
2014-01-01
We present a form of automaton, referred to as data automata, suited for monitoring sequences of data-carrying events, for example emitted by an executing software system. This form of automata allows states to be parameterized with data, forming named records, which are stored in an efficiently indexed data structure, a form of database. This very explicit approach differs from other automaton-based monitoring approaches. Data automata are also characterized by allowing transition conditions to refer to other parameterized states, and by allowing transitions sequences. The presented automaton concept is inspired by rule-based systems, especially the Rete algorithm, which is one of the well-established algorithms for executing rule-based systems. We present an optimized external DSL for data automata, as well as a comparable unoptimized internal DSL (API) in the Scala programming language, in order to compare the two solutions. An evaluation compares these two solutions to several other monitoring systems.
Scala, Giovanni; Affinito, Ornella; Palumbo, Domenico; Florio, Ermanno; Monticelli, Antonella; Miele, Gennaro; Chiariotti, Lorenzo; Cocozza, Sergio
2016-11-25
CpG sites in an individual molecule may exist in a binary state (methylated or unmethylated) and each individual DNA molecule, containing a certain number of CpGs, is a combination of these states defining an epihaplotype. Classic quantification based approaches to study DNA methylation are intrinsically unable to fully represent the complexity of the underlying methylation substrate. Epihaplotype based approaches, on the other hand, allow methylation profiles of cell populations to be studied at the single molecule level. For such investigations, next-generation sequencing techniques can be used, both for quantitative and for epihaplotype analysis. Currently available tools for methylation analysis lack output formats that explicitly report CpG methylation profiles at the single molecule level and that have suited statistical tools for their interpretation. Here we present ampliMethProfiler, a python-based pipeline for the extraction and statistical epihaplotype analysis of amplicons from targeted deep bisulfite sequencing of multiple DNA regions. ampliMethProfiler tool provides an easy and user friendly way to extract and analyze the epihaplotype composition of reads from targeted bisulfite sequencing experiments. ampliMethProfiler is written in python language and requires a local installation of BLAST and (optionally) QIIME tools. It can be run on Linux and OS X platforms. The software is open source and freely available at http://amplimethprofiler.sourceforge.net .
Moore, Jason H; Boczko, Erik M; Summar, Marshall L
2005-02-01
Understanding how DNA sequence variations impact human health through a hierarchy of biochemical and physiological systems is expected to improve the diagnosis, prevention, and treatment of common, complex human diseases. We have previously developed a hierarchical dynamic systems approach based on Petri nets for generating biochemical network models that are consistent with genetic models of disease susceptibility. This modeling approach uses an evolutionary computation approach called grammatical evolution as a search strategy for optimal Petri net models. We have previously demonstrated that this approach routinely identifies biochemical network models that are consistent with a variety of genetic models in which disease susceptibility is determined by nonlinear interactions between two or more DNA sequence variations. We review here this approach and then discuss how it can be used to model biochemical and metabolic data in the context of genetic studies of human disease susceptibility.
GESFIDE-PROPELLER approach for simultaneous R2 and R2* measurements in the abdomen.
Jin, Ning; Guo, Yang; Zhang, Zhuoli; Zhang, Longjiang; Lu, Guangming; Larson, Andrew C
2013-12-01
To investigate the feasibility of combining GESFIDE with PROPELLER sampling approaches for simultaneous abdominal R2 and R2* mapping. R2 and R2* measurements were performed in 9 healthy volunteers and phantoms using the GESFIDE-PROPELLER and the conventional Cartesian-sampling GESFIDE approaches. Images acquired with the GESFIDE-PROPELLER sequence effectively mitigated the respiratory motion artifacts, which were clearly evident in the images acquired using the conventional GESFIDE approach. There was no significant difference between GESFIDE-PROPELLER and reference MGRE R2* measurements (p=0.162) whereas the Cartesian-sampling based GESFIDE methods significantly overestimated R2* values compared to MGRE measurements (p<0.001). The GESFIDE-PROPELLER sequence provided high quality images and accurate abdominal R2 and R2* maps while avoiding the motion artifacts common to the conventional Cartesian-sampling GESFIDE approaches. © 2013 Elsevier Inc. All rights reserved.
GESFIDE-PROPELLER Approach for Simultaneous R2 and R2* Measurements in the Abdomen
Jin, Ning; Guo, Yang; Zhang, Zhuoli; Zhang, Longjiang; Lu, Guangming; Larson, Andrew C.
2013-01-01
Purpose To investigate the feasibility of combining GESFIDE with PROPELLER sampling approaches for simultaneous abdominal R2 and R2* mapping. Materials and Methods R2 and R2* measurements were performed in 9 healthy volunteers and phantoms using the GESFIDE-PROPELLER and the conventional Cartesian-sampling GESFIDE approaches. Results Images acquired with the GESFIDE-PROPELLER sequence effectively mitigated the respiratory motion artifacts, which were clearly evident in the images acquired using the conventional GESFIDE approach. There were no significant difference between GESFIDE-PROPELLER and reference MGRE R2* measurements (p = 0.162) whereas the Cartesian-sampling based GESFIDE methods significantly overestimated R2* values compared to MGRE measurements (p < 0.001). Conclusion The GESFIDE-PROPELLER sequence provided high quality images and accurate abdominal R2 and R2* maps while avoiding the motion artifacts common to the conventional Cartesian-sampling GESFIDE approaches. PMID:24041478
Combining genomic and proteomic approaches for epigenetics research
Han, Yumiao; Garcia, Benjamin A
2014-01-01
Epigenetics is the study of changes in gene expression or cellular phenotype that do not change the DNA sequence. In this review, current methods, both genomic and proteomic, associated with epigenetics research are discussed. Among them, chromatin immunoprecipitation (ChIP) followed by sequencing and other ChIP-based techniques are powerful techniques for genome-wide profiling of DNA-binding proteins, histone post-translational modifications or nucleosome positions. However, mass spectrometry-based proteomics is increasingly being used in functional biological studies and has proved to be an indispensable tool to characterize histone modifications, as well as DNA–protein and protein–protein interactions. With the development of genomic and proteomic approaches, combination of ChIP and mass spectrometry has the potential to expand our knowledge of epigenetics research to a higher level. PMID:23895656
Functional interrogation of non-coding DNA through CRISPR genome editing.
Canver, Matthew C; Bauer, Daniel E; Orkin, Stuart H
2017-05-15
Methodologies to interrogate non-coding regions have lagged behind coding regions despite comprising the vast majority of the genome. However, the rapid evolution of clustered regularly interspaced short palindromic repeats (CRISPR)-based genome editing has provided a multitude of novel techniques for laboratory investigation including significant contributions to the toolbox for studying non-coding DNA. CRISPR-mediated loss-of-function strategies rely on direct disruption of the underlying sequence or repression of transcription without modifying the targeted DNA sequence. CRISPR-mediated gain-of-function approaches similarly benefit from methods to alter the targeted sequence through integration of customized sequence into the genome as well as methods to activate transcription. Here we review CRISPR-based loss- and gain-of-function techniques for the interrogation of non-coding DNA. Copyright © 2017 Elsevier Inc. All rights reserved.
Chaitankar, Vijender; Karakülah, Gökhan; Ratnapriya, Rinki; Giuste, Felipe O.; Brooks, Matthew J.; Swaroop, Anand
2016-01-01
The advent of high throughput next generation sequencing (NGS) has accelerated the pace of discovery of disease-associated genetic variants and genomewide profiling of expressed sequences and epigenetic marks, thereby permitting systems-based analyses of ocular development and disease. Rapid evolution of NGS and associated methodologies presents significant challenges in acquisition, management, and analysis of large data sets and for extracting biologically or clinically relevant information. Here we illustrate the basic design of commonly used NGS-based methods, specifically whole exome sequencing, transcriptome, and epigenome profiling, and provide recommendations for data analyses. We briefly discuss systems biology approaches for integrating multiple data sets to elucidate gene regulatory or disease networks. While we provide examples from the retina, the NGS guidelines reviewed here are applicable to other tissues/cell types as well. PMID:27297499
2012-01-01
Background The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. Results We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods. Conclusions Appropriate homologous sequences are selected automatically and objectively by the index. Such sequence selection improved the performance of functional region prediction. As far as we know, this is the first approach in which spatial statistics have been applied to protein analyses. Such integration of structure and sequence information would be useful for other bioinformatics problems. PMID:22643026
Portable and Error-Free DNA-Based Data Storage.
Yazdi, S M Hossein Tabatabaei; Gabrys, Ryan; Milenkovic, Olgica
2017-07-10
DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.
PHYLOViZ: phylogenetic inference and data visualization for sequence based typing methods
2012-01-01
Background With the decrease of DNA sequencing costs, sequence-based typing methods are rapidly becoming the gold standard for epidemiological surveillance. These methods provide reproducible and comparable results needed for a global scale bacterial population analysis, while retaining their usefulness for local epidemiological surveys. Online databases that collect the generated allelic profiles and associated epidemiological data are available but this wealth of data remains underused and are frequently poorly annotated since no user-friendly tool exists to analyze and explore it. Results PHYLOViZ is platform independent Java software that allows the integrated analysis of sequence-based typing methods, including SNP data generated from whole genome sequence approaches, and associated epidemiological data. goeBURST and its Minimum Spanning Tree expansion are used for visualizing the possible evolutionary relationships between isolates. The results can be displayed as an annotated graph overlaying the query results of any other epidemiological data available. Conclusions PHYLOViZ is a user-friendly software that allows the combined analysis of multiple data sources for microbial epidemiological and population studies. It is freely available at http://www.phyloviz.net. PMID:22568821
Pilotte, Nils; Papaiakovou, Marina; Grant, Jessica R; Bierwert, Lou Ann; Llewellyn, Stacey; McCarthy, James S; Williams, Steven A
2016-03-01
The soil transmitted helminths are a group of parasitic worms responsible for extensive morbidity in many of the world's most economically depressed locations. With growing emphasis on disease mapping and eradication, the availability of accurate and cost-effective diagnostic measures is of paramount importance to global control and elimination efforts. While real-time PCR-based molecular detection assays have shown great promise, to date, these assays have utilized sub-optimal targets. By performing next-generation sequencing-based repeat analyses, we have identified high copy-number, non-coding DNA sequences from a series of soil transmitted pathogens. We have used these repetitive DNA elements as targets in the development of novel, multi-parallel, PCR-based diagnostic assays. Utilizing next-generation sequencing and the Galaxy-based RepeatExplorer web server, we performed repeat DNA analysis on five species of soil transmitted helminths (Necator americanus, Ancylostoma duodenale, Trichuris trichiura, Ascaris lumbricoides, and Strongyloides stercoralis). Employing high copy-number, non-coding repeat DNA sequences as targets, novel real-time PCR assays were designed, and assays were tested against established molecular detection methods. Each assay provided consistent detection of genomic DNA at quantities of 2 fg or less, demonstrated species-specificity, and showed an improved limit of detection over the existing, proven PCR-based assay. The utilization of next-generation sequencing-based repeat DNA analysis methodologies for the identification of molecular diagnostic targets has the ability to improve assay species-specificity and limits of detection. By exploiting such high copy-number repeat sequences, the assays described here will facilitate soil transmitted helminth diagnostic efforts. We recommend similar analyses when designing PCR-based diagnostic tests for the detection of other eukaryotic pathogens.
Combining Rosetta with molecular dynamics (MD): A benchmark of the MD-based ensemble protein design.
Ludwiczak, Jan; Jarmula, Adam; Dunin-Horkawicz, Stanislaw
2018-07-01
Computational protein design is a set of procedures for computing amino acid sequences that will fold into a specified structure. Rosetta Design, a commonly used software for protein design, allows for the effective identification of sequences compatible with a given backbone structure, while molecular dynamics (MD) simulations can thoroughly sample near-native conformations. We benchmarked a procedure in which Rosetta design is started on MD-derived structural ensembles and showed that such a combined approach generates 20-30% more diverse sequences than currently available methods with only a slight increase in computation time. Importantly, the increase in diversity is achieved without a loss in the quality of the designed sequences assessed by their resemblance to natural sequences. We demonstrate that the MD-based procedure is also applicable to de novo design tasks started from backbone structures without any sequence information. In addition, we implemented a protocol that can be used to assess the stability of designed models and to select the best candidates for experimental validation. In sum our results demonstrate that the MD ensemble-based flexible backbone design can be a viable method for protein design, especially for tasks that require a large pool of diverse sequences. Copyright © 2018 Elsevier Inc. All rights reserved.
MPN estimation of qPCR target sequence recoveries from whole cell calibrator samples.
Sivaganesan, Mano; Siefring, Shawn; Varma, Manju; Haugland, Richard A
2011-12-01
DNA extracts from enumerated target organism cells (calibrator samples) have been used for estimating Enterococcus cell equivalent densities in surface waters by a comparative cycle threshold (Ct) qPCR analysis method. To compare surface water Enterococcus density estimates from different studies by this approach, either a consistent source of calibrator cells must be used or the estimates must account for any differences in target sequence recoveries from different sources of calibrator cells. In this report we describe two methods for estimating target sequence recoveries from whole cell calibrator samples based on qPCR analyses of their serially diluted DNA extracts and most probable number (MPN) calculation. The first method employed a traditional MPN calculation approach. The second method employed a Bayesian hierarchical statistical modeling approach and a Monte Carlo Markov Chain (MCMC) simulation method to account for the uncertainty in these estimates associated with different individual samples of the cell preparations, different dilutions of the DNA extracts and different qPCR analytical runs. The two methods were applied to estimate mean target sequence recoveries per cell from two different lots of a commercially available source of enumerated Enterococcus cell preparations. The mean target sequence recovery estimates (and standard errors) per cell from Lot A and B cell preparations by the Bayesian method were 22.73 (3.4) and 11.76 (2.4), respectively, when the data were adjusted for potential false positive results. Means were similar for the traditional MPN approach which cannot comparably assess uncertainty in the estimates. Cell numbers and estimates of recoverable target sequences in calibrator samples prepared from the two cell sources were also used to estimate cell equivalent and target sequence quantities recovered from surface water samples in a comparative Ct method. Our results illustrate the utility of the Bayesian method in accounting for uncertainty, the high degree of precision attainable by the MPN approach and the need to account for the differences in target sequence recoveries from different calibrator sample cell sources when they are used in the comparative Ct method. Published by Elsevier B.V.
Understanding the mechanisms of protein-DNA interactions
NASA Astrophysics Data System (ADS)
Lavery, Richard
2004-03-01
Structural, biochemical and thermodynamic data on protein-DNA interactions show that specific recognition cannot be reduced to a simple set of binary interactions between the partners (such as hydrogen bonds, ion pairs or steric contacts). The mechanical properties of the partners also play a role and, in the case of DNA, variations in both conformation and flexibility as a function of base sequence can be a significant factor in guiding a protein to the correct binding site. All-atom molecular modeling offers a means of analyzing the role of different binding mechanisms within protein-DNA complexes of known structure. This however requires estimating the binding strengths for the full range of sequences with which a given protein can interact. Since this number grows exponentially with the length of the binding site it is necessary to find a method to accelerate the calculations. We have achieved this by using a multi-copy approach (ADAPT) which allows us to build a DNA fragment with a variable base sequence. The results obtained with this method correlate well with experimental consensus binding sequences. They enable us to show that indirect recognition mechanisms involving the sequence dependent properties of DNA play a significant role in many complexes. This approach also offers a means of predicting protein binding sites on the basis of binding energies, which is complementary to conventional lexical techniques.
Broadband excitation in nuclear magnetic resonance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tycko, Robert
1984-10-01
Theoretical methods for designing sequences of radio frequency (rf) radiation pulses for broadband excitation of spin systems in nuclear magnetic resonance (NMR) are described. The sequences excite spins uniformly over large ranges of resonant frequencies arising from static magnetic field inhomogeneity, chemical shift differences, or spin couplings, or over large ranges of rf field amplitudes. Specific sequences for creating a population inversion or transverse magnetization are derived and demonstrated experimentally in liquid and solid state NMR. One approach to broadband excitation is based on principles of coherent averaging theory. A general formalism for deriving pulse sequences is given, along withmore » computational methods for specific cases. This approach leads to sequences that produce strictly constant transformations of a spin system. The importance of this feature in NMR applications is discussed. A second approach to broadband excitation makes use of iterative schemes, i.e. sets of operations that are applied repetitively to a given initial pulse sequences, generating a series of increasingly complex sequences with increasingly desirable properties. A general mathematical framework for analyzing iterative schemes is developed. An iterative scheme is treated as a function that acts on a space of operators corresponding to the transformations produced by all possible pulse sequences. The fixed points of the function and the stability of the fixed points are shown to determine the essential behavior of the scheme. Iterative schemes for broadband population inversion are treated in detail. Algebraic and numerical methods for performing the mathematical analysis are presented. Two additional topics are treated. The first is the construction of sequences for uniform excitation of double-quantum coherence and for uniform polarization transfer over a range of spin couplings. Double-quantum excitation sequences are demonstrated in a liquid crystal system. The second additional topic is the construction of iterative schemes for narrowband population inversion. The use of sequences that invert spin populations only over a narrow range of rf field amplitudes to spatially localize NMR signals in an rf field gradient is discussed.« less
Level statistics of words: Finding keywords in literary texts and symbolic sequences
NASA Astrophysics Data System (ADS)
Carpena, P.; Bernaola-Galván, P.; Hackenberg, M.; Coronado, A. V.; Oliver, J. L.
2009-03-01
Using a generalization of the level statistics analysis of quantum disordered systems, we present an approach able to extract automatically keywords in literary texts. Our approach takes into account not only the frequencies of the words present in the text but also their spatial distribution along the text, and is based on the fact that relevant words are significantly clustered (i.e., they self-attract each other), while irrelevant words are distributed randomly in the text. Since a reference corpus is not needed, our approach is especially suitable for single documents for which no a priori information is available. In addition, we show that our method works also in generic symbolic sequences (continuous texts without spaces), thus suggesting its general applicability.
Buckley, Mike
2016-03-24
Collagen is one of the most ubiquitous proteins in the animal kingdom and the dominant protein in extracellular tissues such as bone, skin and other connective tissues in which it acts primarily as a supporting scaffold. It has been widely investigated scientifically, not only as a biomedical material for regenerative medicine, but also for its role as a food source for both humans and livestock. Due to the long-term stability of collagen, as well as its abundance in bone, it has been proposed as a source of biomarkers for species identification not only for heat- and pressure-rendered animal feed but also in ancient archaeological and palaeontological specimens, typically carried out by peptide mass fingerprinting (PMF) as well as in-depth liquid chromatography (LC)-based tandem mass spectrometric methods. Through the analysis of the three most common domesticates species, cow, sheep, and pig, this research investigates the advantages of each approach over the other, investigating sites of sequence variation with known functional properties of the collagen molecule. Results indicate that the previously identified species biomarkers through PMF analysis are not among the most variable type 1 collagen peptides present in these tissues, the latter of which can be detected by LC-based methods. However, it is clear that the highly repetitive sequence motif of collagen throughout the molecule, combined with the variability of the sites and relative abundance levels of hydroxylation, can result in high scoring false positive peptide matches using these LC-based methods. Additionally, the greater alpha 2(I) chain sequence variation, in comparison to the alpha 1(I) chain, did not appear to be specific to any particular functional properties, implying that intra-chain functional constraints on sequence variation are not as great as inter-chain constraints. However, although some of the most variable peptides were only observed in LC-based methods, until the range of publicly available collagen sequences improves, the simplicity of the PMF approach and suitable range of peptide sequence variation observed makes it the ideal method for initial taxonomic identification prior to further analysis by LC-based methods only when required.
Tan, Joon Liang; Khang, Tsung Fei; Ngeow, Yun Fong; Choo, Siew Woh
2013-12-13
Mycobacterium abscessus is a rapidly growing mycobacterium that is often associated with human infections. The taxonomy of this species has undergone several revisions and is still being debated. In this study, we sequenced the genomes of 12 M. abscessus strains and used phylogenomic analysis to perform subspecies classification. A data mining approach was used to rank and select informative genes based on the relative entropy metric for the construction of a phylogenetic tree. The resulting tree topology was similar to that generated using the concatenation of five classical housekeeping genes: rpoB, hsp65, secA, recA and sodA. Additional support for the reliability of the subspecies classification came from the analysis of erm41 and ITS gene sequences, single nucleotide polymorphisms (SNPs)-based classification and strain clustering demonstrated by a variable number tandem repeat (VNTR) assay and a multilocus sequence analysis (MLSA). We subsequently found that the concatenation of a minimal set of three median-ranked genes: DNA polymerase III subunit alpha (polC), 4-hydroxy-2-ketovalerate aldolase (Hoa) and cell division protein FtsZ (ftsZ), is sufficient to recover the same tree topology. PCR assays designed specifically for these genes showed that all three genes could be amplified in the reference strain of M. abscessus ATCC 19977T. This study provides proof of concept that whole-genome sequence-based data mining approach can provide confirmatory evidence of the phylogenetic informativeness of existing markers, as well as lead to the discovery of a more economical and informative set of markers that produces similar subspecies classification in M. abscessus. The systematic procedure used in this study to choose the informative minimal set of gene markers can potentially be applied to species or subspecies classification of other bacteria.
Quantifying the Relationships among Drug Classes
Hert, Jérôme; Keiser, Michael J.; Irwin, John J.; Oprea, Tudor I.; Shoichet, Brian K.
2009-01-01
The similarity of drug targets is typically measured using sequence or structural information. Here, we consider chemo-centric approaches that measure target similarity on the basis of their ligands, asking how chemoinformatics similarities differ from those derived bioinformatically, how stable the ligand networks are to changes in chemoinformatics metrics, and which network is the most reliable for prediction of pharmacology. We calculated the similarities between hundreds of drug targets and their ligands and mapped the relationship between them in a formal network. Bioinformatics networks were based on the BLAST similarity between sequences, while chemoinformatics networks were based on the ligand-set similarities calculated with either the Similarity Ensemble Approach (SEA) or a method derived from Bayesian statistics. By multiple criteria, bioinformatics and chemoinformatics networks differed substantially, and only occasionally did a high sequence similarity correspond to a high ligand-set similarity. In contrast, the chemoinformatics networks were stable to the method used to calculate the ligand-set similarities and to the chemical representation of the ligands. Also, the chemoinformatics networks were more natural and more organized, by network theory, than their bioinformatics counterparts: ligand-based networks were found to be small-world and broad-scale. PMID:18335977
LipidSeq: a next-generation clinical resequencing panel for monogenic dyslipidemias.
Johansen, Christopher T; Dubé, Joseph B; Loyzer, Melissa N; MacDonald, Austin; Carter, David E; McIntyre, Adam D; Cao, Henian; Wang, Jian; Robinson, John F; Hegele, Robert A
2014-04-01
We report the design of a targeted resequencing panel for monogenic dyslipidemias, LipidSeq, for the purpose of replacing Sanger sequencing in the clinical detection of dyslipidemia-causing variants. We also evaluate the performance of the LipidSeq approach versus Sanger sequencing in 84 patients with a range of phenotypes including extreme blood lipid concentrations as well as additional dyslipidemias and related metabolic disorders. The panel performs well, with high concordance (95.2%) in samples with known mutations based on Sanger sequencing and a high detection rate (57.9%) of mutations likely to be causative for disease in samples not previously sequenced. Clinical implementation of LipidSeq has the potential to aid in the molecular diagnosis of patients with monogenic dyslipidemias with a high degree of speed and accuracy and at lower cost than either Sanger sequencing or whole exome sequencing. Furthermore, LipidSeq will help to provide a more focused picture of monogenic and polygenic contributors that underlie dyslipidemia while excluding the discovery of incidental pathogenic clinically actionable variants in nonmetabolism-related genes, such as oncogenes, that would otherwise be identified by a whole exome approach, thus minimizing potential ethical issues.
LipidSeq: a next-generation clinical resequencing panel for monogenic dyslipidemias[S
Johansen, Christopher T.; Dubé, Joseph B.; Loyzer, Melissa N.; MacDonald, Austin; Carter, David E.; McIntyre, Adam D.; Cao, Henian; Wang, Jian; Robinson, John F.; Hegele, Robert A.
2014-01-01
We report the design of a targeted resequencing panel for monogenic dyslipidemias, LipidSeq, for the purpose of replacing Sanger sequencing in the clinical detection of dyslipidemia-causing variants. We also evaluate the performance of the LipidSeq approach versus Sanger sequencing in 84 patients with a range of phenotypes including extreme blood lipid concentrations as well as additional dyslipidemias and related metabolic disorders. The panel performs well, with high concordance (95.2%) in samples with known mutations based on Sanger sequencing and a high detection rate (57.9%) of mutations likely to be causative for disease in samples not previously sequenced. Clinical implementation of LipidSeq has the potential to aid in the molecular diagnosis of patients with monogenic dyslipidemias with a high degree of speed and accuracy and at lower cost than either Sanger sequencing or whole exome sequencing. Furthermore, LipidSeq will help to provide a more focused picture of monogenic and polygenic contributors that underlie dyslipidemia while excluding the discovery of incidental pathogenic clinically actionable variants in nonmetabolism-related genes, such as oncogenes, that would otherwise be identified by a whole exome approach, thus minimizing potential ethical issues. PMID:24503134
Computational Tools and Algorithms for Designing Customized Synthetic Genes
Gould, Nathan; Hendy, Oliver; Papamichail, Dimitris
2014-01-01
Advances in DNA synthesis have enabled the construction of artificial genes, gene circuits, and genomes of bacterial scale. Freedom in de novo design of synthetic constructs provides significant power in studying the impact of mutations in sequence features, and verifying hypotheses on the functional information that is encoded in nucleic and amino acids. To aid this goal, a large number of software tools of variable sophistication have been implemented, enabling the design of synthetic genes for sequence optimization based on rationally defined properties. The first generation of tools dealt predominantly with singular objectives such as codon usage optimization and unique restriction site incorporation. Recent years have seen the emergence of sequence design tools that aim to evolve sequences toward combinations of objectives. The design of optimal protein-coding sequences adhering to multiple objectives is computationally hard, and most tools rely on heuristics to sample the vast sequence design space. In this review, we study some of the algorithmic issues behind gene optimization and the approaches that different tools have adopted to redesign genes and optimize desired coding features. We utilize test cases to demonstrate the efficiency of each approach, as well as identify their strengths and limitations. PMID:25340050
Genetic mutations in human rectal cancers detected by targeted sequencing.
Bai, Jun; Gao, Jinglong; Mao, Zhijun; Wang, Jianhua; Li, Jianhui; Li, Wensheng; Lei, Yu; Li, Shuaishuai; Wu, Zhuo; Tang, Chuanning; Jones, Lindsey; Ye, Hua; Lou, Feng; Liu, Zhiyuan; Dong, Zhishou; Guo, Baishuai; Huang, Xue F; Chen, Si-Yi; Zhang, Enke
2015-10-01
Colorectal cancer (CRC) is widespread with significant mortality. Both inherited and sporadic mutations in various signaling pathways influence the development and progression of the cancer. Identifying genetic mutations in CRC is important for optimal patient treatment and many approaches currently exist to uncover these mutations, including next-generation sequencing (NGS) and commercially available kits. In the present study, we used a semiconductor-based targeted DNA-sequencing approach to sequence and identify genetic mutations in 91 human rectal cancer samples. Analysis revealed frequent mutations in KRAS (58.2%), TP53 (28.6%), APC (16.5%), FBXW7 (9.9%) and PIK3CA (9.9%), and additional mutations in BRAF, CTNNB1, ERBB2 and SMAD4 were also detected at lesser frequencies. Thirty-eight samples (41.8%) also contained two or more mutations, with common combination mutations occurring between KRAS and TP53 (42.1%), and KRAS and APC (31.6%). DNA sequencing for individual cancers is of clinical importance for targeted drug therapy and the advantages of such targeted gene sequencing over other NGS platforms or commercially available kits in sensitivity, cost and time effectiveness may aid clinicians in treating CRC patients in the near future.
Liu, Siyang; Huang, Shujia; Rao, Junhua; Ye, Weijian; Krogh, Anders; Wang, Jun
2015-01-01
Comprehensive recognition of genomic variation in one individual is important for understanding disease and developing personalized medication and treatment. Many tools based on DNA re-sequencing exist for identification of single nucleotide polymorphisms, small insertions and deletions (indels) as well as large deletions. However, these approaches consistently display a substantial bias against the recovery of complex structural variants and novel sequence in individual genomes and do not provide interpretation information such as the annotation of ancestral state and formation mechanism. We present a novel approach implemented in a single software package, AsmVar, to discover, genotype and characterize different forms of structural variation and novel sequence from population-scale de novo genome assemblies up to nucleotide resolution. Application of AsmVar to several human de novo genome assemblies captures a wide spectrum of structural variants and novel sequences present in the human population in high sensitivity and specificity. Our method provides a direct solution for investigating structural variants and novel sequences from de novo genome assemblies, facilitating the construction of population-scale pan-genomes. Our study also highlights the usefulness of the de novo assembly strategy for definition of genome structure.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aw, Tiong Gim; Howe, Adina; Rose, Joan B.
2014-12-01
Genomic-based molecular techniques are emerging as powerful tools that allow a comprehensive characterization of water and wastewater microbiomes. Most recently, next generation sequencing (NGS) technologies which produce large amounts of sequence data are beginning to impact the field of environmental virology. In this study, NGS and bioinformatics have been employed for the direct detection and characterization of viruses in wastewater and of viruses isolated after cell culture. Viral particles were concentrated and purified from sewage samples by polyethylene glycol precipitation. Viral nucleic acid was extracted and randomly amplified prior to sequencing using Illumina technology, yielding a total of 18 millionmore » sequence reads. Most of the viral sequences detected could not be characterized, indicating the great viral diversity that is yet to be discovered. This sewage virome was dominated by bacteriophages and contained sequences related to known human pathogenic viruses such as adenoviruses (species B, C and F), polyomaviruses JC and BK and enteroviruses (type B). An array of other animal viruses was also found, suggesting unknown zoonotic viruses. This study demonstrated the feasibility of metagenomic approaches to characterize viruses in complex environmental water samples.« less
Evaluation and integration of existing methods for computational prediction of allergens
2013-01-01
Background Allergy involves a series of complex reactions and factors that contribute to the development of the disease and triggering of the symptoms, including rhinitis, asthma, atopic eczema, skin sensitivity, even acute and fatal anaphylactic shock. Prediction and evaluation of the potential allergenicity is of importance for safety evaluation of foods and other environment factors. Although several computational approaches for assessing the potential allergenicity of proteins have been developed, their performance and relative merits and shortcomings have not been compared systematically. Results To evaluate and improve the existing methods for allergen prediction, we collected an up-to-date definitive dataset consisting of 989 known allergens and massive putative non-allergens. The three most widely used allergen computational prediction approaches including sequence-, motif- and SVM-based (Support Vector Machine) methods were systematically compared using the defined parameters and we found that SVM-based method outperformed the other two methods with higher accuracy and specificity. The sequence-based method with the criteria defined by FAO/WHO (FAO: Food and Agriculture Organization of the United Nations; WHO: World Health Organization) has higher sensitivity of over 98%, but having a low specificity. The advantage of motif-based method is the ability to visualize the key motif within the allergen. Notably, the performances of the sequence-based method defined by FAO/WHO and motif eliciting strategy could be improved by the optimization of parameters. To facilitate the allergen prediction, we integrated these three methods in a web-based application proAP, which provides the global search of the known allergens and a powerful tool for allergen predication. Flexible parameter setting and batch prediction were also implemented. The proAP can be accessed at http://gmobl.sjtu.edu.cn/proAP/main.html. Conclusions This study comprehensively evaluated sequence-, motif- and SVM-based computational prediction approaches for allergens and optimized their parameters to obtain better performance. These findings may provide helpful guidance for the researchers in allergen-prediction. Furthermore, we integrated these methods into a web application proAP, greatly facilitating users to do customizable allergen search and prediction. PMID:23514097
Evaluation and integration of existing methods for computational prediction of allergens.
Wang, Jing; Yu, Yabin; Zhao, Yunan; Zhang, Dabing; Li, Jing
2013-01-01
Allergy involves a series of complex reactions and factors that contribute to the development of the disease and triggering of the symptoms, including rhinitis, asthma, atopic eczema, skin sensitivity, even acute and fatal anaphylactic shock. Prediction and evaluation of the potential allergenicity is of importance for safety evaluation of foods and other environment factors. Although several computational approaches for assessing the potential allergenicity of proteins have been developed, their performance and relative merits and shortcomings have not been compared systematically. To evaluate and improve the existing methods for allergen prediction, we collected an up-to-date definitive dataset consisting of 989 known allergens and massive putative non-allergens. The three most widely used allergen computational prediction approaches including sequence-, motif- and SVM-based (Support Vector Machine) methods were systematically compared using the defined parameters and we found that SVM-based method outperformed the other two methods with higher accuracy and specificity. The sequence-based method with the criteria defined by FAO/WHO (FAO: Food and Agriculture Organization of the United Nations; WHO: World Health Organization) has higher sensitivity of over 98%, but having a low specificity. The advantage of motif-based method is the ability to visualize the key motif within the allergen. Notably, the performances of the sequence-based method defined by FAO/WHO and motif eliciting strategy could be improved by the optimization of parameters. To facilitate the allergen prediction, we integrated these three methods in a web-based application proAP, which provides the global search of the known allergens and a powerful tool for allergen predication. Flexible parameter setting and batch prediction were also implemented. The proAP can be accessed at http://gmobl.sjtu.edu.cn/proAP/main.html. This study comprehensively evaluated sequence-, motif- and SVM-based computational prediction approaches for allergens and optimized their parameters to obtain better performance. These findings may provide helpful guidance for the researchers in allergen-prediction. Furthermore, we integrated these methods into a web application proAP, greatly facilitating users to do customizable allergen search and prediction.
Protein Sequence Classification with Improved Extreme Learning Machine Algorithms
2014-01-01
Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876
Li, Liqi; Luo, Qifa; Xiao, Weidong; Li, Jinhui; Zhou, Shiwen; Li, Yongsheng; Zheng, Xiaoqi; Yang, Hua
2017-02-01
Palmitoylation is the covalent attachment of lipids to amino acid residues in proteins. As an important form of protein posttranslational modification, it increases the hydrophobicity of proteins, which contributes to the protein transportation, organelle localization, and functions, therefore plays an important role in a variety of cell biological processes. Identification of palmitoylation sites is necessary for understanding protein-protein interaction, protein stability, and activity. Since conventional experimental techniques to determine palmitoylation sites in proteins are both labor intensive and costly, a fast and accurate computational approach to predict palmitoylation sites from protein sequences is in urgent need. In this study, a support vector machine (SVM)-based method was proposed through integrating PSI-BLAST profile, physicochemical properties, [Formula: see text]-mer amino acid compositions (AACs), and [Formula: see text]-mer pseudo AACs into the principal feature vector. A recursive feature selection scheme was subsequently implemented to single out the most discriminative features. Finally, an SVM method was implemented to predict palmitoylation sites in proteins based on the optimal features. The proposed method achieved an accuracy of 99.41% and Matthews Correlation Coefficient of 0.9773 for a benchmark dataset. The result indicates the efficiency and accuracy of our method in prediction of palmitoylation sites based on protein sequences.
Sequence-invariant state machines
NASA Technical Reports Server (NTRS)
Whitaker, Sterling R.; Manjunath, Shamanna K.; Maki, Gary K.
1991-01-01
A synthesis method and an MOS VLSI architecture are presented to realize sequential circuits that have the ability to implement any state machine having N states and m inputs, regardless of the actual sequence specified in the flow table. The design method utilizes binary tree structured (BTS) logic to implement regular and dense circuits. The desired state sequence can be hardwired with power supply connections or can be dynamically reallocated if stored in a register. This allows programmable VLSI controllers to be designed with a compact size and performance approaching that of dedicated logic. Results of ICV implementations are reported and an example sequence-invariant state machine is contrasted with implementations based on traditional methods.
Moving object detection and tracking in videos through turbulent medium
NASA Astrophysics Data System (ADS)
Halder, Kalyan Kumar; Tahtali, Murat; Anavatti, Sreenatha G.
2016-06-01
This paper addresses the problem of identifying and tracking moving objects in a video sequence having a time-varying background. This is a fundamental task in many computer vision applications, though a very challenging one because of turbulence that causes blurring and spatiotemporal movements of the background images. Our proposed approach involves two major steps. First, a moving object detection algorithm that deals with the detection of real motions by separating the turbulence-induced motions using a two-level thresholding technique is used. In the second step, a feature-based generalized regression neural network is applied to track the detected objects throughout the frames in the video sequence. The proposed approach uses the centroid and area features of the moving objects and creates the reference regions instantly by selecting the objects within a circle. Simulation experiments are carried out on several turbulence-degraded video sequences and comparisons with an earlier method confirms that the proposed approach provides a more effective tracking of the targets.
Bose, Jeffrey L
2016-01-01
The ability to create mutations is an important step towards understanding bacterial physiology and virulence. While targeted approaches are invaluable, the ability to produce genome-wide random mutations can lead to crucial discoveries. Transposon mutagenesis is a useful approach, but many interesting mutations can be missed by these insertions that interrupt coding and noncoding sequences due to the integration of an entire transposon. Chemical mutagenesis and UV-based random mutagenesis are alternate approaches to isolate mutations of interest with the potential of only single nucleotide changes. Once a standard method, difficulty in identifying mutation sites had decreased the popularity of this technique. However, thanks to the recent emergence of economical whole-genome sequencing, this approach to making mutations can once again become a viable option. Therefore, this chapter provides an overview protocol for random mutagenesis using UV light or DNA-damaging chemicals.
Quaglio, Pietro; Yegenoglu, Alper; Torre, Emiliano; Endres, Dominik M; Grün, Sonja
2017-01-01
Repeated, precise sequences of spikes are largely considered a signature of activation of cell assemblies. These repeated sequences are commonly known under the name of spatio-temporal patterns (STPs). STPs are hypothesized to play a role in the communication of information in the computational process operated by the cerebral cortex. A variety of statistical methods for the detection of STPs have been developed and applied to electrophysiological recordings, but such methods scale poorly with the current size of available parallel spike train recordings (more than 100 neurons). In this work, we introduce a novel method capable of overcoming the computational and statistical limits of existing analysis techniques in detecting repeating STPs within massively parallel spike trains (MPST). We employ advanced data mining techniques to efficiently extract repeating sequences of spikes from the data. Then, we introduce and compare two alternative approaches to distinguish statistically significant patterns from chance sequences. The first approach uses a measure known as conceptual stability, of which we investigate a computationally cheap approximation for applications to such large data sets. The second approach is based on the evaluation of pattern statistical significance. In particular, we provide an extension to STPs of a method we recently introduced for the evaluation of statistical significance of synchronous spike patterns. The performance of the two approaches is evaluated in terms of computational load and statistical power on a variety of artificial data sets that replicate specific features of experimental data. Both methods provide an effective and robust procedure for detection of STPs in MPST data. The method based on significance evaluation shows the best overall performance, although at a higher computational cost. We name the novel procedure the spatio-temporal Spike PAttern Detection and Evaluation (SPADE) analysis.
Quaglio, Pietro; Yegenoglu, Alper; Torre, Emiliano; Endres, Dominik M.; Grün, Sonja
2017-01-01
Repeated, precise sequences of spikes are largely considered a signature of activation of cell assemblies. These repeated sequences are commonly known under the name of spatio-temporal patterns (STPs). STPs are hypothesized to play a role in the communication of information in the computational process operated by the cerebral cortex. A variety of statistical methods for the detection of STPs have been developed and applied to electrophysiological recordings, but such methods scale poorly with the current size of available parallel spike train recordings (more than 100 neurons). In this work, we introduce a novel method capable of overcoming the computational and statistical limits of existing analysis techniques in detecting repeating STPs within massively parallel spike trains (MPST). We employ advanced data mining techniques to efficiently extract repeating sequences of spikes from the data. Then, we introduce and compare two alternative approaches to distinguish statistically significant patterns from chance sequences. The first approach uses a measure known as conceptual stability, of which we investigate a computationally cheap approximation for applications to such large data sets. The second approach is based on the evaluation of pattern statistical significance. In particular, we provide an extension to STPs of a method we recently introduced for the evaluation of statistical significance of synchronous spike patterns. The performance of the two approaches is evaluated in terms of computational load and statistical power on a variety of artificial data sets that replicate specific features of experimental data. Both methods provide an effective and robust procedure for detection of STPs in MPST data. The method based on significance evaluation shows the best overall performance, although at a higher computational cost. We name the novel procedure the spatio-temporal Spike PAttern Detection and Evaluation (SPADE) analysis. PMID:28596729
Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.
Alkhateeb, Abedalrhman; Rueda, Luis
2017-08-01
Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.
Recursive Directional Ligation Approach for Cloning Recombinant Spider Silks.
Dinjaski, Nina; Huang, Wenwen; Kaplan, David L
2018-01-01
Recent advances in genetic engineering have provided a route to produce various types of recombinant spider silks. Different cloning strategies have been applied to achieve this goal (e.g., concatemerization, step-by-step ligation, recursive directional ligation). Here we describe recursive directional ligation as an approach that allows for facile modularity and control over the size of the genetic cassettes. This approach is based on sequential ligation of genetic cassettes (monomers) where the junctions between them are formed without interrupting key gene sequences with additional base pairs.
Bouard, Charlotte; Terreux, Raphael; Honorat, Mylène; Manship, Brigitte; Ansieau, Stéphane; Vigneron, Arnaud M.; Puisieux, Alain; Payen, Léa
2016-01-01
Abstract The TWIST1 bHLH transcription factor controls embryonic development and cancer processes. Although molecular and genetic analyses have provided a wealth of data on the role of bHLH transcription factors, very little is known on the molecular mechanisms underlying their binding affinity to the E-box sequence of the promoter. Here, we used an in silico model of the TWIST1/E12 (TE) heterocomplex and performed molecular dynamics (MD) simulations of its binding to specific (TE-box) and modified E-box sequences. We focused on (i) active E-box and inactive E-box sequences, on (ii) modified active E-box sequences, as well as on (iii) two box sequences with modified adjacent bases the AT- and TA-boxes. Our in silico models were supported by functional in vitro binding assays. This exploration highlighted the predominant role of protein side-chain residues, close to the heart of the complex, at anchoring the dimer to DNA sequences, and unveiled a shift towards adjacent ((-1) and (-1*)) bases and conserved bases of modified E-box sequences. In conclusion, our study provides proof of the predictive value of these MD simulations, which may contribute to the characterization of specific inhibitors by docking approaches, and their use in pharmacological therapies by blocking the tumoral TWIST1/E12 function in cancers. PMID:27151200
Analysis of simulated image sequences from sensors for restricted-visibility operations
NASA Technical Reports Server (NTRS)
Kasturi, Rangachar
1991-01-01
A real time model of the visible output from a 94 GHz sensor, based on a radiometric simulation of the sensor, was developed. A sequence of images as seen from an aircraft as it approaches for landing was simulated using this model. Thirty frames from this sequence of 200 x 200 pixel images were analyzed to identify and track objects in the image using the Cantata image processing package within the visual programming environment provided by the Khoros software system. The image analysis operations are described.
ERIC Educational Resources Information Center
Brown, Bryan A.
2008-01-01
Overview: This research explores using a teaching approach that attempts to balance test preparation with creating "teachable moments" for students. This approach involves the use of a sequence of assessments to introduce topics through formative assessment in order to identify students' understanding, and beginning instruction based on…
Posterior Predictive Bayesian Phylogenetic Model Selection
Lewis, Paul O.; Xie, Wangang; Chen, Ming-Hui; Fan, Yu; Kuo, Lynn
2014-01-01
We present two distinctly different posterior predictive approaches to Bayesian phylogenetic model selection and illustrate these methods using examples from green algal protein-coding cpDNA sequences and flowering plant rDNA sequences. The Gelfand–Ghosh (GG) approach allows dissection of an overall measure of model fit into components due to posterior predictive variance (GGp) and goodness-of-fit (GGg), which distinguishes this method from the posterior predictive P-value approach. The conditional predictive ordinate (CPO) method provides a site-specific measure of model fit useful for exploratory analyses and can be combined over sites yielding the log pseudomarginal likelihood (LPML) which is useful as an overall measure of model fit. CPO provides a useful cross-validation approach that is computationally efficient, requiring only a sample from the posterior distribution (no additional simulation is required). Both GG and CPO add new perspectives to Bayesian phylogenetic model selection based on the predictive abilities of models and complement the perspective provided by the marginal likelihood (including Bayes Factor comparisons) based solely on the fit of competing models to observed data. [Bayesian; conditional predictive ordinate; CPO; L-measure; LPML; model selection; phylogenetics; posterior predictive.] PMID:24193892
Phylogenetic tree construction based on 2D graphical representation
NASA Astrophysics Data System (ADS)
Liao, Bo; Shan, Xinzhou; Zhu, Wen; Li, Renfa
2006-04-01
A new approach based on the two-dimensional (2D) graphical representation of the whole genome sequence [Bo Liao, Chem. Phys. Lett., 401(2005) 196.] is proposed to analyze the phylogenetic relationships of genomes. The evolutionary distances are obtained through measuring the differences among the 2D curves. The fuzzy theory is used to construct phylogenetic tree. The phylogenetic relationships of H5N1 avian influenza virus illustrate the utility of our approach.
USDA-ARS?s Scientific Manuscript database
A Tobacco rattle virus (TRV) based virus-induced gene silencing (VIGS) assay was employed as a reverse genetic approach to study gene function in cotton (Gossypium hirsutum). This approach was used to investigate the function of Enoyl-CoA reductase (GhECR) in pathogen defense. Amino acid sequence al...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gardiner, K.
Tremendous progress has been made in the construction of physical and genetic maps of the human chromosomes. The next step in the solving of disease related problems, and in understanding the human genome as a whole, is the systematic isolation of transcribed sequences. Many investigators have already embarked upon comprehensive gene searches, and many more are considering the best strategies for undertaking such searches. Because these are likely to be costly and time consuming endeavors, it is important to determine the most efficient approaches. As a result, it is critical that investigators involved in the construction of transcriptional maps havemore » the opportunity to discuss their experiences and their successes with both old and new technologies. This document contains the proceedings of the Fourth Annual Workshop on the Identification of Transcribed Sequences, held in Montreal, Quebec, October 16-18, 1994. Included are the workshop notebook, containing the agenda, abstracts presented and list of attendees. Topics included: Progress in the application of the hybridization based approaches and exon trapping; Progress in transcriptional map construction of selected genomic regions; Computer assisted analysis of genomic and protein coding sequences and additional new approaches; and, Sequencing and mapping of random cDNAs.« less
Multi-Temporal Land Cover Classification with Sequential Recurrent Encoders
NASA Astrophysics Data System (ADS)
Rußwurm, Marc; Körner, Marco
2018-03-01
Earth observation (EO) sensors deliver data with daily or weekly temporal resolution. Most land use and land cover (LULC) approaches, however, expect cloud-free and mono-temporal observations. The increasing temporal capabilities of today's sensors enables the use of temporal, along with spectral and spatial features. Domains, such as speech recognition or neural machine translation, work with inherently temporal data and, today, achieve impressive results using sequential encoder-decoder structures. Inspired by these sequence-to-sequence models, we adapt an encoder structure with convolutional recurrent layers in order to approximate a phenological model for vegetation classes based on a temporal sequence of Sentinel 2 (S2) images. In our experiments, we visualize internal activations over a sequence of cloudy and non-cloudy images and find several recurrent cells, which reduce the input activity for cloudy observations. Hence, we assume that our network has learned cloud-filtering schemes solely from input data, which could alleviate the need for tedious cloud-filtering as a preprocessing step for many EO approaches. Moreover, using unfiltered temporal series of top-of-atmosphere (TOA) reflectance data, we achieved in our experiments state-of-the-art classification accuracies on a large number of crop classes with minimal preprocessing compared to other classification approaches.
Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification.
Sinclair, Robert M; Ravantti, Janne J; Bamford, Dennis H
2017-04-15
Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. Copyright © 2017 Sinclair et al.