efficient sequence analysis: Topics by Science.gov

Sample records for efficient sequence analysis

Memory Efficient Sequence Analysis Using Compressed Data Structures (Metagenomics Informatics Challenges Workshop: 10K Genomes at a Time)

ScienceCinema

Simpson, Jared

2018-01-24

Wellcome Trust Sanger Institute's Jared Simpson on Memory efficient sequence analysis using compressed data structures at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.
An Optimal Bahadur-Efficient Method in Detection of Sparse Signals with Applications to Pathway Analysis in Sequencing Association Studies.

PubMed

Dai, Hongying; Wu, Guodong; Wu, Michael; Zhi, Degui

2016-01-01

Next-generation sequencing data pose a severe curse of dimensionality, complicating traditional "single marker-single trait" analysis. We propose a two-stage combined p-value method for pathway analysis. The first stage is at the gene level, where we integrate effects within a gene using the Sequence Kernel Association Test (SKAT). The second stage is at the pathway level, where we perform a correlated Lancaster procedure to detect joint effects from multiple genes within a pathway. We show that the Lancaster procedure is optimal in Bahadur efficiency among all combined p-value methods. The Bahadur efficiency,[Formula: see text], compares sample sizes among different statistical tests when signals become sparse in sequencing data, i.e. ε →0. The optimal Bahadur efficiency ensures that the Lancaster procedure asymptotically requires a minimal sample size to detect sparse signals ([Formula: see text]). The Lancaster procedure can also be applied to meta-analysis. Extensive empirical assessments of exome sequencing data show that the proposed method outperforms Gene Set Enrichment Analysis (GSEA). We applied the competitive Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium to identify pathways significantly associated with high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, and total cholesterol.
DELIMINATE--a fast and efficient method for loss-less compression of genomic sequences: sequence analysis.

PubMed

Mohammed, Monzoorul Haque; Dutta, Anirban; Bose, Tungadri; Chadaram, Sudha; Mande, Sharmila S

2012-10-01

An unprecedented quantity of genome sequence data is currently being generated using next-generation sequencing platforms. This has necessitated the development of novel bioinformatics approaches and algorithms that not only facilitate a meaningful analysis of these data but also aid in efficient compression, storage, retrieval and transmission of huge volumes of the generated data. We present a novel compression algorithm (DELIMINATE) that can rapidly compress genomic sequence data in a loss-less fashion. Validation results indicate relatively higher compression efficiency of DELIMINATE when compared with popular general purpose compression algorithms, namely, gzip, bzip2 and lzma. Linux, Windows and Mac implementations (both 32 and 64-bit) of DELIMINATE are freely available for download at: http://metagenomics.atc.tcs.com/compression/DELIMINATE. sharmila@atc.tcs.com Supplementary data are available at Bioinformatics online.
A streamlined method for analysing genome-wide DNA methylation patterns from low amounts of FFPE DNA.

PubMed

Ludgate, Jackie L; Wright, James; Stockwell, Peter A; Morison, Ian M; Eccles, Michael R; Chatterjee, Aniruddha

2017-08-31

Formalin fixed paraffin embedded (FFPE) tumor samples are a major source of DNA from patients in cancer research. However, FFPE is a challenging material to work with due to macromolecular fragmentation and nucleic acid crosslinking. FFPE tissue particularly possesses challenges for methylation analysis and for preparing sequencing-based libraries relying on bisulfite conversion. Successful bisulfite conversion is a key requirement for sequencing-based methylation analysis. Here we describe a complete and streamlined workflow for preparing next generation sequencing libraries for methylation analysis from FFPE tissues. This includes, counting cells from FFPE blocks and extracting DNA from FFPE slides, testing bisulfite conversion efficiency with a polymerase chain reaction (PCR) based test, preparing reduced representation bisulfite sequencing libraries and massively parallel sequencing. The main features and advantages of this protocol are: An optimized method for extracting good quality DNA from FFPE tissues. An efficient bisulfite conversion and next generation sequencing library preparation protocol that uses 50 ng DNA from FFPE tissue. Incorporation of a PCR-based test to assess bisulfite conversion efficiency prior to sequencing. We provide a complete workflow and an integrated protocol for performing DNA methylation analysis at the genome-scale and we believe this will facilitate clinical epigenetic research that involves the use of FFPE tissue.
High throughput sequencing analysis of RNA libraries reveals the influences of initial library and PCR methods on SELEX efficiency.

PubMed

Takahashi, Mayumi; Wu, Xiwei; Ho, Michelle; Chomchan, Pritsana; Rossi, John J; Burnett, John C; Zhou, Jiehua

2016-09-22

The systemic evolution of ligands by exponential enrichment (SELEX) technique is a powerful and effective aptamer-selection procedure. However, modifications to the process can dramatically improve selection efficiency and aptamer performance. For example, droplet digital PCR (ddPCR) has been recently incorporated into SELEX selection protocols to putatively reduce the propagation of byproducts and avoid selection bias that result from differences in PCR efficiency of sequences within the random library. However, a detailed, parallel comparison of the efficacy of conventional solution PCR versus the ddPCR modification in the RNA aptamer-selection process is needed to understand effects on overall SELEX performance. In the present study, we took advantage of powerful high throughput sequencing technology and bioinformatics analysis coupled with SELEX (HT-SELEX) to thoroughly investigate the effects of initial library and PCR methods in the RNA aptamer identification. Our analysis revealed that distinct "biased sequences" and nucleotide composition existed in the initial, unselected libraries purchased from two different manufacturers and that the fate of the "biased sequences" was target-dependent during selection. Our comparison of solution PCR- and ddPCR-driven HT-SELEX demonstrated that PCR method affected not only the nucleotide composition of the enriched sequences, but also the overall SELEX efficiency and aptamer efficacy.
Recurrence time statistics: versatile tools for genomic DNA sequence analysis.

PubMed

Cao, Yinhe; Tung, Wen-Wen; Gao, J B

2004-01-01

With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.
An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

PubMed

Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W

2010-07-02

The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.
Streaming fragment assignment for real-time analysis of sequencing experiments

PubMed Central

Roberts, Adam; Pachter, Lior

2013-01-01

We present eXpress, a software package for highly efficient probabilistic assignment of ambiguously mapping sequenced fragments. eXpress uses a streaming algorithm with linear run time and constant memory use. It can determine abundances of sequenced molecules in real time, and can be applied to ChIP-seq, metagenomics and other large-scale sequencing data. We demonstrate its use on RNA-seq data, showing greater efficiency than other quantification methods. PMID:23160280
Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo) genome assembly and analysis

USDA-ARS?s Scientific Manuscript database

Next-generation sequencing technologies were used to rapidly and efficiently sequence the genome of the domestic turkey (Meleagris gallopavo). The current genome assembly (~1.1 Gb) includes 917 Mb of sequence assigned to chromosomes. Innate heterozygosity of the sequenced bird allowed discovery of...
An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets

PubMed Central

2010-01-01

Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141
SNP discovery through de novo deep sequencing using the next generation of DNA sequencers

USDA-ARS?s Scientific Manuscript database

The production of high volumes of DNA sequence data using new technologies has permitted more efficient identification of single nucleotide polymorphisms in vertebrate genomes. This chapter presented practical methodology for production and analysis of DNA sequence data for SNP discovery....
pYEMF, a pUC18-derived XcmI T-vector for efficient cloning of PCR products.

PubMed

Gu, Jingsong; Ye, Chunjiang

2011-03-01

A 1330-bp DNA sequence with two XcmI cassettes was inserted into pUC18 to construct an efficient XcmI T-vector parent plasmid, pYEMF. The large size of the inserted DNA fragment improved T-vector cleavage efficiency, and guaranteed good separation of the molecular components after restriction digestion. The pYEMF-T-vector generated from parent plasmid pYEMF permits blue/white colony screening; cloning efficiency analysis showed that most white colonies (>75%) were putative transformants which carried the cloning product. The sequence analysis and design approach presented here will facilitate applications in the fields of molecular biology and genetic engineering.
Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison.

PubMed

Dai, Qi; Yang, Yanchun; Wang, Tianming

2008-10-15

Many proposed statistical measures can efficiently compare biological sequences to further infer their structures, functions and evolutionary information. They are related in spirit because all the ideas for sequence comparison try to use the information on the k-word distributions, Markov model or both. Motivated by adding k-word distributions to Markov model directly, we investigated two novel statistical measures for sequence comparison, called wre.k.r and S2.k.r. The proposed measures were tested by similarity search, evaluation on functionally related regulatory sequences and phylogenetic analysis. This offers the systematic and quantitative experimental assessment of our measures. Moreover, we compared our achievements with these based on alignment or alignment-free. We grouped our experiments into two sets. The first one, performed via ROC (receiver operating curve) analysis, aims at assessing the intrinsic ability of our statistical measures to search for similar sequences from a database and discriminate functionally related regulatory sequences from unrelated sequences. The second one aims at assessing how well our statistical measure is used for phylogenetic analysis. The experimental assessment demonstrates that our similarity measures intending to incorporate k-word distributions into Markov model are more efficient.
High throughput sequencing analysis of RNA libraries reveals the influences of initial library and PCR methods on SELEX efficiency

PubMed Central

Takahashi, Mayumi; Wu, Xiwei; Ho, Michelle; Chomchan, Pritsana; Rossi, John J.; Burnett, John C.; Zhou, Jiehua

2016-01-01

The systemic evolution of ligands by exponential enrichment (SELEX) technique is a powerful and effective aptamer-selection procedure. However, modifications to the process can dramatically improve selection efficiency and aptamer performance. For example, droplet digital PCR (ddPCR) has been recently incorporated into SELEX selection protocols to putatively reduce the propagation of byproducts and avoid selection bias that result from differences in PCR efficiency of sequences within the random library. However, a detailed, parallel comparison of the efficacy of conventional solution PCR versus the ddPCR modification in the RNA aptamer-selection process is needed to understand effects on overall SELEX performance. In the present study, we took advantage of powerful high throughput sequencing technology and bioinformatics analysis coupled with SELEX (HT-SELEX) to thoroughly investigate the effects of initial library and PCR methods in the RNA aptamer identification. Our analysis revealed that distinct “biased sequences” and nucleotide composition existed in the initial, unselected libraries purchased from two different manufacturers and that the fate of the “biased sequences” was target-dependent during selection. Our comparison of solution PCR- and ddPCR-driven HT-SELEX demonstrated that PCR method affected not only the nucleotide composition of the enriched sequences, but also the overall SELEX efficiency and aptamer efficacy. PMID:27652575
Efficient error correction for next-generation sequencing of viral amplicons

PubMed Central

2012-01-01

Background Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing. Results In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones. Conclusions Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses. The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm PMID:22759430
Efficient error correction for next-generation sequencing of viral amplicons.

PubMed

Skums, Pavel; Dimitrova, Zoya; Campo, David S; Vaughan, Gilberto; Rossi, Livia; Forbi, Joseph C; Yokosawa, Jonny; Zelikovsky, Alex; Khudyakov, Yury

2012-06-25

Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing. In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones. Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm.
An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.

PubMed

Jun, Goo; Wing, Mary Kate; Abecasis, Gonçalo R; Kang, Hyun Min

2015-06-01

The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies. © 2015 Jun et al.; Published by Cold Spring Harbor Laboratory Press.
Metagenomic analysis of a desulphurisation system used to treat biogas from vinasse methanisation.

PubMed

Dias, Marcela França; Colturato, Luis Felipe; de Oliveira, João Paulo; Leite, Laura Rabelo; Oliveira, Guilherme; Chernicharo, Carlos Augusto; de Araújo, Juliana Calabria

2016-04-01

We investigated the response of microbial community to changes in H2S loading rate in a microaerated desulphurisation system treating biogas from vinasse methanisation. H2S removal efficiency was high, and both COD and DO seemed to be important parameters to biomass activity. DGGE analysis retrieved sequences of sulphide-oxidising bacteria (SOB), such as Thioalkalimicrobium sp. Deep sequencing analysis revealed that the microbial community was complex and remained constant throughout the experiment. Most sequences belonged to Firmicutes and Proteobacteria, and, to a lesser extent, Bacteroidetes, Chloroflexi, and Synergistetes. Despite the high sulphide removal efficiency, the abundance of the taxa of SOB was low, and was negatively affected by the high sulphide loading rate. Copyright © 2016 Elsevier Ltd. All rights reserved.
Joint analysis of bacterial DNA methylation, predicted promoter and regulation motifs for biological significance

USDA-ARS?s Scientific Manuscript database

Advances in long-read, single molecule real-time sequencing technology and analysis software over the last two years has enabled the efficient production of closed bacterial genome sequences. However, consistent annotation of these genomes has lagged behind the ability to create them, while the avai...
PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples

PubMed Central

2014-01-01

Background Recent innovations in sequencing technologies have provided researchers with the ability to rapidly characterize the microbial content of an environmental or clinical sample with unprecedented resolution. These approaches are producing a wealth of information that is providing novel insights into the microbial ecology of the environment and human health. However, these sequencing-based approaches produce large and complex datasets that require efficient and sensitive computational analysis workflows. Many recent tools for analyzing metagenomic-sequencing data have emerged, however, these approaches often suffer from issues of specificity, efficiency, and typically do not include a complete metagenomic analysis framework. Results We present PathoScope 2.0, a complete bioinformatics framework for rapidly and accurately quantifying the proportions of reads from individual microbial strains present in metagenomic sequencing data from environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including reference genome library extraction and indexing, read quality control and alignment, strain identification, and summarization and annotation of results. We rigorously evaluated PathoScope 2.0 using simulated data and data from the 2011 outbreak of Shiga-toxigenic Escherichia coli O104:H4. Conclusions The results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 2.0 pipeline software is freely available for download at: http://sourceforge.net/projects/pathoscope/. PMID:25225611

An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads

PubMed Central

2013-01-01

Background Next generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly. Results Previously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies. Conclusions Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies. PMID:24564333
Improving transmission efficiency of large sequence alignment/map (SAM) files.

PubMed

Sakib, Muhammad Nazmus; Tang, Jijun; Zheng, W Jim; Huang, Chin-Tser

2011-01-01

Research in bioinformatics primarily involves collection and analysis of a large volume of genomic data. Naturally, it demands efficient storage and transfer of this huge amount of data. In recent years, some research has been done to find efficient compression algorithms to reduce the size of various sequencing data. One way to improve the transmission time of large files is to apply a maximum lossless compression on them. In this paper, we present SAMZIP, a specialized encoding scheme, for sequence alignment data in SAM (Sequence Alignment/Map) format, which improves the compression ratio of existing compression tools available. In order to achieve this, we exploit the prior knowledge of the file format and specifications. Our experimental results show that our encoding scheme improves compression ratio, thereby reducing overall transmission time significantly.
ViennaNGS: A toolbox for building efficient next- generation sequencing analysis pipelines

PubMed Central

Wolfinger, Michael T.; Fallmann, Jörg; Eggenhofer, Florian; Amman, Fabian

2015-01-01

Recent achievements in next-generation sequencing (NGS) technologies lead to a high demand for reuseable software components to easily compile customized analysis workflows for big genomics data. We present ViennaNGS, an integrated collection of Perl modules focused on building efficient pipelines for NGS data processing. It comes with functionality for extracting and converting features from common NGS file formats, computation and evaluation of read mapping statistics, as well as normalization of RNA abundance. Moreover, ViennaNGS provides software components for identification and characterization of splice junctions from RNA-seq data, parsing and condensing sequence motif data, automated construction of Assembly and Track Hubs for the UCSC genome browser, as well as wrapper routines for a set of commonly used NGS command line tools. PMID:26236465
Ub-ISAP: a streamlined UNIX pipeline for mining unique viral vector integration sites from next generation sequencing data.

PubMed

Kamboj, Atul; Hallwirth, Claus V; Alexander, Ian E; McCowage, Geoffrey B; Kramer, Belinda

2017-06-17

The analysis of viral vector genomic integration sites is an important component in assessing the safety and efficiency of patient treatment using gene therapy. Alongside this clinical application, integration site identification is a key step in the genetic mapping of viral elements in mutagenesis screens that aim to elucidate gene function. We have developed a UNIX-based vector integration site analysis pipeline (Ub-ISAP) that utilises a UNIX-based workflow for automated integration site identification and annotation of both single and paired-end sequencing reads. Reads that contain viral sequences of interest are selected and aligned to the host genome, and unique integration sites are then classified as transcription start site-proximal, intragenic or intergenic. Ub-ISAP provides a reliable and efficient pipeline to generate large datasets for assessing the safety and efficiency of integrating vectors in clinical settings, with broader applications in cancer research. Ub-ISAP is available as an open source software package at https://sourceforge.net/projects/ub-isap/ .
Simian virus 40 major late promoter: an upstream DNA sequence required for efficient in vitro transcription.

PubMed Central

Brady, J; Radonovich, M; Thoren, M; Das, G; Salzman, N P

1984-01-01

We have previously identified an 11-base DNA sequence, 5'-G-G-T-A-C-C-T-A-A-C-C-3' (simian virus 40 [SV40] map position 294 to 304), which is important in the control of SV40 late RNA expression in vitro and in vivo (Brady et al., Cell 31:625-633, 1982). We report here the identification of another domain of the SV40 late promoter. A series of mutants with deletions extending from SV40 map position 0 to 300 was prepared by nuclease BAL 31 treatment. The cloned templates were then analyzed for efficiency and accuracy of late SV40 RNA expression in the Manley in vitro transcription system. Our studies showed that, in addition to the promoter domain near map position 300, there are essential DNA sequences between nucleotide positions 74 and 95 that are required for efficient expression of late SV40 RNA. Included in this SV40 DNA sequence were two of the six GGGCGG SV40 repeat sequences and an 11-nucleotide segment which showed strong homology with the upstream sequences required for the efficient in vitro and in vivo expression of the histone H2A gene. This upstream promoter sequence supported transcription with the same efficiency even when it was moved 72 nucleotides closer to the major late cap site. In vitro promoter competition analysis demonstrated that the upstream promoter sequence, independent of the 294 to 304 promoter element, is capable of binding polymerase-transcription factors required for SV40 late gene transcription. Finally, we show that DNA sequences which control the specificity of RNA initiation at nucleotide 325 lie downstream of map position 294. Images PMID:6321950
Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment

PubMed Central

2013-01-01

Background Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. Results In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Conclusion Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA. PMID:24564200
Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment.

PubMed

Nagar, Anurag; Hahsler, Michael

2013-01-01

Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA.
Post-staining electroblotting for efficient and reliable peptide blotting.

PubMed

Lee, Der-Yen; Chang, Geen-Dong

2015-01-01

Post-staining electroblotting has been previously described to transfer Coomassie blue-stained proteins from polyacrylamide gel onto polyvinylidene difluoride (PVDF) membranes. Actually, stained peptides can also be efficiently and reliably transferred. Because of selective staining procedures for peptides and increased retention of stained peptides on the membrane, even peptides with molecular masses less than 2 kDa such as bacitracin and granuliberin R are transferred with satisfactory results. For comparison, post-staining electroblotting is about 16-fold more sensitive than the conventional electroblotting for visualization of insulin on the membrane. Therefore, the peptide blots become practicable and more accessible to further applications, e.g., blot overlay detection or immunoblotting analysis. In addition, the efficiency of peptide transfer is favorable for N-terminal sequence analysis. With this method, peptide blotting can be normalized for further analysis such as blot overlay assay, immunoblotting, and N-terminal sequencing for identification of peptide in crude or partially purified samples.
CoCoNUT: an efficient system for the comparison and analysis of genomes

PubMed Central

2008-01-01

Background Comparative genomics is the analysis and comparison of genomes from different species. This area of research is driven by the large number of sequenced genomes and heavily relies on efficient algorithms and software to perform pairwise and multiple genome comparisons. Results Most of the software tools available are tailored for one specific task. In contrast, we have developed a novel system CoCoNUT (Computational Comparative geNomics Utility Toolkit) that allows solving several different tasks in a unified framework: (1) finding regions of high similarity among multiple genomic sequences and aligning them, (2) comparing two draft or multi-chromosomal genomes, (3) locating large segmental duplications in large genomic sequences, and (4) mapping cDNA/EST to genomic sequences. Conclusion CoCoNUT is competitive with other software tools w.r.t. the quality of the results. The use of state of the art algorithms and data structures allows CoCoNUT to solve comparative genomics tasks more efficiently than previous tools. With the improved user interface (including an interactive visualization component), CoCoNUT provides a unified, versatile, and easy-to-use software tool for large scale studies in comparative genomics. PMID:19014477
Are quantitative trait-dependent sampling designs cost-effective for analysis of rare and common variants?

PubMed

Yilmaz, Yildiz E; Bull, Shelley B

2011-11-29

Use of trait-dependent sampling designs in whole-genome association studies of sequence data can reduce total sequencing costs with modest losses of statistical efficiency. In a quantitative trait (QT) analysis of data from the Genetic Analysis Workshop 17 mini-exome for unrelated individuals in the Asian subpopulation, we investigate alternative designs that sequence only 50% of the entire cohort. In addition to a simple random sampling design, we consider extreme-phenotype designs that are of increasing interest in genetic association analysis of QTs, especially in studies concerned with the detection of rare genetic variants. We also evaluate a novel sampling design in which all individuals have a nonzero probability of being selected into the sample but in which individuals with extreme phenotypes have a proportionately larger probability. We take differential sampling of individuals with informative trait values into account by inverse probability weighting using standard survey methods which thus generalizes to the source population. In replicate 1 data, we applied the designs in association analysis of Q1 with both rare and common variants in the FLT1 gene, based on knowledge of the generating model. Using all 200 replicate data sets, we similarly analyzed Q1 and Q4 (which is known to be free of association with FLT1) to evaluate relative efficiency, type I error, and power. Simulation study results suggest that the QT-dependent selection designs generally yield greater than 50% relative efficiency compared to using the entire cohort, implying cost-effectiveness of 50% sample selection and worthwhile reduction of sequencing costs.
Efficient alignment-free DNA barcode analytics.

PubMed

Kuksa, Pavel; Pavlovic, Vladimir

2009-11-10

In this work we consider barcode DNA analysis problems and address them using alternative, alignment-free methods and representations which model sequences as collections of short sequence fragments (features). The methods use fixed-length representations (spectrum) for barcode sequences to measure similarities or dissimilarities between sequences coming from the same or different species. The spectrum-based representation not only allows for accurate and computationally efficient species classification, but also opens possibility for accurate clustering analysis of putative species barcodes and identification of critical within-barcode loci distinguishing barcodes of different sample groups. New alignment-free methods provide highly accurate and fast DNA barcode-based identification and classification of species with substantial improvements in accuracy and speed over state-of-the-art barcode analysis methods. We evaluate our methods on problems of species classification and identification using barcodes, important and relevant analytical tasks in many practical applications (adverse species movement monitoring, sampling surveys for unknown or pathogenic species identification, biodiversity assessment, etc.) On several benchmark barcode datasets, including ACG, Astraptes, Hesperiidae, Fish larvae, and Birds of North America, proposed alignment-free methods considerably improve prediction accuracy compared to prior results. We also observe significant running time improvements over the state-of-the-art methods. Our results show that newly developed alignment-free methods for DNA barcoding can efficiently and with high accuracy identify specimens by examining only few barcode features, resulting in increased scalability and interpretability of current computational approaches to barcoding.
KONAGAbase: a genomic and transcriptomic database for the diamondback moth, Plutella xylostella.

PubMed

Jouraku, Akiya; Yamamoto, Kimiko; Kuwazaki, Seigo; Urio, Masahiro; Suetsugu, Yoshitaka; Narukawa, Junko; Miyamoto, Kazuhisa; Kurita, Kanako; Kanamori, Hiroyuki; Katayose, Yuichi; Matsumoto, Takashi; Noda, Hiroaki

2013-07-09

The diamondback moth (DBM), Plutella xylostella, is one of the most harmful insect pests for crucifer crops worldwide. DBM has rapidly evolved high resistance to most conventional insecticides such as pyrethroids, organophosphates, fipronil, spinosad, Bacillus thuringiensis, and diamides. Therefore, it is important to develop genomic and transcriptomic DBM resources for analysis of genes related to insecticide resistance, both to clarify the mechanism of resistance of DBM and to facilitate the development of insecticides with a novel mode of action for more effective and environmentally less harmful insecticide rotation. To contribute to this goal, we developed KONAGAbase, a genomic and transcriptomic database for DBM (KONAGA is the Japanese word for DBM). KONAGAbase provides (1) transcriptomic sequences of 37,340 ESTs/mRNAs and 147,370 RNA-seq contigs which were clustered and assembled into 84,570 unigenes (30,695 contigs, 50,548 pseudo singletons, and 3,327 singletons); and (2) genomic sequences of 88,530 WGS contigs with 246,244 degenerate contigs and 106,455 singletons from which 6,310 de novo identified repeat sequences and 34,890 predicted gene-coding sequences were extracted. The unigenes and predicted gene-coding sequences were clustered and 32,800 representative sequences were extracted as a comprehensive putative gene set. These sequences were annotated with BLAST descriptions, Gene Ontology (GO) terms, and Pfam descriptions, respectively. KONAGAbase contains rich graphical user interface (GUI)-based web interfaces for easy and efficient searching, browsing, and downloading sequences and annotation data. Five useful search interfaces consisting of BLAST search, keyword search, BLAST result-based search, GO tree-based search, and genome browser are provided. KONAGAbase is publicly available from our website (http://dbm.dna.affrc.go.jp/px/) through standard web browsers. KONAGAbase provides DBM comprehensive transcriptomic and draft genomic sequences with useful annotation information with easy-to-use web interfaces, which helps researchers to efficiently search for target sequences such as insect resistance-related genes. KONAGAbase will be continuously updated and additional genomic/transcriptomic resources and analysis tools will be provided for further efficient analysis of the mechanism of insecticide resistance and the development of effective insecticides with a novel mode of action for DBM.
Transcriptome analysis of blueberry using 454 EST sequencing

USDA-ARS?s Scientific Manuscript database

Blueberry (Vaccinium corymbosum) is a major berry crop in the United States, and one that has great nutritional and economical value. Next generation sequencing methodologies, such as 454, have been demonstrated to be successful and efficient in producing a snap-shot of transcriptional activities du...
High efficiency family shuffling based on multi-step PCR and in vivo DNA recombination in yeast: statistical and functional analysis of a combinatorial library between human cytochrome P450 1A1 and 1A2.

PubMed

Abécassis, V; Pompon, D; Truan, G

2000-10-15

The design of a family shuffling strategy (CLERY: Combinatorial Libraries Enhanced by Recombination in Yeast) associating PCR-based and in vivo recombination and expression in yeast is described. This strategy was tested using human cytochrome P450 CYP1A1 and CYP1A2 as templates, which share 74% nucleotide sequence identity. Construction of highly shuffled libraries of mosaic structures and reduction of parental gene contamination were two major goals. Library characterization involved multiprobe hybridization on DNA macro-arrays. The statistical analysis of randomly selected clones revealed a high proportion of chimeric genes (86%) and a homogeneous representation of the parental contribution among the sequences (55.8 +/- 2.5% for parental sequence 1A2). A microtiter plate screening system was designed to achieve colorimetric detection of polycyclic hydrocarbon hydroxylation by transformed yeast cells. Full sequences of five randomly picked and five functionally selected clones were analyzed. Results confirmed the shuffling efficiency and allowed calculation of the average length of sequence exchange and mutation rates. The efficient and statistically representative generation of mosaic structures by this type of family shuffling in a yeast expression system constitutes a novel and promising tool for structure-function studies and tuning enzymatic activities of multicomponent eucaryote complexes involving non-soluble enzymes.
Analyzing Immunoglobulin Repertoires

PubMed Central

Chaudhary, Neha; Wesemann, Duane R.

2018-01-01

Somatic assembly of T cell receptor and B cell receptor (BCR) genes produces a vast diversity of lymphocyte antigen recognition capacity. The advent of efficient high-throughput sequencing of lymphocyte antigen receptor genes has recently generated unprecedented opportunities for exploration of adaptive immune responses. With these opportunities have come significant challenges in understanding the analysis techniques that most accurately reflect underlying biological phenomena. In this regard, sample preparation and sequence analysis techniques, which have largely been borrowed and adapted from other fields, continue to evolve. Here, we review current methods and challenges of library preparation, sequencing and statistical analysis of lymphocyte receptor repertoire studies. We discuss the general steps in the process of immune repertoire generation including sample preparation, platforms available for sequencing, processing of sequencing data, measurable features of the immune repertoire, and the statistical tools that can be used for analysis and interpretation of the data. Because BCR analysis harbors additional complexities, such as immunoglobulin (Ig) (i.e., antibody) gene somatic hypermutation and class switch recombination, the emphasis of this review is on Ig/BCR sequence analysis. PMID:29593723
Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.

PubMed

Pollen, Alex A; Nowakowski, Tomasz J; Shuga, Joe; Wang, Xiaohui; Leyrat, Anne A; Lui, Jan H; Li, Nianzhen; Szpankowski, Lukasz; Fowler, Brian; Chen, Peilin; Ramalingam, Naveen; Sun, Gang; Thu, Myo; Norris, Michael; Lebofsky, Ronald; Toppani, Dominique; Kemp, Darnell W; Wong, Michael; Clerkson, Barry; Jones, Brittnee N; Wu, Shiquan; Knutsson, Lawrence; Alvarado, Beatriz; Wang, Jing; Weaver, Lesley S; May, Andrew P; Jones, Robert C; Unger, Marc A; Kriegstein, Arnold R; West, Jay A A

2014-10-01

Large-scale surveys of single-cell gene expression have the potential to reveal rare cell populations and lineage relationships but require efficient methods for cell capture and mRNA sequencing. Although cellular barcoding strategies allow parallel sequencing of single cells at ultra-low depths, the limitations of shallow sequencing have not been investigated directly. By capturing 301 single cells from 11 populations using microfluidics and analyzing single-cell transcriptomes across downsampled sequencing depths, we demonstrate that shallow single-cell mRNA sequencing (~50,000 reads per cell) is sufficient for unbiased cell-type classification and biomarker identification. In the developing cortex, we identify diverse cell types, including multiple progenitor and neuronal subtypes, and we identify EGR1 and FOS as previously unreported candidate targets of Notch signaling in human but not mouse radial glia. Our strategy establishes an efficient method for unbiased analysis and comparison of cell populations from heterogeneous tissue by microfluidic single-cell capture and low-coverage sequencing of many cells.
Single nucleotide polymorphisms from Theobroma cacao expressed sequence tags associated with witches' broom disease in cacao.

PubMed

Lima, L S; Gramacho, K P; Carels, N; Novais, R; Gaiotto, F A; Lopes, U V; Gesteira, A S; Zaidan, H A; Cascardo, J C M; Pires, J L; Micheli, F

2009-07-14

In order to increase the efficiency of cacao tree resistance to witches' broom disease, which is caused by Moniliophthora perniciosa (Tricholomataceae), we looked for molecular markers that could help in the selection of resistant cacao genotypes. Among the different markers useful for developing marker-assisted selection, single nucleotide polymorphisms (SNPs) constitute the most common type of sequence difference between alleles and can be easily detected by in silico analysis from expressed sequence tag libraries. We report the first detection and analysis of SNPs from cacao-M. perniciosa interaction expressed sequence tags, using bioinformatics. Selection based on analysis of these SNPs should be useful for developing cacao varieties resistant to this devastating disease.
String Mining in Bioinformatics

NASA Astrophysics Data System (ADS)

Abouelhoda, Mohamed; Ghanem, Moustafa

Sequence analysis is a major area in bioinformatics encompassing the methods and techniques for studying the biological sequences, DNA, RNA, and proteins, on the linear structure level. The focus of this area is generally on the identification of intra- and inter-molecular similarities. Identifying intra-molecular similarities boils down to detecting repeated segments within a given sequence, while identifying inter-molecular similarities amounts to spotting common segments among two or multiple sequences. From a data mining point of view, sequence analysis is nothing but string- or pattern mining specific to biological strings. For a long time, this point of view, however, has not been explicitly embraced neither in the data mining nor in the sequence analysis text books, which may be attributed to the co-evolution of the two apparently independent fields. In other words, although the word "data-mining" is almost missing in the sequence analysis literature, its basic concepts have been implicitly applied. Interestingly, recent research in biological sequence analysis introduced efficient solutions to many problems in data mining, such as querying and analyzing time series [49,53], extracting information from web pages [20], fighting spam mails [50], detecting plagiarism [22], and spotting duplications in software systems [14].
String Mining in Bioinformatics

NASA Astrophysics Data System (ADS)

Abouelhoda, Mohamed; Ghanem, Moustafa

Sequence analysis is a major area in bioinformatics encompassing the methods and techniques for studying the biological sequences, DNA, RNA, and proteins, on the linear structure level. The focus of this area is generally on the identification of intra- and inter-molecular similarities. Identifying intra-molecular similarities boils down to detecting repeated segments within a given sequence, while identifying inter-molecular similarities amounts to spotting common segments among two or multiple sequences. From a data mining point of view, sequence analysis is nothing but string- or pattern mining specific to biological strings. For a long time, this point of view, however, has not been explicitly embraced neither in the data mining nor in the sequence analysis text books, which may be attributed to the co-evolution of the two apparently independent fields. In other words, although the word “data-mining” is almost missing in the sequence analysis literature, its basic concepts have been implicitly applied. Interestingly, recent research in biological sequence analysis introduced efficient solutions to many problems in data mining, such as querying and analyzing time series [49,53], extracting information from web pages [20], fighting spam mails [50], detecting plagiarism [22], and spotting duplications in software systems [14].
What is a melody? On the relationship between pitch and brightness of timbre.

PubMed

Cousineau, Marion; Carcagno, Samuele; Demany, Laurent; Pressnitzer, Daniel

2013-01-01

Previous studies showed that the perceptual processing of sound sequences is more efficient when the sounds vary in pitch than when they vary in loudness. We show here that sequences of sounds varying in brightness of timbre are processed with the same efficiency as pitch sequences. The sounds used consisted of two simultaneous pure tones one octave apart, and the listeners' task was to make same/different judgments on pairs of sequences varying in length (one, two, or four sounds). In one condition, brightness of timbre was varied within the sequences by changing the relative level of the two pure tones. In other conditions, pitch was varied by changing fundamental frequency, or loudness was varied by changing the overall level. In all conditions, only two possible sounds could be used in a given sequence, and these two sounds were equally discriminable. When sequence length increased from one to four, discrimination performance decreased substantially for loudness sequences, but to a smaller extent for brightness sequences and pitch sequences. In the latter two conditions, sequence length had a similar effect on performance. These results suggest that the processes dedicated to pitch and brightness analysis, when probed with a sequence-discrimination task, share unexpected similarities.

A parallel and sensitive software tool for methylation analysis on multicore platforms.

PubMed

Tárraga, Joaquín; Pérez, Mariano; Orduña, Juan M; Duato, José; Medina, Ignacio; Dopazo, Joaquín

2015-10-01

DNA methylation analysis suffers from very long processing time, as the advent of Next-Generation Sequencers has shifted the bottleneck of genomic studies from the sequencers that obtain the DNA samples to the software that performs the analysis of these samples. The existing software for methylation analysis does not seem to scale efficiently neither with the size of the dataset nor with the length of the reads to be analyzed. As it is expected that the sequencers will provide longer and longer reads in the near future, efficient and scalable methylation software should be developed. We present a new software tool, called HPG-Methyl, which efficiently maps bisulphite sequencing reads on DNA, analyzing DNA methylation. The strategy used by this software consists of leveraging the speed of the Burrows-Wheeler Transform to map a large number of DNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, which is exclusively employed to deal with the most ambiguous and shortest reads. Experimental results on platforms with Intel multicore processors show that HPG-Methyl significantly outperforms in both execution time and sensitivity state-of-the-art software such as Bismark, BS-Seeker or BSMAP, particularly for long bisulphite reads. Software in the form of C libraries and functions, together with instructions to compile and execute this software. Available by sftp to anonymous@clariano.uv.es (password 'anonymous'). juan.orduna@uv.es or jdopazo@cipf.es. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
ZOOM Lite: next-generation sequencing data mapping and visualization software

PubMed Central

Zhang, Zefeng; Lin, Hao; Ma, Bin

2010-01-01

High-throughput next-generation sequencing technologies pose increasing demands on the efficiency, accuracy and usability of data analysis software. In this article, we present ZOOM Lite, a software for efficient reads mapping and result visualization. With a kernel capable of mapping tens of millions of Illumina or AB SOLiD sequencing reads efficiently and accurately, and an intuitive graphical user interface, ZOOM Lite integrates reads mapping and result visualization into a easy to use pipeline on desktop PC. The software handles both single-end and paired-end reads, and can output both the unique mapping result or the top N mapping results for each read. Additionally, the software takes a variety of input file formats and outputs to several commonly used result formats. The software is freely available at http://bioinfor.com/zoom/lite/. PMID:20530531
Analysis and Visualization of ChIP-Seq and RNA-Seq Sequence Alignments Using ngs.plot.

PubMed

Loh, Yong-Hwee Eddie; Shen, Li

2016-01-01

The continual maturation and increasing applications of next-generation sequencing technology in scientific research have yielded ever-increasing amounts of data that need to be effectively and efficiently analyzed and innovatively mined for new biological insights. We have developed ngs.plot-a quick and easy-to-use bioinformatics tool that performs visualizations of the spatial relationships between sequencing alignment enrichment and specific genomic features or regions. More importantly, ngs.plot is customizable beyond the use of standard genomic feature databases to allow the analysis and visualization of user-specified regions of interest generated by the user's own hypotheses. In this protocol, we demonstrate and explain the use of ngs.plot using command line executions, as well as a web-based workflow on the Galaxy framework. We replicate the underlying commands used in the analysis of a true biological dataset that we had reported and published earlier and demonstrate how ngs.plot can easily generate publication-ready figures. With ngs.plot, users would be able to efficiently and innovatively mine their own datasets without having to be involved in the technical aspects of sequence coverage calculations and genomic databases.
DROMPA: easy-to-handle peak calling and visualization software for the computational analysis and validation of ChIP-seq data.

PubMed

Nakato, Ryuichiro; Itoh, Tahehiko; Shirahige, Katsuhiko

2013-07-01

Chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq) can identify genomic regions that bind proteins involved in various chromosomal functions. Although the development of next-generation sequencers offers the technology needed to identify these protein-binding sites, the analysis can be computationally challenging because sequencing data sometimes consist of >100 million reads/sample. Herein, we describe a cost-effective and time-efficient protocol that is generally applicable to ChIP-seq analysis; this protocol uses a novel peak-calling program termed DROMPA to identify peaks and an additional program, parse2wig, to preprocess read-map files. This two-step procedure drastically reduces computational time and memory requirements compared with other programs. DROMPA enables the identification of protein localization sites in repetitive sequences and efficiently identifies both broad and sharp protein localization peaks. Specifically, DROMPA outputs a protein-binding profile map in pdf or png format, which can be easily manipulated by users who have a limited background in bioinformatics. © 2013 The Authors Genes to Cells © 2013 by the Molecular Biology Society of Japan and Wiley Publishing Asia Pty Ltd.
Computational and experimental analysis of DNA shuffling

PubMed Central

Maheshri, Narendra; Schaffer, David V.

2003-01-01

We describe a computational model of DNA shuffling based on the thermodynamics and kinetics of this process. The model independently tracks a representative ensemble of DNA molecules and records their states at every stage of a shuffling reaction. These data can subsequently be analyzed to yield information on any relevant metric, including reassembly efficiency, crossover number, type and distribution, and DNA sequence length distributions. The predictive ability of the model was validated by comparison to three independent sets of experimental data, and analysis of the simulation results led to several unique insights into the DNA shuffling process. We examine a tradeoff between crossover frequency and reassembly efficiency and illustrate the effects of experimental parameters on this relationship. Furthermore, we discuss conditions that promote the formation of useless “junk” DNA sequences or multimeric sequences containing multiple copies of the reassembled product. This model will therefore aid in the design of optimal shuffling reaction conditions. PMID:12626764
Single-cell isolation by a modular single-cell pipette for RNA-sequencing.

PubMed

Zhang, Kai; Gao, Min; Chong, Zechen; Li, Ying; Han, Xin; Chen, Rui; Qin, Lidong

2016-11-29

Single-cell transcriptome sequencing highly requires a convenient and reliable method to rapidly isolate a live cell into a specific container such as a PCR tube. Here, we report a modular single-cell pipette (mSCP) consisting of three modular components, a SCP-Tip, an air-displacement pipette (ADP), and ADP-Tips, that can be easily assembled, disassembled, and reassembled. By assembling the SCP-Tip containing a hydrodynamic trap, the mSCP can isolate single cells from 5-10 cells per μL of cell suspension. The mSCP is compatible with microscopic identification of captured single cells to finally achieve 100% single-cell isolation efficiency. The isolated live single cells are in submicroliter volumes and well suitable for single-cell PCR analysis and RNA-sequencing. The mSCP possesses merits of convenience, rapidness, and high efficiency, making it a powerful tool to isolate single cells for transcriptome analysis.
BiQ Analyzer HT: locus-specific analysis of DNA methylation by high-throughput bisulfite sequencing

PubMed Central

Lutsik, Pavlo; Feuerbach, Lars; Arand, Julia; Lengauer, Thomas; Walter, Jörn; Bock, Christoph

2011-01-01

Bisulfite sequencing is a widely used method for measuring DNA methylation in eukaryotic genomes. The assay provides single-base pair resolution and, given sufficient sequencing depth, its quantitative accuracy is excellent. High-throughput sequencing of bisulfite-converted DNA can be applied either genome wide or targeted to a defined set of genomic loci (e.g. using locus-specific PCR primers or DNA capture probes). Here, we describe BiQ Analyzer HT (http://biq-analyzer-ht.bioinf.mpi-inf.mpg.de/), a user-friendly software tool that supports locus-specific analysis and visualization of high-throughput bisulfite sequencing data. The software facilitates the shift from time-consuming clonal bisulfite sequencing to the more quantitative and cost-efficient use of high-throughput sequencing for studying locus-specific DNA methylation patterns. In addition, it is useful for locus-specific visualization of genome-wide bisulfite sequencing data. PMID:21565797
Whole genome analysis of CRISPR Cas9 sgRNA off-target homologies via an efficient computational algorithm.

PubMed

Zhou, Hong; Zhou, Michael; Li, Daisy; Manthey, Joseph; Lioutikova, Ekaterina; Wang, Hong; Zeng, Xiao

2017-11-17

The beauty and power of the genome editing mechanism, CRISPR Cas9 endonuclease system, lies in the fact that it is RNA-programmable such that Cas9 can be guided to any genomic loci complementary to a 20-nt RNA, single guide RNA (sgRNA), to cleave double stranded DNA, allowing the introduction of wanted mutations. Unfortunately, it has been reported repeatedly that the sgRNA can also guide Cas9 to off-target sites where the DNA sequence is homologous to sgRNA. Using human genome and Streptococcus pyogenes Cas9 (SpCas9) as an example, this article mathematically analyzed the probabilities of off-target homologies of sgRNAs and discovered that for large genome size such as human genome, potential off-target homologies are inevitable for sgRNA selection. A highly efficient computationl algorithm was developed for whole genome sgRNA design and off-target homology searches. By means of a dynamically constructed sequence-indexed database and a simplified sequence alignment method, this algorithm achieves very high efficiency while guaranteeing the identification of all existing potential off-target homologies. Via this algorithm, 1,876,775 sgRNAs were designed for the 19,153 human mRNA genes and only two sgRNAs were found to be free of off-target homology. By means of the novel and efficient sgRNA homology search algorithm introduced in this article, genome wide sgRNA design and off-target analysis were conducted and the results confirmed the mathematical analysis that for a sgRNA sequence, it is almost impossible to escape potential off-target homologies. Future innovations on the CRISPR Cas9 gene editing technology need to focus on how to eliminate the Cas9 off-target activity.
PET-Tool: a software suite for comprehensive processing and managing of Paired-End diTag (PET) sequence data.

PubMed

Chiu, Kuo Ping; Wong, Chee-Hong; Chen, Qiongyu; Ariyaratne, Pramila; Ooi, Hong Sain; Wei, Chia-Lin; Sung, Wing-Kin Ken; Ruan, Yijun

2006-08-25

We recently developed the Paired End diTag (PET) strategy for efficient characterization of mammalian transcriptomes and genomes. The paired end nature of short PET sequences derived from long DNA fragments raised a new set of bioinformatics challenges, including how to extract PETs from raw sequence reads, and correctly yet efficiently map PETs to reference genome sequences. To accommodate and streamline data analysis of the large volume PET sequences generated from each PET experiment, an automated PET data process pipeline is desirable. We designed an integrated computation program package, PET-Tool, to automatically process PET sequences and map them to the genome sequences. The Tool was implemented as a web-based application composed of four modules: the Extractor module for PET extraction; the Examiner module for analytic evaluation of PET sequence quality; the Mapper module for locating PET sequences in the genome sequences; and the Project Manager module for data organization. The performance of PET-Tool was evaluated through the analyses of 2.7 million PET sequences. It was demonstrated that PET-Tool is accurate and efficient in extracting PET sequences and removing artifacts from large volume dataset. Using optimized mapping criteria, over 70% of quality PET sequences were mapped specifically to the genome sequences. With a 2.4 GHz LINUX machine, it takes approximately six hours to process one million PETs from extraction to mapping. The speed, accuracy, and comprehensiveness have proved that PET-Tool is an important and useful component in PET experiments, and can be extended to accommodate other related analyses of paired-end sequences. The Tool also provides user-friendly functions for data quality check and system for multi-layer data management.
Using information content and base frequencies to distinguish mutations from genetic polymorphisms in splice junction recognition sites.

PubMed

Rogan, P K; Schneider, T D

1995-01-01

Predicting the effects of nucleotide substitutions in human splice sites has been based on analysis of consensus sequences. We used a graphic representation of sequence conservation and base frequency, the sequence logo, to demonstrate that a change in a splice acceptor of hMSH2 (a gene associated with familial nonpolyposis colon cancer) probably does not reduce splicing efficiency. This confirms a population genetic study that suggested that this substitution is a genetic polymorphism. The information theory-based sequence logo is quantitative and more sensitive than the corresponding splice acceptor consensus sequence for detection of true mutations. Information analysis may potentially be used to distinguish polymorphisms from mutations in other types of transcriptional, translational, or protein-coding motifs.
Efficient alignment-free DNA barcode analytics

PubMed Central

Kuksa, Pavel; Pavlovic, Vladimir

2009-01-01

Background In this work we consider barcode DNA analysis problems and address them using alternative, alignment-free methods and representations which model sequences as collections of short sequence fragments (features). The methods use fixed-length representations (spectrum) for barcode sequences to measure similarities or dissimilarities between sequences coming from the same or different species. The spectrum-based representation not only allows for accurate and computationally efficient species classification, but also opens possibility for accurate clustering analysis of putative species barcodes and identification of critical within-barcode loci distinguishing barcodes of different sample groups. Results New alignment-free methods provide highly accurate and fast DNA barcode-based identification and classification of species with substantial improvements in accuracy and speed over state-of-the-art barcode analysis methods. We evaluate our methods on problems of species classification and identification using barcodes, important and relevant analytical tasks in many practical applications (adverse species movement monitoring, sampling surveys for unknown or pathogenic species identification, biodiversity assessment, etc.) On several benchmark barcode datasets, including ACG, Astraptes, Hesperiidae, Fish larvae, and Birds of North America, proposed alignment-free methods considerably improve prediction accuracy compared to prior results. We also observe significant running time improvements over the state-of-the-art methods. Conclusion Our results show that newly developed alignment-free methods for DNA barcoding can efficiently and with high accuracy identify specimens by examining only few barcode features, resulting in increased scalability and interpretability of current computational approaches to barcoding. PMID:19900305
SSR_pipeline: a bioinformatic infrastructure for identifying microsatellites from paired-end Illumina high-throughput DNA sequencing data

USGS Publications Warehouse

Miller, Mark P.; Knaus, Brian J.; Mullins, Thomas D.; Haig, Susan M.

2013-01-01

SSR_pipeline is a flexible set of programs designed to efficiently identify simple sequence repeats (e.g., microsatellites) from paired-end high-throughput Illumina DNA sequencing data. The program suite contains 3 analysis modules along with a fourth control module that can automate analyses of large volumes of data. The modules are used to 1) identify the subset of paired-end sequences that pass Illumina quality standards, 2) align paired-end reads into a single composite DNA sequence, and 3) identify sequences that possess microsatellites (both simple and compound) conforming to user-specified parameters. The microsatellite search algorithm is extremely efficient, and we have used it to identify repeats with motifs from 2 to 25bp in length. Each of the 3 analysis modules can also be used independently to provide greater flexibility or to work with FASTQ or FASTA files generated from other sequencing platforms (Roche 454, Ion Torrent, etc.). We demonstrate use of the program with data from the brine fly Ephydra packardi (Diptera: Ephydridae) and provide empirical timing benchmarks to illustrate program performance on a common desktop computer environment. We further show that the Illumina platform is capable of identifying large numbers of microsatellites, even when using unenriched sample libraries and a very small percentage of the sequencing capacity from a single DNA sequencing run. All modules from SSR_pipeline are implemented in the Python programming language and can therefore be used from nearly any computer operating system (Linux, Macintosh, and Windows).
SSR_pipeline: a bioinformatic infrastructure for identifying microsatellites from paired-end Illumina high-throughput DNA sequencing data.

PubMed

Miller, Mark P; Knaus, Brian J; Mullins, Thomas D; Haig, Susan M

2013-01-01

SSR_pipeline is a flexible set of programs designed to efficiently identify simple sequence repeats (e.g., microsatellites) from paired-end high-throughput Illumina DNA sequencing data. The program suite contains 3 analysis modules along with a fourth control module that can automate analyses of large volumes of data. The modules are used to 1) identify the subset of paired-end sequences that pass Illumina quality standards, 2) align paired-end reads into a single composite DNA sequence, and 3) identify sequences that possess microsatellites (both simple and compound) conforming to user-specified parameters. The microsatellite search algorithm is extremely efficient, and we have used it to identify repeats with motifs from 2 to 25 bp in length. Each of the 3 analysis modules can also be used independently to provide greater flexibility or to work with FASTQ or FASTA files generated from other sequencing platforms (Roche 454, Ion Torrent, etc.). We demonstrate use of the program with data from the brine fly Ephydra packardi (Diptera: Ephydridae) and provide empirical timing benchmarks to illustrate program performance on a common desktop computer environment. We further show that the Illumina platform is capable of identifying large numbers of microsatellites, even when using unenriched sample libraries and a very small percentage of the sequencing capacity from a single DNA sequencing run. All modules from SSR_pipeline are implemented in the Python programming language and can therefore be used from nearly any computer operating system (Linux, Macintosh, and Windows).
An analysis by metabolic labelling of the encephalomyocarditis virus ribosomal frameshifting efficiency and stimulators.

PubMed

Ling, Roger; Firth, Andrew E

2017-08-01

Programmed -1 ribosomal frameshifting is a mechanism of gene expression whereby specific signals within messenger RNAs direct a proportion of ribosomes to shift -1 nt and continue translating in the new reading frame. Such frameshifting normally depends on an RNA structure stimulator 3'-adjacent to a 'slippery' heptanucleotide shift site sequence. Recently we identified an unusual frameshifting mechanism in encephalomyocarditis virus, where the stimulator involves a trans-acting virus protein. Thus, in contrast to other examples of -1 frameshifting, the efficiency of frameshifting in encephalomyocarditis virus is best studied in the context of virus infection. Here we use metabolic labelling to analyse the frameshifting efficiency of wild-type and mutant viruses. Confirming previous results, frameshifting depends on a G_GUU_UUU shift site sequence and a 3'-adjacent stem-loop structure, but is not appreciably affected by the 'StopGo' sequence present ~30 nt upstream. At late timepoints, frameshifting was estimated to be 46-76 % efficient.
DUK - A Fast and Efficient Kmer Based Sequence Matching Tool

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Mingkun; Copeland, Alex; Han, James

2011-03-21

A new tool, DUK, is developed to perform matching task. Matching is to find whether a query sequence partially or totally matches given reference sequences or not. Matching is similar to alignment. Indeed many traditional analysis tasks like contaminant removal use alignment tools. But for matching, there is no need to know which bases of a query sequence matches which position of a reference sequence, it only need know whether there exists a match or not. This subtle difference can make matching task much faster than alignment. DUK is accurate, versatile, fast, and has efficient memory usage. It uses Kmermore » hashing method to index reference sequences and Poisson model to calculate p-value. DUK is carefully implemented in C++ in object oriented design. The resulted classes can also be used to develop other tools quickly. DUK have been widely used in JGI for a wide range of applications such as contaminant removal, organelle genome separation, and assembly refinement. Many real applications and simulated dataset demonstrate its power.« less
In vivo evaluation of the effect of stimulus distribution on FIR statistical efficiency in event-related fMRI

PubMed Central

Jansma, J Martijn; de Zwart, Jacco A; van Gelderen, Peter; Duyn, Jeff H; Drevets, Wayne C; Furey, Maura L

2013-01-01

Technical developments in MRI have improved signal to noise, allowing use of analysis methods such as Finite impulse response (FIR) of rapid event related functional MRI (er-fMRI). FIR is one of the most informative analysis methods as it determines onset and full shape of the hemodynamic response function (HRF) without any a-priori assumptions. FIR is however vulnerable to multicollinearity, which is directly related to the distribution of stimuli over time. Efficiency can be optimized by simplifying a design, and restricting stimuli distribution to specific sequences, while more design flexibility necessarily reduces efficiency. However, the actual effect of efficiency on fMRI results has never been tested in vivo. Thus, it is currently difficult to make an informed choice between protocol flexibility and statistical efficiency. The main goal of this study was to assign concrete fMRI signal to noise values to the abstract scale of FIR statistical efficiency. Ten subjects repeated a perception task with five random and m-sequence based protocol, with varying but, according to literature, acceptable levels of multicollinearity. Results indicated substantial differences in signal standard deviation, while the level was a function of multicollinearity. Experiment protocols varied up to 55.4% in standard deviation. Results confirm that quality of fMRI in an FIR analysis can significantly and substantially vary with statistical efficiency. Our in vivo measurements can be used to aid in making an informed decision between freedom in protocol design and statistical efficiency. PMID:23473798
Efficient genome-wide detection and cataloging of EMS-induced mutations using exome capture and next-generation sequencing

USDA-ARS?s Scientific Manuscript database

Chemical mutagenesis efficiently generates phenotypic variation in otherwise homogeneous genetic backgrounds, enabling functional analysis of genes. Advances in mutation detection have brought the utility of induced mutant populations on par with those produced by insertional mutagenesis, but system...
Evaluation of the Bacterial Diversity in the Human Tongue Coating Based on Genus-Specific Primers for 16S rRNA Sequencing.

PubMed

Sun, Beili; Zhou, Dongrui; Tu, Jing; Lu, Zuhong

2017-01-01

The characteristics of tongue coating are very important symbols for disease diagnosis in traditional Chinese medicine (TCM) theory. As a habitat of oral microbiota, bacteria on the tongue dorsum have been proved to be the cause of many oral diseases. The high-throughput next-generation sequencing (NGS) platforms have been widely applied in the analysis of bacterial 16S rRNA gene. We developed a methodology based on genus-specific multiprimer amplification and ligation-based sequencing for microbiota analysis. In order to validate the efficiency of the approach, we thoroughly analyzed six tongue coating samples from lung cancer patients with different TCM types, and more than 600 genera of bacteria were detected by this platform. The results showed that ligation-based parallel sequencing combined with enzyme digestion and multiamplification could expand the effective length of sequencing reads and could be applied in the microbiota analysis.
What is a melody? On the relationship between pitch and brightness of timbre

PubMed Central

Cousineau, Marion; Carcagno, Samuele; Demany, Laurent; Pressnitzer, Daniel

2014-01-01

Previous studies showed that the perceptual processing of sound sequences is more efficient when the sounds vary in pitch than when they vary in loudness. We show here that sequences of sounds varying in brightness of timbre are processed with the same efficiency as pitch sequences. The sounds used consisted of two simultaneous pure tones one octave apart, and the listeners’ task was to make same/different judgments on pairs of sequences varying in length (one, two, or four sounds). In one condition, brightness of timbre was varied within the sequences by changing the relative level of the two pure tones. In other conditions, pitch was varied by changing fundamental frequency, or loudness was varied by changing the overall level. In all conditions, only two possible sounds could be used in a given sequence, and these two sounds were equally discriminable. When sequence length increased from one to four, discrimination performance decreased substantially for loudness sequences, but to a smaller extent for brightness sequences and pitch sequences. In the latter two conditions, sequence length had a similar effect on performance. These results suggest that the processes dedicated to pitch and brightness analysis, when probed with a sequence-discrimination task, share unexpected similarities. PMID:24478638
Homonuclear Hartmann-Hahn transfer with reduced relaxation losses by use of the MOCCA-XY16 multiple pulse sequence

NASA Astrophysics Data System (ADS)

Furrer, Julien; Kramer, Frank; Marino, John P.; Glaser, Steffen J.; Luy, Burkhard

2004-01-01

Homonuclear Hartmann-Hahn transfer is one of the most important building blocks in modern high-resolution NMR. It constitutes a very efficient transfer element for the assignment of proteins, nucleic acids, and oligosaccharides. Nevertheless, in macromolecules exceeding ˜10 kDa TOCSY-experiments can show decreasing sensitivity due to fast transverse relaxation processes that are active during the mixing periods. In this article we propose the MOCCA-XY16 multiple pulse sequence, originally developed for efficient TOCSY transfer through residual dipolar couplings, as a homonuclear Hartmann-Hahn sequence with improved relaxation properties. A theoretical analysis of the coherence transfer via scalar couplings and its relaxation behavior as well as experimental transfer curves for MOCCA-XY16 relative to the well-characterized DIPSI-2 multiple pulse sequence are given.

Homonuclear Hartmann-Hahn transfer with reduced relaxation losses by use of the MOCCA-XY16 multiple pulse sequence.

PubMed

Furrer, Julien; Kramer, Frank; Marino, John P; Glaser, Steffen J; Luy, Burkhard

2004-01-01

Homonuclear Hartmann-Hahn transfer is one of the most important building blocks in modern high-resolution NMR. It constitutes a very efficient transfer element for the assignment of proteins, nucleic acids, and oligosaccharides. Nevertheless, in macromolecules exceeding approximately 10 kDa TOCSY-experiments can show decreasing sensitivity due to fast transverse relaxation processes that are active during the mixing periods. In this article we propose the MOCCA-XY16 multiple pulse sequence, originally developed for efficient TOCSY transfer through residual dipolar couplings, as a homonuclear Hartmann-Hahn sequence with improved relaxation properties. A theoretical analysis of the coherence transfer via scalar couplings and its relaxation behavior as well as experimental transfer curves for MOCCA-XY16 relative to the well-characterized DIPSI-2 multiple pulse sequence are given.
Uterus segmentation in dynamic MRI using LBP texture descriptors

NASA Astrophysics Data System (ADS)

Namias, R.; Bellemare, M.-E.; Rahim, M.; Pirró, N.

2014-03-01

Pelvic floor disorders cover pathologies of which physiopathology is not well understood. However cases get prevalent with an ageing population. Within the context of a project aiming at modelization of the dynamics of pelvic organs, we have developed an efficient segmentation process. It aims at alleviating the radiologist with a tedious one by one image analysis. From a first contour delineating the uterus-vagina set, the organ border is tracked along a dynamic mri sequence. The process combines movement prediction, local intensity and texture analysis and active contour geometry control. Movement prediction allows a contour intitialization for next image in the sequence. Intensity analysis provides image-based local contour detection enhanced by local binary pattern (lbp) texture descriptors. Geometry control prohibits self intersections and smoothes the contour. Results show the efficiency of the method with images produced in clinical routine.
QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation.

PubMed

Wagle, Prerana; Nikolić, Miloš; Frommolt, Peter

2015-07-01

Next-Generation Sequencing (NGS) has emerged as a widely used tool in molecular biology. While time and cost for the sequencing itself are decreasing, the analysis of the massive amounts of data remains challenging. Since multiple algorithmic approaches for the basic data analysis have been developed, there is now an increasing need to efficiently use these tools to obtain results in reasonable time. We have developed QuickNGS, a new workflow system for laboratories with the need to analyze data from multiple NGS projects at a time. QuickNGS takes advantage of parallel computing resources, a comprehensive back-end database, and a careful selection of previously published algorithmic approaches to build fully automated data analysis workflows. We demonstrate the efficiency of our new software by a comprehensive analysis of 10 RNA-Seq samples which we can finish in only a few minutes of hands-on time. The approach we have taken is suitable to process even much larger numbers of samples and multiple projects at a time. Our approach considerably reduces the barriers that still limit the usability of the powerful NGS technology and finally decreases the time to be spent before proceeding to further downstream analysis and interpretation of the data.
An Interactive Multiobjective Programming Approach to Combinatorial Data Analysis.

ERIC Educational Resources Information Center

Brusco, Michael J.; Stahl, Stephanie

2001-01-01

Describes an interactive procedure for multiobjective asymmetric unidimensional seriation problems that uses a dynamic-programming algorithm to generate partially the efficient set of sequences for small to medium-sized problems and a multioperational heuristic to estimate the efficient set for larger problems. Applies the procedure to an…
Parallel-Connected Photovoltaic Inverters: Zero Frequency Sequence Harmonic Analysis and Solution

NASA Astrophysics Data System (ADS)

Carmeli, Maria Stefania; Mauri, Marco; Frosio, Luisa; Bezzolato, Alberto; Marchegiani, Gabriele

2013-05-01

High-power photovoltaic (PV) plants are usually constituted of the connection of different PV subfields, each of them with its interface transformer. Different solutions have been studied to improve the efficiency of the whole generation system. In particular, transformerless configurations are the more attractive one from efficiency and costs point of view. This paper focuses on transformerless PV configurations characterised by the parallel connection of interface inverters. The problem of zero sequence current due to both the parallel connection and the presence of undesirable parasitic earth capacitances is considered and a solution, which consists of the synchronisation of pulse-width modulation triangular carrier, is proposed and theoretically analysed. The theoretical analysis has been validated through simulation and experimental results.
Pyicos: a versatile toolkit for the analysis of high-throughput sequencing data.

PubMed

Althammer, Sonja; González-Vallinas, Juan; Ballaré, Cecilia; Beato, Miguel; Eyras, Eduardo

2011-12-15

High-throughput sequencing (HTS) has revolutionized gene regulation studies and is now fundamental for the detection of protein-DNA and protein-RNA binding, as well as for measuring RNA expression. With increasing variety and sequencing depth of HTS datasets, the need for more flexible and memory-efficient tools to analyse them is growing. We describe Pyicos, a powerful toolkit for the analysis of mapped reads from diverse HTS experiments: ChIP-Seq, either punctuated or broad signals, CLIP-Seq and RNA-Seq. We prove the effectiveness of Pyicos to select for significant signals and show that its accuracy is comparable and sometimes superior to that of methods specifically designed for each particular type of experiment. Pyicos facilitates the analysis of a variety of HTS datatypes through its flexibility and memory efficiency, providing a useful framework for data integration into models of regulatory genomics. Open-source software, with tutorials and protocol files, is available at http://regulatorygenomics.upf.edu/pyicos or as a Galaxy server at http://regulatorygenomics.upf.edu/galaxy eduardo.eyras@upf.edu Supplementary data are available at Bioinformatics online.
TranslatomeDB: a comprehensive database and cloud-based analysis platform for translatome sequencing data

PubMed Central

Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie

2018-01-01

Abstract Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. PMID:29106630
BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU.

PubMed

Luo, Ruibang; Wong, Yiu-Lun; Law, Wai-Chun; Lee, Lap-Kei; Cheung, Jeanno; Liu, Chi-Man; Lam, Tak-Wah

2014-01-01

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA's speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa.
Hybridization-based antibody cDNA recovery for the production of recombinant antibodies identified by repertoire sequencing.

PubMed

Valdés-Alemán, Javier; Téllez-Sosa, Juan; Ovilla-Muñoz, Marbella; Godoy-Lozano, Elizabeth; Velázquez-Ramírez, Daniel; Valdovinos-Torres, Humberto; Gómez-Barreto, Rosa E; Martinez-Barnetche, Jesús

2014-01-01

High-throughput sequencing of the antibody repertoire is enabling a thorough analysis of B cell diversity and clonal selection, which may improve the novel antibody discovery process. Theoretically, an adequate bioinformatic analysis could allow identification of candidate antigen-specific antibodies, requiring their recombinant production for experimental validation of their specificity. Gene synthesis is commonly used for the generation of recombinant antibodies identified in silico. Novel strategies that bypass gene synthesis could offer more accessible antibody identification and validation alternatives. We developed a hybridization-based recovery strategy that targets the complementarity-determining region 3 (CDRH3) for the enrichment of cDNA of candidate antigen-specific antibody sequences. Ten clonal groups of interest were identified through bioinformatic analysis of the heavy chain antibody repertoire of mice immunized with hen egg white lysozyme (HEL). cDNA from eight of the targeted clonal groups was recovered efficiently, leading to the generation of recombinant antibodies. One representative heavy chain sequence from each clonal group recovered was paired with previously reported anti-HEL light chains to generate full antibodies, later tested for HEL-binding capacity. The recovery process proposed represents a simple and scalable molecular strategy that could enhance antibody identification and specificity assessment, enabling a more cost-efficient generation of recombinant antibodies.
A strategic stakeholder approach for addressing further analysis requests in whole genome sequencing research.

PubMed

Thornock, Bradley Steven O

2016-01-01

Whole genome sequencing (WGS) can be a cost-effective and efficient means of diagnosis for some children, but it also raises a number of ethical concerns. One such concern is how researchers derive and communicate results from WGS, including future requests for further analysis of stored sequences. The purpose of this paper is to think about what is at stake, and for whom, in any solution that is developed to deal with such requests. To accomplish this task, this paper will utilize stakeholder theory, a common method used in business ethics. Several scenarios that connect stakeholder concerns and WGS will also posited and analyzed. This paper concludes by developing criteria composed of a series of questions that researchers can answer in order to more effectively address requests for further analysis of stored sequences.
Draft Genome Sequence of the Efficient Bioflocculant-Producing Bacterium Paenibacillus sp. Strain A9

PubMed Central

Liu, Jin-liang; Hu, Xiao-min

2013-01-01

Paenibacillus sp. strain A9 is an important bioflocculant-producing bacterium, isolated from a soil sample, and is pale pink-pigmented, aerobic, and Gram-positive. Here, we report the draft genome sequence and the initial findings from a preliminary analysis of strain A9, which is a novel species of Paenibacillus. PMID:23618713
Adenine specific DNA chemical sequencing reaction.

PubMed Central

Iverson, B L; Dervan, P B

1987-01-01

Reaction of DNA with K2PdCl4 at pH 2.0 followed by a piperidine workup produces specific cleavage at adenine (A) residues. Product analysis revealed the K2PdCl4 reaction involves selective depurination at adenine, affording an excision reaction analogous to the other chemical DNA sequencing reactions. Adenine residues methylated at the exocyclic amine (N6) react with lower efficiency than unmethylated adenine in an identical sequence. This simple protocol specific for A may be a useful addition to current chemical sequencing reactions. Images PMID:3671067
ReSeqTools: an integrated toolkit for large-scale next-generation sequencing based resequencing analysis.

PubMed

He, W; Zhao, S; Liu, X; Dong, S; Lv, J; Liu, D; Wang, J; Meng, Z

2013-12-04

Large-scale next-generation sequencing (NGS)-based resequencing detects sequence variations, constructs evolutionary histories, and identifies phenotype-related genotypes. However, NGS-based resequencing studies generate extraordinarily large amounts of data, making computations difficult. Effective use and analysis of these data for NGS-based resequencing studies remains a difficult task for individual researchers. Here, we introduce ReSeqTools, a full-featured toolkit for NGS (Illumina sequencing)-based resequencing analysis, which processes raw data, interprets mapping results, and identifies and annotates sequence variations. ReSeqTools provides abundant scalable functions for routine resequencing analysis in different modules to facilitate customization of the analysis pipeline. ReSeqTools is designed to use compressed data files as input or output to save storage space and facilitates faster and more computationally efficient large-scale resequencing studies in a user-friendly manner. It offers abundant practical functions and generates useful statistics during the analysis pipeline, which significantly simplifies resequencing analysis. Its integrated algorithms and abundant sub-functions provide a solid foundation for special demands in resequencing projects. Users can combine these functions to construct their own pipelines for other purposes.
High-throughput sequencing: a failure mode analysis.

PubMed

Yang, George S; Stott, Jeffery M; Smailus, Duane; Barber, Sarah A; Balasundaram, Miruna; Marra, Marco A; Holt, Robert A

2005-01-04

Basic manufacturing principles are becoming increasingly important in high-throughput sequencing facilities where there is a constant drive to increase quality, increase efficiency, and decrease operating costs. While high-throughput centres report failure rates typically on the order of 10%, the causes of sporadic sequencing failures are seldom analyzed in detail and have not, in the past, been formally reported. Here we report the results of a failure mode analysis of our production sequencing facility based on detailed evaluation of 9,216 ESTs generated from two cDNA libraries. Two categories of failures are described; process-related failures (failures due to equipment or sample handling) and template-related failures (failures that are revealed by close inspection of electropherograms and are likely due to properties of the template DNA sequence itself). Preventative action based on a detailed understanding of failure modes is likely to improve the performance of other production sequencing pipelines.
Open Reading Frame Phylogenetic Analysis on the Cloud

PubMed Central

2013-01-01

Phylogenetic analysis has become essential in researching the evolutionary relationships between viruses. These relationships are depicted on phylogenetic trees, in which viruses are grouped based on sequence similarity. Viral evolutionary relationships are identified from open reading frames rather than from complete sequences. Recently, cloud computing has become popular for developing internet-based bioinformatics tools. Biocloud is an efficient, scalable, and robust bioinformatics computing service. In this paper, we propose a cloud-based open reading frame phylogenetic analysis service. The proposed service integrates the Hadoop framework, virtualization technology, and phylogenetic analysis methods to provide a high-availability, large-scale bioservice. In a case study, we analyze the phylogenetic relationships among Norovirus. Evolutionary relationships are elucidated by aligning different open reading frame sequences. The proposed platform correctly identifies the evolutionary relationships between members of Norovirus. PMID:23671843
Introduction on Using the FastPCR Software and the Related Java Web Tools for PCR and Oligonucleotide Assembly and Analysis.

PubMed

Kalendar, Ruslan; Tselykh, Timofey V; Khassenov, Bekbolat; Ramanculov, Erlan M

2017-01-01

This chapter introduces the FastPCR software as an integrated tool environment for PCR primer and probe design, which predicts properties of oligonucleotides based on experimental studies of the PCR efficiency. The software provides comprehensive facilities for designing primers for most PCR applications and their combinations. These include the standard PCR as well as the multiplex, long-distance, inverse, real-time, group-specific, unique, overlap extension PCR for multi-fragments assembling cloning and loop-mediated isothermal amplification (LAMP). It also contains a built-in program to design oligonucleotide sets both for long sequence assembly by ligase chain reaction and for design of amplicons that tile across a region(s) of interest. The software calculates the melting temperature for the standard and degenerate oligonucleotides including locked nucleic acid (LNA) and other modifications. It also provides analyses for a set of primers with the prediction of oligonucleotide properties, dimer and G/C-quadruplex detection, linguistic complexity as well as a primer dilution and resuspension calculator. The program consists of various bioinformatical tools for analysis of sequences with the GC or AT skew, CG% and GA% content, and the purine-pyrimidine skew. It also analyzes the linguistic sequence complexity and performs generation of random DNA sequence as well as restriction endonucleases analysis. The program allows to find or create restriction enzyme recognition sites for coding sequences and supports the clustering of sequences. It performs efficient and complete detection of various repeat types with visual display. The FastPCR software allows the sequence file batch processing that is essential for automation. The program is available for download at http://primerdigital.com/fastpcr.html , and its online version is located at http://primerdigital.com/tools/pcr.html .
Direct detection of RNA in vitro and in situ by target-primed RCA: The impact of E. coli RNase III on the detection efficiency of RNA sequences distanced far from the 3'-end.

PubMed

Merkiene, Egle; Gaidamaviciute, Edita; Riauba, Laurynas; Janulaitis, Arvydas; Lagunavicius, Arunas

2010-08-01

We improved the target RNA-primed RCA technique for direct detection and analysis of RNA in vitro and in situ. Previously we showed that the 3' --> 5' single-stranded RNA exonucleolytic activity of Phi29 DNA polymerase converts the target RNA into a primer and uses it for RCA initiation. However, in some cases, the single-stranded RNA exoribonucleolytic activity of the polymerase is hindered by strong double-stranded structures at the 3'-end of target RNAs. We demonstrate that in such hampered cases, the double-stranded RNA-specific Escherichia coli RNase III efficiently assists Phi29 DNA polymerase in converting the target RNA into a primer. These observations extend the target RNA-primed RCA possibilities to test RNA sequences distanced far from the 3'-end and customize this technique for the inner RNA sequence analysis.
IVisTMSA: Interactive Visual Tools for Multiple Sequence Alignments.

PubMed

Pervez, Muhammad Tariq; Babar, Masroor Ellahi; Nadeem, Asif; Aslam, Naeem; Naveed, Nasir; Ahmad, Sarfraz; Muhammad, Shah; Qadri, Salman; Shahid, Muhammad; Hussain, Tanveer; Javed, Maryam

2015-01-01

IVisTMSA is a software package of seven graphical tools for multiple sequence alignments. MSApad is an editing and analysis tool. It can load 409% more data than Jalview, STRAP, CINEMA, and Base-by-Base. MSA comparator allows the user to visualize consistent and inconsistent regions of reference and test alignments of more than 21-MB size in less than 12 seconds. MSA comparator is 5,200% efficient and more than 40% efficient as compared to BALiBASE c program and FastSP, respectively. MSA reconstruction tool provides graphical user interfaces for four popular aligners and allows the user to load several sequence files at a time. FASTA generator converts seven formats of alignments of unlimited size into FASTA format in a few seconds. MSA ID calculator calculates identity matrix of more than 11,000 sequences with a sequence length of 2,696 base pairs in less than 100 seconds. Tree and Distance Matrix calculation tools generate phylogenetic tree and distance matrix, respectively, using neighbor joining% identity and BLOSUM 62 matrix.
Single-Cell Semiconductor Sequencing

PubMed Central

Kohn, Andrea B.; Moroz, Tatiana P.; Barnes, Jeffrey P.; Netherton, Mandy; Moroz, Leonid L.

2014-01-01

RNA-seq or transcriptome analysis of individual cells and small-cell populations is essential for virtually any biomedical field. It is especially critical for developmental, aging, and cancer biology as well as neuroscience where the enormous heterogeneity of cells present a significant methodological and conceptual challenge. Here we present two methods that allow for fast and cost-efficient transcriptome sequencing from ultra-small amounts of tissue or even from individual cells using semiconductor sequencing technology (Ion Torrent, Life Technologies). The first method is a reduced representation sequencing which maximizes capture of RNAs and preserves transcripts’ directionality. The second, a template-switch protocol, is designed for small mammalian neurons. Both protocols, from cell/tissue isolation to final sequence data, take up to 4 days. The efficiency of these protocols has been validated with single hippocampal neurons and various invertebrate tissues including individually identified neurons within a simpler memory-forming circuit of Aplysia californica and early (1-, 2-, 4-, 8-cells) embryonic and developmental stages from basal metazoans. PMID:23929110
In vivo evaluation of the effect of stimulus distribution on FIR statistical efficiency in event-related fMRI.

PubMed

Jansma, J Martijn; de Zwart, Jacco A; van Gelderen, Peter; Duyn, Jeff H; Drevets, Wayne C; Furey, Maura L

2013-05-15

Technical developments in MRI have improved signal to noise, allowing use of analysis methods such as Finite impulse response (FIR) of rapid event related functional MRI (er-fMRI). FIR is one of the most informative analysis methods as it determines onset and full shape of the hemodynamic response function (HRF) without any a priori assumptions. FIR is however vulnerable to multicollinearity, which is directly related to the distribution of stimuli over time. Efficiency can be optimized by simplifying a design, and restricting stimuli distribution to specific sequences, while more design flexibility necessarily reduces efficiency. However, the actual effect of efficiency on fMRI results has never been tested in vivo. Thus, it is currently difficult to make an informed choice between protocol flexibility and statistical efficiency. The main goal of this study was to assign concrete fMRI signal to noise values to the abstract scale of FIR statistical efficiency. Ten subjects repeated a perception task with five random and m-sequence based protocol, with varying but, according to literature, acceptable levels of multicollinearity. Results indicated substantial differences in signal standard deviation, while the level was a function of multicollinearity. Experiment protocols varied up to 55.4% in standard deviation. Results confirm that quality of fMRI in an FIR analysis can significantly and substantially vary with statistical efficiency. Our in vivo measurements can be used to aid in making an informed decision between freedom in protocol design and statistical efficiency. Published by Elsevier B.V.

Genomics of crop wild relatives: expanding the gene pool for crop improvement.

PubMed

Brozynska, Marta; Furtado, Agnelo; Henry, Robert J

2016-04-01

Plant breeders require access to new genetic diversity to satisfy the demands of a growing human population for more food that can be produced in a variable or changing climate and to deliver the high-quality food with nutritional and health benefits demanded by consumers. The close relatives of domesticated plants, crop wild relatives (CWRs), represent a practical gene pool for use by plant breeders. Genomics of CWR generates data that support the use of CWR to expand the genetic diversity of crop plants. Advances in DNA sequencing technology are enabling the efficient sequencing of CWR and their increased use in crop improvement. As the sequencing of genomes of major crop species is completed, attention has shifted to analysis of the wider gene pool of major crops including CWR. A combination of de novo sequencing and resequencing is required to efficiently explore useful genetic variation in CWR. Analysis of the nuclear genome, transcriptome and maternal (chloroplast and mitochondrial) genome of CWR is facilitating their use in crop improvement. Genome analysis results in discovery of useful alleles in CWR and identification of regions of the genome in which diversity has been lost in domestication bottlenecks. Targeting of high priority CWR for sequencing will maximize the contribution of genome sequencing of CWR. Coordination of global efforts to apply genomics has the potential to accelerate access to and conservation of the biodiversity essential to the sustainability of agriculture and food production. © 2015 Society for Experimental Biology, Association of Applied Biologists and John Wiley & Sons Ltd.
The Draft Genome Sequence of a Novel High-Efficient Butanol-Producing Bacterium Clostridium Diolis Strain WST.

PubMed

Chen, Chaoyang; Sun, Chongran; Wu, Yi-Rui

2018-03-21

A wild-type solventogenic strain Clostridium diolis WST, isolated from mangrove sediments, was characterized to produce high amount of butanol and acetone with negligible level of ethanol and acids from glucose via a unique acetone-butanol (AB) fermentation pathway. Through the genomic sequencing, the assembled draft genome of strain WST is calculated to be 5.85 Mb with a GC content of 29.69% and contains 5263 genes that contribute to the annotation of 5049 protein-coding sequences. Within these annotated genes, the butanol dehydrogenase gene (bdh) was determined to be in a higher amount from strain WST compared to other Clostridial strains, which is positively related to its high-efficient production of butanol. Therefore, we present a draft genome sequence analysis of strain WST in this article that should facilitate to further understand the solventogenic mechanism of this special microorganism.
Sequence analysis of cultivated strawberry (Fragaria × ananassa Duch.) using microdissected single somatic chromosomes.

PubMed

Yanagi, Tomohiro; Shirasawa, Kenta; Terachi, Mayuko; Isobe, Sachiko

2017-01-01

Cultivated strawberry ( Fragaria × ananassa Duch.) has homoeologous chromosomes because of allo-octoploidy. For example, two homoeologous chromosomes that belong to different sub-genome of allopolyploids have similar base sequences. Thus, when conducting de novo assembly of DNA sequences, it is difficult to determine whether these sequences are derived from the same chromosome. To avoid the difficulties associated with homoeologous chromosomes and demonstrate the possibility of sequencing allopolyploids using single chromosomes, we conducted sequence analysis using microdissected single somatic chromosomes of cultivated strawberry. Three hundred and ten somatic chromosomes of the Japanese octoploid strawberry 'Reiko' were individually selected under a light microscope using a microdissection system. DNA from 288 of the dissected chromosomes was successfully amplified using a DNA amplification kit. Using next-generation sequencing, we decoded the base sequences of the amplified DNA segments, and on the basis of mapping, we identified DNA sequences from 144 samples that were best matched to the reference genomes of the octoploid strawberry, F. × ananassa , and the diploid strawberry, F. vesca . The 144 samples were classified into seven pseudo-molecules of F. vesca . The coverage rates of the DNA sequences from the single chromosome onto all pseudo-molecular sequences varied from 3 to 29.9%. We demonstrated an efficient method for sequence analysis of allopolyploid plants using microdissected single chromosomes. On the basis of our results, we believe that whole-genome analysis of allopolyploid plants can be enhanced using methodology that employs microdissected single chromosomes.
Identification of multiple mRNA and DNA sequences from small tissue samples isolated by laser-assisted microdissection.

PubMed

Bernsen, M R; Dijkman, H B; de Vries, E; Figdor, C G; Ruiter, D J; Adema, G J; van Muijen, G N

1998-10-01

Molecular analysis of small tissue samples has become increasingly important in biomedical studies. Using a laser dissection microscope and modified nucleic acid isolation protocols, we demonstrate that multiple mRNA as well as DNA sequences can be identified from a single-cell sample. In addition, we show that the specificity of procurement of tissue samples is not compromised by smear contamination resulting from scraping of the microtome knife during sectioning of lesions. The procedures described herein thus allow for efficient RT-PCR or PCR analysis of multiple nucleic acid sequences from small tissue samples obtained by laser-assisted microdissection.
Construction and Evaluation of Normalized cDNA Libraries Enriched with Full-Length Sequences for Rapid Discovery of New Genes from Sisal (Agave sisalana Perr.) Different Developmental Stages

PubMed Central

Zhou, Wen-Zhao; Zhang, Yan-Mei; Lu, Jun-Ying; Li, Jun-Feng

2012-01-01

To provide a resource of sisal-specific expressed sequence data and facilitate this powerful approach in new gene research, the preparation of normalized cDNA libraries enriched with full-length sequences is necessary. Four libraries were produced with RNA pooled from Agave sisalana multiple tissues to increase efficiency of normalization and maximize the number of independent genes by SMART™ method and the duplex-specific nuclease (DSN). This procedure kept the proportion of full-length cDNAs in the subtracted/normalized libraries and dramatically enhanced the discovery of new genes. Sequencing of 3875 cDNA clones of libraries revealed 3320 unigenes with an average insert length about 1.2 kb, indicating that the non-redundancy of libraries was about 85.7%. These unigene functions were predicted by comparing their sequences to functional domain databases and extensively annotated with Gene Ontology (GO) terms. Comparative analysis of sisal unigenes and other plant genomes revealed that four putative MADS-box genes and knotted-like homeobox (knox) gene were obtained from a total of 1162 full-length transcripts. Furthermore, real-time PCR showed that the characteristics of their transcripts mainly depended on the tight expression regulation of a number of genes during the leaf and flower development. Analysis of individual library sequence data indicated that the pooled-tissue approach was highly effective in discovering new genes and preparing libraries for efficient deep sequencing. PMID:23202944
Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics.

PubMed

Kelly, Benjamin J; Fitch, James R; Hu, Yangqiu; Corsmeier, Donald J; Zhong, Huachun; Wetzel, Amy N; Nordquist, Russell D; Newsom, David L; White, Peter

2015-01-20

While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.
Adaptive efficient compression of genomes

PubMed Central

2012-01-01

Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, memory requirements of the current algorithms are high and run times often are slow. In this paper, we propose an adaptive, parallel and highly efficient referential sequence compression method which allows fine-tuning of the trade-off between required memory and compression speed. When using 12 MB of memory, our method is for human genomes on-par with the best previous algorithms in terms of compression ratio (400:1) and compression speed. In contrast, it compresses a complete human genome in just 11 seconds when provided with 9 GB of main memory, which is almost three times faster than the best competitor while using less main memory. PMID:23146997
PGen: large-scale genomic variations analysis workflow and browser in SoyKB.

PubMed

Liu, Yang; Khan, Saad M; Wang, Juexin; Rynge, Mats; Zhang, Yuanxun; Zeng, Shuai; Chen, Shiyuan; Maldonado Dos Santos, Joao V; Valliyodan, Babu; Calyam, Prasad P; Merchant, Nirav; Nguyen, Henry T; Xu, Dong; Joshi, Trupti

2016-10-06

With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed "PGen", an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way. We have developed both a Linux version in GitHub ( https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow ) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), ( http://soykb.org/Pegasus/index.php ). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser ( http://soykb.org/NGS_Resequence/NGS_index.php ) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers. PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.
Using RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in nucleic sequences.

PubMed

Defrance, Matthieu; Janky, Rekin's; Sand, Olivier; van Helden, Jacques

2008-01-01

This protocol explains how to discover functional signals in genomic sequences by detecting over- or under-represented oligonucleotides (words) or spaced pairs thereof (dyads) with the Regulatory Sequence Analysis Tools (http://rsat.ulb.ac.be/rsat/). Two typical applications are presented: (i) predicting transcription factor-binding motifs in promoters of coregulated genes and (ii) discovering phylogenetic footprints in promoters of orthologous genes. The steps of this protocol include purging genomic sequences to discard redundant fragments, discovering over-represented patterns and assembling them to obtain degenerate motifs, scanning sequences and drawing feature maps. The main strength of the method is its statistical ground: the binomial significance provides an efficient control on the rate of false positives. In contrast with optimization-based pattern discovery algorithms, the method supports the detection of under- as well as over-represented motifs. Computation times vary from seconds (gene clusters) to minutes (whole genomes). The execution of the whole protocol should take approximately 1 h.
Using relational databases for improved sequence similarity searching and large-scale genomic analyses.

PubMed

Mackey, Aaron J; Pearson, William R

2004-10-01

Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.
Meta-analysis of RNA-Seq data across cohorts in a multi-season feed efficiency study of crossbred beef steers accounts for biological and technical variability within season

USDA-ARS?s Scientific Manuscript database

High-throughput sequencing is often used for studies of the transcriptome, particularly for comparisons between experimental conditions. Due to sequencing costs, a limited number of biological replicates are typically considered in such experiments, leading to low detection power for differential ex...
Single-Exome sequencing identified a novel RP2 mutation in a child with X-linked retinitis pigmentosa.

PubMed

Lim, Hassol; Park, Young-Mi; Lee, Jong-Keuk; Taek Lim, Hyun

2016-10-01

To present an efficient and successful application of a single-exome sequencing study in a family clinically diagnosed with X-linked retinitis pigmentosa. Exome sequencing study based on clinical examination data. An 8-year-old proband and his family. The proband and his family members underwent comprehensive ophthalmologic examinations. Exome sequencing was undertaken in the proband using Agilent SureSelect Human All Exon Kit and Illumina HiSeq 2000 platform. Bioinformatic analysis used Illumina pipeline with Burrows-Wheeler Aligner-Genome Analysis Toolkit (BWA-GATK), followed by ANNOVAR to perform variant functional annotation. All variants passing filter criteria were validated by Sanger sequencing to confirm familial segregation. Analysis of exome sequence data identified a novel frameshift mutation in RP2 gene resulting in a premature stop codon (c.665delC, p.Pro222fsTer237). Sanger sequencing revealed this mutation co-segregated with the disease phenotype in the child's family. We identified a novel causative mutation in RP2 from a single proband's exome sequence data analysis. This study highlights the effectiveness of the whole-exome sequencing in the genetic diagnosis of X-linked retinitis pigmentosa, over the conventional sequencing methods. Even using a single exome, exome sequencing technology would be able to pinpoint pathogenic variant(s) for X-linked retinitis pigmentosa, when properly applied with aid of adequate variant filtering strategy. Copyright © 2016 Canadian Ophthalmological Society. Published by Elsevier Inc. All rights reserved.
Multiplexed resequencing analysis to identify rare variants in pooled DNA with barcode indexing using next-generation sequencer.

PubMed

Mitsui, Jun; Fukuda, Yoko; Azuma, Kyo; Tozaki, Hirokazu; Ishiura, Hiroyuki; Takahashi, Yuji; Goto, Jun; Tsuji, Shoji

2010-07-01

We have recently found that multiple rare variants of the glucocerebrosidase gene (GBA) confer a robust risk for Parkinson disease, supporting the 'common disease-multiple rare variants' hypothesis. To develop an efficient method of identifying rare variants in a large number of samples, we applied multiplexed resequencing using a next-generation sequencer to identification of rare variants of GBA. Sixteen sets of pooled DNAs from six pooled DNA samples were prepared. Each set of pooled DNAs was subjected to polymerase chain reaction to amplify the target gene (GBA) covering 6.5 kb, pooled into one tube with barcode indexing, and then subjected to extensive sequence analysis using the SOLiD System. Individual samples were also subjected to direct nucleotide sequence analysis. With the optimization of data processing, we were able to extract all the variants from 96 samples with acceptable rates of false-positive single-nucleotide variants.
An improved model for whole genome phylogenetic analysis by Fourier transform.

PubMed

Yin, Changchuan; Yau, Stephen S-T

2015-10-07

DNA sequence similarity comparison is one of the major steps in computational phylogenetic studies. The sequence comparison of closely related DNA sequences and genomes is usually performed by multiple sequence alignments (MSA). While the MSA method is accurate for some types of sequences, it may produce incorrect results when DNA sequences undergone rearrangements as in many bacterial and viral genomes. It is also limited by its computational complexity for comparing large volumes of data. Previously, we proposed an alignment-free method that exploits the full information contents of DNA sequences by Discrete Fourier Transform (DFT), but still with some limitations. Here, we present a significantly improved method for the similarity comparison of DNA sequences by DFT. In this method, we map DNA sequences into 2-dimensional (2D) numerical sequences and then apply DFT to transform the 2D numerical sequences into frequency domain. In the 2D mapping, the nucleotide composition of a DNA sequence is a determinant factor and the 2D mapping reduces the nucleotide composition bias in distance measure, and thus improving the similarity measure of DNA sequences. To compare the DFT power spectra of DNA sequences with different lengths, we propose an improved even scaling algorithm to extend shorter DFT power spectra to the longest length of the underlying sequences. After the DFT power spectra are evenly scaled, the spectra are in the same dimensionality of the Fourier frequency space, then the Euclidean distances of full Fourier power spectra of the DNA sequences are used as the dissimilarity metrics. The improved DFT method, with increased computational performance by 2D numerical representation, can be applicable to any DNA sequences of different length ranges. We assess the accuracy of the improved DFT similarity measure in hierarchical clustering of different DNA sequences including simulated and real datasets. The method yields accurate and reliable phylogenetic trees and demonstrates that the improved DFT dissimilarity measure is an efficient and effective similarity measure of DNA sequences. Due to its high efficiency and accuracy, the proposed DFT similarity measure is successfully applied on phylogenetic analysis for individual genes and large whole bacterial genomes. Copyright © 2015 Elsevier Ltd. All rights reserved.
'DNA Strider': a 'C' program for the fast analysis of DNA and protein sequences on the Apple Macintosh family of computers.

PubMed Central

Marck, C

1988-01-01

DNA Strider is a new integrated DNA and Protein sequence analysis program written with the C language for the Macintosh Plus, SE and II computers. It has been designed as an easy to learn and use program as well as a fast and efficient tool for the day-to-day sequence analysis work. The program consists of a multi-window sequence editor and of various DNA and Protein analysis functions. The editor may use 4 different types of sequences (DNA, degenerate DNA, RNA and one-letter coded protein) and can handle simultaneously 6 sequences of any type up to 32.5 kB each. Negative numbering of the bases is allowed for DNA sequences. All classical restriction and translation analysis functions are present and can be performed in any order on any open sequence or part of a sequence. The main feature of the program is that the same analysis function can be repeated several times on different sequences, thus generating multiple windows on the screen. Many graphic capabilities have been incorporated such as graphic restriction map, hydrophobicity profile and the CAI plot- codon adaptation index according to Sharp and Li. The restriction sites search uses a newly designed fast hexamer look-ahead algorithm. Typical runtime for the search of all sites with a library of 130 restriction endonucleases is 1 second per 10,000 bases. The circular graphic restriction map of the pBR322 plasmid can be therefore computed from its sequence and displayed on the Macintosh Plus screen within 2 seconds and its multiline restriction map obtained in a scrolling window within 5 seconds. PMID:2832831
Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform.

PubMed

Zhao, Jian; Wang, Jiasong; Hua, Wei; Ouyang, Pingkai

2015-12-01

The amino acid sequence of a protein determines its chemical properties, chain conformation and biological functions. Protein sequence comparison is of great importance to identify similarities of protein structures and infer their functions. Many properties of a protein correspond to the low-frequency signals within the sequence. Low frequency modes in protein sequences are linked to the secondary structures, membrane protein types, and sub-cellular localizations of the proteins. In this paper, we present Ramanujan Fourier transform (RFT) with a fast algorithm to analyze the low-frequency signals of protein sequences. The RFT method is applied to similarity analysis of protein sequences with the Resonant Recognition Model (RRM). The results show that the proposed fast RFT method on protein comparison is more efficient than commonly used discrete Fourier transform (DFT). RFT can detect common frequencies as significant feature for specific protein families, and the RFT spectrum heat-map of protein sequences demonstrates the information conservation in the sequence comparison. The proposed method offers a new tool for pattern recognition, feature extraction and structural analysis on protein sequences. Copyright © 2015 Elsevier Ltd. All rights reserved.
Efficient generation of transgenic cattle using the DNA transposon and their analysis by next-generation sequencing

PubMed Central

Yum, Soo-Young; Lee, Song-Jeon; Kim, Hyun-Min; Choi, Woo-Jae; Park, Ji-Hyun; Lee, Won-Wu; Kim, Hee-Soo; Kim, Hyeong-Jong; Bae, Seong-Hun; Lee, Je-Hyeong; Moon, Joo-Yeong; Lee, Ji-Hyun; Lee, Choong-Il; Son, Bong-Jun; Song, Sang-Hoon; Ji, Su-Min; Kim, Seong-Jin; Jang, Goo

2016-01-01

Here, we efficiently generated transgenic cattle using two transposon systems (Sleeping Beauty and Piggybac) and their genomes were analyzed by next-generation sequencing (NGS). Blastocysts derived from microinjection of DNA transposons were selected and transferred into recipient cows. Nine transgenic cattle have been generated and grown-up to date without any health issues except two. Some of them expressed strong fluorescence and the transgene in the oocytes from a superovulating one were detected by PCR and sequencing. To investigate genomic variants by the transgene transposition, whole genomic DNA were analyzed by NGS. We found that preferred transposable integration (TA or TTAA) was identified in their genome. Even though multi-copies (i.e. fifteen) were confirmed, there was no significant difference in genome instabilities. In conclusion, we demonstrated that transgenic cattle using the DNA transposon system could be efficiently generated, and all those animals could be a valuable resource for agriculture and veterinary science. PMID:27324781
ParticleCall: A particle filter for base calling in next-generation sequencing systems

PubMed Central

2012-01-01

Background Next-generation sequencing systems are capable of rapid and cost-effective DNA sequencing, thus enabling routine sequencing tasks and taking us one step closer to personalized medicine. Accuracy and lengths of their reads, however, are yet to surpass those provided by the conventional Sanger sequencing method. This motivates the search for computationally efficient algorithms capable of reliable and accurate detection of the order of nucleotides in short DNA fragments from the acquired data. Results In this paper, we consider Illumina’s sequencing-by-synthesis platform which relies on reversible terminator chemistry and describe the acquired signal by reformulating its mathematical model as a Hidden Markov Model. Relying on this model and sequential Monte Carlo methods, we develop a parameter estimation and base calling scheme called ParticleCall. ParticleCall is tested on a data set obtained by sequencing phiX174 bacteriophage using Illumina’s Genome Analyzer II. The results show that the developed base calling scheme is significantly more computationally efficient than the best performing unsupervised method currently available, while achieving the same accuracy. Conclusions The proposed ParticleCall provides more accurate calls than the Illumina’s base calling algorithm, Bustard. At the same time, ParticleCall is significantly more computationally efficient than other recent schemes with similar performance, rendering it more feasible for high-throughput sequencing data analysis. Improvement of base calling accuracy will have immediate beneficial effects on the performance of downstream applications such as SNP and genotype calling. ParticleCall is freely available at https://sourceforge.net/projects/particlecall. PMID:22776067
TranslatomeDB: a comprehensive database and cloud-based analysis platform for translatome sequencing data.

PubMed

Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie; Zhang, Gong

2018-01-04

Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
CAFE: aCcelerated Alignment-FrEe sequence analysis.

PubMed

Lu, Yang Young; Tang, Kujin; Ren, Jie; Fuhrman, Jed A; Waterman, Michael S; Sun, Fengzhu

2017-07-03

Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

Use of plan quality degradation to evaluate tradeoffs in delivery efficiency and clinical plan metrics arising from IMRT optimizer and sequencer compromises

PubMed Central

Wilkie, Joel R.; Matuszak, Martha M.; Feng, Mary; Moran, Jean M.; Fraass, Benedick A.

2013-01-01

Purpose: Plan degradation resulting from compromises made to enhance delivery efficiency is an important consideration for intensity modulated radiation therapy (IMRT) treatment plans. IMRT optimization and/or multileaf collimator (MLC) sequencing schemes can be modified to generate more efficient treatment delivery, but the effect those modifications have on plan quality is often difficult to quantify. In this work, the authors present a method for quantitative assessment of overall plan quality degradation due to tradeoffs between delivery efficiency and treatment plan quality, illustrated using comparisons between plans developed allowing different numbers of intensity levels in IMRT optimization and/or MLC sequencing for static segmental MLC IMRT plans. Methods: A plan quality degradation method to evaluate delivery efficiency and plan quality tradeoffs was developed and used to assess planning for 14 prostate and 12 head and neck patients treated with static IMRT. Plan quality was evaluated using a physician's predetermined “quality degradation” factors for relevant clinical plan metrics associated with the plan optimization strategy. Delivery efficiency and plan quality were assessed for a range of optimization and sequencing limitations. The “optimal” (baseline) plan for each case was derived using a clinical cost function with an unlimited number of intensity levels. These plans were sequenced with a clinical MLC leaf sequencer which uses >100 segments, assuring delivered intensities to be within 1% of the optimized intensity pattern. Each patient's optimal plan was also sequenced limiting the number of intensity levels (20, 10, and 5), and then separately optimized with these same numbers of intensity levels. Delivery time was measured for all plans, and direct evaluation of the tradeoffs between delivery time and plan degradation was performed. Results: When considering tradeoffs, the optimal number of intensity levels depends on the treatment site and on the stage in the process at which the levels are limited. The cost of improved delivery efficiency, in terms of plan quality degradation, increased as the number of intensity levels in the sequencer or optimizer decreased. The degradation was more substantial for the head and neck cases relative to the prostate cases, particularly when fewer than 20 intensity levels were used. Plan quality degradation was less severe when the number of intensity levels was limited in the optimizer rather than the sequencer. Conclusions: Analysis of plan quality degradation allows for a quantitative assessment of the compromises in clinical plan quality as delivery efficiency is improved, in order to determine the optimal delivery settings. The technique is based on physician-determined quality degradation factors and can be extended to other clinical situations where investigation of various tradeoffs is warranted. PMID:23822412
Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges.

PubMed

Yin, Zekun; Lan, Haidong; Tan, Guangming; Lu, Mian; Vasilakos, Athanasios V; Liu, Weiguo

2017-01-01

The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics.
Pyicos: a versatile toolkit for the analysis of high-throughput sequencing data

PubMed Central

Althammer, Sonja; González-Vallinas, Juan; Ballaré, Cecilia; Beato, Miguel; Eyras, Eduardo

2011-01-01

Motivation: High-throughput sequencing (HTS) has revolutionized gene regulation studies and is now fundamental for the detection of protein–DNA and protein–RNA binding, as well as for measuring RNA expression. With increasing variety and sequencing depth of HTS datasets, the need for more flexible and memory-efficient tools to analyse them is growing. Results: We describe Pyicos, a powerful toolkit for the analysis of mapped reads from diverse HTS experiments: ChIP-Seq, either punctuated or broad signals, CLIP-Seq and RNA-Seq. We prove the effectiveness of Pyicos to select for significant signals and show that its accuracy is comparable and sometimes superior to that of methods specifically designed for each particular type of experiment. Pyicos facilitates the analysis of a variety of HTS datatypes through its flexibility and memory efficiency, providing a useful framework for data integration into models of regulatory genomics. Availability: Open-source software, with tutorials and protocol files, is available at http://regulatorygenomics.upf.edu/pyicos or as a Galaxy server at http://regulatorygenomics.upf.edu/galaxy Contact: eduardo.eyras@upf.edu Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:21994224
Laser mass spectrometry for DNA sequencing, disease diagnosis, and fingerprinting

NASA Astrophysics Data System (ADS)

Chen, C. H. Winston; Taranenko, N. I.; Zhu, Y. F.; Chung, C. N.; Allman, S. L.

1997-05-01

Since laser mass spectrometry has the potential for achieving very fast DNA analysis, we recently applied it to DNA sequencing, DNA typing for fingerprinting, and DNA screening for disease diagnosis. Two different approaches for sequencing DNA have been successfully demonstrated. One is to sequence DNA with DNA ladders produced from Sanger's enzymatic method. The other is to do direct sequencing without DNA ladders. The need for quick DNA typing for identification purposes is critical for forensic application. Our preliminary results indicate laser mass spectrometry can possible be used for rapid DNA fingerprinting applications at a much lower cost than gel electrophoresis. Population screening for certain genetic disease can be a very efficient step to reducing medical costs through prevention. Since laser mass spectrometry can provide very fast DNA analysis, we applied laser mass spectrometry to disease diagnosis. Clinical samples with both base deletion and point mutation have been tested with complete success.
Genetic diversity assessment of anoxygenic photosynthetic bacteria by distance-based grouping analysis of pufM sequences.

PubMed

Zeng, Y H; Chen, X H; Jiao, N Z

2007-12-01

To assess how completely the diversity of anoxygenic phototrophic bacteria (APB) was sampled in natural environments. All nucleotide sequences of the APB marker gene pufM from cultures and environmental clones were retrieved from the GenBank database. A set of cutoff values (sequence distances 0.06, 0.15 and 0.48 for species, genus, and (sub)phylum levels, respectively) was established using a distance-based grouping program. Analysis of the environmental clones revealed that current efforts on APB isolation and sampling in natural environments are largely inadequate. Analysis of the average distance between each identified genus and an uncultured environmental pufM sequence indicated that the majority of cultured APB genera lack environmental representatives. The distance-based grouping method is fast and efficient for bulk functional gene sequences analysis. The results clearly show that we are at a relatively early stage in sampling the global richness of APB species. Periodical assessment will undoubtedly facilitate in-depth analysis of potential biogeographical distribution pattern of APB. This is the first attempt to assess the present understanding of APB diversity in natural environments. The method used is also useful for assessing the diversity of other functional genes.
Mobile Genome Express (MGE): A comprehensive automatic genetic analyses pipeline with a mobile device.

PubMed

Yoon, Jun-Hee; Kim, Thomas W; Mendez, Pedro; Jablons, David M; Kim, Il-Jin

2017-01-01

The development of next-generation sequencing (NGS) technology allows to sequence whole exomes or genome. However, data analysis is still the biggest bottleneck for its wide implementation. Most laboratories still depend on manual procedures for data handling and analyses, which translates into a delay and decreased efficiency in the delivery of NGS results to doctors and patients. Thus, there is high demand for developing an automatic and an easy-to-use NGS data analyses system. We developed comprehensive, automatic genetic analyses controller named Mobile Genome Express (MGE) that works in smartphones or other mobile devices. MGE can handle all the steps for genetic analyses, such as: sample information submission, sequencing run quality check from the sequencer, secured data transfer and results review. We sequenced an Actrometrix control DNA containing multiple proven human mutations using a targeted sequencing panel, and the whole analysis was managed by MGE, and its data reviewing program called ELECTRO. All steps were processed automatically except for the final sequencing review procedure with ELECTRO to confirm mutations. The data analysis process was completed within several hours. We confirmed the mutations that we have identified were consistent with our previous results obtained by using multi-step, manual pipelines.
Identification of sequence motifs significantly associated with antisense activity.

PubMed

McQuisten, Kyle A; Peek, Andrew S

2007-06-07

Predicting the suppression activity of antisense oligonucleotide sequences is the main goal of the rational design of nucleic acids. To create an effective predictive model, it is important to know what properties of an oligonucleotide sequence associate significantly with antisense activity. Also, for the model to be efficient we must know what properties do not associate significantly and can be omitted from the model. This paper will discuss the results of a randomization procedure to find motifs that associate significantly with either high or low antisense suppression activity, analysis of their properties, as well as the results of support vector machine modelling using these significant motifs as features. We discovered 155 motifs that associate significantly with high antisense suppression activity and 202 motifs that associate significantly with low suppression activity. The motifs range in length from 2 to 5 bases, contain several motifs that have been previously discovered as associating highly with antisense activity, and have thermodynamic properties consistent with previous work associating thermodynamic properties of sequences with their antisense activity. Statistical analysis revealed no correlation between a motif's position within an antisense sequence and that sequences antisense activity. Also, many significant motifs existed as subwords of other significant motifs. Support vector regression experiments indicated that the feature set of significant motifs increased correlation compared to all possible motifs as well as several subsets of the significant motifs. The thermodynamic properties of the significantly associated motifs support existing data correlating the thermodynamic properties of the antisense oligonucleotide with antisense efficiency, reinforcing our hypothesis that antisense suppression is strongly associated with probe/target thermodynamics, as there are no enzymatic mediators to speed the process along like the RNA Induced Silencing Complex (RISC) in RNAi. The independence of motif position and antisense activity also allows us to bypass consideration of this feature in the modelling process, promoting model efficiency and reducing the chance of overfitting when predicting antisense activity. The increase in SVR correlation with significant features compared to nearest-neighbour features indicates that thermodynamics alone is likely not the only factor in determining antisense efficiency.
SeqHBase: a big data toolset for family based sequencing data analysis.

PubMed

He, Min; Person, Thomas N; Hebbring, Scott J; Heinzen, Ethan; Ye, Zhan; Schrodi, Steven J; McPherson, Elizabeth W; Lin, Simon M; Peissig, Peggy L; Brilliant, Murray H; O'Rawe, Jason; Robison, Reid J; Lyon, Gholson J; Wang, Kai

2015-04-01

Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Structurally detailed coarse-grained model for Sec-facilitated co-translational protein translocation and membrane integration

PubMed Central

Miller, Thomas F.

2017-01-01

We present a coarse-grained simulation model that is capable of simulating the minute-timescale dynamics of protein translocation and membrane integration via the Sec translocon, while retaining sufficient chemical and structural detail to capture many of the sequence-specific interactions that drive these processes. The model includes accurate geometric representations of the ribosome and Sec translocon, obtained directly from experimental structures, and interactions parameterized from nearly 200 μs of residue-based coarse-grained molecular dynamics simulations. A protocol for mapping amino-acid sequences to coarse-grained beads enables the direct simulation of trajectories for the co-translational insertion of arbitrary polypeptide sequences into the Sec translocon. The model reproduces experimentally observed features of membrane protein integration, including the efficiency with which polypeptide domains integrate into the membrane, the variation in integration efficiency upon single amino-acid mutations, and the orientation of transmembrane domains. The central advantage of the model is that it connects sequence-level protein features to biological observables and timescales, enabling direct simulation for the mechanistic analysis of co-translational integration and for the engineering of membrane proteins with enhanced membrane integration efficiency. PMID:28328943
Analysis of delay reducing and fuel saving sequencing and spacing algorithms for arrival traffic

NASA Technical Reports Server (NTRS)

Neuman, Frank; Erzberger, Heinz

1991-01-01

The air traffic control subsystem that performs sequencing and spacing is discussed. The function of the sequencing and spacing algorithms is to automatically plan the most efficient landing order and to assign optimally spaced landing times to all arrivals. Several algorithms are described and their statistical performance is examined. Sequencing brings order to an arrival sequence for aircraft. First-come-first-served sequencing (FCFS) establishes a fair order, based on estimated times of arrival, and determines proper separations. Because of the randomness of the arriving traffic, gaps will remain in the sequence of aircraft. Delays are reduced by time-advancing the leading aircraft of each group while still preserving the FCFS order. Tightly spaced groups of aircraft remain with a mix of heavy and large aircraft. Spacing requirements differ for different types of aircraft trailing each other. Traffic is reordered slightly to take advantage of this spacing criterion, thus shortening the groups and reducing average delays. For heavy traffic, delays for different traffic samples vary widely, even when the same set of statistical parameters is used to produce each sample. This report supersedes NASA TM-102795 on the same subject. It includes a new method of time-advance as well as an efficient method of sequencing and spacing for two dependent runways.
Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation.

PubMed

Ralph, Duncan K; Matsen, Frederick A

2016-01-01

VDJ rearrangement and somatic hypermutation work together to produce antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of antigens. It is now possible to sequence these BCRs in high throughput; analysis of these sequences is bringing new insight into how antibodies develop, in particular for broadly-neutralizing antibodies against HIV and influenza. A fundamental step in such sequence analysis is to annotate each base as coming from a specific one of the V, D, or J genes, or from an N-addition (a.k.a. non-templated insertion). Previous work has used simple parametric distributions to model transitions from state to state in a hidden Markov model (HMM) of VDJ recombination, and assumed that mutations occur via the same process across sites. However, codon frame and other effects have been observed to violate these parametric assumptions for such coding sequences, suggesting that a non-parametric approach to modeling the recombination process could be useful. In our paper, we find that indeed large modern data sets suggest a model using parameter-rich per-allele categorical distributions for HMM transition probabilities and per-allele-per-position mutation probabilities, and that using such a model for inference leads to significantly improved results. We present an accurate and efficient BCR sequence annotation software package using a novel HMM "factorization" strategy. This package, called partis (https://github.com/psathyrella/partis/), is built on a new general-purpose HMM compiler that can perform efficient inference given a simple text description of an HMM.
Identification of sequence motifs in oligonucleotides whose presence is correlated with antisense activity

PubMed Central

Matveeva, O. V.; Tsodikov, A. D.; Giddings, M.; Freier, S. M.; Wyatt, J. R.; Spiridonov, A. N.; Shabalina, S. A.; Gesteland, R. F.; Atkins, J. F.

2000-01-01

Design of antisense oligonucleotides targeting any mRNA can be much more efficient when several activity-enhancing motifs are included and activity-decreasing motifs are avoided. This conclusion was made after statistical analysis of data collected from >1000 experiments with phosphorothioate-modified oligonucleotides. Highly significant positive correlation between the presence of motifs CCAC, TCCC, ACTC, GCCA and CTCT in the oligonucleotide and its antisense efficiency was demonstrated. In addition, negative correlation was revealed for the motifs GGGG, ACTG, AAA and TAA. It was found that the likelihood of activity of an oligonucleotide against a desired mRNA target is sequence motif content dependent. PMID:10908347
Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners

PubMed Central

Feinauer, Christoph; Procaccini, Andrea; Zecchina, Riccardo; Weigt, Martin; Pagnani, Andrea

2014-01-01

In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code. PMID:24663061
Maximizing mutagenesis with solubilized CRISPR-Cas9 ribonucleoprotein complexes.

PubMed

Burger, Alexa; Lindsay, Helen; Felker, Anastasia; Hess, Christopher; Anders, Carolin; Chiavacci, Elena; Zaugg, Jonas; Weber, Lukas M; Catena, Raul; Jinek, Martin; Robinson, Mark D; Mosimann, Christian

2016-06-01

CRISPR-Cas9 enables efficient sequence-specific mutagenesis for creating somatic or germline mutants of model organisms. Key constraints in vivo remain the expression and delivery of active Cas9-sgRNA ribonucleoprotein complexes (RNPs) with minimal toxicity, variable mutagenesis efficiencies depending on targeting sequence, and high mutation mosaicism. Here, we apply in vitro assembled, fluorescent Cas9-sgRNA RNPs in solubilizing salt solution to achieve maximal mutagenesis efficiency in zebrafish embryos. MiSeq-based sequence analysis of targeted loci in individual embryos using CrispRVariants, a customized software tool for mutagenesis quantification and visualization, reveals efficient bi-allelic mutagenesis that reaches saturation at several tested gene loci. Such virtually complete mutagenesis exposes loss-of-function phenotypes for candidate genes in somatic mutant embryos for subsequent generation of stable germline mutants. We further show that targeting of non-coding elements in gene regulatory regions using saturating mutagenesis uncovers functional control elements in transgenic reporters and endogenous genes in injected embryos. Our results establish that optimally solubilized, in vitro assembled fluorescent Cas9-sgRNA RNPs provide a reproducible reagent for direct and scalable loss-of-function studies and applications beyond zebrafish experiments that require maximal DNA cutting efficiency in vivo. © 2016. Published by The Company of Biologists Ltd.
Phase stability analysis of chirp evoked auditory brainstem responses by Gabor frame operators.

PubMed

Corona-Strauss, Farah I; Delb, Wolfgang; Schick, Bernhard; Strauss, Daniel J

2009-12-01

We have recently shown that click evoked auditory brainstem responses (ABRs) can be efficiently processed using a novelty detection paradigm. Here, ABRs as a large-scale reflection of a stimulus locked neuronal group synchronization at the brainstem level are detected as novel instance-novel as compared to the spontaneous activity which does not exhibit a regular stimulus locked synchronization. In this paper we propose for the first time Gabor frame operators as an efficient feature extraction technique for ABR single sweep sequences that is in line with this paradigm. In particular, we use this decomposition technique to derive the Gabor frame phase stability (GFPS) of sweep sequences of click and chirp evoked ABRs. We show that the GFPS of chirp evoked ABRs provides a stable discrimination of the spontaneous activity from stimulations above the hearing threshold with a small number of sweeps, even at low stimulation intensities. It is concluded that the GFPS analysis represents a robust feature extraction method for ABR single sweep sequences. Further studies are necessary to evaluate the value of the presented approach for clinical applications.
Individual sequences in large sets of gene sequences may be distinguished efficiently by combinations of shared sub-sequences

PubMed Central

Gibbs, Mark J; Armstrong, John S; Gibbs, Adrian J

2005-01-01

Background Most current DNA diagnostic tests for identifying organisms use specific oligonucleotide probes that are complementary in sequence to, and hence only hybridise with the DNA of one target species. By contrast, in traditional taxonomy, specimens are usually identified by 'dichotomous keys' that use combinations of characters shared by different members of the target set. Using one specific character for each target is the least efficient strategy for identification. Using combinations of shared bisectionally-distributed characters is much more efficient, and this strategy is most efficient when they separate the targets in a progressively binary way. Results We have developed a practical method for finding minimal sets of sub-sequences that identify individual sequences, and could be targeted by combinations of probes, so that the efficient strategy of traditional taxonomic identification could be used in DNA diagnosis. The sizes of minimal sub-sequence sets depended mostly on sequence diversity and sub-sequence length and interactions between these parameters. We found that 201 distinct cytochrome oxidase subunit-1 (CO1) genes from moths (Lepidoptera) were distinguished using only 15 sub-sequences 20 nucleotides long, whereas only 8–10 sub-sequences 6–10 nucleotides long were required to distinguish the CO1 genes of 92 species from the 9 largest orders of insects. Conclusion The presence/absence of sub-sequences in a set of gene sequences can be used like the questions in a traditional dichotomous taxonomic key; hybridisation probes complementary to such sub-sequences should provide a very efficient means for identifying individual species, subtypes or genotypes. Sequence diversity and sub-sequence length are the major factors that determine the numbers of distinguishing sub-sequences in any set of sequences. PMID:15817134
Optimizing symmetry-based recoupling sequences in solid-state NMR by pulse-transient compensation and asynchronous implementation

NASA Astrophysics Data System (ADS)

Hellwagner, Johannes; Sharma, Kshama; Tan, Kong Ooi; Wittmann, Johannes J.; Meier, Beat H.; Madhu, P. K.; Ernst, Matthias

2017-06-01

Pulse imperfections like pulse transients and radio-frequency field maladjustment or inhomogeneity are the main sources of performance degradation and limited reproducibility in solid-state nuclear magnetic resonance experiments. We quantitatively analyze the influence of such imperfections on the performance of symmetry-based pulse sequences and describe how they can be compensated. Based on a triple-mode Floquet analysis, we develop a theoretical description of symmetry-based dipolar recoupling sequences, in particular, R2 6411, calculating first- and second-order effective Hamiltonians using real pulse shapes. We discuss the various origins of effective fields, namely, pulse transients, deviation from the ideal flip angle, and fictitious fields, and develop strategies to counteract them for the restoration of full transfer efficiency. We compare experimental applications of transient-compensated pulses and an asynchronous implementation of the sequence to a supercycle, SR26, which is known to be efficient in compensating higher-order error terms. We are able to show the superiority of R26 compared to the supercycle, SR26, given the ability to reduce experimental error on the pulse sequence by pulse-transient compensation and a complete theoretical understanding of the sequence.
Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services

PubMed Central

Madduri, Ravi K.; Sulakhe, Dinanath; Lacinski, Lukasz; Liu, Bo; Rodriguez, Alex; Chard, Kyle; Dave, Utpal J.; Foster, Ian T.

2014-01-01

We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage (via the Globus file transfer system); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner (on Amazon EC2); and efficient scheduling of these pipelines over many processors (via the HTCondor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads. PMID:25342933
Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services.

PubMed

Madduri, Ravi K; Sulakhe, Dinanath; Lacinski, Lukasz; Liu, Bo; Rodriguez, Alex; Chard, Kyle; Dave, Utpal J; Foster, Ian T

2014-09-10

We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage (via the Globus file transfer system); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner (on Amazon EC2); and efficient scheduling of these pipelines over many processors (via the HTCondor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads.
Efficient and universal quantum key distribution based on chaos and middleware

NASA Astrophysics Data System (ADS)

Jiang, Dong; Chen, Yuanyuan; Gu, Xuemei; Xie, Ling; Chen, Lijun

2017-01-01

Quantum key distribution (QKD) promises unconditionally secure communications, however, the low bit rate of QKD cannot meet the requirements of high-speed applications. Despite the many solutions that have been proposed in recent years, they are neither efficient to generate the secret keys nor compatible with other QKD systems. This paper, based on chaotic cryptography and middleware technology, proposes an efficient and universal QKD protocol that can be directly deployed on top of any existing QKD system without modifying the underlying QKD protocol and optical platform. It initially takes the bit string generated by the QKD system as input, periodically updates the chaotic system, and efficiently outputs the bit sequences. Theoretical analysis and simulation results demonstrate that our protocol can efficiently increase the bit rate of the QKD system as well as securely generate bit sequences with perfect statistical properties. Compared with the existing methods, our protocol is more efficient and universal, it can be rapidly deployed on the QKD system to increase the bit rate when the QKD system becomes the bottleneck of its communication system.

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.

PubMed

Wan, Shixiang; Zou, Quan

2017-01-01

Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
Shotgun metagenomic data streams: surfing without fear

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berendzen, Joel R

2010-12-06

Timely information about bio-threat prevalence, consequence, propagation, attribution, and mitigation is needed to support decision-making, both routinely and in a crisis. One DNA sequencer can stream 25 Gbp of information per day, but sampling strategies and analysis techniques are needed to turn raw sequencing power into actionable knowledge. Shotgun metagenomics can enable biosurveillance at the level of a single city, hospital, or airplane. Metagenomics characterizes viruses and bacteria from complex environments such as soil, air filters, or sewage. Unlike targeted-primer-based sequencing, shotgun methods are not blind to sequences that are truly novel, and they can measure absolute prevalence. Shotgun metagenomicmore » sampling can be non-invasive, efficient, and inexpensive while being informative. We have developed analysis techniques for shotgun metagenomic sequencing that rely upon phylogenetic signature patterns. They work by indexing local sequence patterns in a manner similar to web search engines. Our methods are laptop-fast and favorable scaling properties ensure they will be sustainable as sequencing methods grow. We show examples of application to soil metagenomic samples.« less
Sarment: Python modules for HMM analysis and partitioning of sequences.

PubMed

Guéguen, Laurent

2005-08-15

Sarment is a package of Python modules for easy building and manipulation of sequence segmentations. It provides efficient implementation of usual algorithms for hidden Markov Model computation, as well as for maximal predictive partitioning. Owing to its very large variety of criteria for computing segmentations, Sarment can handle many kinds of models. Because of object-oriented programming, the results of the segmentation are very easy tomanipulate.
India Allele Finder: a web-based annotation tool for identifying common alleles in next-generation sequencing data of Indian origin.

PubMed

Zhang, Jimmy F; James, Francis; Shukla, Anju; Girisha, Katta M; Paciorkowski, Alex R

2017-06-27

We built India Allele Finder, an online searchable database and command line tool, that gives researchers access to variant frequencies of Indian Telugu individuals, using publicly available fastq data from the 1000 Genomes Project. Access to appropriate population-based genomic variant annotation can accelerate the interpretation of genomic sequencing data. In particular, exome analysis of individuals of Indian descent will identify population variants not reflected in European exomes, complicating genomic analysis for such individuals. India Allele Finder offers improved ease-of-use to investigators seeking to identify and annotate sequencing data from Indian populations. We describe the use of India Allele Finder to identify common population variants in a disease quartet whole exome dataset, reducing the number of candidate single nucleotide variants from 84 to 7. India Allele Finder is freely available to investigators to annotate genomic sequencing data from Indian populations. Use of India Allele Finder allows efficient identification of population variants in genomic sequencing data, and is an example of a population-specific annotation tool that simplifies analysis and encourages international collaboration in genomics research.
Computational analysis of stochastic heterogeneity in PCR amplification efficiency revealed by single molecule barcoding

PubMed Central

Best, Katharine; Oakes, Theres; Heather, James M.; Shawe-Taylor, John; Chain, Benny

2015-01-01

The polymerase chain reaction (PCR) is one of the most widely used techniques in molecular biology. In combination with High Throughput Sequencing (HTS), PCR is widely used to quantify transcript abundance for RNA-seq, and in the context of analysis of T and B cell receptor repertoires. In this study, we combine DNA barcoding with HTS to quantify PCR output from individual target molecules. We develop computational tools that simulate both the PCR branching process itself, and the subsequent subsampling which typically occurs during HTS sequencing. We explore the influence of different types of heterogeneity on sequencing output, and compare them to experimental results where the efficiency of amplification is measured by barcodes uniquely identifying each molecule of starting template. Our results demonstrate that the PCR process introduces substantial amplification heterogeneity, independent of primer sequence and bulk experimental conditions. This heterogeneity can be attributed both to inherited differences between different template DNA molecules, and the inherent stochasticity of the PCR process. The results demonstrate that PCR heterogeneity arises even when reaction and substrate conditions are kept as constant as possible, and therefore single molecule barcoding is essential in order to derive reproducible quantitative results from any protocol combining PCR with HTS. PMID:26459131
GAMES identifies and annotates mutations in next-generation sequencing projects.

PubMed

Sana, Maria Elena; Iascone, Maria; Marchetti, Daniela; Palatini, Jeff; Galasso, Marco; Volinia, Stefano

2011-01-01

Next-generation sequencing (NGS) methods have the potential for changing the landscape of biomedical science, but at the same time pose several problems in analysis and interpretation. Currently, there are many commercial and public software packages that analyze NGS data. However, the limitations of these applications include output which is insufficiently annotated and of difficult functional comprehension to end users. We developed GAMES (Genomic Analysis of Mutations Extracted by Sequencing), a pipeline aiming to serve as an efficient middleman between data deluge and investigators. GAMES attains multiple levels of filtering and annotation, such as aligning the reads to a reference genome, performing quality control and mutational analysis, integrating results with genome annotations and sorting each mismatch/deletion according to a range of parameters. Variations are matched to known polymorphisms. The prediction of functional mutations is achieved by using different approaches. Overall GAMES enables an effective complexity reduction in large-scale DNA-sequencing projects. GAMES is available free of charge to academic users and may be obtained from http://aqua.unife.it/GAMES.
[Hydrophidae identification through analysis on Cyt b gene barcode].

PubMed

Liao, Li-xi; Zeng, Ke-wu; Tu, Peng-fei

2015-08-01

Hydrophidae, one of the precious traditional Chinese medicines, is generally drily preserved to prevent corruption, but it is hard to identify the species of Hydrophidae through the appearance because of the change due to the drying process. The identification through analysis on gene barcode, a new technique in species identification, can avoid the problem. The gene barcodes of the 6 species of Hydrophidae like Lapemis hardwickii were aquired through DNA extraction and gene sequencing. These barcodes were then in sequence alignment and test the identification efficency by BLAST. Our results revealed that the barcode sequences performed high identification efficiency, and had obvious difference between intra- and inter-species. These all indicated that Cyt b DNA barcoding can confirm the Hydrophidae identification.
Application of resequencing to rice genomics, functional genomics and evolutionary analysis

PubMed Central

2014-01-01

Rice is a model system used for crop genomics studies. The completion of the rice genome draft sequences in 2002 not only accelerated functional genome studies, but also initiated a new era of resequencing rice genomes. Based on the reference genome in rice, next-generation sequencing (NGS) using the high-throughput sequencing system can efficiently accomplish whole genome resequencing of various genetic populations and diverse germplasm resources. Resequencing technology has been effectively utilized in evolutionary analysis, rice genomics and functional genomics studies. This technique is beneficial for both bridging the knowledge gap between genotype and phenotype and facilitating molecular breeding via gene design in rice. Here, we also discuss the limitation, application and future prospects of rice resequencing. PMID:25006357
High throughput SNP discovery and genotyping in grapevine (Vitis vinifera L.) by combining a re-sequencing approach and SNPlex technology

PubMed Central

Lijavetzky, Diego; Cabezas, José Antonio; Ibáñez, Ana; Rodríguez, Virginia; Martínez-Zapater, José M

2007-01-01

Background Single-nucleotide polymorphisms (SNPs) are the most abundant type of DNA sequence polymorphisms. Their higher availability and stability when compared to simple sequence repeats (SSRs) provide enhanced possibilities for genetic and breeding applications such as cultivar identification, construction of genetic maps, the assessment of genetic diversity, the detection of genotype/phenotype associations, or marker-assisted breeding. In addition, the efficiency of these activities can be improved thanks to the ease with which SNP genotyping can be automated. Expressed sequence tags (EST) sequencing projects in grapevine are allowing for the in silico detection of multiple putative sequence polymorphisms within and among a reduced number of cultivars. In parallel, the sequence of the grapevine cultivar Pinot Noir is also providing thousands of polymorphisms present in this highly heterozygous genome. Still the general application of those SNPs requires further validation since their use could be restricted to those specific genotypes. Results In order to develop a large SNP set of wide application in grapevine we followed a systematic re-sequencing approach in a group of 11 grape genotypes corresponding to ancient unrelated cultivars as well as wild plants. Using this approach, we have sequenced 230 gene fragments, what represents the analysis of over 1 Mb of grape DNA sequence. This analysis has allowed the discovery of 1573 SNPs with an average of one SNP every 64 bp (one SNP every 47 bp in non-coding regions and every 69 bp in coding regions). Nucleotide diversity in grape (π = 0.0051) was found to be similar to values observed in highly polymorphic plant species such as maize. The average number of haplotypes per gene sequence was estimated as six, with three haplotypes representing over 83% of the analyzed sequences. Short-range linkage disequilibrium (LD) studies within the analyzed sequences indicate the existence of a rapid decay of LD within the selected grapevine genotypes. To validate the use of the detected polymorphisms in genetic mapping, cultivar identification and genetic diversity studies we have used the SNPlex™ genotyping technology in a sample of grapevine genotypes and segregating progenies. Conclusion These results provide accurate values for nucleotide diversity in coding sequences and a first estimate of short-range LD in grapevine. Using SNPlex™ genotyping we have shown the application of a set of discovered SNPs as molecular markers for cultivar identification, linkage mapping and genetic diversity studies. Thus, the combination a highly efficient re-sequencing approach and the SNPlex™ high throughput genotyping technology provide a powerful tool for grapevine genetic analysis. PMID:18021442
What can we learn about lyssavirus genomes using 454 sequencing?

PubMed

Höper, Dirk; Finke, Stefan; Freuling, Conrad M; Hoffmann, Bernd; Beer, Martin

2012-01-01

The main task of the individual project number four"Whole genome sequencing, virus-host adaptation, and molecular epidemiological analyses of lyssaviruses "within the network" Lyssaviruses--a potential re-emerging public health threat" is to provide high quality complete genome sequences from lyssaviruses. These sequences are analysed in-depth with regard to the diversity of the viral populations as to both quasi-species and so-called defective interfering RNAs. Moreover, the sequence data will facilitate further epidemiological analyses, will provide insight into the evolution of lyssaviruses and will be the basis for the design of novel nucleic acid based diagnostics. The first results presented here indicate that not only high quality full-length lyssavirus genome sequences can be generated, but indeed efficient analysis of the viral population gets feasible.
Typing of canine parvovirus isolates using mini-sequencing based single nucleotide polymorphism analysis.

PubMed

Naidu, Hariprasad; Subramanian, B Mohana; Chinchkar, Shankar Ramchandra; Sriraman, Rajan; Rana, Samir Kumar; Srinivasan, V A

2012-05-01

The antigenic types of canine parvovirus (CPV) are defined based on differences in the amino acids of the major capsid protein VP2. Type specificity is conferred by a limited number of amino acid changes and in particular by few nucleotide substitutions. PCR based methods are not particularly suitable for typing circulating variants which differ in a few specific nucleotide substitutions. Assays for determining SNPs can detect efficiently nucleotide substitutions and can thus be adapted to identify CPV types. In the present study, CPV typing was performed by single nucleotide extension using the mini-sequencing technique. A mini-sequencing signature was established for all the four CPV types (CPV2, 2a, 2b and 2c) and feline panleukopenia virus. The CPV typing using the mini-sequencing reaction was performed for 13 CPV field isolates and the two vaccine strains available in our repository. All the isolates had been typed earlier by full-length sequencing of the VP2 gene. The typing results obtained from mini-sequencing matched completely with that of sequencing. Typing could be achieved with less than 100 copies of standard plasmid DNA constructs or ≤10¹ FAID₅₀ of virus by mini-sequencing technique. The technique was also efficient for detecting multiple types in mixed infections. Copyright © 2012 Elsevier B.V. All rights reserved.
SeqCompress: an algorithm for biological sequence compression.

PubMed

Sardaraz, Muhammad; Tahir, Muhammad; Ikram, Ataul Aziz; Bajwa, Hassan

2014-10-01

The growth of Next Generation Sequencing technologies presents significant research challenges, specifically to design bioinformatics tools that handle massive amount of data efficiently. Biological sequence data storage cost has become a noticeable proportion of total cost in the generation and analysis. Particularly increase in DNA sequencing rate is significantly outstripping the rate of increase in disk storage capacity, which may go beyond the limit of storage capacity. It is essential to develop algorithms that handle large data sets via better memory management. This article presents a DNA sequence compression algorithm SeqCompress that copes with the space complexity of biological sequences. The algorithm is based on lossless data compression and uses statistical model as well as arithmetic coding to compress DNA sequences. The proposed algorithm is compared with recent specialized compression tools for biological sequences. Experimental results show that proposed algorithm has better compression gain as compared to other existing algorithms. Copyright © 2014 Elsevier Inc. All rights reserved.
N-Terminal Amino Acid Sequence Determination of Proteins by N-Terminal Dimethyl Labeling: Pitfalls and Advantages When Compared with Edman Degradation Sequence Analysis.

PubMed

Chang, Elizabeth; Pourmal, Sergei; Zhou, Chun; Kumar, Rupesh; Teplova, Marianna; Pavletich, Nikola P; Marians, Kenneth J; Erdjument-Bromage, Hediye

2016-07-01

In recent history, alternative approaches to Edman sequencing have been investigated, and to this end, the Association of Biomolecular Resource Facilities (ABRF) Protein Sequencing Research Group (PSRG) initiated studies in 2014 and 2015, looking into bottom-up and top-down N-terminal (Nt) dimethyl derivatization of standard quantities of intact proteins with the aim to determine Nt sequence information. We have expanded this initiative and used low picomole amounts of myoglobin to determine the efficiency of Nt-dimethylation. Application of this approach on protein domains, generated by limited proteolysis of overexpressed proteins, confirms that it is a universal labeling technique and is very sensitive when compared with Edman sequencing. Finally, we compared Edman sequencing and Nt-dimethylation of the same polypeptide fragments; results confirm that there is agreement in the identity of the Nt amino acid sequence between these 2 methods.
Adaptation of the Haloarcula hispanica CRISPR-Cas system to a purified virus strictly requires a priming process

PubMed Central

Li, Ming; Wang, Rui; Zhao, Dahe; Xiang, Hua

2014-01-01

The clustered regularly interspaced short palindromic repeat (CRISPR)-Cas system mediates adaptive immunity against foreign nucleic acids in prokaryotes. However, efficient adaptation of a native CRISPR to purified viruses has only been observed for the type II-A system from a Streptococcus thermophilus industry strain, and rarely reported for laboratory strains. Here, we provide a second native system showing efficient adaptation. Infected by a newly isolated virus HHPV-2, Haloarcula hispanica type I-B CRISPR system acquired spacers discriminatively from viral sequences. Unexpectedly, in addition to Cas1, Cas2 and Cas4, this process also requires Cas3 and at least partial Cascade proteins, which are involved in interference and/or CRISPR RNA maturation. Intriguingly, a preexisting spacer partially matching a viral sequence is also required, and spacer acquisition from upstream and downstream sequences of its target sequence (i.e. priming protospacer) shows different strand bias. These evidences strongly indicate that adaptation in this system strictly requires a priming process. This requirement, if validated also true for other CRISPR systems as implied by our bioinformatic analysis, may help to explain failures to observe efficient adaptation to purified viruses in many laboratory strains, and the discrimination mechanism at the adaptation level that has confused scientists for years. PMID:24265226
ESTimating plant phylogeny: lessons from partitioning

PubMed Central

de la Torre, Jose EB; Egan, Mary G; Katari, Manpreet S; Brenner, Eric D; Stevenson, Dennis W; Coruzzi, Gloria M; DeSalle, Rob

2006-01-01

Background While Expressed Sequence Tags (ESTs) have proven a viable and efficient way to sample genomes, particularly those for which whole-genome sequencing is impractical, phylogenetic analysis using ESTs remains difficult. Sequencing errors and orthology determination are the major problems when using ESTs as a source of characters for systematics. Here we develop methods to incorporate EST sequence information in a simultaneous analysis framework to address controversial phylogenetic questions regarding the relationships among the major groups of seed plants. We use an automated, phylogenetically derived approach to orthology determination called OrthologID generate a phylogeny based on 43 process partitions, many of which are derived from ESTs, and examine several measures of support to assess the utility of EST data for phylogenies. Results A maximum parsimony (MP) analysis resulted in a single tree with relatively high support at all nodes in the tree despite rampant conflict among trees generated from the separate analysis of individual partitions. In a comparison of broader-scale groupings based on cellular compartment (ie: chloroplast, mitochondrial or nuclear) or function, only the nuclear partition tree (based largely on EST data) was found to be topologically identical to the tree based on the simultaneous analysis of all data. Despite topological conflict among the broader-scale groupings examined, only the tree based on morphological data showed statistically significant differences. Conclusion Based on the amount of character support contributed by EST data which make up a majority of the nuclear data set, and the lack of conflict of the nuclear data set with the simultaneous analysis tree, we conclude that the inclusion of EST data does provide a viable and efficient approach to address phylogenetic questions within a parsimony framework on a genomic scale, if problems of orthology determination and potential sequencing errors can be overcome. In addition, approaches that examine conflict and support in a simultaneous analysis framework allow for a more precise understanding of the evolutionary history of individual process partitions and may be a novel way to understand functional aspects of different kinds of cellular classes of gene products. PMID:16776834
A generalized analysis of hydrophobic and loop clusters within globular protein sequences

PubMed Central

Eudes, Richard; Le Tuan, Khanh; Delettré, Jean; Mornon, Jean-Paul; Callebaut, Isabelle

2007-01-01

Background Hydrophobic Cluster Analysis (HCA) is an efficient way to compare highly divergent sequences through the implicit secondary structure information directly derived from hydrophobic clusters. However, its efficiency and application are currently limited by the need of user expertise. In order to help the analysis of HCA plots, we report here the structural preferences of hydrophobic cluster species, which are frequently encountered in globular domains of proteins. These species are characterized only by their hydrophobic/non-hydrophobic dichotomy. This analysis has been extended to loop-forming clusters, using an appropriate loop alphabet. Results The structural behavior of hydrophobic cluster species, which are typical of protein globular domains, was investigated within banks of experimental structures, considered at different levels of sequence redundancy. The 294 more frequent hydrophobic cluster species were analyzed with regard to their association with the different secondary structures (frequencies of association with secondary structures and secondary structure propensities). Hydrophobic cluster species are predominantly associated with regular secondary structures, and a large part (60 %) reveals preferences for α-helices or β-strands. Moreover, the analysis of the hydrophobic cluster amino acid composition generally allows for finer prediction of the regular secondary structure associated with the considered cluster within a cluster species. We also investigated the behavior of loop forming clusters, using a "PGDNS" alphabet. These loop clusters do not overlap with hydrophobic clusters and are highly associated with coils. Finally, the structural information contained in the hydrophobic structural words, as deduced from experimental structures, was compared to the PSI-PRED predictions, revealing that β-strands and especially α-helices are generally over-predicted within the limits of typical β and α hydrophobic clusters. Conclusion The dictionary of hydrophobic clusters described here can help the HCA user to interpret and compare the HCA plots of globular protein sequences, as well as provides an original fundamental insight into the structural bricks of protein folds. Moreover, the novel loop cluster analysis brings additional information for secondary structure prediction on the whole sequence through a generalized cluster analysis (GCA), and not only on regular secondary structures. Such information lays the foundations for developing a new and original tool for secondary structure prediction. PMID:17210072
Top-down analysis of protein samples by de novo sequencing techniques

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vyatkina, Kira; Wu, Si; Dekker, Lennard J. M.

MOTIVATION: Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data. RESULTS: We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. Themore » former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns.« less
Protein evolution analysis of S-hydroxynitrile lyase by complete sequence design utilizing the INTMSAlign software.

PubMed

Nakano, Shogo; Asano, Yasuhisa

2015-02-03

Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs.
Protein evolution analysis of S-hydroxynitrile lyase by complete sequence design utilizing the INTMSAlign software

NASA Astrophysics Data System (ADS)

Nakano, Shogo; Asano, Yasuhisa

2015-02-01

Development of software and methods for design of complete sequences of functional proteins could contribute to studies of protein engineering and protein evolution. To this end, we developed the INTMSAlign software, and used it to design functional proteins and evaluate their usefulness. The software could assign both consensus and correlation residues of target proteins. We generated three protein sequences with S-selective hydroxynitrile lyase (S-HNL) activity, which we call designed S-HNLs; these proteins folded as efficiently as the native S-HNL. Sequence and biochemical analysis of the designed S-HNLs suggested that accumulation of neutral mutations occurs during the process of S-HNLs evolution from a low-activity form to a high-activity (native) form. Taken together, our results demonstrate that our software and the associated methods could be applied not only to design of complete sequences, but also to predictions of protein evolution, especially within families such as esterases and S-HNLs.
Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

PubMed Central

Kingsford, Carl

2017-01-01

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively. PMID:28968408

Robust Data Detection for the Photon-Counting Free-Space Optical System With Implicit CSI Acquisition and Background Radiation Compensation

NASA Astrophysics Data System (ADS)

Song, Tianyu; Kam, Pooi-Yuen

2016-02-01

Since atmospheric turbulence and pointing errors cause signal intensity fluctuations and the background radiation surrounding the free-space optical (FSO) receiver contributes an undesired noisy component, the receiver requires accurate channel state information (CSI) and background information to adjust the detection threshold. In most previous studies, for CSI acquisition, pilot symbols were employed, which leads to a reduction of spectral and energy efficiency; and an impractical assumption that the background radiation component is perfectly known was made. In this paper, we develop an efficient and robust sequence receiver, which acquires the CSI and the background information implicitly and requires no knowledge about the channel model information. It is robust since it can automatically estimate the CSI and background component and detect the data sequence accordingly. Its decision metric has a simple form and involves no integrals, and thus can be easily evaluated. A Viterbi-type trellis-search algorithm is adopted to improve the search efficiency, and a selective-store strategy is adopted to overcome a potential error floor problem as well as to increase the memory efficiency. To further simplify the receiver, a decision-feedback symbol-by-symbol receiver is proposed as an approximation of the sequence receiver. By simulations and theoretical analysis, we show that the performance of both the sequence receiver and the symbol-by-symbol receiver, approach that of detection with perfect knowledge of the CSI and background radiation, as the length of the window for forming the decision metric increases.
Quantification of Functionalised Gold Nanoparticle-Targeted Knockdown of Gene Expression in HeLa Cells

PubMed Central

Jiwaji, Meesbah; Sandison, Mairi E.; Reboud, Julien; Stevenson, Ross; Daly, Rónán; Barkess, Gráinne; Faulds, Karen; Kolch, Walter; Graham, Duncan; Girolami, Mark A.; Cooper, Jonathan M.; Pitt, Andrew R.

2014-01-01

Introduction Gene therapy continues to grow as an important area of research, primarily because of its potential in the treatment of disease. One significant area where there is a need for better understanding is in improving the efficiency of oligonucleotide delivery to the cell and indeed, following delivery, the characterization of the effects on the cell. Methods In this report, we compare different transfection reagents as delivery vehicles for gold nanoparticles functionalized with DNA oligonucleotides, and quantify their relative transfection efficiencies. The inhibitory properties of small interfering RNA (siRNA), single-stranded RNA (ssRNA) and single-stranded DNA (ssDNA) sequences targeted to human metallothionein hMT-IIa are also quantified in HeLa cells. Techniques used in this study include fluorescence and confocal microscopy, qPCR and Western analysis. Findings We show that the use of transfection reagents does significantly increase nanoparticle transfection efficiencies. Furthermore, siRNA, ssRNA and ssDNA sequences all have comparable inhibitory properties to ssDNA sequences immobilized onto gold nanoparticles. We also show that functionalized gold nanoparticles can co-localize with autophagosomes and illustrate other factors that can affect data collection and interpretation when performing studies with functionalized nanoparticles. Conclusions The desired outcome for biological knockdown studies is the efficient reduction of a specific target; which we demonstrate by using ssDNA inhibitory sequences targeted to human metallothionein IIa gene transcripts that result in the knockdown of both the mRNA transcript and the target protein. PMID:24926959
Effects of a Non-Conservative Sequence on the Properties of β-glucuronidase from Aspergillus terreus Li-20

PubMed Central

Liu, Yanli; Huangfu, Jie; Qi, Feng; Kaleem, Imdad; E, Wenwen; Li, Chun

2012-01-01

We cloned the β-glucuronidase gene (AtGUS) from Aspergillus terreus Li-20 encoding 657 amino acids (aa), which can transform glycyrrhizin into glycyrrhetinic acid monoglucuronide (GAMG) and glycyrrhetinic acid (GA). Based on sequence alignment, the C-terminal non-conservative sequence showed low identity with those of other species; thus, the partial sequence AtGUS(-3t) (1–592 aa) was amplified to determine the effects of the non-conservative sequence on the enzymatic properties. AtGUS and AtGUS(-3t) were expressed in E. coli BL21, producing AtGUS-E and AtGUS(-3t)-E, respectively. At the similar optimum temperature (55°C) and pH (AtGUS-E, 6.6; AtGUS(-3t)-E, 7.0) conditions, the thermal stability of AtGUS(-3t)-E was enhanced at 65°C, and the metal ions Co2+, Ca2+ and Ni2+ showed opposite effects on AtGUS-E and AtGUS(-3t)-E, respectively. Furthermore, Km of AtGUS(-3t)-E (1.95 mM) was just nearly one-seventh that of AtGUS-E (12.9 mM), whereas the catalytic efficiency of AtGUS(-3t)-E was 3.2 fold higher than that of AtGUS-E (7.16 vs. 2.24 mM s−1), revealing that the truncation of non-conservative sequence can significantly improve the catalytic efficiency of AtGUS. Conformational analysis illustrated significant difference in the secondary structure between AtGUS-E and AtGUS(-3t)-E by circular dichroism (CD). The results showed that the truncation of the non-conservative sequence could preferably alter and influence the stability and catalytic efficiency of enzyme. PMID:22347419
DAMe: a toolkit for the initial processing of datasets with PCR replicates of double-tagged amplicons for DNA metabarcoding analyses.

PubMed

Zepeda-Mendoza, Marie Lisandra; Bohmann, Kristine; Carmona Baez, Aldo; Gilbert, M Thomas P

2016-05-03

DNA metabarcoding is an approach for identifying multiple taxa in an environmental sample using specific genetic loci and taxa-specific primers. When combined with high-throughput sequencing it enables the taxonomic characterization of large numbers of samples in a relatively time- and cost-efficient manner. One recent laboratory development is the addition of 5'-nucleotide tags to both primers producing double-tagged amplicons and the use of multiple PCR replicates to filter erroneous sequences. However, there is currently no available toolkit for the straightforward analysis of datasets produced in this way. We present DAMe, a toolkit for the processing of datasets generated by double-tagged amplicons from multiple PCR replicates derived from an unlimited number of samples. Specifically, DAMe can be used to (i) sort amplicons by tag combination, (ii) evaluate PCR replicates dissimilarity, and (iii) filter sequences derived from sequencing/PCR errors, chimeras, and contamination. This is attained by calculating the following parameters: (i) sequence content similarity between the PCR replicates from each sample, (ii) reproducibility of each unique sequence across the PCR replicates, and (iii) copy number of the unique sequences in each PCR replicate. We showcase the insights that can be obtained using DAMe prior to taxonomic assignment, by applying it to two real datasets that vary in their complexity regarding number of samples, sequencing libraries, PCR replicates, and used tag combinations. Finally, we use a third mock dataset to demonstrate the impact and importance of filtering the sequences with DAMe. DAMe allows the user-friendly manipulation of amplicons derived from multiple samples with PCR replicates built in a single or multiple sequencing libraries. It allows the user to: (i) collapse amplicons into unique sequences and sort them by tag combination while retaining the sample identifier and copy number information, (ii) identify sequences carrying unused tag combinations, (iii) evaluate the comparability of PCR replicates of the same sample, and (iv) filter tagged amplicons from a number of PCR replicates using parameters of minimum length, copy number, and reproducibility across the PCR replicates. This enables an efficient analysis of complex datasets, and ultimately increases the ease of handling datasets from large-scale studies.
US stock market efficiency over weekly, monthly, quarterly and yearly time scales

NASA Astrophysics Data System (ADS)

Rodriguez, E.; Aguilar-Cornejo, M.; Femat, R.; Alvarez-Ramirez, J.

2014-11-01

In financial markets, the weak form of the efficient market hypothesis implies that price returns are serially uncorrelated sequences. In other words, prices should follow a random walk behavior. Recent developments in evolutionary economic theory (Lo, 2004) have tailored the concept of adaptive market hypothesis (AMH) by proposing that market efficiency is not an all-or-none concept, but rather market efficiency is a characteristic that varies continuously over time and across markets. Within the AMH framework, this work considers the Dow Jones Index Average (DJIA) for studying the deviations from the random walk behavior over time. It is found that the market efficiency also varies over different time scales, from weeks to years. The well-known detrended fluctuation analysis was used for the characterization of the serial correlations of the return sequences. The results from the empirical showed that interday and intraday returns are more serially correlated than overnight returns. Also, some insights in the presence of business cycles (e.g., Juglar and Kuznets) are provided in terms of time variations of the scaling exponent.
Using SQL Databases for Sequence Similarity Searching and Analysis.

PubMed

Pearson, William R; Mackey, Aaron J

2017-09-13

Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.
Comparing sequencing assays and human-machine analyses in actionable genomics for glioblastoma.

PubMed

Wrzeszczynski, Kazimierz O; Frank, Mayu O; Koyama, Takahiko; Rhrissorrakrai, Kahn; Robine, Nicolas; Utro, Filippo; Emde, Anne-Katrin; Chen, Bo-Juen; Arora, Kanika; Shah, Minita; Vacic, Vladimir; Norel, Raquel; Bilal, Erhan; Bergmann, Ewa A; Moore Vogel, Julia L; Bruce, Jeffrey N; Lassman, Andrew B; Canoll, Peter; Grommes, Christian; Harvey, Steve; Parida, Laxmi; Michelini, Vanessa V; Zody, Michael C; Jobanputra, Vaidehi; Royyuru, Ajay K; Darnell, Robert B

2017-08-01

To analyze a glioblastoma tumor specimen with 3 different platforms and compare potentially actionable calls from each. Tumor DNA was analyzed by a commercial targeted panel. In addition, tumor-normal DNA was analyzed by whole-genome sequencing (WGS) and tumor RNA was analyzed by RNA sequencing (RNA-seq). The WGS and RNA-seq data were analyzed by a team of bioinformaticians and cancer oncologists, and separately by IBM Watson Genomic Analytics (WGA), an automated system for prioritizing somatic variants and identifying drugs. More variants were identified by WGS/RNA analysis than by targeted panels. WGA completed a comparable analysis in a fraction of the time required by the human analysts. The development of an effective human-machine interface in the analysis of deep cancer genomic datasets may provide potentially clinically actionable calls for individual patients in a more timely and efficient manner than currently possible. NCT02725684.
LoRTE: Detecting transposon-induced genomic variants using low coverage PacBio long read sequences.

PubMed

Disdero, Eric; Filée, Jonathan

2017-01-01

Population genomic analysis of transposable elements has greatly benefited from recent advances of sequencing technologies. However, the short size of the reads and the propensity of transposable elements to nest in highly repeated regions of genomes limits the efficiency of bioinformatic tools when Illumina or 454 technologies are used. Fortunately, long read sequencing technologies generating read length that may span the entire length of full transposons are now available. However, existing TE population genomic softwares were not designed to handle long reads and the development of new dedicated tools is needed. LoRTE is the first tool able to use PacBio long read sequences to identify transposon deletions and insertions between a reference genome and genomes of different strains or populations. Tested against simulated and genuine Drosophila melanogaster PacBio datasets, LoRTE appears to be a reliable and broadly applicable tool to study the dynamic and evolutionary impact of transposable elements using low coverage, long read sequences. LoRTE is an efficient and accurate tool to identify structural genomic variants caused by TE insertion or deletion. LoRTE is available for download at http://www.egce.cnrs-gif.fr/?p=6422.
A Phylogenomic Perspective on the Radiation of Ray-Finned Fishes Based upon Targeted Sequencing of Ultraconserved Elements (UCEs)

PubMed Central

Sorenson, Laurie; Santini, Francesco

2013-01-01

Ray-finned fishes constitute the dominant radiation of vertebrates with over 32,000 species. Although molecular phylogenetics has begun to disentangle major evolutionary relationships within this vast section of the Tree of Life, there is no widely available approach for efficiently collecting phylogenomic data within fishes, leaving much of the enormous potential of massively parallel sequencing technologies for resolving major radiations in ray-finned fishes unrealized. Here, we provide a genomic perspective on longstanding questions regarding the diversification of major groups of ray-finned fishes through targeted enrichment of ultraconserved nuclear DNA elements (UCEs) and their flanking sequence. Our workflow efficiently and economically generates data sets that are orders of magnitude larger than those produced by traditional approaches and is well-suited to working with museum specimens. Analysis of the UCE data set recovers a well-supported phylogeny at both shallow and deep time-scales that supports a monophyletic relationship between Amia and Lepisosteus (Holostei) and reveals elopomorphs and then osteoglossomorphs to be the earliest diverging teleost lineages. Our approach additionally reveals that sequence capture of UCE regions and their flanking sequence offers enormous potential for resolving phylogenetic relationships within ray-finned fishes. PMID:23824177
The siRNA Non-seed Region and Its Target Sequences Are Auxiliary Determinants of Off-Target Effects.

PubMed

Kamola, Piotr J; Nakano, Yuko; Takahashi, Tomoko; Wilson, Paul A; Ui-Tei, Kumiko

2015-12-01

RNA interference (RNAi) is a powerful tool for post-transcriptional gene silencing. However, the siRNA guide strand may bind unintended off-target transcripts via partial sequence complementarity by a mechanism closely mirroring micro RNA (miRNA) silencing. To better understand these off-target effects, we investigated the correlation between sequence features within various subsections of siRNA guide strands, and its corresponding target sequences, with off-target activities. Our results confirm previous reports that strength of base-pairing in the siRNA seed region is the primary factor determining the efficiency of off-target silencing. However, the degree of downregulation of off-target transcripts with shared seed sequence is not necessarily similar, suggesting that there are additional auxiliary factors that influence the silencing potential. Here, we demonstrate that both the melting temperature (Tm) in a subsection of siRNA non-seed region, and the GC contents of its corresponding target sequences, are negatively correlated with the efficiency of off-target effect. Analysis of experimentally validated miRNA targets demonstrated a similar trend, indicating a putative conserved mechanistic feature of seed region-dependent targeting mechanism. These observations may prove useful as parameters for off-target prediction algorithms and improve siRNA 'specificity' design rules.
A Bioinformatic Pipeline for Monitoring of the Mutational Stability of Viral Drug Targets with Deep-Sequencing Technology.

PubMed

Kravatsky, Yuri; Chechetkin, Vladimir; Fedoseeva, Daria; Gorbacheva, Maria; Kravatskaya, Galina; Kretova, Olga; Tchurikov, Nickolai

2017-11-23

The efficient development of antiviral drugs, including efficient antiviral small interfering RNAs (siRNAs), requires continuous monitoring of the strict correspondence between a drug and the related highly variable viral DNA/RNA target(s). Deep sequencing is able to provide an assessment of both the general target conservation and the frequency of particular mutations in the different target sites. The aim of this study was to develop a reliable bioinformatic pipeline for the analysis of millions of short, deep sequencing reads corresponding to selected highly variable viral sequences that are drug target(s). The suggested bioinformatic pipeline combines the available programs and the ad hoc scripts based on an original algorithm of the search for the conserved targets in the deep sequencing data. We also present the statistical criteria for the threshold of reliable mutation detection and for the assessment of variations between corresponding data sets. These criteria are robust against the possible sequencing errors in the reads. As an example, the bioinformatic pipeline is applied to the study of the conservation of RNA interference (RNAi) targets in human immunodeficiency virus 1 (HIV-1) subtype A. The developed pipeline is freely available to download at the website http://virmut.eimb.ru/. Brief comments and comparisons between VirMut and other pipelines are also presented.
Whole genome sequencing: an efficient approach to ensuring food safety

NASA Astrophysics Data System (ADS)

Lakicevic, B.; Nastasijevic, I.; Dimitrijevic, M.

2017-09-01

Whole genome sequencing is an effective, powerful tool that can be applied to a wide range of public health and food safety applications. A major difference between WGS and the traditional typing techniques is that WGS allows all genes to be included in the analysis, instead of a well-defined subset of genes or variable intergenic regions. Also, the use of WGS can facilitate the understanding of contamination/colonization routes of foodborne pathogens within the food production environment, and can also afford efficient tracking of pathogens’ entry routes and distribution from farm-to-consumer. Tracking foodborne pathogens in the food processing-distribution-retail-consumer continuum is of the utmost importance for facilitation of outbreak investigations and rapid action in controlling/preventing foodborne outbreaks. Therefore, WGS likely will replace most of the numerous workflows used in public health laboratories to characterize foodborne pathogens into one consolidated, efficient workflow.
PANGEA: pipeline for analysis of next generation amplicons

PubMed Central

Giongo, Adriana; Crabb, David B; Davis-Richardson, Austin G; Chauliac, Diane; Mobberley, Jennifer M; Gano, Kelsey A; Mukherjee, Nabanita; Casella, George; Roesch, Luiz FW; Walts, Brandon; Riva, Alberto; King, Gary; Triplett, Eric W

2010-01-01

High-throughput DNA sequencing can identify organisms and describe population structures in many environmental and clinical samples. Current technologies generate millions of reads in a single run, requiring extensive computational strategies to organize, analyze and interpret those sequences. A series of bioinformatics tools for high-throughput sequencing analysis, including preprocessing, clustering, database matching and classification, have been compiled into a pipeline called PANGEA. The PANGEA pipeline was written in Perl and can be run on Mac OSX, Windows or Linux. With PANGEA, sequences obtained directly from the sequencer can be processed quickly to provide the files needed for sequence identification by BLAST and for comparison of microbial communities. Two different sets of bacterial 16S rRNA sequences were used to show the efficiency of this workflow. The first set of 16S rRNA sequences is derived from various soils from Hawaii Volcanoes National Park. The second set is derived from stool samples collected from diabetes-resistant and diabetes-prone rats. The workflow described here allows the investigator to quickly assess libraries of sequences on personal computers with customized databases. PANGEA is provided for users as individual scripts for each step in the process or as a single script where all processes, except the χ2 step, are joined into one program called the ‘backbone’. PMID:20182525
PANGEA: pipeline for analysis of next generation amplicons.

PubMed

Giongo, Adriana; Crabb, David B; Davis-Richardson, Austin G; Chauliac, Diane; Mobberley, Jennifer M; Gano, Kelsey A; Mukherjee, Nabanita; Casella, George; Roesch, Luiz F W; Walts, Brandon; Riva, Alberto; King, Gary; Triplett, Eric W

2010-07-01

High-throughput DNA sequencing can identify organisms and describe population structures in many environmental and clinical samples. Current technologies generate millions of reads in a single run, requiring extensive computational strategies to organize, analyze and interpret those sequences. A series of bioinformatics tools for high-throughput sequencing analysis, including pre-processing, clustering, database matching and classification, have been compiled into a pipeline called PANGEA. The PANGEA pipeline was written in Perl and can be run on Mac OSX, Windows or Linux. With PANGEA, sequences obtained directly from the sequencer can be processed quickly to provide the files needed for sequence identification by BLAST and for comparison of microbial communities. Two different sets of bacterial 16S rRNA sequences were used to show the efficiency of this workflow. The first set of 16S rRNA sequences is derived from various soils from Hawaii Volcanoes National Park. The second set is derived from stool samples collected from diabetes-resistant and diabetes-prone rats. The workflow described here allows the investigator to quickly assess libraries of sequences on personal computers with customized databases. PANGEA is provided for users as individual scripts for each step in the process or as a single script where all processes, except the chi(2) step, are joined into one program called the 'backbone'.
AmpliVar: mutation detection in high-throughput sequence from amplicon-based libraries.

PubMed

Hsu, Arthur L; Kondrashova, Olga; Lunke, Sebastian; Love, Clare J; Meldrum, Cliff; Marquis-Nicholson, Renate; Corboy, Greg; Pham, Kym; Wakefield, Matthew; Waring, Paul M; Taylor, Graham R

2015-04-01

Conventional means of identifying variants in high-throughput sequencing align each read against a reference sequence, and then call variants at each position. Here, we demonstrate an orthogonal means of identifying sequence variation by grouping the reads as amplicons prior to any alignment. We used AmpliVar to make key-value hashes of sequence reads and group reads as individual amplicons using a table of flanking sequences. Low-abundance reads were removed according to a selectable threshold, and reads above this threshold were aligned as groups, rather than as individual reads, permitting the use of sensitive alignment tools. We show that this approach is more sensitive, more specific, and more computationally efficient than comparable methods for the analysis of amplicon-based high-throughput sequencing data. The method can be extended to enable alignment-free confirmation of variants seen in hybridization capture target-enrichment data. © 2015 WILEY PERIODICALS, INC.
Differential evolution-simulated annealing for multiple sequence alignment

NASA Astrophysics Data System (ADS)

Addawe, R. C.; Addawe, J. M.; Sueño, M. R. K.; Magadia, J. C.

2017-10-01

Multiple sequence alignments (MSA) are used in the analysis of molecular evolution and sequence structure relationships. In this paper, a hybrid algorithm, Differential Evolution - Simulated Annealing (DESA) is applied in optimizing multiple sequence alignments (MSAs) based on structural information, non-gaps percentage and totally conserved columns. DESA is a robust algorithm characterized by self-organization, mutation, crossover, and SA-like selection scheme of the strategy parameters. Here, the MSA problem is treated as a multi-objective optimization problem of the hybrid evolutionary algorithm, DESA. Thus, we name the algorithm as DESA-MSA. Simulated sequences and alignments were generated to evaluate the accuracy and efficiency of DESA-MSA using different indel sizes, sequence lengths, deletion rates and insertion rates. The proposed hybrid algorithm obtained acceptable solutions particularly for the MSA problem evaluated based on the three objectives.
TEA: the epigenome platform for Arabidopsis methylome study.

PubMed

Su, Sheng-Yao; Chen, Shu-Hwa; Lu, I-Hsuan; Chiang, Yih-Shien; Wang, Yu-Bin; Chen, Pao-Yang; Lin, Chung-Yen

2016-12-22

Bisulfite sequencing (BS-seq) has become a standard technology to profile genome-wide DNA methylation at single-base resolution. It allows researchers to conduct genome-wise cytosine methylation analyses on issues about genomic imprinting, transcriptional regulation, cellular development and differentiation. One single data from a BS-Seq experiment is resolved into many features according to the sequence contexts, making methylome data analysis and data visualization a complex task. We developed a streamlined platform, TEA, for analyzing and visualizing data from whole-genome BS-Seq (WGBS) experiments conducted in the model plant Arabidopsis thaliana. To capture the essence of the genome methylation level and to meet the efficiency for running online, we introduce a straightforward method for measuring genome methylation in each sequence context by gene. The method is scripted in Java to process BS-Seq mapping results. Through a simple data uploading process, the TEA server deploys a web-based platform for deep analysis by linking data to an updated Arabidopsis annotation database and toolkits. TEA is an intuitive and efficient online platform for analyzing the Arabidopsis genomic DNA methylation landscape. It provides several ways to help users exploit WGBS data. TEA is freely accessible for academic users at: http://tea.iis.sinica.edu.tw .
Impact of cultivation on characterisation of species composition of soil bacterial communities.

PubMed

McCaig, A E.; Grayston, S J.; Prosser, J I.; Glover, L A.

2001-03-01

The species composition of culturable bacteria in Scottish grassland soils was investigated using a combination of Biolog and 16S rDNA analysis for characterisation of isolates. The inclusion of a molecular approach allowed direct comparison of sequences from culturable bacteria with sequences obtained during analysis of DNA extracted directly from the same soil samples. Bacterial strains were isolated on Pseudomonas isolation agar (PIA), a selective medium, and on tryptone soya agar (TSA), a general laboratory medium. In total, 12 and 21 morphologically different bacterial cultures were isolated on PIA and TSA, respectively. Biolog and sequencing placed PIA isolates in the same taxonomic groups, the majority of cultures belonging to the Pseudomonas (sensu stricto) group. However, analysis of 16S rDNA sequences proved more efficient than Biolog for characterising TSA isolates due to limitations of the Microlog database for identifying environmental bacteria. In general, 16S rDNA sequences from TSA isolates showed high similarities to cultured species represented in sequence databases, although TSA-8 showed only 92.5% similarity to the nearest relative, Bacillus insolitus. In general, there was very little overlap between the culturable and uncultured bacterial communities, although two sequences, PIA-2 and TSA-13, showed >99% similarity to soil clones. A cloning step was included prior to sequence analysis of two isolates, TSA-5 and TSA-14, and analysis of several clones confirmed that these cultures comprised at least four and three sequence types, respectively. All isolate clones were most closely related to uncultured bacteria, with clone TSA-5.1 showing 99.8% similarity to a sequence amplified directly from the same soil sample. Interestingly, one clone, TSA-5.4, clustered within a novel group comprising only uncultured sequences. This group, which is associated with the novel, deep-branching Acidobacterium capsulatum lineage, also included clones isolated during direct analysis of the same soil and from a wide range of other sample types studied elsewhere. The study demonstrates the value of fine-scale molecular analysis for identification of laboratory isolates and indicates the culturability of approximately 1% of the total population but under a restricted range of media and cultivation conditions.
Sequence analysis by iterated maps, a review.

PubMed

Almeida, Jonas S

2014-05-01

Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, 'Chaos Game Representation'. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.
G-CNV: A GPU-Based Tool for Preparing Data to Detect CNVs with Read-Depth Methods.

PubMed

Manconi, Andrea; Manca, Emanuele; Moscatelli, Marco; Gnocchi, Matteo; Orro, Alessandro; Armano, Giuliano; Milanesi, Luciano

2015-01-01

Copy number variations (CNVs) are the most prevalent types of structural variations (SVs) in the human genome and are involved in a wide range of common human diseases. Different computational methods have been devised to detect this type of SVs and to study how they are implicated in human diseases. Recently, computational methods based on high-throughput sequencing (HTS) are increasingly used. The majority of these methods focus on mapping short-read sequences generated from a donor against a reference genome to detect signatures distinctive of CNVs. In particular, read-depth based methods detect CNVs by analyzing genomic regions with significantly different read-depth from the other ones. The pipeline analysis of these methods consists of four main stages: (i) data preparation, (ii) data normalization, (iii) CNV regions identification, and (iv) copy number estimation. However, available tools do not support most of the operations required at the first two stages of this pipeline. Typically, they start the analysis by building the read-depth signal from pre-processed alignments. Therefore, third-party tools must be used to perform most of the preliminary operations required to build the read-depth signal. These data-intensive operations can be efficiently parallelized on graphics processing units (GPUs). In this article, we present G-CNV, a GPU-based tool devised to perform the common operations required at the first two stages of the analysis pipeline. G-CNV is able to filter low-quality read sequences, to mask low-quality nucleotides, to remove adapter sequences, to remove duplicated read sequences, to map the short-reads, to resolve multiple mapping ambiguities, to build the read-depth signal, and to normalize it. G-CNV can be efficiently used as a third-party tool able to prepare data for the subsequent read-depth signal generation and analysis. Moreover, it can also be integrated in CNV detection tools to generate read-depth signals.

Model Performance Evaluation and Scenario Analysis ...

EPA Pesticide Factsheets

This tool consists of two parts: model performance evaluation and scenario analysis (MPESA). The model performance evaluation consists of two components: model performance evaluation metrics and model diagnostics. These metrics provides modelers with statistical goodness-of-fit measures that capture magnitude only, sequence only, and combined magnitude and sequence errors. The performance measures include error analysis, coefficient of determination, Nash-Sutcliffe efficiency, and a new weighted rank method. These performance metrics only provide useful information about the overall model performance. Note that MPESA is based on the separation of observed and simulated time series into magnitude and sequence components. The separation of time series into magnitude and sequence components and the reconstruction back to time series provides diagnostic insights to modelers. For example, traditional approaches lack the capability to identify if the source of uncertainty in the simulated data is due to the quality of the input data or the way the analyst adjusted the model parameters. This report presents a suite of model diagnostics that identify if mismatches between observed and simulated data result from magnitude or sequence related errors. MPESA offers graphical and statistical options that allow HSPF users to compare observed and simulated time series and identify the parameter values to adjust or the input data to modify. The scenario analysis part of the too
Analysis of a diverse assemblage of diazotrophic bacteria from Spartina alterniflora using DGGE and clone library screening.

PubMed

Lovell, Charles R; Decker, Peter V; Bagwell, Christopher E; Thompson, Shelly; Matsui, George Y

2008-05-01

Methods to assess the diversity of the diazotroph assemblage in the rhizosphere of the salt marsh cordgrass, Spartina alterniflora were examined. The effectiveness of nifH PCR-denaturing gradient gel electrophoresis (DGGE) was compared to that of nifH clone library analysis. Seventeen DGGE gel bands were sequenced and yielded 58 nonidentical nifH sequences from a total of 67 sequences determined. A clone library constructed using the GC-clamp nifH primers that were employed in the PCR-DGGE (designated the GC-Library) yielded 83 nonidentical sequences from a total of 257 nifH sequences. A second library constructed using an alternate set of nifH primers (N-Library) yielded 83 nonidentical sequences from a total of 138 nifH sequences. Rarefaction curves for the libraries did not reach saturation, although the GC-Library curve was substantially dampened and appeared to be closer to saturation than the N-Library curve. Phylogenetic analyses showed that DGGE gel band sequencing recovered nifH sequences that were frequently sampled in the GC-Library, as well as sequences that were infrequently sampled, and provided a species composition assessment that was robust, efficient, and relatively inexpensive to obtain. Further, the DGGE method permits a large number of samples to be examined for differences in banding patterns, after which bands of interest can be sampled for sequence determination.
Detection of signals in mRNAs that influence translation.

PubMed

Brown, Chris M; Jacobs, Grant; Stockwell, Peter; Schreiber, Mark

2003-01-01

Genome sequencing efforts mean that we now have extensive data from a wide range of organisms to study. Understanding the differing natures of the biology of these organisms is an important aim of genome analysis. We are interested in signals that affect translation of mRNAs. Some signals in the mRNA influence how efficiently it is translated into protein. Previous studies have indicated that many important signals are located around the initiation and termination codons. We have developed tools described here to extract the relevant sequence regions from GenBank. To create databases organised by species, or higher taxonomic groupings (eg planta), a program was developed to dynamically view and edit the taxonomy database. Data from relevant species were then extracted using our Genbank feature table parser. We analysed all available sequences, particularly those from complete genomes. Patterns were then identified using information theory. The software is available from http://transterm.otago.ac.nz. Patterns around the initiation codons for most of the organisms fall into two groups, containing the previously known Shine-Dalgarno and Kozaks efficiency signals. However, we have identified several organisms that appear to utilise novel systems. Our analysis indicates that some organisms with extremely high GC% genomes do not have a strong dependence on base pairing ribosome binding sites, as the complementary sequence is absent from many genes.
Motion video analysis using planar parallax

NASA Astrophysics Data System (ADS)

Sawhney, Harpreet S.

1994-04-01

Motion and structure analysis in video sequences can lead to efficient descriptions of objects and their motions. Interesting events in videos can be detected using such an analysis--for instance independent object motion when the camera itself is moving, figure-ground segregation based on the saliency of a structure compared to its surroundings. In this paper we present a method for 3D motion and structure analysis that uses a planar surface in the environment as a reference coordinate system to describe a video sequence. The motion in the video sequence is described as the motion of the reference plane, and the parallax motion of all the non-planar components of the scene. It is shown how this method simplifies the otherwise hard general 3D motion analysis problem. In addition, a natural coordinate system in the environment is used to describe the scene which can simplify motion based segmentation. This work is a part of an ongoing effort in our group towards video annotation and analysis for indexing and retrieval. Results from a demonstration system being developed are presented.
Cloning, expression, and sequence analysis of the Bacillus methanolicus C1 methanol dehydrogenase gene.

PubMed Central

de Vries, G E; Arfman, N; Terpstra, P; Dijkhuizen, L

1992-01-01

The gene (mdh) coding for methanol dehydrogenase (MDH) of thermotolerant, methylotroph Bacillus methanolicus C1 has been cloned and sequenced. The deduced amino acid sequence of the mdh gene exhibited similarity to those of five other alcohol dehydrogenase (type III) enzymes, which are distinct from the long-chain zinc-containing (type I) or short-chain zinc-lacking (type II) enzymes. Highly efficient expression of the mdh gene in Escherichia coli was probably driven from its own promoter sequence. After purification of MDH from E. coli, the kinetic and biochemical properties of the enzyme were investigated. The physiological effect of MDH synthesis in E. coli and the role of conserved sequence patterns in type III alcohol dehydrogenases have been analyzed and are discussed. Images PMID:1644761
Identification of (R)-selective ω-aminotransferases by exploring evolutionary sequence space.

PubMed

Kim, Eun-Mi; Park, Joon Ho; Kim, Byung-Gee; Seo, Joo-Hyun

2018-03-01

Several (R)-selective ω-aminotransferases (R-ωATs) have been reported. The existence of additional R-ωATs having different sequence characteristics from previous ones is highly expected. In addition, it is generally accepted that R-ωATs are variants of aminotransferase group III. Based on these backgrounds, sequences in RefSeq database were scored using family profiles of branched-chain amino acid aminotransferase (BCAT) and d-alanine aminotransferase (DAT) to predict and identify putative R-ωATs. Sequences with two profile analysis scores were plotted on two-dimensional score space. Candidates with relatively similar scores in both BCAT and DAT profiles (i.e., profile analysis score using BCAT profile was similar to profile analysis score using DAT profile) were selected. Experimental results for selected candidates showed that putative R-ωATs from Saccharopolyspora erythraea (R-ωAT_Sery), Bacillus cellulosilyticus (R-ωAT_Bcel), and Bacillus thuringiensis (R-ωAT_Bthu) had R-ωAT activity. Additional experiments revealed that R-ωAT_Sery also possessed DAT activity while R-ωAT_Bcel and R-ωAT_Bthu had BCAT activity. Selecting putative R-ωATs from regions with similar profile analysis scores identified potential R-ωATs. Therefore, R-ωATs could be efficiently identified by using simple family profile analysis and exploring evolutionary sequence space. Copyright © 2017 Elsevier Inc. All rights reserved.
DMRfinder: efficiently identifying differentially methylated regions from MethylC-seq data.

PubMed

Gaspar, John M; Hart, Ronald P

2017-11-29

DNA methylation is an epigenetic modification that is studied at a single-base resolution with bisulfite treatment followed by high-throughput sequencing. After alignment of the sequence reads to a reference genome, methylation counts are analyzed to determine genomic regions that are differentially methylated between two or more biological conditions. Even though a variety of software packages is available for different aspects of the bioinformatics analysis, they often produce results that are biased or require excessive computational requirements. DMRfinder is a novel computational pipeline that identifies differentially methylated regions efficiently. Following alignment, DMRfinder extracts methylation counts and performs a modified single-linkage clustering of methylation sites into genomic regions. It then compares methylation levels using beta-binomial hierarchical modeling and Wald tests. Among its innovative attributes are the analyses of novel methylation sites and methylation linkage, as well as the simultaneous statistical analysis of multiple sample groups. To demonstrate its efficiency, DMRfinder is benchmarked against other computational approaches using a large published dataset. Contrasting two replicates of the same sample yielded minimal genomic regions with DMRfinder, whereas two alternative software packages reported a substantial number of false positives. Further analyses of biological samples revealed fundamental differences between DMRfinder and another software package, despite the fact that they utilize the same underlying statistical basis. For each step, DMRfinder completed the analysis in a fraction of the time required by other software. Among the computational approaches for identifying differentially methylated regions from high-throughput bisulfite sequencing datasets, DMRfinder is the first that integrates all the post-alignment steps in a single package. Compared to other software, DMRfinder is extremely efficient and unbiased in this process. DMRfinder is free and open-source software, available on GitHub ( github.com/jsh58/DMRfinder ); it is written in Python and R, and is supported on Linux.
Rapid Identification of Sequences for Orphan Enzymes to Power Accurate Protein Annotation

PubMed Central

Ojha, Sunil; Watson, Douglas S.; Bomar, Martha G.; Galande, Amit K.; Shearer, Alexander G.

2013-01-01

The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the “back catalog” of enzymology – “orphan enzymes,” those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme “back catalog” is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology’s “back catalog” another powerful tool to drive accurate genome annotation. PMID:24386392
Rapid identification of sequences for orphan enzymes to power accurate protein annotation.

PubMed

Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil; Watson, Douglas S; Bomar, Martha G; Galande, Amit K; Shearer, Alexander G

2013-01-01

The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.
Coding visual features extracted from video sequences.

PubMed

Baroffio, Luca; Cesana, Matteo; Redondi, Alessandro; Tagliasacchi, Marco; Tubaro, Stefano

2014-05-01

Visual features are successfully exploited in several applications (e.g., visual search, object recognition and tracking, etc.) due to their ability to efficiently represent image content. Several visual analysis tasks require features to be transmitted over a bandwidth-limited network, thus calling for coding techniques to reduce the required bit budget, while attaining a target level of efficiency. In this paper, we propose, for the first time, a coding architecture designed for local features (e.g., SIFT, SURF) extracted from video sequences. To achieve high coding efficiency, we exploit both spatial and temporal redundancy by means of intraframe and interframe coding modes. In addition, we propose a coding mode decision based on rate-distortion optimization. The proposed coding scheme can be conveniently adopted to implement the analyze-then-compress (ATC) paradigm in the context of visual sensor networks. That is, sets of visual features are extracted from video frames, encoded at remote nodes, and finally transmitted to a central controller that performs visual analysis. This is in contrast to the traditional compress-then-analyze (CTA) paradigm, in which video sequences acquired at a node are compressed and then sent to a central unit for further processing. In this paper, we compare these coding paradigms using metrics that are routinely adopted to evaluate the suitability of visual features in the context of content-based retrieval, object recognition, and tracking. Experimental results demonstrate that, thanks to the significant coding gains achieved by the proposed coding scheme, ATC outperforms CTA with respect to all evaluation metrics.
VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis.

PubMed

Cornwell, MacIntosh; Vangala, Mahesh; Taing, Len; Herbert, Zachary; Köster, Johannes; Li, Bo; Sun, Hanfei; Li, Taiwen; Zhang, Jian; Qiu, Xintao; Pun, Matthew; Jeselsohn, Rinath; Brown, Myles; Liu, X Shirley; Long, Henry W

2018-04-12

RNA sequencing has become a ubiquitous technology used throughout life sciences as an effective method of measuring RNA abundance quantitatively in tissues and cells. The increase in use of RNA-seq technology has led to the continuous development of new tools for every step of analysis from alignment to downstream pathway analysis. However, effectively using these analysis tools in a scalable and reproducible way can be challenging, especially for non-experts. Using the workflow management system Snakemake we have developed a user friendly, fast, efficient, and comprehensive pipeline for RNA-seq analysis. VIPER (Visualization Pipeline for RNA-seq analysis) is an analysis workflow that combines some of the most popular tools to take RNA-seq analysis from raw sequencing data, through alignment and quality control, into downstream differential expression and pathway analysis. VIPER has been created in a modular fashion to allow for the rapid incorporation of new tools to expand the capabilities. This capacity has already been exploited to include very recently developed tools that explore immune infiltrate and T-cell CDR (Complementarity-Determining Regions) reconstruction abilities. The pipeline has been conveniently packaged such that minimal computational skills are required to download and install the dozens of software packages that VIPER uses. VIPER is a comprehensive solution that performs most standard RNA-seq analyses quickly and effectively with a built-in capacity for customization and expansion.
Parallel human genome analysis: microarray-based expression monitoring of 1000 genes.

PubMed Central

Schena, M; Shalon, D; Heller, R; Chai, A; Brown, P O; Davis, R W

1996-01-01

Microarrays containing 1046 human cDNAs of unknown sequence were printed on glass with high-speed robotics. These 1.0-cm2 DNA "chips" were used to quantitatively monitor differential expression of the cognate human genes using a highly sensitive two-color hybridization assay. Array elements that displayed differential expression patterns under given experimental conditions were characterized by sequencing. The identification of known and novel heat shock and phorbol ester-regulated genes in human T cells demonstrates the sensitivity of the assay. Parallel gene analysis with microarrays provides a rapid and efficient method for large-scale human gene discovery. Images Fig. 1 Fig. 2 Fig. 3 PMID:8855227
CAFE: aCcelerated Alignment-FrEe sequence analysis

PubMed Central

Lu, Yang Young; Tang, Kujin; Ren, Jie; Fuhrman, Jed A.; Waterman, Michael S.

2017-01-01

Abstract Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$d_2^*$\\end{document} and \\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{upgreek} \\usepackage{mathrsfs} \\setlength{\\oddsidemargin}{-69pt} \\begin{document} }{}$d_2^S$\\end{document} are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE. PMID:28472388
Development of self-compressing BLSOM for comprehensive analysis of big sequence data.

PubMed

Kikuchi, Akihito; Ikemura, Toshimichi; Abe, Takashi

2015-01-01

With the remarkable increase in genomic sequence data from various organisms, novel tools are needed for comprehensive analyses of available big sequence data. We previously developed a Batch-Learning Self-Organizing Map (BLSOM), which can cluster genomic fragment sequences according to phylotype solely dependent on oligonucleotide composition and applied to genome and metagenomic studies. BLSOM is suitable for high-performance parallel-computing and can analyze big data simultaneously, but a large-scale BLSOM needs a large computational resource. We have developed Self-Compressing BLSOM (SC-BLSOM) for reduction of computation time, which allows us to carry out comprehensive analysis of big sequence data without the use of high-performance supercomputers. The strategy of SC-BLSOM is to hierarchically construct BLSOMs according to data class, such as phylotype. The first-layer BLSOM was constructed with each of the divided input data pieces that represents the data subclass, such as phylotype division, resulting in compression of the number of data pieces. The second BLSOM was constructed with a total of weight vectors obtained in the first-layer BLSOMs. We compared SC-BLSOM with the conventional BLSOM by analyzing bacterial genome sequences. SC-BLSOM could be constructed faster than BLSOM and cluster the sequences according to phylotype with high accuracy, showing the method's suitability for efficient knowledge discovery from big sequence data.
PWHATSHAP: efficient haplotyping for future generation sequencing.

PubMed

Bracciali, Andrea; Aldinucci, Marco; Patterson, Murray; Marschall, Tobias; Pisanti, Nadia; Merelli, Ivan; Torquati, Massimo

2016-09-22

Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WHATSHAP is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments. Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered PWHATSHAP, a parallel, high-performance version of WHATSHAP. PWHATSHAP is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WHATSHAP, PWHATSHAP exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WHATSHAP, which increases with coverage. Due to its structure and management of the large datasets, the parallelisation of WHATSHAP posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, PWHATSHAP, is a freely available toolkit that improves the efficiency of the analysis of genomics information.
An enhancement of NASTRAN for the seismic analysis of structures. [nuclear power plants

NASA Technical Reports Server (NTRS)

Burroughs, J. W.

1980-01-01

New modules, bulk data cards and DMAP sequence were added to NASTRAN to aid in the seismic analysis of nuclear power plant structures. These allow input consisting of acceleration time histories and result in the generation of acceleration floor response spectra. The resulting system contains numerous user convenience features, as well as being reasonably efficient.
GobyWeb: Simplified Management and Analysis of Gene Expression and DNA Methylation Sequencing Data

PubMed Central

Dorff, Kevin C.; Chambwe, Nyasha; Zeno, Zachary; Simi, Manuele; Shaknovich, Rita; Campagne, Fabien

2013-01-01

We present GobyWeb, a web-based system that facilitates the management and analysis of high-throughput sequencing (HTS) projects. The software provides integrated support for a broad set of HTS analyses and offers a simple plugin extension mechanism. Analyses currently supported include quantification of gene expression for messenger and small RNA sequencing, estimation of DNA methylation (i.e., reduced bisulfite sequencing and whole genome methyl-seq), or the detection of pathogens in sequenced data. In contrast to previous analysis pipelines developed for analysis of HTS data, GobyWeb requires significantly less storage space, runs analyses efficiently on a parallel grid, scales gracefully to process tens or hundreds of multi-gigabyte samples, yet can be used effectively by researchers who are comfortable using a web browser. We conducted performance evaluations of the software and found it to either outperform or have similar performance to analysis programs developed for specialized analyses of HTS data. We found that most biologists who took a one-hour GobyWeb training session were readily able to analyze RNA-Seq data with state of the art analysis tools. GobyWeb can be obtained at http://gobyweb.campagnelab.org and is freely available for non-commercial use. GobyWeb plugins are distributed in source code and licensed under the open source LGPL3 license to facilitate code inspection, reuse and independent extensions http://github.com/CampagneLaboratory/gobyweb2-plugins. PMID:23936070
Experience of targeted Usher exome sequencing as a clinical test

PubMed Central

Besnard, Thomas; García-García, Gema; Baux, David; Vaché, Christel; Faugère, Valérie; Larrieu, Lise; Léonard, Susana; Millan, Jose M; Malcolm, Sue; Claustres, Mireille; Roux, Anne-Françoise

2014-01-01

We show that massively parallel targeted sequencing of 19 genes provides a new and reliable strategy for molecular diagnosis of Usher syndrome (USH) and nonsyndromic deafness, particularly appropriate for these disorders characterized by a high clinical and genetic heterogeneity and a complex structure of several of the genes involved. A series of 71 patients including Usher patients previously screened by Sanger sequencing plus newly referred patients was studied. Ninety-eight percent of the variants previously identified by Sanger sequencing were found by next-generation sequencing (NGS). NGS proved to be efficient as it offers analysis of all relevant genes which is laborious to reach with Sanger sequencing. Among the 13 newly referred Usher patients, both mutations in the same gene were identified in 77% of cases (10 patients) and one candidate pathogenic variant in two additional patients. This work can be considered as pilot for implementing NGS for genetically heterogeneous diseases in clinical service. PMID:24498627
Evol and ProDy for bridging protein sequence evolution and structural dynamics

PubMed Central

Mao, Wenzhi; Liu, Ying; Chennubhotla, Chakra; Lezon, Timothy R.; Bahar, Ivet

2014-01-01

Correlations between sequence evolution and structural dynamics are of utmost importance in understanding the molecular mechanisms of function and their evolution. We have integrated Evol, a new package for fast and efficient comparative analysis of evolutionary patterns and conformational dynamics, into ProDy, a computational toolbox designed for inferring protein dynamics from experimental and theoretical data. Using information-theoretic approaches, Evol coanalyzes conservation and coevolution profiles extracted from multiple sequence alignments of protein families with their inferred dynamics. Availability and implementation: ProDy and Evol are open-source and freely available under MIT License from http://prody.csb.pitt.edu/. Contact: bahar@pitt.edu PMID:24849577
Reinventing Cell Penetrating Peptides Using Glycosylated Methionine Sulfonium Ion Sequences.

PubMed

Kramer, Jessica R; Schmidt, Nathan W; Mayle, Kristine M; Kamei, Daniel T; Wong, Gerard C L; Deming, Timothy J

2015-05-27

Cell penetrating peptides (CPPs) are intriguing molecules that have received much attention, both in terms of mechanistic analysis and as transporters for intracellular therapeutic delivery. Most CPPs contain an abundance of cationic charged residues, typically arginine, where the amino acid compositions, rather than specific sequences, tend to determine their ability to enter cells. Hydrophobic residues are often added to cationic sequences to create efficient CPPs, but typically at the penalty of increased cytotoxicity. Here, we examined polypeptides containing glycosylated, cationic derivatives of methionine, where we found these hydrophilic polypeptides to be surprisingly effective as CPPs and to also possess low cytotoxicity. X-ray analysis of how these new polypeptides interact with lipid membranes revealed that the incorporation of sterically demanding hydrophilic cationic groups in polypeptides is an unprecedented new concept for design of potent CPPs.

Reinventing cell penetrating peptides using glycosylated methionine sulfonium ion sequences

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kramer, Jessica R.; Schmidt, Nathan W.; Mayle, Kristine M.

2015-04-15

Cell penetrating peptides (CPPs) are intriguing molecules that have received much attention, both in terms of mechanistic analysis and as transporters for intracellular therapeutic delivery. Most CPPs contain an abundance of cationic charged residues, typically arginine, where the amino acid compositions, rather than specific sequences, tend to determine their ability to enter cells. Hydrophobic residues are often added to cationic sequences to create efficient CPPs, but typically at the penalty of increased cytotoxicity. Here, we examined polypeptides containing glycosylated, cationic derivatives of methionine, where we found these hydrophilic polypeptides to be surprisingly effective as CPPs and to also possess lowmore » cytotoxicity. X-ray analysis of how these new polypeptides interact with lipid membranes revealed that the incorporation of sterically demanding hydrophilic cationic groups in polypeptides is an unprecedented new concept for design of potent CPPs.« less
Determining mutation density using Restriction Enzyme Sequence Comparative Analysis (RESCAN)

USDA-ARS?s Scientific Manuscript database

The average mutation density of a mutant population is a major consideration when developing resources for the efficient, cost-effective implementation of reverse genetics methods such as Targeting of Induced Local Lesions in Genomes (TILLING). Reliable estimates of mutation density can be achieved ...
[Rapid detection of hot spot mutations of FGFR3 gene with PCR-high resolution melting assay].

PubMed

Li, Shan; Wang, Han; Su, Hua; Gao, Jinsong; Zhao, Xiuli

2017-08-10

To identify the causative mutations in five individuals affected with dyschondroplasia and develop an efficient procedure for detecting hot spot mutations of the FGFR3 gene. Genomic DNA was extracted from peripheral blood samples with a standard phenol/chloroform method. PCR-Sanger sequencing was used to analyze the causative mutations in the five probands. PCR-high resolution melting (HRM) was developed to detect the identified mutations. A c.1138G>A mutation in exon 8 was found in 4 probands, while a c.1620C>G mutation was found in exon 11 of proband 5 whom had a mild phenotype. All patients were successfully distinguished from healthy controls with the PCR-HRM method. The results of HRM analysis were highly consistent with that of Sanger sequencing. The Gly380Arg and Asn540Lys are hot spot mutations of the FGFR3 gene among patients with ACH/HCH. PCR-HRM analysis is more efficient for detecting hot spot mutations of the FGFR3 gene.
Isolation and identification of efficient Egyptian malathion-degrading bacterial isolates.

PubMed

Hamouda, S A; Marzouk, M A; Abbassy, M A; Abd-El-Haleem, D A; Shamseldin, Abdelaal

2015-03-01

Bacterial isolates degrading malathion were isolated from the soil and agricultural waste water due to their ability to grow on minimal salt media amended with malathion as a sole carbon source. Efficiencies of native Egyptian bacterial malathion-degrading isolates were investigated and the study generated nine highly effective malathion-degrading bacterial strains among 40. Strains were identified by partial sequencing of 16S rDNA analysis. Comparative analysis of 16S rDNA sequences revealed that these bacteria are similar with the genus Acinetobacter and Bacillus spp. and RFLP based PCR of 16S rDNA gave four different RFLP patterns among strains with enzyme HinfI while with enzyme HaeI they gave two RFLP profiles. The degradation rate of malathion in liquid culture was estimated using gas chromatography. Bacterial strains could degrade more than 90% of the initial malathion concentration (1000 ppm) within 4 days. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
New technology and resources for cryptococcal research

PubMed Central

Zhang, Nannan; Park, Yoon-Dong; Williamson, Peter R.

2014-01-01

Rapid advances in molecular biology and genome sequencing have enabled the generation of new technology and resources for cryptococcal research. RNAi-mediated specific gene knock down has become routine and more efficient by utilizing modified shRNA plasmids and convergent promoter RNAi constructs. This system was recently applied in a high-throughput screen to identify genes involved in host-pathogen interactions. Gene deletion efficiencies have also been improved by increasing rates of homologous recombination through a number of approaches, including a combination of double-joint PCR with split-marker transformation, the use of dominant selectable markers and the introduction of Cre-Loxp systems into Cryptococcus. Moreover, visualization of cryptococcal proteins has become more facile using fusions with codon-optimized fluorescent tags, such as green or red fluorescent proteins or, mCherry. Using recent genome-wide analytical tools, new transcriptional factors and regulatory proteins have been identified in novel virulence-related signaling pathways by employing microarray analysis, RNA-sequencing and proteomic analysis. PMID:25460849
Evaluation of next generation mtGenome sequencing using the Ion Torrent Personal Genome Machine (PGM)☆

PubMed Central

Parson, Walther; Strobl, Christina; Huber, Gabriela; Zimmermann, Bettina; Gomes, Sibylle M.; Souto, Luis; Fendt, Liane; Delport, Rhena; Langit, Reina; Wootton, Sharon; Lagacé, Robert; Irwin, Jodi

2013-01-01

Insights into the human mitochondrial phylogeny have been primarily achieved by sequencing full mitochondrial genomes (mtGenomes). In forensic genetics (partial) mtGenome information can be used to assign haplotypes to their phylogenetic backgrounds, which may, in turn, have characteristic geographic distributions that would offer useful information in a forensic case. In addition and perhaps even more relevant in the forensic context, haplogroup-specific patterns of mutations form the basis for quality control of mtDNA sequences. The current method for establishing (partial) mtDNA haplotypes is Sanger-type sequencing (STS), which is laborious, time-consuming, and expensive. With the emergence of Next Generation Sequencing (NGS) technologies, the body of available mtDNA data can potentially be extended much more quickly and cost-efficiently. Customized chemistries, laboratory workflows and data analysis packages could support the community and increase the utility of mtDNA analysis in forensics. We have evaluated the performance of mtGenome sequencing using the Personal Genome Machine (PGM) and compared the resulting haplotypes directly with conventional Sanger-type sequencing. A total of 64 mtGenomes (>1 million bases) were established that yielded high concordance with the corresponding STS haplotypes (<0.02% differences). About two-thirds of the differences were observed in or around homopolymeric sequence stretches. In addition, the sequence alignment algorithm employed to align NGS reads played a significant role in the analysis of the data and the resulting mtDNA haplotypes. Further development of alignment software would be desirable to facilitate the application of NGS in mtDNA forensic genetics. PMID:23948325
Analysis and Prediction of Exon Skipping Events from RNA-Seq with Sequence Information Using Rotation Forest.

PubMed

Du, Xiuquan; Hu, Changlin; Yao, Yu; Sun, Shiwei; Zhang, Yanping

2017-12-12

In bioinformatics, exon skipping (ES) event prediction is an essential part of alternative splicing (AS) event analysis. Although many methods have been developed to predict ES events, a solution has yet to be found. In this study, given the limitations of machine learning algorithms with RNA-Seq data or genome sequences, a new feature, called RS (RNA-seq and sequence) features, was constructed. These features include RNA-Seq features derived from the RNA-Seq data and sequence features derived from genome sequences. We propose a novel Rotation Forest classifier to predict ES events with the RS features (RotaF-RSES). To validate the efficacy of RotaF-RSES, a dataset from two human tissues was used, and RotaF-RSES achieved an accuracy of 98.4%, a specificity of 99.2%, a sensitivity of 94.1%, and an area under the curve (AUC) of 98.6%. When compared to the other available methods, the results indicate that RotaF-RSES is efficient and can predict ES events with RS features.
SAMSA2: a standalone metatranscriptome analysis pipeline.

PubMed

Westreich, Samuel T; Treiber, Michelle L; Mills, David A; Korf, Ian; Lemay, Danielle G

2018-05-21

Complex microbial communities are an area of growing interest in biology. Metatranscriptomics allows researchers to quantify microbial gene expression in an environmental sample via high-throughput sequencing. Metatranscriptomic experiments are computationally intensive because the experiments generate a large volume of sequence data and each sequence must be compared with reference sequences from thousands of organisms. SAMSA2 is an upgrade to the original Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) pipeline that has been redesigned for standalone use on a supercomputing cluster. SAMSA2 is faster due to the use of the DIAMOND aligner, and more flexible and reproducible because it uses local databases. SAMSA2 is available with detailed documentation, and example input and output files along with examples of master scripts for full pipeline execution. SAMSA2 is a rapid and efficient metatranscriptome pipeline for analyzing large RNA-seq datasets in a supercomputing cluster environment. SAMSA2 provides simplified output that can be examined directly or used for further analyses, and its reference databases may be upgraded, altered or customized to fit the needs of any experiment.
Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing

PubMed Central

2012-01-01

Background RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. Results Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. Conclusions This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates. PMID:22985019
Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing.

PubMed

Robles, José A; Qureshi, Sumaira E; Stephen, Stuart J; Wilson, Susan R; Burden, Conrad J; Taylor, Jennifer M

2012-09-17

RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates.
A dual selection based, targeted gene replacement tool for Magnaporthe grisea and Fusarium oxysporum.

PubMed

Khang, Chang Hyun; Park, Sook-Young; Lee, Yong-Hwan; Kang, Seogchan

2005-06-01

Rapid progress in fungal genome sequencing presents many new opportunities for functional genomic analysis of fungal biology through the systematic mutagenesis of the genes identified through sequencing. However, the lack of efficient tools for targeted gene replacement is a limiting factor for fungal functional genomics, as it often necessitates the screening of a large number of transformants to identify the desired mutant. We developed an efficient method of gene replacement and evaluated factors affecting the efficiency of this method using two plant pathogenic fungi, Magnaporthe grisea and Fusarium oxysporum. This method is based on Agrobacterium tumefaciens-mediated transformation with a mutant allele of the target gene flanked by the herpes simplex virus thymidine kinase (HSVtk) gene as a conditional negative selection marker against ectopic transformants. The HSVtk gene product converts 5-fluoro-2'-deoxyuridine to a compound toxic to diverse fungi. Because ectopic transformants express HSVtk, while gene replacement mutants lack HSVtk, growing transformants on a medium amended with 5-fluoro-2'-deoxyuridine facilitates the identification of targeted mutants by counter-selecting against ectopic transformants. In addition to M. grisea and F. oxysporum, the method and associated vectors are likely to be applicable to manipulating genes in a broad spectrum of fungi, thus potentially serving as an efficient, universal functional genomic tool for harnessing the growing body of fungal genome sequence data to study fungal biology.
MutAid: Sanger and NGS Based Integrated Pipeline for Mutation Identification, Validation and Annotation in Human Molecular Genetics.

PubMed

Pandey, Ram Vinay; Pabinger, Stephan; Kriegner, Albert; Weinhäusel, Andreas

2016-01-01

Traditional Sanger sequencing as well as Next-Generation Sequencing have been used for the identification of disease causing mutations in human molecular research. The majority of currently available tools are developed for research and explorative purposes and often do not provide a complete, efficient, one-stop solution. As the focus of currently developed tools is mainly on NGS data analysis, no integrative solution for the analysis of Sanger data is provided and consequently a one-stop solution to analyze reads from both sequencing platforms is not available. We have therefore developed a new pipeline called MutAid to analyze and interpret raw sequencing data produced by Sanger or several NGS sequencing platforms. It performs format conversion, base calling, quality trimming, filtering, read mapping, variant calling, variant annotation and analysis of Sanger and NGS data under a single platform. It is capable of analyzing reads from multiple patients in a single run to create a list of potential disease causing base substitutions as well as insertions and deletions. MutAid has been developed for expert and non-expert users and supports four sequencing platforms including Sanger, Illumina, 454 and Ion Torrent. Furthermore, for NGS data analysis, five read mappers including BWA, TMAP, Bowtie, Bowtie2 and GSNAP and four variant callers including GATK-HaplotypeCaller, SAMTOOLS, Freebayes and VarScan2 pipelines are supported. MutAid is freely available at https://sourceforge.net/projects/mutaid.
MutAid: Sanger and NGS Based Integrated Pipeline for Mutation Identification, Validation and Annotation in Human Molecular Genetics

PubMed Central

Pandey, Ram Vinay; Pabinger, Stephan; Kriegner, Albert; Weinhäusel, Andreas

2016-01-01

Traditional Sanger sequencing as well as Next-Generation Sequencing have been used for the identification of disease causing mutations in human molecular research. The majority of currently available tools are developed for research and explorative purposes and often do not provide a complete, efficient, one-stop solution. As the focus of currently developed tools is mainly on NGS data analysis, no integrative solution for the analysis of Sanger data is provided and consequently a one-stop solution to analyze reads from both sequencing platforms is not available. We have therefore developed a new pipeline called MutAid to analyze and interpret raw sequencing data produced by Sanger or several NGS sequencing platforms. It performs format conversion, base calling, quality trimming, filtering, read mapping, variant calling, variant annotation and analysis of Sanger and NGS data under a single platform. It is capable of analyzing reads from multiple patients in a single run to create a list of potential disease causing base substitutions as well as insertions and deletions. MutAid has been developed for expert and non-expert users and supports four sequencing platforms including Sanger, Illumina, 454 and Ion Torrent. Furthermore, for NGS data analysis, five read mappers including BWA, TMAP, Bowtie, Bowtie2 and GSNAP and four variant callers including GATK-HaplotypeCaller, SAMTOOLS, Freebayes and VarScan2 pipelines are supported. MutAid is freely available at https://sourceforge.net/projects/mutaid. PMID:26840129
An integrated PCR colony hybridization approach to screen cDNA libraries for full-length coding sequences.

PubMed

Pollier, Jacob; González-Guzmán, Miguel; Ardiles-Diaz, Wilson; Geelen, Danny; Goossens, Alain

2011-01-01

cDNA-Amplified Fragment Length Polymorphism (cDNA-AFLP) is a commonly used technique for genome-wide expression analysis that does not require prior sequence knowledge. Typically, quantitative expression data and sequence information are obtained for a large number of differentially expressed gene tags. However, most of the gene tags do not correspond to full-length (FL) coding sequences, which is a prerequisite for subsequent functional analysis. A medium-throughput screening strategy, based on integration of polymerase chain reaction (PCR) and colony hybridization, was developed that allows in parallel screening of a cDNA library for FL clones corresponding to incomplete cDNAs. The method was applied to screen for the FL open reading frames of a selection of 163 cDNA-AFLP tags from three different medicinal plants, leading to the identification of 109 (67%) FL clones. Furthermore, the protocol allows for the use of multiple probes in a single hybridization event, thus significantly increasing the throughput when screening for rare transcripts. The presented strategy offers an efficient method for the conversion of incomplete expressed sequence tags (ESTs), such as cDNA-AFLP tags, to FL-coding sequences.
Analysis of Sequence Data Under Multivariate Trait-Dependent Sampling.

PubMed

Tao, Ran; Zeng, Donglin; Franceschini, Nora; North, Kari E; Boerwinkle, Eric; Lin, Dan-Yu

2015-06-01

High-throughput DNA sequencing allows for the genotyping of common and rare variants for genetic association studies. At the present time and for the foreseeable future, it is not economically feasible to sequence all individuals in a large cohort. A cost-effective strategy is to sequence those individuals with extreme values of a quantitative trait. We consider the design under which the sampling depends on multiple quantitative traits. Under such trait-dependent sampling, standard linear regression analysis can result in bias of parameter estimation, inflation of type I error, and loss of power. We construct a likelihood function that properly reflects the sampling mechanism and utilizes all available data. We implement a computationally efficient EM algorithm and establish the theoretical properties of the resulting maximum likelihood estimators. Our methods can be used to perform separate inference on each trait or simultaneous inference on multiple traits. We pay special attention to gene-level association tests for rare variants. We demonstrate the superiority of the proposed methods over standard linear regression through extensive simulation studies. We provide applications to the Cohorts for Heart and Aging Research in Genomic Epidemiology Targeted Sequencing Study and the National Heart, Lung, and Blood Institute Exome Sequencing Project.
Automated Antibody De Novo Sequencing and Its Utility in Biopharmaceutical Discovery

NASA Astrophysics Data System (ADS)

Sen, K. Ilker; Tang, Wilfred H.; Nayak, Shruti; Kil, Yong J.; Bern, Marshall; Ozoglu, Berk; Ueberheide, Beatrix; Davis, Darryl; Becker, Christopher

2017-05-01

Applications of antibody de novo sequencing in the biopharmaceutical industry range from the discovery of new antibody drug candidates to identifying reagents for research and determining the primary structure of innovator products for biosimilar development. When murine, phage display, or patient-derived monoclonal antibodies against a target of interest are available, but the cDNA or the original cell line is not, de novo protein sequencing is required to humanize and recombinantly express these antibodies, followed by in vitro and in vivo testing for functional validation. Availability of fully automated software tools for monoclonal antibody de novo sequencing enables efficient and routine analysis. Here, we present a novel method to automatically de novo sequence antibodies using mass spectrometry and the Supernovo software. The robustness of the algorithm is demonstrated through a series of stress tests.
Analysis of Variability in HIV-1 Subtype A Strains in Russia Suggests a Combination of Deep Sequencing and Multitarget RNA Interference for Silencing of the Virus.

PubMed

Kretova, Olga V; Chechetkin, Vladimir R; Fedoseeva, Daria M; Kravatsky, Yuri V; Sosin, Dmitri V; Alembekov, Ildar R; Gorbacheva, Maria A; Gashnikova, Natalya M; Tchurikov, Nickolai A

2017-02-01

Any method for silencing the activity of the HIV-1 retrovirus should tackle the extremely high variability of HIV-1 sequences and mutational escape. We studied sequence variability in the vicinity of selected RNA interference (RNAi) targets from isolates of HIV-1 subtype A in Russia, and we propose that using artificial RNAi is a potential alternative to traditional antiretroviral therapy. We prove that using multiple RNAi targets overcomes the variability in HIV-1 isolates. The optimal number of targets critically depends on the conservation of the target sequences. The total number of targets that are conserved with a probability of 0.7-0.8 should exceed at least 2. Combining deep sequencing and multitarget RNAi may provide an efficient approach to cure HIV/AIDS.
Analysis of in vitro evolution reveals the underlying distribution of catalytic activity among random sequences.

PubMed

Pressman, Abe; Moretti, Janina E; Campbell, Gregory W; Müller, Ulrich F; Chen, Irene A

2017-08-21

The emergence of catalytic RNA is believed to have been a key event during the origin of life. Understanding how catalytic activity is distributed across random sequences is fundamental to estimating the probability that catalytic sequences would emerge. Here, we analyze the in vitro evolution of triphosphorylating ribozymes and translate their fitnesses into absolute estimates of catalytic activity for hundreds of ribozyme families. The analysis efficiently identified highly active ribozymes and estimated catalytic activity with good accuracy. The evolutionary dynamics follow Fisher's Fundamental Theorem of Natural Selection and a corollary, permitting retrospective inference of the distribution of fitness and activity in the random sequence pool for the first time. The frequency distribution of rate constants appears to be log-normal, with a surprisingly steep dropoff at higher activity, consistent with a mechanism for the emergence of activity as the product of many independent contributions. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Genetic mutation analysis of human gastric adenocarcinomas using ion torrent sequencing platform.

PubMed

Xu, Zhi; Huo, Xinying; Ye, Hua; Tang, Chuanning; Nandakumar, Vijayalakshmi; Lou, Feng; Zhang, Dandan; Dong, Haichao; Sun, Hong; Jiang, Shouwen; Zhang, Guangchun; Liu, Zhiyuan; Dong, Zhishou; Guo, Baishuai; He, Yan; Yan, Chaowei; Wang, Lu; Su, Ziyi; Li, Yangyang; Gu, Dongying; Zhang, Xiaojing; Wu, Xiaomin; Wei, Xiaowei; Hong, Lingzhi; Zhang, Yangmei; Yang, Jinsong; Gong, Yonglin; Tang, Cuiju; Jones, Lindsey; Huang, Xue F; Chen, Si-Yi; Chen, Jinfei

2014-01-01

Gastric cancer is the one of the major causes of cancer-related death, especially in Asia. Gastric adenocarcinoma, the most common type of gastric cancer, is heterogeneous and its incidence and cause varies widely with geographical regions, gender, ethnicity, and diet. Since unique mutations have been observed in individual human cancer samples, identification and characterization of the molecular alterations underlying individual gastric adenocarcinomas is a critical step for developing more effective, personalized therapies. Until recently, identifying genetic mutations on an individual basis by DNA sequencing remained a daunting task. Recent advances in new next-generation DNA sequencing technologies, such as the semiconductor-based Ion Torrent sequencing platform, makes DNA sequencing cheaper, faster, and more reliable. In this study, we aim to identify genetic mutations in the genes which are targeted by drugs in clinical use or are under development in individual human gastric adenocarcinoma samples using Ion Torrent sequencing. We sequenced 737 loci from 45 cancer-related genes in 238 human gastric adenocarcinoma samples using the Ion Torrent Ampliseq Cancer Panel. The sequencing analysis revealed a high occurrence of mutations along the TP53 locus (9.7%) in our sample set. Thus, this study indicates the utility of a cost and time efficient tool such as Ion Torrent sequencing to screen cancer mutations for the development of personalized cancer therapy.
T1 weighted fat/water separated PROPELLER acquired with dual bandwidths.

PubMed

Rydén, Henric; Berglund, Johan; Norbeck, Ola; Avventi, Enrico; Skare, Stefan

2018-04-24

To describe a fat/water separated dual receiver bandwidth (rBW) spin echo PROPELLER sequence that eliminates the dead time associated with single rBW sequences. A nonuniform noise whitening by regularization of the fat/water inverse problem is proposed, to enable dual rBW reconstructions. Bipolar, flyback, and dual spin echo sequences were developed. All sequences acquire two echoes with different rBW without dead time. Chemical shift displacement was corrected by performing the fat/water separation in k-space, prior to gridding. The proposed sequences were compared to fat saturation, and single rBW sequences, in terms of SNR and CNR efficiency, using clinically relevant acquisition parameters. The impact of motion was investigated. Chemical shift correction greatly improved the image quality, especially at high resolution acquired with low rBW, and also improved motion estimates. SNR efficiency of the dual spin echo sequence was up to 20% higher than the single rBW acquisition, while CNR efficiency was 50% higher for the bipolar acquisition. Noise whitening was deemed necessary for all dual rBW acquisitions, rendering high image quality with strong and homogenous fat suppression. Dual rBW sequences eliminate the dead time present in single rBW sequences, which improves SNR efficiency. In combination with the proposed regularization, this enables highly efficient T1-weighted PROPELLER images without chemical shift displacement. © 2018 International Society for Magnetic Resonance in Medicine.

Whole Exome Sequencing in Pediatric Neurology Patients: Clinical Implications and Estimated Cost Analysis.

PubMed

Nolan, Danielle; Carlson, Martha

2016-06-01

Genetic heterogeneity in neurologic disorders has been an obstacle to phenotype-based diagnostic testing. The authors hypothesized that information compiled via whole exome sequencing will improve clinical diagnosis and management of pediatric neurology patients. The authors performed a retrospective chart review of patients evaluated in the University of Michigan Pediatric Neurology clinic between 6/2011 and 6/2015. The authors recorded previous diagnostic testing, indications for whole exome sequencing, and whole exome sequencing results. Whole exome sequencing was recommended for 135 patients and obtained in 53 patients. Insurance barriers often precluded whole exome sequencing. The most common indication for whole exome sequencing was neurodevelopmental disorders. Whole exome sequencing improved the presumptive diagnostic rate in the patient cohort from 25% to 48%. Clinical implications included family planning, medication selection, and systemic investigation. Compared to current second tier testing, whole exome sequencing can result in lower long-term charges and more timely diagnosis. Overcoming barriers related to whole exome sequencing insurance authorization could allow for more efficient and fruitful diagnostic neurological evaluations. © The Author(s) 2016.
Polyfluorophore Labels on DNA: Dramatic Sequence Dependence of Quenching

PubMed Central

Teo, Yin Nah; Wilson, James N.

2010-01-01

We describe studies carried out in the DNA context to test how a common fluorescence quencher, dabcyl, interacts with oligodeoxynu-cleoside fluorophores (ODFs)—a system of stacked, electronically interacting fluorophores built on a DNA scaffold. We tested twenty different tetrameric ODF sequences containing varied combinations and orderings of pyrene (Y), benzopyrene (B), perylene (E), dimethylaminostilbene (D), and spacer (S) monomers conjugated to the 3′ end of a DNA oligomer. Hybridization of this probe sequence to a dabcyl-labeled complementary strand resulted in strong quenching of fluorescence in 85% of the twenty ODF sequences. The high efficiency of quenching was also established by their large Stern–Volmer constants (KSV) of between 2.1 × 104 and 4.3 × 105M−1, measured with a free dabcyl quencher. Interestingly, quenching of ODFs displayed strong sequence dependence. This was particularly evident in anagrams of ODF sequences; for example, the sequence BYDS had a KSV that was approximately two orders of magnitude greater than that of BSDY, which has the same dye composition. Other anagrams, for example EDSY and ESYD, also displayed different responses upon quenching by dabcyl. Analysis of spectra showed that apparent excimer and exciplex emission bands were quenched with much greater efficiency compared to monomer emission bands by at least an order of magnitude. This suggests an important role played by delocalized excited states of the π stack of fluorophores in the amplified quenching of fluorescence. PMID:19780115
Target Site Recognition by a Diversity-Generating Retroelement

PubMed Central

Guo, Huatao; Tse, Longping V.; Nieh, Angela W.; Czornyj, Elizabeth; Williams, Steven; Oukil, Sabrina; Liu, Vincent B.; Miller, Jeff F.

2011-01-01

Diversity-generating retroelements (DGRs) are in vivo sequence diversification machines that are widely distributed in bacterial, phage, and plasmid genomes. They function to introduce vast amounts of targeted diversity into protein-encoding DNA sequences via mutagenic homing. Adenine residues are converted to random nucleotides in a retrotransposition process from a donor template repeat (TR) to a recipient variable repeat (VR). Using the Bordetella bacteriophage BPP-1 element as a prototype, we have characterized requirements for DGR target site function. Although sequences upstream of VR are dispensable, a 24 bp sequence immediately downstream of VR, which contains short inverted repeats, is required for efficient retrohoming. The inverted repeats form a hairpin or cruciform structure and mutational analysis demonstrated that, while the structure of the stem is important, its sequence can vary. In contrast, the loop has a sequence-dependent function. Structure-specific nuclease digestion confirmed the existence of a DNA hairpin/cruciform, and marker coconversion assays demonstrated that it influences the efficiency, but not the site of cDNA integration. Comparisons with other phage DGRs suggested that similar structures are a conserved feature of target sequences. Using a kanamycin resistance determinant as a reporter, we found that transplantation of the IMH and hairpin/cruciform-forming region was sufficient to target the DGR diversification machinery to a heterologous gene. In addition to furthering our understanding of DGR retrohoming, our results suggest that DGRs may provide unique tools for directed protein evolution via in vivo DNA diversification. PMID:22194701
A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection

PubMed Central

Goodacre, Norman; Aljanahi, Aisha; Nandakumar, Subhiksha; Mikailov, Mike

2018-01-01

ABSTRACT Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection. PMID:29564396
A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection.

PubMed

Goodacre, Norman; Aljanahi, Aisha; Nandakumar, Subhiksha; Mikailov, Mike; Khan, Arifa S

2018-01-01

Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection.
Verification of 2A peptide cleavage.

PubMed

Szymczak-Workman, Andrea L; Vignali, Kate M; Vignali, Dario A A

2012-02-01

The need for reliable, multicistronic vectors for multigene delivery is at the forefront of biomedical technology. It is now possible to express multiple proteins from a single open reading frame (ORF) using 2A peptide-linked multicistronic vectors. These small sequences, when cloned between genes, allow for efficient, stoichiometric production of discrete protein products within a single vector through a novel "cleavage" event within the 2A peptide sequence. The easiest and most effective way to assess 2A cleavage is to perform transient transfection of 293T cells (human embryonic kidney cells) followed by western blot analysis, as described in this protocol. 293T cells are easy to grow and can be efficiently transfected with a variety of vectors. Cleavage can be assessed by detection with antibodies against the target proteins or anti-2A serum.
Leveraging genome-wide datasets to quantify the functional role of the anti-Shine-Dalgarno sequence in regulating translation efficiency.

PubMed

Hockenberry, Adam J; Pah, Adam R; Jewett, Michael C; Amaral, Luís A N

2017-01-01

Studies dating back to the 1970s established that sequence complementarity between the anti-Shine-Dalgarno (aSD) sequence on prokaryotic ribosomes and the 5' untranslated region of mRNAs helps to facilitate translation initiation. The optimal location of aSD sequence binding relative to the start codon, the full extents of the aSD sequence and the functional form of the relationship between aSD sequence complementarity and translation efficiency have not been fully resolved. Here, we investigate these relationships by leveraging the sequence diversity of endogenous genes and recently available genome-wide estimates of translation efficiency. We show that-after accounting for predicted mRNA structure-aSD sequence complementarity increases the translation of endogenous mRNAs by roughly 50%. Further, we observe that this relationship is nonlinear, with translation efficiency maximized for mRNAs with intermediate levels of aSD sequence complementarity. The mechanistic insights that we observe are highly robust: we find nearly identical results in multiple datasets spanning three distantly related bacteria. Further, we verify our main conclusions by re-analysing a controlled experimental dataset. © 2017 The Authors.
Initial steps towards a production platform for DNA sequence analysis on the grid.

PubMed

Luyf, Angela C M; van Schaik, Barbera D C; de Vries, Michel; Baas, Frank; van Kampen, Antoine H C; Olabarriaga, Silvia D

2010-12-14

Bioinformatics is confronted with a new data explosion due to the availability of high throughput DNA sequencers. Data storage and analysis becomes a problem on local servers, and therefore it is needed to switch to other IT infrastructures. Grid and workflow technology can help to handle the data more efficiently, as well as facilitate collaborations. However, interfaces to grids are often unfriendly to novice users. In this study we reused a platform that was developed in the VL-e project for the analysis of medical images. Data transfer, workflow execution and job monitoring are operated from one graphical interface. We developed workflows for two sequence alignment tools (BLAST and BLAT) as a proof of concept. The analysis time was significantly reduced. All workflows and executables are available for the members of the Dutch Life Science Grid and the VL-e Medical virtual organizations All components are open source and can be transported to other grid infrastructures. The availability of in-house expertise and tools facilitates the usage of grid resources by new users. Our first results indicate that this is a practical, powerful and scalable solution to address the capacity and collaboration issues raised by the deployment of next generation sequencers. We currently adopt this methodology on a daily basis for DNA sequencing and other applications. More information and source code is available via http://www.bioinformaticslaboratory.nl/
A Tale of Tails: Dissecting the Enhancing Effect of Tailed Primers in Real-Time PCR

PubMed Central

Vandenbussche, Frank; Mathijs, Elisabeth; Lefebvre, David; De Clercq, Kris; Van Borm, Steven

2016-01-01

Non-specific tail sequences are often added to the 5’-terminus of primers to improve the robustness and overall performance of diagnostic assays. Despite the widespread use of tailed primers, the underlying working mechanism is not well understood. To address this problem, we conducted a detailed in vitro and in silico analysis of the enhancing effect of primer tailing on 2 well-established foot-and-mouth disease virus (FMDV) RT-qPCR assays using an FMDV reference panel. Tailing of the panFMDV-5UTR primers mainly affected the shape of the amplification curves. Modelling of the raw fluorescence data suggested a reduction of the amplification efficiency due to the accumulation of inhibitors. In depth analysis of PCR products indeed revealed the rapid accumulation of forward-primer derived artefacts. More importantly, tailing of the forward primer delayed artefacts formation and concomitantly restored the sigmoidal shape of the amplification curves. Our analysis also showed that primer tailing can alter utilisation patterns of degenerate primers and increase the number of primer variants that are able to participate in the reaction. The impact of tailed primers was less pronounced in the panFMDV-3D assay with only 5 out of 50 isolates showing a clear shift in Cq values. Sequence analysis of the target region of these 5 isolates revealed several mutations in the inter-primer region that extend an existing hairpin structure immediately downstream of the forward primer binding site. Stabilisation of the forward primer with either a tail sequence or cationic spermine units restored the sensitivity of the assay, which suggests that the enhancing effect in the panFMDV-3D assay is due to a more efficient extension of the forward primer. ur results show that primer tailing can alter amplification through various mechanisms that are determined by both the assay and target region. These findings expand our understanding of primer tailing and should enable a more targeted and efficient use of tailed primers. PMID:27723800
Comparing sequencing assays and human-machine analyses in actionable genomics for glioblastoma

PubMed Central

Wrzeszczynski, Kazimierz O.; Frank, Mayu O.; Koyama, Takahiko; Rhrissorrakrai, Kahn; Robine, Nicolas; Utro, Filippo; Emde, Anne-Katrin; Chen, Bo-Juen; Arora, Kanika; Shah, Minita; Vacic, Vladimir; Norel, Raquel; Bilal, Erhan; Bergmann, Ewa A.; Moore Vogel, Julia L.; Bruce, Jeffrey N.; Lassman, Andrew B.; Canoll, Peter; Grommes, Christian; Harvey, Steve; Parida, Laxmi; Michelini, Vanessa V.; Zody, Michael C.; Jobanputra, Vaidehi; Royyuru, Ajay K.

2017-01-01

Objective: To analyze a glioblastoma tumor specimen with 3 different platforms and compare potentially actionable calls from each. Methods: Tumor DNA was analyzed by a commercial targeted panel. In addition, tumor-normal DNA was analyzed by whole-genome sequencing (WGS) and tumor RNA was analyzed by RNA sequencing (RNA-seq). The WGS and RNA-seq data were analyzed by a team of bioinformaticians and cancer oncologists, and separately by IBM Watson Genomic Analytics (WGA), an automated system for prioritizing somatic variants and identifying drugs. Results: More variants were identified by WGS/RNA analysis than by targeted panels. WGA completed a comparable analysis in a fraction of the time required by the human analysts. Conclusions: The development of an effective human-machine interface in the analysis of deep cancer genomic datasets may provide potentially clinically actionable calls for individual patients in a more timely and efficient manner than currently possible. ClinicalTrials.gov identifier: NCT02725684. PMID:28740869
Model-based reconstruction of synthetic promoter library in Corynebacterium glutamicum.

PubMed

Zhang, Shuanghong; Liu, Dingyu; Mao, Zhitao; Mao, Yufeng; Ma, Hongwu; Chen, Tao; Zhao, Xueming; Wang, Zhiwen

2018-05-01

To develop an efficient synthetic promoter library for fine-tuned expression of target genes in Corynebacterium glutamicum. A synthetic promoter library for C. glutamicum was developed based on conserved sequences of the - 10 and - 35 regions. The synthetic promoter library covered a wide range of strengths, ranging from 1 to 193% of the tac promoter. 68 promoters were selected and sequenced for correlation analysis between promoter sequence and strength with a statistical model. A new promoter library was further reconstructed with improved promoter strength and coverage based on the results of correlation analysis. Tandem promoter P70 was finally constructed with increased strength by 121% over the tac promoter. The promoter library developed in this study showed a great potential for applications in metabolic engineering and synthetic biology for the optimization of metabolic networks. To the best of our knowledge, this is the first reconstruction of synthetic promoter library based on statistical analysis of C. glutamicum.
Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing.

PubMed

Angiuoli, Samuel V; White, James R; Matalka, Malcolm; White, Owen; Fricke, W Florian

2011-01-01

The widespread popularity of genomic applications is threatened by the "bioinformatics bottleneck" resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.
Resources and Costs for Microbial Sequence Analysis Evaluated Using Virtual Machines and Cloud Computing

PubMed Central

Angiuoli, Samuel V.; White, James R.; Matalka, Malcolm; White, Owen; Fricke, W. Florian

2011-01-01

Background The widespread popularity of genomic applications is threatened by the “bioinformatics bottleneck” resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. Results We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. Conclusions Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers. PMID:22028928
Sequence analysis of the pyruvylated galactan sulfate-derived oligosaccharides by negative-ion electrospray tandem mass spectrometry.

PubMed

Li, Na; Mao, Wenjun; Liu, Xue; Wang, Shuyao; Xia, Zheng; Cao, Sujian; Li, Lin; Zhang, Qi; Liu, Shan

2016-10-04

Five sulfated oligosaccharide fragments, F1-F5, were prepared from a pyruvylated galactan sulfate from the green alga Codium divaricatum, by partial depolymerization using mild acid hydrolysis and purification with gel-permeation chromatography. Negative-ion electrospray tandem mass spectrometry with collision-induced dissociation (ES-CID-MS/MS) is attempted for sequence determination of the sulfated oligosaccharides. The sequence of F1 with homogeneous disaccharide composition was first characterized to be Galp-(4SO4)-(1 → 3)-Galp by detailed nuclear magnetic resonance spectroscopic analyses. The fragmentation pattern of F1 in the product ion spectra was established on the basis of negative-ion ES-CID MS/MS, which was then applied to sequence analysis of other sulfated oligosaccharides. The sequences of F2 and F3 were deduced to be Galp-(4SO4)-(1 → 3)-Galp-(1 → 3)-Galp-(1 → 3)-Galp and 3,4-O-(1-carboxyethylidene)-Galp-(6SO4)-(1 → 3)-Galp, respectively. The sequences of major fragments in F4 and F5 were also deduced. The investigation demonstrated that negative-ion ES-CID-MS/MS was an efficient method for the sequence analysis of the pyruvylated galactan sulfate-derived oligosaccharides which revealed the patterns of substitution and glycosidic linkages. The pyruvylated galactan sulfate-derived oligosaccharides were novel sulfated oligosaccharides different from other algal polysaccharide-derived oligosaccharides. Copyright © 2016 Elsevier Ltd. All rights reserved.
[Molecular identification of astragali radix and its adulterants by ITS sequences].

PubMed

Cui, Zhan-Hu; Li, Yue; Yuan, Qing-Jun; Zhou, Li-She; Li, Min-Hui

2012-12-01

To explore a new method for identification Astragali Radix from its adulterants by using ITS sequence. Thirteen samples of the different Astragali Radix materials and 6 samples of the adulterants of the roots of Hedysarum polybotrys, Medicago sativa and Althaea rosea were collected. ITS sequence was amplified by PCR and sequenced unidirectionally. The interspecific K-2-P distances of Astragali Radix and its adulterants were calculated, and NJ tree and UPGMA tree were constructed by MEGA 4. ITS sequences were obtained from 19 samples respectively, there were Astragali Radix 646-650 bp, H. polybotrys 664 bp, Medicago sativa 659 bp, Althaea rosea 728 bp, which were registered in the GenBank. Phylogeny trees reconstruction using NJ and UPGMA analysis based on ITS nucleotide sequences can effectively distinguish Astragali Radix from adulterants. ITS sequence can be used to identify Astragali Radix from its adulterants successfully and is an efficient molecular marker for authentication of Astragali Radix and its adulterants.
Elimination sequence optimization for SPAR

NASA Technical Reports Server (NTRS)

Hogan, Harry A.

1986-01-01

SPAR is a large-scale computer program for finite element structural analysis. The program allows user specification of the order in which the joints of a structure are to be eliminated since this order can have significant influence over solution performance, in terms of both storage requirements and computer time. An efficient elimination sequence can improve performance by over 50% for some problems. Obtaining such sequences, however, requires the expertise of an experienced user and can take hours of tedious effort to affect. Thus, an automatic elimination sequence optimizer would enhance productivity by reducing the analysts' problem definition time and by lowering computer costs. Two possible methods for automating the elimination sequence specifications were examined. Several algorithms based on the graph theory representations of sparse matrices were studied with mixed results. Significant improvement in the program performance was achieved, but sequencing by an experienced user still yields substantially better results. The initial results provide encouraging evidence that the potential benefits of such an automatic sequencer would be well worth the effort.
Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads.

PubMed

Song, Li; Florea, Liliana

2015-01-01

Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.
Simple, quick and cost-efficient: A universal RT-PCR and sequencing strategy for genomic characterisation of foot-and-mouth disease viruses.

PubMed

Dill, V; Beer, M; Hoffmann, B

2017-08-01

Foot-and-mouth disease (FMD) is a major contributor to poverty and food insecurity in Africa and Asia, and it is one of the biggest threats to agriculture in highly developed countries. As FMD is extremely contagious, strategies for its prevention, early detection, and the immediate characterisation of outbreak strains are of great importance. The generation of whole-genome sequences enables phylogenetic characterisation, the epidemiological tracing of virus transmission pathways and is supportive in disease control strategies. This study describes the development and validation of a rapid, universal and cost-efficient RT-PCR system to generate genome sequences of FMDV, reaching from the IRES to the end of the open reading frame. The method was evaluated using twelve different virus strains covering all seven serotypes of FMDV. Additionally, samples from experimentally infected animals were tested to mimic diagnostic field samples. All primer pairs showed a robust amplification with a high sensitivity for all serotypes. In summary, the described assay is suitable for the generation of FMDV sequences from all serotypes to allow immediate phylogenetic analysis, detailed genotyping and molecular epidemiology. Copyright © 2017 Elsevier B.V. All rights reserved.
A novel approach to enhance biological nutrient removal using a culture supernatant from Micrococcus luteus containing resuscitation-promoting factor (Rpf) in SBR process.

PubMed

Liu, Yindong; Su, Xiaomei; Lu, Lian; Ding, Linxian; Shen, Chaofeng

2016-03-01

A culture supernatant from Micrococcus luteus containing resuscitation-promoting factor (SRpf) was used to enhance the biological nutrient removal of potentially functional bacteria. The obtained results suggest that SRpf accelerated the start-up process and significantly enhanced the biological nutrient removal in sequencing batch reactor (SBR). PO4 (3-)-P removal efficiency increased by over 12 % and total nitrogen removal efficiency increased by over 8 % in treatment reactor acclimated by SRpf compared with those without SRpf addition. The Illumina high-throughput sequencing analysis showed that SRpf played an essential role in shifts in the composition and diversity of bacterial community. The phyla of Proteobacteria and Actinobacteria, which were closely related to biological nutrient removal, were greatly abundant after SRpf addition. This study demonstrates that SRpf acclimation or addition might hold great potential as an efficient and cost-effective alternative for wastewater treatment plants (WWTPs) to meet more stringent operation conditions and legislations.
Comparative description of ten transcriptomes of newly sequenced invertebrates and efficiency estimation of genomic sampling in non-model taxa

PubMed Central

2012-01-01

Introduction Traditionally, genomic or transcriptomic data have been restricted to a few model or emerging model organisms, and to a handful of species of medical and/or environmental importance. Next-generation sequencing techniques have the capability of yielding massive amounts of gene sequence data for virtually any species at a modest cost. Here we provide a comparative analysis of de novo assembled transcriptomic data for ten non-model species of previously understudied animal taxa. Results cDNA libraries of ten species belonging to five animal phyla (2 Annelida [including Sipuncula], 2 Arthropoda, 2 Mollusca, 2 Nemertea, and 2 Porifera) were sequenced in different batches with an Illumina Genome Analyzer II (read length 100 or 150 bp), rendering between ca. 25 and 52 million reads per species. Read thinning, trimming, and de novo assembly were performed under different parameters to optimize output. Between 67,423 and 207,559 contigs were obtained across the ten species, post-optimization. Of those, 9,069 to 25,681 contigs retrieved blast hits against the NCBI non-redundant database, and approximately 50% of these were assigned with Gene Ontology terms, covering all major categories, and with similar percentages in all species. Local blasts against our datasets, using selected genes from major signaling pathways and housekeeping genes, revealed high efficiency in gene recovery compared to available genomes of closely related species. Intriguingly, our transcriptomic datasets detected multiple paralogues in all phyla and in nearly all gene pathways, including housekeeping genes that are traditionally used in phylogenetic applications for their purported single-copy nature. Conclusions We generated the first study of comparative transcriptomics across multiple animal phyla (comparing two species per phylum in most cases), established the first Illumina-based transcriptomic datasets for sponge, nemertean, and sipunculan species, and generated a tractable catalogue of annotated genes (or gene fragments) and protein families for ten newly sequenced non-model organisms, some of commercial importance (i.e., Octopus vulgaris). These comprehensive sets of genes can be readily used for phylogenetic analysis, gene expression profiling, developmental analysis, and can also be a powerful resource for gene discovery. The characterization of the transcriptomes of such a diverse array of animal species permitted the comparison of sequencing depth, functional annotation, and efficiency of genomic sampling using the same pipelines, which proved to be similar for all considered species. In addition, the datasets revealed their potential as a resource for paralogue detection, a recurrent concern in various aspects of biological inquiry, including phylogenetics, molecular evolution, development, and cellular biochemistry. PMID:23190771

Molecular analysis of microbial community in a groundwater sample polluted by landfill leachate and seawater*

PubMed Central

Tian, Yang-jie; Yang, Hong; Wu, Xiu-juan; Li, Dao-tang

2005-01-01

Seashore landfill aquifers are environments of special physicochemical conditions (high organic load and high salinity), and microbes in leachate-polluted aquifers play a significant role for intrinsic bioremediation. In order to characterize microbial diversity and look for clues on the relationship between microbial community structure and hydrochemistry, a culture-independent examination of a typical groundwater sample obtained from a seashore landfill was conducted by sequence analysis of 16S rDNA clone library. Two sets of universal 16S rDNA primers were used to amplify DNA extracted from the groundwater so that problems arising from primer efficiency and specificity could be reduced. Of 74 clones randomly selected from the libraries, 30 contained unique sequences whose analysis showed that the majority of them belonged to bacteria (95.9%), with Proteobacteria (63.5%) being the dominant division. One archaeal sequence and one eukaryotic sequence were found as well. Bacterial sequences belonging to the following phylogenic groups were identified: Bacteroidetes (20.3%), β, γ, δ and ε-subdivisions of Proteobacteria (47.3%, 9.5%, 5.4% and 1.3%, respectively), Firmicutes (1.4%), Actinobacteria (2.7%), Cyanobacteria (2.7%). The percentages of Proteobacteria and Bacteroides in seawater were greater than those in the groundwater from a non-seashore landfill, indicating a possible influence of seawater. Quite a few sequences had close relatives in marine or hypersaline environments. Many sequences showed affiliations with microbes involved in anaerobic fermentation. The remarkable abundance of sequences related to (per)chlorate-reducing bacteria (ClRB) in the groundwater was significant and worthy of further study. PMID:15682499
The transcriptional terminator sequences downstream of the covR gene terminate covR/S operon transcription to generate covR monocistronic transcripts in Streptococcus pyogenes.

PubMed

Chiang-Ni, Chuan; Tsou, Chih-Cheng; Lin, Yee-Shin; Chuang, Woei-Jer; Lin, Ming-T; Liu, Ching-Chuan; Wu, Jiunn-Jong

2008-12-31

CovR/S is an important two component regulatory system, which regulates about 15% of the gene expression in Streptococcus pyogenes. The covR/S locus was identified as an operon generating an RNA transcript around 2.5-kb in size. In this study, we found the covR/S operon produced three RNA transcripts (around 2.5-, 1.0-, and 0.8-kb in size). Using RNA transcriptional terminator sequence prediction and transcriptional terminator analysis, we identified two atypical rho-independent terminator sequences downstream of the covR gene and showed these terminator sequences terminate RNA transcription efficiently. These results indicate that covR/S operon generates covR/S transcript and monocistronic covR transcripts.
Genomics and metagenomics in medical microbiology.

PubMed

Padmanabhan, Roshan; Mishra, Ajay Kumar; Raoult, Didier; Fournier, Pierre-Edouard

2013-12-01

Over the last two decades, sequencing tools have evolved from laborious time-consuming methodologies to real-time detection and deciphering of genomic DNA. Genome sequencing, especially using next generation sequencing (NGS) has revolutionized the landscape of microbiology and infectious disease. This deluge of sequencing data has not only enabled advances in fundamental biology but also helped improve diagnosis, typing of pathogen, virulence and antibiotic resistance detection, and development of new vaccines and culture media. In addition, NGS also enabled efficient analysis of complex human micro-floras, both commensal, and pathological, through metagenomic methods, thus helping the comprehension and management of human diseases such as obesity. This review summarizes technological advances in genomics and metagenomics relevant to the field of medical microbiology. Copyright © 2013 Elsevier B.V. All rights reserved.
High-Performance Integrated Virtual Environment (HIVE) Tools and Applications for Big Data Analysis.

PubMed

Simonyan, Vahan; Mazumder, Raja

2014-09-30

The High-performance Integrated Virtual Environment (HIVE) is a high-throughput cloud-based infrastructure developed for the storage and analysis of genomic and associated biological data. HIVE consists of a web-accessible interface for authorized users to deposit, retrieve, share, annotate, compute and visualize Next-generation Sequencing (NGS) data in a scalable and highly efficient fashion. The platform contains a distributed storage library and a distributed computational powerhouse linked seamlessly. Resources available through the interface include algorithms, tools and applications developed exclusively for the HIVE platform, as well as commonly used external tools adapted to operate within the parallel architecture of the system. HIVE is composed of a flexible infrastructure, which allows for simple implementation of new algorithms and tools. Currently, available HIVE tools include sequence alignment and nucleotide variation profiling tools, metagenomic analyzers, phylogenetic tree-building tools using NGS data, clone discovery algorithms, and recombination analysis algorithms. In addition to tools, HIVE also provides knowledgebases that can be used in conjunction with the tools for NGS sequence and metadata analysis.
High-Performance Integrated Virtual Environment (HIVE) Tools and Applications for Big Data Analysis

PubMed Central

Simonyan, Vahan; Mazumder, Raja

2014-01-01

The High-performance Integrated Virtual Environment (HIVE) is a high-throughput cloud-based infrastructure developed for the storage and analysis of genomic and associated biological data. HIVE consists of a web-accessible interface for authorized users to deposit, retrieve, share, annotate, compute and visualize Next-generation Sequencing (NGS) data in a scalable and highly efficient fashion. The platform contains a distributed storage library and a distributed computational powerhouse linked seamlessly. Resources available through the interface include algorithms, tools and applications developed exclusively for the HIVE platform, as well as commonly used external tools adapted to operate within the parallel architecture of the system. HIVE is composed of a flexible infrastructure, which allows for simple implementation of new algorithms and tools. Currently, available HIVE tools include sequence alignment and nucleotide variation profiling tools, metagenomic analyzers, phylogenetic tree-building tools using NGS data, clone discovery algorithms, and recombination analysis algorithms. In addition to tools, HIVE also provides knowledgebases that can be used in conjunction with the tools for NGS sequence and metadata analysis. PMID:25271953
Condenser-type diffusion denuders for the collection of sulfur dioxide in a cleanroom.

PubMed

Chang, In-Hyoung; Lee, Dong Soo; Ock, Soon-Ho

2003-02-01

High-efficiency condenser-type diffusion denuders of cylindrical and planar geometries are described. The film condensation of water vapor onto a cooled denuder surface can be used as a method for collecting water-soluble gases. By using SO(2) as the test gas, the planar design offers quantitative collection efficiency at air sampling rates up to 5 L min(-1). Coupled to ion chromatography, the limit of detection (LOD) for SO(2) is 0.014 ppbv with a 30-min successive analysis sequence. The method has been successfully applied to the analysis of temperature- and humidity-controlled cleanroom air.
Programs for analysis and resizing of complex structures. [computerized minimum weight design

NASA Technical Reports Server (NTRS)

Haftka, R. T.; Prasad, B.

1978-01-01

The paper describes the PARS (Programs for Analysis and Resizing of Structures) system. PARS is a user oriented system of programs for the minimum weight design of structures modeled by finite elements and subject to stress, displacement, flutter and thermal constraints. The system is built around SPAR - an efficient and modular general purpose finite element program, and consists of a series of processors that communicate through the use of a data base. An efficient optimizer based on the Sequence of Unconstrained Minimization Technique (SUMT) with an extended interior penalty function and Newton's method is used. Several problems are presented for demonstration of the system capabilities.
Transcriptome analysis of Pseudomonas syringae identifies new genes, ncRNAs, and antisense activity

USDA-ARS?s Scientific Manuscript database

To fully understand how bacteria respond to their environment, it is essential to assess genome-wide transcriptional activity. New high throughput sequencing technologies make it possible to query the transcriptome of an organism in an efficient unbiased manner. We applied a strand-specific method t...
Role of indirect readout mechanism in TATA box binding protein-DNA interaction.

PubMed

Mondal, Manas; Choudhury, Devapriya; Chakrabarti, Jaydeb; Bhattacharyya, Dhananjay

2015-03-01

Gene expression generally initiates from recognition of TATA-box binding protein (TBP) to the minor groove of DNA of TATA box sequence where the DNA structure is significantly different from B-DNA. We have carried out molecular dynamics simulation studies of TBP-DNA system to understand how the DNA structure alters for efficient binding. We observed rigid nature of the protein while the DNA of TATA box sequence has an inherent flexibility in terms of bending and minor groove widening. The bending analysis of the free DNA and the TBP bound DNA systems indicate presence of some similar structures. Principal coordinate ordination analysis also indicates some structural features of the protein bound and free DNA are similar. Thus we suggest that the DNA of TATA box sequence regularly oscillates between several alternate structures and the one suitable for TBP binding is induced further by the protein for proper complex formation.
Mining sequential patterns for protein fold recognition.

PubMed

Exarchos, Themis P; Papaloukas, Costas; Lampros, Christos; Fotiadis, Dimitrios I

2008-02-01

Protein data contain discriminative patterns that can be used in many beneficial applications if they are defined correctly. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. Protein classification in terms of fold recognition plays an important role in computational protein analysis, since it can contribute to the determination of the function of a protein whose structure is unknown. Specifically, one of the most efficient SPM algorithms, cSPADE, is employed for the analysis of protein sequence. A classifier uses the extracted sequential patterns to classify proteins in the appropriate fold category. For training and evaluating the proposed method we used the protein sequences from the Protein Data Bank and the annotation of the SCOP database. The method exhibited an overall accuracy of 25% in a classification problem with 36 candidate categories. The classification performance reaches up to 56% when the five most probable protein folds are considered.
Efficient moving target analysis for inverse synthetic aperture radar images via joint speeded-up robust features and regular moment

NASA Astrophysics Data System (ADS)

Yang, Hongxin; Su, Fulin

2018-01-01

We propose a moving target analysis algorithm using speeded-up robust features (SURF) and regular moment in inverse synthetic aperture radar (ISAR) image sequences. In our study, we first extract interest points from ISAR image sequences by SURF. Different from traditional feature point extraction methods, SURF-based feature points are invariant to scattering intensity, target rotation, and image size. Then, we employ a bilateral feature registering model to match these feature points. The feature registering scheme can not only search the isotropic feature points to link the image sequences but also reduce the error matching pairs. After that, the target centroid is detected by regular moment. Consequently, a cost function based on correlation coefficient is adopted to analyze the motion information. Experimental results based on simulated and real data validate the effectiveness and practicability of the proposed method.
The emerging High Efficiency Video Coding standard (HEVC)

NASA Astrophysics Data System (ADS)

Raja, Gulistan; Khan, Awais

2013-12-01

High definition video (HDV) is becoming popular day by day. This paper describes the performance analysis of latest upcoming video standard known as High Efficiency Video Coding (HEVC). HEVC is designed to fulfil all the requirements for future high definition videos. In this paper, three configurations (intra only, low delay and random access) of HEVC are analyzed using various 480p, 720p and 1080p high definition test video sequences. Simulation results show the superior objective and subjective quality of HEVC.
Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics.

PubMed

Dutheil, Julien; Gaillard, Sylvain; Bazin, Eric; Glémin, Sylvain; Ranwez, Vincent; Galtier, Nicolas; Belkhir, Khalid

2006-04-04

A large number of bioinformatics applications in the fields of bio-sequence analysis, molecular evolution and population genetics typically share input/output methods, data storage requirements and data analysis algorithms. Such common features may be conveniently bundled into re-usable libraries, which enable the rapid development of new methods and robust applications. We present Bio++, a set of Object Oriented libraries written in C++. Available components include classes for data storage and handling (nucleotide/amino-acid/codon sequences, trees, distance matrices, population genetics datasets), various input/output formats, basic sequence manipulation (concatenation, transcription, translation, etc.), phylogenetic analysis (maximum parsimony, markov models, distance methods, likelihood computation and maximization), population genetics/genomics (diversity statistics, neutrality tests, various multi-locus analyses) and various algorithms for numerical calculus. Implementation of methods aims at being both efficient and user-friendly. A special concern was given to the library design to enable easy extension and new methods development. We defined a general hierarchy of classes that allow the developer to implement its own algorithms while remaining compatible with the rest of the libraries. Bio++ source code is distributed free of charge under the CeCILL general public licence from its website http://kimura.univ-montp2.fr/BioPP.
Integrated design, execution, and analysis of arrayed and pooled CRISPR genome-editing experiments.

PubMed

Canver, Matthew C; Haeussler, Maximilian; Bauer, Daniel E; Orkin, Stuart H; Sanjana, Neville E; Shalem, Ophir; Yuan, Guo-Cheng; Zhang, Feng; Concordet, Jean-Paul; Pinello, Luca

2018-05-01

CRISPR (clustered regularly interspaced short palindromic repeats) genome-editing experiments offer enormous potential for the evaluation of genomic loci using arrayed single guide RNAs (sgRNAs) or pooled sgRNA libraries. Numerous computational tools are available to help design sgRNAs with optimal on-target efficiency and minimal off-target potential. In addition, computational tools have been developed to analyze deep-sequencing data resulting from genome-editing experiments. However, these tools are typically developed in isolation and oftentimes are not readily translatable into laboratory-based experiments. Here, we present a protocol that describes in detail both the computational and benchtop implementation of an arrayed and/or pooled CRISPR genome-editing experiment. This protocol provides instructions for sgRNA design with CRISPOR (computational tool for the design, evaluation, and cloning of sgRNA sequences), experimental implementation, and analysis of the resulting high-throughput sequencing data with CRISPResso (computational tool for analysis of genome-editing outcomes from deep-sequencing data). This protocol allows for design and execution of arrayed and pooled CRISPR experiments in 4-5 weeks by non-experts, as well as computational data analysis that can be performed in 1-2 d by both computational and noncomputational biologists alike using web-based and/or command-line versions.
Recurrent Network models of sequence generation and memory

PubMed Central

Rajan, Kanaka; Harvey, Christopher D; Tank, David W

2016-01-01

SUMMARY Sequential activation of neurons is a common feature of network activity during a variety of behaviors, including working memory and decision making. Previous network models for sequences and memory emphasized specialized architectures in which a principled mechanism is pre-wired into their connectivity. Here, we demonstrate that starting from random connectivity and modifying a small fraction of connections, a largely disordered recurrent network can produce sequences and implement working memory efficiently. We use this process, called Partial In-Network training (PINning), to model and match cellular-resolution imaging data from the posterior parietal cortex during a virtual memory-guided two-alternative forced choice task [Harvey, Coen and Tank, 2012]. Analysis of the connectivity reveals that sequences propagate by the cooperation between recurrent synaptic interactions and external inputs, rather than through feedforward or asymmetric connections. Together our results suggest that neural sequences may emerge through learning from largely unstructured network architectures. PMID:26971945
Optimized approach for Ion Proton RNA sequencing reveals details of RNA splicing and editing features of the transcriptome.

PubMed

Brown, Roger B; Madrid, Nathaniel J; Suzuki, Hideaki; Ness, Scott A

2017-01-01

RNA-sequencing (RNA-seq) has become the standard method for unbiased analysis of gene expression but also provides access to more complex transcriptome features, including alternative RNA splicing, RNA editing, and even detection of fusion transcripts formed through chromosomal translocations. However, differences in library methods can adversely affect the ability to recover these different types of transcriptome data. For example, some methods have bias for one end of transcripts or rely on low-efficiency steps that limit the complexity of the resulting library, making detection of rare transcripts less likely. We tested several commonly used methods of RNA-seq library preparation and found vast differences in the detection of advanced transcriptome features, such as alternatively spliced isoforms and RNA editing sites. By comparing several different protocols available for the Ion Proton sequencer and by utilizing detailed bioinformatics analysis tools, we were able to develop an optimized random primer based RNA-seq technique that is reliable at uncovering rare transcript isoforms and RNA editing features, as well as fusion reads from oncogenic chromosome rearrangements. The combination of optimized libraries and rapid Ion Proton sequencing provides a powerful platform for the transcriptome analysis of research and clinical samples.
Sequence analysis of ORF IV RTBV isolated from tungro infected Oryza sativa L. cv Ciherang

NASA Astrophysics Data System (ADS)

Hastilestari, Bernadetta Rina; Astuti, Dwi; Estiati, Amy; Nugroho, Satya

2015-09-01

The Effort to increase rice production is often constrained by pest and disease such as Tungro. The Tungro disease is caused by the joint infection with two dissimilar viruses; a bacil-form-DNA virus, the Rice tungro bacilliform virus(RTBV) and the spherical RNA virus, Rice tungro spherical virus (RTSV) and transmitted by Green leafhopper (Nephotettix virescens). The symptom of disease is caused by the presence of RTBV. The genome of RTBV consists of four Open reading frames (ORFs) which encode functional proteins. Of the four, ORF IV is unique because it exists only in RTBV. The most efficient method of generating disease resistance plants is to look for natural sources of resistance genes in wild or germplasm and then transfer the gene and the accompanying resistance in cultivated crop varieties. The aim of this study is, therefore, to isolate and analyze of 1170 bp gene of ORF 4 of Tungro virus isolated from an Indonesian rice cultivar, Ciherang (Oryza sativa L. cv Indica). DNA sequencing analysis using BLAST showed 94% similarity with the reference sequence gen bank Acc.M65026.1. The comparisons and mutation analysis of DNA sequences were discussed in this research.
Multi-objective Analysis for a Sequencing Planning of Mixed-model Assembly Line

NASA Astrophysics Data System (ADS)

Shimizu, Yoshiaki; Waki, Toshiya; Yoo, Jae Kyu

Diversified customer demands are raising importance of just-in-time and agile manufacturing much more than before. Accordingly, introduction of mixed-model assembly lines becomes popular to realize the small-lot-multi-kinds production. Since it produces various kinds on the same assembly line, a rational management is of special importance. With this point of view, this study focuses on a sequencing problem of mixed-model assembly line including a paint line as its preceding process. By taking into account the paint line together, reducing work-in-process (WIP) inventory between these heterogeneous lines becomes a major concern of the sequencing problem besides improving production efficiency. Finally, we have formulated the sequencing problem as a bi-objective optimization problem to prevent various line stoppages, and to reduce the volume of WIP inventory simultaneously. Then we have proposed a practical method for the multi-objective analysis. For this purpose, we applied the weighting method to derive the Pareto front. Actually, the resulting problem is solved by a meta-heuristic method like SA (Simulated Annealing). Through numerical experiments, we verified the validity of the proposed approach, and discussed the significance of trade-off analysis between the conflicting objectives.
Efficient Semiparametric Inference Under Two-Phase Sampling, With Applications to Genetic Association Studies.

PubMed

Tao, Ran; Zeng, Donglin; Lin, Dan-Yu

2017-01-01

In modern epidemiological and clinical studies, the covariates of interest may involve genome sequencing, biomarker assay, or medical imaging and thus are prohibitively expensive to measure on a large number of subjects. A cost-effective solution is the two-phase design, under which the outcome and inexpensive covariates are observed for all subjects during the first phase and that information is used to select subjects for measurements of expensive covariates during the second phase. For example, subjects with extreme values of quantitative traits were selected for whole-exome sequencing in the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP). Herein, we consider general two-phase designs, where the outcome can be continuous or discrete, and inexpensive covariates can be continuous and correlated with expensive covariates. We propose a semiparametric approach to regression analysis by approximating the conditional density functions of expensive covariates given inexpensive covariates with B-spline sieves. We devise a computationally efficient and numerically stable EM-algorithm to maximize the sieve likelihood. In addition, we establish the consistency, asymptotic normality, and asymptotic efficiency of the estimators. Furthermore, we demonstrate the superiority of the proposed methods over existing ones through extensive simulation studies. Finally, we present applications to the aforementioned NHLBI ESP.
The Automated Array Assembly Task of the Low-cost Silicon Solar Array Project, Phase 2

NASA Technical Reports Server (NTRS)

Coleman, M. G.; Grenon, L.; Pastirik, E. M.; Pryor, R. A.; Sparks, T. G.

1978-01-01

An advanced process sequence for manufacturing high efficiency solar cells and modules in a cost-effective manner is discussed. Emphasis is on process simplicity and minimizing consumed materials. The process sequence incorporates texture etching, plasma processes for damage removal and patterning, ion implantation, low pressure silicon nitride deposition, and plated metal. A reliable module design is presented. Specific process step developments are given. A detailed cost analysis was performed to indicate future areas of fruitful cost reduction effort. Recommendations for advanced investigations are included.

dictyExpress: a web-based platform for sequence data management and analytics in Dictyostelium and beyond.

PubMed

Stajdohar, Miha; Rosengarten, Rafael D; Kokosar, Janez; Jeran, Luka; Blenkus, Domen; Shaulsky, Gad; Zupan, Blaz

2017-06-02

Dictyostelium discoideum, a soil-dwelling social amoeba, is a model for the study of numerous biological processes. Research in the field has benefited mightily from the adoption of next-generation sequencing for genomics and transcriptomics. Dictyostelium biologists now face the widespread challenges of analyzing and exploring high dimensional data sets to generate hypotheses and discovering novel insights. We present dictyExpress (2.0), a web application designed for exploratory analysis of gene expression data, as well as data from related experiments such as Chromatin Immunoprecipitation sequencing (ChIP-Seq). The application features visualization modules that include time course expression profiles, clustering, gene ontology enrichment analysis, differential expression analysis and comparison of experiments. All visualizations are interactive and interconnected, such that the selection of genes in one module propagates instantly to visualizations in other modules. dictyExpress currently stores the data from over 800 Dictyostelium experiments and is embedded within a general-purpose software framework for management of next-generation sequencing data. dictyExpress allows users to explore their data in a broader context by reciprocal linking with dictyBase-a repository of Dictyostelium genomic data. In addition, we introduce a companion application called GenBoard, an intuitive graphic user interface for data management and bioinformatics analysis. dictyExpress and GenBoard enable broad adoption of next generation sequencing based inquiries by the Dictyostelium research community. Labs without the means to undertake deep sequencing projects can mine the data available to the public. The entire information flow, from raw sequence data to hypothesis testing, can be accomplished in an efficient workspace. The software framework is generalizable and represents a useful approach for any research community. To encourage more wide usage, the backend is open-source, available for extension and further development by bioinformaticians and data scientists.
ITS1: a DNA barcode better than ITS2 in eukaryotes?

PubMed

Wang, Xin-Cun; Liu, Chang; Huang, Liang; Bengtsson-Palme, Johan; Chen, Haimei; Zhang, Jian-Hui; Cai, Dayong; Li, Jian-Qin

2015-05-01

A DNA barcode is a short piece of DNA sequence used for species determination and discovery. The internal transcribed spacer (ITS/ITS2) region has been proposed as the standard DNA barcode for fungi and seed plants and has been widely used in DNA barcoding analyses for other biological groups, for example algae, protists and animals. The ITS region consists of both ITS1 and ITS2 regions. Here, a large-scale meta-analysis was carried out to compare ITS1 and ITS2 from three aspects: PCR amplification, DNA sequencing and species discrimination, in terms of the presence of DNA barcoding gaps, species discrimination efficiency, sequence length distribution, GC content distribution and primer universality. In total, 85 345 sequence pairs in 10 major groups of eukaryotes, including ascomycetes, basidiomycetes, liverworts, mosses, ferns, gymnosperms, monocotyledons, eudicotyledons, insects and fishes, covering 611 families, 3694 genera, and 19 060 species, were analysed. Using similarity-based methods, we calculated species discrimination efficiencies for ITS1 and ITS2 in all major groups, families and genera. Using Fisher's exact test, we found that ITS1 has significantly higher efficiencies than ITS2 in 17 of the 47 families and 20 of the 49 genera, which are sample-rich. By in silico PCR amplification evaluation, primer universality of the extensively applied ITS1 primers was found superior to that of ITS2 primers. Additionally, shorter length of amplification product and lower GC content was discovered to be two other advantages of ITS1 for sequencing. In summary, ITS1 represents a better DNA barcode than ITS2 for eukaryotic species. © 2014 John Wiley & Sons Ltd.
Vander Lugt correlation of DNA sequence data

NASA Astrophysics Data System (ADS)

Christens-Barry, William A.; Hawk, James F.; Martin, James C.

1990-12-01

DNA, the molecule containing the genetic code of an organism, is a linear chain of subunits. It is the sequence of subunits, of which there are four kinds, that constitutes the unique blueprint of an individual. This sequence is the focus of a large number of analyses performed by an army of geneticists, biologists, and computer scientists. Most of these analyses entail searches for specific subsequences within the larger set of sequence data. Thus, most analyses are essentially pattern recognition or correlation tasks. Yet, there are special features to such analysis that influence the strategy and methods of an optical pattern recognition approach. While the serial processing employed in digital electronic computers remains the main engine of sequence analyses, there is no fundamental reason that more efficient parallel methods cannot be used. We describe an approach using optical pattern recognition (OPR) techniques based on matched spatial filtering. This allows parallel comparison of large blocks of sequence data. In this study we have simulated a Vander Lugt1 architecture implementing our approach. Searches for specific target sequence strings within a block of DNA sequence from the Co/El plasmid2 are performed.
Transcriptome-based differentiation of closely-related Miscanthus lines.

PubMed

Chouvarine, Philippe; Cooksey, Amanda M; McCarthy, Fiona M; Ray, David A; Baldwin, Brian S; Burgess, Shane C; Peterson, Daniel G

2012-01-01

Distinguishing between individuals is critical to those conducting animal/plant breeding, food safety/quality research, diagnostic and clinical testing, and evolutionary biology studies. Classical genetic identification studies are based on marker polymorphisms, but polymorphism-based techniques are time and labor intensive and often cannot distinguish between closely related individuals. Illumina sequencing technologies provide the detailed sequence data required for rapid and efficient differentiation of related species, lines/cultivars, and individuals in a cost-effective manner. Here we describe the use of Illumina high-throughput exome sequencing, coupled with SNP mapping, as a rapid means of distinguishing between related cultivars of the lignocellulosic bioenergy crop giant miscanthus (Miscanthus × giganteus). We provide the first exome sequence database for Miscanthus species complete with Gene Ontology (GO) functional annotations. A SNP comparative analysis of rhizome-derived cDNA sequences was successfully utilized to distinguish three Miscanthus × giganteus cultivars from each other and from other Miscanthus species. Moreover, the resulting phylogenetic tree generated from SNP frequency data parallels the known breeding history of the plants examined. Some of the giant miscanthus plants exhibit considerable sequence divergence. Here we describe an analysis of Miscanthus in which high-throughput exome sequencing was utilized to differentiate between closely related genotypes despite the current lack of a reference genome sequence. We functionally annotated the exome sequences and provide resources to support Miscanthus systems biology. In addition, we demonstrate the use of the commercial high-performance cloud computing to do computational GO annotation.
Analysis of sequencing data for probing RNA secondary structures and protein-RNA binding in studying posttranscriptional regulations.

PubMed

Hu, Xihao; Wu, Yang; Lu, Zhi John; Yip, Kevin Y

2016-11-01

High-throughput sequencing has been used to study posttranscriptional regulations, where the identification of protein-RNA binding is a major and fast-developing sub-area, which is in turn benefited by the sequencing methods for whole-transcriptome probing of RNA secondary structures. In the study of RNA secondary structures using high-throughput sequencing, bases are modified or cleaved according to their structural features, which alter the resulting composition of sequencing reads. In the study of protein-RNA binding, methods have been proposed to immuno-precipitate (IP) protein-bound RNA transcripts in vitro or in vivo By sequencing these transcripts, the protein-RNA interactions and the binding locations can be identified. For both types of data, read counts are affected by a combination of confounding factors, including expression levels of transcripts, sequence biases, mapping errors and the probing or IP efficiency of the experimental protocols. Careful processing of the sequencing data and proper extraction of important features are fundamentally important to a successful analysis. Here we review and compare different experimental methods for probing RNA secondary structures and binding sites of RNA-binding proteins (RBPs), and the computational methods proposed for analyzing the corresponding sequencing data. We suggest how these two types of data should be integrated to study the structural properties of RBP binding sites as a systematic way to better understand posttranscriptional regulations. © The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
An efficient method for the prediction of deleterious multiple-point mutations in the secondary structure of RNAs using suboptimal folding solutions

PubMed Central

Churkin, Alexander; Barash, Danny

2008-01-01

Background RNAmute is an interactive Java application which, given an RNA sequence, calculates the secondary structure of all single point mutations and organizes them into categories according to their similarity to the predicted structure of the wild type. The secondary structure predictions are performed using the Vienna RNA package. A more efficient implementation of RNAmute is needed, however, to extend from the case of single point mutations to the general case of multiple point mutations, which may often be desired for computational predictions alongside mutagenesis experiments. But analyzing multiple point mutations, a process that requires traversing all possible mutations, becomes highly expensive since the running time is O(nm) for a sequence of length n with m-point mutations. Using Vienna's RNAsubopt, we present a method that selects only those mutations, based on stability considerations, which are likely to be conformational rearranging. The approach is best examined using the dot plot representation for RNA secondary structure. Results Using RNAsubopt, the suboptimal solutions for a given wild-type sequence are calculated once. Then, specific mutations are selected that are most likely to cause a conformational rearrangement. For an RNA sequence of about 100 nts and 3-point mutations (n = 100, m = 3), for example, the proposed method reduces the running time from several hours or even days to several minutes, thus enabling the practical application of RNAmute to the analysis of multiple-point mutations. Conclusion A highly efficient addition to RNAmute that is as user friendly as the original application but that facilitates the practical analysis of multiple-point mutations is presented. Such an extension can now be exploited prior to site-directed mutagenesis experiments by virologists, for example, who investigate the change of function in an RNA virus via mutations that disrupt important motifs in its secondary structure. A complete explanation of the application, called MultiRNAmute, is available at [1]. PMID:18445289
Sequenced sorghum mutant library- an efficient platform for discovery of causal gene mutations

USDA-ARS?s Scientific Manuscript database

Ethyl methanesulfonate (EMS) efficiently generates high-density mutations in genomes. We applied whole-genome sequencing to 256 phenotyped mutant lines of sorghum (Sorghum bicolor L. Moench) to 16x coverage. Comparisons with the reference sequence revealed >1.8 million canonical EMS-induced G/C to A...
Transcriptome Sequencing of Gracilariopsis lemaneiformis to Analyze the Genes Related to Optically Active Phycoerythrin Synthesis.

PubMed

Huang, Xiaoyun; Zang, Xiaonan; Wu, Fei; Jin, Yuming; Wang, Haitao; Liu, Chang; Ding, Yating; He, Bangxiang; Xiao, Dongfang; Song, Xinwei; Liu, Zhu

2017-01-01

Gracilariopsis lemaneiformis (aka Gracilaria lemaneiformis) is a red macroalga rich in phycoerythrin, which can capture light efficiently and transfer it to photosystemⅡ. However, little is known about the synthesis of optically active phycoerythrinin in G. lemaneiformis at the molecular level. With the advent of high-throughput sequencing technology, analysis of genetic information for G. lemaneiformis by transcriptome sequencing is an effective means to get a deeper insight into the molecular mechanism of phycoerythrin synthesis. Illumina technology was employed to sequence the transcriptome of two strains of G. lemaneiformis- the wild type and a green-pigmented mutant. We obtained a total of 86915 assembled unigenes as a reference gene set, and 42884 unigenes were annotated in at least one public database. Taking the above transcriptome sequencing as a reference gene set, 4041 differentially expressed genes were screened to analyze and compare the gene expression profiles of the wild type and green mutant. By GO and KEGG pathway analysis, we concluded that three factors, including a reduction in the expression level of apo-phycoerythrin, an increase of chlorophyll light-harvesting complex synthesis, and reduction of phycoerythrobilin by competitive inhibition, caused the reduction of optically active phycoerythrin in the green-pigmented mutant.
Spectrum of benzo[a]pyrene-induced mutations in the Pig-a gene of L5178YTk+/- cells identified with next generation sequencing.

PubMed

Revollo, Javier; Wang, Yiying; McKinzie, Page; Dad, Azra; Pearce, Mason; Heflich, Robert H; Dobrovolsky, Vasily N

2017-12-01

We used Sanger sequencing and next generation sequencing (NGS) for analysis of mutations in the endogenous X-linked Pig-a gene of clonally expanded L5178YTk +/- cells. The clones developed from single cells that were sorted on a flow cytometer based upon the expression pattern of the GPI-anchored marker, CD90, on their surface. CD90-deficient and CD90-proficient cells were sorted from untreated cultures and CD90-deficient cells were sorted from cultures treated with benzo[a]pyrene (B[a]P). Pig-a mutations were identified in all clones developed from CD90-deficient cells; no Pig-a mutations were found in clones of CD90-proficient cells. The spectrum of B[a]P-induced Pig-a mutations was dominated by basepair substitutions, small insertions and deletions at G:C, or at sequences rich in G:C content. We observed high concordance between Pig-a mutations determined by Sanger sequencing and by NGS, but NGS was able to identify mutations in samples that were difficult to analyze by Sanger sequencing (e.g., mixtures of two mutant clones). Overall, the NGS method is a cost and labor efficient high throughput approach for analysis of a large number of mutant clones. Published by Elsevier B.V.
Phylo-VISTA: Interactive visualization of multiple DNA sequence alignments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.

The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. Results: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a frameworkmore » based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. Availability: Phylo-VISTA is available at http://www-gsd.lbl. gov/phylovista. It requires an Internet browser with Java Plugin 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu« less
Draft sequencing and comparative genomics of Xylella fastidiosa strains reveal novel biological insights.

PubMed

Bhattacharyya, Anamitra; Stilwagen, Stephanie; Reznik, Gary; Feil, Helene; Feil, William S; Anderson, Iain; Bernal, Axel; D'Souza, Mark; Ivanova, Natalia; Kapatral, Vinayak; Larsen, Niels; Los, Tamara; Lykidis, Athanasios; Selkov, Eugene; Walunas, Theresa L; Purcell, Alexander; Edwards, Rob A; Hawkins, Trevor; Haselkorn, Robert; Overbeek, Ross; Kyrpides, Nikos C; Predki, Paul F

2002-10-01

Draft sequencing is a rapid and efficient method for determining the near-complete sequence of microbial genomes. Here we report a comparative analysis of one complete and two draft genome sequences of the phytopathogenic bacterium, Xylella fastidiosa, which causes serious disease in plants, including citrus, almond, and oleander. We present highlights of an in silico analysis based on a comparison of reconstructions of core biological subsystems. Cellular pathway reconstructions have been used to identify a small number of genes, which are likely to reside within the draft genomes but are not captured in the draft assembly. These represented only a small fraction of all genes and were predominantly large and small ribosomal subunit protein components. By using this approach, some of the inherent limitations of draft sequence can be significantly reduced. Despite the incomplete nature of the draft genomes, it is possible to identify several phage-related genes, which appear to be absent from the draft genomes and not the result of insufficient sequence sampling. This region may therefore identify potential host-specific functions. Based on this first functional reconstruction of a phytopathogenic microbe, we spotlight an unusual respiration machinery as a potential target for biological control. We also predicted and developed a new defined growth medium for Xylella.
Using Next Generation Sequencing for Multiplexed Trait-Linked Markers in Wheat

PubMed Central

Bernardo, Amy; Wang, Shan; St. Amand, Paul; Bai, Guihua

2015-01-01

With the advent of next generation sequencing (NGS) technologies, single nucleotide polymorphisms (SNPs) have become the major type of marker for genotyping in many crops. However, the availability of SNP markers for important traits of bread wheat ( Triticum aestivum L.) that can be effectively used in marker-assisted selection (MAS) is still limited and SNP assays for MAS are usually uniplex. A shift from uniplex to multiplex assays will allow the simultaneous analysis of multiple markers and increase MAS efficiency. We designed 33 locus-specific markers from SNP or indel-based marker sequences that linked to 20 different quantitative trait loci (QTL) or genes of agronomic importance in wheat and analyzed the amplicon sequences using an Ion Torrent Proton Sequencer and a custom allele detection pipeline to determine the genotypes of 24 selected germplasm accessions. Among the 33 markers, 27 were successfully multiplexed and 23 had 100% SNP call rates. Results from analysis of "kompetitive allele-specific PCR" (KASP) and sequence tagged site (STS) markers developed from the same loci fully verified the genotype calls of 23 markers. The NGS-based multiplexed assay developed in this study is suitable for rapid and high-throughput screening of SNPs and some indel-based markers in wheat. PMID:26625271
Whole exome sequencing is an efficient, sensitive and specific method for determining the genetic cause of short-rib thoracic dystrophies.

PubMed

McInerney-Leo, A M; Harris, J E; Leo, P J; Marshall, M S; Gardiner, B; Kinning, E; Leong, H Y; McKenzie, F; Ong, W P; Vodopiutz, J; Wicking, C; Brown, M A; Zankl, A; Duncan, E L

2015-12-01

Short-rib thoracic dystrophies (SRTDs) are congenital disorders due to defects in primary cilium function. SRTDs are recessively inherited with mutations identified in 14 genes to date (comprising 398 exons). Conventional mutation detection (usually by iterative Sanger sequencing) is inefficient and expensive, and often not undertaken. Whole exome massive parallel sequencing has been used to identify new genes for SRTD (WDR34, WDR60 and IFT172); however, the clinical utility of whole exome sequencing (WES) has not been established. WES was performed in 11 individuals with SRTDs. Compound heterozygous or homozygous mutations were identified in six confirmed SRTD genes in 10 individuals (IFT172, DYNC2H1, TTC21B, WDR60, WDR34 and NEK1), giving overall sensitivity of 90.9%. WES data from 993 unaffected individuals sequenced using similar technology showed two individuals with rare (minor allele frequency <0.005) compound heterozygous variants of unknown significance in SRTD genes (specificity >99%). Costs for consumables, laboratory processing and bioinformatic analysis were
Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets.

PubMed

Shrimankar, D D; Sathe, S R

2016-01-01

Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.
Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets

PubMed Central

Shrimankar, D. D.; Sathe, S. R.

2016-01-01

Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today’s supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures. PMID:27932868
Suppressive subtractive hybridization approach revealed differential expression of hypersensitive response and reactive oxygen species production genes in tea (Camellia sinensis (L.) O. Kuntze) leaves during Pestalotiopsis thea infection.

PubMed

Senthilkumar, Palanisamy; Thirugnanasambantham, Krishnaraj; Mandal, Abul Kalam Azad

2012-12-01

Tea (Camellia sinensis (L.) O. Kuntze) is an economically important plant cultivated for its leaves. Infection of Pestalotiopsis theae in leaves causes gray blight disease and enormous loss to the tea industry. We used suppressive subtractive hybridization (SSH) technique to unravel the differential gene expression pattern during gray blight disease development in tea. Complementary DNA from P. theae-infected and uninfected leaves of disease tolerant cultivar UPASI-10 was used as tester and driver populations respectively. Subtraction efficiency was confirmed by comparing abundance of β-actin gene. A total of 377 and 720 clones with insert size >250 bp from forward and reverse library respectively were sequenced and analyzed. Basic Local Alignment Search Tool analysis revealed 17 sequences in forward SSH library have high degree of similarity with disease and hypersensitive response related genes and 20 sequences with hypothetical proteins while in reverse SSH library, 23 sequences have high degree of similarity with disease and stress response-related genes and 15 sequences with hypothetical proteins. Functional analysis indicated unknown (61 and 59 %) or hypothetical functions (23 and 18 %) for most of the differentially regulated genes in forward and reverse SSH library, respectively, while others have important role in different cellular activities. Majority of the upregulated genes are related to hypersensitive response and reactive oxygen species production. Based on these expressed sequence tag data, putative role of differentially expressed genes were discussed in relation to disease. We also demonstrated the efficiency of SSH as a tool in enriching gray blight disease related up- and downregulated genes in tea. The present study revealed that many genes related to disease resistance were suppressed during P. theae infection and enhancing these genes by the application of inducers may impart better disease tolerance to the plants.
Front-End Electron Transfer Dissociation Coupled to a 21 Tesla FT-ICR Mass Spectrometer for Intact Protein Sequence Analysis

NASA Astrophysics Data System (ADS)

Weisbrod, Chad R.; Kaiser, Nathan K.; Syka, John E. P.; Early, Lee; Mullen, Christopher; Dunyach, Jean-Jacques; English, A. Michelle; Anderson, Lissa C.; Blakney, Greg T.; Shabanowitz, Jeffrey; Hendrickson, Christopher L.; Marshall, Alan G.; Hunt, Donald F.

2017-09-01

High resolution mass spectrometry is a key technology for in-depth protein characterization. High-field Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) enables high-level interrogation of intact proteins in the most detail to date. However, an appropriate complement of fragmentation technologies must be paired with FTMS to provide comprehensive sequence coverage, as well as characterization of sequence variants, and post-translational modifications. Here we describe the integration of front-end electron transfer dissociation (FETD) with a custom-built 21 tesla FT-ICR mass spectrometer, which yields unprecedented sequence coverage for proteins ranging from 2.8 to 29 kDa, without the need for extensive spectral averaging (e.g., 60% sequence coverage for apo-myoglobin with four averaged acquisitions). The system is equipped with a multipole storage device separate from the ETD reaction device, which allows accumulation of multiple ETD fragment ion fills. Consequently, an optimally large product ion population is accumulated prior to transfer to the ICR cell for mass analysis, which improves mass spectral signal-to-noise ratio, dynamic range, and scan rate. We find a linear relationship between protein molecular weight and minimum number of ETD reaction fills to achieve optimum sequence coverage, thereby enabling more efficient use of instrument data acquisition time. Finally, real-time scaling of the number of ETD reactions fills during method-based acquisition is shown, and the implications for LC-MS/MS top-down analysis are discussed. [Figure not available: see fulltext.
eRNA: a graphic user interface-based tool optimized for large data analysis from high-throughput RNA sequencing

PubMed Central

2014-01-01

Background RNA sequencing (RNA-seq) is emerging as a critical approach in biological research. However, its high-throughput advantage is significantly limited by the capacity of bioinformatics tools. The research community urgently needs user-friendly tools to efficiently analyze the complicated data generated by high throughput sequencers. Results We developed a standalone tool with graphic user interface (GUI)-based analytic modules, known as eRNA. The capacity of performing parallel processing and sample management facilitates large data analyses by maximizing hardware usage and freeing users from tediously handling sequencing data. The module miRNA identification” includes GUIs for raw data reading, adapter removal, sequence alignment, and read counting. The module “mRNA identification” includes GUIs for reference sequences, genome mapping, transcript assembling, and differential expression. The module “Target screening” provides expression profiling analyses and graphic visualization. The module “Self-testing” offers the directory setups, sample management, and a check for third-party package dependency. Integration of other GUIs including Bowtie, miRDeep2, and miRspring extend the program’s functionality. Conclusions eRNA focuses on the common tools required for the mapping and quantification analysis of miRNA-seq and mRNA-seq data. The software package provides an additional choice for scientists who require a user-friendly computing environment and high-throughput capacity for large data analysis. eRNA is available for free download at https://sourceforge.net/projects/erna/?source=directory. PMID:24593312
[Study of a family with epidermolysis bullosa simplex resulting from a novel mutation of KRT14 gene].

PubMed

Meng, Lanlan; Du, Juan; Li, Wen; Lu, Guangxiu; Tan, Yueqiu

2017-08-10

To determine the molecular etiology for a Chinese pedigree affected with epidermolysis bullosa simplex (EBS). Target region sequencing using a hereditary epidermolysis bullosa capture array combined with Sanger sequencing and bioinformatics analysis were used. Mutation taster, PolyPhen-2, Provean, and SIFT software and NCBI online were employed to assess the pathogenicity and conservation of detected mutations. One hundred healthy unrelated individuals were used as controls. Target region sequencing showed that the proband has carried a unreported heterozygous c.1234A>G (p.Ile412Val) mutation of the KRT14 gene, which was confirmed by Sanger sequencing in other 8 affected individuals but not among healthy members of the pedigree. Bioinformatics analysis indicated that the mutation is highly pathogenic. Remarkably, 3 members of the family (2 affected and 1 unaffected) have carried a heterozygous c.1237G>A (p.Ala413Thr) mutation of the KRT14 gene, which was collected in Human Gene Mutation Database (HGMD). Bioinformatics analysis indicated that the mutation may not be pathogenic. Both mutations were not detected among the 100 healthy controls. The novel c.1234A>G(p.Ile412Val) mutation of the KRT14 gene is probably responsible for the disease, while c.1237G>A (p.Ala413Thr) mutation of KRT14 gene may be a polymorphism. Compared with Sanger sequencing, target region capture sequencing is more efficient and can significantly reduce the cost of genetic testing for EBS.
eRNA: a graphic user interface-based tool optimized for large data analysis from high-throughput RNA sequencing.

PubMed

Yuan, Tiezheng; Huang, Xiaoyi; Dittmar, Rachel L; Du, Meijun; Kohli, Manish; Boardman, Lisa; Thibodeau, Stephen N; Wang, Liang

2014-03-05

RNA sequencing (RNA-seq) is emerging as a critical approach in biological research. However, its high-throughput advantage is significantly limited by the capacity of bioinformatics tools. The research community urgently needs user-friendly tools to efficiently analyze the complicated data generated by high throughput sequencers. We developed a standalone tool with graphic user interface (GUI)-based analytic modules, known as eRNA. The capacity of performing parallel processing and sample management facilitates large data analyses by maximizing hardware usage and freeing users from tediously handling sequencing data. The module miRNA identification" includes GUIs for raw data reading, adapter removal, sequence alignment, and read counting. The module "mRNA identification" includes GUIs for reference sequences, genome mapping, transcript assembling, and differential expression. The module "Target screening" provides expression profiling analyses and graphic visualization. The module "Self-testing" offers the directory setups, sample management, and a check for third-party package dependency. Integration of other GUIs including Bowtie, miRDeep2, and miRspring extend the program's functionality. eRNA focuses on the common tools required for the mapping and quantification analysis of miRNA-seq and mRNA-seq data. The software package provides an additional choice for scientists who require a user-friendly computing environment and high-throughput capacity for large data analysis. eRNA is available for free download at https://sourceforge.net/projects/erna/?source=directory.

Molecular Cytogenetics Guides Massively Parallel Sequencing of a Radiation-Induced Chromosome Translocation in Human Cells.

PubMed

Cornforth, Michael N; Anur, Pavana; Wang, Nicholas; Robinson, Erin; Ray, F Andrew; Bedford, Joel S; Loucas, Bradford D; Williams, Eli S; Peto, Myron; Spellman, Paul; Kollipara, Rahul; Kittler, Ralf; Gray, Joe W; Bailey, Susan M

2018-05-11

Chromosome rearrangements are large-scale structural variants that are recognized drivers of oncogenic events in cancers of all types. Cytogenetics allows for their rapid, genome-wide detection, but does not provide gene-level resolution. Massively parallel sequencing (MPS) promises DNA sequence-level characterization of the specific breakpoints involved, but is strongly influenced by bioinformatics filters that affect detection efficiency. We sought to characterize the breakpoint junctions of chromosomal translocations and inversions in the clonal derivatives of human cells exposed to ionizing radiation. Here, we describe the first successful use of DNA paired-end analysis to locate and sequence across the breakpoint junctions of a radiation-induced reciprocal translocation. The analyses employed, with varying degrees of success, several well-known bioinformatics algorithms, a task made difficult by the involvement of repetitive DNA sequences. As for underlying mechanisms, the results of Sanger sequencing suggested that the translocation in question was likely formed via microhomology-mediated non-homologous end joining (mmNHEJ). To our knowledge, this represents the first use of MPS to characterize the breakpoint junctions of a radiation-induced chromosomal translocation in human cells. Curiously, these same approaches were unsuccessful when applied to the analysis of inversions previously identified by directional genomic hybridization (dGH). We conclude that molecular cytogenetics continues to provide critical guidance for structural variant discovery, validation and in "tuning" analysis filters to enable robust breakpoint identification at the base pair level.
Collaborative development for setup, execution, sharing and analytics of complex NMR experiments.

PubMed

Irvine, Alistair G; Slynko, Vadim; Nikolaev, Yaroslav; Senthamarai, Russell R P; Pervushin, Konstantin

2014-02-01

Factory settings of NMR pulse sequences are rarely ideal for every scenario in which they are utilised. The optimisation of NMR experiments has for many years been performed locally, with implementations often specific to an individual spectrometer. Furthermore, these optimised experiments are normally retained solely for the use of an individual laboratory, spectrometer or even single user. Here we introduce a web-based service that provides a database for the deposition, annotation and optimisation of NMR experiments. The application uses a Wiki environment to enable the collaborative development of pulse sequences. It also provides a flexible mechanism to automatically generate NMR experiments from deposited sequences. Multidimensional NMR experiments of proteins and other macromolecules consume significant resources, in terms of both spectrometer time and effort required to analyse the results. Systematic analysis of simulated experiments can enable optimal allocation of NMR resources for structural analysis of proteins. Our web-based application (http://nmrplus.org) provides all the necessary information, includes the auxiliaries (waveforms, decoupling sequences etc.), for analysis of experiments by accurate numerical simulation of multidimensional NMR experiments. The online database of the NMR experiments, together with a systematic evaluation of their sensitivity, provides a framework for selection of the most efficient pulse sequences. The development of such a framework provides a basis for the collaborative optimisation of pulse sequences by the NMR community, with the benefits of this collective effort being available to the whole community. Copyright © 2013 Elsevier Inc. All rights reserved.
Use of tuf Sequences for Genus-Specific PCR Detection and Phylogenetic Analysis of 28 Streptococcal Species

PubMed Central

Picard, François J.; Ke, Danbing; Boudreau, Dominique K.; Boissinot, Maurice; Huletsky, Ann; Richard, Dave; Ouellette, Marc; Roy, Paul H.; Bergeron, Michel G.

2004-01-01

A 761-bp portion of the tuf gene (encoding the elongation factor Tu) from 28 clinically relevant streptococcal species was obtained by sequencing amplicons generated using broad-range PCR primers. These tuf sequences were used to select Streptococcus-specific PCR primers and to perform phylogenetic analysis. The specificity of the PCR assay was verified using 102 different bacterial species, including the 28 streptococcal species. Genomic DNA purified from all streptococcal species was efficiently detected, whereas there was no amplification with DNA from 72 of the 74 nonstreptococcal bacterial species tested. There was cross-amplification with DNAs from Enterococcus durans and Lactococcus lactis. However, the 15 to 31% nucleotide sequence divergence in the 761-bp tuf portion of these two species compared to any streptococcal tuf sequence provides ample sequence divergence to allow the development of internal probes specific to streptococci. The Streptococcus-specific assay was highly sensitive for all 28 streptococcal species tested (i.e., detection limit of 1 to 10 genome copies per PCR). The tuf sequence data was also used to perform extensive phylogenetic analysis, which was generally in agreement with phylogeny determined on the basis of 16S rRNA gene data. However, the tuf gene provided a better discrimination at the streptococcal species level that should be particularly useful for the identification of very closely related species. In conclusion, tuf appears more suitable than the 16S ribosomal RNA gene for the development of diagnostic assays for the detection and identification of streptococcal species because of its higher level of species-specific genetic divergence. PMID:15297518
Generation of sequence signatures from DNA amplification fingerprints with mini-hairpin and microsatellite primers.

PubMed

Caetano-Anollés, G; Gresshoff, P M

1996-06-01

DNA amplification fingerprinting (DAF) with mini-hairpins harboring arbitrary "core" sequences at their 3' termini were used to fingerprint a variety of templates, including PCR products and whole genomes, to establish genetic relationships between plant tax at the interspecific and intraspecific level, and to identify closely related fungal isolates and plant accessions. No correlation was observed between the sequence of the arbitrary core, the stability of the mini-hairpin structure and DAF efficiency. Mini-hairpin primers with short arbitrary cores and primers complementary to simple sequence repeats present in microsatellites were also used to generate arbitrary signatures from amplification profiles (ASAP). The ASAP strategy is a dual-step amplification procedure that uses at least one primer in each fingerprinting stage. ASAP was able to reproducibly amplify DAF products (representing about 10-15 kb of sequence) following careful optimization of amplification parameters such as primer and template concentration. Avoidance of primer sequences partially complementary to DAF product termini was necessary in order to produce distinct fingerprints. This allowed the combinatorial use of oligomers in nucleic acid screening, with numerous ASAP fingerprinting reactions based on a limited number of primer sequences. Mini-hairpin primers and ASAP analysis significantly increased detection of polymorphic DNA, separating closely related bermudagrass (Cynodon) cultivars and detecting putatively linked markers in bulked segregant analysis of the soybean (Glycine max) supernodulation (nitrate-tolerant symbiosis) locus.
Genomic analysis of the aconidial and high-performance protein producer, industrially relevant Aspergillus niger SH2 strain.

PubMed

Yin, Chao; Wang, Bin; He, Pan; Lin, Ying; Pan, Li

2014-05-15

Aspergillus niger is usually regarded as a beneficial species widely used in biotechnological industry. Obtaining the genome sequence of the widely used aconidial A. niger SH2 strain is of great importance to understand its unusual production capability. In this study we assembled a high-quality genome sequence of A. niger SH2 with approximately 11,517 ORFs. Relatively high proportion of genes enriched for protein expression related FunCat items verify its efficient capacity in protein production. Furthermore, genome-wide comparative analysis between A. niger SH2 and CBS513.88 reveals insights into unique properties of A. niger SH2. A. niger SH2 lacks the gene related with the initiation of asexual sporulation (PrpA), leading to its distinct aconidial phenotype. Frame shift mutations and non-synonymous SNPs in genes of cell wall integrity signaling, β-1,3-glucan synthesis and chitin synthesis influence its cell wall development which is important for its hyphal fragmentation during industrial high-efficiency protein production. Copyright © 2014 Elsevier B.V. All rights reserved.
Microprocessor activity controls differential miRNA biogenesis In Vivo.

PubMed

Conrad, Thomas; Marsico, Annalisa; Gehre, Maja; Orom, Ulf Andersson

2014-10-23

In miRNA biogenesis, pri-miRNA transcripts are converted into pre-miRNA hairpins. The in vivo properties of this process remain enigmatic. Here, we determine in vivo transcriptome-wide pri-miRNA processing using next-generation sequencing of chromatin-associated pri-miRNAs. We identify a distinctive Microprocessor signature in the transcriptome profile from which efficiency of the endogenous processing event can be accurately quantified. This analysis reveals differential susceptibility to Microprocessor cleavage as a key regulatory step in miRNA biogenesis. Processing is highly variable among pri-miRNAs and a better predictor of miRNA abundance than primary transcription itself. Processing is also largely stable across three cell lines, suggesting a major contribution of sequence determinants. On the basis of differential processing efficiencies, we define functionality for short sequence features adjacent to the pre-miRNA hairpin. In conclusion, we identify Microprocessor as the main hub for diversified miRNA output and suggest a role for uncoupling miRNA biogenesis from host gene expression. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.
Comparative Analysis of Single-Cell RNA Sequencing Methods.

PubMed

Ziegenhain, Christoph; Vieth, Beate; Parekh, Swati; Reinius, Björn; Guillaumet-Adkins, Amy; Smets, Martha; Leonhardt, Heinrich; Heyn, Holger; Hellmann, Ines; Enard, Wolfgang

2017-02-16

Single-cell RNA sequencing (scRNA-seq) offers new possibilities to address biological and medical questions. However, systematic comparisons of the performance of diverse scRNA-seq protocols are lacking. We generated data from 583 mouse embryonic stem cells to evaluate six prominent scRNA-seq methods: CEL-seq2, Drop-seq, MARS-seq, SCRB-seq, Smart-seq, and Smart-seq2. While Smart-seq2 detected the most genes per cell and across cells, CEL-seq2, Drop-seq, MARS-seq, and SCRB-seq quantified mRNA levels with less amplification noise due to the use of unique molecular identifiers (UMIs). Power simulations at different sequencing depths showed that Drop-seq is more cost-efficient for transcriptome quantification of large numbers of cells, while MARS-seq, SCRB-seq, and Smart-seq2 are more efficient when analyzing fewer cells. Our quantitative comparison offers the basis for an informed choice among six prominent scRNA-seq methods, and it provides a framework for benchmarking further improvements of scRNA-seq protocols. Copyright © 2017 Elsevier Inc. All rights reserved.
Search-based optimization

NASA Technical Reports Server (NTRS)

Wheeler, Ward C.

2003-01-01

The problem of determining the minimum cost hypothetical ancestral sequences for a given cladogram is known to be NP-complete (Wang and Jiang, 1994). Traditionally, point estimations of hypothetical ancestral sequences have been used to gain heuristic, upper bounds on cladogram cost. These include procedures with such diverse approaches as non-additive optimization of multiple sequence alignment, direct optimization (Wheeler, 1996), and fixed-state character optimization (Wheeler, 1999). A method is proposed here which, by extending fixed-state character optimization, replaces the estimation process with a search. This form of optimization examines a diversity of potential state solutions for cost-efficient hypothetical ancestral sequences and can result in greatly more parsimonious cladograms. Additionally, such an approach can be applied to other NP-complete phylogenetic optimization problems such as genomic break-point analysis. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.
Protein Sequencing with Tandem Mass Spectrometry

NASA Astrophysics Data System (ADS)

Ziady, Assem G.; Kinter, Michael

The recent introduction of electrospray ionization techniques that are suitable for peptides and whole proteins has allowed for the design of mass spectrometric protocols that provide accurate sequence information for proteins. The advantages gained by these approaches over traditional Edman Degradation sequencing include faster analysis and femtomole, sometimes attomole, sensitivity. The ability to efficiently identify proteins has allowed investigators to conduct studies on their differential expression or modification in response to various treatments or disease states. In this chapter, we discuss the use of electrospray tandem mass spectrometry, a technique whereby protein-derived peptides are subjected to fragmentation in the gas phase, revealing sequence information for the protein. This powerful technique has been instrumental for the study of proteins and markers associated with various disorders, including heart disease, cancer, and cystic fibrosis. We use the study of protein expression in cystic fibrosis as an example.
deepTools: a flexible platform for exploring deep-sequencing data.

PubMed

Ramírez, Fidel; Dündar, Friederike; Diehl, Sarah; Grüning, Björn A; Manke, Thomas

2014-07-01

We present a Galaxy based web server for processing and visualizing deeply sequenced data. The web server's core functionality consists of a suite of newly developed tools, called deepTools, that enable users with little bioinformatic background to explore the results of their sequencing experiments in a standardized setting. Users can upload pre-processed files with continuous data in standard formats and generate heatmaps and summary plots in a straight-forward, yet highly customizable manner. In addition, we offer several tools for the analysis of files containing aligned reads and enable efficient and reproducible generation of normalized coverage files. As a modular and open-source platform, deepTools can easily be expanded and customized to future demands and developments. The deepTools webserver is freely available at http://deeptools.ie-freiburg.mpg.de and is accompanied by extensive documentation and tutorials aimed at conveying the principles of deep-sequencing data analysis. The web server can be used without registration. deepTools can be installed locally either stand-alone or as part of Galaxy. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome”

PubMed Central

Tettelin, Hervé; Masignani, Vega; Cieslewicz, Michael J.; Donati, Claudio; Medini, Duccio; Ward, Naomi L.; Angiuoli, Samuel V.; Crabtree, Jonathan; Jones, Amanda L.; Durkin, A. Scott; DeBoy, Robert T.; Davidsen, Tanja M.; Mora, Marirosa; Scarselli, Maria; Margarit y Ros, Immaculada; Peterson, Jeremy D.; Hauser, Christopher R.; Sundaram, Jaideep P.; Nelson, William C.; Madupu, Ramana; Brinkac, Lauren M.; Dodson, Robert J.; Rosovitz, Mary J.; Sullivan, Steven A.; Daugherty, Sean C.; Haft, Daniel H.; Selengut, Jeremy; Gwinn, Michelle L.; Zhou, Liwei; Zafar, Nikhat; Khouri, Hoda; Radune, Diana; Dimitrov, George; Watkins, Kisha; O'Connor, Kevin J. B.; Smith, Shannon; Utterback, Teresa R.; White, Owen; Rubens, Craig E.; Grandi, Guido; Madoff, Lawrence C.; Kasper, Dennis L.; Telford, John L.; Wessels, Michael R.; Rappuoli, Rino; Fraser, Claire M.

2005-01-01

The development of efficient and inexpensive genome sequencing methods has revolutionized the study of human bacterial pathogens and improved vaccine design. Unfortunately, the sequence of a single genome does not reflect how genetic variability drives pathogenesis within a bacterial species and also limits genome-wide screens for vaccine candidates or for antimicrobial targets. We have generated the genomic sequence of six strains representing the five major disease-causing serotypes of Streptococcus agalactiae, the main cause of neonatal infection in humans. Analysis of these genomes and those available in databases showed that the S. agalactiae species can be described by a pan-genome consisting of a core genome shared by all isolates, accounting for ≈80% of any single genome, plus a dispensable genome consisting of partially shared and strain-specific genes. Mathematical extrapolation of the data suggests that the gene reservoir available for inclusion in the S. agalactiae pan-genome is vast and that unique genes will continue to be identified even after sequencing hundreds of genomes. PMID:16172379
Top-down analysis of protein samples by de novo sequencing techniques.

PubMed

Vyatkina, Kira; Wu, Si; Dekker, Lennard J M; VanDuijn, Martijn M; Liu, Xiaowen; Tolić, Nikola; Luider, Theo M; Paša-Tolić, Ljiljana; Pevzner, Pavel A

2016-09-15

Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data. We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. The former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns. Freely available on the web at http://bioinf.spbau.ru/en/twister vyatkina@spbau.ru or ppevzner@ucsd.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Quantitation of next generation sequencing library preparation protocol efficiencies using droplet digital PCR assays - a systematic comparison of DNA library preparation kits for Illumina sequencing.

PubMed

Aigrain, Louise; Gu, Yong; Quail, Michael A

2016-06-13

The emergence of next-generation sequencing (NGS) technologies in the past decade has allowed the democratization of DNA sequencing both in terms of price per sequenced bases and ease to produce DNA libraries. When it comes to preparing DNA sequencing libraries for Illumina, the current market leader, a plethora of kits are available and it can be difficult for the users to determine which kit is the most appropriate and efficient for their applications; the main concerns being not only cost but also minimal bias, yield and time efficiency. We compared 9 commercially available library preparation kits in a systematic manner using the same DNA sample by probing the amount of DNA remaining after each protocol steps using a new droplet digital PCR (ddPCR) assay. This method allows the precise quantification of fragments bearing either adaptors or P5/P7 sequences on both ends just after ligation or PCR enrichment. We also investigated the potential influence of DNA input and DNA fragment size on the final library preparation efficiency. The overall library preparations efficiencies of the libraries show important variations between the different kits with the ones combining several steps into a single one exhibiting some final yields 4 to 7 times higher than the other kits. Detailed ddPCR data also reveal that the adaptor ligation yield itself varies by more than a factor of 10 between kits, certain ligation efficiencies being so low that it could impair the original library complexity and impoverish the sequencing results. When a PCR enrichment step is necessary, lower adaptor-ligated DNA inputs leads to greater amplification yields, hiding the latent disparity between kits. We describe a ddPCR assay that allows us to probe the efficiency of the most critical step in the library preparation, ligation, and to draw conclusion on which kits is more likely to preserve the sample heterogeneity and reduce the need of amplification.
Identification of peptide sequences that target to the brain using in vivo phage display.

PubMed

Li, Jingwei; Zhang, Qizhi; Pang, Zhiqing; Wang, Yuchen; Liu, Qingfeng; Guo, Liangran; Jiang, Xinguo

2012-06-01

Phage display technology could provide a rapid means for the discovery of novel peptides. To find peptide ligands specific for the brain vascular receptors, we performed a modified phage display method. Phages were recovered from mice brain parenchyma after administrated with a random 7-mer peptide library intravenously. A longer circulation time was arranged according to the biodistributive brain/blood ratios of phage particles. Following sequential rounds of isolation, a number of phages were sequenced and a peptide sequence (CTSTSAPYC, denoted as PepC7) was identified. Clone 7-1, which encodes PepC7, exhibited translocation efficiency about 41-fold higher than the random library phage. Immunofluorescence analysis revealed that Clone 7-1 had a significant superiority on transport efficiency into the brain compared with native M13 phage. Clone 7-1 was inhibited from homing to the brain in a dose-dependent fashion when cyclic peptides of the same sequence were present in a competition assay. Interestingly, the linear peptide (ATSTSAPYA, Pep7) and a scrambled control peptide PepSC7 (CSPATSYTC) did not compete with the phage at the same tested concentration (0.2-200 pg). Labeled by Cy5.5, PepC7 exhibited significant brain-targeting capability in in vivo optical imaging analysis. The cyclic conformation of PepC7 formed by disulfide bond, and the correct structure itself play a critical role in maintaining the selectivity and affinity for the brain. In conclusion, PepC7 is a promising brain-target motif never been reported before and it could be applied to targeted drug delivery into the brain.
Phylogenetic Analysis of Nuclear-Encoded RNA Maturases

PubMed Central

Malik, Sunita; Upadhyaya, KC; Khurana, SM Paul

2017-01-01

Posttranscriptional processes, such as splicing, play a crucial role in gene expression and are prevalent not only in nuclear genes but also in plant mitochondria where splicing of group II introns is catalyzed by a class of proteins termed maturases. In plant mitochondria, there are 22 mitochondrial group II introns. matR, nMAT1, nMAT2, nMAT3, and nMAT4 proteins have been shown to be required for efficient splicing of several group II introns in Arabidopsis thaliana. Nuclear maturases (nMATs) are necessary for splicing of mitochondrial genes, leading to normal oxidative phosphorylation. Sequence analysis through phylogenetic tree (including bootstrapping) revealed high homology with maturase sequences of A thaliana and other plants. This study shows the phylogenetic relationship of nMAT proteins between A thaliana and other nonredundant plant species taken from BLASTP analysis. PMID:28607538
JANE: efficient mapping of prokaryotic ESTs and variable length sequence reads on related template genomes

PubMed Central

2009-01-01

Background ESTs or variable sequence reads can be available in prokaryotic studies well before a complete genome is known. Use cases include (i) transcriptome studies or (ii) single cell sequencing of bacteria. Without suitable software their further analysis and mapping would have to await finalization of the corresponding genome. Results The tool JANE rapidly maps ESTs or variable sequence reads in prokaryotic sequencing and transcriptome efforts to related template genomes. It provides an easy-to-use graphics interface for information retrieval and a toolkit for EST or nucleotide sequence function prediction. Furthermore, we developed for rapid mapping an enhanced sequence alignment algorithm which reassembles and evaluates high scoring pairs provided from the BLAST algorithm. Rapid assembly on and replacement of the template genome by sequence reads or mapped ESTs is achieved. This is illustrated (i) by data from Staphylococci as well as from a Blattabacteria sequencing effort, (ii) mapping single cell sequencing reads is shown for poribacteria to sister phylum representative Rhodopirellula Baltica SH1. The algorithm has been implemented in a web-server accessible at http://jane.bioapps.biozentrum.uni-wuerzburg.de. Conclusion Rapid prokaryotic EST mapping or mapping of sequence reads is achieved applying JANE even without knowing the cognate genome sequence. PMID:19943962
Integration of adeno-associated virus vectors in CD34+ human hematopoietic progenitor cells after transduction.

PubMed

Fisher-Adams, G; Wong, K K; Podsakoff, G; Forman, S J; Chatterjee, S

1996-07-15

Gene transfer vectors based on adeno-associated virus (AAV) appear promising because of their high transduction frequencies regardless of cell cycle status and ability to integrate into chromosomal DNA. We tested AAV-mediated gene transfer into a panel of human bone marrow or umbilical cord-derived CD34+ hematopoietic progenitor cells, using vectors encoding several transgenes under the control of viral and cellular promoters. Gene transfer was evaluated by (1) chromosomal integration of vector sequences and (2) analysis of transgene expression. Southern hybridization and fluorescence in situ hybridization analysis of transduced CD34 genomic DNA showed the presence of integrated vector sequences in chromosomal DNA in a portion of transduced cells and showed that integrated vector sequences were replicated along with cellular DNA during mitosis. Transgene expression in transduced CD34 cells in suspension cultures and in myeloid colonies differentiating in vitro from transduced CD34 cells approximated that predicted by the multiplicity of transduction. This was true in CD34 cells from different donors, regardless of the transgene or selective pressure. Comparisons of CD34 cell transduction either before or after cytokine stimulation showed similar gene transfer frequencies. Our findings suggest that AAV transduction of CD34+ hematopoietic progenitor cells is efficient, can lead to stable integration in a population of transduced cells, and may therefore provide the basis for safe and efficient ex vivo gene therapy of the hematopoietic system.
Application of sorting and next generation sequencing to study 5΄-UTR influence on translation efficiency in Escherichia coli

PubMed Central

Evfratov, Sergey A.; Osterman, Ilya A.; Komarova, Ekaterina S.; Pogorelskaya, Alexandra M.; Rubtsova, Maria P.; Zatsepin, Timofei S.; Semashko, Tatiana A.; Kostryukova, Elena S.; Mironov, Andrey A.; Burnaev, Evgeny; Krymova, Ekaterina; Gelfand, Mikhail S.; Govorun, Vadim M.; Bogdanov, Alexey A.; Dontsova, Olga A.

2017-01-01

Abstract Yield of protein per translated mRNA may vary by four orders of magnitude. Many studies analyzed the influence of mRNA features on the translation yield. However, a detailed understanding of how mRNA sequence determines its propensity to be translated is still missing. Here, we constructed a set of reporter plasmid libraries encoding CER fluorescent protein preceded by randomized 5΄ untranslated regions (5΄-UTR) and Red fluorescent protein (RFP) used as an internal control. Each library was transformed into Escherchia coli cells, separated by efficiency of CER mRNA translation by a cell sorter and subjected to next generation sequencing. We tested efficiency of translation of the CER gene preceded by each of 48 natural 5΄-UTR sequences and introduced random and designed mutations into natural and artificially selected 5΄-UTRs. Several distinct properties could be ascribed to a group of 5΄-UTRs most efficient in translation. In addition to known ones, several previously unrecognized features that contribute to the translation enhancement were found, such as low proportion of cytidine residues, multiple SD sequences and AG repeats. The latter could be identified as translation enhancer, albeit less efficient than SD sequence in several natural 5΄-UTRs. PMID:27899632
AMPLISAS: a web server for multilocus genotyping using next-generation amplicon sequencing data.

PubMed

Sebastian, Alvaro; Herdegen, Magdalena; Migalska, Magdalena; Radwan, Jacek

2016-03-01

Next-generation sequencing (NGS) technologies are revolutionizing the fields of biology and medicine as powerful tools for amplicon sequencing (AS). Using combinations of primers and barcodes, it is possible to sequence targeted genomic regions with deep coverage for hundreds, even thousands, of individuals in a single experiment. This is extremely valuable for the genotyping of gene families in which locus-specific primers are often difficult to design, such as the major histocompatibility complex (MHC). The utility of AS is, however, limited by the high intrinsic sequencing error rates of NGS technologies and other sources of error such as polymerase amplification or chimera formation. Correcting these errors requires extensive bioinformatic post-processing of NGS data. Amplicon Sequence Assignment (AMPLISAS) is a tool that performs analysis of AS results in a simple and efficient way, while offering customization options for advanced users. AMPLISAS is designed as a three-step pipeline consisting of (i) read demultiplexing, (ii) unique sequence clustering and (iii) erroneous sequence filtering. Allele sequences and frequencies are retrieved in excel spreadsheet format, making them easy to interpret. AMPLISAS performance has been successfully benchmarked against previously published genotyped MHC data sets obtained with various NGS technologies. © 2015 John Wiley & Sons Ltd.
K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics.

PubMed

Lin, Jie; Adjeroh, Donald A; Jiang, Bing-Hua; Jiang, Yue

2018-05-15

Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. We propose a new non-parametric alignment-free sequence comparison method, called K2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K2 approach is much faster than the other methods. An improved method, K2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes. The K2 and K2* approaches are implemented in the R language as a package and is freely available for open access (http://community.wvu.edu/daadjeroh/projects/K2/K2_1.0.tar.gz). yueljiang@163.com. Supplementary data are available at Bioinformatics online.

Technical Considerations for Reduced Representation Bisulfite Sequencing with Multiplexed Libraries

PubMed Central

Chatterjee, Aniruddha; Rodger, Euan J.; Stockwell, Peter A.; Weeks, Robert J.; Morison, Ian M.

2012-01-01

Reduced representation bisulfite sequencing (RRBS), which couples bisulfite conversion and next generation sequencing, is an innovative method that specifically enriches genomic regions with a high density of potential methylation sites and enables investigation of DNA methylation at single-nucleotide resolution. Recent advances in the Illumina DNA sample preparation protocol and sequencing technology have vastly improved sequencing throughput capacity. Although the new Illumina technology is now widely used, the unique challenges associated with multiplexed RRBS libraries on this platform have not been previously described. We have made modifications to the RRBS library preparation protocol to sequence multiplexed libraries on a single flow cell lane of the Illumina HiSeq 2000. Furthermore, our analysis incorporates a bioinformatics pipeline specifically designed to process bisulfite-converted sequencing reads and evaluate the output and quality of the sequencing data generated from the multiplexed libraries. We obtained an average of 42 million paired-end reads per sample for each flow-cell lane, with a high unique mapping efficiency to the reference human genome. Here we provide a roadmap of modifications, strategies, and trouble shooting approaches we implemented to optimize sequencing of multiplexed libraries on an a RRBS background. PMID:23193365
SSR_pipeline--computer software for the identification of microsatellite sequences from paired-end Illumina high-throughput DNA sequence data

USGS Publications Warehouse

Miller, Mark P.; Knaus, Brian J.; Mullins, Thomas D.; Haig, Susan M.

2013-01-01

SSR_pipeline is a flexible set of programs designed to efficiently identify simple sequence repeats (SSRs; for example, microsatellites) from paired-end high-throughput Illumina DNA sequencing data. The program suite contains three analysis modules along with a fourth control module that can be used to automate analyses of large volumes of data. The modules are used to (1) identify the subset of paired-end sequences that pass quality standards, (2) align paired-end reads into a single composite DNA sequence, and (3) identify sequences that possess microsatellites conforming to user specified parameters. Each of the three separate analysis modules also can be used independently to provide greater flexibility or to work with FASTQ or FASTA files generated from other sequencing platforms (Roche 454, Ion Torrent, etc). All modules are implemented in the Python programming language and can therefore be used from nearly any computer operating system (Linux, Macintosh, Windows). The program suite relies on a compiled Python extension module to perform paired-end alignments. Instructions for compiling the extension from source code are provided in the documentation. Users who do not have Python installed on their computers or who do not have the ability to compile software also may choose to download packaged executable files. These files include all Python scripts, a copy of the compiled extension module, and a minimal installation of Python in a single binary executable. See program documentation for more information.
RNA-ID, a Powerful Tool for Identifying and Characterizing Regulatory Sequences.

PubMed

Brule, C E; Dean, K M; Grayhack, E J

2016-01-01

The identification and analysis of sequences that regulate gene expression is critical because regulated gene expression underlies biology. RNA-ID is an efficient and sensitive method to discover and investigate regulatory sequences in the yeast Saccharomyces cerevisiae, using fluorescence-based assays to detect green fluorescent protein (GFP) relative to a red fluorescent protein (RFP) control in individual cells. Putative regulatory sequences can be inserted either in-frame or upstream of a superfolder GFP fusion protein whose expression, like that of RFP, is driven by the bidirectional GAL1,10 promoter. In this chapter, we describe the methodology to identify and study cis-regulatory sequences in the RNA-ID system, explaining features and variations of the RNA-ID reporter, as well as some applications of this system. We describe in detail the methods to analyze a single regulatory sequence, from construction of a single GFP variant to assay of variants by flow cytometry, as well as modifications required to screen libraries of different strains simultaneously. We also describe subsequent analyses of regulatory sequences. © 2016 Elsevier Inc. All rights reserved.
Multilevel analysis of sports video sequences

NASA Astrophysics Data System (ADS)

Han, Jungong; Farin, Dirk; de With, Peter H. N.

2006-01-01

We propose a fully automatic and flexible framework for analysis and summarization of tennis broadcast video sequences, using visual features and specific game-context knowledge. Our framework can analyze a tennis video sequence at three levels, which provides a broad range of different analysis results. The proposed framework includes novel pixel-level and object-level tennis video processing algorithms, such as a moving-player detection taking both the color and the court (playing-field) information into account, and a player-position tracking algorithm based on a 3-D camera model. Additionally, we employ scene-level models for detecting events, like service, base-line rally and net-approach, based on a number real-world visual features. The system can summarize three forms of information: (1) all court-view playing frames in a game, (2) the moving trajectory and real-speed of each player, as well as relative position between the player and the court, (3) the semantic event segments in a game. The proposed framework is flexible in choosing the level of analysis that is desired. It is effective because the framework makes use of several visual cues obtained from the real-world domain to model important events like service, thereby increasing the accuracy of the scene-level analysis. The paper presents attractive experimental results highlighting the system efficiency and analysis capabilities.
Electrotransformation and expression of bacterial genes encoding hygromycin phosphotransferase and beta-galactosidase in the pathogenic fungus Histoplasma capsulatum.

PubMed

Woods, J P; Heinecke, E L; Goldman, W E

1998-04-01

We developed an efficient electrotransformation system for the pathogenic fungus Histoplasma capsulatum and used it to examine the effects of features of the transforming DNA on transformation efficiency and fate of the transforming DNA and to demonstrate fungal expression of two recombinant Escherichia coli genes, hph and lacZ. Linearized DNA and plasmids containing Histoplasma telomeric sequences showed the greatest transformation efficiencies, while the plasmid vector had no significant effect, nor did the derivation of the selectable URA5 marker (native Histoplasma gene or a heterologous Podospora anserina gene). Electrotransformation resulted in more frequent multimerization, other modification, or possibly chromosomal integration of transforming telomeric plasmids when saturating amounts of DNA were used, but this effect was not observed with smaller amounts of transforming DNA. We developed another selection system using a hygromycin B resistance marker from plasmid pAN7-1, consisting of the E. coli hph gene flanked by Aspergillus nidulans promoter and terminator sequences. Much of the heterologous fungal sequences could be removed without compromising function in H. capsulatum, allowing construction of a substantially smaller effective marker fragment. Transformation efficiency increased when nonselective conditions were maintained for a time after electrotransformation before selection with the protein synthesis inhibitor hygromycin B was imposed. Finally, we constructed a readily detectable and quantifiable reporter gene by fusing Histoplasma URA5 with E. coli lacZ, resulting in expression of functional beta-galactosidase in H. capsulatum. Demonstration of expression of bacterial genes as effective selectable markers and reporters, together with a highly efficient electrotransformation system, provide valuable approaches for molecular genetic analysis and manipulation of H. capsulatum, which have proven useful for examination of targeted gene disruption, regulated gene expression, and potential virulence determinants in this fungus.
Molecular epidemiological analysis of paired pol/env sequences from Portuguese HIV type 1 patients.

PubMed

Abecasis, Ana B; Martins, Andreia; Costa, Inês; Carvalho, Ana P; Diogo, Isabel; Gomes, Perpétua; Camacho, Ricardo J

2011-07-01

The advent of new therapeutic approaches targeting env and the search for efficient anti-HIV-1 vaccines make it necessary to identify the number of recombinant forms using genomic regions that were previously not frequently sequenced. In this study, we have subtyped paired pol and env sequences from HIV-1 strains infecting 152 patients being clinically followed in Portugal. The percentage of strains in which we found discordant subtypes in pol and env was 25.7%. When the subtype in pol and env was concordant (65.1%), the most prevalent subtypes were subtype B (40.8%), followed by subtype C (17.8%) and subtype G (5.3%). The most prevalent recombinant form was CRF14_BGpol/Genv (7.2%).
Evol and ProDy for bridging protein sequence evolution and structural dynamics.

PubMed

Bakan, Ahmet; Dutta, Anindita; Mao, Wenzhi; Liu, Ying; Chennubhotla, Chakra; Lezon, Timothy R; Bahar, Ivet

2014-09-15

Correlations between sequence evolution and structural dynamics are of utmost importance in understanding the molecular mechanisms of function and their evolution. We have integrated Evol, a new package for fast and efficient comparative analysis of evolutionary patterns and conformational dynamics, into ProDy, a computational toolbox designed for inferring protein dynamics from experimental and theoretical data. Using information-theoretic approaches, Evol coanalyzes conservation and coevolution profiles extracted from multiple sequence alignments of protein families with their inferred dynamics. ProDy and Evol are open-source and freely available under MIT License from http://prody.csb.pitt.edu/. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
A critical analysis of computational protein design with sparse residue interaction graphs

PubMed Central

Georgiev, Ivelin S.

2017-01-01

Protein design algorithms enumerate a combinatorial number of candidate structures to compute the Global Minimum Energy Conformation (GMEC). To efficiently find the GMEC, protein design algorithms must methodically reduce the conformational search space. By applying distance and energy cutoffs, the protein system to be designed can thus be represented using a sparse residue interaction graph, where the number of interacting residue pairs is less than all pairs of mutable residues, and the corresponding GMEC is called the sparse GMEC. However, ignoring some pairwise residue interactions can lead to a change in the energy, conformation, or sequence of the sparse GMEC vs. the original or the full GMEC. Despite the widespread use of sparse residue interaction graphs in protein design, the above mentioned effects of their use have not been previously analyzed. To analyze the costs and benefits of designing with sparse residue interaction graphs, we computed the GMECs for 136 different protein design problems both with and without distance and energy cutoffs, and compared their energies, conformations, and sequences. Our analysis shows that the differences between the GMECs depend critically on whether or not the design includes core, boundary, or surface residues. Moreover, neglecting long-range interactions can alter local interactions and introduce large sequence differences, both of which can result in significant structural and functional changes. Designs on proteins with experimentally measured thermostability show it is beneficial to compute both the full and the sparse GMEC accurately and efficiently. To this end, we show that a provable, ensemble-based algorithm can efficiently compute both GMECs by enumerating a small number of conformations, usually fewer than 1000. This provides a novel way to combine sparse residue interaction graphs with provable, ensemble-based algorithms to reap the benefits of sparse residue interaction graphs while avoiding their potential inaccuracies. PMID:28358804
Structural Analysis of Biodiversity

PubMed Central

Sirovich, Lawrence; Stoeckle, Mark Y.; Zhang, Yu

2010-01-01

Large, recently-available genomic databases cover a wide range of life forms, suggesting opportunity for insights into genetic structure of biodiversity. In this study we refine our recently-described technique using indicator vectors to analyze and visualize nucleotide sequences. The indicator vector approach generates correlation matrices, dubbed Klee diagrams, which represent a novel way of assembling and viewing large genomic datasets. To explore its potential utility, here we apply the improved algorithm to a collection of almost 17000 DNA barcode sequences covering 12 widely-separated animal taxa, demonstrating that indicator vectors for classification gave correct assignment in all 11000 test cases. Indicator vector analysis revealed discontinuities corresponding to species- and higher-level taxonomic divisions, suggesting an efficient approach to classification of organisms from poorly-studied groups. As compared to standard distance metrics, indicator vectors preserve diagnostic character probabilities, enable automated classification of test sequences, and generate high-information density single-page displays. These results support application of indicator vectors for comparative analysis of large nucleotide data sets and raise prospect of gaining insight into broad-scale patterns in the genetic structure of biodiversity. PMID:20195371
EggLib: processing, analysis and simulation tools for population genetics and genomics

PubMed Central

2012-01-01

Background With the considerable growth of available nucleotide sequence data over the last decade, integrated and flexible analytical tools have become a necessity. In particular, in the field of population genetics, there is a strong need for automated and reliable procedures to conduct repeatable and rapid polymorphism analyses, coalescent simulations, data manipulation and estimation of demographic parameters under a variety of scenarios. Results In this context, we present EggLib (Evolutionary Genetics and Genomics Library), a flexible and powerful C++/Python software package providing efficient and easy to use computational tools for sequence data management and extensive population genetic analyses on nucleotide sequence data. EggLib is a multifaceted project involving several integrated modules: an underlying computationally efficient C++ library (which can be used independently in pure C++ applications); two C++ programs; a Python package providing, among other features, a high level Python interface to the C++ library; and the egglib script which provides direct access to pre-programmed Python applications. Conclusions EggLib has been designed aiming to be both efficient and easy to use. A wide array of methods are implemented, including file format conversion, sequence alignment edition, coalescent simulations, neutrality tests and estimation of demographic parameters by Approximate Bayesian Computation (ABC). Classes implementing different demographic scenarios for ABC analyses can easily be developed by the user and included to the package. EggLib source code is distributed freely under the GNU General Public License (GPL) from its website http://egglib.sourceforge.net/ where a full documentation and a manual can also be found and downloaded. PMID:22494792
EggLib: processing, analysis and simulation tools for population genetics and genomics.

PubMed

De Mita, Stéphane; Siol, Mathieu

2012-04-11

With the considerable growth of available nucleotide sequence data over the last decade, integrated and flexible analytical tools have become a necessity. In particular, in the field of population genetics, there is a strong need for automated and reliable procedures to conduct repeatable and rapid polymorphism analyses, coalescent simulations, data manipulation and estimation of demographic parameters under a variety of scenarios. In this context, we present EggLib (Evolutionary Genetics and Genomics Library), a flexible and powerful C++/Python software package providing efficient and easy to use computational tools for sequence data management and extensive population genetic analyses on nucleotide sequence data. EggLib is a multifaceted project involving several integrated modules: an underlying computationally efficient C++ library (which can be used independently in pure C++ applications); two C++ programs; a Python package providing, among other features, a high level Python interface to the C++ library; and the egglib script which provides direct access to pre-programmed Python applications. EggLib has been designed aiming to be both efficient and easy to use. A wide array of methods are implemented, including file format conversion, sequence alignment edition, coalescent simulations, neutrality tests and estimation of demographic parameters by Approximate Bayesian Computation (ABC). Classes implementing different demographic scenarios for ABC analyses can easily be developed by the user and included to the package. EggLib source code is distributed freely under the GNU General Public License (GPL) from its website http://egglib.sourceforge.net/ where a full documentation and a manual can also be found and downloaded.
An interdisciplinary analysis of ERTS data for Colorado mountain environments using ADP Techniques

NASA Technical Reports Server (NTRS)

Hoffer, R. M. (Principal Investigator)

1972-01-01

Author identified significant preliminary results from the Ouachita portion of the Texoma frame of data indicate many potentials in the analysis and interpretation of ERTS data. It is believed that one of the more significant aspects of this analysis sequence has been the investigation of a technique to relate ERTS analysis and surface observation analysis. At present a sequence involving (1) preliminary analysis based solely upon the spectral characteristics of the data, followed by (2) a surface observation mission to obtain visual information and oblique photography to particular points of interest in the test site area, appears to provide an extremely efficient technique for obtaining particularly meaningful surface observation data. Following such a procedure permits concentration on particular points of interest in the entire ERTS frame and thereby makes the surface observation data obtained to be particularly significant and meaningful. The analysis of the Texoma frame has also been significant from the standpoint of demonstrating a fast turn around analysis capability. Additionally, the analysis has shown the potential accuracy and degree of complexity of features that can be identified and mapped using ERTS data.
Methods and compositions for efficient nucleic acid sequencing

DOEpatents

Drmanac, Radoje

2006-07-04

Disclosed are novel methods and compositions for rapid and highly efficient nucleic acid sequencing based upon hybridization with two sets of small oligonucleotide probes of known sequences. Extremely large nucleic acid molecules, including chromosomes and non-amplified RNA, may be sequenced without prior cloning or subcloning steps. The methods of the invention also solve various current problems associated with sequencing technology such as, for example, high noise to signal ratios and difficult discrimination, attaching many nucleic acid fragments to a surface, preparing many, longer or more complex probes and labelling more species.
Methods and compositions for efficient nucleic acid sequencing

DOEpatents

Drmanac, Radoje

2002-01-01

Disclosed are novel methods and compositions for rapid and highly efficient nucleic acid sequencing based upon hybridization with two sets of small oligonucleotide probes of known sequences. Extremely large nucleic acid molecules, including chromosomes and non-amplified RNA, may be sequenced without prior cloning or subcloning steps. The methods of the invention also solve various current problems associated with sequencing technology such as, for example, high noise to signal ratios and difficult discrimination, attaching many nucleic acid fragments to a surface, preparing many, longer or more complex probes and labelling more species.
Unexpected substrate specificity of T4 DNA ligase revealed by in vitro selection

NASA Technical Reports Server (NTRS)

Harada, Kazuo; Orgel, Leslie E.

1993-01-01

We have used in vitro selection techniques to characterize DNA sequences that are ligated efficiently by T4 DNA ligase. We find that the ensemble of selected sequences ligates about 50 times as efficiently as the random mixture of sequences used as the input for selection. Surprisingly many of the selected sequences failed to produce a match at or close to the ligation junction. None of the 20 selected oligomers that we sequenced produced a match two bases upstream from the ligation junction.
Metagenomics: Probing pollutant fate in natural and engineered ecosystems.

PubMed

Bouhajja, Emna; Agathos, Spiros N; George, Isabelle F

2016-12-01

Polluted environments are a reservoir of microbial species able to degrade or to convert pollutants to harmless compounds. The proper management of microbial resources requires a comprehensive characterization of their genetic pool to assess the fate of contaminants and increase the efficiency of bioremediation processes. Metagenomics offers appropriate tools to describe microbial communities in their whole complexity without lab-based cultivation of individual strains. After a decade of use of metagenomics to study microbiomes, the scientific community has made significant progress in this field. In this review, we survey the main steps of metagenomics applied to environments contaminated with organic compounds or heavy metals. We emphasize technical solutions proposed to overcome encountered obstacles. We then compare two metagenomic approaches, i.e. library-based targeted metagenomics and direct sequencing of metagenomes. In the former, environmental DNA is cloned inside a host, and then clones of interest are selected based on (i) their expression of biodegradative functions or (ii) sequence homology with probes and primers designed from relevant, already known sequences. The highest score for the discovery of novel genes and degradation pathways has been achieved so far by functional screening of large clone libraries. On the other hand, direct sequencing of metagenomes without a cloning step has been more often applied to polluted environments for characterization of the taxonomic and functional composition of microbial communities and their dynamics. In this case, the analysis has focused on 16S rRNA genes and marker genes of biodegradation. Advances in next generation sequencing and in bioinformatic analysis of sequencing data have opened up new opportunities for assessing the potential of biodegradation by microbes, but annotation of collected genes is still hampered by a limited number of available reference sequences in databases. Although metagenomics is still facing technical and computational challenges, our review of the recent literature highlights its value as an aid to efficiently monitor the clean-up of contaminated environments and develop successful strategies to mitigate the impact of pollutants on ecosystems. Copyright © 2016 Elsevier Inc. All rights reserved.
Eigenproblem solution by a combined Sturm sequence and inverse iteration technique.

NASA Technical Reports Server (NTRS)

Gupta, K. K.

1973-01-01

Description of an efficient and numerically stable algorithm, along with a complete listing of the associated computer program, developed for the accurate computation of specified roots and associated vectors of the eigenvalue problem Aq = lambda Bq with band symmetric A and B, B being also positive-definite. The desired roots are first isolated by the Sturm sequence procedure; then a special variant of the inverse iteration technique is applied for the individual determination of each root along with its vector. The algorithm fully exploits the banded form of relevant matrices, and the associated program written in FORTRAN V for the JPL UNIVAC 1108 computer proves to be most significantly economical in comparison to similar existing procedures. The program may be conveniently utilized for the efficient solution of practical engineering problems, involving free vibration and buckling analysis of structures. Results of such analyses are presented for representative structures.
Efficient robust reconstruction of dynamic PET activity maps with radioisotope decay constraints.

PubMed

Gao, Fei; Liu, Huafeng; Shi, Pengcheng

2010-01-01

Dynamic PET imaging performs sequence of data acquisition in order to provide visualization and quantification of physiological changes in specific tissues and organs. The reconstruction of activity maps is generally the first step in dynamic PET. State space Hinfinity approaches have been proved to be a robust method for PET image reconstruction where, however, temporal constraints are not considered during the reconstruction process. In addition, the state space strategies for PET image reconstruction have been computationally prohibitive for practical usage because of the need for matrix inversion. In this paper, we present a minimax formulation of the dynamic PET imaging problem where a radioisotope decay model is employed as physics-based temporal constraints on the photon counts. Furthermore, a robust steady state Hinfinity filter is developed to significantly improve the computational efficiency with minimal loss of accuracy. Experiments are conducted on Monte Carlo simulated image sequences for quantitative analysis and validation.
High Resolution Melt analysis for mutation screening in PKD1 and PKD2

PubMed Central

2011-01-01

Background Autosomal dominant polycystic kidney disease (ADPKD) is the most common hereditary kidney disorder. It is characterized by focal development and progressive enlargement of renal cysts leading to end-stage renal disease. PKD1 and PKD2 have been implicated in ADPKD pathogenesis but genetic features and the size of PKD1 make genetic diagnosis tedious. Methods We aim to prove that high resolution melt analysis (HRM), a recent technique in molecular biology, can facilitate molecular diagnosis of ADPKD. We screened for mutations in PKD1 and PKD2 with HRM in 37 unrelated patients with ADPKD. Results We identified 440 sequence variants in the 37 patients. One hundred and thirty eight were different. We found 28 pathogenic mutations (25 in PKD1 and 3 in PKD2 ) within 28 different patients, which is a diagnosis rate of 75% consistent with literature mean direct sequencing diagnosis rate. We describe 52 new sequence variants in PKD1 and two in PKD2. Conclusion HRM analysis is a sensitive and specific method for molecular diagnosis of ADPKD. HRM analysis is also costless and time sparing. Thus, this method is efficient and might be used for mutation pre-screening in ADPKD genes. PMID:22008521
The mutant vasopressin gene from diabetes insipidus (Brattleboro) rats is transcribed but the message is not efficiently translated.

PubMed Central

Schmale, H; Ivell, R; Breindl, M; Darmer, D; Richter, D

1984-01-01

The vasopressin gene from normal and diabetes insipidus (Brattleboro) rats has been isolated and sequenced. Except for a single deletion of a G residue in region coding for the neurophysin carrier protein the approximately 2300 nucleotides of both genes are identical. Blot analysis of hypothalamic RNA as well as transfection and microinjection experiments indicate that the mutant gene is correctly transcribed and spliced, however the resulting mRNA is not efficiently translated. Images Fig. 2. Fig. 3. PMID:6526016

AMP-forming acetyl-CoA synthetases in Archaea show unexpected diversity in substrate utilization

PubMed Central

Ingram-Smith, Cheryl; Smith, Kerry S.

2007-01-01

Adenosine monophosphate (AMP)-forming acetyl-CoA synthetase (ACS; acetate:CoA ligase (AMP-forming), EC 6.2.1.1) is a key enzyme for conversion of acetate to acetyl-CoA, an essential intermediate at the junction of anabolic and catabolic pathways. Phylogenetic analysis of putative short and medium chain acyl-CoA synthetase sequences indicates that the ACSs form a distinct clade from other acyl-CoA synthetases. Within this clade, the archaeal ACSs are not monophyletic and fall into three groups composed of both bacterial and archaeal sequences. Kinetic analysis of two archaeal enzymes, an ACS from Methanothermobacter thermautotrophicus (designated as MT-ACS1) and an ACS from Archaeoglobus fulgidus (designated as AF-ACS2), revealed that these enzymes have very different properties. MT-ACS1 has nearly 11-fold higher affinity and 14-fold higher catalytic efficiency with acetate than with propionate, a property shared by most ACSs. However, AF-ACS2 has only 2.3-fold higher affinity and catalytic efficiency with acetate than with propionate. This enzyme has an affinity for propionate that is almost identical to that of MT-ACS1 for acetate and nearly tenfold higher than the affinity of MT-ACS1 for propionate. Furthermore, MT-ACS1 is limited to acetate and propionate as acyl substrates, whereas AF-ACS2 can also utilize longer straight and branched chain acyl substrates. Phylogenetic analysis, sequence alignment and structural modeling suggest a molecular basis for the altered substrate preference and expanded substrate range of AF-ACS2 versus MT-ACS1. PMID:17350930
Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis

PubMed Central

2012-01-01

Background Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. Results The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. Conclusions The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems. PMID:22551152
MTTE: an innovative strategy for the evaluation of targeted/exome enrichment efficiency

PubMed Central

Klonowska, Katarzyna; Handschuh, Luiza; Swiercz, Aleksandra; Figlerowicz, Marek; Kozlowski, Piotr

2016-01-01

Although currently available strategies for the preparation of exome-enriched libraries are well established, a final validation of the libraries in terms of exome enrichment efficiency prior to the sequencing step is of considerable importance. Here, we present a strategy for the evaluation of exome enrichment, i.e., the Multipoint Test for Targeted-enrichment Efficiency (MTTE), PCR-based approach utilizing multiplex ligation-dependent probe amplification with capillary electrophoresis separation. We used MTTE for the analysis of subsequent steps of the Illumina TruSeq Exome Enrichment procedure. The calculated values of enrichment-associated parameters (i.e., relative enrichment, relative clearance, overall clearance, and fold enrichment) and the comparison of MTTE results with the actual enrichment revealed the high reliability of our assay. Additionally, the MTTE assay enabled the determination of the sequence-associated features that may confer bias in the enrichment of different targets. Importantly, the MTTE is low cost method that can be easily adapted to the region of interest important for a particular project. Thus, the MTTE strategy is attractive for post-capture validation in a variety of targeted/exome enrichment NGS projects. PMID:27572310
MTTE: an innovative strategy for the evaluation of targeted/exome enrichment efficiency.

PubMed

Klonowska, Katarzyna; Handschuh, Luiza; Swiercz, Aleksandra; Figlerowicz, Marek; Kozlowski, Piotr

2016-10-11

Although currently available strategies for the preparation of exome-enriched libraries are well established, a final validation of the libraries in terms of exome enrichment efficiency prior to the sequencing step is of considerable importance. Here, we present a strategy for the evaluation of exome enrichment, i.e., the Multipoint Test for Targeted-enrichment Efficiency (MTTE), PCR-based approach utilizing multiplex ligation-dependent probe amplification with capillary electrophoresis separation. We used MTTE for the analysis of subsequent steps of the Illumina TruSeq Exome Enrichment procedure. The calculated values of enrichment-associated parameters (i.e., relative enrichment, relative clearance, overall clearance, and fold enrichment) and the comparison of MTTE results with the actual enrichment revealed the high reliability of our assay. Additionally, the MTTE assay enabled the determination of the sequence-associated features that may confer bias in the enrichment of different targets. Importantly, the MTTE is low cost method that can be easily adapted to the region of interest important for a particular project. Thus, the MTTE strategy is attractive for post-capture validation in a variety of targeted/exome enrichment NGS projects.
Homeologous plastid DNA transformation in tobacco is mediated by multiple recombination events.

PubMed Central

Kavanagh, T A; Thanh, N D; Lao, N T; McGrath, N; Peter, S O; Horváth, E M; Dix, P J; Medgyesy, P

1999-01-01

Efficient plastid transformation has been achieved in Nicotiana tabacum using cloned plastid DNA of Solanum nigrum carrying mutations conferring spectinomycin and streptomycin resistance. The use of the incompletely homologous (homeologous) Solanum plastid DNA as donor resulted in a Nicotiana plastid transformation frequency comparable with that of other experiments where completely homologous plastid DNA was introduced. Physical mapping and nucleotide sequence analysis of the targeted plastid DNA region in the transformants demonstrated efficient site-specific integration of the 7.8-kb Solanum plastid DNA and the exclusion of the vector DNA. The integration of the cloned Solanum plastid DNA into the Nicotiana plastid genome involved multiple recombination events as revealed by the presence of discontinuous tracts of Solanum-specific sequences that were interspersed between Nicotiana-specific markers. Marked position effects resulted in very frequent cointegration of the nonselected peripheral donor markers located adjacent to the vector DNA. Data presented here on the efficiency and features of homeologous plastid DNA recombination are consistent with the existence of an active RecA-mediated, but a diminished mismatch, recombination/repair system in higher-plant plastids. PMID:10388829
Evaluation of the effectiveness and efficiency of two stimulus prompt strategies with severely handicapped students.

PubMed Central

Steege, M W; Wacker, D P; McMahon, C M

1987-01-01

In this study we compared the effectiveness and efficiency of two treatment packages that used stimulus prompt sequences and task analyses for teaching community living skills to severely handicapped students. Four severely and multiply handicapped students were trained to perform four tasks: (a) making toast, (b) making popcorn, (c) operating a clothes dryer, and (d) operating a washing machine. Following baseline, each student was exposed to two types of training procedures, each involving a task analysis of the target behavior. Training Procedure 1 (Traditional) utilized a least-to-most restrictive prompt sequence. Training Procedure 2 (Prescriptive) utilized ongoing behavioral assessment data to identify discriminative stimuli. The assessment data were used to prescribe instructional prompts across successive training trials. Performance on the tasks was evaluated within a combination multiple baseline (across subjects) and probe (across tasks) design. Training conditions were counterbalanced across subjects and tasks. Results indicated that both training procedures were equally effective in increasing independent task acquisition for subjects on all tasks; however, the prescriptive procedure was the more efficient procedure. PMID:3667479
Stratification of co-evolving genomic groups using ranked phylogenetic profiles

PubMed Central

Freilich, Shiri; Goldovsky, Leon; Gottlieb, Assaf; Blanc, Eric; Tsoka, Sophia; Ouzounis, Christos A

2009-01-01

Background Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database. Results The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples. Conclusion Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples. PMID:19860884
A generic assay for whole-genome amplification and deep sequencing of enterovirus A71

PubMed Central

Tan, Le Van; Tuyen, Nguyen Thi Kim; Thanh, Tran Tan; Ngan, Tran Thuy; Van, Hoang Minh Tu; Sabanathan, Saraswathy; Van, Tran Thi My; Thanh, Le Thi My; Nguyet, Lam Anh; Geoghegan, Jemma L.; Ong, Kien Chai; Perera, David; Hang, Vu Thi Ty; Ny, Nguyen Thi Han; Anh, Nguyen To; Ha, Do Quang; Qui, Phan Tu; Viet, Do Chau; Tuan, Ha Manh; Wong, Kum Thong; Holmes, Edward C.; Chau, Nguyen Van Vinh; Thwaites, Guy; van Doorn, H. Rogier

2015-01-01

Enterovirus A71 (EV-A71) has emerged as the most important cause of large outbreaks of severe and sometimes fatal hand, foot and mouth disease (HFMD) across the Asia-Pacific region. EV-A71 outbreaks have been associated with (sub)genogroup switches, sometimes accompanied by recombination events. Understanding EV-A71 population dynamics is therefore essential for understanding this emerging infection, and may provide pivotal information for vaccine development. Despite the public health burden of EV-A71, relatively few EV-A71 complete-genome sequences are available for analysis and from limited geographical localities. The availability of an efficient procedure for whole-genome sequencing would stimulate effort to generate more viral sequence data. Herein, we report for the first time the development of a next-generation sequencing based protocol for whole-genome sequencing of EV-A71 directly from clinical specimens. We were able to sequence viruses of subgenogroup C4 and B5, while RNA from culture materials of diverse EV-A71 subgenogroups belonging to both genogroup B and C was successfully amplified. The nature of intra-host genetic diversity was explored in 22 clinical samples, revealing 107 positions carrying minor variants (ranging from 0 to 15 variants per sample). Our analysis of EV-A71 strains sampled in 2013 showed that they all belonged to subgenogroup B5, representing the first report of this subgenogroup in Vietnam. In conclusion, we have successfully developed a high-throughput next-generation sequencing-based assay for whole-genome sequencing of EV-A71 from clinical samples. PMID:25704598
Encapsidation of Host RNAs by Cucumber Necrosis Virus Coat Protein during both Agroinfiltration and Infection.

PubMed

Ghoshal, Kankana; Theilmann, Jane; Reade, Ron; Maghodia, Ajay; Rochon, D'Ann

2015-11-01

Next-generation sequence analysis of virus-like particles (VLPs) produced during agroinfiltration of cucumber necrosis virus (CNV) coat protein (CP) and of authentic CNV virions was conducted to assess if host RNAs can be encapsidated by CNV CP. VLPs containing host RNAs were found to be produced during agroinfiltration, accumulating to approximately 1/60 the level that CNV virions accumulated during infection. VLPs contained a variety of host RNA species, including the major rRNAs as well as cytoplasmic, chloroplast, and mitochondrial mRNAs. The most predominant host RNA species encapsidated in VLPs were chloroplast encoded, consistent with the efficient targeting of CNV CP to chloroplasts during agroinfiltration. Interestingly, droplet digital PCR analysis showed that the CNV CP mRNA expressed during agroinfiltration was the most efficiently encapsidated mRNA, suggesting that the CNV CP open reading frame may contain a high-affinity site or sites for CP binding and thus contribute to the specificity of CNV RNA encapsidation. Approximately 0.09% to 0.7% of the RNA derived from authentic CNV virions contained host RNA, with chloroplast RNA again being the most prominent species. This is consistent with our previous finding that a small proportion of CNV CP enters chloroplasts during the infection process and highlights the possibility that chloroplast targeting is a significant aspect of CNV infection. Remarkably, 6 to 8 of the top 10 most efficiently encapsidated nucleus-encoded RNAs in CNV virions correspond to retrotransposon or retrotransposon-like RNA sequences. Thus, CNV could potentially serve as a vehicle for horizontal transmission of retrotransposons to new hosts and thereby significantly influence genome evolution. Viruses predominantly encapsidate their own virus-related RNA species due to the possession of specific sequences and/or structures on viral RNA which serve as high-affinity binding sites for the coat protein. In this study, we show, using next-generation sequence analysis, that CNV also encapsidates host RNA species, which account for ∼0.1% of the RNA packaged in CNV particles. The encapsidated host RNAs predominantly include chloroplast RNAs, reinforcing previous observations that CNV CP enters chloroplasts during infection. Remarkably, the most abundantly encapsidated cytoplasmic mRNAs consisted of retrotransposon-like RNA sequences, similar to findings recently reported for flock house virus (A. Routh, T. Domitrovic, and J. E. Johnson, Proc Natl Acad Sci U S A 109:1907-1912, 2012). Encapsidation of retrotransposon sequences may contribute to their horizontal transmission should CNV virions carrying retrotransposons infect a new host. Such an event could lead to large-scale genomic changes in a naive plant host, thus facilitating host evolutionary novelty. Copyright © 2015, American Society for Microbiology. All Rights Reserved.
Polymerase ribozyme efficiency increased by G/T-rich DNA oligonucleotides

PubMed Central

Yao, Chengguo; Müller, Ulrich F.

2011-01-01

The RNA world hypothesis states that the early evolution of life went through a stage where RNA served as genome and as catalyst. The replication of RNA world organisms would have been facilitated by ribozymes that catalyze RNA polymerization. To recapitulate an RNA world in the laboratory, a series of RNA polymerase ribozymes was developed previously. However, these ribozymes have a polymerization efficiency that is too low for self-replication, and the most efficient ribozymes prefer one specific template sequence. The limiting factor for polymerization efficiency is the weak sequence-independent binding to its primer/template substrate. Most of the known polymerase ribozymes bind an RNA heptanucleotide to form the P2 duplex on the ribozyme. By modifying this heptanucleotide, we were able to significantly increase polymerization efficiency. Truncations at the 3′-terminus of this heptanucleotide increased full-length primer extension by 10-fold, on a specific template sequence. In contrast, polymerization on several different template sequences was improved dramatically by replacing the RNA heptanucleotide with DNA oligomers containing randomized sequences of 15 nt. The presence of G and T in the random sequences was sufficient for this effect, with an optimal composition of 60% G and 40% T. Our results indicate that these DNA sequences function by establishing many weak and nonspecific base-pairing interactions to the single-stranded portion of the template. Such low-specificity interactions could have had important functions in an RNA world. PMID:21622900
Solar cells and modules from dentritic web silicon

NASA Technical Reports Server (NTRS)

Campbell, R. B.; Rohatgi, A.; Seman, E. J.; Davis, J. R.; Rai-Choudhury, P.; Gallagher, B. D.

1980-01-01

Some of the noteworthy features of the processes developed in the fabrication of solar cell modules are the handling of long lengths of web, the use of cost effective dip coating of photoresist and antireflection coatings, selective electroplating of the grid pattern and ultrasonic bonding of the cell interconnect. Data on the cells is obtained by means of dark I-V analysis and deep level transient spectroscopy. A histogram of over 100 dentritic web solar cells fabricated in a number of runs using different web crystals shows an average efficiency of over 13%, with some efficiencies running above 15%. Lower cell efficiency is generally associated with low minority carrier time due to recombination centers sometimes present in the bulk silicon. A cost analysis of the process sequence using a 25 MW production line indicates a selling price of $0.75/peak watt in 1986. It is concluded that the efficiency of dentritic web cells approaches that of float zone silicon cells, reduced somewhat by the lower bulk lifetime of the former.
Analysis of endoscopic third ventriculostomy patency by MRI: value of different pulse sequences, the sequence parameters, and the imaging planes for investigation of flow void.

PubMed

Dinçer, Alp; Yildiz, Erdem; Kohan, Saeed; Memet Özek, M

2011-01-01

The aim of the study is to evaluate the efficiency of turbo spin-echo (TSE), three-dimensional constructive interference in the steady state (3D CISS) and cine phase contrast (Cine PC) sequences in determining flow through the endoscopic third ventriculostomy (ETV) fenestration, and to determine the effect of various TSE sequence parameters. The study was approved by our institutional review board and informed consent from all patients was obtained. Two groups of patients were included: group I (24 patients with good clinical outcome after ETV) and group II (22 patients with hydrocephalus evaluated preoperatively). The imaging protocol for both groups was identical. TSE T2 with various sequence parameters and imaging planes, and 3D CISS, followed by cine PC were obtained. Flow void was graded as four-point scales. The sensitivity, specificity, accuracy, positive and negative predictive values of sequences were calculated. Bidirectional flow through the fenestration was detected in all group I patients by cine PC. Stroke volumes through the fenestration in group I ranged 10-160.8 ml/min. There was no correlation between the presence of reversed flow and flow void grading. Also, there was no correlation between the stroke volumes and flow void grading. The sensitivity of 3D CISS was low, and 2 mm sagittal TSE T2, nearly equal to cine PC, provided best result. Cine PC and TSE T2 both have high confidence in the assessment of the flow through the fenestration. But, sequence parameters significantly affect the efficiency of TSE T2.
Elements in the transcriptional regulatory region flanking herpes simplex virus type 1 oriS stimulate origin function.

PubMed

Wong, S W; Schaffer, P A

1991-05-01

Like other DNA-containing viruses, the three origins of herpes simplex virus type 1 (HSV-1) DNA replication are flanked by sequences containing transcriptional regulatory elements. In a transient plasmid replication assay, deletion of sequences comprising the transcriptional regulatory elements of ICP4 and ICP22/47, which flank oriS, resulted in a greater than 80-fold decrease in origin function compared with a plasmid, pOS-822, which retains these sequences. In an effort to identify specific cis-acting elements responsible for this effect, we conducted systematic deletion analysis of the flanking region with plasmid pOS-822 and tested the resulting mutant plasmids for origin function. Stimulation by cis-acting elements was shown to be both distance and orientation dependent, as changes in either parameter resulted in a decrease in oriS function. Additional evidence for the stimulatory effect of flanking sequences on origin function was demonstrated by replacement of these sequences with the cytomegalovirus immediate-early promoter, resulting in nearly wild-type levels of oriS function. In competition experiments, cotransfection of cells with the test plasmid, pOS-822, and increasing molar concentrations of a competitor plasmid which contained the ICP4 and ICP22/47 transcriptional regulatory regions but lacked core origin sequences resulted in a significant reduction in the replication efficiency of pOS-822, demonstrating that factors which bind specifically to the oriS-flanking sequences are likely involved as auxiliary proteins in oriS function. Together, these studies demonstrate that trans-acting factors and the sites to which they bind play a critical role in the efficiency of HSV-1 DNA replication from oriS in transient-replication assays.
Identification and characterization of unrecognized viruses in stool samples of non-polio acute flaccid paralysis children by simplified VIDISCA.

PubMed

Shaukat, Shahzad; Angez, Mehar; Alam, Muhammad Masroor; Jebbink, Maarten F; Deijs, Martin; Canuti, Marta; Sharif, Salmaan; de Vries, Michel; Khurshid, Adnan; Mahmood, Tariq; van der Hoek, Lia; Zaidi, Syed Sohail Zahoor

2014-08-12

The use of sequence independent methods combined with next generation sequencing for identification purposes in clinical samples appears promising and exciting results have been achieved to understand unexplained infections. One sequence independent method, Virus Discovery based on cDNA Amplified Fragment Length Polymorphism (VIDISCA) is capable of identifying viruses that would have remained unidentified in standard diagnostics or cell cultures. VIDISCA is normally combined with next generation sequencing, however, we set up a simplified VIDISCA which can be used in case next generation sequencing is not possible. Stool samples of 10 patients with unexplained acute flaccid paralysis showing cytopathic effect in rhabdomyosarcoma cells and/or mouse cells were used to test the efficiency of this method. To further characterize the viruses, VIDISCA-positive samples were amplified and sequenced with gene specific primers. Simplified VIDISCA detected seven viruses (70%) and the proportion of eukaryotic viral sequences from each sample ranged from 8.3 to 45.8%. Human enterovirus EV-B97, EV-B100, echovirus-9 and echovirus-21, human parechovirus type-3, human astrovirus probably a type-3/5 recombinant, and tetnovirus-1 were identified. Phylogenetic analysis based on the VP1 region demonstrated that the human enteroviruses are more divergent isolates circulating in the community. Our data support that a simplified VIDISCA protocol can efficiently identify unrecognized viruses grown in cell culture with low cost, limited time without need of advanced technical expertise. Also complex data interpretation is avoided thus the method can be used as a powerful diagnostic tool in limited resources. Redesigning the routine diagnostics might lead to additional detection of previously undiagnosed viruses in clinical samples of patients.
Visualization of Concurrent Program Executions

NASA Technical Reports Server (NTRS)

Artho, Cyrille; Havelund, Klaus; Honiden, Shinichi

2007-01-01

Various program analysis techniques are efficient at discovering failures and properties. However, it is often difficult to evaluate results, such as program traces. This calls for abstraction and visualization tools. We propose an approach based on UML sequence diagrams, addressing shortcomings of such diagrams for concurrency. The resulting visualization is expressive and provides all the necessary information at a glance.
Analysis of the gut microbiome in beef cattle and its association with feed intake, growth, and efficiency

USDA-ARS?s Scientific Manuscript database

Next-generation sequencing has taken a central role in studies of microbial ecology, especially with regard to culture-independent methods based on molecular phylogenies of the small-subunit ribosomal RNA gene (16S rRNA gene). The ability to relate trends at the species or genus level to host/envir...
Development of a 690K SNP array in catfish and its application for genetic mapping and validation of the reference genome sequence

USDA-ARS?s Scientific Manuscript database

Single nucleotide polymorphisms (SNPs) are capable of providing the highest level of genome coverage for genomic and genetic analysis because of their abundance and relatively even distribution in the genome. Such a capacity, however, cannot be achieved without an efficient genotyping platform such ...
A harmonized immunoassay with liquid chromatography-mass spectrometry analysis in egg allergen determination.

PubMed

Nimata, Masaomi; Okada, Hideki; Kurihara, Kei; Sugimoto, Tsukasa; Honjoh, Tsutomu; Kuroda, Kazuhiko; Yano, Takeo; Tachibana, Hirofumi; Shoji, Masahiro

2018-01-01

Food allergy is a serious health issue worldwide. Implementing allergen labeling regulations is extremely challenging for regulators, food manufacturers, and analytical kit manufacturers. Here we have developed an "amino acid sequence immunoassay" approach to ELISA. The new ELISA comprises of a monoclonal antibody generated via an analyte specific peptide antigen and sodium lauryl sulfate/sulfite solution. This combination enables the antibody to access the epitope site in unfolded analyte protein. The newly developed ELISA recovered 87.1%-106.4% ovalbumin from ovalbumin-incurred model processed foods, thereby demonstrating its applicability as practical egg allergen determination. Furthermore, the comparison of LC-MS/MS and the new ELISA, which targets the amino acid sequence conforming to the LC-MS/MS detection peptide, showed a good agreement. Consequently the harmonization of two methods was demonstrated. The complementary use of the new ELISA and LC-MS analysis can offer a wide range of practical benefits in terms of easiness, cost, accuracy, and efficiency in food allergen analysis. In addition, the new assay is attractive in respect to its easy antigen preparation and predetermined specificity. Graphical abstract The ELISA composing of the monoclonal antibody targeting the amino acid sequence conformed to LC-MS detection peptide, and the protein conformation unfolding reagent was developed. In ovalbumin determination, the developed ELISA showed a good agreement with LC-MS analysis. Consequently the harmonization of immunoassay with LC-MS analysis by using common target amino acid sequence was demonstrated.
A novel bioinformatics method for efficient knowledge discovery by BLSOM from big genomic sequence data.

PubMed

Bai, Yu; Iwasaki, Yuki; Kanaya, Shigehiko; Zhao, Yue; Ikemura, Toshimichi

2014-01-01

With remarkable increase of genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional data such as oligonucleotide composition on one map. By modifying the conventional SOM, we have previously developed Batch-Learning SOM (BLSOM), which allows classification of sequence fragments according to species, solely depending on the oligonucleotide composition. In the present study, we introduce the oligonucleotide BLSOM used for characterization of vertebrate genome sequences. We first analyzed pentanucleotide compositions in 100 kb sequences derived from a wide range of vertebrate genomes and then the compositions in the human and mouse genomes in order to investigate an efficient method for detecting differences between the closely related genomes. BLSOM can recognize the species-specific key combination of oligonucleotide frequencies in each genome, which is called a "genome signature," and the specific regions specifically enriched in transcription-factor-binding sequences. Because the classification and visualization power is very high, BLSOM is an efficient powerful tool for extracting a wide range of information from massive amounts of genomic sequences (i.e., big sequence data).
User Guidelines for the Brassica Database: BRAD.

PubMed

Wang, Xiaobo; Cheng, Feng; Wang, Xiaowu

2016-01-01

The genome sequence of Brassica rapa was first released in 2011. Since then, further Brassica genomes have been sequenced or are undergoing sequencing. It is therefore necessary to develop tools that help users to mine information from genomic data efficiently. This will greatly aid scientific exploration and breeding application, especially for those with low levels of bioinformatic training. Therefore, the Brassica database (BRAD) was built to collect, integrate, illustrate, and visualize Brassica genomic datasets. BRAD provides useful searching and data mining tools, and facilitates the search of gene annotation datasets, syntenic or non-syntenic orthologs, and flanking regions of functional genomic elements. It also includes genome-analysis tools such as BLAST and GBrowse. One of the important aims of BRAD is to build a bridge between Brassica crop genomes with the genome of the model species Arabidopsis thaliana, thus transferring the bulk of A. thaliana gene study information for use with newly sequenced Brassica crops.

Investigation of the effect of finite pulse errors on the BABA pulse sequence using the Floquet-Magnus expansion approach

NASA Astrophysics Data System (ADS)

Mananga, Eugene S.; Reid, Alicia E.

2013-01-01

This paper presents a study of finite pulse widths for the BABA pulse sequence using the Floquet-Magnus expansion (FME) approach. In the FME scheme, the first order ? is identical to its counterparts in average Hamiltonian theory (AHT) and Floquet theory (FT). However, the timing part in the FME approach is introduced via the ? function not present in other schemes. This function provides an easy way for evaluating the spin evolution during the time in between' through the Magnus expansion of the operator connected to the timing part of the evolution. The evaluation of ? is particularly useful for the analysis of the non-stroboscopic evolution. Here, the importance of the boundary conditions, which provide a natural choice of ? , is ignored. This work uses the ? function to compare the efficiency of the BABA pulse sequence with ? and the BABA pulse sequence with finite pulses. Calculations of ? and ? are presented.
Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism.

PubMed

Yu, Jia; Blom, Jochen; Sczyrba, Alexander; Goesmann, Alexander

2017-09-10

The introduction of next generation sequencing has caused a steady increase in the amounts of data that have to be processed in modern life science. Sequence alignment plays a key role in the analysis of sequencing data e.g. within whole genome sequencing or metagenome projects. BLAST is a commonly used alignment tool that was the standard approach for more than two decades, but in the last years faster alternatives have been proposed including RapSearch, GHOSTX, and DIAMOND. Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy. Copyright © 2017 The Author(s). Published by Elsevier B.V. All rights reserved.
Rapid and Easy Protocol for Quantification of Next-Generation Sequencing Libraries.

PubMed

Hawkins, Steve F C; Guest, Paul C

2018-01-01

The emergence of next-generation sequencing (NGS) over the last 10 years has increased the efficiency of DNA sequencing in terms of speed, ease, and price. However, the exact quantification of a NGS library is crucial in order to obtain good data on sequencing platforms developed by the current market leader Illumina. Different approaches for DNA quantification are available currently and the most commonly used are based on analysis of the physical properties of the DNA through spectrophotometric or fluorometric methods. Although these methods are technically simple, they do not allow exact quantification as can be achieved using a real-time quantitative PCR (qPCR) approach. A qPCR protocol for DNA quantification with applications in NGS library preparation studies is presented here. This can be applied in various fields of study such as medical disorders resulting from nutritional programming disturbances.
Two‐phase designs for joint quantitative‐trait‐dependent and genotype‐dependent sampling in post‐GWAS regional sequencing

PubMed Central

Espin‐Garcia, Osvaldo; Craiu, Radu V.

2017-01-01

ABSTRACT We evaluate two‐phase designs to follow‐up findings from genome‐wide association study (GWAS) when the cost of regional sequencing in the entire cohort is prohibitive. We develop novel expectation‐maximization‐based inference under a semiparametric maximum likelihood formulation tailored for post‐GWAS inference. A GWAS‐SNP (where SNP is single nucleotide polymorphism) serves as a surrogate covariate in inferring association between a sequence variant and a normally distributed quantitative trait (QT). We assess test validity and quantify efficiency and power of joint QT‐SNP‐dependent sampling and analysis under alternative sample allocations by simulations. Joint allocation balanced on SNP genotype and extreme‐QT strata yields significant power improvements compared to marginal QT‐ or SNP‐based allocations. We illustrate the proposed method and evaluate the sensitivity of sample allocation to sampling variation using data from a sequencing study of systolic blood pressure. PMID:29239496
Mycofier: a new machine learning-based classifier for fungal ITS sequences.

PubMed

Delgado-Serrano, Luisa; Restrepo, Silvia; Bustos, Jose Ricardo; Zambrano, Maria Mercedes; Anzola, Juan Manuel

2016-08-11

The taxonomic and phylogenetic classification based on sequence analysis of the ITS1 genomic region has become a crucial component of fungal ecology and diversity studies. Nowadays, there is no accurate alignment-free classification tool for fungal ITS1 sequences for large environmental surveys. This study describes the development of a machine learning-based classifier for the taxonomical assignment of fungal ITS1 sequences at the genus level. A fungal ITS1 sequence database was built using curated data. Training and test sets were generated from it. A Naïve Bayesian classifier was built using features from the primary sequence with an accuracy of 87 % in the classification at the genus level. The final model was based on a Naïve Bayes algorithm using ITS1 sequences from 510 fungal genera. This classifier, denoted as Mycofier, provides similar classification accuracy compared to BLASTN, but the database used for the classification contains curated data and the tool, independent of alignment, is more efficient and contributes to the field, given the lack of an accurate classification tool for large data from fungal ITS1 sequences. The software and source code for Mycofier are freely available at https://github.com/ldelgado-serrano/mycofier.git .
Image Encryption Algorithm Based on Hyperchaotic Maps and Nucleotide Sequences Database

PubMed Central

2017-01-01

Image encryption technology is one of the main means to ensure the safety of image information. Using the characteristics of chaos, such as randomness, regularity, ergodicity, and initial value sensitiveness, combined with the unique space conformation of DNA molecules and their unique information storage and processing ability, an efficient method for image encryption based on the chaos theory and a DNA sequence database is proposed. In this paper, digital image encryption employs a process of transforming the image pixel gray value by using chaotic sequence scrambling image pixel location and establishing superchaotic mapping, which maps quaternary sequences and DNA sequences, and by combining with the logic of the transformation between DNA sequences. The bases are replaced under the displaced rules by using DNA coding in a certain number of iterations that are based on the enhanced quaternary hyperchaotic sequence; the sequence is generated by Chen chaos. The cipher feedback mode and chaos iteration are employed in the encryption process to enhance the confusion and diffusion properties of the algorithm. Theoretical analysis and experimental results show that the proposed scheme not only demonstrates excellent encryption but also effectively resists chosen-plaintext attack, statistical attack, and differential attack. PMID:28392799
Assembly and analysis of a male sterile rubber tree mitochondrial genome reveals DNA rearrangement events and a novel transcript.

PubMed

Shearman, Jeremy R; Sangsrakru, Duangjai; Ruang-Areerate, Panthita; Sonthirod, Chutima; Uthaipaisanwong, Pichahpuk; Yoocha, Thippawan; Poopear, Supannee; Theerawattanasuk, Kanikar; Tragoonrung, Somvong; Tangphatsornruang, Sithichoke

2014-02-10

The rubber tree, Hevea brasiliensis, is an important plant species that is commercially grown to produce latex rubber in many countries. The rubber tree variety BPM 24 exhibits cytoplasmic male sterility, inherited from the variety GT 1. We constructed the rubber tree mitochondrial genome of a cytoplasmic male sterile variety, BPM 24, using 454 sequencing, including 8 kb paired-end libraries, plus Illumina paired-end sequencing. We annotated this mitochondrial genome with the aid of Illumina RNA-seq data and performed comparative analysis. We then compared the sequence of BPM 24 to the contigs of the published rubber tree, variety RRIM 600, and identified a rearrangement that is unique to BPM 24 resulting in a novel transcript containing a portion of atp9. The novel transcript is consistent with changes that cause cytoplasmic male sterility through a slight reduction to ATP production efficiency. The exhaustive nature of the search rules out alternative causes and supports previous findings of novel transcripts causing cytoplasmic male sterility.
Analysis of protein-coding genetic variation in 60,706 humans.

PubMed

Lek, Monkol; Karczewski, Konrad J; Minikel, Eric V; Samocha, Kaitlin E; Banks, Eric; Fennell, Timothy; O'Donnell-Luria, Anne H; Ware, James S; Hill, Andrew J; Cummings, Beryl B; Tukiainen, Taru; Birnbaum, Daniel P; Kosmicki, Jack A; Duncan, Laramie E; Estrada, Karol; Zhao, Fengmei; Zou, James; Pierce-Hoffman, Emma; Berghout, Joanne; Cooper, David N; Deflaux, Nicole; DePristo, Mark; Do, Ron; Flannick, Jason; Fromer, Menachem; Gauthier, Laura; Goldstein, Jackie; Gupta, Namrata; Howrigan, Daniel; Kiezun, Adam; Kurki, Mitja I; Moonshine, Ami Levy; Natarajan, Pradeep; Orozco, Lorena; Peloso, Gina M; Poplin, Ryan; Rivas, Manuel A; Ruano-Rubio, Valentin; Rose, Samuel A; Ruderfer, Douglas M; Shakir, Khalid; Stenson, Peter D; Stevens, Christine; Thomas, Brett P; Tiao, Grace; Tusie-Luna, Maria T; Weisburd, Ben; Won, Hong-Hee; Yu, Dongmei; Altshuler, David M; Ardissino, Diego; Boehnke, Michael; Danesh, John; Donnelly, Stacey; Elosua, Roberto; Florez, Jose C; Gabriel, Stacey B; Getz, Gad; Glatt, Stephen J; Hultman, Christina M; Kathiresan, Sekar; Laakso, Markku; McCarroll, Steven; McCarthy, Mark I; McGovern, Dermot; McPherson, Ruth; Neale, Benjamin M; Palotie, Aarno; Purcell, Shaun M; Saleheen, Danish; Scharf, Jeremiah M; Sklar, Pamela; Sullivan, Patrick F; Tuomilehto, Jaakko; Tsuang, Ming T; Watkins, Hugh C; Wilson, James G; Daly, Mark J; MacArthur, Daniel G

2016-08-18

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.
Modified midi- and mini-multiplex PCR systems for mitochondrial DNA control region sequence analysis in degraded samples.

PubMed

Kim, Na Young; Lee, Hwan Young; Park, Sun Joo; Yang, Woo Ick; Shin, Kyoung-Jin

2013-05-01

Two multiplex polymerase chain reaction (PCR) systems (Midiplex and Miniplex) were developed for the amplification of the mitochondrial DNA (mtDNA) control region, and the efficiencies of the multiplexes for amplifying degraded DNA were validated using old skeletal remains. The Midiplex system consisted of two multiplex PCRs to amplify six overlapping amplicons ranging in length from 227 to 267 bp. The Miniplex system consisted of three multiplex PCRs to amplify 10 overlapping short amplicons ranging in length from 142 to 185 bp. Most mtDNA control region sequences of several 60-year-old and 400-500-year-old skeletal remains were successfully obtained using both PCR systems and consistent with those previously obtained by monoplex amplification. The multiplex system consisting of smaller amplicons is effective for mtDNA sequence analyses of ancient and forensic degraded samples, saving time, cost, and the amount of DNA sample consumed during analysis. © 2013 American Academy of Forensic Sciences.
Calibrating genomic and allelic coverage bias in single-cell sequencing.

PubMed

Zhang, Cheng-Zhong; Adalsteinsson, Viktor A; Francis, Joshua; Cornils, Hauke; Jung, Joonil; Maire, Cecile; Ligon, Keith L; Meyerson, Matthew; Love, J Christopher

2015-04-16

Artifacts introduced in whole-genome amplification (WGA) make it difficult to derive accurate genomic information from single-cell genomes and require different analytical strategies from bulk genome analysis. Here, we describe statistical methods to quantitatively assess the amplification bias resulting from whole-genome amplification of single-cell genomic DNA. Analysis of single-cell DNA libraries generated by different technologies revealed universal features of the genome coverage bias predominantly generated at the amplicon level (1-10 kb). The magnitude of coverage bias can be accurately calibrated from low-pass sequencing (∼0.1 × ) to predict the depth-of-coverage yield of single-cell DNA libraries sequenced at arbitrary depths. We further provide a benchmark comparison of single-cell libraries generated by multi-strand displacement amplification (MDA) and multiple annealing and looping-based amplification cycles (MALBAC). Finally, we develop statistical models to calibrate allelic bias in single-cell whole-genome amplification and demonstrate a census-based strategy for efficient and accurate variant detection from low-input biopsy samples.
Calibrating genomic and allelic coverage bias in single-cell sequencing

PubMed Central

Francis, Joshua; Cornils, Hauke; Jung, Joonil; Maire, Cecile; Ligon, Keith L.; Meyerson, Matthew; Love, J. Christopher

2016-01-01

Artifacts introduced in whole-genome amplification (WGA) make it difficult to derive accurate genomic information from single-cell genomes and require different analytical strategies from bulk genome analysis. Here, we describe statistical methods to quantitatively assess the amplification bias resulting from whole-genome amplification of single-cell genomic DNA. Analysis of single-cell DNA libraries generated by different technologies revealed universal features of the genome coverage bias predominantly generated at the amplicon level (1–10 kb). The magnitude of coverage bias can be accurately calibrated from low-pass sequencing (~0.1 ×) to predict the depth-of-coverage yield of single-cell DNA libraries sequenced at arbitrary depths. We further provide a benchmark comparison of single-cell libraries generated by multi-strand displacement amplification (MDA) and multiple annealing and looping-based amplification cycles (MALBAC). Finally, we develop statistical models to calibrate allelic bias in single-cell whole-genome amplification and demonstrate a census-based strategy for efficient and accurate variant detection from low-input biopsy samples. PMID:25879913
FastaValidator: an open-source Java library to parse and validate FASTA formatted sequences.

PubMed

Waldmann, Jost; Gerken, Jan; Hankeln, Wolfgang; Schweer, Timmy; Glöckner, Frank Oliver

2014-06-14

Advances in sequencing technologies challenge the efficient importing and validation of FASTA formatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysis of commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy is hampered. FastaValidator represents a platform-independent, standardized, light-weight software library written in the Java programming language. It targets computer scientists and bioinformaticians writing software which needs to parse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactive out-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughput validation in software pipelines. The accuracy and performance of the FastaValidator library qualifies it for large data sets such as those commonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardized method for parsing and validating FASTA formatted sequence data.
Faster sequence homology searches by clustering subsequences.

PubMed

Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka

2015-04-15

Sequence homology searches are used in various fields. New sequencing technologies produce huge amounts of sequence data, which continuously increase the size of sequence databases. As a result, homology searches require large amounts of computational time, especially for metagenomic analysis. We developed a fast homology search method based on database subsequence clustering, and implemented it as GHOSTZ. This method clusters similar subsequences from a database to perform an efficient seed search and ungapped extension by reducing alignment candidates based on triangle inequality. The database subsequence clustering technique achieved an ∼2-fold increase in speed without a large decrease in search sensitivity. When we measured with metagenomic data, GHOSTZ is ∼2.2-2.8 times faster than RAPSearch and is ∼185-261 times faster than BLASTX. The source code is freely available for download at http://www.bi.cs.titech.ac.jp/ghostz/ akiyama@cs.titech.ac.jp Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
An evolution based biosensor receptor DNA sequence generation algorithm.

PubMed

Kim, Eungyeong; Lee, Malrey; Gatton, Thomas M; Lee, Jaewan; Zang, Yupeng

2010-01-01

A biosensor is composed of a bioreceptor, an associated recognition molecule, and a signal transducer that can selectively detect target substances for analysis. DNA based biosensors utilize receptor molecules that allow hybridization with the target analyte. However, most DNA biosensor research uses oligonucleotides as the target analytes and does not address the potential problems of real samples. The identification of recognition molecules suitable for real target analyte samples is an important step towards further development of DNA biosensors. This study examines the characteristics of DNA used as bioreceptors and proposes a hybrid evolution-based DNA sequence generating algorithm, based on DNA computing, to identify suitable DNA bioreceptor recognition molecules for stable hybridization with real target substances. The Traveling Salesman Problem (TSP) approach is applied in the proposed algorithm to evaluate the safety and fitness of the generated DNA sequences. This approach improves efficiency and stability for enhanced and variable-length DNA sequence generation and allows extension to generation of variable-length DNA sequences with diverse receptor recognition requirements.
Monitoring and Surveillance of Marine Invasive Species in Californian Waters by DNA Barcoding: Methodological and Analytical Solutions

NASA Astrophysics Data System (ADS)

Campbell, T. L.; Geller, J. B.; Heller, P.; Ruiz, G.; Chang, A.; McCann, L.; Ceballos, L.; Marraffini, M.; Ashton, G.; Larson, K.; Havard, S.; Meagher, K.; Wheelock, M.; Drake, C.; Rhett, G.

2016-02-01

The Ballast Water Management Act, the Marine Invasive Species Act, and the Coastal Ecosystem Protection Act require the California Department of Fish and Wildlife to monitor and evaluate the extent of biological invasions in the state's marine and estuarine waters. This has been performed statewide, using a variety of methodologies. Conventional sample collection and processing is laborious, slow and costly, and may require considerable taxonomic expertise requiring detailed time-consuming microscopic study of multiple specimens. These factors limit the volume of biomass that can be searched for introduced species. New technologies continue to reduce the cost and increase the throughput of genetic analyses, which become efficient alternatives to traditional morphological analysis for identification, monitoring and surveillance of marine invasive species. Using next-generation sequencing of mitochondrial Cytochrome c oxidase subunit I (COI) and nuclear large subunit ribosomal RNA (LSU), we analyzed over 15,000 individual marine invertebrates collected in Californian waters. We have created sequence databases of California native and non-native species to assist in molecular identification and surveillance in North American waters. Metagenetics, the next-generation sequencing of environmental samples with comparison to DNA sequence databases, is a faster and cost-effective alternative to individual sample analysis. We have sequenced from biomass collected from whole settlement plates and plankton in California harbors, and used our introduced species database to create species lists. We can combine these species lists for individual marinas with collected environmental data, such as temperature, salinity, and dissolved oxygen to understand the ecology of marine invasions. Here we discuss high throughput sampling, sequencing, and COASTLINE, our data analysis answer to challenges working with hundreds of millions of sequencing reads from tens of thousands of specimens.
Diagnosis of autosomal dominant polycystic kidney disease using efficient PKD1 and PKD2 targeted next-generation sequencing.

PubMed

Trujillano, Daniel; Bullich, Gemma; Ossowski, Stephan; Ballarín, José; Torra, Roser; Estivill, Xavier; Ars, Elisabet

2014-09-01

Molecular diagnostics of autosomal dominant polycystic kidney disease (ADPKD) relies on mutation screening of PKD1 and PKD2, which is complicated by extensive allelic heterogeneity and the presence of six highly homologous sequences of PKD1. To date, specific sequencing of PKD1 requires laborious long-range amplifications. The high cost and long turnaround time of PKD1 and PKD2 mutation analysis using conventional techniques limits its widespread application in clinical settings. We performed targeted next-generation sequencing (NGS) of PKD1 and PKD2. Pooled barcoded DNA patient libraries were enriched by in-solution hybridization with PKD1 and PKD2 capture probes. Bioinformatics analysis was performed using an in-house developed pipeline. We validated the assay in a cohort of 36 patients with previously known PKD1 and PKD2 mutations and five control individuals. Then, we used the same assay and bioinformatics analysis in a discovery cohort of 12 uncharacterized patients. We detected 35 out of 36 known definitely, highly likely, and likely pathogenic mutations in the validation cohort, including two large deletions. In the discovery cohort, we detected 11 different pathogenic mutations in 10 out of 12 patients. This study demonstrates that laborious long-range PCRs of the repeated PKD1 region can be avoided by in-solution enrichment of PKD1 and PKD2 and NGS. This strategy significantly reduces the cost and time for simultaneous PKD1 and PKD2 sequence analysis, facilitating routine genetic diagnostics of ADPKD.
Comparative Analysis of the Molecular Adjuvants and Their Binding Efficiency with CR1.

PubMed

Saranya, B; Saxena, Shweta; Saravanan, K M; Shakila, H

2016-03-01

There are so many obstacles in developing a vaccine or vaccine technology for diseases like cancer and human immunodeficiency virus infection. While developing vaccines that target specific infection, molecular adjuvants are indispensable. These molecular adjuvants act as a vaccine delivery vehicle to the immune system to increase the effectiveness of the specific antigens. In the present work, a computational study has been done on molecular adjuvants like IgGFc, GMCSF and C3d to find out how efficiently they are binding to CR1. Sequence, structure and mutational analysis are performed on the molecular adjuvants to understand the features important for their binding with the receptor. Results obtained from our study indicate that the adjuvant IgGFc complexed with the receptor CR1 has the best binding efficiency, which can be used further to develop better vaccine technologies.
Understanding the complex evolution of rapidly mutating viruses with deep sequencing: Beyond the analysis of viral diversity.

PubMed

Leung, Preston; Eltahla, Auda A; Lloyd, Andrew R; Bull, Rowena A; Luciani, Fabio

2017-07-15

With the advent of affordable deep sequencing technologies, detection of low frequency variants within genetically diverse viral populations can now be achieved with unprecedented depth and efficiency. The high-resolution data provided by next generation sequencing technologies is currently recognised as the gold standard in estimation of viral diversity. In the analysis of rapidly mutating viruses, longitudinal deep sequencing datasets from viral genomes during individual infection episodes, as well as at the epidemiological level during outbreaks, now allow for more sophisticated analyses such as statistical estimates of the impact of complex mutation patterns on the evolution of the viral populations both within and between hosts. These analyses are revealing more accurate descriptions of the evolutionary dynamics that underpin the rapid adaptation of these viruses to the host response, and to drug therapies. This review assesses recent developments in methods and provide informative research examples using deep sequencing data generated from rapidly mutating viruses infecting humans, particularly hepatitis C virus (HCV), human immunodeficiency virus (HIV), Ebola virus and influenza virus, to understand the evolution of viral genomes and to explore the relationship between viral mutations and the host adaptive immune response. Finally, we discuss limitations in current technologies, and future directions that take advantage of publically available large deep sequencing datasets. Copyright © 2016 Elsevier B.V. All rights reserved.
Fragment analysis represents a suitable approach for the detection of hotspot c.7541_7542delCT NOTCH1 mutation in chronic lymphocytic leukemia.

PubMed

Vavrova, Eva; Kantorova, Barbara; Vonkova, Barbara; Kabathova, Jitka; Skuhrova-Francova, Hana; Diviskova, Eva; Letocha, Ondrej; Kotaskova, Jana; Brychtova, Yvona; Doubek, Michael; Mayer, Jiri; Pospisilova, Sarka

2017-09-01

The hotspot c.7541_7542delCT NOTCH1 mutation has been proven to have a negative clinical impact in chronic lymphocytic leukemia (CLL). However, an optimal method for its detection has not yet been specified. The aim of our study was to examine the presence of the NOTCH1 mutation in CLL using three commonly used molecular methods. Sanger sequencing, fragment analysis and allele-specific PCR were compared in the detection of the c.7541_7542delCT NOTCH1 mutation in 201 CLL patients. In 7 patients with inconclusive mutational analysis results, the presence of the NOTCH1 mutation was also confirmed using ultra-deep next generation sequencing. The NOTCH1 mutation was detected in 15% (30/201) of examined patients. Only fragment analysis was able to identify all 30 NOTCH1-mutated patients. Sanger sequencing and allele-specific PCR showed a lower detection efficiency, determining 93% (28/30) and 80% (24/30) of the present NOTCH1 mutations, respectively. Considering these three most commonly used methodologies for c.7541_7542delCT NOTCH1 mutation screening in CLL, we defined fragment analysis as the most suitable approach for detecting the hotspot NOTCH1 mutation. Copyright © 2017 Elsevier Ltd. All rights reserved.
New Uses for Sensitivity Analysis: How Different Movement Tasks Effect Limb Model Parameter Sensitivity

NASA Technical Reports Server (NTRS)

Winters, J. M.; Stark, L.

1984-01-01

Original results for a newly developed eight-order nonlinear limb antagonistic muscle model of elbow flexion and extension are presented. A wider variety of sensitivity analysis techniques are used and a systematic protocol is established that shows how the different methods can be used efficiently to complement one another for maximum insight into model sensitivity. It is explicitly shown how the sensitivity of output behaviors to model parameters is a function of the controller input sequence, i.e., of the movement task. When the task is changed (for instance, from an input sequence that results in the usual fast movement task to a slower movement that may also involve external loading, etc.) the set of parameters with high sensitivity will in general also change. Such task-specific use of sensitivity analysis techniques identifies the set of parameters most important for a given task, and even suggests task-specific model reduction possibilities.

19F DOSY NMR analysis for spin systems with nJFF couplings.

PubMed

Dal Poggetto, Guilherme; Favaro, Denize C; Nilsson, Mathias; Morris, Gareth A; Tormena, Cláudio F

2014-04-01

NMR is a powerful method for identification and quantification of drug components and contaminations. These problems present themselves as mixtures, and here, one of the most powerful tools is DOSY. DOSY works best when there is no spectral overlap between components, so drugs containing fluorine substituents are well-suited for DOSY analysis as (19)F spectra are typically very sparse. Here, we demonstrate the use of a modified (19)F DOSY experiment (on the basis of the Oneshot sequences) for various fluorinated benzenes. For compounds with significant (n) JFF coupling constants, as is common, the undesirable J-modulation can be efficiently suppressed using the Oneshot45 pulse sequence. This investigation highlights (19)F DOSY as a valuable and robust method for analysis of molecular systems containing fluorine atoms even where there are large fluorine-fluorine couplings. Copyright © 2014 John Wiley & Sons, Ltd.
[Identification of antler powder components based on DNA barcoding technology].

PubMed

Jia, Jing; Shi, Lin-chun; Xu, Zhi-chao; Xin, Tian-yi; Song, Jing-yuan; Chen Shi, Lin

2015-10-01

In order to authenticate the components of antler powder in the market, DNA barcoding technology coupled with cloning method were used. Cytochrome c oxidase subunit I (COI) sequences were obtained according to the DNA barcoding standard operation procedure (SOP). For antler powder with possible mixed components, the cloning method was used to get each COI sequence. 65 COI sequences were successfully obtained from commercial antler powders via sequencing PCR products. The results indicates that only 38% of these samples were derived from Cervus nippon Temminck or Cervus elaphus Linnaeus which is recorded in the 2010 edition of "Chinese Pharmacopoeia", while 62% of them were derived from other species. Rangifer tarandus Linnaeus was the most frequent species among the adulterants. Further analysis showed that some samples collected from different regions, companies and prices, contained adulterants. Analysis of 36 COI sequences obtained by the cloning method showed that C. elaphus and C. nippon were main components. In addition, some samples were marked clearly as antler powder on the label, however, C. elaphus or R. tarandus were their main components. In summary, DNA barcoding can accurately and efficiently distinguish the exact content in the commercial antler powder, which provides a new technique to ensure clinical safety and improve quality control of Chinese traditional medicine
Bacterial Pathogens and Community Composition in Advanced Sewage Treatment Systems Revealed by Metagenomics Analysis Based on High-Throughput Sequencing

PubMed Central

Lu, Xin; Zhang, Xu-Xiang; Wang, Zhu; Huang, Kailong; Wang, Yuan; Liang, Weigang; Tan, Yunfei; Liu, Bo; Tang, Junying

2015-01-01

This study used 454 pyrosequencing, Illumina high-throughput sequencing and metagenomic analysis to investigate bacterial pathogens and their potential virulence in a sewage treatment plant (STP) applying both conventional and advanced treatment processes. Pyrosequencing and Illumina sequencing consistently demonstrated that Arcobacter genus occupied over 43.42% of total abundance of potential pathogens in the STP. At species level, potential pathogens Arcobacter butzleri, Aeromonas hydrophila and Klebsiella pneumonia dominated in raw sewage, which was also confirmed by quantitative real time PCR. Illumina sequencing also revealed prevalence of various types of pathogenicity islands and virulence proteins in the STP. Most of the potential pathogens and virulence factors were eliminated in the STP, and the removal efficiency mainly depended on oxidation ditch. Compared with sand filtration, magnetic resin seemed to have higher removals in most of the potential pathogens and virulence factors. However, presence of the residual A. butzleri in the final effluent still deserves more concerns. The findings indicate that sewage acts as an important source of environmental pathogens, but STPs can effectively control their spread in the environment. Joint use of the high-throughput sequencing technologies is considered a reliable method for deep and comprehensive overview of environmental bacterial virulence. PMID:25938416
Real-Time PCR Quantification Using A Variable Reaction Efficiency Model

PubMed Central

Platts, Adrian E.; Johnson, Graham D.; Linnemann, Amelia K.; Krawetz, Stephen A.

2008-01-01

Quantitative real-time PCR remains a cornerstone technique in gene expression analysis and sequence characterization. Despite the importance of the approach to experimental biology the confident assignment of reaction efficiency to the early cycles of real-time PCR reactions remains problematic. Considerable noise may be generated where few cycles in the amplification are available to estimate peak efficiency. An alternate approach that uses data from beyond the log-linear amplification phase is explored with the aim of reducing noise and adding confidence to efficiency estimates. PCR reaction efficiency is regressed to estimate the per-cycle profile of an asymptotically departed peak efficiency, even when this is not closely approximated in the measurable cycles. The process can be repeated over replicates to develop a robust estimate of peak reaction efficiency. This leads to an estimate of the maximum reaction efficiency that may be considered primer-design specific. Using a series of biological scenarios we demonstrate that this approach can provide an accurate estimate of initial template concentration. PMID:18570886
BamTools: a C++ API and toolkit for analyzing and managing BAM files.

PubMed

Barnett, Derek W; Garrison, Erik K; Quinlan, Aaron R; Strömberg, Michael P; Marth, Gabor T

2011-06-15

Analysis of genomic sequencing data requires efficient, easy-to-use access to alignment results and flexible data management tools (e.g. filtering, merging, sorting, etc.). However, the enormous amount of data produced by current sequencing technologies is typically stored in compressed, binary formats that are not easily handled by the text-based parsers commonly used in bioinformatics research. We introduce a software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit. BamTools was written in C++, and is supported on Linux, Mac OSX and MS Windows. Source code and documentation are freely available at http://github.org/pezmaster31/bamtools.
Dynamic multiplexed analysis method using ion mobility spectrometer

DOEpatents

Belov, Mikhail E [Richland, WA

2010-05-18

A method for multiplexed analysis using ion mobility spectrometer in which the effectiveness and efficiency of the multiplexed method is optimized by automatically adjusting rates of passage of analyte materials through an IMS drift tube during operation of the system. This automatic adjustment is performed by the IMS instrument itself after determining the appropriate levels of adjustment according to the method of the present invention. In one example, the adjustment of the rates of passage for these materials is determined by quantifying the total number of analyte molecules delivered to the ion trap in a preselected period of time, comparing this number to the charge capacity of the ion trap, selecting a gate opening sequence; and implementing the selected gate opening sequence to obtain a preselected rate of analytes within said IMS drift tube.
Copy number variants calling for single cell sequencing data by multi-constrained optimization.

PubMed

Xu, Bo; Cai, Hongmin; Zhang, Changsheng; Yang, Xi; Han, Guoqiang

2016-08-01

Variations in DNA copy number carry important information on genome evolution and regulation of DNA replication in cancer cells. The rapid development of single-cell sequencing technology allows one to explore gene expression heterogeneity among single-cells, thus providing important cancer cell evolution information. Single-cell DNA/RNA sequencing data usually have low genome coverage, which requires an extra step of amplification to accumulate enough samples. However, such amplification will introduce large bias and makes bioinformatics analysis challenging. Accurately modeling the distribution of sequencing data and effectively suppressing the bias influence is the key to success variations analysis. Recent advances demonstrate the technical noises by amplification are more likely to follow negative binomial distribution, a special case of Poisson distribution. Thus, we tackle the problem CNV detection by formulating it into a quadratic optimization problem involving two constraints, in which the underling signals are corrupted by Poisson distributed noises. By imposing the constraints of sparsity and smoothness, the reconstructed read depth signals from single-cell sequencing data are anticipated to fit the CNVs patterns more accurately. An efficient numerical solution based on the classical alternating direction minimization method (ADMM) is tailored to solve the proposed model. We demonstrate the advantages of the proposed method using both synthetic and empirical single-cell sequencing data. Our experimental results demonstrate that the proposed method achieves excellent performance and high promise of success with single-cell sequencing data. Crown Copyright © 2016. Published by Elsevier Ltd. All rights reserved.
Cloning and sequence analysis demonstrate the chromate reduction ability of a novel chromate reductase gene from Serratia sp.

PubMed

Deng, Peng; Tan, Xiaoqing; Wu, Ying; Bai, Qunhua; Jia, Yan; Xiao, Hong

2015-03-01

The ChrT gene encodes a chromate reductase enzyme which catalyzes the reduction of Cr(VI). The chromate reductase is also known as flavin mononucleotide (FMN) reductase (FMN_red). The aim of the present study was to clone the full-length ChrT DNA from Serratia sp. CQMUS2 and analyze the deduced amino acid sequence and three-dimensional structure. The putative ChrT gene fragment of Serratia sp. CQMUS2 was isolated by polymerase chain reaction (PCR), according to the known FMN_red gene sequence from Serratia sp. AS13. The flanking sequences of the ChrT gene were obtained by high efficiency TAIL-PCR, while the full-length gene of ChrT was cloned in Escherichia coli for subsequent sequencing. The nucleotide sequence of ChrT was submitted onto GenBank under the accession number, KF211434. Sequence analysis of the gene and amino acids was conducted using the Basic Local Alignment Search Tool, and open reading frame (ORF) analysis was performed using ORF Finder software. The ChrT gene was found to be an ORF of 567 bp that encodes a 188-amino acid enzyme with a calculated molecular weight of 20.4 kDa. In addition, the ChrT protein was hypothesized to be an NADPH-dependent FMN_red and a member of the flavodoxin-2 superfamily. The amino acid sequence of ChrT showed high sequence similarity to the FMN reductase genes of Klebsiella pneumonia and Raoultella ornithinolytica , which belong to the flavodoxin-2 superfamily. Furthermore, ChrT was shown to have a 85.6% similarity to the three-dimensional structure of Escherichia coli ChrR, sharing four common enzyme active sites for chromate reduction. Therefore, ChrT gene cloning and protein structure determination demonstrated the ability of the gene for chromate reduction. The results of the present study provide a basis for further studies on ChrT gene expression and protein function.
Cloning and sequence analysis demonstrate the chromate reduction ability of a novel chromate reductase gene from Serratia sp

PubMed Central

DENG, PENG; TAN, XIAOQING; WU, YING; BAI, QUNHUA; JIA, YAN; XIAO, HONG

2015-01-01

The ChrT gene encodes a chromate reductase enzyme which catalyzes the reduction of Cr(VI). The chromate reductase is also known as flavin mononucleotide (FMN) reductase (FMN_red). The aim of the present study was to clone the full-length ChrT DNA from Serratia sp. CQMUS2 and analyze the deduced amino acid sequence and three-dimensional structure. The putative ChrT gene fragment of Serratia sp. CQMUS2 was isolated by polymerase chain reaction (PCR), according to the known FMN_red gene sequence from Serratia sp. AS13. The flanking sequences of the ChrT gene were obtained by high efficiency TAIL-PCR, while the full-length gene of ChrT was cloned in Escherichia coli for subsequent sequencing. The nucleotide sequence of ChrT was submitted onto GenBank under the accession number, KF211434. Sequence analysis of the gene and amino acids was conducted using the Basic Local Alignment Search Tool, and open reading frame (ORF) analysis was performed using ORF Finder software. The ChrT gene was found to be an ORF of 567 bp that encodes a 188-amino acid enzyme with a calculated molecular weight of 20.4 kDa. In addition, the ChrT protein was hypothesized to be an NADPH-dependent FMN_red and a member of the flavodoxin-2 superfamily. The amino acid sequence of ChrT showed high sequence similarity to the FMN reductase genes of Klebsiella pneumonia and Raoultella ornithinolytica, which belong to the flavodoxin-2 superfamily. Furthermore, ChrT was shown to have a 85.6% similarity to the three-dimensional structure of Escherichia coli ChrR, sharing four common enzyme active sites for chromate reduction. Therefore, ChrT gene cloning and protein structure determination demonstrated the ability of the gene for chromate reduction. The results of the present study provide a basis for further studies on ChrT gene expression and protein function. PMID:25667630
CapZyme-Seq Comprehensively Defines Promoter-Sequence Determinants for RNA 5' Capping with NAD.

PubMed

Vvedenskaya, Irina O; Bird, Jeremy G; Zhang, Yuanchao; Zhang, Yu; Jiao, Xinfu; Barvík, Ivan; Krásný, Libor; Kiledjian, Megerditch; Taylor, Deanne M; Ebright, Richard H; Nickels, Bryce E

2018-05-03

Nucleoside-containing metabolites such as NAD + can be incorporated as 5' caps on RNA by serving as non-canonical initiating nucleotides (NCINs) for transcription initiation by RNA polymerase (RNAP). Here, we report CapZyme-seq, a high-throughput-sequencing method that employs NCIN-decapping enzymes NudC and Rai1 to detect and quantify NCIN-capped RNA. By combining CapZyme-seq with multiplexed transcriptomics, we determine efficiencies of NAD + capping by Escherichia coli RNAP for ∼16,000 promoter sequences. The results define preferred transcription start site (TSS) positions for NAD + capping and define a consensus promoter sequence for NAD + capping: HRRASWW (TSS underlined). By applying CapZyme-seq to E. coli total cellular RNA, we establish that sequence determinants for NCIN capping in vivo match the NAD + -capping consensus defined in vitro, and we identify and quantify NCIN-capped small RNAs (sRNAs). Our findings define the promoter-sequence determinants for NCIN capping with NAD + and provide a general method for analysis of NCIN capping in vitro and in vivo. Copyright © 2018 Elsevier Inc. All rights reserved.
VariantBam: filtering and profiling of next-generational sequencing data using region-specific rules.

PubMed

Wala, Jeremiah; Zhang, Cheng-Zhong; Meyerson, Matthew; Beroukhim, Rameen

2016-07-01

We developed VariantBam, a C ++ read filtering and profiling tool for use with BAM, CRAM and SAM sequencing files. VariantBam provides a flexible framework for extracting sequencing reads or read-pairs that satisfy combinations of rules, defined by any number of genomic intervals or variant sites. We have implemented filters based on alignment data, sequence motifs, regional coverage and base quality. For example, VariantBam achieved a median size reduction ratio of 3.1:1 when applied to 10 lung cancer whole genome BAMs by removing large tags and selecting for only high-quality variant-supporting reads and reads matching a large dictionary of sequence motifs. Thus VariantBam enables efficient storage of sequencing data while preserving the most relevant information for downstream analysis. VariantBam and full documentation are available at github.com/jwalabroad/VariantBam rameen@broadinstitute.org Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
A comparative analysis of exome capture.

PubMed

Parla, Jennifer S; Iossifov, Ivan; Grabill, Ian; Spector, Mona S; Kramer, Melissa; McCombie, W Richard

2011-09-29

Human exome resequencing using commercial target capture kits has been and is being used for sequencing large numbers of individuals to search for variants associated with various human diseases. We rigorously evaluated the capabilities of two solution exome capture kits. These analyses help clarify the strengths and limitations of those data as well as systematically identify variables that should be considered in the use of those data. Each exome kit performed well at capturing the targets they were designed to capture, which mainly corresponds to the consensus coding sequences (CCDS) annotations of the human genome. In addition, based on their respective targets, each capture kit coupled with high coverage Illumina sequencing produced highly accurate nucleotide calls. However, other databases, such as the Reference Sequence collection (RefSeq), define the exome more broadly, and so not surprisingly, the exome kits did not capture these additional regions. Commercial exome capture kits provide a very efficient way to sequence select areas of the genome at very high accuracy. Here we provide the data to help guide critical analyses of sequencing data derived from these products.
The immediate upstream region of the 5′-UTR from the AUG start codon has a pronounced effect on the translational efficiency in Arabidopsis thaliana

PubMed Central

Kim, Younghyun; Lee, Goeun; Jeon, Eunhyun; Sohn, Eun ju; Lee, Yongjik; Kang, Hyangju; Lee, Dong wook; Kim, Dae Heon; Hwang, Inhwan

2014-01-01

The nucleotide sequence around the translational initiation site is an important cis-acting element for post-transcriptional regulation. However, it has not been fully understood how the sequence context at the 5′-untranslated region (5′-UTR) affects the translational efficiency of individual mRNAs. In this study, we provide evidence that the 5′-UTRs of Arabidopsis genes showing a great difference in the nucleotide sequence vary greatly in translational efficiency with more than a 200-fold difference. Of the four types of nucleotides, the A residue was the most favourable nucleotide from positions −1 to −21 of the 5′-UTRs in Arabidopsis genes. In particular, the A residue in the 5′-UTR from positions −1 to −5 was required for a high-level translational efficiency. In contrast, the T residue in the 5′-UTR from positions −1 to −5 was the least favourable nucleotide in translational efficiency. Furthermore, the effect of the sequence context in the −1 to −21 region of the 5′-UTR was conserved in different plant species. Based on these observations, we propose that the sequence context immediately upstream of the AUG initiation codon plays a crucial role in determining the translational efficiency of plant genes. PMID:24084084
DNA pooling: a comprehensive, multi-stage association analysis of ACSL6 and SIRT5 polymorphisms in schizophrenia.

PubMed

Chowdari, K V; Northup, A; Pless, L; Wood, J; Joo, Y H; Mirnics, K; Lewis, D A; Levitt, P R; Bacanu, S-A; Nimgaonkar, V L

2007-04-01

Many candidate gene association studies have evaluated incomplete, unrepresentative sets of single nucleotide polymorphisms (SNPs), producing non-significant results that are difficult to interpret. Using a rapid, efficient strategy designed to investigate all common SNPs, we tested associations between schizophrenia and two positional candidate genes: ACSL6 (Acyl-Coenzyme A synthetase long-chain family member 6) and SIRT5 (silent mating type information regulation 2 homologue 5). We initially evaluated the utility of DNA sequencing traces to estimate SNP allele frequencies in pooled DNA samples. The mean variances for the DNA sequencing estimates were acceptable and were comparable to other published methods (mean variance: 0.0008, range 0-0.0119). Using pooled DNA samples from cases with schizophrenia/schizoaffective disorder (Diagnostic and Statistical Manual of Mental Disorders edition IV criteria) and controls (n=200, each group), we next sequenced all exons, introns and flanking upstream/downstream sequences for ACSL6 and SIRT5. Among 69 identified SNPs, case-control allele frequency comparisons revealed nine suggestive associations (P<0.2). Each of these SNPs was next genotyped in the individual samples composing the pools. A suggestive association with rs 11743803 at ACSL6 remained (allele-wise P=0.02), with diminished evidence in an extended sample (448 cases, 554 controls, P=0.062). In conclusion, we propose a multi-stage method for comprehensive, rapid, efficient and economical genetic association analysis that enables simultaneous SNP detection and allele frequency estimation in large samples. This strategy may be particularly useful for research groups lacking access to high throughput genotyping facilities. Our analyses did not yield convincing evidence for associations of schizophrenia with ACSL6 or SIRT5.
SNP discovery in the bovine milk transcriptome using RNA-Seq technology.

PubMed

Cánovas, Angela; Rincon, Gonzalo; Islas-Trejo, Alma; Wickramasinghe, Saumya; Medrano, Juan F

2010-12-01

High-throughput sequencing of RNA (RNA-Seq) was developed primarily to analyze global gene expression in different tissues. However, it also is an efficient way to discover coding SNPs. The objective of this study was to perform a SNP discovery analysis in the milk transcriptome using RNA-Seq. Seven milk samples from Holstein cows were analyzed by sequencing cDNAs using the Illumina Genome Analyzer system. We detected 19,175 genes expressed in milk samples corresponding to approximately 70% of the total number of genes analyzed. The SNP detection analysis revealed 100,734 SNPs in Holstein samples, and a large number of those corresponded to differences between the Holstein breed and the Hereford bovine genome assembly Btau4.0. The number of polymorphic SNPs within Holstein cows was 33,045. The accuracy of RNA-Seq SNP discovery was tested by comparing SNPs detected in a set of 42 candidate genes expressed in milk that had been resequenced earlier using Sanger sequencing technology. Seventy of 86 SNPs were detected using both RNA-Seq and Sanger sequencing technologies. The KASPar Genotyping System was used to validate unique SNPs found by RNA-Seq but not observed by Sanger technology. Our results confirm that analyzing the transcriptome using RNA-Seq technology is an efficient and cost-effective method to identify SNPs in transcribed regions. This study creates guidelines to maximize the accuracy of SNP discovery and prevention of false-positive SNP detection, and provides more than 33,000 SNPs located in coding regions of genes expressed during lactation that can be used to develop genotyping platforms to perform marker-trait association studies in Holstein cattle.
Recovery of soil unicellular eukaryotes: an efficiency and activity analysis on the single cell level.

PubMed

Lentendu, Guillaume; Hübschmann, Thomas; Müller, Susann; Dunker, Susanne; Buscot, François; Wilhelm, Christian

2013-12-01

Eukaryotic unicellular organisms are an important part of the soil microbial community, but they are often neglected in soil functional microbial diversity analysis, principally due to the absence of specific investigation methods in the special soil environment. In this study we used a method based on high-density centrifugation to specifically isolate intact algal and yeast cells, with the aim to analyze them with flow cytometry and sort them for further molecular analysis such as deep sequencing. Recovery efficiency was tested at low abundance levels that fit those in natural environments (10(4) to 10(6) cells per g soil). Five algae and five yeast morphospecies isolated from soil were used for the testing. Recovery efficiency was between 1.5 to 43.16% and 2 to 30.2%, respectively, and was dependent on soil type for three of the algae. Control treatments without soil showed that the majority of cells were lost due to the method itself (58% and 55.8% respectively). However, the cell extraction technique did not much compromise cell vitality because a fluorescein di-acetate assay indicated high viability percentages (73.3% and 97.2% of cells, respectively). The low abundant algae and yeast morphospecies recovered from soil were cytometrically analyzed and sorted. Following, their DNA was isolated and amplified using specific primers. The developed workflow enables isolation and enrichment of intact autotrophic and heterotrophic soil unicellular eukaryotes from natural environments for subsequent application of deep sequencing technologies. Copyright © 2013 Elsevier B.V. All rights reserved.
Genetic analysis of biodegradation of tetralin by a Sphingomonas strain

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hernaez, M.J.; Santero, E.; Reineke, W.

Tetralin (1,2,3,4-tetrahydronaphthalene) is produced for industrial purposes from naphthalene by catalytic hydrogenation or from anthracene by cracking. A strain designated TFA which very efficiently utilizes tetralin has been isolated from the Rhine river. The strain has been identified as Sphingomonas macrogoltabidus, based on 16S rDNA sequence similarity. Genetic analysis of tetralin biodegradation has been performed by insertion mutagenesis and by physical analysis and analysis of complementation between the mutants. The genes involved in tetralin utilization are clustered in a region of 9 kb, comprising at least five genes grouped in two divergently transcribed operons.
Non coding extremities of the seven influenza virus type C vRNA segments: effect on transcription and replication by the type C and type A polymerase complexes

PubMed Central

Crescenzo-Chaigne, Bernadette; Barbezange, Cyril; van der Werf, Sylvie

2008-01-01

Background The transcription/replication of the influenza viruses implicate the terminal nucleotide sequences of viral RNA, which comprise sequences at the extremities conserved among the genomic segments as well as variable 3' and 5' non-coding (NC) regions. The plasmid-based system for the in vivo reconstitution of functional ribonucleoproteins, upon expression of viral-like RNAs together with the nucleoprotein and polymerase proteins has been widely used to analyze transcription/replication of influenza viruses. It was thus shown that the type A polymerase could transcribe and replicate type A, B, or C vRNA templates whereas neither type B nor type C polymerases were able to transcribe and replicate type A templates efficiently. Here we studied the importance of the NC regions from the seven segments of type C influenza virus for efficient transcription/replication by the type A and C polymerases. Results The NC sequences of the seven genomic segments of the type C influenza virus C/Johannesburg/1/66 strain were found to be more variable in length than those of the type A and B viruses. The levels of transcription/replication of viral-like vRNAs harboring the NC sequences of the respective type C virus segments flanking the CAT reporter gene were comparable in the presence of either type C or type A polymerase complexes except for the NS and PB2-like vRNAs. For the NS-like vRNA, the transcription/replication level was higher after introduction of a U residue at position 6 in the 5' NC region as for all other segments. For the PB2-like vRNA the CAT expression level was particularly reduced with the type C polymerase. Analysis of mutants of the 5' NC sequence in the PB2-like vRNA, the shortest 5' NC sequence among the seven segments, showed that additional sequences within the PB2 ORF were essential for the efficiency of transcription but not replication by the type C polymerase complex. Conclusion In the context of a PB2-like reporter vRNA template, the sequence upstream the polyU stretch plays a role in the transcription/replication process by the type C polymerase complex. PMID:18973655
MetaGenSense: A web-application for analysis and exploration of high throughput sequencing metagenomic data

PubMed Central

Denis, Jean-Baptiste; Vandenbogaert, Mathias; Caro, Valérie

2016-01-01

The detection and characterization of emerging infectious agents has been a continuing public health concern. High Throughput Sequencing (HTS) or Next-Generation Sequencing (NGS) technologies have proven to be promising approaches for efficient and unbiased detection of pathogens in complex biological samples, providing access to comprehensive analyses. As NGS approaches typically yield millions of putatively representative reads per sample, efficient data management and visualization resources have become mandatory. Most usually, those resources are implemented through a dedicated Laboratory Information Management System (LIMS), solely to provide perspective regarding the available information. We developed an easily deployable web-interface, facilitating management and bioinformatics analysis of metagenomics data-samples. It was engineered to run associated and dedicated Galaxy workflows for the detection and eventually classification of pathogens. The web application allows easy interaction with existing Galaxy metagenomic workflows, facilitates the organization, exploration and aggregation of the most relevant sample-specific sequences among millions of genomic sequences, allowing them to determine their relative abundance, and associate them to the most closely related organism or pathogen. The user-friendly Django-Based interface, associates the users’ input data and its metadata through a bio-IT provided set of resources (a Galaxy instance, and both sufficient storage and grid computing power). Galaxy is used to handle and analyze the user’s input data from loading, indexing, mapping, assembly and DB-searches. Interaction between our application and Galaxy is ensured by the BioBlend library, which gives API-based access to Galaxy’s main features. Metadata about samples, runs, as well as the workflow results are stored in the LIMS. For metagenomic classification and exploration purposes, we show, as a proof of concept, that integration of intuitive exploratory tools, like Krona for representation of taxonomic classification, can be achieved very easily. In the trend of Galaxy, the interface enables the sharing of scientific results to fellow team members. PMID:28451381
New mutations in non-syndromic primary ovarian insufficiency patients identified via whole-exome sequencing.

PubMed

Patiño, Liliana Catherine; Beau, Isabelle; Carlosama, Carolina; Buitrago, July Constanza; González, Ronald; Suárez, Carlos Fernando; Patarroyo, Manuel Alfonso; Delemer, Brigitte; Young, Jacques; Binart, Nadine; Laissue, Paul

2017-07-01

Is it possible to identify new mutations potentially associated with non-syndromic primary ovarian insufficiency (POI) via whole-exome sequencing (WES)? WES is an efficient tool to study genetic causes of POI as we have identified new mutations, some of which lead to protein destablization potentially contributing to the disease etiology. POI is a frequently occurring complex pathology leading to infertility. Mutations in only few candidate genes, mainly identified by Sanger sequencing, have been definitively related to the pathogenesis of the disease. This is a retrospective cohort study performed on 69 women affected by POI. WES and an innovative bioinformatics analysis were used on non-synonymous sequence variants in a subset of 420 selected POI candidate genes. Mutations in BMPR1B and GREM1 were modeled by using fragment molecular orbital analysis. Fifty-five coding variants in 49 genes potentially related to POI were identified in 33 out of 69 patients (48%). These genes participate in key biological processes in the ovary, such as meiosis, follicular development, granulosa cell differentiation/proliferation and ovulation. The presence of at least two mutations in distinct genes in 42% of the patients argued in favor of a polygenic nature of POI. It is possible that regulatory regions, not analyzed in the present study, carry further variants related to POI. WES and the in silico analyses presented here represent an efficient approach for mapping variants associated with POI etiology. Sequence variants presented here represents potential future genetic biomarkers. This study was supported by the Universidad del Rosario and Colciencias (Grants CS/CIGGUR-ABN062-2016 and 672-2014). Colciencias supported Liliana Catherine Patiño´s work (Fellowship: 617, 2013). The authors declare no conflict of interest. © The Author 2017. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

MetaGenSense: A web-application for analysis and exploration of high throughput sequencing metagenomic data.

PubMed

Correia, Damien; Doppelt-Azeroual, Olivia; Denis, Jean-Baptiste; Vandenbogaert, Mathias; Caro, Valérie

2015-01-01

The detection and characterization of emerging infectious agents has been a continuing public health concern. High Throughput Sequencing (HTS) or Next-Generation Sequencing (NGS) technologies have proven to be promising approaches for efficient and unbiased detection of pathogens in complex biological samples, providing access to comprehensive analyses. As NGS approaches typically yield millions of putatively representative reads per sample, efficient data management and visualization resources have become mandatory. Most usually, those resources are implemented through a dedicated Laboratory Information Management System (LIMS), solely to provide perspective regarding the available information. We developed an easily deployable web-interface, facilitating management and bioinformatics analysis of metagenomics data-samples. It was engineered to run associated and dedicated Galaxy workflows for the detection and eventually classification of pathogens. The web application allows easy interaction with existing Galaxy metagenomic workflows, facilitates the organization, exploration and aggregation of the most relevant sample-specific sequences among millions of genomic sequences, allowing them to determine their relative abundance, and associate them to the most closely related organism or pathogen. The user-friendly Django-Based interface, associates the users' input data and its metadata through a bio-IT provided set of resources (a Galaxy instance, and both sufficient storage and grid computing power). Galaxy is used to handle and analyze the user's input data from loading, indexing, mapping, assembly and DB-searches. Interaction between our application and Galaxy is ensured by the BioBlend library, which gives API-based access to Galaxy's main features. Metadata about samples, runs, as well as the workflow results are stored in the LIMS. For metagenomic classification and exploration purposes, we show, as a proof of concept, that integration of intuitive exploratory tools, like Krona for representation of taxonomic classification, can be achieved very easily. In the trend of Galaxy, the interface enables the sharing of scientific results to fellow team members.
Developing and Evaluating the HRM Technique for Identifying Cytochrome P450 2D6 Polymorphisms.

PubMed

Lu, Hsiu-Chin; Chang, Ya-Sian; Chang, Chun-Chi; Lin, Ching-Hsiung; Chang, Jan-Gowth

2015-05-01

Cytochrome P450 2D6 is one of the important enzymes involved in the metabolism of many widely used drugs. Genetic polymorphisms of CYP2D6 can affect its activity. Therefore, an efficient method for identifying CYP2D6 polymorphisms is clinically important. We developed a high-resolution melting (HRM) analysis to investigate CYP2D6 polymorphisms. Genomic DNA was extracted from peripheral blood samples from 71 healthy individuals. All nine exons of the CYP2D6 gene were sequenced before screening by HRM analysis. This method can detect the most genotypes (*1, *2, *4, *10, *14, *21 *39, and *41) of CYP2D6 in Chinese. All samples were successfully genotyped. The four most common mutant CYP2D6 alleles (*1, *2, *10, and *41) can be genotyped. The single nucleotides polymorphism (SNP) frequencies of 100C > T (rs1065852), 1039C > T (rs1081003), 1661G > C (rs1058164), 2663G > A (rs28371722), 2850C > T (rs16947), 2988G > A (rs28371725), 3181A > G, and 4180G > C (rs1135840) were 58%, 61%, 73%, 1%, 13%, 3%, 1%, 73%, respectively. We identified 100% of all heterozygotes without any errors. The two homozygous genotypes (1661G > C and 4180G > C) can be distinguished by mixing with a known genotype sample to generate an artificial heterozygote for HRM analysis. Therefore, all samples could be identified using our HRM method, and the results of HRM analysis are identical to those obtained by sequencing. Our method achieved 100% sensitivity, specificity, positive prediction value and negative prediction value. HRM analysis is a nongel resolution method that is faster and less expensive than direct sequencing. Our study shows that it is an efficient tool for typing CYP2D6 polymorphisms. © 2014 Wiley Periodicals, Inc.
The complicated substrates enhance the microbial diversity and zinc leaching efficiency in sphalerite bioleaching system.

PubMed

Xiao, Yunhua; Xu, YongDong; Dong, Weiling; Liang, Yili; Fan, Fenliang; Zhang, Xiaoxia; Zhang, Xian; Niu, Jiaojiao; Ma, Liyuan; She, Siyuan; He, Zhili; Liu, Xueduan; Yin, Huaqun

2015-12-01

This study used an artificial enrichment microbial consortium to examine the effects of different substrate conditions on microbial diversity, composition, and function (e.g., zinc leaching efficiency) through adding pyrite (SP group), chalcopyrite (SC group), or both (SPC group) in sphalerite bioleaching systems. 16S rRNA gene sequencing analysis showed that microbial community structures and compositions dramatically changed with additions of pyrite or chalcopyrite during the sphalerite bioleaching process. Shannon diversity index showed a significantly increase in the SP (1.460), SC (1.476), and SPC (1.341) groups compared with control (sphalerite group, 0.624) on day 30, meanwhile, zinc leaching efficiencies were enhanced by about 13.4, 2.9, and 13.2%, respectively. Also, additions of pyrite or chalcopyrite could increase electric potential (ORP) and the concentrations of Fe3+ and H+, which were the main factors shaping microbial community structures by Mantel test analysis. Linear regression analysis showed that ORP, Fe3+ concentration, and pH were significantly correlated to zinc leaching efficiency and microbial diversity. In addition, we found that leaching efficiency showed a positive and significant relationship with microbial diversity. In conclusion, our results showed that the complicated substrates could significantly enhance microbial diversity and activity of function.
Simple and efficient identification of rare recessive pathologically important sequence variants from next generation exome sequence data.

PubMed

Carr, Ian M; Morgan, Joanne; Watson, Christopher; Melnik, Svitlana; Diggle, Christine P; Logan, Clare V; Harrison, Sally M; Taylor, Graham R; Pena, Sergio D J; Markham, Alexander F; Alkuraya, Fowzan S; Black, Graeme C M; Ali, Manir; Bonthron, David T

2013-07-01

Massively parallel ("next generation") DNA sequencing (NGS) has quickly become the method of choice for seeking pathogenic mutations in rare uncharacterized monogenic diseases. Typically, before DNA sequencing, protein-coding regions are enriched from patient genomic DNA, representing either the entire genome ("exome sequencing") or selected mapped candidate loci. Sequence variants, identified as differences between the patient's and the human genome reference sequences, are then filtered according to various quality parameters. Changes are screened against datasets of known polymorphisms, such as dbSNP and the 1000 Genomes Project, in the effort to narrow the list of candidate causative variants. An increasing number of commercial services now offer to both generate and align NGS data to a reference genome. This potentially allows small groups with limited computing infrastructure and informatics skills to utilize this technology. However, the capability to effectively filter and assess sequence variants is still an important bottleneck in the identification of deleterious sequence variants in both research and diagnostic settings. We have developed an approach to this problem comprising a user-friendly suite of programs that can interactively analyze, filter and screen data from enrichment-capture NGS data. These programs ("Agile Suite") are particularly suitable for small-scale gene discovery or for diagnostic analysis. © 2013 WILEY PERIODICALS, INC.
Strategies to Improve Efficiency and Specificity of Degenerate Primers in PCR.

PubMed

Campos, Maria Jorge; Quesada, Alberto

2017-01-01

PCR with degenerate primers can be used to identify the coding sequence of an unknown protein or to detect a genetic variant within a gene family. These primers, which are complex mixtures of slightly different oligonucleotide sequences, can be optimized to increase the efficiency and/or specificity of PCR in the amplification of a sequence of interest by the introduction of mismatches with the target sequence and balancing their position toward the primers 5'- or 3'-ends. In this work, we explain in detail examples of rational design of primers in two different applications, including the use of specific determinants at the 3'-end, to: (1) improve PCR efficiency with coding sequences for members of a protein family by fully degeneration at a core box of conserved genetic information, with the reduction of degeneration at the 5'-end, and (2) optimize specificity of allelic discrimination of closely related orthologous by 5'-end degenerate primers.
Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition.

PubMed

Ibrahim, Wisam; Abadeh, Mohammad Saniee

2017-05-21

Protein fold recognition is an important problem in bioinformatics to predict three-dimensional structure of a protein. One of the most challenging tasks in protein fold recognition problem is the extraction of efficient features from the amino-acid sequences to obtain better classifiers. In this paper, we have proposed six descriptors to extract features from protein sequences. These descriptors are applied in the first stage of a three-stage framework PCA-DELM-LDA to extract feature vectors from the amino-acid sequences. Principal Component Analysis PCA has been implemented to reduce the number of extracted features. The extracted feature vectors have been used with original features to improve the performance of the Deep Extreme Learning Machine DELM in the second stage. Four new features have been extracted from the second stage and used in the third stage by Linear Discriminant Analysis LDA to classify the instances into 27 folds. The proposed framework is implemented on the independent and combined feature sets in SCOP datasets. The experimental results show that extracted feature vectors in the first stage could improve the performance of DELM in extracting new useful features in second stage. Copyright © 2017 Elsevier Ltd. All rights reserved.
CEQer: a graphical tool for copy number and allelic imbalance detection from whole-exome sequencing data.

PubMed

Piazza, Rocco; Magistroni, Vera; Pirola, Alessandra; Redaelli, Sara; Spinelli, Roberta; Redaelli, Serena; Galbiati, Marta; Valletta, Simona; Giudici, Giovanni; Cazzaniga, Giovanni; Gambacorti-Passerini, Carlo

2013-01-01

Copy number alterations (CNA) are common events occurring in leukaemias and solid tumors. Comparative Genome Hybridization (CGH) is actually the gold standard technique to analyze CNAs; however, CGH analysis requires dedicated instruments and is able to perform only low resolution Loss of Heterozygosity (LOH) analyses. Here we present CEQer (Comparative Exome Quantification analyzer), a new graphical, event-driven tool for CNA/allelic-imbalance (AI) coupled analysis of exome sequencing data. By using case-control matched exome data, CEQer performs a comparative digital exonic quantification to generate CNA data and couples this information with exome-wide LOH and allelic imbalance detection. This data is used to build mixed statistical/heuristic models allowing the identification of CNA/AI events. To test our tool, we initially used in silico generated data, then we performed whole-exome sequencing from 20 leukemic specimens and corresponding matched controls and we analyzed the results using CEQer. Taken globally, these analyses showed that the combined use of comparative digital exon quantification and LOH/AI allows generating very accurate CNA data. Therefore, we propose CEQer as an efficient, robust and user-friendly graphical tool for the identification of CNA/AI in the context of whole-exome sequencing data.
Efficient Extracellular Expression of Phospholipase D in Escherichia Coli with an Optimized Signal Peptide

NASA Astrophysics Data System (ADS)

Yang, Leyun; Xu, Yu; Chen, Yong; Ying, Hanjie

2018-01-01

New secretion vectors containing the synthetic signal sequence (OmpA’) was constructed for the secretory production of recombinant proteins in Escherichia coli. The E. coli Phospholipase D structural gene (Accession number:NC_018658) fused to various signal sequence were expressed from the Lac promoter in E. coli Rosetta strains by induction with 0.4mM IPTG at 28°C for 48h. SDS-PaGe analysis of expression and subcellular fractions of recombinant constructs revealed the translocation of Phospholipase D (PLD) not only to the medium but also remained in periplasm of E. coli with OmpA’ signal sequence at the N-terminus of PLD. Thus the study on the effects of various surfactants on PLD extracellular production in Escherichia coli in shake flasks revealed that optimal PLD extracellular production could be achieved by adding 0.4% Triton X-100 into the medium. The maximal extracellular PLD production and extracellular enzyme activity were 0.23mg ml-1 and 16U ml-1, respectively. These results demonstrate the possibility of efficient secretory production of recombinant PLD in E. coli should be a potential industrial applications.
A note on the efficiencies of sampling strategies in two-stage Bayesian regional fine mapping of a quantitative trait.

PubMed

Chen, Zhijian; Craiu, Radu V; Bull, Shelley B

2014-11-01

In focused studies designed to follow up associations detected in a genome-wide association study (GWAS), investigators can proceed to fine-map a genomic region by targeted sequencing or dense genotyping of all variants in the region, aiming to identify a functional sequence variant. For the analysis of a quantitative trait, we consider a Bayesian approach to fine-mapping study design that incorporates stratification according to a promising GWAS tag SNP in the same region. Improved cost-efficiency can be achieved when the fine-mapping phase incorporates a two-stage design, with identification of a smaller set of more promising variants in a subsample taken in stage 1, followed by their evaluation in an independent stage 2 subsample. To avoid the potential negative impact of genetic model misspecification on inference we incorporate genetic model selection based on posterior probabilities for each competing model. Our simulation study shows that, compared to simple random sampling that ignores genetic information from GWAS, tag-SNP-based stratified sample allocation methods reduce the number of variants continuing to stage 2 and are more likely to promote the functional sequence variant into confirmation studies. © 2014 WILEY PERIODICALS, INC.
Complete sequencing and pan-genomic analysis of Lactobacillus delbrueckii subsp. bulgaricus reveal its genetic basis for industrial yogurt production.

PubMed

Hao, Pei; Zheng, Huajun; Yu, Yao; Ding, Guohui; Gu, Wenyi; Chen, Shuting; Yu, Zhonghao; Ren, Shuangxi; Oda, Munehiro; Konno, Tomonobu; Wang, Shengyue; Li, Xuan; Ji, Zai-Si; Zhao, Guoping

2011-01-17

Lactobacillus delbrueckii subsp. bulgaricus (Lb. bulgaricus) is an important species of Lactic Acid Bacteria (LAB) used for cheese and yogurt fermentation. The genome of Lb. bulgaricus 2038, an industrial strain mainly used for yogurt production, was completely sequenced and compared against the other two ATCC collection strains of the same subspecies. Specific physiological properties of strain 2038, such as lysine biosynthesis, formate production, aspartate-related carbon-skeleton intermediate metabolism, unique EPS synthesis and efficient DNA restriction/modification systems, are all different from those of the collection strains that might benefit the industrial production of yogurt. Other common features shared by Lb. bulgaricus strains, such as efficient protocooperation with Streptococcus thermophilus and lactate production as well as well-equipped stress tolerance mechanisms may account for it being selected originally for yogurt fermentation industry. Multiple lines of evidence suggested that Lb. bulgaricus 2038 was genetically closer to the common ancestor of the subspecies than the other two sequenced collection strains, probably due to a strict industrial maintenance process for strain 2038 that might have halted its genome decay and sustained a gene network suitable for large scale yogurt production.
Complete Sequencing and Pan-Genomic Analysis of Lactobacillus delbrueckii subsp. bulgaricus Reveal Its Genetic Basis for Industrial Yogurt Production

PubMed Central

Ding, Guohui; Gu, Wenyi; Chen, Shuting; Yu, Zhonghao; Ren, Shuangxi; Oda, Munehiro; Konno, Tomonobu; Wang, Shengyue; Li, Xuan; Ji, Zai-Si; Zhao, Guoping

2011-01-01

Lactobacillus delbrueckii subsp. bulgaricus (Lb. bulgaricus) is an important species of Lactic Acid Bacteria (LAB) used for cheese and yogurt fermentation. The genome of Lb. bulgaricus 2038, an industrial strain mainly used for yogurt production, was completely sequenced and compared against the other two ATCC collection strains of the same subspecies. Specific physiological properties of strain 2038, such as lysine biosynthesis, formate production, aspartate-related carbon-skeleton intermediate metabolism, unique EPS synthesis and efficient DNA restriction/modification systems, are all different from those of the collection strains that might benefit the industrial production of yogurt. Other common features shared by Lb. bulgaricus strains, such as efficient protocooperation with Streptococcus thermophilus and lactate production as well as well-equipped stress tolerance mechanisms may account for it being selected originally for yogurt fermentation industry. Multiple lines of evidence suggested that Lb. bulgaricus 2038 was genetically closer to the common ancestor of the subspecies than the other two sequenced collection strains, probably due to a strict industrial maintenance process for strain 2038 that might have halted its genome decay and sustained a gene network suitable for large scale yogurt production. PMID:21264216
Genomic region operation kit for flexible processing of deep sequencing data.

PubMed

Ovaska, Kristian; Lyly, Lauri; Sahu, Biswajyoti; Jänne, Olli A; Hautaniemi, Sampsa

2013-01-01

Computational analysis of data produced in deep sequencing (DS) experiments is challenging due to large data volumes and requirements for flexible analysis approaches. Here, we present a mathematical formalism based on set algebra for frequently performed operations in DS data analysis to facilitate translation of biomedical research questions to language amenable for computational analysis. With the help of this formalism, we implemented the Genomic Region Operation Kit (GROK), which supports various DS-related operations such as preprocessing, filtering, file conversion, and sample comparison. GROK provides high-level interfaces for R, Python, Lua, and command line, as well as an extension C++ API. It supports major genomic file formats and allows storing custom genomic regions in efficient data structures such as red-black trees and SQL databases. To demonstrate the utility of GROK, we have characterized the roles of two major transcription factors (TFs) in prostate cancer using data from 10 DS experiments. GROK is freely available with a user guide from >http://csbi.ltdk.helsinki.fi/grok/.
YAMAT-seq: an efficient method for high-throughput sequencing of mature transfer RNAs

PubMed Central

Shigematsu, Megumi; Honda, Shozo; Loher, Phillipe; Telonis, Aristeidis G.; Rigoutsos, Isidore

2017-01-01

Abstract Besides translation, transfer RNAs (tRNAs) play many non-canonical roles in various biological pathways and exhibit highly variable expression profiles. To unravel the emerging complexities of tRNA biology and molecular mechanisms underlying them, an efficient tRNA sequencing method is required. However, the rigid structure of tRNA has been presenting a challenge to the development of such methods. We report the development of Y-shaped Adapter-ligated MAture TRNA sequencing (YAMAT-seq), an efficient and convenient method for high-throughput sequencing of mature tRNAs. YAMAT-seq circumvents the issue of inefficient adapter ligation, a characteristic of conventional RNA sequencing methods for mature tRNAs, by employing the efficient and specific ligation of Y-shaped adapter to mature tRNAs using T4 RNA Ligase 2. Subsequent cDNA amplification and next-generation sequencing successfully yield numerous mature tRNA sequences. YAMAT-seq has high specificity for mature tRNAs and high sensitivity to detect most isoacceptors from minute amount of total RNA. Moreover, YAMAT-seq shows quantitative capability to estimate expression levels of mature tRNAs, and has high reproducibility and broad applicability for various cell lines. YAMAT-seq thus provides high-throughput technique for identifying tRNA profiles and their regulations in various transcriptomes, which could play important regulatory roles in translation and other biological processes. PMID:28108659
Spatial Data Structures for Robotic Vehicle Route Planning

DTIC Science & Technology

1988-12-01

goal will be realized in an intelligent Spatial Data Structure Development System (SDSDS) intended for use by Terrain Analysis applications...from the user the details of representation and to permit the infrastructure itself to decide which representations will be most efficient or effective ...to intelligently predict performance of algorithmic sequences and thereby optimize the application (within the accuracy of the prediction models). The
Enabling interspecies epigenomic comparison with CEpBrowser.

PubMed

Cao, Xiaoyi; Zhong, Sheng

2013-05-01

We developed the Comparative Epigenome Browser (CEpBrowser) to allow the public to perform multi-species epigenomic analysis. The web-based CEpBrowser integrates, manages and visualizes sequencing-based epigenomic datasets. Five key features were developed to maximize the efficiency of interspecies epigenomic comparisons. CEpBrowser is a web application implemented with PHP, MySQL, C and Apache. URL: http://www.cepbrowser.org/.
Research in Computational Astrobiology

NASA Technical Reports Server (NTRS)

Chaban, Galina; Colombano, Silvano; Scargle, Jeff; New, Michael H.; Pohorille, Andrew; Wilson, Michael A.

2003-01-01

We report on several projects in the field of computational astrobiology, which is devoted to advancing our understanding of the origin, evolution and distribution of life in the Universe using theoretical and computational tools. Research projects included modifying existing computer simulation codes to use efficient, multiple time step algorithms, statistical methods for analysis of astrophysical data via optimal partitioning methods, electronic structure calculations on water-nuclei acid complexes, incorporation of structural information into genomic sequence analysis methods and calculations of shock-induced formation of polycylic aromatic hydrocarbon compounds.
A transmission imaging spectrograph and microfabricated channel system for DNA analysis.

PubMed

Simpson, J W; Ruiz-Martinez, M C; Mulhern, G T; Berka, J; Latimer, D R; Ball, J A; Rothberg, J M; Went, G T

2000-01-01

In this paper we present the development of a DNA analysis system using a microfabricated channel device and a novel transmission imaging spectrograph which can be efficiently incorporated into a high throughput genomics facility for both sizing and sequencing of DNA fragments. The device contains 48 channels etched on a glass substrate. The channels are sealed with a flat glass plate which also provides a series of apertures for sample loading and contact with buffer reservoirs. Samples can be easily loaded in volumes up to 640 nL without band broadening because of an efficient electrokinetic stacking at the electrophoresis channel entrance. The system uses a dual laser excitation source and a highly sensitive charge-coupled device (CCD) detector allowing for simultaneous detection of many fluorescent dyes. The sieving matrices for the separation of single-stranded DNA fragments are polymerized in situ in denaturing buffer systems. Examples of separation of single-stranded DNA fragments up to 500 bases in length are shown, including accurate sizing of GeneCalling fragments, and sequencing samples prepared with a reduced amount of dye terminators. An increase in sample throughput has been achieved by color multiplexing.
OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis.

PubMed

Verzotto, Davide; M Teo, Audrey S; Hillmer, Axel M; Nagarajan, Niranjan

2016-01-01

Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences. We introduce a novel seed-and-extend glocal (short for global-local) alignment method, OPTIMA (and a sliding-window extension for overlap alignment, OPTIMA-Overlap), which is the first to create indexes for continuous-valued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technology-dependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutation-based tests. We show that OPTIMA and OPTIMA-Overlap outperform other state-of-the-art approaches (1.6-2 times more sensitive) and are more efficient (170-200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.
probeBase—an online resource for rRNA-targeted oligonucleotide probes and primers: new features 2016

PubMed Central

Greuter, Daniel; Loy, Alexander; Horn, Matthias; Rattei, Thomas

2016-01-01

probeBase http://www.probebase.net is a manually maintained and curated database of rRNA-targeted oligonucleotide probes and primers. Contextual information and multiple options for evaluating in silico hybridization performance against the most recent rRNA sequence databases are provided for each oligonucleotide entry, which makes probeBase an important and frequently used resource for microbiology research and diagnostics. Here we present a major update of probeBase, which was last featured in the NAR Database Issue 2007. This update describes a complete remodeling of the database architecture and environment to accommodate computationally efficient access. Improved search functions, sequence match tools and data output now extend the opportunities for finding suitable hierarchical probe sets that target an organism or taxon at different taxonomic levels. To facilitate the identification of complementary probe sets for organisms represented by short rRNA sequence reads generated by amplicon sequencing or metagenomic analysis with next generation sequencing technologies such as Illumina and IonTorrent, we introduce a novel tool that recovers surrogate near full-length rRNA sequences for short query sequences and finds matching oligonucleotides in probeBase. PMID:26586809
Whole exome sequencing is an efficient, sensitive and specific method of mutation detection in osteogenesis imperfecta and Marfan syndrome

PubMed Central

McInerney-Leo, Aideen M; Marshall, Mhairi S; Gardiner, Brooke; Coucke, Paul J; Van Laer, Lut; Loeys, Bart L; Summers, Kim M; Symoens, Sofie; West, Jennifer A; West, Malcolm J; Paul Wordsworth, B; Zankl, Andreas; Leo, Paul J; Brown, Matthew A; Duncan, Emma L

2013-01-01

Osteogenesis imperfecta (OI) and Marfan syndrome (MFS) are common Mendelian disorders. Both conditions are usually diagnosed clinically, as genetic testing is expensive due to the size and number of potentially causative genes and mutations. However, genetic testing may benefit patients, at-risk family members and individuals with borderline phenotypes, as well as improving genetic counseling and allowing critical differential diagnoses. We assessed whether whole exome sequencing (WES) is a sensitive method for mutation detection in OI and MFS. WES was performed on genomic DNA from 13 participants with OI and 10 participants with MFS who had known mutations, with exome capture followed by massive parallel sequencing of multiplexed samples. Single nucleotide polymorphisms (SNPs) and small indels were called using Genome Analysis Toolkit (GATK) and annotated with ANNOVAR. CREST, exomeCopy and exomeDepth were used for large deletion detection. Results were compared with the previous data. Specificity was calculated by screening WES data from a control population of 487 individuals for mutations in COL1A1, COL1A2 and FBN1. The target capture of five exome capture platforms was compared. All 13 mutations in the OI cohort and 9/10 in the MFS cohort were detected (sensitivity=95.6%) including non-synonymous SNPs, small indels (<10 bp), and a large UTR5/exon 1 deletion. One mutation was not detected by GATK due to strand bias. Specificity was 99.5%. Capture platforms and analysis programs differed considerably in their ability to detect mutations. Consumable costs for WES were low. WES is an efficient, sensitive, specific and cost-effective method for mutation detection in patients with OI and MFS. Careful selection of platform and analysis programs is necessary to maximize success. PMID:24501682

Contribution of PCR Denaturing Gradient Gel Electrophoresis Combined with Mixed Chromatogram Software Separation for Complex Urinary Sample Analysis.

PubMed

Kotásková, Iva; Mališová, Barbora; Obručová, Hana; Holá, Veronika; Peroutková, Tereza; Růžička, Filip; Freiberger, Tomáš

2017-01-01

Complex samples are a challenge for sequencing-based broad-range diagnostics. We analysed 19 urinary catheter, ureteral Double-J catheter, and urine samples using 3 methodological approaches. Out of the total 84 operational taxonomic units, 37, 61, and 88% were identified by culture, PCR-DGGE-SS (PCR denaturing gradient gel electrophoresis followed by Sanger sequencing), and PCR-DGGE-RM (PCR- DGGE combined with software chromatogram separation by RipSeq Mixed tool), respectively. The latter approach was shown to be an efficient tool to complement culture in complex sample assessment. © 2017 S. Karger AG, Basel.
Breaking the computational barriers of pairwise genome comparison.

PubMed

Torreno, Oscar; Trelles, Oswaldo

2015-08-11

Conventional pairwise sequence comparison software algorithms are being used to process much larger datasets than they were originally designed for. This can result in processing bottlenecks that limit software capabilities or prevent full use of the available hardware resources. Overcoming the barriers that limit the efficient computational analysis of large biological sequence datasets by retrofitting existing algorithms or by creating new applications represents a major challenge for the bioinformatics community. We have developed C libraries for pairwise sequence comparison within diverse architectures, ranging from commodity systems to high performance and cloud computing environments. Exhaustive tests were performed using different datasets of closely- and distantly-related sequences that span from small viral genomes to large mammalian chromosomes. The tests demonstrated that our solution is capable of generating high quality results with a linear-time response and controlled memory consumption, being comparable or faster than the current state-of-the-art methods. We have addressed the problem of pairwise and all-versus-all comparison of large sequences in general, greatly increasing the limits on input data size. The approach described here is based on a modular out-of-core strategy that uses secondary storage to avoid reaching memory limits during the identification of High-scoring Segment Pairs (HSPs) between the sequences under comparison. Software engineering concepts were applied to avoid intermediate result re-calculation, to minimise the performance impact of input/output (I/O) operations and to modularise the process, thus enhancing application flexibility and extendibility. Our computationally-efficient approach allows tasks such as the massive comparison of complete genomes, evolutionary event detection, the identification of conserved synteny blocks and inter-genome distance calculations to be performed more effectively.
Bioinformatics Pipelines for Targeted Resequencing and Whole-Exome Sequencing of Human and Mouse Genomes: A Virtual Appliance Approach for Instant Deployment

PubMed Central

Saeed, Isaam; Wong, Stephen Q.; Mar, Victoria; Goode, David L.; Caramia, Franco; Doig, Ken; Ryland, Georgina L.; Thompson, Ella R.; Hunter, Sally M.; Halgamuge, Saman K.; Ellul, Jason; Dobrovic, Alexander; Campbell, Ian G.; Papenfuss, Anthony T.; McArthur, Grant A.; Tothill, Richard W.

2014-01-01

Targeted resequencing by massively parallel sequencing has become an effective and affordable way to survey small to large portions of the genome for genetic variation. Despite the rapid development in open source software for analysis of such data, the practical implementation of these tools through construction of sequencing analysis pipelines still remains a challenging and laborious activity, and a major hurdle for many small research and clinical laboratories. We developed TREVA (Targeted REsequencing Virtual Appliance), making pre-built pipelines immediately available as a virtual appliance. Based on virtual machine technologies, TREVA is a solution for rapid and efficient deployment of complex bioinformatics pipelines to laboratories of all sizes, enabling reproducible results. The analyses that are supported in TREVA include: somatic and germline single-nucleotide and insertion/deletion variant calling, copy number analysis, and cohort-based analyses such as pathway and significantly mutated genes analyses. TREVA is flexible and easy to use, and can be customised by Linux-based extensions if required. TREVA can also be deployed on the cloud (cloud computing), enabling instant access without investment overheads for additional hardware. TREVA is available at http://bioinformatics.petermac.org/treva/. PMID:24752294
Nanoliter-Scale Oil-Air-Droplet Chip-Based Single Cell Proteomic Analysis.

PubMed

Li, Zi-Yi; Huang, Min; Wang, Xiu-Kun; Zhu, Ying; Li, Jin-Song; Wong, Catherine C L; Fang, Qun

2018-04-17

Single cell proteomic analysis provides crucial information on cellular heterogeneity in biological systems. Herein, we describe a nanoliter-scale oil-air-droplet (OAD) chip for achieving multistep complex sample pretreatment and injection for single cell proteomic analysis in the shotgun mode. By using miniaturized stationary droplet microreaction and manipulation techniques, our system allows all sample pretreatment and injection procedures to be performed in a nanoliter-scale droplet with minimum sample loss and a high sample injection efficiency (>99%), thus substantially increasing the analytical sensitivity for single cell samples. We applied the present system in the proteomic analysis of 100 ± 10, 50 ± 5, 10, and 1 HeLa cell(s), and protein IDs of 1360, 612, 192, and 51 were identified, respectively. The OAD chip-based system was further applied in single mouse oocyte analysis, with 355 protein IDs identified at the single oocyte level, which demonstrated its special advantages of high enrichment of sequence coverage, hydrophobic proteins, and enzymatic digestion efficiency over the traditional in-tube system.
Identification, characterization and phylogenetic analysis of antifungal Trichoderma from tomato rhizosphere.

PubMed

Rai, Shalini; Kashyap, Prem Lal; Kumar, Sudheer; Srivastava, Alok Kumar; Ramteke, Pramod W

2016-01-01

The use of Trichoderma isolates with efficient antagonistic activity represents a potentially effective and alternative disease management strategy to replace health hazardous chemical control. In this context, twenty isolates were obtained from tomato rhizosphere and evaluated by their antagonistic activity against four fungal pathogens ( Fusarium oxysporum f. sp. lycopersici , Alternaria alternata , Colletotrichum gloeosporoides and Rhizoctonia solani ). The production of extracellular cell wall degrading enzymes of tested isolates was also measured. All the isolates significantly reduced the mycelial growth of tested pathogens but the amount of growth reduction varied significantly as well. There was a positive correlation between the antagonistic capacity of Trichoderma isolates towards fungal pathogens and their lytic enzyme production. The Trichoderma isolates were initially sorted according to morphology and based on the translation elongation factor 1-α gene sequence similarity, the isolates were designated as Trichoderma harzianum , T. koningii , T. asperellum , T. virens and T. viride . PCA analysis explained 31.53, 61.95, 62.22 and 60.25% genetic variation among Trichoderma isolates based on RAPD, REP-, ERIC- and BOX element analysis, respectively. ERG - 1 gene, encoding a squalene epoxidase has been used for the first time for diversity analysis of antagonistic Trichoderma from tomato rhizosphere. Phylogenetic analysis of ERG -1 gene sequences revealed close relatedness of ERG -1sequences with earlier reported sequences of Hypocrea lixii , T. arundinaceum and T. reesei. However, ERG -1 gene also showed heterogeneity among some antagonistic isolates and indicated the possibility of occurrence of squalene epoxidase driven triterpene biosynthesis as an alternative biocontrol mechanism in Trichoderma species.
RNACompress: Grammar-based compression and informational complexity measurement of RNA secondary structure.

PubMed

Liu, Qi; Yang, Yu; Chen, Chun; Bu, Jiajun; Zhang, Yin; Ye, Xiuzi

2008-03-31

With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression. RNACompress employs an efficient grammar-based model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: (1) present a robust and effective way for RNA structural data compression; (2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequence-specific or common text-specific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTP-binding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective. A universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules.
RNACompress: Grammar-based compression and informational complexity measurement of RNA secondary structure

PubMed Central

Liu, Qi; Yang, Yu; Chen, Chun; Bu, Jiajun; Zhang, Yin; Ye, Xiuzi

2008-01-01

Background With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression. Results RNACompress employs an efficient grammar-based model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: (1) present a robust and effective way for RNA structural data compression; (2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequence-specific or common text-specific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTP-binding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective. Conclusion A universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules. PMID:18373878
Computationally assisted screening and design of cell-interactive peptides by a cell-based assay using peptide arrays and a fuzzy neural network algorithm.

PubMed

Kaga, Chiaki; Okochi, Mina; Tomita, Yasuyuki; Kato, Ryuji; Honda, Hiroyuki

2008-03-01

We developed a method of effective peptide screening that combines experiments and computational analysis. The method is based on the concept that screening efficiency can be enhanced from even limited data by use of a model derived from computational analysis that serves as a guide to screening and combining the model with subsequent repeated experiments. Here we focus on cell-adhesion peptides as a model application of this peptide-screening strategy. Cell-adhesion peptides were screened by use of a cell-based assay of a peptide array. Starting with the screening data obtained from a limited, random 5-mer library (643 sequences), a rule regarding structural characteristics of cell-adhesion peptides was extracted by fuzzy neural network (FNN) analysis. According to this rule, peptides with unfavored residues in certain positions that led to inefficient binding were eliminated from the random sequences. In the restricted, second random library (273 sequences), the yield of cell-adhesion peptides having an adhesion rate more than 1.5-fold to that of the basal array support was significantly high (31%) compared with the unrestricted random library (20%). In the restricted third library (50 sequences), the yield of cell-adhesion peptides increased to 84%. We conclude that a repeated cycle of experiments screening limited numbers of peptides can be assisted by the rule-extracting feature of FNN.
De novo assembled expressed gene catalog of a fast-growing Eucalyptus tree produced by Illumina mRNA-Seq

PubMed Central

2010-01-01

Background De novo assembly of transcript sequences produced by short-read DNA sequencing technologies offers a rapid approach to obtain expressed gene catalogs for non-model organisms. A draft genome sequence will be produced in 2010 for a Eucalyptus tree species (E. grandis) representing the most important hardwood fibre crop in the world. Genome annotation of this valuable woody plant and genetic dissection of its superior growth and productivity will be greatly facilitated by the availability of a comprehensive collection of expressed gene sequences from multiple tissues and organs. Results We present an extensive expressed gene catalog for a commercially grown E. grandis × E. urophylla hybrid clone constructed using only Illumina mRNA-Seq technology and de novo assembly. A total of 18,894 transcript-derived contigs, a large proportion of which represent full-length protein coding genes were assembled and annotated. Analysis of assembly quality, length and diversity show that this dataset represent the most comprehensive expressed gene catalog for any Eucalyptus tree. mRNA-Seq analysis furthermore allowed digital expression profiling of all of the assembled transcripts across diverse xylogenic and non-xylogenic tissues, which is invaluable for ascribing putative gene functions. Conclusions De novo assembly of Illumina mRNA-Seq reads is an efficient approach for transcriptome sequencing and profiling in Eucalyptus and other non-model organisms. The transcriptome resource (Eucspresso, http://eucspresso.bi.up.ac.za/) generated by this study will be of value for genomic analysis of woody biomass production in Eucalyptus and for comparative genomic analysis of growth and development in woody and herbaceous plants. PMID:21122097
Development of a Prokaryotic Universal Primer for Simultaneous Analysis of Bacteria and Archaea Using Next-Generation Sequencing

PubMed Central

Takahashi, Shunsuke; Tomita, Junko; Nishioka, Kaori; Hisada, Takayoshi; Nishijima, Miyuki

2014-01-01

For the analysis of microbial community structure based on 16S rDNA sequence diversity, sensitive and robust PCR amplification of 16S rDNA is a critical step. To obtain accurate microbial composition data, PCR amplification must be free of bias; however, amplifying all 16S rDNA species with equal efficiency from a sample containing a large variety of microorganisms remains challenging. Here, we designed a universal primer based on the V3-V4 hypervariable region of prokaryotic 16S rDNA for the simultaneous detection of Bacteria and Archaea in fecal samples from crossbred pigs (Landrace×Large white×Duroc) using an Illumina MiSeq next-generation sequencer. In-silico analysis showed that the newly designed universal prokaryotic primers matched approximately 98.0% of Bacteria and 94.6% of Archaea rRNA gene sequences in the Ribosomal Database Project database. For each sequencing reaction performed with the prokaryotic universal primer, an average of 69,330 (±20,482) reads were obtained, of which archaeal rRNA genes comprised approximately 1.2% to 3.2% of all prokaryotic reads. In addition, the detection frequency of Bacteria belonging to the phylum Verrucomicrobia, including members of the classes Verrucomicrobiae and Opitutae, was higher in the NGS analysis using the prokaryotic universal primer than that performed with the bacterial universal primer. Importantly, this new prokaryotic universal primer set had markedly lower bias than that of most previously designed universal primers. Our findings demonstrate that the prokaryotic universal primer set designed in the present study will permit the simultaneous detection of Bacteria and Archaea, and will therefore allow for a more comprehensive understanding of microbial community structures in environmental samples. PMID:25144201
Improvement of energy efficiency via spectrum optimization of excitation sequence for multichannel simultaneously triggered airborne sonar system

NASA Astrophysics Data System (ADS)

Meng, Qing-Hao; Yao, Zhen-Jing; Peng, Han-Yang

2009-12-01

Both the energy efficiency and correlation characteristics are important in airborne sonar systems to realize multichannel ultrasonic transducers working together. High energy efficiency can increase echo energy and measurement range, and sharp autocorrelation and flat cross correlation can help eliminate cross-talk among multichannel transducers. This paper addresses energy efficiency optimization under the premise that cross-talk between different sonar transducers can be avoided. The nondominated sorting genetic algorithm-II is applied to optimize both the spectrum and correlation characteristics of the excitation sequence. The central idea of the spectrum optimization is to distribute most of the energy of the excitation sequence within the frequency band of the sonar transducer; thus, less energy is filtered out by the transducers. Real experiments show that a sonar system consisting of eight-channel Polaroid 600 series electrostatic transducers excited with 2 ms optimized pulse-position-modulation sequences can work together without cross-talk and can measure distances up to 650 cm with maximal 1% relative error.
Efficient generation of cavitation bubbles and reactive oxygen species using triggered high-intensity focused ultrasound sequence for sonodynamic treatment

NASA Astrophysics Data System (ADS)

Yasuda, Jun; Yoshizawa, Shin; Umemura, Shin-ichiro

2016-07-01

Sonodynamic treatment is a method of treating cancer using reactive oxygen species (ROS) generated by cavitation bubbles in collaboration with a sonosensitizer at a target tissue. In this treatment method, both localized ROS generation and ROS generation with high efficiency are important. In this study, a triggered high-intensity focused ultrasound (HIFU) sequence, which consists of a short, extremely high intensity pulse immediately followed by a long, moderate-intensity burst, was employed for the efficient generation of ROS. In experiments, a solution sealed in a chamber was exposed to a triggered HIFU sequence. Then, the distribution of generated ROS was observed by the luminol reaction, and the amount of generated ROS was quantified using KI method. As a result, the localized ROS generation was demonstrated by light emission from the luminol reaction. Moreover, it was demonstrated that the triggered HIFU sequence has higher efficiency of ROS generation by both the KI method and the luminol reaction emission.
Depletion of Unwanted Nucleic Acid Templates by Selective Cleavage: LNAzymes, Catalytically Active Oligonucleotides Containing Locked Nucleic Acids, Open a New Window for Detecting Rare Microbial Community Members

PubMed Central

Dolinšek, Jan; Dorninger, Christiane; Lagkouvardos, Ilias; Wagner, Michael

2013-01-01

Many studies of molecular microbial ecology rely on the characterization of microbial communities by PCR amplification, cloning, sequencing, and phylogenetic analysis of genes encoding rRNAs or functional marker enzymes. However, if the established clone libraries are dominated by one or a few sequence types, the cloned diversity is difficult to analyze by random clone sequencing. Here we present a novel approach to deplete unwanted sequence types from complex nucleic acid mixtures prior to cloning and downstream analyses. It employs catalytically active oligonucleotides containing locked nucleic acids (LNAzymes) for the specific cleavage of selected RNA targets. When combined with in vitro transcription and reverse transcriptase PCR, this LNAzyme-based technique can be used with DNA or RNA extracts from microbial communities. The simultaneous application of more than one specific LNAzyme allows the concurrent depletion of different sequence types from the same nucleic acid preparation. This new method was evaluated with defined mixtures of cloned 16S rRNA genes and then used to identify accompanying bacteria in an enrichment culture dominated by the nitrite oxidizer “Candidatus Nitrospira defluvii.” In silico analysis revealed that the majority of publicly deposited rRNA-targeted oligonucleotide probes may be used as specific LNAzymes with no or only minor sequence modifications. This efficient and cost-effective approach will greatly facilitate tasks such as the identification of microbial symbionts in nucleic acid preparations dominated by plastid or mitochondrial rRNA genes from eukaryotic hosts, the detection of contaminants in microbial cultures, and the analysis of rare organisms in microbial communities of highly uneven composition. PMID:23263968
Whole Genome Sequencing for Genomics-Guided Investigations of Escherichia coli O157:H7 Outbreaks.

PubMed

Rusconi, Brigida; Sanjar, Fatemeh; Koenig, Sara S K; Mammel, Mark K; Tarr, Phillip I; Eppinger, Mark

2016-01-01

Multi isolate whole genome sequencing (WGS) and typing for outbreak investigations has become a reality in the post-genomics era. We applied this technology to strains from Escherichia coli O157:H7 outbreaks. These include isolates from seven North America outbreaks, as well as multiple isolates from the same patient and from different infected individuals in the same household. Customized high-resolution bioinformatics sequence typing strategies were developed to assess the core genome and mobilome plasticity. Sequence typing was performed using an in-house single nucleotide polymorphism (SNP) discovery and validation pipeline. Discriminatory power becomes of particular importance for the investigation of isolates from outbreaks in which macrogenomic techniques such as pulse-field gel electrophoresis or multiple locus variable number tandem repeat analysis do not differentiate closely related organisms. We also characterized differences in the phage inventory, allowing us to identify plasticity among outbreak strains that is not detectable at the core genome level. Our comprehensive analysis of the mobilome identified multiple plasmids that have not previously been associated with this lineage. Applied phylogenomics approaches provide strong molecular evidence for exceptionally little heterogeneity of strains within outbreaks and demonstrate the value of intra-cluster comparisons, rather than basing the analysis on archetypal reference strains. Next generation sequencing and whole genome typing strategies provide the technological foundation for genomic epidemiology outbreak investigation utilizing its significantly higher sample throughput, cost efficiency, and phylogenetic relatedness accuracy. These phylogenomics approaches have major public health relevance in translating information from the sequence-based survey to support timely and informed countermeasures. Polymorphisms identified in this work offer robust phylogenetic signals that index both short- and long-term evolution and can complement currently employed typing schemes for outbreak ex- and inclusion, diagnostics, surveillance, and forensic studies.
Whole Genome Sequencing for Genomics-Guided Investigations of Escherichia coli O157:H7 Outbreaks

PubMed Central

Rusconi, Brigida; Sanjar, Fatemeh; Koenig, Sara S. K.; Mammel, Mark K.; Tarr, Phillip I.; Eppinger, Mark

2016-01-01

Multi isolate whole genome sequencing (WGS) and typing for outbreak investigations has become a reality in the post-genomics era. We applied this technology to strains from Escherichia coli O157:H7 outbreaks. These include isolates from seven North America outbreaks, as well as multiple isolates from the same patient and from different infected individuals in the same household. Customized high-resolution bioinformatics sequence typing strategies were developed to assess the core genome and mobilome plasticity. Sequence typing was performed using an in-house single nucleotide polymorphism (SNP) discovery and validation pipeline. Discriminatory power becomes of particular importance for the investigation of isolates from outbreaks in which macrogenomic techniques such as pulse-field gel electrophoresis or multiple locus variable number tandem repeat analysis do not differentiate closely related organisms. We also characterized differences in the phage inventory, allowing us to identify plasticity among outbreak strains that is not detectable at the core genome level. Our comprehensive analysis of the mobilome identified multiple plasmids that have not previously been associated with this lineage. Applied phylogenomics approaches provide strong molecular evidence for exceptionally little heterogeneity of strains within outbreaks and demonstrate the value of intra-cluster comparisons, rather than basing the analysis on archetypal reference strains. Next generation sequencing and whole genome typing strategies provide the technological foundation for genomic epidemiology outbreak investigation utilizing its significantly higher sample throughput, cost efficiency, and phylogenetic relatedness accuracy. These phylogenomics approaches have major public health relevance in translating information from the sequence-based survey to support timely and informed countermeasures. Polymorphisms identified in this work offer robust phylogenetic signals that index both short- and long-term evolution and can complement currently employed typing schemes for outbreak ex- and inclusion, diagnostics, surveillance, and forensic studies. PMID:27446025
PLAN: a web platform for automating high-throughput BLAST searches and for managing and mining results.

PubMed

He, Ji; Dai, Xinbin; Zhao, Xuechun

2007-02-09

BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Personal BLAST Navigator (PLAN) is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1) query and target sequence database management, (2) automated high-throughput BLAST searching, (3) indexing and searching of results, (4) filtering results online, (5) managing results of personal interest in favorite categories, (6) automated sequence annotation (such as NCBI NR and ontology-based annotation). PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results. The PLAN web interface is platform-independent, easily configurable and capable of comprehensive expansion, and user-intuitive. PLAN is freely available to academic users at http://bioinfo.noble.org/plan/. The source code for local deployment is provided under free license. Full support on system utilization, installation, configuration and customization are provided to academic users.
PLAN: a web platform for automating high-throughput BLAST searches and for managing and mining results

PubMed Central

He, Ji; Dai, Xinbin; Zhao, Xuechun

2007-01-01

Background BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Results Personal BLAST Navigator (PLAN) is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1) query and target sequence database management, (2) automated high-throughput BLAST searching, (3) indexing and searching of results, (4) filtering results online, (5) managing results of personal interest in favorite categories, (6) automated sequence annotation (such as NCBI NR and ontology-based annotation). PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. Conclusion PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results. The PLAN web interface is platform-independent, easily configurable and capable of comprehensive expansion, and user-intuitive. PLAN is freely available to academic users at . The source code for local deployment is provided under free license. Full support on system utilization, installation, configuration and customization are provided to academic users. PMID:17291345
BamTools: a C++ API and toolkit for analyzing and managing BAM files

PubMed Central

Barnett, Derek W.; Garrison, Erik K.; Quinlan, Aaron R.; Strömberg, Michael P.; Marth, Gabor T.

2011-01-01

Motivation: Analysis of genomic sequencing data requires efficient, easy-to-use access to alignment results and flexible data management tools (e.g. filtering, merging, sorting, etc.). However, the enormous amount of data produced by current sequencing technologies is typically stored in compressed, binary formats that are not easily handled by the text-based parsers commonly used in bioinformatics research. Results: We introduce a software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit. Availability: BamTools was written in C++, and is supported on Linux, Mac OSX and MS Windows. Source code and documentation are freely available at http://github.org/pezmaster31/bamtools. Contact: barnetde@bc.edu PMID:21493652
Differential principal component analysis of ChIP-seq.

PubMed

Ji, Hongkai; Li, Xia; Wang, Qian-fei; Ning, Yang

2013-04-23

We propose differential principal component analysis (dPCA) for analyzing multiple ChIP-sequencing datasets to identify differential protein-DNA interactions between two biological conditions. dPCA integrates unsupervised pattern discovery, dimension reduction, and statistical inference into a single framework. It uses a small number of principal components to summarize concisely the major multiprotein synergistic differential patterns between the two conditions. For each pattern, it detects and prioritizes differential genomic loci by comparing the between-condition differences with the within-condition variation among replicate samples. dPCA provides a unique tool for efficiently analyzing large amounts of ChIP-sequencing data to study dynamic changes of gene regulation across different biological conditions. We demonstrate this approach through analyses of differential chromatin patterns at transcription factor binding sites and promoters as well as allele-specific protein-DNA interactions.
Directed evolution and expression tuning of geraniol synthase for efficient geraniol production in Escherichia coli.

PubMed

Tashiro, Miki; Fujii, Akira; Kawai-Noma, Shigeko; Saito, Kyoichi; Umeno, Daisuke

2017-11-17

To achieve an efficient production of geraniol and its derivatives in Escherichia coli, we aimed to improve the activity of geraniol synthase (GES) through a single round of mutagenesis and screening for higher substrate consumption. We isolated GES variants that outperform their parent in geraniol production. The analysis of GES variants indicated that the expression level of GES was the bottleneck for geraniol synthesis. Over-expression of the mutant GES M53 with a 5'-untranslated sequence designed for high translational efficiency, along with the additional expression of mevalonate pathway enzymes, isopentenyl pyrophosphate isomerase, and geranyl pyrophosphate synthase, yielded 300 mg/L/12 h geraniol and its derivatives (>1000 mg/L/42 h in total) in a shaking flask.

Googling DNA sequences on the World Wide Web.

PubMed

Hajibabaei, Mehrdad; Singer, Gregory A C

2009-11-10

New web-based technologies provide an excellent opportunity for sharing and accessing information and using web as a platform for interaction and collaboration. Although several specialized tools are available for analyzing DNA sequence information, conventional web-based tools have not been utilized for bioinformatics applications. We have developed a novel algorithm and implemented it for searching species-specific genomic sequences, DNA barcodes, by using popular web-based methods such as Google. We developed an alignment independent character based algorithm based on dividing a sequence library (DNA barcodes) and query sequence to words. The actual search is conducted by conventional search tools such as freely available Google Desktop Search. We implemented our algorithm in two exemplar packages. We developed pre and post-processing software to provide customized input and output services, respectively. Our analysis of all publicly available DNA barcode sequences shows a high accuracy as well as rapid results. Our method makes use of conventional web-based technologies for specialized genetic data. It provides a robust and efficient solution for sequence search on the web. The integration of our search method for large-scale sequence libraries such as DNA barcodes provides an excellent web-based tool for accessing this information and linking it to other available categories of information on the web.
MANGO: a new approach to multiple sequence alignment.

PubMed

Zhang, Zefeng; Lin, Hao; Li, Ming

2007-01-01

Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the 'once a gap, always a gap' phenomenon. Is there a radically new way to do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, Prob-ConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0.
Flexbar 3.0 - SIMD and multicore parallelization.

PubMed

Roehr, Johannes T; Dieterich, Christoph; Reinert, Knut

2017-09-15

High-throughput sequencing machines can process many samples in a single run. For Illumina systems, sequencing reads are barcoded with an additional DNA tag that is contained in the respective sequencing adapters. The recognition of barcode and adapter sequences is hence commonly needed for the analysis of next-generation sequencing data. Flexbar performs demultiplexing based on barcodes and adapter trimming for such data. The massive amounts of data generated on modern sequencing machines demand that this preprocessing is done as efficiently as possible. We present Flexbar 3.0, the successor of the popular program Flexbar. It employs now twofold parallelism: multi-threading and additionally SIMD vectorization. Both types of parallelism are used to speed-up the computation of pair-wise sequence alignments, which are used for the detection of barcodes and adapters. Furthermore, new features were included to cover a wide range of applications. We evaluated the performance of Flexbar based on a simulated sequencing dataset. Our program outcompetes other tools in terms of speed and is among the best tools in the presented quality benchmark. https://github.com/seqan/flexbar. johannes.roehr@fu-berlin.de or knut.reinert@fu-berlin.de. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Synchronized moving aperture radiation therapy (SMART): superimposing tumor motion on IMRT MLC leaf sequences under realistic delivery conditions

NASA Astrophysics Data System (ADS)

Xu, Jun; Papanikolaou, Nikos; Shi, Chengyu; Jiang, Steve B.

2009-08-01

Synchronized moving aperture radiation therapy (SMART) has been proposed to account for tumor motions during radiotherapy in prior work. The basic idea of SMART is to synchronize the moving radiation beam aperture formed by a dynamic multileaf collimator (DMLC) with the tumor motion induced by respiration. In this paper, a two-dimensional (2D) superimposing leaf sequencing method is presented for SMART. A leaf sequence optimization strategy was generated to assure the SMART delivery under realistic delivery conditions. The study of delivery performance using the Varian LINAC and the Millennium DMLC showed that clinical factors such as collimator angle, dose rate, initial phase and machine tolerance affect the delivery accuracy and efficiency. An in-house leaf sequencing software was developed to implement the 2D superimposing leaf sequencing method and optimize the motion-corrected leaf sequence under realistic clinical conditions. The analysis of dynamic log (Dynalog) files showed that optimization of the leaf sequence for various clinical factors can avoid beam hold-offs which break the synchronization of SMART and fail the SMART dose delivery. Through comparison between the simulated delivered fluence map and the planed fluence map, it was shown that the motion-corrected leaf sequence can greatly reduce the dose error.
Synchronized moving aperture radiation therapy (SMART): superimposing tumor motion on IMRT MLC leaf sequences under realistic delivery conditions.

PubMed

Xu, Jun; Papanikolaou, Nikos; Shi, Chengyu; Jiang, Steve B

2009-08-21

Synchronized moving aperture radiation therapy (SMART) has been proposed to account for tumor motions during radiotherapy in prior work. The basic idea of SMART is to synchronize the moving radiation beam aperture formed by a dynamic multileaf collimator (DMLC) with the tumor motion induced by respiration. In this paper, a two-dimensional (2D) superimposing leaf sequencing method is presented for SMART. A leaf sequence optimization strategy was generated to assure the SMART delivery under realistic delivery conditions. The study of delivery performance using the Varian LINAC and the Millennium DMLC showed that clinical factors such as collimator angle, dose rate, initial phase and machine tolerance affect the delivery accuracy and efficiency. An in-house leaf sequencing software was developed to implement the 2D superimposing leaf sequencing method and optimize the motion-corrected leaf sequence under realistic clinical conditions. The analysis of dynamic log (Dynalog) files showed that optimization of the leaf sequence for various clinical factors can avoid beam hold-offs which break the synchronization of SMART and fail the SMART dose delivery. Through comparison between the simulated delivered fluence map and the planed fluence map, it was shown that the motion-corrected leaf sequence can greatly reduce the dose error.
Improving membrane protein expression by optimizing integration efficiency

PubMed Central

2017-01-01

The heterologous overexpression of integral membrane proteins in Escherichia coli often yields insufficient quantities of purifiable protein for applications of interest. The current study leverages a recently demonstrated link between co-translational membrane integration efficiency and protein expression levels to predict protein sequence modifications that improve expression. Membrane integration efficiencies, obtained using a coarse-grained simulation approach, robustly predicted effects on expression of the integral membrane protein TatC for a set of 140 sequence modifications, including loop-swap chimeras and single-residue mutations distributed throughout the protein sequence. Mutations that improve simulated integration efficiency were 4-fold enriched with respect to improved experimentally observed expression levels. Furthermore, the effects of double mutations on both simulated integration efficiency and experimentally observed expression levels were cumulative and largely independent, suggesting that multiple mutations can be introduced to yield higher levels of purifiable protein. This work provides a foundation for a general method for the rational overexpression of integral membrane proteins based on computationally simulated membrane integration efficiencies. PMID:28918393
Regions flanking ori sequences affect the replication efficiency of the mitochondrial genome of ori+ petite mutants from yeast.

PubMed

Rayko, E; Goursot, R; Cherif-Zahar, B; Melis, R; Bernardi, G

1988-03-31

The mitochondrial genomes of progenies from 26 crosses between 17 cytoplasmic, spontaneous, suppressive, ori+ petite mutants of Saccharomyces cerevisiae have been studied by electrophoresis of restriction fragments. Only parental genomes (or occasionally, genomes derived from them by secondary excisions) were found in the progenies of the almost 500 diploids investigated; no evidence for illegitimate, site-specific mitochondrial recombination was detected. One of the parental genomes was always found to be predominate over the other one, although to different extents in different crosses. This predominance appears to be due to a higher replication efficiency, which is correlated with a greater density of ori sequences on the mitochondrial genome (and with a shorter repeat unit size of the latter). Exceptions to the 'repeat-unit-size rule' were found, however, even when the parental mitochondrial genomes carried the same ori sequence. This indicates that noncoding, intergenic sequences outside ori sequences also play a role in modulating replication efficiency. Since in different petites such sequences differ in primary structure, size, and position relative to ori sequences, this modulation is likely to take place through an indirect effect on DNA and nucleoid structure.
Research on respiratory motion correction method based on liver contrast-enhanced ultrasound images of single mode

NASA Astrophysics Data System (ADS)

Zhang, Ji; Li, Tao; Zheng, Shiqiang; Li, Yiyong

2015-03-01

To reduce the effects of respiratory motion in the quantitative analysis based on liver contrast-enhanced ultrasound (CEUS) image sequencesof single mode. The image gating method and the iterative registration method using model image were adopted to register liver contrast-enhanced ultrasound image sequences of single mode. The feasibility of the proposed respiratory motion correction method was explored preliminarily using 10 hepatocellular carcinomas CEUS cases. The positions of the lesions in the time series of 2D ultrasound images after correction were visually evaluated. Before and after correction, the quality of the weighted sum of transit time (WSTT) parametric images were also compared, in terms of the accuracy and spatial resolution. For the corrected and uncorrected sequences, their mean deviation values (mDVs) of time-intensity curve (TIC) fitting derived from CEUS sequences were measured. After the correction, the positions of the lesions in the time series of 2D ultrasound images were almost invariant. In contrast, the lesions in the uncorrected images all shifted noticeably. The quality of the WSTT parametric maps derived from liver CEUS image sequences were improved more greatly. Moreover, the mDVs of TIC fitting derived from CEUS sequences after the correction decreased by an average of 48.48+/-42.15. The proposed correction method could improve the accuracy of quantitative analysis based on liver CEUS image sequences of single mode, which would help in enhancing the differential diagnosis efficiency of liver tumors.
Detailed analysis of stem I and its 5' and 3' neighbor regions in the trans-acting HDV ribozyme.

PubMed Central

Nishikawa, F; Roy, M; Fauzi, H; Nishikawa, S

1999-01-01

To determine the stem I structure of the human hepatitis delta virus (HDV) ribozyme, which is related to the substrate sequence in the trans -acting system, we kinetically studied stem I length and sequences. Stem I extension from 7 to 8 or 9 bp caused a loss of activity and a low amount of active complex with 9 bp in the trans -acting system. In a previous report, we presented cleavage in a 6 bp stem I. The observed reaction rates indicate that the original 7 bp stem I is in the most favorable location for catalytic reaction among the possible 6-8 bp stems. To test base specificity, we replaced the original GC-rich sequence in stem I with AU-rich sequences containing six AU or UA base pairs with the natural +1G.U wobble base pair at the cleavage site. The cis -acting AU-rich molecules demonstrated similar catalytic activity to that of the wild-type. In trans -acting molecules, due to stem I instability, reaction efficiency strongly depended on the concentration of the ribozyme-substrate complex and reaction temperature. Multiple turnover was observed at 37 degreesC, strongly suggesting that stem I has no base specificity and more efficient activity can be expected under multiple turnover conditions by substituting several UA or AU base pairs into stem I. We also studied the substrate damaging sequences linked to both ends of stem I for its development in therapeutic applications and confirmed the functions of the unique structure. PMID:9862958
Entropic Profiler – detection of conservation in genomes using information theory

PubMed Central

Fernandes, Francisco; Freitas, Ana T; Almeida, Jonas S; Vinga, Susana

2009-01-01

Background In the last decades, with the successive availability of whole genome sequences, many research efforts have been made to mathematically model DNA. Entropic Profiles (EP) were proposed recently as a new measure of continuous entropy of genome sequences. EP represent local information plots related to DNA randomness and are based on information theory and statistical concepts. They express the weighed relative abundance of motifs for each position in genomes. Their study is very relevant because under or over-representation segments are often associated with significant biological meaning. Findings The Entropic Profiler application here presented is a new tool designed to detect and extract under and over-represented DNA segments in genomes by using EP. It allows its computation in a very efficient way by recurring to improved algorithms and data structures, which include modified suffix trees. Available through a web interface and as downloadable source code, it allows to study positions and to search for motifs inside the whole sequence or within a specified range. DNA sequences can be entered from different sources, including FASTA files, pre-loaded examples or resuming a previously saved work. Besides the EP value plots, p-values and z-scores for each motif are also computed, along with the Chaos Game Representation of the sequence. Conclusion EP are directly related with the statistical significance of motifs and can be considered as a new method to extract and classify significant regions in genomes and estimate local scales in DNA. The present implementation establishes an efficient and useful tool for whole genome analysis. PMID:19416538
Research on parallel algorithm for sequential pattern mining

NASA Astrophysics Data System (ADS)

Zhou, Lijuan; Qin, Bai; Wang, Yu; Hao, Zhongxiao

2008-03-01

Sequential pattern mining is the mining of frequent sequences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the business database and has extended to new data sources such as Web and advanced science fields such as DNA analysis. The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms haven't considered the above-mentioned characteristics synthetically. According to the traits mentioned above and combining the parallel theory, this paper puts forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent item sets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn't generate the candidated sequences, which abates the access time and improves the mining efficiency. Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll algorithm. The experiments demonstrate that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.
A numerical formulation and algorithm for limit and shakedown analysis of large-scale elastoplastic structures

NASA Astrophysics Data System (ADS)

Peng, Heng; Liu, Yinghua; Chen, Haofeng

2018-05-01

In this paper, a novel direct method called the stress compensation method (SCM) is proposed for limit and shakedown analysis of large-scale elastoplastic structures. Without needing to solve the specific mathematical programming problem, the SCM is a two-level iterative procedure based on a sequence of linear elastic finite element solutions where the global stiffness matrix is decomposed only once. In the inner loop, the static admissible residual stress field for shakedown analysis is constructed. In the outer loop, a series of decreasing load multipliers are updated to approach to the shakedown limit multiplier by using an efficient and robust iteration control technique, where the static shakedown theorem is adopted. Three numerical examples up to about 140,000 finite element nodes confirm the applicability and efficiency of this method for two-dimensional and three-dimensional elastoplastic structures, with detailed discussions on the convergence and the accuracy of the proposed algorithm.
High-resolution melting analysis (HRM) for differentiation of four major Taeniidae species in dogs Taenia hydatigena, Taenia multiceps, Taenia ovis, and Echinococcus granulosus sensu stricto.

PubMed

Dehghani, Mansoureh; Mohammadi, Mohammad Ali; Rostami, Sima; Shamsaddini, Saeedeh; Mirbadie, Seyed Reza; Harandi, Majid Fasihi

2016-07-01

Tapeworms of the genus Taenia include several species of important parasites with considerable medical and veterinary significance. Accurate identification of these species in dogs is the prerequisite of any prevention and control program. Here, we have applied an efficient method for differentiating four major Taeniid species in dogs, i.e., Taenia hydatigena, T. multiceps, T. ovis, and Echinococcus granulosus sensu stricto. High-resolution melting (HRM) analysis is simpler, less expensive, and faster technique than conventional DNA-based assays and enables us to detect PCR amplicons in a closed system. Metacestode samples were collected from local abattoirs from sheep. All the isolates had already been identified by PCR-sequencing, and their sequence data were deposited in the GenBank. Real-time PCR coupled with HRM analysis targeting mitochondrial cox1 and ITS1 genes was used to differentiate taeniid species. Distinct melting curves were obtained from ITS1 region enabling accurate differentiation of three Taenia species and E. granulosus in dogs. The HRM curves of Taenia species and E .granulosus were clearly separated at Tm of 85 to 87 °C. In addition, double-pick melting curves were produced in mixed infections. Cox1 melting curves were not decisive enough to distinguish four taeniids. In this work, the efficiency of HRM analysis to differentiate four major taeniid species in dogs has been demonstrated using ITS1 gene.
Cis-acting elements in the promoter region of the human aldolase C gene.

PubMed

Buono, P; de Conciliis, L; Olivetta, E; Izzo, P; Salvatore, F

1993-08-16

We investigated the cis-acting sequences involved in the expression of the human aldolase C gene by transient transfections into human neuroblastoma cells (SKNBE). We demonstrate that 420 bp of the 5'-flanking DNA direct at high efficiency the transcription of the CAT reporter gene. A deletion between -420 bp and -164 bp causes a 60% decrease of CAT activity. Gel shift and DNase I footprinting analyses revealed four protected elements: A, B, C and D. Competition analyses indicate that Sp1 or factors sharing a similar sequence specificity bind to elements A and B, but not to elements C and D. Sequence analysis shows a half palindromic ERE motif (GGTCA), in elements B and D. Region D binds a transactivating factor which appears also essential to stabilize the initiation complex.
Genetic analysis of Fasciola isolates from cattle in Korea based on second internal transcribed spacer (ITS-2) sequence of nuclear ribosomal DNA.

PubMed

Choe, Se-Eun; Nguyen, Thuy Thi-Dieu; Kang, Tae-Gyu; Kweon, Chang-Hee; Kang, Seung-Won

2011-09-01

Nuclear ribosomal DNA sequence of the second internal transcribed spacer (ITS-2) has been used efficiently to identify the liver fluke species collected from different hosts and various geographic regions. ITS-2 sequences of 19 Fasciola samples collected from Korean native cattle were determined and compared. Sequence comparison including ITS-2 sequences of isolates from this study and reference sequences from Fasciola hepatica and Fasciola gigantica and intermediate Fasciola in Genbank revealed seven identical variable sites of investigated isolates. Among 19 samples, 12 individuals had ITS-2 sequences completely identical to that of pure F. hepatica, five possessed the sequences identical to F. gigantica type, whereas two shared the sequence of both F. hepatica and F. gigantica. No variations in length and nucleotide composition of ITS-2 sequence were observed within isolates that belonged to F. hepatica or F. gigantica. At the position of 218, five Fasciola containing a single-base substitution (C>T) formed a distinct branch inside the F. gigantica-type group which was similar to those of Asian-origin isolates. The phylogenetic tree of the Fasciola spp. based on complete ITS-2 sequences from this study and other representative isolates in different locations clearly showed that pure F. hepatica, F. gigantica type and intermediate Fasciola were observed. The result also provided additional genetic evidence for the existence of three forms of Fasciola isolated from native cattle in Korea by genetic approach using ITS-2 sequence.
Combination of COFRADIC and high temperature-extended column length conventional liquid chromatography: a very efficient way to tackle complex protein samples, such as serum.

PubMed

Sandra, Koen; Verleysen, Katleen; Labeur, Christine; Vanneste, Lies; D'Hondt, Filip; Thomas, Grégoire; Kas, Koen; Gevaert, Kris; Vandekerckhove, Joël; Sandra, Pat

2007-03-01

The previously reported COmbined FRActional DIagonal Chromatography (COFRA-DIC) methodology, in which a subset of peptides representative for their parent proteins are sorted, is particularly powerful for whole proteome analysis. This peptide-centric technology is built around diagonal chromatography, where peptide separations are crucial. This paper presents high efficiency peptide separations, in which four 250 x 2.1 mm, 5 microm Zorbax 300SB-C18 columns (total length 1 m) were coupled at operating temperatures of 60'C using a dedicated LC oven and conventional LC equipment. The high efficiency separations were combined with the COFRADIC procedure. This extremely powerful combination resulted, for the analysis of serum, in an increase in the uniquely identified peptide sequences by a factor of 2.6, compared to the COFRADIC procedure on a 25 cm column. This is a reflection of the increased peak capacity obtained on the 1 m column, which was calculated to be a factor 2.7 higher than on the 25 cm column. Besides more efficient sorting, less ion suppression was noticed.
YAMAT-seq: an efficient method for high-throughput sequencing of mature transfer RNAs.

PubMed

Shigematsu, Megumi; Honda, Shozo; Loher, Phillipe; Telonis, Aristeidis G; Rigoutsos, Isidore; Kirino, Yohei

2017-05-19

Besides translation, transfer RNAs (tRNAs) play many non-canonical roles in various biological pathways and exhibit highly variable expression profiles. To unravel the emerging complexities of tRNA biology and molecular mechanisms underlying them, an efficient tRNA sequencing method is required. However, the rigid structure of tRNA has been presenting a challenge to the development of such methods. We report the development of Y-shaped Adapter-ligated MAture TRNA sequencing (YAMAT-seq), an efficient and convenient method for high-throughput sequencing of mature tRNAs. YAMAT-seq circumvents the issue of inefficient adapter ligation, a characteristic of conventional RNA sequencing methods for mature tRNAs, by employing the efficient and specific ligation of Y-shaped adapter to mature tRNAs using T4 RNA Ligase 2. Subsequent cDNA amplification and next-generation sequencing successfully yield numerous mature tRNA sequences. YAMAT-seq has high specificity for mature tRNAs and high sensitivity to detect most isoacceptors from minute amount of total RNA. Moreover, YAMAT-seq shows quantitative capability to estimate expression levels of mature tRNAs, and has high reproducibility and broad applicability for various cell lines. YAMAT-seq thus provides high-throughput technique for identifying tRNA profiles and their regulations in various transcriptomes, which could play important regulatory roles in translation and other biological processes. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
High-accuracy biodistribution analysis of adeno-associated virus variants by double barcode sequencing.

PubMed

Marsic, Damien; Méndez-Gómez, Héctor R; Zolotukhin, Sergei

2015-01-01

Biodistribution analysis is a key step in the evaluation of adeno-associated virus (AAV) capsid variants, whether natural isolates or produced by rational design or directed evolution. Indeed, when screening candidate vectors, accurate knowledge about which tissues are infected and how efficiently is essential. We describe the design, validation, and application of a new vector, pTR-UF50-BC, encoding a bioluminescent protein, a fluorescent protein and a DNA barcode, which can be used to visualize localization of transduction at the organism, organ, tissue, or cellular levels. In addition, by linking capsid variants to different barcoded versions of the vector and amplifying the barcode region from various tissue samples using barcoded primers, biodistribution of viral genomes can be analyzed with high accuracy and efficiency.
Selection of mRNA 5'-untranslated region sequence with high translation efficiency through ribosome display

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mie, Masayasu; Shimizu, Shun; Takahashi, Fumio

2008-08-15

The 5'-untranslated region (5'-UTR) of mRNAs functions as a translation enhancer, promoting translation efficiency. Many in vitro translation systems exhibit a reduced efficiency in protein translation due to decreased translation initiation. The use of a 5'-UTR sequence with high translation efficiency greatly enhances protein production in these systems. In this study, we have developed an in vitro selection system that favors 5'-UTRs with high translation efficiency using a ribosome display technique. A 5'-UTR random library, comprised of 5'-UTRs tagged with a His-tag and Renilla luciferase (R-luc) fusion, were in vitro translated in rabbit reticulocytes. By limiting the translation period, onlymore » mRNAs with high translation efficiency were translated. During translation, mRNA, ribosome and translated R-luc with His-tag formed ternary complexes. They were collected with translated His-tag using Ni-particles. Extracted mRNA from ternary complex was amplified using RT-PCR and sequenced. Finally, 5'-UTR with high translation efficiency was obtained from random 5'-UTR library.« less
An Efficient Site-Specific Method for Irreversible Covalent Labeling of Proteins with a Fluorophore.

PubMed

Liu, Jiaquan; Hanne, Jeungphill; Britton, Brooke M; Shoffner, Matthew; Albers, Aaron E; Bennett, Jared; Zatezalo, Rachel; Barfield, Robyn; Rabuka, David; Lee, Jong-Bong; Fishel, Richard

2015-11-19

Fluorophore labeling of proteins while preserving native functions is essential for bulk Förster resonance energy transfer (FRET) interaction and single molecule imaging analysis. Here we describe a versatile, efficient, specific, irreversible, gentle and low-cost method for labeling proteins with fluorophores that appears substantially more robust than a similar but chemically distinct procedure. The method employs the controlled enzymatic conversion of a central Cys to a reactive formylglycine (fGly) aldehyde within a six amino acid Formylglycine Generating Enzyme (FGE) recognition sequence in vitro. The fluorophore is then irreversibly linked to the fGly residue using a Hydrazinyl-Iso-Pictet-Spengler (HIPS) ligation reaction. We demonstrate the robust large-scale fluorophore labeling and purification of E.coli (Ec) mismatch repair (MMR) components. Fluorophore labeling did not alter the native functions of these MMR proteins in vitro or in singulo. Because the FGE recognition sequence is easily portable, FGE-HIPS fluorophore-labeling may be easily extended to other proteins.

A tag-based approach for high-throughput analysis of CCWGG methylation.

PubMed

Denisova, Oksana V; Chernov, Andrei V; Koledachkina, Tatyana Y; Matvienko, Nicholas I

2007-10-15

Non-CpG methylation occurring in the context of CNG sequences is found in plants at a large number of genomic loci. However, there is still little information available about non-CpG methylation in mammals. Efficient methods that would allow detection of scarcely localized methylated sites in small quantities of DNA are required to elucidate the biological role of non-CpG methylation in both plants and animals. In this study, we tested a new whole genome approach to identify sites of CCWGG methylation (W is A or T), a particular case of CNG methylation, in genomic DNA. This technique is based on digestion of DNAs with methylation-sensitive restriction endonucleases EcoRII-C and AjnI. Short DNAs flanking methylated CCWGG sites (tags) are selectively purified and assembled in tandem arrays of up to nine tags. This allows high-throughput sequencing of tags, identification of flanking regions, and their exact positions in the genome. In this study, we tested specificity and efficiency of the approach.
A modified operational sequence methodology for zoo exhibit design and renovation: conceptualizing animals, staff, and visitors as interdependent coworkers.

PubMed

Kelling, Nicholas J; Gaalema, Diann E; Kelling, Angela S

2014-01-01

Human factors analyses have been used to improve efficiency and safety in various work environments. Although generally limited to humans, the universality of these analyses allows for their formal application to a much broader domain. This paper outlines a model for the use of human factors to enhance zoo exhibits and optimize spaces for all user groups; zoo animals, zoo visitors, and zoo staff members. Zoo exhibits are multi-faceted and each user group has a distinct set of requirements that can clash or complement each other. Careful analysis and a reframing of the three groups as interdependent coworkers can enhance safety, efficiency, and experience for all user groups. This paper details a general creation and specific examples of the use of the modified human factors tools of function allocation, operational sequence diagram and needs assessment. These tools allow for adaptability and ease of understanding in the design or renovation of exhibits. © 2014 Wiley Periodicals, Inc.
A New Approach for Mining Order-Preserving Submatrices Based on All Common Subsequences.

PubMed

Xue, Yun; Liao, Zhengling; Li, Meihang; Luo, Jie; Kuang, Qiuhua; Hu, Xiaohui; Li, Tiechen

2015-01-01

Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method.
A novel model for DNA sequence similarity analysis based on graph theory.

PubMed

Qi, Xingqin; Wu, Qin; Zhang, Yusen; Fuller, Eddie; Zhang, Cun-Quan

2011-01-01

Determination of sequence similarity is one of the major steps in computational phylogenetic studies. As we know, during evolutionary history, not only DNA mutations for individual nucleotide but also subsequent rearrangements occurred. It has been one of major tasks of computational biologists to develop novel mathematical descriptors for similarity analysis such that various mutation phenomena information would be involved simultaneously. In this paper, different from traditional methods (eg, nucleotide frequency, geometric representations) as bases for construction of mathematical descriptors, we construct novel mathematical descriptors based on graph theory. In particular, for each DNA sequence, we will set up a weighted directed graph. The adjacency matrix of the directed graph will be used to induce a representative vector for DNA sequence. This new approach measures similarity based on both ordering and frequency of nucleotides so that much more information is involved. As an application, the method is tested on a set of 0.9-kb mtDNA sequences of twelve different primate species. All output phylogenetic trees with various distance estimations have the same topology, and are generally consistent with the reported results from early studies, which proves the new method's efficiency; we also test the new method on a simulated data set, which shows our new method performs better than traditional global alignment method when subsequent rearrangements happen frequently during evolutionary history.
Association analysis using next-generation sequence data from publicly available control groups: the robust variance score statistic

PubMed Central

Derkach, Andriy; Chiang, Theodore; Gong, Jiafen; Addis, Laura; Dobbins, Sara; Tomlinson, Ian; Houlston, Richard; Pal, Deb K.; Strug, Lisa J.

2014-01-01

Motivation: Sufficiently powered case–control studies with next-generation sequence (NGS) data remain prohibitively expensive for many investigators. If feasible, a more efficient strategy would be to include publicly available sequenced controls. However, these studies can be confounded by differences in sequencing platform; alignment, single nucleotide polymorphism and variant calling algorithms; read depth; and selection thresholds. Assuming one can match cases and controls on the basis of ethnicity and other potential confounding factors, and one has access to the aligned reads in both groups, we investigate the effect of systematic differences in read depth and selection threshold when comparing allele frequencies between cases and controls. We propose a novel likelihood-based method, the robust variance score (RVS), that substitutes genotype calls by their expected values given observed sequence data. Results: We show theoretically that the RVS eliminates read depth bias in the estimation of minor allele frequency. We also demonstrate that, using simulated and real NGS data, the RVS method controls Type I error and has comparable power to the ‘gold standard’ analysis with the true underlying genotypes for both common and rare variants. Availability and implementation: An RVS R script and instructions can be found at strug.research.sickkids.ca, and at https://github.com/strug-lab/RVS. Contact: lisa.strug@utoronto.ca Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24733292
A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis.

PubMed

Liu, Bin; Wang, Xiaolong; Lin, Lei; Dong, Qiwen; Wang, Xuan

2008-12-01

Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods. The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
Terminator oligo blocking efficiently eliminates rRNA from Drosophila small RNA sequencing libraries.

PubMed

Wickersheim, Michelle L; Blumenstiel, Justin P

2013-11-01

A large number of methods are available to deplete ribosomal RNA reads from high-throughput RNA sequencing experiments. Such methods are critical for sequencing Drosophila small RNAs between 20 and 30 nucleotides because size selection is not typically sufficient to exclude the highly abundant class of 30 nucleotide 2S rRNA. Here we demonstrate that pre-annealing terminator oligos complimentary to Drosophila 2S rRNA prior to 5' adapter ligation and reverse transcription efficiently depletes 2S rRNA sequences from the sequencing reaction in a simple and inexpensive way. This depletion is highly specific and is achieved with minimal perturbation of miRNA and piRNA profiles.
A multiple-alignment based primer design algorithm for genetically highly variable DNA targets

PubMed Central

2013-01-01

Background Primer design for highly variable DNA sequences is difficult, and experimental success requires attention to many interacting constraints. The advent of next-generation sequencing methods allows the investigation of rare variants otherwise hidden deep in large populations, but requires attention to population diversity and primer localization in relatively conserved regions, in addition to recognized constraints typically considered in primer design. Results Design constraints include degenerate sites to maximize population coverage, matching of melting temperatures, optimizing de novo sequence length, finding optimal bio-barcodes to allow efficient downstream analyses, and minimizing risk of dimerization. To facilitate primer design addressing these and other constraints, we created a novel computer program (PrimerDesign) that automates this complex procedure. We show its powers and limitations and give examples of successful designs for the analysis of HIV-1 populations. Conclusions PrimerDesign is useful for researchers who want to design DNA primers and probes for analyzing highly variable DNA populations. It can be used to design primers for PCR, RT-PCR, Sanger sequencing, next-generation sequencing, and other experimental protocols targeting highly variable DNA samples. PMID:23965160
DNA-Templated Polymerization of Side-Chain-Functionalized Peptide Nucleic Acid Aldehydes

PubMed Central

Kleiner, Ralph E.; Brudno, Yevgeny; Birnbaum, Michael E.; Liu, David R.

2009-01-01

The DNA-templated polymerization of synthetic building blocks provides a potential route to the laboratory evolution of sequence-defined polymers with structures and properties not necessarily limited to those of natural biopolymers. We previously reported the efficient and sequence-specific DNA-templated polymerization of peptide nucleic acid (PNA) aldehydes. Here, we report the enzyme-free, DNA-templated polymerization of side-chain-functionalized PNA tetramer and pentamer aldehydes. We observed that the polymerization of tetramer and pentamer PNA building blocks with a single lysine-based side chain at various positions in the building block could proceed efficiently and sequence-specifically. In addition, DNA-templated polymerization also proceeded efficiently and in a sequence-specific manner with pentamer PNA aldehydes containing two or three lysine side chains in a single building block to generate more densely functionalized polymers. To further our understanding of side-chain compatibility and expand the capabilities of this system, we also examined the polymerization efficiencies of 20 pentamer building blocks each containing one of five different side-chain groups and four different side-chain regio- and stereochemistries. Polymerization reactions were efficient for all five different side-chain groups and for three of the four combinations of side-chain regio- and stereochemistries. Differences in the efficiency and initial rate of polymerization correlate with the apparent melting temperature of each building block, which is dependent on side-chain regio- and stereochemistry, but relatively insensitive to side-chain structure among the substrates tested. Our findings represent a significant step towards the evolution of sequence-defined synthetic polymers and also demonstrate that enzyme-free nucleic acid-templated polymerization can occur efficiently using substrates with a wide range of side-chain structures, functionalization positions within each building block, and functionalization densities. PMID:18341334
Large scale RNAi screen in Tribolium reveals novel target genes for pest control and the proteasome as prime target.

PubMed

Ulrich, Julia; Dao, Van Anh; Majumdar, Upalparna; Schmitt-Engel, Christian; Schwirz, Jonas; Schultheis, Dorothea; Ströhlein, Nadi; Troelenberg, Nicole; Grossmann, Daniela; Richter, Tobias; Dönitz, Jürgen; Gerischer, Lizzy; Leboulle, Gérard; Vilcinskas, Andreas; Stanke, Mario; Bucher, Gregor

2015-09-03

Insect pest control is challenged by insecticide resistance and negative impact on ecology and health. One promising pest specific alternative is the generation of transgenic plants, which express double stranded RNAs targeting essential genes of a pest species. Upon feeding, the dsRNA induces gene silencing in the pest resulting in its death. However, the identification of efficient RNAi target genes remains a major challenge as genomic tools and breeding capacity is limited in most pest insects impeding whole-animal-high-throughput-screening. We use the red flour beetle Tribolium castaneum as a screening platform in order to identify the most efficient RNAi target genes. From about 5,000 randomly screened genes of the iBeetle RNAi screen we identify 11 novel and highly efficient RNAi targets. Our data allowed us to determine GO term combinations that are predictive for efficient RNAi target genes with proteasomal genes being most predictive. Finally, we show that RNAi target genes do not appear to act synergistically and that protein sequence conservation does not correlate with the number of potential off target sites. Our results will aid the identification of RNAi target genes in many pest species by providing a manageable number of excellent candidate genes to be tested and the proteasome as prime target. Further, the identified GO term combinations will help to identify efficient target genes from organ specific transcriptomes. Our off target analysis is relevant for the sequence selection used in transgenic plants.
Crosslinking transcription factors to their recognition sequences with PtII complexes

NASA Technical Reports Server (NTRS)

Chu, B. C.; Orgel, L. E.

1992-01-01

We have prepared phosphorothioate-containing cyclic oligodeoxynucleotides that fold into 'dumbbells' containing CRE and TRE sequences, the binding sequences for the CREB and JUN proteins, respectively. Six phosphorothioate residues were introduced into each of the recognition sequences. K2PtCl4 crosslinks CRE to CREB and TRE to JUN. The extent of crosslinking is about eight times greater than that observed with standard oligodeoxynucleotides and amounts to 30-50% of the efficiency of non-covalent association as estimated by gel-shift assays. Crosslinking is reversed by incubation with NaCN. The crosslinking reaction is specific--a dumbbell oligonucleotide with six phosphorothioate groups introduced into the Sp1 recognition sequence could not be crosslinked efficiently to CREB or JUN proteins with K2PtCl4. The binding of TRE to CREB is not strong enough for effective detection by gel-shift assays, but the TRE-CREB complex is crosslinked efficiently by K2PtCl4 and can then readily be detected.
Kinetic analysis of bypass of abasic site by the catalytic core of yeast DNA polymerase eta.

PubMed

Yang, Juntang; Wang, Rong; Liu, Binyan; Xue, Qizhen; Zhong, Mengyu; Zeng, Hao; Zhang, Huidong

2015-09-01

Abasic sites (Apurinic/apyrimidinic (AP) sites), produced ∼ 50,000 times/cell/day, are very blocking and miscoding. To better understand miscoding mechanisms of abasic site for yeast DNA polymerase η, pre-steady-state nucleotide incorporation and LC-MS/MS sequence analysis of extension product were studied using pol η(core) (catalytic core, residues 1-513), which can completely eliminate the potential effects of the C-terminal C2H2 motif of pol η on dNTP incorporation. The extension beyond the abasic site was very inefficient. Compared with incorporation of dCTP opposite G, the incorporation efficiencies opposite abasic site were greatly reduced according to the order of dGTP > dATP > dCTP and dTTP. Pol η(core) showed no fast burst phase for any incorporation opposite G or abasic site, suggesting that the catalytic step is not faster than the dissociation of polymerase from DNA. LC-MS/MS sequence analysis of extension products showed that 53% products were dGTP misincorporation, 33% were dATP and 14% were -1 frameshift, indicating that Pol η(core) bypasses abasic site by a combined G-rule, A-rule and -1 frameshift deletions. Compared with full-length pol η, pol η(core) relatively reduced the efficiency of incorporation of dCTP opposite G, increased the efficiencies of dNTP incorporation opposite abasic site and the exclusive incorporation of dGTP opposite abasic site, but inhibited the extension beyond abasic site, and increased the priority in extension of A: abasic site relative to G: abasic site. This study provides further understanding in the mutation mechanism of abasic sites for yeast DNA polymerase η. Copyright © 2015 Elsevier B.V. All rights reserved.
Combining and Comparing Coalescent, Distance and Character-Based Approaches for Barcoding Microalgaes: A Test with Chlorella-Like Species (Chlorophyta).

PubMed

Zou, Shanmei; Fei, Cong; Song, Jiameng; Bao, Yachao; He, Meilin; Wang, Changhai

2016-01-01

Several different barcoding methods of distinguishing species have been advanced, but which method is the best is still controversial. Chlorella is becoming particularly promising in the development of second-generation biofuels. However, the taxonomy of Chlorella-like organisms is easily confused. Here we report a comprehensive barcoding analysis of Chlorella-like species from Chlorella, Chloroidium, Dictyosphaerium and Actinastrum based on rbcL, ITS, tufA and 16S sequences to test the efficiency of traditional barcoding, GMYC, ABGD, PTP, P ID and character-based barcoding methods. First of all, the barcoding results gave new insights into the taxonomic assessment of Chlorella-like organisms studied, including the clear species discrimination and resolution of potentially cryptic species complexes in C. sorokiniana, D. ehrenbergianum and C. Vulgaris. The tufA proved to be the most efficient barcoding locus, which thus could be as potential "specific barcode" for Chlorella-like species. The 16S failed in discriminating most closely related species. The resolution of GMYC, PTP, P ID, ABGD and character-based barcoding methods were variable among rbcL, ITS and tufA genes. The best resolution for species differentiation appeared in tufA analysis where GMYC, PTP, ABGD and character-based approaches produced consistent groups while the PTP method over-split the taxa. The character analysis of rbcL, ITS and tufA sequences could clearly distinguish all taxonomic groups respectively, including the potentially cryptic lineages, with many character attributes. Thus, the character-based barcoding provides an attractive complement to coalescent and distance-based barcoding. Our study represents the test that proves the efficiency of multiple DNA barcoding in species discrimination of microalgaes.
Combining and Comparing Coalescent, Distance and Character-Based Approaches for Barcoding Microalgaes: A Test with Chlorella-Like Species (Chlorophyta)

PubMed Central

Zou, Shanmei; Fei, Cong; Song, Jiameng; Bao, Yachao; He, Meilin; Wang, Changhai

2016-01-01

Several different barcoding methods of distinguishing species have been advanced, but which method is the best is still controversial. Chlorella is becoming particularly promising in the development of second-generation biofuels. However, the taxonomy of Chlorella–like organisms is easily confused. Here we report a comprehensive barcoding analysis of Chlorella-like species from Chlorella, Chloroidium, Dictyosphaerium and Actinastrum based on rbcL, ITS, tufA and 16S sequences to test the efficiency of traditional barcoding, GMYC, ABGD, PTP, P ID and character-based barcoding methods. First of all, the barcoding results gave new insights into the taxonomic assessment of Chlorella-like organisms studied, including the clear species discrimination and resolution of potentially cryptic species complexes in C. sorokiniana, D. ehrenbergianum and C. Vulgaris. The tufA proved to be the most efficient barcoding locus, which thus could be as potential “specific barcode” for Chlorella-like species. The 16S failed in discriminating most closely related species. The resolution of GMYC, PTP, P ID, ABGD and character-based barcoding methods were variable among rbcL, ITS and tufA genes. The best resolution for species differentiation appeared in tufA analysis where GMYC, PTP, ABGD and character-based approaches produced consistent groups while the PTP method over-split the taxa. The character analysis of rbcL, ITS and tufA sequences could clearly distinguish all taxonomic groups respectively, including the potentially cryptic lineages, with many character attributes. Thus, the character-based barcoding provides an attractive complement to coalescent and distance-based barcoding. Our study represents the test that proves the efficiency of multiple DNA barcoding in species discrimination of microalgaes. PMID:27092945
Assembly and analysis of a male sterile rubber tree mitochondrial genome reveals DNA rearrangement events and a novel transcript

PubMed Central

2014-01-01

Background The rubber tree, Hevea brasiliensis, is an important plant species that is commercially grown to produce latex rubber in many countries. The rubber tree variety BPM 24 exhibits cytoplasmic male sterility, inherited from the variety GT 1. Results We constructed the rubber tree mitochondrial genome of a cytoplasmic male sterile variety, BPM 24, using 454 sequencing, including 8 kb paired-end libraries, plus Illumina paired-end sequencing. We annotated this mitochondrial genome with the aid of Illumina RNA-seq data and performed comparative analysis. We then compared the sequence of BPM 24 to the contigs of the published rubber tree, variety RRIM 600, and identified a rearrangement that is unique to BPM 24 resulting in a novel transcript containing a portion of atp9. Conclusions The novel transcript is consistent with changes that cause cytoplasmic male sterility through a slight reduction to ATP production efficiency. The exhaustive nature of the search rules out alternative causes and supports previous findings of novel transcripts causing cytoplasmic male sterility. PMID:24512148
Integrated genome sequence and linkage map of physic nut (Jatropha curcas L.), a biodiesel plant.

PubMed

Wu, Pingzhi; Zhou, Changpin; Cheng, Shifeng; Wu, Zhenying; Lu, Wenjia; Han, Jinli; Chen, Yanbo; Chen, Yan; Ni, Peixiang; Wang, Ying; Xu, Xun; Huang, Ying; Song, Chi; Wang, Zhiwen; Shi, Nan; Zhang, Xudong; Fang, Xiaohua; Yang, Qing; Jiang, Huawu; Chen, Yaping; Li, Meiru; Wang, Ying; Chen, Fan; Wang, Jun; Wu, Guojiang

2015-03-01

The family Euphorbiaceae includes some of the most efficient biomass accumulators. Whole genome sequencing and the development of genetic maps of these species are important components in molecular breeding and genetic improvement. Here we report the draft genome of physic nut (Jatropha curcas L.), a biodiesel plant. The assembled genome has a total length of 320.5 Mbp and contains 27,172 putative protein-coding genes. We established a linkage map containing 1208 markers and anchored the genome assembly (81.7%) to this map to produce 11 pseudochromosomes. After gene family clustering, 15,268 families were identified, of which 13,887 existed in the castor bean genome. Analysis of the genome highlighted specific expansion and contraction of a number of gene families during the evolution of this species, including the ribosome-inactivating proteins and oil biosynthesis pathway enzymes. The genomic sequence and linkage map provide a valuable resource not only for fundamental and applied research on physic nut but also for evolutionary and comparative genomics analysis, particularly in the Euphorbiaceae. © 2015 The Authors The Plant Journal © 2015 John Wiley & Sons Ltd.
Quantitative Assessment of In-solution Digestion Efficiency Identifies Optimal Protocols for Unbiased Protein Analysis*

PubMed Central

León, Ileana R.; Schwämmle, Veit; Jensen, Ole N.; Sprenger, Richard R.

2013-01-01

The majority of mass spectrometry-based protein quantification studies uses peptide-centric analytical methods and thus strongly relies on efficient and unbiased protein digestion protocols for sample preparation. We present a novel objective approach to assess protein digestion efficiency using a combination of qualitative and quantitative liquid chromatography-tandem MS methods and statistical data analysis. In contrast to previous studies we employed both standard qualitative as well as data-independent quantitative workflows to systematically assess trypsin digestion efficiency and bias using mitochondrial protein fractions. We evaluated nine trypsin-based digestion protocols, based on standard in-solution or on spin filter-aided digestion, including new optimized protocols. We investigated various reagents for protein solubilization and denaturation (dodecyl sulfate, deoxycholate, urea), several trypsin digestion conditions (buffer, RapiGest, deoxycholate, urea), and two methods for removal of detergents before analysis of peptides (acid precipitation or phase separation with ethyl acetate). Our data-independent quantitative liquid chromatography-tandem MS workflow quantified over 3700 distinct peptides with 96% completeness between all protocols and replicates, with an average 40% protein sequence coverage and an average of 11 peptides identified per protein. Systematic quantitative and statistical analysis of physicochemical parameters demonstrated that deoxycholate-assisted in-solution digestion combined with phase transfer allows for efficient, unbiased generation and recovery of peptides from all protein classes, including membrane proteins. This deoxycholate-assisted protocol was also optimal for spin filter-aided digestions as compared with existing methods. PMID:23792921
HSA: a heuristic splice alignment tool.

PubMed

Bu, Jingde; Chi, Xuebin; Jin, Zhong

2013-01-01

RNA-Seq methodology is a revolutionary transcriptomics sequencing technology, which is the representative of Next generation Sequencing (NGS). With the high throughput sequencing of RNA-Seq, we can acquire much more information like differential expression and novel splice variants from deep sequence analysis and data mining. But the short read length brings a great challenge to alignment, especially when the reads span two or more exons. A two steps heuristic splice alignment tool is generated in this investigation. First, map raw reads to reference with unspliced aligner--BWA; second, split initial unmapped reads into three equal short reads (seeds), align each seed to the reference, filter hits, search possible split position of read and extend hits to a complete match. Compare with other splice alignment tools like SOAPsplice and Tophat2, HSA has a better performance in call rate and efficiency, but its results do not as accurate as the other software to some extent. HSA is an effective spliced aligner of RNA-Seq reads mapping, which is available at https://github.com/vlcc/HSA.
Nuclear Mitochondrial DNA Activates Replication in Saccharomyces cerevisiae

PubMed Central

Chatre, Laurent; Ricchetti, Miria

2011-01-01

The nuclear genome of eukaryotes is colonized by DNA fragments of mitochondrial origin, called NUMTs. These insertions have been associated with a variety of germ-line diseases in humans. The significance of this uptake of potentially dangerous sequences into the nuclear genome is unclear. Here we provide functional evidence that sequences of mitochondrial origin promote nuclear DNA replication in Saccharomyces cerevisiae. We show that NUMTs are rich in key autonomously replicating sequence (ARS) consensus motifs, whose mutation results in the reduction or loss of DNA replication activity. Furthermore, 2D-gel analysis of the mrc1 mutant exposed to hydroxyurea shows that several NUMTs function as late chromosomal origins. We also show that NUMTs located close to or within ARS provide key sequence elements for replication. Thus NUMTs can act as independent origins, when inserted in an appropriate genomic context or affect the efficiency of pre-existing origins. These findings show that migratory mitochondrial DNAs can impact on the replication of the nuclear region they are inserted in. PMID:21408151
Two-phase designs for joint quantitative-trait-dependent and genotype-dependent sampling in post-GWAS regional sequencing.

PubMed

Espin-Garcia, Osvaldo; Craiu, Radu V; Bull, Shelley B

2018-02-01

We evaluate two-phase designs to follow-up findings from genome-wide association study (GWAS) when the cost of regional sequencing in the entire cohort is prohibitive. We develop novel expectation-maximization-based inference under a semiparametric maximum likelihood formulation tailored for post-GWAS inference. A GWAS-SNP (where SNP is single nucleotide polymorphism) serves as a surrogate covariate in inferring association between a sequence variant and a normally distributed quantitative trait (QT). We assess test validity and quantify efficiency and power of joint QT-SNP-dependent sampling and analysis under alternative sample allocations by simulations. Joint allocation balanced on SNP genotype and extreme-QT strata yields significant power improvements compared to marginal QT- or SNP-based allocations. We illustrate the proposed method and evaluate the sensitivity of sample allocation to sampling variation using data from a sequencing study of systolic blood pressure. © 2017 The Authors. Genetic Epidemiology Published by Wiley Periodicals, Inc.

Nuclear mitochondrial DNA activates replication in Saccharomyces cerevisiae.

PubMed

Chatre, Laurent; Ricchetti, Miria

2011-03-08

The nuclear genome of eukaryotes is colonized by DNA fragments of mitochondrial origin, called NUMTs. These insertions have been associated with a variety of germ-line diseases in humans. The significance of this uptake of potentially dangerous sequences into the nuclear genome is unclear. Here we provide functional evidence that sequences of mitochondrial origin promote nuclear DNA replication in Saccharomyces cerevisiae. We show that NUMTs are rich in key autonomously replicating sequence (ARS) consensus motifs, whose mutation results in the reduction or loss of DNA replication activity. Furthermore, 2D-gel analysis of the mrc1 mutant exposed to hydroxyurea shows that several NUMTs function as late chromosomal origins. We also show that NUMTs located close to or within ARS provide key sequence elements for replication. Thus NUMTs can act as independent origins, when inserted in an appropriate genomic context or affect the efficiency of pre-existing origins. These findings show that migratory mitochondrial DNAs can impact on the replication of the nuclear region they are inserted in.
Human adenovirus serotype 12 virion precursors pMu and pVI are cleaved at amino-terminal and carboxy-terminal sites that conform to the adenovirus 2 endoproteinase cleavage consensus sequence.

PubMed

Freimuth, P; Anderson, C W

1993-03-01

The sequence of a 1158-base pair fragment of the human adenovirus serotype 12 (Ad12) genome was determined. This segment encodes the precursors for virion components Mu and VI. Both Ad12 precursors contain two sequences that conform to a consensus sequence motif for cleavage by the endoproteinase of adenovirus 2 (Ad2). Analysis of the amino terminus of VI and of the peptide fragments found in Ad12 virions demonstrated that these sites are cleaved during Ad12 maturation. This observation suggests that the recognition motif for adenovirus endoproteinases is highly conserved among human serotypes. The adenovirus 2 endoproteinase polypeptide requires additional co-factors for activity (C. W. Anderson, Protein Expression Purif., 1993, 4, 8-15). Synthetic Ad12 or Ad2 pVI carboxy-terminal peptides each permitted efficient cleavage of an artificial endoproteinase substrate by recombinant Ad2 endoproteinase polypeptide.
Genomic sequencing and the impact of molecular diagnosis on patient care.

PubMed

Solomon, Benjamin D

2015-02-01

Evolving sequencing technologies allow more accurate, efficient and affordable genomic analysis. As a result, these technologies are increasingly available, especially to provide molecular diagnoses for patients with suspected genetic disorders. However, there are many challenges to using genomic sequencing to benefit patients, including concerns that there is insufficient evidence that identifying an underlying molecular explanation may positively impact a patient's healthcare. This concern has many repercussions, including funding and/or (in some countries and healthcare systems) insurance reimbursement for genomic sequencing. To investigate this concern, all monogenic disorders were analyzed based on the impact of achieving molecular diagnosis. Of the 2,849 individual genes in which germline mutations cause disorders (not including contiguous gene syndromes or what may be categorized as susceptibility alleles), our analyses showed a specific, available intervention related to at least one affected organ system for 1,419 (49.8%) genes. In 95.6% of these genes, the intervention(s) would be recommended during the pediatric time frame.
Probing the phylogenetic relationships of a few newly recorded intertidal zoanthids of Gujarat coast (India) with mtDNA COI sequences.

PubMed

Joseph, Sneha; Poriya, Paresh; Kundu, Rahul

2016-11-01

The present study reports the phylogenetic relationship of six zoanthid species belonging to three genera, Isaurus, Palythoa, and Zoanthus identified using systematic computational analysis of mtDNA gene sequences. All six species are first recorded from the coasts of Kathiawar Peninsula, India. Genus: Isaurus is represented by Isaurus tuberculatus, genus Zoanthus is represented by Zoanthus kuroshio and Zoanthus sansibaricus, while genus Palythoa is represented by Palythoa tuberculosa, P. sp. JVK-2006 and Palythoa heliodiscus. Results of the present study revealed that among the various species observed along the coastline, a minimum of 99% sequence divergence and a maximum of 96% sequence divergence were seen. An interspecific divergence of 1-4% and negligible intraspecific divergence was observed. These results not only highlighted the efficiency of the COI gene region in species identification but also demonstrated the genetic variability of zoanthids along the Saurashtra coastline of the west coast of India.
Short communication: development and characterization of novel transcriptome-derived microsatellites for genetic analysis of persimmon.

PubMed

Luo, C; Zhang, Q L; Luo, Z R

2014-04-16

Oriental persimmon (Diospyros kaki Thunb.) (2n = 6x = 90) is a major commercial and deciduous fruit tree that is believed to have originated in China. However, rare transcriptomic and genomic information on persimmon is available. Using Roche 454 sequencing technology, the transcriptome from RNA of the flowers of D. kaki was analyzed. A total of 1,250,893 reads were generated and 83,898 unigenes were assembled. A total of 42,711 SSR loci were identified from 23,494 unigenes and 289 polymerase chain reaction primer pairs were designed. Of these 289 primers, 155 (53.6%) showed robust PCR amplification and 98 revealed polymorphism between 15 persimmon genotypes, indicating a polymorphic rate of 63.23% of the productive primers for characterization and genotyping of the genus Diospyros. Transcriptome sequence data generated from next-generation sequencing technology to identify microsatellite loci appears to be rapid and cost-efficient, particularly for species with no genomic sequence information available.
Function-Based Algorithms for Biological Sequences

ERIC Educational Resources Information Center

Mohanty, Pragyan Sheela P.

2015-01-01

Two problems at two different abstraction levels of computational biology are studied. At the molecular level, efficient pattern matching algorithms in DNA sequences are presented. For gene order data, an efficient data structure is presented capable of storing all gene re-orderings in a systematic manner. A common characteristic of presented…
A comparison of RNA with DNA in template-directed synthesis

NASA Technical Reports Server (NTRS)

Zielinski, M.; Kozlov, I. A.; Orgel, L. E.; Bada, J. L. (Principal Investigator)

2000-01-01

Nonenzymatic template-directed copying of RNA sequences rich in cytidylic acid using nucleoside 5'-(2-methylimidazol-1-yl phosphates) as substrates is substantially more efficient than the copying of corresponding DNA sequences. However, many sequences cannot be copied, and the prospect of replication in this system is remote, even for RNA. Surprisingly, wobble-pairing leads to much more efficient incorporation of G opposite U on RNA templates than of G opposite T on DNA templates.
XS: a FASTQ read simulator.

PubMed

Pratas, Diogo; Pinho, Armando J; Rodrigues, João M O S

2014-01-16

The emerging next-generation sequencing (NGS) is bringing, besides the natural huge amounts of data, an avalanche of new specialized tools (for analysis, compression, alignment, among others) and large public and private network infrastructures. Therefore, a direct necessity of specific simulation tools for testing and benchmarking is rising, such as a flexible and portable FASTQ read simulator, without the need of a reference sequence, yet correctly prepared for producing approximately the same characteristics as real data. We present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores). XS provides an efficient and convenient method for fast simulation of FASTQ files, such as those from Ion Torrent (currently uncovered by other simulators), Roche-454, Illumina and ABI-SOLiD sequencing machines. This tool is publicly available at http://bioinformatics.ua.pt/software/xs/.
Analysis of genomic sequences by Chaos Game Representation.

PubMed

Almeida, J S; Carriço, J A; Maretzek, A; Noble, P A; Fletcher, M

2001-05-01

Chaos Game Representation (CGR) is an iterative mapping technique that processes sequences of units, such as nucleotides in a DNA sequence or amino acids in a protein, in order to find the coordinates for their position in a continuous space. This distribution of positions has two properties: it is unique, and the source sequence can be recovered from the coordinates such that distance between positions measures similarity between the corresponding sequences. The possibility of using the latter property to identify succession schemes have been entirely overlooked in previous studies which raises the possibility that CGR may be upgraded from a mere representation technique to a sequence modeling tool. The distribution of positions in the CGR plane were shown to be a generalization of Markov chain probability tables that accommodates non-integer orders. Therefore, Markov models are particular cases of CGR models rather than the reverse, as currently accepted. In addition, the CGR generalization has both practical (computational efficiency) and fundamental (scale independence) advantages. These results are illustrated by using Escherichia coli K-12 as a test data-set, in particular, the genes thrA, thrB and thrC of the threonine operon.
Navigating the tip of the genomic iceberg: Next-generation sequencing for plant systematics.

PubMed

Straub, Shannon C K; Parks, Matthew; Weitemier, Kevin; Fishbein, Mark; Cronn, Richard C; Liston, Aaron

2012-02-01

Just as Sanger sequencing did more than 20 years ago, next-generation sequencing (NGS) is poised to revolutionize plant systematics. By combining multiplexing approaches with NGS throughput, systematists may no longer need to choose between more taxa or more characters. Here we describe a genome skimming (shallow sequencing) approach for plant systematics. Through simulations, we evaluated optimal sequencing depth and performance of single-end and paired-end short read sequences for assembly of nuclear ribosomal DNA (rDNA) and plastomes and addressed the effect of divergence on reference-guided plastome assembly. We also used simulations to identify potential phylogenetic markers from low-copy nuclear loci at different sequencing depths. We demonstrated the utility of genome skimming through phylogenetic analysis of the Sonoran Desert clade (SDC) of Asclepias (Apocynaceae). Paired-end reads performed better than single-end reads. Minimum sequencing depths for high quality rDNA and plastome assemblies were 40× and 30×, respectively. Divergence from the reference significantly affected plastome assembly, but relatively similar references are available for most seed plants. Deeper rDNA sequencing is necessary to characterize intragenomic polymorphism. The low-copy fraction of the nuclear genome was readily surveyed, even at low sequencing depths. Nearly 160000 bp of sequence from three organelles provided evidence of phylogenetic incongruence in the SDC. Adoption of NGS will facilitate progress in plant systematics, as whole plastome and rDNA cistrons, partial mitochondrial genomes, and low-copy nuclear markers can now be efficiently obtained for molecular phylogenetics studies.
[Correlation of codon biases and potential secondary structures with mRNA translation efficiency in unicellular organisms].

PubMed

Vladimirov, N V; Likhoshvaĭ, V A; Matushkin, Iu G

2007-01-01

Gene expression is known to correlate with degree of codon bias in many unicellular organisms. However, such correlation is absent in some organisms. Recently we demonstrated that inverted complementary repeats within coding DNA sequence must be considered for proper estimation of translation efficiency, since they may form secondary structures that obstruct ribosome movement. We have developed a program for estimation of potential coding DNA sequence expression in defined unicellular organism using its genome sequence. The program computes elongation efficiency index. Computation is based on estimation of coding DNA sequence elongation efficiency, taking into account three key factors: codon bias, average number of inverted complementary repeats, and free energy of potential stem-loop structures formed by the repeats. The influence of these factors on translation is numerically estimated. An optimal proportion of these factors is computed for each organism individually. Quantitative translational characteristics of 384 unicellular organisms (351 bacteria, 28 archaea, 5 eukaryota) have been computed using their annotated genomes from NCBI GenBank. Five potential evolutionary strategies of translational optimization have been determined among studied organisms. A considerable difference of preferred translational strategies between Bacteria and Archaea has been revealed. Significant correlations between elongation efficiency index and gene expression levels have been shown for two organisms (S. cerevisiae and H. pylori) using available microarray data. The proposed method allows to estimate numerically the coding DNA sequence translation efficiency and to optimize nucleotide composition of heterologous genes in unicellular organisms. http://www.mgs.bionet.nsc.ru/mgs/programs/eei-calculator/.
Identification of Immunogenic Hot Spots within Plum Pox Potyvirus Capsid Protein for Efficient Antigen Presentation

PubMed Central

Fernández-Fernández, M. Rosario; Martínez-Torrecuadrada, Jorge L.; Roncal, Fernando; Domínguez, Elvira; García, Juan Antonio

2002-01-01

PEPSCAN analysis has been used to characterize the immunogenic regions of the capsid protein (CP) in virions of plum pox potyvirus (PPV). In addition to the well-known highly immunogenic N- and C-terminal domains of CP, regions within the core domain of the protein have also shown high immunogenicity. Moreover, the N terminus of CP is not homogeneously immunogenic, alternatively showing regions frequently recognized by antibodies and others that are not recognized at all. These results have helped us to design efficient antigen presentation vectors based on PPV. As predicted by PEPSCAN analysis, a small displacement of the insertion site in a previously constructed vector, PPV-γ, turned the derived chimeras into efficient immunogens. Vectors expressing foreign peptides at different positions within a highly immunogenic region (amino acids 43 to 52) in the N-terminal domain of CP were the most effective at inducing specific antibody responses against the foreign sequence. PMID:12438590
Phenol-degrading anode biofilm with high coulombic efficiency in graphite electrodes microbial fuel cell.

PubMed

Zhang, Dongdong; Li, Zhiling; Zhang, Chunfang; Zhou, Xue; Xiao, Zhixing; Awata, Takanori; Katayama, Arata

2017-03-01

A microbial fuel cell (MFC), with graphite electrodes as both the anode and cathode, was operated with a soil-free anaerobic consortium for phenol degradation. This phenol-degrading MFC showed high efficiency with a current density of 120 mA/m 2 and a coulombic efficiency of 22.7%, despite the lack of a platinum catalyst cathode and inoculation of sediment/soil. Removal of planktonic bacteria by renewing the anaerobic medium did not decrease the performance, suggesting that the phenol-degrading MFC was not maintained by the planktonic bacteria but by the microorganisms in the anode biofilm. Cyclic voltammetry analysis of the anode biofilm showed distinct oxidation and reduction peaks. Analysis of the microbial community structure of the anode biofilm and the planktonic bacteria based on 16S rRNA gene sequences suggested that Geobacter sp. was the phenol degrader in the anode biofilm and was responsible for current generation. Copyright © 2016 The Society for Biotechnology, Japan. Published by Elsevier B.V. All rights reserved.
Comparative evaluation of rRNA depletion procedures for the improved analysis of bacterial biofilm and mixed pathogen culture transcriptomes

PubMed Central

Petrova, Olga E.; Garcia-Alcalde, Fernando; Zampaloni, Claudia; Sauer, Karin

2017-01-01

Global transcriptomic analysis via RNA-seq is often hampered by the high abundance of ribosomal (r)RNA in bacterial cells. To remove rRNA and enrich coding sequences, subtractive hybridization procedures have become the approach of choice prior to RNA-seq, with their efficiency varying in a manner dependent on sample type and composition. Yet, despite an increasing number of RNA-seq studies, comparative evaluation of bacterial rRNA depletion methods has remained limited. Moreover, no such study has utilized RNA derived from bacterial biofilms, which have potentially higher rRNA:mRNA ratios and higher rRNA carryover during RNA-seq analysis. Presently, we evaluated the efficiency of three subtractive hybridization-based kits in depleting rRNA from samples derived from biofilm, as well as planktonic cells of the opportunistic human pathogen Pseudomonas aeruginosa. Our results indicated different rRNA removal efficiency for the three procedures, with the Ribo-Zero kit yielding the highest degree of rRNA depletion, which translated into enhanced enrichment of non-rRNA transcripts and increased depth of RNA-seq coverage. The results indicated that, in addition to improving RNA-seq sensitivity, efficient rRNA removal enhanced detection of low abundance transcripts via qPCR. Finally, we demonstrate that the Ribo-Zero kit also exhibited the highest efficiency when P. aeruginosa/Staphylococcus aureus co-culture RNA samples were tested. PMID:28117413
Comparison of Seven Methods for Boolean Factor Analysis and Their Evaluation by Information Gain.

PubMed

Frolov, Alexander A; Húsek, Dušan; Polyakov, Pavel Yu

2016-03-01

An usual task in large data set analysis is searching for an appropriate data representation in a space of fewer dimensions. One of the most efficient methods to solve this task is factor analysis. In this paper, we compare seven methods for Boolean factor analysis (BFA) in solving the so-called bars problem (BP), which is a BFA benchmark. The performance of the methods is evaluated by means of information gain. Study of the results obtained in solving BP of different levels of complexity has allowed us to reveal strengths and weaknesses of these methods. It is shown that the Likelihood maximization Attractor Neural Network with Increasing Activity (LANNIA) is the most efficient BFA method in solving BP in many cases. Efficacy of the LANNIA method is also shown, when applied to the real data from the Kyoto Encyclopedia of Genes and Genomes database, which contains full genome sequencing for 1368 organisms, and to text data set R52 (from Reuters 21578) typically used for label categorization.
Genome sequence analysis of a flocculant-producing bacterium, Paenibacillus shenyangensis.

PubMed

Fu, Lili; Jiang, Binhui; Liu, Jinliang; Zhao, Xin; Liu, Qian; Hu, Xiaomin

2016-03-01

To explore the metabolic process of Paenibacillus shenyangensis that is an efficient bioflocculant-producing bacterium. The biosynthesis mechanism of bioflocculation was used to enrich the genome of Paenibacillus shenyangensis and provide a basis for molecular genetics and functional genomics analyses. According to the analysis of de novo assembly, a total of 5,501,467 bp clean reads were generated, and were assembled into 92 contigs. 4800 unigenes were predicted of which 4393 were annotated showing a specific gene function in the NCBI-Nr database. 3423 genes were found in the database of cluster of orthologous groups. Among the 168 Kyoto Encyclopedia of Genes and Genomes database, cell growth and metabolism were the main biological processes, and a potential metabolic pathway was predicted from glucose to exopolysaccharide within the starch and sucrose metabolism pathway. By using the high-throughput sequencing technology, we provide a genome analysis of Paenibacillus shenyangensis that predicts the main metabolic processes and a potential pathway of exopolysaccharide biosynthesis.
Tilted pillar array fabrication by the combination of proton beam writing and soft lithography for microfluidic cell capture Part 2: Image sequence analysis based evaluation and biological application.

PubMed

Járvás, Gábor; Varga, Tamás; Szigeti, Márton; Hajba, László; Fürjes, Péter; Rajta, István; Guttman, András

2018-02-01

As a continuation of our previously published work, this paper presents a detailed evaluation of a microfabricated cell capture device utilizing a doubly tilted micropillar array. The device was fabricated using a novel hybrid technology based on the combination of proton beam writing and conventional lithography techniques. Tilted pillars offer unique flow characteristics and support enhanced fluidic interaction for improved immunoaffinity based cell capture. The performance of the microdevice was evaluated by an image sequence analysis based in-house developed single-cell tracking system. Individual cell tracking allowed in-depth analysis of the cell-chip surface interaction mechanism from hydrodynamic point of view. Simulation results were validated by using the hybrid device and the optimized surface functionalization procedure. Finally, the cell capture capability of this new generation microdevice was demonstrated by efficiently arresting cells from a HT29 cell-line suspension. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Neutralizing antibodies against West Nile virus identified directly from human B cells by single-cell analysis and next generation sequencing

PubMed Central

Tsioris, Konstantinos; Gupta, Namita T.; Ogunniyi, Adebola O.; Zimnisky, Ross M.; Qian, Feng; Yao, Yi; Wang, Xiaomei; Stern, Joel N. H.; Chari, Raj; Briggs, Adrian W.; Clouser, Christopher R.; Vigneault, Francois; Church, George M.; Garcia, Melissa N.; Murray, Kristy O.; Montgomery, Ruth R.; Kleinstein, Steven H.; Love, J. Christopher

2015-01-01

West Nile virus infection (WNV) is an emerging mosquito-borne disease that can lead to severe neurological illness and currently has no available treatment or vaccine. Using microengraving, an integrated single-cell analysis method, we analyzed a cohort of subjects infected with WNV - recently infected and post-convalescent subjects - and efficiently identified four novel WNV neutralizing antibodies. We also assessed the humoral response to WNV on a single-cell and repertoire level by integrating next generation sequencing (NGS) into our analysis. The results from single-cell analysis indicate persistence of WNV-specific memory B cells and antibody-secreting cells in post-convalescent subjects. These cells exhibited class-switched antibody isotypes. Furthermore, the results suggest that the antibody response itself does not predict the clinical severity of the disease (asymptomatic or symptomatic). Using the nucleotide coding sequences for WNV-specific antibodies derived from single cells, we revealed the ontogeny of expanded WNV-specific clones in the repertoires of recently infected subjects through NGS and bioinformatic analysis. This analysis also indicated that the humoral response to WNV did not depend on an anamnestic response, due to an unlikely previous exposure to the virus. The innovative and integrative approach presented here to analyze the evolution of neutralizing antibodies from natural infection on a single-cell and repertoire level can also be applied to vaccine studies, and could potentially aid the development of therapeutic antibodies and our basic understanding of other infectious diseases. PMID:26481611
Neutralizing antibodies against West Nile virus identified directly from human B cells by single-cell analysis and next generation sequencing.

PubMed

Tsioris, Konstantinos; Gupta, Namita T; Ogunniyi, Adebola O; Zimnisky, Ross M; Qian, Feng; Yao, Yi; Wang, Xiaomei; Stern, Joel N H; Chari, Raj; Briggs, Adrian W; Clouser, Christopher R; Vigneault, Francois; Church, George M; Garcia, Melissa N; Murray, Kristy O; Montgomery, Ruth R; Kleinstein, Steven H; Love, J Christopher

2015-12-01

West Nile virus (WNV) infection is an emerging mosquito-borne disease that can lead to severe neurological illness and currently has no available treatment or vaccine. Using microengraving, an integrated single-cell analysis method, we analyzed a cohort of subjects infected with WNV - recently infected and post-convalescent subjects - and efficiently identified four novel WNV neutralizing antibodies. We also assessed the humoral response to WNV on a single-cell and repertoire level by integrating next generation sequencing (NGS) into our analysis. The results from single-cell analysis indicate persistence of WNV-specific memory B cells and antibody-secreting cells in post-convalescent subjects. These cells exhibited class-switched antibody isotypes. Furthermore, the results suggest that the antibody response itself does not predict the clinical severity of the disease (asymptomatic or symptomatic). Using the nucleotide coding sequences for WNV-specific antibodies derived from single cells, we revealed the ontogeny of expanded WNV-specific clones in the repertoires of recently infected subjects through NGS and bioinformatic analysis. This analysis also indicated that the humoral response to WNV did not depend on an anamnestic response, due to an unlikely previous exposure to the virus. The innovative and integrative approach presented here to analyze the evolution of neutralizing antibodies from natural infection on a single-cell and repertoire level can also be applied to vaccine studies, and could potentially aid the development of therapeutic antibodies and our basic understanding of other infectious diseases.
The MHOST finite element program: 3-D inelastic analysis methods for hot section components. Volume 3: Systems' manual

NASA Technical Reports Server (NTRS)

Nakazawa, Shohei

1989-01-01

The internal structure is discussed of the MHOST finite element program designed for 3-D inelastic analysis of gas turbine hot section components. The computer code is the first implementation of the mixed iterative solution strategy for improved efficiency and accuracy over the conventional finite element method. The control structure of the program is covered along with the data storage scheme and the memory allocation procedure and the file handling facilities including the read and/or write sequences.

The low information content of Neurospora splicing signals: implications for RNA splicing and intron origin.

PubMed

Collins, Richard A; Stajich, Jason E; Field, Deborah J; Olive, Joan E; DeAbreu, Diane M

2015-05-01

When we expressed a small (0.9 kb) nonprotein-coding transcript derived from the mitochondrial VS plasmid in the nucleus of Neurospora we found that it was efficiently spliced at one or more of eight 5' splice sites and ten 3' splice sites, which are present apparently by chance in the sequence. Further experimental and bioinformatic analyses of other mitochondrial plasmids, random sequences, and natural nuclear genes in Neurospora and other fungi indicate that fungal spliceosomes recognize a wide range of 5' splice site and branchpoint sequences and predict introns to be present at high frequency in random sequence. In contrast, analysis of intronless fungal nuclear genes indicates that branchpoint, 5' splice site and 3' splice site consensus sequences are underrepresented compared with random sequences. This underrepresentation of splicing signals is sufficient to deplete the nuclear genome of splice sites at locations that do not comprise biologically relevant introns. Thus, the splicing machinery can recognize a wide range of splicing signal sequences, but splicing still occurs with great accuracy, not because the splicing machinery distinguishes correct from incorrect introns, but because incorrect introns are substantially depleted from the genome. © 2015 Collins et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
Sequence variation between 462 human individuals fine-tunes functional sites of RNA processing

NASA Astrophysics Data System (ADS)

Ferreira, Pedro G.; Oti, Martin; Barann, Matthias; Wieland, Thomas; Ezquina, Suzana; Friedländer, Marc R.; Rivas, Manuel A.; Esteve-Codina, Anna; Estivill, Xavier; Guigó, Roderic; Dermitzakis, Emmanouil; Antonarakis, Stylianos; Meitinger, Thomas; Strom, Tim M.; Palotie, Aarno; François Deleuze, Jean; Sudbrak, Ralf; Lerach, Hans; Gut, Ivo; Syvänen, Ann-Christine; Gyllensten, Ulf; Schreiber, Stefan; Rosenstiel, Philip; Brunner, Han; Veltman, Joris; Hoen, Peter A. C. T.; Jan van Ommen, Gert; Carracedo, Angel; Brazma, Alvis; Flicek, Paul; Cambon-Thomsen, Anne; Mangion, Jonathan; Bentley, David; Hamosh, Ada; Rosenstiel, Philip; Strom, Tim M.; Lappalainen, Tuuli; Guigó, Roderic; Sammeth, Michael

2016-09-01

Recent advances in the cost-efficiency of sequencing technologies enabled the combined DNA- and RNA-sequencing of human individuals at the population-scale, making genome-wide investigations of the inter-individual genetic impact on gene expression viable. Employing mRNA-sequencing data from the Geuvadis Project and genome sequencing data from the 1000 Genomes Project we show that the computational analysis of DNA sequences around splice sites and poly-A signals is able to explain several observations in the phenotype data. In contrast to widespread assessments of statistically significant associations between DNA polymorphisms and quantitative traits, we developed a computational tool to pinpoint the molecular mechanisms by which genetic markers drive variation in RNA-processing, cataloguing and classifying alleles that change the affinity of core RNA elements to their recognizing factors. The in silico models we employ further suggest RNA editing can moonlight as a splicing-modulator, albeit less frequently than genomic sequence diversity. Beyond existing annotations, we demonstrate that the ultra-high resolution of RNA-Seq combined from 462 individuals also provides evidence for thousands of bona fide novel elements of RNA processing—alternative splice sites, introns, and cleavage sites—which are often rare and lowly expressed but in other characteristics similar to their annotated counterparts.
Efficient use of unlabeled data for protein sequence classification: a comparative study.

PubMed

Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir

2009-04-29

Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.
Analysis of levels of support and resonance demonstrated by an elite singing teacher

NASA Astrophysics Data System (ADS)

Scherer, Ronald C.; Radhakrishnan, Nandhakumar; Poulimenos, Andreas

2003-04-01

This was a study of levels of singing expertise demonstrated by an elite operatic singer and teacher. This approach may prove advantageous because the teacher demonstrates what he thinks is important, not what the nonsinging scientist thinks should be important. Two pedagogical sequences were studied: (1) the location of support-glottis (poor), chest (better), abdomen (best); (2) locations of resonance-hard palate/straight tone (poor), mouth (better), sinus/head (best). Measures were obtained for a single frequency (196 Hz), the vowel /ae/, and for mezzo-forte loudness using the /pae pae pae/ technique. Sequence differences: The support sequence was characterized by formant frequency lowering suggestive of vocal tract lengthening. The resonance sequence was characterized by flow (AC, mean flow) and abduction increases. Sequence similarities: The best locations had the widest F2 bandwidths. The better and best locations had the largest dB difference between F2 and F3. Although acoustic power increased through the sequences, the acoustic efficiency was not a discriminating factor. Open and speed quotients were not differentiating. The flow resistance was highest and aerodynamic power the lowest for the first of each sequence. Combined data: The maximum flow declination rate correlated highly with the AC flow (r=-0.92) and SPL (r=0.901).
Phenotypic and genomic analysis of serotype 3 Sabin poliovirus vaccine produced in MRC-5 cell substrate.

PubMed

Alirezaie, Behnam; Taqavian, Mohammad; Aghaiypour, Khosrow; Esna-Ashari, Fatemeh; Shafyi, Abbas

2011-05-01

The cell substrate has a pivotal role in live virus vaccines production. It is necessary to evaluate the effects of the cell substrate on the properties of the propagated viruses, especially in the case of viruses which are unstable genetically such as polioviruses, by monitoring the molecular and phenotypical characteristics of harvested viruses. To investigate the presence/absence of mutation(s), the near full-length genomic sequence of different harvests of the type 3 Sabin strain of poliovirus propagated in MRC-5 cells were determined. The sequences were compared with genomic sequences of different virus seeds, vaccines, and OPV-like isolates. Nearly complete genomic sequencing results, however, revealed no detectable mutations throughout the genome RNA-plaque purified (RSO)-derived monopool of type 3 OPVs manufactured in MRC-5. Thirty-six years of experience in OPV production, trend analysis, and vaccine surveillance also suggest that: (i) different monopools of serotype 3 OPV produced in MRC-5 retained their phenotypic characteristics (temperature sensitivity and neuroattenuation), (ii) MRC-5 cells support the production of acceptable virus yields, (iii) OPV replicated in the MRC-5 cell substrate is a highly efficient and safe vaccine. These results confirm previous reports that MRC-5 is a desirable cell substrate for the production of OPV. Copyright © 2011 Wiley-Liss, Inc.
[Bacterial diversity in sequencing batch biofilm reactor (SBBR) for landfill leachate treatment using PCR-DGGE].

PubMed

Xiao, Yong; Yang, Zhao-hui; Zeng, Guang-ming; Ma, Yan-he; Liu, You-sheng; Wang, Rong-juan; Xu, Zheng-yong

2007-05-01

For studying the bacterial diversity and the mechanism of denitrification in sequencing bath biofilm reactor (SBBR) treating landfill leachate to provide microbial evidence for technique improvements, total microbial DNA was extracted from samples which were collected from natural landfill leachate and biofilm of a SBBR that could efficiently remove NH4+ -N and COD of high concentration. 16S rDNA fragments were amplified from the total DNA successfully using a pair of universal bacterial 16S rDNA primer, GC341F and 907R, and then were used for denaturing gradient gel electrophoresis (DGGE) analysis. The bands in the gel were analyzed by statistical methods and excided from the gel for sequencing, and the sequences were used for homology analysis and then two phylogenetic trees were constructed using DNAStar software. Results indicated that the bacterial diversity of the biofilm in SBBR and the landfill leachate was abundant, and no obvious change of community structure happened during running in the biofilm, in which most bacteria came from the landfill leachate. There may be three different modes of denitrification in the reactor because several different nitrifying bacteria, denitrifying bacteria and anaerobic ammonia oxidation bacteria coexisted in it. The results provided some valuable references for studying microbiological mechanism of denitrification in SBBR.
CEQer: A Graphical Tool for Copy Number and Allelic Imbalance Detection from Whole-Exome Sequencing Data

PubMed Central

Piazza, Rocco; Magistroni, Vera; Pirola, Alessandra; Redaelli, Sara; Spinelli, Roberta; Redaelli, Serena; Galbiati, Marta; Valletta, Simona; Giudici, Giovanni; Cazzaniga, Giovanni; Gambacorti-Passerini, Carlo

2013-01-01

Copy number alterations (CNA) are common events occurring in leukaemias and solid tumors. Comparative Genome Hybridization (CGH) is actually the gold standard technique to analyze CNAs; however, CGH analysis requires dedicated instruments and is able to perform only low resolution Loss of Heterozygosity (LOH) analyses. Here we present CEQer (Comparative Exome Quantification analyzer), a new graphical, event-driven tool for CNA/allelic-imbalance (AI) coupled analysis of exome sequencing data. By using case-control matched exome data, CEQer performs a comparative digital exonic quantification to generate CNA data and couples this information with exome-wide LOH and allelic imbalance detection. This data is used to build mixed statistical/heuristic models allowing the identification of CNA/AI events. To test our tool, we initially used in silico generated data, then we performed whole-exome sequencing from 20 leukemic specimens and corresponding matched controls and we analyzed the results using CEQer. Taken globally, these analyses showed that the combined use of comparative digital exon quantification and LOH/AI allows generating very accurate CNA data. Therefore, we propose CEQer as an efficient, robust and user-friendly graphical tool for the identification of CNA/AI in the context of whole-exome sequencing data. PMID:24124457
Genetic analysis of a Chinese family with members affected with Usher syndrome type II and Waardenburg syndrome type IV.

PubMed

Wang, Xueling; Lin, Xiao-Jiang; Tang, Xiangrong; Chai, Yong-Chuan; Yu, De-Hong; Chen, Dong-Ye; Wu, Hao

2017-11-01

The purpose of this study was to identify the genetic causes of a family presenting with multiple symptoms overlapping Usher syndrome type II (USH2) and Waardenburg syndrome type IV (WS4). Targeted next-generation sequencing including the exon and flanking intron sequences of 79 deafness genes was performed on the proband. Co-segregation of the disease phenotype and the detected variants were confirmed in all family members by PCR amplification and Sanger sequencing. The affected members of this family had two different recessive disorders, USH2 and WS4. By targeted next-generation sequencing, we identified that USH2 was caused by a novel missense mutation, p.V4907D in GPR98; whereas WS4 due to p.V185M in EDNRB. This is the first report of homozygous p.V185M mutation in EDNRB in patient with WS4. This study reported a Chinese family with multiple independent and overlapping phenotypes. In condition, molecular level analysis was efficient to identify the causative variant p.V4907D in GPR98 and p.V185M in EDNRB, also was helpful to confirm the clinical diagnosis of USH2 and WS4. Copyright © 2017 Elsevier B.V. All rights reserved.
Isolation and characterization of Pseudomonas aeruginosa strain SJTD-2 for degrading long-chain n-alkanes and crude oil.

PubMed

Xu, Jing; Liu, Huan; Liu, Jianhua; Liang, Rubing

2015-06-04

Oil pollution poses a severe threat to ecosystems, and bioremediation is considered as a safe and efficient alternative to physicochemical. for eliminating this contaminant. In this study, a gram-negative bacteria strain SJTD-2 isolated from oil-contaminated soil was found capable of utilizing n-alkanes and crude oil as sole energy sources. The efficiency of this strain in degrading these pollutants was analyzed. Strain SJTD-2 was identified on the basis of its phenotype, its physiological features, and a comparative genetic analysis using 16S rRNA sequence. Growth of strain SJTD-2 with different carbon sources (n-alkanes of different lengths and crude oil) was assessed, and the gas chromatography-mass spectrometry method was used to analyze the degradation efficiency of strain SJTD-2 for n-alkanes and petroleum by detecting the residual n-alkane concentrations. Strain SJTD-2 was identified as Pseudomonas aeruginosa based on the phenotype, physiological features, and 16S rRNA sequence analysis. This strain can efficiently decompose medium-chain and long-chain n-alkanes (C10-C26), and petroleum as its sole carbon sources. It preferred the long-chain n-alkanes (C18-C22), and n-docosane was considered as the best carbon source for its growth. In 48 h, 500 mg/L n-docosane could be degraded completely, and 2 g/L n-docosane was decomposed to undetectable levels within 72 h. Moreover, strain SJTD-2 could utilize about 88% of 2 g/L crude oil in 7days. Compared with other alkane-utilizing strains, strain SJTD-2 showed outstanding degradation efficiency for long-chain n-alkanes and high tolerance to petroleum at elevated concentrations. The isolation and characterization of strain SJTD-2 would help researchers study the mechanisms underlying the biodegradation of n-alkanes, and this strain could be used as a potential strain for environmental governance and soil bioremediation.
Knock-in reporter mice demonstrate that DNA repair by non-homologous end joining declines with age.

PubMed

Vaidya, Amita; Mao, Zhiyong; Tian, Xiao; Spencer, Brianna; Seluanov, Andrei; Gorbunova, Vera

2014-07-01

Accumulation of genome rearrangements is a characteristic of aged tissues. Since genome rearrangements result from faulty repair of DNA double strand breaks (DSBs), we hypothesized that DNA DSB repair becomes less efficient with age. The Non-Homologous End Joining (NHEJ) pathway repairs a majority of DSBs in vertebrates. To examine age-associated changes in NHEJ, we have generated an R26NHEJ mouse model in which a GFP-based NHEJ reporter cassette is knocked-in to the ROSA26 locus. In this model, NHEJ repair of DSBs generated by the site-specific endonuclease, I-SceI, reconstitutes a functional GFP gene. In this system NHEJ efficiency can be compared across tissues of the same mouse and in mice of different age. Using R26NHEJ mice, we found that NHEJ efficiency was higher in the skin, lung, and kidney fibroblasts, and lower in the heart fibroblasts and brain astrocytes. Furthermore, we observed that NHEJ efficiency declined with age. In the 24-month old animals compared to the 5-month old animals, NHEJ efficiency declined 1.8 to 3.8-fold, depending on the tissue, with the strongest decline observed in the skin fibroblasts. The sequence analysis of 300 independent NHEJ repair events showed that, regardless of age, mice utilize microhomology sequences at a significantly higher frequency than expected by chance. Furthermore, the frequency of microhomology-mediated end joining (MMEJ) events increased in the heart and lung fibroblasts of old mice, suggesting that NHEJ becomes more mutagenic with age. In summary, our study provides a versatile mouse model for the analysis of NHEJ in a wide range of tissues and demonstrates that DNA repair by NHEJ declines with age in mice, which could provide a mechanism for age-related genomic instability and increased cancer incidence with age.
High-Resolution Analysis of the Efficiency, Heritability, and Editing Outcomes of CRISPR/Cas9-Induced Modifications of NCED4 in Lettuce (Lactuca sativa).

PubMed

Bertier, Lien D; Ron, Mily; Huo, Heqiang; Bradford, Kent J; Britt, Anne B; Michelmore, Richard W

2018-05-04

CRISPR/Cas9 is a transformative tool for making targeted genetic alterations. In plants, high mutation efficiencies have been reported in primary transformants. However, many of the mutations analyzed were somatic and therefore not heritable. To provide more insights into the efficiency of creating stable homozygous mutants using CRISPR/Cas9, we targeted LsNCED4 ( 9-cis-EPOXYCAROTENOID DIOXYGENASE4) , a gene conditioning thermoinhibition of seed germination in lettuce. Three constructs, each capable of expressing Cas9 and a single gRNA targeting different sites in LsNCED4 , were stably transformed into lettuce (Lactuca sativa) cvs. Salinas and Cobham Green. Analysis of 47 primary transformants (T 1 ) and 368 T 2 plants by deep amplicon sequencing revealed that 57% of T 1 plants contained events at the target site: 28% of plants had germline mutations in one allele indicative of an early editing event (mono-allelic), 8% of plants had germline mutations in both alleles indicative of two early editing events (bi-allelic), and the remaining 21% of plants had multiple low frequency mutations indicative of late events (chimeric plants). Editing efficiency was similar in both genotypes, while the different gRNAs varied in efficiency. Amplicon sequencing of 20 T 1 and more than 100 T 2 plants for each of the three gRNAs showed that repair outcomes were not random, but reproducible and characteristic for each gRNA. Knockouts of NCED4 resulted in large increases in the maximum temperature for seed germination, with seeds of both cultivars capable of germinating >70% at 37°. Knockouts of NCED4 provide a whole-plant selectable phenotype that has minimal pleiotropic consequences. Targeting NCED4 in a co-editing strategy could therefore be used to enrich for germline-edited events simply by germinating seeds at high temperature. Copyright © 2018 Bertier et al.
High-Resolution Analysis of the Efficiency, Heritability, and Editing Outcomes of CRISPR/Cas9-Induced Modifications of NCED4 in Lettuce (Lactuca sativa)

PubMed Central

Bertier, Lien D.; Ron, Mily; Huo, Heqiang; Bradford, Kent J.; Britt, Anne B.; Michelmore, Richard W.

2018-01-01

CRISPR/Cas9 is a transformative tool for making targeted genetic alterations. In plants, high mutation efficiencies have been reported in primary transformants. However, many of the mutations analyzed were somatic and therefore not heritable. To provide more insights into the efficiency of creating stable homozygous mutants using CRISPR/Cas9, we targeted LsNCED4 (9-cis-EPOXYCAROTENOID DIOXYGENASE4), a gene conditioning thermoinhibition of seed germination in lettuce. Three constructs, each capable of expressing Cas9 and a single gRNA targeting different sites in LsNCED4, were stably transformed into lettuce (Lactuca sativa) cvs. Salinas and Cobham Green. Analysis of 47 primary transformants (T1) and 368 T2 plants by deep amplicon sequencing revealed that 57% of T1 plants contained events at the target site: 28% of plants had germline mutations in one allele indicative of an early editing event (mono-allelic), 8% of plants had germline mutations in both alleles indicative of two early editing events (bi-allelic), and the remaining 21% of plants had multiple low frequency mutations indicative of late events (chimeric plants). Editing efficiency was similar in both genotypes, while the different gRNAs varied in efficiency. Amplicon sequencing of 20 T1 and more than 100 T2 plants for each of the three gRNAs showed that repair outcomes were not random, but reproducible and characteristic for each gRNA. Knockouts of NCED4 resulted in large increases in the maximum temperature for seed germination, with seeds of both cultivars capable of germinating >70% at 37°. Knockouts of NCED4 provide a whole-plant selectable phenotype that has minimal pleiotropic consequences. Targeting NCED4 in a co-editing strategy could therefore be used to enrich for germline-edited events simply by germinating seeds at high temperature. PMID:29511025
Whole-genome sequencing of Bacillus subtilis XF-1 reveals mechanisms for biological control and multiple beneficial properties in plants.

PubMed

Guo, Shengye; Li, Xingyu; He, Pengfei; Ho, Honhing; Wu, Yixin; He, Yueqiu

2015-06-01

Bacillus subtilis XF-1 is a gram-positive, plant-associated bacterium that stimulates plant growth and produces secondary metabolites that suppress soil-borne plant pathogens. In particular, it is especially highly efficient at controlling the clubroot disease of cruciferous crops. Its 4,061,186-bp genome contains an estimated 3853 protein-coding sequences and the 1155 genes of XF-1 are present in most genome-sequenced Bacillus strains: 3757 genes in B. subtilis 168, and 1164 in B. amyloliquefaciens FZB42. Analysis using the Cluster of Orthologous Groups database of proteins shows that 60 genes control bacterial mobility, 221 genes are related to cell wall and membrane biosynthesis, and more than 112 are genes associated with secondary metabolites. In addition, the genes contributed to the strain's plant colonization, bio-control and stimulation of plant growth. Sequencing of the genome is a fundamental step for developing a desired strain to serve as an efficient biological control agent and plant growth stimulator. Similar to other members of the taxon, XF-1 has a genome that contains giant gene clusters for the non-ribosomal synthesis of antifungal lipopeptides (surfactin and fengycin), the polyketides (macrolactin and bacillaene), the siderophore bacillibactin, and the dipeptide bacilysin. There are two synthesis pathways for volatile growth-promoting compounds. The expression of biosynthesized antibiotic peptides in XF-1 was revealed by matrix-assisted laser desorption/ionization-time of flight mass spectrometry.
zUMIs - A fast and flexible pipeline to process RNA sequencing data with UMIs.

PubMed

Parekh, Swati; Ziegenhain, Christoph; Vieth, Beate; Enard, Wolfgang; Hellmann, Ines

2018-06-01

Single-cell RNA-sequencing (scRNA-seq) experiments typically analyze hundreds or thousands of cells after amplification of the cDNA. The high throughput is made possible by the early introduction of sample-specific bar codes (BCs), and the amplification bias is alleviated by unique molecular identifiers (UMIs). Thus, the ideal analysis pipeline for scRNA-seq data needs to efficiently tabulate reads according to both BC and UMI. zUMIs is a pipeline that can handle both known and random BCs and also efficiently collapse UMIs, either just for exon mapping reads or for both exon and intron mapping reads. If BC annotation is missing, zUMIs can accurately detect intact cells from the distribution of sequencing reads. Another unique feature of zUMIs is the adaptive downsampling function that facilitates dealing with hugely varying library sizes but also allows the user to evaluate whether the library has been sequenced to saturation. To illustrate the utility of zUMIs, we analyzed a single-nucleus RNA-seq dataset and show that more than 35% of all reads map to introns. Also, we show that these intronic reads are informative about expression levels, significantly increasing the number of detected genes and improving the cluster resolution. zUMIs flexibility makes if possible to accommodate data generated with any of the major scRNA-seq protocols that use BCs and UMIs and is the most feature-rich, fast, and user-friendly pipeline to process such scRNA-seq data.
Onco-STS: a web-based laboratory information management system for sample and analysis tracking in oncogenomic experiments.

PubMed

Gavrielides, Mike; Furney, Simon J; Yates, Tim; Miller, Crispin J; Marais, Richard

2014-01-01

Whole genomes, whole exomes and transcriptomes of tumour samples are sequenced routinely to identify the drivers of cancer. The systematic sequencing and analysis of tumour samples, as well other oncogenomic experiments, necessitates the tracking of relevant sample information throughout the investigative process. These meta-data of the sequencing and analysis procedures include information about the samples and projects as well as the sequencing centres, platforms, data locations, results locations, alignments, analysis specifications and further information relevant to the experiments. The current work presents a sample tracking system for oncogenomic studies (Onco-STS) to store these data and make them easily accessible to the researchers who work with the samples. The system is a web application, which includes a database and a front-end web page that allows the remote access, submission and updating of the sample data in the database. The web application development programming framework Grails was used for the development and implementation of the system. The resulting Onco-STS solution is efficient, secure and easy to use and is intended to replace the manual data handling of text records. Onco-STS allows simultaneous remote access to the system making collaboration among researchers more effective. The system stores both information on the samples in oncogenomic studies and details of the analyses conducted on the resulting data. Onco-STS is based on open-source software, is easy to develop and can be modified according to a research group's needs. Hence it is suitable for laboratories that do not require a commercial system.
A generative, probabilistic model of local protein structure.

PubMed

Boomsma, Wouter; Mardia, Kanti V; Taylor, Charles C; Ferkinghoff-Borg, Jesper; Krogh, Anders; Hamelryck, Thomas

2008-07-01

Despite significant progress in recent years, protein structure prediction maintains its status as one of the prime unsolved problems in computational biology. One of the key remaining challenges is an efficient probabilistic exploration of the structural space that correctly reflects the relative conformational stabilities. Here, we present a fully probabilistic, continuous model of local protein structure in atomic detail. The generative model makes efficient conformational sampling possible and provides a framework for the rigorous analysis of local sequence-structure correlations in the native state. Our method represents a significant theoretical and practical improvement over the widely used fragment assembly technique by avoiding the drawbacks associated with a discrete and nonprobabilistic approach.
Identification of forensic samples by using an infrared-based automatic DNA sequencer.

PubMed

Ricci, Ugo; Sani, Ilaria; Klintschar, Michael; Cerri, Nicoletta; De Ferrari, Francesco; Giovannucci Uzielli, Maria Luisa

2003-06-01

We have recently introduced a new protocol for analyzing all core loci of the Federal Bureau of Investigation's (FBI) Combined DNA Index System (CODIS) with an infrared (IR) automatic DNA sequencer (LI-COR 4200). The amplicons were labeled with forward oligonucleotide primers, covalently linked to a new infrared fluorescent molecule (IRDye 800). The alleles were displayed as familiar autoradiogram-like images with real-time detection. This protocol was employed for paternity testing, population studies, and identification of degraded forensic samples. We extensively analyzed some simulated forensic samples and mixed stains (blood, semen, saliva, bones, and fixed archival embedded tissues), comparing the results with donor samples. Sensitivity studies were also performed for the four multiplex systems. Our results show the efficiency, reliability, and accuracy of the IR system for the analysis of forensic samples. We also compared the efficiency of the multiplex protocol with ultraviolet (UV) technology. Paternity tests, undegraded DNA samples, and real forensic samples were analyzed with this approach based on IR technology and with UV-based automatic sequencers in combination with commercially-available kits. The comparability of the results with the widespread UV methods suggests that it is possible to exchange data between laboratories using the same core group of markers but different primer sets and detection methods.
Characterization of efficient aerobic denitrifiers isolated from two different sequencing batch reactors by 16S-rRNA analysis.

PubMed

Wang, Ping; Li, Xiuting; Xiang, Mufei; Zhai, Qian

2007-06-01

By adopting two sequencing batch reactors (SBRs) A and B, nitrate as the substrate, and the intermittent aeration mode, activated sludge was domesticated to enrich aerobic denitrifiers. The pHs of reactor A were approximately 6.3 at DOs 2.2-6.1 mg/l for a carbon source of 720 mg/l COD; the pHs of reactor B were 6.8-7.8 at DOs 2.2-3.0 mg/l for a carbon source of 1500 mg/l COD. Both reactors maintained an influent nitrate concentration of 80 mg/l NO3- -N. When the total inorganic nitrogen (TIN) removal efficiency of both reactors reached 60%, aerobic denitrifier accumulation was regarded completed. By bromthymol blue (BTB) medium, 20 bacteria were isolated from the two SBRs and DNA samples of 8 of these 20 strains were amplified by PCR and processed for 16SrRNA sequencing. The obtained results were analysed by a Blast similarity search of the GenBank database, and constructing a phylogenetic tree for identification by comparison. The 8 bacteria were found to belong to the genera Pseudomonas, Delftia, Herbaspirillum and Comamonas. At present, no Delftia has been reported to be an aerobic denitrifier.
Efficient sequence-specific isolation of DNA fragments and chromatin by in vitro enChIP technology using recombinant CRISPR ribonucleoproteins.

PubMed

Fujita, Toshitsugu; Yuno, Miyuki; Fujii, Hodaka

2016-04-01

The clustered regularly interspaced short palindromic repeats (CRISPR) system is widely used for various biological applications, including genome editing. We developed engineered DNA-binding molecule-mediated chromatin immunoprecipitation (enChIP) using CRISPR to isolate target genomic regions from cells for their biochemical characterization. In this study, we developed 'in vitro enChIP' using recombinant CRISPR ribonucleoproteins (RNPs) to isolate target genomic regions. in vitro enChIP has the great advantage over conventional enChIP of not requiring expression of CRISPR complexes in cells. We first showed that in vitro enChIP using recombinant CRISPR RNPs can be used to isolate target DNA from mixtures of purified DNA in a sequence-specific manner. In addition, we showed that this technology can be used to efficiently isolate target genomic regions, while retaining their intracellular molecular interactions, with negligible contamination from irrelevant genomic regions. Thus, in vitro enChIP technology is of potential use for sequence-specific isolation of DNA, as well as for identification of molecules interacting with genomic regions of interest in vivo in combination with downstream analysis. © 2016 The Authors. Genes to Cells published by Molecular Biology Society of Japan and John Wiley & Sons Australia, Ltd.
DNA barcoding of human-biting black flies (Diptera: Simuliidae) in Thailand.

PubMed

Pramual, Pairot; Thaijarern, Jiraporn; Wongpakam, Komgrit

2016-12-01

Black flies (Diptera: Simuliidae) are important insect vectors and pests of humans and animals. Accurate identification, therefore, is important for control and management. In this study, we used mitochondrial cytochrome oxidase I (COI) barcoding sequences to test the efficiency of species identification for the human-biting black flies in Thailand. We used human-biting specimens because they enabled us to link information with previous studies involving the immature stages. Three black fly taxa, Simulium nodosum, S. nigrogilvum and S. doipuiense complex, were collected. The S. doipuiense complex was confirmed for the first time as having human-biting habits. The COI sequences revealed considerable genetic diversity in all three species. Comparisons to a COI sequence library of black flies in Thailand and in a public database indicated a high efficiency for specimen identification for S. nodosum and S. nigrogilvum, but this method was not successful for the S. doipuiense complex. Phylogenetic analyses revealed two divergent lineages in the S. doipuiense complex. Human-biting specimens formed a separate clade from other members of this complex. The results are consistent with the Barcoding Index Number System (BINs) analysis that found six BINs in the S. doipuiense complex. Further taxonomic work is needed to clarify the species status of these human-biting specimens. Copyright © 2016 Elsevier B.V. All rights reserved.

Efficient theory of dipolar recoupling in solid-state nuclear magnetic resonance of rotating solids using Floquet-Magnus expansion: application on BABA and C7 radiofrequency pulse sequences.

PubMed

Mananga, Eugene S; Reid, Alicia E; Charpentier, Thibault

2012-02-01

This article describes the use of an alternative expansion scheme called Floquet-Magnus expansion (FME) to study the dynamics of spin system in solid-state NMR. The main tool used to describe the effect of time-dependent interactions in NMR is the average Hamiltonian theory (AHT). However, some NMR experiments, such as sample rotation and pulse crafting, seem to be more conveniently described using the Floquet theory (FT). Here, we present the first report highlighting the basics of the Floquet-Magnus expansion (FME) scheme and hint at its application on recoupling sequences that excite more efficiently double-quantum coherences, namely BABA and C7 radiofrequency pulse sequences. The use of Λ(n)(t) functions available only in the FME scheme, allows the comparison of the efficiency of BABA and C7 sequences. Copyright © 2011 Elsevier Inc. All rights reserved.
Efficient theory of dipolar recoupling in–solid state nuclear magnetic resonance of rotating solids using Floquet-Magnus expansion: Application on BABA and C7 radiofrequency pulse sequences

PubMed Central

Reid, Alicia E.; Charpentier, Thibault

2013-01-01

This article describes the use of an alternative expansion scheme called Floquet-Magnus expansion (FME) to study the dynamics of spin system in solid-state NMR. The main tool used to describe the effect of time-dependent interactions in NMR is the average Hamiltonian theory (AHT). However, some NMR experiments, such as sample rotation and pulse crafting, seem to be more conveniently described using the Floquet theory (FT). Here, we present the first report highlighting the basics of the Floquet-Magnus expansion (FME) scheme and hint at its application on recoupling sequences that excite more efficiently double-quantum coherences, namely BABA and C7 radiofrequency pulse sequences. The use of Λn(t) functions available only in the FME scheme, allows the comparison of the efficiency of BABA and C7 sequences. PMID:22197191
VirusDetect: An automated pipeline for efficient virus discovery using deep sequencing of small RNAs

USDA-ARS?s Scientific Manuscript database

Accurate detection of viruses in plants and animals is critical for agriculture production and human health. Deep sequencing and assembly of virus-derived siRNAs has proven to be a highly efficient approach for virus discovery. However, to date no computational tools specifically designed for both k...
Molecular Identification of Date Palm Cultivars Using Random Amplified Polymorphic DNA (RAPD) Markers.

PubMed

Al-Khalifah, Nasser S; Shanavaskhan, A E

2017-01-01

Ambiguity in the total number of date palm cultivars across the world is pointing toward the necessity for an enumerative study using standard morphological and molecular markers. Among molecular markers, DNA markers are more suitable and ubiquitous to most applications. They are highly polymorphic in nature, frequently occurring in genomes, easy to access, and highly reproducible. Various molecular markers such as restriction fragment length polymorphism (RFLP), amplified fragment length polymorphism (AFLP), simple sequence repeats (SSR), inter-simple sequence repeats (ISSR), and random amplified polymorphic DNA (RAPD) markers have been successfully used as efficient tools for analysis of genetic variation in date palm. This chapter explains a stepwise protocol for extracting total genomic DNA from date palm leaves. A user-friendly protocol for RAPD analysis and a table showing the primers used in different molecular techniques that produce polymorphisms in date palm are also provided.
Characterization of a filamentous virus from Bermuda grass and its molecular, serological and biological comparison with Spartina mottle virus.

PubMed

Hosseini, A; Koohi Habibi, M; Izadpanah, K; Mosahebi, G H; Rubies-Autonell, C; Ratti, C

2010-10-01

Bermuda grass with mosaic symptoms have been found in many parts of Iran. No serological correlation was observed between two isolates of this filamentous virus and any of the members of the family Potyviridae that were tested. Aphid transmission was demonstrated at low efficiency for isolates of this virus, whereas no transmission through seed was observed. A DNA fragment corresponding to the 3' end of the viral genome of these two isolates from Iran and one isolate from Italy was amplified and sequenced. A BLAST search showed that these isolates are more closely related to Spartina mottle virus (SpMV) than to any other virus in the family Potyviridae. Specific serological assays confirmed the phylogenetic analysis. Sequence and phylogenetic analysis suggested that these isolates could be considered as divergent strains of SpMV in the proposed genus Sparmovirus.
Fast single-pass alignment and variant calling using sequencing data

USDA-ARS?s Scientific Manuscript database

Sequencing research requires efficient computation. Few programs use already known information about DNA variants when aligning sequence data to the reference map. New program findmap.f90 reads the previous variant list before aligning sequence, calling variant alleles, and summing the allele counts...
On avoided words, absent words, and their application to biological sequence analysis.

PubMed

Almirantis, Yannis; Charalampopoulos, Panagiotis; Gao, Jia; Iliopoulos, Costas S; Mohamed, Manal; Pissis, Solon P; Polychronopoulos, Dimitris

2017-01-01

The deviation of the observed frequency of a word w from its expected frequency in a given sequence x is used to determine whether or not the word is avoided . This concept is particularly useful in DNA linguistic analysis. The value of the deviation of w , denoted by [Formula: see text], effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word w of length [Formula: see text] is a [Formula: see text]-avoided word in x if [Formula: see text], for a given threshold [Formula: see text]. Notice that such a word may be completely absent from x . Hence, computing all such words naïvely can be a very time-consuming procedure, in particular for large k . In this article, we propose an [Formula: see text]-time and [Formula: see text]-space algorithm to compute all [Formula: see text]-avoided words of length k in a given sequence of length n over a fixed-sized alphabet. We also present a time-optimal [Formula: see text]-time algorithm to compute all [Formula: see text]-avoided words (of any length) in a sequence of length n over an integer alphabet of size [Formula: see text]. In addition, we provide a tight asymptotic upper bound for the number of [Formula: see text]-avoided words over an integer alphabet and the expected length of the longest one. We make available an implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency and applicability of our implementation in biological sequence analysis. The systematic search for avoided words is particularly useful for biological sequence analysis. We present a linear-time and linear-space algorithm for the computation of avoided words of length k in a given sequence x . We suggest a modification to this algorithm so that it computes all avoided words of x , irrespective of their length, within the same time complexity. We also present combinatorial results with regards to avoided words and absent words.
Chromatin accessibility and guide sequence secondary structure affect CRISPR-Cas9 gene editing efficiency.

PubMed

Jensen, Kristopher Torp; Fløe, Lasse; Petersen, Trine Skov; Huang, Jinrong; Xu, Fengping; Bolund, Lars; Luo, Yonglun; Lin, Lin

2017-07-01

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-associated protein 9 (CRISPR-Cas9) systems have emerged as the method of choice for genome editing, but large variations in on-target efficiencies continue to limit their applicability. Here, we investigate the effect of chromatin accessibility on Cas9-mediated gene editing efficiency for 20 gRNAs targeting 10 genomic loci in HEK293T cells using both SpCas9 and the eSpCas9(1.1) variant. Our study indicates that gene editing is more efficient in euchromatin than in heterochromatin, and we validate this finding in HeLa cells and in human fibroblasts. Furthermore, we investigate the gRNA sequence determinants of CRISPR-Cas9 activity using a surrogate reporter system and find that the efficiency of Cas9-mediated gene editing is dependent on guide sequence secondary structure formation. This knowledge can aid in the further improvement of tools for gRNA design. © 2017 Federation of European Biochemical Societies.
Transient effects in π-pulse sequences in MAS solid-state NMR

NASA Astrophysics Data System (ADS)

Hellwagner, Johannes; Wili, Nino; Ibáñez, Luis Fábregas; Wittmann, Johannes J.; Meier, Beat H.; Ernst, Matthias

2018-02-01

Dipolar recoupling techniques that use isolated rotor-synchronized π pulses are commonly used in solid-state NMR spectroscopy to gain insight into the structure of biological molecules. These sequences excel through their simplicity, stability towards radio-frequency (rf) inhomogeneity, and low rf requirements. For a theoretical understanding of such sequences, we present a Floquet treatment based on an interaction-frame transformation including the chemical-shift offset dependence. This approach is applied to the homonuclear dipolar-recoupling sequence Radio-Frequency Driven Recoupling (RFDR) and the heteronuclear recoupling sequence Rotational Echo Double Resonance (REDOR). Based on the Floquet approach, we show the influence of effective fields caused by pulse transients and discuss the advantages of pulse-transient compensation. We demonstrate experimentally that the transfer efficiency for homonuclear recoupling can be doubled in some cases in model compounds as well as in simple peptides if pulse-transient compensation is applied to the π pulses. Additionally, we discuss the influence of various phase cycles on the recoupling efficiency in order to reduce the magnitude of effective fields. Based on the findings from RFDR, we are able to explain why the REDOR sequence does not suffer in the recoupling efficiency despite the presence of effective fields.
A two-step recognition of signal sequences determines the translocation efficiency of proteins.

PubMed Central

Belin, D; Bost, S; Vassalli, J D; Strub, K

1996-01-01

The cytosolic and secreted, N-glycosylated, forms of plasminogen activator inhibitor-2 (PAI-2) are generated by facultative translocation. To study the molecular events that result in the bi-topological distribution of proteins, we determined in vitro the capacities of several signal sequences to bind the signal recognition particle (SRP) during targeting, and to promote vectorial transport of murine PAI-2 (mPAI-2). Interestingly, the six signal sequences we compared (mPAI-2 and three mutated derivatives thereof, ovalbumin and preprolactin) were found to have the differential activities in the two events. For example, the mPAI-2 signal sequence first binds SRP with moderate efficiency and secondly promotes the vectorial transport of only a fraction of the SRP-bound nascent chains. Our results provide evidence that the translocation efficiency of proteins can be controlled by the recognition of their signal sequences at two steps: during SRP-mediated targeting and during formation of a committed translocation complex. This second recognition may occur at several time points during the insertion/translocation step. In conclusion, signal sequences have a more complex structure than previously anticipated, allowing for multiple and independent interactions with the translocation machinery. Images PMID:8599930
A two-step recognition of signal sequences determines the translocation efficiency of proteins.

PubMed

Belin, D; Bost, S; Vassalli, J D; Strub, K

1996-02-01

The cytosolic and secreted, N-glycosylated, forms of plasminogen activator inhibitor-2 (PAI-2) are generated by facultative translocation. To study the molecular events that result in the bi-topological distribution of proteins, we determined in vitro the capacities of several signal sequences to bind the signal recognition particle (SRP) during targeting, and to promote vectorial transport of murine PAI-2 (mPAI-2). Interestingly, the six signal sequences we compared (mPAI-2 and three mutated derivatives thereof, ovalbumin and preprolactin) were found to have the differential activities in the two events. For example, the mPAI-2 signal sequence first binds SRP with moderate efficiency and secondly promotes the vectorial transport of only a fraction of the SRP-bound nascent chains. Our results provide evidence that the translocation efficiency of proteins can be controlled by the recognition of their signal sequences at two steps: during SRP-mediated targeting and during formation of a committed translocation complex. This second recognition may occur at several time points during the insertion/translocation step. In conclusion, signal sequences have a more complex structure than previously anticipated, allowing for multiple and independent interactions with the translocation machinery.
Using Python Packages in 6D (Py)Ferret: EOF Analysis, OPeNDAP Sequence Data

NASA Astrophysics Data System (ADS)

Smith, K. M.; Manke, A.; Hankin, S. C.

2012-12-01

PyFerret was designed to provide the easy methods of access, analysis, and display of data found in the Ferret under the simple yet powerful Python scripting/programming language. This has enabled PyFerret to take advantage of a large and expanding collection of third-party scientific Python modules. Furthermore, ensemble and forecast axes have been added to Ferret and PyFerret for creating and working with collections of related data in Ferret's delayed-evaluation and minimal-data-access mode of operation. These axes simplify processing and visualization of these collections of related data. As one example, an empirical orthogonal function (EOF) analysis Python module was developed, taking advantage of the linear algebra module and other standard functionality in NumPy for efficient numerical array processing. This EOF analysis module is used in a Ferret function to provide an ensemble of levels of data explained by each EOF and Time Amplitude Function (TAF) product. Another example makes use of the PyDAP Python module to provide OPeNDAP sequence data for use in Ferret with minimal data access characteristic of Ferret.
Sulfo-NHS-SS-biotin derivatization: a versatile tool for MALDI mass analysis of PTMs in lysine-rich proteins.

PubMed

Markoutsa, Stavroula; Bahr, Ute; Papasotiriou, Dimitrios G; Häfner, Ann-Kathrin; Karas, Michael; Sorg, Bernd L

2014-03-01

The discovery of PTMs in proteins by MS requires nearly complete sequence coverage of the detected proteolytic peptides. Unfortunately, mass spectrometric analysis of the desired sequence fragments is often impeded due to low ionization efficiency and/or signal suppression in complex samples. When several lysine residues are in close proximity tryptic peptides may be too short for mass analysis. Moreover, modified peptides often appear in low stoichiometry and need to be enriched before analysis. We present here how the use of sulfo-NHS-SS-biotin derivatization of lysine side chain can help to detect PTMs in lysine-rich proteins. This label leads to a mass shift which can be adjusted by reduction of the SS bridge and alkylation with different reagents. Low intensity peptides can be enriched by use of streptavidin beads. Using this method, the functionally relevant protein kinase A phosphorylation site in 5-lipoxygenase was detected for the first time by MS. Additionally, methylation and acetylation could be unambiguously determined in histones. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Molecular studies in olive (Olea europaea L.): overview on DNA markers applications and recent advances in genome analysis.

PubMed

Bracci, T; Busconi, M; Fogher, C; Sebastiani, L

2011-04-01

Olive (Olea europaea L.) is one of the oldest agricultural tree crops worldwide and is an important source of oil with beneficial properties for human health. This emblematic tree crop of the Mediterranean Basin, which has conserved a very wide germplasm estimated in more than 1,200 cultivars, is a diploid species (2n = 2x = 46) that is present in two forms, namely wild (Olea europaea subsp. europaea var. sylvestris) and cultivated (Olea europaea subsp. europaea var. europaea). In spite of its economic and nutritional importance, there are few data about the genetic of olive if compared with other fruit crops. Available molecular data are especially related to the application of molecular markers to the analysis of genetic variability in Olea europaea complex and to develop efficient molecular tools for the olive oil origin traceability. With regard to genomic research, in the last years efforts are made for the identification of expressed sequence tag, with particular interest in those sequences expressed during fruit development and in pollen allergens. Very recently the sequencing of chloroplast genome provided new information on the olive nucleotide sequence, opening the olive genomic era. In this article, we provide an overview of the most relevant results in olive molecular studies. A particular attention was given to DNA markers and their application that constitute the most part of published researches. The first important results in genome analysis were reported.
MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants.

PubMed

Elshazly, Hatem; Souilmi, Yassine; Tonellato, Peter J; Wall, Dennis P; Abouelhoda, Mohamed

2017-01-20

Next Generation Genome sequencing techniques became affordable for massive sequencing efforts devoted to clinical characterization of human diseases. However, the cost of providing cloud-based data analysis of the mounting datasets remains a concerning bottleneck for providing cost-effective clinical services. To address this computational problem, it is important to optimize the variant analysis workflow and the used analysis tools to reduce the overall computational processing time, and concomitantly reduce the processing cost. Furthermore, it is important to capitalize on the use of the recent development in the cloud computing market, which have witnessed more providers competing in terms of products and prices. In this paper, we present a new package called MC-GenomeKey (Multi-Cloud GenomeKey) that efficiently executes the variant analysis workflow for detecting and annotating mutations using cloud resources from different commercial cloud providers. Our package supports Amazon, Google, and Azure clouds, as well as, any other cloud platform based on OpenStack. Our package allows different scenarios of execution with different levels of sophistication, up to the one where a workflow can be executed using a cluster whose nodes come from different clouds. MC-GenomeKey also supports scenarios to exploit the spot instance model of Amazon in combination with the use of other cloud platforms to provide significant cost reduction. To the best of our knowledge, this is the first solution that optimizes the execution of the workflow using computational resources from different cloud providers. MC-GenomeKey provides an efficient multicloud based solution to detect and annotate mutations. The package can run in different commercial cloud platforms, which enables the user to seize the best offers. The package also provides a reliable means to make use of the low-cost spot instance model of Amazon, as it provides an efficient solution to the sudden termination of spot machines as a result of a sudden price increase. The package has a web-interface and it is available for free for academic use.
A privacy-preserving solution for compressed storage and selective retrieval of genomic data.

PubMed

Huang, Zhicong; Ayday, Erman; Lin, Huang; Aiyar, Raeka S; Molyneaux, Adam; Xu, Zhenyu; Fellay, Jacques; Steinmetz, Lars M; Hubaux, Jean-Pierre

2016-12-01

In clinical genomics, the continuous evolution of bioinformatic algorithms and sequencing platforms makes it beneficial to store patients' complete aligned genomic data in addition to variant calls relative to a reference sequence. Due to the large size of human genome sequence data files (varying from 30 GB to 200 GB depending on coverage), two major challenges facing genomics laboratories are the costs of storage and the efficiency of the initial data processing. In addition, privacy of genomic data is becoming an increasingly serious concern, yet no standard data storage solutions exist that enable compression, encryption, and selective retrieval. Here we present a privacy-preserving solution named SECRAM (Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map) for the secure storage of compressed aligned genomic data. Our solution enables selective retrieval of encrypted data and improves the efficiency of downstream analysis (e.g., variant calling). Compared with BAM, the de facto standard for storing aligned genomic data, SECRAM uses 18% less storage. Compared with CRAM, one of the most compressed nonencrypted formats (using 34% less storage than BAM), SECRAM maintains efficient compression and downstream data processing, while allowing for unprecedented levels of security in genomic data storage. Compared with previous work, the distinguishing features of SECRAM are that (1) it is position-based instead of read-based, and (2) it allows random querying of a subregion from a BAM-like file in an encrypted form. Our method thus offers a space-saving, privacy-preserving, and effective solution for the storage of clinical genomic data. © 2016 Huang et al.; Published by Cold Spring Harbor Laboratory Press.
A privacy-preserving solution for compressed storage and selective retrieval of genomic data

PubMed Central

Huang, Zhicong; Ayday, Erman; Lin, Huang; Aiyar, Raeka S.; Molyneaux, Adam; Xu, Zhenyu; Hubaux, Jean-Pierre

2016-01-01

In clinical genomics, the continuous evolution of bioinformatic algorithms and sequencing platforms makes it beneficial to store patients’ complete aligned genomic data in addition to variant calls relative to a reference sequence. Due to the large size of human genome sequence data files (varying from 30 GB to 200 GB depending on coverage), two major challenges facing genomics laboratories are the costs of storage and the efficiency of the initial data processing. In addition, privacy of genomic data is becoming an increasingly serious concern, yet no standard data storage solutions exist that enable compression, encryption, and selective retrieval. Here we present a privacy-preserving solution named SECRAM (Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map) for the secure storage of compressed aligned genomic data. Our solution enables selective retrieval of encrypted data and improves the efficiency of downstream analysis (e.g., variant calling). Compared with BAM, the de facto standard for storing aligned genomic data, SECRAM uses 18% less storage. Compared with CRAM, one of the most compressed nonencrypted formats (using 34% less storage than BAM), SECRAM maintains efficient compression and downstream data processing, while allowing for unprecedented levels of security in genomic data storage. Compared with previous work, the distinguishing features of SECRAM are that (1) it is position-based instead of read-based, and (2) it allows random querying of a subregion from a BAM-like file in an encrypted form. Our method thus offers a space-saving, privacy-preserving, and effective solution for the storage of clinical genomic data. PMID:27789525
Detection of integrated papillomavirus sequences by ligation-mediated PCR (DIPS-PCR) and molecular characterization in cervical cancer cells.

PubMed

Luft, F; Klaes, R; Nees, M; Dürst, M; Heilmann, V; Melsheimer, P; von Knebel Doeberitz, M

2001-04-01

Human papillomavirus (HPV) genomes usually persist as episomal molecules in HPV associated preneoplastic lesions whereas they are frequently integrated into the host cell genome in HPV-related cancers cells. This suggests that malignant conversion of HPV-infected epithelia is linked to recombination of cellular and viral sequences. Due to technical limitations, precise sequence information on viral-cellular junctions were obtained only for few cell lines and primary lesions. In order to facilitate the molecular analysis of genomic HPV integration, we established a ligation-mediated PCR assay for the detection of integrated papillomavirus sequences (DIPS-PCR). DIPS-PCR was initially used to amplify genomic viral-cellular junctions from HPV-associated cervical cancer cell lines (C4-I, C4-II, SW756, and HeLa) and HPV-immortalized keratinocyte lines (HPKIA, HPKII). In addition to junctions already reported in public data bases, various new fusion fragments were identified. Subsequently, 22 different viral-cellular junctions were amplified from 17 cervical carcinomas and 1 vulval intraepithelial neoplasia (VIN III). Sequence analysis of each junction revealed that the viral E1 open reading frame (ORF) was fused to cellular sequences in 20 of 22 (91%) cases. Chromosomal integration loci mapped to chromosomes 1 (2n), 2 (3n), 7 (2n), 8 (3n), 10 (1n), 14 (5n), 16 (1n), 17 (2n), and mitochondrial DNA (1n), suggesting random distribution of chromosomal integration sites. Precise sequence information obtained by DIPS-PCR was further used to monitor the monoclonal origin of 4 cervical cancers, 1 case of recurrent premalignant lesions and 1 lymph node metastasis. Therefore, DIPS-PCR might allow efficient therapy control and prediction of relapse in patients with HPV-associated anogenital cancers. Copyright 2001 Wiley-Liss, Inc.
Long Read Alignment with Parallel MapReduce Cloud Platform

PubMed Central

Al-Absi, Ahmed Abdulhakim; Kang, Dae-Ki

2015-01-01

Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics. Next-Generation Sequencing technologies produce genomic data of longer reads. Cloud platforms are adopted to address the problems arising from storage and analysis of large genomic data. Existing genes sequencing tools for cloud platforms predominantly consider short read gene sequences and adopt the Hadoop MapReduce framework for computation. However, serial execution of map and reduce phases is a problem in such systems. Therefore, in this paper, we introduce Burrows-Wheeler Aligner's Smith-Waterman Alignment on Parallel MapReduce (BWASW-PMR) cloud platform for long sequence alignment. The proposed cloud platform adopts a widely accepted and accurate BWA-SW algorithm for long sequence alignment. A custom MapReduce platform is developed to overcome the drawbacks of the Hadoop framework. A parallel execution strategy of the MapReduce phases and optimization of Smith-Waterman algorithm are considered. Performance evaluation results exhibit an average speed-up of 6.7 considering BWASW-PMR compared with the state-of-the-art Bwasw-Cloud. An average reduction of 30% in the map phase makespan is reported across all experiments comparing BWASW-PMR with Bwasw-Cloud. Optimization of Smith-Waterman results in reducing the execution time by 91.8%. The experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloud platforms. PMID:26839887
Long Read Alignment with Parallel MapReduce Cloud Platform.

PubMed

Al-Absi, Ahmed Abdulhakim; Kang, Dae-Ki

2015-01-01

Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics. Next-Generation Sequencing technologies produce genomic data of longer reads. Cloud platforms are adopted to address the problems arising from storage and analysis of large genomic data. Existing genes sequencing tools for cloud platforms predominantly consider short read gene sequences and adopt the Hadoop MapReduce framework for computation. However, serial execution of map and reduce phases is a problem in such systems. Therefore, in this paper, we introduce Burrows-Wheeler Aligner's Smith-Waterman Alignment on Parallel MapReduce (BWASW-PMR) cloud platform for long sequence alignment. The proposed cloud platform adopts a widely accepted and accurate BWA-SW algorithm for long sequence alignment. A custom MapReduce platform is developed to overcome the drawbacks of the Hadoop framework. A parallel execution strategy of the MapReduce phases and optimization of Smith-Waterman algorithm are considered. Performance evaluation results exhibit an average speed-up of 6.7 considering BWASW-PMR compared with the state-of-the-art Bwasw-Cloud. An average reduction of 30% in the map phase makespan is reported across all experiments comparing BWASW-PMR with Bwasw-Cloud. Optimization of Smith-Waterman results in reducing the execution time by 91.8%. The experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloud platforms.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.