King, Brian R; Aburdene, Maurice; Thompson, Alex; Warres, Zach
2014-01-01
Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.
Image encryption using random sequence generated from generalized information domain
NASA Astrophysics Data System (ADS)
Xia-Yan, Zhang; Guo-Ji, Zhang; Xuan, Li; Ya-Zhou, Ren; Jie-Hua, Wu
2016-05-01
A novel image encryption method based on the random sequence generated from the generalized information domain and permutation-diffusion architecture is proposed. The random sequence is generated by reconstruction from the generalized information file and discrete trajectory extraction from the data stream. The trajectory address sequence is used to generate a P-box to shuffle the plain image while random sequences are treated as keystreams. A new factor called drift factor is employed to accelerate and enhance the performance of the random sequence generator. An initial value is introduced to make the encryption method an approximately one-time pad. Experimental results show that the random sequences pass the NIST statistical test with a high ratio and extensive analysis demonstrates that the new encryption scheme has superior security.
Next generation sequencers: methods and applications in food-borne pathogens
USDA-ARS?s Scientific Manuscript database
Next generation sequencers are able to produce millions of short sequence reads in a high-throughput, low-cost way. The emergence of these technologies has not only facilitated genome sequencing but also started to change the landscape of life sciences. This chapter will survey their methods and app...
NASA Astrophysics Data System (ADS)
Yasuda, Jun; Yoshizawa, Shin; Umemura, Shin-ichiro
2016-07-01
Sonodynamic treatment is a method of treating cancer using reactive oxygen species (ROS) generated by cavitation bubbles in collaboration with a sonosensitizer at a target tissue. In this treatment method, both localized ROS generation and ROS generation with high efficiency are important. In this study, a triggered high-intensity focused ultrasound (HIFU) sequence, which consists of a short, extremely high intensity pulse immediately followed by a long, moderate-intensity burst, was employed for the efficient generation of ROS. In experiments, a solution sealed in a chamber was exposed to a triggered HIFU sequence. Then, the distribution of generated ROS was observed by the luminol reaction, and the amount of generated ROS was quantified using KI method. As a result, the localized ROS generation was demonstrated by light emission from the luminol reaction. Moreover, it was demonstrated that the triggered HIFU sequence has higher efficiency of ROS generation by both the KI method and the luminol reaction emission.
System, method and apparatus for generating phrases from a database
NASA Technical Reports Server (NTRS)
McGreevy, Michael W. (Inventor)
2004-01-01
A phrase generation is a method of generating sequences of terms, such as phrases, that may occur within a database of subsets containing sequences of terms, such as text. A database is provided and a relational model of the database is created. A query is then input. The query includes a term or a sequence of terms or multiple individual terms or multiple sequences of terms or combinations thereof. Next, several sequences of terms that are contextually related to the query are assembled from contextual relations in the model of the database. The sequences of terms are then sorted and output. Phrase generation can also be an iterative process used to produce sequences of terms from a relational model of a database.
Metagenome assembly through clustering of next-generation sequencing data using protein sequences.
Sim, Mikang; Kim, Jaebum
2015-02-01
The study of environmental microbial communities, called metagenomics, has gained a lot of attention because of the recent advances in next-generation sequencing (NGS) technologies. Microbes play a critical role in changing their environments, and the mode of their effect can be solved by investigating metagenomes. However, the difficulty of metagenomes, such as the combination of multiple microbes and different species abundance, makes metagenome assembly tasks more challenging. In this paper, we developed a new metagenome assembly method by utilizing protein sequences, in addition to the NGS read sequences. Our method (i) builds read clusters by using mapping information against available protein sequences, and (ii) creates contig sequences by finding consensus sequences through probabilistic choices from the read clusters. By using simulated NGS read sequences from real microbial genome sequences, we evaluated our method in comparison with four existing assembly programs. We found that our method could generate relatively long and accurate metagenome assemblies, indicating that the idea of using protein sequences, as a guide for the assembly, is promising. Copyright © 2015 Elsevier B.V. All rights reserved.
Archer, Stuart K; Shirokikh, Nikolay E; Preiss, Thomas
2015-04-01
Most applications for RNA-seq require the depletion of abundant transcripts to gain greater coverage of the underlying transcriptome. The sequences to be targeted for depletion depend on application and species and in many cases may not be supported by commercial depletion kits. This unit describes a method for generating RNA-seq libraries that incorporates probe-directed degradation (PDD), which can deplete any unwanted sequence set, with the low-bias split-adapter method of library generation (although many other library generation methods are in principle compatible). The overall strategy is suitable for applications requiring customized sequence depletion or where faithful representation of fragment ends and lack of sequence bias is paramount. We provide guidelines to rapidly design specific probes against the target sequence, and a detailed protocol for library generation using the split-adapter method including several strategies for streamlining the technique and reducing adapter dimer content. Copyright © 2015 John Wiley & Sons, Inc.
DNA sequence similarity recognition by hybridization to short oligomers
Milosavljevic, Aleksandar
1999-01-01
Methods are disclosed for the comparison of nucleic acid sequences. Data is generated by hybridizing sets of oligomers with target nucleic acids. The data thus generated is manipulated simultaneously with respect to both (i) matching between oligomers and (ii) matching between oligomers and putative reference sequences available in databases. Using data compression methods to manipulate this mutual information, sequences for the target can be constructed.
Guilfoyle, Richard A.; Guo, Zhen
2001-01-01
A restriction site indexing method for selectively amplifying any fragment generated by a Class II restriction enzyme includes adaptors specific to fragment ends containing adaptor indexing sequences complementary to fragment indexing sequences near the termini of fragments generated by Class II enzyme cleavage. A method for combinatorial indexing facilitates amplification of restriction fragments whose sequence is not known.
Guilfoyle, Richard A.; Guo, Zhen
1999-01-01
A restriction site indexing method for selectively amplifying any fragment generated by a Class II restriction enzyme includes adaptors specific to fragment ends containing adaptor indexing sequences complementary to fragment indexing sequences near the termini of fragments generated by Class II enzyme cleavage. A method for combinatorial indexing facilitates amplification of restriction fragments whose sequence is not known.
Chen, Guiqian; Qiu, Yuan; Zhuang, Qingye; Wang, Suchun; Wang, Tong; Chen, Jiming; Wang, Kaicheng
2018-05-09
Next generation sequencing (NGS) is a powerful tool for the characterization, discovery, and molecular identification of RNA viruses. There were multiple NGS library preparation methods published for strand-specific RNA-seq, but some methods are not suitable for identifying and characterizing RNA viruses. In this study, we report a NGS library preparation method to identify RNA viruses using the Ion Torrent PGM platform. The NGS sequencing adapters were directly inserted into the sequencing library through reverse transcription and polymerase chain reaction, without fragmentation and ligation of nucleic acids. The results show that this method is simple to perform, able to identify multiple species of RNA viruses in clinical samples.
Quick, Josh; Grubaugh, Nathan D; Pullan, Steven T; Claro, Ingra M; Smith, Andrew D; Gangavarapu, Karthik; Oliveira, Glenn; Robles-Sikisaka, Refugio; Rogers, Thomas F; Beutler, Nathan A; Burton, Dennis R; Lewis-Ximenez, Lia Laura; de Jesus, Jaqueline Goes; Giovanetti, Marta; Hill, Sarah; Black, Allison; Bedford, Trevor; Carroll, Miles W; Nunes, Marcio; Alcantara, Luiz Carlos; Sabino, Ester C; Baylis, Sally A; Faria, Nuno; Loose, Matthew; Simpson, Jared T; Pybus, Oliver G; Andersen, Kristian G; Loman, Nicholas J
2018-01-01
Genome sequencing has become a powerful tool for studying emerging infectious diseases; however, genome sequencing directly from clinical samples without isolation remains challenging for viruses such as Zika, where metagenomic sequencing methods may generate insufficient numbers of viral reads. Here we present a protocol for generating coding-sequence complete genomes comprising an online primer design tool, a novel multiplex PCR enrichment protocol, optimised library preparation methods for the portable MinION sequencer (Oxford Nanopore Technologies) and the Illumina range of instruments, and a bioinformatics pipeline for generating consensus sequences. The MinION protocol does not require an internet connection for analysis, making it suitable for field applications with limited connectivity. Our method relies on multiplex PCR for targeted enrichment of viral genomes from samples containing as few as 50 genome copies per reaction. Viral consensus sequences can be achieved starting with clinical samples in 1-2 days following a simple laboratory workflow. This method has been successfully used by several groups studying Zika virus evolution and is facilitating an understanding of the spread of the virus in the Americas. PMID:28538739
Altimari, Annalisa; de Biase, Dario; De Maglio, Giovanna; Gruppioni, Elisa; Capizzi, Elisa; Degiovanni, Alessio; D’Errico, Antonia; Pession, Annalisa; Pizzolitto, Stefano; Fiorentino, Michelangelo; Tallini, Giovanni
2013-01-01
Detection of KRAS mutations in archival pathology samples is critical for therapeutic appropriateness of anti-EGFR monoclonal antibodies in colorectal cancer. We compared the sensitivity, specificity, and accuracy of Sanger sequencing, ARMS-Scorpion (TheraScreen®) real-time polymerase chain reaction (PCR), pyrosequencing, chip array hybridization, and 454 next-generation sequencing to assess KRAS codon 12 and 13 mutations in 60 nonconsecutive selected cases of colorectal cancer. Twenty of the 60 cases were detected as wild-type KRAS by all methods with 100% specificity. Among the 40 mutated cases, 13 were discrepant with at least one method. The sensitivity was 85%, 90%, 93%, and 92%, and the accuracy was 90%, 93%, 95%, and 95% for Sanger sequencing, TheraScreen real-time PCR, pyrosequencing, and chip array hybridization, respectively. The main limitation of Sanger sequencing was its low analytical sensitivity, whereas TheraScreen real-time PCR, pyrosequencing, and chip array hybridization showed higher sensitivity but suffered from the limitations of predesigned assays. Concordance between the methods was k = 0.79 for Sanger sequencing and k > 0.85 for the other techniques. Tumor cell enrichment correlated significantly with the abundance of KRAS-mutated deoxyribonucleic acid (DNA), evaluated as ΔCt for TheraScreen real-time PCR (P = 0.03), percentage of mutation for pyrosequencing (P = 0.001), ratio for chip array hybridization (P = 0.003), and percentage of mutation for 454 next-generation sequencing (P = 0.004). Also, 454 next-generation sequencing showed the best cross correlation for quantification of mutation abundance compared with all the other methods (P < 0.001). Our comparison showed the superiority of next-generation sequencing over the other techniques in terms of sensitivity and specificity. Next-generation sequencing will replace Sanger sequencing as the reference technique for diagnostic detection of KRAS mutation in archival tumor tissues. PMID:23950653
A Next-Generation Sequencing Primer—How Does It Work and What Can It Do?
Alekseyev, Yuriy O.; Fazeli, Roghayeh; Yang, Shi; Basran, Raveen; Miller, Nancy S.
2018-01-01
Next-generation sequencing refers to a high-throughput technology that determines the nucleic acid sequences and identifies variants in a sample. The technology has been introduced into clinical laboratory testing and produces test results for precision medicine. Since next-generation sequencing is relatively new, graduate students, medical students, pathology residents, and other physicians may benefit from a primer to provide a foundation about basic next-generation sequencing methods and applications, as well as specific examples where it has had diagnostic and prognostic utility. Next-generation sequencing technology grew out of advances in multiple fields to produce a sophisticated laboratory test with tremendous potential. Next-generation sequencing may be used in the clinical setting to look for specific genetic alterations in patients with cancer, diagnose inherited conditions such as cystic fibrosis, and detect and profile microbial organisms. This primer will review DNA sequencing technology, the commercialization of next-generation sequencing, and clinical uses of next-generation sequencing. Specific applications where next-generation sequencing has demonstrated utility in oncology are provided. PMID:29761157
USDA-ARS?s Scientific Manuscript database
Next-generation sequencing technology such as genotyping-by-sequencing (GBS) made low-cost, but often low-coverage, whole-genome sequencing widely available. Extensive inbreeding in crop plants provides an untapped, high quality source of phased haplotypes for imputing missing genotypes. We introduc...
A resampling procedure for generating conditioned daily weather sequences
Clark, Martyn P.; Gangopadhyay, Subhrendu; Brandon, David; Werner, Kevin; Hay, Lauren E.; Rajagopalan, Balaji; Yates, David
2004-01-01
A method is introduced to generate conditioned daily precipitation and temperature time series at multiple stations. The method resamples data from the historical record “nens” times for the period of interest (nens = number of ensemble members) and reorders the ensemble members to reconstruct the observed spatial (intersite) and temporal correlation statistics. The weather generator model is applied to 2307 stations in the contiguous United States and is shown to reproduce the observed spatial correlation between neighboring stations, the observed correlation between variables (e.g., between precipitation and temperature), and the observed temporal correlation between subsequent days in the generated weather sequence. The weather generator model is extended to produce sequences of weather that are conditioned on climate indices (in this case the Niño 3.4 index). Example illustrations of conditioned weather sequences are provided for a station in Arizona (Petrified Forest, 34.8°N, 109.9°W), where El Niño and La Niña conditions have a strong effect on winter precipitation. The conditioned weather sequences generated using the methods described in this paper are appropriate for use as input to hydrologic models to produce multiseason forecasts of streamflow.
Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.
Alkhateeb, Abedalrhman; Rueda, Luis
2017-08-01
Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.
Quick, Joshua; Grubaugh, Nathan D; Pullan, Steven T; Claro, Ingra M; Smith, Andrew D; Gangavarapu, Karthik; Oliveira, Glenn; Robles-Sikisaka, Refugio; Rogers, Thomas F; Beutler, Nathan A; Burton, Dennis R; Lewis-Ximenez, Lia Laura; de Jesus, Jaqueline Goes; Giovanetti, Marta; Hill, Sarah C; Black, Allison; Bedford, Trevor; Carroll, Miles W; Nunes, Marcio; Alcantara, Luiz Carlos; Sabino, Ester C; Baylis, Sally A; Faria, Nuno R; Loose, Matthew; Simpson, Jared T; Pybus, Oliver G; Andersen, Kristian G; Loman, Nicholas J
2017-06-01
Genome sequencing has become a powerful tool for studying emerging infectious diseases; however, genome sequencing directly from clinical samples (i.e., without isolation and culture) remains challenging for viruses such as Zika, for which metagenomic sequencing methods may generate insufficient numbers of viral reads. Here we present a protocol for generating coding-sequence-complete genomes, comprising an online primer design tool, a novel multiplex PCR enrichment protocol, optimized library preparation methods for the portable MinION sequencer (Oxford Nanopore Technologies) and the Illumina range of instruments, and a bioinformatics pipeline for generating consensus sequences. The MinION protocol does not require an Internet connection for analysis, making it suitable for field applications with limited connectivity. Our method relies on multiplex PCR for targeted enrichment of viral genomes from samples containing as few as 50 genome copies per reaction. Viral consensus sequences can be achieved in 1-2 d by starting with clinical samples and following a simple laboratory workflow. This method has been successfully used by several groups studying Zika virus evolution and is facilitating an understanding of the spread of the virus in the Americas. The protocol can be used to sequence other viral genomes using the online Primal Scheme primer designer software. It is suitable for sequencing either RNA or DNA viruses in the field during outbreaks or as an inexpensive, convenient method for use in the lab.
Fernández-Caballero Rico, Jose Ángel; Chueca Porcuna, Natalia; Álvarez Estévez, Marta; Mosquera Gutiérrez, María Del Mar; Marcos Maeso, María Ángeles; García, Federico
2018-02-01
To show how to generate a consensus sequence from the information of massive parallel sequences data obtained from routine HIV anti-retroviral resistance studies, and that may be suitable for molecular epidemiology studies. Paired Sanger (Trugene-Siemens) and next-generation sequencing (NGS) (454 GSJunior-Roche) HIV RT and protease sequences from 62 patients were studied. NGS consensus sequences were generated using Mesquite, using 10%, 15%, and 20% thresholds. Molecular evolutionary genetics analysis (MEGA) was used for phylogenetic studies. At a 10% threshold, NGS-Sanger sequences from 17/62 patients were phylogenetically related, with a median bootstrap-value of 88% (IQR83.5-95.5). Association increased to 36/62 sequences, median bootstrap 94% (IQR85.5-98)], using a 15% threshold. Maximum association was at the 20% threshold, with 61/62 sequences associated, and a median bootstrap value of 99% (IQR98-100). A safe method is presented to generate consensus sequences from HIV-NGS data at 20% threshold, which will prove useful for molecular epidemiological studies. Copyright © 2016 Elsevier España, S.L.U. and Sociedad Española de Enfermedades Infecciosas y Microbiología Clínica. All rights reserved.
Caruccio, Nicholas
2011-01-01
DNA library preparation is a common entry point and bottleneck for next-generation sequencing. Current methods generally consist of distinct steps that often involve significant sample loss and hands-on time: DNA fragmentation, end-polishing, and adaptor-ligation. In vitro transposition with Nextera™ Transposomes simultaneously fragments and covalently tags the target DNA, thereby combining these three distinct steps into a single reaction. Platform-specific sequencing adaptors can be added, and the sample can be enriched and bar-coded using limited-cycle PCR to prepare di-tagged DNA fragment libraries. Nextera technology offers a streamlined, efficient, and high-throughput method for generating bar-coded libraries compatible with multiple next-generation sequencing platforms.
DNA Base-Calling from a Nanopore Using a Viterbi Algorithm
Timp, Winston; Comer, Jeffrey; Aksimentiev, Aleksei
2012-01-01
Nanopore-based DNA sequencing is the most promising third-generation sequencing method. It has superior read length, speed, and sample requirements compared with state-of-the-art second-generation methods. However, base-calling still presents substantial difficulty because the resolution of the technique is limited compared with the measured signal/noise ratio. Here we demonstrate a method to decode 3-bp-resolution nanopore electrical measurements into a DNA sequence using a Hidden Markov model. This method shows tremendous potential for accuracy (∼98%), even with a poor signal/noise ratio. PMID:22677395
Liu, Gary W; Livesay, Brynn R; Kacherovsky, Nataly A; Cieslewicz, Maryelise; Lutz, Emi; Waalkes, Adam; Jensen, Michael C; Salipante, Stephen J; Pun, Suzie H
2015-08-19
Peptide ligands are used to increase the specificity of drug carriers to their target cells and to facilitate intracellular delivery. One method to identify such peptide ligands, phage display, enables high-throughput screening of peptide libraries for ligands binding to therapeutic targets of interest. However, conventional methods for identifying target binders in a library by Sanger sequencing are low-throughput, labor-intensive, and provide a limited perspective (<0.01%) of the complete sequence space. Moreover, the small sample space can be dominated by nonspecific, preferentially amplifying "parasitic sequences" and plastic-binding sequences, which may lead to the identification of false positives or exclude the identification of target-binding sequences. To overcome these challenges, we employed next-generation Illumina sequencing to couple high-throughput screening and high-throughput sequencing, enabling more comprehensive access to the phage display library sequence space. In this work, we define the hallmarks of binding sequences in next-generation sequencing data, and develop a method that identifies several target-binding phage clones for murine, alternatively activated M2 macrophages with a high (100%) success rate: sequences and binding motifs were reproducibly present across biological replicates; binding motifs were identified across multiple unique sequences; and an unselected, amplified library accurately filtered out parasitic sequences. In addition, we validate the Multiple Em for Motif Elicitation tool as an efficient and principled means of discovering binding sequences.
Concatenated shift registers generating maximally spaced phase shifts of PN-sequences
NASA Technical Reports Server (NTRS)
Hurd, W. J.; Welch, L. R.
1977-01-01
A large class of linearly concatenated shift registers is shown to generate approximately maximally spaced phase shifts of pn-sequences, for use in pseudorandom number generation. A constructive method is presented for finding members of this class, for almost all degrees for which primitive trinomials exist. The sequences which result are not normally characterized by trinomial recursions, which is desirable since trinomial sequences can have some undesirable randomness properties.
Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong
2017-01-01
Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.
Wu, Chenxue; Liu, Zhao; Zhu, Yunhong
2017-01-01
Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687
A statistical method for the detection of variants from next-generation resequencing of DNA pools.
Bansal, Vikas
2010-06-15
Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing. We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80-85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3-5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP. Implementation of this method is available at http://polymorphism.scripps.edu/~vbansal/software/CRISP/.
Vertical decomposition with Genetic Algorithm for Multiple Sequence Alignment
2011-01-01
Background Many Bioinformatics studies begin with a multiple sequence alignment as the foundation for their research. This is because multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence structure relationships. Results In this paper, we have proposed a Vertical Decomposition with Genetic Algorithm (VDGA) for Multiple Sequence Alignment (MSA). In VDGA, we divide the sequences vertically into two or more subsequences, and then solve them individually using a guide tree approach. Finally, we combine all the subsequences to generate a new multiple sequence alignment. This technique is applied on the solutions of the initial generation and of each child generation within VDGA. We have used two mechanisms to generate an initial population in this research: the first mechanism is to generate guide trees with randomly selected sequences and the second is shuffling the sequences inside such trees. Two different genetic operators have been implemented with VDGA. To test the performance of our algorithm, we have compared it with existing well-known methods, namely PRRP, CLUSTALX, DIALIGN, HMMT, SB_PIMA, ML_PIMA, MULTALIGN, and PILEUP8, and also other methods, based on Genetic Algorithms (GA), such as SAGA, MSA-GA and RBT-GA, by solving a number of benchmark datasets from BAliBase 2.0. Conclusions The experimental results showed that the VDGA with three vertical divisions was the most successful variant for most of the test cases in comparison to other divisions considered with VDGA. The experimental results also confirmed that VDGA outperformed the other methods considered in this research. PMID:21867510
Pseudorandom number generation using chaotic true orbits of the Bernoulli map
DOE Office of Scientific and Technical Information (OSTI.GOV)
Saito, Asaki, E-mail: saito@fun.ac.jp; Yamaguchi, Akihiro
We devise a pseudorandom number generator that exactly computes chaotic true orbits of the Bernoulli map on quadratic algebraic integers. Moreover, we describe a way to select the initial points (seeds) for generating multiple pseudorandom binary sequences. This selection method distributes the initial points almost uniformly (equidistantly) in the unit interval, and latter parts of the generated sequences are guaranteed not to coincide. We also demonstrate through statistical testing that the generated sequences possess good randomness properties.
repRNA: a web server for generating various feature vectors of RNA sequences.
Liu, Bin; Liu, Fule; Fang, Longyun; Wang, Xiaolong; Chou, Kuo-Chen
2016-02-01
With the rapid growth of RNA sequences generated in the postgenomic age, it is highly desired to develop a flexible method that can generate various kinds of vectors to represent these sequences by focusing on their different features. This is because nearly all the existing machine-learning methods, such as SVM (support vector machine) and KNN (k-nearest neighbor), can only handle vectors but not sequences. To meet the increasing demands and speed up the genome analyses, we have developed a new web server, called "representations of RNA sequences" (repRNA). Compared with the existing methods, repRNA is much more comprehensive, flexible and powerful, as reflected by the following facts: (1) it can generate 11 different modes of feature vectors for users to choose according to their investigation purposes; (2) it allows users to select the features from 22 built-in physicochemical properties and even those defined by users' own; (3) the resultant feature vectors and the secondary structures of the corresponding RNA sequences can be visualized. The repRNA web server is freely accessible to the public at http://bioinformatics.hitsz.edu.cn/repRNA/ .
DNA base-calling from a nanopore using a Viterbi algorithm.
Timp, Winston; Comer, Jeffrey; Aksimentiev, Aleksei
2012-05-16
Nanopore-based DNA sequencing is the most promising third-generation sequencing method. It has superior read length, speed, and sample requirements compared with state-of-the-art second-generation methods. However, base-calling still presents substantial difficulty because the resolution of the technique is limited compared with the measured signal/noise ratio. Here we demonstrate a method to decode 3-bp-resolution nanopore electrical measurements into a DNA sequence using a Hidden Markov model. This method shows tremendous potential for accuracy (~98%), even with a poor signal/noise ratio. Copyright © 2012 Biophysical Society. Published by Elsevier Inc. All rights reserved.
Using structure to explore the sequence alignment space of remote homologs.
Kuziemko, Andrew; Honig, Barry; Petrey, Donald
2011-10-01
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is "optimal" in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are "suboptimal" in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for "modelability", we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.
NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents.
Liu, Sophia S; Hockenberry, Adam J; Lancichinetti, Andrea; Jewett, Michael C; Amaral, Luís A N
2016-11-01
The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.
USDA-ARS?s Scientific Manuscript database
The Kauffman White (KW) serotyping method requires more than 250 antisera to characterize more than 2,500 Salmonella serovars. The complexity of serotyping could be overcome using molecular methods. In this study, a dkgB-linked intergenic sequence ribotyping (ISR) method that generates sequence occu...
A new strategy for genome assembly using short sequence reads and reduced representation libraries.
Young, Andrew L; Abaan, Hatice Ozel; Zerbino, Daniel; Mullikin, James C; Birney, Ewan; Margulies, Elliott H
2010-02-01
We have developed a novel approach for using massively parallel short-read sequencing to generate fast and inexpensive de novo genomic assemblies comparable to those generated by capillary-based methods. The ultrashort (<100 base) sequences generated by this technology pose specific biological and computational challenges for de novo assembly of large genomes. To account for this, we devised a method for experimentally partitioning the genome using reduced representation (RR) libraries prior to assembly. We use two restriction enzymes independently to create a series of overlapping fragment libraries, each containing a tractable subset of the genome. Together, these libraries allow us to reassemble the entire genome without the need of a reference sequence. As proof of concept, we applied this approach to sequence and assembled the majority of the 125-Mb Drosophila melanogaster genome. We subsequently demonstrate the accuracy of our assembly method with meaningful comparisons against the current available D. melanogaster reference genome (dm3). The ease of assembly and accuracy for comparative genomics suggest that our approach will scale to future mammalian genome-sequencing efforts, saving both time and money without sacrificing quality.
Puton, Tomasz; Kozlowski, Lukasz P.; Rother, Kristian M.; Bujnicki, Janusz M.
2013-01-01
We present a continuous benchmarking approach for the assessment of RNA secondary structure prediction methods implemented in the CompaRNA web server. As of 3 October 2012, the performance of 28 single-sequence and 13 comparative methods has been evaluated on RNA sequences/structures released weekly by the Protein Data Bank. We also provide a static benchmark generated on RNA 2D structures derived from the RNAstrand database. Benchmarks on both data sets offer insight into the relative performance of RNA secondary structure prediction methods on RNAs of different size and with respect to different types of structure. According to our tests, on the average, the most accurate predictions obtained by a comparative approach are generated by CentroidAlifold, MXScarna, RNAalifold and TurboFold. On the average, the most accurate predictions obtained by single-sequence analyses are generated by CentroidFold, ContextFold and IPknot. The best comparative methods typically outperform the best single-sequence methods if an alignment of homologous RNA sequences is available. This article presents the results of our benchmarks as of 3 October 2012, whereas the rankings presented online are continuously updated. We will gladly include new prediction methods and new measures of accuracy in the new editions of CompaRNA benchmarks. PMID:23435231
Sequence Bundles: a novel method for visualising, discovering and exploring sequence motifs
2014-01-01
Background We introduce Sequence Bundles--a novel data visualisation method for representing multiple sequence alignments (MSAs). We identify and address key limitations of the existing bioinformatics data visualisation methods (i.e. the Sequence Logo) by enabling Sequence Bundles to give salient visual expression to sequence motifs and other data features, which would otherwise remain hidden. Methods For the development of Sequence Bundles we employed research-led information design methodologies. Sequences are encoded as uninterrupted, semi-opaque lines plotted on a 2-dimensional reconfigurable grid. Each line represents a single sequence. The thickness and opacity of the stack at each residue in each position indicates the level of conservation and the lines' curved paths expose patterns in correlation and functionality. Several MSAs can be visualised in a composite image. The Sequence Bundles method is designed to favour a tangible, continuous and intuitive display of information. Results We have developed a software demonstration application for generating a Sequence Bundles visualisation of MSAs provided for the BioVis 2013 redesign contest. A subsequent exploration of the visualised line patterns allowed for the discovery of a number of interesting features in the dataset. Reported features include the extreme conservation of sequences displaying a specific residue and bifurcations of the consensus sequence. Conclusions Sequence Bundles is a novel method for visualisation of MSAs and the discovery of sequence motifs. It can aid in generating new insight and hypothesis making. Sequence Bundles is well disposed for future implementation as an interactive visual analytics software, which can complement existing visualisation tools. PMID:25237395
An efficient approach to BAC based assembly of complex genomes.
Visendi, Paul; Berkman, Paul J; Hayashi, Satomi; Golicz, Agnieszka A; Bayer, Philipp E; Ruperao, Pradeep; Hurgobin, Bhavna; Montenegro, Juan; Chan, Chon-Kit Kenneth; Staňková, Helena; Batley, Jacqueline; Šimková, Hana; Doležel, Jaroslav; Edwards, David
2016-01-01
There has been an exponential growth in the number of genome sequencing projects since the introduction of next generation DNA sequencing technologies. Genome projects have increasingly involved assembly of whole genome data which produces inferior assemblies compared to traditional Sanger sequencing of genomic fragments cloned into bacterial artificial chromosomes (BACs). While whole genome shotgun sequencing using next generation sequencing (NGS) is relatively fast and inexpensive, this method is extremely challenging for highly complex genomes, where polyploidy or high repeat content confounds accurate assembly, or where a highly accurate 'gold' reference is required. Several attempts have been made to improve genome sequencing approaches by incorporating NGS methods, to variable success. We present the application of a novel BAC sequencing approach which combines indexed pools of BACs, Illumina paired read sequencing, a sequence assembler specifically designed for complex BAC assembly, and a custom bioinformatics pipeline. We demonstrate this method by sequencing and assembling BAC cloned fragments from bread wheat and sugarcane genomes. We demonstrate that our assembly approach is accurate, robust, cost effective and scalable, with applications for complete genome sequencing in large and complex genomes.
Whole-genome sequencing for comparative genomics and de novo genome assembly.
Benjak, Andrej; Sala, Claudia; Hartkoorn, Ruben C
2015-01-01
Next-generation sequencing technologies for whole-genome sequencing of mycobacteria are rapidly becoming an attractive alternative to more traditional sequencing methods. In particular this technology is proving useful for genome-wide identification of mutations in mycobacteria (comparative genomics) as well as for de novo assembly of whole genomes. Next-generation sequencing however generates a vast quantity of data that can only be transformed into a usable and comprehensible form using bioinformatics. Here we describe the methodology one would use to prepare libraries for whole-genome sequencing, and the basic bioinformatics to identify mutations in a genome following Illumina HiSeq or MiSeq sequencing, as well as de novo genome assembly following sequencing using Pacific Biosciences (PacBio).
USDA-ARS?s Scientific Manuscript database
High-throughput next-generation sequencing was used to scan the genome and generate reliable sequence of high copy number regions. Using this method, we examined whole plastid genomes as well as nearly 6000 bases of nuclear ribosomal DNA sequences for nine genotypes of Theobroma cacao and an indivi...
Jun, Goo; Flickinger, Matthew; Hetrick, Kurt N.; Romm, Jane M.; Doheny, Kimberly F.; Abecasis, Gonçalo R.; Boehnke, Michael; Kang, Hyun Min
2012-01-01
DNA sample contamination is a serious problem in DNA sequencing studies and may result in systematic genotype misclassification and false positive associations. Although methods exist to detect and filter out cross-species contamination, few methods to detect within-species sample contamination are available. In this paper, we describe methods to identify within-species DNA sample contamination based on (1) a combination of sequencing reads and array-based genotype data, (2) sequence reads alone, and (3) array-based genotype data alone. Analysis of sequencing reads allows contamination detection after sequence data is generated but prior to variant calling; analysis of array-based genotype data allows contamination detection prior to generation of costly sequence data. Through a combination of analysis of in silico and experimentally contaminated samples, we show that our methods can reliably detect and estimate levels of contamination as low as 1%. We evaluate the impact of DNA contamination on genotype accuracy and propose effective strategies to screen for and prevent DNA contamination in sequencing studies. PMID:23103226
Peck, Michelle A; Sturk-Andreaggi, Kimberly; Thomas, Jacqueline T; Oliver, Robert S; Barritt-Ross, Suzanne; Marshall, Charla
2018-05-01
Generating mitochondrial genome (mitogenome) data from reference samples in a rapid and efficient manner is critical to harnessing the greater power of discrimination of the entire mitochondrial DNA (mtDNA) marker. The method of long-range target enrichment, Nextera XT library preparation, and Illumina sequencing on the MiSeq is a well-established technique for generating mitogenome data from high-quality samples. To this end, a validation was conducted for this mitogenome method processing up to 24 samples simultaneously along with analysis in the CLC Genomics Workbench and utilizing the AQME (AFDIL-QIAGEN mtDNA Expert) tool to generate forensic profiles. This validation followed the Federal Bureau of Investigation's Quality Assurance Standards (QAS) for forensic DNA testing laboratories and the Scientific Working Group on DNA Analysis Methods (SWGDAM) validation guidelines. The evaluation of control DNA, non-probative samples, blank controls, mixtures, and nonhuman samples demonstrated the validity of this method. Specifically, the sensitivity was established at ≥25 pg of nuclear DNA input for accurate mitogenome profile generation. Unreproducible low-level variants were observed in samples with low amplicon yields. Further, variant quality was shown to be a useful metric for identifying sequencing error and crosstalk. Success of this method was demonstrated with a variety of reference sample substrates and extract types. These studies further demonstrate the advantages of using NGS techniques by highlighting the quantitative nature of heteroplasmy detection. The results presented herein from more than 175 samples processed in ten sequencing runs, show this mitogenome sequencing method and analysis strategy to be valid for the generation of reference data. Copyright © 2018 Elsevier B.V. All rights reserved.
Besaratinia, Ahmad; Li, Haiqing; Yoon, Jae-In; Zheng, Albert; Gao, Hanlin; Tommasi, Stella
2012-01-01
Many carcinogens leave a unique mutational fingerprint in the human genome. These mutational fingerprints manifest as specific types of mutations often clustering at certain genomic loci in tumor genomes from carcinogen-exposed individuals. To develop a high-throughput method for detecting the mutational fingerprint of carcinogens, we have devised a cost-, time- and labor-effective strategy, in which the widely used transgenic Big Blue® mouse mutation detection assay is made compatible with the Roche/454 Genome Sequencer FLX Titanium next-generation sequencing technology. As proof of principle, we have used this novel method to establish the mutational fingerprints of three prominent carcinogens with varying mutagenic potencies, including sunlight ultraviolet radiation, 4-aminobiphenyl and secondhand smoke that are known to be strong, moderate and weak mutagens, respectively. For verification purposes, we have compared the mutational fingerprints of these carcinogens obtained by our newly developed method with those obtained by parallel analyses using the conventional low-throughput approach, that is, standard mutation detection assay followed by direct DNA sequencing using a capillary DNA sequencer. We demonstrate that this high-throughput next-generation sequencing-based method is highly specific and sensitive to detect the mutational fingerprints of the tested carcinogens. The method is reproducible, and its accuracy is comparable with that of the currently available low-throughput method. In conclusion, this novel method has the potential to move the field of carcinogenesis forward by allowing high-throughput analysis of mutations induced by endogenous and/or exogenous genotoxic agents. PMID:22735701
Besaratinia, Ahmad; Li, Haiqing; Yoon, Jae-In; Zheng, Albert; Gao, Hanlin; Tommasi, Stella
2012-08-01
Many carcinogens leave a unique mutational fingerprint in the human genome. These mutational fingerprints manifest as specific types of mutations often clustering at certain genomic loci in tumor genomes from carcinogen-exposed individuals. To develop a high-throughput method for detecting the mutational fingerprint of carcinogens, we have devised a cost-, time- and labor-effective strategy, in which the widely used transgenic Big Blue mouse mutation detection assay is made compatible with the Roche/454 Genome Sequencer FLX Titanium next-generation sequencing technology. As proof of principle, we have used this novel method to establish the mutational fingerprints of three prominent carcinogens with varying mutagenic potencies, including sunlight ultraviolet radiation, 4-aminobiphenyl and secondhand smoke that are known to be strong, moderate and weak mutagens, respectively. For verification purposes, we have compared the mutational fingerprints of these carcinogens obtained by our newly developed method with those obtained by parallel analyses using the conventional low-throughput approach, that is, standard mutation detection assay followed by direct DNA sequencing using a capillary DNA sequencer. We demonstrate that this high-throughput next-generation sequencing-based method is highly specific and sensitive to detect the mutational fingerprints of the tested carcinogens. The method is reproducible, and its accuracy is comparable with that of the currently available low-throughput method. In conclusion, this novel method has the potential to move the field of carcinogenesis forward by allowing high-throughput analysis of mutations induced by endogenous and/or exogenous genotoxic agents.
Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome
Margulies, Elliott H.; Cooper, Gregory M.; Asimenos, George; Thomas, Daryl J.; Dewey, Colin N.; Siepel, Adam; Birney, Ewan; Keefe, Damian; Schwartz, Ariel S.; Hou, Minmei; Taylor, James; Nikolaev, Sergey; Montoya-Burgos, Juan I.; Löytynoja, Ari; Whelan, Simon; Pardi, Fabio; Massingham, Tim; Brown, James B.; Bickel, Peter; Holmes, Ian; Mullikin, James C.; Ureta-Vidal, Abel; Paten, Benedict; Stone, Eric A.; Rosenbloom, Kate R.; Kent, W. James; Bouffard, Gerard G.; Guan, Xiaobin; Hansen, Nancy F.; Idol, Jacquelyn R.; Maduro, Valerie V.B.; Maskeri, Baishali; McDowell, Jennifer C.; Park, Morgan; Thomas, Pamela J.; Young, Alice C.; Blakesley, Robert W.; Muzny, Donna M.; Sodergren, Erica; Wheeler, David A.; Worley, Kim C.; Jiang, Huaiyang; Weinstock, George M.; Gibbs, Richard A.; Graves, Tina; Fulton, Robert; Mardis, Elaine R.; Wilson, Richard K.; Clamp, Michele; Cuff, James; Gnerre, Sante; Jaffe, David B.; Chang, Jean L.; Lindblad-Toh, Kerstin; Lander, Eric S.; Hinrichs, Angie; Trumbower, Heather; Clawson, Hiram; Zweig, Ann; Kuhn, Robert M.; Barber, Galt; Harte, Rachel; Karolchik, Donna; Field, Matthew A.; Moore, Richard A.; Matthewson, Carrie A.; Schein, Jacqueline E.; Marra, Marco A.; Antonarakis, Stylianos E.; Batzoglou, Serafim; Goldman, Nick; Hardison, Ross; Haussler, David; Miller, Webb; Pachter, Lior; Green, Eric D.; Sidow, Arend
2007-01-01
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization. PMID:17567995
Reiman, Anne; Pandey, Sarojini; Lloyd, Kate L; Dyer, Nigel; Khan, Mike; Crockard, Martin; Latten, Mark J; Watson, Tracey L; Cree, Ian A; Grammatopoulos, Dimitris K
2016-11-01
Background Detection of disease-associated mutations in patients with familial hypercholesterolaemia is crucial for early interventions to reduce risk of cardiovascular disease. Screening for these mutations represents a methodological challenge since more than 1200 different causal mutations in the low-density lipoprotein receptor has been identified. A number of methodological approaches have been developed for screening by clinical diagnostic laboratories. Methods Using primers targeting, the low-density lipoprotein receptor, apolipoprotein B, and proprotein convertase subtilisin/kexin type 9, we developed a novel Ion Torrent-based targeted re-sequencing method. We validated this in a West Midlands-UK small cohort of 58 patients screened in parallel with other mutation-targeting methods, such as multiplex polymerase chain reaction (Elucigene FH20), oligonucleotide arrays (Randox familial hypercholesterolaemia array) or the Illumina next-generation sequencing platform. Results In this small cohort, the next-generation sequencing method achieved excellent analytical performance characteristics and showed 100% and 89% concordance with the Randox array and the Elucigene FH20 assay. Investigation of the discrepant results identified two cases of mutation misclassification of the Elucigene FH20 multiplex polymerase chain reaction assay. A number of novel mutations not previously reported were also identified by the next-generation sequencing method. Conclusions Ion Torrent-based next-generation sequencing can deliver a suitable alternative for the molecular investigation of familial hypercholesterolaemia patients, especially when comprehensive mutation screening for rare or unknown mutations is required.
Genovo: De Novo Assembly for Metagenomes
NASA Astrophysics Data System (ADS)
Laserson, Jonathan; Jojic, Vladimir; Koller, Daphne
Next-generation sequencing technologies produce a large number of noisy reads from the DNA in a sample. Metagenomics and population sequencing aim to recover the genomic sequences of the species in the sample, which could be of high diversity. Methods geared towards single sequence reconstruction are not sensitive enough when applied in this setting. We introduce a generative probabilistic model of read generation from environmental samples and present Genovo, a novel de novo sequence assembler that discovers likely sequence reconstructions under the model. A Chinese restaurant process prior accounts for the unknown number of genomes in the sample. Inference is made by applying a series of hill-climbing steps iteratively until convergence. We compare the performance of Genovo to three other short read assembly programs across one synthetic dataset and eight metagenomic datasets created using the 454 platform, the largest of which has 311k reads. Genovo's reconstructions cover more bases and recover more genes than the other methods, and yield a higher assembly score.
Barbau-Piednoir, Elodie; De Keersmaecker, Sigrid C J; Delvoye, Maud; Gau, Céline; Philipp, Patrick; Roosens, Nancy H
2015-11-11
Recently, the presence of an unauthorized genetically modified (GM) Bacillus subtilis bacterium overproducing vitamin B2 in a feed additive was notified by the Rapid Alert System for Food and Feed (RASFF). This has demonstrated that a contamination by a GM micro-organism (GMM) may occur in feed additives and has confronted for the first time,the enforcement laboratories with this type of RASFF. As no sequence information of this GMM nor any specific detection or identification method was available, Next GenerationSequencing (NGS) was used to generate sequence information. However, NGS data analysis often requires appropriate tools, involving bioinformatics expertise which is not alwayspresent in the average enforcement laboratory. This hampers the use of this technology to rapidly obtain critical sequence information in order to be able to develop a specific qPCRdetection method. Data generated by NGS were exploited using a simple BLAST approach. A TaqMan® qPCR method was developed and tested on isolated bacterial strains and on the feed additive directly. In this study, a very simple strategy based on the common BLAST tools that can be used by any enforcement lab without profound bioinformatics expertise, was successfully used toanalyse the B. subtilis data generated by NGS. The results were used to design and assess a new TaqMan® qPCR method, specifically detecting this GM vitamin B2 overproducing bacterium. The method complies with EU critical performance parameters for specificity, sensitivity, PCR efficiency and repeatability. The VitB2-UGM method also could detect the B. subtilis strain in genomic DNA extracted from the feed additive, without prior culturing step. The proposed method, provides a crucial tool for specifically and rapidly identifying this unauthorized GM bacterium in food and feed additives by enforcement laboratories. Moreover, this work can be seen as a case study to substantiate how the use of NGS data can offer an added value to easily gain access to sequence information needed to develop qPCR methods to detect unknown andunauthorized GMO in food and feed.
Probabilistic topic modeling for the analysis and classification of genomic sequences
2015-01-01
Background Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques. Methods The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences. Results and conclusions We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased. PMID:25916734
Montesino, Marta; Prieto, Lourdes
2012-01-01
Cycle sequencing reaction with Big-Dye terminators provides the methodology to analyze mtDNA Control Region amplicons by means of capillary electrophoresis. DNA sequencing with ddNTPs or terminators was developed by (1). The progressive automation of the method by combining the use of fluorescent-dye terminators with cycle sequencing has made it possible to increase the sensibility and efficiency of the method and hence has allowed its introduction into the forensic field. PCR-generated mitochondrial DNA products are the templates for sequencing reactions. Different set of primers can be used to generate amplicons with different sizes according to the quality and quantity of the DNA extract providing sequence data for different ranges inside the Control Region.
Multi-site Stochastic Simulation of Daily Streamflow with Markov Chain and KNN Algorithm
NASA Astrophysics Data System (ADS)
Mathai, J.; Mujumdar, P.
2017-12-01
A key focus of this study is to develop a method which is physically consistent with the hydrologic processes that can capture short-term characteristics of daily hydrograph as well as the correlation of streamflow in temporal and spatial domains. In complex water resource systems, flow fluctuations at small time intervals require that discretisation be done at small time scales such as daily scales. Also, simultaneous generation of synthetic flows at different sites in the same basin are required. We propose a method to equip water managers with a streamflow generator within a stochastic streamflow simulation framework. The motivation for the proposed method is to generate sequences that extend beyond the variability represented in the historical record of streamflow time series. The method has two steps: In step 1, daily flow is generated independently at each station by a two-state Markov chain, with rising limb increments randomly sampled from a Gamma distribution and the falling limb modelled as exponential recession and in step 2, the streamflow generated in step 1 is input to a nonparametric K-nearest neighbor (KNN) time series bootstrap resampler. The KNN model, being data driven, does not require assumptions on the dependence structure of the time series. A major limitation of KNN based streamflow generators is that they do not produce new values, but merely reshuffle the historical data to generate realistic streamflow sequences. However, daily flow generated using the Markov chain approach is capable of generating a rich variety of streamflow sequences. Furthermore, the rising and falling limbs of daily hydrograph represent different physical processes, and hence they need to be modelled individually. Thus, our method combines the strengths of the two approaches. We show the utility of the method and improvement over the traditional KNN by simulating daily streamflow sequences at 7 locations in the Godavari River basin in India.
Zheng, Guanglou; Fang, Gengfa; Shankaran, Rajan; Orgun, Mehmet A; Zhou, Jie; Qiao, Li; Saleem, Kashif
2017-05-01
Generating random binary sequences (BSes) is a fundamental requirement in cryptography. A BS is a sequence of N bits, and each bit has a value of 0 or 1. For securing sensors within wireless body area networks (WBANs), electrocardiogram (ECG)-based BS generation methods have been widely investigated in which interpulse intervals (IPIs) from each heartbeat cycle are processed to produce BSes. Using these IPI-based methods to generate a 128-bit BS in real time normally takes around half a minute. In order to improve the time efficiency of such methods, this paper presents an ECG multiple fiducial-points based binary sequence generation (MFBSG) algorithm. The technique of discrete wavelet transforms is employed to detect arrival time of these fiducial points, such as P, Q, R, S, and T peaks. Time intervals between them, including RR, RQ, RS, RP, and RT intervals, are then calculated based on this arrival time, and are used as ECG features to generate random BSes with low latency. According to our analysis on real ECG data, these ECG feature values exhibit the property of randomness and, thus, can be utilized to generate random BSes. Compared with the schemes that solely rely on IPIs to generate BSes, this MFBSG algorithm uses five feature values from one heart beat cycle, and can be up to five times faster than the solely IPI-based methods. So, it achieves a design goal of low latency. According to our analysis, the complexity of the algorithm is comparable to that of fast Fourier transforms. These randomly generated ECG BSes can be used as security keys for encryption or authentication in a WBAN system.
Meher, J K; Meher, P K; Dash, G N; Raval, M K
2012-01-01
The first step in gene identification problem based on genomic signal processing is to convert character strings into numerical sequences. These numerical sequences are then analysed spectrally or using digital filtering techniques for the period-3 peaks, which are present in exons (coding areas) and absent in introns (non-coding areas). In this paper, we have shown that single-indicator sequences can be generated by encoding schemes based on physico-chemical properties. Two new methods are proposed for generating single-indicator sequences based on hydration energy and dipole moments. The proposed methods produce high peak at exon locations and effectively suppress false exons (intron regions having greater peak than exon regions) resulting in high discriminating factor, sensitivity and specificity.
Choi, Jung-Han; Lim, Young-Jun; Kim, Chang-Whe; Kim, Myung-Joo
2009-01-01
This study evaluated the effect of different screw-tightening sequences, forces, and methods on the stresses generated on a well-fitting internal-connection implant (Astra Tech) superstructure. A metal framework directly connected to four parallel implants was fabricated on a fully edentulous mandibular resin model. Six stone casts with four implant replicas were made from a pickup impression of the superstructure to represent a "well-fitting" situation. Stresses generated by four screw-tightening sequences (1-2-3-4, 4-3-2-1, 2-4-3-1, and 2-3-1-4), two forces (10 and 20 Ncm), and two methods (one-step and two-step) were evaluated. In the two-step method, screws were tightened to the initial torque (10 Ncm) in a predetermined screw-tightening sequence and then to the final torque (20 Ncm) in the same sequence. Stresses were recorded twice by three strain gauges attached to the framework (superior face midway between abutments). Deformation data were analyzed using multiple analysis of variance at a .05 level of statistical significance. In all stone casts, stresses were produced by the superstructure connection, regardless of screw-tightening sequence, force, and method. No statistically significant differences for superstructure preload stresses were found based on screw-tightening sequences (-180.0 to -181.6 microm/m) or forces (-163.4 and -169.2 microm/m) (P > .05). However, different screw-tightening methods induced different stresses on the superstructure. The two-step screw-tightening method (-180.1 microm/m) produced significantly higher stress than the one-step method (-169.2 microm/m) (P = .0457). Within the limitations of this in vitro study, screw-tightening sequence and force were not critical factors in the stress generated on a well-fitting internal-connection implant superstructure. The stress caused by the two-step method was greater than that produced using the one-step method. Further studies are needed to evaluate the effect of screw-tightening techniques on preload stress in various different clinical situations.
GENESUS: a two-step sequence design program for DNA nanostructure self-assembly.
Tsutsumi, Takanobu; Asakawa, Takeshi; Kanegami, Akemi; Okada, Takao; Tahira, Tomoko; Hayashi, Kenshi
2014-01-01
DNA has been recognized as an ideal material for bottom-up construction of nanometer scale structures by self-assembly. The generation of sequences optimized for unique self-assembly (GENESUS) program reported here is a straightforward method for generating sets of strand sequences optimized for self-assembly of arbitrarily designed DNA nanostructures by a generate-candidates-and-choose-the-best strategy. A scalable procedure to prepare single-stranded DNA having arbitrary sequences is also presented. Strands for the assembly of various structures were designed and successfully constructed, validating both the program and the procedure.
Choi, Jung-Han
2011-01-01
This study aimed to evaluate the effect of different screw-tightening sequences, torques, and methods on the strains generated on an internal-connection implant (Astra Tech) superstructure with good fit. An edentulous mandibular master model and a metal framework directly connected to four parallel implants with a passive fit to each other were fabricated. Six stone casts were made from a dental stone master model by a splinted impression technique to represent a well-fitting situation with the metal framework. Strains generated by four screw-tightening sequences (1-2-3-4, 4-3-2-1, 2-4-3-1, and 2-3-1-4), two torques (10 and 20 Ncm), and two methods (one-step and two-step) were evaluated. In the two-step method, screws were tightened to the initial torque (10 Ncm) in a predetermined screw-tightening sequence and then to the final torque (20 Ncm) in the same sequence. Strains were recorded twice by three strain gauges attached to the framework (superior face midway between abutments). Deformation data were analyzed using multiple analysis of variance at a .05 level of statistical significance. In all stone casts, strains were produced by connection of the superstructure, regardless of screw-tightening sequence, torque, and method. No statistically significant differences in superstructure strains were found based on screw-tightening sequences (range, -409.8 to -413.8 μm/m), torques (-409.7 and -399.1 μm/m), or methods (-399.1 and -410.3 μm/m). Within the limitations of this in vitro study, screw-tightening sequence, torque, and method were not critical factors for the strain generated on a well-fitting internal-connection implant superstructure by the splinted impression technique. Further studies are needed to evaluate the effect of screw-tightening techniques on the preload stress in various different clinical situations.
Using Maximum Entropy to Find Patterns in Genomes
NASA Astrophysics Data System (ADS)
Liu, Sophia; Hockenberry, Adam; Lancichinetti, Andrea; Jewett, Michael; Amaral, Luis
The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. To accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. This approach can also be easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes. National Institute of General Medical Science, Northwestern University Presidential Fellowship, National Science Foundation, David and Lucile Packard Foundation, Camille Dreyfus Teacher Scholar Award.
A robust and cost-effective approach to sequence and analyze complete genomes of small RNA viruses
USDA-ARS?s Scientific Manuscript database
Background: Next-generation sequencing (NGS) allows ultra-deep sequencing of nucleic acids. The use of sequence-independent amplification of viral nucleic acids without utilization of target-specific primers provides advantages over traditional sequencing methods and allows detection of unsuspected ...
Analysis of plant microbe interactions in the era of next generation sequencing technologies
Knief, Claudia
2014-01-01
Next generation sequencing (NGS) technologies have impressively accelerated research in biological science during the last years by enabling the production of large volumes of sequence data to a drastically lower price per base, compared to traditional sequencing methods. The recent and ongoing developments in the field allow addressing research questions in plant-microbe biology that were not conceivable just a few years ago. The present review provides an overview of NGS technologies and their usefulness for the analysis of microorganisms that live in association with plants. Possible limitations of the different sequencing systems, in particular sources of errors and bias, are critically discussed and methods are disclosed that help to overcome these shortcomings. A focus will be on the application of NGS methods in metagenomic studies, including the analysis of microbial communities by amplicon sequencing, which can be considered as a targeted metagenomic approach. Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods. PMID:24904612
Sequence Data for Clostridium autoethanogenum using Three Generations of Sequencing Technologies
Utturkar, Sagar M.; Klingeman, Dawn Marie; Bruno-Barcena, José M.; ...
2015-04-14
During the past decade, DNA sequencing output has been mostly dominated by the second generation sequencing platforms which are characterized by low cost, high throughput and shorter read lengths for example, Illumina. The emergence and development of so called third generation sequencing platforms such as PacBio has permitted exceptionally long reads (over 20 kb) to be generated. Due to read length increases, algorithm improvements and hybrid assembly approaches, the concept of one chromosome, one contig and automated finishing of microbial genomes is now a realistic and achievable task for many microbial laboratories. In this paper, we describe high quality sequencemore » datasets which span three generations of sequencing technologies, containing six types of data from four NGS platforms and originating from a single microorganism, Clostridium autoethanogenum. The dataset reported here will be useful for the scientific community to evaluate upcoming NGS platforms, enabling comparison of existing and novel bioinformatics approaches and will encourage interest in the development of innovative experimental and computational methods for NGS data.« less
NASA Astrophysics Data System (ADS)
Choudhary, Kuldeep; Kumar, Santosh
2017-05-01
The application of electro-optic effect in lithium-niobate-based Mach-Zehnder interferometer to design a 3-bit optical pseudorandom binary sequence (PRBS) generator has been proposed, which is characterized by its simplicity of generation and stability. The proposed device is optoelectronic in nature. The PBRS generator is immensely applicable for pattern generation, encryption, and coding applications in optical networks. The study is carried out by simulating the proposed device with beam propagation method.
Masking as an effective quality control method for next-generation sequencing data analysis.
Yun, Sajung; Yun, Sijung
2014-12-13
Next generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions. Here we compare the effectiveness of two data preprocessing methods, masking and trimming, and the accuracy of simple nucleotide variation calls on whole-genome sequence data from Caenorhabditis elegans. Masking substitutes low quality base calls with 'N's (undetermined bases), whereas trimming removes low quality bases that results in a shorter read lengths. We demonstrate that masking is more effective than trimming in reducing the false-positive rate in single nucleotide polymorphism (SNP) calling. However, both of the preprocessing methods did not affect the false-negative rate in SNP calling with statistical significance compared to the data analysis without preprocessing. False-positive rate and false-negative rate for small insertions and deletions did not show differences between masking and trimming. We recommend masking over trimming as a more effective preprocessing method for next generation sequencing data analysis since masking reduces the false-positive rate in SNP calling without sacrificing the false-negative rate although trimming is more commonly used currently in the field. The perl script for masking is available at http://code.google.com/p/subn/. The sequencing data used in the study were deposited in the Sequence Read Archive (SRX450968 and SRX451773).
Suyama, Yoshihisa; Matsuki, Yu
2015-01-01
Restriction-enzyme (RE)-based next-generation sequencing methods have revolutionized marker-assisted genetic studies; however, the use of REs has limited their widespread adoption, especially in field samples with low-quality DNA and/or small quantities of DNA. Here, we developed a PCR-based procedure to construct reduced representation libraries without RE digestion steps, representing de novo single-nucleotide polymorphism discovery, and its genotyping using next-generation sequencing. Using multiplexed inter-simple sequence repeat (ISSR) primers, thousands of genome-wide regions were amplified effectively from a wide variety of genomes, without prior genetic information. We demonstrated: 1) Mendelian gametic segregation of the discovered variants; 2) reproducibility of genotyping by checking its applicability for individual identification; and 3) applicability in a wide variety of species by checking standard population genetic analysis. This approach, called multiplexed ISSR genotyping by sequencing, should be applicable to many marker-assisted genetic studies with a wide range of DNA qualities and quantities. PMID:26593239
Method for phosphorothioate antisense DNA sequencing by capillary electrophoresis with UV detection.
Froim, D; Hopkins, C E; Belenky, A; Cohen, A S
1997-11-01
The progress of antisense DNA therapy demands development of reliable and convenient methods for sequencing short single-stranded oligonucleotides. A method of phosphorothioate antisense DNA sequencing analysis using UV detection coupled to capillary electrophoresis (CE) has been developed based on a modified chain termination sequencing method. The proposed method reduces the sequencing cost since it uses affordable CE-UV instrumentation and requires no labeling with minimal sample processing before analysis. Cycle sequencing with ThermoSequenase generates quantities of sequencing products that are readily detectable by UV. Discrimination of undesired components from sequencing products in the reaction mixture, previously accomplished by fluorescent or radioactive labeling, is now achieved by bringing concentrations of undesired components below the UV detection range which yields a 'clean', well defined sequence. UV detection coupled with CE offers additional conveniences for sequencing since it can be accomplished with commercially available CE-UV equipment and is readily amenable to automation.
Method for phosphorothioate antisense DNA sequencing by capillary electrophoresis with UV detection.
Froim, D; Hopkins, C E; Belenky, A; Cohen, A S
1997-01-01
The progress of antisense DNA therapy demands development of reliable and convenient methods for sequencing short single-stranded oligonucleotides. A method of phosphorothioate antisense DNA sequencing analysis using UV detection coupled to capillary electrophoresis (CE) has been developed based on a modified chain termination sequencing method. The proposed method reduces the sequencing cost since it uses affordable CE-UV instrumentation and requires no labeling with minimal sample processing before analysis. Cycle sequencing with ThermoSequenase generates quantities of sequencing products that are readily detectable by UV. Discrimination of undesired components from sequencing products in the reaction mixture, previously accomplished by fluorescent or radioactive labeling, is now achieved by bringing concentrations of undesired components below the UV detection range which yields a 'clean', well defined sequence. UV detection coupled with CE offers additional conveniences for sequencing since it can be accomplished with commercially available CE-UV equipment and is readily amenable to automation. PMID:9336449
USDA-ARS?s Scientific Manuscript database
Background: Vertebrate immune systems generate diverse repertoires of antibodies capable of mediating response to a variety of antigens. Next generation sequencing methods provide unique approaches to a number of immuno-based research areas including antibody discovery and engineering, disease surve...
Wu, Wei; Lu, Chao-Xia; Wang, Yi-Ning; Liu, Fang; Chen, Wei; Liu, Yong-Tai; Han, Ye-Chen; Cao, Jian; Zhang, Shu-Yang; Zhang, Xue
2015-07-10
MYBPC3 dysfunctions have been proven to induce dilated cardiomyopathy, hypertrophic cardiomyopathy, and/or left ventricular noncompaction; however, the genotype-phenotype correlation between MYBPC3 and restrictive cardiomyopathy (RCM) has not been established. The newly developed next-generation sequencing method is capable of broad genomic DNA sequencing with high throughput and can help explore novel correlations between genetic variants and cardiomyopathies. A proband from a multigenerational family with 3 live patients and 1 unrelated patient with clinical diagnoses of RCM underwent a next-generation sequencing workflow based on a custom AmpliSeq panel, including 64 candidate pathogenic genes for cardiomyopathies, on the Ion Personal Genome Machine high-throughput sequencing benchtop instrument. The selected panel contained a total of 64 genes that were reportedly associated with inherited cardiomyopathies. All patients fulfilled strict criteria for RCM with clinical characteristics, echocardiography, and/or cardiac magnetic resonance findings. The multigenerational family with 3 adult RCM patients carried an identical nonsense MYBPC3 mutation, and the unrelated patient carried a missense mutation in the MYBPC3 gene. All of these results were confirmed by the Sanger sequencing method. This study demonstrated that MYBPC3 gene mutations, revealed by next-generation sequencing, were associated with familial and sporadic RCM patients. It is suggested that the next-generation sequencing platform with a selected panel provides a highly efficient approach for molecular diagnosis of hereditary and idiopathic RCM and helps build new genotype-phenotype correlations. © 2015 The Authors. Published on behalf of the American Heart Association, Inc., by Wiley Blackwell.
Li, Haonan; Jin, Peng; Hao, Qian; Zhu, Wei; Chen, Xia; Wang, Ping
2017-11-01
Waardenburg syndrome (WS) is a rare autosomal dominant disorder associated with pigmentation abnormalities and sensorineural hearing loss. In this study, we investigated the genetic cause of WSII in a patient and evaluated the reliability of the targeted next-generation exome sequencing method for the genetic diagnosis of WS. Clinical evaluations were conducted on the patient and targeted next-generation sequencing (NGS) was used to identify the candidate genes responsible for WSII. Multiplex ligation-dependent probe amplification (MLPA) and real-time quantitative polymerase chain reaction (qPCR) were performed to confirm the targeted NGS results. Targeted NGS detected the entire deletion of the coding sequence (CDS) of the SOX10 gene in the WSII patient. MLPA results indicated that all exons of the SOX10 heterozygous deletion were detected; no aberrant copy number in the PAX3 and microphthalmia-associated transcription factor (MITF) genes was found. Real-time qPCR results identified the mutation as a de novo heterozygous deletion. This is the first report of using a targeted NGS method for WS candidate gene sequencing; its accuracy was verified by using the MLPA and qPCR methods. Our research provides a valuable method for the genetic diagnosis of WS.
BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences
Gao, Jianzhao; Faraggi, Eshel; Zhou, Yaoqi; Ruan, Jishou; Kurgan, Lukasz
2012-01-01
Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods. PMID:22761950
Next-generation sequencing for endocrine cancers: Recent advances and challenges.
Suresh, Padmanaban S; Venkatesh, Thejaswini; Tsutsumi, Rie; Shetty, Abhishek
2017-05-01
Contemporary molecular biology research tools have enriched numerous areas of biomedical research that address challenging diseases, including endocrine cancers (pituitary, thyroid, parathyroid, adrenal, testicular, ovarian, and neuroendocrine cancers). These tools have placed several intriguing clues before the scientific community. Endocrine cancers pose a major challenge in health care and research despite considerable attempts by researchers to understand their etiology. Microarray analyses have provided gene signatures from many cells, tissues, and organs that can differentiate healthy states from diseased ones, and even show patterns that correlate with stages of a disease. Microarray data can also elucidate the responses of endocrine tumors to therapeutic treatments. The rapid progress in next-generation sequencing methods has overcome many of the initial challenges of these technologies, and their advantages over microarray techniques have enabled them to emerge as valuable aids for clinical research applications (prognosis, identification of drug targets, etc.). A comprehensive review describing the recent advances in next-generation sequencing methods and their application in the evaluation of endocrine and endocrine-related cancers is lacking. The main purpose of this review is to illustrate the concepts that collectively constitute our current view of the possibilities offered by next-generation sequencing technological platforms, challenges to relevant applications, and perspectives on the future of clinical genetic testing of patients with endocrine tumors. We focus on recent discoveries in the use of next-generation sequencing methods for clinical diagnosis of endocrine tumors in patients and conclude with a discussion on persisting challenges and future objectives.
Next-generation sequencing: the future of molecular genetics in poultry production and food safety.
Diaz-Sanchez, S; Hanning, I; Pendleton, Sean; D'Souza, Doris
2013-02-01
The era of molecular biology and automation of the Sanger chain-terminator sequencing method has led to discovery and advances in diagnostics and biotechnology. The Sanger methodology dominated research for over 2 decades, leading to significant accomplishments and technological improvements in DNA sequencing. Next-generation high-throughput sequencing (HT-NGS) technologies were developed subsequently to overcome the limitations of this first generation technology that include higher speed, less labor, and lowered cost. Various platforms developed include sequencing-by-synthesis 454 Life Sciences, Illumina (Solexa) sequencing, SOLiD sequencing (among others), and the Ion Torrent semiconductor sequencing technologies that use different detection principles. As technology advances, progress made toward third generation sequencing technologies are being reported, which include Nanopore Sequencing and real-time monitoring of PCR activity through fluorescent resonant energy transfer. The advantages of these technologies include scalability, simplicity, with increasing DNA polymerase performance and yields, being less error prone, and even more economically feasible with the eventual goal of obtaining real-time results. These technologies can be directly applied to improve poultry production and enhance food safety. For example, sequence-based (determination of the gut microbial community, genes for metabolic pathways, or presence of plasmids) and function-based (screening for function such as antibiotic resistance, or vitamin production) metagenomic analysis can be carried out. Gut microbialflora/communities of poultry can be sequenced to determine the changes that affect health and disease along with efficacy of methods to control pathogenic growth. Thus, the purpose of this review is to provide an overview of the principles of these current technologies and their potential application to improve poultry production and food safety as well as public health.
Delay test generation for synchronous sequential circuits
NASA Astrophysics Data System (ADS)
Devadas, Srinivas
1989-05-01
We address the problem of generating tests for delay faults in non-scan synchronous sequential circuits. Delay test generation for sequential circuits is a considerably more difficult problem than delay testing of combinational circuits and has received much less attention. In this paper, we present a method for generating test sequences to detect delay faults in sequential circuits using the stuck-at fault sequential test generator STALLION. The method is complete in that it will generate a delay test sequence for a targeted fault given sufficient CPU time, if such a sequence exists. We term faults for which no delay test sequence exists, under out test methodology, sequentially delay redundant. We describe means of eliminating sequential delay redundancies in logic circuits. We present a partial-scan methodology for enhancing the testability of difficult-to-test of untestable sequential circuits, wherein a small number of flip-flops are selected and made controllable/observable. The selection process guarantees the elimination of all sequential delay redundancies. We show that an intimate relationship exists between state assignment and delay testability of a sequential machine. We describe a state assignment algorithm for the synthesis of sequential machines with maximal delay fault testability. Preliminary experimental results using the test generation, partial-scan and synthesis algorithm are presented.
Single-Molecule Electrical Random Resequencing of DNA and RNA
NASA Astrophysics Data System (ADS)
Ohshiro, Takahito; Matsubara, Kazuki; Tsutsui, Makusu; Furuhashi, Masayuki; Taniguchi, Masateru; Kawai, Tomoji
2012-07-01
Two paradigm shifts in DNA sequencing technologies--from bulk to single molecules and from optical to electrical detection--are expected to realize label-free, low-cost DNA sequencing that does not require PCR amplification. It will lead to development of high-throughput third-generation sequencing technologies for personalized medicine. Although nanopore devices have been proposed as third-generation DNA-sequencing devices, a significant milestone in these technologies has been attained by demonstrating a novel technique for resequencing DNA using electrical signals. Here we report single-molecule electrical resequencing of DNA and RNA using a hybrid method of identifying single-base molecules via tunneling currents and random sequencing. Our method reads sequences of nine types of DNA oligomers. The complete sequence of 5'-UGAGGUA-3' from the let-7 microRNA family was also identified by creating a composite of overlapping fragment sequences, which was randomly determined using tunneling current conducted by single-base molecules as they passed between a pair of nanoelectrodes.
Heinke, Florian; Bittrich, Sebastian; Kaiser, Florian; Labudde, Dirk
2016-01-01
To understand the molecular function of biopolymers, studying their structural characteristics is of central importance. Graphics programs are often utilized to conceive these properties, but with the increasing number of available structures in databases or structure models produced by automated modeling frameworks this process requires assistance from tools that allow automated structure visualization. In this paper a web server and its underlying method for generating graphical sequence representations of molecular structures is presented. The method, called SequenceCEROSENE (color encoding of residues obtained by spatial neighborhood embedding), retrieves the sequence of each amino acid or nucleotide chain in a given structure and produces a color coding for each residue based on three-dimensional structure information. From this, color-highlighted sequences are obtained, where residue coloring represent three-dimensional residue locations in the structure. This color encoding thus provides a one-dimensional representation, from which spatial interactions, proximity and relations between residues or entire chains can be deduced quickly and solely from color similarity. Furthermore, additional heteroatoms and chemical compounds bound to the structure, like ligands or coenzymes, are processed and reported as well. To provide free access to SequenceCEROSENE, a web server has been implemented that allows generating color codings for structures deposited in the Protein Data Bank or structure models uploaded by the user. Besides retrieving visualizations in popular graphic formats, underlying raw data can be downloaded as well. In addition, the server provides user interactivity with generated visualizations and the three-dimensional structure in question. Color encoded sequences generated by SequenceCEROSENE can aid to quickly perceive the general characteristics of a structure of interest (or entire sets of complexes), thus supporting the researcher in the initial phase of structure-based studies. In this respect, the web server can be a valuable tool, as users are allowed to process multiple structures, quickly switch between results, and interact with generated visualizations in an intuitive manner. The SequenceCEROSENE web server is available at https://biosciences.hs-mittweida.de/seqcerosene.
NASA Astrophysics Data System (ADS)
Liu, Qiong; Wang, Wen-xi; Zhu, Ke-ren; Zhang, Chao-yong; Rao, Yun-qing
2014-11-01
Mixed-model assembly line sequencing is significant in reducing the production time and overall cost of production. To improve production efficiency, a mathematical model aiming simultaneously to minimize overtime, idle time and total set-up costs is developed. To obtain high-quality and stable solutions, an advanced scatter search approach is proposed. In the proposed algorithm, a new diversification generation method based on a genetic algorithm is presented to generate a set of potentially diverse and high-quality initial solutions. Many methods, including reference set update, subset generation, solution combination and improvement methods, are designed to maintain the diversification of populations and to obtain high-quality ideal solutions. The proposed model and algorithm are applied and validated in a case company. The results indicate that the proposed advanced scatter search approach is significant for mixed-model assembly line sequencing in this company.
ERIC Educational Resources Information Center
Berge, Analia
2006-01-01
Burn (2005) proposes a "genetic approach" to teaching limits of numerical sequences. The article includes an explanation of the Method of Exhaustion, a generalization of this method, and a description of how this method was used for obtaining areas and lengths in the seventeenth century. The author uses these historical and mathematical analyses…
Quantum-Sequencing: Biophysics of quantum tunneling through nucleic acids
NASA Astrophysics Data System (ADS)
Casamada Ribot, Josep; Chatterjee, Anushree; Nagpal, Prashant
2014-03-01
Tunneling microscopy and spectroscopy has extensively been used in physical surface sciences to study quantum tunneling to measure electronic local density of states of nanomaterials and to characterize adsorbed species. Quantum-Sequencing (Q-Seq) is a new method based on tunneling microscopy for electronic sequencing of single molecule of nucleic acids. A major goal of third-generation sequencing technologies is to develop a fast, reliable, enzyme-free single-molecule sequencing method. Here, we present the unique ``electronic fingerprints'' for all nucleotides on DNA and RNA using Q-Seq along their intrinsic biophysical parameters. We have analyzed tunneling spectra for the nucleotides at different pH conditions and analyzed the HOMO, LUMO and energy gap for all of them. In addition we show a number of biophysical parameters to further characterize all nucleobases (electron and hole transition voltage and energy barriers). These results highlight the robustness of Q-Seq as a technique for next-generation sequencing.
Watson, Christopher M.; Crinnion, Laura A.; Gurgel‐Gianetti, Juliana; Harrison, Sally M.; Daly, Catherine; Antanavicuite, Agne; Lascelles, Carolina; Markham, Alexander F.; Pena, Sergio D. J.; Bonthron, David T.
2015-01-01
ABSTRACT Autozygosity mapping is a powerful technique for the identification of rare, autosomal recessive, disease‐causing genes. The ease with which this category of disease gene can be identified has greatly increased through the availability of genome‐wide SNP genotyping microarrays and subsequently of exome sequencing. Although these methods have simplified the generation of experimental data, its analysis, particularly when disparate data types must be integrated, remains time consuming. Moreover, the huge volume of sequence variant data generated from next generation sequencing experiments opens up the possibility of using these data instead of microarray genotype data to identify disease loci. To allow these two types of data to be used in an integrated fashion, we have developed AgileVCFMapper, a program that performs both the mapping of disease loci by SNP genotyping and the analysis of potentially deleterious variants using exome sequence variant data, in a single step. This method does not require microarray SNP genotype data, although analysis with a combination of microarray and exome genotype data enables more precise delineation of disease loci, due to superior marker density and distribution. PMID:26037133
Wei, Xiaoming; Sun, Yan; Xie, Jiansheng; Shi, Quan; Qu, Ning; Yang, Guanghui; Cai, Jun; Yang, Yi; Liang, Yu; Wang, Wei; Yi, Xin
2012-11-20
Targeted enrichment and next-generation sequencing (NGS) have been employed for detection of genetic diseases. The purpose of this study was to validate the accuracy and sensitivity of our method for comprehensive mutation detection of hereditary hearing loss, and identify inherited mutations involved in human deafness accurately and economically. To make genetic diagnosis of hereditary hearing loss simple and timesaving, we designed a 0.60 MB array-based chip containing 69 nuclear genes and mitochondrial genome responsible for human deafness and conducted NGS toward ten patients with five known mutations and a Chinese family with hearing loss (never genetically investigated). Ten patients with five known mutations were sequenced using next-generation sequencing to validate the sensitivity of the method. We identified four known mutations in two nuclear deafness causing genes (GJB2 and SLC26A4), one in mitochondrial DNA. We then performed this method to analyze the variants in a Chinese family with hearing loss and identified compound heterozygosity for two novel mutations in gene MYO7A. The compound heterozygosity identified in gene MYO7A causes Usher Syndrome 1B with severe phenotypes. The results support that the combination of enrichment of targeted genes and next-generation sequencing is a valuable molecular diagnostic tool for hereditary deafness and suitable for clinical application. Copyright © 2012 Elsevier B.V. All rights reserved.
On the design of henon and logistic map-based random number generator
NASA Astrophysics Data System (ADS)
Magfirawaty; Suryadi, M. T.; Ramli, Kalamullah
2017-10-01
The key sequence is one of the main elements in the cryptosystem. True Random Number Generators (TRNG) method is one of the approaches to generating the key sequence. The randomness source of the TRNG divided into three main groups, i.e. electrical noise based, jitter based and chaos based. The chaos based utilizes a non-linear dynamic system (continuous time or discrete time) as an entropy source. In this study, a new design of TRNG based on discrete time chaotic system is proposed, which is then simulated in LabVIEW. The principle of the design consists of combining 2D and 1D chaotic systems. A mathematical model is implemented for numerical simulations. We used comparator process as a harvester method to obtain the series of random bits. Without any post processing, the proposed design generated random bit sequence with high entropy value and passed all NIST 800.22 statistical tests.
Thomsen, Martin Christen Frølund; Nielsen, Morten
2012-01-01
Seq2Logo is a web-based sequence logo generator. Sequence logos are a graphical representation of the information content stored in a multiple sequence alignment (MSA) and provide a compact and highly intuitive representation of the position-specific amino acid composition of binding motifs, active sites, etc. in biological sequences. Accurate generation of sequence logos is often compromised by sequence redundancy and low number of observations. Moreover, most methods available for sequence logo generation focus on displaying the position-specific enrichment of amino acids, discarding the equally valuable information related to amino acid depletion. Seq2logo aims at resolving these issues allowing the user to include sequence weighting to correct for data redundancy, pseudo counts to correct for low number of observations and different logotype representations each capturing different aspects related to amino acid enrichment and depletion. Besides allowing input in the format of peptides and MSA, Seq2Logo accepts input as Blast sequence profiles, providing easy access for non-expert end-users to characterize and identify functionally conserved/variable amino acids in any given protein of interest. The output from the server is a sequence logo and a PSSM. Seq2Logo is available at http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May 2012, date last accessed). PMID:22638583
SMARTIV: combined sequence and structure de-novo motif discovery for in-vivo RNA binding data.
Polishchuk, Maya; Paz, Inbal; Yakhini, Zohar; Mandel-Gutfreund, Yael
2018-05-25
Gene expression regulation is highly dependent on binding of RNA-binding proteins (RBPs) to their RNA targets. Growing evidence supports the notion that both RNA primary sequence and its local secondary structure play a role in specific Protein-RNA recognition and binding. Despite the great advance in high-throughput experimental methods for identifying sequence targets of RBPs, predicting the specific sequence and structure binding preferences of RBPs remains a major challenge. We present a novel webserver, SMARTIV, designed for discovering and visualizing combined RNA sequence and structure motifs from high-throughput RNA-binding data, generated from in-vivo experiments. The uniqueness of SMARTIV is that it predicts motifs from enriched k-mers that combine information from ranked RNA sequences and their predicted secondary structure, obtained using various folding methods. Consequently, SMARTIV generates Position Weight Matrices (PWMs) in a combined sequence and structure alphabet with assigned P-values. SMARTIV concisely represents the sequence and structure motif content as a single graphical logo, which is informative and easy for visual perception. SMARTIV was examined extensively on a variety of high-throughput binding experiments for RBPs from different families, generated from different technologies, showing consistent and accurate results. Finally, SMARTIV is a user-friendly webserver, highly efficient in run-time and freely accessible via http://smartiv.technion.ac.il/.
Method of generating ploynucleotides encoding enhanced folding variants
Bradbury, Andrew M.; Kiss, Csaba; Waldo, Geoffrey S.
2017-05-02
The invention provides directed evolution methods for improving the folding, solubility and stability (including thermostability) characteristics of polypeptides. In one aspect, the invention provides a method for generating folding and stability-enhanced variants of proteins, including but not limited to fluorescent proteins, chromophoric proteins and enzymes. In another aspect, the invention provides methods for generating thermostable variants of a target protein or polypeptide via an internal destabilization baiting strategy. Internally destabilization a protein of interest is achieved by inserting a heterologous, folding-destabilizing sequence (folding interference domain) within DNA encoding the protein of interest, evolving the protein sequences adjacent to the heterologous insertion to overcome the destabilization (using any number of mutagenesis methods), thereby creating a library of variants. The variants in the library are expressed, and those with enhanced folding characteristics selected.
Genome Sequencing and Assembly by Long Reads in Plants
Li, Changsheng; Lin, Feng; An, Dong; Huang, Ruidong
2017-01-01
Plant genomes generated by Sanger and Next Generation Sequencing (NGS) have provided insight into species diversity and evolution. However, Sanger sequencing is limited in its applications due to high cost, labor intensity, and low throughput, while NGS reads are too short to resolve abundant repeats and polyploidy, leading to incomplete or ambiguous assemblies. The advent and improvement of long-read sequencing by Third Generation Sequencing (TGS) methods such as PacBio and Nanopore have shown promise in producing high-quality assemblies for complex genomes. Here, we review the development of sequencing, introducing the application as well as considerations of experimental design in TGS of plant genomes. We also introduce recent revolutionary scaffolding technologies including BioNano, Hi-C, and 10× Genomics. We expect that the informative guidance for genome sequencing and assembly by long reads will benefit the initiation of scientists’ projects. PMID:29283420
Implementation of a quantum random number generator based on the optimal clustering of photocounts
NASA Astrophysics Data System (ADS)
Balygin, K. A.; Zaitsev, V. I.; Klimov, A. N.; Kulik, S. P.; Molotkov, S. N.
2017-10-01
To implement quantum random number generators, it is fundamentally important to have a mathematically provable and experimentally testable process of measurements of a system from which an initial random sequence is generated. This makes sure that randomness indeed has a quantum nature. A quantum random number generator has been implemented with the use of the detection of quasi-single-photon radiation by a silicon photomultiplier (SiPM) matrix, which makes it possible to reliably reach the Poisson statistics of photocounts. The choice and use of the optimal clustering of photocounts for the initial sequence of photodetection events and a method of extraction of a random sequence of 0's and 1's, which is polynomial in the length of the sequence, have made it possible to reach a yield rate of 64 Mbit/s of the output certainly random sequence.
Evaluation of normalization methods in mammalian microRNA-Seq data
Garmire, Lana Xia; Subramaniam, Shankar
2012-01-01
Simple total tag count normalization is inadequate for microRNA sequencing data generated from the next generation sequencing technology. However, so far systematic evaluation of normalization methods on microRNA sequencing data is lacking. We comprehensively evaluate seven commonly used normalization methods including global normalization, Lowess normalization, Trimmed Mean Method (TMM), quantile normalization, scaling normalization, variance stabilization, and invariant method. We assess these methods on two individual experimental data sets with the empirical statistical metrics of mean square error (MSE) and Kolmogorov-Smirnov (K-S) statistic. Additionally, we evaluate the methods with results from quantitative PCR validation. Our results consistently show that Lowess normalization and quantile normalization perform the best, whereas TMM, a method applied to the RNA-Sequencing normalization, performs the worst. The poor performance of TMM normalization is further evidenced by abnormal results from the test of differential expression (DE) of microRNA-Seq data. Comparing with the models used for DE, the choice of normalization method is the primary factor that affects the results of DE. In summary, Lowess normalization and quantile normalization are recommended for normalizing microRNA-Seq data, whereas the TMM method should be used with caution. PMID:22532701
Refinetti, Paulo; Morgenthaler, Stephan; Ekstrøm, Per O
2016-07-01
Cycling temperature capillary electrophoresis has been optimised for mutation detection in 76% of the mitochondrial genome. The method was tested on a mixed sample and compared to mutation detection by next generation sequencing. Out of 152 fragments 90 were concordant, 51 discordant and in 11 were semi-concordant. Dilution experiments show that cycling capillary electrophoresis has a detection limit of 1-3%. The detection limit of routine next generation sequencing was in the ranges of 15 to 30%. Cycling temperature capillary electrophoresis detect and accurate quantify mutations at a fraction of the cost and time required to perform a next generation sequencing analysis. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Uribe-Convers, Simon; Duke, Justin R.; Moore, Michael J.; Tank, David C.
2014-01-01
• Premise of the study: We present an alternative approach for molecular systematic studies that combines long PCR and next-generation sequencing. Our approach can be used to generate templates from any DNA source for next-generation sequencing. Here we test our approach by amplifying complete chloroplast genomes, and we present a set of 58 potentially universal primers for angiosperms to do so. Additionally, this approach is likely to be particularly useful for nuclear and mitochondrial regions. • Methods and Results: Chloroplast genomes of 30 species across angiosperms were amplified to test our approach. Amplification success varied depending on whether PCR conditions were optimized for a given taxon. To further test our approach, some amplicons were sequenced on an Illumina HiSeq 2000. • Conclusions: Although here we tested this approach by sequencing plastomes, long PCR amplicons could be generated using DNA from any genome, expanding the possibilities of this approach for molecular systematic studies. PMID:25202592
Library construction for next-generation sequencing: Overviews and challenges
Head, Steven R.; Komori, H. Kiyomi; LaMere, Sarah A.; Whisenant, Thomas; Van Nieuwerburgh, Filip; Salomon, Daniel R.; Ordoukhanian, Phillip
2014-01-01
High-throughput sequencing, also known as next-generation sequencing (NGS), has revolutionized genomic research. In recent years, NGS technology has steadily improved, with costs dropping and the number and range of sequencing applications increasing exponentially. Here, we examine the critical role of sequencing library quality and consider important challenges when preparing NGS libraries from DNA and RNA sources. Factors such as the quantity and physical characteristics of the RNA or DNA source material as well as the desired application (i.e., genome sequencing, targeted sequencing, RNA-seq, ChIP-seq, RIP-seq, and methylation) are addressed in the context of preparing high quality sequencing libraries. In addition, the current methods for preparing NGS libraries from single cells are also discussed. PMID:24502796
Statistical inference of the generation probability of T-cell receptors from sequence repertoires.
Murugan, Anand; Mora, Thierry; Walczak, Aleksandra M; Callan, Curtis G
2012-10-02
Stochastic rearrangement of germline V-, D-, and J-genes to create variable coding sequence for certain cell surface receptors is at the origin of immune system diversity. This process, known as "VDJ recombination", is implemented via a series of stochastic molecular events involving gene choices and random nucleotide insertions between, and deletions from, genes. We use large sequence repertoires of the variable CDR3 region of human CD4+ T-cell receptor beta chains to infer the statistical properties of these basic biochemical events. Because any given CDR3 sequence can be produced in multiple ways, the probability distribution of hidden recombination events cannot be inferred directly from the observed sequences; we therefore develop a maximum likelihood inference method to achieve this end. To separate the properties of the molecular rearrangement mechanism from the effects of selection, we focus on nonproductive CDR3 sequences in T-cell DNA. We infer the joint distribution of the various generative events that occur when a new T-cell receptor gene is created. We find a rich picture of correlation (and absence thereof), providing insight into the molecular mechanisms involved. The generative event statistics are consistent between individuals, suggesting a universal biochemical process. Our probabilistic model predicts the generation probability of any specific CDR3 sequence by the primitive recombination process, allowing us to quantify the potential diversity of the T-cell repertoire and to understand why some sequences are shared between individuals. We argue that the use of formal statistical inference methods, of the kind presented in this paper, will be essential for quantitative understanding of the generation and evolution of diversity in the adaptive immune system.
PHYLOViZ: phylogenetic inference and data visualization for sequence based typing methods
2012-01-01
Background With the decrease of DNA sequencing costs, sequence-based typing methods are rapidly becoming the gold standard for epidemiological surveillance. These methods provide reproducible and comparable results needed for a global scale bacterial population analysis, while retaining their usefulness for local epidemiological surveys. Online databases that collect the generated allelic profiles and associated epidemiological data are available but this wealth of data remains underused and are frequently poorly annotated since no user-friendly tool exists to analyze and explore it. Results PHYLOViZ is platform independent Java software that allows the integrated analysis of sequence-based typing methods, including SNP data generated from whole genome sequence approaches, and associated epidemiological data. goeBURST and its Minimum Spanning Tree expansion are used for visualizing the possible evolutionary relationships between isolates. The results can be displayed as an annotated graph overlaying the query results of any other epidemiological data available. Conclusions PHYLOViZ is a user-friendly software that allows the combined analysis of multiple data sources for microbial epidemiological and population studies. It is freely available at http://www.phyloviz.net. PMID:22568821
USDA-ARS?s Scientific Manuscript database
Using next-generation-sequencing technology to assess entire transcriptomes requires high quality starting RNA. Currently, RNA quality is routinely judged using automated microfluidic gel electrophoresis platforms and associated algorithms. Here we report that such automated methods generate false-n...
Sequencing technologies - the next generation.
Metzker, Michael L
2010-01-01
Demand has never been greater for revolutionary technologies that deliver fast, inexpensive and accurate genome information. This challenge has catalysed the development of next-generation sequencing (NGS) technologies. The inexpensive production of large volumes of sequence data is the primary advantage over conventional methods. Here, I present a technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments. I also outline the broad range of applications for NGS technologies, in addition to providing guidelines for platform selection to address biological questions of interest.
Yang, Lei; Naylor, Gavin J P
2016-01-01
We determined the complete mitochondrial genome sequence (16,760 bp) of the peacock skate Pavoraja nitida using a long-PCR based next generation sequencing method. It has 13 protein-coding genes, 22 tRNA genes, 2 rRNA genes, and 1 control region in the typical vertebrate arrangement. Primers, protocols, and procedures used to obtain this mitogenome are provided. We anticipate that this approach will facilitate rapid collection of mitogenome sequences for studies on phylogenetic relationships, population genetics, and conservation of cartilaginous fishes.
Shum, Bennett O V; Henner, Ilya; Belluoccio, Daniele; Hinchcliffe, Marcus J
2017-07-01
The sensitivity and specificity of next-generation sequencing laboratory developed tests (LDTs) are typically determined by an analyte-specific approach. Analyte-specific validations use disease-specific controls to assess an LDT's ability to detect known pathogenic variants. Alternatively, a methods-based approach can be used for LDT technical validations. Methods-focused validations do not use disease-specific controls but use benchmark reference DNA that contains known variants (benign, variants of unknown significance, and pathogenic) to assess variant calling accuracy of a next-generation sequencing workflow. Recently, four whole-genome reference materials (RMs) from the National Institute of Standards and Technology (NIST) were released to standardize methods-based validations of next-generation sequencing panels across laboratories. We provide a practical method for using NIST RMs to validate multigene panels. We analyzed the utility of RMs in validating a novel newborn screening test that targets 70 genes, called NEO1. Despite the NIST RM variant truth set originating from multiple sequencing platforms, replicates, and library types, we discovered a 5.2% false-negative variant detection rate in the RM truth set genes that were assessed in our validation. We developed a strategy using complementary non-RM controls to demonstrate 99.6% sensitivity of the NEO1 test in detecting variants. Our findings have implications for laboratories or proficiency testing organizations using whole-genome NIST RMs for testing. Copyright © 2017 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
High-Throughput Next-Generation Sequencing of Polioviruses
Montmayeur, Anna M.; Schmidt, Alexander; Zhao, Kun; Magaña, Laura; Iber, Jane; Castro, Christina J.; Chen, Qi; Henderson, Elizabeth; Ramos, Edward; Shaw, Jing; Tatusov, Roman L.; Dybdahl-Sissoko, Naomi; Endegue-Zanga, Marie Claire; Adeniji, Johnson A.; Oberste, M. Steven; Burns, Cara C.
2016-01-01
ABSTRACT The poliovirus (PV) is currently targeted for worldwide eradication and containment. Sanger-based sequencing of the viral protein 1 (VP1) capsid region is currently the standard method for PV surveillance. However, the whole-genome sequence is sometimes needed for higher resolution global surveillance. In this study, we optimized whole-genome sequencing protocols for poliovirus isolates and FTA cards using next-generation sequencing (NGS), aiming for high sequence coverage, efficiency, and throughput. We found that DNase treatment of poliovirus RNA followed by random reverse transcription (RT), amplification, and the use of the Nextera XT DNA library preparation kit produced significantly better results than other preparations. The average viral reads per total reads, a measurement of efficiency, was as high as 84.2% ± 15.6%. PV genomes covering >99 to 100% of the reference length were obtained and validated with Sanger sequencing. A total of 52 PV genomes were generated, multiplexing as many as 64 samples in a single Illumina MiSeq run. This high-throughput, sequence-independent NGS approach facilitated the detection of a diverse range of PVs, especially for those in vaccine-derived polioviruses (VDPV), circulating VDPV, or immunodeficiency-related VDPV. In contrast to results from previous studies on other viruses, our results showed that filtration and nuclease treatment did not discernibly increase the sequencing efficiency of PV isolates. However, DNase treatment after nucleic acid extraction to remove host DNA significantly improved the sequencing results. This NGS method has been successfully implemented to generate PV genomes for molecular epidemiology of the most recent PV isolates. Additionally, the ability to obtain full PV genomes from FTA cards will aid in facilitating global poliovirus surveillance. PMID:27927929
Quick, Joshua; Quinlan, Aaron R; Loman, Nicholas J
2014-01-01
The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. The MinION™ measures the change in current resulting from DNA strands interacting with a charged protein nanopore. These measurements can then be used to deduce the underlying nucleotide sequence. We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION™ Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods.
Zhang, Lu; Xu, Jinhao; Ma, Jinbiao
2016-07-25
RNA-binding protein exerts important biological function by specifically recognizing RNA motif. SELEX (Systematic evolution of ligands by exponential enrichment), an in vitro selection method, can obtain consensus motif with high-affinity and specificity for many target molecules from DNA or RNA libraries. Here, we combined SELEX with next-generation sequencing to study the protein-RNA interaction in vitro. A pool of RNAs with 20 bp random sequences were transcribed by T7 promoter, and target protein was inserted into plasmid containing SBP-tag, which can be captured by streptavidin beads. Through only one cycle, the specific RNA motif can be obtained, which dramatically improved the selection efficiency. Using this method, we found that human hnRNP A1 RRMs domain (UP1 domain) bound RNA motifs containing AGG and AG sequences. The EMSA experiment indicated that hnRNP A1 RRMs could bind the obtained RNA motif. Taken together, this method provides a rapid and effective method to study the RNA binding specificity of proteins.
An alternative extragradient projection method for quasi-equilibrium problems.
Chen, Haibin; Wang, Yiju; Xu, Yi
2018-01-01
For the quasi-equilibrium problem where the players' costs and their strategies both depend on the rival's decisions, an alternative extragradient projection method for solving it is designed. Different from the classical extragradient projection method whose generated sequence has the contraction property with respect to the solution set, the newly designed method possesses an expansion property with respect to a given initial point. The global convergence of the method is established under the assumptions of pseudomonotonicity of the equilibrium function and of continuity of the underlying multi-valued mapping. Furthermore, we show that the generated sequence converges to the nearest point in the solution set to the initial point. Numerical experiments show the efficiency of the method.
Rapid Sequencing of Complete env Genes from Primary HIV-1 Samples.
Laird Smith, Melissa; Murrell, Ben; Eren, Kemal; Ignacio, Caroline; Landais, Elise; Weaver, Steven; Phung, Pham; Ludka, Colleen; Hepler, Lance; Caballero, Gemma; Pollner, Tristan; Guo, Yan; Richman, Douglas; Poignard, Pascal; Paxinos, Ellen E; Kosakovsky Pond, Sergei L; Smith, Davey M
2016-07-01
The ability to study rapidly evolving viral populations has been constrained by the read length of next-generation sequencing approaches and the sampling depth of single-genome amplification methods. Here, we develop and characterize a method using Pacific Biosciences' Single Molecule, Real-Time (SMRT®) sequencing technology to sequence multiple, intact full-length human immunodeficiency virus-1 env genes amplified from viral RNA populations circulating in blood, and provide computational tools for analyzing and visualizing these data.
Li, Man; Ling, Cheng; Xu, Qi; Gao, Jingyang
2018-02-01
Sequence classification is crucial in predicting the function of newly discovered sequences. In recent years, the prediction of the incremental large-scale and diversity of sequences has heavily relied on the involvement of machine-learning algorithms. To improve prediction accuracy, these algorithms must confront the key challenge of extracting valuable features. In this work, we propose a feature-enhanced protein classification approach, considering the rich generation of multiple sequence alignment algorithms, N-gram probabilistic language model and the deep learning technique. The essence behind the proposed method is that if each group of sequences can be represented by one feature sequence, composed of homologous sites, there should be less loss when the sequence is rebuilt, when a more relevant sequence is added to the group. On the basis of this consideration, the prediction becomes whether a query sequence belonging to a group of sequences can be transferred to calculate the probability that the new feature sequence evolves from the original one. The proposed work focuses on the hierarchical classification of G-protein Coupled Receptors (GPCRs), which begins by extracting the feature sequences from the multiple sequence alignment results of the GPCRs sub-subfamilies. The N-gram model is then applied to construct the input vectors. Finally, these vectors are imported into a convolutional neural network to make a prediction. The experimental results elucidate that the proposed method provides significant performance improvements. The classification error rate of the proposed method is reduced by at least 4.67% (family level I) and 5.75% (family Level II), in comparison with the current state-of-the-art methods. The implementation program of the proposed work is freely available at: https://github.com/alanFchina/CNN .
Next generation sequencing (NGS): a golden tool in forensic toolkit.
Aly, S M; Sabri, D M
The DNA analysis is a cornerstone in contemporary forensic sciences. DNA sequencing technologies are powerful tools that enrich molecular sciences in the past based on Sanger sequencing and continue to glowing these sciences based on Next generation sequencing (NGS). Next generation sequencing has excellent potential to flourish and increase the molecular applications in forensic sciences by jumping over the pitfalls of the conventional method of sequencing. The main advantages of NGS compared to conventional method that it utilizes simultaneously a large number of genetic markers with high-resolution of genetic data. These advantages will help in solving several challenges such as mixture analysis and dealing with minute degraded samples. Based on these new technologies, many markers could be examined to get important biological data such as age, geographical origins, tissue type determination, external visible traits and monozygotic twins identification. It also could get data related to microbes, insects, plants and soil which are of great medico-legal importance. Despite the dozens of forensic research involving NGS, there are requirements before using this technology routinely in forensic cases. Thus, there is a great need to more studies that address robustness of these techniques. Therefore, this work highlights the applications of forensic sciences in the era of massively parallel sequencing.
A Pan-HIV Strategy for Complete Genome Sequencing
Yamaguchi, Julie; Alessandri-Gradt, Elodie; Tell, Robert W.; Brennan, Catherine A.
2015-01-01
Molecular surveillance is essential to monitor HIV diversity and track emerging strains. We have developed a universal library preparation method (HIV-SMART [i.e., switching mechanism at 5′ end of RNA transcript]) for next-generation sequencing that harnesses the specificity of HIV-directed priming to enable full genome characterization of all HIV-1 groups (M, N, O, and P) and HIV-2. Broad application of the HIV-SMART approach was demonstrated using a panel of diverse cell-cultured virus isolates. HIV-1 non-subtype B-infected clinical specimens from Cameroon were then used to optimize the protocol to sequence directly from plasma. When multiplexing 8 or more libraries per MiSeq run, full genome coverage at a median ∼2,000× depth was routinely obtained for either sample type. The method reproducibly generated the same consensus sequence, consistently identified viral sequence heterogeneity present in specimens, and at viral loads of ≤4.5 log copies/ml yielded sufficient coverage to permit strain classification. HIV-SMART provides an unparalleled opportunity to identify diverse HIV strains in patient specimens and to determine phylogenetic classification based on the entire viral genome. Easily adapted to sequence any RNA virus, this technology illustrates the utility of next-generation sequencing (NGS) for viral characterization and surveillance. PMID:26699702
Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions
2014-01-01
Deep sequencing harnesses the high throughput nature of next generation sequencing technologies to generate population samples, treating information contained in individual reads as meaningful. Here, we review applications of deep sequencing to pathogen evolution. Pioneering deep sequencing studies from the virology literature are discussed, such as whole genome Roche-454 sequencing analyses of the dynamics of the rapidly mutating pathogens hepatitis C virus and HIV. Extension of the deep sequencing approach to bacterial populations is then discussed, including the impacts of emerging sequencing technologies. While it is clear that deep sequencing has unprecedented potential for assessing the genetic structure and evolutionary history of pathogen populations, bioinformatic challenges remain. We summarise current approaches to overcoming these challenges, in particular methods for detecting low frequency variants in the context of sequencing error and reconstructing individual haplotypes from short reads. PMID:24428920
van den Oever, Jessica M E; van Minderhout, Ivonne J H M; Harteveld, Cornelis L; den Hollander, Nicolette S; Bakker, Egbert; van der Stoep, Nienke; Boon, Elles M J
2015-09-01
The challenge in noninvasive prenatal diagnosis for monogenic disorders lies in the detection of low levels of fetal variants in the excess of maternal cell-free plasma DNA. Next-generation sequencing, which is the main method used for noninvasive prenatal testing and diagnosis, can overcome this challenge. However, this method may not be accessible to all genetic laboratories. Moreover, shotgun next-generation sequencing as, for instance, currently applied for noninvasive fetal trisomy screening may not be suitable for the detection of inherited mutations. We have developed a sensitive, mutation-specific, and fast alternative for next-generation sequencing-mediated noninvasive prenatal diagnosis using a PCR-based method. For this proof-of-principle study, noninvasive fetal paternally inherited mutation detection was performed using cell-free DNA from maternal plasma. Preferential amplification of the paternally inherited allele was accomplished through a personalized approach using a blocking probe against maternal sequences in a high-resolution melting curve analysis-based assay. Enhanced detection of the fetal paternally inherited mutation was obtained for both an autosomal dominant and a recessive monogenic disorder by blocking the amplification of maternal sequences in maternal plasma. Copyright © 2015 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Mapping Ribonucleotides Incorporated into DNA by Hydrolytic End-Sequencing.
Orebaugh, Clinton D; Lujan, Scott A; Burkholder, Adam B; Clausen, Anders R; Kunkel, Thomas A
2018-01-01
Ribonucleotides embedded within DNA render the DNA sensitive to the formation of single-stranded breaks under alkali conditions. Here, we describe a next-generation sequencing method called hydrolytic end sequencing (HydEn-seq) to map ribonucleotides inserted into the genome of Saccharomyce cerevisiae strains deficient in ribonucleotide excision repair. We use this method to map several genomic features in wild-type and replicase variant yeast strains.
Test Input Generation for Red-Black Trees using Abstraction
NASA Technical Reports Server (NTRS)
Visser, Willem; Pasareanu, Corina S.; Pelanek, Radek
2005-01-01
We consider the problem of test input generation for code that manipulates complex data structures. Test inputs are sequences of method calls from the data structure interface. We describe test input generation techniques that rely on state matching to avoid generation of redundant tests. Exhaustive techniques use explicit state model checking to explore all the possible test sequences up to predefined input sizes. Lossy techniques rely on abstraction mappings to compute and store abstract versions of the concrete states; they explore under-approximations of all the possible test sequences. We have implemented the techniques on top of the Java PathFinder model checker and we evaluate them using a Java implementation of red-black trees.
GENETIC-BASED ANALYTICAL METHODS FOR BACTERIA AND FUNGI
In the past two decades, advances in high-throughput sequencing technologies have lead to a veritable explosion in the generation of nucleic acid sequence information (1). While these advances are illustrated most prominently by the successful sequencing of the human genome, they...
Proteome Studies of Filamentous Fungi
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baker, Scott E.; Panisko, Ellen A.
2011-04-20
The continued fast pace of fungal genome sequence generation has enabled proteomic analysis of a wide breadth of organisms that span the breadth of the Kingdom Fungi. There is some phylogenetic bias to the current catalog of fungi with reasonable DNA sequence databases (genomic or EST) that could be analyzed at a global proteomic level. However, the rapid development of next generation sequencing platforms has lowered the cost of genome sequencing such that in the near future, having a genome sequence will no longer be a time or cost bottleneck for downstream proteomic (and transcriptomic) analyses. High throughput, non-gel basedmore » proteomics offers a snapshot of proteins present in a given sample at a single point in time. There are a number of different variations on the general method and technologies for identifying peptides in a given sample. We present a method that can serve as a “baseline” for proteomic studies of fungi.« less
Proteome studies of filamentous fungi.
Baker, Scott E; Panisko, Ellen A
2011-01-01
The continued fast pace of fungal genome sequence generation has enabled proteomic analysis of a wide variety of organisms that span the breadth of the Kingdom Fungi. There is some phylogenetic bias to the current catalog of fungi with reasonable DNA sequence databases (genomic or EST) that could be analyzed at a global proteomic level. However, the rapid development of next generation sequencing platforms has lowered the cost of genome sequencing such that in the near future, having a genome sequence will no longer be a time or cost bottleneck for downstream proteomic (and transcriptomic) analyses. High throughput, nongel-based proteomics offers a snapshot of proteins present in a given sample at a single point in time. There are a number of variations on the general methods and technologies for identifying peptides in a given sample. We present a method that can serve as a "baseline" for proteomic studies of fungi.
Using expected sequence features to improve basecalling accuracy of amplicon pyrosequencing data.
Rask, Thomas S; Petersen, Bent; Chen, Donald S; Day, Karen P; Pedersen, Anders Gorm
2016-04-22
Amplicon pyrosequencing targets a known genetic region and thus inherently produces reads highly anticipated to have certain features, such as conserved nucleotide sequence, and in the case of protein coding DNA, an open reading frame. Pyrosequencing errors, consisting mainly of nucleotide insertions and deletions, are on the other hand likely to disrupt open reading frames. Such an inverse relationship between errors and expectation based on prior knowledge can be used advantageously to guide the process known as basecalling, i.e. the inference of nucleotide sequence from raw sequencing data. The new basecalling method described here, named Multipass, implements a probabilistic framework for working with the raw flowgrams obtained by pyrosequencing. For each sequence variant Multipass calculates the likelihood and nucleotide sequence of several most likely sequences given the flowgram data. This probabilistic approach enables integration of basecalling into a larger model where other parameters can be incorporated, such as the likelihood for observing a full-length open reading frame at the targeted region. We apply the method to 454 amplicon pyrosequencing data obtained from a malaria virulence gene family, where Multipass generates 20 % more error-free sequences than current state of the art methods, and provides sequence characteristics that allow generation of a set of high confidence error-free sequences. This novel method can be used to increase accuracy of existing and future amplicon sequencing data, particularly where extensive prior knowledge is available about the obtained sequences, for example in analysis of the immunoglobulin VDJ region where Multipass can be combined with a model for the known recombining germline genes. Multipass is available for Roche 454 data at http://www.cbs.dtu.dk/services/MultiPass-1.0 , and the concept can potentially be implemented for other sequencing technologies as well.
Shaukat, Shahzad; Angez, Mehar; Alam, Muhammad Masroor; Jebbink, Maarten F; Deijs, Martin; Canuti, Marta; Sharif, Salmaan; de Vries, Michel; Khurshid, Adnan; Mahmood, Tariq; van der Hoek, Lia; Zaidi, Syed Sohail Zahoor
2014-08-12
The use of sequence independent methods combined with next generation sequencing for identification purposes in clinical samples appears promising and exciting results have been achieved to understand unexplained infections. One sequence independent method, Virus Discovery based on cDNA Amplified Fragment Length Polymorphism (VIDISCA) is capable of identifying viruses that would have remained unidentified in standard diagnostics or cell cultures. VIDISCA is normally combined with next generation sequencing, however, we set up a simplified VIDISCA which can be used in case next generation sequencing is not possible. Stool samples of 10 patients with unexplained acute flaccid paralysis showing cytopathic effect in rhabdomyosarcoma cells and/or mouse cells were used to test the efficiency of this method. To further characterize the viruses, VIDISCA-positive samples were amplified and sequenced with gene specific primers. Simplified VIDISCA detected seven viruses (70%) and the proportion of eukaryotic viral sequences from each sample ranged from 8.3 to 45.8%. Human enterovirus EV-B97, EV-B100, echovirus-9 and echovirus-21, human parechovirus type-3, human astrovirus probably a type-3/5 recombinant, and tetnovirus-1 were identified. Phylogenetic analysis based on the VP1 region demonstrated that the human enteroviruses are more divergent isolates circulating in the community. Our data support that a simplified VIDISCA protocol can efficiently identify unrecognized viruses grown in cell culture with low cost, limited time without need of advanced technical expertise. Also complex data interpretation is avoided thus the method can be used as a powerful diagnostic tool in limited resources. Redesigning the routine diagnostics might lead to additional detection of previously undiagnosed viruses in clinical samples of patients.
Optimizing Illumina next-generation sequencing library preparation for extremely AT-biased genomes.
Oyola, Samuel O; Otto, Thomas D; Gu, Yong; Maslen, Gareth; Manske, Magnus; Campino, Susana; Turner, Daniel J; Macinnis, Bronwyn; Kwiatkowski, Dominic P; Swerdlow, Harold P; Quail, Michael A
2012-01-03
Massively parallel sequencing technology is revolutionizing approaches to genomic and genetic research. Since its advent, the scale and efficiency of Next-Generation Sequencing (NGS) has rapidly improved. In spite of this success, sequencing genomes or genomic regions with extremely biased base composition is still a great challenge to the currently available NGS platforms. The genomes of some important pathogenic organisms like Plasmodium falciparum (high AT content) and Mycobacterium tuberculosis (high GC content) display extremes of base composition. The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses. Alternative library-preparation approaches that omit PCR amplification require large quantities of starting material and hence are not suitable for small amounts of DNA/RNA such as those from clinical isolates. We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content sequences. We have used our optimized conditions in parallel with standard methods to prepare Illumina sequencing libraries from a non-clinical and a clinical isolate (containing ~53% host contamination). By analyzing and comparing the quality of sequence data generated, we show that our optimized conditions that involve a PCR additive (TMAC), produces amplified libraries with improved coverage of extremely AT-rich regions and reduced bias toward GC neutral templates. We have developed a robust and optimized Next-Generation Sequencing library amplification method suitable for extremely AT-rich genomes. The new amplification conditions significantly reduce bias and retain the complexity of either extremes of base composition. This development will greatly benefit sequencing clinical samples that often require amplification due to low mass of DNA starting material.
Comparison of next generation sequencing technologies for transcriptome characterization
2009-01-01
Background We have developed a simulation approach to help determine the optimal mixture of sequencing methods for most complete and cost effective transcriptome sequencing. We compared simulation results for traditional capillary sequencing with "Next Generation" (NG) ultra high-throughput technologies. The simulation model was parameterized using mappings of 130,000 cDNA sequence reads to the Arabidopsis genome (NCBI Accession SRA008180.19). We also generated 454-GS20 sequences and de novo assemblies for the basal eudicot California poppy (Eschscholzia californica) and the magnoliid avocado (Persea americana) using a variety of methods for cDNA synthesis. Results The Arabidopsis reads tagged more than 15,000 genes, including new splice variants and extended UTR regions. Of the total 134,791 reads (13.8 MB), 119,518 (88.7%) mapped exactly to known exons, while 1,117 (0.8%) mapped to introns, 11,524 (8.6%) spanned annotated intron/exon boundaries, and 3,066 (2.3%) extended beyond the end of annotated UTRs. Sequence-based inference of relative gene expression levels correlated significantly with microarray data. As expected, NG sequencing of normalized libraries tagged more genes than non-normalized libraries, although non-normalized libraries yielded more full-length cDNA sequences. The Arabidopsis data were used to simulate additional rounds of NG and traditional EST sequencing, and various combinations of each. Our simulations suggest a combination of FLX and Solexa sequencing for optimal transcriptome coverage at modest cost. We have also developed ESTcalc http://fgp.huck.psu.edu/NG_Sims/ngsim.pl, an online webtool, which allows users to explore the results of this study by specifying individualized costs and sequencing characteristics. Conclusion NG sequencing technologies are a highly flexible set of platforms that can be scaled to suit different project goals. In terms of sequence coverage alone, the NG sequencing is a dramatic advance over capillary-based sequencing, but NG sequencing also presents significant challenges in assembly and sequence accuracy due to short read lengths, method-specific sequencing errors, and the absence of physical clones. These problems may be overcome by hybrid sequencing strategies using a mixture of sequencing methodologies, by new assemblers, and by sequencing more deeply. Sequencing and microarray outcomes from multiple experiments suggest that our simulator will be useful for guiding NG transcriptome sequencing projects in a wide range of organisms. PMID:19646272
Single molecule sequencing of the M13 virus genome without amplification
Zhao, Luyang; Deng, Liwei; Li, Gailing; Jin, Huan; Cai, Jinsen; Shang, Huan; Li, Yan; Wu, Haomin; Xu, Weibin; Zeng, Lidong; Zhang, Renli; Zhao, Huan; Wu, Ping; Zhou, Zhiliang; Zheng, Jiao; Ezanno, Pierre; Yang, Andrew X.; Yan, Qin; Deem, Michael W.; He, Jiankui
2017-01-01
Next generation sequencing (NGS) has revolutionized life sciences research. However, GC bias and costly, time-intensive library preparation make NGS an ill fit for increasing sequencing demands in the clinic. A new class of third-generation sequencing platforms has arrived to meet this need, capable of directly measuring DNA and RNA sequences at the single-molecule level without amplification. Here, we use the new GenoCare single-molecule sequencing platform from Direct Genomics to sequence the genome of the M13 virus. Our platform detects single-molecule fluorescence by total internal reflection microscopy, with sequencing-by-synthesis chemistry. We sequenced the genome of M13 to a depth of 316x, with 100% coverage. We determined a consensus sequence accuracy of 100%. In contrast to GC bias inherent to NGS results, we demonstrated that our single-molecule sequencing method yields minimal GC bias. PMID:29253901
Single molecule sequencing of the M13 virus genome without amplification.
Zhao, Luyang; Deng, Liwei; Li, Gailing; Jin, Huan; Cai, Jinsen; Shang, Huan; Li, Yan; Wu, Haomin; Xu, Weibin; Zeng, Lidong; Zhang, Renli; Zhao, Huan; Wu, Ping; Zhou, Zhiliang; Zheng, Jiao; Ezanno, Pierre; Yang, Andrew X; Yan, Qin; Deem, Michael W; He, Jiankui
2017-01-01
Next generation sequencing (NGS) has revolutionized life sciences research. However, GC bias and costly, time-intensive library preparation make NGS an ill fit for increasing sequencing demands in the clinic. A new class of third-generation sequencing platforms has arrived to meet this need, capable of directly measuring DNA and RNA sequences at the single-molecule level without amplification. Here, we use the new GenoCare single-molecule sequencing platform from Direct Genomics to sequence the genome of the M13 virus. Our platform detects single-molecule fluorescence by total internal reflection microscopy, with sequencing-by-synthesis chemistry. We sequenced the genome of M13 to a depth of 316x, with 100% coverage. We determined a consensus sequence accuracy of 100%. In contrast to GC bias inherent to NGS results, we demonstrated that our single-molecule sequencing method yields minimal GC bias.
Fast and secure encryption-decryption method based on chaotic dynamics
Protopopescu, Vladimir A.; Santoro, Robert T.; Tolliver, Johnny S.
1995-01-01
A method and system for the secure encryption of information. The method comprises the steps of dividing a message of length L into its character components; generating m chaotic iterates from m independent chaotic maps; producing an "initial" value based upon the m chaotic iterates; transforming the "initial" value to create a pseudo-random integer; repeating the steps of generating, producing and transforming until a pseudo-random integer sequence of length L is created; and encrypting the message as ciphertext based upon the pseudo random integer sequence. A system for accomplishing the invention is also provided.
Method of multiplexed analysis using ion mobility spectrometer
Belov, Mikhail E [Richland, WA; Smith, Richard D [Richland, WA
2009-06-02
A method for analyzing analytes from a sample introduced into a Spectrometer by generating a pseudo random sequence of a modulation bins, organizing each modulation bin as a series of submodulation bins, thereby forming an extended pseudo random sequence of submodulation bins, releasing the analytes in a series of analyte packets into a Spectrometer, thereby generating an unknown original ion signal vector, detecting the analytes at a detector, and characterizing the sample using the plurality of analyte signal subvectors. The method is advantageously applied to an Ion Mobility Spectrometer, and an Ion Mobility Spectrometer interfaced with a Time of Flight Mass Spectrometer.
Radtke, Robert P; Stokes, Robert H; Glowka, David A
2014-12-02
A method for operating an impulsive type seismic energy source in a firing sequence having at least two actuations for each seismic impulse to be generated by the source. The actuations have a time delay between them related to a selected energy frequency peak of the source output. One example of the method is used for generating seismic signals in a wellbore and includes discharging electric current through a spark gap disposed in the wellbore in at least one firing sequence. The sequence includes at least two actuations of the spark gap separated by an amount of time selected to cause acoustic energy resulting from the actuations to have peak amplitude at a selected frequency.
Wall, Melanie; Cheslack-Postava, Keely; Hu, Mei-Chen; Feng, Tianshu; Griesler, Pamela; Kandel, Denise B
2018-01-01
This study sought to specify (1) the position of nonmedical prescription opioids (NMPO) in drug initiation sequences among Millennials (1979-96), Generation X (1964-79), and Baby Boomers (1949-64) and (2) gender and racial/ethnic differences in sequences among Millennials. Data are from the 2013-2014 National Surveys on Drug Use and Health (n = 73,026). We identified statistically significant drug initiation sequences involving alcohol/cigarettes, marijuana, NMPO, cocaine, and heroin using a novel method distinguishing significant sequences from patterns expected only due to correlations induced by common liability among drugs. Alcohol/cigarettes followed by marijuana was the most common sequence. NMPO or cocaine use after marijuana, and heroin use after NMPO or cocaine, differed by generation. Among successively younger generations, NMPO after marijuana and heroin after NMPO increased. Millennials were more likely to initiate NMPO than cocaine after marijuana; Generation X and Baby Boomers were less likely (odds ratios = 1.4;0.3;0.2). Millennials were more likely than Generation X and Baby Boomers to use heroin after NMPO (hazards ratios = 7.1;3.4;2.5). In each generation, heroin users were far more likely to start heroin after both NMPO and cocaine than either alone. Sequences were similar by gender. Fewer paths were significant among African-Americans. NMPOs play a more prominent role in drug initiation sequences among Millennials than prior generations. Among Millennials, NMPO use is more likely than cocaine to follow marijuana use. In all generations, transition to heroin from NMPO significantly occurs only when both NMPO and cocaine have been used. Delineation of drug sequences suggests optimal points in development for prevention and treatment efforts. Copyright © 2017 Elsevier B.V. All rights reserved.
Rapid Sequencing of Complete env Genes from Primary HIV-1 Samples
Eren, Kemal; Ignacio, Caroline; Landais, Elise; Weaver, Steven; Phung, Pham; Ludka, Colleen; Hepler, Lance; Caballero, Gemma; Pollner, Tristan; Guo, Yan; Richman, Douglas; Poignard, Pascal; Paxinos, Ellen E.; Kosakovsky Pond, Sergei L.
2016-01-01
Abstract The ability to study rapidly evolving viral populations has been constrained by the read length of next-generation sequencing approaches and the sampling depth of single-genome amplification methods. Here, we develop and characterize a method using Pacific Biosciences’ Single Molecule, Real-Time (SMRT®) sequencing technology to sequence multiple, intact full-length human immunodeficiency virus-1 env genes amplified from viral RNA populations circulating in blood, and provide computational tools for analyzing and visualizing these data. PMID:29492273
Mohammed, Monzoorul Haque; Dutta, Anirban; Bose, Tungadri; Chadaram, Sudha; Mande, Sharmila S
2012-10-01
An unprecedented quantity of genome sequence data is currently being generated using next-generation sequencing platforms. This has necessitated the development of novel bioinformatics approaches and algorithms that not only facilitate a meaningful analysis of these data but also aid in efficient compression, storage, retrieval and transmission of huge volumes of the generated data. We present a novel compression algorithm (DELIMINATE) that can rapidly compress genomic sequence data in a loss-less fashion. Validation results indicate relatively higher compression efficiency of DELIMINATE when compared with popular general purpose compression algorithms, namely, gzip, bzip2 and lzma. Linux, Windows and Mac implementations (both 32 and 64-bit) of DELIMINATE are freely available for download at: http://metagenomics.atc.tcs.com/compression/DELIMINATE. sharmila@atc.tcs.com Supplementary data are available at Bioinformatics online.
Nakamura, Kosuke; Kondo, Kazunari; Akiyama, Hiroshi; Ishigaki, Takumi; Noguchi, Akio; Katsumata, Hiroshi; Takasaki, Kazuto; Futo, Satoshi; Sakata, Kozue; Fukuda, Nozomi; Mano, Junichi; Kitta, Kazumi; Tanaka, Hidenori; Akashi, Ryo; Nishimaki-Mogami, Tomoko
2016-08-15
Identification of transgenic sequences in an unknown genetically modified (GM) papaya (Carica papaya L.) by whole genome sequence analysis was demonstrated. Whole genome sequence data were generated for a GM-positive fresh papaya fruit commodity detected in monitoring using real-time polymerase chain reaction (PCR). The sequences obtained were mapped against an open database for papaya genome sequence. Transgenic construct- and event-specific sequences were identified as a GM papaya developed to resist infection from a Papaya ringspot virus. Based on the transgenic sequences, a specific real-time PCR detection method for GM papaya applicable to various food commodities was developed. Whole genome sequence analysis enabled identifying unknown transgenic construct- and event-specific sequences in GM papaya and development of a reliable method for detecting them in papaya food commodities. Copyright © 2016 Elsevier Ltd. All rights reserved.
A multiplex primer design algorithm for target amplification of continuous genomic regions.
Ozturk, Ahmet Rasit; Can, Tolga
2017-06-19
Targeted Next Generation Sequencing (NGS) assays are cost-efficient and reliable alternatives to Sanger sequencing. For sequencing of very large set of genes, the target enrichment approach is suitable. However, for smaller genomic regions, the target amplification method is more efficient than both the target enrichment method and Sanger sequencing. The major difficulty of the target amplification method is the preparation of amplicons, regarding required time, equipment, and labor. Multiplex PCR (MPCR) is a good solution for the mentioned problems. We propose a novel method to design MPCR primers for a continuous genomic region, following the best practices of clinically reliable PCR design processes. On an experimental setup with 48 different combinations of factors, we have shown that multiple parameters might effect finding the first feasible solution. Increasing the length of the initial primer candidate selection sequence gives better results whereas waiting for a longer time to find the first feasible solution does not have a significant impact. We generated MPCR primer designs for the HBB whole gene, MEFV coding regions, and human exons between 2000 bp to 2100 bp-long. Our benchmarking experiments show that the proposed MPCR approach is able produce reliable NGS assay primers for a given sequence in a reasonable amount of time.
Exome-wide DNA capture and next generation sequencing in domestic and wild species.
Cosart, Ted; Beja-Pereira, Albano; Chen, Shanyuan; Ng, Sarah B; Shendure, Jay; Luikart, Gordon
2011-07-05
Gene-targeted and genome-wide markers are crucial to advance evolutionary biology, agriculture, and biodiversity conservation by improving our understanding of genetic processes underlying adaptation and speciation. Unfortunately, for eukaryotic species with large genomes it remains costly to obtain genome sequences and to develop genome resources such as genome-wide SNPs. A method is needed to allow gene-targeted, next-generation sequencing that is flexible enough to include any gene or number of genes, unlike transcriptome sequencing. Such a method would allow sequencing of many individuals, avoiding ascertainment bias in subsequent population genetic analyses.We demonstrate the usefulness of a recent technology, exon capture, for genome-wide, gene-targeted marker discovery in species with no genome resources. We use coding gene sequences from the domestic cow genome sequence (Bos taurus) to capture (enrich for), and subsequently sequence, thousands of exons of B. taurus, B. indicus, and Bison bison (wild bison). Our capture array has probes for 16,131 exons in 2,570 genes, including 203 candidate genes with known function and of interest for their association with disease and other fitness traits. We successfully sequenced and mapped exon sequences from across the 29 autosomes and X chromosome in the B. taurus genome sequence. Exon capture and high-throughput sequencing identified thousands of putative SNPs spread evenly across all reference chromosomes, in all three individuals, including hundreds of SNPs in our targeted candidate genes. This study shows exon capture can be customized for SNP discovery in many individuals and for non-model species without genomic resources. Our captured exome subset was small enough for affordable next-generation sequencing, and successfully captured exons from a divergent wild species using the domestic cow genome as reference.
Bit error rate tester using fast parallel generation of linear recurring sequences
Pierson, Lyndon G.; Witzke, Edward L.; Maestas, Joseph H.
2003-05-06
A fast method for generating linear recurring sequences by parallel linear recurring sequence generators (LRSGs) with a feedback circuit optimized to balance minimum propagation delay against maximal sequence period. Parallel generation of linear recurring sequences requires decimating the sequence (creating small contiguous sections of the sequence in each LRSG). A companion matrix form is selected depending on whether the LFSR is right-shifting or left-shifting. The companion matrix is completed by selecting a primitive irreducible polynomial with 1's most closely grouped in a corner of the companion matrix. A decimation matrix is created by raising the companion matrix to the (n*k).sup.th power, where k is the number of parallel LRSGs and n is the number of bits to be generated at a time by each LRSG. Companion matrices with 1's closely grouped in a corner will yield sparse decimation matrices. A feedback circuit comprised of XOR logic gates implements the decimation matrix in hardware. Sparse decimation matrices can be implemented with minimum number of XOR gates, and therefore a minimum propagation delay through the feedback circuit. The LRSG of the invention is particularly well suited to use as a bit error rate tester on high speed communication lines because it permits the receiver to synchronize to the transmitted pattern within 2n bits.
Davey, Sue; Navarrete, Cristina; Brown, Colin
2017-06-01
Twenty-nine human platelet antigen systems have been described to date, but the majority of current genotyping methods are restricted to the identification of those most commonly associated with alloantibody production in a clinical context. This can result in a protracted investigation if causative human platelet antigens are rare or novel. A targeted next-generation sequencing approach was designed to detect all known human platelet antigens with the additional capability of identifying novel mutations in the encoding genes. A targeted enrichment, high-sensitivity HaloPlex assay was designed to sequence all exons and flanking regions of the six genes known to encode human platelet antigens. Indexed DNA libraries were prepared from 47 previously human platelet antigen-genotyped samples and subsequently combined into one of three pools for sequencing on an Illumina MiSeq platform. The generated FASTQ files were aligned and scrutinized for each human platelet antigen polymorphism using SureCall data analysis software. Forty-six samples were successfully genotyped for human platelet antigens 1 through 29bw, with an average per base coverage depth of 1144. Concordance with historical human platelet antigen genotypes was 100%. A putative novel mutation in Exon 10 of the integrin β-3 (ITGB3) gene from an unsolved case of fetal neonatal alloimmune thrombocytopenia was also detected. A next-generation sequencing-based method that can accurately define all known human platelet antigen polymorphisms was developed. With the ability to sequence up to 96 samples simultaneously, our HaloPlex design could be used for high-throughput human platelet antigen genotyping. This method is also applicable for investigating fetal neonatal alloimmune thrombocytopenia when rare or novel human platelet antigens are suspected. © 2017 AABB.
Pancoska, Petr; Moravek, Zdenek; Moll, Ute M
2004-01-01
Nucleic acids are molecules of choice for both established and emerging nanoscale technologies. These technologies benefit from large functional densities of 'DNA processing elements' that can be readily manufactured. To achieve the desired functionality, polynucleotide sequences are currently designed by a process that involves tedious and laborious filtering of potential candidates against a series of requirements and parameters. Here, we present a complete novel methodology for the rapid rational design of large sets of DNA sequences. This method allows for the direct implementation of very complex and detailed requirements for the generated sequences, thus avoiding 'brute force' filtering. At the same time, these sequences have narrow distributions of melting temperatures. The molecular part of the design process can be done without computer assistance, using an efficient 'human engineering' approach by drawing a single blueprint graph that represents all generated sequences. Moreover, the method eliminates the necessity for extensive thermodynamic calculations. Melting temperature can be calculated only once (or not at all). In addition, the isostability of the sequences is independent of the selection of a particular set of thermodynamic parameters. Applications are presented for DNA sequence designs for microarrays, universal microarray zip sequences and electron transfer experiments.
Quantiprot - a Python package for quantitative analysis of protein sequences.
Konopka, Bogumił M; Marciniak, Marta; Dyrka, Witold
2017-07-17
The field of protein sequence analysis is dominated by tools rooted in substitution matrices and alignments. A complementary approach is provided by methods of quantitative characterization. A major advantage of the approach is that quantitative properties defines a multidimensional solution space, where sequences can be related to each other and differences can be meaningfully interpreted. Quantiprot is a software package in Python, which provides a simple and consistent interface to multiple methods for quantitative characterization of protein sequences. The package can be used to calculate dozens of characteristics directly from sequences or using physico-chemical properties of amino acids. Besides basic measures, Quantiprot performs quantitative analysis of recurrence and determinism in the sequence, calculates distribution of n-grams and computes the Zipf's law coefficient. We propose three main fields of application of the Quantiprot package. First, quantitative characteristics can be used in alignment-free similarity searches, and in clustering of large and/or divergent sequence sets. Second, a feature space defined by quantitative properties can be used in comparative studies of protein families and organisms. Third, the feature space can be used for evaluating generative models, where large number of sequences generated by the model can be compared to actually observed sequences.
Nilyanimit, Pornjarim; Chansaenroj, Jira; Poomipak, Witthaya; Praianantathavorn, Kesmanee; Payungporn, Sunchai; Poovorawan, Yong
2018-03-01
Human papillomavirus (HPV) infection causes cervical cancer, thus necessitating early detection by screening. Rapid and accurate HPV genotyping is crucial both for the assessment of patients with HPV infection and for surveillance studies. Fifty-eight cervicovaginal samples were tested for HPV genotypes using four methods in parallel: nested-PCR followed by conventional sequencing, INNO-LiPA, electrochemical DNA chip, and next-generation sequencing (NGS). Seven HPV genotypes (16, 18, 31, 33, 45, 56, and 58) were identified by all four methods. Nineteen HPV genotypes were detected by NGS, but not by nested-PCR, INNO-LiPA, or electrochemical DNA chip. Although NGS is relatively expensive and complex, it may serve as a sensitive HPV genotyping method. Because of its highly sensitive detection of multiple HPV genotypes, NGS may serve as an alternative for diagnostic HPV genotyping in certain situations. © The Korean Society for Laboratory Medicine
Okamoto, Nobuhiko; Nakashima, Mitsuko; Tsurusaki, Yoshinori; Miyake, Noriko; Saitsu, Hirotomo; Matsumoto, Naomichi
2013-01-01
Next-generation sequencing (NGS) combined with enrichment of target genes enables highly efficient and low-cost sequencing of multiple genes for genetic diseases. The aim of this study was to validate the accuracy and sensitivity of our method for comprehensive mutation detection in autism spectrum disorder (ASD). We assessed the performance of the bench-top Ion Torrent PGM and Illumina MiSeq platforms as optimized solutions for mutation detection, using microdroplet PCR-based enrichment of 62 ASD associated genes. Ten patients with known mutations were sequenced using NGS to validate the sensitivity of our method. The overall read quality was better with MiSeq, largely because of the increased indel-related error associated with PGM. The sensitivity of SNV detection was similar between the two platforms, suggesting they are both suitable for SNV detection in the human genome. Next, we used these methods to analyze 28 patients with ASD, and identified 22 novel variants in genes associated with ASD, with one mutation detected by MiSeq only. Thus, our results support the combination of target gene enrichment and NGS as a valuable molecular method for investigating rare variants in ASD. PMID:24066114
Comparing viral metagenomics methods using a highly multiplexed human viral pathogens reagent
Li, Linlin; Deng, Xutao; Mee, Edward T.; Collot-Teixeira, Sophie; Anderson, Rob; Schepelmann, Silke; Minor, Philip D.; Delwart, Eric
2014-01-01
Unbiased metagenomic sequencing holds significant potential as a diagnostic tool for the simultaneous detection of any previously genetically described viral nucleic acids in clinical samples. Viral genome sequences can also inform on likely phenotypes including drug susceptibility or neutralization serotypes. In this study, different variables of the laboratory methods often used to generate viral metagenomics libraries on the efficiency of viral detection and virus genome coverage were compared. A biological reagent consisting of 25 different human RNA and DNA viral pathogens was used to estimate the effect of filtration and nuclease digestion, DNA/RNA extraction methods, pre-amplification and the use of different library preparation kits on the detection of viral nucleic acids. Filtration and nuclease treatment led to slight decreases in the percentage of viral sequence reads and number of viruses detected. For nucleic acid extractions silica spin columns improved viral sequence recovery relative to magnetic beads and Trizol extraction. Pre-amplification using random RT-PCR while generating more viral sequence reads resulted in detection of fewer viruses, more overlapping sequences, and lower genome coverage. The ScriptSeq library preparation method retrieved more viruses and a greater fraction of their genomes than the TruSeq and Nextera methods. Viral metagenomics sequencing was able to simultaneously detect up to 22 different viruses in the biological reagent analyzed including all those detected by qPCR. Further optimization will be required for the detection of viruses in biologically more complex samples such as tissues, blood, or feces. PMID:25497414
Two general models that generate long range correlation
NASA Astrophysics Data System (ADS)
Gan, Xiaocong; Han, Zhangang
2012-06-01
In this paper we study two models that generate sequences with LRC (long range correlation). For the IFT (inverse Fourier transform) model, our conclusion is the low frequency part leads to LRC, while the high frequency part tends to eliminate it. Therefore, a typical method to generate a sequence with LRC is multiplying the spectrum of a white noise sequence by a decaying function. A special case is analyzed: the linear combination of a smooth curve and a white noise sequence, in which the DFA plot consists of two line segments. For the patch model, our conclusion is long subsequences leads to LRC, while short subsequences tend to eliminate it. Therefore, we can generate a sequence with LRC by using a fat-tailed PDF (probability distribution function) of the length of the subsequences. A special case is also analyzed: if a patch model with long subsequences is mixed with a white noise sequence, the DFA plot will consist of two line segments. We have checked known models and actual data, and found they are all consistent with this study.
Chen, He; Yao, Jiacheng; Fu, Yusi; Pang, Yuhong; Wang, Jianbin; Huang, Yanyi
2018-04-11
The next generation sequencing (NGS) technologies have been rapidly evolved and applied to various research fields, but they often suffer from losing long-range information due to short library size and read length. Here, we develop a simple, cost-efficient, and versatile NGS library preparation method, called tagmentation on microbeads (TOM). This method is capable of recovering long-range information through tagmentation mediated by microbead-immobilized transposomes. Using transposomes with DNA barcodes to identically label adjacent sequences during tagmentation, we can restore inter-read connection of each fragment from original DNA molecule by fragment-barcode linkage after sequencing. In our proof-of-principle experiment, more than 4.5% of the reads are linked with their adjacent reads, and the longest linkage is over 1112 bp. We demonstrate TOM with eight barcodes, but the number of barcodes can be scaled up by an ultrahigh complexity construction. We also show this method has low amplification bias and effectively fits the applications to identify copy number variations.
Fritsch, Leonie; Fischer, Rainer; Wambach, Christoph; Dudek, Max; Schillberg, Stefan; Schröper, Florian
2015-08-01
Simple and reliable, high-throughput techniques to detect the zygosity of transgenic events in plants are valuable for biotechnology and plant breeding companies seeking robust genotyping data for the assessment of new lines and the monitoring of breeding programs. We show that next-generation sequencing (NGS) applied to short PCR products spanning the transgene integration site provides accurate zygosity data that are more robust and reliable than those generated by PCR-based methods. The NGS reads covered the 5' border of the transgenic events (incorporating part of the transgene and the flanking genomic DNA), or the genomic sequences flanking the unfilled transgene integration site at the wild-type locus. We compared the NGS method to competitive real-time PCR with transgene-specific and wild-type-specific primer/probe pairs, one pair matching the 5' genomic flanking sequence and 5' part of the transgene and the other matching the unfilled transgene integration site. Although both NGS and real-time PCR provided useful zygosity data, the NGS technique was favorable because it needed fewer optimization steps. It also provided statistically more-reliable evidence for the presence of each allele because each product was often covered by more than 100 reads. The NGS method is also more suitable for the genotyping of large panels of plants because up to 80 million reads can be produced in one sequencing run. Our novel method is therefore ideal for the rapid and accurate genotyping of large numbers of samples.
A better sequence-read simulator program for metagenomics.
Johnson, Stephen; Trost, Brett; Long, Jeffrey R; Pittet, Vanessa; Kusalik, Anthony
2014-01-01
There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data. We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task. BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work.
Quantum-Sequencing: Fast electronic single DNA molecule sequencing
NASA Astrophysics Data System (ADS)
Casamada Ribot, Josep; Chatterjee, Anushree; Nagpal, Prashant
2014-03-01
A major goal of third-generation sequencing technologies is to develop a fast, reliable, enzyme-free, high-throughput and cost-effective, single-molecule sequencing method. Here, we present the first demonstration of unique ``electronic fingerprint'' of all nucleotides (A, G, T, C), with single-molecule DNA sequencing, using Quantum-tunneling Sequencing (Q-Seq) at room temperature. We show that the electronic state of the nucleobases shift depending on the pH, with most distinct states identified at acidic pH. We also demonstrate identification of single nucleotide modifications (methylation here). Using these unique electronic fingerprints (or tunneling data), we report a partial sequence of beta lactamase (bla) gene, which encodes resistance to beta-lactam antibiotics, with over 95% success rate. These results highlight the potential of Q-Seq as a robust technique for next-generation sequencing.
Neugebauer, Tomasz; Bordeleau, Eric; Burrus, Vincent; Brzezinski, Ryszard
2015-01-01
Data visualization methods are necessary during the exploration and analysis activities of an increasingly data-intensive scientific process. There are few existing visualization methods for raw nucleotide sequences of a whole genome or chromosome. Software for data visualization should allow the researchers to create accessible data visualization interfaces that can be exported and shared with others on the web. Herein, novel software developed for generating DNA data visualization interfaces is described. The software converts DNA data sets into images that are further processed as multi-scale images to be accessed through a web-based interface that supports zooming, panning and sequence fragment selection. Nucleotide composition frequencies and GC skew of a selected sequence segment can be obtained through the interface. The software was used to generate DNA data visualization of human and bacterial chromosomes. Examples of visually detectable features such as short and long direct repeats, long terminal repeats, mobile genetic elements, heterochromatic segments in microbial and human chromosomes, are presented. The software and its source code are available for download and further development. The visualization interfaces generated with the software allow for the immediate identification and observation of several types of sequence patterns in genomes of various sizes and origins. The visualization interfaces generated with the software are readily accessible through a web browser. This software is a useful research and teaching tool for genetics and structural genomics.
Tillmar, Andreas O.; Dell'Amico, Barbara; Welander, Jenny; Holmlund, Gunilla
2013-01-01
Species identification can be interesting in a wide range of areas, for example, in forensic applications, food monitoring and in archeology. The vast majority of existing DNA typing methods developed for species determination, mainly focuses on a single species source. There are, however, many instances where all species from mixed sources need to be determined, even when the species in minority constitutes less than 1 % of the sample. The introduction of next generation sequencing opens new possibilities for such challenging samples. In this study we present a universal deep sequencing method using 454 GS Junior sequencing of a target on the mitochondrial gene 16S rRNA. The method was designed through phylogenetic analyses of DNA reference sequences from more than 300 mammal species. Experiments were performed on artificial species-species mixture samples in order to verify the method’s robustness and its ability to detect all species within a mixture. The method was also tested on samples from authentic forensic casework. The results showed to be promising, discriminating over 99.9 % of mammal species and the ability to detect multiple donors within a mixture and also to detect minor components as low as 1 % of a mixed sample. PMID:24358309
Minimizing structural vibrations with Input Shaping (TM)
NASA Technical Reports Server (NTRS)
Singhose, Bill; Singer, Neil
1995-01-01
A new method for commanding machines to move with increased dynamic performance was developed. This method is an enhanced version of input shaping, a patented vibration suppression algorithm. This technique intercepts a command input to a system command that moves the mechanical system with increased performance and reduced residual vibration. This document describes many advanced methods for generating highly optimized shaping sequences which are tuned to particular systems. The shaping sequence is important because it determines the trade off between move/settle time of the system and the insensitivity of the input shaping algorithm to variations or uncertainties in the machine which can be controlled. For example, a system with a 5 Hz resonance that takes 1 second to settle can be improved to settle instantaneously using a 0.2 shaping sequence (thus improving settle time by a factor of 5). This system could vary by plus or minus 15% in its natural frequency and still have no apparent vibration. However, the same system shaped with a 0.3 second shaping sequence could tolerate plus or minus 40% or more variation in natural frequency. This document describes how to generate sequences that maximize performance, sequences that maximize insensitivity, and sequences that trade off between the two. Several software tools are documented and included.
Milius, Robert P; Heuer, Michael; Valiga, Daniel; Doroschak, Kathryn J; Kennedy, Caleb J; Bolon, Yung-Tsi; Schneider, Joel; Pollack, Jane; Kim, Hwa Ran; Cereb, Nezih; Hollenbach, Jill A; Mack, Steven J; Maiers, Martin
2015-12-01
We present an electronic format for exchanging data for HLA and KIR genotyping with extensions for next-generation sequencing (NGS). This format addresses NGS data exchange by refining the Histoimmunogenetics Markup Language (HML) to conform to the proposed Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) reporting guidelines (miring.immunogenomics.org). Our refinements of HML include two major additions. First, NGS is supported by new XML structures to capture additional NGS data and metadata required to produce a genotyping result, including analysis-dependent (dynamic) and method-dependent (static) components. A full genotype, consensus sequence, and the surrounding metadata are included directly, while the raw sequence reads and platform documentation are externally referenced. Second, genotype ambiguity is fully represented by integrating Genotype List Strings, which use a hierarchical set of delimiters to represent allele and genotype ambiguity in a complete and accurate fashion. HML also continues to enable the transmission of legacy methods (e.g. site-specific oligonucleotide, sequence-specific priming, and Sequence Based Typing (SBT)), adding features such as allowing multiple group-specific sequencing primers, and fully leveraging techniques that combine multiple methods to obtain a single result, such as SBT integrated with NGS. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Review of sequencing platforms and their applications in phaeochromocytoma and paragangliomas.
Pillai, Suja; Gopalan, Vinod; Lam, Alfred King-Yin
2017-08-01
Genetic testing is recommended for patients with phaeochromocytoma (PCC) and paraganglioma (PGL) because of their genetic heterogeneity and heritability. Due to the large number of susceptibility genes associated with PCC/PGL, next-generation sequencing (NGS) technology is ideally suited for carrying out genetic screening of these individuals. New generations of DNA sequencing technologies facilitate the development of comprehensive genetic testing in PCC/PGL at a lower cost. Whole-exome sequencing and targeted NGS are the preferred methods for screening of PCC/PGL, both having precise mutation detection methods and low costs. RNA sequencing and DNA methylation studies using NGS technology in PCC/PGL can be adopted to act as diagnostic or prognostic biomarkers as well as in planning targeted epigenetic treatment of patients with PCC/PGL. The designs of NGS having a high depth of coverage and robust analytical pipelines can lead to the successful detection of a wide range of genomic defects in PCC/PGL. Nevertheless, the major challenges of this technology must be addressed before it has practical applications in the clinical diagnostics to fulfill the goal of personalized medicine in PCC/PGL. In future, novel approaches of sequencing, such as third and fourth generation sequencing can alter the workflow, cost, analysis, and interpretation of genomics associated with PCC/PGL. Copyright © 2017 Elsevier B.V. All rights reserved.
USDA-ARS?s Scientific Manuscript database
Next generation sequencing technologies and improved bioinformatics methods have provided opportunities to study sequence variability in complex polyploid transcriptomes. In this study, we used a diverse panel of twenty-two Arachis accessions representing seven Arachis hypogaea market classes, A-, B...
Rhodes, Johanna; Beale, Mathew A; Fisher, Matthew C
2014-01-01
The industry of next-generation sequencing is constantly evolving, with novel library preparation methods and new sequencing machines being released by the major sequencing technology companies annually. The Illumina TruSeq v2 library preparation method was the most widely used kit and the market leader; however, it has now been discontinued, and in 2013 was replaced by the TruSeq Nano and TruSeq PCR-free methods, leaving a gap in knowledge regarding which is the most appropriate library preparation method to use. Here, we used isolates from the pathogenic fungi Cryptococcus neoformans var. grubii and sequenced them using the existing TruSeq DNA v2 kit (Illumina), along with two new kits: the TruSeq Nano DNA kit (Illumina) and the NEBNext Ultra DNA kit (New England Biolabs) to provide a comparison. Compared to the original TruSeq DNA v2 kit, both newer kits gave equivalent or better sequencing data, with increased coverage. When comparing the two newer kits, we found little difference in cost and workflow, with the NEBNext Ultra both slightly cheaper and faster than the TruSeq Nano. However, the quality of data generated using the TruSeq Nano DNA kit was superior due to higher coverage at regions of low GC content, and more SNPs identified. Researchers should therefore evaluate their resources and the type of application (and hence data quality) being considered when ultimately deciding on which library prep method to use.
Manlig, Erika; Wahlberg, Per
2017-01-01
Abstract Sodium bisulphite treatment of DNA combined with next generation sequencing (NGS) is a powerful combination for the interrogation of genome-wide DNA methylation profiles. Library preparation for whole genome bisulphite sequencing (WGBS) is challenging due to side effects of the bisulphite treatment, which leads to extensive DNA damage. Recently, a new generation of methods for bisulphite sequencing library preparation have been devised. They are based on initial bisulphite treatment of the DNA, followed by adaptor tagging of single stranded DNA fragments, and enable WGBS using low quantities of input DNA. In this study, we present a novel approach for quick and cost effective WGBS library preparation that is based on splinted adaptor tagging (SPLAT) of bisulphite-converted single-stranded DNA. Moreover, we validate SPLAT against three commercially available WGBS library preparation techniques, two of which are based on bisulphite treatment prior to adaptor tagging and one is a conventional WGBS method. PMID:27899585
Zhu, Yinzhou; Pirnie, Stephan P; Carmichael, Gordon G
2017-08-01
Ribose methylation (2'- O -methylation, 2'- O Me) occurs at high frequencies in rRNAs and other small RNAs and is carried out using a shared mechanism across eukaryotes and archaea. As RNA modifications are important for ribosome maturation, and alterations in these modifications are associated with cellular defects and diseases, it is important to characterize the landscape of 2'- O -methylation. Here we report the development of a highly sensitive and accurate method for ribose methylation detection using next-generation sequencing. A key feature of this method is the generation of RNA fragments with random 3'-ends, followed by periodate oxidation of all molecules terminating in 2',3'-OH groups. This allows only RNAs harboring 2'-OMe groups at their 3'-ends to be sequenced. Although currently requiring microgram amounts of starting material, this method is robust for the analysis of rRNAs even at low sequencing depth. © 2017 Zhu et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
A fast sequence assembly method based on compressed data structures.
Liang, Peifeng; Zhang, Yancong; Lin, Kui; Hu, Jinglu
2014-01-01
Assembling a large genome using next generation sequencing reads requires large computer memory and a long execution time. To reduce these requirements, a memory and time efficient assembler is presented from applying FM-index in JR-Assembler, called FMJ-Assembler, where FM stand for FMR-index derived from the FM-index and BWT and J for jumping extension. The FMJ-Assembler uses expanded FM-index and BWT to compress data of reads to save memory and jumping extension method make it faster in CPU time. An extensive comparison of the FMJ-Assembler with current assemblers shows that the FMJ-Assembler achieves a better or comparable overall assembly quality and requires lower memory use and less CPU time. All these advantages of the FMJ-Assembler indicate that the FMJ-Assembler will be an efficient assembly method in next generation sequencing technology.
A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS
Jiao, Xiaoli; Zheng, Xin; Ma, Liang; Kutty, Geetha; Gogineni, Emile; Sun, Qiang; Sherman, Brad T.; Hu, Xiaojun; Jones, Kristine; Raley, Castle; Tran, Bao; Munroe, David J.; Stephens, Robert; Liang, Dun; Imamichi, Tomozumi; Kovacs, Joseph A.; Lempicki, Richard A.; Huang, Da Wei
2013-01-01
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results. PMID:24179701
A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS.
Jiao, Xiaoli; Zheng, Xin; Ma, Liang; Kutty, Geetha; Gogineni, Emile; Sun, Qiang; Sherman, Brad T; Hu, Xiaojun; Jones, Kristine; Raley, Castle; Tran, Bao; Munroe, David J; Stephens, Robert; Liang, Dun; Imamichi, Tomozumi; Kovacs, Joseph A; Lempicki, Richard A; Huang, Da Wei
2013-07-31
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.
Gürtler, Nicolas; Röthlisberger, Benno; Ludin, Katja; Schlegel, Christoph; Lalwani, Anil K
2017-07-01
Identification of the causative mutation using next-generation sequencing in autosomal-dominant hereditary hearing impairment, as mutation analysis in hereditary hearing impairment by classic genetic methods, is hindered by the high heterogeneity of the disease. Two Swiss families with autosomal-dominant hereditary hearing impairment. Amplified DNA libraries for next-generation sequencing were constructed from extracted genomic DNA, derived from peripheral blood, and enriched by a custom-made sequence capture library. Validated, pooled libraries were sequenced on an Illumina MiSeq instrument, 300 cycles and paired-end sequencing. Technical data analysis was performed with SeqMonk, variant analysis with GeneTalk or VariantStudio. The detection of mutations in genes related to hearing loss by next-generation sequencing was subsequently confirmed using specific polymerase-chain-reaction and Sanger sequencing. Mutation detection in hearing-loss-related genes. The first family harbored the mutation c.5383+5delGTGA in the TECTA-gene. In the second family, a novel mutation c.2614-2625delCATGGCGCCGTG in the WFS1-gene and a second mutation TCOF1-c.1028G>A were identified. Next-generation sequencing successfully identified the causative mutation in families with autosomal-dominant hereditary hearing impairment. The results helped to clarify the pathogenic role of a known mutation and led to the detection of a novel one. NGS represents a feasible approach with great potential future in the diagnostics of hereditary hearing impairment, even in smaller labs.
K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics.
Lin, Jie; Adjeroh, Donald A; Jiang, Bing-Hua; Jiang, Yue
2018-05-15
Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. We propose a new non-parametric alignment-free sequence comparison method, called K2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K2 approach is much faster than the other methods. An improved method, K2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes. The K2 and K2* approaches are implemented in the R language as a package and is freely available for open access (http://community.wvu.edu/daadjeroh/projects/K2/K2_1.0.tar.gz). yueljiang@163.com. Supplementary data are available at Bioinformatics online.
Vector representation of user's view using self-organizing map
NASA Astrophysics Data System (ADS)
Ae, Tadashi; Yamaguchi, Tomohisa; Monden, Eri; Kawabata, Shunji; Kamitani, Motoki
2004-05-01
There exist various objects, such as pictures, music, texts, etc., around our environment. We have a view for these objects by looking, reading or listening. Our view is concerned with our behaviors deeply, and is very important to understand our behaviors. Therefore, we propose a method which acquires a view as a vector, and apply the vector to sequence generation. We focus on sequences of the data of which a user selects from a multimedia database containing pictures, music, movie, etc.. These data cannot be stereotyped because user's view for them changes by each user. Therefore, we represent the structure of the multimedia database as the vector representing user's view and the stereotyped vector, and acquire sequences containing the structure as elements. We demonstrate a city-sequence generation system which reflects user's intension as an application of sequence generation containing user's view. We apply the self-organizing map to this system to represent user's view.
A Comparison of Three Random Number Generators for Aircraft Dynamic Modeling Applications
NASA Technical Reports Server (NTRS)
Grauer, Jared A.
2017-01-01
Three random number generators, which produce Gaussian white noise sequences, were compared to assess their suitability in aircraft dynamic modeling applications. The first generator considered was the MATLAB (registered) implementation of the Mersenne-Twister algorithm. The second generator was a website called Random.org, which processes atmospheric noise measured using radios to create the random numbers. The third generator was based on synthesis of the Fourier series, where the random number sequences are constructed from prescribed amplitude and phase spectra. A total of 200 sequences, each having 601 random numbers, for each generator were collected and analyzed in terms of the mean, variance, normality, autocorrelation, and power spectral density. These sequences were then applied to two problems in aircraft dynamic modeling, namely estimating stability and control derivatives from simulated onboard sensor data, and simulating flight in atmospheric turbulence. In general, each random number generator had good performance and is well-suited for aircraft dynamic modeling applications. Specific strengths and weaknesses of each generator are discussed. For Monte Carlo simulation, the Fourier synthesis method is recommended because it most accurately and consistently approximated Gaussian white noise and can be implemented with reasonable computational effort.
Comparative modeling without implicit sequence alignments.
Kolinski, Andrzej; Gront, Dominik
2007-10-01
The number of known protein sequences is about thousand times larger than the number of experimentally solved 3D structures. For more than half of the protein sequences a close or distant structural analog could be identified. The key starting point in a classical comparative modeling is to generate the best possible sequence alignment with a template or templates. With decreasing sequence similarity, the number of errors in the alignments increases and these errors are the main causes of the decreasing accuracy of the molecular models generated. Here we propose a new approach to comparative modeling, which does not require the implicit alignment - the model building phase explores geometric, evolutionary and physical properties of a template (or templates). The proposed method requires prior identification of a template, although the initial sequence alignment is ignored. The model is built using a very efficient reduced representation search engine CABS to find the best possible superposition of the query protein onto the template represented as a 3D multi-featured scaffold. The criteria used include: sequence similarity, predicted secondary structure consistency, local geometric features and hydrophobicity profile. For more difficult cases, the new method qualitatively outperforms existing schemes of comparative modeling. The algorithm unifies de novo modeling, 3D threading and sequence-based methods. The main idea is general and could be easily combined with other efficient modeling tools as Rosetta, UNRES and others.
Pedersen, M S; Fahnøe, U; Hansen, T A; Pedersen, A G; Jenssen, H; Bukh, J; Schønning, K
2018-06-01
The current treatment options for hepatitis C virus (HCV), based on direct acting antivirals (DAA), are dependent on virus genotype and previous treatment experience. Treatment failures have been associated with detection of resistance-associated substitutions (RASs) in the DAA targets of HCV, the NS3, NS5A and NS5 B proteins. To develop a next generation sequencing based method that provides genotype and detection of HCV NS3, NS5A, and NS5 B RASs without prior knowledge of sample genotype. In total, 101 residual plasma samples from patients with HCV covering 10 different viral subtypes across 4 genotypes with viral loads of 3.84-7.61 Log IU/mL were included. All samples were de-identified and consequently prior treatment status for patients was unknown. Almost full open reading frame amplicons (∼ 9 kb) were generated using RT-PCR with a single primer set. The resulting amplicons were sequenced with high throughput sequencing and analysed using an in-house developed script for detecting RASs. The method successfully amplified and sequenced 94% (95/101) of samples with an average coverage of 14,035; four of six failed samples were genotype 4a. Samples analysed twice yielded reproducible nucleotide frequencies across all sites. RASs were detected in 21/95 (22%) samples at a 15% threshold. The method identified one patient infected with two genotype 2b variants, and the presence of subgenomic deletion variants in 8 (8.4%) of 95 successfully sequenced samples. The presented method may provide identification of HCV genotype, RASs detection, and detect multiple HCV infection without prior knowledge of sample genotype. Copyright © 2018 Elsevier B.V. All rights reserved.
Deep sequencing methods for protein engineering and design.
Wrenbeck, Emily E; Faber, Matthew S; Whitehead, Timothy A
2017-08-01
The advent of next-generation sequencing (NGS) has revolutionized protein science, and the development of complementary methods enabling NGS-driven protein engineering have followed. In general, these experiments address the functional consequences of thousands of protein variants in a massively parallel manner using genotype-phenotype linked high-throughput functional screens followed by DNA counting via deep sequencing. We highlight the use of information rich datasets to engineer protein molecular recognition. Examples include the creation of multiple dual-affinity Fabs targeting structurally dissimilar epitopes and engineering of a broad germline-targeted anti-HIV-1 immunogen. Additionally, we highlight the generation of enzyme fitness landscapes for conducting fundamental studies of protein behavior and evolution. We conclude with discussion of technological advances. Copyright © 2016 Elsevier Ltd. All rights reserved.
Ehrhardt, J; Säring, D; Handels, H
2007-01-01
Modern tomographic imaging devices enable the acquisition of spatial and temporal image sequences. But, the spatial and temporal resolution of such devices is limited and therefore image interpolation techniques are needed to represent images at a desired level of discretization. This paper presents a method for structure-preserving interpolation between neighboring slices in temporal or spatial image sequences. In a first step, the spatiotemporal velocity field between image slices is determined using an optical flow-based registration method in order to establish spatial correspondence between adjacent slices. An iterative algorithm is applied using the spatial and temporal image derivatives and a spatiotemporal smoothing step. Afterwards, the calculated velocity field is used to generate an interpolated image at the desired time by averaging intensities between corresponding points. Three quantitative measures are defined to evaluate the performance of the interpolation method. The behavior and capability of the algorithm is demonstrated by synthetic images. A population of 17 temporal and spatial image sequences are utilized to compare the optical flow-based interpolation method to linear and shape-based interpolation. The quantitative results show that the optical flow-based method outperforms the linear and shape-based interpolation statistically significantly. The interpolation method presented is able to generate image sequences with appropriate spatial or temporal resolution needed for image comparison, analysis or visualization tasks. Quantitative and qualitative measures extracted from synthetic phantoms and medical image data show that the new method definitely has advantages over linear and shape-based interpolation.
Hiding message into DNA sequence through DNA coding and chaotic maps.
Liu, Guoyan; Liu, Hongjun; Kadir, Abdurahman
2014-09-01
The paper proposes an improved reversible substitution method to hide data into deoxyribonucleic acid (DNA) sequence, and four measures have been taken to enhance the robustness and enlarge the hiding capacity, such as encode the secret message by DNA coding, encrypt it by pseudo-random sequence, generate the relative hiding locations by piecewise linear chaotic map, and embed the encoded and encrypted message into a randomly selected DNA sequence using the complementary rule. The key space and the hiding capacity are analyzed. Experimental results indicate that the proposed method has a better performance compared with the competing methods with respect to robustness and capacity.
Day-Williams, Aaron G.; McLay, Kirsten; Drury, Eleanor; Edkins, Sarah; Coffey, Alison J.; Palotie, Aarno; Zeggini, Eleftheria
2011-01-01
Pooled sequencing can be a cost-effective approach to disease variant discovery, but its applicability in association studies remains unclear. We compare sequence enrichment methods coupled to next-generation sequencing in non-indexed pools of 1, 2, 10, 20 and 50 individuals and assess their ability to discover variants and to estimate their allele frequencies. We find that pooled resequencing is most usefully applied as a variant discovery tool due to limitations in estimating allele frequency with high enough accuracy for association studies, and that in-solution hybrid-capture performs best among the enrichment methods examined regardless of pool size. PMID:22069447
The history and advances of reversible terminators used in new generations of sequencing technology.
Chen, Fei; Dong, Mengxing; Ge, Meng; Zhu, Lingxiang; Ren, Lufeng; Liu, Guocheng; Mu, Rong
2013-02-01
DNA sequencing using reversible terminators, as one sequencing by synthesis strategy, has garnered a great deal of interest due to its popular application in the second-generation high-throughput DNA sequencing technology. In this review, we provided its history of development, classification, and working mechanism of this technology. We also outlined the screening strategies for DNA polymerases to accommodate the reversible terminators as substrates during polymerization; particularly, we introduced the "REAP" method developed by us. At the end of this review, we discussed current limitations of this approach and provided potential solutions to extend its application. Copyright © 2013. Production and hosting by Elsevier Ltd.
Ho, Cynthia K. Y.; Raghwani, Jayna; Koekkoek, Sylvie; Liang, Richard H.; Van der Meer, Jan T. M.; Van Der Valk, Marc; De Jong, Menno; Pybus, Oliver G.
2016-01-01
ABSTRACT In contrast to other available next-generation sequencing platforms, PacBio single-molecule, real-time (SMRT) sequencing has the advantage of generating long reads albeit with a relatively higher error rate in unprocessed data. Using this platform, we longitudinally sampled and sequenced the hepatitis C virus (HCV) envelope genome region (1,680 nucleotides [nt]) from individuals belonging to a cluster of sexually transmitted cases. All five subjects were coinfected with HIV-1 and a closely related strain of HCV genotype 4d. In total, 50 samples were analyzed by using SMRT sequencing. By using 7 passes of circular consensus sequencing, the error rate was reduced to 0.37%, and the median number of sequences was 612 per sample. A further reduction of insertions was achieved by alignment against a sample-specific reference sequence. However, in vitro recombination during PCR amplification could not be excluded. Phylogenetic analysis supported close relationships among HCV sequences from the four male subjects and subsequent transmission from one subject to his female partner. Transmission was characterized by a strong genetic bottleneck. Viral genetic diversity was low during acute infection and increased upon progression to chronicity but subsequently fluctuated during chronic infection, caused by the alternate detection of distinct coexisting lineages. SMRT sequencing combines long reads with sufficient depth for many phylogenetic analyses and can therefore provide insights into within-host HCV evolutionary dynamics without the need for haplotype reconstruction using statistical algorithms. IMPORTANCE Next-generation sequencing has revolutionized the study of genetically variable RNA virus populations, but for phylogenetic and evolutionary analyses, longer sequences than those generated by most available platforms, while minimizing the intrinsic error rate, are desired. Here, we demonstrate for the first time that PacBio SMRT sequencing technology can be used to generate full-length HCV envelope sequences at the single-molecule level, providing a data set with large sequencing depth for the characterization of intrahost viral dynamics. The selection of consensus reads derived from at least 7 full circular consensus sequencing rounds significantly reduced the intrinsic high error rate of this method. We used this method to genetically characterize a unique transmission cluster of sexually transmitted HCV infections, providing insight into the distinct evolutionary pathways in each patient over time and identifying the transmission-associated genetic bottleneck as well as fluctuations in viral genetic diversity over time, accompanied by dynamic shifts in viral subpopulations. PMID:28077634
Mining and Development of Novel SSR Markers Using Next Generation Sequencing (NGS) Data in Plants.
Taheri, Sima; Lee Abdullah, Thohirah; Yusop, Mohd Rafii; Hanafi, Mohamed Musa; Sahebi, Mahbod; Azizi, Parisa; Shamshiri, Redmond Ramin
2018-02-13
Microsatellites, or simple sequence repeats (SSRs), are one of the most informative and multi-purpose genetic markers exploited in plant functional genomics. However, the discovery of SSRs and development using traditional methods are laborious, time-consuming, and costly. Recently, the availability of high-throughput sequencing technologies has enabled researchers to identify a substantial number of microsatellites at less cost and effort than traditional approaches. Illumina is a noteworthy transcriptome sequencing technology that is currently used in SSR marker development. Although 454 pyrosequencing datasets can be used for SSR development, this type of sequencing is no longer supported. This review aims to present an overview of the next generation sequencing, with a focus on the efficient use of de novo transcriptome sequencing (RNA-Seq) and related tools for mining and development of microsatellites in plants.
Next-Generation Technologies for Multiomics Approaches Including Interactome Sequencing
Ohashi, Hiroyuki; Miyamoto-Sato, Etsuko
2015-01-01
The development of high-speed analytical techniques such as next-generation sequencing and microarrays allows high-throughput analysis of biological information at a low cost. These techniques contribute to medical and bioscience advancements and provide new avenues for scientific research. Here, we outline a variety of new innovative techniques and discuss their use in omics research (e.g., genomics, transcriptomics, metabolomics, proteomics, and interactomics). We also discuss the possible applications of these methods, including an interactome sequencing technology that we developed, in future medical and life science research. PMID:25649523
Talking Drums: Generating drum grooves with neural networks
NASA Astrophysics Data System (ADS)
Hutchings, P.
2017-05-01
Presented is a method of generating a full drum kit part for a provided kick-drum sequence. A sequence to sequence neural network model used in natural language translation was adopted to encode multiple musical styles and an online survey was developed to test different techniques for sampling the output of the softmax function. The strongest results were found using a sampling technique that drew from the three most probable outputs at each subdivision of the drum pattern but the consistency of output was found to be heavily dependent on style.
Taggart, David J.; Camerlengo, Terry L.; Harrison, Jason K.; Sherrer, Shanen M.; Kshetry, Ajay K.; Taylor, John-Stephen; Huang, Kun; Suo, Zucai
2013-01-01
Cellular genomes are constantly damaged by endogenous and exogenous agents that covalently and structurally modify DNA to produce DNA lesions. Although most lesions are mended by various DNA repair pathways in vivo, a significant number of damage sites persist during genomic replication. Our understanding of the mutagenic outcomes derived from these unrepaired DNA lesions has been hindered by the low throughput of existing sequencing methods. Therefore, we have developed a cost-effective high-throughput short oligonucleotide sequencing assay that uses next-generation DNA sequencing technology for the assessment of the mutagenic profiles of translesion DNA synthesis catalyzed by any error-prone DNA polymerase. The vast amount of sequencing data produced were aligned and quantified by using our novel software. As an example, the high-throughput short oligonucleotide sequencing assay was used to analyze the types and frequencies of mutations upstream, downstream and at a site-specifically placed cis–syn thymidine–thymidine dimer generated individually by three lesion-bypass human Y-family DNA polymerases. PMID:23470999
Zhu, Yuan O; Aw, Pauline P K; de Sessions, Paola Florez; Hong, Shuzhen; See, Lee Xian; Hong, Lewis Z; Wilm, Andreas; Li, Chen Hao; Hue, Stephane; Lim, Seng Gee; Nagarajan, Niranjan; Burkholder, William F; Hibberd, Martin
2017-10-27
Viral populations are complex, dynamic, and fast evolving. The evolution of groups of closely related viruses in a competitive environment is termed quasispecies. To fully understand the role that quasispecies play in viral evolution, characterizing the trajectories of viral genotypes in an evolving population is the key. In particular, long-range haplotype information for thousands of individual viruses is critical; yet generating this information is non-trivial. Popular deep sequencing methods generate relatively short reads that do not preserve linkage information, while third generation sequencing methods have higher error rates that make detection of low frequency mutations a bioinformatics challenge. Here we applied BAsE-Seq, an Illumina-based single-virion sequencing technology, to eight samples from four chronic hepatitis B (CHB) patients - once before antiviral treatment and once after viral rebound due to resistance. With single-virion sequencing, we obtained 248-8796 single-virion sequences per sample, which allowed us to find evidence for both hard and soft selective sweeps. We were able to reconstruct population demographic history that was independently verified by clinically collected data. We further verified four of the samples independently through PacBio SMRT and Illumina Pooled deep sequencing. Overall, we showed that single-virion sequencing yields insight into viral evolution and population dynamics in an efficient and high throughput manner. We believe that single-virion sequencing is widely applicable to the study of viral evolution in the context of drug resistance and host adaptation, allows differentiation between soft or hard selective sweeps, and may be useful in the reconstruction of intra-host viral population demographic history.
Gangras, Pooja; Dayeh, Daniel M; Mabin, Justin W; Nakanishi, Kotaro; Singh, Guramrit
2018-01-01
Argonaute proteins (AGOs) are loaded with small RNAs as guides to recognize target mRNAs. Since the target specificity heavily depends on the base complementarity between two strands, it is important to identify small guide and long target RNAs bound to AGOs. For this purpose, next-generation sequencing (NGS) technologies have extended our appreciation truly to the nucleotide level. However, the identification of RNAs via NGS from scarce RNA samples remains a challenge. Further, most commercial and published methods are compatible with either small RNAs or long RNAs, but are not equally applicable to both. Therefore, a single method that yields quantitative, bias-free NGS libraries to identify small and long RNAs from low levels of input will be of wide interest. Here, we introduce such a procedure that is based on several modifications of two published protocols and allows robust, sensitive, and reproducible cloning and sequencing of small amounts of RNAs of variable lengths. The method was applied to the identification of small RNAs bound to a purified eukaryotic AGO. Following ligation of a DNA adapter to RNA 3'-end, the key feature of this method is to use the adapter for priming reverse transcription (RT) wherein biotinylated deoxyribonucleotides specifically incorporated into the extended complementary DNA. Such RT products are enriched on streptavidin beads, circularized while immobilized on beads and directly used for PCR amplification. We provide a stepwise guide to generate RNA-Seq libraries, their purification, quantification, validation, and preparation for next-generation sequencing. We also provide basic steps in post-NGS data analyses using Galaxy, an open-source, web-based platform.
Solving the problem of comparing whole bacterial genomes across different sequencing platforms.
Kaas, Rolf S; Leekitcharoenphon, Pimlapas; Aarestrup, Frank M; Lund, Ole
2014-01-01
Whole genome sequencing (WGS) shows great potential for real-time monitoring and identification of infectious disease outbreaks. However, rapid and reliable comparison of data generated in multiple laboratories and using multiple technologies is essential. So far studies have focused on using one technology because each technology has a systematic bias making integration of data generated from different platforms difficult. We developed two different procedures for identifying variable sites and inferring phylogenies in WGS data across multiple platforms. The methods were evaluated on three bacterial data sets and sequenced on three different platforms (Illumina, 454, Ion Torrent). We show that the methods are able to overcome the systematic biases caused by the sequencers and infer the expected phylogenies. It is concluded that the cause of the success of these new procedures is due to a validation of all informative sites that are included in the analysis. The procedures are available as web tools.
Zhao, Min; Wang, Qingguo; Wang, Quan; Jia, Peilin; Zhao, Zhongming
2013-01-01
Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development.
2013-01-01
Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development. PMID:24564169
Rapid and Easy Protocol for Quantification of Next-Generation Sequencing Libraries.
Hawkins, Steve F C; Guest, Paul C
2018-01-01
The emergence of next-generation sequencing (NGS) over the last 10 years has increased the efficiency of DNA sequencing in terms of speed, ease, and price. However, the exact quantification of a NGS library is crucial in order to obtain good data on sequencing platforms developed by the current market leader Illumina. Different approaches for DNA quantification are available currently and the most commonly used are based on analysis of the physical properties of the DNA through spectrophotometric or fluorometric methods. Although these methods are technically simple, they do not allow exact quantification as can be achieved using a real-time quantitative PCR (qPCR) approach. A qPCR protocol for DNA quantification with applications in NGS library preparation studies is presented here. This can be applied in various fields of study such as medical disorders resulting from nutritional programming disturbances.
Scholz, Christian F P; Jensen, Anders
2017-01-01
The protocol describes a computational method to develop a Single Locus Sequence Typing (SLST) scheme for typing bacterial species. The resulting scheme can be used to type bacterial isolates as well as bacterial species directly from complex communities using next-generation sequencing technologies.
Using chaos to generate variations on movement sequences
NASA Astrophysics Data System (ADS)
Bradley, Elizabeth; Stuart, Joshua
1998-12-01
We describe a method for introducing variations into predefined motion sequences using a chaotic symbol-sequence reordering technique. A progression of symbols representing the body positions in a dance piece, martial arts form, or other motion sequence is mapped onto a chaotic trajectory, establishing a symbolic dynamics that links the movement sequence and the attractor structure. A variation on the original piece is created by generating a trajectory with slightly different initial conditions, inverting the mapping, and using special corpus-based graph-theoretic interpolation schemes to smooth any abrupt transitions. Sensitive dependence guarantees that the variation is different from the original; the attractor structure and the symbolic dynamics guarantee that the two resemble one another in both aesthetic and mathematical senses.
A simple method for MR elastography: a gradient-echo type multi-echo sequence.
Numano, Tomokazu; Mizuhara, Kazuyuki; Hata, Junichi; Washio, Toshikatsu; Homma, Kazuhiro
2015-01-01
To demonstrate the feasibility of a novel MR elastography (MRE) technique based on a conventional gradient-echo type multi-echo MR sequence which does not need additional bipolar magnetic field gradients (motion encoding gradient: MEG), yet is sensitive to vibration. In a gradient-echo type multi-echo MR sequence, several images are produced from each echo of the train with different echo times (TEs). If these echoes are synchronized with the vibration, each readout's gradient lobes achieve a MEG-like effect, and the later generated echo causes a greater MEG-like effect. The sequence was tested for the tissue-mimicking agarose gel phantoms and the psoas major muscles of healthy volunteers. It was confirmed that the readout gradient lobes caused an MEG-like effect and the later TE images had higher sensitivity to vibrations. The magnitude image of later generated echo suffered the T2 decay and the susceptibility artifacts, but the wave image and elastogram of later generated echo were unaffected by these effects. In in vivo experiments, this method was able to measure the mean shear modulus of the psoas major muscle. From the results of phantom experiments and volunteer studies, it was shown that this method has clinical application potential. Copyright © 2014 Elsevier Inc. All rights reserved.
De novo assembly of human genomes with massively parallel short read sequencing.
Li, Ruiqiang; Zhu, Hongmei; Ruan, Jue; Qian, Wubin; Fang, Xiaodong; Shi, Zhongbin; Li, Yingrui; Li, Shengting; Shan, Gao; Kristiansen, Karsten; Li, Songgang; Yang, Huanming; Wang, Jian; Wang, Jun
2010-02-01
Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.
Hughes, Paul; Deng, Wenjie; Olson, Scott C; Coombs, Robert W; Chung, Michael H; Frenkel, Lisa M
2016-03-01
Accurate analysis of minor populations of drug-resistant HIV requires analysis of a sufficient number of viral templates. We assessed the effect of experimental conditions on the analysis of HIV pol 454 pyrosequences generated from plasma using (1) the "Insertion-deletion (indel) and Carry Forward Correction" (ICC) pipeline, which clusters sequence reads using a nonsubstitution approach and can correct for indels and carry forward errors, and (2) the "Primer Identification (ID)" method, which facilitates construction of a consensus sequence to correct for sequencing errors and allelic skewing. The Primer ID and ICC methods produced similar estimates of viral diversity, but differed in the number of sequence variants generated. Sequence preparation for ICC was comparably simple, but was limited by an inability to assess the number of templates analyzed and allelic skewing. The more costly Primer ID method corrected for allelic skewing and provided the number of viral templates analyzed, which revealed that amplifiable HIV templates varied across specimens and did not correlate with clinical viral load. This latter observation highlights the value of the Primer ID method, which by determining the number of templates amplified, enables more accurate assessment of minority species in the virus population, which may be relevant to prescribing effective antiretroviral therapy.
Dai, Hongying; Wu, Guodong; Wu, Michael; Zhi, Degui
2016-01-01
Next-generation sequencing data pose a severe curse of dimensionality, complicating traditional "single marker-single trait" analysis. We propose a two-stage combined p-value method for pathway analysis. The first stage is at the gene level, where we integrate effects within a gene using the Sequence Kernel Association Test (SKAT). The second stage is at the pathway level, where we perform a correlated Lancaster procedure to detect joint effects from multiple genes within a pathway. We show that the Lancaster procedure is optimal in Bahadur efficiency among all combined p-value methods. The Bahadur efficiency,[Formula: see text], compares sample sizes among different statistical tests when signals become sparse in sequencing data, i.e. ε →0. The optimal Bahadur efficiency ensures that the Lancaster procedure asymptotically requires a minimal sample size to detect sparse signals ([Formula: see text]). The Lancaster procedure can also be applied to meta-analysis. Extensive empirical assessments of exome sequencing data show that the proposed method outperforms Gene Set Enrichment Analysis (GSEA). We applied the competitive Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium to identify pathways significantly associated with high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, and total cholesterol.
Li, Jun; Tibshirani, Robert
2015-01-01
We discuss the identification of features that are associated with an outcome in RNA-Sequencing (RNA-Seq) and other sequencing-based comparative genomic experiments. RNA-Seq data takes the form of counts, so models based on the normal distribution are generally unsuitable. The problem is especially challenging because different sequencing experiments may generate quite different total numbers of reads, or ‘sequencing depths’. Existing methods for this problem are based on Poisson or negative binomial models: they are useful but can be heavily influenced by ‘outliers’ in the data. We introduce a simple, nonparametric method with resampling to account for the different sequencing depths. The new method is more robust than parametric methods. It can be applied to data with quantitative, survival, two-class or multiple-class outcomes. We compare our proposed method to Poisson and negative binomial-based methods in simulated and real data sets, and find that our method discovers more consistent patterns than competing methods. PMID:22127579
[The principle and application of the single-molecule real-time sequencing technology].
Yanhu, Liu; Lu, Wang; Li, Yu
2015-03-01
Last decade witnessed the explosive development of the third-generation sequencing strategy, including single-molecule real-time sequencing (SMRT), true single-molecule sequencing (tSMSTM) and the single-molecule nanopore DNA sequencing. In this review, we summarize the principle, performance and application of the SMRT sequencing technology. Compared with the traditional Sanger method and the next-generation sequencing (NGS) technologies, the SMRT approach has several advantages, including long read length, high speed, PCR-free and the capability of direct detection of epigenetic modifications. However, the disadvantage of its low accuracy, most of which resulted from insertions and deletions, is also notable. So, the raw sequence data need to be corrected before assembly. Up to now, the SMRT is a good fit for applications in the de novo genomic sequencing and the high-quality assemblies of small genomes. In the future, it is expected to play an important role in epigenetics, transcriptomic sequencing, and assemblies of large genomes.
Johnson, Paul A; Tencate, James A; Le Bas, Pierre-Yves; Guyer, Robert; Vu, Cung Khac; Skelt, Christopher
2013-10-08
In some aspects of the disclosure, a method and an apparatus is disclosed for investigating material surrounding the borehole. The method includes generating within a borehole an intermittent low frequency vibration that propagates as a tube wave longitudinally to the borehole and induces a nonlinear response in one or more features in the material that are substantially perpendicular to a longitudinal axis of the borehole; generating within the borehole a sequence of high frequency pulses directed such that they travel longitudinally to the borehole within the surrounding material; and receiving, at one or more receivers positionable in the borehole, a signal that includes components from the low frequency vibration and the sequence of high frequency pulses during intermittent generation of the low frequency vibration, to investigate the material surrounding the borehole.
Mack, Steven J.; Milius, Robert P.; Gifford, Benjamin D.; Sauter, Jürgen; Hofmann, Jan; Osoegawa, Kazutoyo; Robinson, James; Groeneweg, Mathijs; Turenchalk, Gregory S.; Adai, Alex; Holcomb, Cherie; Rozemuller, Erik H.; Penning, Maarten T.; Heuer, Michael L.; Wang, Chunlin; Salit, Marc L.; Schmidt, Alexander H.; Parham, Peter R.; Müller, Carlheinz; Hague, Tim; Fischer, Gottfried; Fernandez-Viňa, Marcelo; Hollenbach, Jill A; Norman, Paul J.; Maiers, Martin
2015-01-01
The development of next-generation sequencing (NGS) technologies for HLA and KIR genotyping is rapidly advancing knowledge of genetic variation of these highly polymorphic loci. NGS genotyping is poised to replace older methods for clinical use, but standard methods for reporting and exchanging these new, high quality genotype data are needed. The Immunogenomic NGS Consortium, a broad collaboration of histocompatibility and immunogenetics clinicians, researchers, instrument manufacturers and software developers, has developed the Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) reporting guidelines. MIRING is a checklist that specifies the content of NGS genotyping results as well as a set of messaging guidelines for reporting the results. A MIRING message includes five categories of structured information – message annotation, reference context, full genotype, consensus sequence and novel polymorphism – and references to three categories of accessory information – NGS platform documentation, read processing documentation and primary data. These eight categories of information ensure the long-term portability and broad application of this NGS data for all current histocompatibility and immunogenetics use cases. In addition, MIRING can be extended to allow the reporting of genotype data generated using pre-NGS technologies. Because genotyping results reported using MIRING are easily updated in accordance with reference and nomenclature databases, MIRING represents a bold departure from previous methods of reporting HLA and KIR genotyping results, which have provided static and less-portable data. More information about MIRING can be found online at miring.immunogenomics.org. PMID:26407912
Genetic markers, genotyping methods & next generation sequencing in Mycobacterium tuberculosis
Desikan, Srinidhi; Narayanan, Sujatha
2015-01-01
Molecular epidemiology (ME) is one of the main areas in tuberculosis research which is widely used to study the transmission epidemics and outbreaks of tubercle bacilli. It exploits the presence of various polymorphisms in the genome of the bacteria that can be widely used as genetic markers. Many DNA typing methods apply these genetic markers to differentiate various strains and to study the evolutionary relationships between them. The three widely used genotyping tools to differentiate Mycobacterium tuberculosis strains are IS6110 restriction fragment length polymorphism (RFLP), spacer oligotyping (Spoligotyping), and mycobacterial interspersed repeat units - variable number of tandem repeats (MIRU-VNTR). A new prospect towards ME was introduced with the development of whole genome sequencing (WGS) and the next generation sequencing (NGS) methods, where the entire genome is sequenced that not only helps in pointing out minute differences between the various sequences but also saves time and the cost. NGS is also found to be useful in identifying single nucleotide polymorphisms (SNPs), comparative genomics and also various aspects about transmission dynamics. These techniques enable the identification of mycobacterial strains and also facilitate the study of their phylogenetic and evolutionary traits. PMID:26205019
2013-01-01
Background Human leukocyte antigen matching at allelic resolution is proven clinically significant in hematopoietic stem cell transplantation, lowering the risk of graft-versus-host disease and mortality. However, due to the ever growing HLA allele database, tissue typing laboratories face substantial challenges. In light of the complexity and the high degree of allelic diversity, it has become increasingly difficult to define the classical transplantation antigens at high-resolution by using well-tried methods. Thus, next-generation sequencing is entering into diagnostic laboratories at the perfect time and serving as a promising tool to overcome intrinsic HLA typing problems. Therefore, we have developed and validated a scalable automated HLA class I and class II typing approach suitable for diagnostic use. Results A validation panel of 173 clinical and proficiency testing samples was analysed, demonstrating 100% concordance to the reference method. From a total of 1,273 loci we were able to generate 1,241 (97.3%) initial successful typings. The mean ambiguity reduction for the analysed loci was 93.5%. Allele assignment including intronic sequences showed an improved resolution (99.2%) of non-expressed HLA alleles. Conclusion We provide a powerful HLA typing protocol offering a short turnaround time of only two days, a fully integrated workflow and most importantly a high degree of typing reliability. The presented automated assay is flexible and can be scaled by specific primer compilations and the use of different 454 sequencing systems. The workflow was successfully validated according to the policies of the European Federation for Immunogenetics. Next-generation sequencing seems to become one of the new methods in the field of Histocompatibility. PMID:23557197
Hu, Jiazhi; Meyers, Robin M; Dong, Junchao; Panchakshari, Rohit A; Alt, Frederick W; Frock, Richard L
2016-05-01
Unbiased, high-throughput assays for detecting and quantifying DNA double-stranded breaks (DSBs) across the genome in mammalian cells will facilitate basic studies of the mechanisms that generate and repair endogenous DSBs. They will also enable more applied studies, such as those to evaluate the on- and off-target activities of engineered nucleases. Here we describe a linear amplification-mediated high-throughput genome-wide sequencing (LAM-HTGTS) method for the detection of genome-wide 'prey' DSBs via their translocation in cultured mammalian cells to a fixed 'bait' DSB. Bait-prey junctions are cloned directly from isolated genomic DNA using LAM-PCR and unidirectionally ligated to bridge adapters; subsequent PCR steps amplify the single-stranded DNA junction library in preparation for Illumina Miseq paired-end sequencing. A custom bioinformatics pipeline identifies prey sequences that contribute to junctions and maps them across the genome. LAM-HTGTS differs from related approaches because it detects a wide range of broken end structures with nucleotide-level resolution. Familiarity with nucleic acid methods and next-generation sequencing analysis is necessary for library generation and data interpretation. LAM-HTGTS assays are sensitive, reproducible, relatively inexpensive, scalable and straightforward to implement with a turnaround time of <1 week.
Ai, Jing-Wen; Li, Yang; Cheng, Qi; Cui, Peng; Wu, Hong-Long; Xu, Bin; Zhang, Wen-Hong
2018-06-01
A 45-year-old man who complained of continuous fever and multiple hepatic masses was admitted to our hospital. Repeated MRI manifestations were similar while each radiological report suggested contradictory diagnosis pointing to infections or malignances respectively. Pathologic examination of the liver tissue showed no direct evidence of either infections or tumor. We performed next-generation sequencing on the liver tissue and peripheral blood to further investigate the possible etiology. High throughput sequencing was performed on the liver lesion tissues using BGISEQ-100 platform, and data was mapped to the Microbial Genome Databases after filtering low quality data and human reads. We identified a total of 299 sequencing reads of Mycobacterium tuberculosis (M. tuberculosis) complex sequences from the liver tissue, including 8, 229 of 4,424,435 of the M. tuberculosis nucleotide sequences, and Mycobacterium africanum, Mycobacterium bovis, and Mycobacterium canettii were also detected due to the 99.9% identical rate among these strains. No specific Mycobacterial tuberculosis nucleotide sequence was detected in the sample of peripheral blood. Patient's symptom quickly recovered after anti-tuberculosis treatment and repeated Ziehl-Neelsen staining of the liver tissue finally identified small numbers of positive bacillus. The diagnosis of this patient was difficult to establish before the next-generation sequencing because of contradictive radiological results and negative pathological findings. More sensitive diagnostic methods are urgently needed. This is the first case reporting hepatic tuberculosis confirmed by the next-generation sequencing, and marks the promising potential of the application of the next-generation sequencing in the diagnosis of hepatic lesions with unknown etiology. Copyright © 2018 Elsevier Masson SAS. All rights reserved.
Dudley, Dawn M.; Chin, Emily N.; Bimber, Benjamin N.; Sanabani, Sabri S.; Tarosso, Leandro F.; Costa, Priscilla R.; Sauer, Mariana M.; Kallas, Esper G.; O.’Connor, David H.
2012-01-01
Background Great efforts have been made to increase accessibility of HIV antiretroviral therapy (ART) in low and middle-income countries. The threat of wide-scale emergence of drug resistance could severely hamper ART scale-up efforts. Population-based surveillance of transmitted HIV drug resistance ensures the use of appropriate first-line regimens to maximize efficacy of ART programs where drug options are limited. However, traditional HIV genotyping is extremely expensive, providing a cost barrier to wide-scale and frequent HIV drug resistance surveillance. Methods/Results We have developed a low-cost laboratory-scale next-generation sequencing-based genotyping method to monitor drug resistance. We designed primers specifically to amplify protease and reverse transcriptase from Brazilian HIV subtypes and developed a multiplexing scheme using multiplex identifier tags to minimize cost while providing more robust data than traditional genotyping techniques. Using this approach, we characterized drug resistance from plasma in 81 HIV infected individuals collected in São Paulo, Brazil. We describe the complexities of analyzing next-generation sequencing data and present a simplified open-source workflow to analyze drug resistance data. From this data, we identified drug resistance mutations in 20% of treatment naïve individuals in our cohort, which is similar to frequencies identified using traditional genotyping in Brazilian patient samples. Conclusion The developed ultra-wide sequencing approach described here allows multiplexing of at least 48 patient samples per sequencing run, 4 times more than the current genotyping method. This method is also 4-fold more sensitive (5% minimal detection frequency vs. 20%) at a cost 3–5× less than the traditional Sanger-based genotyping method. Lastly, by using a benchtop next-generation sequencer (Roche/454 GS Junior), this approach can be more easily implemented in low-resource settings. This data provides proof-of-concept that next-generation HIV drug resistance genotyping is a feasible and low-cost alternative to current genotyping methods and may be particularly beneficial for in-country surveillance of transmitted drug resistance. PMID:22574170
Cheng, Chia-Ying; Tsai, Chia-Feng; Chen, Yu-Ju; Sung, Ting-Yi; Hsu, Wen-Lian
2013-05-03
As spectral library searching has received increasing attention for peptide identification, constructing good decoy spectra from the target spectra is the key to correctly estimating the false discovery rate in searching against the concatenated target-decoy spectral library. Several methods have been proposed to construct decoy spectral libraries. Most of them construct decoy peptide sequences and then generate theoretical spectra accordingly. In this paper, we propose a method, called precursor-swap, which directly constructs decoy spectral libraries directly at the "spectrum level" without generating decoy peptide sequences by swapping the precursors of two spectra selected according to a very simple rule. Our spectrum-based method does not require additional efforts to deal with ion types (e.g., a, b or c ions), fragment mechanism (e.g., CID, or ETD), or unannotated peaks, but preserves many spectral properties. The precursor-swap method is evaluated on different spectral libraries and the results of obtained decoy ratios show that it is comparable to other methods. Notably, it is efficient in time and memory usage for constructing decoy libraries. A software tool called Precursor-Swap-Decoy-Generation (PSDG) is publicly available for download at http://ms.iis.sinica.edu.tw/PSDG/.
Weirather, Jason L.; Afshar, Pegah Tootoonchi; Clark, Tyson A.; Tseng, Elizabeth; Powers, Linda S.; Underwood, Jason G.; Zabner, Joseph; Korlach, Jonas; Wong, Wing Hung; Au, Kin Fai
2015-01-01
We developed an innovative hybrid sequencing approach, IDP-fusion, to detect fusion genes, determine fusion sites and identify and quantify fusion isoforms. IDP-fusion is the first method to study gene fusion events by integrating Third Generation Sequencing long reads and Second Generation Sequencing short reads. We applied IDP-fusion to PacBio data and Illumina data from the MCF-7 breast cancer cells. Compared with the existing tools, IDP-fusion detects fusion genes at higher precision and a very low false positive rate. The results show that IDP-fusion will be useful for unraveling the complexity of multiple fusion splices and fusion isoforms within tumorigenesis-relevant fusion genes. PMID:26040699
Method and apparatus for automated assembly
Jones, Rondall E.; Wilson, Randall H.; Calton, Terri L.
1999-01-01
A process and apparatus generates a sequence of steps for assembly or disassembly of a mechanical system. Each step in the sequence is geometrically feasible, i.e., the part motions required are physically possible. Each step in the sequence is also constraint feasible, i.e., the step satisfies user-definable constraints. Constraints allow process and other such limitations, not usually represented in models of the completed mechanical system, to affect the sequence.
de Bruijn cycles for neural decoding.
Aguirre, Geoffrey Karl; Mattar, Marcelo Gomes; Magis-Weinberg, Lucía
2011-06-01
Stimulus counterbalance is critical for studies of neural habituation, bias, anticipation, and (more generally) the effect of stimulus history and context. We introduce de Bruijn cycles, a class of combinatorial objects, as the ideal source of pseudo-random stimulus sequences with arbitrary levels of counterbalance. Neuro-vascular imaging studies (such as BOLD fMRI) have an additional requirement imposed by the filtering and noise properties of the method: only some temporal frequencies of neural modulation are detectable. Extant methods of generating counterbalanced stimulus sequences yield neural modulations that are weakly (or not at all) detected by BOLD fMRI. We solve this limitation using a novel "path-guided" approach for the generation of de Bruijn cycles. The algorithm encodes a hypothesized neural modulation of specific temporal frequency within the seemingly random order of events. By positioning the modulation between the signal and noise bands of the neuro-vascular imaging method, the resulting sequence markedly improves detection power. These sequences may be used to study stimulus context and history effects in a manner not previously possible. Copyright © 2011 Elsevier Inc. All rights reserved.
Jézéquel, Laetitia; Loeper, Jacqueline; Pompon, Denis
2008-11-01
Combinatorial libraries coding for mosaic enzymes with predefined crossover points constitute useful tools to address and model structure-function relationships and for functional optimization of enzymes based on multivariate statistics. The presented method, called sequence-independent generation of a chimera-ordered library (SIGNAL), allows easy shuffling of any predefined amino acid segment between two or more proteins. This method is particularly well adapted to the exchange of protein structural modules. The procedure could also be well suited to generate ordered combinatorial libraries independent of sequence similarities in a robotized manner. Sequence segments to be recombined are first extracted by PCR from a single-stranded template coding for an enzyme of interest using a biotin-avidin-based method. This technique allows the reduction of parental template contamination in the final library. Specific PCR primers allow amplification of two complementary mosaic DNA fragments, overlapping in the region to be exchanged. Fragments are finally reassembled using a fusion PCR. The process is illustrated via the construction of a set of mosaic CYP2B enzymes using this highly modular approach.
A weighted U-statistic for genetic association analyses of sequencing data.
Wei, Changshuai; Li, Ming; He, Zihuai; Vsevolozhskaya, Olga; Schaid, Daniel J; Lu, Qing
2014-12-01
With advancements in next-generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high-dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on a nonparametric U-statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol. © 2014 WILEY PERIODICALS, INC.
HLA genotyping by next-generation sequencing of complementary DNA.
Segawa, Hidenobu; Kukita, Yoji; Kato, Kikuya
2017-11-28
Genotyping of the human leucocyte antigen (HLA) is indispensable for various medical treatments. However, unambiguous genotyping is technically challenging due to high polymorphism of the corresponding genomic region. Next-generation sequencing is changing the landscape of genotyping. In addition to high throughput of data, its additional advantage is that DNA templates are derived from single molecules, which is a strong merit for the phasing problem. Although most currently developed technologies use genomic DNA, use of cDNA could enable genotyping with reduced costs in data production and analysis. We thus developed an HLA genotyping system based on next-generation sequencing of cDNA. Each HLA gene was divided into 3 or 4 target regions subjected to PCR amplification and subsequent sequencing with Ion Torrent PGM. The sequence data were then subjected to an automated analysis. The principle of the analysis was to construct candidate sequences generated from all possible combinations of variable bases and arrange them in decreasing order of the number of reads. Upon collecting candidate sequences from all target regions, 2 haplotypes were usually assigned. Cases not assigned 2 haplotypes were forwarded to 4 additional processes: selection of candidate sequences applying more stringent criteria, removal of artificial haplotypes, selection of candidate sequences with a relaxed threshold for sequence matching, and countermeasure for incomplete sequences in the HLA database. The genotyping system was evaluated using 30 samples; the overall accuracy was 97.0% at the field 3 level and 98.3% at the G group level. With one sample, genotyping of DPB1 was not completed due to short read size. We then developed a method for complete sequencing of individual molecules of the DPB1 gene, using the molecular barcode technology. The performance of the automatic genotyping system was comparable to that of systems developed in previous studies. Thus, next-generation sequencing of cDNA is a viable option for HLA genotyping.
Gullapalli, Rama R; Desai, Ketaki V; Santana-Santos, Lucas; Kant, Jeffrey A; Becich, Michael J
2012-01-01
The Human Genome Project (HGP) provided the initial draft of mankind's DNA sequence in 2001. The HGP was produced by 23 collaborating laboratories using Sanger sequencing of mapped regions as well as shotgun sequencing techniques in a process that occupied 13 years at a cost of ~$3 billion. Today, Next Generation Sequencing (NGS) techniques represent the next phase in the evolution of DNA sequencing technology at dramatically reduced cost compared to traditional Sanger sequencing. A single laboratory today can sequence the entire human genome in a few days for a few thousand dollars in reagents and staff time. Routine whole exome or even whole genome sequencing of clinical patients is well within the realm of affordability for many academic institutions across the country. This paper reviews current sequencing technology methods and upcoming advancements in sequencing technology as well as challenges associated with data generation, data manipulation and data storage. Implementation of routine NGS data in cancer genomics is discussed along with potential pitfalls in the interpretation of the NGS data. The overarching importance of bioinformatics in the clinical implementation of NGS is emphasized.[7] We also review the issue of physician education which also is an important consideration for the successful implementation of NGS in the clinical workplace. NGS technologies represent a golden opportunity for the next generation of pathologists to be at the leading edge of the personalized medicine approaches coming our way. Often under-emphasized issues of data access and control as well as potential ethical implications of whole genome NGS sequencing are also discussed. Despite some challenges, it's hard not to be optimistic about the future of personalized genome sequencing and its potential impact on patient care and the advancement of knowledge of human biology and disease in the near future.
Gullapalli, Rama R.; Desai, Ketaki V.; Santana-Santos, Lucas; Kant, Jeffrey A.; Becich, Michael J.
2012-01-01
The Human Genome Project (HGP) provided the initial draft of mankind's DNA sequence in 2001. The HGP was produced by 23 collaborating laboratories using Sanger sequencing of mapped regions as well as shotgun sequencing techniques in a process that occupied 13 years at a cost of ~$3 billion. Today, Next Generation Sequencing (NGS) techniques represent the next phase in the evolution of DNA sequencing technology at dramatically reduced cost compared to traditional Sanger sequencing. A single laboratory today can sequence the entire human genome in a few days for a few thousand dollars in reagents and staff time. Routine whole exome or even whole genome sequencing of clinical patients is well within the realm of affordability for many academic institutions across the country. This paper reviews current sequencing technology methods and upcoming advancements in sequencing technology as well as challenges associated with data generation, data manipulation and data storage. Implementation of routine NGS data in cancer genomics is discussed along with potential pitfalls in the interpretation of the NGS data. The overarching importance of bioinformatics in the clinical implementation of NGS is emphasized.[7] We also review the issue of physician education which also is an important consideration for the successful implementation of NGS in the clinical workplace. NGS technologies represent a golden opportunity for the next generation of pathologists to be at the leading edge of the personalized medicine approaches coming our way. Often under-emphasized issues of data access and control as well as potential ethical implications of whole genome NGS sequencing are also discussed. Despite some challenges, it's hard not to be optimistic about the future of personalized genome sequencing and its potential impact on patient care and the advancement of knowledge of human biology and disease in the near future. PMID:23248761
Evaluation of nearest-neighbor methods for detection of chimeric small-subunit rRNA sequences
NASA Technical Reports Server (NTRS)
Robison-Cox, J. F.; Bateson, M. M.; Ward, D. M.
1995-01-01
Detection of chimeric artifacts formed when PCR is used to retrieve naturally occurring small-subunit (SSU) rRNA sequences may rely on demonstrating that different sequence domains have different phylogenetic affiliations. We evaluated the CHECK_CHIMERA method of the Ribosomal Database Project and another method which we developed, both based on determining nearest neighbors of different sequence domains, for their ability to discern artificially generated SSU rRNA chimeras from authentic Ribosomal Database Project sequences. The reliability of both methods decreases when the parental sequences which contribute to chimera formation are more than 82 to 84% similar. Detection is also complicated by the occurrence of authentic SSU rRNA sequences that behave like chimeras. We developed a naive statistical test based on CHECK_CHIMERA output and used it to evaluate previously reported SSU rRNA chimeras. Application of this test also suggests that chimeras might be formed by retrieving SSU rRNAs as cDNA. The amount of uncertainty associated with nearest-neighbor analyses indicates that such tests alone are insufficient and that better methods are needed.
Dawson, Natalie L; Sillitoe, Ian; Lees, Jonathan G; Lam, Su Datt; Orengo, Christine A
2017-01-01
This chapter describes the generation of the data in the CATH-Gene3D online resource and how it can be used to study protein domains and their evolutionary relationships. Methods will be presented for: comparing protein structures, recognizing homologs, predicting domain structures within protein sequences, and subclassifying superfamilies into functionally pure families, together with a guide on using the webpages.
CORALINA: a universal method for the generation of gRNA libraries for CRISPR-based screening.
Köferle, Anna; Worf, Karolina; Breunig, Christopher; Baumann, Valentin; Herrero, Javier; Wiesbeck, Maximilian; Hutter, Lukas H; Götz, Magdalena; Fuchs, Christiane; Beck, Stephan; Stricker, Stefan H
2016-11-14
The bacterial CRISPR system is fast becoming the most popular genetic and epigenetic engineering tool due to its universal applicability and adaptability. The desire to deploy CRISPR-based methods in a large variety of species and contexts has created an urgent need for the development of easy, time- and cost-effective methods enabling large-scale screening approaches. Here we describe CORALINA (comprehensive gRNA library generation through controlled nuclease activity), a method for the generation of comprehensive gRNA libraries for CRISPR-based screens. CORALINA gRNA libraries can be derived from any source of DNA without the need of complex oligonucleotide synthesis. We show the utility of CORALINA for human and mouse genomic DNA, its reproducibility in covering the most relevant genomic features including regulatory, coding and non-coding sequences and confirm the functionality of CORALINA generated gRNAs. The simplicity and cost-effectiveness make CORALINA suitable for any experimental system. The unprecedented sequence complexities obtainable with CORALINA libraries are a necessary pre-requisite for less biased large scale genomic and epigenomic screens.
Carr, Ian M; Morgan, Joanne; Watson, Christopher; Melnik, Svitlana; Diggle, Christine P; Logan, Clare V; Harrison, Sally M; Taylor, Graham R; Pena, Sergio D J; Markham, Alexander F; Alkuraya, Fowzan S; Black, Graeme C M; Ali, Manir; Bonthron, David T
2013-07-01
Massively parallel ("next generation") DNA sequencing (NGS) has quickly become the method of choice for seeking pathogenic mutations in rare uncharacterized monogenic diseases. Typically, before DNA sequencing, protein-coding regions are enriched from patient genomic DNA, representing either the entire genome ("exome sequencing") or selected mapped candidate loci. Sequence variants, identified as differences between the patient's and the human genome reference sequences, are then filtered according to various quality parameters. Changes are screened against datasets of known polymorphisms, such as dbSNP and the 1000 Genomes Project, in the effort to narrow the list of candidate causative variants. An increasing number of commercial services now offer to both generate and align NGS data to a reference genome. This potentially allows small groups with limited computing infrastructure and informatics skills to utilize this technology. However, the capability to effectively filter and assess sequence variants is still an important bottleneck in the identification of deleterious sequence variants in both research and diagnostic settings. We have developed an approach to this problem comprising a user-friendly suite of programs that can interactively analyze, filter and screen data from enrichment-capture NGS data. These programs ("Agile Suite") are particularly suitable for small-scale gene discovery or for diagnostic analysis. © 2013 WILEY PERIODICALS, INC.
Software for pre-processing Illumina next-generation sequencing short read sequences
2014-01-01
Background When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets. Methods We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. Results Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness. Conclusions Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects. PMID:24955109
Design of association studies with pooled or un-pooled next-generation sequencing data.
Kim, Su Yeon; Li, Yingrui; Guo, Yiran; Li, Ruiqiang; Holmkvist, Johan; Hansen, Torben; Pedersen, Oluf; Wang, Jun; Nielsen, Rasmus
2010-07-01
Most common hereditary diseases in humans are complex and multifactorial. Large-scale genome-wide association studies based on SNP genotyping have only identified a small fraction of the heritable variation of these diseases. One explanation may be that many rare variants (a minor allele frequency, MAF <5%), which are not included in the common genotyping platforms, may contribute substantially to the genetic variation of these diseases. Next-generation sequencing, which would allow the analysis of rare variants, is now becoming so cheap that it provides a viable alternative to SNP genotyping. In this paper, we present cost-effective protocols for using next-generation sequencing in association mapping studies based on pooled and un-pooled samples, and identify optimal designs with respect to total number of individuals, number of individuals per pool, and the sequencing coverage. We perform a small empirical study to evaluate the pooling variance in a realistic setting where pooling is combined with exon-capturing. To test for associations, we develop a likelihood ratio statistic that accounts for the high error rate of next-generation sequencing data. We also perform extensive simulations to determine the power and accuracy of this method. Overall, our findings suggest that with a fixed cost, sequencing many individuals at a more shallow depth with larger pool size achieves higher power than sequencing a small number of individuals in higher depth with smaller pool size, even in the presence of high error rates. Our results provide guidelines for researchers who are developing association mapping studies based on next-generation sequencing. (c) 2010 Wiley-Liss, Inc.
Rasmussen, Ulla; Svenning, Mette M.
1998-01-01
The presence of repeated DNA (short tandemly repeated repetitive [STRR] and long tandemly repeated repetitive [LTRR]) sequences in the genome of cyanobacteria was used to generate a fingerprint method for symbiotic and free-living isolates. Primers corresponding to the STRR and LTRR sequences were used in the PCR, resulting in a method which generate specific fingerprints for individual isolates. The method was useful both with purified DNA and with intact cyanobacterial filaments or cells as templates for the PCR. Twenty-three Nostoc isolates from a total of 35 were symbiotic isolates from the angiosperm Gunnera species, including isolates from the same Gunnera species as well as from different species. The results show a genetic similarity among isolates from different Gunnera species as well as a genetic heterogeneity among isolates from the same Gunnera species. Isolates which have been postulated to be closely related or identical revealed similar results by the PCR method, indicating that the technique is useful for clustering of even closely related strains. The method was applied to nonheterocystus cyanobacteria from which a fingerprint pattern was obtained. PMID:16349487
Problems with the random number generator RANF implemented on the CDC cyber 205
NASA Astrophysics Data System (ADS)
Kalle, Claus; Wansleben, Stephan
1984-10-01
We show that using RANF may lead to wrong results when lattice models are simulated by Monte Carlo methods. We present a shift-register sequence random number generator which generates two random numbers per cycle on a two pipe CDC Cyber 205.
SU-F-T-527: A Novel Dynamic Multileaf Collimator Leaf-Sequencing Algorithm in Radiation Therapy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jing, J; Lin, H; Chow, J
Purpose: A novel leaf-sequencing algorithm is developed for generating arbitrary beam intensity profiles in discrete levels using dynamic multileaf collimator (MLC). The efficiency of this dynamic MLC leaf-sequencing method was evaluated using external beam treatment plans delivered by intensity modulated radiation therapy technique. Methods: To qualify and validate this algorithm, integral test for the beam segment of MLC generated by the CORVUS treatment planning system was performed with clinical intensity map experiments. The treatment plans were optimized and the fluence maps for all photon beams were determined. This algorithm started with the algebraic expression for the area under the beammore » profile. The coefficients in the expression can be transformed into the specifications for the leaf-setting sequence. The leaf optimization procedure was then applied and analyzed for clinical relevant intensity profiles in cancer treatment. Results: The macrophysical effect of this method can be described by volumetric plan evaluation tools such as dose-volume histograms (DVHs). The DVH results are in good agreement compared to those from the CORVUS treatment planning system. Conclusion: We developed a dynamic MLC method to examine the stability of leaf speed including effects of acceleration and deceleration of leaf motion in order to make sure the stability of leaf speed did not affect the intensity profile generated. It was found that the mechanical requirements were better satisfied using this method. The Project is sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.« less
Logan, Grace; Freimanis, Graham L; King, David J; Valdazo-González, Begoña; Bachanek-Bankowska, Katarzyna; Sanderson, Nicholas D; Knowles, Nick J; King, Donald P; Cottam, Eleanor M
2014-09-30
Next-Generation Sequencing (NGS) is revolutionizing molecular epidemiology by providing new approaches to undertake whole genome sequencing (WGS) in diagnostic settings for a variety of human and veterinary pathogens. Previous sequencing protocols have been subject to biases such as those encountered during PCR amplification and cell culture, or are restricted by the need for large quantities of starting material. We describe here a simple and robust methodology for the generation of whole genome sequences on the Illumina MiSeq. This protocol is specific for foot-and-mouth disease virus (FMDV) or other polyadenylated RNA viruses and circumvents both the use of PCR and the requirement for large amounts of initial template. The protocol was successfully validated using five FMDV positive clinical samples from the 2001 epidemic in the United Kingdom, as well as a panel of representative viruses from all seven serotypes. In addition, this protocol was successfully used to recover 94% of an FMDV genome that had previously been identified as cell culture negative. Genome sequences from three other non-FMDV polyadenylated RNA viruses (EMCV, ERAV, VESV) were also obtained with minor protocol amendments. We calculated that a minimum coverage depth of 22 reads was required to produce an accurate consensus sequence for FMDV O. This was achieved in 5 FMDV/O/UKG isolates and the type O FMDV from the serotype panel with the exception of the 5' genomic termini and area immediately flanking the poly(C) region. We have developed a universal WGS method for FMDV and other polyadenylated RNA viruses. This method works successfully from a limited quantity of starting material and eliminates the requirement for genome-specific PCR amplification. This protocol has the potential to generate consensus-level sequences within a routine high-throughput diagnostic environment.
Laurie, Matthew T; Bertout, Jessica A; Taylor, Sean D; Burton, Joshua N; Shendure, Jay A; Bielas, Jason H
2013-08-01
Due to the high cost of failed runs and suboptimal data yields, quantification and determination of fragment size range are crucial steps in the library preparation process for massively parallel sequencing (or next-generation sequencing). Current library quality control methods commonly involve quantification using real-time quantitative PCR and size determination using gel or capillary electrophoresis. These methods are laborious and subject to a number of significant limitations that can make library calibration unreliable. Herein, we propose and test an alternative method for quality control of sequencing libraries using droplet digital PCR (ddPCR). By exploiting a correlation we have discovered between droplet fluorescence and amplicon size, we achieve the joint quantification and size determination of target DNA with a single ddPCR assay. We demonstrate the accuracy and precision of applying this method to the preparation of sequencing libraries.
MIPS bacterial genomes functional annotation benchmark dataset.
Tetko, Igor V; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Fobo, Gisela; Ruepp, Andreas; Antonov, Alexey V; Surmeli, Dimitrij; Mewes, Hans-Wernen
2005-05-15
Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. BFAB is available at http://mips.gsf.de/proj/bfab
Dias, Miguel de Sousa; Hernan, Imma; Pascual, Beatriz; Borràs, Emma; Mañé, Begoña; Gamundi, Maria José
2013-01-01
Purpose To devise an effective method for detecting mutations in 12 genes (CA4, CRX, IMPDH1, NR2E3, RP9, PRPF3, PRPF8, PRPF31, PRPH2, RHO, RP1, and TOPORS) commonly associated with autosomal dominant retinitis pigmentosa (adRP) that account for more than 95% of known mutations. Methods We used long-range PCR (LR-PCR) amplification and next-generation sequencing (NGS) performed in a GS Junior 454 benchtop sequencing platform. Twenty LR-PCR fragments, between 3,000 and 10,000 bp, containing all coding exons and flanking regions of the 12 genes, were obtained from DNA samples of patients with adRP. Sequencing libraries were prepared with an enzymatic (Fragmentase technology) method. Results Complete coverage of the coding and flanking sequences of the 12 genes assayed was obtained with NGS, with an average sequence depth of 380× (ranging from 128× to 1,077×). Five previous known mutations in the adRP genes were detected with a sequence variation percentage between 35% and 65%. We also performed a parallel sequence analysis of four samples, three of them new patients with index adRP, in which two novel mutations were detected in RHO (p.Asn73del) and PRPF31 (p.Ile109del). Conclusions The results demonstrate that genomic LR-PCR amplification together with NGS is an effective method for analyzing individual patient samples for mutations in a monogenic heterogeneous disease such as adRP. This approach proved effective for the parallel analysis of adRP and has been introduced as routine. Additionally, this approach could be extended to other heterogeneous genetic diseases. PMID:23559859
Nakano, Shogo; Motoyama, Tomoharu; Miyashita, Yurina; Ishizuka, Yuki; Matsuo, Naoya; Tokiwa, Hiroaki; Shinoda, Suguru; Asano, Yasuhisa; Ito, Sohei
2018-05-22
The expansion of protein sequence databases has enabled us to design artificial proteins by sequence-based design methods, such as full consensus design (FCD) and ancestral sequence reconstruction (ASR). Artificial proteins with enhanced activity levels compared with native ones can potentially be generated by such methods, but successful design is rare because preparing a sequence library by curating the database and selecting a method is difficult. Utilizing a curated library prepared by reducing conservation energies, we successfully designed two artificial L-threonine 3-dehydrogenase (SDR-TDH) with higher activity levels than native SDR-TDH, FcTDH-N1 and AncTDH, using FCD and ASR, respectively. The artificial SDR-TDHs had excellent thermal stability and NAD+ recognition compared to native SDR-TDH from Cupriavidus necator (CnTDH): the melting temperatures of FcTDH-N1 and AncTDH were about 10 and 5°C higher than CnTDH, respectively, and the dissociation constants toward NAD+ of FcTDH-N1 and AncTDH were two- and seven-fold lower than that of CnTDH, respectively. Enzymatic efficiency of the artificial SDR-TDHs were comparable to that of CnTDH. Crystal structures of FcTDH-N1 and AncTDH were determined at 2.8 and 2.1 Å resolution, respectively. Structural and MD simulation analysis of the SDR-TDHs indicated that only the flexibility at specific regions was changed, suggesting that multiple mutations introduced in the artificial SDR-TDHs altered their flexibility and thereby affected their enzymatic properties. Benchmark analysis of the SDR-TDHs indicated that both FCD and ASR can generate highly functional proteins if a curated library is prepared appropriately.
An integrated SNP mining and utilization (ISMU) pipeline for next generation sequencing data.
Azam, Sarwar; Rathore, Abhishek; Shah, Trushar M; Telluri, Mohan; Amindala, BhanuPrakash; Ruperao, Pradeep; Katta, Mohan A V S K; Varshney, Rajeev K
2014-01-01
Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.
G-CNV: A GPU-Based Tool for Preparing Data to Detect CNVs with Read-Depth Methods.
Manconi, Andrea; Manca, Emanuele; Moscatelli, Marco; Gnocchi, Matteo; Orro, Alessandro; Armano, Giuliano; Milanesi, Luciano
2015-01-01
Copy number variations (CNVs) are the most prevalent types of structural variations (SVs) in the human genome and are involved in a wide range of common human diseases. Different computational methods have been devised to detect this type of SVs and to study how they are implicated in human diseases. Recently, computational methods based on high-throughput sequencing (HTS) are increasingly used. The majority of these methods focus on mapping short-read sequences generated from a donor against a reference genome to detect signatures distinctive of CNVs. In particular, read-depth based methods detect CNVs by analyzing genomic regions with significantly different read-depth from the other ones. The pipeline analysis of these methods consists of four main stages: (i) data preparation, (ii) data normalization, (iii) CNV regions identification, and (iv) copy number estimation. However, available tools do not support most of the operations required at the first two stages of this pipeline. Typically, they start the analysis by building the read-depth signal from pre-processed alignments. Therefore, third-party tools must be used to perform most of the preliminary operations required to build the read-depth signal. These data-intensive operations can be efficiently parallelized on graphics processing units (GPUs). In this article, we present G-CNV, a GPU-based tool devised to perform the common operations required at the first two stages of the analysis pipeline. G-CNV is able to filter low-quality read sequences, to mask low-quality nucleotides, to remove adapter sequences, to remove duplicated read sequences, to map the short-reads, to resolve multiple mapping ambiguities, to build the read-depth signal, and to normalize it. G-CNV can be efficiently used as a third-party tool able to prepare data for the subsequent read-depth signal generation and analysis. Moreover, it can also be integrated in CNV detection tools to generate read-depth signals.
BASiNET-BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification.
Ito, Eric Augusto; Katahira, Isaque; Vicente, Fábio Fernandes da Rocha; Pereira, Luiz Filipe Protasio; Lopes, Fabrício Martins
2018-06-05
With the emergence of Next Generation Sequencing (NGS) technologies, a large volume of sequence data in particular de novo sequencing was rapidly produced at relatively low costs. In this context, computational tools are increasingly important to assist in the identification of relevant information to understand the functioning of organisms. This work introduces BASiNET, an alignment-free tool for classifying biological sequences based on the feature extraction from complex network measurements. The method initially transform the sequences and represents them as complex networks. Then it extracts topological measures and constructs a feature vector that is used to classify the sequences. The method was evaluated in the classification of coding and non-coding RNAs of 13 species and compared to the CNCI, PLEK and CPC2 methods. BASiNET outperformed all compared methods in all adopted organisms and datasets. BASiNET have classified sequences in all organisms with high accuracy and low standard deviation, showing that the method is robust and non-biased by the organism. The proposed methodology is implemented in open source in R language and freely available for download at https://cran.r-project.org/package=BASiNET.
PipeOnline 2.0: automated EST processing and functional data sorting.
Ayoubi, Patricia; Jin, Xiaojing; Leite, Saul; Liu, Xianghui; Martajaja, Jeson; Abduraham, Abdurashid; Wan, Qiaolan; Yan, Wei; Misawa, Eduardo; Prade, Rolf A
2002-11-01
Expressed sequence tags (ESTs) are generated and deposited in the public domain, as redundant, unannotated, single-pass reactions, with virtually no biological content. PipeOnline automatically analyses and transforms large collections of raw DNA-sequence data from chromatograms or FASTA files by calling the quality of bases, screening and removing vector sequences, assembling and rewriting consensus sequences of redundant input files into a unigene EST data set and finally through translation, amino acid sequence similarity searches, annotation of public databases and functional data. PipeOnline generates an annotated database, retaining the processed unigene sequence, clone/file history, alignments with similar sequences, and proposed functional classification, if available. Functional annotation is automatic and based on a novel method that relies on homology of amino acid sequence multiplicity within GenBank records. Records are examined through a function ordered browser or keyword queries with automated export of results. PipeOnline offers customization for individual projects (MyPipeOnline), automated updating and alert service. PipeOnline is available at http://stress-genomics.org.
Highly multiplexed targeted DNA sequencing from single nuclei.
Leung, Marco L; Wang, Yong; Kim, Charissa; Gao, Ruli; Jiang, Jerry; Sei, Emi; Navin, Nicholas E
2016-02-01
Single-cell DNA sequencing methods are challenged by poor physical coverage, high technical error rates and low throughput. To address these issues, we developed a single-cell DNA sequencing protocol that combines flow-sorting of single nuclei, time-limited multiple-displacement amplification (MDA), low-input library preparation, DNA barcoding, targeted capture and next-generation sequencing (NGS). This approach represents a major improvement over our previous single nucleus sequencing (SNS) Nature Protocols paper in terms of generating higher-coverage data (>90%), thereby enabling the detection of genome-wide variants in single mammalian cells at base-pair resolution. Furthermore, by pooling 48-96 single-cell libraries together for targeted capture, this approach can be used to sequence many single-cell libraries in parallel in a single reaction. This protocol greatly reduces the cost of single-cell DNA sequencing, and it can be completed in 5-6 d by advanced users. This single-cell DNA sequencing protocol has broad applications for studying rare cells and complex populations in diverse fields of biological research and medicine.
Fast, accurate and easy-to-pipeline methods for amplicon sequence processing
NASA Astrophysics Data System (ADS)
Antonielli, Livio; Sessitsch, Angela
2016-04-01
Next generation sequencing (NGS) technologies established since years as an essential resource in microbiology. While on the one hand metagenomic studies can benefit from the continuously increasing throughput of the Illumina (Solexa) technology, on the other hand the spreading of third generation sequencing technologies (PacBio, Oxford Nanopore) are getting whole genome sequencing beyond the assembly of fragmented draft genomes, making it now possible to finish bacterial genomes even without short read correction. Besides (meta)genomic analysis next-gen amplicon sequencing is still fundamental for microbial studies. Amplicon sequencing of the 16S rRNA gene and ITS (Internal Transcribed Spacer) remains a well-established widespread method for a multitude of different purposes concerning the identification and comparison of archaeal/bacterial (16S rRNA gene) and fungal (ITS) communities occurring in diverse environments. Numerous different pipelines have been developed in order to process NGS-derived amplicon sequences, among which Mothur, QIIME and USEARCH are the most well-known and cited ones. The entire process from initial raw sequence data through read error correction, paired-end read assembly, primer stripping, quality filtering, clustering, OTU taxonomic classification and BIOM table rarefaction as well as alternative "normalization" methods will be addressed. An effective and accurate strategy will be presented using the state-of-the-art bioinformatic tools and the example of a straightforward one-script pipeline for 16S rRNA gene or ITS MiSeq amplicon sequencing will be provided. Finally, instructions on how to automatically retrieve nucleotide sequences from NCBI and therefore apply the pipeline to targets other than 16S rRNA gene (Greengenes, SILVA) and ITS (UNITE) will be discussed.
Mohd-Yusoff, Nur Fatihah; Ruperao, Pradeep; Tomoyoshi, Nurain Emylia; Edwards, David; Gresshoff, Peter M.; Biswas, Bandana; Batley, Jacqueline
2015-01-01
Genetic structure can be altered by chemical mutagenesis, which is a common method applied in molecular biology and genetics. Second-generation sequencing provides a platform to reveal base alterations occurring in the whole genome due to mutagenesis. A model legume, Lotus japonicus ecotype Miyakojima, was chemically mutated with alkylating ethyl methanesulfonate (EMS) for the scanning of DNA lesions throughout the genome. Using second-generation sequencing, two individually mutated third-generation progeny (M3, named AM and AS) were sequenced and analyzed to identify single nucleotide polymorphisms and reveal the effects of EMS on nucleotide sequences in these mutant genomes. Single-nucleotide polymorphisms were found in every 208 kb (AS) and 202 kb (AM) with a bias mutation of G/C-to-A/T changes at low percentage. Most mutations were intergenic. The mutation spectrum of the genomes was comparable in their individual chromosomes; however, each mutated genome has unique alterations, which are useful to identify causal mutations for their phenotypic changes. The data obtained demonstrate that whole genomic sequencing is applicable as a high-throughput tool to investigate genomic changes due to mutagenesis. The identification of these single-point mutations will facilitate the identification of phenotypically causative mutations in EMS-mutated germplasm. PMID:25660167
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.
Benoit, Gaëtan; Lemaitre, Claire; Lavenier, Dominique; Drezen, Erwan; Dayris, Thibault; Uricaru, Raluca; Rizk, Guillaume
2015-09-14
Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.
Hodzic, Jasin; Gurbeta, Lejla; Omanovic-Miklicanin, Enisa; Badnjevic, Almir
2017-01-01
Introduction: Major advancements in DNA sequencing methods introduced in the first decade of the new millennium initiated a rapid expansion of sequencing studies, which yielded a tremendous amount of DNA sequence data, including whole sequenced genomes of various species, including plants. A set of novel sequencing platforms, often collectively named as “next-generation sequencing” (NGS) completely transformed the life sciences, by allowing extensive throughput, while greatly reducing the necessary time, labor and cost of any sequencing endeavor. Purpose: of this paper is to present an overview NGS platforms used to produce the current compendium of published draft genomes of various plants, namely the Roche/454, ABI/SOLiD, and Solexa/Illumina, and to determine the most frequently used platform for the whole genome sequencing of plants in light of genotypization of immortelle plant. Materials and methods: 45 papers were selected (with 47 presented plant genome draft sequences), and utilized sequencing techniques and NGS platforms (Roche/454, ABI/SOLiD and Illumina/Solexa) in selected papers were determined. Subsequently, frequency of usage of each platform or combination of platforms was calculated. Results: Illumina/Solexa platforms are by used either as sole sequencing tool in 40.42% of published genomes, or in combination with other platforms - additional 48.94% of published genomes, followed by Roche/454 platforms, used in combination with traditional Sanger sequencing method (10.64%), and never as a sole tool. ABI/SOLiD was only used in combination with Illumina/Solexa and Roche/454 in 4.25% of publications. Conclusions: Illumina/Solexa platforms are by far most preferred by researchers, most probably due to most affordable sequencing costs. Taking into consideration the current economic situation in the Balkans region, Illumina Solexa is the best (if not the only) platform choice if the sequencing of immortelle plant (Helichrysium arenarium) is to be performed by the researchers in this region. PMID:28974852
F-16 Instructional Sequencing Plan Report.
1981-03-01
information). 2. Interference (learning of some tasks interferes with the learning of other tasks when they possess similar but confusing differences ...profound effect on the total training expense. This increases the desirability of systematic, precise methods of syllabus generation. Inherent in a given...the expensive to acquire. resource. Least cost The syllabus must Select sequences which provide a least total make maximum use of cost method of
Dechartres, Agnes; Trinquart, Ludovic; Atal, Ignacio; Moher, David; Dickersin, Kay; Boutron, Isabelle; Perrodeau, Elodie; Altman, Douglas G; Ravaud, Philippe
2017-06-08
Objective To examine how poor reporting and inadequate methods for key methodological features in randomised controlled trials (RCTs) have changed over the past three decades. Design Mapping of trials included in Cochrane reviews. Data sources Data from RCTs included in all Cochrane reviews published between March 2011 and September 2014 reporting an evaluation of the Cochrane risk of bias items: sequence generation, allocation concealment, blinding, and incomplete outcome data. Data extraction For each RCT, we extracted consensus on risk of bias made by the review authors and identified the primary reference to extract publication year and journal. We matched journal names with Journal Citation Reports to get 2014 impact factors. Main outcomes measures We considered the proportions of trials rated by review authors at unclear and high risk of bias as surrogates for poor reporting and inadequate methods, respectively. Results We analysed 20 920 RCTs (from 2001 reviews) published in 3136 journals. The proportion of trials with unclear risk of bias was 48.7% for sequence generation and 57.5% for allocation concealment; the proportion of those with high risk of bias was 4.0% and 7.2%, respectively. For blinding and incomplete outcome data, 30.6% and 24.7% of trials were at unclear risk and 33.1% and 17.1% were at high risk, respectively. Higher journal impact factor was associated with a lower proportion of trials at unclear or high risk of bias. The proportion of trials at unclear risk of bias decreased over time, especially for sequence generation, which fell from 69.1% in 1986-1990 to 31.2% in 2011-14 and for allocation concealment (70.1% to 44.6%). After excluding trials at unclear risk of bias, use of inadequate methods also decreased over time: from 14.8% to 4.6% for sequence generation and from 32.7% to 11.6% for allocation concealment. Conclusions Poor reporting and inadequate methods have decreased over time, especially for sequence generation and allocation concealment. But more could be done, especially in lower impact factor journals. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Fujimori, Shigeo; Hirai, Naoya; Ohashi, Hiroyuki; Masuoka, Kazuyo; Nishikimi, Akihiko; Fukui, Yoshinori; Washio, Takanori; Oshikubo, Tomohiro; Yamashita, Tatsuhiro; Miyamoto-Sato, Etsuko
2012-01-01
Next-generation sequencing (NGS) has been applied to various kinds of omics studies, resulting in many biological and medical discoveries. However, high-throughput protein-protein interactome datasets derived from detection by sequencing are scarce, because protein-protein interaction analysis requires many cell manipulations to examine the interactions. The low reliability of the high-throughput data is also a problem. Here, we describe a cell-free display technology combined with NGS that can improve both the coverage and reliability of interactome datasets. The completely cell-free method gives a high-throughput and a large detection space, testing the interactions without using clones. The quantitative information provided by NGS reduces the number of false positives. The method is suitable for the in vitro detection of proteins that interact not only with the bait protein, but also with DNA, RNA and chemical compounds. Thus, it could become a universal approach for exploring the large space of protein sequences and interactome networks. PMID:23056904
DendroBLAST: approximate phylogenetic trees in the absence of multiple sequence alignments.
Kelly, Steven; Maini, Philip K
2013-01-01
The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method differs from other methods by incorporating a simple model of sequence evolution to test the effect of introducing sequence changes on the reliability of the bipartitions in the inferred tree. Using realistic simulated sequence data we demonstrate that this method produces phylogenetic trees that are more accurate than other commonly-used distance based methods though not as accurate as maximum likelihood methods from good quality multiple sequence alignments. In addition to tests on simulated data, we use DendroBLAST to generate input trees for a supertree reconstruction of the phylogeny of the Archaea. This independent analysis produces an approximate phylogeny of the Archaea that has both high precision and recall when compared to previously published analysis of the same dataset using conventional methods. Taken together these results demonstrate that approximate phylogenetic trees can be produced in the absence of multiple sequence alignments, and we propose that these trees will provide a platform for improving and informing downstream bioinformatic analysis. A web implementation of the DendroBLAST method is freely available for use at http://www.dendroblast.com/.
Biological intuition in alignment-free methods: response to Posada.
Ragan, Mark A; Chan, Cheong Xin
2013-08-01
A recent editorial in Journal of Molecular Evolution highlights opportunities and challenges facing molecular evolution in the era of next-generation sequencing. Abundant sequence data should allow more-complex models to be fit at higher confidence, making phylogenetic inference more reliable and improving our understanding of evolution at the molecular level. However, concern that approaches based on multiple sequence alignment may be computationally infeasible for large datasets is driving the development of so-called alignment-free methods for sequence comparison and phylogenetic inference. The recent editorial characterized these approaches as model-free, not based on the concept of homology, and lacking in biological intuition. We argue here that alignment-free methods have not abandoned models or homology, and can be biologically intuitive.
Next-Generation Sequencing of Antibody Display Repertoires
Rouet, Romain; Jackson, Katherine J. L.; Langley, David B.; Christ, Daniel
2018-01-01
In vitro selection technology has transformed the development of therapeutic monoclonal antibodies. Using methods such as phage, ribosome, and yeast display, high affinity binders can be selected from diverse repertoires. Here, we review strategies for the next-generation sequencing (NGS) of phage- and other antibody-display libraries, as well as NGS platforms and analysis tools. Moreover, we discuss recent examples relating to the use of NGS to assess library diversity, clonal enrichment, and affinity maturation. PMID:29472918
Kane, Nolan; Sveinsson, Saemundur; Dempewolf, Hannes; Yang, Ji Yong; Zhang, Dapeng; Engels, Johannes M M; Cronk, Quentin
2012-02-01
To reliably identify lineages below the species level such as subspecies or varieties, we propose an extension to DNA-barcoding using next-generation sequencing to produce whole organellar genomes and substantial nuclear ribosomal sequence. Because this method uses much longer versions of the traditional DNA-barcoding loci in the plastid and ribosomal DNA, we call our approach ultra-barcoding (UBC). We used high-throughput next-generation sequencing to scan the genome and generate reliable sequence of high copy number regions. Using this method, we examined whole plastid genomes as well as nearly 6000 bases of nuclear ribosomal DNA sequences for nine genotypes of Theobroma cacao and an individual of the related species T. grandiflorum, as well as an additional publicly available whole plastid genome of T. cacao. All individuals of T. cacao examined were uniquely distinguished, and evidence of reticulation and gene flow was observed. Sequence variation was observed in some of the canonical barcoding regions between species, but other regions of the chloroplast were more variable both within species and between species, as were ribosomal spacers. Furthermore, no single region provides the level of data available using the complete plastid genome and rDNA. Our data demonstrate that UBC is a viable, increasingly cost-effective approach for reliably distinguishing varieties and even individual genotypes of T. cacao. This approach shows great promise for applications where very closely related or interbreeding taxa must be distinguished.
Ståhlberg, Anders; Krzyzanowski, Paul M; Jackson, Jennifer B; Egyud, Matthew; Stein, Lincoln; Godfrey, Tony E
2016-06-20
Detection of cell-free DNA in liquid biopsies offers great potential for use in non-invasive prenatal testing and as a cancer biomarker. Fetal and tumor DNA fractions however can be extremely low in these samples and ultra-sensitive methods are required for their detection. Here, we report an extremely simple and fast method for introduction of barcodes into DNA libraries made from 5 ng of DNA. Barcoded adapter primers are designed with an oligonucleotide hairpin structure to protect the molecular barcodes during the first rounds of polymerase chain reaction (PCR) and prevent them from participating in mis-priming events. Our approach enables high-level multiplexing and next-generation sequencing library construction with flexible library content. We show that uniform libraries of 1-, 5-, 13- and 31-plex can be generated. Utilizing the barcodes to generate consensus reads for each original DNA molecule reduces background sequencing noise and allows detection of variant alleles below 0.1% frequency in clonal cell line DNA and in cell-free plasma DNA. Thus, our approach bridges the gap between the highly sensitive but specific capabilities of digital PCR, which only allows a limited number of variants to be analyzed, with the broad target capability of next-generation sequencing which traditionally lacks the sensitivity to detect rare variants. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Galbadrakh, Bulgan; Lee, Kyung-Eun; Park, Hyun-Seok
2012-12-01
Grammatical inference methods are expected to find grammatical structures hidden in biological sequences. One hopes that studies of grammar serve as an appropriate tool for theory formation. Thus, we have developed JSequitur for automatically generating the grammatical structure of biological sequences in an inference framework of string compression algorithms. Our original motivation was to find any grammatical traits of several cancer genes that can be detected by string compression algorithms. Through this research, we could not find any meaningful unique traits of the cancer genes yet, but we could observe some interesting traits in regards to the relationship among gene length, similarity of sequences, the patterns of the generated grammar, and compression rate.
Ludgate, Jackie L; Wright, James; Stockwell, Peter A; Morison, Ian M; Eccles, Michael R; Chatterjee, Aniruddha
2017-08-31
Formalin fixed paraffin embedded (FFPE) tumor samples are a major source of DNA from patients in cancer research. However, FFPE is a challenging material to work with due to macromolecular fragmentation and nucleic acid crosslinking. FFPE tissue particularly possesses challenges for methylation analysis and for preparing sequencing-based libraries relying on bisulfite conversion. Successful bisulfite conversion is a key requirement for sequencing-based methylation analysis. Here we describe a complete and streamlined workflow for preparing next generation sequencing libraries for methylation analysis from FFPE tissues. This includes, counting cells from FFPE blocks and extracting DNA from FFPE slides, testing bisulfite conversion efficiency with a polymerase chain reaction (PCR) based test, preparing reduced representation bisulfite sequencing libraries and massively parallel sequencing. The main features and advantages of this protocol are: An optimized method for extracting good quality DNA from FFPE tissues. An efficient bisulfite conversion and next generation sequencing library preparation protocol that uses 50 ng DNA from FFPE tissue. Incorporation of a PCR-based test to assess bisulfite conversion efficiency prior to sequencing. We provide a complete workflow and an integrated protocol for performing DNA methylation analysis at the genome-scale and we believe this will facilitate clinical epigenetic research that involves the use of FFPE tissue.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Larsen, P. E.; Trivedi, G.; Sreedasyam, A.
2010-07-06
Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derivedmore » from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. 69% of expressed mycorrhizal JGI 'best' gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural annotation in other species, provided that there is a sequenced genome and a set of gene models.« less
Identification of genomic indels and structural variations using split reads
2011-01-01
Background Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. Results We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. Conclusions Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful. PMID:21787423
Recombination of polynucleotide sequences using random or defined primers
Arnold, Frances H.; Shao, Zhixin; Affholter, Joseph A.; Zhao, Huimin H; Giver, Lorraine J.
2000-01-01
A method for in vitro mutagenesis and recombination of polynucleotide sequences based on polymerase-catalyzed extension of primer oligonucleotides is disclosed. The method involves priming template polynucleotide(s) with random-sequences or defined-sequence primers to generate a pool of short DNA fragments with a low level of point mutations. The DNA fragments are subjected to denaturization followed by annealing and further enzyme-catalyzed DNA polymerization. This procedure is repeated a sufficient number of times to produce full-length genes which comprise mutants of the original template polynucleotides. These genes can be further amplified by the polymerase chain reaction and cloned into a vector for expression of the encoded proteins.
Recombination of polynucleotide sequences using random or defined primers
Arnold, Frances H.; Shao, Zhixin; Affholter, Joseph A.; Zhao, Huimin; Giver, Lorraine J.
2001-01-01
A method for in vitro mutagenesis and recombination of polynucleotide sequences based on polymerase-catalyzed extension of primer oligonucleotides is disclosed. The method involves priming template polynucleotide(s) with random-sequences or defined-sequence primers to generate a pool of short DNA fragments with a low level of point mutations. The DNA fragments are subjected to denaturization followed by annealing and further enzyme-catalyzed DNA polymerization. This procedure is repeated a sufficient number of times to produce full-length genes which comprise mutants of the original template polynucleotides. These genes can be further amplified by the polymerase chain reaction and cloned into a vector for expression of the encoded proteins.
Amicarelli, Giulia; Adlerstein, Daniel; Shehi, Erlet; Wang, Fengfei; Makrigiorgos, G Mike
2006-10-01
Genotyping methods that reveal single-nucleotide differences are useful for a wide range of applications. We used digestion of 3-way DNA junctions in a novel technology, OneCutEventAmplificatioN (OCEAN) that allows sequence-specific signal generation and amplification. We combined OCEAN with peptide-nucleic-acid (PNA)-based variant enrichment to detect and simultaneously genotype v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog (KRAS) codon 12 sequence variants in human tissue specimens. We analyzed KRAS codon 12 sequence variants in 106 lung cancer surgical specimens. We conducted a PNA-PCR reaction that suppresses wild-type KRAS amplification and genotyped the product with a set of OCEAN reactions carried out in fluorescence microplate format. The isothermal OCEAN assay enabled a 3-way DNA junction to form between the specific target nucleic acid, a fluorescently labeled "amplifier", and an "anchor". The amplifier-anchor contact contains the recognition site for a restriction enzyme. Digestion produces a cleaved amplifier and generation of a fluorescent signal. The cleaved amplifier dissociates from the 3-way DNA junction, allowing a new amplifier to bind and propagate the reaction. The system detected and genotyped KRAS sequence variants down to approximately 0.3% variant-to-wild-type alleles. PNA-PCR/OCEAN had a concordance rate with PNA-PCR/sequencing of 93% to 98%, depending on the exact implementation. Concordance rate with restriction endonuclease-mediated selective-PCR/sequencing was 89%. OCEAN is a practical and low-cost novel technology for sequence-specific signal generation. Reliable analysis of KRAS sequence alterations in human specimens circumvents the requirement for sequencing. Application is expected in genotyping KRAS codon 12 sequence variants in surgical specimens or in bodily fluids, as well as single-base variations and sequence alterations in other genes.
ParticleCall: A particle filter for base calling in next-generation sequencing systems
2012-01-01
Background Next-generation sequencing systems are capable of rapid and cost-effective DNA sequencing, thus enabling routine sequencing tasks and taking us one step closer to personalized medicine. Accuracy and lengths of their reads, however, are yet to surpass those provided by the conventional Sanger sequencing method. This motivates the search for computationally efficient algorithms capable of reliable and accurate detection of the order of nucleotides in short DNA fragments from the acquired data. Results In this paper, we consider Illumina’s sequencing-by-synthesis platform which relies on reversible terminator chemistry and describe the acquired signal by reformulating its mathematical model as a Hidden Markov Model. Relying on this model and sequential Monte Carlo methods, we develop a parameter estimation and base calling scheme called ParticleCall. ParticleCall is tested on a data set obtained by sequencing phiX174 bacteriophage using Illumina’s Genome Analyzer II. The results show that the developed base calling scheme is significantly more computationally efficient than the best performing unsupervised method currently available, while achieving the same accuracy. Conclusions The proposed ParticleCall provides more accurate calls than the Illumina’s base calling algorithm, Bustard. At the same time, ParticleCall is significantly more computationally efficient than other recent schemes with similar performance, rendering it more feasible for high-throughput sequencing data analysis. Improvement of base calling accuracy will have immediate beneficial effects on the performance of downstream applications such as SNP and genotype calling. ParticleCall is freely available at https://sourceforge.net/projects/particlecall. PMID:22776067
Identifying and reducing error in cluster-expansion approximations of protein energies.
Hahn, Seungsoo; Ashenberg, Orr; Grigoryan, Gevorg; Keating, Amy E
2010-12-01
Protein design involves searching a vast space for sequences that are compatible with a defined structure. This can pose significant computational challenges. Cluster expansion is a technique that can accelerate the evaluation of protein energies by generating a simple functional relationship between sequence and energy. The method consists of several steps. First, for a given protein structure, a training set of sequences with known energies is generated. Next, this training set is used to expand energy as a function of clusters consisting of single residues, residue pairs, and higher order terms, if required. The accuracy of the sequence-based expansion is monitored and improved using cross-validation testing and iterative inclusion of additional clusters. As a trade-off for evaluation speed, the cluster-expansion approximation causes prediction errors, which can be reduced by including more training sequences, including higher order terms in the expansion, and/or reducing the sequence space described by the cluster expansion. This article analyzes the sources of error and introduces a method whereby accuracy can be improved by judiciously reducing the described sequence space. The method is applied to describe the sequence-stability relationship for several protein structures: coiled-coil dimers and trimers, a PDZ domain, and T4 lysozyme as examples with computationally derived energies, and SH3 domains in amphiphysin-1 and endophilin-1 as examples where the expanded pseudo-energies are obtained from experiments. Our open-source software package Cluster Expansion Version 1.0 allows users to expand their own energy function of interest and thereby apply cluster expansion to custom problems in protein design. © 2010 Wiley Periodicals, Inc.
A filtering method to generate high quality short reads using illumina paired-end technology.
Eren, A Murat; Vineis, Joseph H; Morrison, Hilary G; Sogin, Mitchell L
2013-01-01
Consensus between independent reads improves the accuracy of genome and transcriptome analyses, however lack of consensus between very similar sequences in metagenomic studies can and often does represent natural variation of biological significance. The common use of machine-assigned quality scores on next generation platforms does not necessarily correlate with accuracy. Here, we describe using the overlap of paired-end, short sequence reads to identify error-prone reads in marker gene analyses and their contribution to spurious OTUs following clustering analysis using QIIME. Our approach can also reduce error in shotgun sequencing data generated from libraries with small, tightly constrained insert sizes. The open-source implementation of this algorithm in Python programming language with user instructions can be obtained from https://github.com/meren/illumina-utils.
Acceleration of convergence of vector sequences
NASA Technical Reports Server (NTRS)
Sidi, A.; Ford, W. F.; Smith, D. A.
1983-01-01
A general approach to the construction of convergence acceleration methods for vector sequence is proposed. Using this approach, one can generate some known methods, such as the minimal polynomial extrapolation, the reduced rank extrapolation, and the topological epsilon algorithm, and also some new ones. Some of the new methods are easier to implement than the known methods and are observed to have similar numerical properties. The convergence analysis of these new methods is carried out, and it is shown that they are especially suitable for accelerating the convergence of vector sequences that are obtained when one solves linear systems of equations iteratively. A stability analysis is also given, and numerical examples are provided. The convergence and stability properties of the topological epsilon algorithm are likewise given.
NASA Astrophysics Data System (ADS)
El-Assaad, Atlal; Dawy, Zaher; Nemer, Georges; Kobeissy, Firas
2017-01-01
The crucial biological role of proteases has been visible with the development of degradomics discipline involved in the determination of the proteases/substrates resulting in breakdown-products (BDPs) that can be utilized as putative biomarkers associated with different biological-clinical significance. In the field of cancer biology, matrix metalloproteinases (MMPs) have shown to result in MMPs-generated protein BDPs that are indicative of malignant growth in cancer, while in the field of neural injury, calpain-2 and caspase-3 proteases generate BDPs fragments that are indicative of different neural cell death mechanisms in different injury scenarios. Advanced proteomic techniques have shown a remarkable progress in identifying these BDPs experimentally. In this work, we present a bioinformatics-based prediction method that identifies protease-associated BDPs with high precision and efficiency. The method utilizes state-of-the-art sequence matching and alignment algorithms. It starts by locating consensus sequence occurrences and their variants in any set of protein substrates, generating all fragments resulting from cleavage. The complexity exists in space O(mn) as well as in O(Nmn) time, where N, m, and n are the number of protein sequences, length of the consensus sequence, and length per protein sequence, respectively. Finally, the proposed methodology is validated against βII-spectrin protein, a brain injury validated biomarker.
Regional stochastic generation of streamflows using an ARIMA (1,0,1) process and disaggregation
Armbruster, Jeffrey T.
1979-01-01
An ARIMA (1,0,1) model was calibrated and used to generate long annual flow sequences at three sites in the Juniata River basin, Pennsylvania. The model preserves the mean, variance, and cross correlations of the observed station data. In addition, it has a desirable blend of both high and low frequency characteristics and therefore is capable of preserving the Hurst coefficient, h. The generated annual flows are disaggregated into monthly sequences using a modification of the Valencia-Schaake model. The low-flow frequency and flow duration characteristics of the generated monthly flows, with length equal to the historical data, compare favorably with the historical data. Once the models were verified, 100-year sequences were generated and analyzed for their low flow characteristics. One-, three- and six- month low-flow frequencies at recurrence intervals greater than 10 years are generally found to be lower than flow computed from the historical flows. A method is proposed for synthesizing flows at ungaged sites. (Kosco-USGS)
Gao, Xinxin; Yo, Peggy; Keith, Andrew; Ragan, Timothy J.; Harris, Thomas K.
2003-01-01
A novel thermodynamically-balanced inside-out (TBIO) method of primer design was developed and compared with a thermodynamically-balanced conventional (TBC) method of primer design for PCR-based gene synthesis of codon-optimized gene sequences for the human protein kinase B-2 (PKB2; 1494 bp), p70 ribosomal S6 subunit protein kinase-1 (S6K1; 1622 bp) and phosphoinositide-dependent protein kinase-1 (PDK1; 1712 bp). Each of the 60mer TBIO primers coded for identical nucleotide regions that the 60mer TBC primers covered, except that half of the TBIO primers were reverse complement sequences. In addition, the TBIO and TBC primers contained identical regions of temperature- optimized primer overlaps. The TBC method was optimized to generate sequential overlapping fragments (∼0.4–0.5 kb) for each of the gene sequences, and simultaneous and sequential combinations of overlapping fragments were tested for their ability to be assembled under an array of PCR conditions. However, no fully synthesized gene sequences could be obtained by this approach. In contrast, the TBIO method generated an initial central fragment (∼0.4–0.5 kb), which could be gel purified and used for further inside-out bidirectional elongation by additional increments of 0.4–0.5 kb. By using the newly developed TBIO method of PCR-based gene synthesis, error-free synthetic genes for the human protein kinases PKB2, S6K1 and PDK1 were obtained with little or no corrective mutagenesis. PMID:14602936
Single molecule targeted sequencing for cancer gene mutation detection.
Gao, Yan; Deng, Liwei; Yan, Qin; Gao, Yongqian; Wu, Zengding; Cai, Jinsen; Ji, Daorui; Li, Gailing; Wu, Ping; Jin, Huan; Zhao, Luyang; Liu, Song; Ge, Liangjin; Deem, Michael W; He, Jiankui
2016-05-19
With the rapid decline in cost of sequencing, it is now affordable to examine multiple genes in a single disease-targeted clinical test using next generation sequencing. Current targeted sequencing methods require a separate step of targeted capture enrichment during sample preparation before sequencing. Although there are fast sample preparation methods available in market, the library preparation process is still relatively complicated for physicians to use routinely. Here, we introduced an amplification-free Single Molecule Targeted Sequencing (SMTS) technology, which combined targeted capture and sequencing in one step. We demonstrated that this technology can detect low-frequency mutations using artificially synthesized DNA sample. SMTS has several potential advantages, including simple sample preparation thus no biases and errors are introduced by PCR reaction. SMTS has the potential to be an easy and quick sequencing technology for clinical diagnosis such as cancer gene mutation detection, infectious disease detection, inherited condition screening and noninvasive prenatal diagnosis.
Tree crops: Advances in insects and disease management
USDA-ARS?s Scientific Manuscript database
Advances in next-generation sequencing have enabled genome sequencing to be fast and affordable. Thus today researchers and industries can address new methods in pest and pathogen management. Biological control of insect pests that occur in large areas, such as forests and farming systems of fruit t...
Analysis of genetic diversity using SNP markers in oat
USDA-ARS?s Scientific Manuscript database
A large-scale single nucleotide polymorphism (SNP) discovery was carried out in cultivated oat using Roche 454 sequencing methods. DNA sequences were generated from cDNAs originating from a panel of 20 diverse oat cultivars, and from Diversity Array Technology (DArT) genomic complexity reductions fr...
Comparison of Methods of Detection of Exceptional Sequences in Prokaryotic Genomes.
Rusinov, I S; Ershova, A S; Karyagina, A S; Spirin, S A; Alexeevski, A V
2018-02-01
Many proteins need recognition of specific DNA sequences for functioning. The number of recognition sites and their distribution along the DNA might be of biological importance. For example, the number of restriction sites is often reduced in prokaryotic and phage genomes to decrease the probability of DNA cleavage by restriction endonucleases. We call a sequence an exceptional one if its frequency in a genome significantly differs from one predicted by some mathematical model. An exceptional sequence could be either under- or over-represented, depending on its frequency in comparison with the predicted one. Exceptional sequences could be considered biologically meaningful, for example, as targets of DNA-binding proteins or as parts of abundant repetitive elements. Several methods to predict frequency of a short sequence in a genome, based on actual frequencies of certain its subsequences, are used. The most popular are methods based on Markov chain models. But any rigorous comparison of the methods has not previously been performed. We compared three methods for the prediction of short sequence frequencies: the maximum-order Markov chain model-based method, the method that uses geometric mean of extended Markovian estimates, and the method that utilizes frequencies of all subsequences including discontiguous ones. We applied them to restriction sites in complete genomes of 2500 prokaryotic species and demonstrated that the results depend greatly on the method used: lists of 5% of the most under-represented sites differed by up to 50%. The method designed by Burge and coauthors in 1992, which utilizes all subsequences of the sequence, showed a higher precision than the other two methods both on prokaryotic genomes and randomly generated sequences after computational imitation of selective pressure. We propose this method as the first choice for detection of exceptional sequences in prokaryotic genomes.
Unbalanced voltage control of virtual synchronous generator in isolated micro-grid
NASA Astrophysics Data System (ADS)
Cao, Y. Z.; Wang, H. N.; Chen, B.
2017-06-01
Virtual synchronous generator (VSG) control is recommended to stabilize the voltage and frequency in isolated micro-grid. However, common VSG control is challenged by widely used unbalance loads, and the linked unbalance voltage problem worsens the power quality of the micro-grid. In this paper, the mathematical model of VSG was presented. Based on the analysis of positive- and negative-sequence equivalent circuit of VSG, an approach was proposed to eliminate the negative-sequence voltage of VSG with unbalance loads. Delay cancellation method and PI controller were utilized to identify and suppress the negative-sequence voltages. Simulation results verify the feasibility of proposed control strategy.
Yleaf: Software for Human Y-Chromosomal Haplogroup Inference from Next-Generation Sequencing Data.
Ralf, Arwin; Montiel González, Diego; Zhong, Kaiyin; Kayser, Manfred
2018-05-01
Next-generation sequencing (NGS) technologies offer immense possibilities given the large genomic data they simultaneously deliver. The human Y-chromosome serves as good example how NGS benefits various applications in evolution, anthropology, genealogy, and forensics. Prior to NGS, the Y-chromosome phylogenetic tree consisted of a few hundred branches, based on NGS data, it now contains many thousands. The complexity of both, Y tree and NGS data provide challenges for haplogroup assignment. For effective analysis and interpretation of Y-chromosome NGS data, we present Yleaf, a publically available, automated, user-friendly software for high-resolution Y-chromosome haplogroup inference independently of library and sequencing methods.
Bernard, Guillaume; Chan, Cheong Xin; Ragan, Mark A
2016-07-01
Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.
Malware analysis using visualized image matrices.
Han, KyoungSoo; Kang, BooJoong; Im, Eul Gyu
2014-01-01
This paper proposes a novel malware visual analysis method that contains not only a visualization method to convert binary files into images, but also a similarity calculation method between these images. The proposed method generates RGB-colored pixels on image matrices using the opcode sequences extracted from malware samples and calculates the similarities for the image matrices. Particularly, our proposed methods are available for packed malware samples by applying them to the execution traces extracted through dynamic analysis. When the images are generated, we can reduce the overheads by extracting the opcode sequences only from the blocks that include the instructions related to staple behaviors such as functions and application programming interface (API) calls. In addition, we propose a technique that generates a representative image for each malware family in order to reduce the number of comparisons for the classification of unknown samples and the colored pixel information in the image matrices is used to calculate the similarities between the images. Our experimental results show that the image matrices of malware can effectively be used to classify malware families both statically and dynamically with accuracy of 0.9896 and 0.9732, respectively.
Seo, Joo-Hyun; Park, Jihyang; Kim, Eun-Mi; Kim, Juhan; Joo, Keehyoung; Lee, Jooyoung; Kim, Byung-Gee
2014-02-01
Sequence subgrouping for a given sequence set can enable various informative tasks such as the functional discrimination of sequence subsets and the functional inference of unknown sequences. Because an identity threshold for sequence subgrouping may vary according to the given sequence set, it is highly desirable to construct a robust subgrouping algorithm which automatically identifies an optimal identity threshold and generates subgroups for a given sequence set. To meet this end, an automatic sequence subgrouping method, named 'Subgrouping Automata' was constructed. Firstly, tree analysis module analyzes the structure of tree and calculates the all possible subgroups in each node. Sequence similarity analysis module calculates average sequence similarity for all subgroups in each node. Representative sequence generation module finds a representative sequence using profile analysis and self-scoring for each subgroup. For all nodes, average sequence similarities are calculated and 'Subgrouping Automata' searches a node showing statistically maximum sequence similarity increase using Student's t-value. A node showing the maximum t-value, which gives the most significant differences in average sequence similarity between two adjacent nodes, is determined as an optimum subgrouping node in the phylogenetic tree. Further analysis showed that the optimum subgrouping node from SA prevents under-subgrouping and over-subgrouping. Copyright © 2013. Published by Elsevier Ltd.
Large-Scale Concatenation cDNA Sequencing
Yu, Wei; Andersson, Björn; Worley, Kim C.; Muzny, Donna M.; Ding, Yan; Liu, Wen; Ricafrente, Jennifer Y.; Wentland, Meredith A.; Lennon, Greg; Gibbs, Richard A.
1997-01-01
A total of 100 kb of DNA derived from 69 individual human brain cDNA clones of 0.7–2.0 kb were sequenced by concatenated cDNA sequencing (CCS), whereby multiple individual DNA fragments are sequenced simultaneously in a single shotgun library. The method yielded accurate sequences and a similar efficiency compared with other shotgun libraries constructed from single DNA fragments (>20 kb). Computer analyses were carried out on 65 cDNA clone sequences and their corresponding end sequences to examine both nucleic acid and amino acid sequence similarities in the databases. Thirty-seven clones revealed no DNA database matches, 12 clones generated exact matches (≥98% identity), and 16 clones generated nonexact matches (57%–97% identity) to either known human or other species genes. Of those 28 matched clones, 8 had corresponding end sequences that failed to identify similarities. In a protein similarity search, 27 clone sequences displayed significant matches, whereas only 20 of the end sequences had matches to known protein sequences. Our data indicate that full-length cDNA insert sequences provide significantly more nucleic acid and protein sequence similarity matches than expressed sequence tags (ESTs) for database searching. [All 65 cDNA clone sequences described in this paper have been submitted to the GenBank data library under accession nos. U79240–U79304.] PMID:9110174
ADEPT, a dynamic next generation sequencing data error-detection program with trimming
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feng, Shihai; Lo, Chien-Chi; Li, Po-E
Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the truemore » positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. We conclude that ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate.« less
ADEPT, a dynamic next generation sequencing data error-detection program with trimming
Feng, Shihai; Lo, Chien-Chi; Li, Po-E; ...
2016-02-29
Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the truemore » positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. We conclude that ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate.« less
Bakal, Tomas; Janata, Jiri; Sabova, Lenka; Grabic, Roman; Zlabek, Vladimir; Najmanova, Lucie
2018-06-16
A robust and widely applicable method for sampling of aquatic microbial biofilm and further sample processing is presented. The method is based on next-generation sequencing of V4-V5 variable regions of 16S rRNA gene and further statistical analysis of sequencing data, which could be useful not only to investigate taxonomic composition of biofilm bacterial consortia but also to assess aquatic ecosystem health. Five artificial materials commonly used for biofilm growth (glass, stainless steel, aluminum, polypropylene, polyethylene) were tested to determine the one giving most robust and reproducible results. The effect of used sampler material on total microbial composition was not statistically significant; however, the non-plastic materials (glass, metal) gave more stable outputs without irregularities among sample parallels. The bias of the method is assessed with respect to the employment of a non-quantitative step (PCR amplification) to obtain quantitative results (relative abundance of identified taxa). This aspect is often overlooked in ecological and medical studies. We document that sequencing of a mixture of three merged primary PCR reactions for each sample and further evaluation of median values from three technical replicates for each sample enables to overcome this bias and gives robust and repeatable results well distinguishing among sampling localities and seasons.
In vivo generation of DNA sequence diversity for cellular barcoding
Peikon, Ian D.; Gizatullina, Diana I.; Zador, Anthony M.
2014-01-01
Heterogeneity is a ubiquitous feature of biological systems. A complete understanding of such systems requires a method for uniquely identifying and tracking individual components and their interactions with each other. We have developed a novel method of uniquely tagging individual cells in vivo with a genetic ‘barcode’ that can be recovered by DNA sequencing. Our method is a two-component system comprised of a genetic barcode cassette whose fragments are shuffled by Rci, a site-specific DNA invertase. The system is highly scalable, with the potential to generate theoretical diversities in the billions. We demonstrate the feasibility of this technique in Escherichia coli. Currently, this method could be employed to track the dynamics of populations of microbes through various bottlenecks. Advances of this method should prove useful in tracking interactions of cells within a network, and/or heterogeneity within complex biological samples. PMID:25013177
Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows
Torri, Federica; Dinov, Ivo D.; Zamanyan, Alen; Hobel, Sam; Genco, Alex; Petrosyan, Petros; Clark, Andrew P.; Liu, Zhizhong; Eggert, Paul; Pierce, Jonathan; Knowles, James A.; Ames, Joseph; Kesselman, Carl; Toga, Arthur W.; Potkin, Steven G.; Vawter, Marquis P.; Macciardi, Fabio
2012-01-01
Whole-genome and exome sequencing have already proven to be essential and powerful methods to identify genes responsible for simple Mendelian inherited disorders. These methods can be applied to complex disorders as well, and have been adopted as one of the current mainstream approaches in population genetics. These achievements have been made possible by next generation sequencing (NGS) technologies, which require substantial bioinformatics resources to analyze the dense and complex sequence data. The huge analytical burden of data from genome sequencing might be seen as a bottleneck slowing the publication of NGS papers at this time, especially in psychiatric genetics. We review the existing methods for processing NGS data, to place into context the rationale for the design of a computational resource. We describe our method, the Graphical Pipeline for Computational Genomics (GPCG), to perform the computational steps required to analyze NGS data. The GPCG implements flexible workflows for basic sequence alignment, sequence data quality control, single nucleotide polymorphism analysis, copy number variant identification, annotation, and visualization of results. These workflows cover all the analytical steps required for NGS data, from processing the raw reads to variant calling and annotation. The current version of the pipeline is freely available at http://pipeline.loni.ucla.edu. These applications of NGS analysis may gain clinical utility in the near future (e.g., identifying miRNA signatures in diseases) when the bioinformatics approach is made feasible. Taken together, the annotation tools and strategies that have been developed to retrieve information and test hypotheses about the functional role of variants present in the human genome will help to pinpoint the genetic risk factors for psychiatric disorders. PMID:23139896
Lu, Chaoxia; Wu, Wei; Xiao, Jifang; Meng, Yan; Zhang, Shuyang; Zhang, Xue
2013-06-01
To detect pathogenic mutations in Marfan syndrome (MFS) using an Ion Torrent Personal Genome Machine (PGM) and to validate the result of targeted next-generation semiconductor sequencing for the diagnosis of genetic disorders. Peripheral blood samples were collected from three MFS patients and a normal control with informed consent. Genomic DNA was isolated by standard method and then subjected to targeted sequencing using an Ion Ampliseq(TM) Inherited Disease Panel. Three multiplex PCR reactions were carried out to amplify the coding exons of 328 genes including FBN1, TGFBR1 and TGFBR2. DNA fragments from different samples were ligated with barcoded sequencing adaptors. Template preparation and emulsion PCR, and Ion Sphere Particles enrichment were carried out using an Ion One Touch system. The ion sphere particles were sequenced on a 318 chip using the PGM platform. Data from the PGM runs were processed using an Ion Torrent Suite 3.2 software to generate sequence reads. After sequence alignment and extraction of SNPs and indels, all the variants were filtered against dbSNP137. DNA sequences were visualized with an Integrated Genomics Viewer. The most likely disease-causing variants were analyzed by Sanger sequencing. The PGM sequencing has yielded an output of 855.80 Mb, with a > 100 × median sequencing depth and a coverage of > 98% for the targeted regions in all the four samples. After data analysis and database filtering, one known missense mutation (p.E1811K) and two novel premature termination mutations (p.E2264X and p.L871FfsX23) in the FBN1 gene were identified in the three MFS patients. All mutations were verified by conventional Sanger sequencing. Pathogenic FBN1 mutations have been identified in all patients with MFS, indicating that the targeted next-generation sequencing on the PGM sequencers can be applied for accurate and high-throughput testing of genetic disorders.
Porter, Teresita M.; Golding, G. Brian
2012-01-01
Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys. PMID:22558215
Olson, Nathan D.; Lund, Steven P.; Zook, Justin M.; Rojas-Cornejo, Fabiola; Beck, Brian; Foy, Carole; Huggett, Jim; Whale, Alexandra S.; Sui, Zhiwei; Baoutina, Anna; Dobeson, Michael; Partis, Lina; Morrow, Jayne B.
2015-01-01
This study presents the results from an interlaboratory sequencing study for which we developed a novel high-resolution method for comparing data from different sequencing platforms for a multi-copy, paralogous gene. The combination of PCR amplification and 16S ribosomal RNA gene (16S rRNA) sequencing has revolutionized bacteriology by enabling rapid identification, frequently without the need for culture. To assess variability between laboratories in sequencing 16S rRNA, six laboratories sequenced the gene encoding the 16S rRNA from Escherichia coli O157:H7 strain EDL933 and Listeria monocytogenes serovar 4b strain NCTC11994. Participants performed sequencing methods and protocols available in their laboratories: Sanger sequencing, Roche 454 pyrosequencing®, or Ion Torrent PGM®. The sequencing data were evaluated on three levels: (1) identity of biologically conserved position, (2) ratio of 16S rRNA gene copies featuring identified variants, and (3) the collection of variant combinations in a set of 16S rRNA gene copies. The same set of biologically conserved positions was identified for each sequencing method. Analytical methods using Bayesian and maximum likelihood statistics were developed to estimate variant copy ratios, which describe the ratio of nucleotides at each identified biologically variable position, as well as the likely set of variant combinations present in 16S rRNA gene copies. Our results indicate that estimated variant copy ratios at biologically variable positions were only reproducible for high throughput sequencing methods. Furthermore, the likely variant combination set was only reproducible with increased sequencing depth and longer read lengths. We also demonstrate novel methods for evaluating variable positions when comparing multi-copy gene sequence data from multiple laboratories generated using multiple sequencing technologies. PMID:27077030
NASA Astrophysics Data System (ADS)
Nie, Xiaokai; Luo, Jingjing; Coca, Daniel; Birkin, Mark; Chen, Jing
2018-03-01
The paper introduces a method for reconstructing one-dimensional iterated maps that are driven by an external control input and subjected to an additive stochastic perturbation, from sequences of probability density functions that are generated by the stochastic dynamical systems and observed experimentally.
Genomic treasure troves: complete genome sequencing of herbarium and insect museum specimens.
Staats, Martijn; Erkens, Roy H J; van de Vossenberg, Bart; Wieringa, Jan J; Kraaijeveld, Ken; Stielow, Benjamin; Geml, József; Richardson, James E; Bakker, Freek T
2013-01-01
Unlocking the vast genomic diversity stored in natural history collections would create unprecedented opportunities for genome-scale evolutionary, phylogenetic, domestication and population genomic studies. Many researchers have been discouraged from using historical specimens in molecular studies because of both generally limited success of DNA extraction and the challenges associated with PCR-amplifying highly degraded DNA. In today's next-generation sequencing (NGS) world, opportunities and prospects for historical DNA have changed dramatically, as most NGS methods are actually designed for taking short fragmented DNA molecules as templates. Here we show that using a standard multiplex and paired-end Illumina sequencing approach, genome-scale sequence data can be generated reliably from dry-preserved plant, fungal and insect specimens collected up to 115 years ago, and with minimal destructive sampling. Using a reference-based assembly approach, we were able to produce the entire nuclear genome of a 43-year-old Arabidopsis thaliana (Brassicaceae) herbarium specimen with high and uniform sequence coverage. Nuclear genome sequences of three fungal specimens of 22-82 years of age (Agaricus bisporus, Laccaria bicolor, Pleurotus ostreatus) were generated with 81.4-97.9% exome coverage. Complete organellar genome sequences were assembled for all specimens. Using de novo assembly we retrieved between 16.2-71.0% of coding sequence regions, and hence remain somewhat cautious about prospects for de novo genome assembly from historical specimens. Non-target sequence contaminations were observed in 2 of our insect museum specimens. We anticipate that future museum genomics projects will perhaps not generate entire genome sequences in all cases (our specimens contained relatively small and low-complexity genomes), but at least generating vital comparative genomic data for testing (phylo)genetic, demographic and genetic hypotheses, that become increasingly more horizontal. Furthermore, NGS of historical DNA enables recovering crucial genetic information from old type specimens that to date have remained mostly unutilized and, thus, opens up a new frontier for taxonomic research as well.
Engineering of a DNA Polymerase for Direct m6 A Sequencing.
Aschenbrenner, Joos; Werner, Stephan; Marchand, Virginie; Adam, Martina; Motorin, Yuri; Helm, Mark; Marx, Andreas
2018-01-08
Methods for the detection of RNA modifications are of fundamental importance for advancing epitranscriptomics. N 6 -methyladenosine (m 6 A) is the most abundant RNA modification in mammalian mRNA and is involved in the regulation of gene expression. Current detection techniques are laborious and rely on antibody-based enrichment of m 6 A-containing RNA prior to sequencing, since m 6 A modifications are generally "erased" during reverse transcription (RT). To overcome the drawbacks associated with indirect detection, we aimed to generate novel DNA polymerase variants for direct m 6 A sequencing. Therefore, we developed a screen to evolve an RT-active KlenTaq DNA polymerase variant that sets a mark for N 6 -methylation. We identified a mutant that exhibits increased misincorporation opposite m 6 A compared to unmodified A. Application of the generated DNA polymerase in next-generation sequencing allowed the identification of m 6 A sites directly from the sequencing data of untreated RNA samples. © 2017 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA.
Control method of Three-phase Four-leg converter based on repetitive control
NASA Astrophysics Data System (ADS)
Hui, Wang
2018-03-01
The research chose the magnetic levitation force of wind power generation system as the object. In order to improve the power quality problem caused by unbalanced load in power supply system, we combined the characteristics and repetitive control principle of magnetic levitation wind power generation system, and then an independent control strategy for three-phase four-leg converter was proposed. In this paper, based on the symmetric component method, the second order generalized integrator was used to generate the positive and negative sequence of signals, and the decoupling control was carried out under the synchronous rotating reference frame, in which the positive and negative sequence voltage is PI double closed loop, and a PI regulator with repetitive control was introduced to eliminate the static error regarding the fundamental frequency fluctuation characteristic of zero sequence component. The simulation results based on Matlab/Simulink show that the proposed control project can effectively suppress the disturbance caused by unbalanced loads and maintain the load voltage balance. The project is easy to be achieved and remarkably improves the quality of the independent power supply system.
Beltman, Joost B; Urbanus, Jos; Velds, Arno; van Rooij, Nienke; Rohr, Jan C; Naik, Shalin H; Schumacher, Ton N
2016-04-02
Next generation sequencing (NGS) of amplified DNA is a powerful tool to describe genetic heterogeneity within cell populations that can both be used to investigate the clonal structure of cell populations and to perform genetic lineage tracing. For applications in which both abundant and rare sequences are biologically relevant, the relatively high error rate of NGS techniques complicates data analysis, as it is difficult to distinguish rare true sequences from spurious sequences that are generated by PCR or sequencing errors. This issue, for instance, applies to cellular barcoding strategies that aim to follow the amount and type of offspring of single cells, by supplying these with unique heritable DNA tags. Here, we use genetic barcoding data from the Illumina HiSeq platform to show that straightforward read threshold-based filtering of data is typically insufficient to filter out spurious barcodes. Importantly, we demonstrate that specific sequencing errors occur at an approximately constant rate across different samples that are sequenced in parallel. We exploit this observation by developing a novel approach to filter out spurious sequences. Application of our new method demonstrates its value in the identification of true sequences amongst spurious sequences in biological data sets.
De novo assembly of a haplotype-resolved human genome.
Cao, Hongzhi; Wu, Honglong; Luo, Ruibang; Huang, Shujia; Sun, Yuhui; Tong, Xin; Xie, Yinlong; Liu, Binghang; Yang, Hailong; Zheng, Hancheng; Li, Jian; Li, Bo; Wang, Yu; Yang, Fang; Sun, Peng; Liu, Siyang; Gao, Peng; Huang, Haodong; Sun, Jing; Chen, Dan; He, Guangzhu; Huang, Weihua; Huang, Zheng; Li, Yue; Tellier, Laurent C A M; Liu, Xiao; Feng, Qiang; Xu, Xun; Zhang, Xiuqing; Bolund, Lars; Krogh, Anders; Kristiansen, Karsten; Drmanac, Radoje; Drmanac, Snezana; Nielsen, Rasmus; Li, Songgang; Wang, Jian; Yang, Huanming; Li, Yingrui; Wong, Gane Ka-Shu; Wang, Jun
2015-06-01
The human genome is diploid, and knowledge of the variants on each chromosome is important for the interpretation of genomic information. Here we report the assembly of a haplotype-resolved diploid genome without using a reference genome. Our pipeline relies on fosmid pooling together with whole-genome shotgun strategies, based solely on next-generation sequencing and hierarchical assembly methods. We applied our sequencing method to the genome of an Asian individual and generated a 5.15-Gb assembled genome with a haplotype N50 of 484 kb. Our analysis identified previously undetected indels and 7.49 Mb of novel coding sequences that could not be aligned to the human reference genome, which include at least six predicted genes. This haplotype-resolved genome represents the most complete de novo human genome assembly to date. Application of our approach to identify individual haplotype differences should aid in translating genotypes to phenotypes for the development of personalized medicine.
Transcriptome Analysis at the Single-Cell Level Using SMART Technology.
Fish, Rachel N; Bostick, Magnolia; Lehman, Alisa; Farmer, Andrew
2016-10-10
RNA sequencing (RNA-seq) is a powerful method for analyzing cell state, with minimal bias, and has broad applications within the biological sciences. However, transcriptome analysis of seemingly homogenous cell populations may in fact overlook significant heterogeneity that can be uncovered at the single-cell level. The ultra-low amount of RNA contained in a single cell requires extraordinarily sensitive and reproducible transcriptome analysis methods. As next-generation sequencing (NGS) technologies mature, transcriptome profiling by RNA-seq is increasingly being used to decipher the molecular signature of individual cells. This unit describes an ultra-sensitive and reproducible protocol to generate cDNA and sequencing libraries directly from single cells or RNA inputs ranging from 10 pg to 10 ng. Important considerations for working with minute RNA inputs are given. © 2016 by John Wiley & Sons, Inc. Copyright © 2016 John Wiley & Sons, Inc.
Altmüller, Janine; Budde, Birgit S; Nürnberg, Peter
2014-02-01
Abstract Targeted re-sequencing such as gene panel sequencing (GPS) has become very popular in medical genetics, both for research projects and in diagnostic settings. The technical principles of the different enrichment methods have been reviewed several times before; however, new enrichment products are constantly entering the market, and researchers are often puzzled about the requirement to take decisions about long-term commitments, both for the enrichment product and the sequencing technology. This review summarizes important considerations for the experimental design and provides helpful recommendations in choosing the best sequencing strategy for various research projects and diagnostic applications.
Doitsidou, Maria; Jarriault, Sophie; Poole, Richard J.
2016-01-01
The use of next-generation sequencing (NGS) has revolutionized the way phenotypic traits are assigned to genes. In this review, we describe NGS-based methods for mapping a mutation and identifying its molecular identity, with an emphasis on applications in Caenorhabditis elegans. In addition to an overview of the general principles and concepts, we discuss the main methods, provide practical and conceptual pointers, and guide the reader in the types of bioinformatics analyses that are required. Owing to the speed and the plummeting costs of NGS-based methods, mapping and cloning a mutation of interest has become straightforward, quick, and relatively easy. Removing this bottleneck previously associated with forward genetic screens has significantly advanced the use of genetics to probe fundamental biological processes in an unbiased manner. PMID:27729495
Farhat, Maha; Shaheed, Raja A; Al-Ali, Haider H; Al-Ghamdi, Abdullah S; Al-Hamaqi, Ghadeer M; Maan, Hawraa S; Al-Mahfoodh, Zainab A; Al-Seba, Hussain Z
2018-02-01
To investigate the presence of Legionella spp in cooling tower water. Legionella proliferation in cooling tower water has serious public health implications as it can be transmitted to humans via aerosols and cause Legionnaires' disease. Samples of cooling tower water were collected from King Fahd Hospital of the University (KFHU) (Imam Abdulrahman Bin Faisal University, 2015/2016). The water samples were analyzed by a standard Legionella culture method, real-time polymerase chain reaction (RT-PCR), and 16S rRNA next-generation sequencing. In addition, the bacterial community composition was evaluated. All samples were negative by conventional Legionella culture. In contrast, all water samples yielded positive results by real-time PCR (105 to 106 GU/L). The results of 16S rRNA next generation sequencing showed high similarity and reproducibility among the water samples. The majority of sequences were Alpha-, Beta-, and Gamma-proteobacteria, and Legionella was the predominant genus. The hydrogen-oxidizing gram-negative bacterium Hydrogenophaga was present at high abundance, indicating high metabolic activity. Sphingopyxis, which is known for its resistance to antimicrobials and as a pioneer in biofilm formation, was also detected. Our findings indicate that monitoring of Legionella in cooling tower water would be enhanced by use of both conventional culturing and molecular methods.
Weirather, Jason L; Afshar, Pegah Tootoonchi; Clark, Tyson A; Tseng, Elizabeth; Powers, Linda S; Underwood, Jason G; Zabner, Joseph; Korlach, Jonas; Wong, Wing Hung; Au, Kin Fai
2015-10-15
We developed an innovative hybrid sequencing approach, IDP-fusion, to detect fusion genes, determine fusion sites and identify and quantify fusion isoforms. IDP-fusion is the first method to study gene fusion events by integrating Third Generation Sequencing long reads and Second Generation Sequencing short reads. We applied IDP-fusion to PacBio data and Illumina data from the MCF-7 breast cancer cells. Compared with the existing tools, IDP-fusion detects fusion genes at higher precision and a very low false positive rate. The results show that IDP-fusion will be useful for unraveling the complexity of multiple fusion splices and fusion isoforms within tumorigenesis-relevant fusion genes. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Shen, Kang-Ning; Chen, Ching-Hung; Hsiao, Chung-Der
2016-05-01
In this study, the complete mitogenome sequence of hornlip mullet Plicomugil labiosus (Teleostei: Mugilidae) has been sequenced by next-generation sequencing method. The assembled mitogenome, consisting of 16,829 bp, had the typical vertebrate mitochondrial gene arrangement, including 13 protein coding genes, 22 transfer RNAs, 2 ribosomal RNAs genes and a non-coding control region of D-loop. D-loop contains 1057 bp length is located between tRNA-Pro and tRNA-Phe. The overall base composition of P. labiosus is 28.0% for A, 29.3% for C, 15.5% for G and 27.2% for T. The complete mitogenome may provide essential and important DNA molecular data for further population, phylogenetic and evolutionary analysis for Mugilidae.
Shen, Kang-Ning; Tsai, Shiou-Yi; Chen, Ching-Hung; Hsiao, Chung-Der; Durand, Jean-Dominique
2016-11-01
In this study, the complete mitogenome sequence of largescale mullet (Teleostei: Mugilidae) has been sequenced by the next-generation sequencing method. The assembled mitogenome, consisting of 16,832 bp, had the typical vertebrate mitochondrial gene arrangement, including 13 protein-coding genes, 22 transfer RNAs, two ribosomal RNAs genes, and a non-coding control region of D-loop. D-loop which has a length of 1094 bp is located between tRNA-Pro and tRNA-Phe. The overall base composition of largescale mullet is 27.8% for A, 30.1% for C, 16.2% for G, and 25.9% for T. The complete mitogenome may provide essential and important DNA molecular data for further phylogenetic and evolutionary analysis for Mugilidae.
Single-Cell RNA Sequencing of Glioblastoma Cells.
Sen, Rajeev; Dolgalev, Igor; Bayin, N Sumru; Heguy, Adriana; Tsirigos, Aris; Placantonakis, Dimitris G
2018-01-01
Single-cell RNA sequencing (sc-RNASeq) is a recently developed technique used to evaluate the transcriptome of individual cells. As opposed to conventional RNASeq in which entire populations are sequenced in bulk, sc-RNASeq can be beneficial when trying to better understand gene expression patterns in markedly heterogeneous populations of cells or when trying to identify transcriptional signatures of rare cells that may be underrepresented when using conventional bulk RNASeq. In this method, we describe the generation and analysis of cDNA libraries from single patient-derived glioblastoma cells using the C1 Fluidigm system. The protocol details the use of the C1 integrated fluidics circuit (IFC) for capturing, imaging and lysing cells; performing reverse transcription; and generating cDNA libraries that are ready for sequencing and analysis.
Selective amplification and sequencing of cyclic phosphate-containing RNAs by the cP-RNA-seq method.
Honda, Shozo; Morichika, Keisuke; Kirino, Yohei
2016-03-01
RNA digestions catalyzed by many ribonucleases generate RNA fragments that contain a 2',3'-cyclic phosphate (cP) at their 3' termini. However, standard RNA-seq methods are unable to accurately capture cP-containing RNAs because the cP inhibits the adapter ligation reaction. We recently developed a method named cP-RNA-seq that is able to selectively amplify and sequence cP-containing RNAs. Here we describe the cP-RNA-seq protocol in which the 3' termini of all RNAs, except those containing a cP, are cleaved through a periodate treatment after phosphatase treatment; hence, subsequent adapter ligation and cDNA amplification steps are exclusively applied to cP-containing RNAs. cP-RNA-seq takes ∼6 d, excluding the time required for sequencing and bioinformatics analyses, which are not covered in detail in this protocol. Biochemical validation of the existence of cP in the identified RNAs takes ∼3 d. Even though the cP-RNA-seq method was developed to identify angiogenin-generating 5'-tRNA halves as a proof of principle, the method should be applicable to global identification of cP-containing RNA repertoires in various transcriptomes.
Advances in Molecular Serotyping and Subtyping of Escherichia coli.
Fratamico, Pina M; DebRoy, Chitrita; Liu, Yanhong; Needleman, David S; Baranzoni, Gian Marco; Feng, Peter
2016-01-01
Escherichia coli plays an important role as a member of the gut microbiota; however, pathogenic strains also exist, including various diarrheagenic E. coli pathotypes and extraintestinal pathogenic E. coli that cause illness outside of the GI-tract. E. coli have traditionally been serotyped using antisera against the ca. 186 O-antigens and 53 H-flagellar antigens. Phenotypic methods, including bacteriophage typing and O- and H- serotyping for differentiating and characterizing E. coli have been used for many years; however, these methods are generally time consuming and not always accurate. Advances in next generation sequencing technologies have made it possible to develop genetic-based subtyping and molecular serotyping methods for E. coli, which are more discriminatory compared to phenotypic typing methods. Furthermore, whole genome sequencing (WGS) of E. coli is replacing established subtyping methods such as pulsed-field gel electrophoresis, providing a major advancement in the ability to investigate food-borne disease outbreaks and for trace-back to sources. A variety of sequence analysis tools and bioinformatic pipelines are being developed to analyze the vast amount of data generated by WGS and to obtain specific information such as O- and H-group determination and the presence of virulence genes and other genetic markers.
Random sequences generation through optical measurements by phase-shifting interferometry
NASA Astrophysics Data System (ADS)
François, M.; Grosges, T.; Barchiesi, D.; Erra, R.; Cornet, A.
2012-04-01
The development of new techniques for producing random sequences with a high level of security is a challenging topic of research in modern cryptographics. The proposed method is based on the measurement by phase-shifting interferometry of the speckle signals of the interaction between light and structures. We show how the combination of amplitude and phase distributions (maps) under a numerical process can produce random sequences. The produced sequences satisfy all the statistical requirements of randomness and can be used in cryptographic schemes.
[Sequencing technology in gene diagnosis and its application].
Yibin, Guo
2014-11-01
The study of gene mutation is one of the hot topics in the field of life science nowadays, and the related detection methods and diagnostic technology have been developed rapidly. Sequencing technology plays an indispensable role in the definite diagnosis and classification of genetic diseases. In this review, we summarize the research progress in sequencing technology, evaluate the advantages and disadvantages of 1(st) ~3(rd) generation of sequencing technology, and describe its application in gene diagnosis. Also we made forecasts and prospects on its development trend.
Clustering evolving proteins into homologous families.
Chan, Cheong Xin; Mahbob, Maisarah; Ragan, Mark A
2013-04-08
Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, including content bias and degree of divergence. New, highly scalable methods have recently been introduced to cluster the very large datasets being generated by next-generation sequencing technologies. However, there has been little systematic investigation of how characteristics of the data impact the performance of these approaches. Using clusters from a manually curated dataset as reference, we examined the performance of a widely used graph-based Markov clustering algorithm (MCL) and a greedy heuristic approach (UCLUST) in delineating protein families coded by three sets of bacterial genomes of different G+C content. Both MCL and UCLUST generated clusters that are comparable to the reference sets at specific parameter settings, although UCLUST tends to under-cluster compositionally biased sequences (G+C content 33% and 66%). Using simulated data, we sought to assess the individual effects of sequence divergence, rate heterogeneity, and underlying G+C content. Performance decreased with increasing sequence divergence, decreasing among-site rate variation, and increasing G+C bias. Two MCL-based methods recovered the simulated families more accurately than did UCLUST. MCL using local alignment distances is more robust across the investigated range of sequence features than are greedy heuristics using distances based on global alignment. Our results demonstrate that sequence divergence, rate heterogeneity and content bias can individually and in combination affect the accuracy with which MCL and UCLUST can recover homologous protein families. For application to data that are more divergent, and exhibit higher among-site rate variation and/or content bias, MCL may often be the better choice, especially if computational resources are not limiting.
Development of Pineapple Microsatellite Markers and Germplasm Genetic Diversity Analysis
Tong, Helin; Chen, You; Wang, Jingyi; Chen, Yeyuan; Sun, Guangming; He, Junhu; Wu, Yaoting
2013-01-01
Two methods were used to develop pineapple microsatellite markers. Genomic library-based SSR development: using selectively amplified microsatellite assay, 86 sequences were generated from pineapple genomic library. 91 (96.8%) of the 94 Simple Sequence Repeat (SSR) loci were dinucleotide repeats (39 AC/GT repeats and 52 GA/TC repeats, accounting for 42.9% and 57.1%, resp.), and the other three were mononucleotide repeats. Thirty-six pairs of SSR primers were designed; 24 of them generated clear bands of expected sizes, and 13 of them showed polymorphism. EST-based SSR development: 5659 pineapple EST sequences obtained from NCBI were analyzed; among 1397 nonredundant EST sequences, 843 were found containing 1110 SSR loci (217 of them contained more than one SSR locus). Frequency of SSRs in pineapple EST sequences is 1SSR/3.73 kb, and 44 types were found. Mononucleotide, dinucleotide, and trinucleotide repeats dominate, accounting for 95.6% in total. AG/CT and AGC/GCT were the dominant type of dinucleotide and trinucleotide repeats, accounting for 83.5% and 24.1%, respectively. Thirty pairs of primers were designed for each of randomly selected 30 sequences; 26 of them generated clear and reproducible bands, and 22 of them showed polymorphism. Eighteen pairs of primers obtained by the one or the other of the two methods above that showed polymorphism were selected to carry out germplasm genetic diversity analysis for 48 breeds of pineapple; similarity coefficients of these breeds were between 0.59 and 1.00, and they can be divided into four groups accordingly. Amplification products of five SSR markers were extracted and sequenced, corresponding repeat loci were found and locus mutations are mainly in copy number of repeats and base mutations in the flanking region. PMID:24024187
Multiplex Reverse Transcription-PCR for Simultaneous Surveillance of Influenza A and B Viruses
Zhou, Bin; Barnes, John R.; Sessions, October M.; Chou, Tsui-Wen; Wilson, Malania; Stark, Thomas J.; Volk, Michelle; Spirason, Natalie; Halpin, Rebecca A.; Kamaraj, Uma Sangumathi; Ding, Tao; Stockwell, Timothy B.; Ghedin, Elodie; Barr, Ian G.
2017-01-01
ABSTRACT Influenza A and B viruses are the causative agents of annual influenza epidemics that can be severe, and influenza A viruses intermittently cause pandemics. Sequence information from influenza virus genomes is instrumental in determining mechanisms underpinning antigenic evolution and antiviral resistance. However, due to sequence diversity and the dynamics of influenza virus evolution, rapid and high-throughput sequencing of influenza viruses remains a challenge. We developed a single-reaction influenza A/B virus (FluA/B) multiplex reverse transcription-PCR (RT-PCR) method that amplifies the most critical genomic segments (hemagglutinin [HA], neuraminidase [NA], and matrix [M]) of seasonal influenza A and B viruses for next-generation sequencing, regardless of viral type, subtype, or lineage. Herein, we demonstrate that the strategy is highly sensitive and robust. The strategy was validated on thousands of seasonal influenza A and B virus-positive specimens using multiple next-generation sequencing platforms. PMID:28978683
Hwang, Byungjin; Bang, Duhee
2016-01-01
All synthetic DNA materials require prior programming of the building blocks of the oligonucleotide sequences. The development of a programmable microarray platform provides cost-effective and time-efficient solutions in the field of data storage using DNA. However, the scalability of the synthesis is not on par with the accelerating sequencing capacity. Here, we report on a new paradigm of generating genetic material (writing) using a degenerate oligonucleotide and optomechanical retrieval method that leverages sequencing (reading) throughput to generate the desired number of oligonucleotides. As a proof of concept, we demonstrate the feasibility of our concept in digital information storage in DNA. In simulation, the ability to store data is expected to exponentially increase with increase in degenerate space. The present study highlights the major framework change in conventional DNA writing paradigm as a sequencer itself can become a potential source of making genetic materials. PMID:27876825
Hwang, Byungjin; Bang, Duhee
2016-11-23
All synthetic DNA materials require prior programming of the building blocks of the oligonucleotide sequences. The development of a programmable microarray platform provides cost-effective and time-efficient solutions in the field of data storage using DNA. However, the scalability of the synthesis is not on par with the accelerating sequencing capacity. Here, we report on a new paradigm of generating genetic material (writing) using a degenerate oligonucleotide and optomechanical retrieval method that leverages sequencing (reading) throughput to generate the desired number of oligonucleotides. As a proof of concept, we demonstrate the feasibility of our concept in digital information storage in DNA. In simulation, the ability to store data is expected to exponentially increase with increase in degenerate space. The present study highlights the major framework change in conventional DNA writing paradigm as a sequencer itself can become a potential source of making genetic materials.
More of an art than a science: Using microbial DNA sequences to compose music
Larsen, Peter E.
2016-03-01
Bacteria are everywhere. Microbial ecology is emerging as a critical field for understanding the relationships between these ubiquitous bacterial communities, the environment, and human health. Next generation DNA sequencing technology provides us a powerful tool to indirectly observe the communities by sequencing and analyzing all of the bacterial DNA present in an environment. The results of the DNA sequencing experiments can generate gigabytes to terabytes of information however, making it difficult for the citizen scientist to grasp and the educator to convey this data. Here, we present a method for interpreting massive amounts of microbial ecology data as musical performances,more » easily generated on any computer and using only commonly available or freely available software and the ‘Microbial Bebop’ algorithm. Furthermore, using this approach citizen scientists and biology educators can sonify complex data in a fun and interactive format, making it easier to communicate both the importance and the excitement of exploring the planet earth’s largest ecosystem.« less
More of an art than a science: Using microbial DNA sequences to compose music
DOE Office of Scientific and Technical Information (OSTI.GOV)
Larsen, Peter E.
Bacteria are everywhere. Microbial ecology is emerging as a critical field for understanding the relationships between these ubiquitous bacterial communities, the environment, and human health. Next generation DNA sequencing technology provides us a powerful tool to indirectly observe the communities by sequencing and analyzing all of the bacterial DNA present in an environment. The results of the DNA sequencing experiments can generate gigabytes to terabytes of information however, making it difficult for the citizen scientist to grasp and the educator to convey this data. Here, we present a method for interpreting massive amounts of microbial ecology data as musical performances,more » easily generated on any computer and using only commonly available or freely available software and the ‘Microbial Bebop’ algorithm. Furthermore, using this approach citizen scientists and biology educators can sonify complex data in a fun and interactive format, making it easier to communicate both the importance and the excitement of exploring the planet earth’s largest ecosystem.« less
He, Shan; Li, Yangyang; Chen, Yang; Zhu, Yue; Zhang, Xinyu; Xia, Xiaoli; Sun, Huaichang
2016-08-01
Pigs are the most economically important livestock, but pig cell lines useful for physiological studies and/or vaccine development are limited. Although several pig cell lines have been generated by oncogene transformation or human telomerase reverse transcriptase (TERT) immortalization, these cell lines contain viral sequences and/or antibiotic resistance genes. In this study, we established a new method for generating pig cell lines using the Sleeping Beauty (SB) transposon-mediated ectopic expression of porcine telomerase reverse transcriptase (pTERT). The performance of the new method was confirmed by generating a pig fibroblast cell (PFC) line. After transfection of primary PFCs with the SB transposon system, one cell clone containing the pTERT expression cassette was selected by dilution cloning and passed for different generations. After passage for more than 40 generations, the cell line retained stable expression of ectopic pTERT and continuous growth potential. Further characterization showed that the cell line kept the fibroblast morphology, growth curve, population doubling time, cloning efficiency, marker gene expression pattern, cell cycle distribution and anchorage-dependent growth property of the primary cells. These data suggest that the new method established is useful for generating pig cell lines without viral sequence and antibiotic resistant gene.
Evaluation of next generation sequencing for the analysis of Eimeria communities in wildlife.
Vermeulen, Elke T; Lott, Matthew J; Eldridge, Mark D B; Power, Michelle L
2016-05-01
Next-generation sequencing (NGS) techniques are well-established for studying bacterial communities but not yet for microbial eukaryotes. Parasite communities remain poorly studied, due in part to the lack of reliable and accessible molecular methods to analyse eukaryotic communities. We aimed to develop and evaluate a methodology to analyse communities of the protozoan parasite Eimeria from populations of the Australian marsupial Petrogale penicillata (brush-tailed rock-wallaby) using NGS. An oocyst purification method for small sample sizes and polymerase chain reaction (PCR) protocol for the 18S rRNA locus targeting Eimeria was developed and optimised prior to sequencing on the Illumina MiSeq platform. A data analysis approach was developed by modifying methods from bacterial metagenomics and utilising existing Eimeria sequences in GenBank. Operational taxonomic unit (OTU) assignment at a high similarity threshold (97%) was more accurate at assigning Eimeria contigs into Eimeria OTUs but at a lower threshold (95%) there was greater resolution between OTU consensus sequences. The assessment of two amplification PCR methods prior to Illumina MiSeq, single and nested PCR, determined that single PCR was more sensitive to Eimeria as more Eimeria OTUs were detected in single amplicons. We have developed a simple and cost-effective approach to a data analysis pipeline for community analysis of eukaryotic organisms using Eimeria communities as a model. The pipeline provides a basis for evaluation using other eukaryotic organisms and potential for diverse community analysis studies. Copyright © 2016 Elsevier B.V. All rights reserved.
Yin, Li; Yao, Jiqiang; Gardner, Brent P; Chang, Kaifen; Yu, Fahong; Goodenow, Maureen M
2012-01-01
Next Generation sequencing (NGS) applied to human papilloma viruses (HPV) can provide sensitive methods to investigate the molecular epidemiology of multiple type HPV infection. Currently a genotyping system with a comprehensive collection of updated HPV reference sequences and a capacity to handle NGS data sets is lacking. HPV-QUEST was developed as an automated and rapid HPV genotyping system. The web-based HPV-QUEST subtyping algorithm was developed using HTML, PHP, Perl scripting language, and MYSQL as the database backend. HPV-QUEST includes a database of annotated HPV reference sequences with updated nomenclature covering 5 genuses, 14 species and 150 mucosal and cutaneous types to genotype blasted query sequences. HPV-QUEST processes up to 10 megabases of sequences within 1 to 2 minutes. Results are reported in html, text and excel formats and display e-value, blast score, and local and coverage identities; provide genus, species, type, infection site and risk for the best matched reference HPV sequence; and produce results ready for additional analyses.
Binladen, Jonas; Gilbert, M Thomas P; Bollback, Jonathan P; Panitz, Frank; Bendixen, Christian; Nielsen, Rasmus; Willerslev, Eske
2007-02-14
The invention of the Genome Sequence 20 DNA Sequencing System (454 parallel sequencing platform) has enabled the rapid and high-volume production of sequence data. Until now, however, individual emulsion PCR (emPCR) reactions and subsequent sequencing runs have been unable to combine template DNA from multiple individuals, as homologous sequences cannot be subsequently assigned to their original sources. We use conventional PCR with 5'-nucleotide tagged primers to generate homologous DNA amplification products from multiple specimens, followed by sequencing through the high-throughput Genome Sequence 20 DNA Sequencing System (GS20, Roche/454 Life Sciences). Each DNA sequence is subsequently traced back to its individual source through 5'tag-analysis. We demonstrate that this new approach enables the assignment of virtually all the generated DNA sequences to the correct source once sequencing anomalies are accounted for (miss-assignment rate<0.4%). Therefore, the method enables accurate sequencing and assignment of homologous DNA sequences from multiple sources in single high-throughput GS20 run. We observe a bias in the distribution of the differently tagged primers that is dependent on the 5' nucleotide of the tag. In particular, primers 5' labelled with a cytosine are heavily overrepresented among the final sequences, while those 5' labelled with a thymine are strongly underrepresented. A weaker bias also exists with regards to the distribution of the sequences as sorted by the second nucleotide of the dinucleotide tags. As the results are based on a single GS20 run, the general applicability of the approach requires confirmation. However, our experiments demonstrate that 5'primer tagging is a useful method in which the sequencing power of the GS20 can be applied to PCR-based assays of multiple homologous PCR products. The new approach will be of value to a broad range of research areas, such as those of comparative genomics, complete mitochondrial analyses, population genetics, and phylogenetics.
Keller, A; Danner, N; Grimmer, G; Ankenbrand, M; von der Ohe, K; von der Ohe, W; Rost, S; Härtel, S; Steffan-Dewenter, I
2015-03-01
The identification of pollen plays an important role in ecology, palaeo-climatology, honey quality control and other areas. Currently, expert knowledge and reference collections are essential to identify pollen origin through light microscopy. Pollen identification through molecular sequencing and DNA barcoding has been proposed as an alternative approach, but the assessment of mixed pollen samples originating from multiple plant species is still a tedious and error-prone task. Next-generation sequencing has been proposed to avoid this hindrance. In this study we assessed mixed pollen probes through next-generation sequencing of amplicons from the highly variable, species-specific internal transcribed spacer 2 region of nuclear ribosomal DNA. Further, we developed a bioinformatic workflow to analyse these high-throughput data with a newly created reference database. To evaluate the feasibility, we compared results from classical identification based on light microscopy from the same samples with our sequencing results. We assessed in total 16 mixed pollen samples, 14 originated from honeybee colonies and two from solitary bee nests. The sequencing technique resulted in higher taxon richness (deeper assignments and more identified taxa) compared to light microscopy. Abundance estimations from sequencing data were significantly correlated with counted abundances through light microscopy. Simulation analyses of taxon specificity and sensitivity indicate that 96% of taxa present in the database are correctly identifiable at the genus level and 70% at the species level. Next-generation sequencing thus presents a useful and efficient workflow to identify pollen at the genus and species level without requiring specialised palynological expert knowledge. © 2014 German Botanical Society and The Royal Botanical Society of the Netherlands.
Efficient error correction for next-generation sequencing of viral amplicons
2012-01-01
Background Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing. Results In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones. Conclusions Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses. The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm PMID:22759430
Efficient error correction for next-generation sequencing of viral amplicons.
Skums, Pavel; Dimitrova, Zoya; Campo, David S; Vaughan, Gilberto; Rossi, Livia; Forbi, Joseph C; Yokosawa, Jonny; Zelikovsky, Alex; Khudyakov, Yury
2012-06-25
Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing. In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones. Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm.
Method for construction of normalized cDNA libraries
Soares, Marcelo B.; Efstratiadis, Argiris
1998-01-01
This invention provides a method to normalize a directional cDNA library constructed in a vector that allows propagation in single-stranded circle form comprising: (a) propagating the directional cDNA library in single-stranded circles; (b) generating fragments complementary to the 3' noncoding sequence of the single-stranded circles in the library to produce partial duplexes; (c) purifying the partial duplexes; (d) melting and reassociating the purified partial duplexes to appropriate Cot; and (e) purifying the unassociated single-stranded circles, thereby generating a normalized cDNA library. This invention also provides normalized cDNA libraries generated by the above-described method and uses of the generated libraries.
Method for construction of normalized cDNA libraries
Soares, M.B.; Efstratiadis, A.
1998-11-03
This invention provides a method to normalize a directional cDNA library constructed in a vector that allows propagation in single-stranded circle form comprising: (a) propagating the directional cDNA library in single-stranded circles; (b) generating fragments complementary to the 3` noncoding sequence of the single-stranded circles in the library to produce partial duplexes; (c) purifying the partial duplexes; (d) melting and reassociating the purified partial duplexes to appropriate Cot; and (e) purifying the unassociated single-stranded circles, thereby generating a normalized cDNA library. This invention also provides normalized cDNA libraries generated by the above-described method and uses of the generated libraries. 19 figs.
Truly random number generation: an example
NASA Astrophysics Data System (ADS)
Frauchiger, Daniela; Renner, Renato
2013-10-01
Randomness is crucial for a variety of applications, ranging from gambling to computer simulations, and from cryptography to statistics. However, many of the currently used methods for generating randomness do not meet the criteria that are necessary for these applications to work properly and safely. A common problem is that a sequence of numbers may look random but nevertheless not be truly random. In fact, the sequence may pass all standard statistical tests and yet be perfectly predictable. This renders it useless for many applications. For example, in cryptography, the predictability of a "andomly" chosen password is obviously undesirable. Here, we review a recently developed approach to generating true | and hence unpredictable | randomness.
Abe, Takashi; Hamano, Yuta; Ikemura, Toshimichi
2014-01-01
A strategy of evolutionary studies that can compare vast numbers of genome sequences is becoming increasingly important with the remarkable progress of high-throughput DNA sequencing methods. We previously established a sequence alignment-free clustering method "BLSOM" for di-, tri-, and tetranucleotide compositions in genome sequences, which can characterize sequence characteristics (genome signatures) of a wide range of species. In the present study, we generated BLSOMs for tetra- and pentanucleotide compositions in approximately one million sequence fragments derived from 101 eukaryotes, for which almost complete genome sequences were available. BLSOM recognized phylotype-specific characteristics (e.g., key combinations of oligonucleotide frequencies) in the genome sequences, permitting phylotype-specific clustering of the sequences without any information regarding the species. In our detailed examination of 12 Drosophila species, the correlation between their phylogenetic classification and the classification on the BLSOMs was observed to visualize oligonucleotides diagnostic for species-specific clustering.
Dumonceaux, Tim J.; Green, Margaret; Hammond, Christine; Perez, Edel; Olivier, Chrystel
2014-01-01
Phytoplasmas (‘Candidatus Phytoplasma’ spp.) are insect-vectored bacteria that infect a wide variety of plants, including many agriculturally important species. The infections can cause devastating yield losses by inducing morphological changes that dramatically alter inflorescence development. Detection of phytoplasma infection typically utilizes sequences located within the 16S–23S rRNA-encoding locus, and these sequences are necessary for strain identification by currently accepted standards for phytoplasma classification. However, these methods can generate PCR products >1400 bp that are less divergent in sequence than protein-encoding genes, limiting strain resolution in certain cases. We describe a method for accessing the chaperonin-60 (cpn60) gene sequence from a diverse array of ‘Ca.Phytoplasma’ spp. Two degenerate primer sets were designed based on the known sequence diversity of cpn60 from ‘Ca.Phytoplasma’ spp. and used to amplify cpn60 gene fragments from various reference samples and infected plant tissues. Forty three cpn60 sequences were thereby determined. The cpn60 PCR-gel electrophoresis method was highly sensitive compared to 16S-23S-targeted PCR-gel electrophoresis. The topology of a phylogenetic tree generated using cpn60 sequences was congruent with that reported for 16S rRNA-encoding genes. The cpn60 sequences were used to design a hybridization array using oligonucleotide-coupled fluorescent microspheres, providing rapid diagnosis and typing of phytoplasma infections. The oligonucleotide-coupled fluorescent microsphere assay revealed samples that were infected simultaneously with two subtypes of phytoplasma. These tools were applied to show that two host plants, Brassica napus and Camelina sativa, displayed different phytoplasma infection patterns. PMID:25551224
NASA Astrophysics Data System (ADS)
Tsao, Shih-Ming; Lai, Ji-Ching; Horng, Horng-Er; Liu, Tu-Chen; Hong, Chin-Yih
2017-04-01
Aptamers are oligonucleotides that can bind to specific target molecules. Most aptamers are generated using random libraries in the standard systematic evolution of ligands by exponential enrichment (SELEX). Each random library contains oligonucleotides with a randomized central region and two fixed primer regions at both ends. The fixed primer regions are necessary for amplifying target-bound sequences by PCR. However, these extra-sequences may cause non-specific bindings, which potentially interfere with good binding for random sequences. The Magnetic-Assisted Rapid Aptamer Selection (MARAS) is a newly developed protocol for generating single-strand DNA aptamers. No repeat selection cycle is required in the protocol. This study proposes and demonstrates a method to isolate aptamers for C-reactive proteins (CRP) from a randomized ssDNA library containing no fixed sequences at 5‧ and 3‧ termini using the MARAS platform. Furthermore, the isolated primer-free aptamer was sequenced and binding affinity for CRP was analyzed. The specificity of the obtained aptamer was validated using blind serum samples. The result was consistent with monoclonal antibody-based nephelometry analysis, which indicated that a primer-free aptamer has high specificity toward targets. MARAS is a feasible platform for efficiently generating primer-free aptamers for clinical diagnoses.
Rinke, Jenny; Schäfer, Vivien; Schmidt, Mathias; Ziermann, Janine; Kohlmann, Alexander; Hochhaus, Andreas; Ernst, Thomas
2013-08-01
We sought to establish a convenient, sensitive next-generation sequencing (NGS) method for genotyping the 26 most commonly mutated leukemia-associated genes in a single work flow and to optimize this method for low amounts of input template DNA. We designed 184 PCR amplicons that cover all of the candidate genes. NGS was performed with genomic DNA (gDNA) from a cohort of 10 individuals with chronic myelomonocytic leukemia. The results were compared with NGS data obtained from sequencing of DNA generated by whole-genome amplification (WGA) of 20 ng template gDNA. Differences between gDNA and WGA samples in variant frequencies were determined for 2 different WGA kits. For gDNA samples, 25 of 26 genes were successfully sequenced with a sensitivity of 5%, which was achieved by a median coverage of 492 reads (range, 308-636 reads) per amplicon. We identified 24 distinct mutations in 11 genes. With WGA samples, we reliably detected all mutations above 5% sensitivity with a median coverage of 506 reads (range, 256-653 reads) per amplicon. With all variants included in the analysis, WGA amplification by the 2 kits tested yielded differences in variant frequencies that ranged from -28.19% to +9.94% [mean (SD) difference, -0.2% (4.08%)] and from -35.03% to +18.67% [mean difference, -0.75% (5.12%)]. Our method permits simultaneous analysis of a wide range of leukemia-associated target genes in a single sequencing run. NGS can be performed after WGA of template DNA for reliable detection of variants without introducing appreciable bias.
Lim, Byung Chan; Lee, Seungbok; Shin, Jong-Yeon; Kim, Jong-Il; Hwang, Hee; Kim, Ki Joong; Hwang, Yong Seung; Seo, Jeong-Sun; Chae, Jong Hee
2011-11-01
Duchenne muscular dystrophy or Becker muscular dystrophy might be a suitable candidate disease for application of next-generation sequencing in the genetic diagnosis because the complex mutational spectrum and the large size of the dystrophin gene require two or more analytical methods and have a high cost. The authors tested whether large deletions/duplications or small mutations, such as point mutations or short insertions/deletions of the dystrophin gene, could be predicted accurately in a single platform using next-generation sequencing technology. A custom solution-based target enrichment kit was designed to capture whole genomic regions of the dystrophin gene and other muscular-dystrophy-related genes. A multiplexing strategy, wherein four differently bar-coded samples were captured and sequenced together in a single lane of the Illumina Genome Analyser, was applied. The study subjects were 25 16 with deficient dystrophin expression without a large deletion/duplication and 9 with a known large deletion/duplication. Nearly 100% of the exonic region of the dystrophin gene was covered by at least eight reads with a mean read depth of 107. Pathogenic small mutations were identified in 15 of the 16 patients without a large deletion/duplication. Using these 16 patients as the standard, the authors' method accurately predicted the deleted or duplicated exons in the 9 patients with known mutations. Inclusion of non-coding regions and paired-end sequence analysis enabled accurate identification by increasing the read depth and providing information about the breakpoint junction. The current method has an advantage for the genetic diagnosis of Duchenne muscular dystrophy and Becker muscular dystrophy wherein a comprehensive mutational search may be feasible using a single platform.
Bringing the fathead minnow into the genomic era | Science ...
The fathead minnow is a well-established ecotoxicological model organism that has been widely used for regulatory ecotoxicity testing and research for over a half century. While a large amount of molecular information has been gathered on the fathead minnow over the years, the lack of genomic sequence data has limited the utility of the fathead minnow for certain applications. To address this limitation, high-throughput Illumina sequencing technology was employed to sequence the fathead minnow genome. Approximately 100X coverage was achieved by sequencing several libraries of paired-end reads with differing genome insert sizes. Two draft genome assemblies were generated using the SOAPdenovo and String Graph Assembler (SGA) methods, respectively. When these were compared, the SOAPdenovo assembly had a higher scaffold N50 value of 60.4 kbp versus 15.4 kbp, and it also performed better in a Core Eukaryotic Genes Mapping Analysis (CEGMA), mapping 91% versus 67% of genes. As such, this assembly was selected for further development and annotation. The foundation for genome annotation was generated using AUGUSTUS, an ab initio method for gene prediction. A total of 43,345 potential coding sequences were predicted on the genome assembly. These predicted sequences were translated to peptides and queried in a BLAST search against all vertebrates, with 28,290 of these sequences corresponding to zebrafish peptides and 5,242 producing no significant alignments. Additional ty
Guo, Bingfu; Guo, Yong; Hong, Huilong; Qiu, Li-Juan
2016-01-01
Molecular characterization of sequence flanking exogenous fragment insertion is essential for safety assessment and labeling of genetically modified organism (GMO). In this study, the T-DNA insertion sites and flanking sequences were identified in two newly developed transgenic glyphosate-tolerant soybeans GE-J16 and ZH10-6 based on whole genome sequencing (WGS) method. More than 22.4 Gb sequence data (∼21 × coverage) for each line was generated on Illumina HiSeq 2500 platform. The junction reads mapped to boundaries of T-DNA and flanking sequences in these two events were identified by comparing all sequencing reads with soybean reference genome and sequence of transgenic vector. The putative insertion loci and flanking sequences were further confirmed by PCR amplification, Sanger sequencing, and co-segregation analysis. All these analyses supported that exogenous T-DNA fragments were integrated in positions of Chr19: 50543767-50543792 and Chr17: 7980527-7980541 in these two transgenic lines. Identification of genomic insertion sites of G2-EPSPS and GAT transgenes will facilitate the utilization of their glyphosate-tolerant traits in soybean breeding program. These results also demonstrated that WGS was a cost-effective and rapid method for identifying sites of T-DNA insertions and flanking sequences in soybean.
2011-01-01
Background Anophthalmia/microphthalmia (A/M) is caused by mutations in several different transcription factors, but mutations in each causative gene are relatively rare, emphasizing the need for a testing approach that screens multiple genes simultaneously. We used next-generation sequencing to screen 15 A/M patients for mutations in 9 pathogenic genes to evaluate this technology for screening in A/M. Methods We used a pooled sequencing design, together with custom single nucleotide polymorphism (SNP) calling software. We verified predicted sequence alterations using Sanger sequencing. Results We verified three mutations - c.542delC in SOX2, resulting in p.Pro181Argfs*22, p.Glu105X in OTX2 and p.Cys240X in FOXE3. We found several novel sequence alterations and SNPs that were likely to be non-pathogenic - p.Glu42Lys in CRYBA4, p.Val201Met in FOXE3 and p.Asp291Asn in VSX2. Our analysis methodology gave one false positive result comprising a mutation in PAX6 (c.1268A > T, predicting p.X423LeuextX*15) that was not verified by Sanger sequencing. We also failed to detect one 20 base pair (bp) deletion and one 3 bp duplication in SOX2. Conclusions Our results demonstrated the power of next-generation sequencing with pooled sample groups for the rapid screening of candidate genes for A/M as we were correctly able to identify disease-causing mutations. However, next-generation sequencing was less useful for small, intragenic deletions and duplications. We did not find mutations in 10/15 patients and conclude that there is a need for further gene discovery in A/M. PMID:22204637
A New Method for Generating Probability Tables in the Unresolved Resonance Region
Holcomb, Andrew M.; Leal, Luiz C.; Rahnema, Farzad; ...
2017-04-18
One new method for constructing probability tables in the unresolved resonance region (URR) has been developed. This new methodology is an extensive modification of the single-level Breit-Wigner (SLBW) pseudo-resonance pair sequence method commonly used to generate probability tables in the URR. The new method uses a Monte Carlo process to generate many pseudo-resonance sequences by first sampling the average resonance parameter data in the URR and then converting the sampled resonance parameters to the more robust R-matrix limited (RML) format. Furthermore, for each sampled set of pseudo-resonance sequences, the temperature-dependent cross sections are reconstructed on a small grid around themore » energy of reference using the Reich-Moore formalism and the Leal-Hwang Doppler broadening methodology. We then use the effective cross sections calculated at the energies of reference to construct probability tables in the URR. The RML cross-section reconstruction algorithm has been rigorously tested for a variety of isotopes, including 16O, 19F, 35Cl, 56Fe, 63Cu, and 65Cu. The new URR method also produced normalized cross-section factor probability tables for 238U that were found to be in agreement with current standards. The modified 238U probability tables were shown to produce results in excellent agreement with several standard benchmarks, including the IEU-MET-FAST-007 (BIG TEN), IEU-MET-FAST-003, and IEU-COMP-FAST-004 benchmarks.« less
Automatic Generation of Passer-by Record Images using Internet Camera
NASA Astrophysics Data System (ADS)
Terada, Kenji; Atsuta, Koji
Recently, many brutal crimes have shocked us. On the other hand, we have seen a decline in the number of solved crimes. Therefore, the importance of security and self-defense has increased more and more. As an example of self-defense, many surveillance cameras are set up in the buildings, homes and offices. But even if we want to detect a suspicious person, we cannot check the surveillance videos immediately so that huge number of image sequences is stored in each video system. In this paper, we propose an automatic method of generating passer-by record images by using internet camera. In first step, the process of recognizing passer-by is carried out using an image sequence obtained from the internet camera. Our method classifies the subject region into each person by using the space-time image. In addition, we obtain the information of the time, direction and number of passey-by persons from this space-time image. Next, the present method detects five characteristics: the gravity of center, the position of person's head, the brightness, the size, and the shape of person. Finaly, an image of each person is selected among the image sequence by integrating five characteristics, and is added into the passer-by record image. Some experimental results using a simple experimental system are also reported, which indicate effectiveness of the proposed method. In most scenes, the every persons was able to be detected by the proposed method and the passer-by record image was generated.
Stanislawski, Jerzy; Kotulska, Malgorzata; Unold, Olgierd
2013-01-17
Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found. Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods. We generated a new dataset of hexapeptides, using more economical 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning). We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.
Single-cell genomic sequencing using Multiple Displacement Amplification.
Lasken, Roger S
2007-10-01
Single microbial cells can now be sequenced using DNA amplified by the Multiple Displacement Amplification (MDA) reaction. The few femtograms of DNA in a bacterium are amplified into micrograms of high molecular weight DNA suitable for DNA library construction and Sanger sequencing. The MDA-generated DNA also performs well when used directly as template for pyrosequencing by the 454 Life Sciences method. While MDA from single cells loses some of the genomic sequence, this approach will greatly accelerate the pace of sequencing from uncultured microbes. The genetically linked sequences from single cells are also a powerful tool to be used in guiding genomic assembly of shotgun sequences of multiple organisms from environmental DNA extracts (metagenomic sequences).
Genomic signal processing methods for computation of alignment-free distances from DNA sequences.
Borrayo, Ernesto; Mendizabal-Ruiz, E Gerardo; Vélez-Pérez, Hugo; Romo-Vázquez, Rebeca; Mendizabal, Adriana P; Morales, J Alejandro
2014-01-01
Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.
Genomic Signal Processing Methods for Computation of Alignment-Free Distances from DNA Sequences
Borrayo, Ernesto; Mendizabal-Ruiz, E. Gerardo; Vélez-Pérez, Hugo; Romo-Vázquez, Rebeca; Mendizabal, Adriana P.; Morales, J. Alejandro
2014-01-01
Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments. PMID:25393409
An Integrated SNP Mining and Utilization (ISMU) Pipeline for Next Generation Sequencing Data
Azam, Sarwar; Rathore, Abhishek; Shah, Trushar M.; Telluri, Mohan; Amindala, BhanuPrakash; Ruperao, Pradeep; Katta, Mohan A. V. S. K.; Varshney, Rajeev K.
2014-01-01
Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software. PMID:25003610
Correcting for batch effects in case-control microbiome studies
Gibbons, Sean M.; Duvallet, Claire
2018-01-01
High-throughput data generation platforms, like mass-spectrometry, microarrays, and second-generation sequencing are susceptible to batch effects due to run-to-run variation in reagents, equipment, protocols, or personnel. Currently, batch correction methods are not commonly applied to microbiome sequencing datasets. In this paper, we compare different batch-correction methods applied to microbiome case-control studies. We introduce a model-free normalization procedure where features (i.e. bacterial taxa) in case samples are converted to percentiles of the equivalent features in control samples within a study prior to pooling data across studies. We look at how this percentile-normalization method compares to traditional meta-analysis methods for combining independent p-values and to limma and ComBat, widely used batch-correction models developed for RNA microarray data. Overall, we show that percentile-normalization is a simple, non-parametric approach for correcting batch effects and improving sensitivity in case-control meta-analyses. PMID:29684016
DOE Office of Scientific and Technical Information (OSTI.GOV)
Podoshvedov, S. A., E-mail: podoshvedov@mail.ru
A method to generate Schroedinger cat states in free propagating optical fields based on the use of displaced states (or displacement operators) is developed. Some optical schemes with photon-added coherent states are studied. The schemes are modifications of the general method based on a sequence of displacements and photon additions or subtractions adjusted to generate Schroedinger cat states of a larger size. The effects of detection inefficiency are taken into account.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pilus, Nur Shazwani Mohd; Ahmad, Azrin; Yusof, Nurul Yuziana Mohd
Scaffold/matrix attachment regions (S/MARs) are potential element that can be integrated into expression vector to increase expression of recombinant protein. Many studies on S/MAR have been done but none has revealed the distribution of S/MAR in a genome. In this study, we have isolated S/MAR sequences from HEK293 and Chinese hamster ovary cell lines (CHO DG44) using two different methods utilizing 2 M NaCl and lithium-3,5-diiodosalicylate (LIS). The isolated S/MARs were sequenced using Next Generation Sequencing (NGS) platform. Based on reference mapping analysis against human genome database, a total of 8,994,856 and 8,412,672 contigs of S/MAR sequences were retrieved frommore » 2M NaCl and LIS extraction of HEK293 respectively. On the other hand, reference mapping analysis of S/MAR derived from CHO DG44 against our own CHO DG44 database have generated a total of 7,204,348 and 4,672,913 contigs from 2 M NaCl and LIS extraction method respectively.« less
SPMBR: a scalable algorithm for mining sequential patterns based on bitmaps
NASA Astrophysics Data System (ADS)
Xu, Xiwei; Zhang, Changhai
2013-12-01
Now some sequential patterns mining algorithms generate too many candidate sequences, and increase the processing cost of support counting. Therefore, we present an effective and scalable algorithm called SPMBR (Sequential Patterns Mining based on Bitmap Representation) to solve the problem of mining the sequential patterns for large databases. Our method differs from previous related works of mining sequential patterns. The main difference is that the database of sequential patterns is represented by bitmaps, and a simplified bitmap structure is presented firstly. In this paper, First the algorithm generate candidate sequences by SE(Sequence Extension) and IE(Item Extension), and then obtain all frequent sequences by comparing the original bitmap and the extended item bitmap .This method could simplify the problem of mining the sequential patterns and avoid the high processing cost of support counting. Both theories and experiments indicate that the performance of SPMBR is predominant for large transaction databases, the required memory size for storing temporal data is much less during mining process, and all sequential patterns can be mined with feasibility.
Tabata, Ryo; Kamiya, Takehiro; Shigenobu, Shuji; Yamaguchi, Katsushi; Yamada, Masashi; Hasebe, Mitsuyasu; Fujiwara, Toru; Sawa, Shinichiro
2013-01-01
Next-generation sequencing (NGS) technologies enable the rapid production of an enormous quantity of sequence data. These powerful new technologies allow the identification of mutations by whole-genome sequencing. However, most reported NGS-based mapping methods, which are based on bulked segregant analysis, are costly and laborious. To address these limitations, we designed a versatile NGS-based mapping method that consists of a combination of low- to medium-coverage multiplex SOLiD (Sequencing by Oligonucleotide Ligation and Detection) and classical genetic rough mapping. Using only low to medium coverage reduces the SOLiD sequencing costs and, since just 10 to 20 mutant F2 plants are required for rough mapping, the operation is simple enough to handle in a laboratory with limited space and funding. As a proof of principle, we successfully applied this method to identify the CTR1, which is involved in boron-mediated root development, from among a population of high boron requiring Arabidopsis thaliana mutants. Our work demonstrates that this NGS-based mapping method is a moderately priced and versatile method that can readily be applied to other model organisms. PMID:23104114
Watanabe, Manabu; Kusano, Junko; Ohtaki, Shinsaku; Ishikura, Takashi; Katayama, Jin; Koguchi, Akira; Paumen, Michael; Hayashi, Yoshiharu
2014-09-01
Combining single-cell methods and next-generation sequencing should provide a powerful means to understand single-cell biology and obviate the effects of sample heterogeneity. Here we report a single-cell identification method and seamless cancer gene profiling using semiconductor-based massively parallel sequencing. A549 cells (adenocarcinomic human alveolar basal epithelial cell line) were used as a model. Single-cell capture was performed using laser capture microdissection (LCM) with an Arcturus® XT system, and a captured single cell and a bulk population of A549 cells (≈ 10(6) cells) were subjected to whole genome amplification (WGA). For cell identification, a multiplex PCR method (AmpliSeq™ SNP HID panel) was used to enrich 136 highly discriminatory SNPs with a genotype concordance probability of 10(31-35). For cancer gene profiling, we used mutation profiling that was performed in parallel using a hotspot panel for 50 cancer-related genes. Sequencing was performed using a semiconductor-based bench top sequencer. The distribution of sequence reads for both HID and Cancer panel amplicons was consistent across these samples. For the bulk population of cells, the percentages of sequence covered at coverage of more than 100 × were 99.04% for the HID panel and 98.83% for the Cancer panel, while for the single cell percentages of sequence covered at coverage of more than 100 × were 55.93% for the HID panel and 65.96% for the Cancer panel. Partial amplification failure or randomly distributed non-amplified regions across samples from single cells during the WGA procedures or random allele drop out probably caused these differences. However, comparative analyses showed that this method successfully discriminated a single A549 cancer cell from a bulk population of A549 cells. Thus, our approach provides a powerful means to overcome tumor sample heterogeneity when searching for somatic mutations.
A method for multi-codon scanning mutagenesis of proteins based on asymmetric transposons.
Liu, Jia; Cropp, T Ashton
2012-02-01
Random mutagenesis followed by selection or screening is a commonly used strategy to improve protein function. Despite many available methods for random mutagenesis, nearly all generate mutations at the nucleotide level. An ideal mutagenesis method would allow for the generation of 'codon mutations' to change protein sequence with defined or mixed amino acids of choice. Herein we report a method that allows for mutations of one, two or three consecutive codons. Key to this method is the development of a Mu transposon variant with asymmetric terminal sequences. As a demonstration of the method, we performed multi-codon scanning on the gene encoding superfolder GFP (sfGFP). Characterization of 50 randomly chosen clones from each library showed that more than 40% of the mutants in these three libraries contained seamless, in-frame mutations with low site preference. By screening only 500 colonies from each library, we successfully identified several spectra-shift mutations, including a S205D variant that was found to bear a single excitation peak in the UV region.
Lin, You-Yu; Hsieh, Chia-Hung; Chen, Jiun-Hong; Lu, Xuemei; Kao, Jia-Horng; Chen, Pei-Jer; Chen, Ding-Shinn; Wang, Hurng-Yi
2017-04-26
The accuracy of metagenomic assembly is usually compromised by high levels of polymorphism due to divergent reads from the same genomic region recognized as different loci when sequenced and assembled together. A viral quasispecies is a group of abundant and diversified genetically related viruses found in a single carrier. Current mainstream assembly methods, such as Velvet and SOAPdenovo, were not originally intended for the assembly of such metagenomics data, and therefore demands for new methods to provide accurate and informative assembly results for metagenomic data. In this study, we present a hybrid method for assembling highly polymorphic data combining the partial de novo-reference assembly (PDR) strategy and the BLAST-based assembly pipeline (BBAP). The PDR strategy generates in situ reference sequences through de novo assembly of a randomly extracted partial data set which is subsequently used for the reference assembly for the full data set. BBAP employs a greedy algorithm to assemble polymorphic reads. We used 12 hepatitis B virus quasispecies NGS data sets from a previous study to assess and compare the performance of both PDR and BBAP. Analyses suggest the high polymorphism of a full metagenomic data set leads to fragmentized de novo assembly results, whereas the biased or limited representation of external reference sequences included fewer reads into the assembly with lower assembly accuracy and variation sensitivity. In comparison, the PDR generated in situ reference sequence incorporated more reads into the final PDR assembly of the full metagenomics data set along with greater accuracy and higher variation sensitivity. BBAP assembly results also suggest higher assembly efficiency and accuracy compared to other assembly methods. Additionally, BBAP assembly recovered HBV structural variants that were not observed amongst assembly results of other methods. Together, PDR/BBAP assembly results were significantly better than other compared methods. Both PDR and BBAP independently increased the assembly efficiency and accuracy of highly polymorphic data, and assembly performances were further improved when used together. BBAP also provides nucleotide frequency information. Together, PDR and BBAP provide powerful tools for metagenomic data studies.
Brancaccio, Rosario N; Robitaille, Alexis; Dutta, Sankhadeep; Cuenin, Cyrille; Santare, Daiga; Skenders, Girts; Leja, Marcis; Fischer, Nicole; Giuliano, Anna R; Rollison, Dana E; Grundhoff, Adam; Tommasino, Massimo; Gheit, Tarik
2018-05-07
With the advent of new molecular tools, the discovery of new papillomaviruses (PVs) has accelerated during the past decade, enabling the expansion of knowledge about the viral populations that inhabit the human body. Human PVs (HPVs) are etiologically linked to benign or malignant lesions of the skin and mucosa. The detection of HPV types can vary widely, depending mainly on the methodology and the quality of the biological sample. Next-generation sequencing is one of the most powerful tools, enabling the discovery of novel viruses in a wide range of biological material. Here, we report a novel protocol for the detection of known and unknown HPV types in human skin and oral gargle samples using improved PCR protocols combined with next-generation sequencing. We identified 105 putative new PV types in addition to 296 known types, thus providing important information about the viral distribution in the oral cavity and skin. Copyright © 2018. Published by Elsevier Inc.
A deep learning method for lincRNA detection using auto-encoder algorithm.
Yu, Ning; Yu, Zeng; Pan, Yi
2017-12-06
RNA sequencing technique (RNA-seq) enables scientists to develop novel data-driven methods for discovering more unidentified lincRNAs. Meantime, knowledge-based technologies are experiencing a potential revolution ignited by the new deep learning methods. By scanning the newly found data set from RNA-seq, scientists have found that: (1) the expression of lincRNAs appears to be regulated, that is, the relevance exists along the DNA sequences; (2) lincRNAs contain some conversed patterns/motifs tethered together by non-conserved regions. The two evidences give the reasoning for adopting knowledge-based deep learning methods in lincRNA detection. Similar to coding region transcription, non-coding regions are split at transcriptional sites. However, regulatory RNAs rather than message RNAs are generated. That is, the transcribed RNAs participate the biological process as regulatory units instead of generating proteins. Identifying these transcriptional regions from non-coding regions is the first step towards lincRNA recognition. The auto-encoder method achieves 100% and 92.4% prediction accuracy on transcription sites over the putative data sets. The experimental results also show the excellent performance of predictive deep neural network on the lincRNA data sets compared with support vector machine and traditional neural network. In addition, it is validated through the newly discovered lincRNA data set and one unreported transcription site is found by feeding the whole annotated sequences through the deep learning machine, which indicates that deep learning method has the extensive ability for lincRNA prediction. The transcriptional sequences of lincRNAs are collected from the annotated human DNA genome data. Subsequently, a two-layer deep neural network is developed for the lincRNA detection, which adopts the auto-encoder algorithm and utilizes different encoding schemes to obtain the best performance over intergenic DNA sequence data. Driven by those newly annotated lincRNA data, deep learning methods based on auto-encoder algorithm can exert their capability in knowledge learning in order to capture the useful features and the information correlation along DNA genome sequences for lincRNA detection. As our knowledge, this is the first application to adopt the deep learning techniques for identifying lincRNA transcription sequences.
Next-generation technologies and data analytical approaches for epigenomics.
Mensaert, Klaas; Denil, Simon; Trooskens, Geert; Van Criekinge, Wim; Thas, Olivier; De Meyer, Tim
2014-04-01
Epigenetics refers to the collection of heritable features that modulate the genome-environment interaction without being encoded in the actual DNA sequence. While being mitotically and sometimes even meiotically transmitted, epigenetic traits often demonstrate extensive flexibility. This allows cells to acquire diverse gene expression patterns during differentiation, but also to adapt to a changing environment. However, epigenetic alterations are not always beneficial to the organism, as they are, for example, frequently identified in human diseases such as cancer. Accurate and cost-efficient genome-scale profiling of epigenetic features is thus of major importance to pinpoint these "epimutations," for example, to monitor the epigenetic impact of environmental exposure. Over the last decade, the field of epigenetics has been revolutionized by several innovative "epigenomics" technologies exactly addressing this need. In this review, we discuss and compare widely used next-generation methods to assess DNA methylation and hydroxymethylation, noncoding RNA expression, histone modifications, and nucleosome positioning. Although recent methods are typically based on "second-generation" sequencing, we also pay attention to still commonly used array- and PCR-based methods, and look forward to the additional advantages of single-molecule sequencing. As the current bottleneck in epigenomics research is the analysis rather than generation of data, the basic difficulties and problem-solving strategies regarding data preprocessing and statistical analysis are introduced for the different technologies. Finally, we also consider the complications associated with epigenomic studies of species with yet unsequenced genomes and possible solutions. Copyright © 2013 Wiley Periodicals, Inc.
Liao, Xiaolei; Zhao, Juanjuan; Jiao, Cheng; Lei, Lei; Qiang, Yan; Cui, Qiang
2016-01-01
Background Lung parenchyma segmentation is often performed as an important pre-processing step in the computer-aided diagnosis of lung nodules based on CT image sequences. However, existing lung parenchyma image segmentation methods cannot fully segment all lung parenchyma images and have a slow processing speed, particularly for images in the top and bottom of the lung and the images that contain lung nodules. Method Our proposed method first uses the position of the lung parenchyma image features to obtain lung parenchyma ROI image sequences. A gradient and sequential linear iterative clustering algorithm (GSLIC) for sequence image segmentation is then proposed to segment the ROI image sequences and obtain superpixel samples. The SGNF, which is optimized by a genetic algorithm (GA), is then utilized for superpixel clustering. Finally, the grey and geometric features of the superpixel samples are used to identify and segment all of the lung parenchyma image sequences. Results Our proposed method achieves higher segmentation precision and greater accuracy in less time. It has an average processing time of 42.21 seconds for each dataset and an average volume pixel overlap ratio of 92.22 ± 4.02% for four types of lung parenchyma image sequences. PMID:27532214
Clinical analysis of genome next-generation sequencing data using the Omicia platform
Coonrod, Emily M; Margraf, Rebecca L; Russell, Archie; Voelkerding, Karl V; Reese, Martin G
2013-01-01
Aims Next-generation sequencing is being implemented in the clinical laboratory environment for the purposes of candidate causal variant discovery in patients affected with a variety of genetic disorders. The successful implementation of this technology for diagnosing genetic disorders requires a rapid, user-friendly method to annotate variants and generate short lists of clinically relevant variants of interest. This report describes Omicia’s Opal platform, a new software tool designed for variant discovery and interpretation in a clinical laboratory environment. The software allows clinical scientists to process, analyze, interpret and report on personal genome files. Materials & Methods To demonstrate the software, the authors describe the interactive use of the system for the rapid discovery of disease-causing variants using three cases. Results & Conclusion Here, the authors show the features of the Opal system and their use in uncovering variants of clinical significance. PMID:23895124
Technical Considerations for Reduced Representation Bisulfite Sequencing with Multiplexed Libraries
Chatterjee, Aniruddha; Rodger, Euan J.; Stockwell, Peter A.; Weeks, Robert J.; Morison, Ian M.
2012-01-01
Reduced representation bisulfite sequencing (RRBS), which couples bisulfite conversion and next generation sequencing, is an innovative method that specifically enriches genomic regions with a high density of potential methylation sites and enables investigation of DNA methylation at single-nucleotide resolution. Recent advances in the Illumina DNA sample preparation protocol and sequencing technology have vastly improved sequencing throughput capacity. Although the new Illumina technology is now widely used, the unique challenges associated with multiplexed RRBS libraries on this platform have not been previously described. We have made modifications to the RRBS library preparation protocol to sequence multiplexed libraries on a single flow cell lane of the Illumina HiSeq 2000. Furthermore, our analysis incorporates a bioinformatics pipeline specifically designed to process bisulfite-converted sequencing reads and evaluate the output and quality of the sequencing data generated from the multiplexed libraries. We obtained an average of 42 million paired-end reads per sample for each flow-cell lane, with a high unique mapping efficiency to the reference human genome. Here we provide a roadmap of modifications, strategies, and trouble shooting approaches we implemented to optimize sequencing of multiplexed libraries on an a RRBS background. PMID:23193365
Chaitankar, Vijender; Karakülah, Gökhan; Ratnapriya, Rinki; Giuste, Felipe O.; Brooks, Matthew J.; Swaroop, Anand
2016-01-01
The advent of high throughput next generation sequencing (NGS) has accelerated the pace of discovery of disease-associated genetic variants and genomewide profiling of expressed sequences and epigenetic marks, thereby permitting systems-based analyses of ocular development and disease. Rapid evolution of NGS and associated methodologies presents significant challenges in acquisition, management, and analysis of large data sets and for extracting biologically or clinically relevant information. Here we illustrate the basic design of commonly used NGS-based methods, specifically whole exome sequencing, transcriptome, and epigenome profiling, and provide recommendations for data analyses. We briefly discuss systems biology approaches for integrating multiple data sets to elucidate gene regulatory or disease networks. While we provide examples from the retina, the NGS guidelines reviewed here are applicable to other tissues/cell types as well. PMID:27297499
Reid-Bayliss, Kate S; Loeb, Lawrence A
2017-08-29
Transcriptional mutagenesis (TM) due to misincorporation during RNA transcription can result in mutant RNAs, or epimutations, that generate proteins with altered properties. TM has long been hypothesized to play a role in aging, cancer, and viral and bacterial evolution. However, inadequate methodologies have limited progress in elucidating a causal association. We present a high-throughput, highly accurate RNA sequencing method to measure epimutations with single-molecule sensitivity. Accurate RNA consensus sequencing (ARC-seq) uniquely combines RNA barcoding and generation of multiple cDNA copies per RNA molecule to eliminate errors introduced during cDNA synthesis, PCR, and sequencing. The stringency of ARC-seq can be scaled to accommodate the quality of input RNAs. We apply ARC-seq to directly assess transcriptome-wide epimutations resulting from RNA polymerase mutants and oxidative stress.
MOCASSIN-prot: a multi-objective clustering approach for protein similarity networks.
Keel, Brittney N; Deng, Bo; Moriyama, Etsuko N
2018-04-15
Proteins often include multiple conserved domains. Various evolutionary events including duplication and loss of domains, domain shuffling, as well as sequence divergence contribute to generating complexities in protein structures, and consequently, in their functions. The evolutionary history of proteins is hence best modeled through networks that incorporate information both from the sequence divergence and the domain content. Here, a game-theoretic approach proposed for protein network construction is adapted into the framework of multi-objective optimization, and extended to incorporate clustering refinement procedure. The new method, MOCASSIN-prot, was applied to cluster multi-domain proteins from ten genomes. The performance of MOCASSIN-prot was compared against two protein clustering methods, Markov clustering (TRIBE-MCL) and spectral clustering (SCPS). We showed that compared to these two methods, MOCASSIN-prot, which uses both domain composition and quantitative sequence similarity information, generates fewer false positives. It achieves more functionally coherent protein clusters and better differentiates protein families. MOCASSIN-prot, implemented in Perl and Matlab, is freely available at http://bioinfolab.unl.edu/emlab/MOCASSINprot. emoriyama2@unl.edu. Supplementary data are available at Bioinformatics online.
Current and future molecular diagnostics for ocular infectious diseases.
Doan, Thuy; Pinsky, Benjamin A
2016-11-01
Confirmation of ocular infections can pose great challenges to the clinician. A fundamental limitation is the small amounts of specimen that can be obtained from the eye. Molecular diagnostics can circumvent this limitation and have been shown to be more sensitive than conventional culture. The purpose of this review is to describe new molecular methods and to discuss the applications of next-generation sequencing-based approaches in the diagnosis of ocular infections. Efforts have focused on improving the sensitivity of pathogen detection using molecular methods. This review describes a new molecular target for Toxoplasma gondii-directed polymerase chain reaction assays. Molecular diagnostics for Chlamydia trachomatis and Acanthamoeba species are also discussed. Finally, we describe a hypothesis-free approach, metagenomic deep sequencing, which can detect DNA and RNA pathogens from a single specimen in one test. In some cases, this method can provide the geographic location and timing of the infection. Pathogen-directed PCRs have been powerful tools in the diagnosis of ocular infections for over 20 years. The use of next-generation sequencing-based approaches, when available, will further improve sensitivity of detection with the potential to improve patient care.
Park, Doori; Park, Su-Hyun; Ban, Yong Wook; Kim, Youn Shic; Park, Kyoung-Cheul; Kim, Nam-Soo; Kim, Ju-Kon; Choi, Ik-Young
2017-08-15
Genetically modified crops (GM crops) have been developed to improve the agricultural traits of modern crop cultivars. Safety assessments of GM crops are of paramount importance in research at developmental stages and before releasing transgenic plants into the marketplace. Sequencing technology is developing rapidly, with higher output and labor efficiencies, and will eventually replace existing methods for the molecular characterization of genetically modified organisms. To detect the transgenic insertion locations in the three GM rice gnomes, Illumina sequencing reads are mapped and classified to the rice genome and plasmid sequence. The both mapped reads are classified to characterize the junction site between plant and transgene sequence by sequence alignment. Herein, we present a next generation sequencing (NGS)-based molecular characterization method, using transgenic rice plants SNU-Bt9-5, SNU-Bt9-30, and SNU-Bt9-109. Specifically, using bioinformatics tools, we detected the precise insertion locations and copy numbers of transfer DNA, genetic rearrangements, and the absence of backbone sequences, which were equivalent to results obtained from Southern blot analyses. NGS methods have been suggested as an effective means of characterizing and detecting transgenic insertion locations in genomes. Our results demonstrate the use of a combination of NGS technology and bioinformatics approaches that offers cost- and time-effective methods for assessing the safety of transgenic plants.
Using single cell sequencing data to model the evolutionary history of a tumor.
Kim, Kyung In; Simon, Richard
2014-01-24
The introduction of next-generation sequencing (NGS) technology has made it possible to detect genomic alterations within tumor cells on a large scale. However, most applications of NGS show the genetic content of mixtures of cells. Recently developed single cell sequencing technology can identify variation within a single cell. Characterization of multiple samples from a tumor using single cell sequencing can potentially provide information on the evolutionary history of that tumor. This may facilitate understanding how key mutations accumulate and evolve in lineages to form a heterogeneous tumor. We provide a computational method to infer an evolutionary mutation tree based on single cell sequencing data. Our approach differs from traditional phylogenetic tree approaches in that our mutation tree directly describes temporal order relationships among mutation sites. Our method also accommodates sequencing errors. Furthermore, we provide a method for estimating the proportion of time from the earliest mutation event of the sample to the most recent common ancestor of the sample of cells. Finally, we discuss current limitations on modeling with single cell sequencing data and possible improvements under those limitations. Inferring the temporal ordering of mutational sites using current single cell sequencing data is a challenge. Our proposed method may help elucidate relationships among key mutations and their role in tumor progression.
Tracking B-Cell Repertoires and Clonal Histories in Normal and Malignant Lymphocytes.
Weston-Bell, Nicola J; Cowan, Graeme; Sahota, Surinder S
2017-01-01
Methods for tracking B-cell repertoires and clonal history in normal and malignant B-cells based on immunoglobulin variable region (IGV) gene analysis have developed rapidly with the advent of massive parallel next-generation sequencing (mpNGS) protocols. mpNGS permits a depth of analysis of IGV genes not hitherto feasible, and presents challenges of bioinformatics analysis, which can be readily met by current pipelines. This strategy offers a potential resolution of B-cell usage at a depth that may capture fully the natural state, in a given biological setting. Conventional methods based on RT-PCR amplification and Sanger sequencing are also available where mpNGS is not accessible. Each method offers distinct advantages. Conventional methods for IGV gene sequencing are readily adaptable to most laboratories and provide an ease of analysis to capture salient features of B-cell use. This chapter describes two methods in detail for analysis of IGV genes, mpNGS and conventional RT-PCR with Sanger sequencing.
CNV-seq, a new method to detect copy number variation using high-throughput sequencing.
Xie, Chao; Tammi, Martti T
2009-03-06
DNA copy number variation (CNV) has been recognized as an important source of genetic variation. Array comparative genomic hybridization (aCGH) is commonly used for CNV detection, but the microarray platform has a number of inherent limitations. Here, we describe a method to detect copy number variation using shotgun sequencing, CNV-seq. The method is based on a robust statistical model that describes the complete analysis procedure and allows the computation of essential confidence values for detection of CNV. Our results show that the number of reads, not the length of the reads is the key factor determining the resolution of detection. This favors the next-generation sequencing methods that rapidly produce large amount of short reads. Simulation of various sequencing methods with coverage between 0.1x to 8x show overall specificity between 91.7 - 99.9%, and sensitivity between 72.2 - 96.5%. We also show the results for assessment of CNV between two individual human genomes.
Malware Analysis Using Visualized Image Matrices
Im, Eul Gyu
2014-01-01
This paper proposes a novel malware visual analysis method that contains not only a visualization method to convert binary files into images, but also a similarity calculation method between these images. The proposed method generates RGB-colored pixels on image matrices using the opcode sequences extracted from malware samples and calculates the similarities for the image matrices. Particularly, our proposed methods are available for packed malware samples by applying them to the execution traces extracted through dynamic analysis. When the images are generated, we can reduce the overheads by extracting the opcode sequences only from the blocks that include the instructions related to staple behaviors such as functions and application programming interface (API) calls. In addition, we propose a technique that generates a representative image for each malware family in order to reduce the number of comparisons for the classification of unknown samples and the colored pixel information in the image matrices is used to calculate the similarities between the images. Our experimental results show that the image matrices of malware can effectively be used to classify malware families both statically and dynamically with accuracy of 0.9896 and 0.9732, respectively. PMID:25133202
Secure self-calibrating quantum random-bit generator
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fiorentino, M.; Santori, C.; Spillane, S. M.
2007-03-15
Random-bit generators (RBGs) are key components of a variety of information processing applications ranging from simulations to cryptography. In particular, cryptographic systems require 'strong' RBGs that produce high-entropy bit sequences, but traditional software pseudo-RBGs have very low entropy content and therefore are relatively weak for cryptography. Hardware RBGs yield entropy from chaotic or quantum physical systems and therefore are expected to exhibit high entropy, but in current implementations their exact entropy content is unknown. Here we report a quantum random-bit generator (QRBG) that harvests entropy by measuring single-photon and entangled two-photon polarization states. We introduce and implement a quantum tomographicmore » method to measure a lower bound on the 'min-entropy' of the system, and we employ this value to distill a truly random-bit sequence. This approach is secure: even if an attacker takes control of the source of optical states, a secure random sequence can be distilled.« less
Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Daily, Jeffrey A.
2015-05-01
The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore’s law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of alreadymore » annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K cores, representing a time-to-solution of 33 seconds. We extend this work with a detailed analysis of single-node sequence alignment performance using the latest CPU vector instruction set extensions. Preliminary results reveal that current sequence alignment algorithms are unable to fully utilize widening vector registers.« less
Riman, Sarah; Kiesler, Kevin M; Borsuk, Lisa A; Vallone, Peter M
2017-07-01
Standard Reference Materials SRM 2392 and 2392-I are intended to provide quality control when amplifying and sequencing human mitochondrial genome sequences. The National Institute of Standards and Technology (NIST) offers these SRMs to laboratories performing DNA-based forensic human identification, molecular diagnosis of mitochondrial diseases, mutation detection, evolutionary anthropology, and genetic genealogy. The entire mtGenome (∼16569bp) of SRM 2392 and 2392-I have previously been characterized at NIST by Sanger sequencing. Herein, we used the sensitivity, specificity, and accuracy offered by next generation sequencing (NGS) to: (1) re-sequence the certified values of the SRM 2392 and 2392-I; (2) confirm Sanger data with a high coverage new sequencing technology; (3) detect lower level heteroplasmies (<20%); and thus (4) support mitochondrial sequencing communities in the adoption of NGS methods. To obtain a consensus sequence for the SRMs as well as identify and control any bias, sequencing was performed using two NGS platforms and data was analyzed using different bioinformatics pipelines. Our results confirm five low level heteroplasmy sites that were not previously observed with Sanger sequencing: three sites in the GM09947A template in SRM 2392 and two sites in the HL-60 template in SRM 2392-I. Copyright © 2017 Elsevier B.V. All rights reserved.
NASA Technical Reports Server (NTRS)
Wheeler, Ward C.
2003-01-01
A method to align sequence data based on parsimonious synapomorphy schemes generated by direct optimization (DO; earlier termed optimization alignment) is proposed. DO directly diagnoses sequence data on cladograms without an intervening multiple-alignment step, thereby creating topology-specific, dynamic homology statements. Hence, no multiple-alignment is required to generate cladograms. Unlike general and globally optimal multiple-alignment procedures, the method described here, implied alignment (IA), takes these dynamic homologies and traces them back through a single cladogram, linking the unaligned sequence positions in the terminal taxa via DO transformation series. These "lines of correspondence" link ancestor-descendent states and, when displayed as linearly arrayed columns without hypothetical ancestors, are largely indistinguishable from standard multiple alignment. Since this method is based on synapomorphy, the treatment of certain classes of insertion-deletion (indel) events may be different from that of other alignment procedures. As with all alignment methods, results are dependent on parameter assumptions such as indel cost and transversion:transition ratios. Such an IA could be used as a basis for phylogenetic search, but this would be questionable since the homologies derived from the implied alignment depend on its natal cladogram and any variance, between DO and IA + Search, due to heuristic approach. The utility of this procedure in heuristic cladogram searches using DO and the improvement of heuristic cladogram cost calculations are discussed. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.
GeneSilico protein structure prediction meta-server.
Kurowski, Michal A; Bujnicki, Janusz M
2003-07-01
Rigorous assessments of protein structure prediction have demonstrated that fold recognition methods can identify remote similarities between proteins when standard sequence search methods fail. It has been shown that the accuracy of predictions is improved when refined multiple sequence alignments are used instead of single sequences and if different methods are combined to generate a consensus model. There are several meta-servers available that integrate protein structure predictions performed by various methods, but they do not allow for submission of user-defined multiple sequence alignments and they seldom offer confidentiality of the results. We developed a novel WWW gateway for protein structure prediction, which combines the useful features of other meta-servers available, but with much greater flexibility of the input. The user may submit an amino acid sequence or a multiple sequence alignment to a set of methods for primary, secondary and tertiary structure prediction. Fold-recognition results (target-template alignments) are converted into full-atom 3D models and the quality of these models is uniformly assessed. A consensus between different FR methods is also inferred. The results are conveniently presented on-line on a single web page over a secure, password-protected connection. The GeneSilico protein structure prediction meta-server is freely available for academic users at http://genesilico.pl/meta.
GeneSilico protein structure prediction meta-server
Kurowski, Michal A.; Bujnicki, Janusz M.
2003-01-01
Rigorous assessments of protein structure prediction have demonstrated that fold recognition methods can identify remote similarities between proteins when standard sequence search methods fail. It has been shown that the accuracy of predictions is improved when refined multiple sequence alignments are used instead of single sequences and if different methods are combined to generate a consensus model. There are several meta-servers available that integrate protein structure predictions performed by various methods, but they do not allow for submission of user-defined multiple sequence alignments and they seldom offer confidentiality of the results. We developed a novel WWW gateway for protein structure prediction, which combines the useful features of other meta-servers available, but with much greater flexibility of the input. The user may submit an amino acid sequence or a multiple sequence alignment to a set of methods for primary, secondary and tertiary structure prediction. Fold-recognition results (target-template alignments) are converted into full-atom 3D models and the quality of these models is uniformly assessed. A consensus between different FR methods is also inferred. The results are conveniently presented on-line on a single web page over a secure, password-protected connection. The GeneSilico protein structure prediction meta-server is freely available for academic users at http://genesilico.pl/meta. PMID:12824313
Compositions and methods for xylem-specific expression in plant cells
DOE Office of Scientific and Technical Information (OSTI.GOV)
Han, Kyung-Hwan; Ko, Jae-Heung
The invention provides promoter sequences that regulate specific expression of operably linked sequences in developing xylem cells and/or in developing xylem tissue. The developing xylem-specific sequences are exemplified by the DX5, DX8, DX11, and DX15 promoters, portions thereof, and homologs thereof. The invention further provides expression vectors, cells, tissues and plants that contain the invention's sequences. The compositions of the invention and methods of using them are useful in, for example, improving the quantity (biomass) and/or the quality (wood density, lignin content, sugar content etc.) of expressed biomass feedstock products that may be used for bioenergy, biorefinary, and generating woodmore » products such as pulp, paper, and solid wood.« less
Song, Dandan; Li, Ning; Liao, Lejian
2015-01-01
Due to the generation of enormous amounts of data at both lower costs as well as in shorter times, whole-exome sequencing technologies provide dramatic opportunities for identifying disease genes implicated in Mendelian disorders. Since upwards of thousands genomic variants can be sequenced in each exome, it is challenging to filter pathogenic variants in protein coding regions and reduce the number of missing true variants. Therefore, an automatic and efficient pipeline for finding disease variants in Mendelian disorders is designed by exploiting a combination of variants filtering steps to analyze the family-based exome sequencing approach. Recent studies on the Freeman-Sheldon disease are revisited and show that the proposed method outperforms other existing candidate gene identification methods.
MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping.
Lee, Wan-Ping; Stromberg, Michael P; Ward, Alistair; Stewart, Chip; Garrison, Erik P; Marth, Gabor T
2014-01-01
MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me).
MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping
Lee, Wan-Ping; Stromberg, Michael P.; Ward, Alistair; Stewart, Chip; Garrison, Erik P.; Marth, Gabor T.
2014-01-01
MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me). PMID:24599324
Gruenstaeudl, Michael; Gerschler, Nico; Borsch, Thomas
2018-06-21
The sequencing and comparison of plastid genomes are becoming a standard method in plant genomics, and many researchers are using this approach to infer plant phylogenetic relationships. Due to the widespread availability of next-generation sequencing, plastid genome sequences are being generated at breakneck pace. This trend towards massive sequencing of plastid genomes highlights the need for standardized bioinformatic workflows. In particular, documentation and dissemination of the details of genome assembly, annotation, alignment and phylogenetic tree inference are needed, as these processes are highly sensitive to the choice of software and the precise settings used. Here, we present the procedure and results of sequencing, assembling, annotating and quality-checking of three complete plastid genomes of the aquatic plant genus Cabomba as well as subsequent gene alignment and phylogenetic tree inference. We accompany our findings by a detailed description of the bioinformatic workflow employed. Importantly, we share a total of eleven software scripts for each of these bioinformatic processes, enabling other researchers to evaluate and replicate our analyses step by step. The results of our analyses illustrate that the plastid genomes of Cabomba are highly conserved in both structure and gene content.
Oligonucleotide fingerprinting of rRNA genes for analysis of fungal community composition.
Valinsky, Lea; Della Vedova, Gianluca; Jiang, Tao; Borneman, James
2002-12-01
Thorough assessments of fungal diversity are currently hindered by technological limitations. Here we describe a new method for identifying fungi, oligonucleotide fingerprinting of rRNA genes (OFRG). ORFG sorts arrayed rRNA gene (ribosomal DNA [rDNA]) clones into taxonomic clusters through a series of hybridization experiments, each using a single oligonucleotide probe. A simulated annealing algorithm was used to design an OFRG probe set for fungal rDNA. Analysis of 1,536 fungal rDNA clones derived from soil generated 455 clusters. A pairwise sequence analysis showed that clones with average sequence identities of 99.2% were grouped into the same cluster. To examine the accuracy of the taxonomic identities produced by this OFRG experiment, we determined the nucleotide sequences for 117 clones distributed throughout the tree. For all but two of these clones, the taxonomic identities generated by this OFRG experiment were consistent with those generated by a nucleotide sequence analysis. Eighty-eight percent of the clones were affiliated with Ascomycota, while 12% belonged to BASIDIOMYCOTA: A large fraction of the clones were affiliated with the genera Fusarium (404 clones) and Raciborskiomyces (176 clones). Smaller assemblages of clones had high sequence identities to the Alternaria, Ascobolus, Chaetomium, Cryptococcus, and Rhizoctonia clades.
Shinozuka, Hiroshi; Cogan, Noel O I; Shinozuka, Maiko; Marshall, Alexis; Kay, Pippa; Lin, Yi-Han; Spangenberg, German C; Forster, John W
2015-04-11
Fragmentation at random nucleotide locations is an essential process for preparation of DNA libraries to be used on massively parallel short-read DNA sequencing platforms. Although instruments for physical shearing, such as the Covaris S2 focused-ultrasonicator system, and products for enzymatic shearing, such as the Nextera technology and NEBNext dsDNA Fragmentase kit, are commercially available, a simple and inexpensive method is desirable for high-throughput sequencing library preparation. MspJI is a recently characterised restriction enzyme which recognises the sequence motif CNNR (where R = G or A) when the first base is modified to 5-methylcytosine or 5-hydroxymethylcytosine. A semi-random enzymatic DNA amplicon fragmentation method was developed based on the unique cleavage properties of MspJI. In this method, random incorporation of 5-methyl-2'-deoxycytidine-5'-triphosphate is achieved through DNA amplification with DNA polymerase, followed by DNA digestion with MspJI. Due to the recognition sequence of the enzyme, DNA amplicons are fragmented in a relatively sequence-independent manner. The size range of the resulting fragments was capable of control through optimisation of 5-methyl-2'-deoxycytidine-5'-triphosphate concentration in the reaction mixture. A library suitable for sequencing using the Illumina MiSeq platform was prepared and processed using the proposed method. Alignment of generated short reads to a reference sequence demonstrated a relatively high level of random fragmentation. The proposed method may be performed with standard laboratory equipment. Although the uniformity of coverage was slightly inferior to the Covaris physical shearing procedure, due to efficiencies of cost and labour, the method may be more suitable than existing approaches for implementation in large-scale sequencing activities, such as bacterial artificial chromosome (BAC)-based genome sequence assembly, pan-genomic studies and locus-targeted genotyping-by-sequencing.
Friis-Nielsen, Jens; Vinner, Lasse; Hansen, Thomas Arn; Richter, Stine Raith; Fridholm, Helena; Herrera, Jose Alejandro Romero; Lund, Ole; Brunak, Søren; Izarzugaza, Jose M. G.; Mourier, Tobias; Nielsen, Lars Peter
2016-01-01
Propionibacterium acnes is the most abundant bacterium on human skin, particularly in sebaceous areas. P. acnes is suggested to be an opportunistic pathogen involved in the development of diverse medical conditions but is also a proven contaminant of human clinical samples and surgical wounds. Its significance as a pathogen is consequently a matter of debate. In the present study, we investigated the presence of P. acnes DNA in 250 next-generation sequencing data sets generated from 180 samples of 20 different sample types, mostly of cancerous origin. The samples were subjected to either microbial enrichment, involving nuclease treatment to reduce the amount of host nucleic acids, or shotgun sequencing. We detected high proportions of P. acnes DNA in enriched samples, particularly skin tissue-derived and other tissue samples, with the levels being higher in enriched samples than in shotgun-sequenced samples. P. acnes reads were detected in most samples analyzed, though the proportions in most shotgun-sequenced samples were low. Our results show that P. acnes can be detected in practically all sample types when molecular methods, such as next-generation sequencing, are employed. The possibility of contamination from the patient or other sources, including laboratory reagents or environment, should therefore always be considered carefully when P. acnes is detected in clinical samples. We advocate that detection of P. acnes always be accompanied by experiments validating the association between this bacterium and any clinical condition. PMID:26818667
A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
2008-01-01
task of learning dynamic textures from image sequences as well as to modeling biosurveillance drug-sales data. The constraint generation approach...previous methods in our experiments. One application of LDSs in computer vision is learning dynamic textures from video data [8]. An advantage of...over-the-counter (OTC) drug sales for biosurveillance , and sunspot numbers from the UCR archive [9]. Comparison to the best alternative methods [7, 10
Ciric, Milica; Moon, Christina D; Leahy, Sinead C; Creevey, Christopher J; Altermann, Eric; Attwood, Graeme T; Rakonjac, Jasna; Gagic, Dragana
2014-05-12
In silico, secretome proteins can be predicted from completely sequenced genomes using various available algorithms that identify membrane-targeting sequences. For metasecretome (collection of surface, secreted and transmembrane proteins from environmental microbial communities) this approach is impractical, considering that the metasecretome open reading frames (ORFs) comprise only 10% to 30% of total metagenome, and are poorly represented in the dataset due to overall low coverage of metagenomic gene pool, even in large-scale projects. By combining secretome-selective phage display and next-generation sequencing, we focused the sequence analysis of complex rumen microbial community on the metasecretome component of the metagenome. This approach achieved high enrichment (29 fold) of secreted fibrolytic enzymes from the plant-adherent microbial community of the bovine rumen. In particular, we identified hundreds of heretofore rare modules belonging to cellulosomes, cell-surface complexes specialised for recognition and degradation of the plant fibre. As a method, metasecretome phage display combined with next-generation sequencing has a power to sample the diversity of low-abundance surface and secreted proteins that would otherwise require exceptionally large metagenomic sequencing projects. As a resource, metasecretome display library backed by the dataset obtained by next-generation sequencing is ready for i) affinity selection by standard phage display methodology and ii) easy purification of displayed proteins as part of the virion for individual functional analysis.
Generation of Synthetic Copolymer Libraries by Combinatorial Assembly on Nucleic Acid Templates.
Kong, Dehui; Yeung, Wayland; Hili, Ryan
2016-07-11
Recent advances in nucleic acid-templated copolymerization have expanded the scope of sequence-controlled synthetic copolymers beyond the molecular architectures witnessed in nature. This has enabled the power of molecular evolution to be applied to synthetic copolymer libraries to evolve molecular function ranging from molecular recognition to catalysis. This Review seeks to summarize different approaches available to generate sequence-defined monodispersed synthetic copolymer libraries using nucleic acid-templated polymerization. Key concepts and principles governing nucleic acid-templated polymerization, as well as the fidelity of various copolymerization technologies, will be described. The Review will focus on methods that enable the combinatorial generation of copolymer libraries and their molecular evolution for desired function.
Gao, Shen; Yao, Bei; Lu, Zuhong
2015-01-01
Background The chimeric sequences produced by phi29 DNA polymerase, which are named as chimeras, influence the performance of the multiple displacement amplification (MDA) and also increase the difficulty of sequence data process. Despite several articles have reported the existence of chimeric sequence, there was only one research focusing on the structure and generation mechanism of chimeras, and it was merely based on hundreds of chimeras found in the sequence data of E. coli genome. Method We finished data mining towards a series of Next Generation Sequencing (NGS) reads which were used for whole genome haplotype assembling in a primary study. We established a bioinformatics pipeline based on subsection alignment strategy to discover all the chimeras inside and achieve their structural visualization. Then, we artificially defined two statistical indexes (the chimeric distance and the overlap length), and their regular abundance distribution helped illustrate of the structural characteristics of the chimeras. Finally we analyzed the relationship between the chimera type and the average insertion size, so that illustrate a method to decrease the proportion of wasted data in the procedure of DNA library construction. Results/Conclusion 131.4 Gb pair-end (PE) sequence data was reanalyzed for the chimeras. Totally, 40,259,438 read pairs (6.19%) with chimerism were discovered among 650,430,811 read pairs. The chimeric sequences are consisted of two or more parts which locate inconsecutively but adjacently on the chromosome. The chimeric distance between the locations of adjacent parts on the chromosome followed an approximate bimodal distribution ranging from 0 to over 5,000 nt, whose peak was at about 250 to 300 nt. The overlap length of adjacent parts followed an approximate Poisson distribution and revealed a peak at 6 nt. Moreover, unmapped chimeras, which were classified as the wasted data, could be reduced by properly increasing the length of the insertion segment size through a linear correlation analysis. Significance This study exhibited the profile of the phi29MDA chimeras by tens of millions of chimeric sequences, and helped understand the amplification mechanism of the phi29 DNA polymerase. Our work also illustrated the importance of NGS data reanalysis, not only for the improvement of data utilization efficiency, but also for more potential genomic information. PMID:26440104
De novo design and engineering of functional metal and porphyrin-binding protein domains
NASA Astrophysics Data System (ADS)
Everson, Bernard H.
In this work, I describe an approach to the rational, iterative design and characterization of two functional cofactor-binding protein domains. First, a hybrid computational/experimental method was developed with the aim of algorithmically generating a suite of porphyrin-binding protein sequences with minimal mutual sequence information. This method was explored by generating libraries of sequences, which were then expressed and evaluated for function. One successful sequence is shown to bind a variety of porphyrin-like cofactors, and exhibits light- activated electron transfer in mixed hemin:chlorin e6 and hemin:Zn(II)-protoporphyrin IX complexes. These results imply that many sophisticated functions such as cofactor binding and electron transfer require only a very small number of residue positions in a protein sequence to be fixed. Net charge and hydrophobic content are important in determining protein solubility and stability. Accordingly, rational modifications were made to the aforementioned design procedure in order to improve its overall success rate. The effects of these modifications are explored using two `next-generation' sequence libraries, which were separately expressed and evaluated. Particular modifications to these design parameters are demonstrated to effectively double the purification success rate of the procedure. Finally, I describe the redesign of the artificial di-iron protein DF2 into CDM13, a single chain di-Manganese four-helix bundle. CDM13 acts as a functional model of natural manganese catalase, exhibiting a kcat of 0.08s-1 under steady-state conditions. The bound manganese cofactors have a reduction potential of +805 mV vs NHE, which is too high for efficient dismutation of hydrogen peroxide. These results indicate that as a high-potential manganese complex, CDM13 may represent a promising first step toward a polypeptide model of the Oxygen Evolving Complex of the photosynthetic enzyme Photosystem II.
Methyl-CpG island-associated genome signature tags
Dunn, John J
2014-05-20
Disclosed is a method for analyzing the organismic complexity of a sample through analysis of the nucleic acid in the sample. In the disclosed method, through a series of steps, including digestion with a type II restriction enzyme, ligation of capture adapters and linkers and digestion with a type IIS restriction enzyme, genome signature tags are produced. The sequences of a statistically significant number of the signature tags are determined and the sequences are used to identify and quantify the organisms in the sample. Various embodiments of the invention described herein include methods for using single point genome signature tags to analyze the related families present in a sample, methods for analyzing sequences associated with hyper- and hypo-methylated CpG islands, methods for visualizing organismic complexity change in a sampling location over time and methods for generating the genome signature tag profile of a sample of fragmented DNA.
QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles.
Van der Borght, Koen; Thys, Kim; Wetzels, Yves; Clement, Lieven; Verbist, Bie; Reumers, Joke; van Vlijmen, Herman; Aerssens, Jeroen
2015-11-10
Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ("deep sequencing"), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNV(D)). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNV(HS)). To also increase specificity, SNVs called were overruled when their frequency was below the 80(th) percentile calculated on the distribution of error frequencies (QQ-SNV(HS-P80)). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNV(D) performed similarly to the existing approaches. QQ-SNV(HS) was more sensitive on all test sets but with more false positives. QQ-SNV(HS-P80) was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5%, QQ-SNV(HS-P80) revealed a sensitivity of 100% (vs. 40-60% for the existing methods) and a specificity of 100% (vs. 98.0-99.7% for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5% were consistently detected by QQ-SNV(HS-P80) from different generations of Illumina sequencers. We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data.
Tan, BoonFei; Ng, Charmaine; Nshimyimana, Jean Pierre; Loh, Lay Leng; Gin, Karina Y.-H.; Thompson, Janelle R.
2015-01-01
Water quality is an emergent property of a complex system comprised of interacting microbial populations and introduced microbial and chemical contaminants. Studies leveraging next-generation sequencing (NGS) technologies are providing new insights into the ecology of microbially mediated processes that influence fresh water quality such as algal blooms, contaminant biodegradation, and pathogen dissemination. In addition, sequencing methods targeting small subunit (SSU) rRNA hypervariable regions have allowed identification of signature microbial species that serve as bioindicators for sewage contamination in these environments. Beyond amplicon sequencing, metagenomic and metatranscriptomic analyses of microbial communities in fresh water environments reveal the genetic capabilities and interplay of waterborne microorganisms, shedding light on the mechanisms for production and biodegradation of toxins and other contaminants. This review discusses the challenges and benefits of applying NGS-based methods to water quality research and assessment. We will consider the suitability and biases inherent in the application of NGS as a screening tool for assessment of biological risks and discuss the potential and limitations for direct quantitative interpretation of NGS data. Secondly, we will examine case studies from recent literature where NGS based methods have been applied to topics in water quality assessment, including development of bioindicators for sewage pollution and microbial source tracking, characterizing the distribution of toxin and antibiotic resistance genes in water samples, and investigating mechanisms of biodegradation of harmful pollutants that threaten water quality. Finally, we provide a short review of emerging NGS platforms and their potential applications to the next generation of water quality assessment tools. PMID:26441948
USDA-ARS?s Scientific Manuscript database
We needed to obtain an alternative to conventional cloning to generate high-quality DNA sequences from a variety of nuclear orthologs for phylogenetic studies in potato, to save time and money and to avoid problems typically encountered in cloning. We tested a variety of SSCP protocols to include pu...
Random Item Generation Is Affected by Age
ERIC Educational Resources Information Center
Multani, Namita; Rudzicz, Frank; Wong, Wing Yiu Stephanie; Namasivayam, Aravind Kumar; van Lieshout, Pascal
2016-01-01
Purpose: Random item generation (RIG) involves central executive functioning. Measuring aspects of random sequences can therefore provide a simple method to complement other tools for cognitive assessment. We examine the extent to which RIG relates to specific measures of cognitive function, and whether those measures can be estimated using RIG…
NASA Astrophysics Data System (ADS)
Molotkov, S. N.
2017-03-01
Various methods for the clustering of photocounts constituting a sequence of random numbers are considered. It is shown that the clustering of photocounts resulting in the Fermi-Dirac distribution makes it possible to achieve the theoretical limit of the random number generation rate.
BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data
Ji, Yuan; Xu, Yanxun; Zhang, Qiong; Tsui, Kam-Wah; Yuan, Yuan; Norris, Clift; Liang, Shoudan; Liang, Han
2011-01-01
Summary Next-generation sequencing (NGS) technology generates millions of short reads, which provide valuable information for various aspects of cellular activities and biological functions. A key step in NGS applications (e.g., RNA-Seq) is to map short reads to correct genomic locations within the source genome. While most reads are mapped to a unique location, a significant proportion of reads align to multiple genomic locations with equal or similar numbers of mismatches; these are called multireads. The ambiguity in mapping the multireads may lead to bias in downstream analyses. Currently, most practitioners discard the multireads in their analysis, resulting in a loss of valuable information, especially for the genes with similar sequences. To refine the read mapping, we develop a Bayesian model that computes the posterior probability of mapping a multiread to each competing location. The probabilities are used for downstream analyses, such as the quantification of gene expression. We show through simulation studies and RNA-Seq analysis of real life data that the Bayesian method yields better mapping than the current leading methods. We provide a C++ program for downloading that is being packaged into a user-friendly software. PMID:21517792
de la Fuente, Gabriel; Belanche, Alejandro; Girwood, Susan E.; Pinloche, Eric; Wilkinson, Toby; Newbold, C. Jamie
2014-01-01
The development of next generation sequencing has challenged the use of other molecular fingerprinting methods used to study microbial diversity. We analysed the bacterial diversity in the rumen of defaunated sheep following the introduction of different protozoal populations, using both next generation sequencing (NGS: Ion Torrent PGM) and terminal restriction fragment length polymorphism (T-RFLP). Although absolute number differed, there was a high correlation between NGS and T-RFLP in terms of richness and diversity with R values of 0.836 and 0.781 for richness and Shannon-Wiener index, respectively. Dendrograms for both datasets were also highly correlated (Mantel test = 0.742). Eighteen OTUs and ten genera were significantly impacted by the addition of rumen protozoa, with an increase in the relative abundance of Prevotella, Bacteroides and Ruminobacter, related to an increase in free ammonia levels in the rumen. Our findings suggest that classic fingerprinting methods are still valuable tools to study microbial diversity and structure in complex environments but that NGS techniques now provide cost effect alternatives that provide a far greater level of information on the individual members of the microbial population. PMID:25051490
Method and apparatus for characterizing reflected ultrasonic pulses
NASA Technical Reports Server (NTRS)
Yost, William T. (Inventor); Cantrell, John H., Jr. (Inventor)
1991-01-01
The invention is a method of and apparatus for characterizing the amplitudes of a sequence of reflected pulses R1, R2, and R3 by converting them into corresponding electric signals E1, E2, and E3 to substantially the same value during each sequence thereby restoring the reflected pulses R1, R2, and R3 to their initial reflection values by timing means, an exponential generator, and a time gain compensator. Envelope and baseline reject circuits permit the display and accurate location of the time spaced sequence of electric signals having substantially the same amplitude on a measurement scale on a suitable video display or oscilloscope.
Salehi, Sohrab; Steif, Adi; Roth, Andrew; Aparicio, Samuel; Bouchard-Côté, Alexandre; Shah, Sohrab P
2017-03-01
Next-generation sequencing (NGS) of bulk tumour tissue can identify constituent cell populations in cancers and measure their abundance. This requires computational deconvolution of allelic counts from somatic mutations, which may be incapable of fully resolving the underlying population structure. Single cell sequencing (SCS) is a more direct method, although its replacement of NGS is impeded by technical noise and sampling limitations. We propose ddClone, which analytically integrates NGS and SCS data, leveraging their complementary attributes through joint statistical inference. We show on real and simulated datasets that ddClone produces more accurate results than can be achieved by either method alone.
Arnold, Frances H.; Shao, Zhixin; Zhao, Huimin; Giver, Lorraine J.
2002-01-01
A method for in vitro mutagenesis and recombination of polynucleotide sequences based on polymerase-catalyzed extension of primer oligonucleotides is disclosed. The method involves priming template polynucleotide(s) with random-sequences or defined-sequence primers to generate a pool of short DNA fragments with a low level of point mutations. The DNA fragments are subjected to denaturization followed by annealing and further enzyme-catalyzed DNA polymerization. This procedure is repeated a sufficient number of times to produce full-length genes which comprise mutants of the original template polynucleotides. These genes can be further amplified by the polymerase chain reaction and cloned into a vector for expression of the encoded proteins.
Mining sequence variations in representative polyploid sugarcane germplasm accessions
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Xiping; Song, Jian; You, Qian
Sugarcane (Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes.more » To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.« less
Mining sequence variations in representative polyploid sugarcane germplasm accessions
Yang, Xiping; Song, Jian; You, Qian; ...
2017-08-09
Sugarcane (Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes.more » To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.« less
Selective amplification and sequencing of cyclic phosphate-containing RNAs by the cP-RNA-seq method
Honda, Shozo; Morichika, Keisuke; Kirino, Yohei
2016-01-01
RNA digestions catalyzed by many ribonucleases generate RNA fragments containing a 2′,3′-cyclic phosphate (cP) at their 3′-termini. However, standard RNA-seq methods are unable to accurately capture cP-containing RNAs because the cP inhibits the adapter ligation reaction. We recently developed a method named “cP-RNA-seq” that is able to selectively amplify and sequence cP-containing RNAs. Here we describe the cP-RNA-seq protocol in which the 3′-termini of all RNAs, except those containing a cP, are cleaved through a periodate treatment after phosphatase treatment, hence subsequent adapter ligation and cDNA amplification steps are exclusively applied to cP-containing RNAs. cP-RNA-seq takes ~6 d, excluding the time required for sequencing and bioinformatics analyses, such downstream assays are not covered in detail in this protocol. Biochemical validation of the existence of cP in the identified RNAs takes ~3 d. Even though the cP-RNA-seq method was developed to identify angiogenin-generating 5′-tRNA halves as a proof of principle, the method should be applicable to global identification of cP-containing RNA repertoires in various transcriptomes. PMID:26866791
Advances in molecular serotyping and subtyping of Escherichia coli
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fratamico, Pina M.; DebRoy, Chitrita; Liu, Yanhong
Escherichia coli plays an important role as a member of the gut microbiota; however, pathogenic strains also exist, including various diarrheagenic E. coli pathotypes and extraintestinal pathogenic E. coli that cause illness outside of the GI-tract. E. coli have traditionally been serotyped using antisera against the ca. 186 O-antigens and 53 H-flagellar antigens. Phenotypic methods, including bacteriophage typing and O- and H- serotyping for differentiating and characterizing E. coli have been used for many years; however, these methods are generally time consuming and not always accurate. Advances in next generation sequencing technologies have made it possible to develop genetic-based subtypingmore » and molecular serotyping methods for E. coli, which are more discriminatory compared to phenotypic typing methods. Furthermore, whole genome sequencing (WGS) of E. coli is replacing established subtyping methods such as pulsedfield gel electrophoresis, providing a major advancement in the ability to investigate food-borne disease outbreaks and for trace-back to sources. Furthermore, a variety of sequence analysis tools and bioinformatic pipelines are being developed to analyze the vast amount of data generated by WGS and to obtain specific information such as O- and H-group determination and the presence of virulence genes and other genetic markers.« less
Advances in molecular serotyping and subtyping of Escherichia coli
Fratamico, Pina M.; DebRoy, Chitrita; Liu, Yanhong; ...
2016-05-03
Escherichia coli plays an important role as a member of the gut microbiota; however, pathogenic strains also exist, including various diarrheagenic E. coli pathotypes and extraintestinal pathogenic E. coli that cause illness outside of the GI-tract. E. coli have traditionally been serotyped using antisera against the ca. 186 O-antigens and 53 H-flagellar antigens. Phenotypic methods, including bacteriophage typing and O- and H- serotyping for differentiating and characterizing E. coli have been used for many years; however, these methods are generally time consuming and not always accurate. Advances in next generation sequencing technologies have made it possible to develop genetic-based subtypingmore » and molecular serotyping methods for E. coli, which are more discriminatory compared to phenotypic typing methods. Furthermore, whole genome sequencing (WGS) of E. coli is replacing established subtyping methods such as pulsedfield gel electrophoresis, providing a major advancement in the ability to investigate food-borne disease outbreaks and for trace-back to sources. Furthermore, a variety of sequence analysis tools and bioinformatic pipelines are being developed to analyze the vast amount of data generated by WGS and to obtain specific information such as O- and H-group determination and the presence of virulence genes and other genetic markers.« less
A systematic comparison of error correction enzymes by next-generation sequencing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lubock, Nathan B.; Zhang, Di; Sidore, Angus M.
Gene synthesis, the process of assembling genelength fragments from shorter groups of oligonucleotides (oligos), is becoming an increasingly important tool in molecular and synthetic biology. The length, quality and cost of gene synthesis are limited by errors produced during oligo synthesis and subsequent assembly. Enzymatic error correction methods are cost-effective means to ameliorate errors in gene synthesis. Previous analyses of these methods relied on cloning and Sanger sequencing to evaluate their efficiencies, limiting quantitative assessment. Here, we develop a method to quantify errors in synthetic DNA by next-generation sequencing. We analyzed errors in model gene assemblies and systematically compared sixmore » different error correction enzymes across 11 conditions. We find that ErrASE and T7 Endonuclease I are the most effective at decreasing average error rates (up to 5.8-fold relative to the input), whereas MutS is the best for increasing the number of perfect assemblies (up to 25.2-fold). We are able to quantify differential specificities such as ErrASE preferentially corrects C/G transversions whereas T7 Endonuclease I preferentially corrects A/T transversions. More generally, this experimental and computational pipeline is a fast, scalable and extensible way to analyze errors in gene assemblies, to profile error correction methods, and to benchmark DNA synthesis methods.« less
A systematic comparison of error correction enzymes by next-generation sequencing
Lubock, Nathan B.; Zhang, Di; Sidore, Angus M.; ...
2017-08-01
Gene synthesis, the process of assembling genelength fragments from shorter groups of oligonucleotides (oligos), is becoming an increasingly important tool in molecular and synthetic biology. The length, quality and cost of gene synthesis are limited by errors produced during oligo synthesis and subsequent assembly. Enzymatic error correction methods are cost-effective means to ameliorate errors in gene synthesis. Previous analyses of these methods relied on cloning and Sanger sequencing to evaluate their efficiencies, limiting quantitative assessment. Here, we develop a method to quantify errors in synthetic DNA by next-generation sequencing. We analyzed errors in model gene assemblies and systematically compared sixmore » different error correction enzymes across 11 conditions. We find that ErrASE and T7 Endonuclease I are the most effective at decreasing average error rates (up to 5.8-fold relative to the input), whereas MutS is the best for increasing the number of perfect assemblies (up to 25.2-fold). We are able to quantify differential specificities such as ErrASE preferentially corrects C/G transversions whereas T7 Endonuclease I preferentially corrects A/T transversions. More generally, this experimental and computational pipeline is a fast, scalable and extensible way to analyze errors in gene assemblies, to profile error correction methods, and to benchmark DNA synthesis methods.« less
Importance Sampling of Word Patterns in DNA and Protein Sequences
Chan, Hock Peng; Chen, Louis H.Y.
2010-01-01
Abstract Monte Carlo methods can provide accurate p-value estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Naive direct Monte Carlo is undesirable for the estimation of small probabilities because the associated rare events of interest are seldom generated. We propose instead efficient importance sampling algorithms that use controlled insertion of the desired word patterns on randomly generated sequences. The implementation is illustrated on word patterns of biological interest: palindromes and inverted repeats, patterns arising from position-specific weight matrices (PSWMs), and co-occurrences of pairs of motifs. PMID:21128856
Transient sequences in a hypernetwork generated by an adaptive network of spiking neurons.
Maslennikov, Oleg V; Shchapin, Dmitry S; Nekorkin, Vladimir I
2017-06-28
We propose a model of an adaptive network of spiking neurons that gives rise to a hypernetwork of its dynamic states at the upper level of description. Left to itself, the network exhibits a sequence of transient clustering which relates to a traffic in the hypernetwork in the form of a random walk. Receiving inputs the system is able to generate reproducible sequences corresponding to stimulus-specific paths in the hypernetwork. We illustrate these basic notions by a simple network of discrete-time spiking neurons together with its FPGA realization and analyse their properties.This article is part of the themed issue 'Mathematical methods in medicine: neuroscience, cardiology and pathology'. © 2017 The Author(s).
Bandeira, Nuno; Clauser, Karl R; Pevzner, Pavel A
2007-07-01
Despite significant advances in the identification of known proteins, the analysis of unknown proteins by MS/MS still remains a challenging open problem. Although Klaus Biemann recognized the potential of MS/MS for sequencing of unknown proteins in the 1980s, low throughput Edman degradation followed by cloning still remains the main method to sequence unknown proteins. The automated interpretation of MS/MS spectra has been limited by a focus on individual spectra and has not capitalized on the information contained in spectra of overlapping peptides. Indeed the powerful shotgun DNA sequencing strategies have not been extended to automated protein sequencing. We demonstrate, for the first time, the feasibility of automated shotgun protein sequencing of protein mixtures by utilizing MS/MS spectra of overlapping and possibly modified peptides generated via multiple proteases of different specificities. We validate this approach by generating highly accurate de novo reconstructions of multiple regions of various proteins in western diamondback rattlesnake venom. We further argue that shotgun protein sequencing has the potential to overcome the limitations of current protein sequencing approaches and thus catalyze the otherwise impractical applications of proteomics methodologies in studies of unknown proteins.
El-Assaad, Atlal; Dawy, Zaher; Nemer, Georges; Hajj, Hazem; Kobeissy, Firas H
2017-01-01
Degradomics is a novel discipline that involves determination of the proteases/substrate fragmentation profile, called the substrate degradome, and has been recently applied in different disciplines. A major application of degradomics is its utility in the field of biomarkers where the breakdown products (BDPs) of different protease have been investigated. Among the major proteases assessed, calpain and caspase proteases have been associated with the execution phases of the pro-apoptotic and pro-necrotic cell death, generating caspase/calpain-specific cleaved fragments. The distinction between calpain and caspase protein fragments has been applied to distinguish injury mechanisms. Advanced proteomics technology has been used to identify these BDPs experimentally. However, it has been a challenge to identify these BDPs with high precision and efficiency, especially if we are targeting a number of proteins at one time. In this chapter, we present a novel bioinfromatic detection method that identifies BDPs accurately and efficiently with validation against experimental data. This method aims at predicting the consensus sequence occurrences and their variants in a large set of experimentally detected protein sequences based on state-of-the-art sequence matching and alignment algorithms. After detection, the method generates all the potential cleaved fragments by a specific protease. This space and time-efficient algorithm is flexible to handle the different orientations that the consensus sequence and the protein sequence can take before cleaving. It is O(mn) in space complexity and O(Nmn) in time complexity, with N number of protein sequences, m length of the consensus sequence, and n length of each protein sequence. Ultimately, this knowledge will subsequently feed into the development of a novel tool for researchers to detect diverse types of selected BDPs as putative disease markers, contributing to the diagnosis and treatment of related disorders.
Ananda, Guruprasad; Mockus, Susan; Lundquist, Micaela; Spotlow, Vanessa; Simons, Al; Mitchell, Talia; Stafford, Grace; Philip, Vivek; Stearns, Timothy; Srivastava, Anuj; Barter, Mary; Rowe, Lucy; Malcolm, Joan; Bult, Carol; Karuturi, Radha Krishna Murthy; Rasmussen, Karen; Hinerfeld, Douglas
2015-01-01
Background The continued development of targeted therapeutics for cancer treatment has required the concomitant development of more expansive methods for the molecular profiling of the patient’s tumor. We describe the validation of the JAX Cancer Treatment Profile™ (JAX-CTP™), a next generation sequencing (NGS)-based molecular diagnostic assay that detects actionable mutations in solid tumors to inform the selection of targeted therapeutics for cancer treatment. Methods NGS libraries are generated from DNA extracted from formalin fixed paraffin embedded tumors. Using hybrid capture, the genes of interest are enriched and sequenced on the Illumina HiSeq 2500 or MiSeq sequencers followed by variant detection and functional and clinical annotation for the generation of a clinical report. Results The JAX-CTP™ detects actionable variants, in the form of single nucleotide variations and small insertions and deletions (≤50bp) in 190 genes in specimens with a neoplastic cell content of ≥10%. The JAX-CTP™ is also validated for the detection of clinically actionable gene amplifications. Conclusions There is a lack of consensus in the molecular diagnostics field on the best method for the validation of NGS-based assays in oncology, thus the importance of communicating methods, as contained in this report. The growing number of targeted therapeutics and the complexity of the tumor genome necessitates continued development and refinement of advanced assays for tumor profiling to enable precision cancer treatment. PMID:25562415
2013-01-01
Background With high quantity and quality data production and low cost, next generation sequencing has the potential to provide new opportunities for plant phylogeographic studies on single and multiple species. Here we present an approach for in silicio chloroplast DNA assembly and single nucleotide polymorphism detection from short-read shotgun sequencing. The approach is simple and effective and can be implemented using standard bioinformatic tools. Results The chloroplast genome of Toona ciliata (Meliaceae), 159,514 base pairs long, was assembled from shotgun sequencing on the Illumina platform using de novo assembly of contigs. To evaluate its practicality, value and quality, we compared the short read assembly with an assembly completed using 454 data obtained after chloroplast DNA isolation. Sanger sequence verifications indicated that the Illumina dataset outperformed the longer read 454 data. Pooling of several individuals during preparation of the shotgun library enabled detection of informative chloroplast SNP markers. Following validation, we used the identified SNPs for a preliminary phylogeographic study of T. ciliata in Australia and to confirm low diversity across the distribution. Conclusions Our approach provides a simple method for construction of whole chloroplast genomes from shotgun sequencing of whole genomic DNA using short-read data and no available closely related reference genome (e.g. from the same species or genus). The high coverage of Illumina sequence data also renders this method appropriate for multiplexing and SNP discovery and therefore a useful approach for landscape level studies of evolutionary ecology. PMID:23497206
ERIC Educational Resources Information Center
Gauthier, N.
2006-01-01
This note describes a method for evaluating the sums of the m -th powers of n consecutive terms of a general arithmetic sequence: { S[subscript m] = 0, 1, 2,...}. The method is based on the use of a differential operator that is repeatedly applied to a generating function. A known linear recurrence is then obtained and the m-th sum, S[subscript…
Maceachern, Sean; Muir, William M; Crosby, Seth; Cheng, Hans H
2011-06-03
Marek's disease (MD), a T cell lymphoma induced by the highly oncogenic α-herpesvirus Marek's disease virus (MDV), is the main chronic infectious disease concern threatening the poultry industry. Enhancing genetic resistance to MD in commercial poultry is an attractive method to augment MD vaccines, which is currently the control method of choice. In order to optimally implement this control strategy through marker-assisted selection (MAS) and to gain biological information, it is necessary to identify specific genes that influence MD incidence. A genome-wide screen for allele-specific expression (ASE) in response to MDV infection was conducted. The highly inbred ADOL chicken lines 6 (MD resistant) and 7 (MD susceptible) were inter-mated in reciprocal crosses and half of the progeny challenged with MDV. Splenic RNA pools at a single time after infection for each treatment group point were generated, sequenced using a next generation sequencer, then analyzed for allele-specific expression (ASE). To validate and extend the results, Illumina GoldenGate assays for selected cSNPs were developed and used on all RNA samples from all 6 time points following MDV challenge. RNA sequencing resulted in 11-13+ million mappable reads per treatment group, 1.7+ Gb total sequence, and 22,655 high-confidence cSNPs. Analysis of these cSNPs revealed that 5360 cSNPs in 3773 genes exhibited statistically significant allelic imbalance. Of the 1536 GoldenGate assays, 1465 were successfully scored with all but 19 exhibiting evidence for allelic imbalance. ASE is an efficient method to identify potentially all or most of the genes influencing this complex trait. The identified cSNPs can be further evaluated in resource populations to determine their allelic direction and size of effect on genetic resistance to MD as well as being directly implemented in genomic selection programs. The described method, although demonstrated in inbred chicken lines, is applicable to all traits in any diploid species, and should prove to be a simple method to identify the majority of genes controlling any complex trait.
Dissecting enzyme function with microfluidic-based deep mutational scanning.
Romero, Philip A; Tran, Tuan M; Abate, Adam R
2015-06-09
Natural enzymes are incredibly proficient catalysts, but engineering them to have new or improved functions is challenging due to the complexity of how an enzyme's sequence relates to its biochemical properties. Here, we present an ultrahigh-throughput method for mapping enzyme sequence-function relationships that combines droplet microfluidic screening with next-generation DNA sequencing. We apply our method to map the activity of millions of glycosidase sequence variants. Microfluidic-based deep mutational scanning provides a comprehensive and unbiased view of the enzyme function landscape. The mapping displays expected patterns of mutational tolerance and a strong correspondence to sequence variation within the enzyme family, but also reveals previously unreported sites that are crucial for glycosidase function. We modified the screening protocol to include a high-temperature incubation step, and the resulting thermotolerance landscape allowed the discovery of mutations that enhance enzyme thermostability. Droplet microfluidics provides a general platform for enzyme screening that, when combined with DNA-sequencing technologies, enables high-throughput mapping of enzyme sequence space.
Compression of next-generation sequencing quality scores using memetic algorithm
2014-01-01
Background The exponential growth of next-generation sequencing (NGS) derived DNA data poses great challenges to data storage and transmission. Although many compression algorithms have been proposed for DNA reads in NGS data, few methods are designed specifically to handle the quality scores. Results In this paper we present a memetic algorithm (MA) based NGS quality score data compressor, namely MMQSC. The algorithm extracts raw quality score sequences from FASTQ formatted files, and designs compression codebook using MA based multimodal optimization. The input data is then compressed in a substitutional manner. Experimental results on five representative NGS data sets show that MMQSC obtains higher compression ratio than the other state-of-the-art methods. Particularly, MMQSC is a lossless reference-free compression algorithm, yet obtains an average compression ratio of 22.82% on the experimental data sets. Conclusions The proposed MMQSC compresses NGS quality score data effectively. It can be utilized to improve the overall compression ratio on FASTQ formatted files. PMID:25474747
Single-Cell Sequencing for Precise Cancer Research: Progress and Prospects.
Zhang, Xiaoyan; Marjani, Sadie L; Hu, Zhaoyang; Weissman, Sherman M; Pan, Xinghua; Wu, Shixiu
2016-03-15
Advances in genomic technology have enabled the faithful detection and measurement of mutations and the gene expression profile of cancer cells at the single-cell level. Recently, several single-cell sequencing methods have been developed that permit the comprehensive and precise analysis of the cancer-cell genome, transcriptome, and epigenome. The use of these methods to analyze cancer cells has led to a series of unanticipated discoveries, such as the high heterogeneity and stochastic changes in cancer-cell populations, the new driver mutations and the complicated clonal evolution mechanisms, and the novel identification of biomarkers of variant tumors. These methods and the knowledge gained from their utilization could potentially improve the early detection and monitoring of rare cancer cells, such as circulating tumor cells and disseminated tumor cells, and promote the development of personalized and highly precise cancer therapy. Here, we discuss the current methods for single cancer-cell sequencing, with a strong focus on those practically used or potentially valuable in cancer research, including single-cell isolation, whole genome and transcriptome amplification, epigenome profiling, multi-dimensional sequencing, and next-generation sequencing and analysis. We also examine the current applications, challenges, and prospects of single cancer-cell sequencing. ©2016 American Association for Cancer Research.
Whole-genome sequencing and genetic variant analysis of a Quarter Horse mare.
Doan, Ryan; Cohen, Noah D; Sawyer, Jason; Ghaffari, Noushin; Johnson, Charlie D; Dindot, Scott V
2012-02-17
The catalog of genetic variants in the horse genome originates from a few select animals, the majority originating from the Thoroughbred mare used for the equine genome sequencing project. The purpose of this study was to identify genetic variants, including single nucleotide polymorphisms (SNPs), insertion/deletion polymorphisms (INDELs), and copy number variants (CNVs) in the genome of an individual Quarter Horse mare sequenced by next-generation sequencing. Using massively parallel paired-end sequencing, we generated 59.6 Gb of DNA sequence from a Quarter Horse mare resulting in an average of 24.7X sequence coverage. Reads were mapped to approximately 97% of the reference Thoroughbred genome. Unmapped reads were de novo assembled resulting in 19.1 Mb of new genomic sequence in the horse. Using a stringent filtering method, we identified 3.1 million SNPs, 193 thousand INDELs, and 282 CNVs. Genetic variants were annotated to determine their impact on gene structure and function. Additionally, we genotyped this Quarter Horse for mutations of known diseases and for variants associated with particular traits. Functional clustering analysis of genetic variants revealed that most of the genetic variation in the horse's genome was enriched in sensory perception, signal transduction, and immunity and defense pathways. This is the first sequencing of a horse genome by next-generation sequencing and the first genomic sequence of an individual Quarter Horse mare. We have increased the catalog of genetic variants for use in equine genomics by the addition of novel SNPs, INDELs, and CNVs. The genetic variants described here will be a useful resource for future studies of genetic variation regulating performance traits and diseases in equids.
Next-generation sequencing reveals a novel NDP gene mutation in a Chinese family with Norrie disease
Huang, Xiaoyan; Tian, Mao; Li, Jiankang; Cui, Ling; Li, Min; Zhang, Jianguo
2017-01-01
Purpose: Norrie disease (ND) is a rare X-linked genetic disorder, the main symptoms of which are congenital blindness and white pupils. It has been reported that ND is caused by mutations in the NDP gene. Although many mutations in NDP have been reported, the genetic cause for many patients remains unknown. In this study, the aim is to investigate the genetic defect in a five-generation family with typical symptoms of ND. Methods: To identify the causative gene, next-generation sequencing based target capture sequencing was performed. Segregation analysis of the candidate variant was performed in additional family members using Sanger sequencing. Results: We identified a novel missense variant (c.314C>A) located within the NDP gene. The mutation cosegregated within all affected individuals in the family and was not found in unaffected members. By happenstance, in this family, we also detected a known pathogenic variant of retinitis pigmentosa in a healthy individual. Conclusion: c.314C>A mutation of NDP gene is a novel mutation and broadens the genetic spectrum of ND. PMID:29133643
NASA Astrophysics Data System (ADS)
Sheynkman, Gloria M.; Shortreed, Michael R.; Cesnik, Anthony J.; Smith, Lloyd M.
2016-06-01
Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.
Sheynkman, Gloria M.; Shortreed, Michael R.; Cesnik, Anthony J.; Smith, Lloyd M.
2016-01-01
Mass spectrometry–based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications. PMID:27049631
Madi, Nada; Al-Nakib, Widad; Mustafa, Abu Salim; Habibi, Nazima
2018-03-01
A metagenomic approach based on target independent next-generation sequencing has become a known method for the detection of both known and novel viruses in clinical samples. This study aimed to use the metagenomic sequencing approach to characterize the viral diversity in respiratory samples from patients with respiratory tract infections. We have investigated 86 respiratory samples received from various hospitals in Kuwait between 2015 and 2016 for the diagnosis of respiratory tract infections. A metagenomic approach using the next-generation sequencer to characterize viruses was used. According to the metagenomic analysis, an average of 145, 019 reads were identified, and 2% of these reads were of viral origin. Also, metagenomic analysis of the viral sequences revealed many known respiratory viruses, which were detected in 30.2% of the clinical samples. Also, sequences of non-respiratory viruses were detected in 14% of the clinical samples, while sequences of non-human viruses were detected in 55.8% of the clinical samples. The average genome coverage of the viruses was 12% with the highest genome coverage of 99.2% for respiratory syncytial virus, and the lowest was 1% for torque teno midi virus 2. Our results showed 47.7% agreement between multiplex Real-Time PCR and metagenomics sequencing in the detection of respiratory viruses in the clinical samples. Though there are some difficulties in using this method to clinical samples such as specimen quality, these observations are indicative of the promising utility of the metagenomic sequencing approach for the identification of respiratory viruses in patients with respiratory tract infections. © 2017 Wiley Periodicals, Inc.
Mitsui, Jun; Fukuda, Yoko; Azuma, Kyo; Tozaki, Hirokazu; Ishiura, Hiroyuki; Takahashi, Yuji; Goto, Jun; Tsuji, Shoji
2010-07-01
We have recently found that multiple rare variants of the glucocerebrosidase gene (GBA) confer a robust risk for Parkinson disease, supporting the 'common disease-multiple rare variants' hypothesis. To develop an efficient method of identifying rare variants in a large number of samples, we applied multiplexed resequencing using a next-generation sequencer to identification of rare variants of GBA. Sixteen sets of pooled DNAs from six pooled DNA samples were prepared. Each set of pooled DNAs was subjected to polymerase chain reaction to amplify the target gene (GBA) covering 6.5 kb, pooled into one tube with barcode indexing, and then subjected to extensive sequence analysis using the SOLiD System. Individual samples were also subjected to direct nucleotide sequence analysis. With the optimization of data processing, we were able to extract all the variants from 96 samples with acceptable rates of false-positive single-nucleotide variants.
Shen, Kang-Ning; Chen, Ching-Hung; Hsiao, Chung-Der; Durand, Jean-Dominique
2016-09-01
In this study, the complete mitogenome sequence of a cryptic species from East Australia (Mugil sp. H) belonging to the worldwide Mugil cephalus species complex (Teleostei: Mugilidae) has been sequenced by next-generation sequencing method. The assembled mitogenome, consisting of 16,845 bp, had the typical vertebrate mitochondrial gene arrangement, including 13 protein-coding genes, 22 transfer RNAs, 2 ribosomal RNAs genes and a non-coding control region of D-loop. D-loop consists of 1067 bp length, and is located between tRNA-Pro and tRNA-Phe. The overall base composition of East Australia M. cephalus is 28.4% for A, 29.3% for C, 15.4% for G and 26.9% for T. The complete mitogenome may provide essential and important DNA molecular data for further phylogenetic and evolutionary analysis for flathead mullet species complex.
Shen, Kang-Ning; Yen, Ta-Chi; Chen, Ching-Hung; Li, Huei-Ying; Chen, Pei-Lung; Hsiao, Chung-Der
2016-05-01
In this study, the complete mitogenome sequence of Northwestern Pacific 2 (NWP2) cryptic species of flathead mullet, Mugil cephalus (Teleostei: Mugilidae) has been amplified by long-range PCR and sequenced by next-generation sequencing method. The assembled mitogenome, consisting of 16,686 bp, had the typical vertebrate mitochondrial gene arrangement, including 13 protein-coding genes, 22 transfer RNAs, 2 ribosomal RNAs genes and a non-coding control region of D-loop. D-loop was 909 bp length and was located between tRNA-Pro and tRNA-Phe. The overall base composition of NWP2 M. cephalus was 28.4% for A, 29.8% for C, 26.5% for T and 15.3% for G. The complete mitogenome may provide essential and important DNA molecular data for further phylogenetic and evolutionary analysis for flathead mullet species complex.
Laboratory procedures to generate viral metagenomes.
Thurber, Rebecca V; Haynes, Matthew; Breitbart, Mya; Wegley, Linda; Rohwer, Forest
2009-01-01
This collection of laboratory protocols describes the steps to collect viruses from various samples with the specific aim of generating viral metagenome sequence libraries (viromes). Viral metagenomics, the study of uncultured viral nucleic acid sequences from different biomes, relies on several concentration, purification, extraction, sequencing and heuristic bioinformatic methods. No single technique can provide an all-inclusive approach, and therefore the protocols presented here will be discussed in terms of hypothetical projects. However, care must be taken to individualize each step depending on the source and type of viral-particles. This protocol is a description of the processes we have successfully used to: (i) concentrate viral particles from various types of samples, (ii) eliminate contaminating cells and free nucleic acids and (iii) extract, amplify and purify viral nucleic acids. Overall, a sample can be processed to isolate viral nucleic acids suitable for high-throughput sequencing in approximately 1 week.
NASA Technical Reports Server (NTRS)
Desmarais, R. N.
1982-01-01
The method is capable of generating approximations of arbitrary accuracy. It is based on approximating the algebraic part of the nonelementary integrals in the kernel by exponential functions and then integrating termwise. The exponent spacing in the approximation is a geometric sequence. The coefficients and exponent multiplier of the exponential approximation are computed by least squares so the method is completely automated. Exponential approximates generated in this manner are two orders of magnitude more accurate than the exponential approximation that is currently most often used for this purpose. The method can be used to generate approximations to attain any desired trade-off between accuracy and computing cost.
Imai, Kazuo; Tarumoto, Norihito; Misawa, Kazuhisa; Runtuwene, Lucky Ronald; Sakai, Jun; Hayashida, Kyoko; Eshita, Yuki; Maeda, Ryuichiro; Tuda, Josef; Murakami, Takashi; Maesaki, Shigefumi; Suzuki, Yutaka; Yamagishi, Junya; Maeda, Takuya
2017-09-13
A simple and accurate molecular diagnostic method for malaria is urgently needed due to the limitations of conventional microscopic examination. In this study, we demonstrate a new diagnostic procedure for human malaria using loop mediated isothermal amplification (LAMP) and the MinION™ nanopore sequencer. We generated specific LAMP primers targeting the 18S-rRNA gene of all five human Plasmodium species including two P. ovale subspecies (P. falciparum, P. vivax, P. ovale wallikeri, P. ovale curtisi, P. knowlesi and P. malariae) and examined human blood samples collected from 63 malaria patients in Indonesia. Additionally, we performed amplicon sequencing of our LAMP products using MinION™ nanopore sequencer to identify each Plasmodium species. Our LAMP method allowed amplification of all targeted 18S-rRNA genes of the reference plasmids with detection limits of 10-100 copies per reaction. Among the 63 clinical samples, 54 and 55 samples were positive by nested PCR and our LAMP method, respectively. Identification of the Plasmodium species by LAMP amplicon sequencing analysis using the MinION™ was consistent with the reference plasmid sequences and the results of nested PCR. Our diagnostic method combined with LAMP and MinION™ could become a simple and accurate tool for the identification of human Plasmodium species, even in resource-limited situations.
Overcoming Sequence Misalignments with Weighted Structural Superposition
Khazanov, Nickolay A.; Damm-Ganamet, Kelly L.; Quang, Daniel X.; Carlson, Heather A.
2012-01-01
An appropriate structural superposition identifies similarities and differences between homologous proteins that are not evident from sequence alignments alone. We have coupled our Gaussian-weighted RMSD (wRMSD) tool with a sequence aligner and seed extension (SE) algorithm to create a robust technique for overlaying structures and aligning sequences of homologous proteins (HwRMSD). HwRMSD overcomes errors in the initial sequence alignment that would normally propagate into a standard RMSD overlay. SE can generate a corrected sequence alignment from the improved structural superposition obtained by wRMSD. HwRMSD’s robust performance and its superiority over standard RMSD are demonstrated over a range of homologous proteins. Its better overlay results in corrected sequence alignments with good agreement to HOMSTRAD. Finally, HwRMSD is compared to established structural alignment methods: FATCAT, SSM, CE, and Dalilite. Most methods are comparable at placing residue pairs within 2 Å, but HwRMSD places many more residue pairs within 1 Å, providing a clear advantage. Such high accuracy is essential in drug design, where small distances can have a large impact on computational predictions. This level of accuracy is also needed to correct sequence alignments in an automated fashion, especially for omics-scale analysis. HwRMSD can align homologs with low sequence identity and large conformational differences, cases where both sequence-based and structural-based methods may fail. The HwRMSD pipeline overcomes the dependency of structural overlays on initial sequence pairing and removes the need to determine the best sequence-alignment method, substitution matrix, and gap parameters for each unique pair of homologs. PMID:22733542
Electrophoretic mobility shift scanning using an automated infrared DNA sequencer.
Sano, M; Ohyama, A; Takase, K; Yamamoto, M; Machida, M
2001-11-01
Electrophoretic mobility shift assay (EMSA) is widely used in the study of sequence-specific DNA-binding proteins, including transcription factors and mismatch binding proteins. We have established a non-radioisotope-based protocol for EMSA that features an automated DNA sequencer with an infrared fluorescent dye (IRDye) detection unit. Our modification of the elec- trophoresis unit, which includes cooling the gel plates with a reduced well-to-read length, has made it possible to detect shifted bands within 1 h. Further, we have developed a rapid ligation-based method for generating IRDye-labeled probes with an approximately 60% cost reduction. This method has the advantages of real-time scanning, stability of labeled probes, and better safety associated with nonradioactive methods of detection. Analysis of a promoter from an industrially important filamentous fungus, Aspergillus oryzae, in a prototype experiment revealed that the method we describe has potential for use in systematic scanning and identification of the functionally important elements to which cellular factors bind in a sequence-specific manner.
NASA Astrophysics Data System (ADS)
Li, Zixiang; Janardhanan, Mukund Nilakantan; Tang, Qiuhua; Nielsen, Peter
2018-05-01
This article presents the first method to simultaneously balance and sequence robotic mixed-model assembly lines (RMALB/S), which involves three sub-problems: task assignment, model sequencing and robot allocation. A new mixed-integer programming model is developed to minimize makespan and, using CPLEX solver, small-size problems are solved for optimality. Two metaheuristics, the restarted simulated annealing algorithm and co-evolutionary algorithm, are developed and improved to address this NP-hard problem. The restarted simulated annealing method replaces the current temperature with a new temperature to restart the search process. The co-evolutionary method uses a restart mechanism to generate a new population by modifying several vectors simultaneously. The proposed algorithms are tested on a set of benchmark problems and compared with five other high-performing metaheuristics. The proposed algorithms outperform their original editions and the benchmarked methods. The proposed algorithms are able to solve the balancing and sequencing problem of a robotic mixed-model assembly line effectively and efficiently.
Ávila-Arcos, María C.; Cappellini, Enrico; Romero-Navarro, J. Alberto; Wales, Nathan; Moreno-Mayar, J. Víctor; Rasmussen, Morten; Fordyce, Sarah L.; Montiel, Rafael; Vielle-Calzada, Jean-Philippe; Willerslev, Eske; Gilbert, M. Thomas P.
2011-01-01
The development of second-generation sequencing technologies has greatly benefitted the field of ancient DNA (aDNA). Its application can be further exploited by the use of targeted capture-enrichment methods to overcome restrictions posed by low endogenous and contaminating DNA in ancient samples. We tested the performance of Agilent's SureSelect and Mycroarray's MySelect in-solution capture systems on Illumina sequencing libraries built from ancient maize to identify key factors influencing aDNA capture experiments. High levels of clonality as well as the presence of multiple-copy sequences in the capture targets led to biases in the data regardless of the capture method. Neither method consistently outperformed the other in terms of average target enrichment, and no obvious difference was observed either when two tiling designs were compared. In addition to demonstrating the plausibility of capturing aDNA from ancient plant material, our results also enable us to provide useful recommendations for those planning targeted-sequencing on aDNA. PMID:22355593
Roden, Suzanne E; Dutton, Peter H; Morin, Phillip A
2009-01-01
The green sea turtle, Chelonia mydas, was used as a case study for single nucleotide polymorphism (SNP) discovery in a species that has little genetic sequence information available. As green turtles have a complex population structure, additional nuclear markers other than microsatellites could add to our understanding of their complex life history. Amplified fragment length polymorphism technique was used to generate sets of random fragments of genomic DNA, which were then electrophoretically separated with precast gels, stained with SYBR green, excised, and directly sequenced. It was possible to perform this method without the use of polyacrylamide gels, radioactive or fluorescent labeled primers, or hybridization methods, reducing the time, expense, and safety hazards of SNP discovery. Within 13 loci, 2547 base pairs were screened, resulting in the discovery of 35 SNPs. Using this method, it was possible to yield a sufficient number of loci to screen for SNP markers without the availability of prior sequence information.
Kim, Ki Hwan; Do, Won-Joon; Park, Sung-Hong
2018-05-04
The routine MRI scan protocol consists of multiple pulse sequences that acquire images of varying contrast. Since high frequency contents such as edges are not significantly affected by image contrast, down-sampled images in one contrast may be improved by high resolution (HR) images acquired in another contrast, reducing the total scan time. In this study, we propose a new deep learning framework that uses HR MR images in one contrast to generate HR MR images from highly down-sampled MR images in another contrast. The proposed convolutional neural network (CNN) framework consists of two CNNs: (a) a reconstruction CNN for generating HR images from the down-sampled images using HR images acquired with a different MRI sequence and (b) a discriminator CNN for improving the perceptual quality of the generated HR images. The proposed method was evaluated using a public brain tumor database and in vivo datasets. The performance of the proposed method was assessed in tumor and no-tumor cases separately, with perceptual image quality being judged by a radiologist. To overcome the challenge of training the network with a small number of available in vivo datasets, the network was pretrained using the public database and then fine-tuned using the small number of in vivo datasets. The performance of the proposed method was also compared to that of several compressed sensing (CS) algorithms. Incorporating HR images of another contrast improved the quantitative assessments of the generated HR image in reference to ground truth. Also, incorporating a discriminator CNN yielded perceptually higher image quality. These results were verified in regions of normal tissue as well as tumors for various MRI sequences from pseudo k-space data generated from the public database. The combination of pretraining with the public database and fine-tuning with the small number of real k-space datasets enhanced the performance of CNNs in in vivo application compared to training CNNs from scratch. The proposed method outperformed the compressed sensing methods. The proposed method can be a good strategy for accelerating routine MRI scanning. © 2018 American Association of Physicists in Medicine.
SeSaM-Tv-II generates a protein sequence space that is unobtainable by epPCR.
Mundhada, Hemanshu; Marienhagen, Jan; Scacioc, Andreea; Schenk, Alexander; Roccatano, Danilo; Schwaneberg, Ulrich
2011-07-04
Generating high-quality mutant libraries in which each amino acid is equally targeted and substituted in a chemically diverse manner is crucial to obtain improved variants in small mutant libraries. The sequence saturation mutagenesis method (SeSaM-Tv(+) ) offers the opportunity to generate such high-quality mutant libraries by introducing consecutive mutations and by enriching transversions. In this study, automated gel electrophoresis, real-time quantitative PCR, and a phosphorimager quantification system were developed and employed to optimize each step of previously reported SeSaM-Tv(+) method. Advancements of the SeSaM-Tv(+) protocol and the use of a novel DNA polymerase quadrupled the number of transversions, by doubling the fraction of consecutive mutations (from 16.7 to 37.1 %). About 33 % of all amino acid substitutions observed in a model library are rarely introduced by epPCR methods, and around 10 % of all clones carried amino acid substitutions that are unobtainable by epPCR. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Liu, Biao; Conroy, Jeffrey M.; Morrison, Carl D.; Odunsi, Adekunle O.; Qin, Maochun; Wei, Lei; Trump, Donald L.; Johnson, Candace S.; Liu, Song; Wang, Jianmin
2015-01-01
Somatic Structural Variations (SVs) are a complex collection of chromosomal mutations that could directly contribute to carcinogenesis. Next Generation Sequencing (NGS) technology has emerged as the primary means of interrogating the SVs of the cancer genome in recent investigations. Sophisticated computational methods are required to accurately identify the SV events and delineate their breakpoints from the massive amounts of reads generated by a NGS experiment. In this review, we provide an overview of current analytic tools used for SV detection in NGS-based cancer studies. We summarize the features of common SV groups and the primary types of NGS signatures that can be used in SV detection methods. We discuss the principles and key similarities and differences of existing computational programs and comment on unresolved issues related to this research field. The aim of this article is to provide a practical guide of relevant concepts, computational methods, software tools and important factors for analyzing and interpreting NGS data for the detection of SVs in the cancer genome. PMID:25849937
A disruptive sequencer meets disruptive publishing.
Loman, Nick; Goodwin, Sarah; Jansen, Hans; Loose, Matt
2015-01-01
Nanopore sequencing was recently made available to users in the form of the Oxford Nanopore MinION. Released to users through an early access programme, the MinION is made unique by its tiny form factor and ability to generate very long sequences from single DNA molecules. The platform is undergoing rapid evolution with three distinct nanopore types and five updates to library preparation chemistry in the last 18 months. To keep pace with the rapid evolution of this sequencing platform, and to provide a space where new analysis methods can be openly discussed, we present a new F1000Research channel devoted to updates to and analysis of nanopore sequence data.
YAMAT-seq: an efficient method for high-throughput sequencing of mature transfer RNAs
Shigematsu, Megumi; Honda, Shozo; Loher, Phillipe; Telonis, Aristeidis G.; Rigoutsos, Isidore
2017-01-01
Abstract Besides translation, transfer RNAs (tRNAs) play many non-canonical roles in various biological pathways and exhibit highly variable expression profiles. To unravel the emerging complexities of tRNA biology and molecular mechanisms underlying them, an efficient tRNA sequencing method is required. However, the rigid structure of tRNA has been presenting a challenge to the development of such methods. We report the development of Y-shaped Adapter-ligated MAture TRNA sequencing (YAMAT-seq), an efficient and convenient method for high-throughput sequencing of mature tRNAs. YAMAT-seq circumvents the issue of inefficient adapter ligation, a characteristic of conventional RNA sequencing methods for mature tRNAs, by employing the efficient and specific ligation of Y-shaped adapter to mature tRNAs using T4 RNA Ligase 2. Subsequent cDNA amplification and next-generation sequencing successfully yield numerous mature tRNA sequences. YAMAT-seq has high specificity for mature tRNAs and high sensitivity to detect most isoacceptors from minute amount of total RNA. Moreover, YAMAT-seq shows quantitative capability to estimate expression levels of mature tRNAs, and has high reproducibility and broad applicability for various cell lines. YAMAT-seq thus provides high-throughput technique for identifying tRNA profiles and their regulations in various transcriptomes, which could play important regulatory roles in translation and other biological processes. PMID:28108659
Analyzing Immunoglobulin Repertoires
Chaudhary, Neha; Wesemann, Duane R.
2018-01-01
Somatic assembly of T cell receptor and B cell receptor (BCR) genes produces a vast diversity of lymphocyte antigen recognition capacity. The advent of efficient high-throughput sequencing of lymphocyte antigen receptor genes has recently generated unprecedented opportunities for exploration of adaptive immune responses. With these opportunities have come significant challenges in understanding the analysis techniques that most accurately reflect underlying biological phenomena. In this regard, sample preparation and sequence analysis techniques, which have largely been borrowed and adapted from other fields, continue to evolve. Here, we review current methods and challenges of library preparation, sequencing and statistical analysis of lymphocyte receptor repertoire studies. We discuss the general steps in the process of immune repertoire generation including sample preparation, platforms available for sequencing, processing of sequencing data, measurable features of the immune repertoire, and the statistical tools that can be used for analysis and interpretation of the data. Because BCR analysis harbors additional complexities, such as immunoglobulin (Ig) (i.e., antibody) gene somatic hypermutation and class switch recombination, the emphasis of this review is on Ig/BCR sequence analysis. PMID:29593723
Dentinger, Bryn T M; Margaritescu, Simona; Moncalvo, Jean-Marc
2010-07-01
We present two methods for DNA extraction from fresh and dried mushrooms that are adaptable to high-throughput sequencing initiatives, such as DNA barcoding. Our results show that these protocols yield ∼85% sequencing success from recently collected materials. Tests with both recent (<2 year) and older (>100 years) specimens reveal that older collections have low success rates and may be an inefficient resource for populating a barcode database. However, our method of extracting DNA from herbarium samples using small amount of tissue is reliable and could be used for important historical specimens. The application of these protocols greatly reduces time, and therefore cost, of generating DNA sequences from mushrooms and other fungi vs. traditional extraction methods. The efficiency of these methods illustrates that standardization and streamlining of sample processing should be shifted from the laboratory to the field. © 2009 Blackwell Publishing Ltd.
An extension of command shaping methods for controlling residual vibration using frequency sampling
NASA Technical Reports Server (NTRS)
Singer, Neil C.; Seering, Warren P.
1992-01-01
The authors present an extension to the impulse shaping technique for commanding machines to move with reduced residual vibration. The extension, called frequency sampling, is a method for generating constraints that are used to obtain shaping sequences which minimize residual vibration in systems such as robots whose resonant frequencies change during motion. The authors present a review of impulse shaping methods, a development of the proposed extension, and a comparison of results of tests conducted on a simple model of the space shuttle robot arm. Frequency shaping provides a method for minimizing the impulse sequence duration required to give the desired insensitivity.
2013-01-01
Background Mitochondrial DNA (mtDNA) typing can be a useful aid for identifying people from compromised samples when nuclear DNA is too damaged, degraded or below detection thresholds for routine short tandem repeat (STR)-based analysis. Standard mtDNA typing, focused on PCR amplicon sequencing of the control region (HVS I and HVS II), is limited by the resolving power of this short sequence, which misses up to 70% of the variation present in the mtDNA genome. Methods We used in-solution hybridisation-based DNA capture (using DNA capture probes prepared from modern human mtDNA) to recover mtDNA from post-mortem human remains in which the majority of DNA is both highly fragmented (<100 base pairs in length) and chemically damaged. The method ‘immortalises’ the finite quantities of DNA in valuable extracts as DNA libraries, which is followed by the targeted enrichment of endogenous mtDNA sequences and characterisation by next-generation sequencing (NGS). Results We sequenced whole mitochondrial genomes for human identification from samples where standard nuclear STR typing produced only partial profiles or demonstrably failed and/or where standard mtDNA hypervariable region sequences lacked resolving power. Multiple rounds of enrichment can substantially improve coverage and sequencing depth of mtDNA genomes from highly degraded samples. The application of this method has led to the reliable mitochondrial sequencing of human skeletal remains from unidentified World War Two (WWII) casualties approximately 70 years old and from archaeological remains (up to 2,500 years old). Conclusions This approach has potential applications in forensic science, historical human identification cases, archived medical samples, kinship analysis and population studies. In particular the methodology can be applied to any case, involving human or non-human species, where whole mitochondrial genome sequences are required to provide the highest level of maternal lineage discrimination. Multiple rounds of in-solution hybridisation-based DNA capture can retrieve whole mitochondrial genome sequences from even the most challenging samples. PMID:24289217
Sequence comparison alignment-free approach based on suffix tree and L-words frequency.
Soares, Inês; Goios, Ana; Amorim, António
2012-01-01
The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length L-L-words--in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.
Lee, Jennifer F.; Hesselberth, Jay R.; Meyers, Lauren Ancel; Ellington, Andrew D.
2004-01-01
The aptamer database is designed to contain comprehensive sequence information on aptamers and unnatural ribozymes that have been generated by in vitro selection methods. Such data are not normally collected in ‘natural’ sequence databases, such as GenBank. Besides serving as a storehouse of sequences that may have diagnostic or therapeutic utility, the database serves as a valuable resource for theoretical biologists who describe and explore fitness landscapes. The database is updated monthly and is publicly available at http://aptamer.icmb.utexas.edu/. PMID:14681367
Veeranagouda, Yaligara; Debono-Lagneaux, Delphine; Fournet, Hamida; Thill, Gilbert; Didier, Michel
2018-01-16
The emergence of clustered regularly interspaced short palindromic repeats-Cas9 (CRISPR-Cas9) gene editing systems has enabled the creation of specific mutants at low cost, in a short time and with high efficiency, in eukaryotic cells. Since a CRISPR-Cas9 system typically creates an array of mutations in targeted sites, a successful gene editing project requires careful selection of edited clones. This process can be very challenging, especially when working with multiallelic genes and/or polyploid cells (such as cancer and plants cells). Here we described a next-generation sequencing method called CRISPR-Cas9 Edited Site Sequencing (CRES-Seq) for the efficient and high-throughput screening of CRISPR-Cas9-edited clones. CRES-Seq facilitates the precise genotyping up to 96 CRISPR-Cas9-edited sites (CRES) in a single MiniSeq (Illumina) run with an approximate sequencing cost of $6/clone. CRES-Seq is particularly useful when multiple genes are simultaneously targeted by CRISPR-Cas9, and also for screening of clones generated from multiallelic genes/polyploid cells. © 2018 by John Wiley & Sons, Inc. Copyright © 2018 John Wiley & Sons, Inc.
Analysis and Visualization Tool for Targeted Amplicon Bisulfite Sequencing on Ion Torrent Sequencers
Pabinger, Stephan; Ernst, Karina; Pulverer, Walter; Kallmeyer, Rainer; Valdes, Ana M.; Metrustry, Sarah; Katic, Denis; Nuzzo, Angelo; Kriegner, Albert; Vierlinger, Klemens; Weinhaeusel, Andreas
2016-01-01
Targeted sequencing of PCR amplicons generated from bisulfite deaminated DNA is a flexible, cost-effective way to study methylation of a sample at single CpG resolution and perform subsequent multi-target, multi-sample comparisons. Currently, no platform specific protocol, support, or analysis solution is provided to perform targeted bisulfite sequencing on a Personal Genome Machine (PGM). Here, we present a novel tool, called TABSAT, for analyzing targeted bisulfite sequencing data generated on Ion Torrent sequencers. The workflow starts with raw sequencing data, performs quality assessment, and uses a tailored version of Bismark to map the reads to a reference genome. The pipeline visualizes results as lollipop plots and is able to deduce specific methylation-patterns present in a sample. The obtained profiles are then summarized and compared between samples. In order to assess the performance of the targeted bisulfite sequencing workflow, 48 samples were used to generate 53 different Bisulfite-Sequencing PCR amplicons from each sample, resulting in 2,544 amplicon targets. We obtained a mean coverage of 282X using 1,196,822 aligned reads. Next, we compared the sequencing results of these targets to the methylation level of the corresponding sites on an Illumina 450k methylation chip. The calculated average Pearson correlation coefficient of 0.91 confirms the sequencing results with one of the industry-leading CpG methylation platforms and shows that targeted amplicon bisulfite sequencing provides an accurate and cost-efficient method for DNA methylation studies, e.g., to provide platform-independent confirmation of Illumina Infinium 450k methylation data. TABSAT offers a novel way to analyze data generated by Ion Torrent instruments and can also be used with data from the Illumina MiSeq platform. It can be easily accessed via the Platomics platform, which offers a web-based graphical user interface along with sample and parameter storage. TABSAT is freely available under a GNU General Public License version 3.0 (GPLv3) at https://github.com/tadkeys/tabsat/ and http://demo.platomics.com/. PMID:27467908
Brooks, Matthew J.; Rajasimha, Harsha K.; Roger, Jerome E.
2011-01-01
Purpose Next-generation sequencing (NGS) has revolutionized systems-based analysis of cellular pathways. The goals of this study are to compare NGS-derived retinal transcriptome profiling (RNA-seq) to microarray and quantitative reverse transcription polymerase chain reaction (qRT–PCR) methods and to evaluate protocols for optimal high-throughput data analysis. Methods Retinal mRNA profiles of 21-day-old wild-type (WT) and neural retina leucine zipper knockout (Nrl−/−) mice were generated by deep sequencing, in triplicate, using Illumina GAIIx. The sequence reads that passed quality filters were analyzed at the transcript isoform level with two methods: Burrows–Wheeler Aligner (BWA) followed by ANOVA (ANOVA) and TopHat followed by Cufflinks. qRT–PCR validation was performed using TaqMan and SYBR Green assays. Results Using an optimized data analysis workflow, we mapped about 30 million sequence reads per sample to the mouse genome (build mm9) and identified 16,014 transcripts in the retinas of WT and Nrl−/− mice with BWA workflow and 34,115 transcripts with TopHat workflow. RNA-seq data confirmed stable expression of 25 known housekeeping genes, and 12 of these were validated with qRT–PCR. RNA-seq data had a linear relationship with qRT–PCR for more than four orders of magnitude and a goodness of fit (R2) of 0.8798. Approximately 10% of the transcripts showed differential expression between the WT and Nrl−/− retina, with a fold change ≥1.5 and p value <0.05. Altered expression of 25 genes was confirmed with qRT–PCR, demonstrating the high degree of sensitivity of the RNA-seq method. Hierarchical clustering of differentially expressed genes uncovered several as yet uncharacterized genes that may contribute to retinal function. Data analysis with BWA and TopHat workflows revealed a significant overlap yet provided complementary insights in transcriptome profiling. Conclusions Our study represents the first detailed analysis of retinal transcriptomes, with biologic replicates, generated by RNA-seq technology. The optimized data analysis workflows reported here should provide a framework for comparative investigations of expression profiles. Our results show that NGS offers a comprehensive and more accurate quantitative and qualitative evaluation of mRNA content within a cell or tissue. We conclude that RNA-seq based transcriptome characterization would expedite genetic network analyses and permit the dissection of complex biologic functions. PMID:22162623
Xiao, Fanshu; Yu, Yuhe; Li, Jinjin; Juneau, Philippe; Yan, Qingyun
2018-05-25
The 16S rRNA gene is one of the most commonly used molecular markers for estimating bacterial diversity during the past decades. However, there is no consistency about the sequencing depth (from thousand to millions of sequences per sample), and the clustering methods used to generate OTUs may also be different among studies. These inconsistent premises make effective comparisons among studies difficult or unreliable. This study aims to examine the necessary sequencing depth and clustering method that would be needed to ensure a stable diversity patterns for studying fish gut microbiota. A total number of 42 samples dataset of Siniperca chuatsi (carnivorous fish) gut microbiota were used to test how the sequencing depth and clustering may affect the alpha and beta diversity patterns of fish intestinal microbiota. Interestingly, we found that the sequencing depth (resampling 1000-11,000 per sample) and the clustering methods (UPARSE and UCLUST) did not bias the estimates of the diversity patterns during the fish development from larva to adult. Although we should acknowledge that a suitable sequencing depth may differ case by case, our finding indicates that a shallow sequencing such as 1000 sequences per sample may be also enough to reflect the general diversity patterns of fish gut microbiota. However, we have shown in the present study that strict pre-processing of the original sequences is required to ensure reliable results. This study provides evidences to help making a strong scientific choice of the sequencing depth and clustering method for future studies on fish gut microbiota patterns, but at the same time reducing as much as possible the costs related to the analysis.
Pilotte, Nils; Papaiakovou, Marina; Grant, Jessica R; Bierwert, Lou Ann; Llewellyn, Stacey; McCarthy, James S; Williams, Steven A
2016-03-01
The soil transmitted helminths are a group of parasitic worms responsible for extensive morbidity in many of the world's most economically depressed locations. With growing emphasis on disease mapping and eradication, the availability of accurate and cost-effective diagnostic measures is of paramount importance to global control and elimination efforts. While real-time PCR-based molecular detection assays have shown great promise, to date, these assays have utilized sub-optimal targets. By performing next-generation sequencing-based repeat analyses, we have identified high copy-number, non-coding DNA sequences from a series of soil transmitted pathogens. We have used these repetitive DNA elements as targets in the development of novel, multi-parallel, PCR-based diagnostic assays. Utilizing next-generation sequencing and the Galaxy-based RepeatExplorer web server, we performed repeat DNA analysis on five species of soil transmitted helminths (Necator americanus, Ancylostoma duodenale, Trichuris trichiura, Ascaris lumbricoides, and Strongyloides stercoralis). Employing high copy-number, non-coding repeat DNA sequences as targets, novel real-time PCR assays were designed, and assays were tested against established molecular detection methods. Each assay provided consistent detection of genomic DNA at quantities of 2 fg or less, demonstrated species-specificity, and showed an improved limit of detection over the existing, proven PCR-based assay. The utilization of next-generation sequencing-based repeat DNA analysis methodologies for the identification of molecular diagnostic targets has the ability to improve assay species-specificity and limits of detection. By exploiting such high copy-number repeat sequences, the assays described here will facilitate soil transmitted helminth diagnostic efforts. We recommend similar analyses when designing PCR-based diagnostic tests for the detection of other eukaryotic pathogens.
RAD tag sequencing as a source of SNP markers in Cynara cardunculus L
2012-01-01
Background The globe artichoke (Cynara cardunculus L. var. scolymus) genome is relatively poorly explored, especially compared to those of the other major Asteraceae crops sunflower and lettuce. No SNP markers are in the public domain. We have combined the recently developed restriction-site associated DNA (RAD) approach with the Illumina DNA sequencing platform to effect the rapid and mass discovery of SNP markers for C. cardunculus. Results RAD tags were sequenced from the genomic DNA of three C. cardunculus mapping population parents, generating 9.7 million reads, corresponding to ~1 Gbp of sequence. An assembly based on paired ends produced ~6.0 Mbp of genomic sequence, separated into ~19,000 contigs (mean length 312 bp), of which ~21% were fragments of putative coding sequence. The shared sequences allowed for the discovery of ~34,000 SNPs and nearly 800 indels, equivalent to a SNP frequency of 5.6 per 1,000 nt, and an indel frequency of 0.2 per 1,000 nt. A sample of heterozygous SNP loci was mapped by CAPS assays and this exercise provided validation of our mining criteria. The repetitive fraction of the genome had a high representation of retrotransposon sequence, followed by simple repeats, AT-low complexity regions and mobile DNA elements. The genomic k-mers distribution and CpG rate of C. cardunculus, compared with data derived from three whole genome-sequenced dicots species, provided a further evidence of the random representation of the C. cardunculus genome generated by RAD sampling. Conclusion The RAD tag sequencing approach is a cost-effective and rapid method to develop SNP markers in a highly heterozygous species. Our approach permitted to generate a large and robust SNP datasets by the adoption of optimized filtering criteria. PMID:22214349
Designing the optimal shutter sequences for the flutter shutter imaging method
NASA Astrophysics Data System (ADS)
Jelinek, Jan
2010-04-01
Acquiring iris or face images of moving subjects at larger distances using a flash to prevent the motion blur quickly runs into eye safety concerns as the acquisition distance is increased. For that reason the flutter shutter method recently proposed by Raskar et al.has generated considerable interest in the biometrics community. The paper concerns the design of shutter sequences that produce the best images. The number of possible sequences grows exponentially in both the subject' s motion velocity and desired exposure value, with their majority being useless. Because the exact solution leads to an intractable mixed integer programming problem, we propose an approximate solution based on pre - screening the sequences according to the distribution of roots in their Fourier transform. A very fast algorithm utilizing the Jury' s criterion allows the testing to be done without explicitly computing the roots, making the approach practical for moderately long sequences.
Direct Detection and Sequencing of Damaged DNA Bases
2011-01-01
Products of various forms of DNA damage have been implicated in a variety of important biological processes, such as aging, neurodegenerative diseases, and cancer. Therefore, there exists great interest to develop methods for interrogating damaged DNA in the context of sequencing. Here, we demonstrate that single-molecule, real-time (SMRT®) DNA sequencing can directly detect damaged DNA bases in the DNA template - as a by-product of the sequencing method - through an analysis of the DNA polymerase kinetics that are altered by the presence of a modified base. We demonstrate the sequencing of several DNA templates containing products of DNA damage, including 8-oxoguanine, 8-oxoadenine, O6-methylguanine, 1-methyladenine, O4-methylthymine, 5-hydroxycytosine, 5-hydroxyuracil, 5-hydroxymethyluracil, or thymine dimers, and show that these base modifications can be readily detected with single-modification resolution and DNA strand specificity. We characterize the distinct kinetic signatures generated by these DNA base modifications. PMID:22185597
Direct detection and sequencing of damaged DNA bases.
Clark, Tyson A; Spittle, Kristi E; Turner, Stephen W; Korlach, Jonas
2011-12-20
Products of various forms of DNA damage have been implicated in a variety of important biological processes, such as aging, neurodegenerative diseases, and cancer. Therefore, there exists great interest to develop methods for interrogating damaged DNA in the context of sequencing. Here, we demonstrate that single-molecule, real-time (SMRT®) DNA sequencing can directly detect damaged DNA bases in the DNA template - as a by-product of the sequencing method - through an analysis of the DNA polymerase kinetics that are altered by the presence of a modified base. We demonstrate the sequencing of several DNA templates containing products of DNA damage, including 8-oxoguanine, 8-oxoadenine, O6-methylguanine, 1-methyladenine, O4-methylthymine, 5-hydroxycytosine, 5-hydroxyuracil, 5-hydroxymethyluracil, or thymine dimers, and show that these base modifications can be readily detected with single-modification resolution and DNA strand specificity. We characterize the distinct kinetic signatures generated by these DNA base modifications.
Automated Sequence Generation Process and Software
NASA Technical Reports Server (NTRS)
Gladden, Roy
2007-01-01
"Automated sequence generation" (autogen) signifies both a process and software used to automatically generate sequences of commands to operate various spacecraft. The autogen software comprises the autogen script plus the Activity Plan Generator (APGEN) program. APGEN can be used for planning missions and command sequences.
Complementary DNA libraries: an overview.
Ying, Shao-Yao
2004-07-01
The generation of complete and full-length cDNA libraries for potential functional assays of specific gene sequences is essential for most molecules in biotechnology and biomedical research. The field of cDNA library generation has changed rapidly in the past 10 yr. This review presents an overview of the method available for the basic information of generating cDNA libraries, including the definition of the cDNA library, different kinds of cDNA libraries, difference between methods for cDNA library generation using conventional approaches and a novel strategy, and the quality of cDNA libraries. It is anticipated that the high-quality cDNA libraries so generated would facilitate studies involving genechips and the microarray, differential display, subtractive hybridization, gene cloning, and peptide library generation.
The correlation structure of several popular pseudorandom number generators
NASA Technical Reports Server (NTRS)
Neuman, F.; Merrick, R.; Martin, C. F.
1973-01-01
One of the desirable properties of a pseudorandom number generator is that the sequence of numbers it generates should have very low autocorrelation for all shifts except for zero shift and those that are multiples of its cycle length. Due to the simple methods of constructing random numbers, the ideal is often not quite fulfilled. A simple method of examining any random generator for previously unsuspected regularities is discussed. Once they are discovered it is often easy to derive the mathematical relationships, which describe the mathematical relationships, which describe the regular behavior. As examples, it is shown that high correlation exists in mixed and multiplicative congruential random number generators and prime moduli Lehmer generators for shifts a fraction of their cycle lengths.
Detection and Composition of Bacterial Communities in Waters using RNA-based Methods
In recent years, microbial water quality assessments have shifted from solely relying on pure culture-based methods to monitoring bacterial groups of interest using molecular assays such as PCR and qPCR. Furthermore, coupling next generation sequencing technologies with ribosomal...
Method and structure for an improved data reformatting procedure
Chatterjee, Siddhartha [Yorktown Heights, NY; Gunnels, John A [Brewster, NY; Gustavson, Fred Gehrung [Briarcliff Manor, NY
2009-06-30
A method (and structure) of managing memory in which a low-level mechanism is executed to signal, in a sequence of instructions generated at a higher level, that at least a portion of a contiguous area of memory is permitted to be overwritten.
Cotten, Matthew; Oude Munnink, Bas; Canuti, Marta; Deijs, Martin; Watson, Simon J; Kellam, Paul; van der Hoek, Lia
2014-01-01
We have developed a full genome virus detection process that combines sensitive nucleic acid preparation optimised for virus identification in fecal material with Illumina MiSeq sequencing and a novel post-sequencing virus identification algorithm. Enriched viral nucleic acid was converted to double-stranded DNA and subjected to Illumina MiSeq sequencing. The resulting short reads were processed with a novel iterative Python algorithm SLIM for the identification of sequences with homology to known viruses. De novo assembly was then used to generate full viral genomes. The sensitivity of this process was demonstrated with a set of fecal samples from HIV-1 infected patients. A quantitative assessment of the mammalian, plant, and bacterial virus content of this compartment was generated and the deep sequencing data were sufficient to assembly 12 complete viral genomes from 6 virus families. The method detected high levels of enteropathic viruses that are normally controlled in healthy adults, but may be involved in the pathogenesis of HIV-1 infection and will provide a powerful tool for virus detection and for analyzing changes in the fecal virome associated with HIV-1 progression and pathogenesis.
Cotten, Matthew; Oude Munnink, Bas; Canuti, Marta; Deijs, Martin; Watson, Simon J.; Kellam, Paul; van der Hoek, Lia
2014-01-01
We have developed a full genome virus detection process that combines sensitive nucleic acid preparation optimised for virus identification in fecal material with Illumina MiSeq sequencing and a novel post-sequencing virus identification algorithm. Enriched viral nucleic acid was converted to double-stranded DNA and subjected to Illumina MiSeq sequencing. The resulting short reads were processed with a novel iterative Python algorithm SLIM for the identification of sequences with homology to known viruses. De novo assembly was then used to generate full viral genomes. The sensitivity of this process was demonstrated with a set of fecal samples from HIV-1 infected patients. A quantitative assessment of the mammalian, plant, and bacterial virus content of this compartment was generated and the deep sequencing data were sufficient to assembly 12 complete viral genomes from 6 virus families. The method detected high levels of enteropathic viruses that are normally controlled in healthy adults, but may be involved in the pathogenesis of HIV-1 infection and will provide a powerful tool for virus detection and for analyzing changes in the fecal virome associated with HIV-1 progression and pathogenesis. PMID:24695106
Li, Fan; Ma, Liying; Feng, Yi; Hu, Jing; Ni, Na; Ruan, Yuhua; Shao, Yiming
2017-06-01
HIV-1 transmission in intravenous drug users (IDUs) has been characterized by high genetic multiplicity and suggests a greater challenge for HIV-1 infection blocking. We investigated a total of 749 sequences of full-length gp160 gene obtained by single genome sequencing (SGS) from 22 HIV-1 early infected IDUs in Xinjiang province, northwest China, and generated a transmitted and founder virus (T/F virus) consensus sequence (IDU.CON). The T/F virus was classified as subtype CRF07_BC and predicted to be CCR5-tropic virus. The variable region (V1, V2, and V4 loop) of IDU.CON showed length variation compared with the heterosexual T/F virus consensus sequence (HSX.CON) and homosexual T/F virus consensus sequence (MSM.CON). A total of 26 N-linked glycosylation sites were discovered in the IDU.CON sequence, which is less than that of MSM.CON and HSX.CON. Characterization of T/F virus from IDUs highlights the genetic make-up and complexity of virus near the moment of transmission or in early infection preceding systemic dissemination and is important toward the development of an effective HIV-1 preventive methods, including vaccines.
Next-generation sequencing in clinical virology: Discovery of new viruses.
Datta, Sibnarayan; Budhauliya, Raghvendra; Das, Bidisha; Chatterjee, Soumya; Vanlalhmuaka; Veer, Vijay
2015-08-12
Viruses are a cause of significant health problem worldwide, especially in the developing nations. Due to different anthropological activities, human populations are exposed to different viral pathogens, many of which emerge as outbreaks. In such situations, discovery of novel viruses is utmost important for deciding prevention and treatment strategies. Since last century, a number of different virus discovery methods, based on cell culture inoculation, sequence-independent PCR have been used for identification of a variety of viruses. However, the recent emergence and commercial availability of next-generation sequencers (NGS) has entirely changed the field of virus discovery. These massively parallel sequencing platforms can sequence a mixture of genetic materials from a very heterogeneous mix, with high sensitivity. Moreover, these platforms work in a sequence-independent manner, making them ideal tools for virus discovery. However, for their application in clinics, sample preparation or enrichment is necessary to detect low abundance virus populations. A number of techniques have also been developed for enrichment or viral nucleic acids. In this manuscript, we review the evolution of sequencing; NGS technologies available today as well as widely used virus enrichment technologies. We also discuss the challenges associated with their applications in the clinical virus discovery.
Liang, Chanjuan; van Dijk, Jeroen P; Scholtens, Ingrid M J; Staats, Martijn; Prins, Theo W; Voorhuijzen, Marleen M; da Silva, Andrea M; Arisi, Ana Carolina Maisonnave; den Dunnen, Johan T; Kok, Esther J
2014-04-01
The growing number of biotech crops with novel genetic elements increasingly complicates the detection of genetically modified organisms (GMOs) in food and feed samples using conventional screening methods. Unauthorized GMOs (UGMOs) in food and feed are currently identified through combining GMO element screening with sequencing the DNA flanking these elements. In this study, a specific and sensitive qPCR assay was developed for vip3A element detection based on the vip3Aa20 coding sequences of the recently marketed MIR162 maize and COT102 cotton. Furthermore, SiteFinding-PCR in combination with Sanger, Illumina or Pacific BioSciences (PacBio) sequencing was performed targeting the flanking DNA of the vip3Aa20 element in MIR162. De novo assembly and Basic Local Alignment Search Tool searches were used to mimic UGMO identification. PacBio data resulted in relatively long contigs in the upstream (1,326 nucleotides (nt); 95 % identity) and downstream (1,135 nt; 92 % identity) regions, whereas Illumina data resulted in two smaller contigs of 858 and 1,038 nt with higher sequence identity (>99 % identity). Both approaches outperformed Sanger sequencing, underlining the potential for next-generation sequencing in UGMO identification.
Length-independent structural similarities enrich the antibody CDR canonical class model.
Nowak, Jaroslaw; Baker, Terry; Georges, Guy; Kelm, Sebastian; Klostermann, Stefan; Shi, Jiye; Sridharan, Sudharsan; Deane, Charlotte M
2016-01-01
Complementarity-determining regions (CDRs) are antibody loops that make up the antigen binding site. Here, we show that all CDR types have structurally similar loops of different lengths. Based on these findings, we created length-independent canonical classes for the non-H3 CDRs. Our length variable structural clusters show strong sequence patterns suggesting either that they evolved from the same original structure or result from some form of convergence. We find that our length-independent method not only clusters a larger number of CDRs, but also predicts canonical class from sequence better than the standard length-dependent approach. To demonstrate the usefulness of our findings, we predicted cluster membership of CDR-L3 sequences from 3 next-generation sequencing datasets of the antibody repertoire (over 1,000,000 sequences). Using the length-independent clusters, we can structurally classify an additional 135,000 sequences, which represents a ∼20% improvement over the standard approach. This suggests that our length-independent canonical classes might be a highly prevalent feature of antibody space, and could substantially improve our ability to accurately predict the structure of novel CDRs identified by next-generation sequencing.
Kocot, Kevin M; Citarella, Mathew R; Moroz, Leonid L; Halanych, Kenneth M
2013-01-01
Molecular phylogenetics relies on accurate identification of orthologous sequences among the taxa of interest. Most orthology inference programs available for use in phylogenomics rely on small sets of pre-defined orthologs from model organisms or phenetic approaches such as all-versus-all sequence comparisons followed by Markov graph-based clustering. Such approaches have high sensitivity but may erroneously include paralogous sequences. We developed PhyloTreePruner, a software utility that uses a phylogenetic approach to refine orthology inferences made using phenetic methods. PhyloTreePruner checks single-gene trees for evidence of paralogy and generates a new alignment for each group containing only sequences inferred to be orthologs. Importantly, PhyloTreePruner takes into account support values on the tree and avoids unnecessarily deleting sequences in cases where a weakly supported tree topology incorrectly indicates paralogy. A test of PhyloTreePruner on a dataset generated from 11 completely sequenced arthropod genomes identified 2,027 orthologous groups sampled for all taxa. Phylogenetic analysis of the concatenated supermatrix yielded a generally well-supported topology that was consistent with the current understanding of arthropod phylogeny. PhyloTreePruner is freely available from http://sourceforge.net/projects/phylotreepruner/.
NASA Astrophysics Data System (ADS)
Prato, Marco; Bonettini, Silvia; Loris, Ignace; Porta, Federica; Rebegoldi, Simone
2016-10-01
The scaled gradient projection (SGP) method is a first-order optimization method applicable to the constrained minimization of smooth functions and exploiting a scaling matrix multiplying the gradient and a variable steplength parameter to improve the convergence of the scheme. For a general nonconvex function, the limit points of the sequence generated by SGP have been proved to be stationary, while in the convex case and with some restrictions on the choice of the scaling matrix the sequence itself converges to a constrained minimum point. In this paper we extend these convergence results by showing that the SGP sequence converges to a limit point provided that the objective function satisfies the Kurdyka-Łojasiewicz property at each point of its domain and its gradient is Lipschitz continuous.
Identification of cancer-specific motifs in mimotope profiles of serum antibody repertoire.
Gerasimov, Ekaterina; Zelikovsky, Alex; Măndoiu, Ion; Ionov, Yurij
2017-06-07
For fighting cancer, earlier detection is crucial. Circulating auto-antibodies produced by the patient's own immune system after exposure to cancer proteins are promising bio-markers for the early detection of cancer. Since an antibody recognizes not the whole antigen but 4-7 critical amino acids within the antigenic determinant (epitope), the whole proteome can be represented by a random peptide phage display library. This opens the possibility to develop an early cancer detection test based on a set of peptide sequences identified by comparing cancer patients' and healthy donors' global peptide profiles of antibody specificities. Due to the enormously large number of peptide sequences contained in global peptide profiles generated by next generation sequencing, the large number of cancer and control sera is required to identify cancer-specific peptides with high degree of statistical significance. To decrease the number of peptides in profiles generated by nextgen sequencing without losing cancer-specific sequences we used for generation of profiles the phage library enriched by panning on the pool of cancer sera. To further decrease the complexity of profiles we used computational methods for transforming a list of peptides constituting the mimotope profiles to the list motifs formed by similar peptide sequences. We have shown that the amino-acid order is meaningful in mimotope motifs since they contain significantly more peptides than motifs among peptides where amino-acids are randomly permuted. Also the single sample motifs significantly differ from motifs in peptides drawn from multiple samples. Finally, multiple cancer-specific motifs have been identified.
Ren, Jie; Song, Kai; Deng, Minghua; Reinert, Gesine; Cannon, Charles H; Sun, Fengzhu
2016-04-01
Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html fsun@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Weber, Stefanie; Büscher, Anja K; Hagmann, Henning; Liebau, Max C; Heberle, Christian; Ludwig, Michael; Rath, Sabine; Alberer, Martin; Beissert, Antje; Zenker, Martin; Hoyer, Peter F; Konrad, Martin; Klein, Hanns-Georg; Hoefele, Julia
2016-01-01
Steroid-resistant nephrotic syndrome (SRNS) is a severe cause of progressive renal disease. Genetic forms of SRNS can present with autosomal recessive or autosomal dominant inheritance. Recent studies have identified mutations in multiple podocyte genes responsible for SRNS. Improved sequencing methods (next-generation sequencing, NGS) now promise rapid mutational testing of SRNS genes. In the present study, a simultaneous screening of ten SRNS genes in 37 SRNS patients was performed by NGS. In 38 % of the patients, causative mutations in one SRNS gene were found. In 22 % of the patients, in addition to these mutations, a secondary variant in a different gene was identified. This high incidence of accumulating sequence variants was unexpected but, although they might have modifier effects, the pathogenic potential of these additional sequence variants seems unclear so far. The example of molecular diagnostics by NGS in SRNS patients shows that these new sequencing technologies might provide further insight into molecular pathogenicity in genetic disorders but will also generate results, which will be difficult to interpret and complicate genetic counseling. Although NGS promises more frequent identification of disease-causing mutations, the identification of causative mutations, the interpretation of incidental findings and possible pitfalls might pose problems, which hopefully will decrease by further experience and elucidation of molecular interactions.
DNA sequencing using polymerase substrate-binding kinetics
Previte, Michael John Robert; Zhou, Chunhong; Kellinger, Matthew; Pantoja, Rigo; Chen, Cheng-Yao; Shi, Jin; Wang, BeiBei; Kia, Amirali; Etchin, Sergey; Vieceli, John; Nikoomanzar, Ali; Bomati, Erin; Gloeckner, Christian; Ronaghi, Mostafa; He, Molly Min
2015-01-01
Next-generation sequencing (NGS) has transformed genomic research by decreasing the cost of sequencing. However, whole-genome sequencing is still costly and complex for diagnostics purposes. In the clinical space, targeted sequencing has the advantage of allowing researchers to focus on specific genes of interest. Routine clinical use of targeted NGS mandates inexpensive instruments, fast turnaround time and an integrated and robust workflow. Here we demonstrate a version of the Sequencing by Synthesis (SBS) chemistry that potentially can become a preferred targeted sequencing method in the clinical space. This sequencing chemistry uses natural nucleotides and is based on real-time recording of the differential polymerase/DNA-binding kinetics in the presence of correct or mismatch nucleotides. This ensemble SBS chemistry has been implemented on an existing Illumina sequencing platform with integrated cluster amplification. We discuss the advantages of this sequencing chemistry for targeted sequencing as well as its limitations for other applications. PMID:25612848
Bandelt, Hans-Jürgen; Yao, Yong-Gang; Bravi, Claudio M; Salas, Antonio; Kivisild, Toomas
2009-03-01
Sequence analysis of the mitochondrial genome has become a routine method in the study of mitochondrial diseases. Quite often, the sequencing efforts in the search of pathogenic or disease-associated mutations are affected by technical and interpretive problems, caused by sample mix-up, contamination, biochemical problems, incomplete sequencing, misdocumentation and insufficient reference to previously published data. To assess data quality in case studies of mitochondrial diseases, it is recommended to compare any mtDNA sequence under consideration to their phylogenetically closest lineages available in the Web. The median network method has proven useful for visualizing potential problems with the data. We contrast some early reports of complete mtDNA sequences to more recent total mtDNA sequencing efforts in studies of various mitochondrial diseases. We conclude that the quality of complete mtDNA sequences generated in the medical field in the past few years is somewhat unsatisfactory and may even fall behind that of pioneer manual sequencing in the early nineties. Our study provides a paradigm for an a posteriori evaluation of sequence quality and for detection of potential problems with inferring a pathogenic status of a particular mutation.
Looking for trees in the forest: summary tree from posterior samples
2013-01-01
Background Bayesian phylogenetic analysis generates a set of trees which are often condensed into a single tree representing the whole set. Many methods exist for selecting a representative topology for a set of unrooted trees, few exist for assigning branch lengths to a fixed topology, and even fewer for simultaneously setting the topology and branch lengths. However, there is very little research into locating a good representative for a set of rooted time trees like the ones obtained from a BEAST analysis. Results We empirically compare new and known methods for generating a summary tree. Some new methods are motivated by mathematical constructions such as tree metrics, while the rest employ tree concepts which work well in practice. These use more of the posterior than existing methods, which discard information not directly mapped to the chosen topology. Using results from a large number of simulations we assess the quality of a summary tree, measuring (a) how well it explains the sequence data under the model and (b) how close it is to the “truth”, i.e to the tree used to generate the sequences. Conclusions Our simulations indicate that no single method is “best”. Methods producing good divergence time estimates have poor branch lengths and lower model fit, and vice versa. Using the results presented here, a user can choose the appropriate method based on the purpose of the summary tree. PMID:24093883
Looking for trees in the forest: summary tree from posterior samples.
Heled, Joseph; Bouckaert, Remco R
2013-10-04
Bayesian phylogenetic analysis generates a set of trees which are often condensed into a single tree representing the whole set. Many methods exist for selecting a representative topology for a set of unrooted trees, few exist for assigning branch lengths to a fixed topology, and even fewer for simultaneously setting the topology and branch lengths. However, there is very little research into locating a good representative for a set of rooted time trees like the ones obtained from a BEAST analysis. We empirically compare new and known methods for generating a summary tree. Some new methods are motivated by mathematical constructions such as tree metrics, while the rest employ tree concepts which work well in practice. These use more of the posterior than existing methods, which discard information not directly mapped to the chosen topology. Using results from a large number of simulations we assess the quality of a summary tree, measuring (a) how well it explains the sequence data under the model and (b) how close it is to the "truth", i.e to the tree used to generate the sequences. Our simulations indicate that no single method is "best". Methods producing good divergence time estimates have poor branch lengths and lower model fit, and vice versa. Using the results presented here, a user can choose the appropriate method based on the purpose of the summary tree.
Cheng, Ji-Hong; Liu, Wen-Chun; Chang, Ting-Tsung; Hsieh, Sun-Yuan; Tseng, Vincent S
2017-10-01
Many studies have suggested that deletions of Hepatitis B Viral (HBV) are associated with the development of progressive liver diseases, even ultimately resulting in hepatocellular carcinoma (HCC). Among the methods for detecting deletions from next-generation sequencing (NGS) data, few methods considered the characteristics of virus, such as high evolution rates and high divergence among the different HBV genomes. Sequencing high divergence HBV genome sequences using the NGS technology outputs millions of reads. Thus, detecting exact breakpoints of deletions from these big and complex data incurs very high computational cost. We proposed a novel analytical method named VirDelect (Virus Deletion Detect), which uses split read alignment base to detect exact breakpoint and diversity variable to consider high divergence in single-end reads data, such that the computational cost can be reduced without losing accuracy. We use four simulated reads datasets and two real pair-end reads datasets of HBV genome sequence to verify VirDelect accuracy by score functions. The experimental results show that VirDelect outperforms the state-of-the-art method Pindel in terms of accuracy score for all simulated datasets and VirDelect had only two base errors even in real datasets. VirDelect is also shown to deliver high accuracy in analyzing the single-end read data as well as pair-end data. VirDelect can serve as an effective and efficient bioinformatics tool for physiologists with high accuracy and efficient performance and applicable to further analysis with characteristics similar to HBV on genome length and high divergence. The software program of VirDelect can be downloaded at https://sourceforge.net/projects/virdelect/. Copyright © 2017. Published by Elsevier Inc.
Hirsch, B; Endris, V; Lassmann, S; Weichert, W; Pfarr, N; Schirmacher, P; Kovaleva, V; Werner, M; Bonzheim, I; Fend, F; Sperveslage, J; Kaulich, K; Zacher, A; Reifenberger, G; Köhrer, K; Stepanow, S; Lerke, S; Mayr, T; Aust, D E; Baretton, G; Weidner, S; Jung, A; Kirchner, T; Hansmann, M L; Burbat, L; von der Wall, E; Dietel, M; Hummel, M
2018-04-01
The simultaneous detection of multiple somatic mutations in the context of molecular diagnostics of cancer is frequently performed by means of amplicon-based targeted next-generation sequencing (NGS). However, only few studies are available comparing multicenter testing of different NGS platforms and gene panels. Therefore, seven partner sites of the German Cancer Consortium (DKTK) performed a multicenter interlaboratory trial for targeted NGS using the same formalin-fixed, paraffin-embedded (FFPE) specimen of molecularly pre-characterized tumors (n = 15; each n = 5 cases of Breast, Lung, and Colon carcinoma) and a colorectal cancer cell line DNA dilution series. Detailed information regarding pre-characterized mutations was not disclosed to the partners. Commercially available and custom-designed cancer gene panels were used for library preparation and subsequent sequencing on several devices of two NGS different platforms. For every case, centrally extracted DNA and FFPE tissue sections for local processing were delivered to each partner site to be sequenced with the commercial gene panel and local bioinformatics. For cancer-specific panel-based sequencing, only centrally extracted DNA was analyzed at seven sequencing sites. Subsequently, local data were compiled and bioinformatics was performed centrally. We were able to demonstrate that all pre-characterized mutations were re-identified correctly, irrespective of NGS platform or gene panel used. However, locally processed FFPE tissue sections disclosed that the DNA extraction method can affect the detection of mutations with a trend in favor of magnetic bead-based DNA extraction methods. In conclusion, targeted NGS is a very robust method for simultaneous detection of various mutations in FFPE tissue specimens if certain pre-analytical conditions are carefully considered.
Algorithms for optimizing cross-overs in DNA shuffling.
He, Lu; Friedman, Alan M; Bailey-Kellogg, Chris
2012-03-21
DNA shuffling generates combinatorial libraries of chimeric genes by stochastically recombining parent genes. The resulting libraries are subjected to large-scale genetic selection or screening to identify those chimeras with favorable properties (e.g., enhanced stability or enzymatic activity). While DNA shuffling has been applied quite successfully, it is limited by its homology-dependent, stochastic nature. Consequently, it is used only with parents of sufficient overall sequence identity, and provides no control over the resulting chimeric library. This paper presents efficient methods to extend the scope of DNA shuffling to handle significantly more diverse parents and to generate more predictable, optimized libraries. Our CODNS (cross-over optimization for DNA shuffling) approach employs polynomial-time dynamic programming algorithms to select codons for the parental amino acids, allowing for zero or a fixed number of conservative substitutions. We first present efficient algorithms to optimize the local sequence identity or the nearest-neighbor approximation of the change in free energy upon annealing, objectives that were previously optimized by computationally-expensive integer programming methods. We then present efficient algorithms for more powerful objectives that seek to localize and enhance the frequency of recombination by producing "runs" of common nucleotides either overall or according to the sequence diversity of the resulting chimeras. We demonstrate the effectiveness of CODNS in choosing codons and allocating substitutions to promote recombination between parents targeted in earlier studies: two GAR transformylases (41% amino acid sequence identity), two very distantly related DNA polymerases, Pol X and β (15%), and beta-lactamases of varying identity (26-47%). Our methods provide the protein engineer with a new approach to DNA shuffling that supports substantially more diverse parents, is more deterministic, and generates more predictable and more diverse chimeric libraries.
Navigating Microbiological Food Safety in the Era of Whole-Genome Sequencing
Nasheri, Neda; Petronella, Nicholas; Pagotto, Franco
2016-01-01
SUMMARY The epidemiological investigation of a foodborne outbreak, including identification of related cases, source attribution, and development of intervention strategies, relies heavily on the ability to subtype the etiological agent at a high enough resolution to differentiate related from nonrelated cases. Historically, several different molecular subtyping methods have been used for this purpose; however, emerging techniques, such as single nucleotide polymorphism (SNP)-based techniques, that use whole-genome sequencing (WGS) offer a resolution that was previously not possible. With WGS, unlike traditional subtyping methods that lack complete information, data can be used to elucidate phylogenetic relationships and disease-causing lineages can be tracked and monitored over time. The subtyping resolution and evolutionary context provided by WGS data allow investigators to connect related illnesses that would be missed by traditional techniques. The added advantage of data generated by WGS is that these data can also be used for secondary analyses, such as virulence gene detection, antibiotic resistance gene profiling, synteny comparisons, mobile genetic element identification, and geographic attribution. In addition, several software packages are now available to generate in silico results for traditional molecular subtyping methods from the whole-genome sequence, allowing for efficient comparison with historical databases. Metagenomic approaches using next-generation sequencing have also been successful in the detection of nonculturable foodborne pathogens. This review addresses state-of-the-art techniques in microbial WGS and analysis and then discusses how this technology can be used to help support food safety investigations. Retrospective outbreak investigations using WGS are presented to provide organism-specific examples of the benefits, and challenges, associated with WGS in comparison to traditional molecular subtyping techniques. PMID:27559074
Schmidt, Martin; Van Bel, Michiel; Woloszynska, Magdalena; Slabbinck, Bram; Martens, Cindy; De Block, Marc; Coppens, Frederik; Van Lijsebettens, Mieke
2017-07-06
Cytosine methylation in plant genomes is important for the regulation of gene transcription and transposon activity. Genome-wide methylomes are studied upon mutation of the DNA methyltransferases, adaptation to environmental stresses or during development. However, from basic biology to breeding programs, there is a need to monitor multiple samples to determine transgenerational methylation inheritance or differential cytosine methylation. Methylome data obtained by sodium hydrogen sulfite (bisulfite)-conversion and next-generation sequencing (NGS) provide genome-wide information on cytosine methylation. However, a profiling method that detects cytosine methylation state dispersed over the genome would allow high-throughput analysis of multiple plant samples with distinct epigenetic signatures. We use specific restriction endonucleases to enrich for cytosine coverage in a bisulfite and NGS-based profiling method, which was compared to whole-genome bisulfite sequencing of the same plant material. We established an effective methylome profiling method in plants, termed plant-reduced representation bisulfite sequencing (plant-RRBS), using optimized double restriction endonuclease digestion, fragment end repair, adapter ligation, followed by bisulfite conversion, PCR amplification and NGS. We report a performant laboratory protocol and a straightforward bioinformatics data analysis pipeline for plant-RRBS, applicable for any reference-sequenced plant species. As a proof of concept, methylome profiling was performed using an Oryza sativa ssp. indica pure breeding line and a derived epigenetically altered line (epiline). Plant-RRBS detects methylation levels at tens of millions of cytosine positions deduced from bisulfite conversion in multiple samples. To evaluate the method, the coverage of cytosine positions, the intra-line similarity and the differential cytosine methylation levels between the pure breeding line and the epiline were determined. Plant-RRBS reproducibly covers commonly up to one fourth of the cytosine positions in the rice genome when using MspI-DpnII within a group of five biological replicates of a line. The method predominantly detects cytosine methylation in putative promoter regions and not-annotated regions in rice. Plant-RRBS offers high-throughput and broad, genome-dispersed methylation detection by effective read number generation obtained from reproducibly covered genome fractions using optimized endonuclease combinations, facilitating comparative analyses of multi-sample studies for cytosine methylation and transgenerational stability in experimental material and plant breeding populations.
Eduardoff, Mayra; Xavier, Catarina; Strobl, Christina; Casas-Vargas, Andrea; Parson, Walther
2017-01-01
The analysis of mitochondrial DNA (mtDNA) has proven useful in forensic genetics and ancient DNA (aDNA) studies, where specimens are often highly compromised and DNA quality and quantity are low. In forensic genetics, the mtDNA control region (CR) is commonly sequenced using established Sanger-type Sequencing (STS) protocols involving fragment sizes down to approximately 150 base pairs (bp). Recent developments include Massively Parallel Sequencing (MPS) of (multiplex) PCR-generated libraries using the same amplicon sizes. Molecular genetic studies on archaeological remains that harbor more degraded aDNA have pioneered alternative approaches to target mtDNA, such as capture hybridization and primer extension capture (PEC) methods followed by MPS. These assays target smaller mtDNA fragment sizes (down to 50 bp or less), and have proven to be substantially more successful in obtaining useful mtDNA sequences from these samples compared to electrophoretic methods. Here, we present the modification and optimization of a PEC method, earlier developed for sequencing the Neanderthal mitochondrial genome, with forensic applications in mind. Our approach was designed for a more sensitive enrichment of the mtDNA CR in a single tube assay and short laboratory turnaround times, thus complying with forensic practices. We characterized the method using sheared, high quantity mtDNA (six samples), and tested challenging forensic samples (n = 2) as well as compromised solid tissue samples (n = 15) up to 8 kyrs of age. The PEC MPS method produced reliable and plausible mtDNA haplotypes that were useful in the forensic context. It yielded plausible data in samples that did not provide results with STS and other MPS techniques. We addressed the issue of contamination by including four generations of negative controls, and discuss the results in the forensic context. We finally offer perspectives for future research to enable the validation and accreditation of the PEC MPS method for final implementation in forensic genetic laboratories. PMID:28934125
Method for construction of normalized cDNA libraries
Soares, M.B.; Efstratiadis, A.
1996-01-09
This invention provides a method to normalize a directional cDNA library constructed in a vector that allows propagation in single-stranded circle form. The method comprises: (a) propagating the directional cDNA library in single-stranded circles; (b) generating fragments complementary to the 3` noncoding sequence of the single-stranded circles in the library to produce partial duplexes; (c) purifying the partial duplexes; (d) melting and reassociating the purified partial duplexes to moderate Cot; and (e) purifying the unassociated single-stranded circles, thereby generating a normalized cDNA library. 4 figs.
McCourt, Clare M; McArt, Darragh G; Mills, Ken; Catherwood, Mark A; Maxwell, Perry; Waugh, David J; Hamilton, Peter; O'Sullivan, Joe M; Salto-Tellez, Manuel
2013-01-01
Next Generation Sequencing (NGS) has the potential of becoming an important tool in clinical diagnosis and therapeutic decision-making in oncology owing to its enhanced sensitivity in DNA mutation detection, fast-turnaround of samples in comparison to current gold standard methods and the potential to sequence a large number of cancer-driving genes at the one time. We aim to test the diagnostic accuracy of current NGS technology in the analysis of mutations that represent current standard-of-care, and its reliability to generate concomitant information on other key genes in human oncogenesis. Thirteen clinical samples (8 lung adenocarcinomas, 3 colon carcinomas and 2 malignant melanomas) already genotyped for EGFR, KRAS and BRAF mutations by current standard-of-care methods (Sanger Sequencing and q-PCR), were analysed for detection of mutations in the same three genes using two NGS platforms and an additional 43 genes with one of these platforms. The results were analysed using closed platform-specific proprietary bioinformatics software as well as open third party applications. Our results indicate that the existing format of the NGS technology performed well in detecting the clinically relevant mutations stated above but may not be reliable for a broader unsupervised analysis of the wider genome in its current design. Our study represents a diagnostically lead validation of the major strengths and weaknesses of this technology before consideration for diagnostic use.
A Multiplexed Amplicon Approach for Detecting Gene Fusions by Next-Generation Sequencing.
Beadling, Carol; Wald, Abigail I; Warrick, Andrea; Neff, Tanaya L; Zhong, Shan; Nikiforov, Yuri E; Corless, Christopher L; Nikiforova, Marina N
2016-03-01
Chromosomal rearrangements that result in oncogenic gene fusions are clinically important drivers of many cancer types. Rapid and sensitive methods are therefore needed to detect a broad range of gene fusions in clinical specimens that are often of limited quantity and quality. We describe a next-generation sequencing approach that uses a multiplex PCR-based amplicon panel to interrogate fusion transcripts that involve 19 driver genes and 94 partners implicated in solid tumors. The panel also includes control assays that evaluate the 3'/5' expression ratios of 12 oncogenic kinases, which might be used to infer gene fusion events when the partner is unknown or not included on the panel. There was good concordance between the solid tumor fusion gene panel and other methods, including fluorescence in situ hybridization, real-time PCR, Sanger sequencing, and other next-generation sequencing panels, because 40 specimens known to harbor gene fusions were correctly identified. No specific fusion reads were observed in 59 fusion-negative specimens. The 3'/5' expression ratio was informative for fusions that involved ALK, RET, and NTRK1 but not for BRAF or ROS1 fusions. However, among 37 ALK or RET fusion-negative specimens, four exhibited elevated 3'/5' expression ratios, indicating that fusions predicted solely by 3'/5' read ratios require confirmatory testing. Copyright © 2016 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
An accurate algorithm for the detection of DNA fragments from dilution pool sequencing experiments.
Bansal, Vikas
2018-01-01
The short read lengths of current high-throughput sequencing technologies limit the ability to recover long-range haplotype information. Dilution pool methods for preparing DNA sequencing libraries from high molecular weight DNA fragments enable the recovery of long DNA fragments from short sequence reads. These approaches require computational methods for identifying the DNA fragments using aligned sequence reads and assembling the fragments into long haplotypes. Although a number of computational methods have been developed for haplotype assembly, the problem of identifying DNA fragments from dilution pool sequence data has not received much attention. We formulate the problem of detecting DNA fragments from dilution pool sequencing experiments as a genome segmentation problem and develop an algorithm that uses dynamic programming to optimize a likelihood function derived from a generative model for the sequence reads. This algorithm uses an iterative approach to automatically infer the mean background read depth and the number of fragments in each pool. Using simulated data, we demonstrate that our method, FragmentCut, has 25-30% greater sensitivity compared with an HMM based method for fragment detection and can also detect overlapping fragments. On a whole-genome human fosmid pool dataset, the haplotypes assembled using the fragments identified by FragmentCut had greater N50 length, 16.2% lower switch error rate and 35.8% lower mismatch error rate compared with two existing methods. We further demonstrate the greater accuracy of our method using two additional dilution pool datasets. FragmentCut is available from https://bansal-lab.github.io/software/FragmentCut. vibansal@ucsd.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
2012-01-01
Background In the last 30 years, a number of DNA fingerprinting methods such as RFLP, RAPD, AFLP, SSR, DArT, have been extensively used in marker development for molecular plant breeding. However, it remains a daunting task to identify highly polymorphic and closely linked molecular markers for a target trait for molecular marker-assisted selection. The next-generation sequencing (NGS) technology is far more powerful than any existing generic DNA fingerprinting methods in generating DNA markers. In this study, we employed a grain legume crop Lupinus angustifolius (lupin) as a test case, and examined the utility of an NGS-based method of RAD (restriction-site associated DNA) sequencing as DNA fingerprinting for rapid, cost-effective marker development tagging a disease resistance gene for molecular breeding. Results Twenty informative plants from a cross of RxS (disease resistant x susceptible) in lupin were subjected to RAD single-end sequencing by multiplex identifiers. The entire RAD sequencing products were resolved in two lanes of the 16-lanes per run sequencing platform Solexa HiSeq2000. A total of 185 million raw reads, approximately 17 Gb of sequencing data, were collected. Sequence comparison among the 20 test plants discovered 8207 SNP markers. Filtration of DNA sequencing data with marker identification parameters resulted in the discovery of 38 molecular markers linked to the disease resistance gene Lanr1. Five randomly selected markers were converted into cost-effective, simple PCR-based markers. Linkage analysis using marker genotyping data and disease resistance phenotyping data on a F8 population consisting of 186 individual plants confirmed that all these five markers were linked to the R gene. Two of these newly developed sequence-specific PCR markers, AnSeq3 and AnSeq4, flanked the target R gene at a genetic distance of 0.9 centiMorgan (cM), and are now replacing the markers previously developed by a traditional DNA fingerprinting method for marker-assisted selection in the Australian national lupin breeding program. Conclusions We demonstrated that more than 30 molecular markers linked to a target gene of agronomic trait of interest can be identified from a small portion (1/8) of one sequencing run on HiSeq2000 by applying NGS based RAD sequencing in marker development. The markers developed by the strategy described in this study are all co-dominant SNP markers, which can readily be converted into high throughput multiplex format or low-cost, simple PCR-based markers desirable for large scale marker implementation in plant breeding programs. The high density and closely linked molecular markers associated with a target trait help to overcome a major bottleneck for implementation of molecular markers on a wide range of germplasm in breeding programs. We conclude that application of NGS based RAD sequencing as DNA fingerprinting is a very rapid and cost-effective strategy for marker development in molecular plant breeding. The strategy does not require any prior genome knowledge or molecular information for the species under investigation, and it is applicable to other plant species. PMID:22805587
Yang, Huaan; Tao, Ye; Zheng, Zequn; Li, Chengdao; Sweetingham, Mark W; Howieson, John G
2012-07-17
In the last 30 years, a number of DNA fingerprinting methods such as RFLP, RAPD, AFLP, SSR, DArT, have been extensively used in marker development for molecular plant breeding. However, it remains a daunting task to identify highly polymorphic and closely linked molecular markers for a target trait for molecular marker-assisted selection. The next-generation sequencing (NGS) technology is far more powerful than any existing generic DNA fingerprinting methods in generating DNA markers. In this study, we employed a grain legume crop Lupinus angustifolius (lupin) as a test case, and examined the utility of an NGS-based method of RAD (restriction-site associated DNA) sequencing as DNA fingerprinting for rapid, cost-effective marker development tagging a disease resistance gene for molecular breeding. Twenty informative plants from a cross of RxS (disease resistant x susceptible) in lupin were subjected to RAD single-end sequencing by multiplex identifiers. The entire RAD sequencing products were resolved in two lanes of the 16-lanes per run sequencing platform Solexa HiSeq2000. A total of 185 million raw reads, approximately 17 Gb of sequencing data, were collected. Sequence comparison among the 20 test plants discovered 8207 SNP markers. Filtration of DNA sequencing data with marker identification parameters resulted in the discovery of 38 molecular markers linked to the disease resistance gene Lanr1. Five randomly selected markers were converted into cost-effective, simple PCR-based markers. Linkage analysis using marker genotyping data and disease resistance phenotyping data on a F8 population consisting of 186 individual plants confirmed that all these five markers were linked to the R gene. Two of these newly developed sequence-specific PCR markers, AnSeq3 and AnSeq4, flanked the target R gene at a genetic distance of 0.9 centiMorgan (cM), and are now replacing the markers previously developed by a traditional DNA fingerprinting method for marker-assisted selection in the Australian national lupin breeding program. We demonstrated that more than 30 molecular markers linked to a target gene of agronomic trait of interest can be identified from a small portion (1/8) of one sequencing run on HiSeq2000 by applying NGS based RAD sequencing in marker development. The markers developed by the strategy described in this study are all co-dominant SNP markers, which can readily be converted into high throughput multiplex format or low-cost, simple PCR-based markers desirable for large scale marker implementation in plant breeding programs. The high density and closely linked molecular markers associated with a target trait help to overcome a major bottleneck for implementation of molecular markers on a wide range of germplasm in breeding programs. We conclude that application of NGS based RAD sequencing as DNA fingerprinting is a very rapid and cost-effective strategy for marker development in molecular plant breeding. The strategy does not require any prior genome knowledge or molecular information for the species under investigation, and it is applicable to other plant species.
Classification of a set of vectors using self-organizing map- and rule-based technique
NASA Astrophysics Data System (ADS)
Ae, Tadashi; Okaniwa, Kaishirou; Nosaka, Kenzaburou
2005-02-01
There exist various objects, such as pictures, music, texts, etc., around our environment. We have a view for these objects by looking, reading or listening. Our view is concerned with our behaviors deeply, and is very important to understand our behaviors. We have a view for an object, and decide the next action (data selection, etc.) with our view. Such a series of actions constructs a sequence. Therefore, we propose a method which acquires a view as a vector from several words for a view, and apply the vector to sequence generation. We focus on sequences of the data of which a user selects from a multimedia database containing pictures, music, movie, etc... These data cannot be stereotyped because user's view for them changes by each user. Therefore, we represent the structure of the multimedia database as the vector representing user's view and the stereotyped vector, and acquire sequences containing the structure as elements. Such a vector can be classified by SOM (Self-Organizing Map). Hidden Markov Model (HMM) is a method to generate sequences. Therefore, we use HMM of which a state corresponds to the representative vector of user's view, and acquire sequences containing the change of user's view. We call it Vector-state Markov Model (VMM). We introduce the rough set theory as a rule-base technique, which plays a role of classifying the sets of data such as the sets of "Tour".
Calibrating genomic and allelic coverage bias in single-cell sequencing.
Zhang, Cheng-Zhong; Adalsteinsson, Viktor A; Francis, Joshua; Cornils, Hauke; Jung, Joonil; Maire, Cecile; Ligon, Keith L; Meyerson, Matthew; Love, J Christopher
2015-04-16
Artifacts introduced in whole-genome amplification (WGA) make it difficult to derive accurate genomic information from single-cell genomes and require different analytical strategies from bulk genome analysis. Here, we describe statistical methods to quantitatively assess the amplification bias resulting from whole-genome amplification of single-cell genomic DNA. Analysis of single-cell DNA libraries generated by different technologies revealed universal features of the genome coverage bias predominantly generated at the amplicon level (1-10 kb). The magnitude of coverage bias can be accurately calibrated from low-pass sequencing (∼0.1 × ) to predict the depth-of-coverage yield of single-cell DNA libraries sequenced at arbitrary depths. We further provide a benchmark comparison of single-cell libraries generated by multi-strand displacement amplification (MDA) and multiple annealing and looping-based amplification cycles (MALBAC). Finally, we develop statistical models to calibrate allelic bias in single-cell whole-genome amplification and demonstrate a census-based strategy for efficient and accurate variant detection from low-input biopsy samples.
Calibrating genomic and allelic coverage bias in single-cell sequencing
Francis, Joshua; Cornils, Hauke; Jung, Joonil; Maire, Cecile; Ligon, Keith L.; Meyerson, Matthew; Love, J. Christopher
2016-01-01
Artifacts introduced in whole-genome amplification (WGA) make it difficult to derive accurate genomic information from single-cell genomes and require different analytical strategies from bulk genome analysis. Here, we describe statistical methods to quantitatively assess the amplification bias resulting from whole-genome amplification of single-cell genomic DNA. Analysis of single-cell DNA libraries generated by different technologies revealed universal features of the genome coverage bias predominantly generated at the amplicon level (1–10 kb). The magnitude of coverage bias can be accurately calibrated from low-pass sequencing (~0.1 ×) to predict the depth-of-coverage yield of single-cell DNA libraries sequenced at arbitrary depths. We further provide a benchmark comparison of single-cell libraries generated by multi-strand displacement amplification (MDA) and multiple annealing and looping-based amplification cycles (MALBAC). Finally, we develop statistical models to calibrate allelic bias in single-cell whole-genome amplification and demonstrate a census-based strategy for efficient and accurate variant detection from low-input biopsy samples. PMID:25879913
Monroe, J Grey; Allen, Zachariah A; Tanger, Paul; Mullen, Jack L; Lovell, John T; Moyers, Brook T; Whitley, Darrell; McKay, John K
2017-01-01
Recent advances in nucleic acid sequencing technologies have led to a dramatic increase in the number of markers available to generate genetic linkage maps. This increased marker density can be used to improve genome assemblies as well as add much needed resolution for loci controlling variation in ecologically and agriculturally important traits. However, traditional genetic map construction methods from these large marker datasets can be computationally prohibitive and highly error prone. We present TSPmap , a method which implements both approximate and exact Traveling Salesperson Problem solvers to generate linkage maps. We demonstrate that for datasets with large numbers of genomic markers (e.g. 10,000) and in multiple population types generated from inbred parents, TSPmap can rapidly produce high quality linkage maps with low sensitivity to missing and erroneous genotyping data compared to two other benchmark methods, JoinMap and MSTmap . TSPmap is open source and freely available as an R package. With the advancement of low cost sequencing technologies, the number of markers used in the generation of genetic maps is expected to continue to rise. TSPmap will be a useful tool to handle such large datasets into the future, quickly producing high quality maps using a large number of genomic markers.
Lee, Sejoon; Lee, Soohyun; Ouellette, Scott; Park, Woong-Yang; Lee, Eunjung A; Park, Peter J
2017-06-20
In many next-generation sequencing (NGS) studies, multiple samples or data types are profiled for each individual. An important quality control (QC) step in these studies is to ensure that datasets from the same subject are properly paired. Given the heterogeneity of data types, file types and sequencing depths in a multi-dimensional study, a robust program that provides a standardized metric for genotype comparisons would be useful. Here, we describe NGSCheckMate, a user-friendly software package for verifying sample identities from FASTQ, BAM or VCF files. This tool uses a model-based method to compare allele read fractions at known single-nucleotide polymorphisms, considering depth-dependent behavior of similarity metrics for identical and unrelated samples. Our evaluation shows that NGSCheckMate is effective for a variety of data types, including exome sequencing, whole-genome sequencing, RNA-seq, ChIP-seq, targeted sequencing and single-cell whole-genome sequencing, with a minimal requirement for sequencing depth (>0.5X). An alignment-free module can be run directly on FASTQ files for a quick initial check. We recommend using this software as a QC step in NGS studies. https://github.com/parklab/NGSCheckMate. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Design of nucleic acid sequences for DNA computing based on a thermodynamic approach
Tanaka, Fumiaki; Kameda, Atsushi; Yamamoto, Masahito; Ohuchi, Azuma
2005-01-01
We have developed an algorithm for designing multiple sequences of nucleic acids that have a uniform melting temperature between the sequence and its complement and that do not hybridize non-specifically with each other based on the minimum free energy (ΔGmin). Sequences that satisfy these constraints can be utilized in computations, various engineering applications such as microarrays, and nano-fabrications. Our algorithm is a random generate-and-test algorithm: it generates a candidate sequence randomly and tests whether the sequence satisfies the constraints. The novelty of our algorithm is that the filtering method uses a greedy search to calculate ΔGmin. This effectively excludes inappropriate sequences before ΔGmin is calculated, thereby reducing computation time drastically when compared with an algorithm without the filtering. Experimental results in silico showed the superiority of the greedy search over the traditional approach based on the hamming distance. In addition, experimental results in vitro demonstrated that the experimental free energy (ΔGexp) of 126 sequences correlated well with ΔGmin (|R| = 0.90) than with the hamming distance (|R| = 0.80). These results validate the rationality of a thermodynamic approach. We implemented our algorithm in a graphic user interface-based program written in Java. PMID:15701762
Quantitative phenotyping via deep barcode sequencing.
Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey
2009-10-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.
Systems and Methods for Automated Vessel Navigation Using Sea State Prediction
NASA Technical Reports Server (NTRS)
Huntsberger, Terrance L. (Inventor); Howard, Andrew B. (Inventor); Reinhart, Rene Felix (Inventor); Aghazarian, Hrand (Inventor); Rankin, Arturo (Inventor)
2017-01-01
Systems and methods for sea state prediction and autonomous navigation in accordance with embodiments of the invention are disclosed. One embodiment of the invention includes a method of predicting a future sea state including generating a sequence of at least two 3D images of a sea surface using at least two image sensors, detecting peaks and troughs in the 3D images using a processor, identifying at least one wavefront in each 3D image based upon the detected peaks and troughs using the processor, characterizing at least one propagating wave based upon the propagation of wavefronts detected in the sequence of 3D images using the processor, and predicting a future sea state using at least one propagating wave characterizing the propagation of wavefronts in the sequence of 3D images using the processor. Another embodiment includes a method of autonomous vessel navigation based upon a predicted sea state and target location.
Systems and Methods for Automated Vessel Navigation Using Sea State Prediction
NASA Technical Reports Server (NTRS)
Aghazarian, Hrand (Inventor); Reinhart, Rene Felix (Inventor); Huntsberger, Terrance L. (Inventor); Rankin, Arturo (Inventor); Howard, Andrew B. (Inventor)
2015-01-01
Systems and methods for sea state prediction and autonomous navigation in accordance with embodiments of the invention are disclosed. One embodiment of the invention includes a method of predicting a future sea state including generating a sequence of at least two 3D images of a sea surface using at least two image sensors, detecting peaks and troughs in the 3D images using a processor, identifying at least one wavefront in each 3D image based upon the detected peaks and troughs using the processor, characterizing at least one propagating wave based upon the propagation of wavefronts detected in the sequence of 3D images using the processor, and predicting a future sea state using at least one propagating wave characterizing the propagation of wavefronts in the sequence of 3D images using the processor. Another embodiment includes a method of autonomous vessel navigation based upon a predicted sea state and target location.
Methods, Tools and Current Perspectives in Proteogenomics *
Ruggles, Kelly V.; Krug, Karsten; Wang, Xiaojing; Clauser, Karl R.; Wang, Jing; Payne, Samuel H.; Fenyö, David; Zhang, Bing; Mani, D. R.
2017-01-01
With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e. the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic sequencing data. More recently, integrative analysis of quantitative measurements from genomic and proteomic studies have identified novel insights into gene expression regulation, cell signaling, and disease. Many methods and tools have been developed or adapted to enable an array of integrative proteogenomic approaches and in this article, we systematically classify published methods and tools into four major categories, (1) Sequence-centric proteogenomics; (2) Analysis of proteogenomic relationships; (3) Integrative modeling of proteogenomic data; and (4) Data sharing and visualization. We provide a comprehensive review of methods and available tools in each category and highlight their typical applications. PMID:28456751
Bozan, Mahir; Akyol, Çağrı; Ince, Orhan; Aydin, Sevcan; Ince, Bahar
2017-09-01
The anaerobic digestion of lignocellulosic wastes is considered an efficient method for managing the world's energy shortages and resolving contemporary environmental problems. However, the recalcitrance of lignocellulosic biomass represents a barrier to maximizing biogas production. The purpose of this review is to examine the extent to which sequencing methods can be employed to monitor such biofuel conversion processes. From a microbial perspective, we present a detailed insight into anaerobic digesters that utilize lignocellulosic biomass and discuss some benefits and disadvantages associated with the microbial sequencing techniques that are typically applied. We further evaluate the extent to which a hybrid approach incorporating a variation of existing methods can be utilized to develop a more in-depth understanding of microbial communities. It is hoped that this deeper knowledge will enhance the reliability and extent of research findings with the end objective of improving the stability of anaerobic digesters that manage lignocellulosic biomass.
Barcode extension for analysis and reconstruction of structures
NASA Astrophysics Data System (ADS)
Myhrvold, Cameron; Baym, Michael; Hanikel, Nikita; Ong, Luvena L.; Gootenberg, Jonathan S.; Yin, Peng
2017-03-01
Collections of DNA sequences can be rationally designed to self-assemble into predictable three-dimensional structures. The geometric and functional diversity of DNA nanostructures created to date has been enhanced by improvements in DNA synthesis and computational design. However, existing methods for structure characterization typically image the final product or laboriously determine the presence of individual, labelled strands using gel electrophoresis. Here we introduce a new method of structure characterization that uses barcode extension and next-generation DNA sequencing to quantitatively measure the incorporation of every strand into a DNA nanostructure. By quantifying the relative abundances of distinct DNA species in product and monomer bands, we can study the influence of geometry and sequence on assembly. We have tested our method using 2D and 3D DNA brick and DNA origami structures. Our method is general and should be extensible to a wide variety of DNA nanostructures.
Barcode extension for analysis and reconstruction of structures.
Myhrvold, Cameron; Baym, Michael; Hanikel, Nikita; Ong, Luvena L; Gootenberg, Jonathan S; Yin, Peng
2017-03-13
Collections of DNA sequences can be rationally designed to self-assemble into predictable three-dimensional structures. The geometric and functional diversity of DNA nanostructures created to date has been enhanced by improvements in DNA synthesis and computational design. However, existing methods for structure characterization typically image the final product or laboriously determine the presence of individual, labelled strands using gel electrophoresis. Here we introduce a new method of structure characterization that uses barcode extension and next-generation DNA sequencing to quantitatively measure the incorporation of every strand into a DNA nanostructure. By quantifying the relative abundances of distinct DNA species in product and monomer bands, we can study the influence of geometry and sequence on assembly. We have tested our method using 2D and 3D DNA brick and DNA origami structures. Our method is general and should be extensible to a wide variety of DNA nanostructures.
Barcode extension for analysis and reconstruction of structures
Myhrvold, Cameron; Baym, Michael; Hanikel, Nikita; Ong, Luvena L; Gootenberg, Jonathan S; Yin, Peng
2017-01-01
Collections of DNA sequences can be rationally designed to self-assemble into predictable three-dimensional structures. The geometric and functional diversity of DNA nanostructures created to date has been enhanced by improvements in DNA synthesis and computational design. However, existing methods for structure characterization typically image the final product or laboriously determine the presence of individual, labelled strands using gel electrophoresis. Here we introduce a new method of structure characterization that uses barcode extension and next-generation DNA sequencing to quantitatively measure the incorporation of every strand into a DNA nanostructure. By quantifying the relative abundances of distinct DNA species in product and monomer bands, we can study the influence of geometry and sequence on assembly. We have tested our method using 2D and 3D DNA brick and DNA origami structures. Our method is general and should be extensible to a wide variety of DNA nanostructures. PMID:28287117
Mutation Scanning in Wheat by Exon Capture and Next-Generation Sequencing.
King, Robert; Bird, Nicholas; Ramirez-Gonzalez, Ricardo; Coghill, Jane A; Patil, Archana; Hassani-Pak, Keywan; Uauy, Cristobal; Phillips, Andrew L
2015-01-01
Targeted Induced Local Lesions in Genomes (TILLING) is a reverse genetics approach to identify novel sequence variation in genomes, with the aims of investigating gene function and/or developing useful alleles for breeding. Despite recent advances in wheat genomics, most current TILLING methods are low to medium in throughput, being based on PCR amplification of the target genes. We performed a pilot-scale evaluation of TILLING in wheat by next-generation sequencing through exon capture. An oligonucleotide-based enrichment array covering ~2 Mbp of wheat coding sequence was used to carry out exon capture and sequencing on three mutagenised lines of wheat containing previously-identified mutations in the TaGA20ox1 homoeologous genes. After testing different mapping algorithms and settings, candidate SNPs were identified by mapping to the IWGSC wheat Chromosome Survey Sequences. Where sequence data for all three homoeologues were found in the reference, mutant calls were unambiguous; however, where the reference lacked one or two of the homoeologues, captured reads from these genes were mis-mapped to other homoeologues, resulting either in dilution of the variant allele frequency or assignment of mutations to the wrong homoeologue. Competitive PCR assays were used to validate the putative SNPs and estimate cut-off levels for SNP filtering. At least 464 high-confidence SNPs were detected across the three mutagenized lines, including the three known alleles in TaGA20ox1, indicating a mutation rate of ~35 SNPs per Mb, similar to that estimated by PCR-based TILLING. This demonstrates the feasibility of using exon capture for genome re-sequencing as a method of mutation detection in polyploid wheat, but accurate mutation calling will require an improved genomic reference with more comprehensive coverage of homoeologues.
Phylogenomics and taxonomy of Lecomtelleae (Poaceae), an isolated panicoid lineage from Madagascar
Besnard, Guillaume; Christin, Pascal-Antoine; Malé, Pierre-Jean G.; Coissac, Eric; Ralimanana, Hélène; Vorontsova, Maria S.
2013-01-01
Background and Aims An accurate characterization of biodiversity requires analyses of DNA sequences in addition to classical morphological descriptions. New methods based on high-throughput sequencing may allow investigation of specimens with a large set of genetic markers to infer their evolutionary history. In the grass family, the phylogenetic position of the monotypic genus Lecomtella, a rare bamboo-like endemic from Madagascar, has never been appropriately evaluated. Until now its taxonomic treatment has remained controversial, indicating the need for re-evaluation based on a combination of molecular and morphological data. Methods The phylogenetic position of Lecomtella in Poaceae was evaluated based on sequences from the nuclear and plastid genomes generated by next-generation sequencing (NGS). In addition, a detailed morphological description of L. madagascariensis was produced, and its distribution and habit were investigated in order to assess its conservation status. Key Results The complete plastid sequence, a ribosomal DNA unit and fragments of low-copy nuclear genes (phyB and ppc) were obtained. All phylogenetic analyses place Lecomtella as an isolated member of the core panicoids, which last shared a common ancestor with other species >20 million years ago. Although Lecomtella exhibits morphological characters typical of Panicoideae, an unusual combination of traits supports its treatment as a separate group. Conclusions The study showed that NGS can be used to generate abundant phylogenetic information rapidly, opening new avenues for grass phylogenetics. These data clearly showed that Lecomtella forms an isolated lineage, which, in combination with its morphological peculiarities, justifies its treatment as a separate tribe: Lecomtelleae. New descriptions of the tribe, genus and species are presented with a typification, a distribution map and an IUCN conservation assessment. PMID:23985988
Exome sequencing of a multigenerational human pedigree.
Hedges, Dale J; Hedges, Dale; Burges, Dan; Powell, Eric; Almonte, Cherylyn; Huang, Jia; Young, Stuart; Boese, Benjamin; Schmidt, Mike; Pericak-Vance, Margaret A; Martin, Eden; Zhang, Xinmin; Harkins, Timothy T; Züchner, Stephan
2009-12-14
Over the next few years, the efficient use of next-generation sequencing (NGS) in human genetics research will depend heavily upon the effective mechanisms for the selective enrichment of genomic regions of interest. Recently, comprehensive exome capture arrays have become available for targeting approximately 33 Mb or approximately 180,000 coding exons across the human genome. Selective genomic enrichment of the human exome offers an attractive option for new experimental designs aiming to quickly identify potential disease-associated genetic variants, especially in family-based studies. We have evaluated a 2.1 M feature human exome capture array on eight individuals from a three-generation family pedigree. We were able to cover up to 98% of the targeted bases at a long-read sequence read depth of > or = 3, 86% at a read depth of > or = 10, and over 50% of all targets were covered with > or = 20 reads. We identified up to 14,284 SNPs and small indels per individual exome, with up to 1,679 of these representing putative novel polymorphisms. Applying the conservative genotype calling approach HCDiff, the average rate of detection of a variant allele based on Illumina 1 M BeadChips genotypes was 95.2% at > or = 10x sequence. Further, we propose an advantageous genotype calling strategy for low covered targets that empirically determines cut-off thresholds at a given coverage depth based on existing genotype data. Application of this method was able to detect >99% of SNPs covered > or = 8x. Our results offer guidance for "real-world" applications in human genetics and provide further evidence that microarray-based exome capture is an efficient and reliable method to enrich for chromosomal regions of interest in next-generation sequencing experiments.
New developments in ancient genomics.
Millar, Craig D; Huynen, Leon; Subramanian, Sankar; Mohandesan, Elmira; Lambert, David M
2008-07-01
Ancient DNA research is on the crest of a 'third wave' of progress due to the introduction of a new generation of DNA sequencing technologies. Here we review the advantages and disadvantages of the four new DNA sequencers that are becoming available to researchers. These machines now allow the recovery of orders of magnitude more DNA sequence data, albeit as short sequence reads. Hence, the potential reassembly of complete ancient genomes seems imminent, and when used to screen libraries of ancient sequences, these methods are cost effective. This new wealth of data is also likely to herald investigations into the functional properties of extinct genes and gene complexes and will improve our understanding of the biological basis of extinct phenotypes.
Detection of DNA Methylation by Whole-Genome Bisulfite Sequencing.
Li, Qing; Hermanson, Peter J; Springer, Nathan M
2018-01-01
DNA methylation plays an important role in the regulation of the expression of transposons and genes. Various methods have been developed to assay DNA methylation levels. Bisulfite sequencing is considered to be the "gold standard" for single-base resolution measurement of DNA methylation levels. Coupled with next-generation sequencing, whole-genome bisulfite sequencing (WGBS) allows DNA methylation to be evaluated at a genome-wide scale. Here, we described a protocol for WGBS in plant species with large genomes. This protocol has been successfully applied to assay genome-wide DNA methylation levels in maize and barley. This protocol has also been successfully coupled with sequence capture technology to assay DNA methylation levels in a targeted set of genomic regions.
Predicting the host of influenza viruses based on the word vector.
Xu, Beibei; Tan, Zhiying; Li, Kenli; Jiang, Taijiao; Peng, Yousong
2017-01-01
Newly emerging influenza viruses continue to threaten public health. A rapid determination of the host range of newly discovered influenza viruses would assist in early assessment of their risk. Here, we attempted to predict the host of influenza viruses using the Support Vector Machine (SVM) classifier based on the word vector, a new representation and feature extraction method for biological sequences. The results show that the length of the word within the word vector, the sequence type (DNA or protein) and the species from which the sequences were derived for generating the word vector all influence the performance of models in predicting the host of influenza viruses. In nearly all cases, the models built on the surface proteins hemagglutinin (HA) and neuraminidase (NA) (or their genes) produced better results than internal influenza proteins (or their genes). The best performance was achieved when the model was built on the HA gene based on word vectors (words of three-letters long) generated from DNA sequences of the influenza virus. This results in accuracies of 99.7% for avian, 96.9% for human and 90.6% for swine influenza viruses. Compared to the method of sequence homology best-hit searches using the Basic Local Alignment Search Tool (BLAST), the word vector-based models still need further improvements in predicting the host of influenza A viruses.
Next generation sequencing and its applications in forensic genetics.
Børsting, Claus; Morling, Niels
2015-09-01
It has been almost a decade since the first next generation sequencing (NGS) technologies emerged and quickly changed the way genetic research is conducted. Today, full genomes are mapped and published almost weekly and with ever increasing speed and decreasing costs. NGS methods and platforms have matured during the last 10 years, and the quality of the sequences has reached a level where NGS is used in clinical diagnostics of humans. Forensic genetic laboratories have also explored NGS technologies and especially in the last year, there has been a small explosion in the number of scientific articles and presentations at conferences with forensic aspects of NGS. These contributions have demonstrated that NGS offers new possibilities for forensic genetic case work. More information may be obtained from unique samples in a single experiment by analyzing combinations of markers (STRs, SNPs, insertion/deletions, mRNA) that cannot be analyzed simultaneously with the standard PCR-CE methods used today. The true variation in core forensic STR loci has been uncovered, and previously unknown STR alleles have been discovered. The detailed sequence information may aid mixture interpretation and will increase the statistical weight of the evidence. In this review, we will give an introduction to NGS and single-molecule sequencing, and we will discuss the possible applications of NGS in forensic genetics. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Derkach, Andriy; Chiang, Theodore; Gong, Jiafen; Addis, Laura; Dobbins, Sara; Tomlinson, Ian; Houlston, Richard; Pal, Deb K.; Strug, Lisa J.
2014-01-01
Motivation: Sufficiently powered case–control studies with next-generation sequence (NGS) data remain prohibitively expensive for many investigators. If feasible, a more efficient strategy would be to include publicly available sequenced controls. However, these studies can be confounded by differences in sequencing platform; alignment, single nucleotide polymorphism and variant calling algorithms; read depth; and selection thresholds. Assuming one can match cases and controls on the basis of ethnicity and other potential confounding factors, and one has access to the aligned reads in both groups, we investigate the effect of systematic differences in read depth and selection threshold when comparing allele frequencies between cases and controls. We propose a novel likelihood-based method, the robust variance score (RVS), that substitutes genotype calls by their expected values given observed sequence data. Results: We show theoretically that the RVS eliminates read depth bias in the estimation of minor allele frequency. We also demonstrate that, using simulated and real NGS data, the RVS method controls Type I error and has comparable power to the ‘gold standard’ analysis with the true underlying genotypes for both common and rare variants. Availability and implementation: An RVS R script and instructions can be found at strug.research.sickkids.ca, and at https://github.com/strug-lab/RVS. Contact: lisa.strug@utoronto.ca Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24733292
Cooper, James; Ding, Yi; Song, Jiuzhou; Zhao, Keji
2017-11-01
Increased chromatin accessibility is a feature of cell-type-specific cis-regulatory elements; therefore, mapping of DNase I hypersensitive sites (DHSs) enables the detection of active regulatory elements of transcription, including promoters, enhancers, insulators and locus-control regions. Single-cell DNase sequencing (scDNase-seq) is a method of detecting genome-wide DHSs when starting with either single cells or <1,000 cells from primary cell sources. This technique enables genome-wide mapping of hypersensitive sites in a wide range of cell populations that cannot be analyzed using conventional DNase I sequencing because of the requirement for millions of starting cells. Fresh cells, formaldehyde-cross-linked cells or cells recovered from formalin-fixed paraffin-embedded (FFPE) tissue slides are suitable for scDNase-seq assays. To generate scDNase-seq libraries, cells are lysed and then digested with DNase I. Circular carrier plasmid DNA is included during subsequent DNA purification and library preparation steps to prevent loss of the small quantity of DHS DNA. Libraries are generated for high-throughput sequencing on the Illumina platform using standard methods. Preparation of scDNase-seq libraries requires only 2 d. The materials and molecular biology techniques described in this protocol should be accessible to any general molecular biology laboratory. Processing of high-throughput sequencing data requires basic bioinformatics skills and uses publicly available bioinformatics software.
MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads
Lukjancenko, Oksana; Thomsen, Martin Christen Frølund; Maddalena Sperotto, Maria; Lund, Ole; Møller Aarestrup, Frank; Sicheritz-Pontén, Thomas
2017-01-01
An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package (https://bitbucket.org/genomicepidemiology/mgmapper). A web-version (https://cge.cbs.dtu.dk/services/MGmapper) provides the basic functionality for analysis of small fastq datasets. PMID:28467460
MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads.
Petersen, Thomas Nordahl; Lukjancenko, Oksana; Thomsen, Martin Christen Frølund; Maddalena Sperotto, Maria; Lund, Ole; Møller Aarestrup, Frank; Sicheritz-Pontén, Thomas
2017-01-01
An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package (https://bitbucket.org/genomicepidemiology/mgmapper). A web-version (https://cge.cbs.dtu.dk/services/MGmapper) provides the basic functionality for analysis of small fastq datasets.
Cousins, Matthew M.; Ou, San-San; Wawer, Maria J.; Munshaw, Supriya; Swan, David; Magaret, Craig A.; Mullis, Caroline E.; Serwadda, David; Porcella, Stephen F.; Gray, Ronald H.; Quinn, Thomas C.; Donnell, Deborah; Eshleman, Susan H.
2012-01-01
Next-generation sequencing (NGS) has recently been used for analysis of HIV diversity, but this method is labor-intensive, costly, and requires complex protocols for data analysis. We compared diversity measures obtained using NGS data to those obtained using a diversity assay based on high-resolution melting (HRM) of DNA duplexes. The HRM diversity assay provides a single numeric score that reflects the level of diversity in the region analyzed. HIV gag and env from individuals in Rakai, Uganda, were analyzed in a previous study using NGS (n = 220 samples from 110 individuals). Three sequence-based diversity measures were calculated from the NGS sequence data (percent diversity, percent complexity, and Shannon entropy). The amplicon pools used for NGS were analyzed with the HRM diversity assay. HRM scores were significantly associated with sequence-based measures of HIV diversity for both gag and env (P < 0.001 for all measures). The level of diversity measured by the HRM diversity assay and NGS increased over time in both regions analyzed (P < 0.001 for all measures except for percent complexity in gag), and similar amounts of diversification were observed with both methods (P < 0.001 for all measures except for percent complexity in gag). Diversity measures obtained using the HRM diversity assay were significantly associated with those from NGS, and similar increases in diversity over time were detected by both methods. The HRM diversity assay is faster and less expensive than NGS, facilitating rapid analysis of large studies of HIV diversity and evolution. PMID:22785188
Urabe, N; Ishii, Y; Hyodo, Y; Aoki, K; Yoshizawa, S; Saga, T; Murayama, S Y; Sakai, K; Homma, S; Tateda, K
2016-04-01
Between 18 November and 3 December 2011, five renal transplant patients at the Department of Nephrology, Toho University Omori Medical Centre, Tokyo, were diagnosed with Pneumocystis pneumonia (PCP). We used molecular epidemiologic methods to determine whether the patients were infected with the same strain of Pneumocystis jirovecii. DNA extracted from the residual bronchoalveolar lavage fluid from the five outbreak cases and from another 20 cases of PCP between 2007 and 2014 were used for multilocus sequence typing to compare the genetic similarity of the P. jirovecii. DNA base sequencing by the Sanger method showed some regions where two bases overlapped and could not be defined. A next-generation sequencer was used to analyse the types and ratios of these overlapping bases. DNA base sequences of P. jirovecii in the bronchoalveolar lavage fluid from four of the five PCP patients in the 2011 outbreak and from another two renal transplant patients who developed PCP in 2013 were highly homologous. The Sanger method revealed 14 genomic regions where two differing DNA bases overlapped and could not be identified. Analyses of the overlapping bases by a next-generation sequencer revealed that the differing types of base were present in almost identical ratios. There is a strong possibility that the PCP outbreak at the Toho University Omori Medical Centre was caused by the same strain of P. jirovecii. Two different types of base present in some regions may be due to P. jirovecii's being a diploid species. Copyright © 2015 European Society of Clinical Microbiology and Infectious Diseases. Published by Elsevier Ltd. All rights reserved.
Adhikari, Badri; Hou, Jie; Cheng, Jianlin
2018-03-01
In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRUCT) uses our new alignment algorithm to generate deep multiple sequence alignment to derive coevolution-based features, which are integrated by a neural network method to predict contacts. The third method (MULTICOM-CLUSTER) is a consensus combination of the predictions of the first two methods. We evaluated our methods on 94 CASP12 domains. On a subset of 38 free-modeling domains, our methods achieved an average precision of up to 41.7% for top L/5 long-range contact predictions. The comparison of the three methods shows that the quality and effective depth of multiple sequence alignments, coevolution-based features, and machine learning integration of coevolution-based features and traditional features drive the quality of predicted protein contacts. On the full CASP12 dataset, the coevolution-based features alone can improve the average precision from 28.4% to 41.6%, and the machine learning integration of all the features further raises the precision to 56.3%, when top L/5 predicted long-range contacts are evaluated. And the correlation between the precision of contact prediction and the logarithm of the number of effective sequences in alignments is 0.66. © 2017 Wiley Periodicals, Inc.
DNA Barcode Sequence Identification Incorporating Taxonomic Hierarchy and within Taxon Variability
Little, Damon P.
2011-01-01
For DNA barcoding to succeed as a scientific endeavor an accurate and expeditious query sequence identification method is needed. Although a global multiple–sequence alignment can be generated for some barcoding markers (e.g. COI, rbcL), not all barcoding markers are as structurally conserved (e.g. matK). Thus, algorithms that depend on global multiple–sequence alignments are not universally applicable. Some sequence identification methods that use local pairwise alignments (e.g. BLAST) are unable to accurately differentiate between highly similar sequences and are not designed to cope with hierarchic phylogenetic relationships or within taxon variability. Here, I present a novel alignment–free sequence identification algorithm–BRONX–that accounts for observed within taxon variability and hierarchic relationships among taxa. BRONX identifies short variable segments and corresponding invariant flanking regions in reference sequences. These flanking regions are used to score variable regions in the query sequence without the production of a global multiple–sequence alignment. By incorporating observed within taxon variability into the scoring procedure, misidentifications arising from shared alleles/haplotypes are minimized. An explicit treatment of more inclusive terminals allows for separate identifications to be made for each taxonomic level and/or for user–defined terminals. BRONX performs better than all other methods when there is imperfect overlap between query and reference sequences (e.g. mini–barcode queries against a full–length barcode database). BRONX consistently produced better identifications at the genus–level for all query types. PMID:21857897
Evolutionary profiles from the QR factorization of multiple sequence alignments
Sethi, Anurag; O'Donoghue, Patrick; Luthey-Schulten, Zaida
2005-01-01
We present an algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of the homologous group. The method, based on the multidimensional QR factorization of numerically encoded multiple sequence alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. We observe a general trend that these smaller, more evolutionarily balanced profiles have comparable and, in many cases, better performance in database searches than conventional profiles containing hundreds of sequences, constructed in an iterative and computationally intensive procedure. For more diverse families or superfamilies, with sequence identity <30%, structural alignments, based purely on the geometry of the protein structures, provide better alignments than pure sequence-based methods. Merging the structure and sequence information allows the construction of accurate profiles for distantly related groups. These structure-based profiles outperformed other sequence-based methods for finding distant homologs and were used to identify a putative class II cysteinyl-tRNA synthetase (CysRS) in several archaea that eluded previous annotation studies. Phylogenetic analysis showed the putative class II CysRSs to be a monophyletic group and homology modeling revealed a constellation of active site residues similar to that in the known class I CysRS. PMID:15741270
Fingerprints of Modified RNA Bases from Deep Sequencing Profiles.
Kietrys, Anna M; Velema, Willem A; Kool, Eric T
2017-11-29
Posttranscriptional modifications of RNA bases are not only found in many noncoding RNAs but have also recently been identified in coding (messenger) RNAs as well. They require complex and laborious methods to locate, and many still lack methods for localized detection. Here we test the ability of next-generation sequencing (NGS) to detect and distinguish between ten modified bases in synthetic RNAs. We compare ultradeep sequencing patterns of modified bases, including miscoding, insertions and deletions (indels), and truncations, to unmodified bases in the same contexts. The data show widely varied responses to modification, ranging from no response, to high levels of mutations, insertions, deletions, and truncations. The patterns are distinct for several of the modifications, and suggest the future use of ultradeep sequencing as a fingerprinting strategy for locating and identifying modifications in cellular RNAs.
Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment
2013-01-01
Background Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. Results In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Conclusion Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA. PMID:24564200
Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment.
Nagar, Anurag; Hahsler, Michael
2013-01-01
Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA.
Tran, Tuan-Anh; Vo, Nam Tri; Nguyen, Hoang Duc; Pham, Bao The
2015-12-01
Recombinant proteins play an important role in many aspects of life and have generated a huge income, notably in the industrial enzyme business. A gene is introduced into a vector and expressed in a host organism-for example, E. coli-to obtain a high productivity of target protein. However, transferred genes from particular organisms are not usually compatible with the host's expression system because of various reasons, for example, codon usage bias, GC content, repetitive sequences, and secondary structure. The solution is developing programs to optimize for designing a nucleotide sequence whose origin is from peptide sequences using properties of highly expressed genes (HEGs) of the host organism. Existing data of HEGs determined by practical and computer-based methods do not satisfy for qualifying and quantifying. Therefore, the demand for developing a new HEG prediction method is critical. We proposed a new method for predicting HEGs and criteria to evaluate gene optimization. Codon usage bias was weighted by amplifying the difference between HEGs and non-highly expressed genes (non-HEGs). The number of predicted HEGs is 5% of the genome. In comparison with Puigbò's method, the result is twice as good as Puigbò's one, in kernel ratio and kernel sensitivity. Concerning transcription/translation factor proteins (TF), the proposed method gives low TF sensitivity, while Puigbò's method gives moderate one. In summary, the results indicated that the proposed method can be a good optional applying method to predict optimized genes for particular organisms, and we generated an HEG database for further researches in gene design.
Accurate read-based metagenome characterization using a hierarchical suite of unique signatures
Freitas, Tracey Allen K.; Li, Po-E; Scholz, Matthew B.; Chain, Patrick S. G.
2015-01-01
A major challenge in the field of shotgun metagenomics is the accurate identification of organisms present within a microbial community, based on classification of short sequence reads. Though existing microbial community profiling methods have attempted to rapidly classify the millions of reads output from modern sequencers, the combination of incomplete databases, similarity among otherwise divergent genomes, errors and biases in sequencing technologies, and the large volumes of sequencing data required for metagenome sequencing has led to unacceptably high false discovery rates (FDR). Here, we present the application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly and consistently smaller FDR than any other available method. Our algorithm circumvents false positives using a series of non-redundant signature databases and examines Genomic Origins Through Taxonomic CHAllenge (GOTTCHA). GOTTCHA was tested and validated on 20 synthetic and mock datasets ranging in community composition and complexity, was applied successfully to data generated from spiked environmental and clinical samples, and robustly demonstrates superior performance compared with other available tools. PMID:25765641
Townsley, Brad T; Covington, Michael F; Ichihashi, Yasunori; Zumstein, Kristina; Sinha, Neelima R
2015-01-01
Next Generation Sequencing (NGS) is driving rapid advancement in biological understanding and RNA-sequencing (RNA-seq) has become an indispensable tool for biology and medicine. There is a growing need for access to these technologies although preparation of NGS libraries remains a bottleneck to wider adoption. Here we report a novel method for the production of strand specific RNA-seq libraries utilizing the terminal breathing of double-stranded cDNA to capture and incorporate a sequencing adapter. Breath Adapter Directional sequencing (BrAD-seq) reduces sample handling and requires far fewer enzymatic steps than most available methods to produce high quality strand-specific RNA-seq libraries. The method we present is optimized for 3-prime Digital Gene Expression (DGE) libraries and can easily extend to full transcript coverage shotgun (SHO) type strand-specific libraries and is modularized to accommodate a diversity of RNA and DNA input materials. BrAD-seq offers a highly streamlined and inexpensive option for RNA-seq libraries.
YAMAT-seq: an efficient method for high-throughput sequencing of mature transfer RNAs.
Shigematsu, Megumi; Honda, Shozo; Loher, Phillipe; Telonis, Aristeidis G; Rigoutsos, Isidore; Kirino, Yohei
2017-05-19
Besides translation, transfer RNAs (tRNAs) play many non-canonical roles in various biological pathways and exhibit highly variable expression profiles. To unravel the emerging complexities of tRNA biology and molecular mechanisms underlying them, an efficient tRNA sequencing method is required. However, the rigid structure of tRNA has been presenting a challenge to the development of such methods. We report the development of Y-shaped Adapter-ligated MAture TRNA sequencing (YAMAT-seq), an efficient and convenient method for high-throughput sequencing of mature tRNAs. YAMAT-seq circumvents the issue of inefficient adapter ligation, a characteristic of conventional RNA sequencing methods for mature tRNAs, by employing the efficient and specific ligation of Y-shaped adapter to mature tRNAs using T4 RNA Ligase 2. Subsequent cDNA amplification and next-generation sequencing successfully yield numerous mature tRNA sequences. YAMAT-seq has high specificity for mature tRNAs and high sensitivity to detect most isoacceptors from minute amount of total RNA. Moreover, YAMAT-seq shows quantitative capability to estimate expression levels of mature tRNAs, and has high reproducibility and broad applicability for various cell lines. YAMAT-seq thus provides high-throughput technique for identifying tRNA profiles and their regulations in various transcriptomes, which could play important regulatory roles in translation and other biological processes. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
ChIP-seq and RNA-seq methods to study circadian control of transcription in mammals
Takahashi, Joseph S.; Kumar, Vivek; Nakashe, Prachi; Koike, Nobuya; Huang, Hung-Chung; Green, Carla B.; Kim, Tae-Kyung
2015-01-01
Genome-wide analyses have revolutionized our ability to study the transcriptional regulation of circadian rhythms. The advent of next-generation sequencing methods has facilitated the use of two such technologies, ChIP-seq and RNA-seq. In this chapter, we describe detailed methods and protocols for these two techniques, with emphasis on their usage in circadian rhythm experiments in the mouse liver, a major target organ of the circadian clock system. Critical factors for these methods are highlighted and issues arising with time series samples for ChIP-seq and RNA-seq are discussed. Finally detailed protocols for library preparation suitable for Illumina sequencing platforms are presented. PMID:25662462
A Rapid Method for Engineering Recombinant Polioviruses or Other Enteroviruses.
Bessaud, Maël; Pelletier, Isabelle; Blondel, Bruno; Delpeyroux, Francis
2016-01-01
The cloning of large enterovirus RNA sequences is labor-intensive because of the frequent instability in bacteria of plasmidic vectors containing the corresponding cDNAs. In order to circumvent this issue we have developed a PCR-based method that allows the generation of highly modified or chimeric full-length enterovirus genomes. This method relies on fusion PCR which enables the concatenation of several overlapping cDNA amplicons produced separately. A T7 promoter sequence added upstream the fusion PCR products allows its transcription into infectious genomic RNAs directly in transfected cells constitutively expressing the phage T7 RNA polymerase. This method permits the rapid recovery of modified viruses that can be subsequently amplified on adequate cell-lines.
Direct generation of all-optical random numbers from optical pulse amplitude chaos.
Li, Pu; Wang, Yun-Cai; Wang, An-Bang; Yang, Ling-Zhen; Zhang, Ming-Jiang; Zhang, Jian-Zhong
2012-02-13
We propose and theoretically demonstrate an all-optical method for directly generating all-optical random numbers from pulse amplitude chaos produced by a mode-locked fiber ring laser. Under an appropriate pump intensity, the mode-locked laser can experience a quasi-periodic route to chaos. Such a chaos consists of a stream of pulses with a fixed repetition frequency but random intensities. In this method, we do not require sampling procedure and external triggered clocks but directly quantize the chaotic pulses stream into random number sequence via an all-optical flip-flop. Moreover, our simulation results show that the pulse amplitude chaos has no periodicity and possesses a highly symmetric distribution of amplitude. Thus, in theory, the obtained random number sequence without post-processing has a high-quality randomness verified by industry-standard statistical tests.
Error baseline rates of five sample preparation methods used to characterize RNA virus populations.
Kugelman, Jeffrey R; Wiley, Michael R; Nagle, Elyse R; Reyes, Daniel; Pfeffer, Brad P; Kuhn, Jens H; Sanchez-Lockhart, Mariano; Palacios, Gustavo F
2017-01-01
Individual RNA viruses typically occur as populations of genomes that differ slightly from each other due to mutations introduced by the error-prone viral polymerase. Understanding the variability of RNA virus genome populations is critical for understanding virus evolution because individual mutant genomes may gain evolutionary selective advantages and give rise to dominant subpopulations, possibly even leading to the emergence of viruses resistant to medical countermeasures. Reverse transcription of virus genome populations followed by next-generation sequencing is the only available method to characterize variation for RNA viruses. However, both steps may lead to the introduction of artificial mutations, thereby skewing the data. To better understand how such errors are introduced during sample preparation, we determined and compared error baseline rates of five different sample preparation methods by analyzing in vitro transcribed Ebola virus RNA from an artificial plasmid-based system. These methods included: shotgun sequencing from plasmid DNA or in vitro transcribed RNA as a basic "no amplification" method, amplicon sequencing from the plasmid DNA or in vitro transcribed RNA as a "targeted" amplification method, sequence-independent single-primer amplification (SISPA) as a "random" amplification method, rolling circle reverse transcription sequencing (CirSeq) as an advanced "no amplification" method, and Illumina TruSeq RNA Access as a "targeted" enrichment method. The measured error frequencies indicate that RNA Access offers the best tradeoff between sensitivity and sample preparation error (1.4-5) of all compared methods.
Error baseline rates of five sample preparation methods used to characterize RNA virus populations
Kugelman, Jeffrey R.; Wiley, Michael R.; Nagle, Elyse R.; Reyes, Daniel; Pfeffer, Brad P.; Kuhn, Jens H.; Sanchez-Lockhart, Mariano; Palacios, Gustavo F.
2017-01-01
Individual RNA viruses typically occur as populations of genomes that differ slightly from each other due to mutations introduced by the error-prone viral polymerase. Understanding the variability of RNA virus genome populations is critical for understanding virus evolution because individual mutant genomes may gain evolutionary selective advantages and give rise to dominant subpopulations, possibly even leading to the emergence of viruses resistant to medical countermeasures. Reverse transcription of virus genome populations followed by next-generation sequencing is the only available method to characterize variation for RNA viruses. However, both steps may lead to the introduction of artificial mutations, thereby skewing the data. To better understand how such errors are introduced during sample preparation, we determined and compared error baseline rates of five different sample preparation methods by analyzing in vitro transcribed Ebola virus RNA from an artificial plasmid-based system. These methods included: shotgun sequencing from plasmid DNA or in vitro transcribed RNA as a basic “no amplification” method, amplicon sequencing from the plasmid DNA or in vitro transcribed RNA as a “targeted” amplification method, sequence-independent single-primer amplification (SISPA) as a “random” amplification method, rolling circle reverse transcription sequencing (CirSeq) as an advanced “no amplification” method, and Illumina TruSeq RNA Access as a “targeted” enrichment method. The measured error frequencies indicate that RNA Access offers the best tradeoff between sensitivity and sample preparation error (1.4−5) of all compared methods. PMID:28182717
Genetic basis of arrhythmogenic cardiomyopathy.
Karmouch, Jennifer; Protonotarios, Alexandros; Syrris, Petros
2018-05-01
To date 16 genes have been associated with arrhythmogenic cardiomyopathy (ACM). Mutations in these genes can lead to a broad spectrum of phenotypic expression ranging from disease affecting predominantly the right or left ventricle, to biventricular subtypes. Understanding the genetic causes of ACM is important in diagnosis and management of the disorder. This review summarizes recent advances in molecular genetics and discusses the application of next-generation sequencing technology in genetic testing in ACM. Use of next-generation sequencing methods has resulted in the identification of novel causative variants and genes for ACM. The involvement of filamin C in ACM demonstrates the genetic overlap between ACM and other types of cardiomyopathy. Putative pathogenic variants have been detected in cadherin 2 gene, a protein involved in cell adhesion. Large genomic rearrangements in desmosome genes have been systematically investigated in a cohort of ACM patients. Recent studies have identified novel causes of ACM providing new insights into the genetic spectrum of the disease and highlighting an overlapping phenotype between ACM and dilated cardiomyopathy. Next-generation sequencing is a useful tool for research and genetic diagnostic screening but interpretation of identified sequence variants requires caution and should be performed in specialized centres.
Hoople, Gordon D; Richards, Andrew; Wu, Yan; Pisano, Albert P; Zhang, Kun
2018-03-26
The ability to amplify and sequence either DNA or RNA from small starting samples has only been achieved in the last five years. Unfortunately, the standard protocols for generating genomic or transcriptomic libraries are incompatible and researchers must choose whether to sequence DNA or RNA for a particular sample. Gel-seq solves this problem by enabling researchers to simultaneously prepare libraries for both DNA and RNA starting with 100 - 1000 cells using a simple hydrogel device. This paper presents a detailed approach for the fabrication of the device as well as the biological protocol to generate paired libraries. We designed Gel-seq so that it could be easily implemented by other researchers; many genetics labs already have the necessary equipment to reproduce the Gel-seq device fabrication. Our protocol employs commonly-used kits for both whole-transcript amplification (WTA) and library preparation, which are also likely to be familiar to researchers already versed in generating genomic and transcriptomic libraries. Our approach allows researchers to bring to bear the power of both DNA and RNA sequencing on a single sample without splitting and with negligible added cost.
Dill, V; Beer, M; Hoffmann, B
2017-08-01
Foot-and-mouth disease (FMD) is a major contributor to poverty and food insecurity in Africa and Asia, and it is one of the biggest threats to agriculture in highly developed countries. As FMD is extremely contagious, strategies for its prevention, early detection, and the immediate characterisation of outbreak strains are of great importance. The generation of whole-genome sequences enables phylogenetic characterisation, the epidemiological tracing of virus transmission pathways and is supportive in disease control strategies. This study describes the development and validation of a rapid, universal and cost-efficient RT-PCR system to generate genome sequences of FMDV, reaching from the IRES to the end of the open reading frame. The method was evaluated using twelve different virus strains covering all seven serotypes of FMDV. Additionally, samples from experimentally infected animals were tested to mimic diagnostic field samples. All primer pairs showed a robust amplification with a high sensitivity for all serotypes. In summary, the described assay is suitable for the generation of FMDV sequences from all serotypes to allow immediate phylogenetic analysis, detailed genotyping and molecular epidemiology. Copyright © 2017 Elsevier B.V. All rights reserved.
Marshall, Owen J; Southall, Tony D; Cheetham, Seth W; Brand, Andrea H
2016-09-01
This protocol is an extension to: Nat. Protoc. 2, 1467-1478 (2007); doi:10.1038/nprot.2007.148; published online 7 June 2007The ability to profile transcription and chromatin binding in a cell-type-specific manner is a powerful aid to understanding cell-fate specification and cellular function in multicellular organisms. We recently developed targeted DamID (TaDa) to enable genome-wide, cell-type-specific profiling of DNA- and chromatin-binding proteins in vivo without cell isolation. As a protocol extension, this article describes substantial modifications to an existing protocol, and it offers additional applications. TaDa builds upon DamID, a technique for detecting genome-wide DNA-binding profiles of proteins, by coupling it with the GAL4 system in Drosophila to enable both temporal and spatial resolution. TaDa ensures that Dam-fusion proteins are expressed at very low levels, thus avoiding toxicity and potential artifacts from overexpression. The modifications to the core DamID technique presented here also increase the speed of sample processing and throughput, and adapt the method to next-generation sequencing technology. TaDa is robust, reproducible and highly sensitive. Compared with other methods for cell-type-specific profiling, the technique requires no cell-sorting, cross-linking or antisera, and binding profiles can be generated from as few as 10,000 total induced cells. By profiling the genome-wide binding of RNA polymerase II (Pol II), TaDa can also identify transcribed genes in a cell-type-specific manner. Here we describe a detailed protocol for carrying out TaDa experiments and preparing the material for next-generation sequencing. Although we developed TaDa in Drosophila, it should be easily adapted to other organisms with an inducible expression system. Once transgenic animals are obtained, the entire experimental procedure-from collecting tissue samples to generating sequencing libraries-can be accomplished within 5 d.
The Impact of Normalization Methods on RNA-Seq Data Analysis
Zyprych-Walczak, J.; Szabelska, A.; Handschuh, L.; Górczak, K.; Klamecka, K.; Figlerowicz, M.; Siatkowski, I.
2015-01-01
High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis. In this work, we focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis. Based on this study, we suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably. PMID:26176014
Ravi, Koustuban; Schimpf, Damian N; Kärtner, Franz X
2016-10-31
The use of laser pulse sequences to drive the cascaded difference frequency generation of high energy, high peak-power and multi-cycle terahertz pulses in cryogenically cooled (100 K) periodically poled Lithium Niobate is proposed and studied. Detailed simulations considering the coupled nonlinear interaction of terahertz and optical waves (or pump depletion), show that unprecedented optical-to-terahertz energy conversion efficiencies > 5%, peak electric fields of hundred(s) of mega volts/meter at terahertz pulse durations of hundred(s) of picoseconds can be achieved. The proposed methods are shown to circumvent laser induced damage limitations at Joule-level pumping by 1µm lasers to enable multi-cycle terahertz sources with pulse energies > 10 milli-joules. Various pulse sequence formats are proposed and analyzed. Numerical calculations for periodically poled structures accounting for cascaded difference frequency generation, self-phase-modulation, cascaded second harmonic generation and laser induced damage are introduced. The physics governing terahertz generation using pulse sequences in this high conversion efficiency regime, limitations and practical considerations are discussed. It is shown that varying the poling period along the crystal length and further reduction of absorption can lead to even higher energy conversion efficiencies >10%. In addition to numerical calculations, an analytic formulation valid for arbitrary pulse formats and closed-form expressions for important cases are presented. Parameters optimizing conversion efficiency in the 0.1-1 THz range, the corresponding peak electric fields, crystal lengths and terahertz pulse properties are furnished.
Nanopore sequencing in microgravity
McIntyre, Alexa B R; Rizzardi, Lindsay; Yu, Angela M; Alexander, Noah; Rosen, Gail L; Botkin, Douglas J; Stahl, Sarah E; John, Kristen K; Castro-Wallace, Sarah L; McGrath, Ken; Burton, Aaron S; Feinberg, Andrew P; Mason, Christopher E
2016-01-01
Rapid DNA sequencing and analysis has been a long-sought goal in remote research and point-of-care medicine. In microgravity, DNA sequencing can facilitate novel astrobiological research and close monitoring of crew health, but spaceflight places stringent restrictions on the mass and volume of instruments, crew operation time, and instrument functionality. The recent emergence of portable, nanopore-based tools with streamlined sample preparation protocols finally enables DNA sequencing on missions in microgravity. As a first step toward sequencing in space and aboard the International Space Station (ISS), we tested the Oxford Nanopore Technologies MinION during a parabolic flight to understand the effects of variable gravity on the instrument and data. In a successful proof-of-principle experiment, we found that the instrument generated DNA reads over the course of the flight, including the first ever sequenced in microgravity, and additional reads measured after the flight concluded its parabolas. Here we detail modifications to the sample-loading procedures to facilitate nanopore sequencing aboard the ISS and in other microgravity environments. We also evaluate existing analysis methods and outline two new approaches, the first based on a wave-fingerprint method and the second on entropy signal mapping. Computationally light analysis methods offer the potential for in situ species identification, but are limited by the error profiles (stays, skips, and mismatches) of older nanopore data. Higher accuracies attainable with modified sample processing methods and the latest version of flow cells will further enable the use of nanopore sequencers for diagnostics and research in space. PMID:28725742
It’s More Than Stamp Collecting: How Genome Sequencing Can Unify Biological Research
Richards, Stephen
2015-01-01
The availability of reference genome sequences, especially the human reference, has revolutionized the study of biology. However, whilst the genomes of some species have been fully sequenced, a wide range of biological problems still cannot be effectively studied for lack of genome sequence information. Here, I identify neglected areas of biology and describe how both targeted species sequencing and more broad taxonomic surveys of the tree of life can address important biological questions. I enumerate the significant benefits that would accrue from sequencing a broader range of taxa, as well as discuss the technical advances in sequencing and assembly methods that would allow for wide-ranging application of whole-genome analysis. Finally, I suggest that in addition to “Big Science” survey initiatives to sequence the tree of life, a modified infrastructure-funding paradigm would better support reference genome sequence generation for research communities most in need. PMID:26003218
It's more than stamp collecting: how genome sequencing can unify biological research.
Richards, Stephen
2015-07-01
The availability of reference genome sequences, especially the human reference, has revolutionized the study of biology. However, while the genomes of some species have been fully sequenced, a wide range of biological problems still cannot be effectively studied for lack of genome sequence information. Here, I identify neglected areas of biology and describe how both targeted species sequencing and more broad taxonomic surveys of the tree of life can address important biological questions. I enumerate the significant benefits that would accrue from sequencing a broader range of taxa, as well as discuss the technical advances in sequencing and assembly methods that would allow for wide-ranging application of whole-genome analysis. Finally, I suggest that in addition to 'big science' survey initiatives to sequence the tree of life, a modified infrastructure-funding paradigm would better support reference genome sequence generation for research communities most in need. Copyright © 2015 Elsevier Ltd. All rights reserved.
The quest for rare variants: pooled multiplexed next generation sequencing in plants.
Marroni, Fabio; Pinosio, Sara; Morgante, Michele
2012-01-01
Next generation sequencing (NGS) instruments produce an unprecedented amount of sequence data at contained costs. This gives researchers the possibility of designing studies with adequate power to identify rare variants at a fraction of the economic and labor resources required by individual Sanger sequencing. As of today, few research groups working in plant sciences have exploited this potentiality, showing that pooled NGS provides results in excellent agreement with those obtained by individual Sanger sequencing. The aim of this review is to convey to the reader the general ideas underlying the use of pooled NGS for the identification of rare variants. To facilitate a thorough understanding of the possibilities of the method, we will explain in detail the possible experimental and analytical approaches and discuss their advantages and disadvantages. We will show that information on allele frequency obtained by pooled NGS can be used to accurately compute basic population genetics indexes such as allele frequency, nucleotide diversity, and Tajima's D. Finally, we will discuss applications and future perspectives of the multiplexed NGS approach.
Molecular Characterization of Transgenic Events Using Next Generation Sequencing Approach.
Guttikonda, Satish K; Marri, Pradeep; Mammadov, Jafar; Ye, Liang; Soe, Khaing; Richey, Kimberly; Cruse, James; Zhuang, Meibao; Gao, Zhifang; Evans, Clive; Rounsley, Steve; Kumpatla, Siva P
2016-01-01
Demand for the commercial use of genetically modified (GM) crops has been increasing in light of the projected growth of world population to nine billion by 2050. A prerequisite of paramount importance for regulatory submissions is the rigorous safety assessment of GM crops. One of the components of safety assessment is molecular characterization at DNA level which helps to determine the copy number, integrity and stability of a transgene; characterize the integration site within a host genome; and confirm the absence of vector DNA. Historically, molecular characterization has been carried out using Southern blot analysis coupled with Sanger sequencing. While this is a robust approach to characterize the transgenic crops, it is both time- and resource-consuming. The emergence of next-generation sequencing (NGS) technologies has provided highly sensitive and cost- and labor-effective alternative for molecular characterization compared to traditional Southern blot analysis. Herein, we have demonstrated the successful application of both whole genome sequencing and target capture sequencing approaches for the characterization of single and stacked transgenic events and compared the results and inferences with traditional method with respect to key criteria required for regulatory submissions.
A multiple-alignment based primer design algorithm for genetically highly variable DNA targets
2013-01-01
Background Primer design for highly variable DNA sequences is difficult, and experimental success requires attention to many interacting constraints. The advent of next-generation sequencing methods allows the investigation of rare variants otherwise hidden deep in large populations, but requires attention to population diversity and primer localization in relatively conserved regions, in addition to recognized constraints typically considered in primer design. Results Design constraints include degenerate sites to maximize population coverage, matching of melting temperatures, optimizing de novo sequence length, finding optimal bio-barcodes to allow efficient downstream analyses, and minimizing risk of dimerization. To facilitate primer design addressing these and other constraints, we created a novel computer program (PrimerDesign) that automates this complex procedure. We show its powers and limitations and give examples of successful designs for the analysis of HIV-1 populations. Conclusions PrimerDesign is useful for researchers who want to design DNA primers and probes for analyzing highly variable DNA populations. It can be used to design primers for PCR, RT-PCR, Sanger sequencing, next-generation sequencing, and other experimental protocols targeting highly variable DNA samples. PMID:23965160