ERIC Educational Resources Information Center
Noell, George H.; Gresham, Frank M.
2001-01-01
Describes design logic and potential uses of a variant of the multiple-baseline design. The multiple-baseline multiple-sequence (MBL-MS) consists of multiple-baseline designs that are interlaced with one another and include all possible sequences of treatments. The MBL-MS design appears to be primarily useful for comparison of treatments taking…
PFAAT version 2.0: a tool for editing, annotating, and analyzing multiple sequence alignments.
Caffrey, Daniel R; Dana, Paul H; Mathur, Vidhya; Ocano, Marco; Hong, Eun-Jong; Wang, Yaoyu E; Somaroo, Shyamal; Caffrey, Brian E; Potluri, Shobha; Huang, Enoch S
2007-10-11
By virtue of their shared ancestry, homologous sequences are similar in their structure and function. Consequently, multiple sequence alignments are routinely used to identify trends that relate to function. This type of analysis is particularly productive when it is combined with structural and phylogenetic analysis. Here we describe the release of PFAAT version 2.0, a tool for editing, analyzing, and annotating multiple sequence alignments. Support for multiple annotations is a key component of this release as it provides a framework for most of the new functionalities. The sequence annotations are accessible from the alignment and tree, where they are typically used to label sequences or hyperlink them to related databases. Sequence annotations can be created manually or extracted automatically from UniProt entries. Once a multiple sequence alignment is populated with sequence annotations, sequences can be easily selected and sorted through a sophisticated search dialog. The selected sequences can be further analyzed using statistical methods that explicitly model relationships between the sequence annotations and residue properties. Residue annotations are accessible from the alignment viewer and are typically used to designate binding sites or properties for a particular residue. Residue annotations are also searchable, and allow one to quickly select alignment columns for further sequence analysis, e.g. computing percent identities. Other features include: novel algorithms to compute sequence conservation, mapping conservation scores to a 3D structure in Jmol, displaying secondary structure elements, and sorting sequences by residue composition. PFAAT provides a framework whereby end-users can specify knowledge for a protein family in the form of annotation. The annotations can be combined with sophisticated analysis to test hypothesis that relate to sequence, structure and function.
Computer-aided visualization and analysis system for sequence evaluation
Chee, M.S.
1998-08-18
A computer system for analyzing nucleic acid sequences is provided. The computer system is used to perform multiple methods for determining unknown bases by analyzing the fluorescence intensities of hybridized nucleic acid probes. The results of individual experiments are improved by processing nucleic acid sequences together. Comparative analysis of multiple experiments is also provided by displaying reference sequences in one area and sample sequences in another area on a display device. 27 figs.
Computer-aided visualization and analysis system for sequence evaluation
Chee, Mark S.; Wang, Chunwei; Jevons, Luis C.; Bernhart, Derek H.; Lipshutz, Robert J.
2004-05-11
A computer system for analyzing nucleic acid sequences is provided. The computer system is used to perform multiple methods for determining unknown bases by analyzing the fluorescence intensities of hybridized nucleic acid probes. The results of individual experiments are improved by processing nucleic acid sequences together. Comparative analysis of multiple experiments is also provided by displaying reference sequences in one area and sample sequences in another area on a display device.
Computer-aided visualization and analysis system for sequence evaluation
Chee, Mark S.
1998-08-18
A computer system for analyzing nucleic acid sequences is provided. The computer system is used to perform multiple methods for determining unknown bases by analyzing the fluorescence intensities of hybridized nucleic acid probes. The results of individual experiments are improved by processing nucleic acid sequences together. Comparative analysis of multiple experiments is also provided by displaying reference sequences in one area and sample sequences in another area on a display device.
Computer-aided visualization and analysis system for sequence evaluation
Chee, Mark S.
2003-08-19
A computer system for analyzing nucleic acid sequences is provided. The computer system is used to perform multiple methods for determining unknown bases by analyzing the fluorescence intensities of hybridized nucleic acid probes. The results of individual experiments may be improved by processing nucleic acid sequences together. Comparative analysis of multiple experiments is also provided by displaying reference sequences in one area and sample sequences in another area on a display device.
Bellerophon: A program to detect chimeric sequences in multiple sequence alignments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huber, Thomas; Faulkner, Geoffrey; Hugenholtz, Philip
2003-12-23
Bellerophon is a program for detecting chimeric sequences in multiple sequence datasets by an adaption of partial treeing analysis. Bellerophon was specifically developed to detect 16S rRNA gene chimeras in PCR-clone libraries of environmental samples but can be applied to other nucleotide sequence alignments.
Computer-aided visualization and analysis system for sequence evaluation
Chee, Mark S.
1999-10-26
A computer system (1) for analyzing nucleic acid sequences is provided. The computer system is used to perform multiple methods for determining unknown bases by analyzing the fluorescence intensities of hybridized nucleic acid probes. The results of individual experiments may be improved by processing nucleic acid sequences together. Comparative analysis of multiple experiments is also provided by displaying reference sequences in one area (814) and sample sequences in another area (816) on a display device (3).
Computer-aided visualization and analysis system for sequence evaluation
Chee, Mark S.
2001-06-05
A computer system (1) for analyzing nucleic acid sequences is provided. The computer system is used to perform multiple methods for determining unknown bases by analyzing the fluorescence intensities of hybridized nucleic acid probes. The results of individual experiments may be improved by processing nucleic acid sequences together. Comparative analysis of multiple experiments is also provided by displaying reference sequences in one area (814) and sample sequences in another area (816) on a display device (3).
USDA-ARS?s Scientific Manuscript database
The Spodoptera littoralis multiple nucleopolyhedrovirus (SpliMNPV), a pathogen of the Egyptian cotton leaf worm Spodoptera littoralis, was subjected to sequencing of its entire DNA genome and bioassay analysis comparing its virulence to that of other baculoviruses. The annotated SpliMNPV genome of...
USDA-ARS?s Scientific Manuscript database
Geographic isolates of Lymantria dispar multiple nucleopolyhedrovirus: Genome sequence analysis and pathogenicity against European and Asian gypsy moth strains. To evaluate the genetic diversity of Lymantria dispar nucleopolyhedrovirus (LdMNPV) at the genomic level, the genomes of three isolates of...
Bellerophon: a program to detect chimeric sequences in multiple sequence alignments.
Huber, Thomas; Faulkner, Geoffrey; Hugenholtz, Philip
2004-09-22
Bellerophon is a program for detecting chimeric sequences in multiple sequence datasets by an adaption of partial treeing analysis. Bellerophon was specifically developed to detect 16S rRNA gene chimeras in PCR-clone libraries of environmental samples but can be applied to other nucleotide sequence alignments. Bellerophon is available as an interactive web server at http://foo.maths.uq.edu.au/~huber/bellerophon.pl
eShadow: A tool for comparing closely related sequences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ovcharenko, Ivan; Boffelli, Dario; Loots, Gabriela G.
2004-01-15
Primate sequence comparisons are difficult to interpret due to the high degree of sequence similarity shared between such closely related species. Recently, a novel method, phylogenetic shadowing, has been pioneered for predicting functional elements in the human genome through the analysis of multiple primate sequence alignments. We have expanded this theoretical approach to create a computational tool, eShadow, for the identification of elements under selective pressure in multiple sequence alignments of closely related genomes, such as in comparisons of human to primate or mouse to rat DNA. This tool integrates two different statistical methods and allows for the dynamic visualizationmore » of the resulting conservation profile. eShadow also includes a versatile optimization module capable of training the underlying Hidden Markov Model to differentially predict functional sequences. This module grants the tool high flexibility in the analysis of multiple sequence alignments and in comparing sequences with different divergence rates. Here, we describe the eShadow comparative tool and its potential uses for analyzing both multiple nucleotide and protein alignments to predict putative functional elements. The eShadow tool is publicly available at http://eshadow.dcode.org/« less
BlockLogo: visualization of peptide and sequence motif conservation
Olsen, Lars Rønn; Kudahl, Ulrich Johan; Simon, Christian; Sun, Jing; Schönbach, Christian; Reinherz, Ellis L.; Zhang, Guang Lan; Brusic, Vladimir
2013-01-01
BlockLogo is a web-server application for visualization of protein and nucleotide fragments, continuous protein sequence motifs, and discontinuous sequence motifs using calculation of block entropy from multiple sequence alignments. The user input consists of a multiple sequence alignment, selection of motif positions, type of sequence, and output format definition. The output has BlockLogo along with the sequence logo, and a table of motif frequencies. We deployed BlockLogo as an online application and have demonstrated its utility through examples that show visualization of T-cell epitopes and B-cell epitopes (both continuous and discontinuous). Our additional example shows a visualization and analysis of structural motifs that determine specificity of peptide binding to HLA-DR molecules. The BlockLogo server also employs selected experimentally validated prediction algorithms to enable on-the-fly prediction of MHC binding affinity to 15 common HLA class I and class II alleles as well as visual analysis of discontinuous epitopes from multiple sequence alignments. It enables the visualization and analysis of structural and functional motifs that are usually described as regular expressions. It provides a compact view of discontinuous motifs composed of distant positions within biological sequences. BlockLogo is available at: http://research4.dfci.harvard.edu/cvc/blocklogo/ and http://methilab.bu.edu/blocklogo/ PMID:24001880
Analysis of Ribosome Inactivating Protein (RIP): A Bioinformatics Approach
NASA Astrophysics Data System (ADS)
Jothi, G. Edward Gnana; Majilla, G. Sahaya Jose; Subhashini, D.; Deivasigamani, B.
2012-10-01
In spite of the medical advances in recent years, the world is in need of different sources to encounter certain health issues.Ribosome Inactivating Proteins (RIPs) were found to be one among them. In order to get easy access about RIPs, there is a need to analyse RIPs towards constructing a database on RIPs. Also, multiple sequence alignment was done towards screening for homologues of significant RIPs from rare sources against RIPs from easily available sources in terms of similarity. Protein sequences were retrieved from SWISS-PROT and are further analysed using pair wise and multiple sequence alignment.Analysis shows that, 151 RIPs have been characterized to date. Amongst them, there are 87 type I, 37 type II, 1 type III and 25 unknown RIPs. The sequence length information of various RIPs about the availability of full or partial sequence was also found. The multiple sequence alignment of 37 type I RIP using the online server Multalin, indicates the presence of 20 conserved residues. Pairwise alignment and multiple sequence alignment of certain selected RIPs in two groups namely Group I and Group II were carried out and the consensus level was found to be 98%, 98% and 90% respectively.
Bernsen, M R; Dijkman, H B; de Vries, E; Figdor, C G; Ruiter, D J; Adema, G J; van Muijen, G N
1998-10-01
Molecular analysis of small tissue samples has become increasingly important in biomedical studies. Using a laser dissection microscope and modified nucleic acid isolation protocols, we demonstrate that multiple mRNA as well as DNA sequences can be identified from a single-cell sample. In addition, we show that the specificity of procurement of tissue samples is not compromised by smear contamination resulting from scraping of the microtome knife during sectioning of lesions. The procedures described herein thus allow for efficient RT-PCR or PCR analysis of multiple nucleic acid sequences from small tissue samples obtained by laser-assisted microdissection.
EUGENE'HOM: A generic similarity-based gene finder using multiple homologous sequences.
Foissac, Sylvain; Bardou, Philippe; Moisan, Annick; Cros, Marie-Josée; Schiex, Thomas
2003-07-01
EUGENE'HOM is a gene prediction software for eukaryotic organisms based on comparative analysis. EUGENE'HOM is able to take into account multiple homologous sequences from more or less closely related organisms. It integrates the results of TBLASTX analysis, splice site and start codon prediction and a robust coding/non-coding probabilistic model which allows EUGENE'HOM to handle sequences from a variety of organisms. The current target of EUGENE'HOM is plant sequences. The EUGENE'HOM web site is available at http://genopole.toulouse.inra.fr/bioinfo/eugene/EuGeneHom/cgi-bin/EuGeneHom.pl.
MANGO: a new approach to multiple sequence alignment.
Zhang, Zefeng; Lin, Hao; Li, Ming
2007-01-01
Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the 'once a gap, always a gap' phenomenon. Is there a radically new way to do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, Prob-ConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0.
Coupling detrended fluctuation analysis for multiple warehouse-out behavioral sequences
NASA Astrophysics Data System (ADS)
Yao, Can-Zhong; Lin, Ji-Nan; Zheng, Xu-Zhou
2017-01-01
Interaction patterns among different warehouses could make the warehouse-out behavioral sequences less predictable. We firstly take a coupling detrended fluctuation analysis on the warehouse-out quantity, and find that the multivariate sequences exhibit significant coupling multifractal characteristics regardless of the types of steel products. Secondly, we track the sources of multifractal warehouse-out sequences by shuffling and surrogating original ones, and we find that fat-tail distribution contributes more to multifractal features than the long-term memory, regardless of types of steel products. From perspective of warehouse contribution, some warehouses steadily contribute more to multifractal than other warehouses. Finally, based on multiscale multifractal analysis, we propose Hurst surface structure to investigate coupling multifractal, and show that multiple behavioral sequences exhibit significant coupling multifractal features that emerge and usually be restricted within relatively greater time scale interval.
EUGÈNE'HOM: a generic similarity-based gene finder using multiple homologous sequences
Foissac, Sylvain; Bardou, Philippe; Moisan, Annick; Cros, Marie-Josée; Schiex, Thomas
2003-01-01
EUGÈNE'HOM is a gene prediction software for eukaryotic organisms based on comparative analysis. EUGÈNE'HOM is able to take into account multiple homologous sequences from more or less closely related organisms. It integrates the results of TBLASTX analysis, splice site and start codon prediction and a robust coding/non-coding probabilistic model which allows EUGÈNE'HOM to handle sequences from a variety of organisms. The current target of EUGÈNE'HOM is plant sequences. The EUGÈNE'HOM web site is available at http://genopole.toulouse.inra.fr/bioinfo/eugene/EuGeneHom/cgi-bin/EuGeneHom.pl. PMID:12824408
DendroBLAST: approximate phylogenetic trees in the absence of multiple sequence alignments.
Kelly, Steven; Maini, Philip K
2013-01-01
The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method differs from other methods by incorporating a simple model of sequence evolution to test the effect of introducing sequence changes on the reliability of the bipartitions in the inferred tree. Using realistic simulated sequence data we demonstrate that this method produces phylogenetic trees that are more accurate than other commonly-used distance based methods though not as accurate as maximum likelihood methods from good quality multiple sequence alignments. In addition to tests on simulated data, we use DendroBLAST to generate input trees for a supertree reconstruction of the phylogeny of the Archaea. This independent analysis produces an approximate phylogeny of the Archaea that has both high precision and recall when compared to previously published analysis of the same dataset using conventional methods. Taken together these results demonstrate that approximate phylogenetic trees can be produced in the absence of multiple sequence alignments, and we propose that these trees will provide a platform for improving and informing downstream bioinformatic analysis. A web implementation of the DendroBLAST method is freely available for use at http://www.dendroblast.com/.
TaxI: a software tool for DNA barcoding using distance methods
Steinke, Dirk; Vences, Miguel; Salzburger, Walter; Meyer, Axel
2005-01-01
DNA barcoding is a promising approach to the diagnosis of biological diversity in which DNA sequences serve as the primary key for information retrieval. Most existing software for evolutionary analysis of DNA sequences was designed for phylogenetic analyses and, hence, those algorithms do not offer appropriate solutions for the rapid, but precise analyses needed for DNA barcoding, and are also unable to process the often large comparative datasets. We developed a flexible software tool for DNA taxonomy, named TaxI. This program calculates sequence divergences between a query sequence (taxon to be barcoded) and each sequence of a dataset of reference sequences defined by the user. Because the analysis is based on separate pairwise alignments this software is also able to work with sequences characterized by multiple insertions and deletions that are difficult to align in large sequence sets (i.e. thousands of sequences) by multiple alignment algorithms because of computational restrictions. Here, we demonstrate the utility of this approach with two datasets of fish larvae and juveniles from Lake Constance and juvenile land snails under different models of sequence evolution. Sets of ribosomal 16S rRNA sequences, characterized by multiple indels, performed as good as or better than cox1 sequence sets in assigning sequences to species, demonstrating the suitability of rRNA genes for DNA barcoding. PMID:16214755
Texture analysis of common renal masses in multiple MR sequences for prediction of pathology
NASA Astrophysics Data System (ADS)
Hoang, Uyen N.; Malayeri, Ashkan A.; Lay, Nathan S.; Summers, Ronald M.; Yao, Jianhua
2017-03-01
This pilot study performs texture analysis on multiple magnetic resonance (MR) images of common renal masses for differentiation of renal cell carcinoma (RCC). Bounding boxes are drawn around each mass on one axial slice in T1 delayed sequence to use for feature extraction and classification. All sequences (T1 delayed, venous, arterial, pre-contrast phases, T2, and T2 fat saturated sequences) are co-registered and texture features are extracted from each sequence simultaneously. Random forest is used to construct models to classify lesions on 96 normal regions, 87 clear cell RCCs, 8 papillary RCCs, and 21 renal oncocytomas; ground truths are verified through pathology reports. The highest performance is seen in random forest model when data from all sequences are used in conjunction, achieving an overall classification accuracy of 83.7%. When using data from one single sequence, the overall accuracies achieved for T1 delayed, venous, arterial, and pre-contrast phase, T2, and T2 fat saturated were 79.1%, 70.5%, 56.2%, 61.0%, 60.0%, and 44.8%, respectively. This demonstrates promising results of utilizing intensity information from multiple MR sequences for accurate classification of renal masses.
Panwar, Priyankar; Verma, A K; Dubey, Ashutosh
2018-05-01
Barnyard ( Echinochloa frumentacea ) and finger ( Eleusine coracana ) millet growing at northwestern Himalaya were explored for the α-amylase inhibitor (α-AI). The mature seeds of barnyard millet variety PRJ1 had maximum α-AI activity which increases in different developmental stage. α-AI was purified up to 22.25-fold from barnyard millet variety PRJ1. Semi-quantitative PCR of different developmental stages of barnyard millet seeds showed increased levels of the transcript from 7 to 28 days. Sequence analysis revealed that it contained 315 bp nucleotide which encodes 104 amino acid sequence with molecular weight 10.72 kDa. The predicted 3D structure of α-AI was 86.73% similar to a bifunctional inhibitor of ragi. In silico analysis of 71 α-AI protein sequences were carried out for biochemical features, homology search, multiple sequence alignment, phylogenetic tree construction, motif, and superfamily distribution of protein sequences. Analysis of multiple sequence alignment revealed the existence of conserved regions NPLP[S/G]CRWYVV[S/Q][Q/R]TCG[V/I] throughout sequences. Superfam analysis revealed that α-AI protein sequences were distributed among seven different superfamilies.
Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment
2013-01-01
Background Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. Results In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Conclusion Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA. PMID:24564200
Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment.
Nagar, Anurag; Hahsler, Michael
2013-01-01
Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA.
USDA-ARS?s Scientific Manuscript database
The Agrotis ipsilon multiple nucleopolyhedrovirus (AgipMNPV) is a group II nucleopolyhedrovirus (NPV) from the black cutworm, A. ipsilon, with potential as a biopesticide to control infestations of cutworm larvae. The genome of the Illinois strain of AgipMNPV was completely sequenced. The AgipMNPV...
Phylo-VISTA: Interactive visualization of multiple DNA sequence alignments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.
The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. Results: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a frameworkmore » based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. Availability: Phylo-VISTA is available at http://www-gsd.lbl. gov/phylovista. It requires an Internet browser with Java Plugin 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu« less
DNA Translator and Aligner: HyperCard utilities to aid phylogenetic analysis of molecules.
Eernisse, D J
1992-04-01
DNA Translator and Aligner are molecular phylogenetics HyperCard stacks for Macintosh computers. They manipulate sequence data to provide graphical gene mapping, conversions, translations and manual multiple-sequence alignment editing. DNA Translator is able to convert documented GenBank or EMBL documented sequences into linearized, rescalable gene maps whose gene sequences are extractable by clicking on the corresponding map button or by selection from a scrolling list. Provided gene maps, complete with extractable sequences, consist of nine metazoan, one yeast, and one ciliate mitochondrial DNAs and three green plant chloroplast DNAs. Single or multiple sequences can be manipulated to aid in phylogenetic analysis. Sequences can be translated between nucleic acids and proteins in either direction with flexible support of alternate genetic codes and ambiguous nucleotide symbols. Multiple aligned sequence output from diverse sources can be converted to Nexus, Hennig86 or PHYLIP format for subsequent phylogenetic analysis. Input or output alignments can be examined with Aligner, a convenient accessory stack included in the DNA Translator package. Aligner is an editor for the manual alignment of up to 100 sequences that toggles between display of matched characters and normal unmatched sequences. DNA Translator also generates graphic displays of amino acid coding and codon usage frequency relative to all other, or only synonymous, codons for approximately 70 select organism-organelle combinations. Codon usage data is compatible with spreadsheet or UWGCG formats for incorporation of additional molecules of interest. The complete package is available via anonymous ftp and is free for non-commercial uses.
Differential evolution-simulated annealing for multiple sequence alignment
NASA Astrophysics Data System (ADS)
Addawe, R. C.; Addawe, J. M.; Sueño, M. R. K.; Magadia, J. C.
2017-10-01
Multiple sequence alignments (MSA) are used in the analysis of molecular evolution and sequence structure relationships. In this paper, a hybrid algorithm, Differential Evolution - Simulated Annealing (DESA) is applied in optimizing multiple sequence alignments (MSAs) based on structural information, non-gaps percentage and totally conserved columns. DESA is a robust algorithm characterized by self-organization, mutation, crossover, and SA-like selection scheme of the strategy parameters. Here, the MSA problem is treated as a multi-objective optimization problem of the hybrid evolutionary algorithm, DESA. Thus, we name the algorithm as DESA-MSA. Simulated sequences and alignments were generated to evaluate the accuracy and efficiency of DESA-MSA using different indel sizes, sequence lengths, deletion rates and insertion rates. The proposed hybrid algorithm obtained acceptable solutions particularly for the MSA problem evaluated based on the three objectives.
Angiuoli, Samuel V; Matalka, Malcolm; Gussman, Aaron; Galens, Kevin; Vangala, Mahesh; Riley, David R; Arze, Cesar; White, James R; White, Owen; Fricke, W Florian
2011-08-30
Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software. We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms. The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.
A functional U-statistic method for association analysis of sequencing data.
Jadhav, Sneha; Tong, Xiaoran; Lu, Qing
2017-11-01
Although sequencing studies hold great promise for uncovering novel variants predisposing to human diseases, the high dimensionality of the sequencing data brings tremendous challenges to data analysis. Moreover, for many complex diseases (e.g., psychiatric disorders) multiple related phenotypes are collected. These phenotypes can be different measurements of an underlying disease, or measurements characterizing multiple related diseases for studying common genetic mechanism. Although jointly analyzing these phenotypes could potentially increase the power of identifying disease-associated genes, the different types of phenotypes pose challenges for association analysis. To address these challenges, we propose a nonparametric method, functional U-statistic method (FU), for multivariate analysis of sequencing data. It first constructs smooth functions from individuals' sequencing data, and then tests the association of these functions with multiple phenotypes by using a U-statistic. The method provides a general framework for analyzing various types of phenotypes (e.g., binary and continuous phenotypes) with unknown distributions. Fitting the genetic variants within a gene using a smoothing function also allows us to capture complexities of gene structure (e.g., linkage disequilibrium, LD), which could potentially increase the power of association analysis. Through simulations, we compared our method to the multivariate outcome score test (MOST), and found that our test attained better performance than MOST. In a real data application, we apply our method to the sequencing data from Minnesota Twin Study (MTS) and found potential associations of several nicotine receptor subunit (CHRN) genes, including CHRNB3, associated with nicotine dependence and/or alcohol dependence. © 2017 WILEY PERIODICALS, INC.
Chiu, Chi-yang; Jung, Jeesun; Chen, Wei; Weeks, Daniel E; Ren, Haobo; Boehnke, Michael; Amos, Christopher I; Liu, Aiyi; Mills, James L; Ting Lee, Mei-ling; Xiong, Momiao; Fan, Ruzong
2017-01-01
To analyze next-generation sequencing data, multivariate functional linear models are developed for a meta-analysis of multiple studies to connect genetic variant data to multiple quantitative traits adjusting for covariates. The goal is to take the advantage of both meta-analysis and pleiotropic analysis in order to improve power and to carry out a unified association analysis of multiple studies and multiple traits of complex disorders. Three types of approximate F -distributions based on Pillai–Bartlett trace, Hotelling–Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to evaluate false-positive rates and power of the proposed tests. The proposed methods are applied to analyze lipid traits in eight European cohorts. It is shown that it is more advantageous to perform multivariate analysis than univariate analysis in general, and it is more advantageous to perform meta-analysis of multiple studies instead of analyzing the individual studies separately. The proposed models require individual observations. The value of the current paper can be seen at least for two reasons: (a) the proposed methods can be applied to studies that have individual genotype data; (b) the proposed methods can be used as a criterion for future work that uses summary statistics to build test statistics to meta-analyze the data. PMID:28000696
Chiu, Chi-Yang; Jung, Jeesun; Chen, Wei; Weeks, Daniel E; Ren, Haobo; Boehnke, Michael; Amos, Christopher I; Liu, Aiyi; Mills, James L; Ting Lee, Mei-Ling; Xiong, Momiao; Fan, Ruzong
2017-02-01
To analyze next-generation sequencing data, multivariate functional linear models are developed for a meta-analysis of multiple studies to connect genetic variant data to multiple quantitative traits adjusting for covariates. The goal is to take the advantage of both meta-analysis and pleiotropic analysis in order to improve power and to carry out a unified association analysis of multiple studies and multiple traits of complex disorders. Three types of approximate F -distributions based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to evaluate false-positive rates and power of the proposed tests. The proposed methods are applied to analyze lipid traits in eight European cohorts. It is shown that it is more advantageous to perform multivariate analysis than univariate analysis in general, and it is more advantageous to perform meta-analysis of multiple studies instead of analyzing the individual studies separately. The proposed models require individual observations. The value of the current paper can be seen at least for two reasons: (a) the proposed methods can be applied to studies that have individual genotype data; (b) the proposed methods can be used as a criterion for future work that uses summary statistics to build test statistics to meta-analyze the data.
Ajawatanawong, Pravech; Atkinson, Gemma C; Watson-Haigh, Nathan S; Mackenzie, Bryony; Baldauf, Sandra L
2012-07-01
Analyses of multiple sequence alignments generally focus on well-defined conserved sequence blocks, while the rest of the alignment is largely ignored or discarded. This is especially true in phylogenomics, where large multigene datasets are produced through automated pipelines. However, some of the most powerful phylogenetic markers have been found in the variable length regions of multiple alignments, particularly insertions/deletions (indels) in protein sequences. We have developed Sequence Feature and Indel Region Extractor (SeqFIRE) to enable the automated identification and extraction of indels from protein sequence alignments. The program can also extract conserved blocks and identify fast evolving sites using a combination of conservation and entropy. All major variables can be adjusted by the user, allowing them to identify the sets of variables most suited to a particular analysis or dataset. Thus, all major tasks in preparing an alignment for further analysis are combined in a single flexible and user-friendly program. The output includes a numbered list of indels, alignments in NEXUS format with indels annotated or removed and indel-only matrices. SeqFIRE is a user-friendly web application, freely available online at www.seqfire.org/.
Mitsui, Jun; Fukuda, Yoko; Azuma, Kyo; Tozaki, Hirokazu; Ishiura, Hiroyuki; Takahashi, Yuji; Goto, Jun; Tsuji, Shoji
2010-07-01
We have recently found that multiple rare variants of the glucocerebrosidase gene (GBA) confer a robust risk for Parkinson disease, supporting the 'common disease-multiple rare variants' hypothesis. To develop an efficient method of identifying rare variants in a large number of samples, we applied multiplexed resequencing using a next-generation sequencer to identification of rare variants of GBA. Sixteen sets of pooled DNAs from six pooled DNA samples were prepared. Each set of pooled DNAs was subjected to polymerase chain reaction to amplify the target gene (GBA) covering 6.5 kb, pooled into one tube with barcode indexing, and then subjected to extensive sequence analysis using the SOLiD System. Individual samples were also subjected to direct nucleotide sequence analysis. With the optimization of data processing, we were able to extract all the variants from 96 samples with acceptable rates of false-positive single-nucleotide variants.
Zepeda-Mendoza, Marie Lisandra; Bohmann, Kristine; Carmona Baez, Aldo; Gilbert, M Thomas P
2016-05-03
DNA metabarcoding is an approach for identifying multiple taxa in an environmental sample using specific genetic loci and taxa-specific primers. When combined with high-throughput sequencing it enables the taxonomic characterization of large numbers of samples in a relatively time- and cost-efficient manner. One recent laboratory development is the addition of 5'-nucleotide tags to both primers producing double-tagged amplicons and the use of multiple PCR replicates to filter erroneous sequences. However, there is currently no available toolkit for the straightforward analysis of datasets produced in this way. We present DAMe, a toolkit for the processing of datasets generated by double-tagged amplicons from multiple PCR replicates derived from an unlimited number of samples. Specifically, DAMe can be used to (i) sort amplicons by tag combination, (ii) evaluate PCR replicates dissimilarity, and (iii) filter sequences derived from sequencing/PCR errors, chimeras, and contamination. This is attained by calculating the following parameters: (i) sequence content similarity between the PCR replicates from each sample, (ii) reproducibility of each unique sequence across the PCR replicates, and (iii) copy number of the unique sequences in each PCR replicate. We showcase the insights that can be obtained using DAMe prior to taxonomic assignment, by applying it to two real datasets that vary in their complexity regarding number of samples, sequencing libraries, PCR replicates, and used tag combinations. Finally, we use a third mock dataset to demonstrate the impact and importance of filtering the sequences with DAMe. DAMe allows the user-friendly manipulation of amplicons derived from multiple samples with PCR replicates built in a single or multiple sequencing libraries. It allows the user to: (i) collapse amplicons into unique sequences and sort them by tag combination while retaining the sample identifier and copy number information, (ii) identify sequences carrying unused tag combinations, (iii) evaluate the comparability of PCR replicates of the same sample, and (iv) filter tagged amplicons from a number of PCR replicates using parameters of minimum length, copy number, and reproducibility across the PCR replicates. This enables an efficient analysis of complex datasets, and ultimately increases the ease of handling datasets from large-scale studies.
2011-01-01
Background Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software. Results We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms. Conclusion The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing. PMID:21878105
Functional Regression Models for Epistasis Analysis of Multiple Quantitative Traits.
Zhang, Futao; Xie, Dan; Liang, Meimei; Xiong, Momiao
2016-04-01
To date, most genetic analyses of phenotypes have focused on analyzing single traits or analyzing each phenotype independently. However, joint epistasis analysis of multiple complementary traits will increase statistical power and improve our understanding of the complicated genetic structure of the complex diseases. Despite their importance in uncovering the genetic structure of complex traits, the statistical methods for identifying epistasis in multiple phenotypes remains fundamentally unexplored. To fill this gap, we formulate a test for interaction between two genes in multiple quantitative trait analysis as a multiple functional regression (MFRG) in which the genotype functions (genetic variant profiles) are defined as a function of the genomic position of the genetic variants. We use large-scale simulations to calculate Type I error rates for testing interaction between two genes with multiple phenotypes and to compare the power with multivariate pairwise interaction analysis and single trait interaction analysis by a single variate functional regression model. To further evaluate performance, the MFRG for epistasis analysis is applied to five phenotypes of exome sequence data from the NHLBI's Exome Sequencing Project (ESP) to detect pleiotropic epistasis. A total of 267 pairs of genes that formed a genetic interaction network showed significant evidence of epistasis influencing five traits. The results demonstrate that the joint interaction analysis of multiple phenotypes has a much higher power to detect interaction than the interaction analysis of a single trait and may open a new direction to fully uncovering the genetic structure of multiple phenotypes.
Jayawardene, Wasantha Parakrama; YoussefAgha, Ahmed Hassan
2014-01-01
This study aimed to identify the sequential patterns of drug use initiation, which included prescription drugs misuse (PDM), among 12th-grade students in Indiana. The study also tested the suitability of the data mining method Market Basket Analysis (MBA) to detect common drug use initiation sequences in large-scale surveys. Data from 2007 to 2009 Annual Surveys of Alcohol, Tobacco, and Other Drug Use by Indiana Children and Adolescents were used for this study. A close-ended, self-administered questionnaire was used to ask adolescents about the use of 21 substance categories and the age of first use. "Support%" and "confidence%" statistics of Market Basket Analysis detected multiple and substitute addictions, respectively. The lifetime prevalence of using any addictive substance was 73.3%, and it has been decreasing during past few years. Although the lifetime prevalence of PDM was 19.2%, it has been increasing. Males and whites were more likely to use drugs and engage in multiple addictions. Market Basket Analysis identified common drug use initiation sequences that involved 11 drugs. High levels of support existed for associations among alcohol, cigarettes, and marijuana, whereas associations that included prescription drugs had medium levels of support. Market Basket Analysis is useful for the detection of common substance use initiation sequences in large-scale surveys. Before initiation of prescription drugs, physicians should consider the adolescents' risk of addiction. Prevention programs should address multiple addictions, substitute addictions, common sequences in drug use initiation, sex and racial differences in PDM, and normative beliefs of parents and adolescents in relation to PDM.
MetaSeq: privacy preserving meta-analysis of sequencing-based association studies.
Singh, Angad Pal; Zafer, Samreen; Pe'er, Itsik
2013-01-01
Human genetics recently transitioned from GWAS to studies based on NGS data. For GWAS, small effects dictated large sample sizes, typically made possible through meta-analysis by exchanging summary statistics across consortia. NGS studies groupwise-test for association of multiple potentially-causal alleles along each gene. They are subject to similar power constraints and therefore likely to resort to meta-analysis as well. The problem arises when considering privacy of the genetic information during the data-exchange process. Many scoring schemes for NGS association rely on the frequency of each variant thus requiring the exchange of identity of the sequenced variant. As such variants are often rare, potentially revealing the identity of their carriers and jeopardizing privacy. We have thus developed MetaSeq, a protocol for meta-analysis of genome-wide sequencing data by multiple collaborating parties, scoring association for rare variants pooled per gene across all parties. We tackle the challenge of tallying frequency counts of rare, sequenced alleles, for metaanalysis of sequencing data without disclosing the allele identity and counts, thereby protecting sample identity. This apparent paradoxical exchange of information is achieved through cryptographic means. The key idea is that parties encrypt identity of genes and variants. When they transfer information about frequency counts in cases and controls, the exchanged data does not convey the identity of a mutation and therefore does not expose carrier identity. The exchange relies on a 3rd party, trusted to follow the protocol although not trusted to learn about the raw data. We show applicability of this method to publicly available exome-sequencing data from multiple studies, simulating phenotypic information for powerful meta-analysis. The MetaSeq software is publicly available as open source.
Binladen, Jonas; Gilbert, M Thomas P; Bollback, Jonathan P; Panitz, Frank; Bendixen, Christian; Nielsen, Rasmus; Willerslev, Eske
2007-02-14
The invention of the Genome Sequence 20 DNA Sequencing System (454 parallel sequencing platform) has enabled the rapid and high-volume production of sequence data. Until now, however, individual emulsion PCR (emPCR) reactions and subsequent sequencing runs have been unable to combine template DNA from multiple individuals, as homologous sequences cannot be subsequently assigned to their original sources. We use conventional PCR with 5'-nucleotide tagged primers to generate homologous DNA amplification products from multiple specimens, followed by sequencing through the high-throughput Genome Sequence 20 DNA Sequencing System (GS20, Roche/454 Life Sciences). Each DNA sequence is subsequently traced back to its individual source through 5'tag-analysis. We demonstrate that this new approach enables the assignment of virtually all the generated DNA sequences to the correct source once sequencing anomalies are accounted for (miss-assignment rate<0.4%). Therefore, the method enables accurate sequencing and assignment of homologous DNA sequences from multiple sources in single high-throughput GS20 run. We observe a bias in the distribution of the differently tagged primers that is dependent on the 5' nucleotide of the tag. In particular, primers 5' labelled with a cytosine are heavily overrepresented among the final sequences, while those 5' labelled with a thymine are strongly underrepresented. A weaker bias also exists with regards to the distribution of the sequences as sorted by the second nucleotide of the dinucleotide tags. As the results are based on a single GS20 run, the general applicability of the approach requires confirmation. However, our experiments demonstrate that 5'primer tagging is a useful method in which the sequencing power of the GS20 can be applied to PCR-based assays of multiple homologous PCR products. The new approach will be of value to a broad range of research areas, such as those of comparative genomics, complete mitochondrial analyses, population genetics, and phylogenetics.
MACSIMS : multiple alignment of complete sequences information management system
Thompson, Julie D; Muller, Arnaud; Waterhouse, Andrew; Procter, Jim; Barton, Geoffrey J; Plewniak, Frédéric; Poch, Olivier
2006-01-01
Background In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family. Results MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis. Conclusion MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at . PMID:16792820
Major soybean maturity gene haplotypes revealed by SNPViz analysis of 72 sequenced soybean genomes
USDA-ARS?s Scientific Manuscript database
In this Genomics Era, vast amounts of next generation sequencing data have become publicly-available for multiple genomes across hundreds of species. Analysis of these large-scale datasets can become cumbersome, especially when comparing nucleotide polymorphisms across many samples within a dataset...
Reconstructing evolutionary trees in parallel for massive sequences.
Zou, Quan; Wan, Shixiang; Zeng, Xiangxiang; Ma, Zhanshan Sam
2017-12-14
Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .
AlignMe—a membrane protein sequence alignment web server
Stamm, Marcus; Staritzbichler, René; Khafizov, Kamil; Forrest, Lucy R.
2014-01-01
We present a web server for pair-wise alignment of membrane protein sequences, using the program AlignMe. The server makes available two operational modes of AlignMe: (i) sequence to sequence alignment, taking two sequences in fasta format as input, combining information about each sequence from multiple sources and producing a pair-wise alignment (PW mode); and (ii) alignment of two multiple sequence alignments to create family-averaged hydropathy profile alignments (HP mode). For the PW sequence alignment mode, four different optimized parameter sets are provided, each suited to pairs of sequences with a specific similarity level. These settings utilize different types of inputs: (position-specific) substitution matrices, secondary structure predictions and transmembrane propensities from transmembrane predictions or hydrophobicity scales. In the second (HP) mode, each input multiple sequence alignment is converted into a hydrophobicity profile averaged over the provided set of sequence homologs; the two profiles are then aligned. The HP mode enables qualitative comparison of transmembrane topologies (and therefore potentially of 3D folds) of two membrane proteins, which can be useful if the proteins have low sequence similarity. In summary, the AlignMe web server provides user-friendly access to a set of tools for analysis and comparison of membrane protein sequences. Access is available at http://www.bioinfo.mpg.de/AlignMe PMID:24753425
Palzkill, T G; Oliver, S G; Newlon, C S
1986-01-01
Four fragments of Saccharomyces cerevisiae chromosome III DNA which carry ARS elements have been sequenced. Each fragment contains multiple copies of sequences that have at least 10 out of 11 bases of homology to a previously reported 11 bp core consensus sequence. A survey of these new ARS sequences and previously reported sequences revealed the presence of an additional 11 bp conserved element located on the 3' side of the T-rich strand of the core consensus. Subcloning analysis as well as deletion and transposon insertion mutagenesis of ARS fragments support a role for 3' conserved sequence in promoting ARS activity. PMID:3529036
Regularized rare variant enrichment analysis for case-control exome sequencing data.
Larson, Nicholas B; Schaid, Daniel J
2014-02-01
Rare variants have recently garnered an immense amount of attention in genetic association analysis. However, unlike methods traditionally used for single marker analysis in GWAS, rare variant analysis often requires some method of aggregation, since single marker approaches are poorly powered for typical sequencing study sample sizes. Advancements in sequencing technologies have rendered next-generation sequencing platforms a realistic alternative to traditional genotyping arrays. Exome sequencing in particular not only provides base-level resolution of genetic coding regions, but also a natural paradigm for aggregation via genes and exons. Here, we propose the use of penalized regression in combination with variant aggregation measures to identify rare variant enrichment in exome sequencing data. In contrast to marginal gene-level testing, we simultaneously evaluate the effects of rare variants in multiple genes, focusing on gene-based least absolute shrinkage and selection operator (LASSO) and exon-based sparse group LASSO models. By using gene membership as a grouping variable, the sparse group LASSO can be used as a gene-centric analysis of rare variants while also providing a penalized approach toward identifying specific regions of interest. We apply extensive simulations to evaluate the performance of these approaches with respect to specificity and sensitivity, comparing these results to multiple competing marginal testing methods. Finally, we discuss our findings and outline future research. © 2013 WILEY PERIODICALS, INC.
Oshiki, Mamoru; Segawa, Takahiro; Ishii, Satoshi
2018-02-02
Various microorganisms play key roles in the Nitrogen (N) cycle. Quantitative PCR (qPCR) and PCR-amplicon sequencing of the N cycle functional genes allow us to analyze the abundance and diversity of microbes responsible in the N transforming reactions in various environmental samples. However, analysis of multiple target genes can be cumbersome and expensive. PCR-independent analysis, such as metagenomics and metatranscriptomics, is useful but expensive especially when we analyze multiple samples and try to detect N cycle functional genes present at relatively low abundance. Here, we present the application of microfluidic qPCR chip technology to simultaneously quantify and prepare amplicon sequence libraries for multiple N cycle functional genes as well as taxon-specific 16S rRNA gene markers for many samples. This approach, named as N cycle evaluation (NiCE) chip, was evaluated by using DNA from pure and artificially mixed bacterial cultures and by comparing the results with those obtained by conventional qPCR and amplicon sequencing methods. Quantitative results obtained by the NiCE chip were comparable to those obtained by conventional qPCR. In addition, the NiCE chip was successfully applied to examine abundance and diversity of N cycle functional genes in wastewater samples. Although non-specific amplification was detected on the NiCE chip, this could be overcome by optimizing the primer sequences in the future. As the NiCE chip can provide high-throughput format to quantify and prepare sequence libraries for multiple N cycle functional genes, this tool should advance our ability to explore N cycling in various samples. Importance. We report a novel approach, namely Nitrogen Cycle Evaluation (NiCE) chip by using microfluidic qPCR chip technology. By sequencing the amplicons recovered from the NiCE chip, we can assess diversities of the N cycle functional genes. The NiCE chip technology is applicable to analyze the temporal dynamics of the N cycle gene transcriptions in wastewater treatment bioreactors. The NiCE chip can provide high-throughput format to quantify and prepare sequence libraries for multiple N cycle functional genes. While there is a room for future improvement, this tool should significantly advance our ability to explore the N cycle in various environmental samples. Copyright © 2018 American Society for Microbiology.
USDA-ARS?s Scientific Manuscript database
Fov isolates belonging to all known races, biotypes, and most of known genotypes were characterized by phylogenetic and VCG analysis. VCGs with multiple members were sequenced for at least two members, and the resulting sequences were always identical except for VCG01111 members. Vegetative compatib...
QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation.
Wagle, Prerana; Nikolić, Miloš; Frommolt, Peter
2015-07-01
Next-Generation Sequencing (NGS) has emerged as a widely used tool in molecular biology. While time and cost for the sequencing itself are decreasing, the analysis of the massive amounts of data remains challenging. Since multiple algorithmic approaches for the basic data analysis have been developed, there is now an increasing need to efficiently use these tools to obtain results in reasonable time. We have developed QuickNGS, a new workflow system for laboratories with the need to analyze data from multiple NGS projects at a time. QuickNGS takes advantage of parallel computing resources, a comprehensive back-end database, and a careful selection of previously published algorithmic approaches to build fully automated data analysis workflows. We demonstrate the efficiency of our new software by a comprehensive analysis of 10 RNA-Seq samples which we can finish in only a few minutes of hands-on time. The approach we have taken is suitable to process even much larger numbers of samples and multiple projects at a time. Our approach considerably reduces the barriers that still limit the usability of the powerful NGS technology and finally decreases the time to be spent before proceeding to further downstream analysis and interpretation of the data.
Thomas, W. Kelley; Vida, J. T.; Frisse, Linda M.; Mundo, Manuel; Baldwin, James G.
1997-01-01
To effectively integrate DNA sequence analysis and classical nematode taxonomy, we must be able to obtain DNA sequences from formalin-fixed specimens. Microdissected sections of nematodes were removed from specimens fixed in formalin, using standard protocols and without destroying morphological features. The fixed sections provided sufficient template for multiple polymerase chain reaction-based DNA sequence analyses. PMID:19274156
NASA Astrophysics Data System (ADS)
Furrer, Julien; Kramer, Frank; Marino, John P.; Glaser, Steffen J.; Luy, Burkhard
2004-01-01
Homonuclear Hartmann-Hahn transfer is one of the most important building blocks in modern high-resolution NMR. It constitutes a very efficient transfer element for the assignment of proteins, nucleic acids, and oligosaccharides. Nevertheless, in macromolecules exceeding ˜10 kDa TOCSY-experiments can show decreasing sensitivity due to fast transverse relaxation processes that are active during the mixing periods. In this article we propose the MOCCA-XY16 multiple pulse sequence, originally developed for efficient TOCSY transfer through residual dipolar couplings, as a homonuclear Hartmann-Hahn sequence with improved relaxation properties. A theoretical analysis of the coherence transfer via scalar couplings and its relaxation behavior as well as experimental transfer curves for MOCCA-XY16 relative to the well-characterized DIPSI-2 multiple pulse sequence are given.
Furrer, Julien; Kramer, Frank; Marino, John P; Glaser, Steffen J; Luy, Burkhard
2004-01-01
Homonuclear Hartmann-Hahn transfer is one of the most important building blocks in modern high-resolution NMR. It constitutes a very efficient transfer element for the assignment of proteins, nucleic acids, and oligosaccharides. Nevertheless, in macromolecules exceeding approximately 10 kDa TOCSY-experiments can show decreasing sensitivity due to fast transverse relaxation processes that are active during the mixing periods. In this article we propose the MOCCA-XY16 multiple pulse sequence, originally developed for efficient TOCSY transfer through residual dipolar couplings, as a homonuclear Hartmann-Hahn sequence with improved relaxation properties. A theoretical analysis of the coherence transfer via scalar couplings and its relaxation behavior as well as experimental transfer curves for MOCCA-XY16 relative to the well-characterized DIPSI-2 multiple pulse sequence are given.
Applications of Single-Cell Sequencing for Multiomics.
Xu, Yungang; Zhou, Xiaobo
2018-01-01
Single-cell sequencing interrogates the sequence or chromatin information from individual cells with advanced next-generation sequencing technologies. It provides a higher resolution of cellular differences and a better understanding of the underlying genetic and epigenetic mechanisms of an individual cell in the context of its survival and adaptation to microenvironment. However, it is more challenging to perform single-cell sequencing and downstream data analysis, owing to the minimal amount of starting materials, sample loss, and contamination. In addition, due to the picogram level of the amount of nucleic acids used, heavy amplification is often needed during sample preparation of single-cell sequencing, resulting in the uneven coverage, noise, and inaccurate quantification of sequencing data. All these unique properties raise challenges in and thus high demands for computational methods that specifically fit single-cell sequencing data. We here comprehensively survey the current strategies and challenges for multiple single-cell sequencing, including single-cell transcriptome, genome, and epigenome, beginning with a brief introduction to multiple sequencing techniques for single cells.
Pseudomonas specific 16S rDNA PCR amplification and multiple enzyme restriction fragment length polymorphism (MERFLP) analysis using a single digestion mixture of Alu I, Hinf I, Rsa I, and Tru 9I distinguished 150 published sequences and reference strains of authentic Pseudomonas...
Generalized causal mediation and path analysis: Extensions and practical considerations.
Albert, Jeffrey M; Cho, Jang Ik; Liu, Yiying; Nelson, Suchitra
2018-01-01
Causal mediation analysis seeks to decompose the effect of a treatment or exposure among multiple possible paths and provide casually interpretable path-specific effect estimates. Recent advances have extended causal mediation analysis to situations with a sequence of mediators or multiple contemporaneous mediators. However, available methods still have limitations, and computational and other challenges remain. The present paper provides an extended causal mediation and path analysis methodology. The new method, implemented in the new R package, gmediation (described in a companion paper), accommodates both a sequence (two stages) of mediators and multiple mediators at each stage, and allows for multiple types of outcomes following generalized linear models. The methodology can also handle unsaturated models and clustered data. Addressing other practical issues, we provide new guidelines for the choice of a decomposition, and for the choice of a reference group multiplier for the reduction of Monte Carlo error in mediation formula computations. The new method is applied to data from a cohort study to illuminate the contribution of alternative biological and behavioral paths in the effect of socioeconomic status on dental caries in adolescence.
Shih, Arthur Chun-Chieh; Lee, DT; Peng, Chin-Lin; Wu, Yu-Wei
2007-01-01
Background When aligning several hundreds or thousands of sequences, such as epidemic virus sequences or homologous/orthologous sequences of some big gene families, to reconstruct the epidemiological history or their phylogenies, how to analyze and visualize the alignment results of many sequences has become a new challenge for computational biologists. Although there are several tools available for visualization of very long sequence alignments, few of them are applicable to the alignments of many sequences. Results A multiple-logo alignment visualization tool, called Phylo-mLogo, is presented in this paper. Phylo-mLogo calculates the variabilities and homogeneities of alignment sequences by base frequencies or entropies. Different from the traditional representations of sequence logos, Phylo-mLogo not only displays the global logo patterns of the whole alignment of multiple sequences, but also demonstrates their local homologous logos for each clade hierarchically. In addition, Phylo-mLogo also allows the user to focus only on the analysis of some important, structurally or functionally constrained sites in the alignment selected by the user or by built-in automatic calculation. Conclusion With Phylo-mLogo, the user can symbolically and hierarchically visualize hundreds of aligned sequences simultaneously and easily check the changes of their amino acid sites when analyzing many homologous/orthologous or influenza virus sequences. More information of Phylo-mLogo can be found at URL . PMID:17319966
Sequence analysis by iterated maps, a review.
Almeida, Jonas S
2014-05-01
Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, 'Chaos Game Representation'. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.
Customisation of the exome data analysis pipeline using a combinatorial approach.
Pattnaik, Swetansu; Vaidyanathan, Srividya; Pooja, Durgad G; Deepak, Sa; Panda, Binay
2012-01-01
The advent of next generation sequencing (NGS) technologies have revolutionised the way biologists produce, analyse and interpret data. Although NGS platforms provide a cost-effective way to discover genome-wide variants from a single experiment, variants discovered by NGS need follow up validation due to the high error rates associated with various sequencing chemistries. Recently, whole exome sequencing has been proposed as an affordable option compared to whole genome runs but it still requires follow up validation of all the novel exomic variants. Customarily, a consensus approach is used to overcome the systematic errors inherent to the sequencing technology, alignment and post alignment variant detection algorithms. However, the aforementioned approach warrants the use of multiple sequencing chemistry, multiple alignment tools, multiple variant callers which may not be viable in terms of time and money for individual investigators with limited informatics know-how. Biologists often lack the requisite training to deal with the huge amount of data produced by NGS runs and face difficulty in choosing from the list of freely available analytical tools for NGS data analysis. Hence, there is a need to customise the NGS data analysis pipeline to preferentially retain true variants by minimising the incidence of false positives and make the choice of right analytical tools easier. To this end, we have sampled different freely available tools used at the alignment and post alignment stage suggesting the use of the most suitable combination determined by a simple framework of pre-existing metrics to create significant datasets.
Kashuk, Carl S.; Stone, Eric A.; Grice, Elizabeth A.; Portnoy, Matthew E.; Green, Eric D.; Sidow, Arend; Chakravarti, Aravinda; McCallion, Andrew S.
2005-01-01
The ability to discriminate between deleterious and neutral amino acid substitutions in the genes of patients remains a significant challenge in human genetics. The increasing availability of genomic sequence data from multiple vertebrate species allows inclusion of sequence conservation and physicochemical properties of residues to be used for functional prediction. In this study, the RET receptor tyrosine kinase serves as a model disease gene in which a broad spectrum (≥116) of disease-associated mutations has been identified among patients with Hirschsprung disease and multiple endocrine neoplasia type 2. We report the alignment of the human RET protein sequence with the orthologous sequences of 12 non-human vertebrates (eight mammalian, one avian, and three teleost species), their comparative analysis, the evolutionary topology of the RET protein, and predicted tolerance for all published missense mutations. We show that, although evolutionary conservation alone provides significant information to predict the effect of a RET mutation, a model that combines comparative sequence data with analysis of physiochemical properties in a quantitative framework provides far greater accuracy. Although the ability to discern the impact of a mutation is imperfect, our analyses permit substantial discrimination between predicted functional classes of RET mutations and disease severity even for a multigenic disease such as Hirschsprung disease. PMID:15956201
Solving the problem of comparing whole bacterial genomes across different sequencing platforms.
Kaas, Rolf S; Leekitcharoenphon, Pimlapas; Aarestrup, Frank M; Lund, Ole
2014-01-01
Whole genome sequencing (WGS) shows great potential for real-time monitoring and identification of infectious disease outbreaks. However, rapid and reliable comparison of data generated in multiple laboratories and using multiple technologies is essential. So far studies have focused on using one technology because each technology has a systematic bias making integration of data generated from different platforms difficult. We developed two different procedures for identifying variable sites and inferring phylogenies in WGS data across multiple platforms. The methods were evaluated on three bacterial data sets and sequenced on three different platforms (Illumina, 454, Ion Torrent). We show that the methods are able to overcome the systematic biases caused by the sequencers and infer the expected phylogenies. It is concluded that the cause of the success of these new procedures is due to a validation of all informative sites that are included in the analysis. The procedures are available as web tools.
Gardner, Shea N; Wagner, Mark C
2005-01-01
Background Microbial forensics is important in tracking the source of a pathogen, whether the disease is a naturally occurring outbreak or part of a criminal investigation. Results A method and SPR Opt (SNP and PCR-RFLP Optimization) software to perform a comprehensive, whole-genome analysis to forensically discriminate multiple sequences is presented. Tools for the optimization of forensic typing using Single Nucleotide Polymorphism (SNP) and PCR-Restriction Fragment Length Polymorphism (PCR-RFLP) analyses across multiple isolate sequences of a species are described. The PCR-RFLP analysis includes prediction and selection of optimal primers and restriction enzymes to enable maximum isolate discrimination based on sequence information. SPR Opt calculates all SNP or PCR-RFLP variations present in the sequences, groups them into haplotypes according to their co-segregation across those sequences, and performs combinatoric analyses to determine which sets of haplotypes provide maximal discrimination among all the input sequences. Those set combinations requiring that membership in the fewest haplotypes be queried (i.e. the fewest assays be performed) are found. These analyses highlight variable regions based on existing sequence data. These markers may be heterogeneous among unsequenced isolates as well, and thus may be useful for characterizing the relationships among unsequenced as well as sequenced isolates. The predictions are multi-locus. Analyses of mumps and SARS viruses are summarized. Phylogenetic trees created based on SNPs, PCR-RFLPs, and full genomes are compared for SARS virus, illustrating that purported phylogenies based only on SNP or PCR-RFLP variations do not match those based on multiple sequence alignment of the full genomes. Conclusion This is the first software to optimize the selection of forensic markers to maximize information gained from the fewest assays, accepting whole or partial genome sequence data as input. As more sequence data becomes available for multiple strains and isolates of a species, automated, computational approaches such as those described here will be essential to make sense of large amounts of information, and to guide and optimize efforts in the laboratory. The software and source code for SPR Opt is publicly available and free for non-profit use at . PMID:15904493
Roca, Alberto I
2014-01-01
The 2013 BioVis Contest provided an opportunity to evaluate different paradigms for visualizing protein multiple sequence alignments. Such data sets are becoming extremely large and thus taxing current visualization paradigms. Sequence Logos represent consensus sequences but have limitations for protein alignments. As an alternative, ProfileGrids are a new protein sequence alignment visualization paradigm that represents an alignment as a color-coded matrix of the residue frequency occurring at every homologous position in the aligned protein family. The JProfileGrid software program was used to analyze the BioVis contest data sets to generate figures for comparison with the Sequence Logo reference images. The ProfileGrid representation allows for the clear and effective analysis of protein multiple sequence alignments. This includes both a general overview of the conservation and diversity sequence patterns as well as the interactive ability to query the details of the protein residue distributions in the alignment. The JProfileGrid software is free and available from http://www.ProfileGrid.org.
Genome, Epigenome and RNA sequences of Monozygotic Twins Discordant for Multiple Sclerosis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, Neil
2010-06-02
Neil Miller, Deputy Director of Software Engineering at the National Center for Genome Resources, discusses a monozygotic twin study on June 2, 2010 at the "Sequencing, Finishing, Analysis in the Future" meeting in Santa Fe, NM.
Genome, Epigenome and RNA sequences of Monozygotic Twins Discordant for Multiple Sclerosis
Miller, Neil
2018-01-22
Neil Miller, Deputy Director of Software Engineering at the National Center for Genome Resources, discusses a monozygotic twin study on June 2, 2010 at the "Sequencing, Finishing, Analysis in the Future" meeting in Santa Fe, NM.
NASA Technical Reports Server (NTRS)
Wang, C.-W.; Stark, W.
2005-01-01
This article considers a quaternary direct-sequence code-division multiple-access (DS-CDMA) communication system with asymmetric quadrature phase-shift-keying (AQPSK) modulation for unequal error protection (UEP) capability. Both time synchronous and asynchronous cases are investigated. An expression for the probability distribution of the multiple-access interference is derived. The exact bit-error performance and the approximate performance using a Gaussian approximation and random signature sequences are evaluated by extending the techniques used for uniform quadrature phase-shift-keying (QPSK) and binary phase-shift-keying (BPSK) DS-CDMA systems. Finally, a general system model with unequal user power and the near-far problem is considered and analyzed. The results show that, for a system with UEP capability, the less protected data bits are more sensitive to the near-far effect that occurs in a multiple-access environment than are the more protected bits.
Genome-wide comparative analysis of four Indian Drosophila species.
Mohanty, Sujata; Khanna, Radhika
2017-12-01
Comparative analysis of multiple genomes of closely or distantly related Drosophila species undoubtedly creates excitement among evolutionary biologists in exploring the genomic changes with an ecology and evolutionary perspective. We present herewith the de novo assembled whole genome sequences of four Drosophila species, D. bipectinata, D. takahashii, D. biarmipes and D. nasuta of Indian origin using Next Generation Sequencing technology on an Illumina platform along with their detailed assembly statistics. The comparative genomics analysis, e.g. gene predictions and annotations, functional and orthogroup analysis of coding sequences and genome wide SNP distribution were performed. The whole genome of Zaprionus indianus of Indian origin published earlier by us and the genome sequences of previously sequenced 12 Drosophila species available in the NCBI database were included in the analysis. The present work is a part of our ongoing genomics project of Indian Drosophila species.
Analysis on the use of Multi-Sequence MRI Series for Segmentation of Abdominal Organs
NASA Astrophysics Data System (ADS)
Selver, M. A.; Selvi, E.; Kavur, E.; Dicle, O.
2015-01-01
Segmentation of abdominal organs from MRI data sets is a challenging task due to various limitations and artefacts. During the routine clinical practice, radiologists use multiple MR sequences in order to analyze different anatomical properties. These sequences have different characteristics in terms of acquisition parameters (such as contrast mechanisms and pulse sequence designs) and image properties (such as pixel spacing, slice thicknesses and dynamic range). For a complete understanding of the data, computational techniques should combine the information coming from these various MRI sequences. These sequences are not acquired in parallel but in a sequential manner (one after another). Therefore, patient movements and respiratory motions change the position and shape of the abdominal organs. In this study, the amount of these effects is measured using three different symmetric surface distance metrics performed to three dimensional data acquired from various MRI sequences. The results are compared to intra and inter observer differences and discussions on using multiple MRI sequences for segmentation and the necessities for registration are presented.
Whole Genome Sequencing for Genomics-Guided Investigations of Escherichia coli O157:H7 Outbreaks.
Rusconi, Brigida; Sanjar, Fatemeh; Koenig, Sara S K; Mammel, Mark K; Tarr, Phillip I; Eppinger, Mark
2016-01-01
Multi isolate whole genome sequencing (WGS) and typing for outbreak investigations has become a reality in the post-genomics era. We applied this technology to strains from Escherichia coli O157:H7 outbreaks. These include isolates from seven North America outbreaks, as well as multiple isolates from the same patient and from different infected individuals in the same household. Customized high-resolution bioinformatics sequence typing strategies were developed to assess the core genome and mobilome plasticity. Sequence typing was performed using an in-house single nucleotide polymorphism (SNP) discovery and validation pipeline. Discriminatory power becomes of particular importance for the investigation of isolates from outbreaks in which macrogenomic techniques such as pulse-field gel electrophoresis or multiple locus variable number tandem repeat analysis do not differentiate closely related organisms. We also characterized differences in the phage inventory, allowing us to identify plasticity among outbreak strains that is not detectable at the core genome level. Our comprehensive analysis of the mobilome identified multiple plasmids that have not previously been associated with this lineage. Applied phylogenomics approaches provide strong molecular evidence for exceptionally little heterogeneity of strains within outbreaks and demonstrate the value of intra-cluster comparisons, rather than basing the analysis on archetypal reference strains. Next generation sequencing and whole genome typing strategies provide the technological foundation for genomic epidemiology outbreak investigation utilizing its significantly higher sample throughput, cost efficiency, and phylogenetic relatedness accuracy. These phylogenomics approaches have major public health relevance in translating information from the sequence-based survey to support timely and informed countermeasures. Polymorphisms identified in this work offer robust phylogenetic signals that index both short- and long-term evolution and can complement currently employed typing schemes for outbreak ex- and inclusion, diagnostics, surveillance, and forensic studies.
Whole Genome Sequencing for Genomics-Guided Investigations of Escherichia coli O157:H7 Outbreaks
Rusconi, Brigida; Sanjar, Fatemeh; Koenig, Sara S. K.; Mammel, Mark K.; Tarr, Phillip I.; Eppinger, Mark
2016-01-01
Multi isolate whole genome sequencing (WGS) and typing for outbreak investigations has become a reality in the post-genomics era. We applied this technology to strains from Escherichia coli O157:H7 outbreaks. These include isolates from seven North America outbreaks, as well as multiple isolates from the same patient and from different infected individuals in the same household. Customized high-resolution bioinformatics sequence typing strategies were developed to assess the core genome and mobilome plasticity. Sequence typing was performed using an in-house single nucleotide polymorphism (SNP) discovery and validation pipeline. Discriminatory power becomes of particular importance for the investigation of isolates from outbreaks in which macrogenomic techniques such as pulse-field gel electrophoresis or multiple locus variable number tandem repeat analysis do not differentiate closely related organisms. We also characterized differences in the phage inventory, allowing us to identify plasticity among outbreak strains that is not detectable at the core genome level. Our comprehensive analysis of the mobilome identified multiple plasmids that have not previously been associated with this lineage. Applied phylogenomics approaches provide strong molecular evidence for exceptionally little heterogeneity of strains within outbreaks and demonstrate the value of intra-cluster comparisons, rather than basing the analysis on archetypal reference strains. Next generation sequencing and whole genome typing strategies provide the technological foundation for genomic epidemiology outbreak investigation utilizing its significantly higher sample throughput, cost efficiency, and phylogenetic relatedness accuracy. These phylogenomics approaches have major public health relevance in translating information from the sequence-based survey to support timely and informed countermeasures. Polymorphisms identified in this work offer robust phylogenetic signals that index both short- and long-term evolution and can complement currently employed typing schemes for outbreak ex- and inclusion, diagnostics, surveillance, and forensic studies. PMID:27446025
Smith, R F; Wiese, B A; Wojzynski, M K; Davison, D B; Worley, K C
1996-05-01
The BCM Search Launcher is an integrated set of World Wide Web (WWW) pages that organize molecular biology-related search and analysis services available on the WWW by function, and provide a single point of entry for related searches. The Protein Sequence Search Page, for example, provides a single sequence entry form for submitting sequences to WWW servers that offer remote access to a variety of different protein sequence search tools, including BLAST, FASTA, Smith-Waterman, BEAUTY, PROSITE, and BLOCKS searches. Other Launch pages provide access to (1) nucleic acid sequence searches, (2) multiple and pair-wise sequence alignments, (3) gene feature searches, (4) protein secondary structure prediction, and (5) miscellaneous sequence utilities (e.g., six-frame translation). The BCM Search Launcher also provides a mechanism to extend the utility of other WWW services by adding supplementary hypertext links to results returned by remote servers. For example, links to the NCBI's Entrez data base and to the Sequence Retrieval System (SRS) are added to search results returned by the NCBI's WWW BLAST server. These links provide easy access to auxiliary information, such as Medline abstracts, that can be extremely helpful when analyzing BLAST data base hits. For new or infrequent users of sequence data base search tools, we have preset the default search parameters to provide the most informative first-pass sequence analysis possible. We have also developed a batch client interface for Unix and Macintosh computers that allows multiple input sequences to be searched automatically as a background task, with the results returned as individual HTML documents directly to the user's system. The BCM Search Launcher and batch client are available on the WWW at URL http:@gc.bcm.tmc.edu:8088/search-launcher.html.
Analysis of Sequence Data Under Multivariate Trait-Dependent Sampling.
Tao, Ran; Zeng, Donglin; Franceschini, Nora; North, Kari E; Boerwinkle, Eric; Lin, Dan-Yu
2015-06-01
High-throughput DNA sequencing allows for the genotyping of common and rare variants for genetic association studies. At the present time and for the foreseeable future, it is not economically feasible to sequence all individuals in a large cohort. A cost-effective strategy is to sequence those individuals with extreme values of a quantitative trait. We consider the design under which the sampling depends on multiple quantitative traits. Under such trait-dependent sampling, standard linear regression analysis can result in bias of parameter estimation, inflation of type I error, and loss of power. We construct a likelihood function that properly reflects the sampling mechanism and utilizes all available data. We implement a computationally efficient EM algorithm and establish the theoretical properties of the resulting maximum likelihood estimators. Our methods can be used to perform separate inference on each trait or simultaneous inference on multiple traits. We pay special attention to gene-level association tests for rare variants. We demonstrate the superiority of the proposed methods over standard linear regression through extensive simulation studies. We provide applications to the Cohorts for Heart and Aging Research in Genomic Epidemiology Targeted Sequencing Study and the National Heart, Lung, and Blood Institute Exome Sequencing Project.
Processing and population genetic analysis of multigenic datasets with ProSeq3 software.
Filatov, Dmitry A
2009-12-01
The current tendency in molecular population genetics is to use increasing numbers of genes in the analysis. Here I describe a program for handling and population genetic analysis of DNA polymorphism data collected from multiple genes. The program includes a sequence/alignment editor and an internal relational database that simplify the preparation and manipulation of multigenic DNA polymorphism datasets. The most commonly used DNA polymorphism analyses are implemented in ProSeq3, facilitating population genetic analysis of large multigenic datasets. Extensive input/output options make ProSeq3 a convenient hub for sequence data processing and analysis. The program is available free of charge from http://dps.plants.ox.ac.uk/sequencing/proseq.htm.
Berthier, Y; Thierry, D; Lemattre, M; Guesdon, J L
1994-01-01
A new insertion sequence was isolated from Xanthomonas campestris pv. dieffenbachiae. Sequence analysis showed that this element is 1,158 bp long and has 15-bp inverted repeat ends containing two mismatches. Comparison of this sequence with sequences in data bases revealed significant homology with Escherichia coli IS5. IS1051, which detected multiple restriction fragment length polymorphisms, was used as a probe to characterize strains from the pathovar dieffenbachiae. Images PMID:7906933
Xu, Jiajia; Li, Yuanyuan; Ma, Xiuling; Ding, Jianfeng; Wang, Kai; Wang, Sisi; Tian, Ye; Zhang, Hui; Zhu, Xin-Guang
2013-09-01
Setaria viridis is an emerging model species for genetic studies of C4 photosynthesis. Many basic molecular resources need to be developed to support for this species. In this paper, we performed a comprehensive transcriptome analysis from multiple developmental stages and tissues of S. viridis using next-generation sequencing technologies. Sequencing of the transcriptome from multiple tissues across three developmental stages (seed germination, vegetative growth, and reproduction) yielded a total of 71 million single end 100 bp long reads. Reference-based assembly using Setaria italica genome as a reference generated 42,754 transcripts. De novo assembly generated 60,751 transcripts. In addition, 9,576 and 7,056 potential simple sequence repeats (SSRs) covering S. viridis genome were identified when using the reference based assembled transcripts and the de novo assembled transcripts, respectively. This identified transcripts and SSR provided by this study can be used for both reverse and forward genetic studies based on S. viridis.
NASA Astrophysics Data System (ADS)
Du, Mao-Kang; He, Bo; Wang, Yong
2011-01-01
Recently, the cryptosystem based on chaos has attracted much attention. Wang and Yu (Commun. Nonlin. Sci. Numer. Simulat. 14 (2009) 574) proposed a block encryption algorithm based on dynamic sequences of multiple chaotic systems. We analyze the potential flaws in the algorithm. Then, a chosen-plaintext attack is presented. Some remedial measures are suggested to avoid the flaws effectively. Furthermore, an improved encryption algorithm is proposed to resist the attacks and to keep all the merits of the original cryptosystem.
Huang, Fengying; Meng, Qiuping; Tan, Guanghong; Huang, Yonghao; Wang, Hua; Mei, Wenli; Dai, Haofu
2011-06-01
To analysis and identify a bacterium strain isolated from laboratory breeding mouse far away from a hospital. Phenotype of the isolate was investigated by conventional microbiological methods, including Gram-staining, colony morphology, tests for haemolysis, catalase, coagulase, and antimicrobial susceptibility test. The mecA and 16S rRNA genes were amplified by the polymerase chain reaction (PCR) and sequenced. The base sequence of the PCR product was compared with known 16S rRNA gene sequences in the GenBank database by phylogenetic analysis and multiple sequence alignment. The isolate in this study was a gram positive, coagulase negative, and catalase positive coccus. The isolate was resistant to oxacillin, methicillin, penicillin, ampicillin, cefazolin, ciprofloxacin erythromycin, et al. PCR results indicated that the isolate was mecA gene positive and its 16S rRNA was 1 465 bp. Phylogenetic analysis of the resultant 16S rRNA indicated the isolate belonged to genus Saphylococcus, and multiple sequence alignment showed that the isolate was Saphylococcus haemolyticus with only one base difference from the corresponding 16S rRNA deposited in the GenBank. 16S rRNA gene sequencing is a suitable technique for non-specialist researchers. Laboratory animals are possible sources of lethal pathogens, and researchers must adapt protective measures when they manipulate animals. Copyright © 2011 Hainan Medical College. Published by Elsevier B.V. All rights reserved.
The MIGenAS integrated bioinformatics toolkit for web-based sequence analysis
Rampp, Markus; Soddemann, Thomas; Lederer, Hermann
2006-01-01
We describe a versatile and extensible integrated bioinformatics toolkit for the analysis of biological sequences over the Internet. The web portal offers convenient interactive access to a growing pool of chainable bioinformatics software tools and databases that are centrally installed and maintained by the RZG. Currently, supported tasks comprise sequence similarity searches in public or user-supplied databases, computation and validation of multiple sequence alignments, phylogenetic analysis and protein–structure prediction. Individual tools can be seamlessly chained into pipelines allowing the user to conveniently process complex workflows without the necessity to take care of any format conversions or tedious parsing of intermediate results. The toolkit is part of the Max-Planck Integrated Gene Analysis System (MIGenAS) of the Max Planck Society available at (click ‘Start Toolkit’). PMID:16844980
Fast alignment-free sequence comparison using spaced-word frequencies.
Leimeister, Chris-Andre; Boden, Marcus; Horwege, Sebastian; Lindner, Sebastian; Morgenstern, Burkhard
2014-07-15
Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. To reduce the statistical dependency between adjacent word matches, we propose to use 'spaced words', defined by patterns of 'match' and 'don't care' positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. Our program is freely available at http://spaced.gobics.de/. © The Author 2014. Published by Oxford University Press.
An intuitive graphical webserver for multiple-choice protein sequence search.
Banky, Daniel; Szalkai, Balazs; Grolmusz, Vince
2014-04-10
Every day tens of thousands of sequence searches and sequence alignment queries are submitted to webservers. The capitalized word "BLAST" becomes a verb, describing the act of performing sequence search and alignment. However, if one needs to search for sequences that contain, for example, two hydrophobic and three polar residues at five given positions, the query formation on the most frequently used webservers will be difficult. Some servers support the formation of queries with regular expressions, but most of the users are unfamiliar with their syntax. Here we present an intuitive, easily applicable webserver, the Protein Sequence Analysis server, that allows the formation of multiple choice queries by simply drawing the residues to their positions; if more than one residue are drawn to the same position, then they will be nicely stacked on the user interface, indicating the multiple choice at the given position. This computer-game-like interface is natural and intuitive, and the coloring of the residues makes possible to form queries requiring not just certain amino acids in the given positions, but also small nonpolar, negatively charged, hydrophobic, positively charged, or polar ones. The webserver is available at http://psa.pitgroup.org. Copyright © 2014 Elsevier B.V. All rights reserved.
Java bioinformatics analysis web services for multiple sequence alignment--JABAWS:MSA.
Troshin, Peter V; Procter, James B; Barton, Geoffrey J
2011-07-15
JABAWS is a web services framework that simplifies the deployment of web services for bioinformatics. JABAWS:MSA provides services for five multiple sequence alignment (MSA) methods (Probcons, T-coffee, Muscle, Mafft and ClustalW), and is the system employed by the Jalview multiple sequence analysis workbench since version 2.6. A fully functional, easy to set up server is provided as a Virtual Appliance (VA), which can be run on most operating systems that support a virtualization environment such as VMware or Oracle VirtualBox. JABAWS is also distributed as a Web Application aRchive (WAR) and can be configured to run on a single computer and/or a cluster managed by Grid Engine, LSF or other queuing systems that support DRMAA. JABAWS:MSA provides clients full access to each application's parameters, allows administrators to specify named parameter preset combinations and execution limits for each application through simple configuration files. The JABAWS command-line client allows integration of JABAWS services into conventional scripts. JABAWS is made freely available under the Apache 2 license and can be obtained from: http://www.compbio.dundee.ac.uk/jabaws.
Sequence Diversity Diagram for comparative analysis of multiple sequence alignments.
Sakai, Ryo; Aerts, Jan
2014-01-01
The sequence logo is a graphical representation of a set of aligned sequences, commonly used to depict conservation of amino acid or nucleotide sequences. Although it effectively communicates the amount of information present at every position, this visual representation falls short when the domain task is to compare between two or more sets of aligned sequences. We present a new visual presentation called a Sequence Diversity Diagram and validate our design choices with a case study. Our software was developed using the open-source program called Processing. It loads multiple sequence alignment FASTA files and a configuration file, which can be modified as needed to change the visualization. The redesigned figure improves on the visual comparison of two or more sets, and it additionally encodes information on sequential position conservation. In our case study of the adenylate kinase lid domain, the Sequence Diversity Diagram reveals unexpected patterns and new insights, for example the identification of subgroups within the protein subfamily. Our future work will integrate this visual encoding into interactive visualization tools to support higher level data exploration tasks.
2014-01-01
Background The 2013 BioVis Contest provided an opportunity to evaluate different paradigms for visualizing protein multiple sequence alignments. Such data sets are becoming extremely large and thus taxing current visualization paradigms. Sequence Logos represent consensus sequences but have limitations for protein alignments. As an alternative, ProfileGrids are a new protein sequence alignment visualization paradigm that represents an alignment as a color-coded matrix of the residue frequency occurring at every homologous position in the aligned protein family. Results The JProfileGrid software program was used to analyze the BioVis contest data sets to generate figures for comparison with the Sequence Logo reference images. Conclusions The ProfileGrid representation allows for the clear and effective analysis of protein multiple sequence alignments. This includes both a general overview of the conservation and diversity sequence patterns as well as the interactive ability to query the details of the protein residue distributions in the alignment. The JProfileGrid software is free and available from http://www.ProfileGrid.org. PMID:25237393
ERIC Educational Resources Information Center
Lavigne, Frederic; Dumercy, Laurent; Darmon, Nelly
2011-01-01
Recall and language comprehension while processing sequences of words involves multiple semantic priming between several related and/or unrelated words. Accounting for multiple and interacting priming effects in terms of underlying neuronal structure and dynamics is a challenge for current models of semantic priming. Further elaboration of current…
Jakupciak, John P; Wells, Jeffrey M; Karalus, Richard J; Pawlowski, David R; Lin, Jeffrey S; Feldman, Andrew B
2013-01-01
Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations.
Jakupciak, John P.; Wells, Jeffrey M.; Karalus, Richard J.; Pawlowski, David R.; Lin, Jeffrey S.; Feldman, Andrew B.
2013-01-01
Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations. PMID:24455204
Cuddy, L L; Thompson, W F
1992-01-01
In a probe-tone experiment, two groups of listeners--one trained, the other untrained, in traditional music theory--rated the goodness of fit of each of the 12 notes of the chromatic scale to four-voice harmonic sequences. Sequences were 12 simplified excerpts from Bach chorales, 4 nonmodulating, and 8 modulating. Modulations occurred either one or two steps in either the clockwise or the counterclockwise direction on the cycle of fifths. A consistent pattern of probe-tone ratings was obtained for each sequence, with no significant differences between listener groups. Two methods of analysis (Fourier analysis and regression analysis) revealed a directional asymmetry in the perceived key movement conveyed by modulating sequences. For a given modulation distance, modulations in the counterclockwise direction effected a clearer shift in tonal organization toward the final key than did clockwise modulations. The nature of the directional asymmetry was consistent with results reported for identification and rating of key change in the sequences (Thompson & Cuddy, 1989a). Further, according to the multiple-regression analysis, probe-tone ratings did not merely reflect the distribution of tones in the sequence. Rather, ratings were sensitive to the temporal structure of the tonal organization in the sequence.
String Mining in Bioinformatics
NASA Astrophysics Data System (ADS)
Abouelhoda, Mohamed; Ghanem, Moustafa
Sequence analysis is a major area in bioinformatics encompassing the methods and techniques for studying the biological sequences, DNA, RNA, and proteins, on the linear structure level. The focus of this area is generally on the identification of intra- and inter-molecular similarities. Identifying intra-molecular similarities boils down to detecting repeated segments within a given sequence, while identifying inter-molecular similarities amounts to spotting common segments among two or multiple sequences. From a data mining point of view, sequence analysis is nothing but string- or pattern mining specific to biological strings. For a long time, this point of view, however, has not been explicitly embraced neither in the data mining nor in the sequence analysis text books, which may be attributed to the co-evolution of the two apparently independent fields. In other words, although the word "data-mining" is almost missing in the sequence analysis literature, its basic concepts have been implicitly applied. Interestingly, recent research in biological sequence analysis introduced efficient solutions to many problems in data mining, such as querying and analyzing time series [49,53], extracting information from web pages [20], fighting spam mails [50], detecting plagiarism [22], and spotting duplications in software systems [14].
String Mining in Bioinformatics
NASA Astrophysics Data System (ADS)
Abouelhoda, Mohamed; Ghanem, Moustafa
Sequence analysis is a major area in bioinformatics encompassing the methods and techniques for studying the biological sequences, DNA, RNA, and proteins, on the linear structure level. The focus of this area is generally on the identification of intra- and inter-molecular similarities. Identifying intra-molecular similarities boils down to detecting repeated segments within a given sequence, while identifying inter-molecular similarities amounts to spotting common segments among two or multiple sequences. From a data mining point of view, sequence analysis is nothing but string- or pattern mining specific to biological strings. For a long time, this point of view, however, has not been explicitly embraced neither in the data mining nor in the sequence analysis text books, which may be attributed to the co-evolution of the two apparently independent fields. In other words, although the word “data-mining” is almost missing in the sequence analysis literature, its basic concepts have been implicitly applied. Interestingly, recent research in biological sequence analysis introduced efficient solutions to many problems in data mining, such as querying and analyzing time series [49,53], extracting information from web pages [20], fighting spam mails [50], detecting plagiarism [22], and spotting duplications in software systems [14].
Skeleton-based human action recognition using multiple sequence alignment
NASA Astrophysics Data System (ADS)
Ding, Wenwen; Liu, Kai; Cheng, Fei; Zhang, Jin; Li, YunSong
2015-05-01
Human action recognition and analysis is an active research topic in computer vision for many years. This paper presents a method to represent human actions based on trajectories consisting of 3D joint positions. This method first decompose action into a sequence of meaningful atomic actions (actionlets), and then label actionlets with English alphabets according to the Davies-Bouldin index value. Therefore, an action can be represented using a sequence of actionlet symbols, which will preserve the temporal order of occurrence of each of the actionlets. Finally, we employ sequence comparison to classify multiple actions through using string matching algorithms (Needleman-Wunsch). The effectiveness of the proposed method is evaluated on datasets captured by commodity depth cameras. Experiments of the proposed method on three challenging 3D action datasets show promising results.
CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.
Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan
2017-06-24
The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn 2 ) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA .
USDA-ARS?s Scientific Manuscript database
Here, we present the draft genome sequences of nine multidrug-resistant Escherichia coli isolated from humans (n=6) and chicken carcass (n=3) from Lagos, Nigeria in 2013. Multiple extended-spectrum beta-lactamase (ESBL) genes were identified in these isolates. ...
Agnelli, Luca; Tassone, Pierfrancesco; Neri, Antonino
2013-06-01
Multiple myeloma is a fatal malignant proliferation of clonal bone marrow Ig-secreting plasma cells, characterized by wide clinical, biological, and molecular heterogeneity. Herein, global gene and microRNA expression, genome-wide DNA profilings, and next-generation sequencing technology used to investigate the genomic alterations underlying the bio-clinical heterogeneity in multiple myeloma are discussed. High-throughput technologies have undoubtedly allowed a better comprehension of the molecular basis of the disease, a fine stratification, and early identification of high-risk patients, and have provided insights toward targeted therapy studies. However, such technologies are at risk of being affected by laboratory- or cohort-specific biases, and are moreover influenced by high number of expected false positives. This aspect has a major weight in myeloma, which is characterized by large molecular heterogeneity. Therefore, meta-analysis as well as multiple approaches are desirable if not mandatory to validate the results obtained, in line with commonly accepted recommendation for tumor diagnostic/prognostic biomarker studies.
mESAdb: microRNA Expression and Sequence Analysis Database
Kaya, Koray D.; Karakülah, Gökhan; Yakıcıer, Cengiz M.; Acar, Aybar C.; Konu, Özlen
2011-01-01
microRNA expression and sequence analysis database (http://konulab.fen.bilkent.edu.tr/mirna/) (mESAdb) is a regularly updated database for the multivariate analysis of sequences and expression of microRNAs from multiple taxa. mESAdb is modular and has a user interface implemented in PHP and JavaScript and coupled with statistical analysis and visualization packages written for the R language. The database primarily comprises mature microRNA sequences and their target data, along with selected human, mouse and zebrafish expression data sets. mESAdb analysis modules allow (i) mining of microRNA expression data sets for subsets of microRNAs selected manually or by motif; (ii) pair-wise multivariate analysis of expression data sets within and between taxa; and (iii) association of microRNA subsets with annotation databases, HUGE Navigator, KEGG and GO. The use of existing and customized R packages facilitates future addition of data sets and analysis tools. Furthermore, the ability to upload and analyze user-specified data sets makes mESAdb an interactive and expandable analysis tool for microRNA sequence and expression data. PMID:21177657
mESAdb: microRNA expression and sequence analysis database.
Kaya, Koray D; Karakülah, Gökhan; Yakicier, Cengiz M; Acar, Aybar C; Konu, Ozlen
2011-01-01
microRNA expression and sequence analysis database (http://konulab.fen.bilkent.edu.tr/mirna/) (mESAdb) is a regularly updated database for the multivariate analysis of sequences and expression of microRNAs from multiple taxa. mESAdb is modular and has a user interface implemented in PHP and JavaScript and coupled with statistical analysis and visualization packages written for the R language. The database primarily comprises mature microRNA sequences and their target data, along with selected human, mouse and zebrafish expression data sets. mESAdb analysis modules allow (i) mining of microRNA expression data sets for subsets of microRNAs selected manually or by motif; (ii) pair-wise multivariate analysis of expression data sets within and between taxa; and (iii) association of microRNA subsets with annotation databases, HUGE Navigator, KEGG and GO. The use of existing and customized R packages facilitates future addition of data sets and analysis tools. Furthermore, the ability to upload and analyze user-specified data sets makes mESAdb an interactive and expandable analysis tool for microRNA sequence and expression data.
Illuminator, a desktop program for mutation detection using short-read clonal sequencing.
Carr, Ian M; Morgan, Joanne E; Diggle, Christine P; Sheridan, Eamonn; Markham, Alexander F; Logan, Clare V; Inglehearn, Chris F; Taylor, Graham R; Bonthron, David T
2011-10-01
Current methods for sequencing clonal populations of DNA molecules yield several gigabases of data per day, typically comprising reads of < 100 nt. Such datasets permit widespread genome resequencing and transcriptome analysis or other quantitative tasks. However, this huge capacity can also be harnessed for the resequencing of smaller (gene-sized) target regions, through the simultaneous parallel analysis of multiple subjects, using sample "tagging" or "indexing". These methods promise to have a huge impact on diagnostic mutation analysis and candidate gene testing. Here we describe a software package developed for such studies, offering the ability to resolve pooled samples carrying barcode tags and to align reads to a reference sequence using a mutation-tolerant process. The program, Illuminator, can identify rare sequence variants, including insertions and deletions, and permits interactive data analysis on standard desktop computers. It facilitates the effective analysis of targeted clonal sequencer data without dedicated computational infrastructure or specialized training. Copyright © 2011 Elsevier Inc. All rights reserved.
KinView: A visual comparative sequence analysis tool for integrated kinome research
McSkimming, Daniel Ian; Dastgheib, Shima; Baffi, Timothy R.; Byrne, Dominic P.; Ferries, Samantha; Scott, Steven Thomas; Newton, Alexandra C.; Eyers, Claire E.; Kochut, Krzysztof J.; Eyers, Patrick A.
2017-01-01
Multiple sequence alignments (MSAs) are a fundamental analysis tool used throughout biology to investigate relationships between protein sequence, structure, function, evolutionary history, and patterns of disease-associated variants. However, their widespread application in systems biology research is currently hindered by the lack of user-friendly tools to simultaneously visualize, manipulate and query the information conceptualized in large sequence alignments, and the challenges in integrating MSAs with multiple orthogonal data such as cancer variants and post-translational modifications, which are often stored in heterogeneous data sources and formats. Here, we present the Multiple Sequence Alignment Ontology (MSAOnt), which represents a profile or consensus alignment in an ontological format. Subsets of the alignment are easily selected through the SPARQL Protocol and RDF Query Language for downstream statistical analysis or visualization. We have also created the Kinome Viewer (KinView), an interactive integrative visualization that places eukaryotic protein kinase cancer variants in the context of natural sequence variation and experimentally determined post-translational modifications, which play central roles in the regulation of cellular signaling pathways. Using KinView, we identified differential phosphorylation patterns between tyrosine and serine/threonine kinases in the activation segment, a major kinase regulatory region that is often mutated in proliferative diseases. We discuss cancer variants that disrupt phosphorylation sites in the activation segment, and show how KinView can be used as a comparative tool to identify differences and similarities in natural variation, cancer variants and post-translational modifications between kinase groups, families and subfamilies. Based on KinView comparisons, we identify and experimentally characterize a regulatory tyrosine (Y177PLK4) in the PLK4 C-terminal activation segment region termed the P+1 loop. To further demonstrate the application of KinView in hypothesis generation and testing, we formulate and validate a hypothesis explaining a novel predicted loss-of-function variant (D523NPKCβ) in the regulatory spine of PKCβ, a recently identified tumor suppressor kinase. KinView provides a novel, extensible interface for performing comparative analyses between subsets of kinases and for integrating multiple types of residue specific annotations in user friendly formats. PMID:27731453
Cho, Jin-Young; Lee, Hyoung-Joo; Jeong, Seul-Ki; Paik, Young-Ki
2017-12-01
Mass spectrometry (MS) is a widely used proteome analysis tool for biomedical science. In an MS-based bottom-up proteomic approach to protein identification, sequence database (DB) searching has been routinely used because of its simplicity and convenience. However, searching a sequence DB with multiple variable modification options can increase processing time, false-positive errors in large and complicated MS data sets. Spectral library searching is an alternative solution, avoiding the limitations of sequence DB searching and allowing the detection of more peptides with high sensitivity. Unfortunately, this technique has less proteome coverage, resulting in limitations in the detection of novel and whole peptide sequences in biological samples. To solve these problems, we previously developed the "Combo-Spec Search" method, which uses manually multiple references and simulated spectral library searching to analyze whole proteomes in a biological sample. In this study, we have developed a new analytical interface tool called "Epsilon-Q" to enhance the functions of both the Combo-Spec Search method and label-free protein quantification. Epsilon-Q performs automatically multiple spectral library searching, class-specific false-discovery rate control, and result integration. It has a user-friendly graphical interface and demonstrates good performance in identifying and quantifying proteins by supporting standard MS data formats and spectrum-to-spectrum matching powered by SpectraST. Furthermore, when the Epsilon-Q interface is combined with the Combo-Spec search method, called the Epsilon-Q system, it shows a synergistic function by outperforming other sequence DB search engines for identifying and quantifying low-abundance proteins in biological samples. The Epsilon-Q system can be a versatile tool for comparative proteome analysis based on multiple spectral libraries and label-free quantification.
Genomic Sequencing: Assessing The Health Care System, Policy, And Big-Data Implications
Phillips, Kathryn A.; Trosman, Julia; Kelley, Robin K.; Pletcher, Mark J.; Douglas, Michael P.; Weldon, Christine B.
2014-01-01
New genomic sequencing technologies enable the high-speed analysis of multiple genes simultaneously, including all of those in a person's genome. Sequencing is a prominent example of a “big data” technology because of the massive amount of information it produces and its complexity, diversity, and timeliness. Our objective in this article is to provide a policy primer on sequencing and illustrate how it can affect health care system and policy issues. Toward this end, we developed an easily applied classification of sequencing based on inputs, methods, and outputs. We used it to examine the implications of sequencing for three health care system and policy issues: making care more patient-centered, developing coverage and reimbursement policies, and assessing economic value. We conclude that sequencing has great promise but that policy challenges include how to optimize patient engagement as well as privacy, develop coverage policies that distinguish research from clinical uses and account for bioinformatics costs, and determine the economic value of sequencing through complex economic models that take into account multiple findings and downstream costs. PMID:25006153
Genomic sequencing: assessing the health care system, policy, and big-data implications.
Phillips, Kathryn A; Trosman, Julia R; Kelley, Robin K; Pletcher, Mark J; Douglas, Michael P; Weldon, Christine B
2014-07-01
New genomic sequencing technologies enable the high-speed analysis of multiple genes simultaneously, including all of those in a person's genome. Sequencing is a prominent example of a "big data" technology because of the massive amount of information it produces and its complexity, diversity, and timeliness. Our objective in this article is to provide a policy primer on sequencing and illustrate how it can affect health care system and policy issues. Toward this end, we developed an easily applied classification of sequencing based on inputs, methods, and outputs. We used it to examine the implications of sequencing for three health care system and policy issues: making care more patient-centered, developing coverage and reimbursement policies, and assessing economic value. We conclude that sequencing has great promise but that policy challenges include how to optimize patient engagement as well as privacy, develop coverage policies that distinguish research from clinical uses and account for bioinformatics costs, and determine the economic value of sequencing through complex economic models that take into account multiple findings and downstream costs. Project HOPE—The People-to-People Health Foundation, Inc.
Kumar, Sudhir; Stecher, Glen; Peterson, Daniel; Tamura, Koichiro
2012-10-15
There is a growing need in the research community to apply the molecular evolutionary genetics analysis (MEGA) software tool for batch processing a large number of datasets and to integrate it into analysis workflows. Therefore, we now make available the computing core of the MEGA software as a stand-alone executable (MEGA-CC), along with an analysis prototyper (MEGA-Proto). MEGA-CC provides users with access to all the computational analyses available through MEGA's graphical user interface version. This includes methods for multiple sequence alignment, substitution model selection, evolutionary distance estimation, phylogeny inference, substitution rate and pattern estimation, tests of natural selection and ancestral sequence inference. Additionally, we have upgraded the source code for phylogenetic analysis using the maximum likelihood methods for parallel execution on multiple processors and cores. Here, we describe MEGA-CC and outline the steps for using MEGA-CC in tandem with MEGA-Proto for iterative and automated data analysis. http://www.megasoftware.net/.
Desai, Aarti; Marwah, Veer Singh; Yadav, Akshay; Jha, Vineet; Dhaygude, Kishor; Bangar, Ujwala; Kulkarni, Vivek; Jere, Abhay
2013-01-01
Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6-40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.
Initial genome sequencing and analysis of multiple myeloma
Chapman, Michael A.; Lawrence, Michael S.; Keats, Jonathan J.; Cibulskis, Kristian; Sougnez, Carrie; Schinzel, Anna C.; Harview, Christina L.; Brunet, Jean-Philippe; Ahmann, Gregory J.; Adli, Mazhar; Anderson, Kenneth C.; Ardlie, Kristin G.; Auclair, Daniel; Baker, Angela; Bergsagel, P. Leif; Bernstein, Bradley E.; Drier, Yotam; Fonseca, Rafael; Gabriel, Stacey B.; Hofmeister, Craig C.; Jagannath, Sundar; Jakubowiak, Andrzej J.; Krishnan, Amrita; Levy, Joan; Liefeld, Ted; Lonial, Sagar; Mahan, Scott; Mfuko, Bunmi; Monti, Stefano; Perkins, Louise M.; Onofrio, Robb; Pugh, Trevor J.; Vincent Rajkumar, S.; Ramos, Alex H.; Siegel, David S.; Sivachenko, Andrey; Trudel, Suzanne; Vij, Ravi; Voet, Douglas; Winckler, Wendy; Zimmerman, Todd; Carpten, John; Trent, Jeff; Hahn, William C.; Garraway, Levi A.; Meyerson, Matthew; Lander, Eric S.; Getz, Gad; Golub, Todd R.
2013-01-01
Multiple myeloma is an incurable malignancy of plasma cells, and its pathogenesis is poorly understood. Here we report the massively parallel sequencing of 38 tumor genomes and their comparison to matched normal DNAs. Several new and unexpected oncogenic mechanisms were suggested by the pattern of somatic mutation across the dataset. These include the mutation of genes involved in protein translation (seen in nearly half of the patients), genes involved in histone methylation, and genes involved in blood coagulation. In addition, a broader than anticipated role of NF-κB signaling was suggested by mutations in 11 members of the NF-κB pathway. Of potential immediate clinical relevance, activating mutations of the kinase BRAF were observed in 4% of patients, suggesting the evaluation of BRAF inhibitors in multiple myeloma clinical trials. These results indicate that cancer genome sequencing of large collections of samples will yield new insights into cancer not anticipated by existing knowledge. PMID:21430775
pico-PLAZA, a genome database of microbial photosynthetic eukaryotes.
Vandepoele, Klaas; Van Bel, Michiel; Richard, Guilhem; Van Landeghem, Sofie; Verhelst, Bram; Moreau, Hervé; Van de Peer, Yves; Grimsley, Nigel; Piganeau, Gwenael
2013-08-01
With the advent of next generation genome sequencing, the number of sequenced algal genomes and transcriptomes is rapidly growing. Although a few genome portals exist to browse individual genome sequences, exploring complete genome information from multiple species for the analysis of user-defined sequences or gene lists remains a major challenge. pico-PLAZA is a web-based resource (http://bioinformatics.psb.ugent.be/pico-plaza/) for algal genomics that combines different data types with intuitive tools to explore genomic diversity, perform integrative evolutionary sequence analysis and study gene functions. Apart from homologous gene families, multiple sequence alignments, phylogenetic trees, Gene Ontology, InterPro and text-mining functional annotations, different interactive viewers are available to study genome organization using gene collinearity and synteny information. Different search functions, documentation pages, export functions and an extensive glossary are available to guide non-expert scientists. To illustrate the versatility of the platform, different case studies are presented demonstrating how pico-PLAZA can be used to functionally characterize large-scale EST/RNA-Seq data sets and to perform environmental genomics. Functional enrichments analysis of 16 Phaeodactylum tricornutum transcriptome libraries offers a molecular view on diatom adaptation to different environments of ecological relevance. Furthermore, we show how complementary genomic data sources can easily be combined to identify marker genes to study the diversity and distribution of algal species, for example in metagenomes, or to quantify intraspecific diversity from environmental strains. © 2013 John Wiley & Sons Ltd and Society for Applied Microbiology.
A Comprehensive Strategy for Accurate Mutation Detection of the Highly Homologous PMS2.
Li, Jianli; Dai, Hongzheng; Feng, Yanming; Tang, Jia; Chen, Stella; Tian, Xia; Gorman, Elizabeth; Schmitt, Eric S; Hansen, Terah A A; Wang, Jing; Plon, Sharon E; Zhang, Victor Wei; Wong, Lee-Jun C
2015-09-01
Germline mutations in the DNA mismatch repair gene PMS2 underlie the cancer susceptibility syndrome, Lynch syndrome. However, accurate molecular testing of PMS2 is complicated by a large number of highly homologous sequences. To establish a comprehensive approach for mutation detection of PMS2, we have designed a strategy combining targeted capture next-generation sequencing (NGS), multiplex ligation-dependent probe amplification, and long-range PCR followed by NGS to simultaneously detect point mutations and copy number changes of PMS2. Exonic deletions (E2 to E9, E5 to E9, E8, E10, E14, and E1 to E15), duplications (E11 to E12), and a nonsense mutation, p.S22*, were identified. Traditional multiplex ligation-dependent probe amplification and Sanger sequencing approaches cannot differentiate the origin of the exonic deletions in the 3' region when PMS2 and PMS2CL share identical sequences as a result of gene conversion. Our approach allows unambiguous identification of mutations in the active gene with a straightforward long-range-PCR/NGS method. Breakpoint analysis of multiple samples revealed that recurrent exon 14 deletions are mediated by homologous Alu sequences. Our comprehensive approach provides a reliable tool for accurate molecular analysis of genes containing multiple copies of highly homologous sequences and should improve PMS2 molecular analysis for patients with Lynch syndrome. Copyright © 2015 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Zhang, Ran; Yin, Yinliang; Zhang, Yujun; Li, Kexin; Zhu, Hongxia; Gong, Qin; Wang, Jianwu; Hu, Xiaoxiang; Li, Ning
2012-01-01
As the number of transgenic livestock increases, reliable detection and molecular characterization of transgene integration sites and copy number are crucial not only for interpreting the relationship between the integration site and the specific phenotype but also for commercial and economic demands. However, the ability of conventional PCR techniques to detect incomplete and multiple integration events is limited, making it technically challenging to characterize transgenes. Next-generation sequencing has enabled cost-effective, routine and widespread high-throughput genomic analysis. Here, we demonstrate the use of next-generation sequencing to extensively characterize cattle harboring a 150-kb human lactoferrin transgene that was initially analyzed by chromosome walking without success. Using this approach, the sites upstream and downstream of the target gene integration site in the host genome were identified at the single nucleotide level. The sequencing result was verified by event-specific PCR for the integration sites and FISH for the chromosomal location. Sequencing depth analysis revealed that multiple copies of the incomplete target gene and the vector backbone were present in the host genome. Upon integration, complex recombination was also observed between the target gene and the vector backbone. These findings indicate that next-generation sequencing is a reliable and accurate approach for the molecular characterization of the transgene sequence, integration sites and copy number in transgenic species. PMID:23185606
Molecular Cloning and Sequence Analysis of a Phenylalanine Ammonia-Lyase Gene from Dendrobium
Cai, Yongping; Lin, Yi
2013-01-01
In this study, a phenylalanine ammonia-lyase (PAL) gene was cloned from Dendrobium candidum using homology cloning and RACE. The full-length sequence and catalytic active sites that appear in PAL proteins of Arabidopsis thaliana and Nicotiana tabacum are also found: PAL cDNA of D. candidum (designated Dc-PAL1, GenBank No. JQ765748) has 2,458 bps and contains a complete open reading frame (ORF) of 2,142 bps, which encodes 713 amino acid residues. The amino acid sequence of DcPAL1 has more than 80% sequence identity with the PAL genes of other plants, as indicated by multiple alignments. The dominant sites and catalytic active sites, which are similar to that showing in PAL proteins of Arabidopsis thaliana and Nicotiana tabacum, are also found in DcPAL1. Phylogenetic tree analysis revealed that DcPAL is more closely related to PALs from orchidaceae plants than to those of other plants. The differential expression patterns of PAL in protocorm-like body, leaf, stem, and root, suggest that the PAL gene performs multiple physiological functions in Dendrobium candidum. PMID:23638048
Setliff, Ian; McDonnell, Wyatt J; Raju, Nagarajan; Bombardi, Robin G; Murji, Amyn A; Scheepers, Cathrine; Ziki, Rutendo; Mynhardt, Charissa; Shepherd, Bryan E; Mamchak, Alusha A; Garrett, Nigel; Karim, Salim Abdool; Mallal, Simon A; Crowe, James E; Morris, Lynn; Georgiev, Ivelin S
2018-06-13
Characterization of single antibody lineages within infected individuals has provided insights into the development of Env-specific antibodies. However, a systems-level understanding of the humoral response against HIV-1 is limited. Here, we interrogated the antibody repertoires of multiple HIV-infected donors from an infection-naive state through acute and chronic infection using next-generation sequencing. This analysis revealed the existence of "public" antibody clonotypes that were shared among multiple HIV-infected individuals. The HIV-1 reactivity for representative antibodies from an identified public clonotype shared by three donors was confirmed. Furthermore, a meta-analysis of publicly available antibody repertoire sequencing datasets revealed antibodies with high sequence identity to known HIV-reactive antibodies, even in repertoires that were reported to be HIV naive. The discovery of public antibody clonotypes in HIV-infected individuals represents an avenue of significant potential for better understanding antibody responses to HIV-1 infection, as well as for clonotype-specific vaccine development. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.
Leakey, Tatiana I; Zielinski, Jerzy; Siegfried, Rachel N; Siegel, Eric R; Fan, Chun-Yang; Cooney, Craig A
2008-06-01
DNA methylation at cytosines is a widely studied epigenetic modification. Methylation is commonly detected using bisulfite modification of DNA followed by PCR and additional techniques such as restriction digestion or sequencing. These additional techniques are either laborious, require specialized equipment, or are not quantitative. Here we describe a simple algorithm that yields quantitative results from analysis of conventional four-dye-trace sequencing. We call this method Mquant and we compare it with the established laboratory method of combined bisulfite restriction assay (COBRA). This analysis of sequencing electropherograms provides a simple, easily applied method to quantify DNA methylation at specific CpG sites.
Analysis Commons, A Team Approach to Discovery in a Big-Data Environment for Genetic Epidemiology
Brody, Jennifer A.; Morrison, Alanna C.; Bis, Joshua C.; O'Connell, Jeffrey R.; Brown, Michael R.; Huffman, Jennifer E.; Ames, Darren C.; Carroll, Andrew; Conomos, Matthew P.; Gabriel, Stacey; Gibbs, Richard A.; Gogarten, Stephanie M.; Gupta, Namrata; Jaquish, Cashell E.; Johnson, Andrew D.; Lewis, Joshua P.; Liu, Xiaoming; Manning, Alisa K.; Papanicolaou, George J.; Pitsillides, Achilleas N.; Rice, Kenneth M.; Salerno, William; Sitlani, Colleen M.; Smith, Nicholas L.; Heckbert, Susan R.; Laurie, Cathy C.; Mitchell, Braxton D.; Vasan, Ramachandran S.; Rich, Stephen S.; Rotter, Jerome I.; Wilson, James G.; Boerwinkle, Eric; Psaty, Bruce M.; Cupples, L. Adrienne
2017-01-01
Summary paragraph The exploding volume of whole-genome sequence (WGS) and multi-omics data requires new approaches for analysis. As one solution, we have created a cloud-based Analysis Commons, which brings together genotype and phenotype data from multiple studies in a setting that is accessible by multiple investigators. This framework addresses many of the challenges of multi-center WGS analyses, including data sharing mechanisms, phenotype harmonization, integrated multi-omics analyses, annotation, and computational flexibility. In this setting, the computational pipeline facilitates a sequence-to-discovery analysis workflow illustrated here by an analysis of plasma fibrinogen levels in 3996 individuals from the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) WGS program. The Analysis Commons represents a novel model for transforming WGS resources from a massive quantity of phenotypic and genomic data into knowledge of the determinants of health and disease risk in diverse human populations. PMID:29074945
Sequential addition of short DNA oligos in DNA-polymerase-based synthesis reactions
Gardner, Shea N; Mariella, Jr., Raymond P; Christian, Allen T; Young, Jennifer A; Clague, David S
2013-06-25
A method of preselecting a multiplicity of DNA sequence segments that will comprise the DNA molecule of user-defined sequence, separating the DNA sequence segments temporally, and combining the multiplicity of DNA sequence segments with at least one polymerase enzyme wherein the multiplicity of DNA sequence segments join to produce the DNA molecule of user-defined sequence. Sequence segments may be of length n, where n is an odd integer. In one embodiment the length of desired hybridizing overlap is specified by the user and the sequences and the protocol for combining them are guided by computational (bioinformatics) predictions. In one embodiment sequence segments are combined from multiple reading frames to span the same region of a sequence, so that multiple desired hybridizations may occur with different overlap lengths.
Chen, DaYang; Zhen, HeFu; Qiu, Yong; Liu, Ping; Zeng, Peng; Xia, Jun; Shi, QianYu; Xie, Lin; Zhu, Zhu; Gao, Ya; Huang, GuoDong; Wang, Jian; Yang, HuanMing; Chen, Fang
2018-03-21
Research based on a strategy of single-cell low-coverage whole genome sequencing (SLWGS) has enabled better reproducibility and accuracy for detection of copy number variations (CNVs). The whole genome amplification (WGA) method and sequencing platform are critical factors for successful SLWGS (<0.1 × coverage). In this study, we compared single cell and multiple cells sequencing data produced by the HiSeq2000 and Ion Proton platforms using two WGA kits and then comprehensively evaluated the GC-bias, reproducibility, uniformity and CNV detection among different experimental combinations. Our analysis demonstrated that the PicoPLEX WGA Kit resulted in higher reproducibility, lower sequencing error frequency but more GC-bias than the GenomePlex Single Cell WGA Kit (WGA4 kit) independent of the cell number on the HiSeq2000 platform. While on the Ion Proton platform, the WGA4 kit (both single cell and multiple cells) had higher uniformity and less GC-bias but lower reproducibility than those of the PicoPLEX WGA Kit. Moreover, on these two sequencing platforms, depending on cell number, the performance of the two WGA kits was different for both sensitivity and specificity on CNV detection. The results can help researchers who plan to use SLWGS on single or multiple cells to select appropriate experimental conditions for their applications.
REFGEN and TREENAMER: Automated Sequence Data Handling for Phylogenetic Analysis in the Genomic Era
Leonard, Guy; Stevens, Jamie R.; Richards, Thomas A.
2009-01-01
The phylogenetic analysis of nucleotide sequences and increasingly that of amino acid sequences is used to address a number of biological questions. Access to extensive datasets, including numerous genome projects, means that standard phylogenetic analyses can include many hundreds of sequences. Unfortunately, most phylogenetic analysis programs do not tolerate the sequence naming conventions of genome databases. Managing large numbers of sequences and standardizing sequence labels for use in phylogenetic analysis programs can be a time consuming and laborious task. Here we report the availability of an online resource for the management of gene sequences recovered from public access genome databases such as GenBank. These web utilities include the facility for renaming every sequence in a FASTA alignment file, with each sequence label derived from a user-defined combination of the species name and/or database accession number. This facility enables the user to keep track of the branching order of the sequences/taxa during multiple tree calculations and re-optimisations. Post phylogenetic analysis, these webpages can then be used to rename every label in the subsequent tree files (with a user-defined combination of species name and/or database accession number). Together these programs drastically reduce the time required for managing sequence alignments and labelling phylogenetic figures. Additional features of our platform include the automatic removal of identical accession numbers (recorded in the report file) and generation of species and accession number lists for use in supplementary materials or figure legends. PMID:19812722
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jaing, Crystal; Vergez, Lisa; Hinckley, Aubree
2011-06-21
The objective of this project is to provide DHS a comprehensive evaluation of the current genomic technologies including genotyping, Taqman PCR, multiple locus variable tandem repeat analysis (MLVA), microarray and high-throughput DNA sequencing in the analysis of biothreat agents from complex environmental samples. As the result of a different DHS project, we have selected for and isolated a large number of ciprofloxacin resistant B. anthracis Sterne isolates. These isolates vary in the concentrations of ciprofloxacin that they can tolerate, suggesting multiple mutations in the samples. In collaboration with University of Houston, Eureka Genomics and Oak Ridge National Laboratory, we analyzedmore » the ciprofloxacin resistant B. anthracis Sterne isolates by microarray hybridization, Illumina and Roche 454 sequencing to understand the error rates and sensitivity of the different methods. The report provides an assessment of the results and a complete set of all protocols used and all data generated along with information to interpret the protocols and data sets.« less
[Genetic analysis of two children patients affected with CHARGE syndrome].
Li, Guoqiang; Li, Niu; Xu, Yufei; Li, Juan; Ding, Yu; Shen, Yiping; Wang, Xiumin; Wang, Jian
2018-04-10
To analyze two Chinese pediatric patients with multiple malformations and growth and development delay. Both patients were subjected to targeted gene sequencing, and the results were analyzed with Ingenuity Variant Analysis software. Suspected pathogenic variations were verified by Sanger sequencing. High-throughput sequencing showed that both patients have carried heterozygous variants of the CHD7 gene. Patient 1 carried a nonsense mutation in exon 36 (c.7957C>T, p.Arg2653*), while patient 2 carried a nonsense mutation of exon 2 (c.718C>T, p.Gln240*). Sanger sequencing confirmed the above mutations in both patients, while their parents were of wild-type for the corresponding sites, indicating that the two mutations have happened de novo. Two patients were diagnosed with CHARGE syndrome by high-throughput sequencing.
Squires, R Burke; Pickett, Brett E; Das, Sajal; Scheuermann, Richard H
2014-12-01
In 2009 a novel pandemic H1N1 influenza virus (H1N1pdm09) emerged as the first official influenza pandemic of the 21st century. Early genomic sequence analysis pointed to the swine origin of the virus. Here we report a novel computational approach to determine the evolutionary trajectory of viral sequences that uses data-driven estimations of nucleotide substitution rates to track the gradual accumulation of observed sequence alterations over time. Phylogenetic analysis and multiple sequence alignments show that sequences belonging to the resulting evolutionary trajectory of the H1N1pdm09 lineage exhibit a gradual accumulation of sequence variations and tight temporal correlations in the topological structure of the phylogenetic trees. These results suggest that our evolutionary trajectory analysis (ETA) can more effectively pinpoint the evolutionary history of viruses, including the host and geographical location traversed by each segment, when compared against either BLAST or traditional phylogenetic analysis alone. Copyright © 2014 Elsevier B.V. All rights reserved.
Uptake, Results, and Outcomes of Germline Multiple-Gene Sequencing After Diagnosis of Breast Cancer.
Kurian, Allison W; Ward, Kevin C; Hamilton, Ann S; Deapen, Dennis M; Abrahamse, Paul; Bondarenko, Irina; Li, Yun; Hawley, Sarah T; Morrow, Monica; Jagsi, Reshma; Katz, Steven J
2018-05-10
Low-cost sequencing of multiple genes is increasingly available for cancer risk assessment. Little is known about uptake or outcomes of multiple-gene sequencing after breast cancer diagnosis in community practice. To examine the effect of multiple-gene sequencing on the experience and treatment outcomes for patients with breast cancer. For this population-based retrospective cohort study, patients with breast cancer diagnosed from January 2013 to December 2015 and accrued from SEER registries across Georgia and in Los Angeles, California, were surveyed (n = 5080, response rate = 70%). Responses were merged with SEER data and results of clinical genetic tests, either BRCA1 and BRCA2 (BRCA1/2) sequencing only or including additional other genes (multiple-gene sequencing), provided by 4 laboratories. Type of testing (multiple-gene sequencing vs BRCA1/2-only sequencing), test results (negative, variant of unknown significance, or pathogenic variant), patient experiences with testing (timing of testing, who discussed results), and treatment (strength of patient consideration of, and surgeon recommendation for, prophylactic mastectomy), and prophylactic mastectomy receipt. We defined a patient subgroup with higher pretest risk of carrying a pathogenic variant according to practice guidelines. Among 5026 patients (mean [SD] age, 59.9 [10.7]), 1316 (26.2%) were linked to genetic results from any laboratory. Multiple-gene sequencing increasingly replaced BRCA1/2-only testing over time: in 2013, the rate of multiple-gene sequencing was 25.6% and BRCA1/2-only testing, 74.4%;in 2015 the rate of multiple-gene sequencing was 66.5% and BRCA1/2-only testing, 33.5%. Multiple-gene sequencing was more often ordered by genetic counselors (multiple-gene sequencing, 25.5% and BRCA1/2-only testing, 15.3%) and delayed until after surgery (multiple-gene sequencing, 32.5% and BRCA1/2-only testing, 19.9%). Multiple-gene sequencing substantially increased rate of detection of any pathogenic variant (multiple-gene sequencing: higher-risk patients, 12%; average-risk patients, 4.2% and BRCA1/2-only testing: higher-risk patients, 7.8%; average-risk patients, 2.2%) and variants of uncertain significance, especially in minorities (multiple-gene sequencing: white patients, 23.7%; black patients, 44.5%; and Asian patients, 50.9% and BRCA1/2-only testing: white patients, 2.2%; black patients, 5.6%; and Asian patients, 0%). Multiple-gene sequencing was not associated with an increase in the rate of prophylactic mastectomy use, which was highest with pathogenic variants in BRCA1/2 (BRCA1/2, 79.0%; other pathogenic variant, 37.6%; variant of uncertain significance, 30.2%; negative, 35.3%). Multiple-gene sequencing rapidly replaced BRCA1/2-only testing for patients with breast cancer in the community and enabled 2-fold higher detection of clinically relevant pathogenic variants without an associated increase in prophylactic mastectomy. However, important targets for improvement in the clinical utility of multiple-gene sequencing include postsurgical delay and racial/ethnic disparity in variants of uncertain significance.
A new polymorphic and multicopy MHC gene family related to nonmammalian class I
DOE Office of Scientific and Technical Information (OSTI.GOV)
Leelayuwat, C.; Degli-Esposti, M.A.; Abraham, L.J.
1994-12-31
The authors have used genomic analysis to characterize a region of the central major histocompatibility complex (MHC) spanning {approximately} 300 kilobases (kb) between TNF and HLA-B. This region has been suggested to carry genetic factors relevant to the development of autoimmune diseases such as myasthenia gravis (MG) and insulin dependent diabetes mellitus (IDDM). Genomic sequence was analyzed for coding potential, using two neural network programs, GRAIL and GeneParser. A genomic probe, JAB, containing putative coding sequences (PERB11) located 60 kb centromeric of HLA-B, was used for northern analysis of human tissues. Multiple transcripts were detected. Southern analysis of genomic DNAmore » and overlapping YAC clones, covering the region from BAT1 to HLA-F, indicated that there are at least five copies of PERB11, four of which are located within this region of the MHC. The partial cDNA sequence of PERB11 was obtained from poly-A RNA derived from skeletal muscle. The putative amino acid sequence of PERB11 shares {approximately} 30% identity to MHC class I molecules from various species, including reptiles, chickens, and frogs, as well as to other MHC class I-like molecules, such as the IgG FcR of the mouse and rat and the human Zn-{alpha}2-glycoprotein. From direct comparison of amino acid sequences, it is concluded that PERB11 is a distinct molecule more closely related to nonmammalian than known mammalian MHC class I molecules. Genomic sequence analysis of PERB11 from five MHC ancestral haplotypes (AH) indicated that the gene is polymorphic at both DNA and protein level. The results suggest that the authors have identified a novel polymorphic gene family with multiple copies within the MHC. 48 refs., 10 figs., 2 tabs.« less
Bandeira, Nuno; Clauser, Karl R; Pevzner, Pavel A
2007-07-01
Despite significant advances in the identification of known proteins, the analysis of unknown proteins by MS/MS still remains a challenging open problem. Although Klaus Biemann recognized the potential of MS/MS for sequencing of unknown proteins in the 1980s, low throughput Edman degradation followed by cloning still remains the main method to sequence unknown proteins. The automated interpretation of MS/MS spectra has been limited by a focus on individual spectra and has not capitalized on the information contained in spectra of overlapping peptides. Indeed the powerful shotgun DNA sequencing strategies have not been extended to automated protein sequencing. We demonstrate, for the first time, the feasibility of automated shotgun protein sequencing of protein mixtures by utilizing MS/MS spectra of overlapping and possibly modified peptides generated via multiple proteases of different specificities. We validate this approach by generating highly accurate de novo reconstructions of multiple regions of various proteins in western diamondback rattlesnake venom. We further argue that shotgun protein sequencing has the potential to overcome the limitations of current protein sequencing approaches and thus catalyze the otherwise impractical applications of proteomics methodologies in studies of unknown proteins.
Integrated databanks access and sequence/structure analysis services at the PBIL.
Perrière, Guy; Combet, Christophe; Penel, Simon; Blanchet, Christophe; Thioulouse, Jean; Geourjon, Christophe; Grassot, Julien; Charavay, Céline; Gouy, Manolo; Duret, Laurent; Deléage, Gilbert
2003-07-01
The World Wide Web server of the PBIL (Pôle Bioinformatique Lyonnais) provides on-line access to sequence databanks and to many tools of nucleic acid and protein sequence analyses. This server allows to query nucleotide sequence banks in the EMBL and GenBank formats and protein sequence banks in the SWISS-PROT and PIR formats. The query engine on which our data bank access is based is the ACNUC system. It allows the possibility to build complex queries to access functional zones of biological interest and to retrieve large sequence sets. Of special interest are the unique features provided by this system to query the data banks of gene families developed at the PBIL. The server also provides access to a wide range of sequence analysis methods: similarity search programs, multiple alignments, protein structure prediction and multivariate statistics. An originality of this server is the integration of these two aspects: sequence retrieval and sequence analysis. Indeed, thanks to the introduction of re-usable lists, it is possible to perform treatments on large sets of data. The PBIL server can be reached at: http://pbil.univ-lyon1.fr.
Quantiprot - a Python package for quantitative analysis of protein sequences.
Konopka, Bogumił M; Marciniak, Marta; Dyrka, Witold
2017-07-17
The field of protein sequence analysis is dominated by tools rooted in substitution matrices and alignments. A complementary approach is provided by methods of quantitative characterization. A major advantage of the approach is that quantitative properties defines a multidimensional solution space, where sequences can be related to each other and differences can be meaningfully interpreted. Quantiprot is a software package in Python, which provides a simple and consistent interface to multiple methods for quantitative characterization of protein sequences. The package can be used to calculate dozens of characteristics directly from sequences or using physico-chemical properties of amino acids. Besides basic measures, Quantiprot performs quantitative analysis of recurrence and determinism in the sequence, calculates distribution of n-grams and computes the Zipf's law coefficient. We propose three main fields of application of the Quantiprot package. First, quantitative characteristics can be used in alignment-free similarity searches, and in clustering of large and/or divergent sequence sets. Second, a feature space defined by quantitative properties can be used in comparative studies of protein families and organisms. Third, the feature space can be used for evaluating generative models, where large number of sequences generated by the model can be compared to actually observed sequences.
Etienne, Kizee A.; Gillece, John; Hilsabeck, Remy; Schupp, Jim M.; Colman, Rebecca; Lockhart, Shawn R.; Gade, Lalitha; Thompson, Elizabeth H.; Sutton, Deanna A.; Neblett-Fanfair, Robyn; Park, Benjamin J.; Turabelidze, George; Keim, Paul; Brandt, Mary E.; Deak, Eszter; Engelthaler, David M.
2012-01-01
Case reports of Apophysomyces spp. in immunocompetent hosts have been a result of traumatic deep implantation of Apophysomyces spp. spore-contaminated soil or debris. On May 22, 2011 a tornado occurred in Joplin, MO, leaving 13 tornado victims with Apophysomyces trapeziformis infections as a result of lacerations from airborne material. We used whole genome sequence typing (WGST) for high-resolution phylogenetic SNP analysis of 17 outbreak Apophysomyces isolates and five additional temporally and spatially diverse Apophysomyces control isolates (three A. trapeziformis and two A. variabilis isolates). Whole genome SNP phylogenetic analysis revealed three clusters of genotypically related or identical A. trapeziformis isolates and multiple distinct isolates among the Joplin group; this indicated multiple genotypes from a single or multiple sources. Though no linkage between genotype and location of exposure was observed, WGST analysis determined that the Joplin isolates were more closely related to each other than to the control isolates, suggesting local population structure. Additionally, species delineation based on WGST demonstrated the need to reassess currently accepted taxonomic classifications of phylogenetic species within the genus Apophysomyces. PMID:23209631
Etienne, Kizee A; Gillece, John; Hilsabeck, Remy; Schupp, Jim M; Colman, Rebecca; Lockhart, Shawn R; Gade, Lalitha; Thompson, Elizabeth H; Sutton, Deanna A; Neblett-Fanfair, Robyn; Park, Benjamin J; Turabelidze, George; Keim, Paul; Brandt, Mary E; Deak, Eszter; Engelthaler, David M
2012-01-01
Case reports of Apophysomyces spp. in immunocompetent hosts have been a result of traumatic deep implantation of Apophysomyces spp. spore-contaminated soil or debris. On May 22, 2011 a tornado occurred in Joplin, MO, leaving 13 tornado victims with Apophysomyces trapeziformis infections as a result of lacerations from airborne material. We used whole genome sequence typing (WGST) for high-resolution phylogenetic SNP analysis of 17 outbreak Apophysomyces isolates and five additional temporally and spatially diverse Apophysomyces control isolates (three A. trapeziformis and two A. variabilis isolates). Whole genome SNP phylogenetic analysis revealed three clusters of genotypically related or identical A. trapeziformis isolates and multiple distinct isolates among the Joplin group; this indicated multiple genotypes from a single or multiple sources. Though no linkage between genotype and location of exposure was observed, WGST analysis determined that the Joplin isolates were more closely related to each other than to the control isolates, suggesting local population structure. Additionally, species delineation based on WGST demonstrated the need to reassess currently accepted taxonomic classifications of phylogenetic species within the genus Apophysomyces.
Harmonic Analysis of Sedimentary Cyclic Sequences in Kansas, Midcontinent, USA
Merriam, D.F.; Robinson, J.E.
1997-01-01
Several stratigraphic sequences in the Upper Carboniferous (Pennsylvanian) in Kansas (Midcontinent, USA) were analyzed quantitatively for periodic repetitions. The sequences were coded by lithologic type into strings of datasets. The strings then were analyzed by an adaptation of a one-dimensional Fourier transform analysis and examined for evidence of periodicity. The method was tested using different states in coding to determine the robustness of the method and data. The most persistent response is in multiples of 8-10 ft (2.5-3.0 m) and probably is dependent on the depositional thickness of the original lithologic units. Other cyclicities occurred in multiples of the basic frequency of 8-10 with persistent ones at 22 and 30 feet (6.5-9.0 m) and large ones at 80 and 160 feet (25-50 m). These levels of thickness relate well to the basic cyclothem and megacyclothem as measured on outcrop. We propose that this approach is a suitable one for analyzing cyclic events in the stratigraphic record.
Wang, Yongjie; Kleespies, Regina G; Ramle, Moslim B; Jehle, Johannes A
2008-09-01
The genomic sequence analysis of many large dsDNA viruses is hampered by the lack of enough sample materials. Here, we report a whole genome amplification of the Oryctes rhinoceros nudivirus (OrNV) isolate Ma07 starting from as few as about 10 ng of purified viral DNA by application of phi29 DNA polymerase- and exonuclease-resistant random hexamer-based multiple displacement amplification (MDA) method. About 60 microg of high molecular weight DNA with fragment sizes of up to 25 kbp was amplified. A genomic DNA clone library was generated using the product DNA. After 8-fold sequencing coverage, the 127,615 bp of OrNV whole genome was sequenced successfully. The results demonstrate that the MDA-based whole genome amplification enables rapid access to genomic information from exiguous virus samples.
IVisTMSA: Interactive Visual Tools for Multiple Sequence Alignments.
Pervez, Muhammad Tariq; Babar, Masroor Ellahi; Nadeem, Asif; Aslam, Naeem; Naveed, Nasir; Ahmad, Sarfraz; Muhammad, Shah; Qadri, Salman; Shahid, Muhammad; Hussain, Tanveer; Javed, Maryam
2015-01-01
IVisTMSA is a software package of seven graphical tools for multiple sequence alignments. MSApad is an editing and analysis tool. It can load 409% more data than Jalview, STRAP, CINEMA, and Base-by-Base. MSA comparator allows the user to visualize consistent and inconsistent regions of reference and test alignments of more than 21-MB size in less than 12 seconds. MSA comparator is 5,200% efficient and more than 40% efficient as compared to BALiBASE c program and FastSP, respectively. MSA reconstruction tool provides graphical user interfaces for four popular aligners and allows the user to load several sequence files at a time. FASTA generator converts seven formats of alignments of unlimited size into FASTA format in a few seconds. MSA ID calculator calculates identity matrix of more than 11,000 sequences with a sequence length of 2,696 base pairs in less than 100 seconds. Tree and Distance Matrix calculation tools generate phylogenetic tree and distance matrix, respectively, using neighbor joining% identity and BLOSUM 62 matrix.
Joseph, Agnel Praveen; Srinivasan, Narayanaswamy; de Brevern, Alexandre G
2012-09-01
Comparison of multiple protein structures has a broad range of applications in the analysis of protein structure, function and evolution. Multiple structure alignment tools (MSTAs) are necessary to obtain a simultaneous comparison of a family of related folds. In this study, we have developed a method for multiple structure comparison largely based on sequence alignment techniques. A widely used Structural Alphabet named Protein Blocks (PBs) was used to transform the information on 3D protein backbone conformation as a 1D sequence string. A progressive alignment strategy similar to CLUSTALW was adopted for multiple PB sequence alignment (mulPBA). Highly similar stretches identified by the pairwise alignments are given higher weights during the alignment. The residue equivalences from PB based alignments are used to obtain a three dimensional fit of the structures followed by an iterative refinement of the structural superposition. Systematic comparisons using benchmark datasets of MSTAs underlines that the alignment quality is better than MULTIPROT, MUSTANG and the alignments in HOMSTRAD, in more than 85% of the cases. Comparison with other rigid-body and flexible MSTAs also indicate that mulPBA alignments are superior to most of the rigid-body MSTAs and highly comparable to the flexible alignment methods. Copyright © 2012 Elsevier Masson SAS. All rights reserved.
[Exome sequencing revealed Allan-Herndon-Dudley syndrome underlying multiple disabilities].
Arvio, Maria; Philips, Anju K; Ahvenainen, Minna; Somer, Mirja; Kalscheuer, Vera; Järvelä, Irma
2014-01-01
Normal function of the thyroid gland is the cornerstone of a child's mental development and physical growth. We describe a Finnish family, in which the diagnosis of three brothers became clear after investigations that lasted for more than 30 years. Two of the sons have already died. DNA analysis of the third one, a 16-year-old boy, revealed in exome sequencing of the complete X chromosome a mutation in the SLC16A2 gene, i.e. MCT8, coding for a thyroid hormone transport protein. Allan-Herndon-Dudley syndrome was thus shown to be the cause of multiple disabilities.
ISRNA: an integrative online toolkit for short reads from high-throughput sequencing data.
Luo, Guan-Zheng; Yang, Wei; Ma, Ying-Ke; Wang, Xiu-Jie
2014-02-01
Integrative Short Reads NAvigator (ISRNA) is an online toolkit for analyzing high-throughput small RNA sequencing data. Besides the high-speed genome mapping function, ISRNA provides statistics for genomic location, length distribution and nucleotide composition bias analysis of sequence reads. Number of reads mapped to known microRNAs and other classes of short non-coding RNAs, coverage of short reads on genes, expression abundance of sequence reads as well as some other analysis functions are also supported. The versatile search functions enable users to select sequence reads according to their sub-sequences, expression abundance, genomic location, relationship to genes, etc. A specialized genome browser is integrated to visualize the genomic distribution of short reads. ISRNA also supports management and comparison among multiple datasets. ISRNA is implemented in Java/C++/Perl/MySQL and can be freely accessed at http://omicslab.genetics.ac.cn/ISRNA/.
A Polyglot Approach to Bioinformatics Data Integration: A Phylogenetic Analysis of HIV-1
Reisman, Steven; Hatzopoulos, Thomas; Läufer, Konstantin; Thiruvathukal, George K.; Putonti, Catherine
2016-01-01
As sequencing technologies continue to drop in price and increase in throughput, new challenges emerge for the management and accessibility of genomic sequence data. We have developed a pipeline for facilitating the storage, retrieval, and subsequent analysis of molecular data, integrating both sequence and metadata. Taking a polyglot approach involving multiple languages, libraries, and persistence mechanisms, sequence data can be aggregated from publicly available and local repositories. Data are exposed in the form of a RESTful web service, formatted for easy querying, and retrieved for downstream analyses. As a proof of concept, we have developed a resource for annotated HIV-1 sequences. Phylogenetic analyses were conducted for >6,000 HIV-1 sequences revealing spatial and temporal factors influence the evolution of the individual genes uniquely. Nevertheless, signatures of origin can be extrapolated even despite increased globalization. The approach developed here can easily be customized for any species of interest. PMID:26819543
Sequential addition of short DNA oligos in DNA-polymerase-based synthesis reactions
Gardner, Shea N [San Leandro, CA; Mariella, Jr., Raymond P.; Christian, Allen T [Tracy, CA; Young, Jennifer A [Berkeley, CA; Clague, David S [Livermore, CA
2011-01-18
A method of fabricating a DNA molecule of user-defined sequence. The method comprises the steps of preselecting a multiplicity of DNA sequence segments that will comprise the DNA molecule of user-defined sequence, separating the DNA sequence segments temporally, and combining the multiplicity of DNA sequence segments with at least one polymerase enzyme wherein the multiplicity of DNA sequence segments join to produce the DNA molecule of user-defined sequence. Sequence segments may be of length n, where n is an even or odd integer. In one embodiment the length of desired hybridizing overlap is specified by the user and the sequences and the protocol for combining them are guided by computational (bioinformatics) predictions. In one embodiment sequence segments are combined from multiple reading frames to span the same region of a sequence, so that multiple desired hybridizations may occur with different overlap lengths. In one embodiment starting sequence fragments are of different lengths, n, n+1, n+2, etc.
Conservation of tubulin-binding sequences in TRPV1 throughout evolution.
Sardar, Puspendu; Kumar, Abhishek; Bhandari, Anita; Goswami, Chandan
2012-01-01
Transient Receptor Potential Vanilloid sub type 1 (TRPV1), commonly known as capsaicin receptor can detect multiple stimuli ranging from noxious compounds, low pH, temperature as well as electromagnetic wave at different ranges. In addition, this receptor is involved in multiple physiological and sensory processes. Therefore, functions of TRPV1 have direct influences on adaptation and further evolution also. Availability of various eukaryotic genomic sequences in public domain facilitates us in studying the molecular evolution of TRPV1 protein and the respective conservation of certain domains, motifs and interacting regions that are functionally important. Using statistical and bioinformatics tools, our analysis reveals that TRPV1 has evolved about ∼420 million years ago (MYA). Our analysis reveals that specific regions, domains and motifs of TRPV1 has gone through different selection pressure and thus have different levels of conservation. We found that among all, TRP box is the most conserved and thus have functional significance. Our results also indicate that the tubulin binding sequences (TBS) have evolutionary significance as these stretch sequences are more conserved than many other essential regions of TRPV1. The overall distribution of positively charged residues within the TBS motifs is conserved throughout evolution. In silico analysis reveals that the TBS-1 and TBS-2 of TRPV1 can form helical structures and may play important role in TRPV1 function. Our analysis identifies the regions of TRPV1, which are important for structure-function relationship. This analysis indicates that tubulin binding sequence-1 (TBS-1) near the TRP-box forms a potential helix and the tubulin interactions with TRPV1 via TBS-1 have evolutionary significance. This interaction may be required for the proper channel function and regulation and may also have significance in the context of Taxol®-induced neuropathy.
FunGene: the functional gene pipeline and repository.
Fish, Jordan A; Chai, Benli; Wang, Qiong; Sun, Yanni; Brown, C Titus; Tiedje, James M; Cole, James R
2013-01-01
Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes.
Kelly, Benjamin J; Fitch, James R; Hu, Yangqiu; Corsmeier, Donald J; Zhong, Huachun; Wetzel, Amy N; Nordquist, Russell D; Newsom, David L; White, Peter
2015-01-20
While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.
Patterns and Sequences: Interactive Exploration of Clickstreams to Understand Common Visitor Paths.
Liu, Zhicheng; Wang, Yang; Dontcheva, Mira; Hoffman, Matthew; Walker, Seth; Wilson, Alan
2017-01-01
Modern web clickstream data consists of long, high-dimensional sequences of multivariate events, making it difficult to analyze. Following the overarching principle that the visual interface should provide information about the dataset at multiple levels of granularity and allow users to easily navigate across these levels, we identify four levels of granularity in clickstream analysis: patterns, segments, sequences and events. We present an analytic pipeline consisting of three stages: pattern mining, pattern pruning and coordinated exploration between patterns and sequences. Based on this approach, we discuss properties of maximal sequential patterns, propose methods to reduce the number of patterns and describe design considerations for visualizing the extracted sequential patterns and the corresponding raw sequences. We demonstrate the viability of our approach through an analysis scenario and discuss the strengths and limitations of the methods based on user feedback.
NASA Astrophysics Data System (ADS)
Amiroch, S.; Pradana, M. S.; Irawan, M. I.; Mukhlash, I.
2017-09-01
Multiple Alignment (MA) is a particularly important tool for studying the viral genome and determine the evolutionary process of the specific virus. Application of MA in the case of the spread of the Severe acute respiratory syndrome (SARS) epidemic is an interesting thing because this virus epidemic a few years ago spread so quickly that medical attention in many countries. Although there has been a lot of software to process multiple sequences, but the use of pairwise alignment to process MA is very important to consider. In previous research, the alignment between the sequences to process MA algorithm, Super Pairwise Alignment, but in this study used a dynamic programming algorithm Needleman wunchs simulated in Matlab. From the analysis of MA obtained and stable region and unstable which indicates the position where the mutation occurs, the system network topology that produced the phylogenetic tree of the SARS epidemic distance method, and system area networks mutation.
Automatic detection of pelvic lymph nodes using multiple MR sequences
NASA Astrophysics Data System (ADS)
Yan, Michelle; Lu, Yue; Lu, Renzhi; Requardt, Martin; Moeller, Thomas; Takahashi, Satoru; Barentsz, Jelle
2007-03-01
A system for automatic detection of pelvic lymph nodes is developed by incorporating complementary information extracted from multiple MR sequences. A single MR sequence lacks sufficient diagnostic information for lymph node localization and staging. Correct diagnosis often requires input from multiple complementary sequences which makes manual detection of lymph nodes very labor intensive. Small lymph nodes are often missed even by highly-trained radiologists. The proposed system is aimed at assisting radiologists in finding lymph nodes faster and more accurately. To the best of our knowledge, this is the first such system reported in the literature. A 3-dimensional (3D) MR angiography (MRA) image is employed for extracting blood vessels that serve as a guide in searching for pelvic lymph nodes. Segmentation, shape and location analysis of potential lymph nodes are then performed using a high resolution 3D T1-weighted VIBE (T1-vibe) MR sequence acquired by Siemens 3T scanner. An optional contrast-agent enhanced MR image, such as post ferumoxtran-10 T2*-weighted MEDIC sequence, can also be incorporated to further improve detection accuracy of malignant nodes. The system outputs a list of potential lymph node locations that are overlaid onto the corresponding MR sequences and presents them to users with associated confidence levels as well as their sizes and lengths in each axis. Preliminary studies demonstrates the feasibility of automatic lymph node detection and scenarios in which this system may be used to assist radiologists in diagnosis and reporting.
Oligo Design: a computer program for development of probes for oligonucleotide microarrays.
Herold, Keith E; Rasooly, Avraham
2003-12-01
Oligonucleotide microarrays have demonstrated potential for the analysis of gene expression, genotyping, and mutational analysis. Our work focuses primarily on the detection and identification of bacteria based on known short sequences of DNA. Oligo Design, the software described here, automates several design aspects that enable the improved selection of oligonucleotides for use with microarrays for these applications. Two major features of the program are: (i) a tiling algorithm for the design of short overlapping temperature-matched oligonucleotides of variable length, which are useful for the analysis of single nucleotide polymorphisms and (ii) a set of tools for the analysis of multiple alignments of gene families and related short DNA sequences, which allow for the identification of conserved DNA sequences for PCR primer selection and variable DNA sequences for the selection of unique probes for identification. Note that the program does not address the full genome perspective but, instead, is focused on the genetic analysis of short segments of DNA. The program is Internet-enabled and includes a built-in browser and the automated ability to download sequences from GenBank by specifying the GI number. The program also includes several utilities, including audio recital of a DNA sequence (useful for verifying sequences against a written document), a random sequence generator that provides insight into the relationship between melting temperature and GC content, and a PCR calculator.
CoCoNUT: an efficient system for the comparison and analysis of genomes
2008-01-01
Background Comparative genomics is the analysis and comparison of genomes from different species. This area of research is driven by the large number of sequenced genomes and heavily relies on efficient algorithms and software to perform pairwise and multiple genome comparisons. Results Most of the software tools available are tailored for one specific task. In contrast, we have developed a novel system CoCoNUT (Computational Comparative geNomics Utility Toolkit) that allows solving several different tasks in a unified framework: (1) finding regions of high similarity among multiple genomic sequences and aligning them, (2) comparing two draft or multi-chromosomal genomes, (3) locating large segmental duplications in large genomic sequences, and (4) mapping cDNA/EST to genomic sequences. Conclusion CoCoNUT is competitive with other software tools w.r.t. the quality of the results. The use of state of the art algorithms and data structures allows CoCoNUT to solve comparative genomics tasks more efficiently than previous tools. With the improved user interface (including an interactive visualization component), CoCoNUT provides a unified, versatile, and easy-to-use software tool for large scale studies in comparative genomics. PMID:19014477
Nawrocki, Eric P.; Burge, Sarah W.
2013-01-01
The development of RNA bioinformatic tools began more than 30 y ago with the description of the Nussinov and Zuker dynamic programming algorithms for single sequence RNA secondary structure prediction. Since then, many tools have been developed for various RNA sequence analysis problems such as homology search, multiple sequence alignment, de novo RNA discovery, read-mapping, and many more. In this issue, we have collected a sampling of reviews and original research that demonstrate some of the many ways bioinformatics is integrated with current RNA biology research. PMID:23948768
Ma, Lijun; Lee, Letitia; Barani, Igor; Hwang, Andrew; Fogh, Shannon; Nakamura, Jean; McDermott, Michael; Sneed, Penny; Larson, David A; Sahgal, Arjun
2011-11-21
Rapid delivery of multiple shots or isocenters is one of the hallmarks of Gamma Knife radiosurgery. In this study, we investigated whether the temporal order of shots delivered with Gamma Knife Perfexion would significantly influence the biological equivalent dose for complex multi-isocenter treatments. Twenty single-target cases were selected for analysis. For each case, 3D dose matrices of individual shots were extracted and single-fraction equivalent uniform dose (sEUD) values were determined for all possible shot delivery sequences, corresponding to different patterns of temporal dose delivery within the target. We found significant variations in the sEUD values among these sequences exceeding 15% for certain cases. However, the sequences for the actual treatment delivery were found to agree (<3%) and to correlate (R² = 0.98) excellently with the sequences yielding the maximum sEUD values for all studied cases. This result is applicable for both fast and slow growing tumors with α/β values of 2 to 20 according to the linear-quadratic model. In conclusion, despite large potential variations in different shot sequences for multi-isocenter Gamma Knife treatments, current clinical delivery sequences exhibited consistent biological target dosing that approached that maximally achievable for all studied cases.
Genetic analysis of duck circovirus in Pekin ducks from South Korea.
Cha, S-Y; Kang, M; Cho, J-G; Jang, H-K
2013-11-01
The genetic organization of the 24 duck circovirus (DuCV) strains detected in commercial Pekin ducks from South Korea between 2011 and 2012 is described in this study. Multiple sequence alignment and phylogenetic analyses were performed on the 24 viral genome sequences as well as on 45 genome sequences available from the GenBank database. Phylogenetic analyses based on the genomic and open reading frame 2/cap sequences demonstrated that all DuCV strains belonged to genotype 1 and were designated in a subcluster under genotype 1. Analysis of the capsid protein amino acid sequences of the 24 Korean DuCV strains showed 10 substitutions compared with that of other genotype 1 strains. Our analysis showed that genotype 1 is predominant and circulating in South Korea. These present results serve as incentive to add more data to the DuCV database and provide insight to conduct further intensive study on the geographic relationships among these virus strains.
Protein Sectors: Statistical Coupling Analysis versus Conservation
Teşileanu, Tiberiu; Colwell, Lucy J.; Leibler, Stanislas
2015-01-01
Statistical coupling analysis (SCA) is a method for analyzing multiple sequence alignments that was used to identify groups of coevolving residues termed “sectors”. The method applies spectral analysis to a matrix obtained by combining correlation information with sequence conservation. It has been asserted that the protein sectors identified by SCA are functionally significant, with different sectors controlling different biochemical properties of the protein. Here we reconsider the available experimental data and note that it involves almost exclusively proteins with a single sector. We show that in this case sequence conservation is the dominating factor in SCA, and can alone be used to make statistically equivalent functional predictions. Therefore, we suggest shifting the experimental focus to proteins for which SCA identifies several sectors. Correlations in protein alignments, which have been shown to be informative in a number of independent studies, would then be less dominated by sequence conservation. PMID:25723535
Improved multiple displacement amplification (iMDA) and ultraclean reagents.
Motley, S Timothy; Picuri, John M; Crowder, Chris D; Minich, Jeremiah J; Hofstadler, Steven A; Eshoo, Mark W
2014-06-06
Next-generation sequencing sample preparation requires nanogram to microgram quantities of DNA; however, many relevant samples are comprised of only a few cells. Genomic analysis of these samples requires a whole genome amplification method that is unbiased and free of exogenous DNA contamination. To address these challenges we have developed protocols for the production of DNA-free consumables including reagents and have improved upon multiple displacement amplification (iMDA). A specialized ethylene oxide treatment was developed that renders free DNA and DNA present within Gram positive bacterial cells undetectable by qPCR. To reduce DNA contamination in amplification reagents, a combination of ion exchange chromatography, filtration, and lot testing protocols were developed. Our multiple displacement amplification protocol employs a second strand-displacing DNA polymerase, improved buffers, improved reaction conditions and DNA free reagents. The iMDA protocol, when used in combination with DNA-free laboratory consumables and reagents, significantly improved efficiency and accuracy of amplification and sequencing of specimens with moderate to low levels of DNA. The sensitivity and specificity of sequencing of amplified DNA prepared using iMDA was compared to that of DNA obtained with two commercial whole genome amplification kits using 10 fg (~1-2 bacterial cells worth) of bacterial genomic DNA as a template. Analysis showed >99% of the iMDA reads mapped to the template organism whereas only 0.02% of the reads from the commercial kits mapped to the template. To assess the ability of iMDA to achieve balanced genomic coverage, a non-stochastic amount of bacterial genomic DNA (1 pg) was amplified and sequenced, and data obtained were compared to sequencing data obtained directly from genomic DNA. The iMDA DNA and genomic DNA sequencing had comparable coverage 99.98% of the reference genome at ≥1X coverage and 99.9% at ≥5X coverage while maintaining both balance and representation of the genome. The iMDA protocol in combination with DNA-free laboratory consumables, significantly improved the ability to sequence specimens with low levels of DNA. iMDA has broad utility in metagenomics, diagnostics, ancient DNA analysis, pre-implantation embryo screening, single-cell genomics, whole genome sequencing of unculturable organisms, and forensic applications for both human and microbial targets.
Macher, Hada C; Martinez-Broca, Maria A; Rubio-Calvo, Amalia; Leon-Garcia, Cristina; Conde-Sanchez, Manuel; Costa, Alzenira; Navarro, Elena; Guerrero, Juan M
2012-01-01
The multiple endocrine neoplasia type 2A (MEN2A) is a monogenic disorder characterized by an autosomal dominant pattern of inheritance which is characterized by high risk of medullary thyroid carcinoma in all mutation carriers. Although this disorder is classified as a rare disease, the patients affected have a low life quality and a very expensive and continuous treatment. At present, MEN2A is diagnosed by gene sequencing after birth, thus trying to start an early treatment and by reduction of morbidity and mortality. We first evaluated the presence of MEN2A mutation (C634Y) in serum of 25 patients, previously diagnosed by sequencing in peripheral blood leucocytes, using HRM genotyping analysis. In a second step, we used a COLD-PCR approach followed by HRM genotyping analysis for non-invasive prenatal diagnosis of a pregnant woman carrying a fetus with a C634Y mutation. HRM analysis revealed differences in melting curve shapes that correlated with patients diagnosed for MEN2A by gene sequencing analysis with 100% accuracy. Moreover, the pregnant woman carrying the fetus with the C634Y mutation revealed a melting curve shape in agreement with the positive controls in the COLD-PCR study. The mutation was confirmed by sequencing of the COLD-PCR amplification product. In conclusion, we have established a HRM analysis in serum samples as a new primary diagnosis method suitable for the detection of C634Y mutations in MEN2A patients. Simultaneously, we have applied the increase of sensitivity of COLD-PCR assay approach combined with HRM analysis for the non-invasive prenatal diagnosis of C634Y fetal mutations using pregnant women serum.
Chen, Hui; Luthra, Rajyalakshmi; Goswami, Rashmi S; Singh, Rajesh R; Roy-Chowdhuri, Sinchita
2015-08-28
Application of next-generation sequencing (NGS) technology to routine clinical practice has enabled characterization of personalized cancer genomes to identify patients likely to have a response to targeted therapy. The proper selection of tumor sample for downstream NGS based mutational analysis is critical to generate accurate results and to guide therapeutic intervention. However, multiple pre-analytic factors come into play in determining the success of NGS testing. In this review, we discuss pre-analytic requirements for AmpliSeq PCR-based sequencing using Ion Torrent Personal Genome Machine (PGM) (Life Technologies), a NGS sequencing platform that is often used by clinical laboratories for sequencing solid tumors because of its low input DNA requirement from formalin fixed and paraffin embedded tissue. The success of NGS mutational analysis is affected not only by the input DNA quantity but also by several other factors, including the specimen type, the DNA quality, and the tumor cellularity. Here, we review tissue requirements for solid tumor NGS based mutational analysis, including procedure types, tissue types, tumor volume and fraction, decalcification, and treatment effects.
Quantitative analysis of the anti-noise performance of an m-sequence in an electromagnetic method
NASA Astrophysics Data System (ADS)
Yuan, Zhe; Zhang, Yiming; Zheng, Qijia
2018-02-01
An electromagnetic method with a transmitted waveform coded by an m-sequence achieved better anti-noise performance compared to the conventional manner with a square-wave. The anti-noise performance of the m-sequence varied with multiple coding parameters; hence, a quantitative analysis of the anti-noise performance for m-sequences with different coding parameters was required to optimize them. This paper proposes the concept of an identification system, with the identified Earth impulse response obtained by measuring the system output with the input of the voltage response. A quantitative analysis of the anti-noise performance of the m-sequence was achieved by analyzing the amplitude-frequency response of the corresponding identification system. The effects of the coding parameters on the anti-noise performance are summarized by numerical simulation, and their optimization is further discussed in our conclusions; the validity of the conclusions is further verified by field experiment. The quantitative analysis method proposed in this paper provides a new insight into the anti-noise mechanism of the m-sequence, and could be used to evaluate the anti-noise performance of artificial sources in other time-domain exploration methods, such as the seismic method.
Genomic Diversity and Evolution of the Lyssaviruses
Delmas, Olivier; Holmes, Edward C.; Talbi, Chiraz; Larrous, Florence; Dacheux, Laurent; Bouchier, Christiane; Bourhy, Hervé
2008-01-01
Lyssaviruses are RNA viruses with single-strand, negative-sense genomes responsible for rabies-like diseases in mammals. To date, genomic and evolutionary studies have most often utilized partial genome sequences, particularly of the nucleoprotein and glycoprotein genes, with little consideration of genome-scale evolution. Herein, we report the first genomic and evolutionary analysis using complete genome sequences of all recognised lyssavirus genotypes, including 14 new complete genomes of field isolates from 6 genotypes and one genotype that is completely sequenced for the first time. In doing so we significantly increase the extent of genome sequence data available for these important viruses. Our analysis of these genome sequence data reveals that all lyssaviruses have the same genomic organization. A phylogenetic analysis reveals strong geographical structuring, with the greatest genetic diversity in Africa, and an independent origin for the two known genotypes that infect European bats. We also suggest that multiple genotypes may exist within the diversity of viruses currently classified as ‘Lagos Bat’. In sum, we show that rigorous phylogenetic techniques based on full length genome sequence provide the best discriminatory power for genotype classification within the lyssaviruses. PMID:18446239
Marck, C
1988-01-01
DNA Strider is a new integrated DNA and Protein sequence analysis program written with the C language for the Macintosh Plus, SE and II computers. It has been designed as an easy to learn and use program as well as a fast and efficient tool for the day-to-day sequence analysis work. The program consists of a multi-window sequence editor and of various DNA and Protein analysis functions. The editor may use 4 different types of sequences (DNA, degenerate DNA, RNA and one-letter coded protein) and can handle simultaneously 6 sequences of any type up to 32.5 kB each. Negative numbering of the bases is allowed for DNA sequences. All classical restriction and translation analysis functions are present and can be performed in any order on any open sequence or part of a sequence. The main feature of the program is that the same analysis function can be repeated several times on different sequences, thus generating multiple windows on the screen. Many graphic capabilities have been incorporated such as graphic restriction map, hydrophobicity profile and the CAI plot- codon adaptation index according to Sharp and Li. The restriction sites search uses a newly designed fast hexamer look-ahead algorithm. Typical runtime for the search of all sites with a library of 130 restriction endonucleases is 1 second per 10,000 bases. The circular graphic restriction map of the pBR322 plasmid can be therefore computed from its sequence and displayed on the Macintosh Plus screen within 2 seconds and its multiline restriction map obtained in a scrolling window within 5 seconds. PMID:2832831
Churkin, Alexander; Barash, Danny
2008-01-01
Background RNAmute is an interactive Java application which, given an RNA sequence, calculates the secondary structure of all single point mutations and organizes them into categories according to their similarity to the predicted structure of the wild type. The secondary structure predictions are performed using the Vienna RNA package. A more efficient implementation of RNAmute is needed, however, to extend from the case of single point mutations to the general case of multiple point mutations, which may often be desired for computational predictions alongside mutagenesis experiments. But analyzing multiple point mutations, a process that requires traversing all possible mutations, becomes highly expensive since the running time is O(nm) for a sequence of length n with m-point mutations. Using Vienna's RNAsubopt, we present a method that selects only those mutations, based on stability considerations, which are likely to be conformational rearranging. The approach is best examined using the dot plot representation for RNA secondary structure. Results Using RNAsubopt, the suboptimal solutions for a given wild-type sequence are calculated once. Then, specific mutations are selected that are most likely to cause a conformational rearrangement. For an RNA sequence of about 100 nts and 3-point mutations (n = 100, m = 3), for example, the proposed method reduces the running time from several hours or even days to several minutes, thus enabling the practical application of RNAmute to the analysis of multiple-point mutations. Conclusion A highly efficient addition to RNAmute that is as user friendly as the original application but that facilitates the practical analysis of multiple-point mutations is presented. Such an extension can now be exploited prior to site-directed mutagenesis experiments by virologists, for example, who investigate the change of function in an RNA virus via mutations that disrupt important motifs in its secondary structure. A complete explanation of the application, called MultiRNAmute, is available at [1]. PMID:18445289
Presentation Extensions of the SOAP
NASA Technical Reports Server (NTRS)
Carnright, Robert; Stodden, David; Coggi, John
2009-01-01
A set of extensions of the Satellite Orbit Analysis Program (SOAP) enables simultaneous and/or sequential presentation of information from multiple sources. SOAP is used in the aerospace community as a means of collaborative visualization and analysis of data on planned spacecraft missions. The following definitions of terms also describe the display modalities of SOAP as now extended: In SOAP terminology, View signifies an animated three-dimensional (3D) scene, two-dimensional still image, plot of numerical data, or any other visible display derived from a computational simulation or other data source; a) "Viewport" signifies a rectangular portion of a computer-display window containing a view; b) "Palette" signifies a collection of one or more viewports configured for simultaneous (split-screen) display in the same window; c) "Slide" signifies a palette with a beginning and ending time and an animation time step; and d) "Presentation" signifies a prescribed sequence of slides. For example, multiple 3D views from different locations can be crafted for simultaneous display and combined with numerical plots and other representations of data for both qualitative and quantitative analysis. The resulting sets of views can be temporally sequenced to convey visual impressions of a sequence of events for a planned mission.
Oluwayelu, D O; Todd, D; Olaleye, O D
2008-12-01
This work reports the first molecular analysis study of chicken anaemia virus (CAV) in backyard chickens in Africa using molecular cloning and sequence analysis to characterize CAV strains obtained from commercial chickens and Nigerian backyard chickens. Partial VP1 gene sequences were determined for three CAVs from commercial chickens and for six CAV variants present in samples from a backyard chicken. Multiple alignment analysis revealed that the 6% and 4% nucleotide diversity obtained respectively for the commercial and backyard chicken strains translated to only 2% amino acid diversity for each breed. Overall, the amino acid composition of Nigerian CAVs was found to be highly conserved. Since the partial VP1 gene sequence of two backyard chicken cloned CAV strains (NGR/CI-8 and NGR/CI-9) were almost identical and evolutionarily closely related to the commercial chicken strains NGR-1, and NGR-4 and NGR-5, respectively, we concluded that CAV infections had crossed the farm boundary.
Chang, Suhua; Zhang, Jiajie; Liao, Xiaoyun; Zhu, Xinxing; Wang, Dahai; Zhu, Jiang; Feng, Tao; Zhu, Baoli; Gao, George F; Wang, Jian; Yang, Huanming; Yu, Jun; Wang, Jing
2007-01-01
Frequent outbreaks of highly pathogenic avian influenza and the increasing data available for comparative analysis require a central database specialized in influenza viruses (IVs). We have established the Influenza Virus Database (IVDB) to integrate information and create an analysis platform for genetic, genomic, and phylogenetic studies of the virus. IVDB hosts complete genome sequences of influenza A virus generated by Beijing Institute of Genomics (BIG) and curates all other published IV sequences after expert annotation. Our Q-Filter system classifies and ranks all nucleotide sequences into seven categories according to sequence content and integrity. IVDB provides a series of tools and viewers for comparative analysis of the viral genomes, genes, genetic polymorphisms and phylogenetic relationships. A search system has been developed for users to retrieve a combination of different data types by setting search options. To facilitate analysis of global viral transmission and evolution, the IV Sequence Distribution Tool (IVDT) has been developed to display the worldwide geographic distribution of chosen viral genotypes and to couple genomic data with epidemiological data. The BLAST, multiple sequence alignment and phylogenetic analysis tools were integrated for online data analysis. Furthermore, IVDB offers instant access to pre-computed alignments and polymorphisms of IV genes and proteins, and presents the results as SNP distribution plots and minor allele distributions. IVDB is publicly available at http://influenza.genomics.org.cn.
He, Zihuai; Xu, Bin; Lee, Seunggeun; Ionita-Laza, Iuliana
2017-09-07
Substantial progress has been made in the functional annotation of genetic variation in the human genome. Integrative analysis that incorporates such functional annotations into sequencing studies can aid the discovery of disease-associated genetic variants, especially those with unknown function and located outside protein-coding regions. Direct incorporation of one functional annotation as weight in existing dispersion and burden tests can suffer substantial loss of power when the functional annotation is not predictive of the risk status of a variant. Here, we have developed unified tests that can utilize multiple functional annotations simultaneously for integrative association analysis with efficient computational techniques. We show that the proposed tests significantly improve power when variant risk status can be predicted by functional annotations. Importantly, when functional annotations are not predictive of risk status, the proposed tests incur only minimal loss of power in relation to existing dispersion and burden tests, and under certain circumstances they can even have improved power by learning a weight that better approximates the underlying disease model in a data-adaptive manner. The tests can be constructed with summary statistics of existing dispersion and burden tests for sequencing data, therefore allowing meta-analysis of multiple studies without sharing individual-level data. We applied the proposed tests to a meta-analysis of noncoding rare variants in Metabochip data on 12,281 individuals from eight studies for lipid traits. By incorporating the Eigen functional score, we detected significant associations between noncoding rare variants in SLC22A3 and low-density lipoprotein and total cholesterol, associations that are missed by standard dispersion and burden tests. Copyright © 2017 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Non-redundant patent sequence databases with value-added annotations at two levels
Li, Weizhong; McWilliam, Hamish; de la Torre, Ana Richart; Grodowski, Adam; Benediktovich, Irina; Goujon, Mickael; Nauche, Stephane; Lopez, Rodrigo
2010-01-01
The European Bioinformatics Institute (EMBL-EBI) provides public access to patent data, including abstracts, chemical compounds and sequences. Sequences can appear multiple times due to the filing of the same invention with multiple patent offices, or the use of the same sequence by different inventors in different contexts. Information relating to the source invention may be incomplete, and biological information available in patent documents elsewhere may not be reflected in the annotation of the sequence. Search and analysis of these data have become increasingly challenging for both the scientific and intellectual-property communities. Here, we report a collection of non-redundant patent sequence databases, which cover the EMBL-Bank nucleotides patent class and the patent protein databases and contain value-added annotations from patent documents. The databases were created at two levels by the use of sequence MD5 checksums. Sequences within a level-1 cluster are 100% identical over their whole length. Level-2 clusters were defined by sub-grouping level-1 clusters based on patent family information. Value-added annotations, such as publication number corrections, earliest publication dates and feature collations, significantly enhance the quality of the data, allowing for better tracking and cross-referencing. The databases are available format: http://www.ebi.ac.uk/patentdata/nr/. PMID:19884134
Non-redundant patent sequence databases with value-added annotations at two levels.
Li, Weizhong; McWilliam, Hamish; de la Torre, Ana Richart; Grodowski, Adam; Benediktovich, Irina; Goujon, Mickael; Nauche, Stephane; Lopez, Rodrigo
2010-01-01
The European Bioinformatics Institute (EMBL-EBI) provides public access to patent data, including abstracts, chemical compounds and sequences. Sequences can appear multiple times due to the filing of the same invention with multiple patent offices, or the use of the same sequence by different inventors in different contexts. Information relating to the source invention may be incomplete, and biological information available in patent documents elsewhere may not be reflected in the annotation of the sequence. Search and analysis of these data have become increasingly challenging for both the scientific and intellectual-property communities. Here, we report a collection of non-redundant patent sequence databases, which cover the EMBL-Bank nucleotides patent class and the patent protein databases and contain value-added annotations from patent documents. The databases were created at two levels by the use of sequence MD5 checksums. Sequences within a level-1 cluster are 100% identical over their whole length. Level-2 clusters were defined by sub-grouping level-1 clusters based on patent family information. Value-added annotations, such as publication number corrections, earliest publication dates and feature collations, significantly enhance the quality of the data, allowing for better tracking and cross-referencing. The databases are available format: http://www.ebi.ac.uk/patentdata/nr/.
Saw, Jimmy H. W.; Yuryev, Anton; Kanbe, Masaomi; Hou, Shaobin; Young, Aaron G.; Aizawa, Shin-Ichi
2012-01-01
Saprospira grandis is a coastal marine bacterium that can capture and prey upon other marine bacteria using a mechanism known as ‘ixotrophy’. Here, we present the complete genome sequence of Saprospira grandis str. Lewin isolated from La Jolla beach in San Diego, California. The complete genome sequence comprises a chromosome of 4.35 Mbp and a plasmid of 54.9 Kbp. Genome analysis revealed incomplete pathways for the biosynthesis of nine essential amino acids but presence of a large number of peptidases. The genome encodes multiple copies of sensor globin-coupled rsbR genes thought to be essential for stress response and the presence of such sensor globins in Bacteroidetes is unprecedented. A total of 429 spacer sequences within the three CRISPR repeat regions were identified in the genome and this number is the largest among all the Bacteroidetes sequenced to date. PMID:22675601
New Tools For Understanding Microbial Diversity Using High-throughput Sequence Data
NASA Astrophysics Data System (ADS)
Knight, R.; Hamady, M.; Liu, Z.; Lozupone, C.
2007-12-01
High-throughput sequencing techniques such as 454 are straining the limits of tools traditionally used to build trees, choose OTUs, and perform other essential sequencing tasks. We have developed a workflow for phylogenetic analysis of large-scale sequence data sets that combines existing tools, such as the Arb phylogeny package and the NAST multiple sequence alignment tool, with new methods for choosing and clustering OTUs and for performing phylogenetic community analysis with UniFrac. This talk discusses the cyberinfrastructure we are developing to support the human microbiome project, and the application of these workflows to analyze very large data sets that contrast the gut microbiota with a range of physical environments. These tools will ultimately help to define core and peripheral microbiomes in a range of environments, and will allow us to understand the physical and biotic factors that contribute most to differences in microbial diversity.
GWFASTA: server for FASTA search in eukaryotic and microbial genomes.
Issac, Biju; Raghava, G P S
2002-09-01
Similarity searches are a powerful method for solving important biological problems such as database scanning, evolutionary studies, gene prediction, and protein structure prediction. FASTA is a widely used sequence comparison tool for rapid database scanning. Here we describe the GWFASTA server that was developed to assist the FASTA user in similarity searches against partially and/or completely sequenced genomes. GWFASTA consists of more than 60 microbial genomes, eight eukaryote genomes, and proteomes of annotatedgenomes. Infact, it provides the maximum number of databases for similarity searching from a single platform. GWFASTA allows the submission of more than one sequence as a single query for a FASTA search. It also provides integrated post-processing of FASTA output, including compositional analysis of proteins, multiple sequences alignment, and phylogenetic analysis. Furthermore, it summarizes the search results organism-wise for prokaryotes and chromosome-wise for eukaryotes. Thus, the integration of different tools for sequence analyses makes GWFASTA a powerful toolfor biologists.
Zhu, Tianqi; Dos Reis, Mario; Yang, Ziheng
2015-03-01
Genetic sequence data provide information about the distances between species or branch lengths in a phylogeny, but not about the absolute divergence times or the evolutionary rates directly. Bayesian methods for dating species divergences estimate times and rates by assigning priors on them. In particular, the prior on times (node ages on the phylogeny) incorporates information in the fossil record to calibrate the molecular tree. Because times and rates are confounded, our posterior time estimates will not approach point values even if an infinite amount of sequence data are used in the analysis. In a previous study we developed a finite-sites theory to characterize the uncertainty in Bayesian divergence time estimation in analysis of large but finite sequence data sets under a strict molecular clock. As most modern clock dating analyses use more than one locus and are conducted under relaxed clock models, here we extend the theory to the case of relaxed clock analysis of data from multiple loci (site partitions). Uncertainty in posterior time estimates is partitioned into three sources: Sampling errors in the estimates of branch lengths in the tree for each locus due to limited sequence length, variation of substitution rates among lineages and among loci, and uncertainty in fossil calibrations. Using a simple but analogous estimation problem involving the multivariate normal distribution, we predict that as the number of loci ([Formula: see text]) goes to infinity, the variance in posterior time estimates decreases and approaches the infinite-data limit at the rate of 1/[Formula: see text], and the limit is independent of the number of sites in the sequence alignment. We then confirmed the predictions by using computer simulation on phylogenies of two or three species, and by analyzing a real genomic data set for six primate species. Our results suggest that with the fossil calibrations fixed, analyzing multiple loci or site partitions is the most effective way for improving the precision of posterior time estimation. However, even if a huge amount of sequence data is analyzed, considerable uncertainty will persist in time estimates. © The Author(s) 2014. Published by Oxford University Press on behalf of the Society of Systematic Biologists.
AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis
Aniba, Mohamed Radhouene; Poch, Olivier; Marchler-Bauer, Aron; Thompson, Julie Dawn
2010-01-01
Multiple sequence alignment (MSA) is a cornerstone of modern molecular biology and represents a unique means of investigating the patterns of conservation and diversity in complex biological systems. Many different algorithms have been developed to construct MSAs, but previous studies have shown that no single aligner consistently outperforms the rest. This has led to the development of a number of ‘meta-methods’ that systematically run several aligners and merge the output into one single solution. Although these methods generally produce more accurate alignments, they are inefficient because all the aligners need to be run first and the choice of the best solution is made a posteriori. Here, we describe the development of a new expert system, AlexSys, for the multiple alignment of protein sequences. AlexSys incorporates an intelligent inference engine to automatically select an appropriate aligner a priori, depending only on the nature of the input sequences. The inference engine was trained on a large set of reference multiple alignments, using a novel machine learning approach. Applying AlexSys to a test set of 178 alignments, we show that the expert system represents a good compromise between alignment quality and running time, making it suitable for high throughput projects. AlexSys is freely available from http://alnitak.u-strasbg.fr/∼aniba/alexsys. PMID:20530533
Yoon, Jun-Hee; Kim, Thomas W; Mendez, Pedro; Jablons, David M; Kim, Il-Jin
2017-01-01
The development of next-generation sequencing (NGS) technology allows to sequence whole exomes or genome. However, data analysis is still the biggest bottleneck for its wide implementation. Most laboratories still depend on manual procedures for data handling and analyses, which translates into a delay and decreased efficiency in the delivery of NGS results to doctors and patients. Thus, there is high demand for developing an automatic and an easy-to-use NGS data analyses system. We developed comprehensive, automatic genetic analyses controller named Mobile Genome Express (MGE) that works in smartphones or other mobile devices. MGE can handle all the steps for genetic analyses, such as: sample information submission, sequencing run quality check from the sequencer, secured data transfer and results review. We sequenced an Actrometrix control DNA containing multiple proven human mutations using a targeted sequencing panel, and the whole analysis was managed by MGE, and its data reviewing program called ELECTRO. All steps were processed automatically except for the final sequencing review procedure with ELECTRO to confirm mutations. The data analysis process was completed within several hours. We confirmed the mutations that we have identified were consistent with our previous results obtained by using multi-step, manual pipelines.
Fanali, Gabriella; Ascenzi, Paolo; Bernardi, Giorgio; Fasano, Mauro
2012-01-01
Serum albumin (SA) is a circulating protein providing a depot and carrier for many endogenous and exogenous compounds. At least seven major binding sites have been identified by structural and functional investigations mainly in human SA. SA is conserved in vertebrates, with at least 49 entries in protein sequence databases. The multiple sequence analysis of this set of entries leads to the definition of a cladistic tree for the molecular evolution of SA orthologs in vertebrates, thus showing the clustering of the considered species, with lamprey SAs (Lethenteron japonicum and Petromyzon marinus) in a separate outgroup. Sequence analysis aimed at searching conserved domains revealed that most SA sequences are made up by three repeated domains (about 600 residues), as extensively characterized for human SA. On the contrary, lamprey SAs are giant proteins (about 1400 residues) comprising seven repeated domains. The phylogenetic analysis of the SA family reveals a stringent correlation with the taxonomic classification of the species available in sequence databases. A focused inspection of the sequences of ligand binding sites in SA revealed that in all sites most residues involved in ligand binding are conserved, although the versatility towards different ligands could be peculiar of higher organisms. Moreover, the analysis of molecular links between the different sites suggests that allosteric modulation mechanisms could be restricted to higher vertebrates.
Harrison Robert L.; Daniel L. Rowley; Melody A. Keena
2016-01-01
Isolates of the baculovirus species Lymantria dispar multiple nucleopolyhedrovirus have been formulated and applied to suppress outbreaks of the gypsy moth, L. dispar. To evaluate the genetic diversity in this species at the genomic level, the genomes of three isolates from Massachusetts, USA (LdMNPV-Aba624), Spain (LdMNPV-3054...
Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing.
Zhao, Shanrong; Prenger, Kurt; Smith, Lance; Messina, Thomas; Fan, Hongtao; Jaeger, Edward; Stephens, Susan
2013-06-27
Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses. Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies. Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available for third-party implementation and use, and can be downloaded from http://s3.amazonaws.com/jnj_rainbow/index.html.
Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
2013-01-01
Background Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses. Results Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies. Conclusions Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available for third-party implementation and use, and can be downloaded from http://s3.amazonaws.com/jnj_rainbow/index.html. PMID:23802613
ERIC Educational Resources Information Center
Luiselli, James K.; Luiselli, Tracy Evans
1995-01-01
This report describes a behavior analysis treatment approach to establishing oral feeding in children with multiple developmental disabilities and gastrostomy-tube dependency. Pretreatment screening, functional assessment, and treatment are reported as implemented within a behavioral consultation model. A case study illustrates the sequence and…
ADOMA: A Command Line Tool to Modify ClustalW Multiple Alignment Output.
Zaal, Dionne; Nota, Benjamin
2016-01-01
We present ADOMA, a command line tool that produces alternative outputs from ClustalW multiple alignments of nucleotide or protein sequences. ADOMA can simplify the output of alignments by showing only the different residues between sequences, which is often desirable when only small differences such as single nucleotide polymorphisms are present (e.g., between different alleles). Another feature of ADOMA is that it can enhance the ClustalW output by coloring the residues in the alignment. This tool is easily integrated into automated Linux pipelines for next-generation sequencing data analysis, and may be useful for researchers in a broad range of scientific disciplines including evolutionary biology and biomedical sciences. The source code is freely available at https://sourceforge. net/projects/adoma/. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Andersen, G.L.; He, Z.; DeSantis, T.Z.
Microarrays have proven to be a useful and high-throughput method to provide targeted DNA sequence information for up to many thousands of specific genetic regions in a single test. A microarray consists of multiple DNA oligonucleotide probes that, under high stringency conditions, hybridize only to specific complementary nucleic acid sequences (targets). A fluorescent signal indicates the presence and, in many cases, the abundance of genetic regions of interest. In this chapter we will look at how microarrays are used in microbial ecology, especially with the recent increase in microbial community DNA sequence data. Of particular interest to microbial ecologists, phylogeneticmore » microarrays are used for the analysis of phylotypes in a community and functional gene arrays are used for the analysis of functional genes, and, by inference, phylotypes in environmental samples. A phylogenetic microarray that has been developed by the Andersen laboratory, the PhyloChip, will be discussed as an example of a microarray that targets the known diversity within the 16S rRNA gene to determine microbial community composition. Using multiple, confirmatory probes to increase the confidence of detection and a mismatch probe for every perfect match probe to minimize the effect of cross-hybridization by non-target regions, the PhyloChip is able to simultaneously identify any of thousands of taxa present in an environmental sample. The PhyloChip is shown to reveal greater diversity within a community than rRNA gene sequencing due to the placement of the entire gene product on the microarray compared with the analysis of up to thousands of individual molecules by traditional sequencing methods. A functional gene array that has been developed by the Zhou laboratory, the GeoChip, will be discussed as an example of a microarray that dynamically identifies functional activities of multiple members within a community. The recent version of GeoChip contains more than 24,000 50mer oligonucleotide probes and covers more than 10,000 gene sequences in 150 gene categories involved in carbon, nitrogen, sulfur, and phosphorus cycling, metal resistance and reduction, and organic contaminant degradation. GeoChip can be used as a generic tool for microbial community analysis, and also link microbial community structure to ecosystem functioning. Examples of the application of both arrays in different environmental samples will be described in the two subsequent sections.« less
FASMA: a service to format and analyze sequences in multiple alignments.
Costantini, Susan; Colonna, Giovanni; Facchiano, Angelo M
2007-12-01
Multiple sequence alignments are successfully applied in many studies for under- standing the structural and functional relations among single nucleic acids and protein sequences as well as whole families. Because of the rapid growth of sequence databases, multiple sequence alignments can often be very large and difficult to visualize and analyze. We offer a new service aimed to visualize and analyze the multiple alignments obtained with different external algorithms, with new features useful for the comparison of the aligned sequences as well as for the creation of a final image of the alignment. The service is named FASMA and is available at http://bioinformatica.isa.cnr.it/FASMA/.
Odronitz, Florian; Kollmar, Martin
2006-11-29
Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein.
High quality de novo sequencing and assembly of the Saccharomyces arboricolus genome
2013-01-01
Background Comparative genomics is a formidable tool to identify functional elements throughout a genome. In the past ten years, studies in the budding yeast Saccharomyces cerevisiae and a set of closely related species have been instrumental in showing the benefit of analyzing patterns of sequence conservation. Increasing the number of closely related genome sequences makes the comparative genomics approach more powerful and accurate. Results Here, we report the genome sequence and analysis of Saccharomyces arboricolus, a yeast species recently isolated in China, that is closely related to S. cerevisiae. We obtained high quality de novo sequence and assemblies using a combination of next generation sequencing technologies, established the phylogenetic position of this species and considered its phenotypic profile under multiple environmental conditions in the light of its gene content and phylogeny. Conclusions We suggest that the genome of S. arboricolus will be useful in future comparative genomics analysis of the Saccharomyces sensu stricto yeasts. PMID:23368932
Sequence investigation of 34 forensic autosomal STRs with massively parallel sequencing.
Zhang, Suhua; Niu, Yong; Bian, Yingnan; Dong, Rixia; Liu, Xiling; Bao, Yun; Jin, Chao; Zheng, Hancheng; Li, Chengtao
2018-05-01
STRs vary not only in the length of the repeat units and the number of repeats but also in the region with which they conform to an incremental repeat pattern. Massively parallel sequencing (MPS) offers new possibilities in the analysis of STRs since they can simultaneously sequence multiple targets in a single reaction and capture potential internal sequence variations. Here, we sequenced 34 STRs applied in the forensic community of China with a custom-designed panel. MPS performance were evaluated from sequencing reads analysis, concordance study and sensitivity testing. High coverage sequencing data were obtained to determine the constitute ratios and heterozygous balance. No actual inconsistent genotypes were observed between capillary electrophoresis (CE) and MPS, demonstrating the reliability of the panel and the MPS technology. With the sequencing data from the 200 investigated individuals, 346 and 418 alleles were obtained via CE and MPS technologies at the 34 STRs, indicating MPS technology provides higher discrimination than CE detection. The whole study demonstrated that STR genotyping with the custom panel and MPS technology has the potential not only to reveal length and sequence variations but also to satisfy the demands of high throughput and high multiplexing with acceptable sensitivity.
Evaluation of CDMA system capacity for mobile satellite system applications
NASA Technical Reports Server (NTRS)
Smith, Partrick O.; Geraniotis, Evaggelos A.
1988-01-01
A specific Direct-Sequence/Pseudo-Noise (DS/PN) Code-Division Multiple-Access (CDMA) mobile satellite system (MSAT) architecture is discussed. The performance of this system is evaluated in terms of the maximum number of active MSAT subscribers that can be supported at a given uncoded bit-error probability. The evaluation decouples the analysis of the multiple-access capability (i.e., the number of instantaneous user signals) from the analysis of the multiple-access mutliplier effect allowed by the use of CDMA with burst-modem operation. We combine the results of these two analyses and present numerical results for scenarios of interest to the mobile satellite system community.
Rift Valley Fever, Sudan, 2007 and 2010
Aradaib, Imadeldin E.; Erickson, Bobbie R.; Elageb, Rehab M.; Khristova, Marina L.; Carroll, Serena A.; Elkhidir, Isam M.; Karsany, Mubarak E.; Karrar, AbdelRahim E.; Elbashir, Mustafa I.
2013-01-01
To elucidate whether Rift Valley fever virus (RVFV) diversity in Sudan resulted from multiple introductions or from acquired changes over time from 1 introduction event, we generated complete genome sequences from RVFV strains detected during the 2007 and 2010 outbreaks. Phylogenetic analyses of small, medium, and large RNA segment sequences indicated several genetic RVFV variants were circulating in Sudan, which all grouped into Kenya-1 or Kenya-2 sublineages from the 2006–2008 eastern Africa epizootic. Bayesian analysis of sequence differences estimated that diversity among the 2007 and 2010 Sudan RVFV variants shared a most recent common ancestor circa 1996. The data suggest multiple introductions of RVFV into Sudan as part of sweeping epizootics from eastern Africa. The sequences indicate recent movement of RVFV and support the need for surveillance to recognize when and where RVFV circulates between epidemics, which can make data from prediction tools easier to interpret and preventive measures easier to direct toward high-risk areas. PMID:23347790
A DYNAMICAL SIGNATURE OF MULTIPLE STELLAR POPULATIONS IN 47 TUCANAE
DOE Office of Scientific and Technical Information (OSTI.GOV)
Richer, Harvey B.; Heyl, Jeremy; Anderson, Jay
2013-07-01
Based on the width of its main sequence, and an actual observed split when viewed through particular filters, it is widely accepted that 47 Tucanae contains multiple stellar populations. In this contribution, we divide the main sequence of 47 Tuc into four color groups, which presumably represent stars of various chemical compositions. The kinematic properties of each of these groups are explored via proper motions, and a strong signal emerges of differing proper-motion anisotropies with differing main-sequence color; the bluest main-sequence stars exhibit the largest proper-motion anisotropy which becomes undetectable for the reddest stars. In addition, the bluest stars aremore » also the most centrally concentrated. A similar analysis for Small Magellanic Cloud stars, which are located in the background of 47 Tuc on our frames, yields none of the anisotropy exhibited by the 47 Tuc stars. We discuss implications of these results for possible formation scenarios of the various populations.« less
Evolutionary profiles from the QR factorization of multiple sequence alignments
Sethi, Anurag; O'Donoghue, Patrick; Luthey-Schulten, Zaida
2005-01-01
We present an algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of the homologous group. The method, based on the multidimensional QR factorization of numerically encoded multiple sequence alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. We observe a general trend that these smaller, more evolutionarily balanced profiles have comparable and, in many cases, better performance in database searches than conventional profiles containing hundreds of sequences, constructed in an iterative and computationally intensive procedure. For more diverse families or superfamilies, with sequence identity <30%, structural alignments, based purely on the geometry of the protein structures, provide better alignments than pure sequence-based methods. Merging the structure and sequence information allows the construction of accurate profiles for distantly related groups. These structure-based profiles outperformed other sequence-based methods for finding distant homologs and were used to identify a putative class II cysteinyl-tRNA synthetase (CysRS) in several archaea that eluded previous annotation studies. Phylogenetic analysis showed the putative class II CysRSs to be a monophyletic group and homology modeling revealed a constellation of active site residues similar to that in the known class I CysRS. PMID:15741270
SARA-Coffee web server, a tool for the computation of RNA sequence and structure multiple alignments
Di Tommaso, Paolo; Bussotti, Giovanni; Kemena, Carsten; Capriotti, Emidio; Chatzou, Maria; Prieto, Pablo; Notredame, Cedric
2014-01-01
This article introduces the SARA-Coffee web server; a service allowing the online computation of 3D structure based multiple RNA sequence alignments. The server makes it possible to combine sequences with and without known 3D structures. Given a set of sequences SARA-Coffee outputs a multiple sequence alignment along with a reliability index for every sequence, column and aligned residue. SARA-Coffee combines SARA, a pairwise structural RNA aligner with the R-Coffee multiple RNA aligner in a way that has been shown to improve alignment accuracy over most sequence aligners when enough structural data is available. The server can be accessed from http://tcoffee.crg.cat/apps/tcoffee/do:saracoffee. PMID:24972831
Dai, Hongying; Wu, Guodong; Wu, Michael; Zhi, Degui
2016-01-01
Next-generation sequencing data pose a severe curse of dimensionality, complicating traditional "single marker-single trait" analysis. We propose a two-stage combined p-value method for pathway analysis. The first stage is at the gene level, where we integrate effects within a gene using the Sequence Kernel Association Test (SKAT). The second stage is at the pathway level, where we perform a correlated Lancaster procedure to detect joint effects from multiple genes within a pathway. We show that the Lancaster procedure is optimal in Bahadur efficiency among all combined p-value methods. The Bahadur efficiency,[Formula: see text], compares sample sizes among different statistical tests when signals become sparse in sequencing data, i.e. ε →0. The optimal Bahadur efficiency ensures that the Lancaster procedure asymptotically requires a minimal sample size to detect sparse signals ([Formula: see text]). The Lancaster procedure can also be applied to meta-analysis. Extensive empirical assessments of exome sequencing data show that the proposed method outperforms Gene Set Enrichment Analysis (GSEA). We applied the competitive Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium to identify pathways significantly associated with high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, and total cholesterol.
Chaitankar, Vijender; Karakülah, Gökhan; Ratnapriya, Rinki; Giuste, Felipe O.; Brooks, Matthew J.; Swaroop, Anand
2016-01-01
The advent of high throughput next generation sequencing (NGS) has accelerated the pace of discovery of disease-associated genetic variants and genomewide profiling of expressed sequences and epigenetic marks, thereby permitting systems-based analyses of ocular development and disease. Rapid evolution of NGS and associated methodologies presents significant challenges in acquisition, management, and analysis of large data sets and for extracting biologically or clinically relevant information. Here we illustrate the basic design of commonly used NGS-based methods, specifically whole exome sequencing, transcriptome, and epigenome profiling, and provide recommendations for data analyses. We briefly discuss systems biology approaches for integrating multiple data sets to elucidate gene regulatory or disease networks. While we provide examples from the retina, the NGS guidelines reviewed here are applicable to other tissues/cell types as well. PMID:27297499
Chronodes: Interactive Multifocus Exploration of Event Sequences
POLACK, PETER J.; CHEN, SHANG-TSE; KAHNG, MINSUK; DE BARBARO, KAYA; BASOLE, RAHUL; SHARMIN, MOUSHUMI; CHAU, DUEN HORNG
2018-01-01
The advent of mobile health (mHealth) technologies challenges the capabilities of current visualizations, interactive tools, and algorithms. We present Chronodes, an interactive system that unifies data mining and human-centric visualization techniques to support explorative analysis of longitudinal mHealth data. Chronodes extracts and visualizes frequent event sequences that reveal chronological patterns across multiple participant timelines of mHealth data. It then combines novel interaction and visualization techniques to enable multifocus event sequence analysis, which allows health researchers to interactively define, explore, and compare groups of participant behaviors using event sequence combinations. Through summarizing insights gained from a pilot study with 20 behavioral and biomedical health experts, we discuss Chronodes’s efficacy and potential impact in the mHealth domain. Ultimately, we outline important open challenges in mHealth, and offer recommendations and design guidelines for future research. PMID:29515937
NASA Astrophysics Data System (ADS)
Tene, Yair; Tene, Noam; Tene, G.
1993-08-01
An interactive data fusion methodology of video, audio, and nonlinear structural dynamic analysis for potential application in forensic engineering is presented. The methodology was developed and successfully demonstrated in the analysis of heavy transportable bridge collapse during preparation for testing. Multiple bridge elements failures were identified after the collapse, including fracture, cracks and rupture of high performance structural materials. Videotape recording by hand held camcorder was the only source of information about the collapse sequence. The interactive data fusion methodology resulted in extracting relevant information form the videotape and from dynamic nonlinear structural analysis, leading to full account of the sequence of events during the bridge collapse.
Genetic characterization of Measles Viruses in China, 2004
Zhang, Yan; Ji, Yixin; Jiang, Xiaohong; Xu, Songtao; Zhu, Zhen; Zheng, Lei; He, Jilan; Ling, Hua; Wang, Yan; Liu, Yang; Du, Wen; Yang, Xuelei; Mao, Naiying; Xu, Wenbo
2008-01-01
Genetic characterization of wild-type measles virus was studied using nucleotide sequencing of the C-terminal region of the N protein gene and phylogenetic analysis on 59 isolates from 16 provinces of China in 2004. The results showed that all of the isolates belonged to genotype H1. 51 isolates were belonged to cluster 1 and 8 isolates were cluster 2 and Viruses from both clusters were distributed throughout China without distinct geographic pattern. The nucleotide sequence and predicted amino acid homologies of the 59 H1 strains were 96.5%–100% and 95.7%–100%, respectively. The report showed that the transmission pattern of genotype H1 viruses in China in 2004 was consistent with ongoing endemic transmission of multiple lineages of a single, endemic genotype. Multiple transmission pathways leaded to multiple lineages within endemic genotype. PMID:18928575
Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier
2003-01-01
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.
NASA Astrophysics Data System (ADS)
Ma, Lijun; Lee, Letitia; Barani, Igor; Hwang, Andrew; Fogh, Shannon; Nakamura, Jean; McDermott, Michael; Sneed, Penny; Larson, David A.; Sahgal, Arjun
2011-11-01
Rapid delivery of multiple shots or isocenters is one of the hallmarks of Gamma Knife radiosurgery. In this study, we investigated whether the temporal order of shots delivered with Gamma Knife Perfexion would significantly influence the biological equivalent dose for complex multi-isocenter treatments. Twenty single-target cases were selected for analysis. For each case, 3D dose matrices of individual shots were extracted and single-fraction equivalent uniform dose (sEUD) values were determined for all possible shot delivery sequences, corresponding to different patterns of temporal dose delivery within the target. We found significant variations in the sEUD values among these sequences exceeding 15% for certain cases. However, the sequences for the actual treatment delivery were found to agree (<3%) and to correlate (R2 = 0.98) excellently with the sequences yielding the maximum sEUD values for all studied cases. This result is applicable for both fast and slow growing tumors with α/β values of 2 to 20 according to the linear-quadratic model. In conclusion, despite large potential variations in different shot sequences for multi-isocenter Gamma Knife treatments, current clinical delivery sequences exhibited consistent biological target dosing that approached that maximally achievable for all studied cases.
Milius, Robert P; Heuer, Michael; Valiga, Daniel; Doroschak, Kathryn J; Kennedy, Caleb J; Bolon, Yung-Tsi; Schneider, Joel; Pollack, Jane; Kim, Hwa Ran; Cereb, Nezih; Hollenbach, Jill A; Mack, Steven J; Maiers, Martin
2015-12-01
We present an electronic format for exchanging data for HLA and KIR genotyping with extensions for next-generation sequencing (NGS). This format addresses NGS data exchange by refining the Histoimmunogenetics Markup Language (HML) to conform to the proposed Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) reporting guidelines (miring.immunogenomics.org). Our refinements of HML include two major additions. First, NGS is supported by new XML structures to capture additional NGS data and metadata required to produce a genotyping result, including analysis-dependent (dynamic) and method-dependent (static) components. A full genotype, consensus sequence, and the surrounding metadata are included directly, while the raw sequence reads and platform documentation are externally referenced. Second, genotype ambiguity is fully represented by integrating Genotype List Strings, which use a hierarchical set of delimiters to represent allele and genotype ambiguity in a complete and accurate fashion. HML also continues to enable the transmission of legacy methods (e.g. site-specific oligonucleotide, sequence-specific priming, and Sequence Based Typing (SBT)), adding features such as allowing multiple group-specific sequencing primers, and fully leveraging techniques that combine multiple methods to obtain a single result, such as SBT integrated with NGS. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Wang, Xueling; Lin, Xiao-Jiang; Tang, Xiangrong; Chai, Yong-Chuan; Yu, De-Hong; Chen, Dong-Ye; Wu, Hao
2017-11-01
The purpose of this study was to identify the genetic causes of a family presenting with multiple symptoms overlapping Usher syndrome type II (USH2) and Waardenburg syndrome type IV (WS4). Targeted next-generation sequencing including the exon and flanking intron sequences of 79 deafness genes was performed on the proband. Co-segregation of the disease phenotype and the detected variants were confirmed in all family members by PCR amplification and Sanger sequencing. The affected members of this family had two different recessive disorders, USH2 and WS4. By targeted next-generation sequencing, we identified that USH2 was caused by a novel missense mutation, p.V4907D in GPR98; whereas WS4 due to p.V185M in EDNRB. This is the first report of homozygous p.V185M mutation in EDNRB in patient with WS4. This study reported a Chinese family with multiple independent and overlapping phenotypes. In condition, molecular level analysis was efficient to identify the causative variant p.V4907D in GPR98 and p.V185M in EDNRB, also was helpful to confirm the clinical diagnosis of USH2 and WS4. Copyright © 2017 Elsevier B.V. All rights reserved.
Multiplexed fragaria chloroplast genome sequencing
W. Njuguna; A. Liston; R. Cronn; N.V. Bassil
2010-01-01
A method to sequence multiple chloroplast genomes using ultra high throughput sequencing technologies was recently described. Complete chloroplast genome sequences can resolve phylogenetic relationships at low taxonomic levels and identify informative point mutations and indels. The objective of this research was to sequence multiple Fragaria...
Rodas, Claudia; Klena, John D.; Nicklasson, Matilda; Iniguez, Volga; Sjöling, Åsa
2011-01-01
Background Enterotoxigenic Escherichia coli (ETEC) is a major cause of traveller's and infantile diarrhoea in the developing world. ETEC produces two toxins, a heat-stable toxin (known as ST) and a heat-labile toxin (LT) and colonization factors that help the bacteria to attach to epithelial cells. Methodology/Principal Findings In this study, we characterized a subset of ETEC clinical isolates recovered from Bolivian children under 5 years of age using a combination of multilocus sequence typing (MLST) analysis, virulence typing, serotyping and antimicrobial resistance test patterns in order to determine the genetic background of ETEC strains circulating in Bolivia. We found that strains expressing the heat-labile (LT) enterotoxin and colonization factor CS17 were common and belonged to several MLST sequence types but mainly to sequence type-423 and sequence type-443 (Achtman scheme). To further study the LT/CS17 strains we analysed the nucleotide sequence of the CS17 operon and compared the structure to LT/CS17 ETEC isolates from Bangladesh. Sequence analysis confirmed that all sequence type-423 strains from Bolivia had a single nucleotide polymorphism; SNPbol in the CS17 operon that was also found in some other MLST sequence types from Bolivia but not in strains recovered from Bangladeshi children. The dominant ETEC clone in Bolivia (sequence type-423/SNPbol) was found to persist over multiple years and was associated with severe diarrhoea but these strains were variable with respect to antimicrobial resistance patterns. Conclusion/Significance The results showed that although the LT/CS17 phenotype is common among ETEC strains in Bolivia, multiple clones, as determined by unique MLST sequence types, populate this phenotype. Our data also appear to suggest that acquisition and loss of antimicrobial resistance in LT-expressing CS17 ETEC clones is more dynamic than acquisition or loss of virulence factors. PMID:22140423
Rodas, Claudia; Klena, John D; Nicklasson, Matilda; Iniguez, Volga; Sjöling, Asa
2011-01-01
Enterotoxigenic Escherichia coli (ETEC) is a major cause of traveller's and infantile diarrhoea in the developing world. ETEC produces two toxins, a heat-stable toxin (known as ST) and a heat-labile toxin (LT) and colonization factors that help the bacteria to attach to epithelial cells. In this study, we characterized a subset of ETEC clinical isolates recovered from Bolivian children under 5 years of age using a combination of multilocus sequence typing (MLST) analysis, virulence typing, serotyping and antimicrobial resistance test patterns in order to determine the genetic background of ETEC strains circulating in Bolivia. We found that strains expressing the heat-labile (LT) enterotoxin and colonization factor CS17 were common and belonged to several MLST sequence types but mainly to sequence type-423 and sequence type-443 (Achtman scheme). To further study the LT/CS17 strains we analysed the nucleotide sequence of the CS17 operon and compared the structure to LT/CS17 ETEC isolates from Bangladesh. Sequence analysis confirmed that all sequence type-423 strains from Bolivia had a single nucleotide polymorphism; SNP(bol) in the CS17 operon that was also found in some other MLST sequence types from Bolivia but not in strains recovered from Bangladeshi children. The dominant ETEC clone in Bolivia (sequence type-423/SNP(bol)) was found to persist over multiple years and was associated with severe diarrhoea but these strains were variable with respect to antimicrobial resistance patterns. The results showed that although the LT/CS17 phenotype is common among ETEC strains in Bolivia, multiple clones, as determined by unique MLST sequence types, populate this phenotype. Our data also appear to suggest that acquisition and loss of antimicrobial resistance in LT-expressing CS17 ETEC clones is more dynamic than acquisition or loss of virulence factors.
Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements
Mukherjee, Supratim; Stamatis, Dimitri; Bertsch, Jon; Ovchinnikova, Galina; Verezemska, Olena; Isbandi, Michelle; Thomas, Alex D.; Ali, Rida; Sharma, Kaushal; Kyrpides, Nikos C.; Reddy, T. B. K.
2017-01-01
The Genomes Online Database (GOLD) (https://gold.jgi.doe.gov) is a manually curated data management system that catalogs sequencing projects with associated metadata from around the world. In the current version of GOLD (v.6), all projects are organized based on a four level classification system in the form of a Study, Organism (for isolates) or Biosample (for environmental samples), Sequencing Project and Analysis Project. Currently, GOLD provides information for 26 117 Studies, 239 100 Organisms, 15 887 Biosamples, 97 212 Sequencing Projects and 78 579 Analysis Projects. These are integrated with over 312 metadata fields from which 58 are controlled vocabularies with 2067 terms. The web interface facilitates submission of a diverse range of Sequencing Projects (such as isolate genome, single-cell genome, metagenome, metatranscriptome) and complex Analysis Projects (such as genome from metagenome, or combined assembly from multiple Sequencing Projects). GOLD provides a seamless interface with the Integrated Microbial Genomes (IMG) system and supports and promotes the Genomic Standards Consortium (GSC) Minimum Information standards. This paper describes the data updates and additional features added during the last two years. PMID:27794040
A simple procedure for parallel sequence analysis of both strands of 5'-labeled DNA.
Razvi, F; Gargiulo, G; Worcel, A
1983-08-01
Ligation of a 5'-labeled DNA restriction fragment results in a circular DNA molecule carrying the two 32Ps at the reformed restriction site. Double digestions of the circular DNA with the original enzyme and a second restriction enzyme cleavage near the labeled site allows direct chemical sequencing of one 5'-labeled DNA strand. Similar double digestions, using an isoschizomer that cleaves differently at the 32P-labeled site, allows direct sequencing of the now 3'-labeled complementary DNA strand. It is possible to directly sequence both strands of cloned DNA inserts by using the above protocol and a multiple cloning site vector that provides the necessary restriction sites. The simultaneous and parallel visualization of both DNA strands eliminates sequence ambiguities. In addition, the labeled circular molecules are particularly useful for single-hit DNA cleavage studies and DNA footprint analysis. As an example, we show here an analysis of the micrococcal nuclease-induced breaks on the two strands of the somatic 5S RNA gene of Xenopus borealis, which suggests that the enzyme may recognize and cleave small AT-containing palindromes along the DNA helix.
PHYLOViZ: phylogenetic inference and data visualization for sequence based typing methods
2012-01-01
Background With the decrease of DNA sequencing costs, sequence-based typing methods are rapidly becoming the gold standard for epidemiological surveillance. These methods provide reproducible and comparable results needed for a global scale bacterial population analysis, while retaining their usefulness for local epidemiological surveys. Online databases that collect the generated allelic profiles and associated epidemiological data are available but this wealth of data remains underused and are frequently poorly annotated since no user-friendly tool exists to analyze and explore it. Results PHYLOViZ is platform independent Java software that allows the integrated analysis of sequence-based typing methods, including SNP data generated from whole genome sequence approaches, and associated epidemiological data. goeBURST and its Minimum Spanning Tree expansion are used for visualizing the possible evolutionary relationships between isolates. The results can be displayed as an annotated graph overlaying the query results of any other epidemiological data available. Conclusions PHYLOViZ is a user-friendly software that allows the combined analysis of multiple data sources for microbial epidemiological and population studies. It is freely available at http://www.phyloviz.net. PMID:22568821
Embedding strategies for effective use of information from multiple sequence alignments.
Henikoff, S.; Henikoff, J. G.
1997-01-01
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain. PMID:9070452
Sand, Olivier; Thomas-Chollier, Morgane; Vervisch, Eric; van Helden, Jacques
2008-01-01
This protocol shows how to access the Regulatory Sequence Analysis Tools (RSAT) via a programmatic interface in order to automate the analysis of multiple data sets. We describe the steps for writing a Perl client that connects to the RSAT Web services and implements a workflow to discover putative cis-acting elements in promoters of gene clusters. In the presented example, we apply this workflow to lists of transcription factor target genes resulting from ChIP-chip experiments. For each factor, the protocol predicts the binding motifs by detecting significantly overrepresented hexanucleotides in the target promoters and generates a feature map that displays the positions of putative binding sites along the promoter sequences. This protocol is addressed to bioinformaticians and biologists with programming skills (notions of Perl). Running time is approximately 6 min on the example data set.
Tuan, Nguyen Ngoc; Chang, Yi-Chia; Yu, Chang-Ping; Huang, Shir-Ly
2014-01-01
In this study, the first survey of microbial community in thermophilic anaerobic digester using swine manure as sole feedstock was performed by multiple approaches including denaturing gradient gel electrophoresis (DGGE), clone library and pyrosequencing techniques. The integrated analysis of 21 DGGE bands, 126 clones and 8506 pyrosequencing read sequences revealed that Clostridia from the phylum Firmicutes account for the most dominant Bacteria. In addition, our analysis also identified additional taxa that were missed by the previous researches, including members of the bacterial phyla Synergistetes, Planctomycetes, Armatimonadetes, Chloroflexi and Nitrospira which might also play a role in thermophilic anaerobic digester. Most archaeal 16S rRNA sequences could be assigned to the order Methanobacteriales instead of Methanomicrobiales comparing to previous studies. In addition, this study reported that the member of Methanothermobacter genus was firstly found in thermophilic anaerobic digester. Copyright © 2014 Elsevier GmbH. All rights reserved.
Bull, Marta E; Heath, Laura M; McKernan-Mullin, Jennifer L; Kraft, Kelli M; Acevedo, Luis; Hitti, Jane E; Cohn, Susan E; Tapia, Kenneth A; Holte, Sarah E; Dragavon, Joan A; Coombs, Robert W; Mullins, James I; Frenkel, Lisa M
2013-04-15
Whether unique human immunodeficiency type 1 (HIV) genotypes occur in the genital tract is important for vaccine development and management of drug resistant viruses. Multiple cross-sectional studies suggest HIV is compartmentalized within the female genital tract. We hypothesize that bursts of HIV replication and/or proliferation of infected cells captured in cross-sectional analyses drive compartmentalization but over time genital-specific viral lineages do not form; rather viruses mix between genital tract and blood. Eight women with ongoing HIV replication were studied during a period of 1.5 to 4.5 years. Multiple viral sequences were derived by single-genome amplification of the HIV C2-V5 region of env from genital secretions and blood plasma. Maximum likelihood phylogenies were evaluated for compartmentalization using 4 statistical tests. In cross-sectional analyses compartmentalization of genital from blood viruses was detected in three of eight women by all tests; this was associated with tissue specific clades containing multiple monotypic sequences. In longitudinal analysis, the tissues-specific clades did not persist to form viral lineages. Rather, across women, HIV lineages were comprised of both genital tract and blood sequences. The observation of genital-specific HIV clades only in cross-sectional analysis and an absence of genital-specific lineages in longitudinal analyses suggest a dynamic interchange of HIV variants between the female genital tract and blood.
Darville, Lancia N F; Merchant, Mark E; Maccha, Venkata; Siddavarapu, Vivekananda Reddy; Hasan, Azeem; Murray, Kermit K
2012-02-01
Mass spectrometry in conjunction with de novo sequencing was used to determine the amino acid sequence of a 35kDa lectin protein isolated from the serum of the American alligator that exhibits binding to mannose. The protein N-terminal sequence was determined using Edman degradation and enzymatic digestion with different proteases was used to generate peptide fragments for analysis by liquid chromatography tandem mass spectrometry (LC MS/MS). Separate analysis of the protein digests with multiple enzymes enhanced the protein sequence coverage. De novo sequencing was accomplished using MASCOT Distiller and PEAKS software and the sequences were searched against the NCBI database using MASCOT and BLAST to identify homologous peptides. MS analysis of the intact protein indicated that it is present primarily as monomer and dimer in vitro. The isolated 35kDa protein was ~98% sequenced and found to have 313 amino acids and nine cysteine residues and was identified as an alligator lectin. The alligator lectin sequence was aligned with other lectin sequences using DIALIGN and ClustalW software and was found to exhibit 58% and 59% similarity to both human and mouse intelectin-1. The alligator lectin exhibited strong binding affinities toward mannan and mannose as compared to other tested carbohydrates. Copyright © 2011 Elsevier Inc. All rights reserved.
Herrera, Victoria L M; Steffen, Martin; Moran, Ann Marie; Tan, Glaiza A; Pasion, Khristine A; Rivera, Keith; Pappin, Darryl J; Ruiz-Opazo, Nelson
2016-06-14
In contrast to rat and mouse databases, the NCBI gene database lists the human dual-endothelin1/VEGFsp receptor (DEspR, formerly Dear) as a unitary transcribed pseudogene due to a stop [TGA]-codon at codon#14 in automated DNA and RNA sequences. However, re-analysis is needed given prior single gene studies detected a tryptophan [TGG]-codon#14 by manual Sanger sequencing, demonstrated DEspR translatability and functionality, and since the demonstration of actual non-translatability through expression studies, the standard-of-excellence for pseudogene designation, has not been performed. Re-analysis must meet UNIPROT criteria for demonstration of a protein's existence at the highest (protein) level, which a priori, would override DNA- or RNA-based deductions. To dissect the nucleotide sequence discrepancy, we performed Maxam-Gilbert sequencing and reviewed 727 RNA-seq entries. To comply with the highest level multiple UNIPROT criteria for determining DEspR's existence, we performed various experiments using multiple anti-DEspR monoclonal antibodies (mAbs) targeting distinct DEspR epitopes with one spanning the contested tryptophan [TGG]-codon#14, assessing: (a) DEspR protein expression, (b) predicted full-length protein size, (c) sequence-predicted protein-specific properties beyond codon#14: receptor glycosylation and internalization, (d) protein-partner interactions, and (e) DEspR functionality via DEspR-inhibition effects. Maxam-Gilbert sequencing and some RNA-seq entries demonstrate two guanines, hence a tryptophan [TGG]-codon#14 within a compression site spanning an error-prone compression sequence motif. Western blot analysis using anti-DEspR mAbs targeting distinct DEspR epitopes detect the identical glycosylated 17.5 kDa pull-down protein. Decrease in DEspR-protein size after PNGase-F digest demonstrates post-translational glycosylation, concordant with the consensus-glycosylation site beyond codon#14. Like other small single-transmembrane proteins, mass spectrometry analysis of anti-DEspR mAb pull-down proteins do not detect DEspR, but detect DEspR-protein interactions with proteins implicated in intracellular trafficking and cancer. FACS analyses also detect DEspR-protein in different human cancer stem-like cells (CSCs). DEspR-inhibition studies identify DEspR-roles in CSC survival and growth. Live cell imaging detects fluorescently-labeled anti-DEspR mAb targeted-receptor internalization, concordant with the single internalization-recognition sequence also located beyond codon#14. Data confirm translatability of DEspR, the full-length DEspR protein beyond codon#14, and elucidate DEspR-specific functionality. Along with detection of the tryptophan [TGG]-codon#14 within an error-prone compression site, cumulative data demonstrating DEspR protein existence fulfill multiple UNIPROT criteria, thus refuting its pseudogene designation.
GAMES identifies and annotates mutations in next-generation sequencing projects.
Sana, Maria Elena; Iascone, Maria; Marchetti, Daniela; Palatini, Jeff; Galasso, Marco; Volinia, Stefano
2011-01-01
Next-generation sequencing (NGS) methods have the potential for changing the landscape of biomedical science, but at the same time pose several problems in analysis and interpretation. Currently, there are many commercial and public software packages that analyze NGS data. However, the limitations of these applications include output which is insufficiently annotated and of difficult functional comprehension to end users. We developed GAMES (Genomic Analysis of Mutations Extracted by Sequencing), a pipeline aiming to serve as an efficient middleman between data deluge and investigators. GAMES attains multiple levels of filtering and annotation, such as aligning the reads to a reference genome, performing quality control and mutational analysis, integrating results with genome annotations and sorting each mismatch/deletion according to a range of parameters. Variations are matched to known polymorphisms. The prediction of functional mutations is achieved by using different approaches. Overall GAMES enables an effective complexity reduction in large-scale DNA-sequencing projects. GAMES is available free of charge to academic users and may be obtained from http://aqua.unife.it/GAMES.
Multiple Myeloma Genomics: A Systematic Review.
Weaver, Casey J; Tariman, Joseph D
2017-08-01
This integrative review describes the genomic variants that have been found to be associated with poor prognosis in patients diagnosed with multiple myeloma (MM). Second, it identifies MM genetic and genomic changes using next-generation sequencing, specifically whole-genome sequencing or exome sequencing. A search for peer-reviewed articles through PubMed, EBSCOhost, and DePaul WorldCat Libraries Worldwide yielded 33 articles that were included in the final analysis. The most commonly reported genetic changes were KRAS, NRAS, TP53, FAM46C, BRAF, DIS3, ATM, and CCND1. These genetic changes play a role in the pathogenesis of MM, prognostication, and therapeutic targets for novel therapies. MM genetics and genomics are expanding rapidly; oncology nurse clinicians must have basic competencies in genetics and genomics to help patients understand the complexities of genetic and genomic alterations and be able to refer patients to appropriate genomic professionals if needed. Copyright © 2017 Elsevier Inc. All rights reserved.
Kuraku, Shigehiro; Zmasek, Christian M; Nishimura, Osamu; Katoh, Kazutaka
2013-07-01
We report a new web server, aLeaves (http://aleaves.cdb.riken.jp/), for homologue collection from diverse animal genomes. In molecular comparative studies involving multiple species, orthology identification is the basis on which most subsequent biological analyses rely. It can be achieved most accurately by explicit phylogenetic inference. More and more species are subjected to large-scale sequencing, but the resultant resources are scattered in independent project-based, and multi-species, but separate, web sites. This complicates data access and is becoming a serious barrier to the comprehensiveness of molecular phylogenetic analysis. aLeaves, launched to overcome this difficulty, collects sequences similar to an input query sequence from various data sources. The collected sequences can be passed on to the MAFFT sequence alignment server (http://mafft.cbrc.jp/alignment/server/), which has been significantly improved in interactivity. This update enables to switch between (i) sequence selection using the Archaeopteryx tree viewer, (ii) multiple sequence alignment and (iii) tree inference. This can be performed as a loop until one reaches a sensible data set, which minimizes redundancy for better visibility and handling in phylogenetic inference while covering relevant taxa. The work flow achieved by the seamless link between aLeaves and MAFFT provides a convenient online platform to address various questions in zoology and evolutionary biology.
Kuraku, Shigehiro; Zmasek, Christian M.; Nishimura, Osamu; Katoh, Kazutaka
2013-01-01
We report a new web server, aLeaves (http://aleaves.cdb.riken.jp/), for homologue collection from diverse animal genomes. In molecular comparative studies involving multiple species, orthology identification is the basis on which most subsequent biological analyses rely. It can be achieved most accurately by explicit phylogenetic inference. More and more species are subjected to large-scale sequencing, but the resultant resources are scattered in independent project-based, and multi-species, but separate, web sites. This complicates data access and is becoming a serious barrier to the comprehensiveness of molecular phylogenetic analysis. aLeaves, launched to overcome this difficulty, collects sequences similar to an input query sequence from various data sources. The collected sequences can be passed on to the MAFFT sequence alignment server (http://mafft.cbrc.jp/alignment/server/), which has been significantly improved in interactivity. This update enables to switch between (i) sequence selection using the Archaeopteryx tree viewer, (ii) multiple sequence alignment and (iii) tree inference. This can be performed as a loop until one reaches a sensible data set, which minimizes redundancy for better visibility and handling in phylogenetic inference while covering relevant taxa. The work flow achieved by the seamless link between aLeaves and MAFFT provides a convenient online platform to address various questions in zoology and evolutionary biology. PMID:23677614
A statistical method for the detection of variants from next-generation resequencing of DNA pools.
Bansal, Vikas
2010-06-15
Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing. We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80-85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3-5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP. Implementation of this method is available at http://polymorphism.scripps.edu/~vbansal/software/CRISP/.
Bidirectional Retroviral Integration Site PCR Methodology and Quantitative Data Analysis Workflow.
Suryawanshi, Gajendra W; Xu, Song; Xie, Yiming; Chou, Tom; Kim, Namshin; Chen, Irvin S Y; Kim, Sanggu
2017-06-14
Integration Site (IS) assays are a critical component of the study of retroviral integration sites and their biological significance. In recent retroviral gene therapy studies, IS assays, in combination with next-generation sequencing, have been used as a cell-tracking tool to characterize clonal stem cell populations sharing the same IS. For the accurate comparison of repopulating stem cell clones within and across different samples, the detection sensitivity, data reproducibility, and high-throughput capacity of the assay are among the most important assay qualities. This work provides a detailed protocol and data analysis workflow for bidirectional IS analysis. The bidirectional assay can simultaneously sequence both upstream and downstream vector-host junctions. Compared to conventional unidirectional IS sequencing approaches, the bidirectional approach significantly improves IS detection rates and the characterization of integration events at both ends of the target DNA. The data analysis pipeline described here accurately identifies and enumerates identical IS sequences through multiple steps of comparison that map IS sequences onto the reference genome and determine sequencing errors. Using an optimized assay procedure, we have recently published the detailed repopulation patterns of thousands of Hematopoietic Stem Cell (HSC) clones following transplant in rhesus macaques, demonstrating for the first time the precise time point of HSC repopulation and the functional heterogeneity of HSCs in the primate system. The following protocol describes the step-by-step experimental procedure and data analysis workflow that accurately identifies and quantifies identical IS sequences.
Brody, Thomas; Yavatkar, Amarendra S; Kuzin, Alexander; Kundu, Mukta; Tyson, Leonard J; Ross, Jermaine; Lin, Tzu-Yang; Lee, Chi-Hon; Awasaki, Takeshi; Lee, Tzumin; Odenwald, Ward F
2012-01-01
Background: Phylogenetic footprinting has revealed that cis-regulatory enhancers consist of conserved DNA sequence clusters (CSCs). Currently, there is no systematic approach for enhancer discovery and analysis that takes full-advantage of the sequence information within enhancer CSCs. Results: We have generated a Drosophila genome-wide database of conserved DNA consisting of >100,000 CSCs derived from EvoPrints spanning over 90% of the genome. cis-Decoder database search and alignment algorithms enable the discovery of functionally related enhancers. The program first identifies conserved repeat elements within an input enhancer and then searches the database for CSCs that score highly against the input CSC. Scoring is based on shared repeats as well as uniquely shared matches, and includes measures of the balance of shared elements, a diagnostic that has proven to be useful in predicting cis-regulatory function. To demonstrate the utility of these tools, a temporally-restricted CNS neuroblast enhancer was used to identify other functionally related enhancers and analyze their structural organization. Conclusions: cis-Decoder reveals that co-regulating enhancers consist of combinations of overlapping shared sequence elements, providing insights into the mode of integration of multiple regulating transcription factors. The database and accompanying algorithms should prove useful in the discovery and analysis of enhancers involved in any developmental process. Developmental Dynamics 241:169–189, 2012. © 2011 Wiley Periodicals, Inc. Key findings A genome-wide catalog of Drosophila conserved DNA sequence clusters. cis-Decoder discovers functionally related enhancers. Functionally related enhancers share balanced sequence element copy numbers. Many enhancers function during multiple phases of development. PMID:22174086
Simultaneous phylogeny reconstruction and multiple sequence alignment
Yue, Feng; Shi, Jian; Tang, Jijun
2009-01-01
Background A phylogeny is the evolutionary history of a group of organisms. To date, sequence data is still the most used data type for phylogenetic reconstruction. Before any sequences can be used for phylogeny reconstruction, they must be aligned, and the quality of the multiple sequence alignment has been shown to affect the quality of the inferred phylogeny. At the same time, all the current multiple sequence alignment programs use a guide tree to produce the alignment and experiments showed that good guide trees can significantly improve the multiple alignment quality. Results We devise a new algorithm to simultaneously align multiple sequences and search for the phylogenetic tree that leads to the best alignment. We also implemented the algorithm as a C program package, which can handle both DNA and protein data and can take simple cost model as well as complex substitution matrices, such as PAM250 or BLOSUM62. The performance of the new method are compared with those from other popular multiple sequence alignment tools, including the widely used programs such as ClustalW and T-Coffee. Experimental results suggest that this method has good performance in terms of both phylogeny accuracy and alignment quality. Conclusion We present an algorithm to align multiple sequences and reconstruct the phylogenies that minimize the alignment score, which is based on an efficient algorithm to solve the median problems for three sequences. Our extensive experiments suggest that this method is very promising and can produce high quality phylogenies and alignments. PMID:19208110
Pang, Y H; Zhao, J X; Du, W L; Li, Y L; Wang, J; Wang, L M; Wu, J; Cheng, X N; Yang, Q H; Chen, X H
2014-05-23
Leymus mollis (Trin.) Pilger (NsNsXmXm, 2n = 28), a wild relative of common wheat, possesses many traits that are potentially valuable for wheat improvement. In order to exploit and utilize the useful genes of L. mollis, we developed a multiple alien substitution line, 10DM50, from the progenies of octoploid Tritileymus M842-16 x Triticum durum cv. D4286. Genomic in situ hybridization analysis of mitosis and meiosis (metaphase I), using labeled total DNA of Psathyrostachys huashanica as probe, showed that the substitution line 10DM50 was a cytogenetically stable alien substitution line with 36 chromosomes from wheat and three pairs of Ns genome chromosomes from L. mollis. Simple sequence repeat analysis showed that the chromosomes 3D, 6D, and 7D were absent in 10DM50. Expressed sequence tag-sequence tagged sites analysis showed that new chromatin from 3Ns, 6Ns, and 7Ns of L. mollis were detected in 10DM50. We deduced that the substitution line 10DM50 was a multiple alien substitution line with the 3D, 6D, and 7D chromosomes replaced by 3Ns, 6Ns, and 7Ns from L. mollis. 10DM50 showed high resistance to leaf rust and significantly improved spike length, spikes per plant, and kernels per spike, which are correlated with higher wheat yield. These results suggest that line 10DM50 could be used as intermediate material for transferring desirable traits from L. mollis into common wheat in breeding programs.
Quasispecies Analyses of the HIV-1 Near-full-length Genome With Illumina MiSeq
Ode, Hirotaka; Matsuda, Masakazu; Matsuoka, Kazuhiro; Hachiya, Atsuko; Hattori, Junko; Kito, Yumiko; Yokomaku, Yoshiyuki; Iwatani, Yasumasa; Sugiura, Wataru
2015-01-01
Human immunodeficiency virus type-1 (HIV-1) exhibits high between-host genetic diversity and within-host heterogeneity, recognized as quasispecies. Because HIV-1 quasispecies fluctuate in terms of multiple factors, such as antiretroviral exposure and host immunity, analyzing the HIV-1 genome is critical for selecting effective antiretroviral therapy and understanding within-host viral coevolution mechanisms. Here, to obtain HIV-1 genome sequence information that includes minority variants, we sought to develop a method for evaluating quasispecies throughout the HIV-1 near-full-length genome using the Illumina MiSeq benchtop deep sequencer. To ensure the reliability of minority mutation detection, we applied an analysis method of sequence read mapping onto a consensus sequence derived from de novo assembly followed by iterative mapping and subsequent unique error correction. Deep sequencing analyses of aHIV-1 clone showed that the analysis method reduced erroneous base prevalence below 1% in each sequence position and discarded only < 1% of all collected nucleotides, maximizing the usage of the collected genome sequences. Further, we designed primer sets to amplify the HIV-1 near-full-length genome from clinical plasma samples. Deep sequencing of 92 samples in combination with the primer sets and our analysis method provided sufficient coverage to identify >1%-frequency sequences throughout the genome. When we evaluated sequences of pol genes from 18 treatment-naïve patients' samples, the deep sequencing results were in agreement with Sanger sequencing and identified numerous additional minority mutations. The results suggest that our deep sequencing method would be suitable for identifying within-host viral population dynamics throughout the genome. PMID:26617593
MinION Analysis and Reference Consortium: Phase 1 data release and analysis
Eccles, David A.; Zalunin, Vadim; Urban, John M.; Piazza, Paolo; Bowden, Rory J.; Paten, Benedict; Mwaigwisya, Solomon; Batty, Elizabeth M.; Simpson, Jared T.; Snutch, Terrance P.
2015-01-01
The advent of a miniaturized DNA sequencing device with a high-throughput contextual sequencing capability embodies the next generation of large scale sequencing tools. The MinION™ Access Programme (MAP) was initiated by Oxford Nanopore Technologies™ in April 2014, giving public access to their USB-attached miniature sequencing device. The MinION Analysis and Reference Consortium (MARC) was formed by a subset of MAP participants, with the aim of evaluating and providing standard protocols and reference data to the community. Envisaged as a multi-phased project, this study provides the global community with the Phase 1 data from MARC, where the reproducibility of the performance of the MinION was evaluated at multiple sites. Five laboratories on two continents generated data using a control strain of Escherichia coli K-12, preparing and sequencing samples according to a revised ONT protocol. Here, we provide the details of the protocol used, along with a preliminary analysis of the characteristics of typical runs including the consistency, rate, volume and quality of data produced. Further analysis of the Phase 1 data presented here, and additional experiments in Phase 2 of E. coli from MARC are already underway to identify ways to improve and enhance MinION performance. PMID:26834992
Design of DNA pooling to allow incorporation of covariates in rare variants analysis.
Guan, Weihua; Li, Chun
2014-01-01
Rapid advances in next-generation sequencing technologies facilitate genetic association studies of an increasingly wide array of rare variants. To capture the rare or less common variants, a large number of individuals will be needed. However, the cost of a large scale study using whole genome or exome sequencing is still high. DNA pooling can serve as a cost-effective approach, but with a potential limitation that the identity of individual genomes would be lost and therefore individual characteristics and environmental factors could not be adjusted in association analysis, which may result in power loss and a biased estimate of genetic effect. For case-control studies, we propose a design strategy for pool creation and an analysis strategy that allows covariate adjustment, using multiple imputation technique. Simulations show that our approach can obtain reasonable estimate for genotypic effect with only slight loss of power compared to the much more expensive approach of sequencing individual genomes. Our design and analysis strategies enable more powerful and cost-effective sequencing studies of complex diseases, while allowing incorporation of covariate adjustment.
Rahimi, Pooneh; Tabatabaie, H; Gouya, Mohammad M; Mahmudi, M; Musavi, T; Rad, K Samimi; Azad, T Mokhtari; Nategh, R
2009-06-01
The 66 serotypes of human enteroviruses (EVs) are classified into four species A-D, based on phylogenetic relationships in multiple genome regions. Partial VP(1) amplification and sequence analysis are reliable methods for identifying non-polio enterovirus serotypes, especially in negative cell culture specimens from patients with residual paralysis. In Iran during the years 2000-2002, there were 29 residual paralysis cases with negative cell (RD, HEp(2) and L(20)B) culture results. The genomic RNA was extracted from stool specimens from cases of residual paralysis and detected by amplification of the 5'-nontranslated region using RT-PCR with Pan-EV primers. Partial VP(1) amplification by semi-nested RT-PCR (snRT-PCR) and sequence analysis were done. Specimens from the 29 culture-negative cases contained echoviruses of six different serotypes. The global eradication of wild polioviruses is near and study of non-polio enteroviruses, which can cause poliomyelitis, is increasingly important to understand their pathogenesis. The VP(1) sequences, derived from the snRT-PCR products, allowed rapid molecular analysis of these non-polio strains.
Single-cell genomic sequencing using Multiple Displacement Amplification.
Lasken, Roger S
2007-10-01
Single microbial cells can now be sequenced using DNA amplified by the Multiple Displacement Amplification (MDA) reaction. The few femtograms of DNA in a bacterium are amplified into micrograms of high molecular weight DNA suitable for DNA library construction and Sanger sequencing. The MDA-generated DNA also performs well when used directly as template for pyrosequencing by the 454 Life Sciences method. While MDA from single cells loses some of the genomic sequence, this approach will greatly accelerate the pace of sequencing from uncultured microbes. The genetically linked sequences from single cells are also a powerful tool to be used in guiding genomic assembly of shotgun sequences of multiple organisms from environmental DNA extracts (metagenomic sequences).
Odronitz, Florian; Kollmar, Martin
2006-01-01
Background Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. Description Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. Conclusion We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein. PMID:17134497
Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization.
Bauer, Markus; Klau, Gunnar W; Reinert, Knut
2007-07-27
The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account. We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments. The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of input sequences. Our program LARA is freely available for academic purposes from http://www.planet-lisa.net.
Association analysis of whole genome sequencing data accounting for longitudinal and family designs.
Hu, Yijuan; Hui, Qin; Sun, Yan V
2014-01-01
Using the whole genome sequencing data and the simulated longitudinal phenotypes for 849 pedigree-based individuals from Genetic Analysis Workshop 18, we investigated various approaches to detecting the association of rare and common variants with blood pressure traits. We compared three strategies for longitudinal data: (a) using the baseline measurement only, (b) using the average from multiple visits, and (c) using all individual measurements. We also compared the power of using all of the pedigree-based data and the unrelated subset. The analyses were performed without knowledge of the underlying simulating model.
High-Throughput Analysis of T-DNA Location and Structure Using Sequence Capture.
Inagaki, Soichi; Henry, Isabelle M; Lieberman, Meric C; Comai, Luca
2015-01-01
Agrobacterium-mediated transformation of plants with T-DNA is used both to introduce transgenes and for mutagenesis. Conventional approaches used to identify the genomic location and the structure of the inserted T-DNA are laborious and high-throughput methods using next-generation sequencing are being developed to address these problems. Here, we present a cost-effective approach that uses sequence capture targeted to the T-DNA borders to select genomic DNA fragments containing T-DNA-genome junctions, followed by Illumina sequencing to determine the location and junction structure of T-DNA insertions. Multiple probes can be mixed so that transgenic lines transformed with different T-DNA types can be processed simultaneously, using a simple, index-based pooling approach. We also developed a simple bioinformatic tool to find sequence read pairs that span the junction between the genome and T-DNA or any foreign DNA. We analyzed 29 transgenic lines of Arabidopsis thaliana, each containing inserts from 4 different T-DNA vectors. We determined the location of T-DNA insertions in 22 lines, 4 of which carried multiple insertion sites. Additionally, our analysis uncovered a high frequency of unconventional and complex T-DNA insertions, highlighting the needs for high-throughput methods for T-DNA localization and structural characterization. Transgene insertion events have to be fully characterized prior to use as commercial products. Our method greatly facilitates the first step of this characterization of transgenic plants by providing an efficient screen for the selection of promising lines.
HTLV-1aA introduction into Brazil and its association with the trans-Atlantic slave trade.
Amoussa, Adjile Edjide Roukiyath; Wilkinson, Eduan; Giovanetti, Marta; de Almeida Rego, Filipe Ferreira; Araujo, Thessika Hialla A; de Souza Gonçalves, Marilda; de Oliveira, Tulio; Alcantara, Luiz Carlos Junior
2017-03-01
Human T-lymphotropic virus (HTLV) is an endemic virus in some parts of the world, with Africa being home to most of the viral genetic diversity. In Brazil, HTLV-1 is endemic amongst Japanese and African immigrant populations. Multiple introductions of the virus in Brazil from other epidemic foci were hypothesized. The long terminal repeat (LTR) region of HTLV-1 was used to infer the origin of the virus in Brazil, using phylogenetic analysis. LTR sequences were obtained from the HTLV-1 database (http://htlv1db.bahia.fiocruz.br). Sequences were aligned and maximum-likelihood and Bayesian tree topologies were inferred. Brazilian specific clusters were identified and molecular-clock and coalescent models were used to estimate each cluster's time to the most recent common ancestor (tMRCA). Three Brazilian clusters were identified with a posterior probability ranged from 0.61 to 0.99. Molecular clock analysis of these three clusters dated back their respective tMRCAs between the year 1499 and the year 1668. Additional analysis also identified a close association between Brazilian sequences and new sequences from South Africa. Our results support the hypothesis of a multiple introductions of HTLV-1 into Brazil, with the majority of introductions occurring in the post-Colombian period. Our results further suggest that HTLV-1 introduction into Brazil was facilitated by the trans-Atlantic slave trade from endemic areas of Africa. The close association between southern African and Brazilian sequences also suggested that greater numbers of the southern African Bantu population might also have been part of the slave trade than previously thought. Copyright © 2016. Published by Elsevier B.V.
Camerlengo, Terry; Ozer, Hatice Gulcin; Onti-Srinivasan, Raghuram; Yan, Pearlly; Huang, Tim; Parvin, Jeffrey; Huang, Kun
2012-01-01
Next Generation Sequencing is highly resource intensive. NGS Tasks related to data processing, management and analysis require high-end computing servers or even clusters. Additionally, processing NGS experiments requires suitable storage space and significant manual interaction. At The Ohio State University's Biomedical Informatics Shared Resource, we designed and implemented a scalable architecture to address the challenges associated with the resource intensive nature of NGS secondary analysis built around Illumina Genome Analyzer II sequencers and Illumina's Gerald data processing pipeline. The software infrastructure includes a distributed computing platform consisting of a LIMS called QUEST (http://bisr.osumc.edu), an Automation Server, a computer cluster for processing NGS pipelines, and a network attached storage device expandable up to 40TB. The system has been architected to scale to multiple sequencers without requiring additional computing or labor resources. This platform provides demonstrates how to manage and automate NGS experiments in an institutional or core facility setting.
Muangkram, Yuttamol; Amano, Akira; Wajjwalku, Worawidh; Pinyopummintr, Tanu; Thongtip, Nikorn; Kaolim, Nongnid; Sukmak, Manakorn; Kamolnorranath, Sumate; Siriaroonrat, Boripat; Tipkantha, Wanlaya; Maikaew, Umaporn; Thomas, Warisara; Polsrila, Kanda; Dongsaard, Kwanreaun; Sanannu, Saowaphang; Wattananorrasate, Anuwat
2017-07-01
The Asian tapir (Tapirus indicus) has been classified as Endangered on the IUCN Red List of Threatened Species (2008). Genetic diversity data provide important information for the management of captive breeding and conservation of this species. We analyzed mitochondrial control region (CR) sequences from 37 captive Asian tapirs in Thailand. Multiple alignments of the full-length CR sequences sized 1268 bp comprised three domains as described in other mammal species. Analysis of 16 parsimony-informative variable sites revealed 11 haplotypes. Furthermore, the phylogenetic analysis using median-joining network clearly showed three clades correlated with our earlier cytochrome b gene study in this endangered species. The repetitive motif is located between first and second conserved sequence blocks, similar to the Brazilian tapir. The highest polymorphic site was located in the extended termination associated sequences domain. The results could be applied for future genetic management based in captivity and wild that shows stable populations.
Characterization of tannase protein sequences of bacteria and fungi: an in silico study.
Banerjee, Amrita; Jana, Arijit; Pati, Bikash R; Mondal, Keshab C; Das Mohapatra, Pradeep K
2012-04-01
The tannase protein sequences of 149 bacteria and 36 fungi were retrieved from NCBI database. Among them only 77 bacterial and 31 fungal tannase sequences were taken which have different amino acid compositions. These sequences were analysed for different physical and chemical properties, superfamily search, multiple sequence alignment, phylogenetic tree construction and motif finding to find out the functional motif and the evolutionary relationship among them. The superfamily search for these tannase exposed the occurrence of proline iminopeptidase-like, biotin biosynthesis protein BioH, O-acetyltransferase, carboxylesterase/thioesterase 1, carbon-carbon bond hydrolase, haloperoxidase, prolyl oligopeptidase, C-terminal domain and mycobacterial antigens families and alpha/beta hydrolase superfamily. Some bacterial and fungal sequence showed similarity with different families individually. The multiple sequence alignment of these tannase protein sequences showed conserved regions at different stretches with maximum homology from amino acid residues 389-469 and 482-523 which could be used for designing degenerate primers or probes specific for tannase producing bacterial and fungal species. Phylogenetic tree showed two different clusters; one has only bacteria and another have both fungi and bacteria showing some relationship between these different genera. Although in second cluster near about all fungal species were found together in a corner which indicates the sequence level similarity among fungal genera. The distributions of fourteen motifs analysis revealed Motif 1 with a signature amino acid sequence of 29 amino acids, i.e. GCSTGGREALKQAQRWPHDYDGIIANNPA, was uniformly observed in 83.3 % of studied tannase sequences representing its participation with the structure and enzymatic function.
A novel approach to multiple sequence alignment using hadoop data grids.
Sudha Sadasivam, G; Baktavatchalam, G
2010-01-01
Multiple alignment of protein sequences helps to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Although dynamic programming algorithms produce accurate alignments, they are computation intensive. In this paper we propose a time efficient approach to sequence alignment that also produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. The principle of block splitting in hadoop coupled with its scalability facilitates alignment of very large sequences.
DSAP: deep-sequencing small RNA analysis pipeline.
Huang, Po-Jung; Liu, Yi-Chung; Lee, Chi-Ching; Lin, Wei-Chen; Gan, Richie Ruei-Chi; Lyu, Ping-Chiang; Tang, Petrus
2010-07-01
DSAP is an automated multiple-task web service designed to provide a total solution to analyzing deep-sequencing small RNA datasets generated by next-generation sequencing technology. DSAP uses a tab-delimited file as an input format, which holds the unique sequence reads (tags) and their corresponding number of copies generated by the Solexa sequencing platform. The input data will go through four analysis steps in DSAP: (i) cleanup: removal of adaptors and poly-A/T/C/G/N nucleotides; (ii) clustering: grouping of cleaned sequence tags into unique sequence clusters; (iii) non-coding RNA (ncRNA) matching: sequence homology mapping against a transcribed sequence library from the ncRNA database Rfam (http://rfam.sanger.ac.uk/); and (iv) known miRNA matching: detection of known miRNAs in miRBase (http://www.mirbase.org/) based on sequence homology. The expression levels corresponding to matched ncRNAs and miRNAs are summarized in multi-color clickable bar charts linked to external databases. DSAP is also capable of displaying miRNA expression levels from different jobs using a log(2)-scaled color matrix. Furthermore, a cross-species comparative function is also provided to show the distribution of identified miRNAs in different species as deposited in miRBase. DSAP is available at http://dsap.cgu.edu.tw.
Krüger, Melanie; Hinder, Mark R; Puri, Rohan; Summers, Jeffery J
2017-01-01
Objectives: The aim of this study was to investigate how age-related performance differences in a visuospatial sequence learning task relate to age-related declines in cognitive functioning. Method: Cognitive functioning of 18 younger and 18 older participants was assessed using a standardized test battery. Participants then undertook a perceptual visuospatial sequence learning task. Various relationships between sequence learning and participants' cognitive functioning were examined through correlation and factor analysis. Results: Older participants exhibited significantly lower performance than their younger counterparts in the sequence learning task as well as in multiple cognitive functions. Factor analysis revealed two independent subsets of cognitive functions associated with performance in the sequence learning task, related to either the processing and storage of sequence information (first subset) or problem solving (second subset). Age-related declines were only found for the first subset of cognitive functions, which also explained a significant degree of the performance differences in the sequence learning task between age-groups. Discussion: The results suggest that age-related performance differences in perceptual visuospatial sequence learning can be explained by declines in the ability to process and store sequence information in older adults, while a set of cognitive functions related to problem solving mediates performance differences independent of age.
Bhore, Subhash J; Kassim, Amelia; Loh, Chye Ying; Shah, Farida H
2010-01-01
It is well known that the nutritional quality of the American oil-palm (Elaeis oleifera) mesocarp oil is superior to that of African oil-palm (Elaeis guineensis Jacq. Tenera) mesocarp oil. Therefore, it is of important to identify the genetic features for its superior value. This could be achieved through the genome sequencing of the oil-palm. However, the genome sequence is not available in the public domain due to commercial secrecy. Hence, we constructed a cDNA library and generated expressed sequence tags (3,205) from the mesocarp tissue of the American oil-palm. We continued to annotate each of these cDNAs after submitting to GenBank/DDBJ/EMBL. A rough analysis turned our attention to the beta-carotene hydroxylase (Chyb) enzyme encoding cDNA. Then, we completed the full sequencing of cDNA clone for its both strands using M13 forward and reverse primers. The full nucleotide and protein sequence was further analyzed and annotated using various Bioinformatics tools. The analysis results showed the presence of fatty acid hydroxylase superfamily domain in the protein sequence. The multiple sequence alignment of selected Chyb amino acid sequences from other plant species and algal members with E. oleifera Chyb using ClustalW and its phylogenetic analysis suggest that Chyb from monocotyledonous plant species, Lilium hubrid, Crocus sativus and Zea mays are the most evolutionary related with E. oleifera Chyb. This study reports the annotation of E. oleifera Chyb. Abbreviations ESTs - expressed sequence tags, EoChyb - Elaeis oleifera beta-carotene hydroxylase, MC - main cluster PMID:21364789
spads 1.0: a toolbox to perform spatial analyses on DNA sequence data sets.
Dellicour, Simon; Mardulyn, Patrick
2014-05-01
SPADS 1.0 (for 'Spatial and Population Analysis of DNA Sequences') is a population genetic toolbox for characterizing genetic variability within and among populations from DNA sequences. In view of the drastic increase in genetic information available through sequencing methods, spads was specifically designed to deal with multilocus data sets of DNA sequences. It computes several summary statistics from populations or groups of populations, performs input file conversions for other population genetic programs and implements locus-by-locus and multilocus versions of two clustering algorithms to study the genetic structure of populations. The toolbox also includes two MATLAB and r functions, GDISPAL and GDIVPAL, to display differentiation and diversity patterns across landscapes. These functions aim to generate interpolating surfaces based on multilocus distance and diversity indices. In the case of multiple loci, such surfaces can represent a useful alternative to multiple pie charts maps traditionally used in phylogeography to represent the spatial distribution of genetic diversity. These coloured surfaces can also be used to compare different data sets or different diversity and/or distance measures estimated on the same data set. © 2013 John Wiley & Sons Ltd.
GibbsCluster: unsupervised clustering and alignment of peptide sequences.
Andreatta, Massimo; Alvarez, Bruno; Nielsen, Morten
2017-07-03
Receptor interactions with short linear peptide fragments (ligands) are at the base of many biological signaling processes. Conserved and information-rich amino acid patterns, commonly called sequence motifs, shape and regulate these interactions. Because of the properties of a receptor-ligand system or of the assay used to interrogate it, experimental data often contain multiple sequence motifs. GibbsCluster is a powerful tool for unsupervised motif discovery because it can simultaneously cluster and align peptide data. The GibbsCluster 2.0 presented here is an improved version incorporating insertion and deletions accounting for variations in motif length in the peptide input. In basic terms, the program takes as input a set of peptide sequences and clusters them into meaningful groups. It returns the optimal number of clusters it identified, together with the sequence alignment and sequence motif characterizing each cluster. Several parameters are available to customize cluster analysis, including adjustable penalties for small clusters and overlapping groups and a trash cluster to remove outliers. As an example application, we used the server to deconvolute multiple specificities in large-scale peptidome data generated by mass spectrometry. The server is available at http://www.cbs.dtu.dk/services/GibbsCluster-2.0. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
2014-01-01
Background Ambiscript is a graphically-designed nucleic acid notation that uses symbol symmetries to support sequence complementation, highlight biologically-relevant palindromes, and facilitate the analysis of consensus sequences. Although the original Ambiscript notation was designed to easily represent consensus sequences for multiple sequence alignments, the notation’s black-on-white ambiguity characters are unable to reflect the statistical distribution of nucleotides found at each position. We now propose a color-augmented ambigraphic notation to encode the frequency of positional polymorphisms in these consensus sequences. Results We have implemented this color-coding approach by creating an Adobe Flash® application ( http://www.ambiscript.org) that shades and colors modified Ambiscript characters according to the prevalence of the encoded nucleotide at each position in the alignment. The resulting graphic helps viewers perceive biologically-relevant patterns in multiple sequence alignments by uniquely combining color, shading, and character symmetries to highlight palindromes and inverted repeats in conserved DNA motifs. Conclusion Juxtaposing an intuitive color scheme over the deliberate character symmetries of an ambigraphic nucleic acid notation yields a highly-functional nucleic acid notation that maximizes information content and successfully embodies key principles of graphic excellence put forth by the statistician and graphic design theorist, Edward Tufte. PMID:24447494
Sun, Zhifu; Cunningham, Julie; Slager, Susan; Kocher, Jean-Pierre
2015-01-01
Bisulfite treatment-based methylation microarray (mainly Illumina 450K Infinium array) and next-generation sequencing (reduced representation bisulfite sequencing, Agilent SureSelect Human Methyl-Seq, NimbleGen SeqCap Epi CpGiant or whole-genome bisulfite sequencing) are commonly used for base resolution DNA methylome research. Although multiple tools and methods have been developed and used for the data preprocessing and analysis, confusions remains for these platforms including how and whether the 450k array should be normalized; which platform should be used to better fit researchers’ needs; and which statistical models would be more appropriate for differential methylation analysis. This review presents the commonly used platforms and compares the pros and cons of each in methylome profiling. We then discuss approaches to study design, data normalization, bias correction and model selection for differentially methylated individual CpGs and regions. PMID:26366945
Wang, Edwin; Zou, Jinfeng; Zaman, Naif; Beitel, Lenore K; Trifiro, Mark; Paliouras, Miltiadis
2013-08-01
Recent tumor genome sequencing confirmed that one tumor often consists of multiple cell subpopulations (clones) which bear different, but related, genetic profiles such as mutation and copy number variation profiles. Thus far, one tumor has been viewed as a whole entity in cancer functional studies. With the advances of genome sequencing and computational analysis, we are able to quantify and computationally dissect clones from tumors, and then conduct clone-based analysis. Emerging technologies such as single-cell genome sequencing and RNA-Seq could profile tumor clones. Thus, we should reconsider how to conduct cancer systems biology studies in the genome sequencing era. We will outline new directions for conducting cancer systems biology by considering that genome sequencing technology can be used for dissecting, quantifying and genetically characterizing clones from tumors. Topics discussed in Part 1 of this review include computationally quantifying of tumor subpopulations; clone-based network modeling, cancer hallmark-based networks and their high-order rewiring principles and the principles of cell survival networks of fast-growing clones. Crown Copyright © 2013. Published by Elsevier Ltd. All rights reserved.
Zou, Xiaohui; Tang, Guangpeng; Zhao, Xiang; Huang, Yan; Chen, Tao; Lei, Mingyu; Chen, Wenbing; Yang, Lei; Zhu, Wenfei; Zhuang, Li; Yang, Jing; Feng, Zhaomin; Wang, Dayan; Wang, Dingming; Shu, Yuelong
2017-03-01
Many viruses can cause respiratory diseases in humans. Although great advances have been achieved in methods of diagnosis, it remains challenging to identify pathogens in unexplained pneumonia (UP) cases. In this study, we applied next-generation sequencing (NGS) technology and a metagenomic approach to detect and characterize respiratory viruses in UP cases from Guizhou Province, China. A total of 33 oropharyngeal swabs were obtained from hospitalized UP patients and subjected to NGS. An unbiased metagenomic analysis pipeline identified 13 virus species in 16 samples. Human rhinovirus C was the virus most frequently detected and was identified in seven samples. Human measles virus, adenovirus B 55 and coxsackievirus A10 were also identified. Metagenomic sequencing also provided virus genomic sequences, which enabled genotype characterization and phylogenetic analysis. For cases of multiple infection, metagenomic sequencing afforded information regarding the quantity of each virus in the sample, which could be used to evaluate each viruses' role in the disease. Our study highlights the potential of metagenomic sequencing for pathogen identification in UP cases.
Walker, M D; Park, C W; Rosen, A; Aronheim, A
1990-01-01
Cell specific expression of the insulin gene is achieved through transcriptional mechanisms operating on multiple DNA sequence elements located in the 5' flanking region of the gene. Of particular importance in the rat insulin I gene are two closely similar 9 bp sequences (IEB1 and IEB2): mutation of either of these leads to 5-10 fold reduction in transcriptional activity. We have screened an expression cDNA library derived from mouse pancreatic endocrine beta cells with a radioactive DNA probe containing multiple copies of the IEB1 sequence. A cDNA clone (A1) isolated by this procedure encodes a protein which shows efficient binding to the IEB1 probe, but much weaker binding to either an unrelated DNA probe or to a probe bearing a single base pair insertion within the recognition sequence. DNA sequence analysis indicates a protein belonging to the helix-loop-helix family of DNA-binding proteins. The ability of the protein encoded by clone A1 to recognize a number of wild type and mutant DNA sequences correlates closely with the ability of each sequence element to support transcription in vivo in the context of the insulin 5' flanking DNA. We conclude that the isolated cDNA may encode a transcription factor that participates in control of insulin gene expression. Images PMID:2181401
A distributed system for fast alignment of next-generation sequencing data.
Srimani, Jaydeep K; Wu, Po-Yen; Phan, John H; Wang, May D
2010-12-01
We developed a scalable distributed computing system using the Berkeley Open Interface for Network Computing (BOINC) to align next-generation sequencing (NGS) data quickly and accurately. NGS technology is emerging as a promising platform for gene expression analysis due to its high sensitivity compared to traditional genomic microarray technology. However, despite the benefits, NGS datasets can be prohibitively large, requiring significant computing resources to obtain sequence alignment results. Moreover, as the data and alignment algorithms become more prevalent, it will become necessary to examine the effect of the multitude of alignment parameters on various NGS systems. We validate the distributed software system by (1) computing simple timing results to show the speed-up gained by using multiple computers, (2) optimizing alignment parameters using simulated NGS data, and (3) computing NGS expression levels for a single biological sample using optimal parameters and comparing these expression levels to that of a microarray sample. Results indicate that the distributed alignment system achieves approximately a linear speed-up and correctly distributes sequence data to and gathers alignment results from multiple compute clients.
NASA Astrophysics Data System (ADS)
Tibbetts, Clark; Lichanska, Agnieszka M.; Borsuk, Lisa A.; Weslowski, Brian; Morris, Leah M.; Lorence, Matthew C.; Schafer, Klaus O.; Campos, Joseph; Sene, Mohamadou; Myers, Christopher A.; Faix, Dennis; Blair, Patrick J.; Brown, Jason; Metzgar, David
2010-04-01
High-density resequencing microarrays support simultaneous detection and identification of multiple viral and bacterial pathogens. Because detection and identification using RPM is based upon multiple specimen-specific target pathogen gene sequences generated in the individual test, the test results enable both a differential diagnostic analysis and epidemiological tracking of detected pathogen strains and variants from one specimen to the next. The RPM assay enables detection and identification of pathogen sequences that share as little as 80% sequence similarity to prototype target gene sequences represented as detector tiles on the array. This capability enables the RPM to detect and identify previously unknown strains and variants of a detected pathogen, as in sentinel cases associated with an infectious disease outbreak. We illustrate this capability using assay results from testing influenza A virus vaccines configured with strains that were first defined years after the design of the RPM microarray. Results are also presented from RPM-Flu testing of three specimens independently confirmed to the positive for the 2009 Novel H1N1 outbreak strain of influenza virus.
The Poultry-Associated Microbiome: Network Analysis and Farm-to-Fork Characterizations
Oakley, Brian B.; Morales, Cesar A.; Line, J.; Berrang, Mark E.; Meinersmann, Richard J.; Tillman, Glenn E.; Wise, Mark G.; Siragusa, Gregory R.; Hiett, Kelli L.; Seal, Bruce S.
2013-01-01
Microbial communities associated with agricultural animals are important for animal health, food safety, and public health. Here we combine high-throughput sequencing (HTS), quantitative-PCR assays, and network analysis to profile the poultry-associated microbiome and important pathogens at various stages of commercial poultry production from the farm to the consumer. Analysis of longitudinal data following two flocks from the farm through processing showed a core microbiome containing multiple sequence types most closely related to genera known to be pathogenic for animals and/or humans, including Campylobacter, Clostridium, and Shigella. After the final stage of commercial poultry processing, taxonomic richness was ca. 2–4 times lower than the richness of fecal samples from the same flocks and Campylobacter abundance was significantly reduced. Interestingly, however, carcasses sampled at 48 hr after processing harboured the greatest proportion of unique taxa (those not encountered in other samples), significantly more than expected by chance. Among these were anaerobes such as Prevotella, Veillonella, Leptrotrichia, and multiple Campylobacter sequence types. Retail products were dominated by Pseudomonas, but also contained 27 other genera, most of which were potentially metabolically active and encountered in on-farm samples. Network analysis was focused on the foodborne pathogen Campylobacter and revealed a majority of sequence types with no significant interactions with other taxa, perhaps explaining the limited efficacy of previous attempts at competitive exclusion of Campylobacter. These data represent the first use of HTS to characterize the poultry microbiome across a series of farm-to-fork samples and demonstrate the utility of HTS in monitoring the food supply chain and identifying sources of potential zoonoses and interactions among taxa in complex communities. PMID:23468931
Targeted Analysis of Whole Genome Sequence Data to Diagnose Genetic Cardiomyopathy
Golbus, Jessica R.; Puckelwartz, Megan J.; Dellefave-Castillo, Lisa; ...
2014-09-01
Background—Cardiomyopathy is highly heritable but genetically diverse. At present, genetic testing for cardiomyopathy uses targeted sequencing to simultaneously assess the coding regions of more than 50 genes. New genes are routinely added to panels to improve the diagnostic yield. With the anticipated $1000 genome, it is expected that genetic testing will shift towards comprehensive genome sequencing accompanied by targeted gene analysis. Therefore, we assessed the reliability of whole genome sequencing and targeted analysis to identify cardiomyopathy variants in 11 subjects with cardiomyopathy. Methods and Results—Whole genome sequencing with an average of 37× coverage was combined with targeted analysis focused onmore » 204 genes linked to cardiomyopathy. Genetic variants were scored using multiple prediction algorithms combined with frequency data from public databases. This pipeline yielded 1-14 potentially pathogenic variants per individual. Variants were further analyzed using clinical criteria and/or segregation analysis. Three of three previously identified primary mutations were detected by this analysis. In six subjects for whom the primary mutation was previously unknown, we identified mutations that segregated with disease, had clinical correlates, and/or had additional pathological correlation to provide evidence for causality. For two subjects with previously known primary mutations, we identified additional variants that may act as modifiers of disease severity. In total, we identified the likely pathological mutation in 9 of 11 (82%) subjects. We conclude that these pilot data demonstrate that ~30-40× coverage whole genome sequencing combined with targeted analysis is feasible and sensitive to identify rare variants in cardiomyopathy-associated genes.« less
Comparison of Next-Generation Sequencing Systems
Liu, Lin; Li, Yinhu; Li, Siliang; Hu, Ni; He, Yimin; Pong, Ray; Lin, Danni; Lu, Lihua; Law, Maggie
2012-01-01
With fast development and wide applications of next-generation sequencing (NGS) technologies, genomic sequence information is within reach to aid the achievement of goals to decode life mysteries, make better crops, detect pathogens, and improve life qualities. NGS systems are typically represented by SOLiD/Ion Torrent PGM from Life Sciences, Genome Analyzer/HiSeq 2000/MiSeq from Illumina, and GS FLX Titanium/GS Junior from Roche. Beijing Genomics Institute (BGI), which possesses the world's biggest sequencing capacity, has multiple NGS systems including 137 HiSeq 2000, 27 SOLiD, one Ion Torrent PGM, one MiSeq, and one 454 sequencer. We have accumulated extensive experience in sample handling, sequencing, and bioinformatics analysis. In this paper, technologies of these systems are reviewed, and first-hand data from extensive experience is summarized and analyzed to discuss the advantages and specifics associated with each sequencing system. At last, applications of NGS are summarized. PMID:22829749
MSAViewer: interactive JavaScript visualization of multiple sequence alignments.
Yachdav, Guy; Wilzbach, Sebastian; Rauscher, Benedikt; Sheridan, Robert; Sillitoe, Ian; Procter, James; Lewis, Suzanna E; Rost, Burkhard; Goldberg, Tatyana
2016-11-15
The MSAViewer is a quick and easy visualization and analysis JavaScript component for Multiple Sequence Alignment data of any size. Core features include interactive navigation through the alignment, application of popular color schemes, sorting, selecting and filtering. The MSAViewer is 'web ready': written entirely in JavaScript, compatible with modern web browsers and does not require any specialized software. The MSAViewer is part of the BioJS collection of components. The MSAViewer is released as open source software under the Boost Software License 1.0. Documentation, source code and the viewer are available at http://msa.biojs.net/Supplementary information: Supplementary data are available at Bioinformatics online. msa@bio.sh. © The Author 2016. Published by Oxford University Press.
MSAViewer: interactive JavaScript visualization of multiple sequence alignments
Yachdav, Guy; Wilzbach, Sebastian; Rauscher, Benedikt; Sheridan, Robert; Sillitoe, Ian; Procter, James; Lewis, Suzanna E.; Rost, Burkhard; Goldberg, Tatyana
2016-01-01
Summary: The MSAViewer is a quick and easy visualization and analysis JavaScript component for Multiple Sequence Alignment data of any size. Core features include interactive navigation through the alignment, application of popular color schemes, sorting, selecting and filtering. The MSAViewer is ‘web ready’: written entirely in JavaScript, compatible with modern web browsers and does not require any specialized software. The MSAViewer is part of the BioJS collection of components. Availability and Implementation: The MSAViewer is released as open source software under the Boost Software License 1.0. Documentation, source code and the viewer are available at http://msa.biojs.net/. Supplementary information: Supplementary data are available at Bioinformatics online. Contact: msa@bio.sh PMID:27412096
CisSERS: Customizable in silico sequence evaluation for restriction sites
Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus; ...
2016-04-12
High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less
CisSERS: Customizable in silico sequence evaluation for restriction sites
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus
High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less
NASA Astrophysics Data System (ADS)
Miyatake, Teruhiko; Chiba, Kazuki; Hamamura, Masanori; Tachikawa, Shin'ichi
We propose a novel asynchronous direct-sequence codedivision multiple access (DS-CDMA) using feedback-controlled spreading sequences (FCSSs) (FCSS/DS-CDMA). At the receiver of FCSS/DS-CDMA, the code-orthogonalizing filter (COF) produces a spreading sequence, and the receiver returns the spreading sequence to the transmitter. Then the transmitter uses the spreading sequence as its updated version. The performance of FCSS/DS-CDMA is evaluated over time-dispersive channels. The results indicate that FCSS/DS-CDMA greatly suppresses both the intersymbol interference (ISI) and multiple access interference (MAI) over time-invariant channels. FCSS/DS-CDMA is applicable to the decentralized multiple access.
2013-01-01
Background Mitochondrial DNA (mtDNA) typing can be a useful aid for identifying people from compromised samples when nuclear DNA is too damaged, degraded or below detection thresholds for routine short tandem repeat (STR)-based analysis. Standard mtDNA typing, focused on PCR amplicon sequencing of the control region (HVS I and HVS II), is limited by the resolving power of this short sequence, which misses up to 70% of the variation present in the mtDNA genome. Methods We used in-solution hybridisation-based DNA capture (using DNA capture probes prepared from modern human mtDNA) to recover mtDNA from post-mortem human remains in which the majority of DNA is both highly fragmented (<100 base pairs in length) and chemically damaged. The method ‘immortalises’ the finite quantities of DNA in valuable extracts as DNA libraries, which is followed by the targeted enrichment of endogenous mtDNA sequences and characterisation by next-generation sequencing (NGS). Results We sequenced whole mitochondrial genomes for human identification from samples where standard nuclear STR typing produced only partial profiles or demonstrably failed and/or where standard mtDNA hypervariable region sequences lacked resolving power. Multiple rounds of enrichment can substantially improve coverage and sequencing depth of mtDNA genomes from highly degraded samples. The application of this method has led to the reliable mitochondrial sequencing of human skeletal remains from unidentified World War Two (WWII) casualties approximately 70 years old and from archaeological remains (up to 2,500 years old). Conclusions This approach has potential applications in forensic science, historical human identification cases, archived medical samples, kinship analysis and population studies. In particular the methodology can be applied to any case, involving human or non-human species, where whole mitochondrial genome sequences are required to provide the highest level of maternal lineage discrimination. Multiple rounds of in-solution hybridisation-based DNA capture can retrieve whole mitochondrial genome sequences from even the most challenging samples. PMID:24289217
A Generalized Least-Squares Estimate for the Origin of Sporophytic Self-Incompatibility
Uyenoyama, M. K.
1995-01-01
Analysis of nucleotide sequences that regulate the expression of self-incompatibility in flowering plants affords a direct means of examining classical hypotheses for the origin and evolution of this major feature of mating systems. Departing from the classical view of monophyly of all forms of self-incompatibility, the current paradigm for the origin of self-incompatibility postulates multiple episodes of recruitment and modification of preexisting genes. In Brassica, the S locus, which regulates sporophytic self-incompatibility, shows homology to a multigene family present both in self-compatible congeners and in groups for which this form of self-incompatibility is atypical. A phylogenetic analysis of S-allele sequences together with homologous sequences that do not cosegregate with self-incompatibility permits dating the change of function that marked the origin of self-incompatibility. A generalized least-squares method is introduced that provides closed-form expressions for estimates and standard errors for function-specific divergence rates and times of divergence among sequences. This analysis suggests that the age of the sporophytic self-incompatibility system expressed in Brassica exceeds species divergence within the genus by four- to fivefold. The extraordinarily high levels of sequence diversity exhibited by S alleles appears to reflect their ancient derivation, with the alternative hypothesis of hypermutability rejected by the analysis. PMID:7713446
Rare Variant Association Test with Multiple Phenotypes
Lee, Selyeong; Won, Sungho; Kim, Young Jin; Kim, Yongkang; Kim, Bong-Jo; Park, Taesung
2016-01-01
Although genome-wide association studies (GWAS) have now discovered thousands of genetic variants associated with common traits, such variants cannot explain the large degree of “missing heritability,” likely due to rare variants. The advent of next generation sequencing technology has allowed rare variant detection and association with common traits, often by investigating specific genomic regions for rare variant effects on a trait. Although multiply correlated phenotypes are often concurrently observed in GWAS, most studies analyze only single phenotypes, which may lessen statistical power. To increase power, multivariate analyses, which consider correlations between multiple phenotypes, can be used. However, few existing multi-variant analyses can identify rare variants for assessing multiple phenotypes. Here, we propose Multivariate Association Analysis using Score Statistics (MAAUSS), to identify rare variants associated with multiple phenotypes, based on the widely used Sequence Kernel Association Test (SKAT) for a single phenotype. We applied MAAUSS to Whole Exome Sequencing (WES) data from a Korean population of 1,058 subjects, to discover genes associated with multiple traits of liver function. We then assessed validation of those genes by a replication study, using an independent dataset of 3,445 individuals. Notably, we detected the gene ZNF620 among five significant genes. We then performed a simulation study to compare MAAUSS's performance with existing methods. Overall, MAAUSS successfully conserved type 1 error rates and in many cases, had a higher power than the existing methods. This study illustrates a feasible and straightforward approach for identifying rare variants correlated with multiple phenotypes, with likely relevance to missing heritability. PMID:28039885
SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes.
Pruesse, Elmar; Peplies, Jörg; Glöckner, Frank Oliver
2012-07-15
In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements. In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands. SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks. Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license.
NASA Astrophysics Data System (ADS)
Zobin, Vyacheslav M.
2018-02-01
The 10-11 July 2015 partial collapses of the lava dome in the crater of Volcán de Colima, México, were accompanied by a sequence of two-stage multiple PDCs, separated by a 15-h interval, with a total bulk volume of 14.2 × 106 m3 of fragmentary material and runout distances reaching 9.1 and 10.3 km, respectively (Reyes-Dávila et al., 2016). Broad-band seismic signals, associated with the PDCs and recorded at seismic station EZ5 installed at a distance of 4 km from the crater, were used for analysis of the 20-h eruption process. This process included two stages of the multiple PDCs emplacements, two one-hour periods of preliminary events to each of the stages, and the inter-stage period. Analysis of seismic signals allowed us to identify the types of volcanic events composing this eruption episode and estimate their quantitative characteristics and spectral parameters of generated seismic signals. It was shown that the seismic signals produced by PDCs emplacements, recorded during the two stages, were characterized by different characteristics. The second stage PDCs had radiated greater seismic energy than the PDCs emplaced during the first stage. Spectral analysis of the seismic signals, produced by PDCs, indicates a clearly separation in frequency content at 1.95 Hz between the higher-frequency events of the first stage and the lower-frequency events of the second stage of the PDCs emplacements. The obtained difference in the spectral contents of the seismic signals, produced by the movement of two multiple PDCs, may be supposed as a consequence of the proposed relative difference in the volumes of the PDCs of two multiple sequences due to a difference in the level of radiated seismic energy and a change in bottom conditions of the ravines during their passing along the ravines. Results of seismic study were used in discussion of the nature of the two-stage eruptive process.
Bull, Marta E.; Heath, Laura M.; McKernan-Mullin, Jennifer L.; Kraft, Kelli M.; Acevedo, Luis; Hitti, Jane E.; Cohn, Susan E.; Tapia, Kenneth A.; Holte, Sarah E.; Dragavon, Joan A.; Coombs, Robert W.; Mullins, James I.; Frenkel, Lisa M.
2013-01-01
Background. Whether unique human immunodeficiency type 1 (HIV) genotypes occur in the genital tract is important for vaccine development and management of drug resistant viruses. Multiple cross-sectional studies suggest HIV is compartmentalized within the female genital tract. We hypothesize that bursts of HIV replication and/or proliferation of infected cells captured in cross-sectional analyses drive compartmentalization but over time genital-specific viral lineages do not form; rather viruses mix between genital tract and blood. Methods. Eight women with ongoing HIV replication were studied during a period of 1.5 to 4.5 years. Multiple viral sequences were derived by single-genome amplification of the HIV C2-V5 region of env from genital secretions and blood plasma. Maximum likelihood phylogenies were evaluated for compartmentalization using 4 statistical tests. Results. In cross-sectional analyses compartmentalization of genital from blood viruses was detected in three of eight women by all tests; this was associated with tissue specific clades containing multiple monotypic sequences. In longitudinal analysis, the tissues-specific clades did not persist to form viral lineages. Rather, across women, HIV lineages were comprised of both genital tract and blood sequences. Conclusions. The observation of genital-specific HIV clades only in cross-sectional analysis and an absence of genital-specific lineages in longitudinal analyses suggest a dynamic interchange of HIV variants between the female genital tract and blood. PMID:23315326
Kneider, M; Bergström, T; Gustafsson, C; Nenonen, N; Ahlgren, C; Nilsson, S; Andersen, O
2009-04-01
Upper respiratory infections were reported to trigger multiple sclerosis relapses. A relationship between picornavirus infections and MS relapses was recently reported. To evaluate whether human rhinovirus is associated with multiple sclerosis relapses and whether any particular strain is predominant. Nasopharyngeal fluid was aspirated from 36 multiple sclerosis patients at pre-defined critical time points. Reverse-transcriptase-PCR was performed to detect human rhinovirus-RNA. Positive amplicons were sequenced. We found that rhinovirus RNA was present in 17/40 (43%) of specimens obtained at the onset of a URTI in 19 patients, in 1/21 specimens during convalescence after URTI in 14 patients, in 0/6 specimens obtained in 5 patients on average a week after the onset of an "at risk" relapse, occurring within a window in time from one week before to three weeks after an infection, and in 0/17 specimens obtained after the onset of a "not at risk" relapse not associated with any infection in 12 patients. Fifteen specimens from healthy control persons not associated with URTI were negative. The frequency of HRV presence in URTI was similar to that reported for community infections. Eight amplicons from patients represented 5 different HRV strains. We were unable to reproduce previous findings of association between HRV infections and multiple sclerosis relapses. HRV was not present in nasopharyngeal aspirates obtained during "at risk" or "not at risk" relapses. Sequencing of HRV obtained from patients during URTI did not reveal any strain with predominance in multiple sclerosis.
Tracking Algorithm of Multiple Pedestrians Based on Particle Filters in Video Sequences
Liu, Yun; Wang, Chuanxu; Zhang, Shujun; Cui, Xuehong
2016-01-01
Pedestrian tracking is a critical problem in the field of computer vision. Particle filters have been proven to be very useful in pedestrian tracking for nonlinear and non-Gaussian estimation problems. However, pedestrian tracking in complex environment is still facing many problems due to changes of pedestrian postures and scale, moving background, mutual occlusion, and presence of pedestrian. To surmount these difficulties, this paper presents tracking algorithm of multiple pedestrians based on particle filters in video sequences. The algorithm acquires confidence value of the object and the background through extracting a priori knowledge thus to achieve multipedestrian detection; it adopts color and texture features into particle filter to get better observation results and then automatically adjusts weight value of each feature according to current tracking environment. During the process of tracking, the algorithm processes severe occlusion condition to prevent drift and loss phenomena caused by object occlusion and associates detection results with particle state to propose discriminated method for object disappearance and emergence thus to achieve robust tracking of multiple pedestrians. Experimental verification and analysis in video sequences demonstrate that proposed algorithm improves the tracking performance and has better tracking results. PMID:27847514
DeVry, C G; Tsai, W; Clarke, S
1996-11-15
The protein L-isoaspartyl/D-aspartyl O-methyltransferase (EC 2.1.1.77) catalyzes the first step in the repair of proteins damaged in the aging process by isomerization or racemization reactions at aspartyl and asparaginyl residues. A single gene has been localized to human chromosome 6 and multiple transcripts arising through alternative splicing have been identified. Restriction enzyme mapping, subcloning, and DNA sequence analysis of three overlapping clones from a human genomic library in bacteriophage P1 indicate that the gene spans approximately 60 kb and is composed of 8 exons interrupted by 7 introns. Analysis of intron/exon splice junctions reveals that all of the donor and acceptor splice sites are in agreement with the mammalian consensus splicing sequence. Determination of transcription initiation sites by primer extension analysis of poly(A)+ mRNA from human brain identifies multiple start sites, with a major site 159 nucleotides upstream from the ATG start codon. Sequence analysis of the 5'-untranslated region demonstrates several potential cis-acting DNA elements including SP1, ETF, AP1, AP2, ARE, XRE, CREB, MED-1, and half-palindromic ERE motifs. The promoter of this methyltransferase gene lacks an identifiable TATA box but is characterized by a CpG island which begins approximately 723 nucleotides upstream of the major transcriptional start site and extends through exon 1 and into the first intron. These features are characteristic of housekeeping genes and are consistent with the wide tissue distribution observed for this methyltransferase activity.
Introduction to bioinformatics.
Can, Tolga
2014-01-01
Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science, mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computational point of view. The most common problems are modeling biological processes at the molecular level and making inferences from collected data. A bioinformatics solution usually involves the following steps: Collect statistics from biological data. Build a computational model. Solve a computational modeling problem. Test and evaluate a computational algorithm. This chapter gives a brief introduction to bioinformatics by first providing an introduction to biological terminology and then discussing some classical bioinformatics problems organized by the types of data sources. Sequence analysis is the analysis of DNA and protein sequences for clues regarding function and includes subproblems such as identification of homologs, multiple sequence alignment, searching sequence patterns, and evolutionary analyses. Protein structures are three-dimensional data and the associated problems are structure prediction (secondary and tertiary), analysis of protein structures for clues regarding function, and structural alignment. Gene expression data is usually represented as matrices and analysis of microarray data mostly involves statistics analysis, classification, and clustering approaches. Biological networks such as gene regulatory networks, metabolic pathways, and protein-protein interaction networks are usually modeled as graphs and graph theoretic approaches are used to solve associated problems such as construction and analysis of large-scale networks.
SD-MSAEs: Promoter recognition in human genome based on deep feature extraction.
Xu, Wenxuan; Zhang, Li; Lu, Yaping
2016-06-01
The prediction and recognition of promoter in human genome play an important role in DNA sequence analysis. Entropy, in Shannon sense, of information theory is a multiple utility in bioinformatic details analysis. The relative entropy estimator methods based on statistical divergence (SD) are used to extract meaningful features to distinguish different regions of DNA sequences. In this paper, we choose context feature and use a set of methods of SD to select the most effective n-mers distinguishing promoter regions from other DNA regions in human genome. Extracted from the total possible combinations of n-mers, we can get four sparse distributions based on promoter and non-promoters training samples. The informative n-mers are selected by optimizing the differentiating extents of these distributions. Specially, we combine the advantage of statistical divergence and multiple sparse auto-encoders (MSAEs) in deep learning to extract deep feature for promoter recognition. And then we apply multiple SVMs and a decision model to construct a human promoter recognition method called SD-MSAEs. Framework is flexible that it can integrate new feature extraction or new classification models freely. Experimental results show that our method has high sensitivity and specificity. Copyright © 2016 Elsevier Inc. All rights reserved.
Molecular detection and characterization of noroviruses in river water in Thailand.
Inoue, K; Motomura, K; Boonchan, M; Takeda, N; Ruchusatsawa, K; Guntapong, R; Tacharoenmuang, R; Sangkitporn, S; Chantaroj, S
2016-03-01
Norovirus (NoV) generally exists as a mixture of multiple genotype variants in nature. However, there has been no published report monitoring NoV in natural settings in Thailand. To obtain information on mixed presence of the NoV RNA genome, we conducted viral genome analysis of 15 water specimens collected from five sites in a river near Bangkok between August 2013 and August 2014. The number of viral RNA copies per specimen declined progressively from the most upstream to the most downstream site. Following direct nucleotide sequencing of the PCR products, we obtained three partial genome sequences of the NoV GI strain and 13 partial genome sequences of the NoV GII strains. Phylogenetic analysis indicated the presence of four GII.4 variant groups pro-circulated after the Den Haag_2006b, New Orleans_2009 and Sydney_2012 outbreaks. On the other hand, only GI.4 was observed from the specimens collected on April, 2014. These results indicated that multiple genogroups and genotypes of noroviruses are present and are circulating in the natural environment in Thailand as in other countries. Our study provides comprehensive information on the occurrence of new variants. Our study is the first paper that multiple genogroups and genotypes of norovirus exist, and are circulating in the river water near Bangkok, Thailand. Phylogenetic analysis indicated the presence of four GII.4 variant groups pro-circulated after the Den Haag_2006b, New Orleans_2009 and Sydney_2012 that caused outbreaks in the world. Continued research will be essential for understanding the natural history of NoV and the control of future outbreaks. © 2015 The Society for Applied Microbiology.
Pourcel, Christine; Minandri, Fabrizia; Hauck, Yolande; D'Arezzo, Silvia; Imperi, Francesco; Vergnaud, Gilles; Visca, Paolo
2011-01-01
Acinetobacter baumannii is an important opportunistic pathogen responsible for nosocomial outbreaks, mostly occurring in intensive care units. Due to the multiplicity of infection sources, reliable molecular fingerprinting techniques are needed to establish epidemiological correlations among A. baumannii isolates. Multiple-locus variable-number tandem-repeat analysis (MLVA) has proven to be a fast, reliable, and cost-effective typing method for several bacterial species. In this study, an MLVA assay compatible with simple PCR- and agarose gel-based electrophoresis steps as well as with high-throughput automated methods was developed for A. baumannii typing. Preliminarily, 10 potential polymorphic variable-number tandem repeats (VNTRs) were identified upon bioinformatic screening of six annotated genome sequences of A. baumannii. A collection of 7 reference strains plus 18 well-characterized isolates, including unique types and representatives of the three international A. baumannii lineages, was then evaluated in a two-center study aimed at validating the MLVA assay and comparing it with other genotyping assays, namely, macrorestriction analysis with pulsed-field gel electrophoresis (PFGE) and PCR-based sequence group (SG) profiling. The results showed that MLVA can discriminate between isolates with identical PFGE types and SG profiles. A panel of eight VNTR markers was selected, all showing the ability to be amplified and good amounts of polymorphism in the majority of strains. Independently generated MLVA profiles, composed of an ordered string of allele numbers corresponding to the number of repeats at each VNTR locus, were concordant between centers. Typeability, reproducibility, stability, discriminatory power, and epidemiological concordance were excellent. A database containing information and MLVA profiles for several A. baumannii strains is available from http://mlva.u-psud.fr/. PMID:21147956
A core microbiome associated with the peritoneal tumors of pseudomyxoma peritonei
2013-01-01
Background Pseudomyxoma peritonei (PMP) is a malignancy characterized by dissemination of mucus-secreting cells throughout the peritoneum. This disease is associated with significant morbidity and mortality and despite effective treatment options for early-stage disease, patients with PMP often relapse. Thus, there is a need for additional treatment options to reduce relapse rate and increase long-term survival. A previous study identified the presence of both typed and non-culturable bacteria associated with PMP tissue and determined that increased bacterial density was associated with more severe disease. These findings highlighted the possible role for bacteria in PMP disease. Methods To more clearly define the bacterial communities associated with PMP disease, we employed a sequenced-based analysis to profile the bacterial populations found in PMP tumor and mucin tissue in 11 patients. Sequencing data were confirmed by in situ hybridization at multiple taxonomic depths and by culturing. A pilot clinical study was initiated to determine whether the addition of antibiotic therapy affected PMP patient outcome. Main results We determined that the types of bacteria present are highly conserved in all PMP patients; the dominant phyla are the Proteobacteria, Actinobacteria, Firmicutes and Bacteroidetes. A core set of taxon-specific sequences were found in all 11 patients; many of these sequences were classified into taxonomic groups that also contain known human pathogens. In situ hybridization directly confirmed the presence of bacteria in PMP at multiple taxonomic depths and supported our sequence-based analysis. Furthermore, culturing of PMP tissue samples allowed us to isolate 11 different bacterial strains from eight independent patients, and in vitro analysis of subset of these isolates suggests that at least some of these strains may interact with the PMP-associated mucin MUC2. Finally, we provide evidence suggesting that targeting these bacteria with antibiotic treatment may increase the survival of PMP patients. Conclusions Using 16S amplicon-based sequencing, direct in situ hybridization analysis and culturing methods, we have identified numerous bacterial taxa that are consistently present in all PMP patients tested. Combined with data from a pilot clinical study, these data support the hypothesis that adding antimicrobials to the standard PMP treatment could improve PMP patient survival. PMID:23844722
Rodriguez-Rivas, Juan; Marsili, Simone; Juan, David; Valencia, Alfonso
2016-01-01
Protein–protein interactions are fundamental for the proper functioning of the cell. As a result, protein interaction surfaces are subject to strong evolutionary constraints. Recent developments have shown that residue coevolution provides accurate predictions of heterodimeric protein interfaces from sequence information. So far these approaches have been limited to the analysis of families of prokaryotic complexes for which large multiple sequence alignments of homologous sequences can be compiled. We explore the hypothesis that coevolution points to structurally conserved contacts at protein–protein interfaces, which can be reliably projected to homologous complexes with distantly related sequences. We introduce a domain-centered protocol to study the interplay between residue coevolution and structural conservation of protein–protein interfaces. We show that sequence-based coevolutionary analysis systematically identifies residue contacts at prokaryotic interfaces that are structurally conserved at the interface of their eukaryotic counterparts. In turn, this allows the prediction of conserved contacts at eukaryotic protein–protein interfaces with high confidence using solely mutational patterns extracted from prokaryotic genomes. Even in the context of high divergence in sequence (the twilight zone), where standard homology modeling of protein complexes is unreliable, our approach provides sequence-based accurate information about specific details of protein interactions at the residue level. Selected examples of the application of prokaryotic coevolutionary analysis to the prediction of eukaryotic interfaces further illustrate the potential of this approach. PMID:27965389
Rodriguez-Rivas, Juan; Marsili, Simone; Juan, David; Valencia, Alfonso
2016-12-27
Protein-protein interactions are fundamental for the proper functioning of the cell. As a result, protein interaction surfaces are subject to strong evolutionary constraints. Recent developments have shown that residue coevolution provides accurate predictions of heterodimeric protein interfaces from sequence information. So far these approaches have been limited to the analysis of families of prokaryotic complexes for which large multiple sequence alignments of homologous sequences can be compiled. We explore the hypothesis that coevolution points to structurally conserved contacts at protein-protein interfaces, which can be reliably projected to homologous complexes with distantly related sequences. We introduce a domain-centered protocol to study the interplay between residue coevolution and structural conservation of protein-protein interfaces. We show that sequence-based coevolutionary analysis systematically identifies residue contacts at prokaryotic interfaces that are structurally conserved at the interface of their eukaryotic counterparts. In turn, this allows the prediction of conserved contacts at eukaryotic protein-protein interfaces with high confidence using solely mutational patterns extracted from prokaryotic genomes. Even in the context of high divergence in sequence (the twilight zone), where standard homology modeling of protein complexes is unreliable, our approach provides sequence-based accurate information about specific details of protein interactions at the residue level. Selected examples of the application of prokaryotic coevolutionary analysis to the prediction of eukaryotic interfaces further illustrate the potential of this approach.
Using a Sequence of Earcons to Monitor Multiple Simulated Patients.
Hickling, Anna; Brecknell, Birgit; Loeb, Robert G; Sanderson, Penelope
2017-03-01
The aim of this study was to determine whether a sequence of earcons can effectively convey the status of multiple processes, such as the status of multiple patients in a clinical setting. Clinicians often monitor multiple patients. An auditory display that intermittently conveys the status of multiple patients may help. Nonclinician participants listened to sequences of 500-ms earcons that each represented the heart rate (HR) and oxygen saturation (SpO 2 ) levels of a different simulated patient. In each sequence, one, two, or three patients had an abnormal level of HR and/or SpO 2 . In Experiment 1, participants reported which of nine patients in a sequence were abnormal. In Experiment 2, participants identified the vital signs of one, two, or three abnormal patients in sequences of one, five, or nine patients, where the interstimulus interval (ISI) between earcons was 150 ms. Experiment 3 used the five-sequence condition of Experiment 2, but the ISI was either 150 ms or 800 ms. Participants reported which patient(s) were abnormal with median 95% accuracy. Identification accuracy for vital signs decreased as the number of abnormal patients increased from one to three, p < .001, but accuracy was unaffected by number of patients in a sequence. Overall, identification accuracy was significantly higher with an ISI of 800 ms (89%) compared with an ISI of 150 ms (83%), p < .001. A multiple-patient display can be created by cycling through earcons that represent individual patients. The principles underlying the multiple-patient display can be extended to other vital signs, designs, and domains.
Eastman, Alexander W.; Yuan, Ze-Chun
2015-01-01
Advances in sequencing technology have drastically increased the depth and feasibility of bacterial genome sequencing. However, little information is available that details the specific techniques and procedures employed during genome sequencing despite the large numbers of published genomes. Shotgun approaches employed by second-generation sequencing platforms has necessitated the development of robust bioinformatics tools for in silico assembly, and complete assembly is limited by the presence of repetitive DNA sequences and multi-copy operons. Typically, re-sequencing with multiple platforms and laborious, targeted Sanger sequencing are employed to finish a draft bacterial genome. Here we describe a novel strategy based on the identification and targeted sequencing of repetitive rDNA operons to expedite bacterial genome assembly and finishing. Our strategy was validated by finishing the genome of Paenibacillus polymyxa strain CR1, a bacterium with potential in sustainable agriculture and bio-based processes. An analysis of the 38 contigs contained in the P. polymyxa strain CR1 draft genome revealed 12 repetitive rDNA operons with varied intragenic and flanking regions of variable length, unanimously located at contig boundaries and within contig gaps. These highly similar but not identical rDNA operons were experimentally verified and sequenced simultaneously with multiple, specially designed primer sets. This approach also identified and corrected significant sequence rearrangement generated during the initial in silico assembly of sequencing reads. Our approach reduces the required effort associated with blind primer walking for contig assembly, increasing both the speed and feasibility of genome finishing. Our study further reinforces the notion that repetitive DNA elements are major limiting factors for genome finishing. Moreover, we provided a step-by-step workflow for genome finishing, which may guide future bacterial genome finishing projects. PMID:25653642
da Fonseca, Néli José; Lima Afonso, Marcelo Querino; Pedersolli, Natan Gonçalves; de Oliveira, Lucas Carrijo; Andrade, Dhiego Souto; Bleicher, Lucas
2017-10-28
Flaviviruses are responsible for serious diseases such as dengue, yellow fever, and zika fever. Their genomes encode a polyprotein which, after cleavage, results in three structural and seven non-structural proteins. Homologous proteins can be studied by conservation and coevolution analysis as detected in multiple sequence alignments, usually reporting positions which are strictly necessary for the structure and/or function of all members in a protein family or which are involved in a specific sub-class feature requiring the coevolution of residue sets. This study provides a complete conservation and coevolution analysis on all flaviviruses non-structural proteins, with results mapped on all well-annotated available sequences. A literature review on the residues found in the analysis enabled us to compile available information on their roles and distribution among different flaviviruses. Also, we provide the mapping of conserved and coevolved residues for all sequences currently in SwissProt as a supplementary material, so that particularities in different viruses can be easily analyzed. Copyright © 2017 Elsevier Inc. All rights reserved.
Lindholdt, Louise; Labriola, Merete; Nielsen, Claus Vinther; Horsbøl, Trine Allerslev; Lund, Thomas
2017-01-01
Introduction The return-to-work (RTW) process after long-term sickness absence is often complex and long and implies multiple shifts between different labour market states for the absentee. Standard methods for examining RTW research typically rely on the analysis of one outcome measure at a time, which will not capture the many possible states and transitions the absentee can go through. The purpose of this study was to explore the potential added value of sequence analysis in supplement to standard regression analysis of a multidisciplinary RTW intervention among patients with low back pain (LBP). Methods The study population consisted of 160 patients randomly allocated to either a hospital-based brief or a multidisciplinary intervention. Data on labour market participation following intervention were obtained from a national register and analysed in two ways: as a binary outcome expressed as active or passive relief at a 1-year follow-up and as four different categories for labour market participation. Logistic regression and sequence analysis were performed. Results The logistic regression analysis showed no difference in labour market participation for patients in the two groups after 1 year. Applying sequence analysis showed differences in subsequent labour market participation after 2 years after baseline in favour of the brief intervention group versus the multidisciplinary intervention group. Conclusion The study indicated that sequence analysis could provide added analytical value as a supplement to traditional regression analysis in prospective studies of RTW among patients with LBP. PMID:28729315
High-throughput analysis of T-DNA location and structure using sequence capture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Inagaki, Soichi; Henry, Isabelle M.; Lieberman, Meric C.
Agrobacterium-mediated transformation of plants with T-DNA is used both to introduce transgenes and for mutagenesis. Conventional approaches used to identify the genomic location and the structure of the inserted T-DNA are laborious and high-throughput methods using next-generation sequencing are being developed to address these problems. Here, we present a cost-effective approach that uses sequence capture targeted to the T-DNA borders to select genomic DNA fragments containing T-DNA—genome junctions, followed by Illumina sequencing to determine the location and junction structure of T-DNA insertions. Multiple probes can be mixed so that transgenic lines transformed with different T-DNA types can be processed simultaneously,more » using a simple, index-based pooling approach. We also developed a simple bioinformatic tool to find sequence read pairs that span the junction between the genome and T-DNA or any foreign DNA. We analyzed 29 transgenic lines of Arabidopsis thaliana, each containing inserts from 4 different T-DNA vectors. We determined the location of T-DNA insertions in 22 lines, 4 of which carried multiple insertion sites. Additionally, our analysis uncovered a high frequency of unconventional and complex T-DNA insertions, highlighting the needs for high-throughput methods for T-DNA localization and structural characterization. Transgene insertion events have to be fully characterized prior to use as commercial products. As a result, our method greatly facilitates the first step of this characterization of transgenic plants by providing an efficient screen for the selection of promising lines.« less
High-throughput analysis of T-DNA location and structure using sequence capture
Inagaki, Soichi; Henry, Isabelle M.; Lieberman, Meric C.; ...
2015-10-07
Agrobacterium-mediated transformation of plants with T-DNA is used both to introduce transgenes and for mutagenesis. Conventional approaches used to identify the genomic location and the structure of the inserted T-DNA are laborious and high-throughput methods using next-generation sequencing are being developed to address these problems. Here, we present a cost-effective approach that uses sequence capture targeted to the T-DNA borders to select genomic DNA fragments containing T-DNA—genome junctions, followed by Illumina sequencing to determine the location and junction structure of T-DNA insertions. Multiple probes can be mixed so that transgenic lines transformed with different T-DNA types can be processed simultaneously,more » using a simple, index-based pooling approach. We also developed a simple bioinformatic tool to find sequence read pairs that span the junction between the genome and T-DNA or any foreign DNA. We analyzed 29 transgenic lines of Arabidopsis thaliana, each containing inserts from 4 different T-DNA vectors. We determined the location of T-DNA insertions in 22 lines, 4 of which carried multiple insertion sites. Additionally, our analysis uncovered a high frequency of unconventional and complex T-DNA insertions, highlighting the needs for high-throughput methods for T-DNA localization and structural characterization. Transgene insertion events have to be fully characterized prior to use as commercial products. As a result, our method greatly facilitates the first step of this characterization of transgenic plants by providing an efficient screen for the selection of promising lines.« less
Application of the MIDAS approach for analysis of lysine acetylation sites.
Evans, Caroline A; Griffiths, John R; Unwin, Richard D; Whetton, Anthony D; Corfe, Bernard M
2013-01-01
Multiple Reaction Monitoring Initiated Detection and Sequencing (MIDAS™) is a mass spectrometry-based technique for the detection and characterization of specific post-translational modifications (Unwin et al. 4:1134-1144, 2005), for example acetylated lysine residues (Griffiths et al. 18:1423-1428, 2007). The MIDAS™ technique has application for discovery and analysis of acetylation sites. It is a hypothesis-driven approach that requires a priori knowledge of the primary sequence of the target protein and a proteolytic digest of this protein. MIDAS essentially performs a targeted search for the presence of modified, for example acetylated, peptides. The detection is based on the combination of the predicted molecular weight (measured as mass-charge ratio) of the acetylated proteolytic peptide and a diagnostic fragment (product ion of m/z 126.1), which is generated by specific fragmentation of acetylated peptides during collision induced dissociation performed in tandem mass spectrometry (MS) analysis. Sequence information is subsequently obtained which enables acetylation site assignment. The technique of MIDAS was later trademarked by ABSciex for targeted protein analysis where an MRM scan is combined with full MS/MS product ion scan to enable sequence confirmation.
Genomic characterization reconfirms the taxonomic status of Lactobacillus parakefiri
TANIZAWA, Yasuhiro; KOBAYASHI, Hisami; KAMINUMA, Eli; SAKAMOTO, Mitsuo; OHKUMA, Moriya; NAKAMURA, Yasukazu; ARITA, Masanori; TOHNO, Masanori
2017-01-01
Whole-genome sequencing was performed for Lactobacillus parakefiri JCM 8573T to confirm its hitherto controversial taxonomic position. Here, we report its first reliable reference genome. Genome-wide metrics, such as average nucleotide identity and digital DNA-DNA hybridization, and phylogenomic analysis based on multiple genes supported its taxonomic status as a distinct species in the genus Lactobacillus. The availability of a reliable genome sequence will aid future investigations on the industrial applications of L. parakefiri in functional foods such as kefir grains. PMID:28748134
Methods for comparative metagenomics
Huson, Daniel H; Richter, Daniel C; Mitra, Suparna; Auch, Alexander F; Schuster, Stephan C
2009-01-01
Background Metagenomics is a rapidly growing field of research that aims at studying uncultured organisms to understand the true diversity of microbes, their functions, cooperation and evolution, in environments such as soil, water, ancient remains of animals, or the digestive system of animals and humans. The recent development of ultra-high throughput sequencing technologies, which do not require cloning or PCR amplification, and can produce huge numbers of DNA reads at an affordable cost, has boosted the number and scope of metagenomic sequencing projects. Increasingly, there is a need for new ways of comparing multiple metagenomics datasets, and for fast and user-friendly implementations of such approaches. Results This paper introduces a number of new methods for interactively exploring, analyzing and comparing multiple metagenomic datasets, which will be made freely available in a new, comparative version 2.0 of the stand-alone metagenome analysis tool MEGAN. Conclusion There is a great need for powerful and user-friendly tools for comparative analysis of metagenomic data and MEGAN 2.0 will help to fill this gap. PMID:19208111
Burkholderia: an update on taxonomy and biotechnological potential as antibiotic producers.
Depoorter, Eliza; Bull, Matt J; Peeters, Charlotte; Coenye, Tom; Vandamme, Peter; Mahenthiralingam, Eshwar
2016-06-01
Burkholderia is an incredibly diverse and versatile Gram-negative genus, within which over 80 species have been formally named and multiple other genotypic groups likely represent new species. Phylogenetic analysis based on the 16S rRNA gene sequence and core genome ribosomal multilocus sequence typing analysis indicates the presence of at least three major clades within the genus. Biotechnologically, Burkholderia are well-known for their bioremediation and biopesticidal properties. Within this review, we explore the ability of Burkholderia to synthesise a wide range of antimicrobial compounds ranging from historically characterised antifungals to recently described antibacterial antibiotics with activity against multiresistant clinical pathogens. The production of multiple Burkholderia antibiotics is controlled by quorum sensing and examples of quorum sensing pathways found across the genus are discussed. The capacity for antibiotic biosynthesis and secondary metabolism encoded within Burkholderia genomes is also evaluated. Overall, Burkholderia demonstrate significant biotechnological potential as a source of novel antibiotics and bioactive secondary metabolites.
Debunking Occam's razor: Diagnosing multiple genetic diseases in families by whole-exome sequencing.
Balci, T B; Hartley, T; Xi, Y; Dyment, D A; Beaulieu, C L; Bernier, F P; Dupuis, L; Horvath, G A; Mendoza-Londono, R; Prasad, C; Richer, J; Yang, X-R; Armour, C M; Bareke, E; Fernandez, B A; McMillan, H J; Lamont, R E; Majewski, J; Parboosingh, J S; Prasad, A N; Rupar, C A; Schwartzentruber, J; Smith, A C; Tétreault, M; Innes, A M; Boycott, K M
2017-09-01
Recent clinical whole exome sequencing (WES) cohorts have identified unanticipated multiple genetic diagnoses in single patients. However, the frequency of multiple genetic diagnoses in families is largely unknown. We set out to identify the rate of multiple genetic diagnoses in probands and their families referred for analysis in two national research programs in Canada. We retrospectively analyzed WES results for 802 undiagnosed probands referred over the past 5 years in either the FORGE or Care4Rare Canada WES initiatives. Of the 802 probands, 226 (28.2%) were diagnosed based on mutations in known disease genes. Eight (3.5%) had two or more genetic diagnoses explaining their clinical phenotype, a rate in keeping with the large published studies (average 4.3%; 1.4 - 7.2%). Seven of the 8 probands had family members with one or more of the molecularly diagnosed diseases. Consanguinity and multisystem disease appeared to increase the likelihood of multiple genetic diagnoses in a family. Our findings highlight the importance of comprehensive clinical phenotyping of family members to ultimately provide accurate genetic counseling. © 2017 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
NASA Astrophysics Data System (ADS)
Qiu, Kun; Zhang, Chongfu; Ling, Yun; Wang, Yibo
2007-11-01
This paper proposes an all-optical label processing scheme using multiple optical orthogonal codes sequences (MOOCS) for optical packet switching (OPS) (MOOCS-OPS) networks, for the first time to the best of our knowledge. In this scheme, the multiple optical orthogonal codes (MOOC) from multiple-groups optical orthogonal codes (MGOOC) are permuted and combined to obtain the MOOCS for the optical labels, which are used to effectively enlarge the capacity of available optical codes for optical labels. The optical label processing (OLP) schemes are reviewed and analyzed, the principles of MOOCS-based optical labels for OPS networks are given, and analyzed, then the MOOCS-OPS topology and the key realization units of the MOOCS-based optical label packets are studied in detail, respectively. The performances of this novel all-optical label processing technology are analyzed, the corresponding simulation is performed. These analysis and results show that the proposed scheme can overcome the lack of available optical orthogonal codes (OOC)-based optical labels due to the limited number of single OOC for optical label with the short code length, and indicate that the MOOCS-OPS scheme is feasible.
Model-based quality assessment and base-calling for second-generation sequencing data.
Bravo, Héctor Corrada; Irizarry, Rafael A
2010-09-01
Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads-strings of A,C,G, or T's, between 30 and 100 characters long-which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance. © 2009, The International Biometric Society.
Change in IgHV Mutational Status of CLL Suggests Origin From Multiple Clones.
Osman, Afaf; Gocke, Christopher D; Gladstone, Douglas E
2017-02-01
Fluorescence in situ hybridization and immunoglobulin (Ig) heavy-chain variable-region (IgHV) mutational status are used to predict outcome in chronic lymphocytic leukemia (CLL). Although DNA aberrations change over time, IgHV sequences and mutational status are considered stable. In a retrospective review, 409 CLL patients, between 2008 and 2015, had IgHV analysis: 56 patients had multiple analyses performed. Seven patients' IgHV results changed: 2 from unmutated to mutated and 5 from mutated to unmutated IgHV sequence. Three concurrently changed their variable heavy-chain sequence. Secondary to allelic exclusion, 2 of the new variable heavy chains produced were biologically nonplausible. The existence of these new nonplausible heavy-chain variable regions suggests either the CLL cancer stem-cell maintains the ability to rearrange a previously silenced IgH allele or more likely that the cancer stem-cell produced at least 2 subclones, suggesting that the CLL cancer stem cell exists before the process of allelic exclusion occurs. Copyright © 2016 Elsevier Inc. All rights reserved.
1994-01-01
The apparatus that permits protein translocation across the internal thylakoid membranes of chloroplasts is completely unknown, even though these membranes have been the subject of extensive biochemical analysis. We have used a genetic approach to characterize the translocation of Chlamydomonas cytochrome f, a chloroplast-encoded protein that spans the thylakoid once. Mutations in the hydrophobic core of the cytochrome f signal sequence inhibit the accumulation of cytochrome f, lead to an accumulation of precursor, and impair the ability of Chlamydomonas cells to grow photosynthetically. One hydrophobic core mutant also reduces the accumulation of other thylakoid membrane proteins, but not those that translocate completely across the membrane. These results suggest that the signal sequence of cytochrome f is required and is involved in one of multiple insertion pathways. Suppressors of two signal peptide mutations describe at least two nuclear genes whose products likely describe the translocation apparatus, and selected second-site chloroplast suppressors further define regions of the cytochrome f signal peptide. PMID:8034740
Romer, Katherine A.; Kayombya, Guy-Richard; Fraenkel, Ernest
2007-01-01
WebMOTIFS provides a web interface that facilitates the discovery and analysis of DNA-sequence motifs. Several studies have shown that the accuracy of motif discovery can be significantly improved by using multiple de novo motif discovery programs and using randomized control calculations to identify the most significant motifs or by using Bayesian approaches. WebMOTIFS makes it easy to apply these strategies. Using a single submission form, users can run several motif discovery programs and score, cluster and visualize the results. In addition, the Bayesian motif discovery program THEME can be used to determine the class of transcription factors that is most likely to regulate a set of sequences. Input can be provided as a list of gene or probe identifiers. Used with the default settings, WebMOTIFS accurately identifies biologically relevant motifs from diverse data in several species. WebMOTIFS is freely available at http://fraenkel.mit.edu/webmotifs. PMID:17584794
Phylogenetic shadowing of primate sequences to find functional regions of the human genome.
Boffelli, Dario; McAuliffe, Jon; Ovcharenko, Dmitriy; Lewis, Keith D; Ovcharenko, Ivan; Pachter, Lior; Rubin, Edward M
2003-02-28
Nonhuman primates represent the most relevant model organisms to understand the biology of Homo sapiens. The recent divergence and associated overall sequence conservation between individual members of this taxon have nonetheless largely precluded the use of primates in comparative sequence studies. We used sequence comparisons of an extensive set of Old World and New World monkeys and hominoids to identify functional regions in the human genome. Analysis of these data enabled the discovery of primate-specific gene regulatory elements and the demarcation of the exons of multiple genes. Much of the information content of the comprehensive primate sequence comparisons could be captured with a small subset of phylogenetically close primates. These results demonstrate the utility of intraprimate sequence comparisons to discover common mammalian as well as primate-specific functional elements in the human genome, which are unattainable through the evaluation of more evolutionarily distant species.
Evol and ProDy for bridging protein sequence evolution and structural dynamics
Mao, Wenzhi; Liu, Ying; Chennubhotla, Chakra; Lezon, Timothy R.; Bahar, Ivet
2014-01-01
Correlations between sequence evolution and structural dynamics are of utmost importance in understanding the molecular mechanisms of function and their evolution. We have integrated Evol, a new package for fast and efficient comparative analysis of evolutionary patterns and conformational dynamics, into ProDy, a computational toolbox designed for inferring protein dynamics from experimental and theoretical data. Using information-theoretic approaches, Evol coanalyzes conservation and coevolution profiles extracted from multiple sequence alignments of protein families with their inferred dynamics. Availability and implementation: ProDy and Evol are open-source and freely available under MIT License from http://prody.csb.pitt.edu/. Contact: bahar@pitt.edu PMID:24849577
DOE Office of Scientific and Technical Information (OSTI.GOV)
Golbus, Jessica R.; Puckelwartz, Megan J.; Dellefave-Castillo, Lisa
Background—Cardiomyopathy is highly heritable but genetically diverse. At present, genetic testing for cardiomyopathy uses targeted sequencing to simultaneously assess the coding regions of more than 50 genes. New genes are routinely added to panels to improve the diagnostic yield. With the anticipated $1000 genome, it is expected that genetic testing will shift towards comprehensive genome sequencing accompanied by targeted gene analysis. Therefore, we assessed the reliability of whole genome sequencing and targeted analysis to identify cardiomyopathy variants in 11 subjects with cardiomyopathy. Methods and Results—Whole genome sequencing with an average of 37× coverage was combined with targeted analysis focused onmore » 204 genes linked to cardiomyopathy. Genetic variants were scored using multiple prediction algorithms combined with frequency data from public databases. This pipeline yielded 1-14 potentially pathogenic variants per individual. Variants were further analyzed using clinical criteria and/or segregation analysis. Three of three previously identified primary mutations were detected by this analysis. In six subjects for whom the primary mutation was previously unknown, we identified mutations that segregated with disease, had clinical correlates, and/or had additional pathological correlation to provide evidence for causality. For two subjects with previously known primary mutations, we identified additional variants that may act as modifiers of disease severity. In total, we identified the likely pathological mutation in 9 of 11 (82%) subjects. We conclude that these pilot data demonstrate that ~30-40× coverage whole genome sequencing combined with targeted analysis is feasible and sensitive to identify rare variants in cardiomyopathy-associated genes.« less
Ryan, Niamh M; Lihm, Jayon; Kramer, Melissa; McCarthy, Shane; Morris, Stewart W; Arnau-Soler, Aleix; Davies, Gail; Duff, Barbara; Ghiban, Elena; Hayward, Caroline; Deary, Ian J; Blackwood, Douglas H R; Lawrie, Stephen M; McIntosh, Andrew M; Evans, Kathryn L; Porteous, David J; McCombie, W Richard; Thomson, Pippa A
2018-06-07
Psychiatric disorders are a group of genetically related diseases with highly polygenic architectures. Genome-wide association analyses have made substantial progress towards understanding the genetic architecture of these disorders. More recently, exome- and whole-genome sequencing of cases and families have identified rare, high penetrant variants that provide direct functional insight. There remains, however, a gap in the heritability explained by these complementary approaches. To understand how multiple genetic variants combine to modify both severity and penetrance of a highly penetrant variant, we sequenced 48 whole genomes from a family with a high loading of psychiatric disorder linked to a balanced chromosomal translocation. The (1;11)(q42;q14.3) translocation directly disrupts three genes: DISC1, DISC2, DISC1FP and has been linked to multiple brain imaging and neurocognitive outcomes in the family. Using DNA sequence-level linkage analysis, functional annotation and population-based association, we identified common and rare variants in GRM5 (minor allele frequency (MAF) > 0.05), PDE4D (MAF > 0.2) and CNTN5 (MAF < 0.01) that may help explain the individual differences in phenotypic expression in the family. We suggest that whole-genome sequencing in large families will improve the understanding of the combined effects of the rare and common sequence variation underlying psychiatric phenotypes.
Akkuratov, Evgeny E; Walters, Lorraine; Saha-Mandal, Arnab; Khandekar, Sushant; Crawford, Erin; Zirbel, Craig L; Leisner, Scott; Prakash, Ashwin; Fedorova, Larisa; Fedorov, Alexei
2014-09-10
Orthologous introns have identical positions relative to the coding sequence in orthologous genes of different species. By analyzing the complete genomes of five plants we generated a database of 40,512 orthologous intron groups of dicotyledonous plants, 28,519 orthologous intron groups of angiosperms, and 15,726 of land plants (moss and angiosperms). Multiple sequence alignments of each orthologous intron group were obtained using the Mafft algorithm. The number of conserved regions in plant introns appeared to be hundreds of times fewer than that in mammals or vertebrates. Approximately three quarters of conserved intronic regions among angiosperms and dicots, in particular, correspond to alternatively-spliced exonic sequences. We registered only a handful of conserved intronic ncRNAs of flowering plants. However, the most evolutionarily conserved intronic region, which is ubiquitous for all plants examined in this study, including moss, possessed multiple structural features of tRNAs, which caused us to classify it as a putative tRNA-like ncRNA. Intronic sequences encoding tRNA-like structures are not unique to plants. Bioinformatics examination of the presence of tRNA inside introns revealed an unusually long-term association of four glycine tRNAs inside the Vac14 gene of fish, amniotes, and mammals. Copyright © 2014 Elsevier B.V. All rights reserved.
Bullich, Gemma; Trujillano, Daniel; Santín, Sheila; Ossowski, Stephan; Mendizábal, Santiago; Fraga, Gloria; Madrid, Álvaro; Ariceta, Gema; Ballarín, José; Torra, Roser; Estivill, Xavier; Ars, Elisabet
2015-09-01
Genetic diagnosis of steroid-resistant nephrotic syndrome (SRNS) using Sanger sequencing is complicated by the high genetic heterogeneity and phenotypic variability of this disease. We aimed to improve the genetic diagnosis of SRNS by simultaneously sequencing 26 glomerular genes using massive parallel sequencing and to study whether mutations in multiple genes increase disease severity. High-throughput mutation analysis was performed in 50 SRNS and/or focal segmental glomerulosclerosis (FSGS) patients, a validation cohort of 25 patients with known pathogenic mutations, and a discovery cohort of 25 uncharacterized patients with probable genetic etiology. In the validation cohort, we identified the 42 previously known pathogenic mutations across NPHS1, NPHS2, WT1, TRPC6, and INF2 genes. In the discovery cohort, disease-causing mutations in SRNS/FSGS genes were found in nine patients. We detected three patients with mutations in an SRNS/FSGS gene and COL4A3. Two of them were familial cases and presented a more severe phenotype than family members with mutation in only one gene. In conclusion, our results show that massive parallel sequencing is feasible and robust for genetic diagnosis of SRNS/FSGS. Our results indicate that patients carrying mutations in an SRNS/FSGS gene and also in COL4A3 gene have increased disease severity.
Circular RNA expression in basal cell carcinoma.
Sand, Michael; Bechara, Falk G; Sand, Daniel; Gambichler, Thilo; Hahn, Stephan A; Bromba, Michael; Stockfleth, Eggert; Hessam, Schapoor
2016-05-01
Circular RNAs (circRNAs), are nonprotein coding RNAs consisting of a circular loop with multiple miRNA, binding sites called miRNA response elements (MREs), functioning as miRNA sponges. This study was performed to identify differentially expressed circRNAs and their MREs in basal cell carcinoma (BCC). Microarray circRNA expression profiles were acquired from BCC and control followed by qRT-PCR validation. Bioinformatical target prediction revealed multiple MREs. Sequence analysis was performed concerning MRE interaction potential with the BCC miRNome. We identified 23 upregulated and 48 downregulated circRNAs with 354 miRNA response elements capable of sequestering miRNA target sequences of the BCC miRNome. The present study describes a variety of circRNAs that are potentially involved in the molecular pathogenesis of BCC.
Image encryption algorithm based on multiple mixed hash functions and cyclic shift
NASA Astrophysics Data System (ADS)
Wang, Xingyuan; Zhu, Xiaoqiang; Wu, Xiangjun; Zhang, Yingqian
2018-08-01
This paper proposes a new one-time pad scheme for chaotic image encryption that is based on the multiple mixed hash functions and the cyclic-shift function. The initial value is generated using both information of the plaintext image and the chaotic sequences, which are calculated from the SHA1 and MD5 hash algorithms. The scrambling sequences are generated by the nonlinear equations and logistic map. This paper aims to improve the deficiencies of traditional Baptista algorithms and its improved algorithms. We employ the cyclic-shift function and piece-wise linear chaotic maps (PWLCM), which give each shift number the characteristics of chaos, to diffuse the image. Experimental results and security analysis show that the new scheme has better security and can resist common attacks.
Li, Cheng-Wei; Chen, Bor-Sen
2016-01-01
Epigenetic and microRNA (miRNA) regulation are associated with carcinogenesis and the development of cancer. By using the available omics data, including those from next-generation sequencing (NGS), genome-wide methylation profiling, candidate integrated genetic and epigenetic network (IGEN) analysis, and drug response genome-wide microarray analysis, we constructed an IGEN system based on three coupling regression models that characterize protein-protein interaction networks (PPINs), gene regulatory networks (GRNs), miRNA regulatory networks (MRNs), and epigenetic regulatory networks (ERNs). By applying system identification method and principal genome-wide network projection (PGNP) to IGEN analysis, we identified the core network biomarkers to investigate bladder carcinogenic mechanisms and design multiple drug combinations for treating bladder cancer with minimal side-effects. The progression of DNA repair and cell proliferation in stage 1 bladder cancer ultimately results not only in the derepression of miR-200a and miR-200b but also in the regulation of the TNF pathway to metastasis-related genes or proteins, cell proliferation, and DNA repair in stage 4 bladder cancer. We designed a multiple drug combination comprising gefitinib, estradiol, yohimbine, and fulvestrant for treating stage 1 bladder cancer with minimal side-effects, and another multiple drug combination comprising gefitinib, estradiol, chlorpromazine, and LY294002 for treating stage 4 bladder cancer with minimal side-effects.
Detecting and Analyzing Genetic Recombination Using RDP4.
Martin, Darren P; Murrell, Ben; Khoosal, Arjun; Muhire, Brejnev
2017-01-01
Recombination between nucleotide sequences is a major process influencing the evolution of most species on Earth. The evolutionary value of recombination has been widely debated and so too has its influence on evolutionary analysis methods that assume nucleotide sequences replicate without recombining. When nucleic acids recombine, the evolution of the daughter or recombinant molecule cannot be accurately described by a single phylogeny. This simple fact can seriously undermine the accuracy of any phylogenetics-based analytical approach which assumes that the evolutionary history of a set of recombining sequences can be adequately described by a single phylogenetic tree. There are presently a large number of available methods and associated computer programs for analyzing and characterizing recombination in various classes of nucleotide sequence datasets. Here we examine the use of some of these methods to derive and test recombination hypotheses using multiple sequence alignments.
Kretova, Olga V; Chechetkin, Vladimir R; Fedoseeva, Daria M; Kravatsky, Yuri V; Sosin, Dmitri V; Alembekov, Ildar R; Gorbacheva, Maria A; Gashnikova, Natalya M; Tchurikov, Nickolai A
2017-02-01
Any method for silencing the activity of the HIV-1 retrovirus should tackle the extremely high variability of HIV-1 sequences and mutational escape. We studied sequence variability in the vicinity of selected RNA interference (RNAi) targets from isolates of HIV-1 subtype A in Russia, and we propose that using artificial RNAi is a potential alternative to traditional antiretroviral therapy. We prove that using multiple RNAi targets overcomes the variability in HIV-1 isolates. The optimal number of targets critically depends on the conservation of the target sequences. The total number of targets that are conserved with a probability of 0.7-0.8 should exceed at least 2. Combining deep sequencing and multitarget RNAi may provide an efficient approach to cure HIV/AIDS.
2013-01-01
A need for a genomic species definition is emerging from several independent studies worldwide. In this commentary paper, we discuss recent studies on the genomic taxonomy of diverse microbial groups and a unified species definition based on genomics. Accordingly, strains from the same microbial species share >95% Average Amino Acid Identity (AAI) and Average Nucleotide Identity (ANI), >95% identity based on multiple alignment genes, <10 in Karlin genomic signature, and > 70% in silico Genome-to-Genome Hybridization similarity (GGDH). Species of the same genus will form monophyletic groups on the basis of 16S rRNA gene sequences, Multilocus Sequence Analysis (MLSA) and supertree analysis. In addition to the established requirements for species descriptions, we propose that new taxa descriptions should also include at least a draft genome sequence of the type strain in order to obtain a clear outlook on the genomic landscape of the novel microbe. The application of the new genomic species definition put forward here will allow researchers to use genome sequences to define simultaneously coherent phenotypic and genomic groups. PMID:24365132
Pollier, Jacob; González-Guzmán, Miguel; Ardiles-Diaz, Wilson; Geelen, Danny; Goossens, Alain
2011-01-01
cDNA-Amplified Fragment Length Polymorphism (cDNA-AFLP) is a commonly used technique for genome-wide expression analysis that does not require prior sequence knowledge. Typically, quantitative expression data and sequence information are obtained for a large number of differentially expressed gene tags. However, most of the gene tags do not correspond to full-length (FL) coding sequences, which is a prerequisite for subsequent functional analysis. A medium-throughput screening strategy, based on integration of polymerase chain reaction (PCR) and colony hybridization, was developed that allows in parallel screening of a cDNA library for FL clones corresponding to incomplete cDNAs. The method was applied to screen for the FL open reading frames of a selection of 163 cDNA-AFLP tags from three different medicinal plants, leading to the identification of 109 (67%) FL clones. Furthermore, the protocol allows for the use of multiple probes in a single hybridization event, thus significantly increasing the throughput when screening for rare transcripts. The presented strategy offers an efficient method for the conversion of incomplete expressed sequence tags (ESTs), such as cDNA-AFLP tags, to FL-coding sequences.
Analysis of the cytochrome c oxidase subunit II (COX2) gene in giant panda, Ailuropoda melanoleuca.
Ling, S S; Zhu, Y; Lan, D; Li, D S; Pang, H Z; Wang, Y; Li, D Y; Wei, R P; Zhang, H M; Wang, C D; Hu, Y D
2017-01-23
The giant panda, Ailuropoda melanoleuca (Ursidae), has a unique bamboo-based diet; however, this low-energy intake has been sufficient to maintain the metabolic processes of this species since the fourth ice age. As mitochondria are the main sites for energy metabolism in animals, the protein-coding genes involved in mitochondrial respiratory chains, particularly cytochrome c oxidase subunit II (COX2), which is the rate-limiting enzyme in electron transfer, could play an important role in giant panda metabolism. Therefore, the present study aimed to isolate, sequence, and analyze the COX2 DNA from individuals kept at the Giant Panda Protection and Research Center, China, and compare these sequences with those of the other Ursidae family members. Multiple sequence alignment showed that the COX2 gene had three point mutations that defined three haplotypes, with 60% of the sequences corresponding to haplotype I. The neutrality tests revealed that the COX2 gene was conserved throughout evolution, and the maximum likelihood phylogenetic analysis, using homologous sequences from other Ursidae species, showed clustering of the COX2 sequences of giant pandas, suggesting that this gene evolved differently in them.
Pandey, Ram Vinay; Pabinger, Stephan; Kriegner, Albert; Weinhäusel, Andreas
2016-01-01
Traditional Sanger sequencing as well as Next-Generation Sequencing have been used for the identification of disease causing mutations in human molecular research. The majority of currently available tools are developed for research and explorative purposes and often do not provide a complete, efficient, one-stop solution. As the focus of currently developed tools is mainly on NGS data analysis, no integrative solution for the analysis of Sanger data is provided and consequently a one-stop solution to analyze reads from both sequencing platforms is not available. We have therefore developed a new pipeline called MutAid to analyze and interpret raw sequencing data produced by Sanger or several NGS sequencing platforms. It performs format conversion, base calling, quality trimming, filtering, read mapping, variant calling, variant annotation and analysis of Sanger and NGS data under a single platform. It is capable of analyzing reads from multiple patients in a single run to create a list of potential disease causing base substitutions as well as insertions and deletions. MutAid has been developed for expert and non-expert users and supports four sequencing platforms including Sanger, Illumina, 454 and Ion Torrent. Furthermore, for NGS data analysis, five read mappers including BWA, TMAP, Bowtie, Bowtie2 and GSNAP and four variant callers including GATK-HaplotypeCaller, SAMTOOLS, Freebayes and VarScan2 pipelines are supported. MutAid is freely available at https://sourceforge.net/projects/mutaid.
Pandey, Ram Vinay; Pabinger, Stephan; Kriegner, Albert; Weinhäusel, Andreas
2016-01-01
Traditional Sanger sequencing as well as Next-Generation Sequencing have been used for the identification of disease causing mutations in human molecular research. The majority of currently available tools are developed for research and explorative purposes and often do not provide a complete, efficient, one-stop solution. As the focus of currently developed tools is mainly on NGS data analysis, no integrative solution for the analysis of Sanger data is provided and consequently a one-stop solution to analyze reads from both sequencing platforms is not available. We have therefore developed a new pipeline called MutAid to analyze and interpret raw sequencing data produced by Sanger or several NGS sequencing platforms. It performs format conversion, base calling, quality trimming, filtering, read mapping, variant calling, variant annotation and analysis of Sanger and NGS data under a single platform. It is capable of analyzing reads from multiple patients in a single run to create a list of potential disease causing base substitutions as well as insertions and deletions. MutAid has been developed for expert and non-expert users and supports four sequencing platforms including Sanger, Illumina, 454 and Ion Torrent. Furthermore, for NGS data analysis, five read mappers including BWA, TMAP, Bowtie, Bowtie2 and GSNAP and four variant callers including GATK-HaplotypeCaller, SAMTOOLS, Freebayes and VarScan2 pipelines are supported. MutAid is freely available at https://sourceforge.net/projects/mutaid. PMID:26840129
Identification of G-quadruplex forming sequences in three manatee papillomaviruses
Zahin, Maryam; Dean, William L.; Ghim, Shin-je; Joh, Joongho; Gray, Robert D.; Khanal, Sujita; Bossart, Gregory D.; Mignucci-Giannoni, Antonio A.; Rouchka, Eric C.; Jenson, Alfred B.; Trent, John O.; Chaires, Jonathan B.
2018-01-01
The Florida manatee (Trichechus manatus latirotris) is a threatened aquatic mammal in United States coastal waters. Over the past decade, the appearance of papillomavirus-induced lesions and viral papillomatosis in manatees has been a concern for those involved in the management and rehabilitation of this species. To date, three manatee papillomaviruses (TmPVs) have been identified in Florida manatees, one forming cutaneous lesions (TmPV1) and two forming genital lesions (TmPV3 and TmPV4). We identified DNA sequences with the potential to form G-quadruplex structures (G4) across the three genomes. G4 were located on both DNA strands and across coding and non-coding regions on all TmPVs, offering multiple targets for viral control. Although G4 have been identified in several viral genomes, including human PVs, most research has focused on canonical structures comprised of three G-tetrads. In contrast, the vast majority of sequences we identified would allow the formation of non-canonical structures with only two G-tetrads. Our biophysical analysis confirmed the formation of G4 with parallel topology in three such sequences from the E2 region. Two of the structures appear comprised of multiple stacked two G-tetrad structures, perhaps serving to increase structural stability. Computational analysis demonstrated enrichment of G4 sequences on all TmPVs on the reverse strand in the E2/E4 region and on both strands in the L2 region. Several G4 sequences occurred at similar regional locations on all PVs, most notably on the reverse strand in the E2 region. In other cases, G4 were identified at similar regional locations only on PVs forming genital lesions. On all TmPVs, G4 sequences were located in the non-coding region near putative E2 binding sites. Together, these findings suggest that G4 are possible regulatory elements in TmPVs. PMID:29630682
Performance analysis of multiple PRF technique for ambiguity resolution
NASA Technical Reports Server (NTRS)
Chang, C. Y.; Curlander, J. C.
1992-01-01
For short wavelength spaceborne synthetic aperture radar (SAR), ambiguity in Doppler centroid estimation occurs when the azimuth squint angle uncertainty is larger than the azimuth antenna beamwidth. Multiple pulse recurrence frequency (PRF) hopping is a technique developed to resolve the ambiguity by operating the radar in different PRF's in the pre-imaging sequence. Performance analysis results of the multiple PRF technique are presented, given the constraints of the attitude bound, the drift rate uncertainty, and the arbitrary numerical values of PRF's. The algorithm performance is derived in terms of the probability of correct ambiguity resolution. Examples, using the Shuttle Imaging Radar-C (SIR-C) and X-SAR parameters, demonstrate that the probability of correct ambiguity resolution obtained by the multiple PRF technique is greater than 95 percent and 80 percent for the SIR-C and X-SAR applications, respectively. The success rate is significantly higher than that achieved by the range cross correlation technique.
DNA Multiple Sequence Alignment Guided by Protein Domains: The MSA-PAD 2.0 Method.
Balech, Bachir; Monaco, Alfonso; Perniola, Michele; Santamaria, Monica; Donvito, Giacinto; Vicario, Saverio; Maggi, Giorgio; Pesole, Graziano
2018-01-01
Multiple sequence alignment (MSA) is a fundamental component in many DNA sequence analyses including metagenomics studies and phylogeny inference. When guided by protein profiles, DNA multiple alignments assume a higher precision and robustness. Here we present details of the use of the upgraded version of MSA-PAD (2.0), which is a DNA multiple sequence alignment framework able to align DNA sequences coding for single/multiple protein domains guided by PFAM or user-defined annotations. MSA-PAD has two alignment strategies, called "Gene" and "Genome," accounting for coding domains order and genomic rearrangements, respectively. Novel options were added to the present version, where the MSA can be guided by protein profiles provided by the user. This allows MSA-PAD 2.0 to run faster and to add custom protein profiles sometimes not present in PFAM database according to the user's interest. MSA-PAD 2.0 is currently freely available as a Web application at https://recasgateway.cloud.ba.infn.it/ .
Evaluation of microbial community in hydrothermal field by direct DNA sequencing
NASA Astrophysics Data System (ADS)
Kawarabayasi, Y.; Maruyama, A.
2002-12-01
Many extremophiles have been discovered from terrestrial and marine hydrothermal fields. Some thermophiles can grow beyond 90°C in culture, while direct microscopic analysis occasionally indicates that microbes may survive in much hotter hydrothermal fluids. However, it is very difficult to isolate and cultivate such microbes from the environments, i.e., over 99% of total microbes remains undiscovered. Based on experiences of entire microbial genome analysis (Y.K.) and microbial community analysis (A.M.), we started to find out unique microbes/genes in hydrothermal fields through direct sequencing of environmental DNA fragments. At first, shotgun plasmid libraries were directly constructed with the DNA molecules prepared from mixed microbes collected by an in situ filtration system from low-temperature fluids at RM24 in the Southern East Pacific Rise (S-EPR). A gene amplification (PCR) technique was not used for preventing mutation in the process. The nucleotide sequences of 285 clones indicated that no sequence had identical data in public databases. Among 27 clones determined entire sequences, no ORF was identified on 14 clones like intron in Eukaryote. On four clones, tetra-nucleotide-long multiple tandem repetitive sequences were identified. This type of sequence was identified in some familiar disease in human. The result indicates that living/dead materials with eukaryotic features may exist in this low temperature field. Secondly, shotgun plasmid libraries were constructed from the environmental DNA prepared from Beppu hot springs. In randomly-selected 143 clones used for sequencing, no known sequence was identified. Unlike the clones in S-EPR library, clear ORFs were identified on all nine clones determined the entire sequence. It was found that one clone, H4052, contained the complete Aspartyl-tRNA synthetase. Phylogenetic analysis using amino acid sequences of this gene indicated that this gene was separated from other Euryarchaea before the differentiation of species. Thus, some novel archaeal species are expected to be in this field. The present direct cloning and sequencing technique is now opening a window to the new world in hydrothermal microbial community analysis.
Heuristics for multiobjective multiple sequence alignment.
Abbasi, Maryam; Paquete, Luís; Pereira, Francisco B
2016-07-15
Aligning multiple sequences arises in many tasks in Bioinformatics. However, the alignments produced by the current software packages are highly dependent on the parameters setting, such as the relative importance of opening gaps with respect to the increase of similarity. Choosing only one parameter setting may provide an undesirable bias in further steps of the analysis and give too simplistic interpretations. In this work, we reformulate multiple sequence alignment from a multiobjective point of view. The goal is to generate several sequence alignments that represent a trade-off between maximizing the substitution score and minimizing the number of indels/gaps in the sum-of-pairs score function. This trade-off gives to the practitioner further information about the similarity of the sequences, from which she could analyse and choose the most plausible alignment. We introduce several heuristic approaches, based on local search procedures, that compute a set of sequence alignments, which are representative of the trade-off between the two objectives (substitution score and indels). Several algorithm design options are discussed and analysed, with particular emphasis on the influence of the starting alignment and neighborhood search definitions on the overall performance. A perturbation technique is proposed to improve the local search, which provides a wide range of high-quality alignments. The proposed approach is tested experimentally on a wide range of instances. We performed several experiments with sequences obtained from the benchmark database BAliBASE 3.0. To evaluate the quality of the results, we calculate the hypervolume indicator of the set of score vectors returned by the algorithms. The results obtained allow us to identify reasonably good choices of parameters for our approach. Further, we compared our method in terms of correctly aligned pairs ratio and columns correctly aligned ratio with respect to reference alignments. Experimental results show that our approaches can obtain better results than TCoffee and Clustal Omega in terms of the first ratio.
Identification of species by multiplex analysis of variable-length sequences
Pereira, Filipe; Carneiro, João; Matthiesen, Rune; van Asch, Barbara; Pinto, Nádia; Gusmão, Leonor; Amorim, António
2010-01-01
The quest for a universal and efficient method of identifying species has been a longstanding challenge in biology. Here, we show that accurate identification of species in all domains of life can be accomplished by multiplex analysis of variable-length sequences containing multiple insertion/deletion variants. The new method, called SPInDel, is able to discriminate 93.3% of eukaryotic species from 18 taxonomic groups. We also demonstrate that the identification of prokaryotic and viral species with numeric profiles of fragment lengths is generally straightforward. A computational platform is presented to facilitate the planning of projects and includes a large data set with nearly 1800 numeric profiles for species in all domains of life (1556 for eukaryotes, 105 for prokaryotes and 130 for viruses). Finally, a SPInDel profiling kit for discrimination of 10 mammalian species was successfully validated on highly processed food products with species mixtures and proved to be easily adaptable to multiple screening procedures routinely used in molecular biology laboratories. These results suggest that SPInDel is a reliable and cost-effective method for broad-spectrum species identification that is appropriate for use in suboptimal samples and is amenable to different high-throughput genotyping platforms without the need for DNA sequencing. PMID:20923781
Kumar, Yadhu; Westram, Ralf; Kipfer, Peter; Meier, Harald; Ludwig, Wolfgang
2006-01-01
Background Availability of high-resolution RNA crystal structures for the 30S and 50S ribosomal subunits and the subsequent validation of comparative secondary structure models have prompted the biologists to use three-dimensional structure of ribosomal RNA (rRNA) for evaluating sequence alignments of rRNA genes. Furthermore, the secondary and tertiary structural features of rRNA are highly useful and successfully employed in designing rRNA targeted oligonucleotide probes intended for in situ hybridization experiments. RNA3D, a program to combine sequence alignment information with three-dimensional structure of rRNA was developed. Integration into ARB software package, which is used extensively by the scientific community for phylogenetic analysis and molecular probe designing, has substantially extended the functionality of ARB software suite with 3D environment. Results Three-dimensional structure of rRNA is visualized in OpenGL 3D environment with the abilities to change the display and overlay information onto the molecule, dynamically. Phylogenetic information derived from the multiple sequence alignments can be overlaid onto the molecule structure in a real time. Superimposition of both statistical and non-statistical sequence associated information onto the rRNA 3D structure can be done using customizable color scheme, which is also applied to a textual sequence alignment for reference. Oligonucleotide probes designed by ARB probe design tools can be mapped onto the 3D structure along with the probe accessibility models for evaluation with respect to secondary and tertiary structural conformations of rRNA. Conclusion Visualization of three-dimensional structure of rRNA in an intuitive display provides the biologists with the greater possibilities to carry out structure based phylogenetic analysis. Coupled with secondary structure models of rRNA, RNA3D program aids in validating the sequence alignments of rRNA genes and evaluating probe target sites. Superimposition of the information derived from the multiple sequence alignment onto the molecule dynamically allows the researchers to observe any sequence inherited characteristics (phylogenetic information) in real-time environment. The extended ARB software package is made freely available for the scientific community via . PMID:16672074
Stable scalable control of soliton propagation in broadband nonlinear optical waveguides
NASA Astrophysics Data System (ADS)
Peleg, Avner; Nguyen, Quan M.; Huynh, Toan T.
2017-02-01
We develop a method for achieving scalable transmission stabilization and switching of N colliding soliton sequences in optical waveguides with broadband delayed Raman response and narrowband nonlinear gain-loss. We show that dynamics of soliton amplitudes in N-sequence transmission is described by a generalized N-dimensional predator-prey model. Stability and bifurcation analysis for the predator-prey model are used to obtain simple conditions on the physical parameters for robust transmission stabilization as well as on-off and off-on switching of M out of N soliton sequences. Numerical simulations for single-waveguide transmission with a system of N coupled nonlinear Schrödinger equations with 2 ≤ N ≤ 4 show excellent agreement with the predator-prey model's predictions and stable propagation over significantly larger distances compared with other broadband nonlinear single-waveguide systems. Moreover, stable on-off and off-on switching of multiple soliton sequences and stable multiple transmission switching events are demonstrated by the simulations. We discuss the reasons for the robustness and scalability of transmission stabilization and switching in waveguides with broadband delayed Raman response and narrowband nonlinear gain-loss, and explain their advantages compared with other broadband nonlinear waveguides.
2013-01-01
Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made. PMID:23800020
DIALIGN P: fast pair-wise and multiple sequence alignment using parallel processors.
Schmollinger, Martin; Nieselt, Kay; Kaufmann, Michael; Morgenstern, Burkhard
2004-09-09
Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b) For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.
González, Carolina; Tabernero, David; Cortese, Maria Francesca; Gregori, Josep; Casillas, Rosario; Riveiro-Barciela, Mar; Godoy, Cristina; Sopena, Sara; Rando, Ariadna; Yll, Marçal; Lopez-Martinez, Rosa; Quer, Josep; Esteban, Rafael; Buti, Maria; Rodríguez-Frías, Francisco
2018-05-21
To detect hyper-conserved regions in the hepatitis B virus (HBV) X gene ( HBX ) 5' region that could be candidates for gene therapy. The study included 27 chronic hepatitis B treatment-naive patients in various clinical stages (from chronic infection to cirrhosis and hepatocellular carcinoma, both HBeAg-negative and HBeAg-positive), and infected with HBV genotypes A-F and H. In a serum sample from each patient with viremia > 3.5 log IU/mL, the HBX 5' end region [nucleotide (nt) 1255-1611] was PCR-amplified and submitted to next-generation sequencing (NGS). We assessed genotype variants by phylogenetic analysis, and evaluated conservation of this region by calculating the information content of each nucleotide position in a multiple alignment of all unique sequences (haplotypes) obtained by NGS. Conservation at the HBx protein amino acid (aa) level was also analyzed. NGS yielded 1333069 sequences from the 27 samples, with a median of 4578 sequences/sample (2487-9279, IQR 2817). In 14/27 patients (51.8%), phylogenetic analysis of viral nucleotide haplotypes showed a complex mixture of genotypic variants. Analysis of the information content in the haplotype multiple alignments detected 2 hyper-conserved nucleotide regions, one in the HBX upstream non-coding region (nt 1255-1286) and the other in the 5' end coding region (nt 1519-1603). This last region coded for a conserved amino acid region (aa 63-76) that partially overlaps a Kunitz-like domain. Two hyper-conserved regions detected in the HBX 5' end may be of value for targeted gene therapy, regardless of the patients' clinical stage or HBV genotype.
Genome-wide gene–gene interaction analysis for next-generation sequencing
Zhao, Jinying; Zhu, Yun; Xiong, Momiao
2016-01-01
The critical barrier in interaction analysis for next-generation sequencing (NGS) data is that the traditional pairwise interaction analysis that is suitable for common variants is difficult to apply to rare variants because of their prohibitive computational time, large number of tests and low power. The great challenges for successful detection of interactions with NGS data are (1) the demands in the paradigm of changes in interaction analysis; (2) severe multiple testing; and (3) heavy computations. To meet these challenges, we shift the paradigm of interaction analysis between two SNPs to interaction analysis between two genomic regions. In other words, we take a gene as a unit of analysis and use functional data analysis techniques as dimensional reduction tools to develop a novel statistic to collectively test interaction between all possible pairs of SNPs within two genome regions. By intensive simulations, we demonstrate that the functional logistic regression for interaction analysis has the correct type 1 error rates and higher power to detect interaction than the currently used methods. The proposed method was applied to a coronary artery disease dataset from the Wellcome Trust Case Control Consortium (WTCCC) study and the Framingham Heart Study (FHS) dataset, and the early-onset myocardial infarction (EOMI) exome sequence datasets with European origin from the NHLBI's Exome Sequencing Project. We discovered that 6 of 27 pairs of significantly interacted genes in the FHS were replicated in the independent WTCCC study and 24 pairs of significantly interacted genes after applying Bonferroni correction in the EOMI study. PMID:26173972
Dcode.org anthology of comparative genomic tools.
Loots, Gabriela G; Ovcharenko, Ivan
2005-07-01
Comparative genomics provides the means to demarcate functional regions in anonymous DNA sequences. The successful application of this method to identifying novel genes is currently shifting to deciphering the non-coding encryption of gene regulation across genomes. To facilitate the practical application of comparative sequence analysis to genetics and genomics, we have developed several analytical and visualization tools for the analysis of arbitrary sequences and whole genomes. These tools include two alignment tools, zPicture and Mulan; a phylogenetic shadowing tool, eShadow for identifying lineage- and species-specific functional elements; two evolutionary conserved transcription factor analysis tools, rVista and multiTF; a tool for extracting cis-regulatory modules governing the expression of co-regulated genes, Creme 2.0; and a dynamic portal to multiple vertebrate and invertebrate genome alignments, the ECR Browser. Here, we briefly describe each one of these tools and provide specific examples on their practical applications. All the tools are publicly available at the http://www.dcode.org/ website.
Ivors, K; Garbelotto, M; Vries, I D E; Ruyter-Spira, C; Te Hekkert, B; Rosenzweig, N; Bonants, P
2006-05-01
Analysis of 12 polymorphic simple sequence repeats identified in the genome sequence of Phytophthora ramorum, causal agent of 'sudden oak death', revealed genotypic diversity to be significantly higher in nurseries (91% of total) than in forests (18% of total). Our analysis identified only two closely related genotypes in US forests, while the genetic structure of populations from European nurseries was of intermediate complexity, including multiple, closely related genotypes. Multilocus analysis determined populations in US forests reproduce clonally and are likely descendants of a single introduced individual. The 151 isolates analysed clustered in three clades. US forest and European nursery isolates clustered into two distinct clades, while one isolate from a US nursery belonged to a third novel clade. The combined microsatellite, sequencing and morphological analyses suggest the three clades represent distinct evolutionary lineages. All three clades were identified in some US nurseries, emphasizing the role of commercial plant trade in the movement of this pathogen.
Typing and comparative genome analysis of Brucella melitensis isolated from Lebanon.
Abou Zaki, Natalia; Salloum, Tamara; Osman, Marwan; Rafei, Rayane; Hamze, Monzer; Tokajian, Sima
2017-10-16
Brucella melitensis is the main causative agent of the zoonotic disease brucellosis. This study aimed at typing and characterizing genetic variation in 33 Brucella isolates recovered from patients in Lebanon. Bruce-ladder multiplex PCR and PCR-RFLP of omp31, omp2a and omp2b were performed. Sixteen representative isolates were chosen for draft-genome sequencing and analyzed to determine variations in virulence, resistance, genomic islands, prophages and insertion sequences. Comparative whole-genome single nucleotide polymorphism analysis was also performed. The isolates were confirmed to be B. melitensis. Genome analysis revealed multiple virulence determinants and efflux pumps. Genome comparisons and single nucleotide polymorphisms divided the isolates based on geographical distribution but revealed high levels of similarity between the strains. Sequence divergence in B. melitensis was mainly due to lateral gene transfer of mobile elements. This is the first report of an in-depth genomic characterization of B. melitensis in Lebanon. © FEMS 2017. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Multiple Access Interference Reduction Using Received Response Code Sequence for DS-CDMA UWB System
NASA Astrophysics Data System (ADS)
Toh, Keat Beng; Tachikawa, Shin'ichi
This paper proposes a combination of novel Received Response (RR) sequence at the transmitter and a Matched Filter-RAKE (MF-RAKE) combining scheme receiver system for the Direct Sequence-Code Division Multiple Access Ultra Wideband (DS-CDMA UWB) multipath channel model. This paper also demonstrates the effectiveness of the RR sequence in Multiple Access Interference (MAI) reduction for the DS-CDMA UWB system. It suggests that by using conventional binary code sequence such as the M sequence or the Gold sequence, there is a possibility of generating extra MAI in the UWB system. Therefore, it is quite difficult to collect the energy efficiently although the RAKE reception method is applied at the receiver. The main purpose of the proposed system is to overcome the performance degradation for UWB transmission due to the occurrence of MAI during multiple accessing in the DS-CDMA UWB system. The proposed system improves the system performance by improving the RAKE reception performance using the RR sequence which can reduce the MAI effect significantly. Simulation results verify that significant improvement can be obtained by the proposed system in the UWB multipath channel models.
Doddapaneni, Harshavardhan; Yao, Jiqiang; Lin, Hong; Walker, M Andrew; Civerolo, Edwin L
2006-01-01
Background The Gram-negative, xylem-limited phytopathogenic bacterium Xylella fastidiosa is responsible for causing economically important diseases in grapevine, citrus and many other plant species. Despite its economic impact, relatively little is known about the genomic variations among strains isolated from different hosts and their influence on the population genetics of this pathogen. With the availability of genome sequence information for four strains, it is now possible to perform genome-wide analyses to identify and categorize such DNA variations and to understand their influence on strain functional divergence. Results There are 1,579 genes and 194 non-coding homologous sequences present in the genomes of all four strains, representing a 76. 2% conservation of the sequenced genome. About 60% of the X. fastidiosa unique sequences exist as tandem gene clusters of 6 or more genes. Multiple alignments identified 12,754 SNPs and 14,449 INDELs in the 1528 common genes and 20,779 SNPs and 10,075 INDELs in the 194 non-coding sequences. The average SNP frequency was 1.08 × 10-2 per base pair of DNA and the average INDEL frequency was 2.06 × 10-2 per base pair of DNA. On an average, 60.33% of the SNPs were synonymous type while 39.67% were non-synonymous type. The mutation frequency, primarily in the form of external INDELs was the main type of sequence variation. The relative similarity between the strains was discussed according to the INDEL and SNP differences. The number of genes unique to each strain were 60 (9a5c), 54 (Dixon), 83 (Ann1) and 9 (Temecula-1). A sub-set of the strain specific genes showed significant differences in terms of their codon usage and GC composition from the native genes suggesting their xenologous origin. Tandem repeat analysis of the genomic sequences of the four strains identified associations of repeat sequences with hypothetical and phage related functions. Conclusion INDELs and strain specific genes have been identified as the main source of variations among strains, with individual strains showing different rates of genome evolution. Based on these genome comparisons, it appears that the Pierce's disease strain Temecula-1 genome represents the ancestral genome of the X. fastidiosa. Results of this analysis are publicly available in the form of a web database. PMID:16948851
Pollen, Alex A; Nowakowski, Tomasz J; Shuga, Joe; Wang, Xiaohui; Leyrat, Anne A; Lui, Jan H; Li, Nianzhen; Szpankowski, Lukasz; Fowler, Brian; Chen, Peilin; Ramalingam, Naveen; Sun, Gang; Thu, Myo; Norris, Michael; Lebofsky, Ronald; Toppani, Dominique; Kemp, Darnell W; Wong, Michael; Clerkson, Barry; Jones, Brittnee N; Wu, Shiquan; Knutsson, Lawrence; Alvarado, Beatriz; Wang, Jing; Weaver, Lesley S; May, Andrew P; Jones, Robert C; Unger, Marc A; Kriegstein, Arnold R; West, Jay A A
2014-10-01
Large-scale surveys of single-cell gene expression have the potential to reveal rare cell populations and lineage relationships but require efficient methods for cell capture and mRNA sequencing. Although cellular barcoding strategies allow parallel sequencing of single cells at ultra-low depths, the limitations of shallow sequencing have not been investigated directly. By capturing 301 single cells from 11 populations using microfluidics and analyzing single-cell transcriptomes across downsampled sequencing depths, we demonstrate that shallow single-cell mRNA sequencing (~50,000 reads per cell) is sufficient for unbiased cell-type classification and biomarker identification. In the developing cortex, we identify diverse cell types, including multiple progenitor and neuronal subtypes, and we identify EGR1 and FOS as previously unreported candidate targets of Notch signaling in human but not mouse radial glia. Our strategy establishes an efficient method for unbiased analysis and comparison of cell populations from heterogeneous tissue by microfluidic single-cell capture and low-coverage sequencing of many cells.
iSeq: Web-Based RNA-seq Data Analysis and Visualization.
Zhang, Chao; Fan, Caoqi; Gan, Jingbo; Zhu, Ping; Kong, Lei; Li, Cheng
2018-01-01
Transcriptome sequencing (RNA-seq) is becoming a standard experimental methodology for genome-wide characterization and quantification of transcripts at single base-pair resolution. However, downstream analysis of massive amount of sequencing data can be prohibitively technical for wet-lab researchers. A functionally integrated and user-friendly platform is required to meet this demand. Here, we present iSeq, an R-based Web server, for RNA-seq data analysis and visualization. iSeq is a streamlined Web-based R application under the Shiny framework, featuring a simple user interface and multiple data analysis modules. Users without programming and statistical skills can analyze their RNA-seq data and construct publication-level graphs through a standardized yet customizable analytical pipeline. iSeq is accessible via Web browsers on any operating system at http://iseq.cbi.pku.edu.cn .
Spectral analysis of variable-length coded digital signals
NASA Astrophysics Data System (ADS)
Cariolaro, G. L.; Pierobon, G. L.; Pupolin, S. G.
1982-05-01
A spectral analysis is conducted for a variable-length word sequence by an encoder driven by a stationary memoryless source. A finite-state sequential machine is considered as a model of the line encoder, and the spectral analysis of the encoded message is performed under the assumption that the sourceword sequence is composed of independent identically distributed words. Closed form expressions for both the continuous and discrete parts of the spectral density are derived in terms of the encoder law and sourceword statistics. The jump part exhibits jumps at multiple integers of per lambda(sub 0)T, where lambda(sub 0) is the greatest common divisor of the possible codeword lengths, and T is the symbol period. The derivation of the continuous part can be conveniently factorized, and the theory is applied to the spectral analysis of BnZS and HDBn codes.
Jiménez, Cristina; Jara-Acevedo, María; Corchete, Luis A; Castillo, David; Ordóñez, Gonzalo R; Sarasquete, María E; Puig, Noemí; Martínez-López, Joaquín; Prieto-Conde, María I; García-Álvarez, María; Chillón, María C; Balanzategui, Ana; Alcoceba, Miguel; Oriol, Albert; Rosiñol, Laura; Palomera, Luis; Teruel, Ana I; Lahuerta, Juan J; Bladé, Joan; Mateos, María V; Orfão, Alberto; San Miguel, Jesús F; González, Marcos; Gutiérrez, Norma C; García-Sanz, Ramón
2017-01-01
Identification and characterization of genetic alterations are essential for diagnosis of multiple myeloma and may guide therapeutic decisions. Currently, genomic analysis of myeloma to cover the diverse range of alterations with prognostic impact requires fluorescence in situ hybridization (FISH), single nucleotide polymorphism arrays, and sequencing techniques, which are costly and labor intensive and require large numbers of plasma cells. To overcome these limitations, we designed a targeted-capture next-generation sequencing approach for one-step identification of IGH translocations, V(D)J clonal rearrangements, the IgH isotype, and somatic mutations to rapidly identify risk groups and specific targetable molecular lesions. Forty-eight newly diagnosed myeloma patients were tested with the panel, which included IGH and six genes that are recurrently mutated in myeloma: NRAS, KRAS, HRAS, TP53, MYC, and BRAF. We identified 14 of 17 IGH translocations previously detected by FISH and three confirmed translocations not detected by FISH, with the additional advantage of breakpoint identification, which can be used as a target for evaluating minimal residual disease. IgH subclass and V(D)J rearrangements were identified in 77% and 65% of patients, respectively. Mutation analysis revealed the presence of missense protein-coding alterations in at least one of the evaluating genes in 16 of 48 patients (33%). This method may represent a time- and cost-effective diagnostic method for the molecular characterization of multiple myeloma. Copyright © 2017 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Leshinsky-Silver, Esther; Malinger, Gustavo; Ben-Sira, Liat; Kidron, Dvora; Cohen, Sarit; Inbar, Shani; Bezaleli, Tali; Levine, Arie; Vinkler, Chana; Lev, Dorit; Lerman-Sagie, Tally
2011-01-01
Aicardi–Goutiéres syndrome (AGS) is a genetic neurodegenerative disorder with clinical symptoms mimicking a congenital viral infection. Five causative genes have been described: three prime repair exonuclease1 (TREX1), ribonucleases H2A, B and C, and most recently SAM domain and HD domain 1 (SAMHD1). We performed a detailed clinical and molecular characterization of a family with autosomal recessive neurodegenerative disorder showing white matter destruction and calcifications, presenting in utero and associated with multiple mtDNA deletions. A muscle biopsy was normal and did not show any evidence of respiratory chain dysfunction. Southern blot analysis of tissue from a living child and affected fetuses demonstrated multiple mtDNA deletions. Molecular analysis of genes involved in mtDNA synthesis and maintenance (POLGα, POLGβ, Twinkle, ANT1, TK2, SUCLA1 and DGOUK) revealed normal sequences. Sequencing of TREX1 and ribonucleases H2A, B and C failed to reveal any mutations. Whole-genome homozygosity mapping revealed a candidate region containing the SAMHD1 gene. Sequencing of the gene in the affected child and two affected fetuses revealed a large deletion (9 kb), spanning the promoter, exon1 and intron 1. The parents were found to be heterozygous for this deletion. The identification of a homozygous large deletion in the SAMHD1 gene causing atypical AGS with multiple mtDNA deletions may add information regarding the involvement of mitochondria in self-activation of innate immunity by cell intrinsic components. PMID:21102625
Due to the accumulating evidence that suggests that numerous unhealthy conditions in the indoor environment are the result of abnormal growth of the filamentous fungi (mold) in and on building surfaces, it is necessary to accurately determine the organisms responsible for these m...
A Common Framework for Multiple Sources of Bacterial Annotation
White, Owen
2018-05-03
Owen White, professor of epidemiology and preventive medicine at the University of Maryland School of Medicine and a researcher at the University of Maryland Institute for Genome Sciences, gives the May 29, 2009 keynote speech at the "Sequencing, Finishing, Analysis in the Future" meeting in Santa Fe, NM.
The genomic landscape of rapid, repeated evolutionary rescue from toxic pollution in wild fish
USDA-ARS?s Scientific Manuscript database
Here we describe evolutionary rescue from intense pollution via multiple modes of selection in killifish populations from 4 urban estuaries of the US eastern seaboard. Comparative transcriptomics and analysis of 384 whole genome sequences show that the functioning of a receptor-based signaling pathw...
Lindholdt, Louise; Labriola, Merete; Nielsen, Claus Vinther; Horsbøl, Trine Allerslev; Lund, Thomas
2017-07-20
The return-to-work (RTW) process after long-term sickness absence is often complex and long and implies multiple shifts between different labour market states for the absentee. Standard methods for examining RTW research typically rely on the analysis of one outcome measure at a time, which will not capture the many possible states and transitions the absentee can go through. The purpose of this study was to explore the potential added value of sequence analysis in supplement to standard regression analysis of a multidisciplinary RTW intervention among patients with low back pain (LBP). The study population consisted of 160 patients randomly allocated to either a hospital-based brief or a multidisciplinary intervention. Data on labour market participation following intervention were obtained from a national register and analysed in two ways: as a binary outcome expressed as active or passive relief at a 1-year follow-up and as four different categories for labour market participation. Logistic regression and sequence analysis were performed. The logistic regression analysis showed no difference in labour market participation for patients in the two groups after 1 year. Applying sequence analysis showed differences in subsequent labour market participation after 2 years after baseline in favour of the brief intervention group versus the multidisciplinary intervention group. The study indicated that sequence analysis could provide added analytical value as a supplement to traditional regression analysis in prospective studies of RTW among patients with LBP. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Secondary structure prediction and structure-specific sequence analysis of single-stranded DNA.
Dong, F; Allawi, H T; Anderson, T; Neri, B P; Lyamichev, V I
2001-08-01
DNA sequence analysis by oligonucleotide binding is often affected by interference with the secondary structure of the target DNA. Here we describe an approach that improves DNA secondary structure prediction by combining enzymatic probing of DNA by structure-specific 5'-nucleases with an energy minimization algorithm that utilizes the 5'-nuclease cleavage sites as constraints. The method can identify structural differences between two DNA molecules caused by minor sequence variations such as a single nucleotide mutation. It also demonstrates the existence of long-range interactions between DNA regions separated by >300 nt and the formation of multiple alternative structures by a 244 nt DNA molecule. The differences in the secondary structure of DNA molecules revealed by 5'-nuclease probing were used to design structure-specific probes for mutation discrimination that target the regions of structural, rather than sequence, differences. We also demonstrate the performance of structure-specific 'bridge' probes complementary to non-contiguous regions of the target molecule. The structure-specific probes do not require the high stringency binding conditions necessary for methods based on mismatch formation and permit mutation detection at temperatures from 4 to 37 degrees C. Structure-specific sequence analysis is applied for mutation detection in the Mycobacterium tuberculosis katG gene and for genotyping of the hepatitis C virus.
Using whole-exome sequencing to identify variants inherited from mosaic parents
Rios, Jonathan J; Delgado, Mauricio R
2015-01-01
Whole-exome sequencing (WES) has allowed the discovery of genes and variants causing rare human disease. This is often achieved by comparing nonsynonymous variants between unrelated patients, and particularly for sporadic or recessive disease, often identifies a single or few candidate genes for further consideration. However, despite the potential for this approach to elucidate the genetic cause of rare human disease, a majority of patients fail to realize a genetic diagnosis using standard exome analysis methods. Although genetic heterogeneity contributes to the difficulty of exome sequence analysis between patients, it remains plausible that rare human disease is not caused by de novo or recessive variants. Multiple human disorders have been described for which the variant was inherited from a phenotypically normal mosaic parent. Here we highlight the potential for exome sequencing to identify a reasonable number of candidate genes when dominant disease variants are inherited from a mosaic parent. We show the power of WES to identify a limited number of candidate genes using this disease model and how sequence coverage affects identification of mosaic variants by WES. We propose this analysis as an alternative to discover genetic causes of rare human disorders for which typical WES approaches fail to identify likely pathogenic variants. PMID:24986828
Seneca, Sara; Vancampenhout, Kim; Van Coster, Rudy; Smet, Joél; Lissens, Willy; Vanlander, Arnaud; De Paepe, Boel; Jonckheere, An; Stouffs, Katrien; De Meirleir, Linda
2015-01-01
Next-generation sequencing (NGS), an innovative sequencing technology that enables the successful analysis of numerous gene sequences in a massive parallel sequencing approach, has revolutionized the field of molecular biology. Although NGS was introduced in a rather recent past, the technology has already demonstrated its potential and effectiveness in many research projects, and is now on the verge of being introduced into the diagnostic setting of routine laboratories to delineate the molecular basis of genetic disease in undiagnosed patient samples. We tested a benchtop device on retrospective genomic DNA (gDNA) samples of controls and patients with a clinical suspicion of a mitochondrial DNA disorder. This Ion Torrent Personal Genome Machine platform is a high-throughput sequencer with a fast turnaround time and reasonable running costs. We challenged the chemistry and technology with the analysis and processing of a mutational spectrum composed of samples with single-nucleotide substitutions, indels (insertions and deletions) and large single or multiple deletions, occasionally in heteroplasmy. The output data were compared with previously obtained conventional dideoxy sequencing results and the mitochondrial revised Cambridge Reference Sequence (rCRS). We were able to identify the majority of all nucleotide alterations, but three false-negative results were also encountered in the data set. At the same time, the poor performance of the PGM instrument in regions associated with homopolymeric stretches generated many false-positive miscalls demanding additional manual curation of the data.
Rybarczyk-Mydłowska, Katarzyna; Maboreke, Hazel Ruvimbo; van Megen, Hanny; van den Elsen, Sven; Mooyman, Paul; Smant, Geert; Bakker, Jaap; Helder, Johannes
2012-11-21
Plant parasitic nematodes are unusual Metazoans as they are equipped with genes that allow for symbiont-independent degradation of plant cell walls. Among the cell wall-degrading enzymes, glycoside hydrolase family 5 (GHF5) cellulases are relatively well characterized, especially for high impact parasites such as root-knot and cyst nematodes. Interestingly, ancestors of extant nematodes most likely acquired these GHF5 cellulases from a prokaryote donor by one or multiple lateral gene transfer events. To obtain insight into the origin of GHF5 cellulases among evolutionary advanced members of the order Tylenchida, cellulase biodiversity data from less distal family members were collected and analyzed. Single nematodes were used to obtain (partial) genomic sequences of cellulases from representatives of the genera Meloidogyne, Pratylenchus, Hirschmanniella and Globodera. Combined Bayesian analysis of ≈ 100 cellulase sequences revealed three types of catalytic domains (A, B, and C). Represented by 84 sequences, type B is numerically dominant, and the overall topology of the catalytic domain type shows remarkable resemblance with trees based on neutral (= pathogenicity-unrelated) small subunit ribosomal DNA sequences. Bayesian analysis further suggested a sister relationship between the lesion nematode Pratylenchus thornei and all type B cellulases from root-knot nematodes. Yet, the relationship between the three catalytic domain types remained unclear. Superposition of intron data onto the cellulase tree suggests that types B and C are related, and together distinct from type A that is characterized by two unique introns. All Tylenchida members investigated here harbored one or multiple GHF5 cellulases. Three types of catalytic domains are distinguished, and the presence of at least two types is relatively common among plant parasitic Tylenchida. Analysis of coding sequences of cellulases suggests that root-knot and cyst nematodes did not acquire this gene directly by lateral genes transfer. More likely, these genes were passed on by ancestors of a family nowadays known as the Pratylenchidae.
Image sequence analysis workstation for multipoint motion analysis
NASA Astrophysics Data System (ADS)
Mostafavi, Hassan
1990-08-01
This paper describes an application-specific engineering workstation designed and developed to analyze motion of objects from video sequences. The system combines the software and hardware environment of a modem graphic-oriented workstation with the digital image acquisition, processing and display techniques. In addition to automation and Increase In throughput of data reduction tasks, the objective of the system Is to provide less invasive methods of measurement by offering the ability to track objects that are more complex than reflective markers. Grey level Image processing and spatial/temporal adaptation of the processing parameters is used for location and tracking of more complex features of objects under uncontrolled lighting and background conditions. The applications of such an automated and noninvasive measurement tool include analysis of the trajectory and attitude of rigid bodies such as human limbs, robots, aircraft in flight, etc. The system's key features are: 1) Acquisition and storage of Image sequences by digitizing and storing real-time video; 2) computer-controlled movie loop playback, freeze frame display, and digital Image enhancement; 3) multiple leading edge tracking in addition to object centroids at up to 60 fields per second from both live input video or a stored Image sequence; 4) model-based estimation and tracking of the six degrees of freedom of a rigid body: 5) field-of-view and spatial calibration: 6) Image sequence and measurement data base management; and 7) offline analysis software for trajectory plotting and statistical analysis.
The Saccharomyces Genome Database Variant Viewer
Sheppard, Travis K.; Hitz, Benjamin C.; Engel, Stacia R.; Song, Giltae; Balakrishnan, Rama; Binkley, Gail; Costanzo, Maria C.; Dalusag, Kyla S.; Demeter, Janos; Hellerstedt, Sage T.; Karra, Kalpana; Nash, Robert S.; Paskov, Kelley M.; Skrzypek, Marek S.; Weng, Shuai; Wong, Edith D.; Cherry, J. Michael
2016-01-01
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source web application for the visualization of genomic and proteomic differences. Multiple sequence alignments have been constructed across high quality genome sequences from 11 different S. cerevisiae strains and stored in the SGD. The alignments and summaries are encoded in JSON and used to create a two-tiered dynamic view of the budding yeast pan-genome, available at http://www.yeastgenome.org/variant-viewer. PMID:26578556
van der Ley, P
1988-11-01
Gonococci express a family of related outer membrane proteins designated protein II (P.II). These surface proteins are subject to both phase variation and antigenic variation. The P.II gene repertoire of Neisseria gonorrhoeae strain JS3 was found to consist of at least ten genes, eight of which were cloned. Sequence analysis and DNA hybridization studies revealed that one particular P.II-encoding sequence is present in three distinct, but almost identical, copies in the JS3 genome. These genes encode the P.II protein that was previously identified as P.IIc. Comparison of their sequences shows that the multiple copies of this P.IIc-encoding gene might have been generated by both gene conversion and gene duplication.
A Simple Exact Error Rate Analysis for DS-CDMA with Arbitrary Pulse Shape in Flat Nakagami Fading
NASA Astrophysics Data System (ADS)
Rahman, Mohammad Azizur; Sasaki, Shigenobu; Kikuchi, Hisakazu; Harada, Hiroshi; Kato, Shuzo
A simple exact error rate analysis is presented for random binary direct sequence code division multiple access (DS-CDMA) considering a general pulse shape and flat Nakagami fading channel. First of all, a simple model is developed for the multiple access interference (MAI). Based on this, a simple exact expression of the characteristic function (CF) of MAI is developed in a straight forward manner. Finally, an exact expression of error rate is obtained following the CF method of error rate analysis. The exact error rate so obtained can be much easily evaluated as compared to the only reliable approximate error rate expression currently available, which is based on the Improved Gaussian Approximation (IGA).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Czarnecki, Olaf; Bryan, Anthony C.; Jawdy, Sara S.
Genetic engineering of plants that results in successful establishment of new biochemical or regulatory pathways requires stable introduction of one or more genes into the plant genome. It might also be necessary to down-regulate or turn off expression of endogenous genes in order to reduce activity of competing pathways. An established way to knockdown gene expression in plants is expressing a hairpin-RNAi construct, eventually leading to degradation of a specifically targeted mRNA. Knockdown of multiple genes that do not share homologous sequences is still challenging and involves either sophisticated cloning strategies to create vectors with different serial expression constructs ormore » multiple transformation events that is often restricted by a lack of available transformation markers. Synthetic RNAi fragments were assembled in yeast carrying homologous sequences to six or seven non-family genes and introduced into pAGRIKOLA. Transformation of Arabidopsis thaliana and subsequent expression analysis of targeted genes proved efficient knockdown of all target genes. In conclusion, we present a simple and cost-effective method to create constructs to simultaneously knockdown multiple non-family genes or genes that do not share sequence homology. The presented method can be applied in plant and animal synthetic biology as well as traditional plant and animal genetic engineering.« less
Schmidt, S; Pericak-Vance, M A; Sawcer, S; Barcellos, L F; Hart, J; Sims, J; Prokop, A M; van der Walt, J; DeLoa, C; Lincoln, R R; Oksenberg, J R; Compston, A; Hauser, S L; Haines, J L; Gregory, S G
2006-07-01
Discrepant findings have been reported regarding an association of the apolipoprotein E (APOE) gene with the clinical course of multiple sclerosis (MS). To resolve these discrepancies, we examined common sequence variation in six candidate genes residing in a 380-kb genomic region surrounding and including the APOE locus for an association with MS severity. We genotyped at least three polymorphisms in each of six candidate genes in 1,540 Caucasian MS families (729 single-case and multiple-case families from the United States, 811 single-case families from the UK). By applying the quantitative transmission/disequilibrium test to a recently proposed MS severity score, the only statistically significant (P=0.003) association with MS severity was found for an intronic variant in the Herpes Virus Entry Mediator-B Gene PVRL2. Additional genotyping extended the association to a 16.6 kb block spanning intron 1 to intron 2 of the gene. Sequencing of PVRL2 failed to identify variants with an obvious functional role. In conclusion, the analysis of a very large data set suggests that genetic polymorphisms in PVRL2 may influence MS severity and supports the possibility that viral factors may contribute to the clinical course of MS, consistent with previous reports.
Czarnecki, Olaf; Bryan, Anthony C.; Jawdy, Sara S.; ...
2016-02-17
Genetic engineering of plants that results in successful establishment of new biochemical or regulatory pathways requires stable introduction of one or more genes into the plant genome. It might also be necessary to down-regulate or turn off expression of endogenous genes in order to reduce activity of competing pathways. An established way to knockdown gene expression in plants is expressing a hairpin-RNAi construct, eventually leading to degradation of a specifically targeted mRNA. Knockdown of multiple genes that do not share homologous sequences is still challenging and involves either sophisticated cloning strategies to create vectors with different serial expression constructs ormore » multiple transformation events that is often restricted by a lack of available transformation markers. Synthetic RNAi fragments were assembled in yeast carrying homologous sequences to six or seven non-family genes and introduced into pAGRIKOLA. Transformation of Arabidopsis thaliana and subsequent expression analysis of targeted genes proved efficient knockdown of all target genes. In conclusion, we present a simple and cost-effective method to create constructs to simultaneously knockdown multiple non-family genes or genes that do not share sequence homology. The presented method can be applied in plant and animal synthetic biology as well as traditional plant and animal genetic engineering.« less
A Novel Center Star Multiple Sequence Alignment Algorithm Based on Affine Gap Penalty and K-Band
NASA Astrophysics Data System (ADS)
Zou, Quan; Shan, Xiao; Jiang, Yi
Multiple sequence alignment is one of the most important topics in computational biology, but it cannot deal with the large data so far. As the development of copy-number variant(CNV) and Single Nucleotide Polymorphisms(SNP) research, many researchers want to align numbers of similar sequences for detecting CNV and SNP. In this paper, we propose a novel multiple sequence alignment algorithm based on affine gap penalty and k-band. It can align more quickly and accurately, that will be helpful for mining CNV and SNP. Experiments prove the performance of our algorithm.
Yadav, Saurabh; Kumari, Pragati; Kushwaha, Hemant Ritturaj
2013-01-01
Glutaredoxins are enzymatic antioxidants which are small, ubiquitous, glutathione dependent and essentially classified under thioredoxin-fold superfamily. Glutaredoxins are classified into two types: dithiol and monothiol. Monothiol glutaredoxins which carry the signature "CGFS" as a redox active motif is known for its role in oxidative stress, inside the cell. In the present analysis, the 138 amino acid long monothiol glutaredoxin, AgGRX1 from Ashbya gossypii was identified and has been used for the analysis. The multiple sequence alignment of the AgGRX1 protein sequence revealed the characteristic motif of typical monothiol glutaredoxin as observed in various other organisms. The proposed structure of the AgGRX1 protein was used to analyze signature folds related to the thioredoxin superfamily. Further, the study highlighted the structural features pertaining to the complex mechanism of glutathione docking and interacting residues.
Computational and experimental analysis of DNA shuffling
Maheshri, Narendra; Schaffer, David V.
2003-01-01
We describe a computational model of DNA shuffling based on the thermodynamics and kinetics of this process. The model independently tracks a representative ensemble of DNA molecules and records their states at every stage of a shuffling reaction. These data can subsequently be analyzed to yield information on any relevant metric, including reassembly efficiency, crossover number, type and distribution, and DNA sequence length distributions. The predictive ability of the model was validated by comparison to three independent sets of experimental data, and analysis of the simulation results led to several unique insights into the DNA shuffling process. We examine a tradeoff between crossover frequency and reassembly efficiency and illustrate the effects of experimental parameters on this relationship. Furthermore, we discuss conditions that promote the formation of useless “junk” DNA sequences or multimeric sequences containing multiple copies of the reassembled product. This model will therefore aid in the design of optimal shuffling reaction conditions. PMID:12626764
Estimating differential expression from multiple indicators
Ilmjärv, Sten; Hundahl, Christian Ansgar; Reimets, Riin; Niitsoo, Margus; Kolde, Raivo; Vilo, Jaak; Vasar, Eero; Luuk, Hendrik
2014-01-01
Regardless of the advent of high-throughput sequencing, microarrays remain central in current biomedical research. Conventional microarray analysis pipelines apply data reduction before the estimation of differential expression, which is likely to render the estimates susceptible to noise from signal summarization and reduce statistical power. We present a probe-level framework, which capitalizes on the high number of concurrent measurements to provide more robust differential expression estimates. The framework naturally extends to various experimental designs and target categories (e.g. transcripts, genes, genomic regions) as well as small sample sizes. Benchmarking in relation to popular microarray and RNA-sequencing data-analysis pipelines indicated high and stable performance on the Microarray Quality Control dataset and in a cell-culture model of hypoxia. Experimental-data-exhibiting long-range epigenetic silencing of gene expression was used to demonstrate the efficacy of detecting differential expression of genomic regions, a level of analysis not embraced by conventional workflows. Finally, we designed and conducted an experiment to identify hypothermia-responsive genes in terms of monotonic time-response. As a novel insight, hypothermia-dependent up-regulation of multiple genes of two major antioxidant pathways was identified and verified by quantitative real-time PCR. PMID:24586062
Enhancing knowledge discovery from cancer genomics data with Galaxy
Albuquerque, Marco A.; Grande, Bruno M.; Ritch, Elie J.; Pararajalingam, Prasath; Jessa, Selin; Krzywinski, Martin; Grewal, Jasleen K.; Shah, Sohrab P.; Boutros, Paul C.
2017-01-01
Abstract The field of cancer genomics has demonstrated the power of massively parallel sequencing techniques to inform on the genes and specific alterations that drive tumor onset and progression. Although large comprehensive sequence data sets continue to be made increasingly available, data analysis remains an ongoing challenge, particularly for laboratories lacking dedicated resources and bioinformatics expertise. To address this, we have produced a collection of Galaxy tools that represent many popular algorithms for detecting somatic genetic alterations from cancer genome and exome data. We developed new methods for parallelization of these tools within Galaxy to accelerate runtime and have demonstrated their usability and summarized their runtimes on multiple cloud service providers. Some tools represent extensions or refinement of existing toolkits to yield visualizations suited to cohort-wide cancer genomic analysis. For example, we present Oncocircos and Oncoprintplus, which generate data-rich summaries of exome-derived somatic mutation. Workflows that integrate these to achieve data integration and visualizations are demonstrated on a cohort of 96 diffuse large B-cell lymphomas and enabled the discovery of multiple candidate lymphoma-related genes. Our toolkit is available from our GitHub repository as Galaxy tool and dependency definitions and has been deployed using virtualization on multiple platforms including Docker. PMID:28327945
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes
Li, Li; Stoeckert, Christian J.; Roos, David S.
2003-01-01
The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of “recent” paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome. PMID:12952885
Enhancing knowledge discovery from cancer genomics data with Galaxy.
Albuquerque, Marco A; Grande, Bruno M; Ritch, Elie J; Pararajalingam, Prasath; Jessa, Selin; Krzywinski, Martin; Grewal, Jasleen K; Shah, Sohrab P; Boutros, Paul C; Morin, Ryan D
2017-05-01
The field of cancer genomics has demonstrated the power of massively parallel sequencing techniques to inform on the genes and specific alterations that drive tumor onset and progression. Although large comprehensive sequence data sets continue to be made increasingly available, data analysis remains an ongoing challenge, particularly for laboratories lacking dedicated resources and bioinformatics expertise. To address this, we have produced a collection of Galaxy tools that represent many popular algorithms for detecting somatic genetic alterations from cancer genome and exome data. We developed new methods for parallelization of these tools within Galaxy to accelerate runtime and have demonstrated their usability and summarized their runtimes on multiple cloud service providers. Some tools represent extensions or refinement of existing toolkits to yield visualizations suited to cohort-wide cancer genomic analysis. For example, we present Oncocircos and Oncoprintplus, which generate data-rich summaries of exome-derived somatic mutation. Workflows that integrate these to achieve data integration and visualizations are demonstrated on a cohort of 96 diffuse large B-cell lymphomas and enabled the discovery of multiple candidate lymphoma-related genes. Our toolkit is available from our GitHub repository as Galaxy tool and dependency definitions and has been deployed using virtualization on multiple platforms including Docker. © The Author 2017. Published by Oxford University Press.
Mavromatis, Konstantinos; Land, Miriam L; Brettin, Thomas S; Quest, Daniel J; Copeland, Alex; Clum, Alicia; Goodwin, Lynne; Woyke, Tanja; Lapidus, Alla; Klenk, Hans Peter; Cottingham, Robert W; Kyrpides, Nikos C
2012-01-01
The emergence of next generation sequencing (NGS) has provided the means for rapid and high throughput sequencing and data generation at low cost, while concomitantly creating a new set of challenges. The number of available assembled microbial genomes continues to grow rapidly and their quality reflects the quality of the sequencing technology used, but also of the analysis software employed for assembly and annotation. In this work, we have explored the quality of the microbial draft genomes across various sequencing technologies. We have compared the draft and finished assemblies of 133 microbial genomes sequenced at the Department of Energy-Joint Genome Institute and finished at the Los Alamos National Laboratory using a variety of combinations of sequencing technologies, reflecting the transition of the institute from Sanger-based sequencing platforms to NGS platforms. The quality of the public assemblies and of the associated gene annotations was evaluated using various metrics. Results obtained with the different sequencing technologies, as well as their effects on downstream processes, were analyzed. Our results demonstrate that the Illumina HiSeq 2000 sequencing system, the primary sequencing technology currently used for de novo genome sequencing and assembly at JGI, has various advantages in terms of total sequence throughput and cost, but it also introduces challenges for the downstream analyses. In all cases assembly results although on average are of high quality, need to be viewed critically and consider sources of errors in them prior to analysis. These data follow the evolution of microbial sequencing and downstream processing at the JGI from draft genome sequences with large gaps corresponding to missing genes of significant biological role to assemblies with multiple small gaps (Illumina) and finally to assemblies that generate almost complete genomes (Illumina+PacBio).
Marques, M Carmen; Alonso-Cantabrana, Hugo; Forment, Javier; Arribas, Raquel; Alamar, Santiago; Conejero, Vicente; Perez-Amador, Miguel A
2009-01-01
Background Interpretation of ever-increasing raw sequence information generated by modern genome sequencing technologies faces multiple challenges, such as gene function analysis and genome annotation. Indeed, nearly 40% of genes in plants encode proteins of unknown function. Functional characterization of these genes is one of the main challenges in modern biology. In this regard, the availability of full-length cDNA clones may fill in the gap created between sequence information and biological knowledge. Full-length cDNA clones facilitate functional analysis of the corresponding genes enabling manipulation of their expression in heterologous systems and the generation of a variety of tagged versions of the native protein. In addition, the development of full-length cDNA sequences has the power to improve the quality of genome annotation. Results We developed an integrated method to generate a new normalized EST collection enriched in full-length and rare transcripts of different citrus species from multiple tissues and developmental stages. We constructed a total of 15 cDNA libraries, from which we isolated 10,898 high-quality ESTs representing 6142 different genes. Percentages of redundancy and proportion of full-length clones range from 8 to 33, and 67 to 85, respectively, indicating good efficiency of the approach employed. The new EST collection adds 2113 new citrus ESTs, representing 1831 unigenes, to the collection of citrus genes available in the public databases. To facilitate functional analysis, cDNAs were introduced in a Gateway-based cloning vector for high-throughput functional analysis of genes in planta. Herein, we describe the technical methods used in the library construction, sequence analysis of clones and the overexpression of CitrSEP, a citrus homolog to the Arabidopsis SEP3 gene, in Arabidopsis as an example of a practical application of the engineered Gateway vector for functional analysis. Conclusion The new EST collection denotes an important step towards the identification of all genes in the citrus genome. Furthermore, public availability of the cDNA clones generated in this study, and not only their sequence, enables testing of the biological function of the genes represented in the collection. Expression of the citrus SEP3 homologue, CitrSEP, in Arabidopsis results in early flowering, along with other phenotypes resembling the over-expression of the Arabidopsis SEPALLATA genes. Our findings suggest that the members of the SEP gene family play similar roles in these quite distant plant species. PMID:19747386
The repetitive landscape of the chicken genome.
Wicker, Thomas; Robertson, Jon S; Schulze, Stefan R; Feltus, F Alex; Magrini, Vincent; Morrison, Jason A; Mardis, Elaine R; Wilson, Richard K; Peterson, Daniel G; Paterson, Andrew H; Ivarie, Robert
2005-01-01
Cot-based cloning and sequencing (CBCS) is a powerful tool for isolating and characterizing the various repetitive components of any genome, combining the established principles of DNA reassociation kinetics with high-throughput sequencing. CBCS was used to generate sequence libraries representing the high, middle, and low-copy fractions of the chicken genome. Sequencing high-copy DNA of chicken to about 2.7 x coverage of its estimated sequence complexity led to the initial identification of several new repeat families, which were then used for a survey of the newly released first draft of the complete chicken genome. The analysis provided insight into the diversity and biology of known repeat structures such as CR1 and CNM, for which only limited sequence data had previously been available. Cot sequence data also resulted in the identification of four novel repeats (Birddawg, Hitchcock, Kronos, and Soprano), two new subfamilies of CR1 repeats, and many elements absent from the chicken genome assembly. Multiple autonomous elements were found for a novel Mariner-like transposon, Galluhop, in addition to nonautonomous deletion derivatives. Phylogenetic analysis of the high-copy repeats CR1, Galluhop, and Birddawg provided insight into two distinct genome dispersion strategies. This study also exemplifies the power of the CBCS method to create representative databases for the repetitive fractions of genomes for which only limited sequence data is available.
The repetitive landscape of the chicken genome
Wicker, Thomas; Robertson, Jon S.; Schulze, Stefan R.; Feltus, F. Alex; Magrini, Vincent; Morrison, Jason A.; Mardis, Elaine R.; Wilson, Richard K.; Peterson, Daniel G.; Paterson, Andrew H.; Ivarie, Robert
2005-01-01
Cot-based cloning and sequencing (CBCS) is a powerful tool for isolating and characterizing the various repetitive components of any genome, combining the established principles of DNA reassociation kinetics with high-throughput sequencing. CBCS was used to generate sequence libraries representing the high, middle, and low-copy fractions of the chicken genome. Sequencing high-copy DNA of chicken to about 2.7× coverage of its estimated sequence complexity led to the initial identification of several new repeat families, which were then used for a survey of the newly released first draft of the complete chicken genome. The analysis provided insight into the diversity and biology of known repeat structures such as CR1 and CNM, for which only limited sequence data had previously been available. Cot sequence data also resulted in the identification of four novel repeats (Birddawg, Hitchcock, Kronos, and Soprano), two new subfamilies of CR1 repeats, and many elements absent from the chicken genome assembly. Multiple autonomous elements were found for a novel Mariner-like transposon, Galluhop, in addition to nonautonomous deletion derivatives. Phylogenetic analysis of the high-copy repeats CR1, Galluhop, and Birddawg provided insight into two distinct genome dispersion strategies. This study also exemplifies the power of the CBCS method to create representative databases for the repetitive fractions of genomes for which only limited sequence data is available. PMID:15256510
2012-01-01
Background Hawthorn is the common name of all plant species in the genus Crataegus, which belongs to the Rosaceae family. Crataegus are considered useful medicinal plants because of their high content of proanthocyanidins (PAs) and other related compounds. To improve PAs production in Crataegus tissues, the sequences of genes encoding PAs biosynthetic enzymes are required. Findings Different bioinformatics tools, including BLAST, multiple sequence alignment and alignment PCR analysis were used to design primers suitable for the amplification of DNA fragments from 10 candidate genes encoding enzymes involved in PAs biosynthesis in C. aronia. DNA sequencing results proved the utility of the designed primers. The primers were used successfully to amplify DNA fragments of different PAs biosynthesis genes in different Rosaceae plants. Conclusion To the best of our knowledge, this is the first use of the alignment PCR approach to isolate DNA sequences encoding PAs biosynthetic enzymes in Rosaceae plants. PMID:22883984
Zuiter, Afnan Saeid; Sawwan, Jammal; Al Abdallat, Ayed
2012-08-10
Hawthorn is the common name of all plant species in the genus Crataegus, which belongs to the Rosaceae family. Crataegus are considered useful medicinal plants because of their high content of proanthocyanidins (PAs) and other related compounds. To improve PAs production in Crataegus tissues, the sequences of genes encoding PAs biosynthetic enzymes are required. Different bioinformatics tools, including BLAST, multiple sequence alignment and alignment PCR analysis were used to design primers suitable for the amplification of DNA fragments from 10 candidate genes encoding enzymes involved in PAs biosynthesis in C. aronia. DNA sequencing results proved the utility of the designed primers. The primers were used successfully to amplify DNA fragments of different PAs biosynthesis genes in different Rosaceae plants. To the best of our knowledge, this is the first use of the alignment PCR approach to isolate DNA sequences encoding PAs biosynthetic enzymes in Rosaceae plants.
MassSieve: Panning MS/MS peptide data for proteins
Slotta, Douglas J.; McFarland, Melinda A.; Markey, Sanford P.
2010-01-01
We present MassSieve, a Java-based platform for visualization and parsimony analysis of single and comparative LC-MS/MS database search engine results. The success of mass spectrometric peptide sequence assignment algorithms has led to the need for a tool to merge and evaluate the increasing data set sizes that result from LC-MS/MS-based shotgun proteomic experiments. MassSieve supports reports from multiple search engines with differing search characteristics, which can increase peptide sequence coverage and/or identify conflicting or ambiguous spectral assignments. PMID:20564260
3D polymer gel dosimetry using a 3D (DESS) and a 2D MultiEcho SE (MESE) sequence
NASA Astrophysics Data System (ADS)
Maris, Thomas G.; Pappas, Evangelos; Karolemeas, Kostantinos; Papadakis, Antonios E.; Zacharopoulou, Fotini; Papanikolaou, Nickolas; Gourtsoyiannis, Nicholas
2006-12-01
The utilization of 3D techniques in Magnetic Resonance Imaging data aquisition and post-processing analysis is a prerequisite especially when modern radiotherapy techniques (conformal RT, IMRT, Stereotactic RT) are to be used. The aim of this work is to compare a 3D Double Echo Steady State (DESS) and a 2D Multiple Echo Spin Echo (MESE) sequence in 3D MRI radiation dosimetry using two different MRI scanners and utilising N-VInylPyrrolidone (VIPAR) based polymer gels.
2009-01-01
Background Sequence identification of ESTs from non-model species offers distinct challenges particularly when these species have duplicated genomes and when they are phylogenetically distant from sequenced model organisms. For the common carp, an environmental model of aquacultural interest, large numbers of ESTs remained unidentified using BLAST sequence alignment. We have used the expression profiles from large-scale microarray experiments to suggest gene identities. Results Expression profiles from ~700 cDNA microarrays describing responses of 7 major tissues to multiple environmental stressors were used to define a co-expression landscape. This was based on the Pearsons correlation coefficient relating each gene with all other genes, from which a network description provided clusters of highly correlated genes as 'mountains'. We show that these contain genes with known identities and genes with unknown identities, and that the correlation constitutes evidence of identity in the latter. This procedure has suggested identities to 522 of 2701 unknown carp ESTs sequences. We also discriminate several common carp genes and gene isoforms that were not discriminated by BLAST sequence alignment alone. Precision in identification was substantially improved by use of data from multiple tissues and treatments. Conclusion The detailed analysis of co-expression landscapes is a sensitive technique for suggesting an identity for the large number of BLAST unidentified cDNAs generated in EST projects. It is capable of detecting even subtle changes in expression profiles, and thereby of distinguishing genes with a common BLAST identity into different identities. It benefits from the use of multiple treatments or contrasts, and from the large-scale microarray data. PMID:19939286
Kovács, Endre R; Benko, Mária
2009-03-01
Partial genome characterisation of a novel adenovirus, found recently in organ samples of multiple species of dead birds of prey, was carried out by sequence analysis of PCR-amplified DNA fragments. The virus, named as raptor adenovirus 1 (RAdV-1), has originally been detected by a nested PCR method with consensus primers targeting the adenoviral DNA polymerase gene. Phylogenetic analysis with the deduced amino acid sequence of the small PCR product has implied a new siadenovirus type present in the samples. Since virus isolation attempts remained unsuccessful, further characterisation of this putative novel siadenovirus was carried out with the use of PCR on the infected organ samples. The DNA sequence of the central genome part of RAdV-1, encompassing nine full (pTP, 52K, pIIIa, III, pVII, pX, pVI, hexon, protease) and two partial (DNA polymerase and DBP) genes and exceeding 12 kb pairs in size, was determined. Phylogenetic tree reconstructions, based on several genes, unambiguously confirmed the preliminary classification of RAdV-1 as a new species within the genus Siadenovirus. Further study of RAdV-1 is of interest since it represents a rare adenovirus genus of yet undetermined host origin.
Vertical decomposition with Genetic Algorithm for Multiple Sequence Alignment
2011-01-01
Background Many Bioinformatics studies begin with a multiple sequence alignment as the foundation for their research. This is because multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence structure relationships. Results In this paper, we have proposed a Vertical Decomposition with Genetic Algorithm (VDGA) for Multiple Sequence Alignment (MSA). In VDGA, we divide the sequences vertically into two or more subsequences, and then solve them individually using a guide tree approach. Finally, we combine all the subsequences to generate a new multiple sequence alignment. This technique is applied on the solutions of the initial generation and of each child generation within VDGA. We have used two mechanisms to generate an initial population in this research: the first mechanism is to generate guide trees with randomly selected sequences and the second is shuffling the sequences inside such trees. Two different genetic operators have been implemented with VDGA. To test the performance of our algorithm, we have compared it with existing well-known methods, namely PRRP, CLUSTALX, DIALIGN, HMMT, SB_PIMA, ML_PIMA, MULTALIGN, and PILEUP8, and also other methods, based on Genetic Algorithms (GA), such as SAGA, MSA-GA and RBT-GA, by solving a number of benchmark datasets from BAliBase 2.0. Conclusions The experimental results showed that the VDGA with three vertical divisions was the most successful variant for most of the test cases in comparison to other divisions considered with VDGA. The experimental results also confirmed that VDGA outperformed the other methods considered in this research. PMID:21867510
Novel genomic findings in multiple myeloma identified through routine diagnostic sequencing.
Ryland, Georgina L; Jones, Kate; Chin, Melody; Markham, John; Aydogan, Elle; Kankanige, Yamuna; Caruso, Marisa; Guinto, Jerick; Dickinson, Michael; Prince, H Miles; Yong, Kwee; Blombery, Piers
2018-05-14
Multiple myeloma is a genomically complex haematological malignancy with many genomic alterations recognised as important in diagnosis, prognosis and therapeutic decision making. Here, we provide a summary of genomic findings identified through routine diagnostic next-generation sequencing at our centre. A cohort of 86 patients with multiple myeloma underwent diagnostic sequencing using a custom hybridisation-based panel targeting 104 genes. Sequence variants, genome-wide copy number changes and structural rearrangements were detected using an inhouse-developed bioinformatics pipeline. At least one mutation was found in 69 (80%) patients. Frequently mutated genes included TP53 (36%), KRAS (22.1%), NRAS (15.1%), FAM46C/DIS3 (8.1%) and TET2/FGFR3 (5.8%), including multiple mutations not previously described in myeloma. Importantly we observed TP53 mutations in the absence of a 17 p deletion in 8% of the cohort, highlighting the need for sequencing-based assessment in addition to cytogenetics to identify these high-risk patients. Multiple novel copy number changes and immunoglobulin heavy chain translocations are also discussed. Our results demonstrate that many clinically relevant genomic findings remain in multiple myeloma which have not yet been identified through large-scale sequencing efforts, and provide important mechanistic insights into plasma cell pathobiology. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Li, Ying; Shi, Xiaohu; Liang, Yanchun; Xie, Juan; Zhang, Yu; Ma, Qin
2017-01-21
RNAs have been found to carry diverse functionalities in nature. Inferring the similarity between two given RNAs is a fundamental step to understand and interpret their functional relationship. The majority of functional RNAs show conserved secondary structures, rather than sequence conservation. Those algorithms relying on sequence-based features usually have limitations in their prediction performance. Hence, integrating RNA structure features is very critical for RNA analysis. Existing algorithms mainly fall into two categories: alignment-based and alignment-free. The alignment-free algorithms of RNA comparison usually have lower time complexity than alignment-based algorithms. An alignment-free RNA comparison algorithm was proposed, in which novel numerical representations RNA-TVcurve (triple vector curve representation) of RNA sequence and corresponding secondary structure features are provided. Then a multi-scale similarity score of two given RNAs was designed based on wavelet decomposition of their numerical representation. In support of RNA mutation and phylogenetic analysis, a web server (RNA-TVcurve) was designed based on this alignment-free RNA comparison algorithm. It provides three functional modules: 1) visualization of numerical representation of RNA secondary structure; 2) detection of single-point mutation based on secondary structure; and 3) comparison of pairwise and multiple RNA secondary structures. The inputs of the web server require RNA primary sequences, while corresponding secondary structures are optional. For the primary sequences alone, the web server can compute the secondary structures using free energy minimization algorithm in terms of RNAfold tool from Vienna RNA package. RNA-TVcurve is the first integrated web server, based on an alignment-free method, to deliver a suite of RNA analysis functions, including visualization, mutation analysis and multiple RNAs structure comparison. The comparison results with two popular RNA comparison tools, RNApdist and RNAdistance, showcased that RNA-TVcurve can efficiently capture subtle relationships among RNAs for mutation detection and non-coding RNA classification. All the relevant results were shown in an intuitive graphical manner, and can be freely downloaded from this server. RNA-TVcurve, along with test examples and detailed documents, are available at: http://ml.jlu.edu.cn/tvcurve/ .
Li, Kai; Chen, Wenyuan; Zhang, Weiping
2011-01-01
Beam’s multiple-contact mode, characterized by multiple and discrete contact regions, non-uniform stoppers’ heights, irregular contact sequence, seesaw-like effect, indirect interaction between different stoppers, and complex coupling relationship between loads and deformation is studied. A novel analysis method and a novel high speed calculation model are developed for multiple-contact mode under mechanical load and electrostatic load, without limitations on stopper height and distribution, providing the beam has stepped or curved shape. Accurate values of deflection, contact load, contact region and so on are obtained directly, with a subsequent validation by CoventorWare. A new concept design of high-g threshold microaccelerometer based on multiple-contact mode is presented, featuring multiple acceleration thresholds of one sensitive component and consequently small sensor size. PMID:22163897
Chitty, Lyn S; Mason, Sarah; Barrett, Angela N; McKay, Fiona; Lench, Nicholas; Daley, Rebecca; Jenkins, Lucy A
2015-01-01
Abstract Objective Accurate prenatal diagnosis of genetic conditions can be challenging and usually requires invasive testing. Here, we demonstrate the potential of next-generation sequencing (NGS) for the analysis of cell-free DNA in maternal blood to transform prenatal diagnosis of monogenic disorders. Methods Analysis of cell-free DNA using a PCR and restriction enzyme digest (PCR–RED) was compared with a novel NGS assay in pregnancies at risk of achondroplasia and thanatophoric dysplasia. Results PCR–RED was performed in 72 cases and was correct in 88.6%, inconclusive in 7% with one false negative. NGS was performed in 47 cases and was accurate in 96.2% with no inconclusives. Both approaches were used in 27 cases, with NGS giving the correct result in the two cases inconclusive with PCR–RED. Conclusion NGS provides an accurate, flexible approach to non-invasive prenatal diagnosis of de novo and paternally inherited mutations. It is more sensitive than PCR–RED and is ideal when screening a gene with multiple potential pathogenic mutations. These findings highlight the value of NGS in the development of non-invasive prenatal diagnosis for other monogenic disorders. © 2015 The Authors. Prenatal Diagnosis published by John Wiley & Sons, Ltd. What's already known about this topic? Non-invasive prenatal diagnosis (NIPD) using PCR-based methods has been reported for the detection or exclusion of individual paternally inherited or de novo alleles in maternal plasma. What does this study add? NIPD using next generation sequencing provides an accurate, more sensitive approach which can be used to detect multiple mutations in a single assay and so is ideal when screening a gene with multiple potential pathogenic mutations. Next generation sequencing thus provides a flexible approach to non-invasive prenatal diagnosis ideal for use in a busy service laboratory. PMID:25728633
PSAT: A web tool to compare genomic neighborhoods of multiple prokaryotic genomes
Fong, Christine; Rohmer, Laurence; Radey, Matthew; Wasnick, Michael; Brittnacher, Mitchell J
2008-01-01
Background The conservation of gene order among prokaryotic genomes can provide valuable insight into gene function, protein interactions, or events by which genomes have evolved. Although some tools are available for visualizing and comparing the order of genes between genomes of study, few support an efficient and organized analysis between large numbers of genomes. The Prokaryotic Sequence homology Analysis Tool (PSAT) is a web tool for comparing gene neighborhoods among multiple prokaryotic genomes. Results PSAT utilizes a database that is preloaded with gene annotation, BLAST hit results, and gene-clustering scores designed to help identify regions of conserved gene order. Researchers use the PSAT web interface to find a gene of interest in a reference genome and efficiently retrieve the sequence homologs found in other bacterial genomes. The tool generates a graphic of the genomic neighborhood surrounding the selected gene and the corresponding regions for its homologs in each comparison genome. Homologs in each region are color coded to assist users with analyzing gene order among various genomes. In contrast to common comparative analysis methods that filter sequence homolog data based on alignment score cutoffs, PSAT leverages gene context information for homologs, including those with weak alignment scores, enabling a more sensitive analysis. Features for constraining or ordering results are designed to help researchers browse results from large numbers of comparison genomes in an organized manner. PSAT has been demonstrated to be useful for helping to identify gene orthologs and potential functional gene clusters, and detecting genome modifications that may result in loss of function. Conclusion PSAT allows researchers to investigate the order of genes within local genomic neighborhoods of multiple genomes. A PSAT web server for public use is available for performing analyses on a growing set of reference genomes through any web browser with no client side software setup or installation required. Source code is freely available to researchers interested in setting up a local version of PSAT for analysis of genomes not available through the public server. Access to the public web server and instructions for obtaining source code can be found at . PMID:18366802
NASA Technical Reports Server (NTRS)
Khanampompan, Teerapat; Gladden, Roy; Fisher, Forest; DelGuercio, Chris
2008-01-01
The Sequence History Update Tool performs Web-based sequence statistics archiving for Mars Reconnaissance Orbiter (MRO). Using a single UNIX command, the software takes advantage of sequencing conventions to automatically extract the needed statistics from multiple files. This information is then used to populate a PHP database, which is then seamlessly formatted into a dynamic Web page. This tool replaces a previous tedious and error-prone process of manually editing HTML code to construct a Web-based table. Because the tool manages all of the statistics gathering and file delivery to and from multiple data sources spread across multiple servers, there is also a considerable time and effort savings. With the use of The Sequence History Update Tool what previously took minutes is now done in less than 30 seconds, and now provides a more accurate archival record of the sequence commanding for MRO.
Miyake, Sou; Ngugi, David K.; Stingl, Ulrich
2016-01-01
Epulopiscium is a group of giant bacteria found in high abundance in intestinal tracts of herbivorous surgeonfish. Despite their peculiarly large cell size (can be up to 600 μm), extreme polyploidy (some with over 100,000 genome copies per cell) and viviparity (whereby mother cells produce live offspring), details about their diversity, distribution or their role in the host gut are lacking. Previous studies have highlighted the existence of morphologically distinct Epulopiscium cell types (defined as morphotypes A to J) in some surgeonfish genera, but the corresponding genetic diversity and distribution among other surgeonfishes remain mostly unknown. Therefore, we investigated the phylogenetic diversity of Epulopiscium, distribution and co-occurrence in multiple hosts. Here, we identified eleven new phylogenetic clades, six of which were also morphologically characterized. Three of these novel clades were phylogenetically and morphologically similar to cigar-shaped type A1 cells, found in a wide range of surgeonfishes including Acanthurus nigrofuscus, while three were similar to smaller, rod-shaped type E that has not been phylogenetically classified thus far. Our results also confirmed that biogeography appears to have relatively little influence on Epulopiscium diversity, as clades found in the Great Barrier Reef and Hawaii were also recovered from the Red Sea. Although multiple symbiont clades inhabited a given species of host surgeonfish and multiple host species possessed a given symbiont clade, statistical analysis of host and symbiont phylogenies indicated significant cophylogeny, which in turn suggests co-evolutionary relationships. A cluster analysis of Epulopiscium sequences from previously published amplicon sequencing dataset revealed a similar pattern, where specific clades were consistently found in high abundance amongst closely related surgeonfishes. Differences in abundance may indicate specialization of clades to certain gut environments reflected by inferred differences in the host diets. Overall, our analysis identified a large phylogenetic diversity of Epulopiscium (up to 10% sequence divergence of 16S rRNA genes), which lets us hypothesize that there are multiple species that are spread across guts of different host species. PMID:27014209
StatsDB: platform-agnostic storage and understanding of next generation sequencing run metrics
Ramirez-Gonzalez, Ricardo H.; Leggett, Richard M.; Waite, Darren; Thanki, Anil; Drou, Nizar; Caccamo, Mario; Davey, Robert
2014-01-01
Modern sequencing platforms generate enormous quantities of data in ever-decreasing amounts of time. Additionally, techniques such as multiplex sequencing allow one run to contain hundreds of different samples. With such data comes a significant challenge to understand its quality and to understand how the quality and yield are changing across instruments and over time. As well as the desire to understand historical data, sequencing centres often have a duty to provide clear summaries of individual run performance to collaborators or customers. We present StatsDB, an open-source software package for storage and analysis of next generation sequencing run metrics. The system has been designed for incorporation into a primary analysis pipeline, either at the programmatic level or via integration into existing user interfaces. Statistics are stored in an SQL database and APIs provide the ability to store and access the data while abstracting the underlying database design. This abstraction allows simpler, wider querying across multiple fields than is possible by the manual steps and calculation required to dissect individual reports, e.g. ”provide metrics about nucleotide bias in libraries using adaptor barcode X, across all runs on sequencer A, within the last month”. The software is supplied with modules for storage of statistics from FastQC, a commonly used tool for analysis of sequence reads, but the open nature of the database schema means it can be easily adapted to other tools. Currently at The Genome Analysis Centre (TGAC), reports are accessed through our LIMS system or through a standalone GUI tool, but the API and supplied examples make it easy to develop custom reports and to interface with other packages. PMID:24627795
Yin, Li; Yao, Jiqiang; Gardner, Brent P; Chang, Kaifen; Yu, Fahong; Goodenow, Maureen M
2012-01-01
Next Generation sequencing (NGS) applied to human papilloma viruses (HPV) can provide sensitive methods to investigate the molecular epidemiology of multiple type HPV infection. Currently a genotyping system with a comprehensive collection of updated HPV reference sequences and a capacity to handle NGS data sets is lacking. HPV-QUEST was developed as an automated and rapid HPV genotyping system. The web-based HPV-QUEST subtyping algorithm was developed using HTML, PHP, Perl scripting language, and MYSQL as the database backend. HPV-QUEST includes a database of annotated HPV reference sequences with updated nomenclature covering 5 genuses, 14 species and 150 mucosal and cutaneous types to genotype blasted query sequences. HPV-QUEST processes up to 10 megabases of sequences within 1 to 2 minutes. Results are reported in html, text and excel formats and display e-value, blast score, and local and coverage identities; provide genus, species, type, infection site and risk for the best matched reference HPV sequence; and produce results ready for additional analyses.
Comparative analysis and visualization of multiple collinear genomes
2012-01-01
Background Genome browsers are a common tool used by biologists to visualize genomic features including genes, polymorphisms, and many others. However, existing genome browsers and visualization tools are not well-suited to perform meaningful comparative analysis among a large number of genomes. With the increasing quantity and availability of genomic data, there is an increased burden to provide useful visualization and analysis tools for comparison of multiple collinear genomes such as the large panels of model organisms which are the basis for much of the current genetic research. Results We have developed a novel web-based tool for visualizing and analyzing multiple collinear genomes. Our tool illustrates genome-sequence similarity through a mosaic of intervals representing local phylogeny, subspecific origin, and haplotype identity. Comparative analysis is facilitated through reordering and clustering of tracks, which can vary throughout the genome. In addition, we provide local phylogenetic trees as an alternate visualization to assess local variations. Conclusions Unlike previous genome browsers and viewers, ours allows for simultaneous and comparative analysis. Our browser provides intuitive selection and interactive navigation about features of interest. Dynamic visualizations adjust to scale and data content making analysis at variable resolutions and of multiple data sets more informative. We demonstrate our genome browser for an extensive set of genomic data sets composed of almost 200 distinct mouse laboratory strains. PMID:22536897
Bueno, Danilo; Palacios-Gimenez, Octavio Manuel; Martí, Dardo Andrea; Mariguela, Tatiane Casagrande; Cabral-de-Mello, Diogo Cavalcanti
2016-08-01
The 5S ribosomal DNA (rDNA) sequences are subject of dynamic evolution at chromosomal and molecular levels, evolving through concerted and/or birth-and-death fashion. Among grasshoppers, the chromosomal location for this sequence was established for some species, but little molecular information was obtained to infer evolutionary patterns. Here, we integrated data from chromosomal and nucleotide sequence analysis for 5S rDNA in two Abracris species aiming to identify evolutionary dynamics. For both species, two arrays were identified, a larger sequence (named type-I) that consisted of the entire 5S rDNA gene plus NTS (non-transcribed spacer) and a smaller (named type-II) with truncated 5S rDNA gene plus short NTS that was considered a pseudogene. For type-I sequences, the gene corresponding region contained the internal control region and poly-T motif and the NTS presented partial transposable elements. Between the species, nucleotide differences for type-I were noticed, while type-II was identical, suggesting pseudogenization in a common ancestor. At chromosomal point to view, the type-II was placed in one bivalent, while type-I occurred in multiple copies in distinct chromosomes. In Abracris, the evolution of 5S rDNA was apparently influenced by the chromosomal distribution of clusters (single or multiple location), resulting in a mixed mechanism integrating concerted and birth-and-death evolution depending on the unit.
Shahinyan, Grigor; Margaryan, Armine; Panosyan, Hovik; Trchounian, Armen
2017-05-02
Among the huge diversity of thermophilic bacteria mainly bacilli have been reported as active thermostable lipase producers. Geothermal springs serve as the main source for isolation of thermostable lipase producing bacilli. Thermostable lipolytic enzymes, functioning in the harsh conditions, have promising applications in processing of organic chemicals, detergent formulation, synthesis of biosurfactants, pharmaceutical processing etc. In order to study the distribution of lipase-producing thermophilic bacilli and their specific lipase protein primary structures, three lipase producers from different genera were isolated from mesothermal (27.5-70 °C) springs distributed on the territory of Armenia and Nagorno Karabakh. Based on phenotypic characteristics and 16S rRNA gene sequencing the isolates were identified as Geobacillus sp., Bacillus licheniformis and Anoxibacillus flavithermus strains. The lipase genes of isolates were sequenced by using initially designed primer sets. Multiple alignments generated from primary structures of the lipase proteins and annotated lipase protein sequences, conserved regions analysis and amino acid composition have illustrated the similarity (98-99%) of the lipases with true lipases (family I) and GDSL esterase family (family II). A conserved sequence block that determines the thermostability has been identified in the multiple alignments of the lipase proteins. The results are spreading light on the lipase producing bacilli distribution in geothermal springs in Armenia and Nagorno Karabakh. Newly isolated bacilli strains could be prospective source for thermostable lipases and their genes.
Multiple introductions and onward transmission of HIV-1 subtype B strains in Shanghai, China.
Li, Xiaoshan; Zhu, Kexin; Xue, Yile; Wei, Feiran; Gao, Rong; Duerr, Ralf; Fang, Kun; Li, Wei; Song, Yue; Du, Guoping; Yan, Wenjuan; Musa, Taha Hussein; Ge, You; Ji, Yu; Zhong, Ping; Wei, Pingmin
2017-08-01
To investigate the viral genetic evolution, spatial origins and patterns of transmission of HIV-1 subtype B in Shanghai, China. A total of 242 Shanghai subtype B and 1519 reference pol sequences were subjected to phylogenetic inference and genetic transmission network analyses. Phylogenetic analysis revealed that subtype B strains circulating in Shanghai were genetically diverse and closely associated with viral sequence lineages in Beijing (76 of 242 [31.4%]), Central China (Henan/Hebei/Hunan/Hubei) (43 of 242 [17.8%]), Chinese Taiwan (20 of 242 [8.3%]), Japan (6 of 242 [2.5%]), and Korea (7 of 242 [2.9%]), suggesting multiple introductions into Shanghai from mainland China and Taiwan, Japan, and Korea. Interestingly, a monophyletic Shanghai lineage (SH-L) (36 of 242 [14.9%]) of HIV-1 subtype B most likely originated from an Argentine strain, transferred through Liaoning infected individuals. In-depth analyses of 195 Shanghai subtype B sequences revealed that a total of 37.9% (n = 74) sequences contributed to 35 transmission networks, whereof 33.8% (n = 25) of the sequences associated with infected individuals from other provinces. Our new findings reflect the evolution complexity and transmission dynamics of HIV-1 subtype B in Shanghai, which would provide critical information for the design of effective prevention measures against HIV transmission. Copyright © 2017 The British Infection Association. Published by Elsevier Ltd. All rights reserved.
An Imaging And Graphics Workstation For Image Sequence Analysis
NASA Astrophysics Data System (ADS)
Mostafavi, Hassan
1990-01-01
This paper describes an application-specific engineering workstation designed and developed to analyze imagery sequences from a variety of sources. The system combines the software and hardware environment of the modern graphic-oriented workstations with the digital image acquisition, processing and display techniques. The objective is to achieve automation and high throughput for many data reduction tasks involving metric studies of image sequences. The applications of such an automated data reduction tool include analysis of the trajectory and attitude of aircraft, missile, stores and other flying objects in various flight regimes including launch and separation as well as regular flight maneuvers. The workstation can also be used in an on-line or off-line mode to study three-dimensional motion of aircraft models in simulated flight conditions such as wind tunnels. The system's key features are: 1) Acquisition and storage of image sequences by digitizing real-time video or frames from a film strip; 2) computer-controlled movie loop playback, slow motion and freeze frame display combined with digital image sharpening, noise reduction, contrast enhancement and interactive image magnification; 3) multiple leading edge tracking in addition to object centroids at up to 60 fields per second from both live input video or a stored image sequence; 4) automatic and manual field-of-view and spatial calibration; 5) image sequence data base generation and management, including the measurement data products; 6) off-line analysis software for trajectory plotting and statistical analysis; 7) model-based estimation and tracking of object attitude angles; and 8) interface to a variety of video players and film transport sub-systems.
Combining results of multiple search engines in proteomics.
Shteynberg, David; Nesvizhskii, Alexey I; Moritz, Robert L; Deutsch, Eric W
2013-09-01
A crucial component of the analysis of shotgun proteomics datasets is the search engine, an algorithm that attempts to identify the peptide sequence from the parent molecular ion that produced each fragment ion spectrum in the dataset. There are many different search engines, both commercial and open source, each employing a somewhat different technique for spectrum identification. The set of high-scoring peptide-spectrum matches for a defined set of input spectra differs markedly among the various search engine results; individual engines each provide unique correct identifications among a core set of correlative identifications. This has led to the approach of combining the results from multiple search engines to achieve improved analysis of each dataset. Here we review the techniques and available software for combining the results of multiple search engines and briefly compare the relative performance of these techniques.
Combining Results of Multiple Search Engines in Proteomics*
Shteynberg, David; Nesvizhskii, Alexey I.; Moritz, Robert L.; Deutsch, Eric W.
2013-01-01
A crucial component of the analysis of shotgun proteomics datasets is the search engine, an algorithm that attempts to identify the peptide sequence from the parent molecular ion that produced each fragment ion spectrum in the dataset. There are many different search engines, both commercial and open source, each employing a somewhat different technique for spectrum identification. The set of high-scoring peptide-spectrum matches for a defined set of input spectra differs markedly among the various search engine results; individual engines each provide unique correct identifications among a core set of correlative identifications. This has led to the approach of combining the results from multiple search engines to achieve improved analysis of each dataset. Here we review the techniques and available software for combining the results of multiple search engines and briefly compare the relative performance of these techniques. PMID:23720762
Importation and co-circulation of multiple serotypes of dengue virus in Sarawak, Malaysia.
Holmes, Edward C; Tio, Phaik-Hooi; Perera, David; Muhi, Jamail; Cardosa, Jane
2009-07-01
Although dengue is a common disease in South-East Asia, there is a marked absence of virological data from the Malaysian state of Sarawak located on the island of Borneo. From 1997 to 2002 we noted the co-circulation of DENV-2, DENV-3 and DENV-4 in Sarawak. To determine the origins of these Sarawak viruses we obtained the complete E gene sequences of 21 isolates. A phylogenetic analysis revealed multiple entries of DENV-2 and DENV-4 into Sarawak, such that multiple lineages co-circulate, yet with little exportation from Sarawak. Notably, all viral isolates were most closely related to those circulating in different localities in South-East Asia. In sum, our analysis reveals a frequent traffic of DENV in South-East Asia, with Sarawak representing a local sink population.
Enhanced sequencing coverage with digital droplet multiple displacement amplification
Sidore, Angus M.; Lan, Freeman; Lim, Shaun W.; Abate, Adam R.
2016-01-01
Sequencing small quantities of DNA is important for applications ranging from the assembly of uncultivable microbial genomes to the identification of cancer-associated mutations. To obtain sufficient quantities of DNA for sequencing, the small amount of starting material must be amplified significantly. However, existing methods often yield errors or non-uniform coverage, reducing sequencing data quality. Here, we describe digital droplet multiple displacement amplification, a method that enables massive amplification of low-input material while maintaining sequence accuracy and uniformity. The low-input material is compartmentalized as single molecules in millions of picoliter droplets. Because the molecules are isolated in compartments, they amplify to saturation without competing for resources; this yields uniform representation of all sequences in the final product and, in turn, enhances the quality of the sequence data. We demonstrate the ability to uniformly amplify the genomes of single Escherichia coli cells, comprising just 4.7 fg of starting DNA, and obtain sequencing coverage distributions that rival that of unamplified material. Digital droplet multiple displacement amplification provides a simple and effective method for amplifying minute amounts of DNA for accurate and uniform sequencing. PMID:26704978
ASGARD: an open-access database of annotated transcriptomes for emerging model arthropod species.
Zeng, Victor; Extavour, Cassandra G
2012-01-01
The increased throughput and decreased cost of next-generation sequencing (NGS) have shifted the bottleneck genomic research from sequencing to annotation, analysis and accessibility. This is particularly challenging for research communities working on organisms that lack the basic infrastructure of a sequenced genome, or an efficient way to utilize whatever sequence data may be available. Here we present a new database, the Assembled Searchable Giant Arthropod Read Database (ASGARD). This database is a repository and search engine for transcriptomic data from arthropods that are of high interest to multiple research communities but currently lack sequenced genomes. We demonstrate the functionality and utility of ASGARD using de novo assembled transcriptomes from the milkweed bug Oncopeltus fasciatus, the cricket Gryllus bimaculatus and the amphipod crustacean Parhyale hawaiensis. We have annotated these transcriptomes to assign putative orthology, coding region determination, protein domain identification and Gene Ontology (GO) term annotation to all possible assembly products. ASGARD allows users to search all assemblies by orthology annotation, GO term annotation or Basic Local Alignment Search Tool. User-friendly features of ASGARD include search term auto-completion suggestions based on database content, the ability to download assembly product sequences in FASTA format, direct links to NCBI data for predicted orthologs and graphical representation of the location of protein domains and matches to similar sequences from the NCBI non-redundant database. ASGARD will be a useful repository for transcriptome data from future NGS studies on these and other emerging model arthropods, regardless of sequencing platform, assembly or annotation status. This database thus provides easy, one-stop access to multi-species annotated transcriptome information. We anticipate that this database will be useful for members of multiple research communities, including developmental biology, physiology, evolutionary biology, ecology, comparative genomics and phylogenomics. Database URL: asgard.rc.fas.harvard.edu.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Geraghty, M.T.; Stetten, G.; Kearns, W.
1994-09-01
X-linked adrenoleukodystrophy (ALD) is a disorder of peroxisomal {beta}-oxidation of very long chain fatty acids. It presents either as progressive dementia in childhood or as progressive paraparesis in later years. Adrenal insufficiency occurs in both phenotypes. The gene of the ALD protein has been mapped to Xq28 and has recently been cloned and characterized. The ALD protein has significant homology to the peroxisomal membrane protein, PMP70 and belongs to the ATP binding cassette superfamily of transporters. We screened a human genomic library with an ALDP cDNA and isolated 5 different but highly similar clones containing sequences corresponding to the 3{prime}more » end of the ALDP gene. Comparison of the sequences over the region corresponding to exon 9 through the 3{prime} end of the ALDP gene reveals {approximately}96% nucleotide identity in both exonic and intronic regions. Splice sites and open reading frames are maintained. Using both FISH and human-rodent DNA mapping panels, we positively assign these ALDP-related sequences to chromosomes 2, 16 and 22, and provisionally to 1 and 20. Southern blot of primate DNA probed with a partial ALDP cDNA (exon 2-10) shows that expansion of ALDP-related sequences occurred in higher primates (chimp, gorilla and human). Although Northern blots show multiple ALDP-hybridizing transcripts in certain tissues, we have no evidence to date for expression of these ALDP-related sequences. In conclusion, our data show there has been an unusual and recent dispersal to multiple chromosomes of structural gene sequences related to the ALDP gene. The functional significance of these sequences remains to be determined but their existence complicates PCR and mutation analysis of the ALDP gene.« less
Multiple Origins of a Mitochondrial Mutation Conferring Deafness
Hutchin, T. P.; Cortopassi, G. A.
1997-01-01
A point mutation (1555G) in the smaller ribosomal subunit of the mitochondrial DNA (mtDNA) has been associated with maternally inherited traits of hypersensitivity to streptomycin and sensorineural deafness in a number of families from China, Japan, Israel, and Africa. To determine whether this distribution was the result of a single or multiple mutational events, we carried out genetic distance analysis and phylogenetic analysis of 10 independent mtDNA D-loop sequences from Africa and Asia. The mtDNA sequence diversity was high (2.21%). Phylogenetic analysis assigned 1555G-bearing haplotypes at very divergent points in the human mtDNA evolutionary tree, and the 1555G mutations occur in many cases on race-specific mtDNA haplotypes, both facts are inconsistent with a recent introgression of the mutation into these races. The simplest interpretation of the available data is that there have been multiple origins of the 1555G mutation. The genetic distance among mtDNAs bearing the pathogenic 1555G mutation is much larger than among mtDNAs bearing either evolutionarily neutral or weakly deleterious nucleotide substitutions (such as the 4336G mutation). These results are consistent with the view that pathogenic mtDNA haplotypes such as 1555G arise on disparate mtDNA lineages which because of negative natural selection leave relatively few related descendants. The co-existence of the same mutation with deafness in individuals with very different nuclear and mitochondrial genetic backgrounds confirms the pathogenicity of the 1555G mutation. PMID:9055086
Kang, Guangliang; Du, Li; Zhang, Hong
2016-06-22
The growing complexity of biological experiment design based on high-throughput RNA sequencing (RNA-seq) is calling for more accommodative statistical tools. We focus on differential expression (DE) analysis using RNA-seq data in the presence of multiple treatment conditions. We propose a novel method, multiDE, for facilitating DE analysis using RNA-seq read count data with multiple treatment conditions. The read count is assumed to follow a log-linear model incorporating two factors (i.e., condition and gene), where an interaction term is used to quantify the association between gene and condition. The number of the degrees of freedom is reduced to one through the first order decomposition of the interaction, leading to a dramatically power improvement in testing DE genes when the number of conditions is greater than two. In our simulation situations, multiDE outperformed the benchmark methods (i.e. edgeR and DESeq2) even if the underlying model was severely misspecified, and the power gain was increasing in the number of conditions. In the application to two real datasets, multiDE identified more biologically meaningful DE genes than the benchmark methods. An R package implementing multiDE is available publicly at http://homepage.fudan.edu.cn/zhangh/softwares/multiDE . When the number of conditions is two, multiDE performs comparably with the benchmark methods. When the number of conditions is greater than two, multiDE outperforms the benchmark methods.
Bodelle, Boris; Luboldt, Wolfgang; Wichmann, Julian L; Fischer, Sebastian; Vogl, Thomas J; Beeres, Martin
2016-01-01
To determine the value of the 2D multiple-echo data image combination (MEDIC) sequence relative to the short-tau inversion recovery (STIR) sequence regarding the depiction of chondral lesions in the patellofemoral joint. During a period of 6 month patients with acute pain at the anterior aspect of the knee, joint effusion and suspected chondral lesion defect in the patellofemoral joint underwent MRI including axial MEDIC and STIR imaging. Patients with chondral lesions in the patellofemoral joint on at least one sequence were included. The MEDIC and STIR sequence were quantitatively compared regarding the patella cartilage-to-effusion contrast-to-noise ratio (CNR) and qualitatively regarding the depiction of chondral lesions independently scored by two radiologists on a 3-point scale (1 = not depicted; 2 = blurred depicted; 3 = clearly depicted) using the Wilcoxon-Mann-Whitney-Test. For the analysis of inter-observer agreement the Cohen's Weighted Kappa test was used. 30 of 58 patients (male: female, 21:9; age: 44 ± 12 yrs) revealed cartilage lesions (fissures, n = 5 including fibrillation; gaps, n = 15; delamination, n = 7; osteoarthritis, n = 3) and were included in this study. The STIR-sequence was significantly (p < 0.001) superior to the MEDIC-sequence regarding both, the patella cartilage-to-effusion CNR (mean CNR: 232 ± 61 vs. 40 ± 16) as well as the depiction of chondral lesion (mean score: 2.83 ± 0.4 vs. 1.75 ± 0.7) with substantial inter-observer agreement in the rating of both sequences (κ = 0.76-0.89). For the depiction of chondral lesions in the patellofemoral joint, the axial STIR-sequence should be chosen in preference to the axial MEDIC-sequence.
The Papillomavirus Episteme: a major update to the papillomavirus sequence database.
Van Doorslaer, Koenraad; Li, Zhiwen; Xirasagar, Sandhya; Maes, Piet; Kaminsky, David; Liou, David; Sun, Qiang; Kaur, Ramandeep; Huyen, Yentram; McBride, Alison A
2017-01-04
The Papillomavirus Episteme (PaVE) is a database of curated papillomavirus genomic sequences, accompanied by web-based sequence analysis tools. This update describes the addition of major new features. The papillomavirus genomes within PaVE have been further annotated, and now includes the major spliced mRNA transcripts. Viral genes and transcripts can be visualized on both linear and circular genome browsers. Evolutionary relationships among PaVE reference protein sequences can be analysed using multiple sequence alignments and phylogenetic trees. To assist in viral discovery, PaVE offers a typing tool; a simplified algorithm to determine whether a newly sequenced virus is novel. PaVE also now contains an image library containing gross clinical and histopathological images of papillomavirus infected lesions. Database URL: https://pave.niaid.nih.gov/. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.
Sequence analysis of MHC class I α2 from sockeye salmon (Oncorhynchus nerka).
McClelland, Erin K; Ming, Tobi J; Tabata, Amy; Miller, Kristina M
2011-09-01
Most studies assessing adaptive MHC diversity in salmon populations have focused on the classical class II DAB or DAA loci, as these have been most amenable to single PCR amplifications due to their relatively low level of sequence divergence. Herein, we report the characterization of the classical class I UBA α2 locus based on collections taken throughout the species range of sockeye salmon (Oncorhynchus nerka). Through use of multiple lineage-specific primer sets, denaturing gradient gel electrophoresis and sequencing, we identified thirty-four alleles from three highly divergent lineages. Sequence identity between lineages ranged from 30.0% to 56.8% but was relatively high within lineages. Allelic identity within the antigen recognition site (ARS) was greater than for the longer sequence. Global positive selection on UBA was seen at the sequence level (dN:dS = 1.012) with four codons under positive selection and 12 codons under negative selection. Crown Copyright © 2011. Published by Elsevier Ltd. All rights reserved.
Nucleic Acid Detection Methods
Smith, Cassandra L.; Yaar, Ron; Szafranski, Przemyslaw; Cantor, Charles R.
1998-05-19
The invention relates to methods for rapidly determining the sequence and/or length a target sequence. The target sequence may be a series of known or unknown repeat sequences which are hybridized to an array of probes. The hybridized array is digested with a single-strand nuclease and free 3'-hydroxyl groups extended with a nucleic acid polymerase. Nuclease cleaved heteroduplexes can be easily distinguish from nuclease uncleaved heteroduplexes by differential labeling. Probes and target can be differentially labeled with detectable labels. Matched target can be detected by cleaving resulting loops from the hybridized target and creating free 3-hydroxyl groups. These groups are recognized and extended by polymerases added into the reaction system which also adds or releases one label into solution. Analysis of the resulting products using either solid phase or solution. These methods can be used to detect characteristic nucleic acid sequences, to determine target sequence and to screen for genetic defects and disorders. Assays can be conducted on solid surfaces allowing for multiple reactions to be conducted in parallel and, if desired, automated.
Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS.
Warris, Sven; Yalcin, Feyruz; Jackson, Katherine J L; Nap, Jan Peter
2015-01-01
To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation.
Motion and Structure Estimation of Manoeuvring Objects in Multiple- Camera Image Sequences
1992-11-01
and Speckert [23], Gennery [24], Hallman [25], Legters and Young [26], Stuller and Krishnamurthy [27], Wu et al. [381, Matthies, Kanade, and Szeliski...26] G.R. Legters , T.Y. Young, "A mathematical model for computer image track- ing," IEEE Transactions on Pattern Analysis and Machine Intelligence
Phylogenetic Analysis of Klebsiella pneumoniae from Hospitalized Children, Pakistan.
Ejaz, Hasan; Wang, Nancy; Wilksch, Jonathan J; Page, Andrew J; Cao, Hanwei; Gujaran, Shruti; Keane, Jacqueline A; Lithgow, Trevor; Ul-Haq, Ikram; Dougan, Gordon; Strugnell, Richard A; Heinz, Eva
2017-11-01
Klebsiella pneumoniae shows increasing emergence of multidrug-resistant lineages, including strains resistant to all available antimicrobial drugs. We conducted whole-genome sequencing of 178 highly drug-resistant isolates from a tertiary hospital in Lahore, Pakistan. Phylogenetic analyses to place these isolates into global context demonstrate the expansion of multiple independent lineages, including K. quasipneumoniae.
Jie Jin, Feng; Hara, Seiichi; Sato, Atsushi; Koyama, Yasuji
2014-01-01
Wild-type Aspergillus oryzae RIB40 contains two copies of the AO090005001597 gene. We previously constructed A. oryzae RIB40 strain, RKuAF8B, with multiple chromosomal deletions, in which the AO090005001597 copy number was found to be increased significantly. Sequence analysis indicated that AO090005001597 is part of a putative 6,000-bp retrotransposable element, flanked by two long terminal repeats (LTRs) of 669 bp, with characteristics of retroviruses and retrotransposons, and thus designated AoLTR (A. oryzae LTR-retrotransposable element). AoLTR comprised putative reverse transcriptase, RNase H, and integrase domains. The deduced amino acid sequence alignment of AoLTR showed 94% overall identity with AFLAV, an A. flavus Tf1/sushi retrotransposon. Quantitative real-time RT-PCR showed that AoLTR gene expression was significantly increased in the RKuAF8B, in accordance with the increased copy number. Inverse PCR indicated that the full-length retrotransposable element was randomly integrated into multiple genomic locations. However, no obvious phenotypic changes were associated with the increased AoLTR gene copy number.
Batchu, Navish Kumar; Khater, Shradha; Patil, Sonal; Nagle, Vinod; Das, Gautam; Bhadra, Bhaskar; Sapre, Ajit; Dasgupta, Santanu
2018-03-05
A filamentous cyanobacteria, Geitlerinema sp. FC II, was isolated from marine algae culture pond at Reliance Industries Limited (RIL), India. The 6.7 Mb draft genome of FC II encodes for 6697 protein coding genes. Analysis of the whole genome sequence revealed presence of nif gene cluster, supporting its capability to fix atmospheric nitrogen. FC II genome contains two variants of sulfide:quinone oxidoreductases (SQR), which is a crucial elector donor in cyanobacterial metabolic processes. FC II is characterized by the presence of multiple CRISPR- Cas (Clustered Regularly Interspaced Short Palindrome Repeats - CRISPR associated proteins) clusters, multiple variants of genes encoding photosystem reaction centres, biosynthetic gene clusters of alkane, polyketides and non-ribosomal peptides. Presence of these pathways will help FC II in gaining an ecological advantage over other strains for biomass production in large scale cultivation system. Hence, FC II may be used for production of biofuel and other industrially important metabolites. Copyright © 2018 Elsevier Inc. All rights reserved.
The Genome Portal of the Department of Energy Joint Genome Institute
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nordberg, Henrik; Cantor, Michael; Dushekyo, Serge
2014-03-14
The JGI Genome Portal (http://genome.jgi.doe.gov) provides unified access to all JGI genomic databases and analytical tools. A user can search, download and explore multiple data sets available for all DOE JGI sequencing projects including their status, assemblies and annotations of sequenced genomes. Genome Portal in the past 2 years was significantly updated, with a specific emphasis on efficient handling of the rapidly growing amount of diverse genomic data accumulated in JGI. A critical aspect of handling big data in genomics is the development of visualization and analysis tools that allow scientists to derive meaning from what are otherwise terrabases ofmore » inert sequence. An interactive visualization tool developed in the group allows us to explore contigs resulting from a single metagenome assembly. Implemented with modern web technologies that take advantage of the power of the computer's graphical processing unit (gpu), the tool allows the user to easily navigate over a 100,000 data points in multiple dimensions, among many biologically meaningful parameters of a dataset such as relative abundance, contig length, and G+C content.« less
Pena, S D; Barreto, G; Vago, A R; De Marco, L; Reinach, F C; Dias Neto, E; Simpson, A J
1994-01-01
Low-stringency single specific primer PCR (LSSP-PCR) is an extremely simple PCR-based technique that detects single or multiple mutations in gene-sized DNA fragments. A purified DNA fragment is subjected to PCR using high concentrations of a single specific oligonucleotide primer, large amounts of Taq polymerase, and a very low annealing temperature. Under these conditions the primer hybridizes specifically to its complementary region and nonspecifically to multiple sites within the fragment, in a sequence-dependent manner, producing a heterogeneous set of reaction products resolvable by electrophoresis. The complex banding pattern obtained is significantly altered by even a single-base change and thus constitutes a unique "gene signature." Therefore LSSP-PCR will have almost unlimited application in all fields of genetics and molecular medicine where rapid and sensitive detection of mutations and sequence variations is important. The usefulness of LSSP-PCR is illustrated by applications in the study of mutants of smooth muscle myosin light chain, analysis of a family with X-linked nephrogenic diabetes insipidus, and identity testing using human mitochondrial DNA. Images PMID:8127912
Parkin, Derek B; Archer, Linda L; Childress, April L; Wellehan, James F X
2009-07-01
Bearded dragons (Pogona vitticeps) are popular pets in the United States. Agamid Adenovirus 1 (AgAdV1) is an important infectious agent of bearded dragons. The only AgAdV1 sequences available to date are from a highly conserved region of the DNA polymerase gene. Degenerate primers were designed to amplify a variable region of the AgAdV1 hexon gene for sequencing. Genetic differences were identified within the hexon gene of 17 bearded dragons from 4 collections. Much less diversity was present in the polymerase gene. Bayesian analysis of the hexon nucleotide alignment identified two larger groups and two isolates that did not tightly cluster with these two groups. Multiple genotypes were identified within collections, and individual genotypes were seen in different collections. Three bearded dragons appeared to be infected by multiple strains. These findings show that this hexon region is useful for AgAdV1 genotyping, which can be used epidemiologically as well as in future investigations of AgAdV1 evolution and clinical implications of strain differences.
NASA Astrophysics Data System (ADS)
Zhang, Xunxun; Xu, Hongke; Fang, Jianwu
2018-01-01
Along with the rapid development of the unmanned aerial vehicle technology, multiple vehicle tracking (MVT) in aerial video sequence has received widespread interest for providing the required traffic information. Due to the camera motion and complex background, MVT in aerial video sequence poses unique challenges. We propose an efficient MVT algorithm via driver behavior-based Kalman filter (DBKF) and an improved deterministic data association (IDDA) method. First, a hierarchical image registration method is put forward to compensate the camera motion. Afterward, to improve the accuracy of the state estimation, we propose the DBKF module by incorporating the driver behavior into the Kalman filter, where artificial potential field is introduced to reflect the driver behavior. Then, to implement the data association, a local optimization method is designed instead of global optimization. By introducing the adaptive operating strategy, the proposed IDDA method can also deal with the situation in which the vehicles suddenly appear or disappear. Finally, comprehensive experiments on the DARPA VIVID data set and KIT AIS data set demonstrate that the proposed algorithm can generate satisfactory and superior results.
Kinoti, Wycliff M; Constable, Fiona E; Nancarrow, Narelle; Plummer, Kim M; Rodoni, Brendan
2017-01-01
PCR amplicon next generation sequencing (NGS) analysis offers a broadly applicable and targeted approach to detect populations of both high- or low-frequency virus variants in one or more plant samples. In this study, amplicon NGS was used to explore the diversity of the tripartite genome virus, Prunus necrotic ringspot virus (PNRSV) from 53 PNRSV-infected trees using amplicons from conserved gene regions of each of PNRSV RNA1, RNA2 and RNA3. Sequencing of the amplicons from 53 PNRSV-infected trees revealed differing levels of polymorphism across the three different components of the PNRSV genome with a total number of 5040, 2083 and 5486 sequence variants observed for RNA1, RNA2 and RNA3 respectively. The RNA2 had the lowest diversity of sequences compared to RNA1 and RNA3, reflecting the lack of flexibility tolerated by the replicase gene that is encoded by this RNA component. Distinct PNRSV phylo-groups, consisting of closely related clusters of sequence variants, were observed in each of PNRSV RNA1, RNA2 and RNA3. Most plant samples had a single phylo-group for each RNA component. Haplotype network analysis showed that smaller clusters of PNRSV sequence variants were genetically connected to the largest sequence variant cluster within a phylo-group of each RNA component. Some plant samples had sequence variants occurring in multiple PNRSV phylo-groups in at least one of each RNA and these phylo-groups formed distinct clades that represent PNRSV genetic strains. Variants within the same phylo-group of each Prunus plant sample had ≥97% similarity and phylo-groups within a Prunus plant sample and between samples had less ≤97% similarity. Based on the analysis of diversity, a definition of a PNRSV genetic strain was proposed. The proposed definition was applied to determine the number of PNRSV genetic strains in each of the plant samples and the complexity in defining genetic strains in multipartite genome viruses was explored.
Ahn, Yul-Kyun; Tripathi, Swati; Kim, Jeong-Ho; Cho, Young-Il; Lee, Hye-Eun; Kim, Do-Sun; Woo, Jong-Gyu; Cho, Myeong-Cheoul
2014-01-10
Next generation sequencing technologies have proven to be a rapid and cost-effective means to assemble and characterize gene content and identify molecular markers in various organisms. Pepper (Capsicum annuum L., Solanaceae) is a major staple vegetable crop, which is economically important and has worldwide distribution. High-throughput transcriptome profiling of two pepper cultivars, Mandarin and Blackcluster, using 454 GS-FLX pyrosequencing yielded 279,221 and 316,357 sequenced reads with a total 120.44 and 142.54Mb of sequence data (average read length of 431 and 450 nucleotides). These reads resulted from 17,525 and 16,341 'isogroups' and were assembled into 19,388 and 18,057 isotigs, and 22,217 and 13,153 singletons for both the cultivars, respectively. Assembled sequences were annotated functionally based on homology to genes in multiple public databases. Detailed sequence variant analysis identified a total of 9701 and 12,741 potential SNPs which eventually resulted in 1025 and 1059 genotype specific SNPs, for both the varieties, respectively, after examining SNP frequency distribution for each mapped unigenes. These markers for pepper will be highly valuable for marker-assisted breeding and other genetic studies. © 2013 Elsevier B.V. All rights reserved.
CFGP: a web-based, comparative fungal genomics platform.
Park, Jongsun; Park, Bongsoo; Jung, Kyongyong; Jang, Suwang; Yu, Kwangyul; Choi, Jaeyoung; Kong, Sunghyung; Park, Jaejin; Kim, Seryun; Kim, Hyojeong; Kim, Soonok; Kim, Jihyun F; Blair, Jaime E; Lee, Kwangwon; Kang, Seogchan; Lee, Yong-Hwan
2008-01-01
Since the completion of the Saccharomyces cerevisiae genome sequencing project in 1996, the genomes of over 80 fungal species have been sequenced or are currently being sequenced. Resulting data provide opportunities for studying and comparing fungal biology and evolution at the genome level. To support such studies, the Comparative Fungal Genomics Platform (CFGP; http://cfgp.snu.ac.kr), a web-based multifunctional informatics workbench, was developed. The CFGP comprises three layers, including the basal layer, middleware and the user interface. The data warehouse in the basal layer contains standardized genome sequences of 65 fungal species. The middleware processes queries via six analysis tools, including BLAST, ClustalW, InterProScan, SignalP 3.0, PSORT II and a newly developed tool named BLASTMatrix. The BLASTMatrix permits the identification and visualization of genes homologous to a query across multiple species. The Data-driven User Interface (DUI) of the CFGP was built on a new concept of pre-collecting data and post-executing analysis instead of the 'fill-in-the-form-and-press-SUBMIT' user interfaces utilized by most bioinformatics sites. A tool termed Favorite, which supports the management of encapsulated sequence data and provides a personalized data repository to users, is another novel feature in the DUI.
DLocalMotif: a discriminative approach for discovering local motifs in protein sequences.
Mehdi, Ahmed M; Sehgal, Muhammad Shoaib B; Kobe, Bostjan; Bailey, Timothy L; Bodén, Mikael
2013-01-01
Local motifs are patterns of DNA or protein sequences that occur within a sequence interval relative to a biologically defined anchor or landmark. Current protein motif discovery methods do not adequately consider such constraints to identify biologically significant motifs that are only weakly over-represented but spatially confined. Using negatives, i.e. sequences known to not contain a local motif, can further increase the specificity of their discovery. This article introduces the method DLocalMotif that makes use of positional information and negative data for local motif discovery in protein sequences. DLocalMotif combines three scoring functions, measuring degrees of motif over-representation, entropy and spatial confinement, specifically designed to discriminatively exploit the availability of negative data. The method is shown to outperform current methods that use only a subset of these motif characteristics. We apply the method to several biological datasets. The analysis of peroxisomal targeting signals uncovers several novel motifs that occur immediately upstream of the dominant peroxisomal targeting signal-1 signal. The analysis of proline-tyrosine nuclear localization signals uncovers multiple novel motifs that overlap with C2H2 zinc finger domains. We also evaluate the method on classical nuclear localization signals and endoplasmic reticulum retention signals and find that DLocalMotif successfully recovers biologically relevant sequence properties. http://bioinf.scmb.uq.edu.au/dlocalmotif/
Investigating the viral ecology of global bee communities with high-throughput metagenomics.
Galbraith, David A; Fuller, Zachary L; Ray, Allyson M; Brockmann, Axel; Frazier, Maryann; Gikungu, Mary W; Martinez, J Francisco Iturralde; Kapheim, Karen M; Kerby, Jeffrey T; Kocher, Sarah D; Losyev, Oleksiy; Muli, Elliud; Patch, Harland M; Rosa, Cristina; Sakamoto, Joyce M; Stanley, Scott; Vaudo, Anthony D; Grozinger, Christina M
2018-06-11
Bee viral ecology is a fascinating emerging area of research: viruses exert a range of effects on their hosts, exacerbate impacts of other environmental stressors, and, importantly, are readily shared across multiple bee species in a community. However, our understanding of bee viral communities is limited, as it is primarily derived from studies of North American and European Apis mellifera populations. Here, we examined viruses in populations of A. mellifera and 11 other bee species from 9 countries, across 4 continents and Oceania. We developed a novel pipeline to rapidly and inexpensively screen for bee viruses. This pipeline includes purification of encapsulated RNA/DNA viruses, sequence-independent amplification, high throughput sequencing, integrated assembly of contigs, and filtering to identify contigs specifically corresponding to viral sequences. We identified sequences for (+)ssRNA, (-)ssRNA, dsRNA, and ssDNA viruses. Overall, we found 127 contigs corresponding to novel viruses (i.e. previously not observed in bees), with 27 represented by >0.1% of the reads in a given sample, and 7 contained an RdRp or replicase sequence which could be used for robust phylogenetic analysis. This study provides a sequence-independent pipeline for viral metagenomics analysis, and greatly expands our understanding of the diversity of viruses found in bee communities.
Using Next Generation Sequencing for Multiplexed Trait-Linked Markers in Wheat
Bernardo, Amy; Wang, Shan; St. Amand, Paul; Bai, Guihua
2015-01-01
With the advent of next generation sequencing (NGS) technologies, single nucleotide polymorphisms (SNPs) have become the major type of marker for genotyping in many crops. However, the availability of SNP markers for important traits of bread wheat ( Triticum aestivum L.) that can be effectively used in marker-assisted selection (MAS) is still limited and SNP assays for MAS are usually uniplex. A shift from uniplex to multiplex assays will allow the simultaneous analysis of multiple markers and increase MAS efficiency. We designed 33 locus-specific markers from SNP or indel-based marker sequences that linked to 20 different quantitative trait loci (QTL) or genes of agronomic importance in wheat and analyzed the amplicon sequences using an Ion Torrent Proton Sequencer and a custom allele detection pipeline to determine the genotypes of 24 selected germplasm accessions. Among the 33 markers, 27 were successfully multiplexed and 23 had 100% SNP call rates. Results from analysis of "kompetitive allele-specific PCR" (KASP) and sequence tagged site (STS) markers developed from the same loci fully verified the genotype calls of 23 markers. The NGS-based multiplexed assay developed in this study is suitable for rapid and high-throughput screening of SNPs and some indel-based markers in wheat. PMID:26625271
Sailaja, B; Anjum, Najreen; Patil, Yogesh K; Agarwal, Surekha; Malathi, P; Krishnaveni, D; Balachandran, S M; Viraktamath, B C; Mangrauthia, Satendra K
2013-12-01
In this study, complete genome of a south Indian isolate of Rice tungro spherical virus (RTSV) from Andhra Pradesh (AP) was sequenced, and the predicted amino acid sequence was analysed. The RTSV RNA genome consists of 12,171 nt without the poly(A) tail, encoding a putative typical polyprotein of 3,470 amino acids. Furthermore, cleavage sites and sequence motifs of the polyprotein were predicted. Multiple alignment with other RTSV isolates showed a nucleotide sequence identity of 95% to east Indian isolates and 90% to Philippines isolates. A phylogenetic tree based on complete genome sequence showed that Indian isolates clustered together, while Vt6 and PhilA isolates of Philippines formed two separate clusters. Twelve recombination events were detected in RNA genome of RTSV using the Recombination Detection Program version 3. Recombination analysis suggested significant role of 5' end and central region of genome in virus evolution. Further, AP and Odisha isolates appeared as important RTSV isolates involved in diversification of this virus in India through recombination phenomenon. The new addition of complete genome of first south Indian isolate provided an opportunity to establish the molecular evolution of RTSV through recombination analysis and phylogenetic relationship.
Wide distribution of O157-antigen biosynthesis gene clusters in Escherichia coli.
Iguchi, Atsushi; Shirai, Hiroki; Seto, Kazuko; Ooka, Tadasuke; Ogura, Yoshitoshi; Hayashi, Tetsuya; Osawa, Kayo; Osawa, Ro
2011-01-01
Most Escherichia coli O157-serogroup strains are classified as enterohemorrhagic E. coli (EHEC), which is known as an important food-borne pathogen for humans. They usually produce Shiga toxin (Stx) 1 and/or Stx2, and express H7-flagella antigen (or nonmotile). However, O157 strains that do not produce Stxs and express H antigens different from H7 are sometimes isolated from clinical and other sources. Multilocus sequence analysis revealed that these 21 O157:non-H7 strains tested in this study belong to multiple evolutionary lineages different from that of EHEC O157:H7 strains, suggesting a wide distribution of the gene set encoding the O157-antigen biosynthesis in multiple lineages. To gain insight into the gene organization and the sequence similarity of the O157-antigen biosynthesis gene clusters, we conducted genomic comparisons of the chromosomal regions (about 59 kb in each strain) covering the O-antigen gene cluster and its flanking regions between six O157:H7/non-H7 strains. Gene organization of the O157-antigen gene cluster was identical among O157:H7/non-H7 strains, but was divided into two distinct types at the nucleotide sequence level. Interestingly, distribution of the two types did not clearly follow the evolutionary lineages of the strains, suggesting that horizontal gene transfer of both types of O157-antigen gene clusters has occurred independently among E. coli strains. Additionally, detailed sequence comparison revealed that some positions of the repetitive extragenic palindromic (REP) sequences in the regions flanking the O-antigen gene clusters were coincident with possible recombination points. From these results, we conclude that the horizontal transfer of the O157-antigen gene clusters induced the emergence of multiple O157 lineages within E. coli and speculate that REP sequences may involve one of the driving forces for exchange and evolution of O-antigen loci.
Goto, Taichiro; Hirotsu, Yosuke; Mochizuki, Hitoshi; Nakagomi, Takahiro; Shikata, Daichi; Yokoyama, Yujiro; Oyama, Toshio; Amemiya, Kenji; Okimoto, Kenichiro; Omata, Masao
2017-05-09
In cases of multiple lung cancers, individual tumors may represent either a primary lung cancer or both primary and metastatic lung cancers. Treatment selection varies depending on such features, and this discrimination is critically important in predicting prognosis. The present study was undertaken to determine the efficacy and validity of mutation analysis as a means of determining whether multiple lung cancers are primary or metastatic in nature. The study involved 12 patients who underwent surgery in our department for multiple lung cancers between July 2014 and March 2016. Tumor cells were collected from formalin-fixed paraffin-embedded tissues of the primary lesions by using laser capture microdissection, and targeted sequencing of 53 lung cancer-related genes was performed. In surgically treated patients with multiple lung cancers, the driver mutation profile differed among the individual tumors. Meanwhile, in a case of a solitary lung tumor that appeared after surgery for double primary lung cancers, gene mutation analysis using a bronchoscopic biopsy sample revealed a gene mutation profile consistent with the surgically resected specimen, thus demonstrating that the tumor in this case was metastatic. In cases of multiple lung cancers, the comparison of driver mutation profiles clarifies the clonal origin of the tumors and enables discrimination between primary and metastatic tumors.
Single-molecule dilution and multiple displacement amplification for molecular haplotyping.
Paul, Philip; Apgar, Josh
2005-04-01
Separate haploid analysis is frequently required for heterozygous genotyping to resolve phase ambiguity or confirm allelic sequence. We demonstrate a technique of single-molecule dilution followed by multiple strand displacement amplification to haplotype polymorphic alleles. Dilution of DNA to haploid equivalency, or a single molecule, is a simple method for separating di-allelic DNA. Strand displacement amplification is a robust method for non-specific DNA expansion that employs random hexamers and phage polymerase Phi29 for double-stranded DNA displacement and primer extension, resulting in high processivity and exceptional product length. Single-molecule dilution was followed by strand displacement amplification to expand separated alleles to microgram quantities of DNA for more efficient haplotype analysis of heterozygous genes.
ComplexContact: a web server for inter-protein contact prediction using deep learning.
Zeng, Hong; Wang, Sheng; Zhou, Tianming; Zhao, Feifeng; Li, Xiufeng; Wu, Qing; Xu, Jinbo
2018-05-22
ComplexContact (http://raptorx2.uchicago.edu/ComplexContact/) is a web server for sequence-based interfacial residue-residue contact prediction of a putative protein complex. Interfacial residue-residue contacts are critical for understanding how proteins form complex and interact at residue level. When receiving a pair of protein sequences, ComplexContact first searches for their sequence homologs and builds two paired multiple sequence alignments (MSA), then it applies co-evolution analysis and a CASP-winning deep learning (DL) method to predict interfacial contacts from paired MSAs and visualizes the prediction as an image. The DL method was originally developed for intra-protein contact prediction and performed the best in CASP12. Our large-scale experimental test further shows that ComplexContact greatly outperforms pure co-evolution methods for inter-protein contact prediction, regardless of the species.
NASA Technical Reports Server (NTRS)
Wheeler, Ward C.
2003-01-01
The problem of determining the minimum cost hypothetical ancestral sequences for a given cladogram is known to be NP-complete (Wang and Jiang, 1994). Traditionally, point estimations of hypothetical ancestral sequences have been used to gain heuristic, upper bounds on cladogram cost. These include procedures with such diverse approaches as non-additive optimization of multiple sequence alignment, direct optimization (Wheeler, 1996), and fixed-state character optimization (Wheeler, 1999). A method is proposed here which, by extending fixed-state character optimization, replaces the estimation process with a search. This form of optimization examines a diversity of potential state solutions for cost-efficient hypothetical ancestral sequences and can result in greatly more parsimonious cladograms. Additionally, such an approach can be applied to other NP-complete phylogenetic optimization problems such as genomic break-point analysis. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.
Ming, De-Song; Chen, Qing-Qing; Chen, Xiao-Tin
2018-05-14
To clarify the resistance mechanisms of Pannonibacter phragmitetus 31801, isolated from the blood of a liver abscess patient, at the genomic level, we performed whole genomic sequencing using a PacBio RS II single-molecule real-time long-read sequencer. Bioinformatic analysis of the resulting sequence was then carried out to identify any possible resistance genes. Analyses included Basic Local Alignment Search Tool searches against the Antibiotic Resistance Genes Database, ResFinder analysis of the genome sequence, and Resistance Gene Identifier analysis within the Comprehensive Antibiotic Resistance Database. Prophages, clustered regularly interspaced short palindromic repeats (CRISPR), and other putative virulence factors were also identified using PHAST, CRISPRfinder, and the Virulence Factors Database, respectively. The circular chromosome and single plasmid of P. phragmitetus 31801 contained multiple antibiotic resistance genes, including those coding for three different types of β-lactamase [NPS β-lactamase (EC 3.5.2.6), β-lactamase class C, and a metal-dependent hydrolase of β-lactamase superfamily I]. In addition, genes coding for subunits of several multidrug-resistance efflux pumps were identified, including those targeting macrolides (adeJ, cmeB), tetracycline (acrB, adeAB), fluoroquinolones (acrF, ceoB), and aminoglycosides (acrD, amrB, ceoB, mexY, smeB). However, apart from the tripartite macrolide efflux pump macAB-tolC, the genome did not appear to contain the complete complement of subunit genes required for production of most of the major multidrug-resistance efflux pumps.
Metavir 2: new tools for viral metagenome comparison and assembled virome analysis
2014-01-01
Background Metagenomics, based on culture-independent sequencing, is a well-fitted approach to provide insights into the composition, structure and dynamics of environmental viral communities. Following recent advances in sequencing technologies, new challenges arise for existing bioinformatic tools dedicated to viral metagenome (i.e. virome) analysis as (i) the number of viromes is rapidly growing and (ii) large genomic fragments can now be obtained by assembling the huge amount of sequence data generated for each metagenome. Results To face these challenges, a new version of Metavir was developed. First, all Metavir tools have been adapted to support comparative analysis of viromes in order to improve the analysis of multiple datasets. In addition to the sequence comparison previously provided, viromes can now be compared through their k-mer frequencies, their taxonomic compositions, recruitment plots and phylogenetic trees containing sequences from different datasets. Second, a new section has been specifically designed to handle assembled viromes made of thousands of large genomic fragments (i.e. contigs). This section includes an annotation pipeline for uploaded viral contigs (gene prediction, similarity search against reference viral genomes and protein domains) and an extensive comparison between contigs and reference genomes. Contigs and their annotations can be explored on the website through specifically developed dynamic genomic maps and interactive networks. Conclusions The new features of Metavir 2 allow users to explore and analyze viromes composed of raw reads or assembled fragments through a set of adapted tools and a user-friendly interface. PMID:24646187
NASA Astrophysics Data System (ADS)
Weisbrod, Chad R.; Kaiser, Nathan K.; Syka, John E. P.; Early, Lee; Mullen, Christopher; Dunyach, Jean-Jacques; English, A. Michelle; Anderson, Lissa C.; Blakney, Greg T.; Shabanowitz, Jeffrey; Hendrickson, Christopher L.; Marshall, Alan G.; Hunt, Donald F.
2017-09-01
High resolution mass spectrometry is a key technology for in-depth protein characterization. High-field Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) enables high-level interrogation of intact proteins in the most detail to date. However, an appropriate complement of fragmentation technologies must be paired with FTMS to provide comprehensive sequence coverage, as well as characterization of sequence variants, and post-translational modifications. Here we describe the integration of front-end electron transfer dissociation (FETD) with a custom-built 21 tesla FT-ICR mass spectrometer, which yields unprecedented sequence coverage for proteins ranging from 2.8 to 29 kDa, without the need for extensive spectral averaging (e.g., 60% sequence coverage for apo-myoglobin with four averaged acquisitions). The system is equipped with a multipole storage device separate from the ETD reaction device, which allows accumulation of multiple ETD fragment ion fills. Consequently, an optimally large product ion population is accumulated prior to transfer to the ICR cell for mass analysis, which improves mass spectral signal-to-noise ratio, dynamic range, and scan rate. We find a linear relationship between protein molecular weight and minimum number of ETD reaction fills to achieve optimum sequence coverage, thereby enabling more efficient use of instrument data acquisition time. Finally, real-time scaling of the number of ETD reactions fills during method-based acquisition is shown, and the implications for LC-MS/MS top-down analysis are discussed. [Figure not available: see fulltext.
NASA Technical Reports Server (NTRS)
Clancy, R. T.; Lee, Steven W.
1991-01-01
The present analysis of emission-phase function (EPF) observations from the IR thermal mapper aboard the Viking Orbiter encompasses polar latitudes, and Viking Lander sites, and spans a wide range of solar longitudes. A multiple scattering radiative transfer model which incorporates a bidirectional phase function for the surface and atmospheric scattering by dust and clouds yields surface albedos and dust and ice optical properties and optical depths for the variety of Mars conditions. It is possible to fit all analyzed EPF sequences corresponding to dust scattering with an albedo of 0.92, rather than the 0.86 given by Pollack et al. on the bases of Viking Lander observations.
Morrison, Heather; Roscoe, Eileen M; Atwell, Amy
2011-01-01
We evaluated antecedent exercise for treating the automatically reinforced problem behavior of 4 individuals with autism. We conducted preference assessments to identify leisure and exercise items that were associated with high levels of engagement and low levels of problem behavior. Next, we conducted three 3-component multiple-schedule sequences: an antecedent-exercise test sequence, a noncontingent leisure-item control sequence, and a social-interaction control sequence. Within each sequence, we used a 3-component multiple schedule to evaluate preintervention, intervention, and postintervention effects. Problem behavior decreased during the postintervention component relative to the preintervention component for 3 of the 4 participants during the exercise-item assessment; however, the effects could not be attributed solely to exercise for 1 of these participants. PMID:21941383
deFUME: Dynamic exploration of functional metagenomic sequencing data.
van der Helm, Eric; Geertz-Hansen, Henrik Marcus; Genee, Hans Jasper; Malla, Sailesh; Sommer, Morten Otto Alexander
2015-07-31
Functional metagenomic selections represent a powerful technique that is widely applied for identification of novel genes from complex metagenomic sources. However, whereas hundreds to thousands of clones can be easily generated and sequenced over a few days of experiments, analyzing the data is time consuming and constitutes a major bottleneck for experimental researchers in the field. Here we present the deFUME web server, an easy-to-use web-based interface for processing, annotation and visualization of functional metagenomics sequencing data, tailored to meet the requirements of non-bioinformaticians. The web-server integrates multiple analysis steps into one single workflow: read assembly, open reading frame prediction, and annotation with BLAST, InterPro and GO classifiers. Analysis results are visualized in an online dynamic web-interface. The deFUME webserver provides a fast track from raw sequence to a comprehensive visual data overview that facilitates effortless inspection of gene function, clustering and distribution. The webserver is available at cbs.dtu.dk/services/deFUME/and the source code is distributed at github.com/EvdH0/deFUME.
Tettelin, Hervé; Masignani, Vega; Cieslewicz, Michael J.; Donati, Claudio; Medini, Duccio; Ward, Naomi L.; Angiuoli, Samuel V.; Crabtree, Jonathan; Jones, Amanda L.; Durkin, A. Scott; DeBoy, Robert T.; Davidsen, Tanja M.; Mora, Marirosa; Scarselli, Maria; Margarit y Ros, Immaculada; Peterson, Jeremy D.; Hauser, Christopher R.; Sundaram, Jaideep P.; Nelson, William C.; Madupu, Ramana; Brinkac, Lauren M.; Dodson, Robert J.; Rosovitz, Mary J.; Sullivan, Steven A.; Daugherty, Sean C.; Haft, Daniel H.; Selengut, Jeremy; Gwinn, Michelle L.; Zhou, Liwei; Zafar, Nikhat; Khouri, Hoda; Radune, Diana; Dimitrov, George; Watkins, Kisha; O'Connor, Kevin J. B.; Smith, Shannon; Utterback, Teresa R.; White, Owen; Rubens, Craig E.; Grandi, Guido; Madoff, Lawrence C.; Kasper, Dennis L.; Telford, John L.; Wessels, Michael R.; Rappuoli, Rino; Fraser, Claire M.
2005-01-01
The development of efficient and inexpensive genome sequencing methods has revolutionized the study of human bacterial pathogens and improved vaccine design. Unfortunately, the sequence of a single genome does not reflect how genetic variability drives pathogenesis within a bacterial species and also limits genome-wide screens for vaccine candidates or for antimicrobial targets. We have generated the genomic sequence of six strains representing the five major disease-causing serotypes of Streptococcus agalactiae, the main cause of neonatal infection in humans. Analysis of these genomes and those available in databases showed that the S. agalactiae species can be described by a pan-genome consisting of a core genome shared by all isolates, accounting for ≈80% of any single genome, plus a dispensable genome consisting of partially shared and strain-specific genes. Mathematical extrapolation of the data suggests that the gene reservoir available for inclusion in the S. agalactiae pan-genome is vast and that unique genes will continue to be identified even after sequencing hundreds of genomes. PMID:16172379
Kaján, Győző L; Kajon, Adriana E; Pinto, Alexis Castillo; Bartha, Dániel; Arnberg, Niklas
2017-10-15
A novel human adenovirus was isolated from a pediatric case of acute respiratory disease in Panama City, Panama in 2011. The clinical isolate was initially identified as an intertypic recombinant based on hexon and fiber gene sequencing. Based on the analysis of its complete genome sequence, the novel complex recombinant Human mastadenovirus D (HAdV-D) strain was classified into a new HAdV type: HAdV-84, and it was designated Adenovirus D human/PAN/P309886/2011/84[P43H17F84]. HAdV-D types possess usually an ocular or gastrointestinal tropism, and respiratory association is scarcely reported. The virus has a novel fiber type, most closely related to, but still clearly distant from that of HAdV-36. The predicted fiber is hypothesised to bind sialic acid with lower affinity compared to HAdV-37. Bioinformatic analysis of the complete genomic sequence of HAdV-84 revealed multiple homologous recombination events and provided deeper insight into HAdV evolution. Copyright © 2017 Elsevier B.V. All rights reserved.
Robust analysis of semiparametric renewal process models
Lin, Feng-Chang; Truong, Young K.; Fine, Jason P.
2013-01-01
Summary A rate model is proposed for a modulated renewal process comprising a single long sequence, where the covariate process may not capture the dependencies in the sequence as in standard intensity models. We consider partial likelihood-based inferences under a semiparametric multiplicative rate model, which has been widely studied in the context of independent and identical data. Under an intensity model, gap times in a single long sequence may be used naively in the partial likelihood with variance estimation utilizing the observed information matrix. Under a rate model, the gap times cannot be treated as independent and studying the partial likelihood is much more challenging. We employ a mixing condition in the application of limit theory for stationary sequences to obtain consistency and asymptotic normality. The estimator's variance is quite complicated owing to the unknown gap times dependence structure. We adapt block bootstrapping and cluster variance estimators to the partial likelihood. Simulation studies and an analysis of a semiparametric extension of a popular model for neural spike train data demonstrate the practical utility of the rate approach in comparison with the intensity approach. PMID:24550568
Exact Performance Analysis of Two Distributed Processes with Multiple Synchronization Points.
1987-05-01
number of processes with straight-line sequences of semaphore operations . We use the geometric model for performance analysis, in contrast to proving...distribution unlimited. 4. PERFORMING’*ORGANIZATION REPORT NUMBERS) 5. MONITORING ORGANIZATION REPORT NUMB CS-TR-1845 6a. NAME OF PERFORMING ORGANIZATION 6b...OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATIO U University of Maryland (If applicable) Office of Naval Research N/A 6c. ADDRESS (City, State, and
Validation of a next-generation sequencing assay for clinical molecular oncology.
Cottrell, Catherine E; Al-Kateb, Hussam; Bredemeyer, Andrew J; Duncavage, Eric J; Spencer, David H; Abel, Haley J; Lockwood, Christina M; Hagemann, Ian S; O'Guin, Stephanie M; Burcea, Lauren C; Sawyer, Christopher S; Oschwald, Dayna M; Stratman, Jennifer L; Sher, Dorie A; Johnson, Mark R; Brown, Justin T; Cliften, Paul F; George, Bijoy; McIntosh, Leslie D; Shrivastava, Savita; Nguyen, Tudung T; Payton, Jacqueline E; Watson, Mark A; Crosby, Seth D; Head, Richard D; Mitra, Robi D; Nagarajan, Rakesh; Kulkarni, Shashikant; Seibert, Karen; Virgin, Herbert W; Milbrandt, Jeffrey; Pfeifer, John D
2014-01-01
Currently, oncology testing includes molecular studies and cytogenetic analysis to detect genetic aberrations of clinical significance. Next-generation sequencing (NGS) allows rapid analysis of multiple genes for clinically actionable somatic variants. The WUCaMP assay uses targeted capture for NGS analysis of 25 cancer-associated genes to detect mutations at actionable loci. We present clinical validation of the assay and a detailed framework for design and validation of similar clinical assays. Deep sequencing of 78 tumor specimens (≥ 1000× average unique coverage across the capture region) achieved high sensitivity for detecting somatic variants at low allele fraction (AF). Validation revealed sensitivities and specificities of 100% for detection of single-nucleotide variants (SNVs) within coding regions, compared with SNP array sequence data (95% CI = 83.4-100.0 for sensitivity and 94.2-100.0 for specificity) or whole-genome sequencing (95% CI = 89.1-100.0 for sensitivity and 99.9-100.0 for specificity) of HapMap samples. Sensitivity for detecting variants at an observed 10% AF was 100% (95% CI = 93.2-100.0) in HapMap mixes. Analysis of 15 masked specimens harboring clinically reported variants yielded concordant calls for 13/13 variants at AF of ≥ 15%. The WUCaMP assay is a robust and sensitive method to detect somatic variants of clinical significance in molecular oncology laboratories, with reduced time and cost of genetic analysis allowing for strategic patient management. Copyright © 2014 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Rau, M. A.; Aleven, V.; Rummel, N.; Pardos, Z.
2014-01-01
Providing learners with multiple representations of learning content has been shown to enhance learning outcomes. When multiple representations are presented across consecutive problems, we have to decide in what sequence to present them. Prior research has demonstrated that interleaving "tasks types" (as opposed to blocking them) can…
Miao, L X; Jiang, M; Zhang, Y C; Yang, X F; Zhang, H Q; Zhang, Z F; Wang, Y Z; Jiang, G H
2016-08-05
The MLO (powdery mildew locus O) gene family is important in resistance to powdery mildew (PM). In this study, all of the members of the MLO family were identified and analyzed in the strawberry (Fragaria vesca) genome. The strawberry contains at least 20 members of the MLO family, and the protein sequence contained between 171 and 1485 amino acids, with 0-34 introns. Chromosomal localization showed that the MLOs were unevenly distributed on each of the chromosomes, except for chromosome 4. The greatest number of MLOs (seven) was found on chromosome 3. A phylogenetic tree showed that the MLOs were divided into seven groups (I-VII), four of which consisted of MLOs from strawberry, Arabidopsis thaliana, rice, and maize, suggesting that these genes may have evolved after the divergence of monocots and dicots. Multiple sequence alignment showed that strawberry MLO candidates related to powdery mildew resistance possessed seven highly conserved transmembrane domains, a calmodulin-binding domain, and two conserved regions, all of which are important domains for powdery mildew resistance genes. Expressed sequence tag analysis revealed that the MLOs were induced by multiple abiotic stressors, including low and high temperature, drought, and high salinity. These findings will contribute to the functional characterization of MLOs related to PM susceptibility, and will assist in the development of disease resistance in strawberries.
Huang, Yu-Feng; Midha, Mohit; Chen, Tzu-Han; Wang, Yu-Tai; Smith, David Glenn; Pei, Kurtis Jai-Chyi; Chiu, Kuo Ping
2015-01-01
The Taiwanese (Formosan) macaque (Macaca cyclopis) is the only nonhuman primate endemic to Taiwan. This primate species is valuable for evolutionary studies and as subjects in medical research. However, only partial fragments of the mitochondrial genome (mitogenome) of this primate species have been sequenced, not mentioning its nuclear genome. We employed next-generation sequencing to generate 2 x 90 bp paired-end reads, followed by reference-assisted de novo assembly with multiple k-mer strategy to characterize the M. cyclopis mitogenome. We compared the assembled mitogenome with that of other macaque species for phylogenetic analysis. Our results show that, the M. cyclopis mitogenome consists of 16,563 nucleotides encoding for 13 protein-coding genes, 2 ribosomal RNAs and 22 transfer RNAs. Phylogenetic analysis indicates that M. cyclopis is most closely related to M. mulatta lasiota (Chinese rhesus macaque), supporting the notion of Asia-continental origin of M. cyclopis proposed in previous studies based on partial mitochondrial sequences. Our work presents a novel approach for assembling a mitogenome that utilizes the capabilities of de novo genome assembly with assistance of a reference genome. The availability of the complete Taiwanese macaque mitogenome will facilitate the study of primate evolution and the characterization of genetic variations for the potential usage of this species as a non-human primate model for medical research.
NASA Astrophysics Data System (ADS)
Stanley, Daniel Jean
1993-01-01
Petrological analysis of geological sections in St. Croix in the Caribbean, the Niesenflysch in Switzerland and the Annot Sandstone in the French Maritime Alps sheds light on multiple process transport in deep marine settings. A model depicting a turbidite-to-contourite continuum of stratal types is applied to these three rock units. Recognition of a diverse suite of bedforms, coupled with analysis of paleocurrents, helps to better interpret depositional origin and basin paleogeography. The St. Croix strata record emplacement by gravity flows and, subsequently, by bottom currents flowing parallel to the base of slope; these sediments accumulated on a lower slope apron. A Niesenflysch section in the Swiss Alps west of Adelboden includes turbidites which were deposited at fairly regular intervals beyond the base of slope, in a setting more distal than that of the St. Croix sequences. Most of these turbidites appear to have been partially reworked by bottom currents related to basin circulation or to density flows from the basin margins. In the Annot Sandstone, reworked turbidites (termed transitional variants) and packets of entirely rippled strata are observed in submarine fan and slope sequences in the Peira-Cava area. In contrast to those in St. Croix and the Niesenflysch, the current-emplaced deposits of the Annot Sandstone are directly associated with fan-valley deposits. Such rippled strata in channels are deposits of gravity flow origin which were subsequently reworked downslope by currents generated by successive gravity flows; they also occur on levees by overbank flow. Consideration of multiple process transport is of special help to interpret sections which are poorly exposed, or which can be examined in cores, or which are located in sequences that have been highly deformed structurally.
Evidence That Up-Regulation of MicroRNA-29 Contributes to Postnatal Body Growth Deceleration
Kamran, Fariha; Andrade, Anenisia C.; Nella, Aikaterini A.; Clokie, Samuel J.; Rezvani, Geoffrey; Nilsson, Ola; Baron, Jeffrey
2015-01-01
Body growth is rapid in infancy but subsequently slows and eventually ceases due to a progressive decline in cell proliferation that occurs simultaneously in multiple organs. We previously showed that this decline in proliferation is driven in part by postnatal down-regulation of a large set of growth-promoting genes in multiple organs. We hypothesized that this growth-limiting genetic program is orchestrated by microRNAs (miRNAs). Bioinformatic analysis identified target sequences of the miR-29 family of miRNAs to be overrepresented in age–down-regulated genes. Concomitantly, expression microarray analysis in mouse kidney and lung showed that all members of the miR-29 family, miR-29a, -b, and -c, were strongly up-regulated from 1 to 6 weeks of age. Real-time PCR confirmed that miR-29a, -b, and -c were up-regulated with age in liver, kidney, lung, and heart, and their expression levels were higher in hepatocytes isolated from 5-week-old mice than in hepatocytes from embryonic mouse liver at embryonic day 16.5. We next focused on 3 predicted miR-29 target genes (Igf1, Imp1, and Mest), all of which are growth-promoting. A 3′-untranslated region containing the predicted target sequences from each gene was placed individually in a luciferase reporter construct. Transfection of miR-29 mimics suppressed luciferase gene activity for all 3 genes, and this suppression was diminished by mutating the target sequences, suggesting that these genes are indeed regulated by miR-29. Taken together, the findings suggest that up-regulation of miR-29 during juvenile life drives the down-regulation of multiple growth-promoting genes, thus contributing to physiological slowing and eventual cessation of body growth. PMID:25866874
Evidence That Up-Regulation of MicroRNA-29 Contributes to Postnatal Body Growth Deceleration.
Kamran, Fariha; Andrade, Anenisia C; Nella, Aikaterini A; Clokie, Samuel J; Rezvani, Geoffrey; Nilsson, Ola; Baron, Jeffrey; Lui, Julian C
2015-06-01
Body growth is rapid in infancy but subsequently slows and eventually ceases due to a progressive decline in cell proliferation that occurs simultaneously in multiple organs. We previously showed that this decline in proliferation is driven in part by postnatal down-regulation of a large set of growth-promoting genes in multiple organs. We hypothesized that this growth-limiting genetic program is orchestrated by microRNAs (miRNAs). Bioinformatic analysis identified target sequences of the miR-29 family of miRNAs to be overrepresented in age-down-regulated genes. Concomitantly, expression microarray analysis in mouse kidney and lung showed that all members of the miR-29 family, miR-29a, -b, and -c, were strongly up-regulated from 1 to 6 weeks of age. Real-time PCR confirmed that miR-29a, -b, and -c were up-regulated with age in liver, kidney, lung, and heart, and their expression levels were higher in hepatocytes isolated from 5-week-old mice than in hepatocytes from embryonic mouse liver at embryonic day 16.5. We next focused on 3 predicted miR-29 target genes (Igf1, Imp1, and Mest), all of which are growth-promoting. A 3'-untranslated region containing the predicted target sequences from each gene was placed individually in a luciferase reporter construct. Transfection of miR-29 mimics suppressed luciferase gene activity for all 3 genes, and this suppression was diminished by mutating the target sequences, suggesting that these genes are indeed regulated by miR-29. Taken together, the findings suggest that up-regulation of miR-29 during juvenile life drives the down-regulation of multiple growth-promoting genes, thus contributing to physiological slowing and eventual cessation of body growth.
Different propagation speeds of recalled sequences in plastic spiking neural networks
NASA Astrophysics Data System (ADS)
Huang, Xuhui; Zheng, Zhigang; Hu, Gang; Wu, Si; Rasch, Malte J.
2015-03-01
Neural networks can generate spatiotemporal patterns of spike activity. Sequential activity learning and retrieval have been observed in many brain areas, and e.g. is crucial for coding of episodic memory in the hippocampus or generating temporal patterns during song production in birds. In a recent study, a sequential activity pattern was directly entrained onto the neural activity of the primary visual cortex (V1) of rats and subsequently successfully recalled by a local and transient trigger. It was observed that the speed of activity propagation in coordinates of the retinotopically organized neural tissue was constant during retrieval regardless how the speed of light stimulation sweeping across the visual field during training was varied. It is well known that spike-timing dependent plasticity (STDP) is a potential mechanism for embedding temporal sequences into neural network activity. How training and retrieval speeds relate to each other and how network and learning parameters influence retrieval speeds, however, is not well described. We here theoretically analyze sequential activity learning and retrieval in a recurrent neural network with realistic synaptic short-term dynamics and STDP. Testing multiple STDP rules, we confirm that sequence learning can be achieved by STDP. However, we found that a multiplicative nearest-neighbor (NN) weight update rule generated weight distributions and recall activities that best matched the experiments in V1. Using network simulations and mean-field analysis, we further investigated the learning mechanisms and the influence of network parameters on recall speeds. Our analysis suggests that a multiplicative STDP rule with dominant NN spike interaction might be implemented in V1 since recall speed was almost constant in an NMDA-dominant regime. Interestingly, in an AMPA-dominant regime, neural circuits might exhibit recall speeds that instead follow the change in stimulus speeds. This prediction could be tested in experiments.
Ray Wu as Fifth Business: Deconstructing collective memory in the history of DNA sequencing.
Onaga, Lisa A
2014-06-01
The concept of 'Fifth Business' is used to analyze a minority standpoint and bring serious attention to the role of scientists who play a galvanizing role in a science but for multiple reasons appear less prominently in more common recounts of any particular development. Biochemist Ray Wu (1928-2008) published a DNA sequencing experiment in March 1970 using DNA polymerase catalysis and specific nucleotide labeling, both of which are foundational to general sequencing methods today. The scant mention of Wu's work from textbooks, research articles, and other accounts of DNA sequencing calls into question how scientific collective memory forms. This alternative history seeks to understand why a key figure in nucleic acid sequence analysis has remained less visibly connected or peripheral to solidifying narratives about the history of DNA sequencing. The study resists predictable dismissals of Wu's work in order to seriously examine the formation of his nucleic acid sequence analysis research program and how he shared his knowledge of sequencing during a period of rapid advancement in the field. An analysis of Wu's work on sequencing the cohesive ends of lambda bacteriophage in the 1960s and 1970s exemplifies how a variety of individuals and groups attempted to develop protocol for sequencing the order of nucleotide base pairs comprising DNA. This historical examination of the sociality of scientific research suggests a way to understand how Wu and others contributed to the very collective memory of DNA sequencing that Wu eventually tried to repair. The study of Wu, who was a Chinese immigrant to the United States, provides a foundation for further critical scholarship on the heterogeneous histories of Asian American bioscientists, the sociality of their scientific works, and how the resulting knowledge produced is preserved, if not evenly, in a scientific field's collective memory. Copyright © 2014 Elsevier Ltd. All rights reserved.
Park, Bongsoo; Park, Jongsun; Cheong, Kyeong-Chae; Choi, Jaeyoung; Jung, Kyongyong; Kim, Donghan; Lee, Yong-Hwan; Ward, Todd J; O'Donnell, Kerry; Geiser, David M; Kang, Seogchan
2011-01-01
The fungal genus Fusarium includes many plant and/or animal pathogenic species and produces diverse toxins. Although accurate species identification is critical for managing such threats, it is difficult to identify Fusarium morphologically. Fortunately, extensive molecular phylogenetic studies, founded on well-preserved culture collections, have established a robust foundation for Fusarium classification. Genomes of four Fusarium species have been published with more being currently sequenced. The Cyber infrastructure for Fusarium (CiF; http://www.fusariumdb.org/) was built to support archiving and utilization of rapidly increasing data and knowledge and consists of Fusarium-ID, Fusarium Comparative Genomics Platform (FCGP) and Fusarium Community Platform (FCP). The Fusarium-ID archives phylogenetic marker sequences from most known species along with information associated with characterized isolates and supports strain identification and phylogenetic analyses. The FCGP currently archives five genomes from four species. Besides supporting genome browsing and analysis, the FCGP presents computed characteristics of multiple gene families and functional groups. The Cart/Favorite function allows users to collect sequences from Fusarium-ID and the FCGP and analyze them later using multiple tools without requiring repeated copying-and-pasting of sequences. The FCP is designed to serve as an online community forum for sharing and preserving accumulated experience and knowledge to support future research and education.
Park, Bongsoo; Park, Jongsun; Cheong, Kyeong-Chae; Choi, Jaeyoung; Jung, Kyongyong; Kim, Donghan; Lee, Yong-Hwan; Ward, Todd J.; O'Donnell, Kerry; Geiser, David M.; Kang, Seogchan
2011-01-01
The fungal genus Fusarium includes many plant and/or animal pathogenic species and produces diverse toxins. Although accurate species identification is critical for managing such threats, it is difficult to identify Fusarium morphologically. Fortunately, extensive molecular phylogenetic studies, founded on well-preserved culture collections, have established a robust foundation for Fusarium classification. Genomes of four Fusarium species have been published with more being currently sequenced. The Cyber infrastructure for Fusarium (CiF; http://www.fusariumdb.org/) was built to support archiving and utilization of rapidly increasing data and knowledge and consists of Fusarium-ID, Fusarium Comparative Genomics Platform (FCGP) and Fusarium Community Platform (FCP). The Fusarium-ID archives phylogenetic marker sequences from most known species along with information associated with characterized isolates and supports strain identification and phylogenetic analyses. The FCGP currently archives five genomes from four species. Besides supporting genome browsing and analysis, the FCGP presents computed characteristics of multiple gene families and functional groups. The Cart/Favorite function allows users to collect sequences from Fusarium-ID and the FCGP and analyze them later using multiple tools without requiring repeated copying-and-pasting of sequences. The FCP is designed to serve as an online community forum for sharing and preserving accumulated experience and knowledge to support future research and education. PMID:21087991
Arent, Z; Frizzell, C; Gilmore, C; Allen, A; Ellis, W A
2016-07-15
Strains of Leptospira interrogans belonging to two very closely related serovars - Bratislava and Muenchen - have been associated with disease in domestic animals, in particular pigs, but also in horses and dogs. Similar strains have also been recovered from various wildlife species. Their epidemiology is poorly understood. Two hundred and forty seven such isolates, from UK domestic animal and wildlife species, were examined by restriction endonuclease analysis in an attempt to elucidate their epidemiology. A representative sub-sample of 65 of these isolates was further examined by multiple-locus variable-number tandem repeat analysis and 22 by secY sequencing. Ten restriction pattern types were identified. The majority of isolates fell into one of three restriction endonuclease analysis pattern types designated B2a, B2b and M2a. B2a was ubiquitous and was isolated from 10 species and represented the majority of the horse and all dog isolates. B2b was very different, being isolated only from pigs, indicating that this type was maintained by pigs. The pattern M2a was reported for the majority of isolates from pigs but also was common in small rodents isolates. Five restriction pattern types were found only in wildlife suggesting that they are unlikely to pose a disease threat to domestic animals. Multiple-locus variable-number tandem repeat analysis identified six clusters. The REA types B2a and B2b were all found in one MLVA cluster while the majority of the M2a strains examined occurred in another cluster. The secY sequencing detected only one sequence type, clustered with other serovars of Leptospira interrogans. Copyright © 2016 Elsevier B.V. All rights reserved.
MicRhoDE: a curated database for the analysis of microbial rhodopsin diversity and evolution
Boeuf, Dominique; Audic, Stéphane; Brillet-Guéguen, Loraine; Caron, Christophe; Jeanthon, Christian
2015-01-01
Microbial rhodopsins are a diverse group of photoactive transmembrane proteins found in all three domains of life and in viruses. Today, microbial rhodopsin research is a flourishing research field in which new understandings of rhodopsin diversity, function and evolution are contributing to broader microbiological and molecular knowledge. Here, we describe MicRhoDE, a comprehensive, high-quality and freely accessible database that facilitates analysis of the diversity and evolution of microbial rhodopsins. Rhodopsin sequences isolated from a vast array of marine and terrestrial environments were manually collected and curated. To each rhodopsin sequence are associated related metadata, including predicted spectral tuning of the protein, putative activity and function, taxonomy for sequences that can be linked to a 16S rRNA gene, sampling date and location, and supporting literature. The database currently covers 7857 aligned sequences from more than 450 environmental samples or organisms. Based on a robust phylogenetic analysis, we introduce an operational classification system with multiple phylogenetic levels ranging from superclusters to species-level operational taxonomic units. An integrated pipeline for online sequence alignment and phylogenetic tree construction is also provided. With a user-friendly interface and integrated online bioinformatics tools, this unique resource should be highly valuable for upcoming studies of the biogeography, diversity, distribution and evolution of microbial rhodopsins. Database URL: http://micrhode.sb-roscoff.fr. PMID:26286928
MicRhoDE: a curated database for the analysis of microbial rhodopsin diversity and evolution.
Boeuf, Dominique; Audic, Stéphane; Brillet-Guéguen, Loraine; Caron, Christophe; Jeanthon, Christian
2015-01-01
Microbial rhodopsins are a diverse group of photoactive transmembrane proteins found in all three domains of life and in viruses. Today, microbial rhodopsin research is a flourishing research field in which new understandings of rhodopsin diversity, function and evolution are contributing to broader microbiological and molecular knowledge. Here, we describe MicRhoDE, a comprehensive, high-quality and freely accessible database that facilitates analysis of the diversity and evolution of microbial rhodopsins. Rhodopsin sequences isolated from a vast array of marine and terrestrial environments were manually collected and curated. To each rhodopsin sequence are associated related metadata, including predicted spectral tuning of the protein, putative activity and function, taxonomy for sequences that can be linked to a 16S rRNA gene, sampling date and location, and supporting literature. The database currently covers 7857 aligned sequences from more than 450 environmental samples or organisms. Based on a robust phylogenetic analysis, we introduce an operational classification system with multiple phylogenetic levels ranging from superclusters to species-level operational taxonomic units. An integrated pipeline for online sequence alignment and phylogenetic tree construction is also provided. With a user-friendly interface and integrated online bioinformatics tools, this unique resource should be highly valuable for upcoming studies of the biogeography, diversity, distribution and evolution of microbial rhodopsins. Database URL: http://micrhode.sb-roscoff.fr. © The Author(s) 2015. Published by Oxford University Press.
The EMBL-EBI bioinformatics web and programmatic tools framework.
Li, Weizhong; Cowley, Andrew; Uludag, Mahmut; Gur, Tamer; McWilliam, Hamish; Squizzato, Silvano; Park, Young Mi; Buso, Nicola; Lopez, Rodrigo
2015-07-01
Since 2009 the EMBL-EBI Job Dispatcher framework has provided free access to a range of mainstream sequence analysis applications. These include sequence similarity search services (https://www.ebi.ac.uk/Tools/sss/) such as BLAST, FASTA and PSI-Search, multiple sequence alignment tools (https://www.ebi.ac.uk/Tools/msa/) such as Clustal Omega, MAFFT and T-Coffee, and other sequence analysis tools (https://www.ebi.ac.uk/Tools/pfa/) such as InterProScan. Through these services users can search mainstream sequence databases such as ENA, UniProt and Ensembl Genomes, utilising a uniform web interface or systematically through Web Services interfaces (https://www.ebi.ac.uk/Tools/webservices/) using common programming languages, and obtain enriched results with novel visualisations. Integration with EBI Search (https://www.ebi.ac.uk/ebisearch/) and the dbfetch retrieval service (https://www.ebi.ac.uk/Tools/dbfetch/) further expands the usefulness of the framework. New tools and updates such as NCBI BLAST+, InterProScan 5 and PfamScan, new categories such as RNA analysis tools (https://www.ebi.ac.uk/Tools/rna/), new databases such as ENA non-coding, WormBase ParaSite, Pfam and Rfam, and new workflow methods, together with the retirement of depreciated services, ensure that the framework remains relevant to today's biological community. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Identification of MHC class I sequences in Chinese-origin rhesus macaques
Karl, Julie A.; Wiseman, Roger W.; Campbell, Kevin J.; Blasky, Alex J.; Hughes, Austin L.; Ferguson, Betsy; Read, Daniel S.
2010-01-01
The rhesus macaque (Macaca mulatta) is an excellent model for human disease and vaccine research. Two populations exhibiting distinctive morphological and physiological characteristics, Indian- and Chinese-origin rhesus macaques, are commonly used in research. Genetic analysis has focused on the Indian macaque population, but the accessibility of these animals for research is limited. Due to their greater availability, Chinese rhesus macaques are now being used more frequently, particularly in vaccine and biodefense studies, although relatively little is known about their immunogenetics. In this study, we discovered major histocompatibility complex (MHC) class I cDNAs in 12 Chinese rhesus macaques and detected 41 distinct Mamu-A and Mamu-B sequences. Twenty-seven of these class I cDNAs were novel, while six and eight of these sequences were previously reported in Chinese and Indian rhesus macaques, respectively. We then performed microsatellite analysis on DNA from these 12 animals, as well as an additional 18 animals, and developed sequence specific primer PCR (PCR-SSP) assays for eight cDNAs found in multiple animals. We also examined our cohort for potential admixture of Chinese and Indian origin animals using a recently developed panel of single nucleotide polymorphisms (SNPs). The discovery of 27 novel MHC class I sequences in this analysis underscores the genetic diversity of Chinese rhesus macaques and contributes reagents that will be valuable for studying cellular immunology in this population. PMID:18097659
Sun, Xiaoyong; Wang, Lin; Ding, Jiechao; Wang, Yanru; Wang, Jiansheng; Zhang, Xiaoyang; Che, Yulei; Liu, Ziwei; Zhang, Xinran; Ye, Jiazhen; Wang, Jie; Sablok, Gaurav; Deng, Zhiping; Zhao, Hongwei
2016-10-01
A new regulatory class of small endogenous RNAs called circular RNAs (circRNAs) has been described as miRNA sponges in animals. Using 16 Arabidopsis thaliana RNA-Seq data sets, we identified 803 circRNAs in RNase R-/non-RNase R-treated samples. The results revealed the following features: Canonical and noncanonical splicing can generate circRNAs; chloroplasts are a hotspot for circRNA generation; furthermore, limited complementary sequences exist not only in introns, but also in the sequences flanking splice sites. The latter finding suggests that multiple combinations between complementary sequences may facilitate the formation of the circular structure. Our results contribute to a better understanding of this novel class of plant circRNAs. © 2016 Federation of European Biochemical Societies.
Evol and ProDy for bridging protein sequence evolution and structural dynamics.
Bakan, Ahmet; Dutta, Anindita; Mao, Wenzhi; Liu, Ying; Chennubhotla, Chakra; Lezon, Timothy R; Bahar, Ivet
2014-09-15
Correlations between sequence evolution and structural dynamics are of utmost importance in understanding the molecular mechanisms of function and their evolution. We have integrated Evol, a new package for fast and efficient comparative analysis of evolutionary patterns and conformational dynamics, into ProDy, a computational toolbox designed for inferring protein dynamics from experimental and theoretical data. Using information-theoretic approaches, Evol coanalyzes conservation and coevolution profiles extracted from multiple sequence alignments of protein families with their inferred dynamics. ProDy and Evol are open-source and freely available under MIT License from http://prody.csb.pitt.edu/. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Cho, Hoonkyung; Chun, Joohwan; Song, Sungchan
2016-09-01
The dim moving target tracking from the infrared image sequence in the presence of high clutter and noise has been recently under intensive investigation. The track-before-detect (TBD) algorithm processing the image sequence over a number of frames before decisions on the target track and existence is known to be especially attractive in very low SNR environments (⩽ 3 dB). In this paper, we shortly present a three-dimensional (3-D) TBD with dynamic programming (TBD-DP) algorithm using multiple IR image sensors. Since traditional two-dimensional TBD algorithm cannot track and detect the along the viewing direction, we use 3-D TBD with multiple sensors and also strictly analyze the detection performance (false alarm and detection probabilities) based on Fisher-Tippett-Gnedenko theorem. The 3-D TBD-DP algorithm which does not require a separate image registration step uses the pixel intensity values jointly read off from multiple image frames to compute the merit function required in the DP process. Therefore, we also establish the relationship between the pixel coordinates of image frame and the reference coordinates.
Li, Hu; Leavengood, John M.; Chapman, Eric G.; Burkhardt, Daniel; Song, Fan; Jiang, Pei; Liu, Jinpeng; Cai, Wanzhi
2017-01-01
Hemiptera, the largest non-holometabolous order of insects, represents approximately 7% of metazoan diversity. With extraordinary life histories and highly specialized morphological adaptations, hemipterans have exploited diverse habitats and food sources through approximately 300 Myr of evolution. To elucidate the phylogeny and evolutionary history of Hemiptera, we carried out the most comprehensive mitogenomics analysis on the richest taxon sampling to date covering all the suborders and infraorders, including 34 newly sequenced and 94 published mitogenomes. With optimized branch length and sequence heterogeneity, Bayesian analyses using a site-heterogeneous mixture model resolved the higher-level hemipteran phylogeny as (Sternorrhyncha, (Auchenorrhyncha, (Coleorrhyncha, Heteroptera))). Ancestral character state reconstruction and divergence time estimation suggest that the success of true bugs (Heteroptera) is probably due to angiosperm coevolution, but key adaptive innovations (e.g. prognathous mouthpart, predatory behaviour, and haemelytron) facilitated multiple independent shifts among diverse feeding habits and multiple independent colonizations of aquatic habitats. PMID:28878063
Saisawang, Chonticha; Ketterman, Albert J.
2014-01-01
Glutathione transferases (GST) are an ancient superfamily comprising a large number of paralogous proteins in a single organism. This multiplicity of GSTs has allowed the copies to diverge for neofunctionalization with proposed roles ranging from detoxication and oxidative stress response to involvement in signal transduction cascades. We performed a comparative genomic analysis using FlyBase annotations and Drosophila melanogaster GST sequences as templates to further annotate the GST orthologs in the 12 Drosophila sequenced genomes. We found that GST genes in the Drosophila subgenera have undergone repeated local duplications followed by transposition, inversion, and micro-rearrangements of these copies. The colinearity and orientations of the orthologous GST genes appear to be unique in many of the species which suggests that genomic rearrangement events have occurred multiple times during speciation. The high micro-plasticity of the genomes appears to have a functional contribution utilized for evolution of this gene family. PMID:25310450
Whole-Genome Sequence Variation among Multiple Isolates of Pseudomonas aeruginosa
Spencer, David H.; Kas, Arnold; Smith, Eric E.; Raymond, Christopher K.; Sims, Elizabeth H.; Hastings, Michele; Burns, Jane L.; Kaul, Rajinder; Olson, Maynard V.
2003-01-01
Whole-genome shotgun sequencing was used to study the sequence variation of three Pseudomonas aeruginosa isolates, two from clonal infections of cystic fibrosis patients and one from an aquatic environment, relative to the genomic sequence of reference strain PAO1. The majority of the PAO1 genome is represented in these strains; however, at least three prominent islands of PAO1-specific sequence are apparent. Conversely, ∼10% of the sequencing reads derived from each isolate fail to align with the PAO1 backbone. While average sequence variation among all strains is roughly 0.5%, regions of pronounced differences were evident in whole-genome scans of nucleotide diversity. We analyzed two such divergent loci, the pyoverdine and O-antigen biosynthesis regions, by complete resequencing. A thorough analysis of isolates collected over time from one of the cystic fibrosis patients revealed independent mutations resulting in the loss of O-antigen synthesis alternating with a mucoid phenotype. Overall, we conclude that most of the PAO1 genome represents a core P. aeruginosa backbone sequence while the strains addressed in this study possess additional genetic material that accounts for at least 10% of their genomes. Approximately half of these additional sequences are novel. PMID:12562802
Multiple DNA and protein sequence alignment on a workstation and a supercomputer.
Tajima, K
1988-11-01
This paper describes a multiple alignment method using a workstation and supercomputer. The method is based on the alignment of a set of aligned sequences with the new sequence, and uses a recursive procedure of such alignment. The alignment is executed in a reasonable computation time on diverse levels from a workstation to a supercomputer, from the viewpoint of alignment results and computational speed by parallel processing. The application of the algorithm is illustrated by several examples of multiple alignment of 12 amino acid and DNA sequences of HIV (human immunodeficiency virus) env genes. Colour graphic programs on a workstation and parallel processing on a supercomputer are discussed.
Version VI of the ESTree db: an improved tool for peach transcriptome analysis
Lazzari, Barbara; Caprera, Andrea; Vecchietti, Alberto; Merelli, Ivan; Barale, Francesca; Milanesi, Luciano; Stella, Alessandra; Pozzi, Carlo
2008-01-01
Background The ESTree database (db) is a collection of Prunus persica and Prunus dulcis EST sequences that in its current version encompasses 75,404 sequences from 3 almond and 19 peach libraries. Nine peach genotypes and four peach tissues are represented, from four fruit developmental stages. The aim of this work was to implement the already existing ESTree db by adding new sequences and analysis programs. Particular care was given to the implementation of the web interface, that allows querying each of the database features. Results A Perl modular pipeline is the backbone of sequence analysis in the ESTree db project. Outputs obtained during the pipeline steps are automatically arrayed into the fields of a MySQL database. Apart from standard clustering and annotation analyses, version VI of the ESTree db encompasses new tools for tandem repeat identification, annotation against genomic Rosaceae sequences, and positioning on the database of oligomer sequences that were used in a peach microarray study. Furthermore, known protein patterns and motifs were identified by comparison to PROSITE. Based on data retrieved from sequence annotation against the UniProtKB database, a script was prepared to track positions of homologous hits on the GO tree and build statistics on the ontologies distribution in GO functional categories. EST mapping data were also integrated in the database. The PHP-based web interface was upgraded and extended. The aim of the authors was to enable querying the database according to all the biological aspects that can be investigated from the analysis of data available in the ESTree db. This is achieved by allowing multiple searches on logical subsets of sequences that represent different biological situations or features. Conclusions The version VI of ESTree db offers a broad overview on peach gene expression. Sequence analyses results contained in the database, extensively linked to external related resources, represent a large amount of information that can be queried via the tools offered in the web interface. Flexibility and modularity of the ESTree analysis pipeline and of the web interface allowed the authors to set up similar structures for different datasets, with limited manual intervention. PMID:18387211
Analysis of MHC class I genes across horse MHC haplotypes
Tallmadge, Rebecca L.; Campbell, Julie A.; Miller, Donald C.; Antczak, Douglas F.
2010-01-01
The genomic sequences of 15 horse Major Histocompatibility Complex (MHC) class I genes and a collection of MHC class I homozygous horses of five different haplotypes were used to investigate the genomic structure and polymorphism of the equine MHC. A combination of conserved and locus-specific primers was used to amplify horse MHC class I genes with classical and non-classical characteristics. Multiple clones from each haplotype identified three to five classical sequences per homozygous animal, and two to three non-classical sequences. Phylogenetic analysis was applied to these sequences and groups were identified which appear to be allelic series, but some sequences were left ungrouped. Sequences determined from MHC class I heterozygous horses and previously described MHC class I sequences were then added, representing a total of ten horse MHC haplotypes. These results were consistent with those obtained from the MHC homozygous horses alone, and 30 classical sequences were assigned to four previously confirmed loci and three new provisional loci. The non-classical genes had few alleles and the classical genes had higher levels of allelic polymorphism. Alleles for two classical loci with the expected pattern of polymorphism were found in the majority of haplotypes tested, but alleles at two other commonly detected loci had more variation outside of the hypervariable region than within. Our data indicate that the equine Major Histocompatibility Complex is characterized by variation in the complement of class I genes expressed in different haplotypes in addition to the expected allelic polymorphism within loci. PMID:20099063
High-Resolution Melt Analysis for Rapid Comparison of Bacterial Community Compositions
Hjelmsø, Mathis Hjort; Hansen, Lars Hestbjerg; Bælum, Jacob; Feld, Louise; Holben, William E.
2014-01-01
In the study of bacterial community composition, 16S rRNA gene amplicon sequencing is today among the preferred methods of analysis. The cost of nucleotide sequence analysis, including requisite computational and bioinformatic steps, however, takes up a large part of many research budgets. High-resolution melt (HRM) analysis is the study of the melt behavior of specific PCR products. Here we describe a novel high-throughput approach in which we used HRM analysis targeting the 16S rRNA gene to rapidly screen multiple complex samples for differences in bacterial community composition. We hypothesized that HRM analysis of amplified 16S rRNA genes from a soil ecosystem could be used as a screening tool to identify changes in bacterial community structure. This hypothesis was tested using a soil microcosm setup exposed to a total of six treatments representing different combinations of pesticide and fertilization treatments. The HRM analysis identified a shift in the bacterial community composition in two of the treatments, both including the soil fumigant Basamid GR. These results were confirmed with both denaturing gradient gel electrophoresis (DGGE) analysis and 454-based 16S rRNA gene amplicon sequencing. HRM analysis was shown to be a fast, high-throughput technique that can serve as an effective alternative to gel-based screening methods to monitor microbial community composition. PMID:24610853
Selvan, A. Sakthivel; Gupta, I. D.; Verma, A.; Chaudhari, M. V.; Magotra, A.
2016-01-01
Aim: The present study was undertaken with the objectives to characterize and to analyze combined genotypes of cluster of differentiation 14 (CD14) gene to explore its association with clinical mastitis in Karan Fries (KF) cows maintained in the National Dairy Research Institute herd, Karnal. Materials and Methods: Genomic DNA was extracted using blood of randomly selected 94 KF lactating cattle by phenol-chloroform method. After checking its quality and quantity, polymerase chain reaction (PCR) was carried out using six sets of reported gene-specific primers to amplify complete KF CD14 gene. The forward and reverse sequences for each PCR fragments were assembled to form complete sequence for the respective region of KF CD14 gene. The multiple sequence alignments of the edited sequence with the corresponding reference with reported Bos taurus sequence (EU148610.1) were performed with ClustalW software to identify single nucleotide polymorphisms (SNPs). Basic Local Alignment Search Tool analysis was performed to compare the sequence identity of KF CD14 gene with other species. The restriction fragment length polymorphism (RFLP) analysis was carried out in all KF cows using Helicobacter pylori 188I (Hpy188I) (contig 2) and Haemophilus influenzae I (HinfI) (contig 4) restriction enzyme (RE). Cows were assigned genotypes obtained by PCR-RFLP analysis, and association study was done using Chi-square (χ2) test. The genotypes of both contigs (loci) number 2 and 4 were combined with respect to each animal to construct combined genotype patterns. Results: Two types of sequences of KF were obtained: One with 2630 bp having one insertion at 616 nucleotide (nt) position and one deletion at 1117 nt position, and the another sequence was of 2629 bp having only one deletion at 615 nt position. ClustalW, multiple alignments of KF CD14 gene sequence with B. taurus cattle sequence (EU148610.1), revealed 24 nt changes (SNPs). Cows were also screened using PCR-RFLP with Hpy188I (contig 2) and HinfI (contig 4) RE, which revealed three genotypes each that differed significantly regarding mastitis incidence. The maximum possible combination of these two loci shown nine combined genotype patterns and it was observed only eight combined genotypes out of nine: AACC, AACD, AADD, ABCD, ABDD, BBCC, BBCD, and BBDD. The combined genotype ABCC was not observed in the studied population of KF cows. Out of 94 animals, AACD combined genotype animals (10.63%) were found to be not affected with mastitis, and ABDD combined genotyped animals was observed having the highest mastitis incidence of 15.96%. Conclusion: AACD typed cows were found to be least susceptible to mastitis incidence as compared to other combined genotypes. PMID:27536026
Selvan, A Sakthivel; Gupta, I D; Verma, A; Chaudhari, M V; Magotra, A
2016-07-01
The present study was undertaken with the objectives to characterize and to analyze combined genotypes of cluster of differentiation 14 (CD14) gene to explore its association with clinical mastitis in Karan Fries (KF) cows maintained in the National Dairy Research Institute herd, Karnal. Genomic DNA was extracted using blood of randomly selected 94 KF lactating cattle by phenol-chloroform method. After checking its quality and quantity, polymerase chain reaction (PCR) was carried out using six sets of reported gene-specific primers to amplify complete KF CD14 gene. The forward and reverse sequences for each PCR fragments were assembled to form complete sequence for the respective region of KF CD14 gene. The multiple sequence alignments of the edited sequence with the corresponding reference with reported Bos taurus sequence (EU148610.1) were performed with ClustalW software to identify single nucleotide polymorphisms (SNPs). Basic Local Alignment Search Tool analysis was performed to compare the sequence identity of KF CD14 gene with other species. The restriction fragment length polymorphism (RFLP) analysis was carried out in all KF cows using Helicobacter pylori 188I (Hpy188I) (contig 2) and Haemophilus influenzae I (HinfI) (contig 4) restriction enzyme (RE). Cows were assigned genotypes obtained by PCR-RFLP analysis, and association study was done using Chi-square (χ (2)) test. The genotypes of both contigs (loci) number 2 and 4 were combined with respect to each animal to construct combined genotype patterns. Two types of sequences of KF were obtained: One with 2630 bp having one insertion at 616 nucleotide (nt) position and one deletion at 1117 nt position, and the another sequence was of 2629 bp having only one deletion at 615 nt position. ClustalW, multiple alignments of KF CD14 gene sequence with B. taurus cattle sequence (EU148610.1), revealed 24 nt changes (SNPs). Cows were also screened using PCR-RFLP with Hpy188I (contig 2) and HinfI (contig 4) RE, which revealed three genotypes each that differed significantly regarding mastitis incidence. The maximum possible combination of these two loci shown nine combined genotype patterns and it was observed only eight combined genotypes out of nine: AACC, AACD, AADD, ABCD, ABDD, BBCC, BBCD, and BBDD. The combined genotype ABCC was not observed in the studied population of KF cows. Out of 94 animals, AACD combined genotype animals (10.63%) were found to be not affected with mastitis, and ABDD combined genotyped animals was observed having the highest mastitis incidence of 15.96%. AACD typed cows were found to be least susceptible to mastitis incidence as compared to other combined genotypes.
Multi-Harmony: detecting functional specificity from sequence alignment
Brandt, Bernd W.; Feenstra, K. Anton; Heringa, Jaap
2010-01-01
Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different protein–protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww. PMID:20525785
Ventura, Marco; Canchaya, Carlos; Meylan, Valèrie; Klaenhammer, Todd R.; Zink, Ralf
2003-01-01
We analyzed the tuf gene, encoding elongation factor Tu, from 33 strains representing 17 Lactobacillus species and 8 Bifidobacterium species. The tuf sequences were aligned and used to infer phylogenesis among species of lactobacilli and bifidobacteria. We demonstrated that the synonymous substitution affecting this gene renders elongation factor Tu a reliable molecular clock for investigating evolutionary distances of lactobacilli and bifidobacteria. In fact, the phylogeny generated by these tuf sequences is consistent with that derived from 16S rRNA analysis. The investigation of a multiple alignment of tuf sequences revealed regions conserved among strains belonging to the same species but distinct from those of other species. PCR primers complementary to these regions allowed species-specific identification of closely related species, such as Lactobacillus casei group members. These tuf gene-based assays developed in this study provide an alternative to present methods for the identification for lactic acid bacterial species. Since a variable number of tuf genes have been described for bacteria, the presence of multiple genes was examined. Southern analysis revealed one tuf gene in the genomes of lactobacilli and bifidobacteria, but the tuf gene was arranged differently in the genomes of these two taxa. Our results revealed that the tuf gene in bifidobacteria is flanked by the same gene constellation as the str operon, as originally reported for Escherichia coli. In contrast, bioinformatic and transcriptional analyses of the DNA region flanking the tuf gene in four Lactobacillus species indicated the same four-gene unit and suggested a novel tuf operon specific for the genus Lactobacillus. PMID:14602655
Yang, Xiaojun; Wang, Xiaohong; Liang, Zhijuan; Zhang, Xiaoya; Wang, Yanbo; Wang, Zhenhai
2014-05-01
To study the species and amount of bacteria in sputum of patients with ventilator-associated pneumonia (VAP) by using 16S rDNA sequencing analysis, and to explore the new method for etiologic diagnosis of VAP. Bronchoalveolar lavage sputum samples were collected from 31 patients with VAP. Bacterial DNA of the samples were extracted and identified by polymerase chain reaction (PCR). At the same time, sputum specimens were processed for routine bacterial culture. The high flux sequencing experiment was conducted on PCR positive samples with 16S rDNA macro genome sequencing technology, and sequencing results were analyzed using bioinformatics, then the results between the sequencing and bacteria culture were compared. (1) 550 bp of specific DNA sequences were amplified in sputum specimens from 27 cases of the 31 patients with VAP, and they were used for sequencing analysis. 103 856 sequences were obtained from those sputum specimens using 16S rDNA sequencing, yielding approximately 39 Mb of raw data. Tag sequencing was able to inform genus level in all 27 samples. (2) Alpha-diversity analysis showed that sputum samples of patients with VAP had significantly higher variability and richness in bacterial species (Shannon index values 1.20, Simpson index values 0.48). Rarefaction curve analysis showed that there were more species that were not detected by sequencing from some VAP sputum samples. (3) Analysis of 27 sputum samples with VAP by using 16S rDNA sequences yielded four phyla: namely Acitinobacteria, Bacteroidetes, Firmicutes, Proteobacteria. With genus as a classification, it was found that the dominant species included Streptococcus 88.9% (24/27), Limnohabitans 77.8% (21/27), Acinetobacter 70.4% (19/27), Sphingomonas 63.0% (17/27), Prevotella 63.0% (17/27), Klebsiella 55.6% (15/27), Pseudomonas 55.6% (15/27), Aquabacterium 55.6% (15/27), and Corynebacterium 55.6% (15/27). (4) Pyrophosphate sequencing discovered that Prevotella, Limnohabitans, Aquabacterium, Sphingomonas might not be detected by routine bacteria culture. Among seven species which were identified by both methods, pyrophosphate sequencing yielded higher positive rate than that of ordinary bacteria culture [Streptococcus: 88.9% (24/27) vs. 18.5% (5/27), Klebsiella: 55.6% (15/27) vs. 18.5% (5/27), Acinetobacter: 70.4% (19/27) vs. 37.0% (10/27), Corynebacterium: 55.6% (15/27) vs. 7.4% (2/27), P<0.05 or P<0.01]. Sequencing positive rate was found to increase positive rate for culture of Pseudomonas [55.6% (15/27) vs. 25.9% (7/27), P=0.050]. No significant differences were observed between sequencing and ordinary bacteria culture for detection Staphylococcus [7.4% (2/27) vs. 11.1% (3/27)] and Neisseria bacteria genera [18.5% (5/27) vs. 3.7% (1/27), both P>0.05]. 16S rDNA sequencing analysis confirmed that pathogenic bacteria in sputum of VAP were complicated with multiple drug resistant strains. Compared with routine bacterial culture, pyrophosphate sequencing had higher positive rate in detecting pathogens. 16S rDNA gene sequencing technology may become a new method for etiological diagnosis of VAP.
MultiSeq: unifying sequence and structure data for evolutionary analysis
Roberts, Elijah; Eargle, John; Wright, Dan; Luthey-Schulten, Zaida
2006-01-01
Background Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are available in public databases. Finding correlations in and between these data to answer critical research questions is extremely challenging. This problem needs to be approached from several directions: information science to organize and search the data; information visualization to assist in recognizing correlations; mathematics to formulate statistical inferences; and biology to analyze chemical and physical properties in terms of sequence and structure changes. Results Here we present MultiSeq, a unified bioinformatics analysis environment that allows one to organize, display, align and analyze both sequence and structure data for proteins and nucleic acids. While special emphasis is placed on analyzing the data within the framework of evolutionary biology, the environment is also flexible enough to accommodate other usage patterns. The evolutionary approach is supported by the use of predefined metadata, adherence to standard ontological mappings, and the ability for the user to adjust these classifications using an electronic notebook. MultiSeq contains a new algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of a homologous group of distantly related proteins. The method, based on the multidimensional QR factorization of multiple sequence and structure alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. Conclusion MultiSeq is a major extension of the Multiple Alignment tool that is provided as part of VMD, a structural visualization program for analyzing molecular dynamics simulations. Both are freely distributed by the NIH Resource for Macromolecular Modeling and Bioinformatics and MultiSeq is included with VMD starting with version 1.8.5. The MultiSeq website has details on how to download and use the software: PMID:16914055
Wan, Shixiang; Zou, Quan
2017-01-01
Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
Du, Yushen; Wu, Nicholas C; Jiang, Lin; Zhang, Tianhao; Gong, Danyang; Shu, Sara; Wu, Ting-Ting; Sun, Ren
2016-11-01
Identification and annotation of functional residues are fundamental questions in protein sequence analysis. Sequence and structure conservation provides valuable information to tackle these questions. It is, however, limited by the incomplete sampling of sequence space in natural evolution. Moreover, proteins often have multiple functions, with overlapping sequences that present challenges to accurate annotation of the exact functions of individual residues by conservation-based methods. Using the influenza A virus PB1 protein as an example, we developed a method to systematically identify and annotate functional residues. We used saturation mutagenesis and high-throughput sequencing to measure the replication capacity of single nucleotide mutations across the entire PB1 protein. After predicting protein stability upon mutations, we identified functional PB1 residues that are essential for viral replication. To further annotate the functional residues important to the canonical or noncanonical functions of viral RNA-dependent RNA polymerase (vRdRp), we performed a homologous-structure analysis with 16 different vRdRp structures. We achieved high sensitivity in annotating the known canonical polymerase functional residues. Moreover, we identified a cluster of noncanonical functional residues located in the loop region of the PB1 β-ribbon. We further demonstrated that these residues were important for PB1 protein nuclear import through the interaction with Ran-binding protein 5. In summary, we developed a systematic and sensitive method to identify and annotate functional residues that are not restrained by sequence conservation. Importantly, this method is generally applicable to other proteins about which homologous-structure information is available. To fully comprehend the diverse functions of a protein, it is essential to understand the functionality of individual residues. Current methods are highly dependent on evolutionary sequence conservation, which is usually limited by sampling size. Sequence conservation-based methods are further confounded by structural constraints and multifunctionality of proteins. Here we present a method that can systematically identify and annotate functional residues of a given protein. We used a high-throughput functional profiling platform to identify essential residues. Coupling it with homologous-structure comparison, we were able to annotate multiple functions of proteins. We demonstrated the method with the PB1 protein of influenza A virus and identified novel functional residues in addition to its canonical function as an RNA-dependent RNA polymerase. Not limited to virology, this method is generally applicable to other proteins that can be functionally selected and about which homologous-structure information is available. Copyright © 2016 Du et al.
Evolutionary genetics of insect innate immunity.
Viljakainen, Lumi
2015-11-01
Patterns of evolution in immune defense genes help to understand the evolutionary dynamics between hosts and pathogens. Multiple insect genomes have been sequenced, with many of them having annotated immune genes, which paves the way for a comparative genomic analysis of insect immunity. In this review, I summarize the current state of comparative and evolutionary genomics of insect innate immune defense. The focus is on the conserved and divergent components of immunity with an emphasis on gene family evolution and evolution at the sequence level; both population genetics and molecular evolution frameworks are considered. © The Author 2015. Published by Oxford University Press.
A Comparative Analysis of Three Monocular Passive Ranging Methods on Real Infrared Sequences
NASA Astrophysics Data System (ADS)
Bondžulić, Boban P.; Mitrović, Srđan T.; Barbarić, Žarko P.; Andrić, Milenko S.
2013-09-01
Three monocular passive ranging methods are analyzed and tested on the real infrared sequences. The first method exploits scale changes of an object in successive frames, while other two use Beer-Lambert's Law. Ranging methods are evaluated by comparing with simultaneously obtained reference data at the test site. Research is addressed on scenarios where multiple sensor views or active measurements are not possible. The results show that these methods for range estimation can provide the fidelity required for object tracking. Maximum values of relative distance estimation errors in near-ideal conditions are less than 8%.
Using reconfigurable hardware to accelerate multiple sequence alignment with ClustalW.
Oliver, Tim; Schmidt, Bertil; Nathan, Darran; Clemens, Ralf; Maskell, Douglas
2005-08-15
Aligning hundreds of sequences using progressive alignment tools such as ClustalW requires several hours on state-of-the-art workstations. We present a new approach to compute multiple sequence alignments in far shorter time using reconfigurable hardware. This results in an implementation of ClustalW with significant runtime savings on a standard off-the-shelf FPGA.
Mango: multiple alignment with N gapped oligos.
Zhang, Zefeng; Lin, Hao; Li, Ming
2008-06-01
Multiple sequence alignment is a classical and challenging task. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state-of-the-art works suffer from the "once a gap, always a gap" phenomenon. Is there a radically new way to do multiple sequence alignment? In this paper, we introduce a novel and orthogonal multiple sequence alignment method, using both multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole and tries to build the alignment vertically, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds have proved significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks, showing that MANGO compares favorably, in both accuracy and speed, against state-of-the-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, ProbConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0, and Kalign 2.0. We have further demonstrated the scalability of MANGO on very large datasets of repeat elements. MANGO can be downloaded at http://www.bioinfo.org.cn/mango/ and is free for academic usage.
USDA-ARS?s Scientific Manuscript database
Genotyping-by-sequencing allows for large-scale genetic analyses in plant species with no reference genome, creating the challenge of sound inference in the presence of uncertain genotypes. Here we report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundina...
USDA-ARS?s Scientific Manuscript database
Oocyte-specific genes play critical roles in oogenesis, folliculogenesis and early embryonic development. Through analysis of expressed sequence tags (ESTs) from a rainbow trout oocyte cDNA library, we identified a novel transcript which is represented by multiple ESTs derived only from the oocyte c...
Conservation in the face of diversity: multistrain analysis of an intracellular bacterium
USDA-ARS?s Scientific Manuscript database
Comparisons of multiple strains revealed that A. marginale has a closed-core genome with few highly plastic regions, which include the msp2 and msp3 genes, as well as the aaap locus. Comparison of the Florida and St. Maries genome sequences found that SNPs comprise 0.8% of the longer Florida genome,...
Phylogenetic Analysis of Klebsiella pneumoniae from Hospitalized Children, Pakistan
Ejaz, Hasan; Wang, Nancy; Wilksch, Jonathan J.; Page, Andrew J.; Cao, Hanwei; Gujaran, Shruti; Keane, Jacqueline A.; Lithgow, Trevor; ul-Haq, Ikram; Dougan, Gordon
2017-01-01
Klebsiella pneumoniae shows increasing emergence of multidrug-resistant lineages, including strains resistant to all available antimicrobial drugs. We conducted whole-genome sequencing of 178 highly drug-resistant isolates from a tertiary hospital in Lahore, Pakistan. Phylogenetic analyses to place these isolates into global context demonstrate the expansion of multiple independent lineages, including K. quasipneumoniae. PMID:29048298
Optimized scheduling technique of null subcarriers for peak power control in 3GPP LTE downlink.
Cho, Soobum; Park, Sang Kyu
2014-01-01
Orthogonal frequency division multiple access (OFDMA) is a key multiple access technique for the long term evolution (LTE) downlink. However, high peak-to-average power ratio (PAPR) can cause the degradation of power efficiency. The well-known PAPR reduction technique, dummy sequence insertion (DSI), can be a realistic solution because of its structural simplicity. However, the large usage of subcarriers for the dummy sequences may decrease the transmitted data rate in the DSI scheme. In this paper, a novel DSI scheme is applied to the LTE system. Firstly, we obtain the null subcarriers in single-input single-output (SISO) and multiple-input multiple-output (MIMO) systems, respectively; then, optimized dummy sequences are inserted into the obtained null subcarrier. Simulation results show that Walsh-Hadamard transform (WHT) sequence is the best for the dummy sequence and the ratio of 16 to 20 for the WHT and randomly generated sequences has the maximum PAPR reduction performance. The number of near optimal iteration is derived to prevent exhausted iterations. It is also shown that there is no bit error rate (BER) degradation with the proposed technique in LTE downlink system.
Optimized Scheduling Technique of Null Subcarriers for Peak Power Control in 3GPP LTE Downlink
Park, Sang Kyu
2014-01-01
Orthogonal frequency division multiple access (OFDMA) is a key multiple access technique for the long term evolution (LTE) downlink. However, high peak-to-average power ratio (PAPR) can cause the degradation of power efficiency. The well-known PAPR reduction technique, dummy sequence insertion (DSI), can be a realistic solution because of its structural simplicity. However, the large usage of subcarriers for the dummy sequences may decrease the transmitted data rate in the DSI scheme. In this paper, a novel DSI scheme is applied to the LTE system. Firstly, we obtain the null subcarriers in single-input single-output (SISO) and multiple-input multiple-output (MIMO) systems, respectively; then, optimized dummy sequences are inserted into the obtained null subcarrier. Simulation results show that Walsh-Hadamard transform (WHT) sequence is the best for the dummy sequence and the ratio of 16 to 20 for the WHT and randomly generated sequences has the maximum PAPR reduction performance. The number of near optimal iteration is derived to prevent exhausted iterations. It is also shown that there is no bit error rate (BER) degradation with the proposed technique in LTE downlink system. PMID:24883376
Garcia-Hermoso, Dea; Criscuolo, Alexis; Lee, Soo Chan; Legrand, Matthieu; Chaouat, Marc; Denis, Blandine; Lafaurie, Matthieu; Rouveau, Martine; Soler, Charles; Schaal, Jean-Vivien; Mimoun, Maurice; Mebazaa, Alexandre; Heitman, Joseph; Dromer, Françoise; Brisse, Sylvain; Bretagne, Stéphane; Alanio, Alexandre
2018-04-24
Mucorales are ubiquitous environmental molds responsible for mucormycosis in diabetic, immunocompromised, and severely burned patients. Small outbreaks of invasive wound mucormycosis (IWM) have already been reported in burn units without extensive microbiological investigations. We faced an outbreak of IWM in our center and investigated the clinical isolates with whole-genome sequencing (WGS) analysis. We analyzed M. circinelloides isolates from patients in our burn unit (BU1, Hôpital Saint-Louis, Paris, France) together with nonoutbreak isolates from Burn Unit 2 (BU2, Paris area) and from France over a 2-year period (2013 to 2015). A total of 21 isolates, including 14 isolates from six BU1 patients, were analyzed by whole-genome sequencing (WGS). Phylogenetic classification based on de novo assembly and assembly free approaches showed that the clinical isolates clustered in four highly divergent clades. Clade 1 contained at least one of the strains from the six epidemiologically linked BU1 patients. The clinical isolates were specific to each patient. Two patients were infected with more than two strains from different clades, suggesting that an environmental reservoir of clonally unrelated isolates was the source of contamination. Only two patients from BU1 shared one strain, which could correspond to direct transmission or contamination with the same environmental source. In conclusion, WGS of several isolates per patients coupled with precise epidemiological data revealed a complex situation combining potential cross-transmission between patients and multiple contaminations with a heterogeneous pool of strains from a cryptic environmental reservoir. IMPORTANCE Invasive wound mucormycosis (IWM) is a severe infection due to environmental molds belonging to the order Mucorales. Severely burned patients are particularly at risk for IWM. Here, we used whole-genome sequencing (WGS) analysis to resolve an outbreak of IWM due to Mucor circinelloides that occurred in our hospital (BU1). We sequenced 21 clinical isolates, including 14 from BU1 and 7 unrelated isolates, and compared them to the reference genome (1006PhL). This analysis revealed that the outbreak was mainly due to multiple strains that seemed patient specific, suggesting that the patients were more likely infected from a pool of diverse strains from the environment rather than from direct transmission among them. This study revealed the complexity of a Mucorales outbreak in the settings of IWM in burn patients, which has been highlighted based on WGS combined with careful sampling. Copyright © 2018 Garcia-Hermoso et al.
Álvarez-Pérez, Sergio; de Vega, Clara; Herrera, Carlos M.
2013-01-01
The genetic and evolutionary relationships among floral nectar-dwelling Pseudomonas ‘sensu stricto’ isolates associated to South African and Mediterranean plants were investigated by multilocus sequence analysis (MLSA) of four core housekeeping genes (rrs, gyrB, rpoB and rpoD). A total of 35 different sequence types were found for the 38 nectar bacterial isolates characterised. Phylogenetic analyses resulted in the identification of three main clades [nectar groups (NGs) 1, 2 and 3] of nectar pseudomonads, which were closely related to five intrageneric groups: Pseudomonas oryzihabitans (NG 1); P. fluorescens, P. lutea and P. syringae (NG 2); and P. rhizosphaerae (NG 3). Linkage disequilibrium analysis pointed to a mostly clonal population structure, even when the analysis was restricted to isolates from the same floristic region or belonging to the same NG. Nevertheless, signatures of recombination were observed for NG 3, which exclusively included isolates retrieved from the floral nectar of insect-pollinated Mediterranean plants. In contrast, the other two NGs comprised both South African and Mediterranean isolates. Analyses relating diversification to floristic region and pollinator type revealed that there has been more unique evolution of the nectar pseudomonads within the Mediterranean region than would be expected by chance. This is the first work analysing the sequence of multiple loci to reveal geno- and ecotypes of nectar bacteria. PMID:24116076
Subject-level reliability analysis of fast fMRI with application to epilepsy.
Hao, Yongfu; Khoo, Hui Ming; von Ellenrieder, Nicolas; Gotman, Jean
2017-07-01
Recent studies have applied the new magnetic resonance encephalography (MREG) sequence to the study of interictal epileptic discharges (IEDs) in the electroencephalogram (EEG) of epileptic patients. However, there are no criteria to quantitatively evaluate different processing methods, to properly use the new sequence. We evaluated different processing steps of this new sequence under the common generalized linear model (GLM) framework by assessing the reliability of results. A bootstrap sampling technique was first used to generate multiple replicated data sets; a GLM with different processing steps was then applied to obtain activation maps, and the reliability of these maps was assessed. We applied our analysis in an event-related GLM related to IEDs. A higher reliability was achieved by using a GLM with head motion confound regressor with 24 components rather than the usual 6, with an autoregressive model of order 5 and with a canonical hemodynamic response function (HRF) rather than variable latency or patient-specific HRFs. Comparison of activation with IED field also favored the canonical HRF, consistent with the reliability analysis. The reliability analysis helps to optimize the processing methods for this fast fMRI sequence, in a context in which we do not know the ground truth of activation areas. Magn Reson Med 78:370-382, 2017. © 2016 International Society for Magnetic Resonance in Medicine. © 2016 International Society for Magnetic Resonance in Medicine.
Scala, Giovanni; Affinito, Ornella; Palumbo, Domenico; Florio, Ermanno; Monticelli, Antonella; Miele, Gennaro; Chiariotti, Lorenzo; Cocozza, Sergio
2016-11-25
CpG sites in an individual molecule may exist in a binary state (methylated or unmethylated) and each individual DNA molecule, containing a certain number of CpGs, is a combination of these states defining an epihaplotype. Classic quantification based approaches to study DNA methylation are intrinsically unable to fully represent the complexity of the underlying methylation substrate. Epihaplotype based approaches, on the other hand, allow methylation profiles of cell populations to be studied at the single molecule level. For such investigations, next-generation sequencing techniques can be used, both for quantitative and for epihaplotype analysis. Currently available tools for methylation analysis lack output formats that explicitly report CpG methylation profiles at the single molecule level and that have suited statistical tools for their interpretation. Here we present ampliMethProfiler, a python-based pipeline for the extraction and statistical epihaplotype analysis of amplicons from targeted deep bisulfite sequencing of multiple DNA regions. ampliMethProfiler tool provides an easy and user friendly way to extract and analyze the epihaplotype composition of reads from targeted bisulfite sequencing experiments. ampliMethProfiler is written in python language and requires a local installation of BLAST and (optionally) QIIME tools. It can be run on Linux and OS X platforms. The software is open source and freely available at http://amplimethprofiler.sourceforge.net .
Bolivar, I; Fahrni, J F; Smirnov, A; Pawlowski, J
2001-12-01
Naked lobose amoebae (gymnamoebae) are among the most abundant group of protists present in all aquatic and terrestrial biotopes. Yet, because of lack of informative morphological characters, the origin and evolutionary history of gymnamoebae are poorly known. The first molecular studies revealed multiple origins for the amoeboid lineages and an extraordinary diversity of amoebae species. Molecular data, however, exist only for a few species of the numerous taxa belonging to this group. Here, we present the small-subunit (SSU) rDNA sequences of four species of typical large gymnamoebae: Amoeba proteus, Amoeba leningradensis, Chaos nobile, and Chaos carolinense. Sequence analysis suggests that the four species are closely related to the species of genera Saccamoeba, Leptomyxa, Rhizamoeba, Paraflabellula, Hartmannella, and Echinamoeba. All of them form a relatively well-supported clade, which corresponds to the subclass Gymnamoebia, in agreement with morphology-based taxonomy. The other gymnamoebae cluster in small groups or branch separately. Their relationships change depending on the type of analysis and the model of nucleotide substitution. All gymnamoebae branch together in Neighbor-Joining analysis with corrections for among-site rate heterogeneity and proportion of invariable sites. This clade, however, is not statistically supported by SSU rRNA gene sequences and further analysis of protein sequence data will be necessary to test the monophyly of gymnamoebae.
Learning of goal-relevant and -irrelevant complex visual sequences in human V1.
Rosenthal, Clive R; Mallik, Indira; Caballero-Gaudes, Cesar; Sereno, Martin I; Soto, David
2018-06-12
Learning and memory are supported by a network involving the medial temporal lobe and linked neocortical regions. Emerging evidence indicates that primary visual cortex (i.e., V1) may contribute to recognition memory, but this has been tested only with a single visuospatial sequence as the target memorandum. The present study used functional magnetic resonance imaging to investigate whether human V1 can support the learning of multiple, concurrent complex visual sequences involving discontinous (second-order) associations. Two peripheral, goal-irrelevant but structured sequences of orientated gratings appeared simultaneously in fixed locations of the right and left visual fields alongside a central, goal-relevant sequence that was in the focus of spatial attention. Pseudorandom sequences were introduced at multiple intervals during the presentation of the three structured visual sequences to provide an online measure of sequence-specific knowledge at each retinotopic location. We found that a network involving the precuneus and V1 was involved in learning the structured sequence presented at central fixation, whereas right V1 was modulated by repeated exposure to the concurrent structured sequence presented in the left visual field. The same result was not found in left V1. These results indicate for the first time that human V1 can support the learning of multiple concurrent sequences involving complex discontinuous inter-item associations, even peripheral sequences that are goal-irrelevant. Copyright © 2018. Published by Elsevier Inc.
Reddy, M K; Nair, S; Singh, B N; Mudgil, Y; Tewari, K K; Sopory, S K
2001-01-24
We report the cloning and sequencing of both cDNA and genomic DNA of a 33 kDa chloroplast ribonucleoprotein (33RNP) from pea. The analysis of the predicted amino acid sequence of the cDNA clone revealed that the encoded protein contains two RNA binding domains, including the conserved consensus ribonucleoprotein sequences CS-RNP1 and CS-RNP2, on the C-terminus half and the presence of a putative transit peptide sequence in the N-terminus region. The phylogenetic and multiple sequence alignment analysis of pea chloroplast RNP along with RNPs reported from the other plant sources revealed that the pea 33RNP is very closely related to Nicotiana sylvestris 31RNP and 28RNP and also to 31RNP and 28RNP of Arabidopsis and spinach, respectively. The pea 33RNP was expressed in Escherichia coli and purified to homogeneity. The in vitro import of precursor protein into chloroplasts confirmed that the N-terminus putative transit peptide is a bona fide transit peptide and 33RNP is localized in the chloroplast. The nucleic acid-binding properties of the recombinant protein, as revealed by South-Western analysis, showed that 33RNP has higher binding affinity for poly (U) and oligo dT than for ssDNA and dsDNA. The steady state transcript level was higher in leaves than in roots and the expression of this gene is light stimulated. Sequence analysis of the genomic clone revealed that the gene contains four exons and three introns. We have also isolated and analyzed the 5' flanking region of the pea 33RNP gene.
Zeil, Catharina; Widmann, Michael; Fademrecht, Silvia; Vogel, Constantin; Pleiss, Jürgen
2016-05-01
The Lactamase Engineering Database (www.LacED.uni-stuttgart.de) was developed to facilitate the classification and analysis of TEM β-lactamases. The current version contains 474 TEM variants. Two hundred fifty-nine variants form a large scale-free network of highly connected point mutants. The network was divided into three subnetworks which were enriched by single phenotypes: one network with predominantly 2be and two networks with 2br phenotypes. Fifteen positions were found to be highly variable, contributing to the majority of the observed variants. Since it is expected that a considerable fraction of the theoretical sequence space is functional, the currently sequenced 474 variants represent only the tip of the iceberg of functional TEM β-lactamase variants which form a huge natural reservoir of highly interconnected variants. Almost 50% of the variants are part of a quartet. Thus, two single mutations that result in functional enzymes can be combined into a functional protein. Most of these quartets consist of the same phenotype, or the mutations are additive with respect to the phenotype. By predicting quartets from triplets, 3,916 unknown variants were constructed. Eighty-seven variants complement multiple quartets and therefore have a high probability of being functional. The construction of a TEM β-lactamase network and subsequent analyses by clustering and quartet prediction are valuable tools to gain new insights into the viable sequence space of TEM β-lactamases and to predict their phenotype. The highly connected sequence space of TEM β-lactamases is ideally suited to network analysis and demonstrates the strengths of network analysis over tree reconstruction methods. Copyright © 2016, American Society for Microbiology. All Rights Reserved.
Cocco, Simona; Monasson, Remi; Weigt, Martin
2013-01-01
Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant ‘patterns’ of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold. PMID:23990764
Large Scale Comparative Visualisation of Regulatory Networks with TRNDiff
Chua, Xin-Yi; Buckingham, Lawrence; Hogan, James M.; ...
2015-06-01
The advent of Next Generation Sequencing (NGS) technologies has seen explosive growth in genomic datasets, and dense coverage of related organisms, supporting study of subtle, strain-specific variations as a determinant of function. Such data collections present fresh and complex challenges for bioinformatics, those of comparing models of complex relationships across hundreds and even thousands of sequences. Transcriptional Regulatory Network (TRN) structures document the influence of regulatory proteins called Transcription Factors (TFs) on associated Target Genes (TGs). TRNs are routinely inferred from model systems or iterative search, and analysis at these scales requires simultaneous displays of multiple networks well beyond thosemore » of existing network visualisation tools [1]. In this paper we describe TRNDiff, an open source system supporting the comparative analysis and visualization of TRNs (and similarly structured data) from many genomes, allowing rapid identification of functional variations within species. The approach is demonstrated through a small scale multiple TRN analysis of the Fur iron-uptake system of Yersinia, suggesting a number of candidate virulence factors; and through a larger study exploiting integration with the RegPrecise database (http://regprecise.lbl.gov; [2]) - a collection of hundreds of manually curated and predicted transcription factor regulons drawn from across the entire spectrum of prokaryotic organisms.« less
Reddy, E P; Mettus, R V; DeFreitas, E; Wroblewska, Z; Cisco, M; Koprowski, H
1988-01-01
Human T-cell lymphotropic virus type 1 (HTLV-I), the etiologic agent of human T-cell leukemia, has recently been shown to be associated with neurologic disorders such as tropical spastic paraparesis, HTLV-associated myelopathy, and possibly with multiple sclerosis. In this communication, we have examined one specific case of neurologic disorder that can be classified as multiple sclerosis or tropical spastic paraparesis. The patient suffering from chronic neurologic disorder was found to contain antibodies to HTLV-I envelope and gag proteins in his serum and cerebrospinal fluid. Lymphocytes from peripheral blood and cerebrospinal fluid of the patient were shown to express viral RNA sequences by in situ hybridization. Southern blot analysis of the patient lymphocyte DNA revealed the presence of HTLV-I-related sequences. Blot-hybridization analysis of the RNA from fresh peripheral lymphocytes stimulated with interleukin 2 revealed the presence of abundant amounts of genomic viral RNA with little or no subgenomic RNA. We have cloned the proviral genome from the DNA of the peripheral lymphocytes and determined its restriction map. This analysis shows that this proviral genome is very similar if not identical to that of the prototype HTLV-I genome. Images PMID:2897123
Long-range barcode labeling-sequencing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Feng; Zhang, Tao; Singh, Kanwar K.
Methods for sequencing single large DNA molecules by clonal multiple displacement amplification using barcoded primers. Sequences are binned based on barcode sequences and sequenced using a microdroplet-based method for sequencing large polynucleotide templates to enable assembly of haplotype-resolved complex genomes and metagenomes.
Ferret, Yann; Caillault, Aurélie; Sebda, Shéhérazade; Duez, Marc; Grardel, Nathalie; Duployez, Nicolas; Villenet, Céline; Figeac, Martin; Preudhomme, Claude; Salson, Mikaël; Giraud, Mathieu
2016-05-01
High-throughput sequencing (HTS) is considered a technical revolution that has improved our knowledge of lymphoid and autoimmune diseases, changing our approach to leukaemia both at diagnosis and during follow-up. As part of an immunoglobulin/T cell receptor-based minimal residual disease (MRD) assessment of acute lymphoblastic leukaemia patients, we assessed the performance and feasibility of the replacement of the first steps of the approach based on DNA isolation and Sanger sequencing, using a HTS protocol combined with bioinformatics analysis and visualization using the Vidjil software. We prospectively analysed the diagnostic and relapse samples of 34 paediatric patients, thus identifying 125 leukaemic clones with recombinations on multiple loci (TRG, TRD, IGH and IGK), including Dd2/Dd3 and Intron/KDE rearrangements. Sequencing failures were halved (14% vs. 34%, P = 0.0007), enabling more patients to be monitored. Furthermore, more markers per patient could be monitored, reducing the probability of false negative MRD results. The whole analysis, from sample receipt to clinical validation, was shorter than our current diagnostic protocol, with equal resources. V(D)J recombination was successfully assigned by the software, even for unusual recombinations. This study emphasizes the progress that HTS with adapted bioinformatics tools can bring to the diagnosis of leukaemia patients. © 2016 John Wiley & Sons Ltd.
Botelho, Ana; Canto, Ana; Leão, Célia; Cunha, Mónica V
2015-01-01
Typical CRISPR (clustered, regularly interspaced, short palindromic repeat) regions are constituted by short direct repeats (DRs), interspersed with similarly sized non-repetitive spacers, derived from transmissible genetic elements, acquired when the cell is challenged with foreign DNA. The analysis of the structure, in number and nature, of CRISPR spacers is a valuable tool for molecular typing since these loci are polymorphic among strains, originating characteristic signatures. The existence of CRISPR structures in the genome of the members of Mycobacterium tuberculosis complex (MTBC) enabled the development of a genotyping method, based on the analysis of the presence or absence of 43 oligonucleotide spacers separated by conserved DRs. This method, called spoligotyping, consists on PCR amplification of the DR chromosomal region and recognition after hybridization of the spacers that are present. The workflow beneath this methodology implies that the PCR products are brought onto a membrane containing synthetic oligonucleotides that have complementary sequences to the spacer sequences. Lack of hybridization of the PCR products to a specific oligonucleotide sequence indicates absence of the correspondent spacer sequence in the examined strain. Spoligotyping gained great notoriety as a robust identification and typing tool for members of MTBC, enabling multiple epidemiological studies on human and animal tuberculosis.
Ventura, Marco; Jankovic, Ivana; Walker, D. Carey; Pridmore, R. David; Zink, Ralf
2002-01-01
We have identified and sequenced the genes encoding the aggregation-promoting factor (APF) protein from six different strains of Lactobacillus johnsonii and Lactobacillus gasseri. Both species harbor two apf genes, apf1 and apf2, which are in the same orientation and encode proteins of 257 to 326 amino acids. Multiple alignments of the deduced amino acid sequences of these apf genes demonstrate a very strong sequence conservation of all of the genes with the exception of their central regions. Northern blot analysis showed that both genes are transcribed, reaching their maximum expression during the exponential phase. Primer extension analysis revealed that apf1 and apf2 harbor a putative promoter sequence that is conserved in all of the genes. Western blot analysis of the LiCl cell extracts showed that APF proteins are located on the cell surface. Intact cells of L. johnsonii revealed the typical cell wall architecture of S-layer-carrying gram-positive eubacteria, which could be selectively removed with LiCl treatment. In addition, the amino acid composition, physical properties, and genetic organization were found to be quite similar to those of S-layer proteins. These results suggest that APF is a novel surface protein of the Lactobacillus acidophilus B-homology group which might belong to an S-layer-like family. PMID:12450842
MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling.
Piro, Vitor C; Matschkowski, Marcel; Renard, Bernhard Y
2017-08-14
Many metagenome analysis tools are presently available to classify sequences and profile environmental samples. In particular, taxonomic profiling and binning methods are commonly used for such tasks. Tools available among these two categories make use of several techniques, e.g., read mapping, k-mer alignment, and composition analysis. Variations on the construction of the corresponding reference sequence databases are also common. In addition, different tools provide good results in different datasets and configurations. All this variation creates a complicated scenario to researchers to decide which methods to use. Installation, configuration and execution can also be difficult especially when dealing with multiple datasets and tools. We propose MetaMeta: a pipeline to execute and integrate results from metagenome analysis tools. MetaMeta provides an easy workflow to run multiple tools with multiple samples, producing a single enhanced output profile for each sample. MetaMeta includes a database generation, pre-processing, execution, and integration steps, allowing easy execution and parallelization. The integration relies on the co-occurrence of organisms from different methods as the main feature to improve community profiling while accounting for differences in their databases. In a controlled case with simulated and real data, we show that the integrated profiles of MetaMeta overcome the best single profile. Using the same input data, it provides more sensitive and reliable results with the presence of each organism being supported by several methods. MetaMeta uses Snakemake and has six pre-configured tools, all available at BioConda channel for easy installation (conda install -c bioconda metameta). The MetaMeta pipeline is open-source and can be downloaded at: https://gitlab.com/rki_bioinformatics .
Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes
Liu, Kuan-Liang; Porras-Alfaro, Andrea; Eichorst, Stephanie A.
2012-01-01
Taxonomic and phylogenetic fingerprinting based on sequence analysis of gene fragments from the large-subunit rRNA (LSU) gene or the internal transcribed spacer (ITS) region is becoming an integral part of fungal classification. The lack of an accurate and robust classification tool trained by a validated sequence database for taxonomic placement of fungal LSU genes is a severe limitation in taxonomic analysis of fungal isolates or large data sets obtained from environmental surveys. Using a hand-curated set of 8,506 fungal LSU gene fragments, we determined the performance characteristics of a naïve Bayesian classifier across multiple taxonomic levels and compared the classifier performance to that of a sequence similarity-based (BLASTN) approach. The naïve Bayesian classifier was computationally more rapid (>460-fold with our system) than the BLASTN approach, and it provided equal or superior classification accuracy. Classifier accuracies were compared using sequence fragments of 100 bp and 400 bp and two different PCR primer anchor points to mimic sequence read lengths commonly obtained using current high-throughput sequencing technologies. Accuracy was higher with 400-bp sequence reads than with 100-bp reads. It was also significantly affected by sequence location across the 1,400-bp test region. The highest accuracy was obtained across either the D1 or D2 variable region. The naïve Bayesian classifier provides an effective and rapid means to classify fungal LSU sequences from large environmental surveys. The training set and tool are publicly available through the Ribosomal Database Project (http://rdp.cme.msu.edu/classifier/classifier.jsp). PMID:22194300
Yoshida, Catherine E; Kruczkiewicz, Peter; Laing, Chad R; Lingohr, Erika J; Gannon, Victor P J; Nash, John H E; Taboada, Eduardo N
2016-01-01
For nearly 100 years serotyping has been the gold standard for the identification of Salmonella serovars. Despite the increasing adoption of DNA-based subtyping approaches, serotype information remains a cornerstone in food safety and public health activities aimed at reducing the burden of salmonellosis. At the same time, recent advances in whole-genome sequencing (WGS) promise to revolutionize our ability to perform advanced pathogen characterization in support of improved source attribution and outbreak analysis. We present the Salmonella In Silico Typing Resource (SISTR), a bioinformatics platform for rapidly performing simultaneous in silico analyses for several leading subtyping methods on draft Salmonella genome assemblies. In addition to performing serovar prediction by genoserotyping, this resource integrates sequence-based typing analyses for: Multi-Locus Sequence Typing (MLST), ribosomal MLST (rMLST), and core genome MLST (cgMLST). We show how phylogenetic context from cgMLST analysis can supplement the genoserotyping analysis and increase the accuracy of in silico serovar prediction to over 94.6% on a dataset comprised of 4,188 finished genomes and WGS draft assemblies. In addition to allowing analysis of user-uploaded whole-genome assemblies, the SISTR platform incorporates a database comprising over 4,000 publicly available genomes, allowing users to place their isolates in a broader phylogenetic and epidemiological context. The resource incorporates several metadata driven visualizations to examine the phylogenetic, geospatial and temporal distribution of genome-sequenced isolates. As sequencing of Salmonella isolates at public health laboratories around the world becomes increasingly common, rapid in silico analysis of minimally processed draft genome assemblies provides a powerful approach for molecular epidemiology in support of public health investigations. Moreover, this type of integrated analysis using multiple sequence-based methods of sub-typing allows for continuity with historical serotyping data as we transition towards the increasing adoption of genomic analyses in epidemiology. The SISTR platform is freely available on the web at https://lfz.corefacility.ca/sistr-app/.
TCW: Transcriptome Computational Workbench
Soderlund, Carol; Nelson, William; Willer, Mark; Gang, David R.
2013-01-01
Background The analysis of transcriptome data involves many steps and various programs, along with organization of large amounts of data and results. Without a methodical approach for storage, analysis and query, the resulting ad hoc analysis can lead to human error, loss of data and results, inefficient use of time, and lack of verifiability, repeatability, and extensibility. Methodology The Transcriptome Computational Workbench (TCW) provides Java graphical interfaces for methodical analysis for both single and comparative transcriptome data without the use of a reference genome (e.g. for non-model organisms). The singleTCW interface steps the user through importing transcript sequences (e.g. Illumina) or assembling long sequences (e.g. Sanger, 454, transcripts), annotating the sequences, and performing differential expression analysis using published statistical programs in R. The data, metadata, and results are stored in a MySQL database. The multiTCW interface builds a comparison database by importing sequence and annotation from one or more single TCW databases, executes the ESTscan program to translate the sequences into proteins, and then incorporates one or more clusterings, where the clustering options are to execute the orthoMCL program, compute transitive closure, or import clusters. Both singleTCW and multiTCW allow extensive query and display of the results, where singleTCW displays the alignment of annotation hits to transcript sequences, and multiTCW displays multiple transcript alignments with MUSCLE or pairwise alignments. The query programs can be executed on the desktop for fastest analysis, or from the web for sharing the results. Conclusion It is now affordable to buy a multi-processor machine, and easy to install Java and MySQL. By simply downloading the TCW, the user can interactively analyze, query and view their data. The TCW allows in-depth data mining of the results, which can lead to a better understanding of the transcriptome. TCW is freely available from www.agcol.arizona.edu/software/tcw. PMID:23874959
TCW: transcriptome computational workbench.
Soderlund, Carol; Nelson, William; Willer, Mark; Gang, David R
2013-01-01
The analysis of transcriptome data involves many steps and various programs, along with organization of large amounts of data and results. Without a methodical approach for storage, analysis and query, the resulting ad hoc analysis can lead to human error, loss of data and results, inefficient use of time, and lack of verifiability, repeatability, and extensibility. The Transcriptome Computational Workbench (TCW) provides Java graphical interfaces for methodical analysis for both single and comparative transcriptome data without the use of a reference genome (e.g. for non-model organisms). The singleTCW interface steps the user through importing transcript sequences (e.g. Illumina) or assembling long sequences (e.g. Sanger, 454, transcripts), annotating the sequences, and performing differential expression analysis using published statistical programs in R. The data, metadata, and results are stored in a MySQL database. The multiTCW interface builds a comparison database by importing sequence and annotation from one or more single TCW databases, executes the ESTscan program to translate the sequences into proteins, and then incorporates one or more clusterings, where the clustering options are to execute the orthoMCL program, compute transitive closure, or import clusters. Both singleTCW and multiTCW allow extensive query and display of the results, where singleTCW displays the alignment of annotation hits to transcript sequences, and multiTCW displays multiple transcript alignments with MUSCLE or pairwise alignments. The query programs can be executed on the desktop for fastest analysis, or from the web for sharing the results. It is now affordable to buy a multi-processor machine, and easy to install Java and MySQL. By simply downloading the TCW, the user can interactively analyze, query and view their data. The TCW allows in-depth data mining of the results, which can lead to a better understanding of the transcriptome. TCW is freely available from www.agcol.arizona.edu/software/tcw.
High-speed multiple sequence alignment on a reconfigurable platform.
Oliver, Tim; Schmidt, Bertil; Maskell, Douglas; Nathan, Darran; Clemens, Ralf
2006-01-01
Progressive alignment is a widely used approach to compute multiple sequence alignments (MSAs). However, aligning several hundred sequences by popular progressive alignment tools requires hours on sequential computers. Due to the rapid growth of sequence databases biologists have to compute MSAs in a far shorter time. In this paper we present a new approach to MSA on reconfigurable hardware platforms to gain high performance at low cost. We have constructed a linear systolic array to perform pairwise sequence distance computations using dynamic programming. This results in an implementation with significant runtime savings on a standard FPGA.
The Saccharomyces Genome Database Variant Viewer.
Sheppard, Travis K; Hitz, Benjamin C; Engel, Stacia R; Song, Giltae; Balakrishnan, Rama; Binkley, Gail; Costanzo, Maria C; Dalusag, Kyla S; Demeter, Janos; Hellerstedt, Sage T; Karra, Kalpana; Nash, Robert S; Paskov, Kelley M; Skrzypek, Marek S; Weng, Shuai; Wong, Edith D; Cherry, J Michael
2016-01-04
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source web application for the visualization of genomic and proteomic differences. Multiple sequence alignments have been constructed across high quality genome sequences from 11 different S. cerevisiae strains and stored in the SGD. The alignments and summaries are encoded in JSON and used to create a two-tiered dynamic view of the budding yeast pan-genome, available at http://www.yeastgenome.org/variant-viewer. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
A communal catalogue reveals Earth's multiscale microbial diversity.
Thompson, Luke R; Sanders, Jon G; McDonald, Daniel; Amir, Amnon; Ladau, Joshua; Locey, Kenneth J; Prill, Robert J; Tripathi, Anupriya; Gibbons, Sean M; Ackermann, Gail; Navas-Molina, Jose A; Janssen, Stefan; Kopylova, Evguenia; Vázquez-Baeza, Yoshiki; González, Antonio; Morton, James T; Mirarab, Siavash; Zech Xu, Zhenjiang; Jiang, Lingjing; Haroon, Mohamed F; Kanbar, Jad; Zhu, Qiyun; Jin Song, Se; Kosciolek, Tomasz; Bokulich, Nicholas A; Lefler, Joshua; Brislawn, Colin J; Humphrey, Gregory; Owens, Sarah M; Hampton-Marcell, Jarrad; Berg-Lyons, Donna; McKenzie, Valerie; Fierer, Noah; Fuhrman, Jed A; Clauset, Aaron; Stevens, Rick L; Shade, Ashley; Pollard, Katherine S; Goodwin, Kelly D; Jansson, Janet K; Gilbert, Jack A; Knight, Rob
2017-11-23
Our growing awareness of the microbial world's importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth's microbial diversity.
Bailey, Sarah F; Scheible, Melissa K; Williams, Christopher; Silva, Deborah S B S; Hoggan, Marina; Eichman, Christopher; Faith, Seth A
2017-11-01
Next-generation Sequencing (NGS) is a rapidly evolving technology with demonstrated benefits for forensic genetic applications, and the strategies to analyze and manage the massive NGS datasets are currently in development. Here, the computing, data storage, connectivity, and security resources of the Cloud were evaluated as a model for forensic laboratory systems that produce NGS data. A complete front-to-end Cloud system was developed to upload, process, and interpret raw NGS data using a web browser dashboard. The system was extensible, demonstrating analysis capabilities of autosomal and Y-STRs from a variety of NGS instrumentation (Illumina MiniSeq and MiSeq, and Oxford Nanopore MinION). NGS data for STRs were concordant with standard reference materials previously characterized with capillary electrophoresis and Sanger sequencing. The computing power of the Cloud was implemented with on-demand auto-scaling to allow multiple file analysis in tandem. The system was designed to store resulting data in a relational database, amenable to downstream sample interpretations and databasing applications following the most recent guidelines in nomenclature for sequenced alleles. Lastly, a multi-layered Cloud security architecture was tested and showed that industry standards for securing data and computing resources were readily applied to the NGS system without disadvantageous effects for bioinformatic analysis, connectivity or data storage/retrieval. The results of this study demonstrate the feasibility of using Cloud-based systems for secured NGS data analysis, storage, databasing, and multi-user distributed connectivity. Copyright © 2017 Elsevier B.V. All rights reserved.
Shao, Bing; Li, Hang; Liu, Sheng-Yuan; Li, Wen-Jing; Huang, Chao-Qun; Lin, Yuan-Long; Wang, Fu-Xiang; Wang, Bin-You
2013-05-01
To identify the current prevalent subtypes and to study the genetic variation of HIV-1 strains in men who have sex with men (MSM) residing in Heilongjiang province, China. We analyzed the characteristics of the nucleotide sequences and the corresponding deduced protein of Vif of HIV-1 strains isolated from 17 drug-naive HIV-1-seropositive MSM. Subtypes B (7.65%) and B' (Thailand B) (11.76%), CRF07_BC (47.06%), and CRF01_AE (23.53%) were identified. Phylogenetic analysis showed that there was a close relationship between our strains and those from the same MSM population in Hebei province, which is geographically close to Heilongjiang. Most of the documented Vif functional motifs are well conserved in the majority of our analyzed sequences. Taken together, our results suggest that there might be multiple introductions of HIV in Heilongjiang MSM and frequent sexual communications with other geographically nearby MSM populations.
PredictProtein—an open resource for online prediction of protein structural and functional features
Yachdav, Guy; Kloppmann, Edda; Kajan, Laszlo; Hecht, Maximilian; Goldberg, Tatyana; Hamp, Tobias; Hönigschmid, Peter; Schafferhans, Andrea; Roos, Manfred; Bernhofer, Michael; Richter, Lothar; Ashkenazy, Haim; Punta, Marco; Schlessinger, Avner; Bromberg, Yana; Schneider, Reinhard; Vriend, Gerrit; Sander, Chris; Ben-Tal, Nir; Rost, Burkhard
2014-01-01
PredictProtein is a meta-service for sequence analysis that has been predicting structural and functional features of proteins since 1992. Queried with a protein sequence it returns: multiple sequence alignments, predicted aspects of structure (secondary structure, solvent accessibility, transmembrane helices (TMSEG) and strands, coiled-coil regions, disulfide bonds and disordered regions) and function. The service incorporates analysis methods for the identification of functional regions (ConSurf), homology-based inference of Gene Ontology terms (metastudent), comprehensive subcellular localization prediction (LocTree3), protein–protein binding sites (ISIS2), protein–polynucleotide binding sites (SomeNA) and predictions of the effect of point mutations (non-synonymous SNPs) on protein function (SNAP2). Our goal has always been to develop a system optimized to meet the demands of experimentalists not highly experienced in bioinformatics. To this end, the PredictProtein results are presented as both text and a series of intuitive, interactive and visually appealing figures. The web server and sources are available at http://ppopen.rostlab.org. PMID:24799431
Calibrating genomic and allelic coverage bias in single-cell sequencing.
Zhang, Cheng-Zhong; Adalsteinsson, Viktor A; Francis, Joshua; Cornils, Hauke; Jung, Joonil; Maire, Cecile; Ligon, Keith L; Meyerson, Matthew; Love, J Christopher
2015-04-16
Artifacts introduced in whole-genome amplification (WGA) make it difficult to derive accurate genomic information from single-cell genomes and require different analytical strategies from bulk genome analysis. Here, we describe statistical methods to quantitatively assess the amplification bias resulting from whole-genome amplification of single-cell genomic DNA. Analysis of single-cell DNA libraries generated by different technologies revealed universal features of the genome coverage bias predominantly generated at the amplicon level (1-10 kb). The magnitude of coverage bias can be accurately calibrated from low-pass sequencing (∼0.1 × ) to predict the depth-of-coverage yield of single-cell DNA libraries sequenced at arbitrary depths. We further provide a benchmark comparison of single-cell libraries generated by multi-strand displacement amplification (MDA) and multiple annealing and looping-based amplification cycles (MALBAC). Finally, we develop statistical models to calibrate allelic bias in single-cell whole-genome amplification and demonstrate a census-based strategy for efficient and accurate variant detection from low-input biopsy samples.
Calibrating genomic and allelic coverage bias in single-cell sequencing
Francis, Joshua; Cornils, Hauke; Jung, Joonil; Maire, Cecile; Ligon, Keith L.; Meyerson, Matthew; Love, J. Christopher
2016-01-01
Artifacts introduced in whole-genome amplification (WGA) make it difficult to derive accurate genomic information from single-cell genomes and require different analytical strategies from bulk genome analysis. Here, we describe statistical methods to quantitatively assess the amplification bias resulting from whole-genome amplification of single-cell genomic DNA. Analysis of single-cell DNA libraries generated by different technologies revealed universal features of the genome coverage bias predominantly generated at the amplicon level (1–10 kb). The magnitude of coverage bias can be accurately calibrated from low-pass sequencing (~0.1 ×) to predict the depth-of-coverage yield of single-cell DNA libraries sequenced at arbitrary depths. We further provide a benchmark comparison of single-cell libraries generated by multi-strand displacement amplification (MDA) and multiple annealing and looping-based amplification cycles (MALBAC). Finally, we develop statistical models to calibrate allelic bias in single-cell whole-genome amplification and demonstrate a census-based strategy for efficient and accurate variant detection from low-input biopsy samples. PMID:25879913
Finding similar nucleotide sequences using network BLAST searches.
Ladunga, Istvan
2009-06-01
The Basic Local Alignment Search Tool (BLAST) is a keystone of bioinformatics due to its performance and user-friendliness. Beginner and intermediate users will learn how to design and submit blastn and Megablast searches on the Web pages at the National Center for Biotechnology Information. We map nucleic acid sequences to genomes, find identical or similar mRNA, expressed sequence tag, and noncoding RNA sequences, and run Megablast searches, which are much faster than blastn. Understanding results is assisted by taxonomy reports, genomic views, and multiple alignments. We interpret expected frequency thresholds, biological significance, and statistical significance. Weak hits provide no evidence, but hints for further analyses. We find genes that may code for homologous proteins by translated BLAST. We reduce false positives by filtering out low-complexity regions. Parsed BLAST results can be integrated into analysis pipelines. Links in the output connect to Entrez, PUBMED, structural, sequence, interaction, and expression databases. This facilitates integration with a wide spectrum of biological knowledge.
System, method and apparatus for generating phrases from a database
NASA Technical Reports Server (NTRS)
McGreevy, Michael W. (Inventor)
2004-01-01
A phrase generation is a method of generating sequences of terms, such as phrases, that may occur within a database of subsets containing sequences of terms, such as text. A database is provided and a relational model of the database is created. A query is then input. The query includes a term or a sequence of terms or multiple individual terms or multiple sequences of terms or combinations thereof. Next, several sequences of terms that are contextually related to the query are assembled from contextual relations in the model of the database. The sequences of terms are then sorted and output. Phrase generation can also be an iterative process used to produce sequences of terms from a relational model of a database.
Foltz, T M; Welsh, B M
1999-01-01
This paper uses the fact that the discrete Fourier transform diagonalizes a circulant matrix to provide an alternate derivation of the symmetric convolution-multiplication property for discrete trigonometric transforms. Derived in this manner, the symmetric convolution-multiplication property extends easily to multiple dimensions using the notion of block circulant matrices and generalizes to multidimensional asymmetric sequences. The symmetric convolution of multidimensional asymmetric sequences can then be accomplished by taking the product of the trigonometric transforms of the sequences and then applying an inverse trigonometric transform to the result. An example is given of how this theory can be used for applying a two-dimensional (2-D) finite impulse response (FIR) filter with nonlinear phase which models atmospheric turbulence.
Multiple host-switching of Haemosporidia parasites in bats
Duval, Linda; Robert, Vincent; Csorba, Gabor; Hassanin, Alexandre; Randrianarivelojosia, Milijaona; Walston, Joe; Nhim, Thy; Goodman, Steve M; Ariey, Frédéric
2007-01-01
Background There have been reported cases of host-switching in avian and lizard species of Plasmodium (Apicomplexa, Haemosporidia), as well as in those infecting different primate species. However, no evidence has previously been found for host-swapping between wild birds and mammals. Methods This paper presents the results of the sampling of blood parasites of wild-captured bats from Madagascar and Cambodia. The presence of Haemosporidia infection in these animals is confirmed and cytochrome b gene sequences were used to construct a phylogenetic analysis. Results Results reveal at least three different and independent Haemosporidia evolutionary histories in three different bat lineages from Madagascar and Cambodia. Conclusion Phylogenetic analysis strongly suggests multiple host-switching of Haemosporidia parasites in bats with those from avian and primate hosts. PMID:18045505
Klopfenstein, Ned B; Stewart, Jane E; Ota, Yuko; Hanna, John W; Richardson, Bryce A; Ross-Davis, Amy L; Elías-Román, Rubén D; Korhonen, Kari; Keča, Nenad; Iturritxa, Eugenia; Alvarado-Rosales, Dionicio; Solheim, Halvor; Brazee, Nicholas J; Łakomy, Piotr; Cleary, Michelle R; Hasegawa, Eri; Kikuchi, Taisei; Garza-Ocañas, Fortunato; Tsopelas, Panaghiotis; Rigling, Daniel; Prospero, Simone; Tsykun, Tetyana; Bérubé, Jean A; Stefani, Franck O P; Jafarpour, Saeideh; Antonín, Vladimír; Tomšovský, Michal; McDonald, Geral I; Woodward, Stephen; Kim, Mee-Sook
2017-01-01
Armillaria possesses several intriguing characteristics that have inspired wide interest in understanding phylogenetic relationships within and among species of this genus. Nuclear ribosomal DNA sequence-based analyses of Armillaria provide only limited information for phylogenetic studies among widely divergent taxa. More recent studies have shown that translation elongation factor 1-α (tef1) sequences are highly informative for phylogenetic analysis of Armillaria species within diverse global regions. This study used Neighbor-net and coalescence-based Bayesian analyses to examine phylogenetic relationships of newly determined and existing tef1 sequences derived from diverse Armillaria species from across the Northern Hemisphere, with Southern Hemisphere Armillaria species included for reference. Based on the Bayesian analysis of tef1 sequences, Armillaria species from the Northern Hemisphere are generally contained within the following four superclades, which are named according to the specific epithet of the most frequently cited species within the superclade: (i) Socialis/Tabescens (exannulate) superclade including Eurasian A. ectypa, North American A. socialis (A. tabescens), and Eurasian A. socialis (A. tabescens) clades; (ii) Mellea superclade including undescribed annulate North American Armillaria sp. (Mexico) and four separate clades of A. mellea (Europe and Iran, eastern Asia, and two groups from North America); (iii) Gallica superclade including Armillaria Nag E (Japan), multiple clades of A. gallica (Asia and Europe), A. calvescens (eastern North America), A. cepistipes (North America), A. altimontana (western USA), A. nabsnona (North America and Japan), and at least two A. gallica clades (North America); and (iv) Solidipes/Ostoyae superclade including two A. solidipes/ostoyae clades (North America), A. gemina (eastern USA), A. solidipes/ostoyae (Eurasia), A. cepistipes (Europe and Japan), A. sinapina (North America and Japan), and A. borealis (Eurasia) clade 2. Of note is that A. borealis (Eurasia) clade 1 appears basal to the Solidipes/Ostoyae and Gallica superclades. The Neighbor-net analysis showed similar phylogenetic relationships. This study further demonstrates the utility of tef1 for global phylogenetic studies of Armillaria species and provides critical insights into multiple taxonomic issues that warrant further study.
Identification of Prostate Cancer-Specific microDNAs
2016-02-01
circular DNA by rolling circle amplification (RCA) and then amplified DNA fragments were subject to deep sequencing. Deep sequencing of the...demonstrate the existence of microDNAs in prostate cancer. We adopted multiple displacement amplification (MDA) with random 2 primers for enriched...prostate cancer cells through multiple displacement amplification and next generation sequencing. R e la ti v e c e ll g ro w th ( % ) 0 20
Applying Agrep to r-NSA to solve multiple sequences approximate matching.
Ni, Bing; Wong, Man-Hon; Lam, Chi-Fai David; Leung, Kwong-Sak
2014-01-01
This paper addresses the approximate matching problem in a database consisting of multiple DNA sequences, where the proposed approach applies Agrep to a new truncated suffix array, r-NSA. The construction time of the structure is linear to the database size, and the computations of indexing a substring in the structure are constant. The number of characters processed in applying Agrep is analysed theoretically, and the theoretical upper-bound can approximate closely the empirical number of characters, which is obtained through enumerating the characters in the actual structure built. Experiments are carried out using (synthetic) random DNA sequences, as well as (real) genome sequences including Hepatitis-B Virus and X-chromosome. Experimental results show that, compared to the straight-forward approach that applies Agrep to multiple sequences individually, the proposed approach solves the matching problem in much shorter time. The speed-up of our approach depends on the sequence patterns, and for highly similar homologous genome sequences, which are the common cases in real-life genomes, it can be up to several orders of magnitude.
Targeted Quantitation of Proteins by Mass Spectrometry
2013-01-01
Quantitative measurement of proteins is one of the most fundamental analytical tasks in a biochemistry laboratory, but widely used immunochemical methods often have limited specificity and high measurement variation. In this review, we discuss applications of multiple-reaction monitoring (MRM) mass spectrometry, which allows sensitive, precise quantitative analyses of peptides and the proteins from which they are derived. Systematic development of MRM assays is permitted by databases of peptide mass spectra and sequences, software tools for analysis design and data analysis, and rapid evolution of tandem mass spectrometer technology. Key advantages of MRM assays are the ability to target specific peptide sequences, including variants and modified forms, and the capacity for multiplexing that allows analysis of dozens to hundreds of peptides. Different quantitative standardization methods provide options that balance precision, sensitivity, and assay cost. Targeted protein quantitation by MRM and related mass spectrometry methods can advance biochemistry by transforming approaches to protein measurement. PMID:23517332
Targeted quantitation of proteins by mass spectrometry.
Liebler, Daniel C; Zimmerman, Lisa J
2013-06-04
Quantitative measurement of proteins is one of the most fundamental analytical tasks in a biochemistry laboratory, but widely used immunochemical methods often have limited specificity and high measurement variation. In this review, we discuss applications of multiple-reaction monitoring (MRM) mass spectrometry, which allows sensitive, precise quantitative analyses of peptides and the proteins from which they are derived. Systematic development of MRM assays is permitted by databases of peptide mass spectra and sequences, software tools for analysis design and data analysis, and rapid evolution of tandem mass spectrometer technology. Key advantages of MRM assays are the ability to target specific peptide sequences, including variants and modified forms, and the capacity for multiplexing that allows analysis of dozens to hundreds of peptides. Different quantitative standardization methods provide options that balance precision, sensitivity, and assay cost. Targeted protein quantitation by MRM and related mass spectrometry methods can advance biochemistry by transforming approaches to protein measurement.
Reid, Jeffrey G; Carroll, Andrew; Veeraraghavan, Narayanan; Dahdouli, Mahmoud; Sundquist, Andreas; English, Adam; Bainbridge, Matthew; White, Simon; Salerno, William; Buhay, Christian; Yu, Fuli; Muzny, Donna; Daly, Richard; Duyk, Geoff; Gibbs, Richard A; Boerwinkle, Eric
2014-01-29
Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results. To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts. By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.
Su, Fei; Ou, Hong-Yu; Tao, Fei; Tang, Hongzhi; Xu, Ping
2013-12-27
With genomic sequences of many closely related bacterial strains made available by deep sequencing, it is now possible to investigate trends in prokaryotic microevolution. Positive selection is a sub-process of microevolution, in which a particular mutation is favored, causing the allele frequency to continuously shift in one direction. Wide scanning of prokaryotic genomes has shown that positive selection at the molecular level is much more frequent than expected. Genes with significant positive selection may play key roles in bacterial adaption to different environmental pressures. However, selection pressure analyses are computationally intensive and awkward to configure. Here we describe an open access web server, which is designated as PSP (Positive Selection analysis for Prokaryotic genomes) for performing evolutionary analysis on orthologous coding genes, specially designed for rapid comparison of dozens of closely related prokaryotic genomes. Remarkably, PSP facilitates functional exploration at the multiple levels by assignments and enrichments of KO, GO or COG terms. To illustrate this user-friendly tool, we analyzed Escherichia coli and Bacillus cereus genomes and found that several genes, which play key roles in human infection and antibiotic resistance, show significant evidence of positive selection. PSP is freely available to all users without any login requirement at: http://db-mml.sjtu.edu.cn/PSP/. PSP ultimately allows researchers to do genome-scale analysis for evolutionary selection across multiple prokaryotic genomes rapidly and easily, and identify the genes undergoing positive selection, which may play key roles in the interactions of host-pathogen and/or environmental adaptation.
Anticancer property of sediment actinomycetes against MCF-7 and MDA-MB-231 cell lines.
Ravikumar, S; Fredimoses, M; Gnanadesigan, M
2012-02-01
To investigate the anticancer property of marine sediment actinomycetes against two different breast cancer cell lines. In vitro anticancer activity was carried out against breast (MCF-7 and MDA-MB-231) cancer cell lines. Partial sequences of the 16s rRNA gene, phylogenetic tree construction, multiple sequence analysis and secondary structure analysis were also carried out with the actinomycetes isolates. Of the selected five actinomycete isolates, ACT01 and ACT02 showed the IC50 value with (10.13±0.92) and (22.34±5.82) µg/mL concentrations, respectively for MCF-7 cell line at 48 h, but ACT01 showed the minimum (18.54±2.49 µg/mL) level of IC50 value with MDA-MB-231 cell line. Further, the 16s rRNA partial sequences of ACT01, ACT02, ACT03, ACT04 and ACT05 isolates were also deposited in NCBI data bank with the accession numbers of GQ478246, GQ478247, GQ478248, GQ478249 and GQ478250, respectively. The phylogenetic tree analysis showed that, the isolates of ACT02 and ACT03 were represented in group I and III, respectively, but ACT01 and ACT02 were represented in group II. The multiple sequence alignment of the actinomycete isolates showed that, the maximum identical conserved regions were identified with the nucleotide regions of 125 to 221st base pairs, 65 to 119th base pairs and 55, 48 and 31st base pairs. Secondary structure prediction of the 16s rRNA showed that, the maximum free energy was consumed with ACT03 isolate (-45.4 kkal/mol) and the minimum free energy was consumed with ACT04 isolate (-57.6 kkal/mol). The actinomycete isolates of ACT01 and ACT02 (GQ478246 and GQ478247) which are isolated from sediment sample can be further used as anticancer agents against breast cancer cell lines.
Report for the NGFA-5 project.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jaing, C; Jackson, P; Thissen, J
The objective of this project is to provide DHS a comprehensive evaluation of the current genomic technologies including genotyping, TaqMan PCR, multiple locus variable tandem repeat analysis (MLVA), microarray and high-throughput DNA sequencing in the analysis of biothreat agents from complex environmental samples. To effectively compare the sensitivity and specificity of the different genomic technologies, we used SNP TaqMan PCR, MLVA, microarray and high-throughput illumine and 454 sequencing to test various strains from B. anthracis, B. thuringiensis, BioWatch aerosol filter extracts or soil samples that were spiked with B. anthracis, and samples that were previously collected during DHS and EPAmore » environmental release exercises that were known to contain B. thuringiensis spores. The results of all the samples against the various assays are discussed in this report.« less
DraGnET: Software for storing, managing and analyzing annotated draft genome sequence data
2010-01-01
Background New "next generation" DNA sequencing technologies offer individual researchers the ability to rapidly generate large amounts of genome sequence data at dramatically reduced costs. As a result, a need has arisen for new software tools for storage, management and analysis of genome sequence data. Although bioinformatic tools are available for the analysis and management of genome sequences, limitations still remain. For example, restrictions on the submission of data and use of these tools may be imposed, thereby making them unsuitable for sequencing projects that need to remain in-house or proprietary during their initial stages. Furthermore, the availability and use of next generation sequencing in industrial, governmental and academic environments requires biologist to have access to computational support for the curation and analysis of the data generated; however, this type of support is not always immediately available. Results To address these limitations, we have developed DraGnET (Draft Genome Evaluation Tool). DraGnET is an open source web application which allows researchers, with no experience in programming and database management, to setup their own in-house projects for storing, retrieving, organizing and managing annotated draft and complete genome sequence data. The software provides a web interface for the use of BLAST, allowing users to perform preliminary comparative analysis among multiple genomes. We demonstrate the utility of DraGnET for performing comparative genomics on closely related bacterial strains. Furthermore, DraGnET can be further developed to incorporate additional tools for more sophisticated analyses. Conclusions DraGnET is designed for use either by individual researchers or as a collaborative tool available through Internet (or Intranet) deployment. For genome projects that require genome sequencing data to initially remain proprietary, DraGnET provides the means for researchers to keep their data in-house for analysis using local programs or until it is made publicly available, at which point it may be uploaded to additional analysis software applications. The DraGnET home page is available at http://www.dragnet.cvm.iastate.edu and includes example files for examining the functionalities, a link for downloading the DraGnET setup package and a link to the DraGnET source code hosted with full documentation on SourceForge. PMID:20175920
NASA Astrophysics Data System (ADS)
Campbell, T. L.; Geller, J. B.; Heller, P.; Ruiz, G.; Chang, A.; McCann, L.; Ceballos, L.; Marraffini, M.; Ashton, G.; Larson, K.; Havard, S.; Meagher, K.; Wheelock, M.; Drake, C.; Rhett, G.
2016-02-01
The Ballast Water Management Act, the Marine Invasive Species Act, and the Coastal Ecosystem Protection Act require the California Department of Fish and Wildlife to monitor and evaluate the extent of biological invasions in the state's marine and estuarine waters. This has been performed statewide, using a variety of methodologies. Conventional sample collection and processing is laborious, slow and costly, and may require considerable taxonomic expertise requiring detailed time-consuming microscopic study of multiple specimens. These factors limit the volume of biomass that can be searched for introduced species. New technologies continue to reduce the cost and increase the throughput of genetic analyses, which become efficient alternatives to traditional morphological analysis for identification, monitoring and surveillance of marine invasive species. Using next-generation sequencing of mitochondrial Cytochrome c oxidase subunit I (COI) and nuclear large subunit ribosomal RNA (LSU), we analyzed over 15,000 individual marine invertebrates collected in Californian waters. We have created sequence databases of California native and non-native species to assist in molecular identification and surveillance in North American waters. Metagenetics, the next-generation sequencing of environmental samples with comparison to DNA sequence databases, is a faster and cost-effective alternative to individual sample analysis. We have sequenced from biomass collected from whole settlement plates and plankton in California harbors, and used our introduced species database to create species lists. We can combine these species lists for individual marinas with collected environmental data, such as temperature, salinity, and dissolved oxygen to understand the ecology of marine invasions. Here we discuss high throughput sampling, sequencing, and COASTLINE, our data analysis answer to challenges working with hundreds of millions of sequencing reads from tens of thousands of specimens.
Jaschob, Daniel; Davis, Trisha N; Riffle, Michael
2014-07-23
As high throughput sequencing continues to grow more commonplace, the need to disseminate the resulting data via web applications continues to grow. Particularly, there is a need to disseminate multiple versions of related gene and protein sequences simultaneously--whether they represent alleles present in a single species, variations of the same gene among different strains, or homologs among separate species. Often this is accomplished by displaying all versions of the sequence at once in a manner that is not intuitive or space-efficient and does not facilitate human understanding of the data. Web-based applications needing to disseminate multiple versions of sequences would benefit from a drop-in module designed to effectively disseminate these data. SnipViz is a client-side software tool designed to disseminate multiple versions of related gene and protein sequences on web sites. SnipViz has a space-efficient, interactive, and dynamic interface for navigating, analyzing and visualizing sequence data. It is written using standard World Wide Web technologies (HTML, Javascript, and CSS) and is compatible with most web browsers. SnipViz is designed as a modular client-side web component and may be incorporated into virtually any web site and be implemented without any programming. SnipViz is a drop-in client-side module for web sites designed to efficiently visualize and disseminate gene and protein sequences. SnipViz is open source and is freely available at https://github.com/yeastrc/snipviz.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ruggles, Kelly V.; Tang, Zuojian; Wang, Xuya
Improvements in mass spectrometry (MS)-based peptide sequencing provide a new opportunity to determine whether polymorphisms, mutations and splice variants identified in cancer cells are translated. Herein we therefore describe a proteogenomic data integration tool (QUILTS) and illustrate its application to whole genome, transcriptome and global MS peptide sequence datasets generated from a pair of luminal and basal-like breast cancer patient derived xenografts (PDX). The sensitivity of proteogenomic analysis for singe nucleotide variant (SNV) expression and novel splice junction (NSJ) detection was probed using multiple MS/MS process replicates. Despite over thirty sample replicates, only about 10% of all SNV (somatic andmore » germline) were detected by both DNA and RNA sequencing were observed as peptides. An even smaller proportion of peptides corresponding to NSJ observed by RNA sequencing were detected (<0.1%). Peptides mapping to DNA-detected SNV without a detectable mRNA transcript were also observed demonstrating the transcriptome coverage was also incomplete (~80%). In contrast to germ-line variants, somatic variants were less likely to be detected at the peptide level in the basal-like tumor than the luminal tumor raising the possibility of differential translation or protein degradation effects. In conclusion, the QUILTS program integrates DNA, RNA and peptide sequencing to assess the degree to which somatic mutations are translated and therefore biologically active. By identifying gaps in sequence coverage QUILTS benchmarks current technology and assesses progress towards whole cancer proteome and transcriptome analysis.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Willis, Leslie G.; Siepp, Robyn; Stewart, Taryn M.
2005-08-01
The genome of the Trichoplusia ni single nucleopolyhedrovirus (TnSNPV), a group II NPV which infects the cabbage looper (T. ni), has been completely sequenced and analyzed. The TnSNPV DNA genome consists of 134,394 bp and has an overall G + C content of 39%. Gene analysis predicted 144 open reading frames (ORFs) of 150 nucleotides or greater that showed minimal overlap. Comparisons with previously sequenced baculoviruses indicate that 119 TnSNPV ORFs were homologues of previously reported viral gene sequences. Ninety-four TnSNPV ORFs returned an Autographa californica multiple NPV (AcMNPV) homologue while 25 ORFs returned poor or no sequence matches withmore » the current databases. A putative photolyase gene was also identified that had highest amino acid identity to the photolyase genes of Chrysodeixis chalcites NPV (ChchNPV) (47%) and Danio rerio (zebrafish) (40%). In addition unlike all other baculoviruses no obvious homologous repeat (hr) sequences were identified. Comparison of the TnSNPV and AcMNPV genomes provides a unique opportunity to examine two baculoviruses that are highly virulent for a common insect host (T. ni) yet belong to diverse baculovirus taxonomic groups and possess distinct biological features. In vitro fusion assays demonstrated that the TnSNPV F protein induces membrane fusion and syncytia formation and were compared to syncytia formed by AcMNPV GP64.« less
Generating Models of Surgical Procedures using UMLS Concepts and Multiple Sequence Alignment
Meng, Frank; D’Avolio, Leonard W.; Chen, Andrew A.; Taira, Ricky K.; Kangarloo, Hooshang
2005-01-01
Surgical procedures can be viewed as a process composed of a sequence of steps performed on, by, or with the patient’s anatomy. This sequence is typically the pattern followed by surgeons when generating surgical report narratives for documenting surgical procedures. This paper describes a methodology for semi-automatically deriving a model of conducted surgeries, utilizing a sequence of derived Unified Medical Language System (UMLS) concepts for representing surgical procedures. A multiple sequence alignment was computed from a collection of such sequences and was used for generating the model. These models have the potential of being useful in a variety of informatics applications such as information retrieval and automatic document generation. PMID:16779094
NASA Astrophysics Data System (ADS)
He, Lidong; Anderson, Lissa C.; Barnidge, David R.; Murray, David L.; Hendrickson, Christopher L.; Marshall, Alan G.
2017-05-01
With the rapid growth of therapeutic monoclonal antibodies (mAbs), stringent quality control is needed to ensure clinical safety and efficacy. Monoclonal antibody primary sequence and post-translational modifications (PTM) are conventionally analyzed with labor-intensive, bottom-up tandem mass spectrometry (MS/MS), which is limited by incomplete peptide sequence coverage and introduction of artifacts during the lengthy analysis procedure. Here, we describe top-down and middle-down approaches with the advantages of fast sample preparation with minimal artifacts, ultrahigh mass accuracy, and extensive residue cleavages by use of 21 tesla FT-ICR MS/MS. The ultrahigh mass accuracy yields an RMS error of 0.2-0.4 ppm for antibody light chain, heavy chain, heavy chain Fc/2, and Fd subunits. The corresponding sequence coverages are 81%, 38%, 72%, and 65% with MS/MS RMS error 4 ppm. Extension to a monoclonal antibody in human serum as a monoclonal gammopathy model yielded 53% sequence coverage from two nano-LC MS/MS runs. A blind analysis of five therapeutic monoclonal antibodies at clinically relevant concentrations in human serum resulted in correct identification of all five antibodies. Nano-LC 21 T FT-ICR MS/MS provides nonpareil mass resolution, mass accuracy, and sequence coverage for mAbs, and sets a benchmark for MS/MS analysis of multiple mAbs in serum. This is the first time that extensive cleavages for both variable and constant regions have been achieved for mAbs in a human serum background.
Winterhoff, Boris J; Maile, Makayla; Mitra, Amit Kumar; Sebe, Attila; Bazzaro, Martina; Geller, Melissa A; Abrahante, Juan E; Klein, Molly; Hellweg, Raffaele; Mullany, Sally A; Beckman, Kenneth; Daniel, Jerry; Starr, Timothy K
2017-03-01
The purpose of this study was to determine the level of heterogeneity in high grade serous ovarian cancer (HGSOC) by analyzing RNA expression in single epithelial and cancer associated stromal cells. In addition, we explored the possibility of identifying subgroups based on pathway activation and pre-defined signatures from cancer stem cells and chemo-resistant cells. A fresh, HGSOC tumor specimen derived from ovary was enzymatically digested and depleted of immune infiltrating cells. RNA sequencing was performed on 92 single cells and 66 of these single cell datasets passed quality control checks. Sequences were analyzed using multiple bioinformatics tools, including clustering, principle components analysis, and geneset enrichment analysis to identify subgroups and activated pathways. Immunohistochemistry for ovarian cancer, stem cell and stromal markers was performed on adjacent tumor sections. Analysis of the gene expression patterns identified two major subsets of cells characterized by epithelial and stromal gene expression patterns. The epithelial group was characterized by proliferative genes including genes associated with oxidative phosphorylation and MYC activity, while the stromal group was characterized by increased expression of extracellular matrix (ECM) genes and genes associated with epithelial-to-mesenchymal transition (EMT). Neither group expressed a signature correlating with published chemo-resistant gene signatures, but many cells, predominantly in the stromal subgroup, expressed markers associated with cancer stem cells. Single cell sequencing provides a means of identifying subpopulations of cancer cells within a single patient. Single cell sequence analysis may prove to be critical for understanding the etiology, progression and drug resistance in ovarian cancer. Copyright © 2017 Elsevier Inc. All rights reserved.
Jenista, Elizabeth R; Stokes, Ashley M; Branca, Rosa Tamara; Warren, Warren S
2009-11-28
A recent quantum computing paper (G. S. Uhrig, Phys. Rev. Lett. 98, 100504 (2007)) analytically derived optimal pulse spacings for a multiple spin echo sequence designed to remove decoherence in a two-level system coupled to a bath. The spacings in what has been called a "Uhrig dynamic decoupling (UDD) sequence" differ dramatically from the conventional, equal pulse spacing of a Carr-Purcell-Meiboom-Gill (CPMG) multiple spin echo sequence. The UDD sequence was derived for a model that is unrelated to magnetic resonance, but was recently shown theoretically to be more general. Here we show that the UDD sequence has theoretical advantages for magnetic resonance imaging of structured materials such as tissue, where diffusion in compartmentalized and microstructured environments leads to fluctuating fields on a range of different time scales. We also show experimentally, both in excised tissue and in a live mouse tumor model, that optimal UDD sequences produce different T(2)-weighted contrast than do CPMG sequences with the same number of pulses and total delay, with substantial enhancements in most regions. This permits improved characterization of low-frequency spectral density functions in a wide range of applications.
Cui, Zhihua; Zhang, Yi
2014-02-01
As a promising and innovative research field, bioinformatics has attracted increasing attention recently. Beneath the enormous number of open problems in this field, one fundamental issue is about the accurate and efficient computational methodology that can deal with tremendous amounts of data. In this paper, we survey some applications of swarm intelligence to discover patterns of multiple sequences. To provide a deep insight, ant colony optimization, particle swarm optimization, artificial bee colony and artificial fish swarm algorithm are selected, and their applications to multiple sequence alignment and motif detecting problem are discussed.
Gent, Jonathan I; Wang, Na; Dawe, R Kelly
2017-06-21
Paradoxically, centromeres are known both for their characteristic repeat sequences (satellite DNA) and for being epigenetically defined. Maize (Zea mays mays) is an attractive model for studying centromere positioning because many of its large (~2 Mb) centromeres are not dominated by satellite DNA. These centromeres, which we call complex centromeres, allow for both assembly into reference genomes and for mapping short reads from ChIP-seq with antibodies to centromeric histone H3 (cenH3). We found frequent complex centromeres in maize and its wild relatives Z. mays parviglumis, Z. mays mexicana, and particularly Z. mays huehuetenangensis. Analysis of individual plants reveals minor variation in the positions of complex centromeres among siblings. However, such positional shifts are stochastic and not heritable, consistent with prior findings that centromere positioning is stable at the population level. Centromeres are also stable in multiple F1 hybrid contexts. Analysis of repeats in Z. mays and other species (Zea diploperennis, Zea luxurians, and Tripsacum dactyloides) reveals tenfold differences in abundance of the major satellite CentC, but similar high levels of sequence polymorphism in individual CentC copies. Deviation from the CentC consensus has little or no effect on binding of cenH3. These data indicate that complex centromeres are neither a peculiarity of cultivation nor inbreeding in Z. mays. While extensive arrays of CentC may be the norm for other Zea and Tripsacum species, these data also reveal that a wide diversity of DNA sequences and multiple types of genetic elements in and near centromeres support centromere function and constrain centromere positions.
Zheng, Guanglou; Fang, Gengfa; Shankaran, Rajan; Orgun, Mehmet A; Zhou, Jie; Qiao, Li; Saleem, Kashif
2017-05-01
Generating random binary sequences (BSes) is a fundamental requirement in cryptography. A BS is a sequence of N bits, and each bit has a value of 0 or 1. For securing sensors within wireless body area networks (WBANs), electrocardiogram (ECG)-based BS generation methods have been widely investigated in which interpulse intervals (IPIs) from each heartbeat cycle are processed to produce BSes. Using these IPI-based methods to generate a 128-bit BS in real time normally takes around half a minute. In order to improve the time efficiency of such methods, this paper presents an ECG multiple fiducial-points based binary sequence generation (MFBSG) algorithm. The technique of discrete wavelet transforms is employed to detect arrival time of these fiducial points, such as P, Q, R, S, and T peaks. Time intervals between them, including RR, RQ, RS, RP, and RT intervals, are then calculated based on this arrival time, and are used as ECG features to generate random BSes with low latency. According to our analysis on real ECG data, these ECG feature values exhibit the property of randomness and, thus, can be utilized to generate random BSes. Compared with the schemes that solely rely on IPIs to generate BSes, this MFBSG algorithm uses five feature values from one heart beat cycle, and can be up to five times faster than the solely IPI-based methods. So, it achieves a design goal of low latency. According to our analysis, the complexity of the algorithm is comparable to that of fast Fourier transforms. These randomly generated ECG BSes can be used as security keys for encryption or authentication in a WBAN system.