Sharma, Aseem; Chatterjee, Arindam; Goyal, Manu; Parsons, Matthew S; Bartel, Seth
2015-04-01
Targeting redundancy within MRI can improve its cost-effective utilization. We sought to quantify potential redundancy in our brain MRI protocols. In this retrospective review, we aggregated 207 consecutive adults who underwent brain MRI and reviewed their medical records to document clinical indication, core diagnostic information provided by MRI, and its clinical impact. Contributory imaging abnormalities constituted positive core diagnostic information whereas absence of imaging abnormalities constituted negative core diagnostic information. The senior author selected core sequences deemed sufficient for extraction of core diagnostic information. For validating core sequences selection, four readers assessed the relative ease of extracting core diagnostic information from the core sequences. Potential redundancy was calculated by comparing the average number of core sequences to the average number of sequences obtained. Scanning had been performed using 9.4±2.8 sequences over 37.3±12.3 minutes. Core diagnostic information was deemed extractable from 2.1±1.1 core sequences, with an assumed scanning time of 8.6±4.8 minutes, reflecting a potential redundancy of 74.5%±19.1%. Potential redundancy was least in scans obtained for treatment planning (14.9%±25.7%) and highest in scans obtained for follow-up of benign diseases (81.4%±12.6%). In 97.4% of cases, all four readers considered core diagnostic information to be either easily extractable from core sequences or the ease to be equivalent to that from the entire study. With only one MRI lacking clinical impact (0.48%), overutilization did not seem to contribute to potential redundancy. High potential redundancy that can be targeted for more efficient scanner utilization exists in brain MRI protocols.
Rubin, D A; Dores, R M
1995-06-01
In order to obtain a more resolute phylogeny of teleosts based on growth hormone (GH) sequences, phylogenetic analyses were performed in which deletions (gaps), which appear to be order specific, were upheld to maintain GH's structural information. Sequences were analyzed at 194 amino acid positions. In addition, the two closest genealogically related groups to the teleosts, Amia calva and Acipenser guldenstadti, were used as outgroups. Modified sequence alignments were also analyzed to determine clade stability. Analyses indicated, in the most parsimonious cladogram, that molecular and morphological relationships for the orders of fishes are congruent. With GH molecular sequence data it was possible to resolve all clades at the familial level. Analyses of the primary sequence data indicate that: (a) the halecomorphean and chondrostean GH sequences are the appropriate outgroups for generating the most parsimonious cladogram for teleosts; (b) proper alignment of teleost GH sequence by the inclusion of gaps is necessary for resolution of the Percomorpha; and (c) removal of sequence information by deleting improperly aligned sequence decreases the phylogenetic signal obtained.
Kamatuka, Kenta; Hattori, Masahiro; Sugiyama, Tomoyasu
2016-12-01
RNA interference (RNAi) screening is extensively used in the field of reverse genetics. RNAi libraries constructed using random oligonucleotides have made this technology affordable. However, the new methodology requires exploration of the RNAi target gene information after screening because the RNAi library includes non-natural sequences that are not found in genes. Here, we developed a web-based tool to support RNAi screening. The system performs short hairpin RNA (shRNA) target prediction that is informed by comprehensive enquiry (SPICE). SPICE automates several tasks that are laborious but indispensable to evaluate the shRNAs obtained by RNAi screening. SPICE has four main functions: (i) sequence identification of shRNA in the input sequence (the sequence might be obtained by sequencing clones in the RNAi library), (ii) searching the target genes in the database, (iii) demonstrating biological information obtained from the database, and (iv) preparation of search result files that can be utilized in a local personal computer (PC). Using this system, we demonstrated that genes targeted by random oligonucleotide-derived shRNAs were not different from those targeted by organism-specific shRNA. The system facilitates RNAi screening, which requires sequence analysis after screening. The SPICE web application is available at http://www.spice.sugysun.org/.
Kwarciak, Kamil; Radom, Marcin; Formanowicz, Piotr
2016-04-01
The classical sequencing by hybridization takes into account a binary information about sequence composition. A given element from an oligonucleotide library is or is not a part of the target sequence. However, the DNA chip technology has been developed and it enables to receive a partial information about multiplicity of each oligonucleotide the analyzed sequence consist of. Currently, it is not possible to assess the exact data of such type but even partial information should be very useful. Two realistic multiplicity information models are taken into consideration in this paper. The first one, called "one and many" assumes that it is possible to obtain information if a given oligonucleotide occurs in a reconstructed sequence once or more than once. According to the second model, called "one, two and many", one is able to receive from biochemical experiment information if a given oligonucleotide is present in an analyzed sequence once, twice or at least three times. An ant colony optimization algorithm has been implemented to verify the above models and to compare with existing algorithms for sequencing by hybridization which utilize the additional information. The proposed algorithm solves the problem with any kind of hybridization errors. Computational experiment results confirm that using even the partial information about multiplicity leads to increased quality of reconstructed sequences. Moreover, they also show that the more precise model enables to obtain better solutions and the ant colony optimization algorithm outperforms the existing ones. Test data sets and the proposed ant colony optimization algorithm are available on: http://bioserver.cs.put.poznan.pl/download/ACO4mSBH.zip. Copyright © 2016 Elsevier Ltd. All rights reserved.
Identifying functionally informative evolutionary sequence profiles.
Gil, Nelson; Fiser, Andras
2018-04-15
Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. andras.fiser@einstein.yu.edu. Supplementary data are available at Bioinformatics online.
[Complete genome sequencing and sequence analysis of BCG Tice].
Wang, Zhiming; Pan, Yuanlong; Wu, Jun; Zhu, Baoli
2012-10-04
The objective of this study is to obtain the complete genome sequence of Bacillus Calmette-Guerin Tice (BCG Tice), in order to provide more information about the molecular biology of BCG Tice and design more reasonable vaccines to prevent tuberculosis. We assembled the data from high-throughput sequencing with SOAPdenovo software, with many contigs and scaffolds obtained. There are many sequence gaps and physical gaps remained as a result of regional low coverage and low quality. We designed primers at the end of contigs and performed PCR amplification in order to link these contigs and scaffolds. With various enzymes to perform PCR amplification, adjustment of PCR reaction conditions, and combined with clone construction to sequence, all the gaps were finished. We obtained the complete genome sequence of BCG Tice and submitted it to GenBank of National Center for Biotechnology Information (NCBI). The genome of BCG Tice is 4334064 base pairs in length, with GC content 65.65%. The problems and strategies during the finishing step of BCG Tice sequencing are illuminated here, with the hope of affording some experience to those who are involved in the finishing step of genome sequencing. The microarray data were verified by our results.
Rutvisuttinunt, Wiriya; Chinnawirotpisan, Piyawan; Simasathien, Sriluck; Shrestha, Sanjaya K; Yoon, In-Kyu; Klungthong, Chonticha; Fernandez, Stefan
2013-11-01
Active global surveillance and characterization of influenza viruses are essential for better preparation against possible pandemic events. Obtaining comprehensive information about the influenza genome can improve our understanding of the evolution of influenza viruses and emergence of new strains, and improve the accuracy when designing preventive vaccines. This study investigated the use of deep sequencing by the next-generation sequencing (NGS) Illumina MiSeq Platform to obtain complete genome sequence information from influenza virus isolates. The influenza virus isolates were cultured from 6 respiratory acute clinical specimens collected in Thailand and Nepal. DNA libraries obtained from each viral isolate were mixed and all were sequenced simultaneously. Total information of 2.6 Gbases was obtained from a 455±14 K/mm2 density with 95.76% (8,571,655/8,950,724 clusters) of the clusters passing quality control (QC) filters. Approximately 93.7% of all sequences from Read1 and 83.5% from Read2 contained high quality sequences that were ≥Q30, a base calling QC score standard. Alignments analysis identified three seasonal influenza A H3N2 strains, one 2009 pandemic influenza A H1N1 strain and two influenza B strains. The nearly entire genomes of all six virus isolates yielded equal or greater than 600-fold sequence coverage depth. MiSeq Platform identified seasonal influenza A H3N2, 2009 pandemic influenza A H1N1and influenza B in the DNA library mixtures efficiently. Copyright © 2013 The Authors. Published by Elsevier B.V. All rights reserved.
Poliovirus serotype-specific VP1 sequencing primers.
Kilpatrick, David R; Iber, Jane C; Chen, Qi; Ching, Karen; Yang, Su-Ju; De, Lina; Mandelbaum, Mark D; Emery, Brian; Campagnoli, Ray; Burns, Cara C; Kew, Olen
2011-06-01
The Global Polio Laboratory Network routinely uses poliovirus-specific PCR primers and probes to determine the serotype and genotype of poliovirus isolates obtained as part of global poliovirus surveillance. To provide detailed molecular epidemiologic information, poliovirus isolates are further characterized by sequencing the ~900-nucleotide region encoding the major capsid protein, VP1. It is difficult to obtain quality sequence information when clinical or environmental samples contain poliovirus mixtures. As an alternative to conventional methods for resolving poliovirus mixtures, sets of serotype-specific primers were developed for amplifying and sequencing the VP1 regions of individual components of mixed populations of vaccine-vaccine, vaccine-wild, and wild-wild polioviruses. Published by Elsevier B.V.
Nakazato, Takeru; Bono, Hidemasa
2017-01-01
Abstract It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party. PMID:28449062
Cryptosporidium meleagridis in an Indian ring-necked parrot (Psittacula krameri).
Morgan, U M; Xiao, L; Limor, J; Gelis, S; Raidal, S R; Fayer, R; Lal, A; Elliot, A; Thompson, R C
2000-03-01
To perform a morphological and genetic characterisation of a Cryptosporidium infection in an Indian ring-necked parrot (Psittacula krameri) and to compare this with C meleagridis from a turkey. Tissue and intestinal sections from an Indian ring-necked parrot were examined microscopically for Cryptosporidium. The organism was also purified from the crop and intestine, the DNA extracted and a portion of the 18S rDNA gene amplified, sequenced and compared with sequence and biological information obtained for C meleagridis from a turkey as well as sequence information for other species of Cryptosporidium. Morphological examination of tissue sections from an Indian ring-necked parrot revealed large numbers of Cryptosporidium oocysts attached to the apical border of enterocytes lining the intestinal tract. Purified Cryptosporidium oocysts measured about 5.1 x 4.5 microns, which conformed morphologically to C meleagridis. The sequence obtained from this isolate was identical to sequence information obtained from a C meleagridis isolate from a turkey. Cryptosporidium meleagridis was detected in an Indian ring-necked parrot using morphological and molecular methods. This is the first time that this species of Cryptosporidium has been reported in a non-galliform host and extends the known host range of C meleagridis.
Yohda, Masafumi; Yagi, Osami; Takechi, Ayane; Kitajima, Mizuki; Matsuda, Hisashi; Miyamura, Naoaki; Aizawa, Tomoko; Nakajima, Mutsuyasu; Sunairi, Michio; Daiba, Akito; Miyajima, Takashi; Teruya, Morimi; Teruya, Kuniko; Shiroma, Akino; Shimoji, Makiko; Tamotsu, Hinako; Juan, Ayaka; Nakano, Kazuma; Aoyama, Misako; Terabayashi, Yasunobu; Satou, Kazuhito; Hirano, Takashi
2015-07-01
A Dehalococcoides-containing bacterial consortium that performed dechlorination of 0.20 mM cis-1,2-dichloroethene to ethene in 14 days was obtained from the sediment mud of the lotus field. To obtain detailed information of the consortium, the metagenome was analyzed using the short-read next-generation sequencer SOLiD 3. Matching the obtained sequence tags with the reference genome sequences indicated that the Dehalococcoides sp. in the consortium was highly homologous to Dehalococcoides mccartyi CBDB1 and BAV1. Sequence comparison with the reference sequence constructed from 16S rRNA gene sequences in a public database showed the presence of Sedimentibacter, Sulfurospirillum, Clostridium, Desulfovibrio, Parabacteroides, Alistipes, Eubacterium, Peptostreptococcus and Proteocatella in addition to Dehalococcoides sp. After further enrichment, the members of the consortium were narrowed down to almost three species. Finally, the full-length circular genome sequence of the Dehalococcoides sp. in the consortium, D. mccartyi IBARAKI, was determined by analyzing the metagenome with the single-molecule DNA sequencer PacBio RS. The accuracy of the sequence was confirmed by matching it to the tag sequences obtained by SOLiD 3. The genome is 1,451,062 nt and the number of CDS is 1566, which includes 3 rRNA genes and 47 tRNA genes. There exist twenty-eight RDase genes that are accompanied by the genes for anchor proteins. The genome exhibits significant sequence identity with other Dehalococcoides spp. throughout the genome, but there exists significant difference in the distribution RDase genes. The combination of a short-read next-generation DNA sequencer and a long-read single-molecule DNA sequencer gives detailed information of a bacterial consortium. Copyright © 2014 The Society for Biotechnology, Japan. Published by Elsevier B.V. All rights reserved.
AMPLIFICATION OF RIBOSOMAL RNA SEQUENCES
This book chapter offers an overview of the use of ribosomal RNA sequences. A history of the technology traces the evolution of techniques to measure bacterial phylogenetic relationships and recent advances in obtaining rRNA sequence information. The manual also describes procedu...
Ohta, Tazro; Nakazato, Takeru; Bono, Hidemasa
2017-06-01
It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party. © The Authors 2017. Published by Oxford University Press.
Elman RNN based classification of proteins sequences on account of their mutual information.
Mishra, Pooja; Nath Pandey, Paras
2012-10-21
In the present work we have employed the method of estimating residue correlation within the protein sequences, by using the mutual information (MI) of adjacent residues, based on structural and solvent accessibility properties of amino acids. The long range correlation between nonadjacent residues is improved by constructing a mutual information vector (MIV) for a single protein sequence, like this each protein sequence is associated with its corresponding MIVs. These MIVs are given to Elman RNN to obtain the classification of protein sequences. The modeling power of MIV was shown to be significantly better, giving a new approach towards alignment free classification of protein sequences. We also conclude that sequence structural and solvent accessible property based MIVs are better predictor. Copyright © 2012 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
McMillen, Chelsea L.; Wright, Patience M.; Cassady, Carolyn J.
2016-05-01
Matrix-assisted laser desorption/ionization (MALDI) in-source decay was studied in the negative ion mode on deprotonated peptides to determine its usefulness for obtaining extensive sequence information for acidic peptides. Eight biological acidic peptides, ranging in size from 11 to 33 residues, were studied by negative ion mode ISD (nISD). The matrices 2,5-dihydroxybenzoic acid, 2-aminobenzoic acid, 2-aminobenzamide, 1,5-diaminonaphthalene, 5-amino-1-naphthol, 3-aminoquinoline, and 9-aminoacridine were used with each peptide. Optimal fragmentation was produced with 1,5-diaminonphthalene (DAN), and extensive sequence informative fragmentation was observed for every peptide except hirudin(54-65). Cleavage at the N-Cα bond of the peptide backbone, producing c' and z' ions, was dominant for all peptides. Cleavage of the N-Cα bond N-terminal to proline residues was not observed. The formation of c and z ions is also found in electron transfer dissociation (ETD), electron capture dissociation (ECD), and positive ion mode ISD, which are considered to be radical-driven techniques. Oxidized insulin chain A, which has four highly acidic oxidized cysteine residues, had less extensive fragmentation. This peptide also exhibited the only charged localized fragmentation, with more pronounced product ion formation adjacent to the highly acidic residues. In addition, spectra were obtained by positive ion mode ISD for each protonated peptide; more sequence informative fragmentation was observed via nISD for all peptides. Three of the peptides studied had no product ion formation in ISD, but extensive sequence informative fragmentation was found in their nISD spectra. The results of this study indicate that nISD can be used to readily obtain sequence information for acidic peptides.
McMillen, Chelsea L; Wright, Patience M; Cassady, Carolyn J
2016-05-01
Matrix-assisted laser desorption/ionization (MALDI) in-source decay was studied in the negative ion mode on deprotonated peptides to determine its usefulness for obtaining extensive sequence information for acidic peptides. Eight biological acidic peptides, ranging in size from 11 to 33 residues, were studied by negative ion mode ISD (nISD). The matrices 2,5-dihydroxybenzoic acid, 2-aminobenzoic acid, 2-aminobenzamide, 1,5-diaminonaphthalene, 5-amino-1-naphthol, 3-aminoquinoline, and 9-aminoacridine were used with each peptide. Optimal fragmentation was produced with 1,5-diaminonphthalene (DAN), and extensive sequence informative fragmentation was observed for every peptide except hirudin(54-65). Cleavage at the N-Cα bond of the peptide backbone, producing c' and z' ions, was dominant for all peptides. Cleavage of the N-Cα bond N-terminal to proline residues was not observed. The formation of c and z ions is also found in electron transfer dissociation (ETD), electron capture dissociation (ECD), and positive ion mode ISD, which are considered to be radical-driven techniques. Oxidized insulin chain A, which has four highly acidic oxidized cysteine residues, had less extensive fragmentation. This peptide also exhibited the only charged localized fragmentation, with more pronounced product ion formation adjacent to the highly acidic residues. In addition, spectra were obtained by positive ion mode ISD for each protonated peptide; more sequence informative fragmentation was observed via nISD for all peptides. Three of the peptides studied had no product ion formation in ISD, but extensive sequence informative fragmentation was found in their nISD spectra. The results of this study indicate that nISD can be used to readily obtain sequence information for acidic peptides.
Nanopore Sequencing as a Rapidly Deployable Ebola Outbreak Tool.
Hoenen, Thomas; Groseth, Allison; Rosenke, Kyle; Fischer, Robert J; Hoenen, Andreas; Judson, Seth D; Martellaro, Cynthia; Falzarano, Darryl; Marzi, Andrea; Squires, R Burke; Wollenberg, Kurt R; de Wit, Emmie; Prescott, Joseph; Safronetz, David; van Doremalen, Neeltje; Bushmaker, Trenton; Feldmann, Friederike; McNally, Kristin; Bolay, Fatorma K; Fields, Barry; Sealy, Tara; Rayfield, Mark; Nichol, Stuart T; Zoon, Kathryn C; Massaquoi, Moses; Munster, Vincent J; Feldmann, Heinz
2016-02-01
Rapid sequencing of RNA/DNA from pathogen samples obtained during disease outbreaks provides critical scientific and public health information. However, challenges exist for exporting samples to laboratories or establishing conventional sequencers in remote outbreak regions. We successfully used a novel, pocket-sized nanopore sequencer at a field diagnostic laboratory in Liberia during the current Ebola virus outbreak.
Lu, Hui-Meng; Yin, Da-Chuan; Ye, Ya-Jing; Luo, Hui-Min; Geng, Li-Qiang; Li, Hai-Sheng; Guo, Wei-Hong; Shang, Peng
2009-01-01
As the most widely utilized technique to determine the 3-dimensional structure of protein molecules, X-ray crystallography can provide structure of the highest resolution among the developed techniques. The resolution obtained via X-ray crystallography is known to be influenced by many factors, such as the crystal quality, diffraction techniques, and X-ray sources, etc. In this paper, the authors found that the protein sequence could also be one of the factors. We extracted information of the resolution and the sequence of proteins from the Protein Data Bank (PDB), classified the proteins into different clusters according to the sequence similarity, and statistically analyzed the relationship between the sequence similarity and the best resolution obtained. The results showed that there was a pronounced correlation between the sequence similarity and the obtained resolution. These results indicate that protein structure itself is one variable that may affect resolution when X-ray crystallography is used.
Yoshida, Mitsunori; Fukano, Hanako; Miyamoto, Yuji; Shibayama, Keigo; Suzuki, Masato; Hoshino, Yoshihiko
2018-05-17
Mycobacterium marinum is a slowly growing, broad-host-range mycobacterial species. Here, we report the complete genome sequence of a Mycobacterium marinum type strain that was isolated from tubercles of diseased fish. This sequence will provide essential information for future taxonomic and comparative genome studies of its relatives. Copyright © 2018 Yoshida et al.
Nanopore Sequencing as a Rapidly Deployable Ebola Outbreak Tool
Groseth, Allison; Rosenke, Kyle; Fischer, Robert J.; Hoenen, Andreas; Judson, Seth D.; Martellaro, Cynthia; Falzarano, Darryl; Marzi, Andrea; Squires, R. Burke; Wollenberg, Kurt R.; de Wit, Emmie; Prescott, Joseph; Safronetz, David; van Doremalen, Neeltje; Bushmaker, Trenton; Feldmann, Friederike; McNally, Kristin; Bolay, Fatorma K.; Fields, Barry; Sealy, Tara; Rayfield, Mark; Nichol, Stuart T.; Zoon, Kathryn C.; Massaquoi, Moses; Munster, Vincent J.; Feldmann, Heinz
2016-01-01
Rapid sequencing of RNA/DNA from pathogen samples obtained during disease outbreaks provides critical scientific and public health information. However, challenges exist for exporting samples to laboratories or establishing conventional sequencers in remote outbreak regions. We successfully used a novel, pocket-sized nanopore sequencer at a field diagnostic laboratory in Liberia during the current Ebola virus outbreak. PMID:26812583
Bastien, Olivier; Maréchal, Eric
2008-08-07
Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. Two statistical models have been proposed. In the asymptotic limit of long sequences, the Karlin-Altschul model is based on the computation of a P-value, assuming that the number of high scoring matching regions above a threshold is Poisson distributed. Alternatively, the Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Z-values allow the deduction of an upper bound of the P-value (1/Z-value2) following the TULIP theorem. Simulations of Z-value distribution is known to fit with a Gumbel law. This remarkable property was not demonstrated and had no obvious biological support. We built a model of evolution of sequences based on aging, as meant in Reliability Theory, using the fact that the amount of information shared between an initial sequence and the sequences in its lineage (i.e., mutual information in Information Theory) is a decreasing function of time. This quantity is simply measured by a sequence alignment score. In systems aging, the failure rate is related to the systems longevity. The system can be a machine with structured components, or a living entity or population. "Reliability" refers to the ability to operate properly according to a standard. Here, the "reliability" of a sequence refers to the ability to conserve a sufficient functional level at the folded and maturated protein level (positive selection pressure). Homologous sequences were considered as systems 1) having a high redundancy of information reflected by the magnitude of their alignment scores, 2) which components are the amino acids that can independently be damaged by random DNA mutations. From these assumptions, we deduced that information shared at each amino acid position evolved with a constant rate, corresponding to the information hazard rate, and that pairwise sequence alignment scores should follow a Gumbel distribution, which parameters could find some theoretical rationale. In particular, one parameter corresponds to the information hazard rate. Extreme value distribution of alignment scores, assessed from high scoring segments pairs following the Karlin-Altschul model, can also be deduced from the Reliability Theory applied to molecular sequences. It reflects the redundancy of information between homologous sequences, under functional conservative pressure. This model also provides a link between concepts of biological sequence analysis and of systems biology.
Ghosh, Pritha; Mathew, Oommen K; Sowdhamini, Ramanathan
2016-10-07
RNA-binding proteins (RBPs) interact with their cognate RNA(s) to form large biomolecular assemblies. They are versatile in their functionality and are involved in a myriad of processes inside the cell. RBPs with similar structural features and common biological functions are grouped together into families and superfamilies. It will be useful to obtain an early understanding and association of RNA-binding property of sequences of gene products. Here, we report a web server, RStrucFam, to predict the structure, type of cognate RNA(s) and function(s) of proteins, where possible, from mere sequence information. The web server employs Hidden Markov Model scan (hmmscan) to enable association to a back-end database of structural and sequence families. The database (HMMRBP) comprises of 437 HMMs of RBP families of known structure that have been generated using structure-based sequence alignments and 746 sequence-centric RBP family HMMs. The input protein sequence is associated with structural or sequence domain families, if structure or sequence signatures exist. In case of association of the protein with a family of known structures, output features like, multiple structure-based sequence alignment (MSSA) of the query with all others members of that family is provided. Further, cognate RNA partner(s) for that protein, Gene Ontology (GO) annotations, if any and a homology model of the protein can be obtained. The users can also browse through the database for details pertaining to each family, protein or RNA and their related information based on keyword search or RNA motif search. RStrucFam is a web server that exploits structurally conserved features of RBPs, derived from known family members and imprinted in mathematical profiles, to predict putative RBPs from sequence information. Proteins that fail to associate with such structure-centric families are further queried against the sequence-centric RBP family HMMs in the HMMRBP database. Further, all other essential information pertaining to an RBP, like overall function annotations, are provided. The web server can be accessed at the following link: http://caps.ncbs.res.in/rstrucfam .
Draft Genome Sequence of Ideonella sp. Strain A 288, Isolated from an Iron-Precipitating Biofilm
Künzel, Sven; Szewzyk, Ulrich
2017-01-01
ABSTRACT Here, we report the draft genome sequence of the betaproteobacterium Ideonella sp. strain A_228. This isolate, obtained from a bog iron ore-containing floodplain area in Germany, provides valuable information about the genetic diversity of neutrophilic iron-depositing bacteria. The Illumina NextSeq technique was used to sequence the draft genome sequence of the strain. PMID:28818902
Rapid in silico cloning of genes using expressed sequence tags (ESTs).
Gill, R W; Sanseau, P
2000-01-01
Expressed sequence tags (ESTs) are short single-pass DNA sequences obtained from either end of cDNA clones. These ESTs are derived from a vast number of cDNA libraries obtained from different species. Human ESTs are the bulk of the data and have been widely used to identify new members of gene families, as markers on the human chromosomes, to discover polymorphism sites and to compare expression patterns in different tissues or pathologies states. Information strategies have been devised to query EST databases. Since most of the analysis is performed with a computer, the term "in silico" strategy has been coined. In this chapter we will review the current status of EST databases, the pros and cons of EST-type data and describe possible strategies to retrieve meaningful information.
Mariappan, Yogesh K.; Dzyubak, Bogdan; Glaser, Kevin J.; Venkatesh, Sudhakar K.; Sirlin, Claude B.; Hooker, Jonathan; McGee, Kiaran P.
2017-01-01
Purpose To (a) evaluate modified spin-echo (SE) magnetic resonance (MR) elastographic sequences for acquiring MR images with improved signal-to-noise ratio (SNR) in patients in whom the standard gradient-echo (GRE) MR elastographic sequence yields low hepatic signal intensity and (b) compare the stiffness values obtained with these sequences with those obtained with the conventional GRE sequence. Materials and Methods This HIPAA-compliant retrospective study was approved by the institutional review board; the requirement to obtain informed consent was waived. Data obtained with modified SE and SE echo-planar imaging (EPI) MR elastographic pulse sequences with short echo times were compared with those obtained with the conventional GRE MR elastographic sequence in two patient cohorts, one that exhibited adequate liver signal intensity and one that exhibited low liver signal intensity. Shear stiffness values obtained with the three sequences in 130 patients with successful GRE-based examinations were retrospectively tested for statistical equivalence by using a 5% margin. In 47 patients in whom GRE examinations were considered to have failed because of low SNR, the SNR and confidence level with the SE-based sequences were compared with those with the GRE sequence. Results The results of this study helped confirm the equivalence of SE MR elastography and SE-EPI MR elastography to GRE MR elastography (P = .0212 and P = .0001, respectively). The SE and SE-EPI MR elastographic sequences provided substantially improved SNR and stiffness inversion confidence level in 47 patients in whom GRE MR elastography had failed. Conclusion Modified SE-based MR elastographic sequences provide higher SNR MR elastographic data and reliable stiffness measurements; thus, they enable quantification of stiffness in patients in whom the conventional GRE MR elastographic sequence failed owing to low signal intensity. The equivalence of the three sequences indicates that the current diagnostic thresholds are applicable to SE MR elastographic sequences for assessing liver fibrosis. © RSNA, 2016 PMID:27509543
Methods for making nucleotide probes for sequencing and synthesis
Church, George M; Zhang, Kun; Chou, Joseph
2014-07-08
Compositions and methods for making a plurality of probes for analyzing a plurality of nucleic acid samples are provided. Compositions and methods for analyzing a plurality of nucleic acid samples to obtain sequence information in each nucleic acid sample are also provided.
Modeling genome coverage in single-cell sequencing
Daley, Timothy; Smith, Andrew D.
2014-01-01
Motivation: Single-cell DNA sequencing is necessary for examining genetic variation at the cellular level, which remains hidden in bulk sequencing experiments. But because they begin with such small amounts of starting material, the amount of information that is obtained from single-cell sequencing experiment is highly sensitive to the choice of protocol employed and variability in library preparation. In particular, the fraction of the genome represented in single-cell sequencing libraries exhibits extreme variability due to quantitative biases in amplification and loss of genetic material. Results: We propose a method to predict the genome coverage of a deep sequencing experiment using information from an initial shallow sequencing experiment mapped to a reference genome. The observed coverage statistics are used in a non-parametric empirical Bayes Poisson model to estimate the gain in coverage from deeper sequencing. This approach allows researchers to know statistical features of deep sequencing experiments without actually sequencing deeply, providing a basis for optimizing and comparing single-cell sequencing protocols or screening libraries. Availability and implementation: The method is available as part of the preseq software package. Source code is available at http://smithlabresearch.org/preseq. Contact: andrewds@usc.edu Supplementary information: Supplementary material is available at Bioinformatics online. PMID:25107873
Gomulski, Ludvik M; Dimopoulos, George; Xi, Zhiyong; Soares, Marcelo B; Bonaldo, Maria F; Malacrida, Anna R; Gasperi, Giuliano
2008-01-01
Background The medfly, Ceratitis capitata, is a highly invasive agricultural pest that has become a model insect for the development of biological control programs. Despite research into the behavior and classical and population genetics of this organism, the quantity of sequence data available is limited. We have utilized an expressed sequence tag (EST) approach to obtain detailed information on transcriptome signatures that relate to a variety of physiological systems in the medfly; this information emphasizes on reproduction, sex determination, and chemosensory perception, since the study was based on normalized cDNA libraries from embryos and adult heads. Results A total of 21,253 high-quality ESTs were obtained from the embryo and head libraries. Clustering analyses performed separately for each library resulted in 5201 embryo and 6684 head transcripts. Considering an estimated 19% overlap in the transcriptomes of the two libraries, they represent about 9614 unique transcripts involved in a wide range of biological processes and molecular functions. Of particular interest are the sequences that share homology with Drosophila genes involved in sex determination, olfaction, and reproductive behavior. The medfly transformer2 (tra2) homolog was identified among the embryonic sequences, and its genomic organization and expression were characterized. Conclusion The sequences obtained in this study represent the first major dataset of expressed genes in a tephritid species of agricultural importance. This resource provides essential information to support the investigation of numerous questions regarding the biology of the medfly and other related species and also constitutes an invaluable tool for the annotation of complete genome sequences. Our study has revealed intriguing findings regarding the transcript regulation of tra2 and other sex determination genes, as well as insights into the comparative genomics of genes implicated in chemosensory reception and reproduction. PMID:18500975
Arrays of probes for positional sequencing by hybridization
Cantor, Charles R [Boston, MA; Prezetakiewiczr, Marek [East Boston, MA; Smith, Cassandra L [Boston, MA; Sano, Takeshi [Waltham, MA
2008-01-15
This invention is directed to methods and reagents useful for sequencing nucleic acid targets utilizing sequencing by hybridization technology comprising probes, arrays of probes and methods whereby sequence information is obtained rapidly and efficiently in discrete packages. That information can be used for the detection, identification, purification and complete or partial sequencing of a particular target nucleic acid. When coupled with a ligation step, these methods can be performed under a single set of hybridization conditions. The invention also relates to the replication of probe arrays and methods for making and replicating arrays of probes which are useful for the large scale manufacture of diagnostic aids used to screen biological samples for specific target sequences. Arrays created using PCR technology may comprise probes with 5'- and/or 3'-overhangs.
Cescutti, Paola; Campa, Cristiana; Delben, Franco; Rizzo, Roberto
2002-11-29
Dimers and trimers obtained by enzymatic hydrolysis of the glucomannan produced by the plant Amorphophallus konjac were analysed in order to obtain information on the saccharidic sequences present in the polymer. The polysaccharide was digested with cellulase and beta-mannanase and the oligomers produced were isolated by means of size-exclusion chromatography. They were structurally characterised using electrospray mass spectrometry, capillary electrophoresis, and NMR. The investigation revealed that many possible sequences were present in the polymer backbone suggesting a Bernoulli-type chain.
TANDEM: matching proteins with tandem mass spectra.
Craig, Robertson; Beavis, Ronald C
2004-06-12
Tandem mass spectra obtained from fragmenting peptide ions contain some peptide sequence specific information, but often there is not enough information to sequence the original peptide completely. Several proprietary software applications have been developed to attempt to match the spectra with a list of protein sequences that may contain the sequence of the peptide. The application TANDEM was written to provide the proteomics research community with a set of components that can be used to test new methods and algorithms for performing this type of sequence-to-data matching. The source code and binaries for this software are available at http://www.proteome.ca/opensource.html, for Windows, Linux and Macintosh OSX. The source code is made available under the Artistic License, from the authors.
Xia, Wei; Mason, Annaliese S.; Xia, Zhihui; Qiao, Fei; Zhao, Songlin; Tang, Haoru
2013-01-01
Background Cocos nucifera (coconut), a member of the Arecaceae family, is an economically important woody palm grown in tropical regions. Despite its agronomic importance, previous germplasm assessment studies have relied solely on morphological and agronomical traits. Molecular biology techniques have been scarcely used in assessment of genetic resources and for improvement of important agronomic and quality traits in Cocos nucifera, mostly due to the absence of available sequence information. Methodology/Principal Findings To provide basic information for molecular breeding and further molecular biological analysis in Cocos nucifera, we applied RNA-seq technology and de novo assembly to gain a global overview of the Cocos nucifera transcriptome from mixed tissue samples. Using Illumina sequencing, we obtained 54.9 million short reads and conducted de novo assembly to obtain 57,304 unigenes with an average length of 752 base pairs. Sequence comparison between assembled unigenes and released cDNA sequences of Cocos nucifera and Elaeis guineensis indicated that the assembled sequences were of high quality. Approximately 99.9% of unigenes were novel compared to the released coconut EST sequences. Using BLASTX, 68.2% of unigenes were successfully annotated based on the Genbank non-redundant (Nr) protein database. The annotated unigenes were then further classified using the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Conclusions/Significance Our study provides a large quantity of novel genetic information for Cocos nucifera. This information will act as a valuable resource for further molecular genetic studies and breeding in coconut, as well as for isolation and characterization of functional genes involved in different biochemical pathways in this important tropical crop species. PMID:23555859
Fan, Haikuo; Xiao, Yong; Yang, Yaodong; Xia, Wei; Mason, Annaliese S; Xia, Zhihui; Qiao, Fei; Zhao, Songlin; Tang, Haoru
2013-01-01
Cocos nucifera (coconut), a member of the Arecaceae family, is an economically important woody palm grown in tropical regions. Despite its agronomic importance, previous germplasm assessment studies have relied solely on morphological and agronomical traits. Molecular biology techniques have been scarcely used in assessment of genetic resources and for improvement of important agronomic and quality traits in Cocos nucifera, mostly due to the absence of available sequence information. To provide basic information for molecular breeding and further molecular biological analysis in Cocos nucifera, we applied RNA-seq technology and de novo assembly to gain a global overview of the Cocos nucifera transcriptome from mixed tissue samples. Using Illumina sequencing, we obtained 54.9 million short reads and conducted de novo assembly to obtain 57,304 unigenes with an average length of 752 base pairs. Sequence comparison between assembled unigenes and released cDNA sequences of Cocos nucifera and Elaeis guineensis indicated that the assembled sequences were of high quality. Approximately 99.9% of unigenes were novel compared to the released coconut EST sequences. Using BLASTX, 68.2% of unigenes were successfully annotated based on the Genbank non-redundant (Nr) protein database. The annotated unigenes were then further classified using the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Our study provides a large quantity of novel genetic information for Cocos nucifera. This information will act as a valuable resource for further molecular genetic studies and breeding in coconut, as well as for isolation and characterization of functional genes involved in different biochemical pathways in this important tropical crop species.
Investigation of modulation parameters in multiplexing gas chromatography.
Trapp, Oliver
2010-10-22
Combination of information technology and separation sciences opens a new avenue to achieve high sample throughputs and therefore is of great interest to bypass bottlenecks in catalyst screening of parallelized reactors or using multitier well plates in reaction optimization. Multiplexing gas chromatography utilizes pseudo-random injection sequences derived from Hadamard matrices to perform rapid sample injections which gives a convoluted chromatogram containing the information of a single sample or of several samples with similar analyte composition. The conventional chromatogram is obtained by application of the Hadamard transform using the known injection sequence or in case of several samples an averaged transformed chromatogram is obtained which can be used in a Gauss-Jordan deconvolution procedure to obtain all single chromatograms of the individual samples. The performance of such a system depends on the modulation precision and on the parameters, e.g. the sequence length and modulation interval. Here we demonstrate the effects of the sequence length and modulation interval on the deconvoluted chromatogram, peak shapes and peak integration for sequences between 9-bit (511 elements) and 13-bit (8191 elements) and modulation intervals Δt between 5 s and 500 ms using a mixture of five components. It could be demonstrated that even for high-speed modulation at time intervals of 500 ms the chromatographic information is very well preserved and that the separation efficiency can be improved by very narrow sample injections. Furthermore this study shows that the relative peak areas in multiplexed chromatograms do not deviate from conventionally recorded chromatograms. Copyright © 2010 Elsevier B.V. All rights reserved.
Hu, Zhi-Liang; Ramos, Antonio M.; Humphray, Sean J.; Rogers, Jane; Reecy, James M.; Rothschild, Max F.
2011-01-01
The newly available pig genome sequence has provided new information to fine map quantitative trait loci (QTL) in order to eventually identify causal variants. With targeted genomic sequencing efforts, we were able to obtain high quality BAC sequences that cover a region on pig chromosome 17 where a number of meat quality QTL have been previously discovered. Sequences from 70 BAC clones were assembled to form an 8-Mbp contig. Subsequently, we successfully mapped five previously identified QTL, three for meat color and two for lactate related traits, to the contig. With an additional 25 genetic markers that were identified by sequence comparison, we were able to carry out further linkage disequilibrium analysis to narrow down the genomic locations of these QTL, which allowed identification of the chromosomal regions that likely contain the causative variants. This research has provided one practical approach to combine genetic and molecular information for QTL mining. PMID:22303339
ERIC Educational Resources Information Center
Rimoldi, Horacio J. A.
The study of problem solving is made through the analysis of the process that leads to the final answer. The type of information obtained through the study of the process is compared with the information obtained by studying the final answer. The experimental technique used permits to identify the sequence of questions (tactics) that subjects ask…
Transcriptome assembly, gene annotation and tissue gene expression atlas of the rainbow trout
USDA-ARS?s Scientific Manuscript database
Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complimented by transcriptome information that will enhance genome assembly and annotation. Previously, we reported a transcriptome reference sequence using a 19X coverage of Sanger and 454-pyrosequencing dat...
Paiardini, Alessandro; Bossa, Francesco; Pascarella, Stefano
2004-01-01
The wealth of biological information provided by structural and genomic projects opens new prospects of understanding life and evolution at the molecular level. In this work, it is shown how computational approaches can be exploited to pinpoint protein structural features that remain invariant upon long evolutionary periods in the fold-type I, PLP-dependent enzymes. A nonredundant set of 23 superposed crystallographic structures belonging to this superfamily was built. Members of this family typically display high-structural conservation despite low-sequence identity. For each structure, a multiple-sequence alignment of orthologous sequences was obtained, and the 23 alignments were merged using the structural information to obtain a comprehensive multiple alignment of 921 sequences of fold-type I enzymes. The structurally conserved regions (SCRs), the evolutionarily conserved residues, and the conserved hydrophobic contacts (CHCs) were extracted from this data set, using both sequence and structural information. The results of this study identified a structural pattern of hydrophobic contacts shared by all of the superfamily members of fold-type I enzymes and involved in native interactions. This profile highlights the presence of a nucleus for this fold, in which residues participating in the most conserved native interactions exhibit preferential evolutionary conservation, that correlates significantly (r = 0.70) with the extent of mean hydrophobic contact value of their apolar fraction. PMID:15498941
3D knee segmentation based on three MRI sequences from different planes.
Zhou, L; Chav, R; Cresson, T; Chartrand, G; de Guise, J
2016-08-01
In clinical practice, knee MRI sequences with 3.5~5 mm slice distance in sagittal, coronal, and axial planes are often requested for the knee examination since its acquisition is faster than high-resolution MRI sequence in a single plane, thereby reducing the probability of motion artifact. In order to take advantage of the three sequences from different planes, a 3D segmentation method based on the combination of three knee models obtained from the three sequences is proposed in this paper. In the method, the sub-segmentation is respectively performed with sagittal, coronal, and axial MRI sequence in the image coordinate system. With each sequence, an initial knee model is hierarchically deformed, and then the three deformed models are mapped to reference coordinate system defined by the DICOM standard and combined to obtain a patient-specific model. The experimental results verified that the three sub-segmentation results can complement each other, and their integration can compensate for the insufficiency of boundary information caused by 3.5~5 mm gap between consecutive slices. Therefore, the obtained patient-specific model is substantially more accurate than each sub-segmentation results.
Devesse, Laurence; Ballard, David; Davenport, Lucinda; Riethorst, Immy; Mason-Buck, Gabriella; Syndercombe Court, Denise
2018-05-01
By using sequencing technology to genotype loci of forensic interest it is possible to simultaneously target autosomal, X and Y STRs as well as identity, ancestry and phenotypic informative SNPs, resulting in a breadth of data obtained from a single run that is considerable when compared to that generated with standard technologies. It is important however that this information aligns with the genotype data currently obtained using commercially available kits for CE-based investigations such that results are compatible with existing databases and hence can be of use to the forensic community. In this work, 400 samples were typed using commercially available STR kits and CE, as well as using the Ilumina ForenSeq™ DNA Signature Prep Kit and MiSeq ® FGx to assess concordance of autosomal STRs and population variability. Results show a concordance rate between the two technologies exceeding 99.98% while numerous novel sequence based alleles are described. In order to make use of the sequence variation observed, sequence specific allele frequencies were generated for White British and British Chinese populations. Copyright © 2017 Elsevier B.V. All rights reserved.
33 CFR 385.30 - Master Implementation Sequencing Plan.
Code of Federal Regulations, 2010 CFR
2010-07-01
... projects of the Plan, including pilot projects and operational elements, based on the best scientific... Florida Water Management District shall also consult with the South Florida Ecosystem Restoration Task...; (ii) Information obtained from pilot projects; (iii) Updated funding information; (iv) Approved...
USDA-ARS?s Scientific Manuscript database
Aspergillus flavus and A. parasiticus fungi, carcinogen-mycotoxins producers, infect peanut seeds, causing considerable impact on both human health and the economy. Here we report 9 genome sequences of Aspergillus spp. isolated from peanut seeds. The information obtained will allow conducting biodiv...
Sasaki, Katsutomo; Mitsuda, Nobutaka; Nashima, Kenji; Kishimoto, Kyutaro; Katayose, Yuichi; Kanamori, Hiroyuki; Ohmiya, Akemi
2017-09-04
Chrysanthemum morifolium is one of the most economically valuable ornamental plants worldwide. Chrysanthemum is an allohexaploid plant with a large genome that is commercially propagated by vegetative reproduction. New cultivars with different floral traits, such as color, morphology, and scent, have been generated mainly by classical cross-breeding and mutation breeding. However, only limited genetic resources and their genome information are available for the generation of new floral traits. To obtain useful information about molecular bases for floral traits of chrysanthemums, we read expressed sequence tags (ESTs) of chrysanthemums by high-throughput sequencing using the 454 pyrosequencing technology. We constructed normalized cDNA libraries, consisting of full-length, 3'-UTR, and 5'-UTR cDNAs derived from various tissues of chrysanthemums. These libraries produced a total number of 3,772,677 high-quality reads, which were assembled into 213,204 contigs. By comparing the data obtained with those of full genome-sequenced species, we confirmed that our chrysanthemum contig set contained the majority of all expressed genes, which was sufficient for further molecular analysis in chrysanthemums. We confirmed that our chrysanthemum EST set (contigs) contained a number of contigs that encoded transcription factors and enzymes involved in pigment and aroma compound metabolism that was comparable to that of other species. This information can serve as an informative resource for identifying genes involved in various biological processes in chrysanthemums. Moreover, the findings of our study will contribute to a better understanding of the floral characteristics of chrysanthemums including the myriad cultivars at the molecular level.
Extension of the COG and arCOG databases by amino acid and nucleotide sequences
Meereis, Florian; Kaufmann, Michael
2008-01-01
Background The current versions of the COG and arCOG databases, both excellent frameworks for studies in comparative and functional genomics, do not contain the nucleotide sequences corresponding to their protein or protein domain entries. Results Using sequence information obtained from GenBank flat files covering the completely sequenced genomes of the COG and arCOG databases, we constructed NUCOCOG (nucleotide sequences containing COG databases) as an extended version including all nucleotide sequences and in addition the amino acid sequences originally utilized to construct the current COG and arCOG databases. We make available three comprehensive single XML files containing the complete databases including all sequence information. In addition, we provide a web interface as a utility suitable to browse the NUCOCOG database for sequence retrieval. The database is accessible at . Conclusion NUCOCOG offers the possibility to analyze any sequence related property in the context of the COG and arCOG framework simply by using script languages such as PERL applied to a large but single XML document. PMID:19014535
A public HTLV-1 molecular epidemiology database for sequence management and data mining.
Araujo, Thessika Hialla Almeida; Souza-Brito, Leandro Inacio; Libin, Pieter; Deforche, Koen; Edwards, Dustin; de Albuquerque-Junior, Antonio Eduardo; Vandamme, Anne-Mieke; Galvao-Castro, Bernardo; Alcantara, Luiz Carlos Junior
2012-01-01
It is estimated that 15 to 20 million people are infected with the human T-cell lymphotropic virus type 1 (HTLV-1). At present, there are more than 2,000 unique HTLV-1 isolate sequences published. A central database to aggregate sequence information from a range of epidemiological aspects including HTLV-1 infections, pathogenesis, origins, and evolutionary dynamics would be useful to scientists and physicians worldwide. Described here, we have developed a database that collects and annotates sequence data and can be accessed through a user-friendly search interface. The HTLV-1 Molecular Epidemiology Database website is available at http://htlv1db.bahia.fiocruz.br/. All data was obtained from publications available at GenBank or through contact with the authors. The database was developed using Apache Webserver 2.1.6 and SGBD MySQL. The webpage interfaces were developed in HTML and sever-side scripting written in PHP. The HTLV-1 Molecular Epidemiology Database is hosted on the Gonçalo Moniz/FIOCRUZ Research Center server. There are currently 2,457 registered sequences with 2,024 (82.37%) of those sequences representing unique isolates. Of these sequences, 803 (39.67%) contain information about clinical status (TSP/HAM, 17.19%; ATL, 7.41%; asymptomatic, 12.89%; other diseases, 2.17%; and no information, 60.32%). Further, 7.26% of sequences contain information on patient gender while 5.23% of sequences provide the age of the patient. The HTLV-1 Molecular Epidemiology Database retrieves and stores annotated HTLV-1 proviral sequences from clinical, epidemiological, and geographical studies. The collected sequences and related information are now accessible on a publically available and user-friendly website. This open-access database will support clinical research and vaccine development related to viral genotype.
Transcriptome analysis by strand-specific sequencing of complementary DNA
Parkhomchuk, Dmitri; Borodina, Tatiana; Amstislavskiy, Vyacheslav; Banaru, Maria; Hallen, Linda; Krobitsch, Sylvia; Lehrach, Hans; Soldatov, Alexey
2009-01-01
High-throughput complementary DNA sequencing (RNA-Seq) is a powerful tool for whole-transcriptome analysis, supplying information about a transcript's expression level and structure. However, it is difficult to determine the polarity of transcripts, and therefore identify which strand is transcribed. Here, we present a simple cDNA sequencing protocol that preserves information about a transcript's direction. Using Saccharomyces cerevisiae and mouse brain transcriptomes as models, we demonstrate that knowing the transcript's orientation allows more accurate determination of the structure and expression of genes. It also helps to identify new genes and enables studying promoter-associated and antisense transcription. The transcriptional landscapes we obtained are available online. PMID:19620212
Transcriptome analysis by strand-specific sequencing of complementary DNA.
Parkhomchuk, Dmitri; Borodina, Tatiana; Amstislavskiy, Vyacheslav; Banaru, Maria; Hallen, Linda; Krobitsch, Sylvia; Lehrach, Hans; Soldatov, Alexey
2009-10-01
High-throughput complementary DNA sequencing (RNA-Seq) is a powerful tool for whole-transcriptome analysis, supplying information about a transcript's expression level and structure. However, it is difficult to determine the polarity of transcripts, and therefore identify which strand is transcribed. Here, we present a simple cDNA sequencing protocol that preserves information about a transcript's direction. Using Saccharomyces cerevisiae and mouse brain transcriptomes as models, we demonstrate that knowing the transcript's orientation allows more accurate determination of the structure and expression of genes. It also helps to identify new genes and enables studying promoter-associated and antisense transcription. The transcriptional landscapes we obtained are available online.
Wellehan, James F.X.; Pessier, Allan P.; Archer, Linda L.; Childress, April L.; Jacobson, Elliott R.; Tesh, Robert B.
2012-01-01
Rhabdoviruses infect a variety of hosts, including non-avian reptiles. Consensus PCR techniques were used to obtain partial RNA-dependent RNA polymerase gene sequence from five rhabdoviruses of South American lizards; Marco, Chaco, Timbo, Sena Madureira, and a rhabdovirus from a caiman lizard (Dracaena guianensis). The caiman lizard rhabdovirus formed inclusions in erythrocytes, which may be a route for infecting hematophagous insects. This is the first information on behavior of a rhabdovirus in squamates. We also obtained sequence from two rhabdoviruses of Australian lizards, confirming previous Charleville virus sequence and finding that, unlike a previous sequence report but in agreement with serologic reports, Almpiwar virus is clearly distinct from Charleville virus. Bayesian and maximum likelihood phylogenetic analysis revealed that most known rhabdoviruses of squamates cluster in the Almpiwar subgroup. The exception is Marco virus, which is found in the Hart Park group. PMID:22397930
Transcriptome Analysis and Comparison of Marmota monax and Marmota himalayana.
Liu, Yanan; Wang, Baoju; Wang, Lu; Vikash, Vikash; Wang, Qin; Roggendorf, Michael; Lu, Mengji; Yang, Dongliang; Liu, Jia
2016-01-01
The Eastern woodchuck (Marmota monax) is a classical animal model for studying hepatitis B virus (HBV) infection and hepatocellular carcinoma (HCC) in humans. Recently, we found that Marmota himalayana, an Asian animal species closely related to Marmota monax, is susceptible to woodchuck hepatitis virus (WHV) infection and can be used as a new mammalian model for HBV infection. However, the lack of genomic sequence information of both Marmota models strongly limited their application breadth and depth. To address this major obstacle of the Marmota models, we utilized Illumina RNA-Seq technology to sequence the cDNA libraries of liver and spleen samples of two Marmota monax and four Marmota himalayana. In total, over 13 billion nucleotide bases were sequenced and approximately 1.5 billion clean reads were obtained. Following assembly, 106,496 consensus sequences of Marmota monax and 78,483 consensus sequences of Marmota himalayana were detected. For functional annotation, in total 73,603 Unigenes of Marmota monax and 78,483 Unigenes of Marmota himalayana were identified using different databases (NR, NT, Swiss-Prot, KEGG, COG, GO). The Unigenes were aligned by blastx to protein databases to decide the coding DNA sequences (CDS) and in total 41,247 CDS of Marmota monax and 34,033 CDS of Marmota himalayana were predicted. The single nucleotide polymorphisms (SNPs) and the simple sequence repeats (SSRs) were also analyzed for all Unigenes obtained. Moreover, a large-scale transcriptome comparison was performed and revealed a high similarity in transcriptome sequences between the two marmota species. Our study provides an extensive amount of novel sequence information for Marmota monax and Marmota himalayana. This information may serve as a valuable genomics resource for further molecular, developmental and comparative evolutionary studies, as well as for the identification and characterization of functional genes that are involved in WHV infection and HCC development in the woodchuck model.
Transcriptome Analysis and Comparison of Marmota monax and Marmota himalayana
Wang, Lu; Vikash, Vikash; Wang, Qin; Roggendorf, Michael; Lu, Mengji; Yang, Dongliang; Liu, Jia
2016-01-01
The Eastern woodchuck (Marmota monax) is a classical animal model for studying hepatitis B virus (HBV) infection and hepatocellular carcinoma (HCC) in humans. Recently, we found that Marmota himalayana, an Asian animal species closely related to Marmota monax, is susceptible to woodchuck hepatitis virus (WHV) infection and can be used as a new mammalian model for HBV infection. However, the lack of genomic sequence information of both Marmota models strongly limited their application breadth and depth. To address this major obstacle of the Marmota models, we utilized Illumina RNA-Seq technology to sequence the cDNA libraries of liver and spleen samples of two Marmota monax and four Marmota himalayana. In total, over 13 billion nucleotide bases were sequenced and approximately 1.5 billion clean reads were obtained. Following assembly, 106,496 consensus sequences of Marmota monax and 78,483 consensus sequences of Marmota himalayana were detected. For functional annotation, in total 73,603 Unigenes of Marmota monax and 78,483 Unigenes of Marmota himalayana were identified using different databases (NR, NT, Swiss-Prot, KEGG, COG, GO). The Unigenes were aligned by blastx to protein databases to decide the coding DNA sequences (CDS) and in total 41,247 CDS of Marmota monax and 34,033 CDS of Marmota himalayana were predicted. The single nucleotide polymorphisms (SNPs) and the simple sequence repeats (SSRs) were also analyzed for all Unigenes obtained. Moreover, a large-scale transcriptome comparison was performed and revealed a high similarity in transcriptome sequences between the two marmota species. Our study provides an extensive amount of novel sequence information for Marmota monax and Marmota himalayana. This information may serve as a valuable genomics resource for further molecular, developmental and comparative evolutionary studies, as well as for the identification and characterization of functional genes that are involved in WHV infection and HCC development in the woodchuck model. PMID:27806133
Using Problem Solving to Assess Young Children's Mathematics Knowledge
ERIC Educational Resources Information Center
Charlesworth, Rosalind; Leali, Shirley A.
2012-01-01
Mathematics problem solving provides a means for obtaining a view of young children's understanding of mathematics as they move through the early childhood concept development sequence. Assessment information can be obtained through observations and interviews as children develop problem solutions. Examples of preschool, kindergarten, and primary…
Valenzuela-González, Fabiola; Martínez-Porchas, Marcel; Villalpando-Canchola, Enrique; Vargas-Albores, Francisco
2016-03-01
Ultrafast-metagenomic sequence classification using exact alignments (Kraken) is a novel approach to classify 16S rDNA sequences. The classifier is based on mapping short sequences to the lowest ancestor and performing alignments to form subtrees with specific weights in each taxon node. This study aimed to evaluate the classification performance of Kraken with long 16S rDNA random environmental sequences produced by cloning and then Sanger sequenced. A total of 480 clones were isolated and expanded, and 264 of these clones formed contigs (1352 ± 153 bp). The same sequences were analyzed using the Ribosomal Database Project (RDP) classifier. Deeper classification performance was achieved by Kraken than by the RDP: 73% of the contigs were classified up to the species or variety levels, whereas 67% of these contigs were classified no further than the genus level by the RDP. The results also demonstrated that unassembled sequences analyzed by Kraken provide similar or inclusively deeper information. Moreover, sequences that did not form contigs, which are usually discarded by other programs, provided meaningful information when analyzed by Kraken. Finally, it appears that the assembly step for Sanger sequences can be eliminated when using Kraken. Kraken cumulates the information of both sequence senses, providing additional elements for the classification. In conclusion, the results demonstrate that Kraken is an excellent choice for use in the taxonomic assignment of sequences obtained by Sanger sequencing or based on third generation sequencing, of which the main goal is to generate larger sequences. Copyright © 2016 Elsevier B.V. All rights reserved.
The Complete Sequence of a Human Parainfluenzavirus 4 Genome
Yea, Carmen; Cheung, Rose; Collins, Carol; Adachi, Dena; Nishikawa, John; Tellier, Raymond
2009-01-01
Although the human parainfluenza virus 4 (HPIV4) has been known for a long time, its genome, alone among the human paramyxoviruses, has not been completely sequenced to date. In this study we obtained the first complete genomic sequence of HPIV4 from a clinical isolate named SKPIV4 obtained at the Hospital for Sick Children in Toronto (Ontario, Canada). The coding regions for the N, P/V, M, F and HN proteins show very high identities (95% to 97%) with previously available partial sequences for HPIV4B. The sequence for the L protein and the non-coding regions represent new information. A surprising feature of the genome is its length, more than 17 kb, making it the longest genome within the genus Rubulavirus, although the length is well within the known range of 15 kb to 19 kb for the subfamily Paramyxovirinae. The availability of a complete genomic sequence will facilitate investigations on a respiratory virus that is still not completely characterized. PMID:21994536
Garcia-Reyero, Natàlia; Griffitt, Robert J.; Liu, Li; Kroll, Kevin J.; Farmerie, William G.; Barber, David S.; Denslow, Nancy D.
2009-01-01
A novel custom microarray for largemouth bass (Micropterus salmoides) was designed with sequences obtained from a normalized cDNA library using the 454 Life Sciences GS-20 pyrosequencer. This approach yielded in excess of 58 million bases of high-quality sequence. The sequence information was combined with 2,616 reads obtained by traditional suppressive subtractive hybridizations to derive a total of 31,391 unique sequences. Annotation and coding sequences were predicted for these transcripts where possible. 16,350 annotated transcripts were selected as target sequences for the design of the custom largemouth bass oligonucleotide microarray. The microarray was validated by examining the transcriptomic response in male largemouth bass exposed to 17β-œstradiol. Transcriptomic responses were assessed in liver and gonad, and indicated gene expression profiles typical of exposure to œstradiol. The results demonstrate the potential to rapidly create the tools necessary to assess large scale transcriptional responses in non-model species, paving the way for expanded impact of toxicogenomics in ecotoxicology. PMID:19936325
The nuclear 18S ribosomal RNA gene as a source of phylogenetic information in the genus Taenia.
Yan, Hongbin; Lou, Zhongzi; Li, Li; Ni, Xingwei; Guo, Aijiang; Li, Hongmin; Zheng, Yadong; Dyachenko, Viktor; Jia, Wanzhong
2013-03-01
Most species of the genus Taenia are of considerable medical and veterinary significance. In this study, complete nuclear 18S rRNA gene sequences were obtained from seven members of genus Taenia [Taenia multiceps, Taenia saginata, Taenia asiatica, Taenia solium, Taenia pisiformis, Taenia hydatigena, and Taenia taeniaeformis] and a phylogeny inferred using these sequences. Most of the variable sites fall within the variable regions, V1-V5. We show that sequences from the nuclear 18S ribosomal RNA gene have considerable promise as sources of phylogenetic information within the genus Taenia. Furthermore, given that almost all the variable sites lie within defined variable portions of that gene, it will be appropriate and economical to sequence only those regions for additional species of Taenia.
Studies and simulations of the DigiCipher system
NASA Technical Reports Server (NTRS)
Sayood, K.; Chen, Y. C.; Kipp, G.
1993-01-01
During this period the development of simulators for the various high definition television (HDTV) systems proposed to the FCC was continued. The FCC has indicated that it wants the various proposers to collaborate on a single system. Based on all available information this system will look very much like the advanced digital television (ADTV) system with major contributions only from the DigiCipher system. The results of our simulations of the DigiCipher system are described. This simulator was tested using test sequences from the MPEG committee. The results are extrapolated to HDTV video sequences. Once again, some caveats are in order. The sequences used for testing the simulator and generating the results are those used for testing the MPEG algorithm. The sequences are of much lower resolution than the HDTV sequences would be, and therefore the extrapolations are not totally accurate. One would expect to get significantly higher compression in terms of bits per pixel with sequences that are of higher resolution. However, the simulator itself is a valid one, and should HDTV sequences become available, they could be used directly with the simulator. A brief overview of the DigiCipher system is given. Some coding results obtained using the simulator are looked at. These results are compared to those obtained using the ADTV system. These results are evaluated in the context of the CCSDS specifications and make some suggestions as to how the DigiCipher system could be implemented in the NASA network. Simulations such as the ones reported can be biased depending on the particular source sequence used. In order to get more complete information about the system one needs to obtain a reasonable set of models which mirror the various kinds of sources encountered during video coding. A set of models which can be used to effectively model the various possible scenarios is provided. As this is somewhat tangential to the other work reported, the results are included as an appendix.
Sanderson, Nicholas D.; Atkins, Bridget L.; Brent, Andrew J.; Cole, Kevin; Foster, Dona; McNally, Martin A.; Oakley, Sarah; Peto, Leon; Taylor, Adrian; Peto, Tim E. A.; Crook, Derrick W.; Eyre, David W.
2017-01-01
ABSTRACT Culture of multiple periprosthetic tissue samples is the current gold standard for microbiological diagnosis of prosthetic joint infections (PJI). Additional diagnostic information may be obtained through culture of sonication fluid from explants. However, current techniques can have relatively low sensitivity, with prior antimicrobial therapy and infection by fastidious organisms influencing results. We assessed if metagenomic sequencing of total DNA extracts obtained direct from sonication fluid can provide an alternative rapid and sensitive tool for diagnosis of PJI. We compared metagenomic sequencing with standard aerobic and anaerobic culture in 97 sonication fluid samples from prosthetic joint and other orthopedic device infections. Reads from Illumina MiSeq sequencing were taxonomically classified using Kraken. Using 50 derivation samples, we determined optimal thresholds for the number and proportion of bacterial reads required to identify an infection and confirmed our findings in 47 independent validation samples. Compared to results from sonication fluid culture, the species-level sensitivity of metagenomic sequencing was 61/69 (88%; 95% confidence interval [CI], 77 to 94%; for derivation samples 35/38 [92%; 95% CI, 79 to 98%]; for validation samples, 26/31 [84%; 95% CI, 66 to 95%]), and genus-level sensitivity was 64/69 (93%; 95% CI, 84 to 98%). Species-level specificity, adjusting for plausible fastidious causes of infection, species found in concurrently obtained tissue samples, and prior antibiotics, was 85/97 (88%; 95% CI, 79 to 93%; for derivation samples, 43/50 [86%; 95% CI, 73 to 94%]; for validation samples, 42/47 [89%; 95% CI, 77 to 96%]). High levels of human DNA contamination were seen despite the use of laboratory methods to remove it. Rigorous laboratory good practice was required to minimize bacterial DNA contamination. We demonstrate that metagenomic sequencing can provide accurate diagnostic information in PJI. Our findings, combined with the increasing availability of portable, random-access sequencing technology, offer the potential to translate metagenomic sequencing into a rapid diagnostic tool in PJI. PMID:28490492
Bauer, Jan Stefan; Noël, Peter Benjamin; Vollhardt, Christiane; Much, Daniela; Degirmenci, Saliha; Brunner, Stefanie; Rummeny, Ernst Josef; Hauner, Hans
2015-01-01
Purpose MR might be well suited to obtain reproducible and accurate measures of fat tissues in infants. This study evaluates MR-measurements of adipose tissue in young infants in vitro and in vivo. Material and Methods MR images of ten phantoms simulating subcutaneous fat of an infant’s torso were obtained using a 1.5T MR scanner with and without simulated breathing. Scans consisted of a cartesian water-suppression turbo spin echo (wsTSE) sequence, and a PROPELLER wsTSE sequence. Fat volume was quantified directly and by MR imaging using k-means clustering and threshold-based segmentation procedures to calculate accuracy in vitro. Whole body MR was obtained in sleeping young infants (average age 67±30 days). This study was approved by the local review board. All parents gave written informed consent. To obtain reproducibility in vivo, cartesian and PROPELLER wsTSE sequences were repeated in seven and four young infants, respectively. Overall, 21 repetitions were performed for the cartesian sequence and 13 repetitions for the PROPELLER sequence. Results In vitro accuracy errors depended on the chosen segmentation procedure, ranging from 5.4% to 76%, while the sequence showed no significant influence. Artificial breathing increased the minimal accuracy error to 9.1%. In vivo reproducibility errors for total fat volume of the sleeping infants ranged from 2.6% to 3.4%. Neither segmentation nor sequence significantly influenced reproducibility. Conclusion With both cartesian and PROPELLER sequences an accurate and reproducible measure of body fat was achieved. Adequate segmentation was mandatory for high accuracy. PMID:25706876
Bauer, Jan Stefan; Noël, Peter Benjamin; Vollhardt, Christiane; Much, Daniela; Degirmenci, Saliha; Brunner, Stefanie; Rummeny, Ernst Josef; Hauner, Hans
2015-01-01
MR might be well suited to obtain reproducible and accurate measures of fat tissues in infants. This study evaluates MR-measurements of adipose tissue in young infants in vitro and in vivo. MR images of ten phantoms simulating subcutaneous fat of an infant's torso were obtained using a 1.5T MR scanner with and without simulated breathing. Scans consisted of a cartesian water-suppression turbo spin echo (wsTSE) sequence, and a PROPELLER wsTSE sequence. Fat volume was quantified directly and by MR imaging using k-means clustering and threshold-based segmentation procedures to calculate accuracy in vitro. Whole body MR was obtained in sleeping young infants (average age 67±30 days). This study was approved by the local review board. All parents gave written informed consent. To obtain reproducibility in vivo, cartesian and PROPELLER wsTSE sequences were repeated in seven and four young infants, respectively. Overall, 21 repetitions were performed for the cartesian sequence and 13 repetitions for the PROPELLER sequence. In vitro accuracy errors depended on the chosen segmentation procedure, ranging from 5.4% to 76%, while the sequence showed no significant influence. Artificial breathing increased the minimal accuracy error to 9.1%. In vivo reproducibility errors for total fat volume of the sleeping infants ranged from 2.6% to 3.4%. Neither segmentation nor sequence significantly influenced reproducibility. With both cartesian and PROPELLER sequences an accurate and reproducible measure of body fat was achieved. Adequate segmentation was mandatory for high accuracy.
Short memory fuzzy fusion image recognition schema employing spatial and Fourier descriptors
NASA Astrophysics Data System (ADS)
Raptis, Sotiris N.; Tzafestas, Spyros G.
2001-03-01
Single images quite often do not bear enough information for precise interpretation due to a variety of reasons. Multiple image fusion and adequate integration recently became the state of the art in the pattern recognition field. In this paper presented here and enhanced multiple observation schema is discussed investigating improvements to the baseline fuzzy- probabilistic image fusion methodology. The first innovation introduced consists in considering only a limited but seemingly ore effective part of the uncertainty information obtained by a certain time restricting older uncertainty dependencies and alleviating computational burden that is now needed for short sequence (stored into memory) of samples. The second innovation essentially grouping them into feature-blind object hypotheses. Experiment settings include a sequence of independent views obtained by camera being moved around the investigated object.
NASA Astrophysics Data System (ADS)
Yang, Hong
Until recently, recovery and analysis of genetic information encoded in ancient DNA sequences from Pleistocene fossils were impossible. Recent advances in molecular biology offered technical tools to obtain ancient DNA sequences from well-preserved Quaternary fossils and opened the possibilities to directly study genetic changes in fossil species to address various biological and paleontological questions. Ancient DNA studies involving Pleistocene fossil material and ancient DNA degradation and preservation in Quaternary deposits are reviewed. The molecular technology applied to isolate, amplify, and sequence ancient DNA is also presented. Authentication of ancient DNA sequences and technical problems associated with modern and ancient DNA contamination are discussed. As illustrated in recent studies on ancient DNA from proboscideans, it is apparent that fossil DNA sequence data can shed light on many aspects of Quaternary research such as systematics and phylogeny. conservation biology, evolutionary theory, molecular taphonomy, and forensic sciences. Improvement of molecular techniques and a better understanding of DNA degradation during fossilization are likely to build on current strengths and to overcome existing problems, making fossil DNA data a unique source of information for Quaternary scientists.
Information Entropy Analysis of the H1N1 Genetic Code
NASA Astrophysics Data System (ADS)
Martwick, Andy
2010-03-01
During the current H1N1 pandemic, viral samples are being obtained from large numbers of infected people world-wide and are being sequenced on the NCBI Influenza Virus Resource Database. The information entropy of the sequences was computed from the probability of occurrence of each nucleotide base at every position of each set of sequences using Shannon's definition of information entropy, [ H=∑bpb,2( 1pb ) ] where H is the observed information entropy at each nucleotide position and pb is the probability of the base pair of the nucleotides A, C, G, U. Information entropy of the current H1N1 pandemic is compared to reference human and swine H1N1 entropy. As expected, the current H1N1 entropy is in a low entropy state and has a very large mutation potential. Using the entropy method in mature genes we can identify low entropy regions of nucleotides that generally correlate to critical protein function.
Wang, Chun Guo; Chen, Xiao Qiang; Li, Hui; Zhao, Qian Cheng; Sun, De Ling; Song, Wen Qin
2008-02-01
Analysis of ISSR (Inter-Simple Sequence Repeat) and DDRT-PCR (Differential Display Reverse Transcriptase Polymerase Chain Reaction) was performed between cytoplasmic male sterility cauliflower ogura-A and its corresponding maintainer line ogura-B. Totally, 306 detectable bands were obtained by ISSR using thirty oligonucleotide primers. Commonly, six to twelve bands were produced per primer. Among all these primers only the amplification of primer ISSR3 was polymorphic, an 1100 bp specific band was only detected in maintainer line, named ISSR3(1100). Analysis of this sequence indicated that ISSR3(1100) was high homologous with the corresponding sequences of mitochondrial genome in Brassica napus and Arabidopsis thaliana,which suggested that ISSR3(1100) may derive from mitochondrial genome in cauliflower. To carry out DDRT-PCR analysis, three anchor primers and fifteen random primers were selected to combine. Totally, 1122 bands from 1 000 bp to 50 bp were detected. However, only four bands, named ogura-A 205, ogura-A383, ogura-B307 and ogura-B352, were confirmed to be different display in both lines. This result was further identified by reverse Northern dot blotting analysis. Among these four bands, ogura-A205 and ogura-A383 only express in cytoplasmic male sterility line, while ogura-B307 and ogura-B352 were only detected in maintainer line. Analysis of these sequences indicated that it was the first time that these four sequences were reported in cauliflower. Interestingly, ogura-A205 and ogura-B307 did not exhibit any similarities to other reported sequences in other species, more investigations were required to obtain further information. ogura-A383 and ogura-B352 were also two new sequences, they showed high similarities to corresponding chloroplast sequences of Arabidopsis thaliana and Brassica rapa subsp. pekinensis. So we speculated that these two sequences may derive from chloroplast genome. All these results obtained in this study offer new and significant information to investigate the molecular mechanism of cytoplasmic male sterility and fertile maintenance in cauliflower.
Application of next generation sequencing in clinical microbiology and infection prevention.
Deurenberg, Ruud H; Bathoorn, Erik; Chlebowicz, Monika A; Couto, Natacha; Ferdous, Mithila; García-Cobos, Silvia; Kooistra-Smid, Anna M D; Raangs, Erwin C; Rosema, Sigrid; Veloo, Alida C M; Zhou, Kai; Friedrich, Alexander W; Rossen, John W A
2017-02-10
Current molecular diagnostics of human pathogens provide limited information that is often not sufficient for outbreak and transmission investigation. Next generation sequencing (NGS) determines the DNA sequence of a complete bacterial genome in a single sequence run, and from these data, information on resistance and virulence, as well as information for typing is obtained, useful for outbreak investigation. The obtained genome data can be further used for the development of an outbreak-specific screening test. In this review, a general introduction to NGS is presented, including the library preparation and the major characteristics of the most common NGS platforms, such as the MiSeq (Illumina) and the Ion PGM™ (ThermoFisher). An overview of the software used for NGS data analyses used at the medical microbiology diagnostic laboratory in the University Medical Center Groningen in The Netherlands is given. Furthermore, applications of NGS in the clinical setting are described, such as outbreak management, molecular case finding, characterization and surveillance of pathogens, rapid identification of bacteria using the 16S-23S rRNA region, taxonomy, metagenomics approaches on clinical samples, and the determination of the transmission of zoonotic micro-organisms from animals to humans. Finally, we share our vision on the use of NGS in personalised microbiology in the near future, pointing out specific requirements. Copyright © 2016 The Author(s). Published by Elsevier B.V. All rights reserved.
Omics Metadata Management Software (OMMS).
Perez-Arriaga, Martha O; Wilson, Susan; Williams, Kelly P; Schoeniger, Joseph; Waymire, Russel L; Powell, Amy Jo
2015-01-01
Next-generation sequencing projects have underappreciated information management tasks requiring detailed attention to specimen curation, nucleic acid sample preparation and sequence production methods required for downstream data processing, comparison, interpretation, sharing and reuse. The few existing metadata management tools for genome-based studies provide weak curatorial frameworks for experimentalists to store and manage idiosyncratic, project-specific information, typically offering no automation supporting unified naming and numbering conventions for sequencing production environments that routinely deal with hundreds, if not thousands of samples at a time. Moreover, existing tools are not readily interfaced with bioinformatics executables, (e.g., BLAST, Bowtie2, custom pipelines). Our application, the Omics Metadata Management Software (OMMS), answers both needs, empowering experimentalists to generate intuitive, consistent metadata, and perform analyses and information management tasks via an intuitive web-based interface. Several use cases with short-read sequence datasets are provided to validate installation and integrated function, and suggest possible methodological road maps for prospective users. Provided examples highlight possible OMMS workflows for metadata curation, multistep analyses, and results management and downloading. The OMMS can be implemented as a stand alone-package for individual laboratories, or can be configured for webbased deployment supporting geographically-dispersed projects. The OMMS was developed using an open-source software base, is flexible, extensible and easily installed and executed. The OMMS can be obtained at http://omms.sandia.gov. The OMMS can be obtained at http://omms.sandia.gov.
Deurenberg, Ruud H; Bathoorn, Erik; Chlebowicz, Monika A; Couto, Natacha; Ferdous, Mithila; García-Cobos, Silvia; Kooistra-Smid, Anna M D; Raangs, Erwin C; Rosema, Sigrid; Veloo, Alida C M; Zhou, Kai; Friedrich, Alexander W; Rossen, John W A
2017-05-20
Current molecular diagnostics of human pathogens provide limited information that is often not sufficient for outbreak and transmission investigation. Next generation sequencing (NGS) determines the DNA sequence of a complete bacterial genome in a single sequence run, and from these data, information on resistance and virulence, as well as information for typing is obtained, useful for outbreak investigation. The obtained genome data can be further used for the development of an outbreak-specific screening test. In this review, a general introduction to NGS is presented, including the library preparation and the major characteristics of the most common NGS platforms, such as the MiSeq (Illumina) and the Ion PGM™ (ThermoFisher). An overview of the software used for NGS data analyses used at the medical microbiology diagnostic laboratory in the University Medical Center Groningen in The Netherlands is given. Furthermore, applications of NGS in the clinical setting are described, such as outbreak management, molecular case finding, characterization and surveillance of pathogens, rapid identification of bacteria using the 16S-23S rRNA region, taxonomy, metagenomics approaches on clinical samples, and the determination of the transmission of zoonotic micro-organisms from animals to humans. Finally, we share our vision on the use of NGS in personalised microbiology in the near future, pointing out specific requirements. Copyright © 2017. Published by Elsevier B.V.
Omics Metadata Management Software (OMMS)
Perez-Arriaga, Martha O; Wilson, Susan; Williams, Kelly P; Schoeniger, Joseph; Waymire, Russel L; Powell, Amy Jo
2015-01-01
Next-generation sequencing projects have underappreciated information management tasks requiring detailed attention to specimen curation, nucleic acid sample preparation and sequence production methods required for downstream data processing, comparison, interpretation, sharing and reuse. The few existing metadata management tools for genome-based studies provide weak curatorial frameworks for experimentalists to store and manage idiosyncratic, project-specific information, typically offering no automation supporting unified naming and numbering conventions for sequencing production environments that routinely deal with hundreds, if not thousands of samples at a time. Moreover, existing tools are not readily interfaced with bioinformatics executables, (e.g., BLAST, Bowtie2, custom pipelines). Our application, the Omics Metadata Management Software (OMMS), answers both needs, empowering experimentalists to generate intuitive, consistent metadata, and perform analyses and information management tasks via an intuitive web-based interface. Several use cases with short-read sequence datasets are provided to validate installation and integrated function, and suggest possible methodological road maps for prospective users. Provided examples highlight possible OMMS workflows for metadata curation, multistep analyses, and results management and downloading. The OMMS can be implemented as a stand alone-package for individual laboratories, or can be configured for webbased deployment supporting geographically-dispersed projects. The OMMS was developed using an open-source software base, is flexible, extensible and easily installed and executed. The OMMS can be obtained at http://omms.sandia.gov. Availability The OMMS can be obtained at http://omms.sandia.gov PMID:26124554
Multi-proxies Approach of Climatic Records In Terrestrial Mollusks Shells
NASA Astrophysics Data System (ADS)
Labonne, M.; Rousseau, D. D.; Ben Othman, D.; Luck, J. M.; Metref, S.
Fossil land snails shells constitute a valuable source of information for the study of Quaternary deposits as they are commonly preserved in many regions and notably in loess sequences. The use of stable isotope composition of the carbonate in the shells was previously applied to reconstruct past climate or environnements but the technic was not widely exploited and compared with other proxies from the same sequence. In this study, we have analysed stables isotopes, trace elements and Sr isotopes from both shells of land snails Vertigo modesta and the sediment from the Eustis upper Pleistocene loess sequence (Nebraska, USA). This serie developed during the last glaciation and records the last deglaciation between 18,000 and 12,000 B.P. years. We compare the paleoclimatic information obtained by different proxies, such as mag- netic susceptibility, temperature and moisture estimated by land snails assemblage with geochemical data measured on land snails shells in order to validate the climatic information obtained with this proxy. Our study demonstrates that shell carbonate reflects environmental conditions estimated by other proxies. Carbon and oxygen iso- topes show cyclic variations (millenial cycles) along the profile which correlate with stratigraphic units and could be link with the retreat of the Laurentide ice sheet. Trace element and Sr isotopes in the shells indicate various origins for the eolian dusts in the two main loess units along the sequence.
Sharp, Richard R
2011-03-01
As we look to a time when whole-genome sequencing is integrated into patient care, it is possible to anticipate a number of ethical challenges that will need to be addressed. The most intractable of these concern informed consent and the responsible management of very large amounts of genetic information. Given the range of possible findings, it remains unclear to what extent it will be possible to obtain meaningful patient consent to genomic testing. Equally unclear is how clinicians will disseminate the enormous volume of genetic information produced by whole-genome sequencing. Toward developing practical strategies for managing these ethical challenges, we propose a research agenda that approaches multiplexed forms of clinical genetic testing as natural laboratories in which to develop best practices for managing the ethical complexities of genomic medicine.
Decoding DNA, RNA and peptides with quantum tunnelling
NASA Astrophysics Data System (ADS)
di Ventra, Massimiliano; Taniguchi, Masateru
2016-02-01
Drugs and treatments could be precisely tailored to an individual patient by extracting their cellular- and molecular-level information. For this approach to be feasible on a global scale, however, information on complete genomes (DNA), transcriptomes (RNA) and proteomes (all proteins) needs to be obtained quickly and at low cost. Quantum mechanical phenomena could potentially be of value here, because the biological information needs to be decoded at an atomic level and quantum tunnelling has recently been shown to be able to differentiate single nucleobases and amino acids in short sequences. Here, we review the different approaches to using quantum tunnelling for sequencing, highlighting the theoretical background to the method and the experimental capabilities demonstrated to date. We also explore the potential advantages of the approach and the technical challenges that must be addressed to deliver practical quantum sequencing devices.
1987-01-01
identified in the difference spectra, implying that: there are five to seven tryptophans within 17 A of the spin-label hapten. Amino acid sequences...of the heavy, and light chains were obtained by a combination of amino acid and DNA sequencing. A molecular model’ was constructed from the sequence...Clore & acids yields detailed information about the amino acid com- Gronenborn, 1982, 1983). This technique should also identify position of the combining
Recognition of Drainage Tunnels during Glacier Lake Outburst Events from Terrestrial Image Sequences
NASA Astrophysics Data System (ADS)
Schwalbe, E.; Koschitzki, R.; Maas, H.-G.
2016-06-01
In recent years, many glaciers all over the world have been distinctly retreating and thinning. One of the consequences of this is the increase of so called glacier lake outburst flood events (GLOFs). The mechanisms ruling such GLOF events are still not yet fully understood by glaciologists. Thus, there is a demand for data and measurements that can help to understand and model the phenomena. Thereby, a main issue is to obtain information about the location and formation of subglacial channels through which some lakes, dammed by a glacier, start to drain. The paper will show how photogrammetric image sequence analysis can be used to collect such data. For the purpose of detecting a subglacial tunnel, a camera has been installed in a pilot study to observe the area of the Colonia Glacier (Northern Patagonian Ice Field) where it dams the Lake Cachet II. To verify the hypothesis, that the course of the subglacial tunnel is indicated by irregular surface motion patterns during its collapse, the camera acquired image sequences of the glacier surface during several GLOF events. Applying tracking techniques to these image sequences, surface feature motion trajectories could be obtained for a dense raster of glacier points. Since only a single camera has been used for image sequence acquisition, depth information is required to scale the trajectories. Thus, for scaling and georeferencing of the measurements a GPS-supported photogrammetric network has been measured. The obtained motion fields of the Colonia Glacier deliver information about the glacier's behaviour before during and after a GLOF event. If the daily vertical glacier motion of the glacier is integrated over a period of several days and projected into a satellite image, the location and shape of the drainage channel underneath the glacier becomes visible. The high temporal resolution of the motion fields may also allows for an analysis of the tunnels dynamic in comparison to the changing water level of the lake.
Ferreira, Diogo C; van der Linden, Marx G; de Oliveira, Leandro C; Onuchic, José N; de Araújo, Antônio F Pereira
2016-04-01
Recent ab initio folding simulations for a limited number of small proteins have corroborated a previous suggestion that atomic burial information obtainable from sequence could be sufficient for tertiary structure determination when combined to sequence-independent geometrical constraints. Here, we use simulations parameterized by native burials to investigate the required amount of information in a diverse set of globular proteins comprising different structural classes and a wide size range. Burial information is provided by a potential term pushing each atom towards one among a small number L of equiprobable concentric layers. An upper bound for the required information is provided by the minimal number of layers L(min) still compatible with correct folding behavior. We obtain L(min) between 3 and 5 for seven small to medium proteins with 50 ≤ Nr ≤ 110 residues while for a larger protein with Nr = 141 we find that L ≥ 6 is required to maintain native stability. We additionally estimate the usable redundancy for a given L ≥ L(min) from the burial entropy associated to the largest folding-compatible fraction of "superfluous" atoms, for which the burial term can be turned off or target layers can be chosen randomly. The estimated redundancy for small proteins with L = 4 is close to 0.8. Our results are consistent with the above-average quality of burial predictions used in previous simulations and indicate that the fraction of approachable proteins could increase significantly with even a mild, plausible, improvement on sequence-dependent burial prediction or on sequence-independent constraints that augment the detectable redundancy during simulations. © 2016 Wiley Periodicals, Inc.
Wellehan, James F X; Pessier, Allan P; Archer, Linda L; Childress, April L; Jacobson, Elliott R; Tesh, Robert B
2012-08-17
Rhabdoviruses infect a variety of hosts, including non-avian reptiles. Consensus PCR techniques were used to obtain partial RNA-dependent RNA polymerase gene sequence from five rhabdoviruses of South American lizards; Marco, Chaco, Timbo, Sena Madureira, and a rhabdovirus from a caiman lizard (Dracaena guianensis). The caiman lizard rhabdovirus formed inclusions in erythrocytes, which may be a route for infecting hematophagous insects. This is the first information on behavior of a rhabdovirus in squamates. We also obtained sequence from two rhabdoviruses of Australian lizards, confirming previous Charleville virus sequence and finding that, unlike a previous sequence report but in agreement with serologic reports, Almpiwar virus is clearly distinct from Charleville virus. Bayesian and maximum likelihood phylogenetic analysis revealed that most known rhabdoviruses of squamates cluster in the Almpiwar subgroup. The exception is Marco virus, which is found in the Hart Park group. Copyright © 2012 Elsevier B.V. All rights reserved.
Novel Approach to Analyzing MFE of Noncoding RNA Sequences
George, Tina P.; Thomas, Tessamma
2016-01-01
Genomic studies have become noncoding RNA (ncRNA) centric after the study of different genomes provided enormous information on ncRNA over the past decades. The function of ncRNA is decided by its secondary structure, and across organisms, the secondary structure is more conserved than the sequence itself. In this study, the optimal secondary structure or the minimum free energy (MFE) structure of ncRNA was found based on the thermodynamic nearest neighbor model. MFE of over 2600 ncRNA sequences was analyzed in view of its signal properties. Mathematical models linking MFE to the signal properties were found for each of the four classes of ncRNA analyzed. MFE values computed with the proposed models were in concordance with those obtained with the standard web servers. A total of 95% of the sequences analyzed had deviation of MFE values within ±15% relative to those obtained from standard web servers. PMID:27695341
Novel Approach to Analyzing MFE of Noncoding RNA Sequences.
George, Tina P; Thomas, Tessamma
2016-01-01
Genomic studies have become noncoding RNA (ncRNA) centric after the study of different genomes provided enormous information on ncRNA over the past decades. The function of ncRNA is decided by its secondary structure, and across organisms, the secondary structure is more conserved than the sequence itself. In this study, the optimal secondary structure or the minimum free energy (MFE) structure of ncRNA was found based on the thermodynamic nearest neighbor model. MFE of over 2600 ncRNA sequences was analyzed in view of its signal properties. Mathematical models linking MFE to the signal properties were found for each of the four classes of ncRNA analyzed. MFE values computed with the proposed models were in concordance with those obtained with the standard web servers. A total of 95% of the sequences analyzed had deviation of MFE values within ±15% relative to those obtained from standard web servers.
Dual-pathway multi-echo sequence for simultaneous frequency and T2 mapping
NASA Astrophysics Data System (ADS)
Cheng, Cheng-Chieh; Mei, Chang-Sheng; Duryea, Jeffrey; Chung, Hsiao-Wen; Chao, Tzu-Cheng; Panych, Lawrence P.; Madore, Bruno
2016-04-01
Purpose: To present a dual-pathway multi-echo steady state sequence and reconstruction algorithm to capture T2, T2∗ and field map information. Methods: Typically, pulse sequences based on spin echoes are needed for T2 mapping while gradient echoes are needed for field mapping, making it difficult to jointly acquire both types of information. A dual-pathway multi-echo pulse sequence is employed here to generate T2 and field maps from the same acquired data. The approach might be used, for example, to obtain both thermometry and tissue damage information during thermal therapies, or susceptibility and T2 information from a same head scan, or to generate bonus T2 maps during a knee scan. Results: Quantitative T2, T2∗ and field maps were generated in gel phantoms, ex vivo bovine muscle, and twelve volunteers. T2 results were validated against a spin-echo reference standard: A linear regression based on ROI analysis in phantoms provided close agreement (slope/R2 = 0.99/0.998). A pixel-wise in vivo Bland-Altman analysis of R2 = 1/T2 showed a bias of 0.034 Hz (about 0.3%), as averaged over four volunteers. Ex vivo results, with and without motion, suggested that tissue damage detection based on T2 rather than temperature-dose measurements might prove more robust to motion. Conclusion: T2, T2∗ and field maps were obtained simultaneously, from the same datasets, in thermometry, susceptibility-weighted imaging and knee-imaging contexts.
Support for HIV-1 Intervention Therapy
1993-10-01
I. Kiselev, and E. S. Severin. 1990. Amplification of DNA 46 sequences of Epstein - Barr and human immunodeficiency viruses using DNA-polymerase from... develop and validate assays that predict or demonstrate disease progression for use in interventional trials with an emphasis on molecular biologic...to stay on the leading edge of technology development . A potential problem in obtaining quality sequence information is the occurrence of template
Rapid and accurate pyrosequencing of angiosperm plastid genomes
Moore, Michael J; Dhingra, Amit; Soltis, Pamela S; Shaw, Regina; Farmerie, William G; Folta, Kevin M; Soltis, Douglas E
2006-01-01
Background Plastid genome sequence information is vital to several disciplines in plant biology, including phylogenetics and molecular biology. The past five years have witnessed a dramatic increase in the number of completely sequenced plastid genomes, fuelled largely by advances in conventional Sanger sequencing technology. Here we report a further significant reduction in time and cost for plastid genome sequencing through the successful use of a newly available pyrosequencing platform, the Genome Sequencer 20 (GS 20) System (454 Life Sciences Corporation), to rapidly and accurately sequence the whole plastid genomes of the basal eudicot angiosperms Nandina domestica (Berberidaceae) and Platanus occidentalis (Platanaceae). Results More than 99.75% of each plastid genome was simultaneously obtained during two GS 20 sequence runs, to an average depth of coverage of 24.6× in Nandina and 17.3× in Platanus. The Nandina and Platanus plastid genomes shared essentially identical gene complements and possessed the typical angiosperm plastid structure and gene arrangement. To assess the accuracy of the GS 20 sequence, over 45 kilobases of sequence were generated for each genome using conventional sequencing. Overall error rates of 0.043% and 0.031% were observed in GS 20 sequence for Nandina and Platanus, respectively. More than 97% of all observed errors were associated with homopolymer runs, with ~60% of all errors associated with homopolymer runs of 5 or more nucleotides and ~50% of all errors associated with regions of extensive homopolymer runs. No substitution errors were present in either genome. Error rates were generally higher in the single-copy and noncoding regions of both plastid genomes relative to the inverted repeat and coding regions. Conclusion Highly accurate and essentially complete sequence information was obtained for the Nandina and Platanus plastid genomes using the GS 20 System. More importantly, the high accuracy observed in the GS 20 plastid genome sequence was generated for a significant reduction in time and cost over traditional shotgun-based genome sequencing techniques, although with approximately half the coverage of previously reported GS 20 de novo genome sequence. The GS 20 should be broadly applicable to angiosperm plastid genome sequencing, and therefore promises to expand the scale of plant genetic and phylogenetic research dramatically. PMID:16934154
Pal, Debojyoti; Sharma, Deepak; Kumar, Mukesh; Sandur, Santosh K
2016-09-01
S-glutathionylation of proteins plays an important role in various biological processes and is known to be protective modification during oxidative stress. Since, experimental detection of S-glutathionylation is labor intensive and time consuming, bioinformatics based approach is a viable alternative. Available methods require relatively longer sequence information, which may prevent prediction if sequence information is incomplete. Here, we present a model to predict glutathionylation sites from pentapeptide sequences. It is based upon differential association of amino acids with glutathionylated and non-glutathionylated cysteines from a database of experimentally verified sequences. This data was used to calculate position dependent F-scores, which measure how a particular amino acid at a particular position may affect the likelihood of glutathionylation event. Glutathionylation-score (G-score), indicating propensity of a sequence to undergo glutathionylation, was calculated using position-dependent F-scores for each amino-acid. Cut-off values were used for prediction. Our model returned an accuracy of 58% with Matthew's correlation-coefficient (MCC) value of 0.165. On an independent dataset, our model outperformed the currently available model, in spite of needing much less sequence information. Pentapeptide motifs having high abundance among glutathionylated proteins were identified. A list of potential glutathionylation hotspot sequences were obtained by assigning G-scores and subsequent Protein-BLAST analysis revealed a total of 254 putative glutathionable proteins, a number of which were already known to be glutathionylated. Our model predicted glutathionylation sites in 93.93% of experimentally verified glutathionylated proteins. Outcome of this study may assist in discovering novel glutathionylation sites and finding candidate proteins for glutathionylation.
Source-Adaptation-Based Wireless Video Transport: A Cross-Layer Approach
NASA Astrophysics Data System (ADS)
Qu, Qi; Pei, Yong; Modestino, James W.; Tian, Xusheng
2006-12-01
Real-time packet video transmission over wireless networks is expected to experience bursty packet losses that can cause substantial degradation to the transmitted video quality. In wireless networks, channel state information is hard to obtain in a reliable and timely manner due to the rapid change of wireless environments. However, the source motion information is always available and can be obtained easily and accurately from video sequences. Therefore, in this paper, we propose a novel cross-layer framework that exploits only the motion information inherent in video sequences and efficiently combines a packetization scheme, a cross-layer forward error correction (FEC)-based unequal error protection (UEP) scheme, an intracoding rate selection scheme as well as a novel intraframe interleaving scheme. Our objective and subjective results demonstrate that the proposed approach is very effective in dealing with the bursty packet losses occurring on wireless networks without incurring any additional implementation complexity or delay. Thus, the simplicity of our proposed system has important implications for the implementation of a practical real-time video transmission system.
Schilmiller, Anthony L; Miner, Dennis P; Larson, Matthew; McDowell, Eric; Gang, David R; Wilkerson, Curtis; Last, Robert L
2010-07-01
Shotgun proteomics analysis allows hundreds of proteins to be identified and quantified from a single sample at relatively low cost. Extensive DNA sequence information is a prerequisite for shotgun proteomics, and it is ideal to have sequence for the organism being studied rather than from related species or accessions. While this requirement has limited the set of organisms that are candidates for this approach, next generation sequencing technologies make it feasible to obtain deep DNA sequence coverage from any organism. As part of our studies of specialized (secondary) metabolism in tomato (Solanum lycopersicum) trichomes, 454 sequencing of cDNA was combined with shotgun proteomics analyses to obtain in-depth profiles of genes and proteins expressed in leaf and stem glandular trichomes of 3-week-old plants. The expressed sequence tag and proteomics data sets combined with metabolite analysis led to the discovery and characterization of a sesquiterpene synthase that produces beta-caryophyllene and alpha-humulene from E,E-farnesyl diphosphate in trichomes of leaf but not of stem. This analysis demonstrates the utility of combining high-throughput cDNA sequencing with proteomics experiments in a target tissue. These data can be used for dissection of other biochemical processes in these specialized epidermal cells.
Schilmiller, Anthony L.; Miner, Dennis P.; Larson, Matthew; McDowell, Eric; Gang, David R.; Wilkerson, Curtis; Last, Robert L.
2010-01-01
Shotgun proteomics analysis allows hundreds of proteins to be identified and quantified from a single sample at relatively low cost. Extensive DNA sequence information is a prerequisite for shotgun proteomics, and it is ideal to have sequence for the organism being studied rather than from related species or accessions. While this requirement has limited the set of organisms that are candidates for this approach, next generation sequencing technologies make it feasible to obtain deep DNA sequence coverage from any organism. As part of our studies of specialized (secondary) metabolism in tomato (Solanum lycopersicum) trichomes, 454 sequencing of cDNA was combined with shotgun proteomics analyses to obtain in-depth profiles of genes and proteins expressed in leaf and stem glandular trichomes of 3-week-old plants. The expressed sequence tag and proteomics data sets combined with metabolite analysis led to the discovery and characterization of a sesquiterpene synthase that produces β-caryophyllene and α-humulene from E,E-farnesyl diphosphate in trichomes of leaf but not of stem. This analysis demonstrates the utility of combining high-throughput cDNA sequencing with proteomics experiments in a target tissue. These data can be used for dissection of other biochemical processes in these specialized epidermal cells. PMID:20431087
Ultraaccurate genome sequencing and haplotyping of single human cells.
Chu, Wai Keung; Edge, Peter; Lee, Ho Suk; Bansal, Vikas; Bafna, Vineet; Huang, Xiaohua; Zhang, Kun
2017-11-21
Accurate detection of variants and long-range haplotypes in genomes of single human cells remains very challenging. Common approaches require extensive in vitro amplification of genomes of individual cells using DNA polymerases and high-throughput short-read DNA sequencing. These approaches have two notable drawbacks. First, polymerase replication errors could generate tens of thousands of false-positive calls per genome. Second, relatively short sequence reads contain little to no haplotype information. Here we report a method, which is dubbed SISSOR (single-stranded sequencing using microfluidic reactors), for accurate single-cell genome sequencing and haplotyping. A microfluidic processor is used to separate the Watson and Crick strands of the double-stranded chromosomal DNA in a single cell and to randomly partition megabase-size DNA strands into multiple nanoliter compartments for amplification and construction of barcoded libraries for sequencing. The separation and partitioning of large single-stranded DNA fragments of the homologous chromosome pairs allows for the independent sequencing of each of the complementary and homologous strands. This enables the assembly of long haplotypes and reduction of sequence errors by using the redundant sequence information and haplotype-based error removal. We demonstrated the ability to sequence single-cell genomes with error rates as low as 10 -8 and average 500-kb-long DNA fragments that can be assembled into haplotype contigs with N50 greater than 7 Mb. The performance could be further improved with more uniform amplification and more accurate sequence alignment. The ability to obtain accurate genome sequences and haplotype information from single cells will enable applications of genome sequencing for diverse clinical needs. Copyright © 2017 the Author(s). Published by PNAS.
Deng, Youping; Dong, Yinghua; Thodima, Venkata; Clem, Rollie J; Passarelli, A Lorena
2006-01-01
Background Little is known about the genome sequences of lepidopteran insects, although this group of insects has been studied extensively in the fields of endocrinology, development, immunity, and pathogen-host interactions. In addition, cell lines derived from Spodoptera frugiperda and other lepidopteran insects are routinely used for baculovirus foreign gene expression. This study reports the results of an expressed sequence tag (EST) sequencing project in cells from the lepidopteran insect S. frugiperda, the fall armyworm. Results We have constructed an EST database using two cDNA libraries from the S. frugiperda-derived cell line, SF-21. The database consists of 2,367 ESTs which were assembled into 244 contigs and 951 singlets for a total of 1,195 unique sequences. Conclusion S. frugiperda is an agriculturally important pest insect and genomic information will be instrumental for establishing initial transcriptional profiling and gene function studies, and for obtaining information about genes manipulated during infections by insect pathogens such as baculoviruses. PMID:17052344
Protein Sectors: Statistical Coupling Analysis versus Conservation
Teşileanu, Tiberiu; Colwell, Lucy J.; Leibler, Stanislas
2015-01-01
Statistical coupling analysis (SCA) is a method for analyzing multiple sequence alignments that was used to identify groups of coevolving residues termed “sectors”. The method applies spectral analysis to a matrix obtained by combining correlation information with sequence conservation. It has been asserted that the protein sectors identified by SCA are functionally significant, with different sectors controlling different biochemical properties of the protein. Here we reconsider the available experimental data and note that it involves almost exclusively proteins with a single sector. We show that in this case sequence conservation is the dominating factor in SCA, and can alone be used to make statistically equivalent functional predictions. Therefore, we suggest shifting the experimental focus to proteins for which SCA identifies several sectors. Correlations in protein alignments, which have been shown to be informative in a number of independent studies, would then be less dominated by sequence conservation. PMID:25723535
ERIC Educational Resources Information Center
Byrd, Rita, Comp.; And Others
This guide is intended to help Kentucky families of children with disabilities or other special needs to find services in their communities. Introductory information offers guidance in selecting services, applying for services, obtaining services, and organizing to obtain unavailable services. A question-and-answer format is used, with space…
Ribot, Emeline J; Trotier, Aurélien J; Castets, Charles R; Dallaudière, Benjamin; Thiaudière, Eric; Franconi, Jean-Michel; Miraux, Sylvain
2016-02-01
The goal of this study was to develop a 3D diffusion weighted sequence for free breathing liver imaging in small animals at high magnetic field. Hepatic metastases were detected and the apparent diffusion coefficients (ADC) were measured. A 3D SE-EPI sequence was developed by (i) inserting a water-selective excitation radiofrequency pulse to suppress adipose tissue signal and (ii) bipolar diffusion gradients to decrease the sensitivity to respiration motion. Mice with hepatic metastases were imaged at 7T by applying b values from 200 to 1100 s/mm(2). 3D images with high spatial resolution (182 × 156 × 125 µm) were obtained in only 8 min 32 s. The modified DW-SE-EPI sequence allowed to obtain 3D abdominal images of healthy mice with fat SNR 2.5 times lower than without any fat suppression method and sharpness 2.8 times higher than on respiration-triggered images. Due to the high spatial resolution, the core and the periphery of disseminated hepatic metastases were differentiated at high b-values only, demonstrating the presence of edema and proliferating cells (with ADC of 2.65 × 10(-3) and 1.55 × 10(-3) mm(2)/s, respectively). Furthermore, these metastases were accurately distinguished from proliferating ones within the same animal at high b-values (mean ADC of 0.38 × 10(-3) mm(2)/s). Metastases of less than 1.7 mm(3) diameter were detected. The new 3D SE-EPI sequence enabled to obtain diffusion information within liver metastases. In addition of intra-metastasis heterogeneity, differences in diffusion were measured between metastases within an animal. This sequence could be used to obtain diffusion information at high magnetic field.
De novo transcriptomic analysis and development of EST-SSRs for Sorbus pohuashanensis (Hance) Hedl.
Guan, Xuelian; Fu, Qiang; Zhang, Ze; Hu, Zenghui; Zheng, Jian; Lu, Yizeng; Li, Wei
2017-01-01
Sorbus pohuashanensis is a native tree species of northern China that is used for a variety of ecological purposes. The species is often grown as an ornamental landscape tree because of its beautiful form, silver flowers in early summer, attractive pinnate leaves in summer, and red leaves and fruits in autumn. However, development and further utilization of the species are hindered by the lack of comprehensive genetic information, which impedes research into its genetics and molecular biology. Recent advances in de novo transcriptome sequencing (RNA-seq) technology have provided an effective means to obtain genomic information from non-model species. Here, we applied RNA-seq for sequencing S. pohuashanensis leaves and obtained a total of 137,506 clean reads. After assembly, 96,213 unigenes with an average length of 770 bp were obtained. We found that 64.5% of the unigenes could be annotated using bioinformatics tools to analyze gene function and alignment with the NCBI database. Overall, 59,089 unigenes were annotated using the Nr database(non-redundant protein database), 35,225 unigenes were annotated using the GO (Gene Ontology categories) database, and 33,168 unigenes were annotated using COG (Cluster of Orthologous Groups). Analysis of the unigenes using the KEGG (Kyoto Encyclopedia of Genes and Genomes) database indicated that 13,953 unigenes were involved in 322 metabolic pathways. Finally, simple sequence repeat (SSR) site detection identified 6,604 unigenes that included EST-SSRs and a total of 7,473 EST-SSRs in the unigene sequences. Fifteen polymorphic SSRs were screened and found to be of use for future genetic research. These unigene sequences will provide important genetic resources for genetic improvement and investigation of biochemical processes in S. pohuashanensis. PMID:28614366
Application of industrial scale genomics to discovery of therapeutic targets in heart failure.
Mehraban, F; Tomlinson, J E
2001-12-01
In recent years intense activity in both academic and industrial sectors has provided a wealth of information on the human genome with an associated impressive increase in the number of novel gene sequences deposited in sequence data repositories and patent applications. This genomic industrial revolution has transformed the way in which drug target discovery is now approached. In this article we discuss how various differential gene expression (DGE) technologies are being utilized for cardiovascular disease (CVD) drug target discovery. Other approaches such as sequencing cDNA from cardiovascular derived tissues and cells coupled with bioinformatic sequence analysis are used with the aim of identifying novel gene sequences that may be exploited towards target discovery. Additional leverage from gene sequence information is obtained through identification of polymorphisms that may confer disease susceptibility and/or affect drug responsiveness. Pharmacogenomic studies are described wherein gene expression-based techniques are used to evaluate drug response and/or efficacy. Industrial-scale genomics supports and addresses not only novel target gene discovery but also the burgeoning issues in pharmaceutical and clinical cardiovascular medicine relative to polymorphic gene responses.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W
2009-01-01
GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank(R) staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
Genome sequencing of a single tardigrade Hypsibius dujardini individual
Arakawa, Kazuharu; Yoshida, Yuki; Tomita, Masaru
2016-01-01
Tardigrades are ubiquitous microscopic animals that play an important role in the study of metazoan phylogeny. Most terrestrial tardigrades can withstand extreme environments by entering an ametabolic desiccated state termed anhydrobiosis. Due to their small size and the non-axenic nature of laboratory cultures, molecular studies of tardigrades are prone to contamination. To minimize the possibility of microbial contaminations and to obtain high-quality genomic information, we have developed an ultra-low input library sequencing protocol to enable the genome sequencing of a single tardigrade Hypsibius dujardini individual. Here, we describe the details of our sequencing data and the ultra-low input library preparation methodologies. PMID:27529330
Genome sequencing of a single tardigrade Hypsibius dujardini individual.
Arakawa, Kazuharu; Yoshida, Yuki; Tomita, Masaru
2016-08-16
Tardigrades are ubiquitous microscopic animals that play an important role in the study of metazoan phylogeny. Most terrestrial tardigrades can withstand extreme environments by entering an ametabolic desiccated state termed anhydrobiosis. Due to their small size and the non-axenic nature of laboratory cultures, molecular studies of tardigrades are prone to contamination. To minimize the possibility of microbial contaminations and to obtain high-quality genomic information, we have developed an ultra-low input library sequencing protocol to enable the genome sequencing of a single tardigrade Hypsibius dujardini individual. Here, we describe the details of our sequencing data and the ultra-low input library preparation methodologies.
New Stopping Criteria for Segmenting DNA Sequences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Wentian
2001-06-18
We propose a solution on the stopping criterion in segmenting inhomogeneous DNA sequences with complex statistical patterns. This new stopping criterion is based on Bayesian information criterion in the model selection framework. When this criterion is applied to telomere of S.cerevisiae and the complete sequence of E.coli, borders of biologically meaningful units were identified, and a more reasonable number of domains was obtained. We also introduce a measure called segmentation strength which can be used to control the delineation of large domains. The relationship between the average domain size and the threshold of segmentation strength is determined for several genomemore » sequences.« less
Arai, Yuuki; Maeda, Akiko; Hirami, Yasuhiko; Ishigami, Chie; Kosugi, Shinji; Mandai, Michiko; Kurimoto, Yasuo; Takahashi, Masayo
2015-01-01
The aim of this study was to gain information about disease prevalence and to identify the responsible genes for inherited retinal dystrophies (IRD) in Japanese populations. Clinical and molecular evaluations were performed on 349 patients with IRD. For segregation analyses, 63 of their family members were employed. Bioinformatics data from 1,208 Japanese individuals were used as controls. Molecular diagnosis was obtained by direct sequencing in a stepwise fashion utilizing one or two panels of 15 and 27 genes for retinitis pigmentosa patients. If a specific clinical diagnosis was suspected, direct sequencing of disease-specific genes, that is, ABCA4 for Stargardt disease, was conducted. Limited availability of intrafamily information and decreasing family size hampered identifying inherited patterns. Differential disease profiles with lower prevalence of Stargardt disease from European and North American populations were obtained. We found 205 sequence variants in 159 of 349 probands with an identification rate of 45.6%. This study found 43 novel sequence variants. In silico analysis suggests that 20 of 25 novel missense variants are pathogenic. EYS mutations had the highest prevalence at 23.5%. c.4957_4958insA and c.8868C>A were the two major EYS mutations identified in this cohort. EYS mutations are the most prevalent among Japanese patients with IRD.
DArT Markers Effectively Target Gene Space in the Rye Genome
Gawroński, Piotr; Pawełkowicz, Magdalena; Tofil, Katarzyna; Uszyński, Grzegorz; Sharifova, Saida; Ahluwalia, Shivaksh; Tyrka, Mirosław; Wędzony, Maria; Kilian, Andrzej; Bolibok-Brągoszewska, Hanna
2016-01-01
Large genome size and complexity hamper considerably the genomics research in relevant species. Rye (Secale cereale L.) has one of the largest genomes among cereal crops and repetitive sequences account for over 90% of its length. Diversity Arrays Technology is a high-throughput genotyping method, in which a preferential sampling of gene-rich regions is achieved through the use of methylation sensitive restriction enzymes. We obtained sequences of 6,177 rye DArT markers and following a redundancy analysis assembled them into 3,737 non-redundant sequences, which were then used in homology searches against five Pooideae sequence sets. In total 515 DArT sequences could be incorporated into publicly available rye genome zippers providing a starting point for the integration of DArT- and transcript-based genomics resources in rye. Using Blast2Go pipeline we attributed putative gene functions to 1101 (29.4%) of the non-redundant DArT marker sequences, including 132 sequences with putative disease resistance-related functions, which were found to be preferentially located in the 4RL and 6RL chromosomes. Comparative analysis based on the DArT sequences revealed obvious inconsistencies between two recently published high density consensus maps of rye. Furthermore we demonstrated that DArT marker sequences can be a source of SSR polymorphisms. Obtained data demonstrate that DArT markers effectively target gene space in the large, complex, and repetitive rye genome. Through the annotation of putative gene functions and the alignment of DArT sequences relative to reference genomes we obtained information, that will complement the results of the studies, where DArT genotyping was deployed, by simplifying the gene ontology and microcolinearity based identification of candidate genes. PMID:27833625
DArT Markers Effectively Target Gene Space in the Rye Genome.
Gawroński, Piotr; Pawełkowicz, Magdalena; Tofil, Katarzyna; Uszyński, Grzegorz; Sharifova, Saida; Ahluwalia, Shivaksh; Tyrka, Mirosław; Wędzony, Maria; Kilian, Andrzej; Bolibok-Brągoszewska, Hanna
2016-01-01
Large genome size and complexity hamper considerably the genomics research in relevant species. Rye ( Secale cereale L.) has one of the largest genomes among cereal crops and repetitive sequences account for over 90% of its length. Diversity Arrays Technology is a high-throughput genotyping method, in which a preferential sampling of gene-rich regions is achieved through the use of methylation sensitive restriction enzymes. We obtained sequences of 6,177 rye DArT markers and following a redundancy analysis assembled them into 3,737 non-redundant sequences, which were then used in homology searches against five Pooideae sequence sets. In total 515 DArT sequences could be incorporated into publicly available rye genome zippers providing a starting point for the integration of DArT- and transcript-based genomics resources in rye. Using Blast2Go pipeline we attributed putative gene functions to 1101 (29.4%) of the non-redundant DArT marker sequences, including 132 sequences with putative disease resistance-related functions, which were found to be preferentially located in the 4RL and 6RL chromosomes. Comparative analysis based on the DArT sequences revealed obvious inconsistencies between two recently published high density consensus maps of rye. Furthermore we demonstrated that DArT marker sequences can be a source of SSR polymorphisms. Obtained data demonstrate that DArT markers effectively target gene space in the large, complex, and repetitive rye genome. Through the annotation of putative gene functions and the alignment of DArT sequences relative to reference genomes we obtained information, that will complement the results of the studies, where DArT genotyping was deployed, by simplifying the gene ontology and microcolinearity based identification of candidate genes.
dCITE: Measuring Necessary Cladistic Information Can Help You Reduce Polytomy Artefacts in Trees.
Wise, Michael J
2016-01-01
Biologists regularly create phylogenetic trees to better understand the evolutionary origins of their species of interest, and often use genomes as their data source. However, as more and more incomplete genomes are published, in many cases it may not be possible to compute genome-based phylogenetic trees due to large gaps in the assembled sequences. In addition, comparison of complete genomes may not even be desirable due to the presence of horizontally acquired and homologous genes. A decision must therefore be made about which gene, or gene combinations, should be used to compute a tree. Deflated Cladistic Information based on Total Entropy (dCITE) is proposed as an easily computed metric for measuring the cladistic information in multiple sequence alignments representing a range of taxa, without the need to first compute the corresponding trees. dCITE scores can be used to rank candidate genes or decide whether input sequences provide insufficient cladistic information, making artefactual polytomies more likely. The dCITE method can be applied to protein, nucleotide or encoded phenotypic data, so can be used to select which data-type is most appropriate, given the choice. In a series of experiments the dCITE method was compared with related measures. Then, as a practical demonstration, the ideas developed in the paper were applied to a dataset representing species from the order Campylobacterales; trees based on sequence combinations, selected on the basis of their dCITE scores, were compared with a tree constructed to mimic Multi-Locus Sequence Typing (MLST) combinations of fragments. We see that the greater the dCITE score the more likely it is that the computed phylogenetic tree will be free of artefactual polytomies. Secondly, cladistic information saturates, beyond which little additional cladistic information can be obtained by adding additional sequences. Finally, sequences with high cladistic information produce more consistent trees for the same taxa.
dCITE: Measuring Necessary Cladistic Information Can Help You Reduce Polytomy Artefacts in Trees
2016-01-01
Biologists regularly create phylogenetic trees to better understand the evolutionary origins of their species of interest, and often use genomes as their data source. However, as more and more incomplete genomes are published, in many cases it may not be possible to compute genome-based phylogenetic trees due to large gaps in the assembled sequences. In addition, comparison of complete genomes may not even be desirable due to the presence of horizontally acquired and homologous genes. A decision must therefore be made about which gene, or gene combinations, should be used to compute a tree. Deflated Cladistic Information based on Total Entropy (dCITE) is proposed as an easily computed metric for measuring the cladistic information in multiple sequence alignments representing a range of taxa, without the need to first compute the corresponding trees. dCITE scores can be used to rank candidate genes or decide whether input sequences provide insufficient cladistic information, making artefactual polytomies more likely. The dCITE method can be applied to protein, nucleotide or encoded phenotypic data, so can be used to select which data-type is most appropriate, given the choice. In a series of experiments the dCITE method was compared with related measures. Then, as a practical demonstration, the ideas developed in the paper were applied to a dataset representing species from the order Campylobacterales; trees based on sequence combinations, selected on the basis of their dCITE scores, were compared with a tree constructed to mimic Multi-Locus Sequence Typing (MLST) combinations of fragments. We see that the greater the dCITE score the more likely it is that the computed phylogenetic tree will be free of artefactual polytomies. Secondly, cladistic information saturates, beyond which little additional cladistic information can be obtained by adding additional sequences. Finally, sequences with high cladistic information produce more consistent trees for the same taxa. PMID:27898695
Cabrera, Ana R; Donohue, Kevin V; Khalil, Sayed M S; Scholl, Elizabeth; Opperman, Charles; Sonenshine, Daniel E; Roe, R Michael
2011-01-01
Many species of mites and ticks are of agricultural and medical importance. Much can be learned from the study of transcriptomes of acarines which can generate DNA-sequence information of potential target genes for the control of acarine pests. High throughput transcriptome sequencing can also yield sequences of genes critical during physiological processes poorly understood in acarines, i.e., the regulation of female reproduction in mites. The predatory mite, Phytoseiulus persimilis, was selected to conduct a transcriptome analysis using 454 pyrosequencing. The objective of this project was to obtain DNA-sequence information of expressed genes from P. persimilis with special interest in sequences corresponding to vitellogenin (Vg) and the vitellogenin receptor (VgR). These genes are critical to the understanding of vitellogenesis, and they will facilitate the study of the regulation of mite female reproduction. A total of 12,556 contiguous sequences (contigs) were assembled with an average size of 935bp. From these sequences, the putative translated peptides of 11 contigs were similar in amino acid sequences to other arthropod Vgs, while 6 were similar to VgRs. We selected some of these sequences to conduct stage-specific expression studies to further determine their function. 2010 Elsevier Ltd. All rights reserved.
Code of Federal Regulations, 2014 CFR
2014-01-01
... Requirements for Licensed Launch, Including Suborbital Launch I. General Information A. Mission description. 1.... Orbit altitudes (apogee and perigee). 2. Flight sequence. 3. Staging events and the time for each event... shall cover the range of launch trajectories, inclinations and orbits for which authorization is sought...
Code of Federal Regulations, 2013 CFR
2013-01-01
... Requirements for Licensed Launch, Including Suborbital Launch I. General Information A. Mission description. 1.... Orbit altitudes (apogee and perigee). 2. Flight sequence. 3. Staging events and the time for each event... shall cover the range of launch trajectories, inclinations and orbits for which authorization is sought...
Code of Federal Regulations, 2012 CFR
2012-01-01
... Requirements for Licensed Launch, Including Suborbital Launch I. General Information A. Mission description. 1.... Orbit altitudes (apogee and perigee). 2. Flight sequence. 3. Staging events and the time for each event... shall cover the range of launch trajectories, inclinations and orbits for which authorization is sought...
Code of Federal Regulations, 2011 CFR
2011-01-01
... Requirements for Licensed Launch, Including Suborbital Launch I. General Information A. Mission description. 1.... Orbit altitudes (apogee and perigee). 2. Flight sequence. 3. Staging events and the time for each event... shall cover the range of launch trajectories, inclinations and orbits for which authorization is sought...
Mathur, Rinku; Adlakha, Neeru
2014-06-01
Phylogenetic trees give the information about the vertical relationships of ancestors and descendants but phylogenetic networks are used to visualize the horizontal relationships among the different organisms. In order to predict reticulate events there is a need to construct phylogenetic networks. Here, a Linear Programming (LP) model has been developed for the construction of phylogenetic network. The model is validated by using data sets of chloroplast of 16S rRNA sequences of photosynthetic organisms and Influenza A/H5N1 viruses. Results obtained are in agreement with those obtained by earlier researchers.
Tamura, Takeyuki; Akutsu, Tatsuya
2007-11-30
Subcellular location prediction of proteins is an important and well-studied problem in bioinformatics. This is a problem of predicting which part in a cell a given protein is transported to, where an amino acid sequence of the protein is given as an input. This problem is becoming more important since information on subcellular location is helpful for annotation of proteins and genes and the number of complete genomes is rapidly increasing. Since existing predictors are based on various heuristics, it is important to develop a simple method with high prediction accuracies. In this paper, we propose a novel and general predicting method by combining techniques for sequence alignment and feature vectors based on amino acid composition. We implemented this method with support vector machines on plant data sets extracted from the TargetP database. Through fivefold cross validation tests, the obtained overall accuracies and average MCC were 0.9096 and 0.8655 respectively. We also applied our method to other datasets including that of WoLF PSORT. Although there is a predictor which uses the information of gene ontology and yields higher accuracy than ours, our accuracies are higher than existing predictors which use only sequence information. Since such information as gene ontology can be obtained only for known proteins, our predictor is considered to be useful for subcellular location prediction of newly-discovered proteins. Furthermore, the idea of combination of alignment and amino acid frequency is novel and general so that it may be applied to other problems in bioinformatics. Our method for plant is also implemented as a web-system and available on http://sunflower.kuicr.kyoto-u.ac.jp/~tamura/slpfa.html.
Empirical Bayes Estimation of Coalescence Times from Nucleotide Sequence Data.
King, Leandra; Wakeley, John
2016-09-01
We demonstrate the advantages of using information at many unlinked loci to better calibrate estimates of the time to the most recent common ancestor (TMRCA) at a given locus. To this end, we apply a simple empirical Bayes method to estimate the TMRCA. This method is both asymptotically optimal, in the sense that the estimator converges to the true value when the number of unlinked loci for which we have information is large, and has the advantage of not making any assumptions about demographic history. The algorithm works as follows: we first split the sample at each locus into inferred left and right clades to obtain many estimates of the TMRCA, which we can average to obtain an initial estimate of the TMRCA. We then use nucleotide sequence data from other unlinked loci to form an empirical distribution that we can use to improve this initial estimate. Copyright © 2016 by the Genetics Society of America.
Luft, F; Klaes, R; Nees, M; Dürst, M; Heilmann, V; Melsheimer, P; von Knebel Doeberitz, M
2001-04-01
Human papillomavirus (HPV) genomes usually persist as episomal molecules in HPV associated preneoplastic lesions whereas they are frequently integrated into the host cell genome in HPV-related cancers cells. This suggests that malignant conversion of HPV-infected epithelia is linked to recombination of cellular and viral sequences. Due to technical limitations, precise sequence information on viral-cellular junctions were obtained only for few cell lines and primary lesions. In order to facilitate the molecular analysis of genomic HPV integration, we established a ligation-mediated PCR assay for the detection of integrated papillomavirus sequences (DIPS-PCR). DIPS-PCR was initially used to amplify genomic viral-cellular junctions from HPV-associated cervical cancer cell lines (C4-I, C4-II, SW756, and HeLa) and HPV-immortalized keratinocyte lines (HPKIA, HPKII). In addition to junctions already reported in public data bases, various new fusion fragments were identified. Subsequently, 22 different viral-cellular junctions were amplified from 17 cervical carcinomas and 1 vulval intraepithelial neoplasia (VIN III). Sequence analysis of each junction revealed that the viral E1 open reading frame (ORF) was fused to cellular sequences in 20 of 22 (91%) cases. Chromosomal integration loci mapped to chromosomes 1 (2n), 2 (3n), 7 (2n), 8 (3n), 10 (1n), 14 (5n), 16 (1n), 17 (2n), and mitochondrial DNA (1n), suggesting random distribution of chromosomal integration sites. Precise sequence information obtained by DIPS-PCR was further used to monitor the monoclonal origin of 4 cervical cancers, 1 case of recurrent premalignant lesions and 1 lymph node metastasis. Therefore, DIPS-PCR might allow efficient therapy control and prediction of relapse in patients with HPV-associated anogenital cancers. Copyright 2001 Wiley-Liss, Inc.
The quest for rare variants: pooled multiplexed next generation sequencing in plants.
Marroni, Fabio; Pinosio, Sara; Morgante, Michele
2012-01-01
Next generation sequencing (NGS) instruments produce an unprecedented amount of sequence data at contained costs. This gives researchers the possibility of designing studies with adequate power to identify rare variants at a fraction of the economic and labor resources required by individual Sanger sequencing. As of today, few research groups working in plant sciences have exploited this potentiality, showing that pooled NGS provides results in excellent agreement with those obtained by individual Sanger sequencing. The aim of this review is to convey to the reader the general ideas underlying the use of pooled NGS for the identification of rare variants. To facilitate a thorough understanding of the possibilities of the method, we will explain in detail the possible experimental and analytical approaches and discuss their advantages and disadvantages. We will show that information on allele frequency obtained by pooled NGS can be used to accurately compute basic population genetics indexes such as allele frequency, nucleotide diversity, and Tajima's D. Finally, we will discuss applications and future perspectives of the multiplexed NGS approach.
Zhang, Hongkai; Torkamani, Ali; Jones, Teresa M; Ruiz, Diana I; Pons, Jaume; Lerner, Richard A
2011-08-16
Use of large combinatorial antibody libraries and next-generation sequencing of nucleic acids are two of the most powerful methods in modern molecular biology. The libraries are screened using the principles of evolutionary selection, albeit in real time, to enrich for members with a particular phenotype. This selective process necessarily results in the loss of information about less-fit molecules. On the other hand, sequencing of the library, by itself, gives information that is mostly unrelated to phenotype. If the two methods could be combined, the full potential of very large molecular libraries could be realized. Here we report the implementation of a phenotype-information-phenotype cycle that integrates information and gene recovery. After selection for phage-encoded antibodies that bind to targets expressed on the surface of Escherichia coli, the information content of the selected pool is obtained by pyrosequencing. Sequences that encode specific antibodies are identified by a bioinformatic analysis and recovered by a stringent affinity method that is uniquely suited for gene isolation from a highly degenerate collection of nucleic acids. This approach can be generalized for selection of antibodies against targets that are present as minor components of complex systems.
Zhu, Yuan O; Aw, Pauline P K; de Sessions, Paola Florez; Hong, Shuzhen; See, Lee Xian; Hong, Lewis Z; Wilm, Andreas; Li, Chen Hao; Hue, Stephane; Lim, Seng Gee; Nagarajan, Niranjan; Burkholder, William F; Hibberd, Martin
2017-10-27
Viral populations are complex, dynamic, and fast evolving. The evolution of groups of closely related viruses in a competitive environment is termed quasispecies. To fully understand the role that quasispecies play in viral evolution, characterizing the trajectories of viral genotypes in an evolving population is the key. In particular, long-range haplotype information for thousands of individual viruses is critical; yet generating this information is non-trivial. Popular deep sequencing methods generate relatively short reads that do not preserve linkage information, while third generation sequencing methods have higher error rates that make detection of low frequency mutations a bioinformatics challenge. Here we applied BAsE-Seq, an Illumina-based single-virion sequencing technology, to eight samples from four chronic hepatitis B (CHB) patients - once before antiviral treatment and once after viral rebound due to resistance. With single-virion sequencing, we obtained 248-8796 single-virion sequences per sample, which allowed us to find evidence for both hard and soft selective sweeps. We were able to reconstruct population demographic history that was independently verified by clinically collected data. We further verified four of the samples independently through PacBio SMRT and Illumina Pooled deep sequencing. Overall, we showed that single-virion sequencing yields insight into viral evolution and population dynamics in an efficient and high throughput manner. We believe that single-virion sequencing is widely applicable to the study of viral evolution in the context of drug resistance and host adaptation, allows differentiation between soft or hard selective sweeps, and may be useful in the reconstruction of intra-host viral population demographic history.
Analysis of secreted proteins from Aspergillus flavus.
Medina, Martha L; Haynes, Paul A; Breci, Linda; Francisco, Wilson A
2005-08-01
MS/MS techniques in proteomics make possible the identification of proteins from organisms with little or no genome sequence information available. Peptide sequences are obtained from tandem mass spectra by matching peptide mass and fragmentation information to protein sequence information from related organisms, including unannotated genome sequence data. This peptide identification data can then be grouped and reconstructed into protein data. In this study, we have used this approach to study protein secretion by Aspergillus flavus, a filamentous fungus for which very little genome sequence information is available. A. flavus is capable of degrading the flavonoid rutin (quercetin 3-O-glycoside), as the only source of carbon via an extracellular enzyme system. In this continuing study, a proteomic analysis was used to identify secreted proteins from A. flavus when grown on rutin. The growth media glucose and potato dextrose were used to identify differentially expressed secreted proteins. The secreted proteins were analyzed by 1- and 2-DE and MS/MS. A total of 51 unique A. flavus secreted proteins were identified from the three growth conditions. Ten proteins were unique to rutin-, five to glucose- and one to potato dextrose-grown A. flavus. Sixteen secreted proteins were common to all three media. Fourteen identifications were of hypothetical proteins or proteins of unknown functions. To our knowledge, this is the first extensive proteomic study conducted to identify the secreted proteins from a filamentous fungus.
Badisco, Liesbeth; Huybrechts, Jurgen; Simonet, Gert; Verlinden, Heleen; Marchal, Elisabeth; Huybrechts, Roger; Schoofs, Liliane; De Loof, Arnold; Vanden Broeck, Jozef
2011-03-21
The desert locust (Schistocerca gregaria) displays a fascinating type of phenotypic plasticity, designated as 'phase polyphenism'. Depending on environmental conditions, one genome can be translated into two highly divergent phenotypes, termed the solitarious and gregarious (swarming) phase. Although many of the underlying molecular events remain elusive, the central nervous system (CNS) is expected to play a crucial role in the phase transition process. Locusts have also proven to be interesting model organisms in a physiological and neurobiological research context. However, molecular studies in locusts are hampered by the fact that genome/transcriptome sequence information available for this branch of insects is still limited. We have generated 34,672 raw expressed sequence tags (EST) from the CNS of desert locusts in both phases. These ESTs were assembled in 12,709 unique transcript sequences and nearly 4,000 sequences were functionally annotated. Moreover, the obtained S. gregaria EST information is highly complementary to the existing orthopteran transcriptomic data. Since many novel transcripts encode neuronal signaling and signal transduction components, this paper includes an overview of these sequences. Furthermore, several transcripts being differentially represented in solitarious and gregarious locusts were retrieved from this EST database. The findings highlight the involvement of the CNS in the phase transition process and indicate that this novel annotated database may also add to the emerging knowledge of concomitant neuronal signaling and neuroplasticity events. In summary, we met the need for novel sequence data from desert locust CNS. To our knowledge, we hereby also present the first insect EST database that is derived from the complete CNS. The obtained S. gregaria EST data constitute an important new source of information that will be instrumental in further unraveling the molecular principles of phase polyphenism, in further establishing locusts as valuable research model organisms and in molecular evolutionary and comparative entomology.
SMARTIV: combined sequence and structure de-novo motif discovery for in-vivo RNA binding data.
Polishchuk, Maya; Paz, Inbal; Yakhini, Zohar; Mandel-Gutfreund, Yael
2018-05-25
Gene expression regulation is highly dependent on binding of RNA-binding proteins (RBPs) to their RNA targets. Growing evidence supports the notion that both RNA primary sequence and its local secondary structure play a role in specific Protein-RNA recognition and binding. Despite the great advance in high-throughput experimental methods for identifying sequence targets of RBPs, predicting the specific sequence and structure binding preferences of RBPs remains a major challenge. We present a novel webserver, SMARTIV, designed for discovering and visualizing combined RNA sequence and structure motifs from high-throughput RNA-binding data, generated from in-vivo experiments. The uniqueness of SMARTIV is that it predicts motifs from enriched k-mers that combine information from ranked RNA sequences and their predicted secondary structure, obtained using various folding methods. Consequently, SMARTIV generates Position Weight Matrices (PWMs) in a combined sequence and structure alphabet with assigned P-values. SMARTIV concisely represents the sequence and structure motif content as a single graphical logo, which is informative and easy for visual perception. SMARTIV was examined extensively on a variety of high-throughput binding experiments for RBPs from different families, generated from different technologies, showing consistent and accurate results. Finally, SMARTIV is a user-friendly webserver, highly efficient in run-time and freely accessible via http://smartiv.technion.ac.il/.
Deng, Yue; Bao, Feng; Yang, Yang; Ji, Xiangyang; Du, Mulong; Zhang, Zhengdong
2017-01-01
Abstract The automated transcript discovery and quantification of high-throughput RNA sequencing (RNA-seq) data are important tasks of next-generation sequencing (NGS) research. However, these tasks are challenging due to the uncertainties that arise in the inference of complete splicing isoform variants from partially observed short reads. Here, we address this problem by explicitly reducing the inherent uncertainties in a biological system caused by missing information. In our approach, the RNA-seq procedure for transforming transcripts into short reads is considered an information transmission process. Consequently, the data uncertainties are substantially reduced by exploiting the information transduction capacity of information theory. The experimental results obtained from the analyses of simulated datasets and RNA-seq datasets from cell lines and tissues demonstrate the advantages of our method over state-of-the-art competitors. Our algorithm is an open-source implementation of MaxInfo. PMID:28911101
Image denoising and deblurring using multispectral data
NASA Astrophysics Data System (ADS)
Semenishchev, E. A.; Voronin, V. V.; Marchuk, V. I.
2017-05-01
Currently decision-making systems get widespread. These systems are based on the analysis video sequences and also additional data. They are volume, change size, the behavior of one or a group of objects, temperature gradient, the presence of local areas with strong differences, and others. Security and control system are main areas of application. A noise on the images strongly influences the subsequent processing and decision making. This paper considers the problem of primary signal processing for solving the tasks of image denoising and deblurring of multispectral data. The additional information from multispectral channels can improve the efficiency of object classification. In this paper we use method of combining information about the objects obtained by the cameras in different frequency bands. We apply method based on simultaneous minimization L2 and the first order square difference sequence of estimates to denoising and restoring the blur on the edges. In case of loss of the information will be applied an approach based on the interpolation of data taken from the analysis of objects located in other areas and information obtained from multispectral camera. The effectiveness of the proposed approach is shown in a set of test images.
Ruibal, Monica P; Peakall, Rod; Foret, Sylvain; Linde, Celeste C
2014-06-01
To investigate fungal species identity and diversity in mycorrhizal fungi of order Sebacinales, we developed phylogenetic markers. These new markers will enable future studies investigating species delineation and phylogenetic relationships of the fungal symbionts and facilitate investigations into evolutionary interactions among Sebacina species and their orchid hosts. • We generated partial genome sequences for a Sebacina symbiont originating from Caladenia huegelii with 454 genome sequencing and from three symbionts from Eriochilus dilatatus and one from E. pulchellus using Illumina sequencing. Six nuclear and two mitochondrial loci showed high variability (10-31% parsimony informative sites) for Sebacinales mycorrhizal fungi across four genera of Australian orchids (Caladenia, Eriochilus, Elythranthera, and Glossodia). • We obtained highly informative DNA markers that will allow investigation of mycorrhizal diversity of Sebacinaceae fungi associated with terrestrial orchids in Australia and worldwide.
Optimum quantum receiver for detecting weak signals in PAM communication systems
NASA Astrophysics Data System (ADS)
Sharma, Navneet; Rawat, Tarun Kumar; Parthasarathy, Harish; Gautam, Kumar
2017-09-01
This paper deals with the modeling of an optimum quantum receiver for pulse amplitude modulator (PAM) communication systems. The information bearing sequence {I_k}_{k=0}^{N-1} is estimated using the maximum likelihood (ML) method. The ML method is based on quantum mechanical measurements of an observable X in the Hilbert space of the quantum system at discrete times, when the Hamiltonian of the system is perturbed by an operator obtained by modulating a potential V with a PAM signal derived from the information bearing sequence {I_k}_{k=0}^{N-1}. The measurement process at each time instant causes collapse of the system state to an observable eigenstate. All probabilities of getting different outcomes from an observable are calculated using the perturbed evolution operator combined with the collapse postulate. For given probability densities, calculation of the mean square error evaluates the performance of the receiver. Finally, we present an example involving estimating an information bearing sequence that modulates a quantum electromagnetic field incident on a quantum harmonic oscillator.
Der Sarkissian, Clio; Allentoft, Morten E.; Ávila-Arcos, María C.; Barnett, Ross; Campos, Paula F.; Cappellini, Enrico; Ermini, Luca; Fernández, Ruth; da Fonseca, Rute; Ginolhac, Aurélien; Hansen, Anders J.; Jónsson, Hákon; Korneliussen, Thorfinn; Margaryan, Ashot; Martin, Michael D.; Moreno-Mayar, J. Víctor; Raghavan, Maanasa; Rasmussen, Morten; Velasco, Marcela Sandoval; Schroeder, Hannes; Schubert, Mikkel; Seguin-Orlando, Andaine; Wales, Nathan; Gilbert, M. Thomas P.; Willerslev, Eske; Orlando, Ludovic
2015-01-01
The past decade has witnessed a revolution in ancient DNA (aDNA) research. Although the field's focus was previously limited to mitochondrial DNA and a few nuclear markers, whole genome sequences from the deep past can now be retrieved. This breakthrough is tightly connected to the massive sequence throughput of next generation sequencing platforms and the ability to target short and degraded DNA molecules. Many ancient specimens previously unsuitable for DNA analyses because of extensive degradation can now successfully be used as source materials. Additionally, the analytical power obtained by increasing the number of sequence reads to billions effectively means that contamination issues that have haunted aDNA research for decades, particularly in human studies, can now be efficiently and confidently quantified. At present, whole genomes have been sequenced from ancient anatomically modern humans, archaic hominins, ancient pathogens and megafaunal species. Those have revealed important functional and phenotypic information, as well as unexpected adaptation, migration and admixture patterns. As such, the field of aDNA has entered the new era of genomics and has provided valuable information when testing specific hypotheses related to the past. PMID:25487338
Wheeler, David
2007-01-01
GenBank(R) is a comprehensive database of publicly available DNA sequences for more than 205,000 named organisms and for more than 60,000 within the embryophyta, obtained through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Daily data exchange with the European Molecular Biology Laboratory (EMBL) in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases with taxonomy, genome, mapping, protein structure, and domain information and the biomedical journal literature through PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available through FTP. GenBank usage scenarios ranging from local analyses of the data available through FTP to online analyses supported by the NCBI Web-based tools are discussed. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov.
De novo sequencing and analysis of the transcriptome of Panax ginseng in the leaf-expansion period.
Liu, Shichao; Wang, Siming; Liu, Meichen; Yang, Fei; Zhang, Hui; Liu, Shiyang; Wang, Qun; Zhao, Yu
2016-08-01
Panax ginseng, a traditional Chinese medicine, is used worldwide for its variety of health benefits and its treatment efficacy. However, it is difficult to cultivate due to its vulnerability to environmental stresses. The present study provided the first report, to the best of our knowledge, of transcriptome analysis of ginseng at the leaf‑expansion stage. Using the Illumina sequencing platform, >40,000,000 high‑quality paired‑end reads were obtained and assembled into 100,533 unique sequences. When the sequences were searched against the publicly available National Center for Biotechnology Information protein database using The Basic Local Alignment Search Tool, 61,599 sequences exhibited similarity to known proteins. Functional annotation and classification, including use of the Gene Ontology, Clusters of Orthologous Groups, and Kyoto Encyclopedia of Genes and Genomes databases, revealed that the activated genes in ginseng were predominantly ribonuclease‑like storage genes, environmental stress genes, pathogenesis-related genes and other antioxidant genes. A number of candidate genes in environmental stress‑associated pathways were also identified. These novel data provide useful information on the growth and development stages of ginseng, and serve as an important public information platform for further understanding of the molecular mechanisms and functional genomics of ginseng.
Saito, T; Ochiai, H
1999-10-01
cDNA fragments putatively encoding amino acid sequences characteristic of the fatty acid desaturase were obtained using expressed sequence tag (EST) information of the Dictyostelium cDNA project. Using this sequence, we have determined the cDNA sequence and genomic sequence of a desaturase. The cloned cDNA is 1489 nucleotides long and the deduced amino acid sequence comprised 464 amino acid residues containing an N-terminal cytochrome b5 domain. The whole sequence was 38.6% identical to the initially identified Delta5-desaturase of Mortierella alpina. We have confirmed its function as Delta5-desaturase by over expression mutation in D. discoideum and also the gain of function mutation in the yeast Saccharomyces cerevisiae. Analysis of the lipids from transformed D. discoideum and yeast demonstrated the accumulation of Delta5-desaturated products. This is the first report concering fatty acid desaturase in cellular slime molds.
Differential evolution-simulated annealing for multiple sequence alignment
NASA Astrophysics Data System (ADS)
Addawe, R. C.; Addawe, J. M.; Sueño, M. R. K.; Magadia, J. C.
2017-10-01
Multiple sequence alignments (MSA) are used in the analysis of molecular evolution and sequence structure relationships. In this paper, a hybrid algorithm, Differential Evolution - Simulated Annealing (DESA) is applied in optimizing multiple sequence alignments (MSAs) based on structural information, non-gaps percentage and totally conserved columns. DESA is a robust algorithm characterized by self-organization, mutation, crossover, and SA-like selection scheme of the strategy parameters. Here, the MSA problem is treated as a multi-objective optimization problem of the hybrid evolutionary algorithm, DESA. Thus, we name the algorithm as DESA-MSA. Simulated sequences and alignments were generated to evaluate the accuracy and efficiency of DESA-MSA using different indel sizes, sequence lengths, deletion rates and insertion rates. The proposed hybrid algorithm obtained acceptable solutions particularly for the MSA problem evaluated based on the three objectives.
Geographically widespread swordfish barcode stock identification: a case study of its application.
Pappalardo, Anna Maria; Guarino, Francesca; Reina, Simona; Messina, Angela; De Pinto, Vito
2011-01-01
The swordfish (Xiphias gladius) is a cosmopolitan large pelagic fish inhabiting tempered and tropical waters and it is a target species for fisheries all around the world. The present study investigated the ability of COI barcoding to reliably identify swordfish and particularly specific stocks of this commercially important species. We applied the classical DNA barcoding technology, upon a 682 bp segment of COI, and compared swordfish sequences from different geographical sources (Atlantic, Indian Oceans and Mediterranean Sea). The sequences of the 5' hyper-variable fragment of the control region (5'dloop), were also used to validate the efficacy of COI as a stock-specific marker. This information was successfully applied to the discrimination of unknown samples from the market, detecting in some cases mislabeled seafood products. The NJ distance-based phenogram (K2P model) obtained with COI sequences allowed us to correlate the swordfish haplotypes to the different geographical stocks. Similar results were obtained with 5'dloop. Our preliminary data in swordfish Xiphias gladius confirm that Cytochrome Oxidase I can be proposed as an efficient species-specific marker that has also the potential to assign geographical provenance. This information might speed the samples analysis in commercial application of barcoding.
HUNT: launch of a full-length cDNA database from the Helix Research Institute.
Yudate, H T; Suwa, M; Irie, R; Matsui, H; Nishikawa, T; Nakamura, Y; Yamaguchi, D; Peng, Z Z; Yamamoto, T; Nagai, K; Hayashi, K; Otsuki, T; Sugiyama, T; Ota, T; Suzuki, Y; Sugano, S; Isogai, T; Masuho, Y
2001-01-01
The Helix Research Institute (HRI) in Japan is releasing 4356 HUman Novel Transcripts and related information in the newly established HUNT database. The institute is a joint research project principally funded by the Japanese Ministry of International Trade and Industry, and the clones were sequenced in the governmental New Energy and Industrial Technology Development Organization (NEDO) Human cDNA Sequencing Project. The HUNT database contains an extensive amount of annotation from advanced analysis and represents an essential bioinformatics contribution towards understanding of the gene function. The HRI human cDNA clones were obtained from full-length enriched cDNA libraries constructed with the oligo-capping method and have resulted in novel full-length cDNA sequences. A large fraction has little similarity to any proteins of known function and to obtain clues about possible function we have developed original analysis procedures. Any putative function deduced here can be validated or refuted by complementary analysis results. The user can also extract information from specific categories like PROSITE patterns, PFAM domains, PSORT localization, transmembrane helices and clones with GENIUS structure assignments. The HUNT database can be accessed at http://www.hri.co.jp/HUNT.
Archean metamorphic sequence and surfaces, Kangerdlugssuaq Fjord, East Greenland
NASA Technical Reports Server (NTRS)
Kays, M. A.
1986-01-01
The characteristics of Archean metamorphic surfaces and fabrics of a mapped sequence of rocks older than about 3000 Ma provide information basic to an understanding of the structural evolution and metamorphic history in Kangerdlugssuaq Fjord, east Greenland. This information and the additional results of petrologic and geochemical studies have culminated in an extended chronology of Archean plutonic, metamorphic, and tectonic events. The basis for the chronology is considered, especially the nature of the metamorphic fabrics and surfaces in the Archean sequence. The surfaces, which are planar mineral parageneses, may prove to be mappable outside Kangerdlugssuaq Fjord, and if so, will be helpful in extending the events that they represent to other Archean sequences in east Greenland. The surfaces will become especially important reference planes if the absolute ages of their metamorphic assemblages can be determined in at least one location where strain was low subsequent to their recrystallization. Once an isochron is obtained, the dynamothermal age of the regionally identifiable metamorphic surface is determined everywhere it can be mapped.
NASA Technical Reports Server (NTRS)
Springer, E.; Sachs, M. S.; Woese, C. R.; Boone, D. R.
1995-01-01
Representatives of the family Methanosarcinaceae were analyzed phylogenetically by comparing partial sequences of their methyl-coenzyme M reductase (mcrI) genes. A 490-bp fragment from the A subunit of the gene was selected, amplified by the PCR, cloned, and sequenced for each of 25 strains belonging to the Methanosarcinaceae. The sequences obtained were aligned with the corresponding portions of five previously published sequences, and all of the sequences were compared to determine phylogenetic distances by Fitch distance matrix methods. We prepared analogous trees based on 16S rRNA sequences; these trees corresponded closely to the mcrI trees, although the mcrI sequences of pairs of organisms had 3.01 +/- 0.541 times more changes than the respective pairs of 16S rRNA sequences, suggesting that the mcrI fragment evolved about three times more rapidly than the 16S rRNA gene. The qualitative similarity of the mcrI and 16S rRNA trees suggests that transfer of genetic information between dissimilar organisms has not significantly affected these sequences, although we found inconsistencies between some mcrI distances that we measured and and previously published DNA reassociation data. It is unlikely that multiple mcrI isogenes were present in the organisms that we examined, because we found no major discrepancies in multiple determinations of mcrI sequences from the same organism. Our primers for the PCR also match analogous sites in the previously published mcrII sequences, but all of the sequences that we obtained from members of the Methanosarcinaceae were more closely related to mcrI sequences than to mcrII sequences, suggesting that members of the Methanosarcinaceae do not have distinct mcrII genes.
When are pathogen genome sequences informative of transmission events?
Ferguson, Neil; Jombart, Thibaut
2018-01-01
Recent years have seen the development of numerous methodologies for reconstructing transmission trees in infectious disease outbreaks from densely sampled whole genome sequence data. However, a fundamental and as of yet poorly addressed limitation of such approaches is the requirement for genetic diversity to arise on epidemiological timescales. Specifically, the position of infected individuals in a transmission tree can only be resolved by genetic data if mutations have accumulated between the sampled pathogen genomes. To quantify and compare the useful genetic diversity expected from genetic data in different pathogen outbreaks, we introduce here the concept of ‘transmission divergence’, defined as the number of mutations separating whole genome sequences sampled from transmission pairs. Using parameter values obtained by literature review, we simulate outbreak scenarios alongside sequence evolution using two models described in the literature to describe transmission divergence of ten major outbreak-causing pathogens. We find that while mean values vary significantly between the pathogens considered, their transmission divergence is generally very low, with many outbreaks characterised by large numbers of genetically identical transmission pairs. We describe the impact of transmission divergence on our ability to reconstruct outbreaks using two outbreak reconstruction tools, the R packages outbreaker and phybreak, and demonstrate that, in agreement with previous observations, genetic sequence data of rapidly evolving pathogens such as RNA viruses can provide valuable information on individual transmission events. Conversely, sequence data of pathogens with lower mean transmission divergence, including Streptococcus pneumoniae, Shigella sonnei and Clostridium difficile, provide little to no information about individual transmission events. Our results highlight the informational limitations of genetic sequence data in certain outbreak scenarios, and demonstrate the need to expand the toolkit of outbreak reconstruction tools to integrate other types of epidemiological data. PMID:29420641
Jia, Ying; Cantu, Bruno A; Sánchez, Elda E; Pérez, John C
2008-06-15
To advance our knowledge on the snake venom composition and transcripts expressed in venom gland at the molecular level, we constructed a cDNA library from the venom gland of Agkistrodon piscivorus leucostoma for the generation of expressed sequence tags (ESTs) database. From the randomly sequenced 2112 independent clones, we have obtained ESTs for 1309 (62%) cDNAs, which showed significant deduced amino acid sequence similarity (scores >80) to previously characterized proteins in National Center for Biotechnology Information (NCBI) database. Ribosomal proteins make up 47 clones (2%) and the remaining 756 (36%) cDNAs represent either unknown identity or show BLASTX sequence identity scores of <80 with known GenBank accessions. The most highly expressed gene encoding phospholipase A(2) (PLA(2)) accounting for 35% of A. p. leucostoma venom gland cDNAs was identified and further confirmed by crude venom applied to sodium dodecyl sulfate/polyacrylamide gel electrophoresis (SDS-PAGE) electrophoresis and protein sequencing. A total of 180 representative genes were obtained from the sequence assemblies and deposited to EST database. Clones showing sequence identity to disintegrins, thrombin-like enzymes, hemorrhagic toxins, fibrinogen clotting inhibitors and plasminogen activators were also identified in our EST database. These data can be used to develop a research program that will help us identify genes encoding proteins that are of medical importance or proteins involved in the mechanisms of the toxin venom.
Froenicke, Lutz; Lavelle, Dean; Martineau, Belinda; Perroud, Bertrand; Michelmore, Richard
2013-01-01
Several applications of high throughput genome and transcriptome sequencing would benefit from a reduction of the high-copy-number sequences in the libraries being sequenced and analyzed, particularly when applied to species with large genomes. We adapted and analyzed the consequences of a method that utilizes a thermostable duplex-specific nuclease for reducing the high-copy components in transcriptomic and genomic libraries prior to sequencing. This reduces the time, cost, and computational effort of obtaining informative transcriptomic and genomic sequence data for both fully sequenced and non-sequenced genomes. It also reduces contamination from organellar DNA in preparations of nuclear DNA. Hybridization in the presence of 3 M tetramethylammonium chloride (TMAC), which equalizes the rates of hybridization of GC and AT nucleotide pairs, reduced the bias against sequences with high GC content. Consequences of this method on the reduction of high-copy and enrichment of low-copy sequences are reported for Arabidopsis and lettuce. PMID:23409088
Matvienko, Marta; Kozik, Alexander; Froenicke, Lutz; Lavelle, Dean; Martineau, Belinda; Perroud, Bertrand; Michelmore, Richard
2013-01-01
Several applications of high throughput genome and transcriptome sequencing would benefit from a reduction of the high-copy-number sequences in the libraries being sequenced and analyzed, particularly when applied to species with large genomes. We adapted and analyzed the consequences of a method that utilizes a thermostable duplex-specific nuclease for reducing the high-copy components in transcriptomic and genomic libraries prior to sequencing. This reduces the time, cost, and computational effort of obtaining informative transcriptomic and genomic sequence data for both fully sequenced and non-sequenced genomes. It also reduces contamination from organellar DNA in preparations of nuclear DNA. Hybridization in the presence of 3 M tetramethylammonium chloride (TMAC), which equalizes the rates of hybridization of GC and AT nucleotide pairs, reduced the bias against sequences with high GC content. Consequences of this method on the reduction of high-copy and enrichment of low-copy sequences are reported for Arabidopsis and lettuce.
Shortt, Jonathan A; Card, Daren C; Schield, Drew R; Liu, Yang; Zhong, Bo; Castoe, Todd A; Carlton, Elizabeth J; Pollock, David D
2017-01-01
In areas where schistosomiasis control programs have been implemented, morbidity and prevalence have been greatly reduced. However, to sustain these reductions and move towards interruption of transmission, new tools for disease surveillance are needed. Genomic methods have the potential to help trace the sources of new infections, and allow us to monitor drug resistance. Large-scale genotyping efforts for schistosome species have been hindered by cost, limited numbers of established target loci, and the small amount of DNA obtained from miracidia, the life stage most readily acquired from humans. Here, we present a method using next generation sequencing to provide high-resolution genomic data from S. japonicum for population-based studies. We applied whole genome amplification followed by double digest restriction site associated DNA sequencing (ddRADseq) to individual S. japonicum miracidia preserved on Whatman FTA cards. We found that we could effectively and consistently survey hundreds of thousands of variants from 10,000 to 30,000 loci from archived miracidia as old as six years. An analysis of variation from eight miracidia obtained from three hosts in two villages in Sichuan showed clear population structuring by village and host even within this limited sample. This high-resolution sequencing approach yields three orders of magnitude more information than microsatellite genotyping methods that have been employed over the last decade, creating the potential to answer detailed questions about the sources of human infections and to monitor drug resistance. Costs per sample range from $50-$200, depending on the amount of sequence information desired, and we expect these costs can be reduced further given continued reductions in sequencing costs, improvement of protocols, and parallelization. This approach provides new promise for using modern genome-scale sampling to S. japonicum surveillance, and could be applied to other schistosome species and other parasitic helminthes.
Strope, Pooja K; Chaverri, Priscila; Gazis, Romina; Ciufo, Stacy; Domrachev, Michael; Schoch, Conrad L
2017-01-01
Abstract The ITS (nuclear ribosomal internal transcribed spacer) RefSeq database at the National Center for Biotechnology Information (NCBI) is dedicated to the clear association between name, specimen and sequence data. This database is focused on sequences obtained from type material stored in public collections. While the initial ITS sequence curation effort together with numerous fungal taxonomy experts attempted to cover as many orders as possible, we extended our latest focus to the family and genus ranks. We focused on Trichoderma for several reasons, mainly because the asexual and sexual synonyms were well documented, and a list of proposed names and type material were recently proposed and published. In this case study the recent taxonomic information was applied to do a complete taxonomic audit for the genus Trichoderma in the NCBI Taxonomy database. A name status report is available here: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi. As a result, the ITS RefSeq Targeted Loci database at NCBI has been augmented with more sequences from type and verified material from Trichoderma species. Additionally, to aid in the cross referencing of data from single loci and genomes we have collected a list of quality records of the RPB2 gene obtained from type material in GenBank that could help validate future submissions. During the process of curation misidentified genomes were discovered, and sequence records from type material were found hidden under previous classifications. Source metadata curation, although more cumbersome, proved to be useful as confirmation of the type material designation. Database URL: http://www.ncbi.nlm.nih.gov/bioproject/PRJNA177353 PMID:29220466
Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm
NASA Astrophysics Data System (ADS)
Umam, Khoirul; Bustamam, Alhadi; Lestari, Dian
2017-03-01
DNA is one of the carrier of genetic information of living organisms. Encoding, sequencing, and clustering DNA sequences has become the key jobs and routine in the world of molecular biology, in particular on bioinformatics application. There are two type of clustering, hierarchical clustering and partitioning clustering. In this paper, we combined two type clustering i.e. K-Means (partitioning clustering) and DIANA (hierarchical clustering), therefore it called Hybrid clustering. Application of hybrid clustering using Parallel K-Means algorithm and DIANA algorithm used to clustering DNA sequences of Human Papillomavirus (HPV). The clustering process is started with Collecting DNA sequences of HPV are obtained from NCBI (National Centre for Biotechnology Information), then performing characteristics extraction of DNA sequences. The characteristics extraction result is store in a matrix form, then normalize this matrix using Min-Max normalization and calculate genetic distance using Euclidian Distance. Furthermore, the hybrid clustering is applied by using implementation of Parallel K-Means algorithm and DIANA algorithm. The aim of using Hybrid Clustering is to obtain better clusters result. For validating the resulted clusters, to get optimum number of clusters, we use Davies-Bouldin Index (DBI). In this study, the result of implementation of Parallel K-Means clustering is data clustered become 5 clusters with minimal IDB value is 0.8741, and Hybrid Clustering clustered data become 13 sub-clusters with minimal IDB values = 0.8216, 0.6845, 0.3331, 0.1994 and 0.3952. The IDB value of hybrid clustering less than IBD value of Parallel K-Means clustering only that perform at 1ts stage. Its means clustering using Hybrid Clustering have the better result to clustered DNA sequence of HPV than perform parallel K-Means Clustering only.
LLNL Genomic Assessment: Viral and Bacterial Sequencing Needs for TMTI, Tier 1 Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Slezak, T; Borucki, M; Lenhoff, R
2009-09-29
The Lawrence Livermore National Lab Bioinformatics group has recently taken on a role in DTRA's Transformation Medical Technologies Initiative (TMTI). The high-level goal of TMTI is to accelerate the development of broad-spectrum countermeasures. To achieve those goals, TMTI has a near term need to obtain more sequence information across a large range of pathogens, near neighbors, and across a broad geographical and host range. Our role in this project is to research available sequence data for the organisms of interest and identify critical microbial sequence and knowledge gaps that need to be filled to meet TMTI objectives. This effort includes:more » (1) assessing current genomic sequence for each agent including phylogenetic and geographical diversity, host range, date of isolation range, virulence, sequence availability of key near neighbors, and other characteristics; (2) identifying Subject Matter Experts (SME's) and potential holders of isolate collections, contacting appropriate SME's with known expertise and isolate collections to obtain information on isolate availability and specific recommendations; (3) identifying sequence as well as knowledge gaps (eg virulence, host range, and antibiotic resistance determinants); (4) providing specific recommendations as to the most valuable strains to be placed on the DTRA sequencing queue. We acknowledge that criteria for prioritization of isolates for sequencing falls into two categories aligning with priority queues 1 and 2 as described in the summary. (Priority queue 0 relates to DTRA operational isolates whose availability is not predictable in advance.) 1. Selection of isolates that appear to have likelihood to provide information on virulence and antibiotic resistance. This will include sequence of known virulent strains. Particularly valuable would be virulent strains that have genetically similar yet avirulent, or non human transmissible, counterparts that can be used for comparison to help identify key virulence or host range genes. This approach will provide information that can be used by structural biologists to help develop therapeutics and vaccines. We have pointed out such high priority strains of which we are aware, and note that if any such isolates should be discovered, they will rise to the top priority. We anticipate difficulty locating samples with unusual resistance phenotypes, in particular. Sequencing strategies for isolates in queue 1 should aim for as complete finishing status as possible, since high-quality initial annotation (gene-calling) will be necessary for the follow-on protein structure analyses contributing to countermeasure development. Queue 2 for sequencing determination will be more dynamic than queue 1, and samples will be added to it as they become available to the TMTI program. 2. Selection of isolates that will provide broader information about diversity and phylogenetics and aid in specific detection as well as forensics. This approach focuses on sequencing of isolates that will provide better resolution of variants that are (or were) circulating in nature. The finishing strategy for queue 2 does not require complete closing with annotation. This queue is more static, as there is considerable phylogenetic data, and in this report we have sought to reveal gaps and make suggestions to fill them given existing sequence data and strain information. In this report we identify current sequencing gaps in both priority queue categories. Note that this is most applicable to the bacterial pathogens, as most viruses are by default in queue 1. The Phase I focus of this project is on viral hemorrhagic fever viruses and Category A bacterial agents as defined to us by TMTI. We have carried out individual analyses on each species of interest, and these are included as chapters in this report. Viruses and bacteria are biologically very distinct from each other and require different methods of analysis and criteria for sequencing prioritization. Therefore, we will describe our methods, analyses and conclusions separately for each category.« less
Ortuño, Francisco M; Valenzuela, Olga; Rojas, Fernando; Pomares, Hector; Florido, Javier P; Urquiza, Jose M; Rojas, Ignacio
2013-09-01
Multiple sequence alignments (MSAs) are widely used approaches in bioinformatics to carry out other tasks such as structure predictions, biological function analyses or phylogenetic modeling. However, current tools usually provide partially optimal alignments, as each one is focused on specific biological features. Thus, the same set of sequences can produce different alignments, above all when sequences are less similar. Consequently, researchers and biologists do not agree about which is the most suitable way to evaluate MSAs. Recent evaluations tend to use more complex scores including further biological features. Among them, 3D structures are increasingly being used to evaluate alignments. Because structures are more conserved in proteins than sequences, scores with structural information are better suited to evaluate more distant relationships between sequences. The proposed multiobjective algorithm, based on the non-dominated sorting genetic algorithm, aims to jointly optimize three objectives: STRIKE score, non-gaps percentage and totally conserved columns. It was significantly assessed on the BAliBASE benchmark according to the Kruskal-Wallis test (P < 0.01). This algorithm also outperforms other aligners, such as ClustalW, Multiple Sequence Alignment Genetic Algorithm (MSA-GA), PRRP, DIALIGN, Hidden Markov Model Training (HMMT), Pattern-Induced Multi-sequence Alignment (PIMA), MULTIALIGN, Sequence Alignment Genetic Algorithm (SAGA), PILEUP, Rubber Band Technique Genetic Algorithm (RBT-GA) and Vertical Decomposition Genetic Algorithm (VDGA), according to the Wilcoxon signed-rank test (P < 0.05), whereas it shows results not significantly different to 3D-COFFEE (P > 0.05) with the advantage of being able to use less structures. Structural information is included within the objective function to evaluate more accurately the obtained alignments. The source code is available at http://www.ugr.es/~fortuno/MOSAStrE/MO-SAStrE.zip.
Hong, Soon Gyu; Cramer, Robert A; Lawrence, Christopher B; Pryor, Barry M
2005-02-01
A gene for the Alternaria major allergen, Alt a 1, was amplified from 52 species of Alternaria and related genera, and sequence information was used for phylogenetic study. Alt a 1 gene sequences evolved 3.8 times faster and contained 3.5 times more parsimony-informative sites than glyceraldehyde-3-phosphate dehydrogenase (gpd) sequences. Analyses of Alt a 1 gene and gpd exon sequences strongly supported grouping of Alternaria spp. and related taxa into several species-groups described in previous studies, especially the infectoria, alternata, porri, brassicicola, and radicina species-groups and the Embellisia group. The sonchi species-group was newly suggested in this study. Monophyly of the Nimbya group was moderately supported, and monophyly of the Ulocladium group was weakly supported. Relationships among species-groups and among closely related species of the same species-group were not fully resolved. However, higher resolution could be obtained using Alt a 1 sequences or a combined dataset than using gpd sequences alone. Despite high levels of variation in amino acid sequences, results of in silico prediction of protein secondary structure for Alt a 1 demonstrated a high degree of structural similarity for most of the species suggesting a conservation of function.
Optical Processing Techniques For Pseudorandom Sequence Prediction
NASA Astrophysics Data System (ADS)
Gustafson, Steven C.
1983-11-01
Pseudorandom sequences are series of apparently random numbers generated, for example, by linear or nonlinear feedback shift registers. An important application of these sequences is in spread spectrum communication systems, in which, for example, the transmitted carrier phase is digitally modulated rapidly and pseudorandomly and in which the information to be transmitted is incorporated as a slow modulation in the pseudorandom sequence. In this case the transmitted information can be extracted only by a receiver that uses for demodulation the same pseudorandom sequence used by the transmitter, and thus this type of communication system has a very high immunity to third-party interference. However, if a third party can predict in real time the probable future course of the transmitted pseudorandom sequence given past samples of this sequence, then interference immunity can be significantly reduced.. In this application effective pseudorandom sequence prediction techniques should be (1) applicable in real time to rapid (e.g., megahertz) sequence generation rates, (2) applicable to both linear and nonlinear pseudorandom sequence generation processes, and (3) applicable to error-prone past sequence samples of limited number and continuity. Certain optical processing techniques that may meet these requirements are discussed in this paper. In particular, techniques based on incoherent optical processors that perform general linear transforms or (more specifically) matrix-vector multiplications are considered. Computer simulation examples are presented which indicate that significant prediction accuracy can be obtained using these transforms for simple pseudorandom sequences. However, the useful prediction of more complex pseudorandom sequences will probably require the application of more sophisticated optical processing techniques.
Evaluation of MR imaging with T1 and T2* mapping for the determination of hepatic iron overload.
Henninger, B; Kremser, C; Rauch, S; Eder, R; Zoller, H; Finkenstedt, A; Michaely, H J; Schocke, M
2012-11-01
To evaluate MRI using T1 and T2* mapping sequences in patients with suspected hepatic iron overload (HIO). Twenty-five consecutive patients with clinically suspected HIO were retrospectively studied. All underwent MRI and liver biopsy. For the quantification of liver T2* values we used a fat-saturated multi-echo gradient echo sequence with 12 echoes (TR = 200 ms, TE = 0.99 ms + n × 1.41 ms, flip angle 20°). T1 values were obtained using a fast T1 mapping sequence based on an inversion recovery snapshot FLASH sequence. Parameter maps were analysed using regions of interest. ROC analysis calculated cut-off points at 10.07 ms and 15.47 ms for T2* in the determination of HIO with accuracy 88 %/88 %, sensitivity 84 %/89.5 % and specificity 100 %/83 %. MRI correctly classified 20 patients (80 %). All patients with HIO only had decreased T1 and T2* relaxation times. There was a significant difference in T1 between patients with HIO only and patients with HIO and steatohepatitis (P = 0.018). MRI-based T2* relaxation diagnoses HIO very accurately, even at low iron concentrations. Important additional information may be obtained by the combination of T1 and T2* mapping. It is a rapid, non-invasive, accurate and reproducible technique for validating the evidence of even low hepatic iron concentrations. • Hepatic iron overload causes fibrosis, cirrhosis and increases hepatocellular carcinoma risk. • MRI detects iron because of the field heterogeneity generated by haemosiderin. • T2* relaxation is very accurate in diagnosing hepatic iron overload. • Additional information may be obtained by T1 and T2* mapping.
Improving protein complex classification accuracy using amino acid composition profile.
Huang, Chien-Hung; Chou, Szu-Yu; Ng, Ka-Lok
2013-09-01
Protein complex prediction approaches are based on the assumptions that complexes have dense protein-protein interactions and high functional similarity between their subunits. We investigated those assumptions by studying the subunits' interaction topology, sequence similarity and molecular function for human and yeast protein complexes. Inclusion of amino acids' physicochemical properties can provide better understanding of protein complex properties. Principal component analysis is carried out to determine the major features. Adopting amino acid composition profile information with the SVM classifier serves as an effective post-processing step for complexes classification. Improvement is based on primary sequence information only, which is easy to obtain. Copyright © 2013 Elsevier Ltd. All rights reserved.
A Workshop Report on Wheat Genome Sequencing
Gill, Bikram S.; Appels, Rudi; Botha-Oberholster, Anna-Maria; Buell, C. Robin; Bennetzen, Jeffrey L.; Chalhoub, Boulos; Chumley, Forrest; Dvořák, Jan; Iwanaga, Masaru; Keller, Beat; Li, Wanlong; McCombie, W. Richard; Ogihara, Yasunari; Quetier, Francis; Sasaki, Takuji
2004-01-01
Sponsored by the National Science Foundation and the U.S. Department of Agriculture, a wheat genome sequencing workshop was held November 10–11, 2003, in Washington, DC. It brought together 63 scientists of diverse research interests and institutions, including 45 from the United States and 18 from a dozen foreign countries (see list of participants at http://www.ksu.edu/igrow). The objectives of the workshop were to discuss the status of wheat genomics, obtain feedback from ongoing genome sequencing projects, and develop strategies for sequencing the wheat genome. The purpose of this report is to convey the information discussed at the workshop and provide the basis for an ongoing dialogue, bringing forth comments and suggestions from the genetics community. PMID:15514080
Genetic Identification of Orientobilharzia turkestanicum from Sheep Isolates in Iran.
Tabaripour, Reza; Youssefi, Mohammad Reza; Tabaripour, Rabeeh
2015-01-01
Adult worms of Orientobilharzia turkestanicum live in the portal veins, or intestinal veins of cattle, sheep, goat and many other mammals causing orientobilharziasis. Orientobilharziasis causes significant economic losses to livestock industry of Iran. However, there is limited information about genotypes of O. turkestanicum in Iran. In this study, 30 isolates of O. turkestanicum obtained from sheep were characterized by sequencing mitochondrial cytochrome c oxidase subunit 1 (cox1) and nicotinamide adenine dinucleotide dehydrogenase subunit 1 (nad1) gene. The mitochondrial cox1 and nad1 DNA were amplified by polymerase chain reaction (PCR) and then sequenced and compared with O. turkestanicum and that of other members of the Schistosomatidae available in Gen-Bank(™). Phylogenetic relationships between them were re-constructed using the maximum parsimony method. Phylogenetic analyses done in present study placed O. turkestanicum within the Schistosoma genus, and indicates that O. turkestanicum was phylogenetically closer to the African schistosome group than to the Asian schistosome group. Comparison of nad1 and cox1 sequences of O. turkestanicum obtained in this study with corresponding sequences available in Genbank(™) revealed some sequence variations and provided evidence for presence of microvarients in Iran.
Lee, James W.; Thundat, Thomas G.
2005-06-14
An apparatus and method for performing nucleic acid (DNA and/or RNA) sequencing on a single molecule. The genetic sequence information is obtained by probing through a DNA or RNA molecule base by base at nanometer scale as though looking through a strip of movie film. This DNA sequencing nanotechnology has the theoretical capability of performing DNA sequencing at a maximal rate of about 1,000,000 bases per second. This enhanced performance is made possible by a series of innovations including: novel applications of a fine-tuned nanometer gap for passage of a single DNA or RNA molecule; thin layer microfluidics for sample loading and delivery; and programmable electric fields for precise control of DNA or RNA movement. Detection methods include nanoelectrode-gated tunneling current measurements, dielectric molecular characterization, and atomic force microscopy/electrostatic force microscopy (AFM/EFM) probing for nanoscale reading of the nucleic acid sequences.
Building a genome database using an object-oriented approach.
Barbasiewicz, Anna; Liu, Lin; Lang, B Franz; Burger, Gertraud
2002-01-01
GOBASE is a relational database that integrates data associated with mitochondria and chloroplasts. The most important data in GOBASE, i. e., molecular sequences and taxonomic information, are obtained from the public sequence data repository at the National Center for Biotechnology Information (NCBI), and are validated by our experts. Maintaining a curated genomic database comes with a towering labor cost, due to the shear volume of available genomic sequences and the plethora of annotation errors and omissions in records retrieved from public repositories. Here we describe our approach to increase automation of the database population process, thereby reducing manual intervention. As a first step, we used Unified Modeling Language (UML) to construct a list of potential errors. Each case was evaluated independently, and an expert solution was devised, and represented as a diagram. Subsequently, the UML diagrams were used as templates for writing object-oriented automation programs in the Java programming language.
Zhang, Li; Liao, Bo; Li, Dachao; Zhu, Wen
2009-07-21
Apoptosis, or programmed cell death, plays an important role in development of an organism. Obtaining information on subcellular location of apoptosis proteins is very helpful to understand the apoptosis mechanism. In this paper, based on the concept that the position distribution information of amino acids is closely related with the structure and function of proteins, we introduce the concept of distance frequency [Matsuda, S., Vert, J.P., Ueda, N., Toh, H., Akutsu, T., 2005. A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci. 14, 2804-2813] and propose a novel way to calculate distance frequencies. In order to calculate the local features, each protein sequence is separated into p parts with the same length in our paper. Then we use the novel representation of protein sequences and adopt support vector machine to predict subcellular location. The overall prediction accuracy is significantly improved by jackknife test.
Ferrer, Rebecca A; Taber, Jennifer M; Klein, William M P; Harris, Peter R; Lewis, Katie L; Biesecker, Leslie G
2015-01-01
One reason for not seeking personally threatening information may be negative current and anticipated affective responses. We examined whether current (e.g., worry) and anticipated negative affect predicted intentions to seek sequencing results in the context of an actual genomic sequencing trial (ClinSeq®; n = 545) and whether spontaneous self-affirmation mitigated any (negative) association between affect and intentions. Anticipated affective response negatively predicted intentions to obtain and share results pertaining to both medically actionable and non-actionable disease, whereas current affect was only a marginal predictor. The negative association between anticipated affect and intentions to obtain results pertaining to non-actionable disease was weaker in individuals who were higher in spontaneous self-affirmation. These results have implications for the understanding of current and anticipated affect, self-affirmation and consequential decision-making and contribute to a growing body of evidence on the role of affect in medical decisions.
Haebel, S.; Jensen, C.; Andersen, S. O.; Roepstorff, P.
1995-01-01
Simultaneous sequencing, using a combination of mass spectrometry and Edman degradation, of three approximately 15-kDa variants of a cuticular protein extracted from the meal beetle Tenebrio molitor larva is demonstrated. The information obtained by matrix-assisted laser desorption ionization mass spectrometry (MALDI MS) time-course monitoring of enzymatic digests was found essential to identify the differences among the three variants and for alignment of the peptides in the sequence. To determine whether each individual insect larva contains all three protein variants, proteins extracted from single animals were separated by two-dimensional gel electrophoresis, electroeluted from the gel spots, and analyzed by MALDI MS. Molecular weights of the proteins present in each sample could be obtained, and mass spectrometric mapping of the peptides after digestion with trypsin gave additional information. The protein isoforms were found to be allelic variants. PMID:7795523
Haebel, S; Jensen, C; Andersen, S O; Roepstorff, P
1995-03-01
Simultaneous sequencing, using a combination of mass spectrometry and Edman degradation, of three approximately 15-kDa variants of a cuticular protein extracted from the meal beetle Tenebrio molitor larva is demonstrated. The information obtained by matrix-assisted laser desorption ionization mass spectrometry (MALDI MS) time-course monitoring of enzymatic digests was found essential to identify the differences among the three variants and for alignment of the peptides in the sequence. To determine whether each individual insect larva contains all three protein variants, proteins extracted from single animals were separated by two-dimensional gel electrophoresis, electroeluted from the gel spots, and analyzed by MALDI MS. Molecular weights of the proteins present in each sample could be obtained, and mass spectrometric mapping of the peptides after digestion with trypsin gave additional information. The protein isoforms were found to be allelic variants.
Tan, Qian-Qian; Zhu, Li; Li, Yi; Liu, Wen; Ma, Wei-Hua; Lei, Chao-Liang; Wang, Xiao-Ping
2015-01-01
The cabbage beetle Colaphellus bowringi Baly is a serious insect pest of crucifers and undergoes reproductive diapause in soil. An understanding of the molecular mechanisms of diapause regulation, insecticide resistance, and other physiological processes is helpful for developing new management strategies for this beetle. However, the lack of genomic information and valid reference genes limits knowledge on the molecular bases of these physiological processes in this species. Using Illumina sequencing, we obtained more than 57 million sequence reads derived from C. bowringi, which were assembled into 39,390 unique sequences. A Clusters of Orthologous Groups classification was obtained for 9,048 of these sequences, covering 25 categories, and 16,951 were assigned to 255 Kyoto Encyclopedia of Genes and Genomes pathways. Eleven candidate reference gene sequences from the transcriptome were then identified through reverse transcriptase polymerase chain reaction. Among these candidate genes, EF1α, ACT1, and RPL19 proved to be the most stable reference genes for different reverse transcriptase quantitative polymerase chain reaction experiments in C. bowringi. Conversely, aTUB and GAPDH were the least stable reference genes. The abundant putative C. bowringi transcript sequences reported enrich the genomic resources of this beetle. Importantly, the larger number of gene sequences and valid reference genes provide a valuable platform for future gene expression studies, especially with regard to exploring the molecular mechanisms of different physiological processes in this species.
2014-01-01
Background Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to transfer functional information from the homologs to the given protein. Sequence-based comparison cannot detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. Structure-based comparisons can detect remote homologs but most methods for doing so are too expensive to apply at a large scale over structural databases of proteins. Recently, fragment-based structural representations have been proposed that allow fast detection of remote homologs with reasonable accuracy. These representations have also been used to obtain linearly-reducible maps of protein structure space. It has been shown, as additionally supported from analysis in this paper that such maps preserve functional co-localization of the protein structure space. Methods Inspired by a recent application of the Latent Dirichlet Allocation (LDA) model for conducting structural comparisons of proteins, we propose higher-order LDA-obtained topic-based representations of protein structures to provide an alternative route for remote homology detection and organization of the protein structure space in few dimensions. Various techniques based on natural language processing are proposed and employed to aid the analysis of topics in the protein structure domain. Results We show that a topic-based representation is just as effective as a fragment-based one at automated detection of remote homologs and organization of protein structure space. We conduct a detailed analysis of the information content in the topic-based representation, showing that topics have semantic meaning. The fragment-based and topic-based representations are also shown to allow prediction of superfamily membership. Conclusions This work opens exciting venues in designing novel representations to extract information about protein structures, as well as organizing and mining protein structure space with mature text mining tools. PMID:25080993
Li, Jitao; Li, Jian; Chen, Ping; Liu, Ping; He, Yuying
2015-01-01
The ridgetail white prawn Exopalaemon carinicauda is one of major economic mariculture species in eastern China. The deficiency of genomic and transcriptomic data is becoming the bottleneck of further researches on its good traits. In the present study, 454 pyrosequencing was undertaken to investigate the transcriptome profiles of E. carinicauda. A collection of 1,028,710 sequence reads (459.59 Mb) obtained from cDNA prepared from eyestalk and hemocytes was assembled into 162,056 expressed sequence tags (ESTs). Of these, 29.88 % of 48,428 contigs and 70.12 % of 113,628 singlets possessed high similarities to sequences in the GenBank non-redundant database, with most significant (E value <1e(-10)) unigenes matches occurring with crustacean and insect sequences. KEGG analysis of unigenes identified putative members of biological pathways related to growth and immunity. In addition, we obtained a total of putative 125,112 SNPs and 13,467 microsatellites. These results will contribute to the understanding of the genome makeup and provide useful information for future functional genomic research in E. carinicauda.
Failure to produce response variability with reinforcement
Schwartz, Barry
1982-01-01
Two experiments attempted to train pigeons to produce variable response sequences. In the first, naive pigeons were exposed to a procedure requiring four pecks on each of two keys in any order, with a reinforcer delivered only if a given sequence was different from the preceding one. In the second experiment, the same pigeons were exposed to this procedure after having been trained successfully to alternate between two specific response sequences. In neither case did any pigeon produce more than a few different sequences or obtain more than 50% of the possible reinforcers. Stereotyped sequences developed even though stereotypy was not reinforced. It is suggested that reinforcers have both hedonic and informative properties and that the hedonic properties are responsible for sterotyped repetition of reinforced responses, even when stereotypy is negatively related to reinforcer delivery. PMID:16812263
WEB-server for search of a periodicity in amino acid and nucleotide sequences
NASA Astrophysics Data System (ADS)
E Frenkel, F.; Skryabin, K. G.; Korotkov, E. V.
2017-12-01
A new web server (http://victoria.biengi.ac.ru/splinter/login.php) was designed and developed to search for periodicity in nucleotide and amino acid sequences. The web server operation is based upon a new mathematical method of searching for multiple alignments, which is founded on the position weight matrices optimization, as well as on implementation of the two-dimensional dynamic programming. This approach allows the construction of multiple alignments of the indistinctly similar amino acid and nucleotide sequences that accumulated more than 1.5 substitutions per a single amino acid or a nucleotide without performing the sequences paired comparisons. The article examines the principles of the web server operation and two examples of studying amino acid and nucleotide sequences, as well as information that could be obtained using the web server.
Fourment, Mathieu; Gibbs, Mark J
2008-02-05
Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically.
Dölz, R; Mossé, M O; Slonimski, P P; Bairoch, A; Linder, P
1994-01-01
We continued our effort to make a comprehensive database (LISTA) for the yeast Saccharomyces cerevisiae. In this database each sequence has been attributed a single genetic name. In the case of duplicated sequences a simple method has been applied to distinguish between sequences of one and the same gene from non-allelic sequences of duplicated genes. If necessary, synonyms are given in the case of allelic duplicated sequences. Thus sequences can be found either by the name or by synonyms given in LISTA. Each entry contains the genetic name, the mnemonic from the EMBL data bank, the codon bias, reference of the publication of the sequence, Chromosomal location as far as known, Swissprot and EMBL accession numbers. To obtain more information on the included sequences, each entry has been screened against non-redundant nucleotide and protein data bank collections resulting in LISTA-HON and LISTA-HOP. The LISTA data base can be linked to the associated data sets or to nucleotide and protein banks by the Sequence Retrieval System (SRS). PMID:7937046
Nolan, Danielle; Carlson, Martha
2016-06-01
Genetic heterogeneity in neurologic disorders has been an obstacle to phenotype-based diagnostic testing. The authors hypothesized that information compiled via whole exome sequencing will improve clinical diagnosis and management of pediatric neurology patients. The authors performed a retrospective chart review of patients evaluated in the University of Michigan Pediatric Neurology clinic between 6/2011 and 6/2015. The authors recorded previous diagnostic testing, indications for whole exome sequencing, and whole exome sequencing results. Whole exome sequencing was recommended for 135 patients and obtained in 53 patients. Insurance barriers often precluded whole exome sequencing. The most common indication for whole exome sequencing was neurodevelopmental disorders. Whole exome sequencing improved the presumptive diagnostic rate in the patient cohort from 25% to 48%. Clinical implications included family planning, medication selection, and systemic investigation. Compared to current second tier testing, whole exome sequencing can result in lower long-term charges and more timely diagnosis. Overcoming barriers related to whole exome sequencing insurance authorization could allow for more efficient and fruitful diagnostic neurological evaluations. © The Author(s) 2016.
Automatic annotation of protein motif function with Gene Ontology terms.
Lu, Xinghua; Zhai, Chengxiang; Gopalakrishnan, Vanathi; Buchanan, Bruce G
2004-09-02
Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, a much needed and important task is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base. This paper presents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifs is viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association is found to be a very useful feature. We take advantage of the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correct association. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria. In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about the functions of newly discovered candidate protein motifs.
A new arenavirus in a cluster of fatal transplant-associated diseases.
Palacios, Gustavo; Druce, Julian; Du, Lei; Tran, Thomas; Birch, Chris; Briese, Thomas; Conlan, Sean; Quan, Phenix-Lan; Hui, Jeffrey; Marshall, John; Simons, Jan Fredrik; Egholm, Michael; Paddock, Christopher D; Shieh, Wun-Ju; Goldsmith, Cynthia S; Zaki, Sherif R; Catton, Mike; Lipkin, W Ian
2008-03-06
Three patients who received visceral-organ transplants from a single donor on the same day died of a febrile illness 4 to 6 weeks after transplantation. Culture, polymerase-chain-reaction (PCR) and serologic assays, and oligonucleotide microarray analysis for a wide range of infectious agents were not informative. We evaluated RNA obtained from the liver and kidney transplant recipients. Unbiased high-throughput sequencing was used to identify microbial sequences not found by means of other methods. The specificity of sequences for a new candidate pathogen was confirmed by means of culture and by means of PCR, immunohistochemical, and serologic analyses. High-throughput sequencing yielded 103,632 sequences, of which 14 represented an Old World arenavirus. Additional sequence analysis showed that this new arenavirus was related to lymphocytic choriomeningitis viruses. Specific PCR assays based on a unique sequence confirmed the presence of the virus in the kidneys, liver, blood, and cerebrospinal fluid of the recipients. Immunohistochemical analysis revealed arenavirus antigen in the liver and kidney transplants in the recipients. IgM and IgG antiviral antibodies were detected in the serum of the donor. Seroconversion was evident in serum specimens obtained from one recipient at two time points. Unbiased high-throughput sequencing is a powerful tool for the discovery of pathogens. The use of this method during an outbreak of disease facilitated the identification of a new arenavirus transmitted through solid-organ transplantation. Copyright 2008 Massachusetts Medical Society.
Sharma, Amit K; Gohel, Sangeeta; Singh, Satya P
2012-01-01
Actinobase is a relational database of molecular diversity, phylogeny and biocatalytic potential of haloalkaliphilic actinomycetes. The main objective of this data base is to provide easy access to range of information, data storage, comparison and analysis apart from reduced data redundancy, data entry, storage, retrieval costs and improve data security. Information related to habitat, cell morphology, Gram reaction, biochemical characterization and molecular features would allow researchers in understanding identification and stress adaptation of the existing and new candidates belonging to salt tolerant alkaliphilic actinomycetes. The PHP front end helps to add nucleotides and protein sequence of reported entries which directly help researchers to obtain the required details. Analysis of the genus wise status of the salt tolerant alkaliphilic actinomycetes indicated 6 different genera among the 40 classified entries of the salt tolerant alkaliphilic actinomycetes. The results represented wide spread occurrence of salt tolerant alkaliphilic actinomycetes belonging to diverse taxonomic positions. Entries and information related to actinomycetes in the database are publicly accessible at http://www.actinobase.in. On clustalW/X multiple sequence alignment of the alkaline protease gene sequences, different clusters emerged among the groups. The narrow search and limit options of the constructed database provided comparable information. The user friendly access to PHP front end facilitates would facilitate addition of sequences of reported entries. The database is available for free at http://www.actinobase.in.
Shi, Xiaohe; Lu, Wen-Cong; Cai, Yu-Dong; Chou, Kuo-Chen
2011-01-01
Background With the huge amount of uncharacterized protein sequences generated in the post-genomic age, it is highly desirable to develop effective computational methods for quickly and accurately predicting their functions. The information thus obtained would be very useful for both basic research and drug development in a timely manner. Methodology/Principal Findings Although many efforts have been made in this regard, most of them were based on either sequence similarity or protein-protein interaction (PPI) information. However, the former often fails to work if a query protein has no or very little sequence similarity to any function-known proteins, while the latter had similar problem if the relevant PPI information is not available. In view of this, a new approach is proposed by hybridizing the PPI information and the biochemical/physicochemical features of protein sequences. The overall first-order success rates by the new predictor for the functions of mouse proteins on training set and test set were 69.1% and 70.2%, respectively, and the success rate covered by the results of the top-4 order from a total of 24 orders was 65.2%. Conclusions/Significance The results indicate that the new approach is quite promising that may open a new avenue or direction for addressing the difficult and complicated problem. PMID:21283518
UPIC: Perl scripts to determine the number of SSR markers to run
USDA-ARS?s Scientific Manuscript database
We have developed Perl Scripts for the cost-effective planning of fingerprinting and genotyping experiments. The UPIC scripts detect the best combination of polymorphic simple sequence repeat (SSR) markers and provide coefficients of the amount of information obtainable (number of alleles of patter...
Modulation of Molecular Markers by CLA.
1998-10-01
sequence information obtained for each gene fragment, a gene-specific primer was synthesized (Integrated DNA Technology, Inc, Coralville , IA) as the down...G.W. and Cochran, W.G. (1967) Statistical Methods, Ed. 6 Iowa University Press. 81. JK Beckman, T Yoshioka, SM Knobel, HL Green. Biphasic changes in
Wang, Ruijia; Nambiar, Ram; Zheng, Dinghai
2018-01-01
Abstract PolyA_DB is a database cataloging cleavage and polyadenylation sites (PASs) in several genomes. Previous versions were based mainly on expressed sequence tags (ESTs), which had a limited amount and could lead to inaccurate PAS identification due to the presence of internal A-rich sequences in transcripts. Here, we present an updated version of the database based solely on deep sequencing data. First, PASs are mapped by the 3′ region extraction and deep sequencing (3′READS) method, ensuring unequivocal PAS identification. Second, a large volume of data based on diverse biological samples increases PAS coverage by 3.5-fold over the EST-based version and provides PAS usage information. Third, strand-specific RNA-seq data are used to extend annotated 3′ ends of genes to obtain more thorough annotations of alternative polyadenylation (APA) sites. Fourth, conservation information of PAS across mammals sheds light on significance of APA sites. The database (URL: http://www.polya-db.org/v3) currently holds PASs in human, mouse, rat and chicken, and has links to the UCSC genome browser for further visualization and for integration with other genomic data. PMID:29069441
Lee, Chi-Ching; Chen, Yi-Ping Phoebe; Yao, Tzu-Jung; Ma, Cheng-Yu; Lo, Wei-Cheng; Lyu, Ping-Chiang; Tang, Chuan Yi
2013-04-10
Sequencing of microbial genomes is important because of microbial-carrying antibiotic and pathogenetic activities. However, even with the help of new assembling software, finishing a whole genome is a time-consuming task. In most bacteria, pathogenetic or antibiotic genes are carried in genomic islands. Therefore, a quick genomic island (GI) prediction method is useful for ongoing sequencing genomes. In this work, we built a Web server called GI-POP (http://gipop.life.nthu.edu.tw) which integrates a sequence assembling tool, a functional annotation pipeline, and a high-performance GI predicting module, in a support vector machine (SVM)-based method called genomic island genomic profile scanning (GI-GPS). The draft genomes of the ongoing genome projects in contigs or scaffolds can be submitted to our Web server, and it provides the functional annotation and highly probable GI-predicting results. GI-POP is a comprehensive annotation Web server designed for ongoing genome project analysis. Researchers can perform annotation and obtain pre-analytic information include possible GIs, coding/non-coding sequences and functional analysis from their draft genomes. This pre-analytic system can provide useful information for finishing a genome sequencing project. Copyright © 2012 Elsevier B.V. All rights reserved.
Clinical sequencing in leukemia with the assistance of artificial intelligence.
Tojo, Arinobu
2017-01-01
Next generation sequencing (NGS) of cancer genomes is now becoming a prerequisite for accurate diagnosis and proper treatment in clinical oncology. Because the genomic regions for NGS expand from a certain set of genes to the whole exome or whole genome, the resulting sequence data becomes incredibly enormous and makes it quite laborious to translate the genomic data into medicine, so-called annotation and curation. We organized a clinical sequencing team and established a bidirectional (bed-to-bench and bench-to-bed) system to integrate clinical and genomic data for hematological malignancies. We also started a collaborative research project with IBM Japan to adopt the artificial intelligence Watson for Genomics (WfG) to the pipeline of medical informatics. Genomic DNA was prepared from malignant as well as normal tissues in each patient and subjected to NGS. Sequence data was analyzed using an in-house semi-automated pipeline in combination with WfG, which was used to identify candidate driver mutations and relevant pathways from which applicable drug information was deduced. Currently, we have analyzed more than 150 patients with hematological disorders, including AML and ALL, and obtained many informative findings. In this presentation, I will introduce some of the achievements we have made so far.
2014-01-01
Background Leptotrombidium pallidum and Leptotrombidium scutellare are the major vector mites for Orientia tsutsugamushi, the causative agent of scrub typhus. Before these organisms can be subjected to whole-genome sequencing, it is necessary to estimate their genome sizes to obtain basic information for establishing the strategies that should be used for genome sequencing and assembly. Method The genome sizes of L. pallidum and L. scutellare were estimated by a method based on quantitative real-time PCR. In addition, a k-mer analysis of the whole-genome sequences obtained through Illumina sequencing was conducted to verify the mutual compatibility and reliability of the results. Results The genome sizes estimated using qPCR were 191 ± 7 Mb for L. pallidum and 262 ± 13 Mb for L. scutellare. The k-mer analysis-based genome lengths were estimated to be 175 Mb for L. pallidum and 286 Mb for L. scutellare. The estimates from these two independent methods were mutually complementary and within a similar range to those of other Acariform mites. Conclusions The estimation method based on qPCR appears to be a useful alternative when the standard methods, such as flow cytometry, are impractical. The relatively small estimated genome sizes should facilitate whole-genome analysis, which could contribute to our understanding of Arachnida genome evolution and provide key information for scrub typhus prevention and mite vector competence. PMID:24947244
Li, Zhao-Qun; Zhang, Shuai; Ma, Yan; Luo, Jun-Yu; Wang, Chun-Yi; Lv, Li-Min; Dong, Shuang-Lin; Cui, Jin-Jie
2013-01-01
Chrysopa pallens (Rambur) are the most important natural enemies and predators of various agricultural pests. Understanding the sophisticated olfactory system in insect antennae is crucial for studying the physiological bases of olfaction and also could lead to effective applications of C. pallens in integrated pest management. However no transcriptome information is available for Neuroptera, and sequence data for C. pallens are scarce, so obtaining more sequence data is a priority for researchers on this species. To facilitate identifying sets of genes involved in olfaction, a normalized transcriptome of C. pallens was sequenced. A total of 104,603 contigs were obtained and assembled into 10,662 clusters and 39,734 singletons; 20,524 were annotated based on BLASTX analyses. A large number of candidate chemosensory genes were identified, including 14 odorant-binding proteins (OBPs), 22 chemosensory proteins (CSPs), 16 ionotropic receptors, 14 odorant receptors, and genes potentially involved in olfactory modulation. To better understand the OBPs, CSPs and cytochrome P450s, phylogenetic trees were constructed. In addition, 10 digital gene expression libraries of different tissues were constructed and gene expression profiles were compared among different tissues in males and females. Our results provide a basis for exploring the mechanisms of chemoreception in C. pallens, as well as other insects. The evolutionary analyses in our study provide new insights into the differentiation and evolution of insect OBPs and CSPs. Our study provided large-scale sequence information for further studies in C. pallens.
A proteomic analysis of leaf sheaths from rice.
Shen, Shihua; Matsubae, Masami; Takao, Toshifumi; Tanaka, Naoki; Komatsu, Setsuko
2002-10-01
The proteins extracted from the leaf sheaths of rice seedlings were separated by 2-D PAGE, and analyzed by Edman sequencing and mass spectrometry, followed by database searching. Image analysis revealed 352 protein spots on 2-D PAGE after staining with Coomassie Brilliant Blue. The amino acid sequences of 44 of 84 proteins were determined; for 31 of these proteins, a clear function could be assigned, whereas for 12 proteins, no function could be assigned. Forty proteins did not yield amino acid sequence information, because they were N-terminally blocked, or the obtained sequences were too short and/or did not give unambiguous results. Fifty-nine proteins were analyzed by mass spectrometry; all of these proteins were identified by matching to the protein database. The amino acid sequences of 19 of 27 proteins analyzed by mass spectrometry were similar to the results of Edman sequencing. These results suggest that 2-D PAGE combined with Edman sequencing and mass spectrometry analysis can be effectively used to identify plant proteins.
Professionally Responsible Disclosure of Genomic Sequencing Results in Pediatric Practice
Brothers, Kyle B.; Chung, Wendy K.; Joffe, Steven; Koenig, Barbara A.; Wilfond, Benjamin; Yu, Joon-Ho
2015-01-01
Genomic sequencing is being rapidly introduced into pediatric clinical practice. The results of sequencing are distinctive for their complexity and subsequent challenges of interpretation for generalist and specialist pediatricians, parents, and patients. Pediatricians therefore need to prepare for the professionally responsible disclosure of sequencing results to parents and patients and guidance of parents and patients in the interpretation and use of these results, including managing uncertain data. This article provides an ethical framework to guide and evaluate the professionally responsible disclosure of the results of genomic sequencing in pediatric practice. The ethical framework comprises 3 core concepts of pediatric ethics: the best interests of the child standard, parental surrogate decision-making, and pediatric assent. When recommending sequencing, pediatricians should explain the nature of the proposed test, its scope and complexity, the categories of results, and the concept of a secondary or incidental finding. Pediatricians should obtain the informed permission of parents and the assent of mature adolescents about the scope of sequencing to be performed and the return of results. PMID:26371191
Connected Component Model for Multi-Object Tracking.
He, Zhenyu; Li, Xin; You, Xinge; Tao, Dacheng; Tang, Yuan Yan
2016-08-01
In multi-object tracking, it is critical to explore the data associations by exploiting the temporal information from a sequence of frames rather than the information from the adjacent two frames. Since straightforwardly obtaining data associations from multi-frames is an NP-hard multi-dimensional assignment (MDA) problem, most existing methods solve this MDA problem by either developing complicated approximate algorithms, or simplifying MDA as a 2D assignment problem based upon the information extracted only from adjacent frames. In this paper, we show that the relation between associations of two observations is the equivalence relation in the data association problem, based on the spatial-temporal constraint that the trajectories of different objects must be disjoint. Therefore, the MDA problem can be equivalently divided into independent subproblems by equivalence partitioning. In contrast to existing works for solving the MDA problem, we develop a connected component model (CCM) by exploiting the constraints of the data association and the equivalence relation on the constraints. Based upon CCM, we can efficiently obtain the global solution of the MDA problem for multi-object tracking by optimizing a sequence of independent data association subproblems. Experiments on challenging public data sets demonstrate that our algorithm outperforms the state-of-the-art approaches.
MuffinInfo: HTML5-Based Statistics Extractor from Next-Generation Sequencing Data.
Alic, Andy S; Blanquer, Ignacio
2016-09-01
Usually, the information known a priori about a newly sequenced organism is limited. Even resequencing the same organism can generate unpredictable output. We introduce MuffinInfo, a FastQ/Fasta/SAM information extractor implemented in HTML5 capable of offering insights into next-generation sequencing (NGS) data. Our new tool can run on any software or hardware environment, in command line or graphically, and in browser or standalone. It presents information such as average length, base distribution, quality scores distribution, k-mer histogram, and homopolymers analysis. MuffinInfo improves upon the existing extractors by adding the ability to save and then reload the results obtained after a run as a navigable file (also supporting saving pictures of the charts), by supporting custom statistics implemented by the user, and by offering user-adjustable parameters involved in the processing, all in one software. At the moment, the extractor works with all base space technologies such as Illumina, Roche, Ion Torrent, Pacific Biosciences, and Oxford Nanopore. Owing to HTML5, our software demonstrates the readiness of web technologies for mild intensive tasks encountered in bioinformatics.
Song, Jiangning; Burrage, Kevin; Yuan, Zheng; Huber, Thomas
2006-03-09
The majority of peptide bonds in proteins are found to occur in the trans conformation. However, for proline residues, a considerable fraction of Prolyl peptide bonds adopt the cis form. Proline cis/trans isomerization is known to play a critical role in protein folding, splicing, cell signaling and transmembrane active transport. Accurate prediction of proline cis/trans isomerization in proteins would have many important applications towards the understanding of protein structure and function. In this paper, we propose a new approach to predict the proline cis/trans isomerization in proteins using support vector machine (SVM). The preliminary results indicated that using Radial Basis Function (RBF) kernels could lead to better prediction performance than that of polynomial and linear kernel functions. We used single sequence information of different local window sizes, amino acid compositions of different local sequences, multiple sequence alignment obtained from PSI-BLAST and the secondary structure information predicted by PSIPRED. We explored these different sequence encoding schemes in order to investigate their effects on the prediction performance. The training and testing of this approach was performed on a newly enlarged dataset of 2424 non-homologous proteins determined by X-Ray diffraction method using 5-fold cross-validation. Selecting the window size 11 provided the best performance for determining the proline cis/trans isomerization based on the single amino acid sequence. It was found that using multiple sequence alignments in the form of PSI-BLAST profiles could significantly improve the prediction performance, the prediction accuracy increased from 62.8% with single sequence to 69.8% and Matthews Correlation Coefficient (MCC) improved from 0.26 with single local sequence to 0.40. Furthermore, if coupled with the predicted secondary structure information by PSIPRED, our method yielded a prediction accuracy of 71.5% and MCC of 0.43, 9% and 0.17 higher than the accuracy achieved based on the singe sequence information, respectively. A new method has been developed to predict the proline cis/trans isomerization in proteins based on support vector machine, which used the single amino acid sequence with different local window sizes, the amino acid compositions of local sequence flanking centered proline residues, the position-specific scoring matrices (PSSMs) extracted by PSI-BLAST and the predicted secondary structures generated by PSIPRED. The successful application of SVM approach in this study reinforced that SVM is a powerful tool in predicting proline cis/trans isomerization in proteins and biological sequence analysis.
Comparative 454 pyrosequencing of transcripts from two olive genotypes during fruit development
Alagna, Fiammetta; D'Agostino, Nunzio; Torchia, Laura; Servili, Maurizio; Rao, Rosa; Pietrella, Marco; Giuliano, Giovanni; Chiusano, Maria Luisa; Baldoni, Luciana; Perrotta, Gaetano
2009-01-01
Background Despite its primary economic importance, genomic information on olive tree is still lacking. 454 pyrosequencing was used to enrich the very few sequence data currently available for the Olea europaea species and to identify genes involved in expression of fruit quality traits. Results Fruits of Coratina, a widely cultivated variety characterized by a very high phenolic content, and Tendellone, an oleuropein-lacking natural variant, were used as starting material for monitoring the transcriptome. Four different cDNA libraries were sequenced, respectively at the beginning and at the end of drupe development. A total of 261,485 reads were obtained, for an output of about 58 Mb. Raw sequence data were processed using a four step pipeline procedure and data were stored in a relational database with a web interface. Conclusion Massively parallel sequencing of different fruit cDNA collections has provided large scale information about the structure and putative function of gene transcripts accumulated during fruit development. Comparative transcript profiling allowed the identification of differentially expressed genes with potential relevance in regulating the fruit metabolism and phenolic content during ripening. PMID:19709400
2013-01-01
Background Salamanders are unique among vertebrates in their ability to completely regenerate amputated limbs through the mediation of blastema cells located at the stump ends. This regeneration is nerve-dependent because blastema formation and regeneration does not occur after limb denervation. To obtain the genomic information of blastema tissues, de novo transcriptomes from both blastema tissues and denervated stump ends of Ambystoma mexicanum (axolotls) 14 days post-amputation were sequenced and compared using Solexa DNA sequencing. Results The sequencing done for this study produced 40,688,892 reads that were assembled into 307,345 transcribed sequences. The N50 of transcribed sequence length was 562 bases. A similarity search with known proteins identified 39,200 different genes to be expressed during limb regeneration with a cut-off E-value exceeding 10-5. We annotated assembled sequences by using gene descriptions, gene ontology, and clusters of orthologous group terms. Targeted searches using these annotations showed that the majority of the genes were in the categories of essential metabolic pathways, transcription factors and conserved signaling pathways, and novel candidate genes for regenerative processes. We discovered and confirmed numerous sequences of the candidate genes by using quantitative polymerase chain reaction and in situ hybridization. Conclusion The results of this study demonstrate that de novo transcriptome sequencing allows gene expression analysis in a species lacking genome information and provides the most comprehensive mRNA sequence resources for axolotls. The characterization of the axolotl transcriptome can help elucidate the molecular mechanisms underlying blastema formation during limb regeneration. PMID:23815514
Taxonomic and functional assignment of cloned sequences from high Andean forest soil metagenome.
Montaña, José Salvador; Jiménez, Diego Javier; Hernández, Mónica; Angel, Tatiana; Baena, Sandra
2012-02-01
Total metagenomic DNA was isolated from high Andean forest soil and subjected to taxonomical and functional composition analyses by means of clone library generation and sequencing. The obtained yield of 1.7 μg of DNA/g of soil was used to construct a metagenomic library of approximately 20,000 clones (in the plasmid p-Bluescript II SK+) with an average insert size of 4 Kb, covering 80 Mb of the total metagenomic DNA. Metagenomic sequences near the plasmid cloning site were sequenced and them trimmed and assembled, obtaining 299 reads and 31 contigs (0.3 Mb). Taxonomic assignment of total sequences was performed by BLASTX, resulting in 68.8, 44.8 and 24.5% classification into taxonomic groups using the metagenomic RAST server v2.0, WebCARMA v1.0 online system and MetaGenome Analyzer v3.8 software, respectively. Most clone sequences were classified as Bacteria belonging to phlya Actinobacteria, Proteobacteria and Acidobacteria. Among the most represented orders were Actinomycetales (34% average), Rhizobiales, Burkholderiales and Myxococcales and with a greater number of sequences in the genus Mycobacterium (7% average), Frankia, Streptomyces and Bradyrhizobium. The vast majority of sequences were associated with the metabolism of carbohydrates, proteins, lipids and catalytic functions, such as phosphatases, glycosyltransferases, dehydrogenases, methyltransferases, dehydratases and epoxide hydrolases. In this study we compared different methods of taxonomic and functional assignment of metagenomic clone sequences to evaluate microbial diversity in an unexplored soil ecosystem, searching for putative enzymes of biotechnological interest and generating important information for further functional screening of clone libraries.
2012-01-01
Background The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. Results We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods. Conclusions Appropriate homologous sequences are selected automatically and objectively by the index. Such sequence selection improved the performance of functional region prediction. As far as we know, this is the first approach in which spatial statistics have been applied to protein analyses. Such integration of structure and sequence information would be useful for other bioinformatics problems. PMID:22643026
NASA Technical Reports Server (NTRS)
Stevenson, William A. (Inventor)
1989-01-01
A process for infrared spectroscopic monitoring of insitu compositional changes in a polymeric material comprises the steps of providing an elongated infrared radiation transmitting fiber that has a transmission portion and a sensor portion, embedding the sensor portion in the polymeric material to be monitored, subjecting the polymeric material to a processing sequence, applying a beam of infrared radiation to the fiber for transmission through the transmitting portion to the sensor portion for modification as a function of properties of the polymeric material, monitoring the modified infrared radiation spectra as the polymeric material is being subjected to the processing sequence to obtain kinetic data on changes in the polymeric material during the processing sequence, and adjusting the processing sequence as a function of the kinetic data provided by the modified infrared radiation spectra information.
NASA Technical Reports Server (NTRS)
Stevenson, William A. (Inventor)
1992-01-01
A process for infrared spectroscopic monitoring of insitu compositional changes in a polymeric material comprises the steps of providing an elongated infrared radiation transmitting fiber that has a transmission portion and a sensor portion, embedding the sensor portion in the polymeric material to be monitored, subjecting the polymeric material to a processing sequence, applying a beam of infrared radiation to the fiber for transmission through the transmitting portion to the sensor portion for modification as a function of properties of the polymeric material, monitoring the modified infrared radiation spectra as the polymeric material is being subjected to the processing sequence to obtain kinetic data on changes in the polymeric material during the processing sequence, and adjusting the processing sequence as a function of the kinetic data provided by the modified infrared radiation spectra information.
Bastien, Olivier; Ortet, Philippe; Roy, Sylvaine; Maréchal, Eric
2005-03-10
Popular methods to reconstruct molecular phylogenies are based on multiple sequence alignments, in which addition or removal of data may change the resulting tree topology. We have sought a representation of homologous proteins that would conserve the information of pair-wise sequence alignments, respect probabilistic properties of Z-scores (Monte Carlo methods applied to pair-wise comparisons) and be the basis for a novel method of consistent and stable phylogenetic reconstruction. We have built up a spatial representation of protein sequences using concepts from particle physics (configuration space) and respecting a frame of constraints deduced from pair-wise alignment score properties in information theory. The obtained configuration space of homologous proteins (CSHP) allows the representation of real and shuffled sequences, and thereupon an expression of the TULIP theorem for Z-score probabilities. Based on the CSHP, we propose a phylogeny reconstruction using Z-scores. Deduced trees, called TULIP trees, are consistent with multiple-alignment based trees. Furthermore, the TULIP tree reconstruction method provides a solution for some previously reported incongruent results, such as the apicomplexan enolase phylogeny. The CSHP is a unified model that conserves mutual information between proteins in the way physical models conserve energy. Applications include the reconstruction of evolutionary consistent and robust trees, the topology of which is based on a spatial representation that is not reordered after addition or removal of sequences. The CSHP and its assigned phylogenetic topology, provide a powerful and easily updated representation for massive pair-wise genome comparisons based on Z-score computations.
USDA-ARS?s Scientific Manuscript database
It has been often stated that we have moved from an age of chemistry to an age of biology. The ease of sequencing genomes and obtaining related genotypic, transcriptomic, proteomic, and metabolomics information is leading to the development of new commercial technologies where problems are solved "...
PhosphoBase: a database of phosphorylation sites.
Blom, N; Kreegipuu, A; Brunak, S
1998-01-01
PhosphoBase is a database of experimentally verified phosphorylation sites. Version 1.0 contains 156 entries and 398 experimentally determined phosphorylation sites. Entries are compiled and revised from the literature and from major protein sequence databases such as SwissProt and PIR. The entries provide information about the phosphoprotein and the exact position of its phosphorylation sites. Furthermore, part of the entries contain information about kinetic data obtained from enzyme assays on specific peptides. To illustrate the use of data extracted from PhosphoBase we present a sequence logo displaying the overall conservation of positions around serines phosphorylated by protein kinase A (PKA). PhosphoBase is available on the WWW at http://www.cbs.dtu.dk/databases/PhosphoBase/ PMID:9399879
Giehr, Pascal; Walter, Jörn
2018-01-01
The accurate and quantitative detection of 5-methylcytosine is of great importance in the field of epigenetics. The method of choice is usually bisulfite sequencing because of the high resolution and the possibility to combine it with next generation sequencing. Nevertheless, also this method has its limitations. Following the bisulfite treatment DNA strands are no longer complementary such that in a subsequent PCR amplification the DNA methylation patterns information of only one of the two DNA strand is preserved. Several years ago Hairpin Bisulfite sequencing was developed as a method to obtain the pattern information on complementary DNA strands. The method requires fragmentation (usually by enzymatic cleavage) of genomic DNA followed by a covalent linking of both DNA strands through ligation of a short DNA hairpin oligonucleotide to both strands. The ligated covalently linked dsDNA products are then subjected to a conventional bisulfite treatment during which all unmodified cytosines are converted to uracils. During the treatment the DNA is denatured forming noncomplementary ssDNA circles. These circles serve as a template for a locus specific PCR to amplify chromosomal patterns of the region of interest. As a result one ends up with a linearized product, which contains the methylation information of both complementary DNA strands.
Halachev, Mihail R; Chan, Jacqueline Z-M; Constantinidou, Chrystala I; Cumley, Nicola; Bradley, Craig; Smith-Banks, Matthew; Oppenheim, Beryl; Pallen, Mark J
2014-01-01
Multidrug-resistant Acinetobacter baumannii commonly causes hospital outbreaks. However, within an outbreak, it can be difficult to identify the routes of cross-infection rapidly and accurately enough to inform infection control. Here, we describe a protracted hospital outbreak of multidrug-resistant A. baumannii, in which whole-genome sequencing (WGS) was used to obtain a high-resolution view of the relationships between isolates. To delineate and investigate the outbreak, we attempted to genome-sequence 114 isolates that had been assigned to the A. baumannii complex by the Vitek2 system and obtained informative draft genome sequences from 102 of them. Genomes were mapped against an outbreak reference sequence to identify single nucleotide variants (SNVs). We found that the pulsotype 27 outbreak strain was distinct from all other genome-sequenced strains. Seventy-four isolates from 49 patients could be assigned to the pulsotype 27 outbreak on the basis of genomic similarity, while WGS allowed 18 isolates to be ruled out of the outbreak. Among the pulsotype 27 outbreak isolates, we identified 31 SNVs and seven major genotypic clusters. In two patients, we documented within-host diversity, including mixtures of unrelated strains and within-strain clouds of SNV diversity. By combining WGS and epidemiological data, we reconstructed potential transmission events that linked all but 10 of the patients and confirmed links between clinical and environmental isolates. Identification of a contaminated bed and a burns theatre as sources of transmission led to enhanced environmental decontamination procedures. WGS is now poised to make an impact on hospital infection prevention and control, delivering cost-effective identification of routes of infection within a clinically relevant timeframe and allowing infection control teams to track, and even prevent, the spread of drug-resistant hospital pathogens.
Amadoz, Alicia; González-Candelas, Fernando
2007-04-20
Most research scientists working in the fields of molecular epidemiology, population and evolutionary genetics are confronted with the management of large volumes of data. Moreover, the data used in studies of infectious diseases are complex and usually derive from different institutions such as hospitals or laboratories. Since no public database scheme incorporating clinical and epidemiological information about patients and molecular information about pathogens is currently available, we have developed an information system, composed by a main database and a web-based interface, which integrates both types of data and satisfies requirements of good organization, simple accessibility, data security and multi-user support. From the moment a patient arrives to a hospital or health centre until the processing and analysis of molecular sequences obtained from infectious pathogens in the laboratory, lots of information is collected from different sources. We have divided the most relevant data into 12 conceptual modules around which we have organized the database schema. Our schema is very complete and it covers many aspects of sample sources, samples, laboratory processes, molecular sequences, phylogenetics results, clinical tests and results, clinical information, treatments, pathogens, transmissions, outbreaks and bibliographic information. Communication between end-users and the selected Relational Database Management System (RDMS) is carried out by default through a command-line window or through a user-friendly, web-based interface which provides access and management tools for the data. epiPATH is an information system for managing clinical and molecular information from infectious diseases. It facilitates daily work related to infectious pathogens and sequences obtained from them. This software is intended for local installation in order to safeguard private data and provides advanced SQL-users the flexibility to adapt it to their needs. The database schema, tool scripts and web-based interface are free software but data stored in our database server are not publicly available. epiPATH is distributed under the terms of GNU General Public License. More details about epiPATH can be found at http://genevo.uv.es/epipath.
IMM estimator with out-of-sequence measurements
NASA Astrophysics Data System (ADS)
Bar-Shalom, Yaakov; Chen, Huimin
2004-08-01
In multisensor tracking systems that operate in a centralized information processing architecture, measurements from the same target obtained by different sensors can arrive at the processing center out of sequence. In order to avoid either a delay in the output or the need for reordering and reprocessing an entire sequence of measurements, such measurements have to be processed as out-of-sequence measurements (OOSM). Recent work developed procedures for incorporating OOSMs into a Kalman filter (KF). Since the state of the art tracker for real (maneuvering) targets is the Interacting Multiple Model (IMM) estimator, this paper presents the algorithm for incorporating OOSMs into an IMM estimator. Both data association and estimation are considered. Simulation results are presented for two realistic problems using measurements from two airborne GMTI sensors. It is shown that the proposed algorithm for incorporating OOSMs into an IMM estimator yields practically the same performance as the reordering and in-sequence reprocessing of the measurements.
Benson, Dennis A.; Karsch-Mizrachi, Ilene; Lipman, David J.; Ostell, James; Wheeler, David L.
2007-01-01
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage (). PMID:17202161
Quasispecies in population of compositional assemblies.
Gross, Renan; Fouxon, Itzhak; Lancet, Doron; Markovitch, Omer
2014-12-30
The quasispecies model refers to information carriers that undergo self-replication with errors. A quasispecies is a steady-state population of biopolymer sequence variants generated by mutations from a master sequence. A quasispecies error threshold is a minimal replication accuracy below which the population structure breaks down. Theory and experimentation of this model often refer to biopolymers, e.g. RNA molecules or viral genomes, while its prebiotic context is often associated with an RNA world scenario. Here, we study the possibility that compositional entities which code for compositional information, intrinsically different from biopolymers coding for sequential information, could show quasispecies dynamics. We employed a chemistry-based model, graded autocatalysis replication domain (GARD), which simulates the network dynamics within compositional molecular assemblies. In GARD, a compotype represents a population of similar assemblies that constitute a quasi-stationary state in compositional space. A compotype's center-of-mass is found to be analogous to a master sequence for a sequential quasispecies. Using single-cycle GARD dynamics, we measured the quasispecies transition matrix (Q) for the probabilities of transition from one center-of-mass Euclidean distance to another. Similarly, the quasispecies' growth rate vector (A) was obtained. This allowed computing a steady state distribution of distances to the center of mass, as derived from the quasispecies equation. In parallel, a steady state distribution was obtained via the GARD equation kinetics. Rewardingly, a significant correlation was observed between the distributions obtained by these two methods. This was only seen for distances to the compotype center-of-mass, and not to randomly selected compositions. A similar correspondence was found when comparing the quasispecies time dependent dynamics towards steady state. Further, changing the error rate by modifying basal assembly joining rate of GARD kinetics was found to display an error catastrophe, similar to the standard quasispecies model. Additional augmentation of compositional mutations leads to the complete disappearance of the master-like composition. Our results show that compositional assemblies, as simulated by the GARD formalism, portray significant attributes of quasispecies dynamics. This expands the applicability of the quasispecies model beyond sequence-based entities, and potentially enhances validity of GARD as a model for prebiotic evolution.
Wellehan, James F. X.; Johnson, April J.; Harrach, Balázs; Benkö, Mária; Pessier, Allan P.; Johnson, Calvin M.; Garner, Michael M.; Childress, April; Jacobson, Elliott R.
2004-01-01
A consensus nested-PCR method was designed for investigation of the DNA polymerase gene of adenoviruses. Gene fragments were amplified and sequenced from six novel adenoviruses from seven lizard species, including four species from which adenoviruses had not previously been reported. Host species included Gila monster, leopard gecko, fat-tail gecko, blue-tongued skink, Tokay gecko, bearded dragon, and mountain chameleon. This is the first sequence information from lizard adenoviruses. Phylogenetic analysis indicated that these viruses belong to the genus Atadenovirus, supporting the reptilian origin of atadenoviruses. This PCR method may be useful for obtaining templates for initial sequencing of novel adenoviruses. PMID:15542689
Wellehan, James F X; Johnson, April J; Harrach, Balázs; Benkö, Mária; Pessier, Allan P; Johnson, Calvin M; Garner, Michael M; Childress, April; Jacobson, Elliott R
2004-12-01
A consensus nested-PCR method was designed for investigation of the DNA polymerase gene of adenoviruses. Gene fragments were amplified and sequenced from six novel adenoviruses from seven lizard species, including four species from which adenoviruses had not previously been reported. Host species included Gila monster, leopard gecko, fat-tail gecko, blue-tongued skink, Tokay gecko, bearded dragon, and mountain chameleon. This is the first sequence information from lizard adenoviruses. Phylogenetic analysis indicated that these viruses belong to the genus Atadenovirus, supporting the reptilian origin of atadenoviruses. This PCR method may be useful for obtaining templates for initial sequencing of novel adenoviruses.
Reverse transcription polymerase chain reaction protocols for cloning small circular RNAs.
Navarro, B; Daròs, J A; Flores, R
1998-07-01
A protocol is described for general application for cloning small circular RNAs which requires only minimal amounts of template (approximately 50 ng) of unknown sequence. Both cDNA strands are synthesized with a 26-mer primer whose six 3'-terminal positions are totally degenerate in two consecutive reactions catalyzed by reverse transcriptase and DNA polymerase, respectively. The cDNAs are then PCR-amplified, using a 20-mer primer with the non-degenerate sequence of the previous primer, cloned and sequenced. This information permits the synthesis of one or more pairs of specific and adjacent primers for obtaining full-length cDNA clones by a protocol which is also described.
Muñoz-Colmenero, Marta; Martínez, Jose Luis; Roca, Agustín; Garcia-Vazquez, Eva
2017-01-01
The Next Generation Sequencing methodologies are considered the next step within DNA-based methods and their applicability in different fields is being evaluated. Here, we tested the usefulness of the Ion Torrent Personal Genome Machine (PGM) in food traceability analyzing candies as a model of high processed foods, and compared the results with those obtained by PCR-cloning-sequencing (PCR-CS). The majority of samples exhibited consistency between methodologies, yielding more information and species per product from the PGM platform than PCR-CS. Significantly higher AT-content in sequences of the same species was also obtained from PGM. This together with some taxonomical discrepancies between methodologies suggest that the PGM platform is still pre-mature for its use in food traceability of complex highly processed products. It could be a good option for analysis of less complex food, saving time and cost per sample. Copyright © 2016 Elsevier Ltd. All rights reserved.
Fernández-Caballero Rico, Jose Ángel; Chueca Porcuna, Natalia; Álvarez Estévez, Marta; Mosquera Gutiérrez, María Del Mar; Marcos Maeso, María Ángeles; García, Federico
2018-02-01
To show how to generate a consensus sequence from the information of massive parallel sequences data obtained from routine HIV anti-retroviral resistance studies, and that may be suitable for molecular epidemiology studies. Paired Sanger (Trugene-Siemens) and next-generation sequencing (NGS) (454 GSJunior-Roche) HIV RT and protease sequences from 62 patients were studied. NGS consensus sequences were generated using Mesquite, using 10%, 15%, and 20% thresholds. Molecular evolutionary genetics analysis (MEGA) was used for phylogenetic studies. At a 10% threshold, NGS-Sanger sequences from 17/62 patients were phylogenetically related, with a median bootstrap-value of 88% (IQR83.5-95.5). Association increased to 36/62 sequences, median bootstrap 94% (IQR85.5-98)], using a 15% threshold. Maximum association was at the 20% threshold, with 61/62 sequences associated, and a median bootstrap value of 99% (IQR98-100). A safe method is presented to generate consensus sequences from HIV-NGS data at 20% threshold, which will prove useful for molecular epidemiological studies. Copyright © 2016 Elsevier España, S.L.U. and Sociedad Española de Enfermedades Infecciosas y Microbiología Clínica. All rights reserved.
Narad, Priyanka; Kumar, Abhishek; Chakraborty, Amlan; Patni, Pranav; Sengupta, Abhishek; Wadhwa, Gulshan; Upadhyaya, K C
2017-09-01
Transcription factors are trans-acting proteins that interact with specific nucleotide sequences known as transcription factor binding site (TFBS), and these interactions are implicated in regulation of the gene expression. Regulation of transcriptional activation of a gene often involves multiple interactions of transcription factors with various sequence elements. Identification of these sequence elements is the first step in understanding the underlying molecular mechanism(s) that regulate the gene expression. For in silico identification of these sequence elements, we have developed an online computational tool named transcription factor information system (TFIS) for detecting TFBS for the first time using a collection of JAVA programs and is mainly based on TFBS detection using position weight matrix (PWM). The database used for obtaining position frequency matrices (PFM) is JASPAR and HOCOMOCO, which is an open-access database of transcription factor binding profiles. Pseudo-counts are used while converting PFM to PWM, and TFBS detection is carried out on the basis of percent score taken as threshold value. TFIS is equipped with advanced features such as direct sequence retrieving from NCBI database using gene identification number and accession number, detecting binding site for common TF in a batch of gene sequences, and TFBS detection after generating PWM from known raw binding sequences in addition to general detection methods. TFIS can detect the presence of potential TFBSs in both the directions at the same time. This feature increases its efficiency. And the results for this dual detection are presented in different colors specific to the orientation of the binding site. Results obtained by the TFIS are more detailed and specific to the detected TFs as integration of more informative links from various related web servers are added in the result pages like Gene Ontology, PAZAR database and Transcription Factor Encyclopedia in addition to NCBI and UniProt. Common TFs like SP1, AP1 and NF-KB of the Amyloid beta precursor gene is easily detected using TFIS along with multiple binding sites. In another scenario of embryonic developmental process, TFs of the FOX family (FOXL1 and FOXC1) were also identified. TFIS is platform-independent which is publicly available along with its support and documentation at http://tfistool.appspot.com and http://www.bioinfoplus.com/tfis/ . TFIS is licensed under the GNU General Public License, version 3 (GPL-3.0).
Jones, David T; Kandathil, Shaun M
2018-04-26
In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. DeepCov is freely available at https://github.com/psipred/DeepCov. d.t.jones@ucl.ac.uk.
GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank.
You, Ronghui; Zhang, Zihan; Xiong, Yi; Sun, Fengzhu; Mamitsuka, Hiroshi; Zhu, Shanfeng
2018-03-07
Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% of more than 70 million proteins in UniProtKB have experimental GO annotations, implying the strong necessity of automated function prediction (AFP) of proteins, where AFP is a hard multilabel classification problem due to one protein with a diverse number of GO terms. Most of these proteins have only sequences as input information, indicating the importance of sequence-based AFP (SAFP: sequences are the only input). Furthermore homology-based SAFP tools are competitive in AFP competitions, while they do not necessarily work well for so-called difficult proteins, which have <60% sequence identity to proteins with annotations already. Thus the vital and challenging problem now is how to develop a method for SAFP, particularly for difficult proteins. The key of this method is to extract not only homology information but also diverse, deep- rooted information/evidence from sequence inputs and integrate them into a predictor in a both effective and efficient manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a paradigm of machine learning, especially powerful for multilabel classification. The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods. http://datamining-iip.fudan.edu.cn/golabeler. zhusf@fudan.edu.cn. Supplementary data are available at Bioinformatics online.
Genetic characterization of Zostera asiatica on the Pacific Coast of North America
Talbot, S.L.; Wyllie-Echeverria, S.; Ward, D.H.; Rearick, J.R.; Sage, G.K.; Chesney, B.; Phillips, R.C.
2006-01-01
We gathered sequence information from the nuclear 5.8S rDNA gene and associated internal transcribed spacers, ITS-1 and ITS-2 (5.8S rDNA/ITS), and the chloroplast maturase K (matK) gene, from Zostera samples collected from subtidal habitats in Monterey and Santa Barbara (Isla Vista) bays, California, to test the hypothesis that these plants are conspecific with Z. asiatica Miki of Asia. Sequences from approximately 520 base pairs of the nuclear 5.8S rDNA/ITS obtained from the subtidal Monterey and Isla Vista Zostera samples were identical to homologous sequences obtained from Z. marina collected from intertidal habitats in Japan, Alaska, Oregon and California. Similarly, sequences from the matK gene from the subtidal Zostera samples were identical to matK sequences obtained from Z. marina collected from intertidal habitats in Japan, Alaska, Oregon and California, but differed from Z. asiatica sequences accessioned into GenBank. This suggests the subtidal plants are conspecific with Z. marina, not Z. asiatica. However, we found that herbarium samples accessioned into the Kyoto University Herbarium, determined to be Z. asiatica, yielded 5.8S rDNA/ITS sequences consistent with either Z. japonica, in two cases, or Z. marina, in one case. Similar results were observed for the chloroplast matK gene; we found haplotypes that were inconsistent with published matK sequences from Z. asiatica collected from Japan. These results underscore the need for closer examination of the relationship between Z. marina along the Pacific Coast of North America, and Z. asiatica of Asia, for the retention and verification of specimens examined in scientific studies, and for assessment of the usefulness of morphological characters in the determination of taxonomic relationships within Zosteraceae.
Kowalczyk, Marek; Sekuła, Andrzej; Mleczko, Piotr; Olszowy, Zofia; Kujawa, Anna; Zubek, Szymon; Kupiec, Tomasz
2015-01-01
Aim To assess the usefulness of a DNA-based method for identifying mushroom species for application in forensic laboratory practice. Methods Two hundred twenty-one samples of clinical forensic material (dried mushrooms, food remains, stomach contents, feces, etc) were analyzed. ITS2 region of nuclear ribosomal DNA (nrDNA) was sequenced and the sequences were compared with reference sequences collected from the National Center for Biotechnology Information gene bank (GenBank). Sporological identification of mushrooms was also performed for 57 samples of clinical material. Results Of 221 samples, positive sequencing results were obtained for 152 (69%). The highest percentage of positive results was obtained for samples of dried mushrooms (96%) and food remains (91%). Comparison with GenBank sequences enabled identification of all samples at least at the genus level. Most samples (90%) were identified at the level of species or a group of closely related species. Sporological and molecular identification were consistent at the level of species or genus for 30% of analyzed samples. Conclusion Molecular analysis identified a larger number of species than sporological method. It proved to be suitable for analysis of evidential material (dried hallucinogenic mushrooms) in forensic genetic laboratories as well as to complement classical methods in the analysis of clinical material. PMID:25727040
Kowalczyk, Marek; Sekuła, Andrzej; Mleczko, Piotr; Olszowy, Zofia; Kujawa, Anna; Zubek, Szymon; Kupiec, Tomasz
2015-02-01
To assess the usefulness of a DNA-based method for identifying mushroom species for application in forensic laboratory practice. Two hundred twenty-one samples of clinical forensic material (dried mushrooms, food remains, stomach contents, feces, etc) were analyzed. ITS2 region of nuclear ribosomal DNA (nrDNA) was sequenced and the sequen-ces were compared with reference sequences collected from the National Center for Biotechnology Information gene bank (GenBank). Sporological identification of mushrooms was also performed for 57 samples of clinical material. Of 221 samples, positive sequencing results were obtained for 152 (69%). The highest percentage of positive results was obtained for samples of dried mushrooms (96%) and food remains (91%). Comparison with GenBank sequences enabled identification of all samples at least at the genus level. Most samples (90%) were identified at the level of species or a group of closely related species. Sporological and molecular identification were consistent at the level of species or genus for 30% of analyzed samples. Molecular analysis identified a larger number of species than sporological method. It proved to be suitable for analysis of evidential material (dried hallucinogenic mushrooms) in forensic genetic laboratories as well as to complement classical methods in the analysis of clinical material.
Hemichordates and the Origin of Chordates
NASA Technical Reports Server (NTRS)
Gerhart, John; Kirschner, Marc; Lowe, Chris
2002-01-01
At the start of the period of the NASA grant three years ago, we had no information on the organization and development of the body axis of the hemichordate, Saccoglossus kowalevskii. Now we have substantial findings about the anteroposterior axis and dorsoventral axis, and based on this information, we have new insights about the origin of chordates from ancestral deuterostomes. We found ways to obtain and preserve large numbers of embryos and hatched juveniles. We can now collect about 40,000 embryos in the month of September, the time of S. kowalevskii spawning at Woods Hole. Excellent cDNA libraries were prepared from three developmental stages. From these libraries, we directly isolated about 30 gene ortholog sequences by screening and pcr techniques, all of these sequences of interest in the inquiry about the animal's organization and development. We also performed a mid-sized EST project (60,000 randomly picked clones, many of these arrayed). About half of these have been analyzed so far by blastx and are suitable for direct use of clones. We have obtained about 50 interesting sequences from this set. The rest still await analysis. Thus, at this time we have isolated orthologs of 80 genes that are known to be expressed in chordates in conserved domains and known to have interesting roles in chordate organization and development. The orthology of the S. kowalevskii sequences has been verified by neighbor joining and parsimony methods, with bootstrap estimates of validity. The S. kowalevskii sequences cluster with other deuterostome sequences, namely, other hemichordates, echinoderms, ascidians, amphioxus, or vertebrates, depending on what sequences are available in the database for comparison. We have used these sequences to do high quality in situ hybridization on S. kowalevskii embryos, and the results can be divided into three sections-those concerning the anteroposterior axis of S. kowalevskii in comparison to the same axis of chordates, those concerning the dorsoventral axis of S. kowalevskii in comparison to the same axis of chordates, and those concerning the signals and transcription factors found in the endoderm, of S. kowalevskii compared to the signals and transcription factors in the endo-mesodermal cells of Spemann's organizer of chordates.
From protein sequence to dynamics and disorder with DynaMine.
Cilia, Elisa; Pancsa, Rita; Tompa, Peter; Lenaerts, Tom; Vranken, Wim F
2013-01-01
Protein function and dynamics are closely related; however, accurate dynamics information is difficult to obtain. Here based on a carefully assembled data set derived from experimental data for proteins in solution, we quantify backbone dynamics properties on the amino-acid level and develop DynaMine--a fast, high-quality predictor of protein backbone dynamics. DynaMine uses only protein sequence information as input and shows great potential in distinguishing regions of different structural organization, such as folded domains, disordered linkers, molten globules and pre-structured binding motifs of different sizes. It also identifies disordered regions within proteins with an accuracy comparable to the most sophisticated existing predictors, without depending on prior disorder knowledge or three-dimensional structural information. DynaMine provides molecular biologists with an important new method that grasps the dynamical characteristics of any protein of interest, as we show here for human p53 and E1A from human adenovirus 5.
On Utilizing Optimal and Information Theoretic Syntactic Modeling for Peptide Classification
NASA Astrophysics Data System (ADS)
Aygün, Eser; Oommen, B. John; Cataltepe, Zehra
Syntactic methods in pattern recognition have been used extensively in bioinformatics, and in particular, in the analysis of gene and protein expressions, and in the recognition and classification of bio-sequences. These methods are almost universally distance-based. This paper concerns the use of an Optimal and Information Theoretic (OIT) probabilistic model [11] to achieve peptide classification using the information residing in their syntactic representations. The latter has traditionally been achieved using the edit distances required in the respective peptide comparisons. We advocate that one can model the differences between compared strings as a mutation model consisting of random Substitutions, Insertions and Deletions (SID) obeying the OIT model. Thus, in this paper, we show that the probability measure obtained from the OIT model can be perceived as a sequence similarity metric, using which a Support Vector Machine (SVM)-based peptide classifier, referred to as OIT_SVM, can be devised.
Molecular interaction networks in the analyses of sequence variation and proteomics data.
Stelzl, Ulrich
2013-12-01
Protein-protein interaction networks are typically generated in standard cell lines or model organisms as it is prohibitively difficult to record large interaction datasets from specific tissues or disease models at a reasonable pace. Although the interaction data are of high confidence, they thus do not reflect in vivo relationships as such. A wealth of physiologically relevant protein information, obtained under different conditions and from different systems, is available including information on genetic variation, protein levels, and PTMs. However, these data are difficult to assess comprehensively because the relationships between the entities remain elusive from the measurements. Here, we exemplarily highlight recent studies that gained deeper insight from genetic variation, protein, and PTM measurements using interaction information pointing toward the importance and potential of interaction networks for the interpretation of sequencing and proteomics data. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Earth field NMR with chemical shift spectral resolution: theory and proof of concept.
Katz, Itai; Shtirberg, Lazar; Shakour, Gubrail; Blank, Aharon
2012-06-01
A new method for obtaining an NMR signal in the Earth's magnetic field (EF) is presented. The method makes use of a simple pulse sequence with only DC fields which is much less demanding than previous approaches in terms of the pulses' rise and fall times. Furthermore, it offers the possibility of obtaining NMR data with enough spectral resolution to allow retrieving high resolution molecular chemical shift (CS) information - a capability that was not considered possible in EF NMR until now. Details of the pulse sequence, the experimental system, and our specially tailored EF NMR probe are provided. The experimental results demonstrate the capability to differentiate between three types of samples made of common fluorine compounds, based on their CS data. Copyright © 2012 Elsevier Inc. All rights reserved.
Four new topological indices based on the molecular path code.
Balaban, Alexandru T; Beteringhe, Adrian; Constantinescu, Titus; Filip, Petru A; Ivanciuc, Ovidiu
2007-01-01
The sequence of all paths pi of lengths i = 1 to the maximum possible length in a hydrogen-depleted molecular graph (which sequence is also called the molecular path code) contains significant information on the molecular topology, and as such it is a reasonable choice to be selected as the basis of topological indices (TIs). Four new (or five partly new) TIs with progressively improved performance (judged by correctly reflecting branching, centricity, and cyclicity of graphs, ordering of alkanes, and low degeneracy) have been explored. (i) By summing the squares of all numbers in the sequence one obtains Sigmaipi(2), and by dividing this sum by one plus the cyclomatic number, a Quadratic TI is obtained: Q = Sigmaipi(2)/(mu+1). (ii) On summing the Square roots of all numbers in the sequence one obtains Sigmaipi(1/2), and by dividing this sum by one plus the cyclomatic number, the TI denoted by S is obtained: S = Sigmaipi(1/2)/(mu+1). (iii) On dividing terms in this sum by the corresponding topological distances, one obtains the Distance-reduced index D = Sigmai{pi(1/2)/[i(mu+1)]}. Two similar formulas define the next two indices, the first one with no square roots: (iv) distance-Attenuated index: A = Sigmai{pi/[i(mu + 1)]}; and (v) the last TI with two square roots: Path-count index: P = Sigmai{pi(1/2)/[i(1/2)(mu + 1)]}. These five TIs are compared for their degeneracy, ordering of alkanes, and performance in QSPR (for all alkanes with 3-12 carbon atoms and for all possible chemical cyclic or acyclic graphs with 4-6 carbon atoms) in correlations with six physical properties and one chemical property.
Pirooznia, Mehdi; Gong, Ping; Guan, Xin; Inouye, Laura S; Yang, Kuan; Perkins, Edward J; Deng, Youping
2007-01-01
Background Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from two E. fetida libraries enriched with genes responsive to ten ordnance related compounds using suppressive subtractive hybridization-PCR. Results A total of 3144 good quality ESTs (GenBank dbEST accession number EH669363–EH672369 and EL515444–EL515580) were obtained from the raw clone sequences after cleaning. Clustering analysis yielded 2231 unique sequences including 448 contigs (from 1361 ESTs) and 1783 singletons. Comparative genomic analysis showed that 743 or 33% of the unique sequences shared high similarity with existing genes in the GenBank nr database. Provisional function annotation assigned 830 Gene Ontology terms to 517 unique sequences based on their homology with the annotated genomes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. Seven percent of the unique sequences were further mapped to 99 Kyoto Encyclopedia of Genes and Genomes pathways based on their matching Enzyme Commission numbers. All the information is stored and retrievable at a highly performed, web-based and user-friendly relational database called EST model database or ESTMD version 2. Conclusion The ESTMD containing the sequence and annotation information of 4032 E. fetida ESTs is publicly accessible at . PMID:18047730
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L
2008-01-01
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
Benson, Dennis A.; Karsch-Mizrachi, Ilene; Lipman, David J.; Ostell, James; Wheeler, David L.
2008-01-01
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov PMID:18073190
Flexible, fast and accurate sequence alignment profiling on GPGPU with PaSWAS.
Warris, Sven; Yalcin, Feyruz; Jackson, Katherine J L; Nap, Jan Peter
2015-01-01
To obtain large-scale sequence alignments in a fast and flexible way is an important step in the analyses of next generation sequencing data. Applications based on the Smith-Waterman (SW) algorithm are often either not fast enough, limited to dedicated tasks or not sufficiently accurate due to statistical issues. Current SW implementations that run on graphics hardware do not report the alignment details necessary for further analysis. With the Parallel SW Alignment Software (PaSWAS) it is possible (a) to have easy access to the computational power of NVIDIA-based general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments, and (b) retrieve relevant information such as score, number of gaps and mismatches. The software reports multiple hits per alignment. The added value of the new SW implementation is demonstrated with two test cases: (1) tag recovery in next generation sequence data and (2) isotype assignment within an immunoglobulin 454 sequence data set. Both cases show the usability and versatility of the new parallel Smith-Waterman implementation.
Fourment, Mathieu; Gibbs, Mark J
2008-01-01
Background Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. Results The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. Conclusion VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically. PMID:18251994
DDRprot: a database of DNA damage response-related proteins.
Andrés-León, Eduardo; Cases, Ildefonso; Arcas, Aida; Rojas, Ana M
2016-01-01
The DNA Damage Response (DDR) signalling network is an essential system that protects the genome's integrity. The DDRprot database presented here is a resource that integrates manually curated information on the human DDR network and its sub-pathways. For each particular DDR protein, we present detailed information about its function. If involved in post-translational modifications (PTMs) with each other, we depict the position of the modified residue/s in the three-dimensional structures, when resolved structures are available for the proteins. All this information is linked to the original publication from where it was obtained. Phylogenetic information is also shown, including time of emergence and conservation across 47 selected species, family trees and sequence alignments of homologues. The DDRprot database can be queried by different criteria: pathways, species, evolutionary age or involvement in (PTM). Sequence searches using hidden Markov models can be also used.Database URL: http://ddr.cbbio.es. © The Author(s) 2016. Published by Oxford University Press.
Shortt, Jonathan A.; Card, Daren C.; Schield, Drew R.; Liu, Yang; Zhong, Bo; Castoe, Todd A.
2017-01-01
Background In areas where schistosomiasis control programs have been implemented, morbidity and prevalence have been greatly reduced. However, to sustain these reductions and move towards interruption of transmission, new tools for disease surveillance are needed. Genomic methods have the potential to help trace the sources of new infections, and allow us to monitor drug resistance. Large-scale genotyping efforts for schistosome species have been hindered by cost, limited numbers of established target loci, and the small amount of DNA obtained from miracidia, the life stage most readily acquired from humans. Here, we present a method using next generation sequencing to provide high-resolution genomic data from S. japonicum for population-based studies. Methodology/Principal Findings We applied whole genome amplification followed by double digest restriction site associated DNA sequencing (ddRADseq) to individual S. japonicum miracidia preserved on Whatman FTA cards. We found that we could effectively and consistently survey hundreds of thousands of variants from 10,000 to 30,000 loci from archived miracidia as old as six years. An analysis of variation from eight miracidia obtained from three hosts in two villages in Sichuan showed clear population structuring by village and host even within this limited sample. Conclusions/Significance This high-resolution sequencing approach yields three orders of magnitude more information than microsatellite genotyping methods that have been employed over the last decade, creating the potential to answer detailed questions about the sources of human infections and to monitor drug resistance. Costs per sample range from $50-$200, depending on the amount of sequence information desired, and we expect these costs can be reduced further given continued reductions in sequencing costs, improvement of protocols, and parallelization. This approach provides new promise for using modern genome-scale sampling to S. japonicum surveillance, and could be applied to other schistosome species and other parasitic helminthes. PMID:28107347
Tran, T T Nha; Brinkworth, Craig S; Bowie, John H
2015-01-30
To use negative-ion nano-electrospray ionization mass spectrometry of peptides from the tryptic digest of ricin D, to provide sequence information; in particular, to identify disulfide position and connectivity. Negative-ion fragmentations of peptides from the tryptic digest of ricin D was studied using a Waters QTOF2 mass spectrometer operating in MS and MS(2) modes. Twenty-three peptides were obtained following high-performance liquid chromatography and studied by negative-ion mass spectrometry covering 73% of the amino-acid residues of ricin D. Five disulfide-containing peptides were identified, three intermolecular and two intramolecular disulfide-containing peptides. The [M-H](-) anions of the intermolecular disulfides undergo facile cleavage of the disulfide units to produce fragment peptides. In negative-ion collision-induced dissociation (CID) these source-formed anions undergo backbone cleavages, which provide sequencing information. The two intramolecular disulfides were converted proteolytically into intermolecular disulfides, which were identified as outlined above. The positions of the five disulfide groups in ricin D may be determined by characteristic negative-ion cleavage of the disulfide groups, while sequence information may be determined using the standard negative-ion backbone cleavages of the resulting cleaved peptides. Negative-ion mass spectrometry can also be used to provide partial sequencing information for other peptides (i.e. those not containing Cys) using the standard negative-ion backbone cleavages of these peptides. Copyright © 2014 John Wiley & Sons, Ltd.
USDA-ARS?s Scientific Manuscript database
Flowering and plant and ear height-related traits are extensively studied in maize for three main reasons: 1) ease of obtaining phenotypic measurements, 2) advances in genotyping and sequencing technologies have reduced the cost of genomic information, and 3) the importance of these traits for adapt...
Sequencing of Oligourea Foldamers by Tandem Mass Spectrometry
NASA Astrophysics Data System (ADS)
Bathany, Katell; Owens, Neil W.; Guichard, Gilles; Schmitter, Jean-Marie
2013-03-01
This study is focused on sequence analysis of peptidomimetic helical oligoureas by means of tandem mass spectrometry, to build a basis for de novo sequencing for future high-throughput combinatorial library screening of oligourea foldamers. After the evaluation of MS/MS spectra obtained for model compounds with either MALDI or ESI sources, we found that the MALDI-TOF-TOF instrument gave more satisfactory results. MS/MS spectra of oligoureas generated by decay of singly charged precursor ions show major ion series corresponding to fragmentation across both CO-NH and N'H-CO urea bonds. Oligourea backbones fragment to produce a pattern of a, x, b, and y type fragment ions. De novo decoding of spectral information is facilitated by the occurrence of low mass reporter ions, representative of constitutive monomers, in an analogous manner to the use of immonium ions for peptide sequencing.
Li, Zhao-Qun; Zhang, Shuai; Ma, Yan; Luo, Jun-Yu; Wang, Chun-Yi; Lv, Li-Min; Dong, Shuang-Lin; Cui, Jin-Jie
2013-01-01
Background Chrysopa pallens (Rambur) are the most important natural enemies and predators of various agricultural pests. Understanding the sophisticated olfactory system in insect antennae is crucial for studying the physiological bases of olfaction and also could lead to effective applications of C. pallens in integrated pest management. However no transcriptome information is available for Neuroptera, and sequence data for C. pallens are scarce, so obtaining more sequence data is a priority for researchers on this species. Results To facilitate identifying sets of genes involved in olfaction, a normalized transcriptome of C. pallens was sequenced. A total of 104,603 contigs were obtained and assembled into 10,662 clusters and 39,734 singletons; 20,524 were annotated based on BLASTX analyses. A large number of candidate chemosensory genes were identified, including 14 odorant-binding proteins (OBPs), 22 chemosensory proteins (CSPs), 16 ionotropic receptors, 14 odorant receptors, and genes potentially involved in olfactory modulation. To better understand the OBPs, CSPs and cytochrome P450s, phylogenetic trees were constructed. In addition, 10 digital gene expression libraries of different tissues were constructed and gene expression profiles were compared among different tissues in males and females. Conclusions Our results provide a basis for exploring the mechanisms of chemoreception in C. pallens, as well as other insects. The evolutionary analyses in our study provide new insights into the differentiation and evolution of insect OBPs and CSPs. Our study provided large-scale sequence information for further studies in C. pallens. PMID:23826220
Mining SNPs from EST sequences using filters and ensemble classifiers.
Wang, J; Zou, Q; Guo, M Z
2010-05-04
Abundant single nucleotide polymorphisms (SNPs) provide the most complete information for genome-wide association studies. However, due to the bottleneck of manual discovery of putative SNPs and the inaccessibility of the original sequencing reads, it is essential to develop a more efficient and accurate computational method for automated SNP detection. We propose a novel computational method to rapidly find true SNPs in public-available EST (expressed sequence tag) databases; this method is implemented as SNPDigger. EST sequences are clustered and aligned. SNP candidates are then obtained according to a measure of redundant frequency. Several new informative biological features, such as the structural neighbor profiles and the physical position of the SNP, were extracted from EST sequences, and the effectiveness of these features was demonstrated. An ensemble classifier, which employs a carefully selected feature set, was included for the imbalanced training data. The sensitivity and specificity of our method both exceeded 80% for human genetic data in the cross validation. Our method enables detection of SNPs from the user's own EST dataset and can be used on species for which there is no genome data. Our tests showed that this method can effectively guide SNP discovery in ESTs and will be useful to avoid and save the cost of biological analyses.
Lavery, Richard; Zakrzewska, Krystyna; Beveridge, David; Bishop, Thomas C.; Case, David A.; Cheatham, Thomas; Dixit, Surjit; Jayaram, B.; Lankas, Filip; Laughton, Charles; Maddocks, John H.; Michon, Alexis; Osman, Roman; Orozco, Modesto; Perez, Alberto; Singh, Tanya; Spackova, Nada; Sponer, Jiri
2010-01-01
It is well recognized that base sequence exerts a significant influence on the properties of DNA and plays a significant role in protein–DNA interactions vital for cellular processes. Understanding and predicting base sequence effects requires an extensive structural and dynamic dataset which is currently unavailable from experiment. A consortium of laboratories was consequently formed to obtain this information using molecular simulations. This article describes results providing information not only on all 10 unique base pair steps, but also on all possible nearest-neighbor effects on these steps. These results are derived from simulations of 50–100 ns on 39 different DNA oligomers in explicit solvent and using a physiological salt concentration. We demonstrate that the simulations are converged in terms of helical and backbone parameters. The results show that nearest-neighbor effects on base pair steps are very significant, implying that dinucleotide models are insufficient for predicting sequence-dependent behavior. Flanking base sequences can notably lead to base pair step parameters in dynamic equilibrium between two conformational sub-states. Although this study only provides limited data on next-nearest-neighbor effects, we suggest that such effects should be analyzed before attempting to predict the sequence-dependent behavior of DNA. PMID:19850719
QRS complex detection based on continuous density hidden Markov models using univariate observations
NASA Astrophysics Data System (ADS)
Sotelo, S.; Arenas, W.; Altuve, M.
2018-04-01
In the electrocardiogram (ECG), the detection of QRS complexes is a fundamental step in the ECG signal processing chain since it allows the determination of other characteristics waves of the ECG and provides information about heart rate variability. In this work, an automatic QRS complex detector based on continuous density hidden Markov models (HMM) is proposed. HMM were trained using univariate observation sequences taken either from QRS complexes or their derivatives. The detection approach is based on the log-likelihood comparison of the observation sequence with a fixed threshold. A sliding window was used to obtain the observation sequence to be evaluated by the model. The threshold was optimized by receiver operating characteristic curves. Sensitivity (Sen), specificity (Spc) and F1 score were used to evaluate the detection performance. The approach was validated using ECG recordings from the MIT-BIH Arrhythmia database. A 6-fold cross-validation shows that the best detection performance was achieved with 2 states HMM trained with QRS complexes sequences (Sen = 0.668, Spc = 0.360 and F1 = 0.309). We concluded that these univariate sequences provide enough information to characterize the QRS complex dynamics from HMM. Future works are directed to the use of multivariate observations to increase the detection performance.
Barnes, D W
2012-04-01
Two of the most commonly used elasmobranch experimental model species are the spiny dogfish Squalus acanthias and the little skate Leucoraja erinacea. Comparative biology and genomics with these species have provided useful information in physiology, pharmacology, toxicology, immunology, evolutionary developmental biology and genetics. A wealth of information has been obtained using in vitro approaches to study isolated cells and tissues from these organisms under circumstances in which the extracellular environment can be controlled. In addition to classical work with primary cell cultures, continuously proliferating cell lines have been derived recently, representing the first cell lines from cartilaginous fishes. These lines have proved to be valuable tools with which to explore functional genomic and biological questions and to test hypotheses at the molecular level. In genomic experiments, complementary (c)DNA libraries have been constructed, and c. 8000 unique transcripts identified, with over 3000 representing previously unknown gene sequences. A sub-set of messenger (m)RNAs has been detected for which the 3' untranslated regions show elements that are remarkably well conserved evolutionarily, representing novel, potentially regulatory gene sequences. The cell culture systems provide physiologically valid tools to study functional roles of these sequences and other aspects of elasmobranch molecular cell biology and physiology. Information derived from the use of in vitro cell cultures is valuable in revealing gene diversity and information for genomic sequence assembly, as well as for identification of new genes and molecular markers, construction of gene-array probes and acquisition of full-length cDNA sequences. © 2012 The Author. Journal of Fish Biology © 2012 The Fisheries Society of the British Isles.
Wacker, Michael A.
2010-01-01
Borehole geophysical logs were obtained from selected exploratory coreholes in the vicinity of the Florida Power and Light Company Turkey Point Power Plant. The geophysical logging tools used and logging sequences performed during this project are summarized herein to include borehole logging methods, descriptions of the properties measured, types of data obtained, and calibration information.
Massouras, Andreas; Decouttere, Frederik; Hens, Korneel; Deplancke, Bart
2010-07-01
High-throughput sequencing (HTS) is revolutionizing our ability to obtain cheap, fast and reliable sequence information. Many experimental approaches are expected to benefit from the incorporation of such sequencing features in their pipeline. Consequently, software tools that facilitate such an incorporation should be of great interest. In this context, we developed WebPrInSeS, a web server tool allowing automated full-length clone sequence identification and verification using HTS data. WebPrInSeS encompasses two separate software applications. The first is WebPrInSeS-C which performs automated sequence verification of user-defined open-reading frame (ORF) clone libraries. The second is WebPrInSeS-E, which identifies positive hits in cDNA or ORF-based library screening experiments such as yeast one- or two-hybrid assays. Both tools perform de novo assembly using HTS data from any of the three major sequencing platforms. Thus, WebPrInSeS provides a highly integrated, cost-effective and efficient way to sequence-verify or identify clones of interest. WebPrInSeS is available at http://webprinses.epfl.ch/ and is open to all users.
Massouras, Andreas; Decouttere, Frederik; Hens, Korneel; Deplancke, Bart
2010-01-01
High-throughput sequencing (HTS) is revolutionizing our ability to obtain cheap, fast and reliable sequence information. Many experimental approaches are expected to benefit from the incorporation of such sequencing features in their pipeline. Consequently, software tools that facilitate such an incorporation should be of great interest. In this context, we developed WebPrInSeS, a web server tool allowing automated full-length clone sequence identification and verification using HTS data. WebPrInSeS encompasses two separate software applications. The first is WebPrInSeS-C which performs automated sequence verification of user-defined open-reading frame (ORF) clone libraries. The second is WebPrInSeS-E, which identifies positive hits in cDNA or ORF-based library screening experiments such as yeast one- or two-hybrid assays. Both tools perform de novo assembly using HTS data from any of the three major sequencing platforms. Thus, WebPrInSeS provides a highly integrated, cost-effective and efficient way to sequence-verify or identify clones of interest. WebPrInSeS is available at http://webprinses.epfl.ch/ and is open to all users. PMID:20501601
Proels, Reinhard K; Roitsch, Thomas
2006-03-01
Very few CACTA transposon-like sequences have been described in Solanaceae species. Sequence information has been restricted to partial transposase (TPase)-like fragments, and no target gene of CACTA-like transposon insertion has been described in tomato to date. In this manuscript, we report on a CACTA transposon-like insertion in intron I of tomato (Lycopersicon esculentum) invertase gene Lin5 and TPase-like sequences of several Solanaceae species. Consensus primers deduced from the TPase region of the tomato CACTA transposon-like element allowed the amplification of similar sequences from various Solanaceae species of different subfamilies including Solaneae (Solanum tuberosum), Cestreae (Nicotiana tabacum) and Datureae (Datura stramonium). This demonstrates the ubiquitous presence of CACTA-like elements in Solanaceae genomes. The obtained partial sequences are highly conserved, and allow further detection and detailed analysis of CACTA-like transposons throughout Solanaceae species. CACTA-like transposon sequences make possible the evaluation of their use for genome analysis, functional studies of genes and the evolutionary relationships between plant species.
Combining Physicochemical and Evolutionary Information for Protein Contact Prediction
Schneider, Michael; Brock, Oliver
2014-01-01
We introduce a novel contact prediction method that achieves high prediction accuracy by combining evolutionary and physicochemical information about native contacts. We obtain evolutionary information from multiple-sequence alignments and physicochemical information from predicted ab initio protein structures. These structures represent low-energy states in an energy landscape and thus capture the physicochemical information encoded in the energy function. Such low-energy structures are likely to contain native contacts, even if their overall fold is not native. To differentiate native from non-native contacts in those structures, we develop a graph-based representation of the structural context of contacts. We then use this representation to train an support vector machine classifier to identify most likely native contacts in otherwise non-native structures. The resulting contact predictions are highly accurate. As a result of combining two sources of information—evolutionary and physicochemical—we maintain prediction accuracy even when only few sequence homologs are present. We show that the predicted contacts help to improve ab initio structure prediction. A web service is available at http://compbio.robotics.tu-berlin.de/epc-map/. PMID:25338092
Measuring information-based energy and temperature of literary texts
NASA Astrophysics Data System (ADS)
Chang, Mei-Chu; Yang, Albert C.-C.; Eugene Stanley, H.; Peng, C.-K.
2017-02-01
We apply a statistical method, information-based energy, to quantify informative symbolic sequences. To apply this method to literary texts, it is assumed that different words with different occurrence frequencies are at different energy levels, and that the energy-occurrence frequency distribution obeys a Boltzmann distribution. The temperature within the Boltzmann distribution can be an indicator for the author's writing capacity as the repertory of thoughts. The relative temperature of a text is obtained by comparing the energy-occurrence frequency distributions of words collected from one text versus from all texts of the same author. Combining the relative temperature with the Shannon entropy as the text complexity, the information-based energy of the text is defined and can be viewed as a quantitative evaluation of an author's writing performance. We demonstrate the method by analyzing two authors, Shakespeare in English and Jin Yong in Chinese, and find that their well-known works are associated with higher information-based energies. This method can be used to measure the creativity level of a writer's work in linguistics, and can also quantify symbolic sequences in different systems.
Simões-Araújo, Jean Luiz; Rumjanek, Norma Gouvêa; Xavier, Gustavo Ribeiro; Zilli, Jerri Édson
The strain BR 3351 T (Bradyrhizobium manausense) was obtained from nodules of cowpea (Vigna unguiculata L. Walp) growing in soil collected from Amazon rainforest. Furthermore, it was observed that the strain has high capacity to fix nitrogen symbiotically in symbioses with cowpea. We report here the draft genome sequence of strain BR 3351 T . The information presented will be important for comparative analysis of nodulation and nitrogen fixation for diazotrophic bacteria. A draft genome with 9,145,311bp and 62.9% of GC content was assembled in 127 scaffolds using 100bp pair-end Illumina MiSeq system. The RAST annotation identified 8603 coding sequences, 51 RNAs genes, classified in 504 subsystems. Published by Elsevier Editora Ltda.
Comparison of Metabolic Pathways in Escherichia coli by Using Genetic Algorithms.
Ortegon, Patricia; Poot-Hernández, Augusto C; Perez-Rueda, Ernesto; Rodriguez-Vazquez, Katya
2015-01-01
In order to understand how cellular metabolism has taken its modern form, the conservation and variations between metabolic pathways were evaluated by using a genetic algorithm (GA). The GA approach considered information on the complete metabolism of the bacterium Escherichia coli K-12, as deposited in the KEGG database, and the enzymes belonging to a particular pathway were transformed into enzymatic step sequences by using the breadth-first search algorithm. These sequences represent contiguous enzymes linked to each other, based on their catalytic activities as they are encoded in the Enzyme Commission numbers. In a posterior step, these sequences were compared using a GA in an all-against-all (pairwise comparisons) approach. Individual reactions were chosen based on their measure of fitness to act as parents of offspring, which constitute the new generation. The sequences compared were used to construct a similarity matrix (of fitness values) that was then considered to be clustered by using a k-medoids algorithm. A total of 34 clusters of conserved reactions were obtained, and their sequences were finally aligned with a multiple-sequence alignment GA optimized to align all the reaction sequences included in each group or cluster. From these comparisons, maps associated with the metabolism of similar compounds also contained similar enzymatic step sequences, reinforcing the Patchwork Model for the evolution of metabolism in E. coli K-12, an observation that can be expanded to other organisms, for which there is metabolism information. Finally, our mapping of these reactions is discussed, with illustrations from a particular case.
Comparison of Metabolic Pathways in Escherichia coli by Using Genetic Algorithms
Ortegon, Patricia; Poot-Hernández, Augusto C.; Perez-Rueda, Ernesto; Rodriguez-Vazquez, Katya
2015-01-01
In order to understand how cellular metabolism has taken its modern form, the conservation and variations between metabolic pathways were evaluated by using a genetic algorithm (GA). The GA approach considered information on the complete metabolism of the bacterium Escherichia coli K-12, as deposited in the KEGG database, and the enzymes belonging to a particular pathway were transformed into enzymatic step sequences by using the breadth-first search algorithm. These sequences represent contiguous enzymes linked to each other, based on their catalytic activities as they are encoded in the Enzyme Commission numbers. In a posterior step, these sequences were compared using a GA in an all-against-all (pairwise comparisons) approach. Individual reactions were chosen based on their measure of fitness to act as parents of offspring, which constitute the new generation. The sequences compared were used to construct a similarity matrix (of fitness values) that was then considered to be clustered by using a k-medoids algorithm. A total of 34 clusters of conserved reactions were obtained, and their sequences were finally aligned with a multiple-sequence alignment GA optimized to align all the reaction sequences included in each group or cluster. From these comparisons, maps associated with the metabolism of similar compounds also contained similar enzymatic step sequences, reinforcing the Patchwork Model for the evolution of metabolism in E. coli K-12, an observation that can be expanded to other organisms, for which there is metabolism information. Finally, our mapping of these reactions is discussed, with illustrations from a particular case. PMID:25973143
NASA Astrophysics Data System (ADS)
Müller, Vilhelm; Rajer, Fredrika; Frykholm, Karolin; Nyberg, Lena K.; Quaderi, Saair; Fritzsche, Joachim; Kristiansson, Erik; Ambjörnsson, Tobias; Sandegren, Linus; Westerlund, Fredrik
2016-12-01
Bacterial plasmids are extensively involved in the rapid global spread of antibiotic resistance. We here present an assay, based on optical DNA mapping of single plasmids in nanofluidic channels, which provides detailed information about the plasmids present in a bacterial isolate. In a single experiment, we obtain the number of different plasmids in the sample, the size of each plasmid, an optical barcode that can be used to identify and trace the plasmid of interest and information about which plasmid that carries a specific resistance gene. Gene identification is done using CRISPR/Cas9 loaded with a guide-RNA (gRNA) complementary to the gene of interest that linearizes the circular plasmids at a specific location that is identified using the optical DNA maps. We demonstrate the principle on clinically relevant extended spectrum beta-lactamase (ESBL) producing isolates. We discuss how the gRNA sequence can be varied to obtain the desired information. The gRNA can either be very specific to identify a homogeneous group of genes or general to detect several groups of genes at the same time. Finally, we demonstrate an example where we use a combination of two gRNA sequences to identify carbapenemase-encoding genes in two previously not characterized clinical bacterial samples.
EuroPineDB: a high-coverage web database for maritime pine transcriptome
2011-01-01
Background Pinus pinaster is an economically and ecologically important species that is becoming a woody gymnosperm model. Its enormous genome size makes whole-genome sequencing approaches are hard to apply. Therefore, the expressed portion of the genome has to be characterised and the results and annotations have to be stored in dedicated databases. Description EuroPineDB is the largest sequence collection available for a single pine species, Pinus pinaster (maritime pine), since it comprises 951 641 raw sequence reads obtained from non-normalised cDNA libraries and high-throughput sequencing from adult (xylem, phloem, roots, stem, needles, cones, strobili) and embryonic (germinated embryos, buds, callus) maritime pine tissues. Using open-source tools, sequences were optimally pre-processed, assembled, and extensively annotated (GO, EC and KEGG terms, descriptions, SNPs, SSRs, ORFs and InterPro codes). As a result, a 10.5× P. pinaster genome was covered and assembled in 55 322 UniGenes. A total of 32 919 (59.5%) of P. pinaster UniGenes were annotated with at least one description, revealing at least 18 466 different genes. The complete database, which is designed to be scalable, maintainable, and expandable, is freely available at: http://www.scbi.uma.es/pindb/. It can be retrieved by gene libraries, pine species, annotations, UniGenes and microarrays (i.e., the sequences are distributed in two-colour microarrays; this is the only conifer database that provides this information) and will be periodically updated. Small assemblies can be viewed using a dedicated visualisation tool that connects them with SNPs. Any sequence or annotation set shown on-screen can be downloaded. Retrieval mechanisms for sequences and gene annotations are provided. Conclusions The EuroPineDB with its integrated information can be used to reveal new knowledge, offers an easy-to-use collection of information to directly support experimental work (including microarray hybridisation), and provides deeper knowledge on the maritime pine transcriptome. PMID:21762488
GenomeRNAi: a database for cell-based RNAi phenotypes.
Horn, Thomas; Arziman, Zeynep; Berger, Juerg; Boutros, Michael
2007-01-01
RNA interference (RNAi) has emerged as a powerful tool to generate loss-of-function phenotypes in a variety of organisms. Combined with the sequence information of almost completely annotated genomes, RNAi technologies have opened new avenues to conduct systematic genetic screens for every annotated gene in the genome. As increasing large datasets of RNAi-induced phenotypes become available, an important challenge remains the systematic integration and annotation of functional information. Genome-wide RNAi screens have been performed both in Caenorhabditis elegans and Drosophila for a variety of phenotypes and several RNAi libraries have become available to assess phenotypes for almost every gene in the genome. These screens were performed using different types of assays from visible phenotypes to focused transcriptional readouts and provide a rich data source for functional annotation across different species. The GenomeRNAi database provides access to published RNAi phenotypes obtained from cell-based screens and maps them to their genomic locus, including possible non-specific regions. The database also gives access to sequence information of RNAi probes used in various screens. It can be searched by phenotype, by gene, by RNAi probe or by sequence and is accessible at http://rnai.dkfz.de.
GenomeRNAi: a database for cell-based RNAi phenotypes
Horn, Thomas; Arziman, Zeynep; Berger, Juerg; Boutros, Michael
2007-01-01
RNA interference (RNAi) has emerged as a powerful tool to generate loss-of-function phenotypes in a variety of organisms. Combined with the sequence information of almost completely annotated genomes, RNAi technologies have opened new avenues to conduct systematic genetic screens for every annotated gene in the genome. As increasing large datasets of RNAi-induced phenotypes become available, an important challenge remains the systematic integration and annotation of functional information. Genome-wide RNAi screens have been performed both in Caenorhabditis elegans and Drosophila for a variety of phenotypes and several RNAi libraries have become available to assess phenotypes for almost every gene in the genome. These screens were performed using different types of assays from visible phenotypes to focused transcriptional readouts and provide a rich data source for functional annotation across different species. The GenomeRNAi database provides access to published RNAi phenotypes obtained from cell-based screens and maps them to their genomic locus, including possible non-specific regions. The database also gives access to sequence information of RNAi probes used in various screens. It can be searched by phenotype, by gene, by RNAi probe or by sequence and is accessible at PMID:17135194
Heuristics for multiobjective multiple sequence alignment.
Abbasi, Maryam; Paquete, Luís; Pereira, Francisco B
2016-07-15
Aligning multiple sequences arises in many tasks in Bioinformatics. However, the alignments produced by the current software packages are highly dependent on the parameters setting, such as the relative importance of opening gaps with respect to the increase of similarity. Choosing only one parameter setting may provide an undesirable bias in further steps of the analysis and give too simplistic interpretations. In this work, we reformulate multiple sequence alignment from a multiobjective point of view. The goal is to generate several sequence alignments that represent a trade-off between maximizing the substitution score and minimizing the number of indels/gaps in the sum-of-pairs score function. This trade-off gives to the practitioner further information about the similarity of the sequences, from which she could analyse and choose the most plausible alignment. We introduce several heuristic approaches, based on local search procedures, that compute a set of sequence alignments, which are representative of the trade-off between the two objectives (substitution score and indels). Several algorithm design options are discussed and analysed, with particular emphasis on the influence of the starting alignment and neighborhood search definitions on the overall performance. A perturbation technique is proposed to improve the local search, which provides a wide range of high-quality alignments. The proposed approach is tested experimentally on a wide range of instances. We performed several experiments with sequences obtained from the benchmark database BAliBASE 3.0. To evaluate the quality of the results, we calculate the hypervolume indicator of the set of score vectors returned by the algorithms. The results obtained allow us to identify reasonably good choices of parameters for our approach. Further, we compared our method in terms of correctly aligned pairs ratio and columns correctly aligned ratio with respect to reference alignments. Experimental results show that our approaches can obtain better results than TCoffee and Clustal Omega in terms of the first ratio.
Sequence Composition and Gene Content of the Short Arm of Rye (Secale cereale) Chromosome 1
Fluch, Silvia; Kopecky, Dieter; Burg, Kornel; Šimková, Hana; Taudien, Stefan; Petzold, Andreas; Kubaláková, Marie; Platzer, Matthias; Berenyi, Maria; Krainer, Siegfried; Doležel, Jaroslav; Lelley, Tamas
2012-01-01
Background The purpose of the study is to elucidate the sequence composition of the short arm of rye chromosome 1 (Secale cereale) with special focus on its gene content, because this portion of the rye genome is an integrated part of several hundreds of bread wheat varieties worldwide. Methodology/Principal Findings Multiple Displacement Amplification of 1RS DNA, obtained from flow sorted 1RS chromosomes, using 1RS ditelosomic wheat-rye addition line, and subsequent Roche 454FLX sequencing of this DNA yielded 195,313,589 bp sequence information. This quantity of sequence information resulted in 0.43× sequence coverage of the 1RS chromosome arm, permitting the identification of genes with estimated probability of 95%. A detailed analysis revealed that more than 5% of the 1RS sequence consisted of gene space, identifying at least 3,121 gene loci representing 1,882 different gene functions. Repetitive elements comprised about 72% of the 1RS sequence, Gypsy/Sabrina (13.3%) being the most abundant. More than four thousand simple sequence repeat (SSR) sites mostly located in gene related sequence reads were identified for possible marker development. The existence of chloroplast insertions in 1RS has been verified by identifying chimeric chloroplast-genomic sequence reads. Synteny analysis of 1RS to the full genomes of Oryza sativa and Brachypodium distachyon revealed that about half of the genes of 1RS correspond to the distal end of the short arm of rice chromosome 5 and the proximal region of the long arm of Brachypodium distachyon chromosome 2. Comparison of the gene content of 1RS to 1HS barley chromosome arm revealed high conservation of genes related to chromosome 5 of rice. Conclusions The present study revealed the gene content and potential gene functions on this chromosome arm and demonstrated numerous sequence elements like SSRs and gene-related sequences, which can be utilised for future research as well as in breeding of wheat and rye. PMID:22328922
Ambers, Angie D; Churchill, Jennifer D; King, Jonathan L; Stoljarova, Monika; Gill-King, Harrell; Assidi, Mourad; Abu-Elmagd, Muhammad; Buhmeida, Abdelbaset; Al-Qahtani, Mohammed; Budowle, Bruce
2016-10-17
Although the primary objective of forensic DNA analyses of unidentified human remains is positive identification, cases involving historical or archaeological skeletal remains often lack reference samples for comparison. Massively parallel sequencing (MPS) offers an opportunity to provide biometric data in such cases, and these cases provide valuable data on the feasibility of applying MPS for characterization of modern forensic casework samples. In this study, MPS was used to characterize 140-year-old human skeletal remains discovered at a historical site in Deadwood, South Dakota, United States. The remains were in an unmarked grave and there were no records or other metadata available regarding the identity of the individual. Due to the high throughput of MPS, a variety of biometric markers could be typed using a single sample. Using MPS and suitable forensic genetic markers, more relevant information could be obtained from a limited quantity and quality sample. Results were obtained for 25/26 Y-STRs, 34/34 Y SNPs, 166/166 ancestry-informative SNPs, 24/24 phenotype-informative SNPs, 102/102 human identity SNPs, 27/29 autosomal STRs (plus amelogenin), and 4/8 X-STRs (as well as ten regions of mtDNA). The Y-chromosome (Y-STR, Y-SNP) and mtDNA profiles of the unidentified skeletal remains are consistent with the R1b and H1 haplogroups, respectively. Both of these haplogroups are the most common haplogroups in Western Europe. Ancestry-informative SNP analysis also supported European ancestry. The genetic results are consistent with anthropological findings that the remains belong to a male of European ancestry (Caucasian). Phenotype-informative SNP data provided strong support that the individual had light red hair and brown eyes. This study is among the first to genetically characterize historical human remains with forensic genetic marker kits specifically designed for MPS. The outcome demonstrates that substantially more genetic information can be obtained from the same initial quantities of DNA as that of current CE-based analyses.
Magnetic resonance imaging of the equine temporomandibular joint anatomy.
Rodríguez, M J; Agut, A; Soler, M; López-Albors, O; Arredondo, J; Querol, M; Latorre, R
2010-04-01
In human medicine, magnetic resonance imaging (MRI) is considered the 'gold standard' imaging procedure to assess the temporomandibular joint (TMJ). However, there is no information regarding MRI evaluation of equine TMJ. To describe the normal sectional MRI anatomy of equine TMJ by using frozen and plastinated anatomical sections as reference; and determine the best imaging planes and sequences to visualise TMJ components. TMJs from 6 Spanish Purebred horse cadavers (4 immature and 2 mature) underwent MRI examination. Spin-echo T1-weighting (SE T1W), T2*W, fat-suppressed (FS) proton density-weighting (PDW) and fast spin-echo T2-weighting (FSE T2W) sequences were obtained in oblique sagittal, transverse and dorsal planes. Anatomical sections were procured on the same planes for a thorough interpretation. The oblique sagittal and transverse planes were the most informative anatomical planes. SE T1W images showed excellent spatial resolution and resulted in superior anatomic detail when comparing to other sequences. FSE T2W sequence provided an acceptable anatomical depiction but T2*W and fat-suppressed PDW demonstrated higher contrast in visualisation of the disc, synovial fluid, synovial pouches and articular cartilage. The SE T1W sequence in oblique sagittal and transverse plane should be the baseline to identify anatomy. The T2*W and fat-suppressed PDW sequences enhance the study of the articular cartilage and synovial pouches better than FSE T2W. The information provided in this paper should aid clinicians in the interpretation of MRI images of equine TMJ and assist in the early diagnosis of those problems that could not be diagnosed by other means.
Comparative Sequence Analysis of Multidrug-Resistant IncA/C Plasmids from Salmonella enterica.
Hoffmann, Maria; Pettengill, James B; Gonzalez-Escalona, Narjol; Miller, John; Ayers, Sherry L; Zhao, Shaohua; Allard, Marc W; McDermott, Patrick F; Brown, Eric W; Monday, Steven R
2017-01-01
Determinants of multidrug resistance (MDR) are often encoded on mobile elements, such as plasmids, transposons, and integrons, which have the potential to transfer among foodborne pathogens, as well as to other virulent pathogens, increasing the threats these traits pose to human and veterinary health. Our understanding of MDR among Salmonella has been limited by the lack of closed plasmid genomes for comparisons across resistance phenotypes, due to difficulties in effectively separating the DNA of these high-molecular weight, low-copy-number plasmids from chromosomal DNA. To resolve this problem, we demonstrate an efficient protocol for isolating, sequencing and closing IncA/C plasmids from Salmonella sp. using single molecule real-time sequencing on a Pacific Biosciences (Pacbio) RS II Sequencer. We obtained six Salmonella enterica isolates from poultry, representing six different serovars, each exhibiting the MDR-Ampc resistance profile. Salmonella plasmids were obtained using a modified mini preparation and transformed with Escherichia coli DH10Br. A Qiagen Large-Construct kit™ was used to recover highly concentrated and purified plasmid DNA that was sequenced using PacBio technology. These six closed IncA/C plasmids ranged in size from 104 to 191 kb and shared a stable, conserved backbone containing 98 core genes, with only six differences among those core genes. The plasmids encoded a number of antimicrobial resistance genes, including those for quaternary ammonium compounds and mercury. We then compared our six IncA/C plasmid sequences: first with 14 IncA/C plasmids derived from S. enterica available at the National Center for Biotechnology Information (NCBI), and then with an additional 38 IncA/C plasmids derived from different taxa. These comparisons allowed us to build an evolutionary picture of how antimicrobial resistance may be mediated by this common plasmid backbone. Our project provides detailed genetic information about resistance genes in plasmids, advances in plasmid sequencing, and phylogenetic analyses, and important insights about how MDR evolution occurs across diverse serotypes from different animal sources, particularly in agricultural settings where antimicrobial drug use practices vary.
Álvarez-Cervantes, Jorge; Díaz-Godínez, Gerardo; Mercado-Flores, Yuridia; Gupta, Vijai Kumar; Anducho-Reyes, Miguel Angel
2016-01-01
In this paper, the amino acid sequence of the β-xylanase SRXL1 of Sporisorium reilianum, which is a pathogenic fungus of maize was used as a model protein to find its phylogenetic relationship with other xylanases of Ascomycetes and Basidiomycetes and the information obtained allowed to establish a hypothesis of monophyly and of biological role. 84 amino acid sequences of β-xylanase obtained from the GenBank database was used. Groupings analysis of higher-level in the Pfam database allowed to determine that the proteins under study were classified into the GH10 and GH11 families, based on the regions of highly conserved amino acids, 233–318 and 180–193 respectively, where glutamate residues are responsible for the catalysis. PMID:27040368
Xu, Yi-Hua; Manoharan, Herbert T; Pitot, Henry C
2007-09-01
The bisulfite genomic sequencing technique is one of the most widely used techniques to study sequence-specific DNA methylation because of its unambiguous ability to reveal DNA methylation status to the order of a single nucleotide. One characteristic feature of the bisulfite genomic sequencing technique is that a number of sample sequence files will be produced from a single DNA sample. The PCR products of bisulfite-treated DNA samples cannot be sequenced directly because they are heterogeneous in nature; therefore they should be cloned into suitable plasmids and then sequenced. This procedure generates an enormous number of sample DNA sequence files as well as adding extra bases belonging to the plasmids to the sequence, which will cause problems in the final sequence comparison. Finding the methylation status for each CpG in each sample sequence is not an easy job. As a result CpG PatternFinder was developed for this purpose. The main functions of the CpG PatternFinder are: (i) to analyze the reference sequence to obtain CpG and non-CpG-C residue position information. (ii) To tailor sample sequence files (delete insertions and mark deletions from the sample sequence files) based on a configuration of ClustalW multiple alignment. (iii) To align sample sequence files with a reference file to obtain bisulfite conversion efficiency and CpG methylation status. And, (iv) to produce graphics, highlighted aligned sequence text and a summary report which can be easily exported to Microsoft Office suite. CpG PatternFinder is designed to operate cooperatively with BioEdit, a freeware on the internet. It can handle up to 100 files of sample DNA sequences simultaneously, and the total CpG pattern analysis process can be finished in minutes. CpG PatternFinder is an ideal software tool for DNA methylation studies to determine the differential methylation pattern in a large number of individuals in a population. Previously we developed the CpG Analyzer program; CpG PatternFinder is our further effort to create software tools for DNA methylation studies.
Seo, Dong-Won; Oh, Jae-Don; Jin, Shil; Song, Ki-Duk; Park, Hee-Bok; Heo, Kang-Nyeong; Shin, Younhee; Jung, Myunghee; Park, Junhyung; Jo, Cheorun; Lee, Hak-Kyo; Lee, Jun-Heon
2015-02-01
There are five native chicken lines in Korea, which are mainly classified by plumage colors (black, white, red, yellow, gray). These five lines are very important genetic resources in the Korean poultry industry. Based on a next generation sequencing technology, whole genome sequence and reference assemblies were performed using Gallus_gallus_4.0 (NCBI) with whole genome sequences from these lines to identify common and novel single nucleotide polymorphisms (SNPs). We obtained 36,660,731,136 ± 1,257,159,120 bp of raw sequence and average 26.6-fold of 25-29 billion reference assembly sequences representing 97.288 % coverage. Also, 4,006,068 ± 97,534 SNPs were observed from 29 autosomes and the Z chromosome and, of these, 752,309 SNPs are the common SNPs across lines. Among the identified SNPs, the number of novel- and known-location assigned SNPs was 1,047,951 ± 14,956 and 2,948,648 ± 81,414, respectively. The number of unassigned known SNPs was 1,181 ± 150 and unassigned novel SNPs was 8,238 ± 1,019. Synonymous SNPs, non-synonymous SNPs, and SNPs having character changes were 26,266 ± 1,456, 11,467 ± 604, 8,180 ± 458, respectively. Overall, 443,048 ± 26,389 SNPs in each bird were identified by comparing with dbSNP in NCBI. The presently obtained genome sequence and SNP information in Korean native chickens have wide applications for further genome studies such as genetic diversity studies to detect causative mutations for economic and disease related traits.
SOMKE: kernel density estimation over data streams by sequences of self-organizing maps.
Cao, Yuan; He, Haibo; Man, Hong
2012-08-01
In this paper, we propose a novel method SOMKE, for kernel density estimation (KDE) over data streams based on sequences of self-organizing map (SOM). In many stream data mining applications, the traditional KDE methods are infeasible because of the high computational cost, processing time, and memory requirement. To reduce the time and space complexity, we propose a SOM structure in this paper to obtain well-defined data clusters to estimate the underlying probability distributions of incoming data streams. The main idea of this paper is to build a series of SOMs over the data streams via two operations, that is, creating and merging the SOM sequences. The creation phase produces the SOM sequence entries for windows of the data, which obtains clustering information of the incoming data streams. The size of the SOM sequences can be further reduced by combining the consecutive entries in the sequence based on the measure of Kullback-Leibler divergence. Finally, the probability density functions over arbitrary time periods along the data streams can be estimated using such SOM sequences. We compare SOMKE with two other KDE methods for data streams, the M-kernel approach and the cluster kernel approach, in terms of accuracy and processing time for various stationary data streams. Furthermore, we also investigate the use of SOMKE over nonstationary (evolving) data streams, including a synthetic nonstationary data stream, a real-world financial data stream and a group of network traffic data streams. The simulation results illustrate the effectiveness and efficiency of the proposed approach.
Personalized Oncology Through Integrative High-Throughput Sequencing: A Pilot Study
Roychowdhury, Sameek; Iyer, Matthew K.; Robinson, Dan R.; Lonigro, Robert J.; Wu, Yi-Mi; Cao, Xuhong; Kalyana-Sundaram, Shanker; Sam, Lee; Balbin, O. Alejandro; Quist, Michael J.; Barrette, Terrence; Everett, Jessica; Siddiqui, Javed; Kunju, Lakshmi P.; Navone, Nora; Araujo, John C.; Troncoso, Patricia; Logothetis, Christopher J.; Innis, Jeffrey W.; Smith, David C.; Lao, Christopher D.; Kim, Scott Y.; Roberts, J. Scott; Gruber, Stephen B.; Pienta, Kenneth J.; Talpaz, Moshe; Chinnaiyan, Arul M.
2012-01-01
Individual cancers harbor a set of genetic aberrations that can be informative for identifying rational therapies currently available or in clinical trials. We implemented a pilot study to explore the practical challenges of applying high-throughput sequencing in clinical oncology. We enrolled patients with advanced or refractory cancer who were eligible for clinical trials. For each patient, we performed whole-genome sequencing of the tumor, targeted whole-exome sequencing of tumor and normal DNA, and transcriptome sequencing (RNA-Seq) of the tumor to identify potentially informative mutations in a clinically relevant time frame of 3 to 4 weeks. With this approach, we detected several classes of cancer mutations including structural rearrangements, copy number alterations, point mutations, and gene expression alterations. A multidisciplinary Sequencing Tumor Board (STB) deliberated on the clinical interpretation of the sequencing results obtained. We tested our sequencing strategy on human prostate cancer xenografts. Next, we enrolled two patients into the clinical protocol and were able to review the results at our STB within 24 days of biopsy. The first patient had metastatic colorectal cancer in which we identified somatic point mutations in NRAS, TP53, AURKA, FAS, and MYH11, plus amplification and overexpression of cyclin-dependent kinase 8 (CDK8). The second patient had malignant melanoma, in which we identified a somatic point mutation in HRAS and a structural rearrangement affecting CDKN2C. The STB identified the CDK8 amplification and Ras mutation as providing a rationale for clinical trials with CDK inhibitors or MEK (mitogenactivated or extracellular signal–regulated protein kinase kinase) and PI3K (phosphatidylinositol 3-kinase) inhibitors, respectively. Integrative high-throughput sequencing of patients with advanced cancer generates a comprehensive, individual mutational landscape to facilitate biomarker-driven clinical trials in oncology. PMID:22133722
2012-09-01
meaning. Information (Know-what): The interpretation of a sequence of elements or in this example, ingredients such as flour , water, sugar, spices, and...the current situation. In addition, obtaining expertise from external specialty sources enriches knowledge and enhances the ability to take action
An artificial intelligence approach fit for tRNA gene studies in the era of big sequence data.
Iwasaki, Yuki; Abe, Takashi; Wada, Kennosuke; Wada, Yoshiko; Ikemura, Toshimichi
2017-09-12
Unsupervised data mining capable of extracting a wide range of knowledge from big data without prior knowledge or particular models is a timely application in the era of big sequence data accumulation in genome research. By handling oligonucleotide compositions as high-dimensional data, we have previously modified the conventional self-organizing map (SOM) for genome informatics and established BLSOM, which can analyze more than ten million sequences simultaneously. Here, we develop BLSOM specialized for tRNA genes (tDNAs) that can cluster (self-organize) more than one million microbial tDNAs according to their cognate amino acid solely depending on tetra- and pentanucleotide compositions. This unsupervised clustering can reveal combinatorial oligonucleotide motifs that are responsible for the amino acid-dependent clustering, as well as other functionally and structurally important consensus motifs, which have been evolutionarily conserved. BLSOM is also useful for identifying tDNAs as phylogenetic markers for special phylotypes. When we constructed BLSOM with 'species-unknown' tDNAs from metagenomic sequences plus 'species-known' microbial tDNAs, a large portion of metagenomic tDNAs self-organized with species-known tDNAs, yielding information on microbial communities in environmental samples. BLSOM can also enhance accuracy in the tDNA database obtained from big sequence data. This unsupervised data mining should become important for studying numerous functionally unclear RNAs obtained from a wide range of organisms.
From cultured to uncultured genome sequences: metagenomics and modeling microbial ecosystems.
Garza, Daniel R; Dutilh, Bas E
2015-11-01
Microorganisms and the viruses that infect them are the most numerous biological entities on Earth and enclose its greatest biodiversity and genetic reservoir. With strength in their numbers, these microscopic organisms are major players in the cycles of energy and matter that sustain all life. Scientists have only scratched the surface of this vast microbial world through culture-dependent methods. Recent developments in generating metagenomes, large random samples of nucleic acid sequences isolated directly from the environment, are providing comprehensive portraits of the composition, structure, and functioning of microbial communities. Moreover, advances in metagenomic analysis have created the possibility of obtaining complete or nearly complete genome sequences from uncultured microorganisms, providing important means to study their biology, ecology, and evolution. Here we review some of the recent developments in the field of metagenomics, focusing on the discovery of genetic novelty and on methods for obtaining uncultured genome sequences, including through the recycling of previously published datasets. Moreover we discuss how metagenomics has become a core scientific tool to characterize eco-evolutionary patterns of microbial ecosystems, thus allowing us to simultaneously discover new microbes and study their natural communities. We conclude by discussing general guidelines and challenges for modeling the interactions between uncultured microorganisms and viruses based on the information contained in their genome sequences. These models will significantly advance our understanding of the functioning of microbial ecosystems and the roles of microbes in the environment.
Khew, Gillian Su-Wen; Chia, Tet Fatt
2011-01-01
Background and aims The popular hybrid orchid Vanda Miss Joaquim was made Singapore's national flower in 1981. It was originally described in the Gardeners’ Chronicle in 1893, as a cross between Vanda hookeriana and Vanda teres. However, no record had been kept as to which parent contributed the pollen. This study was conducted using DNA barcoding techniques to determine the pod parent of V. Miss Joaquim, thereby inferring the pollen parent of the hybrid by exclusion. Methodology Two chloroplast genes, matK and rbcL, from five related taxa, V. hookeriana, V. teres var. alba, V. teres var. andersonii, V. teres var. aurorea and V. Miss Joaquim ‘Agnes’, were sequenced. The matK gene from herbarium specimens of V. teres and V. Miss Joaquim, both collected in 1893, was also sequenced. Principal results No sequence variation was found in the 600-bp region of rbcL sequenced. Sequence variation was found in the matK gene of V. hookeriana, V. teres var. alba, V. teres var. aurorea and V. Miss Joaquim ‘Agnes’. Complete sequence identity was established between V. teres var. andersonii and V. Miss Joaquim ‘Agnes’. The matK sequences obtained from the herbarium specimens of V. teres and V. Miss Joaquim were completely identical to the sequences obtained from the fresh samples of V. teres var. andersonii and V. Miss Joaquim ‘Agnes’. Conclusions The pod parent of V. Miss Joaquim ‘Agnes’ is V. teres var. andersonii and, by exclusion, the pollen parent is V. hookeriana. The herbarium and fresh samples of V. teres var. andersonii and V. Miss Joaquim share the same inferred maternity. The matK gene was more informative than rbcL and facilitated differentiation of varieties of V. teres. PMID:22476488
Statistical Inference in Hidden Markov Models Using k-Segment Constraints
Titsias, Michalis K.; Holmes, Christopher C.; Yau, Christopher
2016-01-01
Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward–backward algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to (i) find MAP sequences, (ii) compute posterior probabilities, and (iii) simulate sample paths. We collectively call these recursions k-segment algorithms and illustrate their utility using simulated and real examples. We also highlight the prospective and retrospective use of k-segment constraints for fitting HMMs or exploring existing model fits. Supplementary materials for this article are available online. PMID:27226674
Protein Sequence Classification with Improved Extreme Learning Machine Algorithms
2014-01-01
Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876
Microsatellite analysis in the genome of Acanthaceae: An in silico approach.
Kaliswamy, Priyadharsini; Vellingiri, Srividhya; Nathan, Bharathi; Selvaraj, Saravanakumar
2015-01-01
Acanthaceae is one of the advanced and specialized families with conventionally used medicinal plants. Simple sequence repeats (SSRs) play a major role as molecular markers for genome analysis and plant breeding. The microsatellites existing in the complete genome sequences would help to attain a direct role in the genome organization, recombination, gene regulation, quantitative genetic variation, and evolution of genes. The current study reports the frequency of microsatellites and appropriate markers for the Acanthaceae family genome sequences. The whole nucleotide sequences of Acanthaceae species were obtained from National Center for Biotechnology Information database and screened for the presence of SSRs. SSR Locator tool was used to predict the microsatellites and inbuilt Primer3 module was used for primer designing. Totally 110 repeats from 108 sequences of Acanthaceae family plant genomes were identified, and the occurrence of dinucleotide repeats was found to be abundant in the genome sequences. The essential amino acid isoleucine was found rich in all the sequences. We also designed the SSR-based primers/markers for 59 sequences of this family that contains microsatellite repeats in their genome. The identified microsatellites and primers might be useful for breeding and genetic studies of plants that belong to Acanthaceae family in the future.
Mapping the Space of Genomic Signatures
Kari, Lila; Hill, Kathleen A.; Sayem, Abu S.; Karamichalis, Rallis; Bryans, Nathaniel; Davis, Katelyn; Dattani, Nikesh S.
2015-01-01
We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to k (herein k = 9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence alignment and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information. This method also correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and that the sequence most different from it in this dataset belongs to a cucumber. PMID:26000734
Characterization of Austrian koi herpesvirus samples based on the ORF40 region.
Marek, A; Schachner, O; Bilic, I; Hess, M
2010-02-17
Using a PCR that amplifies a region of the thymidine kinase (TK) gene, an epidemic spread of koi herpesvirus (KHV) was determined in koi carps in Austria in 2007. A total of 15 virus samples from different locations in Austria were analyzed to determine their genetic relatedness following PCR and nucleic acid sequencing of the open reading frame 40 (ORF40) region of the KHV genome. ORF40-specific PCR amplification products that were obtained from tissue samples shared 100% nucleotide sequence identity with the published sequence of the Japanese strain of KHV. The ORF40 sequence of one isolate from the UK that was included in the present study was 100% identical with the published sequence of an Israeli strain of KHV. This is the first study that used a larger number of samples and a PCR method, which allowed distinguishing all 3 strains of KHV. The present investigation provides information on the epidemiology of KHV infections in Europe and describes a useful molecular tool for epidemiological studies.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L
2007-01-01
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage (www.ncbi.nlm.nih.gov).
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L
2005-01-01
GenBank is a comprehensive database that contains publicly available DNA sequences for more than 165,000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in the UK and the DNA Data Bank of Japan helps to ensure worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L
2006-01-01
GenBank (R) is a comprehensive database that contains publicly available DNA sequences for more than 205 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the Web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at www.ncbi.nlm.nih.gov.
Ruppitsch, W; Stöger, A; Indra, A; Grif, K; Schabereiter-Gurtner, C; Hirschl, A; Allerberger, F
2007-03-01
In a bioterrorism event a rapid tool is needed to identify relevant dangerous bacteria. The aim of the study was to assess the usefulness of partial 16S rRNA gene sequence analysis and the suitability of diverse databases for identifying dangerous bacterial pathogens. For rapid identification purposes a 500-bp fragment of the 16S rRNA gene of 28 isolates comprising Bacillus anthracis, Brucella melitensis, Burkholderia mallei, Burkholderia pseudomallei, Francisella tularensis, Yersinia pestis, and eight genus-related and unrelated control strains was amplified and sequenced. The obtained sequence data were submitted to three public and two commercial sequence databases for species identification. The most frequent reason for incorrect identification was the lack of the respective 16S rRNA gene sequences in the database. Sequence analysis of a 500-bp 16S rDNA fragment allows the rapid identification of dangerous bacterial species. However, for discrimination of closely related species sequencing of the entire 16S rRNA gene, additional sequencing of the 23S rRNA gene or sequencing of the 16S-23S rRNA intergenic spacer is essential. This work provides comprehensive information on the suitability of partial 16S rDNA analysis and diverse databases for rapid and accurate identification of dangerous bacterial pathogens.
Information Security Scheme Based on Computational Temporal Ghost Imaging.
Jiang, Shan; Wang, Yurong; Long, Tao; Meng, Xiangfeng; Yang, Xiulun; Shu, Rong; Sun, Baoqing
2017-08-09
An information security scheme based on computational temporal ghost imaging is proposed. A sequence of independent 2D random binary patterns are used as encryption key to multiply with the 1D data stream. The cipher text is obtained by summing the weighted encryption key. The decryption process can be realized by correlation measurement between the encrypted information and the encryption key. Due to the instinct high-level randomness of the key, the security of this method is greatly guaranteed. The feasibility of this method and robustness against both occlusion and additional noise attacks are discussed with simulation, respectively.
Smit, Kyra N; van Poppelen, Natasha M; Vaarwater, Jolanda; Verdijk, Robert; van Marion, Ronald; Kalirai, Helen; Coupland, Sarah E; Thornton, Sophie; Farquhar, Neil; Dubbink, Hendrikus-Jan; Paridaens, Dion; de Klein, Annelies; Kiliç, Emine
2018-05-01
Uveal melanoma is a highly aggressive cancer of the eye, in which nearly 50% of the patients die from metastasis. It is the most common type of primary eye cancer in adults. Chromosome and mutation status have been shown to correlate with the disease-free survival. Loss of chromosome 3 and inactivating mutations in BAP1, which is located on chromosome 3, are strongly associated with 'high-risk' tumors that metastasize early. Other genes often involved in uveal melanoma are SF3B1 and EIF1AX, which are found to be mutated in intermediate- and low-risk tumors, respectively. To obtain genetic information of all genes in one test, we developed a targeted sequencing method that can detect mutations in uveal melanoma genes and chromosomal anomalies in chromosome 1, 3, and 8. With as little as 10 ng DNA, we obtained enough coverage on all genes to detect mutations, such as substitutions, deletions, and insertions. These results were validated with Sanger sequencing in 28 samples. In >90% of the cases, the BAP1 mutation status corresponded to the BAP1 immunohistochemistry. The results obtained in the Ion Torrent single-nucleotide polymorphism assay were confirmed with several other techniques, such as fluorescence in situ hybridization, multiplex ligation-dependent probe amplification, and Illumina SNP array. By validating our assay in 27 formalin-fixed paraffin-embedded and 43 fresh uveal melanomas, we show that mutations and chromosome status can reliably be obtained using targeted next-generation sequencing. Implementing this technique as a diagnostic pathology application for uveal melanoma will allow prediction of the patients' metastatic risk and potentially assess eligibility for new therapies.
Obregón, Walter D; Liggieri, Constanza S; Trejo, Sebastian A; Avilés, Francesc X; Vairo-Cavalli, Sandra E; Priolo, Nora S
2009-01-01
Latices from Asclepias spp are used in wound healing and the treatment of some digestive disorders. These pharmacological actions have been attributed to the presence of cysteine proteases in these milky latices. Asclepias curassavica (Asclepiadaceae), "scarlet milkweed" is a perennial subshrub native to South America. In the current paper we report a new approach directed at the selective biochemical and molecular characterization of asclepain cI (acI) and asclepain cII (acII), the enzymes responsible for the proteolytic activity of the scarlet milkweed latex. SDS-PAGE spots of both purified peptidases were digested with trypsin and Peptide Mass Fingerprints (PMFs) obtained showed no equivalent peptides. No identification was possible by MASCOT search due to the paucity of information concerning Asclepiadaceae latex cysteine proteinases available in databases. From total RNA extracted from latex samples, cDNA of both peptidases was obtained by RT-PCR using degenerate primers encoding Asclepiadaceae cysteine peptidase conserved domains. Theoretical PMFs of partial polypeptide sequences obtained by cloning (186 and 185 amino acids) were compared with empirical PMFs, confirming that the sequences of 186 and 185 amino acids correspond to acI and acII, respectively. N-terminal sequences of acI and acII, characterized by Edman sequencing, were overlapped with those coming from the cDNA to obtain the full-length sequence of both mature peptidases (212 and 211 residues respectively). Alignment and phylogenetic analysis confirmed that acI and acII belong to the subfamily C1A forming a new group of papain-like cysteine peptidases together with asclepain f from Asclepias fruticosa. We conclude that PMF could be adopted as an excellent tool to differentiate, in a fast and unequivocal way, peptidases with very similar physicochemical and functional properties, with advantages over other conventional methods (for instance enzyme kinetics) that are time consuming and afford less reliable results.
“One code to find them all”: a perl tool to conveniently parse RepeatMasker output files
2014-01-01
Background Of the different bioinformatic methods used to recover transposable elements (TEs) in genome sequences, one of the most commonly used procedures is the homology-based method proposed by the RepeatMasker program. RepeatMasker generates several output files, including the .out file, which provides annotations for all detected repeats in a query sequence. However, a remaining challenge consists of identifying the different copies of TEs that correspond to the identified hits. This step is essential for any evolutionary/comparative analysis of the different copies within a family. Different possibilities can lead to multiple hits corresponding to a unique copy of an element, such as the presence of large deletions/insertions or undetermined bases, and distinct consensus corresponding to a single full-length sequence (like for long terminal repeat (LTR)-retrotransposons). These possibilities must be taken into account to determine the exact number of TE copies. Results We have developed a perl tool that parses the RepeatMasker .out file to better determine the number and positions of TE copies in the query sequence, in addition to computing quantitative information for the different families. To determine the accuracy of the program, we tested it on several RepeatMasker .out files corresponding to two organisms (Drosophila melanogaster and Homo sapiens) for which the TE content has already been largely described and which present great differences in genome size, TE content, and TE families. Conclusions Our tool provides access to detailed information concerning the TE content in a genome at the family level from the .out file of RepeatMasker. This information includes the exact position and orientation of each copy, its proportion in the query sequence, and its quality compared to the reference element. In addition, our tool allows a user to directly retrieve the sequence of each copy and obtain the same detailed information at the family level when a local library with incomplete TE class/subclass information was used with RepeatMasker. We hope that this tool will be helpful for people working on the distribution and evolution of TEs within genomes.
de Andrade, Roberto R S; Vaslin, Maite F S
2014-03-07
Next-generation parallel sequencing (NGS) allows the identification of viral pathogens by sequencing the small RNAs of infected hosts. Thus, viral genomes may be assembled from host immune response products without prior virus enrichment, amplification or purification. However, mapping of the vast information obtained presents a bioinformatics challenge. In order to by pass the need of line command and basic bioinformatics knowledge, we develop a mapping software with a graphical interface to the assemblage of viral genomes from small RNA dataset obtained by NGS. SearchSmallRNA was developed in JAVA language version 7 using NetBeans IDE 7.1 software. The program also allows the analysis of the viral small interfering RNAs (vsRNAs) profile; providing an overview of the size distribution and other features of the vsRNAs produced in infected cells. The program performs comparisons between each read sequenced present in a library and a chosen reference genome. Reads showing Hamming distances smaller or equal to an allowed mismatched will be selected as positives and used to the assemblage of a long nucleotide genome sequence. In order to validate the software, distinct analysis using NGS dataset obtained from HIV and two plant viruses were used to reconstruct viral whole genomes. SearchSmallRNA program was able to reconstructed viral genomes using NGS of small RNA dataset with high degree of reliability so it will be a valuable tool for viruses sequencing and discovery. It is accessible and free to all research communities and has the advantage to have an easy-to-use graphical interface. SearchSmallRNA was written in Java and is freely available at http://www.microbiologia.ufrj.br/ssrna/.
2014-01-01
Background Next-generation parallel sequencing (NGS) allows the identification of viral pathogens by sequencing the small RNAs of infected hosts. Thus, viral genomes may be assembled from host immune response products without prior virus enrichment, amplification or purification. However, mapping of the vast information obtained presents a bioinformatics challenge. Methods In order to by pass the need of line command and basic bioinformatics knowledge, we develop a mapping software with a graphical interface to the assemblage of viral genomes from small RNA dataset obtained by NGS. SearchSmallRNA was developed in JAVA language version 7 using NetBeans IDE 7.1 software. The program also allows the analysis of the viral small interfering RNAs (vsRNAs) profile; providing an overview of the size distribution and other features of the vsRNAs produced in infected cells. Results The program performs comparisons between each read sequenced present in a library and a chosen reference genome. Reads showing Hamming distances smaller or equal to an allowed mismatched will be selected as positives and used to the assemblage of a long nucleotide genome sequence. In order to validate the software, distinct analysis using NGS dataset obtained from HIV and two plant viruses were used to reconstruct viral whole genomes. Conclusions SearchSmallRNA program was able to reconstructed viral genomes using NGS of small RNA dataset with high degree of reliability so it will be a valuable tool for viruses sequencing and discovery. It is accessible and free to all research communities and has the advantage to have an easy-to-use graphical interface. Availability and implementation SearchSmallRNA was written in Java and is freely available at http://www.microbiologia.ufrj.br/ssrna/. PMID:24607237
Detection of Different DNA Animal Species in Commercial Candy Products.
Muñoz-Colmenero, Marta; Martínez, Jose Luis; Roca, Agustín; Garcia-Vazquez, Eva
2016-03-01
Candy products are consumed all across the world, but there is not much information about their composition. In this study we have used a DNA-based approach for determining the animal species occurring in 40 commercial candies of different types. We extracted DNA and performed PCR amplification, cloning and sequencing for obtaining species-informative DNA sequences. Eight species were identified including fish (hake and anchovy) in 22% of the products analyzed. Bovine and porcine were the most abundant appearing in 27 samples each one. Most products contained a mixture of species. Marshmallows (7), jelly-types, and gummies (20) contained a significantly higher number of species than hard candies (9). We demonstrated the presence of DNA animal species in candy product which allow consumers to make choices and prevent allergic reaction. © 2016 Institute of Food Technologists®
Mumps virus F gene and HN gene sequencing as a molecular tool to study mumps virus transmission.
Gouma, Sigrid; Cremer, Jeroen; Parkkali, Saara; Veldhuijzen, Irene; van Binnendijk, Rob S; Koopmans, Marion P G
2016-11-01
Various mumps outbreaks have occurred in the Netherlands since 2004, particularly among persons who had received 2 doses of measles, mumps, and rubella (MMR) vaccination. Genomic typing of pathogens can be used to track outbreaks, but the established genotyping of mumps virus based on the small hydrophobic (SH) gene sequences did not provide sufficient resolution. Therefore, we expanded the sequencing to include fusion (F) gene and haemagglutinin-neuraminidase (HN) gene sequences in addition to the SH gene sequences from 109 mumps virus genotype G strains obtained between 2004 and mid 2015 in the Netherlands. When the molecular information from these 3 genes was combined, we were able to identify separate mumps virus clusters and track mumps virus transmission. The analyses suggested that multiple mumps virus introductions occurred in the Netherlands between 2004 and 2015 resulting in several mumps outbreaks throughout this period, whereas during some local outbreaks the molecular data pointed towards endemic circulation. Combined analysis of epidemiological data and sequence data collected in 2015 showed good support for the phylogenetic clustering. Copyright © 2016 Elsevier B.V. All rights reserved.
NGS Catalog: A Database of Next Generation Sequencing Studies in Humans
Xia, Junfeng; Wang, Qingguo; Jia, Peilin; Wang, Bing; Pao, William; Zhao, Zhongming
2015-01-01
Next generation sequencing (NGS) technologies have been rapidly applied in biomedical and biological research since its advent only a few years ago, and they are expected to advance at an unprecedented pace in the following years. To provide the research community with a comprehensive NGS resource, we have developed the database Next Generation Sequencing Catalog (NGS Catalog, http://bioinfo.mc.vanderbilt.edu/NGS/index.html), a continually updated database that collects, curates and manages available human NGS data obtained from published literature. NGS Catalog deposits publication information of NGS studies and their mutation characteristics (SNVs, small insertions/deletions, copy number variations, and structural variants), as well as mutated genes and gene fusions detected by NGS. Other functions include user data upload, NGS general analysis pipelines, and NGS software. NGS Catalog is particularly useful for investigators who are new to NGS but would like to take advantage of these powerful technologies for their own research. Finally, based on the data deposited in NGS Catalog, we summarized features and findings from whole exome sequencing, whole genome sequencing, and transcriptome sequencing studies for human diseases or traits. PMID:22517761
Sequence Analysis and Domain Motifs in the Porcine Skin Decorin Glycosaminoglycan Chain*
Zhao, Xue; Yang, Bo; Solakylidirim, Kemal; Joo, Eun Ji; Toida, Toshihiko; Higashi, Kyohei; Linhardt, Robert J.; Li, Lingyun
2013-01-01
Decorin proteoglycan is comprised of a core protein containing a single O-linked dermatan sulfate/chondroitin sulfate glycosaminoglycan (GAG) chain. Although the sequence of the decorin core protein is determined by the gene encoding its structure, the structure of its GAG chain is determined in the Golgi. The recent application of modern MS to bikunin, a far simpler chondroitin sulfate proteoglycans, suggests that it has a single or small number of defined sequences. On this basis, a similar approach to sequence the decorin of porcine skin much larger and more structurally complex dermatan sulfate/chondroitin sulfate GAG chain was undertaken. This approach resulted in information on the consistency/variability of its linkage region at the reducing end of the GAG chain, its iduronic acid-rich domain, glucuronic acid-rich domain, and non-reducing end. A general motif for the porcine skin decorin GAG chain was established. A single small decorin GAG chain was sequenced using MS/MS analysis. The data obtained in the study suggest that the decorin GAG chain has a small or a limited number of sequences. PMID:23423381
Drummond, Alexei J; Nicholls, Geoff K; Rodrigo, Allen G; Solomon, Wiremu
2002-01-01
Molecular sequences obtained at different sampling times from populations of rapidly evolving pathogens and from ancient subfossil and fossil sources are increasingly available with modern sequencing technology. Here, we present a Bayesian statistical inference approach to the joint estimation of mutation rate and population size that incorporates the uncertainty in the genealogy of such temporally spaced sequences by using Markov chain Monte Carlo (MCMC) integration. The Kingman coalescent model is used to describe the time structure of the ancestral tree. We recover information about the unknown true ancestral coalescent tree, population size, and the overall mutation rate from temporally spaced data, that is, from nucleotide sequences gathered at different times, from different individuals, in an evolving haploid population. We briefly discuss the methodological implications and show what can be inferred, in various practically relevant states of prior knowledge. We develop extensions for exponentially growing population size and joint estimation of substitution model parameters. We illustrate some of the important features of this approach on a genealogy of HIV-1 envelope (env) partial sequences. PMID:12136032
Drummond, Alexei J; Nicholls, Geoff K; Rodrigo, Allen G; Solomon, Wiremu
2002-07-01
Molecular sequences obtained at different sampling times from populations of rapidly evolving pathogens and from ancient subfossil and fossil sources are increasingly available with modern sequencing technology. Here, we present a Bayesian statistical inference approach to the joint estimation of mutation rate and population size that incorporates the uncertainty in the genealogy of such temporally spaced sequences by using Markov chain Monte Carlo (MCMC) integration. The Kingman coalescent model is used to describe the time structure of the ancestral tree. We recover information about the unknown true ancestral coalescent tree, population size, and the overall mutation rate from temporally spaced data, that is, from nucleotide sequences gathered at different times, from different individuals, in an evolving haploid population. We briefly discuss the methodological implications and show what can be inferred, in various practically relevant states of prior knowledge. We develop extensions for exponentially growing population size and joint estimation of substitution model parameters. We illustrate some of the important features of this approach on a genealogy of HIV-1 envelope (env) partial sequences.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W
2010-01-01
GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bi-monthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI homepage: www.ncbi.nlm.nih.gov.
Zou, Xiaohui; Tang, Guangpeng; Zhao, Xiang; Huang, Yan; Chen, Tao; Lei, Mingyu; Chen, Wenbing; Yang, Lei; Zhu, Wenfei; Zhuang, Li; Yang, Jing; Feng, Zhaomin; Wang, Dayan; Wang, Dingming; Shu, Yuelong
2017-03-01
Many viruses can cause respiratory diseases in humans. Although great advances have been achieved in methods of diagnosis, it remains challenging to identify pathogens in unexplained pneumonia (UP) cases. In this study, we applied next-generation sequencing (NGS) technology and a metagenomic approach to detect and characterize respiratory viruses in UP cases from Guizhou Province, China. A total of 33 oropharyngeal swabs were obtained from hospitalized UP patients and subjected to NGS. An unbiased metagenomic analysis pipeline identified 13 virus species in 16 samples. Human rhinovirus C was the virus most frequently detected and was identified in seven samples. Human measles virus, adenovirus B 55 and coxsackievirus A10 were also identified. Metagenomic sequencing also provided virus genomic sequences, which enabled genotype characterization and phylogenetic analysis. For cases of multiple infection, metagenomic sequencing afforded information regarding the quantity of each virus in the sample, which could be used to evaluate each viruses' role in the disease. Our study highlights the potential of metagenomic sequencing for pathogen identification in UP cases.
Blank-Landeshammer, Bernhard; Kollipara, Laxmikanth; Biß, Karsten; Pfenninger, Markus; Malchow, Sebastian; Shuvaev, Konstantin; Zahedi, René P; Sickmann, Albert
2017-09-01
Complex mass spectrometry based proteomics data sets are mostly analyzed by protein database searches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e., de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than 3-fold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11 120 PSMs (combined) instead of 3476 PSMs (alone) in triplicate 2 h LC-MS runs of tryptic HeLa digestion.
Reid, Allecia E.; Taber, Jennifer M.; Ferrer, Rebecca A.; Biesecker, Barbara B.; Lewis, Katie L.; Biesecker, Leslie G.; Klein, William M. P.
2018-01-01
Objective Genomic sequencing is becoming increasingly accessible, highlighting the need to understand the social and psychological factors that drive interest in receiving testing results. These decisions may depend on perceived descriptive norms (how most others behave) and injunctive norms (what is approved of by others). We predicted that descriptive norms would be directly associated with intentions to learn genomic sequencing results, whereas injunctive norms would be associated indirectly, via attitudes. These differential associations with intentions versus attitudes were hypothesized to be strongest when individuals held ambivalent attitudes toward obtaining results. Methods Participants enrolled in a genomic sequencing trial (n=372) reported intentions to learn medically actionable, non-medically actionable, and carrier sequencing results. Descriptive norms items referenced other study participants. Injunctive norms were analyzed separately for close friends and family members. Attitudes, attitudinal ambivalence, and sociodemographic covariates were also assessed. Results In structural equation models, both descriptive norms and friend injunctive norms were associated with intentions to receive all sequencing results (ps<.004). Attitudes consistently mediated all friend injunctive norms-intentions associations, but not the descriptive norms-intentions associations. Attitudinal ambivalence moderated the association between friend injunctive norms (p≤.001), but not descriptive norms (p=.16), and attitudes. Injunctive norms were significantly associated with attitudes when ambivalence was high, but were unrelated when ambivalence was low. Results replicated for family injunctive norms. Conclusions Descriptive and injunctive norms play roles in genomic sequencing decisions. Considering mediators and moderators of these processes enhances ability to optimize use of normative information to support informed decision making. PMID:29745680
Predicting Protein-Protein Interactions by Combing Various Sequence-Derived.
Zhao, Xiao-Wei; Ma, Zhi-Qiang; Yin, Ming-Hao
2011-09-20
Knowledge of protein-protein interactions (PPIs) plays an important role in constructing protein interaction networks and understanding the general machineries of biological systems. In this study, a new method is proposed to predict PPIs using a comprehensive set of 930 features based only on sequence information, these features measure the interactions between residues a certain distant apart in the protein sequences from different aspects. To achieve better performance, the principal component analysis (PCA) is first employed to obtain an optimized feature subset. Then, the resulting 67-dimensional feature vectors are fed to Support Vector Machine (SVM). Experimental results on Drosophila melanogaster and Helicobater pylori datasets show that our method is very promising to predict PPIs and may at least be a useful supplement tool to existing methods.
Dynamic visual attention: motion direction versus motion magnitude
NASA Astrophysics Data System (ADS)
Bur, A.; Wurtz, P.; Müri, R. M.; Hügli, H.
2008-02-01
Defined as an attentive process in the context of visual sequences, dynamic visual attention refers to the selection of the most informative parts of video sequence. This paper investigates the contribution of motion in dynamic visual attention, and specifically compares computer models designed with the motion component expressed either as the speed magnitude or as the speed vector. Several computer models, including static features (color, intensity and orientation) and motion features (magnitude and vector) are considered. Qualitative and quantitative evaluations are performed by comparing the computer model output with human saliency maps obtained experimentally from eye movement recordings. The model suitability is evaluated in various situations (synthetic and real sequences, acquired with fixed and moving camera perspective), showing advantages and inconveniences of each method as well as preferred domain of application.
Lepère, Cécile; Domaizon, Isabelle; Debroas, Didier
2008-01-01
The diversity of small eukaryotes (0.2 to 5 μm) in a mesotrophic lake (Lake Bourget) was investigated using 18S rRNA gene library construction and fluorescent in situ hybridization coupled with tyramide signal amplification (TSA-FISH). Samples collected from the epilimnion on two dates were used to extend a data set previously obtained using similar approaches for lakes with a range of trophic types. A high level of diversity was recorded for this system with intermediate trophic status, and the main sequences from Lake Bourget were affiliated with ciliates (maximum, 19% of the operational taxonomic units [OTUs]), cryptophytes (33%), stramenopiles (13.2%), and cercozoa (9%). Although the comparison of TSA-FISH results and clone libraries suggested that the level of Chlorophyceae may have been underestimated using PCR with 18S rRNA primers, heterotrophic organisms dominated the small-eukaryote assemblage. We found that a large fraction of the sequences belonged to potential parasites of freshwater phytoplankton, including sequences affiliated with fungi and Perkinsozoa. On average, these sequences represented 30% of the OTUs (40% of the clones) obtained for each of two dates for Lake Bourget. Our results provide information on lacustrine small-eukaryote diversity and structure, adding to the phylogenetic data available for lakes with various trophic types. PMID:18359836
Lepère, Cécile; Domaizon, Isabelle; Debroas, Didier
2008-05-01
The diversity of small eukaryotes (0.2 to 5 mum) in a mesotrophic lake (Lake Bourget) was investigated using 18S rRNA gene library construction and fluorescent in situ hybridization coupled with tyramide signal amplification (TSA-FISH). Samples collected from the epilimnion on two dates were used to extend a data set previously obtained using similar approaches for lakes with a range of trophic types. A high level of diversity was recorded for this system with intermediate trophic status, and the main sequences from Lake Bourget were affiliated with ciliates (maximum, 19% of the operational taxonomic units [OTUs]), cryptophytes (33%), stramenopiles (13.2%), and cercozoa (9%). Although the comparison of TSA-FISH results and clone libraries suggested that the level of Chlorophyceae may have been underestimated using PCR with 18S rRNA primers, heterotrophic organisms dominated the small-eukaryote assemblage. We found that a large fraction of the sequences belonged to potential parasites of freshwater phytoplankton, including sequences affiliated with fungi and Perkinsozoa. On average, these sequences represented 30% of the OTUs (40% of the clones) obtained for each of two dates for Lake Bourget. Our results provide information on lacustrine small-eukaryote diversity and structure, adding to the phylogenetic data available for lakes with various trophic types.
Joseph, Agnel Praveen; Srinivasan, Narayanaswamy; de Brevern, Alexandre G
2012-09-01
Comparison of multiple protein structures has a broad range of applications in the analysis of protein structure, function and evolution. Multiple structure alignment tools (MSTAs) are necessary to obtain a simultaneous comparison of a family of related folds. In this study, we have developed a method for multiple structure comparison largely based on sequence alignment techniques. A widely used Structural Alphabet named Protein Blocks (PBs) was used to transform the information on 3D protein backbone conformation as a 1D sequence string. A progressive alignment strategy similar to CLUSTALW was adopted for multiple PB sequence alignment (mulPBA). Highly similar stretches identified by the pairwise alignments are given higher weights during the alignment. The residue equivalences from PB based alignments are used to obtain a three dimensional fit of the structures followed by an iterative refinement of the structural superposition. Systematic comparisons using benchmark datasets of MSTAs underlines that the alignment quality is better than MULTIPROT, MUSTANG and the alignments in HOMSTRAD, in more than 85% of the cases. Comparison with other rigid-body and flexible MSTAs also indicate that mulPBA alignments are superior to most of the rigid-body MSTAs and highly comparable to the flexible alignment methods. Copyright © 2012 Elsevier Masson SAS. All rights reserved.
Evaluation of the impact of RNA preservation methods of spiders for de novo transcriptome assembly.
Kono, Nobuaki; Nakamura, Hiroyuki; Ito, Yusuke; Tomita, Masaru; Arakawa, Kazuharu
2016-05-01
With advances in high-throughput sequencing technologies, de novo transcriptome sequencing and assembly has become a cost-effective method to obtain comprehensive genetic information of a species of interest, especially in nonmodel species with large genomes such as spiders. However, high-quality RNA is essential for successful sequencing, and sample preservation conditions require careful consideration for the effective storage of field-collected samples. To this end, we report a streamlined feasibility study of various storage conditions and their effects on de novo transcriptome assembly results. The storage parameters considered include temperatures ranging from room temperature to -80°C; preservatives, including ethanol, RNAlater, TRIzol and RNAlater-ICE; and sample submersion states. As a result, intact RNA was extracted and assembly was successful when samples were preserved at low temperatures regardless of the type of preservative used. The assemblies as well as the gene expression profiles were shown to be robust to RNA degradation, when 30 million 150-bp paired-end reads are obtained. The parameters for sample storage, RNA extraction, library preparation, sequencing and in silico assembly considered in this work provide a guideline for the study of field-collected samples of spiders. © 2015 John Wiley & Sons Ltd.
Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca
2015-01-01
Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources. PMID:26151450
Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca
2015-01-01
Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources.
Frías-López, Cristina; Sánchez-Herrero, José F; Guirao-Rico, Sara; Mora, Elisa; Arnedo, Miquel A; Sánchez-Gracia, Alejandro; Rozas, Julio
2016-12-15
The development of molecular markers is one of the most important challenges in phylogenetic and genome wide population genetics studies, especially in studies with non-model organisms. A highly promising approach for obtaining suitable markers is the utilization of genomic partitioning strategies for the simultaneous discovery and genotyping of a large number of markers. Unfortunately, not all markers obtained from these strategies provide enough information for solving multiple evolutionary questions at a reasonable taxonomic resolution. We have developed Development Of Molecular markers In Non-model Organisms (DOMINO), a bioinformatics tool for informative marker development from both next generation sequencing (NGS) data and pre-computed sequence alignments. The application implements popular NGS tools with new utilities in a highly versatile pipeline specifically designed to discover or select personalized markers at different levels of taxonomic resolution. These markers can be directly used to study the taxa surveyed for their design, utilized for further downstream PCR amplification in a broader set taxonomic scope, or exploited as suitable templates to bait design for target DNA enrichment techniques. We conducted an exhaustive evaluation of the performance of DOMINO via computer simulations and illustrate its utility to find informative markers in an empirical dataset. DOMINO is freely available from www.ub.edu/softevol/domino CONTACT: elsanchez@ub.edu or jrozas@ub.eduSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Next generation sequencing and its applications in forensic genetics.
Børsting, Claus; Morling, Niels
2015-09-01
It has been almost a decade since the first next generation sequencing (NGS) technologies emerged and quickly changed the way genetic research is conducted. Today, full genomes are mapped and published almost weekly and with ever increasing speed and decreasing costs. NGS methods and platforms have matured during the last 10 years, and the quality of the sequences has reached a level where NGS is used in clinical diagnostics of humans. Forensic genetic laboratories have also explored NGS technologies and especially in the last year, there has been a small explosion in the number of scientific articles and presentations at conferences with forensic aspects of NGS. These contributions have demonstrated that NGS offers new possibilities for forensic genetic case work. More information may be obtained from unique samples in a single experiment by analyzing combinations of markers (STRs, SNPs, insertion/deletions, mRNA) that cannot be analyzed simultaneously with the standard PCR-CE methods used today. The true variation in core forensic STR loci has been uncovered, and previously unknown STR alleles have been discovered. The detailed sequence information may aid mixture interpretation and will increase the statistical weight of the evidence. In this review, we will give an introduction to NGS and single-molecule sequencing, and we will discuss the possible applications of NGS in forensic genetics. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Liu, Dan; Wang, Qianqian; Ruan, Zengliang; He, Qian; Zhang, Liming
2015-01-01
Background Jellyfish contain diverse toxins and other bioactive components. However, large-scale identification of novel toxins and bioactive components from jellyfish has been hampered by the low efficiency of traditional isolation and purification methods. Results We performed de novo transcriptome sequencing of the tentacle tissue of the jellyfish Cyanea capillata. A total of 51,304,108 reads were obtained and assembled into 50,536 unigenes. Of these, 21,357 unigenes had homologues in public databases, but the remaining unigenes had no significant matches due to the limited sequence information available and species-specific novel sequences. Functional annotation of the unigenes also revealed general gene expression profile characteristics in the tentacle of C. capillata. A primary goal of this study was to identify putative toxin transcripts. As expected, we screened many transcripts encoding proteins similar to several well-known toxin families including phospholipases, metalloproteases, serine proteases and serine protease inhibitors. In addition, some transcripts also resembled molecules with potential toxic activities, including cnidarian CfTX-like toxins with hemolytic activity, plancitoxin-1, venom toxin-like peptide-6, histamine-releasing factor, neprilysin, dipeptidyl peptidase 4, vascular endothelial growth factor A, angiotensin-converting enzyme-like and endothelin-converting enzyme 1-like proteins. Most of these molecules have not been previously reported in jellyfish. Interestingly, we also characterized a number of transcripts with similarities to proteins relevant to several degenerative diseases, including Huntington’s, Alzheimer’s and Parkinson’s diseases. This is the first description of degenerative disease-associated genes in jellyfish. Conclusion We obtained a well-categorized and annotated transcriptome of C. capillata tentacle that will be an important and valuable resource for further understanding of jellyfish at the molecular level and information on the underlying molecular mechanisms of jellyfish stinging. The findings of this study may also be used in comparative studies of gene expression profiling among different jellyfish species. PMID:26551022
Liu, Guoyan; Zhou, Yonghong; Liu, Dan; Wang, Qianqian; Ruan, Zengliang; He, Qian; Zhang, Liming
2015-01-01
Jellyfish contain diverse toxins and other bioactive components. However, large-scale identification of novel toxins and bioactive components from jellyfish has been hampered by the low efficiency of traditional isolation and purification methods. We performed de novo transcriptome sequencing of the tentacle tissue of the jellyfish Cyanea capillata. A total of 51,304,108 reads were obtained and assembled into 50,536 unigenes. Of these, 21,357 unigenes had homologues in public databases, but the remaining unigenes had no significant matches due to the limited sequence information available and species-specific novel sequences. Functional annotation of the unigenes also revealed general gene expression profile characteristics in the tentacle of C. capillata. A primary goal of this study was to identify putative toxin transcripts. As expected, we screened many transcripts encoding proteins similar to several well-known toxin families including phospholipases, metalloproteases, serine proteases and serine protease inhibitors. In addition, some transcripts also resembled molecules with potential toxic activities, including cnidarian CfTX-like toxins with hemolytic activity, plancitoxin-1, venom toxin-like peptide-6, histamine-releasing factor, neprilysin, dipeptidyl peptidase 4, vascular endothelial growth factor A, angiotensin-converting enzyme-like and endothelin-converting enzyme 1-like proteins. Most of these molecules have not been previously reported in jellyfish. Interestingly, we also characterized a number of transcripts with similarities to proteins relevant to several degenerative diseases, including Huntington's, Alzheimer's and Parkinson's diseases. This is the first description of degenerative disease-associated genes in jellyfish. We obtained a well-categorized and annotated transcriptome of C. capillata tentacle that will be an important and valuable resource for further understanding of jellyfish at the molecular level and information on the underlying molecular mechanisms of jellyfish stinging. The findings of this study may also be used in comparative studies of gene expression profiling among different jellyfish species.
Identification of a Herbal Powder by Deoxyribonucleic Acid Barcoding and Structural Analyses.
Sheth, Bhavisha P; Thaker, Vrinda S
2015-10-01
Authentic identification of plants is essential for exploiting their medicinal properties as well as to stop the adulteration and malpractices with the trade of the same. To identify a herbal powder obtained from a herbalist in the local vicinity of Rajkot, Gujarat, using deoxyribonucleic acid (DNA) barcoding and molecular tools. The DNA was extracted from a herbal powder and selected Cassia species, followed by the polymerase chain reaction (PCR) and sequencing of the rbcL barcode locus. Thereafter the sequences were subjected to National Center for Biotechnology Information (NCBI) basic local alignment search tool (BLAST) analysis, followed by the protein three-dimension structure determination of the rbcL protein from the herbal powder and Cassia species namely Cassia fistula, Cassia tora and Cassia javanica (sequences obtained in the present study), Cassia Roxburghii, and Cassia abbreviata (sequences retrieved from Genbank). Further, the multiple and pairwise structural alignment were carried out in order to identify the herbal powder. The nucleotide sequences obtained from the selected species of Cassia were submitted to Genbank (Accession No. JX141397, JX141405, JX141420). The NCBI BLAST analysis of the rbcL protein from the herbal powder showed an equal sequence similarity (with reference to different parameters like E value, maximum identity, total score, query coverage) to C. javanica and C. roxburghii. In order to solve the ambiguities of the BLAST result, a protein structural approach was implemented. The protein homology models obtained in the present study were submitted to the protein model database (PM0079748-PM0079753). The pairwise structural alignment of the herbal powder (as template) and C. javanica and C. roxburghii (as targets individually) revealed a close similarity of the herbal powder with C. javanica. A strategy as used here, incorporating the integrated use of DNA barcoding and protein structural analyses could be adopted, as a novel rapid and economic procedure, especially in cases when protein coding loci are considered. Authentic identification of plants is essential for exploiting their medicinal properties as well as to stop the adulteration and malpractices with the trade of the same. A herbal powder was obtained from a herbalist in the local vicinity of Rajkot, Gujarat. An integrated approach using DNA barcoding and structural analyses was carried out to identify the herbal powder. The herbal powder was identified as Cassia javanica L.
Ordered transport and identification of particles
Shera, E.B.
1993-05-11
A method and apparatus are provided for application of electrical field gradients to induce particle velocities to enable particle sequence and identification information to be obtained. Particle sequence is maintained by providing electroosmotic flow for an electrolytic solution in a particle transport tube. The transport tube and electrolytic solution are selected to provide an electroosmotic radius of >100 so that a plug flow profile is obtained for the electrolytic solution in the transport tube. Thus, particles are maintained in the same order in which they are introduced in the transport tube. When the particles also have known electrophoretic velocities, the field gradients introduce an electrophoretic velocity component onto the electroosmotic velocity. The time that the particles pass selected locations along the transport tube may then be detected and the electrophoretic velocity component calculated for particle identification. One particular application is the ordered transport and identification of labeled nucleotides sequentially cleaved from a strand of DNA.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moore, L.L.; Jones, A.D.
This report presents and summarizes exhaust emission data and other information obtained as a result of the testing and inspection of 350 in-use passenger cars. The test fleet was made up of 1978, 1979 and 1980 automobiles manufactured by Ford, General Motors, Mazda, Saab, Toyota, Volkswagen/Audi and Volvo. Each vehicle was equipped with a three way catalyst control system. They were obtained randomly from private owners in the Los Angeles and Orange County areas. The testing was completed December, 1979. Each vehicle was tested only in as-received condition. The test sequence consisted of the 1975 Federal Test Procedure (exhaust emissionsmore » only), a Highway Fuel Economy test, a Two-Speed Idle test, a Federal Three Mode test, and a Loaded Two Mode test. After the initial test sequence, each vehicle was subjected to a thorough underhood inspection.« less
Ordered transport and identification of particles
Shera, E. Brooks
1993-01-01
A method and apparatus are provided for application of electrical field gradients to induce particle velocities to enable particle sequence and identification information to be obtained. Particle sequence is maintained by providing electroosmotic flow for an electrolytic solution in a particle transport tube. The transport tube and electrolytic solution are selected to provide an electroosmotic radius of >100 so that a plug flow profile is obtained for the electrolytic solution in the transport tube. Thus, particles are maintained in the same order in which they are introduced in the transport tube. When the particles also have known electrophoretic velocities, the field gradients introduce an electrophoretic velocity component onto the electroosmotic velocity. The time that the particles pass selected locations along the transport tube may then be detected and the electrophoretic velocity component calculated for particle identification. One particular application is the ordered transport and identification of labeled nucleotides sequentially cleaved from a strand of DNA.
rbcL gene sequences provide evidence for the evolutionary lineages of leptosporangiate ferns.
Hasebe, M; Omori, T; Nakazawa, M; Sano, T; Kato, M; Iwatsuki, K
1994-06-07
Pteriodophytes have a longer evolutionary history than any other vascular land plant and, therefore, have endured greater loss of phylogenetically informative information. This factor has resulted in substantial disagreements in evaluating characters and, thus, controversy in establishing a stable classification. To compare competing classifications, we obtained DNA sequences of a chloroplast gene. The sequence of 1206 nt of the large subunit of the ribulose-bisphosphate carboxylase gene (rbcL) was determined from 58 species, representing almost all families of leptosporangiate ferns. Phlogenetic trees were inferred by the neighbor-joining and the parsimony methods. The two methods produced almost identical phylogenetic trees that provided insights concerning major general evolutionary trends in the leptosporangiate ferns. Interesting findings were as follows: (i) two morphologically distinct heterosporous water ferns, Marsilea and Salvinia, are sister genera; (ii) the tree ferns (Cyatheaceae, Dicksoniaceae, and Metaxyaceae) are monophyletic; and (iii) polypodioids are distantly related to the gleichenioids in spite of the similarity of their exindusiate soral morphology and are close to the higher indusiate ferns. In addition, the affinities of several "problematic genera" were assessed.
3D Dose reconstruction: Banding artefacts in cine mode EPID images during VMAT delivery
NASA Astrophysics Data System (ADS)
Woodruff, H. C.; Greer, P. B.
2013-06-01
Cine (continuous) mode images obtained during VMAT delivery are heavily degraded by banding artefacts. We have developed a method to reconstruct the pulse sequence (and hence dose deposited) from open field images. For clinical VMAT fields we have devised a frame averaging strategy that greatly improves image quality and dosimetric information for three-dimensional dose reconstruction.
Image-based aircraft pose estimation: a comparison of simulations and real-world data
NASA Astrophysics Data System (ADS)
Breuers, Marcel G. J.; de Reus, Nico
2001-10-01
The problem of estimating aircraft pose information from mono-ocular image data is considered using a Fourier descriptor based algorithm. The dependence of pose estimation accuracy on image resolution and aspect angle is investigated through simulations using sets of synthetic aircraft images. Further evaluation shows that god pose estimation accuracy can be obtained in real world image sequences.
Yoon, Jun-Hee; Kim, Thomas W; Mendez, Pedro; Jablons, David M; Kim, Il-Jin
2017-01-01
The development of next-generation sequencing (NGS) technology allows to sequence whole exomes or genome. However, data analysis is still the biggest bottleneck for its wide implementation. Most laboratories still depend on manual procedures for data handling and analyses, which translates into a delay and decreased efficiency in the delivery of NGS results to doctors and patients. Thus, there is high demand for developing an automatic and an easy-to-use NGS data analyses system. We developed comprehensive, automatic genetic analyses controller named Mobile Genome Express (MGE) that works in smartphones or other mobile devices. MGE can handle all the steps for genetic analyses, such as: sample information submission, sequencing run quality check from the sequencer, secured data transfer and results review. We sequenced an Actrometrix control DNA containing multiple proven human mutations using a targeted sequencing panel, and the whole analysis was managed by MGE, and its data reviewing program called ELECTRO. All steps were processed automatically except for the final sequencing review procedure with ELECTRO to confirm mutations. The data analysis process was completed within several hours. We confirmed the mutations that we have identified were consistent with our previous results obtained by using multi-step, manual pipelines.
The twilight zone of cis element alignments.
Sebastian, Alvaro; Contreras-Moreira, Bruno
2013-02-01
Sequence alignment of proteins and nucleic acids is a routine task in bioinformatics. Although the comparison of complete peptides, genes or genomes can be undertaken with a great variety of tools, the alignment of short DNA sequences and motifs entails pitfalls that have not been fully addressed yet. Here we confront the structural superposition of transcription factors with the sequence alignment of their recognized cis elements. Our goals are (i) to test TFcompare (http://floresta.eead.csic.es/tfcompare), a structural alignment method for protein-DNA complexes; (ii) to benchmark the pairwise alignment of regulatory elements; (iii) to define the confidence limits and the twilight zone of such alignments and (iv) to evaluate the relevance of these thresholds with elements obtained experimentally. We find that the structure of cis elements and protein-DNA interfaces is significantly more conserved than their sequence and measures how this correlates with alignment errors when only sequence information is considered. Our results confirm that DNA motifs in the form of matrices produce better alignments than individual sequences. Finally, we report that empirical and theoretically derived twilight thresholds are useful for estimating the natural plasticity of regulatory sequences, and hence for filtering out unreliable alignments.
The twilight zone of cis element alignments
Sebastian, Alvaro; Contreras-Moreira, Bruno
2013-01-01
Sequence alignment of proteins and nucleic acids is a routine task in bioinformatics. Although the comparison of complete peptides, genes or genomes can be undertaken with a great variety of tools, the alignment of short DNA sequences and motifs entails pitfalls that have not been fully addressed yet. Here we confront the structural superposition of transcription factors with the sequence alignment of their recognized cis elements. Our goals are (i) to test TFcompare (http://floresta.eead.csic.es/tfcompare), a structural alignment method for protein–DNA complexes; (ii) to benchmark the pairwise alignment of regulatory elements; (iii) to define the confidence limits and the twilight zone of such alignments and (iv) to evaluate the relevance of these thresholds with elements obtained experimentally. We find that the structure of cis elements and protein–DNA interfaces is significantly more conserved than their sequence and measures how this correlates with alignment errors when only sequence information is considered. Our results confirm that DNA motifs in the form of matrices produce better alignments than individual sequences. Finally, we report that empirical and theoretically derived twilight thresholds are useful for estimating the natural plasticity of regulatory sequences, and hence for filtering out unreliable alignments. PMID:23268451
Dinçer, Alp; Yildiz, Erdem; Kohan, Saeed; Memet Özek, M
2011-01-01
The aim of the study is to evaluate the efficiency of turbo spin-echo (TSE), three-dimensional constructive interference in the steady state (3D CISS) and cine phase contrast (Cine PC) sequences in determining flow through the endoscopic third ventriculostomy (ETV) fenestration, and to determine the effect of various TSE sequence parameters. The study was approved by our institutional review board and informed consent from all patients was obtained. Two groups of patients were included: group I (24 patients with good clinical outcome after ETV) and group II (22 patients with hydrocephalus evaluated preoperatively). The imaging protocol for both groups was identical. TSE T2 with various sequence parameters and imaging planes, and 3D CISS, followed by cine PC were obtained. Flow void was graded as four-point scales. The sensitivity, specificity, accuracy, positive and negative predictive values of sequences were calculated. Bidirectional flow through the fenestration was detected in all group I patients by cine PC. Stroke volumes through the fenestration in group I ranged 10-160.8 ml/min. There was no correlation between the presence of reversed flow and flow void grading. Also, there was no correlation between the stroke volumes and flow void grading. The sensitivity of 3D CISS was low, and 2 mm sagittal TSE T2, nearly equal to cine PC, provided best result. Cine PC and TSE T2 both have high confidence in the assessment of the flow through the fenestration. But, sequence parameters significantly affect the efficiency of TSE T2.
Lu, Min; An, Huaming; Li, Liangliang
2016-01-01
Rosa roxburghii Tratt is an important commercial horticultural crop in China that is recognized for its nutritional and medicinal values. In spite of the economic significance, genomic information on this rose species is currently unavailable. In the present research, a genome survey of R. roxburghii was carried out using next-generation sequencing (NGS) technologies. Total 30.29 Gb sequence data was obtained by HiSeq 2500 sequencing and an estimated genome size of R. roxburghii was 480.97 Mb, in which the guanine plus cytosine (GC) content was calculated to be 38.63%. All of these reads were technically assembled and a total of 627,554 contigs with a N50 length of 1.484 kb and furthermore 335,902 scaffolds with a total length of 409.36 Mb were obtained. Transposable elements (TE) sequence of 90.84 Mb which comprised 29.20% of the genome, and 167,859 simple sequence repeats (SSRs) were identified from the scaffolds. Among these, the mono-(66.30%), di-(25.67%), and tri-(6.64%) nucleotide repeats contributed to nearly 99% of the SSRs, and sequence motifs AG/CT (28.81%) and GAA/TTC (14.76%) were the most abundant among the dinucleotide and trinucleotide repeat motifs, respectively. Genome analysis predicted a total of 22,721 genes which have an average length of 2311.52 bp, an average exon length of 228.15 bp, and average intron length of 401.18 bp. Eleven genes putatively involved in ascorbate metabolism were identified and its expression in R. roxburghii leaves was validated by quantitative real-time PCR (qRT-PCR). This is the first report of genome-wide characterization of this rose species.
Hügel, Theresa; van Meir, Vincent; Muñoz-Meneses, Amanda; Clarin, B-Markus; Siemers, Björn M; Goerlitz, Holger R
2017-01-01
Animals can gain important information by attending to the signals and cues of other animals in their environment, with acoustic information playing a major role in many taxa. Echolocation call sequences of bats contain information about the identity and behaviour of the sender which is perceptible to close-by receivers. Increasing evidence supports the communicative function of echolocation within species, yet data about its role for interspecific information transfer is scarce. Here, we asked which information bats extract from heterospecific echolocation calls during foraging. In three linked playback experiments, we tested in the flight room and field if foraging Myotis bats approached the foraging call sequences of conspecifics and four heterospecifics that were similar in acoustic call structure only (acoustic similarity hypothesis), in foraging ecology only (foraging similarity hypothesis), both, or none. Compared to the natural prey capture rate of 1.3 buzzes per minute of bat activity, our playbacks of foraging sequences with 23-40 buzzes/min simulated foraging patches with significantly higher profitability. In the flight room, M. capaccinii only approached call sequences of conspecifics and of the heterospecific M. daubentonii with similar acoustics and foraging ecology. In the field, M. capaccinii and M. daubentonii only showed a weak positive response to those two species. Our results confirm information transfer across species boundaries and highlight the importance of context on the studied behaviour, but cannot resolve whether information transfer in trawling Myotis is based on acoustic similarity only or on a combination of similarity in acoustics and foraging ecology. Animals transfer information, both voluntarily and inadvertently, and within and across species boundaries. In echolocating bats, acoustic call structure and foraging ecology are linked, making echolocation calls a rich source of information about species identity, ecology and activity of the sender, which receivers might exploit to find profitable foraging grounds. We tested in three lab and field experiments if information transfer occurs between bat species and if bats obtain information about ecology from echolocation calls. Myotis capaccinii/daubentonii bats approached call playbacks, but only those from con- and heterospecifics with similar call structure and foraging ecology, confirming interspecific information transfer. Reactions differed between lab and field, emphasising situation-dependent differences in animal behaviour, the importance of field research, and the need for further studies on the underlying mechanism of information transfer and the relative contributions of acoustic and ecological similarity.
GMDD: a database of GMO detection methods.
Dong, Wei; Yang, Litao; Shen, Kailin; Kim, Banghyun; Kleter, Gijs A; Marvin, Hans J P; Guo, Rong; Liang, Wanqi; Zhang, Dabing
2008-06-04
Since more than one hundred events of genetically modified organisms (GMOs) have been developed and approved for commercialization in global area, the GMO analysis methods are essential for the enforcement of GMO labelling regulations. Protein and nucleic acid-based detection techniques have been developed and utilized for GMOs identification and quantification. However, the information for harmonization and standardization of GMO analysis methods at global level is needed. GMO Detection method Database (GMDD) has collected almost all the previous developed and reported GMOs detection methods, which have been grouped by different strategies (screen-, gene-, construct-, and event-specific), and also provide a user-friendly search service of the detection methods by GMO event name, exogenous gene, or protein information, etc. In this database, users can obtain the sequences of exogenous integration, which will facilitate PCR primers and probes design. Also the information on endogenous genes, certified reference materials, reference molecules, and the validation status of developed methods is included in this database. Furthermore, registered users can also submit new detection methods and sequences to this database, and the newly submitted information will be released soon after being checked. GMDD contains comprehensive information of GMO detection methods. The database will make the GMOs analysis much easier.
Narravula, Alekhya; Garber, Kathryn B; Askree, S Hussain; Hegde, Madhuri; Hall, Patricia L
2017-01-01
As exome and genome sequencing using high-throughput sequencing technologies move rapidly into the diagnostic process, laboratories and clinicians need to develop a strategy for dealing with uncertain findings. A commitment must be made to minimize these findings, and all parties may need to make adjustments to their processes. The information required to reclassify these variants is often available but not communicated to all relevant parties. To illustrate these issues, we focused on three well-characterized monogenic, metabolic disorders included in newborn screens: classic galactosemia, caused by GALT variants; phenylketonuria, caused by PAH variants; and medium-chain acyl-CoA dehydrogenase (MCAD) deficiency, caused by ACADM variants. In 10 years of clinical molecular testing, we have observed 134 unique GALT variants, 46 of which were variants of uncertain significance (VUS). In PAH, we observed 132 variants, including 17 VUS, and for ACADM, we observed 64 unique variants, of which 33 were uncertain. After this review, 17 VUS (37%; 7 in ACADM, 9 in GALT, and 1 in PAH) were reclassified from uncertain (6 to benign or likely benign and 11 to pathogenic or likely pathogenic). We identified common types of missing information that would have helped make a definitive classification and categorized this information by ease and cost to obtain.Genet Med 19 1, 77-82.
Noise and drift analysis of non-equally spaced timing data
NASA Technical Reports Server (NTRS)
Vernotte, F.; Zalamansky, G.; Lantz, E.
1994-01-01
Generally, it is possible to obtain equally spaced timing data from oscillators. The measurement of the drifts and noises affecting oscillators is then performed by using a variance (Allan variance, modified Allan variance, or time variance) or a system of several variances (multivariance method). However, in some cases, several samples, or even several sets of samples, are missing. In the case of millisecond pulsar timing data, for instance, observations are quite irregularly spaced in time. Nevertheless, since some observations are very close together (one minute) and since the timing data sequence is very long (more than ten years), information on both short-term and long-term stability is available. Unfortunately, a direct variance analysis is not possible without interpolating missing data. Different interpolation algorithms (linear interpolation, cubic spline) are used to calculate variances in order to verify that they neither lose information nor add erroneous information. A comparison of the results of the different algorithms is given. Finally, the multivariance method was adapted to the measurement sequence of the millisecond pulsar timing data: the responses of each variance of the system are calculated for each type of noise and drift, with the same missing samples as in the pulsar timing sequence. An estimation of precision, dynamics, and separability of this method is given.
Rtools: a web server for various secondary structural analyses on single RNA sequences.
Hamada, Michiaki; Ono, Yukiteru; Kiryu, Hisanori; Sato, Kengo; Kato, Yuki; Fukunaga, Tsukasa; Mori, Ryota; Asai, Kiyoshi
2016-07-08
The secondary structures, as well as the nucleotide sequences, are the important features of RNA molecules to characterize their functions. According to the thermodynamic model, however, the probability of any secondary structure is very small. As a consequence, any tool to predict the secondary structures of RNAs has limited accuracy. On the other hand, there are a few tools to compensate the imperfect predictions by calculating and visualizing the secondary structural information from RNA sequences. It is desirable to obtain the rich information from those tools through a friendly interface. We implemented a web server of the tools to predict secondary structures and to calculate various structural features based on the energy models of secondary structures. By just giving an RNA sequence to the web server, the user can get the different types of solutions of the secondary structures, the marginal probabilities such as base-paring probabilities, loop probabilities and accessibilities of the local bases, the energy changes by arbitrary base mutations as well as the measures for validations of the predicted secondary structures. The web server is available at http://rtools.cbrc.jp, which integrates software tools, CentroidFold, CentroidHomfold, IPKnot, CapR, Raccess, Rchange and RintD. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Schroeter, Elena R; DeHart, Caroline J; Cleland, Timothy P; Zheng, Wenxia; Thomas, Paul M; Kelleher, Neil L; Bern, Marshall; Schweitzer, Mary H
2017-02-03
Sequence data from biomolecules such as DNA and proteins, which provide critical information for evolutionary studies, have been assumed to be forever outside the reach of dinosaur paleontology. Proteins, which are predicted to have greater longevity than DNA, have been recovered from two nonavian dinosaurs, but these results remain controversial. For proteomic data derived from extinct Mesozoic organisms to reach their greatest potential for investigating questions of phylogeny and paleobiology, it must be shown that peptide sequences can be reliably and reproducibly obtained from fossils and that fragmentary sequences for ancient proteins can be increasingly expanded. To test the hypothesis that peptides can be repeatedly detected and validated from fossil tissues many millions of years old, we applied updated extraction methodology, high-resolution mass spectrometry, and bioinformatics analyses on a Brachylophosaurus canadensis specimen (MOR 2598) from which collagen I peptides were recovered in 2009. We recovered eight peptide sequences of collagen I: two identical to peptides recovered in 2009 and six new peptides. Phylogenetic analyses place the recovered sequences within basal archosauria. When only the new sequences are considered, B. canadensis is grouped more closely to crocodylians, but when all sequences (current and those reported in 2009) are analyzed, B. canadensis is placed more closely to basal birds. The data robustly support the hypothesis of an endogenous origin for these peptides, confirm the idea that peptides can survive in specimens tens of millions of years old, and bolster the validity of the 2009 study. Furthermore, the new data expand the coverage of B. canadensis collagen I (a 33.6% increase in collagen I alpha 1 and 116.7% in alpha 2). Finally, this study demonstrates the importance of reexamining previously studied specimens with updated methods and instrumentation, as we obtained roughly the same amount of sequence data as the previous study with substantially less sample material. Data are available via ProteomeXchange with identifier PXD005087.
PASS2: an automated database of protein alignments organised as structural superfamilies.
Bhaduri, Anirban; Pugalenthi, Ganesan; Sowdhamini, Ramanathan
2004-04-02
The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins. An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database. The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at http://www.ncbs.res.in/~faculty/mini/campass/pass2.html
An, Jianyu; Yin, Mengqi; Zhang, Qin; Gong, Dongting; Jia, Xiaowen; Guan, Yajing; Hu, Jin
2017-09-11
Luffa cylindrica (L.) Roem. is an economically important vegetable crop in China. However, the genomic information on this species is currently unknown. In this study, for the first time, a genome survey of L. cylindrica was carried out using next-generation sequencing (NGS) technology. In total, 43.40 Gb sequence data of L. cylindrica , about 54.94× coverage of the estimated genome size of 789.97 Mb, were obtained from HiSeq 2500 sequencing, in which the guanine plus cytosine (GC) content was calculated to be 37.90%. The heterozygosity of genome sequences was only 0.24%. In total, 1,913,731 contigs (>200 bp) with 525 bp N 50 length and 1,410,117 scaffolds (>200 bp) with 885.01 Mb total length were obtained. From the initial assembled L. cylindrica genome, 431,234 microsatellites (SSRs) (≥5 repeats) were identified. The motif types of SSR repeats included 62.88% di-nucleotide, 31.03% tri-nucleotide, 4.59% tetra-nucleotide, 0.96% penta-nucleotide and 0.54% hexa-nucleotide. Eighty genomic SSR markers were developed, and 51/80 primers could be used in both "Zheda 23" and "Zheda 83". Nineteen SSRs were used to investigate the genetic diversity among 32 accessions through SSR-HRM analysis. The unweighted pair group method analysis (UPGMA) dendrogram tree was built by calculating the SSR-HRM raw data. SSR-HRM could be effectively used for genotype relationship analysis of Luffa species.
Pardo, Belén G; Álvarez-Dios, José Antonio; Cao, Asunción; Ramilo, Andrea; Gómez-Tato, Antonio; Planas, Josep V; Villalba, Antonio; Martínez, Paulino
2016-12-01
The flat oyster, Ostrea edulis, is one of the main farmed oysters, not only in Europe but also in the United States and Canada. Bonamiosis due to the parasite Bonamia ostreae has been associated with high mortality episodes in this species. This parasite is an intracellular protozoan that infects haemocytes, the main cells involved in oyster defence. Due to the economical and ecological importance of flat oyster, genomic data are badly needed for genetic improvement of the species, but they are still very scarce. The objective of this study is to develop a sequence database, OedulisDB, with new genomic and transcriptomic resources, providing new data and convenient tools to improve our knowledge of the oyster's immune mechanisms. Transcriptomic and genomic sequences were obtained using 454 pyrosequencing and compiled into an O. edulis database, OedulisDB, consisting of two sets of 10,318 and 7159 unique sequences that represent the oyster's genome (WG) and de novo haemocyte transcriptome (HT), respectively. The flat oyster transcriptome was obtained from two strains (naïve and tolerant) challenged with B. ostreae, and from their corresponding non-challenged controls. Approximately 78.5% of 5619 HT unique sequences were successfully annotated by Blast search using public databases. A total of 984 sequences were identified as being related to immune response and several key immune genes were identified for the first time in flat oyster. Additionally, transcriptome information was used to design and validate the first oligo-microarray in flat oyster enriched with immune sequences from haemocytes. Our transcriptomic and genomic sequencing and subsequent annotation have largely increased the scarce resources available for this economically important species and have enabled us to develop an OedulisDB database and accompanying tools for gene expression analysis. This study represents the first attempt to characterize in depth the O. edulis haemocyte transcriptome in response to B. ostreae through massively sequencing and has aided to improve our knowledge of the immune mechanisms of flat oyster. The validated oligo-microarray and the establishment of a reference transcriptome will be useful for large-scale gene expression studies in this species. Copyright © 2016 Elsevier Ltd. All rights reserved.
Egorov, Evgeny S; Merzlyak, Ekaterina M; Shelenkov, Andrew A; Britanova, Olga V; Sharonov, George V; Staroverov, Dmitriy B; Bolotin, Dmitriy A; Davydov, Alexey N; Barsova, Ekaterina; Lebedev, Yuriy B; Shugay, Mikhail; Chudakov, Dmitriy M
2015-06-15
Emerging high-throughput sequencing methods for the analyses of complex structure of TCR and BCR repertoires give a powerful impulse to adaptive immunity studies. However, there are still essential technical obstacles for performing a truly quantitative analysis. Specifically, it remains challenging to obtain comprehensive information on the clonal composition of small lymphocyte populations, such as Ag-specific, functional, or tissue-resident cell subsets isolated by sorting, microdissection, or fine needle aspirates. In this study, we report a robust approach based on unique molecular identifiers that allows profiling Ag receptors for several hundred to thousand lymphocytes while preserving qualitative and quantitative information on clonal composition of the sample. We also describe several general features regarding the data analysis with unique molecular identifiers that are critical for accurate counting of starting molecules in high-throughput sequencing applications. Copyright © 2015 by The American Association of Immunologists, Inc.
Ganguli, Sayak; Gupta, Manoj Kumar; Basu, Protip; Banik, Rahul; Singh, Pankaj Kumar; Vishal, Vineet; Bera, Abhisek Ranjan; Chakraborty, Hirak Jyoti; Das, Sasti Gopal
2014-01-01
With the advent of age of big data and advances in high throughput technology accessing data has become one of the most important step in the entire knowledge discovery process. Most users are not able to decipher the query result that is obtained when non specific keywords or a combination of keywords are used. Intelligent access to sequence and structure databases (IASSD) is a desktop application for windows operating system. It is written in Java and utilizes the web service description language (wsdl) files and Jar files of E-utilities of various databases such as National Centre for Biotechnology Information (NCBI) and Protein Data Bank (PDB). Apart from that IASSD allows the user to view protein structure using a JMOL application which supports conditional editing. The Jar file is freely available through e-mail from the corresponding author.
Recovery of a Medieval Brucella melitensis Genome Using Shotgun Metagenomics
Kay, Gemma L.; Sergeant, Martin J.; Giuffra, Valentina; Bandiera, Pasquale; Milanese, Marco; Bramanti, Barbara
2014-01-01
ABSTRACT Shotgun metagenomics provides a powerful assumption-free approach to the recovery of pathogen genomes from contemporary and historical material. We sequenced the metagenome of a calcified nodule from the skeleton of a 14th-century middle-aged male excavated from the medieval Sardinian settlement of Geridu. We obtained 6.5-fold coverage of a Brucella melitensis genome. Sequence reads from this genome showed signatures typical of ancient or aged DNA. Despite the relatively low coverage, we were able to use information from single-nucleotide polymorphisms to place the medieval pathogen genome within a clade of B. melitensis strains that included the well-studied Ether strain and two other recent Italian isolates. We confirmed this placement using information from deletions and IS711 insertions. We conclude that metagenomics stands ready to document past and present infections, shedding light on the emergence, evolution, and spread of microbial pathogens. PMID:25028426
Microsatellite analysis in the genome of Acanthaceae: An in silico approach
Kaliswamy, Priyadharsini; Vellingiri, Srividhya; Nathan, Bharathi; Selvaraj, Saravanakumar
2015-01-01
Background: Acanthaceae is one of the advanced and specialized families with conventionally used medicinal plants. Simple sequence repeats (SSRs) play a major role as molecular markers for genome analysis and plant breeding. The microsatellites existing in the complete genome sequences would help to attain a direct role in the genome organization, recombination, gene regulation, quantitative genetic variation, and evolution of genes. Objective: The current study reports the frequency of microsatellites and appropriate markers for the Acanthaceae family genome sequences. Materials and Methods: The whole nucleotide sequences of Acanthaceae species were obtained from National Center for Biotechnology Information database and screened for the presence of SSRs. SSR Locator tool was used to predict the microsatellites and inbuilt Primer3 module was used for primer designing. Results: Totally 110 repeats from 108 sequences of Acanthaceae family plant genomes were identified, and the occurrence of dinucleotide repeats was found to be abundant in the genome sequences. The essential amino acid isoleucine was found rich in all the sequences. We also designed the SSR-based primers/markers for 59 sequences of this family that contains microsatellite repeats in their genome. Conclusion: The identified microsatellites and primers might be useful for breeding and genetic studies of plants that belong to Acanthaceae family in the future. PMID:25709226
An interdisciplinary analysis of ERTS data for Colorado mountain environments using ADP Techniques
NASA Technical Reports Server (NTRS)
Hoffer, R. M. (Principal Investigator)
1972-01-01
Author identified significant preliminary results from the Ouachita portion of the Texoma frame of data indicate many potentials in the analysis and interpretation of ERTS data. It is believed that one of the more significant aspects of this analysis sequence has been the investigation of a technique to relate ERTS analysis and surface observation analysis. At present a sequence involving (1) preliminary analysis based solely upon the spectral characteristics of the data, followed by (2) a surface observation mission to obtain visual information and oblique photography to particular points of interest in the test site area, appears to provide an extremely efficient technique for obtaining particularly meaningful surface observation data. Following such a procedure permits concentration on particular points of interest in the entire ERTS frame and thereby makes the surface observation data obtained to be particularly significant and meaningful. The analysis of the Texoma frame has also been significant from the standpoint of demonstrating a fast turn around analysis capability. Additionally, the analysis has shown the potential accuracy and degree of complexity of features that can be identified and mapped using ERTS data.
Numeric promoter description - A comparative view on concepts and general application.
Beier, Rico; Labudde, Dirk
2016-01-01
Nucleic acid molecules play a key role in a variety of biological processes. Starting from storage and transfer tasks, this also comprises the triggering of biological processes, regulatory effects and the active influence gained by target binding. Based on the experimental output (in this case promoter sequences), further in silico analyses aid in gaining new insights into these processes and interactions. The numerical description of nucleic acids thereby constitutes a bridge between the concrete biological issues and the analytical methods. Hence, this study compares 26 descriptor sets obtained by applying well-known numerical description concepts to an established dataset of 38 DNA promoter sequences. The suitability of the description sets was evaluated by computing partial least squares regression models and assessing the model accuracy. We conclude that the major importance regarding the descriptive power is attached to positional information rather than to explicitly incorporated physico-chemical information, since a sufficient amount of implicit physico-chemical information is already encoded in the nucleobase classification. The regression models especially benefited from employing the information that is encoded in the sequential and structural neighborhood of the nucleobases. Thus, the analyses of n-grams (short fragments of length n) suggested that they are valuable descriptors for DNA target interactions. A mixed n-gram descriptor set thereby yielded the best description of the promoter sequences. The corresponding regression model was checked and found to be plausible as it was able to reproduce the characteristic binding motifs of promoter sequences in a reasonable degree. As most functional nucleic acids are based on the principle of molecular recognition, the findings are not restricted to promoter sequences, but can rather be transferred to other kinds of functional nucleic acids. Thus, the concepts presented in this study could provide advantages for future nucleic acid-based technologies, like biosensoring, therapeutics and molecular imaging. Copyright © 2015 Elsevier Inc. All rights reserved.
Region-based multifocus image fusion for the precise acquisition of Pap smear images.
Tello-Mijares, Santiago; Bescós, Jesús
2018-05-01
A multifocus image fusion method to obtain a single focused image from a sequence of microscopic high-magnification Papanicolau source (Pap smear) images is presented. These images, captured each in a different position of the microscope lens, frequently show partially focused cells or parts of cells, which makes them unpractical for the direct application of image analysis techniques. The proposed method obtains a focused image with a high preservation of original pixels information while achieving a negligible visibility of the fusion artifacts. The method starts by identifying the best-focused image of the sequence; then, it performs a mean-shift segmentation over this image; the focus level of the segmented regions is evaluated in all the images of the sequence, and best-focused regions are merged in a single combined image; finally, this image is processed with an adaptive artifact removal process. The combination of a region-oriented approach, instead of block-based approaches, and a minimum modification of the value of focused pixels in the original images achieve a highly contrasted image with no visible artifacts, which makes this method especially convenient for the medical imaging domain. The proposed method is compared with several state-of-the-art alternatives over a representative dataset. The experimental results show that our proposal obtains the best and more stable quality indicators. (2018) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
Zhao, Xiao-Wei; Ma, Zhi-Qiang; Yin, Ming-Hao
2012-05-01
Knowledge of protein-protein interactions (PPIs) plays an important role in constructing protein interaction networks and understanding the general machineries of biological systems. In this study, a new method is proposed to predict PPIs using a comprehensive set of 930 features based only on sequence information, these features measure the interactions between residues a certain distant apart in the protein sequences from different aspects. To achieve better performance, the principal component analysis (PCA) is first employed to obtain an optimized feature subset. Then, the resulting 67-dimensional feature vectors are fed to Support Vector Machine (SVM). Experimental results on Drosophila melanogaster and Helicobater pylori datasets show that our method is very promising to predict PPIs and may at least be a useful supplement tool to existing methods.
Schrijver, Iris; Aziz, Nazneen; Farkas, Daniel H; Furtado, Manohar; Gonzalez, Andrea Ferreira; Greiner, Timothy C; Grody, Wayne W; Hambuch, Tina; Kalman, Lisa; Kant, Jeffrey A; Klein, Roger D; Leonard, Debra G B; Lubin, Ira M; Mao, Rong; Nagan, Narasimhan; Pratt, Victoria M; Sobel, Mark E; Voelkerding, Karl V; Gibson, Jane S
2012-11-01
This report of the Whole Genome Analysis group of the Association for Molecular Pathology illuminates the opportunities and challenges associated with clinical diagnostic genome sequencing. With the reality of clinical application of next-generation sequencing, technical aspects of molecular testing can be accomplished at greater speed and with higher volume, while much information is obtained. Although this testing is a next logical step for molecular pathology laboratories, the potential impact on the diagnostic process and clinical correlations is extraordinary and clinical interpretation will be challenging. We review the rapidly evolving technologies; provide application examples; discuss aspects of clinical utility, ethics, and consent; and address the analytic, postanalytic, and professional implications. Copyright © 2012 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Hyun, Tae Kyung; Lee, Sarah; Kumar, Dhinesh; Rim, Yeonggil; Kumar, Ritesh; Lee, Sang Yeol; Lee, Choong Hwan; Kim, Jae-Yean
2014-10-01
Using Illumina sequencing technology, we have generated the large-scale transcriptome sequencing data containing abundant information on genes involved in the metabolic pathways in R. idaeus cv. Nova fruits. Rubus idaeus (Red raspberry) is one of the important economical crops that possess numerous nutrients, micronutrients and phytochemicals with essential health benefits to human. The molecular mechanism underlying the ripening process and phytochemical biosynthesis in red raspberry is attributed to the changes in gene expression, but very limited transcriptomic and genomic information in public databases is available. To address this issue, we generated more than 51 million sequencing reads from R. idaeus cv. Nova fruit using Illumina RNA-Seq technology. After de novo assembly, we obtained 42,604 unigenes with an average length of 812 bp. At the protein level, Nova fruit transcriptome showed 77 and 68 % sequence similarities with Rubus coreanus and Fragaria versa, respectively, indicating the evolutionary relationship between them. In addition, 69 % of assembled unigenes were annotated using public databases including NCBI non-redundant, Cluster of Orthologous Groups and Gene ontology database, suggesting that our transcriptome dataset provides a valuable resource for investigating metabolic processes in red raspberry. To analyze the relationship between several novel transcripts and the amounts of metabolites such as γ-aminobutyric acid and anthocyanins, real-time PCR and target metabolite analysis were performed on two different ripening stages of Nova. This is the first attempt using Illumina sequencing platform for RNA sequencing and de novo assembly of Nova fruit without reference genome. Our data provide the most comprehensive transcriptome resource available for Rubus fruits, and will be useful for understanding the ripening process and for breeding R. idaeus cultivars with improved fruit quality.
Loss-resistant unambiguous phase measurement
NASA Astrophysics Data System (ADS)
Dinani, Hossein T.; Berry, Dominic W.
2014-08-01
Entangled multiphoton states have the potential to provide improved measurement accuracy, but are sensitive to photon loss. It is possible to calculate ideal loss-resistant states that maximize the Fisher information, but it is unclear how these could be experimentally generated. Here we propose a set of states that can be obtained by processing the output from parametric down-conversion. Although these states are not optimal, they provide performance very close to that of optimal states for a range of parameters. Moreover, we show how to use sequences of such states in order to obtain an unambiguous phase measurement that beats the standard quantum limit. We consider the optimization of parameters in order to minimize the final phase variance, and find that the optimum parameters are different from those that maximize the Fisher information.
NASA Technical Reports Server (NTRS)
Koenig, John C.; Billitti, Joseph W.; Tallon, John M.
1980-01-01
The criteria is defined for auditing photovoltaic system applications and experiments. The purpose of the audit is twofold: to see if the application is meeting its stated objectives and to measure the application's progress in terms of the National Photovoltaic Program's goals of performance, cost, reliability, safety, and socio-environmental acceptance. The information obtained from an audit will be used to assess the status of an application and to provide the Department of Energy with recommendations on the future conduct of the application. Those aspects are covered of a site audit necessary to produce a systematic method for the gathering of qualitative and quantitative data to measure the success of an application. A sequence of audit events and guidelines for obtaining the required information is presented.
Guo, Fei; Yu, Jiao; Zhang, Lu; Li, Jun
2017-11-01
The ForenSeq™ DNA Signature Prep Kit (ForenSeq Kit) is designed to detect more than 200 forensically relevant markers in a single reaction on the MiSeq FGx™ Forensic Genomics System (MiSeq FGx System), including Amelogenin, 27 autosomal short tandem repeats (A-STRs), 7 X chromosomal STRs (X-STRs), 24 Y chromosomal STRs (Y-STRs) and 94 identity-informative single nucleotide polymorphisms (iSNPs) with the option to contain 22 phenotypic-informative SNPs (pSNPs) and 56 ancestry-informative SNPs (aSNPs). In this study, we evaluated the MiSeq FGx System on three major parts: methodological optimization (DNA extraction, sample quantification, library normalization, diluted libraries concentration, and sample-to-cell arrangement), massively parallel sequencing (MPS) performance (depth of coverage, sequence coverage ratio, and allele coverage ratio), and ForenSeq Kit characteristics (repeatability and concordance, sensitivity, mixture, stability and case-type samples). Results showed that quantitative polymerase chain reaction (qPCR)-based sample quantification and library normalization and the appropriate number of pooled libraries and concentration of diluted libraries provided a greater level of MPS performance and repeatability. Repeatable and concordant genotypes were obtained by the ForenSeq Kit. Full profiles were obtained from ≥100pg input DNA for STRs and ≥200pg for SNPs. A sample with ≥5% minor contributors was considered as a mixture by imbalanced allele coverage ratio distribution, and full profiles from minor contributors were easily detected between 9:1 and 1:9 mixtures with known reference profiles. The ForenSeq Kit tolerated considerable concentrations of inhibitors like ≤200μM hematin and ≤50μg/ml humic acid, and >56% STR profiles and >88% SNP profiles were obtained from ≥200-bp degraded samples. Also, it was adapted to case-type samples. As a whole, the ForenSeq Kit is a well-performed, robust, reliable, reproducible and highly informative assay, and it can fully meet requirements for human identification. Further, sensitive QC indicator and automated sample comparison function in the ForenSeq™ Universal Analysis Software are quite helpful, so that we can concentrate on questionable genotypes and avoid tedious and time-consuming labor to maximum the time spent in data analysis. Copyright © 2017 Elsevier B.V. All rights reserved.
González-Caballero, Natalia; Valenzuela, Jesus G; Ribeiro, José M C; Cuervo, Patricia; Brazil, Reginaldo P
2013-03-07
Molecules involved in pheromone biosynthesis may represent alternative targets for insect population control. This may be particularly useful in managing the reproduction of Lutzomyia longipalpis, the main vector of the protozoan parasite Leishmania infantum in Latin America. Besides the chemical identity of the major components of the L. longipalpis sex pheromone, there is no information regarding the molecular biology behind its production. To understand this process, obtaining information on which genes are expressed in the pheromone gland is essential. In this study we used a transcriptomic approach to explore the pheromone gland and adjacent abdominal tergites in order to obtain substantial general sequence information. We used a laboratory-reared L. longipalpis (one spot, 9-Methyl GermacreneB) population, captured in Lapinha Cave, state of Minas Gerais, Brazil for this analysis. From a total of 3,547 cDNA clones, 2,502 high quality sequences from the pheromone gland and adjacent tissues were obtained and assembled into 1,387 contigs. Through blast searches of public databases, a group of transcripts encoding proteins potentially involved in the production of terpenoid precursors were identified in the 4th abdominal tergite, the segment containing the pheromone gland. Among them, protein-coding transcripts for four enzymes of the mevalonate pathway such as 3-hydroxyl-3-methyl glutaryl CoA reductase, phosphomevalonate kinase, diphosphomevalonate descarboxylase, and isopentenyl pyrophosphate isomerase were identified. Moreover, transcripts coding for farnesyl diphosphate synthase and NADP+ dependent farnesol dehydrogenase were also found in the same tergite. Additionally, genes potentially involved in pheromone transportation were identified from the three abdominal tergites analyzed. This study constitutes the first transcriptomic analysis exploring the repertoire of genes expressed in the tissue containing the L. longipalpis pheromone gland as well as the flanking tissues. Using a comparative approach, a set of molecules potentially present in the mevalonate pathway emerge as interesting subjects for further study regarding their association to pheromone biosynthesis. The sequences presented here may be used as a reference set for future research on pheromone production or other characteristics of pheromone communication in this insect. Moreover, some matches for transcripts of unknown function may provide fertile ground of an in-depth study of pheromone-gland specific molecules.
Protein sequences from mastodon and Tyrannosaurus rex revealed by mass spectrometry.
Asara, John M; Schweitzer, Mary H; Freimark, Lisa M; Phillips, Matthew; Cantley, Lewis C
2007-04-13
Fossilized bones from extinct taxa harbor the potential for obtaining protein or DNA sequences that could reveal evolutionary links to extant species. We used mass spectrometry to obtain protein sequences from bones of a 160,000- to 600,000-year-old extinct mastodon (Mammut americanum) and a 68-million-year-old dinosaur (Tyrannosaurus rex). The presence of T. rex sequences indicates that their peptide bonds were remarkably stable. Mass spectrometry can thus be used to determine unique sequences from ancient organisms from peptide fragmentation patterns, a valuable tool to study the evolution and adaptation of ancient taxa from which genomic sequences are unlikely to be obtained.
Microsatellite DNA capture from enriched libraries.
Gonzalez, Elena G; Zardoya, Rafael
2013-01-01
Microsatellites are DNA sequences of tandem repeats of one to six nucleotides, which are highly polymorphic, and thus the molecular markers of choice in many kinship, population genetic, and conservation studies. There have been significant technical improvements since the early methods for microsatellite isolation were developed, and today the most common procedures take advantage of the hybrid capture methods of enriched-targeted microsatellite DNA. Furthermore, recent advents in sequencing technologies (i.e., next-generation sequencing, NGS) have fostered the mining of microsatellite markers in non-model organisms, affording a cost-effective way of obtaining a large amount of sequence data potentially useful for loci characterization. The rapid improvements of NGS platforms together with the increase in available microsatellite information open new avenues to the understanding of the evolutionary forces that shape genetic structuring in wild populations. Here, we provide detailed methodological procedures for microsatellite isolation based on the screening of GT microsatellite-enriched libraries, either by cloning and Sanger sequencing of positive clones or by direct NGS. Guides for designing new species-specific primers and basic genotyping are also given.
Lopez-Doriga, Adriana; Feliubadaló, Lídia; Menéndez, Mireia; Lopez-Doriga, Sergio; Morón-Duran, Francisco D; del Valle, Jesús; Tornero, Eva; Montes, Eva; Cuesta, Raquel; Campos, Olga; Gómez, Carolina; Pineda, Marta; González, Sara; Moreno, Victor; Capellá, Gabriel; Lázaro, Conxi
2014-03-01
Next-generation sequencing (NGS) has revolutionized genomic research and is set to have a major impact on genetic diagnostics thanks to the advent of benchtop sequencers and flexible kits for targeted libraries. Among the main hurdles in NGS are the difficulty of performing bioinformatic analysis of the huge volume of data generated and the high number of false positive calls that could be obtained, depending on the NGS technology and the analysis pipeline. Here, we present the development of a free and user-friendly Web data analysis tool that detects and filters sequence variants, provides coverage information, and allows the user to customize some basic parameters. The tool has been developed to provide accurate genetic analysis of targeted sequencing of common high-risk hereditary cancer genes using amplicon libraries run in a GS Junior System. The Web resource is linked to our own mutation database, to assist in the clinical classification of identified variants. We believe that this tool will greatly facilitate the use of the NGS approach in routine laboratories.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W
2011-01-01
GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 380,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system that integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
Barbau-Piednoir, Elodie; De Keersmaecker, Sigrid C J; Delvoye, Maud; Gau, Céline; Philipp, Patrick; Roosens, Nancy H
2015-11-11
Recently, the presence of an unauthorized genetically modified (GM) Bacillus subtilis bacterium overproducing vitamin B2 in a feed additive was notified by the Rapid Alert System for Food and Feed (RASFF). This has demonstrated that a contamination by a GM micro-organism (GMM) may occur in feed additives and has confronted for the first time,the enforcement laboratories with this type of RASFF. As no sequence information of this GMM nor any specific detection or identification method was available, Next GenerationSequencing (NGS) was used to generate sequence information. However, NGS data analysis often requires appropriate tools, involving bioinformatics expertise which is not alwayspresent in the average enforcement laboratory. This hampers the use of this technology to rapidly obtain critical sequence information in order to be able to develop a specific qPCRdetection method. Data generated by NGS were exploited using a simple BLAST approach. A TaqMan® qPCR method was developed and tested on isolated bacterial strains and on the feed additive directly. In this study, a very simple strategy based on the common BLAST tools that can be used by any enforcement lab without profound bioinformatics expertise, was successfully used toanalyse the B. subtilis data generated by NGS. The results were used to design and assess a new TaqMan® qPCR method, specifically detecting this GM vitamin B2 overproducing bacterium. The method complies with EU critical performance parameters for specificity, sensitivity, PCR efficiency and repeatability. The VitB2-UGM method also could detect the B. subtilis strain in genomic DNA extracted from the feed additive, without prior culturing step. The proposed method, provides a crucial tool for specifically and rapidly identifying this unauthorized GM bacterium in food and feed additives by enforcement laboratories. Moreover, this work can be seen as a case study to substantiate how the use of NGS data can offer an added value to easily gain access to sequence information needed to develop qPCR methods to detect unknown andunauthorized GMO in food and feed.
Churchill, Jennifer D; Novroski, Nicole M M; King, Jonathan L; Seah, Lay Hong; Budowle, Bruce
2017-09-01
The MiSeq FGx Forensic Genomics System (Illumina) enables amplification and massively parallel sequencing of 59 STRs, 94 identity informative SNPs, 54 ancestry informative SNPs, and 24 phenotypic informative SNPs. Allele frequency and population statistics data were generated for the 172 SNP loci included in this panel on four major population groups (Chinese, African Americans, US Caucasians, and Southwest Hispanics). Single-locus and combined random match probability values were generated for the identity informative SNPs. The average combined STR and identity informative SNP random match probabilities (assuming independence) across all four populations were 1.75E-67 and 2.30E-71 with length-based and sequence-based STR alleles, respectively. Ancestry and phenotype predictions were obtained using the ForenSeq™ Universal Analysis System (UAS; Illumina) based on the ancestry informative and phenotype informative SNP profiles generated for each sample. Additionally, performance metrics, including profile completeness, read depth, relative locus performance, and allele coverage ratios, were evaluated and detailed for the 725 samples included in this study. While some genetic markers included in this panel performed notably better than others, performance across populations was generally consistent. The performance and population data included in this study support that accurate and reliable profiles were generated and provide valuable background information for laboratories considering internal validation studies and implementation. Copyright © 2017 Elsevier B.V. All rights reserved.
Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke
2008-05-01
Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.
Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke
2008-01-01
Background Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. Results SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. Conclusion The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods. PMID:18452616
Stepanauskas, Ramunas; Fergusson, Elizabeth A; Brown, Joseph; Poulton, Nicole J; Tupper, Ben; Labonté, Jessica M; Becraft, Eric D; Brown, Julia M; Pachiadaki, Maria G; Povilaitis, Tadas; Thompson, Brian P; Mascena, Corianna J; Bellows, Wendy K; Lubys, Arvydas
2017-07-20
Microbial single-cell genomics can be used to provide insights into the metabolic potential, interactions, and evolution of uncultured microorganisms. Here we present WGA-X, a method based on multiple displacement amplification of DNA that utilizes a thermostable mutant of the phi29 polymerase. WGA-X enhances genome recovery from individual microbial cells and viral particles while maintaining ease of use and scalability. The greatest improvements are observed when amplifying high G+C content templates, such as those belonging to the predominant bacteria in agricultural soils. By integrating WGA-X with calibrated index-cell sorting and high-throughput genomic sequencing, we are able to analyze genomic sequences and cell sizes of hundreds of individual, uncultured bacteria, archaea, protists, and viral particles, obtained directly from marine and soil samples, in a single experiment. This approach may find diverse applications in microbiology and in biomedical and forensic studies of humans and other multicellular organisms.Single-cell genomics can be used to study uncultured microorganisms. Here, Stepanauskas et al. present a method combining improved multiple displacement amplification and FACS, to obtain genomic sequences and cell size information from uncultivated microbial cells and viral particles in environmental samples.
Investigation of the Iterative Phase Retrieval Algorithm for Interferometric Applications
NASA Astrophysics Data System (ADS)
Gombkötő, Balázs; Kornis, János
2010-04-01
Sequentially recorded intensity patterns reflected from a coherently illuminated diffuse object can be used to reconstruct the complex amplitude of the scattered beam. Several iterative phase retrieval algorithms are known in the literature to obtain the initially unknown phase from these longitudinally displaced intensity patterns. When two sequences are recorded in two different states of a centimeter sized object in optical setups that are similar to digital holographic interferometry-but omitting the reference wave-, displacement, deformation, or shape measurement is theoretically possible. To do this, the retrieved phase pattern should contain information not only about the intensities and locations of the point sources of the object surface, but their relative phase as well. Not only experiments require strict mechanical precision to record useful data, but even in simulations several parameters influence the capabilities of iterative phase retrieval, such as object to camera distance range, uniform or varying camera step sequence, speckle field characteristics, and sampling. Experiments were done to demonstrate this principle with an as large as 5×5 cm sized deformable object as well. Good initial results were obtained in an imaging setup, where the intensity pattern sequences were recorded near the image plane.
Individual predictions of eye-movements with dynamic scenes
NASA Astrophysics Data System (ADS)
Barth, Erhardt; Drewes, Jan; Martinetz, Thomas
2003-06-01
We present a model that predicts saccadic eye-movements and can be tuned to a particular human observer who is viewing a dynamic sequence of images. Our work is motivated by applications that involve gaze-contingent interactive displays on which information is displayed as a function of gaze direction. The approach therefore differs from standard approaches in two ways: (1) we deal with dynamic scenes, and (2) we provide means of adapting the model to a particular observer. As an indicator for the degree of saliency we evaluate the intrinsic dimension of the image sequence within a geometric approach implemented by using the structure tensor. Out of these candidate saliency-based locations, the currently attended location is selected according to a strategy found by supervised learning. The data are obtained with an eye-tracker and subjects who view video sequences. The selection algorithm receives candidate locations of current and past frames and a limited history of locations attended in the past. We use a linear mapping that is obtained by minimizing the quadratic difference between the predicted and the actually attended location by gradient descent. Being linear, the learned mapping can be quickly adapted to the individual observer.
Integration of Temporal and Ordinal Information During Serial Interception Sequence Learning
Gobel, Eric W.; Sanchez, Daniel J.; Reber, Paul J.
2011-01-01
The expression of expert motor skills typically involves learning to perform a precisely timed sequence of movements (e.g., language production, music performance, athletic skills). Research examining incidental sequence learning has previously relied on a perceptually-cued task that gives participants exposure to repeating motor sequences but does not require timing of responses for accuracy. Using a novel perceptual-motor sequence learning task, learning a precisely timed cued sequence of motor actions is shown to occur without explicit instruction. Participants learned a repeating sequence through practice and showed sequence-specific knowledge via a performance decrement when switched to an unfamiliar sequence. In a second experiment, the integration of representation of action order and timing sequence knowledge was examined. When either action order or timing sequence information was selectively disrupted, performance was reduced to levels similar to completely novel sequences. Unlike prior sequence-learning research that has found timing information to be secondary to learning action sequences, when the task demands require accurate action and timing information, an integrated representation of these types of information is acquired. These results provide the first evidence for incidental learning of fully integrated action and timing sequence information in the absence of an independent representation of action order, and suggest that this integrative mechanism may play a material role in the acquisition of complex motor skills. PMID:21417511
Cellular automata and its applications in protein bioinformatics.
Xiao, Xuan; Wang, Pu; Chou, Kuo-Chen
2011-09-01
With the explosion of protein sequences generated in the postgenomic era, it is highly desirable to develop high-throughput tools for rapidly and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. The knowledge thus obtained can help us timely utilize these newly found protein sequences for both basic research and drug discovery. Many bioinformatics tools have been developed by means of machine learning methods. This review is focused on the applications of a new kind of science (cellular automata) in protein bioinformatics. A cellular automaton (CA) is an open, flexible and discrete dynamic model that holds enormous potentials in modeling complex systems, in spite of the simplicity of the model itself. Researchers, scientists and practitioners from different fields have utilized cellular automata for visualizing protein sequences, investigating their evolution processes, and predicting their various attributes. Owing to its impressive power, intuitiveness and relative simplicity, the CA approach has great potential for use as a tool for bioinformatics.
NASA Astrophysics Data System (ADS)
Gong, Liang; Wu, Yu; Jian, Qijie; Yin, Chunxiao; Li, Taotao; Gupta, Vijai Kumar; Duan, Xuewu; Jiang, Yueming
2018-01-01
Vibrio qinghaiensis sp.-Q67 (Vqin-Q67) is a freshwater luminescent bacterium that continuously emits blue-green light (485 nm). The bacterium has been widely used for detecting toxic contaminants. Here, we report the complete genome sequence of Vqin-Q67, obtained using third-generation PacBio sequencing technology. Continuous long reads were attained from three PacBio sequencing runs and reads >500 bp with a quality value of >0.75 were merged together into a single dataset. This resultant highly-contiguous de novo assembly has no genome gaps, and comprises two chromosomes with substantial genetic information, including protein-coding genes, non-coding RNA, transposon and gene islands. Our dataset can be useful as a comparative genome for evolution and speciation studies, as well as for the analysis of protein-coding gene families, the pathogenicity of different Vibrio species in fish, the evolution of non-coding RNA and transposon, and the regulation of gene expression in relation to the bioluminescence of Vqin-Q67.
Chang, D D; Clayton, D A
1986-01-01
Transcription of the heavy strand of mouse mitochondrial DNA starts from two closely spaced, distinct sites located in the displacement loop region of the genome. We report here an analysis of regulatory sequences required for faithful transcription from these two sites. Data obtained from in vitro assays demonstrated that a 51-base-pair region, encompassing nucleotides -40 to +11 of the downstream start site, contains sufficient information for accurate transcription from both start sites. Deletion of the 3' flanking sequences, including one or both start sites to -17, resulted in the initiation of transcription by the mitochondrial RNA polymerase from alternative sites within vector DNA sequences. This feature places the mouse heavy-strand promoter uniquely among other known mitochondrial promoters, all of which absolutely require cognate start sites for transcription. Comparison of the heavy-strand promoter with those of other vertebrate mitochondrial DNAs revealed a remarkably high rate of sequence divergence among species. Images PMID:3785226
Variability of Actinobacteria, a minor component of rumen microflora.
Suľák, M; Sikorová, L; Jankuvová, J; Javorský, P; Pristaš, P
2012-07-01
Actinobacteria (Actinomycetes) are a significant and interesting group of gram-positive bacteria. They are regular, though infrequent, members of the microbial life in the rumen and represent up to 3 % of total rumen bacteria; there is considerable lack of information about ecology and biology of rumen actinobacteria. During the characterization of variability of rumen treponemas using non-cultivation approach, we also noted the variability of rumen actinobacteria. By using Treponema-specific primers a specific 16S rRNA gene library was prepared from cow and sheep rumen total DNA. About 10 % of recombinant clones contained actinobacteria-like sequences. Phylogenetic analyses of 11 clones obtained showed the high variability of actinobacteria in the ruminant digestive system. While some sequences are nearly identical to known sequences of actinobacteria, we detected completely new clusters of actinobacteria-like sequences, representing probably new, as yet undiscovered, group of rumen Actinobacteria. Further research will be necessary for understanding their nature and functions in the rumen.
Analysis for complete genomic sequence of HLA-B and HLA-C alleles in the Chinese Han population.
Zhu, F; He, Y; Zhang, W; He, J; He, J; Xu, X; Lv, H; Yan, L
2011-08-01
In the present study, we have determined the complete genomic sequence and analysed the intron polymorphism of partial HLA-B and HLA-C alleles in the Chinese Han population. Over 3.0 kb DNA fragments of HLA-B and HLA-C loci were amplified by polymerase chain reaction from partial 5' untranslated region to 3' noncoding region respectively, and then the amplified products were sequenced. Full-length nucleotide sequences of 14 HLA-B alleles and 10 HLA-C alleles were obtained and have been submitted to GenBank and IMGT/HLA database. Two novel alleles of HLA-B*52:01:01:02 and HLA-B*59:01:01:02 were identified, and the complete genomic sequence of HLA-B*52:01:01:01 was firstly reported. Totally 157 and 167 polymorphism positions were found in the full-length genomic sequence of HLA-B and HLA-C loci respectively. Our results suggested that many single nucleotide polymorphisms existed in the exon and intron regions, and the data can provide useful information for understanding the evolution of HLA-B and HLA-C alleles. © 2011 Blackwell Publishing Ltd.
Pollier, Jacob; González-Guzmán, Miguel; Ardiles-Diaz, Wilson; Geelen, Danny; Goossens, Alain
2011-01-01
cDNA-Amplified Fragment Length Polymorphism (cDNA-AFLP) is a commonly used technique for genome-wide expression analysis that does not require prior sequence knowledge. Typically, quantitative expression data and sequence information are obtained for a large number of differentially expressed gene tags. However, most of the gene tags do not correspond to full-length (FL) coding sequences, which is a prerequisite for subsequent functional analysis. A medium-throughput screening strategy, based on integration of polymerase chain reaction (PCR) and colony hybridization, was developed that allows in parallel screening of a cDNA library for FL clones corresponding to incomplete cDNAs. The method was applied to screen for the FL open reading frames of a selection of 163 cDNA-AFLP tags from three different medicinal plants, leading to the identification of 109 (67%) FL clones. Furthermore, the protocol allows for the use of multiple probes in a single hybridization event, thus significantly increasing the throughput when screening for rare transcripts. The presented strategy offers an efficient method for the conversion of incomplete expressed sequence tags (ESTs), such as cDNA-AFLP tags, to FL-coding sequences.
Detection and characterization of Pasteuria 16S rRNA gene sequences from nematodes and soils.
Duan, Y P; Castro, H F; Hewlett, T E; White, J H; Ogram, A V
2003-01-01
Various bacterial species in the genus Pasteuria have great potential as biocontrol agents against plant-parasitic nematodes, although study of this important genus is hampered by the current inability to cultivate Pasteuria species outside their host. To aid in the study of this genus, an extensive 16S rRNA gene sequence phylogeny was constructed and this information was used to develop cultivation-independent methods for detection of Pasteuria in soils and nematodes. Thirty new clones of Pasteuria 16S rRNA genes were obtained directly from nematodes and soil samples. These were sequenced and used to construct an extensive phylogeny of this genus. These sequences were divided into two deeply branching clades within the low-G + C, Gram-positive division; some sequences appear to represent novel species within the genus Pasteuria. In addition, a surprising degree of 16S rRNA gene sequence diversity was observed within what had previously been designated a single strain of Pasteuria penetrans (P-20). PCR primers specific to Pasteuria 16S rRNA for detection of Pasteuria in soils were also designed and evaluated. Detection limits for soil DNA were 100-10,000 Pasteuria endospores (g soil)(-1).
Novel primers for complete mitochondrial cytochrome b genesequencing in mammals
Naidu, Ashwin; Fitak, Robert R.; Munguia-Vega, Adrian; Culver, Melanie
2011-01-01
Sequence-based species identification relies on the extent and integrity of sequence data available in online databases such as GenBank. When identifying species from a sample of unknown origin, partial DNA sequences obtained from the sample are aligned against existing sequences in databases. When the sequence from the matching species is not present in the database, high-scoring alignments with closely related sequences might produce unreliable results on species identity. For species identification in mammals, the cytochrome b (cyt b) gene has been identified to be highly informative; thus, large amounts of reference sequence data from the cyt b gene are much needed. To enhance availability of cyt b gene sequence data on a large number of mammalian species in GenBank and other such publicly accessible online databases, we identified a primer pair for complete cyt b gene sequencing in mammals. Using this primer pair, we successfully PCR amplified and sequenced the complete cyt b gene from 40 of 44 mammalian species representing 10 orders of mammals. We submitted 40 complete, correctly annotated, cyt b protein coding sequences to GenBank. To our knowledge, this is the first single primer pair to amplify the complete cyt b gene in a broad range of mammalian species. This primer pair can be used for the addition of new cyt b gene sequences and to enhance data available on species represented in GenBank. The availability of novel and complete gene sequences as high-quality reference data can improve the reliability of sequence-based species identification.
A chain-retrieval model for voluntary task switching.
Vandierendonck, André; Demanet, Jelle; Liefooghe, Baptist; Verbruggen, Frederick
2012-09-01
To account for the findings obtained in voluntary task switching, this article describes and tests the chain-retrieval model. This model postulates that voluntary task selection involves retrieval of task information from long-term memory, which is then used to guide task selection and task execution. The model assumes that the retrieved information consists of acquired sequences (or chains) of tasks, that selection may be biased towards chains containing more task repetitions and that bottom-up triggered repetitions may overrule the intended task. To test this model, four experiments are reported. In Studies 1 and 2, sequences of task choices and the corresponding transition sequences (task repetitions or switches) were analyzed with the help of dependency statistics. The free parameters of the chain-retrieval model were estimated on the observed task sequences and these estimates were used to predict autocorrelations of tasks and transitions. In Studies 3 and 4, sequences of hand choices and their transitions were analyzed similarly. In all studies, the chain-retrieval model yielded better fits and predictions than statistical models of event choice. In applications to voluntary task switching (Studies 1 and 2), all three parameters of the model were needed to account for the data. When no task switching was required (Studies 3 and 4), the chain-retrieval model could account for the data with one or two parameters clamped to a neutral value. Implications for our understanding of voluntary task selection and broader theoretical implications are discussed. Copyright © 2012 Elsevier Inc. All rights reserved.
Houghton, Rebecca; Ellis, Joanna; Galiano, Monica; Clark, Tristan W; Wyllie, Sarah
2017-04-01
We describe haemagglutinin (HA) and neuraminidase (NA) sequencing in an apparent cross-site influenza A(H1N1) outbreak in renal transplant and haemodialysis patients, confirmed with whole genome sequencing (WGS). Isolates were sequenced from influenza positive individuals. Phylogenetic trees were constructed using HA and NA sequencing and subsequently WGS. Sequence data was analysed to determine genetic relatedness of viruses obtained from inpatient and outpatient cohorts and compared with epidemiological outbreak information. There were 6 patient cases of influenza in the inpatient renal ward cohort (associated with 3 deaths) and 9 patient cases in the outpatient haemodialysis unit cohort (no deaths). WGS confirmed clustered transmission of two genetically different influenza A(H1N1)pdm09 strains initially identified by analysis of HA and NA genes. WGS took longer, and in this case was not required to determine whether or not the two seemingly linked outbreaks were related. Rapid sequencing of HA and NA genes may be sufficient to aid early influenza outbreak investigation making it appealing for future outbreak investigation. However, as next generation sequencing becomes cheaper and more widely available and bioinformatics software is now freely accessible next generation whole genome analysis may increasingly become a valuable tool for real-time Influenza outbreak investigation. Crown Copyright © 2017. Published by Elsevier Ltd. All rights reserved.
DNA Metabarcoding of Amazonian Ichthyoplankton Swarms.
Maggia, M E; Vigouroux, Y; Renno, J F; Duponchelle, F; Desmarais, E; Nunez, J; García-Dávila, C; Carvajal-Vallejos, F M; Paradis, E; Martin, J F; Mariac, C
2017-01-01
Tropical rainforests harbor extraordinary biodiversity. The Amazon basin is thought to hold 30% of all river fish species in the world. Information about the ecology, reproduction, and recruitment of most species is still lacking, thus hampering fisheries management and successful conservation strategies. One of the key understudied issues in the study of population dynamics is recruitment. Fish larval ecology in tropical biomes is still in its infancy owing to identification difficulties. Molecular techniques are very promising tools for the identification of larvae at the species level. However, one of their limits is obtaining individual sequences with large samples of larvae. To facilitate this task, we developed a new method based on the massive parallel sequencing capability of next generation sequencing (NGS) coupled with hybridization capture. We focused on the mitochondrial marker cytochrome oxidase I (COI). The results obtained using the new method were compared with individual larval sequencing. We validated the ability of the method to identify Amazonian catfish larvae at the species level and to estimate the relative abundance of species in batches of larvae. Finally, we applied the method and provided evidence for strong temporal variation in reproductive activity of catfish species in the Ucayalí River in the Peruvian Amazon. This new time and cost effective method enables the acquisition of large datasets, paving the way for a finer understanding of reproductive dynamics and recruitment patterns of tropical fish species, with major implications for fisheries management and conservation.
Powers, T. O.; Harris, T. S.; Hyman, B. C.
1993-01-01
Mitochondrial DNA sequences were obtained from the NADH dehydrogenase subunit 3 (ND3), large rRNA, and cytochrome b genes from Meloidogyne incognita and Romanomermis culicivorax. Both species show considerable genetic distance within these same genes when compared with Caenorhabditis elegans or Ascaris suum, two species previously analyzed. Caenorhabditis, Ascaris, and Meloidogyne were selected as representatives of three subclasses in the nematode class Secernentea: Rhabditia, Spiruria, and Diplogasteria, respectively. Romanomermis served as a representative out-group of the class Adenophorea. The divergence between the phytoparasitic lineage (represented by Meloidogyne) and the three other species is so great that virtually every variable position in these genes appears to have accumulated multiple mutations, obscuring the phylogenetic information obtainable from these comparisons. The 39 and 42% amino acid similarity between the M. incognita and C. elegans ND3 and cytochrome b coding sequences, respectively, are approximately the same as those of C. elegans-mouse comparisons for the same genes (26 and 44%). This discovery calls into question the feasibility of employing cloned C. elegans probes as reagents to isolate phytoparasitic nematode genes. The genetic distance between the phytoparasitic nematode lineage and C. elegans markedly contrasts with the 79% amino acid similarity between C. elegans and A. suum for the same sequences. The molecular data suggest that Caenorhabditis and Ascaris belong to the same subclass. PMID:19279810
Ancient Mitochondrial DNA Analyses of Ascaris Eggs Discovered in Coprolites from Joseon Tomb
Oh, Chang Seok; Seo, Min; Hong, Jong Ha; Chai, Jong-Yil; Oh, Seung Whan; Park, Jun Bum; Shin, Dong Hoon
2015-01-01
Analysis of ancient DNA (aDNA) extracted from Ascaris is very important for understanding the phylogenetic lineage of the parasite species. When aDNAs obtained from a Joseon tomb (SN2-19-1) coprolite in which Ascaris eggs were identified were amplified with primers for cytochrome b (cyt b) and 18S small subunit ribosomal RNA (18S rRNA) gene, the outcome exhibited Ascaris specific amplicon bands. By cloning, sequencing, and analysis of the amplified DNA, we obtained information valuable for comprehending genetic lineage of Ascaris prevalent among pre-modern Joseon peoples. PMID:25925186
Application of polymer sensitive MRI sequence to localization of EEG electrodes.
Butler, Russell; Gilbert, Guillaume; Descoteaux, Maxime; Bernier, Pierre-Michel; Whittingstall, Kevin
2017-02-15
The growing popularity of simultaneous electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) opens up the possibility of imaging EEG electrodes while the subject is in the scanner. Such information could be useful for improving the fusion of EEG-fMRI datasets. Here, we report for the first time how an ultra-short echo time (UTE) MR sequence can image the materials of an MR-compatible EEG cap, finding that electrodes and some parts of the wiring are visible in a high resolution UTE. Using these images, we developed a segmentation procedure to obtain electrode coordinates based on voxel intensity from the raw UTE, using hand labeled coordinates as the starting point. We were able to visualize and segment 95% of EEG electrodes using a short (3.5min) UTE sequence. We provide scripts and template images so this approach can now be easily implemented to obtain precise, subject-specific EEG electrode positions while adding minimal acquisition time to the simultaneous EEG-fMRI protocol. T1 gel artifacts are not robust enough to localize all electrodes across subjects, the polymers composing Brainvision cap electrodes are not visible on a T1, and adding T1 visible materials to the EEG cap is not always possible. We therefore consider our method superior to existing methods for obtaining electrode positions in the scanner, as it is hardware free and should work on a wide range of materials (caps). EEG electrode positions are obtained with high precision and no additional hardware. Copyright © 2016 Elsevier B.V. All rights reserved.
Whole-Genome Sequencing and Variant Analysis of Human Papillomavirus 16 Infections.
van der Weele, Pascal; Meijer, Chris J L M; King, Audrey J
2017-10-01
Human papillomavirus (HPV) is a strongly conserved DNA virus, high-risk types of which can cause cervical cancer in persistent infections. The most common type found in HPV-attributable cancer is HPV16, which can be subdivided into four lineages (A to D) with different carcinogenic properties. Studies have shown HPV16 sequence diversity in different geographical areas, but only limited information is available regarding HPV16 diversity within a population, especially at the whole-genome level. We analyzed HPV16 major variant diversity and conservation in persistent infections and performed a single nucleotide polymorphism (SNP) comparison between persistent and clearing infections. Materials were obtained in the Netherlands from a cohort study with longitudinal follow-up for up to 3 years. Our analysis shows a remarkably large variant diversity in the population. Whole-genome sequences were obtained for 57 persistent and 59 clearing HPV16 infections, resulting in 109 unique variants. Interestingly, persistent infections were completely conserved through time. One reinfection event was identified where the initial and follow-up samples clustered differently. Non-A1/A2 variants seemed to clear preferentially ( P = 0.02). Our analysis shows that population-wide HPV16 sequence diversity is very large. In persistent infections, the HPV16 sequence was fully conserved. Sequencing can identify HPV16 reinfections, although occurrence is rare. SNP comparison identified no strongly acting effect of the viral genome affecting HPV16 infection clearance or persistence in up to 3 years of follow-up. These findings suggest the progression of an early HPV16 infection could be host related. IMPORTANCE Human papillomavirus 16 (HPV16) is the predominant type found in cervical cancer. Progression of initial infection to cervical cancer has been linked to sequence properties; however, knowledge of variants circulating in European populations, especially with longitudinal follow-up, is limited. By sequencing a number of infections with known follow-up for up to 3 years, we gained initial insights into the genetic diversity of HPV16 and the effects of the viral genome on the persistence of infections. A SNP comparison between sequences obtained from clearing and persistent infections did not identify strongly acting DNA variations responsible for these infection outcomes. In addition, we identified an HPV16 reinfection event where sequencing of initial and follow-up samples showed different HPV16 variants. Based on conventional genotyping, this infection would incorrectly be considered a persistent HPV16 infection. In the context of vaccine efficacy and monitoring studies, such infections could potentially cause reduced reported efficacy or efficiency. Copyright © 2017 van der Weele et al.
Environmental DNA sequencing primers for eutardigrades and bdelloid rotifers
2009-01-01
Background The time it takes to isolate individuals from environmental samples and then extract DNA from each individual is one of the problems with generating molecular data from meiofauna such as eutardigrades and bdelloid rotifers. The lack of consistent morphological information and the extreme abundance of these classes makes morphological identification of rare, or even common cryptic taxa a large and unwieldy task. This limits the ability to perform large-scale surveys of the diversity of these organisms. Here we demonstrate a culture-independent molecular survey approach that enables the generation of large amounts of eutardigrade and bdelloid rotifer sequence data directly from soil. Our PCR primers, specific to the 18s small-subunit rRNA gene, were developed for both eutardigrades and bdelloid rotifers. Results The developed primers successfully amplified DNA of their target organism from various soil DNA extracts. This was confirmed by both the BLAST similarity searches and phylogenetic analyses. Tardigrades showed much better phylogenetic resolution than bdelloids. Both groups of organisms exhibited varying levels of endemism. Conclusion The development of clade-specific primers for characterizing eutardigrades and bdelloid rotifers from environmental samples should greatly increase our ability to characterize the composition of these taxa in environmental samples. Environmental sequencing as shown here differs from other molecular survey methods in that there is no need to pre-isolate the organisms of interest from soil in order to amplify their DNA. The DNA sequences obtained from methods that do not require culturing can be identified post-hoc and placed phylogenetically as additional closely related sequences are obtained from morphologically identified conspecifics. Our non-cultured environmental sequence based approach will be able to provide a rapid and large-scale screening of the presence, absence and diversity of Bdelloidea and Eutardigrada in a variety of soils. PMID:20003362
Yu, Yongxin; Cai, Hui; Hu, Linghao; Lei, Rongwei; Pan, Yingjie; Yan, Shuling
2015-01-01
Noroviruses (NoVs) are a leading cause of epidemic and sporadic cases of acute gastroenteritis worldwide. Oysters are well recognized as the main vectors of environmentally transmitted NoVs, and disease outbreaks linked to oyster consumption have been commonly observed. Here, to quantify the genetic diversity, temporal distribution, and circulation of oyster-related NoVs on a global scale, 1,077 oyster-related NoV sequences deposited from 1983 to 2014 were downloaded from both NCBI GenBank and the NoroNet outbreak database and were then screened for quality control. A total of 665 sequences with reliable information were obtained and were subsequently subjected to genotyping and phylogenetic analyses. The results indicated that the majority of oyster-related NoV sequences were obtained from coastal countries and regions and that the numbers of sequences in these regions were unevenly distributed. Moreover, >80% of human NoV genotypes were detected in oyster samples or oyster-related outbreaks. A higher proportion of genogroup I (GI) (34%) was observed for oyster-related sequences than for non-oyster-related outbreaks, where GII strains dominated with an overwhelming majority of >90%, indicating that the prevalences of GI and GII are different in humans and oysters. In addition, a related convergence of the circulation trend was found between oyster-related NoV sequences and human pandemic outbreaks. This suggests that oysters not only act as a vector of NoV through environmental transmission but also serve as an important reservoir of human NoVs. These results highlight the importance of oysters in the persistence and transmission of human NoVs in the environment and have important implications for the surveillance of human NoVs in oyster samples. PMID:26319869
Validation of Splicing Events in Transcriptome Sequencing Data
Kaisers, Wolfgang; Ptok, Johannes; Schwender, Holger; Schaal, Heiner
2017-01-01
Genomic alignments of sequenced cellular messenger RNA contain gapped alignments which are interpreted as consequence of intron removal. The resulting gap-sites, genomic locations of alignment gaps, are landmarks representing potential splice-sites. As alignment algorithms report gap-sites with a considerable false discovery rate, validations are required. We describe two quality scores, gap quality score (gqs) and weighted gap information score (wgis), developed for validation of putative splicing events: While gqs solely relies on alignment data wgis additionally considers information from the genomic sequence. FASTQ files obtained from 54 human dermal fibroblast samples were aligned against the human genome (GRCh38) using TopHat and STAR aligner. Statistical properties of gap-sites validated by gqs and wgis were evaluated by their sequence similarity to known exon-intron borders. Within the 54 samples, TopHat identifies 1,000,380 and STAR reports 6,487,577 gap-sites. Due to the lack of strand information, however, the percentage of identified GT-AG gap-sites is rather low. While gap-sites from TopHat contain ≈89% GT-AG, gap-sites from STAR only contain ≈42% GT-AG dinucleotide pairs in merged data from 54 fibroblast samples. Validation with gqs yields 156,251 gap-sites from TopHat alignments and 166,294 from STAR alignments. Validation with wgis yields 770,327 gap-sites from TopHat alignments and 1,065,596 from STAR alignments. Both alignment algorithms, TopHat and STAR, report gap-sites with considerable false discovery rate, which can drastically be reduced by validation with gqs and wgis. PMID:28545234
Characterization of a new apple luteovirus identified by high-throughput sequencing.
Liu, Huawei; Wu, Liping; Nikolaeva, Ekaterina; Peter, Kari; Liu, Zongrang; Mollov, Dimitre; Cao, Mengji; Li, Ruhui
2018-05-15
'Rapid Apple Decline' (RAD) is a newly emerging problem of young, dwarf apple trees in the Northeastern USA. The affected trees show trunk necrosis, cracking and canker before collapse in summer. In this study, we discovered and characterized a new luteovirus from apple trees in RAD-affected orchards using high-throughput sequencing (HTS) technology and subsequent Sanger sequencing. Illumina NextSeq sequencing was applied to total RNAs prepared from three diseased apple trees. Sequence reads were de novo assembled, and contigs were annotated by BLASTx. RT-PCR and 5'/3' RACE sequencing were used to obtain the complete genome of a new virus. RT-PCR was used to detect the virus. Three common apple viruses and a new luteovirus were identified from the diseased trees by HTS and RT-PCR. Sequence analyses of the complete genome of the new virus show that it is a new species of the genus Luteovirus in the family Luteoviridae. The virus is graft transmissible and detected by RT-PCR in apple trees in a couple of orchards. A new luteovirus and/or three known viruses were found to be associated with RAD. Molecular characterization of the new luteovirus provides important information for further investigation of its distribution and etiological role.
Milius, Robert P; Heuer, Michael; Valiga, Daniel; Doroschak, Kathryn J; Kennedy, Caleb J; Bolon, Yung-Tsi; Schneider, Joel; Pollack, Jane; Kim, Hwa Ran; Cereb, Nezih; Hollenbach, Jill A; Mack, Steven J; Maiers, Martin
2015-12-01
We present an electronic format for exchanging data for HLA and KIR genotyping with extensions for next-generation sequencing (NGS). This format addresses NGS data exchange by refining the Histoimmunogenetics Markup Language (HML) to conform to the proposed Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) reporting guidelines (miring.immunogenomics.org). Our refinements of HML include two major additions. First, NGS is supported by new XML structures to capture additional NGS data and metadata required to produce a genotyping result, including analysis-dependent (dynamic) and method-dependent (static) components. A full genotype, consensus sequence, and the surrounding metadata are included directly, while the raw sequence reads and platform documentation are externally referenced. Second, genotype ambiguity is fully represented by integrating Genotype List Strings, which use a hierarchical set of delimiters to represent allele and genotype ambiguity in a complete and accurate fashion. HML also continues to enable the transmission of legacy methods (e.g. site-specific oligonucleotide, sequence-specific priming, and Sequence Based Typing (SBT)), adding features such as allowing multiple group-specific sequencing primers, and fully leveraging techniques that combine multiple methods to obtain a single result, such as SBT integrated with NGS. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Gene finding in metatranscriptomic sequences.
Ismail, Wazim Mohammed; Ye, Yuzhen; Tang, Haixu
2014-01-01
Metatranscriptomic sequencing is a highly sensitive bioassay of functional activity in a microbial community, providing complementary information to the metagenomic sequencing of the community. The acquisition of the metatranscriptomic sequences will enable us to refine the annotations of the metagenomes, and to study the gene activities and their regulation in complex microbial communities and their dynamics. In this paper, we present TransGeneScan, a software tool for finding genes in assembled transcripts from metatranscriptomic sequences. By incorporating several features of metatranscriptomic sequencing, including strand-specificity, short intergenic regions, and putative antisense transcripts into a Hidden Markov Model, TranGeneScan can predict a sense transcript containing one or multiple genes (in an operon) or an antisense transcript. We tested TransGeneScan on a mock metatranscriptomic data set containing three known bacterial genomes. The results showed that TranGeneScan performs better than metagenomic gene finders (MetaGeneMark and FragGeneScan) on predicting protein coding genes in assembled transcripts, and achieves comparable or even higher accuracy than gene finders for microbial genomes (Glimmer and GeneMark). These results imply, with the assistance of metatranscriptomic sequencing, we can obtain a broad and precise picture about the genes (and their functions) in a microbial community. TransGeneScan is available as open-source software on SourceForge at https://sourceforge.net/projects/transgenescan/.
Quasispecies Analyses of the HIV-1 Near-full-length Genome With Illumina MiSeq
Ode, Hirotaka; Matsuda, Masakazu; Matsuoka, Kazuhiro; Hachiya, Atsuko; Hattori, Junko; Kito, Yumiko; Yokomaku, Yoshiyuki; Iwatani, Yasumasa; Sugiura, Wataru
2015-01-01
Human immunodeficiency virus type-1 (HIV-1) exhibits high between-host genetic diversity and within-host heterogeneity, recognized as quasispecies. Because HIV-1 quasispecies fluctuate in terms of multiple factors, such as antiretroviral exposure and host immunity, analyzing the HIV-1 genome is critical for selecting effective antiretroviral therapy and understanding within-host viral coevolution mechanisms. Here, to obtain HIV-1 genome sequence information that includes minority variants, we sought to develop a method for evaluating quasispecies throughout the HIV-1 near-full-length genome using the Illumina MiSeq benchtop deep sequencer. To ensure the reliability of minority mutation detection, we applied an analysis method of sequence read mapping onto a consensus sequence derived from de novo assembly followed by iterative mapping and subsequent unique error correction. Deep sequencing analyses of aHIV-1 clone showed that the analysis method reduced erroneous base prevalence below 1% in each sequence position and discarded only < 1% of all collected nucleotides, maximizing the usage of the collected genome sequences. Further, we designed primer sets to amplify the HIV-1 near-full-length genome from clinical plasma samples. Deep sequencing of 92 samples in combination with the primer sets and our analysis method provided sufficient coverage to identify >1%-frequency sequences throughout the genome. When we evaluated sequences of pol genes from 18 treatment-naïve patients' samples, the deep sequencing results were in agreement with Sanger sequencing and identified numerous additional minority mutations. The results suggest that our deep sequencing method would be suitable for identifying within-host viral population dynamics throughout the genome. PMID:26617593
Nielsen, Flemming K; Egund, Niels; Jørgensen, Anette; Peters, David A; Jurik, Anne Grethe
2016-11-16
Bone marrow lesions (BMLs) in knee osteoarthritis (OA) can be assessed using fluid sensitive and contrast enhanced sequences. The association between BMLs and symptoms has been investigated in several studies but only using fluid sensitive sequences. Our aims were to assess BMLs by contrast enhanced MRI sequences in comparison with a fluid sensitive STIR sequence using two different segmentation methods and to analyze the association between the MR findings and disability and pain. Twenty-two patients (mean age 61 years, range 41-79 years) with medial femoro-tibial knee OA obtained MRI and filled out a WOMAC questionnaire at baseline and follow-up (median interval of 334 days). STIR, dynamic contrast enhanced-MRI (DCE-MRI) and fat saturated T1 post-contrast (T1 CE FS) MRI sequences were obtained. All STIR and T1 CE FS sequences were assessed independently by two readers for STIR-BMLs and contrast enhancing areas of BMLs (CEA-BMLs) using manual segmentation and computer assisted segmentation, and the measurements were compared. DCE-MRIs were assessed for the relative distribution of voxels with an inflammatory enhancement pattern, N voxel , in the bone marrow. All findings were compared to WOMAC scores, including pain and overall symptoms, and changes from baseline to follow-up were analyzed. The average volume of CEA-BML was smaller than the STIR-BML volume by manual segmentation. The opposite was found for computer assisted segmentation where the average CEA-BML volume was larger than the STIR-BML volume. The contradictory finding by computer assisted segmentation was partly caused by a number of outliers with an apparent generally increased signal intensity in the anterior parts of the femoral condyle and tibial plateau causing an overestimation of the CEA-BML volume. Both CEA-BML, STIR-BML and N voxel were significantly correlated with symptoms and to a similar degree. A significant reduction in total WOMAC score was seen at follow-up, but no significant changes were observed for either CEA-BML, STIR-BML or N voxel . Neither the degree nor the volume of contrast enhancement in BMLs seems to add any clinical information compared to BMLs visualized by fluid sensitive sequences. Manual segmentation may be needed to obtain valid CEA-BML measurements.
Accuracy of Reaction Cross Section for Exotic Nuclei in Glauber Model Based on MCMC Diagnostics
NASA Astrophysics Data System (ADS)
Rueter, Keiti; Novikov, Ivan
2017-01-01
Parameters of a nuclear density distribution for an exotic nuclei with halo or skin structures can be determined from the experimentally measured reaction cross-section. In the presented work, to extract parameters such as nuclear size information for a halo and core, we compare experimental data on reaction cross-sections with values obtained using expressions of the Glauber Model. These calculations are performed using a Markov Chain Monte Carlo algorithm. We discuss the accuracy of the Monte Carlo approach and its dependence on k*, the power law turnover point in the discreet power spectrum of the random number sequence and on the lag-1 autocorrelation time of the random number sequence.
Sequence Capture versus Restriction Site Associated DNA Sequencing for Shallow Systematics.
Harvey, Michael G; Smith, Brian Tilston; Glenn, Travis C; Faircloth, Brant C; Brumfield, Robb T
2016-09-01
Sequence capture and restriction site associated DNA sequencing (RAD-Seq) are two genomic enrichment strategies for applying next-generation sequencing technologies to systematics studies. At shallow timescales, such as within species, RAD-Seq has been widely adopted among researchers, although there has been little discussion of the potential limitations and benefits of RAD-Seq and sequence capture. We discuss a series of issues that may impact the utility of sequence capture and RAD-Seq data for shallow systematics in non-model species. We review prior studies that used both methods, and investigate differences between the methods by re-analyzing existing RAD-Seq and sequence capture data sets from a Neotropical bird (Xenops minutus). We suggest that the strengths of RAD-Seq data sets for shallow systematics are the wide dispersion of markers across the genome, the relative ease and cost of laboratory work, the deep coverage and read overlap at recovered loci, and the high overall information that results. Sequence capture's benefits include flexibility and repeatability in the genomic regions targeted, success using low-quality samples, more straightforward read orthology assessment, and higher per-locus information content. The utility of a method in systematics, however, rests not only on its performance within a study, but on the comparability of data sets and inferences with those of prior work. In RAD-Seq data sets, comparability is compromised by low overlap of orthologous markers across species and the sensitivity of genetic diversity in a data set to an interaction between the level of natural heterozygosity in the samples examined and the parameters used for orthology assessment. In contrast, sequence capture of conserved genomic regions permits interrogation of the same loci across divergent species, which is preferable for maintaining comparability among data sets and studies for the purpose of drawing general conclusions about the impact of historical processes across biotas. We argue that sequence capture should be given greater attention as a method of obtaining data for studies in shallow systematics and comparative phylogeography. © The Author(s) 2016. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Heinke, Florian; Bittrich, Sebastian; Kaiser, Florian; Labudde, Dirk
2016-01-01
To understand the molecular function of biopolymers, studying their structural characteristics is of central importance. Graphics programs are often utilized to conceive these properties, but with the increasing number of available structures in databases or structure models produced by automated modeling frameworks this process requires assistance from tools that allow automated structure visualization. In this paper a web server and its underlying method for generating graphical sequence representations of molecular structures is presented. The method, called SequenceCEROSENE (color encoding of residues obtained by spatial neighborhood embedding), retrieves the sequence of each amino acid or nucleotide chain in a given structure and produces a color coding for each residue based on three-dimensional structure information. From this, color-highlighted sequences are obtained, where residue coloring represent three-dimensional residue locations in the structure. This color encoding thus provides a one-dimensional representation, from which spatial interactions, proximity and relations between residues or entire chains can be deduced quickly and solely from color similarity. Furthermore, additional heteroatoms and chemical compounds bound to the structure, like ligands or coenzymes, are processed and reported as well. To provide free access to SequenceCEROSENE, a web server has been implemented that allows generating color codings for structures deposited in the Protein Data Bank or structure models uploaded by the user. Besides retrieving visualizations in popular graphic formats, underlying raw data can be downloaded as well. In addition, the server provides user interactivity with generated visualizations and the three-dimensional structure in question. Color encoded sequences generated by SequenceCEROSENE can aid to quickly perceive the general characteristics of a structure of interest (or entire sets of complexes), thus supporting the researcher in the initial phase of structure-based studies. In this respect, the web server can be a valuable tool, as users are allowed to process multiple structures, quickly switch between results, and interact with generated visualizations in an intuitive manner. The SequenceCEROSENE web server is available at https://biosciences.hs-mittweida.de/seqcerosene.
Transcriptome and gene expression analysis during flower blooming in Rosa chinensis 'Pallida'.
Yan, Huijun; Zhang, Hao; Chen, Min; Jian, Hongying; Baudino, Sylvie; Caissard, Jean-Claude; Bendahmane, Mohammed; Li, Shubin; Zhang, Ting; Zhou, Ningning; Qiu, Xianqin; Wang, Qigang; Tang, Kaixue
2014-04-25
Rosa chinensis 'Pallida' (Rosa L.) is one of the most important ancient rose cultivars originating from China. It contributed the 'tea scent' trait to modern roses. However, little information is available on the gene regulatory networks involved in scent biosynthesis and metabolism in Rosa. In this study, the transcriptome of R. chinensis 'Pallida' petals at different developmental stages, from flower buds to senescent flowers, was investigated using Illumina sequencing technology. De novo assembly generated 89,614 clusters with an average length of 428bp. Based on sequence similarity search with known proteins, 62.9% of total clusters were annotated. Out of these annotated transcripts, 25,705 and 37,159 sequences were assigned to gene ontology and clusters of orthologous groups, respectively. The dataset provides information on transcripts putatively associated with known scent metabolic pathways. Digital gene expression (DGE) was obtained using RNA samples from flower bud, open flower and senescent flower stages. Comparative DGE and quantitative real time PCR permitted the identification of five transcripts encoding proteins putatively associated with scent biosynthesis in roses. The study provides a foundation for scent-related gene discovery in roses. Copyright © 2014. Published by Elsevier B.V.
PoMaMo--a comprehensive database for potato genome data.
Meyer, Svenja; Nagel, Axel; Gebhardt, Christiane
2005-01-01
A database for potato genome data (PoMaMo, Potato Maps and More) was established. The database contains molecular maps of all twelve potato chromosomes with about 1000 mapped elements, sequence data, putative gene functions, results from BLAST analysis, SNP and InDel information from different diploid and tetraploid potato genotypes, publication references, links to other public databases like GenBank (http://www.ncbi.nlm.nih.gov/) or SGN (Solanaceae Genomics Network, http://www.sgn.cornell.edu/), etc. Flexible search and data visualization interfaces enable easy access to the data via internet (https://gabi.rzpd.de/PoMaMo.html). The Java servlet tool YAMB (Yet Another Map Browser) was designed to interactively display chromosomal maps. Maps can be zoomed in and out, and detailed information about mapped elements can be obtained by clicking on an element of interest. The GreenCards interface allows a text-based data search by marker-, sequence- or genotype name, by sequence accession number, gene function, BLAST Hit or publication reference. The PoMaMo database is a comprehensive database for different potato genome data, and to date the only database containing SNP and InDel data from diploid and tetraploid potato genotypes.
PoMaMo—a comprehensive database for potato genome data
Meyer, Svenja; Nagel, Axel; Gebhardt, Christiane
2005-01-01
A database for potato genome data (PoMaMo, Potato Maps and More) was established. The database contains molecular maps of all twelve potato chromosomes with about 1000 mapped elements, sequence data, putative gene functions, results from BLAST analysis, SNP and InDel information from different diploid and tetraploid potato genotypes, publication references, links to other public databases like GenBank (http://www.ncbi.nlm.nih.gov/) or SGN (Solanaceae Genomics Network, http://www.sgn.cornell.edu/), etc. Flexible search and data visualization interfaces enable easy access to the data via internet (https://gabi.rzpd.de/PoMaMo.html). The Java servlet tool YAMB (Yet Another Map Browser) was designed to interactively display chromosomal maps. Maps can be zoomed in and out, and detailed information about mapped elements can be obtained by clicking on an element of interest. The GreenCards interface allows a text-based data search by marker-, sequence- or genotype name, by sequence accession number, gene function, BLAST Hit or publication reference. The PoMaMo database is a comprehensive database for different potato genome data, and to date the only database containing SNP and InDel data from diploid and tetraploid potato genotypes. PMID:15608284
Hume, Maxwell A; Barrera, Luis A; Gisselbrecht, Stephen S; Bulyk, Martha L
2015-01-01
The Universal PBM Resource for Oligonucleotide Binding Evaluation (UniPROBE) serves as a convenient source of information on published data generated using universal protein-binding microarray (PBM) technology, which provides in vitro data about the relative DNA-binding preferences of transcription factors for all possible sequence variants of a length k ('k-mers'). The database displays important information about the proteins and displays their DNA-binding specificity data in terms of k-mers, position weight matrices and graphical sequence logos. This update to the database documents the growth of UniPROBE since the last update 4 years ago, and introduces a variety of new features and tools, including a new streamlined pipeline that facilitates data deposition by universal PBM data generators in the research community, a tool that generates putative nonbinding (i.e. negative control) DNA sequences for one or more proteins and novel motifs obtained by analyzing the PBM data using the BEEML-PBM algorithm for motif inference. The UniPROBE database is available at http://uniprobe.org. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Building toy models of proteins using coevolutionary information
NASA Astrophysics Data System (ADS)
Cheng, Ryan; Raghunathan, Mohit; Onuchic, Jose
2015-03-01
Recent developments in global statistical methodologies have advanced the analysis of large collections of protein sequences for coevolutionary information. Coevolution between amino acids in a protein arises from compensatory mutations that are needed to maintain the stability or function of a protein over the course of evolution. This gives rise to quantifiable correlations between amino acid positions within the multiple sequence alignment of a protein family. Here, we use Direct Coupling Analysis (DCA) to infer a Potts model Hamiltonian governing the correlated mutations in a protein family to obtain the sequence-dependent interaction energies of a toy protein model. We demonstrate that this methodology predicts residue-residue interaction energies that are consistent with experimental mutational changes in protein stabilities as well as other computational methodologies. Furthermore, we demonstrate with several examples that DCA could be used to construct a structure-based model that quantitatively agrees with experimental data on folding mechanisms. This work serves as a potential framework for generating models of proteins that are enriched by evolutionary data that can potentially be used to engineer key functional motions and interactions in protein systems. This research has been supported by the NSF INSPIRE award MCB-1241332 and by the CTBP sponsored by the NSF (Grant PHY-1427654).
Shao, Chengchen; Zhang, Yaqi; Zhou, Yueqin; Zhu, Wei; Xu, Hongmei; Liu, Zhiping; Tang, Qiqun; Shen, Yiwen; Xie, Jianhui
2015-01-01
Aim To systemically select and evaluate short tandem repeats (STRs) on the chromosome 14 and obtain new STR loci as expanded genotyping markers for forensic application. Methods STRs on the chromosome 14 were filtered from Tandem Repeats Database and further selected based on their positions on the chromosome, repeat patterns of the core sequences, sequence homology of the flanking regions, and suitability of flanking regions in primer design. The STR locus with the highest heterozygosity and polymorphism information content (PIC) was selected for further analysis of genetic polymorphism, forensic parameters, and the core sequence. Results Among 26 STR loci selected as candidates, D14S739 had the highest heterozygosity (0.8691) and PIC (0.8432), and showed no deviation from the Hardy-Weinberg equilibrium. 14 alleles were observed, ranging in size from 21 to 34 tetranucleotide units in the core region of (GATA)9-18 (GACA)7-12 GACG (GACA)2 GATA. Paternity testing showed no mutations. Conclusion D14S739 is a highly informative STR locus and could be a suitable genetic marker for forensic applications in the Han Chinese population. PMID:26526885
Chamrad, Daniel C; Körting, Gerhard; Schäfer, Heike; Stephan, Christian; Thiele, Herbert; Apweiler, Rolf; Meyer, Helmut E; Marcus, Katrin; Blüggel, Martin
2006-09-01
A novel software tool named PTM-Explorer has been applied to LC-MS/MS datasets acquired within the Human Proteome Organisation (HUPO) Brain Proteome Project (BPP). PTM-Explorer enables automatic identification of peptide MS/MS spectra that were not explained in typical sequence database searches. The main focus was detection of PTMs, but PTM-Explorer detects also unspecific peptide cleavage, mass measurement errors, experimental modifications, amino acid substitutions, transpeptidation products and unknown mass shifts. To avoid a combinatorial problem the search is restricted to a set of selected protein sequences, which stem from previous protein identifications using a common sequence database search. Prior to application to the HUPO BPP data, PTM-Explorer was evaluated on excellently manually characterized and evaluated LC-MS/MS data sets from Alpha-A-Crystallin gel spots obtained from mouse eye lens. Besides various PTMs including phosphorylation, a wealth of experimental modifications and unspecific cleavage products were successfully detected, completing the primary structure information of the measured proteins. Our results indicate that a large amount of MS/MS spectra that currently remain unidentified in standard database searches contain valuable information that can only be elucidated using suitable software tools.
Ertefaie, Ashkan; Shortreed, Susan; Chakraborty, Bibhas
2016-06-15
Q-learning is a regression-based approach that uses longitudinal data to construct dynamic treatment regimes, which are sequences of decision rules that use patient information to inform future treatment decisions. An optimal dynamic treatment regime is composed of a sequence of decision rules that indicate how to optimally individualize treatment using the patients' baseline and time-varying characteristics to optimize the final outcome. Constructing optimal dynamic regimes using Q-learning depends heavily on the assumption that regression models at each decision point are correctly specified; yet model checking in the context of Q-learning has been largely overlooked in the current literature. In this article, we show that residual plots obtained from standard Q-learning models may fail to adequately check the quality of the model fit. We present a modified Q-learning procedure that accommodates residual analyses using standard tools. We present simulation studies showing the advantage of the proposed modification over standard Q-learning. We illustrate this new Q-learning approach using data collected from a sequential multiple assignment randomized trial of patients with schizophrenia. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Pasquier, C; Promponas, V J; Hamodrakas, S J
2001-08-15
A cascading system of hierarchical, artificial neural networks (named PRED-CLASS) is presented for the generalized classification of proteins into four distinct classes-transmembrane, fibrous, globular, and mixed-from information solely encoded in their amino acid sequences. The architecture of the individual component networks is kept very simple, reducing the number of free parameters (network synaptic weights) for faster training, improved generalization, and the avoidance of data overfitting. Capturing information from as few as 50 protein sequences spread among the four target classes (6 transmembrane, 10 fibrous, 13 globular, and 17 mixed), PRED-CLASS was able to obtain 371 correct predictions out of a set of 387 proteins (success rate approximately 96%) unambiguously assigned into one of the target classes. The application of PRED-CLASS to several test sets and complete proteomes of several organisms demonstrates that such a method could serve as a valuable tool in the annotation of genomic open reading frames with no functional assignment or as a preliminary step in fold recognition and ab initio structure prediction methods. Detailed results obtained for various data sets and completed genomes, along with a web sever running the PRED-CLASS algorithm, can be accessed over the World Wide Web at http://o2.biol.uoa.gr/PRED-CLASS.
[Cardiac Synchronization Function Estimation Based on ASM Level Set Segmentation Method].
Zhang, Yaonan; Gao, Yuan; Tang, Liang; He, Ying; Zhang, Huie
At present, there is no accurate and quantitative methods for the determination of cardiac mechanical synchronism, and quantitative determination of the synchronization function of the four cardiac cavities with medical images has a great clinical value. This paper uses the whole heart ultrasound image sequence, and segments the left & right atriums and left & right ventricles of each frame. After the segmentation, the number of pixels in each cavity and in each frame is recorded, and the areas of the four cavities of the image sequence are therefore obtained. The area change curves of the four cavities are further extracted, and the synchronous information of the four cavities is obtained. Because of the low SNR of Ultrasound images, the boundary lines of cardiac cavities are vague, so the extraction of cardiac contours is still a challenging problem. Therefore, the ASM model information is added to the traditional level set method to force the curve evolution process. According to the experimental results, the improved method improves the accuracy of the segmentation. Furthermore, based on the ventricular segmentation, the right and left ventricular systolic functions are evaluated, mainly according to the area changes. The synchronization of the four cavities of the heart is estimated based on the area changes and the volume changes.
Zamecnik, Patrik; Schouten, Martijn G; Krafft, Axel J; Maier, Florian; Schlemmer, Heinz-Peter; Barentsz, Jelle O; Bock, Michael; Fütterer, Jurgen J
2014-12-01
To assess the feasibility of automatic needle-guide tracking by using a real-time phase-only cross correlation ( POCC phase-only cross correlation ) algorithm-based sequence for transrectal 3-T in-bore magnetic resonance (MR)-guided prostate biopsies. This study was approved by the ethics review board, and written informed consent was obtained from all patients. Eleven patients with a prostate-specific antigen level of at least 4 ng/mL (4 μg/L) and at least one transrectal ultrasonography-guided biopsy session with negative findings were enrolled. Regions suspicious for cancer were identified on 3-T multiparametric MR images. During a subsequent MR-guided biopsy, the regions suspicious for cancer were reidentified and targeted by using the POCC phase-only cross correlation -based tracking sequence. Besides testing a general technical feasibility of the biopsy procedure by using the POCC phase-only cross correlation -based tracking sequence, the procedure times were measured, and a pathologic analysis of the biopsy cores was performed. Thirty-eight core samples were obtained from 25 regions suspicious for cancer. It was technically feasible to perform the POCC phase-only cross correlation -based biopsies in all regions suspicious for cancer in each patient, with adequate biopsy samples obtained with each biopsy attempt. The median size of the region suspicious for cancer was 8 mm (range, 4-13 mm). In each region suspicious for cancer (median number per patient, two; range, 1-4), a median of one core sample per region was obtained (range, 1-3). The median time for guidance per target was 1.5 minutes (range, 0.7-5 minutes). Nineteen of 38 core biopsy samples contained cancer. This study shows that it is feasible to perform transrectal 3-T MR-guided biopsies by using a POCC phase-only cross correlation algorithm-based real-time tracking sequence. © RSNA, 2014.
He, Wei; Zhuang, Huihui; Fu, Yanping; Guo, Linwei; Guo, Bin; Guo, Lizhu; Zhang, Xiuhong; Wei, Yahui
2015-01-01
Background: Locoweeds (toxic Oxytropis and Astraglus species), containing the toxic agent swainsonine, pose serious threats to animal husbandry on grasslands in both China and the US. Some locoweeds have evolved adaptations in order to resist various stress conditions such as drought, salt and cold. As a result they replace other plants in their communities and become an ecological problem. Currently very limited genetic information of locoweeds is available and this hinders our understanding in the molecular basis of their environmental plasticity, and the interaction between locoweeds and their symbiotic swainsonine producing endophytes. Next-generation sequencing provides a means of obtaining transcriptomic sequences in a timely manner, which is particularly useful for non-model plants. In this study, we performed transcriptome sequencing of Oxytropis ochrocephala plants followed by a de nove assembly. Our primary aim was to provide an enriched pool of genetic sequences of an Oxytropis sp. for further locoweed research. Results: Transcriptomes of four different O. ochrocephala samples, from control (CK) plants, and those that had experienced either drought (20% PEG), salt (150 mM NaCl) or cold (4°C) stress were sequenced using an Illumina Hiseq 2000 platform. From 232,209,506 clean reads 23,220,950,600 (~23 G nucleotides), 182,430 transcripts and 88,942 unigenes were retrieved, with an N50 value of 1237. Differential expression analysis revealed putative genes encoding heat shock proteins (HSPs) and late embryogenesis abundant (LEA) proteins, enzymes in secondary metabolite and plant hormone biosyntheses, and transcription factors which are involved in stress tolerance in O. ochrocephala. In order to validate our sequencing results, we further analyzed the expression profiles of nine genes by quantitative real-time PCR. Finally, we discuss the possible mechanism of O. ochrocephala's adaptations to stress environment. Conclusion: Our transcriptome sequencing data present useful genetic information of a locoweed species. This genetic information will underpin further research in elucidating the environmental acclimation mechanism in locoweeds and the endophyte-plant association. PMID:26697040
Hassa, Julia; Maus, Irena; Off, Sandra; Pühler, Alfred; Scherer, Paul; Klocke, Michael; Schlüter, Andreas
2018-06-01
The production of biogas by anaerobic digestion (AD) of agricultural residues, organic wastes, animal excrements, municipal sludge, and energy crops has a firm place in sustainable energy production and bio-economy strategies. Focusing on the microbial community involved in biomass conversion offers the opportunity to control and engineer the biogas process with the objective to optimize its efficiency. Taxonomic profiling of biogas producing communities by means of high-throughput 16S rRNA gene amplicon sequencing provided high-resolution insights into bacterial and archaeal structures of AD assemblages and their linkages to fed substrates and process parameters. Commonly, the bacterial phyla Firmicutes and Bacteroidetes appeared to dominate biogas communities in varying abundances depending on the apparent process conditions. Regarding the community of methanogenic Archaea, their diversity was mainly affected by the nature and composition of the substrates, availability of nutrients and ammonium/ammonia contents, but not by the temperature. It also appeared that a high proportion of 16S rRNA sequences can only be classified on higher taxonomic ranks indicating that many community members and their participation in AD within functional networks are still unknown. Although cultivation-based approaches to isolate microorganisms from biogas fermentation samples yielded hundreds of novel species and strains, this approach intrinsically is limited to the cultivable fraction of the community. To obtain genome sequence information of non-cultivable biogas community members, metagenome sequencing including assembly and binning strategies was highly valuable. Corresponding research has led to the compilation of hundreds of metagenome-assembled genomes (MAGs) frequently representing novel taxa whose metabolism and lifestyle could be reconstructed based on nucleotide sequence information. In contrast to metagenome analyses revealing the genetic potential of microbial communities, metatranscriptome sequencing provided insights into the metabolically active community. Taking advantage of genome sequence information, transcriptional activities were evaluated considering the microorganism's genetic background. Metaproteome studies uncovered enzyme profiles expressed by biogas community members. Enzymes involved in cellulose and hemicellulose decomposition and utilization of other complex biopolymers were identified. Future studies on biogas functional microbial networks will increasingly involve integrated multi-omics analyses evaluating metagenome, transcriptome, proteome, and metabolome datasets.
Prediction of TF target sites based on atomistic models of protein-DNA complexes
Angarica, Vladimir Espinosa; Pérez, Abel González; Vasconcelos, Ana T; Collado-Vides, Julio; Contreras-Moreira, Bruno
2008-01-01
Background The specific recognition of genomic cis-regulatory elements by transcription factors (TFs) plays an essential role in the regulation of coordinated gene expression. Studying the mechanisms determining binding specificity in protein-DNA interactions is thus an important goal. Most current approaches for modeling TF specific recognition rely on the knowledge of large sets of cognate target sites and consider only the information contained in their primary sequence. Results Here we describe a structure-based methodology for predicting sequence motifs starting from the coordinates of a TF-DNA complex. Our algorithm combines information regarding the direct and indirect readout of DNA into an atomistic statistical model, which is used to estimate the interaction potential. We first measure the ability of our method to correctly estimate the binding specificities of eight prokaryotic and eukaryotic TFs that belong to different structural superfamilies. Secondly, the method is applied to two homology models, finding that sampling of interface side-chain rotamers remarkably improves the results. Thirdly, the algorithm is compared with a reference structural method based on contact counts, obtaining comparable predictions for the experimental complexes and more accurate sequence motifs for the homology models. Conclusion Our results demonstrate that atomic-detail structural information can be feasibly used to predict TF binding sites. The computational method presented here is universal and might be applied to other systems involving protein-DNA recognition. PMID:18922190
GMDD: a database of GMO detection methods
Dong, Wei; Yang, Litao; Shen, Kailin; Kim, Banghyun; Kleter, Gijs A; Marvin, Hans JP; Guo, Rong; Liang, Wanqi; Zhang, Dabing
2008-01-01
Background Since more than one hundred events of genetically modified organisms (GMOs) have been developed and approved for commercialization in global area, the GMO analysis methods are essential for the enforcement of GMO labelling regulations. Protein and nucleic acid-based detection techniques have been developed and utilized for GMOs identification and quantification. However, the information for harmonization and standardization of GMO analysis methods at global level is needed. Results GMO Detection method Database (GMDD) has collected almost all the previous developed and reported GMOs detection methods, which have been grouped by different strategies (screen-, gene-, construct-, and event-specific), and also provide a user-friendly search service of the detection methods by GMO event name, exogenous gene, or protein information, etc. In this database, users can obtain the sequences of exogenous integration, which will facilitate PCR primers and probes design. Also the information on endogenous genes, certified reference materials, reference molecules, and the validation status of developed methods is included in this database. Furthermore, registered users can also submit new detection methods and sequences to this database, and the newly submitted information will be released soon after being checked. Conclusion GMDD contains comprehensive information of GMO detection methods. The database will make the GMOs analysis much easier. PMID:18522755
Jaenicke, Sebastian; Ander, Christina; Bekel, Thomas; Bisdorf, Regina; Dröge, Marcus; Gartemann, Karl-Heinz; Jünemann, Sebastian; Kaiser, Olaf; Krause, Lutz; Tille, Felix; Zakrzewski, Martha; Pühler, Alfred
2011-01-01
Biogas production from renewable resources is attracting increased attention as an alternative energy source due to the limited availability of traditional fossil fuels. Many countries are promoting the use of alternative energy sources for sustainable energy production. In this study, a metagenome from a production-scale biogas fermenter was analysed employing Roche's GS FLX Titanium technology and compared to a previous dataset obtained from the same community DNA sample that was sequenced on the GS FLX platform. Taxonomic profiling based on 16S rRNA-specific sequences and an Environmental Gene Tag (EGT) analysis employing CARMA demonstrated that both approaches benefit from the longer read lengths obtained on the Titanium platform. Results confirmed Clostridia as the most prevalent taxonomic class, whereas species of the order Methanomicrobiales are dominant among methanogenic Archaea. However, the analyses also identified additional taxa that were missed by the previous study, including members of the genera Streptococcus, Acetivibrio, Garciella, Tissierella, and Gelria, which might also play a role in the fermentation process leading to the formation of methane. Taking advantage of the CARMA feature to correlate taxonomic information of sequences with their assigned functions, it appeared that Firmicutes, followed by Bacteroidetes and Proteobacteria, dominate within the functional context of polysaccharide degradation whereas Methanomicrobiales represent the most abundant taxonomic group responsible for methane production. Clostridia is the most important class involved in the reductive CoA pathway (Wood-Ljungdahl pathway) that is characteristic for acetogenesis. Based on binning of 16S rRNA-specific sequences allocated to the dominant genus Methanoculleus, it could be shown that this genus is represented by several different species. Phylogenetic analysis of these sequences placed them in close proximity to the hydrogenotrophic methanogen Methanoculleus bourgensis. While rarefaction analyses still indicate incomplete coverage, examination of the GS FLX Titanium dataset resulted in the identification of additional genera and functional elements, providing a far more complete coverage of the community involved in anaerobic fermentative pathways leading to methane formation. PMID:21297863
SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry; Gurevich, Alexey A.; Dvorkin, Mikhail; Kulikov, Alexander S.; Lesin, Valery M.; Nikolenko, Sergey I.; Pham, Son; Prjibelski, Andrey D.; Pyshkin, Alexey V.; Sirotkin, Alexander V.; Vyahhi, Nikolay; Tesler, Glenn; Pevzner, Pavel A.
2012-01-01
Abstract The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V−SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online (http://bioinf.spbau.ru/spades). It is distributed as open source software. PMID:22506599
Liu, Tianyu; Liang, Yinan; Zhong, Xiuqin; Wang, Ning; Hu, Dandan; Zhou, Xuan; Gu, Xiaobin; Peng, Xuerong; Yang, Guangyou
2014-01-01
Dirofilaria immitis (heartworm) is the causative agent of an important zoonotic disease that is spread by mosquitoes. In this study, molecular and phylogenetic characterization of D. immitis were performed based on complete ND1 and 16S rDNA gene sequences, which provided the foundation for more advanced molecular diagnosis, prevention, and control of heartworm diseases. The mutation rate and evolutionary divergence in adult heartworm samples from seven dogs in western China were analyzed to obtain information on genetic diversity and variability. Phylogenetic relationships were inferred using both maximum parsimony (MP) and Bayes methods based on the complete gene sequences. The results suggest that D. immitis formed an independent monophyletic group in which the 16S rDNA gene has mutated more rapidly than has ND1. PMID:24639299
Mainardi, L T; Pattini, L; Cerutti, S
2007-01-01
A novel method is presented for the investigation of protein properties of sequences using Ramanujan Fourier Transform (RFT). The new methodology involves the preprocessing of protein sequence data by numerically encoding it and then applying the RFT. The RFT is based on projecting the obtained numerical series on a set of basis functions constituted by Ramanujan sums (RS). In RS components, periodicities of finite integer length, rather than frequency, (as in classical harmonic analysis) are considered. The potential of the new approach is documented by a few examples in the analysis of hydrophobic profiles of proteins in two classes including abundance of alpha-helices (group A) or beta-strands (group B). Different patterns are provided as evidence. RFT can be used to characterize the structural properties of proteins and integrate complementary information provided by other signal processing transforms.
MIPE: A metagenome-based community structure explorer and SSU primer evaluation tool
Zhou, Quan
2017-01-01
An understanding of microbial community structure is an important issue in the field of molecular ecology. The traditional molecular method involves amplification of small subunit ribosomal RNA (SSU rRNA) genes by polymerase chain reaction (PCR). However, PCR-based amplicon approaches are affected by primer bias and chimeras. With the development of high-throughput sequencing technology, unbiased SSU rRNA gene sequences can be mined from shotgun sequencing-based metagenomic or metatranscriptomic datasets to obtain a reflection of the microbial community structure in specific types of environment and to evaluate SSU primers. However, the use of short reads obtained through next-generation sequencing for primer evaluation has not been well resolved. The software MIPE (MIcrobiota metagenome Primer Explorer) was developed to adapt numerous short reads from metagenomes and metatranscriptomes. Using metagenomic or metatranscriptomic datasets as input, MIPE extracts and aligns rRNA to reveal detailed information on microbial composition and evaluate SSU rRNA primers. A mock dataset, a real Metagenomics Rapid Annotation using Subsystem Technology (MG-RAST) test dataset, two PrimerProspector test datasets and a real metatranscriptomic dataset were used to validate MIPE. The software calls Mothur (v1.33.3) and the SILVA database (v119) for the alignment and classification of rRNA genes from a metagenome or metatranscriptome. MIPE can effectively extract shotgun rRNA reads from a metagenome or metatranscriptome and is capable of classifying these sequences and exhibiting sensitivity to different SSU rRNA PCR primers. Therefore, MIPE can be used to guide primer design for specific environmental samples. PMID:28350876
Transcriptome characterisation of Pinus tabuliformis and evolution of genes in the Pinus phylogeny
2013-01-01
Background The Chinese pine (Pinus tabuliformis) is an indigenous conifer species in northern China but is relatively underdeveloped as a genomic resource; thus, limiting gene discovery and breeding. Large-scale transcriptome data were obtained using a next-generation sequencing platform to compensate for the lack of P. tabuliformis genomic information. Results The increasing amount of transcriptome data on Pinus provides an excellent resource for multi-gene phylogenetic analysis and studies on how conserved genes and functions are maintained in the face of species divergence. The first P. tabuliformis transcriptome from a normalised cDNA library of multiple tissues and individuals was sequenced in a full 454 GS-FLX run, producing 911,302 sequencing reads. The high quality overlapping expressed sequence tags (ESTs) were assembled into 46,584 putative transcripts, and more than 700 SSRs and 92,000 SNPs/InDels were characterised. Comparative analysis of the transcriptome of six conifer species yielded 191 orthologues, from which we inferred a phylogenetic tree, evolutionary patterns and calculated rates of gene diversion. We also identified 938 fast evolving sequences that may be useful for identifying genes that perhaps evolved in response to positive selection and might be responsible for speciation in the Pinus lineage. Conclusions A large collection of high-quality ESTs was obtained, de novo assembled and characterised, which represents a dramatic expansion of the current transcript catalogues of P. tabuliformis and which will gradually be applied in breeding programs of P. tabuliformis. Furthermore, these data will facilitate future studies of the comparative genomics of P. tabuliformis and other related species. PMID:23597112
Rector, Annabel; Tachezy, Ruth; Van Ranst, Marc
2004-01-01
The discovery of novel viruses has often been accomplished by using hybridization-based methods that necessitate the availability of a previously characterized virus genome probe or knowledge of the viral nucleotide sequence to construct consensus or degenerate PCR primers. In their natural replication cycle, certain viruses employ a rolling-circle mechanism to propagate their circular genomes, and multiply primed rolling-circle amplification (RCA) with φ29 DNA polymerase has recently been applied in the amplification of circular plasmid vectors used in cloning. We employed an isothermal RCA protocol that uses random hexamer primers to amplify the complete genomes of papillomaviruses without the need for prior knowledge of their DNA sequences. We optimized this RCA technique with extracted human papillomavirus type 16 (HPV-16) DNA from W12 cells, using a real-time quantitative PCR assay to determine amplification efficiency, and obtained a 2.4 × 104-fold increase in HPV-16 DNA concentration. We were able to clone the complete HPV-16 genome from this multiply primed RCA product. The optimized protocol was subsequently applied to a bovine fibropapillomatous wart tissue sample. Whereas no papillomavirus DNA could be detected by restriction enzyme digestion of the original sample, multiply primed RCA enabled us to obtain a sufficient amount of papillomavirus DNA for restriction enzyme analysis, cloning, and subsequent sequencing of a novel variant of bovine papillomavirus type 1. The multiply primed RCA method allows the discovery of previously unknown papillomaviruses, and possibly also other circular DNA viruses, without a priori sequence information. PMID:15113879
An, Jianyu; Yin, Mengqi; Zhang, Qin; Gong, Dongting; Jia, Xiaowen; Guan, Yajing; Hu, Jin
2017-01-01
Luffa cylindrica (L.) Roem. is an economically important vegetable crop in China. However, the genomic information on this species is currently unknown. In this study, for the first time, a genome survey of L. cylindrica was carried out using next-generation sequencing (NGS) technology. In total, 43.40 Gb sequence data of L. cylindrica, about 54.94× coverage of the estimated genome size of 789.97 Mb, were obtained from HiSeq 2500 sequencing, in which the guanine plus cytosine (GC) content was calculated to be 37.90%. The heterozygosity of genome sequences was only 0.24%. In total, 1,913,731 contigs (>200 bp) with 525 bp N50 length and 1,410,117 scaffolds (>200 bp) with 885.01 Mb total length were obtained. From the initial assembled L. cylindrica genome, 431,234 microsatellites (SSRs) (≥5 repeats) were identified. The motif types of SSR repeats included 62.88% di-nucleotide, 31.03% tri-nucleotide, 4.59% tetra-nucleotide, 0.96% penta-nucleotide and 0.54% hexa-nucleotide. Eighty genomic SSR markers were developed, and 51/80 primers could be used in both “Zheda 23” and “Zheda 83”. Nineteen SSRs were used to investigate the genetic diversity among 32 accessions through SSR-HRM analysis. The unweighted pair group method analysis (UPGMA) dendrogram tree was built by calculating the SSR-HRM raw data. SSR-HRM could be effectively used for genotype relationship analysis of Luffa species. PMID:28891982
The recurrence sequences via Sylvester matrices
NASA Astrophysics Data System (ADS)
Karaduman, Erdal; Deveci, Ömür
2017-07-01
In this work, we define the Pell-Jacobsthal-Slyvester sequence and the Jacobsthal-Pell-Slyvester sequence by using the Slyvester matrices which are obtained from the characteristic polynomials of the Pell and Jacobsthal sequences and then, we study the sequences defined modulo m. Also, we obtain the cyclic groups and the semigroups from the generating matrices of these sequences when read modulo m and then, we derive the relationships among the orders of the cyclic groups and the periods of the sequences. Furthermore, we redefine Pell-Jacobsthal-Slyvester sequence and the Jacobsthal-Pell-Slyvester sequence by means of the elements of the groups and then, we examine them in the finite groups.
The least channel capacity for chaos synchronization.
Wang, Mogei; Wang, Xingyuan; Liu, Zhenzhen; Zhang, Huaguang
2011-03-01
Recently researchers have found that a channel with capacity exceeding the Kolmogorov-Sinai entropy of the drive system (h(KS)) is theoretically necessary and sufficient to sustain the unidirectional synchronization to arbitrarily high precision. In this study, we use symbolic dynamics and the automaton reset sequence to distinguish the information that is required in identifying the current drive word and obtaining the synchronization. Then, we show that the least channel capacity that is sufficient to transmit the distinguished information and attain the synchronization of arbitrarily high precision is h(KS). Numerical simulations provide support for our conclusions.
Estimation and classification by sigmoids based on mutual information
NASA Technical Reports Server (NTRS)
Baram, Yoram
1994-01-01
An estimate of the probability density function of a random vector is obtained by maximizing the mutual information between the input and the output of a feedforward network of sigmoidal units with respect to the input weights. Classification problems can be solved by selecting the class associated with the maximal estimated density. Newton's s method, applied to an estimated density, yields a recursive maximum likelihood estimator, consisting of a single internal layer of sigmoids, for a random variable or a random sequence. Applications to the diamond classification and to the prediction of a sun-spot process are demonstrated.
Mochida, Keiichi; Uehara-Yamaguchi, Yukiko; Takahashi, Fuminori; Yoshida, Takuhiro; Sakurai, Tetsuya; Shinozaki, Kazuo
2013-01-01
A comprehensive collection of full-length cDNAs is essential for correct structural gene annotation and functional analyses of genes. We constructed a mixed full-length cDNA library from 21 different tissues of Brachypodium distachyon Bd21, and obtained 78,163 high quality expressed sequence tags (ESTs) from both ends of ca. 40,000 clones (including 16,079 contigs). We updated gene structure annotations of Brachypodium genes based on full-length cDNA sequences in comparison with the latest publicly available annotations. About 10,000 non-redundant gene models were supported by full-length cDNAs; ca. 6,000 showed some transcription unit modifications. We also found ca. 580 novel gene models, including 362 newly identified in Bd21. Using the updated transcription start sites, we searched a total of 580 plant cis-motifs in the −3 kb promoter regions and determined a genome-wide Brachypodium promoter architecture. Furthermore, we integrated the Brachypodium full-length cDNAs and updated gene structures with available sequence resources in wheat and barley in a web-accessible database, the RIKEN Brachypodium FL cDNA database. The database represents a “one-stop” information resource for all genomic information in the Pooideae, facilitating functional analysis of genes in this model grass plant and seamless knowledge transfer to the Triticeae crops. PMID:24130698
Protein Solvent-Accessibility Prediction by a Stacked Deep Bidirectional Recurrent Neural Network.
Zhang, Buzhong; Li, Linqing; Lü, Qiang
2018-05-25
Residue solvent accessibility is closely related to the spatial arrangement and packing of residues. Predicting the solvent accessibility of a protein is an important step to understand its structure and function. In this work, we present a deep learning method to predict residue solvent accessibility, which is based on a stacked deep bidirectional recurrent neural network applied to sequence profiles. To capture more long-range sequence information, a merging operator was proposed when bidirectional information from hidden nodes was merged for outputs. Three types of merging operators were used in our improved model, with a long short-term memory network performing as a hidden computing node. The trained database was constructed from 7361 proteins extracted from the PISCES server using a cut-off of 25% sequence identity. Sequence-derived features including position-specific scoring matrix, physical properties, physicochemical characteristics, conservation score and protein coding were used to represent a residue. Using this method, predictive values of continuous relative solvent-accessible area were obtained, and then, these values were transformed into binary states with predefined thresholds. Our experimental results showed that our deep learning method improved prediction quality relative to current methods, with mean absolute error and Pearson's correlation coefficient values of 8.8% and 74.8%, respectively, on the CB502 dataset and 8.2% and 78%, respectively, on the Manesh215 dataset.
Molecular analysis of the human faecal archaea in a southern Indian population.
Rani, Sandya B; Balamurugan, Ramadass; Ramakrishna, Balakrishnan S
2017-03-01
Archaea are an important constituent of the human gut microbiota, but there is no information on human gut archaea in an Indian population. In this study, faecal samples were obtained from different age groups (neonatal babies, preschool children, school-going children, adolescents, adults and elderly) of a southern Indian population, and from a tribal population also resident in southern India). 16S rRNA gene sequences specific to Archaea were amplified from pooled faecal DNA in each group, sequenced, and aligned against the NCBI database. Of the 806 adequate sequences in the study, most aligned with 22 known sequences. There were 9 novel sequences in the present study. All sequences were deposited in the GenBank nucleotide sequence database with the following accession numbers: KF607113 - KF607918. Methanobrevibacter was the most prevalent genus among all the age groups accounting for 98% in neonates, 96% in post-weaning, and 100% each in preschool, school and adult population. In the elderly, Methanobrevibacter accounted for 96% and in tribal adults, 99% of the clones belonged to Methanobrevibacter genus. Other genera detected included Caldisphaera, Halobaculum, Methanosphaeraand Thermogymnomonas. Methanobrevibacter smithii predominated in all age groups, accounting for 749 (92.9%) of the 806 sequences. Archaea can be found in the faeces of southern Indian residents immediately after birth. Methanobrevibacter smithii was the dominant faecal archeon in all age groups, with other genera being found at the extremes of age.
Analysis of expressed sequence tags for Frankliniella occidentalis, the western flower thrips.
Rotenberg, D; Whitfield, A E
2010-08-01
Thrips are members of the insect order Thysanoptera and Frankliniella occidentalis (the western flower thrips) is the most economically important pest within this order. F. occidentalis is both a direct pest of crops and an efficient vector of plant viruses, including Tomato spotted wilt virus (TSWV). Despite the world-wide importance of thrips in agriculture, there is little knowledge of the F. occidentalis genome or gene functions at this time. A normalized cDNA library was constructed from first instar thrips and 13 839 expressed sequence tags (ESTs) were obtained. Our EST data assembled into 894 contigs and 11 806 singletons (12 700 nonredundant sequences). We found that 31% of these sequences had significant similarity (E< or = 10(-10)) to protein sequences in the National Center for Biotechnology Information nonredundant (nr) protein database, and 25% were functionally annotated using Blast 2GO. We identified 74 sequences with putative homology to proteins associated with insect innate immunity. Sixteen sequences had significant similarity to proteins associated with small RNA-mediated gene silencing pathways (RNA interference; RNAi), including the antiviral pathway (short interfering RNA-mediated pathway). Our EST collection provides new sequence resources for characterizing gene functions in F. occidentalis and other thrips species with regards to vital biological processes, studying the mechanism of interactions with the viruses harboured and transmitted by the vector, and identifying new insect gene-centred targets for plant disease and insect control.
Pereiro, Patricia; Balseiro, Pablo; Romero, Alejandro; Dios, Sonia; Forn-Cuni, Gabriel; Fuste, Berta; Planas, Josep V.; Beltran, Sergi; Novoa, Beatriz; Figueras, Antonio
2012-01-01
Background Turbot (Scophthalmus maximus L.) is an important aquacultural resource both in Europe and Asia. However, there is little information on gene sequences available in public databases. Currently, one of the main problems affecting the culture of this flatfish is mortality due to several pathogens, especially viral diseases which are not treatable. In order to identify new genes involved in immune defense, we conducted 454-pyrosequencing of the turbot transcriptome after different immune stimulations. Methodology/Principal Findings Turbot were injected with viral stimuli to increase the expression level of immune-related genes. High-throughput deep sequencing using 454-pyrosequencing technology yielded 915,256 high-quality reads. These sequences were assembled into 55,404 contigs that were subjected to annotation steps. Intriguingly, 55.16% of the deduced protein was not significantly similar to any sequences in the databases used for the annotation and only 0.85% of the BLASTx top-hits matched S. maximus protein sequences. This relatively low level of annotation is possibly due to the limited information for this specie and other flatfish in the database. These results suggest the identification of a large number of new genes in turbot and in fish in general. A more detailed analysis showed the presence of putative members of several innate and specific immune pathways. Conclusions/Significance To our knowledge, this study is the first transcriptome analysis using 454-pyrosequencing for turbot. Previously, there were only 12,471 EST and less of 1,500 nucleotide sequences for S. maximus in NCBI database. Our results provide a rich source of data (55,404 contigs and 181,845 singletons) for discovering and identifying new genes, which will serve as a basis for microarray construction, gene expression characterization and for identification of genetic markers to be used in several applications. Immune stimulation in turbot was very effective, obtaining an enormous variety of sequences belonging to genes involved in the defense mechanisms. PMID:22629298
Tank, David C.
2016-01-01
Advances in high-throughput sequencing (HTS) have allowed researchers to obtain large amounts of biological sequence information at speeds and costs unimaginable only a decade ago. Phylogenetics, and the study of evolution in general, is quickly migrating towards using HTS to generate larger and more complex molecular datasets. In this paper, we present a method that utilizes microfluidic PCR and HTS to generate large amounts of sequence data suitable for phylogenetic analyses. The approach uses the Fluidigm Access Array System (Fluidigm, San Francisco, CA, USA) and two sets of PCR primers to simultaneously amplify 48 target regions across 48 samples, incorporating sample-specific barcodes and HTS adapters (2,304 unique amplicons per Access Array). The final product is a pooled set of amplicons ready to be sequenced, and thus, there is no need to construct separate, costly genomic libraries for each sample. Further, we present a bioinformatics pipeline to process the raw HTS reads to either generate consensus sequences (with or without ambiguities) for every locus in every sample or—more importantly—recover the separate alleles from heterozygous target regions in each sample. This is important because it adds allelic information that is well suited for coalescent-based phylogenetic analyses that are becoming very common in conservation and evolutionary biology. To test our approach and bioinformatics pipeline, we sequenced 576 samples across 96 target regions belonging to the South American clade of the genus Bartsia L. in the plant family Orobanchaceae. After sequencing cleanup and alignment, the experiment resulted in ~25,300bp across 486 samples for a set of 48 primer pairs targeting the plastome, and ~13,500bp for 363 samples for a set of primers targeting regions in the nuclear genome. Finally, we constructed a combined concatenated matrix from all 96 primer combinations, resulting in a combined aligned length of ~40,500bp for 349 samples. PMID:26828929
Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields.
Wang, Sheng; Peng, Jian; Ma, Jianzhu; Xu, Jinbo
2016-01-11
Protein secondary structure (SS) prediction is important for studying protein structure and function. When only the sequence (profile) information is used as input feature, currently the best predictors can obtain ~80% Q3 accuracy, which has not been improved in the past decade. Here we present DeepCNF (Deep Convolutional Neural Fields) for protein SS prediction. DeepCNF is a Deep Learning extension of Conditional Neural Fields (CNF), which is an integration of Conditional Random Fields (CRF) and shallow neural networks. DeepCNF can model not only complex sequence-structure relationship by a deep hierarchical architecture, but also interdependency between adjacent SS labels, so it is much more powerful than CNF. Experimental results show that DeepCNF can obtain ~84% Q3 accuracy, ~85% SOV score, and ~72% Q8 accuracy, respectively, on the CASP and CAMEO test proteins, greatly outperforming currently popular predictors. As a general framework, DeepCNF can be used to predict other protein structure properties such as contact number, disorder regions, and solvent accessibility.
Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields
NASA Astrophysics Data System (ADS)
Wang, Sheng; Peng, Jian; Ma, Jianzhu; Xu, Jinbo
2016-01-01
Protein secondary structure (SS) prediction is important for studying protein structure and function. When only the sequence (profile) information is used as input feature, currently the best predictors can obtain ~80% Q3 accuracy, which has not been improved in the past decade. Here we present DeepCNF (Deep Convolutional Neural Fields) for protein SS prediction. DeepCNF is a Deep Learning extension of Conditional Neural Fields (CNF), which is an integration of Conditional Random Fields (CRF) and shallow neural networks. DeepCNF can model not only complex sequence-structure relationship by a deep hierarchical architecture, but also interdependency between adjacent SS labels, so it is much more powerful than CNF. Experimental results show that DeepCNF can obtain ~84% Q3 accuracy, ~85% SOV score, and ~72% Q8 accuracy, respectively, on the CASP and CAMEO test proteins, greatly outperforming currently popular predictors. As a general framework, DeepCNF can be used to predict other protein structure properties such as contact number, disorder regions, and solvent accessibility.
Effects of informed consent for individual genome sequencing on relevant knowledge.
Kaphingst, K A; Facio, F M; Cheng, M-R; Brooks, S; Eidem, H; Linn, A; Biesecker, B B; Biesecker, L G
2012-11-01
Increasing availability of individual genomic information suggests that patients will need knowledge about genome sequencing to make informed decisions, but prior research is limited. In this study, we examined genome sequencing knowledge before and after informed consent among 311 participants enrolled in the ClinSeq™ sequencing study. An exploratory factor analysis of knowledge items yielded two factors (sequencing limitations knowledge; sequencing benefits knowledge). In multivariable analysis, high pre-consent sequencing limitations knowledge scores were significantly related to education [odds ratio (OR): 8.7, 95% confidence interval (CI): 2.45-31.10 for post-graduate education, and OR: 3.9; 95% CI: 1.05, 14.61 for college degree compared with less than college degree] and race/ethnicity (OR: 2.4, 95% CI: 1.09, 5.38 for non-Hispanic Whites compared with other racial/ethnic groups). Mean values increased significantly between pre- and post-consent for the sequencing limitations knowledge subscale (6.9-7.7, p < 0.0001) and sequencing benefits knowledge subscale (7.0-7.5, p < 0.0001); increase in knowledge did not differ by sociodemographic characteristics. This study highlights gaps in genome sequencing knowledge and underscores the need to target educational efforts toward participants with less education or from minority racial/ethnic groups. The informed consent process improved genome sequencing knowledge. Future studies could examine how genome sequencing knowledge influences informed decision making. © 2012 John Wiley & Sons A/S.
A DNA 'barcode blitz': rapid digitization and sequencing of a natural history collection.
Hebert, Paul D N; Dewaard, Jeremy R; Zakharov, Evgeny V; Prosser, Sean W J; Sones, Jayme E; McKeown, Jaclyn T A; Mantle, Beth; La Salle, John
2013-01-01
DNA barcoding protocols require the linkage of each sequence record to a voucher specimen that has, whenever possible, been authoritatively identified. Natural history collections would seem an ideal resource for barcode library construction, but they have never seen large-scale analysis because of concerns linked to DNA degradation. The present study examines the strength of this barrier, carrying out a comprehensive analysis of moth and butterfly (Lepidoptera) species in the Australian National Insect Collection. Protocols were developed that enabled tissue samples, specimen data, and images to be assembled rapidly. Using these methods, a five-person team processed 41,650 specimens representing 12,699 species in 14 weeks. Subsequent molecular analysis took about six months, reflecting the need for multiple rounds of PCR as sequence recovery was impacted by age, body size, and collection protocols. Despite these variables and the fact that specimens averaged 30.4 years old, barcode records were obtained from 86% of the species. In fact, one or more barcode compliant sequences (>487 bp) were recovered from virtually all species represented by five or more individuals, even when the youngest was 50 years old. By assembling specimen images, distributional data, and DNA barcode sequences on a web-accessible informatics platform, this study has greatly advanced accessibility to information on thousands of species. Moreover, much of the specimen data became publically accessible within days of its acquisition, while most sequence results saw release within three months. As such, this study reveals the speed with which DNA barcode workflows can mobilize biodiversity data, often providing the first web-accessible information for a species. These results further suggest that existing collections can enable the rapid development of a comprehensive DNA barcode library for the most diverse compartment of terrestrial biodiversity - insects.
Dactyl Alphabet Gesture Recognition in a Video Sequence Using Microsoft Kinect
NASA Astrophysics Data System (ADS)
Artyukhin, S. G.; Mestetskiy, L. M.
2015-05-01
This paper presents an efficient framework for solving the problem of static gesture recognition based on data obtained from the web cameras and depth sensor Kinect (RGB-D - data). Each gesture given by a pair of images: color image and depth map. The database store gestures by it features description, genereated by frame for each gesture of the alphabet. Recognition algorithm takes as input a video sequence (a sequence of frames) for marking, put in correspondence with each frame sequence gesture from the database, or decide that there is no suitable gesture in the database. First, classification of the frame of the video sequence is done separately without interframe information. Then, a sequence of successful marked frames in equal gesture is grouped into a single static gesture. We propose a method combined segmentation of frame by depth map and RGB-image. The primary segmentation is based on the depth map. It gives information about the position and allows to get hands rough border. Then, based on the color image border is specified and performed analysis of the shape of the hand. Method of continuous skeleton is used to generate features. We propose a method of skeleton terminal branches, which gives the opportunity to determine the position of the fingers and wrist. Classification features for gesture is description of the position of the fingers relative to the wrist. The experiments were carried out with the developed algorithm on the example of the American Sign Language. American Sign Language gesture has several components, including the shape of the hand, its orientation in space and the type of movement. The accuracy of the proposed method is evaluated on the base of collected gestures consisting of 2700 frames.
Amar, David; Frades, Itziar; Danek, Agnieszka; Goldberg, Tatyana; Sharma, Sanjeev K; Hedley, Pete E; Proux-Wera, Estelle; Andreasson, Erik; Shamir, Ron; Tzfadia, Oren; Alexandersson, Erik
2014-12-05
For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, 'omics', and literature data. However, researchers encounter little guidance on how well they perform. Here, we used the recently sequenced potato genome as a case study. The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available. We show that the automatic gene annotations of potato have low accuracy when compared to a "gold standard" based on experimentally validated potato genes. Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average). To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines. We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard. We find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline. We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms. This will greatly aid future functional analysis of '-omics' datasets from potato and other organisms with newly sequenced genomes. The new potato annotations are available with this paper.
Petersen, Bent; Lundegaard, Claus; Petersen, Thomas Nordahl
2010-01-01
β-turns are the most common type of non-repetitive structures, and constitute on average 25% of the amino acids in proteins. The formation of β-turns plays an important role in protein folding, protein stability and molecular recognition processes. In this work we present the neural network method NetTurnP, for prediction of two-class β-turns and prediction of the individual β-turn types, by use of evolutionary information and predicted protein sequence features. It has been evaluated against a commonly used dataset BT426, and achieves a Matthews correlation coefficient of 0.50, which is the highest reported performance on a two-class prediction of β-turn and not-β-turn. Furthermore NetTurnP shows improved performance on some of the specific β-turn types. In the present work, neural network methods have been trained to predict β-turn or not and individual β-turn types from the primary amino acid sequence. The individual β-turn types I, I', II, II', VIII, VIa1, VIa2, VIba and IV have been predicted based on classifications by PROMOTIF, and the two-class prediction of β-turn or not is a superset comprised of all β-turn types. The performance is evaluated using a golden set of non-homologous sequences known as BT426. Our two-class prediction method achieves a performance of: MCC = 0.50, Qtotal = 82.1%, sensitivity = 75.6%, PPV = 68.8% and AUC = 0.864. We have compared our performance to eleven other prediction methods that obtain Matthews correlation coefficients in the range of 0.17 – 0.47. For the type specific β-turn predictions, only type I and II can be predicted with reasonable Matthews correlation coefficients, where we obtain performance values of 0.36 and 0.31, respectively. Conclusion The NetTurnP method has been implemented as a webserver, which is freely available at http://www.cbs.dtu.dk/services/NetTurnP/. NetTurnP is the only available webserver that allows submission of multiple sequences. PMID:21152409
Petersen, Bent; Lundegaard, Claus; Petersen, Thomas Nordahl
2010-11-30
β-turns are the most common type of non-repetitive structures, and constitute on average 25% of the amino acids in proteins. The formation of β-turns plays an important role in protein folding, protein stability and molecular recognition processes. In this work we present the neural network method NetTurnP, for prediction of two-class β-turns and prediction of the individual β-turn types, by use of evolutionary information and predicted protein sequence features. It has been evaluated against a commonly used dataset BT426, and achieves a Matthews correlation coefficient of 0.50, which is the highest reported performance on a two-class prediction of β-turn and not-β-turn. Furthermore NetTurnP shows improved performance on some of the specific β-turn types. In the present work, neural network methods have been trained to predict β-turn or not and individual β-turn types from the primary amino acid sequence. The individual β-turn types I, I', II, II', VIII, VIa1, VIa2, VIba and IV have been predicted based on classifications by PROMOTIF, and the two-class prediction of β-turn or not is a superset comprised of all β-turn types. The performance is evaluated using a golden set of non-homologous sequences known as BT426. Our two-class prediction method achieves a performance of: MCC=0.50, Qtotal=82.1%, sensitivity=75.6%, PPV=68.8% and AUC=0.864. We have compared our performance to eleven other prediction methods that obtain Matthews correlation coefficients in the range of 0.17-0.47. For the type specific β-turn predictions, only type I and II can be predicted with reasonable Matthews correlation coefficients, where we obtain performance values of 0.36 and 0.31, respectively. The NetTurnP method has been implemented as a webserver, which is freely available at http://www.cbs.dtu.dk/services/NetTurnP/. NetTurnP is the only available webserver that allows submission of multiple sequences.
Samuels, Amy K; Weisrock, David W; Smith, Jeramiah J; France, Katherine J; Walker, John A; Putta, Srikrishna; Voss, S Randal
2005-04-11
We report on a study that extended mitochondrial transcript information from a recent EST project to obtain complete mitochondrial genome sequence for 5 tiger salamander complex species (Ambystoma mexicanum, A. t. tigrinum, A. andersoni, A. californiense, and A. dumerilii). We describe, for the first time, aspects of mitochondrial transcription in a representative amphibian, and then use complete mitochondrial sequence data to examine salamander phylogeny at both deep and shallow levels of evolutionary divergence. The available mitochondrial ESTs for A. mexicanum (N=2481) and A. t. tigrinum (N=1205) provided 92% and 87% coverage of the mitochondrial genome, respectively. Complete mitochondrial sequences for all species were rapidly obtained by using long distance PCR and DNA sequencing. A number of genome structural characteristics (base pair length, base composition, gene number, gene boundaries, codon usage) were highly similar among all species and to other distantly related salamanders. Overall, mitochondrial transcription in Ambystoma approximated the pattern observed in other vertebrates. We inferred from the mapping of ESTs onto mtDNA that transcription occurs from both heavy and light strand promoters and continues around the entire length of the mtDNA, followed by post-transcriptional processing. However, the observation of many short transcripts corresponding to rRNA genes indicates that transcription may often terminate prematurely to bias transcription of rRNA genes; indeed an rRNA transcription termination signal sequence was observed immediately following the 16S rRNA gene. Phylogenetic analyses of salamander family relationships consistently grouped Ambystomatidae in a clade containing Cryptobranchidae and Hynobiidae, to the exclusion of Salamandridae. This robust result suggests a novel alternative hypothesis because previous studies have consistently identified Ambystomatidae and Salamandridae as closely related taxa. Phylogenetic analyses of tiger salamander complex species also produced robustly supported trees. The D-loop, used in previous molecular phylogenetic studies of the complex, was found to contain a relatively low level of variation and we identified mitochondrial regions with higher rates of molecular evolution that are more useful in resolving relationships among species. Our results show the benefit of using complete genome mitochondrial information in studies of recently and rapidly diverged taxa.
Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.
Wang, Sheng; Sun, Siqi; Li, Zhen; Zhang, Renyu; Xu, Jinbo
2017-01-01
Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction. This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can accurately model contact occurrence patterns and complex sequence-structure relationship and thus, obtain higher-quality contact prediction regardless of how many sequence homologs are available for proteins in question. Our method greatly outperforms existing methods and leads to much more accurate contact-assisted folding. Tested on 105 CASP11 targets, 76 past CAMEO hard targets, and 398 membrane proteins, the average top L long-range prediction accuracy obtained by our method, one representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints but without any force fields can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Our contact-assisted models also have much better quality than template-based models especially for membrane proteins. The 3D models built from our contact prediction have TMscore>0.5 for 208 of the 398 membrane proteins, while those from homology modeling have TMscore>0.5 for only 10 of them. Further, even if trained mostly by soluble proteins, our deep learning method works very well on membrane proteins. In the recent blind CAMEO benchmark, our fully-automated web server implementing this method successfully folded 6 targets with a new fold and only 0.3L-2.3L effective sequence homologs, including one β protein of 182 residues, one α+β protein of 125 residues, one α protein of 140 residues, one α protein of 217 residues, one α/β of 260 residues and one α protein of 462 residues. Our method also achieved the highest F1 score on free-modeling targets in the latest CASP (Critical Assessment of Structure Prediction), although it was not fully implemented back then. http://raptorx.uchicago.edu/ContactMap/.
Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model
Li, Zhen; Zhang, Renyu
2017-01-01
Motivation Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction. Method This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can accurately model contact occurrence patterns and complex sequence-structure relationship and thus, obtain higher-quality contact prediction regardless of how many sequence homologs are available for proteins in question. Results Our method greatly outperforms existing methods and leads to much more accurate contact-assisted folding. Tested on 105 CASP11 targets, 76 past CAMEO hard targets, and 398 membrane proteins, the average top L long-range prediction accuracy obtained by our method, one representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints but without any force fields can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Our contact-assisted models also have much better quality than template-based models especially for membrane proteins. The 3D models built from our contact prediction have TMscore>0.5 for 208 of the 398 membrane proteins, while those from homology modeling have TMscore>0.5 for only 10 of them. Further, even if trained mostly by soluble proteins, our deep learning method works very well on membrane proteins. In the recent blind CAMEO benchmark, our fully-automated web server implementing this method successfully folded 6 targets with a new fold and only 0.3L-2.3L effective sequence homologs, including one β protein of 182 residues, one α+β protein of 125 residues, one α protein of 140 residues, one α protein of 217 residues, one α/β of 260 residues and one α protein of 462 residues. Our method also achieved the highest F1 score on free-modeling targets in the latest CASP (Critical Assessment of Structure Prediction), although it was not fully implemented back then. Availability http://raptorx.uchicago.edu/ContactMap/ PMID:28056090
Nepusz, Tamás; Sasidharan, Rajkumar; Paccanaro, Alberto
2010-03-09
An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. It was shown recently that global methods such as spectral clustering have better performance on a wide variety of datasets. However, currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community. SCPS (Spectral Clustering of Protein Sequences) is an efficient and user-friendly implementation of a spectral method for inferring protein families. The method uses only pairwise sequence similarities, and is therefore practical when only sequence information is available. SCPS was tested on difficult sets of proteins whose relationships were extracted from the SCOP database, and its results were extensively compared with those obtained using other popular protein clustering algorithms such as TribeMCL, hierarchical clustering and connected component analysis. We show that SCPS is able to identify many of the family/superfamily relationships correctly and that the quality of the obtained clusters as indicated by their F-scores is consistently better than all the other methods we compared it with. We also demonstrate the scalability of SCPS by clustering the entire SCOP database (14,183 sequences) and the complete genome of the yeast Saccharomyces cerevisiae (6,690 sequences). Besides the spectral method, SCPS also implements connected component analysis and hierarchical clustering, it integrates TribeMCL, it provides different cluster quality tools, it can extract human-readable protein descriptions using GI numbers from NCBI, it interfaces with external tools such as BLAST and Cytoscape, and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive and effective tool for practical research in computational biology. Source code and precompiled executables for Windows, Linux and Mac OS X are freely available at http://www.paccanarolab.org/software/scps.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shi, CY; Yang, H; Wei, CL
Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Using high-throughput Illumina RNA-seq, the transcriptome from poly (A){sup +} RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled intomore » 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis.« less
2011-01-01
Background Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Results Using high-throughput Illumina RNA-seq, the transcriptome from poly (A)+ RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled into 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). Conclusions An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis. PMID:21356090
Reinprecht, Yarmilla; Yadegari, Zeinab; Perry, Gregory E.; Siddiqua, Mahbuba; Wright, Lori C.; McClean, Phillip E.; Pauls, K. Peter
2013-01-01
Legumes contain a variety of phytochemicals derived from the phenylpropanoid pathway that have important effects on human health as well as seed coat color, plant disease resistance and nodulation. However, the information about the genes involved in this important pathway is fragmentary in common bean (Phaseolus vulgaris L.). The objectives of this research were to isolate genes that function in and control the phenylpropanoid pathway in common bean, determine their genomic locations in silico in common bean and soybean, and analyze sequences of the 4CL gene family in two common bean genotypes. Sequences of phenylpropanoid pathway genes available for common bean or other plant species were aligned, and the conserved regions were used to design sequence-specific primers. The PCR products were cloned and sequenced and the gene sequences along with common bean gene-based (g) markers were BLASTed against the Glycine max v.1.0 genome and the P. vulgaris v.1.0 (Andean) early release genome. In addition, gene sequences were BLASTed against the OAC Rex (Mesoamerican) genome sequence assembly. In total, fragments of 46 structural and regulatory phenylpropanoid pathway genes were characterized in this way and placed in silico on common bean and soybean sequence maps. The maps contain over 250 common bean g and SSR (simple sequence repeat) markers and identify the positions of more than 60 additional phenylpropanoid pathway gene sequences, plus the putative locations of seed coat color genes. The majority of cloned phenylpropanoid pathway gene sequences were mapped to one location in the common bean genome but had two positions in soybean. The comparison of the genomic maps confirmed previous studies, which show that common bean and soybean share genomic regions, including those containing phenylpropanoid pathway gene sequences, with conserved synteny. Indels identified in the comparison of Andean and Mesoamerican common bean 4CL gene sequences might be used to develop inter-pool phenylpropanoid pathway gene-based markers. We anticipate that the information obtained by this study will simplify and accelerate selections of common bean with specific phenylpropanoid pathway alleles to increase the contents of beneficial phenylpropanoids in common bean and other legumes. PMID:24046770
Zhang, Wei; Zhang, Xiaolong; Qiang, Yan; Tian, Qi; Tang, Xiaoxian
2017-01-01
The fast and accurate segmentation of lung nodule image sequences is the basis of subsequent processing and diagnostic analyses. However, previous research investigating nodule segmentation algorithms cannot entirely segment cavitary nodules, and the segmentation of juxta-vascular nodules is inaccurate and inefficient. To solve these problems, we propose a new method for the segmentation of lung nodule image sequences based on superpixels and density-based spatial clustering of applications with noise (DBSCAN). First, our method uses three-dimensional computed tomography image features of the average intensity projection combined with multi-scale dot enhancement for preprocessing. Hexagonal clustering and morphological optimized sequential linear iterative clustering (HMSLIC) for sequence image oversegmentation is then proposed to obtain superpixel blocks. The adaptive weight coefficient is then constructed to calculate the distance required between superpixels to achieve precise lung nodules positioning and to obtain the subsequent clustering starting block. Moreover, by fitting the distance and detecting the change in slope, an accurate clustering threshold is obtained. Thereafter, a fast DBSCAN superpixel sequence clustering algorithm, which is optimized by the strategy of only clustering the lung nodules and adaptive threshold, is then used to obtain lung nodule mask sequences. Finally, the lung nodule image sequences are obtained. The experimental results show that our method rapidly, completely and accurately segments various types of lung nodule image sequences. PMID:28880916
Esteve-Codina, Anna; Arpi, Oriol; Martinez-García, Maria; Pineda, Estela; Mallo, Mar; Gut, Marta; Carrato, Cristina; Rovira, Anna; Lopez, Raquel; Tortosa, Avelina; Dabad, Marc; Del Barco, Sonia; Heath, Simon; Bagué, Silvia; Ribalta, Teresa; Alameda, Francesc; de la Iglesia, Nuria
2017-01-01
The molecular classification of glioblastoma (GBM) based on gene expression might better explain outcome and response to treatment than clinical factors. Whole transcriptome sequencing using next-generation sequencing platforms is rapidly becoming accepted as a tool for measuring gene expression for both research and clinical use. Fresh frozen (FF) tissue specimens of GBM are difficult to obtain since tumor tissue obtained at surgery is often scarce and necrotic and diagnosis is prioritized over freezing. After diagnosis, leftover tissue is usually stored as formalin-fixed paraffin-embedded (FFPE) tissue. However, RNA from FFPE tissues is usually degraded, which could hamper gene expression analysis. We compared RNA-Seq data obtained from matched pairs of FF and FFPE GBM specimens. Only three FFPE out of eleven FFPE-FF matched samples yielded informative results. Several quality-control measurements showed that RNA from FFPE samples was highly degraded but maintained transcriptomic similarities to RNA from FF samples. Certain issues regarding mutation analysis and subtype prediction were detected. Nevertheless, our results suggest that RNA-Seq of FFPE GBM specimens provides reliable gene expression data that can be used in molecular studies of GBM if the RNA is sufficiently preserved. PMID:28122052
Porter, Teresita M.; Golding, G. Brian
2012-01-01
Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys. PMID:22558215
Carnegie, Nicole Bohme; Wang, Rui; Novitsky, Vladimir; De Gruttola, Victor
2014-01-01
Linkage analysis is useful in investigating disease transmission dynamics and the effect of interventions on them, but estimates of probabilities of linkage between infected people from observed data can be biased downward when missingness is informative. We investigate variation in the rates at which subjects' viral genotypes link across groups defined by viral load (low/high) and antiretroviral treatment (ART) status using blood samples from household surveys in the Northeast sector of Mochudi, Botswana. The probability of obtaining a sequence from a sample varies with viral load; samples with low viral load are harder to amplify. Pairwise genetic distances were estimated from aligned nucleotide sequences of HIV-1C env gp120. It is first shown that the probability that randomly selected sequences are linked can be estimated consistently from observed data. This is then used to develop estimates of the probability that a sequence from one group links to at least one sequence from another group under the assumption of independence across pairs. Furthermore, a resampling approach is developed that accounts for the presence of correlation across pairs, with diagnostics for assessing the reliability of the method. Sequences were obtained for 65% of subjects with high viral load (HVL, n = 117), 54% of subjects with low viral load but not on ART (LVL, n = 180), and 45% of subjects on ART (ART, n = 126). The probability of linkage between two individuals is highest if both have HVL, and lowest if one has LVL and the other has LVL or is on ART. Linkage across groups is high for HVL and lower for LVL and ART. Adjustment for missing data increases the group-wise linkage rates by 40–100%, and changes the relative rates between groups. Bias in inferences regarding HIV viral linkage that arise from differential ability to genotype samples can be reduced by appropriate methods for accommodating missing data. PMID:24415932
Carnegie, Nicole Bohme; Wang, Rui; Novitsky, Vladimir; De Gruttola, Victor
2014-01-01
Linkage analysis is useful in investigating disease transmission dynamics and the effect of interventions on them, but estimates of probabilities of linkage between infected people from observed data can be biased downward when missingness is informative. We investigate variation in the rates at which subjects' viral genotypes link across groups defined by viral load (low/high) and antiretroviral treatment (ART) status using blood samples from household surveys in the Northeast sector of Mochudi, Botswana. The probability of obtaining a sequence from a sample varies with viral load; samples with low viral load are harder to amplify. Pairwise genetic distances were estimated from aligned nucleotide sequences of HIV-1C env gp120. It is first shown that the probability that randomly selected sequences are linked can be estimated consistently from observed data. This is then used to develop estimates of the probability that a sequence from one group links to at least one sequence from another group under the assumption of independence across pairs. Furthermore, a resampling approach is developed that accounts for the presence of correlation across pairs, with diagnostics for assessing the reliability of the method. Sequences were obtained for 65% of subjects with high viral load (HVL, n = 117), 54% of subjects with low viral load but not on ART (LVL, n = 180), and 45% of subjects on ART (ART, n = 126). The probability of linkage between two individuals is highest if both have HVL, and lowest if one has LVL and the other has LVL or is on ART. Linkage across groups is high for HVL and lower for LVL and ART. Adjustment for missing data increases the group-wise linkage rates by 40-100%, and changes the relative rates between groups. Bias in inferences regarding HIV viral linkage that arise from differential ability to genotype samples can be reduced by appropriate methods for accommodating missing data.
Methodologic European external quality assurance for DNA sequencing: the EQUALseq program.
Ahmad-Nejad, Parviz; Dorn-Beineke, Alexandra; Pfeiffer, Ulrike; Brade, Joachim; Geilenkeuser, Wolf-Jochen; Ramsden, Simon; Pazzagli, Mario; Neumaier, Michael
2006-04-01
DNA sequencing is a key technique in molecular diagnostics, but to date no comprehensive methodologic external quality assessment (EQA) programs have been instituted. Between 2003 and 2005, the European Union funded, as specific support actions, the EQUAL initiative to develop methodologic EQA schemes for genotyping (EQUALqual), quantitative PCR (EQUALquant), and sequencing (EQUALseq). Here we report on the results of the EQUALseq program. The participating laboratories received a 4-sample set comprising 2 DNA plasmids, a PCR product, and a finished sequencing reaction to be analyzed. Data and information from detailed questionnaires were uploaded online and evaluated by use of a scoring system for technical skills and proficiency of data interpretation. Sixty laboratories from 21 European countries registered, and 43 participants (72%) returned data and samples. Capillary electrophoresis was the predominant platform (n = 39; 91%). The median contiguous correct sequence stretch was 527 nucleotides with considerable variation in quality of both primary data and data evaluation. The association between laboratory performance and the number of sequencing assays/year was statistically significant (P <0.05). Interestingly, more than 30% of participants neither added comments to their data nor made efforts to identify the gene sequences or mutational positions. Considerable variations exist even in a highly standardized methodology such as DNA sequencing. Methodologic EQAs are appropriate tools to uncover strengths and weaknesses in both technique and proficiency, and our results emphasize the need for mandatory EQAs. The results of EQUALseq should help improve the overall quality of molecular genetics findings obtained by DNA sequencing.
Garrido-Martín, Diego; Pazos, Florencio
2018-02-27
The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.
Dong, Zheng; Zhou, Hongyu; Tao, Peng
2018-02-01
PAS domains are widespread in archaea, bacteria, and eukaryota, and play important roles in various functions. In this study, we aim to explore functional evolutionary relationship among proteins in the PAS domain superfamily in view of the sequence-structure-dynamics-function relationship. We collected protein sequences and crystal structure data from RCSB Protein Data Bank of the PAS domain superfamily belonging to three biological functions (nucleotide binding, photoreceptor activity, and transferase activity). Protein sequences were aligned and then used to select sequence-conserved residues and build phylogenetic tree. Three-dimensional structure alignment was also applied to obtain structure-conserved residues. The protein dynamics were analyzed using elastic network model (ENM) and validated by molecular dynamics (MD) simulation. The result showed that the proteins with same function could be grouped by sequence similarity, and proteins in different functional groups displayed statistically significant difference in their vibrational patterns. Interestingly, in all three functional groups, conserved amino acid residues identified by sequence and structure conservation analysis generally have a lower fluctuation than other residues. In addition, the fluctuation of conserved residues in each biological function group was strongly correlated with the corresponding biological function. This research suggested a direct connection in which the protein sequences were related to various functions through structural dynamics. This is a new attempt to delineate functional evolution of proteins using the integrated information of sequence, structure, and dynamics. © 2017 The Protein Society.
Information-Theoretical Analysis of EEG Microstate Sequences in Python.
von Wegner, Frederic; Laufs, Helmut
2018-01-01
We present an open-source Python package to compute information-theoretical quantities for electroencephalographic data. Electroencephalography (EEG) measures the electrical potential generated by the cerebral cortex and the set of spatial patterns projected by the brain's electrical potential on the scalp surface can be clustered into a set of representative maps called EEG microstates. Microstate time series are obtained by competitively fitting the microstate maps back into the EEG data set, i.e., by substituting the EEG data at a given time with the label of the microstate that has the highest similarity with the actual EEG topography. As microstate sequences consist of non-metric random variables, e.g., the letters A-D, we recently introduced information-theoretical measures to quantify these time series. In wakeful resting state EEG recordings, we found new characteristics of microstate sequences such as periodicities related to EEG frequency bands. The algorithms used are here provided as an open-source package and their use is explained in a tutorial style. The package is self-contained and the programming style is procedural, focusing on code intelligibility and easy portability. Using a sample EEG file, we demonstrate how to perform EEG microstate segmentation using the modified K-means approach, and how to compute and visualize the recently introduced information-theoretical tests and quantities. The time-lagged mutual information function is derived as a discrete symbolic alternative to the autocorrelation function for metric time series and confidence intervals are computed from Markov chain surrogate data. The software package provides an open-source extension to the existing implementations of the microstate transform and is specifically designed to analyze resting state EEG recordings.
Curci, Pasquale L.; De Paola, Domenico; Danzi, Donatella; Vendramin, Giovanni G.; Sonnante, Gabriella
2015-01-01
With over 20,000 species, Asteraceae is the second largest plant family. High-throughput sequencing of nuclear and chloroplast genomes has allowed for a better understanding of the evolutionary relationships within large plant families. Here, the globe artichoke chloroplast (cp) genome was obtained by a combination of whole-genome and BAC clone high-throughput sequencing. The artichoke cp genome is 152,529 bp in length, consisting of two single-copy regions separated by a pair of inverted repeats (IRs) of 25,155 bp, representing the longest IRs found in the Asteraceae family so far. The large (LSC) and the small (SSC) single-copy regions span 83,578 bp and 18,641 bp, respectively. The artichoke cp sequence was compared to the other eight Asteraceae complete cp genomes available, revealing an IR expansion at the SSC/IR boundary. This expansion consists of 17 bp of the ndhF gene generating an overlap between the ndhF and ycf1 genes. A total of 127 cp simple sequence repeats (cpSSRs) were identified in the artichoke cp genome, potentially suitable for future population studies in the Cynara genus. Parsimony-informative regions were evaluated and allowed to place a Cynara species within the Asteraceae family tree. The eight most informative coding regions were also considered and tested for “specific barcode” purpose in the Asteraceae family. Our results highlight the usefulness of cp genome sequencing in exploring plant genome diversity and retrieving reliable molecular resources for phylogenetic and evolutionary studies, as well as for specific barcodes in plants. PMID:25774672
Curci, Pasquale L; De Paola, Domenico; Danzi, Donatella; Vendramin, Giovanni G; Sonnante, Gabriella
2015-01-01
With over 20,000 species, Asteraceae is the second largest plant family. High-throughput sequencing of nuclear and chloroplast genomes has allowed for a better understanding of the evolutionary relationships within large plant families. Here, the globe artichoke chloroplast (cp) genome was obtained by a combination of whole-genome and BAC clone high-throughput sequencing. The artichoke cp genome is 152,529 bp in length, consisting of two single-copy regions separated by a pair of inverted repeats (IRs) of 25,155 bp, representing the longest IRs found in the Asteraceae family so far. The large (LSC) and the small (SSC) single-copy regions span 83,578 bp and 18,641 bp, respectively. The artichoke cp sequence was compared to the other eight Asteraceae complete cp genomes available, revealing an IR expansion at the SSC/IR boundary. This expansion consists of 17 bp of the ndhF gene generating an overlap between the ndhF and ycf1 genes. A total of 127 cp simple sequence repeats (cpSSRs) were identified in the artichoke cp genome, potentially suitable for future population studies in the Cynara genus. Parsimony-informative regions were evaluated and allowed to place a Cynara species within the Asteraceae family tree. The eight most informative coding regions were also considered and tested for "specific barcode" purpose in the Asteraceae family. Our results highlight the usefulness of cp genome sequencing in exploring plant genome diversity and retrieving reliable molecular resources for phylogenetic and evolutionary studies, as well as for specific barcodes in plants.
Osmundson, Todd W.; Robert, Vincent A.; Schoch, Conrad L.; Baker, Lydia J.; Smith, Amy; Robich, Giovanni; Mizzan, Luca; Garbelotto, Matteo M.
2013-01-01
Despite recent advances spearheaded by molecular approaches and novel technologies, species description and DNA sequence information are significantly lagging for fungi compared to many other groups of organisms. Large scale sequencing of vouchered herbarium material can aid in closing this gap. Here, we describe an effort to obtain broad ITS sequence coverage of the approximately 6000 macrofungal-species-rich herbarium of the Museum of Natural History in Venice, Italy. Our goals were to investigate issues related to large sequencing projects, develop heuristic methods for assessing the overall performance of such a project, and evaluate the prospects of such efforts to reduce the current gap in fungal biodiversity knowledge. The effort generated 1107 sequences submitted to GenBank, including 416 previously unrepresented taxa and 398 sequences exhibiting a best BLAST match to an unidentified environmental sequence. Specimen age and taxon affected sequencing success, and subsequent work on failed specimens showed that an ITS1 mini-barcode greatly increased sequencing success without greatly reducing the discriminating power of the barcode. Similarity comparisons and nonmetric multidimensional scaling ordinations based on pairwise distance matrices proved to be useful heuristic tools for validating the overall accuracy of specimen identifications, flagging potential misidentifications, and identifying taxa in need of additional species-level revision. Comparison of within- and among-species nucleotide variation showed a strong increase in species discriminating power at 1–2% dissimilarity, and identified potential barcoding issues (same sequence for different species and vice-versa). All sequences are linked to a vouchered specimen, and results from this study have already prompted revisions of species-sequence assignments in several taxa. PMID:23638077
Osmundson, Todd W; Robert, Vincent A; Schoch, Conrad L; Baker, Lydia J; Smith, Amy; Robich, Giovanni; Mizzan, Luca; Garbelotto, Matteo M
2013-01-01
Despite recent advances spearheaded by molecular approaches and novel technologies, species description and DNA sequence information are significantly lagging for fungi compared to many other groups of organisms. Large scale sequencing of vouchered herbarium material can aid in closing this gap. Here, we describe an effort to obtain broad ITS sequence coverage of the approximately 6000 macrofungal-species-rich herbarium of the Museum of Natural History in Venice, Italy. Our goals were to investigate issues related to large sequencing projects, develop heuristic methods for assessing the overall performance of such a project, and evaluate the prospects of such efforts to reduce the current gap in fungal biodiversity knowledge. The effort generated 1107 sequences submitted to GenBank, including 416 previously unrepresented taxa and 398 sequences exhibiting a best BLAST match to an unidentified environmental sequence. Specimen age and taxon affected sequencing success, and subsequent work on failed specimens showed that an ITS1 mini-barcode greatly increased sequencing success without greatly reducing the discriminating power of the barcode. Similarity comparisons and nonmetric multidimensional scaling ordinations based on pairwise distance matrices proved to be useful heuristic tools for validating the overall accuracy of specimen identifications, flagging potential misidentifications, and identifying taxa in need of additional species-level revision. Comparison of within- and among-species nucleotide variation showed a strong increase in species discriminating power at 1-2% dissimilarity, and identified potential barcoding issues (same sequence for different species and vice-versa). All sequences are linked to a vouchered specimen, and results from this study have already prompted revisions of species-sequence assignments in several taxa.
A protein block based fold recognition method for the annotation of twilight zone sequences.
Suresh, V; Ganesan, K; Parthasarathy, S
2013-03-01
The description of protein backbone was recently improved with a group of structural fragments called Structural Alphabets instead of the regular three states (Helix, Sheet and Coil) secondary structure description. Protein Blocks is one of the Structural Alphabets used to describe each and every region of protein backbone including the coil. According to de Brevern (2000) the Protein Blocks has 16 structural fragments and each one has 5 residues in length. Protein Blocks fragments are highly informative among the available Structural Alphabets and it has been used for many applications. Here, we present a protein fold recognition method based on Protein Blocks for the annotation of twilight zone sequences. In our method, we align the predicted Protein Blocks of a query amino acid sequence with a library of assigned Protein Blocks of 953 known folds using the local pair-wise alignment. The alignment results with z-value ≥ 2.5 and P-value ≤ 0.08 are predicted as possible folds. Our method is able to recognize the possible folds for nearly 35.5% of the twilight zone sequences with their predicted Protein Block sequence obtained by pb_prediction, which is available at Protein Block Export server.
The processing of images of biological threats in visual short-term memory.
Quinlan, Philip T; Yue, Yue; Cohen, Dale J
2017-08-30
The idea that there is enhanced memory for negatively, emotionally charged pictures was examined. Performance was measured under rapid, serial visual presentation (RSVP) conditions in which, on every trial, a sequence of six photo-images was presented. Briefly after the offset of the sequence, two alternative images (a target and a foil) were presented and participants attempted to choose which image had occurred in the sequence. Images were of threatening and non-threatening cats and dogs. The target depicted either an animal expressing an emotion distinct from the other images, or the sequences contained only images depicting the same emotional valence. Enhanced memory was found for targets that differed in emotional valence from the other sequence images, compared to targets that expressed the same emotional valence. Further controls in stimulus selection were then introduced and the same emotional distinctiveness effect obtained. In ruling out possible visual and attentional accounts of the data, an informal dual route topic model is discussed. This places emphasis on how visual short-term memory reveals a sensitivity to the emotional content of the input as it unfolds over time. Items that present with a distinctive emotional content stand out in memory. © 2017 The Author(s).
Protein sequences clustering of herpes virus by using Tribe Markov clustering (Tribe-MCL)
NASA Astrophysics Data System (ADS)
Bustamam, A.; Siswantining, T.; Febriyani, N. L.; Novitasari, I. D.; Cahyaningrum, R. D.
2017-07-01
The herpes virus can be found anywhere and one of the important characteristics is its ability to cause acute and chronic infection at certain times so as a result of the infection allows severe complications occurred. The herpes virus is composed of DNA containing protein and wrapped by glycoproteins. In this work, the Herpes viruses family is classified and analyzed by clustering their protein-sequence using Tribe Markov Clustering (Tribe-MCL) algorithm. Tribe-MCL is an efficient clustering method based on the theory of Markov chains, to classify protein families from protein sequences using pre-computed sequence similarity information. We implement the Tribe-MCL algorithm using an open source program of R. We select 24 protein sequences of Herpes virus obtained from NCBI database. The dataset consists of three types of glycoprotein B, F, and H. Each type has eight herpes virus that infected humans. Based on our simulation using different inflation factor r=1.5, 2, 3 we find a various number of the clusters results. The greater the inflation factor the greater the number of their clusters. Each protein will grouped together in the same type of protein.
Huang, Xiaoyun; Zang, Xiaonan; Wu, Fei; Jin, Yuming; Wang, Haitao; Liu, Chang; Ding, Yating; He, Bangxiang; Xiao, Dongfang; Song, Xinwei; Liu, Zhu
2017-01-01
Gracilariopsis lemaneiformis (aka Gracilaria lemaneiformis) is a red macroalga rich in phycoerythrin, which can capture light efficiently and transfer it to photosystemⅡ. However, little is known about the synthesis of optically active phycoerythrinin in G. lemaneiformis at the molecular level. With the advent of high-throughput sequencing technology, analysis of genetic information for G. lemaneiformis by transcriptome sequencing is an effective means to get a deeper insight into the molecular mechanism of phycoerythrin synthesis. Illumina technology was employed to sequence the transcriptome of two strains of G. lemaneiformis- the wild type and a green-pigmented mutant. We obtained a total of 86915 assembled unigenes as a reference gene set, and 42884 unigenes were annotated in at least one public database. Taking the above transcriptome sequencing as a reference gene set, 4041 differentially expressed genes were screened to analyze and compare the gene expression profiles of the wild type and green mutant. By GO and KEGG pathway analysis, we concluded that three factors, including a reduction in the expression level of apo-phycoerythrin, an increase of chlorophyll light-harvesting complex synthesis, and reduction of phycoerythrobilin by competitive inhibition, caused the reduction of optically active phycoerythrin in the green-pigmented mutant.
Cross-referencing yeast genetics and mammalian genomes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hieter, P.; Basset, D.; Boguski, M.
1994-09-01
We have initiated a project that will systematically transfer information about yeast genes onto the genetic maps of mice and human beings. Rapidly expanding human EST data will serve as a source of candidate human homologs that will be repeatedly searched using yeast protein sequence queries. Search results will be automatically reported to participating labs. Human cDNA sequences from which the ESTs are derived will be mapped at high resolution in the human and mouse genomes. The comparative mapping information cross-references the genomic position of novel human cDNAs with functional information known about the cognate yeast genes. This should facilitatemore » the initial identification of genes responsible for mammalian mutant phenotypes, including human disease. In addition, the identification of mammalian homologs of yeast genes provides reagents for determining evolutionary conservation and for performing direct experiments in multicellular eukaryotes to enhance study of the yeast protein`s function. For example, ESTs homologous to CDC27 and CDC16 were identified, and the corresponding cDNA clones were obtained from ATTC, completely sequenced, and mapped on human and mouse chromosomes. In addition, the CDC17hs cDNA has been used to raise antisera to the CDC27Hs protein and used in subcellular localization experiments and junctional studies in mammalian cells. We have received funding from the National Center for Human Genome Research to provide a community resource which will establish comprehensive cross-referencing among yeast, human, and mouse loci. The project is set up as a service and information on how to communicate with this effort will be provided.« less
Prediction of enhancer-promoter interactions via natural language processing.
Zeng, Wanwen; Wu, Mengmeng; Jiang, Rui
2018-05-09
Precise identification of three-dimensional genome organization, especially enhancer-promoter interactions (EPIs), is important to deciphering gene regulation, cell differentiation and disease mechanisms. Currently, it is a challenging task to distinguish true interactions from other nearby non-interacting ones since the power of traditional experimental methods is limited due to low resolution or low throughput. We propose a novel computational framework EP2vec to assay three-dimensional genomic interactions. We first extract sequence embedding features, defined as fixed-length vector representations learned from variable-length sequences using an unsupervised deep learning method in natural language processing. Then, we train a classifier to predict EPIs using the learned representations in supervised way. Experimental results demonstrate that EP2vec obtains F1 scores ranging from 0.841~ 0.933 on different datasets, which outperforms existing methods. We prove the robustness of sequence embedding features by carrying out sensitivity analysis. Besides, we identify motifs that represent cell line-specific information through analysis of the learned sequence embedding features by adopting attention mechanism. Last, we show that even superior performance with F1 scores 0.889~ 0.940 can be achieved by combining sequence embedding features and experimental features. EP2vec sheds light on feature extraction for DNA sequences of arbitrary lengths and provides a powerful approach for EPIs identification.
Kim, Kyunghee; Lee, Sang-Choon; Lee, Junki; Lee, Hyun Oh; Joh, Ho Jun; Kim, Nam-Hoon; Park, Hyun-Seung; Yang, Tae-Jin
2015-01-01
We report complete sequences of chloroplast (cp) genome and 45S nuclear ribosomal DNA (45S nrDNA) for 11 Panax ginseng cultivars. We have obtained complete sequences of cp and 45S nrDNA, the representative barcoding target sequences for cytoplasm and nuclear genome, respectively, based on low coverage NGS sequence of each cultivar. The cp genomes sizes ranged from 156,241 to 156,425 bp and the major size variation was derived from differences in copy number of tandem repeats in the ycf1 gene and in the intergenic regions of rps16-trnUUG and rpl32-trnUAG. The complete 45S nrDNA unit sequences were 11,091 bp, representing a consensus single transcriptional unit with an intergenic spacer region. Comparative analysis of these sequences as well as those previously reported for three Chinese accessions identified very rare but unique polymorphism in the cp genome within P. ginseng cultivars. There were 12 intra-species polymorphisms (six SNPs and six InDels) among 14 cultivars. We also identified five SNPs from 45S nrDNA of 11 Korean ginseng cultivars. From the 17 unique informative polymorphic sites, we developed six reliable markers for analysis of ginseng diversity and cultivar authentication. PMID:26061692
High dynamic range image acquisition based on multiplex cameras
NASA Astrophysics Data System (ADS)
Zeng, Hairui; Sun, Huayan; Zhang, Tinghua
2018-03-01
High dynamic image is an important technology of photoelectric information acquisition, providing higher dynamic range and more image details, and it can better reflect the real environment, light and color information. Currently, the method of high dynamic range image synthesis based on different exposure image sequences cannot adapt to the dynamic scene. It fails to overcome the effects of moving targets, resulting in the phenomenon of ghost. Therefore, a new high dynamic range image acquisition method based on multiplex cameras system was proposed. Firstly, different exposure images sequences were captured with the camera array, using the method of derivative optical flow based on color gradient to get the deviation between images, and aligned the images. Then, the high dynamic range image fusion weighting function was established by combination of inverse camera response function and deviation between images, and was applied to generated a high dynamic range image. The experiments show that the proposed method can effectively obtain high dynamic images in dynamic scene, and achieves good results.
Hepatitis E Virus of Subtype 3a in a Pig Farm, South-Eastern France.
Colson, P; Saint-Jacques, P; Ferretti, A; Davoust, B
2015-12-01
Hepatitis E virus (HEV) has emerged during the past decade as a causative agent of autochthonous hepatitis and is a clinical concern in Western developed countries. It has been increasingly recognized that pigs are a major reservoir of HEV of genotypes 3 and 4 worldwide and pig-derived food items represent a potential source of infections by these viruses in humans. Hepatitis E virus RNA testing was performed here on faeces from rectal swabs sampled in 2012 from 50 3-month-old farm pigs from the same farm located in south-eastern France than in a previous work conducted in 2007. Pig HEV sequences corresponding to genomic fragments of ORF2 and ORF1 genes were obtained after RT-PCR amplification with in-house protocols. Hepatitis E virus genotype was determined by phylogenetic analysis. Prevalence was similar to that determined 5 years earlier (68% versus 62%). Two robust phylogenetic clusters of HEV subtypes 3a and 3f were identified, and these sequences obtained in 2012 largely differ compared with those obtained in 2007. Notably, HEV sequences obtained in 2012 from a majority (62%) of the infected pigs belonged to subtype 3a, which was not previously described in France, including not being found in any of humans, pigs or wild boars. Further studies are needed to assess the circulation of HEV-3a in pigs and humans in this country. In addition, along with previous findings, this study supports the need for increased information to the public on the risk of HEV infection through contacts with pigs or consumption of pig-derived products in France. © 2015 Blackwell Verlag GmbH.
DNA Metabarcoding of Amazonian Ichthyoplankton Swarms
Maggia, M. E.; Vigouroux, Y.; Renno, J. F.; Duponchelle, F.; Desmarais, E.; Nunez, J.; García-Dávila, C.; Carvajal-Vallejos, F. M.; Paradis, E.; Martin, J. F.; Mariac, C.
2017-01-01
Tropical rainforests harbor extraordinary biodiversity. The Amazon basin is thought to hold 30% of all river fish species in the world. Information about the ecology, reproduction, and recruitment of most species is still lacking, thus hampering fisheries management and successful conservation strategies. One of the key understudied issues in the study of population dynamics is recruitment. Fish larval ecology in tropical biomes is still in its infancy owing to identification difficulties. Molecular techniques are very promising tools for the identification of larvae at the species level. However, one of their limits is obtaining individual sequences with large samples of larvae. To facilitate this task, we developed a new method based on the massive parallel sequencing capability of next generation sequencing (NGS) coupled with hybridization capture. We focused on the mitochondrial marker cytochrome oxidase I (COI). The results obtained using the new method were compared with individual larval sequencing. We validated the ability of the method to identify Amazonian catfish larvae at the species level and to estimate the relative abundance of species in batches of larvae. Finally, we applied the method and provided evidence for strong temporal variation in reproductive activity of catfish species in the Ucayalí River in the Peruvian Amazon. This new time and cost effective method enables the acquisition of large datasets, paving the way for a finer understanding of reproductive dynamics and recruitment patterns of tropical fish species, with major implications for fisheries management and conservation. PMID:28095487
Elucidating and mining the Tulipa and Lilium transcriptomes.
Moreno-Pachon, Natalia M; Leeggangers, Hendrika A C F; Nijveen, Harm; Severing, Edouard; Hilhorst, Henk; Immink, Richard G H
2016-10-01
Genome sequencing remains a challenge for species with large and complex genomes containing extensive repetitive sequences, of which the bulbous and monocotyledonous plants tulip and lily are examples. In such a case, sequencing of only the active part of the genome, represented by the transcriptome, is a good alternative to obtain information about gene content. In this study we aimed to generate a high quality transcriptome of tulip and lily and to make this data available as an open-access resource via a user-friendly web-based interface. The Illumina HiSeq 2000 platform was applied and the transcribed RNA was sequenced from a collection of different lily and tulip tissues, respectively. In order to obtain good transcriptome coverage and to facilitate effective data mining, assembly was done using different filtering parameters for clearing out contamination and noise of the RNAseq datasets. This analysis revealed limitations of commonly applied methods and parameter settings used in de novo transcriptome assembly. The final created transcriptomes are publicly available via a user friendly Transcriptome browser ( http://www.bioinformatics.nl/bulbs/db/species/index ). The usefulness of this resource has been exemplified by a search for all potential transcription factors in lily and tulip, with special focus on the TCP transcription factor family. This analysis and other quality parameters point out the quality of the transcriptomes, which can serve as a basis for further genomics studies in lily, tulip, and bulbous plants in general.
The molecular epidemiological study of bovine leukemia virus infection in Myanmar cattle.
Polat, Meripet; Moe, Hla Hla; Shimogiri, Takeshi; Moe, Kyaw Kyaw; Takeshima, Shin-Nosuke; Aida, Yoko
2017-02-01
Bovine leukemia virus (BLV) is the etiological agent of enzootic bovine leukosis, which is the most common neoplastic disease of cattle. BLV infects cattle worldwide and affects both health status and productivity. However, no studies have examined the distribution of BLV in Myanmar, and the genetic characteristics of Myanmar BLV strains are unknown. Therefore, the aim of this study was to detect BLV infection in Myanmar and examine genetic variability. Blood samples were obtained from 66 cattle from different farms in four townships of the Nay Pyi Taw Union Territory of central Myanmar. BLV provirus was detected by nested PCR and real-time PCR targeting BLV long terminal repeats. Results were confirmed by nested PCR targeting the BLV env-gp51 gene and real-time PCR targeting the BLV tax gene. Out of 66 samples, six (9.1 %) were positive for BLV provirus. A phylogenetic tree, constructed using five distinct partial and complete env-gp51 sequences from BLV strains isolated from three different townships, indicated that Myanmar strains were genotype-10. A phylogenetic tree constructed from whole genome sequences obtained by sequencing cloned, overlapping PCR products from two Myanmar strains confirmed the existence of genotype-10 in Myanmar. Comparative analysis of complete genome sequences identified genotype-10-specific amino acid substitutions in both structural and non-structural genes, thereby distinguishing genotype-10 strains from other known genotypes. This study provides information regarding BLV infection levels in Myanmar and confirms that genotype-10 is circulating in Myanmar.
Gifford, Robert J.; Rhee, Soo-Yon; Eriksson, Nicolas; Liu, Tommy F.; Kiuchi, Mark; Das, Amar K.; Shafer, Robert W.
2008-01-01
Design Promiscuous guanine (G) to adenine (A) substitutions catalysed by apolipoprotein B RNA-editing catalytic component (APOBEC) enzymes are observed in a proportion of HIV-1 sequences in vivo and can introduce artifacts into some genetic analyses. The potential impact of undetected lethal editing on genotypic estimation of transmitted drug resistance was assessed. Methods Classifiers of lethal, APOBEC-mediated editing were developed by analysis of lentiviral pol gene sequence variation and evaluated using control sets of HIV-1 sequences. The potential impact of sequence editing on genotypic estimation of drug resistance was assessed in sets of sequences obtained from 77 studies of 25 or more therapy-naive individuals, using mixture modelling approaches to determine the maximum likelihood classification of sequences as lethally edited as opposed to viable. Results Analysis of 6437 protease and reverse transcriptase sequences from therapy-naive individuals using a novel classifier of lethal, APOBEC3G-mediated sequence editing, the polypeptide-like 3G (APOBEC3G)-mediated defectives (A3GD) index’, detected lethal editing in association with spurious ‘transmitted drug resistance’ in nearly 3% of proviral sequences obtained from whole blood and 0.2% of samples obtained from plasma. Conclusion Screening for lethally edited sequences in datasets containing a proportion of proviral DNA, such as those likely to be obtained for epidemiological surveillance of transmitted drug resistance in the developing world, can eliminate rare but potentially significant errors in genotypic estimation of transmitted drug resistance. PMID:18356601
Sergeant, Martin J.; Constantinidou, Chrystala; Cogan, Tristan; Penn, Charles W.; Pallen, Mark J.
2012-01-01
The analysis of 16S-rDNA sequences to assess the bacterial community composition of a sample is a widely used technique that has increased with the advent of high throughput sequencing. Although considerable effort has been devoted to identifying the most informative region of the 16S gene and the optimal informatics procedures to process the data, little attention has been paid to the PCR step, in particular annealing temperature and primer length. To address this, amplicons derived from 16S-rDNA were generated from chicken caecal content DNA using different annealing temperatures, primers and different DNA extraction procedures. The amplicons were pyrosequenced to determine the optimal protocols for capture of maximum bacterial diversity from a chicken caecal sample. Even at very low annealing temperatures there was little effect on the community structure, although the abundance of some OTUs such as Bifidobacterium increased. Using shorter primers did not reveal any novel OTUs but did change the community profile obtained. Mechanical disruption of the sample by bead beating had a significant effect on the results obtained, as did repeated freezing and thawing. In conclusion, existing primers and standard annealing temperatures captured as much diversity as lower annealing temperatures and shorter primers. PMID:22666455
Sergeant, Martin J; Constantinidou, Chrystala; Cogan, Tristan; Penn, Charles W; Pallen, Mark J
2012-01-01
The analysis of 16S-rDNA sequences to assess the bacterial community composition of a sample is a widely used technique that has increased with the advent of high throughput sequencing. Although considerable effort has been devoted to identifying the most informative region of the 16S gene and the optimal informatics procedures to process the data, little attention has been paid to the PCR step, in particular annealing temperature and primer length. To address this, amplicons derived from 16S-rDNA were generated from chicken caecal content DNA using different annealing temperatures, primers and different DNA extraction procedures. The amplicons were pyrosequenced to determine the optimal protocols for capture of maximum bacterial diversity from a chicken caecal sample. Even at very low annealing temperatures there was little effect on the community structure, although the abundance of some OTUs such as Bifidobacterium increased. Using shorter primers did not reveal any novel OTUs but did change the community profile obtained. Mechanical disruption of the sample by bead beating had a significant effect on the results obtained, as did repeated freezing and thawing. In conclusion, existing primers and standard annealing temperatures captured as much diversity as lower annealing temperatures and shorter primers.
NASA Astrophysics Data System (ADS)
Baraúna, R. A.; Graças, D. A.; Ramos, R. T.; Carneiro, A. R.; Lopes, T. S.; Lima, A. R.; Zahlouth, R. L.; Pellizari, V. H.; Silva, A.
2013-05-01
Methanosarcina mazei is a strictly anaerobic methanogen from the Methanosarcinales order. This species is known for its broad catabolic range among methanogens and is widespread throughout diverse environments. The draft genome of a strain cultivated from the sediment of the Tucuruí hydroelectric power station, the fourth largest hydroelectric dam in the world, is described here. Approximately 80% of methane is produced by biogenic sources, such as methanogenic archaea from M. mazei species. Although the methanogenesis pathway is well known, some aspects of the core genome, genome evolution and shared genes are still unclear. A sediment sample from the Tucuruí hydropower station reservoir was inoculated in mineral media supplemented with acetate and methanol. This media was maintained in an H2:CO2 (80:20) atmosphere to enrich and cultivate M. mazei. The enrichment was conducted at 30°C under standard anaerobic conditions. After several molecular and cellular analyses, total DNA was extracted from a non-pure culture of M. mazei, amplified using phi29 DNA polymerase (BioLabs) and finally used as a source template for genome sequencing. The draft genome was obtained after two rounds of sequencing. First, the genome was sequenced using a SOLiD System V3 with a mate-paired library, which yielded 24,405,103 and 24,399,268 reads (50 bp) for the R3 and F3 tags, respectively. The second round of sequencing was performed using the SOLiD 5500 XL platform with a mate-paired library, resulting in a total of 113,588,848 reads (60 bp) for each tag (F3 and R3). All reads obtained by this procedure were filtered using Quality Assessment software, whereby reads with an average quality score below Phred 20 were removed. Velvet and Edena were used to assemble the reads, and Simplifier was used to remove the redundant sequences. After this, a total of 16,811 contigs were obtained. M. mazei GO1 (AE008384) genome was used to map the contigs and generate the scaffolds. We used the Graphical Contig Analyzer for All Sequencing Platforms software (G4ALL; http://g4all.sourceforge.net/) to manually curate and generate the genome scaffold with gaps. The resultant gaps were manually closed using CLC Genomics Workbench software. M. mazei TUC01 genome contained 3,420,400 bp with a GC content of 42.47% distributed over 3 scaffolds that were annotated by RAST. A total of 2,959 coding DNA sequences (CDS) were predicted. The genome of M. mazei TUC01 (accession number: CP003077) will provide valuable information about the ecology of Methanosarcinales order and more accurate information about the methanogenesis pathway observed in the Neotropics. SPONSOR: Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq); Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES); Agência Nacional de Energia Elétrica (ANEEL); Centrais Elétricas do Norte do Brasil (Eletronorte).
Srinivasan, A R; Yathindra, N
1977-01-01
A novel description of the conformational characteristics of all the individual nucleotides and the phosphodiesters in tRNAs is presented in the form of a circular plot. This representation furnishes information of the base sequence with the folding patterns of the polynucleotide chain as one traverses along the circumference and with the individual nucleotide and phosphodiester linkage torsions along the radii. The circular plot obtained for yeast tRNAPhe strikingly distinguishes the helical and the loop regions. The variation of the different nucleotide torsions along the entire chain length and their effect on the secondary helical and tertiary loop regions become readily apparent. PMID:339206
Scoping Study Investigating PWR Instrumentation during a Severe Accident Scenario
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rempe, J. L.; Knudson, D. L.; Lutz, R. J.
The accidents at the Three Mile Island Unit 2 (TMI-2) and Fukushima Daiichi Units 1, 2, and 3 nuclear power plants demonstrate the critical importance of accurate, relevant, and timely information on the status of reactor systems during a severe accident. These events also highlight the critical importance of understanding and focusing on the key elements of system status information in an environment where operators may be overwhelmed with superfluous and sometimes conflicting data. While progress in these areas has been made since TMI-2, the events at Fukushima suggests that there may still be a potential need to ensure thatmore » critical plant information is available to plant operators. Recognizing the significant technical and economic challenges associated with plant modifications, it is important to focus on instrumentation that can address these information critical needs. As part of a program initiated by the Department of Energy, Office of Nuclear Energy (DOE-NE), a scoping effort was initiated to assess critical information needs identified for severe accident management and mitigation in commercial Light Water Reactors (LWRs), to quantify the environment instruments monitoring this data would have to survive, and to identify gaps where predicted environments exceed instrumentation qualification envelop (QE) limits. Results from the Pressurized Water Reactor (PWR) scoping evaluations are documented in this report. The PWR evaluations were limited in this scoping evaluation to quantifying the environmental conditions for an unmitigated Short-Term Station BlackOut (STSBO) sequence in one unit at the Surry nuclear power station. Results were obtained using the MELCOR models developed for the US Nuclear Regulatory Commission (NRC)-sponsored State of the Art Consequence Assessment (SOARCA) program project. Results from this scoping evaluation indicate that some instrumentation identified to provide critical information would be exposed to conditions that significantly exceeded QE limits for extended time periods for the low frequency STSBO sequence evaluated in this study. It is recognized that the core damage frequency (CDF) of the sequence evaluated in this scoping effort would be considerably lower if evaluations considered new FLEX equipment being installed by industry. Nevertheless, because of uncertainties in instrumentation response when exposed to conditions beyond QE limits and alternate challenges associated with different sequences that may impact sensor performance, it is recommended that additional evaluations of instrumentation performance be completed to provide confidence that operators have access to accurate, relevant, and timely information on the status of reactor systems for a broad range of challenges associated with risk important severe accident sequences.« less
Alonso, Ana; Larraga, Vicente; Alcolea, Pedro J
2018-05-07
The first genome project of any living organism excluding viruses, the gammaproteobacteria Haemophilus influenzae, was completed in 1995. Until the last decade, genome sequencing was very tedious because genome survey sequences (GSS) and/or expressed sequence tags (ESTs) belonging to plasmid, cosmid and artificial chromosome genome libraries had to be sequenced and assembled in silico. Nowadays, no genome is completely assembled actually, because gaps and unassembled contigs are always remaining. However, most represent the whole genome of the organism of origin from a practical point of view. The first genome sequencing projects of trypanosomatid parasites were completed in 2005 following those strategies, and belong to Leishmania major, Trypanosoma cruzi and T. brucei. The functional genomics era rapidly developed on the basis of the microarray technology and has been evolving. In the case of the genus Leishmania, substantial biological information about differentiation in the digenetic life cycle of the parasite has been obtained. Later on, next generation sequencing has revolutionized genome sequencing and functional genomics, leading to more sensitive, accurate results by using much less resources. This new technology is more advantageous, but does not invalidate microarray results. In fact, promising vaccine candidates and drug targets have been found on the basis of microarray-based screening and preliminary proof-of-concept tests. Copyright © 2018. Published by Elsevier B.V.
Postel, Alexander; Schmeiser, Stefanie; Zimmermann, Bernd; Becher, Paul
2016-01-01
Molecular epidemiology has become an indispensable tool in the diagnosis of diseases and in tracing the infection routes of pathogens. Due to advances in conventional sequencing and the development of high throughput technologies, the field of sequence determination is in the process of being revolutionized. Platforms for sharing sequence information and providing standardized tools for phylogenetic analyses are becoming increasingly important. The database (DB) of the European Union (EU) and World Organisation for Animal Health (OIE) Reference Laboratory for classical swine fever offers one of the world’s largest semi-public virus-specific sequence collections combined with a module for phylogenetic analysis. The classical swine fever (CSF) DB (CSF-DB) became a valuable tool for supporting diagnosis and epidemiological investigations of this highly contagious disease in pigs with high socio-economic impacts worldwide. The DB has been re-designed and now allows for the storage and analysis of traditionally used, well established genomic regions and of larger genomic regions including complete viral genomes. We present an application example for the analysis of highly similar viral sequences obtained in an endemic disease situation and introduce the new geographic “CSF Maps” tool. The concept of this standardized and easy-to-use DB with an integrated genetic typing module is suited to serve as a blueprint for similar platforms for other human or animal viruses. PMID:27827988
MRI and MRA of spinal cord arteriovenous shunts.
Condette-Auliac, Stéphanie; Boulin, Anne; Roccatagliata, Luca; Coskun, Oguzhan; Guieu, Stéphanie; Guedin, Pierre; Rodesch, Georges
2014-12-01
The purpose of this review is to describe the diagnostic criteria for spinal cord arteriovenous shunts (SCAVSs) when using magnetic resonance imaging (MRI) and magnetic resonance angiography (MRA), and to discuss the extent to which the different MRI and MRA sequences and technical parameters provide the information that is required to diagnose these lesions properly. SCAVSs are divided into four groups according to location (paraspinal, epidural, dural, or intradural) and type (fistula or nidus); each type of lesion is described. SCAVSs are responsible for neurological symptoms due to spinal cord or nerve root involvement. MRI is usually the first examination performed when a spinal cord lesion is suspected. Recognition of the image characteristics of vascular lesions is mandatory if useful sequences are to be performed-especially MRA sequences. Because the treatment of SCAVSs relies mainly on endovascular therapies, MRI and MRA help with the planning of the angiographic procedure. We explain the choice of MRA sequences and parameters, the advantages and pitfalls to be aware of in order to obtain the best visualization, and the analysis of each lesion. © 2014 Wiley Periodicals, Inc.
Prediction of β-turns in proteins from multiple alignment using neural network
Kaur, Harpreet; Raghava, Gajendra Pal Singh
2003-01-01
A neural network-based method has been developed for the prediction of β-turns in proteins by using multiple sequence alignment. Two feed-forward back-propagation networks with a single hidden layer are used where the first-sequence structure network is trained with the multiple sequence alignment in the form of PSI-BLAST–generated position-specific scoring matrices. The initial predictions from the first network and PSIPRED-predicted secondary structure are used as input to the second structure-structure network to refine the predictions obtained from the first net. A significant improvement in prediction accuracy has been achieved by using evolutionary information contained in the multiple sequence alignment. The final network yields an overall prediction accuracy of 75.5% when tested by sevenfold cross-validation on a set of 426 nonhomologous protein chains. The corresponding Qpred, Qobs, and Matthews correlation coefficient values are 49.8%, 72.3%, and 0.43, respectively, and are the best among all the previously published β-turn prediction methods. The Web server BetaTPred2 (http://www.imtech.res.in/raghava/betatpred2/) has been developed based on this approach. PMID:12592033
Metagenomic characterization of viral communities in Goseong Bay, Korea
NASA Astrophysics Data System (ADS)
Hwang, Jinik; Park, So Yun; Park, Mirye; Lee, Sukchan; Jo, Yeonhwa; Cho, Won Kyong; Lee, Taek-Kyun
2016-12-01
In this study, seawater samples were collected from Goseong Bay, Korea in March 2014 and viral populations were examined by metagenomics assembly. Enrichment of marine viral particles using FeCl3 followed by next-generation sequencing produced numerous sequences. De novo assembly and BLAST search showed that most of the obtained contigs were unknown sequences and only 0.74% of sequences were associated with known viruses. As a result, 138 viruses, including bacteriophages (87%), viruses infecting algae and others (13%) were identified. The identified 138 viruses were divided into 11 orders, 14 families, 34 genera, and 133 species. The dominant viruses were Pelagibacter phage HTVC010P and Roseobacter phage SIO1. The viruses infecting algae, including the Ostreococcus species, accounted for 9.4% of total identified viruses. In addition, we identified pathogenic herpes viruses infecting fishes and giant viruses infecting parasitic acanthamoeba species. This is a comprehensive study to reveal the viral populations in the Goseong Bay using metagenomics. The information associated with the marine viral community in Goseong Bay, Korea will be useful for comparative analysis in other marine viral communities.
Robust analysis of semiparametric renewal process models
Lin, Feng-Chang; Truong, Young K.; Fine, Jason P.
2013-01-01
Summary A rate model is proposed for a modulated renewal process comprising a single long sequence, where the covariate process may not capture the dependencies in the sequence as in standard intensity models. We consider partial likelihood-based inferences under a semiparametric multiplicative rate model, which has been widely studied in the context of independent and identical data. Under an intensity model, gap times in a single long sequence may be used naively in the partial likelihood with variance estimation utilizing the observed information matrix. Under a rate model, the gap times cannot be treated as independent and studying the partial likelihood is much more challenging. We employ a mixing condition in the application of limit theory for stationary sequences to obtain consistency and asymptotic normality. The estimator's variance is quite complicated owing to the unknown gap times dependence structure. We adapt block bootstrapping and cluster variance estimators to the partial likelihood. Simulation studies and an analysis of a semiparametric extension of a popular model for neural spike train data demonstrate the practical utility of the rate approach in comparison with the intensity approach. PMID:24550568
Using hidden Markov models to align multiple sequences.
Mount, David W
2009-07-01
A hidden Markov model (HMM) is a probabilistic model of a multiple sequence alignment (msa) of proteins. In the model, each column of symbols in the alignment is represented by a frequency distribution of the symbols (called a "state"), and insertions and deletions are represented by other states. One moves through the model along a particular path from state to state in a Markov chain (i.e., random choice of next move), trying to match a given sequence. The next matching symbol is chosen from each state, recording its probability (frequency) and also the probability of going to that state from a previous one (the transition probability). State and transition probabilities are multiplied to obtain a probability of the given sequence. The hidden nature of the HMM is due to the lack of information about the value of a specific state, which is instead represented by a probability distribution over all possible values. This article discusses the advantages and disadvantages of HMMs in msa and presents algorithms for calculating an HMM and the conditions for producing the best HMM.
Molecular detection and characterization of noroviruses in river water in Thailand.
Inoue, K; Motomura, K; Boonchan, M; Takeda, N; Ruchusatsawa, K; Guntapong, R; Tacharoenmuang, R; Sangkitporn, S; Chantaroj, S
2016-03-01
Norovirus (NoV) generally exists as a mixture of multiple genotype variants in nature. However, there has been no published report monitoring NoV in natural settings in Thailand. To obtain information on mixed presence of the NoV RNA genome, we conducted viral genome analysis of 15 water specimens collected from five sites in a river near Bangkok between August 2013 and August 2014. The number of viral RNA copies per specimen declined progressively from the most upstream to the most downstream site. Following direct nucleotide sequencing of the PCR products, we obtained three partial genome sequences of the NoV GI strain and 13 partial genome sequences of the NoV GII strains. Phylogenetic analysis indicated the presence of four GII.4 variant groups pro-circulated after the Den Haag_2006b, New Orleans_2009 and Sydney_2012 outbreaks. On the other hand, only GI.4 was observed from the specimens collected on April, 2014. These results indicated that multiple genogroups and genotypes of noroviruses are present and are circulating in the natural environment in Thailand as in other countries. Our study provides comprehensive information on the occurrence of new variants. Our study is the first paper that multiple genogroups and genotypes of norovirus exist, and are circulating in the river water near Bangkok, Thailand. Phylogenetic analysis indicated the presence of four GII.4 variant groups pro-circulated after the Den Haag_2006b, New Orleans_2009 and Sydney_2012 that caused outbreaks in the world. Continued research will be essential for understanding the natural history of NoV and the control of future outbreaks. © 2015 The Society for Applied Microbiology.
Wang, Yanjie; Dong, Chunlan; Xue, Zeyun; Jin, Qijiang; Xu, Yingchun
2016-01-15
Paeonia ostii, an important ornamental and medicinal plant, grows normally on copper (Cu) mines with widespread Cu contamination of soils, and it has the ability to lower Cu contents in the Cu-contaminated soils. However, very little molecular information concerned with Cu resistance of P. ostii is available. In this study, high-throughput de novo transcriptome sequencing was carried out for P. ostii with and without Cu treatment using Illumina HiSeq 2000 platform. A total of 77,704 All-unigenes were obtained with a mean length of 710 bp. Of these unigenes, 47,461 were annotated with public databases based on sequence similarities. Comparative transcript profiling allowed the discovery of 4324 differentially expressed genes (DEGs), with 2207 up-regulated and 2117 down-regulated unigenes in Cu-treated library as compared to the control counterpart. Based on these DEGs, Gene Ontology (GO) enrichment analysis indicated Cu stress-relevant terms, such as 'membrane' and 'antioxidant activity'. Meanwhile, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis uncovered some important pathways, including 'biosynthesis of secondary metabolites' and 'metabolic pathways'. In addition, expression patterns of 12 selected DEGs derived from quantitative real-time polymerase chain reaction (qRT-PCR) were consistent with their transcript abundance changes obtained by transcriptomic analyses, suggesting that all the 12 genes were authentically involved in Cu tolerance in P. ostii. This is the first report to identify genes related to Cu stress responses in P. ostii, which could offer valuable information on the molecular mechanisms of Cu resistance, and provide a basis for further genomics research on this and related ornamental species for phytoremediation. Copyright © 2015 Elsevier B.V. All rights reserved.
Csizmár, Nikolett; Mihók, Sándor; Jávor, András; Kusza, Szilvia
2018-01-01
The Hungarian draft is a horse breed with a recent mixed ancestry created in the 1920s by crossing local mares with draught horses imported from France and Belgium. The interest in its conservation and characterization has increased over the last few years. The aim of this work is to contribute to the characterization of the endangered Hungarian heavy draft horse populations in order to obtain useful information to implement conservation strategies for these genetic stocks. To genetically characterize the breed and to set up the basis for a conservation program, in the present study a hypervariable region of the mitochrondial DNA (D-loop) was used to assess genetic diversity in Hungarian draft horses. Two hundred and eighty five sequences obtained in our laboratory and 419 downloaded sequences available from Genbank were analyzed. One hundred and sixty-four haplotypes and thirty-six polymorphic sites were observed. High haplotype and nucleotide diversity values ( H d = 0.954 ± 0.004; π = 0.028 ± 0.0004) were identified in Hungarian population, although they were higher within than among the different populations ( H d = 0.972 ± 0.002; π = 0.03097 ± 0.002). Fourteen of the previously observed seventeen haplogroups were detected. Our samples showed a large intra- and interbreed variation. There was no clear clustering on the median joining network figure. The overall information collected in this work led us to consider that the genetic scenario observed for Hungarian draft breed is more likely the result of contributions from 'ancestrally' different genetic backgrounds. This study could contribute to the development of a breeding plan for Hungarian draft horses and help to formulate a genetic conservation plan, avoiding inbreeding while.
Embedding strategies for effective use of information from multiple sequence alignments.
Henikoff, S.; Henikoff, J. G.
1997-01-01
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain. PMID:9070452
Dölz, R; Mossé, M O; Slonimski, P P; Bairoch, A; Linder, P
1996-01-01
We continued our effort to make a comprehensive database (LISTA) for the yeast Saccharomyces cerevisiae. As in previous editions the genetic names are consistently associated to each sequence with a known and confirmed ORF. If necessary, synonyms are given in the case of allelic duplicated sequences. Although the first publication of a sequence gives-according to our rules-the genetic name of a gene, in some instances more commonly used names are given to avoid nomenclature problems and the use of ancient designations which are no longer used. In these cases the old designation is given as synonym. Thus sequences can be found either by the name or by synonyms given in LISTA. Each entry contains the genetic name, the mnemonic from the EMBL data bank, the codon bias, reference of the publication of the sequence, Chromosomal location as far as known, SWISSPROT and EMBL accession numbers. New entries will also contain the name from the systematic sequencing efforts. Since the release of LISTA4.1 we update the database continuously. To obtain more information on the included sequences, each entry has been screened against non-redundant nucleotide and protein data bank collections resulting in LISTA-HON and LISTA-HOP. This release includes reports from full Smith and Watermann peptide-level searches against a non-redundant protein sequence database. The LISTA data base can be linked to the associated data sets or to nucleotide and protein banks by the Sequence Retrieval System (SRS). The database is available by FTP and on World Wide Web. PMID:8594599
Bowers, Robert M.; Kyrpides, Nikos C.; Stepanauskas, Ramunas; ...
2017-08-08
Here, we present two standards developed by the Genomic Standards Consortium (GSC) for reporting bacterial and archaeal genome sequences. Both are extensions of the Minimum Information about Any (x) Sequence (MIxS). The standards are the Minimum Information about a Single Amplified Genome (MISAG) and the Minimum Information about a MetagenomeAssembled Genome (MIMAG), including, but not limited to, assembly quality, and estimates of genome completeness and contamination. These standards can be used in combination with other GSC checklists, including the Minimum Information about a Genome Sequence (MIGS), Minimum Information about a Metagenomic Sequence (MIMS), and Minimum Information about a Marker Genemore » Sequence (MIMARKS). Community-wide adoption of MISAG and MIMAG will facilitate more robust comparative genomic analyses of bacterial and archaeal diversity.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bowers, Robert M.; Kyrpides, Nikos C.; Stepanauskas, Ramunas
Here, we present two standards developed by the Genomic Standards Consortium (GSC) for reporting bacterial and archaeal genome sequences. Both are extensions of the Minimum Information about Any (x) Sequence (MIxS). The standards are the Minimum Information about a Single Amplified Genome (MISAG) and the Minimum Information about a MetagenomeAssembled Genome (MIMAG), including, but not limited to, assembly quality, and estimates of genome completeness and contamination. These standards can be used in combination with other GSC checklists, including the Minimum Information about a Genome Sequence (MIGS), Minimum Information about a Metagenomic Sequence (MIMS), and Minimum Information about a Marker Genemore » Sequence (MIMARKS). Community-wide adoption of MISAG and MIMAG will facilitate more robust comparative genomic analyses of bacterial and archaeal diversity.« less
Lan, DaoLiang; Xiong, XianRong; Wei, YanLi; Xu, Tong; Zhong, JinCheng; Zhi, XiangDong; Wang, Yong; Li, Jian
2014-09-01
RNA-Seq, a high-throughput (HT) sequencing technique, has been used effectively in large-scale transcriptomic studies, and is particularly useful for improving gene structure information and mining of new genes. In this study, RNA-Seq HT technology was employed to analyze the transcriptome of yak ovary. After Illumina-Solexa deep sequencing, 26826516 clean reads with a total of 4828772880 bp were obtained from the ovary library. Alignment analysis showed that 16992 yak genes mapped to the yak genome and 3734 of these genes were involved in alternative splicing. Gene structure refinement analysis showed that 7340 genes that were annotated in the yak genome could be extended at the 5' or 3' ends based on the alignments been the transcripts and the genome sequence. Novel transcript prediction analysis identified 6321 new transcripts with lengths ranging from 180 to 14884 bp, and 2267 of them were predicted to code proteins. BLAST analysis of the new transcripts showed that 1200?4933 mapped to the non-redundant (nr), nucleotide (nt) and/or SwissProt sequence databases. Comparative statistical analysis of the new mapped transcripts showed that the majority of them were similar to genes in Bos taurus (41.4%), Bos grunniens mutus (33.0%), Ovis aries (6.3%), Homo sapiens (2.8%), Mus musculus (1.6%) and other species. Functional analysis showed that these expressed genes were involved in various Gene Ontology (GO) categories and Kyoto Encyclopedia of Genes and Genomes pathways. GO analysis of the new transcripts found that the largest proportion of them was associated with reproduction. The results of this study will provide a basis for describing the normal transcriptome map of yak ovary and for future studies on yak breeding performance. Moreover, the results confirmed that RNA-Seq HT technology is highly advantageous in improving gene structure information and mining of new genes, as well as in providing valuable data to expand the yak genome information.
Transcriptome Assembly, Gene Annotation and Tissue Gene Expression Atlas of the Rainbow Trout
Salem, Mohamed; Paneru, Bam; Al-Tobasei, Rafet; Abdouni, Fatima; Thorgaard, Gary H.; Rexroad, Caird E.; Yao, Jianbo
2015-01-01
Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complemented by transcriptome information that will enhance genome assembly and annotation. Previously, transcriptome reference sequences were reported using data from different sources. Although the previous work added a great wealth of sequences, a complete and well-annotated transcriptome is still needed. In addition, gene expression in different tissues was not completely addressed in the previous studies. In this study, non-normalized cDNA libraries were sequenced from 13 different tissues of a single doubled haploid rainbow trout from the same source used for the rainbow trout genome sequence. A total of ~1.167 billion paired-end reads were de novo assembled using the Trinity RNA-Seq assembler yielding 474,524 contigs > 500 base-pairs. Of them, 287,593 had homologies to the NCBI non-redundant protein database. The longest contig of each cluster was selected as a reference, yielding 44,990 representative contigs. A total of 4,146 contigs (9.2%), including 710 full-length sequences, did not match any mRNA sequences in the current rainbow trout genome reference. Mapping reads to the reference genome identified an additional 11,843 transcripts not annotated in the genome. A digital gene expression atlas revealed 7,678 housekeeping and 4,021 tissue-specific genes. Expression of about 16,000–32,000 genes (35–71% of the identified genes) accounted for basic and specialized functions of each tissue. White muscle and stomach had the least complex transcriptomes, with high percentages of their total mRNA contributed by a small number of genes. Brain, testis and intestine, in contrast, had complex transcriptomes, with a large numbers of genes involved in their expression patterns. This study provides comprehensive de novo transcriptome information that is suitable for functional and comparative genomics studies in rainbow trout, including annotation of the genome. PMID:25793877
smRNAome profiling to identify conserved and novel microRNAs in Stevia rebaudiana Bertoni
2012-01-01
Background MicroRNAs (miRNAs) constitute a family of small RNA (sRNA) population that regulates the gene expression and plays an important role in plant development, metabolism, signal transduction and stress response. Extensive studies on miRNAs have been performed in different plants such as Arabidopsis thaliana, Oryza sativa etc. and volume of the miRNA database, mirBASE, has been increasing on day to day basis. Stevia rebaudiana Bertoni is an important perennial herb which accumulates high concentrations of diterpene steviol glycosides which contributes to its high indexed sweetening property with no calorific value. Several studies have been carried out for understanding molecular mechanism involved in biosynthesis of these glycosides, however, information about miRNAs has been lacking in S. rebaudiana. Deep sequencing of small RNAs combined with transcriptomic data is a powerful tool for identifying conserved and novel miRNAs irrespective of availability of genome sequence data. Results To identify miRNAs in S. rebaudiana, sRNA library was constructed and sequenced using Illumina genome analyzer II. A total of 30,472,534 reads representing 2,509,190 distinct sequences were obtained from sRNA library. Based on sequence similarity, we identified 100 miRNAs belonging to 34 highly conserved families. Also, we identified 12 novel miRNAs whose precursors were potentially generated from stevia EST and nucleotide sequences. All novel sequences have not been earlier described in other plant species. Putative target genes were predicted for most conserved and novel miRNAs. The predicted targets are mainly mRNA encoding enzymes regulating essential plant metabolic and signaling pathways. Conclusions This study led to the identification of 34 highly conserved miRNA families and 12 novel potential miRNAs indicating that specific miRNAs exist in stevia species. Our results provided information on stevia miRNAs and their targets building a foundation for future studies to understand their roles in key stevia traits. PMID:23116282
RNA Sequencing Analysis of the Gametophyte Transcriptome from the Liverwort, Marchantia polymorpha
Sharma, Niharika; Jung, Chol-Hee; Bhalla, Prem L.; Singh, Mohan B.
2014-01-01
The liverwort Marchantia polymorpha is a member of the most basal lineage of land plants (embryophytes) and likely retains many ancestral morphological, physiological and molecular characteristics. Despite its phylogenetic importance and the availability of previous EST studies, M. polymorpha’s lack of economic importance limits accessible genomic resources for this species. We employed Illumina RNA-Seq technology to sequence the gametophyte transcriptome of M. polymorpha. cDNA libraries from 6 different male and female developmental tissues were sequenced to delineate a global view of the M. polymorpha transcriptome. Approximately 80 million short reads were obtained and assembled into a non-redundant set of 46,533 transcripts (> = 200 bp) from 46,070 loci. The average length and the N50 length of the transcripts were 757 bp and 471 bp, respectively. Sequence comparison of assembled transcripts with non-redundant proteins from embryophytes resulted in the annotation of 43% of the transcripts. The transcripts were also compared with M. polymorpha expressed sequence tags (ESTs), and approximately 69.5% of the transcripts appeared to be novel. Twenty-one percent of the transcripts were assigned GO terms to improve annotation. In addition, 6,112 simple sequence repeats (SSRs) were identified as potential molecular markers, which may be useful in studies of genetic diversity. A comparative genomics approach revealed that a substantial proportion of the genes (35.5%) expressed in M. polymorpha were conserved across phylogenetically related species, such as Selaginella and Physcomitrella, and identified 580 genes that are potentially unique to liverworts. Our study presents an extensive amount of novel sequence information for M. polymorpha. This information will serve as a valuable genomics resource for further molecular, developmental and comparative evolutionary studies, as well as for the isolation and characterization of functional genes that are involved in sex differentiation and sexual reproduction in this liverwort. PMID:24841988
smRNAome profiling to identify conserved and novel microRNAs in Stevia rebaudiana Bertoni.
Mandhan, Vibha; Kaur, Jagdeep; Singh, Kashmir
2012-11-01
MicroRNAs (miRNAs) constitute a family of small RNA (sRNA) population that regulates the gene expression and plays an important role in plant development, metabolism, signal transduction and stress response. Extensive studies on miRNAs have been performed in different plants such as Arabidopsis thaliana, Oryza sativa etc. and volume of the miRNA database, mirBASE, has been increasing on day to day basis. Stevia rebaudiana Bertoni is an important perennial herb which accumulates high concentrations of diterpene steviol glycosides which contributes to its high indexed sweetening property with no calorific value. Several studies have been carried out for understanding molecular mechanism involved in biosynthesis of these glycosides, however, information about miRNAs has been lacking in S. rebaudiana. Deep sequencing of small RNAs combined with transcriptomic data is a powerful tool for identifying conserved and novel miRNAs irrespective of availability of genome sequence data. To identify miRNAs in S. rebaudiana, sRNA library was constructed and sequenced using Illumina genome analyzer II. A total of 30,472,534 reads representing 2,509,190 distinct sequences were obtained from sRNA library. Based on sequence similarity, we identified 100 miRNAs belonging to 34 highly conserved families. Also, we identified 12 novel miRNAs whose precursors were potentially generated from stevia EST and nucleotide sequences. All novel sequences have not been earlier described in other plant species. Putative target genes were predicted for most conserved and novel miRNAs. The predicted targets are mainly mRNA encoding enzymes regulating essential plant metabolic and signaling pathways. This study led to the identification of 34 highly conserved miRNA families and 12 novel potential miRNAs indicating that specific miRNAs exist in stevia species. Our results provided information on stevia miRNAs and their targets building a foundation for future studies to understand their roles in key stevia traits.
Possenti, Andrea; Vendruscolo, Michele; Camilloni, Carlo; Tiana, Guido
2018-05-23
Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The possibility to encode for such functions is controlled by the balance between the amount of information supplied by the sequence and that left after that the protein has folded into its structure. We study the amount of information necessary to specify the protein structure, providing an estimate that keeps into account the thermodynamic properties of protein folding. We thus show that the information remaining in the protein sequence after encoding for its structure (the 'information gap') is very close to what needed to encode for its function and interactions. Then, by predicting the information gap directly from the protein sequence, we show that it may be possible to use these insights from information theory to discriminate between ordered and disordered proteins, to identify unknown functions, and to optimize artificially-designed protein sequences. This article is protected by copyright. All rights reserved. © 2018 Wiley Periodicals, Inc.
2013-01-01
Background Molecules involved in pheromone biosynthesis may represent alternative targets for insect population control. This may be particularly useful in managing the reproduction of Lutzomyia longipalpis, the main vector of the protozoan parasite Leishmania infantum in Latin America. Besides the chemical identity of the major components of the L. longipalpis sex pheromone, there is no information regarding the molecular biology behind its production. To understand this process, obtaining information on which genes are expressed in the pheromone gland is essential. Methods In this study we used a transcriptomic approach to explore the pheromone gland and adjacent abdominal tergites in order to obtain substantial general sequence information. We used a laboratory-reared L. longipalpis (one spot, 9-Methyl GermacreneB) population, captured in Lapinha Cave, state of Minas Gerais, Brazil for this analysis. Results From a total of 3,547 cDNA clones, 2,502 high quality sequences from the pheromone gland and adjacent tissues were obtained and assembled into 1,387 contigs. Through blast searches of public databases, a group of transcripts encoding proteins potentially involved in the production of terpenoid precursors were identified in the 4th abdominal tergite, the segment containing the pheromone gland. Among them, protein-coding transcripts for four enzymes of the mevalonate pathway such as 3-hydroxyl-3-methyl glutaryl CoA reductase, phosphomevalonate kinase, diphosphomevalonate descarboxylase, and isopentenyl pyrophosphate isomerase were identified. Moreover, transcripts coding for farnesyl diphosphate synthase and NADP+ dependent farnesol dehydrogenase were also found in the same tergite. Additionally, genes potentially involved in pheromone transportation were identified from the three abdominal tergites analyzed. Conclusion This study constitutes the first transcriptomic analysis exploring the repertoire of genes expressed in the tissue containing the L. longipalpis pheromone gland as well as the flanking tissues. Using a comparative approach, a set of molecules potentially present in the mevalonate pathway emerge as interesting subjects for further study regarding their association to pheromone biosynthesis. The sequences presented here may be used as a reference set for future research on pheromone production or other characteristics of pheromone communication in this insect. Moreover, some matches for transcripts of unknown function may provide fertile ground of an in-depth study of pheromone-gland specific molecules. PMID:23497448
Helm, Benjamin M; Langley, Katherine; Spangler, Brooke B; Schrier Vergano, Samantha A
2015-01-01
Whole-exome sequencing (WES) has increased our ability to analyze large parts of the human genome, bringing with it a plethora of ethical, legal, and social implications. A topic dominating discussion of WES is identification of "secondary findings" (SFs), defined as the identification of risk in an asymptomatic individual unrelated to the indication for the test. SFs can have considerable psychosocial impact on patients and families, and patients with an SF may have concerns regarding genomic privacy and genetic discrimination. The Genetic Information Nondiscrimination Act of 2008 (GINA) currently excludes protections for members of the military. This may cause concern in military members and families regarding genetic discrimination when considering genetic testing. In this report, we discuss a case involving a patient and family in which a secondary finding was discovered by WES. The family members have careers in the U.S. military, and a risk-predisposing condition could negatively affect employment. While beneficial medical management changes were made, the information placed exceptional stress on the family, who were forced to navigate career-sensitive "extra-medical" issues, to consider the impacts of uncovering risk-predisposition, and to manage the privacy of their genetic information. We highlight how information obtained from WES may collide with these issues and emphasize the importance of genetic counseling for anyone undergoing WES.
[Study on ITS sequences of Aconitum vilmorinianum and its medicinal adulterant].
Zhang, Xiao-nan; Du, Chun-hua; Fu, De-huan; Gao, Li; Zhou, Pei-jun; Wang, Li
2012-09-01
To analyze and compare the ITS sequences of Aconitum vilmorinianum and its medicinal adulterant Aconitum austroyunnanense. Total genomic DNA were extracted from sample materials by improved CTAB method, ITS sequences of samples were amplified using PCR systems, directly sequenced and analyzed using software DNAStar, ClustalX1.81 and MEGA 4.0. 299 consistent sites, 19 variable sites and 13 informative sites were found in ITS1 sequences, 162 consistent sites, 2 variable sites and 1 informative sites were found in 5.8S sequences, 217 consistent sites, 3 variable sites and 1 informative site were found in ITS2 sequences. Base transition and transversion was not found only in 5.8S sequences, 2 sites transition and 1 site transversion were found in ITS1 sequences, only 1 site transversion was found in ITS2 sequences comparting the ITS sequences data matrix. By analyzing the ITS sequences data matrix from 2 population of Aconitum vilmorinianum and 3 population of Aconitum austroyunnanense, we found a stable informative site at the 596th base in ITS2 sequences, in all the samples of Aconitum vilmorinianum the base was C, and in all the samples of Aconitum austroyunnanense the base was A. Aconitum vilmorinianum and Aconitum austroyunnanense can be identified by their characters of ITS sequences, and the variable sites in ITS1 sequences are more than in ITS2 sequences.
Duan, Zhigui; Cao, Rui; Jiang, Liping; Liang, Songping
2013-01-14
In past years, spider venoms have attracted increasing attention due to their extraordinary chemical and pharmacological diversity. The recently popularized proteomic method highly improved our ability to analyze the proteins in the venom. However, the lack of information about isolated venom proteins sequences dramatically limits the ability to confidently identify venom proteins. In the present paper, the venom from Araneus ventricosus was analyzed using two complementary approaches: 2-DE/Shotgun-LC-MS/MS coupled to MASCOT search and 2-DE/Shotgun-LC-MS/MS coupled to manual de novo sequencing followed by local venom protein database (LVPD) search. The LVPD was constructed with toxin-like protein sequences obtained from the analysis of cDNA library from A. ventricosus venom glands. Our results indicate that a total of 130 toxin-like protein sequences were unambiguously identified by manual de novo sequencing coupled to LVPD search, accounting for 86.67% of all toxin-like proteins in LVPD. Thus manual de novo sequencing coupled to LVPD search was proved an extremely effective approach for the analysis of venom proteins. In addition, the approach displays impeccable advantage in validating mutant positions of isoforms from the same toxin-like family. Intriguingly, methyl esterifcation of glutamic acid was discovered for the first time in animal venom proteins by manual de novo sequencing. Crown Copyright © 2012. Published by Elsevier B.V. All rights reserved.
Archaeal β diversity patterns under the seafloor along geochemical gradients
NASA Astrophysics Data System (ADS)
Koyano, Hitoshi; Tsubouchi, Taishi; Kishino, Hirohisa; Akutsu, Tatsuya
2014-09-01
Recently, deep drilling into the seafloor has revealed that there are vast sedimentary ecosystems of diverse microorganisms, particularly archaea, in subsurface areas. We investigated the β diversity patterns of archaeal communities in sediment layers under the seafloor and their determinants. This study was accomplished by analyzing large environmental samples of 16S ribosomal RNA gene sequences and various geochemical data collected from a sediment core of 365.3 m, obtained by drilling into the seafloor off the east coast of the Shimokita Peninsula. To extract the maximum amount of information from these environmental samples, we first developed a method for measuring β diversity using sequence data by applying probability theory on a set of strings developed by two of the authors in a previous publication. We introduced an index of β diversity between sequence populations from which the sequence data were sampled. We then constructed an estimator of the β diversity index based on the sequence data and demonstrated that it converges to the β diversity index between sequence populations with probability of 1 as the number of sampled sequences increases. Next, we applied this new method to quantify β diversities between archaeal sequence populations under the seafloor and constructed a quantitative model of the estimated β diversity patterns. Nearly 90% of the variation in the archaeal β diversity was explained by a model that included as variables the differences in the abundances of chlorine, iodine, and carbon between the sediment layers.
5W1H Information Extraction with CNN-Bidirectional LSTM
NASA Astrophysics Data System (ADS)
Nurdin, A.; Maulidevi, N. U.
2018-03-01
In this work, information about who, did what, when, where, why, and how on Indonesian news articles were extracted by combining Convolutional Neural Network and Bidirectional Long Short-Term Memory. Convolutional Neural Network can learn semantically meaningful representations of sentences. Bidirectional LSTM can analyze the relations among words in the sequence. We also use word embedding word2vec for word representation. By combining these algorithms, we obtained F-measure 0.808. Our experiments show that CNN-BLSTM outperforms other shallow methods, namely IBk, C4.5, and Naïve Bayes with the F-measure 0.655, 0.645, and 0.595, respectively.
Rocheta, Margarida; Dionísio, F Miguel; Fonseca, Luís; Pires, Ana M
2007-12-01
Paternity analysis using microsatellite information is a well-studied subject. These markers are ideal for parentage studies and fingerprinting, due to their high-discrimination power. This type of data is used to assign paternity, to compute the average selfing and outcrossing rates and to estimate the biparental inbreeding. There are several public domain programs that compute all this information from data. Most of the time, it is necessary to export data to some sort of format, feed it to the program and import the output to an Excel book for further processing. In this article we briefly describe a program referred from now on as Paternity Analysis in Excel (PAE), developed at IST and IBET (see the acknowledgments) that computes paternity candidates from data, and other information, from within Excel. In practice this means that the end user provides the data in an Excel sheet and, by pressing an appropriate button, obtains the results in another Excel sheet. For convenience PAE is divided into two modules. The first one is a filtering module that selects data from the sequencer and reorganizes it in a format appropriate to process paternity analysis, assuming certain conventions for the names of parents and offspring from the sequencer. The second module carries out the paternity analysis assuming that one parent is known. Both modules are written in Excel-VBA and can be obtained at the address (www.math.ist.utl.pt/~fmd/pa/pa.zip). They are free for non-commercial purposes and have been tested with different data and against different software (Cervus, FaMoz, and MLTR).
Valenzuela-Muñoz, Valentina; Sturm, Armin; Gallardo-Escárate, Cristian
2015-04-09
ATP-binding cassette (ABC) protein family encode for membrane proteins involved in the transport of various biomolecules through the cellular membrane. These proteins have been identified in all taxa and present important physiological functions, including the process of insecticide detoxification in arthropods. For that reason the ectoparasite Caligus rogercresseyi represents a model species for understanding the molecular underpinnings involved in insecticide drug resistance. llumina sequencing was performed using sea lice exposed to 2 and 3 ppb of deltamethrin and azamethiphos. Contigs obtained from de novo assembly were annotated by Blastx. RNA-Seq analysis was performed and validated by qPCR analysis. From the transcriptome database of C. rogercresseyi, 57 putative members of ABC protein sequences were identified and phylogenetically classified into the eight subfamilies described for ABC transporters in arthropods. Transcriptomic profiles for ABC proteins subfamilies were evaluated throughout C. rogercresseyi development. Moreover, RNA-Seq analysis was performed for adult male and female salmon lice exposed to the delousing drugs azamethiphos and deltamethrin. High transcript levels of the ABCB and ABCC subfamilies were evidenced. Furthermore, SNPs mining was carried out for the ABC proteins sequences, revealing pivotal genomic information. The present study gives a comprehensive transcriptome analysis of ABC proteins from C. rogercresseyi, providing relevant information about transporter roles during ontogeny and in relation to delousing drug responses in salmon lice. This genomic information represents a valuable tool for pest management in the Chilean salmon aquaculture industry.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Denef, Vincent; Shah, Manesh B; Verberkmoes, Nathan C
The recent surge in microbial genomic sequencing, combined with the development of high-throughput liquid chromatography-mass-spectrometry-based (LC/LC-MS/MS) proteomics, has raised the question of the extent to which genomic information of one strain or environmental sample can be used to profile proteomes of related strains or samples. Even with decreasing sequencing costs, it remains impractical to obtain genomic sequence for every strain or sample analyzed. Here, we evaluate how shotgun proteomics is affected by amino acid divergence between the sample and the genomic database using a probability-based model and a random mutation simulation model constrained by experimental data. To assess the effectsmore » of nonrandom distribution of mutations, we also evaluated identification levels using in silico peptide data from sequenced isolates with average amino acid identities (AAI) varying between 76 and 98%. We compared the predictions to experimental protein identification levels for a sample that was evaluated using a database that included genomic information for the dominant organism and for a closely related variant (95% AAI). The range of models set the boundaries at which half of the proteins in a proteomic experiment can be identified to be 77-92% AAI between orthologs in the sample and database. Consistent with this prediction, experimental data indicated loss of half the identifiable proteins at 90% AAI. Additional analysis indicated a 6.4% reduction of the initial protein coverage per 1% amino acid divergence and total identification loss at 86% AAI. Consequently, shotgun proteomics is capable of cross-strain identifications but avoids most crossspecies false positives.« less
Li, Lingli; Zhang, Hehua; Liu, Zhongshuai; Cui, Xiaoyue; Zhang, Tong; Li, Yanfang; Zhang, Lingyun
2016-10-12
Blueberry is an economically important fruit crop in Ericaceae family. The substantial quantities of flavonoids in blueberry have been implicated in a broad range of health benefits. However, the information regarding fruit development and flavonoid metabolites based on the transcriptome level is still limited. In the present study, the transcriptome and gene expression profiling over berry development, especially during color development were initiated. A total of approximately 13.67 Gbp of data were obtained and assembled into 186,962 transcripts and 80,836 unigenes from three stages of blueberry fruit and color development. A large number of simple sequence repeats (SSRs) and candidate genes, which are potentially involved in plant development, metabolic and hormone pathways, were identified. A total of 6429 sequences containing 8796 SSRs were characterized from 15,457 unigenes and 1763 unigenes contained more than one SSR. The expression profiles of key genes involved in anthocyanin biosynthesis were also studied. In addition, a comparison between our dataset and other published results was carried out. Our high quality reads produced in this study are an important advancement and provide a new resource for the interpretation of high-throughput data for blueberry species whether regarding sequencing data depth or species extension. The use of this transcriptome data will serve as a valuable public information database for the studies of blueberry genome and would greatly boost the research of fruit and color development, flavonoid metabolisms and regulation and breeding of more healthful blueberries.
Multilevel analysis of sports video sequences
NASA Astrophysics Data System (ADS)
Han, Jungong; Farin, Dirk; de With, Peter H. N.
2006-01-01
We propose a fully automatic and flexible framework for analysis and summarization of tennis broadcast video sequences, using visual features and specific game-context knowledge. Our framework can analyze a tennis video sequence at three levels, which provides a broad range of different analysis results. The proposed framework includes novel pixel-level and object-level tennis video processing algorithms, such as a moving-player detection taking both the color and the court (playing-field) information into account, and a player-position tracking algorithm based on a 3-D camera model. Additionally, we employ scene-level models for detecting events, like service, base-line rally and net-approach, based on a number real-world visual features. The system can summarize three forms of information: (1) all court-view playing frames in a game, (2) the moving trajectory and real-speed of each player, as well as relative position between the player and the court, (3) the semantic event segments in a game. The proposed framework is flexible in choosing the level of analysis that is desired. It is effective because the framework makes use of several visual cues obtained from the real-world domain to model important events like service, thereby increasing the accuracy of the scene-level analysis. The paper presents attractive experimental results highlighting the system efficiency and analysis capabilities.
The 100 brigthest Blue Straggler Stars.
NASA Astrophysics Data System (ADS)
Morales Durán, C.; Llorente de Andrés, F.; Ahumada, J. A.
2015-05-01
Blue straggler stars (BSS) are characterized by their appearance in the CMD of globular and open clusters, in the Main Sequence extension, above the turn-off and blueward of this. In accordance with the Standard Theory of stellar evolution, BSS should be out of the Main Sequence and over the Giant Branch if they really belong to the cluster and are formed at the same time than the rest of cluster stars. There are several theories that try to explain the existence of BSS but at present prevails the idea that they can be the product of mass transfer in binaries (McCrea, 1964), and the luminosity of the receiver star is incremented in such a way that now it is over the Main Sequence turn-off point of its cluster. Also it is believed that they are the result of stellar fussion of two or several stars, specially in dense systems as the globular cluster nucleus. This work is focalised in all the BSS brihgter the V = 10 mag. that we have been able to identify in open clusters. It is a sample unprecedented by its number and as well it is a sample with plentiful observational information, it is why we hope to be able to assure their membership to the parent cluster and obtain reliable information about their possible origin.
Recent Advances in Conotoxin Classification by Using Machine Learning Methods.
Dao, Fu-Ying; Yang, Hui; Su, Zhen-Dong; Yang, Wuritu; Wu, Yun; Hui, Ding; Chen, Wei; Tang, Hua; Lin, Hao
2017-06-25
Conotoxins are disulfide-rich small peptides, which are invaluable peptides that target ion channel and neuronal receptors. Conotoxins have been demonstrated as potent pharmaceuticals in the treatment of a series of diseases, such as Alzheimer's disease, Parkinson's disease, and epilepsy. In addition, conotoxins are also ideal molecular templates for the development of new drug lead compounds and play important roles in neurobiological research as well. Thus, the accurate identification of conotoxin types will provide key clues for the biological research and clinical medicine. Generally, conotoxin types are confirmed when their sequence, structure, and function are experimentally validated. However, it is time-consuming and costly to acquire the structure and function information by using biochemical experiments. Therefore, it is important to develop computational tools for efficiently and effectively recognizing conotoxin types based on sequence information. In this work, we reviewed the current progress in computational identification of conotoxins in the following aspects: (i) construction of benchmark dataset; (ii) strategies for extracting sequence features; (iii) feature selection techniques; (iv) machine learning methods for classifying conotoxins; (v) the results obtained by these methods and the published tools; and (vi) future perspectives on conotoxin classification. The paper provides the basis for in-depth study of conotoxins and drug therapy research.
Faster computation of exact RNA shape probabilities.
Janssen, Stefan; Giegerich, Robert
2010-03-01
Abstract shape analysis allows efficient computation of a representative sample of low-energy foldings of an RNA molecule. More comprehensive information is obtained by computing shape probabilities, accumulating the Boltzmann probabilities of all structures within each abstract shape. Such information is superior to free energies because it is independent of sequence length and base composition. However, up to this point, computation of shape probabilities evaluates all shapes simultaneously and comes with a computation cost which is exponential in the length of the sequence. We device an approach called RapidShapes that computes the shapes above a specified probability threshold T by generating a list of promising shapes and constructing specialized folding programs for each shape to compute its share of Boltzmann probability. This aims at a heuristic improvement of runtime, while still computing exact probability values. Evaluating this approach and several substrategies, we find that only a small proportion of shapes have to be actually computed. For an RNA sequence of length 400, this leads, depending on the threshold, to a 10-138 fold speed-up compared with the previous complete method. Thus, probabilistic shape analysis has become feasible in medium-scale applications, such as the screening of RNA transcripts in a bacterial genome. RapidShapes is available via http://bibiserv.cebitec.uni-bielefeld.de/rnashapes
Zatz, Mayana; Pavanello, Rita de Cassia M; Lourenço, Naila Cristina V; Cerqueira, Antonia; Lazar, Monize; Vainzof, Mariz
2012-12-01
Improvement in DNA technology is increasingly revealing unexpected/unknown mutations in healthy persons and generating anxiety due to their still unknown health consequences. We report a 44-year-old healthy father of a 10-year-old daughter with bilateral coloboma and hearing loss, but without muscle weakness, in whom a whole-genome CGH revealed a deletion of exons 38-44 in the dystrophin gene. This mutation was inherited from her asymptomatic father, who was further clinically and molecularly evaluated for prognosis and genetic counseling (GC). This deletion was never identified by us in 982 Duchenne/Becker patients. To assess whether the present case represents a rare case of non-penetrance, and aiming to obtain more information for prognosis and GC, we suggested that healthy older relatives submit their DNA for analysis, to which several complied. Mutation analysis revealed that his mother, brother, and 56-year-old maternal uncle also carry the 38-44 deletion, suggesting it an unlikely cause of muscle weakness. Genome sequencing will disclose mutations and variants whose health impact are still unknown, raising important problems in interpreting results, defining prognosis, and discussing GC. We suggest that, in addition to family history, keeping the DNA of older relatives could be very informative, in particular for those interested in having their genome sequenced.
BAYESIAN PROTEIN STRUCTURE ALIGNMENT.
Rodriguez, Abel; Schmidler, Scott C
The analysis of the three-dimensional structure of proteins is an important topic in molecular biochemistry. Structure plays a critical role in defining the function of proteins and is more strongly conserved than amino acid sequence over evolutionary timescales. A key challenge is the identification and evaluation of structural similarity between proteins; such analysis can aid in understanding the role of newly discovered proteins and help elucidate evolutionary relationships between organisms. Computational biologists have developed many clever algorithmic techniques for comparing protein structures, however, all are based on heuristic optimization criteria, making statistical interpretation somewhat difficult. Here we present a fully probabilistic framework for pairwise structural alignment of proteins. Our approach has several advantages, including the ability to capture alignment uncertainty and to estimate key "gap" parameters which critically affect the quality of the alignment. We show that several existing alignment methods arise as maximum a posteriori estimates under specific choices of prior distributions and error models. Our probabilistic framework is also easily extended to incorporate additional information, which we demonstrate by including primary sequence information to generate simultaneous sequence-structure alignments that can resolve ambiguities obtained using structure alone. This combined model also provides a natural approach for the difficult task of estimating evolutionary distance based on structural alignments. The model is illustrated by comparison with well-established methods on several challenging protein alignment examples.
Identification of Functionally Related Enzymes by Learning-to-Rank Methods.
Stock, Michiel; Fober, Thomas; Hüllermeier, Eyke; Glinca, Serghei; Klebe, Gerhard; Pahikkala, Tapio; Airola, Antti; De Baets, Bernard; Waegeman, Willem
2014-01-01
Enzyme sequences and structures are routinely used in the biological sciences as queries to search for functionally related enzymes in online databases. To this end, one usually departs from some notion of similarity, comparing two enzymes by looking for correspondences in their sequences, structures or surfaces. For a given query, the search operation results in a ranking of the enzymes in the database, from very similar to dissimilar enzymes, while information about the biological function of annotated database enzymes is ignored. In this work, we show that rankings of that kind can be substantially improved by applying kernel-based learning algorithms. This approach enables the detection of statistical dependencies between similarities of the active cleft and the biological function of annotated enzymes. This is in contrast to search-based approaches, which do not take annotated training data into account. Similarity measures based on the active cleft are known to outperform sequence-based or structure-based measures under certain conditions. We consider the Enzyme Commission (EC) classification hierarchy for obtaining annotated enzymes during the training phase. The results of a set of sizeable experiments indicate a consistent and significant improvement for a set of similarity measures that exploit information about small cavities in the surface of enzymes.
Evolution and function of CAG/polyglutamine repeats in protein–protein interaction networks
Schaefer, Martin H.; Wanker, Erich E.; Andrade-Navarro, Miguel A.
2012-01-01
Expanded runs of consecutive trinucleotide CAG repeats encoding polyglutamine (polyQ) stretches are observed in the genes of a large number of patients with different genetic diseases such as Huntington's and several Ataxias. Protein aggregation, which is a key feature of most of these diseases, is thought to be triggered by these expanded polyQ sequences in disease-related proteins. However, polyQ tracts are a normal feature of many human proteins, suggesting that they have an important cellular function. To clarify the potential function of polyQ repeats in biological systems, we systematically analyzed available information stored in sequence and protein interaction databases. By integrating genomic, phylogenetic, protein interaction network and functional information, we obtained evidence that polyQ tracts in proteins stabilize protein interactions. This happens most likely through structural changes whereby the polyQ sequence extends a neighboring coiled-coil region to facilitate its interaction with a coiled-coil region in another protein. Alteration of this important biological function due to polyQ expansion results in gain of abnormal interactions, leading to pathological effects like protein aggregation. Our analyses suggest that research on polyQ proteins should shift focus from expanded polyQ proteins into the characterization of the influence of the wild-type polyQ on protein interactions. PMID:22287626
Ghouzam, Yassine; Postic, Guillaume; Guerin, Pierre-Edouard; de Brevern, Alexandre G.; Gelly, Jean-Christophe
2016-01-01
Protein structure prediction based on comparative modeling is the most efficient way to produce structural models when it can be performed. ORION is a dedicated webserver based on a new strategy that performs this task. The identification by ORION of suitable templates is performed using an original profile-profile approach that combines sequence and structure evolution information. Structure evolution information is encoded into profiles using structural features, such as solvent accessibility and local conformation —with Protein Blocks—, which give an accurate description of the local protein structure. ORION has recently been improved, increasing by 5% the quality of its results. The ORION web server accepts a single protein sequence as input and searches homologous protein structures within minutes. Various databases such as PDB, SCOP and HOMSTRAD can be mined to find an appropriate structural template. For the modeling step, a protein 3D structure can be directly obtained from the selected template by MODELLER and displayed with global and local quality model estimation measures. The sequence and the predicted structure of 4 examples from the CAMEO server and a recent CASP11 target from the ‘Hard’ category (T0818-D1) are shown as pertinent examples. Our web server is accessible at http://www.dsimb.inserm.fr/ORION/. PMID:27319297
Ghouzam, Yassine; Postic, Guillaume; Guerin, Pierre-Edouard; de Brevern, Alexandre G; Gelly, Jean-Christophe
2016-06-20
Protein structure prediction based on comparative modeling is the most efficient way to produce structural models when it can be performed. ORION is a dedicated webserver based on a new strategy that performs this task. The identification by ORION of suitable templates is performed using an original profile-profile approach that combines sequence and structure evolution information. Structure evolution information is encoded into profiles using structural features, such as solvent accessibility and local conformation -with Protein Blocks-, which give an accurate description of the local protein structure. ORION has recently been improved, increasing by 5% the quality of its results. The ORION web server accepts a single protein sequence as input and searches homologous protein structures within minutes. Various databases such as PDB, SCOP and HOMSTRAD can be mined to find an appropriate structural template. For the modeling step, a protein 3D structure can be directly obtained from the selected template by MODELLER and displayed with global and local quality model estimation measures. The sequence and the predicted structure of 4 examples from the CAMEO server and a recent CASP11 target from the 'Hard' category (T0818-D1) are shown as pertinent examples. Our web server is accessible at http://www.dsimb.inserm.fr/ORION/.
2010-04-01
biological precursors are obtained legally from legitimate corporations, these suppliers should incorporate some sort of chemical or genetic “barcode...traditionally come under the rubric of limited engagement, especially as such policy evolved during the Cold War between the US and USSR. With the end...information that may enable access to dual-use technology. A “bar-coding” procedure, by which a genetic sequence or chemical signature is used to
Conventions and nomenclature for double diffusion encoding NMR and MRI.
Shemesh, Noam; Jespersen, Sune N; Alexander, Daniel C; Cohen, Yoram; Drobnjak, Ivana; Dyrby, Tim B; Finsterbusch, Jurgen; Koch, Martin A; Kuder, Tristan; Laun, Fredrik; Lawrenz, Marco; Lundell, Henrik; Mitra, Partha P; Nilsson, Markus; Özarslan, Evren; Topgaard, Daniel; Westin, Carl-Fredrik
2016-01-01
Stejskal and Tanner's ingenious pulsed field gradient design from 1965 has made diffusion NMR and MRI the mainstay of most studies seeking to resolve microstructural information in porous systems in general and biological systems in particular. Methods extending beyond Stejskal and Tanner's design, such as double diffusion encoding (DDE) NMR and MRI, may provide novel quantifiable metrics that are less easily inferred from conventional diffusion acquisitions. Despite the growing interest on the topic, the terminology for the pulse sequences, their parameters, and the metrics that can be derived from them remains inconsistent and disparate among groups active in DDE. Here, we present a consensus of those groups on terminology for DDE sequences and associated concepts. Furthermore, the regimes in which DDE metrics appear to provide microstructural information that cannot be achieved using more conventional counterparts (in a model-free fashion) are elucidated. We highlight in particular DDE's potential for determining microscopic diffusion anisotropy and microscopic fractional anisotropy, which offer metrics of microscopic features independent of orientation dispersion and thus provide information complementary to the standard, macroscopic, fractional anisotropy conventionally obtained by diffusion MR. Finally, we discuss future vistas and perspectives for DDE. © 2015 Wiley Periodicals, Inc.
Multiple-camera/motion stereoscopy for range estimation in helicopter flight
NASA Technical Reports Server (NTRS)
Smith, Phillip N.; Sridhar, Banavar; Suorsa, Raymond E.
1993-01-01
Aiding the pilot to improve safety and reduce pilot workload by detecting obstacles and planning obstacle-free flight paths during low-altitude helicopter flight is desirable. Computer vision techniques provide an attractive method of obstacle detection and range estimation for objects within a large field of view ahead of the helicopter. Previous research has had considerable success by using an image sequence from a single moving camera to solving this problem. The major limitations of single camera approaches are that no range information can be obtained near the instantaneous direction of motion or in the absence of motion. These limitations can be overcome through the use of multiple cameras. This paper presents a hybrid motion/stereo algorithm which allows range refinement through recursive range estimation while avoiding loss of range information in the direction of travel. A feature-based approach is used to track objects between image frames. An extended Kalman filter combines knowledge of the camera motion and measurements of a feature's image location to recursively estimate the feature's range and to predict its location in future images. Performance of the algorithm will be illustrated using an image sequence, motion information, and independent range measurements from a low-altitude helicopter flight experiment.
Optimal network alignment with graphlet degree vectors.
Milenković, Tijana; Ng, Weng Leong; Hayes, Wayne; Przulj, Natasa
2010-06-30
Important biological information is encoded in the topology of biological networks. Comparative analyses of biological networks are proving to be valuable, as they can lead to transfer of knowledge between species and give deeper insights into biological function, disease, and evolution. We introduce a new method that uses the Hungarian algorithm to produce optimal global alignment between two networks using any cost function. We design a cost function based solely on network topology and use it in our network alignment. Our method can be applied to any two networks, not just biological ones, since it is based only on network topology. We use our new method to align protein-protein interaction networks of two eukaryotic species and demonstrate that our alignment exposes large and topologically complex regions of network similarity. At the same time, our alignment is biologically valid, since many of the aligned protein pairs perform the same biological function. From the alignment, we predict function of yet unannotated proteins, many of which we validate in the literature. Also, we apply our method to find topological similarities between metabolic networks of different species and build phylogenetic trees based on our network alignment score. The phylogenetic trees obtained in this way bear a striking resemblance to the ones obtained by sequence alignments. Our method detects topologically similar regions in large networks that are statistically significant. It does this independent of protein sequence or any other information external to network topology.
Strategies for the acquisition of transcriptional and epigenetic information in single cells.
Li, Guang; Dzilic, Elda; Flores, Nick; Shieh, Alice; Wu, Sean M
2017-03-01
As the basic unit of living organisms, each single cell has unique molecular signatures and functions. Our ability to uncover the transcriptional and epigenetic signature of single cells has been hampered by the lack of tools to explore this area of research. The advent of microfluidic single cell technology along with single cell genome-wide DNA amplification methods had greatly improved our understanding of the expression variation in single cells. Transcriptional expression profile by multiplex qPCR or genome-wide RNA sequencing has enabled us to examine genes expression in single cells in different tissues. With the new tools, the identification of new cellular heterogeneity, novel marker genes, unique subpopulations, and spatial locations of each single cell can be acquired successfully. Epigenetic modifications for each single cell can also be obtained via similar methods. Based on single cell genome sequencing, single cell epigenetic information including histone modifications, DNA methylation, and chromatin accessibility have been explored and provided valuable insights regarding gene regulation and disease prognosis. In this article, we review the development of strategies to obtain single cell transcriptional and epigenetic data. Furthermore, we discuss ways in which single cell studies may help to provide greater understanding of the mechanisms of basic cardiovascular biology that will eventually lead to improvement in our ability to diagnose disease and develop new therapies.
Wang, Zhong-Wei; Jiang, Cong; Wen, Qiang; Wang, Na; Tao, Yuan-Yuan; Xu, Li-An
2014-03-15
Camellia chekiangoleosa is an important species of genus Camellia. It provides high-quality edible oil and has great ornamental value. The flowers are big and red which bloom between February and March. Flower pigmentation is closely related to the accumulation of anthocyanin. Although anthocyanin biosynthesis has been studied extensively in herbaceous plants, little molecular information on the anthocyanin biosynthesis pathway of C. chekiangoleosa is yet known. In the present study, a cDNA library was constructed to obtain detailed and general data from the flowers of C. chekiangoleosa. To explore the transcriptome of C. chekiangoleosa and investigate genes involved in anthocyanin biosynthesis, a 454 GS FLX Titanium platform was used to generate an EST dataset. About 46,279 sequences were obtained, and 24,593 (53.1%) were annotated. Using Blast search against the AGRIS, 1740 unigenes were found homologous to 599 Arabidopsis transcription factor genes. Based on the transcriptome dataset, nine anthocyanin biosynthesis pathway genes (PAL, CHS1, CHS2, CHS3, CHI, F3H, DFR, ANS, and UFGT) were identified and cloned. The spatio-temporal expression patterns of these genes were also analyzed using quantitative real-time polymerase chain reaction. The study results not only enrich the gene resource but also provide valuable information for further studies concerning anthocyanin biosynthesis. Copyright © 2014 Elsevier B.V. All rights reserved.
New method of extracting information of arterial oxygen saturation based on ∑ | 𝚫 |
NASA Astrophysics Data System (ADS)
Dai, Wenting; Lin, Ling; Li, Gang
2017-04-01
Noninvasive detection of oxygen saturation with near-infrared spectroscopy has been widely used in clinics. In order to further enhance its detection precision and reliability, this paper proposes a method of time domain absolute difference summation (∑|Δ|) based on a dynamic spectrum. In this method, the ratio of absolute differences between intervals of two differential sampling points at the same moment on logarithm photoplethysmography signals of red and infrared light was obtained in turn, and then they obtained a ratio sequence which was screened with a statistical method. Finally, use the summation of the screened ratio sequence as the oxygen saturation coefficient Q. We collected 120 reference samples of SpO2 and then compared the result of two methods, which are ∑|Δ| and peak-peak. Average root-mean-square errors of the two methods were 3.02% and 6.80%, respectively, in the 20 cases which were selected randomly. In addition, the average variance of Q of the 10 samples, which were obtained by the new method, reduced to 22.77% of that obtained by the peak-peak method. Comparing with the commercial product, the new method makes the results more accurate. Theoretical and experimental analysis indicates that the application of the ∑|Δ| method could enhance the precision and reliability of oxygen saturation detection in real time.
New method of extracting information of arterial oxygen saturation based on ∑|𝚫 |
NASA Astrophysics Data System (ADS)
Wenting, Dai; Ling, Lin; Gang, Li
2017-04-01
Noninvasive detection of oxygen saturation with near-infrared spectroscopy has been widely used in clinics. In order to further enhance its detection precision and reliability, this paper proposes a method of time domain absolute difference summation (∑|Δ|) based on a dynamic spectrum. In this method, the ratio of absolute differences between intervals of two differential sampling points at the same moment on logarithm photoplethysmography signals of red and infrared light was obtained in turn, and then they obtained a ratio sequence which was screened with a statistical method. Finally, use the summation of the screened ratio sequence as the oxygen saturation coefficient Q. We collected 120 reference samples of SpO2 and then compared the result of two methods, which are ∑|Δ| and peak-peak. Average root-mean-square errors of the two methods were 3.02% and 6.80%, respectively, in the 20 cases which were selected randomly. In addition, the average variance of Q of the 10 samples, which were obtained by the new method, reduced to 22.77% of that obtained by the peak-peak method. Comparing with the commercial product, the new method makes the results more accurate. Theoretical and experimental analysis indicates that the application of the ∑|Δ| method could enhance the precision and reliability of oxygen saturation detection in real time.
Bertolini, Francesca; Ghionda, Marco Ciro; D'Alessandro, Enrico; Geraci, Claudia; Chiofalo, Vincenzo; Fontanesi, Luca
2015-01-01
The identification of the species of origin of meat and meat products is an important issue to prevent and detect frauds that might have economic, ethical and health implications. In this paper we evaluated the potential of the next generation semiconductor based sequencing technology (Ion Torrent Personal Genome Machine) for the identification of DNA from meat species (pig, horse, cattle, sheep, rabbit, chicken, turkey, pheasant, duck, goose and pigeon) as well as from human and rat in DNA mixtures through the sequencing of PCR products obtained from different couples of universal primers that amplify 12S and 16S rRNA mitochondrial DNA genes. Six libraries were produced including PCR products obtained separately from 13 species or from DNA mixtures containing DNA from all species or only avian or only mammalian species at equimolar concentration or at 1:10 or 1:50 ratios for pig and horse DNA. Sequencing obtained a total of 33,294,511 called nucleotides of which 29,109,688 with Q20 (87.43%) in a total of 215,944 reads. Different alignment algorithms were used to assign the species based on sequence data. Error rate calculated after confirmation of the obtained sequences by Sanger sequencing ranged from 0.0003 to 0.02 for the different species. Correlation about the number of reads per species between different libraries was high for mammalian species (0.97) and lower for avian species (0.70). PCR competition limited the efficiency of amplification and sequencing for avian species for some primer pairs. Detection of low level of pig and horse DNA was possible with reads obtained from different primer pairs. The sequencing of the products obtained from different universal PCR primers could be a useful strategy to overcome potential problems of amplification. Based on these results, the Ion Torrent technology can be applied for the identification of meat species in DNA mixtures.
Bertolini, Francesca; Ghionda, Marco Ciro; D’Alessandro, Enrico; Geraci, Claudia; Chiofalo, Vincenzo; Fontanesi, Luca
2015-01-01
The identification of the species of origin of meat and meat products is an important issue to prevent and detect frauds that might have economic, ethical and health implications. In this paper we evaluated the potential of the next generation semiconductor based sequencing technology (Ion Torrent Personal Genome Machine) for the identification of DNA from meat species (pig, horse, cattle, sheep, rabbit, chicken, turkey, pheasant, duck, goose and pigeon) as well as from human and rat in DNA mixtures through the sequencing of PCR products obtained from different couples of universal primers that amplify 12S and 16S rRNA mitochondrial DNA genes. Six libraries were produced including PCR products obtained separately from 13 species or from DNA mixtures containing DNA from all species or only avian or only mammalian species at equimolar concentration or at 1:10 or 1:50 ratios for pig and horse DNA. Sequencing obtained a total of 33,294,511 called nucleotides of which 29,109,688 with Q20 (87.43%) in a total of 215,944 reads. Different alignment algorithms were used to assign the species based on sequence data. Error rate calculated after confirmation of the obtained sequences by Sanger sequencing ranged from 0.0003 to 0.02 for the different species. Correlation about the number of reads per species between different libraries was high for mammalian species (0.97) and lower for avian species (0.70). PCR competition limited the efficiency of amplification and sequencing for avian species for some primer pairs. Detection of low level of pig and horse DNA was possible with reads obtained from different primer pairs. The sequencing of the products obtained from different universal PCR primers could be a useful strategy to overcome potential problems of amplification. Based on these results, the Ion Torrent technology can be applied for the identification of meat species in DNA mixtures. PMID:25923709
Xu, Li; Ding, Zhi-Shan; Zhou, Yun-Kai; Tao, Xue-Fen
2009-06-01
To obtain the full-length cDNA sequence of Secoisolariciresinol Dehydrogenase gene from Dysosma versipellis by RACE PCR,then investigate the character of Secoisolariciresinol Dehydrogenase gene. The full-length cDNA sequence of Secoisolariciresinol Dehydrogenase gene was obtained by 3'-RACE and 5'-RACE from Dysosma versipellis. We first reported the full cDNA sequences of Secoisolariciresinol Dehydrogenase in Dysosma versipellis. The acquired gene was 991bp in full length, including 5' untranslated region of 42bp, 3' untranslated region of 112bp with Poly (A). The open reading frame (ORF) encoding 278 amino acid with molecular weight 29253.3 Daltons and isolectric point 6.328. The gene accession nucleotide sequence number in GeneBank was EU573789. Semi-quantitative RT-PCR analysis revealed that the Secoisolariciresinol Dehydrogenase gene was highly expressed in stem. Alignment of the amino acid sequence of Secoisolariciresinol Dehydrogenase indicated there may be some significant amino acid sequence difference among different species. Obtain the full-length cDNA sequence of Secoisolariciresinol Dehydrogenase gene from Dysosma versipellis.
Rényi continuous entropy of DNA sequences.
Vinga, Susana; Almeida, Jonas S
2004-12-07
Entropy measures of DNA sequences estimate their randomness or, inversely, their repeatability. L-block Shannon discrete entropy accounts for the empirical distribution of all length-L words and has convergence problems for finite sequences. A new entropy measure that extends Shannon's formalism is proposed. Renyi's quadratic entropy calculated with Parzen window density estimation method applied to CGR/USM continuous maps of DNA sequences constitute a novel technique to evaluate sequence global randomness without some of the former method drawbacks. The asymptotic behaviour of this new measure was analytically deduced and the calculation of entropies for several synthetic and experimental biological sequences was performed. The results obtained were compared with the distributions of the null model of randomness obtained by simulation. The biological sequences have shown a different p-value according to the kernel resolution of Parzen's method, which might indicate an unknown level of organization of their patterns. This new technique can be very useful in the study of DNA sequence complexity and provide additional tools for DNA entropy estimation. The main MATLAB applications developed and additional material are available at the webpage . Specialized functions can be obtained from the authors.
GenColors: annotation and comparative genomics of prokaryotes made easy.
Romualdi, Alessandro; Felder, Marius; Rose, Dominic; Gausmann, Ulrike; Schilhabel, Markus; Glöckner, Gernot; Platzer, Matthias; Sühnel, Jürgen
2007-01-01
GenColors (gencolors.fli-leibniz.de) is a new web-based software/database system aimed at an improved and accelerated annotation of prokaryotic genomes considering information on related genomes and making extensive use of genome comparison. It offers a seamless integration of data from ongoing sequencing projects and annotated genomic sequences obtained from GenBank. A variety of export/import filters manages an effective data flow from sequence assembly and manipulation programs (e.g., GAP4) to GenColors and back as well as to standard GenBank file(s). The genome comparison tools include best bidirectional hits, gene conservation, syntenies, and gene core sets. Precomputed UniProt matches allow annotation and analysis in an effective manner. In addition to these analysis options, base-specific quality data (coverage and confidence) can also be handled if available. The GenColors system can be used both for annotation purposes in ongoing genome projects and as an analysis tool for finished genomes. GenColors comes in two types, as dedicated genome browsers and as the Jena Prokaryotic Genome Viewer (JPGV). Dedicated genome browsers contain genomic information on a set of related genomes and offer a large number of options for genome comparison. The system has been efficiently used in the genomic sequencing of Borrelia garinii and is currently applied to various ongoing genome projects on Borrelia, Legionella, Escherichia, and Pseudomonas genomes. One of these dedicated browsers, the Spirochetes Genome Browser (sgb.fli-leibniz.de) with Borrelia, Leptospira, and Treponema genomes, is freely accessible. The others will be released after finalization of the corresponding genome projects. JPGV (jpgv.fli-leibniz.de) offers information on almost all finished bacterial genomes, as compared to the dedicated browsers with reduced genome comparison functionality, however. As of January 2006, this viewer includes 632 genomic elements (e.g., chromosomes and plasmids) of 293 species. The system provides versatile quick and advanced search options for all currently known prokaryotic genomes and generates circular and linear genome plots. Gene information sheets contain basic gene information, database search options, and links to external databases. GenColors is also available on request for local installation.
Yilmaz, Pelin; Kottmann, Renzo; Field, Dawn; Knight, Rob; Cole, James R; Amaral-Zettler, Linda; Gilbert, Jack A; Karsch-Mizrachi, Ilene; Johnston, Anjanette; Cochrane, Guy; Vaughan, Robert; Hunter, Christopher; Park, Joonhong; Morrison, Norman; Rocca-Serra, Philippe; Sterk, Peter; Arumugam, Manimozhiyan; Bailey, Mark; Baumgartner, Laura; Birren, Bruce W; Blaser, Martin J; Bonazzi, Vivien; Booth, Tim; Bork, Peer; Bushman, Frederic D; Buttigieg, Pier Luigi; Chain, Patrick S G; Charlson, Emily; Costello, Elizabeth K; Huot-Creasy, Heather; Dawyndt, Peter; DeSantis, Todd; Fierer, Noah; Fuhrman, Jed A; Gallery, Rachel E; Gevers, Dirk; Gibbs, Richard A; Gil, Inigo San; Gonzalez, Antonio; Gordon, Jeffrey I; Guralnick, Robert; Hankeln, Wolfgang; Highlander, Sarah; Hugenholtz, Philip; Jansson, Janet; Kau, Andrew L; Kelley, Scott T; Kennedy, Jerry; Knights, Dan; Koren, Omry; Kuczynski, Justin; Kyrpides, Nikos; Larsen, Robert; Lauber, Christian L; Legg, Teresa; Ley, Ruth E; Lozupone, Catherine A; Ludwig, Wolfgang; Lyons, Donna; Maguire, Eamonn; Methé, Barbara A; Meyer, Folker; Muegge, Brian; Nakielny, Sara; Nelson, Karen E; Nemergut, Diana; Neufeld, Josh D; Newbold, Lindsay K; Oliver, Anna E; Pace, Norman R; Palanisamy, Giriprakash; Peplies, Jörg; Petrosino, Joseph; Proctor, Lita; Pruesse, Elmar; Quast, Christian; Raes, Jeroen; Ratnasingham, Sujeevan; Ravel, Jacques; Relman, David A; Assunta-Sansone, Susanna; Schloss, Patrick D; Schriml, Lynn; Sinha, Rohini; Smith, Michelle I; Sodergren, Erica; Spor, Aymé; Stombaugh, Jesse; Tiedje, James M; Ward, Doyle V; Weinstock, George M; Wendel, Doug; White, Owen; Whiteley, Andrew; Wilke, Andreas; Wortman, Jennifer R; Yatsunenko, Tanya; Glöckner, Frank Oliver
2012-01-01
Here we present a standard developed by the Genomic Standards Consortium (GSC) for reporting marker gene sequences—the minimum information about a marker gene sequence (MIMARKS). We also introduce a system for describing the environment from which a biological sample originates. The ‘environmental packages’ apply to any genome sequence of known origin and can be used in combination with MIMARKS and other GSC checklists. Finally, to establish a unified standard for describing sequence data and to provide a single point of entry for the scientific community to access and learn about GSC checklists, we present the minimum information about any (x) sequence (MIxS). Adoption of MIxS will enhance our ability to analyze natural genetic diversity documented by massive DNA sequencing efforts from myriad ecosystems in our ever-changing biosphere. PMID:21552244
Yu, Yongxin; Cai, Hui; Hu, Linghao; Lei, Rongwei; Pan, Yingjie; Yan, Shuling; Wang, Yongjie
2015-11-01
Noroviruses (NoVs) are a leading cause of epidemic and sporadic cases of acute gastroenteritis worldwide. Oysters are well recognized as the main vectors of environmentally transmitted NoVs, and disease outbreaks linked to oyster consumption have been commonly observed. Here, to quantify the genetic diversity, temporal distribution, and circulation of oyster-related NoVs on a global scale, 1,077 oyster-related NoV sequences deposited from 1983 to 2014 were downloaded from both NCBI GenBank and the NoroNet outbreak database and were then screened for quality control. A total of 665 sequences with reliable information were obtained and were subsequently subjected to genotyping and phylogenetic analyses. The results indicated that the majority of oyster-related NoV sequences were obtained from coastal countries and regions and that the numbers of sequences in these regions were unevenly distributed. Moreover, >80% of human NoV genotypes were detected in oyster samples or oyster-related outbreaks. A higher proportion of genogroup I (GI) (34%) was observed for oyster-related sequences than for non-oyster-related outbreaks, where GII strains dominated with an overwhelming majority of >90%, indicating that the prevalences of GI and GII are different in humans and oysters. In addition, a related convergence of the circulation trend was found between oyster-related NoV sequences and human pandemic outbreaks. This suggests that oysters not only act as a vector of NoV through environmental transmission but also serve as an important reservoir of human NoVs. These results highlight the importance of oysters in the persistence and transmission of human NoVs in the environment and have important implications for the surveillance of human NoVs in oyster samples. Copyright © 2015, American Society for Microbiology. All Rights Reserved.
Petzold, Markus; Prior, Karola; Moran-Gilad, Jacob; Harmsen, Dag; Lück, Christian
2017-01-01
Introduction Whole genome sequencing (WGS) is increasingly used in Legionnaires’ disease (LD) outbreak investigations, owing to its higher resolution than sequence-based typing, the gold standard typing method for Legionella pneumophila, in the analysis of endemic strains. Recently, a gene-by-gene typing approach based on 1,521 core genes called core genome multilocus sequence typing (cgMLST) was described that enables a robust and standardised typing of L. pneumophila. Methods: We applied this cgMLST scheme to isolates obtained during the largest outbreak of LD reported so far in Germany. In this outbreak, the epidemic clone ST345 had been isolated from patients and four different environmental sources. In total 42 clinical and environmental isolates were retrospectively typed. Results: Epidemiologically unrelated ST345 isolates were clearly distinguishable from the epidemic clone. Remarkably, epidemic isolates split up into two distinct clusters, ST345-A and ST345-B, each respectively containing a mix of clinical and epidemiologically-related environmental samples. Discussion/conclusion: The outbreak was therefore likely caused by both variants of the single sequence type, which pre-existed in the environmental reservoirs. The two clusters differed by 40 alleles located in two neighbouring genomic regions of ca 42 and 26 kb. Additional analysis supported horizontal gene transfer of the two regions as responsible for the difference between the variants. Both regions comprise virulence genes and have previously been reported to be involved in recombination events. This corroborates the notion that genomic outbreak investigations should always take epidemiological information into consideration when making inferences. Overall, cgMLST proved helpful in disentangling the complex genomic epidemiology of the outbreak. PMID:29162202
Altuntaş, Esra; Schubert, Ulrich S
2014-01-15
Mass spectrometry (MS) is the most versatile and comprehensive method in "OMICS" sciences (i.e. in proteomics, genomics, metabolomics and lipidomics). The applications of MS and tandem MS (MS/MS or MS(n)) provide sequence information of the full complement of biological samples in order to understand the importance of the sequences on their precise and specific functions. Nowadays, the control of polymer sequences and their accurate characterization is one of the significant challenges of current polymer science. Therefore, a similar approach can be very beneficial for characterizing and understanding the complex structures of synthetic macromolecules. MS-based strategies allow a relatively precise examination of polymeric structures (e.g. their molar mass distributions, monomer units, side chain substituents, end-group functionalities, and copolymer compositions). Moreover, tandem MS offer accurate structural information from intricate macromolecular structures; however, it produces vast amount of data to interpret. In "OMICS" sciences, the software application to interpret the obtained data has developed satisfyingly (e.g. in proteomics), because it is not possible to handle the amount of data acquired via (tandem) MS studies on the biological samples manually. It can be expected that special software tools will improve the interpretation of (tandem) MS output from the investigations of synthetic polymers as well. Eventually, the MS/MS field will also open up for polymer scientists who are not MS-specialists. In this review, we dissect the overall framework of the MS and MS/MS analysis of synthetic polymers into its key components. We discuss the fundamentals of polymer analyses as well as recent advances in the areas of tandem mass spectrometry, software developments, and the overall future perspectives on the way to polymer sequencing, one of the last Holy Grail in polymer science. Copyright © 2013 Elsevier B.V. All rights reserved.
Triwitayakorn, Kanokporn; Chatkulkawin, Pornsupa; Kanjanawattanawong, Supanath; Sraphet, Supajit; Yoocha, Thippawan; Sangsrakru, Duangjai; Chanprasert, Juntima; Ngamphiw, Chumpol; Jomchai, Nukoon; Therawattanasuk, Kanikar; Tangphatsornruang, Sithichoke
2011-01-01
To obtain more information on the Hevea brasiliensis genome, we sequenced the transcriptome from the vegetative shoot apex yielding 2 311 497 reads. Clustering and assembly of the reads produced a total of 113 313 unique sequences, comprising 28 387 isotigs and 84 926 singletons. Also, 17 819 expressed sequence tag (EST)-simple sequence repeats (SSRs) were identified from the data set. To demonstrate the use of this EST resource for marker development, primers were designed for 430 of the EST-SSRs. Three hundred and twenty-three primer pairs were amplifiable in H. brasiliensis clones. Polymorphic information content values of selected 47 SSRs among 20 H. brasiliensis clones ranged from 0.13 to 0.71, with an average of 0.51. A dendrogram of genetic similarities between the 20 H. brasiliensis clones using these 47 EST-SSRs suggested two distinct groups that correlated well with clone pedigree. These novel EST-SSRs together with the published SSRs were used for the construction of an integrated parental linkage map of H. brasiliensis based on 81 lines of an F1 mapping population. The map consisted of 97 loci, consisting of 37 novel EST-SSRs and 60 published SSRs, distributed on 23 linkage groups and covered 842.9 cM with a mean interval of 11.9 cM and ∼4 loci per linkage group. Although the numbers of linkage groups exceed the haploid number (18), but with several common markers between homologous linkage groups with the previous map indicated that the F1 map in this study is appropriate for further study in marker-assisted selection. PMID:22086998
Jiang, Yanwen; Nie, Kui; Redmond, David; Melnick, Ari M; Tam, Wayne; Elemento, Olivier
2015-12-28
Understanding tumor clonality is critical to understanding the mechanisms involved in tumorigenesis and disease progression. In addition, understanding the clonal composition changes that occur within a tumor in response to certain micro-environment or treatments may lead to the design of more sophisticated and effective approaches to eradicate tumor cells. However, tracking tumor clonal sub-populations has been challenging due to the lack of distinguishable markers. To address this problem, a VDJ-seq protocol was created to trace the clonal evolution patterns of diffuse large B cell lymphoma (DLBCL) relapse by exploiting VDJ recombination and somatic hypermutation (SHM), two unique features of B cell lymphomas. In this protocol, Next-Generation sequencing (NGS) libraries with indexing potential were constructed from amplified rearranged immunoglobulin heavy chain (IgH) VDJ region from pairs of primary diagnosis and relapse DLBCL samples. On average more than half million VDJ sequences per sample were obtained after sequencing, which contain both VDJ rearrangement and SHM information. In addition, customized bioinformatics pipelines were developed to fully utilize sequence information for the characterization of IgH-VDJ repertoire within these samples. Furthermore, the pipeline allows the reconstruction and comparison of the clonal architecture of individual tumors, which enables the examination of the clonal heterogeneity within the diagnosis tumors and deduction of clonal evolution patterns between diagnosis and relapse tumor pairs. When applying this analysis to several diagnosis-relapse pairs, we uncovered key evidence that multiple distinctive tumor evolutionary patterns could lead to DLBCL relapse. Additionally, this approach can be expanded into other clinical aspects, such as identification of minimal residual disease, monitoring relapse progress and treatment response, and investigation of immune repertoires in non-lymphoma contexts.
Petzold, Markus; Prior, Karola; Moran-Gilad, Jacob; Harmsen, Dag; Lück, Christian
2017-11-01
IntroductionWhole genome sequencing (WGS) is increasingly used in Legionnaires' disease (LD) outbreak investigations, owing to its higher resolution than sequence-based typing, the gold standard typing method for Legionella pneumophila, in the analysis of endemic strains. Recently, a gene-by-gene typing approach based on 1,521 core genes called core genome multilocus sequence typing (cgMLST) was described that enables a robust and standardised typing of L. pneumophila . Methods : We applied this cgMLST scheme to isolates obtained during the largest outbreak of LD reported so far in Germany. In this outbreak, the epidemic clone ST345 had been isolated from patients and four different environmental sources. In total 42 clinical and environmental isolates were retrospectively typed. Results : Epidemiologically unrelated ST345 isolates were clearly distinguishable from the epidemic clone. Remarkably, epidemic isolates split up into two distinct clusters, ST345-A and ST345-B, each respectively containing a mix of clinical and epidemiologically-related environmental samples. Discussion/conclusion : The outbreak was therefore likely caused by both variants of the single sequence type, which pre-existed in the environmental reservoirs. The two clusters differed by 40 alleles located in two neighbouring genomic regions of ca 42 and 26 kb. Additional analysis supported horizontal gene transfer of the two regions as responsible for the difference between the variants. Both regions comprise virulence genes and have previously been reported to be involved in recombination events. This corroborates the notion that genomic outbreak investigations should always take epidemiological information into consideration when making inferences. Overall, cgMLST proved helpful in disentangling the complex genomic epidemiology of the outbreak.
dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts
Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre
2013-01-01
The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284
Almazan, Eugene Matthew P.; Lesko, Sydney L.; Markey, Michael P.; Rouhana, Labib
2017-01-01
Planarian flatworms are popular models for the study of regeneration and stem cell biology in vivo. Technical advances and increased availability of genetic information have fueled the discovery of molecules responsible for stem cell pluripotency and regeneration in flatworms. Unfortunately, most of the planarian research performed worldwide utilizes species that are not natural habitants of North America, which limits their availability to newcomer laboratories and impedes their distribution for educational activities. In order to circumvent these limitations and increase the genetic information available for comparative studies, we sequenced the transcriptome of Girardia dorotocephala, a planarian species pandemic and commercially available in North America. A total of 254,802,670 paired sequence reads were obtained from RNA extracted from intact individuals, regenerating fragments, as well as freshly excised auricles of a clonal line of G. dorotocephala (MA-C2), and used for de novo assembly of its transcriptome. The resulting transcriptome draft was validated through functional analysis of genetic markers of stem cells and their progeny in G. dorotocephala. Akin to orthologs in other planarian species, G. dorotocephala Piwi1 (GdPiwi1) was found to be a robust marker of the planarian stem cell population and GdPiwi2 an essential component for stem cell-driven regeneration. Identification of G. dorotocephala homologs of the early stem cell descendent marker PROG-1 revealed a family of lysine-rich proteins expressed during epithelial cell differentiation. Sequences from the MA-C2 transcriptome were found to be 98–99% identical to nucleotide sequences from G. dorotocephala populations with different chromosomal number, demonstrating strong conservation regardless of karyotype evolution. Altogether, this work establishes G. dorotocephala as a viable and accessible option for analysis of gene function in North America. PMID:28774726
Leonard, Susan R.; Mammel, Mark K.; Lacher, David W.
2015-01-01
Culture-independent diagnostics reduce the reliance on traditional (and slower) culture-based methodologies. Here we capitalize on advances in next-generation sequencing (NGS) to apply this approach to food pathogen detection utilizing NGS as an analytical tool. In this study, spiking spinach with Shiga toxin-producing Escherichia coli (STEC) following an established FDA culture-based protocol was used in conjunction with shotgun metagenomic sequencing to determine the limits of detection, sensitivity, and specificity levels and to obtain information on the microbiology of the protocol. We show that an expected level of contamination (∼10 CFU/100 g) could be adequately detected (including key virulence determinants and strain-level specificity) within 8 h of enrichment at a sequencing depth of 10,000,000 reads. We also rationalize the relative benefit of static versus shaking culture conditions and the addition of selected antimicrobial agents, thereby validating the long-standing culture-based parameters behind such protocols. Moreover, the shotgun metagenomic approach was informative regarding the dynamics of microbial communities during the enrichment process, including initial surveys of the microbial loads associated with bagged spinach; the microbes found included key genera such as Pseudomonas, Pantoea, and Exiguobacterium. Collectively, our metagenomic study highlights and considers various parameters required for transitioning to such sequencing-based diagnostics for food safety and the potential to develop better enrichment processes in a high-throughput manner not previously possible. Future studies will investigate new species-specific DNA signature target regimens, rational design of medium components in concert with judicious use of additives, such as antibiotics, and alterations in the sample processing protocol to enhance detection. PMID:26386062
Shao, En-Si; Lin, Gui-Fang; Liu, Sijun; Ma, Xiao-Li; Chen, Ming-Feng; Lin, Li; Wu, Song-Qing; Sha, Li; Liu, Zhao-Xia; Hu, Xiao-Hua; Guan, Xiong; Zhang, Ling-Ling
2017-01-01
Tea production has been significantly impacted by the false-eye leafhopper, Empoasca vitis (Göthe), around Asia. To identify the key genes which are responsible for nutrition absorption, xenobiotic metabolism and immune response, the transcriptome of either alimentary tracts or bodies minus alimentary tract of E. vitis was sequenced and analyzed. Over 31 million reads were obtained from Illumina sequencing. De novo sequence assembly resulted in 52,182 unigenes with a mean size of 848nt. The assembled unigenes were then annotated using various databases. Transcripts of at least 566 digestion-, 224 detoxification-, and 288 immune-related putative genes in E. vitis were identified. In addition, relative expression of highly abundant transcripts was verified through quantitative real-time PCR. Results from this investigation provide genomic information about E. vitis, which will be helpful in further study of E. vitis biology and in the development of novel strategies to control this devastating pest. Copyright © 2016 Elsevier Inc. All rights reserved.
Protein model discrimination using mutational sensitivity derived from deep sequencing.
Adkar, Bharat V; Tripathi, Arti; Sahoo, Anusmita; Bajaj, Kanika; Goswami, Devrishi; Chakrabarti, Purbani; Swarnkar, Mohit K; Gokhale, Rajesh S; Varadarajan, Raghavan
2012-02-08
A major bottleneck in protein structure prediction is the selection of correct models from a pool of decoys. Relative activities of ∼1,200 individual single-site mutants in a saturation library of the bacterial toxin CcdB were estimated by determining their relative populations using deep sequencing. This phenotypic information was used to define an empirical score for each residue (RankScore), which correlated with the residue depth, and identify active-site residues. Using these correlations, ∼98% of correct models of CcdB (RMSD ≤ 4Å) were identified from a large set of decoys. The model-discrimination methodology was further validated on eleven different monomeric proteins using simulated RankScore values. The methodology is also a rapid, accurate way to obtain relative activities of each mutant in a large pool and derive sequence-structure-function relationships without protein isolation or characterization. It can be applied to any system in which mutational effects can be monitored by a phenotypic readout. Copyright © 2012 Elsevier Ltd. All rights reserved.
Whole-Genome Sequencing in Outbreak Analysis
Turner, Stephen D.; Riley, Margaret F.; Petri, William A.; Hewlett, Erik L.
2015-01-01
SUMMARY In addition to the ever-present concern of medical professionals about epidemics of infectious diseases, the relative ease of access and low cost of obtaining, producing, and disseminating pathogenic organisms or biological toxins mean that bioterrorism activity should also be considered when facing a disease outbreak. Utilization of whole-genome sequencing (WGS) in outbreak analysis facilitates the rapid and accurate identification of virulence factors of the pathogen and can be used to identify the path of disease transmission within a population and provide information on the probable source. Molecular tools such as WGS are being refined and advanced at a rapid pace to provide robust and higher-resolution methods for identifying, comparing, and classifying pathogenic organisms. If these methods of pathogen characterization are properly applied, they will enable an improved public health response whether a disease outbreak was initiated by natural events or by accidental or deliberate human activity. The current application of next-generation sequencing (NGS) technology to microbial WGS and microbial forensics is reviewed. PMID:25876885