Solis, Armando D
2014-01-01
The most informative probability distribution functions (PDFs) describing the Ramachandran phi-psi dihedral angle pair, a fundamental descriptor of backbone conformation of protein molecules, are derived from high-resolution X-ray crystal structures using an information-theoretic approach. The Information Maximization Device (IMD) is established, based on fundamental information-theoretic concepts, and then applied specifically to derive highly resolved phi-psi maps for all 20 single amino acid and all 8000 triplet sequences at an optimal resolution determined by the volume of current data. The paper shows that utilizing the latent information contained in all viable high-resolution crystal structures found in the Protein Data Bank (PDB), totaling more than 77,000 chains, permits the derivation of a large number of optimized sequence-dependent PDFs. This work demonstrates the effectiveness of the IMD and the superiority of the resulting PDFs by extensive fold recognition experiments and rigorous comparisons with previously published triplet PDFs. Because it automatically optimizes PDFs, IMD results in improved performance of knowledge-based potentials, which rely on such PDFs. Furthermore, it provides an easy computational recipe for empirically deriving other kinds of sequence-dependent structural PDFs with greater detail and precision. The high-resolution phi-psi maps derived in this work are available for download.
Brassica ASTRA: an integrated database for Brassica genomic research.
Love, Christopher G; Robinson, Andrew J; Lim, Geraldine A C; Hopkins, Clare J; Batley, Jacqueline; Barker, Gary; Spangenberg, German C; Edwards, David
2005-01-01
Brassica ASTRA is a public database for genomic information on Brassica species. The database incorporates expressed sequences with Swiss-Prot and GenBank comparative sequence annotation as well as secondary Gene Ontology (GO) annotation derived from the comparison with Arabidopsis TAIR GO annotations. Simple sequence repeat molecular markers are identified within resident sequences and mapped onto the closely related Arabidopsis genome sequence. Bacterial artificial chromosome (BAC) end sequences derived from the Multinational Brassica Genome Project are also mapped onto the Arabidopsis genome sequence enabling users to identify candidate Brassica BACs corresponding to syntenic regions of Arabidopsis. This information is maintained in a MySQL database with a web interface providing the primary means of interrogation. The database is accessible at http://hornbill.cspp.latrobe.edu.au.
NASA Technical Reports Server (NTRS)
Gatlin, L. L.
1974-01-01
Concepts of information theory are applied to examine various proteins in terms of their redundancy in natural originators such as animals and plants. The Monte Carlo method is used to derive information parameters for random protein sequences. Real protein sequence parameters are compared with the standard parameters of protein sequences having a specific length. The tendency of a chain to contain some amino acids more frequently than others and the tendency of a chain to contain certain amino acid pairs more frequently than other pairs are used as randomness measures of individual protein sequences. Non-periodic proteins are generally found to have random Shannon redundancies except in cases of constraints due to short chain length and genetic codes. Redundant characteristics of highly periodic proteins are discussed. A degree of periodicity parameter is derived.
Paleovirology of bornaviruses: What can be learned from molecular fossils of bornaviruses.
Horie, Masayuki; Tomonaga, Keizo
2018-04-06
Endogenous viral elements (EVEs) are virus-derived sequences embedded in eukaryotic genomes formed by germline integration of viral sequences. As many EVEs were integrated into eukaryotic genomes millions of years ago, EVEs are considered molecular fossils of viruses. EVEs can be valuable informational sources about ancient viruses, including their time scale, geographical distribution, genetic information, and hosts. Although integration of viral sequences is not required for replications of viruses other than retroviruses, many non-retroviral EVEs have been reported to exist in eukaryotes. Investigation of these EVEs has expanded our knowledge regarding virus-host interactions, as well as provided information on ancient viruses. Among them, EVEs derived from bornaviruses, non-retroviral RNA viruses, have been relatively well studied. Bornavirus-derived EVEs are widely distributed in animal genomes, including the human genome, and the history of bornaviruses can be dated back to more than 65 million years. Although there are several reports focusing on the biological significance of bornavirus-derived sequences in mammals, paleovirology of bornaviruses has not yet been well described and summarized. In this paper, we describe what can be learned about bornaviruses from endogenous bornavirus-like elements from the view of paleovirology using published results and our novel data. Copyright © 2018 Elsevier B.V. All rights reserved.
Embedding strategies for effective use of information from multiple sequence alignments.
Henikoff, S.; Henikoff, J. G.
1997-01-01
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain. PMID:9070452
Remnants of an Ancient Deltaretrovirus in the Genomes of Horseshoe Bats (Rhinolophidae).
Hron, Tomáš; Farkašová, Helena; Gifford, Robert J; Benda, Petr; Hulva, Pavel; Görföl, Tamás; Pačes, Jan; Elleder, Daniel
2018-04-10
Endogenous retrovirus (ERV) sequences provide a rich source of information about the long-term interactions between retroviruses and their hosts. However, most ERVs are derived from a subset of retrovirus groups, while ERVs derived from certain other groups remain extremely rare. In particular, only a single ERV sequence has been identified that shows evidence of being related to an ancient Deltaretrovirus , despite the large number of vertebrate genome sequences now available. In this report, we identify a second example of an ERV sequence putatively derived from a past deltaretroviral infection, in the genomes of several species of horseshoe bats (Rhinolophidae). This sequence represents a fragment of viral genome derived from a single integration. The time of the integration was estimated to be 11-19 million years ago. This finding, together with the previously identified endogenous Deltaretrovirus in long-fingered bats (Miniopteridae), suggest a close association of bats with ancient deltaretroviruses.
NASA Astrophysics Data System (ADS)
Qiu, Junchao; Zhang, Lin; Li, Diyang; Liu, Xingcheng
2016-06-01
Chaotic sequences can be applied to realize multiple user access and improve the system security for a visible light communication (VLC) system. However, since the map patterns of chaotic sequences are usually well known, eavesdroppers can possibly derive the key parameters of chaotic sequences and subsequently retrieve the information. We design an advanced encryption standard (AES) interleaving aided multiple user access scheme to enhance the security of a chaotic code division multiple access-based visible light communication (C-CDMA-VLC) system. We propose to spread the information with chaotic sequences, and then the spread information is interleaved by an AES algorithm and transmitted over VLC channels. Since the computation complexity of performing inverse operations to deinterleave the information is high, the eavesdroppers in a high speed VLC system cannot retrieve the information in real time; thus, the system security will be enhanced. Moreover, we build a mathematical model for the AES-aided VLC system and derive the theoretical information leakage to analyze the system security. The simulations are performed over VLC channels, and the results demonstrate the effectiveness and high security of our presented AES interleaving aided chaotic CDMA-VLC system.
Generating Models of Surgical Procedures using UMLS Concepts and Multiple Sequence Alignment
Meng, Frank; D’Avolio, Leonard W.; Chen, Andrew A.; Taira, Ricky K.; Kangarloo, Hooshang
2005-01-01
Surgical procedures can be viewed as a process composed of a sequence of steps performed on, by, or with the patient’s anatomy. This sequence is typically the pattern followed by surgeons when generating surgical report narratives for documenting surgical procedures. This paper describes a methodology for semi-automatically deriving a model of conducted surgeries, utilizing a sequence of derived Unified Medical Language System (UMLS) concepts for representing surgical procedures. A multiple sequence alignment was computed from a collection of such sequences and was used for generating the model. These models have the potential of being useful in a variety of informatics applications such as information retrieval and automatic document generation. PMID:16779094
FARME DB: a functional antibiotic resistance element database
Wallace, James C.; Port, Jesse A.; Smith, Marissa N.; Faustman, Elaine M.
2017-01-01
Antibiotic resistance (AR) is a major global public health threat but few resources exist that catalog AR genes outside of a clinical context. Current AR sequence databases are assembled almost exclusively from genomic sequences derived from clinical bacterial isolates and thus do not include many microbial sequences derived from environmental samples that confer resistance in functional metagenomic studies. These environmental metagenomic sequences often show little or no similarity to AR sequences from clinical isolates using standard classification criteria. In addition, existing AR databases provide no information about flanking sequences containing regulatory or mobile genetic elements. To help address this issue, we created an annotated database of DNA and protein sequences derived exclusively from environmental metagenomic sequences showing AR in laboratory experiments. Our Functional Antibiotic Resistant Metagenomic Element (FARME) database is a compilation of publically available DNA sequences and predicted protein sequences conferring AR as well as regulatory elements, mobile genetic elements and predicted proteins flanking antibiotic resistant genes. FARME is the first database to focus on functional metagenomic AR gene elements and provides a resource to better understand AR in the 99% of bacteria which cannot be cultured and the relationship between environmental AR sequences and antibiotic resistant genes derived from cultured isolates. Database URL: http://staff.washington.edu/jwallace/farme PMID:28077567
Lopopolo, Alessandro; Frank, Stefan L; van den Bosch, Antal; Willems, Roel M
2017-01-01
Language comprehension involves the simultaneous processing of information at the phonological, syntactic, and lexical level. We track these three distinct streams of information in the brain by using stochastic measures derived from computational language models to detect neural correlates of phoneme, part-of-speech, and word processing in an fMRI experiment. Probabilistic language models have proven to be useful tools for studying how language is processed as a sequence of symbols unfolding in time. Conditional probabilities between sequences of words are at the basis of probabilistic measures such as surprisal and perplexity which have been successfully used as predictors of several behavioural and neural correlates of sentence processing. Here we computed perplexity from sequences of words and their parts of speech, and their phonemic transcriptions. Brain activity time-locked to each word is regressed on the three model-derived measures. We observe that the brain keeps track of the statistical structure of lexical, syntactic and phonological information in distinct areas.
Deng, Youping; Dong, Yinghua; Thodima, Venkata; Clem, Rollie J; Passarelli, A Lorena
2006-01-01
Background Little is known about the genome sequences of lepidopteran insects, although this group of insects has been studied extensively in the fields of endocrinology, development, immunity, and pathogen-host interactions. In addition, cell lines derived from Spodoptera frugiperda and other lepidopteran insects are routinely used for baculovirus foreign gene expression. This study reports the results of an expressed sequence tag (EST) sequencing project in cells from the lepidopteran insect S. frugiperda, the fall armyworm. Results We have constructed an EST database using two cDNA libraries from the S. frugiperda-derived cell line, SF-21. The database consists of 2,367 ESTs which were assembled into 244 contigs and 951 singlets for a total of 1,195 unique sequences. Conclusion S. frugiperda is an agriculturally important pest insect and genomic information will be instrumental for establishing initial transcriptional profiling and gene function studies, and for obtaining information about genes manipulated during infections by insect pathogens such as baculoviruses. PMID:17052344
NASA Astrophysics Data System (ADS)
Shekhar, Karthik; Ruberman, Claire F.; Ferguson, Andrew L.; Barton, John P.; Kardar, Mehran; Chakraborty, Arup K.
2013-12-01
Mutational escape from vaccine-induced immune responses has thwarted the development of a successful vaccine against AIDS, whose causative agent is HIV, a highly mutable virus. Knowing the virus' fitness as a function of its proteomic sequence can enable rational design of potent vaccines, as this information can focus vaccine-induced immune responses to target mutational vulnerabilities of the virus. Spin models have been proposed as a means to infer intrinsic fitness landscapes of HIV proteins from patient-derived viral protein sequences. These sequences are the product of nonequilibrium viral evolution driven by patient-specific immune responses and are subject to phylogenetic constraints. How can such sequence data allow inference of intrinsic fitness landscapes? We combined computer simulations and variational theory á la Feynman to show that, in most circumstances, spin models inferred from patient-derived viral sequences reflect the correct rank order of the fitness of mutant viral strains. Our findings are relevant for diverse viruses.
NASA Astrophysics Data System (ADS)
Jiao, Yong; Wakakuwa, Eyuri; Ogawa, Tomohiro
2018-02-01
We consider asymptotic convertibility of an arbitrary sequence of bipartite pure states into another by local operations and classical communication (LOCC). We adopt an information-spectrum approach to address cases where each element of the sequences is not necessarily a tensor power of a bipartite pure state. We derive necessary and sufficient conditions for the LOCC convertibility of one sequence to another in terms of spectral entropy rates of entanglement of the sequences. Based on these results, we also provide simple proofs for previously known results on the optimal rates of entanglement concentration and dilution of general sequences of bipartite pure states.
The Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) tool was developed to address needs for rapid, cost effective methods of species extrapolation of chemical susceptibility. Specifically, the SeqAPASS tool compares the primary sequence (Level 1), functiona...
Kayesh, E; Bilkish, N; Liu, G S; Chen, W; Leng, X P; Fang, J G
2014-03-31
Among different classes of molecular markers, expressed sequence tags (ESTs) are a new resource for developing simple sequence repeat (SSR) functional markers for genotyping and genetic mapping in F1 hybrid populations of Vitis vinifera L. Recently, because of the availability of an enormous amount of data for ESTs in the public domain, the emphasis has shifted from genomic SSRs to EST-SSRs, which belong to transcribed regions of the genome and may have a role in gene expression or function. The objective of this study was to assess the polymorphisms among 94 F1 hybrids from "Early Rose" and "Red Globe" using 25 EST-derived and 25 non-EST SSR markers. A total collection of 362,375 grape ESTs that were retrieved from the National Center for Biotechnology Information (NCBI) and 2522 EST-SSR sequences were identified. From them, 205 primer pairs were randomly selected, including 176 pairs that were EST-derived and 29 non-EST SSR primer pairs, for polymerase chain reaction amplification. A total of 131 alleles were amplified using 50 pairs of primers; 78 alleles were amplified using EST-derived SSR primers and 53 were from non-EST SSR primers. At most, 6 and 5 alleles were amplified by EST-derived and non-EST SSR primers, respectively. The EST-derived SSR markers showed a maximum polymorphic information content (PIC) value of 1 and a minimum of 0.33 while non-EST SSR markers had maximum and minimum PIC values of 1 and 0.25, respectively. The average PIC value was 0.56 for EST-derived SSR markers and 0.45 for non-EST SSR markers.
Kumar, Yadhu; Westram, Ralf; Kipfer, Peter; Meier, Harald; Ludwig, Wolfgang
2006-01-01
Background Availability of high-resolution RNA crystal structures for the 30S and 50S ribosomal subunits and the subsequent validation of comparative secondary structure models have prompted the biologists to use three-dimensional structure of ribosomal RNA (rRNA) for evaluating sequence alignments of rRNA genes. Furthermore, the secondary and tertiary structural features of rRNA are highly useful and successfully employed in designing rRNA targeted oligonucleotide probes intended for in situ hybridization experiments. RNA3D, a program to combine sequence alignment information with three-dimensional structure of rRNA was developed. Integration into ARB software package, which is used extensively by the scientific community for phylogenetic analysis and molecular probe designing, has substantially extended the functionality of ARB software suite with 3D environment. Results Three-dimensional structure of rRNA is visualized in OpenGL 3D environment with the abilities to change the display and overlay information onto the molecule, dynamically. Phylogenetic information derived from the multiple sequence alignments can be overlaid onto the molecule structure in a real time. Superimposition of both statistical and non-statistical sequence associated information onto the rRNA 3D structure can be done using customizable color scheme, which is also applied to a textual sequence alignment for reference. Oligonucleotide probes designed by ARB probe design tools can be mapped onto the 3D structure along with the probe accessibility models for evaluation with respect to secondary and tertiary structural conformations of rRNA. Conclusion Visualization of three-dimensional structure of rRNA in an intuitive display provides the biologists with the greater possibilities to carry out structure based phylogenetic analysis. Coupled with secondary structure models of rRNA, RNA3D program aids in validating the sequence alignments of rRNA genes and evaluating probe target sites. Superimposition of the information derived from the multiple sequence alignment onto the molecule dynamically allows the researchers to observe any sequence inherited characteristics (phylogenetic information) in real-time environment. The extended ARB software package is made freely available for the scientific community via . PMID:16672074
Derivational Suffixes as Cues to Stress Position in Reading Greek
ERIC Educational Resources Information Center
Grimani, Aikaterini; Protopapas, Athanassios
2017-01-01
Background: In languages with lexical stress, reading aloud must include stress assignment. Stress information sources across languages include word-final letter sequences. Here, we examine whether such sequences account for stress assignment in Greek and whether this is attributable to absolute rules involving accenting morphemes or to…
CRITICA: coding region identification tool invoking comparative analysis
NASA Technical Reports Server (NTRS)
Badger, J. H.; Olsen, G. J.; Woese, C. R. (Principal Investigator)
1999-01-01
Gene recognition is essential to understanding existing and future DNA sequence data. CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) is a suite of programs for identifying likely protein-coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis, regions of DNA are aligned with related sequences from the DNA databases; if the translation of the aligned sequences has greater amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from the relative frequencies of hexanucleotides in coding frames versus other contexts (i.e., dicodon bias). The dicodon usage information is derived by iterative analysis of the data, such that CRITICA is not dependent on the existence or accuracy of coding sequence annotations in the databases. This independence makes the method particularly well suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium DNA sequences. Its predictions were compared with the DNA sequence annotations and with the predictions of GenMark. CRITICA proved to be more accurate than GenMark, and moreover, many of its predictions that would seem to be errors instead reflect problems in the sequence databases. The source code of CRITICA is freely available by anonymous FTP (rdp.life.uiuc.edu in/pub/critica) and on the World Wide Web (http:/(/)rdpwww.life.uiuc.edu).
Restoration of distorted depth maps calculated from stereo sequences
NASA Technical Reports Server (NTRS)
Damour, Kevin; Kaufman, Howard
1991-01-01
A model-based Kalman estimator is developed for spatial-temporal filtering of noise and other degradations in velocity and depth maps derived from image sequences or cinema. As an illustration of the proposed procedures, edge information from image sequences of rigid objects is used in the processing of the velocity maps by selecting from a series of models for directional adaptive filtering. Adaptive filtering then allows for noise reduction while preserving sharpness in the velocity maps. Results from several synthetic and real image sequences are given.
Sanderson, Nicholas D.; Atkins, Bridget L.; Brent, Andrew J.; Cole, Kevin; Foster, Dona; McNally, Martin A.; Oakley, Sarah; Peto, Leon; Taylor, Adrian; Peto, Tim E. A.; Crook, Derrick W.; Eyre, David W.
2017-01-01
ABSTRACT Culture of multiple periprosthetic tissue samples is the current gold standard for microbiological diagnosis of prosthetic joint infections (PJI). Additional diagnostic information may be obtained through culture of sonication fluid from explants. However, current techniques can have relatively low sensitivity, with prior antimicrobial therapy and infection by fastidious organisms influencing results. We assessed if metagenomic sequencing of total DNA extracts obtained direct from sonication fluid can provide an alternative rapid and sensitive tool for diagnosis of PJI. We compared metagenomic sequencing with standard aerobic and anaerobic culture in 97 sonication fluid samples from prosthetic joint and other orthopedic device infections. Reads from Illumina MiSeq sequencing were taxonomically classified using Kraken. Using 50 derivation samples, we determined optimal thresholds for the number and proportion of bacterial reads required to identify an infection and confirmed our findings in 47 independent validation samples. Compared to results from sonication fluid culture, the species-level sensitivity of metagenomic sequencing was 61/69 (88%; 95% confidence interval [CI], 77 to 94%; for derivation samples 35/38 [92%; 95% CI, 79 to 98%]; for validation samples, 26/31 [84%; 95% CI, 66 to 95%]), and genus-level sensitivity was 64/69 (93%; 95% CI, 84 to 98%). Species-level specificity, adjusting for plausible fastidious causes of infection, species found in concurrently obtained tissue samples, and prior antibiotics, was 85/97 (88%; 95% CI, 79 to 93%; for derivation samples, 43/50 [86%; 95% CI, 73 to 94%]; for validation samples, 42/47 [89%; 95% CI, 77 to 96%]). High levels of human DNA contamination were seen despite the use of laboratory methods to remove it. Rigorous laboratory good practice was required to minimize bacterial DNA contamination. We demonstrate that metagenomic sequencing can provide accurate diagnostic information in PJI. Our findings, combined with the increasing availability of portable, random-access sequencing technology, offer the potential to translate metagenomic sequencing into a rapid diagnostic tool in PJI. PMID:28490492
Sputnik: a database platform for comparative plant genomics.
Rudd, Stephen; Mewes, Hans-Werner; Mayer, Klaus F X
2003-01-01
Two million plant ESTs, from 20 different plant species, and totalling more than one 1000 Mbp of DNA sequence, represents a formidable transcriptomic resource. Sputnik uses the potential of this sequence resource to fill some of the information gap in the un-sequenced plant genomes and to serve as the foundation for in silicio comparative plant genomics. The complexity of the individual EST collections has been reduced using optimised EST clustering techniques. Annotation of cluster sequences is performed by exploiting and transferring information from the comprehensive knowledgebase already produced for the completed model plant genome (Arabidopsis thaliana) and by performing additional state of-the-art sequence analyses relevant to today's plant biologist. Functional predictions, comparative analyses and associative annotations for 500 000 plant EST derived peptides make Sputnik (http://mips.gsf.de/proj/sputnik/) a valid platform for contemporary plant genomics.
Sputnik: a database platform for comparative plant genomics
Rudd, Stephen; Mewes, Hans-Werner; Mayer, Klaus F.X.
2003-01-01
Two million plant ESTs, from 20 different plant species, and totalling more than one 1000 Mbp of DNA sequence, represents a formidable transcriptomic resource. Sputnik uses the potential of this sequence resource to fill some of the information gap in the un-sequenced plant genomes and to serve as the foundation for in silicio comparative plant genomics. The complexity of the individual EST collections has been reduced using optimised EST clustering techniques. Annotation of cluster sequences is performed by exploiting and transferring information from the comprehensive knowledgebase already produced for the completed model plant genome (Arabidopsis thaliana) and by performing additional state of-the-art sequence analyses relevant to today's plant biologist. Functional predictions, comparative analyses and associative annotations for 500 000 plant EST derived peptides make Sputnik (http://mips.gsf.de/proj/sputnik/) a valid platform for contemporary plant genomics. PMID:12519965
Labudde, Dirk
2015-01-01
The importance of short membrane sequence motifs has been shown in many works and emphasizes the related sequence motif analysis. Together with specific transmembrane helix-helix interactions, the analysis of interacting sequence parts is helpful for understanding the process during membrane protein folding and in retaining the three-dimensional fold. Here we present a simple high-throughput analysis method for deriving mutational information of interacting sequence parts. Applied on aquaporin water channel proteins, our approach supports the analysis of mutational variants within different interacting subsequences and finally the investigation of natural variants which cause diseases like, for example, nephrogenic diabetes insipidus. In this work we demonstrate a simple method for massive membrane protein data analysis. As shown, the presented in silico analyses provide information about interacting sequence parts which are constrained by protein evolution. We present a simple graphical visualization medium for the representation of evolutionary influenced interaction pattern pairs (EIPPs) adapted to mutagen investigations of aquaporin-2, a protein whose mutants are involved in the rare endocrine disorder known as nephrogenic diabetes insipidus, and membrane proteins in general. Furthermore, we present a new method to derive new evolutionary variations within EIPPs which can be used for further mutagen laboratory investigations. PMID:26180540
Grunert, Steffen; Labudde, Dirk
2015-01-01
The importance of short membrane sequence motifs has been shown in many works and emphasizes the related sequence motif analysis. Together with specific transmembrane helix-helix interactions, the analysis of interacting sequence parts is helpful for understanding the process during membrane protein folding and in retaining the three-dimensional fold. Here we present a simple high-throughput analysis method for deriving mutational information of interacting sequence parts. Applied on aquaporin water channel proteins, our approach supports the analysis of mutational variants within different interacting subsequences and finally the investigation of natural variants which cause diseases like, for example, nephrogenic diabetes insipidus. In this work we demonstrate a simple method for massive membrane protein data analysis. As shown, the presented in silico analyses provide information about interacting sequence parts which are constrained by protein evolution. We present a simple graphical visualization medium for the representation of evolutionary influenced interaction pattern pairs (EIPPs) adapted to mutagen investigations of aquaporin-2, a protein whose mutants are involved in the rare endocrine disorder known as nephrogenic diabetes insipidus, and membrane proteins in general. Furthermore, we present a new method to derive new evolutionary variations within EIPPs which can be used for further mutagen laboratory investigations.
Protein location prediction using atomic composition and global features of the amino acid sequence
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cherian, Betsy Sheena, E-mail: betsy.skb@gmail.com; Nair, Achuthsankar S.
2010-01-22
Subcellular location of protein is constructive information in determining its function, screening for drug candidates, vaccine design, annotation of gene products and in selecting relevant proteins for further studies. Computational prediction of subcellular localization deals with predicting the location of a protein from its amino acid sequence. For a computational localization prediction method to be more accurate, it should exploit all possible relevant biological features that contribute to the subcellular localization. In this work, we extracted the biological features from the full length protein sequence to incorporate more biological information. A new biological feature, distribution of atomic composition is effectivelymore » used with, multiple physiochemical properties, amino acid composition, three part amino acid composition, and sequence similarity for predicting the subcellular location of the protein. Support Vector Machines are designed for four modules and prediction is made by a weighted voting system. Our system makes prediction with an accuracy of 100, 82.47, 88.81 for self-consistency test, jackknife test and independent data test respectively. Our results provide evidence that the prediction based on the biological features derived from the full length amino acid sequence gives better accuracy than those derived from N-terminal alone. Considering the features as a distribution within the entire sequence will bring out underlying property distribution to a greater detail to enhance the prediction accuracy.« less
Ramu, P; Kassahun, B; Senthilvel, S; Ashok Kumar, C; Jayashree, B; Folkertsma, R T; Reddy, L Ananda; Kuruvinashetti, M S; Haussmann, B I G; Hash, C T
2009-11-01
The sequencing and detailed comparative functional analysis of genomes of a number of select botanical models open new doors into comparative genomics among the angiosperms, with potential benefits for improvement of many orphan crops that feed large populations. In this study, a set of simple sequence repeat (SSR) markers was developed by mining the expressed sequence tag (EST) database of sorghum. Among the SSR-containing sequences, only those sharing considerable homology with rice genomic sequences across the lengths of the 12 rice chromosomes were selected. Thus, 600 SSR-containing sorghum EST sequences (50 homologous sequences on each of the 12 rice chromosomes) were selected, with the intention of providing coverage for corresponding homologous regions of the sorghum genome. Primer pairs were designed and polymorphism detection ability was assessed using parental pairs of two existing sorghum mapping populations. About 28% of these new markers detected polymorphism in this 4-entry panel. A subset of 55 polymorphic EST-derived SSR markers were mapped onto the existing skeleton map of a recombinant inbred population derived from cross N13 x E 36-1, which is segregating for Striga resistance and the stay-green component of terminal drought tolerance. These new EST-derived SSR markers mapped across all 10 sorghum linkage groups, mostly to regions expected based on prior knowledge of rice-sorghum synteny. The ESTs from which these markers were derived were then mapped in silico onto the aligned sorghum genome sequence, and 88% of the best hits corresponded to linkage-based positions. This study demonstrates the utility of comparative genomic information in targeted development of markers to fill gaps in linkage maps of related crop species for which sufficient genomic tools are not available.
The 1985 U.S. Environmental Protection Agency (EPA) Guidelines for Deriving Aquatic Life Criteria (ALC) require acute and chronic toxicity testing with a fixed list of taxa that cover aquatic organisms from vertebrates, invertebrates, and plants. In considering Guideline revision...
Barnes, D W
2012-04-01
Two of the most commonly used elasmobranch experimental model species are the spiny dogfish Squalus acanthias and the little skate Leucoraja erinacea. Comparative biology and genomics with these species have provided useful information in physiology, pharmacology, toxicology, immunology, evolutionary developmental biology and genetics. A wealth of information has been obtained using in vitro approaches to study isolated cells and tissues from these organisms under circumstances in which the extracellular environment can be controlled. In addition to classical work with primary cell cultures, continuously proliferating cell lines have been derived recently, representing the first cell lines from cartilaginous fishes. These lines have proved to be valuable tools with which to explore functional genomic and biological questions and to test hypotheses at the molecular level. In genomic experiments, complementary (c)DNA libraries have been constructed, and c. 8000 unique transcripts identified, with over 3000 representing previously unknown gene sequences. A sub-set of messenger (m)RNAs has been detected for which the 3' untranslated regions show elements that are remarkably well conserved evolutionarily, representing novel, potentially regulatory gene sequences. The cell culture systems provide physiologically valid tools to study functional roles of these sequences and other aspects of elasmobranch molecular cell biology and physiology. Information derived from the use of in vitro cell cultures is valuable in revealing gene diversity and information for genomic sequence assembly, as well as for identification of new genes and molecular markers, construction of gene-array probes and acquisition of full-length cDNA sequences. © 2012 The Author. Journal of Fish Biology © 2012 The Fisheries Society of the British Isles.
Palmer, Lance E; Dejori, Mathaeus; Bolanos, Randall; Fasulo, Daniel
2010-01-15
With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps. We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies. Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly.
Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL.
Apweiler, R; Gateau, A; Contrino, S; Martin, M J; Junker, V; O'Donovan, C; Lang, F; Mitaritonna, N; Kappus, S; Bairoch, A
1997-01-01
SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Ongoing genome sequencing projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make the sequences available as fast as possible, we introduced TREMBL (TRanslation of EMBL nucleotide sequence database), a supplement to SWISS-PROT. TREMBL consists of computer-annotated entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for CDS already included in SWISS-PROT. While TREMBL is already of immense value, its computer-generated annotation does not match the quality of SWISS-PROTs. The main difference is in the protein functional information attached to sequences. With this in mind, we are dedicating substantial effort to develop and apply computer methods to enhance the functional information attached to TREMBL entries.
MeCorS: Metagenome-enabled error correction of single cell sequencing reads
Bremges, Andreas; Singer, Esther; Woyke, Tanja; ...
2016-03-15
Here we present a new tool, MeCorS, to correct chimeric reads and sequencing errors in Illumina data generated from single amplified genomes (SAGs). It uses sequence information derived from accompanying metagenome sequencing to accurately correct errors in SAG reads, even from ultra-low coverage regions. In evaluations on real data, we show that MeCorS outperforms BayesHammer, the most widely used state-of-the-art approach. MeCorS performs particularly well in correcting chimeric reads, which greatly improves both accuracy and contiguity of de novo SAG assemblies.
Kamatuka, Kenta; Hattori, Masahiro; Sugiyama, Tomoyasu
2016-12-01
RNA interference (RNAi) screening is extensively used in the field of reverse genetics. RNAi libraries constructed using random oligonucleotides have made this technology affordable. However, the new methodology requires exploration of the RNAi target gene information after screening because the RNAi library includes non-natural sequences that are not found in genes. Here, we developed a web-based tool to support RNAi screening. The system performs short hairpin RNA (shRNA) target prediction that is informed by comprehensive enquiry (SPICE). SPICE automates several tasks that are laborious but indispensable to evaluate the shRNAs obtained by RNAi screening. SPICE has four main functions: (i) sequence identification of shRNA in the input sequence (the sequence might be obtained by sequencing clones in the RNAi library), (ii) searching the target genes in the database, (iii) demonstrating biological information obtained from the database, and (iv) preparation of search result files that can be utilized in a local personal computer (PC). Using this system, we demonstrated that genes targeted by random oligonucleotide-derived shRNAs were not different from those targeted by organism-specific shRNA. The system facilitates RNAi screening, which requires sequence analysis after screening. The SPICE web application is available at http://www.spice.sugysun.org/.
Protein Sequencing with Tandem Mass Spectrometry
NASA Astrophysics Data System (ADS)
Ziady, Assem G.; Kinter, Michael
The recent introduction of electrospray ionization techniques that are suitable for peptides and whole proteins has allowed for the design of mass spectrometric protocols that provide accurate sequence information for proteins. The advantages gained by these approaches over traditional Edman Degradation sequencing include faster analysis and femtomole, sometimes attomole, sensitivity. The ability to efficiently identify proteins has allowed investigators to conduct studies on their differential expression or modification in response to various treatments or disease states. In this chapter, we discuss the use of electrospray tandem mass spectrometry, a technique whereby protein-derived peptides are subjected to fragmentation in the gas phase, revealing sequence information for the protein. This powerful technique has been instrumental for the study of proteins and markers associated with various disorders, including heart disease, cancer, and cystic fibrosis. We use the study of protein expression in cystic fibrosis as an example.
Georgi, Laura; Johnson-Cicalese, Jennifer; Honig, Josh; Das, Sushma Parankush; Rajah, Veeran D; Bhattacharya, Debashish; Bassil, Nahla; Rowland, Lisa J; Polashock, James; Vorsa, Nicholi
2013-03-01
The first genetic map of cranberry (Vaccinium macrocarpon) has been constructed, comprising 14 linkage groups totaling 879.9 cM with an estimated coverage of 82.2 %. This map, based on four mapping populations segregating for field fruit-rot resistance, contains 136 distinct loci. Mapped markers include blueberry-derived simple sequence repeat (SSR) and cranberry-derived sequence-characterized amplified region markers previously used for fingerprinting cranberry cultivars. In addition, SSR markers were developed near cranberry sequences resembling genes involved in flavonoid biosynthesis or defense against necrotrophic pathogens, or conserved orthologous set (COS) sequences. The cranberry SSRs were developed from next-generation cranberry genomic sequence assemblies; thus, the positions of these SSRs on the genomic map provide information about the genomic location of the sequence scaffold from which they were derived. The use of SSR markers near COS and other functional sequences, plus 33 SSR markers from blueberry, facilitates comparisons of this map with maps of other plant species. Regions of the cranberry map were identified that showed conservation of synteny with Vitis vinifera and Arabidopsis thaliana. Positioned on this map are quantitative trait loci (QTL) for field fruit-rot resistance (FFRR), fruit weight, titratable acidity, and sound fruit yield (SFY). The SFY QTL is adjacent to one of the fruit weight QTL and may reflect pleiotropy. Two of the FFRR QTL are in regions of conserved synteny with grape and span defense gene markers, and the third FFRR QTL spans a flavonoid biosynthetic gene.
Information theory applications for biological sequence analysis.
Vinga, Susana
2014-05-01
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Robust temporal alignment of multimodal cardiac sequences
NASA Astrophysics Data System (ADS)
Perissinotto, Andrea; Queirós, Sandro; Morais, Pedro; Baptista, Maria J.; Monaghan, Mark; Rodrigues, Nuno F.; D'hooge, Jan; Vilaça, João. L.; Barbosa, Daniel
2015-03-01
Given the dynamic nature of cardiac function, correct temporal alignment of pre-operative models and intraoperative images is crucial for augmented reality in cardiac image-guided interventions. As such, the current study focuses on the development of an image-based strategy for temporal alignment of multimodal cardiac imaging sequences, such as cine Magnetic Resonance Imaging (MRI) or 3D Ultrasound (US). First, we derive a robust, modality-independent signal from the image sequences, estimated by computing the normalized cross-correlation between each frame in the temporal sequence and the end-diastolic frame. This signal is a resembler for the left-ventricle (LV) volume curve over time, whose variation indicates different temporal landmarks of the cardiac cycle. We then perform the temporal alignment of these surrogate signals derived from MRI and US sequences of the same patient through Dynamic Time Warping (DTW), allowing to synchronize both sequences. The proposed framework was evaluated in 98 patients, which have undergone both 3D+t MRI and US scans. The end-systolic frame could be accurately estimated as the minimum of the image-derived surrogate signal, presenting a relative error of 1.6 +/- 1.9% and 4.0 +/- 4.2% for the MRI and US sequences, respectively, thus supporting its association with key temporal instants of the cardiac cycle. The use of DTW reduces the desynchronization of the cardiac events in MRI and US sequences, allowing to temporally align multimodal cardiac imaging sequences. Overall, a generic, fast and accurate method for temporal synchronization of MRI and US sequences of the same patient was introduced. This approach could be straightforwardly used for the correct temporal alignment of pre-operative MRI information and intra-operative US images.
Wu, Tsung-Jung; Shamsaddini, Amirhossein; Pan, Yang; Smith, Krista; Crichton, Daniel J; Simonyan, Vahan; Mazumder, Raja
2014-01-01
Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies. Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu.
On analytic design of loudspeaker arrays with uniform radiation characteristics
Aarts; Janssen
2000-01-01
Some notes on analytical derived loudspeaker arrays with uniform radiation characteristics are presented. The array coefficients are derived via analytical means and compared with so-called maximal flat sequences known from telecommunications and information theory. It appears that the newly derived array, i.e., the quadratic phase array, has a higher efficiency than the Bessel array and a flatter response than the Barker array. The method discussed admits generalization to the design of arrays with desired nonuniform radiating characteristics.
BAC sequencing using pooled methods.
Saski, Christopher A; Feltus, F Alex; Parida, Laxmi; Haiminen, Niina
2015-01-01
Shotgun sequencing and assembly of a large, complex genome can be both expensive and challenging to accurately reconstruct the true genome sequence. Repetitive DNA arrays, paralogous sequences, polyploidy, and heterozygosity are main factors that plague de novo genome sequencing projects that typically result in highly fragmented assemblies and are difficult to extract biological meaning. Targeted, sub-genomic sequencing offers complexity reduction by removing distal segments of the genome and a systematic mechanism for exploring prioritized genomic content through BAC sequencing. If one isolates and sequences the genome fraction that encodes the relevant biological information, then it is possible to reduce overall sequencing costs and efforts that target a genomic segment. This chapter describes the sub-genome assembly protocol for an organism based upon a BAC tiling path derived from a genome-scale physical map or from fine mapping using BACs to target sub-genomic regions. Methods that are described include BAC isolation and mapping, DNA sequencing, and sequence assembly.
A better sequence-read simulator program for metagenomics.
Johnson, Stephen; Trost, Brett; Long, Jeffrey R; Pittet, Vanessa; Kusalik, Anthony
2014-01-01
There are many programs available for generating simulated whole-genome shotgun sequence reads. The data generated by many of these programs follow predefined models, which limits their use to the authors' original intentions. For example, many models assume that read lengths follow a uniform or normal distribution. Other programs generate models from actual sequencing data, but are limited to reads from single-genome studies. To our knowledge, there are no programs that allow a user to generate simulated data following non-parametric read-length distributions and quality profiles based on empirically-derived information from metagenomics sequencing data. We present BEAR (Better Emulation for Artificial Reads), a program that uses a machine-learning approach to generate reads with lengths and quality values that closely match empirically-derived distributions. BEAR can emulate reads from various sequencing platforms, including Illumina, 454, and Ion Torrent. BEAR requires minimal user input, as it automatically determines appropriate parameter settings from user-supplied data. BEAR also uses a unique method for deriving run-specific error rates, and extracts useful statistics from the metagenomic data itself, such as quality-error models. Many existing simulators are specific to a particular sequencing technology; however, BEAR is not restricted in this way. Because of its flexibility, BEAR is particularly useful for emulating the behaviour of technologies like Ion Torrent, for which no dedicated sequencing simulators are currently available. BEAR is also the first metagenomic sequencing simulator program that automates the process of generating abundances, which can be an arduous task. BEAR is useful for evaluating data processing tools in genomics. It has many advantages over existing comparable software, such as generating more realistic reads and being independent of sequencing technology, and has features particularly useful for metagenomics work.
Du, Xiuquan; Hu, Changlin; Yao, Yu; Sun, Shiwei; Zhang, Yanping
2017-12-12
In bioinformatics, exon skipping (ES) event prediction is an essential part of alternative splicing (AS) event analysis. Although many methods have been developed to predict ES events, a solution has yet to be found. In this study, given the limitations of machine learning algorithms with RNA-Seq data or genome sequences, a new feature, called RS (RNA-seq and sequence) features, was constructed. These features include RNA-Seq features derived from the RNA-Seq data and sequence features derived from genome sequences. We propose a novel Rotation Forest classifier to predict ES events with the RS features (RotaF-RSES). To validate the efficacy of RotaF-RSES, a dataset from two human tissues was used, and RotaF-RSES achieved an accuracy of 98.4%, a specificity of 99.2%, a sensitivity of 94.1%, and an area under the curve (AUC) of 98.6%. When compared to the other available methods, the results indicate that RotaF-RSES is efficient and can predict ES events with RS features.
Hidden Markov models of biological primary sequence information.
Baldi, P; Chauvin, Y; Hunkapiller, T; McClure, M A
1994-01-01
Hidden Markov model (HMM) techniques are used to model families of biological sequences. A smooth and convergent algorithm is introduced to iteratively adapt the transition and emission parameters of the models from the examples in a given family. The HMM approach is applied to three protein families: globins, immunoglobulins, and kinases. In all cases, the models derived capture the important statistical characteristics of the family and can be used for a number of tasks, including multiple alignments, motif detection, and classification. For K sequences of average length N, this approach yields an effective multiple-alignment algorithm which requires O(KN2) operations, linear in the number of sequences. PMID:8302831
Bartels, Daniela; Kespohl, Sebastian; Albaum, Stefan; Drüke, Tanja; Goesmann, Alexander; Herold, Julia; Kaiser, Olaf; Pühler, Alfred; Pfeiffer, Friedhelm; Raddatz, Günter; Stoye, Jens; Meyer, Folker; Schuster, Stephan C
2005-04-01
We provide the graphical tool BACCardI for the construction of virtual clone maps from standard assembler output files or BLAST based sequence comparisons. This new tool has been applied to numerous genome projects to solve various problems including (a) validation of whole genome shotgun assemblies, (b) support for contig ordering in the finishing phase of a genome project, and (c) intergenome comparison between related strains when only one of the strains has been sequenced and a large insert library is available for the other. The BACCardI software can seamlessly interact with various sequence assembly packages. Genomic assemblies generated from sequence information need to be validated by independent methods such as physical maps. The time-consuming task of building physical maps can be circumvented by virtual clone maps derived from read pair information of large insert libraries.
Schuster, W; Brennicke, A
1987-01-01
We describe an open reading frame (ORF) with high homology to reverse transcriptase in the mitochondrial genome of Oenothera. This ORF displays all the characteristics of an active plant mitochondrial gene with a possible ribosome binding site and 39% T in the third codon position. It is located between a sequence fragment from the plastid genome and one of nuclear origin downstream from the gene encoding subunit 5 of the NADH dehydrogenase. The nuclear derived sequence consists of 528 nucleotides from the small ribosomal RNA and contains an expansion segment unique to nuclear rRNAs. The plastid sequence contains part of the ribosomal protein S4 and the complete tRNA(Ser). The observation that only transcribed sequences have been found i more than one subcellular compartment in higher plants suggests that interorganellar transfer of genetic information may occur via RNA and subsequent local reverse transcription and genomic integration. PMID:14650433
Xylella genomics and bacterial pathogenicity to plants.
Dow, J M; Daniels, M J
2000-12-01
Xylella fastidiosa, a pathogen of citrus, is the first plant pathogenic bacterium for which the complete genome sequence has been published. Inspection of the sequence reveals high relatedness to many genes of other pathogens, notably Xanthomonas campestris. Based on this, we suggest that Xylella possesses certain easily testable properties that contribute to pathogenicity. We also present some general considerations for deriving information on pathogenicity from bacterial genomics. Copyright 2000 John Wiley & Sons, Ltd.
Forest, David; Nishikawa, Ryuhei; Kobayashi, Hiroshi; Parton, Angela; Bayne, Christopher J.; Barnes, David W.
2007-01-01
We have established a cartilaginous fish cell line [Squalus acanthias embryo cell line (SAE)], a mesenchymal stem cell line derived from the embryo of an elasmobranch, the spiny dogfish shark S. acanthias. Elasmobranchs (sharks and rays) first appeared >400 million years ago, and existing species provide useful models for comparative vertebrate cell biology, physiology, and genomics. Comparative vertebrate genomics among evolutionarily distant organisms can provide sequence conservation information that facilitates identification of critical coding and noncoding regions. Although these genomic analyses are informative, experimental verification of functions of genomic sequences depends heavily on cell culture approaches. Using ESTs defining mRNAs derived from the SAE cell line, we identified lengthy and highly conserved gene-specific nucleotide sequences in the noncoding 3′ UTRs of eight genes involved in the regulation of cell growth and proliferation. Conserved noncoding 3′ mRNA regions detected by using the shark nucleotide sequences as a starting point were found in a range of other vertebrate orders, including bony fish, birds, amphibians, and mammals. Nucleotide identity of shark and human in these regions was remarkably well conserved. Our results indicate that highly conserved gene sequences dating from the appearance of jawed vertebrates and representing potential cis-regulatory elements can be identified through the use of cartilaginous fish as a baseline. Because the expression of genes in the SAE cell line was prerequisite for their identification, this cartilaginous fish culture system also provides a physiologically valid tool to test functional hypotheses on the role of these ancient conserved sequences in comparative cell biology. PMID:17227856
Objects and processes: Two notions for understanding biological information.
Mercado-Reyes, Agustín; Padilla-Longoria, Pablo; Arroyo-Santos, Alfonso
2015-09-07
In spite of being ubiquitous in life sciences, the concept of information is harshly criticized. Uses of the concept other than those derived from Shannon׳s theory are denounced as metaphoric. We perform a computational experiment to explore whether Shannon׳s information is adequate to describe the uses of said concept in commonplace scientific practice. Our results show that semantic sequences do not have unique complexity values different from the value of meaningless sequences. This result suggests that quantitative theoretical frameworks do not account fully for the complex phenomenon that the term "information" refers to. We propose a restructuring of the concept into two related, but independent notions, and conclude that a complete theory of biological information must account completely not only for both notions, but also for the relationship between them. Copyright © 2015 Elsevier Ltd. All rights reserved.
Inferring Higher Functional Information for RIKEN Mouse Full-Length cDNA Clones With FACTS
Nagashima, Takeshi; Silva, Diego G.; Petrovsky, Nikolai; Socha, Luis A.; Suzuki, Harukazu; Saito, Rintaro; Kasukawa, Takeya; Kurochkin, Igor V.; Konagaya, Akihiko; Schönbach, Christian
2003-01-01
FACTS (Functional Association/Annotation of cDNA Clones from Text/Sequence Sources) is a semiautomated knowledge discovery and annotation system that integrates molecular function information derived from sequence analysis results (sequence inferred) with functional information extracted from text. Text-inferred information was extracted from keyword-based retrievals of MEDLINE abstracts and by matching of gene or protein names to OMIM, BIND, and DIP database entries. Using FACTS, we found that 47.5% of the 60,770 RIKEN mouse cDNA FANTOM2 clone annotations were informative for text searches. MEDLINE queries yielded molecular interaction-containing sentences for 23.1% of the clones. When disease MeSH and GO terms were matched with retrieved abstracts, 22.7% of clones were associated with potential diseases, and 32.5% with GO identifiers. A significant number (23.5%) of disease MeSH-associated clones were also found to have a hereditary disease association (OMIM Morbidmap). Inferred neoplastic and nervous system disease represented 49.6% and 36.0% of disease MeSH-associated clones, respectively. A comparison of sequence-based GO assignments with informative text-based GO assignments revealed that for 78.2% of clones, identical GO assignments were provided for that clone by either method, whereas for 21.8% of clones, the assignments differed. In contrast, for OMIM assignments, only 28.5% of clones had identical sequence-based and text-based OMIM assignments. Sequence, sentence, and term-based functional associations are included in the FACTS database (http://facts.gsc.riken.go.jp/), which permits results to be annotated and explored through web-accessible keyword and sequence search interfaces. The FACTS database will be a critical tool for investigating the functional complexity of the mouse transcriptome, cDNA-inferred interactome (molecular interactions), and pathome (pathologies). PMID:12819151
Dlugosch, Katrina M.; Lai, Zhao; Bonin, Aurélie; Hierro, José; Rieseberg, Loren H.
2013-01-01
Transcriptome sequences are becoming more broadly available for multiple individuals of the same species, providing opportunities to derive population genomic information from these datasets. Using the 454 Life Science Genome Sequencer FLX and FLX-Titanium next-generation platforms, we generated 11−430 Mbp of sequence for normalized cDNA for 40 wild genotypes of the invasive plant Centaurea solstitialis, yellow starthistle, from across its worldwide distribution. We examined the impact of sequencing effort on transcriptome recovery and overlap among individuals. To do this, we developed two novel publicly available software pipelines: SnoWhite for read cleaning before assembly, and AllelePipe for clustering of loci and allele identification in assembled datasets with or without a reference genome. AllelePipe is designed specifically for cases in which read depth information is not appropriate or available to assist with disentangling closely related paralogs from allelic variation, as in transcriptome or previously assembled libraries. We find that modest applications of sequencing effort recover most of the novel sequences present in the transcriptome of this species, including single-copy loci and a representative distribution of functional groups. In contrast, the coverage of variable sites, observation of heterozygosity, and overlap among different libraries are all highly dependent on sequencing effort. Nevertheless, the information gained from overlapping regions was informative regarding coarse population structure and variation across our small number of population samples, providing the first genetic evidence in support of hypothesized invasion scenarios. PMID:23390612
Rapid in silico cloning of genes using expressed sequence tags (ESTs).
Gill, R W; Sanseau, P
2000-01-01
Expressed sequence tags (ESTs) are short single-pass DNA sequences obtained from either end of cDNA clones. These ESTs are derived from a vast number of cDNA libraries obtained from different species. Human ESTs are the bulk of the data and have been widely used to identify new members of gene families, as markers on the human chromosomes, to discover polymorphism sites and to compare expression patterns in different tissues or pathologies states. Information strategies have been devised to query EST databases. Since most of the analysis is performed with a computer, the term "in silico" strategy has been coined. In this chapter we will review the current status of EST databases, the pros and cons of EST-type data and describe possible strategies to retrieve meaningful information.
MAGIC database and interfaces: an integrated package for gene discovery and expression.
Cordonnier-Pratt, Marie-Michèle; Liang, Chun; Wang, Haiming; Kolychev, Dmitri S; Sun, Feng; Freeman, Robert; Sullivan, Robert; Pratt, Lee H
2004-01-01
The rapidly increasing rate at which biological data is being produced requires a corresponding growth in relational databases and associated tools that can help laboratories contend with that data. With this need in mind, we describe here a Modular Approach to a Genomic, Integrated and Comprehensive (MAGIC) Database. This Oracle 9i database derives from an initial focus in our laboratory on gene discovery via production and analysis of expressed sequence tags (ESTs), and subsequently on gene expression as assessed by both EST clustering and microarrays. The MAGIC Gene Discovery portion of the database focuses on information derived from DNA sequences and on its biological relevance. In addition to MAGIC SEQ-LIMS, which is designed to support activities in the laboratory, it contains several additional subschemas. The latter include MAGIC Admin for database administration, MAGIC Sequence for sequence processing as well as sequence and clone attributes, MAGIC Cluster for the results of EST clustering, MAGIC Polymorphism in support of microsatellite and single-nucleotide-polymorphism discovery, and MAGIC Annotation for electronic annotation by BLAST and BLAT. The MAGIC Microarray portion is a MIAME-compliant database with two components at present. These are MAGIC Array-LIMS, which makes possible remote entry of all information into the database, and MAGIC Array Analysis, which provides data mining and visualization. Because all aspects of interaction with the MAGIC Database are via a web browser, it is ideally suited not only for individual research laboratories but also for core facilities that serve clients at any distance.
Array-Based Rational Design of Short Peptide Probe-Derived from an Anti-TNT Monoclonal Antibody.
Okochi, Mina; Muto, Masaki; Yanai, Kentaro; Tanaka, Masayoshi; Onodera, Takeshi; Wang, Jin; Ueda, Hiroshi; Toko, Kiyoshi
2017-10-09
Complementarity-determining regions (CDRs) are sites on the variable chains of antibodies responsible for binding to specific antigens. In this study, a short peptide probe for recognition of 2,4,6-trinitrotoluene (TNT), was identified by testing sequences derived from the CDRs of an anti-TNT monoclonal antibody. The major TNT-binding site in this antibody was identified in the heavy chain CDR3 by antigen docking simulation and confirmed by an immunoassay using a spot-synthesis based peptide array comprising amino acid sequences of six CDRs in the variable region. A peptide derived from heavy chain CDR3 (RGYSSFIYWF) bound to TNT with a dissociation constant of 1.3 μM measured by surface plasmon resonance. Substitution of selected amino acids with basic residues increased TNT binding while substitution with acidic amino acids decreased affinity, an isoleucine to arginine change showed the greatest improvement of 1.8-fold. The ability to create simple peptide binders of volatile organic compounds from sequence information provided by the immune system in the creation of an immune response will be beneficial for sensor developments in the future.
Miotto, Olivo; Heiny, AT; Tan, Tin Wee; August, J Thomas; Brusic, Vladimir
2008-01-01
Background The identification of mutations that confer unique properties to a pathogen, such as host range, is of fundamental importance in the fight against disease. This paper describes a novel method for identifying amino acid sites that distinguish specific sets of protein sequences, by comparative analysis of matched alignments. The use of mutual information to identify distinctive residues responsible for functional variants makes this approach highly suitable for analyzing large sets of sequences. To support mutual information analysis, we developed the AVANA software, which utilizes sequence annotations to select sets for comparison, according to user-specified criteria. The method presented was applied to an analysis of influenza A PB2 protein sequences, with the objective of identifying the components of adaptation to human-to-human transmission, and reconstructing the mutation history of these components. Results We compared over 3,000 PB2 protein sequences of human-transmissible and avian isolates, to produce a catalogue of sites involved in adaptation to human-to-human transmission. This analysis identified 17 characteristic sites, five of which have been present in human-transmissible strains since the 1918 Spanish flu pandemic. Sixteen of these sites are located in functional domains, suggesting they may play functional roles in host-range specificity. The catalogue of characteristic sites was used to derive sequence signatures from historical isolates. These signatures, arranged in chronological order, reveal an evolutionary timeline for the adaptation of the PB2 protein to human hosts. Conclusion By providing the most complete elucidation to date of the functional components participating in PB2 protein adaptation to humans, this study demonstrates that mutual information is a powerful tool for comparative characterization of sequence sets. In addition to confirming previously reported findings, several novel characteristic sites within PB2 are reported. Sequence signatures generated using the characteristic sites catalogue characterize concisely the adaptation characteristics of individual isolates. Evolutionary timelines derived from signatures of early human influenza isolates suggest that characteristic variants emerged rapidly, and remained remarkably stable through subsequent pandemics. In addition, the signatures of human-infecting H5N1 isolates suggest that this avian subtype has low pandemic potential at present, although it presents more human adaptation components than most avian subtypes. PMID:18315849
Tso, Kai-Yuen; Lee, Sau Dan; Lo, Kwok-Wai; Yip, Kevin Y
2014-12-23
Patient-derived tumor xenografts in mice are widely used in cancer research and have become important in developing personalized therapies. When these xenografts are subject to DNA sequencing, the samples could contain various amounts of mouse DNA. It has been unclear how the mouse reads would affect data analyses. We conducted comprehensive simulations to compare three alignment strategies at different mutation rates, read lengths, sequencing error rates, human-mouse mixing ratios and sequenced regions. We also sequenced a nasopharyngeal carcinoma xenograft and a cell line to test how the strategies work on real data. We found the "filtering" and "combined reference" strategies performed better than aligning reads directly to human reference in terms of alignment and variant calling accuracies. The combined reference strategy was particularly good at reducing false negative variants calls without significantly increasing the false positive rate. In some scenarios the performance gain of these two special handling strategies was too small for special handling to be cost-effective, but it was found crucial when false non-synonymous SNVs should be minimized, especially in exome sequencing. Our study systematically analyzes the effects of mouse contamination in the sequencing data of human-in-mouse xenografts. Our findings provide information for designing data analysis pipelines for these data.
Application of industrial scale genomics to discovery of therapeutic targets in heart failure.
Mehraban, F; Tomlinson, J E
2001-12-01
In recent years intense activity in both academic and industrial sectors has provided a wealth of information on the human genome with an associated impressive increase in the number of novel gene sequences deposited in sequence data repositories and patent applications. This genomic industrial revolution has transformed the way in which drug target discovery is now approached. In this article we discuss how various differential gene expression (DGE) technologies are being utilized for cardiovascular disease (CVD) drug target discovery. Other approaches such as sequencing cDNA from cardiovascular derived tissues and cells coupled with bioinformatic sequence analysis are used with the aim of identifying novel gene sequences that may be exploited towards target discovery. Additional leverage from gene sequence information is obtained through identification of polymorphisms that may confer disease susceptibility and/or affect drug responsiveness. Pharmacogenomic studies are described wherein gene expression-based techniques are used to evaluate drug response and/or efficacy. Industrial-scale genomics supports and addresses not only novel target gene discovery but also the burgeoning issues in pharmaceutical and clinical cardiovascular medicine relative to polymorphic gene responses.
Novel numerical and graphical representation of DNA sequences and proteins.
Randić, M; Novic, M; Vikić-Topić, D; Plavsić, D
2006-12-01
We have introduced novel numerical and graphical representations of DNA, which offer a simple and unique characterization of DNA sequences. The numerical representation of a DNA sequence is given as a sequence of real numbers derived from a unique graphical representation of the standard genetic code. There is no loss of information on the primary structure of a DNA sequence associated with this numerical representation. The novel representations are illustrated with the coding sequences of the first exon of beta-globin gene of half a dozen species in addition to human. The method can be extended to proteins as is exemplified by humanin, a 24-aa peptide that has recently been identified as a specific inhibitor of neuronal cell death induced by familial Alzheimer's disease mutant genes.
Volpi, Nicola; Linhardt, Robert J
2012-01-01
Glycosaminoglycans (GAGs) have proven to be very difficult to analyze and characterize because of their high negative charge density, polydispersity and sequence heterogeneity. As the specificity of the interactions between GAGs and proteins results from the structure of these polysaccharides, an understanding of GAG structure is essential for developing a structure–activity relationship. Electrospray ionization (ESI) mass spectrometry (MS) is particularly promising for the analysis of oligosaccharides chemically or enzymatically generated by GAGs because of its relatively soft ionization capacity. Furthermore, on-line high-performance liquid chromatography (HPLC)-MS greatly enhances the characterization of complex mixtures of GAG-derived oligosaccharides, providing important structural information and affording their disaccharide composition. A detailed protocol for producing oligosaccharides from various GAGs, using controlled, specific enzymatic or chemical depolymerization, is presented, together with their HPLC separation, using volatile reversed-phase ion-pairing reagents and on-line ESI-MS structural identification. This analysis provides an oligosaccharide map together with sequence information from a reading frame beginning at the nonreducing end of the GAG chains. The preparation of oligosaccharides can be carried out in 10 h, with subsequent HPLC analysis in 1–2 h and HPLC-MS analysis taking another 2 h. PMID:20448545
Automatic prediction of protein domains from sequence information using a hybrid learning system.
Nagarajan, Niranjan; Yona, Golan
2004-06-12
We describe a novel method for detecting the domain structure of a protein from sequence information alone. The method is based on analyzing multiple sequence alignments that are derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence and are combined into a single predictor using a neural network. The output is further smoothed and post-processed using a probabilistic model to predict the most likely transition positions between domains. The method was assessed using the domain definitions in SCOP and CATH for proteins of known structure and was compared with several other existing methods. Our method performs well both in terms of accuracy and sensitivity. It improves significantly over the best methods available, even some of the semi-manual ones, while being fully automatic. Our method can also be used to suggest and verify domain partitions based on structural data. A few examples of predicted domain definitions and alternative partitions, as suggested by our method, are also discussed. An online domain-prediction server is available at http://biozon.org/tools/domains/
Inferring Short-Range Linkage Information from Sequencing Chromatograms
Beggel, Bastian; Neumann-Fraune, Maria; Kaiser, Rolf; Verheyen, Jens; Lengauer, Thomas
2013-01-01
Direct Sanger sequencing of viral genome populations yields multiple ambiguous sequence positions. It is not straightforward to derive linkage information from sequencing chromatograms, which in turn hampers the correct interpretation of the sequence data. We present a method for determining the variants existing in a viral quasispecies in the case of two nearby ambiguous sequence positions by exploiting the effect of sequence context-dependent incorporation of dideoxynucleotides. The computational model was trained on data from sequencing chromatograms of clonal variants and was evaluated on two test sets of in vitro mixtures. The approach achieved high accuracies in identifying the mixture components of 97.4% on a test set in which the positions to be analyzed are only one base apart from each other, and of 84.5% on a test set in which the ambiguous positions are separated by three bases. In silico experiments suggest two major limitations of our approach in terms of accuracy. First, due to a basic limitation of Sanger sequencing, it is not possible to reliably detect minor variants with a relative frequency of no more than 10%. Second, the model cannot distinguish between mixtures of two or four clonal variants, if one of two sets of linear constraints is fulfilled. Furthermore, the approach requires repetitive sequencing of all variants that might be present in the mixture to be analyzed. Nevertheless, the effectiveness of our method on the two in vitro test sets shows that short-range linkage information of two ambiguous sequence positions can be inferred from Sanger sequencing chromatograms without any further assumptions on the mixture composition. Additionally, our model provides new insights into the established and widely used Sanger sequencing technology. The source code of our method is made available at http://bioinf.mpi-inf.mpg.de/publications/beggel/linkageinformation.zip. PMID:24376502
Rapid and reliable protein structure determination via chemical shift threading.
Hafsa, Noor E; Berjanskii, Mark V; Arndt, David; Wishart, David S
2018-01-01
Protein structure determination using nuclear magnetic resonance (NMR) spectroscopy can be both time-consuming and labor intensive. Here we demonstrate how chemical shift threading can permit rapid, robust, and accurate protein structure determination using only chemical shift data. Threading is a relatively old bioinformatics technique that uses a combination of sequence information and predicted (or experimentally acquired) low-resolution structural data to generate high-resolution 3D protein structures. The key motivations behind using NMR chemical shifts for protein threading lie in the fact that they are easy to measure, they are available prior to 3D structure determination, and they contain vital structural information. The method we have developed uses not only sequence and chemical shift similarity but also chemical shift-derived secondary structure, shift-derived super-secondary structure, and shift-derived accessible surface area to generate a high quality protein structure regardless of the sequence similarity (or lack thereof) to a known structure already in the PDB. The method (called E-Thrifty) was found to be very fast (often < 10 min/structure) and to significantly outperform other shift-based or threading-based structure determination methods (in terms of top template model accuracy)-with an average TM-score performance of 0.68 (vs. 0.50-0.62 for other methods). Coupled with recent developments in chemical shift refinement, these results suggest that protein structure determination, using only NMR chemical shifts, is becoming increasingly practical and reliable. E-Thrifty is available as a web server at http://ethrifty.ca .
Korber, B T; Osmanov, S; Esparza, J; Myers, G
1994-11-01
The World Health Organization Global Programme on AIDS (WHO/GPA) is conducting a large-scale collaborative study of human immunodeficiency virus type 1 (HIV-1) variation, based in four potential vaccine-trial site countries: Brazil, Rwanda, Thailand, and Uganda. Through the course of this study, it was crucial to keep track of certain attributes of the samples from which the viral nucleotide sequences were derived (e.g., country of origin and viral culture characterization), so that meaningful sequence comparisons could be made. Here we describe a system developed in the context of the WHO/GPA study that summarizes such critical attributes by representing them as standardized characters directly incorporated into sequence names. This nomenclature allows linkage of clinical, phenotypic, and geographic information with molecular data. We propose that other investigators involved in human immunodeficiency virus (HIV) nucleotide sequencing efforts adopt a similar standardized sequence nomenclature to facilitate cross-study sequence comparison. HIV sequence data are being generated at an ever-increasing rate; directly coupled to this increase is our deepening understanding of biological parameters that influence or result from sequence variability. A standardized sequence nomenclature that includes relevant biological information would enable researchers to better utilize the growing body of sequence data, and enhance their ability to interpret the biological implications of their own data through facilitating comparisons with previously published work.
Microbial Genome Analysis and Comparisons: Web-based Protocols and Resources
USDA-ARS?s Scientific Manuscript database
Fully annotated genome sequences of many microorganisms are publicly available as a resource. However, in-depth analysis of these genomes using specialized tools is required to derive meaningful information. We describe here the utility of three powerful publicly available genome databases and ana...
The problems and promise of DNA barcodes for species diagnosis of primate biomaterials
Lorenz, Joseph G; Jackson, Whitney E; Beck, Jeanne C; Hanner, Robert
2005-01-01
The Integrated Primate Biomaterials and Information Resource (www.IPBIR.org) provides essential research reagents to the scientific community by establishing, verifying, maintaining, and distributing DNA and RNA derived from primate cell cultures. The IPBIR uses mitochondrial cytochrome c oxidase subunit I sequences to verify the identity of samples for quality control purposes in the accession, cell culture, DNA extraction processes and prior to shipping to end users. As a result, IPBIR is accumulating a database of ‘DNA barcodes’ for many species of primates. However, this quality control process is complicated by taxon specific patterns of ‘universal primer’ failure, as well as the amplification or co-amplification of nuclear pseudogenes of mitochondrial origins. To overcome these difficulties, taxon specific primers have been developed, and reverse transcriptase PCR is utilized to exclude these extraneous sequences from amplification. DNA barcoding of primates has applications to conservation and law enforcement. Depositing barcode sequences in a public database, along with primer sequences, trace files and associated quality scores, makes this species identification technique widely accessible. Reference DNA barcode sequences should be derived from, and linked to, specimens of known provenance in web-accessible collections in order to validate this system of molecular diagnostics. PMID:16214744
Korber, B T; Kunstman, K J; Patterson, B K; Furtado, M; McEvilly, M M; Levy, R; Wolinsky, S M
1994-01-01
Human immunodeficiency virus type 1 (HIV-1) sequences were generated from blood and from brain tissue obtained by stereotactic biopsy from six patients undergoing a diagnostic neurosurgical procedure. Proviral DNA was directly amplified by nested PCR, and 8 to 36 clones from each sample were sequenced. Phylogenetic analysis of intrapatient envelope V3-V5 region HIV-1 DNA sequence sets revealed that brain viral sequences were clustered relative to the blood viral sequences, suggestive of tissue-specific compartmentalization of the virus in four of the six cases. In the other two cases, the blood and brain virus sequences were intermingled in the phylogenetic analyses, suggesting trafficking of virus between the two tissues. Slide-based PCR-driven in situ hybridization of two of the patients' brain biopsy samples confirmed our interpretation of the intrapatient phylogenetic analyses. Interpatient V3 region brain-derived sequence distances were significantly less than blood-derived sequence distances. Relative to the tip of the loop, the set of brain-derived viral sequences had a tendency towards negative or neutral charge compared with the set of blood-derived viral sequences. Entropy calculations were used as a measure of the variability at each position in alignments of blood and brain viral sequences. A relatively conserved set of positions were found, with a significantly lower entropy in the brain-than in the blood-derived viral sequences. These sites constitute a brain "signature pattern," or a noncontiguous set of amino acids in the V3 region conserved in viral sequences derived from brain tissue. This brain-derived signature pattern was also well preserved among isolates previously characterized in vitro as macrophage tropic. Macrophage-monocyte tropism may be the biological constraint that results in the conservation of the viral brain signature pattern. Images PMID:7933130
Rooting gene trees without outgroups: EP rooting.
Sinsheimer, Janet S; Little, Roderick J A; Lake, James A
2012-01-01
Gene sequences are routinely used to determine the topologies of unrooted phylogenetic trees, but many of the most important questions in evolution require knowing both the topologies and the roots of trees. However, general algorithms for calculating rooted trees from gene and genomic sequences in the absence of gene paralogs are few. Using the principles of evolutionary parsimony (EP) (Lake JA. 1987a. A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol Biol Evol. 4:167-181) and its extensions (Cavender, J. 1989. Mechanized derivation of linear invariants. Mol Biol Evol. 6:301-316; Nguyen T, Speed TP. 1992. A derivation of all linear invariants for a nonbalanced transversion model. J Mol Evol. 35:60-76), we explicitly enumerate all linear invariants that solely contain rooting information and derive algorithms for rooting gene trees directly from gene and genomic sequences. These new EP linear rooting invariants allow one to determine rooted trees, even in the complete absence of outgroups and gene paralogs. EP rooting invariants are explicitly derived for three taxon trees, and rules for their extension to four or more taxa are provided. The method is demonstrated using 18S ribosomal DNA to illustrate how the new animal phylogeny (Aguinaldo AMA et al. 1997. Evidence for a clade of nematodes, arthropods, and other moulting animals. Nature 387:489-493; Lake JA. 1990. Origin of the metazoa. Proc Natl Acad Sci USA 87:763-766) may be rooted directly from sequences, even when they are short and paralogs are unavailable. These results are consistent with the current root (Philippe H et al. 2011. Acoelomorph flatworms are deuterostomes related to Xenoturbella. Nature 470:255-260).
Rooting Gene Trees without Outgroups: EP Rooting
Sinsheimer, Janet S.; Little, Roderick J. A.; Lake, James A.
2012-01-01
Gene sequences are routinely used to determine the topologies of unrooted phylogenetic trees, but many of the most important questions in evolution require knowing both the topologies and the roots of trees. However, general algorithms for calculating rooted trees from gene and genomic sequences in the absence of gene paralogs are few. Using the principles of evolutionary parsimony (EP) (Lake JA. 1987a. A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol Biol Evol. 4:167–181) and its extensions (Cavender, J. 1989. Mechanized derivation of linear invariants. Mol Biol Evol. 6:301–316; Nguyen T, Speed TP. 1992. A derivation of all linear invariants for a nonbalanced transversion model. J Mol Evol. 35:60–76), we explicitly enumerate all linear invariants that solely contain rooting information and derive algorithms for rooting gene trees directly from gene and genomic sequences. These new EP linear rooting invariants allow one to determine rooted trees, even in the complete absence of outgroups and gene paralogs. EP rooting invariants are explicitly derived for three taxon trees, and rules for their extension to four or more taxa are provided. The method is demonstrated using 18S ribosomal DNA to illustrate how the new animal phylogeny (Aguinaldo AMA et al. 1997. Evidence for a clade of nematodes, arthropods, and other moulting animals. Nature 387:489–493; Lake JA. 1990. Origin of the metazoa. Proc Natl Acad Sci USA 87:763–766) may be rooted directly from sequences, even when they are short and paralogs are unavailable. These results are consistent with the current root (Philippe H et al. 2011. Acoelomorph flatworms are deuterostomes related to Xenoturbella. Nature 470:255–260). PMID:22593551
Ghosh, Pritha; Mathew, Oommen K; Sowdhamini, Ramanathan
2016-10-07
RNA-binding proteins (RBPs) interact with their cognate RNA(s) to form large biomolecular assemblies. They are versatile in their functionality and are involved in a myriad of processes inside the cell. RBPs with similar structural features and common biological functions are grouped together into families and superfamilies. It will be useful to obtain an early understanding and association of RNA-binding property of sequences of gene products. Here, we report a web server, RStrucFam, to predict the structure, type of cognate RNA(s) and function(s) of proteins, where possible, from mere sequence information. The web server employs Hidden Markov Model scan (hmmscan) to enable association to a back-end database of structural and sequence families. The database (HMMRBP) comprises of 437 HMMs of RBP families of known structure that have been generated using structure-based sequence alignments and 746 sequence-centric RBP family HMMs. The input protein sequence is associated with structural or sequence domain families, if structure or sequence signatures exist. In case of association of the protein with a family of known structures, output features like, multiple structure-based sequence alignment (MSSA) of the query with all others members of that family is provided. Further, cognate RNA partner(s) for that protein, Gene Ontology (GO) annotations, if any and a homology model of the protein can be obtained. The users can also browse through the database for details pertaining to each family, protein or RNA and their related information based on keyword search or RNA motif search. RStrucFam is a web server that exploits structurally conserved features of RBPs, derived from known family members and imprinted in mathematical profiles, to predict putative RBPs from sequence information. Proteins that fail to associate with such structure-centric families are further queried against the sequence-centric RBP family HMMs in the HMMRBP database. Further, all other essential information pertaining to an RBP, like overall function annotations, are provided. The web server can be accessed at the following link: http://caps.ncbs.res.in/rstrucfam .
Baxter, Laura L; Hsu, Benjamin J; Umayam, Lowell; Wolfsberg, Tyra G; Larson, Denise M; Frith, Martin C; Kawai, Jun; Hayashizaki, Yoshihide; Carninci, Piero; Pavan, William J
2007-06-01
As part of the RIKEN mouse encyclopedia project, two cDNA libraries were prepared from melanocyte-derived cell lines, using techniques of full-length clone selection and subtraction/normalization to enrich for rare transcripts. End sequencing showed that these libraries display over 83% complete coding sequence at the 5' end and 96-97% complete coding sequence at the 3' end. Evaluation of the libraries, derived from B16F10Y tumor cells and melan-c cells, revealed that they contain clones for a majority of the genes previously demonstrated to function in melanocyte biology. Analysis of genomic locations for transcripts revealed that the distribution of melanocyte genes is non-random throughout the genome. Three genomic regions identified that showed significant clustering of melanocyte-expressed genes contain one or more genes previously shown to regulate melanocyte development or function. A catalog of genes expressed in these libraries is presented, providing a valuable resource of cDNA clones and sequence information that can be used for identification of new genes important for melanocyte development, function, and disease.
Contamination of sequence databases with adaptor sequences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoshikawa, Takeo; Sanders, A.R.; Detera-Wadleigh, S.D.
Because of the exponential increase in the amount of DNA sequences being added to the public databases on a daily basis, it has become imperative to identify sources of contamination rapidly. Previously, contaminations of sequence databases have been reported to alert the scientific community to the problem. These contaminations can be divided into two categories. The first category comprises host sequences that have been difficult for submitters to manage or control. Examples include anomalous sequences derived from Escherichia coli, which are inserted into the chromosomes (and plasmids) of the bacterial hosts. Insertion sequences are highly mobile and are capable ofmore » transposing themselves into plasmids during cloning manipulation. Another example of the first category is the infection with yeast genomic DNA or with bacterial DNA of some commercially available cDNA libraries from Clontech. The second category of database contamination is due to the inadvertent inclusion of nonhost sequences. This category includes incorporation of cloning-vector sequences and multicloning sites in the database submission. M13-derived artifacts have been common, since M13-based vectors have been widely used for subcloning DNA fragments. Recognizing this problem, the National Center for Biotechnology Information (NCBI) started to screen, in April 1994, all sequences directly submitted to GenBank, against a set of vector data retrieved from GenBank by use of key-word searches, such as {open_quotes}vector.{close_quotes} In this report, we present evidence for another sequence artifact that is widespread but that, to our knowledge, has not yet been reported. 11 refs., 1 tab.« less
Bowen, D; Littlechild, J A; Fothergill, J E; Watson, H C; Hall, L
1988-01-01
Using oligonucleotide probes derived from amino acid sequencing information, the structural gene for phosphoglycerate kinase from the extreme thermophile, Thermus thermophilus, was cloned in Escherichia coli and its complete nucleotide sequence determined. The gene consists of an open reading frame corresponding to a protein of 390 amino acid residues (calculated Mr 41,791) with an extreme bias for G or C (93.1%) in the codon third base position. Comparison of the deduced amino acid sequence with that of the corresponding mesophilic yeast enzyme indicated a number of significant differences. These are discussed in terms of the unusual codon bias and their possible role in enhanced protein thermal stability. Images Fig. 1. PMID:3052437
Optimum quantum receiver for detecting weak signals in PAM communication systems
NASA Astrophysics Data System (ADS)
Sharma, Navneet; Rawat, Tarun Kumar; Parthasarathy, Harish; Gautam, Kumar
2017-09-01
This paper deals with the modeling of an optimum quantum receiver for pulse amplitude modulator (PAM) communication systems. The information bearing sequence {I_k}_{k=0}^{N-1} is estimated using the maximum likelihood (ML) method. The ML method is based on quantum mechanical measurements of an observable X in the Hilbert space of the quantum system at discrete times, when the Hamiltonian of the system is perturbed by an operator obtained by modulating a potential V with a PAM signal derived from the information bearing sequence {I_k}_{k=0}^{N-1}. The measurement process at each time instant causes collapse of the system state to an observable eigenstate. All probabilities of getting different outcomes from an observable are calculated using the perturbed evolution operator combined with the collapse postulate. For given probability densities, calculation of the mean square error evaluates the performance of the receiver. Finally, we present an example involving estimating an information bearing sequence that modulates a quantum electromagnetic field incident on a quantum harmonic oscillator.
Song, Xuhao; Shen, Fujun; Huang, Jie; Huang, Yan; Du, Lianming; Wang, Chengdong; Fan, Zhenxin; Hou, Rong; Yue, Bisong; Zhang, Xiuyue
2016-09-01
Recently, an increasing number of microsatellites or simple sequence repeats (SSRs) have been found and characterized from transcriptomes. Such SSRs can be employed as putative functional markers to easily tag corresponding genes, which play an important role in biomedical studies and genetic analysis. However, the transcriptome-derived SSRs for giant panda (Ailuropoda melanoleuca) are not yet available. In this work, we identified and characterized 20 tetranucleotide microsatellite loci from a transcript database generated from the blood of giant panda. Furthermore, we assigned their predicted transcriptome locations: 16 loci were assigned to untranslated regions (UTRs) and 4 loci were assigned to coding regions (CDSs). Gene identities of 14 transcripts contained corresponding microsatellites were determined, which provide useful information to study the potential contribution of SSRs to gene regulation in giant panda. The polymorphic information content (PIC) values ranged from 0.293 to 0.789 with an average of 0.603 for the 16 UTRs-derived SSRs. Interestingly, 4 CDS-derived microsatellites developed in our study were also polymorphic, and the instability of these 4 CDS-derived SSRs was further validated by re-genotyping and sequencing. The genes containing these 4 CDS-derived SSRs were embedded with various types of repeat motifs. The interaction of all the length-changing SSRs might provide a way against coding region frameshift caused by microsatellite instability. We hope these newly gene-associated biomarkers will pave the way for genetic and biomedical studies for giant panda in the future. In sum, this set of transcriptome-derived markers complements the genetic resources available for giant panda. © The American Genetic Association. 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Luo, C; Zhang, Q L; Luo, Z R
2014-04-16
Oriental persimmon (Diospyros kaki Thunb.) (2n = 6x = 90) is a major commercial and deciduous fruit tree that is believed to have originated in China. However, rare transcriptomic and genomic information on persimmon is available. Using Roche 454 sequencing technology, the transcriptome from RNA of the flowers of D. kaki was analyzed. A total of 1,250,893 reads were generated and 83,898 unigenes were assembled. A total of 42,711 SSR loci were identified from 23,494 unigenes and 289 polymerase chain reaction primer pairs were designed. Of these 289 primers, 155 (53.6%) showed robust PCR amplification and 98 revealed polymorphism between 15 persimmon genotypes, indicating a polymorphic rate of 63.23% of the productive primers for characterization and genotyping of the genus Diospyros. Transcriptome sequence data generated from next-generation sequencing technology to identify microsatellite loci appears to be rapid and cost-efficient, particularly for species with no genomic sequence information available.
PrionHome: a database of prions and other sequences relevant to prion phenomena.
Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M A; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M
2012-01-01
Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion.
PrionHome: A Database of Prions and Other Sequences Relevant to Prion Phenomena
Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M. A.; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M.
2012-01-01
Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion. PMID:22363733
Abe, Takashi; Hamano, Yuta; Ikemura, Toshimichi
2014-01-01
A strategy of evolutionary studies that can compare vast numbers of genome sequences is becoming increasingly important with the remarkable progress of high-throughput DNA sequencing methods. We previously established a sequence alignment-free clustering method "BLSOM" for di-, tri-, and tetranucleotide compositions in genome sequences, which can characterize sequence characteristics (genome signatures) of a wide range of species. In the present study, we generated BLSOMs for tetra- and pentanucleotide compositions in approximately one million sequence fragments derived from 101 eukaryotes, for which almost complete genome sequences were available. BLSOM recognized phylotype-specific characteristics (e.g., key combinations of oligonucleotide frequencies) in the genome sequences, permitting phylotype-specific clustering of the sequences without any information regarding the species. In our detailed examination of 12 Drosophila species, the correlation between their phylogenetic classification and the classification on the BLSOMs was observed to visualize oligonucleotides diagnostic for species-specific clustering.
Petsch, Harold E.
1979-01-01
Statistical summaries of daily streamflow data for 189 stations west of the Continental Divide in Colorado are presented in this report. Duration tables, high-flow sequence tables, and low-flow sequence tables provide information about daily mean discharge. The mean, variance, standard deviation, skewness, and coefficient of variation are provided for monthly and annual flows. Percentages of average flow are provided for monthly flows and first-order serial-correlation coefficients are provided for annual flows. The text explain the nature and derivation of the data and illustrates applications of the tabulated information by examples. The data may be used by agencies and individuals engaged in water studies. (USGS)
Zepeda-Mendoza, Marie Lisandra; Bohmann, Kristine; Carmona Baez, Aldo; Gilbert, M Thomas P
2016-05-03
DNA metabarcoding is an approach for identifying multiple taxa in an environmental sample using specific genetic loci and taxa-specific primers. When combined with high-throughput sequencing it enables the taxonomic characterization of large numbers of samples in a relatively time- and cost-efficient manner. One recent laboratory development is the addition of 5'-nucleotide tags to both primers producing double-tagged amplicons and the use of multiple PCR replicates to filter erroneous sequences. However, there is currently no available toolkit for the straightforward analysis of datasets produced in this way. We present DAMe, a toolkit for the processing of datasets generated by double-tagged amplicons from multiple PCR replicates derived from an unlimited number of samples. Specifically, DAMe can be used to (i) sort amplicons by tag combination, (ii) evaluate PCR replicates dissimilarity, and (iii) filter sequences derived from sequencing/PCR errors, chimeras, and contamination. This is attained by calculating the following parameters: (i) sequence content similarity between the PCR replicates from each sample, (ii) reproducibility of each unique sequence across the PCR replicates, and (iii) copy number of the unique sequences in each PCR replicate. We showcase the insights that can be obtained using DAMe prior to taxonomic assignment, by applying it to two real datasets that vary in their complexity regarding number of samples, sequencing libraries, PCR replicates, and used tag combinations. Finally, we use a third mock dataset to demonstrate the impact and importance of filtering the sequences with DAMe. DAMe allows the user-friendly manipulation of amplicons derived from multiple samples with PCR replicates built in a single or multiple sequencing libraries. It allows the user to: (i) collapse amplicons into unique sequences and sort them by tag combination while retaining the sample identifier and copy number information, (ii) identify sequences carrying unused tag combinations, (iii) evaluate the comparability of PCR replicates of the same sample, and (iv) filter tagged amplicons from a number of PCR replicates using parameters of minimum length, copy number, and reproducibility across the PCR replicates. This enables an efficient analysis of complex datasets, and ultimately increases the ease of handling datasets from large-scale studies.
Automated use of mutagenesis data in structure prediction.
Nanda, Vikas; DeGrado, William F
2005-05-15
In the absence of experimental structural determination, numerous methods are available to indirectly predict or probe the structure of a target molecule. Genetic modification of a protein sequence is a powerful tool for identifying key residues involved in binding reactions or protein stability. Mutagenesis data is usually incorporated into the modeling process either through manual inspection of model compatibility with empirical data, or through the generation of geometric constraints linking sensitive residues to a binding interface. We present an approach derived from statistical studies of lattice models for introducing mutation information directly into the fitness score. The approach takes into account the phenotype of mutation (neutral or disruptive) and calculates the energy for a given structure over an ensemble of sequences. The structure prediction procedure searches for the optimal conformation where neutral sequences either have no impact or improve stability and disruptive sequences reduce stability relative to wild type. We examine three types of sequence ensembles: information from saturation mutagenesis, scanning mutagenesis, and homologous proteins. Incorporating multiple sequences into a statistical ensemble serves to energetically separate the native state and misfolded structures. As a result, the prediction of structure with a poor force field is sufficiently enhanced by mutational information to improve accuracy. Furthermore, by separating misfolded conformations from the target score, the ensemble energy serves to speed up conformational search algorithms such as Monte Carlo-based methods. Copyright 2005 Wiley-Liss, Inc.
Information-Theoretic Uncertainty of SCFG-Modeled Folding Space of The Non-coding RNA
Manzourolajdad, Amirhossein; Wang, Yingfeng; Shaw, Timothy I.; Malmberg, Russell L.
2012-01-01
RNA secondary structure ensembles define probability distributions for alternative equilibrium secondary structures of an RNA sequence. Shannon’s Entropy is a measure for the amount of diversity present in any ensemble. In this work, Shannon’s entropy of the SCFG ensemble on an RNA sequence is derived and implemented in polynomial time for both structurally ambiguous and unambiguous grammars. Micro RNA sequences generally have low folding entropy, as previously discovered. Surprisingly, signs of significantly high folding entropy were observed in certain ncRNA families. More effective models coupled with targeted randomization tests can lead to a better insight into folding features of these families. PMID:23160142
Deciphering the glycosaminoglycan code with the help of microarrays.
de Paz, Jose L; Seeberger, Peter H
2008-07-01
Carbohydrate microarrays have become a powerful tool to elucidate the biological role of complex sugars. Microarrays are particularly useful for the study of glycosaminoglycans (GAGs), a key class of carbohydrates. The high-throughput chip format enables rapid screening of large numbers of potential GAG sequences produced via a complex biosynthesis while consuming very little sample. Here, we briefly highlight the most recent advances involving GAG microarrays built with synthetic or naturally derived oligosaccharides. These chips are powerful tools for characterizing GAG-protein interactions and determining structure-activity relationships for specific sequences. Thereby, they contribute to decoding the information contained in specific GAG sequences.
Using the TIGR gene index databases for biological discovery.
Lee, Yuandan; Quackenbush, John
2003-11-01
The TIGR Gene Index web pages provide access to analyses of ESTs and gene sequences for nearly 60 species, as well as a number of resources derived from these. Each species-specific database is presented using a common format with a homepage. A variety of methods exist that allow users to search each species-specific database. Methods implemented currently include nucleotide or protein sequence queries using WU-BLAST, text-based searches using various sequence identifiers, searches by gene, tissue and library name, and searches using functional classes through Gene Ontology assignments. This protocol provides guidance for using the Gene Index Databases to extract information.
Wang, Chun Guo; Chen, Xiao Qiang; Li, Hui; Zhao, Qian Cheng; Sun, De Ling; Song, Wen Qin
2008-02-01
Analysis of ISSR (Inter-Simple Sequence Repeat) and DDRT-PCR (Differential Display Reverse Transcriptase Polymerase Chain Reaction) was performed between cytoplasmic male sterility cauliflower ogura-A and its corresponding maintainer line ogura-B. Totally, 306 detectable bands were obtained by ISSR using thirty oligonucleotide primers. Commonly, six to twelve bands were produced per primer. Among all these primers only the amplification of primer ISSR3 was polymorphic, an 1100 bp specific band was only detected in maintainer line, named ISSR3(1100). Analysis of this sequence indicated that ISSR3(1100) was high homologous with the corresponding sequences of mitochondrial genome in Brassica napus and Arabidopsis thaliana,which suggested that ISSR3(1100) may derive from mitochondrial genome in cauliflower. To carry out DDRT-PCR analysis, three anchor primers and fifteen random primers were selected to combine. Totally, 1122 bands from 1 000 bp to 50 bp were detected. However, only four bands, named ogura-A 205, ogura-A383, ogura-B307 and ogura-B352, were confirmed to be different display in both lines. This result was further identified by reverse Northern dot blotting analysis. Among these four bands, ogura-A205 and ogura-A383 only express in cytoplasmic male sterility line, while ogura-B307 and ogura-B352 were only detected in maintainer line. Analysis of these sequences indicated that it was the first time that these four sequences were reported in cauliflower. Interestingly, ogura-A205 and ogura-B307 did not exhibit any similarities to other reported sequences in other species, more investigations were required to obtain further information. ogura-A383 and ogura-B352 were also two new sequences, they showed high similarities to corresponding chloroplast sequences of Arabidopsis thaliana and Brassica rapa subsp. pekinensis. So we speculated that these two sequences may derive from chloroplast genome. All these results obtained in this study offer new and significant information to investigate the molecular mechanism of cytoplasmic male sterility and fertile maintenance in cauliflower.
XX/XY System of Sex Determination in the Geophilomorph Centipede Strigamia maritima
Green, Jack E.; Dalíková, Martina; Sahara, Ken; Marec, František; Akam, Michael
2016-01-01
We show that the geophilomorph centipede Strigamia maritima possesses an XX/XY system of sex chromosomes, with males being the heterogametic sex. This is, to our knowledge, the first report of sex chromosomes in any geophilomorph centipede. Using the recently assembled Strigamia genome sequence, we identified a set of scaffolds differentially represented in male and female DNA sequence. Using quantitative real-time PCR, we confirmed that three candidate X chromosome-derived scaffolds are present at approximately twice the copy number in females as in males. Furthermore, we confirmed that six candidate Y chromosome-derived scaffolds contain male-specific sequences. Finally, using this molecular information, we designed an X chromosome-specific DNA probe and performed fluorescent in situ hybridization against mitotic and meiotic chromosome spreads to identify the Strigamia XY sex-chromosome pair cytologically. We found that the X and Y chromosomes are recognizably different in size during the early pachytene stage of meiosis, and exhibit incomplete and delayed pairing. PMID:26919730
Protein model discrimination using mutational sensitivity derived from deep sequencing.
Adkar, Bharat V; Tripathi, Arti; Sahoo, Anusmita; Bajaj, Kanika; Goswami, Devrishi; Chakrabarti, Purbani; Swarnkar, Mohit K; Gokhale, Rajesh S; Varadarajan, Raghavan
2012-02-08
A major bottleneck in protein structure prediction is the selection of correct models from a pool of decoys. Relative activities of ∼1,200 individual single-site mutants in a saturation library of the bacterial toxin CcdB were estimated by determining their relative populations using deep sequencing. This phenotypic information was used to define an empirical score for each residue (RankScore), which correlated with the residue depth, and identify active-site residues. Using these correlations, ∼98% of correct models of CcdB (RMSD ≤ 4Å) were identified from a large set of decoys. The model-discrimination methodology was further validated on eleven different monomeric proteins using simulated RankScore values. The methodology is also a rapid, accurate way to obtain relative activities of each mutant in a large pool and derive sequence-structure-function relationships without protein isolation or characterization. It can be applied to any system in which mutational effects can be monitored by a phenotypic readout. Copyright © 2012 Elsevier Ltd. All rights reserved.
Genome sequence and analysis of Lactobacillus helveticus
Cremonesi, Paola; Chessa, Stefania; Castiglioni, Bianca
2013-01-01
The microbiological characterization of lactobacilli is historically well developed, but the genomic analysis is recent. Because of the widespread use of Lactobacillus helveticus in cheese technology, information concerning the heterogeneity in this species is accumulating rapidly. Recently, the genome of five L. helveticus strains was sequenced to completion and compared with other genomically characterized lactobacilli. The genomic analysis of the first sequenced strain, L. helveticus DPC 4571, isolated from cheese and selected for its characteristics of rapid lysis and high proteolytic activity, has revealed a plethora of genes with industrial potential including those responsible for key metabolic functions such as proteolysis, lipolysis, and cell lysis. These genes and their derived enzymes can facilitate the production of cheese and cheese derivatives with potential for use as ingredients in consumer foods. In addition, L. helveticus has the potential to produce peptides with a biological function, such as angiotensin converting enzyme (ACE) inhibitory activity, in fermented dairy products, demonstrating the therapeutic value of this species. A most intriguing feature of the genome of L. helveticus is the remarkable similarity in gene content with many intestinal lactobacilli. Comparative genomics has allowed the identification of key gene sets that facilitate a variety of lifestyles including adaptation to food matrices or the gastrointestinal tract. As genome sequence and functional genomic information continues to explode, key features of the genomes of L. helveticus strains continue to be discovered, answering many questions but also raising many new ones. PMID:23335916
Evolutionary distances in the twilight zone--a rational kernel approach.
Schwarz, Roland F; Fletcher, William; Förster, Frank; Merget, Benjamin; Wolf, Matthias; Schultz, Jörg; Markowetz, Florian
2010-12-31
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.
Coprolites as a source of information on the genome and diet of the cave hyena
Bon, Céline; Berthonaud, Véronique; Maksud, Frédéric; Labadie, Karine; Poulain, Julie; Artiguenave, François; Wincker, Patrick; Aury, Jean-Marc; Elalouf, Jean-Marc
2012-01-01
We performed high-throughput sequencing of DNA from fossilized faeces to evaluate this material as a source of information on the genome and diet of Pleistocene carnivores. We analysed coprolites derived from the extinct cave hyena (Crocuta crocuta spelaea), and sequenced 90 million DNA fragments from two specimens. The DNA reads enabled a reconstruction of the cave hyena mitochondrial genome with up to a 158-fold coverage. This genome, and those sequenced from extant spotted (Crocuta crocuta) and striped (Hyaena hyaena) hyena specimens, allows for the establishment of a robust phylogeny that supports a close relationship between the cave and the spotted hyena. We also demonstrate that high-throughput sequencing yields data for cave hyena multi-copy and single-copy nuclear genes, and that about 50 per cent of the coprolite DNA can be ascribed to this species. Analysing the data for additional species to indicate the cave hyena diet, we retrieved abundant sequences for the red deer (Cervus elaphus), and characterized its mitochondrial genome with up to a 3.8-fold coverage. In conclusion, we have demonstrated the presence of abundant ancient DNA in the coprolites surveyed. Shotgun sequencing of this material yielded a wealth of DNA sequences for a Pleistocene carnivore and allowed unbiased identification of diet. PMID:22456883
Hong, Yanbin; Pandey, Manish K; Liu, Ying; Chen, Xiaoping; Liu, Hong; Varshney, Rajeev K; Liang, Xuanqiang; Huang, Shangzhi
2015-01-01
The cultivated peanut (Arachis hypogaea L.) is an allotetraploid (AABB) species derived from the A-genome (Arachis duranensis) and B-genome (Arachis ipaensis) progenitors. Presence of two versions of a DNA sequence based on the two progenitor genomes poses a serious technical and analytical problem during single nucleotide polymorphism (SNP) marker identification and analysis. In this context, we have analyzed 200 amplicons derived from expressed sequence tags (ESTs) and genome survey sequences (GSS) to identify SNPs in a panel of genotypes consisting of 12 cultivated peanut varieties and two diploid progenitors representing the ancestral genomes. A total of 18 EST-SNPs and 44 genomic-SNPs were identified in 12 peanut varieties by aligning the sequence of A. hypogaea with diploid progenitors. The average frequency of sequence polymorphism was higher for genomic-SNPs than the EST-SNPs with one genomic-SNP every 1011 bp as compared to one EST-SNP every 2557 bp. In order to estimate the potential and further applicability of these identified SNPs, 96 peanut varieties were genotyped using high resolution melting (HRM) method. Polymorphism information content (PIC) values for EST-SNPs ranged between 0.021 and 0.413 with a mean of 0.172 in the set of peanut varieties, while genomic-SNPs ranged between 0.080 and 0.478 with a mean of 0.249. Total 33 SNPs were used for polymorphism detection among the parents and 10 selected lines from mapping population Y13Zh (Zhenzhuhei × Yueyou13). Of the total 33 SNPs, nine SNPs showed polymorphism in the mapping population Y13Zh, and seven SNPs were successfully mapped into five linkage groups. Our results showed that SNPs can be identified in allotetraploid peanut with high accuracy through amplicon sequencing and HRM assay. The identified SNPs were very informative and can be used for different genetic and breeding applications in peanut.
Blair, Matthew W; Hurtado, Natalia; Chavarro, Carolina M; Muñoz-Torres, Monica C; Giraldo, Martha C; Pedraza, Fabio; Tomkins, Jeff; Wing, Rod
2011-03-22
Sequencing of cDNA libraries for the development of expressed sequence tags (ESTs) as well as for the discovery of simple sequence repeats (SSRs) has been a common method of developing microsatellites or SSR-based markers. In this research, our objective was to further sequence and develop common bean microsatellites from leaf and root cDNA libraries derived from the Andean gene pool accession G19833 and the Mesoamerican gene pool accession DOR364, mapping parents of a commonly used reference map. The root libraries were made from high and low phosphorus treated plants. A total of 3,123 EST sequences from leaf and root cDNA libraries were screened and used for direct simple sequence repeat discovery. From these EST sequences we found 184 microsatellites; the majority containing tri-nucleotide motifs, many of which were GC rich (ACC, AGC and AGG in particular). Di-nucleotide motif microsatellites were about half as common as the tri-nucleotide motif microsatellites but most of these were AGn microsatellites with a moderate number of ATn microsatellites in root ESTs followed by few ACn and no GCn microsatellites. Out of the 184 new SSR loci, 120 new microsatellite markers were developed in the BMc (Bean Microsatellites from cDNAs) series and these were evaluated for their capacity to distinguish bean diversity in a germplasm panel of 18 genotypes. We developed a database with images of the microsatellites and their polymorphism information content (PIC), which averaged 0.310 for polymorphic markers. The present study produced information about microsatellite frequency in root and leaf tissues of two important genotypes for common bean genomics: namely G19833, the Andean genotype selected for whole genome shotgun sequencing from race Peru, and DOR364 a race Mesoamerica subgroup 2 genotype that is a small-red seeded, released variety in Central America. Both race Peru and Mesoamerica subgroup 2 (small red beans) have been understudied in comparison to race Nueva Granada and Mesoamerica subgroup 1 (black beans) both with regards to gene expression and as sources of markers. However, we found few differences between SSR type and frequency between the G19833 leaf and DOR364 root tissue-derived ESTs. Overall, our work adds to the analysis of microsatellite frequency evaluation for common bean and provides a new set of 120 BMc markers which combined with the 248 previously developed BMc markers brings the total in this series to 368 markers. Once we include BMd markers, which are derived from GenBank sequences, the current total of gene-based markers from our laboratory surpasses 500 markers. These markers are basic for studies of the transcriptome of common bean and can form anchor points for genetic mapping studies in the future.
Schröder, Christiane; Bleidorn, Christoph; Hartmann, Stefanie; Tiedemann, Ralph
2009-12-15
Investigating the dog genome we found 178965 introns with a moderate length of 200-1000 bp. A screening of these sequences against 23 different repeat libraries to find insertions of short interspersed elements (SINEs) detected 45276 SINEs. Virtually all of these SINEs (98%) belong to the tRNA-derived Can-SINE family. Can-SINEs arose about 55 million years ago before Carnivora split into two basal groups, the Caniformia (dog-like carnivores) and the Feliformia (cat-like carnivores). Genome comparisons of dog and cat recovered 506 putatively informative SINE loci for caniformian phylogeny. In this study we show how to use such genome information of model organisms to research the phylogeny of related non-model species of interest. Investigating a dataset including representatives of all major caniformian lineages, we analysed 24 randomly chosen loci for 22 taxa. All loci were amplifiable and revealed 17 parsimony-informative SINE insertions. The screening for informative SINE insertions yields a large amount of sequence information, in particular of introns, which contain reliable phylogenetic information as well. A phylogenetic analysis of intron- and SINE sequence data provided a statistically robust phylogeny which is congruent with the absence/presence pattern of our SINE markers. This phylogeny strongly supports a sistergroup relationship of Musteloidea and Pinnipedia. Within Pinnipedia, we see strong support from bootstrapping and the presence of a SINE insertion for a sistergroup relationship of the walrus with the Otariidae.
Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.
Chen, Qingyu; Zobel, Justin; Zhang, Xiuzhen; Verspoor, Karin
2016-01-01
First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.
PatGen--a consolidated resource for searching genetic patent sequences.
Rouse, Richard J D; Castagnetto, Jesus; Niedner, Roland H
2005-04-15
Compared to the wealth of online resources covering genomic, proteomic and derived data the Bioinformatics community is rather underserved when it comes to patent information related to biological sequences. The current online resources are either incomplete or rather expensive. This paper describes, PatGen, an integrated database containing data from bioinformatic and patent resources. This effort addresses the inconsistency of publicly available genetic patent data coverage by providing access to a consolidated dataset. PatGen can be searched at http://www.patgendb.com rjdrouse@patentinformatics.com.
Potential energy landscapes identify the information-theoretic nature of the epigenome
Jenkinson, Garrett; Pujadas, Elisabet; Goutsias, John; Feinberg, Andrew P.
2017-01-01
Epigenetics studies genomic modifications carrying information independent of DNA sequence heritable through cell division. In 1940, Waddington coined the term “epigenetic landscape” as a metaphor for pluripotency and differentiation, but methylation landscapes have not yet been rigorously computed. By using principles of statistical physics and information theory, we derive epigenetic energy landscapes from whole-genome bisulfite sequencing data that allow us to quantify methylation stochasticity genome-wide using Shannon’s entropy and associate entropy with chromatin structure. Moreover, we consider the Jensen-Shannon distance between sample-specific energy landscapes as a measure of epigenetic dissimilarity and demonstrate its effectiveness for discerning epigenetic differences. By viewing methylation maintenance as a communications system, we introduce methylation channels and show that higher-order chromatin organization can be predicted from their informational properties. Our results provide a fundamental understanding of the information-theoretic nature of the epigenome that leads to a powerful approach for studying its role in disease and aging. PMID:28346445
Petsch, Harold E.
1979-01-01
Statistical summaries of daily streamflow data for 246 stations east of the Continental Divide in Colorado and adjacent States are presented in this report. Duration tables, high-flow sequence tables, and low-flow sequence tables provide information about daily mean discharge. The mean, variance, standard deviation, skewness, and coefficient of variation are provided for monthly and annual flows. Percentages of average flow are provided for monthly flows and first-order serial-correlation coefficients are provided for annual flows. The text explains the nature and derivation of the data and illustrates applications of the tabulated information by examples. The data may be used by agencies and individuals engaged in water studies. (USGS)
Anchoring in 4- to 6-Year-Old Children Relates to Predictors of Reading
ERIC Educational Resources Information Center
Banai, Karen; Yifat, Rachel
2012-01-01
Previous studies suggest that anchoring, a short-term dynamic and implicit process that allows individuals to benefit from contextual information embedded in stimulus sequences, might be causally related to reading acquisition. Here we report findings from two experiments in which two previously untested predictions derived from this anchoring…
Fair and Square Computation of Inverse "Z"-Transforms of Rational Functions
ERIC Educational Resources Information Center
Moreira, M. V.; Basilio, J. C.
2012-01-01
All methods presented in textbooks for computing inverse "Z"-transforms of rational functions have some limitation: 1) the direct division method does not, in general, provide enough information to derive an analytical expression for the time-domain sequence "x"("k") whose "Z"-transform is "X"("z"); 2) computation using the inversion integral…
Fluctuation sensitivity of a transcriptional signaling cascade
NASA Astrophysics Data System (ADS)
Pilkiewicz, Kevin R.; Mayo, Michael L.
2016-09-01
The internal biochemical state of a cell is regulated by a vast transcriptional network that kinetically correlates the concentrations of numerous proteins. Fluctuations in protein concentration that encode crucial information about this changing state must compete with fluctuations caused by the noisy cellular environment in order to successfully transmit information across the network. Oftentimes, one protein must regulate another through a sequence of intermediaries, and conventional wisdom, derived from the data processing inequality of information theory, leads us to expect that longer sequences should lose more information to noise. Using the metric of mutual information to characterize the fluctuation sensitivity of transcriptional signaling cascades, we find, counter to this expectation, that longer chains of regulatory interactions can instead lead to enhanced informational efficiency. We derive an analytic expression for the mutual information from a generalized chemical kinetics model that we reduce to simple, mass-action kinetics by linearizing for small fluctuations about the basal biological steady state, and we find that at long times this expression depends only on a simple ratio of protein production to destruction rates and the length of the cascade. We place bounds on the values of these parameters by requiring that the mutual information be at least one bit—otherwise, any received signal would be indistinguishable from noise—and we find not only that nature has devised a way to circumvent the data processing inequality, but that it must be circumvented to attain this one-bit threshold. We demonstrate how this result places informational and biochemical efficiency at odds with one another by correlating high transcription factor binding affinities with low informational output, and we conclude with an analysis of the validity of our assumptions and propose how they might be tested experimentally.
d’Avila-Levy, Claudia Masini; Boucinha, Carolina; Kostygov, Alexei; Santos, Helena Lúcia Carneiro; Morelli, Karina Alessandra; Grybchuk-Ieremenko, Anastasiia; Duval, Linda; Votýpka, Jan; Yurchenko, Vyacheslav; Grellier, Philippe; Lukeš, Julius
2015-01-01
The class Kinetoplastea encompasses both free-living and parasitic species from a wide range of hosts. Several representatives of this group are responsible for severe human diseases and for economic losses in agriculture and livestock. While this group encompasses over 30 genera, most of the available information has been derived from the vertebrate pathogenic genera Leishmaniaand Trypanosoma. Recent studies of the previously neglected groups of Kinetoplastea indicated that the actual diversity is much higher than previously thought. This article discusses the known segment of kinetoplastid diversity and how gene-directed Sanger sequencing and next-generation sequencing methods can help to deepen our knowledge of these interesting protists. PMID:26602872
Sturm, Marc; Quinten, Sascha; Huber, Christian G.; Kohlbacher, Oliver
2007-01-01
We propose a new model for predicting the retention time of oligonucleotides. The model is based on ν support vector regression using features derived from base sequence and predicted secondary structure of oligonucleotides. Because of the secondary structure information, the model is applicable even at relatively low temperatures where the secondary structure is not suppressed by thermal denaturing. This makes the prediction of oligonucleotide retention time for arbitrary temperatures possible, provided that the target temperature lies within the temperature range of the training data. We describe different possibilities of feature calculation from base sequence and secondary structure, present the results and compare our model to existing models. PMID:17567619
Maldonado-Borges, Josefina Ines; Ku-Cauich, José Roberto; Escobedo-Graciamedrano, Rosa Maria
2013-01-01
Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs) and the sequenced genome of the double haploid- (DH-) Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs) were detected with apparent size of 100-4000 bp using 5 pairs of AFLP primers, of which 21 were differentially expressed during the different stages of banana embryogenesis; 15 of the sequences have matched DH-Pahang chromosomes, with 7 of them being homologous to gene sequences encoding either known or putative protein domains of higher plants. Four TDF sequences were located in all Musa chromosomes, while the rest were located in one or two chromosomes. Their putative individual function is briefly reviewed based on published information, and the potential roles of these genes in embryo development are discussed. Thus the availability of the genome of Musa and the information of TDFs sequences presented here opens new possibilities for an in-depth study of the molecular and biochemical research of zygotic and somatic embryogenesis of Musa.
Yamamoto, Naoki; Suzuki, Tomohiro; Kobayashi, Masaaki; Dohra, Hideo; Sasaki, Yohei; Hirai, Hirofumi; Yokoyama, Koji; Kawagishi, Hirokazu; Yano, Kentaro
2014-12-03
The angel's wing oyster mushroom (Pleurocybella porrigens, Sugihiratake) is a well-known delicacy. However, its potential risk in acute encephalopathy was recently revealed by a food poisoning incident. To disclose the genes underlying the accident and provide mechanistic insight, we seek to develop an information infrastructure containing omics data. In our previous work, we sequenced the genome and transcriptome using next-generation sequencing techniques. The next step in achieving our goal is to develop a web database to facilitate the efficient mining of large-scale omics data and identification of genes specifically expressed in the mushroom. This paper introduces a web database A-WINGS (http://bioinf.mind.meiji.ac.jp/a-wings/) that provides integrated genomic and transcriptomic information for the angel's wing oyster mushroom. The database contains structure and functional annotations of transcripts and gene expressions. Functional annotations contain information on homologous sequences from NCBI nr and UniProt, Gene Ontology, and KEGG Orthology. Digital gene expression profiles were derived from RNA sequencing (RNA-seq) analysis in the fruiting bodies and mycelia. The omics information stored in the database is freely accessible through interactive and graphical interfaces by search functions that include 'GO TREE VIEW' browsing, keyword searches, and BLAST searches. The A-WINGS database will accelerate omics studies on specific aspects of the angel's wing oyster mushroom and the family Tricholomataceae.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Myers, G.; Foley, B.; Korber, B.
1997-04-01
This compendium and the accompanying floppy diskettes are the result of an effort to compile and rapidly publish all relevant molecular data concerning the human immunodeficiency viruses (HIV) and related retroviruses. The scope of the compendium and database is best summarized by the five parts that it comprises: (1) Nuclear Acid Alignments and Sequences; (2) Amino Acid Alignments; (3) Analysis; (4) Related Sequences; and (5) Database Communications. Information within all the parts is updated throughout the year on the Web site, http://hiv-web.lanl.gov. While this publication could take the form of a review or sequence monograph, it is not so conceived.more » Instead, the literature from which the database is derived has simply been summarized and some elementary computational analyses have been performed upon the data. Interpretation and commentary have been avoided insofar as possible so that the reader can form his or her own judgments concerning the complex information. In addition to the general descriptions of the parts of the compendium, the user should read the individual introductions for each part.« less
Revealing the transcriptomic complexity of switchgrass by PacBio long-read sequencing.
Zuo, Chunman; Blow, Matthew; Sreedasyam, Avinash; Kuo, Rita C; Ramamoorthy, Govindarajan Kunde; Torres-Jerez, Ivone; Li, Guifen; Wang, Mei; Dilworth, David; Barry, Kerrie; Udvardi, Michael; Schmutz, Jeremy; Tang, Yuhong; Xu, Ying
2018-01-01
Switchgrass ( Panicum virgatum L.) is an important bioenergy crop widely used for lignocellulosic research. While extensive transcriptomic analyses have been conducted on this species using short read-based sequencing techniques, very little has been reliably derived regarding alternatively spliced (AS) transcripts. We present an analysis of transcriptomes of six switchgrass tissue types pooled together, sequenced using Pacific Biosciences (PacBio) single-molecular long-read technology. Our analysis identified 105,419 unique transcripts covering 43,570 known genes and 8795 previously unknown genes. 45,168 are novel transcripts of known genes. A total of 60,096 AS transcripts are identified, 45,628 being novel. We have also predicted 1549 transcripts of genes involved in cell wall construction and remodeling, 639 being novel transcripts of known cell wall genes. Most of the predicted transcripts are validated against Illumina-based short reads. Specifically, 96% of the splice junction sites in all the unique transcripts are validated by at least five Illumina reads. Comparisons between genes derived from our identified transcripts and the current genome annotation revealed that among the gene set predicted by both analyses, 16,640 have different exon-intron structures. Overall, substantial amount of new information is derived from the PacBio RNA data regarding both the transcriptome and the genome of switchgrass.
Pfaff, Florian; Müller, Thomas; Freuling, Conrad M; Fehlner-Gardiner, Christine; Nadin-Davis, Susan; Robardet, Emmanuelle; Cliquet, Florence; Vuta, Vlad; Hostnik, Peter; Mettenleiter, Thomas C; Beer, Martin; Höper, Dirk
2018-02-10
Live-attenuated rabies virus strains such as those derived from the field isolate Street Alabama Dufferin (SAD) have been used extensively and very effectively as oral rabies vaccines for the control of fox rabies in both Europe and Canada. Although these vaccines are safe, some cases of vaccine-derived rabies have been detected during rabies surveillance accompanying these campaigns. In recent analysis it was shown that some commercial SAD vaccines consist of diverse viral populations, rather than clonal genotypes. For cases of vaccine-derived rabies, only consensus sequence data have been available to date and information concerning their population diversity was thus lacking. In our study, we used high-throughput sequencing to analyze 11 cases of vaccine-derived rabies, and compared their viral population diversity to the related oral rabies vaccines using pairwise Manhattan distances. This extensive deep sequencing analysis of vaccine-derived rabies cases observed during oral vaccination programs provided deeper insights into the effect of accidental in vivo replication of genetically diverse vaccine strains in the central nervous system of target and non-target species under field conditions. The viral population in vaccine-derived cases appeared to be clonal in contrast to their parental vaccines. The change from a state of high population diversity present in the vaccine batches to a clonal genotype in the affected animal may indicate the presence of a strong bottleneck during infection. In conclusion, it is very likely that these few cases are the consequence of host factors and not the result of the selection of a more virulent genotype. Furthermore, this type of vaccine-derived rabies leads to the selection of clonal genotypes and the selected variants were genetically very similar to potent SAD vaccines that have undergone a history of in vitro selection. Copyright © 2018. Published by Elsevier Ltd.
BioPepDB: an integrated data platform for food-derived bioactive peptides.
Li, Qilin; Zhang, Chao; Chen, Hongjun; Xue, Jitong; Guo, Xiaolei; Liang, Ming; Chen, Ming
2018-03-12
Food-derived bioactive peptides play critical roles in regulating most biological processes and have considerable biological, medical and industrial importance. However, a large number of active peptides data, including sequence, function, source, commercial product information, references and other information are poorly integrated. BioPepDB is a searchable database of food-derived bioactive peptides and their related articles, including more than four thousand bioactive peptide entries. Moreover, BioPepDB provides modules of prediction and hydrolysis-simulation for discovering novel peptides. It can serve as a reference database to investigate the function of different bioactive peptides. BioPepDB is available at http://bis.zju.edu.cn/biopepdbr/ . The web page utilises Apache, PHP5 and MySQL to provide the user interface for accessing the database and predict novel peptides. The database itself is operated on a specialised server.
Complete mitochondrial genomes of eleven extinct or possibly extinct bird species.
Anmarkrud, Jarl A; Lifjeld, Jan T
2017-03-01
Natural history museum collections represent a vast source of ancient and historical DNA samples from extinct taxa that can be utilized by high-throughput sequencing tools to reveal novel genetic and phylogenetic information about them. Here, we report on the successful sequencing of complete mitochondrial genome sequences (mitogenomes) from eleven extinct bird species, using de novo assembly of short sequences derived from toepad samples of degraded DNA from museum specimens. For two species (the Passenger Pigeon Ectopistes migratorius and the South Island Piopio Turnagra capensis), whole mitogenomes were already available from recent studies, whereas for five others (the Great Auk Pinguinis impennis, the Imperial Woodpecker Campehilus imperialis, the Huia Heteralocha acutirostris, the Kauai Oo Moho braccathus and the South Island Kokako Callaeas cinereus), there were partial mitochondrial sequences available for comparison. For all seven species, we found sequence similarities of >98%. For the remaining four species (the Kamao Myadestes myadestinus, the Paradise Parrot Psephotellus pulcherrimus, the Ou Psittirostra psittacea and the Lesser Akialoa Akialoa obscura), there was no sequence information available for comparison, so we conducted blast searches and phylogenetic analyses to determine their phylogenetic positions and identify their closest extant relatives. These mitogenomes will be valuable for future analyses of avian phylogenetics and illustrate the importance of museum collections as repositories for genomics resources. © 2016 John Wiley & Sons Ltd.
From protein sequence to dynamics and disorder with DynaMine.
Cilia, Elisa; Pancsa, Rita; Tompa, Peter; Lenaerts, Tom; Vranken, Wim F
2013-01-01
Protein function and dynamics are closely related; however, accurate dynamics information is difficult to obtain. Here based on a carefully assembled data set derived from experimental data for proteins in solution, we quantify backbone dynamics properties on the amino-acid level and develop DynaMine--a fast, high-quality predictor of protein backbone dynamics. DynaMine uses only protein sequence information as input and shows great potential in distinguishing regions of different structural organization, such as folded domains, disordered linkers, molten globules and pre-structured binding motifs of different sizes. It also identifies disordered regions within proteins with an accuracy comparable to the most sophisticated existing predictors, without depending on prior disorder knowledge or three-dimensional structural information. DynaMine provides molecular biologists with an important new method that grasps the dynamical characteristics of any protein of interest, as we show here for human p53 and E1A from human adenovirus 5.
Global copy number profiling of cancer genomes | Office of Cancer Genomics
In this article, we introduce a robust and efficient strategy for deriving global and allele-specific copy number alternations (CNA) from cancer whole exome sequencing data based on Log R ratios and B-allele frequencies. Applying the approach to the analysis of over 200 skin cancer samples, we demonstrate its utility for discovering distinct CNA events and for deriving ancillary information such as tumor purity. Availability and implementation: https://github.com/xfwang/CLOSE CONTACT: xuefeng.wang@stonybrook.edu or michael.krauthammer@yale.edu. (Publication Abstract)
Liu, Qi; Yang, Yu; Chen, Chun; Bu, Jiajun; Zhang, Yin; Ye, Xiuzi
2008-03-31
With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression. RNACompress employs an efficient grammar-based model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: (1) present a robust and effective way for RNA structural data compression; (2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequence-specific or common text-specific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTP-binding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective. A universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules.
Liu, Qi; Yang, Yu; Chen, Chun; Bu, Jiajun; Zhang, Yin; Ye, Xiuzi
2008-01-01
Background With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression. Results RNACompress employs an efficient grammar-based model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: (1) present a robust and effective way for RNA structural data compression; (2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequence-specific or common text-specific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTP-binding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective. Conclusion A universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules. PMID:18373878
Deriving video content type from HEVC bitstream semantics
NASA Astrophysics Data System (ADS)
Nightingale, James; Wang, Qi; Grecos, Christos; Goma, Sergio R.
2014-05-01
As network service providers seek to improve customer satisfaction and retention levels, they are increasingly moving from traditional quality of service (QoS) driven delivery models to customer-centred quality of experience (QoE) delivery models. QoS models only consider metrics derived from the network however, QoE models also consider metrics derived from within the video sequence itself. Various spatial and temporal characteristics of a video sequence have been proposed, both individually and in combination, to derive methods of classifying video content either on a continuous scale or as a set of discrete classes. QoE models can be divided into three broad categories, full reference, reduced reference and no-reference models. Due to the need to have the original video available at the client for comparison, full reference metrics are of limited practical value in adaptive real-time video applications. Reduced reference metrics often require metadata to be transmitted with the bitstream, while no-reference metrics typically operate in the decompressed domain at the client side and require significant processing to extract spatial and temporal features. This paper proposes a heuristic, no-reference approach to video content classification which is specific to HEVC encoded bitstreams. The HEVC encoder already makes use of spatial characteristics to determine partitioning of coding units and temporal characteristics to determine the splitting of prediction units. We derive a function which approximates the spatio-temporal characteristics of the video sequence by using the weighted averages of the depth at which the coding unit quadtree is split and the prediction mode decision made by the encoder to estimate spatial and temporal characteristics respectively. Since the video content type of a sequence is determined by using high level information parsed from the video stream, spatio-temporal characteristics are identified without the need for full decoding and can be used in a timely manner to aid decision making in QoE oriented adaptive real time streaming.
The 1985 U.S. Environmental Protection Agency Guidelines for Deriving Aquatic Life Criteria require acute and chronic toxicity testing with a fixed list of taxa that cover a broad spectrum of aquatic organisms from vertebrate, invertebrate, and plant families. In considering revi...
OryzaGenome: Genome Diversity Database of Wild Oryza Species.
Ohyanagi, Hajime; Ebata, Toshinobu; Huang, Xuehui; Gong, Hao; Fujita, Masahiro; Mochizuki, Takako; Toyoda, Atsushi; Fujiyama, Asao; Kaminuma, Eli; Nakamura, Yasukazu; Feng, Qi; Wang, Zi-Xuan; Han, Bin; Kurata, Nori
2016-01-01
The species in the genus Oryza, encompassing nine genome types and 23 species, are a rich genetic resource and may have applications in deeper genomic analyses aiming to understand the evolution of plant genomes. With the advancement of next-generation sequencing (NGS) technology, a flood of Oryza species reference genomes and genomic variation information has become available in recent years. This genomic information, combined with the comprehensive phenotypic information that we are accumulating in our Oryzabase, can serve as an excellent genotype-phenotype association resource for analyzing rice functional and structural evolution, and the associated diversity of the Oryza genus. Here we integrate our previous and future phenotypic/habitat information and newly determined genotype information into a united repository, named OryzaGenome, providing the variant information with hyperlinks to Oryzabase. The current version of OryzaGenome includes genotype information of 446 O. rufipogon accessions derived by imputation and of 17 accessions derived by imputation-free deep sequencing. Two variant viewers are implemented: SNP Viewer as a conventional genome browser interface and Variant Table as a text-based browser for precise inspection of each variant one by one. Portable VCF (variant call format) file or tab-delimited file download is also available. Following these SNP (single nucleotide polymorphism) data, reference pseudomolecules/scaffolds/contigs and genome-wide variation information for almost all of the closely and distantly related wild Oryza species from the NIG Wild Rice Collection will be available in future releases. All of the resources can be accessed through http://viewer.shigen.info/oryzagenome/. © The Author 2015. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists.
Garcia-Reyero, Natàlia; Griffitt, Robert J.; Liu, Li; Kroll, Kevin J.; Farmerie, William G.; Barber, David S.; Denslow, Nancy D.
2009-01-01
A novel custom microarray for largemouth bass (Micropterus salmoides) was designed with sequences obtained from a normalized cDNA library using the 454 Life Sciences GS-20 pyrosequencer. This approach yielded in excess of 58 million bases of high-quality sequence. The sequence information was combined with 2,616 reads obtained by traditional suppressive subtractive hybridizations to derive a total of 31,391 unique sequences. Annotation and coding sequences were predicted for these transcripts where possible. 16,350 annotated transcripts were selected as target sequences for the design of the custom largemouth bass oligonucleotide microarray. The microarray was validated by examining the transcriptomic response in male largemouth bass exposed to 17β-œstradiol. Transcriptomic responses were assessed in liver and gonad, and indicated gene expression profiles typical of exposure to œstradiol. The results demonstrate the potential to rapidly create the tools necessary to assess large scale transcriptional responses in non-model species, paving the way for expanded impact of toxicogenomics in ecotoxicology. PMID:19936325
Memory as embodiment: The case of modality and serial short-term memory.
Macken, Bill; Taylor, John C; Kozlov, Michail D; Hughes, Robert W; Jones, Dylan M
2016-10-01
Classical explanations for the modality effect-superior short-term serial recall of auditory compared to visual sequences-typically recur to privileged processing of information derived from auditory sources. Here we critically appraise such accounts, and re-evaluate the nature of the canonical empirical phenomena that have motivated them. Three experiments show that the standard account of modality in memory is untenable, since auditory superiority in recency is often accompanied by visual superiority in mid-list serial positions. We explain this simultaneous auditory and visual superiority by reference to the way in which perceptual objects are formed in the two modalities and how those objects are mapped to speech motor forms to support sequence maintenance and reproduction. Specifically, stronger obligatory object formation operating in the standard auditory form of sequence presentation compared to that for visual sequences leads both to enhanced addressability of information at the object boundaries and reduced addressability for that in the interior. Because standard visual presentation does not lead to such object formation, such sequences do not show the boundary advantage observed for auditory presentation, but neither do they suffer loss of addressability associated with object information, thereby affording more ready mapping of that information into a rehearsal cohort to support recall. We show that a range of factors that impede this perceptual-motor mapping eliminate visual superiority while leaving auditory superiority unaffected. We make a general case for viewing short-term memory as an embodied, perceptual-motor process. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Sasaki, Katsutomo; Mitsuda, Nobutaka; Nashima, Kenji; Kishimoto, Kyutaro; Katayose, Yuichi; Kanamori, Hiroyuki; Ohmiya, Akemi
2017-09-04
Chrysanthemum morifolium is one of the most economically valuable ornamental plants worldwide. Chrysanthemum is an allohexaploid plant with a large genome that is commercially propagated by vegetative reproduction. New cultivars with different floral traits, such as color, morphology, and scent, have been generated mainly by classical cross-breeding and mutation breeding. However, only limited genetic resources and their genome information are available for the generation of new floral traits. To obtain useful information about molecular bases for floral traits of chrysanthemums, we read expressed sequence tags (ESTs) of chrysanthemums by high-throughput sequencing using the 454 pyrosequencing technology. We constructed normalized cDNA libraries, consisting of full-length, 3'-UTR, and 5'-UTR cDNAs derived from various tissues of chrysanthemums. These libraries produced a total number of 3,772,677 high-quality reads, which were assembled into 213,204 contigs. By comparing the data obtained with those of full genome-sequenced species, we confirmed that our chrysanthemum contig set contained the majority of all expressed genes, which was sufficient for further molecular analysis in chrysanthemums. We confirmed that our chrysanthemum EST set (contigs) contained a number of contigs that encoded transcription factors and enzymes involved in pigment and aroma compound metabolism that was comparable to that of other species. This information can serve as an informative resource for identifying genes involved in various biological processes in chrysanthemums. Moreover, the findings of our study will contribute to a better understanding of the floral characteristics of chrysanthemums including the myriad cultivars at the molecular level.
Utachee, Piraporn; Jinnopat, Piyamat; Isarangkura-Na-Ayuthaya, Panasda; de Silva, Udayanga Chandimal; Nakamura, Shota; Siripanyaphinyo, Uamporn; Wichukchinda, Nuanjun; Tokunaga, Kenzo; Yasunaga, Teruo; Sawanpanyalert, Pathom; Ikuta, Kazuyoshi; Auwanit, Wattana; Kameoka, Masanori
2009-02-01
CRF01_AE is a major subtype of human immunodeficiency virus type 1 (HIV-1) circulating in Southeast Asia, including Thailand. HIV-1 env genes were amplified by polymerase chain reaction from blood samples of HIV-1-infected patients residing in Thailand in 2006, and cloned into the pNL4-3-derived reporter viral construct. Generated envelope protein (Env)-recombinant virus was examined for its infectivity, and then 35 infectious CRF01_AE Env-recombinant viruses were selected. Sequencing analysis revealed that the interclone variation of the deduced amino acid sequences was higher in CRF01_AE env genes isolated in 2006 than in those isolated in the early 1990s, suggesting that env gene variation has been increasing gradually among CRF01_AE viruses prevalent in Thailand. We also examined the characteristics of the deduced amino acid sequences of 35 CRF01_AE env genes. Our results may provide useful information to help in better understanding the genotype of env genes of CRF01_AE viruses currently circulating in Thailand.
Zhao, Chao; Chu, Yanan; Li, Yanhong; Yang, Chengfeng; Chen, Yuqing; Wang, Xumin; Liu, Bin
2017-01-01
To analyze the microbial diversity and gene content of a thermophilic cellulose-degrading consortium from hot springs in Xiamen, China using 454 pyrosequencing for discovering cellulolytic enzyme resources. A thermophilic cellulose-degrading consortium, XM70 that was isolated from a hot spring, used sugarcane bagasse as sole carbon and energy source. DNA sequencing of the XM70 sample resulted in 349,978 reads with an average read length of 380 bases, accounting for 133,896,867 bases of sequence information. The characterization of sequencing reads and assembled contigs revealed that most microbes were derived from four phyla: Geobacillus (Firmicutes), Thermus, Bacillus, and Anoxybacillus. Twenty-eight homologous genes belonging to 15 glycoside hydrolase families were detected, including several cellulase genes. A novel hot spring metagenome-derived thermophilic cellulase was expressed and characterized. The application value of thermostable sugarcane bagasse-degrading enzymes is shown for production of cellulosic biofuel. The practical power of using a short-read-based metagenomic approach for harvesting novel microbial genes is also demonstrated.
Verifying Digital Components of Physical Systems: Experimental Evaluation of Test Quality
NASA Astrophysics Data System (ADS)
Laputenko, A. V.; López, J. E.; Yevtushenko, N. V.
2018-03-01
This paper continues the study of high quality test derivation for verifying digital components which are used in various physical systems; those are sensors, data transfer components, etc. We have used logic circuits b01-b010 of the package of ITC'99 benchmarks (Second Release) for experimental evaluation which as stated before, describe digital components of physical systems designed for various applications. Test sequences are derived for detecting the most known faults of the reference logic circuit using three different approaches to test derivation. Three widely used fault types such as stuck-at-faults, bridges, and faults which slightly modify the behavior of one gate are considered as possible faults of the reference behavior. The most interesting test sequences are short test sequences that can provide appropriate guarantees after testing, and thus, we experimentally study various approaches to the derivation of the so-called complete test suites which detect all fault types. In the first series of experiments, we compare two approaches for deriving complete test suites. In the first approach, a shortest test sequence is derived for testing each fault. In the second approach, a test sequence is pseudo-randomly generated by the use of an appropriate software for logic synthesis and verification (ABC system in our study) and thus, can be longer. However, after deleting sequences detecting the same set of faults, a test suite returned by the second approach is shorter. The latter underlines the fact that in many cases it is useless to spend `time and efforts' for deriving a shortest distinguishing sequence; it is better to use the test minimization afterwards. The performed experiments also show that the use of only randomly generated test sequences is not very efficient since such sequences do not detect all the faults of any type. After reaching the fault coverage around 70%, saturation is observed, and the fault coverage cannot be increased anymore. For deriving high quality short test suites, the approach that is the combination of randomly generated sequences together with sequences which are aimed to detect faults not detected by random tests, allows to reach the good fault coverage using shortest test sequences.
NASA Technical Reports Server (NTRS)
Diak, George R.; Smith, William L.
1993-01-01
The goals of this research endeavor have been to develop a flexible and relatively complete framework for the investigation of current and future satellite data sources in numerical meteorology. In order to realistically model how satellite information might be used for these purposes, it is necessary that Observing System Simulation Experiments (OSSEs) be as complete as possible. It is therefore desirable that these experiments simulate in entirety the sequence of steps involved in bringing satellite information from the radiance level through product retrieval to a realistic analysis and forecast sequence. In this project we have worked to make this sequence realistic by synthesizing raw satellite data from surrogate atmospheres, deriving satellite products from these data and subsequently producing analyses and forecasts using the retrieved products. The accomplishments made in 1991 are presented. The emphasis was on examining atmospheric soundings and microphysical products which we expect to produce with the launch of the Advanced Microwave Sounding Unit (AMSU), slated for flight in mid 1994.
Plechakova, Olga; Tranchant-Dubreuil, Christine; Benedet, Fabrice; Couderc, Marie; Tinaut, Alexandra; Viader, Véronique; De Block, Petra; Hamon, Perla; Campa, Claudine; de Kochko, Alexandre; Hamon, Serge; Poncet, Valérie
2009-01-01
Background In the past few years, functional genomics information has been rapidly accumulating on Rubiaceae species and especially on those belonging to the Coffea genus (coffee trees). An increasing number of expressed sequence tag (EST) data and EST- or genomic-derived microsatellite markers have been generated, together with Conserved Ortholog Set (COS) markers. This considerably facilitates comparative genomics or map-based genetic studies through the common use of orthologous loci across different species. Similar genomic information is available for e.g. tomato or potato, members of the Solanaceae family. Since both Rubiaceae and Solanaceae belong to the Euasterids I (lamiids) integration of information on genetic markers would be possible and lead to more efficient analyses and discovery of key loci involved in important traits such as fruit development, quality, and maturation, or adaptation. Our goal was to develop a comprehensive web data source for integrated information on validated orthologous markers in Rubiaceae. Description MoccaDB is an online MySQL-PHP driven relational database that houses annotated and/or mapped microsatellite markers in Rubiaceae. In its current release, the database stores 638 markers that have been defined on 259 ESTs and 379 genomic sequences. Marker information was retrieved from 11 published works, and completed with original data on 132 microsatellite markers validated in our laboratory. DNA sequences were derived from three Coffea species/hybrids. Microsatellite markers were checked for similarity, in vitro tested for cross-amplification and diversity/polymorphism status in up to 38 Rubiaceae species belonging to the Cinchonoideae and Rubioideae subfamilies. Functional annotation was provided and some markers associated with described metabolic pathways were also integrated. Users can search the database for marker, sequence, map or diversity information through multi-option query forms. The retrieved data can be browsed and downloaded, along with protocols used, using a standard web browser. MoccaDB also integrates bioinformatics tools (CMap viewer and local BLAST) and hyperlinks to related external data sources (NCBI GenBank and PubMed, SOL Genomic Network database). Conclusion We believe that MoccaDB will be extremely useful for all researchers working in the areas of comparative and functional genomics and molecular evolution, in general, and population analysis and association mapping of Rubiaceae and Solanaceae species, in particular. PMID:19788737
The Genome of the Beluga Whale (Delphinapterus leucas)
Taylor, Gregory A.; Chan, Simon; Warren, René L.; Hammond, S. Austin; Bilobram, Steven; Mordecai, Gideon; Miller, Kristina M.; Schulze, Angela; Chan, Amy M.; Jones, Samantha J.; Tse, Kane; Li, Irene; Cheung, Dorothy; Mungall, Karen L.; Choo, Caleb; Ally, Adrian; Dhalla, Noreen; Tam, Angela K. Y.; Troussard, Armelle; Kirk, Heather; Pandoh, Pawan; Paulino, Daniel; Coope, Robin J. N.; Moore, Richard; Zhao, Yongjun; Birol, Inanc; Ma, Yussanne; Marra, Marco; Haulena, Martin
2017-01-01
The beluga whale is a cetacean that inhabits arctic and subarctic regions, and is the only living member of the genus Delphinapterus. The genome of the beluga whale was determined using DNA sequencing approaches that employed both microfluidic partitioning library and non-partitioned library construction. The former allowed for the construction of a highly contiguous assembly with a scaffold N50 length of over 19 Mbp and total reconstruction of 2.32 Gbp. To aid our understanding of the functional elements, transcriptome data was also derived from brain, duodenum, heart, lung, spleen, and liver tissue. Assembled sequence and all of the underlying sequence data are available at the National Center for Biotechnology Information (NCBI) under the Bioproject accession number PRJNA360851A. PMID:29232881
The Genome of the Beluga Whale (Delphinapterus leucas).
Jones, Steven J M; Taylor, Gregory A; Chan, Simon; Warren, René L; Hammond, S Austin; Bilobram, Steven; Mordecai, Gideon; Suttle, Curtis A; Miller, Kristina M; Schulze, Angela; Chan, Amy M; Jones, Samantha J; Tse, Kane; Li, Irene; Cheung, Dorothy; Mungall, Karen L; Choo, Caleb; Ally, Adrian; Dhalla, Noreen; Tam, Angela K Y; Troussard, Armelle; Kirk, Heather; Pandoh, Pawan; Paulino, Daniel; Coope, Robin J N; Mungall, Andrew J; Moore, Richard; Zhao, Yongjun; Birol, Inanc; Ma, Yussanne; Marra, Marco; Haulena, Martin
2017-12-11
The beluga whale is a cetacean that inhabits arctic and subarctic regions, and is the only living member of the genus Delphinapterus . The genome of the beluga whale was determined using DNA sequencing approaches that employed both microfluidic partitioning library and non-partitioned library construction. The former allowed for the construction of a highly contiguous assembly with a scaffold N50 length of over 19 Mbp and total reconstruction of 2.32 Gbp. To aid our understanding of the functional elements, transcriptome data was also derived from brain, duodenum, heart, lung, spleen, and liver tissue. Assembled sequence and all of the underlying sequence data are available at the National Center for Biotechnology Information (NCBI) under the Bioproject accession number PRJNA360851A.
Genomic sequencing of Pleistocene cave bears
DOE Office of Scientific and Technical Information (OSTI.GOV)
Noonan, James P.; Hofreiter, Michael; Smith, Doug
2005-04-01
Despite the information content of genomic DNA, ancient DNA studies to date have largely been limited to amplification of mitochondrial DNA due to technical hurdles such as contamination and degradation of ancient DNAs. In this study, we describe two metagenomic libraries constructed using unamplified DNA extracted from the bones of two 40,000-year-old extinct cave bears. Analysis of {approx}1 Mb of sequence from each library showed that, despite significant microbial contamination, 5.8 percent and 1.1 percent of clones in the libraries contain cave bear inserts, yielding 26,861 bp of cave bear genome sequence. Alignment of this sequence to the dog genome,more » the closest sequenced genome to cave bear in terms of evolutionary distance, revealed roughly the expected ratio of cave bear exons, repeats and conserved noncoding sequences. Only 0.04 percent of all clones sequenced were derived from contamination with modern human DNA. Comparison of cave bear with orthologous sequences from several modern bear species revealed the evolutionary relationship of these lineages. Using the metagenomic approach described here, we have recovered substantial quantities of mammalian genomic sequence more than twice as old as any previously reported, establishing the feasibility of ancient DNA genomic sequencing programs.« less
Kringel, D; Ultsch, A; Zimmermann, M; Jansen, J-P; Ilias, W; Freynhagen, R; Griessinger, N; Kopf, A; Stein, C; Doehring, A; Resch, E; Lötsch, J
2017-01-01
Next-generation sequencing (NGS) provides unrestricted access to the genome, but it produces ‘big data’ exceeding in amount and complexity the classical analytical approaches. We introduce a bioinformatics-based classifying biomarker that uses emergent properties in genetics to separate pain patients requiring extremely high opioid doses from controls. Following precisely calculated selection of the 34 most informative markers in the OPRM1, OPRK1, OPRD1 and SIGMAR1 genes, pattern of genotypes belonging to either patient group could be derived using a k-nearest neighbor (kNN) classifier that provided a diagnostic accuracy of 80.6±4%. This outperformed alternative classifiers such as reportedly functional opioid receptor gene variants or complex biomarkers obtained via multiple regression or decision tree analysis. The accumulation of several genetic variants with only minor functional influences may result in a qualitative consequence affecting complex phenotypes, pointing at emergent properties in genetics. PMID:27139154
Kringel, D; Ultsch, A; Zimmermann, M; Jansen, J-P; Ilias, W; Freynhagen, R; Griessinger, N; Kopf, A; Stein, C; Doehring, A; Resch, E; Lötsch, J
2017-10-01
Next-generation sequencing (NGS) provides unrestricted access to the genome, but it produces 'big data' exceeding in amount and complexity the classical analytical approaches. We introduce a bioinformatics-based classifying biomarker that uses emergent properties in genetics to separate pain patients requiring extremely high opioid doses from controls. Following precisely calculated selection of the 34 most informative markers in the OPRM1, OPRK1, OPRD1 and SIGMAR1 genes, pattern of genotypes belonging to either patient group could be derived using a k-nearest neighbor (kNN) classifier that provided a diagnostic accuracy of 80.6±4%. This outperformed alternative classifiers such as reportedly functional opioid receptor gene variants or complex biomarkers obtained via multiple regression or decision tree analysis. The accumulation of several genetic variants with only minor functional influences may result in a qualitative consequence affecting complex phenotypes, pointing at emergent properties in genetics.
Jones, David T; Kandathil, Shaun M
2018-04-26
In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. DeepCov is freely available at https://github.com/psipred/DeepCov. d.t.jones@ucl.ac.uk.
Pan, Xiaoyong; Shen, Hong-Bin
2018-05-02
RNA-binding proteins (RBPs) take over 5∼10% of the eukaryotic proteome and play key roles in many biological processes, e.g. gene regulation. Experimental detection of RBP binding sites is still time-intensive and high-costly. Instead, computational prediction of the RBP binding sites using pattern learned from existing annotation knowledge is a fast approach. From the biological point of view, the local structure context derived from local sequences will be recognized by specific RBPs. However, in computational modeling using deep learning, to our best knowledge, only global representations of entire RNA sequences are employed. So far, the local sequence information is ignored in the deep model construction process. In this study, we present a computational method iDeepE to predict RNA-protein binding sites from RNA sequences by combining global and local convolutional neural networks (CNNs). For the global CNN, we pad the RNA sequences into the same length. For the local CNN, we split a RNA sequence into multiple overlapping fixed-length subsequences, where each subsequence is a signal channel of the whole sequence. Next, we train deep CNNs for multiple subsequences and the padded sequences to learn high-level features, respectively. Finally, the outputs from local and global CNNs are combined to improve the prediction. iDeepE demonstrates a better performance over state-of-the-art methods on two large-scale datasets derived from CLIP-seq. We also find that the local CNN run 1.8 times faster than the global CNN with comparable performance when using GPUs. Our results show that iDeepE has captured experimentally verified binding motifs. https://github.com/xypan1232/iDeepE. xypan172436@gmail.com or hbshen@sjtu.edu.cn. Supplementary data are available at Bioinformatics online.
Comparative Sequence Analysis of Multidrug-Resistant IncA/C Plasmids from Salmonella enterica.
Hoffmann, Maria; Pettengill, James B; Gonzalez-Escalona, Narjol; Miller, John; Ayers, Sherry L; Zhao, Shaohua; Allard, Marc W; McDermott, Patrick F; Brown, Eric W; Monday, Steven R
2017-01-01
Determinants of multidrug resistance (MDR) are often encoded on mobile elements, such as plasmids, transposons, and integrons, which have the potential to transfer among foodborne pathogens, as well as to other virulent pathogens, increasing the threats these traits pose to human and veterinary health. Our understanding of MDR among Salmonella has been limited by the lack of closed plasmid genomes for comparisons across resistance phenotypes, due to difficulties in effectively separating the DNA of these high-molecular weight, low-copy-number plasmids from chromosomal DNA. To resolve this problem, we demonstrate an efficient protocol for isolating, sequencing and closing IncA/C plasmids from Salmonella sp. using single molecule real-time sequencing on a Pacific Biosciences (Pacbio) RS II Sequencer. We obtained six Salmonella enterica isolates from poultry, representing six different serovars, each exhibiting the MDR-Ampc resistance profile. Salmonella plasmids were obtained using a modified mini preparation and transformed with Escherichia coli DH10Br. A Qiagen Large-Construct kit™ was used to recover highly concentrated and purified plasmid DNA that was sequenced using PacBio technology. These six closed IncA/C plasmids ranged in size from 104 to 191 kb and shared a stable, conserved backbone containing 98 core genes, with only six differences among those core genes. The plasmids encoded a number of antimicrobial resistance genes, including those for quaternary ammonium compounds and mercury. We then compared our six IncA/C plasmid sequences: first with 14 IncA/C plasmids derived from S. enterica available at the National Center for Biotechnology Information (NCBI), and then with an additional 38 IncA/C plasmids derived from different taxa. These comparisons allowed us to build an evolutionary picture of how antimicrobial resistance may be mediated by this common plasmid backbone. Our project provides detailed genetic information about resistance genes in plasmids, advances in plasmid sequencing, and phylogenetic analyses, and important insights about how MDR evolution occurs across diverse serotypes from different animal sources, particularly in agricultural settings where antimicrobial drug use practices vary.
NASA Technical Reports Server (NTRS)
Kaljurand, M.; Valentin, J. R.; Shao, M.
1996-01-01
Two alternative input sequences are commonly employed in correlation chromatography (CC). They are sequences derived according to the algorithm of the feedback shift register (i.e., pseudo random binary sequences (PRBS)) and sequences derived by using the uniform random binary sequences (URBS). These two sequences are compared. By applying the "cleaning" data processing technique to the correlograms that result from these sequences, we show that when the PRBS is used the S/N of the correlogram is much higher than the one resulting from using URBS.
A Complete Developmental Sequence of a Drosophila Neuronal Lineage as Revealed by Twin-Spot MARCM
He, Yisheng; Ding, Peng; Kao, Jui-Chun; Lee, Tzumin
2010-01-01
Drosophila brains contain numerous neurons that form complex circuits. These neurons are derived in stereotyped patterns from a fixed number of progenitors, called neuroblasts, and identifying individual neurons made by a neuroblast facilitates the reconstruction of neural circuits. An improved MARCM (mosaic analysis with a repressible cell marker) technique, called twin-spot MARCM, allows one to label the sister clones derived from a common progenitor simultaneously in different colors. It enables identification of every single neuron in an extended neuronal lineage based on the order of neuron birth. Here we report the first example, to our knowledge, of complete lineage analysis among neurons derived from a common neuroblast that relay olfactory information from the antennal lobe (AL) to higher brain centers. By identifying the sequentially derived neurons, we found that the neuroblast serially makes 40 types of AL projection neurons (PNs). During embryogenesis, one PN with multi-glomerular innervation and 18 uniglomerular PNs targeting 17 glomeruli of the adult AL are born. Many more PNs of 22 additional types, including four types of polyglomerular PNs, derive after the neuroblast resumes dividing in early larvae. Although different offspring are generated in a rather arbitrary sequence, the birth order strictly dictates the fate of each post-mitotic neuron, including the fate of programmed cell death. Notably, the embryonic progenitor has an altered temporal identity following each self-renewing asymmetric cell division. After larval hatching, the same progenitor produces multiple neurons for each cell type, but the number of neurons for each type is tightly regulated. These observations substantiate the origin-dependent specification of neuron types. Sequencing neuronal lineages will not only unravel how a complex brain develops but also permit systematic identification of neuron types for detailed structure and function analysis of the brain. PMID:20808769
Demberg, Lilian M; Winkler, Jana; Wilde, Caroline; Simon, Kay-Uwe; Schön, Julia; Rothemund, Sven; Schöneberg, Torsten; Prömel, Simone; Liebscher, Ines
2017-03-17
Members of the adhesion G protein-coupled receptor (aGPCR) family carry an agonistic sequence within their large ectodomains. Peptides derived from this region, called the Stachel sequence, can activate the respective receptor. As the conserved core region of the Stachel sequence is highly similar between aGPCRs, the agonist specificity of Stachel sequence-derived peptides was tested between family members using cell culture-based second messenger assays. Stachel peptides derived from aGPCRs of subfamily VI (GPR110/ADGRF1, GPR116/ADGRF5) and subfamily VIII (GPR64/ADGRG2, GPR126/ADGRG6) are able to activate more than one member of the respective subfamily supporting their evolutionary relationship and defining them as pharmacological receptor subtypes. Extended functional analyses of the Stachel sequences and derived peptides revealed agonist promiscuity, not only within, but also between aGPCR subfamilies. For example, the Stachel -derived peptide of GPR110 (subfamily VI) can activate GPR64 and GPR126 (both subfamily VIII). Our results indicate that key residues in the Stachel sequence are very similar between aGPCRs allowing for agonist promiscuity of several Stachel -derived peptides. Therefore, aGPCRs appear to be pharmacologically more closely related than previously thought. Our findings have direct implications for many aGPCR studies, as potential functional overlap has to be considered for in vitro and in vivo studies. However, it also offers the possibility of a broader use of more potent peptides when the original Stachel sequence is less effective. © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.
Lavery, Richard; Zakrzewska, Krystyna; Beveridge, David; Bishop, Thomas C.; Case, David A.; Cheatham, Thomas; Dixit, Surjit; Jayaram, B.; Lankas, Filip; Laughton, Charles; Maddocks, John H.; Michon, Alexis; Osman, Roman; Orozco, Modesto; Perez, Alberto; Singh, Tanya; Spackova, Nada; Sponer, Jiri
2010-01-01
It is well recognized that base sequence exerts a significant influence on the properties of DNA and plays a significant role in protein–DNA interactions vital for cellular processes. Understanding and predicting base sequence effects requires an extensive structural and dynamic dataset which is currently unavailable from experiment. A consortium of laboratories was consequently formed to obtain this information using molecular simulations. This article describes results providing information not only on all 10 unique base pair steps, but also on all possible nearest-neighbor effects on these steps. These results are derived from simulations of 50–100 ns on 39 different DNA oligomers in explicit solvent and using a physiological salt concentration. We demonstrate that the simulations are converged in terms of helical and backbone parameters. The results show that nearest-neighbor effects on base pair steps are very significant, implying that dinucleotide models are insufficient for predicting sequence-dependent behavior. Flanking base sequences can notably lead to base pair step parameters in dynamic equilibrium between two conformational sub-states. Although this study only provides limited data on next-nearest-neighbor effects, we suggest that such effects should be analyzed before attempting to predict the sequence-dependent behavior of DNA. PMID:19850719
QRS complex detection based on continuous density hidden Markov models using univariate observations
NASA Astrophysics Data System (ADS)
Sotelo, S.; Arenas, W.; Altuve, M.
2018-04-01
In the electrocardiogram (ECG), the detection of QRS complexes is a fundamental step in the ECG signal processing chain since it allows the determination of other characteristics waves of the ECG and provides information about heart rate variability. In this work, an automatic QRS complex detector based on continuous density hidden Markov models (HMM) is proposed. HMM were trained using univariate observation sequences taken either from QRS complexes or their derivatives. The detection approach is based on the log-likelihood comparison of the observation sequence with a fixed threshold. A sliding window was used to obtain the observation sequence to be evaluated by the model. The threshold was optimized by receiver operating characteristic curves. Sensitivity (Sen), specificity (Spc) and F1 score were used to evaluate the detection performance. The approach was validated using ECG recordings from the MIT-BIH Arrhythmia database. A 6-fold cross-validation shows that the best detection performance was achieved with 2 states HMM trained with QRS complexes sequences (Sen = 0.668, Spc = 0.360 and F1 = 0.309). We concluded that these univariate sequences provide enough information to characterize the QRS complex dynamics from HMM. Future works are directed to the use of multivariate observations to increase the detection performance.
Maldonado-Borges, Josefina Ines; Ku-Cauich, José Roberto; Escobedo-GraciaMedrano, Rosa Maria
2013-01-01
Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs) and the sequenced genome of the double haploid- (DH-) Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs) were detected with apparent size of 100–4000 bp using 5 pairs of AFLP primers, of which 21 were differentially expressed during the different stages of banana embryogenesis; 15 of the sequences have matched DH-Pahang chromosomes, with 7 of them being homologous to gene sequences encoding either known or putative protein domains of higher plants. Four TDF sequences were located in all Musa chromosomes, while the rest were located in one or two chromosomes. Their putative individual function is briefly reviewed based on published information, and the potential roles of these genes in embryo development are discussed. Thus the availability of the genome of Musa and the information of TDFs sequences presented here opens new possibilities for an in-depth study of the molecular and biochemical research of zygotic and somatic embryogenesis of Musa. PMID:24027442
Determination of the sequences of protein-derived peptides and peptide mixtures by mass spectrometry
Morris, Howard R.; Williams, Dudley H.; Ambler, Richard P.
1971-01-01
Micro-quantities of protein-derived peptides have been converted into N-acetylated permethyl derivatives, and their sequences determined by low-resolution mass spectrometry without prior knowledge of their amino acid compositions or lengths. A new strategy is suggested for the mass spectrometric sequencing of oligopeptides or proteins, involving gel filtration of protein hydrolysates and subsequent sequence analysis of peptide mixtures. Finally, results are given that demonstrate for the first time the use of mass spectrometry for the analysis of a protein-derived peptide mixture, again without prior knowledge of the protein or components within the mixture. PMID:5158904
Bai, Yu; Iwasaki, Yuki; Kanaya, Shigehiko; Zhao, Yue; Ikemura, Toshimichi
2014-01-01
With remarkable increase of genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional data such as oligonucleotide composition on one map. By modifying the conventional SOM, we have previously developed Batch-Learning SOM (BLSOM), which allows classification of sequence fragments according to species, solely depending on the oligonucleotide composition. In the present study, we introduce the oligonucleotide BLSOM used for characterization of vertebrate genome sequences. We first analyzed pentanucleotide compositions in 100 kb sequences derived from a wide range of vertebrate genomes and then the compositions in the human and mouse genomes in order to investigate an efficient method for detecting differences between the closely related genomes. BLSOM can recognize the species-specific key combination of oligonucleotide frequencies in each genome, which is called a "genome signature," and the specific regions specifically enriched in transcription-factor-binding sequences. Because the classification and visualization power is very high, BLSOM is an efficient powerful tool for extracting a wide range of information from massive amounts of genomic sequences (i.e., big sequence data).
Evaluating the protein coding potential of exonized transposable element sequences
Piriyapongsa, Jittima; Rutledge, Mark T; Patel, Sanil; Borodovsky, Mark; Jordan, I King
2007-01-01
Background Transposable element (TE) sequences, once thought to be merely selfish or parasitic members of the genomic community, have been shown to contribute a wide variety of functional sequences to their host genomes. Analysis of complete genome sequences have turned up numerous cases where TE sequences have been incorporated as exons into mRNAs, and it is widely assumed that such 'exonized' TEs encode protein sequences. However, the extent to which TE-derived sequences actually encode proteins is unknown and a matter of some controversy. We have tried to address this outstanding issue from two perspectives: i-by evaluating ascertainment biases related to the search methods used to uncover TE-derived protein coding sequences (CDS) and ii-through a probabilistic codon-frequency based analysis of the protein coding potential of TE-derived exons. Results We compared the ability of three classes of sequence similarity search methods to detect TE-derived sequences among data sets of experimentally characterized proteins: 1-a profile-based hidden Markov model (HMM) approach, 2-BLAST methods and 3-RepeatMasker. Profile based methods are more sensitive and more selective than the other methods evaluated. However, the application of profile-based search methods to the detection of TE-derived sequences among well-curated experimentally characterized protein data sets did not turn up many more cases than had been previously detected and nowhere near as many cases as recent genome-wide searches have. We observed that the different search methods used were complementary in the sense that they yielded largely non-overlapping sets of hits and differed in their ability to recover known cases of TE-derived CDS. The probabilistic analysis of TE-derived exon sequences indicates that these sequences have low protein coding potential on average. In particular, non-autonomous TEs that do not encode protein sequences, such as Alu elements, are frequently exonized but unlikely to encode protein sequences. Conclusion The exaptation of the numerous TE sequences found in exons as bona fide protein coding sequences may prove to be far less common than has been suggested by the analysis of complete genomes. We hypothesize that many exonized TE sequences actually function as post-transcriptional regulators of gene expression, rather than coding sequences, which may act through a variety of double stranded RNA related regulatory pathways. Indeed, their relatively high copy numbers and similarity to sequences dispersed throughout the genome suggests that exonized TE sequences could serve as master regulators with a wide scope of regulatory influence. Reviewers: This article was reviewed by Itai Yanai, Kateryna D. Makova, Melissa Wilson (nominated by Kateryna D. Makova) and Cedric Feschotte (nominated by John M. Logsdon Jr.). PMID:18036258
A matter of emphasis: Linguistic stress habits modulate serial recall.
Taylor, John C; Macken, Bill; Jones, Dylan M
2015-04-01
Models of short-term memory for sequential information rely on item-level, feature-based descriptions to account for errors in serial recall. Transposition errors within alternating similar/dissimilar letter sequences derive from interactions between overlapping features. However, in two experiments, we demonstrated that the characteristics of the sequence are what determine the fates of items, rather than the properties ascribed to the items themselves. Performance in alternating sequences is determined by the way that the sequences themselves induce particular prosodic rehearsal patterns, and not by the nature of the items per se. In a serial recall task, the shapes of the canonical "saw-tooth" serial position curves and transposition error probabilities at successive input-output distances were modulated by subvocal rehearsal strategies, despite all item-based parameters being held constant. We replicated this finding using nonalternating lists, thus demonstrating that transpositions are substantially influenced by prosodic features-such as stress-that emerge during subvocal rehearsal.
Automated design of degenerate codon libraries.
Mena, Marco A; Daugherty, Patrick S
2005-12-01
Degenerate codon libraries are frequently used in protein engineering and evolution studies but are often limited to targeting a small number of positions to adequately limit the search space. To mitigate this, codon degeneracy can be limited using heuristics or previous knowledge of the targeted positions. To automate design of libraries given a set of amino acid sequences, an algorithm (LibDesign) was developed that generates a set of possible degenerate codon libraries, their resulting size, and their score relative to a user-defined scoring function. A gene library of a specified size can then be constructed that is representative of the given amino acid distribution or that includes specific sequences or combinations thereof. LibDesign provides a new tool for automated design of high-quality protein libraries that more effectively harness existing sequence-structure information derived from multiple sequence alignment or computational protein design data.
De Groot, Anne S; Rappuoli, Rino
2004-02-01
Vaccine research entered a new era when the complete genome of a pathogenic bacterium was published in 1995. Since then, more than 97 bacterial pathogens have been sequenced and at least 110 additional projects are now in progress. Genome sequencing has also dramatically accelerated: high-throughput facilities can draft the sequence of an entire microbe (two to four megabases) in 1 to 2 days. Vaccine developers are using microarrays, immunoinformatics, proteomics and high-throughput immunology assays to reduce the truly unmanageable volume of information available in genome databases to a manageable size. Vaccines composed by novel antigens discovered from genome mining are already in clinical trials. Within 5 years we can expect to see a novel class of vaccines composed by genome-predicted, assembled and engineered T- and Bcell epitopes. This article addresses the convergence of three forces--microbial genome sequencing, computational immunology and new vaccine technologies--that are shifting genome mining for vaccines onto the forefront of immunology research.
Top-down analysis of protein samples by de novo sequencing techniques
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vyatkina, Kira; Wu, Si; Dekker, Lennard J. M.
MOTIVATION: Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data. RESULTS: We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. Themore » former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns.« less
Compilation of DNA sequences of Escherichia coli (update 1991)
Kröger, Manfred; Wahl, Ralf; Rice, Peter
1991-01-01
We have compiled the DNA sequence data for E.coli available from the GENBANK and EMBL data libraries and over a period of several years independently from the literature. This is the third listing replacing and increasing the former listing roughly by one fifth. However, in order to save space this printed version contains DNA sequence information only. The complete compilation is now available in machine readable form from the EMBL data library (ECD release 6). After deletion of all detected overlaps a total of 1 492 282 individual bp is found to be determined till the beginning of 1991. This corresponds to a total of 31.62% of the entire E.coli chromosome consisting of about 4,720 kbp. This number may actually be higher by some extra 2,5% derived from lysogenic bacteriophage lambda and various DNA sequences already received for statistical purposes only. PMID:2041799
Tikkanen, Tuomas; Leroy, Bernard; Fournier, Jean Louis; Risques, Rosa Ana; Malcikova, Jitka; Soussi, Thierry
2018-07-01
Accurate annotation of genomic variants in human diseases is essential to allow personalized medicine. Assessment of somatic and germline TP53 alterations has now reached the clinic and is required in several circumstances such as the identification of the most effective cancer therapy for patients with chronic lymphocytic leukemia (CLL). Here, we present Seshat, a Web service for annotating TP53 information derived from sequencing data. A flexible framework allows the use of standard file formats such as Mutation Annotation Format (MAF) or Variant Call Format (VCF), as well as common TXT files. Seshat performs accurate variant annotations using the Human Genome Variation Society (HGVS) nomenclature and the stable TP53 genomic reference provided by the Locus Reference Genomic (LRG). In addition, using the 2017 release of the UMD_TP53 database, Seshat provides multiple statistical information for each TP53 variant including database frequency, functional activity, or pathogenicity. The information is delivered in standardized output tables that minimize errors and facilitate comparison of mutational data across studies. Seshat is a beneficial tool to interpret the ever-growing TP53 sequencing data generated by multiple sequencing platforms and it is freely available via the TP53 Website, http://p53.fr or directly at http://vps338341.ovh.net/. © 2018 Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Westfeld, Patrick; Maas, Hans-Gerd; Bringmann, Oliver; Gröllich, Daniel; Schmauder, Martin
2013-11-01
The paper shows techniques for the determination of structured motion parameters from range camera image sequences. The core contribution of the work presented here is the development of an integrated least squares 3D tracking approach based on amplitude and range image sequences to calculate dense 3D motion vector fields. Geometric primitives of a human body model are fitted to time series of range camera point clouds using these vector fields as additional information. Body poses and motion information for individual body parts are derived from the model fit. On the basis of these pose and motion parameters, critical body postures are detected. The primary aim of the study is to automate ergonomic studies for risk assessments regulated by law, identifying harmful movements and awkward body postures in a workplace.
Onozawa, Masahiro; Zhang, Zhenhua; Kim, Yoo Jung; Goldberg, Liat; Varga, Tamas; Bergsagel, P Leif; Kuehl, W Michael; Aplan, Peter D
2014-05-27
We used the I-SceI endonuclease to produce DNA double-strand breaks (DSBs) and observed that a fraction of these DSBs were repaired by insertion of sequences, which we termed "templated sequence insertions" (TSIs), derived from distant regions of the genome. These TSIs were derived from genic, retrotransposon, or telomere sequences and were not deleted from the donor site in the genome, leading to the hypothesis that they were derived from reverse-transcribed RNA. Cotransfection of RNA and an I-SceI expression vector demonstrated insertion of RNA-derived sequences at the DNA-DSB site, and TSIs were suppressed by reverse-transcriptase inhibitors. Both observations support the hypothesis that TSIs were derived from RNA templates. In addition, similar insertions were detected at sites of DNA DSBs induced by transcription activator-like effector nuclease proteins. Whole-genome sequencing of myeloma cell lines revealed additional TSIs, demonstrating that repair of DNA DSBs via insertion was not restricted to experimentally produced DNA DSBs. Analysis of publicly available databases revealed that many of these TSIs are polymorphic in the human genome. Taken together, these results indicate that insertional events should be considered as alternatives to gross chromosomal rearrangements in the interpretation of whole-genome sequence data and that this mutagenic form of DNA repair may play a role in genetic disease, exon shuffling, and mammalian evolution.
Disfani, Fatemeh Miri; Hsu, Wei-Lun; Mizianty, Marcin J.; Oldfield, Christopher J.; Xue, Bin; Dunker, A. Keith; Uversky, Vladimir N.; Kurgan, Lukasz
2012-01-01
Motivation: Molecular recognition features (MoRFs) are short binding regions located within longer intrinsically disordered regions that bind to protein partners via disorder-to-order transitions. MoRFs are implicated in important processes including signaling and regulation. However, only a limited number of experimentally validated MoRFs is known, which motivates development of computational methods that predict MoRFs from protein chains. Results: We introduce a new MoRF predictor, MoRFpred, which identifies all MoRF types (α, β, coil and complex). We develop a comprehensive dataset of annotated MoRFs to build and empirically compare our method. MoRFpred utilizes a novel design in which annotations generated by sequence alignment are fused with predictions generated by a Support Vector Machine (SVM), which uses a custom designed set of sequence-derived features. The features provide information about evolutionary profiles, selected physiochemical properties of amino acids, and predicted disorder, solvent accessibility and B-factors. Empirical evaluation on several datasets shows that MoRFpred outperforms related methods: α-MoRF-Pred that predicts α-MoRFs and ANCHOR which finds disordered regions that become ordered when bound to a globular partner. We show that our predicted (new) MoRF regions have non-random sequence similarity with native MoRFs. We use this observation along with the fact that predictions with higher probability are more accurate to identify putative MoRF regions. We also identify a few sequence-derived hallmarks of MoRFs. They are characterized by dips in the disorder predictions and higher hydrophobicity and stability when compared to adjacent (in the chain) residues. Availability: http://biomine.ece.ualberta.ca/MoRFpred/; http://biomine.ece.ualberta.ca/MoRFpred/Supplement.pdf Contact: lkurgan@ece.ualberta.ca Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22689782
Predicting Protein-Protein Interactions by Combing Various Sequence-Derived.
Zhao, Xiao-Wei; Ma, Zhi-Qiang; Yin, Ming-Hao
2011-09-20
Knowledge of protein-protein interactions (PPIs) plays an important role in constructing protein interaction networks and understanding the general machineries of biological systems. In this study, a new method is proposed to predict PPIs using a comprehensive set of 930 features based only on sequence information, these features measure the interactions between residues a certain distant apart in the protein sequences from different aspects. To achieve better performance, the principal component analysis (PCA) is first employed to obtain an optimized feature subset. Then, the resulting 67-dimensional feature vectors are fed to Support Vector Machine (SVM). Experimental results on Drosophila melanogaster and Helicobater pylori datasets show that our method is very promising to predict PPIs and may at least be a useful supplement tool to existing methods.
Protein Sequence Classification with Improved Extreme Learning Machine Algorithms
2014-01-01
Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876
Buckley, Michael; Warwood, Stacey; van Dongen, Bart; Kitchener, Andrew C; Manning, Phillip L
2017-05-31
A decade ago, reports that organic-rich soft tissue survived from dinosaur fossils were apparently supported by proteomics-derived sequence information of exceptionally well-preserved bone. This initial claim to the sequencing of endogenous collagen peptides from an approximately 68 Myr Tyrannosaurus rex fossil was highly controversial, largely on the grounds of potential contamination from either bacterial biofilms or from laboratory practice. In a subsequent study, collagen peptide sequences from an approximately 78 Myr Brachylophosaurus canadensis fossil were reported that have remained largely unchallenged. However, the endogeneity of these sequences relies heavily on a single peptide sequence, apparently unique to both dinosaurs. Given the potential for cross-contamination from modern bone analysed by the same team, here we extract collagen from bone samples of three individuals of ostrich, Struthio camelus The resulting LC-MS/MS data were found to match all of the proposed sequences for both the original Tyrannosaurus and Brachylophosaurus studies. Regardless of the true nature of the dinosaur peptides, our finding highlights the difficulty of differentiating such sequences with confidence. Our results not only imply that cross-contamination cannot be ruled out, but that appropriate measures to test for endogeneity should be further evaluated. © 2017 The Authors.
Warwood, Stacey; van Dongen, Bart; Kitchener, Andrew C.; Manning, Phillip L.
2017-01-01
A decade ago, reports that organic-rich soft tissue survived from dinosaur fossils were apparently supported by proteomics-derived sequence information of exceptionally well-preserved bone. This initial claim to the sequencing of endogenous collagen peptides from an approximately 68 Myr Tyrannosaurus rex fossil was highly controversial, largely on the grounds of potential contamination from either bacterial biofilms or from laboratory practice. In a subsequent study, collagen peptide sequences from an approximately 78 Myr Brachylophosaurus canadensis fossil were reported that have remained largely unchallenged. However, the endogeneity of these sequences relies heavily on a single peptide sequence, apparently unique to both dinosaurs. Given the potential for cross-contamination from modern bone analysed by the same team, here we extract collagen from bone samples of three individuals of ostrich, Struthio camelus. The resulting LC–MS/MS data were found to match all of the proposed sequences for both the original Tyrannosaurus and Brachylophosaurus studies. Regardless of the true nature of the dinosaur peptides, our finding highlights the difficulty of differentiating such sequences with confidence. Our results not only imply that cross-contamination cannot be ruled out, but that appropriate measures to test for endogeneity should be further evaluated. PMID:28566488
Genotype imputation in a coalescent model with infinitely-many-sites mutation
Huang, Lucy; Buzbas, Erkan O.; Rosenberg, Noah A.
2012-01-01
Empirical studies have identified population-genetic factors as important determinants of the properties of genotype-imputation accuracy in imputation-based disease association studies. Here, we develop a simple coalescent model of three sequences that we use to explore the theoretical basis for the influence of these factors on genotype-imputation accuracy, under the assumption of infinitely-many-sites mutation. Employing a demographic model in which two populations diverged at a given time in the past, we derive the approximate expectation and variance of imputation accuracy in a study sequence sampled from one of the two populations, choosing between two reference sequences, one sampled from the same population as the study sequence and the other sampled from the other population. We show that under this model, imputation accuracy—as measured by the proportion of polymorphic sites that are imputed correctly in the study sequence—increases in expectation with the mutation rate, the proportion of the markers in a chromosomal region that are genotyped, and the time to divergence between the study and reference populations. Each of these effects derives largely from an increase in information available for determining the reference sequence that is genetically most similar to the sequence targeted for imputation. We analyze as a function of divergence time the expected gain in imputation accuracy in the target using a reference sequence from the same population as the target rather than from the other population. Together with a growing body of empirical investigations of genotype imputation in diverse human populations, our modeling framework lays a foundation for extending imputation techniques to novel populations that have not yet been extensively examined. PMID:23079542
NASA Technical Reports Server (NTRS)
Rogers, J. W.
1975-01-01
The results of an experimental investigation on recording information on thermoplastic are given. A description was given of a typical fabrication configuration, the recording sequence, and the samples which were examined. There are basically three configurations which can be used for the recording of information on thermoplastic. The most popular technique uses corona which furnishes free charge. The necessary energy for deformation is derived from a charge layer atop the thermoplastic. The other two techniques simply use a dc potential in place of the corona for deformation energy.
Chigira, M; Watanabe, H
1994-07-01
Preservation of the identity of DNA is the ultimate goal of multicellular organisms. An abnormal DNA sequence in cells within an individual means its parasitic nature in cell society as shown in tumors. Somatic gene arrangement and gene mutation in development may be considered as de novo formation of parasites. It is likely that the developmental process with genetic alterations means symbiosis between altered cells and germ line cells preserving genetic information without alterations, when somatic alteration of DNA sequence is a major mechanism of differentiation. According to the selfish gene theory of Dawkins, germ line cells permit symbiosis when somatic cell society derives clear profit for the replication of original DNA copies.
Defining and predicting structurally conserved regions in protein superfamilies
Huang, Ivan K.; Grishin, Nick V.
2013-01-01
Motivation: The structures of homologous proteins are generally better conserved than their sequences. This phenomenon is demonstrated by the prevalence of structurally conserved regions (SCRs) even in highly divergent protein families. Defining SCRs requires the comparison of two or more homologous structures and is affected by their availability and divergence, and our ability to deduce structurally equivalent positions among them. In the absence of multiple homologous structures, it is necessary to predict SCRs of a protein using information from only a set of homologous sequences and (if available) a single structure. Accurate SCR predictions can benefit homology modelling and sequence alignment. Results: Using pairwise DaliLite alignments among a set of homologous structures, we devised a simple measure of structural conservation, termed structural conservation index (SCI). SCI was used to distinguish SCRs from non-SCRs. A database of SCRs was compiled from 386 SCOP superfamilies containing 6489 protein domains. Artificial neural networks were then trained to predict SCRs with various features deduced from a single structure and homologous sequences. Assessment of the predictions via a 5-fold cross-validation method revealed that predictions based on features derived from a single structure perform similarly to ones based on homologous sequences, while combining sequence and structural features was optimal in terms of accuracy (0.755) and Matthews correlation coefficient (0.476). These results suggest that even without information from multiple structures, it is still possible to effectively predict SCRs for a protein. Finally, inspection of the structures with the worst predictions pinpoints difficulties in SCR definitions. Availability: The SCR database and the prediction server can be found at http://prodata.swmed.edu/SCR. Contact: 91huangi@gmail.com or grishin@chop.swmed.edu Supplementary information: Supplementary data are available at Bioinformatics Online PMID:23193223
Sequential associative memory with nonuniformity of the layer sizes.
Teramae, Jun-Nosuke; Fukai, Tomoki
2007-01-01
Sequence retrieval has a fundamental importance in information processing by the brain, and has extensively been studied in neural network models. Most of the previous sequential associative memory embedded sequences of memory patterns have nearly equal sizes. It was recently shown that local cortical networks display many diverse yet repeatable precise temporal sequences of neuronal activities, termed "neuronal avalanches." Interestingly, these avalanches displayed size and lifetime distributions that obey power laws. Inspired by these experimental findings, here we consider an associative memory model of binary neurons that stores sequences of memory patterns with highly variable sizes. Our analysis includes the case where the statistics of these size variations obey the above-mentioned power laws. We study the retrieval dynamics of such memory systems by analytically deriving the equations that govern the time evolution of macroscopic order parameters. We calculate the critical sequence length beyond which the network cannot retrieve memory sequences correctly. As an application of the analysis, we show how the present variability in sequential memory patterns degrades the power-law lifetime distribution of retrieved neural activities.
Portales-Casamar, Elodie; Arenillas, David; Lim, Jonathan; Swanson, Magdalena I.; Jiang, Steven; McCallum, Anthony; Kirov, Stefan; Wasserman, Wyeth W.
2009-01-01
The PAZAR database unites independently created and maintained data collections of transcription factor and regulatory sequence annotation. The flexible PAZAR schema permits the representation of diverse information derived from experiments ranging from biochemical protein–DNA binding to cellular reporter gene assays. Data collections can be made available to the public, or restricted to specific system users. The data ‘boutiques’ within the shopping-mall-inspired system facilitate the analysis of genomics data and the creation of predictive models of gene regulation. Since its initial release, PAZAR has grown in terms of data, features and through the addition of an associated package of software tools called the ORCA toolkit (ORCAtk). ORCAtk allows users to rapidly develop analyses based on the information stored in the PAZAR system. PAZAR is available at http://www.pazar.info. ORCAtk can be accessed through convenient buttons located in the PAZAR pages or via our website at http://www.cisreg.ca/ORCAtk. PMID:18971253
E-Learning for Rare Diseases: An Example Using Fabry Disease.
Cimmaruta, Chiara; Liguori, Ludovica; Monticelli, Maria; Andreotti, Giuseppina; Citro, Valentina
2017-09-24
Rare diseases represent a challenge for physicians because patients are rarely seen, and they can manifest with symptoms similar to those of common diseases. In this work, genetic confirmation of diagnosis is derived from DNA sequencing. We present a tutorial for the molecular analysis of a rare disease using Fabry disease as an example. An exonic sequence derived from a hypothetical male patient was matched against human reference data using a genome browser. The missense mutation was identified by running BlastX, and information on the affected protein was retrieved from the database UniProt. The pathogenic nature of the mutation was assessed with PolyPhen-2. Disease-specific databases were used to assess whether the missense mutation led to a severe phenotype, and whether pharmacological therapy was an option. An inexpensive bioinformatics approach is presented to get the reader acquainted with the diagnosis of Fabry disease. The reader is introduced to the field of pharmacological chaperones, a therapeutic approach that can be applied only to certain Fabry genotypes. The principle underlying the analysis of exome sequencing can be explained in simple terms using web applications and databases which facilitate diagnosis and therapeutic choices.
Quantifying the Relationships among Drug Classes
Hert, Jérôme; Keiser, Michael J.; Irwin, John J.; Oprea, Tudor I.; Shoichet, Brian K.
2009-01-01
The similarity of drug targets is typically measured using sequence or structural information. Here, we consider chemo-centric approaches that measure target similarity on the basis of their ligands, asking how chemoinformatics similarities differ from those derived bioinformatically, how stable the ligand networks are to changes in chemoinformatics metrics, and which network is the most reliable for prediction of pharmacology. We calculated the similarities between hundreds of drug targets and their ligands and mapped the relationship between them in a formal network. Bioinformatics networks were based on the BLAST similarity between sequences, while chemoinformatics networks were based on the ligand-set similarities calculated with either the Similarity Ensemble Approach (SEA) or a method derived from Bayesian statistics. By multiple criteria, bioinformatics and chemoinformatics networks differed substantially, and only occasionally did a high sequence similarity correspond to a high ligand-set similarity. In contrast, the chemoinformatics networks were stable to the method used to calculate the ligand-set similarities and to the chemical representation of the ligands. Also, the chemoinformatics networks were more natural and more organized, by network theory, than their bioinformatics counterparts: ligand-based networks were found to be small-world and broad-scale. PMID:18335977
Establishing gene models from the Pinus pinaster genome using gene capture and BAC sequencing.
Seoane-Zonjic, Pedro; Cañas, Rafael A; Bautista, Rocío; Gómez-Maldonado, Josefa; Arrillaga, Isabel; Fernández-Pozo, Noé; Claros, M Gonzalo; Cánovas, Francisco M; Ávila, Concepción
2016-02-27
In the era of DNA throughput sequencing, assembling and understanding gymnosperm mega-genomes remains a challenge. Although drafts of three conifer genomes have recently been published, this number is too low to understand the full complexity of conifer genomes. Using techniques focused on specific genes, gene models can be established that can aid in the assembly of gene-rich regions, and this information can be used to compare genomes and understand functional evolution. In this study, gene capture technology combined with BAC isolation and sequencing was used as an experimental approach to establish de novo gene structures without a reference genome. Probes were designed for 866 maritime pine transcripts to sequence genes captured from genomic DNA. The gene models were constructed using GeneAssembler, a new bioinformatic pipeline, which reconstructed over 82% of the gene structures, and a high proportion (85%) of the captured gene models contained sequences from the promoter regulatory region. In a parallel experiment, the P. pinaster BAC library was screened to isolate clones containing genes whose cDNA sequence were already available. BAC clones containing the asparagine synthetase, sucrose synthase and xyloglucan endotransglycosylase gene sequences were isolated and used in this study. The gene models derived from the gene capture approach were compared with the genomic sequences derived from the BAC clones. This combined approach is a particularly efficient way to capture the genomic structures of gene families with a small number of members. The experimental approach used in this study is a valuable combined technique to study genomic gene structures in species for which a reference genome is unavailable. It can be used to establish exon/intron boundaries in unknown gene structures, to reconstruct incomplete genes and to obtain promoter sequences that can be used for transcriptional studies. A bioinformatics algorithm (GeneAssembler) is also provided as a Ruby gem for this class of analyses.
Investigation of modulation parameters in multiplexing gas chromatography.
Trapp, Oliver
2010-10-22
Combination of information technology and separation sciences opens a new avenue to achieve high sample throughputs and therefore is of great interest to bypass bottlenecks in catalyst screening of parallelized reactors or using multitier well plates in reaction optimization. Multiplexing gas chromatography utilizes pseudo-random injection sequences derived from Hadamard matrices to perform rapid sample injections which gives a convoluted chromatogram containing the information of a single sample or of several samples with similar analyte composition. The conventional chromatogram is obtained by application of the Hadamard transform using the known injection sequence or in case of several samples an averaged transformed chromatogram is obtained which can be used in a Gauss-Jordan deconvolution procedure to obtain all single chromatograms of the individual samples. The performance of such a system depends on the modulation precision and on the parameters, e.g. the sequence length and modulation interval. Here we demonstrate the effects of the sequence length and modulation interval on the deconvoluted chromatogram, peak shapes and peak integration for sequences between 9-bit (511 elements) and 13-bit (8191 elements) and modulation intervals Δt between 5 s and 500 ms using a mixture of five components. It could be demonstrated that even for high-speed modulation at time intervals of 500 ms the chromatographic information is very well preserved and that the separation efficiency can be improved by very narrow sample injections. Furthermore this study shows that the relative peak areas in multiplexed chromatograms do not deviate from conventionally recorded chromatograms. Copyright © 2010 Elsevier B.V. All rights reserved.
A novel, privacy-preserving cryptographic approach for sharing sequencing data
Cassa, Christopher A; Miller, Rachel A; Mandl, Kenneth D
2013-01-01
Objective DNA samples are often processed and sequenced in facilities external to the point of collection. These samples are routinely labeled with patient identifiers or pseudonyms, allowing for potential linkage to identity and private clinical information if intercepted during transmission. We present a cryptographic scheme to securely transmit externally generated sequence data which does not require any patient identifiers, public key infrastructure, or the transmission of passwords. Materials and methods This novel encryption scheme cryptographically protects participant sequence data using a shared secret key that is derived from a unique subset of an individual’s genetic sequence. This scheme requires access to a subset of an individual’s genetic sequence to acquire full access to the transmitted sequence data, which helps to prevent sample mismatch. Results We validate that the proposed encryption scheme is robust to sequencing errors, population uniqueness, and sibling disambiguation, and provides sufficient cryptographic key space. Discussion Access to a set of an individual’s genotypes and a mutually agreed cryptographic seed is needed to unlock the full sequence, which provides additional sample authentication and authorization security. We present modest fixed and marginal costs to implement this transmission architecture. Conclusions It is possible for genomics researchers who sequence participant samples externally to protect the transmission of sequence data using unique features of an individual’s genetic sequence. PMID:23125421
The twilight zone of cis element alignments.
Sebastian, Alvaro; Contreras-Moreira, Bruno
2013-02-01
Sequence alignment of proteins and nucleic acids is a routine task in bioinformatics. Although the comparison of complete peptides, genes or genomes can be undertaken with a great variety of tools, the alignment of short DNA sequences and motifs entails pitfalls that have not been fully addressed yet. Here we confront the structural superposition of transcription factors with the sequence alignment of their recognized cis elements. Our goals are (i) to test TFcompare (http://floresta.eead.csic.es/tfcompare), a structural alignment method for protein-DNA complexes; (ii) to benchmark the pairwise alignment of regulatory elements; (iii) to define the confidence limits and the twilight zone of such alignments and (iv) to evaluate the relevance of these thresholds with elements obtained experimentally. We find that the structure of cis elements and protein-DNA interfaces is significantly more conserved than their sequence and measures how this correlates with alignment errors when only sequence information is considered. Our results confirm that DNA motifs in the form of matrices produce better alignments than individual sequences. Finally, we report that empirical and theoretically derived twilight thresholds are useful for estimating the natural plasticity of regulatory sequences, and hence for filtering out unreliable alignments.
The twilight zone of cis element alignments
Sebastian, Alvaro; Contreras-Moreira, Bruno
2013-01-01
Sequence alignment of proteins and nucleic acids is a routine task in bioinformatics. Although the comparison of complete peptides, genes or genomes can be undertaken with a great variety of tools, the alignment of short DNA sequences and motifs entails pitfalls that have not been fully addressed yet. Here we confront the structural superposition of transcription factors with the sequence alignment of their recognized cis elements. Our goals are (i) to test TFcompare (http://floresta.eead.csic.es/tfcompare), a structural alignment method for protein–DNA complexes; (ii) to benchmark the pairwise alignment of regulatory elements; (iii) to define the confidence limits and the twilight zone of such alignments and (iv) to evaluate the relevance of these thresholds with elements obtained experimentally. We find that the structure of cis elements and protein–DNA interfaces is significantly more conserved than their sequence and measures how this correlates with alignment errors when only sequence information is considered. Our results confirm that DNA motifs in the form of matrices produce better alignments than individual sequences. Finally, we report that empirical and theoretically derived twilight thresholds are useful for estimating the natural plasticity of regulatory sequences, and hence for filtering out unreliable alignments. PMID:23268451
Niv, Masha Y.; Skrabanek, Lucy; Roberts, Richard J.; Scheraga, Harold A.; Weinstein, Harel
2008-01-01
Restriction endonucleases (REases) are DNA-cleaving enzymes that have become indispensable tools in molecular biology. Type II REases are highly divergent in sequence despite their common structural core, function and, in some cases, common specificities towards DNA sequences. This makes it difficult to identify and classify them functionally based on sequence, and has hampered the efforts of specificity-engineering. Here, we define novel REase sequence motifs, which extend beyond the PD-(D/E)XK hallmark, and incorporate secondary structure information. The automated search using these motifs is carried out with a newly developed fast regular expression matching algorithm that accommodates long patterns with optional secondary structure constraints. Using this new tool, named Scan2S, motifs derived from REases with specificity towards GATC- and CGGG-containing DNA sequences successfully identify REases of the same specificity. Notably, some of these sequences are not identified by standard sequence detection tools. The new motifs highlight potential specificity-determining positions that do not fully overlap for the GATC- and the CCGG-recognizing REases and are candidates for specificity re-engineering. PMID:17972284
Bào, Yīmíng; Amarasinghe, Gaya K; Basler, Christopher F; Bavari, Sina; Bukreyev, Alexander; Chandran, Kartik; Dolnik, Olga; Dye, John M; Ebihara, Hideki; Formenty, Pierre; Hewson, Roger; Kobinger, Gary P; Leroy, Eric M; Mühlberger, Elke; Netesov, Sergey V; Patterson, Jean L; Paweska, Janusz T; Smither, Sophie J; Takada, Ayato; Towner, Jonathan S; Volchkov, Viktor E; Wahl-Jensen, Victoria; Kuhn, Jens H
2017-05-11
The mononegaviral family Filoviridae has eight members assigned to three genera and seven species. Until now, genus and species demarcation were based on arbitrarily chosen filovirus genome sequence divergence values (≈50% for genera, ≈30% for species) and arbitrarily chosen phenotypic virus or virion characteristics. Here we report filovirus genome sequence-based taxon demarcation criteria using the publicly accessible PAirwise Sequencing Comparison (PASC) tool of the US National Center for Biotechnology Information (Bethesda, MD, USA). Comparison of all available filovirus genomes in GenBank using PASC revealed optimal genus demarcation at the 55-58% sequence diversity threshold range for genera and at the 23-36% sequence diversity threshold range for species. Because these thresholds do not change the current official filovirus classification, these values are now implemented as filovirus taxon demarcation criteria that may solely be used for filovirus classification in case additional data are absent. A near-complete, coding-complete, or complete filovirus genome sequence will now be required to allow official classification of any novel "filovirus." Classification of filoviruses into existing taxa or determining the need for novel taxa is now straightforward and could even become automated using a presented algorithm/flowchart rooted in RefSeq (type) sequences.
Sockeye: A 3D Environment for Comparative Genomics
Montgomery, Stephen B.; Astakhova, Tamara; Bilenky, Mikhail; Birney, Ewan; Fu, Tony; Hassel, Maik; Melsopp, Craig; Rak, Marcin; Robertson, A. Gordon; Sleumer, Monica; Siddiqui, Asim S.; Jones, Steven J.M.
2004-01-01
Comparative genomics techniques are used in bioinformatics analyses to identify the structural and functional properties of DNA sequences. As the amount of available sequence data steadily increases, the ability to perform large-scale comparative analyses has become increasingly relevant. In addition, the growing complexity of genomic feature annotation means that new approaches to genomic visualization need to be explored. We have developed a Java-based application called Sockeye that uses three-dimensional (3D) graphics technology to facilitate the visualization of annotation and conservation across multiple sequences. This software uses the Ensembl database project to import sequence and annotation information from several eukaryotic species. A user can additionally import their own custom sequence and annotation data. Individual annotation objects are displayed in Sockeye by using custom 3D models. Ensembl-derived and imported sequences can be analyzed by using a suite of multiple and pair-wise alignment algorithms. The results of these comparative analyses are also displayed in the 3D environment of Sockeye. By using the Java3D API to visualize genomic data in a 3D environment, we are able to compactly display cross-sequence comparisons. This provides the user with a novel platform for visualizing and comparing genomic feature organization. PMID:15123592
Niv, Masha Y; Skrabanek, Lucy; Roberts, Richard J; Scheraga, Harold A; Weinstein, Harel
2008-05-01
Restriction endonucleases (REases) are DNA-cleaving enzymes that have become indispensable tools in molecular biology. Type II REases are highly divergent in sequence despite their common structural core, function and, in some cases, common specificities towards DNA sequences. This makes it difficult to identify and classify them functionally based on sequence, and has hampered the efforts of specificity-engineering. Here, we define novel REase sequence motifs, which extend beyond the PD-(D/E)XK hallmark, and incorporate secondary structure information. The automated search using these motifs is carried out with a newly developed fast regular expression matching algorithm that accommodates long patterns with optional secondary structure constraints. Using this new tool, named Scan2S, motifs derived from REases with specificity towards GATC- and CGGG-containing DNA sequences successfully identify REases of the same specificity. Notably, some of these sequences are not identified by standard sequence detection tools. The new motifs highlight potential specificity-determining positions that do not fully overlap for the GATC- and the CCGG-recognizing REases and are candidates for specificity re-engineering.
NASA Astrophysics Data System (ADS)
Randahl, Mira
2016-08-01
This paper reports on a study about how the mathematics textbook was perceived and used by the teacher in the context of a calculus part of a basic mathematics course for first-year engineering students. The focus was on the teacher's choices and the use of definitions, examples and exercises in a sequence of lectures introducing the derivative concept. Data were collected during observations of lectures and an interview, and informal talks with the teacher. The introduction and the treatment of the derivative as proposed by the teacher during the lectures were analysed in relation to the results of the content text analysis of the textbook. The teacher's decisions were explored through the lens of intended learning goals for engineering students taking the mathematics course. The results showed that the sequence of concepts and the formal introduction of the derivative as proposed by the textbook were closely followed during the lectures. The examples and tasks offered to the students focused strongly on procedural knowledge. Although the textbook proposes both examples and exercises that promote conceptual knowledge, these opportunities were not fully utilized during the observed lectures. Possible reasons for the teacher's choices and decisions are discussed.
Multilayer material characterization using thermographic signal reconstruction
NASA Astrophysics Data System (ADS)
Shepard, Steven M.; Beemer, Maria Frendberg
2016-02-01
Active-thermography has become a well-established Nondestructive Testing (NDT) method for detection of subsurface flaws. In its simplest form, flaw detection is based on visual identification of contrast between a flaw and local intact regions in an IR image sequence of the surface temperature as the sample responds to thermal stimulation. However, additional information and insight can be obtained from the sequence, even in the absence of a flaw, through analysis of the logarithmic derivatives of individual pixel time histories using the Thermographic Signal Reconstruction (TSR) method. For example, the response of a flaw-free multilayer sample to thermal stimulation can be viewed as a simple transition between the responses of infinitely thick samples of the individual constituent layers over the lifetime of the thermal diffusion process. The transition is represented compactly and uniquely by the logarithmic derivatives, based on the ratio of thermal effusivities of the layers. A spectrum of derivative responses relative to thermal effusivity ratios allows prediction of the time scale and detectability of the interface, and measurement of the thermophysical properties of one layer if the properties of the other are known. A similar transition between steady diffusion states occurs for flat bottom holes, based on the hole aspect ratio.
Proteomics technique opens new frontiers in mobilome research.
Davidson, Andrew D; Matthews, David A; Maringer, Kevin
2017-01-01
A large proportion of the genome of most eukaryotic organisms consists of highly repetitive mobile genetic elements. The sum of these elements is called the "mobilome," which in eukaryotes is made up mostly of transposons. Transposable elements contribute to disease, evolution, and normal physiology by mediating genetic rearrangement, and through the "domestication" of transposon proteins for cellular functions. Although 'omics studies of mobilome genomes and transcriptomes are common, technical challenges have hampered high-throughput global proteomics analyses of transposons. In a recent paper, we overcame these technical hurdles using a technique called "proteomics informed by transcriptomics" (PIT), and thus published the first unbiased global mobilome-derived proteome for any organism (using cell lines derived from the mosquito Aedes aegypti ). In this commentary, we describe our methods in more detail, and summarise our major findings. We also use new genome sequencing data to show that, in many cases, the specific genomic element expressing a given protein can be identified using PIT. This proteomic technique therefore represents an important technological advance that will open new avenues of research into the role that proteins derived from transposons and other repetitive and sequence diverse genetic elements, such as endogenous retroviruses, play in health and disease.
Lehnherr, Dan; Chen, Chen; Pedramrazi, Zahra; DeBlase, Catherine R.; Alzola, Joaquin M.; Keresztes, Ivan; Lobkovsky, Emil B.
2016-01-01
A Cu-catalyzed benzannulation reaction transforms ortho(arylene ethynylene) oligomers into ortho-arylenes. This approach circumvents iterative Suzuki cross-coupling reactions previously used to assemble hindered ortho-arylene backbones. These derivatives form helical folded structures in the solid-state and in solution, as demonstrated by X-ray crystallography and solution-state NMR analysis. DFT calculations of misfolded conformations are correlated with variable-temperature 1H and EXSY NMR to reveal that folding is cooperative and more favorable in halide-substituted naphthalenes. Helical ortho-arylene foldamers with specific aromatic sequences organize functional π-electron systems into arrangements ideal for ambipolar charge transport and show preliminary promise for the surface-mediated synthesis of structurally defined graphene nanoribbons. PMID:28567248
Guarnieri, Michael T.; Nag, Ambarish; Smolinski, Sharon L.; Darzins, Al; Seibert, Michael; Pienkos, Philip T.
2011-01-01
Biofuels derived from algal lipids represent an opportunity to dramatically impact the global energy demand for transportation fuels. Systems biology analyses of oleaginous algae could greatly accelerate the commercialization of algal-derived biofuels by elucidating the key components involved in lipid productivity and leading to the initiation of hypothesis-driven strain-improvement strategies. However, higher-level systems biology analyses, such as transcriptomics and proteomics, are highly dependent upon available genomic sequence data, and the lack of these data has hindered the pursuit of such analyses for many oleaginous microalgae. In order to examine the triacylglycerol biosynthetic pathway in the unsequenced oleaginous microalga, Chlorella vulgaris, we have established a strategy with which to bypass the necessity for genomic sequence information by using the transcriptome as a guide. Our results indicate an upregulation of both fatty acid and triacylglycerol biosynthetic machinery under oil-accumulating conditions, and demonstrate the utility of a de novo assembled transcriptome as a search model for proteomic analysis of an unsequenced microalga. PMID:22043295
Zhao, Xiao-Wei; Ma, Zhi-Qiang; Yin, Ming-Hao
2012-05-01
Knowledge of protein-protein interactions (PPIs) plays an important role in constructing protein interaction networks and understanding the general machineries of biological systems. In this study, a new method is proposed to predict PPIs using a comprehensive set of 930 features based only on sequence information, these features measure the interactions between residues a certain distant apart in the protein sequences from different aspects. To achieve better performance, the principal component analysis (PCA) is first employed to obtain an optimized feature subset. Then, the resulting 67-dimensional feature vectors are fed to Support Vector Machine (SVM). Experimental results on Drosophila melanogaster and Helicobater pylori datasets show that our method is very promising to predict PPIs and may at least be a useful supplement tool to existing methods.
Analysis of Spatio-Temporal Traffic Patterns Based on Pedestrian Trajectories
NASA Astrophysics Data System (ADS)
Busch, S.; Schindler, T.; Klinger, T.; Brenner, C.
2016-06-01
For driver assistance and autonomous driving systems, it is essential to predict the behaviour of other traffic participants. Usually, standard filter approaches are used to this end, however, in many cases, these are not sufficient. For example, pedestrians are able to change their speed or direction instantly. Also, there may be not enough observation data to determine the state of an object reliably, e.g. in case of occlusions. In those cases, it is very useful if a prior model exists, which suggests certain outcomes. For example, it is useful to know that pedestrians are usually crossing the road at a certain location and at certain times. This information can then be stored in a map which then can be used as a prior in scene analysis, or in practical terms to reduce the speed of a vehicle in advance in order to minimize critical situations. In this paper, we present an approach to derive such a spatio-temporal map automatically from the observed behaviour of traffic participants in everyday traffic situations. In our experiments, we use one stationary camera to observe a complex junction, where cars, public transportation and pedestrians interact. We concentrate on the pedestrians trajectories to map traffic patterns. In the first step, we extract trajectory segments from the video data. These segments are then clustered in order to derive a spatial model of the scene, in terms of a spatially embedded graph. In the second step, we analyse the temporal patterns of pedestrian movement on this graph. We are able to derive traffic light sequences as well as the timetables of nearby public transportation. To evaluate our approach, we used a 4 hour video sequence. We show that we are able to derive traffic light sequences as well as time tables of nearby public transportation.
Thrombus segmentation by texture dynamics from microscopic image sequences
NASA Astrophysics Data System (ADS)
Brieu, Nicolas; Serbanovic-Canic, Jovana; Cvejic, Ana; Stemple, Derek; Ouwehand, Willem; Navab, Nassir; Groher, Martin
2010-03-01
The genetic factors of thrombosis are commonly explored by microscopically imaging the coagulation of blood cells induced by injuring a vessel of mice or of zebrafish mutants. The latter species is particularly interesting since skin transparency permits to non-invasively acquire microscopic images of the scene with a CCD camera and to estimate the parameters characterizing the thrombus development. These parameters are currently determined by manual outlining, which is both error prone and extremely time consuming. Even though a technique for automatic thrombus extraction would be highly valuable for gene analysts, little work can be found, which is mainly due to very low image contrast and spurious structures. In this work, we propose to semi-automatically segment the thrombus over time from microscopic image sequences of wild-type zebrafish larvae. To compensate the lack of valuable spatial information, our main idea consists of exploiting the temporal information by modeling the variations of the pixel intensities over successive temporal windows with a linear Markov-based dynamic texture formalization. We then derive an image from the estimated model parameters, which represents the probability of a pixel to belong to the thrombus. We employ this probability image to accurately estimate the thrombus position via an active contour segmentation incorporating also prior and spatial information of the underlying intensity images. The performance of our approach is tested on three microscopic image sequences. We show that the thrombus is accurately tracked over time in each sequence if the respective parameters controlling prior influence and contour stiffness are correctly chosen.
Schiavo, G; Strillacci, M G; Ribani, A; Bovo, S; Roman-Ponce, S I; Cerolini, S; Bertolini, F; Bagnato, A; Fontanesi, L
2018-06-01
Mitochondrial DNA (mtDNA) insertions have been detected in the nuclear genome of many eukaryotes. These sequences are pseudogenes originated by horizontal transfer of mtDNA fragments into the nuclear genome, producing nuclear DNA sequences of mitochondrial origin (numt). In this study we determined the frequency and distribution of mtDNA-originated pseudogenes in the turkey (Meleagris gallopavo) nuclear genome. The turkey reference genome (Turkey_2.01) was aligned with the reference linearized mtDNA sequence using last. A total of 32 numt sequences (corresponding to 18 numt regions derived by unique insertional events) were identified in the turkey nuclear genome (size ranging from 66 to 1415 bp; identity against the modern turkey mtDNA corresponding region ranging from 62% to 100%). Numts were distributed in nine chromosomes and in one scaffold. They derived from parts of 10 mtDNA protein-coding genes, ribosomal genes, the control region and 10 tRNA genes. Seven numt regions reported in the turkey genome were identified in orthologues positions in the Gallus gallus genome and therefore were present in the ancestral genome that in the Cretaceous originated the lineages of the modern crown Galliformes. Five recently integrated turkey numts were validated by PCR in 168 turkeys of six different domestic populations. None of the analysed numts were polymorphic (i.e. absence of the inserted sequence, as reported in numts of recent integration in other species), suggesting that the reticulate speciation model is not useful for explaining the origin of the domesticated turkey lineage. © 2018 Stichting International Foundation for Animal Genetics.
The NCBI BioCollections Database
Sharma, Shobha; Ciufo, Stacy; Starchenko, Elena; Darji, Dakshesh; Chlumsky, Larry; Karsch-Mizrachi, Ilene
2018-01-01
Abstract The rapidly growing set of GenBank submissions includes sequences that are derived from vouchered specimens. These are associated with culture collections, museums, herbaria and other natural history collections, both living and preserved. Correct identification of the specimens studied, along with a method to associate the sample with its institution, is critical to the outcome of related studies and analyses. The National Center for Biotechnology Information BioCollections Database was established to allow the association of specimen vouchers and related sequence records to their home institutions. This process also allows cross-linking from the home institution for quick identification of all records originating from each collection. Database URL: https://www.ncbi.nlm.nih.gov/biocollections PMID:29688360
Yang, Shihui; Vera, Jessica M; Grass, Jeff; Savvakis, Giannis; Moskvin, Oleg V; Yang, Yongfu; McIlwain, Sean J; Lyu, Yucai; Zinonos, Irene; Hebert, Alexander S; Coon, Joshua J; Bates, Donna M; Sato, Trey K; Brown, Steven D; Himmel, Michael E; Zhang, Min; Landick, Robert; Pappas, Katherine M; Zhang, Yaoping
2018-01-01
Zymomonas mobilis is a natural ethanologen being developed and deployed as an industrial biofuel producer. To date, eight Z. mobilis strains have been completely sequenced and found to contain 2-8 native plasmids. However, systematic verification of predicted Z. mobilis plasmid genes and their contribution to cell fitness has not been hitherto addressed. Moreover, the precise number and identities of plasmids in Z. mobilis model strain ZM4 have been unclear. The lack of functional information about plasmid genes in ZM4 impedes ongoing studies for this model biofuel-producing strain. In this study, we determined the complete chromosome and plasmid sequences of ZM4 and its engineered xylose-utilizing derivatives 2032 and 8b. Compared to previously published and revised ZM4 chromosome sequences, the ZM4 chromosome sequence reported here contains 65 nucleotide sequence variations as well as a 2400-bp insertion. Four plasmids were identified in all three strains, with 150 plasmid genes predicted in strain ZM4 and 2032, and 153 plasmid genes predicted in strain 8b due to the insertion of heterologous DNA for expanded substrate utilization. Plasmid genes were then annotated using Blast2GO, InterProScan, and systems biology data analyses, and most genes were found to have apparent orthologs in other organisms or identifiable conserved domains. To verify plasmid gene prediction, RNA-Seq was used to map transcripts and also compare relative gene expression under various growth conditions, including anaerobic and aerobic conditions, or growth in different concentrations of biomass hydrolysates. Overall, plasmid genes were more responsive to varying hydrolysate concentrations than to oxygen availability. Additionally, our results indicated that although all plasmids were present in low copy number (about 1-2 per cell), the copy number of some plasmids varied under specific growth conditions or due to heterologous gene insertion. The complete genome of ZM4 and two xylose-utilizing derivatives is reported in this study, with an emphasis on identifying and characterizing plasmid genes. Plasmid gene annotation, validation, expression levels at growth conditions of interest, and contribution to host fitness are reported for the first time.
Literature information in PubChem: associations between PubChem records and scientific articles.
Kim, Sunghwan; Thiessen, Paul A; Cheng, Tiejun; Yu, Bo; Shoemaker, Benjamin A; Wang, Jiyao; Bolton, Evan E; Wang, Yanli; Bryant, Stephen H
2016-01-01
PubChem is an open archive consisting of a set of three primary public databases (BioAssay, Compound, and Substance). It contains information on a broad range of chemical entities, including small molecules, lipids, carbohydrates, and (chemically modified) amino acid and nucleic acid sequences (including siRNA and miRNA). Currently (as of Nov. 2015), PubChem contains more than 150 million depositor-provided chemical substance descriptions, 60 million unique chemical structures, and 225 million biological activity test results provided from over 1 million biological assay records. Many PubChem records (substances, compounds, and assays) include depositor-provided cross-references to scientific articles in PubMed. Some PubChem contributors provide bioactivity data extracted from scientific articles. Literature-derived bioactivity data complement high-throughput screening (HTS) data from the concluded NIH Molecular Libraries Program and other HTS projects. Some journals provide PubChem with information on chemicals that appear in their newly published articles, enabling concurrent publication of scientific articles in journals and associated data in public databases. In addition, PubChem links records to PubMed articles indexed with the Medical Subject Heading (MeSH) controlled vocabulary thesaurus. Literature information, both provided by depositors and derived from MeSH annotations, can be accessed using PubChem's web interfaces, enabling users to explore information available in literature related to PubChem records beyond typical web search results. Graphical abstractLiterature information for PubChem records is derived from various sources.
Bunawan, Hamidun; Yen, Choong Chee; Yaakop, Salmah; Noor, Normah Mohd
2017-01-26
The chloroplastic trnL intron and the nuclear internal transcribed spacer (ITS) region were sequenced for 11 Nepenthes species recorded in Peninsular Malaysia to examine their phylogenetic relationship and to evaluate the usage of trnL intron and ITS sequences for phylogenetic reconstruction of this genus. Phylogeny reconstruction was carried out using neighbor-joining, maximum parsimony and Bayesian analyses. All the trees revealed two major clusters, a lowland group consisting of N. ampullaria, N. mirabilis, N. gracilis and N. rafflesiana, and another containing both intermediately distributed species (N. albomarginata and N. benstonei) and four highland species (N. sanguinea, N. macfarlanei, N. ramispina and N. alba). The trnL intron and ITS sequences proved to provide phylogenetic informative characters for deriving a phylogeny of Nepenthes species in Peninsular Malaysia. To our knowledge, this is the first molecular phylogenetic study of Nepenthes species occurring along an altitudinal gradient in Peninsular Malaysia.
Fujimori, Shigeo; Hirai, Naoya; Ohashi, Hiroyuki; Masuoka, Kazuyo; Nishikimi, Akihiko; Fukui, Yoshinori; Washio, Takanori; Oshikubo, Tomohiro; Yamashita, Tatsuhiro; Miyamoto-Sato, Etsuko
2012-01-01
Next-generation sequencing (NGS) has been applied to various kinds of omics studies, resulting in many biological and medical discoveries. However, high-throughput protein-protein interactome datasets derived from detection by sequencing are scarce, because protein-protein interaction analysis requires many cell manipulations to examine the interactions. The low reliability of the high-throughput data is also a problem. Here, we describe a cell-free display technology combined with NGS that can improve both the coverage and reliability of interactome datasets. The completely cell-free method gives a high-throughput and a large detection space, testing the interactions without using clones. The quantitative information provided by NGS reduces the number of false positives. The method is suitable for the in vitro detection of proteins that interact not only with the bait protein, but also with DNA, RNA and chemical compounds. Thus, it could become a universal approach for exploring the large space of protein sequences and interactome networks. PMID:23056904
3D RNA and functional interactions from evolutionary couplings
Weinreb, Caleb; Riesselman, Adam; Ingraham, John B.; Gross, Torsten; Sander, Chris; Marks, Debora S.
2016-01-01
Summary Non-coding RNAs are ubiquitous, but the discovery of new RNA gene sequences far outpaces research on their structure and functional interactions. We mine the evolutionary sequence record to derive precise information about function and structure of RNAs and RNA-protein complexes. As in protein structure prediction, we use maximum entropy global probability models of sequence co-variation to infer evolutionarily constrained nucleotide-nucleotide interactions within RNA molecules, and nucleotide-amino acid interactions in RNA-protein complexes. The predicted contacts allow all-atom blinded 3D structure prediction at good accuracy for several known RNA structures and RNA-protein complexes. For unknown structures, we predict contacts in 160 non-coding RNA families. Beyond 3D structure prediction, evolutionary couplings help identify important functional interactions, e.g., at switch points in riboswitches and at a complex nucleation site in HIV. Aided by accelerating sequence accumulation, evolutionary coupling analysis can accelerate the discovery of functional interactions and 3D structures involving RNA. PMID:27087444
De Gannes, Vidya; Eudoxie, Gaius; Hickey, William J
2013-01-01
Fungal community composition in composts of lignocellulosic wastes was assessed via 454-pyrosequencing of ITS1 libraries derived from the three major composting phases. Ascomycota represented most (93%) of the 27,987 fungal sequences. A total of 102 genera, 120 species, and 222 operational taxonomic units (OTUs; >97% similarity) were identified. Thirty genera predominated (ca. 94% of the sequences), and at the species level, sequences matching Chaetomium funicola and Fusarium oxysporum were the most abundant (26 and 12%, respectively). In all composts, fungal diversity in the mature phase exceeded that of the mesophilic phase, but there was no consistent pattern in diversity changes occurring in the thermophilic phase. Fifteen species of human pathogens were identified, eight of which have not been previously identified in composts. This study demonstrated that deep sequencing can elucidate fungal community diversity in composts, and that this information can have important implications for compost use and human health.
De Gannes, Vidya; Eudoxie, Gaius; Hickey, William J.
2013-01-01
Fungal community composition in composts of lignocellulosic wastes was assessed via 454-pyrosequencing of ITS1 libraries derived from the three major composting phases. Ascomycota represented most (93%) of the 27,987 fungal sequences. A total of 102 genera, 120 species, and 222 operational taxonomic units (OTUs; >97% similarity) were identified. Thirty genera predominated (ca. 94% of the sequences), and at the species level, sequences matching Chaetomium funicola and Fusarium oxysporum were the most abundant (26 and 12%, respectively). In all composts, fungal diversity in the mature phase exceeded that of the mesophilic phase, but there was no consistent pattern in diversity changes occurring in the thermophilic phase. Fifteen species of human pathogens were identified, eight of which have not been previously identified in composts. This study demonstrated that deep sequencing can elucidate fungal community diversity in composts, and that this information can have important implications for compost use and human health. PMID:23785368
Methods for determining the genetic affinity of microorganisms and viruses
NASA Technical Reports Server (NTRS)
Fox, George E. (Inventor); Willson, III, Richard C. (Inventor); Zhang, Zhengdong (Inventor)
2012-01-01
Selecting which sub-sequences in a database of nucleic acid such as 16S rRNA are highly characteristic of particular groupings of bacteria, microorganisms, fungi, etc. on a substantially phylogenetic tree. Also applicable to viruses comprising viral genomic RNA or DNA. A catalogue of highly characteristic sequences identified by this method is assembled to establish the genetic identity of an unknown organism. The characteristic sequences are used to design nucleic acid hybridization probes that include the characteristic sequence or its complement, or are derived from one or more characteristic sequences. A plurality of these characteristic sequences is used in hybridization to determine the phylogenetic tree position of the organism(s) in a sample. Those target organisms represented in the original sequence database and sufficient characteristic sequences can identify to the species or subspecies level. Oligonucleotide arrays of many probes are especially preferred. A hybridization signal can comprise fluorescence, chemiluminescence, or isotopic labeling, etc.; or sequences in a sample can be detected by direct means, e.g. mass spectrometry. The method's characteristic sequences can also be used to design specific PCR primers. The method uniquely identifies the phylogenetic affinity of an unknown organism without requiring prior knowledge of what is present in the sample. Even if the organism has not been previously encountered, the method still provides useful information about which phylogenetic tree bifurcation nodes encompass the organism.
Using the Saccharomyces Genome Database (SGD) for analysis of genomic information
Skrzypek, Marek S.; Hirschman, Jodi
2011-01-01
Analysis of genomic data requires access to software tools that place the sequence-derived information in the context of biology. The Saccharomyces Genome Database (SGD) integrates functional information about budding yeast genes and their products with a set of analysis tools that facilitate exploring their biological details. This unit describes how the various types of functional data available at SGD can be searched, retrieved, and analyzed. Starting with the guided tour of the SGD Home page and Locus Summary page, this unit highlights how to retrieve data using YeastMine, how to visualize genomic information with GBrowse, how to explore gene expression patterns with SPELL, and how to use Gene Ontology tools to characterize large-scale datasets. PMID:21901739
Isolation of mini- and microsatellite loci from chromosome 19 library
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prosnyak, M.I.; Belajeva, O.V.; Polukarova, L.G.
Mini- and microsatellite sequences are abundant in the human genome and are very useful as genetic markers. We report the isolation of a panel of clones containing marker sequences from chromosome 19. We screened 10,000 clones from the chromosome 19 cosmid library for the presence of di-(CA)n, tri-(TCC)n, (CAC)n microsatellites and M13-like minisatellite sequences. For this we have used synthetic oligonucleotides and polynucleotides, including micro- (CA, TCC, CAC) and minisatellite (M13 core) sequences. Preliminary results indicated that the chromosome 19 cosmid library contained both human and hamster clones. In order to identify human sequences from this library we have developedmore » the technique of colony and blot hybridization with Alu-PCR, L1-PCR and B1-PCR probes. Dozens of clones have been selected, some of which were analyzed by conventional Southern blot analysis and non-radioactive in situ hybridization of chromosomes. Highly informative markers derived from these clones will be used for physical and genetic mapping of chromosome 19.« less
Role of Mitochondrial Inheritance on Prostate Cancer Outcome in African American Men. Addendum
2016-11-01
DNA sequencing technique developed by our collaborator using single amplicon long-range PCR that permits deep coverage (10,000-20,000X on average) of...the mitochondrial genome. We have sequenced 652 samples derived from frozen fully using this technology. The additional DNA samples derived from...paraffin embedded (FFPE) tissue were more challenging, but have now been sequenced . Mapping of DNA variants in our sequenced genomes to mitochondrial
Xie, Guosen; Mo, Zhongxi
2011-01-21
In this article, we introduce three 3D graphical representations of DNA primary sequences, which we call RY-curve, MK-curve and SW-curve, based on three classifications of the DNA bases. The advantages of our representations are that (i) these 3D curves are strictly non-degenerate and there is no loss of information when transferring a DNA sequence to its mathematical representation and (ii) the coordinates of every node on these 3D curves have clear biological implication. Two applications of these 3D curves are presented: (a) a simple formula is derived to calculate the content of the four bases (A, G, C and T) from the coordinates of nodes on the curves; and (b) a 12-component characteristic vector is constructed to compare similarity among DNA sequences from different species based on the geometrical centers of the 3D curves. As examples, we examine similarity among the coding sequences of the first exon of beta-globin gene from eleven species and validate similarity of cDNA sequences of beta-globin gene from eight species. Copyright © 2010 Elsevier Ltd. All rights reserved.
Links, Matthew G; Chaban, Bonnie; Hemmingsen, Sean M; Muirhead, Kevin; Hill, Janet E
2013-08-15
Formation of operational taxonomic units (OTU) is a common approach to data aggregation in microbial ecology studies based on amplification and sequencing of individual gene targets. The de novo assembly of OTU sequences has been recently demonstrated as an alternative to widely used clustering methods, providing robust information from experimental data alone, without any reliance on an external reference database. Here we introduce mPUMA (microbial Profiling Using Metagenomic Assembly, http://mpuma.sourceforge.net), a software package for identification and analysis of protein-coding barcode sequence data. It was developed originally for Cpn60 universal target sequences (also known as GroEL or Hsp60). Using an unattended process that is independent of external reference sequences, mPUMA forms OTUs by DNA sequence assembly and is capable of tracking OTU abundance. mPUMA processes microbial profiles both in terms of the direct DNA sequence as well as in the translated amino acid sequence for protein coding barcodes. By forming OTUs and calculating abundance through an assembly approach, mPUMA is capable of generating inputs for several popular microbiota analysis tools. Using SFF data from sequencing of a synthetic community of Cpn60 sequences derived from the human vaginal microbiome, we demonstrate that mPUMA can faithfully reconstruct all expected OTU sequences and produce compositional profiles consistent with actual community structure. mPUMA enables analysis of microbial communities while empowering the discovery of novel organisms through OTU assembly.
Characterization of an Avian Influenza Virus H9N2 Strain Isolated from Dove in Southern China.
Li, Dan; Li, ZhengTing; Xie, Zhixun; Li, Meng; Xie, Zhiqin; Liu, Jiabo; Xie, Liji; Deng, Xianwen; Luo, Sisi
2018-05-03
We report here the complete genome sequence of strain H9N2, an avian influenza virus (AIV) isolated from dove in Guangxi, China. Phylogenetic analysis showed that it was a novel reassortant AIV derived from chicken, duck, and wild bird. This finding provides useful information for understanding the H9N2 subtype of AIV circulating in southern China. Copyright © 2018 Li et al.
DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding.
Ma, Wenxiu; Yang, Lin; Rohs, Remo; Noble, William Stafford
2017-10-01
Transcription factors (TFs) bind to specific DNA sequence motifs. Several lines of evidence suggest that TF-DNA binding is mediated in part by properties of the local DNA shape: the width of the minor groove, the relative orientations of adjacent base pairs, etc. Several methods have been developed to jointly account for DNA sequence and shape properties in predicting TF binding affinity. However, a limitation of these methods is that they typically require a training set of aligned TF binding sites. We describe a sequence + shape kernel that leverages DNA sequence and shape information to better understand protein-DNA binding preference and affinity. This kernel extends an existing class of k-mer based sequence kernels, based on the recently described di-mismatch kernel. Using three in vitro benchmark datasets, derived from universal protein binding microarrays (uPBMs), genomic context PBMs (gcPBMs) and SELEX-seq data, we demonstrate that incorporating DNA shape information improves our ability to predict protein-DNA binding affinity. In particular, we observe that (i) the k-spectrum + shape model performs better than the classical k-spectrum kernel, particularly for small k values; (ii) the di-mismatch kernel performs better than the k-mer kernel, for larger k; and (iii) the di-mismatch + shape kernel performs better than the di-mismatch kernel for intermediate k values. The software is available at https://bitbucket.org/wenxiu/sequence-shape.git. rohs@usc.edu or william-noble@uw.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
Hayashida, Kyoko; Abe, Takashi; Weir, William; Nakao, Ryo; Ito, Kimihito; Kajino, Kiichi; Suzuki, Yutaka; Jongejan, Frans; Geysen, Dirk; Sugimoto, Chihiro
2013-01-01
The disease caused by the apicomplexan protozoan parasite Theileria parva, known as East Coast fever or Corridor disease, is one of the most serious cattle diseases in Eastern, Central, and Southern Africa. We performed whole-genome sequencing of nine T. parva strains, including one of the vaccine strains (Kiambu 5), field isolates from Zambia, Uganda, Tanzania, or Rwanda, and two buffalo-derived strains. Comparison with the reference Muguga genome sequence revealed 34 814–121 545 single nucleotide polymorphisms (SNPs) that were more abundant in buffalo-derived strains. High-resolution phylogenetic trees were constructed with selected informative SNPs that allowed the investigation of possible complex recombination events among ancestors of the extant strains. We further analysed the dN/dS ratio (non-synonymous substitutions per non-synonymous site divided by synonymous substitutions per synonymous site) for 4011 coding genes to estimate potential selective pressure. Genes under possible positive selection were identified that may, in turn, assist in the identification of immunogenic proteins or vaccine candidates. This study elucidated the phylogeny of T. parva strains based on genome-wide SNPs analysis with prediction of possible past recombination events, providing insight into the migration, diversification, and evolution of this parasite species in the African continent. PMID:23404454
Hayashida, Kyoko; Abe, Takashi; Weir, William; Nakao, Ryo; Ito, Kimihito; Kajino, Kiichi; Suzuki, Yutaka; Jongejan, Frans; Geysen, Dirk; Sugimoto, Chihiro
2013-06-01
The disease caused by the apicomplexan protozoan parasite Theileria parva, known as East Coast fever or Corridor disease, is one of the most serious cattle diseases in Eastern, Central, and Southern Africa. We performed whole-genome sequencing of nine T. parva strains, including one of the vaccine strains (Kiambu 5), field isolates from Zambia, Uganda, Tanzania, or Rwanda, and two buffalo-derived strains. Comparison with the reference Muguga genome sequence revealed 34 814-121 545 single nucleotide polymorphisms (SNPs) that were more abundant in buffalo-derived strains. High-resolution phylogenetic trees were constructed with selected informative SNPs that allowed the investigation of possible complex recombination events among ancestors of the extant strains. We further analysed the dN/dS ratio (non-synonymous substitutions per non-synonymous site divided by synonymous substitutions per synonymous site) for 4011 coding genes to estimate potential selective pressure. Genes under possible positive selection were identified that may, in turn, assist in the identification of immunogenic proteins or vaccine candidates. This study elucidated the phylogeny of T. parva strains based on genome-wide SNPs analysis with prediction of possible past recombination events, providing insight into the migration, diversification, and evolution of this parasite species in the African continent.
NASA Astrophysics Data System (ADS)
Acquisti, Claudia; Allegrini, Paolo; Bogani, Patrizia; Buiatti, Marcello; Catanese, Elena; Fronzoni, Leone; Grigolini, Paolo; Mersi, Giuseppe; Palatella, Luigi
2004-04-01
We investigate on a possible way to connect the presence of Low-Complexity Sequences (LCS) in DNA genomes and the nonstationary properties of base correlations. Under the hypothesis that these variations signal a change in the DNA function, we use a new technique, called Non-Stationarity Entropic Index (NSEI) method, and we prove that this technique is an efficient way to detect functional changes with respect to a random baseline. The remarkable aspect is that NSEI does not imply any training data or fitting parameter, the only arbitrarity being the choice of a marker in the sequence. We make this choice on the basis of biological information about LCS distributions in genomes. We show that there exists a correlation between changing the amount in LCS and the ratio of long- to short-range correlation.
Karyotype Analysis of Four Vicia Species using In Situ Hybridization with Repetitive Sequences
NAVRÁTILOVÁ, ALICE; NEUMANN, PAVEL; MACAS, JIŘÍ
2003-01-01
Mitotic chromosomes of four Vicia species (V. sativa, V. grandiflora, V. pannonica and V. narbonensis) were subjected to in situ hybridization with probes derived from conserved plant repetitive DNA sequences (18S–25S and 5S rDNA, telomeres) and genus‐specific satellite repeats (VicTR‐A and VicTR‐B). Numbers and positions of hybridization signals provided cytogenetic landmarks suitable for unambiguous identification of all chromosomes, and establishment of the karyotypes. The VicTR‐A and ‐B sequences, in particular, produced highly informative banding patterns that alone were sufficient for discrimination of all chromosomes. However, these patterns were not conserved among species and thus could not be employed for identification of homologous chromosomes. This fact, together with observed variations in positions and numbers of rDNA loci, suggests considerable divergence between karyotypes of the species studied. PMID:12770847
Thermodynamic framework for information in nanoscale systems with memory
NASA Astrophysics Data System (ADS)
Arias-Gonzalez, J. Ricardo
2017-11-01
Information is represented by linear strings of symbols with memory that carry errors as a result of their stochastic nature. Proofreading and edition are assumed to improve certainty although such processes may not be effective. Here, we develop a thermodynamic theory for material chains made up of nanoscopic subunits with symbolic meaning in the presence of memory. This framework is based on the characterization of single sequences of symbols constructed under a protocol and is used to derive the behavior of ensembles of sequences similarly constructed. We then analyze the role of proofreading and edition in the presence of memory finding conditions to make revision an effective process, namely, to decrease the entropy of the chain. Finally, we apply our formalism to DNA replication and RNA transcription finding that Watson and Crick hybridization energies with which nucleotides are branched to the template strand during the copying process are optimal to regulate the fidelity in proofreading. These results are important in applications of information theory to a variety of solid-state physical systems and other biomolecular processes.
Thermodynamic framework for information in nanoscale systems with memory.
Arias-Gonzalez, J Ricardo
2017-11-28
Information is represented by linear strings of symbols with memory that carry errors as a result of their stochastic nature. Proofreading and edition are assumed to improve certainty although such processes may not be effective. Here, we develop a thermodynamic theory for material chains made up of nanoscopic subunits with symbolic meaning in the presence of memory. This framework is based on the characterization of single sequences of symbols constructed under a protocol and is used to derive the behavior of ensembles of sequences similarly constructed. We then analyze the role of proofreading and edition in the presence of memory finding conditions to make revision an effective process, namely, to decrease the entropy of the chain. Finally, we apply our formalism to DNA replication and RNA transcription finding that Watson and Crick hybridization energies with which nucleotides are branched to the template strand during the copying process are optimal to regulate the fidelity in proofreading. These results are important in applications of information theory to a variety of solid-state physical systems and other biomolecular processes.
USDA-ARS?s Scientific Manuscript database
To discover resistance (R) and/or pathogen-induced (PR) genes involved in disease response, 12 bacterial artificial chromosome (BAC) clones from cv. Acala Maxxa (G. hirsutum) were sequenced at the Clemson University, Genomics Institute, Clemson, SC. These BACs derived MUSB single sequence repeat (SS...
Molecular cloning and expression of the CRISP family of proteins in the boar.
Vadnais, Melissa L; Foster, Douglas N; Roberts, Kenneth P
2008-12-01
The family of mammalian cysteine-rich secretory proteins (CRISP) have been well characterized in the rat, mouse, and human. Here we report the molecular cloning and expression analysis of CRISP1, CRISP2, and CRISP3 in the boar. A partial sequence published in the National Center for Biotechnology Information (NCBI) database was used to derive the full-length sequences for CRISP1 and CRISP2 using rapid amplification of cDNA ends. RT-PCR confirmed the expression of these mRNAs in the boar reproductive tract, and real time RT-PCR showed CRISP1 to be highly expressed throughout the epididymis, with CRISP2 highly expressed in the testis. A search of the porcine genomic sequence in the NCBI database identified a BAC (CH242-199E6) encoding the CRISP1 gene. This BAC is derived from porcine Chromosome 7 and is syntenic with the regions of the mouse, rat, and human genomes encoding the CRISP gene family. This BAC was found to encode a third CRISP protein with a predicted amino acid sequence of high similarity to human CRISP3. Using RT-PCR we show that CRISP3 expression in the boar reproductive tract is confined to the prostate. Recombinant porcine (rp) CRISP2 protein was produced and purified. When incubated with capacitated boar sperm, rpCRISP2 induced an acrosome reaction, consistent with its demonstrated ability to alter the activity of calcium channels.
Oldach, Klaus H; Peck, David M; Nair, Ramakrishnan M; Sokolova, Maria; Harris, John; Bogacki, Paul; Ballard, Ross
2014-04-17
The nematode Pratylenchus neglectus has a wide host range and is able to feed on the root systems of cereals, oilseeds, grain and pasture legumes. Under the Mediterranean low rainfall environments of Australia, annual Medicago pasture legumes are used in rotation with cereals to fix atmospheric nitrogen and improve soil parameters. Considerable efforts are being made in breeding programs to improve resistance and tolerance to Pratylenchus neglectus in the major crops wheat and barley, which makes it vital to develop appropriate selection tools in medics. A strong source of tolerance to root damage by the root lesion nematode (RLN) Pratylenchus neglectus had previously been identified in line RH-1 (strand medic, M. littoralis). Using RH-1, we have developed a single seed descent (SSD) population of 138 lines by crossing it to the intolerant cultivar Herald. After inoculation, RLN-associated root damage clearly segregated in the population. Genetic analysis was performed by constructing a genetic map using simple sequence repeat (SSR) and gene-based SNP markers. A highly significant quantitative trait locus (QTL), QPnTolMl.1, was identified explaining 49% of the phenotypic variation in the SSD population. All SSRs and gene-based markers in the QTL region were derived from chromosome 1 of the sequenced genome of the closely related species M. truncatula. Gene-based markers were validated in advanced breeding lines derived from the RH-1 parent and also a second RLN tolerance source, RH-2 (M. truncatula ssp. tricycla). Comparative analysis to sequenced legume genomes showed that the physical QTL interval exists as a synteny block in Lotus japonicus, common bean, soybean and chickpea. Furthermore, using the sequenced genome information of M. truncatula, the QTL interval contains 55 genes out of which five are discussed as potential candidate genes responsible for the mapped tolerance. The closely linked set of SNP-based PCR markers is directly applicable to select for two different sources of RLN tolerance in breeding programs. Moreover, genome sequence information has allowed proposing candidate genes for further functional analysis and nominates QPnTolMl.1 as a target locus for RLN tolerance in economically important grain legumes, e.g. chickpea.
Gene Unprediction with Spurio: A tool to identify spurious protein sequences.
Höps, Wolfram; Jeffryes, Matt; Bateman, Alex
2018-01-01
We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation. Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases. We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes. Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.
Ochirkhuu, Nyamsuren; Konnai, Satoru; Odbileg, Raadan; Murata, Shiro; Ohashi, Kazuhiko
2017-08-01
Anaplasma species are obligate intracellular rickettsial pathogens that cause great economic loss to the animal industry. Few studies on Anaplasma infections in Mongolian livestock have been conducted. This study examined the prevalence of Anaplasma marginale, Anaplasma ovis, Anaplasma phagocytophilum, and Anaplasma bovis by polymerase chain reaction assay in 928 blood samples collected from native cattle and dairy cattle (Bos taurus), yaks (Bos grunniens), sheep (Ovis aries), and goats (Capra aegagrus hircus) in four provinces of Ulaanbaatar city in Mongolia. We genetically characterized positive samples through sequencing analysis based on the heat-shock protein groEL, major surface protein 4 (msp4), and 16S rRNA genes. Only A. ovis was detected in Mongolian livestock (cattle, yaks, sheep, and goats), with 413 animals (44.5%) positive for groEL and 308 animals (33.2%) positive for msp4 genes. In the phylogenetic tree, we separated A. ovis sequences into two distinct clusters based on the groEL gene. One cluster comprised sequences derived mainly from sheep and goats, which was similar to that in A. ovis isolates from other countries. The other divergent cluster comprised sequences derived from cattle and yaks and appeared to be newly branched from that in previously published single isolates in Mongolian cattle. In addition, the msp4 gene of A. ovis using same and different samples with groEL gene of the pathogen demonstrated that all sequences derived from all animal species, except for three sequences derived from cattle and yak, were clustered together, and were identical or similar to those in isolates from other countries. We used 16S rRNA gene sequences to investigate the genetically divergent A. ovis and identified high homology of 99.3-100%. However, the sequences derived from cattle did not match those derived from sheep and goats. The results of this study on the prevalence and molecular characterization of A. ovis in Mongolian livestock can facilitate the control of infectious diseases in livestock.
Collins, Richard A; Stajich, Jason E; Field, Deborah J; Olive, Joan E; DeAbreu, Diane M
2015-05-01
When we expressed a small (0.9 kb) nonprotein-coding transcript derived from the mitochondrial VS plasmid in the nucleus of Neurospora we found that it was efficiently spliced at one or more of eight 5' splice sites and ten 3' splice sites, which are present apparently by chance in the sequence. Further experimental and bioinformatic analyses of other mitochondrial plasmids, random sequences, and natural nuclear genes in Neurospora and other fungi indicate that fungal spliceosomes recognize a wide range of 5' splice site and branchpoint sequences and predict introns to be present at high frequency in random sequence. In contrast, analysis of intronless fungal nuclear genes indicates that branchpoint, 5' splice site and 3' splice site consensus sequences are underrepresented compared with random sequences. This underrepresentation of splicing signals is sufficient to deplete the nuclear genome of splice sites at locations that do not comprise biologically relevant introns. Thus, the splicing machinery can recognize a wide range of splicing signal sequences, but splicing still occurs with great accuracy, not because the splicing machinery distinguishes correct from incorrect introns, but because incorrect introns are substantially depleted from the genome. © 2015 Collins et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
Combining Rosetta with molecular dynamics (MD): A benchmark of the MD-based ensemble protein design.
Ludwiczak, Jan; Jarmula, Adam; Dunin-Horkawicz, Stanislaw
2018-07-01
Computational protein design is a set of procedures for computing amino acid sequences that will fold into a specified structure. Rosetta Design, a commonly used software for protein design, allows for the effective identification of sequences compatible with a given backbone structure, while molecular dynamics (MD) simulations can thoroughly sample near-native conformations. We benchmarked a procedure in which Rosetta design is started on MD-derived structural ensembles and showed that such a combined approach generates 20-30% more diverse sequences than currently available methods with only a slight increase in computation time. Importantly, the increase in diversity is achieved without a loss in the quality of the designed sequences assessed by their resemblance to natural sequences. We demonstrate that the MD-based procedure is also applicable to de novo design tasks started from backbone structures without any sequence information. In addition, we implemented a protocol that can be used to assess the stability of designed models and to select the best candidates for experimental validation. In sum our results demonstrate that the MD ensemble-based flexible backbone design can be a viable method for protein design, especially for tasks that require a large pool of diverse sequences. Copyright © 2018 Elsevier Inc. All rights reserved.
False positives complicate ancient pathogen identifications using high-throughput shotgun sequencing
2014-01-01
Background Identification of historic pathogens is challenging since false positives and negatives are a serious risk. Environmental non-pathogenic contaminants are ubiquitous. Furthermore, public genetic databases contain limited information regarding these species. High-throughput sequencing may help reliably detect and identify historic pathogens. Results We shotgun-sequenced 8 16th-century Mixtec individuals from the site of Teposcolula Yucundaa (Oaxaca, Mexico) who are reported to have died from the huey cocoliztli (‘Great Pestilence’ in Nahautl), an unknown disease that decimated native Mexican populations during the Spanish colonial period, in order to identify the pathogen. Comparison of these sequences with those deriving from the surrounding soil and from 4 precontact individuals from the site found a wide variety of contaminant organisms that confounded analyses. Without the comparative sequence data from the precontact individuals and soil, false positives for Yersinia pestis and rickettsiosis could have been reported. Conclusions False positives and negatives remain problematic in ancient DNA analyses despite the application of high-throughput sequencing. Our results suggest that several studies claiming the discovery of ancient pathogens may need further verification. Additionally, true single molecule sequencing’s short read lengths, inability to sequence through DNA lesions, and limited ancient-DNA-specific technical development hinder its application to palaeopathology. PMID:24568097
Discovery of the "RNA continent" through a contrarian's research strategy.
Hayashizaki, Yoshihide
2011-01-01
The International Human Genome Sequencing Consortium completed the decoding of the human genome sequence in 2003. Readers will be aware of the paradigm shift which has occurred since then in the field of life science research. At last, mankind has been able to focus on a complete picture of the full extent of the genome, on which is recorded the basic information that controls all life. Meanwhile, another genome project, centered on Japan and known as the mouse genome encyclopedia project, was progressing with participation from around the world. Led by our research group at RIKEN, it was a full-length cDNA project which aimed to decode the whole RNA (transcriptome) using the mouse as a model. The basic information that controls all life is recorded on the genome, but in order to obtain a complete picture of this extensive information, the decoding of the genome alone is far from sufficient. These two genome projects established that the number of letters in the genome, which is the blueprint of life, is finite, that the number of RNA molecules derived from it is also finite, and that the number of protein molecules derived from the RNA is probably finite too. A massive number of combinations is still involved, but we are now able to understand one section of the network formed by these data. Once an object of study has been understood to be finite, establishing an image of the whole is certain to lead us to an understanding of the whole. Omics is an approach that views the information controlling life as finite and seeks to assemble and analyze it as a whole. Here, I would like to present our transcriptome research while making reference to our unique research strategy.
Genetic diversity in Trypanosoma theileri from Sri Lankan cattle and water buffaloes.
Yokoyama, Naoaki; Sivakumar, Thillaiampalam; Fukushi, Shintaro; Tattiyapong, Muncharee; Tuvshintulga, Bumduuren; Kothalawala, Hemal; Silva, Seekkuge Susil Priyantha; Igarashi, Ikuo; Inoue, Noboru
2015-01-30
Trypanosoma theileri is a hemoprotozoan parasite that infects various ruminant species. We investigated the epidemiology of this parasite among cattle and water buffalo populations bred in Sri Lanka, using a diagnostic PCR assay based on the cathepsin L-like protein (CATL) gene. Blood DNA samples sourced from cattle (n=316) and water buffaloes (n=320) bred in different geographical areas of Sri Lanka were PCR screened for T. theileri. Parasite DNA was detected in cattle and water buffaloes alike in all the sampling locations. The overall T. theileri-positive rate was higher in water buffaloes (15.9%) than in cattle (7.6%). Subsequently, PCR amplicons were sequenced and the partial CATL sequences were phylogenetically analyzed. The identity values for the CATL gene were 89.6-99.7% among the cattle-derived sequences, compared with values of 90.7-100% for the buffalo-derived sequences. However, the cattle-derived sequences shared 88.2-100% identity values with those from buffaloes. In the phylogenetic tree, the Sri Lankan CATL gene sequences fell into two major clades (TthI and TthII), both of which contain CATL sequences from several other countries. Although most of the CATL sequences from Sri Lankan cattle and buffaloes clustered independently, two buffalo-derived sequences were observed to be closely related to those of the Sri Lankan cattle. Furthermore, a Sri Lankan buffalo sequence clustered with CATL gene sequences from Brazilian buffalo and Thai cattle. In addition to reporting the first PCR-based survey of T. theileri among Sri Lankan-bred cattle and water buffaloes, the present study found that some of the CATL gene fragments sourced from water buffaloes shared similarity with those determined from cattle in this country. Copyright © 2014 Elsevier B.V. All rights reserved.
Velagapudi, Sai Pradeep; Seedhouse, Steven J.; French, Jonathan
2011-01-01
RNA is an important therapeutic target, however, RNA targets are generally underexploited due to a lack of understanding of the small molecules that bind RNA and the RNA motifs that bind small molecules. Herein, we describe the identification of the RNA internal loops derived from a 4096-member 3×3 nucleotide loop library that are the most specific and highest affinity binders to a series of four designer, drug-like benzimidazoles. These studies establish a potentially general protocol to define the highest affinity and most specific RNA motif targets for heterocyclic small molecules. Such information could be used to target functionally important RNAs in genomic sequence. PMID:21604752
Barrett, Angela N; Xiong, Li; Tan, Tuan Z; Advani, Henna V; Hua, Rui; Laureano-Asibal, Cecille; Soong, Richie; Biswas, Arijit; Nagarajan, Niranjan; Choolani, Mahesh
2017-01-01
Cell-free DNA from maternal plasma can be used for non-invasive prenatal testing for aneuploidies and single gene disorders, and also has applications as a biomarker for monitoring high-risk pregnancies, such as those at risk of pre-eclampsia. On average, the fractional cell-free fetal DNA concentration in plasma is approximately 15%, but can vary from less than 4% to greater than 30%. Although quantification of cell-free fetal DNA is straightforward in the case of a male fetus, there is no universal fetal marker; in a female fetus measurement is more challenging. We have developed a panel of multiplexed insertion/deletion polymorphisms that can measure fetal fraction in all pregnancies in a simple, targeted sequencing reaction. A multiplex panel of primers was designed for 35 indels plus a ZFX/ZFY amplicon. cfDNA was extracted from plasma from 157 pregnant women, and maternal genomic DNA was extracted for 20 of these samples for panel validation. Sixty-one samples from pregnancies with a male fetus were subjected to whole genome sequencing on the Ion Proton sequencing platform, and fetal fraction derived from Y chromosome counts was compared to fetal fraction measured using the indel panel. A total of 157 cell-free DNA samples were sequenced using the indel panel, and informativity was assessed, along with the proportion of fetal DNA. Using gDNA we optimised the indel panel, removing amplicons giving rise to PCR bias. Good correlation was found between fetal fraction using indels and using whole genome sequencing of the Y chromosome (Spearmans r = 0.69). A median of 12 indels were informative per sample. The indel panel was informative in 157/157 cases (mean fetal fraction 14.4% (±0.58%)). Using our targeted next generation sequencing panel we can readily assess the fetal DNA percentage in male and female pregnancies.
Xiong, Li; Tan, Tuan Z.; Advani, Henna V.; Hua, Rui; Laureano-Asibal, Cecille; Soong, Richie; Biswas, Arijit; Nagarajan, Niranjan; Choolani, Mahesh
2017-01-01
Objective Cell-free DNA from maternal plasma can be used for non-invasive prenatal testing for aneuploidies and single gene disorders, and also has applications as a biomarker for monitoring high-risk pregnancies, such as those at risk of pre-eclampsia. On average, the fractional cell-free fetal DNA concentration in plasma is approximately 15%, but can vary from less than 4% to greater than 30%. Although quantification of cell-free fetal DNA is straightforward in the case of a male fetus, there is no universal fetal marker; in a female fetus measurement is more challenging. We have developed a panel of multiplexed insertion/deletion polymorphisms that can measure fetal fraction in all pregnancies in a simple, targeted sequencing reaction. Methods A multiplex panel of primers was designed for 35 indels plus a ZFX/ZFY amplicon. cfDNA was extracted from plasma from 157 pregnant women, and maternal genomic DNA was extracted for 20 of these samples for panel validation. Sixty-one samples from pregnancies with a male fetus were subjected to whole genome sequencing on the Ion Proton sequencing platform, and fetal fraction derived from Y chromosome counts was compared to fetal fraction measured using the indel panel. A total of 157 cell-free DNA samples were sequenced using the indel panel, and informativity was assessed, along with the proportion of fetal DNA. Results Using gDNA we optimised the indel panel, removing amplicons giving rise to PCR bias. Good correlation was found between fetal fraction using indels and using whole genome sequencing of the Y chromosome (Spearmans r = 0.69). A median of 12 indels were informative per sample. The indel panel was informative in 157/157 cases (mean fetal fraction 14.4% (±0.58%)). Conclusions Using our targeted next generation sequencing panel we can readily assess the fetal DNA percentage in male and female pregnancies. PMID:29084245
Badisco, Liesbeth; Huybrechts, Jurgen; Simonet, Gert; Verlinden, Heleen; Marchal, Elisabeth; Huybrechts, Roger; Schoofs, Liliane; De Loof, Arnold; Vanden Broeck, Jozef
2011-03-21
The desert locust (Schistocerca gregaria) displays a fascinating type of phenotypic plasticity, designated as 'phase polyphenism'. Depending on environmental conditions, one genome can be translated into two highly divergent phenotypes, termed the solitarious and gregarious (swarming) phase. Although many of the underlying molecular events remain elusive, the central nervous system (CNS) is expected to play a crucial role in the phase transition process. Locusts have also proven to be interesting model organisms in a physiological and neurobiological research context. However, molecular studies in locusts are hampered by the fact that genome/transcriptome sequence information available for this branch of insects is still limited. We have generated 34,672 raw expressed sequence tags (EST) from the CNS of desert locusts in both phases. These ESTs were assembled in 12,709 unique transcript sequences and nearly 4,000 sequences were functionally annotated. Moreover, the obtained S. gregaria EST information is highly complementary to the existing orthopteran transcriptomic data. Since many novel transcripts encode neuronal signaling and signal transduction components, this paper includes an overview of these sequences. Furthermore, several transcripts being differentially represented in solitarious and gregarious locusts were retrieved from this EST database. The findings highlight the involvement of the CNS in the phase transition process and indicate that this novel annotated database may also add to the emerging knowledge of concomitant neuronal signaling and neuroplasticity events. In summary, we met the need for novel sequence data from desert locust CNS. To our knowledge, we hereby also present the first insect EST database that is derived from the complete CNS. The obtained S. gregaria EST data constitute an important new source of information that will be instrumental in further unraveling the molecular principles of phase polyphenism, in further establishing locusts as valuable research model organisms and in molecular evolutionary and comparative entomology.
Two EST-derived marker systems for cultivar identification in tree peony.
Zhang, J J; Shu, Q Y; Liu, Z A; Ren, H X; Wang, L S; De Keyser, E
2012-02-01
Tree peony (Paeonia suffruticosa Andrews), a woody deciduous shrub, belongs to the section Moutan DC. in the genus of Paeonia of the Paeoniaceae family. To increase the efficiency of breeding, two EST-derived marker systems were developed based on a tree peony expressed sequence tag (EST) database. Using target region amplification polymorphism (TRAP), 19 of 39 primer pairs showed good amplification for 56 accessions with amplicons ranging from 120 to 3,000 bp long, among which 99.3% were polymorphic. In contrast, 7 of 21 primer pairs demonstrated adequate amplification with clear bands for simple sequence repeats (SSRs) developed from ESTs, and a total of 33 alleles were found in 56 accessions. The similarity matrices generated by TRAP and EST-SSR markers were compared, and the Mantel test (r = 0.57778, P = 0.0020) showed a moderate correlation between the two types of molecular markers. TRAP markers were suitable for DNA fingerprinting and EST-SSR markers were more appropriate for discriminating synonyms (the same cultivars with different names due to limited information exchanged among different geographic areas). The two sets of EST-derived markers will be used further for genetic linkage map construction and quantitative trait locus detection in tree peony.
Proteomics technique opens new frontiers in mobilome research
Davidson, Andrew D.; Matthews, David A.
2017-01-01
ABSTRACT A large proportion of the genome of most eukaryotic organisms consists of highly repetitive mobile genetic elements. The sum of these elements is called the “mobilome,” which in eukaryotes is made up mostly of transposons. Transposable elements contribute to disease, evolution, and normal physiology by mediating genetic rearrangement, and through the “domestication” of transposon proteins for cellular functions. Although ‘omics studies of mobilome genomes and transcriptomes are common, technical challenges have hampered high-throughput global proteomics analyses of transposons. In a recent paper, we overcame these technical hurdles using a technique called “proteomics informed by transcriptomics” (PIT), and thus published the first unbiased global mobilome-derived proteome for any organism (using cell lines derived from the mosquito Aedes aegypti). In this commentary, we describe our methods in more detail, and summarise our major findings. We also use new genome sequencing data to show that, in many cases, the specific genomic element expressing a given protein can be identified using PIT. This proteomic technique therefore represents an important technological advance that will open new avenues of research into the role that proteins derived from transposons and other repetitive and sequence diverse genetic elements, such as endogenous retroviruses, play in health and disease. PMID:28932623
Xiong, Dapeng; Zeng, Jianyang; Gong, Haipeng
2017-09-01
Residue-residue contacts are of great value for protein structure prediction, since contact information, especially from those long-range residue pairs, can significantly reduce the complexity of conformational sampling for protein structure prediction in practice. Despite progresses in the past decade on protein targets with abundant homologous sequences, accurate contact prediction for proteins with limited sequence information is still far from satisfaction. Methodologies for these hard targets still need further improvement. We presented a computational program DeepConPred, which includes a pipeline of two novel deep-learning-based methods (DeepCCon and DeepRCon) as well as a contact refinement step, to improve the prediction of long-range residue contacts from primary sequences. When compared with previous prediction approaches, our framework employed an effective scheme to identify optimal and important features for contact prediction, and was only trained with coevolutionary information derived from a limited number of homologous sequences to ensure robustness and usefulness for hard targets. Independent tests showed that 59.33%/49.97%, 64.39%/54.01% and 70.00%/59.81% of the top L/5, top L/10 and top 5 predictions were correct for CASP10/CASP11 proteins, respectively. In general, our algorithm ranked as one of the best methods for CASP targets. All source data and codes are available at http://166.111.152.91/Downloads.html . hgong@tsinghua.edu.cn or zengjy321@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Morgan, Andrew P.; Didion, John P.; Doran, Anthony G.; Holt, James M.; McMillan, Leonard; Keane, Thomas M.; de Villena, Fernando Pardo-Manuel
2016-01-01
Wild-derived mouse inbred strains are becoming increasingly popular for complex traits analysis, evolutionary studies, and systems genetics. Here, we report the whole-genome sequencing of two wild-derived mouse inbred strains, LEWES/EiJ and ZALENDE/EiJ, of Mus musculus domesticus origin. These two inbred strains were selected based on their geographic origin, karyotype, and use in ongoing research. We generated 14× and 18× coverage sequence, respectively, and discovered over 1.1 million novel variants, most of which are private to one of these strains. This report expands the number of wild-derived inbred genomes in the Mus genus from six to eight. The sequence variation can be accessed via an online query tool; variant calls (VCF format) and alignments (BAM format) are available for download from a dedicated ftp site. Finally, the sequencing data have also been stored in a lossless, compressed, and indexed format using the multi-string Burrows-Wheeler transform. All data can be used without restriction. PMID:27765810
Simkovic, Felix; Thomas, Jens M H; Keegan, Ronan M; Winn, Martyn D; Mayans, Olga; Rigden, Daniel J
2016-07-01
For many protein families, the deluge of new sequence information together with new statistical protocols now allow the accurate prediction of contacting residues from sequence information alone. This offers the possibility of more accurate ab initio (non-homology-based) structure prediction. Such models can be used in structure solution by molecular replacement (MR) where the target fold is novel or is only distantly related to known structures. Here, AMPLE, an MR pipeline that assembles search-model ensembles from ab initio structure predictions ('decoys'), is employed to assess the value of contact-assisted ab initio models to the crystallographer. It is demonstrated that evolutionary covariance-derived residue-residue contact predictions improve the quality of ab initio models and, consequently, the success rate of MR using search models derived from them. For targets containing β-structure, decoy quality and MR performance were further improved by the use of a β-strand contact-filtering protocol. Such contact-guided decoys achieved 14 structure solutions from 21 attempted protein targets, compared with nine for simple Rosetta decoys. Previously encountered limitations were superseded in two key respects. Firstly, much larger targets of up to 221 residues in length were solved, which is far larger than the previously benchmarked threshold of 120 residues. Secondly, contact-guided decoys significantly improved success with β-sheet-rich proteins. Overall, the improved performance of contact-guided decoys suggests that MR is now applicable to a significantly wider range of protein targets than were previously tractable, and points to a direct benefit to structural biology from the recent remarkable advances in sequencing.
Simkovic, Felix; Thomas, Jens M. H.; Keegan, Ronan M.; Winn, Martyn D.; Mayans, Olga; Rigden, Daniel J.
2016-01-01
For many protein families, the deluge of new sequence information together with new statistical protocols now allow the accurate prediction of contacting residues from sequence information alone. This offers the possibility of more accurate ab initio (non-homology-based) structure prediction. Such models can be used in structure solution by molecular replacement (MR) where the target fold is novel or is only distantly related to known structures. Here, AMPLE, an MR pipeline that assembles search-model ensembles from ab initio structure predictions (‘decoys’), is employed to assess the value of contact-assisted ab initio models to the crystallographer. It is demonstrated that evolutionary covariance-derived residue–residue contact predictions improve the quality of ab initio models and, consequently, the success rate of MR using search models derived from them. For targets containing β-structure, decoy quality and MR performance were further improved by the use of a β-strand contact-filtering protocol. Such contact-guided decoys achieved 14 structure solutions from 21 attempted protein targets, compared with nine for simple Rosetta decoys. Previously encountered limitations were superseded in two key respects. Firstly, much larger targets of up to 221 residues in length were solved, which is far larger than the previously benchmarked threshold of 120 residues. Secondly, contact-guided decoys significantly improved success with β-sheet-rich proteins. Overall, the improved performance of contact-guided decoys suggests that MR is now applicable to a significantly wider range of protein targets than were previously tractable, and points to a direct benefit to structural biology from the recent remarkable advances in sequencing. PMID:27437113
Clinical evaluation incorporating a personal genome
Ashley, Euan A.; Butte, Atul J.; Wheeler, Matthew T.; Chen, Rong; Klein, Teri E.; Dewey, Frederick E.; Dudley, Joel T.; Ormond, Kelly E.; Pavlovic, Aleksandra; Hudgins, Louanne; Gong, Li; Hodges, Laura M.; Berlin, Dorit S.; Thorn, Caroline F.; Sangkuhl, Katrin; Hebert, Joan M.; Woon, Mark; Sagreiya, Hersh; Whaley, Ryan; Morgan, Alexander A.; Pushkarev, Dmitry; Neff, Norma F; Knowles, Joshua W.; Chou, Mike; Thakuria, Joseph; Rosenbaum, Abraham; Zaranek, Alexander Wait; Church, George; Greely, Henry T.; Quake, Stephen R.; Altman, Russ B.
2010-01-01
Background The cost of genomic information has fallen steeply but the path to clinical translation of risk estimates for common variants found in genome wide association studies remains unclear. Since the speed and cost of sequencing complete genomes is rapidly declining, more comprehensive means of analyzing these data in concert with rare variants for genetic risk assessment and individualisation of therapy are required. Here, we present the first integrated analysis of a complete human genome in a clinical context. Methods An individual with a family history of vascular disease and early sudden death was evaluated. Clinical assessment included risk prediction for coronary artery disease, screening for causes of sudden cardiac death, and genetic counselling. Genetic analysis included the development of novel methods for the integration of whole genome sequence data including 2.6 million single nucleotide polymorphisms and 752 copy number variations. The algorithm focused on predicting genetic risk of genes associated with known Mendelian disease, recognised drug responses, and pathogenicity for novel variants. In addition, since integration of risk ratios derived from case control studies is challenging, we estimated posterior probabilities from age and sex appropriate prior probability and likelihood ratios derived for each genotype. In addition, we developed a visualisation approach to account for gene-environment interactions and conditionally dependent risks. Findings We found increased genetic risk for myocardial infarction, type II diabetes and certain cancers. Rare variants in LPA are consistent with the family history of coronary artery disease. Pharmacogenomic analysis suggested a positive response to lipid lowering therapy, likely clopidogrel resistance, and a low initial dosing requirement for warfarin. Many variants of uncertain significance were reported. Interpretation Although challenges remain, our results suggest that whole genome sequencing can yield useful and clinically relevant information for individual patients, especially for those with a strong family history of significant disease. PMID:20435227
Quasispecies Analyses of the HIV-1 Near-full-length Genome With Illumina MiSeq
Ode, Hirotaka; Matsuda, Masakazu; Matsuoka, Kazuhiro; Hachiya, Atsuko; Hattori, Junko; Kito, Yumiko; Yokomaku, Yoshiyuki; Iwatani, Yasumasa; Sugiura, Wataru
2015-01-01
Human immunodeficiency virus type-1 (HIV-1) exhibits high between-host genetic diversity and within-host heterogeneity, recognized as quasispecies. Because HIV-1 quasispecies fluctuate in terms of multiple factors, such as antiretroviral exposure and host immunity, analyzing the HIV-1 genome is critical for selecting effective antiretroviral therapy and understanding within-host viral coevolution mechanisms. Here, to obtain HIV-1 genome sequence information that includes minority variants, we sought to develop a method for evaluating quasispecies throughout the HIV-1 near-full-length genome using the Illumina MiSeq benchtop deep sequencer. To ensure the reliability of minority mutation detection, we applied an analysis method of sequence read mapping onto a consensus sequence derived from de novo assembly followed by iterative mapping and subsequent unique error correction. Deep sequencing analyses of aHIV-1 clone showed that the analysis method reduced erroneous base prevalence below 1% in each sequence position and discarded only < 1% of all collected nucleotides, maximizing the usage of the collected genome sequences. Further, we designed primer sets to amplify the HIV-1 near-full-length genome from clinical plasma samples. Deep sequencing of 92 samples in combination with the primer sets and our analysis method provided sufficient coverage to identify >1%-frequency sequences throughout the genome. When we evaluated sequences of pol genes from 18 treatment-naïve patients' samples, the deep sequencing results were in agreement with Sanger sequencing and identified numerous additional minority mutations. The results suggest that our deep sequencing method would be suitable for identifying within-host viral population dynamics throughout the genome. PMID:26617593
An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.
Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W
2010-07-02
The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.
NASA Astrophysics Data System (ADS)
Akhir, Nor Azurah Mat; Nadzirin, Nurul; Mohamed, Rahmah; Firdaus-Raih, Mohd
2015-09-01
Hypothetical proteins of bacterial pathogens represent a large numbers of novel biological mechanisms which could belong to essential pathways in the bacteria. They lack functional characterizations mainly due to the inability of sequence homology based methods to detect functional relationships in the absence of detectable sequence similarity. The dataset derived from this study showed 550 candidates conserved in genomes that has pathogenicity information and only present in the Burkholderiales order. The dataset has been narrowed down to taxonomic clusters. Ten proteins were selected for ORF amplification, seven of them were successfully amplified, and only four proteins were successfully expressed. These proteins will be great candidates in determining the true function via structural biology.
2016-10-01
prostate cancer through sequencing xenografts and tissue samples. Qualify novel drivers of AR- prostate cancer through in vitro models. Develop novel...ability of RNASEH2A to modulate radio-sensitivity in prostate cancer cell lines and xenograft models. 3: Investigate RNASEH2A as a marker of radio...lung cancer clinical management. List of the Specific Aims: Aim 1: To establish patient-derived xenografts (PDX) models of pre-neoplastic lesions
Chan, Wen-Ling; Yang, Wen-Kuang; Huang, Hsien-Da; Chang, Jan-Gowth
2013-01-01
RNA interference (RNAi) is a gene silencing process within living cells, which is controlled by the RNA-induced silencing complex with a sequence-specific manner. In flies and mice, the pseudogene transcripts can be processed into short interfering RNAs (siRNAs) that regulate protein-coding genes through the RNAi pathway. Following these findings, we construct an innovative and comprehensive database to elucidate siRNA-mediated mechanism in human transcribed pseudogenes (TPGs). To investigate TPG producing siRNAs that regulate protein-coding genes, we mapped the TPGs to small RNAs (sRNAs) that were supported by publicly deep sequencing data from various sRNA libraries and constructed the TPG-derived siRNA-target interactions. In addition, we also presented that TPGs can act as a target for miRNAs that actually regulate the parental gene. To enable the systematic compilation and updating of these results and additional information, we have developed a database, pseudoMap, capturing various types of information, including sequence data, TPG and cognate annotation, deep sequencing data, RNA-folding structure, gene expression profiles, miRNA annotation and target prediction. As our knowledge, pseudoMap is the first database to demonstrate two mechanisms of human TPGs: encoding siRNAs and decoying miRNAs that target the parental gene. pseudoMap is freely accessible at http://pseudomap.mbc.nctu.edu.tw/. Database URL: http://pseudomap.mbc.nctu.edu.tw/
Protein Solvent-Accessibility Prediction by a Stacked Deep Bidirectional Recurrent Neural Network.
Zhang, Buzhong; Li, Linqing; Lü, Qiang
2018-05-25
Residue solvent accessibility is closely related to the spatial arrangement and packing of residues. Predicting the solvent accessibility of a protein is an important step to understand its structure and function. In this work, we present a deep learning method to predict residue solvent accessibility, which is based on a stacked deep bidirectional recurrent neural network applied to sequence profiles. To capture more long-range sequence information, a merging operator was proposed when bidirectional information from hidden nodes was merged for outputs. Three types of merging operators were used in our improved model, with a long short-term memory network performing as a hidden computing node. The trained database was constructed from 7361 proteins extracted from the PISCES server using a cut-off of 25% sequence identity. Sequence-derived features including position-specific scoring matrix, physical properties, physicochemical characteristics, conservation score and protein coding were used to represent a residue. Using this method, predictive values of continuous relative solvent-accessible area were obtained, and then, these values were transformed into binary states with predefined thresholds. Our experimental results showed that our deep learning method improved prediction quality relative to current methods, with mean absolute error and Pearson's correlation coefficient values of 8.8% and 74.8%, respectively, on the CB502 dataset and 8.2% and 78%, respectively, on the Manesh215 dataset.
Schoch, Conrad L; Robbertse, Barbara; Robert, Vincent; Vu, Duong; Cardinali, Gianluigi; Irinyi, Laszlo; Meyer, Wieland; Nilsson, R Henrik; Hughes, Karen; Miller, Andrew N; Kirk, Paul M; Abarenkov, Kessy; Aime, M Catherine; Ariyawansa, Hiran A; Bidartondo, Martin; Boekhout, Teun; Buyck, Bart; Cai, Qing; Chen, Jie; Crespo, Ana; Crous, Pedro W; Damm, Ulrike; De Beer, Z Wilhelm; Dentinger, Bryn T M; Divakar, Pradeep K; Dueñas, Margarita; Feau, Nicolas; Fliegerova, Katerina; García, Miguel A; Ge, Zai-Wei; Griffith, Gareth W; Groenewald, Johannes Z; Groenewald, Marizeth; Grube, Martin; Gryzenhout, Marieka; Gueidan, Cécile; Guo, Liangdong; Hambleton, Sarah; Hamelin, Richard; Hansen, Karen; Hofstetter, Valérie; Hong, Seung-Beom; Houbraken, Jos; Hyde, Kevin D; Inderbitzin, Patrik; Johnston, Peter R; Karunarathna, Samantha C; Kõljalg, Urmas; Kovács, Gábor M; Kraichak, Ekaphan; Krizsan, Krisztina; Kurtzman, Cletus P; Larsson, Karl-Henrik; Leavitt, Steven; Letcher, Peter M; Liimatainen, Kare; Liu, Jian-Kui; Lodge, D Jean; Luangsa-ard, Janet Jennifer; Lumbsch, H Thorsten; Maharachchikumbura, Sajeewa S N; Manamgoda, Dimuthu; Martín, María P; Minnis, Andrew M; Moncalvo, Jean-Marc; Mulè, Giuseppina; Nakasone, Karen K; Niskanen, Tuula; Olariaga, Ibai; Papp, Tamás; Petkovits, Tamás; Pino-Bodas, Raquel; Powell, Martha J; Raja, Huzefa A; Redecker, Dirk; Sarmiento-Ramirez, J M; Seifert, Keith A; Shrestha, Bhushan; Stenroos, Soili; Stielow, Benjamin; Suh, Sung-Oui; Tanaka, Kazuaki; Tedersoo, Leho; Telleria, M Teresa; Udayanga, Dhanushka; Untereiner, Wendy A; Diéguez Uribeondo, Javier; Subbarao, Krishna V; Vágvölgyi, Csaba; Visagie, Cobus; Voigt, Kerstin; Walker, Donald M; Weir, Bevan S; Weiß, Michael; Wijayawardene, Nalin N; Wingfield, Michael J; Xu, J P; Yang, Zhu L; Zhang, Ning; Zhuang, Wen-Ying; Federhen, Scott
2014-01-01
DNA phylogenetic comparisons have shown that morphology-based species recognition often underestimates fungal diversity. Therefore, the need for accurate DNA sequence data, tied to both correct taxonomic names and clearly annotated specimen data, has never been greater. Furthermore, the growing number of molecular ecology and microbiome projects using high-throughput sequencing require fast and effective methods for en masse species assignments. In this article, we focus on selecting and re-annotating a set of marker reference sequences that represent each currently accepted order of Fungi. The particular focus is on sequences from the internal transcribed spacer region in the nuclear ribosomal cistron, derived from type specimens and/or ex-type cultures. Re-annotated and verified sequences were deposited in a curated public database at the National Center for Biotechnology Information (NCBI), namely the RefSeq Targeted Loci (RTL) database, and will be visible during routine sequence similarity searches with NR_prefixed accession numbers. A set of standards and protocols is proposed to improve the data quality of new sequences, and we suggest how type and other reference sequences can be used to improve identification of Fungi. Database URL: http://www.ncbi.nlm.nih.gov/bioproject/PRJNA177353. Published by Oxford University Press 2013. This work is written by US Government employees and is in the public domain in the US.
Parton, Angela; Bayne, Christopher J.; Barnes, David W.
2010-01-01
Elasmobranchs are the most commonly used experimental models among the jawed, cartilaginous fish (Chondrichthyes). Previously we developed cell lines from embryos of two elasmobranchs, Squalus acanthias the spiny dogfish shark (SAE line), and Leucoraja erinacea the little skate (LEE-1 line). From these lines cDNA libraries were derived and expressed sequence tags (ESTs) generated. From the SAE cell line 4303 unique transcripts were identified, with 1848 of these representing unknown sequences (showing no BLASTX identification). From the LEE-1 cell line, 3660 unique transcripts were identified, and unknown, unique sequences totaled 1333. Gene Ontology (GO) annotation showed that GO assignments for the two cell lines were in general similar. These results suggest that the procedures used to derive the cell lines led to isolation of cell types of the same general embryonic origin from both species. The LEE-1 transcripts included GO categories “envelope” and “oxidoreductase activity” but the SAE transcripts did not. GO analysis of SAE transcripts identified the category “anatomical structure formation” that was not present in LEE-1 cells. Increased organelle compartments may exist within LEE-1 cells compared to SAE cells, and the higher oxidoreductase activity in LEE-1 cells may indicate a role for these cells in responses associated with innate immunity or in steroidogenesis. These EST libraries from elasmobranch cell lines provide information for assembly of genomic sequences and are useful in revealing gene diversity, new genes and molecular markers, as well as in providing means for elucidation of full-length cDNAs and probes for gene array analyses. This is the first study of this type with members of the Chondrichthyes. PMID:20471924
Parton, Angela; Bayne, Christopher J; Barnes, David W
2010-09-01
Elasmobranchs are the most commonly used experimental models among the jawed, cartilaginous fish (Chondrichthyes). Previously we developed cell lines from embryos of two elasmobranchs, Squalus acanthias the spiny dogfish shark (SAE line), and Leucoraja erinacea the little skate (LEE-1 line). From these lines cDNA libraries were derived and expressed sequence tags (ESTs) generated. From the SAE cell line 4303 unique transcripts were identified, with 1848 of these representing unknown sequences (showing no BLASTX identification). From the LEE-1 cell line, 3660 unique transcripts were identified, and unknown, unique sequences totaled 1333. Gene Ontology (GO) annotation showed that GO assignments for the two cell lines were in general similar. These results suggest that the procedures used to derive the cell lines led to isolation of cell types of the same general embryonic origin from both species. The LEE-1 transcripts included GO categories "envelope" and "oxidoreductase activity" but the SAE transcripts did not. GO analysis of SAE transcripts identified the category "anatomical structure formation" that was not present in LEE-1 cells. Increased organelle compartments may exist within LEE-1 cells compared to SAE cells, and the higher oxidoreductase activity in LEE-1 cells may indicate a role for these cells in responses associated with innate immunity or in steroidogenesis. These EST libraries from elasmobranch cell lines provide information for assembly of genomic sequences and are useful in revealing gene diversity, new genes and molecular markers, as well as in providing means for elucidation of full-length cDNAs and probes for gene array analyses. This is the first study of this type with members of the Chondrichthyes. Copyright 2010 Elsevier Inc. All rights reserved.
Dense depth maps from correspondences derived from perceived motion
NASA Astrophysics Data System (ADS)
Kirby, Richard; Whitaker, Ross
2017-01-01
Many computer vision applications require finding corresponding points between images and using the corresponding points to estimate disparity. Today's correspondence finding algorithms primarily use image features or pixel intensities common between image pairs. Some 3-D computer vision applications, however, do not produce the desired results using correspondences derived from image features or pixel intensities. Two examples are the multimodal camera rig and the center region of a coaxial camera rig. We present an image correspondence finding technique that aligns pairs of image sequences using optical flow fields. The optical flow fields provide information about the structure and motion of the scene, which are not available in still images but can be used in image alignment. We apply the technique to a dual focal length stereo camera rig consisting of a visible light-infrared camera pair and to a coaxial camera rig. We test our method on real image sequences and compare our results with the state-of-the-art multimodal and structure from motion (SfM) algorithms. Our method produces more accurate depth and scene velocity reconstruction estimates than the state-of-the-art multimodal and SfM algorithms.
Probing the evolution, ecology and physiology of marine protists using transcriptomics.
Caron, David A; Alexander, Harriet; Allen, Andrew E; Archibald, John M; Armbrust, E Virginia; Bachy, Charles; Bell, Callum J; Bharti, Arvind; Dyhrman, Sonya T; Guida, Stephanie M; Heidelberg, Karla B; Kaye, Jonathan Z; Metzner, Julia; Smith, Sarah R; Worden, Alexandra Z
2017-01-01
Protists, which are single-celled eukaryotes, critically influence the ecology and chemistry of marine ecosystems, but genome-based studies of these organisms have lagged behind those of other microorganisms. However, recent transcriptomic studies of cultured species, complemented by meta-omics analyses of natural communities, have increased the amount of genetic information available for poorly represented branches on the tree of eukaryotic life. This information is providing insights into the adaptations and interactions between protists and other microorganisms and macroorganisms, but many of the genes sequenced show no similarity to sequences currently available in public databases. A better understanding of these newly discovered genes will lead to a deeper appreciation of the functional diversity and metabolic processes in the ocean. In this Review, we summarize recent developments in our understanding of the ecology, physiology and evolution of protists, derived from transcriptomic studies of cultured strains and natural communities, and discuss how these novel large-scale genetic datasets will be used in the future.
High dynamic range image acquisition based on multiplex cameras
NASA Astrophysics Data System (ADS)
Zeng, Hairui; Sun, Huayan; Zhang, Tinghua
2018-03-01
High dynamic image is an important technology of photoelectric information acquisition, providing higher dynamic range and more image details, and it can better reflect the real environment, light and color information. Currently, the method of high dynamic range image synthesis based on different exposure image sequences cannot adapt to the dynamic scene. It fails to overcome the effects of moving targets, resulting in the phenomenon of ghost. Therefore, a new high dynamic range image acquisition method based on multiplex cameras system was proposed. Firstly, different exposure images sequences were captured with the camera array, using the method of derivative optical flow based on color gradient to get the deviation between images, and aligned the images. Then, the high dynamic range image fusion weighting function was established by combination of inverse camera response function and deviation between images, and was applied to generated a high dynamic range image. The experiments show that the proposed method can effectively obtain high dynamic images in dynamic scene, and achieves good results.
Structure-Templated Predictions of Novel Protein Interactions from Sequence Information
Betel, Doron; Breitkreuz, Kevin E; Isserlin, Ruth; Dewar-Darch, Danielle; Tyers, Mike; Hogue, Christopher W. V
2007-01-01
The multitude of functions performed in the cell are largely controlled by a set of carefully orchestrated protein interactions often facilitated by specific binding of conserved domains in the interacting proteins. Interacting domains commonly exhibit distinct binding specificity to short and conserved recognition peptides called binding profiles. Although many conserved domains are known in nature, only a few have well-characterized binding profiles. Here, we describe a novel predictive method known as domain–motif interactions from structural topology (D-MIST) for elucidating the binding profiles of interacting domains. A set of domains and their corresponding binding profiles were derived from extant protein structures and protein interaction data and then used to predict novel protein interactions in yeast. A number of the predicted interactions were verified experimentally, including new interactions of the mitotic exit network, RNA polymerases, nucleotide metabolism enzymes, and the chaperone complex. These results demonstrate that new protein interactions can be predicted exclusively from sequence information. PMID:17892321
An Information Theoretic Characterisation of Auditory Encoding
Overath, Tobias; Cusack, Rhodri; Kumar, Sukhbinder; von Kriegstein, Katharina; Warren, Jason D; Grube, Manon; Carlyon, Robert P; Griffiths, Timothy D
2007-01-01
The entropy metric derived from information theory provides a means to quantify the amount of information transmitted in acoustic streams like speech or music. By systematically varying the entropy of pitch sequences, we sought brain areas where neural activity and energetic demands increase as a function of entropy. Such a relationship is predicted to occur in an efficient encoding mechanism that uses less computational resource when less information is present in the signal: we specifically tested the hypothesis that such a relationship is present in the planum temporale (PT). In two convergent functional MRI studies, we demonstrated this relationship in PT for encoding, while furthermore showing that a distributed fronto-parietal network for retrieval of acoustic information is independent of entropy. The results establish PT as an efficient neural engine that demands less computational resource to encode redundant signals than those with high information content. PMID:17958472
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Liyou; Yi, T. Y.; Van Nostrand, Joy
Phylogenetic analyses were done for the Shewanella strains isolated from Baltic Sea (38 strains), US DOE Hanford Uranium bioremediation site [Hanford Reach of the Columbia River (HRCR), 11 strains], Pacific Ocean and Hawaiian sediments (8 strains), and strains from other resources (16 strains) with three out group strains, Rhodopseudomonas palustris, Clostridium cellulolyticum, and Thermoanaerobacter ethanolicus X514, using DNA relatedness derived from WCGA-based DNA-DNA hybridizations, sequence similarities of 16S rRNA gene and gyrB gene, and sequence similarities of 6 loci of Shewanella genome selected from a shared gene list of the Shewanella strains with whole genome sequenced based on the averagemore » nucleotide identity of them (ANI). The phylogenetic trees based on 16S rRNA and gyrB gene sequences, and DNA relatedness derived from WCGA hybridizations of the tested Shewanella strains share exactly the same sub-clusters with very few exceptions, in which the strains were basically grouped by species. However, the phylogenetic analysis based on DNA relatedness derived from WCGA hybridizations dramatically increased the differentiation resolution at species and strains level within Shewanella genus. When the tree based on DNA relatedness derived from WCGA hybridizations was compared to the tree based on the combined sequences of the selected functional genes (6 loci), we found that the resolutions of both methods are similar, but the clustering of the tree based on DNA relatedness derived from WMGA hybridizations was clearer. These results indicate that WCGA-based DNA-DNA hybridization is an idea alternative of conventional DNA-DNA hybridization methods and it is superior to the phylogenetics methods based on sequence similarities of single genes. Detailed analysis is being performed for the re-classification of the strains examined.« less
BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes
Jespersen, Martin Closter; Peters, Bjoern
2017-01-01
Abstract Antibodies have become an indispensable tool for many biotechnological and clinical applications. They bind their molecular target (antigen) by recognizing a portion of its structure (epitope) in a highly specific manner. The ability to predict epitopes from antigen sequences alone is a complex task. Despite substantial effort, limited advancement has been achieved over the last decade in the accuracy of epitope prediction methods, especially for those that rely on the sequence of the antigen only. Here, we present BepiPred-2.0 (http://www.cbs.dtu.dk/services/BepiPred/), a web server for predicting B-cell epitopes from antigen sequences. BepiPred-2.0 is based on a random forest algorithm trained on epitopes annotated from antibody-antigen protein structures. This new method was found to outperform other available tools for sequence-based epitope prediction both on epitope data derived from solved 3D structures, and on a large collection of linear epitopes downloaded from the IEDB database. The method displays results in a user-friendly and informative way, both for computer-savvy and non-expert users. We believe that BepiPred-2.0 will be a valuable tool for the bioinformatics and immunology community. PMID:28472356
Anonymization of electronic medical records for validating genome-wide association studies
Loukides, Grigorios; Gkoulalas-Divanis, Aris; Malin, Bradley
2010-01-01
Genome-wide association studies (GWAS) facilitate the discovery of genotype–phenotype relations from population-based sequence databases, which is an integral facet of personalized medicine. The increasing adoption of electronic medical records allows large amounts of patients’ standardized clinical features to be combined with the genomic sequences of these patients and shared to support validation of GWAS findings and to enable novel discoveries. However, disseminating these data “as is” may lead to patient reidentification when genomic sequences are linked to resources that contain the corresponding patients’ identity information based on standardized clinical features. This work proposes an approach that provably prevents this type of data linkage and furnishes a result that helps support GWAS. Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases. Extensive experiments with real patient data derived from the Vanderbilt's University Medical Center verify that our approach generates data that eliminate the threat of individual reidentification, while supporting GWAS validation and clinical case analysis tasks. PMID:20385806
Stevenson, Clare E. M.; Assaad, Aoun; Chandra, Govind; Le, Tung B. K.; Greive, Sandra J.; Bibb, Mervyn J.; Lawson, David M.
2013-01-01
Consistent with their complex lifestyles and rich secondary metabolite profiles, the genomes of streptomycetes encode a plethora of transcription factors, the vast majority of which are uncharacterized. Herein, we use Surface Plasmon Resonance (SPR) to identify and delineate putative operator sites for SCO3205, a MarR family transcriptional regulator from Streptomyces coelicolor that is well represented in sequenced actinomycete genomes. In particular, we use a novel SPR footprinting approach that exploits indirect ligand capture to vastly extend the lifetime of a standard streptavidin SPR chip. We define two operator sites upstream of sco3205 and a pseudopalindromic consensus sequence derived from these enables further potential operator sites to be identified in the S. coelicolor genome. We evaluate each of these through SPR and test the importance of the conserved bases within the consensus sequence. Informed by these results, we determine the crystal structure of a SCO3205-DNA complex at 2.8 Å resolution, enabling molecular level rationalization of the SPR data. Taken together, our observations support a DNA recognition mechanism involving both direct and indirect sequence readout. PMID:23748564
Human genome project: revolutionizing biology through leveraging technology
NASA Astrophysics Data System (ADS)
Dahl, Carol A.; Strausberg, Robert L.
1996-04-01
The Human Genome Project (HGP) is an international project to develop genetic, physical, and sequence-based maps of the human genome. Since the inception of the HGP it has been clear that substantially improved technology would be required to meet the scientific goals, particularly in order to acquire the complete sequence of the human genome, and that these technologies coupled with the information forthcoming from the project would have a dramatic effect on the way biomedical research is performed in the future. In this paper, we discuss the state-of-the-art for genomic DNA sequencing, technological challenges that remain, and the potential technological paths that could yield substantially improved genomic sequencing technology. The impact of the technology developed from the HGP is broad-reaching and a discussion of other research and medical applications that are leveraging HGP-derived DNA analysis technologies is included. The multidisciplinary approach to the development of new technologies that has been successful for the HGP provides a paradigm for facilitating new genomic approaches toward understanding the biological role of functional elements and systems within the cell, including those encoded within genomic DNA and their molecular products.
Predicting protein crystallization propensity from protein sequence
2011-01-01
The high-throughput structure determination pipelines developed by structural genomics programs offer a unique opportunity for data mining. One important question is how protein properties derived from a primary sequence correlate with the protein’s propensity to yield X-ray quality crystals (crystallizability) and 3D X-ray structures. A set of protein properties were computed for over 1,300 proteins that expressed well but were insoluble, and for ~720 unique proteins that resulted in X-ray structures. The correlation of the protein’s iso-electric point and grand average hydropathy (GRAVY) with crystallizability was analyzed for full length and domain constructs of protein targets. In a second step, several additional properties that can be calculated from the protein sequence were added and evaluated. Using statistical analyses we have identified a set of the attributes correlating with a protein’s propensity to crystallize and implemented a Support Vector Machine (SVM) classifier based on these. We have created applications to analyze and provide optimal boundary information for query sequences and to visualize the data. These tools are available via the web site http://bioinformatics.anl.gov/cgi-bin/tools/pdpredictor. PMID:20177794
Top-down analysis of protein samples by de novo sequencing techniques.
Vyatkina, Kira; Wu, Si; Dekker, Lennard J M; VanDuijn, Martijn M; Liu, Xiaowen; Tolić, Nikola; Luider, Theo M; Paša-Tolić, Ljiljana; Pevzner, Pavel A
2016-09-15
Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data. We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. The former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns. Freely available on the web at http://bioinf.spbau.ru/en/twister vyatkina@spbau.ru or ppevzner@ucsd.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
A database of annotated tentative orthologs from crop abiotic stress transcripts.
Balaji, Jayashree; Crouch, Jonathan H; Petite, Prasad V N S; Hoisington, David A
2006-10-07
A minimal requirement to initiate a comparative genomics study on plant responses to abiotic stresses is a dataset of orthologous sequences. The availability of a large amount of sequence information, including those derived from stress cDNA libraries allow for the identification of stress related genes and orthologs associated with the stress response. Orthologous sequences serve as tools to explore genes and their relationships across species. For this purpose, ESTs from stress cDNA libraries across 16 crop species including 6 important cereal crops and 10 dicots were systematically collated and subjected to bioinformatics analysis such as clustering, grouping of tentative orthologous sets, identification of protein motifs/patterns in the predicted protein sequence, and annotation with stress conditions, tissue/library source and putative function. All data are available to the scientific community at http://intranet.icrisat.org/gt1/tog/homepage.htm. We believe that the availability of annotated plant abiotic stress ortholog sets will be a valuable resource for researchers studying the biology of environmental stresses in plant systems, molecular evolution and genomics.
DMINDA: an integrated web server for DNA motif identification and analyses
Ma, Qin; Zhang, Hanyuan; Mao, Xizeng; Zhou, Chuan; Liu, Bingqiang; Chen, Xin; Xu, Ying
2014-01-01
DMINDA (DNA motif identification and analyses) is an integrated web server for DNA motif identification and analyses, which is accessible at http://csbl.bmb.uga.edu/DMINDA/. This web site is freely available to all users and there is no login requirement. This server provides a suite of cis-regulatory motif analysis functions on DNA sequences, which are important to elucidation of the mechanisms of transcriptional regulation: (i) de novo motif finding for a given set of promoter sequences along with statistical scores for the predicted motifs derived based on information extracted from a control set, (ii) scanning motif instances of a query motif in provided genomic sequences, (iii) motif comparison and clustering of identified motifs, and (iv) co-occurrence analyses of query motifs in given promoter sequences. The server is powered by a backend computer cluster with over 150 computing nodes, and is particularly useful for motif prediction and analyses in prokaryotic genomes. We believe that DMINDA, as a new and comprehensive web server for cis-regulatory motif finding and analyses, will benefit the genomic research community in general and prokaryotic genome researchers in particular. PMID:24753419
Kim, Kyunghee; Lee, Sang-Choon; Lee, Junki; Lee, Hyun Oh; Joh, Ho Jun; Kim, Nam-Hoon; Park, Hyun-Seung; Yang, Tae-Jin
2015-01-01
We report complete sequences of chloroplast (cp) genome and 45S nuclear ribosomal DNA (45S nrDNA) for 11 Panax ginseng cultivars. We have obtained complete sequences of cp and 45S nrDNA, the representative barcoding target sequences for cytoplasm and nuclear genome, respectively, based on low coverage NGS sequence of each cultivar. The cp genomes sizes ranged from 156,241 to 156,425 bp and the major size variation was derived from differences in copy number of tandem repeats in the ycf1 gene and in the intergenic regions of rps16-trnUUG and rpl32-trnUAG. The complete 45S nrDNA unit sequences were 11,091 bp, representing a consensus single transcriptional unit with an intergenic spacer region. Comparative analysis of these sequences as well as those previously reported for three Chinese accessions identified very rare but unique polymorphism in the cp genome within P. ginseng cultivars. There were 12 intra-species polymorphisms (six SNPs and six InDels) among 14 cultivars. We also identified five SNPs from 45S nrDNA of 11 Korean ginseng cultivars. From the 17 unique informative polymorphic sites, we developed six reliable markers for analysis of ginseng diversity and cultivar authentication. PMID:26061692
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Shihui; Vera, Jessica M.; Grass, Jeff
Zymomonas mobilis is a natural ethanologen being developed and deployed as an industrial biofuel producer. To date, eight Z. mobilis strains have been completely sequenced and found to contain 2-8 native plasmids. However, systematic verification of predicted Z. mobilis plasmid genes and their contribution to cell fitness has not been hitherto addressed. Moreover, the precise number and identities of plasmids in Z. mobilis model strain ZM4 have been unclear. The lack of functional information about plasmid genes in ZM4 impedes ongoing studies for this model biofuel-producing strain. In this study, we determined the complete chromosome and plasmid sequences of ZM4more » and its engineered xylose-utilizing derivatives 2032 and 8b. Compared to previously published and revised ZM4 chromosome sequences, the ZM4 chromosome sequence reported here contains 65 nucleotide sequence variations as well as a 2400-bp insertion. Four plasmids were identified in all three strains, with 150 plasmid genes predicted in strain ZM4 and 2032, and 153 plasmid genes predicted in strain 8b due to the insertion of heterologous DNA for expanded substrate utilization. Plasmid genes were then annotated using Blast2GO, InterProScan, and systems biology data analyses, and most genes were found to have apparent orthologs in other organisms or identifiable conserved domains. To verify plasmid gene prediction, RNA-Seq was used to map transcripts and also compare relative gene expression under various growth conditions, including anaerobic and aerobic conditions, or growth in different concentrations of biomass hydrolysates. Overall, plasmid genes were more responsive to varying hydrolysate concentrations than to oxygen availability. Additionally, our results indicated that although all plasmids were present in low copy number (about 1-2 per cell), the copy number of some plasmids varied under specific growth conditions or due to heterologous gene insertion. The complete genome of ZM4 and two xylose-utilizing derivatives is reported in this study, with an emphasis on identifying and characterizing plasmid genes. Furthermore, plasmid gene annotation, validation, expression levels at growth conditions of interest, and contribution to host fitness are reported for the first time.« less
Yang, Shihui; Vera, Jessica M.; Grass, Jeff; ...
2018-05-02
Zymomonas mobilis is a natural ethanologen being developed and deployed as an industrial biofuel producer. To date, eight Z. mobilis strains have been completely sequenced and found to contain 2-8 native plasmids. However, systematic verification of predicted Z. mobilis plasmid genes and their contribution to cell fitness has not been hitherto addressed. Moreover, the precise number and identities of plasmids in Z. mobilis model strain ZM4 have been unclear. The lack of functional information about plasmid genes in ZM4 impedes ongoing studies for this model biofuel-producing strain. In this study, we determined the complete chromosome and plasmid sequences of ZM4more » and its engineered xylose-utilizing derivatives 2032 and 8b. Compared to previously published and revised ZM4 chromosome sequences, the ZM4 chromosome sequence reported here contains 65 nucleotide sequence variations as well as a 2400-bp insertion. Four plasmids were identified in all three strains, with 150 plasmid genes predicted in strain ZM4 and 2032, and 153 plasmid genes predicted in strain 8b due to the insertion of heterologous DNA for expanded substrate utilization. Plasmid genes were then annotated using Blast2GO, InterProScan, and systems biology data analyses, and most genes were found to have apparent orthologs in other organisms or identifiable conserved domains. To verify plasmid gene prediction, RNA-Seq was used to map transcripts and also compare relative gene expression under various growth conditions, including anaerobic and aerobic conditions, or growth in different concentrations of biomass hydrolysates. Overall, plasmid genes were more responsive to varying hydrolysate concentrations than to oxygen availability. Additionally, our results indicated that although all plasmids were present in low copy number (about 1-2 per cell), the copy number of some plasmids varied under specific growth conditions or due to heterologous gene insertion. The complete genome of ZM4 and two xylose-utilizing derivatives is reported in this study, with an emphasis on identifying and characterizing plasmid genes. Furthermore, plasmid gene annotation, validation, expression levels at growth conditions of interest, and contribution to host fitness are reported for the first time.« less
Zill, Oliver A.; Sebisanovic, Dragan; Lopez, Rene; Blau, Sibel; Collisson, Eric A.; Divers, Stephen G.; Hoon, Dave S. B.; Kopetz, E. Scott; Lee, Jeeyun; Nikolinakos, Petros G.; Baca, Arthur M.; Kermani, Bahram G.; Eltoukhy, Helmy; Talasaz, AmirAli
2015-01-01
Next-generation sequencing of cell-free circulating solid tumor DNA addresses two challenges in contemporary cancer care. First this method of massively parallel and deep sequencing enables assessment of a comprehensive panel of genomic targets from a single sample, and second, it obviates the need for repeat invasive tissue biopsies. Digital SequencingTM is a novel method for high-quality sequencing of circulating tumor DNA simultaneously across a comprehensive panel of over 50 cancer-related genes with a simple blood test. Here we report the analytic and clinical validation of the gene panel. Analytic sensitivity down to 0.1% mutant allele fraction is demonstrated via serial dilution studies of known samples. Near-perfect analytic specificity (> 99.9999%) enables complete coverage of many genes without the false positives typically seen with traditional sequencing assays at mutant allele frequencies or fractions below 5%. We compared digital sequencing of plasma-derived cell-free DNA to tissue-based sequencing on 165 consecutive matched samples from five outside centers in patients with stage III-IV solid tumor cancers. Clinical sensitivity of plasma-derived NGS was 85.0%, comparable to 80.7% sensitivity for tissue. The assay success rate on 1,000 consecutive samples in clinical practice was 99.8%. Digital sequencing of plasma-derived DNA is indicated in advanced cancer patients to prevent repeated invasive biopsies when the initial biopsy is inadequate, unobtainable for genomic testing, or uninformative, or when the patient’s cancer has progressed despite treatment. Its clinical utility is derived from reduction in the costs, complications and delays associated with invasive tissue biopsies for genomic testing. PMID:26474073
Wallace, A. C.; Borkakoti, N.; Thornton, J. M.
1997-01-01
It is well established that sequence templates such as those in the PROSITE and PRINTS databases are powerful tools for predicting the biological function and tertiary structure for newly derived protein sequences. The number of X-ray and NMR protein structures is increasing rapidly and it is apparent that a 3D equivalent of the sequence templates is needed. Here, we describe an algorithm called TESS that automatically derives 3D templates from structures deposited in the Brookhaven Protein Data Bank. While a new sequence can be searched for sequence patterns, a new structure can be scanned against these 3D templates to identify functional sites. As examples, 3D templates are derived for enzymes with an O-His-O "catalytic triad" and for the ribonucleases and lysozymes. When these 3D templates are applied to a large data set of nonidentical proteins, several interesting hits are located. This suggests that the development of a 3D template database may help to identify the function of new protein structures, if unknown, as well as to design proteins with specific functions. PMID:9385633
Tang, Khanh G; Kent, Greggory T; Erden, Ihsan; Wu, Weiming
2017-10-04
cis -β-Bromostyrene derivatives were synthesized stereospecifically from cinnamic acids through β-lactone intermediates. The synthetic sequence did not require the purification of the β-lactone intermediates although they were found to be stable and readily purified in most cases.
Characterization of circulating transfer RNA-Derived RNA fragments in cattle
USDA-ARS?s Scientific Manuscript database
The objective was to characterize naturally occurring circulating transfer RNA-derived RNA Fragments (tRFs) in cattle. Serum from eight clinically normal adult dairy cows was collected, and small non-coding RNAs were extracted immediately after collection and sequenced by Illumina MiSeq. Sequences a...
Thakar, Sambhaji B; Ghorpade, Pradnya N; Kale, Manisha V; Sonawane, Kailas D
2015-01-01
Fern plants are known for their ethnomedicinal applications. Huge amount of fern medicinal plants information is scattered in the form of text. Hence, database development would be an appropriate endeavor to cope with the situation. So by looking at the importance of medicinally useful fern plants, we developed a web based database which contains information about several group of ferns, their medicinal uses, chemical constituents as well as protein/enzyme sequences isolated from different fern plants. Fern ethnomedicinal plant database is an all-embracing, content management web-based database system, used to retrieve collection of factual knowledge related to the ethnomedicinal fern species. Most of the protein/enzyme sequences have been extracted from NCBI Protein sequence database. The fern species, family name, identification, taxonomy ID from NCBI, geographical occurrence, trial for, plant parts used, ethnomedicinal importance, morphological characteristics, collected from various scientific literatures and journals available in the text form. NCBI's BLAST, InterPro, phylogeny, Clustal W web source has also been provided for the future comparative studies. So users can get information related to fern plants and their medicinal applications at one place. This Fern ethnomedicinal plant database includes information of 100 fern medicinal species. This web based database would be an advantageous to derive information specifically for computational drug discovery, botanists or botanical interested persons, pharmacologists, researchers, biochemists, plant biotechnologists, ayurvedic practitioners, doctors/pharmacists, traditional medicinal users, farmers, agricultural students and teachers from universities as well as colleges and finally fern plant lovers. This effort would be useful to provide essential knowledge for the users about the adventitious applications for drug discovery, applications, conservation of fern species around the world and finally to create social awareness.
Information-Theoretical Analysis of EEG Microstate Sequences in Python.
von Wegner, Frederic; Laufs, Helmut
2018-01-01
We present an open-source Python package to compute information-theoretical quantities for electroencephalographic data. Electroencephalography (EEG) measures the electrical potential generated by the cerebral cortex and the set of spatial patterns projected by the brain's electrical potential on the scalp surface can be clustered into a set of representative maps called EEG microstates. Microstate time series are obtained by competitively fitting the microstate maps back into the EEG data set, i.e., by substituting the EEG data at a given time with the label of the microstate that has the highest similarity with the actual EEG topography. As microstate sequences consist of non-metric random variables, e.g., the letters A-D, we recently introduced information-theoretical measures to quantify these time series. In wakeful resting state EEG recordings, we found new characteristics of microstate sequences such as periodicities related to EEG frequency bands. The algorithms used are here provided as an open-source package and their use is explained in a tutorial style. The package is self-contained and the programming style is procedural, focusing on code intelligibility and easy portability. Using a sample EEG file, we demonstrate how to perform EEG microstate segmentation using the modified K-means approach, and how to compute and visualize the recently introduced information-theoretical tests and quantities. The time-lagged mutual information function is derived as a discrete symbolic alternative to the autocorrelation function for metric time series and confidence intervals are computed from Markov chain surrogate data. The software package provides an open-source extension to the existing implementations of the microstate transform and is specifically designed to analyze resting state EEG recordings.
A new era in clinical genetic testing for hypertrophic cardiomyopathy.
Wheeler, Matthew; Pavlovic, Aleksandra; DeGoma, Emil; Salisbury, Heidi; Brown, Colleen; Ashley, Euan A
2009-12-01
Building on seminal studies of the last 20 years, genetic testing for hypertrophic cardiomyopathy (HCM) has become a clinical reality in the form of targeted exonic sequencing of known disease-causing genes. This has been driven primarily by the decreasing cost of sequencing, but the high profile of genome-wide association studies, the launch of direct-to-consumer genetic testing, and new legislative protection have also played important roles. In the clinical management of hypertrophic cardiomyopathy, genetic testing is primarily used for family screening. An increasing role is recognized, however, in diagnostic settings: in the differential diagnosis of HCM; in the differentiation of HCM from hypertensive or athlete's heart; and more rarely in preimplantation genetic diagnosis. Aside from diagnostic clarification and family screening, use of the genetic test for guiding therapy remains controversial, with data currently too limited to derive a reliable mutation risk prediction from within the phenotypic noise of different modifying genomes. Meanwhile, the power of genetic testing derives from the confidence with which a mutation can be called present or absent in a given individual. This confidence contrasts with our more limited ability to judge the significance of mutations for which co-segregation has not been demonstrated. These variants of "unknown" significance represent the greatest challenge to the wider adoption of genetic testing in HCM. Looking forward, next-generation sequencing technologies promise to revolutionize the current approach as whole genome sequencing will soon be available for the cost of today's targeted panel. In summary, our future will be characterized not by lack of genetic information but by our ability to effectively parse it.
Hayashimoto, Nobuhito; Ueno, Masami; Tkakura, Akira; Itoh, Toshio
2007-06-01
Phylogenetic analysis based on 16S rRNA sequences with sequence data of some bacterial species of Pasteurellaceae related to rodents deposited in GenBank was performed along with biochemical characterization for the 20 strains of V-factor dependent members of Pasteurellaceae derived from laboratory rats to obtain basic information and to investigate the taxonomic positions. The results of biochemical tests for all strains were identical except for three tests, the ornithine decarboxylase test, and fermentation tests of D(+) mannose and D(+) xylose. The biochemical properties of 8 of 20 strains that showed negative results for the fermentation test of D(+) xylose agreed with those of Haemophilus parainfluenzae complex. By phylogenetic analysis, the strains were divided into two clusters that agreed with the results of the fermentation test of xylose (group I: negative reaction for xylose, group II: positive reaction for xylose). The clusters were independent of other bacterial species of Pasteurellaceae tested. The sequences of the strains in group I showed 99.7-99.8% similarity and the strains in group II showed 99.3-99.7% similarity. None of the strains in group I had a close relation with Haemophilus parainfluenzae by phylogenetic analysis, although they showed the same biochemical properties. In conclusion, the strains had characteristic biochemical properties and formed two independent groups within the "rodent cluster" of Pasteurellaceae that differed in the results of the fermentation test of xylose. Therefore, they seemed to be hitherto undescribed taxa in Pasteurellaceae.
Sawada, Akihisa; Croom-Carter, Deborah; Kondo, Osamu; Yasui, Masahiro; Koyama-Sato, Maho; Inoue, Masami; Kawa, Keisei; Rickinson, Alan B; Tierney, Rosemary J
2011-05-01
Polymorphisms in Epstein-Barr virus (EBV) latent genes can identify virus strains from different human populations and individual strains within a population. An Asian EBV signature has been defined almost exclusively from Chinese viruses, with little information from other Asian countries. Here we sequenced polymorphic regions of the EBNA1, 2, 3A, 3B, 3C and LMP1 genes of 31 Japanese strains from control donors and EBV-associated T/NK-cell lymphoproliferative disease (T/NK-LPD) patients. Though identical to Chinese strains in their dominant EBNA1 and LMP1 alleles, Japanese viruses were subtly different at other loci. Thus, while Chinese viruses mainly fall into two families with strongly linked 'Wu' or 'Li' alleles at EBNA2 and EBNA3A/B/C, Japanese viruses all have the consensus Wu EBNA2 allele but fall into two families at EBNA3A/B/C. One family has variant Li-like sequences at EBNA3A and 3B and the consensus Li sequence at EBNA3C; the other family has variant Wu-like sequences at EBNA3A, variants of a low frequency Chinese allele 'Sp' at EBNA3B and a consensus Sp sequence at EBNA3C. Thus, EBNA3A/B/C allelotypes clearly distinguish Japanese from Chinese strains. Interestingly, most Japanese viruses also lack those immune-escape mutations in the HLA-A11 epitope-encoding region of EBNA3B that are so characteristic of viruses from the highly A11-positive Chinese population. Control donor-derived and T/NK-LPD-derived strains were similarly distributed across allelotypes and, by using allelic polymorphisms to track virus strains in patients pre- and post-haematopoietic stem-cell transplant, we show that a single strain can induce both T/NK-LPD and B-cell-lymphoproliferative disease in the same patient.
Chen, Mingchen; Lin, Xingcheng; Zheng, Weihua; Onuchic, José N; Wolynes, Peter G
2016-08-25
The associative memory, water mediated, structure and energy model (AWSEM) is a coarse-grained force field with transferable tertiary interactions that incorporates local in sequence energetic biases using bioinformatically derived structural information about peptide fragments with locally similar sequences that we call memories. The memory information from the protein data bank (PDB) database guides proper protein folding. The structural information about available sequences in the database varies in quality and can sometimes lead to frustrated free energy landscapes locally. One way out of this difficulty is to construct the input fragment memory information from all-atom simulations of portions of the complete polypeptide chain. In this paper, we investigate this approach first put forward by Kwac and Wolynes in a more complete way by studying the structure prediction capabilities of this approach for six α-helical proteins. This scheme which we call the atomistic associative memory, water mediated, structure and energy model (AAWSEM) amounts to an ab initio protein structure prediction method that starts from the ground up without using bioinformatic input. The free energy profiles from AAWSEM show that atomistic fragment memories are sufficient to guide the correct folding when tertiary forces are included. AAWSEM combines the efficiency of coarse-grained simulations on the full protein level with the local structural accuracy achievable from all-atom simulations of only parts of a large protein. The results suggest that a hybrid use of atomistic fragment memory and database memory in structural predictions may well be optimal for many practical applications.
Cross-referencing yeast genetics and mammalian genomes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hieter, P.; Basset, D.; Boguski, M.
1994-09-01
We have initiated a project that will systematically transfer information about yeast genes onto the genetic maps of mice and human beings. Rapidly expanding human EST data will serve as a source of candidate human homologs that will be repeatedly searched using yeast protein sequence queries. Search results will be automatically reported to participating labs. Human cDNA sequences from which the ESTs are derived will be mapped at high resolution in the human and mouse genomes. The comparative mapping information cross-references the genomic position of novel human cDNAs with functional information known about the cognate yeast genes. This should facilitatemore » the initial identification of genes responsible for mammalian mutant phenotypes, including human disease. In addition, the identification of mammalian homologs of yeast genes provides reagents for determining evolutionary conservation and for performing direct experiments in multicellular eukaryotes to enhance study of the yeast protein`s function. For example, ESTs homologous to CDC27 and CDC16 were identified, and the corresponding cDNA clones were obtained from ATTC, completely sequenced, and mapped on human and mouse chromosomes. In addition, the CDC17hs cDNA has been used to raise antisera to the CDC27Hs protein and used in subcellular localization experiments and junctional studies in mammalian cells. We have received funding from the National Center for Human Genome Research to provide a community resource which will establish comprehensive cross-referencing among yeast, human, and mouse loci. The project is set up as a service and information on how to communicate with this effort will be provided.« less
Generation of a Maize B Centromere Minimal Map Containing the Central Core Domain.
Ellis, Nathanael A; Douglas, Ryan N; Jackson, Caroline E; Birchler, James A; Dawe, R Kelly
2015-10-28
The maize B centromere has been used as a model for centromere epigenetics and as the basis for building artificial chromosomes. However, there are no sequence resources for this important centromere. Here we used transposon display for the centromere-specific retroelement CRM2 to identify a collection of 40 sequence tags that flank CRM2 insertion points on the B chromosome. These were confirmed to lie within the centromere by assaying deletion breakpoints from centromere misdivision derivatives (intracentromere breakages caused by centromere fission). Markers were grouped together on the basis of their association with other markers in the misdivision series and assembled into a pseudocontig containing 10.1 kb of sequence. To identify sequences that interact directly with centromere proteins, we carried out chromatin immunoprecipitation using antibodies to centromeric histone H3 (CENH3), a defining feature of functional centromeric sequences. The CENH3 chromatin immunoprecipitation map was interpreted relative to the known transmission rates of centromere misdivision derivatives to identify a centromere core domain spanning 33 markers. A subset of seven markers was mapped in additional B centromere misdivision derivatives with the use of unique primer pairs. A derivative previously shown to have no canonical centromere sequences (Telo3-3) lacks these core markers. Our results provide a molecular map of the B chromosome centromere and identify key sequences within the map that interact directly with centromeric histone H3. Copyright © 2015 Ellis et al.
Li, Na; Mao, Wenjun; Liu, Xue; Wang, Shuyao; Xia, Zheng; Cao, Sujian; Li, Lin; Zhang, Qi; Liu, Shan
2016-10-04
Five sulfated oligosaccharide fragments, F1-F5, were prepared from a pyruvylated galactan sulfate from the green alga Codium divaricatum, by partial depolymerization using mild acid hydrolysis and purification with gel-permeation chromatography. Negative-ion electrospray tandem mass spectrometry with collision-induced dissociation (ES-CID-MS/MS) is attempted for sequence determination of the sulfated oligosaccharides. The sequence of F1 with homogeneous disaccharide composition was first characterized to be Galp-(4SO4)-(1 → 3)-Galp by detailed nuclear magnetic resonance spectroscopic analyses. The fragmentation pattern of F1 in the product ion spectra was established on the basis of negative-ion ES-CID MS/MS, which was then applied to sequence analysis of other sulfated oligosaccharides. The sequences of F2 and F3 were deduced to be Galp-(4SO4)-(1 → 3)-Galp-(1 → 3)-Galp-(1 → 3)-Galp and 3,4-O-(1-carboxyethylidene)-Galp-(6SO4)-(1 → 3)-Galp, respectively. The sequences of major fragments in F4 and F5 were also deduced. The investigation demonstrated that negative-ion ES-CID-MS/MS was an efficient method for the sequence analysis of the pyruvylated galactan sulfate-derived oligosaccharides which revealed the patterns of substitution and glycosidic linkages. The pyruvylated galactan sulfate-derived oligosaccharides were novel sulfated oligosaccharides different from other algal polysaccharide-derived oligosaccharides. Copyright © 2016 Elsevier Ltd. All rights reserved.
Tan, Qian-Qian; Zhu, Li; Li, Yi; Liu, Wen; Ma, Wei-Hua; Lei, Chao-Liang; Wang, Xiao-Ping
2015-01-01
The cabbage beetle Colaphellus bowringi Baly is a serious insect pest of crucifers and undergoes reproductive diapause in soil. An understanding of the molecular mechanisms of diapause regulation, insecticide resistance, and other physiological processes is helpful for developing new management strategies for this beetle. However, the lack of genomic information and valid reference genes limits knowledge on the molecular bases of these physiological processes in this species. Using Illumina sequencing, we obtained more than 57 million sequence reads derived from C. bowringi, which were assembled into 39,390 unique sequences. A Clusters of Orthologous Groups classification was obtained for 9,048 of these sequences, covering 25 categories, and 16,951 were assigned to 255 Kyoto Encyclopedia of Genes and Genomes pathways. Eleven candidate reference gene sequences from the transcriptome were then identified through reverse transcriptase polymerase chain reaction. Among these candidate genes, EF1α, ACT1, and RPL19 proved to be the most stable reference genes for different reverse transcriptase quantitative polymerase chain reaction experiments in C. bowringi. Conversely, aTUB and GAPDH were the least stable reference genes. The abundant putative C. bowringi transcript sequences reported enrich the genomic resources of this beetle. Importantly, the larger number of gene sequences and valid reference genes provide a valuable platform for future gene expression studies, especially with regard to exploring the molecular mechanisms of different physiological processes in this species.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Halligan, Matthew
Radiated power calculation approaches for practical scenarios of incomplete high- density interface characterization information and incomplete incident power information are presented. The suggested approaches build upon a method that characterizes power losses through the definition of power loss constant matrices. Potential radiated power estimates include using total power loss information, partial radiated power loss information, worst case analysis, and statistical bounding analysis. A method is also proposed to calculate radiated power when incident power information is not fully known for non-periodic signals at the interface. Incident data signals are modeled from a two-state Markov chain where bit state probabilities aremore » derived. The total spectrum for windowed signals is postulated as the superposition of spectra from individual pulses in a data sequence. Statistical bounding methods are proposed as a basis for the radiated power calculation due to the statistical calculation complexity to find a radiated power probability density function.« less
Epstein, F H; Mugler, J P; Brookeman, J R
1994-02-01
A number of pulse sequence techniques, including magnetization-prepared gradient echo (MP-GRE), segmented GRE, and hybrid RARE, employ a relatively large number of variable pulse sequence parameters and acquire the image data during a transient signal evolution. These sequences have recently been proposed and/or used for clinical applications in the brain, spine, liver, and coronary arteries. Thus, the need for a method of deriving optimal pulse sequence parameter values for this class of sequences now exists. Due to the complexity of these sequences, conventional optimization approaches, such as applying differential calculus to signal difference equations, are inadequate. We have developed a general framework for adapting the simulated annealing algorithm to pulse sequence parameter value optimization, and applied this framework to the specific case of optimizing the white matter-gray matter signal difference for a T1-weighted variable flip angle 3D MP-RAGE sequence. Using our algorithm, the values of 35 sequence parameters, including the magnetization-preparation RF pulse flip angle and delay time, 32 flip angles in the variable flip angle gradient-echo acquisition sequence, and the magnetization recovery time, were derived. Optimized 3D MP-RAGE achieved up to a 130% increase in white matter-gray matter signal difference compared with optimized 3D RF-spoiled FLASH with the same total acquisition time. The simulated annealing approach was effective at deriving optimal parameter values for a specific 3D MP-RAGE imaging objective, and may be useful for other imaging objectives and sequences in this general class.
The testes transcriptome derived from the New World Screwworm, Cochliomyia hominivorax SRA
USDA-ARS?s Scientific Manuscript database
In a collaboration with National Center for Genome Resources researchers, we sequenced and assembled the testes transcriptome derived from the Pacora, Panama, production plant strain J06 of the New World Screwworm, Cochliomyia hominivorax. This sequencing project produced 72,750,822 raw reads and th...
NASA Technical Reports Server (NTRS)
1979-01-01
The results of the Coastal Zone Color Scanner protoflight tests are examined in detail while some of the test results are evaluated with respect to expected performance. Performance characteristics examined include spectral response, signal to noise ratio as a function of radiance input, radiance response, the modulation transfer function, and the field of view and coregistration. The results of orbital sequence tests are also included. The in orbit performance or return of radiometric data in the six spectral bands is evaluated along with the data processing sequence necessary to derive the final data products. Examples of the raw data are given and the housekeeping or diagnostic data which provides information on the day to day health or status of the instrument are discussed.
Heparin Characterization: Challenges and Solutions
NASA Astrophysics Data System (ADS)
Jones, Christopher J.; Beni, Szabolcs; Limtiaco, John F. K.; Langeslay, Derek J.; Larive, Cynthia K.
2011-07-01
Although heparin is an important and widely prescribed pharmaceutical anticoagulant, its high degree of sequence microheterogeneity and size polydispersity make molecular-level characterization challenging. Unlike nucleic acids and proteins that are biosynthesized through template-driven assembly processes, heparin and the related glycosaminoglycan heparan sulfate are actively remodeled during biosynthesis through a series of enzymatic reactions that lead to variable levels of O- and N-sulfonation and uronic acid epimers. As summarized in this review, heparin sequence information is determined through a bottom-up approach that relies on depolymerization reactions, size- and charge-based separations, and sensitive mass spectrometric and nuclear magnetic resonance experiments to determine the structural identity of component oligosaccharides. The structure-elucidation process, along with its challenges and opportunities for future analytical improvements, is reviewed and illustrated for a heparin-derived hexasaccharide.
Propeller speed and phase sensor
NASA Technical Reports Server (NTRS)
Collopy, Paul D. (Inventor); Bennett, George W. (Inventor)
1992-01-01
A speed and phase sensor counterrotates aircraft propellers. A toothed wheel is attached to each propeller, and the teeth trigger a sensor as they pass, producing a sequence of signals. From the sequence of signals, rotational speed of each propeller is computer based on time intervals between successive signals. The speed can be computed several times during one revolution, thus giving speed information which is highly up-to-date. Given that spacing between teeth may not be uniform, the signals produced may be nonuniform in time. Error coefficients are derived to correct for nonuniformities in the resulting signals, thus allowing accurate speed to be computed despite the spacing nonuniformities. Phase can be viewed as the relative rotational position of one propeller with respect to the other, but measured at a fixed time. Phase is computed from the signals.
Conventions and nomenclature for double diffusion encoding NMR and MRI.
Shemesh, Noam; Jespersen, Sune N; Alexander, Daniel C; Cohen, Yoram; Drobnjak, Ivana; Dyrby, Tim B; Finsterbusch, Jurgen; Koch, Martin A; Kuder, Tristan; Laun, Fredrik; Lawrenz, Marco; Lundell, Henrik; Mitra, Partha P; Nilsson, Markus; Özarslan, Evren; Topgaard, Daniel; Westin, Carl-Fredrik
2016-01-01
Stejskal and Tanner's ingenious pulsed field gradient design from 1965 has made diffusion NMR and MRI the mainstay of most studies seeking to resolve microstructural information in porous systems in general and biological systems in particular. Methods extending beyond Stejskal and Tanner's design, such as double diffusion encoding (DDE) NMR and MRI, may provide novel quantifiable metrics that are less easily inferred from conventional diffusion acquisitions. Despite the growing interest on the topic, the terminology for the pulse sequences, their parameters, and the metrics that can be derived from them remains inconsistent and disparate among groups active in DDE. Here, we present a consensus of those groups on terminology for DDE sequences and associated concepts. Furthermore, the regimes in which DDE metrics appear to provide microstructural information that cannot be achieved using more conventional counterparts (in a model-free fashion) are elucidated. We highlight in particular DDE's potential for determining microscopic diffusion anisotropy and microscopic fractional anisotropy, which offer metrics of microscopic features independent of orientation dispersion and thus provide information complementary to the standard, macroscopic, fractional anisotropy conventionally obtained by diffusion MR. Finally, we discuss future vistas and perspectives for DDE. © 2015 Wiley Periodicals, Inc.
Reinprecht, Yarmilla; Yadegari, Zeinab; Perry, Gregory E.; Siddiqua, Mahbuba; Wright, Lori C.; McClean, Phillip E.; Pauls, K. Peter
2013-01-01
Legumes contain a variety of phytochemicals derived from the phenylpropanoid pathway that have important effects on human health as well as seed coat color, plant disease resistance and nodulation. However, the information about the genes involved in this important pathway is fragmentary in common bean (Phaseolus vulgaris L.). The objectives of this research were to isolate genes that function in and control the phenylpropanoid pathway in common bean, determine their genomic locations in silico in common bean and soybean, and analyze sequences of the 4CL gene family in two common bean genotypes. Sequences of phenylpropanoid pathway genes available for common bean or other plant species were aligned, and the conserved regions were used to design sequence-specific primers. The PCR products were cloned and sequenced and the gene sequences along with common bean gene-based (g) markers were BLASTed against the Glycine max v.1.0 genome and the P. vulgaris v.1.0 (Andean) early release genome. In addition, gene sequences were BLASTed against the OAC Rex (Mesoamerican) genome sequence assembly. In total, fragments of 46 structural and regulatory phenylpropanoid pathway genes were characterized in this way and placed in silico on common bean and soybean sequence maps. The maps contain over 250 common bean g and SSR (simple sequence repeat) markers and identify the positions of more than 60 additional phenylpropanoid pathway gene sequences, plus the putative locations of seed coat color genes. The majority of cloned phenylpropanoid pathway gene sequences were mapped to one location in the common bean genome but had two positions in soybean. The comparison of the genomic maps confirmed previous studies, which show that common bean and soybean share genomic regions, including those containing phenylpropanoid pathway gene sequences, with conserved synteny. Indels identified in the comparison of Andean and Mesoamerican common bean 4CL gene sequences might be used to develop inter-pool phenylpropanoid pathway gene-based markers. We anticipate that the information obtained by this study will simplify and accelerate selections of common bean with specific phenylpropanoid pathway alleles to increase the contents of beneficial phenylpropanoids in common bean and other legumes. PMID:24046770
a Simple Symmetric Algorithm Using a Likeness with Introns Behavior in RNA Sequences
NASA Astrophysics Data System (ADS)
Regoli, Massimo
2009-02-01
The RNA-Crypto System (shortly RCS) is a symmetric key algorithm to cipher data. The idea for this new algorithm starts from the observation of nature. In particular from the observation of RNA behavior and some of its properties. The RNA sequences has some sections called Introns. Introns, derived from the term "intragenic regions", are non-coding sections of precursor mRNA (pre-mRNA) or other RNAs, that are removed (spliced out of the RNA) before the mature RNA is formed. Once the introns have been spliced out of a pre-mRNA, the resulting mRNA sequence is ready to be translated into a protein. The corresponding parts of a gene are known as introns as well. The nature and the role of Introns in the pre-mRNA is not clear and it is under ponderous researches by Biologists but, in our case, we will use the presence of Introns in the RNA-Crypto System output as a strong method to add chaotic non coding information and an unnecessary behaviour in the access to the secret key to code the messages. In the RNA-Crypto System algoritnm the introns are sections of the ciphered message with non-coding information as well as in the precursor mRNA.
Purification and sequence characterization of chondroitin sulfate and dermatan sulfate from fishes.
Lin, Na; Mo, Xiaoli; Yang, Yang; Zhang, Hong
2017-04-01
Chondroitin sulfate (CS) and dermatan sulfate (DS) were extracted and purified from skins or bones of salmon (Salmo salar), snakehead (Channa argus), monkfish (Lophius litulon) and skipjack tuna (Katsuwonus pelamis). Size, structural sequences and sulfate groups of oligosaccharides in the purified CS and DS could be characterized and identified using high performance liquid chromatography (HPLC) combined with Orbitrap mass spectrometry. CS and DS chain structure varies depending on origin, but motif structure appears consistent. Structures of CS and DS oligosaccharides with different size and sulfate groups were compared between fishes and other animals, and results showed that some minor differences of special structures could be identified by hydrophilic interaction chromatography-liquid chromatography-fourier transform-mass/mass spectrometry (HILIC-LC-FT-MS/MS). For example, data showed that salmon and skipjack CS had a higher percentage content of high-level sulfated oligosaccharides than that porcine CS. In addition, structural information of different origins of CS and DS was analyzed by principal component analysis (PCA) and results showed that CS and DS samples could be differentiated according to their molecular conformation and oligosaccharide fragments information. Understanding CS and DS structure derived from different origins may lead to the production of CS or DS with unique disaccharides or oligosaccharides sequence composition and biological functions.
MUFOLD-SS: New deep inception-inside-inception networks for protein secondary structure prediction.
Fang, Chao; Shang, Yi; Xu, Dong
2018-05-01
Protein secondary structure prediction can provide important information for protein 3D structure prediction and protein functions. Deep learning offers a new opportunity to significantly improve prediction accuracy. In this article, a new deep neural network architecture, named the Deep inception-inside-inception (Deep3I) network, is proposed for protein secondary structure prediction and implemented as a software tool MUFOLD-SS. The input to MUFOLD-SS is a carefully designed feature matrix corresponding to the primary amino acid sequence of a protein, which consists of a rich set of information derived from individual amino acid, as well as the context of the protein sequence. Specifically, the feature matrix is a composition of physio-chemical properties of amino acids, PSI-BLAST profile, and HHBlits profile. MUFOLD-SS is composed of a sequence of nested inception modules and maps the input matrix to either eight states or three states of secondary structures. The architecture of MUFOLD-SS enables effective processing of local and global interactions between amino acids in making accurate prediction. In extensive experiments on multiple datasets, MUFOLD-SS outperformed the best existing methods and other deep neural networks significantly. MUFold-SS can be downloaded from http://dslsrv8.cs.missouri.edu/~cf797/MUFoldSS/download.html. © 2018 Wiley Periodicals, Inc.
Košir, Alexandra Bogožalec; Arulandhu, Alfred J; Voorhuijzen, Marleen M; Xiao, Hongmei; Hagelaar, Rico; Staats, Martijn; Costessi, Adalberto; Žel, Jana; Kok, Esther J; Dijk, Jeroen P van
2017-10-26
The majority of feed products in industrialised countries contains materials derived from genetically modified organisms (GMOs). In parallel, the number of reports of unauthorised GMOs (UGMOs) is gradually increasing. There is a lack of specific detection methods for UGMOs, due to the absence of detailed sequence information and reference materials. In this research, an adapted genome walking approach was developed, called ALF: Amplification of Linearly-enriched Fragments. Coupling of ALF to NGS aims for simultaneous detection and identification of all GMOs, including UGMOs, in one sample, in a single analysis. The ALF approach was assessed on a mixture made of DNA extracts from four reference materials, in an uneven distribution, mimicking a real life situation. The complete insert and genomic flanking regions were known for three of the included GMO events, while for MON15985 only partial sequence information was available. Combined with a known organisation of elements, this GMO served as a model for a UGMO. We successfully identified sequences matching with this organisation of elements serving as proof of principle for ALF as new UGMO detection strategy. Additionally, this study provides a first outline of an automated, web-based analysis pipeline for identification of UGMOs containing known GM elements.
Conceptual issues in Bayesian divergence time estimation
2016-01-01
Bayesian inference of species divergence times is an unusual statistical problem, because the divergence time parameters are not identifiable unless both fossil calibrations and sequence data are available. Commonly used marginal priors on divergence times derived from fossil calibrations may conflict with node order on the phylogenetic tree causing a change in the prior on divergence times for a particular topology. Care should be taken to avoid confusing this effect with changes due to informative sequence data. This effect is illustrated with examples. A topology-consistent prior that preserves the marginal priors is defined and examples are constructed. Conflicts between fossil calibrations and relative branch lengths (based on sequence data) can cause estimates of divergence times that are grossly incorrect, yet have a narrow posterior distribution. An example of this effect is given; it is recommended that overly narrow posterior distributions of divergence times should be carefully scrutinized. This article is part of the themed issue ‘Dating species divergences using rocks and clocks’. PMID:27325831
Conceptual issues in Bayesian divergence time estimation.
Rannala, Bruce
2016-07-19
Bayesian inference of species divergence times is an unusual statistical problem, because the divergence time parameters are not identifiable unless both fossil calibrations and sequence data are available. Commonly used marginal priors on divergence times derived from fossil calibrations may conflict with node order on the phylogenetic tree causing a change in the prior on divergence times for a particular topology. Care should be taken to avoid confusing this effect with changes due to informative sequence data. This effect is illustrated with examples. A topology-consistent prior that preserves the marginal priors is defined and examples are constructed. Conflicts between fossil calibrations and relative branch lengths (based on sequence data) can cause estimates of divergence times that are grossly incorrect, yet have a narrow posterior distribution. An example of this effect is given; it is recommended that overly narrow posterior distributions of divergence times should be carefully scrutinized.This article is part of the themed issue 'Dating species divergences using rocks and clocks'. © 2016 The Author(s).
Evolution of massive stars in very young clusters and associations
NASA Technical Reports Server (NTRS)
Stothers, R. B.
1985-01-01
Statistics concerning the stellar content of young galactic clusters and associations which show well defined main sequence turnups have been analyzed in order to derive information about stellar evolution in high-mass galaxies. The analytical approach is semiempirical and uses natural spectroscopic groups of stars on the H-R diagram together with the stars' apparent magnitudes. The new approach does not depend on absolute luminosities and requires only the most basic elements of stellar evolution theory. The following conclusions are offered on the basis of the statistical analysis: (1) O-tupe main-sequence stars evolve to a spectral type of B1 during core hydrogen burning; (2) most O-type blue stragglers are newly formed massive stars burning core hydrogen; (3) supergiants lying redward of the main-sequence turnup are burning core helium; and most Wolf-Rayet stars are burning core helium and originally had masses greater than 30-40 solar mass. The statistics of the natural spectroscopic stars in young galactic clusters and associations are given in a table.
Calibrating genomic and allelic coverage bias in single-cell sequencing.
Zhang, Cheng-Zhong; Adalsteinsson, Viktor A; Francis, Joshua; Cornils, Hauke; Jung, Joonil; Maire, Cecile; Ligon, Keith L; Meyerson, Matthew; Love, J Christopher
2015-04-16
Artifacts introduced in whole-genome amplification (WGA) make it difficult to derive accurate genomic information from single-cell genomes and require different analytical strategies from bulk genome analysis. Here, we describe statistical methods to quantitatively assess the amplification bias resulting from whole-genome amplification of single-cell genomic DNA. Analysis of single-cell DNA libraries generated by different technologies revealed universal features of the genome coverage bias predominantly generated at the amplicon level (1-10 kb). The magnitude of coverage bias can be accurately calibrated from low-pass sequencing (∼0.1 × ) to predict the depth-of-coverage yield of single-cell DNA libraries sequenced at arbitrary depths. We further provide a benchmark comparison of single-cell libraries generated by multi-strand displacement amplification (MDA) and multiple annealing and looping-based amplification cycles (MALBAC). Finally, we develop statistical models to calibrate allelic bias in single-cell whole-genome amplification and demonstrate a census-based strategy for efficient and accurate variant detection from low-input biopsy samples.
Calibrating genomic and allelic coverage bias in single-cell sequencing
Francis, Joshua; Cornils, Hauke; Jung, Joonil; Maire, Cecile; Ligon, Keith L.; Meyerson, Matthew; Love, J. Christopher
2016-01-01
Artifacts introduced in whole-genome amplification (WGA) make it difficult to derive accurate genomic information from single-cell genomes and require different analytical strategies from bulk genome analysis. Here, we describe statistical methods to quantitatively assess the amplification bias resulting from whole-genome amplification of single-cell genomic DNA. Analysis of single-cell DNA libraries generated by different technologies revealed universal features of the genome coverage bias predominantly generated at the amplicon level (1–10 kb). The magnitude of coverage bias can be accurately calibrated from low-pass sequencing (~0.1 ×) to predict the depth-of-coverage yield of single-cell DNA libraries sequenced at arbitrary depths. We further provide a benchmark comparison of single-cell libraries generated by multi-strand displacement amplification (MDA) and multiple annealing and looping-based amplification cycles (MALBAC). Finally, we develop statistical models to calibrate allelic bias in single-cell whole-genome amplification and demonstrate a census-based strategy for efficient and accurate variant detection from low-input biopsy samples. PMID:25879913
Ye, Yuzhen; Wang, Zhongwei; Zhou, Jianfeng; Wu, Qingjiang
2009-08-01
Microsatellite markers and D-loop sequences of mtDNA from a female allotetraploid parent carp and her progenies of generations 1 and 2 induced by sperm of five distant fish species were analyzed. Eleven microsatellite markers were used to identify 48 alleles from the allotetraploid female. The same number of alleles (48) appeared in the first and second generations of the gynogenetic offspring, regardless of the source of the sperm used as an activator. The mtDNA D-loop analysis was performed on the female tetraploid parent, 25 gynogenetic offspring, and 5 sperm-donor species. Fourteen variable sites from the 1,018 bp sequences were observed in the offspring as compared to the female tetraploid parent. Results from D-loop sequence and microsatellite marker analysis showed exclusive maternal transmission, and no genetic information was derived from the father. Our study suggests that progenies of artificial tetraploid carp are genetically stable, which is important for genetic breeding of this tetraploid fish.
Lew, Jocelyne M; Kapopoulou, Adamandia; Jones, Louis M; Cole, Stewart T
2011-01-01
TubercuList (http://tuberculist.epfl.ch/), the relational database that presents genome-derived information about H37Rv, the paradigm strain of Mycobacterium tuberculosis, has been active for ten years and now presents its twentieth release. Here, we describe some of the recent changes that have resulted from manual annotation with information from the scientific literature. Through manual curation, TubercuList strives to provide current gene-based information and is thus distinguished from other online sources of genome sequence data for M. tuberculosis. New, mostly small, genes have been discovered and the coordinates of some existing coding sequences have been changed when bioinformatics or experimental data suggest that this is required. Nucleotides that are polymorphic between different sources of H37Rv are annotated and gene essentiality data have been updated. A host of functional information has been gleaned from the literature and many new activities of proteins and RNAs have been included. To facilitate basic and translational research, TubercuList also provides links to other specialized databases that present diverse datasets such as 3D-structures, expression profiles, drug development criteria and drug resistance information, in addition to direct access to PubMed articles pertinent to particular genes. TubercuList has been and remains a highly valuable tool for the tuberculosis research community with >75,000 visitors per month. Copyright © 2010 Elsevier Ltd. All rights reserved.
Genome alignment with graph data structures: a comparison
2014-01-01
Background Recent advances in rapid, low-cost sequencing have opened up the opportunity to study complete genome sequences. The computational approach of multiple genome alignment allows investigation of evolutionarily related genomes in an integrated fashion, providing a basis for downstream analyses such as rearrangement studies and phylogenetic inference. Graphs have proven to be a powerful tool for coping with the complexity of genome-scale sequence alignments. The potential of graphs to intuitively represent all aspects of genome alignments led to the development of graph-based approaches for genome alignment. These approaches construct a graph from a set of local alignments, and derive a genome alignment through identification and removal of graph substructures that indicate errors in the alignment. Results We compare the structures of commonly used graphs in terms of their abilities to represent alignment information. We describe how the graphs can be transformed into each other, and identify and classify graph substructures common to one or more graphs. Based on previous approaches, we compile a list of modifications that remove these substructures. Conclusion We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs. If we neglect vertex or edge labels, the graphs differ in their information content. Still, many ideas are shared among all graph-based approaches. Based on these findings, we outline a conceptual framework for graph-based genome alignment that can assist in the development of future genome alignment tools. PMID:24712884
Automatic classification of protein structures using physicochemical parameters.
Mohan, Abhilash; Rao, M Divya; Sunderrajan, Shruthi; Pennathur, Gautam
2014-09-01
Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge. The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied. Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90-96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.
What is bioinformatics? A proposed definition and overview of the field.
Luscombe, N M; Greenbaum, D; Gerstein, M
2001-01-01
The recent flood of data from genome sequences and functional genomics has given rise to new field, bioinformatics, which combines elements of biology and computer science. Here we propose a definition for this new field and review some of the research that is being pursued, particularly in relation to transcriptional regulatory systems. Our definition is as follows: Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying "informatics" techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale. Analyses in bioinformatics predominantly focus on three types of large datasets available in molecular biology: macromolecular structures, genome sequences, and the results of functional genomics experiments (e.g. expression data). Additional information includes the text of scientific papers and "relationship data" from metabolic pathways, taxonomy trees, and protein-protein interaction networks. Bioinformatics employs a wide range of computational techniques including sequence and structural alignment, database design and data mining, macromolecular geometry, phylogenetic tree construction, prediction of protein structure and function, gene finding, and expression data clustering. The emphasis is on approaches integrating a variety of computational methods and heterogeneous data sources. Finally, bioinformatics is a practical discipline. We survey some representative applications, such as finding homologues, designing drugs, and performing large-scale censuses. Additional information pertinent to the review is available over the web at http://bioinfo.mbb.yale.edu/what-is-it.
RPAN: rice pan-genome browser for ∼3000 rice genomes.
Sun, Chen; Hu, Zhiqiang; Zheng, Tianqing; Lu, Kuangchen; Zhao, Yue; Wang, Wensheng; Shi, Jianxin; Wang, Chunchao; Lu, Jinyuan; Zhang, Dabing; Li, Zhikang; Wei, Chaochun
2017-01-25
A pan-genome is the union of the gene sets of all the individuals of a clade or a species and it provides a new dimension of genome complexity with the presence/absence variations (PAVs) of genes among these genomes. With the progress of sequencing technologies, pan-genome study is becoming affordable for eukaryotes with large-sized genomes. The Asian cultivated rice, Oryza sativa L., is one of the major food sources for the world and a model organism in plant biology. Recently, the 3000 Rice Genome Project (3K RGP) sequenced more than 3000 rice genomes with a mean sequencing depth of 14.3×, which provided a tremendous resource for rice research. In this paper, we present a genome browser, Rice Pan-genome Browser (RPAN), as a tool to search and visualize the rice pan-genome derived from 3K RGP. RPAN contains a database of the basic information of 3010 rice accessions, including genomic sequences, gene annotations, PAV information and gene expression data of the rice pan-genome. At least 12 000 novel genes absent in the reference genome were included. RPAN also provides multiple search and visualization functions. RPAN can be a rich resource for rice biology and rice breeding. It is available at http://cgm.sjtu.edu.cn/3kricedb/ or http://www.rmbreeding.cn/pan3k. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah
2012-01-01
Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple.
Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah
2012-01-01
Background Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. Methodology/Principal Findings To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. Conclusions The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple. PMID:23091603
Lentes, K U; Mathieu, E; Bischoff, R; Rasmussen, U B; Pavirani, A
1993-01-01
Current methods for comparative analyses of protein sequences are 1D-alignments of amino acid sequences based on the maximization of amino acid identity (homology) and the prediction of secondary structure elements. This method has a major drawback once the amino acid identity drops below 20-25%, since maximization of a homology score does not take into account any structural information. A new technique called Hydrophobic Cluster Analysis (HCA) has been developed by Lemesle-Varloot et al. (Biochimie 72, 555-574), 1990). This consists of comparing several sequences simultaneously and combining homology detection with secondary structure analysis. HCA is primarily based on the detection and comparison of structural segments constituting the hydrophobic core of globular protein domains, with or without transmembrane domains. We have applied HCA to the analysis of different families of G-protein coupled receptors, such as catecholamine receptors as well as peptide hormone receptors. Utilizing HCA the thrombin receptor, a new and as yet unique member of the family of G-protein coupled receptors, can be clearly classified as being closely related to the family of neuropeptide receptors rather than to the catecholamine receptors for which the shape of the hydrophobic clusters and the length of their third cytoplasmic loop are very different. Furthermore, the potential of HCA to predict relationships between new putative and already characterized members of this family of receptors will be presented.
NASA Astrophysics Data System (ADS)
Hamrouni, Sameh; Rougon, Nicolas; Pr"teux, Françoise
2011-03-01
In perfusion MRI (p-MRI) exams, short-axis (SA) image sequences are captured at multiple slice levels along the long-axis of the heart during the transit of a vascular contrast agent (Gd-DTPA) through the cardiac chambers and muscle. Compensating cardio-thoracic motions is a requirement for enabling computer-aided quantitative assessment of myocardial ischaemia from contrast-enhanced p-MRI sequences. The classical paradigm consists of registering each sequence frame on a reference image using some intensity-based matching criterion. In this paper, we introduce a novel unsupervised method for the spatio-temporal groupwise registration of cardiac p-MRI exams based on normalized mutual information (NMI) between high-dimensional feature distributions. Here, local contrast enhancement curves are used as a dense set of spatio-temporal features, and statistically matched through variational optimization to a target feature distribution derived from a registered reference template. The hard issue of probability density estimation in high-dimensional state spaces is bypassed by using consistent geometric entropy estimators, allowing NMI to be computed directly from feature samples. Specifically, a computationally efficient kth-nearest neighbor (kNN) estimation framework is retained, leading to closed-form expressions for the gradient flow of NMI over finite- and infinite-dimensional motion spaces. This approach is applied to the groupwise alignment of cardiac p-MRI exams using a free-form Deformation (FFD) model for cardio-thoracic motions. Experiments on simulated and natural datasets suggest its accuracy and robustness for registering p-MRI exams comprising more than 30 frames.
A generalized analysis of hydrophobic and loop clusters within globular protein sequences
Eudes, Richard; Le Tuan, Khanh; Delettré, Jean; Mornon, Jean-Paul; Callebaut, Isabelle
2007-01-01
Background Hydrophobic Cluster Analysis (HCA) is an efficient way to compare highly divergent sequences through the implicit secondary structure information directly derived from hydrophobic clusters. However, its efficiency and application are currently limited by the need of user expertise. In order to help the analysis of HCA plots, we report here the structural preferences of hydrophobic cluster species, which are frequently encountered in globular domains of proteins. These species are characterized only by their hydrophobic/non-hydrophobic dichotomy. This analysis has been extended to loop-forming clusters, using an appropriate loop alphabet. Results The structural behavior of hydrophobic cluster species, which are typical of protein globular domains, was investigated within banks of experimental structures, considered at different levels of sequence redundancy. The 294 more frequent hydrophobic cluster species were analyzed with regard to their association with the different secondary structures (frequencies of association with secondary structures and secondary structure propensities). Hydrophobic cluster species are predominantly associated with regular secondary structures, and a large part (60 %) reveals preferences for α-helices or β-strands. Moreover, the analysis of the hydrophobic cluster amino acid composition generally allows for finer prediction of the regular secondary structure associated with the considered cluster within a cluster species. We also investigated the behavior of loop forming clusters, using a "PGDNS" alphabet. These loop clusters do not overlap with hydrophobic clusters and are highly associated with coils. Finally, the structural information contained in the hydrophobic structural words, as deduced from experimental structures, was compared to the PSI-PRED predictions, revealing that β-strands and especially α-helices are generally over-predicted within the limits of typical β and α hydrophobic clusters. Conclusion The dictionary of hydrophobic clusters described here can help the HCA user to interpret and compare the HCA plots of globular protein sequences, as well as provides an original fundamental insight into the structural bricks of protein folds. Moreover, the novel loop cluster analysis brings additional information for secondary structure prediction on the whole sequence through a generalized cluster analysis (GCA), and not only on regular secondary structures. Such information lays the foundations for developing a new and original tool for secondary structure prediction. PMID:17210072
Finnerty, John R; Mazza, Maureen E; Jezewski, Peter A
2009-01-01
Background Msx originated early in animal evolution and is implicated in human genetic disorders. To reconstruct the functional evolution of Msx and inform the study of human mutations, we analyzed the phylogeny and synteny of 46 metazoan Msx proteins and tracked the duplication, diversification and loss of conserved motifs. Results Vertebrate Msx sequences sort into distinct Msx1, Msx2 and Msx3 clades. The sister-group relationship between MSX1 and MSX2 reflects their derivation from the 4p/5q chromosomal paralogon, a derivative of the original "MetaHox" cluster. We demonstrate physical linkage between Msx and other MetaHox genes (Hmx, NK1, Emx) in a cnidarian. Seven conserved domains, including two Groucho repression domains (N- and C-terminal), were present in the ancestral Msx. In cnidarians, the Groucho domains are highly similar. In vertebrate Msx1, the N-terminal Groucho domain is conserved, while the C-terminal domain diverged substantially, implying a novel function. In vertebrate Msx2 and Msx3, the C-terminal domain was lost. MSX1 mutations associated with ectodermal dysplasia or orofacial clefting disorders map to conserved domains in a non-random fashion. Conclusion Msx originated from a MetaHox ancestor that also gave rise to Tlx, Demox, NK, and possibly EHGbox, Hox and ParaHox genes. Duplication, divergence or loss of domains played a central role in the functional evolution of Msx. Duplicated domains allow pleiotropically expressed proteins to evolve new functions without disrupting existing interaction networks. Human missense sequence variants reside within evolutionarily conserved domains, likely disrupting protein function. This phylogenomic evaluation of candidate disease markers will inform clinical and functional studies. PMID:19154605
Finnerty, John R; Mazza, Maureen E; Jezewski, Peter A
2009-01-20
Msx originated early in animal evolution and is implicated in human genetic disorders. To reconstruct the functional evolution of Msx and inform the study of human mutations, we analyzed the phylogeny and synteny of 46 metazoan Msx proteins and tracked the duplication, diversification and loss of conserved motifs. Vertebrate Msx sequences sort into distinct Msx1, Msx2 and Msx3 clades. The sister-group relationship between MSX1 and MSX2 reflects their derivation from the 4p/5q chromosomal paralogon, a derivative of the original "MetaHox" cluster. We demonstrate physical linkage between Msx and other MetaHox genes (Hmx, NK1, Emx) in a cnidarian. Seven conserved domains, including two Groucho repression domains (N- and C-terminal), were present in the ancestral Msx. In cnidarians, the Groucho domains are highly similar. In vertebrate Msx1, the N-terminal Groucho domain is conserved, while the C-terminal domain diverged substantially, implying a novel function. In vertebrate Msx2 and Msx3, the C-terminal domain was lost. MSX1 mutations associated with ectodermal dysplasia or orofacial clefting disorders map to conserved domains in a non-random fashion. Msx originated from a MetaHox ancestor that also gave rise to Tlx, Demox, NK, and possibly EHGbox, Hox and ParaHox genes. Duplication, divergence or loss of domains played a central role in the functional evolution of Msx. Duplicated domains allow pleiotropically expressed proteins to evolve new functions without disrupting existing interaction networks. Human missense sequence variants reside within evolutionarily conserved domains, likely disrupting protein function. This phylogenomic evaluation of candidate disease markers will inform clinical and functional studies.
Cullinan, David B; Hondrogiannis, George; Henderson, Terry J
2008-04-15
Two-dimensional 1H-13C HSQC (heteronuclear single quantum correlation) and fast-HMQC (heteronuclear multiple quantum correlation) pulse sequences were implemented using a sensitivity-enhanced, cryogenic probehead for detecting compounds relevant to the Chemical Weapons Convention present in complex mixtures. The resulting methods demonstrated exceptional sensitivity for detecting the analytes at trace level concentrations. 1H-13C correlations of target analytes at < or = 25 microg/mL were easily detected in a sample where the 1H solvent signal was approximately 58,000-fold more intense than the analyte 1H signals. The problem of overlapping signals typically observed in conventional 1H spectroscopy was essentially eliminated, while 1H and 13C chemical shift information could be derived quickly and simultaneously from the resulting spectra. The fast-HMQC pulse sequences generated magnitude mode spectra suitable for detailed analysis in approximately 4.5 h and can be used in experiments to efficiently screen a large number of samples. The HSQC pulse sequences, on the other hand, required roughly twice the data acquisition time to produce suitable spectra. These spectra, however, were phase-sensitive, contained considerably more resolution in both dimensions, and proved to be superior for detecting analyte 1H-13C correlations. Furthermore, a HSQC spectrum collected with a multiplicity-edited pulse sequence provided additional structural information valuable for identifying target analytes. The HSQC pulse sequences are ideal for collecting high-quality data sets with overnight acquisitions and logically follow the use of fast-HMQC pulse sequences to rapidly screen samples for potential target analytes. Use of the pulse sequences considerably improves the performance of NMR spectroscopy as a complimentary technique for the screening, identification, and validation of chemical warfare agents and other small-molecule analytes present in complex mixtures and environmental samples.
Clément, Nathalie; Avalosse, Bernard; El Bakkouri, Karim; Velu, Thierry; Brandenburger, Annick
2001-01-01
The production of wild-type-free stocks of recombinant parvovirus minute virus of mice [MVM(p)] is difficult due to the presence of homologous sequences in vector and helper genomes that cannot easily be eliminated from the overlapping coding sequences. We have therefore cloned and sequenced spontaneously occurring defective particles of MVM(p) with very small genomes to identify the minimal cis-acting sequences required for DNA amplification and virus production. One of them has lost all capsid-coding sequences but is still able to replicate in permissive cells when nonstructural proteins are provided in trans by a helper plasmid. Vectors derived from this particle produce stocks with no detectable wild-type MVM after cotransfection with new, matched, helper plasmids that present no homology downstream from the transgene. PMID:11152501
A Unified Mixed-Effects Model for Rare-Variant Association in Sequencing Studies
Sun, Jianping; Zheng, Yingye; Hsu, Li
2013-01-01
For rare-variant association analysis, due to extreme low frequencies of these variants, it is necessary to aggregate them by a prior set (e.g., genes and pathways) in order to achieve adequate power. In this paper, we consider hierarchical models to relate a set of rare variants to phenotype by modeling the effects of variants as a function of variant characteristics while allowing for variant-specific effect (heterogeneity). We derive a set of two score statistics, testing the group effect by variant characteristics and the heterogeneity effect. We make a novel modification to these score statistics so that they are independent under the null hypothesis and their asymptotic distributions can be derived. As a result, the computational burden is greatly reduced compared with permutation-based tests. Our approach provides a general testing framework for rare variants association, which includes many commonly used tests, such as the burden test [Li and Leal, 2008] and the sequence kernel association test [Wu et al., 2011], as special cases. Furthermore, in contrast to these tests, our proposed test has an added capacity to identify which components of variant characteristics and heterogeneity contribute to the association. Simulations under a wide range of scenarios show that the proposed test is valid, robust and powerful. An application to the Dallas Heart Study illustrates that apart from identifying genes with significant associations, the new method also provides additional information regarding the source of the association. Such information may be useful for generating hypothesis in future studies. PMID:23483651
Feedback power control strategies in wireless sensor networks with joint channel decoding.
Abrardo, Andrea; Ferrari, Gianluigi; Martalò, Marco; Perna, Fabio
2009-01-01
In this paper, we derive feedback power control strategies for block-faded multiple access schemes with correlated sources and joint channel decoding (JCD). In particular, upon the derivation of the feasible signal-to-noise ratio (SNR) region for the considered multiple access schemes, i.e., the multidimensional SNR region where error-free communications are, in principle, possible, two feedback power control strategies are proposed: (i) a classical feedback power control strategy, which aims at equalizing all link SNRs at the access point (AP), and (ii) an innovative optimized feedback power control strategy, which tries to make the network operational point fall in the feasible SNR region at the lowest overall transmit energy consumption. These strategies will be referred to as "balanced SNR" and "unbalanced SNR," respectively. While they require, in principle, an unlimited power control range at the sources, we also propose practical versions with a limited power control range. We preliminary consider a scenario with orthogonal links and ideal feedback. Then, we analyze the robustness of the proposed power control strategies to possible non-idealities, in terms of residual multiple access interference and noisy feedback channels. Finally, we successfully apply the proposed feedback power control strategies to a limiting case of the class of considered multiple access schemes, namely a central estimating officer (CEO) scenario, where the sensors observe noisy versions of a common binary information sequence and the AP's goal is to estimate this sequence by properly fusing the soft-output information output by the JCD algorithm.
Computational modeling of RNA 3D structures, with the aid of experimental restraints
Magnus, Marcin; Matelska, Dorota; Łach, Grzegorz; Chojnowski, Grzegorz; Boniecki, Michal J; Purta, Elzbieta; Dawson, Wayne; Dunin-Horkawicz, Stanislaw; Bujnicki, Janusz M
2014-01-01
In addition to mRNAs whose primary function is transmission of genetic information from DNA to proteins, numerous other classes of RNA molecules exist, which are involved in a variety of functions, such as catalyzing biochemical reactions or performing regulatory roles. In analogy to proteins, the function of RNAs depends on their structure and dynamics, which are largely determined by the ribonucleotide sequence. Experimental determination of high-resolution RNA structures is both laborious and difficult, and therefore, the majority of known RNAs remain structurally uncharacterized. To address this problem, computational structure prediction methods were developed that simulate either the physical process of RNA structure formation (“Greek science” approach) or utilize information derived from known structures of other RNA molecules (“Babylonian science” approach). All computational methods suffer from various limitations that make them generally unreliable for structure prediction of long RNA sequences. However, in many cases, the limitations of computational and experimental methods can be overcome by combining these two complementary approaches with each other. In this work, we review computational approaches for RNA structure prediction, with emphasis on implementations (particular programs) that can utilize restraints derived from experimental analyses. We also list experimental approaches, whose results can be relatively easily used by computational methods. Finally, we describe case studies where computational and experimental analyses were successfully combined to determine RNA structures that would remain out of reach for each of these approaches applied separately. PMID:24785264
2014-01-01
Background The nematode Pratylenchus neglectus has a wide host range and is able to feed on the root systems of cereals, oilseeds, grain and pasture legumes. Under the Mediterranean low rainfall environments of Australia, annual Medicago pasture legumes are used in rotation with cereals to fix atmospheric nitrogen and improve soil parameters. Considerable efforts are being made in breeding programs to improve resistance and tolerance to Pratylenchus neglectus in the major crops wheat and barley, which makes it vital to develop appropriate selection tools in medics. Results A strong source of tolerance to root damage by the root lesion nematode (RLN) Pratylenchus neglectus had previously been identified in line RH-1 (strand medic, M. littoralis). Using RH-1, we have developed a single seed descent (SSD) population of 138 lines by crossing it to the intolerant cultivar Herald. After inoculation, RLN-associated root damage clearly segregated in the population. Genetic analysis was performed by constructing a genetic map using simple sequence repeat (SSR) and gene-based SNP markers. A highly significant quantitative trait locus (QTL), QPnTolMl.1, was identified explaining 49% of the phenotypic variation in the SSD population. All SSRs and gene-based markers in the QTL region were derived from chromosome 1 of the sequenced genome of the closely related species M. truncatula. Gene-based markers were validated in advanced breeding lines derived from the RH-1 parent and also a second RLN tolerance source, RH-2 (M. truncatula ssp. tricycla). Comparative analysis to sequenced legume genomes showed that the physical QTL interval exists as a synteny block in Lotus japonicus, common bean, soybean and chickpea. Furthermore, using the sequenced genome information of M. truncatula, the QTL interval contains 55 genes out of which five are discussed as potential candidate genes responsible for the mapped tolerance. Conclusion The closely linked set of SNP-based PCR markers is directly applicable to select for two different sources of RLN tolerance in breeding programs. Moreover, genome sequence information has allowed proposing candidate genes for further functional analysis and nominates QPnTolMl.1 as a target locus for RLN tolerance in economically important grain legumes, e.g. chickpea. PMID:24742262
Nagle, Padraic S; McKeever, Caitriona; Rodriguez, Fernando; Nguyen, Binh; Wilson, W David; Rozas, Isabel
2014-09-25
In this paper we report the design and biophysical evaluation of novel rigid-core symmetric and asymmetric dicationic DNA binders containing 9H-fluorene and 9,10-dihydroanthracene cores as well as the synthesis of one of these fluorene derivatives. First, the affinity toward particular DNA sequences of these compounds and flexible core derivatives was evaluated by means of surface plasmon resonance and thermal denaturation experiments finding that the position of the cations significantly influence the binding strength. Then their affinity and mode of binding were further studied by performing circular dichroism and UV studies and the results obtained were rationalized by means of DFT calculations. We found that the fluorene derivatives prepared have the ability to bind to the minor groove of certain DNA sequences and intercalate to others, whereas the dihydroanthracene compounds bind via intercalation to all the DNA sequences studied here.
An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets
2010-01-01
Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141
Pre-main-sequence isochrones - II. Revising star and planet formation time-scales
NASA Astrophysics Data System (ADS)
Bell, Cameron P. M.; Naylor, Tim; Mayne, N. J.; Jeffries, R. D.; Littlefair, S. P.
2013-09-01
We have derived ages for 13 young (<30 Myr) star-forming regions and find that they are up to a factor of 2 older than the ages typically adopted in the literature. This result has wide-ranging implications, including that circumstellar discs survive longer (≃ 10-12 Myr) and that the average Class I lifetime is greater (≃1 Myr) than currently believed. For each star-forming region, we derived two ages from colour-magnitude diagrams. First, we fitted models of the evolution between the zero-age main sequence and terminal-age main sequence to derive a homogeneous set of main-sequence ages, distances and reddenings with statistically meaningful uncertainties. Our second age for each star-forming region was derived by fitting pre-main-sequence stars to new semi-empirical model isochrones. For the first time (for a set of clusters younger than 50 Myr), we find broad agreement between these two ages, and since these are derived from two distinct mass regimes that rely on different aspects of stellar physics, it gives us confidence in the new age scale. This agreement is largely due to our adoption of empirical colour-Teff relations and bolometric corrections for pre-main-sequence stars cooler than 4000 K. The revised ages for the star-forming regions in our sample are: ˜2 Myr for NGC 6611 (Eagle Nebula; M 16), IC 5146 (Cocoon Nebula), NGC 6530 (Lagoon Nebula; M 8) and NGC 2244 (Rosette Nebula); ˜6 Myr for σ Ori, Cep OB3b and IC 348; ≃10 Myr for λ Ori (Collinder 69); ≃11 Myr for NGC 2169; ≃12 Myr for NGC 2362; ≃13 Myr for NGC 7160; ≃14 Myr for χ Per (NGC 884); and ≃20 Myr for NGC 1960 (M 36).
USDA-ARS?s Scientific Manuscript database
Polymorphic genetic markers were identified and characterized using a partial genomic library of Heliothis virescens enriched for simple sequence repeats (SSR) and nucleotide sequences of expressed sequence tags (EST). Nucleotide sequences of 192 clones from the partial genomic library yielded 147 u...
HOMFLYPT polynomial is the best quantifier for topological cascades of vortex knots
NASA Astrophysics Data System (ADS)
Ricca, Renzo L.; Liu, Xin
2018-02-01
In this paper we derive and compare numerical sequences obtained by adapted polynomials such as HOMFLYPT, Jones and Alexander-Conway for the topological cascade of vortex torus knots and links that progressively untie by a single reconnection event at a time. Two cases are considered: the alternate sequence of knots and co-oriented links (with positive crossings) and the sequence of two-component links with oppositely oriented components (negative crossings). New recurrence equations are derived and sequences of numerical values are computed. In all cases the adapted HOMFLYPT polynomial proves to be the best quantifier for the topological cascade of torus knots and links.
From genomics to functional markers in the era of next-generation sequencing.
Salgotra, R K; Gupta, B B; Stewart, C N
2014-03-01
The availability of complete genome sequences, along with other genomic resources for Arabidopsis, rice, pigeon pea, soybean and other crops, has revolutionized our understanding of the genetic make-up of plants. Next-generation DNA sequencing (NGS) has facilitated single nucleotide polymorphism discovery in plants. Functionally-characterized sequences can be identified and functional markers (FMs) for important traits can be developed at an ever-increasing ease. FMs are derived from sequence polymorphisms found in allelic variants of a functional gene. Linkage disequilibrium-based association mapping and homologous recombinants have been developed for identification of "perfect" markers for their use in crop improvement practices. Compared with many other molecular markers, FMs derived from the functionally characterized sequence genes using NGS techniques and their use provide opportunities to develop high-yielding plant genotypes resistant to various stresses at a fast pace.
NASA Astrophysics Data System (ADS)
Schwab, Valérie F.; Herrmann, Martina; Roth, Vanessa-Nina; Gleixner, Gerd; Lehmann, Robert; Pohnert, Georg; Trumbore, Susan; Küsel, Kirsten; Totsche, Kai U.
2017-05-01
Microorganisms in groundwater play an important role in aquifer biogeochemical cycles and water quality. However, the mechanisms linking the functional diversity of microbial populations and the groundwater physico-chemistry are still not well understood due to the complexity of interactions between surface and subsurface. Within the framework of Hainich (north-western Thuringia, central Germany) Critical Zone Exploratory of the Collaborative Research Centre AquaDiva, we used the relative abundances of phospholipid-derived fatty acids (PLFAs) to link specific biochemical markers within the microbial communities to the spatio-temporal changes of the groundwater physico-chemistry. The functional diversities of the microbial communities were mainly correlated with groundwater chemistry, including dissolved O2, Fet and NH4+ concentrations. Abundances of PLFAs derived from eukaryotes and potential nitrite-oxidizing bacteria (11Me16:0 as biomarker for Nitrospira moscoviensis) were high at sites with elevated O2 concentration where groundwater recharge supplies bioavailable substrates. In anoxic groundwaters more rich in Fet, PLFAs abundant in sulfate-reducing bacteria (SRB), iron-reducing bacteria and fungi increased with Fet and HCO3- concentrations, suggesting the occurrence of active iron reduction and the possible role of fungi in meditating iron solubilization and transport in those aquifer domains. In more NH4+-rich anoxic groundwaters, anammox bacteria and SRB-derived PLFAs increased with NH4+ concentration, further evidencing the dependence of the anammox process on ammonium concentration and potential links between SRB and anammox bacteria. Additional support of the PLFA-based bacterial communities was found in DNA- and RNA-based Illumina MiSeq amplicon sequencing of bacterial 16S rRNA genes, which showed high predominance of nitrite-oxidizing bacteria Nitrospira, e.g. Nitrospira moscoviensis, in oxic aquifer zones and of anammox bacteria in more NH4+-rich anoxic groundwater. Higher relative abundances of sequence reads in the RNA-based datasets affiliated with iron-reducing bacteria in more Fet-rich groundwater supported the occurrence of active dissimilatory iron reduction. The functional diversity of the microbial communities in the biogeochemically distinct groundwater assemblages can be largely attributed to the redox conditions linked to changes in bioavailable substrates and input of substrates with the seepage. Our results demonstrate the power of complementary information derived from PLFA-based and sequencing-based approaches.
BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes.
Jespersen, Martin Closter; Peters, Bjoern; Nielsen, Morten; Marcatili, Paolo
2017-07-03
Antibodies have become an indispensable tool for many biotechnological and clinical applications. They bind their molecular target (antigen) by recognizing a portion of its structure (epitope) in a highly specific manner. The ability to predict epitopes from antigen sequences alone is a complex task. Despite substantial effort, limited advancement has been achieved over the last decade in the accuracy of epitope prediction methods, especially for those that rely on the sequence of the antigen only. Here, we present BepiPred-2.0 (http://www.cbs.dtu.dk/services/BepiPred/), a web server for predicting B-cell epitopes from antigen sequences. BepiPred-2.0 is based on a random forest algorithm trained on epitopes annotated from antibody-antigen protein structures. This new method was found to outperform other available tools for sequence-based epitope prediction both on epitope data derived from solved 3D structures, and on a large collection of linear epitopes downloaded from the IEDB database. The method displays results in a user-friendly and informative way, both for computer-savvy and non-expert users. We believe that BepiPred-2.0 will be a valuable tool for the bioinformatics and immunology community. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Effect of hot acid hydrolysis and hot chlorine dioxide stage on bleaching effluent biodegradability.
Gomes, C M; Colodette, J L; Delantonio, N R N; Mounteer, A H; Silva, C M
2007-01-01
The hot acid hydrolysis followed by chlorine dioxide (A/D*) and hot chlorine dioxide (D*) technologies have proven very useful for bleaching of eucalyptus kraft pulp. Although the characteristics and biodegradability of effluents from conventional chlorine dioxide bleaching are well known, such information is not yet available for effluents derived from hot acid hydrolysis and hot chorine dioxide bleaching. This study discusses the characteristics and biodegradability of such effluents. Combined whole effluents from the complete sequences DEpD, D*EpD, A/D*EpD and ADEpD, and from the pre-bleaching sequences DEp, D*Ep, A/D*Ep and ADEp were characterized by quantifying their colour, AOX and organic load (BOD, COD, TOC). These effluents were also evaluated for their treatability by simulation of an activated sludge system. It was concluded that treatment in the laboratory sequencing batch reactor was efficient for removal of COD, BOD and TOC of all effluents. However, colour increased after biological treatment, with the greatest increase found for the effluent produced using the AD technology. Biological treatment was less efficient at removing AOX of effluents from the sequences with D*, A/D* and AD as the first stages, when compared to the reference D stage; there was evidence of the lower treatability of these organochlorine compounds from these sequences.
SubCellProt: predicting protein subcellular localization using machine learning approaches.
Garg, Prabha; Sharma, Virag; Chaudhari, Pradeep; Roy, Nilanjan
2009-01-01
High-throughput genome sequencing projects continue to churn out enormous amounts of raw sequence data. However, most of this raw sequence data is unannotated and, hence, not very useful. Among the various approaches to decipher the function of a protein, one is to determine its localization. Experimental approaches for proteome annotation including determination of a protein's subcellular localizations are very costly and labor intensive. Besides the available experimental methods, in silico methods present alternative approaches to accomplish this task. Here, we present two machine learning approaches for prediction of the subcellular localization of a protein from the primary sequence information. Two machine learning algorithms, k Nearest Neighbor (k-NN) and Probabilistic Neural Network (PNN) were used to classify an unknown protein into one of the 11 subcellular localizations. The final prediction is made on the basis of a consensus of the predictions made by two algorithms and a probability is assigned to it. The results indicate that the primary sequence derived features like amino acid composition, sequence order and physicochemical properties can be used to assign subcellular localization with a fair degree of accuracy. Moreover, with the enhanced accuracy of our approach and the definition of a prediction domain, this method can be used for proteome annotation in a high throughput manner. SubCellProt is available at www.databases.niper.ac.in/SubCellProt.
Cho, Kwang-Soo; Yun, Bong-Kyoung; Yoon, Young-Ho; Hong, Su-Young; Mekapogu, Manjulatha; Kim, Kyung-Hee; Yang, Tae-Jin
2015-01-01
We report the chloroplast (cp) genome sequence of tartary buckwheat (Fagopyrum tataricum) obtained by next-generation sequencing technology and compared this with the previously reported common buckwheat (F. esculentum ssp. ancestrale) cp genome. The cp genome of F. tataricum has a total sequence length of 159,272 bp, which is 327 bp shorter than the common buckwheat cp genome. The cp gene content, order, and orientation are similar to those of common buckwheat, but with some structural variation at tandem and palindromic repeat frequencies and junction areas. A total of seven InDels (around 100 bp) were found within the intergenic sequences and the ycf1 gene. Copy number variation of the 21-bp tandem repeat varied in F. tataricum (four repeats) and F. esculentum (one repeat), and the InDel of the ycf1 gene was 63 bp long. Nucleotide and amino acid have highly conserved coding sequence with about 98% homology and four genes—rpoC2, ycf3, accD, and clpP—have high synonymous (Ks) value. PCR based InDel markers were applied to diverse genetic resources of F. tataricum and F. esculentum, and the amplicon size was identical to that expected in silico. Therefore, these InDel markers are informative biomarkers to practically distinguish raw or processed buckwheat products derived from F. tataricum and F. esculentum. PMID:25966355
The MAR databases: development and implementation of databases specific for marine metagenomics
Klemetsen, Terje; Raknes, Inge A; Fu, Juan; Agafonov, Alexander; Balasundaram, Sudhagar V; Tartari, Giacomo; Robertsen, Espen
2018-01-01
Abstract We introduce the marine databases; MarRef, MarDB and MarCat (https://mmp.sfb.uit.no/databases/), which are publicly available resources that promote marine research and innovation. These data resources, which have been implemented in the Marine Metagenomics Portal (MMP) (https://mmp.sfb.uit.no/), are collections of richly annotated and manually curated contextual (metadata) and sequence databases representing three tiers of accuracy. While MarRef is a database for completely sequenced marine prokaryotic genomes, which represent a marine prokaryote reference genome database, MarDB includes all incomplete sequenced prokaryotic genomes regardless level of completeness. The last database, MarCat, represents a gene (protein) catalog of uncultivable (and cultivable) marine genes and proteins derived from marine metagenomics samples. The first versions of MarRef and MarDB contain 612 and 3726 records, respectively. Each record is built up of 106 metadata fields including attributes for sampling, sequencing, assembly and annotation in addition to the organism and taxonomic information. Currently, MarCat contains 1227 records with 55 metadata fields. Ontologies and controlled vocabularies are used in the contextual databases to enhance consistency. The user-friendly web interface lets the visitors browse, filter and search in the contextual databases and perform BLAST searches against the corresponding sequence databases. All contextual and sequence databases are freely accessible and downloadable from https://s1.sfb.uit.no/public/mar/. PMID:29106641
Windowed R-PDLF recoupling: a flexible and reliable tool to characterize molecular dynamics.
Gansmüller, Axel; Simorre, Jean-Pierre; Hediger, Sabine
2013-09-01
This work focuses on the improvement of the R-PDLF heteronuclear recoupling scheme, a method that allows quantification of molecular dynamics up to the microsecond timescale in heterogeneous materials. We show how the stability of the sequence towards rf-imperfections, one of the main sources of error of this technique, can be improved by the insertion of windows without irradiation into the basic elements of the symmetry-based recoupling sequence. The impact of this modification on the overall performance of the sequence in terms of scaling factor and homonuclear decoupling efficiency is evaluated. This study indicates the experimental conditions for which precise and reliable measurement of dipolar couplings can be obtained using the popular R18(1)(7) recoupling sequence, as well as alternative symmetry-based R sequences suited for fast MAS conditions. An analytical expression for the recoupled dipolar modulation has been derived that applies to a whole class of sequences with similar recoupling properties as R18(1)(7). This analytical expression provides an efficient and precise way to extract dipolar couplings from the experimental dipolar modulation curves. We hereby provide helpful tools and information for tailoring R-PDLF recoupling schemes to specific sample properties and hardware capabilities. This approach is particularly well suited for the study of materials with strong and heterogeneous molecular dynamics where a precise measurement of dipolar couplings is crucial. Copyright © 2013 Elsevier Inc. All rights reserved.
Mulder, Willem H; Crawford, Forrest W
2015-01-07
Efforts to reconstruct phylogenetic trees and understand evolutionary processes depend fundamentally on stochastic models of speciation and mutation. The simplest continuous-time model for speciation in phylogenetic trees is the Yule process, in which new species are "born" from existing lineages at a constant rate. Recent work has illuminated some of the structural properties of Yule trees, but it remains mostly unknown how these properties affect sequence and trait patterns observed at the tips of the phylogenetic tree. Understanding the interplay between speciation and mutation under simple models of evolution is essential for deriving valid phylogenetic inference methods and gives insight into the optimal design of phylogenetic studies. In this work, we derive the probability distribution of interspecies covariance under Brownian motion and Ornstein-Uhlenbeck models of phenotypic change on a Yule tree. We compute the probability distribution of the number of mutations shared between two randomly chosen taxa in a Yule tree under discrete Markov mutation models. Our results suggest summary measures of phylogenetic information content, illuminate the correlation between site patterns in sequences or traits of related organisms, and provide heuristics for experimental design and reconstruction of phylogenetic trees. Copyright © 2014 Elsevier Ltd. All rights reserved.
Sasaki, S
2001-04-01
A number of cross-linking (alkylating) agents have been developed and incorporated into the oligonulceotides for sequence selective control of gene expression. Recently, potential application of such active oligonucleotides has been expanding from use for improvement of inhibition efficiency to new biotechnology that may enable chemical alteration of genetic information. These interests in active oligonucleotides have encouraged the generation of new cross-linking agents that exhibit high efficiency for application of either in vitro or in vivo. This mini review summarizes structures of alkylating agents, in particular, a new basic skeleton for cross-linking, a 2'-deoxyribose derivative of 2-amino-6-vinylpurine that has been recently developed by the author's group. The 2-amino-6-vinylpurine has been shown to form a complex with cytidine under acidic conditions, and brings the vinyl and the amino reactive groups into proximity to achieve efficient alkylation. A new strategy was designed so that the reactivity of 2-amino-6-vinylpurine can be induced from the corresponding phenylsulfoxide derivative within a duplex with the complementary strand. The validity of the new strategy has been proven by achievement of cytidine-selective cross-linking with remarkably efficiency.
Zhou, Yu; Pearson, John E; Auerbach, Anthony
2005-12-01
We derive the analytical form of a rate-equilibrium free-energy relationship (with slope Phi) for a bounded, linear chain of coupled reactions having arbitrary connecting rate constants. The results confirm previous simulation studies showing that Phi-values reflect the position of the perturbed reaction within the chain, with reactions occurring earlier in the sequence producing higher Phi-values than those occurring later in the sequence. The derivation includes an expression for the transmission coefficients of the overall reaction based on the rate constants of an arbitrary, discrete, finite Markov chain. The results indicate that experimental Phi-values can be used to calculate the relative heights of the energy barriers between intermediate states of the chain but provide no information about the energies of the wells along the reaction path. Application of the equations to the case of diliganded acetylcholine receptor channel gating suggests that the transition-state ensemble for this reaction is nearly flat. Although this mechanism accounts for many of the basic features of diliganded and unliganded acetylcholine receptor channel gating, the experimental rate-equilibrium free-energy relationships appear to be more linear than those predicted by the theory.
Improve the prediction of RNA-binding residues using structural neighbours.
Li, Quan; Cao, Zanxia; Liu, Haiyan
2010-03-01
The interactions between RNA-binding proteins (RBPs) with RNA play key roles in managing some of the cell's basic functions. The identification and prediction of RNA binding sites is important for understanding the RNA-binding mechanism. Computational approaches are being developed to predict RNA-binding residues based on the sequence- or structure-derived features. To achieve higher prediction accuracy, improvements on current prediction methods are necessary. We identified that the structural neighbors of RNA-binding and non-RNA-binding residues have different amino acid compositions. Combining this structure-derived feature with evolutionary (PSSM) and other structural information (secondary structure and solvent accessibility) significantly improves the predictions over existing methods. Using a multiple linear regression approach and 6-fold cross validation, our best model can achieve an overall correct rate of 87.8% and MCC of 0.47, with a specificity of 93.4%, correctly predict 52.4% of the RNA-binding residues for a dataset containing 107 non-homologous RNA-binding proteins. Compared with existing methods, including the amino acid compositions of structure neighbors lead to clearly improvement. A web server was developed for predicting RNA binding residues in a protein sequence (or structure),which is available at http://mcgill.3322.org/RNA/.
Phenomenological analysis of medical time series with regular and stochastic components
NASA Astrophysics Data System (ADS)
Timashev, Serge F.; Polyakov, Yuriy S.
2007-06-01
Flicker-Noise Spectroscopy (FNS), a general approach to the extraction and parameterization of resonant and stochastic components contained in medical time series, is presented. The basic idea of FNS is to treat the correlation links present in sequences of different irregularities, such as spikes, "jumps", and discontinuities in derivatives of different orders, on all levels of the spatiotemporal hierarchy of the system under study as main information carriers. The tools to extract and analyze the information are power spectra and difference moments (structural functions), which complement the information of each other. The structural function stochastic component is formed exclusively by "jumps" of the dynamic variable while the power spectrum stochastic component is formed by both spikes and "jumps" on every level of the hierarchy. The information "passport" characteristics that are determined by fitting the derived expressions to the experimental variations for the stochastic components of power spectra and structural functions are interpreted as the correlation times and parameters that describe the rate of "memory loss" on these correlation time intervals for different irregularities. The number of the extracted parameters is determined by the requirements of the problem under study. Application of this approach to the analysis of tremor velocity signals for a Parkinsonian patient is discussed.
Hu, Zhuang; Zhang, Tian; Gao, Xiao-Xiao; Wang, Yang; Zhang, Qiang; Zhou, Hui-Juan; Zhao, Gui-Fang; Wang, Ma-Li; Woeste, Keith E; Zhao, Peng
2016-04-01
Manchurian walnut (Juglans mandshurica Maxim.) is a vulnerable, temperate deciduous tree valued for its wood and nut, but transcriptomic and genomic data for the species are very limited. Next generation sequencing (NGS) has made it possible to develop molecular markers for this species rapidly and efficiently. Our goal is to use transcriptome information from RNA-Seq to understand development in J. mandshurica and develop polymorphic simple sequence repeats (SSRs, microsatellites) to understand the species' population genetics. In this study, more than 47.7 million clean reads were generated using Illumina sequencing technology. De novo assembly yielded 99,869 unigenes with an average length of 747 bp. Based on sequence similarity search with known proteins, a total of 39,708 (42.32 %) genes were identified. Searching against the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG) identified 15,903 (16.9 %) unigenes. Further, we identified and characterized 63 new transcriptome-derived microsatellite markers. By testing the markers on 4 to 14 individuals from four populations, we found that 20 were polymorphic and easily amplified. The number of alleles per locus ranged from 2 to 8. The observed and expected heterozygosity per locus ranged from 0.209 to 0.813 and 0.335 to 0.842, respectively. These twenty microsatellite markers will be useful for studies of population genetics, diversity, and genetic structure, and they will undoubtedly benefit future breeding studies of this walnut species. Moreover, the information uncovered in this research will also serve as a useful genetic resource for understanding the transcriptome and development of J. mandshurica and other Juglans species.
Image Motion Detection And Estimation: The Modified Spatio-Temporal Gradient Scheme
NASA Astrophysics Data System (ADS)
Hsin, Cheng-Ho; Inigo, Rafael M.
1990-03-01
The detection and estimation of motion are generally involved in computing a velocity field of time-varying images. A completely new modified spatio-temporal gradient scheme to determine motion is proposed. This is derived by using gradient methods and properties of biological vision. A set of general constraints is proposed to derive motion constraint equations. The constraints are that the second directional derivatives of image intensity at an edge point in the smoothed image will be constant at times t and t+L . This scheme basically has two stages: spatio-temporal filtering, and velocity estimation. Initially, image sequences are processed by a set of oriented spatio-temporal filters which are designed using a Gaussian derivative model. The velocity is then estimated for these filtered image sequences based on the gradient approach. From a computational stand point, this scheme offers at least three advantages over current methods. The greatest advantage of the modified spatio-temporal gradient scheme over the traditional ones is that an infinite number of motion constraint equations are derived instead of only one. Therefore, it solves the aperture problem without requiring any additional assumptions and is simply a local process. The second advantage is that because of the spatio-temporal filtering, the direct computation of image gradients (discrete derivatives) is avoided. Therefore the error in gradients measurement is reduced significantly. The third advantage is that during the processing of motion detection and estimation algorithm, image features (edges) are produced concurrently with motion information. The reliable range of detected velocity is determined by parameters of the oriented spatio-temporal filters. Knowing the velocity sensitivity of a single motion detection channel, a multiple-channel mechanism for estimating image velocity, seldom addressed by other motion schemes in machine vision, can be constructed by appropriately choosing and combining different sets of parameters. By applying this mechanism, a great range of velocity can be detected. The scheme has been tested for both synthetic and real images. The results of simulations are very satisfactory.
Intestinal flora of FAP patients containing APC-like sequences.
Hainova, K; Adamcikova, Z; Ciernikova, S; Stevurkova, V; Tyciakova, S; Zajac, V
2014-01-01
Colorectal cancer mortality is one of the most common cause of cancer-related mortality. A multiple risk factors are associated with colorectal cancer, including hereditary, enviromental and inflammatory syndromes affecting the gastrointestinal tract. Familial adenomatous polyposis (FAP) is characterized by the emergence of hundreds to thousands of colorectal adenomatous polyps and FAP syndrome is caused by mutations within the adenomatous polyposis coli (APC) tumor suppressor gene. We analyzed 21 rectal bacterial subclones isolated from FAP patient 41-1 with confirmed 5bp ACAAA deletion within codons 1060-1063 for the presence of APC-like sequences in longest exon 15. The studied section was defined by primers 15Efor-15Erev, what correlates with mutation cluster region (MCR) in which the 75% of all APC germline mutations were detected. More than 90% homology was showed by sequencing and subsequent software comparison. The expression of APC-like sequences was demostrated by Western blot analysis using monoclonal and polyclonal antibodies against APC protein. To study missing link between the DNA analysis (PCR, DNA sequencing) and protein expresion experiments (Western blotting) we analyzed bacterial transcripts containing the 15Efor-15Erev sequence of APC gene by reverse transcription-PCR, what indicated that an APC gene derived fragment may be produced. We observed 97-100 % homology after computer comparison of cDNA PCR products. Our results suggest that presence of APC-like sequences in intestinal/rectal bacteria is enrichment of bacterial genetic information in which horizontal gene transfer between humans and microflora play an important role.
Hsing, Michael; Cherkasov, Artem
2008-06-25
Insertions and deletions (indels) represent a common type of sequence variations, which are less studied and pose many important biological questions. Recent research has shown that the presence of sizable indels in protein sequences may be indicative of protein essentiality and their role in protein interaction networks. Examples of utilization of indels for structure-based drug design have also been recently demonstrated. Nonetheless many structural and functional characteristics of indels remain less researched or unknown. We have created a web-based resource, Indel PDB, representing a structural database of insertions/deletions identified from the sequence alignments of highly similar proteins found in the Protein Data Bank (PDB). Indel PDB utilized large amounts of available structural information to characterize 1-, 2- and 3-dimensional features of indel sites. Indel PDB contains 117,266 non-redundant indel sites extracted from 11,294 indel-containing proteins. Unlike loop databases, Indel PDB features more indel sequences with secondary structures including alpha-helices and beta-sheets in addition to loops. The insertion fragments have been characterized by their sequences, lengths, locations, secondary structure composition, solvent accessibility, protein domain association and three dimensional structures. By utilizing the data available in Indel PDB, we have studied and presented here several sequence and structural features of indels. We anticipate that Indel PDB will not only enable future functional studies of indels, but will also assist protein modeling efforts and identification of indel-directed drug binding sites.
Naganeeswaran, Sudalaimuthu Asari; Subbian, Elain Apshara; Ramaswamy, Manimekalai
2012-01-01
Phytophthora megakarya, the causative agent of cacao black pod disease in West African countries causes an extensive loss of yield. In this study we have analyzed 4 libraries of ESTs derived from Phytophthora megakarya infected cocoa leaf and pod tissues. Totally 6379 redundant sequences were retrieved from ESTtik database and EST processing was performed using seqclean tool. Clustering and assembling using CAP3 generated 3333 non-redundant (907 contigs and 2426 singletons) sequences. The primary sequence analysis of 3333 non-redundant sequences showed that the GC percentage was 42.7 and the sequence length ranged from 101 - 2576 nucleotides. Further, functional analysis (Blast, Interproscan, Gene ontology and KEGG search) were executed and 1230 orthologous genes were annotated. Totally 272 enzymes corresponding to 114 metabolic pathways were identified. Functional annotation revealed that most of the sequences are related to molecular function, stress response and biological processes. The annotated enzymes are aldehyde dehydrogenase (E.C: 1.2.1.3), catalase (E.C: 1.11.1.6), acetyl-CoA C-acetyltransferase (E.C: 2.3.1.9), threonine ammonia-lyase (E.C: 4.3.1.19), acetolactate synthase (E.C: 2.2.1.6), O-methyltransferase (E.C: 2.1.1.68) which play an important role in amino acid biosynthesis and phenyl propanoid biosynthesis. All this information was stored in MySQL database management system to be used in future for reconstruction of biotic stress response pathway in cocoa.
The phylogeny of yellow fever virus 17D vaccines.
Stock, Nina K; Boschetti, Nicola; Herzog, Christian; Appelhans, Marc S; Niedrig, Matthias
2012-02-01
In recent years the safety of the yellow fever live vaccine 17D came under scrutiny. The focus was on serious adverse events after vaccinations that resemble a wild type infection with yellow fever and whose reasons are still not known. Also the exact mechanism of attenuation of the vaccine remains unknown to this day. In this context, the standards of safety and surveillance in vaccine production and administration have been discussed. Therein embodied was the demand for improved documentation of the derivation of the seed virus used for yellow fever vaccine production. So far, there was just a historical genealogy available that is based on source area and passage level. However, there is a need for a documentation based on molecular information to get better insights into the mechanisms of pathology. In this work we sequenced the whole genome of different passages of the YFV-17D strain used by Crucell Switzerland AG for vaccine production. Using all other publically available 17D full genome sequences we compared the sequence variance of all vaccine strains and oppose a phylogenetic tree based on full genome sequences to the historical genealogy. Copyright © 2011 Elsevier Ltd. All rights reserved.
Sequence determinants of improved CRISPR sgRNA design.
Xu, Han; Xiao, Tengfei; Chen, Chen-Hao; Li, Wei; Meyer, Clifford A; Wu, Qiu; Wu, Di; Cong, Le; Zhang, Feng; Liu, Jun S; Brown, Myles; Liu, X Shirley
2015-08-01
The CRISPR/Cas9 system has revolutionized mammalian somatic cell genetics. Genome-wide functional screens using CRISPR/Cas9-mediated knockout or dCas9 fusion-mediated inhibition/activation (CRISPRi/a) are powerful techniques for discovering phenotype-associated gene function. We systematically assessed the DNA sequence features that contribute to single guide RNA (sgRNA) efficiency in CRISPR-based screens. Leveraging the information from multiple designs, we derived a new sequence model for predicting sgRNA efficiency in CRISPR/Cas9 knockout experiments. Our model confirmed known features and suggested new features including a preference for cytosine at the cleavage site. The model was experimentally validated for sgRNA-mediated mutation rate and protein knockout efficiency. Tested on independent data sets, the model achieved significant results in both positive and negative selection conditions and outperformed existing models. We also found that the sequence preference for CRISPRi/a is substantially different from that for CRISPR/Cas9 knockout and propose a new model for predicting sgRNA efficiency in CRISPRi/a experiments. These results facilitate the genome-wide design of improved sgRNA for both knockout and CRISPRi/a studies. © 2015 Xu et al.; Published by Cold Spring Harbor Laboratory Press.
DMINDA: an integrated web server for DNA motif identification and analyses.
Ma, Qin; Zhang, Hanyuan; Mao, Xizeng; Zhou, Chuan; Liu, Bingqiang; Chen, Xin; Xu, Ying
2014-07-01
DMINDA (DNA motif identification and analyses) is an integrated web server for DNA motif identification and analyses, which is accessible at http://csbl.bmb.uga.edu/DMINDA/. This web site is freely available to all users and there is no login requirement. This server provides a suite of cis-regulatory motif analysis functions on DNA sequences, which are important to elucidation of the mechanisms of transcriptional regulation: (i) de novo motif finding for a given set of promoter sequences along with statistical scores for the predicted motifs derived based on information extracted from a control set, (ii) scanning motif instances of a query motif in provided genomic sequences, (iii) motif comparison and clustering of identified motifs, and (iv) co-occurrence analyses of query motifs in given promoter sequences. The server is powered by a backend computer cluster with over 150 computing nodes, and is particularly useful for motif prediction and analyses in prokaryotic genomes. We believe that DMINDA, as a new and comprehensive web server for cis-regulatory motif finding and analyses, will benefit the genomic research community in general and prokaryotic genome researchers in particular. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
High-throughput tetrad analysis.
Ludlow, Catherine L; Scott, Adrian C; Cromie, Gareth A; Jeffery, Eric W; Sirr, Amy; May, Patrick; Lin, Jake; Gilbert, Teresa L; Hays, Michelle; Dudley, Aimée M
2013-07-01
Tetrad analysis has been a gold-standard genetic technique for several decades. Unfortunately, the need to manually isolate, disrupt and space tetrads has relegated its application to small-scale studies and limited its integration with high-throughput DNA sequencing technologies. We have developed a rapid, high-throughput method, called barcode-enabled sequencing of tetrads (BEST), that uses (i) a meiosis-specific GFP fusion protein to isolate tetrads by FACS and (ii) molecular barcodes that are read during genotyping to identify spores derived from the same tetrad. Maintaining tetrad information allows accurate inference of missing genetic markers and full genotypes of missing (and presumably nonviable) individuals. An individual researcher was able to isolate over 3,000 yeast tetrads in 3 h, an output equivalent to that of almost 1 month of manual dissection. BEST is transferable to other microorganisms for which meiotic mapping is significantly more laborious.
Schmidt Am Busch, Marcel; Sedano, Audrey; Simonson, Thomas
2010-05-05
Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases. WE EXPLORE THIS STRATEGY FOR FOUR SCOP FAMILIES: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000-300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed. For some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.
Amadoz, Alicia; González-Candelas, Fernando
2007-04-20
Most research scientists working in the fields of molecular epidemiology, population and evolutionary genetics are confronted with the management of large volumes of data. Moreover, the data used in studies of infectious diseases are complex and usually derive from different institutions such as hospitals or laboratories. Since no public database scheme incorporating clinical and epidemiological information about patients and molecular information about pathogens is currently available, we have developed an information system, composed by a main database and a web-based interface, which integrates both types of data and satisfies requirements of good organization, simple accessibility, data security and multi-user support. From the moment a patient arrives to a hospital or health centre until the processing and analysis of molecular sequences obtained from infectious pathogens in the laboratory, lots of information is collected from different sources. We have divided the most relevant data into 12 conceptual modules around which we have organized the database schema. Our schema is very complete and it covers many aspects of sample sources, samples, laboratory processes, molecular sequences, phylogenetics results, clinical tests and results, clinical information, treatments, pathogens, transmissions, outbreaks and bibliographic information. Communication between end-users and the selected Relational Database Management System (RDMS) is carried out by default through a command-line window or through a user-friendly, web-based interface which provides access and management tools for the data. epiPATH is an information system for managing clinical and molecular information from infectious diseases. It facilitates daily work related to infectious pathogens and sequences obtained from them. This software is intended for local installation in order to safeguard private data and provides advanced SQL-users the flexibility to adapt it to their needs. The database schema, tool scripts and web-based interface are free software but data stored in our database server are not publicly available. epiPATH is distributed under the terms of GNU General Public License. More details about epiPATH can be found at http://genevo.uv.es/epipath.
Bashir, Ali; Bansal, Vikas; Bafna, Vineet
2010-06-18
Massively parallel DNA sequencing technologies have enabled the sequencing of several individual human genomes. These technologies are also being used in novel ways for mRNA expression profiling, genome-wide discovery of transcription-factor binding sites, small RNA discovery, etc. The multitude of sequencing platforms, each with their unique characteristics, pose a number of design challenges, regarding the technology to be used and the depth of sequencing required for a particular sequencing application. Here we describe a number of analytical and empirical results to address design questions for two applications: detection of structural variations from paired-end sequencing and estimating mRNA transcript abundance. For structural variation, our results provide explicit trade-offs between the detection and resolution of rearrangement breakpoints, and the optimal mix of paired-read insert lengths. Specifically, we prove that optimal detection and resolution of breakpoints is achieved using a mix of exactly two insert library lengths. Furthermore, we derive explicit formulae to determine these insert length combinations, enabling a 15% improvement in breakpoint detection at the same experimental cost. On empirical short read data, these predictions show good concordance with Illumina 200 bp and 2 Kbp insert length libraries. For transcriptome sequencing, we determine the sequencing depth needed to detect rare transcripts from a small pilot study. With only 1 Million reads, we derive corrections that enable almost perfect prediction of the underlying expression probability distribution, and use this to predict the sequencing depth required to detect low expressed genes with greater than 95% probability. Together, our results form a generic framework for many design considerations related to high-throughput sequencing. We provide software tools http://bix.ucsd.edu/projects/NGS-DesignTools to derive platform independent guidelines for designing sequencing experiments (amount of sequencing, choice of insert length, mix of libraries) for novel applications of next generation sequencing.
A SSR-based genetic linkage map of cultivated peanut (Arachis hypogaea L.)
USDA-ARS?s Scientific Manuscript database
The objective of this study was to construct a molecular linkage map of cultivated tetraploid peanut using simple sequence repeat (SSR) markers derived primarily from peanut genomic sequences, expressed sequence tags (ESTs), and by "data mining" sequences released in GenBank. Three recombinant inbre...
Gruszka, Damian; Marzec, Marek; Szarejko, Iwona
2012-06-14
The high level of conservation of genes that regulate DNA replication and repair indicates that they may serve as a source of information on the origin and evolution of the species and makes them a reliable system for the identification of cross-species homologs. Studies that had been conducted to date shed light on the processes of DNA replication and repair in bacteria, yeast and mammals. However, there is still much to be learned about the process of DNA damage repair in plants. These studies, which were conducted mainly using bioinformatics tools, enabled the list of genes that participate in various pathways of DNA repair in Arabidopsis thaliana (L.) Heynh to be outlined; however, information regarding these mechanisms in crop plants is still very limited. A similar, functional approach is particularly difficult for a species whose complete genomic sequences are still unavailable. One of the solutions is to apply ESTs (Expressed Sequence Tags) as the basis for gene identification. For the construction of the barley EST DNA Replication and Repair Database (bEST-DRRD), presented here, the Arabidopsis nucleotide and protein sequences involved in DNA replication and repair were used to browse for and retrieve the deposited sequences, derived from four barley (Hordeum vulgare L.) sequence databases, including the "Barley Genome version 0.05" database (encompassing ca. 90% of barley coding sequences) and from two databases covering the complete genomes of two monocot models: Oryza sativa L. and Brachypodium distachyon L. in order to identify homologous genes. Sequences of the categorised Arabidopsis queries are used for browsing the repositories, which are located on the ViroBLAST platform. The bEST-DRRD is currently used in our project during the identification and validation of the barley genes involved in DNA repair. The presented database provides information about the Arabidopsis genes involved in DNA replication and repair, their expression patterns and models of protein interactions. It was designed and established to provide an open-access tool for the identification of monocot homologs of known Arabidopsis genes that are responsible for DNA-related processes. The barley genes identified in the project are currently being analysed to validate their function.
Holst-Jensen, Arne; Spilsberg, Bjørn; Arulandhu, Alfred J; Kok, Esther; Shi, Jianxin; Zel, Jana
2016-07-01
The emergence of high-throughput, massive or next-generation sequencing technologies has created a completely new foundation for molecular analyses. Various selective enrichment processes are commonly applied to facilitate detection of predefined (known) targets. Such approaches, however, inevitably introduce a bias and are prone to miss unknown targets. Here we review the application of high-throughput sequencing technologies and the preparation of fit-for-purpose whole genome shotgun sequencing libraries for the detection and characterization of genetically modified and derived products. The potential impact of these new sequencing technologies for the characterization, breeding selection, risk assessment, and traceability of genetically modified organisms and genetically modified products is yet to be fully acknowledged. The published literature is reviewed, and the prospects for future developments and use of the new sequencing technologies for these purposes are discussed.
Dong, Hongjuan; Marchetti-Deschmann, Martina; Allmaier, Günter
2014-01-01
Traditionally characterization of microbial proteins is performed by a complex sequence of steps with the final step to be either Edman sequencing or mass spectrometry, which generally takes several weeks or months to be complete. In this work, we proposed a strategy for the characterization of tryptic peptides derived from Giberella zeae (anamorph: Fusarium graminearum) proteins in parallel to intact cell mass spectrometry (ICMS) in which no complicated and time-consuming steps were needed. Experimentally, after a simple washing treatment of the spores, the aliquots of the intact G. zeae macro conidia spores solution, were deposited two times onto one MALDI (matrix-assisted laser desorption ionization) mass spectrometry (MS) target (two spots). One spot was used for ICMS and the second spot was subject to a brief on-target digestion with bead-immobilized or non-immobilized trypsin. Subsequently, one spot was analyzed immediately by MALDI MS in the linear mode (ICMS) whereas the second spot containing the digested material was investigated by MALDI MS in the reflectron mode ("peptide mass fingerprint") followed by protonated peptide selection for MS/MS (post source decay (PSD) fragment ion) analysis. Based on the formed fragment ions of selected tryptic peptides a complete or partial amino acid sequence was generated by manual de novo sequencing. These sequence data were used for homology search for protein identification. Finally four different peptides of varying abundances have been identified successfully allowing the verification that our desorbed/ionized surface compounds were indeed derived from proteins. The presence of three different proteins could be found unambiguously. Interestingly, one of these proteins is belonging to the ribosomal superfamily which indicates that not only surface-associated proteins were digested. This strategy minimized the amount of time and labor required for obtaining deeper information on spore preparations within the nowadays widely used ICMS approach. Copyright © 2013 Elsevier Ltd. All rights reserved.
Min, Xiang Jia
2013-01-01
Expressed Sequence Tags (ESTs) are a rich resource for identifying Alternatively Splicing (AS) genes. The ASFinder webserver is designed to identify AS isoforms from EST-derived sequences. Two approaches are implemented in ASFinder. If no genomic sequences are provided, the server performs a local BLASTN to identify AS isoforms from ESTs having both ends aligned but an internal segment unaligned. Otherwise, ASFinder uses SIM4 to map ESTs to the genome, then the overlapping ESTs that are mapped to the same genomic locus and have internal variable exon/intron boundaries are identified as AS isoforms. The tool is available at http://proteomics.ysu.edu/tools/ASFinder.html.
Genomic Sequence Variation Markup Language (GSVML).
Nakaya, Jun; Kimura, Michio; Hiroi, Kaei; Ido, Keisuke; Yang, Woosung; Tanaka, Hiroshi
2010-02-01
With the aim of making good use of internationally accumulated genomic sequence variation data, which is increasing rapidly due to the explosive amount of genomic research at present, the development of an interoperable data exchange format and its international standardization are necessary. Genomic Sequence Variation Markup Language (GSVML) will focus on genomic sequence variation data and human health applications, such as gene based medicine or pharmacogenomics. We developed GSVML through eight steps, based on case analysis and domain investigations. By focusing on the design scope to human health applications and genomic sequence variation, we attempted to eliminate ambiguity and to ensure practicability. We intended to satisfy the requirements derived from the use case analysis of human-based clinical genomic applications. Based on database investigations, we attempted to minimize the redundancy of the data format, while maximizing the data covering range. We also attempted to ensure communication and interface ability with other Markup Languages, for exchange of omics data among various omics researchers or facilities. The interface ability with developing clinical standards, such as the Health Level Seven Genotype Information model, was analyzed. We developed the human health-oriented GSVML comprising variation data, direct annotation, and indirect annotation categories; the variation data category is required, while the direct and indirect annotation categories are optional. The annotation categories contain omics and clinical information, and have internal relationships. For designing, we examined 6 cases for three criteria as human health application and 15 data elements for three criteria as data formats for genomic sequence variation data exchange. The data format of five international SNP databases and six Markup Languages and the interface ability to the Health Level Seven Genotype Model in terms of 317 items were investigated. GSVML was developed as a potential data exchanging format for genomic sequence variation data exchange focusing on human health applications. The international standardization of GSVML is necessary, and is currently underway. GSVML can be applied to enhance the utilization of genomic sequence variation data worldwide by providing a communicable platform between clinical and research applications. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
Novel methods for parameter-based analysis of myocardial tissue in MR images
NASA Astrophysics Data System (ADS)
Hennemuth, A.; Behrens, S.; Kuehnel, C.; Oeltze, S.; Konrad, O.; Peitgen, H.-O.
2007-03-01
The analysis of myocardial tissue with contrast-enhanced MR yields multiple parameters, which can be used to classify the examined tissue. Perfusion images are often distorted by motion, while late enhancement images are acquired with a different size and resolution. Therefore, it is common to reduce the analysis to a visual inspection, or to the examination of parameters related to the 17-segment-model proposed by the American Heart Association (AHA). As this simplification comes along with a considerable loss of information, our purpose is to provide methods for a more accurate analysis regarding topological and functional tissue features. In order to achieve this, we implemented registration methods for the motion correction of the perfusion sequence and the matching of the late enhancement information onto the perfusion image and vice versa. For the motion corrected perfusion sequence, vector images containing the voxel enhancement curves' semi-quantitative parameters are derived. The resulting vector images are combined with the late enhancement information and form the basis for the tissue examination. For the exploration of data we propose different modes: the inspection of the enhancement curves and parameter distribution in areas automatically segmented using the late enhancement information, the inspection of regions segmented in parameter space by user defined threshold intervals and the topological comparison of regions segmented with different settings. Results showed a more accurate detection of distorted regions in comparison to the AHA-model-based evaluation.
Kim, Bo-Bae; Kim, Minji; Park, Yun-Hee; Ko, Youngkyung; Park, Jun-Beom
2017-06-01
Objective Next-generation sequencing was performed to evaluate the effects of short-term application of dexamethasone on human gingiva-derived mesenchymal stem cells. Methods Human gingiva-derived stem cells were treated with a final concentration of 10 -7 M dexamethasone and the same concentration of vehicle control. This was followed by mRNA sequencing and data analysis, gene ontology and pathway analysis, quantitative real-time polymerase chain reaction of mRNA, and western blot analysis of RUNX2 and β-catenin. Results In total, 26,364 mRNAs were differentially expressed. Comparison of the results of dexamethasone versus control at 2 hours revealed that 7 mRNAs were upregulated and 25 mRNAs were downregulated. The application of dexamethasone reduced the expression of RUNX2 and β-catenin in human gingiva-derived mesenchymal stem cells. Conclusion The effects of dexamethasone on stem cells were evaluated with mRNA sequencing, and validation of the expression was performed with qualitative real-time polymerase chain reaction and western blot analysis. The results of this study can provide new insights into the role of mRNA sequencing in maxillofacial areas.
Integration of Temporal and Ordinal Information During Serial Interception Sequence Learning
Gobel, Eric W.; Sanchez, Daniel J.; Reber, Paul J.
2011-01-01
The expression of expert motor skills typically involves learning to perform a precisely timed sequence of movements (e.g., language production, music performance, athletic skills). Research examining incidental sequence learning has previously relied on a perceptually-cued task that gives participants exposure to repeating motor sequences but does not require timing of responses for accuracy. Using a novel perceptual-motor sequence learning task, learning a precisely timed cued sequence of motor actions is shown to occur without explicit instruction. Participants learned a repeating sequence through practice and showed sequence-specific knowledge via a performance decrement when switched to an unfamiliar sequence. In a second experiment, the integration of representation of action order and timing sequence knowledge was examined. When either action order or timing sequence information was selectively disrupted, performance was reduced to levels similar to completely novel sequences. Unlike prior sequence-learning research that has found timing information to be secondary to learning action sequences, when the task demands require accurate action and timing information, an integrated representation of these types of information is acquired. These results provide the first evidence for incidental learning of fully integrated action and timing sequence information in the absence of an independent representation of action order, and suggest that this integrative mechanism may play a material role in the acquisition of complex motor skills. PMID:21417511
Pliaka, V; Dedepsidis, E; Kyriakopoulou, Z; Mpirli, K; Tsakogiannis, D; Pratti, A; Levidiotou-Stefanou, S; Markoulatos, P
2010-06-01
In the post-eradication era of wild polioviruses, the only remaining sources of poliovirus infection worldwide would be the vaccine-derived polioviruses (VDPVs). As the preponderance of countries certified to be polio-free has switched from OPV (oral poliovirus vaccine) to IPV (inactivated poliovirus vaccine), importation of recombinant evolved derivatives of vaccinal strains would have serious implication for public health. To test the robustness of the proposed RT-PCR screening analysis, eleven recombinant vaccine-derived polioviruses that were characterized previously by sequencing by our group, in addition to three recently identified recombinant environmental isolates were assayed. Although the most definitive characterization of VDPVs is by genomic sequencing, in this study we describe a new, inexpensive and broadly applicable RT-PCR assay for the identification of the predominant recombination types S3/Sx in 2C and S2/Sx in 3D genomic regions respectively of VDPVs, that can be readily implemented in laboratories lacking sequencing facilities as a first approach for the early detection of vaccine-derived poliovirus (VDPVs).
Witkiewicz, Agnieszka K; Balaji, Uthra; Eslinger, Cody; McMillan, Elizabeth; Conway, William; Posner, Bruce; Mills, Gordon B; O'Reilly, Eileen M; Knudsen, Erik S
2016-08-16
Pancreatic ductal adenocarcinoma (PDAC) harbors the worst prognosis of any common solid tumor, and multiple failed clinical trials indicate therapeutic recalcitrance. Here, we use exome sequencing of patient tumors and find multiple conserved genetic alterations. However, the majority of tumors exhibit no clearly defined therapeutic target. High-throughput drug screens using patient-derived cell lines found rare examples of sensitivity to monotherapy, with most models requiring combination therapy. Using PDX models, we confirmed the effectiveness and selectivity of the identified treatment responses. Out of more than 500 single and combination drug regimens tested, no single treatment was effective for the majority of PDAC tumors, and each case had unique sensitivity profiles that could not be predicted using genetic analyses. These data indicate a shortcoming of reliance on genetic analysis to predict efficacy of currently available agents against PDAC and suggest that sensitivity profiling of patient-derived models could inform personalized therapy design for PDAC. Copyright © 2016 The Author(s). Published by Elsevier Inc. All rights reserved.
Characteristics of the Variable Star P Cygni Determined from Cluster Membership
NASA Technical Reports Server (NTRS)
Turner, David G.; Welch, Gary; Graham, Marianne; Fairweather, David; Horsford, Andrew; Seymour, Michael; Feibelman, Walter; Fisher, Richard (Technical Monitor)
2001-01-01
Empirical information on the luminosity, reddening, age, and mass of the variable B2 Oe supergiant P Cygni is derived from its assumed membership in the sparse anonymous cluster on which it is projected, as well as its association with the spatially adjacent cluster IC 4996, which forms a double cluster with the P Cyg cluster. Evidence for the high luminosity of P Cyg is confirmed by its derived absolute magnitude of M(sub V)= -8.46 +/- 0.03, which translates to log (L/L(sun)) = 5.54 +/- 0.02 for an effective temperature consistent with the star's derived space reddening (E(sub B-V) = 0.53 +/- 0.02). More surprising is an age for the associated clusters of 6 (+/- 1.5) x 10(exp 6) years, corresponding to a turnoff point mass of 25.1 (+/- 5.5) M(sun). By inference, P Cygni, as a post main-sequence object, should have a mass of no more than approximately 23-35 M(sun).
System for plotting subsoil structure and method therefor
NASA Technical Reports Server (NTRS)
Narasimhan, K. Y.; Nathan, R.; Parthasarathy, S. P. (Inventor)
1980-01-01
Data for use in producing a tomograph of subsoil structure between boreholes is derived by pacing spaced geophones in one borehole, on the Earth surface if desired, and by producing a sequence of shots at spaced apart locations in the other borehole. The signals, detected by each of the geophones from the various shots, are processed either on a time of arrival basis, or on the basis of signal amplitude, to provide information of the characteristics of a large number of incremental areas between the boreholes. Such information is useable to produce a tomograph of the subsoil structure between the boreholes. By processing signals of relatively high frequencies, e.g., up to 100 Hz, and by closely spacing the geophones, a high resolution tomograph can be produced.
Antalis, T M; Clark, M A; Barnes, T; Lehrbach, P R; Devine, P L; Schevzov, G; Goss, N H; Stephens, R W; Tolstoshev, P
1988-02-01
Human monocyte-derived plasminogen activator inhibitor (mPAI-2) was purified to homogeneity from the U937 cell line and partially sequenced. Oligonucleotide probes derived from this sequence were used to screen a cDNA library prepared from U937 cells. One positive clone was sequenced and contained most of the coding sequence as well as a long incomplete 3' untranslated region (1112 base pairs). This cDNA sequence was shown to encode mPAI-2 by hybrid-select translation. A cDNA clone encoding the remainder of the mPAI-2 mRNA was obtained by primer extension of U937 poly(A)+ RNA using a probe complementary to the mPAI-2 coding region. The coding sequence for mPAI-2 was placed under the control of the lambda PL promoter, and the protein expressed in Escherichia coli formed a complex with urokinase that could be detected immunologically. By nucleotide sequence analysis, mPAI-2 cDNA encodes a protein containing 415 amino acids with a predicted unglycosylated Mr of 46,543. The predicted amino acid sequence of mPAI-2 is very similar to placental PAI-2 (3 amino acid differences) and shows extensive homology with members of the serine protease inhibitor (serpin) superfamily. mPAI-2 was found to be more homologous to ovalbumin (37%) than the endothelial plasminogen activator inhibitor, PAI-1 (26%). Like ovalbumin, mPAI-2 appears to have no typical amino-terminal signal sequence. The 3' untranslated region of the mPAI-2 cDNA contains a putative regulatory sequence that has been associated with the inflammatory mediators.
Antalis, T M; Clark, M A; Barnes, T; Lehrbach, P R; Devine, P L; Schevzov, G; Goss, N H; Stephens, R W; Tolstoshev, P
1988-01-01
Human monocyte-derived plasminogen activator inhibitor (mPAI-2) was purified to homogeneity from the U937 cell line and partially sequenced. Oligonucleotide probes derived from this sequence were used to screen a cDNA library prepared from U937 cells. One positive clone was sequenced and contained most of the coding sequence as well as a long incomplete 3' untranslated region (1112 base pairs). This cDNA sequence was shown to encode mPAI-2 by hybrid-select translation. A cDNA clone encoding the remainder of the mPAI-2 mRNA was obtained by primer extension of U937 poly(A)+ RNA using a probe complementary to the mPAI-2 coding region. The coding sequence for mPAI-2 was placed under the control of the lambda PL promoter, and the protein expressed in Escherichia coli formed a complex with urokinase that could be detected immunologically. By nucleotide sequence analysis, mPAI-2 cDNA encodes a protein containing 415 amino acids with a predicted unglycosylated Mr of 46,543. The predicted amino acid sequence of mPAI-2 is very similar to placental PAI-2 (3 amino acid differences) and shows extensive homology with members of the serine protease inhibitor (serpin) superfamily. mPAI-2 was found to be more homologous to ovalbumin (37%) than the endothelial plasminogen activator inhibitor, PAI-1 (26%). Like ovalbumin, mPAI-2 appears to have no typical amino-terminal signal sequence. The 3' untranslated region of the mPAI-2 cDNA contains a putative regulatory sequence that has been associated with the inflammatory mediators. Images PMID:3257578
A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS
Jiao, Xiaoli; Zheng, Xin; Ma, Liang; Kutty, Geetha; Gogineni, Emile; Sun, Qiang; Sherman, Brad T.; Hu, Xiaojun; Jones, Kristine; Raley, Castle; Tran, Bao; Munroe, David J.; Stephens, Robert; Liang, Dun; Imamichi, Tomozumi; Kovacs, Joseph A.; Lempicki, Richard A.; Huang, Da Wei
2013-01-01
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results. PMID:24179701
A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS.
Jiao, Xiaoli; Zheng, Xin; Ma, Liang; Kutty, Geetha; Gogineni, Emile; Sun, Qiang; Sherman, Brad T; Hu, Xiaojun; Jones, Kristine; Raley, Castle; Tran, Bao; Munroe, David J; Stephens, Robert; Liang, Dun; Imamichi, Tomozumi; Kovacs, Joseph A; Lempicki, Richard A; Huang, Da Wei
2013-07-31
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.
Kweon, Chang-Hee; Nguyen, Lien Thi Kim; Yoo, Mi-Sun; Kang, Seung-Won
2015-09-15
Porcine circovirus type 2 (PCV2) is the causative agent of post-weaning multisystemic wasting syndrome (PMWS) in swine. Here, a phylogenetic tree was constructed using PCV2 nucleotide sequences derived from the bone marrow of Korean boar and previously reported PCV2 sequences isolated from various countries. PCV2 from Korean boar bone marrow (KC188796) was classified into the group containing PCV2a-Canada and other PCV2 strain from Korea. While the ORF1 region of the PCV2 genome was highly conserved, ORF2 (the capsid protein coding region) was relatively variable. The nucleotide sequences for bone marrow-derived PCV2 were 93.4-99.0% homologous to the other reference sequences. The deduced amino acid sequences for the ORF1 and ORF2 coding regions were 97.4-99.3% and 84.5-97.4% homologous with the other reference strains, respectively, indicating that KC188796 did not differ markedly from the other PCV2 strains. Phylogenetic analysis demonstrated that bone marrow-derived PCV2 was highly similar to PCV2a from Canada and may be related to persistent PCV2 infections in swine. Copyright © 2015 Elsevier B.V. All rights reserved.
MitoNuc: a database of nuclear genes coding for mitochondrial proteins. Update 2002.
Attimonelli, Marcella; Catalano, Domenico; Gissi, Carmela; Grillo, Giorgio; Licciulli, Flavio; Liuni, Sabino; Santamaria, Monica; Pesole, Graziano; Saccone, Cecilia
2002-01-01
Mitochondria, besides their central role in energy metabolism, have recently been found to be involved in a number of basic processes of cell life and to contribute to the pathogenesis of many degenerative diseases. All functions of mitochondria depend on the interaction of nuclear and organelle genomes. Mitochondrial genomes have been extensively sequenced and analysed and data have been collected in several specialised databases. In order to collect information on nuclear coded mitochondrial proteins we developed MitoNuc, a database containing detailed information on sequenced nuclear genes coding for mitochondrial proteins in Metazoa. The MitoNuc database can be retrieved through SRS and is available via the web site http://bighost.area.ba.cnr.it/mitochondriome where other mitochondrial databases developed by our group, the complete list of the sequenced mitochondrial genomes, links to other mitochondrial sites and related information, are available. The MitoAln database, related to MitoNuc in the previous release, reporting the multiple alignments of the relevant homologous protein coding regions, is no longer supported in the present release. In order to keep the links among entries in MitoNuc from homologous proteins, a new field in the database has been defined: the cluster identifier, an alpha numeric code used to identify each cluster of homologous proteins. A comment field derived from the corresponding SWISS-PROT entry has been introduced; this reports clinical data related to dysfunction of the protein. The logic scheme of MitoNuc database has been implemented in the ORACLE DBMS. This will allow the end-users to retrieve data through a friendly interface that will be soon implemented.
Human retroviruses and AIDS 1997
DOE Office of Scientific and Technical Information (OSTI.GOV)
Korber, B.; Foley, B.; Leitner, T.
1997-12-01
This compendium is the result of an effort to compile, organize, and rapidly publish as much relevant molecular data concerning the human immunodeficiency viruses (HIV) and related retroviruses as possible. The scope of the compendium and database is best summarized by the four parts that it comprises: (1) Nucleic Acid Alignments, (2) Amino Acid Alignments, (3) Reviews and Analyses, and (4) Related Sequences. Information within all the parts is updated throughout the year on the Web site, http://hiv-web.lanl.gov. This year we are not including floppy diskettes as the entire compendium is available both at our Web site and at ourmore » ftp site. If you need floppy diskettes please contact either Bette Korber (btk@t10.lanl.gov) or Kersti Rock (karm@t10.lanl.gov) by email or fax ((505) 665-4453). While this publication could take the form of a review or sequence monograph, it is not so conceived. Instead, the literature from which the database is derived has simply been summarized and some elementary computational analyses have been performed upon the data. Interpretation and commentary have been avoided insofar as possible so that the reader can form his or her own judgments concerning the complex information. The exception to this are reviews submitted by experts in areas deemed of particular and basic importance to research involving AIDS viral sequence information. These are included in Part III, and are contributed by scientists with particular expertise in the area of interest. In addition to the general descriptions below of the parts of the compendium, the user should read the individual introductions for each part.« less
Gupta, S K; Gopalakrishna, T
2010-07-01
Unigene sequences available in public databases provide a cost-effective and valuable source for the development of molecular markers. In this study, the identification and development of unigene-based SSR markers in cowpea (Vigna unguiculata (L.) Walp.) is presented. A total of 1071 SSRs were identified in 15 740 cowpea unigene sequences downloaded from the National Center for Biotechnology Information. The most frequent SSR motifs present in the unigenes were trinucleotides (59.7%), followed by dinucleotides (34.8%), pentanucleotides (4%), and tetranucleotides (1.5%). The copy number varied from 6 to 33 for dinucleotide, 5 to 29 for trinucleotide, 5 to 7 for tetranucleotide, and 4 to 6 for pentanucleotide repeats. Primer pairs were successfully designed for 803 SSR motifs and 102 SSR markers were finally characterized and validated. Putative function was assigned to 64.7% of the unigene SSR markers based on significant homology to reported proteins. About 31.7% of the SSRs were present in coding sequences and 68.3% in untranslated regions of the genes. About 87% of the SSRs located in the coding sequences were trinucleotide repeats. Allelic variation at 32 SSR loci produced 98 alleles in 20 cowpea genotypes. The polymorphic information content for the SSR markers varied from 0.10 to 0.83 with an average of 0.53. These unigene SSR markers showed a high rate of transferability (88%) across other Vigna species, thereby expanding their utility. Alignment of unigene sequences with soybean genomic sequences revealed the presence of introns in amplified products of some of the SSR markers. This study presents the distribution of SSRs in the expressed portion of the cowpea genome and is the first report of the development of functional unigene-based SSR markers in cowpea. These SSR markers would play an important role in molecular mapping, comparative genomics, and marker-assisted selection strategies in cowpea and other Vigna species.
Sitt, Tatjana; Pelle, Roger; Chepkwony, Maurine; Morrison, W Ivan; Toye, Philip
2018-05-06
The extent of sequence diversity among the genes encoding 10 antigens (Tp1-10) known to be recognized by CD8+ T lymphocytes from cattle immune to Theileria parva was analysed. The sequences were derived from parasites in 23 buffalo-derived cell lines, three cattle-derived isolates and one cloned cell line obtained from a buffalo-derived stabilate. The results revealed substantial variation among the antigens through sequence diversity. The greatest nucleotide and amino acid diversity were observed in Tp1, Tp2 and Tp9. Tp5 and Tp7 showed the least amount of allelic diversity, and Tp5, Tp6 and Tp7 had the lowest levels of protein diversity. Tp6 was the most conserved protein; only a single non-synonymous substitution was found in all obtained sequences. The ratio of non-synonymous: synonymous substitutions varied from 0.84 (Tp1) to 0.04 (Tp6). Apart from Tp2 and Tp9, we observed no variation in the other defined CD8+ T cell epitopes (Tp4, 5, 7 and 8), indicating that epitope variation is not a universal feature of T. parva antigens. In addition to providing markers that can be used to examine the diversity in T. parva populations, the results highlight the potential for using conserved antigens to develop vaccines that provide broad protection against T. parva.
Isobe, Sachiko N.; Hirakawa, Hideki; Sato, Shusei; Maeda, Fumi; Ishikawa, Masami; Mori, Toshiki; Yamamoto, Yuko; Shirasawa, Kenta; Kimura, Mitsuhiro; Fukami, Masanobu; Hashizume, Fujio; Tsuji, Tomoko; Sasamoto, Shigemi; Kato, Midori; Nanri, Keiko; Tsuruoka, Hisano; Minami, Chiharu; Takahashi, Chika; Wada, Tsuyuko; Ono, Akiko; Kawashima, Kumiko; Nakazaki, Naomi; Kishida, Yoshie; Kohara, Mitsuyo; Nakayama, Shinobu; Yamada, Manabu; Fujishiro, Tsunakazu; Watanabe, Akiko; Tabata, Satoshi
2013-01-01
The cultivated strawberry (Fragaria× ananassa) is an octoploid (2n = 8x = 56) of the Rosaceae family whose genomic architecture is still controversial. Several recent studies support the AAA′A′BBB′B′ model, but its complexity has hindered genetic and genomic analysis of this important crop. To overcome this difficulty and to assist genome-wide analysis of F. × ananassa, we constructed an integrated linkage map by organizing a total of 4474 of simple sequence repeat (SSR) markers collected from published Fragaria sequences, including 3746 SSR markers [Fragaria vesca expressed sequence tag (EST)-derived SSR markers] derived from F. vesca ESTs, 603 markers (F. × ananassa EST-derived SSR markers) from F. × ananassa ESTs, and 125 markers (F. × ananassa transcriptome-derived SSR markers) from F. × ananassa transcripts. Along with the previously published SSR markers, these markers were mapped onto five parent-specific linkage maps derived from three mapping populations, which were then assembled into an integrated linkage map. The constructed map consists of 1856 loci in 28 linkage groups (LGs) that total 2364.1 cM in length. Macrosynteny at the chromosome level was observed between the LGs of F. × ananassa and the genome of F. vesca. Variety distinction on 129 F. × ananassa lines was demonstrated using 45 selected SSR markers. PMID:23248204
Comparison of Dixon Sequences for Estimation of Percent Breast Fibroglandular Tissue
Ledger, Araminta E. W.; Scurr, Erica D.; Hughes, Julie; Macdonald, Alison; Wallace, Toni; Thomas, Karen; Wilson, Robin; Leach, Martin O.; Schmidt, Maria A.
2016-01-01
Objectives To evaluate sources of error in the Magnetic Resonance Imaging (MRI) measurement of percent fibroglandular tissue (%FGT) using two-point Dixon sequences for fat-water separation. Methods Ten female volunteers (median age: 31 yrs, range: 23–50 yrs) gave informed consent following Research Ethics Committee approval. Each volunteer was scanned twice following repositioning to enable an estimation of measurement repeatability from high-resolution gradient-echo (GRE) proton-density (PD)-weighted Dixon sequences. Differences in measures of %FGT attributable to resolution, T1 weighting and sequence type were assessed by comparison of this Dixon sequence with low-resolution GRE PD-weighted Dixon data, and against gradient-echo (GRE) or spin-echo (SE) based T1-weighted Dixon datasets, respectively. Results %FGT measurement from high-resolution PD-weighted Dixon sequences had a coefficient of repeatability of ±4.3%. There was no significant difference in %FGT between high-resolution and low-resolution PD-weighted data. Values of %FGT from GRE and SE T1-weighted data were strongly correlated with that derived from PD-weighted data (r = 0.995 and 0.96, respectively). However, both sequences exhibited higher mean %FGT by 2.9% (p < 0.0001) and 12.6% (p < 0.0001), respectively, in comparison with PD-weighted data; the increase in %FGT from the SE T1-weighted sequence was significantly larger at lower breast densities. Conclusion Although measurement of %FGT at low resolution is feasible, T1 weighting and sequence type impact on the accuracy of Dixon-based %FGT measurements; Dixon MRI protocols for %FGT measurement should be carefully considered, particularly for longitudinal or multi-centre studies. PMID:27011312
Feedback Power Control Strategies in Wireless Sensor Networks with Joint Channel Decoding
Abrardo, Andrea; Ferrari, Gianluigi; Martalò, Marco; Perna, Fabio
2009-01-01
In this paper, we derive feedback power control strategies for block-faded multiple access schemes with correlated sources and joint channel decoding (JCD). In particular, upon the derivation of the feasible signal-to-noise ratio (SNR) region for the considered multiple access schemes, i.e., the multidimensional SNR region where error-free communications are, in principle, possible, two feedback power control strategies are proposed: (i) a classical feedback power control strategy, which aims at equalizing all link SNRs at the access point (AP), and (ii) an innovative optimized feedback power control strategy, which tries to make the network operational point fall in the feasible SNR region at the lowest overall transmit energy consumption. These strategies will be referred to as “balanced SNR” and “unbalanced SNR,” respectively. While they require, in principle, an unlimited power control range at the sources, we also propose practical versions with a limited power control range. We preliminary consider a scenario with orthogonal links and ideal feedback. Then, we analyze the robustness of the proposed power control strategies to possible non-idealities, in terms of residual multiple access interference and noisy feedback channels. Finally, we successfully apply the proposed feedback power control strategies to a limiting case of the class of considered multiple access schemes, namely a central estimating officer (CEO) scenario, where the sensors observe noisy versions of a common binary information sequence and the AP's goal is to estimate this sequence by properly fusing the soft-output information output by the JCD algorithm. PMID:22291536
NASA Astrophysics Data System (ADS)
Andreasen, Daniel; Edmund, Jens M.; Zografos, Vasileios; Menze, Bjoern H.; Van Leemput, Koen
2016-03-01
In radiotherapy treatment planning that is only based on magnetic resonance imaging (MRI), the electron density information usually obtained from computed tomography (CT) must be derived from the MRI by synthesizing a so-called pseudo CT (pCT). This is a non-trivial task since MRI intensities are neither uniquely nor quantitatively related to electron density. Typical approaches involve either a classification or regression model requiring specialized MRI sequences to solve intensity ambiguities, or an atlas-based model necessitating multiple registrations between atlases and subject scans. In this work, we explore a machine learning approach for creating a pCT of the pelvic region from conventional MRI sequences without using atlases. We use a random forest provided with information about local texture, edges and spatial features derived from the MRI. This helps to solve intensity ambiguities. Furthermore, we use the concept of auto-context by sequentially training a number of classification forests to create and improve context features, which are finally used to train a regression forest for pCT prediction. We evaluate the pCT quality in terms of the voxel-wise error and the radiologic accuracy as measured by water-equivalent path lengths. We compare the performance of our method against two baseline pCT strategies, which either set all MRI voxels in the subject equal to the CT value of water, or in addition transfer the bone volume from the real CT. We show an improved performance compared to both baseline pCTs suggesting that our method may be useful for MRI-only radiotherapy.
Studies on Monitoring and Tracking Genetic Resources: An Executive Summary
Garrity, George M.; Thompson, Lorraine M.; Ussery, David W.; Paskin, Norman; Baker, Dwight; Desmeth, Philippe; Schindel, D.E.; Ong, P.S.
2009-01-01
The principles underlying fair and equitable sharing of benefits derived from the utilization of genetic resources are set out in Article 15 of the UN Convention on Biological Diversity, which stipulate that access to genetic resources is subject to the prior informed consent of the country where such resources are located and to mutually agreed terms regarding the sharing of benefits that could be derived from such access. One issue of particular concern for provider countries is how to monitor and track genetic resources once they have left the provider country and enter into use in a variety of forms. This report was commissioned to provide a detailed review of advances in DNA sequencing technologies, as those methods apply to identification of genetic resources, and the use of globally unique persistent identifiers for persistently linking to data and other forms of digital documentation that is linked to individual genetic resources. While the report was written for an audience with a mixture of technical, legal, and policy backgrounds it is relevant to the genomics community as it is an example of downstream application of genomics information. PMID:21304641
Angelastro, James M.; Klimaschewski, Lars; Tang, Song; Vitolo, Ottavio V.; Weissman, Tamily A.; Donlin, Laura T.; Shelanski, Michael L.; Greene, Lloyd A.
2000-01-01
Neurotrophic factors such as nerve growth factor (NGF) promote a wide variety of responses in neurons, including differentiation, survival, plasticity, and repair. Such actions often require changes in gene expression. To identify the regulated genes and thereby to more fully understand the NGF mechanism, we carried out serial analysis of gene expression (SAGE) profiling of transcripts derived from rat PC12 cells before and after NGF-promoted neuronal differentiation. Multiple criteria supported the reliability of the profile. Approximately 157,000 SAGE tags were analyzed, representing at least 21,000 unique transcripts. Of these, nearly 800 were regulated by 6-fold or more in response to NGF. Approximately 150 of the regulated transcripts have been matched to named genes, the majority of which were not previously known to be NGF-responsive. Functional categorization of the regulated genes provides insight into the complex, integrated mechanism by which NGF promotes its multiple actions. It is anticipated that as genomic sequence information accrues the data derived here will continue to provide information about neurotrophic factor mechanisms. PMID:10984536
Rana, Niki; Cultrara, Christopher; Phillips, Mariana; Sabatino, David
2017-09-01
In the search for more potent peptide-based anti-cancer conjugates the generation of new, functionally diverse nucleolipid derived D-(KLAKLAK) 2 -AK sequences has enabled a structure and anti-cancer activity relationship study. A reductive amination approach was key for the synthesis of alkylamine, diamine and polyamine derived nucleolipids as well as those incorporating heterocyclic functionality. The carboxy-derived nucleolipids were then coupled to the C-terminus of the D-(KLAKLAK) 2 -AK killer peptide sequence and produced with and without the FITC fluorophore for investigating biological activity in cancer cells. The amphiphilic, α-helical peptide-nucleolipid bioconjugates were found to exhibit variable effects on the viability of MM.1S cells, with the histamine derived nucleolipid peptide bioconjugate displaying the most significant anti-cancer effects. Thus, functionally diverse nucleolipids have been developed to fine-tune the structure and anti-cancer properties of killer peptide sequences, such as D-(KLAKLAK) 2 -AK. Copyright © 2017 Elsevier Ltd. All rights reserved.
Ashrafi, Hamid; Hill, Theresa; Stoffel, Kevin; Kozik, Alexander; Yao, Jiqiang; Chin-Wo, Sebastian Reyes; Van Deynze, Allen
2012-10-30
Molecular breeding of pepper (Capsicum spp.) can be accelerated by developing DNA markers associated with transcriptomes in breeding germplasm. Before the advent of next generation sequencing (NGS) technologies, the majority of sequencing data were generated by the Sanger sequencing method. By leveraging Sanger EST data, we have generated a wealth of genetic information for pepper including thousands of SNPs and Single Position Polymorphic (SPP) markers. To complement and enhance these resources, we applied NGS to three pepper genotypes: Maor, Early Jalapeño and Criollo de Morelos-334 (CM334) to identify SNPs and SSRs in the assembly of these three genotypes. Two pepper transcriptome assemblies were developed with different purposes. The first reference sequence, assembled by CAP3 software, comprises 31,196 contigs from >125,000 Sanger-EST sequences that were mainly derived from a Korean F1-hybrid line, Bukang. Overlapping probes were designed for 30,815 unigenes to construct a pepper Affymetrix GeneChip® microarray for whole genome analyses. In addition, custom Python scripts were used to identify 4,236 SNPs in contigs of the assembly. A total of 2,489 simple sequence repeats (SSRs) were identified from the assembly, and primers were designed for the SSRs. Annotation of contigs using Blast2GO software resulted in information for 60% of the unigenes in the assembly. The second transcriptome assembly was constructed from more than 200 million Illumina Genome Analyzer II reads (80-120 nt) using a combination of Velvet, CLC workbench and CAP3 software packages. BWA, SAMtools and in-house Perl scripts were used to identify SNPs among three pepper genotypes. The SNPs were filtered to be at least 50 bp from any intron-exon junctions as well as flanking SNPs. More than 22,000 high-quality putative SNPs were identified. Using the MISA software, 10,398 SSR markers were also identified within the Illumina transcriptome assembly and primers were designed for the identified markers. The assembly was annotated by Blast2GO and 14,740 (12%) of annotated contigs were associated with functional proteins. Before availability of pepper genome sequence, assembling transcriptomes of this economically important crop was required to generate thousands of high-quality molecular markers that could be used in breeding programs. In order to have a better understanding of the assembled sequences and to identify candidate genes underlying QTLs, we annotated the contigs of Sanger-EST and Illumina transcriptome assemblies. These and other information have been curated in a database that we have dedicated for pepper project.
Modahl, Cassandra M.; Mackessy, Stephen P.
2016-01-01
Envenomation of humans by snakes is a complex and continuously evolving medical emergency, and treatment is made that much more difficult by the diverse biochemical composition of many venoms. Venomous snakes and their venoms also provide models for the study of molecular evolutionary processes leading to adaptation and genotype-phenotype relationships. To compare venom complexity and protein sequences, venom gland transcriptomes are assembled, which usually requires the sacrifice of snakes for tissue. However, toxin transcripts are also present in venoms, offering the possibility of obtaining cDNA sequences directly from venom. This study provides evidence that unknown full-length venom protein transcripts can be obtained from the venoms of multiple species from all major venomous snake families. These unknown venom protein cDNAs are obtained by the use of primers designed from conserved signal peptide sequences within each venom protein superfamily. This technique was used to assemble a partial venom gland transcriptome for the Middle American Rattlesnake (Crotalus simus tzabcan) by amplifying sequences for phospholipases A2, serine proteases, C-lectins, and metalloproteinases from within venom. Phospholipase A2 sequences were also recovered from the venoms of several rattlesnakes and an elapid snake (Pseudechis porphyriacus), and three-finger toxin sequences were recovered from multiple rear-fanged snake species, demonstrating that the three major clades of advanced snakes (Elapidae, Viperidae, Colubridae) have stable mRNA present in their venoms. These cDNA sequences from venom were then used to explore potential activities derived from protein sequence similarities and evolutionary histories within these large multigene superfamilies. Venom-derived sequences can also be used to aid in characterizing venoms that lack proteomic profiles and identify sequence characteristics indicating specific envenomation profiles. This approach, requiring only venom, provides access to cDNA sequences in the absence of living specimens, even from commercial venom sources, to evaluate important regional differences in venom composition and to study snake venom protein evolution. PMID:27280639
2010-01-01
Background Previous reports have shown that peptides derived from the apolipoprotein E receptor binding region and the amphipathic α-helical domains of apolipoprotein AI have broad anti-infective activity and antiviral activity respectively. Lipoproteins and viruses share a similar cell biological niche, being of overlapping size and displaying similar interactions with mammalian cells and receptors, which may have led to other antiviral sequences arising within apolipoproteins, in addition to those previously reported. We therefore designed a series of peptides based around either apolipoprotein receptor binding regions, or amphipathic α-helical domains, and tested these for antiviral and antibacterial activity. Results Of the nineteen new peptides tested, seven showed some anti-infective activity, with two of these being derived from two apolipoproteins not previously used to derive anti-infective sequences. Apolipoprotein J (151-170) - based on a predicted amphipathic alpha-helical domain from apolipoprotein J - had measurable anti-HSV1 activity, as did apolipoprotein B (3359-3367) dp (apoBdp), the latter being derived from the LDL receptor binding domain B of apolipoprotein B. The more active peptide - apoBdp - showed similarity to the previously reported apoE derived anti-infective peptide, and further modification of the apoBdp sequence to align the charge distribution more closely to that of apoEdp or to introduce aromatic residues resulted in increased breadth and potency of activity. The most active peptide of this type showed similar potent anti-HIV activity, comparable to that we previously reported for the apoE derived peptide apoEdpL-W. Conclusions These data suggest that further antimicrobial peptides may be obtained using human apolipoprotein sequences, selecting regions with either amphipathic α-helical structure, or those linked to receptor-binding regions. The finding that an amphipathic α-helical region of apolipoprotein J has antiviral activity comparable with that for the previously reported apolipoprotein AI derived peptide 18A, suggests that full-length apolipoprotein J may also have such activity, as has been reported for full-length apolipoprotein AI. Although the strength of the anti-infective activity of the sequences identified was limited, this could be increased substantially by developing related mutant peptides. Indeed the apolipoprotein B-derived peptide mutants uncovered by the present study may have utility as HIV therapeutics or microbicides. PMID:20298574
Jiang, Rui ; Yang, Hua ; Zhou, Linqi ; Kuo, C.-C. Jay ; Sun, Fengzhu ; Chen, Ting
2007-01-01
The increasing demand for the identification of genetic variation responsible for common diseases has translated into a need for sophisticated methods for effectively prioritizing mutations occurring in disease-associated genetic regions. In this article, we prioritize candidate nonsynonymous single-nucleotide polymorphisms (nsSNPs) through a bioinformatics approach that takes advantages of a set of improved numeric features derived from protein-sequence information and a new statistical learning model called “multiple selection rule voting” (MSRV). The sequence-based features can maximize the scope of applications of our approach, and the MSRV model can capture subtle characteristics of individual mutations. Systematic validation of the approach demonstrates that this approach is capable of prioritizing causal mutations for both simple monogenic diseases and complex polygenic diseases. Further studies of familial Alzheimer diseases and diabetes show that the approach can enrich mutations underlying these polygenic diseases among the top of candidate mutations. Application of this approach to unclassified mutations suggests that there are 10 suspicious mutations likely to cause diseases, and there is strong support for this in the literature. PMID:17668383
On the error probability of general tree and trellis codes with applications to sequential decoding
NASA Technical Reports Server (NTRS)
Johannesson, R.
1973-01-01
An upper bound on the average error probability for maximum-likelihood decoding of the ensemble of random binary tree codes is derived and shown to be independent of the length of the tree. An upper bound on the average error probability for maximum-likelihood decoding of the ensemble of random L-branch binary trellis codes of rate R = 1/n is derived which separates the effects of the tail length T and the memory length M of the code. It is shown that the bound is independent of the length L of the information sequence. This implication is investigated by computer simulations of sequential decoding utilizing the stack algorithm. These simulations confirm the implication and further suggest an empirical formula for the true undetected decoding error probability with sequential decoding.
Hosseinkhani, Hossein; Tabata, Yasuhiko
2003-01-09
The objective of this study is to investigate the efficiency of a non-viral gene carrier with RGD sequences, Pronectin F(+) for gene transfection. The Pronectin F(+) was cationized by introducing ethylenediamine (Ed), spermidine (Sd), and spermine (Sm) to the hydroxyl groups while the corresponding gelatin derivative was prepared similarly because gelatin also has one RGD sequence per molecule. The zeta potential and molecular size of Pronectin F(+) and gelatin derivatives were examined before and after polyion complexation with a plasmid DNA of luciferase. When complexed with the plasmid DNA at the Pronectin F(+)/plasmid DNA mixing ratio of 50, the complex exhibited a zeta potential of about 10 mV, which is similar to that of the gelatin derivative-plasmid DNA complex. Irrespective of the type of Pronectin F(+) and gelatin derivatives, their complexation enabled the apparent molecular size of plasmid DNA to reduce to about 200 nm, the size decreasing with the increased derivative/plasmid DNA weight mixing ratio. The rat gastric mucosal (RGM)-1 cells treated with both complexes exhibited significantly stronger luciferase activities than free plasmid DNA although the enhanced extent was significant for the Sm derivative compared with the corresponding Ed and Sd derivatives. Cell attachment was enhanced by the Pronectin F(+) derivative to a significant high extent compared with the gelatin derivative. The amount of plasmid DNA internalized into the cells was enhanced by the complexation with every Pronectin F(+) derivative compared with the gelatin derivative. For both of Pronectin F(+) and gelatin carriers, the buffering capacity of Sm derivatives was higher than that of Ed and Sd derivatives and comparable to that of polyethyleneimine. It is likely that the high efficiency of gene transfection for the Sm derivative is due to the superior buffering effect. We conclude that the Sm derivative of Pronectin F(+) is promising as a non-viral vector of gene transfection.
A high-speed on-chip pseudo-random binary sequence generator for multi-tone phase calibration
NASA Astrophysics Data System (ADS)
Gommé, Liesbeth; Vandersteen, Gerd; Rolain, Yves
2011-07-01
An on-chip reference generator is conceived by adopting the technique of decimating a pseudo-random binary sequence (PRBS) signal in parallel sequences. This is of great benefit when high-speed generation of PRBS and PRBS-derived signals is the objective. The design implemented standard CMOS logic is available in commercial libraries to provide the logic functions for the generator. The design allows the user to select the periodicity of the PRBS and the PRBS-derived signals. The characterization of the on-chip generator marks its performance and reveals promising specifications.
Foltz, T M; Welsh, B M
1999-01-01
This paper uses the fact that the discrete Fourier transform diagonalizes a circulant matrix to provide an alternate derivation of the symmetric convolution-multiplication property for discrete trigonometric transforms. Derived in this manner, the symmetric convolution-multiplication property extends easily to multiple dimensions using the notion of block circulant matrices and generalizes to multidimensional asymmetric sequences. The symmetric convolution of multidimensional asymmetric sequences can then be accomplished by taking the product of the trigonometric transforms of the sequences and then applying an inverse trigonometric transform to the result. An example is given of how this theory can be used for applying a two-dimensional (2-D) finite impulse response (FIR) filter with nonlinear phase which models atmospheric turbulence.
Colombo, M M; Swanton, M T; Donini, P; Prescott, D M
1984-01-01
Oxytricha nova is a hypotrichous ciliate with micronuclei and macronuclei. Micronuclei, which contain large, chromosomal-sized DNA, are genetically inert but undergo meiosis and exchange during cell mating. Macronuclei, which contain only small, gene-sized DNA molecules, provide all of the nuclear RNA needed to run the cell. After cell mating the macronucleus is derived from a micronucleus, a derivation that includes excision of the genes from chromosomes and elimination of the remaining DNA. The eliminated DNA includes all of the repetitious sequences and approximately 95% of the unique sequences. We cloned large restriction fragments from the micronucleus that confer replication ability on a replication-deficient plasmid in Saccharomyces cerevisiae. Sequences that confer replication ability are called autonomously replicating sequences. The frequency and effectiveness of autonomously replicating sequences in micronuclear DNA are similar to those reported for DNAs of other organisms introduced into yeast cells. Of the 12 micronuclear fragments with autonomously replicating sequence activity, 9 also showed homology to macronuclear DNA, indicating that they contain a macronuclear gene sequence. We conclude from this that autonomously replicating sequence activity is nonrandomly distributed throughout micronuclear DNA and is preferentially associated with those regions of micronuclear DNA that contain genes. Images PMID:6092934
Revising Star and Planet Formation Timescales
NASA Astrophysics Data System (ADS)
Bell, Cameron P. M.; Naylor, Tim; Mayne, N. J.; Jeffries, R. D.; Littlefair, S. P.
2013-07-01
We have derived ages for 13 young (<30 Myr) star-forming regions and find that they are up to a factor of 2 older than the ages typically adopted in the literature. This result has wide-ranging implications, including that circumstellar discs survive longer (≃ 10-12 Myr) and that the average Class I lifetime is greater (≃1 Myr) than currently believed. For each star-forming region, we derived two ages from colour-magnitude diagrams. First, we fitted models of the evolution between the zero-age main sequence and terminal-age main sequence to derive a homogeneous set of main-sequence ages, distances and reddenings with statistically meaningful uncertainties. Our second age for each star-forming region was derived by fitting pre-main-sequence stars to new semi-empirical model isochrones. For the first time (for a set of clusters younger than 50 Myr), we find broad agreement between these two ages, and since these are derived from two distinct mass regimes that rely on different aspects of stellar physics, it gives us confidence in the new age scale. This agreement is largely due to our adoption of empirical colour-Teff relations and bolometric corrections for pre-main-sequence stars cooler than 4000 K. The revised ages for the star-forming regions in our sample are: 2 Myr for NGC 6611 (Eagle Nebula; M 16), IC 5146 (Cocoon Nebula), NGC 6530 (Lagoon Nebula; M 8) and NGC 2244 (Rosette Nebula); 6 Myr for σ Ori, Cep OB3b and IC 348; ≃10 Myr for λ Ori (Collinder 69); ≃11 Myr for NGC 2169; ≃12 Myr for NGC 2362; ≃13 Myr for NGC 7160; ≃14 Myr for χ Per (NGC 884); and ≃20 Myr for NGC 1960 (M 36).
The MAR databases: development and implementation of databases specific for marine metagenomics.
Klemetsen, Terje; Raknes, Inge A; Fu, Juan; Agafonov, Alexander; Balasundaram, Sudhagar V; Tartari, Giacomo; Robertsen, Espen; Willassen, Nils P
2018-01-04
We introduce the marine databases; MarRef, MarDB and MarCat (https://mmp.sfb.uit.no/databases/), which are publicly available resources that promote marine research and innovation. These data resources, which have been implemented in the Marine Metagenomics Portal (MMP) (https://mmp.sfb.uit.no/), are collections of richly annotated and manually curated contextual (metadata) and sequence databases representing three tiers of accuracy. While MarRef is a database for completely sequenced marine prokaryotic genomes, which represent a marine prokaryote reference genome database, MarDB includes all incomplete sequenced prokaryotic genomes regardless level of completeness. The last database, MarCat, represents a gene (protein) catalog of uncultivable (and cultivable) marine genes and proteins derived from marine metagenomics samples. The first versions of MarRef and MarDB contain 612 and 3726 records, respectively. Each record is built up of 106 metadata fields including attributes for sampling, sequencing, assembly and annotation in addition to the organism and taxonomic information. Currently, MarCat contains 1227 records with 55 metadata fields. Ontologies and controlled vocabularies are used in the contextual databases to enhance consistency. The user-friendly web interface lets the visitors browse, filter and search in the contextual databases and perform BLAST searches against the corresponding sequence databases. All contextual and sequence databases are freely accessible and downloadable from https://s1.sfb.uit.no/public/mar/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Parameter Estimation in Atmospheric Data Sets
NASA Technical Reports Server (NTRS)
Wenig, Mark; Colarco, Peter
2004-01-01
In this study the structure tensor technique is used to estimate dynamical parameters in atmospheric data sets. The structure tensor is a common tool for estimating motion in image sequences. This technique can be extended to estimate other dynamical parameters such as diffusion constants or exponential decay rates. A general mathematical framework was developed for the direct estimation of the physical parameters that govern the underlying processes from image sequences. This estimation technique can be adapted to the specific physical problem under investigation, so it can be used in a variety of applications in trace gas, aerosol, and cloud remote sensing. As a test scenario this technique will be applied to modeled dust data. In this case vertically integrated dust concentrations were used to derive wind information. Those results can be compared to the wind vector fields which served as input to the model. Based on this analysis, a method to compute atmospheric data parameter fields will be presented. .
Applications of Mass Spectrometry to Structural Analysis of Marine Oligosaccharides
Lang, Yinzhi; Zhao, Xia; Liu, Lili; Yu, Guangli
2014-01-01
Marine oligosaccharides have attracted increasing attention recently in developing potential drugs and biomaterials for their particular physical and chemical properties. However, the composition and sequence analysis of marine oligosaccharides are very challenging for their structural complexity and heterogeneity. Mass spectrometry (MS) has become an important technique for carbohydrate analysis by providing more detailed structural information, including molecular mass, sugar constituent, sequence, inter-residue linkage position and substitution pattern. This paper provides an overview of the structural analysis based on MS approaches in marine oligosaccharides, which are derived from some biologically important marine polysaccharides, including agaran, carrageenan, alginate, sulfated fucan, chitosan, glycosaminoglycan (GAG) and GAG-like polysaccharides. Applications of electrospray ionization mass spectrometry (ESI-MS) are mainly presented and the general applications of matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) are also outlined. Some technical challenges in the structural analysis of marine oligosaccharides by MS have also been pointed out. PMID:24983643
Analysis of surface structures of chemically peculiar stars with modern and future interferometers
NASA Astrophysics Data System (ADS)
Shulyak, D.; Perraut, K.; Paladini, Claudia; Li Causi, G.; Sacuto, Stephane; Kochukhov, O.
2014-07-01
Interferometry is a very powerful observational technique known in astronomy for many decades. Its application to main-sequence stars, however, is still limited to only brightest objects. In this work we aim to explore the application of interferometry to a special class of main-sequence stars known as chemically peculiar (CP) stars. These stars demonstrate surface chemical abundance inhomogeneities (spots) that usually cover a considerable part of the stellar surface and induce a pronounced spectral and photometric variability. Interferometry thus has a potential to naturally resolve such spots in single stars, providing unique complementary information about spots sizes and contrasts. By means of numerical experiments we derive the actual interferometric requirements essential for the CP stars research that can be addressed in future instrument development. The first comparison between theoretical predictions and already available observations will also be discussed.
The determination of high-resolution spatio-temporal glacier motion fields from time-lapse sequences
NASA Astrophysics Data System (ADS)
Schwalbe, Ellen; Maas, Hans-Gerd
2017-12-01
This paper presents a comprehensive method for the determination of glacier surface motion vector fields at high spatial and temporal resolution. These vector fields can be derived from monocular terrestrial camera image sequences and are a valuable data source for glaciological analysis of the motion behaviour of glaciers. The measurement concepts for the acquisition of image sequences are presented, and an automated monoscopic image sequence processing chain is developed. Motion vector fields can be derived with high precision by applying automatic subpixel-accuracy image matching techniques on grey value patterns in the image sequences. Well-established matching techniques have been adapted to the special characteristics of the glacier data in order to achieve high reliability in automatic image sequence processing, including the handling of moving shadows as well as motion effects induced by small instabilities in the camera set-up. Suitable geo-referencing techniques were developed to transform image measurements into a reference coordinate system.The result of monoscopic image sequence analysis is a dense raster of glacier surface point trajectories for each image sequence. Each translation vector component in these trajectories can be determined with an accuracy of a few centimetres for points at a distance of several kilometres from the camera. Extensive practical validation experiments have shown that motion vector and trajectory fields derived from monocular image sequences can be used for the determination of high-resolution velocity fields of glaciers, including the analysis of tidal effects on glacier movement, the investigation of a glacier's motion behaviour during calving events, the determination of the position and migration of the grounding line and the detection of subglacial channels during glacier lake outburst floods.
Characterization of circulating transfer RNA-derived RNA fragments in cattle
Casas, Eduardo; Cai, Guohong; Neill, John D.
2015-01-01
The objective was to characterize naturally occurring circulating transfer RNA-derived RNA fragments (tRFs) in cattle1. Serum from eight clinically normal adult dairy cows was collected, and small non-coding RNAs were extracted immediately after collection and sequenced by Illumina MiSeq. Sequences aligned to transfer RNA (tRNA) genes or their flanking sequences were characterized. Sequences aligned to the beginning of 5′ end of the mature tRNA were classified as tRF5; those aligned to the 3′ end of mature tRNA were classified as tRF3; and those aligned to the beginning of the 3′ end flanking sequences were classified as tRF1. There were 3,190,962 sequences that mapped to transfer RNA and small non-coding RNAs in the bovine genome. Of these, 2,323,520 were identified as tRF5s, 562 were tRF3s, and 81 were tRF1s. There were 866,799 sequences identified as other small non-coding RNAs (microRNA, rRNA, snoRNA, etc.) and were excluded from the study. The tRF5s ranged from 28 to 40 nucleotides; and 98.7% ranged from 30 to 34 nucleotides in length. The tRFs with the greatest number of sequences were derived from tRNA of histidine, glutamic acid, lysine, glycine, and valine. There was no association between number of codons for each amino acid and number of tRFs in the samples. The reason for tRF5s being the most abundant can only be explained if these sequences are associated with function within the animal. PMID:26379699
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.
Benoit, Gaëtan; Lemaitre, Claire; Lavenier, Dominique; Drezen, Erwan; Dayris, Thibault; Uricaru, Raluca; Rizk, Guillaume
2015-09-14
Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.
Prins, Theo W; Scholtens, Ingrid M J; Bak, Arno W; van Dijk, Jeroen P; Voorhuijzen, Marleen M; Laurensse, Emile J; Kok, Esther J
2016-12-15
During routine monitoring for GMOs in food in the Netherlands, papaya-containing food supplements were found positive for the genetically modified (GM) elements P-35S and T-nos. The goal of this study was to identify the unknown and EU unauthorised GM papaya event(s). A screening strategy was applied using additional GM screening elements including a newly developed PRSV coat protein PCR. The detected PRSV coat protein PCR product was sequenced and the nucleotide sequence showed identity to PRSV YK strains indigenous to China and Taiwan. The GM events 16-0-1 and 18-2-4 could be identified by amplifying and sequencing events-specific sequences. Further analyses showed that both papaya event 16-0-1 and event 18-2-4 were transformed with the same construct. For use in routine analysis, derived TaqMan qPCR methods for events 16-0-1 and 18-2-4 were developed. Event 16-0-1 was detected in all samples tested whereas event 18-2-4 was detected in one sample. This study presents a strategy for combining information from different sources (literature, patent databases) and novel sequence data to identify unknown GM papaya events. Copyright © 2016 Elsevier Ltd. All rights reserved.
Song, Wen Jun; Qin, Qi Wei; Qiu, Jin; Huang, Can Hua; Wang, Fan; Hew, Choy Leong
2004-01-01
Here we report the complete genome sequence of Singapore grouper iridovirus (SGIV). Sequencing of the random shotgun and restriction endonuclease genomic libraries showed that the entire SGIV genome consists of 140,131 nucleotide bp. One hundred sixty-two open reading frames (ORFs) from the sense and antisense DNA strands, coding for lengths varying from 41 to 1,268 amino acids, were identified. Computer-assisted analyses of the deduced amino acid sequences revealed that 77 of the ORFs exhibited homologies to known virus genes, 23 of which matched functional iridovirus proteins. Forty-two putative conserved domains or signatures were detected in the National Center for Biotechnology Information CD-Search database and PROSITE database. An assortment of enzyme activities involved in DNA replication, transcription, nucleotide metabolism, cell signaling, etc., were identified. Viruses were cultured on a cell line derived from the embryonated egg of the grouper Epinephelus tauvina, isolated, and purified by sucrose gradient ultracentrifugation. The protein extract from the purified virions was analyzed by polyacrylamide gel electrophoresis followed by in-gel digestion of protein bands. Matrix-assisted laser desorption ionization-time of flight mass spectrometry and database searching led to identification of 26 proteins. Twenty of these represented novel or previously unidentified genes, which were further confirmed by reverse transcription-PCR (RT-PCR) and DNA sequencing of their respective RT-PCR products. PMID:15507645
Solis, Armando D
2015-12-01
To reduce complexity, understand generalized rules of protein folding, and facilitate de novo protein design, the 20-letter amino acid alphabet is commonly reduced to a smaller alphabet by clustering amino acids based on some measure of similarity. In this work, we seek the optimal alphabet that preserves as much of the structural information found in long-range (contact) interactions among amino acids in natively-folded proteins. We employ the Information Maximization Device, based on information theory, to partition the amino acids into well-defined clusters. Numbering from 2 to 19 groups, these optimal clusters of amino acids, while generated automatically, embody well-known properties of amino acids such as hydrophobicity/polarity, charge, size, and aromaticity, and are demonstrated to maintain the discriminative power of long-range interactions with minimal loss of mutual information. Our measurements suggest that reduced alphabets (of less than 10) are able to capture virtually all of the information residing in native contacts and may be sufficient for fold recognition, as demonstrated by extensive threading tests. In an expansive survey of the literature, we observe that alphabets derived from various approaches-including those derived from physicochemical intuition, local structure considerations, and sequence alignments of remote homologs-fare consistently well in preserving contact interaction information, highlighting a convergence in the various factors thought to be relevant to the folding code. Moreover, we find that alphabets commonly used in experimental protein design are nearly optimal and are largely coherent with observations that have arisen in this work. © 2015 Wiley Periodicals, Inc.
Schneider, Sean E.; Thomas, James H.
2014-01-01
We show here that 105 regions in two Lepidoptera genomes appear to derive from horizontally transferred wasp DNA. We experimentally verified the presence of two of these sequences in a diverse set of silkworm (Bombyx mori) genomes. We hypothesize that these horizontal transfers are made possible by the unusual strategy many parasitoid wasps employ of injecting hosts with endosymbiotic polydnaviruses to minimize the host's defense response. Because these virus-like particles deliver wasp DNA to the cells of the host, there has been much interest in whether genetic information can be permanently transferred from the wasp to the host. Two transferred sequences code for a BEN domain, known to be associated with polydnaviruses and transcriptional regulation. These findings represent the first documented cases of horizontal transfer of genes between two organisms by a polydnavirus. This presents an interesting evolutionary paradigm in which host species can acquire new sequences from parasitoid wasps that attack them. Hymenoptera and Lepidoptera diverged ∼300 MYA, making this type of event a source of novel sequences for recipient species. Unlike many other cases of horizontal transfer between two eukaryote species, these sequence transfers can be explained without the need to invoke the sequences ‘hitchhiking’ on a third organism (e.g. retrovirus) capable of independent reproduction. The cellular machinery necessary for the transfer is contained entirely in the wasp genome. The work presented here is the first such discovery of what is likely to be a broader phenomenon among species affected by these wasps. PMID:25296163
Sequential de novo centromere formation and inactivation on a chromosomal fragment in maize.
Liu, Yalin; Su, Handong; Pang, Junling; Gao, Zhi; Wang, Xiu-Jie; Birchler, James A; Han, Fangpu
2015-03-17
The ability of centromeres to alternate between active and inactive states indicates significant epigenetic aspects controlling centromere assembly and function. In maize (Zea mays), misdivision of the B chromosome centromere on a translocation with the short arm of chromosome 9 (TB-9Sb) can produce many variants with varying centromere sizes and centromeric DNA sequences. In such derivatives of TB-9Sb, we found a de novo centromere on chromosome derivative 3-3, which has no canonical centromeric repeat sequences. This centromere is derived from a 288-kb region on the short arm of chromosome 9, and is 19 megabases (Mb) removed from the translocation breakpoint of chromosome 9 in TB-9Sb. The functional B centromere in progenitor telo2-2 is deleted from derivative 3-3, but some B-repeat sequences remain. The de novo centromere of derivative 3-3 becomes inactive in three further derivatives with new centromeres being formed elsewhere on each chromosome. Our results suggest that de novo centromere initiation is quite common and can persist on chromosomal fragments without a canonical centromere. However, we hypothesize that when de novo centromeres are initiated in opposition to a larger normal centromere, they are cleared from the chromosome by inactivation, thus maintaining karyotype integrity.
Sequential de novo centromere formation and inactivation on a chromosomal fragment in maize
Liu, Yalin; Su, Handong; Pang, Junling; Gao, Zhi; Wang, Xiu-Jie; Birchler, James A.; Han, Fangpu
2015-01-01
The ability of centromeres to alternate between active and inactive states indicates significant epigenetic aspects controlling centromere assembly and function. In maize (Zea mays), misdivision of the B chromosome centromere on a translocation with the short arm of chromosome 9 (TB-9Sb) can produce many variants with varying centromere sizes and centromeric DNA sequences. In such derivatives of TB-9Sb, we found a de novo centromere on chromosome derivative 3-3, which has no canonical centromeric repeat sequences. This centromere is derived from a 288-kb region on the short arm of chromosome 9, and is 19 megabases (Mb) removed from the translocation breakpoint of chromosome 9 in TB-9Sb. The functional B centromere in progenitor telo2-2 is deleted from derivative 3-3, but some B-repeat sequences remain. The de novo centromere of derivative 3-3 becomes inactive in three further derivatives with new centromeres being formed elsewhere on each chromosome. Our results suggest that de novo centromere initiation is quite common and can persist on chromosomal fragments without a canonical centromere. However, we hypothesize that when de novo centromeres are initiated in opposition to a larger normal centromere, they are cleared from the chromosome by inactivation, thus maintaining karyotype integrity. PMID:25733907
Torrent, C; Gabus, C; Darlix, J L
1994-02-01
Retroviral genomes consist of two identical RNA molecules associated at their 5' ends by the dimer linkage structure located in the packaging element (Psi or E) necessary for RNA dimerization in vitro and packaging in vivo. In murine leukemia virus (MLV)-derived vectors designed for gene transfer, the Psi + sequence of 600 nucleotides directs the packaging of recombinant RNAs into MLV virions produced by helper cells. By using in vitro RNA dimerization as a screening system, a sequence of rat VL30 RNA located next to the 5' end of the Harvey mouse sarcoma virus genome and as small as 67 nucleotides was found to form stable dimeric RNA. In addition, a purine-rich sequence located at the 5' end of this VL30 RNA seems to be critical for RNA dimerization. When this VL30 element was extended by 107 nucleotides at its 3' end and inserted into an MLV-derived vector lacking MLV Psi +, it directed the efficient encapsidation of recombinant RNAs into MLV virions. Because this VL30 packaging signal is smaller and more efficient in packaging recombinant RNAs than the MLV Psi + and does not contain gag or glyco-gag coding sequences, its use in MLV-derived vectors should render even more unlikely recombinations which could generate replication-competent viruses. Therefore, utilization of the rat VL30 packaging sequence should improve the biological safety of MLV vectors for human gene transfer.
Mahelka, Václav; Kopecký, David
2010-06-01
Four accessions of hexaploid Elymus repens from its native Central European distribution area were analyzed using sequencing of multicopy (internal transcribed spacer, ITS) and single-copy (granule-bound starch synthase I, GBSSI) DNA in concert with genomic and fluorescent in situ hybridization (GISH and FISH) to disentangle its allopolyploid origin. Despite extensive ITS homogenization, nrDNA in E. repens allowed us to identify at least four distinct lineages. Apart from Pseudoroegneria and Hordeum, representing the major genome constituents, the presence of further unexpected alien genetic material, originating from species outside the Triticeae and close to Panicum (Paniceae) and Bromus (Bromeae), was revealed. GBSSI sequences provided information complementary to the ITS. Apart from Pseudoroegneria and Hordeum, two additional gene variants from within the Triticeae were discovered: One was Taeniatherum-like, but the other did not have a close relationship with any of the diploids sampled. GISH results were largely congruent with the sequence-based markers. GISH clearly confirmed Pseudoroegneria and Hordeum as major genome constituents and further showed the presence of a small chromosome segment corresponding to Panicum. It resided in the Hordeum subgenome and probably represents an old acquisition of a Hordeum progenitor. Spotty hybridization signals across all chromosomes after GISH with Taeniatherum and Bromus probes suggested that gene acquisition from these species is more likely due to common ancestry of the grasses or early introgression than to recent hybridization or allopolyploid origin of E. repens. Physical mapping of rDNA loci using FISH revealed that all rDNA loci except one minor were located on Pseudoroegneria-derived chromosomes, which suggests the loss of all Hordeum-derived loci but one. Because homogenization mechanisms seem to operate effectively among Pseudoroegneria-like copies in this species, incomplete ITS homogenization in our samples is probably due to an interstitial position of an individual minor rDNA locus located within the Hordeum-derived subgenome.
Moonan, Francis; Mirkov, T. Erik
2002-01-01
We have analyzed the genotypic diversity of Sugarcane yellow leaf virus (SCYLV) collected from North, South, and Central America by fingerprinting assays and selective cDNA cloning and sequencing. One group of isolates from Colombia, designated the C-population, has been identified as residing at the root node between a separable superpopulation structure of SCYLV and other members of the family Luteoviridae, indicating that the progenitor viruses of the North, South, and Central American isolates of the SCYLV superpopulation most likely arose from a C-population structure. From a model of intrafamilial evolution (F. Moonan et al., Virology 269:156–171, 2000), a prediction could be made that within the SCYLV species, the capacity of genomic sequence divergence would range from lowest in the capsid protein open reading frame 3 (ORF 3) to highest in a region spanning across the carboxy-terminal end of the RNA-dependent RNA polymerase ORF. We have demonstrated the validity and applicability of this intrafamilial model for the prediction of intraspecies SCYLV diversity. Analysis of spatial phylogenetic variation (SPV) within the SCYLV isolates could not be assessed by application of a “partial likelihoods assessed through optimization” (PLATO)-derived intraspecies model alone. However, application of a PLATO-derived intrafamilial model with the intraspecies-derived model allowed distinction of three forms of SPV. Two of the SPV forms identified correspond to the extremes in a continuum of sequence evolution displayed in a SCYLV superpopulation structure, and the third form was diagnostic of a C-population structure. The application of these types of models has value in terms of predicting the types of SCYLV intraspecies diversity that may exist worldwide, and in general, may be useful in application for more informed design of transgenes for use in the elicitation of homology-dependent virus resistance mechanisms in transgenic plants. PMID:11773408
Effects of informed consent for individual genome sequencing on relevant knowledge.
Kaphingst, K A; Facio, F M; Cheng, M-R; Brooks, S; Eidem, H; Linn, A; Biesecker, B B; Biesecker, L G
2012-11-01
Increasing availability of individual genomic information suggests that patients will need knowledge about genome sequencing to make informed decisions, but prior research is limited. In this study, we examined genome sequencing knowledge before and after informed consent among 311 participants enrolled in the ClinSeq™ sequencing study. An exploratory factor analysis of knowledge items yielded two factors (sequencing limitations knowledge; sequencing benefits knowledge). In multivariable analysis, high pre-consent sequencing limitations knowledge scores were significantly related to education [odds ratio (OR): 8.7, 95% confidence interval (CI): 2.45-31.10 for post-graduate education, and OR: 3.9; 95% CI: 1.05, 14.61 for college degree compared with less than college degree] and race/ethnicity (OR: 2.4, 95% CI: 1.09, 5.38 for non-Hispanic Whites compared with other racial/ethnic groups). Mean values increased significantly between pre- and post-consent for the sequencing limitations knowledge subscale (6.9-7.7, p < 0.0001) and sequencing benefits knowledge subscale (7.0-7.5, p < 0.0001); increase in knowledge did not differ by sociodemographic characteristics. This study highlights gaps in genome sequencing knowledge and underscores the need to target educational efforts toward participants with less education or from minority racial/ethnic groups. The informed consent process improved genome sequencing knowledge. Future studies could examine how genome sequencing knowledge influences informed decision making. © 2012 John Wiley & Sons A/S.
Cluster-Based Multipolling Sequencing Algorithm for Collecting RFID Data in Wireless LANs
NASA Astrophysics Data System (ADS)
Choi, Woo-Yong; Chatterjee, Mainak
2015-03-01
With the growing use of RFID (Radio Frequency Identification), it is becoming important to devise ways to read RFID tags in real time. Access points (APs) of IEEE 802.11-based wireless Local Area Networks (LANs) are being integrated with RFID networks that can efficiently collect real-time RFID data. Several schemes, such as multipolling methods based on the dynamic search algorithm and random sequencing, have been proposed. However, as the number of RFID readers associated with an AP increases, it becomes difficult for the dynamic search algorithm to derive the multipolling sequence in real time. Though multipolling methods can eliminate the polling overhead, we still need to enhance the performance of the multipolling methods based on random sequencing. To that extent, we propose a real-time cluster-based multipolling sequencing algorithm that drastically eliminates more than 90% of the polling overhead, particularly so when the dynamic search algorithm fails to derive the multipolling sequence in real time.
New fundamental parameters for attitude representation
NASA Astrophysics Data System (ADS)
Patera, Russell P.
2017-08-01
A new attitude parameter set is developed to clarify the geometry of combining finite rotations in a rotational sequence and in combining infinitesimal angular increments generated by angular rate. The resulting parameter set of six Pivot Parameters represents a rotation as a great circle arc on a unit sphere that can be located at any clocking location in the rotation plane. Two rotations are combined by linking their arcs at either of the two intersection points of the respective rotation planes. In a similar fashion, linking rotational increments produced by angular rate is used to derive the associated kinematical equations, which are linear and have no singularities. Included in this paper is the derivation of twelve Pivot Parameter elements that represent all twelve Euler Angle sequences, which enables efficient conversions between Pivot Parameters and any Euler Angle sequence. Applications of this new parameter set include the derivation of quaternions and the quaternion composition rule, as well as, the derivation of the analytical solution to time dependent coning motion. The relationships between Pivot Parameters and traditional parameter sets are included in this work. Pivot Parameters are well suited for a variety of aerospace applications due to their effective composition rule, singularity free kinematic equations, efficient conversion to and from Euler Angle sequences and clarity of their geometrical foundation.
From prenatal genomic diagnosis to fetal personalized medicine: progress and challenges
Bianchi, Diana W
2015-01-01
Thus far, the focus of personalized medicine has been the prevention and treatment of conditions that affect adults. Although advances in genetic technology have been applied more frequently to prenatal diagnosis than to fetal treatment, genetic and genomic information is beginning to influence pregnancy management. Recent developments in sequencing the fetal genome combined with progress in understanding fetal physiology using gene expression arrays indicate that we could have the technical capabilities to apply an individualized medicine approach to the fetus. Here I review recent advances in prenatal genetic diagnostics, the challenges associated with these new technologies and how the information derived from them can be used to advance fetal care. Historically, the goal of prenatal diagnosis has been to provide an informed choice to prospective parents. We are now at a point where that goal can and should be expanded to incorporate genetic, genomic and transcriptomic data to develop new approaches to fetal treatment. PMID:22772565
Feng, Feiling; Cheng, Qingbao; Yang, Liang; Zhang, Dadong; Ji, Shunlong; Zhang, Qiangzu; Lin, Yihui; Li, Fugen; Xiong, Lei; Liu, Chen; Jiang, Xiaoqing
2017-01-17
Gallbladder sarcomatoid carcinoma is a rare cancer with no clinical standard treatment. With the rapid development of next generation sequencing, it has been able to provide reasonable treatment options for patients based on genetic variations. However, most cancer drugs are not approval for gallbladder sarcomatoid carcinoma indications. The correlation between drug response and a genetic variation needs to be further elucidated. Three patient-derived cells-JXQ-3D-001, JXQ-3D-002, and JXQ-3D-003, were derived from biopsy samples of one gallbladder sarcomatoid carcinoma patient with progression and have been characterized. In order to study the relationship between drug sensitivity and gene alteration, genetic mutations of three patient-derived cells were discovered by whole exome sequencing, and drug screening has been performed based on the gene alterations and related signaling pathways that are associated with drug targets. It has been found that there are differences in biological characteristics such as morphology, cell proliferation, cell migration and colony formation activity among these three patient-derived cells although they are derived from the same patient. Their sensitivities to the chemotherapy drugs-Fluorouracil, Doxorubicin, and Cisplatin are distinct. Moreover, none of common chemotherapy drugs could inhibit the proliferations of all three patient-derived cells. Comprehensive analysis of their whole exome sequencing demonstrated that tumor-associated genes TP53, AKT2, FGFR3, FGF10, SDHA, and PI3KCA were mutated or amplified. Part of these alterations are actionable. By screening a set of compounds that are associated with the genetic alteration, it has been found that GDC-0941 and PF-04691502 for PI3K-AKT-mTOR pathway inhibitors could dramatically decrease the proliferation of three patient-derived cells. Importantly, expression of phosphorylated AKT and phosphorylated S6 were markedly decreased after treatments with PI3K-AKT-mTOR pathway inhibitors GDC-0941 (0.5 μM) and PF-04691502 (0.1 μM) in all three patient-derived cells. These data suggested that inhibition of the PI3K-AKT-mTOR pathway that was activated by PIK3CA amplification in all three patient-derived cells could reduce the cell proliferation. A patient-derived cell model combined with whole exome sequencing is a powerful tool to elucidate relationship between drug sensitivities and genetic alternations. In these gallbladder sarcomatoid carcinoma patient-derived cells, it is found that PIK3CA amplification could be used as a biomarker to indicate PI3K-AKT-mTOR pathway activation. Block of the pathway may benefit the gallbladder sarcomatoid carcinoma patient with this alternation in hypothesis. The real efficacy needs to be confirmed in vivo or in a clinical trial.
West, Claire; James, Stephen A; Davey, Robert P; Dicks, Jo; Roberts, Ian N
2014-07-01
The ribosomal RNA encapsulates a wealth of evolutionary information, including genetic variation that can be used to discriminate between organisms at a wide range of taxonomic levels. For example, the prokaryotic 16S rDNA sequence is very widely used both in phylogenetic studies and as a marker in metagenomic surveys and the internal transcribed spacer region, frequently used in plant phylogenetics, is now recognized as a fungal DNA barcode. However, this widespread use does not escape criticism, principally due to issues such as difficulties in classification of paralogous versus orthologous rDNA units and intragenomic variation, both of which may be significant barriers to accurate phylogenetic inference. We recently analyzed data sets from the Saccharomyces Genome Resequencing Project, characterizing rDNA sequence variation within multiple strains of the baker's yeast Saccharomyces cerevisiae and its nearest wild relative Saccharomyces paradoxus in unprecedented detail. Notably, both species possess single locus rDNA systems. Here, we use these new variation datasets to assess whether a more detailed characterization of the rDNA locus can alleviate the second of these phylogenetic issues, sequence heterogeneity, while controlling for the first. We demonstrate that a strong phylogenetic signal exists within both datasets and illustrate how they can be used, with existing methodology, to estimate intraspecies phylogenies of yeast strains consistent with those derived from whole-genome approaches. We also describe the use of partial Single Nucleotide Polymorphisms, a type of sequence variation found only in repetitive genomic regions, in identifying key evolutionary features such as genome hybridization events and show their consistency with whole-genome Structure analyses. We conclude that our approach can transform rDNA sequence heterogeneity from a problem to a useful source of evolutionary information, enabling the estimation of highly accurate phylogenies of closely related organisms, and discuss how it could be extended to future studies of multilocus rDNA systems. [concerted evolution; genome hydridisation; phylogenetic analysis; ribosomal DNA; whole genome sequencing; yeast]. © The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.
Functionally conserved enhancers with divergent sequences in distant vertebrates
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Song; Oksenberg, Nir; Takayama, Sachiko
To examine the contributions of sequence and function conservation in the evolution of enhancers, we systematically identified enhancers whose sequences are not conserved among distant groups of vertebrate species, but have homologous function and are likely to be derived from a common ancestral sequence. In conclusion, our approach combined comparative genomics and epigenomics to identify potential enhancer sequences in the genomes of three groups of distantly related vertebrate species.
Functionally conserved enhancers with divergent sequences in distant vertebrates
Yang, Song; Oksenberg, Nir; Takayama, Sachiko; ...
2015-10-30
To examine the contributions of sequence and function conservation in the evolution of enhancers, we systematically identified enhancers whose sequences are not conserved among distant groups of vertebrate species, but have homologous function and are likely to be derived from a common ancestral sequence. In conclusion, our approach combined comparative genomics and epigenomics to identify potential enhancer sequences in the genomes of three groups of distantly related vertebrate species.
Belyi, Vladimir A.; Levine, Arnold J.; Skalka, Anna Marie
2010-01-01
Vertebrate genomes contain numerous copies of retroviral sequences, acquired over the course of evolution. Until recently they were thought to be the only type of RNA viruses to be so represented, because integration of a DNA copy of their genome is required for their replication. In this study, an extensive sequence comparison was conducted in which 5,666 viral genes from all known non-retroviral families with single-stranded RNA genomes were matched against the germline genomes of 48 vertebrate species, to determine if such viruses could also contribute to the vertebrate genetic heritage. In 19 of the tested vertebrate species, we discovered as many as 80 high-confidence examples of genomic DNA sequences that appear to be derived, as long ago as 40 million years, from ancestral members of 4 currently circulating virus families with single strand RNA genomes. Surprisingly, almost all of the sequences are related to only two families in the Order Mononegavirales: the Bornaviruses and the Filoviruses, which cause lethal neurological disease and hemorrhagic fevers, respectively. Based on signature landmarks some, and perhaps all, of the endogenous virus-like DNA sequences appear to be LINE element-facilitated integrations derived from viral mRNAs. The integrations represent genes that encode viral nucleocapsid, RNA-dependent-RNA-polymerase, matrix and, possibly, glycoproteins. Integrations are generally limited to one or very few copies of a related viral gene per species, suggesting that once the initial germline integration was obtained (or selected), later integrations failed or provided little advantage to the host. The conservation of relatively long open reading frames for several of the endogenous sequences, the virus-like protein regions represented, and a potential correlation between their presence and a species' resistance to the diseases caused by these pathogens, are consistent with the notion that their products provide some important biological advantage to the species. In addition, the viruses could also benefit, as some resistant species (e.g. bats) may serve as natural reservoirs for their persistence and transmission. Given the stringent limitations imposed in this informatics search, the examples described here should be considered a low estimate of the number of such integration events that have persisted over evolutionary time scales. Clearly, the sources of genetic information in vertebrate genomes are much more diverse than previously suspected. PMID:20686665
Belyi, Vladimir A; Levine, Arnold J; Skalka, Anna Marie
2010-07-29
Vertebrate genomes contain numerous copies of retroviral sequences, acquired over the course of evolution. Until recently they were thought to be the only type of RNA viruses to be so represented, because integration of a DNA copy of their genome is required for their replication. In this study, an extensive sequence comparison was conducted in which 5,666 viral genes from all known non-retroviral families with single-stranded RNA genomes were matched against the germline genomes of 48 vertebrate species, to determine if such viruses could also contribute to the vertebrate genetic heritage. In 19 of the tested vertebrate species, we discovered as many as 80 high-confidence examples of genomic DNA sequences that appear to be derived, as long ago as 40 million years, from ancestral members of 4 currently circulating virus families with single strand RNA genomes. Surprisingly, almost all of the sequences are related to only two families in the Order Mononegavirales: the Bornaviruses and the Filoviruses, which cause lethal neurological disease and hemorrhagic fevers, respectively. Based on signature landmarks some, and perhaps all, of the endogenous virus-like DNA sequences appear to be LINE element-facilitated integrations derived from viral mRNAs. The integrations represent genes that encode viral nucleocapsid, RNA-dependent-RNA-polymerase, matrix and, possibly, glycoproteins. Integrations are generally limited to one or very few copies of a related viral gene per species, suggesting that once the initial germline integration was obtained (or selected), later integrations failed or provided little advantage to the host. The conservation of relatively long open reading frames for several of the endogenous sequences, the virus-like protein regions represented, and a potential correlation between their presence and a species' resistance to the diseases caused by these pathogens, are consistent with the notion that their products provide some important biological advantage to the species. In addition, the viruses could also benefit, as some resistant species (e.g. bats) may serve as natural reservoirs for their persistence and transmission. Given the stringent limitations imposed in this informatics search, the examples described here should be considered a low estimate of the number of such integration events that have persisted over evolutionary time scales. Clearly, the sources of genetic information in vertebrate genomes are much more diverse than previously suspected.
An accurate algorithm for the detection of DNA fragments from dilution pool sequencing experiments.
Bansal, Vikas
2018-01-01
The short read lengths of current high-throughput sequencing technologies limit the ability to recover long-range haplotype information. Dilution pool methods for preparing DNA sequencing libraries from high molecular weight DNA fragments enable the recovery of long DNA fragments from short sequence reads. These approaches require computational methods for identifying the DNA fragments using aligned sequence reads and assembling the fragments into long haplotypes. Although a number of computational methods have been developed for haplotype assembly, the problem of identifying DNA fragments from dilution pool sequence data has not received much attention. We formulate the problem of detecting DNA fragments from dilution pool sequencing experiments as a genome segmentation problem and develop an algorithm that uses dynamic programming to optimize a likelihood function derived from a generative model for the sequence reads. This algorithm uses an iterative approach to automatically infer the mean background read depth and the number of fragments in each pool. Using simulated data, we demonstrate that our method, FragmentCut, has 25-30% greater sensitivity compared with an HMM based method for fragment detection and can also detect overlapping fragments. On a whole-genome human fosmid pool dataset, the haplotypes assembled using the fragments identified by FragmentCut had greater N50 length, 16.2% lower switch error rate and 35.8% lower mismatch error rate compared with two existing methods. We further demonstrate the greater accuracy of our method using two additional dilution pool datasets. FragmentCut is available from https://bansal-lab.github.io/software/FragmentCut. vibansal@ucsd.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Comparative analysis of the small RNA transcriptomes of Pinus contorta and Oryza sativa
Morin, Ryan D.; Aksay, Gozde; Dolgosheina, Elena; Ebhardt, H. Alexander; Magrini, Vincent; Mardis, Elaine R.; Sahinalp, S. Cenk; Unrau, Peter J.
2008-01-01
The diversity of microRNAs and small-interfering RNAs has been extensively explored within angiosperms by focusing on a few key organisms such as Oryza sativa and Arabidopsis thaliana. A deeper division of the plants is defined by the radiation of the angiosperms and gymnosperms, with the latter comprising the commercially important conifers. The conifers are expected to provide important information regarding the evolution of highly conserved small regulatory RNAs. Deep sequencing provides the means to characterize and quantitatively profile small RNAs in understudied organisms such as these. Pyrosequencing of small RNAs from O. sativa revealed, as expected, ∼21- and ∼24-nt RNAs. The former contained known microRNAs, and the latter largely comprised intergenic-derived sequences likely representing heterochromatin siRNAs. In contrast, sequences from Pinus contorta were dominated by 21-nt small RNAs. Using a novel sequence-based clustering algorithm, we identified sequences belonging to 18 highly conserved microRNA families in P. contorta as well as numerous clusters of conserved small RNAs of unknown function. Using multiple methods, including expressed sequence folding and machine learning algorithms, we found a further 53 candidate novel microRNA families, 51 appearing specific to the P. contorta library. In addition, alignment of small RNA sequences to the O. sativa genome revealed six perfectly conserved classes of small RNA that included chloroplast transcripts and specific types of genomic repeats. The conservation of microRNAs and other small RNAs between the conifers and the angiosperms indicates that important RNA silencing processes were highly developed in the earliest spermatophytes. Genomic mapping of all sequences to the O. sativa genome can be viewed at http://microrna.bcgsc.ca/cgi-bin/gbrowse/rice_build_3/. PMID:18323537
Zhang, Gaihua; Su, Zhen
2012-01-01
Work on protein structure prediction is very useful in biological research. To evaluate their accuracy, experimental protein structures or their derived data are used as the 'gold standard'. However, as proteins are dynamic molecular machines with structural flexibility such a standard may be unreliable. To investigate the influence of the structure flexibility, we analysed 3,652 protein structures of 137 unique sequences from 24 protein families. The results showed that (1) the three-dimensional (3D) protein structures were not rigid: the root-mean-square deviation (RMSD) of the backbone Cα of structures with identical sequences was relatively large, with the average of the maximum RMSD from each of the 137 sequences being 1.06 Å; (2) the derived data of the 3D structure was not constant, e.g. the highest ratio of the secondary structure wobble site was 60.69%, with the sequence alignments from structural comparisons of two proteins in the same family sometimes being completely different. Proteins may have several stable conformations and the data derived from resolved structures as a 'gold standard' should be optimized before being utilized as criteria to evaluate the prediction methods, e.g. sequence alignment from structural comparison. Helix/β-sheet transition exists in normal free proteins. The coil ratio of the 3D structure could affect its resolution as determined by X-ray crystallography.
Putaporntip, Chaturong; Thongaree, Siriporn; Jongwutiwes, Somchai
2013-08-01
To determine the genetic diversity and potential transmission routes of Plasmodium knowlesi, we analyzed the complete nucleotide sequence of the gene encoding the merozoite surface protein-1 of this simian malaria (Pkmsp-1), an asexual blood-stage vaccine candidate, from naturally infected humans and macaques in Thailand. Analysis of Pkmsp-1 sequences from humans (n=12) and monkeys (n=12) reveals five conserved and four variable domains. Most nucleotide substitutions in conserved domains were dimorphic whereas three of four variable domains contained complex repeats with extensive sequence and size variation. Besides purifying selection in conserved domains, evidence of intragenic recombination scattering across Pkmsp-1 was detected. The number of haplotypes, haplotype diversity, nucleotide diversity and recombination sites of human-derived sequences exceeded that of monkey-derived sequences. Phylogenetic networks based on concatenated conserved sequences of Pkmsp-1 displayed a character pattern that could have arisen from sampling process or the presence of two independent routes of P. knowlesi transmission, i.e. from macaques to human and from human to humans in Thailand. Copyright © 2013 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Campante, Tiago L.; Veras, Dimitri; North, Thomas S. H.; Miglio, Andrea; Morel, Thierry; Johnson, John A.; Chaplin, William J.; Davies, Guy R.; Huber, Daniel; Kuszlewicz, James S.; Lund, Mikkel N.; Cooke, Benjamin F.; Elsworth, Yvonne P.; Rodrigues, Thaíse S.; Vanderburg, Andrew
2017-08-01
Doppler-based planet surveys point to an increasing occurrence rate of giant planets with stellar mass. Such surveys rely on evolved stars for a sample of intermediate-mass stars (so-called retired A stars), which are more amenable to Doppler observations than their main-sequence progenitors. However, it has been hypothesized that the masses of subgiant and low-luminosity red-giant stars targeted by these surveys - typically derived from a combination of spectroscopy and isochrone fitting - may be systematically overestimated. Here, we test this hypothesis for the particular case of the exoplanet-host star HD 212771 using K2 asteroseismology. The benchmark asteroseismic mass (1.45^{+0.10}_{-0.09} M_{⊙) is significantly higher than the value reported in the discovery paper (1.15 ± 0.08 M⊙), which has been used to inform the stellar mass-planet occurrence relation. This result, therefore, does not lend support to the above hypothesis. Implications for the fates of planetary systems are sensitively dependent on stellar mass. Based on the derived asteroseismic mass, we predict the post-main-sequence evolution of the Jovian planet orbiting HD 212771 under the effects of tidal forces and stellar mass-loss.
Just, Rebecca S; Irwin, Jodi A
2018-05-01
Some of the expected advantages of next generation sequencing (NGS) for short tandem repeat (STR) typing include enhanced mixture detection and genotype resolution via sequence variation among non-homologous alleles of the same length. However, at the same time that NGS methods for forensic DNA typing have advanced in recent years, many caseworking laboratories have implemented or are transitioning to probabilistic genotyping to assist the interpretation of complex autosomal STR typing results. Current probabilistic software programs are designed for length-based data, and were not intended to accommodate sequence strings as the product input. Yet to leverage the benefits of NGS for enhanced genotyping and mixture deconvolution, the sequence variation among same-length products must be utilized in some form. Here, we propose use of the longest uninterrupted stretch (LUS) in allele designations as a simple method to represent sequence variation within the STR repeat regions and facilitate - in the nearterm - probabilistic interpretation of NGS-based typing results. An examination of published population data indicated that a reference LUS region is straightforward to define for most autosomal STR loci, and that using repeat unit plus LUS length as the allele designator can represent greater than 80% of the alleles detected by sequencing. A proof of concept study performed using a freely available probabilistic software demonstrated that the LUS length can be used in allele designations when a program does not require alleles to be integers, and that utilizing sequence information improves interpretation of both single-source and mixed contributor STR typing results as compared to using repeat unit information alone. The LUS concept for allele designation maintains the repeat-based allele nomenclature that will permit backward compatibility to extant STR databases, and the LUS lengths themselves will be concordant regardless of the NGS assay or analysis tools employed. Further, these biologically based, easy-to-derive designations uphold clear relationships between parent alleles and their stutter products, enabling analysis in fully continuous probabilistic programs that model stutter while avoiding the algorithmic complexities that come with string based searches. Though using repeat unit plus LUS length as the allele designator does not capture variation that occurs outside of the core repeat regions, this straightforward approach would permit the large majority of known STR sequence variation to be used for mixture deconvolution and, in turn, result in more informative mixture statistics in the near term. Ultimately, the method could bridge the gap from current length-based probabilistic systems to facilitate broader adoption of NGS by forensic DNA testing laboratories. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.
GenSeq: An updated nomenclature and ranking for genetic sequences from type and non-type sources
Chakrabarty, Prosanta; Warren, Melanie; Page, Lawrence M.; Baldwin, Carole C.
2013-01-01
Abstract An improved and expanded nomenclature for genetic sequences is introduced that corresponds with a ranking of the reliability of the taxonomic identification of the source specimens. This nomenclature is an advancement of the “Genetypes” naming system, which some have been reluctant to adopt because of the use of the “type” suffix in the terminology. In the new nomenclature, genetic sequences are labeled “genseq,” followed by a reliability ranking (e.g., 1 if the sequence is from a primary type), followed by the name of the genes from which the sequences were derived (e.g., genseq-1 16S, COI). The numbered suffix provides an indication of the likely reliability of taxonomic identification of the voucher. Included in this ranking system, in descending order of taxonomic reliability, are the following: sequences from primary types – “genseq-1,” secondary types – “genseq-2,” collection-vouchered topotypes – “genseq-3,” collection-vouchered non-types – “genseq-4,” and non-types that lack specimen vouchers but have photo vouchers – “genseq-5.” To demonstrate use of the new nomenclature, we review recently published new-species descriptions in the ichthyological literature that include DNA data and apply the GenSeq nomenclature to sequences referenced in those publications. We encourage authors to adopt the GenSeq nomenclature (note capital “G” and “S” when referring to the nomenclatural program) to provide a searchable tag (e.g., “genseq”; note lowercase “g” and “s” when referring to sequences) for genetic sequences from types and other vouchered specimens. Use of the new nomenclature and ranking system will improve integration of molecular phylogenetics and biological taxonomy and enhance the ability of researchers to assess the reliability of sequence data. We further encourage authors to update sequence information on databases such as GenBank whenever nomenclatural changes are made. PMID:24223486
CRISPR-Cas9 provides the means to perform genome editing and facilitates loss-of-function screens. However, we and others demonstrated that expression of the Cas9 endonuclease induces a gene-independent response that correlates with the number of target sequences in the genome. An alternative approach to suppressing gene expression is to block transcription using a catalytically inactive Cas9 (dCas9). Here we directly compare genome editing by CRISPR-Cas9 (cutting, CRISPRc) and gene suppression using KRAB-dCas9 (CRISPRi) in loss-of-function screens to identify cell essential genes.
Activity recognition using Video Event Segmentation with Text (VEST)
NASA Astrophysics Data System (ADS)
Holloway, Hillary; Jones, Eric K.; Kaluzniacki, Andrew; Blasch, Erik; Tierno, Jorge
2014-06-01
Multi-Intelligence (multi-INT) data includes video, text, and signals that require analysis by operators. Analysis methods include information fusion approaches such as filtering, correlation, and association. In this paper, we discuss the Video Event Segmentation with Text (VEST) method, which provides event boundaries of an activity to compile related message and video clips for future interest. VEST infers meaningful activities by clustering multiple streams of time-sequenced multi-INT intelligence data and derived fusion products. We discuss exemplar results that segment raw full-motion video (FMV) data by using extracted commentary message timestamps, FMV metadata, and user-defined queries.
On the sum of generalized Fibonacci sequence
NASA Astrophysics Data System (ADS)
Chong, Chin-Yoon; Ho, C. K.
2014-06-01
We consider the generalized Fibonacci sequence {Un defined by U0 = 0, U1 = 1, and Un+2 = pUn+1+qUn for all n∈Z0+ and p, q∈Z+. In this paper, we derived various sums of the generalized Fibonacci sequence from their recursive relations.
TFBSshape: a motif database for DNA shape features of transcription factor binding sites.
Yang, Lin; Zhou, Tianyin; Dror, Iris; Mathelier, Anthony; Wasserman, Wyeth W; Gordân, Raluca; Rohs, Remo
2014-01-01
Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein-DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone.
TFBSshape: a motif database for DNA shape features of transcription factor binding sites
Yang, Lin; Zhou, Tianyin; Dror, Iris; Mathelier, Anthony; Wasserman, Wyeth W.; Gordân, Raluca; Rohs, Remo
2014-01-01
Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein–DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone. PMID:24214955
Thermodynamics-based models of transcriptional regulation with gene sequence.
Wang, Shuqiang; Shen, Yanyan; Hu, Jinxing
2015-12-01
Quantitative models of gene regulatory activity have the potential to improve our mechanistic understanding of transcriptional regulation. However, the few models available today have been based on simplistic assumptions about the sequences being modeled or heuristic approximations of the underlying regulatory mechanisms. In this work, we have developed a thermodynamics-based model to predict gene expression driven by any DNA sequence. The proposed model relies on a continuous time, differential equation description of transcriptional dynamics. The sequence features of the promoter are exploited to derive the binding affinity which is derived based on statistical molecular thermodynamics. Experimental results show that the proposed model can effectively identify the activity levels of transcription factors and the regulatory parameters. Comparing with the previous models, the proposed model can reveal more biological sense.
Andersson, P; Klein, M; Lilliebridge, R A; Giffard, P M
2013-09-01
Ultra-deep Illumina sequencing was performed on whole genome amplified DNA derived from a Chlamydia trachomatis-positive vaginal swab. Alignment of reads with reference genomes allowed robust SNP identification from the C. trachomatis chromosome and plasmid. This revealed that the C. trachomatis in the specimen was very closely related to the sequenced urogenital, serovar F, clade T1 isolate F-SW4. In addition, high genome-wide coverage was obtained for Prevotella melaninogenica, Gardnerella vaginalis, Clostridiales genomosp. BVAB3 and Mycoplasma hominis. This illustrates the potential of metagenome data to provide high resolution bacterial typing data from multiple taxa in a diagnostic specimen. ©2013 The Authors Clinical Microbiology and Infection ©2013 European Society of Clinical Microbiology and Infectious Diseases.
Jenista, Elizabeth R; Stokes, Ashley M; Branca, Rosa Tamara; Warren, Warren S
2009-11-28
A recent quantum computing paper (G. S. Uhrig, Phys. Rev. Lett. 98, 100504 (2007)) analytically derived optimal pulse spacings for a multiple spin echo sequence designed to remove decoherence in a two-level system coupled to a bath. The spacings in what has been called a "Uhrig dynamic decoupling (UDD) sequence" differ dramatically from the conventional, equal pulse spacing of a Carr-Purcell-Meiboom-Gill (CPMG) multiple spin echo sequence. The UDD sequence was derived for a model that is unrelated to magnetic resonance, but was recently shown theoretically to be more general. Here we show that the UDD sequence has theoretical advantages for magnetic resonance imaging of structured materials such as tissue, where diffusion in compartmentalized and microstructured environments leads to fluctuating fields on a range of different time scales. We also show experimentally, both in excised tissue and in a live mouse tumor model, that optimal UDD sequences produce different T(2)-weighted contrast than do CPMG sequences with the same number of pulses and total delay, with substantial enhancements in most regions. This permits improved characterization of low-frequency spectral density functions in a wide range of applications.
Multirate control with incomplete information over Profibus-DP network
NASA Astrophysics Data System (ADS)
Salt, J.; Casanova, V.; Cuenca, A.; Pizá, R.
2014-07-01
When a process field bus-decentralized peripherals (Profibus-DP) network is used in an industrial environment, a deterministic behaviour is usually claimed. However, due to some concerns such as bandwidth limitations, lack of synchronisation among different clocks and existence of time-varying delays, a more complex problem must be faced. This problem implies the transmission of irregular and, even, random sequences of incomplete information. The main consequence of this issue is the appearance of different sampling periods at different network devices. In this paper, this aspect is checked by means of a detailed Profibus-DP timescale study. In addition, in order to deal with the different periods, a delay-dependent dual-rate proportional-integral-derivative control is introduced. Stability for the proposed control system is analysed in terms of linear matrix inequalities.
SimRNA: a coarse-grained method for RNA folding simulations and 3D structure prediction.
Boniecki, Michal J; Lach, Grzegorz; Dawson, Wayne K; Tomala, Konrad; Lukasz, Pawel; Soltysinski, Tomasz; Rother, Kristian M; Bujnicki, Janusz M
2016-04-20
RNA molecules play fundamental roles in cellular processes. Their function and interactions with other biomolecules are dependent on the ability to form complex three-dimensional (3D) structures. However, experimental determination of RNA 3D structures is laborious and challenging, and therefore, the majority of known RNAs remain structurally uncharacterized. Here, we present SimRNA: a new method for computational RNA 3D structure prediction, which uses a coarse-grained representation, relies on the Monte Carlo method for sampling the conformational space, and employs a statistical potential to approximate the energy and identify conformations that correspond to biologically relevant structures. SimRNA can fold RNA molecules using only sequence information, and, on established test sequences, it recapitulates secondary structure with high accuracy, including correct prediction of pseudoknots. For modeling of complex 3D structures, it can use additional restraints, derived from experimental or computational analyses, including information about secondary structure and/or long-range contacts. SimRNA also can be used to analyze conformational landscapes and identify potential alternative structures. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
3DNALandscapes: a database for exploring the conformational features of DNA.
Zheng, Guohui; Colasanti, Andrew V; Lu, Xiang-Jun; Olson, Wilma K
2010-01-01
3DNALandscapes, located at: http://3DNAscapes.rutgers.edu, is a new database for exploring the conformational features of DNA. In contrast to most structural databases, which archive the Cartesian coordinates and/or derived parameters and images for individual structures, 3DNALandscapes enables searches of conformational information across multiple structures. The database contains a wide variety of structural parameters and molecular images, computed with the 3DNA software package and known to be useful for characterizing and understanding the sequence-dependent spatial arrangements of the DNA sugar-phosphate backbone, sugar-base side groups, base pairs, base-pair steps, groove structure, etc. The data comprise all DNA-containing structures--both free and bound to proteins, drugs and other ligands--currently available in the Protein Data Bank. The web interface allows the user to link, report, plot and analyze this information from numerous perspectives and thereby gain insight into DNA conformation, deformability and interactions in different sequence and structural contexts. The data accumulated from known, well-resolved DNA structures can serve as useful benchmarks for the analysis and simulation of new structures. The collective data can also help to understand how DNA deforms in response to proteins and other molecules and undergoes conformational rearrangements.
KungFQ: a simple and powerful approach to compress fastq files.
Grassi, Elena; Di Gregorio, Federico; Molineris, Ivan
2012-01-01
Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.
Directed C-H Bond Oxidation of (+)-Pleuromutilin.
Ma, Xiaoshen; Kucera, Roman; Goethe, Olivia F; Murphy, Stephen K; Herzon, Seth B
2018-05-01
Antibiotics derived from the diterpene fungal metabolite (+)-pleuromutilin (1) are useful agents for the treatment Gram-positive infections in humans and farm animals. Pleuromutilins elicit slow rates of resistance development and minimal cross-resistance with existing antibiotics. Despite efforts aimed at producing new derivatives by semisynthesis, modification of the tricyclic core is underexplored, in part due to a limited number of functional group handles. Herein, we report methods to selectively functionalize the methyl groups of (+)-pleuromutilin (1) by hydroxyl-directed iridium-catalyzed C-H silylation, followed by Tamao-Fleming oxidation. These reactions provided access to C16, C17, and C18 monooxidized products, as well as C15/C16 and C17/C18 dioxidized products. Four new functionalized derivatives were prepared from the protected C17 oxidation product. C6 carboxylic acid, aldehyde, and normethyl derivatives were prepared from the C16 oxidation product. Many of these sequences were executed on gram scales. The efficiency and practicality of these routes provides an easy method to rapidly interrogate structure-activity relationships that were previously beyond reach. This study will inform the design of fully synthetic approaches to novel pleuromutilins and underscores the power of the hydroxyl-directed iridium-catalyzed C-H silylation reaction.
Bounds on the cross-correlation functions of state m-sequences
NASA Astrophysics Data System (ADS)
Woodcock, C. F.; Davies, Phillip A.; Shaar, Ahmed A.
1987-03-01
Lower and upper bounds on the peaks of the periodic Hamming cross-correlation function for state m-sequences, which are often used in frequency-hopped spread-spectrum systems, are derived. The state position mapped (SPM) sequences of the state m-sequences are described. The use of SPM sequences for OR-channel code division multiplexing is studied. The relation between the Hamming cross-correlation function and the correlation function of SPM sequence is examined. Numerical results which support the theoretical data are presented.
Kimura, Tomohiro; Nakano, Toshiki; Yamaguchi, Toshiyasu; Sato, Minoru; Ogawa, Tomohisa; Muramoto, Koji; Yokoyama, Takehiko; Kan-No, Nobuhiro; Nagahisa, Eizou; Janssen, Frank; Grieshaber, Manfred K
2004-01-01
The complete complementary DNA sequences of genes presumably coding for opine dehydrogenases from Arabella iricolor (sandworm), Haliotis discus hannai (abalone), and Patinopecten yessoensis (scallop) were determined, and partial cDNA sequences were derived for Meretrix lusoria (Japanese hard clam) and Spisula sachalinensis (Sakhalin surf clam). The primers ODH-9F and ODH-11R proved useful for amplifying the sequences for opine dehydrogenases from the 4 mollusk species investigated in this study. The sequence of the sandworm was obtained using primers constructed from the amino acid sequence of tauropine dehydrogenase, the main opine dehydrogenase in A. iricolor. The complete cDNA sequence of A. iricolor, H. discus hannai, and P. yessoensis encode 397, 400, and 405 amino acids, respectively. All sequences were aligned and compared with published databank sequences of Loligo opalescens, Loligo vulgaris (squid), Sepia officinalis (cuttlefish), and Pecten maximus (scallop). As expected, a high level of homology was observed for the cDNA from closely related species, such as for cephalopods or scallops, whereas cDNA from the other species showed lower-level homologies. A similar trend was observed when the deduced amino acid sequences were compared. Furthermore, alignment of these sequences revealed some structural motifs that are possibly related to the binding sites of the substrates. The phylogenetic trees derived from the nucleotide and amino acid sequences were consistent with the classification of species resulting from classical taxonomic analyses.
Saunders, Brian; Lyon, Stephen; Day, Matthew; Riley, Brenda; Chenette, Emily; Subramaniam, Shankar
2008-01-01
The UCSD-Nature Signaling Gateway Molecule Pages (http://www.signaling-gateway.org/molecule) provides essential information on more than 3800 mammalian proteins involved in cellular signaling. The Molecule Pages contain expert-authored and peer-reviewed information based on the published literature, complemented by regularly updated information derived from public data source references and sequence analysis. The expert-authored data includes both a full-text review about the molecule, with citations, and highly structured data for bioinformatics interrogation, including information on protein interactions and states, transitions between states and protein function. The expert-authored pages are anonymously peer reviewed by the Nature Publishing Group. The Molecule Pages data is present in an object-relational database format and is freely accessible to the authors, the reviewers and the public from a web browser that serves as a presentation layer. The Molecule Pages are supported by several applications that along with the database and the interfaces form a multi-tier architecture. The Molecule Pages and the Signaling Gateway are routinely accessed by a very large research community. PMID:17965093
Saunders, Brian; Lyon, Stephen; Day, Matthew; Riley, Brenda; Chenette, Emily; Subramaniam, Shankar; Vadivelu, Ilango
2008-01-01
The UCSD-Nature Signaling Gateway Molecule Pages (http://www.signaling-gateway.org/molecule) provides essential information on more than 3800 mammalian proteins involved in cellular signaling. The Molecule Pages contain expert-authored and peer-reviewed information based on the published literature, complemented by regularly updated information derived from public data source references and sequence analysis. The expert-authored data includes both a full-text review about the molecule, with citations, and highly structured data for bioinformatics interrogation, including information on protein interactions and states, transitions between states and protein function. The expert-authored pages are anonymously peer reviewed by the Nature Publishing Group. The Molecule Pages data is present in an object-relational database format and is freely accessible to the authors, the reviewers and the public from a web browser that serves as a presentation layer. The Molecule Pages are supported by several applications that along with the database and the interfaces form a multi-tier architecture. The Molecule Pages and the Signaling Gateway are routinely accessed by a very large research community.
Palaeosymbiosis Revealed by Genomic Fossils of Wolbachia in a Strongyloidean Nematode
Koutsovoulos, Georgios; Makepeace, Benjamin; Tanya, Vincent N.; Blaxter, Mark
2014-01-01
Wolbachia are common endosymbionts of terrestrial arthropods, and are also found in nematodes: the animal-parasitic filaria, and the plant-parasite Radopholus similis. Lateral transfer of Wolbachia DNA to the host genome is common. We generated a draft genome sequence for the strongyloidean nematode parasite Dictyocaulus viviparus, the cattle lungworm. In the assembly, we identified nearly 1 Mb of sequence with similarity to Wolbachia. The fragments were unlikely to derive from a live Wolbachia infection: most were short, and the genes were disabled through inactivating mutations. Many fragments were co-assembled with definitively nematode-derived sequence. We found limited evidence of expression of the Wolbachia-derived genes. The D. viviparus Wolbachia genes were most similar to filarial strains and strains from the host-promiscuous clade F. We conclude that D. viviparus was infected by Wolbachia in the past, and that clade F-like symbionts may have been the source of filarial Wolbachia infections. PMID:24901418
PpTFDB: A pigeonpea transcription factor database for exploring functional genomics in legumes
Singh, Akshay; Sharma, Ajay Kumar; Singh, Nagendra Kumar
2017-01-01
Pigeonpea (Cajanus cajan L.), a diploid legume crop, is a member of the tribe Phaseoleae. This tribe is descended from the millettioid (tropical) clade of the subfamily Papilionoideae, which includes many important legume crop species such as soybean (Glycine max), mung bean (Vigna radiata), cowpea (Vigna ungiculata), and common bean (Phaseolus vulgaris). It plays major role in food and nutritional security, being rich source of proteins, minerals and vitamins. We have developed a comprehensive Pigeonpea Transcription Factors Database (PpTFDB) that encompasses information about 1829 putative transcription factors (TFs) and their 55 TF families. PpTFDB provides a comprehensive information about each of the identified TFs that includes chromosomal location, protein physicochemical properties, sequence data, protein functional annotation, simple sequence repeats (SSRs) with primers derived from their motifs, orthology with related legume crops, and gene ontology (GO) assignment to respective TFs. (PpTFDB: http://14.139.229.199/PpTFDB/Home.aspx) is a freely available and user friendly web resource that facilitates users to retrieve the information of individual members of a TF family through a set of query interfaces including TF ID or protein functional annotation. In addition, users can also get the information by browsing interfaces, which include browsing by TF Categories and by, GO Categories. This PpTFDB will serve as a promising central resource for researchers as well as breeders who are working towards crop improvement of legume crops. PMID:28651001
Schwegmann, Alexander; Lindemann, Jens P.; Egelhaaf, Martin
2014-01-01
Knowing the depth structure of the environment is crucial for moving animals in many behavioral contexts, such as collision avoidance, targeting objects, or spatial navigation. An important source of depth information is motion parallax. This powerful cue is generated on the eyes during translatory self-motion with the retinal images of nearby objects moving faster than those of distant ones. To investigate how the visual motion pathway represents motion-based depth information we analyzed its responses to image sequences recorded in natural cluttered environments with a wide range of depth structures. The analysis was done on the basis of an experimentally validated model of the visual motion pathway of insects, with its core elements being correlation-type elementary motion detectors (EMDs). It is the key result of our analysis that the absolute EMD responses, i.e., the motion energy profile, represent the contrast-weighted nearness of environmental structures during translatory self-motion at a roughly constant velocity. In other words, the output of the EMD array highlights contours of nearby objects. This conclusion is largely independent of the scale over which EMDs are spatially pooled and was corroborated by scrutinizing the motion energy profile after eliminating the depth structure from the natural image sequences. Hence, the well-established dependence of correlation-type EMDs on both velocity and textural properties of motion stimuli appears to be advantageous for representing behaviorally relevant information about the environment in a computationally parsimonious way. PMID:25136314
Ohyama, Akio; Shirasawa, Kenta; Matsunaga, Hiroshi; Negoro, Satomi; Miyatake, Koji; Yamaguchi, Hirotaka; Nunome, Tsukasa; Iwata, Hiroyoshi; Fukuoka, Hiroyuki; Hayashi, Takeshi
2017-08-01
Using newly developed euchromatin-derived genomic SSR markers and a flexible Bayesian mapping method, 13 significant agricultural QTLs were identified in a segregating population derived from a four-way cross of tomato. So far, many QTL mapping studies in tomato have been performed for progeny obtained from crosses between two genetically distant parents, e.g., domesticated tomatoes and wild relatives. However, QTL information of quantitative traits related to yield (e.g., flower or fruit number, and total or average weight of fruits) in such intercross populations would be of limited use for breeding commercial tomato cultivars because individuals in the populations have specific genetic backgrounds underlying extremely different phenotypes between the parents such as large fruit in domesticated tomatoes and small fruit in wild relatives, which may not be reflective of the genetic variation in tomato breeding populations. In this study, we constructed F 2 population derived from a cross between two commercial F 1 cultivars in tomato to extract QTL information practical for tomato breeding. This cross corresponded to a four-way cross, because the four parental lines of the two F 1 cultivars were considered to be the founders. We developed 2510 new expressed sequence tag (EST)-based (euchromatin-derived) genomic SSR markers and selected 262 markers from these new SSR markers and publicly available SSR markers to construct a linkage map. QTL analysis for ten agricultural traits of tomato was performed based on the phenotypes and marker genotypes of F 2 plants using a flexible Bayesian method. As results, 13 QTL regions were detected for six traits by the Bayesian method developed in this study.
Genome sequence of a Providencia stuartii strain isolated from Lucilia sericata salivary glands
USDA-ARS?s Scientific Manuscript database
We present the draft genome sequence of a Providencia stuartii strain derived from salivary glands of larval Lucilia sericata; a common blow fly important to forensic, medical and veterinary science. The genome sequence will help dissect coinfections involving Providencia stuartii and Proteus mirab...
From Arithmetic Sequences to Linear Equations
ERIC Educational Resources Information Center
Matsuura, Ryota; Harless, Patrick
2012-01-01
The first part of the article focuses on deriving the essential properties of arithmetic sequences by appealing to students' sense making and reasoning. The second part describes how to guide students to translate their knowledge of arithmetic sequences into an understanding of linear equations. Ryota Matsuura originally wrote these lessons for…
Eltaher, Shamseldeen; Sallam, Ahmed; Belamkar, Vikas; Emara, Hamdy A; Nower, Ahmed A; Salem, Khaled F M; Poland, Jesse; Baenziger, Peter S
2018-01-01
The availability of information on the genetic diversity and population structure in wheat ( Triticum aestivum L.) breeding lines will help wheat breeders to better use their genetic resources and manage genetic variation in their breeding program. The recent advances in sequencing technology provide the opportunity to identify tens or hundreds of thousands of single nucleotide polymorphism (SNPs) in large genome species (e.g., wheat). These SNPs can be utilized for understanding genetic diversity and performing genome wide association studies (GWAS) for complex traits. In this study, the genetic diversity and population structure were investigated in a set of 230 genotypes (F 3:6 ) derived from various crosses as a prerequisite for GWAS and genomic selection. Genotyping-by-sequencing provided 25,566 high-quality SNPs. The polymorphism information content (PIC) across chromosomes ranged from 0.09 to 0.37 with an average of 0.23. The distribution of SNPs markers on the 21 chromosomes ranged from 319 on chromosome 3D to 2,370 on chromosome 3B. The analysis of population structure revealed three subpopulations (G1, G2, and G3). Analysis of molecular variance identified 8% variance among and 92% within subpopulations. Of the three subpopulations, G2 had the highest level of genetic diversity based on three genetic diversity indices: Shannon's information index ( I ) = 0.494, diversity index ( h ) = 0.328 and unbiased diversity index (uh) = 0.331, while G3 had lowest level of genetic diversity ( I = 0.348, h = 0.226 and uh = 0.236). This high genetic diversity identified among the subpopulations can be used to develop new wheat cultivars.
VitisExpDB: a database resource for grape functional genomics.
Doddapaneni, Harshavardhan; Lin, Hong; Walker, M Andrew; Yao, Jiqiang; Civerolo, Edwin L
2008-02-28
The family Vitaceae consists of many different grape species that grow in a range of climatic conditions. In the past few years, several studies have generated functional genomic information on different Vitis species and cultivars, including the European grape vine, Vitis vinifera. Our goal is to develop a comprehensive web data source for Vitaceae. VitisExpDB is an online MySQL-PHP driven relational database that houses annotated EST and gene expression data for V. vinifera and non-vinifera grape species and varieties. Currently, the database stores approximately 320,000 EST sequences derived from 8 species/hybrids, their annotation (BLAST top match) details and Gene Ontology based structured vocabulary. Putative homologs for each EST in other species and varieties along with information on their percent nucleotide identities, phylogenetic relationship and common primers can be retrieved. The database also includes information on probe sequence and annotation features of the high density 60-mer gene expression chip consisting of approximately 20,000 non-redundant set of ESTs. Finally, the database includes 14 processed global microarray expression profile sets. Data from 12 of these expression profile sets have been mapped onto metabolic pathways. A user-friendly web interface with multiple search indices and extensively hyperlinked result features that permit efficient data retrieval has been developed. Several online bioinformatics tools that interact with the database along with other sequence analysis tools have been added. In addition, users can submit their ESTs to the database. The developed database provides genomic resource to grape community for functional analysis of genes in the collection and for the grape genome annotation and gene function identification. The VitisExpDB database is available through our website http://cropdisease.ars.usda.gov/vitis_at/main-page.htm.
Eltaher, Shamseldeen; Sallam, Ahmed; Belamkar, Vikas; Emara, Hamdy A.; Nower, Ahmed A.; Salem, Khaled F. M.; Poland, Jesse; Baenziger, Peter S.
2018-01-01
The availability of information on the genetic diversity and population structure in wheat (Triticum aestivum L.) breeding lines will help wheat breeders to better use their genetic resources and manage genetic variation in their breeding program. The recent advances in sequencing technology provide the opportunity to identify tens or hundreds of thousands of single nucleotide polymorphism (SNPs) in large genome species (e.g., wheat). These SNPs can be utilized for understanding genetic diversity and performing genome wide association studies (GWAS) for complex traits. In this study, the genetic diversity and population structure were investigated in a set of 230 genotypes (F3:6) derived from various crosses as a prerequisite for GWAS and genomic selection. Genotyping-by-sequencing provided 25,566 high-quality SNPs. The polymorphism information content (PIC) across chromosomes ranged from 0.09 to 0.37 with an average of 0.23. The distribution of SNPs markers on the 21 chromosomes ranged from 319 on chromosome 3D to 2,370 on chromosome 3B. The analysis of population structure revealed three subpopulations (G1, G2, and G3). Analysis of molecular variance identified 8% variance among and 92% within subpopulations. Of the three subpopulations, G2 had the highest level of genetic diversity based on three genetic diversity indices: Shannon’s information index (I) = 0.494, diversity index (h) = 0.328 and unbiased diversity index (uh) = 0.331, while G3 had lowest level of genetic diversity (I = 0.348, h = 0.226 and uh = 0.236). This high genetic diversity identified among the subpopulations can be used to develop new wheat cultivars. PMID:29593779
2010-01-01
Background The maturing field of genomics is rapidly increasing the number of sequenced genomes and producing more information from those previously sequenced. Much of this additional information is variation data derived from sampling multiple individuals of a given species with the goal of discovering new variants and characterising the population frequencies of the variants that are already known. These data have immense value for many studies, including those designed to understand evolution and connect genotype to phenotype. Maximising the utility of the data requires that it be stored in an accessible manner that facilitates the integration of variation data with other genome resources such as gene annotation and comparative genomics. Description The Ensembl project provides comprehensive and integrated variation resources for a wide variety of chordate genomes. This paper provides a detailed description of the sources of data and the methods for creating the Ensembl variation databases. It also explores the utility of the information by explaining the range of query options available, from using interactive web displays, to online data mining tools and connecting directly to the data servers programmatically. It gives a good overview of the variation resources and future plans for expanding the variation data within Ensembl. Conclusions Variation data is an important key to understanding the functional and phenotypic differences between individuals. The development of new sequencing and genotyping technologies is greatly increasing the amount of variation data known for almost all genomes. The Ensembl variation resources are integrated into the Ensembl genome browser and provide a comprehensive way to access this data in the context of a widely used genome bioinformatics system. All Ensembl data is freely available at http://www.ensembl.org and from the public MySQL database server at ensembldb.ensembl.org. PMID:20459805
VitisExpDB: A database resource for grape functional genomics
Doddapaneni, Harshavardhan; Lin, Hong; Walker, M Andrew; Yao, Jiqiang; Civerolo, Edwin L
2008-01-01
Background The family Vitaceae consists of many different grape species that grow in a range of climatic conditions. In the past few years, several studies have generated functional genomic information on different Vitis species and cultivars, including the European grape vine, Vitis vinifera. Our goal is to develop a comprehensive web data source for Vitaceae. Description VitisExpDB is an online MySQL-PHP driven relational database that houses annotated EST and gene expression data for V. vinifera and non-vinifera grape species and varieties. Currently, the database stores ~320,000 EST sequences derived from 8 species/hybrids, their annotation (BLAST top match) details and Gene Ontology based structured vocabulary. Putative homologs for each EST in other species and varieties along with information on their percent nucleotide identities, phylogenetic relationship and common primers can be retrieved. The database also includes information on probe sequence and annotation features of the high density 60-mer gene expression chip consisting of ~20,000 non-redundant set of ESTs. Finally, the database includes 14 processed global microarray expression profile sets. Data from 12 of these expression profile sets have been mapped onto metabolic pathways. A user-friendly web interface with multiple search indices and extensively hyperlinked result features that permit efficient data retrieval has been developed. Several online bioinformatics tools that interact with the database along with other sequence analysis tools have been added. In addition, users can submit their ESTs to the database. Conclusion The developed database provides genomic resource to grape community for functional analysis of genes in the collection and for the grape genome annotation and gene function identification. The VitisExpDB database is available through our website . PMID:18307813
Hu, Yanhui; Sopko, Richelle; Foos, Marianna; Kelley, Colleen; Flockhart, Ian; Ammeux, Noemie; Wang, Xiaowei; Perkins, Lizabeth; Perrimon, Norbert; Mohr, Stephanie E.
2013-01-01
The evaluation of specific endogenous transcript levels is important for understanding transcriptional regulation. More specifically, it is useful for independent confirmation of results obtained by the use of microarray analysis or RNA-seq and for evaluating RNA interference (RNAi)-mediated gene knockdown. Designing specific and effective primers for high-quality, moderate-throughput evaluation of transcript levels, i.e., quantitative, real-time PCR (qPCR), is nontrivial. To meet community needs, predefined qPCR primer pairs for mammalian genes have been designed and sequences made available, e.g., via PrimerBank. In this work, we adapted and refined the algorithms used for the mammalian PrimerBank to design 45,417 primer pairs for 13,860 Drosophila melanogaster genes, with three or more primer pairs per gene. We experimentally validated primer pairs for ~300 randomly selected genes expressed in early Drosophila embryos, using SYBR Green-based qPCR and sequence analysis of products derived from conventional PCR. All relevant information, including primer sequences, isoform specificity, spatial transcript targeting, and any available validation results and/or user feedback, is available from an online database (www.flyrnai.org/flyprimerbank). At FlyPrimerBank, researchers can retrieve primer information for fly genes either one gene at a time or in batch mode. Importantly, we included the overlap of each predicted amplified sequence with RNAi reagents from several public resources, making it possible for researchers to choose primers suitable for knockdown evaluation of RNAi reagents (i.e., to avoid amplification of the RNAi reagent itself). We demonstrate the utility of this resource for validation of RNAi reagents in vivo. PMID:23893746
Coughlan, Simone; Taylor, Ali Shirley; Feane, Eoghan; Sanders, Mandy; Schonian, Gabriele; Cotton, James A.
2018-01-01
The unicellular protozoan parasite Leishmania causes the neglected tropical disease leishmaniasis, affecting 12 million people in 98 countries. In South America, where the Viannia subgenus predominates, so far only L. (Viannia) braziliensis and L. (V.) panamensis have been sequenced, assembled and annotated as reference genomes. Addressing this deficit in molecular information can inform species typing, epidemiological monitoring and clinical treatment. Here, L. (V.) naiffi and L. (V.) guyanensis genomic DNA was sequenced to assemble these two genomes as draft references from short sequence reads. The methods used were tested using short sequence reads for L. braziliensis M2904 against its published reference as a comparison. This assembly and annotation pipeline identified 70 additional genes not annotated on the original M2904 reference. Phylogenetic and evolutionary comparisons of L. guyanensis and L. naiffi with 10 other Viannia genomes revealed four traits common to all Viannia: aneuploidy, 22 orthologous groups of genes absent in other Leishmania subgenera, elevated TATE transposon copies and a high NADH-dependent fumarate reductase gene copy number. Within the Viannia, there were limited structural changes in genome architecture specific to individual species: a 45 Kb amplification on chromosome 34 was present in all bar L. lainsoni, L. naiffi had a higher copy number of the virulence factor leishmanolysin, and laboratory isolate L. shawi M8408 had a possible minichromosome derived from the 3’ end of chromosome 34. This combination of genome assembly, phylogenetics and comparative analysis across an extended panel of diverse Viannia has uncovered new insights into the origin and evolution of this subgenus and can help improve diagnostics for leishmaniasis surveillance. PMID:29765675
Scaglione, Davide; Acquadro, Alberto; Portis, Ezio; Taylor, Christopher A; Lanteri, Sergio; Knapp, Steven J
2009-01-01
Background The globe artichoke (Cynara cardunculus var. scolymus L.) is a significant crop in the Mediterranean basin. Despite its commercial importance and its both dietary and pharmaceutical value, knowledge of its genetics and genomics remains scant. Microsatellite markers have become a key tool in genetic and genomic analysis, and we have exploited recently acquired EST (expressed sequence tag) sequence data (Composite Genome Project - CGP) to develop an extensive set of microsatellite markers. Results A unigene assembly was created from over 36,000 globe artichoke EST sequences, containing 6,621 contigs and 12,434 singletons. Over 12,000 of these unigenes were functionally assigned on the basis of homology with Arabidopsis thaliana reference proteins. A total of 4,219 perfect repeats, located within 3,308 unigenes was identified and the gene ontology (GO) analysis highlighted some GO term's enrichments among different classes of microsatellites with respect to their position. Sufficient flanking sequence was available to enable the design of primers to amplify 2,311 of these microsatellites, and a set of 300 was tested against a DNA panel derived from 28 C. cardunculus genotypes. Consistent amplification and polymorphism was obtained from 236 of these assays. Their polymorphic information content (PIC) ranged from 0.04 to 0.90 (mean 0.66). Between 176 and 198 of the assays were informative in at least one of the three available mapping populations. Conclusion EST-based microsatellites have provided a large set of de novo genetic markers, which show significant amounts of polymorphism both between and within the three taxa of C. cardunculus. They are thus well suited as assays for phylogenetic analysis, the construction of genetic maps, marker-assisted breeding, transcript mapping and other genomic applications in the species. PMID:19785740
Cell and molecular biology of SAE, a cell line from the spiny dogfish shark, Squalus acanthias.
Parton, Angela; Forest, David; Kobayashi, Hiroshi; Dowell, Lori; Bayne, Christopher; Barnes, David
2007-02-01
Cartilaginous fish, primarily sharks, rays and skates (elasmobranchs), appeared 450 million years ago. They are the most primitive vertebrates, exhibiting jaws and teeth, adaptive immunity, a pressurized circulatory system, thymus, spleen, and a liver comparable to that of humans. The most used elasmobranch in biomedical research is the spiny dogfish shark, Squalus acanthias. Comparative genomic analysis of the dogfish shark, the little skate (Leucoraja erincea), and other elasmobranchs have yielded insights into conserved functional domains of genes associated with human liver function, multidrug resistance, cystic fibrosis, and other biomedically relevant processes. While genomic information from these animals is informative in an evolutionary framework, experimental verification of functions of genomic sequences depends heavily on cell culture approaches. We have derived the first multipassage, continuously proliferating cell line of a cartilaginous fish. The line was initiated from embryos of the spiny dogfish shark. The cells were maintained in a medium modified for fish species and supplemented with cell type-specific hormones, other proteins and sera, and plated on a collagen substrate. SAE cells have been cultured continuously for three years. These cells can be transfected by plasmids and have been cryopreserved. Expressed Sequence Tags generated from a normalized SAE cDNA library included a number of markers for cartilage and muscle, as well as proteins influencing tissue differentiation and development, suggesting that SAE cells may be of mesenchymal stem cell origin. Examination of SAE EST sequences also revealed a cartilaginous fish-specific repetitive sequence that may be evidence of an ancient mobile genetic element that most likely was introduced into the cartilaginous fish lineage after divergence from the lineage leading to teleosts.
Rigoutsos, Isidore; Riek, Peter; Graham, Robert M.; Novotny, Jiri
2003-01-01
One of the promising methods of protein structure prediction involves the use of amino acid sequence-derived patterns. Here we report on the creation of non-degenerate motif descriptors derived through data mining of training sets of residues taken from the transmembrane-spanning segments of polytopic proteins. These residues correspond to short regions in which there is a deviation from the regular α-helical character (i.e. π-helices, 310-helices and kinks). A ‘search engine’ derived from these motif descriptors correctly identifies, and discriminates amongst instances of the above ‘non-canonical’ helical motifs contained in the SwissProt/TrEMBL database of protein primary structures. Our results suggest that deviations from α-helicity are encoded locally in sequence patterns only about 7–9 residues long and can be determined in silico directly from the amino acid sequence. Delineation of such variations in helical habit is critical to understanding the complex structure–function relationships of polytopic proteins and for drug discovery. The success of our current methodology foretells development of similar prediction tools capable of identifying other structural motifs from sequence alone. The method described here has been implemented and is available on the World Wide Web at http://cbcsrv.watson.ibm.com/Ttkw.html. PMID:12888523
Rigoutsos, Isidore; Riek, Peter; Graham, Robert M; Novotny, Jiri
2003-08-01
One of the promising methods of protein structure prediction involves the use of amino acid sequence-derived patterns. Here we report on the creation of non-degenerate motif descriptors derived through data mining of training sets of residues taken from the transmembrane-spanning segments of polytopic proteins. These residues correspond to short regions in which there is a deviation from the regular alpha-helical character (i.e. pi-helices, 3(10)-helices and kinks). A 'search engine' derived from these motif descriptors correctly identifies, and discriminates amongst instances of the above 'non-canonical' helical motifs contained in the SwissProt/TrEMBL database of protein primary structures. Our results suggest that deviations from alpha-helicity are encoded locally in sequence patterns only about 7-9 residues long and can be determined in silico directly from the amino acid sequence. Delineation of such variations in helical habit is critical to understanding the complex structure-function relationships of polytopic proteins and for drug discovery. The success of our current methodology foretells development of similar prediction tools capable of identifying other structural motifs from sequence alone. The method described here has been implemented and is available on the World Wide Web at http://cbcsrv.watson.ibm.com/Ttkw.html.
Tsuchiya, Mariko; Amano, Kojiro; Abe, Masaya; Seki, Misato; Hase, Sumitaka; Sato, Kengo; Sakakibara, Yasubumi
2016-06-15
Deep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs. We developed an algorithm termed SHARAKU to align two read mapping profiles of next-generation sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5'-end processing and 3'-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain. The source code of our program SHARAKU is available at http://www.dna.bio.keio.ac.jp/sharaku/, and the simulated dataset used in this work is available at the same link. Accession code: The sequence data from the whole RNA transcripts in the hippocampus of the left brain used in this work is available from the DNA DataBank of Japan (DDBJ) Sequence Read Archive (DRA) under the accession number DRA004502. yasu@bio.keio.ac.jp Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
PANGEA: pipeline for analysis of next generation amplicons
Giongo, Adriana; Crabb, David B; Davis-Richardson, Austin G; Chauliac, Diane; Mobberley, Jennifer M; Gano, Kelsey A; Mukherjee, Nabanita; Casella, George; Roesch, Luiz FW; Walts, Brandon; Riva, Alberto; King, Gary; Triplett, Eric W
2010-01-01
High-throughput DNA sequencing can identify organisms and describe population structures in many environmental and clinical samples. Current technologies generate millions of reads in a single run, requiring extensive computational strategies to organize, analyze and interpret those sequences. A series of bioinformatics tools for high-throughput sequencing analysis, including preprocessing, clustering, database matching and classification, have been compiled into a pipeline called PANGEA. The PANGEA pipeline was written in Perl and can be run on Mac OSX, Windows or Linux. With PANGEA, sequences obtained directly from the sequencer can be processed quickly to provide the files needed for sequence identification by BLAST and for comparison of microbial communities. Two different sets of bacterial 16S rRNA sequences were used to show the efficiency of this workflow. The first set of 16S rRNA sequences is derived from various soils from Hawaii Volcanoes National Park. The second set is derived from stool samples collected from diabetes-resistant and diabetes-prone rats. The workflow described here allows the investigator to quickly assess libraries of sequences on personal computers with customized databases. PANGEA is provided for users as individual scripts for each step in the process or as a single script where all processes, except the χ2 step, are joined into one program called the ‘backbone’. PMID:20182525
PANGEA: pipeline for analysis of next generation amplicons.
Giongo, Adriana; Crabb, David B; Davis-Richardson, Austin G; Chauliac, Diane; Mobberley, Jennifer M; Gano, Kelsey A; Mukherjee, Nabanita; Casella, George; Roesch, Luiz F W; Walts, Brandon; Riva, Alberto; King, Gary; Triplett, Eric W
2010-07-01
High-throughput DNA sequencing can identify organisms and describe population structures in many environmental and clinical samples. Current technologies generate millions of reads in a single run, requiring extensive computational strategies to organize, analyze and interpret those sequences. A series of bioinformatics tools for high-throughput sequencing analysis, including pre-processing, clustering, database matching and classification, have been compiled into a pipeline called PANGEA. The PANGEA pipeline was written in Perl and can be run on Mac OSX, Windows or Linux. With PANGEA, sequences obtained directly from the sequencer can be processed quickly to provide the files needed for sequence identification by BLAST and for comparison of microbial communities. Two different sets of bacterial 16S rRNA sequences were used to show the efficiency of this workflow. The first set of 16S rRNA sequences is derived from various soils from Hawaii Volcanoes National Park. The second set is derived from stool samples collected from diabetes-resistant and diabetes-prone rats. The workflow described here allows the investigator to quickly assess libraries of sequences on personal computers with customized databases. PANGEA is provided for users as individual scripts for each step in the process or as a single script where all processes, except the chi(2) step, are joined into one program called the 'backbone'.
Genetic Analysis of East Asian Grape Cultivars Suggests Hybridization with Wild Vitis.
Goto-Yamamoto, Nami; Sawler, Jason; Myles, Sean
2015-01-01
Koshu is a grape cultivar native to Japan and is one of the country's most important cultivars for wine making. Koshu and other oriental grape cultivars are widely believed to belong to the European domesticated grape species Vitis vinifera. To verify the domesticated origin of Koshu and four other cultivars widely grown in China and Japan, we genotyped 48 ancestry informative single nucleotide polymorphisms (SNPs) and estimated wild and domesticated ancestry proportions. Our principal components analysis (PCA) based ancestry estimation revealed that Koshu is 70% V. vinifera, and that the remaining 30% of its ancestry is most likely derived from wild East Asian Vitis species. Partial sequencing of chloroplast DNA suggests that Koshu's maternal line is derived from the Chinese wild species V. davidii or a closely related species. Our results suggest that many traditional East Asian grape cultivars such as Koshu were generated from hybridization events with wild grape species.
Yi, Ming; Zhao, Yongmei; Jia, Li; He, Mei; Kebebew, Electron; Stephens, Robert M.
2014-01-01
To apply exome-seq-derived variants in the clinical setting, there is an urgent need to identify the best variant caller(s) from a large collection of available options. We have used an Illumina exome-seq dataset as a benchmark, with two validation scenarios—family pedigree information and SNP array data for the same samples, permitting global high-throughput cross-validation, to evaluate the quality of SNP calls derived from several popular variant discovery tools from both the open-source and commercial communities using a set of designated quality metrics. To the best of our knowledge, this is the first large-scale performance comparison of exome-seq variant discovery tools using high-throughput validation with both Mendelian inheritance checking and SNP array data, which allows us to gain insights into the accuracy of SNP calling through such high-throughput validation in an unprecedented way, whereas the previously reported comparison studies have only assessed concordance of these tools without directly assessing the quality of the derived SNPs. More importantly, the main purpose of our study was to establish a reusable procedure that applies high-throughput validation to compare the quality of SNP discovery tools with a focus on exome-seq, which can be used to compare any forthcoming tool(s) of interest. PMID:24831545
Circulation of Endemic Type 2 Vaccine-Derived Poliovirus in Egypt from 1983 to 1993
Yang, Chen-Fu; Naguib, Tary; Yang, Su-Ju; Nasr, Eman; Jorba, Jaume; Ahmed, Nahed; Campagnoli, Ray; van der Avoort, Harrie; Shimizu, Hiroyuki; Yoneyama, Tetsuo; Miyamura, Tatsuo; Pallansch, Mark; Kew, Olen
2003-01-01
From 1988 to 1993, 30 cases of poliomyelitis associated with poliovirus type 2 were found in seven governorates of Egypt. Because many of the cases were geographically and temporally clustered and because the case isolates differed antigenically from the vaccine strain, it was initially assumed that the cases signaled the continued circulation of wild type 2 poliovirus. However, comparison of sequences encoding the major capsid protein, VP1 (903 nucleotides), revealed that the isolates were related (93 to 97% nucleotide sequence identity) to the Sabin type 2 oral poliovirus vaccine (OPV) strain and unrelated (<82% nucleotide sequence identity) to the wild type 2 polioviruses previously indigenous to Egypt (last known isolate: 1979) or to any contemporary wild type 2 polioviruses found elsewhere. The rate and pattern of VP1 divergence among the circulating vaccine-derived poliovirus (cVDPV) isolates suggested that all lineages were derived from a single OPV infection that occurred around 1983 and that progeny from the initiating infection circulated for approximately a decade within Egypt along several independent chains of transmission. Complete genomic sequences of an early (1988) and a late (1993) cVDPV isolate revealed that their 5′ untranslated region (5′ UTR) and noncapsid- 3′ UTR sequences were derived from other species C enteroviruses. Circulation of type 2 cVDPVs occurred at a time of low OPV coverage in the affected communities and ceased when OPV coverage rates increased. The potential for cVDPVs to circulate in populations with low immunity to poliovirus has important implications for current and future strategies to eradicate polio worldwide. PMID:12857906
Torrent, C; Gabus, C; Darlix, J L
1994-01-01
Retroviral genomes consist of two identical RNA molecules associated at their 5' ends by the dimer linkage structure located in the packaging element (Psi or E) necessary for RNA dimerization in vitro and packaging in vivo. In murine leukemia virus (MLV)-derived vectors designed for gene transfer, the Psi + sequence of 600 nucleotides directs the packaging of recombinant RNAs into MLV virions produced by helper cells. By using in vitro RNA dimerization as a screening system, a sequence of rat VL30 RNA located next to the 5' end of the Harvey mouse sarcoma virus genome and as small as 67 nucleotides was found to form stable dimeric RNA. In addition, a purine-rich sequence located at the 5' end of this VL30 RNA seems to be critical for RNA dimerization. When this VL30 element was extended by 107 nucleotides at its 3' end and inserted into an MLV-derived vector lacking MLV Psi +, it directed the efficient encapsidation of recombinant RNAs into MLV virions. Because this VL30 packaging signal is smaller and more efficient in packaging recombinant RNAs than the MLV Psi + and does not contain gag or glyco-gag coding sequences, its use in MLV-derived vectors should render even more unlikely recombinations which could generate replication-competent viruses. Therefore, utilization of the rat VL30 packaging sequence should improve the biological safety of MLV vectors for human gene transfer. Images PMID:8289369
Circulation of endemic type 2 vaccine-derived poliovirus in Egypt from 1983 to 1993.
Yang, Chen-Fu; Naguib, Tary; Yang, Su-Ju; Nasr, Eman; Jorba, Jaume; Ahmed, Nahed; Campagnoli, Ray; van der Avoort, Harrie; Shimizu, Hiroyuki; Yoneyama, Tetsuo; Miyamura, Tatsuo; Pallansch, Mark; Kew, Olen
2003-08-01
From 1988 to 1993, 30 cases of poliomyelitis associated with poliovirus type 2 were found in seven governorates of Egypt. Because many of the cases were geographically and temporally clustered and because the case isolates differed antigenically from the vaccine strain, it was initially assumed that the cases signaled the continued circulation of wild type 2 poliovirus. However, comparison of sequences encoding the major capsid protein, VP1 (903 nucleotides), revealed that the isolates were related (93 to 97% nucleotide sequence identity) to the Sabin type 2 oral poliovirus vaccine (OPV) strain and unrelated (<82% nucleotide sequence identity) to the wild type 2 polioviruses previously indigenous to Egypt (last known isolate: 1979) or to any contemporary wild type 2 polioviruses found elsewhere. The rate and pattern of VP1 divergence among the circulating vaccine-derived poliovirus (cVDPV) isolates suggested that all lineages were derived from a single OPV infection that occurred around 1983 and that progeny from the initiating infection circulated for approximately a decade within Egypt along several independent chains of transmission. Complete genomic sequences of an early (1988) and a late (1993) cVDPV isolate revealed that their 5' untranslated region (5' UTR) and noncapsid- 3' UTR sequences were derived from other species C enteroviruses. Circulation of type 2 cVDPVs occurred at a time of low OPV coverage in the affected communities and ceased when OPV coverage rates increased. The potential for cVDPVs to circulate in populations with low immunity to poliovirus has important implications for current and future strategies to eradicate polio worldwide.
Moser, Lindsey A.; Ramirez-Carvajal, Lisbeth; Puri, Vinita; Pauszek, Steven J.; Matthews, Krystal; Dilley, Kari A.; Mullan, Clancy; McGraw, Jennifer; Khayat, Michael; Beeri, Karen; Yee, Anthony; Dugan, Vivien; Heise, Mark T.; Frieman, Matthew B.; Rodriguez, Luis L.; Bernard, Kristen A.; Wentworth, David E.
2016-01-01
ABSTRACT Several biosafety level 3 and/or 4 (BSL-3/4) pathogens are high-consequence, single-stranded RNA viruses, and their genomes, when introduced into permissive cells, are infectious. Moreover, many of these viruses are select agents (SAs), and their genomes are also considered SAs. For this reason, cDNAs and/or their derivatives must be tested to ensure the absence of infectious virus and/or viral RNA before transfer out of the BSL-3/4 and/or SA laboratory. This tremendously limits the capacity to conduct viral genomic research, particularly the application of next-generation sequencing (NGS). Here, we present a sequence-independent method to rapidly amplify viral genomic RNA while simultaneously abolishing both viral and genomic RNA infectivity across multiple single-stranded positive-sense RNA (ssRNA+) virus families. The process generates barcoded DNA amplicons that range in length from 300 to 1,000 bp, which cannot be used to rescue a virus and are stable to transport at room temperature. Our barcoding approach allows for up to 288 barcoded samples to be pooled into a single library and run across various NGS platforms without potential reconstitution of the viral genome. Our data demonstrate that this approach provides full-length genomic sequence information not only from high-titer virion preparations but it can also recover specific viral sequence from samples with limited starting material in the background of cellular RNA, and it can be used to identify pathogens from unknown samples. In summary, we describe a rapid, universal standard operating procedure that generates high-quality NGS libraries free of infectious virus and infectious viral RNA. IMPORTANCE This report establishes and validates a standard operating procedure (SOP) for select agents (SAs) and other biosafety level 3 and/or 4 (BSL-3/4) RNA viruses to rapidly generate noninfectious, barcoded cDNA amenable for next-generation sequencing (NGS). This eliminates the burden of testing all processed samples derived from high-consequence pathogens prior to transfer from high-containment laboratories to lower-containment facilities for sequencing. Our established protocol can be scaled up for high-throughput sequencing of hundreds of samples simultaneously, which can dramatically reduce the cost and effort required for NGS library construction. NGS data from this SOP can provide complete genome coverage from viral stocks and can also detect virus-specific reads from limited starting material. Our data suggest that the procedure can be implemented and easily validated by institutional biosafety committees across research laboratories. PMID:27822536
Gabrieli, Paolo; Gomulski, Ludvik M.; Bonomi, Angelica; Siciliano, Paolo; Scolari, Francesca; Franz, Gerald; Jessup, Andrew; Malacrida, Anna R.; Gasperi, Giuliano
2011-01-01
Background Diptera have an extraordinary variety of sex determination mechanisms, and Drosophila melanogaster is the paradigm for this group. However, the Drosophila sex determination pathway is only partially conserved and the family Tephritidae affords an interesting example. The tephritid Y chromosome is postulated to be necessary to determine male development. Characterization of Y sequences, apart from elucidating the nature of the male determining factor, is also important to understand the evolutionary history of sex chromosomes within the Tephritidae. We studied the Y sequences from the olive fly, Bactrocera oleae. Its Y chromosome is minute and highly heterochromatic, and displays high heteromorphism with the X chromosome. Methodology/Principal Findings A combined Representational Difference Analysis (RDA) and fluorescence in-situ hybridization (FISH) approach was used to investigate the Y chromosome to derive information on its sequence content. The Y chromosome is strewn with repetitive DNA sequences, the majority of which are also interdispersed in the pericentromeric regions of the autosomes. The Y chromosome appears to have accumulated small and large repetitive interchromosomal duplications. The large interchromosomal duplications harbour an importin-4-like gene fragment. Apart from these importin-4-like sequences, the other Y repetitive sequences are not shared with the X chromosome, suggesting molecular differentiation of these two chromosomes. Moreover, as the identified Y sequences were not detected on the Y chromosomes of closely related tephritids, we can infer divergence in the repetitive nature of their sequence contents. Conclusions/Significance The identification of Y-linked sequences may tell us much about the repetitive nature, the origin and the evolution of Y chromosomes. We hypothesize how these repetitive sequences accumulated and were maintained on the Y chromosome during its evolutionary history. Our data reinforce the idea that the sex chromosomes of the Tephritidae may have distinct evolutionary origins with respect to those of the Drosophilidae and other Dipteran families. PMID:21408187
Bowers, Robert M.; Kyrpides, Nikos C.; Stepanauskas, Ramunas; ...
2017-08-08
Here, we present two standards developed by the Genomic Standards Consortium (GSC) for reporting bacterial and archaeal genome sequences. Both are extensions of the Minimum Information about Any (x) Sequence (MIxS). The standards are the Minimum Information about a Single Amplified Genome (MISAG) and the Minimum Information about a MetagenomeAssembled Genome (MIMAG), including, but not limited to, assembly quality, and estimates of genome completeness and contamination. These standards can be used in combination with other GSC checklists, including the Minimum Information about a Genome Sequence (MIGS), Minimum Information about a Metagenomic Sequence (MIMS), and Minimum Information about a Marker Genemore » Sequence (MIMARKS). Community-wide adoption of MISAG and MIMAG will facilitate more robust comparative genomic analyses of bacterial and archaeal diversity.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bowers, Robert M.; Kyrpides, Nikos C.; Stepanauskas, Ramunas
Here, we present two standards developed by the Genomic Standards Consortium (GSC) for reporting bacterial and archaeal genome sequences. Both are extensions of the Minimum Information about Any (x) Sequence (MIxS). The standards are the Minimum Information about a Single Amplified Genome (MISAG) and the Minimum Information about a MetagenomeAssembled Genome (MIMAG), including, but not limited to, assembly quality, and estimates of genome completeness and contamination. These standards can be used in combination with other GSC checklists, including the Minimum Information about a Genome Sequence (MIGS), Minimum Information about a Metagenomic Sequence (MIMS), and Minimum Information about a Marker Genemore » Sequence (MIMARKS). Community-wide adoption of MISAG and MIMAG will facilitate more robust comparative genomic analyses of bacterial and archaeal diversity.« less
Cell type-specific termination of transcription by transposable element sequences.
Conley, Andrew B; Jordan, I King
2012-09-30
Transposable elements (TEs) encode sequences necessary for their own transposition, including signals required for the termination of transcription. TE sequences within the introns of human genes show an antisense orientation bias, which has been proposed to reflect selection against TE sequences in the sense orientation owing to their ability to terminate the transcription of host gene transcripts. While there is evidence in support of this model for some elements, the extent to which TE sequences actually terminate transcription of human gene across the genome remains an open question. Using high-throughput sequencing data, we have characterized over 9,000 distinct TE-derived sequences that provide transcription termination sites for 5,747 human genes across eight different cell types. Rarefaction curve analysis suggests that there may be twice as many TE-derived termination sites (TE-TTS) genome-wide among all human cell types. The local chromatin environment for these TE-TTS is similar to that seen for 3' UTR canonical TTS and distinct from the chromatin environment of other intragenic TE sequences. However, those TE-TTS located within the introns of human genes were found to be far more cell type-specific than the canonical TTS. TE-TTS were much more likely to be found in the sense orientation than other intragenic TE sequences of the same TE family and TE-TTS in the sense orientation terminate transcription more efficiently than those found in the antisense orientation. Alu sequences were found to provide a large number of relatively weak TTS, whereas LTR elements provided a smaller number of much stronger TTS. TE sequences provide numerous termination sites to human genes, and TE-derived TTS are particularly cell type-specific. Thus, TE sequences provide a powerful mechanism for the diversification of transcriptional profiles between cell types and among evolutionary lineages, since most TE-TTS are evolutionarily young. The extent of transcription termination by TEs seen here, along with the preference for sense-oriented TE insertions to provide TTS, is consistent with the observed antisense orientation bias of human TEs.
A new molecular evolution model for limited insertion independent of substitution.
Lèbre, Sophie; Michel, Christian J
2013-10-01
We recently introduced a new molecular evolution model called the IDIS model for Insertion Deletion Independent of Substitution [13,14]. In the IDIS model, the three independent processes of substitution, insertion and deletion of residues have constant rates. In order to control the genome expansion during evolution, we generalize here the IDIS model by introducing an insertion rate which decreases when the sequence grows and tends to 0 for a maximum sequence length nmax. This new model, called LIIS for Limited Insertion Independent of Substitution, defines a matrix differential equation satisfied by a vector P(t) describing the sequence content in each residue at evolution time t. An analytical solution is obtained for any diagonalizable substitution matrix M. Thus, the LIIS model gives an expression of the sequence content vector P(t) in each residue under evolution time t as a function of the eigenvalues and the eigenvectors of matrix M, the residue insertion rate vector R, the total insertion rate r, the initial and maximum sequence lengths n0 and nmax, respectively, and the sequence content vector P(t0) at initial time t0. The derivation of the analytical solution is much more technical, compared to the IDIS model, as it involves Gauss hypergeometric functions. Several propositions of the LIIS model are derived: proof that the IDIS model is a particular case of the LIIS model when the maximum sequence length nmax tends to infinity, fixed point, time scale, time step and time inversion. Using a relation between the sequence length l and the evolution time t, an expression of the LIIS model as a function of the sequence length l=n(t) is obtained. Formulas for 'insertion only', i.e. when the substitution rates are all equal to 0, are derived at evolution time t and sequence length l. Analytical solutions of the LIIS model are explicitly derived, as a function of either evolution time t or sequence length l, for two classical substitution matrices: the 3-parameter symmetric substitution matrix [12] (LIIS-SYM3) and the HKY asymmetric substitution matrix[9] (LIIS-HKY). An evaluation of the LIIS model (precisely, LIIS-HKY) based on four statistical analyses of the GC content in complete genomes of four prokaryotic taxonomic groups, namely Chlamydiae, Crenarchaeota, Spirochaetes and Thermotogae, shows the expected improvement from the theory of the LIIS model compared to the IDIS model. Copyright © 2013 Elsevier Inc. All rights reserved.
Yu, Jeong-Nam; Won, Changman; Jun, Jumin; Lim, YoungWoon; Kwak, Myounghai
2011-01-01
Background Microsatellites, a special class of repetitive DNA sequence, have become one of the most popular genetic markers for population/conservation genetic studies. However, its application to endangered species has been impeded by high development costs, a lack of available sequences, and technical difficulties. The water deer Hydropotes inermis is the sole existing endangered species of the subfamily Capreolinae. Although population genetics studies are urgently required for conservation management, no species-specific microsatellite marker has been reported. Methods We adopted next-generation sequencing (NGS) to elucidate the microsatellite markers of Korean water deer and overcome these impediments on marker developments. We performed genotyping to determine the efficiency of this method as applied to population genetics. Results We obtained 98 Mbp of nucleotide information from 260,467 sequence reads. A total of 20,101 di-/tri-nucleotide repeat motifs were identified; di-repeats were 5.9-fold more common than tri-repeats. [CA]n and [AAC]n/[AAT]n repeats were the most frequent di- and tri-repeats, respectively. Of the 17,206 di-repeats, 12,471 microsatellite primer pairs were derived. PCR amplification of 400 primer pairs yielded 106 amplicons and 79 polymorphic markers from 20 individual Korean water deer. Polymorphic rates of the 79 new microsatellites varied from 2 to 11 alleles per locus (He: 0.050–0.880; Ho: 0.000–1.000), while those of known microsatellite markers transferred from cattle to Chinese water deer ranged from 4 to 6 alleles per locus (He: 0.279–0.714; Ho: 0.300–0.400). Conclusions Polymorphic microsatellite markers from Korean water deer were successfully identified using NGS without any prior sequence information and deposited into the public database. Thus, the methods described herein represent a rapid and low-cost way to investigate the population genetics of endangered/non-model species. PMID:22069476
Orchestration of Molecular Information through Higher Order Chemical Recognition
NASA Astrophysics Data System (ADS)
Frezza, Brian M.
Broadly defined, higher order chemical recognition is the process whereby discrete chemical building blocks capable of specifically binding to cognate moieties are covalently linked into oligomeric chains. These chains, or sequences, are then able to recognize and bind to their cognate sequences with a high degree of cooperativity. Principally speaking, DNA and RNA are the most readily obtained examples of this chemical phenomenon, and function via Watson-Crick cognate pairing: guanine pairs with cytosine and adenine with thymine (DNA) or uracil (RNA), in an anti-parallel manner. While the theoretical principles, techniques, and equations derived herein apply generally to any higher-order chemical recognition system, in practice we utilize DNA oligomers as a model-building material to experimentally investigate and validate our hypotheses. Historically, general purpose information processing has been a task limited to semiconductor electronics. Molecular computing on the other hand has been limited to ad hoc approaches designed to solve highly specific and unique computation problems, often involving components or techniques that cannot be applied generally in a manner suitable for precise and predictable engineering. Herein, we provide a fundamental framework for harnessing high-order recognition in a modular and programmable fashion to synthesize molecular information process networks of arbitrary construction and complexity. This document provides a solid foundation for routinely embedding computational capability into chemical and biological systems where semiconductor electronics are unsuitable for practical application.
Kim, Mi Ae; Rhee, Jae-Sung; Kim, Tae Ha; Lee, Jung Sick; Choi, Ah-Young; Choi, Beom-Soon; Choi, Ik-Young; Sohn, Young Chang
2017-03-09
In order to characterize the female or male transcriptome of the Pacific abalone and further increase genomic resources, we sequenced the mRNA of full-length complementary DNA (cDNA) libraries derived from pooled tissues of female and male Haliotis discus hannai by employing the Iso-Seq protocol of the PacBio RSII platform. We successfully assembled whole full-length cDNA sequences and constructed a transcriptome database that included isoform information. After clustering, a total of 15,110 and 12,145 genes that coded for proteins were identified in female and male abalones, respectively. A total of 13,057 putative orthologs were retained from each transcriptome in abalones. Overall Gene Ontology terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analyzed in each database showed a similar composition between sexes. In addition, a total of 519 and 391 isoforms were genome-widely identified with at least two isoforms from female and male transcriptome databases. We found that the number of isoforms and their alternatively spliced patterns are variable and sex-dependent. This information represents the first significant contribution to sex-preferential genomic resources of the Pacific abalone. The availability of whole female and male transcriptome database and their isoform information will be useful to improve our understanding of molecular responses and also for the analysis of population dynamics in the Pacific abalone.
Kim, Mi Ae; Rhee, Jae-Sung; Kim, Tae Ha; Lee, Jung Sick; Choi, Ah-Young; Choi, Beom-Soon; Choi, Ik-Young; Sohn, Young Chang
2017-01-01
In order to characterize the female or male transcriptome of the Pacific abalone and further increase genomic resources, we sequenced the mRNA of full-length complementary DNA (cDNA) libraries derived from pooled tissues of female and male Haliotis discus hannai by employing the Iso-Seq protocol of the PacBio RSII platform. We successfully assembled whole full-length cDNA sequences and constructed a transcriptome database that included isoform information. After clustering, a total of 15,110 and 12,145 genes that coded for proteins were identified in female and male abalones, respectively. A total of 13,057 putative orthologs were retained from each transcriptome in abalones. Overall Gene Ontology terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analyzed in each database showed a similar composition between sexes. In addition, a total of 519 and 391 isoforms were genome-widely identified with at least two isoforms from female and male transcriptome databases. We found that the number of isoforms and their alternatively spliced patterns are variable and sex-dependent. This information represents the first significant contribution to sex-preferential genomic resources of the Pacific abalone. The availability of whole female and male transcriptome database and their isoform information will be useful to improve our understanding of molecular responses and also for the analysis of population dynamics in the Pacific abalone. PMID:28282934
The recurrence sequences via Sylvester matrices
NASA Astrophysics Data System (ADS)
Karaduman, Erdal; Deveci, Ömür
2017-07-01
In this work, we define the Pell-Jacobsthal-Slyvester sequence and the Jacobsthal-Pell-Slyvester sequence by using the Slyvester matrices which are obtained from the characteristic polynomials of the Pell and Jacobsthal sequences and then, we study the sequences defined modulo m. Also, we obtain the cyclic groups and the semigroups from the generating matrices of these sequences when read modulo m and then, we derive the relationships among the orders of the cyclic groups and the periods of the sequences. Furthermore, we redefine Pell-Jacobsthal-Slyvester sequence and the Jacobsthal-Pell-Slyvester sequence by means of the elements of the groups and then, we examine them in the finite groups.
Possenti, Andrea; Vendruscolo, Michele; Camilloni, Carlo; Tiana, Guido
2018-05-23
Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The possibility to encode for such functions is controlled by the balance between the amount of information supplied by the sequence and that left after that the protein has folded into its structure. We study the amount of information necessary to specify the protein structure, providing an estimate that keeps into account the thermodynamic properties of protein folding. We thus show that the information remaining in the protein sequence after encoding for its structure (the 'information gap') is very close to what needed to encode for its function and interactions. Then, by predicting the information gap directly from the protein sequence, we show that it may be possible to use these insights from information theory to discriminate between ordered and disordered proteins, to identify unknown functions, and to optimize artificially-designed protein sequences. This article is protected by copyright. All rights reserved. © 2018 Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Jamal, Wasifa; Das, Saptarshi; Maharatna, Koushik; Pan, Indranil; Kuyucu, Doga
2015-09-01
Degree of phase synchronization between different Electroencephalogram (EEG) channels is known to be the manifestation of the underlying mechanism of information coupling between different brain regions. In this paper, we apply a continuous wavelet transform (CWT) based analysis technique on EEG data, captured during face perception tasks, to explore the temporal evolution of phase synchronization, from the onset of a stimulus. Our explorations show that there exists a small set (typically 3-5) of unique synchronized patterns or synchrostates, each of which are stable of the order of milliseconds. Particularly, in the beta (β) band, which has been reported to be associated with visual processing task, the number of such stable states has been found to be three consistently. During processing of the stimulus, the switching between these states occurs abruptly but the switching characteristic follows a well-behaved and repeatable sequence. This is observed in a single subject analysis as well as a multiple-subject group-analysis in adults during face perception. We also show that although these patterns remain topographically similar for the general category of face perception task, the sequence of their occurrence and their temporal stability varies markedly between different face perception scenarios (stimuli) indicating toward different dynamical characteristics for information processing, which is stimulus-specific in nature. Subsequently, we translated these stable states into brain complex networks and derived informative network measures for characterizing the degree of segregated processing and information integration in those synchrostates, leading to a new methodology for characterizing information processing in human brain. The proposed methodology of modeling the functional brain connectivity through the synchrostates may be viewed as a new way of quantitative characterization of the cognitive ability of the subject, stimuli and information integration/segregation capability.
A novel procedure on next generation sequencing data analysis using text mining algorithm.
Zhao, Weizhong; Chen, James J; Perkins, Roger; Wang, Yuping; Liu, Zhichao; Hong, Huixiao; Tong, Weida; Zou, Wen
2016-05-13
Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.
Fellner, Lea; Huptas, Christopher; Simon, Svenja; Mühlig, Anna; Neuhaus, Klaus
2016-01-01
Escherichia coli O157:H7 EDL933, isolated in 1982 in the United States, was the first enterohemorrhagic E. coli (EHEC) strain sequenced. Unfortunately, European labs can no longer receive the original strain. We checked three European EDL933 derivatives and found major genetic deviations (deletions, inversions) in two strains. All EDL933 strains contain the cryptic EHEC-plasmid, not reported before. PMID:27056239
Zheng, Deyou
2008-01-01
Background Sequencing and annotation of several mammalian genomes have revealed that segmental duplications are a common architectural feature of primate genomes; in fact, about 5% of the human genome is composed of large blocks of interspersed segmental duplications. These segmental duplications have been implicated in genomic copy-number variation, gene novelty, and various genomic disorders. However, the molecular processes involved in the evolution and regulation of duplicated sequences remain largely unexplored. Results In this study, the profile of about 20 histone modifications within human segmental duplications was characterized using high-resolution, genome-wide data derived from a ChIP-Seq study. The analysis demonstrates that derivative loci of segmental duplications often differ significantly from the original with respect to many histone methylations. Further investigation showed that genes are present three times more frequently in the original than in the derivative, whereas pseudogenes exhibit the opposite trend. These asymmetries tend to increase with the age of segmental duplications. The uneven distribution of genes and pseudogenes does not, however, fully account for the asymmetry in the profile of histone modifications. Conclusion The first systematic analysis of histone modifications between segmental duplications demonstrates that two seemingly 'identical' genomic copies are distinct in their epigenomic properties. Results here suggest that local chromatin environments may be implicated in the discrimination of derived copies of segmental duplications from their originals, leading to a biased pseudogenization of the new duplicates. The data also indicate that further exploration of the interactions between histone modification and sequence degeneration is necessary in order to understand the divergence of duplicated sequences. PMID:18598352
[Study on ITS sequences of Aconitum vilmorinianum and its medicinal adulterant].
Zhang, Xiao-nan; Du, Chun-hua; Fu, De-huan; Gao, Li; Zhou, Pei-jun; Wang, Li
2012-09-01
To analyze and compare the ITS sequences of Aconitum vilmorinianum and its medicinal adulterant Aconitum austroyunnanense. Total genomic DNA were extracted from sample materials by improved CTAB method, ITS sequences of samples were amplified using PCR systems, directly sequenced and analyzed using software DNAStar, ClustalX1.81 and MEGA 4.0. 299 consistent sites, 19 variable sites and 13 informative sites were found in ITS1 sequences, 162 consistent sites, 2 variable sites and 1 informative sites were found in 5.8S sequences, 217 consistent sites, 3 variable sites and 1 informative site were found in ITS2 sequences. Base transition and transversion was not found only in 5.8S sequences, 2 sites transition and 1 site transversion were found in ITS1 sequences, only 1 site transversion was found in ITS2 sequences comparting the ITS sequences data matrix. By analyzing the ITS sequences data matrix from 2 population of Aconitum vilmorinianum and 3 population of Aconitum austroyunnanense, we found a stable informative site at the 596th base in ITS2 sequences, in all the samples of Aconitum vilmorinianum the base was C, and in all the samples of Aconitum austroyunnanense the base was A. Aconitum vilmorinianum and Aconitum austroyunnanense can be identified by their characters of ITS sequences, and the variable sites in ITS1 sequences are more than in ITS2 sequences.
NASA Astrophysics Data System (ADS)
Zhang, Ji; Li, Tao; Zheng, Shiqiang; Li, Yiyong
2015-03-01
To reduce the effects of respiratory motion in the quantitative analysis based on liver contrast-enhanced ultrasound (CEUS) image sequencesof single mode. The image gating method and the iterative registration method using model image were adopted to register liver contrast-enhanced ultrasound image sequences of single mode. The feasibility of the proposed respiratory motion correction method was explored preliminarily using 10 hepatocellular carcinomas CEUS cases. The positions of the lesions in the time series of 2D ultrasound images after correction were visually evaluated. Before and after correction, the quality of the weighted sum of transit time (WSTT) parametric images were also compared, in terms of the accuracy and spatial resolution. For the corrected and uncorrected sequences, their mean deviation values (mDVs) of time-intensity curve (TIC) fitting derived from CEUS sequences were measured. After the correction, the positions of the lesions in the time series of 2D ultrasound images were almost invariant. In contrast, the lesions in the uncorrected images all shifted noticeably. The quality of the WSTT parametric maps derived from liver CEUS image sequences were improved more greatly. Moreover, the mDVs of TIC fitting derived from CEUS sequences after the correction decreased by an average of 48.48+/-42.15. The proposed correction method could improve the accuracy of quantitative analysis based on liver CEUS image sequences of single mode, which would help in enhancing the differential diagnosis efficiency of liver tumors.
Blasi, Maria; Carpenter, J Harris; Balakumaran, Bala; Cara, Andrea; Gao, Feng; Klotman, Mary E
2015-08-24
HIV-1 persists indefinitely in memory CD4 T cells and other long-lived cellular reservoirs despite antiretroviral therapy. Our group had previously demonstrated that HIV-1 can establish a productive infection in renal epithelial cells and that the kidney represents a separate compartment for HIV-1 replication. Here, to better understand the viruses in this unique site, we genetically characterized and compared the viruses in blood and urine specimens from 24 HIV-1 infected patients with detectable viremia. Blood and urine samples were obtained from 35 HIV-1 positive patients. Single-genome amplification was performed on HIV-1 env RNA and DNA isolated from urine supernatants and urine-derived cell pellets, respectively, as well as from plasma and peripheral blood mononuclear cell from the same individuals. Neighbor-joining trees were constructed under the Kimura 2-parameter model. We amplified and sequenced the full-length HIV-1 envelope (env) gene from 12 of the 24 individuals, indicating that 50% of the viremic HIV-1-positive patients had viral RNA in their urine. Phylogenetic analysis of the env sequences from four individuals with more than 15 urine-derived env sequences showed that the majority of the sequences from urine formed distinct cluster(s) independent of those peripheral blood mononuclear cell and plasma-derived sequences, consistent with viral compartmentalization in the urine. Our results suggest the presence of a distinct HIV compartment in the genitourinary tract.
BLASI, Maria; CARPENTER, J. Harris; BALAKUMARAN, Bala; CARA, Andrea; GAO, Feng; KLOTMAN, Mary E.
2015-01-01
Objective HIV-1 persists indefinitely in memory CD4+ T cells and other long-lived cellular reservoirs despite antiretroviral therapy (ART). Our group had previously demonstrated that HIV-1 can establish a productive infection in renal epithelial cells and that the kidney represents a separate compartment for HIV-1 replication. Here, to better understand the viruses in this unique site, we genetically characterized and compared the viruses in blood and urine specimens from twenty-four HIV-1 infected subjects with detectable viremia. Design and Methods Blood and urine samples were obtained from 35 HIV-1 positive subjects. Single-genome amplification was performed on HIV-1 env RNA and DNA isolated from urine supernatants and urine derived cell pellets respectively, as well as from plasma and PBMC from the same individuals. Neighbor-joining trees were constructed under the Kimura 2-parameter mode. Results We amplified and sequenced the full-length HIV-1 envelope (env) gene from twelve of the twenty-four individuals, indicating that fifty percent (50%) of the viremic HIV-1 positive patients had viral RNA in their urine. Phylogenetic analysis of the env sequences from four subjects with more than fifteen urine-derived env sequences showed that the majority of the sequences from urine formed distinct cluster(s) independent of those PBMC and plasma-derived sequences, consistent with viral compartmentalization in the urine. Conclusions Our results suggest the presence of a distinct HIV compartment in the genitourinary tract. PMID:26372275
Characterization of a native hammerhead ribozyme derived from schistosomes
OSBORNE, EDITH M.; SCHAAK, JANELL E.; DEROSE, VICTORIA J.
2005-01-01
A recent re-examination of the role of the helices surrounding the conserved core of the hammerhead ribozyme has identified putative loop–loop interactions between stems I and II in native hammerhead sequences. These extended hammerhead sequences are more active at low concentrations of divalent cations than are minimal hammerheads. The loop–loop interactions are proposed to stabilize a more active conformation of the conserved core. Here, a kinetic and thermodynamic characterization of an extended hammerhead sequence derived from Schistosoma mansoni is performed. Biphasic kinetics are observed, suggesting the presence of at least two conformers, one cleaving with a fast rate and the other with a slow rate. Replacing loop II with a poly(U) sequence designed to eliminate the interaction between the two loops results in greatly diminished activity, suggesting that the loop–loop interactions do aid in forming a more active conformation. Previous studies with minimal hammerheads have shown deleterious effects of Rp-phosphorothioate substitutions at the cleavage site and 5′ to A9, both of which could be rescued with Cd2+. Here, phosphorothioate modifications at the cleavage site and 5′ to A9 were made in the schistosome-derived sequence. In Mg2+, both phosphorothioate substitutions decreased the overall fraction cleaved without significantly affecting the observed rate of cleavage. The addition of Cd2+ rescued cleavage in both cases, suggesting that these are still putative metal binding sites in this native sequence. PMID:15659358
Meaningful call combinations and compositional processing in the southern pied babbler
Engesser, Sabrina; Ridley, Amanda R.; Townsend, Simon W.
2016-01-01
Language’s expressive power is largely attributable to its compositionality: meaningful words are combined into larger/higher-order structures with derived meaning. Despite its importance, little is known regarding the evolutionary origins and emergence of this syntactic ability. Although previous research has shown a rudimentary capability to combine meaningful calls in primates, because of a scarcity of comparative data, it is unclear to what extent analog forms might also exist outside of primates. Here, we address this ambiguity and provide evidence for rudimentary compositionality in the discrete vocal system of a social passerine, the pied babbler (Turdoides bicolor). Natural observations and predator presentations revealed that babblers produce acoustically distinct alert calls in response to close, low-urgency threats and recruitment calls when recruiting group members during locomotion. On encountering terrestrial predators, both vocalizations are combined into a “mobbing sequence,” potentially to recruit group members in a dangerous situation. To investigate whether babblers process the sequence in a compositional way, we conducted systematic experiments, playing back the individual calls in isolation as well as naturally occurring and artificial sequences. Babblers reacted most strongly to mobbing sequence playbacks, showing a greater attentiveness and a quicker approach to the loudspeaker, compared with individual calls or control sequences. We conclude that the sequence constitutes a compositional structure, communicating information on both the context and the requested action. Our work supports previous research suggesting combinatoriality as a viable mechanism to increase communicative output and indicates that the ability to combine and process meaningful vocal structures, a basic syntax, may be more widespread than previously thought. PMID:27155011
Meaningful call combinations and compositional processing in the southern pied babbler.
Engesser, Sabrina; Ridley, Amanda R; Townsend, Simon W
2016-05-24
Language's expressive power is largely attributable to its compositionality: meaningful words are combined into larger/higher-order structures with derived meaning. Despite its importance, little is known regarding the evolutionary origins and emergence of this syntactic ability. Although previous research has shown a rudimentary capability to combine meaningful calls in primates, because of a scarcity of comparative data, it is unclear to what extent analog forms might also exist outside of primates. Here, we address this ambiguity and provide evidence for rudimentary compositionality in the discrete vocal system of a social passerine, the pied babbler (Turdoides bicolor). Natural observations and predator presentations revealed that babblers produce acoustically distinct alert calls in response to close, low-urgency threats and recruitment calls when recruiting group members during locomotion. On encountering terrestrial predators, both vocalizations are combined into a "mobbing sequence," potentially to recruit group members in a dangerous situation. To investigate whether babblers process the sequence in a compositional way, we conducted systematic experiments, playing back the individual calls in isolation as well as naturally occurring and artificial sequences. Babblers reacted most strongly to mobbing sequence playbacks, showing a greater attentiveness and a quicker approach to the loudspeaker, compared with individual calls or control sequences. We conclude that the sequence constitutes a compositional structure, communicating information on both the context and the requested action. Our work supports previous research suggesting combinatoriality as a viable mechanism to increase communicative output and indicates that the ability to combine and process meaningful vocal structures, a basic syntax, may be more widespread than previously thought.
Brody, Thomas; Yavatkar, Amarendra S; Kuzin, Alexander; Kundu, Mukta; Tyson, Leonard J; Ross, Jermaine; Lin, Tzu-Yang; Lee, Chi-Hon; Awasaki, Takeshi; Lee, Tzumin; Odenwald, Ward F
2012-01-01
Background: Phylogenetic footprinting has revealed that cis-regulatory enhancers consist of conserved DNA sequence clusters (CSCs). Currently, there is no systematic approach for enhancer discovery and analysis that takes full-advantage of the sequence information within enhancer CSCs. Results: We have generated a Drosophila genome-wide database of conserved DNA consisting of >100,000 CSCs derived from EvoPrints spanning over 90% of the genome. cis-Decoder database search and alignment algorithms enable the discovery of functionally related enhancers. The program first identifies conserved repeat elements within an input enhancer and then searches the database for CSCs that score highly against the input CSC. Scoring is based on shared repeats as well as uniquely shared matches, and includes measures of the balance of shared elements, a diagnostic that has proven to be useful in predicting cis-regulatory function. To demonstrate the utility of these tools, a temporally-restricted CNS neuroblast enhancer was used to identify other functionally related enhancers and analyze their structural organization. Conclusions: cis-Decoder reveals that co-regulating enhancers consist of combinations of overlapping shared sequence elements, providing insights into the mode of integration of multiple regulating transcription factors. The database and accompanying algorithms should prove useful in the discovery and analysis of enhancers involved in any developmental process. Developmental Dynamics 241:169–189, 2012. © 2011 Wiley Periodicals, Inc. Key findings A genome-wide catalog of Drosophila conserved DNA sequence clusters. cis-Decoder discovers functionally related enhancers. Functionally related enhancers share balanced sequence element copy numbers. Many enhancers function during multiple phases of development. PMID:22174086
Van Ooteghem, Karen; Frank, James S.; Allard, Fran; Horak, Fay B
2011-01-01
Postural motor learning for dynamic balance tasks has been demonstrated in healthy older adults (Van Ooteghem et al. 2009). The purpose of this study was to investigate the type of knowledge (general or specific) obtained with balance training in this age group and to examine whether embedding perturbation regularities within a balance task masks specific learning. Two groups of older adults maintained balance on a constant frequency-variable amplitude oscillating platform. One group was trained using an embedded sequence (ES) protocol which contained the same 15-s sequence of variable amplitude oscillations in the middle of each trial. A second group was trained using a looped sequence (LS) protocol which contained a 15-s sequence repeated three times to form each trial. All trials were 45-s. Participants were not informed of any repetition. To examine learning, participants performed a retention test following a 24-h delay. LS participants also completed a transfer task. Specificity of learning was examined by comparing performance for repeated versus random sequences (ES) and training versus transfer sequences (LS). Performance was measured by deriving spatial and temporal measures of whole body centre of mass (COM), and trunk orientation. Both groups improved performance with practice as characterized by reduced COM displacement, improved COM-platform phase relationships, and decreased angular trunk motion. Improvements were also characterized by general rather than specific postural motor learning. These findings are similar to young adults (Van Ooteghem et al. 2008) and indicate that age does not influence the type of learning which occurs for balance control. PMID:20544184
Lisi, Simonetta; Chirichella, Michele; Arisi, Ivan; Goracci, Martina; Cremisi, Federico; Cattaneo, Antonino
2017-01-01
Antibody libraries are important resources to derive antibodies to be used for a wide range of applications, from structural and functional studies to intracellular protein interference studies to developing new diagnostics and therapeutics. Whatever the goal, the key parameter for an antibody library is its complexity (also known as diversity), i.e. the number of distinct elements in the collection, which directly reflects the probability of finding in the library an antibody against a given antigen, of sufficiently high affinity. Quantitative evaluation of antibody library complexity and quality has been for a long time inadequately addressed, due to the high similarity and length of the sequences of the library. Complexity was usually inferred by the transformation efficiency and tested either by fingerprinting and/or sequencing of a few hundred random library elements. Inferring complexity from such a small sampling is, however, very rudimental and gives limited information about the real diversity, because complexity does not scale linearly with sample size. Next-generation sequencing (NGS) has opened new ways to tackle the antibody library complexity quality assessment. However, much remains to be done to fully exploit the potential of NGS for the quantitative analysis of antibody repertoires and to overcome current limitations. To obtain a more reliable antibody library complexity estimate here we show a new, PCR-free, NGS approach to sequence antibody libraries on Illumina platform, coupled to a new bioinformatic analysis and software (Diversity Estimator of Antibody Library, DEAL) that allows to reliably estimate the complexity, taking in consideration the sequencing error. PMID:28505201
BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences
Gao, Jianzhao; Faraggi, Eshel; Zhou, Yaoqi; Ruan, Jishou; Kurgan, Lukasz
2012-01-01
Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods. PMID:22761950
Searching for evidence of selection in avian DNA barcodes.
Kerr, Kevin C R
2011-11-01
The barcode of life project has assembled a tremendous number of mitochondrial cytochrome c oxidase I (COI) sequences. Although these sequences were gathered to develop a DNA-based system for species identification, it has been suggested that further biological inferences may also be derived from this wealth of data. Recurrent selective sweeps have been invoked as an evolutionary mechanism to explain limited intraspecific COI diversity, particularly in birds, but this hypothesis has not been formally tested. In this study, I collated COI sequences from previous barcoding studies on birds and tested them for evidence of selection. Using this expanded data set, I re-examined the relationships between intraspecific diversity and interspecific divergence and sampling effort, respectively. I employed the McDonald-Kreitman test to test for neutrality in sequence evolution between closely related pairs of species. Because amino acid sequences were generally constrained between closely related pairs, I also included broader intra-order comparisons to quantify patterns of protein variation in avian COI sequences. Lastly, using 22 published whole mitochondrial genomes, I compared the evolutionary rate of COI against the other 12 protein-coding mitochondrial genes to assess intragenomic variability. I found no conclusive evidence of selective sweeps. Most evidence pointed to an overall trend of strong purifying selection and functional constraint. The COI protein did vary across the class Aves, but to a very limited extent. COI was the least variable gene in the mitochondrial genome, suggesting that other genes might be more informative for probing factors constraining mitochondrial variation within species. © 2011 Blackwell Publishing Ltd.
Workman, Rachael E; Myrka, Alexander M; Wong, G William; Tseng, Elizabeth; Welch, Kenneth C; Timp, Winston
2018-03-01
Hummingbirds oxidize ingested nectar sugars directly to fuel foraging but cannot sustain this fuel use during fasting periods, such as during the night or during long-distance migratory flights. Instead, fasting hummingbirds switch to oxidizing stored lipids that are derived from ingested sugars. The hummingbird liver plays a key role in moderating energy homeostasis and this remarkable capacity for fuel switching. Additionally, liver is the principle location of de novo lipogenesis, which can occur at exceptionally high rates, such as during premigratory fattening. Yet understanding how this tissue and whole organism moderates energy turnover is hampered by a lack of information regarding how relevant enzymes differ in sequence, expression, and regulation. We generated a de novo transcriptome of the hummingbird liver using PacBio full-length cDNA sequencing (Iso-Seq), yielding 8.6Gb of sequencing data, or 2.6M reads from 4 different size fractions. We analyzed data using the SMRTAnalysis v3.1 Iso-Seq pipeline, then clustered isoforms into gene families to generate de novo gene contigs using Cogent. We performed orthology analysis to identify closely related sequences between our transcriptome and other avian and human gene sets. Finally, we closely examined homology of critical lipid metabolism genes between our transcriptome data and avian and human genomes. We confirmed high levels of sequence divergence within hummingbird lipogenic enzymes, suggesting a high probability of adaptive divergent function in the hepatic lipogenic pathways. Our results leverage cutting-edge technology and a novel bioinformatics pipeline to provide a first direct look at the transcriptome of this incredible organism.
Fantini, Marco; Pandolfini, Luca; Lisi, Simonetta; Chirichella, Michele; Arisi, Ivan; Terrigno, Marco; Goracci, Martina; Cremisi, Federico; Cattaneo, Antonino
2017-01-01
Antibody libraries are important resources to derive antibodies to be used for a wide range of applications, from structural and functional studies to intracellular protein interference studies to developing new diagnostics and therapeutics. Whatever the goal, the key parameter for an antibody library is its complexity (also known as diversity), i.e. the number of distinct elements in the collection, which directly reflects the probability of finding in the library an antibody against a given antigen, of sufficiently high affinity. Quantitative evaluation of antibody library complexity and quality has been for a long time inadequately addressed, due to the high similarity and length of the sequences of the library. Complexity was usually inferred by the transformation efficiency and tested either by fingerprinting and/or sequencing of a few hundred random library elements. Inferring complexity from such a small sampling is, however, very rudimental and gives limited information about the real diversity, because complexity does not scale linearly with sample size. Next-generation sequencing (NGS) has opened new ways to tackle the antibody library complexity quality assessment. However, much remains to be done to fully exploit the potential of NGS for the quantitative analysis of antibody repertoires and to overcome current limitations. To obtain a more reliable antibody library complexity estimate here we show a new, PCR-free, NGS approach to sequence antibody libraries on Illumina platform, coupled to a new bioinformatic analysis and software (Diversity Estimator of Antibody Library, DEAL) that allows to reliably estimate the complexity, taking in consideration the sequencing error.
Hodgetts, Jennifer; Boonham, Neil; Mumford, Rick; Harrison, Nigel; Dickinson, Matthew
2008-08-01
Phytoplasma phylogenetics has focused primarily on sequences of the non-coding 16S rRNA gene and the 16S-23S rRNA intergenic spacer region (16-23S ISR), and primers that enable amplification of these regions from all phytoplasmas by PCR are well established. In this study, primers based on the secA gene have been developed into a semi-nested PCR assay that results in a sequence of the expected size (about 480 bp) from all 34 phytoplasmas examined, including strains representative of 12 16Sr groups. Phylogenetic analysis of secA gene sequences showed similar clustering of phytoplasmas when compared with clusters resolved by similar sequence analyses of a 16-23S ISR-23S rRNA gene contig or of the 16S rRNA gene alone. The main differences between trees were in the branch lengths, which were elongated in the 16-23S ISR-23S rRNA gene tree when compared with the 16S rRNA gene tree and elongated still further in the secA gene tree, despite this being a shorter sequence. The improved resolution in the secA gene-derived phylogenetic tree resulted in the 16SrII group splitting into two distinct clusters, while phytoplasmas associated with coconut lethal yellowing-type diseases split into three distinct groups, thereby supporting past proposals that they represent different candidate species within 'Candidatus Phytoplasma'. The ability to differentiate 16Sr groups and subgroups by virtual RFLP analysis of secA gene sequences suggests that this gene may provide an informative alternative molecular marker for pathogen identification and diagnosis of phytoplasma diseases.
Hara, Yuichiro; Tatsumi, Kaori; Yoshida, Michio; Kajikawa, Eriko; Kiyonari, Hiroshi; Kuraku, Shigehiro
2015-11-18
RNA-seq enables gene expression profiling in selected spatiotemporal windows and yields massive sequence information with relatively low cost and time investment, even for non-model species. However, there remains a large room for optimizing its workflow, in order to take full advantage of continuously developing sequencing capacity. Transcriptome sequencing for three embryonic stages of Madagascar ground gecko (Paroedura picta) was performed with the Illumina platform. The output reads were assembled de novo for reconstructing transcript sequences. In order to evaluate the completeness of transcriptome assemblies, we prepared a reference gene set consisting of vertebrate one-to-one orthologs. To take advantage of increased read length of >150 nt, we demonstrated shortened RNA fragmentation time, which resulted in a dramatic shift of insert size distribution. To evaluate products of multiple de novo assembly runs incorporating reads with different RNA sources, read lengths, and insert sizes, we introduce a new reference gene set, core vertebrate genes (CVG), consisting of 233 genes that are shared as one-to-one orthologs by all vertebrate genomes examined (29 species)., The completeness assessment performed by the computational pipelines CEGMA and BUSCO referring to CVG, demonstrated higher accuracy and resolution than with the gene set previously established for this purpose. As a result of the assessment with CVG, we have derived the most comprehensive transcript sequence set of the Madagascar ground gecko by means of assembling individual libraries followed by clustering the assembled sequences based on their overall similarities. Our results provide several insights into optimizing de novo RNA-seq workflow, including the coordination between library insert size and read length, which manifested in improved connectivity of assemblies. The approach and assembly assessment with CVG demonstrated here would be applicable to transcriptome analysis of other species as well as whole genome analyses.
Unveiling the Hybrid Genome Structure of Escherichia coli RR1 (HB101 RecA+)
Jeong, Haeyoung; Sim, Young Mi; Kim, Hyun Ju; Lee, Sang Jun
2017-01-01
There have been extensive genome sequencing studies for Escherichia coli strains, particularly for pathogenic isolates, because fast determination of pathogenic potential and/or drug resistance and their propagation routes is crucial. For laboratory E. coli strains, however, genome sequence information is limited except for several well-known strains. We determined the complete genome sequence of laboratory E. coli strain RR1 (HB101 RecA+), which has long been used as a general cloning host. A hybrid genome sequence of K-12 MG1655 and B BL21(DE3) was constructed based on the initial mapping of Illumina HiSeq reads to each reference, and iterative rounds of read mapping, variant detection, and consensus extraction were carried out. Finally, PCR and Sanger sequencing-based finishing were applied to resolve non-single nucleotide variant regions with aberrant read depths and breakpoints, most of them resulting from prophages and insertion sequence transpositions that are not present in the reference genome sequence. We found that 96.9% of the RR1 genome is derived from K-12, and identified exact crossover junctions between K-12 and B genomic fragments. However, because RR1 has experienced a series of genetic manipulations since branching from the common ancestor, it has a set of mutations different from those found in K-12 MG1655. As well as identifying all known genotypes of RR1 on the basis of genomic context, we found novel mutations. Our results extend current knowledge of the genotype of RR1 and its relatives, and provide insights into the pedigree, genomic background, and physiology of common laboratory strains. PMID:28421066
Efficient conformational space exploration in ab initio protein folding simulation.
Ullah, Ahammed; Ahmed, Nasif; Pappu, Subrata Dey; Shatabda, Swakkhar; Ullah, A Z M Dayem; Rahman, M Sohel
2015-08-01
Ab initio protein folding simulation largely depends on knowledge-based energy functions that are derived from known protein structures using statistical methods. These knowledge-based energy functions provide us with a good approximation of real protein energetics. However, these energy functions are not very informative for search algorithms and fail to distinguish the types of amino acid interactions that contribute largely to the energy function from those that do not. As a result, search algorithms frequently get trapped into the local minima. On the other hand, the hydrophobic-polar (HP) model considers hydrophobic interactions only. The simplified nature of HP energy function makes it limited only to a low-resolution model. In this paper, we present a strategy to derive a non-uniform scaled version of the real 20×20 pairwise energy function. The non-uniform scaling helps tackle the difficulty faced by a real energy function, whereas the integration of 20×20 pairwise information overcomes the limitations faced by the HP energy function. Here, we have applied a derived energy function with a genetic algorithm on discrete lattices. On a standard set of benchmark protein sequences, our approach significantly outperforms the state-of-the-art methods for similar models. Our approach has been able to explore regions of the conformational space which all the previous methods have failed to explore. Effectiveness of the derived energy function is presented by showing qualitative differences and similarities of the sampled structures to the native structures. Number of objective function evaluation in a single run of the algorithm is used as a comparison metric to demonstrate efficiency.
Nucleotide sequences specific to Yersinia pestis and methods for the detection of Yersinia pestis
McCready, Paula M [Tracy, CA; Radnedge, Lyndsay [San Mateo, CA; Andersen, Gary L [Berkeley, CA; Ott, Linda L [Livermore, CA; Slezak, Thomas R [Livermore, CA; Kuczmarski, Thomas A [Livermore, CA; Motin, Vladinir L [League City, TX
2009-02-24
Nucleotide sequences specific to Yersinia pestis that serve as markers or signatures for identification of this bacterium were identified. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.
Nucleotide sequences specific to Brucella and methods for the detection of Brucella
DOE Office of Scientific and Technical Information (OSTI.GOV)
McCready, Paula M; Radnedge, Lyndsay; Andersen, Gary L
Nucleotide sequences specific to Brucella that serves as a marker or signature for identification of this bacterium were identified. In addition, forward and reverse primers and hybridization probes derived from these nucleotide sequences that are used in nucleotide detection methods to detect the presence of the bacterium are disclosed.
ESTimating plant phylogeny: lessons from partitioning
de la Torre, Jose EB; Egan, Mary G; Katari, Manpreet S; Brenner, Eric D; Stevenson, Dennis W; Coruzzi, Gloria M; DeSalle, Rob
2006-01-01
Background While Expressed Sequence Tags (ESTs) have proven a viable and efficient way to sample genomes, particularly those for which whole-genome sequencing is impractical, phylogenetic analysis using ESTs remains difficult. Sequencing errors and orthology determination are the major problems when using ESTs as a source of characters for systematics. Here we develop methods to incorporate EST sequence information in a simultaneous analysis framework to address controversial phylogenetic questions regarding the relationships among the major groups of seed plants. We use an automated, phylogenetically derived approach to orthology determination called OrthologID generate a phylogeny based on 43 process partitions, many of which are derived from ESTs, and examine several measures of support to assess the utility of EST data for phylogenies. Results A maximum parsimony (MP) analysis resulted in a single tree with relatively high support at all nodes in the tree despite rampant conflict among trees generated from the separate analysis of individual partitions. In a comparison of broader-scale groupings based on cellular compartment (ie: chloroplast, mitochondrial or nuclear) or function, only the nuclear partition tree (based largely on EST data) was found to be topologically identical to the tree based on the simultaneous analysis of all data. Despite topological conflict among the broader-scale groupings examined, only the tree based on morphological data showed statistically significant differences. Conclusion Based on the amount of character support contributed by EST data which make up a majority of the nuclear data set, and the lack of conflict of the nuclear data set with the simultaneous analysis tree, we conclude that the inclusion of EST data does provide a viable and efficient approach to address phylogenetic questions within a parsimony framework on a genomic scale, if problems of orthology determination and potential sequencing errors can be overcome. In addition, approaches that examine conflict and support in a simultaneous analysis framework allow for a more precise understanding of the evolutionary history of individual process partitions and may be a novel way to understand functional aspects of different kinds of cellular classes of gene products. PMID:16776834
A multi-wavelength study of the evolution of early-type galaxies in groups: the ultraviolet view
NASA Astrophysics Data System (ADS)
Rampazzo, R.; Mazzei, P.; Marino, A.; Bianchi, L.; Plana, H.; Trinchieri, G.; Uslenghi, M.; Wolter, A.
2018-04-01
The ultraviolet-optical colour magnitude diagram of rich galaxy groups is characterised by a well developed Red Sequence, a Blue Cloud and the so-called Green Valley. Loose, less evolved groups of galaxies which are probably not virialised yet may lack a well defined Red Sequence. This is actually explained in the framework of galaxy evolution. We are focussing on understanding galaxy migration towards the Red Sequence, checking for signatures of such a transition in their photometric and morphological properties. We report on the ultraviolet properties of a sample of early-type (ellipticals+S0s) galaxies inhabiting the Red Sequence. The analysis of their structures, as derived by fitting a Sérsic law to their ultraviolet luminosity profiles, suggests the presence of an underlying disk. This is the hallmark of dissipation processes that still must have a role to play in the evolution of this class of galaxies. Smooth particle hydrodynamic simulations with chemo-photometric implementations able to match the global properties of our targets are used to derive their evolutionary paths through ultraviolet-optical colour magnitude diagrams, providing some fundamental information such as the crossing time through the Green Valley, which depends on their luminosity. The transition from the Blue Cloud to the Red Sequence takes several Gyrs, being about 3-5 Gyr for the brightest galaxies and longer for fainter ones, if occurring. The photometric study of nearby galaxy structures in the ultraviolet is seriously hampered by either the limited field of view of the cameras (e.g., in Hubble Space Telescope) or by the low spatial resolution of the images (e.g., in the Galaxy Evolution Explorer). Current missions equipped with telescopes and cameras sensitive to ultraviolet wavelengths, such as Swift- UVOT and Astrosat-UVIT, provide a relatively large field of view and a better resolution than the Galaxy Evolution Explorer. More powerful ultraviolet instruments (size, resolution and field of view) are obviously bound to yield fundamental advances in the accuracy and depth of the surface photometry and in the characterisation of the galaxy environment.
Kringel, Dario; Geisslinger, Gerd; Resch, Eduard; Oertel, Bruno G; Thrun, Michael C; Heinemann, Sarah; Lötsch, Jörn
2018-03-27
Heat pain and its modulation by capsaicin varies among subjects in experimental and clinical settings. A plausible cause is a genetic component, of which TRPV1 ion channels, by their response to both heat and capsaicin, are primary candidates. However, TRPA1 channels can heterodimerize with TRPV1 channels and carry genetic variants reported to modulate heat pain sensitivity. To address the role of these candidate genes in capsaicin-induced hypersensitization to heat, pain thresholds acquired before and after topical application of capsaicin and TRPA1/TRPV1 exomic sequences derived by next-generation sequencing were assessed in n = 75 healthy volunteers and the genetic information comprised 278 loci. Gaussian mixture modeling indicated 2 phenotype groups with high or low capsaicin-induced hypersensitization to heat. Unsupervised machine learning implemented as swarm-based clustering hinted at differences in the genetic pattern between these phenotype groups. Several methods of supervised machine learning implemented as random forests, adaptive boosting, k-nearest neighbors, naive Bayes, support vector machines, and for comparison, binary logistic regression predicted the phenotype group association consistently better when based on the observed genotypes than when using a random permutation of the exomic sequences. Of note, TRPA1 variants were more important for correct phenotype group association than TRPV1 variants. This indicates a role of the TRPA1 and TRPV1 next-generation sequencing-based genetic pattern in the modulation of the individual response to heat-related pain phenotypes. When considering earlier evidence that topical capsaicin can induce neuropathy-like quantitative sensory testing patterns in healthy subjects, implications for future analgesic treatments with transient receptor potential inhibitors arise.This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-ND), where it is permissible to download and share the work provided it is properly cited. The work cannot be changed in any way or used commercially without permission from the journal.
Mota, Talia M; Murray, John M; Center, Rob J; Purcell, Damian F J; McCaw, James M
2012-06-25
The characterization of HIV-1 transmission strains may inform the design of an effective vaccine. Shorter variable loops with fewer predicted glycosites have been suggested as signatures enriched in envelope sequences derived during acute HIV-1 infection. Specifically, a transmission-linked lack of glycosites within the V1 and V2 loops of gp120 provides greater access to an α4β7 binding motif, which promotes the establishment of infection. Also, a histidine at position 12 in the leader sequence of Env has been described as a transmission signature that is selected against during chronic infection. The purpose of this study is to measure the association of the presence of an α4β7 binding motif, the number of N-linked glycosites, the length of the variable loops, and the prevalence of histidine at position 12 with HIV-1 transmission. A case-control study design was used to measure the prevalence of these variables between subtype B and C transmission sequences and frequency-matched randomly-selected sequences derived from chronically infected controls. Subtype B transmission strains had shorter V3 regions than chronic strains (p = 0.031); subtype C transmission strains had shorter V1 loops than chronic strains (p = 0.047); subtype B transmission strains had more V3 loop glycosites (p = 0.024) than chronic strains. Further investigation showed that these statistically significant results were unlikely to be biologically meaningful. Also, there was no difference observed in the prevalence of a histidine at position 12 among transmission strains and controls of either subtype. Although a genetic bottleneck is observed after HIV-1 transmission, our results indicate that summary characteristics of Env hypothesised to be important in transmission are not divergent between transmission and chronic strains of either subtype. The success of a transmission strain to initiate infection may be a random event from the divergent pool of donor viral sequences. The characteristics explored through this study are important, but may not function as genotypic signatures of transmission as previously described.
PMS2 gene mutational analysis: direct cDNA sequencing to circumvent pseudogene interference.
Wimmer, Katharina; Wernstedt, Annekatrin
2014-01-01
The presence of highly homologous pseudocopies can compromise the mutation analysis of a gene of interest. In particular, when using PCR-based strategies, pseudogene co-amplification has to be effectively prevented. This is often achieved by using primers designed to be parental gene specific according to the reference sequence and by applying stringent PCR conditions. However, there are cases in which this approach is of limited utility. For example, it has been shown that the PMS2 gene exchanges sequences with one of its pseudogenes, named PMS2CL. This results in functional PMS2 alleles containing pseudogene-derived sequences at their 3'-end and in nonfunctional PMS2CL pseudogene alleles that contain gene-derived sequences. Hence, the paralogues cannot be distinguished according to the reference sequence. This shortcoming can be effectively circumvented by using direct cDNA sequencing. This approach is based on the selective amplification of PMS2 transcripts in two overlapping 1.6-kb RT-PCR products. In addition to avoiding pseudogene co-amplification and allele dropout, this method has also the advantage that it allows to effectively identify deletions, splice mutations, and de novo retrotransposon insertions that escape the detection of most DNA-based mutation analysis protocols.
2012-01-01
Background Molecular breeding of pepper (Capsicum spp.) can be accelerated by developing DNA markers associated with transcriptomes in breeding germplasm. Before the advent of next generation sequencing (NGS) technologies, the majority of sequencing data were generated by the Sanger sequencing method. By leveraging Sanger EST data, we have generated a wealth of genetic information for pepper including thousands of SNPs and Single Position Polymorphic (SPP) markers. To complement and enhance these resources, we applied NGS to three pepper genotypes: Maor, Early Jalapeño and Criollo de Morelos-334 (CM334) to identify SNPs and SSRs in the assembly of these three genotypes. Results Two pepper transcriptome assemblies were developed with different purposes. The first reference sequence, assembled by CAP3 software, comprises 31,196 contigs from >125,000 Sanger-EST sequences that were mainly derived from a Korean F1-hybrid line, Bukang. Overlapping probes were designed for 30,815 unigenes to construct a pepper Affymetrix GeneChip® microarray for whole genome analyses. In addition, custom Python scripts were used to identify 4,236 SNPs in contigs of the assembly. A total of 2,489 simple sequence repeats (SSRs) were identified from the assembly, and primers were designed for the SSRs. Annotation of contigs using Blast2GO software resulted in information for 60% of the unigenes in the assembly. The second transcriptome assembly was constructed from more than 200 million Illumina Genome Analyzer II reads (80–120 nt) using a combination of Velvet, CLC workbench and CAP3 software packages. BWA, SAMtools and in-house Perl scripts were used to identify SNPs among three pepper genotypes. The SNPs were filtered to be at least 50 bp from any intron-exon junctions as well as flanking SNPs. More than 22,000 high-quality putative SNPs were identified. Using the MISA software, 10,398 SSR markers were also identified within the Illumina transcriptome assembly and primers were designed for the identified markers. The assembly was annotated by Blast2GO and 14,740 (12%) of annotated contigs were associated with functional proteins. Conclusions Before availability of pepper genome sequence, assembling transcriptomes of this economically important crop was required to generate thousands of high-quality molecular markers that could be used in breeding programs. In order to have a better understanding of the assembled sequences and to identify candidate genes underlying QTLs, we annotated the contigs of Sanger-EST and Illumina transcriptome assemblies. These and other information have been curated in a database that we have dedicated for pepper project. PMID:23110314
Single cell RNA sequencing of stem cell-derived retinal ganglion cells.
Daniszewski, Maciej; Senabouth, Anne; Nguyen, Quan H; Crombie, Duncan E; Lukowski, Samuel W; Kulkarni, Tejal; Sluch, Valentin M; Jabbari, Jafar S; Chamling, Xitiz; Zack, Donald J; Pébay, Alice; Powell, Joseph E; Hewitt, Alex W
2018-02-13
We used single cell sequencing technology to characterize the transcriptomes of 1,174 human embryonic stem cell-derived retinal ganglion cells (RGCs) at the single cell level. The human embryonic stem cell line BRN3B-mCherry (A81-H7), was differentiated to RGCs using a guided differentiation approach. Cells were harvested at day 36 and prepared for single cell RNA sequencing. Our data indicates the presence of three distinct subpopulations of cells, with various degrees of maturity. One cluster of 288 cells showed increased expression of genes involved in axon guidance together with semaphorin interactions, cell-extracellular matrix interactions and ECM proteoglycans, suggestive of a more mature RGC phenotype.
Riquet, Franck B; Blanchard, Claire; Jegouic, Sophie; Balanant, Jean; Guillot, Sophie; Vibet, Marie-Anne; Rakoto-Andrianarivelo, Mala; Delpeyroux, Francis
2008-09-01
Pathogenic circulating vaccine-derived polioviruses (cVDPVs) have become a major obstacle to the successful completion of the global polio eradication program. Most cVDPVs are recombinant between the oral poliovirus vaccine (OPV) and human enterovirus species C (HEV-C). To study the role of HEV-C sequences in the phenotype of cVDPVs, we generated a series of recombinants between a Madagascar cVDPV isolate and its parental OPV type 2 strain. Results indicated that the HEV-C sequences present in this cVDPV contribute to its characteristics, including pathogenicity, suggesting that interspecific recombination contributes to the phenotypic biodiversity of polioviruses and may favor the emergence of cVDPVs.
Kwarciak, Kamil; Radom, Marcin; Formanowicz, Piotr
2016-04-01
The classical sequencing by hybridization takes into account a binary information about sequence composition. A given element from an oligonucleotide library is or is not a part of the target sequence. However, the DNA chip technology has been developed and it enables to receive a partial information about multiplicity of each oligonucleotide the analyzed sequence consist of. Currently, it is not possible to assess the exact data of such type but even partial information should be very useful. Two realistic multiplicity information models are taken into consideration in this paper. The first one, called "one and many" assumes that it is possible to obtain information if a given oligonucleotide occurs in a reconstructed sequence once or more than once. According to the second model, called "one, two and many", one is able to receive from biochemical experiment information if a given oligonucleotide is present in an analyzed sequence once, twice or at least three times. An ant colony optimization algorithm has been implemented to verify the above models and to compare with existing algorithms for sequencing by hybridization which utilize the additional information. The proposed algorithm solves the problem with any kind of hybridization errors. Computational experiment results confirm that using even the partial information about multiplicity leads to increased quality of reconstructed sequences. Moreover, they also show that the more precise model enables to obtain better solutions and the ant colony optimization algorithm outperforms the existing ones. Test data sets and the proposed ant colony optimization algorithm are available on: http://bioserver.cs.put.poznan.pl/download/ACO4mSBH.zip. Copyright © 2016 Elsevier Ltd. All rights reserved.
Diversity of the P2 protein among nontypeable Haemophilus influenzae isolates.
Bell, J; Grass, S; Jeanteur, D; Munson, R S
1994-01-01
The genes for outer membrane protein P2 of four nontypeable Haemophilus influenzae strains were cloned and sequenced. The derived amino acid sequences were compared with the outer membrane protein P2 sequence from H. influenzae type b MinnA and the sequences of P2 from three additional nontypeable H. influenzae strains. The sequences were 76 to 94% identical. The sequences had regions with considerable variability separated by regions which were highly conserved. The variable regions mapped to putative surface-exposed loops of the protein. PMID:8188390
Gupta, R S
1998-12-01
The presence of shared conserved insertion or deletions (indels) in protein sequences is a special type of signature sequence that shows considerable promise for phylogenetic inference. An alternative model of microbial evolution based on the use of indels of conserved proteins and the morphological features of prokaryotic organisms is proposed. In this model, extant archaebacteria and gram-positive bacteria, which have a simple, single-layered cell wall structure, are termed monoderm prokaryotes. They are believed to be descended from the most primitive organisms. Evidence from indels supports the view that the archaebacteria probably evolved from gram-positive bacteria, and I suggest that this evolution occurred in response to antibiotic selection pressures. Evidence is presented that diderm prokaryotes (i.e., gram-negative bacteria), which have a bilayered cell wall, are derived from monoderm prokaryotes. Signature sequences in different proteins provide a means to define a number of different taxa within prokaryotes (namely, low G+C and high G+C gram-positive, Deinococcus-Thermus, cyanobacteria, chlamydia-cytophaga related, and two different groups of Proteobacteria) and to indicate how they evolved from a common ancestor. Based on phylogenetic information from indels in different protein sequences, it is hypothesized that all eukaryotes, including amitochondriate and aplastidic organisms, received major gene contributions from both an archaebacterium and a gram-negative eubacterium. In this model, the ancestral eukaryotic cell is a chimera that resulted from a unique fusion event between the two separate groups of prokaryotes followed by integration of their genomes.
Biesecker, Leslie G
2012-04-01
The debate surrounding the return of results from high-throughput genomic interrogation encompasses many important issues including ethics, law, economics, and social policy. As well, the debate is also informed by the molecular, genetic, and clinical foundations of the emerging field of clinical genomics, which is based on this new technology. This article outlines the main biomedical considerations of sequencing technologies and demonstrates some of the early clinical experiences with the technology to enable the debate to stay focused on real-world practicalities. These experiences are based on early data from the ClinSeq project, which is a project to pilot the use of massively parallel sequencing in a clinical research context with a major aim to develop modes of returning results to individual subjects. The study has enrolled >900 subjects and generated exome sequence data on 572 subjects. These data are beginning to be interpreted and returned to the subjects, which provides examples of the potential usefulness and pitfalls of clinical genomics. There are numerous genetic results that can be readily derived from a genome including rare, high-penetrance traits, and carrier states. However, much work needs to be done to develop the tools and resources for genomic interpretation. The main lesson learned is that a genome sequence may be better considered as a health-care resource, rather than a test, one that can be interpreted and used over the lifetime of the patient.
Gupta, Radhey S.
1998-01-01
The presence of shared conserved insertion or deletions (indels) in protein sequences is a special type of signature sequence that shows considerable promise for phylogenetic inference. An alternative model of microbial evolution based on the use of indels of conserved proteins and the morphological features of prokaryotic organisms is proposed. In this model, extant archaebacteria and gram-positive bacteria, which have a simple, single-layered cell wall structure, are termed monoderm prokaryotes. They are believed to be descended from the most primitive organisms. Evidence from indels supports the view that the archaebacteria probably evolved from gram-positive bacteria, and I suggest that this evolution occurred in response to antibiotic selection pressures. Evidence is presented that diderm prokaryotes (i.e., gram-negative bacteria), which have a bilayered cell wall, are derived from monoderm prokaryotes. Signature sequences in different proteins provide a means to define a number of different taxa within prokaryotes (namely, low G+C and high G+C gram-positive, Deinococcus-Thermus, cyanobacteria, chlamydia-cytophaga related, and two different groups of Proteobacteria) and to indicate how they evolved from a common ancestor. Based on phylogenetic information from indels in different protein sequences, it is hypothesized that all eukaryotes, including amitochondriate and aplastidic organisms, received major gene contributions from both an archaebacterium and a gram-negative eubacterium. In this model, the ancestral eukaryotic cell is a chimera that resulted from a unique fusion event between the two separate groups of prokaryotes followed by integration of their genomes. PMID:9841678
Hand, Melanie L.; Spangenberg, German C.; Forster, John W.; Cogan, Noel O. I.
2013-01-01
Chloroplast genome sequences are of broad significance in plant biology, due to frequent use in molecular phylogenetics, comparative genomics, population genetics, and genetic modification studies. The present study used a second-generation sequencing approach to determine and assemble the plastid genomes (plastomes) of four representatives from the agriculturally important Lolium-Festuca species complex of pasture grasses (Lolium multiflorum, Festuca pratensis, Festuca altissima, and Festuca ovina). Total cellular DNA was extracted from either roots or leaves, was sequenced, and the output was filtered for plastome-related reads. A comparison between sources revealed fewer plastome-related reads from root-derived template but an increase in incidental bacterium-derived sequences. Plastome assembly and annotation indicated high levels of sequence identity and a conserved organization and gene content between species. However, frequent deletions within the F. ovina plastome appeared to contribute to a smaller plastid genome size. Comparative analysis with complete plastome sequences from other members of the Poaceae confirmed conservation of most grass-specific features. Detailed analysis of the rbcL–psaI intergenic region, however, revealed a “hot-spot” of variation characterized by independent deletion events. The evolutionary implications of this observation are discussed. The complete plastome sequences are anticipated to provide the basis for potential organelle-specific genetic modification of pasture grasses. PMID:23550121
Van Wagtendonk, Jan W.; Root, Ralph R.
2003-01-01
The objective of this study was to test the applicability of using Normalized Difference Vegetation Index (NDVI) values derived from a temporal sequence of six Landsat Thematic Mapper (TM) scenes to map fuel models for Yosemite National Park, USA. An unsupervised classification algorithm was used to define 30 unique spectral-temporal classes of NDVI values. A combination of graphical, statistical and visual techniques was used to characterize the 30 classes and identify those that responded similarly and could be combined into fuel models. The final classification of fuel models included six different types: short annual and perennial grasses, tall perennial grasses, medium brush and evergreen hardwoods, short-needled conifers with no heavy fuels, long-needled conifers and deciduous hardwoods, and short-needled conifers with a component of heavy fuels. The NDVI, when analysed over a season of phenologically distinct periods along with ancillary data, can elicit information necessary to distinguish fuel model types. Fuels information derived from remote sensors has proven to be useful for initial classification of fuels and has been applied to fire management situations on the ground.
A Motion Detection Algorithm Using Local Phase Information
Lazar, Aurel A.; Ukani, Nikul H.; Zhou, Yiyin
2016-01-01
Previous research demonstrated that global phase alone can be used to faithfully represent visual scenes. Here we provide a reconstruction algorithm by using only local phase information. We also demonstrate that local phase alone can be effectively used to detect local motion. The local phase-based motion detector is akin to models employed to detect motion in biological vision, for example, the Reichardt detector. The local phase-based motion detection algorithm introduced here consists of two building blocks. The first building block measures/evaluates the temporal change of the local phase. The temporal derivative of the local phase is shown to exhibit the structure of a second order Volterra kernel with two normalized inputs. We provide an efficient, FFT-based algorithm for implementing the change of the local phase. The second processing building block implements the detector; it compares the maximum of the Radon transform of the local phase derivative with a chosen threshold. We demonstrate examples of applying the local phase-based motion detection algorithm on several video sequences. We also show how the locally detected motion can be used for segmenting moving objects in video scenes and compare our local phase-based algorithm to segmentation achieved with a widely used optic flow algorithm. PMID:26880882
Molzan, Manuela; Ottmann, Christian
2013-03-01
Myeloid leukemia factor 1 (MLF1) is associated with the development of leukemic diseases such as acute myeloid leukemia (AML) and myelodysplastic syndrome (MDS). However, information on the physiological function of MLF1 is limited and mostly derived from studies identifying MLF1 interaction partners like CSN3, MLF1IP, MADM, Manp and the 14-3-3 proteins. The 14-3-3-binding site surrounding S34 is one of the only known functional features of the MLF1 sequence, along with one nuclear export sequence (NES) and two nuclear localization sequences (NLS). It was recently shown that the subcellular localization of mouse MLF1 is dependent on 14-3-3 proteins. Based on these findings, we investigated whether the subcellular localization of human MLF1 was also directly 14-3-3-dependent. Live cell imaging with GFP-fused human MLF1 was used to study the effects of mutations and deletions on its subcellular localization. Surprisingly, we found that the subcellular localization of full-length human MLF1 is 14-3-3-independent, and is probably regulated by other as-yet-unknown proteins.
Presentation Extensions of the SOAP
NASA Technical Reports Server (NTRS)
Carnright, Robert; Stodden, David; Coggi, John
2009-01-01
A set of extensions of the Satellite Orbit Analysis Program (SOAP) enables simultaneous and/or sequential presentation of information from multiple sources. SOAP is used in the aerospace community as a means of collaborative visualization and analysis of data on planned spacecraft missions. The following definitions of terms also describe the display modalities of SOAP as now extended: In SOAP terminology, View signifies an animated three-dimensional (3D) scene, two-dimensional still image, plot of numerical data, or any other visible display derived from a computational simulation or other data source; a) "Viewport" signifies a rectangular portion of a computer-display window containing a view; b) "Palette" signifies a collection of one or more viewports configured for simultaneous (split-screen) display in the same window; c) "Slide" signifies a palette with a beginning and ending time and an animation time step; and d) "Presentation" signifies a prescribed sequence of slides. For example, multiple 3D views from different locations can be crafted for simultaneous display and combined with numerical plots and other representations of data for both qualitative and quantitative analysis. The resulting sets of views can be temporally sequenced to convey visual impressions of a sequence of events for a planned mission.
Biophysics of protein evolution and evolutionary protein biophysics
Sikosek, Tobias; Chan, Hue Sun
2014-01-01
The study of molecular evolution at the level of protein-coding genes often entails comparing large datasets of sequences to infer their evolutionary relationships. Despite the importance of a protein's structure and conformational dynamics to its function and thus its fitness, common phylogenetic methods embody minimal biophysical knowledge of proteins. To underscore the biophysical constraints on natural selection, we survey effects of protein mutations, highlighting the physical basis for marginal stability of natural globular proteins and how requirement for kinetic stability and avoidance of misfolding and misinteractions might have affected protein evolution. The biophysical underpinnings of these effects have been addressed by models with an explicit coarse-grained spatial representation of the polypeptide chain. Sequence–structure mappings based on such models are powerful conceptual tools that rationalize mutational robustness, evolvability, epistasis, promiscuous function performed by ‘hidden’ conformational states, resolution of adaptive conflicts and conformational switches in the evolution from one protein fold to another. Recently, protein biophysics has been applied to derive more accurate evolutionary accounts of sequence data. Methods have also been developed to exploit sequence-based evolutionary information to predict biophysical behaviours of proteins. The success of these approaches demonstrates a deep synergy between the fields of protein biophysics and protein evolution. PMID:25165599
NASA Astrophysics Data System (ADS)
Kase, Sue E.; Vanni, Michelle; Caylor, Justine; Hoye, Jeff
2017-05-01
The Human-Assisted Machine Information Exploitation (HAMIE) investigation utilizes large-scale online data collection for developing models of information-based problem solving (IBPS) behavior in a simulated time-critical operational environment. These types of environments are characteristic of intelligence workflow processes conducted during human-geo-political unrest situations when the ability to make the best decision at the right time ensures strategic overmatch. The project takes a systems approach to Human Information Interaction (HII) by harnessing the expertise of crowds to model the interaction of the information consumer and the information required to solve a problem at different levels of system restrictiveness and decisional guidance. The design variables derived from Decision Support Systems (DSS) research represent the experimental conditions in this online single-player against-the-clock game where the player, acting in the role of an intelligence analyst, is tasked with a Commander's Critical Information Requirement (CCIR) in an information overload scenario. The player performs a sequence of three information processing tasks (annotation, relation identification, and link diagram formation) with the assistance of `HAMIE the robot' who offers varying levels of information understanding dependent on question complexity. We provide preliminary results from a pilot study conducted with Amazon Mechanical Turk (AMT) participants on the Volunteer Science scientific research platform.
Kim, Kwang-Hwan; Hwang, Ji-Hyun; Han, Dong-Yeup; Park, Minkyu; Kim, Seungill; Choi, Doil; Kim, Yongjae; Lee, Gung Pyo; Kim, Sun-Tae; Park, Young-Hoon
2015-01-01
An intraspecific genetic map for watermelon was constructed using an F2 population derived from 'Arka Manik' × 'TS34' and transcript sequence variants and quantitative trait loci (QTL) for resistance to powdery mildew (PMR), seed size (SS), and fruit shape (FS) were analyzed. The map consists of 14 linkage groups (LGs) defined by 174 cleaved amplified polymorphic sequences (CAPS), 2 derived-cleaved amplified polymorphic sequence markers, 20 sequence-characterized amplified regions, and 8 expressed sequence tag-simple sequence repeat markers spanning 1,404.3 cM, with a mean marker interval of 6.9 cM and an average of 14.6 markers per LG. Genetic inheritance and QTL analyses indicated that each of the PMR, SS, and FS traits is controlled by an incompletely dominant effect of major QTLs designated as pmr2.1, ss2.1, and fsi3.1, respectively. The pmr2.1, detected on chromosome 2 (Chr02), explained 80.0% of the phenotypic variation (LOD = 30.76). This QTL was flanked by two CAPS markers, wsb2-24 (4.00 cM) and wsb2-39 (13.97 cM). The ss2.1, located close to pmr2.1 and CAPS marker wsb2-13 (1.00 cM) on Chr02, explained 92.3% of the phenotypic variation (LOD = 68.78). The fsi3.1, detected on Chr03, explained 79.7% of the phenotypic variation (LOD = 31.37) and was flanked by two CAPS, wsb3-24 (1.91 cM) and wsb3-9 (7.00 cM). Candidate gene-based CAPS markers were developed from the disease resistance and fruit shape gene homologs located on Chr.02 and Chr03 and were mapped on the intraspecific map. Colocalization of these markers with the major QTLs indicated that watermelon orthologs of a nucleotide-binding site-leucine-rich repeat class gene containing an RPW8 domain and a member of SUN containing the IQ67 domain are candidate genes for pmr2.1 and fsi3.1, respectively. The results presented herein provide useful information for marker-assisted breeding and gene cloning for PMR and fruit-related traits.
Kim, Kwang-Hwan; Hwang, Ji-Hyun; Han, Dong-Yeup; Park, Minkyu; Kim, Seungill; Choi, Doil; Kim, Yongjae; Lee, Gung Pyo; Kim, Sun-Tae; Park, Young-Hoon
2015-01-01
An intraspecific genetic map for watermelon was constructed using an F2 population derived from ‘Arka Manik’ × ‘TS34’ and transcript sequence variants and quantitative trait loci (QTL) for resistance to powdery mildew (PMR), seed size (SS), and fruit shape (FS) were analyzed. The map consists of 14 linkage groups (LGs) defined by 174 cleaved amplified polymorphic sequences (CAPS), 2 derived-cleaved amplified polymorphic sequence markers, 20 sequence-characterized amplified regions, and 8 expressed sequence tag-simple sequence repeat markers spanning 1,404.3 cM, with a mean marker interval of 6.9 cM and an average of 14.6 markers per LG. Genetic inheritance and QTL analyses indicated that each of the PMR, SS, and FS traits is controlled by an incompletely dominant effect of major QTLs designated as pmr2.1, ss2.1, and fsi3.1, respectively. The pmr2.1, detected on chromosome 2 (Chr02), explained 80.0% of the phenotypic variation (LOD = 30.76). This QTL was flanked by two CAPS markers, wsb2-24 (4.00 cM) and wsb2-39 (13.97 cM). The ss2.1, located close to pmr2.1 and CAPS marker wsb2-13 (1.00 cM) on Chr02, explained 92.3% of the phenotypic variation (LOD = 68.78). The fsi3.1, detected on Chr03, explained 79.7% of the phenotypic variation (LOD = 31.37) and was flanked by two CAPS, wsb3-24 (1.91 cM) and wsb3-9 (7.00 cM). Candidate gene-based CAPS markers were developed from the disease resistance and fruit shape gene homologs located on Chr.02 and Chr03 and were mapped on the intraspecific map. Colocalization of these markers with the major QTLs indicated that watermelon orthologs of a nucleotide-binding site-leucine-rich repeat class gene containing an RPW8 domain and a member of SUN containing the IQ67 domain are candidate genes for pmr2.1 and fsi3.1, respectively. The results presented herein provide useful information for marker-assisted breeding and gene cloning for PMR and fruit-related traits. PMID:26700647
Fellner, Lea; Huptas, Christopher; Simon, Svenja; Mühlig, Anna; Scherer, Siegfried; Neuhaus, Klaus
2016-04-07
Escherichia coliO157:H7 EDL933, isolated in 1982 in the United States, was the first enterohemorrhagicE. coli(EHEC) strain sequenced. Unfortunately, European labs can no longer receive the original strain. We checked three European EDL933 derivatives and found major genetic deviations (deletions, inversions) in two strains. All EDL933 strains contain the cryptic EHEC-plasmid, not reported before. Copyright © 2016 Fellner et al.