USDA-ARS?s Scientific Manuscript database
Genomic structural variations are an important source of genetic diversity. Copy number variations (CNVs), gains and losses of large regions of genomic sequence between individuals of a species, are known to be associated with both diseases and phenotypic traits. Deeply sequenced genomes are often u...
Human Genome Sequencing in Health and Disease
Gonzaga-Jauregui, Claudia; Lupski, James R.; Gibbs, Richard A.
2013-01-01
Following the “finished,” euchromatic, haploid human reference genome sequence, the rapid development of novel, faster, and cheaper sequencing technologies is making possible the era of personalized human genomics. Personal diploid human genome sequences have been generated, and each has contributed to our better understanding of variation in the human genome. We have consequently begun to appreciate the vastness of individual genetic variation from single nucleotide to structural variants. Translation of genome-scale variation into medically useful information is, however, in its infancy. This review summarizes the initial steps undertaken in clinical implementation of personal genome information, and describes the application of whole-genome and exome sequencing to identify the cause of genetic diseases and to suggest adjuvant therapies. Better analysis tools and a deeper understanding of the biology of our genome are necessary in order to decipher, interpret, and optimize clinical utility of what the variation in the human genome can teach us. Personal genome sequencing may eventually become an instrument of common medical practice, providing information that assists in the formulation of a differential diagnosis. We outline herein some of the remaining challenges. PMID:22248320
Variation block-based genomics method for crop plants.
Kim, Yul Ho; Park, Hyang Mi; Hwang, Tae-Young; Lee, Seuk Ki; Choi, Man Soo; Jho, Sungwoong; Hwang, Seungwoo; Kim, Hak-Min; Lee, Dongwoo; Kim, Byoung-Chul; Hong, Chang Pyo; Cho, Yun Sung; Kim, Hyunmin; Jeong, Kwang Ho; Seo, Min Jung; Yun, Hong Tai; Kim, Sun Lim; Kwon, Young-Up; Kim, Wook Han; Chun, Hye Kyung; Lim, Sang Jong; Shin, Young-Ah; Choi, Ik-Young; Kim, Young Sun; Yoon, Ho-Sung; Lee, Suk-Ha; Lee, Sunghoon
2014-06-15
In contrast with wild species, cultivated crop genomes consist of reshuffled recombination blocks, which occurred by crossing and selection processes. Accordingly, recombination block-based genomics analysis can be an effective approach for the screening of target loci for agricultural traits. We propose the variation block method, which is a three-step process for recombination block detection and comparison. The first step is to detect variations by comparing the short-read DNA sequences of the cultivar to the reference genome of the target crop. Next, sequence blocks with variation patterns are examined and defined. The boundaries between the variation-containing sequence blocks are regarded as recombination sites. All the assumed recombination sites in the cultivar set are used to split the genomes, and the resulting sequence regions are termed variation blocks. Finally, the genomes are compared using the variation blocks. The variation block method identified recurring recombination blocks accurately and successfully represented block-level diversities in the publicly available genomes of 31 soybean and 23 rice accessions. The practicality of this approach was demonstrated by the identification of a putative locus determining soybean hilum color. We suggest that the variation block method is an efficient genomics method for the recombination block-level comparison of crop genomes. We expect that this method will facilitate the development of crop genomics by bringing genomics technologies to the field of crop breeding.
Genomic Sequence Variation Markup Language (GSVML).
Nakaya, Jun; Kimura, Michio; Hiroi, Kaei; Ido, Keisuke; Yang, Woosung; Tanaka, Hiroshi
2010-02-01
With the aim of making good use of internationally accumulated genomic sequence variation data, which is increasing rapidly due to the explosive amount of genomic research at present, the development of an interoperable data exchange format and its international standardization are necessary. Genomic Sequence Variation Markup Language (GSVML) will focus on genomic sequence variation data and human health applications, such as gene based medicine or pharmacogenomics. We developed GSVML through eight steps, based on case analysis and domain investigations. By focusing on the design scope to human health applications and genomic sequence variation, we attempted to eliminate ambiguity and to ensure practicability. We intended to satisfy the requirements derived from the use case analysis of human-based clinical genomic applications. Based on database investigations, we attempted to minimize the redundancy of the data format, while maximizing the data covering range. We also attempted to ensure communication and interface ability with other Markup Languages, for exchange of omics data among various omics researchers or facilities. The interface ability with developing clinical standards, such as the Health Level Seven Genotype Information model, was analyzed. We developed the human health-oriented GSVML comprising variation data, direct annotation, and indirect annotation categories; the variation data category is required, while the direct and indirect annotation categories are optional. The annotation categories contain omics and clinical information, and have internal relationships. For designing, we examined 6 cases for three criteria as human health application and 15 data elements for three criteria as data formats for genomic sequence variation data exchange. The data format of five international SNP databases and six Markup Languages and the interface ability to the Health Level Seven Genotype Model in terms of 317 items were investigated. GSVML was developed as a potential data exchanging format for genomic sequence variation data exchange focusing on human health applications. The international standardization of GSVML is necessary, and is currently underway. GSVML can be applied to enhance the utilization of genomic sequence variation data worldwide by providing a communicable platform between clinical and research applications. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
Genetic Variation in Cardiomyopathy and Cardiovascular Disorders.
McNally, Elizabeth M; Puckelwartz, Megan J
2015-01-01
With the wider deployment of massively-parallel, next-generation sequencing, it is now possible to survey human genome data for research and clinical purposes. The reduced cost of producing short-read sequencing has now shifted the burden to data analysis. Analysis of genome sequencing remains challenged by the complexity of the human genome, including redundancy and the repetitive nature of genome elements and the large amount of variation in individual genomes. Public databases of human genome sequences greatly facilitate interpretation of common and rare genetic variation, although linking database sequence information to detailed clinical information is limited by privacy and practical issues. Genetic variation is a rich source of knowledge for cardiovascular disease because many, if not all, cardiovascular disorders are highly heritable. The role of rare genetic variation in predicting risk and complications of cardiovascular diseases has been well established for hypertrophic and dilated cardiomyopathy, where the number of genes that are linked to these disorders is growing. Bolstered by family data, where genetic variants segregate with disease, rare variation can be linked to specific genetic variation that offers profound diagnostic information. Understanding genetic variation in cardiomyopathy is likely to help stratify forms of heart failure and guide therapy. Ultimately, genetic variation may be amenable to gene correction and gene editing strategies.
2011-01-01
Background Integration of genomic variation with phenotypic information is an effective approach for uncovering genotype-phenotype associations. This requires an accurate identification of the different types of variation in individual genomes. Results We report the integration of the whole genome sequence of a single Holstein Friesian bull with data from single nucleotide polymorphism (SNP) and comparative genomic hybridization (CGH) array technologies to determine a comprehensive spectrum of genomic variation. The performance of resequencing SNP detection was assessed by combining SNPs that were identified to be either in identity by descent (IBD) or in copy number variation (CNV) with results from SNP array genotyping. Coding insertions and deletions (indels) were found to be enriched for size in multiples of 3 and were located near the N- and C-termini of proteins. For larger indels, a combination of split-read and read-pair approaches proved to be complementary in finding different signatures. CNVs were identified on the basis of the depth of sequenced reads, and by using SNP and CGH arrays. Conclusions Our results provide high resolution mapping of diverse classes of genomic variation in an individual bovine genome and demonstrate that structural variation surpasses sequence variation as the main component of genomic variability. Better accuracy of SNP detection was achieved with little loss of sensitivity when algorithms that implemented mapping quality were used. IBD regions were found to be instrumental for calculating resequencing SNP accuracy, while SNP detection within CNVs tended to be less reliable. CNV discovery was affected dramatically by platform resolution and coverage biases. The combined data for this study showed that at a moderate level of sequencing coverage, an ensemble of platforms and tools can be applied together to maximize the accurate detection of sequence and structural variants. PMID:22082336
ACTG: novel peptide mapping onto gene models.
Choi, Seunghyuk; Kim, Hyunwoo; Paek, Eunok
2017-04-15
In many proteogenomic applications, mapping peptide sequences onto genome sequences can be very useful, because it allows us to understand origins of the gene products. Existing software tools either take the genomic position of a peptide start site as an input or assume that the peptide sequence exactly matches the coding sequence of a given gene model. In case of novel peptides resulting from genomic variations, especially structural variations such as alternative splicing, these existing tools cannot be directly applied unless users supply information about the variant, either its genomic position or its transcription model. Mapping potentially novel peptides to genome sequences, while allowing certain genomic variations, requires introducing novel gene models when aligning peptide sequences to gene structures. We have developed a new tool called ACTG (Amino aCids To Genome), which maps peptides to genome, assuming all possible single exon skipping, junction variation allowing three edit distances from the original splice sites, exon extension and frame shift. In addition, it can also consider SNVs (single nucleotide variations) during mapping phase if a user provides the VCF (variant call format) file as an input. Available at http://prix.hanyang.ac.kr/ACTG/search.jsp . eunokpaek@hanyang.ac.kr. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Whole-Genome Sequence Variation among Multiple Isolates of Pseudomonas aeruginosa
Spencer, David H.; Kas, Arnold; Smith, Eric E.; Raymond, Christopher K.; Sims, Elizabeth H.; Hastings, Michele; Burns, Jane L.; Kaul, Rajinder; Olson, Maynard V.
2003-01-01
Whole-genome shotgun sequencing was used to study the sequence variation of three Pseudomonas aeruginosa isolates, two from clonal infections of cystic fibrosis patients and one from an aquatic environment, relative to the genomic sequence of reference strain PAO1. The majority of the PAO1 genome is represented in these strains; however, at least three prominent islands of PAO1-specific sequence are apparent. Conversely, ∼10% of the sequencing reads derived from each isolate fail to align with the PAO1 backbone. While average sequence variation among all strains is roughly 0.5%, regions of pronounced differences were evident in whole-genome scans of nucleotide diversity. We analyzed two such divergent loci, the pyoverdine and O-antigen biosynthesis regions, by complete resequencing. A thorough analysis of isolates collected over time from one of the cystic fibrosis patients revealed independent mutations resulting in the loss of O-antigen synthesis alternating with a mucoid phenotype. Overall, we conclude that most of the PAO1 genome represents a core P. aeruginosa backbone sequence while the strains addressed in this study possess additional genetic material that accounts for at least 10% of their genomes. Approximately half of these additional sequences are novel. PMID:12562802
The diploid genome sequence of an Asian individual
Wang, Jun; Wang, Wei; Li, Ruiqiang; Li, Yingrui; Tian, Geng; Goodman, Laurie; Fan, Wei; Zhang, Junqing; Li, Jun; Zhang, Juanbin; Guo, Yiran; Feng, Binxiao; Li, Heng; Lu, Yao; Fang, Xiaodong; Liang, Huiqing; Du, Zhenglin; Li, Dong; Zhao, Yiqing; Hu, Yujie; Yang, Zhenzhen; Zheng, Hancheng; Hellmann, Ines; Inouye, Michael; Pool, John; Yi, Xin; Zhao, Jing; Duan, Jinjie; Zhou, Yan; Qin, Junjie; Ma, Lijia; Li, Guoqing; Yang, Zhentao; Zhang, Guojie; Yang, Bin; Yu, Chang; Liang, Fang; Li, Wenjie; Li, Shaochuan; Li, Dawei; Ni, Peixiang; Ruan, Jue; Li, Qibin; Zhu, Hongmei; Liu, Dongyuan; Lu, Zhike; Li, Ning; Guo, Guangwu; Zhang, Jianguo; Ye, Jia; Fang, Lin; Hao, Qin; Chen, Quan; Liang, Yu; Su, Yeyang; san, A.; Ping, Cuo; Yang, Shuang; Chen, Fang; Li, Li; Zhou, Ke; Zheng, Hongkun; Ren, Yuanyuan; Yang, Ling; Gao, Yang; Yang, Guohua; Li, Zhuo; Feng, Xiaoli; Kristiansen, Karsten; Wong, Gane Ka-Shu; Nielsen, Rasmus; Durbin, Richard; Bolund, Lars; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian
2009-01-01
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics. PMID:18987735
Child Development and Structural Variation in the Human Genome
ERIC Educational Resources Information Center
Zhang, Ying; Haraksingh, Rajini; Grubert, Fabian; Abyzov, Alexej; Gerstein, Mark; Weissman, Sherman; Urban, Alexander E.
2013-01-01
Structural variation of the human genome sequence is the insertion, deletion, or rearrangement of stretches of DNA sequence sized from around 1,000 to millions of base pairs. Over the past few years, structural variation has been shown to be far more common in human genomes than previously thought. Very little is currently known about the effects…
Boussaha, Mekki; Michot, Pauline; Letaief, Rabia; Hozé, Chris; Fritz, Sébastien; Grohs, Cécile; Esquerré, Diane; Duchesne, Amandine; Philippe, Romain; Blanquet, Véronique; Phocas, Florence; Floriot, Sandrine; Rocha, Dominique; Klopp, Christophe; Capitan, Aurélien; Boichard, Didier
2016-11-15
In recent years, several bovine genome sequencing projects were carried out with the aim of developing genomic tools to improve dairy and beef production efficiency and sustainability. In this study, we describe the first French cattle genome variation dataset obtained by sequencing 274 whole genomes representing several major dairy and beef breeds. This dataset contains over 28 million single nucleotide polymorphisms (SNPs) and small insertions and deletions. Comparisons between sequencing results and SNP array genotypes revealed a very high genotype concordance rate, which indicates the good quality of our data. To our knowledge, this is the first large-scale catalog of small genomic variations in French dairy and beef cattle. This resource will contribute to the study of gene functions and population structure and also help to improve traits through genotype-guided selection.
Minimal Absent Words in Four Human Genome Assemblies
Garcia, Sara P.; Pinho, Armando J.
2011-01-01
Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we aim to contribute to the catalogue of human genomic variation by investigating the variation in number and content of minimal absent words within a species, using four human genome assemblies. We compare the reference human genome GRCh37 assembly, the HuRef assembly of the genome of Craig Venter, the NA12878 assembly from cell line GM12878, and the YH assembly of the genome of a Han Chinese individual. We find the variation in number and content of minimal absent words between assemblies more significant for large and very large minimal absent words, where the biases of sequencing and assembly methodologies become more pronounced. Moreover, we find generally greater similarity between the human genome assemblies sequenced with capillary-based technologies (GRCh37 and HuRef) than between the human genome assemblies sequenced with massively parallel technologies (NA12878 and YH). Finally, as expected, we find the overall variation in number and content of minimal absent words within a species to be generally smaller than the variation between species. PMID:22220210
Whole-Genome Sequences of Thirteen Isolates of Borrelia burgdorferi
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schutzer S. E.; Dunn J.; Fraser-Liggett, C. M.
2011-02-01
Borrelia burgdorferi is a causative agent of Lyme disease in North America and Eurasia. The first complete genome sequence of B. burgdorferi strain 31, available for more than a decade, has assisted research on the pathogenesis of Lyme disease. Because a single genome sequence is not sufficient to understand the relationship between genotypic and geographic variation and disease phenotype, we determined the whole-genome sequences of 13 additional B. burgdorferi isolates that span the range of natural variation. These sequences should allow improved understanding of pathogenesis and provide a foundation for novel detection, diagnosis, and prevention strategies.
Liu, Siyang; Huang, Shujia; Rao, Junhua; Ye, Weijian; Krogh, Anders; Wang, Jun
2015-01-01
Comprehensive recognition of genomic variation in one individual is important for understanding disease and developing personalized medication and treatment. Many tools based on DNA re-sequencing exist for identification of single nucleotide polymorphisms, small insertions and deletions (indels) as well as large deletions. However, these approaches consistently display a substantial bias against the recovery of complex structural variants and novel sequence in individual genomes and do not provide interpretation information such as the annotation of ancestral state and formation mechanism. We present a novel approach implemented in a single software package, AsmVar, to discover, genotype and characterize different forms of structural variation and novel sequence from population-scale de novo genome assemblies up to nucleotide resolution. Application of AsmVar to several human de novo genome assemblies captures a wide spectrum of structural variants and novel sequences present in the human population in high sensitivity and specificity. Our method provides a direct solution for investigating structural variants and novel sequences from de novo genome assemblies, facilitating the construction of population-scale pan-genomes. Our study also highlights the usefulness of the de novo assembly strategy for definition of genome structure.
Aokic, Jun-ya; Kawase, Junya; Hamada, Kazuhisa; Fujimoto, Hiroshi; Yamamoto, Ikki; Usuki, Hironori
2018-01-01
Greater amberjack (Seriola dumerili) is distributed in tropical and temperate waters worldwide and is an important aquaculture fish. We carried out de novo sequencing of the greater amberjack genome to construct a reference genome sequence to identify single nucleotide polymorphisms (SNPs) for breeding amberjack by marker-assisted or gene-assisted selection as well as to identify functional genes for biological traits. We obtained 200 times coverage and constructed a high-quality genome assembly using next generation sequencing technology. The assembled sequences were aligned onto a yellowtail (Seriola quinqueradiata) radiation hybrid (RH) physical map by sequence homology. A total of 215 of the longest amberjack sequences, with a total length of 622.8 Mbp (92% of the total length of the genome scaffolds), were lined up on the yellowtail RH map. We resequenced the whole genomes of 20 greater amberjacks and mapped the resulting sequences onto the reference genome sequence. About 186,000 nonredundant SNPs were successfully ordered on the reference genome. Further, we found differences in the genome structural variations between two greater amberjack populations using BreakDancer. We also analyzed the greater amberjack transcriptome and mapped the annotated sequences onto the reference genome sequence. PMID:29785397
No evidence that sex and transposable elements drive genome size variation in evening primroses.
Ågren, J Arvid; Greiner, Stephan; Johnson, Marc T J; Wright, Stephen I
2015-04-01
Genome size varies dramatically across species, but despite an abundance of attention there is little agreement on the relative contributions of selective and neutral processes in governing this variation. The rate of sex can potentially play an important role in genome size evolution because of its effect on the efficacy of selection and transmission of transposable elements (TEs). Here, we used a phylogenetic comparative approach and whole genome sequencing to investigate the contribution of sex and TE content to genome size variation in the evening primrose (Oenothera) genus. We determined genome size using flow cytometry for 30 species that vary in genetic system and find that variation in sexual/asexual reproduction cannot explain the almost twofold variation in genome size. Moreover, using whole genome sequences of three species of varying genome sizes and reproductive system, we found that genome size was not associated with TE abundance; instead the larger genomes had a higher abundance of simple sequence repeats. Although it has long been clear that sexual reproduction may affect various aspects of genome evolution in general and TE evolution in particular, it does not appear to have played a major role in genome size evolution in the evening primroses. © 2015 The Author(s).
A high-resolution cattle CNV map by population-scale genome sequencing
USDA-ARS?s Scientific Manuscript database
Copy Number Variations (CNVs) are common genomic structural variations that have been linked to human diseases and phenotypic traits. Prior studies in cattle have produced low-resolution CNV maps. We constructed a draft, high-resolution map of cattle CNVs based on whole genome sequencing data from 7...
Maize HapMap2 identifies extant variation from a genome in flux
USDA-ARS?s Scientific Manuscript database
The maize genome is the largest, most diverse and complex plant genome sequenced to date. Using high-throughput sequencing to access genetic variation and a population genetics model to score the polymorphisms, we characterize and unite the diversity of the world’s key breeding germplasm, wild rela...
RSAT 2015: Regulatory Sequence Analysis Tools
Medina-Rivera, Alejandra; Defrance, Matthieu; Sand, Olivier; Herrmann, Carl; Castro-Mondragon, Jaime A.; Delerce, Jeremy; Jaeger, Sébastien; Blanchet, Christophe; Vincens, Pierre; Caron, Christophe; Staines, Daniel M.; Contreras-Moreira, Bruno; Artufel, Marie; Charbonnier-Khamvongsa, Lucie; Hernandez, Céline; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques
2015-01-01
RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. PMID:25904632
Mapping and phasing of structural variation in patient genomes using nanopore sequencing.
Cretu Stancu, Mircea; van Roosmalen, Markus J; Renkens, Ivo; Nieboer, Marleen M; Middelkamp, Sjors; de Ligt, Joep; Pregno, Giulia; Giachino, Daniela; Mandrile, Giorgia; Espejo Valle-Inclan, Jose; Korzelius, Jerome; de Bruijn, Ewart; Cuppen, Edwin; Talkowski, Michael E; Marschall, Tobias; de Ridder, Jeroen; Kloosterman, Wigard P
2017-11-06
Despite improvements in genomics technology, the detection of structural variants (SVs) from short-read sequencing still poses challenges, particularly for complex variation. Here we analyse the genomes of two patients with congenital abnormalities using the MinION nanopore sequencer and a novel computational pipeline-NanoSV. We demonstrate that nanopore long reads are superior to short reads with regard to detection of de novo chromothripsis rearrangements. The long reads also enable efficient phasing of genetic variations, which we leveraged to determine the parental origin of all de novo chromothripsis breakpoints and to resolve the structure of these complex rearrangements. Additionally, genome-wide surveillance of inherited SVs reveals novel variants, missed in short-read data sets, a large proportion of which are retrotransposon insertions. We provide a first exploration of patient genome sequencing with a nanopore sequencer and demonstrate the value of long-read sequencing in mapping and phasing of SVs for both clinical and research applications.
A survey of copy number variation in the porcine genome detected from whole-genome sequence
USDA-ARS?s Scientific Manuscript database
An important challenge to post-genomic biology is relating observed phenotypic variation to the underlying genotypic variation. Genome-wide association studies (GWAS) have made thousands of connections between single nucleotide polymorphisms (SNPs) and phenotypes, implicating regions of the genome t...
The Mouse Genomes Project: a repository of inbred laboratory mouse strain genomes.
Adams, David J; Doran, Anthony G; Lilue, Jingtao; Keane, Thomas M
2015-10-01
The Mouse Genomes Project was initiated in 2009 with the goal of using next-generation sequencing technologies to catalogue molecular variation in the common laboratory mouse strains, and a selected set of wild-derived inbred strains. The initial sequencing and survey of sequence variation in 17 inbred strains was completed in 2011 and included comprehensive catalogue of single nucleotide polymorphisms, short insertion/deletions, larger structural variants including their fine scale architecture and landscape of transposable element variation, and genomic sites subject to post-transcriptional alteration of RNA. From this beginning, the resource has expanded significantly to include 36 fully sequenced inbred laboratory mouse strains, a refined and updated data processing pipeline, and new variation querying and data visualisation tools which are available on the project's website ( http://www.sanger.ac.uk/resources/mouse/genomes/ ). The focus of the project is now the completion of de novo assembled chromosome sequences and strain-specific gene structures for the core strains. We discuss how the assembled chromosomes will power comparative analysis, data access tools and future directions of mouse genetics.
Setoh, Yin Xiang; Amarilla, Alberto A; Peng, Nias Y; Slonchak, Andrii; Periasamy, Parthiban; Figueiredo, Luiz T M; Aquino, Victor H; Khromykh, Alexander A
2018-01-01
Rocio virus (ROCV) is an arbovirus belonging to the genus Flavivirus, family Flaviviridae. We present an updated sequence of ROCV strain SPH 34675 (GenBank: AY632542.4), the only available full genome sequence prior to this study. Using next-generation sequencing of the entire genome, we reveal substantial sequence variation from the prototype sequence, with 30 nucleotide differences amounting to 14 amino acid changes, as well as significant changes to predicted 3'UTR RNA structures. Our results present an updated and corrected sequence of a potential emerging human-virulent flavivirus uniquely indigenous to Brazil (GenBank: MF461639).
Intra-isolate genome variation in arbuscular mycorrhizal fungi persists in the transcriptome.
Boon, E; Zimmerman, E; Lang, B F; Hijri, M
2010-07-01
Arbuscular mycorrhizal fungi (AMF) are heterokaryotes with an unusual genetic makeup. Substantial genetic variation occurs among nuclei within a single mycelium or isolate. AMF reproduce through spores that contain varying fractions of this heterogeneous population of nuclei. It is not clear whether this genetic variation on the genome level actually contributes to the AMF phenotype. To investigate the extent to which polymorphisms in nuclear genes are transcribed, we analysed the intra-isolate genomic and cDNA sequence variation of two genes, the large subunit ribosomal RNA (LSU rDNA) of Glomus sp. DAOM-197198 (previously known as G. intraradices) and the POL1-like sequence (PLS) of Glomus etunicatum. For both genes, we find high sequence variation at the genome and transcriptome level. Reconstruction of LSU rDNA secondary structure shows that all variants are functional. Patterns of PLS sequence polymorphism indicate that there is one functional gene copy, PLS2, which is preferentially transcribed, and one gene copy, PLS1, which is a pseudogene. This is the first study that investigates AMF intra-isolate variation at the transcriptome level. In conclusion, it is possible that, in AMF, multiple nuclear genomes contribute to a single phenotype.
Kumar, Pankaj; Chaitanya, Pasumarthy S; Nagarajaram, Hampapathalu A
2011-01-01
PSSRdb (Polymorphic Simple Sequence Repeats database) (http://www.cdfd.org.in/PSSRdb/) is a relational database of polymorphic simple sequence repeats (PSSRs) extracted from 85 different species of prokaryotes. Simple sequence repeats (SSRs) are the tandem repeats of nucleotide motifs of the sizes 1-6 bp and are highly polymorphic. SSR mutations in and around coding regions affect transcription and translation of genes. Such changes underpin phase variations and antigenic variations seen in some bacteria. Although SSR-mediated phase variation and antigenic variations have been well-studied in some bacteria there seems a lot of other species of prokaryotes yet to be investigated for SSR mediated adaptive and other evolutionary advantages. As a part of our on-going studies on SSR polymorphism in prokaryotes we compared the genome sequences of various strains and isolates available for 85 different species of prokaryotes and extracted a number of SSRs showing length variations and created a relational database called PSSRdb. This database gives useful information such as location of PSSRs in genomes, length variation across genomes, the regions harboring PSSRs, etc. The information provided in this database is very useful for further research and analysis of SSRs in prokaryotes.
USDA-ARS?s Scientific Manuscript database
Copy number variations (CNVs) are large insertions, deletions or duplications in the genome that vary between members of a species and are known to affect a wide variety of phenotypic traits. In this study, we identified CNVs in a population of bulls using low coverage next-generation sequence data....
VCGDB: a dynamic genome database of the Chinese population
2014-01-01
Background The data released by the 1000 Genomes Project contain an increasing number of genome sequences from different nations and populations with a large number of genetic variations. As a result, the focus of human genome studies is changing from single and static to complex and dynamic. The currently available human reference genome (GRCh37) is based on sequencing data from 13 anonymous Caucasian volunteers, which might limit the scope of genomics, transcriptomics, epigenetics, and genome wide association studies. Description We used the massive amount of sequencing data published by the 1000 Genomes Project Consortium to construct the Virtual Chinese Genome Database (VCGDB), a dynamic genome database of the Chinese population based on the whole genome sequencing data of 194 individuals. VCGDB provides dynamic genomic information, which contains 35 million single nucleotide variations (SNVs), 0.5 million insertions/deletions (indels), and 29 million rare variations, together with genomic annotation information. VCGDB also provides a highly interactive user-friendly virtual Chinese genome browser (VCGBrowser) with functions like seamless zooming and real-time searching. In addition, we have established three population-specific consensus Chinese reference genomes that are compatible with mainstream alignment software. Conclusions VCGDB offers a feasible strategy for processing big data to keep pace with the biological data explosion by providing a robust resource for genomics studies; in particular, studies aimed at finding regions of the genome associated with diseases. PMID:24708222
Genome-Wide Structural Variation Detection by Genome Mapping on Nanochannel Arrays.
Mak, Angel C Y; Lai, Yvonne Y Y; Lam, Ernest T; Kwok, Tsz-Piu; Leung, Alden K Y; Poon, Annie; Mostovoy, Yulia; Hastie, Alex R; Stedman, William; Anantharaman, Thomas; Andrews, Warren; Zhou, Xiang; Pang, Andy W C; Dai, Heng; Chu, Catherine; Lin, Chin; Wu, Jacob J K; Li, Catherine M L; Li, Jing-Woei; Yim, Aldrin K Y; Chan, Saki; Sibert, Justin; Džakula, Željko; Cao, Han; Yiu, Siu-Ming; Chan, Ting-Fung; Yip, Kevin Y; Xiao, Ming; Kwok, Pui-Yan
2016-01-01
Comprehensive whole-genome structural variation detection is challenging with current approaches. With diploid cells as DNA source and the presence of numerous repetitive elements, short-read DNA sequencing cannot be used to detect structural variation efficiently. In this report, we show that genome mapping with long, fluorescently labeled DNA molecules imaged on nanochannel arrays can be used for whole-genome structural variation detection without sequencing. While whole-genome haplotyping is not achieved, local phasing (across >150-kb regions) is routine, as molecules from the parental chromosomes are examined separately. In one experiment, we generated genome maps from a trio from the 1000 Genomes Project, compared the maps against that derived from the reference human genome, and identified structural variations that are >5 kb in size. We find that these individuals have many more structural variants than those published, including some with the potential of disrupting gene function or regulation. Copyright © 2016 by the Genetics Society of America.
Read clouds uncover variation in complex regions of the human genome
Bishara, Alex; Liu, Yuling; Weng, Ziming; Kashef-Haghighi, Dorna; Newburger, Daniel E.; West, Robert; Sidow, Arend; Batzoglou, Serafim
2015-01-01
Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies. PMID:26286554
Louis, Ed
2011-01-01
In the early days of the yeast genome sequencing project, gene annotation was in its infancy and suffered the problem of many false positive annotations as well as missed genes. The lack of other sequences for comparison also prevented the annotation of conserved, functional sequences that were not coding. We are now in an era of comparative genomics where many closely related as well as more distantly related genomes are available for direct sequence and synteny comparisons allowing for more probable predictions of genes and other functional sequences due to conservation. We also have a plethora of functional genomics data which helps inform gene annotation for previously uncharacterised open reading frames (ORFs)/genes. For Saccharomyces cerevisiae this has resulted in a continuous updating of the gene and functional sequence annotations in the reference genome helping it retain its position as the best characterized eukaryotic organism's genome. A single reference genome for a species does not accurately describe the species and this is quite clear in the case of S. cerevisiae where the reference strain is not ideal for brewing or baking due to missing genes. Recent surveys of numerous isolates, from a variety of sources, using a variety of technologies have revealed a great deal of variation amongst isolates with genome sequence surveys providing information on novel genes, undetectable by other means. We now have a better understanding of the extant variation in S. cerevisiae as a species as well as some idea of how much we are missing from this understanding. As with gene annotation, comparative genomics enhances the discovery and description of genome variation and is providing us with the tools for understanding genome evolution, adaptation and selection, and underlying genetics of complex traits.
RSAT 2015: Regulatory Sequence Analysis Tools.
Medina-Rivera, Alejandra; Defrance, Matthieu; Sand, Olivier; Herrmann, Carl; Castro-Mondragon, Jaime A; Delerce, Jeremy; Jaeger, Sébastien; Blanchet, Christophe; Vincens, Pierre; Caron, Christophe; Staines, Daniel M; Contreras-Moreira, Bruno; Artufel, Marie; Charbonnier-Khamvongsa, Lucie; Hernandez, Céline; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques
2015-07-01
RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Jakupciak, John P; Wells, Jeffrey M; Karalus, Richard J; Pawlowski, David R; Lin, Jeffrey S; Feldman, Andrew B
2013-01-01
Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations.
Jakupciak, John P.; Wells, Jeffrey M.; Karalus, Richard J.; Pawlowski, David R.; Lin, Jeffrey S.; Feldman, Andrew B.
2013-01-01
Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations. PMID:24455204
Reduced representation approaches to interrogate genome diversity in large repetitive plant genomes.
Hirsch, Cory D; Evans, Joseph; Buell, C Robin; Hirsch, Candice N
2014-07-01
Technology and software improvements in the last decade now provide methodologies to access the genome sequence of not only a single accession, but also multiple accessions of plant species. This provides a means to interrogate species diversity at the genome level. Ample diversity among accessions in a collection of species can be found, including single-nucleotide polymorphisms, insertions and deletions, copy number variation and presence/absence variation. For species with small, non-repetitive rich genomes, re-sequencing of query accessions is robust, highly informative, and economically feasible. However, for species with moderate to large sized repetitive-rich genomes, technical and economic barriers prevent en masse genome re-sequencing of accessions. Multiple approaches to access a focused subset of loci in species with larger genomes have been developed, including reduced representation sequencing, exome capture and transcriptome sequencing. Collectively, these approaches have enabled interrogation of diversity on a genome scale for large plant genomes, including crop species important to worldwide food security. © The Author 2014. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com.
A global reference for human genetic variation
2016-01-01
The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies. PMID:26432245
Caporale, Lynn Helena
2012-09-01
This overview of a special issue of Annals of the New York Academy of Sciences discusses uneven distribution of distinct types of variation across the genome, the dependence of specific types of variation upon distinct classes of DNA sequences and/or the induction of specific proteins, the circumstances in which distinct variation-generating systems are activated, and the implications of this work for our understanding of evolution and of cancer. Also discussed is the value of non text-based computational methods for analyzing information carried by DNA, early insights into organizational frameworks that affect genome behavior, and implications of this work for comparative genomics. © 2012 New York Academy of Sciences.
Salmonella Typhi genomics: envisaging the future of typhoid eradication.
Yap, Kien-Pong; Thong, Kwai Lin
2017-08-01
Next-generation whole-genome sequencing has revolutionised the study of infectious diseases in recent years. The availability of genome sequences and its understanding have transformed the field of molecular microbiology, epidemiology, infection treatments and vaccine developments. We review the key findings of the publicly accessible genomes of Salmonella enterica serovar Typhi since the first complete genome to the most recent release of thousands of Salmonella Typhi genomes, which remarkably shape the genomic research of S. Typhi and other pathogens. Important new insights acquired from the genome sequencing of S. Typhi, pertaining to genomic variations, evolution, population structure, antibiotic resistance, virulence, pathogenesis, disease surveillance/investigation and disease control are discussed. As the numbers of sequenced genomes are increasing at an unprecedented rate, fine variations in the gene pool of S. Typhi are captured in high resolution, allowing deeper understanding of the pathogen's evolutionary trends and its pathogenesis, paving the way to bringing us closer to eradication of typhoid through effective vaccine/treatment development. © 2017 John Wiley & Sons Ltd.
Doddapaneni, Harshavardhan; Yao, Jiqiang; Lin, Hong; Walker, M Andrew; Civerolo, Edwin L
2006-01-01
Background The Gram-negative, xylem-limited phytopathogenic bacterium Xylella fastidiosa is responsible for causing economically important diseases in grapevine, citrus and many other plant species. Despite its economic impact, relatively little is known about the genomic variations among strains isolated from different hosts and their influence on the population genetics of this pathogen. With the availability of genome sequence information for four strains, it is now possible to perform genome-wide analyses to identify and categorize such DNA variations and to understand their influence on strain functional divergence. Results There are 1,579 genes and 194 non-coding homologous sequences present in the genomes of all four strains, representing a 76. 2% conservation of the sequenced genome. About 60% of the X. fastidiosa unique sequences exist as tandem gene clusters of 6 or more genes. Multiple alignments identified 12,754 SNPs and 14,449 INDELs in the 1528 common genes and 20,779 SNPs and 10,075 INDELs in the 194 non-coding sequences. The average SNP frequency was 1.08 × 10-2 per base pair of DNA and the average INDEL frequency was 2.06 × 10-2 per base pair of DNA. On an average, 60.33% of the SNPs were synonymous type while 39.67% were non-synonymous type. The mutation frequency, primarily in the form of external INDELs was the main type of sequence variation. The relative similarity between the strains was discussed according to the INDEL and SNP differences. The number of genes unique to each strain were 60 (9a5c), 54 (Dixon), 83 (Ann1) and 9 (Temecula-1). A sub-set of the strain specific genes showed significant differences in terms of their codon usage and GC composition from the native genes suggesting their xenologous origin. Tandem repeat analysis of the genomic sequences of the four strains identified associations of repeat sequences with hypothetical and phage related functions. Conclusion INDELs and strain specific genes have been identified as the main source of variations among strains, with individual strains showing different rates of genome evolution. Based on these genome comparisons, it appears that the Pierce's disease strain Temecula-1 genome represents the ancestral genome of the X. fastidiosa. Results of this analysis are publicly available in the form of a web database. PMID:16948851
Milos, Patrice
2008-04-01
Helicos BioSciences Corporation is a life sciences company developing revolutionary new single molecule sequencing technology to provide the path to the US$1000 genome. True Single Molecule Sequencing (tSMS) will drive advancements in pharmacogenomics that can enable a better understanding of an individual's susceptibility to disease, develop more effective disease diagnoses and differentiate response to disease therapies. During 2007, genome-wide disease-association studies, the encylopedia of DNA elements (ENCODE) and the published genome sequence of two individuals have revealed human genome variation far more extensive than originally believed. These also demonstrated that common variations explain only a fraction of the genetic basis of disease. Therefore, the capability to understand an individual genome is critical in setting the foundation for the next great revolution in healthcare. Helicos is committed to this vision and will provide cost-effective genome sequencing and comprehensive analysis of the transcribed genome that can unlock the era of personalized healthcare.
Clarke, Laura; Fairley, Susan; Zheng-Bradley, Xiangqun; Streeter, Ian; Perry, Emily; Lowy, Ernesto; Tassé, Anne-Marie; Flicek, Paul
2017-01-01
The International Genome Sample Resource (IGSR; http://www.internationalgenome.org) expands in data type and population diversity the resources from the 1000 Genomes Project. IGSR represents the largest open collection of human variation data and provides easy access to these resources. IGSR was established in 2015 to maintain and extend the 1000 Genomes Project data, which has been widely used as a reference set of human variation and by researchers developing analysis methods. IGSR has mapped all of the 1000 Genomes sequence to the newest human reference (GRCh38), and will release updated variant calls to ensure maximal usefulness of the existing data. IGSR is collecting new structural variation data on the 1000 Genomes samples from long read sequencing and other technologies, and will collect relevant functional data into a single comprehensive resource. IGSR is extending coverage with new populations sequenced by collaborating groups. Here, we present the new data and analysis that IGSR has made available. We have also introduced a new data portal that increases discoverability of our data—previously only browseable through our FTP site—by focusing on particular samples, populations or data sets of interest. PMID:27638885
Read clouds uncover variation in complex regions of the human genome.
Bishara, Alex; Liu, Yuling; Weng, Ziming; Kashef-Haghighi, Dorna; Newburger, Daniel E; West, Robert; Sidow, Arend; Batzoglou, Serafim
2015-10-01
Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies. © 2015 Bishara et al.; Published by Cold Spring Harbor Laboratory Press.
2010-01-01
Background The maturing field of genomics is rapidly increasing the number of sequenced genomes and producing more information from those previously sequenced. Much of this additional information is variation data derived from sampling multiple individuals of a given species with the goal of discovering new variants and characterising the population frequencies of the variants that are already known. These data have immense value for many studies, including those designed to understand evolution and connect genotype to phenotype. Maximising the utility of the data requires that it be stored in an accessible manner that facilitates the integration of variation data with other genome resources such as gene annotation and comparative genomics. Description The Ensembl project provides comprehensive and integrated variation resources for a wide variety of chordate genomes. This paper provides a detailed description of the sources of data and the methods for creating the Ensembl variation databases. It also explores the utility of the information by explaining the range of query options available, from using interactive web displays, to online data mining tools and connecting directly to the data servers programmatically. It gives a good overview of the variation resources and future plans for expanding the variation data within Ensembl. Conclusions Variation data is an important key to understanding the functional and phenotypic differences between individuals. The development of new sequencing and genotyping technologies is greatly increasing the amount of variation data known for almost all genomes. The Ensembl variation resources are integrated into the Ensembl genome browser and provide a comprehensive way to access this data in the context of a widely used genome bioinformatics system. All Ensembl data is freely available at http://www.ensembl.org and from the public MySQL database server at ensembldb.ensembl.org. PMID:20459805
Zhao, Min; Wang, Qingguo; Wang, Quan; Jia, Peilin; Zhao, Zhongming
2013-01-01
Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development.
2013-01-01
Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development. PMID:24564169
Hirose, Yusuke; Onuki, Mamiko; Tenjimbayashi, Yuri; Mori, Seiichiro; Ishii, Yoshiyuki; Takeuchi, Takamasa; Tasaka, Nobutaka; Satoh, Toyomi; Morisada, Tohru; Iwata, Takashi; Miyamoto, Shingo; Matsumoto, Koji; Sekizawa, Akihiko; Kukimoto, Iwao
2018-06-15
Persistent infection with oncogenic human papillomaviruses (HPVs) causes cervical cancer, accompanied by the accumulation of somatic mutations into the host genome. There are concomitant genetic changes in the HPV genome during viral infection; however, their relevance to cervical carcinogenesis is poorly understood. Here, we explored within-host genetic diversity of HPV by performing deep-sequencing analyses of viral whole-genome sequences in clinical specimens. The whole genomes of HPV types 16, 52, and 58 were amplified by type-specific PCR from total cellular DNA of cervical exfoliated cells collected from patients with cervical intraepithelial neoplasia (CIN) and invasive cervical cancer (ICC) and were deep sequenced. After constructing a reference viral genome sequence for each specimen, nucleotide positions showing changes with >0.5% frequencies compared to the reference sequence were determined for individual samples. In total, 1,052 positions of nucleotide variations were detected in HPV genomes from 151 samples (CIN1, n = 56; CIN2/3, n = 68; ICC, n = 27), with various numbers per sample. Overall, C-to-T and C-to-A substitutions were the dominant changes observed across all histological grades. While C-to-T transitions were predominantly detected in CIN1, their prevalence was decreased in CIN2/3 and fell below that of C-to-A transversions in ICC. Analysis of the trinucleotide context encompassing substituted bases revealed that TpCpN, a preferred target sequence for cellular APOBEC cytosine deaminases, was a primary site for C-to-T substitutions in the HPV genome. These results strongly imply that the APOBEC proteins are drivers of HPV genome mutation, particularly in CIN1 lesions. IMPORTANCE HPVs exhibit surprisingly high levels of genetic diversity, including a large repertoire of minor genomic variants in each viral genotype. Here, by conducting deep-sequencing analyses, we show for the first time a comprehensive snapshot of the within-host genetic diversity of high-risk HPVs during cervical carcinogenesis. Quasispecies harboring minor nucleotide variations in viral whole-genome sequences were extensively observed across different grades of CIN and cervical cancer. Among the within-host variations, C-to-T transitions, a characteristic change mediated by cellular APOBEC cytosine deaminases, were predominantly detected throughout the whole viral genome, most strikingly in low-grade CIN lesions. The results strongly suggest that within-host variations of the HPV genome are primarily generated through the interaction with host cell DNA-editing enzymes and that such within-host variability is an evolutionary source of the genetic diversity of HPVs. Copyright © 2018 American Society for Microbiology.
Zhang, Tongwu; Hu, Songnian; Zhang, Guangyu; Pan, Linlin; Zhang, Xiaowei; Al-Mssallem, Ibrahim S.; Yu, Jun
2012-01-01
Hassawi rice (Oryza sativa L.) is a landrace adapted to the climate of Saudi Arabia, characterized by its strong resistance to soil salinity and drought. Using high quality sequencing reads extracted from raw data of a whole genome sequencing project, we assembled both chloroplast (cp) and mitochondrial (mt) genomes of the wild-type Hassawi rice (Hassawi-1) and its dwarf hybrid (Hassawi-2). We discovered 16 InDels (insertions and deletions) but no SNP (single nucleotide polymorphism) is present between the two Hassawi cp genomes. We identified 48 InDels and 26 SNPs in the two Hassawi mt genomes and a new type of sequence variation, termed reverse complementary variation (RCV) in the rice cp genomes. There are two and four RCVs identified in Hassawi-1 when compared to 93–11 (indica) and Nipponbare (japonica), respectively. Microsatellite sequence analysis showed there are more SSRs in the genic regions of both cp and mt genomes in the Hassawi rice than in the other rice varieties. There are also large repeats in the Hassawi mt genomes, with the longest length of 96,168 bp and 96,165 bp in Hassawi-1 and Hassawi-2, respectively. We believe that frequent DNA rearrangement in the Hassawi mt and cp genomes indicate ongoing dynamic processes to reach genetic stability under strong environmental pressures. Based on sequence variation analysis and the breeding history, we suggest that both Hassawi-1 and Hassawi-2 originated from the Indonesian variety Peta since genetic diversity between the two Hassawi cultivars is very low albeit an unknown historic origin of the wild-type Hassawi rice. PMID:22870184
AGAPE (Automated Genome Analysis PipelinE) for Pan-Genome Analysis of Saccharomyces cerevisiae
Song, Giltae; Dickins, Benjamin J. A.; Demeter, Janos; Engel, Stacia; Dunn, Barbara; Cherry, J. Michael
2015-01-01
The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which integrates the steps of assembly, annotation, and variation calling. To assign strain-specific functional annotations, we identified genes that were not present in the reference genome. We classified these according to their presence or absence across strains and characterized each group of genes with known functional and phenotypic features. The functional roles of novel genes not found in the reference genome and associated with strains or groups of strains appear to be consistent with anticipated adaptations in specific lineages. As more S. cerevisiae strain genomes are released, our analysis can be used to collate genome data and relate it to lineage-specific patterns of genome evolution. Our new tool set will enhance our understanding of genomic and functional evolution in S. cerevisiae, and will be available to the yeast genetics and molecular biology community. PMID:25781462
Genome Sequencing of Steroid Producing Bacteria Using Ion Torrent Technology and a Reference Genome.
Sola-Landa, Alberto; Rodríguez-García, Antonio; Barreiro, Carlos; Pérez-Redondo, Rosario
2017-01-01
The Next-Generation Sequencing technology has enormously eased the bacterial genome sequencing and several tens of thousands of genomes have been sequenced during the last 10 years. Most of the genome projects are published as draft version, however, for certain applications the complete genome sequence is required.In this chapter, we describe the strategy that allowed the complete genome sequencing of Mycobacterium neoaurum NRRL B-3805, an industrial strain exploited for steroid production, using Ion Torrent sequencing reads and the genome of a close strain as the reference. This protocol can be applied to analyze the genetic variations between closely related strains; for example, to elucidate the point mutations between a parental strain and a random mutagenesis-derived mutant.
The African Genome Variation Project shapes medical genetics in Africa
NASA Astrophysics Data System (ADS)
Gurdasani, Deepti; Carstensen, Tommy; Tekola-Ayele, Fasil; Pagani, Luca; Tachmazidou, Ioanna; Hatzikotoulas, Konstantinos; Karthikeyan, Savita; Iles, Louise; Pollard, Martin O.; Choudhury, Ananyo; Ritchie, Graham R. S.; Xue, Yali; Asimit, Jennifer; Nsubuga, Rebecca N.; Young, Elizabeth H.; Pomilla, Cristina; Kivinen, Katja; Rockett, Kirk; Kamali, Anatoli; Doumatey, Ayo P.; Asiki, Gershim; Seeley, Janet; Sisay-Joof, Fatoumatta; Jallow, Muminatou; Tollman, Stephen; Mekonnen, Ephrem; Ekong, Rosemary; Oljira, Tamiru; Bradman, Neil; Bojang, Kalifa; Ramsay, Michele; Adeyemo, Adebowale; Bekele, Endashaw; Motala, Ayesha; Norris, Shane A.; Pirie, Fraser; Kaleebu, Pontiano; Kwiatkowski, Dominic; Tyler-Smith, Chris; Rotimi, Charles; Zeggini, Eleftheria; Sandhu, Manjinder S.
2015-01-01
Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterization of African genetic diversity is needed. The African Genome Variation Project provides a resource with which to design, implement and interpret genomic studies in sub-Saharan Africa and worldwide. The African Genome Variation Project represents dense genotypes from 1,481 individuals and whole-genome sequences from 320 individuals across sub-Saharan Africa. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across sub-Saharan Africa. We identify new loci under selection, including loci related to malaria susceptibility and hypertension. We show that modern imputation panels (sets of reference genotypes from which unobserved or missing genotypes in study sets can be inferred) can identify association signals at highly differentiated loci across populations in sub-Saharan Africa. Using whole-genome sequencing, we demonstrate further improvements in imputation accuracy, strengthening the case for large-scale sequencing efforts of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa.
The African Genome Variation Project shapes medical genetics in Africa.
Gurdasani, Deepti; Carstensen, Tommy; Tekola-Ayele, Fasil; Pagani, Luca; Tachmazidou, Ioanna; Hatzikotoulas, Konstantinos; Karthikeyan, Savita; Iles, Louise; Pollard, Martin O; Choudhury, Ananyo; Ritchie, Graham R S; Xue, Yali; Asimit, Jennifer; Nsubuga, Rebecca N; Young, Elizabeth H; Pomilla, Cristina; Kivinen, Katja; Rockett, Kirk; Kamali, Anatoli; Doumatey, Ayo P; Asiki, Gershim; Seeley, Janet; Sisay-Joof, Fatoumatta; Jallow, Muminatou; Tollman, Stephen; Mekonnen, Ephrem; Ekong, Rosemary; Oljira, Tamiru; Bradman, Neil; Bojang, Kalifa; Ramsay, Michele; Adeyemo, Adebowale; Bekele, Endashaw; Motala, Ayesha; Norris, Shane A; Pirie, Fraser; Kaleebu, Pontiano; Kwiatkowski, Dominic; Tyler-Smith, Chris; Rotimi, Charles; Zeggini, Eleftheria; Sandhu, Manjinder S
2015-01-15
Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterization of African genetic diversity is needed. The African Genome Variation Project provides a resource with which to design, implement and interpret genomic studies in sub-Saharan Africa and worldwide. The African Genome Variation Project represents dense genotypes from 1,481 individuals and whole-genome sequences from 320 individuals across sub-Saharan Africa. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across sub-Saharan Africa. We identify new loci under selection, including loci related to malaria susceptibility and hypertension. We show that modern imputation panels (sets of reference genotypes from which unobserved or missing genotypes in study sets can be inferred) can identify association signals at highly differentiated loci across populations in sub-Saharan Africa. Using whole-genome sequencing, we demonstrate further improvements in imputation accuracy, strengthening the case for large-scale sequencing efforts of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa.
Copy number variation of individual cattle genomes using next-generation sequencing
USDA-ARS?s Scientific Manuscript database
Copy number variations (CNVs) affect a wide range of phenotypic traits; however, CNVs in or near segmental duplication regions are often intractable. Using a read depth approach based on next-generation sequencing, we examined genome-wide copy number differences among five taurine (three Angus, one ...
Copy number variation of individual cattle genomes using next-generation sequencing
USDA-ARS?s Scientific Manuscript database
Copy Number Variations (CNVs) affect a wide range of phenotypic traits; however, CNVs in or near segmental duplication regions are often difficult to track. Using a read depth approach based on next generation sequencing, we examined genome-wide copy number differences among five taurine (three Angu...
Scanning the human genome at kilobase resolution.
Chen, Jun; Kim, Yeong C; Jung, Yong-Chul; Xuan, Zhenyu; Dworkin, Geoff; Zhang, Yanming; Zhang, Michael Q; Wang, San Ming
2008-05-01
Normal genome variation and pathogenic genome alteration frequently affect small regions in the genome. Identifying those genomic changes remains a technical challenge. We report here the development of the DGS (Ditag Genome Scanning) technique for high-resolution analysis of genome structure. The basic features of DGS include (1) use of high-frequent restriction enzymes to fractionate the genome into small fragments; (2) collection of two tags from two ends of a given DNA fragment to form a ditag to represent the fragment; (3) application of the 454 sequencing system to reach a comprehensive ditag sequence collection; (4) determination of the genome origin of ditags by mapping to reference ditags from known genome sequences; (5) use of ditag sequences directly as the sense and antisense PCR primers to amplify the original DNA fragment. To study the relationship between ditags and genome structure, we performed a computational study by using the human genome reference sequences as a model, and analyzed the ditags experimentally collected from the well-characterized normal human DNA GM15510 and the leukemic human DNA of Kasumi-1 cells. Our studies show that DGS provides a kilobase resolution for studying genome structure with high specificity and high genome coverage. DGS can be applied to validate genome assembly, to compare genome similarity and variation in normal populations, and to identify genomic abnormality including insertion, inversion, deletion, translocation, and amplification in pathological genomes such as cancer genomes.
NASA Astrophysics Data System (ADS)
Veglia, A. J.; Milford, C. R.; Marston, M.
2016-02-01
Viruses infecting marine Synechococcus are abundant in coastal marine environments and influence the community composition and abundance of their cyanobacterial hosts. In this study, we focused on the cyanopodoviruses which have smaller genomes and narrower host ranges relative to cyanomyoviruses. While previous studies have compared the genomes of diverse podoviruses, here we analyzed the genomic variation, host ranges, and infection kinetics of podoviruses within the same OTU. The genomes of fifty-five podoviral isolates from the coastal waters of New England were fully sequenced. Based on DNA polymerase gene sequences, these isolates fall into five discrete OTUs (termed RIP - Rhode Island Podovirus). Although all the isolates belonging to the same RIP have very similar DNA polymerase gene sequences (>98% sequence identity), differences in genome content, particularly in regions associated with tail fiber genes, were observed among isolates in the same RIP. Host range tests reveal variation both across and within RIPs. Notably within RIP1, isolates that had similar tail fiber regions also had similar host ranges. Isolates belonging to RIP4 do not contain the host-derived psbA photosynthesis gene, while isolates in the other four RIPs do possess a psbA gene. Nevertheless, infection kinetic experiments suggest that the latent period and burst size for RIP4 isolates are similar to RIP1 isolates. We are continuing to investigate the correlations among genome content, host range, and infection kinetics of isolates belonging to the same OTU. Our results to date suggest that there is substantial genomic variation within an OTU and that this variation likely influences cyanopodoviral - host interactions.
Non-codingRNA sequence variations in human chronic lymphocytic leukemia and colorectal cancer.
Wojcik, Sylwia E; Rossi, Simona; Shimizu, Masayoshi; Nicoloso, Milena S; Cimmino, Amelia; Alder, Hansjuerg; Herlea, Vlad; Rassenti, Laura Z; Rai, Kanti R; Kipps, Thomas J; Keating, Michael J; Croce, Carlo M; Calin, George A
2010-02-01
Cancer is a genetic disease in which the interplay between alterations in protein-coding genes and non-coding RNAs (ncRNAs) plays a fundamental role. In recent years, the full coding component of the human genome was sequenced in various cancers, whereas such attempts related to ncRNAs are still fragmentary. We screened genomic DNAs for sequence variations in 148 microRNAs (miRNAs) and ultraconserved regions (UCRs) loci in patients with chronic lymphocytic leukemia (CLL) or colorectal cancer (CRC) by Sanger technique and further tried to elucidate the functional consequences of some of these variations. We found sequence variations in miRNAs in both sporadic and familial CLL cases, mutations of UCRs in CLLs and CRCs and, in certain instances, detected functional effects of these variations. Furthermore, by integrating our data with previously published data on miRNA sequence variations, we have created a catalog of DNA sequence variations in miRNAs/ultraconserved genes in human cancers. These findings argue that ncRNAs are targeted by both germ line and somatic mutations as well as by single-nucleotide polymorphisms with functional significance for human tumorigenesis. Sequence variations in ncRNA loci are frequent and some have functional and biological significance. Such information can be exploited to further investigate on a genome-wide scale the frequency of genetic variations in ncRNAs and their functional meaning, as well as for the development of new diagnostic and prognostic markers for leukemias and carcinomas.
Non-codingRNA sequence variations in human chronic lymphocytic leukemia and colorectal cancer
Wojcik, Sylwia E.; Rossi, Simona; Shimizu, Masayoshi; Nicoloso, Milena S.; Cimmino, Amelia; Alder, Hansjuerg; Herlea, Vlad; Rassenti, Laura Z.; Rai, Kanti R.; Kipps, Thomas J.; Keating, Michael J.
2010-01-01
Cancer is a genetic disease in which the interplay between alterations in protein-coding genes and non-coding RNAs (ncRNAs) plays a fundamental role. In recent years, the full coding component of the human genome was sequenced in various cancers, whereas such attempts related to ncRNAs are still fragmentary. We screened genomic DNAs for sequence variations in 148 microRNAs (miRNAs) and ultraconserved regions (UCRs) loci in patients with chronic lymphocytic leukemia (CLL) or colorectal cancer (CRC) by Sanger technique and further tried to elucidate the functional consequences of some of these variations. We found sequence variations in miRNAs in both sporadic and familial CLL cases, mutations of UCRs in CLLs and CRCs and, in certain instances, detected functional effects of these variations. Furthermore, by integrating our data with previously published data on miRNA sequence variations, we have created a catalog of DNA sequence variations in miRNAs/ultraconserved genes in human cancers. These findings argue that ncRNAs are targeted by both germ line and somatic mutations as well as by single-nucleotide polymorphisms with functional significance for human tumorigenesis. Sequence variations in ncRNA loci are frequent and some have functional and biological significance. Such information can be exploited to further investigate on a genome-wide scale the frequency of genetic variations in ncRNAs and their functional meaning, as well as for the development of new diagnostic and prognostic markers for leukemias and carcinomas. PMID:19926640
Genome Analysis of the Domestic Dog (Korean Jindo) by Massively Parallel Sequencing
Kim, Ryong Nam; Kim, Dae-Soo; Choi, Sang-Haeng; Yoon, Byoung-Ha; Kang, Aram; Nam, Seong-Hyeuk; Kim, Dong-Wook; Kim, Jong-Joo; Ha, Ji-Hong; Toyoda, Atsushi; Fujiyama, Asao; Kim, Aeri; Kim, Min-Young; Park, Kun-Hyang; Lee, Kang Seon; Park, Hong-Seog
2012-01-01
Although pioneering sequencing projects have shed light on the boxer and poodle genomes, a number of challenges need to be met before the sequencing and annotation of the dog genome can be considered complete. Here, we present the DNA sequence of the Jindo dog genome, sequenced to 45-fold average coverage using Illumina massively parallel sequencing technology. A comparison of the sequence to the reference boxer genome led to the identification of 4 675 437 single nucleotide polymorphisms (SNPs, including 3 346 058 novel SNPs), 71 642 indels and 8131 structural variations. Of these, 339 non-synonymous SNPs and 3 indels are located within coding sequences (CDS). In particular, 3 non-synonymous SNPs and a 26-bp deletion occur in the TCOF1 locus, implying that the difference observed in cranial facial morphology between Jindo and boxer dogs might be influenced by those variations. Through the annotation of the Jindo olfactory receptor gene family, we found 2 unique olfactory receptor genes and 236 olfactory receptor genes harbouring non-synonymous homozygous SNPs that are likely to affect smelling capability. In addition, we determined the DNA sequence of the Jindo dog mitochondrial genome and identified Jindo dog-specific mtDNA genotypes. This Jindo genome data upgrade our understanding of dog genomic architecture and will be a very valuable resource for investigating not only dog genetics and genomics but also human and dog disease genetics and comparative genomics. PMID:22474061
Completed Genome Sequences of Strains from 36 Serotypes of Salmonella
Robertson, James; Yoshida, Catherine; Gurnik, Simone; Rankin, Marisa
2018-01-01
ABSTRACT We report here the completed closed genome sequences of strains representing 36 serotypes of Salmonella. These genome sequences will provide useful references for understanding the genetic variation between serotypes, particularly as references for mapping of raw reads or to create assemblies of higher quality, as well as to aid in studies of comparative genomics of Salmonella. PMID:29348347
The population genomics of rhesus macaques (Macaca mulatta) based on whole-genome sequences
Xue, Cheng; Raveendran, Muthuswamy; Harris, R. Alan; Fawcett, Gloria L.; Liu, Xiaoming; White, Simon; Dahdouli, Mahmoud; Rio Deiros, David; Below, Jennifer E.; Salerno, William; Cox, Laura; Fan, Guoping; Ferguson, Betsy; Horvath, Julie; Johnson, Zach; Kanthaswamy, Sree; Kubisch, H. Michael; Liu, Dahai; Platt, Michael; Smith, David G.; Sun, Binghua; Vallender, Eric J.; Wang, Feng; Wiseman, Roger W.; Chen, Rui; Muzny, Donna M.; Gibbs, Richard A.; Yu, Fuli; Rogers, Jeffrey
2016-01-01
Rhesus macaques (Macaca mulatta) are the most widely used nonhuman primate in biomedical research, have the largest natural geographic distribution of any nonhuman primate, and have been the focus of much evolutionary and behavioral investigation. Consequently, rhesus macaques are one of the most thoroughly studied nonhuman primate species. However, little is known about genome-wide genetic variation in this species. A detailed understanding of extant genomic variation among rhesus macaques has implications for the use of this species as a model for studies of human health and disease, as well as for evolutionary population genomics. Whole-genome sequencing analysis of 133 rhesus macaques revealed more than 43.7 million single-nucleotide variants, including thousands predicted to alter protein sequences, transcript splicing, and transcription factor binding sites. Rhesus macaques exhibit 2.5-fold higher overall nucleotide diversity and slightly elevated putative functional variation compared with humans. This functional variation in macaques provides opportunities for analyses of coding and noncoding variation, and its cellular consequences. Despite modestly higher levels of nonsynonymous variation in the macaques, the estimated distribution of fitness effects and the ratio of nonsynonymous to synonymous variants suggest that purifying selection has had stronger effects in rhesus macaques than in humans. Demographic reconstructions indicate this species has experienced a consistently large but fluctuating population size. Overall, the results presented here provide new insights into the population genomics of nonhuman primates and expand genomic information directly relevant to primate models of human disease. PMID:27934697
Parallel altitudinal clines reveal trends in adaptive evolution of genome size in Zea mays
Berg, Jeremy J.; Birchler, James A.; Grote, Mark N.; Lorant, Anne; Quezada, Juvenal
2018-01-01
While the vast majority of genome size variation in plants is due to differences in repetitive sequence, we know little about how selection acts on repeat content in natural populations. Here we investigate parallel changes in intraspecific genome size and repeat content of domesticated maize (Zea mays) landraces and their wild relative teosinte across altitudinal gradients in Mesoamerica and South America. We combine genotyping, low coverage whole-genome sequence data, and flow cytometry to test for evidence of selection on genome size and individual repeat abundance. We find that population structure alone cannot explain the observed variation, implying that clinal patterns of genome size are maintained by natural selection. Our modeling additionally provides evidence of selection on individual heterochromatic knob repeats, likely due to their large individual contribution to genome size. To better understand the phenotypes driving selection on genome size, we conducted a growth chamber experiment using a population of highland teosinte exhibiting extensive variation in genome size. We find weak support for a positive correlation between genome size and cell size, but stronger support for a negative correlation between genome size and the rate of cell production. Reanalyzing published data of cell counts in maize shoot apical meristems, we then identify a negative correlation between cell production rate and flowering time. Together, our data suggest a model in which variation in genome size is driven by natural selection on flowering time across altitudinal clines, connecting intraspecific variation in repetitive sequence to important differences in adaptive phenotypes. PMID:29746459
USDA-ARS?s Scientific Manuscript database
Human selection has reshaped crop genomes. Here we report an apple genome variation map generated through genome sequencing of 117 diverse accessions. A comprehensive model of apple speciation and domestication along the Silk Road was proposed based on evidence from diverse genomic analyses. Cultiva...
Joardar, Vinita; Abrams, Natalie F; Hostetler, Jessica; Paukstelis, Paul J; Pakala, Suchitra; Pakala, Suman B; Zafar, Nikhat; Abolude, Olukemi O; Payne, Gary; Andrianopoulos, Alex; Denning, David W; Nierman, William C
2012-12-12
The genera Aspergillus and Penicillium include some of the most beneficial as well as the most harmful fungal species such as the penicillin-producer Penicillium chrysogenum and the human pathogen Aspergillus fumigatus, respectively. Their mitochondrial genomic sequences may hold vital clues into the mechanisms of their evolution, population genetics, and biology, yet only a handful of these genomes have been fully sequenced and annotated. Here we report the complete sequence and annotation of the mitochondrial genomes of six Aspergillus and three Penicillium species: A. fumigatus, A. clavatus, A. oryzae, A. flavus, Neosartorya fischeri (A. fischerianus), A. terreus, P. chrysogenum, P. marneffei, and Talaromyces stipitatus (P. stipitatum). The accompanying comparative analysis of these and related publicly available mitochondrial genomes reveals wide variation in size (25-36 Kb) among these closely related fungi. The sources of genome expansion include group I introns and accessory genes encoding putative homing endonucleases, DNA and RNA polymerases (presumed to be of plasmid origin) and hypothetical proteins. The two smallest sequenced genomes (A. terreus and P. chrysogenum) do not contain introns in protein-coding genes, whereas the largest genome (T. stipitatus), contains a total of eleven introns. All of the sequenced genomes have a group I intron in the large ribosomal subunit RNA gene, suggesting that this intron is fixed in these species. Subsequent analysis of several A. fumigatus strains showed low intraspecies variation. This study also includes a phylogenetic analysis based on 14 concatenated core mitochondrial proteins. The phylogenetic tree has a different topology from published multilocus trees, highlighting the challenges still facing the Aspergillus systematics. The study expands the genomic resources available to fungal biologists by providing mitochondrial genomes with consistent annotations for future genetic, evolutionary and population studies. Despite the conservation of the core genes, the mitochondrial genomes of Aspergillus and Penicillium species examined here exhibit significant amount of interspecies variation. Most of this variation can be attributed to accessory genes and mobile introns, presumably acquired by horizontal gene transfer of mitochondrial plasmids and intron homing.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gordon, Sean
2013-03-01
Sean Gordon of the USDA on Natural variation in Brachypodium disctachyon: Deep Sequencing of Highly Diverse Natural Accessions at the 8th Annual Genomics of Energy Environment Meeting on March 27, 2013 in Walnut Creek, CA.
West, Claire; James, Stephen A; Davey, Robert P; Dicks, Jo; Roberts, Ian N
2014-07-01
The ribosomal RNA encapsulates a wealth of evolutionary information, including genetic variation that can be used to discriminate between organisms at a wide range of taxonomic levels. For example, the prokaryotic 16S rDNA sequence is very widely used both in phylogenetic studies and as a marker in metagenomic surveys and the internal transcribed spacer region, frequently used in plant phylogenetics, is now recognized as a fungal DNA barcode. However, this widespread use does not escape criticism, principally due to issues such as difficulties in classification of paralogous versus orthologous rDNA units and intragenomic variation, both of which may be significant barriers to accurate phylogenetic inference. We recently analyzed data sets from the Saccharomyces Genome Resequencing Project, characterizing rDNA sequence variation within multiple strains of the baker's yeast Saccharomyces cerevisiae and its nearest wild relative Saccharomyces paradoxus in unprecedented detail. Notably, both species possess single locus rDNA systems. Here, we use these new variation datasets to assess whether a more detailed characterization of the rDNA locus can alleviate the second of these phylogenetic issues, sequence heterogeneity, while controlling for the first. We demonstrate that a strong phylogenetic signal exists within both datasets and illustrate how they can be used, with existing methodology, to estimate intraspecies phylogenies of yeast strains consistent with those derived from whole-genome approaches. We also describe the use of partial Single Nucleotide Polymorphisms, a type of sequence variation found only in repetitive genomic regions, in identifying key evolutionary features such as genome hybridization events and show their consistency with whole-genome Structure analyses. We conclude that our approach can transform rDNA sequence heterogeneity from a problem to a useful source of evolutionary information, enabling the estimation of highly accurate phylogenies of closely related organisms, and discuss how it could be extended to future studies of multilocus rDNA systems. [concerted evolution; genome hydridisation; phylogenetic analysis; ribosomal DNA; whole genome sequencing; yeast]. © The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.
Clarke, Laura; Fairley, Susan; Zheng-Bradley, Xiangqun; Streeter, Ian; Perry, Emily; Lowy, Ernesto; Tassé, Anne-Marie; Flicek, Paul
2017-01-04
The International Genome Sample Resource (IGSR; http://www.internationalgenome.org) expands in data type and population diversity the resources from the 1000 Genomes Project. IGSR represents the largest open collection of human variation data and provides easy access to these resources. IGSR was established in 2015 to maintain and extend the 1000 Genomes Project data, which has been widely used as a reference set of human variation and by researchers developing analysis methods. IGSR has mapped all of the 1000 Genomes sequence to the newest human reference (GRCh38), and will release updated variant calls to ensure maximal usefulness of the existing data. IGSR is collecting new structural variation data on the 1000 Genomes samples from long read sequencing and other technologies, and will collect relevant functional data into a single comprehensive resource. IGSR is extending coverage with new populations sequenced by collaborating groups. Here, we present the new data and analysis that IGSR has made available. We have also introduced a new data portal that increases discoverability of our data-previously only browseable through our FTP site-by focusing on particular samples, populations or data sets of interest. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Human structural variation: mechanisms of chromosome rearrangements
Weckselblatt, Brooke; Rudd, M. Katharine
2015-01-01
Chromosome structural variation (SV) is a normal part of variation in the human genome, but some classes of SV can cause neurodevelopmental disorders. Analysis of the DNA sequence at SV breakpoints can reveal mutational mechanisms and risk factors for chromosome rearrangement. Large-scale SV breakpoint studies have become possible recently owing to advances in next-generation sequencing (NGS) including whole-genome sequencing (WGS). These findings have shed light on complex forms of SV such as triplications, inverted duplications, insertional translocations, and chromothripsis. Sequence-level breakpoint data resolve SV structure and determine how genes are disrupted, fused, and/or misregulated by breakpoints. Recent improvements in breakpoint sequencing have also revealed non-allelic homologous recombination (NAHR) between paralogous long interspersed nuclear element (LINE) or human endogenous retrovirus (HERV) repeats as a cause of deletions, duplications, and translocations. This review covers the genomic organization of simple and complex constitutional SVs, as well as the molecular mechanisms of their formation. PMID:26209074
Identification of structural variation in mouse genomes.
Keane, Thomas M; Wong, Kim; Adams, David J; Flint, Jonathan; Reymond, Alexandre; Yalcin, Binnaz
2014-01-01
Structural variation is variation in structure of DNA regions affecting DNA sequence length and/or orientation. It generally includes deletions, insertions, copy-number gains, inversions, and transposable elements. Traditionally, the identification of structural variation in genomes has been challenging. However, with the recent advances in high-throughput DNA sequencing and paired-end mapping (PEM) methods, the ability to identify structural variation and their respective association to human diseases has improved considerably. In this review, we describe our current knowledge of structural variation in the mouse, one of the prime model systems for studying human diseases and mammalian biology. We further present the evolutionary implications of structural variation on transposable elements. We conclude with future directions on the study of structural variation in mouse genomes that will increase our understanding of molecular architecture and functional consequences of structural variation.
Goossens, Dirk; Moens, Lotte N; Nelis, Eva; Lenaerts, An-Sofie; Glassee, Wim; Kalbe, Andreas; Frey, Bruno; Kopal, Guido; De Jonghe, Peter; De Rijk, Peter; Del-Favero, Jurgen
2009-03-01
We evaluated multiplex PCR amplification as a front-end for high-throughput sequencing, to widen the applicability of massive parallel sequencers for the detailed analysis of complex genomes. Using multiplex PCR reactions, we sequenced the complete coding regions of seven genes implicated in peripheral neuropathies in 40 individuals on a GS-FLX genome sequencer (Roche). The resulting dataset showed highly specific and uniform amplification. Comparison of the GS-FLX sequencing data with the dataset generated by Sanger sequencing confirmed the detection of all variants present and proved the sensitivity of the method for mutation detection. In addition, we showed that we could exploit the multiplexed PCR amplicons to determine individual copy number variation (CNV), increasing the spectrum of detected variations to both genetic and genomic variants. We conclude that our straightforward procedure substantially expands the applicability of the massive parallel sequencers for sequencing projects of a moderate number of amplicons (50-500) with typical applications in resequencing exons in positional or functional candidate regions and molecular genetic diagnostics. 2008 Wiley-Liss, Inc.
The complete chloroplast genome of Capsicum annuum var. glabriusculum using Illumina sequencing.
Raveendar, Sebastin; Na, Young-Wang; Lee, Jung-Ro; Shim, Donghwan; Ma, Kyung-Ho; Lee, Sok-Young; Chung, Jong-Wook
2015-07-20
Chloroplast (cp) genome sequences provide a valuable source for DNA barcoding. Molecular phylogenetic studies have concentrated on DNA sequencing of conserved gene loci. However, this approach is time consuming and more difficult to implement when gene organization differs among species. Here we report the complete re-sequencing of the cp genome of Capsicum pepper (Capsicum annuum var. glabriusculum) using the Illumina platform. The total length of the cp genome is 156,817 bp with a 37.7% overall GC content. A pair of inverted repeats (IRs) of 50,284 bp were separated by a small single copy (SSC; 18,948 bp) and a large single copy (LSC; 87,446 bp). The number of cp genes in C. annuum var. glabriusculum is the same as that in other Capsicum species. Variations in the lengths of LSC; SSC and IR regions were the main contributors to the size variation in the cp genome of this species. A total of 125 simple sequence repeat (SSR) and 48 insertions or deletions variants were found by sequence alignment of Capsicum cp genome. These findings provide a foundation for further investigation of cp genome evolution in Capsicum and other higher plants.
Barrick, Jeffrey E; Colburn, Geoffrey; Deatherage, Daniel E; Traverse, Charles C; Strand, Matthew D; Borges, Jordan J; Knoester, David B; Reba, Aaron; Meyer, Austin G
2014-11-29
Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events. We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for ~25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold). Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.
Whole-genome sequencing and genetic variant analysis of a Quarter Horse mare.
Doan, Ryan; Cohen, Noah D; Sawyer, Jason; Ghaffari, Noushin; Johnson, Charlie D; Dindot, Scott V
2012-02-17
The catalog of genetic variants in the horse genome originates from a few select animals, the majority originating from the Thoroughbred mare used for the equine genome sequencing project. The purpose of this study was to identify genetic variants, including single nucleotide polymorphisms (SNPs), insertion/deletion polymorphisms (INDELs), and copy number variants (CNVs) in the genome of an individual Quarter Horse mare sequenced by next-generation sequencing. Using massively parallel paired-end sequencing, we generated 59.6 Gb of DNA sequence from a Quarter Horse mare resulting in an average of 24.7X sequence coverage. Reads were mapped to approximately 97% of the reference Thoroughbred genome. Unmapped reads were de novo assembled resulting in 19.1 Mb of new genomic sequence in the horse. Using a stringent filtering method, we identified 3.1 million SNPs, 193 thousand INDELs, and 282 CNVs. Genetic variants were annotated to determine their impact on gene structure and function. Additionally, we genotyped this Quarter Horse for mutations of known diseases and for variants associated with particular traits. Functional clustering analysis of genetic variants revealed that most of the genetic variation in the horse's genome was enriched in sensory perception, signal transduction, and immunity and defense pathways. This is the first sequencing of a horse genome by next-generation sequencing and the first genomic sequence of an individual Quarter Horse mare. We have increased the catalog of genetic variants for use in equine genomics by the addition of novel SNPs, INDELs, and CNVs. The genetic variants described here will be a useful resource for future studies of genetic variation regulating performance traits and diseases in equids.
Macas, Jiří; Novák, Petr; Pellicer, Jaume; Čížková, Jana; Koblížková, Andrea; Neumann, Pavel; Fuková, Iva; Doležel, Jaroslav; Kelly, Laura J; Leitch, Ilia J
2015-01-01
The differential accumulation and elimination of repetitive DNA are key drivers of genome size variation in flowering plants, yet there have been few studies which have analysed how different types of repeats in related species contribute to genome size evolution within a phylogenetic context. This question is addressed here by conducting large-scale comparative analysis of repeats in 23 species from four genera of the monophyletic legume tribe Fabeae, representing a 7.6-fold variation in genome size. Phylogenetic analysis and genome size reconstruction revealed that this diversity arose from genome size expansions and contractions in different lineages during the evolution of Fabeae. Employing a combination of low-pass genome sequencing with novel bioinformatic approaches resulted in identification and quantification of repeats making up 55-83% of the investigated genomes. In turn, this enabled an analysis of how each major repeat type contributed to the genome size variation encountered. Differential accumulation of repetitive DNA was found to account for 85% of the genome size differences between the species, and most (57%) of this variation was found to be driven by a single lineage of Ty3/gypsy LTR-retrotransposons, the Ogre elements. Although the amounts of several other lineages of LTR-retrotransposons and the total amount of satellite DNA were also positively correlated with genome size, their contributions to genome size variation were much smaller (up to 6%). Repeat analysis within a phylogenetic framework also revealed profound differences in the extent of sequence conservation between different repeat types across Fabeae. In addition to these findings, the study has provided a proof of concept for the approach combining recent developments in sequencing and bioinformatics to perform comparative analyses of repetitive DNAs in a large number of non-model species without the need to assemble their genomes.
Complex multifractal nature in Mycobacterium tuberculosis genome
Mandal, Saurav; Roychowdhury, Tanmoy; Chirom, Keilash; Bhattacharya, Alok; Brojen Singh, R. K.
2017-01-01
The mutifractal and long range correlation (C(r)) properties of strings, such as nucleotide sequence can be a useful parameter for identification of underlying patterns and variations. In this study C(r) and multifractal singularity function f(α) have been used to study variations in the genomes of a pathogenic bacteria Mycobacterium tuberculosis. Genomic sequences of M. tuberculosis isolates displayed significant variations in C(r) and f(α) reflecting inherent differences in sequences among isolates. M. tuberculosis isolates can be categorised into different subgroups based on sensitivity to drugs, these are DS (drug sensitive isolates), MDR (multi-drug resistant isolates) and XDR (extremely drug resistant isolates). C(r) follows significantly different scaling rules in different subgroups of isolates, but all the isolates follow one parameter scaling law. The richness in complexity of each subgroup can be quantified by the measures of multifractal parameters displaying a pattern in which XDR isolates have highest value and lowest for drug sensitive isolates. Therefore C(r) and multifractal functions can be useful parameters for analysis of genomic sequences. PMID:28440326
Complex multifractal nature in Mycobacterium tuberculosis genome
NASA Astrophysics Data System (ADS)
Mandal, Saurav; Roychowdhury, Tanmoy; Chirom, Keilash; Bhattacharya, Alok; Brojen Singh, R. K.
2017-04-01
The mutifractal and long range correlation (C(r)) properties of strings, such as nucleotide sequence can be a useful parameter for identification of underlying patterns and variations. In this study C(r) and multifractal singularity function f(α) have been used to study variations in the genomes of a pathogenic bacteria Mycobacterium tuberculosis. Genomic sequences of M. tuberculosis isolates displayed significant variations in C(r) and f(α) reflecting inherent differences in sequences among isolates. M. tuberculosis isolates can be categorised into different subgroups based on sensitivity to drugs, these are DS (drug sensitive isolates), MDR (multi-drug resistant isolates) and XDR (extremely drug resistant isolates). C(r) follows significantly different scaling rules in different subgroups of isolates, but all the isolates follow one parameter scaling law. The richness in complexity of each subgroup can be quantified by the measures of multifractal parameters displaying a pattern in which XDR isolates have highest value and lowest for drug sensitive isolates. Therefore C(r) and multifractal functions can be useful parameters for analysis of genomic sequences.
RefSeq microbial genomes database: new representation and annotation strategy.
Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O'Neill, Kathleen; Tolstoy, Igor
2014-01-01
The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.
Iso-Touru, T; Sahana, G; Guldbrandtsen, B; Lund, M S; Vilkki, J
2016-03-22
The Nordic Red Cattle consisting of three different populations from Finland, Sweden and Denmark are under a joint breeding value estimation system. The long history of recording of production and health traits offers a great opportunity to study production traits and identify causal variants behind them. In this study, we used whole genome sequence level data from 4280 progeny tested Nordic Red Cattle bulls to scan the genome for loci affecting milk, fat and protein yields. Using a genome-wise significance threshold, regions on Bos taurus chromosomes 5, 14, 23, 25 and 26 were associated with fat yield. Regions on chromosomes 5, 14, 16, 19, 20 and 25 were associated with milk yield and chromosomes 5, 14 and 25 had regions associated with protein yield. Significantly associated variations were found in 227 genes for fat yield, 72 genes for milk yield and 30 genes for protein yield. Ingenuity Pathway Analysis was used to identify networks connecting these genes displaying significant hits. When compared to previously mapped genomic regions associated with fertility, significantly associated variations were found in 5 genes common for fat yield and fertility, thus linking these two traits via biological networks. This is the first time when whole genome sequence data is utilized to study genomic regions affecting milk production in the Nordic Red Cattle population. Sequence level data offers the possibility to study quantitative traits in detail but still cannot unambiguously reveal which of the associated variations is causative. Linkage disequilibrium creates difficulties to pinpoint the causative genes and variations. One solution to overcome these difficulties is the identification of the functional gene networks and pathways to reveal important interacting genes as candidates for the observed effects. This information on target genomic regions may be exploited to improve genomic prediction.
Museum genomics: low-cost and high-accuracy genetic data from historical specimens.
Rowe, Kevin C; Singhal, Sonal; Macmanes, Matthew D; Ayroles, Julien F; Morelli, Toni Lyn; Rubidge, Emily M; Bi, Ke; Moritz, Craig C
2011-11-01
Natural history collections are unparalleled repositories of geographical and temporal variation in faunal conditions. Molecular studies offer an opportunity to uncover much of this variation; however, genetic studies of historical museum specimens typically rely on extracting highly degraded and chemically modified DNA samples from skins, skulls or other dried samples. Despite this limitation, obtaining short fragments of DNA sequences using traditional PCR amplification of DNA has been the primary method for genetic study of historical specimens. Few laboratories have succeeded in obtaining genome-scale sequences from historical specimens and then only with considerable effort and cost. Here, we describe a low-cost approach using high-throughput next-generation sequencing to obtain reliable genome-scale sequence data from a traditionally preserved mammal skin and skull using a simple extraction protocol. We show that single-nucleotide polymorphisms (SNPs) from the genome sequences obtained independently from the skin and from the skull are highly repeatable compared to a reference genome. © 2011 Blackwell Publishing Ltd.
Webb, Kristen M; Rosenthal, Benjamin M
2011-01-01
The mitochondrial genome's non-recombinant mode of inheritance and relatively rapid rate of evolution has promoted its use as a marker for studying the biogeographic history and evolutionary interrelationships among many metazoan species. A modest portion of the mitochondrial genome has been defined for 12 species and genotypes of parasites in the genus Trichinella, but its adequacy in representing the mitochondrial genome as a whole remains unclear, as the complete coding sequence has been characterized only for Trichinella spiralis. Here, we sought to comprehensively describe the extent and nature of divergence between the mitochondrial genomes of T. spiralis (which poses the most appreciable zoonotic risk owing to its capacity to establish persistent infections in domestic pigs) and Trichinella murrelli (which is the most prevalent species in North American wildlife hosts, but which poses relatively little risk to the safety of pork). Next generation sequencing methodologies and scaffold and de novo assembly strategies were employed. The entire protein-coding region was sequenced (13,917 bp), along with a portion of the highly repetitive non-coding region (1524 bp) of the mitochondrial genome of T. murrelli with a combined average read depth of 250 reads. The accuracy of base calling, estimated from coding region sequence was found to exceed 99.3%. Genome content and gene order was not found to be significantly different from that of T. spiralis. An overall inter-species sequence divergence of 9.5% was estimated. Significant variation was identified when the amount of variation between species at each gene is compared to the average amount of variation between species across the coding region. Next generation sequencing is a highly effective means to obtain previously unknown mitochondrial genome sequence. Particular to parasites, the extremely deep coverage achieved through this method allows for the detection of sequence heterogeneity between the multiple individuals that necessarily comprise such templates. Copyright © 2010 Elsevier B.V. All rights reserved.
Tedim, Ana P.; Lanza, Val F.; Manrique, Marina; Pareja, Eduardo; Ruiz-Garbajosa, Patricia; Cantón, Rafael; Baquero, Fernando; Tobes, Raquel
2017-01-01
ABSTRACT The emergence of nosocomial infections by multidrug-resistant sequence type 117 (ST117) Enterococcus faecium has been reported in several European countries. ST117 has been detected in Spanish hospitals as one of the main causes of bloodstream infections. We analyzed genome variations of ST117 strains isolated in Madrid and describe the first ST117 closed genome sequences. PMID:28360174
Keel, B N; Nonneman, D J; Rohrer, G A
2017-08-01
Genetic variants detected from sequence have been used to successfully identify causal variants and map complex traits in several organisms. High and moderate impact variants, those expected to alter or disrupt the protein coded by a gene and those that regulate protein production, likely have a more significant effect on phenotypic variation than do other types of genetic variants. Hence, a comprehensive list of these functional variants would be of considerable interest in swine genomic studies, particularly those targeting fertility and production traits. Whole-genome sequence was obtained from 72 of the founders of an intensely phenotyped experimental swine herd at the U.S. Meat Animal Research Center (USMARC). These animals included all 24 of the founding boars (12 Duroc and 12 Landrace) and 48 Yorkshire-Landrace composite sows. Sequence reads were mapped to the Sscrofa10.2 genome build, resulting in a mean of 6.1 fold (×) coverage per genome. A total of 22 342 915 high confidence SNPs were identified from the sequenced genomes. These included 21 million previously reported SNPs and 79% of the 62 163 SNPs on the PorcineSNP60 BeadChip assay. Variation was detected in the coding sequence or untranslated regions (UTRs) of 87.8% of the genes in the porcine genome: loss-of-function variants were predicted in 504 genes, 10 202 genes contained nonsynonymous variants, 10 773 had variation in UTRs and 13 010 genes contained synonymous variants. Approximately 139 000 SNPs were classified as loss-of-function, nonsynonymous or regulatory, which suggests that over 99% of the variation detected in our pigs could potentially be ignored, allowing us to focus on a much smaller number of functional SNPs during future analyses. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.
Natural Allelic Variations in Highly Polyploidy Saccharum Complex
DOE Office of Scientific and Technical Information (OSTI.GOV)
Song, Jian; Yang, Xiping; Resende, Jr., Marcio F. R.
Sugarcane ( Saccharum spp.) is an important sugar and biofuel crop with high polyploid and complex genomes. The Saccharum complex, comprised of Saccharum genus and a few related genera, are important genetic resources for sugarcane breeding. A large amount of natural variation exists within the Saccharum complex. Though understanding their allelic variation has been challenging, it is critical to dissect allelic structure and to identify the alleles controlling important traits in sugarcane. To characterize natural variations in Saccharum complex, a target enrichment sequencing approach was used to assay 12 representative germplasm accessions. In total, 55,946 highly efficient probes were designedmore » based on the sorghum genome and sugarcane unigene set targeting a total of 6 Mb of the sugarcane genome. A pipeline specifically tailored for polyploid sequence variants and genotype calling was established. BWAmem and sorghum genome approved to be an acceptable aligner and reference for sugarcane target enrichment sequence analysis, respectively. Genetic variations including 1,166,066 non-redundant SNPs, 150,421 InDels, 919 gene copy number variations, and 1,257 gene presence/absence variations were detected. SNPs from three different callers (Samtools, Freebayes, and GATK) were compared and the validation rates were nearly 90%. Based on the SNP loci of each accession and their ploidy levels, 999,258 single dosage SNPs were identified and most loci were estimated as largely homozygotes. An average of 34,397 haplotype blocks for each accession was inferred. The highest divergence time among the Saccharum spp. was estimated as 1.2 million years ago (MYA). Saccharum spp. diverged from Erianthus and Sorghum approximately 5 and 6 MYA, respectively. Furthermore, the target enrichment sequencing approach provided an effective way to discover and catalog natural allelic variation in highly polyploid or heterozygous genomes.« less
Natural Allelic Variations in Highly Polyploidy Saccharum Complex
Song, Jian; Yang, Xiping; Resende, Jr., Marcio F. R.; ...
2016-06-08
Sugarcane ( Saccharum spp.) is an important sugar and biofuel crop with high polyploid and complex genomes. The Saccharum complex, comprised of Saccharum genus and a few related genera, are important genetic resources for sugarcane breeding. A large amount of natural variation exists within the Saccharum complex. Though understanding their allelic variation has been challenging, it is critical to dissect allelic structure and to identify the alleles controlling important traits in sugarcane. To characterize natural variations in Saccharum complex, a target enrichment sequencing approach was used to assay 12 representative germplasm accessions. In total, 55,946 highly efficient probes were designedmore » based on the sorghum genome and sugarcane unigene set targeting a total of 6 Mb of the sugarcane genome. A pipeline specifically tailored for polyploid sequence variants and genotype calling was established. BWAmem and sorghum genome approved to be an acceptable aligner and reference for sugarcane target enrichment sequence analysis, respectively. Genetic variations including 1,166,066 non-redundant SNPs, 150,421 InDels, 919 gene copy number variations, and 1,257 gene presence/absence variations were detected. SNPs from three different callers (Samtools, Freebayes, and GATK) were compared and the validation rates were nearly 90%. Based on the SNP loci of each accession and their ploidy levels, 999,258 single dosage SNPs were identified and most loci were estimated as largely homozygotes. An average of 34,397 haplotype blocks for each accession was inferred. The highest divergence time among the Saccharum spp. was estimated as 1.2 million years ago (MYA). Saccharum spp. diverged from Erianthus and Sorghum approximately 5 and 6 MYA, respectively. Furthermore, the target enrichment sequencing approach provided an effective way to discover and catalog natural allelic variation in highly polyploid or heterozygous genomes.« less
Mining sequence variations in representative polyploid sugarcane germplasm accessions
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Xiping; Song, Jian; You, Qian
Sugarcane (Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes.more » To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.« less
Mining sequence variations in representative polyploid sugarcane germplasm accessions
Yang, Xiping; Song, Jian; You, Qian; ...
2017-08-09
Sugarcane (Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes.more » To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.« less
Coverage Bias and Sensitivity of Variant Calling for Four Whole-genome Sequencing Technologies
Lasitschka, Bärbel; Jones, David; Northcott, Paul; Hutter, Barbara; Jäger, Natalie; Kool, Marcel; Taylor, Michael; Lichter, Peter; Pfister, Stefan; Wolf, Stephan; Brors, Benedikt; Eils, Roland
2013-01-01
The emergence of high-throughput, next-generation sequencing technologies has dramatically altered the way we assess genomes in population genetics and in cancer genomics. Currently, there are four commonly used whole-genome sequencing platforms on the market: Illumina’s HiSeq2000, Life Technologies’ SOLiD 4 and its completely redesigned 5500xl SOLiD, and Complete Genomics’ technology. A number of earlier studies have compared a subset of those sequencing platforms or compared those platforms with Sanger sequencing, which is prohibitively expensive for whole genome studies. Here we present a detailed comparison of the performance of all currently available whole genome sequencing platforms, especially regarding their ability to call SNVs and to evenly cover the genome and specific genomic regions. Unlike earlier studies, we base our comparison on four different samples, allowing us to assess the between-sample variation of the platforms. We find a pronounced GC bias in GC-rich regions for Life Technologies’ platforms, with Complete Genomics performing best here, while we see the least bias in GC-poor regions for HiSeq2000 and 5500xl. HiSeq2000 gives the most uniform coverage and displays the least sample-to-sample variation. In contrast, Complete Genomics exhibits by far the smallest fraction of bases not covered, while the SOLiD platforms reveal remarkable shortcomings, especially in covering CpG islands. When comparing the performance of the four platforms for calling SNPs, HiSeq2000 and Complete Genomics achieve the highest sensitivity, while the SOLiD platforms show the lowest false positive rate. Finally, we find that integrating sequencing data from different platforms offers the potential to combine the strengths of different technologies. In summary, our results detail the strengths and weaknesses of all four whole-genome sequencing platforms. It indicates application areas that call for a specific sequencing platform and disallow other platforms. This helps to identify the proper sequencing platform for whole genome studies with different application scopes. PMID:23776689
RSAT 2018: regulatory sequence analysis tools 20th anniversary.
Nguyen, Nga Thi Thuy; Contreras-Moreira, Bruno; Castro-Mondragon, Jaime A; Santana-Garcia, Walter; Ossio, Raul; Robles-Espinoza, Carla Daniela; Bahin, Mathieu; Collombet, Samuel; Vincens, Pierre; Thieffry, Denis; van Helden, Jacques; Medina-Rivera, Alejandra; Thomas-Chollier, Morgane
2018-05-02
RSAT (Regulatory Sequence Analysis Tools) is a suite of modular tools for the detection and the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, including from genome-wide datasets like ChIP-seq/ATAC-seq, (ii) motif scanning, (iii) motif analysis (quality assessment, comparisons and clustering), (iv) analysis of regulatory variations, (v) comparative genomics. Six public servers jointly support 10 000 genomes from all kingdoms. Six novel or refactored programs have been added since the 2015 NAR Web Software Issue, including updated programs to analyse regulatory variants (retrieve-variation-seq, variation-scan, convert-variations), along with tools to extract sequences from a list of coordinates (retrieve-seq-bed), to select motifs from motif collections (retrieve-matrix), and to extract orthologs based on Ensembl Compara (get-orthologs-compara). Three use cases illustrate the integration of new and refactored tools to the suite. This Anniversary update gives a 20-year perspective on the software suite. RSAT is well-documented and available through Web sites, SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services, virtual machines and stand-alone programs at http://www.rsat.eu/.
Microfluidic droplet enrichment for targeted sequencing
Eastburn, Dennis J.; Huang, Yong; Pellegrino, Maurizio; Sciambi, Adam; Ptáček, Louis J.; Abate, Adam R.
2015-01-01
Targeted sequence enrichment enables better identification of genetic variation by providing increased sequencing coverage for genomic regions of interest. Here, we report the development of a new target enrichment technology that is highly differentiated from other approaches currently in use. Our method, MESA (Microfluidic droplet Enrichment for Sequence Analysis), isolates genomic DNA fragments in microfluidic droplets and performs TaqMan PCR reactions to identify droplets containing a desired target sequence. The TaqMan positive droplets are subsequently recovered via dielectrophoretic sorting, and the TaqMan amplicons are removed enzymatically prior to sequencing. We demonstrated the utility of this approach by generating an average 31.6-fold sequence enrichment across 250 kb of targeted genomic DNA from five unique genomic loci. Significantly, this enrichment enabled a more comprehensive identification of genetic polymorphisms within the targeted loci. MESA requires low amounts of input DNA, minimal prior locus sequence information and enriches the target region without PCR bias or artifacts. These features make it well suited for the study of genetic variation in a number of research and diagnostic applications. PMID:25873629
TUMOR HAPLOTYPE ASSEMBLY ALGORITHMS FOR CANCER GENOMICS
AGUIAR, DEREK; WONG, WENDY S.W.; ISTRAIL, SORIN
2014-01-01
The growing availability of inexpensive high-throughput sequence data is enabling researchers to sequence tumor populations within a single individual at high coverage. But, cancer genome sequence evolution and mutational phenomena like driver mutations and gene fusions are difficult to investigate without first reconstructing tumor haplotype sequences. Haplotype assembly of single individual tumor populations is an exceedingly difficult task complicated by tumor haplotype heterogeneity, tumor or normal cell sequence contamination, polyploidy, and complex patterns of variation. While computational and experimental haplotype phasing of diploid genomes has seen much progress in recent years, haplotype assembly in cancer genomes remains uncharted territory. In this work, we describe HapCompass-Tumor a computational modeling and algorithmic framework for haplotype assembly of copy number variable cancer genomes containing haplotypes at different frequencies and complex variation. We extend our polyploid haplotype assembly model and present novel algorithms for (1) complex variations, including copy number changes, as varying numbers of disjoint paths in an associated graph, (2) variable haplotype frequencies and contamination, and (3) computation of tumor haplotypes using simple cycles of the compass graph which constrain the space of haplotype assembly solutions. The model and algorithm are implemented in the software package HapCompass-Tumor which is available for download from http://www.brown.edu/Research/Istrail_Lab/. PMID:24297529
Whole-Genome Sequencing Reveals Genetic Variation in the Asian House Rat.
Teng, Huajing; Zhang, Yaohua; Shi, Chengmin; Mao, Fengbiao; Hou, Lingling; Guo, Hongling; Sun, Zhongsheng; Zhang, Jianxu
2016-07-07
Whole-genome sequencing of wild-derived rat species can provide novel genomic resources, which may help decipher the genetics underlying complex phenotypes. As a notorious pest, reservoir of human pathogens, and colonizer, the Asian house rat, Rattus tanezumi, is successfully adapted to its habitat. However, little is known regarding genetic variation in this species. In this study, we identified over 41,000,000 single-nucleotide polymorphisms, plus insertions and deletions, through whole-genome sequencing and bioinformatics analyses. Moreover, we identified over 12,000 structural variants, including 143 chromosomal inversions. Further functional analyses revealed several fixed nonsense mutations associated with infection and immunity-related adaptations, and a number of fixed missense mutations that may be related to anticoagulant resistance. A genome-wide scan for loci under selection identified various genes related to neural activity. Our whole-genome sequencing data provide a genomic resource for future genetic studies of the Asian house rat species and have the potential to facilitate understanding of the molecular adaptations of rats to their ecological niches. Copyright © 2016 Teng et al.
Human genetics and genomics a decade after the release of the draft sequence of the human genome.
Naidoo, Nasheen; Pawitan, Yudi; Soong, Richie; Cooper, David N; Ku, Chee-Seng
2011-10-01
Substantial progress has been made in human genetics and genomics research over the past ten years since the publication of the draft sequence of the human genome in 2001. Findings emanating directly from the Human Genome Project, together with those from follow-on studies, have had an enormous impact on our understanding of the architecture and function of the human genome. Major developments have been made in cataloguing genetic variation, the International HapMap Project, and with respect to advances in genotyping technologies. These developments are vital for the emergence of genome-wide association studies in the investigation of complex diseases and traits. In parallel, the advent of high-throughput sequencing technologies has ushered in the 'personal genome sequencing' era for both normal and cancer genomes, and made possible large-scale genome sequencing studies such as the 1000 Genomes Project and the International Cancer Genome Consortium. The high-throughput sequencing and sequence-capture technologies are also providing new opportunities to study Mendelian disorders through exome sequencing and whole-genome sequencing. This paper reviews these major developments in human genetics and genomics over the past decade.
Human genetics and genomics a decade after the release of the draft sequence of the human genome
2011-01-01
Substantial progress has been made in human genetics and genomics research over the past ten years since the publication of the draft sequence of the human genome in 2001. Findings emanating directly from the Human Genome Project, together with those from follow-on studies, have had an enormous impact on our understanding of the architecture and function of the human genome. Major developments have been made in cataloguing genetic variation, the International HapMap Project, and with respect to advances in genotyping technologies. These developments are vital for the emergence of genome-wide association studies in the investigation of complex diseases and traits. In parallel, the advent of high-throughput sequencing technologies has ushered in the 'personal genome sequencing' era for both normal and cancer genomes, and made possible large-scale genome sequencing studies such as the 1000 Genomes Project and the International Cancer Genome Consortium. The high-throughput sequencing and sequence-capture technologies are also providing new opportunities to study Mendelian disorders through exome sequencing and whole-genome sequencing. This paper reviews these major developments in human genetics and genomics over the past decade. PMID:22155605
A survey of tools for variant analysis of next-generation genome sequencing data
Pabinger, Stephan; Dander, Andreas; Fischer, Maria; Snajder, Rene; Sperk, Michael; Efremova, Mirjana; Krabichler, Birgit; Speicher, Michael R.; Zschocke, Johannes
2014-01-01
Recent advances in genome sequencing technologies provide unprecedented opportunities to characterize individual genomic landscapes and identify mutations relevant for diagnosis and therapy. Specifically, whole-exome sequencing using next-generation sequencing (NGS) technologies is gaining popularity in the human genetics community due to the moderate costs, manageable data amounts and straightforward interpretation of analysis results. While whole-exome and, in the near future, whole-genome sequencing are becoming commodities, data analysis still poses significant challenges and led to the development of a plethora of tools supporting specific parts of the analysis workflow or providing a complete solution. Here, we surveyed 205 tools for whole-genome/whole-exome sequencing data analysis supporting five distinct analytical steps: quality assessment, alignment, variant identification, variant annotation and visualization. We report an overview of the functionality, features and specific requirements of the individual tools. We then selected 32 programs for variant identification, variant annotation and visualization, which were subjected to hands-on evaluation using four data sets: one set of exome data from two patients with a rare disease for testing identification of germline mutations, two cancer data sets for testing variant callers for somatic mutations, copy number variations and structural variations, and one semi-synthetic data set for testing identification of copy number variations. Our comprehensive survey and evaluation of NGS tools provides a valuable guideline for human geneticists working on Mendelian disorders, complex diseases and cancers. PMID:23341494
Clan Genomics and the Complex Architecture of Human Disease
Belmont, John W.; Boerwinkle, Eric
2013-01-01
Human diseases are caused by alleles that encompass the full range of variant types, from single-nucleotide changes to copy-number variants, and these variations span a broad frequency spectrum, from the very rare to the common. The picture emerging from analysis of whole-genome sequences, the 1000 Genomes Project pilot studies, and targeted genomic sequencing derived from very large sample sizes reveals an abundance of rare and private variants. One implication of this realization is that recent mutation may have a greater influence on disease susceptibility or protection than is conferred by variations that arose in distant ancestors. PMID:21962505
A map of human genome variation from population-scale sequencing.
Abecasis, Gonçalo R; Altshuler, David; Auton, Adam; Brooks, Lisa D; Durbin, Richard M; Gibbs, Richard A; Hurles, Matt E; McVean, Gil A
2010-10-28
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.
Zapata, Luis; Ding, Jia; Willing, Eva-Maria; Hartwig, Benjamin; Bezdan, Daniela; Jiao, Wen-Biao; Patel, Vipul; Velikkakam James, Geo; Koornneef, Maarten; Ossowski, Stephan; Schneeberger, Korbinian
2016-07-12
Resequencing or reference-based assemblies reveal large parts of the small-scale sequence variation. However, they typically fail to separate such local variation into colinear and rearranged variation, because they usually do not recover the complement of large-scale rearrangements, including transpositions and inversions. Besides the availability of hundreds of genomes of diverse Arabidopsis thaliana accessions, there is so far only one full-length assembled genome: the reference sequence. We have assembled 117 Mb of the A. thaliana Landsberg erecta (Ler) genome into five chromosome-equivalent sequences using a combination of short Illumina reads, long PacBio reads, and linkage information. Whole-genome comparison against the reference sequence revealed 564 transpositions and 47 inversions comprising ∼3.6 Mb, in addition to 4.1 Mb of nonreference sequence, mostly originating from duplications. Although rearranged regions are not different in local divergence from colinear regions, they are drastically depleted for meiotic recombination in heterozygotes. Using a 1.2-Mb inversion as an example, we show that such rearrangement-mediated reduction of meiotic recombination can lead to genetically isolated haplotypes in the worldwide population of A. thaliana Moreover, we found 105 single-copy genes, which were only present in the reference sequence or the Ler assembly, and 334 single-copy orthologs, which showed an additional copy in only one of the genomes. To our knowledge, this work gives first insights into the degree and type of variation, which will be revealed once complete assemblies will replace resequencing or other reference-dependent methods.
Lei, Yong-Liang; Wang, Xiao-Guang; Tao, Xiao-Yan; Li, Hao; Meng, Sheng-Li; Chen, Xiu-Ying; Liu, Fu-Ming; Ye, Bi-Feng; Tang, Qing
2010-01-01
Based on sequencing the full-length genomes of four Chinese Ferret-Badger and dog, we analyze the properties of rabies viruses genetic variation in molecular level, get the information about rabies viruses prevalence and variation in Zhejiang, and enrich the genome database of rabies viruses street strains isolated from China. Rabies viruses in suckling mice were isolated, overlapped fragments were amplified by RT-PCR and full-length genomes were assembled to analyze the nucleotide and deduced protein similarities and phylogenetic analyses from Chinese Ferret-Badger, dog, sika deer, vole, used vaccine strain were determined. The four full-length genomes were sequenced completely and had the same genetic structure with the length of 11, 923 nts or 11, 925 nts including 58 nts-Leader, 1353 nts-NP, 894 nts-PP, 609 nts-MP, 1575 nts-GP, 6386 nts-LP, and 2, 5, 5 nts- intergenic regions(IGRs), 423 nts-Pseudogene-like sequence (psi), 70 nts-Trailer. The four full-length genomes were in accordance with the properties of Rhabdoviridae Lyssa virus by BLAST and multi-sequence alignment. The nucleotide and amino acid sequences among Chinese strains had the highest similarity, especially among animals of the same species. Of the four full-length genomes, the similarity in amino acid level was dramatically higher than that in nucleotide level, so the nucleotide mutations happened in these four genomes were most synonymous mutations. Compared with the reference rabies viruses, the lengths of the five protein coding regions had no change, no recombination, only with a few point mutations. It was evident that the five proteins appeared to be stable. The variation sites and types of the four genomes were similar to the reference vaccine or street strains. And the four strains were genotype 1 according to the multi-sequence and phylogenetic analyses, which possessed the distinct district characteristics of China. Therefore, these four rabies viruses are likely to be street viruses already existing in the natural world.
Intra-specific variation in genome size in maize: cytological and phenotypic correlates
Realini, María Florencia; Poggio, Lidia; Cámara-Hernández, Julián; González, Graciela Esther
2016-01-01
Genome size variation accompanies the diversification and evolution of many plant species. Relationships between DNA amount and phenotypic and cytological characteristics form the basis of most hypotheses that ascribe a biological role to genome size. The goal of the present research was to investigate the intra-specific variation in the DNA content in maize populations from Northeastern Argentina and further explore the relationship between genome size and the phenotypic traits seed weight and length of the vegetative cycle. Moreover, cytological parameters such as the percentage of heterochromatin as well as the number, position and sequence composition of knobs were analysed and their relationships with 2C DNA values were explored. The populations analysed presented significant differences in 2C DNA amount, from 4.62 to 6.29 pg, representing 36.15 % of the inter-populational variation. Moreover, intra-populational genome size variation was found, varying from 1.08 to 1.63-fold. The variation in the percentage of knob heterochromatin as well as in the number, chromosome position and sequence composition of the knobs was detected among and within the populations. Although a positive relationship between genome size and the percentage of heterochromatin was observed, a significant correlation was not found. This confirms that other non-coding repetitive DNA sequences are contributing to the genome size variation. A positive relationship between DNA amount and the seed weight has been reported in a large number of species, this relationship was not found in the populations studied here. The length of the vegetative cycle showed a positive correlation with the percentage of heterochromatin. This result allowed attributing an adaptive effect to heterochromatin since the length of this cycle would be optimized via selection for an appropriate percentage of heterochromatin. PMID:26644343
Natarajan, Sathishkumar; Kim, Hoy-Taek; Thamilarasan, Senthil Kumar; Veerappan, Karpagam; Park, Jong-In; Nou, Ill-Sup
2016-01-01
Powdery mildew is one of the most common fungal diseases in the world. This disease frequently affects melon (Cucumis melo L.) and other Cucurbitaceous family crops in both open field and greenhouse cultivation. One of the goals of genomics is to identify the polymorphic loci responsible for variation in phenotypic traits. In this study, powdery mildew disease assessment scores were calculated for four melon accessions, 'SCNU1154', 'Edisto47', 'MR-1', and 'PMR5'. To investigate the genetic variation of these accessions, whole genome re-sequencing using the Illumina HiSeq 2000 platform was performed. A total of 754,759,704 quality-filtered reads were generated, with an average of 82.64% coverage relative to the reference genome. Comparisons of the sequences for the melon accessions revealed around 7.4 million single nucleotide polymorphisms (SNPs), 1.9 million InDels, and 182,398 putative structural variations (SVs). Functional enrichment analysis of detected variations classified them into biological process, cellular component and molecular function categories. Further, a disease-associated QTL map was constructed for 390 SNPs and 45 InDels identified as related to defense-response genes. Among them 112 SNPs and 12 InDels were observed in powdery mildew responsive chromosomes. Accordingly, this whole genome re-sequencing study identified SNPs and InDels associated with defense genes that will serve as candidate polymorphisms in the search for sources of resistance against powdery mildew disease and could accelerate marker-assisted breeding in melon.
Population Genomics of Paramecium Species.
Johri, Parul; Krenek, Sascha; Marinov, Georgi K; Doak, Thomas G; Berendonk, Thomas U; Lynch, Michael
2017-05-01
Population-genomic analyses are essential to understanding factors shaping genomic variation and lineage-specific sequence constraints. The dearth of such analyses for unicellular eukaryotes prompted us to assess genomic variation in Paramecium, one of the most well-studied ciliate genera. The Paramecium aurelia complex consists of ∼15 morphologically indistinguishable species that diverged subsequent to two rounds of whole-genome duplications (WGDs, as long as 320 MYA) and possess extremely streamlined genomes. We examine patterns of both nuclear and mitochondrial polymorphism, by sequencing whole genomes of 10-13 worldwide isolates of each of three species belonging to the P. aurelia complex: P. tetraurelia, P. biaurelia, P. sexaurelia, as well as two outgroup species that do not share the WGDs: P. caudatum and P. multimicronucleatum. An apparent absence of global geographic population structure suggests continuous or recent dispersal of Paramecium over long distances. Intergenic regions are highly constrained relative to coding sequences, especially in P. caudatum and P. multimicronucleatum that have shorter intergenic distances. Sequence diversity and divergence are reduced up to ∼100-150 bp both upstream and downstream of genes, suggesting strong constraints imposed by the presence of densely packed regulatory modules. In addition, comparison of sequence variation at non-synonymous and synonymous sites suggests similar recent selective pressures on paralogs within and orthologs across the deeply diverging species. This study presents the first genome-wide population-genomic analysis in ciliates and provides a valuable resource for future studies in evolutionary and functional genetics in Paramecium. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
SIDR: simultaneous isolation and parallel sequencing of genomic DNA and total RNA from single cells.
Han, Kyung Yeon; Kim, Kyu-Tae; Joung, Je-Gun; Son, Dae-Soon; Kim, Yeon Jeong; Jo, Areum; Jeon, Hyo-Jeong; Moon, Hui-Sung; Yoo, Chang Eun; Chung, Woosung; Eum, Hye Hyeon; Kim, Sangmin; Kim, Hong Kwan; Lee, Jeong Eon; Ahn, Myung-Ju; Lee, Hae-Ock; Park, Donghyun; Park, Woong-Yang
2018-01-01
Simultaneous sequencing of the genome and transcriptome at the single-cell level is a powerful tool for characterizing genomic and transcriptomic variation and revealing correlative relationships. However, it remains technically challenging to analyze both the genome and transcriptome in the same cell. Here, we report a novel method for simultaneous isolation of genomic DNA and total RNA (SIDR) from single cells, achieving high recovery rates with minimal cross-contamination, as is crucial for accurate description and integration of the single-cell genome and transcriptome. For reliable and efficient separation of genomic DNA and total RNA from single cells, the method uses hypotonic lysis to preserve nuclear lamina integrity and subsequently captures the cell lysate using antibody-conjugated magnetic microbeads. Evaluating the performance of this method using real-time PCR demonstrated that it efficiently recovered genomic DNA and total RNA. Thorough data quality assessments showed that DNA and RNA simultaneously fractionated by the SIDR method were suitable for genome and transcriptome sequencing analysis at the single-cell level. The integration of single-cell genome and transcriptome sequencing by SIDR (SIDR-seq) showed that genetic alterations, such as copy-number and single-nucleotide variations, were more accurately captured by single-cell SIDR-seq compared with conventional single-cell RNA-seq, although copy-number variations positively correlated with the corresponding gene expression levels. These results suggest that SIDR-seq is potentially a powerful tool to reveal genetic heterogeneity and phenotypic information inferred from gene expression patterns at the single-cell level. © 2018 Han et al.; Published by Cold Spring Harbor Laboratory Press.
SIDR: simultaneous isolation and parallel sequencing of genomic DNA and total RNA from single cells
Han, Kyung Yeon; Kim, Kyu-Tae; Joung, Je-Gun; Son, Dae-Soon; Kim, Yeon Jeong; Jo, Areum; Jeon, Hyo-Jeong; Moon, Hui-Sung; Yoo, Chang Eun; Chung, Woosung; Eum, Hye Hyeon; Kim, Sangmin; Kim, Hong Kwan; Lee, Jeong Eon; Ahn, Myung-Ju; Lee, Hae-Ock; Park, Donghyun; Park, Woong-Yang
2018-01-01
Simultaneous sequencing of the genome and transcriptome at the single-cell level is a powerful tool for characterizing genomic and transcriptomic variation and revealing correlative relationships. However, it remains technically challenging to analyze both the genome and transcriptome in the same cell. Here, we report a novel method for simultaneous isolation of genomic DNA and total RNA (SIDR) from single cells, achieving high recovery rates with minimal cross-contamination, as is crucial for accurate description and integration of the single-cell genome and transcriptome. For reliable and efficient separation of genomic DNA and total RNA from single cells, the method uses hypotonic lysis to preserve nuclear lamina integrity and subsequently captures the cell lysate using antibody-conjugated magnetic microbeads. Evaluating the performance of this method using real-time PCR demonstrated that it efficiently recovered genomic DNA and total RNA. Thorough data quality assessments showed that DNA and RNA simultaneously fractionated by the SIDR method were suitable for genome and transcriptome sequencing analysis at the single-cell level. The integration of single-cell genome and transcriptome sequencing by SIDR (SIDR-seq) showed that genetic alterations, such as copy-number and single-nucleotide variations, were more accurately captured by single-cell SIDR-seq compared with conventional single-cell RNA-seq, although copy-number variations positively correlated with the corresponding gene expression levels. These results suggest that SIDR-seq is potentially a powerful tool to reveal genetic heterogeneity and phenotypic information inferred from gene expression patterns at the single-cell level. PMID:29208629
Ellis, Lisa L.; Huang, Wen; Quinn, Andrew M.; Ahuja, Astha; Alfrejd, Ben; Gomez, Francisco E.; Hjelmen, Carl E.; Moore, Kristi L.; Mackay, Trudy F. C.; Johnston, J. Spencer; Tarone, Aaron M.
2014-01-01
We determined female genome sizes using flow cytometry for 211 Drosophila melanogaster sequenced inbred strains from the Drosophila Genetic Reference Panel, and found significant conspecific and intrapopulation variation in genome size. We also compared several life history traits for 25 lines with large and 25 lines with small genomes in three thermal environments, and found that genome size as well as genome size by temperature interactions significantly correlated with survival to pupation and adulthood, time to pupation, female pupal mass, and female eclosion rates. Genome size accounted for up to 23% of the variation in developmental phenotypes, but the contribution of genome size to variation in life history traits was plastic and varied according to the thermal environment. Expression data implicate differences in metabolism that correspond to genome size variation. These results indicate that significant genome size variation exists within D. melanogaster and this variation may impact the evolutionary ecology of the species. Genome size variation accounts for a significant portion of life history variation in an environmentally dependent manner, suggesting that potential fitness effects associated with genome size variation also depend on environmental conditions. PMID:25057905
A high-resolution cattle CNV map by population-scale genome sequencing
USDA-ARS?s Scientific Manuscript database
Copy Number Variations (CNVs) are common genomic structural variations that have been linked to human diseases and phenotypic traits. CNVs represent an important type of genetic variation among cattle breeds and even individual animals; however, only low-resolution maps of cattle CNVs currently exis...
Typing and comparative genome analysis of Brucella melitensis isolated from Lebanon.
Abou Zaki, Natalia; Salloum, Tamara; Osman, Marwan; Rafei, Rayane; Hamze, Monzer; Tokajian, Sima
2017-10-16
Brucella melitensis is the main causative agent of the zoonotic disease brucellosis. This study aimed at typing and characterizing genetic variation in 33 Brucella isolates recovered from patients in Lebanon. Bruce-ladder multiplex PCR and PCR-RFLP of omp31, omp2a and omp2b were performed. Sixteen representative isolates were chosen for draft-genome sequencing and analyzed to determine variations in virulence, resistance, genomic islands, prophages and insertion sequences. Comparative whole-genome single nucleotide polymorphism analysis was also performed. The isolates were confirmed to be B. melitensis. Genome analysis revealed multiple virulence determinants and efflux pumps. Genome comparisons and single nucleotide polymorphisms divided the isolates based on geographical distribution but revealed high levels of similarity between the strains. Sequence divergence in B. melitensis was mainly due to lateral gene transfer of mobile elements. This is the first report of an in-depth genomic characterization of B. melitensis in Lebanon. © FEMS 2017. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Bergman, Casey M.; Haddrill, Penelope R.
2015-01-01
To contribute to our general understanding of the evolutionary forces that shape variation in genome sequences in nature, we have sequenced genomes from 50 isofemale lines and six pooled samples from populations of Drosophila melanogaster on three continents. Analysis of raw and reference-mapped reads indicates the quality of these genomic sequence data is very high. Comparison of the predicted and experimentally-determined Wolbachia infection status of these samples suggests that strain or sample swaps are unlikely to have occurred in the generation of these data. Genome sequences are freely available in the European Nucleotide Archive under accession ERP009059. Isofemale lines can be obtained from the Drosophila Species Stock Center. PMID:25717372
Bergman, Casey M; Haddrill, Penelope R
2015-01-01
To contribute to our general understanding of the evolutionary forces that shape variation in genome sequences in nature, we have sequenced genomes from 50 isofemale lines and six pooled samples from populations of Drosophila melanogaster on three continents. Analysis of raw and reference-mapped reads indicates the quality of these genomic sequence data is very high. Comparison of the predicted and experimentally-determined Wolbachia infection status of these samples suggests that strain or sample swaps are unlikely to have occurred in the generation of these data. Genome sequences are freely available in the European Nucleotide Archive under accession ERP009059. Isofemale lines can be obtained from the Drosophila Species Stock Center.
Salleh, Mohd Zaki; Teh, Lay Kek; Lee, Lian Shien; Ismet, Rose Iszati; Patowary, Ashok; Joshi, Kandarp; Pasha, Ayesha; Ahmed, Azni Zain; Janor, Roziah Mohd; Hamzah, Ahmad Sazali; Adam, Aishah; Yusoff, Khalid; Hoh, Boon Peng; Hatta, Fazleen Haslinda Mohd; Ismail, Mohamad Izwan; Scaria, Vinod; Sivasubbu, Sridhar
2013-01-01
With a higher throughput and lower cost in sequencing, second generation sequencing technology has immense potential for translation into clinical practice and in the realization of pharmacogenomics based patient care. The systematic analysis of whole genome sequences to assess patient to patient variability in pharmacokinetics and pharmacodynamics responses towards drugs would be the next step in future medicine in line with the vision of personalizing medicine. Genomic DNA obtained from a 55 years old, self-declared healthy, anonymous male of Malay descent was sequenced. The subject's mother died of lung cancer and the father had a history of schizophrenia and deceased at the age of 65 years old. A systematic, intuitive computational workflow/pipeline integrating custom algorithm in tandem with large datasets of variant annotations and gene functions for genetic variations with pharmacogenomics impact was developed. A comprehensive pathway map of drug transport, metabolism and action was used as a template to map non-synonymous variations with potential functional consequences. Over 3 million known variations and 100,898 novel variations in the Malay genome were identified. Further in-depth pharmacogenetics analysis revealed a total of 607 unique variants in 563 proteins, with the eventual identification of 4 drug transport genes, 2 drug metabolizing enzyme genes and 33 target genes harboring deleterious SNVs involved in pharmacological pathways, which could have a potential role in clinical settings. The current study successfully unravels the potential of personal genome sequencing in understanding the functionally relevant variations with potential influence on drug transport, metabolism and differential therapeutic outcomes. These will be essential for realizing personalized medicine through the use of comprehensive computational pipeline for systematic data mining and analysis.
iMETHYL: an integrative database of human DNA methylation, gene expression, and genomic variation.
Komaki, Shohei; Shiwa, Yuh; Furukawa, Ryohei; Hachiya, Tsuyoshi; Ohmomo, Hideki; Otomo, Ryo; Satoh, Mamoru; Hitomi, Jiro; Sobue, Kenji; Sasaki, Makoto; Shimizu, Atsushi
2018-01-01
We launched an integrative multi-omics database, iMETHYL (http://imethyl.iwate-megabank.org). iMETHYL provides whole-DNA methylation (~24 million autosomal CpG sites), whole-genome (~9 million single-nucleotide variants), and whole-transcriptome (>14 000 genes) data for CD4 + T-lymphocytes, monocytes, and neutrophils collected from approximately 100 subjects. These data were obtained from whole-genome bisulfite sequencing, whole-genome sequencing, and whole-transcriptome sequencing, making iMETHYL a comprehensive database.
Ciotlos, Serban; Mao, Qing; Zhang, Rebecca Yu; Li, Zhenyu; Chin, Robert; Gulbahce, Natali; Liu, Sophie Jia; Drmanac, Radoje; Peters, Brock A
2016-01-01
The cell line BT-474 is a popular cell line for studying the biology of cancer and developing novel drugs. However, there is no complete, published genome sequence for this highly utilized scientific resource. In this study we sought to provide a comprehensive and useful data set for the scientific community by generating a whole genome sequence for BT-474. Five μg of genomic DNA, isolated from an early passage of the BT-474 cell line, was used to generate a whole genome sequence (114X coverage) using Complete Genomics' standard sequencing process. To provide additional variant phasing and structural variation data we also processed and analyzed two separate libraries of 5 and 6 individual cells to depths of 99X and 87X, respectively, using Complete Genomics' Long Fragment Read (LFR) technology. BT-474 is a highly aneuploid cell line with an extremely complex genome sequence. This ~300X total coverage genome sequence provides a more complete understanding of this highly utilized cell line at the genomic level.
Sequence variation of the feline immunodeficiency virus genome and its clinical relevance.
Stickney, A L; Dunowska, M; Cave, N J
2013-06-08
The ongoing evolution of feline immunodeficiency virus (FIV) has resulted in the existence of a diverse continuum of viruses. FIV isolates differ with regards to their mutation and replication rates, plasma viral loads, cell tropism and the ability to induce apoptosis. Clinical disease in FIV-infected cats is also inconsistent. Genomic sequence variation of FIV is likely to be responsible for some of the variation in viral behaviour. The specific genetic sequences that influence these key viral properties remain to be determined. With knowledge of the specific key determinants of pathogenicity, there is the potential for veterinarians in the future to apply this information for prognostic purposes. Genomic sequence variation of FIV also presents an obstacle to effective vaccine development. Most challenge studies demonstrate acceptable efficacy of a dual-subtype FIV vaccine (Fel-O-Vax FIV) against FIV infection under experimental settings; however, vaccine efficacy in the field still remains to be proven. It is important that we discover the key determinants of immunity induced by this vaccine; such data would compliment vaccine field efficacy studies and provide the basis to make informed recommendations on its use.
α satellite DNA variation and function of the human centromere
Sullivan, Lori L.; Chew, Kimberline
2017-01-01
ABSTRACT Genomic variation is a source of functional diversity that is typically studied in genic and non-coding regulatory regions. However, the extent of variation within noncoding portions of the human genome, particularly highly repetitive regions, and the functional consequences are not well understood. Satellite DNA, including α satellite DNA found at human centromeres, comprises up to 10% of the genome, but is difficult to study because its repetitive nature hinders contiguous sequence assemblies. We recently described variation within α satellite DNA that affects centromere function. On human chromosome 17 (HSA17), we showed that size and sequence polymorphisms within primary array D17Z1 are associated with chromosome aneuploidy and defective centromere architecture. However, HSA17 can counteract this instability by assembling the centromere at a second, “backup” array lacking variation. Here, we discuss our findings in a broader context of human centromere assembly, and highlight areas of future study to uncover links between genomic and epigenetic features of human centromeres. PMID:28406740
Genome and Transcriptome Sequencing of the Ostreid herpesvirus 1 From Tomales Bay, California
NASA Astrophysics Data System (ADS)
Burge, C. A.; Langevin, S.; Closek, C. J.; Roberts, S. B.; Friedman, C. S.
2016-02-01
Mass mortalities of larval and seed bivalve molluscs attributed to the Ostreid herpesvirus 1 (OsHV-1) occur globally. OsHV-1 was fully sequenced and characterized as a member of the Family Malacoherpesviridae. Multiple strains of OsHV-1 exist and may vary in virulence, i.e. OsHV-1 µvar. For most global variants of OsHV-1, sequence data is limited to PCR-based sequencing of segments, including two recent genomes. In the United States, OsHV-1 is limited to detection in adjacent embayments in California, Tomales and Drakes bays. Limited DNA sequence data of OsHV-1 infecting oysters in Tomales Bay indicates the virus detected in Tomales Bay is similar but not identical to any one global variant of OsHV-1. In order to better understand both strain variation and virulence of OsHV-1 infecting oysters in Tomales Bay, we used genomic and transcriptomic sequencing. Meta-genomic sequencing (Illumina MiSeq) was conducted from infected oysters (n=4 per year) collected in 2003, 2007, and 2014, where full OsHV-1 genome sequences and low overall microbial diversity were achieved from highly infected oysters. Increased microbial diversity was detected in three of four samples sequenced from 2003, where qPCR based genome copy numbers of OsHV-1 were lower. Expression analysis (SOLiD RNA sequencing) of OsHV-1 genes expressed in oyster larvae at 24 hours post exposure revealed a nearly complete transcriptome, with several highly expressed genes, which are similar to recent transcriptomic analyses of other OsHV-1 variants. Taken together, our results indicate that genome and transcriptome sequencing may be powerful tools in understanding both strain variation and virulence of non-culturable marine viruses.
Expanding probe repertoire and improving reproducibility in human genomic hybridization
Dorman, Stephanie N.; Shirley, Ben C.; Knoll, Joan H. M.; Rogan, Peter K.
2013-01-01
Diagnostic DNA hybridization relies on probes composed of single copy (sc) genomic sequences. Sc sequences in probe design ensure high specificity and avoid cross-hybridization to other regions of the genome, which could lead to ambiguous results that are difficult to interpret. We examine how the distribution and composition of repetitive sequences in the genome affects sc probe performance. A divide and conquer algorithm was implemented to design sc probes. With this approach, sc probes can include divergent repetitive elements, which hybridize to unique genomic targets under higher stringency experimental conditions. Genome-wide custom probe sets were created for fluorescent in situ hybridization (FISH) and microarray genomic hybridization. The scFISH probes were developed for detection of copy number changes within small tumour suppressor genes and oncogenes. The microarrays demonstrated increased reproducibility by eliminating cross-hybridization to repetitive sequences adjacent to probe targets. The genome-wide microarrays exhibited lower median coefficients of variation (17.8%) for two HapMap family trios. The coefficients of variations of commercial probes within 300 nt of a repetitive element were 48.3% higher than the nearest custom probe. Furthermore, the custom microarray called a chromosome 15q11.2q13 deletion more consistently. This method for sc probe design increases probe coverage for FISH and lowers variability in genomic microarrays. PMID:23376933
Alverson, Andrew J.; Wei, XiaoXin; Rice, Danny W.; Stern, David B.; Barry, Kerrie; Palmer, Jeffrey D.
2010-01-01
The mitochondrial genomes of seed plants are unusually large and vary in size by at least an order of magnitude. Much of this variation occurs within a single family, the Cucurbitaceae, whose genomes range from an estimated 390 to 2,900 kb in size. We sequenced the mitochondrial genomes of Citrullus lanatus (watermelon: 379,236 nt) and Cucurbita pepo (zucchini: 982,833 nt)—the two smallest characterized cucurbit mitochondrial genomes—and determined their RNA editing content. The relatively compact Citrullus mitochondrial genome actually contains more and longer genes and introns, longer segmental duplications, and more discernibly nuclear-derived DNA. The large size of the Cucurbita mitochondrial genome reflects the accumulation of unprecedented amounts of both chloroplast sequences (>113 kb) and short repeated sequences (>370 kb). A low mutation rate has been hypothesized to underlie increases in both genome size and RNA editing frequency in plant mitochondria. However, despite its much larger genome, Cucurbita has a significantly higher synonymous substitution rate (and presumably mutation rate) than Citrullus but comparable levels of RNA editing. The evolution of mutation rate, genome size, and RNA editing are apparently decoupled in Cucurbitaceae, reflecting either simple stochastic variation or governance by different factors. PMID:20118192
From Conventional to Next Generation Sequencing of Epstein-Barr Virus Genomes.
Kwok, Hin; Chiang, Alan Kwok Shing
2016-02-24
Genomic sequences of Epstein-Barr virus (EBV) have been of interest because the virus is associated with cancers, such as nasopharyngeal carcinoma, and conditions such as infectious mononucleosis. The progress of whole-genome EBV sequencing has been limited by the inefficiency and cost of the first-generation sequencing technology. With the advancement of next-generation sequencing (NGS) and target enrichment strategies, increasing number of EBV genomes has been published. These genomes were sequenced using different approaches, either with or without EBV DNA enrichment. This review provides an overview of the EBV genomes published to date, and a description of the sequencing technology and bioinformatic analyses employed in generating these sequences. We further explored ways through which the quality of sequencing data can be improved, such as using DNA oligos for capture hybridization, and longer insert size and read length in the sequencing runs. These advances will enable large-scale genomic sequencing of EBV which will facilitate a better understanding of the genetic variations of EBV in different geographic regions and discovery of potentially pathogenic variants in specific diseases.
Mercenaro, Luca; Nieddu, Giovanni; Porceddu, Andrea; Pezzotti, Mario; Camiolo, Salvatore
2017-01-01
The genetic diversity among grapevine (Vitis vinifera L.) cultivars that underlies differences in agronomic performance and wine quality reflects the accumulation of single nucleotide polymorphisms (SNPs) and small indels as well as larger genomic variations. A combination of high throughput sequencing and mapping against the grapevine reference genome allows the creation of comprehensive sequence variation maps. We used next generation sequencing and bioinformatics to generate an inventory of SNPs and small indels in four widely cultivated Sardinian grape cultivars (Bovale sardo, Cannonau, Carignano and Vermentino). More than 3,200,000 SNPs were identified with high statistical confidence. Some of the SNPs caused the appearance of premature stop codons and thus identified putative pseudogenes. The analysis of SNP distribution along chromosomes led to the identification of large genomic regions with uninterrupted series of homozygous SNPs. We used a digital comparative genomic hybridization approach to identify 6526 genomic regions with significant differences in copy number among the four cultivars compared to the reference sequence, including 81 regions shared between all four cultivars and 4953 specific to single cultivars (representing 1.2 and 75.9% of total copy number variation, respectively). Reads mapping at a distance that was not compatible with the insert size were used to identify a dataset of putative large deletions with cultivar Cannonau revealing the highest number. The analysis of genes mapping to these regions provided a list of candidates that may explain some of the phenotypic differences among the Bovale sardo, Cannonau, Carignano and Vermentino cultivars. PMID:28775732
Ding, Yanqiang; Fang, Yang; Guo, Ling; Li, Zhidan; He, Kaize; Zhao, Yun; Zhao, Hai
2017-01-01
Phylogenetic relationship within different genera of Lemnoideae, a kind of small aquatic monocotyledonous plants, was not well resolved, using either morphological characters or traditional markers. Given that rich genetic information in chloroplast genome makes them particularly useful for phylogenetic studies, we used chloroplast genomes to clarify the phylogeny within Lemnoideae. DNAs were sequenced with next-generation sequencing. The duckweeds chloroplast genomes were indirectly filtered from the total DNA data, or directly obtained from chloroplast DNA data. To test the reliability of assembling the chloroplast genome based on the filtration of the total DNA, two methods were used to assemble the chloroplast genome of Landoltia punctata strain ZH0202. A phylogenetic tree was built on the basis of the whole chloroplast genome sequences using MrBayes v.3.2.6 and PhyML 3.0. Eight complete duckweeds chloroplast genomes were assembled, with lengths ranging from 165,775 bp to 171,152 bp, and each contains 80 protein-coding sequences, four rRNAs, 30 tRNAs and two pseudogenes. The identity of L. punctata strain ZH0202 chloroplast genomes assembled through two methods was 100%, and their sequences and lengths were completely identical. The chloroplast genome comparison demonstrated that the differences in chloroplast genome sizes among the Lemnoideae primarily resulted from variation in non-coding regions, especially from repeat sequence variation. The phylogenetic analysis demonstrated that the different genera of Lemnoideae are derived from each other in the following order: Spirodela , Landoltia , Lemna , Wolffiella , and Wolffia . This study demonstrates potential of whole chloroplast genome DNA as an effective option for phylogenetic studies of Lemnoideae. It also showed the possibility of using chloroplast DNA data to elucidate those phylogenies which were not yet solved well by traditional methods even in plants other than duckweeds.
Phylogenic study of Lemnoideae (duckweeds) through complete chloroplast genomes for eight accessions
Ding, Yanqiang; Fang, Yang; Guo, Ling; Li, Zhidan; He, Kaize
2017-01-01
Background Phylogenetic relationship within different genera of Lemnoideae, a kind of small aquatic monocotyledonous plants, was not well resolved, using either morphological characters or traditional markers. Given that rich genetic information in chloroplast genome makes them particularly useful for phylogenetic studies, we used chloroplast genomes to clarify the phylogeny within Lemnoideae. Methods DNAs were sequenced with next-generation sequencing. The duckweeds chloroplast genomes were indirectly filtered from the total DNA data, or directly obtained from chloroplast DNA data. To test the reliability of assembling the chloroplast genome based on the filtration of the total DNA, two methods were used to assemble the chloroplast genome of Landoltia punctata strain ZH0202. A phylogenetic tree was built on the basis of the whole chloroplast genome sequences using MrBayes v.3.2.6 and PhyML 3.0. Results Eight complete duckweeds chloroplast genomes were assembled, with lengths ranging from 165,775 bp to 171,152 bp, and each contains 80 protein-coding sequences, four rRNAs, 30 tRNAs and two pseudogenes. The identity of L. punctata strain ZH0202 chloroplast genomes assembled through two methods was 100%, and their sequences and lengths were completely identical. The chloroplast genome comparison demonstrated that the differences in chloroplast genome sizes among the Lemnoideae primarily resulted from variation in non-coding regions, especially from repeat sequence variation. The phylogenetic analysis demonstrated that the different genera of Lemnoideae are derived from each other in the following order: Spirodela, Landoltia, Lemna, Wolffiella, and Wolffia. Discussion This study demonstrates potential of whole chloroplast genome DNA as an effective option for phylogenetic studies of Lemnoideae. It also showed the possibility of using chloroplast DNA data to elucidate those phylogenies which were not yet solved well by traditional methods even in plants other than duckweeds. PMID:29302399
An integrated approach to exploit linkage disequilibrium for ultra high dimensional genome-wide data
USDA-ARS?s Scientific Manuscript database
With the advent of recent DNA sequencing methods (determining molecule order) that quickly produce millions of DNA sequences, variation among sequences in a genome (all the DNA contained in chromosomes of an organism) can be tested for association with traits of economic interest on a relatively lar...
Deep whole-genome sequencing of 90 Han Chinese genomes.
Lan, Tianming; Lin, Haoxiang; Zhu, Wenjuan; Laurent, Tellier Christian Asker Melchior; Yang, Mengcheng; Liu, Xin; Wang, Jun; Wang, Jian; Yang, Huanming; Xu, Xun; Guo, Xiaosen
2017-09-01
Next-generation sequencing provides a high-resolution insight into human genetic information. However, the focus of previous studies has primarily been on low-coverage data due to the high cost of sequencing. Although the 1000 Genomes Project and the Haplotype Reference Consortium have both provided powerful reference panels for imputation, low-frequency and novel variants remain difficult to discover and call with accuracy on the basis of low-coverage data. Deep sequencing provides an optimal solution for the problem of these low-frequency and novel variants. Although whole-exome sequencing is also a viable choice for exome regions, it cannot account for noncoding regions, sometimes resulting in the absence of important, causal variants. For Han Chinese populations, the majority of variants have been discovered based upon low-coverage data from the 1000 Genomes Project. However, high-coverage, whole-genome sequencing data are limited for any population, and a large amount of low-frequency, population-specific variants remain uncharacterized. We have performed whole-genome sequencing at a high depth (∼×80) of 90 unrelated individuals of Chinese ancestry, collected from the 1000 Genomes Project samples, including 45 Northern Han Chinese and 45 Southern Han Chinese samples. Eighty-three of these 90 have been sequenced by the 1000 Genomes Project. We have identified 12 568 804 single nucleotide polymorphisms, 2 074 210 short InDels, and 26 142 structural variations from these 90 samples. Compared to the Han Chinese data from the 1000 Genomes Project, we have found 7 000 629 novel variants with low frequency (defined as minor allele frequency < 5%), including 5 813 503 single nucleotide polymorphisms, 1 169 199 InDels, and 17 927 structural variants. Using deep sequencing data, we have built a greatly expanded spectrum of genetic variation for the Han Chinese genome. Compared to the 1000 Genomes Project, these Han Chinese deep sequencing data enhance the characterization of a large number of low-frequency, novel variants. This will be a valuable resource for promoting Chinese genetics research and medical development. Additionally, it will provide a valuable supplement to the 1000 Genomes Project, as well as to other human genome projects. © The Authors 2017. Published by Oxford University Press.
The Past, Present, and Future of Human Centromere Genomics
Aldrup-MacDonald, Megan E.; Sullivan, Beth A.
2014-01-01
The centromere is the chromosomal locus essential for chromosome inheritance and genome stability. Human centromeres are located at repetitive alpha satellite DNA arrays that compose approximately 5% of the genome. Contiguous alpha satellite DNA sequence is absent from the assembled reference genome, limiting current understanding of centromere organization and function. Here, we review the progress in centromere genomics spanning the discovery of the sequence to its molecular characterization and the work done during the Human Genome Project era to elucidate alpha satellite structure and sequence variation. We discuss exciting recent advances in alpha satellite sequence assembly that have provided important insight into the abundance and complex organization of this sequence on human chromosomes. In light of these new findings, we offer perspectives for future studies of human centromere assembly and function. PMID:24683489
Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome
2013-01-01
Background There is growing evidence for the prevalence of copy number variation (CNV) and its role in phenotypic variation in many eukaryotic species. Here we use array comparative genomic hybridization to explore the extent of this type of structural variation in domesticated barley cultivars and wild barleys. Results A collection of 14 barley genotypes including eight cultivars and six wild barleys were used for comparative genomic hybridization. CNV affects 14.9% of all the sequences that were assessed. Higher levels of CNV diversity are present in the wild accessions relative to cultivated barley. CNVs are enriched near the ends of all chromosomes except 4H, which exhibits the lowest frequency of CNVs. CNV affects 9.5% of the coding sequences represented on the array and the genes affected by CNV are enriched for sequences annotated as disease-resistance proteins and protein kinases. Sequence-based comparisons of CNV between cultivars Barke and Morex provided evidence that DNA repair mechanisms of double-strand breaks via single-stranded annealing and synthesis-dependent strand annealing play an important role in the origin of CNV in barley. Conclusions We present the first catalog of CNVs in a diploid Triticeae species, which opens the door for future genome diversity research in a tribe that comprises the economically important cereal species wheat, barley, and rye. Our findings constitute a valuable resource for the identification of CNV affecting genes of agronomic importance. We also identify potential mechanisms that can generate variation in copy number in plant genomes. PMID:23758725
Re-sequencing and genetic variation identification of a rice line with ideal plant architecture.
Li, Shuangcheng; Xie, Kailong; Li, Wenbo; Zou, Ting; Ren, Yun; Wang, Shiquan; Deng, Qiming; Zheng, Aiping; Zhu, Jun; Liu, Huainian; Wang, Lingxia; Ai, Peng; Gao, Fengyan; Huang, Bin; Cao, Xuemei; Li, Ping
2012-12-01
The ideal plant architecture (IPA) includes several important characteristics such as low tiller numbers, few or no unproductive tillers, more grains per panicle, and thick and sturdy stems. We have developed an indica restorer line 7302R that displays the IPA phenotype in terms of tiller number, grain number, and stem strength. However, its mechanism had to be clarified. We performed re-sequencing and genome-wide variation analysis of 7302R using the Solexa sequencing technology. With the genomic sequence of the indica cultivar 9311 as reference, 307 627 SNPs, 57 372 InDels, and 3 096 SVs were identified in the 7302R genome. The 7302R-specific variations were investigated via the synteny analysis of all the SNPs of 7302R with those of the previous sequenced none-IPA-type lines IR24, MH63, and SH527. Moreover, we found 178 168 7302R-specific SNPs across the whole genome and 30 239 SNPs in the predicted mRNA regions, among which 8 517 were Non-syn CDS. In addition, 263 large-effect SNPs that were expected to affect the integrity of encoded proteins were identified from the 7302R-specific SNPs. SNPs of several important previously cloned rice genes were also identified by aligning the 7302R sequence with other sequence lines. Our results provided several candidates account for the IPA phenotype of 7302R. These results therefore lay the groundwork for long-term efforts to uncover important genes and alleles for rice plant architecture construction, also offer useful data resources for future genetic and genomic studies in rice.
Relating Human Genetic Variation to Variation in Drug Responses
Madian, Ashraf G.; Wheeler, Heather E.; Jones, Richard Baker; Dolan, M. Eileen
2012-01-01
Although sequencing a single human genome was a monumental effort a decade ago, more than one thousand genomes have now been sequenced. The task ahead lies in transforming this information into personalized treatment strategies that are tailored to the unique genetics of each individual. One important aspect of personalized medicine is patient-to-patient variation in drug response. Pharmacogenomics addresses this issue by seeking to identify genetic contributors to human variation in drug efficacy and toxicity. Here, we present a summary of the current status of this field, which has evolved from studies of single candidate genes to comprehensive genome-wide analyses. Additionally, we discuss the major challenges in translating this knowledge into a systems-level understanding of drug physiology with the ultimate goal of developing more effective personalized clinical treatment strategies. PMID:22840197
2014-01-01
Background Neisseria meningitidis expresses type four pili (Tfp) which are important for colonisation and virulence. Tfp have been considered as one of the most variable structures on the bacterial surface due to high frequency gene conversion, resulting in amino acid sequence variation of the major pilin subunit (PilE). Meningococci express either a class I or a class II pilE gene and recent work has indicated that class II pilins do not undergo antigenic variation, as class II pilE genes encode conserved pilin subunits. The purpose of this work was to use whole genome sequences to further investigate the frequency and variability of the class II pilE genes in meningococcal isolate collections. Results We analysed over 600 publically available whole genome sequences of N. meningitidis isolates to determine the sequence and genomic organization of pilE. We confirmed that meningococcal strains belonging to a limited number of clonal complexes (ccs, namely cc1, cc5, cc8, cc11 and cc174) harbour a class II pilE gene which is conserved in terms of sequence and chromosomal context. We also identified pilS cassettes in all isolates with class II pilE, however, our analysis indicates that these do not serve as donor sequences for pilE/pilS recombination. Furthermore, our work reveals that the class II pilE locus lacks the DNA sequence motifs that enable (G4) or enhance (Sma/Cla repeat) pilin antigenic variation. Finally, through analysis of pilin genes in commensal Neisseria species we found that meningococcal class II pilE genes are closely related to pilE from Neisseria lactamica and Neisseria polysaccharea, suggesting horizontal transfer among these species. Conclusions Class II pilins can be defined by their amino acid sequence and genomic context and are present in meningococcal isolates which have persisted and spread globally. The absence of G4 and Sma/Cla sequences adjacent to the class II pilE genes is consistent with the lack of pilin subunit variation in these isolates, although horizontal transfer may generate class II pilin diversity. This study supports the suggestion that high frequency antigenic variation of pilin is not universal in pathogenic Neisseria. PMID:24690385
Comparison and correlation of Simple Sequence Repeats distribution in genomes of Brucella species
Kiran, Jangampalli Adi Pradeep; Chakravarthi, Veeraraghavulu Praveen; Kumar, Yellapu Nanda; Rekha, Somesula Swapna; Kruti, Srinivasan Shanthi; Bhaskar, Matcha
2011-01-01
Computational genomics is one of the important tools to understand the distribution of closely related genomes including simple sequence repeats (SSRs) in an organism, which gives valuable information regarding genetic variations. The central objective of the present study was to screen the SSRs distributed in coding and non-coding regions among different human Brucella species which are involved in a range of pathological disorders. Computational analysis of the SSRs in the Brucella indicates few deviations from expected random models. Statistical analysis also reveals that tri-nucleotide SSRs are overrepresented and tetranucleotide SSRs underrepresented in Brucella genomes. From the data, it can be suggested that over expressed tri-nucleotide SSRs in genomic and coding regions might be responsible in the generation of functional variation of proteins expressed which in turn may lead to different pathogenicity, virulence determinants, stress response genes, transcription regulators and host adaptation proteins of Brucella genomes. Abbreviations SSRs - Simple Sequence Repeats, ORFs - Open Reading Frames. PMID:21738309
Single-Cell Whole-Genome Amplification and Sequencing: Methodology and Applications.
Huang, Lei; Ma, Fei; Chapman, Alec; Lu, Sijia; Xie, Xiaoliang Sunney
2015-01-01
We present a survey of single-cell whole-genome amplification (WGA) methods, including degenerate oligonucleotide-primed polymerase chain reaction (DOP-PCR), multiple displacement amplification (MDA), and multiple annealing and looping-based amplification cycles (MALBAC). The key parameters to characterize the performance of these methods are defined, including genome coverage, uniformity, reproducibility, unmappable rates, chimera rates, allele dropout rates, false positive rates for calling single-nucleotide variations, and ability to call copy-number variations. Using these parameters, we compare five commercial WGA kits by performing deep sequencing of multiple single cells. We also discuss several major applications of single-cell genomics, including studies of whole-genome de novo mutation rates, the early evolution of cancer genomes, circulating tumor cells (CTCs), meiotic recombination of germ cells, preimplantation genetic diagnosis (PGD), and preimplantation genomic screening (PGS) for in vitro-fertilized embryos.
A High-Definition View of Functional Genetic Variation from Natural Yeast Genomes
Bergström, Anders; Simpson, Jared T.; Salinas, Francisco; Barré, Benjamin; Parts, Leopold; Zia, Amin; Nguyen Ba, Alex N.; Moses, Alan M.; Louis, Edward J.; Mustonen, Ville; Warringer, Jonas; Durbin, Richard; Liti, Gianni
2014-01-01
The question of how genetic variation in a population influences phenotypic variation and evolution is of major importance in modern biology. Yet much is still unknown about the relative functional importance of different forms of genome variation and how they are shaped by evolutionary processes. Here we address these questions by population level sequencing of 42 strains from the budding yeast Saccharomyces cerevisiae and its closest relative S. paradoxus. We find that genome content variation, in the form of presence or absence as well as copy number of genetic material, is higher within S. cerevisiae than within S. paradoxus, despite genetic distances as measured in single-nucleotide polymorphisms being vastly smaller within the former species. This genome content variation, as well as loss-of-function variation in the form of premature stop codons and frameshifting indels, is heavily enriched in the subtelomeres, strongly reinforcing the relevance of these regions to functional evolution. Genes affected by these likely functional forms of variation are enriched for functions mediating interaction with the external environment (sugar transport and metabolism, flocculation, metal transport, and metabolism). Our results and analyses provide a comprehensive view of genomic diversity in budding yeast and expose surprising and pronounced differences between the variation within S. cerevisiae and that within S. paradoxus. We also believe that the sequence data and de novo assemblies will constitute a useful resource for further evolutionary and population genomics studies. PMID:24425782
Molecular spectrum of somaclonal variation in regenerated rice revealed by whole-genome sequencing.
Miyao, Akio; Nakagome, Mariko; Ohnuma, Takako; Yamagata, Harumi; Kanamori, Hiroyuki; Katayose, Yuichi; Takahashi, Akira; Matsumoto, Takashi; Hirochika, Hirohiko
2012-01-01
Somaclonal variation is a phenomenon that results in the phenotypic variation of plants regenerated from cell culture. One of the causes of somaclonal variation in rice is the transposition of retrotransposons. However, many aspects of the mechanisms that result in somaclonal variation remain undefined. To detect genome-wide changes in regenerated rice, we analyzed the whole-genome sequences of three plants independently regenerated from cultured cells originating from a single seed stock. Many single-nucleotide polymorphisms (SNPs) and insertions and deletions (indels) were detected in the genomes of the regenerated plants. The transposition of only Tos17 among 43 transposons examined was detected in the regenerated plants. Therefore, the SNPs and indels contribute to the somaclonal variation in regenerated rice in addition to the transposition of Tos17. The observed molecular spectrum was similar to that of the spontaneous mutations in Arabidopsis thaliana. However, the base change ratio was estimated to be 1.74 × 10(-6) base substitutions per site per regeneration, which is 248-fold greater than the spontaneous mutation rate of A. thaliana.
Single haplotype assembly of the human genome from a hydatidiform mole.
Steinberg, Karyn Meltz; Schneider, Valerie A; Graves-Lindsay, Tina A; Fulton, Robert S; Agarwala, Richa; Huddleston, John; Shiryev, Sergey A; Morgulis, Aleksandr; Surti, Urvashi; Warren, Wesley C; Church, Deanna M; Eichler, Evan E; Wilson, Richard K
2014-12-01
A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly. © 2014 Steinberg et al.; Published by Cold Spring Harbor Laboratory Press.
Single haplotype assembly of the human genome from a hydatidiform mole
Steinberg, Karyn Meltz; Schneider, Valerie A.; Graves-Lindsay, Tina A.; Fulton, Robert S.; Agarwala, Richa; Huddleston, John; Shiryev, Sergey A.; Morgulis, Aleksandr; Surti, Urvashi; Warren, Wesley C.; Church, Deanna M.; Eichler, Evan E.; Wilson, Richard K.
2014-01-01
A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly. PMID:25373144
Walker, Joseph F; Zanis, Michael J; Emery, Nancy C
2014-04-01
Complete chloroplast genome studies can help resolve relationships among large, complex plant lineages such as Asteraceae. We present the first whole plastome from the Madieae tribe and compare its sequence variation to other chloroplast genomes in Asteraceae. We used high throughput sequencing to obtain the Lasthenia burkei chloroplast genome. We compared sequence structure and rates of molecular evolution in the small single copy (SSC), large single copy (LSC), and inverted repeat (IR) regions to those for eight Asteraceae accessions and one Solanaceae accession. The chloroplast sequence of L. burkei is 150 746 bp and contains 81 unique protein coding genes and 4 coding ribosomal RNA sequences. We identified three major inversions in the L. burkei chloroplast, all of which have been found in other Asteraceae lineages, and a previously unreported inversion in Lactuca sativa. Regions flanking inversions contained tRNA sequences, but did not have particularly high G + C content. Substitution rates varied among the SSC, LSC, and IR regions, and rates of evolution within each region varied among species. Some observed differences in rates of molecular evolution may be explained by the relative proportion of coding to noncoding sequence within regions. Rates of molecular evolution vary substantially within and among chloroplast genomes, and major inversion events may be promoted by the presence of tRNAs. Collectively, these results provide insight into different mechanisms that may promote intramolecular recombination and the inversion of large genomic regions in the plastome.
The Saccharomyces Genome Database Variant Viewer
Sheppard, Travis K.; Hitz, Benjamin C.; Engel, Stacia R.; Song, Giltae; Balakrishnan, Rama; Binkley, Gail; Costanzo, Maria C.; Dalusag, Kyla S.; Demeter, Janos; Hellerstedt, Sage T.; Karra, Kalpana; Nash, Robert S.; Paskov, Kelley M.; Skrzypek, Marek S.; Weng, Shuai; Wong, Edith D.; Cherry, J. Michael
2016-01-01
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source web application for the visualization of genomic and proteomic differences. Multiple sequence alignments have been constructed across high quality genome sequences from 11 different S. cerevisiae strains and stored in the SGD. The alignments and summaries are encoded in JSON and used to create a two-tiered dynamic view of the budding yeast pan-genome, available at http://www.yeastgenome.org/variant-viewer. PMID:26578556
Effective de novo assembly of fish genome using haploid larvae.
Iwasaki, Yuki; Nishiki, Issei; Nakamura, Yoji; Yasuike, Motoshige; Kai, Wataru; Nomura, Kazuharu; Yoshida, Kazunori; Nomura, Yousuke; Fujiwara, Atushi; Kobayashi, Takanori; Ototake, Mitsuru
2016-02-01
Recent improvements in next-generation sequencing technology have made it possible to do whole genome sequencing, on even non-model eukaryote species with no available reference genomes. However, de novo assembly of diploid genomes is still a big challenge because of allelic variation. The aim of this study was to determine the feasibility of utilizing the genome of haploid fish larvae for de novo assembly of whole-genome sequences. We compared the efficiency of assembly using the haploid genome of yellowtail (Seriola quinqueradiata) with that using the diploid genome obtained from the dam. De novo assembly from the haploid and the diploid sequence reads (100 million reads per each datasets) generated by the Ion Proton sequencer (200 bp) was done under two different assembly algorithms, namely overlap-layout-consensus (OLC) and de Bruijn graph (DBG). This revealed that the assembly of the haploid genome significantly reduced (approximately 22% for OLC, 9% for DBG) the total number of contigs (with longer average and N50 contig lengths) when compared to the diploid genome assembly. The haploid assembly also improved the quality of the scaffolds by reducing the number of regions with unassigned nucleotides (Ns) (total length of Ns; 45,331,916 bp for haploids and 67,724,360 bp for diploids) in OLC-based assemblies. It appears clear that the haploid genome assembly is better because the allelic variation in the diploid genome disrupts the extension of contigs during the assembly process. Our results indicate that utilizing the genome of haploid larvae leads to a significant improvement in the de novo assembly process, thus providing a novel strategy for the construction of reference genomes from non-model diploid organisms such as fish. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.
Fokkema, Ivo F A C; den Dunnen, Johan T; Taschner, Peter E M
2005-08-01
The completion of the human genome project has initiated, as well as provided the basis for, the collection and study of all sequence variation between individuals. Direct access to up-to-date information on sequence variation is currently provided most efficiently through web-based, gene-centered, locus-specific databases (LSDBs). We have developed the Leiden Open (source) Variation Database (LOVD) software approaching the "LSDB-in-a-Box" idea for the easy creation and maintenance of a fully web-based gene sequence variation database. LOVD is platform-independent and uses PHP and MySQL open source software only. The basic gene-centered and modular design of the database follows the recommendations of the Human Genome Variation Society (HGVS) and focuses on the collection and display of DNA sequence variations. With minimal effort, the LOVD platform is extendable with clinical data. The open set-up should both facilitate and promote functional extension with scripts written by the community. The LOVD software is freely available from the Leiden Muscular Dystrophy pages (www.DMD.nl/LOVD/). To promote the use of LOVD, we currently offer curators the possibility to set up an LSDB on our Leiden server. (c) 2005 Wiley-Liss, Inc.
Lei, Yong-Liang; Wang, Xiao-Guang; Liu, Fu-Ming; Chen, Xiu-Ying; Ye, Bi-Feng; Mei, Jian-Hua; Lan, Jin-Quan; Tang, Qing
2009-08-01
Based on sequencing the full-length genomes of two Chinese Ferret-Badger, we analyzed the properties of rabies viruses genetic variation in molecular level to get information on prevalence and variation of rabies viruses in Zhejiang, and to enrich the genome database of rabies viruses street strains isolated from Chinese wildlife. Overlapped fragments were amplified by RT-PCR and full-length genomes were assembled to analyze the nucleotide and deduced protein similarities and phylogenetic analyses of the N genes from Chinese Ferret-Badger, sika deer, vole, dog. Vaccine strains were then determined. The two full-length genomes were completely sequenced to find out that they had the same genetic structure with 11 923 nts including 58 nts-Leader, 1353 nts-NP, 894 nts-PP, 609 nts-MP, 1575 nts-GP, 6386 nts-LP, and 2, 5, 5 nts- intergenic regions (IGRs), 423 nts-Pseudogene-like sequence (Psi), 70 nts-Trailer. The two full-length genomes were in accordance with the properties of Rhabdoviridae Lyssa virus by blast and multi-sequence alignment. The nucleotide and amino acid sequences among Chinese strains had the highest similarity, especially among animals of the same species. Of the two full-length genomes, the similarity in amino acid level was dramatically higher than that in nucleotide level, so that the nucleotide mutations happened in these two genomes were most probably as synonymous mutations. Compared to the referenced rabies viruses, the lengths of the five protein coding regions did not show any changes or recombination, but only with a few-point mutations. It was evident that the five proteins appeared to be stable. The variation sites and types of the two ferret badgers genomes were similar to the referenced vaccine or street strains. The two strains were genotype 1 according to the multi-sequence and phylogenetic analyses, which possessing the distinct geographyphic characteristics of China. All the evidence suggested a cue that these two ferret badgers rabies viruses were likely to be street virus that already circulating in wildlife.
Sequence of the tomato chloroplast DNA and evolutionary comparison of solanaceous plastid genomes.
Kahlau, Sabine; Aspinall, Sue; Gray, John C; Bock, Ralph
2006-08-01
Tomato, Solanum lycopersicum (formerly Lycopersicon esculentum), has long been one of the classical model species of plant genetics. More recently, solanaceous species have become a model of evolutionary genomics, with several EST projects and a tomato genome project having been initiated. As a first contribution toward deciphering the genetic information of tomato, we present here the complete sequence of the tomato chloroplast genome (plastome). The size of this circular genome is 155,461 base pairs (bp), with an average AT content of 62.14%. It contains 114 genes and conserved open reading frames (ycfs). Comparison with the previously sequenced plastid DNAs of Nicotiana tabacum and Atropa belladonna reveals patterns of plastid genome evolution in the Solanaceae family and identifies varying degrees of conservation of individual plastid genes. In addition, we discovered several new sites of RNA editing by cytidine-to-uridine conversion. A detailed comparison of editing patterns in the three solanaceous species highlights the dynamics of RNA editing site evolution in chloroplasts. To assess the level of intraspecific plastome variation in tomato, the plastome of a second tomato cultivar was sequenced. Comparison of the two genotypes (IPA-6, bred in South America, and Ailsa Craig, bred in Europe) revealed no nucleotide differences, suggesting that the plastomes of modern tomato cultivars display very little, if any, sequence variation.
Yi, Guoqiang; Qu, Lujiang; Liu, Jianfeng; Yan, Yiyuan; Xu, Guiyun; Yang, Ning
2014-11-07
Copy number variation (CNV) is important and widespread in the genome, and is a major cause of disease and phenotypic diversity. Herein, we performed a genome-wide CNV analysis in 12 diversified chicken genomes based on whole genome sequencing. A total of 8,840 CNV regions (CNVRs) covering 98.2 Mb and representing 9.4% of the chicken genome were identified, ranging in size from 1.1 to 268.8 kb with an average of 11.1 kb. Sequencing-based predictions were confirmed at a high validation rate by two independent approaches, including array comparative genomic hybridization (aCGH) and quantitative PCR (qPCR). The Pearson's correlation coefficients between sequencing and aCGH results ranged from 0.435 to 0.755, and qPCR experiments revealed a positive validation rate of 91.71% and a false negative rate of 22.43%. In total, 2,214 (25.0%) predicted CNVRs span 2,216 (36.4%) RefSeq genes associated with specific biological functions. Besides two previously reported copy number variable genes EDN3 and PRLR, we also found some promising genes with potential in phenotypic variation. Two genes, FZD6 and LIMS1, related to disease susceptibility/resistance are covered by CNVRs. The highly duplicated SOCS2 may lead to higher bone mineral density. Entire or partial duplication of some genes like POPDC3 may have great economic importance in poultry breeding. Our results based on extensive genetic diversity provide a more refined chicken CNV map and genome-wide gene copy number estimates, and warrant future CNV association studies for important traits in chickens.
Spuesens, Emiel B M; Oduber, Minoushka; Hoogenboezem, Theo; Sluijter, Marcel; Hartwig, Nico G; van Rossum, Annemarie M C; Vink, Cornelis
2009-07-01
The gene encoding major adhesin protein P1 of Mycoplasma pneumoniae, MPN141, contains two DNA sequence stretches, designated RepMP2/3 and RepMP4, which display variation among strains. This variation allows strains to be differentiated into two major P1 genotypes (1 and 2) and several variants. Interestingly, multiple versions of the RepMP2/3 and RepMP4 elements exist at other sites within the bacterial genome. Because these versions are closely related in sequence, but not identical, it has been hypothesized that they have the capacity to recombine with their counterparts within MPN141, and thereby serve as a source of sequence variation of the P1 protein. In order to determine the variation within the RepMP2/3 and RepMP4 elements, both within the bacterial genome and among strains, we analysed the DNA sequences of all RepMP2/3 and RepMP4 elements within the genomes of 23 M. pneumoniae strains. Our data demonstrate that: (i) recombination is likely to have occurred between two RepMP2/3 elements in four of the strains, and (ii) all previously described P1 genotypes can be explained by inter-RepMP recombination events. Moreover, the difference between the two major P1 genotypes was reflected in all RepMP elements, such that subtype 1 and 2 strains can be differentiated on the basis of sequence variation in each RepMP element. This implies that subtype 1 and subtype 2 strains represent evolutionarily diverged strain lineages. Finally, a classification scheme is proposed in which the P1 genotype of M. pneumoniae isolates can be described in a sequence-based, universal fashion.
An Overview of Genomic Sequence Variation Markup Language (GSVML)
Nakaya, Jun; Hiroi, Kaei; Ido, Keisuke; Yang, Woosung; Kimura, Michio
2006-01-01
Internationally accumulated genomic sequence variation data on human requires the interoperable data exchanging format. We developed the GSVML as the data exchanging format. The GSVML is human health oriented and has three categories. Analyses on the use case in human health domain and the investigation on the databases and markup languages were conducted. An interface ability to Health Level Seven Genotype Model was examined. GSVML provides a sharable platform for both clinical and research applications.
USDA-ARS?s Scientific Manuscript database
Deep sequencing of viruses isolated from infected hosts is an efficient way to measure population-genetic variation and can reveal patterns of dispersal and natural selection. In this study, we mined existing Illumina sequence reads to investigate single-nucleotide polymorphisms (SNPs) within two RN...
Translational genomics for analysis of complex traits in peanut and sorghum
USDA-ARS?s Scientific Manuscript database
The integration of sequencing and genotype data from natural variation studies (by whole genome resequencing [wgs] or genotype by sequencing [gbs]), transcriptome (RNA-seq) and mutant analysis (also by wgs) facilitated the development of DNA markers in the form of single nucleotide polymorphic (SNP)...
Integrated translational genomics for analysis of complex traits in sorghum
USDA-ARS?s Scientific Manuscript database
We will report on the integration of sequencing and genotype data from natural variation (by whole genome resequencing [wgs] or genotype by sequencing [gbs]), transcriptome (RNA-seq) and mutant analysis (also by wgs) with the goal of identifying genes controlling important agronomic traits and tran...
Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold
Nijkamp, Jurgen F.; Pop, Mihai; Reinders, Marcel J. T.; de Ridder, Dick
2013-01-01
Motivation: Although many tools are available to study variation and its impact in single genomes, there is a lack of algorithms for finding such variation in metagenomes. This hampers the interpretation of metagenomics sequencing datasets, which are increasingly acquired in research on the (human) microbiome, in environmental studies and in the study of processes in the production of foods and beverages. Existing algorithms often depend on the use of reference genomes, which pose a problem when a metagenome of a priori unknown strain composition is studied. In this article, we develop a method to perform reference-free detection and visual exploration of genomic variation, both within a single metagenome and between metagenomes. Results: We present the MaryGold algorithm and its implementation, which efficiently detects bubble structures in contig graphs using graph decomposition. These bubbles represent variable genomic regions in closely related strains in metagenomic samples. The variation found is presented in a condensed Circos-based visualization, which allows for easy exploration and interpretation of the found variation. We validated the algorithm on two simulated datasets containing three respectively seven Escherichia coli genomes and showed that finding allelic variation in these genomes improves assemblies. Additionally, we applied MaryGold to publicly available real metagenomic datasets, enabling us to find within-sample genomic variation in the metagenomes of a kimchi fermentation process, the microbiome of a premature infant and in microbial communities living on acid mine drainage. Moreover, we used MaryGold for between-sample variation detection and exploration by comparing sequencing data sampled at different time points for both of these datasets. Availability: MaryGold has been written in C++ and Python and can be downloaded from http://bioinformatics.tudelft.nl/software Contact: d.deridder@tudelft.nl PMID:24058058
Population genomics reveals a candidate gene involved in bumble bee pigmentation.
Pimsler, Meaghan L; Jackson, Jason M; Lozier, Jeffrey D
2017-05-01
Variation in bumble bee color patterns is well-documented within and between species. Identifying the genetic mechanisms underlying such variation may be useful in revealing evolutionary forces shaping rapid phenotypic diversification. The widespread North American species Bombus bifarius exhibits regional variation in abdominal color forms, ranging from red-banded to black-banded phenotypes and including geographically and phenotypically intermediate forms. Identifying genomic regions linked to this variation has been complicated by strong, near species level, genome-wide differentiation between red- and black-banded forms. Here, we instead focus on the closely related black-banded and intermediate forms that both belong to the subspecies B. bifarius nearcticus . We analyze an RNA sequencing (RNAseq) data set and identify a cluster of single nucleotide polymorphisms (SNPs) within one gene, Xanthine dehydrogenase/oxidase -like, that exhibit highly unusual differentiation compared to the rest of the sequenced genome. Homologs of this gene contribute to pigmentation in other insects, and results thus represent a strong candidate for investigating the genetic basis of pigment variation in B. bifarius and other bumble bee mimicry complexes.
The ‘thousand-dollar genome': an ethical exploration
Dondorp, Wybo J; de Wert, Guido M W R
2013-01-01
Sequencing an individual's complete genome is expected to be possible for a relatively low sum ‘one thousand dollars' within a few years. Sequencing refers to determining the order of base pairs that make up the genome. The result is a library of three billion letter combinations. Cheap whole-genome sequencing is of greatest importance to medical scientific research. Comparing individual complete genomes will lead to a better understanding of the contribution genetic variation makes to health and disease. As knowledge increases, the ‘thousand-dollar genome' will also become increasingly important to healthcare. The applications that come within reach raise a number of ethical questions. This monitoring report addresses the issue. PMID:23677179
The African Genome Variation Project shapes medical genetics in Africa
Gurdasani, Deepti; Carstensen, Tommy; Tekola-Ayele, Fasil; Pagani, Luca; Tachmazidou, Ioanna; Hatzikotoulas, Konstantinos; Karthikeyan, Savita; Iles, Louise; Pollard, Martin O.; Choudhury, Ananyo; Ritchie, Graham R. S.; Xue, Yali; Asimit, Jennifer; Nsubuga, Rebecca N.; Young, Elizabeth H.; Pomilla, Cristina; Kivinen, Katja; Rockett, Kirk; Kamali, Anatoli; Doumatey, Ayo P.; Asiki, Gershim; Seeley, Janet; Sisay-Joof, Fatoumatta; Jallow, Muminatou; Tollman, Stephen; Mekonnen, Ephrem; Ekong, Rosemary; Oljira, Tamiru; Bradman, Neil; Bojang, Kalifa; Ramsay, Michele; Adeyemo, Adebowale; Bekele, Endashaw; Motala, Ayesha; Norris, Shane A.; Pirie, Fraser; Kaleebu, Pontiano; Kwiatkowski, Dominic; Tyler-Smith, Chris; Rotimi, Charles; Zeggini, Eleftheria; Sandhu, Manjinder S.
2014-01-01
Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterisation of African genetic diversity is needed. The African Genome Variation Project (AGVP) provides a resource to help design, implement and interpret genomic studies in sub-Saharan Africa (SSA) and worldwide. The AGVP represents dense genotypes from 1,481 and whole genome sequences (WGS) from 320 individuals across SSA. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across SSA. We identify new loci under selection, including for malaria and hypertension. We show that modern imputation panels can identify association signals at highly differentiated loci across populations in SSA. Using WGS, we show further improvement in imputation accuracy supporting efforts for large-scale sequencing of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa, showing for the first time that such designs are feasible. PMID:25470054
Wyllie, David H; Sanderson, Nicholas; Myers, Richard; Peto, Tim; Robinson, Esther; Crook, Derrick W; Smith, E Grace; Walker, A Sarah
2018-06-06
Contact tracing requires reliable identification of closely related bacterial isolates. When we noticed the reporting of artefactual variation between M. tuberculosis isolates during routine next generation sequencing of Mycobacterium spp, we investigated its basis in 2,018 consecutive M. tuberculosis isolates. In the routine process used, clinical samples were decontaminated and inoculated into broth cultures; from positive broth cultures DNA was extracted, sequenced, reads mapped, and consensus sequences determined. We investigated the process of consensus sequence determination, which selects the most common nucleotide at each position. Having determined the high-quality read depth and depth of minor variants across 8,006 M. tuberculosis genomic regions, we quantified the relationship between the minor variant depth and the amount of non-Mycobacterial bacterial DNA, which originates from commensal microbes killed during sample decontamination. In the presence of non-Mycobacterial bacterial DNA, we found significant increases in minor variant frequencies of more than 1.5 fold in 242 regions covering 5.1% of the M. tuberculosis genome. Included within these were four high variation regions strongly influenced by the amount of non-Mycobacterial bacterial DNA. Excluding these four regions from pairwise distance comparisons reduced biologically implausible variation from 5.2% to 0% in an independent validation set derived from 226 individuals. Thus, we have demonstrated an approach identifying critical genomic regions contributing to clinically relevant artefactual variation in bacterial similarity searches. The approach described monitors the outputs of the complex multi-step laboratory and bioinformatics process, allows periodic process adjustments, and will have application to quality control of routine bacterial genomics. Copyright © 2018 Wyllie et al.
Genomic profiling of plastid DNA variation in the Mediterranean olive tree
2011-01-01
Background Characterisation of plastid genome (or cpDNA) polymorphisms is commonly used for phylogeographic, population genetic and forensic analyses in plants, but detecting cpDNA variation is sometimes challenging, limiting the applications of such an approach. In the present study, we screened cpDNA polymorphism in the olive tree (Olea europaea L.) by sequencing the complete plastid genome of trees with a distinct cpDNA lineage. Our objective was to develop new markers for a rapid genomic profiling (by Multiplex PCRs) of cpDNA haplotypes in the Mediterranean olive tree. Results Eight complete cpDNA genomes of Olea were sequenced de novo. The nucleotide divergence between olive cpDNA lineages was low and not exceeding 0.07%. Based on these sequences, markers were developed for studying two single nucleotide substitutions and length polymorphism of 62 regions (with variable microsatellite motifs or other indels). They were then used to genotype the cpDNA variation in cultivated and wild Mediterranean olive trees (315 individuals). Forty polymorphic loci were detected on this sample, allowing the distinction of 22 haplotypes belonging to the three Mediterranean cpDNA lineages known as E1, E2 and E3. The discriminating power of cpDNA variation was particularly low for the cultivated olive tree with one predominating haplotype, but more diversity was detected in wild populations. Conclusions We propose a method for a rapid characterisation of the Mediterranean olive germplasm. The low variation in the cultivated olive tree indicated that the utility of cpDNA variation for forensic analyses is limited to rare haplotypes. In contrast, the high cpDNA variation in wild populations demonstrated that our markers may be useful for phylogeographic and populations genetic studies in O. europaea. PMID:21569271
Microsatellite analysis in the genome of Acanthaceae: An in silico approach.
Kaliswamy, Priyadharsini; Vellingiri, Srividhya; Nathan, Bharathi; Selvaraj, Saravanakumar
2015-01-01
Acanthaceae is one of the advanced and specialized families with conventionally used medicinal plants. Simple sequence repeats (SSRs) play a major role as molecular markers for genome analysis and plant breeding. The microsatellites existing in the complete genome sequences would help to attain a direct role in the genome organization, recombination, gene regulation, quantitative genetic variation, and evolution of genes. The current study reports the frequency of microsatellites and appropriate markers for the Acanthaceae family genome sequences. The whole nucleotide sequences of Acanthaceae species were obtained from National Center for Biotechnology Information database and screened for the presence of SSRs. SSR Locator tool was used to predict the microsatellites and inbuilt Primer3 module was used for primer designing. Totally 110 repeats from 108 sequences of Acanthaceae family plant genomes were identified, and the occurrence of dinucleotide repeats was found to be abundant in the genome sequences. The essential amino acid isoleucine was found rich in all the sequences. We also designed the SSR-based primers/markers for 59 sequences of this family that contains microsatellite repeats in their genome. The identified microsatellites and primers might be useful for breeding and genetic studies of plants that belong to Acanthaceae family in the future.
Integrated and translational genomics for analysis of complex traits in crops
USDA-ARS?s Scientific Manuscript database
We report here on integration of sequencing and genotype data from natural variation (by whole genome resequencing [wgs] or genotype by sequencing [gbs]), transcriptome (RNA-seq) and mutant analysis (also by wgs) with the goal of translating gems from these resources into useable DNA markers in the ...
USDA-ARS?s Scientific Manuscript database
Next-generation sequencing (NGS) technologies are revolutionizing both medical and biological research through generation of massive SNP data sets for identifying heritable genome variation underlying key traits, from rare human diseases to important agronomic phenotypes in crop species. We evaluate...
USDA-ARS?s Scientific Manuscript database
Current advances in sequencing technologies and bioinformatics allow to determine a nearly complete genomic background of rice, a staple food for the poor people. Consequently, comprehensive databases of variation among thousands of varieties is currently being assembled and released. Proper analysi...
Yu, Ziniu; Wei, Zhengpeng; Kong, Xiaoyu; Shi, Wei
2008-01-01
Background Mitochondrial DNA sequences are extensively used as genetic markers not only for studies of population or ecological genetics, but also for phylogenetic and evolutionary analyses. Complete mt-sequences can reveal information about gene order and its variation, as well as gene and genome evolution when sequences from multiple phyla are compared. Mitochondrial gene order is highly variable among mollusks, with bivalves exhibiting the most variability. Of the 41 complete mt genomes sequenced so far, 12 are from bivalves. We determined, in the current study, the complete mitochondrial DNA sequence of Crassostrea hongkongensis. We present here an analysis of features of its gene content and genome organization in comparison with two other Crassostrea species to assess the variation within bivalves and among main groups of mollusks. Results The complete mitochondrial genome of C. hongkongensis was determined using long PCR and a primer walking sequencing strategy with genus-specific primers. The genome is 16,475 bp in length and contains 12 protein-coding genes (the atp8 gene is missing, as in most bivalves), 22 transfer tRNA genes (including a suppressor tRNA gene), and 2 ribosomal RNA genes, all of which appear to be transcribed from the same strand. A striking finding of this study is that a DNA segment containing four tRNA genes (trnk1, trnC, trnQ1 and trnN) and two duplicated or split rRNA gene (rrnL5' and rrnS) are absent from the genome, when compared with that of two other extant Crassostrea species, which is very likely a consequence of loss of a single genomic region present in ancestor of C. hongkongensis. It indicates this region seem to be a "hot spot" of genomic rearrangements over the Crassostrea mt-genomes. The arrangement of protein-coding genes in C. hongkongensis is identical to that of Crassostrea gigas and Crassostrea virginica, but higher amino acid sequence identities are shared between C. hongkongensis and C. gigas than between other pairs. There exists significant codon bias, favoring codons ending in A or T and against those ending with C. Pair analysis of genome rearrangements showed that the rearrangement distance is great between C. gigas-C. hongkongensis and C. virginica, indicating a high degree of rearrangements within Crassostrea. The determination of complete mt-genome of C. hongkongensis has yielded useful insight into features of gene order, variation, and evolution of Crassostrea and bivalve mt-genomes. Conclusion The mt-genome of C. hongkongensis shares some similarity with, and interesting differences to, other Crassostrea species and bivalves. The absence of trnC and trnN genes and duplicated or split rRNA genes from the C. hongkongensis genome is a completely novel feature not previously reported in Crassostrea species. The phenomenon is likely due to the loss of a segment that is present in other Crassostrea species and was present in ancestor of C. hongkongensis, thus a case of "tandem duplication-random loss (TDRL)". The mt-genome and new feature presented here reveal and underline the high level variation of gene order and gene content in Crassostrea and bivalves, inspiring more research to gain understanding to mechanisms underlying gene and genome evolution in bivalves and mollusks. PMID:18847502
Karas, Vlad O; Sinnott-Armstrong, Nicholas A; Varghese, Vici; Shafer, Robert W; Greenleaf, William J; Sherlock, Gavin
2018-01-01
Abstract Much of the within species genetic variation is in the form of single nucleotide polymorphisms (SNPs), typically detected by whole genome sequencing (WGS) or microarray-based technologies. However, WGS produces mostly uninformative reads that perfectly match the reference, while microarrays require genome-specific reagents. We have developed Diff-seq, a sequencing-based mismatch detection assay for SNP discovery without the requirement for specialized nucleic-acid reagents. Diff-seq leverages the Surveyor endonuclease to cleave mismatched DNA molecules that are generated after cross-annealing of a complex pool of DNA fragments. Sequencing libraries enriched for Surveyor-cleaved molecules result in increased coverage at the variant sites. Diff-seq detected all mismatches present in an initial test substrate, with specific enrichment dependent on the identity and context of the variation. Application to viral sequences resulted in increased observation of variant alleles in a biologically relevant context. Diff-Seq has the potential to increase the sensitivity and efficiency of high-throughput sequencing in the detection of variation. PMID:29361139
PopHuman: the human population genomics browser
Mulet, Roger; Villegas-Mirón, Pablo; Hervas, Sergi; Sanz, Esteve; Velasco, Daniel; Bertranpetit, Jaume; Laayouni, Hafid
2018-01-01
Abstract The 1000 Genomes Project (1000GP) represents the most comprehensive world-wide nucleotide variation data set so far in humans, providing the sequencing and analysis of 2504 genomes from 26 populations and reporting >84 million variants. The availability of this sequence data provides the human lineage with an invaluable resource for population genomics studies, allowing the testing of molecular population genetics hypotheses and eventually the understanding of the evolutionary dynamics of genetic variation in human populations. Here we present PopHuman, a new population genomics-oriented genome browser based on JBrowse that allows the interactive visualization and retrieval of an extensive inventory of population genetics metrics. Efficient and reliable parameter estimates have been computed using a novel pipeline that faces the unique features and limitations of the 1000GP data, and include a battery of nucleotide variation measures, divergence and linkage disequilibrium parameters, as well as different tests of neutrality, estimated in non-overlapping windows along the chromosomes and in annotated genes for all 26 populations of the 1000GP. PopHuman is open and freely available at http://pophuman.uab.cat. PMID:29059408
One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly.
Koren, Sergey; Phillippy, Adam M
2015-02-01
Like a jigsaw puzzle with large pieces, a genome sequenced with long reads is easier to assemble. However, recent sequencing technologies have favored lowering per-base cost at the expense of read length. This has dramatically reduced sequencing cost, but resulted in fragmented assemblies, which negatively affect downstream analyses and hinder the creation of finished (gapless, high-quality) genomes. In contrast, emerging long-read sequencing technologies can now produce reads tens of kilobases in length, enabling the automated finishing of microbial genomes for under $1000. This promises to improve the quality of reference databases and facilitate new studies of chromosomal structure and variation. We present an overview of these new technologies and the methods used to assemble long reads into complete genomes. Copyright © 2014 The Authors. Published by Elsevier Ltd.. All rights reserved.
Global Genomic Diversity of Oryza sativa Varieties Revealed by Comparative Physical Mapping
Wang, Xiaoming; Kudrna, David A.; Pan, Yonglong; Wang, Hao; Liu, Lin; Lin, Haiyan; Zhang, Jianwei; Song, Xiang; Goicoechea, Jose Luis; Wing, Rod A.; Zhang, Qifa; Luo, Meizhong
2014-01-01
Bacterial artificial chromosome (BAC) physical maps embedding a large number of BAC end sequences (BESs) were generated for Oryza sativa ssp. indica varieties Minghui 63 (MH63) and Zhenshan 97 (ZS97) and were compared with the genome sequences of O. sativa spp. japonica cv. Nipponbare and O. sativa ssp. indica cv. 93-11. The comparisons exhibited substantial diversities in terms of large structural variations and small substitutions and indels. Genome-wide BAC-sized and contig-sized structural variations were detected, and the shared variations were analyzed. In the expansion regions of the Nipponbare reference sequence, in comparison to the MH63 and ZS97 physical maps, as well as to the previously constructed 93-11 physical map, the amounts and types of the repeat contents, and the outputs of gene ontology analysis, were significantly different from those of the whole genome. Using the physical maps of four wild Oryza species from OMAP (http://www.omap.org) as a control, we detected many conserved and divergent regions related to the evolution process of O. sativa. Between the BESs of MH63 and ZS97 and the two reference sequences, a total of 1532 polymorphic simple sequence repeats (SSRs), 71,383 SNPs, 1767 multiple nucleotide polymorphisms, 6340 insertions, and 9137 deletions were identified. This study provides independent whole-genome resources for intra- and intersubspecies comparisons and functional genomics studies in O. sativa. Both the comparative physical maps and the GBrowse, which integrated the QTL and molecular markers from GRAMENE (http://www.gramene.org) with our physical maps and analysis results, are open to the public through our Web site (http://gresource.hzau.edu.cn/resource/resource.html). PMID:24424778
Sequence variation between 462 human individuals fine-tunes functional sites of RNA processing
NASA Astrophysics Data System (ADS)
Ferreira, Pedro G.; Oti, Martin; Barann, Matthias; Wieland, Thomas; Ezquina, Suzana; Friedländer, Marc R.; Rivas, Manuel A.; Esteve-Codina, Anna; Estivill, Xavier; Guigó, Roderic; Dermitzakis, Emmanouil; Antonarakis, Stylianos; Meitinger, Thomas; Strom, Tim M.; Palotie, Aarno; François Deleuze, Jean; Sudbrak, Ralf; Lerach, Hans; Gut, Ivo; Syvänen, Ann-Christine; Gyllensten, Ulf; Schreiber, Stefan; Rosenstiel, Philip; Brunner, Han; Veltman, Joris; Hoen, Peter A. C. T.; Jan van Ommen, Gert; Carracedo, Angel; Brazma, Alvis; Flicek, Paul; Cambon-Thomsen, Anne; Mangion, Jonathan; Bentley, David; Hamosh, Ada; Rosenstiel, Philip; Strom, Tim M.; Lappalainen, Tuuli; Guigó, Roderic; Sammeth, Michael
2016-09-01
Recent advances in the cost-efficiency of sequencing technologies enabled the combined DNA- and RNA-sequencing of human individuals at the population-scale, making genome-wide investigations of the inter-individual genetic impact on gene expression viable. Employing mRNA-sequencing data from the Geuvadis Project and genome sequencing data from the 1000 Genomes Project we show that the computational analysis of DNA sequences around splice sites and poly-A signals is able to explain several observations in the phenotype data. In contrast to widespread assessments of statistically significant associations between DNA polymorphisms and quantitative traits, we developed a computational tool to pinpoint the molecular mechanisms by which genetic markers drive variation in RNA-processing, cataloguing and classifying alleles that change the affinity of core RNA elements to their recognizing factors. The in silico models we employ further suggest RNA editing can moonlight as a splicing-modulator, albeit less frequently than genomic sequence diversity. Beyond existing annotations, we demonstrate that the ultra-high resolution of RNA-Seq combined from 462 individuals also provides evidence for thousands of bona fide novel elements of RNA processing—alternative splice sites, introns, and cleavage sites—which are often rare and lowly expressed but in other characteristics similar to their annotated counterparts.
Cheng, Linzhao; Hansen, Nancy F.; Zhao, Ling; Du, Yutao; Zou, Chunlin; Donovan, Frank X.; Chou, Bin-Kuan; Zhou, Guangyu; Li, Shijie; Dowey, Sarah N.; Ye, Zhaohui; Chandrasekharappa, Settara C.; Yang, Huanming; Mullikin, James C.; Liu, P. Paul
2012-01-01
Summary The utility of induced pluripotent stem cells (iPSCs) as models to study diseases and as sources for cell therapy depends on the integrity of their genomes. Despite recent publications of DNA sequence variations in the iPSCs, the true scope of such changes for the entire genome is not clear. Here we report the whole-genome sequencing of three human iPSC lines derived from two cell types of an adult donor by episomal vectors. The vector sequence was undetectable in the deeply sequenced iPSC lines. We identified 1058–1808 heterozygous single nucleotide variants (SNVs), but no copy number variants, in each iPSC line. Six to twelve of these SNVs were within coding regions in each iPSC line, but ~50% of them are synonymous changes and the remaining are not selectively enriched for known genes associated with cancers. Our data thus suggest that episome-mediated reprogramming is not inherently mutagenic during integration-free iPSC induction. PMID:22385660
Organizational heterogeneity of vertebrate genomes.
Frenkel, Svetlana; Kirzhner, Valery; Korol, Abraham
2012-01-01
Genomes of higher eukaryotes are mosaics of segments with various structural, functional, and evolutionary properties. The availability of whole-genome sequences allows the investigation of their structure as "texts" using different statistical and computational methods. One such method, referred to as Compositional Spectra (CS) analysis, is based on scoring the occurrences of fixed-length oligonucleotides (k-mers) in the target DNA sequence. CS analysis allows generating species- or region-specific characteristics of the genome, regardless of their length and the presence of coding DNA. In this study, we consider the heterogeneity of vertebrate genomes as a joint effect of regional variation in sequence organization superimposed on the differences in nucleotide composition. We estimated compositional and organizational heterogeneity of genome and chromosome sequences separately and found that both heterogeneity types vary widely among genomes as well as among chromosomes in all investigated taxonomic groups. The high correspondence of heterogeneity scores obtained on three genome fractions, coding, repetitive, and the remaining part of the noncoding DNA (the genome dark matter--GDM) allows the assumption that CS-heterogeneity may have functional relevance to genome regulation. Of special interest for such interpretation is the fact that natural GDM sequences display the highest deviation from the corresponding reshuffled sequences.
Whole-genome analyses of Korean native and Holstein cattle breeds by massively parallel sequencing.
Choi, Jung-Woo; Liao, Xiaoping; Stothard, Paul; Chung, Won-Hyong; Jeon, Heoyn-Jeong; Miller, Stephen P; Choi, So-Young; Lee, Jeong-Koo; Yang, Bokyoung; Lee, Kyung-Tai; Han, Kwang-Jin; Kim, Hyeong-Cheol; Jeong, Dongkee; Oh, Jae-Don; Kim, Namshin; Kim, Tae-Hun; Lee, Hak-Kyo; Lee, Sung-Jin
2014-01-01
A main goal of cattle genomics is to identify DNA differences that account for variations in economically important traits. In this study, we performed whole-genome analyses of three important cattle breeds in Korea--Hanwoo, Jeju Heugu, and Korean Holstein--using the Illumina HiSeq 2000 sequencing platform. We achieved 25.5-, 29.6-, and 29.5-fold coverage of the Hanwoo, Jeju Heugu, and Korean Holstein genomes, respectively, and identified a total of 10.4 million single nucleotide polymorphisms (SNPs), of which 54.12% were found to be novel. We also detected 1,063,267 insertions-deletions (InDels) across the genomes (78.92% novel). Annotations of the datasets identified a total of 31,503 nonsynonymous SNPs and 859 frameshift InDels that could affect phenotypic variations in traits of interest. Furthermore, genome-wide copy number variation regions (CNVRs) were detected by comparing the Hanwoo, Jeju Heugu, and previously published Chikso genomes against that of Korean Holstein. A total of 992, 284, and 1881 CNVRs, respectively, were detected throughout the genome. Moreover, 53, 65, 45, and 82 putative regions of homozygosity (ROH) were identified in Hanwoo, Jeju Heugu, Chikso, and Korean Holstein respectively. The results of this study provide a valuable foundation for further investigations to dissect the molecular mechanisms underlying variation in economically important traits in cattle and to develop genetic markers for use in cattle breeding.
Whole-Genome Analyses of Korean Native and Holstein Cattle Breeds by Massively Parallel Sequencing
Stothard, Paul; Chung, Won-Hyong; Jeon, Heoyn-Jeong; Miller, Stephen P.; Choi, So-Young; Lee, Jeong-Koo; Yang, Bokyoung; Lee, Kyung-Tai; Han, Kwang-Jin; Kim, Hyeong-Cheol; Jeong, Dongkee; Oh, Jae-Don; Kim, Namshin; Kim, Tae-Hun; Lee, Hak-Kyo; Lee, Sung-Jin
2014-01-01
A main goal of cattle genomics is to identify DNA differences that account for variations in economically important traits. In this study, we performed whole-genome analyses of three important cattle breeds in Korea—Hanwoo, Jeju Heugu, and Korean Holstein—using the Illumina HiSeq 2000 sequencing platform. We achieved 25.5-, 29.6-, and 29.5-fold coverage of the Hanwoo, Jeju Heugu, and Korean Holstein genomes, respectively, and identified a total of 10.4 million single nucleotide polymorphisms (SNPs), of which 54.12% were found to be novel. We also detected 1,063,267 insertions–deletions (InDels) across the genomes (78.92% novel). Annotations of the datasets identified a total of 31,503 nonsynonymous SNPs and 859 frameshift InDels that could affect phenotypic variations in traits of interest. Furthermore, genome-wide copy number variation regions (CNVRs) were detected by comparing the Hanwoo, Jeju Heugu, and previously published Chikso genomes against that of Korean Holstein. A total of 992, 284, and 1881 CNVRs, respectively, were detected throughout the genome. Moreover, 53, 65, 45, and 82 putative regions of homozygosity (ROH) were identified in Hanwoo, Jeju Heugu, Chikso, and Korean Holstein respectively. The results of this study provide a valuable foundation for further investigations to dissect the molecular mechanisms underlying variation in economically important traits in cattle and to develop genetic markers for use in cattle breeding. PMID:24992012
The genome sequence of four isolates from the family Lichtheimiaceae.
Chibucos, Marcus C; Etienne, Kizee A; Orvis, Joshua; Lee, Hongkyu; Daugherty, Sean; Lockhart, Shawn R; Ibrahim, Ashraf S; Bruno, Vincent M
2015-07-01
This study reports the release of draft genome sequences of two isolates of Lichtheimia corymbifera and two isolates of L. ramosa. Phylogenetic analyses indicate that the two L. corymbifera strains (CDC-B2541 and 008-049) are closely related to the previously sequenced L. corymbifera isolate (FSU 9682) while our two L. ramosa strains CDC-B5399 and CDC-B5792 cluster apart from them. These genome sequences will further the understanding of intraspecies and interspecies genetic variation within the Mucoraceae family of pathogenic fungi. © FEMS 2015. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
The Saccharomyces Genome Database Variant Viewer.
Sheppard, Travis K; Hitz, Benjamin C; Engel, Stacia R; Song, Giltae; Balakrishnan, Rama; Binkley, Gail; Costanzo, Maria C; Dalusag, Kyla S; Demeter, Janos; Hellerstedt, Sage T; Karra, Kalpana; Nash, Robert S; Paskov, Kelley M; Skrzypek, Marek S; Weng, Shuai; Wong, Edith D; Cherry, J Michael
2016-01-04
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source web application for the visualization of genomic and proteomic differences. Multiple sequence alignments have been constructed across high quality genome sequences from 11 different S. cerevisiae strains and stored in the SGD. The alignments and summaries are encoded in JSON and used to create a two-tiered dynamic view of the budding yeast pan-genome, available at http://www.yeastgenome.org/variant-viewer. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Comparative structural analysis of Bru1 region homeologs in Saccharum spontaneum and S. officinarum
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Jisen; Sharma, Anupma; Yu, Qingyi
Here, sugarcane is a major sugar and biofuel crop, but genomic research and molecular breeding have lagged behind other major crops due to the complexity of auto-allopolyploid genomes. Sugarcane cultivars are frequently aneuploid with chromosome number ranging from 100 to 130, consisting of 70-80 % S. officinarum, 10-20 % S. spontaneum, and 10 % recombinants between these two species. Analysis of a genomic region in the progenitor autoploid genomes of sugarcane hybrid cultivars will reveal the nature and divergence of homologous chromosomes. As a result, to investigate the origin and evolution of haplotypes in the Bru1 genomic regions in sugarcanemore » cultivars, we identified two BAC clones from S. spontaneum and four from S. officinarum and compared to seven haplotype sequences from sugarcane hybrid R570. The results clarified the origin of seven homologous haplotypes in R570, four haplotypes originated from S. officinarum, two from S. spontaneum and one recombinant.. Retrotransposon insertions and sequences variations among the homologous haplotypes sequence divergence ranged from 18.2 % to 60.5 % with an average of 33. 7 %. Gene content and gene structure were relatively well conserved among the homologous haplotypes. Exon splitting occurred in haplotypes of the hybrid genome but not in its progenitor genomes. Tajima's D analysis revealed that S. spontaneum hapotypes in the Bru1 genomic regions were under strong directional selection. Numerous inversions, deletions, insertions and translocations were found between haplotypes within each genome. In conclusion, this is the first comparison among haplotypes of a modern sugarcane hybrid and its two progenitors. Tajima's D results emphasized the crucial role of this fungal disease resistance gene for enhancing the fitness of this species and indicating that the brown rust resistance gene in R570 is from S. spontaneum. Species-specific InDel, sequences similarity and phylogenetic analysis of homologous genes can be used for identifying the origin of S. spontaneum and S. officinarum haplotype in Saccharum hybrids. Comparison of exon splitting among the homologous haplotypes suggested that the genome rearrangements in Saccharum hybrids S. officinarum would be sufficient for proper genome assembly of this autopolyploid genome. Retrotransposon insertions and sequences variations among the homologous haplotypes sequence divergence may allow sequencing and assembling the autopolyploid Saccharum genomes and the auto-allopolyploid hybrid genomes using whole genome shotgun sequencing.« less
Comparative structural analysis of Bru1 region homeologs in Saccharum spontaneum and S. officinarum
Zhang, Jisen; Sharma, Anupma; Yu, Qingyi; ...
2016-06-10
Here, sugarcane is a major sugar and biofuel crop, but genomic research and molecular breeding have lagged behind other major crops due to the complexity of auto-allopolyploid genomes. Sugarcane cultivars are frequently aneuploid with chromosome number ranging from 100 to 130, consisting of 70-80 % S. officinarum, 10-20 % S. spontaneum, and 10 % recombinants between these two species. Analysis of a genomic region in the progenitor autoploid genomes of sugarcane hybrid cultivars will reveal the nature and divergence of homologous chromosomes. As a result, to investigate the origin and evolution of haplotypes in the Bru1 genomic regions in sugarcanemore » cultivars, we identified two BAC clones from S. spontaneum and four from S. officinarum and compared to seven haplotype sequences from sugarcane hybrid R570. The results clarified the origin of seven homologous haplotypes in R570, four haplotypes originated from S. officinarum, two from S. spontaneum and one recombinant.. Retrotransposon insertions and sequences variations among the homologous haplotypes sequence divergence ranged from 18.2 % to 60.5 % with an average of 33. 7 %. Gene content and gene structure were relatively well conserved among the homologous haplotypes. Exon splitting occurred in haplotypes of the hybrid genome but not in its progenitor genomes. Tajima's D analysis revealed that S. spontaneum hapotypes in the Bru1 genomic regions were under strong directional selection. Numerous inversions, deletions, insertions and translocations were found between haplotypes within each genome. In conclusion, this is the first comparison among haplotypes of a modern sugarcane hybrid and its two progenitors. Tajima's D results emphasized the crucial role of this fungal disease resistance gene for enhancing the fitness of this species and indicating that the brown rust resistance gene in R570 is from S. spontaneum. Species-specific InDel, sequences similarity and phylogenetic analysis of homologous genes can be used for identifying the origin of S. spontaneum and S. officinarum haplotype in Saccharum hybrids. Comparison of exon splitting among the homologous haplotypes suggested that the genome rearrangements in Saccharum hybrids S. officinarum would be sufficient for proper genome assembly of this autopolyploid genome. Retrotransposon insertions and sequences variations among the homologous haplotypes sequence divergence may allow sequencing and assembling the autopolyploid Saccharum genomes and the auto-allopolyploid hybrid genomes using whole genome shotgun sequencing.« less
Veerkamp, Roel F; Bouwman, Aniek C; Schrooten, Chris; Calus, Mario P L
2016-12-01
Whole-genome sequence data is expected to capture genetic variation more completely than common genotyping panels. Our objective was to compare the proportion of variance explained and the accuracy of genomic prediction by using imputed sequence data or preselected SNPs from a genome-wide association study (GWAS) with imputed whole-genome sequence data. Phenotypes were available for 5503 Holstein-Friesian bulls. Genotypes were imputed up to whole-genome sequence (13,789,029 segregating DNA variants) by using run 4 of the 1000 bull genomes project. The program GCTA was used to perform GWAS for protein yield (PY), somatic cell score (SCS) and interval from first to last insemination (IFL). From the GWAS, subsets of variants were selected and genomic relationship matrices (GRM) were used to estimate the variance explained in 2087 validation animals and to evaluate the genomic prediction ability. Finally, two GRM were fitted together in several models to evaluate the effect of selected variants that were in competition with all the other variants. The GRM based on full sequence data explained only marginally more genetic variation than that based on common SNP panels: for PY, SCS and IFL, genomic heritability improved from 0.81 to 0.83, 0.83 to 0.87 and 0.69 to 0.72, respectively. Sequence data also helped to identify more variants linked to quantitative trait loci and resulted in clearer GWAS peaks across the genome. The proportion of total variance explained by the selected variants combined in a GRM was considerably smaller than that explained by all variants (less than 0.31 for all traits). When selected variants were used, accuracy of genomic predictions decreased and bias increased. Although 35 to 42 variants were detected that together explained 13 to 19% of the total variance (18 to 23% of the genetic variance) when fitted alone, there was no advantage in using dense sequence information for genomic prediction in the Holstein data used in our study. Detection and selection of variants within a single breed are difficult due to long-range linkage disequilibrium. Stringent selection of variants resulted in more biased genomic predictions, although this might be due to the training population being the same dataset from which the selected variants were identified.
The study of human Y chromosome variation through ancient DNA.
Kivisild, Toomas
2017-05-01
High throughput sequencing methods have completely transformed the study of human Y chromosome variation by offering a genome-scale view on genetic variation retrieved from ancient human remains in context of a growing number of high coverage whole Y chromosome sequence data from living populations from across the world. The ancient Y chromosome sequences are providing us the first exciting glimpses into the past variation of male-specific compartment of the genome and the opportunity to evaluate models based on previously made inferences from patterns of genetic variation in living populations. Analyses of the ancient Y chromosome sequences are challenging not only because of issues generally related to ancient DNA work, such as DNA damage-induced mutations and low content of endogenous DNA in most human remains, but also because of specific properties of the Y chromosome, such as its highly repetitive nature and high homology with the X chromosome. Shotgun sequencing of uniquely mapping regions of the Y chromosomes to sufficiently high coverage is still challenging and costly in poorly preserved samples. To increase the coverage of specific target SNPs capture-based methods have been developed and used in recent years to generate Y chromosome sequence data from hundreds of prehistoric skeletal remains. Besides the prospects of testing directly as how much genetic change in a given time period has accompanied changes in material culture the sequencing of ancient Y chromosomes allows us also to better understand the rate at which mutations accumulate and get fixed over time. This review considers genome-scale evidence on ancient Y chromosome diversity that has recently started to accumulate in geographic areas favourable to DNA preservation. More specifically the review focuses on examples of regional continuity and change of the Y chromosome haplogroups in North Eurasia and in the New World.
CNV-seq, a new method to detect copy number variation using high-throughput sequencing.
Xie, Chao; Tammi, Martti T
2009-03-06
DNA copy number variation (CNV) has been recognized as an important source of genetic variation. Array comparative genomic hybridization (aCGH) is commonly used for CNV detection, but the microarray platform has a number of inherent limitations. Here, we describe a method to detect copy number variation using shotgun sequencing, CNV-seq. The method is based on a robust statistical model that describes the complete analysis procedure and allows the computation of essential confidence values for detection of CNV. Our results show that the number of reads, not the length of the reads is the key factor determining the resolution of detection. This favors the next-generation sequencing methods that rapidly produce large amount of short reads. Simulation of various sequencing methods with coverage between 0.1x to 8x show overall specificity between 91.7 - 99.9%, and sensitivity between 72.2 - 96.5%. We also show the results for assessment of CNV between two individual human genomes.
Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data
Degner, Jacob F.; Marioni, John C.; Pai, Athma A.; Pickrell, Joseph K.; Nkadori, Everlyne; Gilad, Yoav; Pritchard, Jonathan K.
2009-01-01
Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here, we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE). Results: We generated 16 million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias toward higher mapping rates of the allele in the reference sequence, compared with the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, ∼5–10% of SNPs still have an inherent bias toward more effective mapping of one allele. Filtering out inherently biased SNPs removes 40% of the top signals of ASE. The remaining SNPs showing ASE are enriched in genes previously known to harbor cis-regulatory variation or known to show uniparental imprinting. Our results have implications for a variety of applications involving detection of alternate alleles from short-read sequence data. Availability: Scripts, written in Perl and R, for simulating short reads, masking SNP variation in a reference genome and analyzing the simulation output are available upon request from JFD. Raw short read data were deposited in GEO (http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE18156. Contact: jdegner@uchicago.edu; marioni@uchicago.edu; gilad@uchicago.edu; pritch@uchicago.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19808877
Cho, Kwang-Soo; Yun, Bong-Kyoung; Yoon, Young-Ho; Hong, Su-Young; Mekapogu, Manjulatha; Kim, Kyung-Hee; Yang, Tae-Jin
2015-01-01
We report the chloroplast (cp) genome sequence of tartary buckwheat (Fagopyrum tataricum) obtained by next-generation sequencing technology and compared this with the previously reported common buckwheat (F. esculentum ssp. ancestrale) cp genome. The cp genome of F. tataricum has a total sequence length of 159,272 bp, which is 327 bp shorter than the common buckwheat cp genome. The cp gene content, order, and orientation are similar to those of common buckwheat, but with some structural variation at tandem and palindromic repeat frequencies and junction areas. A total of seven InDels (around 100 bp) were found within the intergenic sequences and the ycf1 gene. Copy number variation of the 21-bp tandem repeat varied in F. tataricum (four repeats) and F. esculentum (one repeat), and the InDel of the ycf1 gene was 63 bp long. Nucleotide and amino acid have highly conserved coding sequence with about 98% homology and four genes—rpoC2, ycf3, accD, and clpP—have high synonymous (Ks) value. PCR based InDel markers were applied to diverse genetic resources of F. tataricum and F. esculentum, and the amplicon size was identical to that expected in silico. Therefore, these InDel markers are informative biomarkers to practically distinguish raw or processed buckwheat products derived from F. tataricum and F. esculentum. PMID:25966355
Lee, Kevin C; Stott, Matthew B; Dunfield, Peter F; Huttenhower, Curtis; McDonald, Ian R; Morgan, Xochitl C
2016-06-15
Chthonomonas calidirosea T49(T) is a low-abundance, carbohydrate-scavenging, and thermophilic soil bacterium with a seemingly disorganized genome. We hypothesized that the C. calidirosea genome would be highly responsive to local selection pressure, resulting in the divergence of its genomic content, genome organization, and carbohydrate utilization phenotype across environments. We tested this hypothesis by sequencing the genomes of four C. calidirosea isolates obtained from four separate geothermal fields in the Taupō Volcanic Zone, New Zealand. For each isolation site, we measured physicochemical attributes and defined the associated microbial community by 16S rRNA gene sequencing. Despite their ecological and geographical isolation, the genome sequences showed low divergence (maximum, 1.17%). Isolate-specific variations included single-nucleotide polymorphisms (SNPs), restriction-modification systems, and mobile elements but few major deletions and no major rearrangements. The 50-fold variation in C. calidirosea relative abundance among the four sites correlated with site environmental characteristics but not with differences in genomic content. Conversely, the carbohydrate utilization profiles of the C. calidirosea isolates corresponded to the inferred isolate phylogenies, which only partially paralleled the geographical relationships among the sample sites. Genomic sequence conservation does not entirely parallel geographic distance, suggesting that stochastic dispersal and localized extinction, which allow for rapid population homogenization with little restriction by geographical barriers, are possible mechanisms of C. calidirosea distribution. This dispersal and extinction mechanism is likely not limited to C. calidirosea but may shape the populations and genomes of many other low-abundance free-living taxa. This study compares the genomic sequence variations and metabolisms of four strains of Chthonomonas calidirosea, a rare thermophilic bacterium from the phylum Armatimonadetes It additionally compares the microbial communities and chemistry of each of the geographically distinct sites from which the four C. calidirosea strains were isolated. C. calidirosea was previously reported to possess a highly disorganized genome, but it was unclear whether this reflected rapid evolution. Here, we show that each isolation site has a distinct chemistry and microbial community, but despite this, the C. calidirosea genome is highly conserved across all isolation sites. Furthermore, genomic sequence differences only partially paralleled geographic distance, suggesting that C. calidirosea genotypes are not primarily determined by adaptive evolution. Instead, the presence of C. calidirosea may be driven by stochastic dispersal and localized extinction. This ecological mechanism may apply to many other low-abundance taxa. Copyright © 2016 Lee et al.
Lee, Kevin C.; Stott, Matthew B.; Dunfield, Peter F.; Huttenhower, Curtis; McDonald, Ian R.
2016-01-01
ABSTRACT Chthonomonas calidirosea T49T is a low-abundance, carbohydrate-scavenging, and thermophilic soil bacterium with a seemingly disorganized genome. We hypothesized that the C. calidirosea genome would be highly responsive to local selection pressure, resulting in the divergence of its genomic content, genome organization, and carbohydrate utilization phenotype across environments. We tested this hypothesis by sequencing the genomes of four C. calidirosea isolates obtained from four separate geothermal fields in the Taupō Volcanic Zone, New Zealand. For each isolation site, we measured physicochemical attributes and defined the associated microbial community by 16S rRNA gene sequencing. Despite their ecological and geographical isolation, the genome sequences showed low divergence (maximum, 1.17%). Isolate-specific variations included single-nucleotide polymorphisms (SNPs), restriction-modification systems, and mobile elements but few major deletions and no major rearrangements. The 50-fold variation in C. calidirosea relative abundance among the four sites correlated with site environmental characteristics but not with differences in genomic content. Conversely, the carbohydrate utilization profiles of the C. calidirosea isolates corresponded to the inferred isolate phylogenies, which only partially paralleled the geographical relationships among the sample sites. Genomic sequence conservation does not entirely parallel geographic distance, suggesting that stochastic dispersal and localized extinction, which allow for rapid population homogenization with little restriction by geographical barriers, are possible mechanisms of C. calidirosea distribution. This dispersal and extinction mechanism is likely not limited to C. calidirosea but may shape the populations and genomes of many other low-abundance free-living taxa. IMPORTANCE This study compares the genomic sequence variations and metabolisms of four strains of Chthonomonas calidirosea, a rare thermophilic bacterium from the phylum Armatimonadetes. It additionally compares the microbial communities and chemistry of each of the geographically distinct sites from which the four C. calidirosea strains were isolated. C. calidirosea was previously reported to possess a highly disorganized genome, but it was unclear whether this reflected rapid evolution. Here, we show that each isolation site has a distinct chemistry and microbial community, but despite this, the C. calidirosea genome is highly conserved across all isolation sites. Furthermore, genomic sequence differences only partially paralleled geographic distance, suggesting that C. calidirosea genotypes are not primarily determined by adaptive evolution. Instead, the presence of C. calidirosea may be driven by stochastic dispersal and localized extinction. This ecological mechanism may apply to many other low-abundance taxa. PMID:27060125
Reuter, Miriam S.; Walker, Susan; Thiruvahindrapuram, Bhooma; Whitney, Joe; Cohn, Iris; Sondheimer, Neal; Yuen, Ryan K.C.; Trost, Brett; Paton, Tara A.; Pereira, Sergio L.; Herbrick, Jo-Anne; Wintle, Richard F.; Merico, Daniele; Howe, Jennifer; MacDonald, Jeffrey R.; Lu, Chao; Nalpathamkalam, Thomas; Sung, Wilson W.L.; Wang, Zhuozhi; Patel, Rohan V.; Pellecchia, Giovanna; Wei, John; Strug, Lisa J.; Bell, Sherilyn; Kellam, Barbara; Mahtani, Melanie M.; Bassett, Anne S.; Bombard, Yvonne; Weksberg, Rosanna; Shuman, Cheryl; Cohn, Ronald D.; Stavropoulos, Dimitri J.; Bowdin, Sarah; Hildebrandt, Matthew R.; Wei, Wei; Romm, Asli; Pasceri, Peter; Ellis, James; Ray, Peter; Meyn, M. Stephen; Monfared, Nasim; Hosseini, S. Mohsen; Joseph-George, Ann M.; Keeley, Fred W.; Cook, Ryan A.; Fiume, Marc; Lee, Hin C.; Marshall, Christian R.; Davies, Jill; Hazell, Allison; Buchanan, Janet A.; Szego, Michael J.; Scherer, Stephen W.
2018-01-01
BACKGROUND: The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information. We describe genomic variation identified in the initial recruitment cohort of 56 volunteers. METHODS: Volunteers were screened for eligibility and provided informed consent for open data sharing. Using blood DNA, we performed whole genome sequencing and identified all possible classes of DNA variants. A genetic counsellor explained the implication of the results to each participant. RESULTS: Whole genome sequencing of the first 56 participants identified 207 662 805 sequence variants and 27 494 copy number variations. We analyzed a prioritized disease-associated data set (n = 1606 variants) according to standardized guidelines, and interpreted 19 variants in 14 participants (25%) as having obvious health implications. Six of these variants (e.g., in BRCA1 or mosaic loss of an X chromosome) were pathogenic or likely pathogenic. Seven were risk factors for cancer, cardiovascular or neurobehavioural conditions. Four other variants — associated with cancer, cardiac or neurodegenerative phenotypes — remained of uncertain significance because of discrepancies among databases. We also identified a large structural chromosome aberration and a likely pathogenic mitochondrial variant. There were 172 recessive disease alleles (e.g., 5 individuals carried mutations for cystic fibrosis). Pharmacogenomics analyses revealed another 3.9 potentially relevant genotypes per individual. INTERPRETATION: Our analyses identified a spectrum of genetic variants with potential health impact in 25% of participants. When also considering recessive alleles and variants with potential pharmacologic relevance, all 56 participants had medically relevant findings. Although access is mostly limited to research, whole genome sequencing can provide specific and novel information with the potential of major impact for health care. PMID:29431110
Reuter, Miriam S; Walker, Susan; Thiruvahindrapuram, Bhooma; Whitney, Joe; Cohn, Iris; Sondheimer, Neal; Yuen, Ryan K C; Trost, Brett; Paton, Tara A; Pereira, Sergio L; Herbrick, Jo-Anne; Wintle, Richard F; Merico, Daniele; Howe, Jennifer; MacDonald, Jeffrey R; Lu, Chao; Nalpathamkalam, Thomas; Sung, Wilson W L; Wang, Zhuozhi; Patel, Rohan V; Pellecchia, Giovanna; Wei, John; Strug, Lisa J; Bell, Sherilyn; Kellam, Barbara; Mahtani, Melanie M; Bassett, Anne S; Bombard, Yvonne; Weksberg, Rosanna; Shuman, Cheryl; Cohn, Ronald D; Stavropoulos, Dimitri J; Bowdin, Sarah; Hildebrandt, Matthew R; Wei, Wei; Romm, Asli; Pasceri, Peter; Ellis, James; Ray, Peter; Meyn, M Stephen; Monfared, Nasim; Hosseini, S Mohsen; Joseph-George, Ann M; Keeley, Fred W; Cook, Ryan A; Fiume, Marc; Lee, Hin C; Marshall, Christian R; Davies, Jill; Hazell, Allison; Buchanan, Janet A; Szego, Michael J; Scherer, Stephen W
2018-02-05
The Personal Genome Project Canada is a comprehensive public data resource that integrates whole genome sequencing data and health information. We describe genomic variation identified in the initial recruitment cohort of 56 volunteers. Volunteers were screened for eligibility and provided informed consent for open data sharing. Using blood DNA, we performed whole genome sequencing and identified all possible classes of DNA variants. A genetic counsellor explained the implication of the results to each participant. Whole genome sequencing of the first 56 participants identified 207 662 805 sequence variants and 27 494 copy number variations. We analyzed a prioritized disease-associated data set ( n = 1606 variants) according to standardized guidelines, and interpreted 19 variants in 14 participants (25%) as having obvious health implications. Six of these variants (e.g., in BRCA1 or mosaic loss of an X chromosome) were pathogenic or likely pathogenic. Seven were risk factors for cancer, cardiovascular or neurobehavioural conditions. Four other variants - associated with cancer, cardiac or neurodegenerative phenotypes - remained of uncertain significance because of discrepancies among databases. We also identified a large structural chromosome aberration and a likely pathogenic mitochondrial variant. There were 172 recessive disease alleles (e.g., 5 individuals carried mutations for cystic fibrosis). Pharmacogenomics analyses revealed another 3.9 potentially relevant genotypes per individual. Our analyses identified a spectrum of genetic variants with potential health impact in 25% of participants. When also considering recessive alleles and variants with potential pharmacologic relevance, all 56 participants had medically relevant findings. Although access is mostly limited to research, whole genome sequencing can provide specific and novel information with the potential of major impact for health care. © 2018 Joule Inc. or its licensors.
Blake, Jonathon; Riddell, Andrew; Theiss, Susanne; Gonzalez, Alexis Perez; Haase, Bettina; Jauch, Anna; Janssen, Johannes W. G.; Ibberson, David; Pavlinic, Dinko; Moog, Ute; Benes, Vladimir; Runz, Heiko
2014-01-01
Balanced chromosome abnormalities (BCAs) occur at a high frequency in healthy and diseased individuals, but cost-efficient strategies to identify BCAs and evaluate whether they contribute to a phenotype have not yet become widespread. Here we apply genome-wide mate-pair library sequencing to characterize structural variation in a patient with unclear neurodevelopmental disease (NDD) and complex de novo BCAs at the karyotype level. Nucleotide-level characterization of the clinically described BCA breakpoints revealed disruption of at least three NDD candidate genes (LINC00299, NUP205, PSMD14) that gave rise to abnormal mRNAs and could be assumed as disease-causing. However, unbiased genome-wide analysis of the sequencing data for cryptic structural variation was key to reveal an additional submicroscopic inversion that truncates the schizophrenia- and bipolar disorder-associated brain transcription factor ZNF804A as an equally likely NDD-driving gene. Deep sequencing of fluorescent-sorted wild-type and derivative chromosomes confirmed the clinically undetected BCA. Moreover, deep sequencing further validated a high accuracy of mate-pair library sequencing to detect structural variants larger than 10 kB, proposing that this approach is powerful for clinical-grade genome-wide structural variant detection. Our study supports previous evidence for a role of ZNF804A in NDD and highlights the need for a more comprehensive assessment of structural variation in karyotypically abnormal individuals and patients with neurocognitive disease to avoid diagnostic deception. PMID:24625750
Identifying tagging SNPs for African specific genetic variation from the African Diaspora Genome
Johnston, Henry Richard; Hu, Yi-Juan; Gao, Jingjing; O’Connor, Timothy D.; Abecasis, Gonçalo R.; Wojcik, Genevieve L; Gignoux, Christopher R.; Gourraud, Pierre-Antoine; Lizee, Antoine; Hansen, Mark; Genuario, Rob; Bullis, Dave; Lawley, Cindy; Kenny, Eimear E.; Bustamante, Carlos; Beaty, Terri H.; Mathias, Rasika A.; Barnes, Kathleen C.; Qin, Zhaohui S.; Preethi Boorgula, Meher; Campbell, Monica; Chavan, Sameer; Ford, Jean G.; Foster, Cassandra; Gao, Li; Hansel, Nadia N.; Horowitz, Edward; Huang, Lili; Ortiz, Romina; Potee, Joseph; Rafaels, Nicholas; Ruczinski, Ingo; Scott, Alan F.; Taub, Margaret A.; Vergara, Candelaria; Levin, Albert M.; Padhukasahasram, Badri; Williams, L. Keoki; Dunston, Georgia M.; Faruque, Mezbah U.; Gietzen, Kimberly; Deshpande, Aniket; Grus, Wendy E.; Locke, Devin P.; Foreman, Marilyn G.; Avila, Pedro C.; Grammer, Leslie; Kim, Kwang-Youn A.; Kumar, Rajesh; Schleimer, Robert; De La Vega, Francisco M.; Shringarpure, Suyash S.; Musharoff, Shaila; Burchard, Esteban G.; Eng, Celeste; Hernandez, Ryan D.; Pino-Yanes, Maria; Torgerson, Dara G.; Szpiech, Zachary A.; Torres, Raul; Nicolae, Dan L.; Ober, Carole; Olopade, Christopher O; Olopade, Olufunmilayo; Oluwole, Oluwafemi; Arinola, Ganiyu; Song, Wei; Correa, Adolfo; Musani, Solomon; Wilson, James G.; Lange, Leslie A.; Akey, Joshua; Bamshad, Michael; Chong, Jessica; Fu, Wenqing; Nickerson, Deborah; Reiner, Alexander; Hartert, Tina; Ware, Lorraine B.; Bleecker, Eugene; Meyers, Deborah; Ortega, Victor E.; Maul, Pissamai; Maul, Trevor; Watson, Harold; Ilma Araujo, Maria; Riccio Oliveira, Ricardo; Caraballo, Luis; Marrugo, Javier; Martinez, Beatriz; Meza, Catherine; Ayestas, Gerardo; Francisco Herrera-Paz, Edwin; Landaverde-Torres, Pamela; Erazo, Said Omar Leiva; Martinez, Rosella; Mayorga, Alvaro; Mayorga, Luis F.; Mejia-Mejia, Delmy-Aracely; Ramos, Hector; Saenz, Allan; Varela, Gloria; Marina Vasquez, Olga; Ferguson, Trevor; Knight-Madden, Jennifer; Samms-Vaughan, Maureen; Wilks, Rainford J.; Adegnika, Akim; Ateba-Ngoa, Ulysse; Yazdanbakhsh, Maria
2017-01-01
A primary goal of The Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) is to develop an ‘African Diaspora Power Chip’ (ADPC), a genotyping array consisting of tagging SNPs, useful in comprehensively identifying African specific genetic variation. This array is designed based on the novel variation identified in 642 CAAPA samples of African ancestry with high coverage whole genome sequence data (~30× depth). This novel variation extends the pattern of variation catalogued in the 1000 Genomes and Exome Sequencing Projects to a spectrum of populations representing the wide range of West African genomic diversity. These individuals from CAAPA also comprise a large swath of the African Diaspora population and incorporate historical genetic diversity covering nearly the entire Atlantic coast of the Americas. Here we show the results of designing and producing such a microchip array. This novel array covers African specific variation far better than other commercially available arrays, and will enable better GWAS analyses for researchers with individuals of African descent in their study populations. A recent study cataloging variation in continental African populations suggests this type of African-specific genotyping array is both necessary and valuable for facilitating large-scale GWAS in populations of African ancestry. PMID:28429804
Repetitive sequences in plant nuclear DNA: types, distribution, evolution and function.
Mehrotra, Shweta; Goyal, Vinod
2014-08-01
Repetitive DNA sequences are a major component of eukaryotic genomes and may account for up to 90% of the genome size. They can be divided into minisatellite, microsatellite and satellite sequences. Satellite DNA sequences are considered to be a fast-evolving component of eukaryotic genomes, comprising tandemly-arrayed, highly-repetitive and highly-conserved monomer sequences. The monomer unit of satellite DNA is 150-400 base pairs (bp) in length. Repetitive sequences may be species- or genus-specific, and may be centromeric or subtelomeric in nature. They exhibit cohesive and concerted evolution caused by molecular drive, leading to high sequence homogeneity. Repetitive sequences accumulate variations in sequence and copy number during evolution, hence they are important tools for taxonomic and phylogenetic studies, and are known as "tuning knobs" in the evolution. Therefore, knowledge of repetitive sequences assists our understanding of the organization, evolution and behavior of eukaryotic genomes. Repetitive sequences have cytoplasmic, cellular and developmental effects and play a role in chromosomal recombination. In the post-genomics era, with the introduction of next-generation sequencing technology, it is possible to evaluate complex genomes for analyzing repetitive sequences and deciphering the yet unknown functional potential of repetitive sequences. Copyright © 2014 The Authors. Production and hosting by Elsevier Ltd.. All rights reserved.
Sloan, Daniel B; Müller, Karel; McCauley, David E; Taylor, Douglas R; Storchová, Helena
2012-12-01
In angiosperms, mitochondrial-encoded genes can cause cytoplasmic male sterility (CMS), resulting in the coexistence of female and hermaphroditic individuals (gynodioecy). We compared four complete mitochondrial genomes from the gynodioecious species Silene vulgaris and found unprecedented amounts of intraspecific diversity for plant mitochondrial DNA (mtDNA). Remarkably, only about half of overall sequence content is shared between any pair of genomes. The four mtDNAs range in size from 361 to 429 kb and differ in gene complement, with rpl5 and rps13 being intact in some genomes but absent or pseudogenized in others. The genomes exhibit essentially no conservation of synteny and are highly repetitive, with evidence of reciprocal recombination occurring even across short repeats (< 250 bp). Some mitochondrial genes exhibit atypically high degrees of nucleotide polymorphism, while others are invariant. The genomes also contain a variable number of small autonomously mapping chromosomes, which have only recently been identified in angiosperm mtDNA. Southern blot analysis of one of these chromosomes indicated a complex in vivo structure consisting of both monomeric circles and multimeric forms. We conclude that S. vulgaris harbors an unusually large degree of variation in mtDNA sequence and structure and discuss the extent to which this variation might be related to CMS. © 2012 The Authors. New Phytologist © 2012 New Phytologist Trust.
Proteogenomic Investigation of Strain Variation in Clinical Mycobacterium tuberculosis Isolates.
Heunis, Tiaan; Dippenaar, Anzaan; Warren, Robin M; van Helden, Paul D; van der Merwe, Ruben G; Gey van Pittius, Nicolaas C; Pain, Arnab; Sampson, Samantha L; Tabb, David L
2017-10-06
Mycobacterium tuberculosis consists of a large number of different strains that display unique virulence characteristics. Whole-genome sequencing has revealed substantial genetic diversity among clinical M. tuberculosis isolates, and elucidating the phenotypic variation encoded by this genetic diversity will be of the utmost importance to fully understand M. tuberculosis biology and pathogenicity. In this study, we integrated whole-genome sequencing and mass spectrometry (GeLC-MS/MS) to reveal strain-specific characteristics in the proteomes of two clinical M. tuberculosis Latin American-Mediterranean isolates. Using this approach, we identified 59 peptides containing single amino acid variants, which covered ∼9% of all coding nonsynonymous single nucleotide variants detected by whole-genome sequencing. Furthermore, we identified 29 distinct peptides that mapped to a hypothetical protein not present in the M. tuberculosis H37Rv reference proteome. Here, we provide evidence for the expression of this protein in the clinical M. tuberculosis SAWC3651 isolate. The strain-specific databases enabled confirmation of genomic differences (i.e., large genomic regions of difference and nonsynonymous single nucleotide variants) in these two clinical M. tuberculosis isolates and allowed strain differentiation at the proteome level. Our results contribute to the growing field of clinical microbial proteogenomics and can improve our understanding of phenotypic variation in clinical M. tuberculosis isolates.
Xu, Shuhua
2015-01-01
Noncoding DNA sequences (NCS) have attracted much attention recently due to their functional potentials. Here we attempted to reveal the functional roles of noncoding sequences from the point of view of natural selection that typically indicates the functional potentials of certain genomic elements. We analyzed nearly 37 million single nucleotide polymorphisms (SNPs) of Phase I data of the 1000 Genomes Project. We estimated a series of key parameters of population genetics and molecular evolution to characterize sequence variations of the noncoding genome within and between populations, and identified the natural selection footprints in NCS in worldwide human populations. Our results showed that purifying selection is prevalent and there is substantial constraint of variations in NCS, while positive selectionis more likely to be specific to some particular genomic regions and regional populations. Intriguingly, we observed larger fraction of non-conserved NCS variants with lower derived allele frequency in the genome, indicating possible functional gain of non-conserved NCS. Notably, NCS elements are enriched for potentially functional markers such as eQTLs, TF motif, and DNase I footprints in the genome. More interestingly, some NCS variants associated with diseases such as Alzheimer's disease, Type 1 diabetes, and immune-related bowel disorder (IBD) showed signatures of positive selection, although the majority of NCS variants, reported as risk alleles by genome-wide association studies, showed signatures of negative selection. Our analyses provided compelling evidence of natural selection forces on noncoding sequences in the human genome and advanced our understanding of their functional potentials that play important roles in disease etiology and human evolution. PMID:26053627
Xu, Qin; Xiong, Guanjun; Li, Pengbo; He, Fei; Huang, Yi; Wang, Kunbo; Li, Zhaohu; Hua, Jinping
2012-01-01
Background Cotton (Gossypium spp.) is a model system for the analysis of polyploidization. Although ascertaining the donor species of allotetraploid cotton has been intensively studied, sequence comparison of Gossypium chloroplast genomes is still of interest to understand the mechanisms underlining the evolution of Gossypium allotetraploids, while it is generally accepted that the parents were A- and D-genome containing species. Here we performed a comparative analysis of 13 Gossypium chloroplast genomes, twelve of which are presented here for the first time. Methodology/Principal Findings The size of 12 chloroplast genomes under study varied from 159,959 bp to 160,433 bp. The chromosomes were highly similar having >98% sequence identity. They encoded the same set of 112 unique genes which occurred in a uniform order with only slightly different boundary junctions. Divergence due to indels as well as substitutions was examined separately for genome, coding and noncoding sequences. The genome divergence was estimated as 0.374% to 0.583% between allotetraploid species and A-genome, and 0.159% to 0.454% within allotetraploids. Forty protein-coding genes were completely identical at the protein level, and 20 intergenic sequences were completely conserved. The 9 allotetraploids shared 5 insertions and 9 deletions in whole genome, and 7-bp substitutions in protein-coding genes. The phylogenetic tree confirmed a close relationship between allotetraploids and the ancestor of A-genome, and the allotetraploids were divided into four separate groups. Progenitor allotetraploid cotton originated 0.43–0.68 million years ago (MYA). Conclusion Despite high degree of conservation between the Gossypium chloroplast genomes, sequence variations among species could still be detected. Gossypium chloroplast genomes preferred for 5-bp indels and 1–3-bp indels are mainly attributed to the SSR polymorphisms. This study supports that the common ancestor of diploid A-genome species in Gossypium is the maternal source of extant allotetraploid species and allotetraploids have a monophyletic origin. G. hirsutum AD1 lineages have experienced more sequence variations than other allotetraploids in intergenic regions. The available complete nucleotide sequences of 12 Gossypium chloroplast genomes should facilitate studies to uncover the molecular mechanisms of compartmental co-evolution and speciation of Gossypium allotetraploids. PMID:22876273
Using an online genome resource to identify myostatin variation in U.S. sheep
USDA-ARS?s Scientific Manuscript database
We created a public, searchable DNA sequence resource for sheep that contained approximately 14x whole genome sequence of 96 rams. The animals represent 10 popular U.S. breeds and share minimal pedigree relationships, making the resource suitable for viewing gene variants in the user-friendly Integ...
Model-based quality assessment and base-calling for second-generation sequencing data.
Bravo, Héctor Corrada; Irizarry, Rafael A
2010-09-01
Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads-strings of A,C,G, or T's, between 30 and 100 characters long-which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance. © 2009, The International Biometric Society.
Comparative genomics of Enterococcus faecalis from healthy Norwegian infants
Solheim, Margrete; Aakra, Ågot; Snipen, Lars G; Brede, Dag A; Nes, Ingolf F
2009-01-01
Background Enterococcus faecalis, traditionally considered a harmless commensal of the intestinal tract, is now ranked among the leading causes of nosocomial infections. In an attempt to gain insight into the genetic make-up of commensal E. faecalis, we have studied genomic variation in a collection of community-derived E. faecalis isolated from the feces of Norwegian infants. Results The E. faecalis isolates were first sequence typed by multilocus sequence typing (MLST) and characterized with respect to antibiotic resistance and properties associated with virulence. A subset of the isolates was compared to the vancomycin resistant strain E. faecalis V583 (V583) by whole genome microarray comparison (comparative genomic hybridization (CGH)). Several of the putative enterococcal virulence factors were found to be highly prevalent among the commensal baby isolates. The genomic variation as observed by CGH was less between isolates displaying the same MLST sequence type than between isolates belonging to different evolutionary lineages. Conclusion The variations in gene content observed among the investigated commensal E. faecalis is comparable to the genetic variation previously reported among strains of various origins thought to be representative of the major E. faecalis lineages. Previous MLST analysis of E. faecalis have identified so-called high-risk enterococcal clonal complexes (HiRECC), defined as genetically distinct subpopulations, epidemiologically associated with enterococcal infections. The observed correlation between CGH and MLST presented here, may offer a method for the identification of lineage-specific genes, and may therefore add clues on how to distinguish pathogenic from commensal E. faecalis. In this work, information on the core genome of E. faecalis is also substantially extended. PMID:19393078
The Brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes
Liu, Shengyi; Liu, Yumei; Yang, Xinhua; Tong, Chaobo; Edwards, David; Parkin, Isobel A. P.; Zhao, Meixia; Ma, Jianxin; Yu, Jingyin; Huang, Shunmou; Wang, Xiyin; Wang, Junyi; Lu, Kun; Fang, Zhiyuan; Bancroft, Ian; Yang, Tae-Jin; Hu, Qiong; Wang, Xinfa; Yue, Zhen; Li, Haojie; Yang, Linfeng; Wu, Jian; Zhou, Qing; Wang, Wanxin; King, Graham J; Pires, J. Chris; Lu, Changxin; Wu, Zhangyan; Sampath, Perumal; Wang, Zhuo; Guo, Hui; Pan, Shengkai; Yang, Limei; Min, Jiumeng; Zhang, Dong; Jin, Dianchuan; Li, Wanshun; Belcram, Harry; Tu, Jinxing; Guan, Mei; Qi, Cunkou; Du, Dezhi; Li, Jiana; Jiang, Liangcai; Batley, Jacqueline; Sharpe, Andrew G; Park, Beom-Seok; Ruperao, Pradeep; Cheng, Feng; Waminal, Nomar Espinosa; Huang, Yin; Dong, Caihua; Wang, Li; Li, Jingping; Hu, Zhiyong; Zhuang, Mu; Huang, Yi; Huang, Junyan; Shi, Jiaqin; Mei, Desheng; Liu, Jing; Lee, Tae-Ho; Wang, Jinpeng; Jin, Huizhe; Li, Zaiyun; Li, Xun; Zhang, Jiefu; Xiao, Lu; Zhou, Yongming; Liu, Zhongsong; Liu, Xuequn; Qin, Rui; Tang, Xu; Liu, Wenbin; Wang, Yupeng; Zhang, Yangyong; Lee, Jonghoon; Kim, Hyun Hee; Denoeud, France; Xu, Xun; Liang, Xinming; Hua, Wei; Wang, Xiaowu; Wang, Jun; Chalhoub, Boulos; Paterson, Andrew H
2014-01-01
Polyploidization has provided much genetic variation for plant adaptive evolution, but the mechanisms by which the molecular evolution of polyploid genomes establishes genetic architecture underlying species differentiation are unclear. Brassica is an ideal model to increase knowledge of polyploid evolution. Here we describe a draft genome sequence of Brassica oleracea, comparing it with that of its sister species B. rapa to reveal numerous chromosome rearrangements and asymmetrical gene loss in duplicated genomic blocks, asymmetrical amplification of transposable elements, differential gene co-retention for specific pathways and variation in gene expression, including alternative splicing, among a large number of paralogous and orthologous genes. Genes related to the production of anticancer phytochemicals and morphological variations illustrate consequences of genome duplication and gene divergence, imparting biochemical and morphological variation to B. oleracea. This study provides insights into Brassica genome evolution and will underpin research into the many important crops in this genus. PMID:24852848
Progress in Understanding and Sequencing the Genome of Brassica rapa
Hong, Chang Pyo; Kwon, Soo-Jin; Kim, Jung Sun; Yang, Tae-Jin; Park, Beom-Seok; Lim, Yong Pyo
2008-01-01
Brassica rapa, which is closely related to Arabidopsis thaliana, is an important crop and a model plant for studying genome evolution via polyploidization. We report the current understanding of the genome structure of B. rapa and efforts for the whole-genome sequencing of the species. The tribe Brassicaceae, which comprises ca. 240 species, descended from a common hexaploid ancestor with a basic genome similar to that of Arabidopsis. Chromosome rearrangements, including fusions and/or fissions, resulted in the present-day “diploid” Brassica species with variation in chromosome number and phenotype. Triplicated genomic segments of B. rapa are collinear to those of A. thaliana with InDels. The genome triplication has led to an approximately 1.7-fold increase in the B. rapa gene number compared to that of A. thaliana. Repetitive DNA of B. rapa has also been extensively amplified and has diverged from that of A. thaliana. For its whole-genome sequencing, the Brassica rapa Genome Sequencing Project (BrGSP) consortium has developed suitable genomic resources and constructed genetic and physical maps. Ten chromosomes of B. rapa are being allocated to BrGSP consortium participants, and each chromosome will be sequenced by a BAC-by-BAC approach. Genome sequencing of B. rapa will offer a new perspective for plant biology and evolution in the context of polyploidization. PMID:18288250
Microsatellite analysis in the genome of Acanthaceae: An in silico approach
Kaliswamy, Priyadharsini; Vellingiri, Srividhya; Nathan, Bharathi; Selvaraj, Saravanakumar
2015-01-01
Background: Acanthaceae is one of the advanced and specialized families with conventionally used medicinal plants. Simple sequence repeats (SSRs) play a major role as molecular markers for genome analysis and plant breeding. The microsatellites existing in the complete genome sequences would help to attain a direct role in the genome organization, recombination, gene regulation, quantitative genetic variation, and evolution of genes. Objective: The current study reports the frequency of microsatellites and appropriate markers for the Acanthaceae family genome sequences. Materials and Methods: The whole nucleotide sequences of Acanthaceae species were obtained from National Center for Biotechnology Information database and screened for the presence of SSRs. SSR Locator tool was used to predict the microsatellites and inbuilt Primer3 module was used for primer designing. Results: Totally 110 repeats from 108 sequences of Acanthaceae family plant genomes were identified, and the occurrence of dinucleotide repeats was found to be abundant in the genome sequences. The essential amino acid isoleucine was found rich in all the sequences. We also designed the SSR-based primers/markers for 59 sequences of this family that contains microsatellite repeats in their genome. Conclusion: The identified microsatellites and primers might be useful for breeding and genetic studies of plants that belong to Acanthaceae family in the future. PMID:25709226
Alkan, Can; Kavak, Pinar; Somel, Mehmet; Gokcumen, Omer; Ugurlu, Serkan; Saygi, Ceren; Dal, Elif; Bugra, Kuyas; Güngör, Tunga; Sahinalp, S Cenk; Özören, Nesrin; Bekpen, Cemalettin
2014-11-07
Turkey is a crossroads of major population movements throughout history and has been a hotspot of cultural interactions. Several studies have investigated the complex population history of Turkey through a limited set of genetic markers. However, to date, there have been no studies to assess the genetic variation at the whole genome level using whole genome sequencing. Here, we present whole genome sequences of 16 Turkish individuals resequenced at high coverage (32×-48×). We show that the genetic variation of the contemporary Turkish population clusters with South European populations, as expected, but also shows signatures of relatively recent contribution from ancestral East Asian populations. In addition, we document a significant enrichment of non-synonymous private alleles, consistent with recent observations in European populations. A number of variants associated with skin color and total cholesterol levels show frequency differentiation between the Turkish populations and European populations. Furthermore, we have analyzed the 17q21.31 inversion polymorphism region (MAPT locus) and found increased allele frequency of 31.25% for H1/H2 inversion polymorphism when compared to European populations that show about 25% of allele frequency. This study provides the first map of common genetic variation from 16 western Asian individuals and thus helps fill an important geographical gap in analyzing natural human variation and human migration. Our data will help develop population-specific experimental designs for studies investigating disease associations and demographic history in Turkey.
An Exploration into Fern Genome Space.
Wolf, Paul G; Sessa, Emily B; Marchant, Daniel Blaine; Li, Fay-Wei; Rothfels, Carl J; Sigel, Erin M; Gitzendanner, Matthew A; Visger, Clayton J; Banks, Jo Ann; Soltis, Douglas E; Soltis, Pamela S; Pryer, Kathleen M; Der, Joshua P
2015-08-26
Ferns are one of the few remaining major clades of land plants for which a complete genome sequence is lacking. Knowledge of genome space in ferns will enable broad-scale comparative analyses of land plant genes and genomes, provide insights into genome evolution across green plants, and shed light on genetic and genomic features that characterize ferns, such as their high chromosome numbers and large genome sizes. As part of an initial exploration into fern genome space, we used a whole genome shotgun sequencing approach to obtain low-density coverage (∼0.4X to 2X) for six fern species from the Polypodiales (Ceratopteris, Pteridium, Polypodium, Cystopteris), Cyatheales (Plagiogyria), and Gleicheniales (Dipteris). We explore these data to characterize the proportion of the nuclear genome represented by repetitive sequences (including DNA transposons, retrotransposons, ribosomal DNA, and simple repeats) and protein-coding genes, and to extract chloroplast and mitochondrial genome sequences. Such initial sweeps of fern genomes can provide information useful for selecting a promising candidate fern species for whole genome sequencing. We also describe variation of genomic traits across our sample and highlight some differences and similarities in repeat structure between ferns and seed plants. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads
Rebolledo-Mendez, Jovan; Hestand, Matthew S.; Coleman, Stephen J.; Zeng, Zheng; Orlando, Ludovic; MacLeod, James N.; Kalbfleisch, Ted
2015-01-01
The reference assembly for the domestic horse, EquCab2, published in 2009, was built using approximately 30 million Sanger reads from a Thoroughbred mare named Twilight. Contiguity in the assembly was facilitated using nearly 315 thousand BAC end sequences from Twilight’s half brother Bravo. Since then, it has served as the foundation for many genome-wide analyses that include not only the modern horse, but ancient horses and other equid species as well. As data mapped to this reference has accumulated, consistent variation between mapped datasets and the reference, in terms of regions with no read coverage, single nucleotide variants, and small insertions/deletions have become apparent. In many cases, it is not clear whether these differences are the result of true sequence variation between the research subjects’ and Twilight’s genome or due to errors in the reference. EquCab2 is regarded as “The Twilight Assembly.” The objective of this study was to identify inconsistencies between the EquCab2 assembly and the source Twilight Sanger data used to build it. To that end, the original Sanger and BAC end reads have been mapped back to this equine reference and assessed with the addition of approximately 40X coverage of new Illumina Paired-End sequence data. The resulting mapped datasets identify those regions with low Sanger read coverage, as well as variation in genomic content that is not consistent with either the original Twilight Sanger data or the new genomic sequence data generated from Twilight on the Illumina platform. As the haploid EquCab2 reference assembly was created using Sanger reads derived largely from a single individual, the vast majority of variation detected in a mapped dataset comprised of those same Sanger reads should be heterozygous. In contrast, homozygous variations would represent either errors in the reference or contributions from Bravo's BAC end sequences. Our analysis identifies 720,843 homozygous discrepancies between new, high throughput genomic sequence data generated for Twilight and the EquCab2 reference assembly. Most of these represent errors in the assembly, while approximately 10,000 are demonstrated to be contributions from another horse. Other results are presented that include the binary alignment map file of the mapped Sanger reads, a list of variants identified as discrepancies between the source data and resulting reference, and a BED annotation file that lists the regions of the genome whose consensus was likely derived from low coverage alignments. PMID:26107638
Konkel, Miriam K; Walker, Jerilyn A; Hotard, Ashley B; Ranck, Megan C; Fontenot, Catherine C; Storer, Jessica; Stewart, Chip; Marth, Gabor T; Batzer, Mark A
2015-08-29
The goal of the 1000 Genomes Consortium is to characterize human genome structural variation (SV), including forms of copy number variations such as deletions, duplications, and insertions. Mobile element insertions, particularly Alu elements, are major contributors to genomic SV among humans. During the pilot phase of the project we experimentally validated 645 (611 intergenic and 34 exon targeted) polymorphic "young" Alu insertion events, absent from the human reference genome. Here, we report high resolution sequencing of 343 (322 unique) recent Alu insertion events, along with their respective target site duplications, precise genomic breakpoint coordinates, subfamily assignment, percent divergence, and estimated A-rich tail lengths. All the sequenced Alu loci were derived from the AluY lineage with no evidence of retrotransposition activity involving older Alu families (e.g., AluJ and AluS). AluYa5 is currently the most active Alu subfamily in the human lineage, followed by AluYb8, and many others including three newly identified subfamilies we have termed AluYb7a3, AluYb8b1, and AluYa4a1. This report provides the structural details of 322 unique Alu variants from individual human genomes collectively adding about 100 kb of genomic variation. Many Alu subfamilies are currently active in human populations, including a surprising level of AluY retrotransposition. Human Alu subfamilies exhibit continuous evolution with potential drivers sprouting new Alu lineages. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Update on Genomic Databases and Resources at the National Center for Biotechnology Information.
Tatusova, Tatiana
2016-01-01
The National Center for Biotechnology Information (NCBI), as a primary public repository of genomic sequence data, collects and maintains enormous amounts of heterogeneous data. Data for genomes, genes, gene expressions, gene variation, gene families, proteins, and protein domains are integrated with the analytical, search, and retrieval resources through the NCBI website, text-based search and retrieval system, provides a fast and easy way to navigate across diverse biological databases.Comparative genome analysis tools lead to further understanding of evolution processes quickening the pace of discovery. Recent technological innovations have ignited an explosion in genome sequencing that has fundamentally changed our understanding of the biology of living organisms. This huge increase in DNA sequence data presents new challenges for the information management system and the visualization tools. New strategies have been designed to bring an order to this genome sequence shockwave and improve the usability of associated data.
Mapping copy number variation by population-scale genome sequencing.
Mills, Ryan E; Walter, Klaudia; Stewart, Chip; Handsaker, Robert E; Chen, Ken; Alkan, Can; Abyzov, Alexej; Yoon, Seungtai Chris; Ye, Kai; Cheetham, R Keira; Chinwalla, Asif; Conrad, Donald F; Fu, Yutao; Grubert, Fabian; Hajirasouliha, Iman; Hormozdiari, Fereydoun; Iakoucheva, Lilia M; Iqbal, Zamin; Kang, Shuli; Kidd, Jeffrey M; Konkel, Miriam K; Korn, Joshua; Khurana, Ekta; Kural, Deniz; Lam, Hugo Y K; Leng, Jing; Li, Ruiqiang; Li, Yingrui; Lin, Chang-Yun; Luo, Ruibang; Mu, Xinmeng Jasmine; Nemesh, James; Peckham, Heather E; Rausch, Tobias; Scally, Aylwyn; Shi, Xinghua; Stromberg, Michael P; Stütz, Adrian M; Urban, Alexander Eckehart; Walker, Jerilyn A; Wu, Jiantao; Zhang, Yujun; Zhang, Zhengdong D; Batzer, Mark A; Ding, Li; Marth, Gabor T; McVean, Gil; Sebat, Jonathan; Snyder, Michael; Wang, Jun; Ye, Kenny; Eichler, Evan E; Gerstein, Mark B; Hurles, Matthew E; Lee, Charles; McCarroll, Steven A; Korbel, Jan O
2011-02-03
Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
Cai, Na; Bigdeli, Tim B; Kretzschmar, Warren W; Li, Yihan; Liang, Jieqin; Hu, Jingchu; Peterson, Roseann E; Bacanu, Silviu; Webb, Bradley Todd; Riley, Brien; Li, Qibin; Marchini, Jonathan; Mott, Richard; Kendler, Kenneth S; Flint, Jonathan
2017-02-14
The China, Oxford and Virginia Commonwealth University Experimental Research on Genetic Epidemiology (CONVERGE) project on Major Depressive Disorder (MDD) sequenced 11,670 female Han Chinese at low-coverage (1.7X), providing the first large-scale whole genome sequencing resource representative of the largest ethnic group in the world. Samples are collected from 58 hospitals from 23 provinces around China. We are able to call 22 million high quality single nucleotide polymorphisms (SNP) from the nuclear genome, representing the largest SNP call set from an East Asian population to date. We use these variants for imputation of genotypes across all samples, and this has allowed us to perform a successful genome wide association study (GWAS) on MDD. The utility of these data can be extended to studies of genetic ancestry in the Han Chinese and evolutionary genetics when integrated with data from other populations. Molecular phenotypes, such as copy number variations and structural variations can be detected, quantified and analysed in similar ways.
Consensus generation and variant detection by Celera Assembler.
Denisov, Gennady; Walenz, Brian; Halpern, Aaron L; Miller, Jason; Axelrod, Nelson; Levy, Samuel; Sutton, Granger
2008-04-15
We present an algorithm to identify allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences, and to produce a set of haploid consensus sequences rather than a single consensus sequence. Existing WGS assemblers take a column-by-column approach to consensus generation, and produce a single consensus sequence which can be inconsistent with the underlying haploid alleles, and inconsistent with any of the aligned sequence reads. Our new algorithm uses a dynamic windowing approach. It detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation, assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele. This algorithm was used to produce the first diploid genome sequence of an individual human. It can also be applied to assemblies of multiple diploid individuals and hybrid assemblies of multiple haploid organisms. Being applied to the individual human genome assembly, the new algorithm detects exactly two confirmed alleles and reports two consensus sequences in 98.98% of the total number 2,033311 detected regions of sequence variation. In 33,269 out of 460,373 detected regions of size >1 bp, it fixes the constructed errors of a mosaic haploid representation of a diploid locus as produced by the original Celera Assembler consensus algorithm. Using an optimized procedure calibrated against 1 506 344 known SNPs, it detects 438 814 new heterozygous SNPs with false positive rate 12%. The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/
Whole genome re-sequencing of date palms yields insights into diversification of a fruit tree crop.
Hazzouri, Khaled M; Flowers, Jonathan M; Visser, Hendrik J; Khierallah, Hussam S M; Rosas, Ulises; Pham, Gina M; Meyer, Rachel S; Johansen, Caryn K; Fresquez, Zoë A; Masmoudi, Khaled; Haider, Nadia; El Kadri, Nabila; Idaghdour, Youssef; Malek, Joel A; Thirkhill, Deborah; Markhand, Ghulam S; Krueger, Robert R; Zaid, Abdelouahhab; Purugganan, Michael D
2015-11-09
Date palms (Phoenix dactylifera) are the most significant perennial crop in arid regions of the Middle East and North Africa. Here, we present a comprehensive catalogue of approximately seven million single nucleotide polymorphisms in date palms based on whole genome re-sequencing of a collection of 62 cultivars. Population structure analysis indicates a major genetic divide between North Africa and the Middle East/South Asian date palms, with evidence of admixture in cultivars from Egypt and Sudan. Genome-wide scans for selection suggest at least 56 genomic regions associated with selective sweeps that may underlie geographic adaptation. We report candidate mutations for trait variation, including nonsense polymorphisms and presence/absence variation in gene content in pathways for key agronomic traits. We also identify a copia-like retrotransposon insertion polymorphism in the R2R3 myb-like orthologue of the oil palm virescens gene associated with fruit colour variation. This analysis documents patterns of post-domestication diversification and provides a genomic resource for this economically important perennial tree crop.
Whole genome re-sequencing of date palms yields insights into diversification of a fruit tree crop
Hazzouri, Khaled M.; Flowers, Jonathan M.; Visser, Hendrik J.; Khierallah, Hussam S. M.; Rosas, Ulises; Pham, Gina M.; Meyer, Rachel S.; Johansen, Caryn K.; Fresquez, Zoë A.; Masmoudi, Khaled; Haider, Nadia; El Kadri, Nabila; Idaghdour, Youssef; Malek, Joel A.; Thirkhill, Deborah; Markhand, Ghulam S.; Krueger, Robert R.; Zaid, Abdelouahhab; Purugganan, Michael D.
2015-01-01
Date palms (Phoenix dactylifera) are the most significant perennial crop in arid regions of the Middle East and North Africa. Here, we present a comprehensive catalogue of approximately seven million single nucleotide polymorphisms in date palms based on whole genome re-sequencing of a collection of 62 cultivars. Population structure analysis indicates a major genetic divide between North Africa and the Middle East/South Asian date palms, with evidence of admixture in cultivars from Egypt and Sudan. Genome-wide scans for selection suggest at least 56 genomic regions associated with selective sweeps that may underlie geographic adaptation. We report candidate mutations for trait variation, including nonsense polymorphisms and presence/absence variation in gene content in pathways for key agronomic traits. We also identify a copia-like retrotransposon insertion polymorphism in the R2R3 myb-like orthologue of the oil palm virescens gene associated with fruit colour variation. This analysis documents patterns of post-domestication diversification and provides a genomic resource for this economically important perennial tree crop. PMID:26549859
Kidd, Jeffrey M; Gravel, Simon; Byrnes, Jake; Moreno-Estrada, Andres; Musharoff, Shaila; Bryc, Katarzyna; Degenhardt, Jeremiah D; Brisbin, Abra; Sheth, Vrunda; Chen, Rong; McLaughlin, Stephen F; Peckham, Heather E; Omberg, Larsson; Bormann Chung, Christina A; Stanley, Sarah; Pearlstein, Kevin; Levandowsky, Elizabeth; Acevedo-Acevedo, Suehelay; Auton, Adam; Keinan, Alon; Acuña-Alonzo, Victor; Barquera-Lozano, Rodrigo; Canizales-Quinteros, Samuel; Eng, Celeste; Burchard, Esteban G; Russell, Archie; Reynolds, Andy; Clark, Andrew G; Reese, Martin G; Lincoln, Stephen E; Butte, Atul J; De La Vega, Francisco M; Bustamante, Carlos D
2012-10-05
Full sequencing of individual human genomes has greatly expanded our understanding of human genetic variation and population history. Here, we present a systematic analysis of 50 human genomes from 11 diverse global populations sequenced at high coverage. Our sample includes 12 individuals who have admixed ancestry and who have varying degrees of recent (within the last 500 years) African, Native American, and European ancestry. We found over 21 million single-nucleotide variants that contribute to a 1.75-fold range in nucleotide heterozygosity across diverse human genomes. This heterozygosity ranged from a high of one heterozygous site per kilobase in west African genomes to a low of 0.57 heterozygous sites per kilobase in segments inferred to have diploid Native American ancestry from the genomes of Mexican and Puerto Rican individuals. We show evidence of all three continental ancestries in the genomes of Mexican, Puerto Rican, and African American populations, and the genome-wide statistics are highly consistent across individuals from a population once ancestry proportions have been accounted for. Using a generalized linear model, we identified subtle variations across populations in the proportion of neutral versus deleterious variation and found that genome-wide statistics vary in admixed populations even once ancestry proportions have been factored in. We further infer that multiple periods of gene flow shaped the diversity of admixed populations in the Americas-70% of the European ancestry in today's African Americans dates back to European gene flow happening only 7-8 generations ago. Copyright © 2012 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Kidd, Jeffrey M.; Gravel, Simon; Byrnes, Jake; Moreno-Estrada, Andres; Musharoff, Shaila; Bryc, Katarzyna; Degenhardt, Jeremiah D.; Brisbin, Abra; Sheth, Vrunda; Chen, Rong; McLaughlin, Stephen F.; Peckham, Heather E.; Omberg, Larsson; Bormann Chung, Christina A.; Stanley, Sarah; Pearlstein, Kevin; Levandowsky, Elizabeth; Acevedo-Acevedo, Suehelay; Auton, Adam; Keinan, Alon; Acuña-Alonzo, Victor; Barquera-Lozano, Rodrigo; Canizales-Quinteros, Samuel; Eng, Celeste; Burchard, Esteban G.; Russell, Archie; Reynolds, Andy; Clark, Andrew G.; Reese, Martin G.; Lincoln, Stephen E.; Butte, Atul J.; De La Vega, Francisco M.; Bustamante, Carlos D.
2012-01-01
Full sequencing of individual human genomes has greatly expanded our understanding of human genetic variation and population history. Here, we present a systematic analysis of 50 human genomes from 11 diverse global populations sequenced at high coverage. Our sample includes 12 individuals who have admixed ancestry and who have varying degrees of recent (within the last 500 years) African, Native American, and European ancestry. We found over 21 million single-nucleotide variants that contribute to a 1.75-fold range in nucleotide heterozygosity across diverse human genomes. This heterozygosity ranged from a high of one heterozygous site per kilobase in west African genomes to a low of 0.57 heterozygous sites per kilobase in segments inferred to have diploid Native American ancestry from the genomes of Mexican and Puerto Rican individuals. We show evidence of all three continental ancestries in the genomes of Mexican, Puerto Rican, and African American populations, and the genome-wide statistics are highly consistent across individuals from a population once ancestry proportions have been accounted for. Using a generalized linear model, we identified subtle variations across populations in the proportion of neutral versus deleterious variation and found that genome-wide statistics vary in admixed populations even once ancestry proportions have been factored in. We further infer that multiple periods of gene flow shaped the diversity of admixed populations in the Americas—70% of the European ancestry in today’s African Americans dates back to European gene flow happening only 7–8 generations ago. PMID:23040495
Alverson, Andrew J; Zhuo, Shi; Rice, Danny W; Sloan, Daniel B; Palmer, Jeffrey D
2011-01-20
The mitochondrial genomes of seed plants are exceptionally fluid in size, structure, and sequence content, with the accumulation and activity of repetitive sequences underlying much of this variation. We report the first fully sequenced mitochondrial genome of a legume, Vigna radiata (mung bean), and show that despite its unexceptional size (401,262 nt), the genome is unusually depauperate in repetitive DNA and "promiscuous" sequences from the chloroplast and nuclear genomes. Although Vigna lacks the large, recombinationally active repeats typical of most other seed plants, a PCR survey of its modest repertoire of short (38-297 nt) repeats nevertheless revealed evidence for recombination across all of them. A set of novel control assays showed, however, that these results could instead reflect, in part or entirely, artifacts of PCR-mediated recombination. Consequently, we recommend that other methods, especially high-depth genome sequencing, be used instead of PCR to infer patterns of plant mitochondrial recombination. The average-sized but repeat- and feature-poor mitochondrial genome of Vigna makes it ever more difficult to generalize about the factors shaping the size and sequence content of plant mitochondrial genomes.
Munchel, Sarah; Hoang, Yen; Zhao, Yue; Cottrell, Joseph; Klotzle, Brandy; Godwin, Andrew K; Koestler, Devin; Beyerlein, Peter; Fan, Jian-Bing; Bibikova, Marina; Chien, Jeremy
2015-09-22
Current genomic studies are limited by the poor availability of fresh-frozen tissue samples. Although formalin-fixed diagnostic samples are in abundance, they are seldom used in current genomic studies because of the concern of formalin-fixation artifacts. Better characterization of these artifacts will allow the use of archived clinical specimens in translational and clinical research studies. To provide a systematic analysis of formalin-fixation artifacts on Illumina sequencing, we generated 26 DNA sequencing data sets from 13 pairs of matched formalin-fixed paraffin-embedded (FFPE) and fresh-frozen (FF) tissue samples. The results indicate high rate of concordant calls between matched FF/FFPE pairs at reference and variant positions in three commonly used sequencing approaches (whole genome, whole exome, and targeted exon sequencing). Global mismatch rates and C · G > T · A substitutions were comparable between matched FF/FFPE samples, and discordant rates were low (<0.26%) in all samples. Finally, low-pass whole genome sequencing produces similar pattern of copy number alterations between FF/FFPE pairs. The results from our studies suggest the potential use of diagnostic FFPE samples for cancer genomic studies to characterize and catalog variations in cancer genomes.
Aslam, Luqman; Beal, Kathryn; Ann Blomberg, Le; Bouffard, Pascal; Burt, David W.; Crasta, Oswald; Crooijmans, Richard P. M. A.; Cooper, Kristal; Coulombe, Roger A.; De, Supriyo; Delany, Mary E.; Dodgson, Jerry B.; Dong, Jennifer J.; Evans, Clive; Frederickson, Karin M.; Flicek, Paul; Florea, Liliana; Folkerts, Otto; Groenen, Martien A. M.; Harkins, Tim T.; Herrero, Javier; Hoffmann, Steve; Megens, Hendrik-Jan; Jiang, Andrew; de Jong, Pieter; Kaiser, Pete; Kim, Heebal; Kim, Kyu-Won; Kim, Sungwon; Langenberger, David; Lee, Mi-Kyung; Lee, Taeheon; Mane, Shrinivasrao; Marcais, Guillaume; Marz, Manja; McElroy, Audrey P.; Modise, Thero; Nefedov, Mikhail; Notredame, Cédric; Paton, Ian R.; Payne, William S.; Pertea, Geo; Prickett, Dennis; Puiu, Daniela; Qioa, Dan; Raineri, Emanuele; Ruffier, Magali; Salzberg, Steven L.; Schatz, Michael C.; Scheuring, Chantel; Schmidt, Carl J.; Schroeder, Steven; Searle, Stephen M. J.; Smith, Edward J.; Smith, Jacqueline; Sonstegard, Tad S.; Stadler, Peter F.; Tafer, Hakim; Tu, Zhijian (Jake); Van Tassell, Curtis P.; Vilella, Albert J.; Williams, Kelly P.; Yorke, James A.; Zhang, Liqing; Zhang, Hong-Bin; Zhang, Xiaojun; Zhang, Yang; Reed, Kent M.
2010-01-01
A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest. PMID:20838655
Werling, Donna M; Brand, Harrison; An, Joon-Yong; Stone, Matthew R; Zhu, Lingxue; Glessner, Joseph T; Collins, Ryan L; Dong, Shan; Layer, Ryan M; Markenscoff-Papadimitriou, Eirene; Farrell, Andrew; Schwartz, Grace B; Wang, Harold Z; Currall, Benjamin B; Zhao, Xuefang; Dea, Jeanselle; Duhn, Clif; Erdman, Carolyn A; Gilson, Michael C; Yadav, Rachita; Handsaker, Robert E; Kashin, Seva; Klei, Lambertus; Mandell, Jeffrey D; Nowakowski, Tomasz J; Liu, Yuwen; Pochareddy, Sirisha; Smith, Louw; Walker, Michael F; Waterman, Matthew J; He, Xin; Kriegstein, Arnold R; Rubenstein, John L; Sestan, Nenad; McCarroll, Steven A; Neale, Benjamin M; Coon, Hilary; Willsey, A Jeremy; Buxbaum, Joseph D; Daly, Mark J; State, Matthew W; Quinlan, Aaron R; Marth, Gabor T; Roeder, Kathryn; Devlin, Bernie; Talkowski, Michael E; Sanders, Stephan J
2018-05-01
Genomic association studies of common or rare protein-coding variation have established robust statistical approaches to account for multiple testing. Here we present a comparable framework to evaluate rare and de novo noncoding single-nucleotide variants, insertion/deletions, and all classes of structural variation from whole-genome sequencing (WGS). Integrating genomic annotations at the level of nucleotides, genes, and regulatory regions, we define 51,801 annotation categories. Analyses of 519 autism spectrum disorder families did not identify association with any categories after correction for 4,123 effective tests. Without appropriate correction, biologically plausible associations are observed in both cases and controls. Despite excluding previously identified gene-disrupting mutations, coding regions still exhibited the strongest associations. Thus, in autism, the contribution of de novo noncoding variation is probably modest in comparison to that of de novo coding variants. Robust results from future WGS studies will require large cohorts and comprehensive analytical strategies that consider the substantial multiple-testing burden.
Parallel or convergent evolution in human population genomic data revealed by genotype networks.
R Vahdati, Ali; Wagner, Andreas
2016-08-02
Genotype networks are representations of genetic variation data that are complementary to phylogenetic trees. A genotype network is a graph whose nodes are genotypes (DNA sequences) with the same broadly defined phenotype. Two nodes are connected if they differ in some minimal way, e.g., in a single nucleotide. We analyze human genome variation data from the 1,000 genomes project, and construct haploid genotype (haplotype) networks for 12,235 protein coding genes. The structure of these networks varies widely among genes, indicating different patterns of variation despite a shared evolutionary history. We focus on those genes whose genotype networks show many cycles, which can indicate homoplasy, i.e., parallel or convergent evolution, on the sequence level. For 42 genes, the observed number of cycles is so large that it cannot be explained by either chance homoplasy or recombination. When analyzing possible explanations, we discovered evidence for positive selection in 21 of these genes and, in addition, a potential role for constrained variation and purifying selection. Balancing selection plays at most a small role. The 42 genes with excess cycles are enriched in functions related to immunity and response to pathogens. Genotype networks are representations of genetic variation data that can help understand unusual patterns of genomic variation.
Schadt, Eric E.; Banerjee, Onureena; Fang, Gang; Feng, Zhixing; Wong, Wing H.; Zhang, Xuegong; Kislyuk, Andrey; Clark, Tyson A.; Luong, Khai; Keren-Paz, Alona; Chess, Andrew; Kumar, Vipin; Chen-Plotkin, Alice; Sondheimer, Neal; Korlach, Jonas; Kasarskis, Andrew
2013-01-01
Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types. PMID:23093720
Schadt, Eric E; Banerjee, Onureena; Fang, Gang; Feng, Zhixing; Wong, Wing H; Zhang, Xuegong; Kislyuk, Andrey; Clark, Tyson A; Luong, Khai; Keren-Paz, Alona; Chess, Andrew; Kumar, Vipin; Chen-Plotkin, Alice; Sondheimer, Neal; Korlach, Jonas; Kasarskis, Andrew
2013-01-01
Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types.
Empirical Validation of Pooled Whole Genome Population Re-Sequencing in Drosophila melanogaster
Zhu, Yuan; Bergland, Alan O.; González, Josefa; Petrov, Dmitri A.
2012-01-01
The sequencing of pooled non-barcoded individuals is an inexpensive and efficient means of assessing genome-wide population allele frequencies, yet its accuracy has not been thoroughly tested. We assessed the accuracy of this approach on whole, complex eukaryotic genomes by resequencing pools of largely isogenic, individually sequenced Drosophila melanogaster strains. We called SNPs in the pooled data and estimated false positive and false negative rates using the SNPs called in individual strain as a reference. We also estimated allele frequency of the SNPs using “pooled” data and compared them with “true” frequencies taken from the estimates in the individual strains. We demonstrate that pooled sequencing provides a faithful estimate of population allele frequency with the error well approximated by binomial sampling, and is a reliable means of novel SNP discovery with low false positive rates. However, a sufficient number of strains should be used in the pooling because variation in the amount of DNA derived from individual strains is a substantial source of noise when the number of pooled strains is low. Our results and analysis confirm that pooled sequencing is a very powerful and cost-effective technique for assessing of patterns of sequence variation in populations on genome-wide scales, and is applicable to any dataset where sequencing individuals or individual cells is impossible, difficult, time consuming, or expensive. PMID:22848651
2012-01-01
Background Staphylococcus aureus Repeat (STAR) elements are a type of interspersed intergenic direct repeat. In this study the conservation and variation in these elements was explored by bioinformatic analyses of published staphylococcal genome sequences and through sequencing of specific STAR element loci from a large set of S. aureus isolates. Results Using bioinformatic analyses, we found that the STAR elements were located in different genomic loci within each staphylococcal species. There was no correlation between the number of STAR elements in each genome and the evolutionary relatedness of staphylococcal species, however higher levels of repeats were observed in both S. aureus and S. lugdunensis compared to other staphylococcal species. Unexpectedly, sequencing of the internal spacer sequences of individual repeat elements from multiple isolates showed conservation at the sequence level within deep evolutionary lineages of S. aureus. Whilst individual STAR element loci were demonstrated to expand and contract, the sequences associated with each locus were stable and distinct from one another. Conclusions The high degree of lineage and locus-specific conservation of these intergenic repeat regions suggests that STAR elements are maintained due to selective or molecular forces with some of these elements having an important role in cell physiology. The high prevalence in two of the more virulent staphylococcal species is indicative of a potential role for STAR elements in pathogenesis. PMID:23020678
Genome sequences of five Lactobacillus sp. isolates from traditional Turkish sourdough
USDA-ARS?s Scientific Manuscript database
A high level of variation in microflora can be observed in lactic acid bacteria (LAB) profiles of sourdoughs. Here, we present draft genome sequences of Lactobacillus reuteri E81, L. reuteri LR5A, L. rhamnosus LR2, L. plantarum PFC-311 and a novel Lactobacillus sp. PFC-70 isolated from traditional T...
Complete Genome Sequences of Four Isolates of Plutella xylostella Granulovirus.
Spence, Robert J; Noune, Christopher; Hauxwell, Caroline
2016-06-30
Granuloviruses are widespread pathogens of Plutella xylostella L. (diamondback moth) and potential biopesticides for control of this global insect pest. We report the complete genomes of four Plutella xylostella granulovirus isolates from China, Malaysia, and Taiwan exhibiting pairs of noncoding, homologous repeat regions with significant sequence variation but equivalent length. Copyright © 2016 Spence et al.
PopHuman: the human population genomics browser.
Casillas, Sònia; Mulet, Roger; Villegas-Mirón, Pablo; Hervas, Sergi; Sanz, Esteve; Velasco, Daniel; Bertranpetit, Jaume; Laayouni, Hafid; Barbadilla, Antonio
2018-01-04
The 1000 Genomes Project (1000GP) represents the most comprehensive world-wide nucleotide variation data set so far in humans, providing the sequencing and analysis of 2504 genomes from 26 populations and reporting >84 million variants. The availability of this sequence data provides the human lineage with an invaluable resource for population genomics studies, allowing the testing of molecular population genetics hypotheses and eventually the understanding of the evolutionary dynamics of genetic variation in human populations. Here we present PopHuman, a new population genomics-oriented genome browser based on JBrowse that allows the interactive visualization and retrieval of an extensive inventory of population genetics metrics. Efficient and reliable parameter estimates have been computed using a novel pipeline that faces the unique features and limitations of the 1000GP data, and include a battery of nucleotide variation measures, divergence and linkage disequilibrium parameters, as well as different tests of neutrality, estimated in non-overlapping windows along the chromosomes and in annotated genes for all 26 populations of the 1000GP. PopHuman is open and freely available at http://pophuman.uab.cat. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Wu, Ying; Liu, Fang; Yang, Dai-Gang; Li, Wei; Zhou, Xiao-Jian; Pei, Xiao-Yu; Liu, Yan-Gai; He, Kun-Lun; Zhang, Wen-Sheng; Ren, Zhong-Ying; Zhou, Ke-Hai; Ma, Xiong-Feng; Li, Zhong-Hu
2018-01-01
Cotton is one of the most economically important fiber crop plants worldwide. The genus Gossypium contains a single allotetraploid group (AD) and eight diploid genome groups (A–G and K). However, the evolution of repeat sequences in the chloroplast genomes and the phylogenetic relationships of Gossypium species are unclear. Thus, we determined the variations in the repeat sequences and the evolutionary relationships of 40 cotton chloroplast genomes, which represented the most diverse in the genus, including five newly sequenced diploid species, i.e., G. nandewarense (C1-n), G. armourianum (D2-1), G. lobatum (D7), G. trilobum (D8), and G. schwendimanii (D11), and an important semi-wild race of upland cotton, G. hirsutum race latifolium (AD1). The genome structure, gene order, and GC content of cotton species were similar to those of other higher plant plastid genomes. In total, 2860 long sequence repeats (>10 bp in length) were identified, where the F-genome species had the largest number of repeats (G. longicalyx F1: 108) and E-genome species had the lowest (G. stocksii E1: 53). Large-scale repeat sequences possibly enrich the genetic information and maintain genome stability in cotton species. We also identified 10 divergence hotspot regions, i.e., rpl33-rps18, psbZ-trnG (GCC), rps4-trnT (UGU), trnL (UAG)-rpl32, trnE (UUC)-trnT (GGU), atpE, ndhI, rps2, ycf1, and ndhF, which could be useful molecular genetic markers for future population genetics and phylogenetic studies. Site-specific selection analysis showed that some of the coding sites of 10 chloroplast genes (atpB, atpE, rps2, rps3, petB, petD, ccsA, cemA, ycf1, and rbcL) were under protein sequence evolution. Phylogenetic analysis based on the whole plastomes suggested that the Gossypium species grouped into six previously identified genetic clades. Interestingly, all 13 D-genome species clustered into a strong monophyletic clade. Unexpectedly, the cotton species with C, G, and K-genomes were admixed and nested in a large clade, which could have been due to their recent radiation, incomplete lineage sorting, and introgression hybridization among different cotton lineages. In conclusion, the results of this study provide new insights into the evolution of repeat sequences in chloroplast genomes and interspecific relationships in the genus Gossypium. PMID:29619041
Kelly, Benjamin J; Fitch, James R; Hu, Yangqiu; Corsmeier, Donald J; Zhong, Huachun; Wetzel, Amy N; Nordquist, Russell D; Newsom, David L; White, Peter
2015-01-20
While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.
The complete mitochondrial genome sequence of Malus hupehensis var. pinyiensis.
Duan, Naibin; Sun, Honghe; Wang, Nan; Fei, Zhangjun; Chen, Xuesen
2016-07-01
The complete mitochondrial genome sequence of Malus hupehensis var. pinyiensis, a widely used apple rootstock, was determined using the Illumina high-throughput sequencing approach. The genome is 422,555 bp in length and has a GC content of 45.21%. It is separated by a pair of inverted repeats of 32,504 bp, to form a large single copy region of 213,055 bp and a small single copy region of 144,492 bp. The genome contains 38 protein-coding genes, four pseudogenes, 25 tRNA genes, and three rRNA genes. The genome is 25,608 bp longer than that of M. domestica, and several structural variations between these two mitogenomes were detected.
Haimovich, Adrian D.; Muir, Paul; Isaacs, Farren J.
2016-01-01
Next-generation DNA sequencing has revealed the complete genome sequences of numerous organisms, establishing a fundamental and growing understanding of genetic variation and phenotypic diversity. Engineering at the gene, network and whole-genome scale aims to introduce targeted genetic changes both to explore emergent phenotypes and to introduce new functionalities. Expansion of these approaches into massively parallel platforms establishes the ability to generate targeted genome modifications, elucidating causal links between genotype and phenotype, as well as the ability to design and reprogramme organisms. In this Review, we explore techniques and applications in genome engineering, outlining key advances and defining challenges. PMID:26260262
Chen, Z; Nie, H; Grover, C E; Wang, Y; Li, P; Wang, M; Pei, H; Zhao, Y; Li, S; Wendel, J F; Hua, J
2017-05-01
Cotton (Gossypium spp.) is commonly grouped into eight diploid genomic groups, designated A-G and K, and an allotetraploid genomic group, AD. Gossypium raimondii (D 5 ) and G. arboreum (A 2 ) are the putative contributors to the progenitor of G. hirsutum (AD 1 ), the economically important fibre-producing cotton species. Mitochondrial DNA from week-old etiolated seedlings was extracted from isolated organelles using discontinuous sucrose density gradient method. Mitochondrial genomes were sequenced, assembled, annotated and analysed in orderly. Gossypium raimondii (D 5 ) and G. arboreum (A 2 ) mitochondrial genomes were provided in this study. The mitochondrial genomes of two diploid species harboured circular genome of 643,914 bp (D 5 ) and 687,482 bp (A 2 ), respectively. They differ in size and number of repeat sequences, both contain illuminating triplicate sequences with 7317 and 10,246 bp, respectively, demonstrating dynamic difference and rearranged genome organisations. Comparing the D 5 and A 2 mitogenomes with mitogenomes of tetraploid Gossypium species (AD 1 , G. hirsutum; AD 2 , G. barbadense), a shared 11 kbp fragment loss was detected in allotetraploid species, three regions shared by G. arboreum (A 2 ), G. hirsutum (AD 1 ) and G. barbadense (AD 2 ), while eight regions were specific to G. raimondii (D 5 ). The presence/absence variations and gene-based phylogeny supported that A-genome is a cytoplasmic donor to the progenitor of allotetraploid species G. hirsutum and G. barbadense. The results present structure variations and phylogeny of Gossypium mitochondrial genome evolution. © 2017 The Authors. Plant Biology published by John Wiley & Sons Ltd on behalf of German Botanical Society, Royal Dutch Botanical Society.
A physical map of the bovine genome
Snelling, Warren M; Chiu, Readman; Schein, Jacqueline E; Hobbs, Matthew; Abbey, Colette A; Adelson, David L; Aerts, Jan; Bennett, Gary L; Bosdet, Ian E; Boussaha, Mekki; Brauning, Rudiger; Caetano, Alexandre R; Costa, Marcos M; Crawford, Allan M; Dalrymple, Brian P; Eggen, André; Everts-van der Wind, Annelie; Floriot, Sandrine; Gautier, Mathieu; Gill, Clare A; Green, Ronnie D; Holt, Robert; Jann, Oliver; Jones, Steven JM; Kappes, Steven M; Keele, John W; de Jong, Pieter J; Larkin, Denis M; Lewin, Harris A; McEwan, John C; McKay, Stephanie; Marra, Marco A; Mathewson, Carrie A; Matukumalli, Lakshmi K; Moore, Stephen S; Murdoch, Brenda; Nicholas, Frank W; Osoegawa, Kazutoyo; Roy, Alice; Salih, Hanni; Schibler, Laurent; Schnabel, Robert D; Silveri, Licia; Skow, Loren C; Smith, Timothy PL; Sonstegard, Tad S; Taylor, Jeremy F; Tellam, Ross; Van Tassell, Curtis P; Williams, John L; Womack, James E; Wye, Natasja H; Yang, George; Zhao, Shaying
2007-01-01
Background Cattle are important agriculturally and relevant as a model organism. Previously described genetic and radiation hybrid (RH) maps of the bovine genome have been used to identify genomic regions and genes affecting specific traits. Application of these maps to identify influential genetic polymorphisms will be enhanced by integration with each other and with bacterial artificial chromosome (BAC) libraries. The BAC libraries and clone maps are essential for the hybrid clone-by-clone/whole-genome shotgun sequencing approach taken by the bovine genome sequencing project. Results A bovine BAC map was constructed with HindIII restriction digest fragments of 290,797 BAC clones from animals of three different breeds. Comparative mapping of 422,522 BAC end sequences assisted with BAC map ordering and assembly. Genotypes and pedigree from two genetic maps and marker scores from three whole-genome RH panels were consolidated on a 17,254-marker composite map. Sequence similarity allowed integrating the BAC and composite maps with the bovine draft assembly (Btau3.1), establishing a comprehensive resource describing the bovine genome. Agreement between the marker and BAC maps and the draft assembly is high, although discrepancies exist. The composite and BAC maps are more similar than either is to the draft assembly. Conclusion Further refinement of the maps and greater integration into the genome assembly process may contribute to a high quality assembly. The maps provide resources to associate phenotypic variation with underlying genomic variation, and are crucial resources for understanding the biology underpinning this important ruminant species so closely associated with humans. PMID:17697342
Dumas, Laura; Dickens, C Michael; Anderson, Nathan; Davis, Jonathan; Bennett, Beth; Radcliffe, Richard A; Sikela, James M
2014-06-01
It has been well documented that genetic factors can influence predisposition to develop alcoholism. While the underlying genomic changes may be of several types, two of the most common and disease associated are copy number variations (CNVs) and sequence alterations of protein coding regions. The goal of this study was to identify CNVs and single-nucleotide polymorphisms that occur in gene coding regions that may play a role in influencing the risk of an individual developing alcoholism. Toward this end, two mouse strains were used that have been selectively bred based on their differential sensitivity to alcohol: the Inbred long sleep (ILS) and Inbred short sleep (ISS) mouse strains. Differences in initial response to alcohol have been linked to risk for alcoholism, and the ILS/ISS strains are used to investigate the genetics of initial sensitivity to alcohol. Array comparative genomic hybridization (arrayCGH) and exome sequencing were conducted to identify CNVs and gene coding sequence differences, respectively, between ILS and ISS mice. Mouse arrayCGH was performed using catalog Agilent 1 × 244 k mouse arrays. Subsequently, exome sequencing was carried out using an Illumina HiSeq 2000 instrument. ArrayCGH detected 74 CNVs that were strain-specific (38 ILS/36 ISS), including several ISS-specific deletions that contained genes implicated in brain function and neurotransmitter release. Among several interesting coding variations detected by exome sequencing was the gain of a premature stop codon in the alpha-amylase 2B (AMY2B) gene specifically in the ILS strain. In total, exome sequencing detected 2,597 and 1,768 strain-specific exonic gene variants in the ILS and ISS mice, respectively. This study represents the most comprehensive and detailed genomic comparison of ILS and ISS mouse strains to date. The two complementary genome-wide approaches identified strain-specific CNVs and gene coding sequence variations that should provide strong candidates to contribute to the alcohol-related phenotypic differences associated with these strains.
Reliable Detection of Herpes Simplex Virus Sequence Variation by High-Throughput Resequencing.
Morse, Alison M; Calabro, Kaitlyn R; Fear, Justin M; Bloom, David C; McIntyre, Lauren M
2017-08-16
High-throughput sequencing (HTS) has resulted in data for a number of herpes simplex virus (HSV) laboratory strains and clinical isolates. The knowledge of these sequences has been critical for investigating viral pathogenicity. However, the assembly of complete herpesviral genomes, including HSV, is complicated due to the existence of large repeat regions and arrays of smaller reiterated sequences that are commonly found in these genomes. In addition, the inherent genetic variation in populations of isolates for viruses and other microorganisms presents an additional challenge to many existing HTS sequence assembly pipelines. Here, we evaluate two approaches for the identification of genetic variants in HSV1 strains using Illumina short read sequencing data. The first, a reference-based approach, identifies variants from reads aligned to a reference sequence and the second, a de novo assembly approach, identifies variants from reads aligned to de novo assembled consensus sequences. Of critical importance for both approaches is the reduction in the number of low complexity regions through the construction of a non-redundant reference genome. We compared variants identified in the two methods. Our results indicate that approximately 85% of variants are identified regardless of the approach. The reference-based approach to variant discovery captures an additional 15% representing variants divergent from the HSV1 reference possibly due to viral passage. Reference-based approaches are significantly less labor-intensive and identify variants across the genome where de novo assembly-based approaches are limited to regions where contigs have been successfully assembled. In addition, regions of poor quality assembly can lead to false variant identification in de novo consensus sequences. For viruses with a well-assembled reference genome, a reference-based approach is recommended.
Guimaraes, Ana M S; Toth, Balazs; Santos, Andrea P; do Nascimento, Naíla C; Kritchevsky, Janice E; Messick, Joanne B
2012-11-01
We report the complete genome sequence of "Candidatus Mycoplasma haemolamae," an endemic red-cell pathogen of camelids. The single, circular chromosome has 756,845 bp, a 39.3% G+C content, and 925 coding sequences (CDSs). A great proportion (49.1%) of these CDSs are organized into paralogous gene families, which can now be further explored with regard to antigenic variation.
Conversion of amino-acid sequence in proteins to classical music: search for auditory patterns
2007-01-01
We have converted genome-encoded protein sequences into musical notes to reveal auditory patterns without compromising musicality. We derived a reduced range of 13 base notes by pairing similar amino acids and distinguishing them using variations of three-note chords and codon distribution to dictate rhythm. The conversion will help make genomic coding sequences more approachable for the general public, young children, and vision-impaired scientists. PMID:17477882
Sequencing thousands of single-cell genomes with combinatorial indexing.
Vitak, Sarah A; Torkenczy, Kristof A; Rosenkrantz, Jimi L; Fields, Andrew J; Christiansen, Lena; Wong, Melissa H; Carbone, Lucia; Steemers, Frank J; Adey, Andrew
2017-03-01
Single-cell genome sequencing has proven valuable for the detection of somatic variation, particularly in the context of tumor evolution. Current technologies suffer from high library construction costs, which restrict the number of cells that can be assessed and thus impose limitations on the ability to measure heterogeneity within a tissue. Here, we present single-cell combinatorial indexed sequencing (SCI-seq) as a means of simultaneously generating thousands of low-pass single-cell libraries for detection of somatic copy-number variants. We constructed libraries for 16,698 single cells from a combination of cultured cell lines, primate frontal cortex tissue and two human adenocarcinomas, and obtained a detailed assessment of subclonal variation within a pancreatic tumor.
Engel, Stacia R.; Cherry, J. Michael
2013-01-01
The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery. Database URL: http://www.yeastgenome.org/ PMID:23487186
Wang, Edwin; Zou, Jinfeng; Zaman, Naif; Beitel, Lenore K; Trifiro, Mark; Paliouras, Miltiadis
2013-08-01
Recent tumor genome sequencing confirmed that one tumor often consists of multiple cell subpopulations (clones) which bear different, but related, genetic profiles such as mutation and copy number variation profiles. Thus far, one tumor has been viewed as a whole entity in cancer functional studies. With the advances of genome sequencing and computational analysis, we are able to quantify and computationally dissect clones from tumors, and then conduct clone-based analysis. Emerging technologies such as single-cell genome sequencing and RNA-Seq could profile tumor clones. Thus, we should reconsider how to conduct cancer systems biology studies in the genome sequencing era. We will outline new directions for conducting cancer systems biology by considering that genome sequencing technology can be used for dissecting, quantifying and genetically characterizing clones from tumors. Topics discussed in Part 1 of this review include computationally quantifying of tumor subpopulations; clone-based network modeling, cancer hallmark-based networks and their high-order rewiring principles and the principles of cell survival networks of fast-growing clones. Crown Copyright © 2013. Published by Elsevier Ltd. All rights reserved.
USDA-ARS?s Scientific Manuscript database
Single nucleotide polymorphisms (SNPs) are the most abundant DNA sequence variation in the genomes which can be used to associate genotypic variation to the phenotype. Therefore, availability of a high-density SNP array with uniform genome coverage can advance genetic studies and breeding applicatio...
Murat, Claude; Zampieri, Elisa; Vallino, Marta; Daghino, Stefania; Perotto, Silvia; Bonfante, Paola
2011-05-01
Characterization of genomic variation among different microbial species, or different strains of the same species, is a field of significant interest with a wide range of potential applications. We have investigated the genomic variation in mycorrhizal fungal genomes through genomic suppressive subtractive hybridization. The comparison was between phylogenetically distant and close truffle species (Tuber spp.), and between isolates of the ericoid mycorrhizal fungus Oidiodendron maius featuring different degrees of metal tolerance. In the interspecies experiment, almost all the sequences that were identified in the Tuber melanosporum genome and absent in Tuber borchii and Tuber indicum corresponded to transposable elements. In the intraspecies comparison, some specific sequences corresponded to regions coding for enzymes, among them a glutathione synthetase known to be involved in metal tolerance. This approach is a quick and rather inexpensive tool to develop molecular markers for mycorrhizal fungi tracking and barcoding, to identify functional genes and to investigate the genome plasticity, adaptation and evolution. © 2011 Federation of European Microbiological Societies. Published by Blackwell Publishing Ltd. All rights reserved.
Genome-wide association as a means to understanding the mammary gland
USDA-ARS?s Scientific Manuscript database
Next-generation sequencing and related technologies have facilitated the creation of enormous public databases that catalogue genomic variation. These databases have facilitated a variety of approaches to discover new genes that regulate normal biology as well as disease. Genome wide association (...
Horizontal gene transfer of chromosomal Type II toxin-antitoxin systems of Escherichia coli.
Ramisetty, Bhaskar Chandra Mohan; Santhosh, Ramachandran Sarojini
2016-02-01
Type II toxin-antitoxin systems (TAs) are small autoregulated bicistronic operons that encode a toxin protein with the potential to inhibit metabolic processes and an antitoxin protein to neutralize the toxin. Most of the bacterial genomes encode multiple TAs. However, the diversity and accumulation of TAs on bacterial genomes and its physiological implications are highly debated. Here we provide evidence that Escherichia coli chromosomal TAs (encoding RNase toxins) are 'acquired' DNA likely originated from heterologous DNA and are the smallest known autoregulated operons with the potential for horizontal propagation. Sequence analyses revealed that integration of TAs into the bacterial genome is unique and contributes to variations in the coding and/or regulatory regions of flanking host genome sequences. Plasmids and genomes encoding identical TAs of natural isolates are mutually exclusive. Chromosomal TAs might play significant roles in the evolution and ecology of bacteria by contributing to host genome variation and by moderation of plasmid maintenance. © FEMS 2015. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Sakai, Hiroaki; Kanamori, Hiroyuki; Arai-Kichise, Yuko; Shibata-Hatta, Mari; Ebana, Kaworu; Oono, Youko; Kurita, Kanako; Fujisawa, Hiroko; Katagiri, Satoshi; Mukai, Yoshiyuki; Hamada, Masao; Itoh, Takeshi; Matsumoto, Takashi; Katayose, Yuichi; Wakasa, Kyo; Yano, Masahiro; Wu, Jianzhong
2014-01-01
Having a deep genetic structure evolved during its domestication and adaptation, the Asian cultivated rice (Oryza sativa) displays considerable physiological and morphological variations. Here, we describe deep whole-genome sequencing of the aus rice cultivar Kasalath by using the advanced next-generation sequencing (NGS) technologies to gain a better understanding of the sequence and structural changes among highly differentiated cultivars. The de novo assembled Kasalath sequences represented 91.1% (330.55 Mb) of the genome and contained 35 139 expressed loci annotated by RNA-Seq analysis. We detected 2 787 250 single-nucleotide polymorphisms (SNPs) and 7393 large insertion/deletion (indel) sites (>100 bp) between Kasalath and Nipponbare, and 2 216 251 SNPs and 3780 large indels between Kasalath and 93-11. Extensive comparison of the gene contents among these cultivars revealed similar rates of gene gain and loss. We detected at least 7.39 Mb of inserted sequences and 40.75 Mb of unmapped sequences in the Kasalath genome in comparison with the Nipponbare reference genome. Mapping of the publicly available NGS short reads from 50 rice accessions proved the necessity and the value of using the Kasalath whole-genome sequence as an additional reference to capture the sequence polymorphisms that cannot be discovered by using the Nipponbare sequence alone. PMID:24578372
Ebolavirus comparative genomics
Jun, Se-Ran; Leuze, Michael R.; Nookaew, Intawat; ...
2015-07-14
The 2014 Ebola outbreak in West Africa is the largest documented for this virus. We examine the dynamics of this genome, comparing more than one hundred currently available ebolavirus genomes to each other and to other viral genomes. Based on oligomer frequency analysis, the family Filoviridae forms a distinct group from all other sequenced viral genomes. All filovirus genomes sequenced to date encode proteins with similar functions and gene order, although there is considerable divergence in sequences between the three genera Ebolavirus, Cuevavirus, and Marburgvirus within the family Filoviridae. Whereas all ebolavirus genomes are quite similar (multiple sequences of themore » same strain are often identical), variation is most common in the intergenic regions and within specific areas of the genes encoding the glycoprotein (GP), nucleoprotein (NP), and polymerase (L). We predict regions that could contain epitope-binding sites, which might be good vaccine targets. In conclusion, this information, combined with glycosylation sites and experimentally determined epitopes, can identify the most promising regions for the development of therapeutic strategies.« less
Genomic Definition of Hypervirulent and Multidrug-Resistant Klebsiella pneumoniae Clonal Groups
Bialek-Davenet, Suzanne; Criscuolo, Alexis; Ailloud, Florent; Passet, Virginie; Jones, Louis; Delannoy-Vieillard, Anne-Sophie; Garin, Benoit; Le Hello, Simon; Arlet, Guillaume; Nicolas-Chanoine, Marie-Hélène; Decré, Dominique
2014-01-01
Multidrug-resistant and highly virulent Klebsiella pneumoniae isolates are emerging, but the clonal groups (CGs) corresponding to these high-risk strains have remained imprecisely defined. We aimed to identify K. pneumoniae CGs on the basis of genome-wide sequence variation and to provide a simple bioinformatics tool to extract virulence and resistance gene data from genomic data. We sequenced 48 K. pneumoniae isolates, mostly of serotypes K1 and K2, and compared the genomes with 119 publicly available genomes. A total of 694 highly conserved genes were included in a core-genome multilocus sequence typing scheme, and cluster analysis of the data enabled precise definition of globally distributed hypervirulent and multidrug-resistant CGs. In addition, we created a freely accessible database, BIGSdb-Kp, to enable rapid extraction of medically and epidemiologically relevant information from genomic sequences of K. pneumoniae. Although drug-resistant and virulent K. pneumoniae populations were largely nonoverlapping, isolates with combined virulence and resistance features were detected. PMID:25341126
Liu, Tian-Jia; Li, Yong-Ping; Zhou, Jing-Jing; Hu, Chun-Gen; Zhang, Jin-Zhi
2018-03-01
The comprehensive genetic variation of two citrus species were analyzed at genome and transcriptome level. A total of 1090 differentially expressed genes were found during fruit development by RNA-sequencing. Fruit size (fruit equatorial diameter) and weight (fresh weight) are the two most important components determining yield and consumer acceptability for many horticultural crops. However, little is known about the genetic control of these traits. Here, we performed whole-genome resequencing to reveal the comprehensive genetic variation of the fruit development between kumquat (Citrus japonica) and Clementine mandarin (Citrus clementina). In total, 5,865,235 single-nucleotide polymorphisms (SNPs) and 414,447 insertions/deletions (InDels) were identified in the two citrus species. Based on integrative analysis of genome and transcriptome of fruit, 640,801 SNPs and 20,733 InDels were identified. The features, genomic distribution, functional effect, and other characteristics of these genetic variations were explored. RNA-sequencing identified 1090 differentially expressed genes (DEGs) during fruit development of kumquat and Clementine mandarin. Gene Ontology revealed that these genes were involved in various molecular functional and biological processes. In addition, the genetic variation of 939 DEGs and 74 multiple fruit development pathway genes from previous reports were also identified. A global survey identified 24,237 specific alternative splicing events in the two citrus species and showed that intron retention is the most prevalent pattern of alternative splicing. These genome variation data provide a foundation for further exploration of citrus diversity and gene-phenotype relationships and for future research on molecular breeding to improve kumquat, Clementine mandarin and related species.
Wallberg, Andreas; Han, Fan; Wellhagen, Gustaf; Dahle, Bjørn; Kawata, Masakado; Haddad, Nizar; Simões, Zilá Luz Paulino; Allsopp, Mike H; Kandemir, Irfan; De la Rúa, Pilar; Pirk, Christian W; Webster, Matthew T
2014-10-01
The honeybee Apis mellifera has major ecological and economic importance. We analyze patterns of genetic variation at 8.3 million SNPs, identified by sequencing 140 honeybee genomes from a worldwide sample of 14 populations at a combined total depth of 634×. These data provide insight into the evolutionary history and genetic basis of local adaptation in this species. We find evidence that population sizes have fluctuated greatly, mirroring historical fluctuations in climate, although contemporary populations have high genetic diversity, indicating the absence of domestication bottlenecks. Levels of genetic variation are strongly shaped by natural selection and are highly correlated with patterns of gene expression and DNA methylation. We identify genomic signatures of local adaptation, which are enriched in genes expressed in workers and in immune system- and sperm motility-related genes that might underlie geographic variation in reproduction, dispersal and disease resistance. This study provides a framework for future investigations into responses to pathogens and climate change in honeybees.
Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project
Horton, Roger; Gibson, Richard; Coggill, Penny; Miretti, Marcos; Allcock, Richard J.; Almeida, Jeff; Forbes, Simon; Gilbert, James G. R.; Halls, Karen; Harrow, Jennifer L.; Hart, Elizabeth; Howe, Kevin; Jackson, David K.; Palmer, Sophie; Roberts, Anne N.; Sims, Sarah; Stewart, C. Andrew; Traherne, James A.; Trevanion, Steve; Wilming, Laurens; Rogers, Jane; de Jong, Pieter J.; Elliott, John F.; Sawcer, Stephen; Todd, John A.; Trowsdale, John
2008-01-01
The human major histocompatibility complex (MHC) is contained within about 4 Mb on the short arm of chromosome 6 and is recognised as the most variable region in the human genome. The primary aim of the MHC Haplotype Project was to provide a comprehensively annotated reference sequence of a single, human leukocyte antigen-homozygous MHC haplotype and to use it as a basis against which variations could be assessed from seven other similarly homozygous cell lines, representative of the most common MHC haplotypes in the European population. Comparison of the haplotype sequences, including four haplotypes not previously analysed, resulted in the identification of >44,000 variations, both substitutions and indels (insertions and deletions), which have been submitted to the dbSNP database. The gene annotation uncovered haplotype-specific differences and confirmed the presence of more than 300 loci, including over 160 protein-coding genes. Combined analysis of the variation and annotation datasets revealed 122 gene loci with coding substitutions of which 97 were non-synonymous. The haplotype (A3-B7-DR15; PGF cell line) designated as the new MHC reference sequence, has been incorporated into the human genome assembly (NCBI35 and subsequent builds), and constitutes the largest single-haplotype sequence of the human genome to date. The extensive variation and annotation data derived from the analysis of seven further haplotypes have been made publicly available and provide a framework and resource for future association studies of all MHC-associated diseases and transplant medicine. PMID:18193213
Cryptosporidium as a testbed for single cell genome characterization of unicellular eukaryotes.
Troell, Karin; Hallström, Björn; Divne, Anna-Maria; Alsmark, Cecilia; Arrighi, Romanico; Huss, Mikael; Beser, Jessica; Bertilsson, Stefan
2016-06-23
Infectious disease involving multiple genetically distinct populations of pathogens is frequently concurrent, but difficult to detect or describe with current routine methodology. Cryptosporidium sp. is a widespread gastrointestinal protozoan of global significance in both animals and humans. It cannot be easily maintained in culture and infections of multiple strains have been reported. To explore the potential use of single cell genomics methodology for revealing genome-level variation in clinical samples from Cryptosporidium-infected hosts, we sorted individual oocysts for subsequent genome amplification and full-genome sequencing. Cells were identified with fluorescent antibodies with an 80 % success rate for the entire single cell genomics workflow, demonstrating that the methodology can be applied directly to purified fecal samples. Ten amplified genomes from sorted single cells were selected for genome sequencing and compared both to the original population and a reference genome in order to evaluate the accuracy and performance of the method. Single cell genome coverage was on average 81 % even with a moderate sequencing effort and by combining the 10 single cell genomes, the full genome was accounted for. By a comparison to the original sample, biological variation could be distinguished and separated from noise introduced in the amplification. As a proof of principle, we have demonstrated the power of applying single cell genomics to dissect infectious disease caused by closely related parasite species or subtypes. The workflow can easily be expanded and adapted to target other protozoans, and potential applications include mapping genome-encoded traits, virulence, pathogenicity, host specificity and resistance at the level of cells as truly meaningful biological units.
MSeq-CNV: accurate detection of Copy Number Variation from Sequencing of Multiple samples.
Malekpour, Seyed Amir; Pezeshk, Hamid; Sadeghi, Mehdi
2018-03-05
Currently a few tools are capable of detecting genome-wide Copy Number Variations (CNVs) based on sequencing of multiple samples. Although aberrations in mate pair insertion sizes provide additional hints for the CNV detection based on multiple samples, the majority of the current tools rely only on the depth of coverage. Here, we propose a new algorithm (MSeq-CNV) which allows detecting common CNVs across multiple samples. MSeq-CNV applies a mixture density for modeling aberrations in depth of coverage and abnormalities in the mate pair insertion sizes. Each component in this mixture density applies a Binomial distribution for modeling the number of mate pairs with aberration in the insertion size and also a Poisson distribution for emitting the read counts, in each genomic position. MSeq-CNV is applied on simulated data and also on real data of six HapMap individuals with high-coverage sequencing, in 1000 Genomes Project. These individuals include a CEU trio of European ancestry and a YRI trio of Nigerian ethnicity. Ancestry of these individuals is studied by clustering the identified CNVs. MSeq-CNV is also applied for detecting CNVs in two samples with low-coverage sequencing in 1000 Genomes Project and six samples form the Simons Genome Diversity Project.
van der Ley, P
1988-11-01
Gonococci express a family of related outer membrane proteins designated protein II (P.II). These surface proteins are subject to both phase variation and antigenic variation. The P.II gene repertoire of Neisseria gonorrhoeae strain JS3 was found to consist of at least ten genes, eight of which were cloned. Sequence analysis and DNA hybridization studies revealed that one particular P.II-encoding sequence is present in three distinct, but almost identical, copies in the JS3 genome. These genes encode the P.II protein that was previously identified as P.IIc. Comparison of their sequences shows that the multiple copies of this P.IIc-encoding gene might have been generated by both gene conversion and gene duplication.
Limited Variation in BK Virus T-Cell Epitopes Revealed by Next-Generation Sequencing
Sahoo, Malaya K.; Tan, Susanna K.; Chen, Sharon F.; Kapusinszky, Beatrix; Concepcion, Katherine R.; Kjelson, Lynn; Mallempati, Kalyan; Farina, Heidi M.; Fernández-Viña, Marcelo; Tyan, Dolly; Grimm, Paul C.; Anderson, Matthew W.; Concepcion, Waldo
2015-01-01
BK virus (BKV) infection causing end-organ disease remains a formidable challenge to the hematopoietic cell transplant (HCT) and kidney transplant fields. As BKV-specific treatments are limited, immunologic-based therapies may be a promising and novel therapeutic option for transplant recipients with persistent BKV infection. Here, we describe a whole-genome, deep-sequencing methodology and bioinformatics pipeline that identify BKV variants across the genome and at BKV-specific HLA-A2-, HLA-B0702-, and HLA-B08-restricted CD8 T-cell epitopes. BKV whole genomes were amplified using long-range PCR with four inverse primer sets, and fragmentation libraries were sequenced on the Ion Torrent Personal Genome Machine (PGM). An error model and variant-calling algorithm were developed to accurately identify rare variants. A total of 65 samples from 18 pediatric HCT and kidney recipients with quantifiable BKV DNAemia underwent whole-genome sequencing. Limited genetic variation was observed. The median number of amino acid variants identified per sample was 8 (range, 2 to 37; interquartile range, 10), with the majority of variants (77%) detected at a frequency of <5%. When normalized for length, there was no statistical difference in the median number of variants across all genes. Similarly, the predominant virus population within samples harbored T-cell epitopes similar to the reference BKV strain that was matched for the BKV genotype. Despite the conservation of epitopes, low-level variants in T-cell epitopes were detected in 77.7% (14/18) of patients. Understanding epitope variation across the whole genome provides insight into the virus-immune interface and may help guide the development of protocols for novel immunologic-based therapies. PMID:26202116
van den Broek, M; Bolat, I; Nijkamp, J F; Ramos, E; Luttik, M A H; Koopman, F; Geertman, J M; de Ridder, D; Pronk, J T; Daran, J-M
2015-09-01
Lager brewing strains of Saccharomyces pastorianus are natural interspecific hybrids originating from the spontaneous hybridization of Saccharomyces cerevisiae and Saccharomyces eubayanus. Over the past 500 years, S. pastorianus has been domesticated to become one of the most important industrial microorganisms. Production of lager-type beers requires a set of essential phenotypes, including the ability to ferment maltose and maltotriose at low temperature, the production of flavors and aromas, and the ability to flocculate. Understanding of the molecular basis of complex brewing-related phenotypic traits is a prerequisite for rational strain improvement. While genome sequences have been reported, the variability and dynamics of S. pastorianus genomes have not been investigated in detail. Here, using deep sequencing and chromosome copy number analysis, we showed that S. pastorianus strain CBS1483 exhibited extensive aneuploidy. This was confirmed by quantitative PCR and by flow cytometry. As a direct consequence of this aneuploidy, a massive number of sequence variants was identified, leading to at least 1,800 additional protein variants in S. pastorianus CBS1483. Analysis of eight additional S. pastorianus strains revealed that the previously defined group I strains showed comparable karyotypes, while group II strains showed large interstrain karyotypic variability. Comparison of three strains with nearly identical genome sequences revealed substantial chromosome copy number variation, which may contribute to strain-specific phenotypic traits. The observed variability of lager yeast genomes demonstrates that systematic linking of genotype to phenotype requires a three-dimensional genome analysis encompassing physical chromosomal structures, the copy number of individual chromosomes or chromosomal regions, and the allelic variation of copies of individual genes. Copyright © 2015, van den Broek et al.
van den Broek, M.; Bolat, I.; Nijkamp, J. F.; Ramos, E.; Luttik, M. A. H.; Koopman, F.; Geertman, J. M.; de Ridder, D.; Pronk, J. T.
2015-01-01
Lager brewing strains of Saccharomyces pastorianus are natural interspecific hybrids originating from the spontaneous hybridization of Saccharomyces cerevisiae and Saccharomyces eubayanus. Over the past 500 years, S. pastorianus has been domesticated to become one of the most important industrial microorganisms. Production of lager-type beers requires a set of essential phenotypes, including the ability to ferment maltose and maltotriose at low temperature, the production of flavors and aromas, and the ability to flocculate. Understanding of the molecular basis of complex brewing-related phenotypic traits is a prerequisite for rational strain improvement. While genome sequences have been reported, the variability and dynamics of S. pastorianus genomes have not been investigated in detail. Here, using deep sequencing and chromosome copy number analysis, we showed that S. pastorianus strain CBS1483 exhibited extensive aneuploidy. This was confirmed by quantitative PCR and by flow cytometry. As a direct consequence of this aneuploidy, a massive number of sequence variants was identified, leading to at least 1,800 additional protein variants in S. pastorianus CBS1483. Analysis of eight additional S. pastorianus strains revealed that the previously defined group I strains showed comparable karyotypes, while group II strains showed large interstrain karyotypic variability. Comparison of three strains with nearly identical genome sequences revealed substantial chromosome copy number variation, which may contribute to strain-specific phenotypic traits. The observed variability of lager yeast genomes demonstrates that systematic linking of genotype to phenotype requires a three-dimensional genome analysis encompassing physical chromosomal structures, the copy number of individual chromosomes or chromosomal regions, and the allelic variation of copies of individual genes. PMID:26150454
Rapid diversification of five Oryza AA genomes associated with rice adaptation.
Zhang, Qun-Jie; Zhu, Ting; Xia, En-Hua; Shi, Chao; Liu, Yun-Long; Zhang, Yun; Liu, Yuan; Jiang, Wen-Kai; Zhao, You-Jie; Mao, Shu-Yan; Zhang, Li-Ping; Huang, Hui; Jiao, Jun-Ying; Xu, Ping-Zhen; Yao, Qiu-Yang; Zeng, Fan-Chun; Yang, Li-Li; Gao, Ju; Tao, Da-Yun; Wang, Yue-Ju; Bennetzen, Jeffrey L; Gao, Li-Zhi
2014-11-18
Comparative genomic analyses among closely related species can greatly enhance our understanding of plant gene and genome evolution. We report de novo-assembled AA-genome sequences for Oryza nivara, Oryza glaberrima, Oryza barthii, Oryza glumaepatula, and Oryza meridionalis. Our analyses reveal massive levels of genomic structural variation, including segmental duplication and rapid gene family turnover, with particularly high instability in defense-related genes. We show, on a genomic scale, how lineage-specific expansion or contraction of gene families has led to their morphological and reproductive diversification, thus enlightening the evolutionary process of speciation and adaptation. Despite strong purifying selective pressures on most Oryza genes, we documented a large number of positively selected genes, especially those genes involved in flower development, reproduction, and resistance-related processes. These diversifying genes are expected to have played key roles in adaptations to their ecological niches in Asia, South America, Africa and Australia. Extensive variation in noncoding RNA gene numbers, function enrichment, and rates of sequence divergence might also help account for the different genetic adaptations of these rice species. Collectively, these resources provide new opportunities for evolutionary genomics, numerous insights into recent speciation, a valuable database of functional variation for crop improvement, and tools for efficient conservation of wild rice germplasm.
Rapid diversification of five Oryza AA genomes associated with rice adaptation
Zhang, Qun-Jie; Zhu, Ting; Xia, En-Hua; Shi, Chao; Liu, Yun-Long; Zhang, Yun; Liu, Yuan; Jiang, Wen-Kai; Zhao, You-Jie; Mao, Shu-Yan; Zhang, Li-Ping; Huang, Hui; Jiao, Jun-Ying; Xu, Ping-Zhen; Yao, Qiu-Yang; Zeng, Fan-Chun; Yang, Li-Li; Gao, Ju; Tao, Da-Yun; Wang, Yue-Ju; Bennetzen, Jeffrey L.; Gao, Li-Zhi
2014-01-01
Comparative genomic analyses among closely related species can greatly enhance our understanding of plant gene and genome evolution. We report de novo-assembled AA-genome sequences for Oryza nivara, Oryza glaberrima, Oryza barthii, Oryza glumaepatula, and Oryza meridionalis. Our analyses reveal massive levels of genomic structural variation, including segmental duplication and rapid gene family turnover, with particularly high instability in defense-related genes. We show, on a genomic scale, how lineage-specific expansion or contraction of gene families has led to their morphological and reproductive diversification, thus enlightening the evolutionary process of speciation and adaptation. Despite strong purifying selective pressures on most Oryza genes, we documented a large number of positively selected genes, especially those genes involved in flower development, reproduction, and resistance-related processes. These diversifying genes are expected to have played key roles in adaptations to their ecological niches in Asia, South America, Africa and Australia. Extensive variation in noncoding RNA gene numbers, function enrichment, and rates of sequence divergence might also help account for the different genetic adaptations of these rice species. Collectively, these resources provide new opportunities for evolutionary genomics, numerous insights into recent speciation, a valuable database of functional variation for crop improvement, and tools for efficient conservation of wild rice germplasm. PMID:25368197
Tachibana, Shin-Ichiro; Sullivan, Steven A; Kawai, Satoru; Nakamura, Shota; Kim, Hyunjae R; Goto, Naohisa; Arisue, Nobuko; Palacpac, Nirianne M Q; Honma, Hajime; Yagi, Masanori; Tougan, Takahiro; Katakai, Yuko; Kaneko, Osamu; Mita, Toshihiro; Kita, Kiyoshi; Yasutomi, Yasuhiro; Sutton, Patrick L; Shakhbatyan, Rimma; Horii, Toshihiro; Yasunaga, Teruo; Barnwell, John W; Escalante, Ananias A; Carlton, Jane M; Tanabe, Kazuyuki
2012-09-01
P. cynomolgi, a malaria-causing parasite of Asian Old World monkeys, is the sister taxon of P. vivax, the most prevalent malaria-causing species in humans outside of Africa. Because P. cynomolgi shares many phenotypic, biological and genetic characteristics with P. vivax, we generated draft genome sequences for three P. cynomolgi strains and performed genomic analysis comparing them with the P. vivax genome, as well as with the genome of a third previously sequenced simian parasite, Plasmodium knowlesi. Here, we show that genomes of the monkey malaria clade can be characterized by copy-number variants (CNVs) in multigene families involved in evasion of the human immune system and invasion of host erythrocytes. We identify genome-wide SNPs, microsatellites and CNVs in the P. cynomolgi genome, providing a map of genetic variation that can be used to map parasite traits and study parasite populations. The sequencing of the P. cynomolgi genome is a critical step in developing a model system for P. vivax research and in counteracting the neglect of P. vivax.
Palacios-Flores, Kim; García-Sotelo, Jair; Castillo, Alejandra; Uribe, Carina; Aguilar, Luis; Morales, Lucía; Gómez-Romero, Laura; Reyes, José; Garciarubio, Alejandro; Boege, Margareta; Dávila, Guillermo
2018-01-01
We present a conceptually simple, sensitive, precise, and essentially nonstatistical solution for the analysis of genome variation in haploid organisms. The generation of a Perfect Match Genomic Landscape (PMGL), which computes intergenome identity with single nucleotide resolution, reveals signatures of variation wherever a query genome differs from a reference genome. Such signatures encode the precise location of different types of variants, including single nucleotide variants, deletions, insertions, and amplifications, effectively introducing the concept of a general signature of variation. The precise nature of variants is then resolved through the generation of targeted alignments between specific sets of sequence reads and known regions of the reference genome. Thus, the perfect match logic decouples the identification of the location of variants from the characterization of their nature, providing a unified framework for the detection of genome variation. We assessed the performance of the PMGL strategy via simulation experiments. We determined the variation profiles of natural genomes and of a synthetic chromosome, both in the context of haploid yeast strains. Our approach uncovered variants that have previously escaped detection. Moreover, our strategy is ideally suited for further refining high-quality reference genomes. The source codes for the automated PMGL pipeline have been deposited in a public repository. PMID:29367403
Adaptation of Organisms by Resonance of RNA Transcription with the Cellular Redox Cycle
NASA Technical Reports Server (NTRS)
Stolc, Viktor
2012-01-01
Sequence variation in organisms differs across the genome and the majority of mutations are caused by oxidation, yet its origin is not fully understood. It has also been shown that the reduction-oxidation reaction cycle is the fundamental biochemical cycle that coordinates the timing of all biochemical processes in that cell, including energy production, DNA replication, and RNA transcription. It is shown that the temporal resonance of transcriptome biosynthesis with the oscillating binary state of the reduction-oxidation reaction cycle serves as a basis for non-random sequence variation at specific genome-wide coordinates that change faster than by accumulation of chance mutations. This work demonstrates evidence for a universal, persistent and iterative feedback mechanism between the environment and heredity, whereby acquired variation between cell divisions can outweigh inherited variation.
Effective normalization for copy number variation detection from whole genome sequencing.
Janevski, Angel; Varadan, Vinay; Kamalakaran, Sitharthan; Banerjee, Nilanjana; Dimitrova, Nevenka
2012-01-01
Whole genome sequencing enables a high resolution view of the human genome and provides unique insights into genome structure at an unprecedented scale. There have been a number of tools to infer copy number variation in the genome. These tools, while validated, also include a number of parameters that are configurable to genome data being analyzed. These algorithms allow for normalization to account for individual and population-specific effects on individual genome CNV estimates but the impact of these changes on the estimated CNVs is not well characterized. We evaluate in detail the effect of normalization methodologies in two CNV algorithms FREEC and CNV-seq using whole genome sequencing data from 8 individuals spanning four populations. We apply FREEC and CNV-seq to a sequencing data set consisting of 8 genomes. We use multiple configurations corresponding to different read-count normalization methodologies in FREEC, and statistically characterize the concordance of the CNV calls between FREEC configurations and the analogous output from CNV-seq. The normalization methodologies evaluated in FREEC are: GC content, mappability and control genome. We further stratify the concordance analysis within genic, non-genic, and a collection of validated variant regions. The GC content normalization methodology generates the highest number of altered copy number regions. Both mappability and control genome normalization reduce the total number and length of copy number regions. Mappability normalization yields Jaccard indices in the 0.07 - 0.3 range, whereas using a control genome normalization yields Jaccard index values around 0.4 with normalization based on GC content. The most critical impact of using mappability as a normalization factor is substantial reduction of deletion CNV calls. The output of another method based on control genome normalization, CNV-seq, resulted in comparable CNV call profiles, and substantial agreement in variable gene and CNV region calls. Choice of read-count normalization methodology has a substantial effect on CNV calls and the use of genomic mappability or an appropriately chosen control genome can optimize the output of CNV analysis.
Yap, Kien-Pong; Ho, Wing S; Gan, Han M; Chai, Lay C; Thong, Kwai L
2016-01-01
Typhoid fever, caused by Salmonella enterica serovar Typhi, remains an important public health burden in Southeast Asia and other endemic countries. Various genotyping methods have been applied to study the genetic variations of this human-restricted pathogen. Multilocus sequence typing (MLST) is one of the widely accepted methods, and recently, there is a growing interest in the re-application of MLST in the post-genomic era. In this study, we provide the global MLST distribution of S. Typhi utilizing both publicly available 1,826 S. Typhi genome sequences in addition to performing conventional MLST on S. Typhi strains isolated from various endemic regions spanning over a century. Our global MLST analysis confirms the predominance of two sequence types (ST1 and ST2) co-existing in the endemic regions. Interestingly, S. Typhi strains with ST8 are currently confined within the African continent. Comparative genomic analyses of ST8 and other rare STs with genomes of ST1/ST2 revealed unique mutations in important virulence genes such as flhB, sipC, and tviD that may explain the variations that differentiate between seemingly successful (widespread) and unsuccessful (poor dissemination) S. Typhi populations. Large scale whole-genome phylogeny demonstrated evidence of phylogeographical structuring and showed that ST8 may have diverged from the earlier ancestral population of ST1 and ST2, which later lost some of its fitness advantages, leading to poor worldwide dissemination. In response to the unprecedented increase in genomic data, this study demonstrates and highlights the utility of large-scale genome-based MLST as a quick and effective approach to narrow the scope of in-depth comparative genomic analysis and consequently provide new insights into the fine scale of pathogen evolution and population structure.
ENGINES: exploring single nucleotide variation in entire human genomes.
Amigo, Jorge; Salas, Antonio; Phillips, Christopher
2011-04-19
Next generation ultra-sequencing technologies are starting to produce extensive quantities of data from entire human genome or exome sequences, and therefore new software is needed to present and analyse this vast amount of information. The 1000 Genomes project has recently released raw data for 629 complete genomes representing several human populations through their Phase I interim analysis and, although there are certain public tools available that allow exploration of these genomes, to date there is no tool that permits comprehensive population analysis of the variation catalogued by such data. We have developed a genetic variant site explorer able to retrieve data for Single Nucleotide Variation (SNVs), population by population, from entire genomes without compromising future scalability and agility. ENGINES (ENtire Genome INterface for Exploring SNVs) uses data from the 1000 Genomes Phase I to demonstrate its capacity to handle large amounts of genetic variation (>7.3 billion genotypes and 28 million SNVs), as well as deriving summary statistics of interest for medical and population genetics applications. The whole dataset is pre-processed and summarized into a data mart accessible through a web interface. The query system allows the combination and comparison of each available population sample, while searching by rs-number list, chromosome region, or genes of interest. Frequency and FST filters are available to further refine queries, while results can be visually compared with other large-scale Single Nucleotide Polymorphism (SNP) repositories such as HapMap or Perlegen. ENGINES is capable of accessing large-scale variation data repositories in a fast and comprehensive manner. It allows quick browsing of whole genome variation, while providing statistical information for each variant site such as allele frequency, heterozygosity or FST values for genetic differentiation. Access to the data mart generating scripts and to the web interface is granted from http://spsmart.cesga.es/engines.php. © 2011 Amigo et al; licensee BioMed Central Ltd.
Nucleotide diversity maps reveal variation in diversity among wheat genomes and chromosomes
USDA-ARS?s Scientific Manuscript database
Technical Abstract: 20-75 CHARACTER LINES A strategy for a genome-wide assessment of nucleotide diversity in a polyploid species must minimize the inclusion of homoeologous sequences into diversity estimates and reliably allocate individual haplotypes into respective genomes. In this study, nucle...
Simmons, Sheri L; Dibartolo, Genevieve; Denef, Vincent J; Goltsman, Daniela S Aliaga; Thelen, Michael P; Banfield, Jillian F
2008-07-22
Deeply sampled community genomic (metagenomic) datasets enable comprehensive analysis of heterogeneity in natural microbial populations. In this study, we used sequence data obtained from the dominant member of a low-diversity natural chemoautotrophic microbial community to determine how coexisting closely related individuals differ from each other in terms of gene sequence and gene content, and to uncover evidence of evolutionary processes that occur over short timescales. DNA sequence obtained from an acid mine drainage biofilm was reconstructed, taking into account the effects of strain variation, to generate a nearly complete genome tiling path for a Leptospirillum group II species closely related to L. ferriphilum (sampling depth approximately 20x). The population is dominated by one sequence type, yet we detected evidence for relatively abundant variants (>99.5% sequence identity to the dominant type) at multiple loci, and a few rare variants. Blocks of other Leptospirillum group II types ( approximately 94% sequence identity) have recombined into one or more variants. Variant blocks of both types are more numerous near the origin of replication. Heterogeneity in genetic potential within the population arises from localized variation in gene content, typically focused in integrated plasmid/phage-like regions. Some laterally transferred gene blocks encode physiologically important genes, including quorum-sensing genes of the LuxIR system. Overall, results suggest inter- and intrapopulation genetic exchange involving distinct parental genome types and implicate gain and loss of phage and plasmid genes in recent evolution of this Leptospirillum group II population. Population genetic analyses of single nucleotide polymorphisms indicate variation between closely related strains is not maintained by positive selection, suggesting that these regions do not represent adaptive differences between strains. Thus, the most likely explanation for the observed patterns of polymorphism is divergence of ancestral strains due to geographic isolation, followed by mixing and subsequent recombination.
Denef, Vincent J; Goltsman, Daniela S. Aliaga; Thelen, Michael P; Banfield, Jillian F
2008-01-01
Deeply sampled community genomic (metagenomic) datasets enable comprehensive analysis of heterogeneity in natural microbial populations. In this study, we used sequence data obtained from the dominant member of a low-diversity natural chemoautotrophic microbial community to determine how coexisting closely related individuals differ from each other in terms of gene sequence and gene content, and to uncover evidence of evolutionary processes that occur over short timescales. DNA sequence obtained from an acid mine drainage biofilm was reconstructed, taking into account the effects of strain variation, to generate a nearly complete genome tiling path for a Leptospirillum group II species closely related to L. ferriphilum (sampling depth ∼20×). The population is dominated by one sequence type, yet we detected evidence for relatively abundant variants (>99.5% sequence identity to the dominant type) at multiple loci, and a few rare variants. Blocks of other Leptospirillum group II types (∼94% sequence identity) have recombined into one or more variants. Variant blocks of both types are more numerous near the origin of replication. Heterogeneity in genetic potential within the population arises from localized variation in gene content, typically focused in integrated plasmid/phage-like regions. Some laterally transferred gene blocks encode physiologically important genes, including quorum-sensing genes of the LuxIR system. Overall, results suggest inter- and intrapopulation genetic exchange involving distinct parental genome types and implicate gain and loss of phage and plasmid genes in recent evolution of this Leptospirillum group II population. Population genetic analyses of single nucleotide polymorphisms indicate variation between closely related strains is not maintained by positive selection, suggesting that these regions do not represent adaptive differences between strains. Thus, the most likely explanation for the observed patterns of polymorphism is divergence of ancestral strains due to geographic isolation, followed by mixing and subsequent recombination. PMID:18651792
HGVS Recommendations for the Description of Sequence Variants: 2016 Update.
den Dunnen, Johan T; Dalgleish, Raymond; Maglott, Donna R; Hart, Reece K; Greenblatt, Marc S; McGowan-Jordan, Jean; Roux, Anne-Francoise; Smith, Timothy; Antonarakis, Stylianos E; Taschner, Peter E M
2016-06-01
The consistent and unambiguous description of sequence variants is essential to report and exchange information on the analysis of a genome. In particular, DNA diagnostics critically depends on accurate and standardized description and sharing of the variants detected. The sequence variant nomenclature system proposed in 2000 by the Human Genome Variation Society has been widely adopted and has developed into an internationally accepted standard. The recommendations are currently commissioned through a Sequence Variant Description Working Group (SVD-WG) operating under the auspices of three international organizations: the Human Genome Variation Society (HGVS), the Human Variome Project (HVP), and the Human Genome Organization (HUGO). Requests for modifications and extensions go through the SVD-WG following a standard procedure including a community consultation step. Version numbers are assigned to the nomenclature system to allow users to specify the version used in their variant descriptions. Here, we present the current recommendations, HGVS version 15.11, and briefly summarize the changes that were made since the 2000 publication. Most focus has been on removing inconsistencies and tightening definitions allowing automatic data processing. An extensive version of the recommendations is available online, at http://www.HGVS.org/varnomen. © 2016 WILEY PERIODICALS, INC.
PGen: large-scale genomic variations analysis workflow and browser in SoyKB.
Liu, Yang; Khan, Saad M; Wang, Juexin; Rynge, Mats; Zhang, Yuanxun; Zeng, Shuai; Chen, Shiyuan; Maldonado Dos Santos, Joao V; Valliyodan, Babu; Calyam, Prasad P; Merchant, Nirav; Nguyen, Henry T; Xu, Dong; Joshi, Trupti
2016-10-06
With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed "PGen", an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way. We have developed both a Linux version in GitHub ( https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow ) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), ( http://soykb.org/Pegasus/index.php ). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser ( http://soykb.org/NGS_Resequence/NGS_index.php ) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers. PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.
Seal, B S; Neill, J D; Ridpath, J F
1994-07-01
Caliciviruses are nonenveloped with a polyadenylated genome of approximately 7.6 kb and a single capsid protein. The "RNA Fold" computer program was used to analyze 3'-terminal noncoding sequences of five feline calicivirus (FCV), rabbit hemorrhagic disease virus (RHDV), and two San Miguel sea lion virus (SMSV) isolates. The FCV 3'-terminal sequences are 40-46 nucleotides in length and 72-91% similar. The FCV sequences were predicted to contain two possible duplex structures and one stem-loop structure with free energies of -2.1 to -18.2 kcal/mole. The RHDV genomic 3'-terminal RNA sequences are 54 nucleotides in length and share 49% sequence similarity to homologous regions of the FCV genome. The RHDV sequence was predicted to form two duplex structures in the 3'-terminal noncoding region with a single stem-loop structure, resembling that of FCV. In contrast, the SMSV 1 and 4 genomic 3'-terminal noncoding sequences were 185 and 182 nucleotides in length, respectively. Ten possible duplex structures were predicted with an average structural free energy of -35 kcal/mole. Sequence similarity between the two SMSV isolates was 75%. Furthermore, extensive cloverleaflike structures are predicted in the 3' noncoding region of the SMSV genome, in contrast to the predicted single stem-loop structures of FCV or RHDV.
The complete chloroplast genome sequence of the medicinal plant Salvia miltiorrhiza.
Qian, Jun; Song, Jingyuan; Gao, Huanhuan; Zhu, Yingjie; Xu, Jiang; Pang, Xiaohui; Yao, Hui; Sun, Chao; Li, Xian'en; Li, Chuyuan; Liu, Juyan; Xu, Haibin; Chen, Shilin
2013-01-01
Salvia miltiorrhiza is an important medicinal plant with great economic and medicinal value. The complete chloroplast (cp) genome sequence of Salvia miltiorrhiza, the first sequenced member of the Lamiaceae family, is reported here. The genome is 151,328 bp in length and exhibits a typical quadripartite structure of the large (LSC, 82,695 bp) and small (SSC, 17,555 bp) single-copy regions, separated by a pair of inverted repeats (IRs, 25,539 bp). It contains 114 unique genes, including 80 protein-coding genes, 30 tRNAs and four rRNAs. The genome structure, gene order, GC content and codon usage are similar to the typical angiosperm cp genomes. Four forward, three inverted and seven tandem repeats were detected in the Salvia miltiorrhiza cp genome. Simple sequence repeat (SSR) analysis among the 30 asterid cp genomes revealed that most SSRs are AT-rich, which contribute to the overall AT richness of these cp genomes. Additionally, fewer SSRs are distributed in the protein-coding sequences compared to the non-coding regions, indicating an uneven distribution of SSRs within the cp genomes. Entire cp genome comparison of Salvia miltiorrhiza and three other Lamiales cp genomes showed a high degree of sequence similarity and a relatively high divergence of intergenic spacers. Sequence divergence analysis discovered the ten most divergent and ten most conserved genes as well as their length variation, which will be helpful for phylogenetic studies in asterids. Our analysis also supports that both regional and functional constraints affect gene sequence evolution. Further, phylogenetic analysis demonstrated a sister relationship between Salvia miltiorrhiza and Sesamum indicum. The complete cp genome sequence of Salvia miltiorrhiza reported in this paper will facilitate population, phylogenetic and cp genetic engineering studies of this medicinal plant.
A genome-wide SNP scan accelerates trait-regulatory genomic loci identification in chickpea
Kujur, Alice; Bajaj, Deepak; Upadhyaya, Hari D.; Das, Shouvik; Ranjan, Rajeev; Shree, Tanima; Saxena, Maneesha S.; Badoni, Saurabh; Kumar, Vinod; Tripathi, Shailesh; Gowda, C.L.L.; Sharma, Shivali; Singh, Sube; Tyagi, Akhilesh K.; Parida, Swarup K.
2015-01-01
We identified 44844 high-quality SNPs by sequencing 92 diverse chickpea accessions belonging to a seed and pod trait-specific association panel using reference genome- and de novo-based GBS (genotyping-by-sequencing) assays. A GWAS (genome-wide association study) in an association panel of 211, including the 92 sequenced accessions, identified 22 major genomic loci showing significant association (explaining 23–47% phenotypic variation) with pod and seed number/plant and 100-seed weight. Eighteen trait-regulatory major genomic loci underlying 13 robust QTLs were validated and mapped on an intra-specific genetic linkage map by QTL mapping. A combinatorial approach of GWAS, QTL mapping and gene haplotype-specific LD mapping and transcript profiling uncovered one superior haplotype and favourable natural allelic variants in the upstream regulatory region of a CesA-type cellulose synthase (Ca_Kabuli_CesA3) gene regulating high pod and seed number/plant (explaining 47% phenotypic variation) in chickpea. The up-regulation of this superior gene haplotype correlated with increased transcript expression of Ca_Kabuli_CesA3 gene in the pollen and pod of high pod/seed number accession, resulting in higher cellulose accumulation for normal pollen and pollen tube growth. A rapid combinatorial genome-wide SNP genotyping-based approach has potential to dissect complex quantitative agronomic traits and delineate trait-regulatory genomic loci (candidate genes) for genetic enhancement in crop plants, including chickpea. PMID:26058368
Plasmodium copy number variation scan: gene copy numbers evaluation in haploid genomes.
Beghain, Johann; Langlois, Anne-Claire; Legrand, Eric; Grange, Laura; Khim, Nimol; Witkowski, Benoit; Duru, Valentine; Ma, Laurence; Bouchier, Christiane; Ménard, Didier; Paul, Richard E; Ariey, Frédéric
2016-04-12
In eukaryotic genomes, deletion or amplification rates have been estimated to be a thousand more frequent than single nucleotide variation. In Plasmodium falciparum, relatively few transcription factors have been identified, and the regulation of transcription is seemingly largely influenced by gene amplification events. Thus copy number variation (CNV) is a major mechanism enabling parasite genomes to adapt to new environmental changes. Currently, the detection of CNVs is based on quantitative PCR (qPCR), which is significantly limited by the relatively small number of genes that can be analysed at any one time. Technological advances that facilitate whole-genome sequencing, such as next generation sequencing (NGS) enable deeper analyses of the genomic variation to be performed. Because the characteristics of Plasmodium CNVs need special consideration in algorithms and strategies for which classical CNV detection programs are not suited a dedicated algorithm to detect CNVs across the entire exome of P. falciparum was developed. This algorithm is based on a custom read depth strategy through NGS data and called PlasmoCNVScan. The analysis of CNV identification on three genes known to have different levels of amplification and which are located either in the nuclear, apicoplast or mitochondrial genomes is presented. The results are correlated with the qPCR experiments, usually used for identification of locus specific amplification/deletion. This tool will facilitate the study of P. falciparum genomic adaptation in response to ecological changes: drug pressure, decreased transmission, reduction of the parasite population size (transition to pre-elimination endemic area).
Toth, Balazs; Santos, Andrea P.; do Nascimento, Naíla C.; Kritchevsky, Janice E.
2012-01-01
We report the complete genome sequence of “Candidatus Mycoplasma haemolamae,” an endemic red-cell pathogen of camelids. The single, circular chromosome has 756,845 bp, a 39.3% G+C content, and 925 coding sequences (CDSs). A great proportion (49.1%) of these CDSs are organized into paralogous gene families, which can now be further explored with regard to antigenic variation. PMID:23105057
Frahry, Matthew Blake; Sun, Cheng; Chong, Rebecca A; Mueller, Rachel Lockridge
2015-02-01
Across the tree of life, species vary dramatically in nuclear genome size. Mutations that add or remove sequences from genomes-insertions or deletions, or indels-are the ultimate source of this variation. Differences in the tempo and mode of insertion and deletion across taxa have been proposed to contribute to evolutionary diversity in genome size. Among vertebrates, most of the largest genomes are found within the salamanders, an amphibian clade with genome sizes ranging from ~14 to ~120 Gb. Salamander genomes have been shown to experience slower rates of DNA loss through small (i.e., <30 bp) deletions than do other vertebrate genomes. However, no studies have addressed DNA loss from salamander genomes resulting from larger deletions. Here, we focus on one type of large deletion-ectopic-recombination-mediated removal of LTR retrotransposon sequences. In ectopic recombination, double-strand breaks are repaired using a "wrong" (i.e., ectopic, or non-allelic) template sequence-typically another locus of similar sequence. When breaks occur within the LTR portions of LTR retrotransposons, ectopic-recombination-mediated repair can produce deletions that remove the internal transposon sequence and the equivalent of one of the two LTR sequences. These deletions leave a signature in the genome-a solo LTR sequence. We compared levels of solo LTRs in the genomes of four salamander species with levels present in five vertebrates with smaller genomes. Our results demonstrate that salamanders have low levels of solo LTRs, suggesting that ectopic-recombination-mediated deletion of LTR retrotransposons occurs more slowly than in other vertebrates with smaller genomes.
Tang, Huiwu; Zheng, Xingmei; Li, Chuliang; Xie, Xianrong; Chen, Yuanling; Chen, Letian; Zhao, Xiucai; Zheng, Huiqi; Zhou, Jiajian; Ye, Shan; Guo, Jingxin; Liu, Yao-Guang
2017-01-01
New gene origination is a major source of genomic innovations that confer phenotypic changes and biological diversity. Generation of new mitochondrial genes in plants may cause cytoplasmic male sterility (CMS), which can promote outcrossing and increase fitness. However, how mitochondrial genes originate and evolve in structure and function remains unclear. The rice Wild Abortive type of CMS is conferred by the mitochondrial gene WA352c (previously named WA352) and has been widely exploited in hybrid rice breeding. Here, we reconstruct the evolutionary trajectory of WA352c by the identification and analyses of 11 mitochondrial genomic recombinant structures related to WA352c in wild and cultivated rice. We deduce that these structures arose through multiple rearrangements among conserved mitochondrial sequences in the mitochondrial genome of the wild rice Oryza rufipogon, coupled with substoichiometric shifting and sequence variation. We identify two expressed but nonfunctional protogenes among these structures, and show that they could evolve into functional CMS genes via sequence variations that could relieve the self-inhibitory potential of the proteins. These sequence changes would endow the proteins the ability to interact with the nucleus-encoded mitochondrial protein COX11, resulting in premature programmed cell death in the anther tapetum and male sterility. Furthermore, we show that the sequences that encode the COX11-interaction domains in these WA352c-related genes have experienced purifying selection during evolution. We propose a model for the formation and evolution of new CMS genes via a “multi-recombination/protogene formation/functionalization” mechanism involving gradual variations in the structure, sequence, copy number, and function. PMID:27725674
Deciphering the distance to antibiotic resistance for the pneumococcus using genome sequencing data
Mobegi, Fredrick M.; Cremers, Amelieke J. H.; de Jonge, Marien I.; Bentley, Stephen D.; van Hijum, Sacha A. F. T.; Zomer, Aldert
2017-01-01
Advances in genome sequencing technologies and genome-wide association studies (GWAS) have provided unprecedented insights into the molecular basis of microbial phenotypes and enabled the identification of the underlying genetic variants in real populations. However, utilization of genome sequencing in clinical phenotyping of bacteria is challenging due to the lack of reliable and accurate approaches. Here, we report a method for predicting microbial resistance patterns using genome sequencing data. We analyzed whole genome sequences of 1,680 Streptococcus pneumoniae isolates from four independent populations using GWAS and identified probable hotspots of genetic variation which correlate with phenotypes of resistance to essential classes of antibiotics. With the premise that accumulation of putative resistance-conferring SNPs, potentially in combination with specific resistance genes, precedes full resistance, we retrogressively surveyed the hotspot loci and quantified the number of SNPs and/or genes, which if accumulated would confer full resistance to an otherwise susceptible strain. We name this approach the ‘distance to resistance’. It can be used to identify the creep towards complete antibiotics resistance in bacteria using genome sequencing. This approach serves as a basis for the development of future sequencing-based methods for predicting resistance profiles of bacterial strains in hospital microbiology and public health settings. PMID:28205635
The genome sequence of taurine cattle: a window to ruminant biology and evolution.
Elsik, Christine G; Tellam, Ross L; Worley, Kim C; Gibbs, Richard A; Muzny, Donna M; Weinstock, George M; Adelson, David L; Eichler, Evan E; Elnitski, Laura; Guigó, Roderic; Hamernik, Debora L; Kappes, Steve M; Lewin, Harris A; Lynn, David J; Nicholas, Frank W; Reymond, Alexandre; Rijnkels, Monique; Skow, Loren C; Zdobnov, Evgeny M; Schook, Lawrence; Womack, James; Alioto, Tyler; Antonarakis, Stylianos E; Astashyn, Alex; Chapple, Charles E; Chen, Hsiu-Chuan; Chrast, Jacqueline; Câmara, Francisco; Ermolaeva, Olga; Henrichsen, Charlotte N; Hlavina, Wratko; Kapustin, Yuri; Kiryutin, Boris; Kitts, Paul; Kokocinski, Felix; Landrum, Melissa; Maglott, Donna; Pruitt, Kim; Sapojnikov, Victor; Searle, Stephen M; Solovyev, Victor; Souvorov, Alexandre; Ucla, Catherine; Wyss, Carine; Anzola, Juan M; Gerlach, Daniel; Elhaik, Eran; Graur, Dan; Reese, Justin T; Edgar, Robert C; McEwan, John C; Payne, Gemma M; Raison, Joy M; Junier, Thomas; Kriventseva, Evgenia V; Eyras, Eduardo; Plass, Mireya; Donthu, Ravikiran; Larkin, Denis M; Reecy, James; Yang, Mary Q; Chen, Lin; Cheng, Ze; Chitko-McKown, Carol G; Liu, George E; Matukumalli, Lakshmi K; Song, Jiuzhou; Zhu, Bin; Bradley, Daniel G; Brinkman, Fiona S L; Lau, Lilian P L; Whiteside, Matthew D; Walker, Angela; Wheeler, Thomas T; Casey, Theresa; German, J Bruce; Lemay, Danielle G; Maqbool, Nauman J; Molenaar, Adrian J; Seo, Seongwon; Stothard, Paul; Baldwin, Cynthia L; Baxter, Rebecca; Brinkmeyer-Langford, Candice L; Brown, Wendy C; Childers, Christopher P; Connelley, Timothy; Ellis, Shirley A; Fritz, Krista; Glass, Elizabeth J; Herzig, Carolyn T A; Iivanainen, Antti; Lahmers, Kevin K; Bennett, Anna K; Dickens, C Michael; Gilbert, James G R; Hagen, Darren E; Salih, Hanni; Aerts, Jan; Caetano, Alexandre R; Dalrymple, Brian; Garcia, Jose Fernando; Gill, Clare A; Hiendleder, Stefan G; Memili, Erdogan; Spurlock, Diane; Williams, John L; Alexander, Lee; Brownstein, Michael J; Guan, Leluo; Holt, Robert A; Jones, Steven J M; Marra, Marco A; Moore, Richard; Moore, Stephen S; Roberts, Andy; Taniguchi, Masaaki; Waterman, Richard C; Chacko, Joseph; Chandrabose, Mimi M; Cree, Andy; Dao, Marvin Diep; Dinh, Huyen H; Gabisi, Ramatu Ayiesha; Hines, Sandra; Hume, Jennifer; Jhangiani, Shalini N; Joshi, Vandita; Kovar, Christie L; Lewis, Lora R; Liu, Yih-Shin; Lopez, John; Morgan, Margaret B; Nguyen, Ngoc Bich; Okwuonu, Geoffrey O; Ruiz, San Juana; Santibanez, Jireh; Wright, Rita A; Buhay, Christian; Ding, Yan; Dugan-Rocha, Shannon; Herdandez, Judith; Holder, Michael; Sabo, Aniko; Egan, Amy; Goodell, Jason; Wilczek-Boney, Katarzyna; Fowler, Gerald R; Hitchens, Matthew Edward; Lozado, Ryan J; Moen, Charles; Steffen, David; Warren, James T; Zhang, Jingkun; Chiu, Readman; Schein, Jacqueline E; Durbin, K James; Havlak, Paul; Jiang, Huaiyang; Liu, Yue; Qin, Xiang; Ren, Yanru; Shen, Yufeng; Song, Henry; Bell, Stephanie Nicole; Davis, Clay; Johnson, Angela Jolivet; Lee, Sandra; Nazareth, Lynne V; Patel, Bella Mayurkumar; Pu, Ling-Ling; Vattathil, Selina; Williams, Rex Lee; Curry, Stacey; Hamilton, Cerissa; Sodergren, Erica; Wheeler, David A; Barris, Wes; Bennett, Gary L; Eggen, André; Green, Ronnie D; Harhay, Gregory P; Hobbs, Matthew; Jann, Oliver; Keele, John W; Kent, Matthew P; Lien, Sigbjørn; McKay, Stephanie D; McWilliam, Sean; Ratnakumar, Abhirami; Schnabel, Robert D; Smith, Timothy; Snelling, Warren M; Sonstegard, Tad S; Stone, Roger T; Sugimoto, Yoshikazu; Takasuga, Akiko; Taylor, Jeremy F; Van Tassell, Curtis P; Macneil, Michael D; Abatepaulo, Antonio R R; Abbey, Colette A; Ahola, Virpi; Almeida, Iassudara G; Amadio, Ariel F; Anatriello, Elen; Bahadue, Suria M; Biase, Fernando H; Boldt, Clayton R; Carroll, Jeffery A; Carvalho, Wanessa A; Cervelatti, Eliane P; Chacko, Elsa; Chapin, Jennifer E; Cheng, Ye; Choi, Jungwoo; Colley, Adam J; de Campos, Tatiana A; De Donato, Marcos; Santos, Isabel K F de Miranda; de Oliveira, Carlo J F; Deobald, Heather; Devinoy, Eve; Donohue, Kaitlin E; Dovc, Peter; Eberlein, Annett; Fitzsimmons, Carolyn J; Franzin, Alessandra M; Garcia, Gustavo R; Genini, Sem; Gladney, Cody J; Grant, Jason R; Greaser, Marion L; Green, Jonathan A; Hadsell, Darryl L; Hakimov, Hatam A; Halgren, Rob; Harrow, Jennifer L; Hart, Elizabeth A; Hastings, Nicola; Hernandez, Marta; Hu, Zhi-Liang; Ingham, Aaron; Iso-Touru, Terhi; Jamis, Catherine; Jensen, Kirsty; Kapetis, Dimos; Kerr, Tovah; Khalil, Sari S; Khatib, Hasan; Kolbehdari, Davood; Kumar, Charu G; Kumar, Dinesh; Leach, Richard; Lee, Justin C-M; Li, Changxi; Logan, Krystin M; Malinverni, Roberto; Marques, Elisa; Martin, William F; Martins, Natalia F; Maruyama, Sandra R; Mazza, Raffaele; McLean, Kim L; Medrano, Juan F; Moreno, Barbara T; Moré, Daniela D; Muntean, Carl T; Nandakumar, Hari P; Nogueira, Marcelo F G; Olsaker, Ingrid; Pant, Sameer D; Panzitta, Francesca; Pastor, Rosemeire C P; Poli, Mario A; Poslusny, Nathan; Rachagani, Satyanarayana; Ranganathan, Shoba; Razpet, Andrej; Riggs, Penny K; Rincon, Gonzalo; Rodriguez-Osorio, Nelida; Rodriguez-Zas, Sandra L; Romero, Natasha E; Rosenwald, Anne; Sando, Lillian; Schmutz, Sheila M; Shen, Libing; Sherman, Laura; Southey, Bruce R; Lutzow, Ylva Strandberg; Sweedler, Jonathan V; Tammen, Imke; Telugu, Bhanu Prakash V L; Urbanski, Jennifer M; Utsunomiya, Yuri T; Verschoor, Chris P; Waardenberg, Ashley J; Wang, Zhiquan; Ward, Robert; Weikard, Rosemarie; Welsh, Thomas H; White, Stephen N; Wilming, Laurens G; Wunderlich, Kris R; Yang, Jianqi; Zhao, Feng-Qi
2009-04-24
To understand the biology and evolution of ruminants, the cattle genome was sequenced to about sevenfold coverage. The cattle genome contains a minimum of 22,000 genes, with a core set of 14,345 orthologs shared among seven mammalian species of which 1217 are absent or undetected in noneutherian (marsupial or monotreme) genomes. Cattle-specific evolutionary breakpoint regions in chromosomes have a higher density of segmental duplications, enrichment of repetitive elements, and species-specific variations in genes associated with lactation and immune responsiveness. Genes involved in metabolism are generally highly conserved, although five metabolic genes are deleted or extensively diverged from their human orthologs. The cattle genome sequence thus provides a resource for understanding mammalian evolution and accelerating livestock genetic improvement for milk and meat production.
Xu, Duo; Jaber, Yousef; Pavlidis, Pavlos; Gokcumen, Omer
2017-09-26
Constructing alignments and phylogenies for a given locus from large genome sequencing studies with relevant outgroups allow novel evolutionary and anthropological insights. However, no user-friendly tool has been developed to integrate thousands of recently available and anthropologically relevant genome sequences to construct complete sequence alignments and phylogenies. Here, we provide VCFtoTree, a user friendly tool with a graphical user interface that directly accesses online databases to download, parse and analyze genome variation data for regions of interest. Our pipeline combines popular sequence datasets and tree building algorithms with custom data parsing to generate accurate alignments and phylogenies using all the individuals from the 1000 Genomes Project, Neanderthal and Denisovan genomes, as well as reference genomes of Chimpanzee and Rhesus Macaque. It can also be applied to other phased human genomes, as well as genomes from other species. The output of our pipeline includes an alignment in FASTA format and a tree file in newick format. VCFtoTree fulfills the increasing demand for constructing alignments and phylogenies for a given loci from thousands of available genomes. Our software provides a user friendly interface for a wider audience without prerequisite knowledge in programming. VCFtoTree can be accessed from https://github.com/duoduoo/VCFtoTree_3.0.0 .
Proteome Studies of Filamentous Fungi
DOE Office of Scientific and Technical Information (OSTI.GOV)
Baker, Scott E.; Panisko, Ellen A.
2011-04-20
The continued fast pace of fungal genome sequence generation has enabled proteomic analysis of a wide breadth of organisms that span the breadth of the Kingdom Fungi. There is some phylogenetic bias to the current catalog of fungi with reasonable DNA sequence databases (genomic or EST) that could be analyzed at a global proteomic level. However, the rapid development of next generation sequencing platforms has lowered the cost of genome sequencing such that in the near future, having a genome sequence will no longer be a time or cost bottleneck for downstream proteomic (and transcriptomic) analyses. High throughput, non-gel basedmore » proteomics offers a snapshot of proteins present in a given sample at a single point in time. There are a number of different variations on the general method and technologies for identifying peptides in a given sample. We present a method that can serve as a “baseline” for proteomic studies of fungi.« less
Whole genome sequence and comparative analysis of Borrelia burgdorferi MM1
Jabbari, Neda; Reddy, Panga Jaipal; Hood, Leroy
2018-01-01
Lyme disease is caused by spirochaetes of the Borrelia burgdorferi sensu lato genospecies. Complete genome assemblies are available for fewer than ten strains of Borrelia burgdorferi sensu stricto, the primary cause of Lyme disease in North America. MM1 is a sensu stricto strain originally isolated in the midwestern United States. Aside from a small number of genes, the complete genome sequence of this strain has not been reported. Here we present the complete genome sequence of MM1 in relation to other sensu stricto strains and in terms of its Multi Locus Sequence Typing. Our results indicate that MM1 is a new sequence type which contains a conserved main chromosome and 15 plasmids. Our results include the first contiguous 28.5 kb assembly of lp28-8, a linear plasmid carrying the vls antigenic variation system, from a Borrelia burgdorferi sensu stricto strain. PMID:29889842
Proteome studies of filamentous fungi.
Baker, Scott E; Panisko, Ellen A
2011-01-01
The continued fast pace of fungal genome sequence generation has enabled proteomic analysis of a wide variety of organisms that span the breadth of the Kingdom Fungi. There is some phylogenetic bias to the current catalog of fungi with reasonable DNA sequence databases (genomic or EST) that could be analyzed at a global proteomic level. However, the rapid development of next generation sequencing platforms has lowered the cost of genome sequencing such that in the near future, having a genome sequence will no longer be a time or cost bottleneck for downstream proteomic (and transcriptomic) analyses. High throughput, nongel-based proteomics offers a snapshot of proteins present in a given sample at a single point in time. There are a number of variations on the general methods and technologies for identifying peptides in a given sample. We present a method that can serve as a "baseline" for proteomic studies of fungi.
Tempo and mode of genomic mutations unveil human evolutionary history.
Hara, Yuichiro
2015-01-01
Mutations that have occurred in human genomes provide insight into various aspects of evolutionary history such as speciation events and degrees of natural selection. Comparing genome sequences between human and great apes or among humans is a feasible approach for inferring human evolutionary history. Recent advances in high-throughput or so-called 'next-generation' DNA sequencing technologies have enabled the sequencing of thousands of individual human genomes, as well as a variety of reference genomes of hominids, many of which are publicly available. These sequence data can help to unveil the detailed demographic history of the lineage leading to humans as well as the explosion of modern human population size in the last several thousand years. In addition, high-throughput sequencing illustrates the tempo and mode of de novo mutations, which are producing human genetic variation at this moment. Pedigree-based human genome sequencing has shown that mutation rates vary significantly across the human genome. These studies have also provided an improved timescale of human evolution, because the mutation rate estimated from pedigree analysis is half that estimated from traditional analyses based on molecular phylogeny. Because of the dramatic reduction in sequencing cost, sequencing on-demand samples designed for specific studies is now also becoming popular. To produce data of sufficient quality to meet the requirements of the study, it is necessary to set an explicit sequencing plan that includes the choice of sample collection methods, sequencing platforms, and number of sequence reads.
WhopGenome: high-speed access to whole-genome variation and sequence data in R.
Wittelsbürger, Ulrich; Pfeifer, Bastian; Lercher, Martin J
2015-02-01
The statistical programming language R has become a de facto standard for the analysis of many types of biological data, and is well suited for the rapid development of new algorithms. However, variant call data from population-scale resequencing projects are typically too large to be read and processed efficiently with R's built-in I/O capabilities. WhopGenome can efficiently read whole-genome variation data stored in the widely used variant call format (VCF) file format into several R data types. VCF files can be accessed either on local hard drives or on remote servers. WhopGenome can associate variants with annotations such as those available from the UCSC genome browser, and can accelerate the reading process by filtering loci according to user-defined criteria. WhopGenome can also read other Tabix-indexed files and create indices to allow fast selective access to FASTA-formatted sequence files. The WhopGenome R package is available on CRAN at http://cran.r-project.org/web/packages/WhopGenome/. A Bioconductor package has been submitted. lercher@cs.uni-duesseldorf.de. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Liu, Hanmei; Wang, Xuewen; Wei, Bin; Wang, Yongbin; Liu, Yinghong; Zhang, Junjie; Hu, Yufeng; Yu, Guowu; Li, Jian; Xu, Zhanbin; Huang, Yubi
2016-01-01
In southwest China, some maize landraces have long been isolated geographically, and have phenotypes that differ from those of widely grown cultivars. These landraces may harbor rich genetic variation responsible for those phenotypes. Four-row Wax is one such landrace, with four rows of kernels on the cob. We resequenced the genome of Four-row Wax, obtaining 50.46 Gb sequence at 21.87× coverage, then identified and characterized 3,252,194 SNPs, 213,181 short InDels (1–5 bp) and 39,631 structural variations (greater than 5 bp). Of those, 312,511 (9.6%) SNPs were novel compared to the most detailed haplotype map (HapMap) SNP database of maize. Characterization of variations in reported kernel row number (KRN) related genes and KRN QTL regions revealed potential causal mutations in fea2, td1, kn1, and te1. Genome-wide comparisons revealed abundant genetic variations in Four-row Wax, which may be associated with environmental adaptation. The sequence and SNP variations described here enrich genetic resources of maize, and provide guidance into study of seed numbers for crop yield improvement. PMID:27242868
Mutation Detection with Next-Generation Resequencing through a Mediator Genome
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wurtzel, Omri; Dori-Bachash, Mally; Pietrokovski, Shmuel
2010-12-31
The affordability of next generation sequencing (NGS) is transforming the field of mutation analysis in bacteria. The genetic basis for phenotype alteration can be identified directly by sequencing the entire genome of the mutant and comparing it to the wild-type (WT) genome, thus identifying acquired mutations. A major limitation for this approach is the need for an a-priori sequenced reference genome for the WT organism, as the short reads of most current NGS approaches usually prohibit de-novo genome assembly. To overcome this limitation we propose a general framework that utilizes the genome of relative organisms as mediators for comparing WTmore » and mutant bacteria. Under this framework, both mutant and WT genomes are sequenced with NGS, and the short sequencing reads are mapped to the mediator genome. Variations between the mutant and the mediator that recur in the WT are ignored, thus pinpointing the differences between the mutant and the WT. To validate this approach we sequenced the genome of Bdellovibrio bacteriovorus 109J, an obligatory bacterial predator, and its prey-independent mutant, and compared both to the mediator species Bdellovibrio bacteriovorus HD100. Although the mutant and the mediator sequences differed in more than 28,000 nucleotide positions, our approach enabled pinpointing the single causative mutation. Experimental validation in 53 additional mutants further established the implicated gene. Our approach extends the applicability of NGS-based mutant analyses beyond the domain of available reference genomes.« less
SNP discovery by high-throughput sequencing in soybean
2010-01-01
Background With the advance of new massively parallel genotyping technologies, quantitative trait loci (QTL) fine mapping and map-based cloning become more achievable in identifying genes for important and complex traits. Development of high-density genetic markers in the QTL regions of specific mapping populations is essential for fine-mapping and map-based cloning of economically important genes. Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation existing between any diverse genotypes that are usually used for QTL mapping studies. The massively parallel sequencing technologies (Roche GS/454, Illumina GA/Solexa, and ABI/SOLiD), have been widely applied to identify genome-wide sequence variations. However, it is still remains unclear whether sequence data at a low sequencing depth are enough to detect the variations existing in any QTL regions of interest in a crop genome, and how to prepare sequencing samples for a complex genome such as soybean. Therefore, with the aims of identifying SNP markers in a cost effective way for fine-mapping several QTL regions, and testing the validation rate of the putative SNPs predicted with Solexa short sequence reads at a low sequencing depth, we evaluated a pooled DNA fragment reduced representation library and SNP detection methods applied to short read sequences generated by Solexa high-throughput sequencing technology. Results A total of 39,022 putative SNPs were identified by the Illumina/Solexa sequencing system using a reduced representation DNA library of two parental lines of a mapping population. The validation rates of these putative SNPs predicted with low and high stringency were 72% and 85%, respectively. One hundred sixty four SNP markers resulted from the validation of putative SNPs and have been selectively chosen to target a known QTL, thereby increasing the marker density of the targeted region to one marker per 42 K bp. Conclusions We have demonstrated how to quickly identify large numbers of SNPs for fine mapping of QTL regions by applying massively parallel sequencing combined with genome complexity reduction techniques. This SNP discovery approach is more efficient for targeting multiple QTL regions in a same genetic population, which can be applied to other crops. PMID:20701770
2012-01-01
Saccharomyces cerevisiae CEN.PK 113-7D is widely used for metabolic engineering and systems biology research in industry and academia. We sequenced, assembled, annotated and analyzed its genome. Single-nucleotide variations (SNV), insertions/deletions (indels) and differences in genome organization compared to the reference strain S. cerevisiae S288C were analyzed. In addition to a few large deletions and duplications, nearly 3000 indels were identified in the CEN.PK113-7D genome relative to S288C. These differences were overrepresented in genes whose functions are related to transcriptional regulation and chromatin remodelling. Some of these variations were caused by unstable tandem repeats, suggesting an innate evolvability of the corresponding genes. Besides a previously characterized mutation in adenylate cyclase, the CEN.PK113-7D genome sequence revealed a significant enrichment of non-synonymous mutations in genes encoding for components of the cAMP signalling pathway. Some phenotypic characteristics of the CEN.PK113-7D strains were explained by the presence of additional specific metabolic genes relative to S288C. In particular, the presence of the BIO1 and BIO6 genes correlated with a biotin prototrophy of CEN.PK113-7D. Furthermore, the copy number, chromosomal location and sequences of the MAL loci were resolved. The assembled sequence reveals that CEN.PK113-7D has a mosaic genome that combines characteristics of laboratory strains and wild-industrial strains. PMID:22448915
Morrison, Cheryl L; Iwanowicz, Luke; Work, Thierry M; Fahsbender, Elizabeth; Breitbart, Mya; Adams, Cynthia; Iwanowicz, Deb; Sanders, Lakyn; Ackermann, Mathias; Cornman, Robert S
2018-01-01
Chelonid alphaherpesvirus 5 (ChHV5) is a herpesvirus associated with fibropapillomatosis (FP) in sea turtles worldwide. Single-locus typing has previously shown differentiation between Atlantic and Pacific strains of this virus, with low variation within each geographic clade. However, a lack of multi-locus genomic sequence data hinders understanding of the rate and mechanisms of ChHV5 evolutionary divergence, as well as how these genomic changes may contribute to differences in disease manifestation. To assess genomic variation in ChHV5 among five Hawaii and three Florida green sea turtles, we used high-throughput short-read sequencing of long-range PCR products amplified from tumor tissue using primers designed from the single available ChHV5 reference genome from a Hawaii green sea turtle. This strategy recovered sequence data from both geographic regions for approximately 75% of the predicted ChHV5 coding sequences. The average nucleotide divergence between geographic populations was 1.5%; most of the substitutions were fixed differences between regions. Protein divergence was generally low (average 0.08%), and ranged between 0 and 5.3%. Several atypical genes originally identified and annotated in the reference genome were confirmed in ChHV5 genomes from both geographic locations. Unambiguous recombination events between geographic regions were identified, and clustering of private alleles suggests the prevalence of recombination in the evolutionary history of ChHV5. This study significantly increased the amount of sequence data available from ChHV5 strains, enabling informed selection of loci for future population genetic and natural history studies, and suggesting the (possibly latent) co-infection of individuals by well-differentiated geographic variants.
Morrison, Cheryl L.; Iwanowicz, Luke R.; Work, Thierry M.; Fahsbender, Elizabeth; Breitbart, Mya; Adams, Cynthia; Iwanowicz, Deborah; Sanders, Lakyn; Ackermann, Mathias; Cornman, Robert S.
2018-01-01
Chelonid alphaherpesvirus 5 (ChHV5) is a herpesvirus associated with fibropapillomatosis (FP) in sea turtles worldwide. Single-locus typing has previously shown differentiation between Atlantic and Pacific strains of this virus, with low variation within each geographic clade. However, a lack of multi-locus genomic sequence data hinders understanding of the rate and mechanisms of ChHV5 evolutionary divergence, as well as how these genomic changes may contribute to differences in disease manifestation. To assess genomic variation in ChHV5 among five Hawaii and three Florida green sea turtles, we used high-throughput short-read sequencing of long-range PCR products amplified from tumor tissue using primers designed from the single available ChHV5 reference genome from a Hawaii green sea turtle. This strategy recovered sequence data from both geographic regions for approximately 75% of the predicted ChHV5 coding sequences. The average nucleotide divergence between geographic populations was 1.5%; most of the substitutions were fixed differences between regions. Protein divergence was generally low (average 0.08%), and ranged between 0 and 5.3%. Several atypical genes originally identified and annotated in the reference genome were confirmed in ChHV5 genomes from both geographic locations. Unambiguous recombination events between geographic regions were identified, and clustering of private alleles suggests the prevalence of recombination in the evolutionary history of ChHV5. This study significantly increased the amount of sequence data available from ChHV5 strains, enabling informed selection of loci for future population genetic and natural history studies, and suggesting the (possibly latent) co-infection of individuals by well-differentiated geographic variants.
Iwanowicz, Luke; Work, Thierry M.; Fahsbender, Elizabeth; Breitbart, Mya; Adams, Cynthia; Iwanowicz, Deb; Sanders, Lakyn; Ackermann, Mathias; Cornman, Robert S.
2018-01-01
Chelonid alphaherpesvirus 5 (ChHV5) is a herpesvirus associated with fibropapillomatosis (FP) in sea turtles worldwide. Single-locus typing has previously shown differentiation between Atlantic and Pacific strains of this virus, with low variation within each geographic clade. However, a lack of multi-locus genomic sequence data hinders understanding of the rate and mechanisms of ChHV5 evolutionary divergence, as well as how these genomic changes may contribute to differences in disease manifestation. To assess genomic variation in ChHV5 among five Hawaii and three Florida green sea turtles, we used high-throughput short-read sequencing of long-range PCR products amplified from tumor tissue using primers designed from the single available ChHV5 reference genome from a Hawaii green sea turtle. This strategy recovered sequence data from both geographic regions for approximately 75% of the predicted ChHV5 coding sequences. The average nucleotide divergence between geographic populations was 1.5%; most of the substitutions were fixed differences between regions. Protein divergence was generally low (average 0.08%), and ranged between 0 and 5.3%. Several atypical genes originally identified and annotated in the reference genome were confirmed in ChHV5 genomes from both geographic locations. Unambiguous recombination events between geographic regions were identified, and clustering of private alleles suggests the prevalence of recombination in the evolutionary history of ChHV5. This study significantly increased the amount of sequence data available from ChHV5 strains, enabling informed selection of loci for future population genetic and natural history studies, and suggesting the (possibly latent) co-infection of individuals by well-differentiated geographic variants. PMID:29479497
Roisin, S; Gaudin, C; De Mendonça, R; Bellon, J; Van Vaerenbergh, K; De Bruyne, K; Byl, B; Pouseele, H; Denis, O; Supply, P
2016-06-01
We used a two-step whole genome sequencing analysis for resolving two concurrent outbreaks in two neonatal services in Belgium, caused by exfoliative toxin A-encoding-gene-positive (eta+) methicillin-susceptible Staphylococcus aureus with an otherwise sporadic spa-type t209 (ST-109). Outbreak A involved 19 neonates and one healthcare worker in a Brussels hospital from May 2011 to October 2013. After a first episode interrupted by decolonization procedures applied over 7 months, the outbreak resumed concomitantly with the onset of outbreak B in a hospital in Asse, comprising 11 neonates and one healthcare worker from mid-2012 to January 2013. Pan-genome multilocus sequence typing, defined on the basis of 42 core and accessory reference genomes, and single-nucleotide polymorphisms mapped on an outbreak-specific de novo assembly were used to compare 28 available outbreak isolates and 19 eta+/spa-type t209 isolates identified by routine or nationwide surveillance. Pan-genome multilocus sequence typing showed that the outbreaks were caused by independent clones not closely related to any of the surveillance isolates. Isolates from only ten cases with overlapping stays in outbreak A, including four pairs of twins, showed no or only a single nucleotide polymorphism variation, indicating limited sequential transmission. Detection of larger genomic variation, even from the start of the outbreak, pointed to sporadic seeding from a pre-existing exogenous source, which persisted throughout the whole course of outbreak A. Whole genome sequencing analysis can provide unique fine-tuned insights into transmission pathways of complex outbreaks even at their inception, which, with timely use, could valuably guide efforts for early source identification. Copyright © 2016 European Society of Clinical Microbiology and Infectious Diseases. Published by Elsevier Ltd. All rights reserved.
Wang, Xihong; Zheng, Zhuqing; Cai, Yudong; Chen, Ting; Li, Chao; Fu, Weiwei; Jiang, Yu
2017-12-01
The increasing amount of sequencing data available for a wide variety of species can be theoretically used for detecting copy number variations (CNVs) at the population level. However, the growing sample sizes and the divergent complexity of nonhuman genomes challenge the efficiency and robustness of current human-oriented CNV detection methods. Here, we present CNVcaller, a read-depth method for discovering CNVs in population sequencing data. The computational speed of CNVcaller was 1-2 orders of magnitude faster than CNVnator and Genome STRiP for complex genomes with thousands of unmapped scaffolds. CNV detection of 232 goats required only 1.4 days on a single compute node. Additionally, the Mendelian consistency of sheep trios indicated that CNVcaller mitigated the influence of high proportions of gaps and misassembled duplications in the nonhuman reference genome assembly. Furthermore, multiple evaluations using real sheep and human data indicated that CNVcaller achieved the best accuracy and sensitivity for detecting duplications. The fast generalized detection algorithms included in CNVcaller overcome prior computational barriers for detecting CNVs in large-scale sequencing data with complex genomic structures. Therefore, CNVcaller promotes population genetic analyses of functional CNVs in more species. © The Authors 2017. Published by Oxford University Press.
Wang, Xihong; Zheng, Zhuqing; Cai, Yudong; Chen, Ting; Li, Chao; Fu, Weiwei
2017-01-01
Abstract Background The increasing amount of sequencing data available for a wide variety of species can be theoretically used for detecting copy number variations (CNVs) at the population level. However, the growing sample sizes and the divergent complexity of nonhuman genomes challenge the efficiency and robustness of current human-oriented CNV detection methods. Results Here, we present CNVcaller, a read-depth method for discovering CNVs in population sequencing data. The computational speed of CNVcaller was 1–2 orders of magnitude faster than CNVnator and Genome STRiP for complex genomes with thousands of unmapped scaffolds. CNV detection of 232 goats required only 1.4 days on a single compute node. Additionally, the Mendelian consistency of sheep trios indicated that CNVcaller mitigated the influence of high proportions of gaps and misassembled duplications in the nonhuman reference genome assembly. Furthermore, multiple evaluations using real sheep and human data indicated that CNVcaller achieved the best accuracy and sensitivity for detecting duplications. Conclusions The fast generalized detection algorithms included in CNVcaller overcome prior computational barriers for detecting CNVs in large-scale sequencing data with complex genomic structures. Therefore, CNVcaller promotes population genetic analyses of functional CNVs in more species. PMID:29220491
Zhang, Yanjun; Du, Liuwen; Liu, Ao; Chen, Jianjun; Wu, Li; Hu, Weiming; Zhang, Wei; Kim, Kyunghee; Lee, Sang-Choon; Yang, Tae-Jin; Wang, Ying
2016-01-01
Epimedium L. is a phylogenetically and economically important genus in the family Berberidaceae. We here sequenced the complete chloroplast (cp) genomes of four Epimedium species using Illumina sequencing technology via a combination of de novo and reference-guided assembly, which was also the first comprehensive cp genome analysis on Epimedium combining the cp genome sequence of E. koreanum previously reported. The five Epimedium cp genomes exhibited typical quadripartite and circular structure that was rather conserved in genomic structure and the synteny of gene order. However, these cp genomes presented obvious variations at the boundaries of the four regions because of the expansion and contraction of the inverted repeat (IR) region and the single-copy (SC) boundary regions. The trnQ-UUG duplication occurred in the five Epimedium cp genomes, which was not found in the other basal eudicotyledons. The rapidly evolving cp genome regions were detected among the five cp genomes, as well as the difference of simple sequence repeats (SSR) and repeat sequence were identified. Phylogenetic relationships among the five Epimedium species based on their cp genomes showed accordance with the updated system of the genus on the whole, but reminded that the evolutionary relationships and the divisions of the genus need further investigation applying more evidences. The availability of these cp genomes provided valuable genetic information for accurately identifying species, taxonomy and phylogenetic resolution and evolution of Epimedium, and assist in exploration and utilization of Epimedium plants. PMID:27014326
Istace, Benjamin; Friedrich, Anne; d'Agata, Léo; Faye, Sébastien; Payen, Emilie; Beluche, Odette; Caradec, Claudia; Davidas, Sabrina; Cruaud, Corinne; Liti, Gianni; Lemainque, Arnaud; Engelen, Stefan; Wincker, Patrick; Schacherer, Joseph; Aury, Jean-Marc
2017-02-01
Oxford Nanopore Technologies Ltd (Oxford, UK) have recently commercialized MinION, a small single-molecule nanopore sequencer, that offers the possibility of sequencing long DNA fragments from small genomes in a matter of seconds. The Oxford Nanopore technology is truly disruptive; it has the potential to revolutionize genomic applications due to its portability, low cost, and ease of use compared with existing long reads sequencing technologies. The MinION sequencer enables the rapid sequencing of small eukaryotic genomes, such as the yeast genome. Combined with existing assembler algorithms, near complete genome assemblies can be generated and comprehensive population genomic analyses can be performed. Here, we resequenced the genome of the Saccharomyces cerevisiae S288C strain to evaluate the performance of nanopore-only assemblers. Then we de novo sequenced and assembled the genomes of 21 isolates representative of the S. cerevisiae genetic diversity using the MinION platform. The contiguity of our assemblies was 14 times higher than the Illumina-only assemblies and we obtained one or two long contigs for 65 % of the chromosomes. This high contiguity allowed us to accurately detect large structural variations across the 21 studied genomes. Because of the high completeness of the nanopore assemblies, we were able to produce a complete cartography of transposable elements insertions and inspect structural variants that are generally missed using a short-read sequencing strategy. Our analyses show that the Oxford Nanopore technology is already usable for de novo sequencing and assembly; however, non-random errors in homopolymers require polishing the consensus using an alternate sequencing technology. © The Author 2017. Published by Oxford University Press.
Istace, Benjamin; Friedrich, Anne; d'Agata, Léo; Faye, Sébastien; Payen, Emilie; Beluche, Odette; Caradec, Claudia; Davidas, Sabrina; Cruaud, Corinne; Liti, Gianni; Lemainque, Arnaud; Engelen, Stefan; Wincker, Patrick; Schacherer, Joseph
2017-01-01
Abstract Background: Oxford Nanopore Technologies Ltd (Oxford, UK) have recently commercialized MinION, a small single-molecule nanopore sequencer, that offers the possibility of sequencing long DNA fragments from small genomes in a matter of seconds. The Oxford Nanopore technology is truly disruptive; it has the potential to revolutionize genomic applications due to its portability, low cost, and ease of use compared with existing long reads sequencing technologies. The MinION sequencer enables the rapid sequencing of small eukaryotic genomes, such as the yeast genome. Combined with existing assembler algorithms, near complete genome assemblies can be generated and comprehensive population genomic analyses can be performed. Results: Here, we resequenced the genome of the Saccharomyces cerevisiae S288C strain to evaluate the performance of nanopore-only assemblers. Then we de novo sequenced and assembled the genomes of 21 isolates representative of the S. cerevisiae genetic diversity using the MinION platform. The contiguity of our assemblies was 14 times higher than the Illumina-only assemblies and we obtained one or two long contigs for 65 % of the chromosomes. This high contiguity allowed us to accurately detect large structural variations across the 21 studied genomes. Conclusion: Because of the high completeness of the nanopore assemblies, we were able to produce a complete cartography of transposable elements insertions and inspect structural variants that are generally missed using a short-read sequencing strategy. Our analyses show that the Oxford Nanopore technology is already usable for de novo sequencing and assembly; however, non-random errors in homopolymers require polishing the consensus using an alternate sequencing technology. PMID:28369459
Dlugosch, Katrina M.; Lai, Zhao; Bonin, Aurélie; Hierro, José; Rieseberg, Loren H.
2013-01-01
Transcriptome sequences are becoming more broadly available for multiple individuals of the same species, providing opportunities to derive population genomic information from these datasets. Using the 454 Life Science Genome Sequencer FLX and FLX-Titanium next-generation platforms, we generated 11−430 Mbp of sequence for normalized cDNA for 40 wild genotypes of the invasive plant Centaurea solstitialis, yellow starthistle, from across its worldwide distribution. We examined the impact of sequencing effort on transcriptome recovery and overlap among individuals. To do this, we developed two novel publicly available software pipelines: SnoWhite for read cleaning before assembly, and AllelePipe for clustering of loci and allele identification in assembled datasets with or without a reference genome. AllelePipe is designed specifically for cases in which read depth information is not appropriate or available to assist with disentangling closely related paralogs from allelic variation, as in transcriptome or previously assembled libraries. We find that modest applications of sequencing effort recover most of the novel sequences present in the transcriptome of this species, including single-copy loci and a representative distribution of functional groups. In contrast, the coverage of variable sites, observation of heterozygosity, and overlap among different libraries are all highly dependent on sequencing effort. Nevertheless, the information gained from overlapping regions was informative regarding coarse population structure and variation across our small number of population samples, providing the first genetic evidence in support of hypothesized invasion scenarios. PMID:23390612
Batty, Elizabeth M; Chaemchuen, Suwittra; Blacksell, Stuart; Richards, Allen L; Paris, Daniel; Bowden, Rory; Chan, Caroline; Lachumanan, Ramkumar; Day, Nicholas; Donnelly, Peter; Chen, Swaine; Salje, Jeanne
2018-06-01
Orientia tsutsugamushi is a clinically important but neglected obligate intracellular bacterial pathogen of the Rickettsiaceae family that causes the potentially life-threatening human disease scrub typhus. In contrast to the genome reduction seen in many obligate intracellular bacteria, early genetic studies of Orientia have revealed one of the most repetitive bacterial genomes sequenced to date. The dramatic expansion of mobile elements has hampered efforts to generate complete genome sequences using short read sequencing methodologies, and consequently there have been few studies of the comparative genomics of this neglected species. We report new high-quality genomes of O. tsutsugamushi, generated using PacBio single molecule long read sequencing, for six strains: Karp, Kato, Gilliam, TA686, UT76 and UT176. In comparative genomics analyses of these strains together with existing reference genomes from Ikeda and Boryong strains, we identify a relatively small core genome of 657 genes, grouped into core gene islands and separated by repeat regions, and use the core genes to infer the first whole-genome phylogeny of Orientia. Complete assemblies of multiple Orientia genomes verify initial suggestions that these are remarkable organisms. They have larger genomes compared with most other Rickettsiaceae, with widespread amplification of repeat elements and massive chromosomal rearrangements between strains. At the gene level, Orientia has a relatively small set of universally conserved genes, similar to other obligate intracellular bacteria, and the relative expansion in genome size can be accounted for by gene duplication and repeat amplification. Our study demonstrates the utility of long read sequencing to investigate complex bacterial genomes and characterise genomic variation.
Sri, Tanu; Mayee, Pratiksha; Singh, Anandita
2015-09-01
Whole genome sequence analyses allow unravelling such evolutionary consequences of meso-triplication event in Brassicaceae (∼14-20 million years ago (MYA)) as differential gene fractionation and diversification in homeologous sub-genomes. This study presents a simple gene-centric approach involving microsynteny and natural genetic variation analysis for understanding SUPPRESSOR of OVEREXPRESSION of CONSTANS 1 (SOC1) homeolog evolution in Brassica. Analysis of microsynteny in Brassica rapa homeologous regions containing SOC1 revealed differential gene fractionation correlating to reported fractionation status of sub-genomes of origin, viz. least fractionated (LF), moderately fractionated 1 (MF1) and most fractionated (MF2), respectively. Screening 18 cultivars of 6 Brassica species led to the identification of 8 genomic and 27 transcript variants of SOC1, including splice-forms. Co-occurrence of both interrupted and intronless SOC1 genes was detected in few Brassica species. In silico analysis characterised Brassica SOC1 as MADS intervening, K-box, C-terminal (MIKC(C)) transcription factor, with highly conserved MADS and I domains relative to K-box and C-terminal domain. Phylogenetic analyses and multiple sequence alignments depicting shared pattern of silent/non-silent mutations assigned Brassica SOC1 homologs into groups based on shared diploid base genome. In addition, a sub-genome structure in uncharacterised Brassica genomes was inferred. Expression analysis of putative MF2 and LF (Brassica diploid base genome A (AA)) sub-genome-specific SOC1 homeologs of Brassica juncea revealed near identical expression pattern. However, MF2-specific homeolog exhibited significantly higher expression implying regulatory diversification. In conclusion, evidence for polyploidy-induced sequence and regulatory evolution in Brassica SOC1 is being presented wherein differential homeolog expression is implied in functional diversification.
Neptune: a bioinformatics tool for rapid discovery of genomic variation in bacterial populations
Marinier, Eric; Zaheer, Rahat; Berry, Chrystal; Weedmark, Kelly A.; Domaratzki, Michael; Mabon, Philip; Knox, Natalie C.; Reimer, Aleisha R.; Graham, Morag R.; Chui, Linda; Patterson-Fortin, Laura; Zhang, Jian; Pagotto, Franco; Farber, Jeff; Mahony, Jim; Seyer, Karine; Bekal, Sadjia; Tremblay, Cécile; Isaac-Renton, Judy; Prystajecky, Natalie; Chen, Jessica; Slade, Peter
2017-01-01
Abstract The ready availability of vast amounts of genomic sequence data has created the need to rethink comparative genomics algorithms using ‘big data’ approaches. Neptune is an efficient system for rapidly locating differentially abundant genomic content in bacterial populations using an exact k-mer matching strategy, while accommodating k-mer mismatches. Neptune’s loci discovery process identifies sequences that are sufficiently common to a group of target sequences and sufficiently absent from non-targets using probabilistic models. Neptune uses parallel computing to efficiently identify and extract these loci from draft genome assemblies without requiring multiple sequence alignments or other computationally expensive comparative sequence analyses. Tests on simulated and real datasets showed that Neptune rapidly identifies regions that are both sensitive and specific. We demonstrate that this system can identify trait-specific loci from different bacterial lineages. Neptune is broadly applicable for comparative bacterial analyses, yet will particularly benefit pathogenomic applications, owing to efficient and sensitive discovery of differentially abundant genomic loci. The software is available for download at: http://github.com/phac-nml/neptune. PMID:29048594
Are Escherichia coli Pathotypes Still Relevant in the Era of Whole-Genome Sequencing?
Robins-Browne, Roy M.; Holt, Kathryn E.; Ingle, Danielle J.; Hocking, Dianna M.; Yang, Ji; Tauschek, Marija
2016-01-01
The empirical and pragmatic nature of diagnostic microbiology has given rise to several different schemes to subtype E.coli, including biotyping, serotyping, and pathotyping. These schemes have proved invaluable in identifying and tracking outbreaks, and for prognostication in individual cases of infection, but they are imprecise and potentially misleading due to the malleability and continuous evolution of E. coli. Whole genome sequencing can be used to accurately determine E. coli subtypes that are based on allelic variation or differences in gene content, such as serotyping and pathotyping. Whole genome sequencing also provides information about single nucleotide polymorphisms in the core genome of E. coli, which form the basis of sequence typing, and is more reliable than other systems for tracking the evolution and spread of individual strains. A typing scheme for E. coli based on genome sequences that includes elements of both the core and accessory genomes, should reduce typing anomalies and promote understanding of how different varieties of E. coli spread and cause disease. Such a scheme could also define pathotypes more precisely than current methods. PMID:27917373
Are Escherichia coli Pathotypes Still Relevant in the Era of Whole-Genome Sequencing?
Robins-Browne, Roy M; Holt, Kathryn E; Ingle, Danielle J; Hocking, Dianna M; Yang, Ji; Tauschek, Marija
2016-01-01
The empirical and pragmatic nature of diagnostic microbiology has given rise to several different schemes to subtype E .coli, including biotyping, serotyping, and pathotyping. These schemes have proved invaluable in identifying and tracking outbreaks, and for prognostication in individual cases of infection, but they are imprecise and potentially misleading due to the malleability and continuous evolution of E. coli . Whole genome sequencing can be used to accurately determine E. coli subtypes that are based on allelic variation or differences in gene content, such as serotyping and pathotyping. Whole genome sequencing also provides information about single nucleotide polymorphisms in the core genome of E. coli , which form the basis of sequence typing, and is more reliable than other systems for tracking the evolution and spread of individual strains. A typing scheme for E. coli based on genome sequences that includes elements of both the core and accessory genomes, should reduce typing anomalies and promote understanding of how different varieties of E. coli spread and cause disease. Such a scheme could also define pathotypes more precisely than current methods.
Optimizing and evaluating the reconstruction of Metagenome-assembled microbial genomes.
Papudeshi, Bhavya; Haggerty, J Matthew; Doane, Michael; Morris, Megan M; Walsh, Kevin; Beattie, Douglas T; Pande, Dnyanada; Zaeri, Parisa; Silva, Genivaldo G Z; Thompson, Fabiano; Edwards, Robert A; Dinsdale, Elizabeth A
2017-11-28
Microbiome/host interactions describe characteristics that affect the host's health. Shotgun metagenomics includes sequencing a random subset of the microbiome to analyze its taxonomic and metabolic potential. Reconstruction of DNA fragments into genomes from metagenomes (called metagenome-assembled genomes) assigns unknown fragments to taxa/function and facilitates discovery of novel organisms. Genome reconstruction incorporates sequence assembly and sorting of assembled sequences into bins, characteristic of a genome. However, the microbial community composition, including taxonomic and phylogenetic diversity may influence genome reconstruction. We determine the optimal reconstruction method for four microbiome projects that had variable sequencing platforms (IonTorrent and Illumina), diversity (high or low), and environment (coral reefs and kelp forests), using a set of parameters to select for optimal assembly and binning tools. We tested the effects of the assembly and binning processes on population genome reconstruction using 105 marine metagenomes from 4 projects. Reconstructed genomes were obtained from each project using 3 assemblers (IDBA, MetaVelvet, and SPAdes) and 2 binning tools (GroopM and MetaBat). We assessed the efficiency of assemblers using statistics that including contig continuity and contig chimerism and the effectiveness of binning tools using genome completeness and taxonomic identification. We concluded that SPAdes, assembled more contigs (143,718 ± 124 contigs) of longer length (N50 = 1632 ± 108 bp), and incorporated the most sequences (sequences-assembled = 19.65%). The microbial richness and evenness were maintained across the assembly, suggesting low contig chimeras. SPAdes assembly was responsive to the biological and technological variations within the project, compared with other assemblers. Among binning tools, we conclude that MetaBat produced bins with less variation in GC content (average standard deviation: 1.49), low species richness (4.91 ± 0.66), and higher genome completeness (40.92 ± 1.75) across all projects. MetaBat extracted 115 bins from the 4 projects of which 66 bins were identified as reconstructed metagenome-assembled genomes with sequences belonging to a specific genus. We identified 13 novel genomes, some of which were 100% complete, but show low similarity to genomes within databases. In conclusion, we present a set of biologically relevant parameters for evaluation to select for optimal assembly and binning tools. For the tools we tested, SPAdes assembler and MetaBat binning tools reconstructed quality metagenome-assembled genomes for the four projects. We also conclude that metagenomes from microbial communities that have high coverage of phylogenetically distinct, and low taxonomic diversity results in highest quality metagenome-assembled genomes.
Insights from the complete chloroplast genome into the evolution of Sesamum indicum L.
Zhang, Haiyang; Li, Chun; Miao, Hongmei; Xiong, Songjin
2013-01-01
Sesame (Sesamum indicum L.) is one of the oldest oilseed crops. In order to investigate the evolutionary characters according to the Sesame Genome Project, apart from sequencing its nuclear genome, we sequenced the complete chloroplast genome of S. indicum cv. Yuzhi 11 (white seeded) using Illumina and 454 sequencing. Comparisons of chloroplast genomes between S. indicum and the 18 other higher plants were then analyzed. The chloroplast genome of cv. Yuzhi 11 contains 153,338 bp and a total of 114 unique genes (KC569603). The number of chloroplast genes in sesame is the same as that in Nicotiana tabacum, Vitis vinifera and Platanus occidentalis. The variation in the length of the large single-copy (LSC) regions and inverted repeats (IR) in sesame compared to 18 other higher plant species was the main contributor to size variation in the cp genome in these species. The 77 functional chloroplast genes, except for ycf1 and ycf2, were highly conserved. The deletion of the cp ycf1 gene sequence in cp genomes may be due either to its transfer to the nuclear genome, as has occurred in sesame, or direct deletion, as has occurred in Panax ginseng and Cucumis sativus. The sesame ycf2 gene is only 5,721 bp in length and has lost about 1,179 bp. Nucleotides 1-585 of ycf2 when queried in BLAST had hits in the sesame draft genome. Five repeats (R10, R12, R13, R14 and R17) were unique to the sesame chloroplast genome. We also found that IR contraction/expansion in the cp genome alters its rate of evolution. Chloroplast genes and repeats display the signature of convergent evolution in sesame and other species. These findings provide a foundation for further investigation of cp genome evolution in Sesamum and other higher plants.
Xing, Libo; Zhang, Dong; Song, Xiaomin; Weng, Kai; Shen, Yawen; Li, Youmei; Zhao, Caiping; Ma, Juanjuan; An, Na; Han, Mingyu
2016-01-01
Apple (Malus domestica Borkh.) is a commercially important fruit worldwide. Detailed information on genomic DNA polymorphisms, which are important for understanding phenotypic traits, is lacking for the apple. We re-sequenced two elite apple varieties, ‘Nagafu No. 2’ and ‘Qinguan,’ which have different characteristics. We identified many genomic variations, including 2,771,129 single nucleotide polymorphisms (SNPs), 82,663 structural variations (SVs), and 1,572,803 insertion/deletions (INDELs) in ‘Nagafu No. 2’ and 2,262,888 SNPs, 63,764 SVs, and 1,294,060 INDELs in ‘Qinguan.’ The ‘SNP,’ ‘INDEL,’ and ‘SV’ distributions were non-random, with variation-rich or -poor regions throughout the genomes. In ‘Nagafu No. 2’ and ‘Qinguan’ there were 171,520 and 147,090 non-synonymous SNPs spanning 23,111 and 21,400 genes, respectively; 3,963 and 3,196 SVs in 3,431 and 2,815 genes, respectively; and 1,834 and 1,451 INDELs in 1,681 and 1,345 genes, respectively. Genetic linkage maps of 190 flowering genes associated with multiple flowering pathways in ‘Nagafu No. 2,’ ‘Qinguan,’ and ‘Golden Delicious,’ identified complex regulatory mechanisms involved in floral induction, flower bud formation, and flowering characteristics, which might reflect the genetic variation of the flowering genes. Expression profiling of key flowering genes in buds and leaves suggested that the photoperiod and autonomous flowering pathways are major contributors to the different floral-associated traits between ‘Nagafu No. 2’ and ‘Qinguan.’ The genome variation data provided a foundation for the further exploration of apple diversity and gene–phenotype relationships, and for future research on molecular breeding to improve apple and related species. PMID:27446138
Lucas Lledó, José Ignacio; Cáceres, Mario
2013-01-01
One of the most used techniques to study structural variation at a genome level is paired-end mapping (PEM). PEM has the advantage of being able to detect balanced events, such as inversions and translocations. However, inversions are still quite difficult to predict reliably, especially from high-throughput sequencing data. We simulated realistic PEM experiments with different combinations of read and library fragment lengths, including sequencing errors and meaningful base-qualities, to quantify and track down the origin of false positives and negatives along sequencing, mapping, and downstream analysis. We show that PEM is very appropriate to detect a wide range of inversions, even with low coverage data. However, % of inversions located between segmental duplications are expected to go undetected by the most common sequencing strategies. In general, longer DNA libraries improve the detectability of inversions far better than increments of the coverage depth or the read length. Finally, we review the performance of three algorithms to detect inversions —SVDetect, GRIAL, and VariationHunter—, identify common pitfalls, and reveal important differences in their breakpoint precisions. These results stress the importance of the sequencing strategy for the detection of structural variants, especially inversions, and offer guidelines for the design of future genome sequencing projects. PMID:23637806
SNP discovery and genotyping using Genotyping-by-Sequencing in Pekin ducks.
Zhu, Feng; Cui, Qian-Qian; Hou, Zhuo-Cheng
2016-11-15
Genomic selection and genome-wide association studies need thousands to millions of SNPs. However, many non-model species do not have reference chips for detecting variation. Our goal was to develop and validate an inexpensive but effective method for detecting SNP variation. Genotyping by sequencing (GBS) can be a highly efficient strategy for genome-wide SNP detection, as an alternative to microarray chips. Here, we developed a GBS protocol for ducks and tested it to genotype 49 Pekin ducks. A total of 169,209 SNPs were identified from all animals, with a mean of 55,920 SNPs per individual. The average SNP density reached 1156 SNPs/MB. In this study, the first application of GBS to ducks, we demonstrate the power and simplicity of this method. GBS can be used for genetic studies in to provide an effective method for genome-wide SNP discovery.
Hehir-Kwa, Jayne Y; Marschall, Tobias; Kloosterman, Wigard P; Francioli, Laurent C; Baaijens, Jasmijn A; Dijkstra, Louis J; Abdellaoui, Abdel; Koval, Vyacheslav; Thung, Djie Tjwan; Wardenaar, René; Renkens, Ivo; Coe, Bradley P; Deelen, Patrick; de Ligt, Joep; Lameijer, Eric-Wubbo; van Dijk, Freerk; Hormozdiari, Fereydoun; Uitterlinden, André G; van Duijn, Cornelia M; Eichler, Evan E; de Bakker, Paul I W; Swertz, Morris A; Wijmenga, Cisca; van Ommen, Gert-Jan B; Slagboom, P Eline; Boomsma, Dorret I; Schönhuth, Alexander; Ye, Kai; Guryev, Victor
2016-10-06
Structural variation (SV) represents a major source of differences between individual human genomes and has been linked to disease phenotypes. However, the majority of studies provide neither a global view of the full spectrum of these variants nor integrate them into reference panels of genetic variation. Here, we analyse whole genome sequencing data of 769 individuals from 250 Dutch families, and provide a haplotype-resolved map of 1.9 million genome variants across 9 different variant classes, including novel forms of complex indels, and retrotransposition-mediated insertions of mobile elements and processed RNAs. A large proportion are previously under reported variants sized between 21 and 100 bp. We detect 4 megabases of novel sequence, encoding 11 new transcripts. Finally, we show 191 known, trait-associated SNPs to be in strong linkage disequilibrium with SVs and demonstrate that our panel facilitates accurate imputation of SVs in unrelated individuals.
Albayrak, Levent; Khanipov, Kamil; Pimenova, Maria; Golovko, George; Rojas, Mark; Pavlidis, Ioannis; Chumakov, Sergei; Aguilar, Gerardo; Chávez, Arturo; Widger, William R; Fofanov, Yuriy
2016-12-12
Low-abundance mutations in mitochondrial populations (mutations with minor allele frequency ≤ 1%), are associated with cancer, aging, and neurodegenerative disorders. While recent progress in high-throughput sequencing technology has significantly improved the heteroplasmy identification process, the ability of this technology to detect low-abundance mutations can be affected by the presence of similar sequences originating from nuclear DNA (nDNA). To determine to what extent nDNA can cause false positive low-abundance heteroplasmy calls, we have identified mitochondrial locations of all subsequences that are common or similar (one mismatch allowed) between nDNA and mitochondrial DNA (mtDNA). Performed analysis revealed up to a 25-fold variation in the lengths of longest common and longest similar (one mismatch allowed) subsequences across the mitochondrial genome. The size of the longest subsequences shared between nDNA and mtDNA in several regions of the mitochondrial genome were found to be as low as 11 bases, which not only allows using these regions to design new, very specific PCR primers, but also supports the hypothesis of the non-random introduction of mtDNA into the human nuclear DNA. Analysis of the mitochondrial locations of the subsequences shared between nDNA and mtDNA suggested that even very short (36 bases) single-end sequencing reads can be used to identify low-abundance variation in 20.4% of the mitochondrial genome. For longer (76 and 150 bases) reads, the proportion of the mitochondrial genome where nDNA presence will not interfere found to be 44.5 and 67.9%, when low-abundance mutations at 100% of locations can be identified using 417 bases long single reads. This observation suggests that the analysis of low-abundance variations in mitochondria population can be extended to a variety of large data collections such as NCBI Sequence Read Archive, European Nucleotide Archive, The Cancer Genome Atlas, and International Cancer Genome Consortium.
Arif, Rabia; Akram, Faiza; Jamil, Tazeen; Lee, Siu Fai
2017-01-01
Posttranslational modifications (PTMs) occur in all essential proteins taking command of their functions. There are many domains inside proteins where modifications take place on side-chains of amino acids through various enzymes to generate different species of proteins. In this manuscript we have, for the first time, predicted posttranslational modifications of frequency clock and mating type a-1 proteins in Sordaria fimicola collected from different sites to see the effect of environment on proteins or various amino acids pickings and their ultimate impact on consensus sequences present in mating type proteins using bioinformatics tools. Furthermore, we have also measured and walked through genomic DNA of various Sordaria strains to determine genetic diversity by genotyping the short sequence repeats (SSRs) of wild strains of S. fimicola collected from contrasting environments of two opposing slopes (harsh and xeric south facing slope and mild north facing slope) of Evolution Canyon (EC), Israel. Based on the whole genome sequence of S. macrospora, we targeted 20 genomic regions in S. fimicola which contain short sequence repeats (SSRs). Our data revealed genetic variations in strains from south facing slope and these findings assist in the hypothesis that genetic variations caused by stressful environments lead to evolution. PMID:28717646
Arif, Rabia; Akram, Faiza; Jamil, Tazeen; Mukhtar, Hamid; Lee, Siu Fai; Saleem, Muhammad
2017-01-01
Posttranslational modifications (PTMs) occur in all essential proteins taking command of their functions. There are many domains inside proteins where modifications take place on side-chains of amino acids through various enzymes to generate different species of proteins. In this manuscript we have, for the first time, predicted posttranslational modifications of frequency clock and mating type a-1 proteins in Sordaria fimicola collected from different sites to see the effect of environment on proteins or various amino acids pickings and their ultimate impact on consensus sequences present in mating type proteins using bioinformatics tools. Furthermore, we have also measured and walked through genomic DNA of various Sordaria strains to determine genetic diversity by genotyping the short sequence repeats (SSRs) of wild strains of S. fimicola collected from contrasting environments of two opposing slopes (harsh and xeric south facing slope and mild north facing slope) of Evolution Canyon (EC), Israel. Based on the whole genome sequence of S. macrospora , we targeted 20 genomic regions in S. fimicola which contain short sequence repeats (SSRs). Our data revealed genetic variations in strains from south facing slope and these findings assist in the hypothesis that genetic variations caused by stressful environments lead to evolution.
Lu, Fu-Hao; McKenzie, Neil; Kettleborough, George; Heavens, Darren; Clark, Matthew D; Bevan, Michael W
2018-05-01
The accurate sequencing and assembly of very large, often polyploid, genomes remains a challenging task, limiting long-range sequence information and phased sequence variation for applications such as plant breeding. The 15-Gb hexaploid bread wheat (Triticum aestivum) genome has been particularly challenging to sequence, and several different approaches have recently generated long-range assemblies. Mapping and understanding the types of assembly errors are important for optimising future sequencing and assembly approaches and for comparative genomics. Here we use a Fosill 38-kb jumping library to assess medium and longer-range order of different publicly available wheat genome assemblies. Modifications to the Fosill protocol generated longer Illumina sequences and enabled comprehensive genome coverage. Analyses of two independent Bacterial Artificial Chromosome (BAC)-based chromosome-scale assemblies, two independent Illumina whole genome shotgun assemblies, and a hybrid Single Molecule Real Time (SMRT-PacBio) and short read (Illumina) assembly were carried out. We revealed a surprising scale and variety of discrepancies using Fosill mate-pair mapping and validated several of each class. In addition, Fosill mate-pairs were used to scaffold a whole genome Illumina assembly, leading to a 3-fold increase in N50 values. Our analyses, using an independent means to validate different wheat genome assemblies, show that whole genome shotgun assemblies based solely on Illumina sequences are significantly more accurate by all measures compared to BAC-based chromosome-scale assemblies and hybrid SMRT-Illumina approaches. Although current whole genome assemblies are reasonably accurate and useful, additional improvements will be needed to generate complete assemblies of wheat genomes using open-source, computationally efficient, and cost-effective methods.
Qin, Yanhong; Wang, Li; Zhang, Zhenchen; Qiao, Qi; Zhang, Desheng; Tian, Yuting; Wang, Shuang; Wang, Yongjiang; Yan, Zhaoling
2014-01-01
Background Sweet potato chlorotic stunt virus (family Closteroviridae, genus Crinivirus) features a large bipartite, single-stranded, positive-sense RNA genome. To date, only three complete genomic sequences of SPCSV can be accessed through GenBank. SPCSV was first detected from China in 2011, only partial genomic sequences have been determined in the country. No report on the complete genomic sequence and genome structure of Chinese SPCSV isolates or the genetic relation between isolates from China and other countries is available. Methodology/Principal Findings The complete genomic sequences of five isolates from different areas in China were characterized. This study is the first to report the complete genome sequences of SPCSV from whitefly vectors. Genome structure analysis showed that isolates of WA and EA strains from China have the same coding protein as isolates Can181-9 and m2-47, respectively. Twenty cp genes and four RNA1 partial segments were sequenced and analyzed, and the nucleotide identities of complete genomic, cp, and RNA1 partial sequences were determined. Results indicated high conservation among strains and significant differences between WA and EA strains. Genetic analysis demonstrated that, except for isolates from Guangdong Province, SPCSVs from other areas belong to the WA strain. Genome organization analysis showed that the isolates in this study lack the p22 gene. Conclusions/Significance We presented the complete genome sequences of SPCSV in China. Comparison of nucleotide identities and genome structures between these isolates and previously reported isolates showed slight differences. The nucleotide identities of different SPCSV isolates showed high conservation among strains and significant differences between strains. All nine isolates in this study lacked p22 gene. WA strains were more extensively distributed than EA strains in China. These data provide important insights into the molecular variation and genomic structure of SPCSV in China as well as genetic relationships among isolates from China and other countries. PMID:25170926
Genome Editing of Structural Variations: Modeling and Gene Correction.
Park, Chul-Yong; Sung, Jin Jea; Kim, Dong-Wook
2016-07-01
The analysis of chromosomal structural variations (SVs), such as inversions and translocations, was made possible by the completion of the human genome project and the development of genome-wide sequencing technologies. SVs contribute to genetic diversity and evolution, although some SVs can cause diseases such as hemophilia A in humans. Genome engineering technology using programmable nucleases (e.g., ZFNs, TALENs, and CRISPR/Cas9) has been rapidly developed, enabling precise and efficient genome editing for SV research. Here, we review advances in modeling and gene correction of SVs, focusing on inversion, translocation, and nucleotide repeat expansion. Copyright © 2016 Elsevier Ltd. All rights reserved.
Wong, Gerard; Leckie, Christopher; Gorringe, Kylie L; Haviv, Izhak; Campbell, Ian G; Kowalczyk, Adam
2010-04-15
High-density single nucleotide polymorphism (SNP) genotyping arrays are efficient and cost effective platforms for the detection of copy number variation (CNV). To ensure accuracy in probe synthesis and to minimize production costs, short oligonucleotide probe sequences are used. The use of short probe sequences limits the specificity of binding targets in the human genome. The specificity of these short probeset sequences has yet to be fully analysed against a normal reference human genome. Sequence similarity can artificially elevate or suppress copy number measurements, and hence reduce the reliability of affected probe readings. For the purpose of detecting narrow CNVs reliably down to the width of a single probeset, sequence similarity is an important issue that needs to be addressed. We surveyed the Affymetrix Human Mapping SNP arrays for probeset sequence similarity against the reference human genome. Utilizing sequence similarity results, we identified a collection of fine-scaled putative CNVs between gender from autosomal probesets whose sequence matches various loci on the sex chromosomes. To detect these variations, we utilized our statistical approach, Detecting REcurrent Copy number change using rank-order Statistics (DRECS), and showed that its performance was superior and more stable than the t-test in detecting CNVs. Through the application of DRECS on the HapMap population datasets with multi-matching probesets filtered, we identified biologically relevant SNPs in aberrant regions across populations with known association to physical traits, such as height, covered by the span of a single probe. This provided empirical confirmation of the existence of naturally occurring narrow CNVs as well as the sensitivity of the Affymetrix SNP array technology in detecting them. The MATLAB implementation of DRECS is available at http://ww2.cs.mu.oz.au/ approximately gwong/DRECS/index.html.
USDA-ARS?s Scientific Manuscript database
The genome sequence of the constricta strain of Potato yellow dwarf virus (CYDV) was determined to be 12,792 nucleotides long and organized into seven open reading frames with the gene order 3’-N-X-P-Y-M-G-L-5’, which encodes the nucleocapsid, phosphoprotein, movement, matrix, glycoprotein and RNA-d...
Current challenges in genome annotation through structural biology and bioinformatics.
Furnham, Nicholas; de Beer, Tjaart A P; Thornton, Janet M
2012-10-01
With the huge volume in genomic sequences being generated from high-throughout sequencing projects the requirement for providing accurate and detailed annotations of gene products has never been greater. It is proving to be a huge challenge for computational biologists to use as much information as possible from experimental data to provide annotations for genome data of unknown function. A central component to this process is to use experimentally determined structures, which provide a means to detect homology that is not discernable from just the sequence and permit the consequences of genomic variation to be realized at the molecular level. In particular, structures also form the basis of many bioinformatics methods for improving the detailed functional annotations of enzymes in combination with similarities in sequence and chemistry. Copyright © 2012. Published by Elsevier Ltd.
Zheng, X L; Zhou, J P; Zang, L L; Tang, A T; Liu, D Q; Deng, K J; Zhang, Y
2016-06-17
The narrow genetic variation present in common wheat (Triticum aestivum) varieties has greatly restricted the improvement of crop yield in modern breeding systems. Alien addition lines have proven to be an effective means to broaden the genetic diversity of common wheat. Wheat-rye addition lines, which are the direct bridge materials for wheat improvement, have been wildly used to produce new wheat cultivars carrying alien rye germplasm. In this study, we investigated the genetic and epigenetic alterations in two sets of wheat-rye disomic addition lines (1R-7R) and the corresponding triticales. We used expressed sequence tag-simple sequence repeat, amplified fragment length polymorphism, and methylation-sensitive amplification polymorphism analyses to analyze the effects of the introduction of alien chromosomes (either the entire genome or sub-genome) to wheat genetic background. We found obvious and diversiform variations in the genomic primary structure, as well as alterations in the extent and pattern of the genomic DNA methylation of the recipient. Meanwhile, these results also showed that introduction of different rye chromosomes could induce different genetic and epigenetic alterations in its recipient, and the genetic background of the parents is an important factor for genomic and epigenetic variation induced by alien chromosome addition.
Phylogenomic evidence for ancient hybridization in the genomes of living cats (Felidae)
Li, Gang; Davis, Brian W.; Eizirik, Eduardo; Murphy, William J.
2016-01-01
Inter-species hybridization has been recently recognized as potentially common in wild animals, but the extent to which it shapes modern genomes is still poorly understood. Distinguishing historical hybridization events from other processes leading to phylogenetic discordance among different markers requires a well-resolved species tree that considers all modes of inheritance and overcomes systematic problems due to rapid lineage diversification by sampling large genomic character sets. Here, we assessed genome-wide phylogenetic variation across a diverse mammalian family, Felidae (cats). We combined genotypes from a genome-wide SNP array with additional autosomal, X- and Y-linked variants to sample ∼150 kb of nuclear sequence, in addition to complete mitochondrial genomes generated using light-coverage Illumina sequencing. We present the first robust felid time tree that accounts for unique maternal, paternal, and biparental evolutionary histories. Signatures of phylogenetic discordance were abundant in the genomes of modern cats, in many cases indicating hybridization as the most likely cause. Comparison of big cat whole-genome sequences revealed a substantial reduction of X-linked divergence times across several large recombination cold spots, which were highly enriched for signatures of selection-driven post-divergence hybridization between the ancestors of the snow leopard and lion lineages. These results highlight the mosaic origin of modern felid genomes and the influence of sex chromosomes and sex-biased dispersal in post-speciation gene flow. A complete resolution of the tree of life will require comprehensive genomic sampling of biparental and sex-limited genetic variation to identify and control for phylogenetic conflict caused by ancient admixture and sex-biased differences in genomic transmission. PMID:26518481
Cheng, Hui; Li, Jinfeng; Zhang, Hong; Cai, Binhua; Gao, Zhihong
2017-01-01
Compared with other members of the family Rosaceae, the chloroplast genomes of Fragaria species exhibit low variation, and this situation has limited phylogenetic analyses; thus, complete chloroplast genome sequencing of Fragaria species is needed. In this study, we sequenced the complete chloroplast genome of F. × ananassa ‘Benihoppe’ using the Illumina HiSeq 2500-PE150 platform and then performed a combination of de novo assembly and reference-guided mapping of contigs to generate complete chloroplast genome sequences. The chloroplast genome exhibits a typical quadripartite structure with a pair of inverted repeats (IRs, 25,936 bp) separated by large (LSC, 85,531 bp) and small (SSC, 18,146 bp) single-copy (SC) regions. The length of the F. × ananassa ‘Benihoppe’ chloroplast genome is 155,549 bp, representing the smallest Fragaria chloroplast genome observed to date. The genome encodes 112 unique genes, comprising 78 protein-coding genes, 30 tRNA genes and four rRNA genes. Comparative analysis of the overall nucleotide sequence identity among ten complete chloroplast genomes confirmed that for both coding and non-coding regions in Rosaceae, SC regions exhibit higher sequence variation than IRs. The Ka/Ks ratio of most genes was less than 1, suggesting that most genes are under purifying selection. Moreover, the mVISTA results also showed a high degree of conservation in genome structure, gene order and gene content in Fragaria, particularly among three octoploid strawberries which were F. × ananassa ‘Benihoppe’, F. chiloensis (GP33) and F. virginiana (O477). However, when the sequences of the coding and non-coding regions of F. × ananassa ‘Benihoppe’ were compared in detail with those of F. chiloensis (GP33) and F. virginiana (O477), a number of SNPs and InDels were revealed by MEGA 7. Six non-coding regions (trnK-matK, trnS-trnG, atpF-atpH, trnC-petN, trnT-psbD and trnP-psaJ) with a percentage of variable sites greater than 1% and no less than five parsimony-informative sites were identified and may be useful for phylogenetic analysis of the genus Fragaria. PMID:29038765
Whole-genome CNV analysis: advances in computational approaches.
Pirooznia, Mehdi; Goes, Fernando S; Zandi, Peter P
2015-01-01
Accumulating evidence indicates that DNA copy number variation (CNV) is likely to make a significant contribution to human diversity and also play an important role in disease susceptibility. Recent advances in genome sequencing technologies have enabled the characterization of a variety of genomic features, including CNVs. This has led to the development of several bioinformatics approaches to detect CNVs from next-generation sequencing data. Here, we review recent advances in CNV detection from whole genome sequencing. We discuss the informatics approaches and current computational tools that have been developed as well as their strengths and limitations. This review will assist researchers and analysts in choosing the most suitable tools for CNV analysis as well as provide suggestions for new directions in future development.
2011-01-01
Background One of the key goals of oak genomics research is to identify genes of adaptive significance. This information may help to improve the conservation of adaptive genetic variation and the management of forests to increase their health and productivity. Deep-coverage large-insert genomic libraries are a crucial tool for attaining this objective. We report herein the construction of a BAC library for Quercus robur, its characterization and an analysis of BAC end sequences. Results The EcoRI library generated consisted of 92,160 clones, 7% of which had no insert. Levels of chloroplast and mitochondrial contamination were below 3% and 1%, respectively. Mean clone insert size was estimated at 135 kb. The library represents 12 haploid genome equivalents and, the likelihood of finding a particular oak sequence of interest is greater than 99%. Genome coverage was confirmed by PCR screening of the library with 60 unique genetic loci sampled from the genetic linkage map. In total, about 20,000 high-quality BAC end sequences (BESs) were generated by sequencing 15,000 clones. Roughly 5.88% of the combined BAC end sequence length corresponded to known retroelements while ab initio repeat detection methods identified 41 additional repeats. Collectively, characterized and novel repeats account for roughly 8.94% of the genome. Further analysis of the BESs revealed 1,823 putative genes suggesting at least 29,340 genes in the oak genome. BESs were aligned with the genome sequences of Arabidopsis thaliana, Vitis vinifera and Populus trichocarpa. One putative collinear microsyntenic region encoding an alcohol acyl transferase protein was observed between oak and chromosome 2 of V. vinifera. Conclusions This BAC library provides a new resource for genomic studies, including SSR marker development, physical mapping, comparative genomics and genome sequencing. BES analysis provided insight into the structure of the oak genome. These sequences will be used in the assembly of a future genome sequence for oak. PMID:21645357
Whole-genome analysis of a patient with early-stage small-cell lung cancer.
Han, J-Y; Lee, Y-S; Kim, B C; Lee, G K; Lee, S; Kim, E-H; Kim, H-M; Bhak, J
2014-12-01
We performed whole-genome sequencing (WGS) of a case of early-stage small-cell lung cancer (SCLC) to analyze the genomic features. WGS revealed a lot of single-nucleotide variations (SNVs), small insertion/deletions and chromosomal abnormality. Chromosomes 4p, 5q, 13q, 15q, 17p and 22q contained many block deletions. Especially, copy loss was observed in tumor suppressor genes RB1 and TP53, and copy gain in oncogene hTERT. Somatic mutations were found in TP53 and CREBBP. Novel nonsynonymous (ns) SNVs in C6ORF103 and SLC5A4 genes were also found. Sanger sequencing of the SLC5A4 gene in 23 independent SCLC samples showed another nsSNV in the SLC5A4 gene, indicating that nsSNVs in the SLC5A4 gene are recurrent in SCLC. WGS of an early-stage SCLC identified novel recurrent mutations and validated known variations, including copy number variations. These findings provide insight into the genomic landscape contributing to SCLC development.
The Diversity Present in 5140 Human Mitochondrial Genomes
Pereira, Luísa; Freitas, Fernando; Fernandes, Verónica; Pereira, Joana B.; Costa, Marta D.; Costa, Stephanie; Máximo, Valdemar; Macaulay, Vincent; Rocha, Ricardo; Samuels, David C.
2009-01-01
We analyzed the current status (as of the end of August 2008) of human mitochondrial genomes deposited in GenBank, amounting to 5140 complete or coding-region sequences, in order to present an overall picture of the diversity present in the mitochondrial DNA of the global human population. To perform this task, we developed mtDNA-GeneSyn, a computer tool that identifies and exhaustedly classifies the diversity present in large genetic data sets. The diversity observed in the 5140 human mitochondrial genomes was compared with all possible transitions and transversions from the standard human mitochondrial reference genome. This comparison showed that tRNA and rRNA secondary structures have a large effect in limiting the diversity of the human mitochondrial sequences, whereas for the protein-coding genes there is a bias toward less variation at the second codon positions. The analysis of the observed amino acid variations showed a tolerance of variations that convert between the amino acids V, I, A, M, and T. This defines a group of amino acids with similar chemical properties that can interconvert by a single transition. PMID:19426953
Ross, Caitlin R; DeFelice, Dominick S; Hunt, Greg J; Ihle, Kate E; Amdam, Gro V; Rueppell, Olav
2015-02-21
Meiotic recombination has traditionally been explained based on the structural requirement to stabilize homologous chromosome pairs to ensure their proper meiotic segregation. Competing hypotheses seek to explain the emerging findings of significant heterogeneity in recombination rates within and between genomes, but intraspecific comparisons of genome-wide recombination patterns are rare. The honey bee (Apis mellifera) exhibits the highest rate of genomic recombination among multicellular animals with about five cross-over events per chromatid. Here, we present a comparative analysis of recombination rates across eight genetic linkage maps of the honey bee genome to investigate which genomic sequence features are correlated with recombination rate and with its variation across the eight data sets, ranging in average marker spacing ranging from 1 Mbp to 120 kbp. Overall, we found that GC content explained best the variation in local recombination rate along chromosomes at the analyzed 100 kbp scale. In contrast, variation among the different maps was correlated to the abundance of microsatellites and several specific tri- and tetra-nucleotides. The combined evidence from eight medium-scale recombination maps of the honey bee genome suggests that recombination rate variation in this highly recombining genome might be due to the DNA configuration instead of distinct sequence motifs. However, more fine-scale analyses are needed. The empirical basis of eight differing genetic maps allowed for robust conclusions about the correlates of the local recombination rates and enabled the study of the relation between DNA features and variability in local recombination rates, which is particularly relevant in the honey bee genome with its exceptionally high recombination rate.
Motamayor, Juan C; Mockaitis, Keithanne; Schmutz, Jeremy; Haiminen, Niina; Livingstone, Donald; Cornejo, Omar; Findley, Seth D; Zheng, Ping; Utro, Filippo; Royaert, Stefan; Saski, Christopher; Jenkins, Jerry; Podicheti, Ram; Zhao, Meixia; Scheffler, Brian E; Stack, Joseph C; Feltus, Frank A; Mustiga, Guiliana M; Amores, Freddy; Phillips, Wilbert; Marelli, Jean Philippe; May, Gregory D; Shapiro, Howard; Ma, Jianxin; Bustamante, Carlos D; Schnell, Raymond J; Main, Dorrie; Gilbert, Don; Parida, Laxmi; Kuhn, David N
2013-06-03
Theobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders. We describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina 1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation. We report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits.
2013-01-01
Background Galileo is a transposable element responsible for the generation of three chromosomal inversions in natural populations of Drosophila buzzatii. Although the most characteristic feature of Galileo is the long internally-repetitive terminal inverted repeats (TIRs), which resemble the Drosophila Foldback element, its transposase-coding sequence has led to its classification as a member of the P-element superfamily (Class II, subclass 1, TIR order). Furthermore, Galileo has a wide distribution in the genus Drosophila, since it has been found in 6 of the 12 Drosophila sequenced genomes. Among these species, D. mojavensis, the one closest to D. buzzatii, presented the highest diversity in sequence and structure of Galileo elements. Results In the present work, we carried out a thorough search and annotation of all the Galileo copies present in the D. mojavensis sequenced genome. In our set of 170 Galileo copies we have detected 5 Galileo subfamilies (C, D, E, F, and X) with different structures ranging from nearly complete, to only 2 TIR or solo TIR copies. Finally, we have explored the structural and length variation of the Galileo copies that point out the relatively frequent rearrangements within and between Galileo elements. Different mechanisms responsible for these rearrangements are discussed. Conclusions Although Galileo is a transposable element with an ancient history in the D. mojavensis genome, our data indicate a recent transpositional activity. Furthermore, the dynamism in sequence and structure, mainly affecting the TIRs, suggests an active exchange of sequences among the copies. This exchange could lead to new subfamilies of the transposon, which could be crucial for the long-term survival of the element in the genome. PMID:23374229
2013-01-01
Background Theobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders. Results We describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina 1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation. Conclusions We report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits. PMID:23731509
Marzo, Mar; Bello, Xabier; Puig, Marta; Maside, Xulio; Ruiz, Alfredo
2013-02-04
Galileo is a transposable element responsible for the generation of three chromosomal inversions in natural populations of Drosophila buzzatii. Although the most characteristic feature of Galileo is the long internally-repetitive terminal inverted repeats (TIRs), which resemble the Drosophila Foldback element, its transposase-coding sequence has led to its classification as a member of the P-element superfamily (Class II, subclass 1, TIR order). Furthermore, Galileo has a wide distribution in the genus Drosophila, since it has been found in 6 of the 12 Drosophila sequenced genomes. Among these species, D. mojavensis, the one closest to D. buzzatii, presented the highest diversity in sequence and structure of Galileo elements. In the present work, we carried out a thorough search and annotation of all the Galileo copies present in the D. mojavensis sequenced genome. In our set of 170 Galileo copies we have detected 5 Galileo subfamilies (C, D, E, F, and X) with different structures ranging from nearly complete, to only 2 TIR or solo TIR copies. Finally, we have explored the structural and length variation of the Galileo copies that point out the relatively frequent rearrangements within and between Galileo elements. Different mechanisms responsible for these rearrangements are discussed. Although Galileo is a transposable element with an ancient history in the D. mojavensis genome, our data indicate a recent transpositional activity. Furthermore, the dynamism in sequence and structure, mainly affecting the TIRs, suggests an active exchange of sequences among the copies. This exchange could lead to new subfamilies of the transposon, which could be crucial for the long-term survival of the element in the genome.
Global mapping of transposon location.
Gabriel, Abram; Dapprich, Johannes; Kunkel, Mark; Gresham, David; Pratt, Stephen C; Dunham, Maitreya J
2006-12-15
Transposable genetic elements are ubiquitous, yet their presence or absence at any given position within a genome can vary between individual cells, tissues, or strains. Transposable elements have profound impacts on host genomes by altering gene expression, assisting in genomic rearrangements, causing insertional mutations, and serving as sources of phenotypic variation. Characterizing a genome's full complement of transposons requires whole genome sequencing, precluding simple studies of the impact of transposition on interindividual variation. Here, we describe a global mapping approach for identifying transposon locations in any genome, using a combination of transposon-specific DNA extraction and microarray-based comparative hybridization analysis. We use this approach to map the repertoire of endogenous transposons in different laboratory strains of Saccharomyces cerevisiae and demonstrate that transposons are a source of extensive genomic variation. We also apply this method to mapping bacterial transposon insertion sites in a yeast genomic library. This unique whole genome view of transposon location will facilitate our exploration of transposon dynamics, as well as defining bases for individual differences and adaptive potential.
Zhang, Quan; Zhu, Feng; Liu, Long; Zheng, Chuan Wei; Wang, De He; Hou, Zhuo Cheng; Ning, Zhong Hua
2015-01-01
Eggshell damages lead to economic losses in the egg production industry and are a threat to human health. We examined 49-wk-old Rhode Island White hens (Gallus gallus) that laid eggs having shells with significantly different strengths and thicknesses. We used HiSeq 2000 (Illumina) sequencing to characterize the chicken transcriptome and whole genome to identify the key genes and genetic mutations associated with eggshell calcification. We identified a total of 14,234 genes expressed in the chicken uterus, representing 89% of all annotated chicken genes. A total of 889 differentially expressed genes were identified by comparing low eggshell strength (LES) and normal eggshell strength (NES) genomes. The DEGs are enriched in calcification-related processes, including calcium ion transport and calcium signaling pathways as revealed by gene ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis. Some important matrix proteins, such as OC-116, LTF and SPP1, were also expressed differentially between two groups. A total of 3,671,919 single-nucleotide polymorphisms (SNPs) and 508,035 Indels were detected in protein coding genes by whole-genome re-sequencing, including 1775 non-synonymous variations and 19 frame-shift Indels in DEGs. SNPs and Indels found in this study could be further investigated for eggshell traits. This is the first report to integrate the transcriptome and genome re-sequencing to target the genetic variations which decreased the eggshell qualities. These findings further advance our understanding of eggshell calcification in the chicken uterus.
Knoch, Tobias A; Wachsmuth, Malte; Kepper, Nick; Lesnussa, Michael; Abuseiris, Anis; Ali Imam, A M; Kolovos, Petros; Zuin, Jessica; Kockx, Christel E M; Brouwer, Rutger W W; van de Werken, Harmen J G; van IJcken, Wilfred F J; Wendt, Kerstin S; Grosveld, Frank G
2016-01-01
The dynamic three-dimensional chromatin architecture of genomes and its co-evolutionary connection to its function-the storage, expression, and replication of genetic information-is still one of the central issues in biology. Here, we describe the much debated 3D architecture of the human and mouse genomes from the nucleosomal to the megabase pair level by a novel approach combining selective high-throughput high-resolution chromosomal interaction capture ( T2C ), polymer simulations, and scaling analysis of the 3D architecture and the DNA sequence. The genome is compacted into a chromatin quasi-fibre with ~5 ± 1 nucleosomes/11 nm, folded into stable ~30-100 kbp loops forming stable loop aggregates/rosettes connected by similar sized linkers. Minor but significant variations in the architecture are seen between cell types and functional states. The architecture and the DNA sequence show very similar fine-structured multi-scaling behaviour confirming their co-evolution and the above. This architecture, its dynamics, and accessibility, balance stability and flexibility ensuring genome integrity and variation enabling gene expression/regulation by self-organization of (in)active units already in proximity. Our results agree with the heuristics of the field and allow "architectural sequencing" at a genome mechanics level to understand the inseparable systems genomic properties.
Palacios-Flores, Kim; García-Sotelo, Jair; Castillo, Alejandra; Uribe, Carina; Aguilar, Luis; Morales, Lucía; Gómez-Romero, Laura; Reyes, José; Garciarubio, Alejandro; Boege, Margareta; Dávila, Guillermo
2018-04-01
We present a conceptually simple, sensitive, precise, and essentially nonstatistical solution for the analysis of genome variation in haploid organisms. The generation of a Perfect Match Genomic Landscape (PMGL), which computes intergenome identity with single nucleotide resolution, reveals signatures of variation wherever a query genome differs from a reference genome. Such signatures encode the precise location of different types of variants, including single nucleotide variants, deletions, insertions, and amplifications, effectively introducing the concept of a general signature of variation. The precise nature of variants is then resolved through the generation of targeted alignments between specific sets of sequence reads and known regions of the reference genome. Thus, the perfect match logic decouples the identification of the location of variants from the characterization of their nature, providing a unified framework for the detection of genome variation. We assessed the performance of the PMGL strategy via simulation experiments. We determined the variation profiles of natural genomes and of a synthetic chromosome, both in the context of haploid yeast strains. Our approach uncovered variants that have previously escaped detection. Moreover, our strategy is ideally suited for further refining high-quality reference genomes. The source codes for the automated PMGL pipeline have been deposited in a public repository. Copyright © 2018 by the Genetics Society of America.
The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity
Wang, Quanli; Halvorsen, Matt; Han, Yujun; Weir, William H.; Allen, Andrew S.; Goldstein, David B.
2015-01-01
Noncoding sequence contains pathogenic mutations. Yet, compared with mutations in protein-coding sequence, pathogenic regulatory mutations are notoriously difficult to recognize. Most fundamentally, we are not yet adept at recognizing the sequence stretches in the human genome that are most important in regulating the expression of genes. For this reason, it is difficult to apply to the regulatory regions the same kinds of analytical paradigms that are being successfully applied to identify mutations among protein-coding regions that influence risk. To determine whether dosage sensitive genes have distinct patterns among their noncoding sequence, we present two primary approaches that focus solely on a gene’s proximal noncoding regulatory sequence. The first approach is a regulatory sequence analogue of the recently introduced residual variation intolerance score (RVIS), termed noncoding RVIS, or ncRVIS. The ncRVIS compares observed and predicted levels of standing variation in the regulatory sequence of human genes. The second approach, termed ncGERP, reflects the phylogenetic conservation of a gene’s regulatory sequence using GERP++. We assess how well these two approaches correlate with four gene lists that use different ways to identify genes known or likely to cause disease through changes in expression: 1) genes that are known to cause disease through haploinsufficiency, 2) genes curated as dosage sensitive in ClinGen’s Genome Dosage Map, 3) genes judged likely to be under purifying selection for mutations that change expression levels because they are statistically depleted of loss-of-function variants in the general population, and 4) genes judged unlikely to cause disease based on the presence of copy number variants in the general population. We find that both noncoding scores are highly predictive of dosage sensitivity using any of these criteria. In a similar way to ncGERP, we assess two ensemble-based predictors of regional noncoding importance, ncCADD and ncGWAVA, and find both scores are significantly predictive of human dosage sensitive genes and appear to carry information beyond conservation, as assessed by ncGERP. These results highlight that the intolerance of noncoding sequence stretches in the human genome can provide a critical complementary tool to other genome annotation approaches to help identify the parts of the human genome increasingly likely to harbor mutations that influence risk of disease. PMID:26332131
2011-01-01
Background Angiosperm mitochondrial genomes are more complex than those of other organisms. Analyses of the mitochondrial genome sequences of at least 11 angiosperm species have showed several common properties; these cannot easily explain, however, how the diverse mitotypes evolved within each genus or species. We analyzed the evolutionary relationships of Brassica mitotypes by sequencing. Results We sequenced the mitotypes of cam (Brassica rapa), ole (B. oleracea), jun (B. juncea), and car (B. carinata) and analyzed them together with two previously sequenced mitotypes of B. napus (pol and nap). The sizes of whole single circular genomes of cam, jun, ole, and car are 219,747 bp, 219,766 bp, 360,271 bp, and 232,241 bp, respectively. The mitochondrial genome of ole is largest as a resulting of the duplication of a 141.8 kb segment. The jun mitotype is the result of an inherited cam mitotype, and pol is also derived from the cam mitotype with evolutionary modifications. Genes with known functions are conserved in all mitotypes, but clear variation in open reading frames (ORFs) with unknown functions among the six mitotypes was observed. Sequence relationship analysis showed that there has been genome compaction and inheritance in the course of Brassica mitotype evolution. Conclusions We have sequenced four Brassica mitotypes, compared six Brassica mitotypes and suggested a mechanism for mitochondrial genome formation in Brassica, including evolutionary events such as inheritance, duplication, rearrangement, genome compaction, and mutation. PMID:21988783
USDA-ARS?s Scientific Manuscript database
In previous work, we reported on the isolation and genome sequence analysis of Bacillus cereus strain tsu1 NCBI accession number JPYN00000000. The 36 scaffolds in the assembled tsu1 genome were all aligned with B. cereus B4264 genome with variations. Genes encoding for xylanase and cellulase and the...
Genomic diversity of necrotic enteritis-associated strains of Clostridium perfringens: a review.
Lacey, Jake A; Johanesen, Priscilla A; Lyras, Dena; Moore, Robert J
2016-06-01
The investigation of genomic variation between Clostridium perfringens isolates from poultry has been an important tool to enhance our understanding of the genetic basis of strain pathogenicity and the epidemiology of virulent and avirulent strains within the context of necrotic enteritis (NE). The earliest studies used whole genome profiling techniques such as pulsed-field gel electrophoresis to differentiate isolates and determine their relative levels of relatedness. DNA sequencing has been used to investigate genetic variation in (a) individual genes, such as those encoding the alpha and NetB toxins; (b) panels of housekeeping genes for multi-locus sequence typing and (c) most recently whole genome sequencing to build a more complete picture of genomic differences between isolates. Conclusions drawn from these studies include: differential carriage of large conjugative plasmids accounts for a large proportion of inter-strain differences; plasmid-encoded genes are more highly conserved than chromosomal genes, perhaps indicating a relatively recent origin for the plasmids; isolates from NE-affected birds fall into three distinct sequence-based clades while non-pathogenic isolates from healthy birds tend to be more genomically diverse. Overall, the NE causing strains are closely related to C. perfringens isolates from other birds and other diseases whereas the non-pathogenic poultry strains are generally more remotely related to either the pathogenic strains or the strains from other birds. Genomic analysis has indicated that genes in addition to netB are associated with NE pathogenic isolates. Collectively, this work has resulted in a deeper understanding of the pathogenesis of this important poultry disease.
Dynamics of actin evolution in dinoflagellates.
Kim, Sunju; Bachvaroff, Tsvetan R; Handy, Sara M; Delwiche, Charles F
2011-04-01
Dinoflagellates have unique nuclei and intriguing genome characteristics with very high DNA content making complete genome sequencing difficult. In dinoflagellates, many genes are found in multicopy gene families, but the processes involved in the establishment and maintenance of these gene families are poorly understood. Understanding the dynamics of gene family evolution in dinoflagellates requires comparisons at different evolutionary scales. Studies of closely related species provide fine-scale information relative to species divergence, whereas comparisons of more distantly related species provides broad context. We selected the actin gene family as a highly expressed conserved gene previously studied in dinoflagellates. Of the 142 sequences determined in this study, 103 were from the two closely related species, Dinophysis acuminata and D. caudata, including full length and partial cDNA sequences as well as partial genomic amplicons. For these two Dinophysis species, at least three types of sequences could be identified. Most copies (79%) were relatively similar and in nucleotide trees, the sequences formed two bushy clades corresponding to the two species. In comparisons within species, only eight to ten nucleotide differences were found between these copies. The two remaining types formed clades containing sequences from both species. One type included the most similar sequences in between-species comparisons with as few as 12 nucleotide differences between species. The second type included the most divergent sequences in comparisons between and within species with up to 93 nucleotide differences between sequences. In all the sequences, most variation occurred in synonymous sites or the 5' UnTranslated Region (UTR), although there was still limited amino acid variation between most sequences. Several potential pseudogenes were found (approximately 10% of all sequences depending on species) with incomplete open reading frames due to frameshifts or early stop codons. Overall, variation in the actin gene family fits best with the "birth and death" model of evolution based on recent duplications, pseudogenes, and incomplete lineage sorting. Divergence between species was similar to variation within species, so that actin may be too conserved to be useful for phylogenetic estimation of closely related species.
Human Y chromosome copy number variation in the next generation sequencing era and beyond.
Massaia, Andrea; Xue, Yali
2017-05-01
The human Y chromosome provides a fertile ground for structural rearrangements owing to its haploidy and high content of repeated sequences. The methodologies used for copy number variation (CNV) studies have developed over the years. Low-throughput techniques based on direct observation of rearrangements were developed early on, and are still used, often to complement array-based or sequencing approaches which have limited power in regions with high repeat content and specifically in the presence of long, identical repeats, such as those found in human sex chromosomes. Some specific rearrangements have been investigated for decades; because of their effects on fertility, or their outstanding evolutionary features, the interest in these has not diminished. However, following the flourishing of large-scale genomics, several studies have investigated CNVs across the whole chromosome. These studies sometimes employ data generated within large genomic projects such as the DDD study or the 1000 Genomes Project, and often survey large samples of healthy individuals without any prior selection. Novel technologies based on sequencing long molecules and combinations of technologies, promise to stimulate the study of Y-CNVs in the immediate future.
Leveraging Large-Scale Cancer Genomics Datasets for Germline Discovery - TCGA
The session will review how data types have changed over time, focusing on how next-generation sequencing is being employed to yield more precise information about the underlying genomic variation that influences tumor etiology and biology.
Govindaraj, Mahalingam
2015-01-01
The number of sequenced crop genomes and associated genomic resources is growing rapidly with the advent of inexpensive next generation sequencing methods. Databases have become an integral part of all aspects of science research, including basic and applied plant and animal sciences. The importance of databases keeps increasing as the volume of datasets from direct and indirect genomics, as well as other omics approaches, keeps expanding in recent years. The databases and associated web portals provide at a minimum a uniform set of tools and automated analysis across a wide range of crop plant genomes. This paper reviews some basic terms and considerations in dealing with crop plant databases utilization in advancing genomic era. The utilization of databases for variation analysis with other comparative genomics tools, and data interpretation platforms are well described. The major focus of this review is to provide knowledge on platforms and databases for genome-based investigations of agriculturally important crop plants. The utilization of these databases in applied crop improvement program is still being achieved widely; otherwise, the end for sequencing is not far away. PMID:25874133
Genome sequence of the progenitor of wheat A subgenome Triticum urartu.
Ling, Hong-Qing; Ma, Bin; Shi, Xiaoli; Liu, Hui; Dong, Lingli; Sun, Hua; Cao, Yinghao; Gao, Qiang; Zheng, Shusong; Li, Ye; Yu, Ying; Du, Huilong; Qi, Ming; Li, Yan; Lu, Hongwei; Yu, Hua; Cui, Yan; Wang, Ning; Chen, Chunlin; Wu, Huilan; Zhao, Yan; Zhang, Juncheng; Li, Yiwen; Zhou, Wenjuan; Zhang, Bairu; Hu, Weijuan; van Eijk, Michiel J T; Tang, Jifeng; Witsenboer, Hanneke M A; Zhao, Shancen; Li, Zhensheng; Zhang, Aimin; Wang, Daowen; Liang, Chengzhi
2018-05-09
Triticum urartu (diploid, AA) is the progenitor of the A subgenome of tetraploid (Triticum turgidum, AABB) and hexaploid (Triticum aestivum, AABBDD) wheat 1,2 . Genomic studies of T. urartu have been useful for investigating the structure, function and evolution of polyploid wheat genomes. Here we report the generation of a high-quality genome sequence of T. urartu by combining bacterial artificial chromosome (BAC)-by-BAC sequencing, single molecule real-time whole-genome shotgun sequencing 3 , linked reads and optical mapping 4,5 . We assembled seven chromosome-scale pseudomolecules and identified protein-coding genes, and we suggest a model for the evolution of T. urartu chromosomes. Comparative analyses with genomes of other grasses showed gene loss and amplification in the numbers of transposable elements in the T. urartu genome. Population genomics analysis of 147 T. urartu accessions from across the Fertile Crescent showed clustering of three groups, with differences in altitude and biostress, such as powdery mildew disease. The T. urartu genome assembly provides a valuable resource for studying genetic variation in wheat and related grasses, and promises to facilitate the discovery of genes that could be useful for wheat improvement.
Fajardo, Diego; Schlautman, Brandon; Steffan, Shawn; Polashock, James; Vorsa, Nicholi; Zalapa, Juan
2014-02-25
This is the first de novo assembly and annotation of a complete mitochondrial genome in the Ericales order from the American cranberry (Vaccinium macrocarpon Ait.). Moreover, only four complete Asterid mitochondrial genomes have been made publicly available. The cranberry mitochondrial genome was assembled and reconstructed from whole genome 454 Roche GS-FLX and Illumina shotgun sequences. Compared with other Asterids, the reconstruction of the genome revealed an average size mitochondrion (459,678 nt) with relatively little repetitive sequences and DNA of plastid origin. The complete mitochondrial genome of cranberry was annotated obtaining a total of 34 genes classified based on their putative function, plus three ribosomal RNAs, and 17 transfer RNAs. Maternal organellar cranberry inheritance was inferred by analyzing gene variation in the cranberry mitochondria and plastid genomes. The annotation of cranberry mitochondrial genome revealed the presence of two copies of tRNA-Sec and a selenocysteine insertion sequence (SECIS) element which were lost in plants during evolution. This is the first report of a land plant possessing selenocysteine insertion machinery at the sequence level. Published by Elsevier B.V.
Modeling genome coverage in single-cell sequencing
Daley, Timothy; Smith, Andrew D.
2014-01-01
Motivation: Single-cell DNA sequencing is necessary for examining genetic variation at the cellular level, which remains hidden in bulk sequencing experiments. But because they begin with such small amounts of starting material, the amount of information that is obtained from single-cell sequencing experiment is highly sensitive to the choice of protocol employed and variability in library preparation. In particular, the fraction of the genome represented in single-cell sequencing libraries exhibits extreme variability due to quantitative biases in amplification and loss of genetic material. Results: We propose a method to predict the genome coverage of a deep sequencing experiment using information from an initial shallow sequencing experiment mapped to a reference genome. The observed coverage statistics are used in a non-parametric empirical Bayes Poisson model to estimate the gain in coverage from deeper sequencing. This approach allows researchers to know statistical features of deep sequencing experiments without actually sequencing deeply, providing a basis for optimizing and comparing single-cell sequencing protocols or screening libraries. Availability and implementation: The method is available as part of the preseq software package. Source code is available at http://smithlabresearch.org/preseq. Contact: andrewds@usc.edu Supplementary information: Supplementary material is available at Bioinformatics online. PMID:25107873
The Genome Sequence of Taurine Cattle: A window to ruminant biology and evolution
Elsik, Christine G.; Tellam, Ross L.; Worley, Kim C.
2010-01-01
To understand the biology and evolution of ruminants, the cattle genome was sequenced to ∼7× coverage. The cattle genome contains a minimum of 22,000 genes, with a core set of 14,345 orthologs shared among seven mammalian species of which 1,217 are absent or undetected in non-eutherian (marsupial or monotreme) genomes. Cattle-specific evolutionary breakpoint regions in chromosomes have a higher density of segmental duplications, enrichment of repetitive elements, and species-specific variations in genes associated with lactation and immune responsiveness. Genes involved in metabolism are generally highly conserved, although five metabolic genes are deleted or extensively diverged from their human orthologs. The cattle genome sequence thus provides an enabling resource for understanding mammalian evolution and accelerating livestock genetic improvement for milk and meat production. PMID:19390049
Dissecting the human microbiome with single-cell genomics.
Tolonen, Andrew C; Xavier, Ramnik J
2017-06-14
Recent advances in genome sequencing of single microbial cells enable the assignment of functional roles to members of the human microbiome that cannot currently be cultured. This approach can reveal the genomic basis of phenotypic variation between closely related strains and can be applied to the targeted study of immunogenic bacteria in disease.
USDA-ARS?s Scientific Manuscript database
The comprehensive identification of genes underlying phenotypic variation of complex traits such as disease resistance remains one of the greatest challenges in biology despite having genome sequences and more powerful tools. Most genome-wide screens lack sufficient resolving power as they typically...
Aramrak, Attawan; Kidwell, Kimberlee K; Steber, Camille M; Burke, Ian C
2015-10-23
5-Enolpyruvylshikimate-3-phosphate synthase (EPSPS) is the sixth and penultimate enzyme in the shikimate biosynthesis pathway, and is the target of the herbicide glyphosate. The EPSPS genes of allohexaploid wheat (Triticum aestivum, AABBDD) have not been well characterized. Herein, the three homoeologous copies of the allohexaploid wheat EPSPS gene were cloned and characterized. Genomic and coding DNA sequences of EPSPS from the three related genomes of allohexaploid wheat were isolated using PCR and inverse PCR approaches from soft white spring "Louise'. Development of genome-specific primers allowed the mapping and expression analysis of TaEPSPS-7A1, TaEPSPS-7D1, and TaEPSPS-4A1 on chromosomes 7A, 7D, and 4A, respectively. Sequence alignments of cDNA sequences from wheat and wheat relatives served as a basis for phylogenetic analysis. The three genomic copies of wheat EPSPS differed by insertion/deletion and single nucleotide polymorphisms (SNPs), largely in intron sequences. RT-PCR analysis and cDNA cloning revealed that EPSPS is expressed from all three genomic copies. However, TaEPSPS-4A1 is expressed at much lower levels than TaEPSPS-7A1 and TaEPSPS-7D1 in wheat seedlings. Phylogenetic analysis of 1190-bp cDNA clones from wheat and wheat relatives revealed that: 1) TaEPSPS-7A1 is most similar to EPSPS from the tetraploid AB genome donor, T. turgidum (99.7 % identity); 2) TaEPSPS-7D1 most resembles EPSPS from the diploid D genome donor, Aegilops tauschii (100 % identity); and 3) TaEPSPS-4A1 resembles EPSPS from the diploid B genome relative, Ae. speltoides (97.7 % identity). Thus, EPSPS sequences in allohexaploid wheat are preserved from the most two recent ancestors. The wheat EPSPS genes are more closely related to Lolium multiflorum and Brachypodium distachyon than to Oryza sativa (rice). The three related EPSPS homoeologues of wheat exhibited conservation of the exon/intron structure and of coding region sequence, but contained significant sequence variation within intron regions. The genome-specific primers developed will enable future characterization of natural and induced variation in EPSPS sequence and expression. This can be useful in investigating new causes of glyphosate herbicide resistance.
BayesPI-BAR: a new biophysical model for characterization of regulatory sequence variations
Wang, Junbai; Batmanov, Kirill
2015-01-01
Sequence variations in regulatory DNA regions are known to cause functionally important consequences for gene expression. DNA sequence variations may have an essential role in determining phenotypes and may be linked to disease; however, their identification through analysis of massive genome-wide sequencing data is a great challenge. In this work, a new computational pipeline, a Bayesian method for protein–DNA interaction with binding affinity ranking (BayesPI-BAR), is proposed for quantifying the effect of sequence variations on protein binding. BayesPI-BAR uses biophysical modeling of protein–DNA interactions to predict single nucleotide polymorphisms (SNPs) that cause significant changes in the binding affinity of a regulatory region for transcription factors (TFs). The method includes two new parameters (TF chemical potentials or protein concentrations and direct TF binding targets) that are neglected by previous methods. The new method is verified on 67 known human regulatory SNPs, of which 47 (70%) have predicted true TFs ranked in the top 10. Importantly, the performance of BayesPI-BAR, which uses principal component analysis to integrate multiple predictions from various TF chemical potentials, is found to be better than that of existing programs, such as sTRAP and is-rSNP, when evaluated on the same SNPs. BayesPI-BAR is a publicly available tool and is able to carry out parallelized computation, which helps to investigate a large number of TFs or SNPs and to detect disease-associated regulatory sequence variations in the sea of genome-wide noncoding regions. PMID:26202972
VaDiR: an integrated approach to Variant Detection in RNA.
Neums, Lisa; Suenaga, Seiji; Beyerlein, Peter; Anders, Sara; Koestler, Devin; Mariani, Andrea; Chien, Jeremy
2018-02-01
Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.
Li, Zhigang; Hu, Songnian; Yao, Nan; Dean, Ralph A.; Zhao, Wensheng; Shen, Mi; Zhang, Haiwang; Li, Chao; Liu, Liyuan; Cao, Lei; Xu, Xiaowen; Xing, Yunfei; Hsiang, Tom; Zhang, Ziding; Xu, Jin-Rong; Peng, You-Liang
2012-01-01
Rice blast caused by Magnaporthe oryzae is one of the most destructive diseases of rice worldwide. The fungal pathogen is notorious for its ability to overcome host resistance. To better understand its genetic variation in nature, we sequenced the genomes of two field isolates, Y34 and P131. In comparison with the previously sequenced laboratory strain 70-15, both field isolates had a similar genome size but slightly more genes. Sequences from the field isolates were used to improve genome assembly and gene prediction of 70-15. Although the overall genome structure is similar, a number of gene families that are likely involved in plant-fungal interactions are expanded in the field isolates. Genome-wide analysis on asynonymous to synonymous nucleotide substitution rates revealed that many infection-related genes underwent diversifying selection. The field isolates also have hundreds of isolate-specific genes and a number of isolate-specific gene duplication events. Functional characterization of randomly selected isolate-specific genes revealed that they play diverse roles, some of which affect virulence. Furthermore, each genome contains thousands of loci of transposon-like elements, but less than 30% of them are conserved among different isolates, suggesting active transposition events in M. oryzae. A total of approximately 200 genes were disrupted in these three strains by transposable elements. Interestingly, transposon-like elements tend to be associated with isolate-specific or duplicated sequences. Overall, our results indicate that gain or loss of unique genes, DNA duplication, gene family expansion, and frequent translocation of transposon-like elements are important factors in genome variation of the rice blast fungus. PMID:22876203
Comparative Analysis of Genome Sequences Covering the Seven Cronobacter Species
Cummings, Craig A.; Shih, Rita; Degoricija, Lovorka; Rico, Alain; Brzoska, Pius; Hamby, Stephen E.; Masood, Naqash; Hariri, Sumyya; Sonbol, Hana; Chuzhanova, Nadia; McClelland, Michael; Furtado, Manohar R.; Forsythe, Stephen J.
2012-01-01
Background Species of Cronobacter are widespread in the environment and are occasional food-borne pathogens associated with serious neonatal diseases, including bacteraemia, meningitis, and necrotising enterocolitis. The genus is composed of seven species: C. sakazakii, C. malonaticus, C. turicensis, C. dublinensis, C. muytjensii, C. universalis, and C. condimenti. Clinical cases are associated with three species, C. malonaticus, C. turicensis and, in particular, with C. sakazakii multilocus sequence type 4. Thus, it is plausible that virulence determinants have evolved in certain lineages. Methodology/Principal Findings We generated high quality sequence drafts for eleven Cronobacter genomes representing the seven Cronobacter species, including an ST4 strain of C. sakazakii. Comparative analysis of these genomes together with the two publicly available genomes revealed Cronobacter has over 6,000 genes in one or more strains and over 2,000 genes shared by all Cronobacter. Considerable variation in the presence of traits such as type six secretion systems, metal resistance (tellurite, copper and silver), and adhesins were found. C. sakazakii is unique in the Cronobacter genus in encoding genes enabling the utilization of exogenous sialic acid which may have clinical significance. The C. sakazakii ST4 strain 701 contained additional genes as compared to other C. sakazakii but none of them were known specific virulence-related genes. Conclusions/Significance Genome comparison revealed that pair-wise DNA sequence identity varies between 89 and 97% in the seven Cronobacter species, and also suggested various degrees of divergence. Sets of universal core genes and accessory genes unique to each strain were identified. These gene sequences can be used for designing genus/species specific detection assays. Genes encoding adhesins, T6SS, and metal resistance genes as well as prophages are found in only subsets of genomes and have contributed considerably to the variation of genomic content. Differences in gene content likely contribute to differences in the clinical and environmental distribution of species and sequence types. PMID:23166675
Concerted evolution at the population level: pupfish HindIII satellite DNA sequences.
Elder, J F; Turner, B J
1994-01-01
The canonical monomers (approximately 170 bp) of an abundant (1.9 x 10(6) copies per diploid genome) satellite DNA sequence family in the genome of Cyprinodon variegatus, a "pupfish" that ranges along the Atlantic coast from Cape Cod to central Mexico, are divergent in base sequence in 10 of 12 samples collected from natural populations. The divergence involves substitutions, deletions, and insertions, is marked in scope (mean pairwise sequence similarity = 61.6%; range = 35-95.9%), is largely confined to the 3' half of the monomer, and is not correlated with the distance among collecting sites. Repetitive cloning and direct genomic sequencing experiments failed to detect intrapopulation and intraindividual variation, suggesting high levels of sequence homogeneity within populations. The satellite sequence has therefore undergone "concerted evolution," at the level of the local population. Concerted evolution has previously almost always been discussed in terms of the divergence of species or higher taxa; its intraspecific occurrence apparently has not been reported previously. The generality of the observation is difficult to evaluate, for although satellite DNAs from a large number of organisms have been studied in detail, there appear to be little or no other data on their sequence variation in natural populations. The relationship (if any) between concerted, population level, satellite DNA divergence and the extent of gene flow/genetic isolation among conspecific natural populations remains to be established. Images PMID:8302879
A Laboratory Exercise for Genotyping Two Human Single Nucleotide Polymorphisms
ERIC Educational Resources Information Center
Fernando, James; Carlson, Bradley; LeBard, Timothy; McCarthy, Michael; Umali, Finianne; Ashton, Bryce; Rose, Ferrill F., Jr.
2016-01-01
The dramatic decrease in the cost of sequencing a human genome is leading to an era in which a wide range of students will benefit from having an understanding of human genetic variation. Since over 90% of sequence variation between humans is in the form of single nucleotide polymorphisms (SNPs), a laboratory exercise has been devised in order to…
VARiD: a variation detection framework for color-space and letter-space platforms.
Dalca, Adrian V; Rumble, Stephen M; Levy, Samuel; Brudno, Michael
2010-06-15
High-throughput sequencing (HTS) technologies are transforming the study of genomic variation. The various HTS technologies have different sequencing biases and error rates, and while most HTS technologies sequence the residues of the genome directly, generating base calls for each position, the Applied Biosystem's SOLiD platform generates dibase-coded (color space) sequences. While combining data from the various platforms should increase the accuracy of variation detection, to date there are only a few tools that can identify variants from color space data, and none that can analyze color space and regular (letter space) data together. We present VARiD--a probabilistic method for variation detection from both letter- and color-space reads simultaneously. VARiD is based on a hidden Markov model and uses the forward-backward algorithm to accurately identify heterozygous, homozygous and tri-allelic SNPs, as well as micro-indels. Our analysis shows that VARiD performs better than the AB SOLiD toolset at detecting variants from color-space data alone, and improves the calls dramatically when letter- and color-space reads are combined. The toolset is freely available at http://compbio.cs.utoronto.ca/varid.
An improved genome assembly uncovers prolific tandem repeats in Atlantic cod.
Tørresen, Ole K; Star, Bastiaan; Jentoft, Sissel; Reinar, William B; Grove, Harald; Miller, Jason R; Walenz, Brian P; Knight, James; Ekholm, Jenny M; Peluso, Paul; Edvardsen, Rolf B; Tooming-Klunderud, Ave; Skage, Morten; Lien, Sigbjørn; Jakobsen, Kjetill S; Nederbragt, Alexander J
2017-01-18
The first Atlantic cod (Gadus morhua) genome assembly published in 2011 was one of the early genome assemblies exclusively based on high-throughput 454 pyrosequencing. Since then, rapid advances in sequencing technologies have led to a multitude of assemblies generated for complex genomes, although many of these are of a fragmented nature with a significant fraction of bases in gaps. The development of long-read sequencing and improved software now enable the generation of more contiguous genome assemblies. By combining data from Illumina, 454 and the longer PacBio sequencing technologies, as well as integrating the results of multiple assembly programs, we have created a substantially improved version of the Atlantic cod genome assembly. The sequence contiguity of this assembly is increased fifty-fold and the proportion of gap-bases has been reduced fifteen-fold. Compared to other vertebrates, the assembly contains an unusual high density of tandem repeats (TRs). Indeed, retrospective analyses reveal that gaps in the first genome assembly were largely associated with these TRs. We show that 21% of the TRs across the assembly, 19% in the promoter regions and 12% in the coding sequences are heterozygous in the sequenced individual. The inclusion of PacBio reads combined with the use of multiple assembly programs drastically improved the Atlantic cod genome assembly by successfully resolving long TRs. The high frequency of heterozygous TRs within or in the vicinity of genes in the genome indicate a considerable standing genomic variation in Atlantic cod populations, which is likely of evolutionary importance.
Blanchard, Adam M; Jolley, Keith A; Maiden, Martin C J; Coffey, Tracey J; Maboni, Grazieli; Staley, Ceri E; Bollard, Nicola J; Warry, Andrew; Emes, Richard D; Davies, Peers L; Tötemeyer, Sabine
2018-01-01
Dichelobacter nodosus ( D. nodosus ) is the causative pathogen of ovine footrot, a disease that has a significant welfare and financial impact on the global sheep industry. Previous studies into the phylogenetics of D. nodosus have focused on Australia and Scandinavia, meaning the current diversity in the United Kingdom (U.K.) population and its relationship globally, is poorly understood. Numerous epidemiological methods are available for bacterial typing; however, few account for whole genome diversity or provide the opportunity for future application of new computational techniques. Multilocus sequence typing (MLST) measures nucleotide variations within several loci with slow accumulation of variation to enable the designation of allele numbers to determine a sequence type. The usage of whole genome sequence data enables the application of MLST, but also core and whole genome MLST for higher levels of strain discrimination with a negligible increase in experimental cost. An MLST database was developed alongside a seven loci scheme using publically available whole genome data from the sequence read archive. Sequence type designation and strain discrimination was compared to previously published data to ensure reproducibility. Multiple D. nodosus isolates from U.K. farms were directly compared to populations from other countries. The U.K. isolates define new clades within the global population of D. nodosus and predominantly consist of serogroups A, B and H, however serogroups C, D, E, and I were also found. The scheme is publically available at https://pubmlst.org/dnodosus/.
Mobile element biology – new possibilities with high-throughput sequencing
Xing, Jinchuan; Witherspoon, David J.; Jorde, Lynn B.
2014-01-01
Mobile elements compose more than half of the human genome, but until recently their large-scale detection was time-consuming and challenging. With the development of new high-throughput sequencing technologies, the complete spectrum of mobile element variation in humans can now be identified and analyzed. Thousands of new mobile element insertions have been discovered, yielding new insights into mobile element biology, evolution, and genomic variation. We review several high-throughput methods, with an emphasis on techniques that specifically target mobile element insertions in humans, and we highlight recent applications of these methods in evolutionary studies and in the analysis of somatic alterations in human cancers. PMID:23312846
Xu, Teng; Qin, Song; Hu, Yongwu; Song, Zhijian; Ying, Jianchao; Li, Peizhen; Dong, Wei; Zhao, Fangqing; Yang, Huanming; Bao, Qiyu
2016-01-01
Arthrospira platensis is a multi-cellular and filamentous non-N2-fixing cyanobacterium that is capable of performing oxygenic photosynthesis. In this study, we determined the nearly complete genome sequence of A. platensis YZ. A. platensis YZ genome is a single, circular chromosome of 6.62 Mb in size. Phylogenetic and comparative genomic analyses revealed that A. platensis YZ was more closely related to A. platensis NIES-39 than Arthrospira sp. PCC 8005 and A. platensis C1. Broad gene gains were identified between A. platensis YZ and three other Arthrospira speices, some of which have been previously demonstrated that can be laterally transferred among different species, such as restriction-modification systems-coding genes. Moreover, unprecedented extensive chromosomal rearrangements among different strains were observed. The chromosomal rearrangements, particularly the chromosomal inversions, were analysed and estimated to be closely related to palindromes that involved long inverted repeat sequences and the extensively distributed type IIR restriction enzyme in the Arthrospira genome. In addition, species from genus Arthrospira unanimously contained the highest rate of repetitive sequence compared with the other species of order Oscillatoriales, suggested that sequence duplication significantly contributed to Arthrospira genome phylogeny. These results provided in-depth views into the genomic phylogeny and structural variation of A. platensis, as well as provide a valuable resource for functional genomics studies. PMID:27330141
Complete Chloroplast Genome of the Wollemi Pine (Wollemia nobilis): Structure and Evolution.
Yap, Jia-Yee S; Rohner, Thore; Greenfield, Abigail; Van Der Merwe, Marlien; McPherson, Hannah; Glenn, Wendy; Kornfeld, Geoff; Marendy, Elessa; Pan, Annie Y H; Wilton, Alan; Wilkins, Marc R; Rossetto, Maurizio; Delaney, Sven K
2015-01-01
The Wollemi pine (Wollemia nobilis) is a rare Southern conifer with striking morphological similarity to fossil pines. A small population of W. nobilis was discovered in 1994 in a remote canyon system in the Wollemi National Park (near Sydney, Australia). This population contains fewer than 100 individuals and is critically endangered. Previous genetic studies of the Wollemi pine have investigated its evolutionary relationship with other pines in the family Araucariaceae, and have suggested that the Wollemi pine genome contains little or no variation. However, these studies were performed prior to the widespread use of genome sequencing, and their conclusions were based on a limited fraction of the Wollemi pine genome. In this study, we address this problem by determining the entire sequence of the W. nobilis chloroplast genome. A detailed analysis of the structure of the genome is presented, and the evolution of the genome is inferred by comparison with the chloroplast sequences of other members of the Araucariaceae and the related family Podocarpaceae. Pairwise alignments of whole genome sequences, and the presence of unique pseudogenes, gene duplications and insertions in W. nobilis and Araucariaceae, indicate that the W. nobilis chloroplast genome is most similar to that of its sister taxon Agathis. However, the W. nobilis genome contains an unusually high number of repetitive sequences, and these could be used in future studies to investigate and conserve any remnant genetic diversity in the Wollemi pine.
Ensembl Genomes 2016: more genomes, more complexity.
Kersey, Paul Julian; Allen, James E; Armean, Irina; Boddu, Sanjay; Bolt, Bruce J; Carvalho-Silva, Denise; Christensen, Mikkel; Davis, Paul; Falin, Lee J; Grabmueller, Christoph; Humphrey, Jay; Kerhornou, Arnaud; Khobova, Julia; Aranganathan, Naveen K; Langridge, Nicholas; Lowy, Ernesto; McDowall, Mark D; Maheswari, Uma; Nuhn, Michael; Ong, Chuang Kee; Overduin, Bert; Paulini, Michael; Pedro, Helder; Perry, Emily; Spudich, Giulietta; Tapanari, Electra; Walts, Brandon; Williams, Gareth; Tello-Ruiz, Marcela; Stein, Joshua; Wei, Sharon; Ware, Doreen; Bolser, Daniel M; Howe, Kevin L; Kulesha, Eugene; Lawson, Daniel; Maslen, Gareth; Staines, Daniel M
2016-01-04
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including reference sequence, gene models, transcriptional data, genetic variation and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments. These include the development of new analyses and views to represent polyploid genomes (of which bread wheat is the primary exemplar); and the continued up-scaling of the resource, which now includes over 23 000 bacterial genomes, 400 fungal genomes and 100 protist genomes, in addition to 55 genomes from invertebrate metazoa and 39 genomes from plants. This dramatic increase in the number of included genomes is one part of a broader effort to automate the integration of archival data (genome sequence, but also associated RNA sequence data and variant calls) within the context of reference genomes and make it available through the Ensembl user interfaces. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Ensembl Genomes 2016: more genomes, more complexity
Kersey, Paul Julian; Allen, James E.; Armean, Irina; Boddu, Sanjay; Bolt, Bruce J.; Carvalho-Silva, Denise; Christensen, Mikkel; Davis, Paul; Falin, Lee J.; Grabmueller, Christoph; Humphrey, Jay; Kerhornou, Arnaud; Khobova, Julia; Aranganathan, Naveen K.; Langridge, Nicholas; Lowy, Ernesto; McDowall, Mark D.; Maheswari, Uma; Nuhn, Michael; Ong, Chuang Kee; Overduin, Bert; Paulini, Michael; Pedro, Helder; Perry, Emily; Spudich, Giulietta; Tapanari, Electra; Walts, Brandon; Williams, Gareth; Tello–Ruiz, Marcela; Stein, Joshua; Wei, Sharon; Ware, Doreen; Bolser, Daniel M.; Howe, Kevin L.; Kulesha, Eugene; Lawson, Daniel; Maslen, Gareth; Staines, Daniel M.
2016-01-01
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including reference sequence, gene models, transcriptional data, genetic variation and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments. These include the development of new analyses and views to represent polyploid genomes (of which bread wheat is the primary exemplar); and the continued up-scaling of the resource, which now includes over 23 000 bacterial genomes, 400 fungal genomes and 100 protist genomes, in addition to 55 genomes from invertebrate metazoa and 39 genomes from plants. This dramatic increase in the number of included genomes is one part of a broader effort to automate the integration of archival data (genome sequence, but also associated RNA sequence data and variant calls) within the context of reference genomes and make it available through the Ensembl user interfaces. PMID:26578574
Chen, Shen-Bo; Wang, Yue; Kassegne, Kokouvi; Xu, Bin; Shen, Hai-Mo; Chen, Jun-Hu
2017-02-06
Currently in China, the trend of Plasmodium vivax cases imported from Southeast Asia was increased especially in the China-Myanmar border area. Driven by the increase in P. vivax cases and stronger need for vaccine and drug development, several P. vivax isolates genome sequencing projects are underway. However, little is known about the genetic variability in this area until now. The sequencing of the first P. vivax isolate from China-Myanmar border area (CMB-1) generated 120 million paired-end reads. A percentage of 10.6 of the quality-evaluated reads were aligned onto 99.9% of the reference strain Sal I genome in 62-fold coverage with an average of 4.8 SNPs per kb. We present a 539-SNP marker data set for P. vivax that can identify different parasites from different geographic origins with high sensitivity. We also identified exceptionally high levels of genetic variability in members of multigene families such as RBP, SERA, vir, MSP3 and AP2. The de-novo assembly yielded a database composed of 8,409 contigs with N50 lengths of 6.6 kb and revealed 661 novel predicted genes including 78 vir genes, suggesting a greater functional variation in P. vivax from this area. Our result contributes to a better understanding of P. vivax genetic variation, and provides a fundamental basis for the geographic differentiation of vivax malaria from China-Myanmar border area using a direct sequencing approach without leukocyte depletion. This novel sequencing method can be used as an essential tool for the genomic research of P. vivax in the near future.
Furuta, Mayuko; Ueno, Masaki; Fujimoto, Akihiro; Hayami, Shinya; Yasukawa, Satoru; Kojima, Fumiyoshi; Arihiro, Koji; Kawakami, Yoshiiku; Wardell, Christopher P; Shiraishi, Yuichi; Tanaka, Hiroko; Nakano, Kaoru; Maejima, Kazuhiro; Sasaki-Oku, Aya; Tokunaga, Naoki; Boroevich, Keith A; Abe, Tetsuo; Aikata, Hiroshi; Ohdan, Hideki; Gotoh, Kunihito; Kubo, Michiaki; Tsunoda, Tatsuhiko; Miyano, Satoru; Chayama, Kazuaki; Yamaue, Hiroki; Nakagawa, Hidewaki
2017-02-01
Patients with hepatocellular carcinoma (HCC) have a high-risk of multi-centric (MC) tumor occurrence due to a strong carcinogenic background in the liver. In addition, they have a high risk of intrahepatic metastasis (IM). Liver tumors withIM or MC are profoundly different in their development and clinical outcome. However, clinically or pathologically discriminating between IM and MC can be challenging. This study investigated whether IM or MC could be diagnosed at the molecular level. We performed whole genome and RNA sequencing analyses of 49 tumors including two extra-hepatic metastases, and one nodule-in-nodule tumor from 23 HCC patients. Sequencing-based molecular diagnosis using somatic single nucleotide variation information showed higher sensitivity compared to previous techniques due to the inclusion of a larger number of mutation events. This proved useful in cases, which showed inconsistent clinical diagnoses. In addition, whole genome sequencing offered advantages in profiling of other genetic alterations, such as structural variations, copy number alterations, and variant allele frequencies, and helped to confirm the IM/MCdiagnosis. Divergent alterations between IM tumors with sorafenib treatment, long time-intervals, or tumor-in-tumor nodules indicated high intra-tumor heterogeneity, evolution, and clonal switching of liver cancers. It is important to analyze the differences between IM tumors, in addition to IM/MC diagnosis, before selecting a therapeutic strategy for multiple tumors in the liver. Whole genome sequencing of multiple liver tumors enabled the accuratediagnosis ofmulti-centric occurrence and intrahepatic metastasis using somatic single nucleotide variation information. In addition, genetic discrepancies between tumors help us to understand the physical changes during recurrence and cancer spread. Copyright © 2016 European Association for the Study of the Liver. Published by Elsevier B.V. All rights reserved.
Abo, Ryan P; Ducar, Matthew; Garcia, Elizabeth P; Thorner, Aaron R; Rojas-Rudilla, Vanesa; Lin, Ling; Sholl, Lynette M; Hahn, William C; Meyerson, Matthew; Lindeman, Neal I; Van Hummelen, Paul; MacConaill, Laura E
2015-02-18
Genomic structural variation (SV), a common hallmark of cancer, has important predictive and therapeutic implications. However, accurately detecting SV using high-throughput sequencing data remains challenging, especially for 'targeted' resequencing efforts. This is critically important in the clinical setting where targeted resequencing is frequently being applied to rapidly assess clinically actionable mutations in tumor biopsies in a cost-effective manner. We present BreaKmer, a novel approach that uses a 'kmer' strategy to assemble misaligned sequence reads for predicting insertions, deletions, inversions, tandem duplications and translocations at base-pair resolution in targeted resequencing data. Variants are predicted by realigning an assembled consensus sequence created from sequence reads that were abnormally aligned to the reference genome. Using targeted resequencing data from tumor specimens with orthogonally validated SV, non-tumor samples and whole-genome sequencing data, BreaKmer had a 97.4% overall sensitivity for known events and predicted 17 positively validated, novel variants. Relative to four publically available algorithms, BreaKmer detected SV with increased sensitivity and limited calls in non-tumor samples, key features for variant analysis of tumor specimens in both the clinical and research settings. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Barley whole exome capture: a tool for genomic research in the genus Hordeum and beyond
Mascher, Martin; Richmond, Todd A; Gerhardt, Daniel J; Himmelbach, Axel; Clissold, Leah; Sampath, Dharanya; Ayling, Sarah; Steuernagel, Burkhard; Pfeifer, Matthias; D'Ascenzo, Mark; Akhunov, Eduard D; Hedley, Pete E; Gonzales, Ana M; Morrell, Peter L; Kilian, Benjamin; Blattner, Frank R; Scholz, Uwe; Mayer, Klaus FX; Flavell, Andrew J; Muehlbauer, Gary J; Waugh, Robbie; Jeddeloh, Jeffrey A; Stein, Nils
2013-01-01
Advanced resources for genome-assisted research in barley (Hordeum vulgare) including a whole-genome shotgun assembly and an integrated physical map have recently become available. These have made possible studies that aim to assess genetic diversity or to isolate single genes by whole-genome resequencing and in silico variant detection. However such an approach remains expensive given the 5 Gb size of the barley genome. Targeted sequencing of the mRNA-coding exome reduces barley genomic complexity more than 50-fold, thus dramatically reducing this heavy sequencing and analysis load. We have developed and employed an in-solution hybridization-based sequence capture platform to selectively enrich for a 61.6 megabase coding sequence target that includes predicted genes from the genome assembly of the cultivar Morex as well as publicly available full-length cDNAs and de novo assembled RNA-Seq consensus sequence contigs. The platform provides a highly specific capture with substantial and reproducible enrichment of targeted exons, both for cultivated barley and related species. We show that this exome capture platform provides a clear path towards a broader and deeper understanding of the natural variation residing in the mRNA-coding part of the barley genome and will thus constitute a valuable resource for applications such as mapping-by-sequencing and genetic diversity analyzes. PMID:23889683
GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations
Paila, Umadevi; Chapman, Brad A.; Kirchner, Rory; Quinlan, Aaron R.
2013-01-01
Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI's utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics. PMID:23874191
Miller, John J; Eackles, Michael S.; Stauffer, Jay R; King, Timothy L.
2015-01-01
We characterized variation within the mitochondrial genomes of the invasive silver carp (Hypophthalmichthys molitrix) and bighead carp (H. nobilis) from the Mississippi River drainage by mapping our Next-Generation sequences to their publicly available genomes. Variant detection resulted in 338 single-nucleotide polymorphisms for H. molitrix and 39 for H. nobilis. The much greater genetic variation in H. molitrix mitochondria relative to H. nobilis may be indicative of a greater North American female effective population size of the former. When variation was quantified by gene, many tRNA loci appear to have little or no variability based on our results whereas protein-coding regions were more frequently polymorphic. These results provide biologists with additional regions of DNA to be used as markers to study the invasion dynamics of these species.
Berg, Ingrid L; Neumann, Rita; Lam, Kwan-Wood G; Sarbajna, Shriparna; Odenthal-Hesse, Linda; May, Celia A; Jeffreys, Alec J
2010-10-01
PRDM9 has recently been identified as a likely trans regulator of meiotic recombination hot spots in humans and mice. PRDM9 contains a zinc finger array that, in humans, can recognize a short sequence motif associated with hot spots, with binding to this motif possibly triggering hot-spot activity via chromatin remodeling. We now report that human genetic variation at the PRDM9 locus has a strong effect on sperm hot-spot activity, even at hot spots lacking the sequence motif. Subtle changes within the zinc finger array can create hot-spot nonactivating or enhancing variants and can even trigger the appearance of a new hot spot, suggesting that PRDM9 is a major global regulator of hot spots in humans. Variation at the PRDM9 locus also influences aspects of genome instability-specifically, a megabase-scale rearrangement underlying two genomic disorders as well as minisatellite instability-implicating PRDM9 as a risk factor for some pathological genome rearrangements.
Berg, Ingrid L.; Neumann, Rita; Lam, Kwan-Wood G.; Sarbajna, Shriparna; Odenthal-Hesse, Linda; May, Celia A.; Jeffreys, Alec J.
2011-01-01
PRDM9 has recently been identified as a likely trans-regulator of meiotic recombination hot spots in humans and mice1-3. The protein contains a zinc finger array that in humans can recognise a short sequence motif associated with hot spots4, with binding to this motif possibly triggering hot-spot activity via chromatin remodelling5. We now show that variation in the zinc finger array in humans has a profound effect on sperm hot-spot activity, even at hot spots lacking the sequence motif. Very subtle changes within the array can create hot-spot non-activating and enhancing alleles, and even trigger the appearance of a new hot spot. PRDM9 thus appears to be the preeminent global regulator of hot spots in humans. Variation at this locus also influences aspects of genome instability, specifically a megabase-scale rearrangement underlying two genomic disorders6 as well as minisatellite instability7, implicating PRDM9 as a risk factor for some pathological genome rearrangements. PMID:20818382
Hitomi, Yuki; Tokunaga, Katsushi
2017-01-01
Human genome variation may cause differences in traits and disease risks. Disease-causal/susceptible genes and variants for both common and rare diseases can be detected by comprehensive whole-genome analyses, such as whole-genome sequencing (WGS), using next-generation sequencing (NGS) technology and genome-wide association studies (GWAS). Here, in addition to the application of an NGS as a whole-genome analysis method, we summarize approaches for the identification of functional disease-causal/susceptible variants from abundant genetic variants in the human genome and methods for evaluating their functional effects in human diseases, using an NGS and in silico and in vitro functional analyses. We also discuss the clinical applications of the functional disease causal/susceptible variants to personalized medicine.
LINE-1 Elements in Structural Variation and Disease
Beck, Christine R.; Garcia-Perez, José Luis; Badge, Richard M.; Moran, John V.
2014-01-01
The completion of the human genome reference sequence ushered in a new era for the study and discovery of human transposable elements. It now is undeniable that transposable elements, historically dismissed as junk DNA, have had an instrumental role in sculpting the structure and function of our genomes. In particular, long interspersed element-1 (LINE-1 or L1) and short interspersed elements (SINEs) continue to affect our genome, and their movement can lead to sporadic cases of disease. Here, we briefly review the types of transposable elements present in the human genome and their mechanisms of mobility. We next highlight how advances in DNA sequencing and genomic technologies have enabled the discovery of novel retrotransposons in individual genomes. Finally, we discuss how L1-mediated retrotransposition events impact human genomes. PMID:21801021
UCSC genome browser: deep support for molecular biomedical research.
Mangan, Mary E; Williams, Jennifer M; Lathe, Scott M; Karolchik, Donna; Lathe, Warren C
2008-01-01
The volume and complexity of genomic sequence data, and the additional experimental data required for annotation of the genomic context, pose a major challenge for display and access for biomedical researchers. Genome browsers organize this data and make it available in various ways to extract useful information to advance research projects. The UCSC Genome Browser is one of these resources. The official sequence data for a given species forms the framework to display many other types of data such as expression, variation, cross-species comparisons, and more. Visual representations of the data are available for exploration. Data can be queried with sequences. Complex database queries are also easily achieved with the Table Browser interface. Associated tools permit additional query types or access to additional data sources such as images of in situ localizations. Support for solving researcher's issues is provided with active discussion mailing lists and by providing updated training materials. The UCSC Genome Browser provides a source of deep support for a wide range of biomedical molecular research (http://genome.ucsc.edu).
Applications of the 1000 Genomes Project resources
Zheng-Bradley, Xiangqun
2017-01-01
Abstract The 1000 Genomes Project created a valuable, worldwide reference for human genetic variation. Common uses of the 1000 Genomes dataset include genotype imputation supporting Genome-wide Association Studies, mapping expression Quantitative Trait Loci, filtering non-pathogenic variants from exome, whole genome and cancer genome sequencing projects, and genetic analysis of population structure and molecular evolution. In this article, we will highlight some of the multiple ways that the 1000 Genomes data can be and has been utilized for genetic studies. PMID:27436001
Genetic Variation in the Acorn Barnacle from Allozymes to Population Genomics
Flight, Patrick A.; Rand, David M.
2012-01-01
Understanding the patterns of genetic variation within and among populations is a central problem in population and evolutionary genetics. We examine this question in the acorn barnacle, Semibalanus balanoides, in which the allozyme loci Mpi and Gpi have been implicated in balancing selection due to varying selective pressures at different spatial scales. We review the patterns of genetic variation at the Mpi locus, compare this to levels of population differentiation at mtDNA and microsatellites, and place these data in the context of genome-wide variation from high-throughput sequencing of population samples spanning the North Atlantic. Despite considerable geographic variation in the patterns of selection at the Mpi allozyme, this locus shows rather low levels of population differentiation at ecological and trans-oceanic scales (FST ∼ 5%). Pooled population sequencing was performed on samples from Rhode Island (RI), Maine (ME), and Southwold, England (UK). Analysis of more than 650 million reads identified approximately 335,000 high-quality SNPs in 19 million base pairs of the S. balanoides genome. Much variation is shared across the Atlantic, but there are significant examples of strong population differentiation among samples from RI, ME, and UK. An FST outlier screen of more than 22,000 contigs provided a genome-wide context for interpretation of earlier studies on allozymes, mtDNA, and microsatellites. FST values for allozymes, mtDNA and microsatellites are close to the genome-wide average for random SNPs, with the exception of the trans-Atlantic FST for mtDNA. The majority of FST outliers were unique between individual pairs of populations, but some genes show shared patterns of excess differentiation. These data indicate that gene flow is high, that selection is strong on a subset of genes, and that a variety of genes are experiencing diversifying selection at large spatial scales. This survey of polymorphism in S. balanoides provides a number of genomic tools that promise to make this a powerful model for ecological genomics of the rocky intertidal. PMID:22767487
Gold nanoparticles for high-throughput genotyping of long-range haplotypes
NASA Astrophysics Data System (ADS)
Chen, Peng; Pan, Dun; Fan, Chunhai; Chen, Jianhua; Huang, Ke; Wang, Dongfang; Zhang, Honglu; Li, You; Feng, Guoyin; Liang, Peiji; He, Lin; Shi, Yongyong
2011-10-01
Completion of the Human Genome Project and the HapMap Project has led to increasing demands for mapping complex traits in humans to understand the aetiology of diseases. Identifying variations in the DNA sequence, which affect how we develop disease and respond to pathogens and drugs, is important for this purpose, but it is difficult to identify these variations in large sample sets. Here we show that through a combination of capillary sequencing and polymerase chain reaction assisted by gold nanoparticles, it is possible to identify several DNA variations that are associated with age-related macular degeneration and psoriasis on significant regions of human genomic DNA. Our method is accurate and promising for large-scale and high-throughput genetic analysis of susceptibility towards disease and drug resistance.
GENOMIC BASIS OF AGING AND LIFE HISTORY EVOLUTION IN DROSOPHILA MELANOGASTER
Remolina, Silvia C.; Chang, Peter L.; Leips, Jeff; Nuzhdin, Sergey V.; Hughes, Kimberly A.
2015-01-01
Natural diversity in aging and other life history patterns is a hallmark of organismal variation. Related species, populations, and individuals within populations show genetically based variation in life span and other aspects of age-related performance. Population differences are especially informative because these differences can be large relative to within-population variation and because they occur in organisms with otherwise similar genomes. We used experimental evolution to produce populations divergent for life span and late-age fertility and then used deep genome sequencing to detect sequence variants with nucleotide-level resolution. Several genes and genome regions showed strong signatures of selection, and the same regions were implicated in independent comparisons, suggesting that the same alleles were selected in replicate lines. Genes related to oogenesis, immunity, and protein degradation were implicated as important modifiers of late-life performance. Expression profiling and functional annotation narrowed the list of strong candidate genes to 38, most of which are novel candidates for regulating aging. Life span and early-age fecundity were negatively correlated among populations; therefore the alleles we identified also are candidate regulators of a major life-history trade-off. More generally, we argue that hitchhiking mapping can be a powerful tool for uncovering the molecular bases of quantitative genetic variation. PMID:23106705
The BIG Data Center: from deposition to integration to translation
2017-01-01
Biological data are generated at unprecedentedly exponential rates, posing considerable challenges in big data deposition, integration and translation. The BIG Data Center, established at Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, provides a suite of database resources, including (i) Genome Sequence Archive, a data repository specialized for archiving raw sequence reads, (ii) Gene Expression Nebulas, a data portal of gene expression profiles based entirely on RNA-Seq data, (iii) Genome Variation Map, a comprehensive collection of genome variations for featured species, (iv) Genome Warehouse, a centralized resource housing genome-scale data with particular focus on economically important animals and plants, (v) Methylation Bank, an integrated database of whole-genome single-base resolution methylomes and (vi) Science Wikis, a central access point for biological wikis developed for community annotations. The BIG Data Center is dedicated to constructing and maintaining biological databases through big data integration and value-added curation, conducting basic research to translate big data into big knowledge and providing freely open access to a variety of data resources in support of worldwide research activities in both academia and industry. All of these resources are publicly available and can be found at http://bigd.big.ac.cn. PMID:27899658
DOE Office of Scientific and Technical Information (OSTI.GOV)
Muchero, Wellington; Labbe, Jessy L; Priya, Ranjan
2014-01-01
To date, Populus ranks among a few plant species with a complete genome sequence and other highly developed genomic resources. With the first genome sequence among all tree species, Populus has been adopted as a suitable model organism for genomic studies in trees. However, far from being just a model species, Populus is a key renewable economic resource that plays a significant role in providing raw materials for the biofuel and pulp and paper industries. Therefore, aside from leading frontiers of basic tree molecular biology and ecological research, Populus leads frontiers in addressing global economic challenges related to fuel andmore » fiber production. The latter fact suggests that research aimed at improving quality and quantity of Populus as a raw material will likely drive the pursuit of more targeted and deeper research in order to unlock the economic potential tied in molecular biology processes that drive this tree species. Advances in genome sequence-driven technologies, such as resequencing individual genotypes, which in turn facilitates large scale SNP discovery and identification of large scale polymorphisms are key determinants of future success in these initiatives. In this treatise we discuss implications of genome sequence-enable technologies on Populus genomic and genetic studies of complex and specialized-traits.« less
Lado, Bettina; Matus, Ivan; Rodríguez, Alejandra; Inostroza, Luis; Poland, Jesse; Belzile, François; del Pozo, Alejandro; Quincke, Martín; Castro, Marina; von Zitzewitz, Jarislav
2013-12-09
In crop breeding, the interest of predicting the performance of candidate cultivars in the field has increased due to recent advances in molecular breeding technologies. However, the complexity of the wheat genome presents some challenges for applying new technologies in molecular marker identification with next-generation sequencing. We applied genotyping-by-sequencing, a recently developed method to identify single-nucleotide polymorphisms, in the genomes of 384 wheat (Triticum aestivum) genotypes that were field tested under three different water regimes in Mediterranean climatic conditions: rain-fed only, mild water stress, and fully irrigated. We identified 102,324 single-nucleotide polymorphisms in these genotypes, and the phenotypic data were used to train and test genomic selection models intended to predict yield, thousand-kernel weight, number of kernels per spike, and heading date. Phenotypic data showed marked spatial variation. Therefore, different models were tested to correct the trends observed in the field. A mixed-model using moving-means as a covariate was found to best fit the data. When we applied the genomic selection models, the accuracy of predicted traits increased with spatial adjustment. Multiple genomic selection models were tested, and a Gaussian kernel model was determined to give the highest accuracy. The best predictions between environments were obtained when data from different years were used to train the model. Our results confirm that genotyping-by-sequencing is an effective tool to obtain genome-wide information for crops with complex genomes, that these data are efficient for predicting traits, and that correction of spatial variation is a crucial ingredient to increase prediction accuracy in genomic selection models.
Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies
Sundquist, Andreas; Ronaghi, Mostafa; Tang, Haixu; Pevzner, Pavel; Batzoglou, Serafim
2007-01-01
While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology. PMID:17534434
[The human variome project and its progress].
Gao, Shan; Zhang, Ning; Zhang, Lei; Duan, Guang-You; Zhang, Tao
2010-11-01
The main goal of post genomics is to explain how the genome, the map of which has been constructed in the Human Genome Project, affacts activities of life. This leads to generate multiple "omics": structural genomics, functional genomics, proteomics, metabonomics, et al. In Jun. 2006, Melbourne, Australia, Human Genome Variation Society (HGVS) initiated the Human Variome Project (HVP) to collect all the sequence variation and polymorphism data worldwidely. HVP is to search and determine those mutations related with human diseases by association study between genetype and phenotype on the scale of genome level and other methods. Those results will be translated into clinical application. Considering the potential effects of this project on human health, this paper introduced its origin and main content in detail and discussed its meaning and prospect.
Xia, Shu; Kohli, Manish; Du, Meijun; Dittmar, Rachel L; Lee, Adam; Nandy, Debashis; Yuan, Tiezheng; Guo, Yongchen; Wang, Yuan; Tschannen, Michael R; Worthey, Elizabeth; Jacob, Howard; See, William; Kilari, Deepak; Wang, Xuexia; Hovey, Raymond L; Huang, Chiang-Ching; Wang, Liang
2015-06-30
Liquid biopsies, examinations of tumor components in body fluids, have shown promise for predicting clinical outcomes. To evaluate tumor-associated genomic and genetic variations in plasma cell-free DNA (cfDNA) and their associations with treatment response and overall survival, we applied whole genome and targeted sequencing to examine the plasma cfDNAs derived from 20 patients with advanced prostate cancer. Sequencing-based genomic abnormality analysis revealed locus-specific gains or losses that were common in prostate cancer, such as 8q gains, AR amplifications, PTEN losses and TMPRSS2-ERG fusions. To estimate tumor burden in cfDNA, we developed a Plasma Genomic Abnormality (PGA) score by summing the most significant copy number variations. Cox regression analysis showed that PGA scores were significantly associated with overall survival (p < 0.04). After androgen deprivation therapy or chemotherapy, targeted sequencing showed significant mutational profile changes in genes involved in androgen biosynthesis, AR activation, DNA repair, and chemotherapy resistance. These changes may reflect the dynamic evolution of heterozygous tumor populations in response to these treatments. These results strongly support the feasibility of using non-invasive liquid biopsies as potential tools to study biological mechanisms underlying therapy-specific resistance and to predict disease progression in advanced prostate cancer.
Streamlined Genome Sequence Compression using Distributed Source Coding
Wang, Shuang; Jiang, Xiaoqian; Chen, Feng; Cui, Lijuan; Cheng, Samuel
2014-01-01
We aim at developing a streamlined genome sequence compression algorithm to support alternative miniaturized sequencing devices, which have limited communication, storage, and computation power. Existing techniques that require heavy client (encoder side) cannot be applied. To tackle this challenge, we carefully examined distributed source coding theory and developed a customized reference-based genome compression protocol to meet the low-complexity need at the client side. Based on the variation between source and reference, our protocol will pick adaptively either syndrome coding or hash coding to compress subsequences of changing code length. Our experimental results showed promising performance of the proposed method when compared with the state-of-the-art algorithm (GRS). PMID:25520552
The 1000 Genomes Project: new opportunities for research and social challenges
2010-01-01
The 1000 Genomes Project, an international collaboration, is sequencing the whole genome of approximately 2,000 individuals from different worldwide populations. The central goal of this project is to describe most of the genetic variation that occurs at a population frequency greater than 1%. The results of this project will allow scientists to identify genetic variation at an unprecedented degree of resolution and will also help improve the imputation methods for determining unobserved genetic variants that are not represented on current genotyping arrays. By identifying novel or rare functional genetic variants, researchers will be able to pinpoint disease-causing genes in genomic regions initially identified by association studies. This level of detailed sequence information will also improve our knowledge of the evolutionary processes and the genomic patterns that have shaped the human species as we know it today. The new data will also lay the foundation for future clinical applications, such as prediction of disease susceptibility and drug response. However, the forthcoming availability of whole genome sequences at affordable prices will raise ethical concerns and pose potential threats to individual privacy. Nevertheless, we believe that these potential risks are outweighed by the benefits in terms of diagnosis and research, so long as rigorous safeguards are kept in place through legislation that prevents discrimination on the basis of the results of genetic testing. PMID:20193048
USDA-ARS?s Scientific Manuscript database
The identification of specific genes underlying phenotypic variation of complex traits remains one of the greatest challenges in biology despite having genome sequences and more powerful tools. Most genome-wide screens lack sufficient resolving power as they typically depend on linkage. One altern...
Centromere reference models for human chromosomes X and Y satellite arrays
Miga, Karen H.; Newton, Yulia; Jain, Miten; Altemose, Nicolas; Willard, Huntington F.; Kent, W. James
2014-01-01
The human genome sequence remains incomplete, with multimegabase-sized gaps representing the endogenous centromeres and other heterochromatic regions. Available sequence-based studies within these sites in the genome have demonstrated a role in centromere function and chromosome pairing, necessary to ensure proper chromosome segregation during cell division. A common genomic feature of these regions is the enrichment of long arrays of near-identical tandem repeats, known as satellite DNAs, which offer a limited number of variant sites to differentiate individual repeat copies across millions of bases. This substantial sequence homogeneity challenges available assembly strategies and, as a result, centromeric regions are omitted from ongoing genomic studies. To address this problem, we utilize monomer sequence and ordering information obtained from whole-genome shotgun reads to model two haploid human satellite arrays on chromosomes X and Y, resulting in an initial characterization of 3.83 Mb of centromeric DNA within an individual genome. To further expand the utility of each centromeric reference sequence model, we evaluate sites within the arrays for short-read mappability and chromosome specificity. Because satellite DNAs evolve in a concerted manner, we use these centromeric assemblies to assess the extent of sequence variation among 366 individuals from distinct human populations. We thus identify two satellite array variants in both X and Y centromeres, as determined by array length and sequence composition. This study provides an initial sequence characterization of a regional centromere and establishes a foundation to extend genomic characterization to these sites as well as to other repeat-rich regions within complex genomes. PMID:24501022
Zhou, Daniel X M; Chan, Paul K S; Zhang, Tiejun; Tully, Damien C; Tam, John S
2010-10-01
Studies on the association between sequence variability of the interferon sensitivity-determining region (ISDR) of hepatitis C virus and the outcome of treatment have reached conflicting results. In this study, 25 patients infected with HCV 6a who had received interferon-alpha/ribavirin combination treatment were analyzed for the sequence variations. 14 of them had the full genome sequences obtained from a previous study, whereas the other 11 samples were sequenced for the extended ISDR (eISDR). This eISDR fragment covers 192 bp (64 amino acids) upstream and 201 bp (67 amino acids) downstream from the ISDR previously defined for HCV 1b. The comparison between interferon-alpha resistance and response groups for the amino acid mutations located in the full genome (6 and 8 patients respectively) as well as the mutations located in the eISDR (10 and 15 patients respectively) showed that the mutations I2160V, I2256V, V2292I (P<0.05) within eISDR were significantly associated with resistance to treatment. However, the extent of amino acid variations within previously defined ISDR was not associated with resistance to treatment as previously reported. Four amino acid variations I248V (P=0.03-0.06) within E1, R445K (P=0.02-0.05) and S747T (P=0.03) within E2, I861V (P=0.01) within NS2 which located outside the eISDR may also associate with treatment outcome as identified by a prescreening of variations within 14 HCV 6a full genomes. (c) 2010 Elsevier B.V. All rights reserved.
High levels of variation in Salix lignocellulose genes revealed using poplar genomic resources
2013-01-01
Background Little is known about the levels of variation in lignin or other wood related genes in Salix, a genus that is being increasingly used for biomass and biofuel production. The lignin biosynthesis pathway is well characterized in a number of species, including the model tree Populus. We aimed to transfer the genomic resources already available in Populus to its sister genus Salix to assess levels of variation within genes involved in wood formation. Results Amplification trials for 27 gene regions were undertaken in 40 Salix taxa. Twelve of these regions were sequenced. Alignment searches of the resulting sequences against reference databases, combined with phylogenetic analyses, showed the close similarity of these Salix sequences to Populus, confirming homology of the primer regions and indicating a high level of conservation within the wood formation genes. However, all sequences were found to vary considerably among Salix species, mainly as SNPs with a smaller number of insertions-deletions. Between 25 and 176 SNPs per kbp per gene region (in predicted exons) were discovered within Salix. Conclusions The variation found is sizeable but not unexpected as it is based on interspecific and not intraspecific comparison; it is comparable to interspecific variation in Populus. The characterisation of genetic variation is a key process in pre-breeding and for the conservation and exploitation of genetic resources in Salix. This study characterises the variation in several lignocellulose gene markers for such purposes. PMID:23924375
Bouchez, Valérie; Guglielmini, Julien; Dazas, Mélody; Landier, Annie; Toubiana, Julie; Guillot, Sophie; Criscuolo, Alexis; Brisse, Sylvain
2018-06-01
Bordetella pertussis causes whooping cough, a highly contagious respiratory disease that is reemerging in many world regions. The spread of antigen-deficient strains may threaten acellular vaccine efficacy. Dynamics of strain transmission are poorly defined because of shortcomings in current strain genotyping methods. Our objective was to develop a whole-genome genotyping strategy with sufficient resolution for local epidemiologic questions and sufficient reproducibility to enable international comparisons of clinical isolates. We defined a core genome multilocus sequence typing scheme comprising 2,038 loci and demonstrated its congruence with whole-genome single-nucleotide polymorphism variation. Most cases of intrafamilial groups of isolates or of multiple isolates recovered from the same patient were distinguished from temporally and geographically cocirculating isolates. However, epidemiologically unrelated isolates were sometimes nearly undistinguishable. We set up a publicly accessible core genome multilocus sequence typing database to enable global comparisons of B. pertussis isolates, opening the way for internationally coordinated surveillance.
Eastman, Alexander W; Heinrichs, David E; Yuan, Ze-Chun
2014-10-03
Members of the genus Paenibacillus are important plant growth-promoting rhizobacteria that can serve as bio-reactors. Paenibacillus polymyxa promotes the growth of a variety of economically important crops. Our lab recently completed the genome sequence of Paenibacillus polymyxa CR1. As of January 2014, four P. polymyxa genomes have been completely sequenced but no comparative genomic analyses have been reported. Here we report the comparative and genetic analyses of four sequenced P. polymyxa genomes, which revealed a significantly conserved core genome. Complex metabolic pathways and regulatory networks were highly conserved and allow P. polymyxa to rapidly respond to dynamic environmental cues. Genes responsible for phytohormone synthesis, phosphate solubilization, iron acquisition, transcriptional regulation, σ-factors, stress responses, transporters and biomass degradation were well conserved, indicating an intimate association with plant hosts and the rhizosphere niche. In addition, genes responsible for antimicrobial resistance and non-ribosomal peptide/polyketide synthesis are present in both the core and accessory genome of each strain. Comparative analyses also reveal variations in the accessory genome, including large plasmids present in strains M1 and SC2. Furthermore, a considerable number of strain-specific genes and genomic islands are irregularly distributed throughout each genome. Although a variety of plant-growth promoting traits are encoded by all strains, only P. polymyxa CR1 encodes the unique nitrogen fixation cluster found in other Paenibacillus sp. Our study revealed that genomic loci relevant to host interaction and ecological fitness are highly conserved within the P. polymyxa genomes analysed, despite variations in the accessory genome. This work suggets that plant-growth promotion by P. polymyxa is mediated largely through phytohormone production, increased nutrient availability and bio-control mechanisms. This study provides an in-depth understanding of the genome architecture of this species, thus facilitating future genetic engineering and applications in agriculture, industry and medicine. Furthermore, this study highlights the current gap in our understanding of complex plant biomass metabolism in Gram-positive bacteria.
Genomic architecture of adaptive color pattern divergence and convergence in Heliconius butterflies
Supple, Megan A.; Hines, Heather M.; Dasmahapatra, Kanchon K.; Lewis, James J.; Nielsen, Dahlia M.; Lavoie, Christine; Ray, David A.; Salazar, Camilo; McMillan, W. Owen; Counterman, Brian A.
2013-01-01
Identifying the genetic changes driving adaptive variation in natural populations is key to understanding the origins of biodiversity. The mosaic of mimetic wing patterns in Heliconius butterflies makes an excellent system for exploring adaptive variation using next-generation sequencing. In this study, we use a combination of techniques to annotate the genomic interval modulating red color pattern variation, identify a narrow region responsible for adaptive divergence and convergence in Heliconius wing color patterns, and explore the evolutionary history of these adaptive alleles. We use whole genome resequencing from four hybrid zones between divergent color pattern races of Heliconius erato and two hybrid zones of the co-mimic Heliconius melpomene to examine genetic variation across 2.2 Mb of a partial reference sequence. In the intergenic region near optix, the gene previously shown to be responsible for the complex red pattern variation in Heliconius, population genetic analyses identify a shared 65-kb region of divergence that includes several sites perfectly associated with phenotype within each species. This region likely contains multiple cis-regulatory elements that control discrete expression domains of optix. The parallel signatures of genetic differentiation in H. erato and H. melpomene support a shared genetic architecture between the two distantly related co-mimics; however, phylogenetic analysis suggests mimetic patterns in each species evolved independently. Using a combination of next-generation sequencing analyses, we have refined our understanding of the genetic architecture of wing pattern variation in Heliconius and gained important insights into the evolution of novel adaptive phenotypes in natural populations. PMID:23674305
Liu, Long; Zheng, Chuan Wei; Wang, De He; Hou, Zhuo Cheng; Ning, Zhong Hua
2015-01-01
Eggshell damages lead to economic losses in the egg production industry and are a threat to human health. We examined 49-wk-old Rhode Island White hens (Gallus gallus) that laid eggs having shells with significantly different strengths and thicknesses. We used HiSeq 2000 (Illumina) sequencing to characterize the chicken transcriptome and whole genome to identify the key genes and genetic mutations associated with eggshell calcification. We identified a total of 14,234 genes expressed in the chicken uterus, representing 89% of all annotated chicken genes. A total of 889 differentially expressed genes were identified by comparing low eggshell strength (LES) and normal eggshell strength (NES) genomes. The DEGs are enriched in calcification-related processes, including calcium ion transport and calcium signaling pathways as reveled by gene ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis. Some important matrix proteins, such as OC-116, LTF and SPP1, were also expressed differentially between two groups. A total of 3,671,919 single-nucleotide polymorphisms (SNPs) and 508,035 Indels were detected in protein coding genes by whole-genome re-sequencing, including 1775 non-synonymous variations and 19 frame-shift Indels in DEGs. SNPs and Indels found in this study could be further investigated for eggshell traits. This is the first report to integrate the transcriptome and genome re-sequencing to target the genetic variations which decreased the eggshell qualities. These findings further advance our understanding of eggshell calcification in the chicken uterus. PMID:25974068
Australian wild rice reveals pre-domestication origin of polymorphism deserts in rice genome.
Krishnan S, Gopala; Waters, Daniel L E; Henry, Robert J
2014-01-01
Rice is a major source of human food with a predominantly Asian production base. Domestication involved selection of traits that are desirable for agriculture and to human consumers. Wild relatives of crop plants are a source of useful variation which is of immense value for crop improvement. Australian wild rices have been isolated from the impacts of domestication in Asia and represents a source of novel diversity for global rice improvement. Oryza rufipogon is a perennial wild progenitor of cultivated rice. Oryza meridionalis is a related annual species in Australia. We have examined the sequence of the genomes of AA genome wild rices from Australia that are close relatives of cultivated rice through whole genome re-sequencing. Assembly of the resequencing data to the O. sativa ssp. japonica cv. Nipponbare shows that Australian wild rices possess 2.5 times more single nucleotide polymorphisms than in the Asian wild rice and cultivated O. sativa ssp. indica. Analysis of the genome of domesticated rice reveals regions of low diversity that show very little variation (polymorphism deserts). Both the perennial and annual wild rice from Australia show a high degree of conservation of sequence with that found in cultivated rice in the same 4.58 Mbp region on chromosome 5, which suggests that some of the 'polymorphism deserts' in this and other parts of the rice genome may have originated prior to domestication due to natural selection. Analysis of genes in the 'polymorphism deserts' indicates that this selection may have been due to biotic or abiotic stress in the environment of early rice relatives. Despite having closely related sequences in these genome regions, the Australian wild populations represent an invaluable source of diversity supporting rice food security.
A 1000 Arab genome project to study the Emirati population.
Al-Ali, Mariam; Osman, Wael; Tay, Guan K; AlSafar, Habiba S
2018-04-01
Discoveries from the human genome, HapMap, and 1000 genome projects have collectively contributed toward the creation of a catalog of human genetic variations that has improved our understanding of human diversity. Despite the collegial nature of many of these genome study consortiums, which has led to the cataloging of genetic variations of different ethnic groups from around the world, genome data on the Arab population remains overwhelmingly underrepresented. The National Arab Genome project in the United Arab Emirates (UAE) aims to address this deficiency by using Next Generation Sequencing (NGS) technology to provide data to improve our understanding of the Arab genome and catalog variants that are unique to the Arab population of the UAE. The project was conceived to shed light on the similarities and differences between the Arab genome and those of the other ethnic groups.
Li, Chunmei; Yu, Zhilong; Fu, Yusi; Pang, Yuhong; Huang, Yanyi
2017-04-26
We develop a novel single-cell-based platform through digital counting of amplified genomic DNA fragments, named multifraction amplification (mfA), to detect the copy number variations (CNVs) in a single cell. Amplification is required to acquire genomic information from a single cell, while introducing unavoidable bias. Unlike prevalent methods that directly infer CNV profiles from the pattern of sequencing depth, our mfA platform denatures and separates the DNA molecules from a single cell into multiple fractions of a reaction mix before amplification. By examining the sequencing result of each fraction for a specific fragment and applying a segment-merge maximum likelihood algorithm to the calculation of copy number, we digitize the sequencing-depth-based CNV identification and thus provide a method that is less sensitive to the amplification bias. In this paper, we demonstrate a mfA platform through multiple displacement amplification (MDA) chemistry. When performing the mfA platform, the noise of MDA is reduced; therefore, the resolution of single-cell CNV identification can be improved to 100 kb. We can also determine the genomic region free of allelic drop-out with mfA platform, which is impossible for conventional single-cell amplification methods.
Aversano, Riccardo; Contaldi, Felice; Ercolano, Maria Raffaella; Grosso, Valentina; Iorizzo, Massimo; Tatino, Filippo; Xumerle, Luciano; Dal Molin, Alessandra; Avanzato, Carla; Ferrarini, Alberto; Delledonne, Massimo; Sanseverino, Walter; Cigliano, Riccardo Aiese; Capella-Gutierrez, Salvador; Gabaldón, Toni; Frusciante, Luigi; Bradeen, James M.; Carputo, Domenico
2015-01-01
Here, we report the draft genome sequence of Solanum commersonii, which consists of ∼830 megabases with an N50 of 44,303 bp anchored to 12 chromosomes, using the potato (Solanum tuberosum) genome sequence as a reference. Compared with potato, S. commersonii shows a striking reduction in heterozygosity (1.5% versus 53 to 59%), and differences in genome sizes were mainly due to variations in intergenic sequence length. Gene annotation by ab initio prediction supported by RNA-seq data produced a catalog of 1703 predicted microRNAs, 18,882 long noncoding RNAs of which 20% are shown to target cold-responsive genes, and 39,290 protein-coding genes with a significant repertoire of nonredundant nucleotide binding site-encoding genes and 126 cold-related genes that are lacking in S. tuberosum. Phylogenetic analyses indicate that domesticated potato and S. commersonii lineages diverged ∼2.3 million years ago. Three duplication periods corresponding to genome enrichment for particular gene families related to response to salt stress, water transport, growth, and defense response were discovered. The draft genome sequence of S. commersonii substantially increases our understanding of the domesticated germplasm, facilitating translation of acquired knowledge into advances in crop stability in light of global climate and environmental changes. PMID:25873387
Assembly: a resource for assembled genomes at NCBI
Kitts, Paul A.; Church, Deanna M.; Thibaud-Nissen, Françoise; Choi, Jinna; Hem, Vichet; Sapojnikov, Victor; Smith, Robert G.; Tatusova, Tatiana; Xiang, Charlie; Zherikov, Andrey; DiCuccio, Michael; Murphy, Terence D.; Pruitt, Kim D.; Kimchi, Avi
2016-01-01
The NCBI Assembly database (www.ncbi.nlm.nih.gov/assembly/) provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project. Users can find assemblies of interest by querying the Assembly Resource directly or by browsing available assemblies for a particular organism. Links in the Assembly Resource allow users to easily download sequence and annotations for current versions of genome assemblies from the NCBI genomes FTP site. PMID:26578580
Using genomics to characterize evolutionary potential for conservation of wild populations
Harrisson, Katherine A; Pavlova, Alexandra; Telonis-Scott, Marina; Sunnucks, Paul
2014-01-01
Genomics promises exciting advances towards the important conservation goal of maximizing evolutionary potential, notwithstanding associated challenges. Here, we explore some of the complexity of adaptation genetics and discuss the strengths and limitations of genomics as a tool for characterizing evolutionary potential in the context of conservation management. Many traits are polygenic and can be strongly influenced by minor differences in regulatory networks and by epigenetic variation not visible in DNA sequence. Much of this critical complexity is difficult to detect using methods commonly used to identify adaptive variation, and this needs appropriate consideration when planning genomic screens, and when basing management decisions on genomic data. When the genomic basis of adaptation and future threats are well understood, it may be appropriate to focus management on particular adaptive traits. For more typical conservations scenarios, we argue that screening genome-wide variation should be a sensible approach that may provide a generalized measure of evolutionary potential that accounts for the contributions of small-effect loci and cryptic variation and is robust to uncertainty about future change and required adaptive response(s). The best conservation outcomes should be achieved when genomic estimates of evolutionary potential are used within an adaptive management framework. PMID:25553064
Identification and Resolution of Microdiversity through Metagenomic Sequencing of Parallel Consortia
Maezato, Yukari; Wu, Yu-Wei; Romine, Margaret F.; Lindemann, Stephen R.
2015-01-01
To gain a predictive understanding of the interspecies interactions within microbial communities that govern community function, the genomic complement of every member population must be determined. Although metagenomic sequencing has enabled the de novo reconstruction of some microbial genomes from environmental communities, microdiversity confounds current genome reconstruction techniques. To overcome this issue, we performed short-read metagenomic sequencing on parallel consortia, defined as consortia cultivated under the same conditions from the same natural community with overlapping species composition. The differences in species abundance between the two consortia allowed reconstruction of near-complete (at an estimated >85% of gene complement) genome sequences for 17 of the 20 detected member species. Two Halomonas spp. indistinguishable by amplicon analysis were found to be present within the community. In addition, comparison of metagenomic reads against the consensus scaffolds revealed within-species variation for one of the Halomonas populations, one of the Rhodobacteraceae populations, and the Rhizobiales population. Genomic comparison of these representative instances of inter- and intraspecies microdiversity suggests differences in functional potential that may result in the expression of distinct roles in the community. In addition, isolation and complete genome sequence determination of six member species allowed an investigation into the sensitivity and specificity of genome reconstruction processes, demonstrating robustness across a wide range of sequence coverage (9× to 2,700×) within the metagenomic data set. PMID:26497460
Identification and Resolution of Microdiversity through Metagenomic Sequencing of Parallel Consortia
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nelson, William C.; Maezato, Yukari; Wu, Yu-Wei
2015-10-23
To gain a predictive understanding of the interspecies interactions within microbial communities that govern community function, the genomic complement of every member population must be determined. Although metagenomic sequencing has enabled thede novoreconstruction of some microbial genomes from environmental communities, microdiversity confounds current genome reconstruction techniques. To overcome this issue, we performed short-read metagenomic sequencing on parallel consortia, defined as consortia cultivated under the same conditions from the same natural community with overlapping species composition. The differences in species abundance between the two consortia allowed reconstruction of near-complete (at an estimated >85% of gene complement) genome sequences for 17 ofmore » the 20 detected member species. TwoHalomonasspp. indistinguishable by amplicon analysis were found to be present within the community. In addition, comparison of metagenomic reads against the consensus scaffolds revealed within-species variation for one of theHalomonaspopulations, one of theRhodobacteraceaepopulations, and theRhizobialespopulation. Genomic comparison of these representative instances of inter- and intraspecies microdiversity suggests differences in functional potential that may result in the expression of distinct roles in the community. In addition, isolation and complete genome sequence determination of six member species allowed an investigation into the sensitivity and specificity of genome reconstruction processes, demonstrating robustness across a wide range of sequence coverage (9× to 2,700×) within the metagenomic data set.« less
Beaudet, Denis; Nadimi, Maryam; Iffis, Bachir; Hijri, Mohamed
2013-01-01
Arbuscular mycorrhizal fungi (AMF) are common and important plant symbionts. They have coenocytic hyphae and form multinucleated spores. The nuclear genome of AMF is polymorphic and its organization is not well understood, which makes the development of reliable molecular markers challenging. In stark contrast, their mitochondrial genome (mtDNA) is homogeneous. To assess the intra- and inter-specific mitochondrial variability in closely related Glomus species, we performed 454 sequencing on total genomic DNA of Glomus sp. isolate DAOM-229456 and we compared its mtDNA with two G. irregulare isolates. We found that the mtDNA of Glomus sp. is homogeneous, identical in gene order and, with respect to the sequences of coding regions, almost identical to G. irregulare. However, certain genomic regions vary substantially, due to insertions/deletions of elements such as introns, mitochondrial plasmid-like DNA polymerase genes and mobile open reading frames. We found no evidence of mitochondrial or cytoplasmic plasmids in Glomus species, and mobile ORFs in Glomus are responsible for the formation of four gene hybrids in atp6, atp9, cox2, and nad3, which are most probably the result of horizontal gene transfer and are expressed at the mRNA level. We found evidence for substantial sequence variation in defined regions of mtDNA, even among closely related isolates with otherwise identical coding gene sequences. This variation makes it possible to design reliable intra- and inter-specific markers. PMID:23637766
Beaudet, Denis; Nadimi, Maryam; Iffis, Bachir; Hijri, Mohamed
2013-01-01
Arbuscular mycorrhizal fungi (AMF) are common and important plant symbionts. They have coenocytic hyphae and form multinucleated spores. The nuclear genome of AMF is polymorphic and its organization is not well understood, which makes the development of reliable molecular markers challenging. In stark contrast, their mitochondrial genome (mtDNA) is homogeneous. To assess the intra- and inter-specific mitochondrial variability in closely related Glomus species, we performed 454 sequencing on total genomic DNA of Glomus sp. isolate DAOM-229456 and we compared its mtDNA with two G. irregulare isolates. We found that the mtDNA of Glomus sp. is homogeneous, identical in gene order and, with respect to the sequences of coding regions, almost identical to G. irregulare. However, certain genomic regions vary substantially, due to insertions/deletions of elements such as introns, mitochondrial plasmid-like DNA polymerase genes and mobile open reading frames. We found no evidence of mitochondrial or cytoplasmic plasmids in Glomus species, and mobile ORFs in Glomus are responsible for the formation of four gene hybrids in atp6, atp9, cox2, and nad3, which are most probably the result of horizontal gene transfer and are expressed at the mRNA level. We found evidence for substantial sequence variation in defined regions of mtDNA, even among closely related isolates with otherwise identical coding gene sequences. This variation makes it possible to design reliable intra- and inter-specific markers.
Tranchida-Lombardo, Valentina; Aiese Cigliano, Riccardo; Anzar, Irantzu; Landi, Simone; Palombieri, Samuela; Colantuono, Chiara; Bostan, Hamed; Termolino, Pasquale; Aversano, Riccardo; Batelli, Giorgia; Cammareri, Maria; Carputo, Domenico; Chiusano, Maria Luisa; Conicella, Clara; Consiglio, Federica; D'Agostino, Nunzio; De Palma, Monica; Di Matteo, Antonio; Grandillo, Silvana; Sanseverino, Walter; Tucci, Marina; Grillo, Stefania
2017-11-14
Tomato is a high value crop and the primary model for fleshy fruit development and ripening. Breeding priorities include increased fruit quality, shelf life and tolerance to stresses. To contribute towards this goal, we re-sequenced the genomes of Corbarino (COR) and Lucariello (LUC) landraces, which both possess the traits of plant adaptation to water deficit, prolonged fruit shelf-life and good fruit quality. Through the newly developed pipeline Reconstructor, we generated the genome sequences of COR and LUC using datasets of 65.8 M and 56.4 M of 30-150 bp paired-end reads, respectively. New contigs including reads that could not be mapped to the tomato reference genome were assembled, and a total of 43, 054 and 44, 579 gene loci were annotated in COR and LUC. Both genomes showed novel regions with similarity to Solanum pimpinellifolium and Solanum pennellii. In addition to small deletions and insertions, 2, 000 and 1, 700 single nucleotide polymorphisms (SNPs) could exert potentially disruptive effects on 1, 371 and 1, 201 genes in COR and LUC, respectively. A detailed survey of the SNPs occurring in fruit quality, shelf life and stress tolerance related-genes identified several candidates of potential relevance. Variations in ethylene response components may concur in determining peculiar phenotypes of COR and LUC. © The Author 2017. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Lins, Luana S F; Trojahn, Shawn; Sockell, Alexandra; Yee, Muh-Ching; Tatarenkov, Andrey; Bustamante, Carlos D; Earley, Ryan L; Kelley, Joanna L
2018-04-01
The mangrove rivulus, Kryptolebias marmoratus, is one of only two self-fertilizing hermaphroditic fish species and inhabits mangrove forests. While selfing can be advantageous, it reduces heterozygosity and decreases genetic diversity. Studies using microsatellites found that there are variable levels of selfing among populations of K. marmoratus, but overall, there is a low rate of outcrossing and, therefore, low heterozygosity. In this study, we used whole-genome data to assess the levels of heterozygosity in different lineages of the mangrove rivulus and infer the phylogenetic relationships among those lineages. We sequenced whole genomes from 15 lineages that were completely homozygous at microsatellite loci and used single nucleotide polymorphisms (SNPs) to determine heterozygosity levels. More variation was uncovered than in studies using microsatellite data because of the resolution of full genome sequencing data. Moreover, missense polymorphisms were found most often in genes associated with immune function and reproduction. Inferred phylogenetic relationships suggest that lineages largely group by their geographic distribution. The use of whole-genome data provided further insight into genetic diversity in this unique species. Although this study was limited by the number of lineages that were available, these data suggest that there is previously undescribed variation within lineages of K. marmoratus that could have functional consequences and (or) inform us about the limits to selfing (e.g., genetic load, accumulation of deleterious mutations) and selection that might favor the maintenance of heterozygosity. These results highlight the need to sequence additional individuals within and among lineages.
RPAN: rice pan-genome browser for ∼3000 rice genomes.
Sun, Chen; Hu, Zhiqiang; Zheng, Tianqing; Lu, Kuangchen; Zhao, Yue; Wang, Wensheng; Shi, Jianxin; Wang, Chunchao; Lu, Jinyuan; Zhang, Dabing; Li, Zhikang; Wei, Chaochun
2017-01-25
A pan-genome is the union of the gene sets of all the individuals of a clade or a species and it provides a new dimension of genome complexity with the presence/absence variations (PAVs) of genes among these genomes. With the progress of sequencing technologies, pan-genome study is becoming affordable for eukaryotes with large-sized genomes. The Asian cultivated rice, Oryza sativa L., is one of the major food sources for the world and a model organism in plant biology. Recently, the 3000 Rice Genome Project (3K RGP) sequenced more than 3000 rice genomes with a mean sequencing depth of 14.3×, which provided a tremendous resource for rice research. In this paper, we present a genome browser, Rice Pan-genome Browser (RPAN), as a tool to search and visualize the rice pan-genome derived from 3K RGP. RPAN contains a database of the basic information of 3010 rice accessions, including genomic sequences, gene annotations, PAV information and gene expression data of the rice pan-genome. At least 12 000 novel genes absent in the reference genome were included. RPAN also provides multiple search and visualization functions. RPAN can be a rich resource for rice biology and rice breeding. It is available at http://cgm.sjtu.edu.cn/3kricedb/ or http://www.rmbreeding.cn/pan3k. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Development of resources and tools for mapping genetic sources of phenotypic variation
USDA-ARS?s Scientific Manuscript database
Commercial and experimental genetic resources were established and investigated for a range of reproductive and disease susceptibility phenotypes. The phenotyping efforts were accompanied with RNA and whole genome sequencing and novel assemblies of the swine genome. The efforts were complemented wit...
Gardner, Shea N; Wagner, Mark C
2005-01-01
Background Microbial forensics is important in tracking the source of a pathogen, whether the disease is a naturally occurring outbreak or part of a criminal investigation. Results A method and SPR Opt (SNP and PCR-RFLP Optimization) software to perform a comprehensive, whole-genome analysis to forensically discriminate multiple sequences is presented. Tools for the optimization of forensic typing using Single Nucleotide Polymorphism (SNP) and PCR-Restriction Fragment Length Polymorphism (PCR-RFLP) analyses across multiple isolate sequences of a species are described. The PCR-RFLP analysis includes prediction and selection of optimal primers and restriction enzymes to enable maximum isolate discrimination based on sequence information. SPR Opt calculates all SNP or PCR-RFLP variations present in the sequences, groups them into haplotypes according to their co-segregation across those sequences, and performs combinatoric analyses to determine which sets of haplotypes provide maximal discrimination among all the input sequences. Those set combinations requiring that membership in the fewest haplotypes be queried (i.e. the fewest assays be performed) are found. These analyses highlight variable regions based on existing sequence data. These markers may be heterogeneous among unsequenced isolates as well, and thus may be useful for characterizing the relationships among unsequenced as well as sequenced isolates. The predictions are multi-locus. Analyses of mumps and SARS viruses are summarized. Phylogenetic trees created based on SNPs, PCR-RFLPs, and full genomes are compared for SARS virus, illustrating that purported phylogenies based only on SNP or PCR-RFLP variations do not match those based on multiple sequence alignment of the full genomes. Conclusion This is the first software to optimize the selection of forensic markers to maximize information gained from the fewest assays, accepting whole or partial genome sequence data as input. As more sequence data becomes available for multiple strains and isolates of a species, automated, computational approaches such as those described here will be essential to make sense of large amounts of information, and to guide and optimize efforts in the laboratory. The software and source code for SPR Opt is publicly available and free for non-profit use at . PMID:15904493
The Human Genome Project: big science transforms biology and medicine.
Hood, Leroy; Rowen, Lee
2013-01-01
The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called 'big science' - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project.
The Human Genome Project: big science transforms biology and medicine
2013-01-01
The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called ‘big science’ - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project. PMID:24040834
Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome.
Aljohi, Hasan Awad; Liu, Wanfei; Lin, Qiang; Zhao, Yuhui; Zeng, Jingyao; Alamer, Ali; Alanazi, Ibrahim O; Alawad, Abdullah O; Al-Sadi, Abdullah M; Hu, Songnian; Yu, Jun
2016-01-01
Coconut (Cocos nucifera L.), a member of the palm family (Arecaceae), is one of the most economically important crops in tropics, serving as an important source of food, drink, fuel, medicine, and construction material. Here we report an assembly of the coconut (C. nucifera, Oman local Tall cultivar) mitochondrial (mt) genome based on next-generation sequencing data. This genome, 678,653bp in length and 45.5% in GC content, encodes 72 proteins, 9 pseudogenes, 23 tRNAs, and 3 ribosomal RNAs. Within the assembly, we find that the chloroplast (cp) derived regions account for 5.07% of the total assembly length, including 13 proteins, 2 pseudogenes, and 11 tRNAs. The mt genome has a relatively large fraction of repeat content (17.26%), including both forward (tandem) and inverted (palindromic) repeats. Sequence variation analysis shows that the Ti/Tv ratio of the mt genome is lower as compared to that of the nuclear genome and neutral expectation. By combining public RNA-Seq data for coconut, we identify 734 RNA editing sites supported by at least two datasets. In summary, our data provides the second complete mt genome sequence in the family Arecaceae, essential for further investigations on mitochondrial biology of seed plants.
Ridge, Perry G; Maxwell, Taylor J; Foutz, Spencer J; Bailey, Matthew H; Corcoran, Christopher D; Tschanz, JoAnn T; Norton, Maria C; Munger, Ronald G; O'Brien, Elizabeth; Kerber, Richard A; Cawthon, Richard M; Kauwe, John S K
2014-01-01
The mitochondria are essential organelles and are the location of cellular respiration, which is responsible for the majority of ATP production. Each cell contains multiple mitochondria, and each mitochondrion contains multiple copies of its own circular genome. The ratio of mitochondrial genomes to nuclear genomes is referred to as mitochondrial copy number. Decreases in mitochondrial copy number are known to occur in many tissues as people age, and in certain diseases. The regulation of mitochondrial copy number by nuclear genes has been studied extensively. While mitochondrial variation has been associated with longevity and some of the diseases known to have reduced mitochondrial copy number, the role that the mitochondrial genome itself has in regulating mitochondrial copy number remains poorly understood. We analyzed the complete mitochondrial genomes from 1007 individuals randomly selected from the Cache County Study on Memory Health and Aging utilizing the inferred evolutionary history of the mitochondrial haplotypes present in our dataset to identify sequence variation and mitochondrial haplotypes associated with changes in mitochondrial copy number. Three variants belonging to mitochondrial haplogroups U5A1 and T2 were significantly associated with higher mitochondrial copy number in our dataset. We identified three variants associated with higher mitochondrial copy number and suggest several hypotheses for how these variants influence mitochondrial copy number by interacting with known regulators of mitochondrial copy number. Our results are the first to report sequence variation in the mitochondrial genome that causes changes in mitochondrial copy number. The identification of these variants that increase mtDNA copy number has important implications in understanding the pathological processes that underlie these phenotypes.
Population Genomics of Infectious and Integrated Wolbachia pipientis Genomes in Drosophila ananassae
Choi, Jae Young; Bubnell, Jaclyn E.; Aquadro, Charles F.
2015-01-01
Coevolution between Drosophila and its endosymbiont Wolbachia pipientis has many intriguing aspects. For example, Drosophila ananassae hosts two forms of W. pipientis genomes: One being the infectious bacterial genome and the other integrated into the host nuclear genome. Here, we characterize the infectious and integrated genomes of W. pipientis infecting D. ananassae (wAna), by genome sequencing 15 strains of D. ananassae that have either the infectious or integrated wAna genomes. Results indicate evolutionarily stable maternal transmission for the infectious wAna genome suggesting a relatively long-term coevolution with its host. In contrast, the integrated wAna genome showed pseudogene-like characteristics accumulating many variants that are predicted to have deleterious effects if present in an infectious bacterial genome. Phylogenomic analysis of sequence variation together with genotyping by polymerase chain reaction of large structural variations indicated several wAna variants among the eight infectious wAna genomes. In contrast, only a single wAna variant was found among the seven integrated wAna genomes examined in lines from Africa, south Asia, and south Pacific islands suggesting that the integration occurred once from a single infectious wAna genome and then spread geographically. Further analysis revealed that for all D. ananassae we examined with the integrated wAna genomes, the majority of the integrated wAna genomic regions is represented in at least two copies suggesting a double integration or single integration followed by an integrated genome duplication. The possible evolutionary mechanism underlying the widespread geographical presence of the duplicate integration of the wAna genome is an intriguing question remaining to be answered. PMID:26254486
Pandey, Ram Vinay; Kofler, Robert; Orozco-terWengel, Pablo; Nolte, Viola; Schlötterer, Christian
2011-03-02
The enormous potential of natural variation for the functional characterization of genes has been neglected for a long time. Only since recently, functional geneticists are starting to account for natural variation in their analyses. With the new sequencing technologies it has become feasible to collect sequence information for multiple individuals on a genomic scale. In particular sequencing pooled DNA samples has been shown to provide a cost-effective approach for characterizing variation in natural populations. While a range of software tools have been developed for mapping these reads onto a reference genome and extracting SNPs, linking this information to population genetic estimators and functional information still poses a major challenge to many researchers. We developed PoPoolation DB a user-friendly integrated database. Popoolation DB links variation in natural populations with functional information, allowing a wide range of researchers to take advantage of population genetic data. PoPoolation DB provides the user with population genetic parameters (Watterson's θ or Tajima's π), Tajima's D, SNPs, allele frequencies and indels in regions of interest. The database can be queried by gene name, chromosomal position, or a user-provided query sequence or GTF file. We anticipate that PoPoolation DB will be a highly versatile tool for functional geneticists as well as evolutionary biologists. PoPoolation DB, available at http://www.popoolation.at/pgt, provides an integrated platform for researchers to investigate natural polymorphism and associated functional annotations from UCSC and Flybase genome browsers, population genetic estimators and RNA-seq information.
Genomic analysis reveals major determinants of cis-regulatory variation in Capsella grandiflora
Steige, Kim A.; Laenen, Benjamin; Reimegård, Johan; Slotte, Tanja
2017-01-01
Understanding the causes of cis-regulatory variation is a long-standing aim in evolutionary biology. Although cis-regulatory variation has long been considered important for adaptation, we still have a limited understanding of the selective importance and genomic determinants of standing cis-regulatory variation. To address these questions, we studied the prevalence, genomic determinants, and selective forces shaping cis-regulatory variation in the outcrossing plant Capsella grandiflora. We first identified a set of 1,010 genes with common cis-regulatory variation using analyses of allele-specific expression (ASE). Population genomic analyses of whole-genome sequences from 32 individuals showed that genes with common cis-regulatory variation (i) are under weaker purifying selection and (ii) undergo less frequent positive selection than other genes. We further identified genomic determinants of cis-regulatory variation. Gene body methylation (gbM) was a major factor constraining cis-regulatory variation, whereas presence of nearby transposable elements (TEs) and tissue specificity of expression increased the odds of ASE. Our results suggest that most common cis-regulatory variation in C. grandiflora is under weak purifying selection, and that gene-specific functional constraints are more important for the maintenance of cis-regulatory variation than genome-scale variation in the intensity of selection. Our results agree with previous findings that suggest TE silencing affects nearby gene expression, and provide evidence for a link between gbM and cis-regulatory constraint, possibly reflecting greater dosage sensitivity of body-methylated genes. Given the extensive conservation of gbM in flowering plants, this suggests that gbM could be an important predictor of cis-regulatory variation in a wide range of plant species. PMID:28096395
Complete Chloroplast Genome of the Wollemi Pine (Wollemia nobilis): Structure and Evolution
Yap, Jia-Yee S.; Rohner, Thore; Greenfield, Abigail; Van Der Merwe, Marlien; McPherson, Hannah; Glenn, Wendy; Kornfeld, Geoff; Marendy, Elessa; Pan, Annie Y. H.; Wilkins, Marc R.; Rossetto, Maurizio; Delaney, Sven K.
2015-01-01
The Wollemi pine (Wollemia nobilis) is a rare Southern conifer with striking morphological similarity to fossil pines. A small population of W. nobilis was discovered in 1994 in a remote canyon system in the Wollemi National Park (near Sydney, Australia). This population contains fewer than 100 individuals and is critically endangered. Previous genetic studies of the Wollemi pine have investigated its evolutionary relationship with other pines in the family Araucariaceae, and have suggested that the Wollemi pine genome contains little or no variation. However, these studies were performed prior to the widespread use of genome sequencing, and their conclusions were based on a limited fraction of the Wollemi pine genome. In this study, we address this problem by determining the entire sequence of the W. nobilis chloroplast genome. A detailed analysis of the structure of the genome is presented, and the evolution of the genome is inferred by comparison with the chloroplast sequences of other members of the Araucariaceae and the related family Podocarpaceae. Pairwise alignments of whole genome sequences, and the presence of unique pseudogenes, gene duplications and insertions in W. nobilis and Araucariaceae, indicate that the W. nobilis chloroplast genome is most similar to that of its sister taxon Agathis. However, the W. nobilis genome contains an unusually high number of repetitive sequences, and these could be used in future studies to investigate and conserve any remnant genetic diversity in the Wollemi pine. PMID:26061691
Inference of Gorilla Demographic and Selective History from Whole-Genome Sequence Data
McManus, Kimberly F.; Kelley, Joanna L.; Song, Shiya; Veeramah, Krishna R.; Woerner, August E.; Stevison, Laurie S.; Ryder, Oliver A.; Ape Genome Project, Great; Kidd, Jeffrey M.; Wall, Jeffrey D.; Bustamante, Carlos D.; Hammer, Michael F.
2015-01-01
Although population-level genomic sequence data have been gathered extensively for humans, similar data from our closest living relatives are just beginning to emerge. Examination of genomic variation within great apes offers many opportunities to increase our understanding of the forces that have differentially shaped the evolutionary history of hominid taxa. Here, we expand upon the work of the Great Ape Genome Project by analyzing medium to high coverage whole-genome sequences from 14 western lowland gorillas (Gorilla gorilla gorilla), 2 eastern lowland gorillas (G. beringei graueri), and a single Cross River individual (G. gorilla diehli). We infer that the ancestors of western and eastern lowland gorillas diverged from a common ancestor approximately 261 ka, and that the ancestors of the Cross River population diverged from the western lowland gorilla lineage approximately 68 ka. Using a diffusion approximation approach to model the genome-wide site frequency spectrum, we infer a history of western lowland gorillas that includes an ancestral population expansion of 1.4-fold around 970 ka and a recent 5.6-fold contraction in population size 23 ka. The latter may correspond to a major reduction in African equatorial forests around the Last Glacial Maximum. We also analyze patterns of variation among western lowland gorillas to identify several genomic regions with strong signatures of recent selective sweeps. We find that processes related to taste, pancreatic and saliva secretion, sodium ion transmembrane transport, and cardiac muscle function are overrepresented in genomic regions predicted to have experienced recent positive selection. PMID:25534031
The impact of next-generation sequencing on genomics
Zhang, Jun; Chiodini, Rod; Badr, Ahmed; Zhang, Genfa
2011-01-01
This article reviews basic concepts, general applications, and the potential impact of next-generation sequencing (NGS) technologies on genomics, with particular reference to currently available and possible future platforms and bioinformatics. NGS technologies have demonstrated the capacity to sequence DNA at unprecedented speed, thereby enabling previously unimaginable scientific achievements and novel biological applications. But, the massive data produced by NGS also presents a significant challenge for data storage, analyses, and management solutions. Advanced bioinformatic tools are essential for the successful application of NGS technology. As evidenced throughout this review, NGS technologies will have a striking impact on genomic research and the entire biological field. With its ability to tackle the unsolved challenges unconquered by previous genomic technologies, NGS is likely to unravel the complexity of the human genome in terms of genetic variations, some of which may be confined to susceptible loci for some common human conditions. The impact of NGS technologies on genomics will be far reaching and likely change the field for years to come. PMID:21477781
Allen, Upton D; Hu, Pingzhao; Pereira, Sergio L; Robinson, Joan L; Paton, Tara A; Beyene, Joseph; Khodai-Booran, Nasser; Dipchand, Anne; Hébert, Diane; Ng, Vicky; Nalpathamkalam, Thomas; Read, Stanley
2016-02-01
This study examines EBV strains from transplant patients and patients with IM by sequencing major EBV genes. We also used NGS to detect EBV DNA within total genomic DNA, and to evaluate its genetic variation. Sanger sequencing of major EBV genes was used to compare SNVs from samples taken from transplant patients vs. patients with IM. We sequenced EBV DNA from a healthy EBV-seropositive individual on a HiSeq 2000 instrument. Data were mapped to the EBV reference genomes (AG876 and B95-8). The number of EBNA2 SNVs was higher than for EBNA1 and the other genes sequenced within comparable reference coordinates. For EBNA2, there was a median of 15 SNV among transplant samples compared with 10 among IM samples (p = 0.036). EBNA1 showed little variation between samples. For NGS, we identified 640 and 892 variants at an unadjusted p value of 5 × 10(-8) for AG876 and B95-8 genomes, respectively. We used complementary sequence strategies to examine EBV genetic diversity and its application to transplantation. The results provide the framework for further characterization of EBV strains and related outcomes after organ transplantation. © 2015 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Toward the 1,000 dollars human genome.
Bennett, Simon T; Barnes, Colin; Cox, Anthony; Davies, Lisa; Brown, Clive
2005-06-01
Revolutionary new technologies, capable of transforming the economics of sequencing, are providing an unparalleled opportunity to analyze human genetic variation comprehensively at the whole-genome level within a realistic timeframe and at affordable costs. Current estimates suggest that it would cost somewhere in the region of 30 million US dollars to sequence an entire human genome using Sanger-based sequencing, and on one machine it would take about 60 years. Solexa is widely regarded as a company with the necessary disruptive technology to be the first to achieve the ultimate goal of the so-called 1,000 dollars human genome - the conceptual cost-point needed for routine analysis of individual genomes. Solexa's technology is based on completely novel sequencing chemistry capable of sequencing billions of individual DNA molecules simultaneously, a base at a time, to enable highly accurate, low cost analysis of an entire human genome in a single experiment. When applied over a large enough genomic region, these new approaches to resequencing will enable the simultaneous detection and typing of known, as well as unknown, polymorphisms, and will also offer information about patterns of linkage disequilibrium in the population being studied. Technological progress, leading to the advent of single-molecule-based approaches, is beginning to dramatically drive down costs and increase throughput to unprecedented levels, each being several orders of magnitude better than that which is currently available. A new sequencing paradigm based on single molecules will be faster, cheaper and more sensitive, and will permit routine analysis at the whole-genome level.
Kim, Jeongwoon; Matsuba, Yuki; Ning, Jing; Schilmiller, Anthony L.; Hammar, Dagan; Jones, A. Daniel; Pichersky, Eran; Last, Robert L.
2014-01-01
Flavonoids are ubiquitous plant aromatic specialized metabolites found in a variety of cell types and organs. Methylated flavonoids are detected in secreting glandular trichomes of various Solanum species, including the cultivated tomato (Solanum lycopersicum). Inspection of the sequenced S. lycopersicum Heinz 1706 reference genome revealed a close homolog of Solanum habrochaites MOMT1 3′/5′ myricetin O-methyltransferase gene, but this gene (Solyc06g083450) is missing the first exon, raising the question of whether cultivated tomato has a distinct 3′ or 3′/5′ O-methyltransferase. A combination of mining genome and cDNA sequences from wild tomato species and S. lycopersicum cultivar M82 led to the identification of Sl-MOMT4 as a 3′ O-methyltransferase. In parallel, three independent ethyl methanesulfonate mutants in the S. lycopersicum cultivar M82 background were identified as having reduced amounts of di- and trimethylated myricetins and increased monomethylated myricetin. Consistent with the hypothesis that Sl-MOMT4 is a 3′ O-methyltransferase gene, all three myricetin methylation defective mutants were found to have defects in MOMT4 sequence, transcript accumulation, or 3′-O-methyltransferase enzyme activity. Surprisingly, no MOMT4 sequence is found in the Heinz 1706 reference genome sequence, and this cultivar accumulates 3-methyl myricetin and is deficient in 3′-methyl myricetins, demonstrating variation in this gene among cultivated tomato varieties. PMID:25128240
Genetical genomics of Populus leaf shape variation
Drost, Derek R.; Puranik, Swati; Novaes, Evandro; ...
2015-06-30
Leaf morphology varies extensively among plant species and is under strong genetic control. Mutagenic screens in model systems have identified genes and established molecular mechanisms regulating leaf initiation, development, and shape. However, it is not known whether this diversity across plant species is related to naturally occurring variation at these genes. Quantitative trait locus (QTL) analysis has revealed a polygenic control for leaf shape variation in different species suggesting that loci discovered by mutagenesis may only explain part of the naturally occurring variation in leaf shape. Here we undertook a genetical genomics study in a poplar intersectional pseudo-backcross pedigree tomore » identify genetic factors controlling leaf shape. Here, the approach combined QTL discovery in a genetic linkage map anchored to the Populus trichocarpa reference genome sequence and transcriptome analysis.« less
Whole genome resequencing of a laboratory-adapted Drosophila melanogaster population sample
Gilks, William P.; Pennell, Tanya M.; Flis, Ilona; Webster, Matthew T.; Morrow, Edward H.
2016-01-01
As part of a study into the molecular genetics of sexually dimorphic complex traits, we used high-throughput sequencing to obtain data on genomic variation in an outbred laboratory-adapted fruit fly ( Drosophila melanogaster) population. We successfully resequenced the whole genome of 220 hemiclonal females that were heterozygous for the same Berkeley reference line genome (BDGP6/dm6), and a unique haplotype from the outbred base population (LH M). The use of a static and known genetic background enabled us to obtain sequences from whole-genome phased haplotypes. We used a BWA-Picard-GATK pipeline for mapping sequence reads to the dm6 reference genome assembly, at a median depth-of coverage of 31X, and have made the resulting data publicly-available in the NCBI Short Read Archive (Accession number SRP058502). We used Haplotype Caller to discover and genotype 1,726,931 small genomic variants (SNPs and indels, <200bp). Additionally we detected and genotyped 167 large structural variants (1-100Kb in size) using GenomeStrip/2.0. Sequence and genotype data are publicly-available at the corresponding NCBI databases: Short Read Archive, dbSNP and dbVar (BioProject PRJNA282591). We have also released the unfiltered genotype data, and the code and logs for data processing and summary statistics ( https://zenodo.org/communities/sussex_drosophila_sequencing/). PMID:27928499
Manel, S; Perrier, C; Pratlong, M; Abi-Rached, L; Paganini, J; Pontarotti, P; Aurelle, D
2016-01-01
Genome scans represent powerful approaches to investigate the action of natural selection on the genetic variation of natural populations and to better understand local adaptation. This is very useful, for example, in the field of conservation biology and evolutionary biology. Thanks to Next Generation Sequencing, genomic resources are growing exponentially, improving genome scan analyses in non-model species. Thousands of SNPs called using Reduced Representation Sequencing are increasingly used in genome scans. Besides, genome sequences are also becoming increasingly available, allowing better processing of short-read data, offering physical localization of variants, and improving haplotype reconstruction and data imputation. Ultimately, genome sequences are also becoming the raw material for selection inferences. Here, we discuss how the increasing availability of such genomic resources, notably genome sequences, influences the detection of signals of selection. Mainly, increasing data density and having the information of physical linkage data expand genome scans by (i) improving the overall quality of the data, (ii) helping the reconstruction of demographic history for the population studied to decrease false-positive rates and (iii) improving the statistical power of methods to detect the signal of selection. Of particular importance, the availability of a high-quality reference genome can improve the detection of the signal of selection by (i) allowing matching the potential candidate loci to linked coding regions under selection, (ii) rapidly moving the investigation to the gene and function and (iii) ensuring that the highly variable regions of the genomes that include functional genes are also investigated. For all those reasons, using reference genomes in genome scan analyses is highly recommended. © 2015 John Wiley & Sons Ltd.
Single nucleotide variations: Biological impact and theoretical interpretation
Katsonis, Panagiotis; Koire, Amanda; Wilson, Stephen Joseph; Hsu, Teng-Kuei; Lua, Rhonald C; Wilkins, Angela Dawn; Lichtarge, Olivier
2014-01-01
Genome-wide association studies (GWAS) and whole-exome sequencing (WES) generate massive amounts of genomic variant information, and a major challenge is to identify which variations drive disease or contribute to phenotypic traits. Because the majority of known disease-causing mutations are exonic non-synonymous single nucleotide variations (nsSNVs), most studies focus on whether these nsSNVs affect protein function. Computational studies show that the impact of nsSNVs on protein function reflects sequence homology and structural information and predict the impact through statistical methods, machine learning techniques, or models of protein evolution. Here, we review impact prediction methods and discuss their underlying principles, their advantages and limitations, and how they compare to and complement one another. Finally, we present current applications and future directions for these methods in biological research and medical genetics. PMID:25234433
Feltus, F A; Singh, H P; Lohithaswa, H C; Schulze, S R; Silva, T D; Paterson, A H
2006-04-01
Completed genome sequences provide templates for the design of genome analysis tools in orphan species lacking sequence information. To demonstrate this principle, we designed 384 PCR primer pairs to conserved exonic regions flanking introns, using Sorghum/Pennisetum expressed sequence tag alignments to the Oryza genome. Conserved-intron scanning primers (CISPs) amplified single-copy loci at 37% to 80% success rates in taxa that sample much of the approximately 50-million years of Poaceae divergence. While the conserved nature of exons fostered cross-taxon amplification, the lesser evolutionary constraints on introns enhanced single-nucleotide polymorphism detection. For example, in eight rice (Oryza sativa) genotypes, polymorphism averaged 12.1 per kb in introns but only 3.6 per kb in exons. Curiously, among 124 CISPs evaluated across Oryza, Sorghum, Pennisetum, Cynodon, Eragrostis, Zea, Triticum, and Hordeum, 23 (18.5%) seemed to be subject to rigid intron size constraints that were independent of per-nucleotide DNA sequence variation. Furthermore, we identified 487 conserved-noncoding sequence motifs in 129 CISP loci. A large CISP set (6,062 primer pairs, amplifying introns from 1,676 genes) designed using an automated pipeline showed generally higher abundance in recombinogenic than in nonrecombinogenic regions of the rice genome, thus providing relatively even distribution along genetic maps. CISPs are an effective means to explore poorly characterized genomes for both DNA polymorphism and noncoding sequence conservation on a genome-wide or candidate gene basis, and also provide anchor points for comparative genomics across a diverse range of species.
Feltus, F.A.; Singh, H.P.; Lohithaswa, H.C.; Schulze, S.R.; Silva, T.D.; Paterson, A.H.
2006-01-01
Completed genome sequences provide templates for the design of genome analysis tools in orphan species lacking sequence information. To demonstrate this principle, we designed 384 PCR primer pairs to conserved exonic regions flanking introns, using Sorghum/Pennisetum expressed sequence tag alignments to the Oryza genome. Conserved-intron scanning primers (CISPs) amplified single-copy loci at 37% to 80% success rates in taxa that sample much of the approximately 50-million years of Poaceae divergence. While the conserved nature of exons fostered cross-taxon amplification, the lesser evolutionary constraints on introns enhanced single-nucleotide polymorphism detection. For example, in eight rice (Oryza sativa) genotypes, polymorphism averaged 12.1 per kb in introns but only 3.6 per kb in exons. Curiously, among 124 CISPs evaluated across Oryza, Sorghum, Pennisetum, Cynodon, Eragrostis, Zea, Triticum, and Hordeum, 23 (18.5%) seemed to be subject to rigid intron size constraints that were independent of per-nucleotide DNA sequence variation. Furthermore, we identified 487 conserved-noncoding sequence motifs in 129 CISP loci. A large CISP set (6,062 primer pairs, amplifying introns from 1,676 genes) designed using an automated pipeline showed generally higher abundance in recombinogenic than in nonrecombinogenic regions of the rice genome, thus providing relatively even distribution along genetic maps. CISPs are an effective means to explore poorly characterized genomes for both DNA polymorphism and noncoding sequence conservation on a genome-wide or candidate gene basis, and also provide anchor points for comparative genomics across a diverse range of species. PMID:16607031
Widespread and evolutionary analysis of a MITE family Monkey King in Brassicaceae.
Dai, Shutao; Hou, Jinna; Long, Yan; Wang, Jing; Li, Cong; Xiao, Qinqin; Jiang, Xiaoxue; Zou, Xiaoxiao; Zou, Jun; Meng, Jinling
2015-06-19
Miniature inverted repeat transposable elements (MITEs) are important components of eukaryotic genomes, with hundreds of families and many copies, which may play important roles in gene regulation and genome evolution. However, few studies have investigated the molecular mechanisms involved. In our previous study, a Tourist-like MITE, Monkey King, was identified from the promoter region of a flowering time gene, BnFLC.A10, in Brassica napus. Based on this MITE, the characteristics and potential roles on gene regulation of the MITE family were analyzed in Brassicaceae. The characteristics of the Tourist-like MITE family Monkey King in Brassicaceae, including its distribution, copies and insertion sites in the genomes of major Brassicaceae species were analyzed in this study. Monkey King was actively amplified in Brassica after divergence from Arabidopsis, which was indicated by the prompt increase in copy number and by phylogenetic analysis. The genomic variations caused by Monkey King insertions, both intra- and inter-species in Brassica, were traced by PCR amplification. Genomic sequence analysis showed that most complete Monkey King elements are located in gene-rich regions, less than 3kb from genes, in both the B. rapa and A. thaliana genomes. Sixty-seven Brassica expressed sequence tags carrying Monkey King fragments were also identified from the NCBI database. Bisulfite sequencing identified specific DNA methylation of cytosine residues in the Monkey King sequence. A fragment containing putative TATA-box motifs in the MITE sequence could bind with nuclear protein(s) extracted from leaves of B. napus plants. A Monkey King-related microRNA, bna-miR6031, was identified in the microRNA database. In transgenic A. thaliana, when the Monkey King element was inserted upstream of 35S promoter, the promoter activity was weakened. Monkey King, a Brassicaceae Tourist-like MITE family, has amplified relatively recently and has induced intra- and inter-species genomic variations in Brassica. Monkey King elements are most abundant in the vicinity of genes and may have a substantial effect on genome-wide gene regulation in Brassicaceae. Monkey King insertions potentially regulate gene expression and genome evolution through epigenetic modification and new regulatory motif production.
Applications of the 1000 Genomes Project resources.
Zheng-Bradley, Xiangqun; Flicek, Paul
2017-05-01
The 1000 Genomes Project created a valuable, worldwide reference for human genetic variation. Common uses of the 1000 Genomes dataset include genotype imputation supporting Genome-wide Association Studies, mapping expression Quantitative Trait Loci, filtering non-pathogenic variants from exome, whole genome and cancer genome sequencing projects, and genetic analysis of population structure and molecular evolution. In this article, we will highlight some of the multiple ways that the 1000 Genomes data can be and has been utilized for genetic studies. © The Author 2016. Published by Oxford University Press.
A Glimpse into the Satellite DNA Library in Characidae Fish (Teleostei, Characiformes)
Utsunomia, Ricardo; Ruiz-Ruano, Francisco J.; Silva, Duílio M. Z. A.; Serrano, Érica A.; Rosa, Ivana F.; Scudeler, Patrícia E. S.; Hashimoto, Diogo T.; Oliveira, Claudio; Camacho, Juan Pedro M.; Foresti, Fausto
2017-01-01
Satellite DNA (satDNA) is an abundant fraction of repetitive DNA in eukaryotic genomes and plays an important role in genome organization and evolution. In general, satDNA sequences follow a concerted evolutionary pattern through the intragenomic homogenization of different repeat units. In addition, the satDNA library hypothesis predicts that related species share a series of satDNA variants descended from a common ancestor species, with differential amplification of different satDNA variants. The finding of a same satDNA family in species belonging to different genera within Characidae fish provided the opportunity to test both concerted evolution and library hypotheses. For this purpose, we analyzed here sequence variation and abundance of this satDNA family in ten species, by a combination of next generation sequencing (NGS), PCR and Sanger sequencing, and fluorescence in situ hybridization (FISH). We found extensive between-species variation for the number and size of pericentromeric FISH signals. At genomic level, the analysis of 1000s of DNA sequences obtained by Illumina sequencing and PCR amplification allowed defining 150 haplotypes which were linked in a common minimum spanning tree, where different patterns of concerted evolution were apparent. This also provided a glimpse into the satDNA library of this group of species. In consistency with the library hypothesis, different variants for this satDNA showed high differences in abundance between species, from highly abundant to simply relictual variants. PMID:28855916
Variation of 45S rDNA intergenic spacers in Arabidopsis thaliana.
Havlová, Kateřina; Dvořáčková, Martina; Peiro, Ramon; Abia, David; Mozgová, Iva; Vansáčová, Lenka; Gutierrez, Crisanto; Fajkus, Jiří
2016-11-01
Approximately seven hundred 45S rRNA genes (rDNA) in the Arabidopsis thaliana genome are organised in two 4 Mbp-long arrays of tandem repeats arranged in head-to-tail fashion separated by an intergenic spacer (IGS). These arrays make up 5 % of the A. thaliana genome. IGS are rapidly evolving sequences and frequent rearrangements inside the rDNA loci have generated considerable interspecific and even intra-individual variability which allows to distinguish among otherwise highly conserved rRNA genes. The IGS has not been comprehensively described despite its potential importance in regulation of rDNA transcription and replication. Here we describe the detailed sequence variation in the complete IGS of A. thaliana WT plants and provide the reference/consensus IGS sequence, as well as genomic DNA analysis. We further investigate mutants dysfunctional in chromatin assembly factor-1 (CAF-1) (fas1 and fas2 mutants), which are known to have a reduced number of rDNA copies, and plant lines with restored CAF-1 function (segregated from a fas1xfas2 genetic background) showing major rDNA rearrangements. The systematic rDNA loss in CAF-1 mutants leads to the decreased variability of the IGS and to the occurrence of distinct IGS variants. We present for the first time a comprehensive and representative set of complete IGS sequences, obtained by conventional cloning and by Pacific Biosciences sequencing. Our data expands the knowledge of the A. thaliana IGS sequence arrangement and variability, which has not been available in full and in detail until now. This is also the first study combining IGS sequencing data with RFLP analysis of genomic DNA.
Comparative Analysis of the Shared Sex-Determination Region (SDR) among Salmonid Fishes.
Faber-Hammond, Joshua J; Phillips, Ruth B; Brown, Kim H
2015-06-25
Salmonids present an excellent model for studying evolution of young sex-chromosomes. Within the genus, Oncorhynchus, at least six independent sex-chromosome pairs have evolved, many unique to individual species. This variation results from the movement of the sex-determining gene, sdY, throughout the salmonid genome. While sdY is known to define sexual differentiation in salmonids, the mechanism of its movement throughout the genome has remained elusive due to high frequencies of repetitive elements, rDNA sequences, and transposons surrounding the sex-determining regions (SDR). Despite these difficulties, bacterial artificial chromosome (BAC) library clones from both rainbow trout and Atlantic salmon containing the sdY region have been reported. Here, we report the sequences for these BACs as well as the extended sequence for the known SDR in Chinook gained through genome walking methods. Comparative analysis allowed us to study the overlapping SDRs from three unique salmonid Y chromosomes to define the specific content, size, and variation present between the species. We found approximately 4.1 kb of orthologous sequence common to all three species, which contains the genetic content necessary for masculinization. The regions contain transposable elements that may be responsible for the translocations of the SDR throughout salmonid genomes and we examine potential mechanistic roles of each one. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Sabir, Jamal; Schwarz, Erika; Ellison, Nicholas; Zhang, Jin; Baeshen, Nabih A; Mutwakil, Muhammed; Jansen, Robert; Ruhlman, Tracey
2014-08-01
Land plant plastid genomes (plastomes) provide a tractable model for evolutionary study in that they are relatively compact and gene dense. Among the groups that display an appropriate level of variation for structural features, the inverted-repeat-lacking clade (IRLC) of papilionoid legumes presents the potential to advance general understanding of the mechanisms of genomic evolution. Here, are presented six complete plastome sequences from economically important species of the IRLC, a lineage previously represented by only five completed plastomes. A number of characters are compared across the IRLC including gene retention and divergence, synteny, repeat structure and functional gene transfer to the nucleus. The loss of clpP intron 2 was identified in one newly sequenced member of IRLC, Glycyrrhiza glabra. Using deeply sequenced nuclear transcriptomes from two species helped clarify the nature of the functional transfer of accD to the nucleus in Trifolium, which likely occurred in the lineage leading to subgenus Trifolium. Legumes are second only to cereal crops in agricultural importance based on area harvested and total production. Genetic improvement via plastid transformation of IRLC crop species is an appealing proposition. Comparative analyses of intergenic spacer regions emphasize the need for complete genome sequences for developing transformation vectors for plastid genetic engineering of legume crops. © 2014 Society for Experimental Biology, Association of Applied Biologists and John Wiley & Sons Ltd.
Martin, Guillaume; Baurens, Franc-Christophe; Droc, Gaëtan; Rouard, Mathieu; Cenci, Alberto; Kilian, Andrzej; Hastie, Alex; Doležel, Jaroslav; Aury, Jean-Marc; Alberti, Adriana; Carreel, Françoise; D'Hont, Angélique
2016-03-16
Recent advances in genomics indicate functional significance of a majority of genome sequences and their long range interactions. As a detailed examination of genome organization and function requires very high quality genome sequence, the objective of this study was to improve reference genome assembly of banana (Musa acuminata). We have developed a modular bioinformatics pipeline to improve genome sequence assemblies, which can handle various types of data. The pipeline comprises several semi-automated tools. However, unlike classical automated tools that are based on global parameters, the semi-automated tools proposed an expert mode for a user who can decide on suggested improvements through local compromises. The pipeline was used to improve the draft genome sequence of Musa acuminata. Genotyping by sequencing (GBS) of a segregating population and paired-end sequencing were used to detect and correct scaffold misassemblies. Long insert size paired-end reads identified scaffold junctions and fusions missed by automated assembly methods. GBS markers were used to anchor scaffolds to pseudo-molecules with a new bioinformatics approach that avoids the tedious step of marker ordering during genetic map construction. Furthermore, a genome map was constructed and used to assemble scaffolds into super scaffolds. Finally, a consensus gene annotation was projected on the new assembly from two pre-existing annotations. This approach reduced the total Musa scaffold number from 7513 to 1532 (i.e. by 80%), with an N50 that increased from 1.3 Mb (65 scaffolds) to 3.0 Mb (26 scaffolds). 89.5% of the assembly was anchored to the 11 Musa chromosomes compared to the previous 70%. Unknown sites (N) were reduced from 17.3 to 10.0%. The release of the Musa acuminata reference genome version 2 provides a platform for detailed analysis of banana genome variation, function and evolution. Bioinformatics tools developed in this work can be used to improve genome sequence assemblies in other species.
The genetic architecture of long QT syndrome: A critical reappraisal.
Giudicessi, John R; Wilde, Arthur A M; Ackerman, Michael J
2018-03-30
Collectively, the completion of the Human Genome Project and subsequent development of high-throughput next-generation sequencing methodologies have revolutionized genomic research. However, the rapid sequencing and analysis of thousands upon thousands of human exomes and genomes has taught us that most genes, including those known to cause heritable cardiovascular disorders such as long QT syndrome, harbor an unexpected background rate of rare, and presumably innocuous, non-synonymous genetic variation. In this Review, we aim to reappraise the genetic architecture underlying both the acquired and congenital forms of long QT syndrome by examining how the clinical phenotype associated with and background genetic variation in long QT syndrome-susceptibility genes impacts the clinical validity of existing gene-disease associations and the variant classification and reporting strategies that serve as the foundation for diagnostic long QT syndrome genetic testing. Copyright © 2018 Elsevier Inc. All rights reserved.
Genomics of crop wild relatives: expanding the gene pool for crop improvement.
Brozynska, Marta; Furtado, Agnelo; Henry, Robert J
2016-04-01
Plant breeders require access to new genetic diversity to satisfy the demands of a growing human population for more food that can be produced in a variable or changing climate and to deliver the high-quality food with nutritional and health benefits demanded by consumers. The close relatives of domesticated plants, crop wild relatives (CWRs), represent a practical gene pool for use by plant breeders. Genomics of CWR generates data that support the use of CWR to expand the genetic diversity of crop plants. Advances in DNA sequencing technology are enabling the efficient sequencing of CWR and their increased use in crop improvement. As the sequencing of genomes of major crop species is completed, attention has shifted to analysis of the wider gene pool of major crops including CWR. A combination of de novo sequencing and resequencing is required to efficiently explore useful genetic variation in CWR. Analysis of the nuclear genome, transcriptome and maternal (chloroplast and mitochondrial) genome of CWR is facilitating their use in crop improvement. Genome analysis results in discovery of useful alleles in CWR and identification of regions of the genome in which diversity has been lost in domestication bottlenecks. Targeting of high priority CWR for sequencing will maximize the contribution of genome sequencing of CWR. Coordination of global efforts to apply genomics has the potential to accelerate access to and conservation of the biodiversity essential to the sustainability of agriculture and food production. © 2015 Society for Experimental Biology, Association of Applied Biologists and John Wiley & Sons Ltd.
Annotation and sequence diversity of transposable elements in common bean (Phaseolus vulgaris).
Gao, Dongying; Abernathy, Brian; Rohksar, Daniel; Schmutz, Jeremy; Jackson, Scott A
2014-01-01
Common bean (Phaseolus vulgaris) is an important legume crop grown and consumed worldwide. With the availability of the common bean genome sequence, the next challenge is to annotate the genome and characterize functional DNA elements. Transposable elements (TEs) are the most abundant component of plant genomes and can dramatically affect genome evolution and genetic variation. Thus, it is pivotal to identify TEs in the common bean genome. In this study, we performed a genome-wide transposon annotation in common bean using a combination of homology and sequence structure-based methods. We developed a 2.12-Mb transposon database which includes 791 representative transposon sequences and is available upon request or from www.phytozome.org. Of note, nearly all transposons in the database are previously unrecognized TEs. More than 5,000 transposon-related expressed sequence tags (ESTs) were detected which indicates that some transposons may be transcriptionally active. Two Ty1-copia retrotransposon families were found to encode the envelope-like protein which has rarely been identified in plant genomes. Also, we identified an extra open reading frame (ORF) termed ORF2 from 15 Ty3-gypsy families that was located between the ORF encoding the retrotransposase and the 3'LTR. The ORF2 was in opposite transcriptional orientation to retrotransposase. Sequence homology searches and phylogenetic analysis suggested that the ORF2 may have an ancient origin, but its function is not clear. These transposon data provide a useful resource for understanding the genome organization and evolution and may be used to identify active TEs for developing transposon-tagging system in common bean and other related genomes.
Lado, Bettina; Matus, Ivan; Rodríguez, Alejandra; Inostroza, Luis; Poland, Jesse; Belzile, François; del Pozo, Alejandro; Quincke, Martín; Castro, Marina; von Zitzewitz, Jarislav
2013-01-01
In crop breeding, the interest of predicting the performance of candidate cultivars in the field has increased due to recent advances in molecular breeding technologies. However, the complexity of the wheat genome presents some challenges for applying new technologies in molecular marker identification with next-generation sequencing. We applied genotyping-by-sequencing, a recently developed method to identify single-nucleotide polymorphisms, in the genomes of 384 wheat (Triticum aestivum) genotypes that were field tested under three different water regimes in Mediterranean climatic conditions: rain-fed only, mild water stress, and fully irrigated. We identified 102,324 single-nucleotide polymorphisms in these genotypes, and the phenotypic data were used to train and test genomic selection models intended to predict yield, thousand-kernel weight, number of kernels per spike, and heading date. Phenotypic data showed marked spatial variation. Therefore, different models were tested to correct the trends observed in the field. A mixed-model using moving-means as a covariate was found to best fit the data. When we applied the genomic selection models, the accuracy of predicted traits increased with spatial adjustment. Multiple genomic selection models were tested, and a Gaussian kernel model was determined to give the highest accuracy. The best predictions between environments were obtained when data from different years were used to train the model. Our results confirm that genotyping-by-sequencing is an effective tool to obtain genome-wide information for crops with complex genomes, that these data are efficient for predicting traits, and that correction of spatial variation is a crucial ingredient to increase prediction accuracy in genomic selection models. PMID:24082033
Cho, Yun Sung; Kim, Hyunho; Kim, Hak-Min; Jho, Sungwoong; Jun, JeHoon; Lee, Yong Joo; Chae, Kyun Shik; Kim, Chang Geun; Kim, Sangsoo; Eriksson, Anders; Edwards, Jeremy S.; Lee, Semin; Kim, Byung Chul; Manica, Andrea; Oh, Tae-Kwang; Church, George M.; Bhak, Jong
2016-01-01
Human genomes are routinely compared against a universal reference. However, this strategy could miss population-specific and personal genomic variations, which may be detected more efficiently using an ethnically relevant or personal reference. Here we report a hybrid assembly of a Korean reference genome (KOREF) for constructing personal and ethnic references by combining sequencing and mapping methods. We also build its consensus variome reference, providing information on millions of variants from 40 additional ethnically homogeneous genomes from the Korean Personal Genome Project. We find that the ethnically relevant consensus reference can be beneficial for efficient variant detection. Systematic comparison of human assemblies shows the importance of assembly quality, suggesting the necessity of new technologies to comprehensively map ethnic and personal genomic structure variations. In the era of large-scale population genome projects, the leveraging of ethnicity-specific genome assemblies as well as the human reference genome will accelerate mapping all human genome diversity. PMID:27882922
Diekmann, Kerstin; Hodkinson, Trevor R; Wolfe, Kenneth H; van den Bekerom, Rob; Dix, Philip J; Barth, Susanne
2009-06-01
Lolium perenne L. (perennial ryegrass) is globally one of the most important forage and grassland crops. We sequenced the chloroplast (cp) genome of Lolium perenne cultivar Cashel. The L. perenne cp genome is 135 282 bp with a typical quadripartite structure. It contains genes for 76 unique proteins, 30 tRNAs and four rRNAs. As in other grasses, the genes accD, ycf1 and ycf2 are absent. The genome is of average size within its subfamily Pooideae and of medium size within the Poaceae. Genome size differences are mainly due to length variations in non-coding regions. However, considerable length differences of 1-27 codons in comparison of L. perenne to other Poaceae and 1-68 codons among all Poaceae were also detected. Within the cp genome of this outcrossing cultivar, 10 insertion/deletion polymorphisms and 40 single nucleotide polymorphisms were detected. Two of the polymorphisms involve tiny inversions within hairpin structures. By comparing the genome sequence with RT-PCR products of transcripts for 33 genes, 31 mRNA editing sites were identified, five of them unique to Lolium. The cp genome sequence of L. perenne is available under Accession number AM777385 at the European Molecular Biology Laboratory, National Center for Biotechnology Information and DNA DataBank of Japan.
Kooke, Rik; Kruijer, Willem; Bours, Ralph; Becker, Frank; Kuhn, André; van de Geest, Henri; Buntjer, Jaap; Doeswijk, Timo; Guerra, José; Bouwmeester, Harro; Vreugdenhil, Dick; Keurentjes, Joost J B
2016-04-01
Quantitative traits in plants are controlled by a large number of genes and their interaction with the environment. To disentangle the genetic architecture of such traits, natural variation within species can be explored by studying genotype-phenotype relationships. Genome-wide association studies that link phenotypes to thousands of single nucleotide polymorphism markers are nowadays common practice for such analyses. In many cases, however, the identified individual loci cannot fully explain the heritability estimates, suggesting missing heritability. We analyzed 349 Arabidopsis accessions and found extensive variation and high heritabilities for different morphological traits. The number of significant genome-wide associations was, however, very low. The application of genomic prediction models that take into account the effects of all individual loci may greatly enhance the elucidation of the genetic architecture of quantitative traits in plants. Here, genomic prediction models revealed different genetic architectures for the morphological traits. Integrating genomic prediction and association mapping enabled the assignment of many plausible candidate genes explaining the observed variation. These genes were analyzed for functional and sequence diversity, and good indications that natural allelic variation in many of these genes contributes to phenotypic variation were obtained. For ACS11, an ethylene biosynthesis gene, haplotype differences explaining variation in the ratio of petiole and leaf length could be identified. © 2016 American Society of Plant Biologists. All Rights Reserved.
Molecular discrimination of tall fescue morphotypes in association with Festuca relatives
Chekhovskiy, Konstantin
2018-01-01
Tall fescue (Festuca arundinacea Schreb.) is an important cool-season perennial grass species used as forage and turf, and in conservation plantings. There are three morphotypes in hexaploid tall fescue: Continental, Mediterranean and Rhizomatous. This study was conducted to develop morphotype-specific molecular markers to distinguish Continental and Mediterranean tall fescues, and establish their relationships with other species of the Festuca genus for genomic inference. Chloroplast sequence variation and simple sequence repeat (SSR) polymorphism were explored in 12 genotypes of three tall fescue morphotypes and four Festuca species. Hypervariable chloroplast regions were retrieved by using 33 specifically designed primers followed by sequencing the PCR products. SSR polymorphism was studied using 144 tall fescue SSR primers. Four chloroplast (NFTCHL17, NFTCHL43, NFTCHL45 and NFTCHL48) and three SSR (nffa090, nffa204 and nffa338) markers were identified which can distinctly differentiate Continental and Mediterranean morphotypes. A primer pair, NFTCHL45, amplified a 47 bp deletion between the two morphotypes is being routinely used in the Noble Research Institute’s core facility for morphotype discrimination. Both chloroplast sequence variation and SSR diversity showed a close association between Rhizomatous and Continental morphotypes, while the Mediterranean morphotype was in a distant clade. F. pratensis and F. arundinacea var. glaucescens, the P and G1G2 genome donors, respectively, were grouped with the Continental clade, and F. mairei (M1M2 genome) grouped with the Mediterranean clade in chloroplast sequence variation, while both F. pratensis and F. mairei formed independent clade in SSR analysis. Age estimation based on chloroplast sequence variation indicated that the Continental and Mediterranean clades might have been colonized independently during 0.65 ± 0.06 and 0.96 ± 0.1 million years ago (Mya) respectively. The findings of the study will enhance tall fescue breeding for persistence and productivity. PMID:29342197
A wide extent of inter-strain diversity in virulent and vaccine strains of alphaherpesviruses.
Szpara, Moriah L; Tafuri, Yolanda R; Parsons, Lance; Shamim, S Rafi; Verstrepen, Kevin J; Legendre, Matthieu; Enquist, L W
2011-10-01
Alphaherpesviruses are widespread in the human population, and include herpes simplex virus 1 (HSV-1) and 2, and varicella zoster virus (VZV). These viral pathogens cause epithelial lesions, and then infect the nervous system to cause lifelong latency, reactivation, and spread. A related veterinary herpesvirus, pseudorabies (PRV), causes similar disease in livestock that result in significant economic losses. Vaccines developed for VZV and PRV serve as useful models for the development of an HSV-1 vaccine. We present full genome sequence comparisons of the PRV vaccine strain Bartha, and two virulent PRV isolates, Kaplan and Becker. These genome sequences were determined by high-throughput sequencing and assembly, and present new insights into the attenuation of a mammalian alphaherpesvirus vaccine strain. We find many previously unknown coding differences between PRV Bartha and the virulent strains, including changes to the fusion proteins gH and gB, and over forty other viral proteins. Inter-strain variation in PRV protein sequences is much closer to levels previously observed for HSV-1 than for the highly stable VZV proteome. Almost 20% of the PRV genome contains tandem short sequence repeats (SSRs), a class of nucleic acids motifs whose length-variation has been associated with changes in DNA binding site efficiency, transcriptional regulation, and protein interactions. We find SSRs throughout the herpesvirus family, and provide the first global characterization of SSRs in viruses, both within and between strains. We find SSR length variation between different isolates of PRV and HSV-1, which may provide a new mechanism for phenotypic variation between strains. Finally, we detected a small number of polymorphic bases within each plaque-purified PRV strain, and we characterize the effect of passage and plaque-purification on these polymorphisms. These data add to growing evidence that even plaque-purified stocks of stable DNA viruses exhibit limited sequence heterogeneity, which likely seeds future strain evolution.
Subtelomeric Rearrangements and Copy Number Variations in People with Intellectual Disabilities
ERIC Educational Resources Information Center
Christofolini, D. M.; De Paula Ramos, M. A.; Kulikowski, L. D.; Da Silva Bellucco, F. T.; Belangero, S. I. N.; Brunoni, D.; Melaragno, M. I.
2010-01-01
Background: The most prevalent type of structural variation in the human genome is represented by copy number variations that can affect transcription levels, sequence, structure and function of genes. Method: In the present study, we used the multiplex ligation-dependent probe amplification (MLPA) technique and quantitative PCR for the detection…
Hatono, Saki; Nishimura, Kaori; Murakami, Yoko; Tsujimura, Mai; Yamagishi, Hiroshi
2017-09-01
The complete sequence of the mitochondrial genome was determined for two cultivars of Brassica rapa . After determining the sequence of a Chinese cabbage variety, 'Oushou hakusai', the sequence of a mizuna variety, 'Chusei shiroguki sensuji kyomizuna', was mapped against the sequence of Chinese cabbage. The precise sequences where the two varieties demonstrated variation were ascertained by direct sequencing. It was found that the mitochondrial genomes of the two varieties are identical over 219,775 bp, with a single nucleotide polymorphism (SNP) between the genomes. Because B. rapa is the maternal species of an amphidiploid crop species, Brassica juncea , the distribution of the SNP was observed both in B. rapa and B. juncea . While the mizuna type SNP was restricted mainly to cultivars of mizuna (japonica group) in B. rapa , the mizuna type was widely distributed in B. juncea . The finding that the two Brassica species have these SNP types in common suggests that the nucleotide substitution occurred in wild B. rapa before both mitotypes were domesticated. It was further inferred that the interspecific hybridization between B. rapa and B. nigra took place twice and resulted in the two mitotypes of cultivated B. juncea .
Liu, Wen; Ghouri, Fozia; Yu, Hang; Li, Xiang; Yu, Shuhong; Shahid, Muhammad Qasim; Liu, Xiangdong
2017-01-01
Common wild rice (Oryza rufipogon Griff.) is an important germplasm for rice breeding, which contains many resistance genes. Re-sequencing provides an unprecedented opportunity to explore the abundant useful genes at whole genome level. Here, we identified the nucleotide-binding site leucine-rich repeat (NBS-LRR) encoding genes by re-sequencing of two wild rice lines (i.e. Huaye 1 and Huaye 2) that were developed from common wild rice. We obtained 128 to 147 million reads with approximately 32.5-fold coverage depth, and uniquely covered more than 89.6% (> = 1 fold) of reference genomes. Two wild rice lines showed high SNP (single-nucleotide polymorphisms) variation rate in 12 chromosomes against the reference genomes of Nipponbare (japonica cultivar) and 93-11 (indica cultivar). InDels (insertion/deletion polymorphisms) count-length distribution exhibited normal distribution in the two lines, and most of the InDels were ranged from -5 to 5 bp. With reference to the Nipponbare genome sequence, we detected a total of 1,209,308 SNPs, 161,117 InDels and 4,192 SVs (structural variations) in Huaye 1, and 1,387,959 SNPs, 180,226 InDels and 5,305 SVs in Huaye 2. A total of 44.9% and 46.9% genes exhibited sequence variations in two wild rice lines compared to the Nipponbare and 93-11 reference genomes, respectively. Analysis of NBS-LRR mutant candidate genes showed that they were mainly distributed on chromosome 11, and NBS domain was more conserved than LRR domain in both wild rice lines. NBS genes depicted higher levels of genetic diversity in Huaye 1 than that found in Huaye 2. Furthermore, protein-protein interaction analysis showed that NBS genes mostly interacted with the cytochrome C protein (Os05g0420600, Os01g0885000 and BGIOSGA038922), while some NBS genes interacted with heat shock protein, DNA-binding activity, Phosphoinositide 3-kinase and a coiled coil region. We explored abundant NBS-LRR encoding genes in two common wild rice lines through genome wide re-sequencing, which proved to be a useful tool to exploit elite NBS-LRR genes in wild rice. The data here provide a foundation for future work aimed at dissecting the genetic basis of disease resistance in rice, and the two wild rice lines will be useful germplasm for the molecular improvement of cultivated rice.
Yu, Hang; Li, Xiang; Yu, Shuhong; Shahid, Muhammad Qasim
2017-01-01
Common wild rice (Oryza rufipogon Griff.) is an important germplasm for rice breeding, which contains many resistance genes. Re-sequencing provides an unprecedented opportunity to explore the abundant useful genes at whole genome level. Here, we identified the nucleotide-binding site leucine-rich repeat (NBS-LRR) encoding genes by re-sequencing of two wild rice lines (i.e. Huaye 1 and Huaye 2) that were developed from common wild rice. We obtained 128 to 147 million reads with approximately 32.5-fold coverage depth, and uniquely covered more than 89.6% (> = 1 fold) of reference genomes. Two wild rice lines showed high SNP (single-nucleotide polymorphisms) variation rate in 12 chromosomes against the reference genomes of Nipponbare (japonica cultivar) and 93–11 (indica cultivar). InDels (insertion/deletion polymorphisms) count-length distribution exhibited normal distribution in the two lines, and most of the InDels were ranged from -5 to 5 bp. With reference to the Nipponbare genome sequence, we detected a total of 1,209,308 SNPs, 161,117 InDels and 4,192 SVs (structural variations) in Huaye 1, and 1,387,959 SNPs, 180,226 InDels and 5,305 SVs in Huaye 2. A total of 44.9% and 46.9% genes exhibited sequence variations in two wild rice lines compared to the Nipponbare and 93–11 reference genomes, respectively. Analysis of NBS-LRR mutant candidate genes showed that they were mainly distributed on chromosome 11, and NBS domain was more conserved than LRR domain in both wild rice lines. NBS genes depicted higher levels of genetic diversity in Huaye 1 than that found in Huaye 2. Furthermore, protein-protein interaction analysis showed that NBS genes mostly interacted with the cytochrome C protein (Os05g0420600, Os01g0885000 and BGIOSGA038922), while some NBS genes interacted with heat shock protein, DNA-binding activity, Phosphoinositide 3-kinase and a coiled coil region. We explored abundant NBS-LRR encoding genes in two common wild rice lines through genome wide re-sequencing, which proved to be a useful tool to exploit elite NBS-LRR genes in wild rice. The data here provide a foundation for future work aimed at dissecting the genetic basis of disease resistance in rice, and the two wild rice lines will be useful germplasm for the molecular improvement of cultivated rice. PMID:28700714
Haraksingh, Rajini R.; Abyzov, Alexej; Gerstein, Mark; Urban, Alexander E.; Snyder, Michael
2011-01-01
Accurate and efficient genome-wide detection of copy number variants (CNVs) is essential for understanding human genomic variation, genome-wide CNV association type studies, cytogenetics research and diagnostics, and independent validation of CNVs identified from sequencing based technologies. Numerous, array-based platforms for CNV detection exist utilizing array Comparative Genome Hybridization (aCGH), Single Nucleotide Polymorphism (SNP) genotyping or both. We have quantitatively assessed the abilities of twelve leading genome-wide CNV detection platforms to accurately detect Gold Standard sets of CNVs in the genome of HapMap CEU sample NA12878, and found significant differences in performance. The technologies analyzed were the NimbleGen 4.2 M, 2.1 M and 3×720 K Whole Genome and CNV focused arrays, the Agilent 1×1 M CGH and High Resolution and 2×400 K CNV and SNP+CGH arrays, the Illumina Human Omni1Quad array and the Affymetrix SNP 6.0 array. The Gold Standards used were a 1000 Genomes Project sequencing-based set of 3997 validated CNVs and an ultra high-resolution aCGH-based set of 756 validated CNVs. We found that sensitivity, total number, size range and breakpoint resolution of CNV calls were highest for CNV focused arrays. Our results are important for cost effective CNV detection and validation for both basic and clinical applications. PMID:22140474
A comprehensively molecular haplotype-resolved genome of a European individual
Suk, Eun-Kyung; McEwen, Gayle K.; Duitama, Jorge; Nowick, Katja; Schulz, Sabrina; Palczewski, Stefanie; Schreiber, Stefan; Holloway, Dustin T.; McLaughlin, Stephen; Peckham, Heather; Lee, Clarence; Huebsch, Thomas; Hoehe, Margret R.
2011-01-01
Independent determination of both haplotype sequences of an individual genome is essential to relate genetic variation to genome function, phenotype, and disease. To address the importance of phase, we have generated the most complete haplotype-resolved genome to date, “Max Planck One” (MP1), by fosmid pool-based next generation sequencing. Virtually all SNPs (>99%) and 80,000 indels were phased into haploid sequences of up to 6.3 Mb (N50 ∼1 Mb). The completeness of phasing allowed determination of the concrete molecular haplotype pairs for the vast majority of genes (81%) including potential regulatory sequences, of which >90% were found to be constituted by two different molecular forms. A subset of 159 genes with potentially severe mutations in either cis or trans configurations exemplified in particular the role of phase for gene function, disease, and clinical interpretation of personal genomes (e.g., BRCA1). Extended genomic regions harboring manifold combinations of physically and/or functionally related genes and regulatory elements were resolved into their underlying “haploid landscapes,” which may define the functional genome. Moreover, the majority of genes and functional sequences were found to contain individual or rare SNPs, which cannot be phased from population data alone, emphasizing the importance of molecular phasing for characterizing a genome in its molecular individuality. Our work provides the foundation to understand that the distinction of molecular haplotypes is essential to resolve the (inherently individual) biology of genes, genomes, and disease, establishing a reference point for “phase-sensitive” personal genomics. MP1's annotated haploid genomes are available as a public resource. PMID:21813624
Functional genomics of physiological plasticity and local adaptation in killifish.
Whitehead, Andrew; Galvez, Fernando; Zhang, Shujun; Williams, Larissa M; Oleksiak, Marjorie F
2011-01-01
Evolutionary solutions to the physiological challenges of life in highly variable habitats can span the continuum from evolution of a cosmopolitan plastic phenotype to the evolution of locally adapted phenotypes. Killifish (Fundulus sp.) have evolved both highly plastic and locally adapted phenotypes within different selective contexts, providing a comparative system in which to explore the genomic underpinnings of physiological plasticity and adaptive variation. Importantly, extensive variation exists among populations and species for tolerance to a variety of stressors, and we exploit this variation in comparative studies to yield insights into the genomic basis of evolved phenotypic variation. Notably, species of Fundulus occupy the continuum of osmotic habitats from freshwater to marine and populations within Fundulus heteroclitus span far greater variation in pollution tolerance than across all species of fish. Here, we explore how transcriptome regulation underpins extreme physiological plasticity on osmotic shock and how genomic and transcriptomic variation is associated with locally evolved pollution tolerance. We show that F. heteroclitus quickly acclimate to extreme osmotic shock by mounting a dramatic rapid transcriptomic response including an early crisis control phase followed by a tissue remodeling phase involving many regulatory pathways. We also show that convergent evolution of locally adapted pollution tolerance involves complex patterns of gene expression and genome sequence variation, which is confounded with body-weight dependence for some genes. Similarly, exploiting the natural phenotypic variation associated with other established and emerging model organisms is likely to greatly accelerate the pace of discovery of the genomic basis of phenotypic variation.
Functional Genomics of Physiological Plasticity and Local Adaptation in Killifish
Galvez, Fernando; Zhang, Shujun; Williams, Larissa M.; Oleksiak, Marjorie F.
2011-01-01
Evolutionary solutions to the physiological challenges of life in highly variable habitats can span the continuum from evolution of a cosmopolitan plastic phenotype to the evolution of locally adapted phenotypes. Killifish (Fundulus sp.) have evolved both highly plastic and locally adapted phenotypes within different selective contexts, providing a comparative system in which to explore the genomic underpinnings of physiological plasticity and adaptive variation. Importantly, extensive variation exists among populations and species for tolerance to a variety of stressors, and we exploit this variation in comparative studies to yield insights into the genomic basis of evolved phenotypic variation. Notably, species of Fundulus occupy the continuum of osmotic habitats from freshwater to marine and populations within Fundulus heteroclitus span far greater variation in pollution tolerance than across all species of fish. Here, we explore how transcriptome regulation underpins extreme physiological plasticity on osmotic shock and how genomic and transcriptomic variation is associated with locally evolved pollution tolerance. We show that F. heteroclitus quickly acclimate to extreme osmotic shock by mounting a dramatic rapid transcriptomic response including an early crisis control phase followed by a tissue remodeling phase involving many regulatory pathways. We also show that convergent evolution of locally adapted pollution tolerance involves complex patterns of gene expression and genome sequence variation, which is confounded with body-weight dependence for some genes. Similarly, exploiting the natural phenotypic variation associated with other established and emerging model organisms is likely to greatly accelerate the pace of discovery of the genomic basis of phenotypic variation. PMID:20581107
The 1000 Genomes Project: data management and community access.
Clarke, Laura; Zheng-Bradley, Xiangqun; Smith, Richard; Kulesha, Eugene; Xiao, Chunlin; Toneva, Iliana; Vaughan, Brendan; Preuss, Don; Leinonen, Rasko; Shumway, Martin; Sherry, Stephen; Flicek, Paul
2012-04-27
The 1000 Genomes Project was launched as one of the largest distributed data collection and analysis projects ever undertaken in biology. In addition to the primary scientific goals of creating both a deep catalog of human genetic variation and extensive methods to accurately discover and characterize variation using new sequencing technologies, the project makes all of its data publicly available. Members of the project data coordination center have developed and deployed several tools to enable widespread data access.
Natural and Unanticipated Modifiers of RNAi Activity in Caenorhabditis elegans
Asad, Nadeem; Aw, Wen Yih; Timmons, Lisa
2012-01-01
Organisms used as model genomics systems are maintained as isogenic strains, yet evidence of sequence differences between independently maintained wild-type stocks has been substantiated by whole-genome resequencing data and strain-specific phenotypes. Sequence differences may arise from replication errors, transposon mobilization, meiotic gene conversion, or environmental or chemical assault on the genome. Low frequency alleles or mutations with modest effects on phenotypes can contribute to natural variation, and it has proven possible for such sequences to become fixed by adapted evolutionary enrichment and identified by resequencing. Our objective was to identify and analyze single locus genetic defects leading to RNAi resistance in isogenic strains of Caenorhabditis elegans. In so doing, we uncovered a mutation that arose de novo in an existing strain, which initially frustrated our phenotypic analysis. We also report experimental, environmental, and genetic conditions that can complicate phenotypic analysis of RNAi pathway defects. These observations highlight the potential for unanticipated mutations, coupled with genetic and environmental phenomena, to enhance or suppress the effects of known mutations and cause variation between wild-type strains. PMID:23209671
Spontaneous Mutation Rate in the Smallest Photosynthetic Eukaryotes
Krasovec, Marc; Eyre-Walker, Adam; Sanchez-Ferandin, Sophie
2017-01-01
Abstract Mutation is the ultimate source of genetic variation, and knowledge of mutation rates is fundamental for our understanding of all evolutionary processes. High throughput sequencing of mutation accumulation lines has provided genome wide spontaneous mutation rates in a dozen model species, but estimates from nonmodel organisms from much of the diversity of life are very limited. Here, we report mutation rates in four haploid marine bacterial-sized photosynthetic eukaryotic algae; Bathycoccus prasinos, Ostreococcus tauri, Ostreococcus mediterraneus, and Micromonas pusilla. The spontaneous mutation rate between species varies from μ = 4.4 × 10−10 to 9.8 × 10−10 mutations per nucleotide per generation. Within genomes, there is a two-fold increase of the mutation rate in intergenic regions, consistent with an optimization of mismatch and transcription-coupled DNA repair in coding sequences. Additionally, we show that deviation from the equilibrium GC content increases the mutation rate by ∼2% to ∼12% because of a GC bias in coding sequences. More generally, the difference between the observed and equilibrium GC content of genomes explains some of the inter-specific variation in mutation rates. PMID:28379581
Bosch, Jason; Noubiap, Jean Jacques N; Dandara, Collet; Makubalo, Nomlindo; Wright, Galen; Entfellner, Jean-Baka Domelevo; Tiffin, Nicki; Wonkam, Ambroise
2014-11-01
Mutations in the GJB2 gene, encoding connexin 26, could account for 50% of congenital, nonsyndromic, recessive deafness cases in some Caucasian/Asian populations. There is a scarcity of published data in sub-Saharan Africans. We Sanger sequenced the coding region of the GJB2 gene in 205 Cameroonian and Xhosa South Africans with congenital, nonsyndromic deafness; and performed bioinformatic analysis of variations in the GJB2 gene, incorporating data from the 1000 Genomes Project. Amongst Cameroonian patients, 26.1% were familial. The majority of patients (70%) suffered from sensorineural hearing loss. Ten GJB2 genetic variants were detected by sequencing. A previously reported pathogenic mutation, g.3741_3743delTTC (p.F142del), and a putative pathogenic mutation, g.3816G>A (p.V167M), were identified in single heterozygous samples. Amongst eight the remaining variants, two novel variants, g.3318-41G>A and g.3332G>A, were reported. There were no statistically significant differences in allele frequencies between cases and controls. Principal Components Analyses differentiated between Africans, Asians, and Europeans, but only explained 40% of the variation. The present study is the first to compare African GJB2 sequences with the data from the 1000 Genomes Project and have revealed the low variation between population groups. This finding has emphasized the hypothesis that the prevalence of mutations in GJB2 in nonsyndromic deafness amongst European and Asian populations is due to founder effects arising after these individuals migrated out of Africa, and not to a putative "protective" variant in the genomic structure of GJB2 in Africans. Our results confirm that mutations in GJB2 are not associated with nonsyndromic deafness in Africans.
Charles, Mathieu; Belcram, Harry; Just, Jérémy; Huneau, Cécile; Viollet, Agnès; Couloux, Arnaud; Segurens, Béatrice; Carter, Meredith; Huteau, Virginie; Coriton, Olivier; Appels, Rudi; Samain, Sylvie; Chalhoub, Boulos
2008-01-01
Transposable elements (TEs) constitute >80% of the wheat genome but their dynamics and contribution to size variation and evolution of wheat genomes (Triticum and Aegilops species) remain unexplored. In this study, 10 genomic regions have been sequenced from wheat chromosome 3B and used to constitute, along with all publicly available genomic sequences of wheat, 1.98 Mb of sequence (from 13 BAC clones) of the wheat B genome and 3.63 Mb of sequence (from 19 BAC clones) of the wheat A genome. Analysis of TE sequence proportions (as percentages), ratios of complete to truncated copies, and estimation of insertion dates of class I retrotransposons showed that specific types of TEs have undergone waves of differential proliferation in the B and A genomes of wheat. While both genomes show similar rates and relatively ancient proliferation periods for the Athila retrotransposons, the Copia retrotransposons proliferated more recently in the A genome whereas Gypsy retrotransposon proliferation is more recent in the B genome. It was possible to estimate for the first time the proliferation periods of the abundant CACTA class II DNA transposons, relative to that of the three main retrotransposon superfamilies. Proliferation of these TEs started prior to and overlapped with that of the Athila retrotransposons in both genomes. However, they also proliferated during the same periods as Gypsy and Copia retrotransposons in the A genome, but not in the B genome. As estimated from their insertion dates and confirmed by PCR-based tracing analysis, the majority of differential proliferation of TEs in B and A genomes of wheat (87 and 83%, respectively), leading to rapid sequence divergence, occurred prior to the allotetraploidization event that brought them together in Triticum turgidum and Triticum aestivum, <0.5 million years ago. More importantly, the allotetraploidization event appears to have neither enhanced nor repressed retrotranspositions. We discuss the apparent proliferation of TEs as resulting from their insertion, removal, and/or combinations of both evolutionary forces. PMID:18780739
Ensembl Plants: Integrating Tools for Visualizing, Mining, and Analyzing Plant Genomics Data.
Bolser, Dan; Staines, Daniel M; Pritchard, Emily; Kersey, Paul
2016-01-01
Ensembl Plants ( http://plants.ensembl.org ) is an integrative resource presenting genome-scale information for a growing number of sequenced plant species (currently 33). Data provided includes genome sequence, gene models, functional annotation, and polymorphic loci. Various additional information are provided for variation data, including population structure, individual genotypes, linkage, and phenotype data. In each release, comparative analyses are performed on whole genome and protein sequences, and genome alignments and gene trees are made available that show the implied evolutionary history of each gene family. Access to the data is provided through a genome browser incorporating many specialist interfaces for different data types, and through a variety of additional methods for programmatic access and data mining. These access routes are consistent with those offered through the Ensembl interface for the genomes of non-plant species, including those of plant pathogens, pests, and pollinators.Ensembl Plants is updated 4-5 times a year and is developed in collaboration with our international partners in the Gramene ( http://www.gramene.org ) and transPLANT projects ( http://www.transplantdb.org ).
Chao, Michael C.; Pritchard, Justin R.; Zhang, Yanjia J.; Rubin, Eric J.; Livny, Jonathan; Davis, Brigid M.; Waldor, Matthew K.
2013-01-01
The coupling of high-density transposon mutagenesis to high-throughput DNA sequencing (transposon-insertion sequencing) enables simultaneous and genome-wide assessment of the contributions of individual loci to bacterial growth and survival. We have refined analysis of transposon-insertion sequencing data by normalizing for the effect of DNA replication on sequencing output and using a hidden Markov model (HMM)-based filter to exploit heretofore unappreciated information inherent in all transposon-insertion sequencing data sets. The HMM can smooth variations in read abundance and thereby reduce the effects of read noise, as well as permit fine scale mapping that is independent of genomic annotation and enable classification of loci into several functional categories (e.g. essential, domain essential or ‘sick’). We generated a high-resolution map of genomic loci (encompassing both intra- and intergenic sequences) that are required or beneficial for in vitro growth of the cholera pathogen, Vibrio cholerae. This work uncovered new metabolic and physiologic requirements for V. cholerae survival, and by combining transposon-insertion sequencing and transcriptomic data sets, we also identified several novel noncoding RNA species that contribute to V. cholerae growth. Our findings suggest that HMM-based approaches will enhance extraction of biological meaning from transposon-insertion sequencing genomic data. PMID:23901011
The BIG Data Center: from deposition to integration to translation.
2017-01-04
Biological data are generated at unprecedentedly exponential rates, posing considerable challenges in big data deposition, integration and translation. The BIG Data Center, established at Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, provides a suite of database resources, including (i) Genome Sequence Archive, a data repository specialized for archiving raw sequence reads, (ii) Gene Expression Nebulas, a data portal of gene expression profiles based entirely on RNA-Seq data, (iii) Genome Variation Map, a comprehensive collection of genome variations for featured species, (iv) Genome Warehouse, a centralized resource housing genome-scale data with particular focus on economically important animals and plants, (v) Methylation Bank, an integrated database of whole-genome single-base resolution methylomes and (vi) Science Wikis, a central access point for biological wikis developed for community annotations. The BIG Data Center is dedicated to constructing and maintaining biological databases through big data integration and value-added curation, conducting basic research to translate big data into big knowledge and providing freely open access to a variety of data resources in support of worldwide research activities in both academia and industry. All of these resources are publicly available and can be found at http://bigd.big.ac.cn. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Genome Sequencing and Analysis of Geographically Diverse Clinical Isolates of Herpes Simplex Virus 2
Lamers, Susanna L.; Weiner, Brian; Ray, Stuart C.; Colgrove, Robert C.; Diaz, Fernando; Jing, Lichen; Wang, Kening; Saif, Sakina; Young, Sarah; Henn, Matthew; Laeyendecker, Oliver; Tobian, Aaron A. R.; Cohen, Jeffrey I.; Koelle, David M.; Quinn, Thomas C.; Knipe, David M.
2015-01-01
ABSTRACT Herpes simplex virus 2 (HSV-2), the principal causative agent of recurrent genital herpes, is a highly prevalent viral infection worldwide. Limited information is available on the amount of genomic DNA variation between HSV-2 strains because only two genomes have been determined, the HG52 laboratory strain and the newly sequenced SD90e low-passage-number clinical isolate strain, each from a different geographical area. In this study, we report the nearly complete genome sequences of 34 HSV-2 low-passage-number and laboratory strains, 14 of which were collected in Uganda, 1 in South Africa, 11 in the United States, and 8 in Japan. Our analyses of these genomes demonstrated remarkable sequence conservation, regardless of geographic origin, with the maximum nucleotide divergence between strains being 0.4% across the genome. In contrast, prior studies indicated that HSV-1 genomes exhibit more sequence diversity, as well as geographical clustering. Additionally, unlike HSV-1, little viral recombination between HSV-2 strains could be substantiated. These results are interpreted in light of HSV-2 evolution, epidemiology, and pathogenesis. Finally, the newly generated sequences more closely resemble the low-passage-number SD90e than HG52, supporting the use of the former as the new reference genome of HSV-2. IMPORTANCE Herpes simplex virus 2 (HSV-2) is a causative agent of genital and neonatal herpes. Therefore, knowledge of its DNA genome and genetic variability is central to preventing and treating genital herpes. However, only two full-length HSV-2 genomes have been reported. In this study, we sequenced 34 additional HSV-2 low-passage-number and laboratory viral genomes and initiated analysis of the genetic diversity of HSV-2 strains from around the world. The analysis of these genomes will facilitate research aimed at vaccine development, diagnosis, and the evaluation of clinical manifestations and transmission of HSV-2. This information will also contribute to our understanding of HSV evolution. PMID:26018166
Evolution and Diversity in Human Herpes Simplex Virus Genomes
Gatherer, Derek; Ochoa, Alejandro; Greenbaum, Benjamin; Dolan, Aidan; Bowden, Rory J.; Enquist, Lynn W.; Legendre, Matthieu; Davison, Andrew J.
2014-01-01
Herpes simplex virus 1 (HSV-1) causes a chronic, lifelong infection in >60% of adults. Multiple recent vaccine trials have failed, with viral diversity likely contributing to these failures. To understand HSV-1 diversity better, we comprehensively compared 20 newly sequenced viral genomes from China, Japan, Kenya, and South Korea with six previously sequenced genomes from the United States, Europe, and Japan. In this diverse collection of passaged strains, we found that one-fifth of the newly sequenced members share a gene deletion and one-third exhibit homopolymeric frameshift mutations (HFMs). Individual strains exhibit genotypic and potential phenotypic variation via HFMs, deletions, short sequence repeats, and single-nucleotide polymorphisms, although the protein sequence identity between strains exceeds 90% on average. In the first genome-scale analysis of positive selection in HSV-1, we found signs of selection in specific proteins and residues, including the fusion protein glycoprotein H. We also confirmed previous results suggesting that recombination has occurred with high frequency throughout the HSV-1 genome. Despite this, the HSV-1 strains analyzed clustered by geographic origin during whole-genome distance analysis. These data shed light on likely routes of HSV-1 adaptation to changing environments and will aid in the selection of vaccine antigens that are invariant worldwide. PMID:24227835
Champoiseau, P; Daugrois, J-H; Pieretti, I; Cociancich, S; Royer, M; Rott, P
2006-10-01
ABSTRACT Pathogenicity of 75 strains of Xanthomonas albilineans from Guadeloupe was assessed by inoculation of sugarcane cv. B69566, which is susceptible to leaf scald, and 19 of the strains were selected as representative of the variation in pathogenicity observed based on stalk colonization. In vitro production of albicidin varied among these 19 strains, but the restriction fragment length polymorphism pattern of their albicidin biosynthesis genes was identical. Similarly, no genomic variation was found among strains by pulsed-field gel electrophoresis. Some variation among strains was found by amplified fragment length polymorphism, but no relationship between this genetic variation and variation in pathogenicity was found. Only 3 (pilB, rpfA, and xpsE) of 40 genes involved in pathogenicity of bacterial species closely related to X. albilineans could be amplified by polymerase chain reaction from total genomic DNA of all nine strains tested of X. albilineans differing in pathogenicity in Guadeloupe. Nucleotide sequences of these genes were 100% identical among strains, and a phylogenetic study with these genes and housekeeping genes efp and ihfA suggested that X. albilineans is on an evolutionary road between the X. campestris group and Xylella fastidiosa, another vascular plant pathogen. Sequencing of the complete genome of Xanthomonas albilineans could be the next step in deciphering molecular mechanisms involved in pathogenicity of X. albilineans.
Optimizing Illumina next-generation sequencing library preparation for extremely AT-biased genomes.
Oyola, Samuel O; Otto, Thomas D; Gu, Yong; Maslen, Gareth; Manske, Magnus; Campino, Susana; Turner, Daniel J; Macinnis, Bronwyn; Kwiatkowski, Dominic P; Swerdlow, Harold P; Quail, Michael A
2012-01-03
Massively parallel sequencing technology is revolutionizing approaches to genomic and genetic research. Since its advent, the scale and efficiency of Next-Generation Sequencing (NGS) has rapidly improved. In spite of this success, sequencing genomes or genomic regions with extremely biased base composition is still a great challenge to the currently available NGS platforms. The genomes of some important pathogenic organisms like Plasmodium falciparum (high AT content) and Mycobacterium tuberculosis (high GC content) display extremes of base composition. The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses. Alternative library-preparation approaches that omit PCR amplification require large quantities of starting material and hence are not suitable for small amounts of DNA/RNA such as those from clinical isolates. We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content sequences. We have used our optimized conditions in parallel with standard methods to prepare Illumina sequencing libraries from a non-clinical and a clinical isolate (containing ~53% host contamination). By analyzing and comparing the quality of sequence data generated, we show that our optimized conditions that involve a PCR additive (TMAC), produces amplified libraries with improved coverage of extremely AT-rich regions and reduced bias toward GC neutral templates. We have developed a robust and optimized Next-Generation Sequencing library amplification method suitable for extremely AT-rich genomes. The new amplification conditions significantly reduce bias and retain the complexity of either extremes of base composition. This development will greatly benefit sequencing clinical samples that often require amplification due to low mass of DNA starting material.
Diversity arrays technology: a generic genome profiling technology on open platforms.
Kilian, Andrzej; Wenzl, Peter; Huttner, Eric; Carling, Jason; Xia, Ling; Blois, Hélène; Caig, Vanessa; Heller-Uszynska, Katarzyna; Jaccoud, Damian; Hopper, Colleen; Aschenbrenner-Kilian, Malgorzata; Evers, Margaret; Peng, Kaiman; Cayla, Cyril; Hok, Puthick; Uszynski, Grzegorz
2012-01-01
In the last 20 years, we have observed an exponential growth of the DNA sequence data and simular increase in the volume of DNA polymorphism data generated by numerous molecular marker technologies. Most of the investment, and therefore progress, concentrated on human genome and genomes of selected model species. Diversity Arrays Technology (DArT), developed over a decade ago, was among the first "democratizing" genotyping technologies, as its performance was primarily driven by the level of DNA sequence variation in the species rather than by the level of financial investment. DArT also proved more robust to genome size and ploidy-level differences among approximately 60 organisms for which DArT was developed to date compared to other high-throughput genotyping technologies. The success of DArT in a number of organisms, including a wide range of "orphan crops," can be attributed to the simplicity of underlying concepts: DArT combines genome complexity reduction methods enriching for genic regions with a highly parallel assay readout on a number of "open-access" microarray platforms. The quantitative nature of the assay enabled a number of applications in which allelic frequencies can be estimated from DArT arrays. A typical DArT assay tests for polymorphism tens of thousands of genomic loci with the final number of markers reported (hundreds to thousands) reflecting the level of DNA sequence variation in the tested loci. Detailed DArT methods, protocols, and a range of their application examples as well as DArT's evolution path are presented.
Genome diversity in Brachypodium distachyon: deep sequencing of highly diverse inbred lines
USDA-ARS?s Scientific Manuscript database
Natural variation provides a powerful opportunity to study the genetic basis of biological traits. Brachypodium distachyon is a broadly distributed diploid model grass with a small genome and a large collection of diverse inbred lines. As a step towards understanding the genetic basis of the natura...
Dukić, Marinela; Berner, Daniel; Roesti, Marius; Haag, Christoph R; Ebert, Dieter
2016-10-13
Recombination rate is an essential parameter for many genetic analyses. Recombination rates are highly variable across species, populations, individuals and different genomic regions. Due to the profound influence that recombination can have on intraspecific diversity and interspecific divergence, characterization of recombination rate variation emerges as a key resource for population genomic studies and emphasises the importance of high-density genetic maps as tools for studying genome biology. Here we present such a high-density genetic map for Daphnia magna, and analyse patterns of recombination rate across the genome. A F2 intercross panel was genotyped by Restriction-site Associated DNA sequencing to construct the third-generation linkage map of D. magna. The resulting high-density map included 4037 markers covering 813 scaffolds and contigs that sum up to 77 % of the currently available genome draft sequence (v2.4) and 55 % of the estimated genome size (238 Mb). Total genetic length of the map presented here is 1614.5 cM and the genome-wide recombination rate is estimated to 6.78 cM/Mb. Merging genetic and physical information we consistently found that recombination rate estimates are high towards the peripheral parts of the chromosomes, while chromosome centres, harbouring centromeres in D. magna, show very low recombination rate estimates. Due to its high-density, the third-generation linkage map for D. magna can be coupled with the draft genome assembly, providing an essential tool for genome investigation in this model organism. Thus, our linkage map can be used for the on-going improvements of the genome assembly, but more importantly, it has enabled us to characterize variation in recombination rate across the genome of D. magna for the first time. These new insights can provide a valuable assistance in future studies of the genome evolution, mapping of quantitative traits and population genetic studies.
Understanding Neurodevelopmental Disorders: The Promise of Regulatory Variation in the 3'UTRome.
Wanke, Kai A; Devanna, Paolo; Vernes, Sonja C
2018-04-01
Neurodevelopmental disorders have a strong genetic component, but despite widespread efforts, the specific genetic factors underlying these disorders remain undefined for a large proportion of affected individuals. Given the accessibility of exome sequencing, this problem has thus far been addressed from a protein-centric standpoint; however, protein-coding regions only make up ∼1% to 2% of the human genome. With the advent of whole genome sequencing we are in the midst of a paradigm shift as it is now possible to interrogate the entire sequence of the human genome (coding and noncoding) to fill in the missing heritability of complex disorders. These new technologies bring new challenges, as the number of noncoding variants identified per individual can be overwhelming, making it prudent to focus on noncoding regions of known function, for which the effects of variation can be predicted and directly tested to assess pathogenicity. The 3'UTRome is a region of the noncoding genome that perfectly fulfills these criteria and is of high interest when searching for pathogenic variation related to complex neurodevelopmental disorders. Herein, we review the regulatory roles of the 3'UTRome as binding sites for microRNAs or RNA binding proteins, or during alternative polyadenylation. We detail existing evidence that these regions contribute to neurodevelopmental disorders and outline strategies for identification and validation of novel putatively pathogenic variation in these regions. This evidence suggests that studying the 3'UTRome will lead to the identification of new risk factors, new candidate disease genes, and a better understanding of the molecular mechanisms contributing to neurodevelopmental disorders. Copyright © 2017 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.
Zheng, Renhua; Xu, Haibin; Zhou, Yanwei; Li, Meiping; Lu, Fengjuan; Dong, Yini; Liu, Xin; Chen, Jinhui; Shi, Jisen
2016-01-01
Glyptostrobus pensilis, belonging to the monotypic genus Glyptostrobus (Family: Cupressaceae), is an ancient conifer that is naturally distributed in low-lying wet areas. Here, we report the complete chloroplast (cp) genome sequence (132,239 bp) of G. pensilis. The G. pensilis cp genome is similar in gene content, organization and genome structure to the sequenced cp genomes from other cupressophytes, especially with respect to the loss of the inverted repeat region A (IRA). Through phylogenetic analysis, we demonstrated that the genus Glyptostrobus is closely related to the genus Cryptomeria, supporting previous findings based on physiological characteristics. Since IRs play an important role in stabilize cp genome and conifer cp genomes lost different IR regions after splitting in two clades (cupressophytes and Pinaceae), we performed cp genome rearrangement analysis and found more extensive cp genome rearrangements among the species of cupressophytes relative to Pinaceae. Additional repeat analysis indicated that cupressophytes cp genomes contained less potential functional repeats, especially in Cupressaceae, compared with Pinaceae. These results suggested that dynamics of cp genome rearrangement in conifers differed since the two clades, Pinaceae and cupressophytes, lost IR copies independently and developed different repeats to complement the residual IRs. In addition, we identified 170 perfect simple sequence repeats that will be useful in future research focusing on the evolution of genetic diversity and conservation of genetic variation for this endangered species in the wild. PMID:27560965
Smoot, James C; Barbian, Kent D; Van Gompel, Jamie J; Smoot, Laura M; Chaussee, Michael S; Sylva, Gail L; Sturdevant, Daniel E; Ricklefs, Stacy M; Porcella, Stephen F; Parkins, Larye D; Beres, Stephen B; Campbell, David S; Smith, Todd M; Zhang, Qing; Kapur, Vivek; Daly, Judy A; Veasy, L George; Musser, James M
2002-04-02
Acute rheumatic fever (ARF), a sequelae of group A Streptococcus (GAS) infection, is the most common cause of preventable childhood heart disease worldwide. The molecular basis of ARF and the subsequent rheumatic heart disease are poorly understood. Serotype M18 GAS strains have been associated for decades with ARF outbreaks in the U.S. As a first step toward gaining new insight into ARF pathogenesis, we sequenced the genome of strain MGAS8232, a serotype M18 organism isolated from a patient with ARF. The genome is a circular chromosome of 1,895,017 bp, and it shares 1.7 Mb of closely related genetic material with strain SF370 (a sequenced serotype M1 strain). Strain MGAS8232 has 178 ORFs absent in SF370. Phages, phage-like elements, and insertion sequences are the major sources of variation between the genomes. The genomes of strain MGAS8232 and SF370 encode many of the same proven or putative virulence factors. Importantly, strain MGAS8232 has genes encoding many additional secreted proteins involved in human-GAS interactions, including streptococcal pyrogenic exotoxin A (scarlet fever toxin) and two uncharacterized pyrogenic exotoxin homologues, all phage-associated. DNA microarray analysis of 36 serotype M18 strains from diverse localities showed that most regions of variation were phages or phage-like elements. Two epidemics of ARF occurring 12 years apart in Salt Lake City, UT, were caused by serotype M18 strains that were genetically identical, or nearly so. Our analysis provides a critical foundation for accelerated research into ARF pathogenesis and a molecular framework to study the plasticity of GAS genomes.
Comparative pathogenomics of Clostridium tetani.
Cohen, Jonathan E; Wang, Rong; Shen, Rong-Fong; Wu, Wells W; Keller, James E
2017-01-01
Clostridium tetani and Clostridium botulinum produce two of the most potent neurotoxins known, tetanus neurotoxin and botulinum neurotoxin, respectively. Extensive biochemical and genetic investigation has been devoted to identifying and characterizing various C. botulinum strains. Less effort has been focused on studying C. tetani likely because recently sequenced strains of C. tetani show much less genetic diversity than C. botulinum strains and because widespread vaccination efforts have reduced the public health threat from tetanus. Our aim was to acquire genomic data on the U.S. vaccine strain of C. tetani to better understand its genetic relationship to previously published genomic data from European vaccine strains. We performed high throughput genomic sequence analysis on two wild-type and two vaccine C. tetani strains. Comparative genomic analysis was performed using these and previously published genomic data for seven other C. tetani strains. Our analysis focused on single nucleotide polymorphisms (SNP) and four distinct constituents of the mobile genome (mobilome): a hypervariable flagellar glycosylation island region, five conserved bacteriophage insertion regions, variations in three CRISPR (clustered regularly interspaced short palindromic repeats)-Cas (CRISPR-associated) systems, and a single plasmid. Intact type IA and IB CRISPR/Cas systems were within 10 of 11 strains. A type IIIA CRISPR/Cas system was present in two strains. Phage infection histories derived from CRISPR-Cas sequences indicate C. tetani encounters phages common among commensal gut bacteria and soil-borne organisms consistent with C. tetani distribution in nature. All vaccine strains form a clade distinct from currently sequenced wild type strains when considering variations in these mobile elements. SNP, flagellar glycosylation island, prophage content and CRISPR/Cas phylogenic histories provide tentative evidence suggesting vaccine and wild type strains share a common ancestor.
Iskow, Rebecca C.; Austermann, Christian; Scharer, Christopher D.; Raj, Towfique; Boss, Jeremy M.; Sunyaev, Shamil; Price, Alkes; Stranger, Barbara; Simon, Viviana; Lee, Charles
2013-01-01
Ancient population structure shaping contemporary genetic variation has been recently appreciated and has important implications regarding our understanding of the structure of modern human genomes. We identified a ∼36-kb DNA segment in the human genome that displays an ancient substructure. The variation at this locus exists primarily as two highly divergent haplogroups. One of these haplogroups (the NE1 haplogroup) aligns with the Neandertal haplotype and contains a 4.6-kb deletion polymorphism in perfect linkage disequilibrium with 12 single nucleotide polymorphisms (SNPs) across diverse populations. The other haplogroup, which does not contain the 4.6-kb deletion, aligns with the chimpanzee haplotype and is likely ancestral. Africans have higher overall pairwise differences with the Neandertal haplotype than Eurasians do for this NE1 locus (p<10−15). Moreover, the nucleotide diversity at this locus is higher in Eurasians than in Africans. These results mimic signatures of recent Neandertal admixture contributing to this locus. However, an in-depth assessment of the variation in this region across multiple populations reveals that African NE1 haplotypes, albeit rare, harbor more sequence variation than NE1 haplotypes found in Europeans, indicating an ancient African origin of this haplogroup and refuting recent Neandertal admixture. Population genetic analyses of the SNPs within each of these haplogroups, along with genome-wide comparisons revealed significant FST (p = 0.00003) and positive Tajima's D (p = 0.00285) statistics, pointing to non-neutral evolution of this locus. The NE1 locus harbors no protein-coding genes, but contains transcribed sequences as well as sequences with putative regulatory function based on bioinformatic predictions and in vitro experiments. We postulate that the variation observed at this locus predates Human–Neandertal divergence and is evolving under balancing selection, especially among European populations. PMID:23593015
Garita-Cambronero, Jerson; Palacio-Bielsa, Ana; López, María M; Cubero, Jaime
2016-01-01
Xanthomonas arboricola is a species in genus Xanthomonas which is mainly comprised of plant pathogens. Among the members of this taxon, X. arboricola pv. pruni, the causal agent of bacterial spot disease of stone fruits and almond, is distributed worldwide although it is considered a quarantine pathogen in the European Union. Herein, we report the draft genome sequence, the classification, the annotation and the sequence analyses of a virulent strain, IVIA 2626.1, and an avirulent strain, CITA 44, of X. arboricola associated with Prunus spp. The draft genome sequence of IVIA 2626.1 consists of 5,027,671 bp, 4,720 protein coding genes and 50 RNA encoding genes. The draft genome sequence of strain CITA 44 consists of 4,760,482 bp, 4,250 protein coding genes and 56 RNA coding genes. Initial comparative analyses reveals differences in the presence of structural and regulatory components of the type IV pilus, the type III secretion system, the type III effectors as well as variations in the number of the type IV secretion systems. The genome sequence data for these strains will facilitate the development of molecular diagnostics protocols that differentiate virulent and avirulent strains. In addition, comparative genome analysis will provide insights into the plant-pathogen interaction during the bacterial spot disease process.
Roach, David J.; Burton, Joshua N.; Lee, Choli; Stackhouse, Bethany; Butler-Wu, Susan M.; Cookson, Brad T.
2015-01-01
Bacterial whole genome sequencing holds promise as a disruptive technology in clinical microbiology, but it has not yet been applied systematically or comprehensively within a clinical context. Here, over the course of one year, we performed prospective collection and whole genome sequencing of nearly all bacterial isolates obtained from a tertiary care hospital’s intensive care units (ICUs). This unbiased collection of 1,229 bacterial genomes from 391 patients enables detailed exploration of several features of clinical pathogens. A sizable fraction of isolates identified as clinically relevant corresponded to previously undescribed species: 12% of isolates assigned a species-level classification by conventional methods actually qualified as distinct, novel genomospecies on the basis of genomic similarity. Pan-genome analysis of the most frequently encountered pathogens in the collection revealed substantial variation in pan-genome size (1,420 to 20,432 genes) and the rate of gene discovery (1 to 152 genes per isolate sequenced). Surprisingly, although potential nosocomial transmission of actively surveilled pathogens was rare, 8.7% of isolates belonged to genomically related clonal lineages that were present among multiple patients, usually with overlapping hospital admissions, and were associated with clinically significant infection in 62% of patients from which they were recovered. Multi-patient clonal lineages were particularly evident in the neonatal care unit, where seven separate Staphylococcus epidermidis clonal lineages were identified, including one lineage associated with bacteremia in 5/9 neonates. Our study highlights key differences in the information made available by conventional microbiological practices versus whole genome sequencing, and motivates the further integration of microbial genome sequencing into routine clinical care. PMID:26230489
Survey of genome sequences in a wild sweet potato, Ipomoea trifida (H. B. K.) G. Don
Hirakawa, Hideki; Okada, Yoshihiro; Tabuchi, Hiroaki; Shirasawa, Kenta; Watanabe, Akiko; Tsuruoka, Hisano; Minami, Chiharu; Nakayama, Shinobu; Sasamoto, Shigemi; Kohara, Mitsuyo; Kishida, Yoshie; Fujishiro, Tsunakazu; Kato, Midori; Nanri, Keiko; Komaki, Akiko; Yoshinaga, Masaru; Takahata, Yasuhiro; Tanaka, Masaru; Tabata, Satoshi; Isobe, Sachiko N.
2015-01-01
Ipomoea trifida (H. B. K.) G. Don. is the most likely diploid ancestor of the hexaploid sweet potato, I. batatas (L.) Lam. To assist in analysis of the sweet potato genome, de novo whole-genome sequencing was performed with two lines of I. trifida, namely the selfed line Mx23Hm and the highly heterozygous line 0431-1, using the Illumina HiSeq platform. We classified the sequences thus obtained as either ‘core candidates’ (common to the two lines) or ‘line specific’. The total lengths of the assembled sequences of Mx23Hm (ITR_r1.0) was 513 Mb, while that of 0431-1 (ITRk_r1.0) was 712 Mb. Of the assembled sequences, 240 Mb (Mx23Hm) and 353 Mb (0431-1) were classified into core candidate sequences. A total of 62,407 (62.4 Mb) and 109,449 (87.2 Mb) putative genes were identified, respectively, in the genomes of Mx23Hm and 0431-1, of which 11,823 were derived from core sequences of Mx23Hm, while 28,831 were from the core candidate sequence of 0431-1. There were a total of 1,464,173 single-nucleotide polymorphisms and 16,682 copy number variations (CNVs) in the two assembled genomic sequences (under the condition of log2 ratio of >1 and CNV size >1,000 bases). The results presented here are expected to contribute to the progress of genomic and genetic studies of I. trifida, as well as studies of the sweet potato and the genus Ipomoea in general. PMID:25805887
Developing 100K Affymetrix Axiom SNP Array for Polyploid Sugarcane
USDA-ARS?s Scientific Manuscript database
Sugarcane genotyping or fingerprinting has long been a daunting task due to its high polyploidy level with large number of chromosomes. Single nucleotide polymorphisms (SNPs) are very abundant DNA sequence variations in the genomes. With the advance of next generation sequencing (NGS) technologies, ...
Population sequencing reveals breed and sub-species specific CNVs in cattle
USDA-ARS?s Scientific Manuscript database
Individualized copy number variation (CNV) maps have highlighted the need for population surveys of cattle to detect rare and common variants. While SNP and comparative genomic hybridization (CGH) arrays have provided preliminary data, next-generation sequence (NGS) data analysis offers an increased...
Erickson, Robert P
2016-01-01
The advent of next generation sequencing (NGS, which consists of massively parallel sequencing to perform TGS (total genome sequencing) or WES (whole exome sequencing)) has abundantly discovered many causative mutations in patients with pediatric neurological disease. A surprisingly high number of these are de novo mutations which have not been inherited from either parent. For epilepsy, autism spectrum disorders, and neuromotor disorders, including cerebral palsy, initial estimates put the frequency of causative de novo mutations at about 15% and about 10% of these are somatic. There are some shared mutated genes between these three classes of disease. Studies of copy number variation by comparative genomic hybridization (CGH) proceded the NGS approaches but they also detect de novo variation which is especially important for ASDs. There are interesting differences between the mutated genes detected by CGS and NGS. In summary, de novo mutations cause a very significant proportion of pediatric neurological disease. Copyright © 2015 Elsevier B.V. All rights reserved.
The complete chloroplast genome sequence of Actinidia arguta using the PacBio RS II platform
Lin, Miaomiao; Qi, Xiujuan; Chen, Jinyong; Sun, Leiming; Zhong, Yunpeng; Fang, Jinbao; Hu, Chungen
2018-01-01
Actinidia arguta is the most basal species in a phylogenetically and economically important genus in the family Actinidiaceae. To better understand the molecular basis of the Actinidia arguta chloroplast (cp), we sequenced the complete cp genome from A. arguta using Illumina and PacBio RS II sequencing technologies. The cp genome from A. arguta was 157,611 bp in length and composed of a pair of 24,232 bp inverted repeats (IRs) separated by a 20,463 bp small single copy region (SSC) and an 88,684 bp large single copy region (LSC). Overall, the cp genome contained 113 unique genes. The cp genomes from A. arguta and three other Actinidia species from GenBank were subjected to a comparative analysis. Indel mutation events and high frequencies of base substitution were identified, and the accD and ycf2 genes showed a high degree of variation within Actinidia. Forty-seven simple sequence repeats (SSRs) and 155 repetitive structures were identified, further demonstrating the rapid evolution in Actinidia. The cp genome analysis and the identification of variable loci provide vital information for understanding the evolution and function of the chloroplast and for characterizing Actinidia population genetics. PMID:29795601
Serres-Armero, Aitor; Povolotskaya, Inna S; Quilez, Javier; Ramirez, Oscar; Santpere, Gabriel; Kuderna, Lukas F K; Hernandez-Rodriguez, Jessica; Fernandez-Callejo, Marcos; Gomez-Sanchez, Daniel; Freedman, Adam H; Fan, Zhenxin; Novembre, John; Navarro, Arcadi; Boyko, Adam; Wayne, Robert; Vilà, Carles; Lorente-Galdos, Belen; Marques-Bonet, Tomas
2017-12-19
Whole genome re-sequencing data from dogs and wolves are now commonly used to study how natural and artificial selection have shaped the patterns of genetic diversity. Single nucleotide polymorphisms, microsatellites and variants in mitochondrial DNA have been interrogated for links to specific phenotypes or signals of domestication. However, copy number variation (CNV), despite its increasingly recognized importance as a contributor to phenotypic diversity, has not been extensively explored in canids. Here, we develop a new accurate probabilistic framework to create fine-scale genomic maps of segmental duplications (SDs), compare patterns of CNV across groups and investigate their role in the evolution of the domestic dog by using information from 34 canine genomes. Our analyses show that duplicated regions are enriched in genes and hence likely possess functional importance. We identify 86 loci with large CNV differences between dogs and wolves, enriched in genes responsible for sensory perception, immune response, metabolic processes, etc. In striking contrast to the observed loss of nucleotide diversity in domestic dogs following the population bottlenecks that occurred during domestication and breed creation, we find a similar proportion of CNV loci in dogs and wolves, suggesting that other dynamics are acting to particularly select for CNVs with potentially functional impacts. This work is the first comparison of genome wide CNV patterns in domestic and wild canids using whole-genome sequencing data and our findings contribute to study the impact of novel kinds of genetic changes on the evolution of the domestic dog.
Kim, Kyunghee; Lee, Sang-Choon; Lee, Junki; Lee, Hyun Oh; Joh, Ho Jun; Kim, Nam-Hoon; Park, Hyun-Seung; Yang, Tae-Jin
2015-01-01
We report complete sequences of chloroplast (cp) genome and 45S nuclear ribosomal DNA (45S nrDNA) for 11 Panax ginseng cultivars. We have obtained complete sequences of cp and 45S nrDNA, the representative barcoding target sequences for cytoplasm and nuclear genome, respectively, based on low coverage NGS sequence of each cultivar. The cp genomes sizes ranged from 156,241 to 156,425 bp and the major size variation was derived from differences in copy number of tandem repeats in the ycf1 gene and in the intergenic regions of rps16-trnUUG and rpl32-trnUAG. The complete 45S nrDNA unit sequences were 11,091 bp, representing a consensus single transcriptional unit with an intergenic spacer region. Comparative analysis of these sequences as well as those previously reported for three Chinese accessions identified very rare but unique polymorphism in the cp genome within P. ginseng cultivars. There were 12 intra-species polymorphisms (six SNPs and six InDels) among 14 cultivars. We also identified five SNPs from 45S nrDNA of 11 Korean ginseng cultivars. From the 17 unique informative polymorphic sites, we developed six reliable markers for analysis of ginseng diversity and cultivar authentication. PMID:26061692
Medici, Maria Cristina; Tummolo, Fabio; Martella, Vito; Arcangeletti, Maria Cristina; De Conto, Flora; Chezzi, Carlo; Fehér, Enikő; Marton, Szilvia; Calderaro, Adriana; Bányai, Krisztián
2016-08-01
Group C rotaviruses (RVC) are enteric pathogens of humans and animals. Whole-genome sequences are available only for few RVCs, leaving gaps in our knowledge about their genetic diversity. We determined the full-length genome sequence of two human RVCs (PR2593/2004 and PR713/2012), detected in Italy from hospital-based surveillance for rotavirus infection in 2004 and 2012. In the 11 RNA genomic segments, the two Italian RVCs segregated within separate intra-genotypic lineages showed variation ranging from 1.9 % (VP6) to 15.9 % (VP3) at the nucleotide level. Comprehensive analysis of human RVC sequences available in the databases allowed us to reveal the existence of at least two major genome configurations, defined as type I and type II. Human RVCs of type I were all associated with the M3 VP3 genotype, including the Italian strain PR2593/2004. Conversely, human RVCs of type II were all associated with the M2 VP3 genotype, including the Italian strain PR713/2012. Reassortant RVC strains between these major genome configurations were identified. Although only a few full-genome sequences of human RVCs, mostly of Asian origin, are available, the analysis of human RVC sequences retrieved from the databases indicates that at least two intra-genotypic RVC lineages circulate in European countries. Gathering more sequence data is necessary to develop a standardized genotype and intra-genotypic lineage classification system useful for epidemiological investigations and avoiding confusion in the literature.
Guo, Xiaosen; Brenner, Max; Zhang, Xuemei; Laragione, Teresina; Tai, Shuaishuai; Li, Yanhong; Bu, Junjie; Yin, Ye; Shah, Anish A.; Kwan, Kevin; Li, Yingrui; Jun, Wang; Gulko, Pércio S.
2013-01-01
DA (D-blood group of Palm and Agouti, also known as Dark Agouti) and F344 (Fischer) are two inbred rat strains with differences in several phenotypes, including susceptibility to autoimmune disease models and inflammatory responses. While these strains have been extensively studied, little information is available about the DA and F344 genomes, as only the Brown Norway (BN) and spontaneously hypertensive rat strains have been sequenced to date. Here we report the sequencing of the DA and F344 genomes using next-generation Illumina paired-end read technology and the first de novo assembly of a rat genome. DA and F344 were sequenced with an average depth of 32-fold, covered 98.9% of the BN reference genome, and included 97.97% of known rat ESTs. New sequences could be assigned to 59 million positions with previously unknown data in the BN reference genome. Differences between DA, F344, and BN included 19 million positions in novel scaffolds, 4.09 million single nucleotide polymorphisms (SNPs) (including 1.37 million new SNPs), 458,224 short insertions and deletions, and 58,174 structural variants. Genetic differences between DA, F344, and BN, including high-impact SNPs and short insertions and deletions affecting >2500 genes, are likely to account for most of the phenotypic variation between these strains. The new DA and F344 genome sequencing data should facilitate gene discovery efforts in rat models of human disease. PMID:23695301
Guo, Xiaosen; Brenner, Max; Zhang, Xuemei; Laragione, Teresina; Tai, Shuaishuai; Li, Yanhong; Bu, Junjie; Yin, Ye; Shah, Anish A; Kwan, Kevin; Li, Yingrui; Jun, Wang; Gulko, Pércio S
2013-08-01
DA (D-blood group of Palm and Agouti, also known as Dark Agouti) and F344 (Fischer) are two inbred rat strains with differences in several phenotypes, including susceptibility to autoimmune disease models and inflammatory responses. While these strains have been extensively studied, little information is available about the DA and F344 genomes, as only the Brown Norway (BN) and spontaneously hypertensive rat strains have been sequenced to date. Here we report the sequencing of the DA and F344 genomes using next-generation Illumina paired-end read technology and the first de novo assembly of a rat genome. DA and F344 were sequenced with an average depth of 32-fold, covered 98.9% of the BN reference genome, and included 97.97% of known rat ESTs. New sequences could be assigned to 59 million positions with previously unknown data in the BN reference genome. Differences between DA, F344, and BN included 19 million positions in novel scaffolds, 4.09 million single nucleotide polymorphisms (SNPs) (including 1.37 million new SNPs), 458,224 short insertions and deletions, and 58,174 structural variants. Genetic differences between DA, F344, and BN, including high-impact SNPs and short insertions and deletions affecting >2500 genes, are likely to account for most of the phenotypic variation between these strains. The new DA and F344 genome sequencing data should facilitate gene discovery efforts in rat models of human disease.
Vembar, Shruthi Sridhar; Seetin, Matthew; Lambert, Christine; Nattestad, Maria; Schatz, Michael C.; Baybayan, Primo; Scherf, Artur; Smith, Melissa Laird
2016-01-01
The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90–99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission. PMID:27345719
DOE Office of Scientific and Technical Information (OSTI.GOV)
McLoughlin, Kevin
2016-01-11
This report describes the design and implementation of an algorithm for estimating relative microbial abundances, together with confidence limits, using data from metagenomic DNA sequencing. For the background behind this project and a detailed discussion of our modeling approach for metagenomic data, we refer the reader to our earlier technical report, dated March 4, 2014. Briefly, we described a fully Bayesian generative model for paired-end sequence read data, incorporating the effects of the relative abundances, the distribution of sequence fragment lengths, fragment position bias, sequencing errors and variations between the sampled genomes and the nearest reference genomes. A distinctive featuremore » of our modeling approach is the use of a Chinese restaurant process (CRP) to describe the selection of genomes to be sampled, and thus the relative abundances. The CRP component is desirable for fitting abundances to reads that may map ambiguously to multiple targets, because it naturally leads to sparse solutions that select the best representative from each set of nearly equivalent genomes.« less
Selection of a DNA barcode for Nectriaceae from fungal whole-genomes.
Zeng, Zhaoqing; Zhao, Peng; Luo, Jing; Zhuang, Wenying; Yu, Zhihe
2012-01-01
A DNA barcode is a short segment of sequence that is able to distinguish species. A barcode must ideally contain enough variation to distinguish every individual species and be easily obtained. Fungi of Nectriaceae are economically important and show high species diversity. To establish a standard DNA barcode for this group of fungi, the genomes of Neurospora crassa and 30 other filamentous fungi were compared. The expect value was treated as a criterion to recognize homologous sequences. Four candidate markers, Hsp90, AAC, CDC48, and EF3, were tested for their feasibility as barcodes in the identification of 34 well-established species belonging to 13 genera of Nectriaceae. Two hundred and fifteen sequences were analyzed. Intra- and inter-specific variations and the success rate of PCR amplification and sequencing were considered as important criteria for estimation of the candidate markers. Ultimately, the partial EF3 gene met the requirements for a good DNA barcode: No overlap was found between the intra- and inter-specific pairwise distances. The smallest inter-specific distance of EF3 gene was 3.19%, while the largest intra-specific distance was 1.79%. In addition, there was a high success rate in PCR and sequencing for this gene (96.3%). CDC48 showed sufficiently high sequence variation among species, but the PCR and sequencing success rate was 84% using a single pair of primers. Although the Hsp90 and AAC genes had higher PCR and sequencing success rates (96.3% and 97.5%, respectively), overlapping occurred between the intra- and inter-specific variations, which could lead to misidentification. Therefore, we propose the EF3 gene as a possible DNA barcode for the nectriaceous fungi.
Wang, Daxi; Young, Neil D; Koehler, Anson V; Tan, Patrick; Sohn, Woon-Mok; Korhonen, Pasi K; Gasser, Robin B
2017-07-01
Clonorchiasis is a neglected tropical disease that affects >35 million people mainly in China, Vietnam, South Korea and some parts of Russia. The disease-causing agent, Clonorchis sinensis, is a liver fluke of humans and other piscivorous animals, and has a complex aquatic life cycle involving snails and fish intermediate hosts. Chronic infection in humans causes liver disease and associated complications including malignant bile duct cancer. Central to control and to understanding the epidemiology of this disease is knowledge of the specific identity of the causative agent as well as genetic variation within and among populations of this parasite. Although most published molecular studies seem to suggest that C. sinensis represents a single species and that genetic variation within the species is limited, karyotypic variation within C. sinensis among China, Korea (2n=56) and Russian Far East (2n=14) suggests that this taxon might contain sibling species. Here, we assessed and applied a deep sequencing-bioinformatic approach to sequence and define a reference mitochondrial (mt) genome for a particular isolate of C. sinensis from Korea (Cs-k2), to confirm its specific identity, and compared this mt genome with homologous data sets available for this species. Comparative analyses revealed consistency in the number and structure of genes as well as in the lengths of protein-coding genes, and limited genetic variation among isolates of C. sinensis. Phylogenetic analyses of amino acid sequences predicted from mt genes showed that representatives of C. sinensis clustered together, with absolute nodal support, to the exclusion of other liver fluke representatives, but sub-structuring within C. sinensis was not well supported. The plan now is to proceed with the sequencing, assembly and annotation of a high quality draft nuclear genome of this defined isolate (Cs-k2) as a basis for a detailed investigation of molecular variation within C. sinensis from disparate geographical locations in parts of Asia and to prospect for cryptic species. Copyright © 2017 Elsevier B.V. All rights reserved.
Du, Qingzhang; Gong, Chenrui; Pan, Wei; Zhang, Deqiang
2013-02-01
Gene-derived simple sequence repeats (genic SSRs), also known as functional markers, are often preferred over random genomic markers because they represent variation in gene coding and/or regulatory regions. We characterized 544 genic SSR loci derived from 138 candidate genes involved in wood formation, distributed throughout the genome of Populus tomentosa, a key ecological and cultivated wood production species. Of these SSRs, three-quarters were located in the promoter or intron regions, and dinucleotide (59.7%) and trinucleotide repeat motifs (26.5%) predominated. By screening 15 wild P. tomentosa ecotypes, we identified 188 polymorphic genic SSRs with 861 alleles, 2-7 alleles for each marker. Transferability analysis of 30 random genic SSRs, testing whether these SSRs work in 26 genotypes of five genus Populus sections (outgroup, Salix matsudana), showed that 72% of the SSRs could be amplified in Turanga and 100% could be amplified in Leuce. Based on genotyping of these 26 genotypes, a neighbour-joining analysis showed the expected six phylogenetic groupings. In silico analysis of SSR variation in 220 sequences that are homologous between P. tomentosa and Populus trichocarpa suggested that genic SSR variations between relatives were predominantly affected by repeat motif variations or flanking sequence mutations. Inheritance tests and single-marker associations demonstrated the power of genic SSRs in family-based linkage mapping and candidate gene-based association studies, as well as marker-assisted selection and comparative genomic studies of P. tomentosa and related species.
CNV-TV: a robust method to discover copy number variation from short sequencing reads.
Duan, Junbo; Zhang, Ji-Gang; Deng, Hong-Wen; Wang, Yu-Ping
2013-05-02
Copy number variation (CNV) is an important structural variation (SV) in human genome. Various studies have shown that CNVs are associated with complex diseases. Traditional CNV detection methods such as fluorescence in situ hybridization (FISH) and array comparative genomic hybridization (aCGH) suffer from low resolution. The next generation sequencing (NGS) technique promises a higher resolution detection of CNVs and several methods were recently proposed for realizing such a promise. However, the performances of these methods are not robust under some conditions, e.g., some of them may fail to detect CNVs of short sizes. There has been a strong demand for reliable detection of CNVs from high resolution NGS data. A novel and robust method to detect CNV from short sequencing reads is proposed in this study. The detection of CNV is modeled as a change-point detection from the read depth (RD) signal derived from the NGS, which is fitted with a total variation (TV) penalized least squares model. The performance (e.g., sensitivity and specificity) of the proposed approach are evaluated by comparison with several recently published methods on both simulated and real data from the 1000 Genomes Project. The experimental results showed that both the true positive rate and false positive rate of the proposed detection method do not change significantly for CNVs with different copy numbers and lengthes, when compared with several existing methods. Therefore, our proposed approach results in a more reliable detection of CNVs than the existing methods.
Draft sequencing and analysis of the genome of pufferfish Takifugu flavidus.
Gao, Yang; Gao, Qiang; Zhang, Huan; Wang, Lingling; Zhang, Fuchong; Yang, Chuanyan; Song, Linsheng
2014-12-01
The pufferfish Takifugu flavidus is an important economic species due to its outstanding flavour and high market value. It has been regarded as an excellent model of genetic study for decades as well. In the present study, three mate-pair libraries of T. flavidus genome were sequenced by the SOLiD 4 next-generation sequencing platform, and the draft genome was constructed with the short reads using an assisted assembly strategy. The draft consists of 50,947 scaffolds with an N50 value of 305.7 kb, and the average GC content was 45.2%. The combined length of repetitive sequences was 26.5 Mb, which accounted for 6.87% of the genome, indicating that the compactness of T. flavidus genome was approximative with that of T. rubripes genome. A total of 1,253 non-coding RNA genes and 30,285 protein-encoding genes were assigned to the genome. There were 132,775 and 394 presumptive genes playing roles in the colour pattern variation, the relatively slow growth and the lipid metabolism, respectively. Among them, genes involved in the microtubule-dependent transport system, angiogenesis, decapentaplegic pathway and lipid mobilization were significantly expanded in the T. flavidus genome. This draft genome provides a valuable resource for understanding and improving both fundamental and applied research with pufferfish in the future. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Emerging Genomic Tools for Legume Breeding: Current Status and Future Prospects
Pandey, Manish K.; Roorkiwal, Manish; Singh, Vikas K.; Ramalingam, Abirami; Kudapa, Himabindu; Thudi, Mahendar; Chitikineni, Anu; Rathore, Abhishek; Varshney, Rajeev K.
2016-01-01
Legumes play a vital role in ensuring global nutritional food security and improving soil quality through nitrogen fixation. Accelerated higher genetic gains is required to meet the demand of ever increasing global population. In recent years, speedy developments have been witnessed in legume genomics due to advancements in next-generation sequencing (NGS) and high-throughput genotyping technologies. Reference genome sequences for many legume crops have been reported in the last 5 years. The availability of the draft genome sequences and re-sequencing of elite genotypes for several important legume crops have made it possible to identify structural variations at large scale. Availability of large-scale genomic resources and low-cost and high-throughput genotyping technologies are enhancing the efficiency and resolution of genetic mapping and marker-trait association studies. Most importantly, deployment of molecular breeding approaches has resulted in development of improved lines in some legume crops such as chickpea and groundnut. In order to support genomics-driven crop improvement at a fast pace, the deployment of breeder-friendly genomics and decision support tools seems appear to be critical in breeding programs in developing countries. This review provides an overview of emerging genomics and informatics tools/approaches that will be the key driving force for accelerating genomics-assisted breeding and ultimately ensuring nutritional and food security in developing countries. PMID:27199998
Naito, Mariko; Ogura, Yoshitoshi; Itoh, Takehiko; Shoji, Mikio; Okamoto, Masaaki; Hayashi, Tetsuya; Nakayama, Koji
2016-01-01
Prevotella intermedia is a pathogenic bacterium involved in periodontal diseases. Here, we present the complete genome sequence of a clinical strain, OMA14, of this bacterium along with the results of comparative genome analysis with strain 17 of the same species whose genome has also been sequenced, but not fully analysed yet. The genomes of both strains consist of two circular chromosomes: the larger chromosomes are similar in size and exhibit a high overall linearity of gene organizations, whereas the smaller chromosomes show a significant size variation and have undergone remarkable genome rearrangements. Unique features of the Pre. intermedia genomes are the presence of a remarkable number of essential genes on the second chromosomes and the abundance of conjugative and mobilizable transposons (CTns and MTns). The CTns/MTns are particularly abundant in the second chromosomes, involved in its extensive genome rearrangement, and have introduced a number of strain-specific genes into each strain. We also found a novel 188-bp repeat sequence that has been highly amplified in Pre. intermedia and are specifically distributed among the Pre. intermedia-related species. These findings expand our understanding of the genetic features of Pre. intermedia and the roles of CTns and MTns in the evolution of bacteria. PMID:26645327
Stafuzza, Nedenia Bonvino; Zerlotini, Adhemar; Lobo, Francisco Pereira; Yamagishi, Michel Eduardo Beleza; Chud, Tatiane Cristina Seleguim; Caetano, Alexandre Rodrigues; Munari, Danísio Prado; Garrick, Dorian J; Machado, Marco Antonio; Martins, Marta Fonseca; Carvalho, Maria Raquel; Cole, John Bruce; Barbosa da Silva, Marcos Vinicius Gualberto
2017-01-01
Whole-genome re-sequencing, alignment and annotation analyses were undertaken for 12 sires representing four important cattle breeds in Brazil: Guzerat (multi-purpose), Gyr, Girolando and Holstein (dairy production). A total of approximately 4.3 billion reads from an Illumina HiSeq 2000 sequencer generated for each animal 10.7 to 16.4-fold genome coverage. A total of 27,441,279 single nucleotide variations (SNVs) and 3,828,041 insertions/deletions (InDels) were detected in the samples, of which 2,557,670 SNVs and 883,219 InDels were novel. The submission of these genetic variants to the dbSNP database significantly increased the number of known variants, particularly for the indicine genome. The concordance rate between genotypes obtained using the Bovine HD BeadChip array and the same variants identified by sequencing was about 99.05%. The annotation of variants identified numerous non-synonymous SNVs and frameshift InDels which could affect phenotypic variation. Functional enrichment analysis was performed and revealed that variants in the olfactory transduction pathway was over represented in all four cattle breeds, while the ECM-receptor interaction pathway was over represented in Girolando and Guzerat breeds, the ABC transporters pathway was over represented only in Holstein breed, and the metabolic pathways was over represented only in Gyr breed. The genetic variants discovered here provide a rich resource to help identify potential genomic markers and their associated molecular mechanisms that impact economically important traits for Gyr, Girolando, Guzerat and Holstein breeding programs.
Lobo, Francisco Pereira; Yamagishi, Michel Eduardo Beleza; Chud, Tatiane Cristina Seleguim; Caetano, Alexandre Rodrigues; Munari, Danísio Prado; Garrick, Dorian J.; Machado, Marco Antonio; Martins, Marta Fonseca; Carvalho, Maria Raquel; Cole, John Bruce; Barbosa da Silva, Marcos Vinicius Gualberto
2017-01-01
Whole-genome re-sequencing, alignment and annotation analyses were undertaken for 12 sires representing four important cattle breeds in Brazil: Guzerat (multi-purpose), Gyr, Girolando and Holstein (dairy production). A total of approximately 4.3 billion reads from an Illumina HiSeq 2000 sequencer generated for each animal 10.7 to 16.4-fold genome coverage. A total of 27,441,279 single nucleotide variations (SNVs) and 3,828,041 insertions/deletions (InDels) were detected in the samples, of which 2,557,670 SNVs and 883,219 InDels were novel. The submission of these genetic variants to the dbSNP database significantly increased the number of known variants, particularly for the indicine genome. The concordance rate between genotypes obtained using the Bovine HD BeadChip array and the same variants identified by sequencing was about 99.05%. The annotation of variants identified numerous non-synonymous SNVs and frameshift InDels which could affect phenotypic variation. Functional enrichment analysis was performed and revealed that variants in the olfactory transduction pathway was over represented in all four cattle breeds, while the ECM-receptor interaction pathway was over represented in Girolando and Guzerat breeds, the ABC transporters pathway was over represented only in Holstein breed, and the metabolic pathways was over represented only in Gyr breed. The genetic variants discovered here provide a rich resource to help identify potential genomic markers and their associated molecular mechanisms that impact economically important traits for Gyr, Girolando, Guzerat and Holstein breeding programs. PMID:28323836
Graña-Miraglia, Lucía; Lozano, Luis F.; Velázquez, Consuelo; Volkow-Fernández, Patricia; Pérez-Oseguera, Ángeles; Cevallos, Miguel A.; Castillo-Ramírez, Santiago
2017-01-01
Genome sequencing has been useful to gain an understanding of bacterial evolution. It has been used for studying the phylogeography and/or the impact of mutation and recombination on bacterial populations. However, it has rarely been used to study gene turnover at microevolutionary scales. Here, we sequenced Mexican strains of the human pathogen Acinetobacter baumannii sampled from the same locale over a 3 year period to obtain insights into the microevolutionary dynamics of gene content variability. We found that the Mexican A. baumannii population was recently founded and has been emerging due to a rapid clonal expansion. Furthermore, we noticed that on average the Mexican strains differed from each other by over 300 genes and, notably, this gene content variation has accrued more frequently and faster than the accumulation of mutations. Moreover, due to its rapid pace, gene content variation reflects the phylogeny only at very short periods of time. Additionally, we found that the external branches of the phylogeny had almost 100 more genes than the internal branches. All in all, these results show that rapid gene turnover has been of paramount importance in producing genetic variation within this population and demonstrate the utility of genome sequencing to study alternative forms of genetic variation. PMID:28979253
Graña-Miraglia, Lucía; Lozano, Luis F; Velázquez, Consuelo; Volkow-Fernández, Patricia; Pérez-Oseguera, Ángeles; Cevallos, Miguel A; Castillo-Ramírez, Santiago
2017-01-01
Genome sequencing has been useful to gain an understanding of bacterial evolution. It has been used for studying the phylogeography and/or the impact of mutation and recombination on bacterial populations. However, it has rarely been used to study gene turnover at microevolutionary scales. Here, we sequenced Mexican strains of the human pathogen Acinetobacter baumannii sampled from the same locale over a 3 year period to obtain insights into the microevolutionary dynamics of gene content variability. We found that the Mexican A. baumannii population was recently founded and has been emerging due to a rapid clonal expansion. Furthermore, we noticed that on average the Mexican strains differed from each other by over 300 genes and, notably, this gene content variation has accrued more frequently and faster than the accumulation of mutations. Moreover, due to its rapid pace, gene content variation reflects the phylogeny only at very short periods of time. Additionally, we found that the external branches of the phylogeny had almost 100 more genes than the internal branches. All in all, these results show that rapid gene turnover has been of paramount importance in producing genetic variation within this population and demonstrate the utility of genome sequencing to study alternative forms of genetic variation.
Savary, Romain; Masclaux, Frédéric G; Wyss, Tania; Droh, Germain; Cruz Corella, Joaquim; Machado, Ana Paula; Morton, Joseph B; Sanders, Ian R
2018-01-01
Arbuscular mycorrhizal fungi (AMF; phylum Gomeromycota) associate with plants forming one of the most successful microbe-plant associations. The fungi promote plant diversity and have a potentially important role in global agriculture. Plant growth depends on both inter- and intra-specific variation in AMF. It was recently reported that an unusually large number of AMF taxa have an intercontinental distribution, suggesting long-distance gene flow for many AMF species, facilitated by either long-distance natural dispersal mechanisms or human-assisted dispersal. However, the intercontinental distribution of AMF species has been questioned because the use of very low-resolution markers may be unsuitable to detect genetic differences among geographically separated AMF, as seen with some other fungi. This has been untestable because of the lack of population genomic data, with high resolution, for any AMF taxa. Here we use phylogenetics and population genomics to test for intra-specific variation in Rhizophagus irregularis, an AMF species for which genome sequence information already exists. We used ddRAD sequencing to obtain thousands of markers distributed across the genomes of 81 R. irregularis isolates and related species. Based on 6 888 variable positions, we observed significant genetic divergence into four main genetic groups within R. irregularis, highlighting that previous studies have not captured underlying genetic variation. Despite considerable genetic divergence, surprisingly, the variation could not be explained by geographical origin, thus also supporting the hypothesis for at least one AMF species of widely dispersed AMF genotypes at an intercontinental scale. Such information is crucial for understanding AMF ecology, and how these fungi can be used in an environmentally safe way in distant locations.
Transcriptome sequencing reveals genome-wide variation in molecular evolutionary rate among ferns.
Grusz, Amanda L; Rothfels, Carl J; Schuettpelz, Eric
2016-08-30
Transcriptomics in non-model plant systems has recently reached a point where the examination of nuclear genome-wide patterns in understudied groups is an achievable reality. This progress is especially notable in evolutionary studies of ferns, for which molecular resources to date have been derived primarily from the plastid genome. Here, we utilize transcriptome data in the first genome-wide comparative study of molecular evolutionary rate in ferns. We focus on the ecologically diverse family Pteridaceae, which comprises about 10 % of fern diversity and includes the enigmatic vittarioid ferns-an epiphytic, tropical lineage known for dramatically reduced morphologies and radically elongated phylogenetic branch lengths. Using expressed sequence data for 2091 loci, we perform pairwise comparisons of molecular evolutionary rate among 12 species spanning the three largest clades in the family and ask whether previously documented heterogeneity in plastid substitution rates is reflected in their nuclear genomes. We then inquire whether variation in evolutionary rate is being shaped by genes belonging to specific functional categories and test for differential patterns of selection. We find significant, genome-wide differences in evolutionary rate for vittarioid ferns relative to all other lineages within the Pteridaceae, but we recover few significant correlations between faster/slower vittarioid loci and known functional gene categories. We demonstrate that the faster rates characteristic of the vittarioid ferns are likely not driven by positive selection, nor are they unique to any particular type of nucleotide substitution. Our results reinforce recently reviewed mechanisms hypothesized to shape molecular evolutionary rates in vittarioid ferns and provide novel insight into substitution rate variation both within and among fern nuclear genomes.
Alu repeat discovery and characterization within human genomes
Hormozdiari, Fereydoun; Alkan, Can; Ventura, Mario; Hajirasouliha, Iman; Malig, Maika; Hach, Faraz; Yorukoglu, Deniz; Dao, Phuong; Bakhshi, Marzieh; Sahinalp, S. Cenk; Eichler, Evan E.
2011-01-01
Human genomes are now being rapidly sequenced, but not all forms of genetic variation are routinely characterized. In this study, we focus on Alu retrotransposition events and seek to characterize differences in the pattern of mobile insertion between individuals based on the analysis of eight human genomes sequenced using next-generation sequencing. Applying a rapid read-pair analysis algorithm, we discover 4342 Alu insertions not found in the human reference genome and show that 98% of a selected subset (63/64) experimentally validate. Of these new insertions, 89% correspond to AluY elements, suggesting that they arose by retrotransposition. Eighty percent of the Alu insertions have not been previously reported and more novel events were detected in Africans when compared with non-African samples (76% vs. 69%). Using these data, we develop an experimental and computational screen to identify ancestry informative Alu retrotransposition events among different human populations. PMID:21131385
Draft genome of the mountain pine beetle, Dendroctonus ponderosae Hopkins, a major forest pest
2013-01-01
Background The mountain pine beetle, Dendroctonus ponderosae Hopkins, is the most serious insect pest of western North American pine forests. A recent outbreak destroyed more than 15 million hectares of pine forests, with major environmental effects on forest health, and economic effects on the forest industry. The outbreak has in part been driven by climate change, and will contribute to increased carbon emissions through decaying forests. Results We developed a genome sequence resource for the mountain pine beetle to better understand the unique aspects of this insect's biology. A draft de novo genome sequence was assembled from paired-end, short-read sequences from an individual field-collected male pupa, and scaffolded using mate-paired, short-read genomic sequences from pooled field-collected pupae, paired-end short-insert whole-transcriptome shotgun sequencing reads of mRNA from adult beetle tissues, and paired-end Sanger EST sequences from various life stages. We describe the cytochrome P450, glutathione S-transferase, and plant cell wall-degrading enzyme gene families important to the survival of the mountain pine beetle in its harsh and nutrient-poor host environment, and examine genome-wide single-nucleotide polymorphism variation. A horizontally transferred bacterial sucrose-6-phosphate hydrolase was evident in the genome, and its tissue-specific transcription suggests a functional role for this beetle. Conclusions Despite Coleoptera being the largest insect order with over 400,000 described species, including many agricultural and forest pest species, this is only the second genome sequence reported in Coleoptera, and will provide an important resource for the Curculionoidea and other insects. PMID:23537049
Gouveia, Juceli Gonzalez; Wolf, Ivan Rodrigo; de Moraes-Manécolo, Vivian Patrícia Oliveira; Bardella, Vanessa Belline; Ferracin, Lara Munique; Giuliano-Caetano, Lucia; da Rosa, Renata; Dias, Ana Lúcia
2016-12-01
Sequences of 5S ribosomal RNA (rRNA) are extensively used in fish cytogenomic studies, once they have a flexible organization at the chromosomal level, showing inter- and intra-specific variation in number and position in karyotypes. Sequences from the genome of Imparfinis schubarti (Heptapteridae) were isolated, aiming to understand the organization of 5S rDNA families in the fish genome. The isolation of 5S rDNA from the genome of I. schubarti was carried out by reassociation kinetics (C 0 t) and PCR amplification. The obtained sequences were cloned for the construction of a micro-library. The obtained clones were sequenced and hybridized in I. schubarti and Microglanis cottoides (Pseudopimelodidae) for chromosome mapping. An analysis of the sequence alignments with other fish groups was accomplished. Both methods were effective when using 5S rDNA for hybridization in I. schubarti genome. However, the C 0 t method enabled the use of a complete 5S rRNA gene, which was also successful in the hybridization of M. cottoides. Nevertheless, this gene was obtained only partially by PCR. The hybridization results and sequence analyses showed that intact 5S regions are more appropriate for the probe operation, due to conserved structure and motifs. This study contributes to a better understanding of the organization of multigene families in catfish's genomes.
Liu, Ying; Tang, Yuanman; Qin, Xiyun; Yang, Liang; Jiang, Gaofei; Li, Shili; Ding, Wei
2017-01-01
Ralstonia solanacearum, an agent of bacterial wilt, is a highly variable species with a broad host range and wide geographic distribution. As a species complex, it has extensive genetic diversity and its living environment is polymorphic like the lowland and the highland area, so more genomes are needed for studying population evolution and environment adaptation. In this paper, we reported the genome sequencing of R. solanacearum strain CQPS-1 isolated from wilted tobacco in Pengshui, Chongqing, China, a highland area with severely acidified soil and continuous cropping of tobacco more than 20 years. The comparative genomic analysis among different R. solanacearum strains was also performed. The completed genome size of CQPS-1 was 5.89 Mb and contained the chromosome (3.83 Mb) and the megaplasmid (2.06 Mb). A total of 5229 coding sequences were predicted (the chromosome and megaplasmid encoded 3573 and 1656 genes, respectively). A comparative analysis with eight strains from four phylotypes showed that there was some variation among the species, e.g., a large set of specific genes in CQPS-1. Type III secretion system gene cluster (hrp gene cluster) was conserved in CQPS-1 compared with the reference strain GMI1000. In addition, most genes coding core type III effectors were also conserved with GMI1000, but significant gene variation was found in the gene ripAA: the identity compared with strain GMI1000 was 75% and the hrpII box promoter in the upstream had significantly mutated. This study provided a potential resource for further understanding of the relationship between variation of pathogenicity factors and adaptation to the host environment. PMID:28620361
Fu, Yong-Bi; Peterson, Gregory W; Dong, Yibo
2016-04-07
Genotyping-by-sequencing (GBS) has emerged as a useful genomic approach for exploring genome-wide genetic variation. However, GBS commonly samples a genome unevenly and can generate a substantial amount of missing data. These technical features would limit the power of various GBS-based genetic and genomic analyses. Here we present software called IgCoverage for in silico evaluation of genomic coverage through GBS with an individual or pair of restriction enzymes on one sequenced genome, and report a new set of 21 restriction enzyme combinations that can be applied to enhance GBS applications. These enzyme combinations were developed through an application of IgCoverage on 22 plant, animal, and fungus species with sequenced genomes, and some of them were empirically evaluated with different runs of Illumina MiSeq sequencing in 12 plant species. The in silico analysis of 22 organisms revealed up to eight times more genome coverage for the new combinations consisted of pairing four- or five-cutter restriction enzymes than the commonly used enzyme combination PstI + MspI. The empirical evaluation of the new enzyme combination (HinfI + HpyCH4IV) in 12 plant species showed 1.7-6 times more genome coverage than PstI + MspI, and 2.3 times more genome coverage in dicots than monocots. Also, the SNP genotyping in 12 Arabidopsis and 12 rice plants revealed that HinfI + HpyCH4IV generated 7 and 1.3 times more SNPs (with 0-16.7% missing observations) than PstI + MspI, respectively. These findings demonstrate that these novel enzyme combinations can be utilized to increase genome sampling and improve SNP genotyping in various GBS applications. Copyright © 2016 Fu et al.
The business value and cost-effectiveness of genomic medicine.
Crawford, James M; Aspinall, Mara G
2012-05-01
Genomic medicine offers the promise of more effective diagnosis and treatment of human diseases. Genome sequencing early in the course of disease may enable more timely and informed intervention, with reduced healthcare costs and improved long-term outcomes. However, genomic medicine strains current models for demonstrating value, challenging efforts to achieve fair payment for services delivered, both for laboratory diagnostics and for use of molecular information in clinical management. Current models of healthcare reform stipulate that care must be delivered at equal or lower cost, with better patient and population outcomes. To achieve demonstrated value, genomic medicine must overcome many uncertainties: the clinical relevance of genomic variation; potential variation in technical performance and/or computational analysis; management of massive information sets; and must have available clinical interventions that can be informed by genomic analysis, so as to attain more favorable cost management of healthcare delivery and demonstrate improvements in cost-effectiveness.
Inference of gorilla demographic and selective history from whole-genome sequence data.
McManus, Kimberly F; Kelley, Joanna L; Song, Shiya; Veeramah, Krishna R; Woerner, August E; Stevison, Laurie S; Ryder, Oliver A; Ape Genome Project, Great; Kidd, Jeffrey M; Wall, Jeffrey D; Bustamante, Carlos D; Hammer, Michael F
2015-03-01
Although population-level genomic sequence data have been gathered extensively for humans, similar data from our closest living relatives are just beginning to emerge. Examination of genomic variation within great apes offers many opportunities to increase our understanding of the forces that have differentially shaped the evolutionary history of hominid taxa. Here, we expand upon the work of the Great Ape Genome Project by analyzing medium to high coverage whole-genome sequences from 14 western lowland gorillas (Gorilla gorilla gorilla), 2 eastern lowland gorillas (G. beringei graueri), and a single Cross River individual (G. gorilla diehli). We infer that the ancestors of western and eastern lowland gorillas diverged from a common ancestor approximately 261 ka, and that the ancestors of the Cross River population diverged from the western lowland gorilla lineage approximately 68 ka. Using a diffusion approximation approach to model the genome-wide site frequency spectrum, we infer a history of western lowland gorillas that includes an ancestral population expansion of 1.4-fold around 970 ka and a recent 5.6-fold contraction in population size 23 ka. The latter may correspond to a major reduction in African equatorial forests around the Last Glacial Maximum. We also analyze patterns of variation among western lowland gorillas to identify several genomic regions with strong signatures of recent selective sweeps. We find that processes related to taste, pancreatic and saliva secretion, sodium ion transmembrane transport, and cardiac muscle function are overrepresented in genomic regions predicted to have experienced recent positive selection. © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Hand, Melanie L.; Spangenberg, German C.; Forster, John W.; Cogan, Noel O. I.
2013-01-01
Chloroplast genome sequences are of broad significance in plant biology, due to frequent use in molecular phylogenetics, comparative genomics, population genetics, and genetic modification studies. The present study used a second-generation sequencing approach to determine and assemble the plastid genomes (plastomes) of four representatives from the agriculturally important Lolium-Festuca species complex of pasture grasses (Lolium multiflorum, Festuca pratensis, Festuca altissima, and Festuca ovina). Total cellular DNA was extracted from either roots or leaves, was sequenced, and the output was filtered for plastome-related reads. A comparison between sources revealed fewer plastome-related reads from root-derived template but an increase in incidental bacterium-derived sequences. Plastome assembly and annotation indicated high levels of sequence identity and a conserved organization and gene content between species. However, frequent deletions within the F. ovina plastome appeared to contribute to a smaller plastid genome size. Comparative analysis with complete plastome sequences from other members of the Poaceae confirmed conservation of most grass-specific features. Detailed analysis of the rbcL–psaI intergenic region, however, revealed a “hot-spot” of variation characterized by independent deletion events. The evolutionary implications of this observation are discussed. The complete plastome sequences are anticipated to provide the basis for potential organelle-specific genetic modification of pasture grasses. PMID:23550121
The complete chloroplast genome sequence of date palm (Phoenix dactylifera L.).
Yang, Meng; Zhang, Xiaowei; Liu, Guiming; Yin, Yuxin; Chen, Kaifu; Yun, Quanzheng; Zhao, Duojun; Al-Mssallem, Ibrahim S; Yu, Jun
2010-09-15
Date palm (Phoenix dactylifera L.), a member of Arecaceae family, is one of the three major economically important woody palms--the two other palms being oil palm and coconut tree--and its fruit is a staple food among Middle East and North African nations, as well as many other tropical and subtropical regions. Here we report a complete sequence of the data palm chloroplast (cp) genome based on pyrosequencing. After extracting 369,022 cp sequencing reads from our whole-genome-shotgun data, we put together an assembly and validated it with intensive PCR-based verification, coupled with PCR product sequencing. The date palm cp genome is 158,462 bp in length and has a typical quadripartite structure of the large (LSC, 86,198 bp) and small single-copy (SSC, 17,712 bp) regions separated by a pair of inverted repeats (IRs, 27,276 bp). Similar to what has been found among most angiosperms, the date palm cp genome harbors 112 unique genes and 19 duplicated fragments in the IR regions. The junctions between LSC/IRs and SSC/IRs show different features of sequence expansion in evolution. We identified 78 SNPs as major intravarietal polymorphisms within the population of a specific cp genome, most of which were located in genes with vital functions. Based on RNA-sequencing data, we also found 18 polycistronic transcription units and three highly expression-biased genes--atpF, trnA-UGC, and rrn23. Unlike most monocots, date palm has a typical cp genome similar to that of tobacco--with little rearrangement and gene loss or gain. High-throughput sequencing technology facilitates the identification of intravarietal variations in cp genomes among different cultivars. Moreover, transcriptomic analysis of cp genes provides clues for uncovering regulatory mechanisms of transcription and translation in chloroplasts.
Variation resources at UC Santa Cruz.
Thomas, Daryl J; Trumbower, Heather; Kern, Andrew D; Rhead, Brooke L; Kuhn, Robert M; Haussler, David; Kent, W James
2007-01-01
The variation resources within the University of California Santa Cruz Genome Browser include polymorphism data drawn from public collections and analyses of these data, along with their display in the context of other genomic annotations. Primary data from dbSNP is included for many organisms, with added information including genomic alleles and orthologous alleles for closely related organisms. Display filtering and coloring is available by variant type, functional class or other annotations. Annotation of potential errors is highlighted and a genomic alignment of the variant's flanking sequence is displayed. HapMap allele frequencies and linkage disequilibrium (LD) are available for each HapMap population, along with non-human primate alleles. The browsing and analysis tools, downloadable data files and links to documentation and other information can be found at http://genome.ucsc.edu/.
Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome
Zhao, Yuhui; Zeng, Jingyao; Alamer, Ali; Alanazi, Ibrahim O.; Alawad, Abdullah O.; Al-Sadi, Abdullah M.; Hu, Songnian; Yu, Jun
2016-01-01
Coconut (Cocos nucifera L.), a member of the palm family (Arecaceae), is one of the most economically important crops in tropics, serving as an important source of food, drink, fuel, medicine, and construction material. Here we report an assembly of the coconut (C. nucifera, Oman local Tall cultivar) mitochondrial (mt) genome based on next-generation sequencing data. This genome, 678,653bp in length and 45.5% in GC content, encodes 72 proteins, 9 pseudogenes, 23 tRNAs, and 3 ribosomal RNAs. Within the assembly, we find that the chloroplast (cp) derived regions account for 5.07% of the total assembly length, including 13 proteins, 2 pseudogenes, and 11 tRNAs. The mt genome has a relatively large fraction of repeat content (17.26%), including both forward (tandem) and inverted (palindromic) repeats. Sequence variation analysis shows that the Ti/Tv ratio of the mt genome is lower as compared to that of the nuclear genome and neutral expectation. By combining public RNA-Seq data for coconut, we identify 734 RNA editing sites supported by at least two datasets. In summary, our data provides the second complete mt genome sequence in the family Arecaceae, essential for further investigations on mitochondrial biology of seed plants. PMID:27736909
The complete chloroplast genome sequence of Dodonaea viscosa: comparative and phylogenetic analyses.
Saina, Josphat K; Gichira, Andrew W; Li, Zhi-Zhong; Hu, Guang-Wan; Wang, Qing-Feng; Liao, Kuo
2018-02-01
The plant chloroplast (cp) genome is a highly conserved structure which is beneficial for evolution and systematic research. Currently, numerous complete cp genome sequences have been reported due to high throughput sequencing technology. However, there is no complete chloroplast genome of genus Dodonaea that has been reported before. To better understand the molecular basis of Dodonaea viscosa chloroplast, we used Illumina sequencing technology to sequence its complete genome. The whole length of the cp genome is 159,375 base pairs (bp), with a pair of inverted repeats (IRs) of 27,099 bp separated by a large single copy (LSC) 87,204 bp, and small single copy (SSC) 17,972 bp. The annotation analysis revealed a total of 115 unique genes of which 81 were protein coding, 30 tRNA, and four ribosomal RNA genes. Comparative genome analysis with other closely related Sapindaceae members showed conserved gene order in the inverted and single copy regions. Phylogenetic analysis clustered D. viscosa with other species of Sapindaceae with strong bootstrap support. Finally, a total of 249 SSRs were detected. Moreover, a comparison of the synonymous (Ks) and nonsynonymous (Ka) substitution rates in D. viscosa showed very low values. The availability of cp genome reported here provides a valuable genetic resource for comprehensive further studies in genetic variation, taxonomy and phylogenetic evolution of Sapindaceae family. In addition, SSR markers detected will be used in further phylogeographic and population structure studies of the species in this genus.
Population sequencing reveals breed and sub-species specific CNVs in cattle
USDA-ARS?s Scientific Manuscript database
Individualized copy number variation (CNV) maps have highlighted the need for population surveys of cattle to detect the rare and common variants. While SNP and comparative genomic hybridization (CGH) arrays have provided preliminary data, next-generation sequence (NGS) data analysis offers an incre...
Development and utilization of 100K SNP array in Saccharum Spp.
USDA-ARS?s Scientific Manuscript database
Sugarcane genotyping or fingerprinting has long been a daunting task due to its high polyploidy level with large number of chromosomes. Single nucleotide polymorphisms (SNPs) are very abundant DNA sequence variations in the genome. With the advance of next generation sequencing (NGS) technologies, m...
The Arab genome: Health and wealth.
Zayed, Hatem
2016-11-05
The 22 Arab nations have a unique genetic structure, which reflects both conserved and diverse gene pools due to the prevalent endogamous and consanguineous marriage culture and the long history of admixture among different ethnic subcultures descended from the Asian, European, and African continents. Human genome sequencing has enabled large-scale genomic studies of different populations and has become a powerful tool for studying disease predictions and diagnosis. Despite the importance of the Arab genome for better understanding the dynamics of the human genome, discovering rare genetic variations, and studying early human migration out of Africa, it is poorly represented in human genome databases, such as HapMap and the 1000 Genomes Project. In this review, I demonstrate the significance of sequencing the Arab genome and setting an Arab genome reference(s) for better understanding the molecular pathogenesis of genetic diseases, discovering novel/rare variants, and identifying a meaningful genotype-phenotype correlation for complex diseases. Copyright © 2016. Published by Elsevier B.V.