Hylind, Robyn; Smith, Maureen; Rasmussen-Torvik, Laura; Aufox, Sharon
2018-01-01
The management of secondary findings is a challenge to health-care providers relaying clinical genomic-sequencing results to patients. Understanding patients' expectations from non-diagnostic genomic sequencing could help guide this management. This study interviewed 14 individuals enrolled in the eMERGE (Electronic Medical Records and Genomics) study. Participants in eMERGE consent to undergo non-diagnostic genomic sequencing, receive results, and have results returned to their physicians. The interviews assessed expectations and intended use of results. The majority of interviewees were male (64%) and 43% identified as non-Caucasian. A unique theme identified was that many participants expressed uncertainty about the type of diseases they expected to receive results on, what results they wanted to learn about, and how they intended to use results. Participant uncertainty highlights the complex nature of deciding to undergo genomic testing and a deficiency in genomic knowledge. These results could help improve how genomic sequencing and secondary findings are discussed with patients.
Curated eutherian third party data gene data sets.
Premzl, Marko
2016-03-01
The free available eutherian genomic sequence data sets advanced scientific field of genomics. Of note, future revisions of gene data sets were expected, due to incompleteness of public eutherian genomic sequence assemblies and potential genomic sequence errors. The eutherian comparative genomic analysis protocol was proposed as guidance in protection against potential genomic sequence errors in public eutherian genomic sequences. The protocol was applicable in updates of 7 major eutherian gene data sets, including 812 complete coding sequences deposited in European Nucleotide Archive as curated third party data gene data sets.
Reducing assembly complexity of microbial genomes with single-molecule sequencing.
Koren, Sergey; Harhay, Gregory P; Smith, Timothy P L; Bono, James L; Harhay, Dayna M; Mcvey, Scott D; Radune, Diana; Bergman, Nicholas H; Phillippy, Adam M
2013-01-01
The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Third-generation, single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem. To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These single-library assemblies are also more accurate than typical short-read assemblies and hybrid assemblies of short and long reads. Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to $1,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of completed genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.
Detection of somatic, subclonal and mosaic CNVs from sequencing | Division of Cancer Prevention
Progress in technology has made individual genome sequencing a clinical reality, with partial genome sequencing already in use in clinical care. In fact, it is expected that within a few years whole genome sequencing will be a standard procedure that will allow discovering personal genomic variants of all types and thus greatly facilitate individualized medicine. However, fast
Puli'uvea, Christopher; Khan, Subuhi; Chang, Wee-Leong; Valmonte, Gardette; Pearson, Michael N; Higgins, Colleen M
2017-02-01
We present the first complete genome of vanilla mosaic virus (VanMV). The VanMV genomic structure is consistent with that of a potyvirus, containing a single open reading frame (ORF) encoding a polyprotein of 3139 amino acids. Motif analyses indicate the polyprotein can be cleaved into the expected ten individual proteins; other recognised potyvirus motifs are also present. As expected, the VanMV genome shows high sequence similarity to the published Dasheen mosaic virus (DsMV) genome sequences; comparisons with DsMV continue to support VanMV as a vanilla infecting strain of DsMV. Phylogenetic analyses indicate that VanMV and DsMV share a common ancestor, with VanMV having the closest relationship with DsMV strains from the South Pacific.
Rius, Nuria; Guillén, Yolanda; Delprat, Alejandra; Kapusta, Aurélie; Feschotte, Cédric; Ruiz, Alfredo
2016-05-10
Many new Drosophila genomes have been sequenced in recent years using new-generation sequencing platforms and assembly methods. Transposable elements (TEs), being repetitive sequences, are often misassembled, especially in the genomes sequenced with short reads. Consequently, the mobile fraction of many of the new genomes has not been analyzed in detail or compared with that of other genomes sequenced with different methods, which could shed light into the understanding of genome and TE evolution. Here we compare the TE content of three genomes: D. buzzatii st-1, j-19, and D. mojavensis. We have sequenced a new D. buzzatii genome (j-19) that complements the D. buzzatii reference genome (st-1) already published, and compared their TE contents with that of D. mojavensis. We found an underestimation of TE sequences in Drosophila genus NGS-genomes when compared to Sanger-genomes. To be able to compare genomes sequenced with different technologies, we developed a coverage-based method and applied it to the D. buzzatii st-1 and j-19 genome. Between 10.85 and 11.16 % of the D. buzzatii st-1 genome is made up of TEs, between 7 and 7,5 % of D. buzzatii j-19 genome, while TEs represent 15.35 % of the D. mojavensis genome. Helitrons are the most abundant order in the three genomes. TEs in D. buzzatii are less abundant than in D. mojavensis, as expected according to the genome size and TE content positive correlation. However, TEs alone do not explain the genome size difference. TEs accumulate in the dot chromosomes and proximal regions of D. buzzatii and D. mojavensis chromosomes. We also report a significantly higher TE density in D. buzzatii and D. mojavensis X chromosomes, which is not expected under the current models. Our easy-to-use correction method allowed us to identify recently active families in D. buzzatii st-1 belonging to the LTR-retrotransposon superfamily Gypsy.
Power, Susan E.; Harris, Hugh M. B.; Bottacini, Francesca; Ross, R. Paul; O’Toole, Paul W.
2013-01-01
Here we report the 1.86-Mb draft genome sequence of Lactobacillus crispatus EM-LC1, a fecal isolate with antimicrobial activity. This genome sequence is expected to provide insights into the antimicrobial activity of L. crispatus and improve our knowledge of its potential probiotic traits. PMID:24356836
McCullough, Laurence B; Slashinski, Melody J; McGuire, Amy L; Street, Richard L; Eng, Christine M; Gibbs, Richard A; Parsons, D William; Plon, Sharon E
2016-03-01
It has been anticipated that physician and parents will be ill prepared or unprepared for the clinical introduction of genome sequencing, making it ethically disruptive. As a part of the Baylor Advancing Sequencing in Childhood Cancer Care study, we conducted semistructured interviews with 16 pediatric oncologists and 40 parents of pediatric patients with cancer prior to the return of sequencing results. We elicited expectations and attitudes concerning the impact of sequencing on clinical decision making, clinical utility, and treatment expectations from both groups. Using accepted methods of qualitative research to analyze interview transcripts, we completed a thematic analysis to provide inductive insights into their views of sequencing. Our major findings reveal that neither pediatric oncologists nor parents anticipate sequencing to be an ethically disruptive technology, because they expect to be prepared to integrate sequencing results into their existing approaches to learning and using new clinical information for care. Pediatric oncologists do not expect sequencing results to be more complex than other diagnostic information and plan simply to incorporate these data into their evidence-based approach to clinical practice, although they were concerned about impact on parents. For parents, there is an urgency to protect their child's health and in this context they expect genomic information to better prepare them to participate in decisions about their child's care. Our data do not support the concern that introducing genome sequencing into childhood cancer care will be ethically disruptive, that is, leave physicians or parents ill prepared or unprepared to make responsible decisions about patient care. © 2015 Wiley Periodicals, Inc.
McCullough, Laurence B.; Slashinski, Melody J.; McGuire, Amy L.; Street, Richard L.; Eng, Christine M.; Gibbs, Richard A.; Parsons, D. Williams; Plon, Sharon E.
2016-01-01
Background Some anticipate that physician and parents will be ill-prepared or unprepared for the clinical introduction of genome sequencing, making it ethically disruptive. Procedure As part of the Baylor Advancing Sequencing in Childhood Cancer Care (BASIC3) study, we conducted semi-structured interviews with 16 pediatric oncologists and 40 parents of pediatric patients with cancer prior to the return of sequencing results. We elicited expectations and attitudes concerning the impact of sequencing on clinical decision-making, clinical utility, and treatment expectations from both groups. Using accepted methods of qualitative research to analyze interview transcripts, we completed a thematic analysis to provide inductive insights into their views of sequencing. Results Our major findings reveal that neither pediatric oncologists nor parents anticipate sequencing to be an ethically disruptive technology, because they expect to be prepared to integrate sequencing results into their existing approaches to learning and using new clinical information for care. Pediatric oncologists do not expect sequencing results to be more complex than other diagnostic information and plan simply to incorporate these data into their evidence-based approach to clinical practice although they were concerned about impact on parents. For parents, there is an urgency to protect their chil's health and in this context they expect genomic information to better prepare them to participate in decisions about their chil's care. Conclusion Our data do not support concern that introducing genome sequencing into childhood cancer care will be ethically disruptive, i.e., leave physicians or parents ill-prepared or unprepared to make responsible decisions about patient care. PMID:26505993
Genome Sequencing and Assembly by Long Reads in Plants
Li, Changsheng; Lin, Feng; An, Dong; Huang, Ruidong
2017-01-01
Plant genomes generated by Sanger and Next Generation Sequencing (NGS) have provided insight into species diversity and evolution. However, Sanger sequencing is limited in its applications due to high cost, labor intensity, and low throughput, while NGS reads are too short to resolve abundant repeats and polyploidy, leading to incomplete or ambiguous assemblies. The advent and improvement of long-read sequencing by Third Generation Sequencing (TGS) methods such as PacBio and Nanopore have shown promise in producing high-quality assemblies for complex genomes. Here, we review the development of sequencing, introducing the application as well as considerations of experimental design in TGS of plant genomes. We also introduce recent revolutionary scaffolding technologies including BioNano, Hi-C, and 10× Genomics. We expect that the informative guidance for genome sequencing and assembly by long reads will benefit the initiation of scientists’ projects. PMID:29283420
Understanding patient and provider perceptions and expectations of genomic medicine
Hall, Michael J; Forman, Andrea; Montgomery, Susan; Rainey, Kim; Daly, Mary B
2014-01-01
Advances in genome sequencing technology have fostered a new era of clinical genomic medicine. Genetic counselors, who have begun to support patients undergoing multi-gene panel testing for hereditary cancer risk, will review brief clinical vignettes, and discuss early experiences with clinical genomic testing. Their experiences will frame a discussion about how current testing may challenge patient understanding and expectations toward the evaluation of cancer risk and downstream preventive behaviors. PMID:24992205
Quantitative phenotyping via deep barcode sequencing.
Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey
2009-10-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.
The UK’s 100,000 Genomes Project: manifesting policymakers’ expectations
Samuel, Gabrielle Natalie; Farsides, Bobbie
2017-01-01
The UK’s 100,000 Genomes Project has the aim of sequencing 100,000 genomes from UK National Health Service (NHS) patients while concomitantly transforming clinical care such that whole genome sequencing becomes routine clinical practice in the UK. Policymakers claim that the project will revolutionize NHS care. We wished to explore the 100,000 Genomes Project, and in particular, the extent to which policymaker claims have helped or hindered the work of those associated with Genomics England – the company established by the Department of Health to deliver the project. We interviewed 20 individuals linked to, or working for Genomics England. Interviewees had double-edged views about the context within which they were working. On the one hand, policymakers’ expectations attached to the venture were considered vacuous “genohype”; on the other hand, they were considered the impetus needed for those trying to advance genomic research into clinical practice. Findings should be considered for future genomes projects. PMID:29238265
The ‘thousand-dollar genome': an ethical exploration
Dondorp, Wybo J; de Wert, Guido M W R
2013-01-01
Sequencing an individual's complete genome is expected to be possible for a relatively low sum ‘one thousand dollars' within a few years. Sequencing refers to determining the order of base pairs that make up the genome. The result is a library of three billion letter combinations. Cheap whole-genome sequencing is of greatest importance to medical scientific research. Comparing individual complete genomes will lead to a better understanding of the contribution genetic variation makes to health and disease. As knowledge increases, the ‘thousand-dollar genome' will also become increasingly important to healthcare. The applications that come within reach raise a number of ethical questions. This monitoring report addresses the issue. PMID:23677179
De Groot, Anne S; Rappuoli, Rino
2004-02-01
Vaccine research entered a new era when the complete genome of a pathogenic bacterium was published in 1995. Since then, more than 97 bacterial pathogens have been sequenced and at least 110 additional projects are now in progress. Genome sequencing has also dramatically accelerated: high-throughput facilities can draft the sequence of an entire microbe (two to four megabases) in 1 to 2 days. Vaccine developers are using microarrays, immunoinformatics, proteomics and high-throughput immunology assays to reduce the truly unmanageable volume of information available in genome databases to a manageable size. Vaccines composed by novel antigens discovered from genome mining are already in clinical trials. Within 5 years we can expect to see a novel class of vaccines composed by genome-predicted, assembled and engineered T- and Bcell epitopes. This article addresses the convergence of three forces--microbial genome sequencing, computational immunology and new vaccine technologies--that are shifting genome mining for vaccines onto the forefront of immunology research.
Genomic sequencing of Pleistocene cave bears
DOE Office of Scientific and Technical Information (OSTI.GOV)
Noonan, James P.; Hofreiter, Michael; Smith, Doug
2005-04-01
Despite the information content of genomic DNA, ancient DNA studies to date have largely been limited to amplification of mitochondrial DNA due to technical hurdles such as contamination and degradation of ancient DNAs. In this study, we describe two metagenomic libraries constructed using unamplified DNA extracted from the bones of two 40,000-year-old extinct cave bears. Analysis of {approx}1 Mb of sequence from each library showed that, despite significant microbial contamination, 5.8 percent and 1.1 percent of clones in the libraries contain cave bear inserts, yielding 26,861 bp of cave bear genome sequence. Alignment of this sequence to the dog genome,more » the closest sequenced genome to cave bear in terms of evolutionary distance, revealed roughly the expected ratio of cave bear exons, repeats and conserved noncoding sequences. Only 0.04 percent of all clones sequenced were derived from contamination with modern human DNA. Comparison of cave bear with orthologous sequences from several modern bear species revealed the evolutionary relationship of these lineages. Using the metagenomic approach described here, we have recovered substantial quantities of mammalian genomic sequence more than twice as old as any previously reported, establishing the feasibility of ancient DNA genomic sequencing programs.« less
APPLaUD: access for patients and participants to individual level uninterpreted genomic data.
Thorogood, Adrian; Bobe, Jason; Prainsack, Barbara; Middleton, Anna; Scott, Erick; Nelson, Sarah; Corpas, Manuel; Bonhomme, Natasha; Rodriguez, Laura Lyman; Murtagh, Madeleine; Kleiderman, Erika
2018-02-17
There is a growing support for the stance that patients and research participants should have better and easier access to their raw (uninterpreted) genomic sequence data in both clinical and research contexts. We review legal frameworks and literature on the benefits, risks, and practical barriers of providing individuals access to their data. We also survey genomic sequencing initiatives that provide or plan to provide individual access. Many patients and research participants expect to be able to access their health and genomic data. Individuals have a legal right to access their genomic data in some countries and contexts. Moreover, increasing numbers of participatory research projects, direct-to-consumer genetic testing companies, and now major national sequencing initiatives grant individuals access to their genomic sequence data upon request. Drawing on current practice and regulatory analysis, we outline legal, ethical, and practical guidance for genomic sequencing initiatives seeking to offer interested patients and participants access to their raw genomic data.
Understanding patient and provider perceptions and expectations of genomic medicine.
Hall, Michael J; Forman, Andrea D; Montgomery, Susan V; Rainey, Kim L; Daly, Mary B
2015-01-01
Advances in genome sequencing technology have fostered a new era of clinical genomic medicine. Genetic counselors, who have begun to support patients undergoing multi-gene panel testing for hereditary cancer risk, will review brief clinical vignettes, and discuss early experiences with clinical genomic testing. Their experiences will frame a discussion about how current testing may challenge patient understanding and expectations toward the evaluation of cancer risk and downstream preventive behaviors. © 2014 Wiley Periodicals, Inc.
Enriching public descriptions of marine phages using the Genomic Standards Consortium MIGS standard
Duhaime, Melissa Beth; Kottmann, Renzo; Field, Dawn; Glöckner, Frank Oliver
2011-01-01
In any sequencing project, the possible depth of comparative analysis is determined largely by the amount and quality of the accompanying contextual data. The structure, content, and storage of this contextual data should be standardized to ensure consistent coverage of all sequenced entities and facilitate comparisons. The Genomic Standards Consortium (GSC) has developed the “Minimum Information about Genome/Metagenome Sequences (MIGS/MIMS)” checklist for the description of genomes and here we annotate all 30 publicly available marine bacteriophage sequences to the MIGS standard. These annotations build on existing International Nucleotide Sequence Database Collaboration (INSDC) records, and confirm, as expected that current submissions lack most MIGS fields. MIGS fields were manually curated from the literature and placed in XML format as specified by the Genomic Contextual Data Markup Language (GCDML). These “machine-readable” reports were then analyzed to highlight patterns describing this collection of genomes. Completed reports are provided in GCDML. This work represents one step towards the annotation of our complete collection of genome sequences and shows the utility of capturing richer metadata along with raw sequences. PMID:21677864
Yamagishi, Hiroshi; Tanaka, Yoshiyuki; Terachi, Toru
2014-11-01
Crop species of Brassica (Brassicaceae) consist of three monogenomic species and three amphidiploid species resulting from interspecific hybridizations among them. Until now, mitochondrial genome sequences were available for only five of these species. We sequenced the mitochondrial genome of the sixth species, Brassica nigra (nuclear genome constitution BB), and compared it with those of Brassica oleracea (CC) and Brassica carinata (BBCC). The genome was assembled into a 232 145 bp circular sequence that is slightly larger than that of B. oleracea (219 952 bp). The genome of B. nigra contained 33 protein-coding genes, 3 rRNA genes, and 17 tRNA genes. The cox2-2 gene present in B. oleracea was absent in B. nigra. Although the nucleotide sequences of 52 genes were identical between B. nigra and B. carinata, the second exon of rps3 showed differences including an insertion/deletion (indel) and nucleotide substitutions. A PCR test to detect the indel revealed intraspecific variation in rps3, and in one line of B. nigra it amplified a DNA fragment of the size expected for B. carinata. In addition, the B. carinata lines tested here produced DNA fragments of the size expected for B. nigra. The results indicate that at least two mitotypes of B. nigra were present in the maternal parents of B. carinata.
Complete Genome Sequence of a Putative New Bacterial Strain, I507, Isolated from the Indian Ocean
Wang, Shu-yan; Wei, Jia-qiang
2018-01-01
ABSTRACT Bacterial strain I507 was isolated from the central Indian Ocean and may be a potential novel species, according to the 16S rRNA gene sequence. Here, we present its complete genome sequence and expect that it will provide researchers with valuable information to further understand its classification and function in the future. PMID:29674539
Best, Megan; Newson, Ainsley J; Meiser, Bettina; Juraskova, Ilona; Goldstein, David; Tucker, Kathy; Ballinger, Mandy L; Hess, Dominique; Schlub, Timothy E; Biesecker, Barbara; Vines, Richard; Vines, Kate; Thomas, David; Young, Mary-Anne; Savard, Jacqueline; Jacobs, Chris; Butow, Phyllis
2018-04-23
Advances in genomics offer promise for earlier detection or prevention of cancer, by personalisation of medical care tailored to an individual's genomic risk status. However genome sequencing can generate an unprecedented volume of results for the patient to process with potential implications for their families and reproductive choices. This paper describes a protocol for a study (PiGeOn) that aims to explore how patients and their blood relatives experience germline genomic sequencing, to help guide the appropriate future implementation of genome sequencing into routine clinical practice. We have designed a mixed-methods, prospective, cohort sub-study of a germline genomic sequencing study that targets adults with cancer suggestive of a genetic aetiology. One thousand probands and 2000 of their blood relatives will undergo germline genomic sequencing as part of the parent study in Sydney, Australia between 2016 and 2020. Test results are expected within12-15 months of recruitment. For the PiGeOn sub-study, participants will be invited to complete surveys at baseline, three months and twelve months after baseline using self-administered questionnaires, to assess the experience of long waits for results (despite being informed that results may not be returned) and expectations of receiving them. Subsets of both probands and blood relatives will be purposively sampled and invited to participate in three semi-structured qualitative interviews (at baseline and each follow-up) to triangulate the data. Ethical themes identified in the data will be used to inform critical revisions of normative ethical concepts or frameworks. This will be one of the first studies internationally to follow the psychosocial impact on probands and their blood relatives who undergo germline genome sequencing, over time. Study results will inform ongoing ethical debates on issues such as informed consent for genomic sequencing, and informing participants and their relatives of specific results. The study will also provide important outcome data concerning the psychological impact of prolonged waiting for germline genomic sequencing. These data are needed to ensure that when germline genomic sequencing is introduced into standard clinical settings, ethical concepts are embedded, and patients and their relatives are adequately prepared and supported during and after the testing process.
Fast neutron mutants database and web displays at SoyBase
USDA-ARS?s Scientific Manuscript database
SoyBase, the USDA-ARS soybean genetics and genomics database, has been expanded to include data for the fast neutron mutants produced by Bolon, Vance, et al. In addition to the expected text and sequence homology searches and visualization of the indels in the context of the genome sequence viewer, ...
Quantitative phenotyping via deep barcode sequencing
Smith, Andrew M.; Heisler, Lawrence E.; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J.; Chee, Mark; Roth, Frederick P.; Giaever, Guri; Nislow, Corey
2009-01-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or “Bar-seq,” outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that ∼20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene–environment interactions on a genome-wide scale. PMID:19622793
Complete Genome Sequence of the Alfalfa Symbiont Sinorhizobium/Ensifer meliloti Strain GR4.
Martínez-Abarca, Francisco; Martínez-Rodríguez, Laura; López-Contreras, José Antonio; Jiménez-Zurdo, José Ignacio; Toro, Nicolás
2013-01-01
We present the complete nucleotide sequence of the multipartite genome of Sinorhizobium/Ensifer meliloti GR4, a predominant rhizobial strain in an agricultural field site. The genome (total size, 7.14 Mb) consists of five replicons: one chromosome, two expected symbiotic megaplasmids (pRmeGR4c and pRmeGR4d), and two accessory plasmids (pRmeGR4a and pRmeGR4b).
Extensive sequencing of seven human genomes to characterize benchmark reference materials
Zook, Justin M.; Catoe, David; McDaniel, Jennifer; Vang, Lindsay; Spies, Noah; Sidow, Arend; Weng, Ziming; Liu, Yuling; Mason, Christopher E.; Alexander, Noah; Henaff, Elizabeth; McIntyre, Alexa B.R.; Chandramohan, Dhruva; Chen, Feng; Jaeger, Erich; Moshrefi, Ali; Pham, Khoa; Stedman, William; Liang, Tiffany; Saghbini, Michael; Dzakula, Zeljko; Hastie, Alex; Cao, Han; Deikus, Gintaras; Schadt, Eric; Sebra, Robert; Bashir, Ali; Truty, Rebecca M.; Chang, Christopher C.; Gulbahce, Natali; Zhao, Keyan; Ghosh, Srinka; Hyland, Fiona; Fu, Yutao; Chaisson, Mark; Xiao, Chunlin; Trow, Jonathan; Sherry, Stephen T.; Zaranek, Alexander W.; Ball, Madeleine; Bobe, Jason; Estep, Preston; Church, George M.; Marks, Patrick; Kyriazopoulou-Panagiotopoulou, Sofia; Zheng, Grace X.Y.; Schnall-Levin, Michael; Ordonez, Heather S.; Mudivarti, Patrice A.; Giorda, Kristina; Sheng, Ying; Rypdal, Karoline Bjarnesdatter; Salit, Marc
2016-01-01
The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCode WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly. PMID:27271295
USDA-ARS?s Scientific Manuscript database
Genetic variants detected from sequence have been used to successfully identify causal variants and map complex traits in several organisms. High and moderate impact variants, those expected to alter or disrupt the protein coded by a gene and those that regulate protein production, likely have a mor...
Len Gen: The international lentil genome sequencing project
USDA-ARS?s Scientific Manuscript database
We have been sequencing CDC Redberry using NGS of paired-end and mate-pair libraries over a wide range of sizes and technologies. The most recent draft (v0.7) of approximately 150x coverage produced scaffolds covering over half the genome (2.7 Gb of the expected 4.3 Gb). Long reads from PacBio sequ...
Complete Genome Sequence of the Alfalfa Symbiont Sinorhizobium/Ensifer meliloti Strain GR4
Martínez-Abarca, Francisco; Martínez-Rodríguez, Laura; López-Contreras, José Antonio; Jiménez-Zurdo, José Ignacio
2013-01-01
We present the complete nucleotide sequence of the multipartite genome of Sinorhizobium/Ensifer meliloti GR4, a predominant rhizobial strain in an agricultural field site. The genome (total size, 7.14 Mb) consists of five replicons: one chromosome, two expected symbiotic megaplasmids (pRmeGR4c and pRmeGR4d), and two accessory plasmids (pRmeGR4a and pRmeGR4b). PMID:23409262
2012-01-01
Background The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need. Results We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities. Conclusions CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from http://www.herbalgenomics.org/cpgavas. PMID:23256920
The Yak genome database: an integrative database for studying yak biology and high-altitude adaption
2012-01-01
Background The yak (Bos grunniens) is a long-haired bovine that lives at high altitudes and is an important source of milk, meat, fiber and fuel. The recent sequencing, assembly and annotation of its genome are expected to further our understanding of the means by which it has adapted to life at high altitudes and its ecologically important traits. Description The Yak Genome Database (YGD) is an internet-based resource that provides access to genomic sequence data and predicted functional information concerning the genes and proteins of Bos grunniens. The curated data stored in the YGD includes genome sequences, predicted genes and associated annotations, non-coding RNA sequences, transposable elements, single nucleotide variants, and three-way whole-genome alignments between human, cattle and yak. YGD offers useful searching and data mining tools, including the ability to search for genes by name or using function keywords as well as GBrowse genome browsers and/or BLAST servers, which can be used to visualize genome regions and identify similar sequences. Sequence data from the YGD can also be downloaded to perform local searches. Conclusions A new yak genome database (YGD) has been developed to facilitate studies on high-altitude adaption and bovine genomics. The database will be continuously updated to incorporate new information such as transcriptome data and population resequencing data. The YGD can be accessed at http://me.lzu.edu.cn/yak. PMID:23134687
Schreck, Katharina; Herbold, Craig W.; Daims, Holger; Wagner, Michael; Loy, Alexander
2018-01-01
ABSTRACT The facultative anaerobic chemoorganoheterotrophic alphaproteobacterium Telmatospirillum siberiense 26-4b1 was isolated from a Siberian peatland. We report here a 6.20-Mbp near-complete high-quality draft genome sequence of T. siberiense that reveals expected and novel metabolic potential for the genus Telmatospirillum, including genes for sulfur oxidation. PMID:29371357
Efficient isolation method for high-quality genomic DNA from cicada exuviae.
Nguyen, Hoa Quynh; Kim, Ye Inn; Borzée, Amaël; Jang, Yikweon
2017-10-01
In recent years, animal ethics issues have led researchers to explore nondestructive methods to access materials for genetic studies. Cicada exuviae are among those materials because they are cast skins that individuals left after molt and are easily collected. In this study, we aim to identify the most efficient extraction method to obtain high quantity and quality of DNA from cicada exuviae. We compared relative DNA yield and purity of six extraction protocols, including both manual protocols and available commercial kits, extracting from four different exoskeleton parts. Furthermore, amplification and sequencing of genomic DNA were evaluated in terms of availability of sequencing sequence at the expected genomic size. Both the choice of protocol and exuvia part significantly affected DNA yield and purity. Only samples that were extracted using the PowerSoil DNA Isolation kit generated gel bands of expected size as well as successful sequencing results. The failed attempts to extract DNA using other protocols could be partially explained by a low DNA yield from cicada exuviae and partly by contamination with humic acids that exist in the soil where cicada nymphs reside before emergence, as shown by spectroscopic measurements. Genomic DNA extracted from cicada exuviae could provide valuable information for species identification, allowing the investigation of genetic diversity across consecutive broods, or spatiotemporal variation among various populations. Consequently, we hope to provide a simple method to acquire pure genomic DNA applicable for multiple research purposes.
Analysis of Epstein-Barr Virus Genomes and Expression Profiles in Gastric Adenocarcinoma.
Borozan, Ivan; Zapatka, Marc; Frappier, Lori; Ferretti, Vincent
2018-01-15
Epstein-Barr virus (EBV) is a causative agent of a variety of lymphomas, nasopharyngeal carcinoma (NPC), and ∼9% of gastric carcinomas (GCs). An important question is whether particular EBV variants are more oncogenic than others, but conclusions are currently hampered by the lack of sequenced EBV genomes. Here, we contribute to this question by mining whole-genome sequences of 201 GCs to identify 13 EBV-positive GCs and by assembling 13 new EBV genome sequences, almost doubling the number of available GC-derived EBV genome sequences and providing the first non-Asian EBV genome sequences from GC. Whole-genome sequence comparisons of all EBV isolates sequenced to date (85 from tumors and 57 from healthy individuals) showed that most GC and NPC EBV isolates were closely related although American Caucasian GC samples were more distant, suggesting a geographical component. However, EBV GC isolates were found to contain some consistent changes in protein sequences regardless of geographical origin. In addition, transcriptome data available for eight of the EBV-positive GCs were analyzed to determine which EBV genes are expressed in GC. In addition to the expected latency proteins (EBNA1, LMP1, and LMP2A), specific subsets of lytic genes were consistently expressed that did not reflect a typical lytic or abortive lytic infection, suggesting a novel mechanism of EBV gene regulation in the context of GC. These results are consistent with a model in which a combination of specific latent and lytic EBV proteins promotes tumorigenesis. IMPORTANCE Epstein-Barr virus (EBV) is a widespread virus that causes cancer, including gastric carcinoma (GC), in a small subset of individuals. An important question is whether particular EBV variants are more cancer associated than others, but more EBV sequences are required to address this question. Here, we have generated 13 new EBV genome sequences from GC, almost doubling the number of EBV sequences from GC isolates and providing the first EBV sequences from non-Asian GC. We further identify sequence changes in some EBV proteins common to GC isolates. In addition, gene expression analysis of eight of the EBV-positive GCs showed consistent expression of both the expected latency proteins and a subset of lytic proteins that was not consistent with typical lytic or abortive lytic expression. These results suggest that novel mechanisms activate expression of some EBV lytic proteins and that their expression may contribute to oncogenesis. Copyright © 2018 American Society for Microbiology.
Nearing saturation of cancer driver gene discovery.
Hsiehchen, David; Hsieh, Antony
2018-06-15
Extensive sequencing efforts of cancer genomes such as The Cancer Genome Atlas (TCGA) have been undertaken to uncover bona fide cancer driver genes which has enhanced our understanding of cancer and revealed therapeutic targets. However, the number of driver gene mutations is bounded, indicating that there must be a point when further sequencing efforts will be excessive. We found that there was a significant positive correlation between sample size and identified driver gene mutations across 33 cancers sequenced by the TCGA, which is expected if additional sequencing is still leading to the identification of more driver genes. However, the rate of new cancer driver genes being discovered with larger samples is declining rapidly. Our analysis provides a general guide for determining which cancer types would likely benefit from additional sequencing efforts, particularly those with relatively high rates of cancer driver gene discovery. Our results argue that past strategies of indiscriminately sequencing as many specimens as possible for all cancer types is becoming inefficient. In addition, without significant investments into applying our knowledge of cancer genomes, we risk sequencing more cancer genomes for the sake of sequencing rather than meaningful patient benefit.
[The principle and application of the single-molecule real-time sequencing technology].
Yanhu, Liu; Lu, Wang; Li, Yu
2015-03-01
Last decade witnessed the explosive development of the third-generation sequencing strategy, including single-molecule real-time sequencing (SMRT), true single-molecule sequencing (tSMSTM) and the single-molecule nanopore DNA sequencing. In this review, we summarize the principle, performance and application of the SMRT sequencing technology. Compared with the traditional Sanger method and the next-generation sequencing (NGS) technologies, the SMRT approach has several advantages, including long read length, high speed, PCR-free and the capability of direct detection of epigenetic modifications. However, the disadvantage of its low accuracy, most of which resulted from insertions and deletions, is also notable. So, the raw sequence data need to be corrected before assembly. Up to now, the SMRT is a good fit for applications in the de novo genomic sequencing and the high-quality assemblies of small genomes. In the future, it is expected to play an important role in epigenetics, transcriptomic sequencing, and assemblies of large genomes.
Hausmann, Bela; Pjevac, Petra; Schreck, Katharina; Herbold, Craig W; Daims, Holger; Wagner, Michael; Loy, Alexander
2018-01-25
The facultative anaerobic chemoorganoheterotrophic alphaproteobacterium Telmatospirillum siberiense 26-4b1 was isolated from a Siberian peatland. We report here a 6.20-Mbp near-complete high-quality draft genome sequence of T. siberiense that reveals expected and novel metabolic potential for the genus Telmatospirillum , including genes for sulfur oxidation. Copyright © 2018 Hausmann et al.
Challenges in Whole-Genome Annotation of Pyrosequenced Eukaryotic Genomes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kuo, Alan; Grigoriev, Igor
2009-04-17
Pyrosequencing technologies such as 454/Roche and Solexa/Illumina vastly lower the cost of nucleotide sequencing compared to the traditional Sanger method, and thus promise to greatly expand the number of sequenced eukaryotic genomes. However, the new technologies also bring new challenges such as shorter reads and new kinds and higher rates of sequencing errors, which complicate genome assembly and gene prediction. At JGI we are deploying 454 technology for the sequencing and assembly of ever-larger eukaryotic genomes. Here we describe our first whole-genome annotation of a purely 454-sequenced fungal genome that is larger than a yeast (>30 Mbp). The pezizomycotine (filamentousmore » ascomycote) Aspergillus carbonarius belongs to the Aspergillus section Nigri species complex, members of which are significant as platforms for bioenergy and bioindustrial technology, as members of soil microbial communities and players in the global carbon cycle, and as agricultural toxigens. Application of a modified version of the standard JGI Annotation Pipeline has so far predicted ~;;10k genes. ~;;12percent of these preliminary annotations suffer a potential frameshift error, which is somewhat higher than the ~;;9percent rate in the Sanger-sequenced and conventionally assembled and annotated genome of fellow Aspergillus section Nigri member A. niger. Also,>90percent of A. niger genes have potential homologs in the A. carbonarius preliminary annotation. Weconclude, and with further annotation and comparative analysis expect to confirm, that 454 sequencing strategies provide a promising substrate for annotation of modestly sized eukaryotic genomes. We will also present results of annotation of a number of other pyrosequenced fungal genomes of bioenergy interest.« less
Ma, Peng-Fei; Guo, Zhen-Hua; Li, De-Zhu
2012-01-01
Compared to their counterparts in animals, the mitochondrial (mt) genomes of angiosperms exhibit a number of unique features. However, unravelling their evolution is hindered by the few completed genomes, of which are essentially Sanger sequenced. While next-generation sequencing technologies have revolutionized chloroplast genome sequencing, they are just beginning to be applied to angiosperm mt genomes. Chloroplast genomes of grasses (Poaceae) have undergone episodic evolution and the evolutionary rate was suggested to be correlated between chloroplast and mt genomes in Poaceae. It is interesting to investigate whether correlated rate change also occurred in grass mt genomes as expected under lineage effects. A time-calibrated phylogenetic tree is needed to examine rate change. We determined a largely completed mt genome from a bamboo, Ferrocalamus rimosivaginus (Poaceae), through Illumina sequencing of total DNA. With combination of de novo and reference-guided assembly, 39.5-fold coverage Illumina reads were finally assembled into scaffolds totalling 432,839 bp. The assembled genome contains nearly the same genes as the completed mt genomes in Poaceae. For examining evolutionary rate in grass mt genomes, we reconstructed a phylogenetic tree including 22 taxa based on 31 mt genes. The topology of the well-resolved tree was almost identical to that inferred from chloroplast genome with only minor difference. The inconsistency possibly derived from long branch attraction in mtDNA tree. By calculating absolute substitution rates, we found significant rate change (∼4-fold) in mt genome before and after the diversification of Poaceae both in synonymous and nonsynonymous terms. Furthermore, the rate change was correlated with that of chloroplast genomes in grasses. Our result demonstrates that it is a rapid and efficient approach to obtain angiosperm mt genome sequences using Illumina sequencing technology. The parallel episodic evolution of mt and chloroplast genomes in grasses is consistent with lineage effects.
Ma, Peng-Fei; Guo, Zhen-Hua; Li, De-Zhu
2012-01-01
Background Compared to their counterparts in animals, the mitochondrial (mt) genomes of angiosperms exhibit a number of unique features. However, unravelling their evolution is hindered by the few completed genomes, of which are essentially Sanger sequenced. While next-generation sequencing technologies have revolutionized chloroplast genome sequencing, they are just beginning to be applied to angiosperm mt genomes. Chloroplast genomes of grasses (Poaceae) have undergone episodic evolution and the evolutionary rate was suggested to be correlated between chloroplast and mt genomes in Poaceae. It is interesting to investigate whether correlated rate change also occurred in grass mt genomes as expected under lineage effects. A time-calibrated phylogenetic tree is needed to examine rate change. Methodology/Principal Findings We determined a largely completed mt genome from a bamboo, Ferrocalamus rimosivaginus (Poaceae), through Illumina sequencing of total DNA. With combination of de novo and reference-guided assembly, 39.5-fold coverage Illumina reads were finally assembled into scaffolds totalling 432,839 bp. The assembled genome contains nearly the same genes as the completed mt genomes in Poaceae. For examining evolutionary rate in grass mt genomes, we reconstructed a phylogenetic tree including 22 taxa based on 31 mt genes. The topology of the well-resolved tree was almost identical to that inferred from chloroplast genome with only minor difference. The inconsistency possibly derived from long branch attraction in mtDNA tree. By calculating absolute substitution rates, we found significant rate change (∼4-fold) in mt genome before and after the diversification of Poaceae both in synonymous and nonsynonymous terms. Furthermore, the rate change was correlated with that of chloroplast genomes in grasses. Conclusions/Significance Our result demonstrates that it is a rapid and efficient approach to obtain angiosperm mt genome sequences using Illumina sequencing technology. The parallel episodic evolution of mt and chloroplast genomes in grasses is consistent with lineage effects. PMID:22272330
High-throughput sequencing of three Lemnoideae (duckweeds) chloroplast genomes from total DNA.
Wang, Wenqin; Messing, Joachim
2011-01-01
Chloroplast genomes provide a wealth of information for evolutionary and population genetic studies. Chloroplasts play a particularly important role in the adaption for aquatic plants because they float on water and their major surface is exposed continuously to sunlight. The subfamily of Lemnoideae represents such a collection of aquatic species that because of photosynthesis represents one of the fastest growing plant species on earth. We sequenced the chloroplast genomes from three different genera of Lemnoideae, Spirodela polyrhiza, Wolffiella lingulata and Wolffia australiana by high-throughput DNA sequencing of genomic DNA using the SOLiD platform. Unfractionated total DNA contains high copies of plastid DNA so that sequences from the nucleus and mitochondria can easily be filtered computationally. Remaining sequence reads were assembled into contiguous sequences (contigs) using SOLiD software tools. Contigs were mapped to a reference genome of Lemna minor and gaps, selected by PCR, were sequenced on the ABI3730xl platform. This combinatorial approach yielded whole genomic contiguous sequences in a cost-effective manner. Over 1,000-time coverage of chloroplast from total DNA were reached by the SOLiD platform in a single spot on a quadrant slide without purification. Comparative analysis indicated that the chloroplast genome was conserved in gene number and organization with respect to the reference genome of L. minor. However, higher nucleotide substitution, abundant deletions and insertions occurred in non-coding regions of these genomes, indicating a greater genomic dynamics than expected from the comparison of other related species in the Pooideae. Noticeably, there was no transition bias over transversion in Lemnoideae. The data should have immediate applications in evolutionary biology and plant taxonomy with increased resolution and statistical power.
High-Throughput Sequencing of Three Lemnoideae (Duckweeds) Chloroplast Genomes from Total DNA
Wang, Wenqin; Messing, Joachim
2011-01-01
Background Chloroplast genomes provide a wealth of information for evolutionary and population genetic studies. Chloroplasts play a particularly important role in the adaption for aquatic plants because they float on water and their major surface is exposed continuously to sunlight. The subfamily of Lemnoideae represents such a collection of aquatic species that because of photosynthesis represents one of the fastest growing plant species on earth. Methods We sequenced the chloroplast genomes from three different genera of Lemnoideae, Spirodela polyrhiza, Wolffiella lingulata and Wolffia australiana by high-throughput DNA sequencing of genomic DNA using the SOLiD platform. Unfractionated total DNA contains high copies of plastid DNA so that sequences from the nucleus and mitochondria can easily be filtered computationally. Remaining sequence reads were assembled into contiguous sequences (contigs) using SOLiD software tools. Contigs were mapped to a reference genome of Lemna minor and gaps, selected by PCR, were sequenced on the ABI3730xl platform. Conclusions This combinatorial approach yielded whole genomic contiguous sequences in a cost-effective manner. Over 1,000-time coverage of chloroplast from total DNA were reached by the SOLiD platform in a single spot on a quadrant slide without purification. Comparative analysis indicated that the chloroplast genome was conserved in gene number and organization with respect to the reference genome of L. minor. However, higher nucleotide substitution, abundant deletions and insertions occurred in non-coding regions of these genomes, indicating a greater genomic dynamics than expected from the comparison of other related species in the Pooideae. Noticeably, there was no transition bias over transversion in Lemnoideae. The data should have immediate applications in evolutionary biology and plant taxonomy with increased resolution and statistical power. PMID:21931804
A high-throughput Sanger strategy for human mitochondrial genome sequencing
2013-01-01
Background A population reference database of complete human mitochondrial genome (mtGenome) sequences is needed to enable the use of mitochondrial DNA (mtDNA) coding region data in forensic casework applications. However, the development of entire mtGenome haplotypes to forensic data quality standards is difficult and laborious. A Sanger-based amplification and sequencing strategy that is designed for automated processing, yet routinely produces high quality sequences, is needed to facilitate high-volume production of these mtGenome data sets. Results We developed a robust 8-amplicon Sanger sequencing strategy that regularly produces complete, forensic-quality mtGenome haplotypes in the first pass of data generation. The protocol works equally well on samples representing diverse mtDNA haplogroups and DNA input quantities ranging from 50 pg to 1 ng, and can be applied to specimens of varying DNA quality. The complete workflow was specifically designed for implementation on robotic instrumentation, which increases throughput and reduces both the opportunities for error inherent to manual processing and the cost of generating full mtGenome sequences. Conclusions The described strategy will assist efforts to generate complete mtGenome haplotypes which meet the highest data quality expectations for forensic genetic and other applications. Additionally, high-quality data produced using this protocol can be used to assess mtDNA data developed using newer technologies and chemistries. Further, the amplification strategy can be used to enrich for mtDNA as a first step in sample preparation for targeted next-generation sequencing. PMID:24341507
Minimal Absent Words in Four Human Genome Assemblies
Garcia, Sara P.; Pinho, Armando J.
2011-01-01
Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we aim to contribute to the catalogue of human genomic variation by investigating the variation in number and content of minimal absent words within a species, using four human genome assemblies. We compare the reference human genome GRCh37 assembly, the HuRef assembly of the genome of Craig Venter, the NA12878 assembly from cell line GM12878, and the YH assembly of the genome of a Han Chinese individual. We find the variation in number and content of minimal absent words between assemblies more significant for large and very large minimal absent words, where the biases of sequencing and assembly methodologies become more pronounced. Moreover, we find generally greater similarity between the human genome assemblies sequenced with capillary-based technologies (GRCh37 and HuRef) than between the human genome assemblies sequenced with massively parallel technologies (NA12878 and YH). Finally, as expected, we find the overall variation in number and content of minimal absent words within a species to be generally smaller than the variation between species. PMID:22220210
Rodriguez-R, Luis M; Gunturu, Santosh; Harvey, William T; Rosselló-Mora, Ramon; Tiedje, James M; Cole, James R; Konstantinidis, Konstantinos T
2018-06-14
The small subunit ribosomal RNA gene (16S rRNA) has been successfully used to catalogue and study the diversity of prokaryotic species and communities but it offers limited resolution at the species and finer levels, and cannot represent the whole-genome diversity and fluidity. To overcome these limitations, we introduced the Microbial Genomes Atlas (MiGA), a webserver that allows the classification of an unknown query genomic sequence, complete or partial, against all taxonomically classified taxa with available genome sequences, as well as comparisons to other related genomes including uncultivated ones, based on the genome-aggregate Average Nucleotide and Amino Acid Identity (ANI/AAI) concepts. MiGA integrates best practices in sequence quality trimming and assembly and allows input to be raw reads or assemblies from isolate genomes, single-cell sequences, and metagenome-assembled genomes (MAGs). Further, MiGA can take as input hundreds of closely related genomes of the same or closely related species (a so-called 'Clade Project') to assess their gene content diversity and evolutionary relationships, and calculate important clade properties such as the pangenome and core gene sets. Therefore, MiGA is expected to facilitate a range of genome-based taxonomic and diversity studies, and quality assessment across environmental and clinical settings. MiGA is available at http://microbial-genomes.org/.
Gene discovery by chemical mutagenesis and whole-genome sequencing in Dictyostelium.
Li, Cheng-Lin Frank; Santhanam, Balaji; Webb, Amanda Nicole; Zupan, Blaž; Shaulsky, Gad
2016-09-01
Whole-genome sequencing is a useful approach for identification of chemical-induced lesions, but previous applications involved tedious genetic mapping to pinpoint the causative mutations. We propose that saturation mutagenesis under low mutagenic loads, followed by whole-genome sequencing, should allow direct implication of genes by identifying multiple independent alleles of each relevant gene. We tested the hypothesis by performing three genetic screens with chemical mutagenesis in the social soil amoeba Dictyostelium discoideum Through genome sequencing, we successfully identified mutant genes with multiple alleles in near-saturation screens, including resistance to intense illumination and strong suppressors of defects in an allorecognition pathway. We tested the causality of the mutations by comparison to published data and by direct complementation tests, finding both dominant and recessive causative mutations. Therefore, our strategy provides a cost- and time-efficient approach to gene discovery by integrating chemical mutagenesis and whole-genome sequencing. The method should be applicable to many microbial systems, and it is expected to revolutionize the field of functional genomics in Dictyostelium by greatly expanding the mutation spectrum relative to other common mutagenesis methods. © 2016 Li et al.; Published by Cold Spring Harbor Laboratory Press.
Whole genome sequencing data and de novo draft assemblies for 66 teleost species
Malmstrøm, Martin; Matschiner, Michael; Tørresen, Ole K.; Jakobsen, Kjetill S.; Jentoft, Sissel
2017-01-01
Teleost fishes comprise more than half of all vertebrate species, yet genomic data are only available for 0.2% of their diversity. Here, we present whole genome sequencing data for 66 new species of teleosts, vastly expanding the availability of genomic data for this important vertebrate group. We report on de novo assemblies based on low-coverage (9–39×) sequencing and present detailed methodology for all analyses. To facilitate further utilization of this data set, we present statistical analyses of the gene space completeness and verify the expected phylogenetic position of the sequenced genomes in a large mitogenomic context. We further present a nuclear marker set used for phylogenetic inference and evaluate each gene tree in relation to the species tree to test for homogeneity in the phylogenetic signal. Collectively, these analyses illustrate the robustness of this highly diverse data set and enable extensive reuse of the selected phylogenetic markers and the genomic data in general. This data set covers all major teleost lineages and provides unprecedented opportunities for comparative studies of teleosts. PMID:28094797
Comparative analysis of the small RNA transcriptomes of Pinus contorta and Oryza sativa
Morin, Ryan D.; Aksay, Gozde; Dolgosheina, Elena; Ebhardt, H. Alexander; Magrini, Vincent; Mardis, Elaine R.; Sahinalp, S. Cenk; Unrau, Peter J.
2008-01-01
The diversity of microRNAs and small-interfering RNAs has been extensively explored within angiosperms by focusing on a few key organisms such as Oryza sativa and Arabidopsis thaliana. A deeper division of the plants is defined by the radiation of the angiosperms and gymnosperms, with the latter comprising the commercially important conifers. The conifers are expected to provide important information regarding the evolution of highly conserved small regulatory RNAs. Deep sequencing provides the means to characterize and quantitatively profile small RNAs in understudied organisms such as these. Pyrosequencing of small RNAs from O. sativa revealed, as expected, ∼21- and ∼24-nt RNAs. The former contained known microRNAs, and the latter largely comprised intergenic-derived sequences likely representing heterochromatin siRNAs. In contrast, sequences from Pinus contorta were dominated by 21-nt small RNAs. Using a novel sequence-based clustering algorithm, we identified sequences belonging to 18 highly conserved microRNA families in P. contorta as well as numerous clusters of conserved small RNAs of unknown function. Using multiple methods, including expressed sequence folding and machine learning algorithms, we found a further 53 candidate novel microRNA families, 51 appearing specific to the P. contorta library. In addition, alignment of small RNA sequences to the O. sativa genome revealed six perfectly conserved classes of small RNA that included chloroplast transcripts and specific types of genomic repeats. The conservation of microRNAs and other small RNAs between the conifers and the angiosperms indicates that important RNA silencing processes were highly developed in the earliest spermatophytes. Genomic mapping of all sequences to the O. sativa genome can be viewed at http://microrna.bcgsc.ca/cgi-bin/gbrowse/rice_build_3/. PMID:18323537
Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle.
van Binsbergen, Rianne; Calus, Mario P L; Bink, Marco C A M; van Eeuwijk, Fred A; Schrooten, Chris; Veerkamp, Roel F
2015-09-17
In contrast to currently used single nucleotide polymorphism (SNP) panels, the use of whole-genome sequence data is expected to enable the direct estimation of the effects of causal mutations on a given trait. This could lead to higher reliabilities of genomic predictions compared to those based on SNP genotypes. Also, at each generation of selection, recombination events between a SNP and a mutation can cause decay in reliability of genomic predictions based on markers rather than on the causal variants. Our objective was to investigate the use of imputed whole-genome sequence genotypes versus high-density SNP genotypes on (the persistency of) the reliability of genomic predictions using real cattle data. Highly accurate phenotypes based on daughter performance and Illumina BovineHD Beadchip genotypes were available for 5503 Holstein Friesian bulls. The BovineHD genotypes (631,428 SNPs) of each bull were used to impute whole-genome sequence genotypes (12,590,056 SNPs) using the Beagle software. Imputation was done using a multi-breed reference panel of 429 sequenced individuals. Genomic estimated breeding values for three traits were predicted using a Bayesian stochastic search variable selection (BSSVS) model and a genome-enabled best linear unbiased prediction model (GBLUP). Reliabilities of predictions were based on 2087 validation bulls, while the other 3416 bulls were used for training. Prediction reliabilities ranged from 0.37 to 0.52. BSSVS performed better than GBLUP in all cases. Reliabilities of genomic predictions were slightly lower with imputed sequence data than with BovineHD chip data. Also, the reliabilities tended to be lower for both sequence data and BovineHD chip data when relationships between training animals were low. No increase in persistency of prediction reliability using imputed sequence data was observed. Compared to BovineHD genotype data, using imputed sequence data for genomic prediction produced no advantage. To investigate the putative advantage of genomic prediction using (imputed) sequence data, a training set with a larger number of individuals that are distantly related to each other and genomic prediction models that incorporate biological information on the SNPs or that apply stricter SNP pre-selection should be considered.
Budiman, Muhammad A.; Mao, Long; Wood, Todd C.; Wing, Rod A.
2000-01-01
Recently a new strategy using BAC end sequences as sequence-tagged connectors (STCs) was proposed for whole-genome sequencing projects. In this study, we present the construction and detailed characterization of a 15.0 haploid genome equivalent BAC library for the cultivated tomato, Lycopersicon esculentum cv. Heinz 1706. The library contains 129,024 clones with an average insert size of 117.5 kb and a chloroplast content of 1.11%. BAC end sequences from 1490 ends were generated and analyzed as a preliminary evaluation for using this library to develop an STC framework to sequence the tomato genome. A total of 1205 BAC end sequences (80.9%) were obtained, with an average length of 360 high-quality bases, and were searched against the GenBank database. Using a cutoff expectation value of <10−6, and combining the results from BLASTN, BLASTX, and TBLASTX searches, 24.3% of the BAC end sequences were similar to known sequences, of which almost half (48.7%) share sequence similarities to retrotransposons and 7% to known genes. Some of the transposable element sequences were the first reported in tomato, such as sequences similar to maize transposon Activator (Ac) ORF and tobacco pararetrovirus-like sequences. Interestingly, there were no BAC end sequences similar to the highly repeated TGRI and TGRII elements. However, the majority (70.3%) of STCs did not share significant sequence similarities to any sequences in GenBank at either the DNA or predicted protein levels, indicating that a large portion of the tomato genome is still unknown. Our data demonstrate that this BAC library is suitable for developing an STC database to sequence the tomato genome. The advantages of developing an STC framework for whole-genome sequencing of tomato are discussed. [The BAC end sequences described in this paper have been deposited in the GenBank data library under accession nos. AQ367111–AQ368361.] PMID:10645957
The public goods hypothesis for the evolution of life on Earth
2011-01-01
It is becoming increasingly difficult to reconcile the observed extent of horizontal gene transfers with the central metaphor of a great tree uniting all evolving entities on the planet. In this manuscript we describe the Public Goods Hypothesis and show that it is appropriate in order to describe biological evolution on the planet. According to this hypothesis, nucleotide sequences (genes, promoters, exons, etc.) are simply seen as goods, passed from organism to organism through both vertical and horizontal transfer. Public goods sequences are defined by having the properties of being largely non-excludable (no organism can be effectively prevented from accessing these sequences) and non-rival (while such a sequence is being used by one organism it is also available for use by another organism). The universal nature of genetic systems ensures that such non-excludable sequences exist and non-excludability explains why we see a myriad of genes in different combinations in sequenced genomes. There are three features of the public goods hypothesis. Firstly, segments of DNA are seen as public goods, available for all organisms to integrate into their genomes. Secondly, we expect the evolution of mechanisms for DNA sharing and of defense mechanisms against DNA intrusion in genomes. Thirdly, we expect that we do not see a global tree-like pattern. Instead, we expect local tree-like patterns to emerge from the combination of a commonage of genes and vertical inheritance of genomes by cell division. Indeed, while genes are theoretically public goods, in reality, some genes are excludable, particularly, though not only, when they have variant genetic codes or behave as coalition or club goods, available for all organisms of a coalition to integrate into their genomes, and non-rival within the club. We view the Tree of Life hypothesis as a regionalized instance of the Public Goods hypothesis, just like classical mechanics and euclidean geometry are seen as regionalized instances of quantum mechanics and Riemannian geometry respectively. We argue for this change using an axiomatic approach that shows that the Public Goods hypothesis is a better accommodation of the observed data than the Tree of Life hypothesis. PMID:21861918
The Public Goods Hypothesis for the evolution of life on Earth.
McInerney, James O; Pisani, Davide; Bapteste, Eric; O'Connell, Mary J
2011-08-23
It is becoming increasingly difficult to reconcile the observed extent of horizontal gene transfers with the central metaphor of a great tree uniting all evolving entities on the planet. In this manuscript we describe the Public Goods Hypothesis and show that it is appropriate in order to describe biological evolution on the planet. According to this hypothesis, nucleotide sequences (genes, promoters, exons, etc.) are simply seen as goods, passed from organism to organism through both vertical and horizontal transfer. Public goods sequences are defined by having the properties of being largely non-excludable (no organism can be effectively prevented from accessing these sequences) and non-rival (while such a sequence is being used by one organism it is also available for use by another organism). The universal nature of genetic systems ensures that such non-excludable sequences exist and non-excludability explains why we see a myriad of genes in different combinations in sequenced genomes. There are three features of the public goods hypothesis. Firstly, segments of DNA are seen as public goods, available for all organisms to integrate into their genomes. Secondly, we expect the evolution of mechanisms for DNA sharing and of defense mechanisms against DNA intrusion in genomes. Thirdly, we expect that we do not see a global tree-like pattern. Instead, we expect local tree-like patterns to emerge from the combination of a commonage of genes and vertical inheritance of genomes by cell division. Indeed, while genes are theoretically public goods, in reality, some genes are excludable, particularly, though not only, when they have variant genetic codes or behave as coalition or club goods, available for all organisms of a coalition to integrate into their genomes, and non-rival within the club. We view the Tree of Life hypothesis as a regionalized instance of the Public Goods hypothesis, just like classical mechanics and euclidean geometry are seen as regionalized instances of quantum mechanics and Riemannian geometry respectively. We argue for this change using an axiomatic approach that shows that the Public Goods hypothesis is a better accommodation of the observed data than the Tree of Life hypothesis.
A little bit of sex matters for genome evolution in asexual plants.
Hojsgaard, Diego; Hörandl, Elvira
2015-01-01
Genome evolution in asexual organisms is theoretically expected to be shaped by various factors: first, hybrid origin, and polyploidy confer a genomic constitution of highly heterozygous genotypes with multiple copies of genes; second, asexuality confers a lack of recombination and variation in populations, which reduces the efficiency of selection against deleterious mutations; hence, the accumulation of mutations and a gradual increase in mutational load (Muller's ratchet) would lead to rapid extinction of asexual lineages; third, allelic sequence divergence is expected to result in rapid divergence of lineages (Meselson effect). Recent transcriptome studies on the asexual polyploid complex Ranunculus auricomus using single-nucleotide polymorphisms confirmed neutral allelic sequence divergence within a short time frame, but rejected a hypothesis of a genome-wide accumulation of mutations in asexuals compared to sexuals, except for a few genes related to reproductive development. We discuss a general model that the observed incidence of facultative sexuality in plants may unmask deleterious mutations with partial dominance and expose them efficiently to purging selection. A little bit of sex may help to avoid genomic decay and extinction.
Plant genome and transcriptome annotations: from misconceptions to simple solutions
Bolger, Marie E; Arsova, Borjana; Usadel, Björn
2018-01-01
Abstract Next-generation sequencing has triggered an explosion of available genomic and transcriptomic resources in the plant sciences. Although genome and transcriptome sequencing has become orders of magnitudes cheaper and more efficient, often the functional annotation process is lagging behind. This might be hampered by the lack of a comprehensive enumeration of simple-to-use tools available to the plant researcher. In this comprehensive review, we present (i) typical ontologies to be used in the plant sciences, (ii) useful databases and resources used for functional annotation, (iii) what to expect from an annotated plant genome, (iv) an automated annotation pipeline and (v) a recipe and reference chart outlining typical steps used to annotate plant genomes/transcriptomes using publicly available resources. PMID:28062412
Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome.
Aljohi, Hasan Awad; Liu, Wanfei; Lin, Qiang; Zhao, Yuhui; Zeng, Jingyao; Alamer, Ali; Alanazi, Ibrahim O; Alawad, Abdullah O; Al-Sadi, Abdullah M; Hu, Songnian; Yu, Jun
2016-01-01
Coconut (Cocos nucifera L.), a member of the palm family (Arecaceae), is one of the most economically important crops in tropics, serving as an important source of food, drink, fuel, medicine, and construction material. Here we report an assembly of the coconut (C. nucifera, Oman local Tall cultivar) mitochondrial (mt) genome based on next-generation sequencing data. This genome, 678,653bp in length and 45.5% in GC content, encodes 72 proteins, 9 pseudogenes, 23 tRNAs, and 3 ribosomal RNAs. Within the assembly, we find that the chloroplast (cp) derived regions account for 5.07% of the total assembly length, including 13 proteins, 2 pseudogenes, and 11 tRNAs. The mt genome has a relatively large fraction of repeat content (17.26%), including both forward (tandem) and inverted (palindromic) repeats. Sequence variation analysis shows that the Ti/Tv ratio of the mt genome is lower as compared to that of the nuclear genome and neutral expectation. By combining public RNA-Seq data for coconut, we identify 734 RNA editing sites supported by at least two datasets. In summary, our data provides the second complete mt genome sequence in the family Arecaceae, essential for further investigations on mitochondrial biology of seed plants.
Jordan, Daniel M; Do, Ron
2018-04-11
While sequence-based genetic tests have long been available for specific loci, especially for Mendelian disease, the rapidly falling costs of genome-wide genotyping arrays, whole-exome sequencing, and whole-genome sequencing are moving us toward a future where full genomic information might inform the prognosis and treatment of a variety of diseases, including complex disease. Similarly, the availability of large populations with full genomic information has enabled new insights about the etiology and genetic architecture of complex disease. Insights from the latest generation of genomic studies suggest that our categorization of diseases as complex may conceal a wide spectrum of genetic architectures and causal mechanisms that ranges from Mendelian forms of complex disease to complex regulatory structures underlying Mendelian disease. Here, we review these insights, along with advances in the prediction of disease risk and outcomes from full genomic information. Expected final online publication date for the Annual Review of Genomics and Human Genetics Volume 19 is August 31, 2018. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Sensitivity to sequencing depth in single-cell cancer genomics.
Alves, João M; Posada, David
2018-04-16
Querying cancer genomes at single-cell resolution is expected to provide a powerful framework to understand in detail the dynamics of cancer evolution. However, given the high costs currently associated with single-cell sequencing, together with the inevitable technical noise arising from single-cell genome amplification, cost-effective strategies that maximize the quality of single-cell data are critically needed. Taking advantage of previously published single-cell whole-genome and whole-exome cancer datasets, we studied the impact of sequencing depth and sampling effort towards single-cell variant detection. Five single-cell whole-genome and whole-exome cancer datasets were independently downscaled to 25, 10, 5, and 1× sequencing depth. For each depth level, ten technical replicates were generated, resulting in a total of 6280 single-cell BAM files. The sensitivity of variant detection, including structural and driver mutations, genotyping, clonal inference, and phylogenetic reconstruction to sequencing depth was evaluated using recent tools specifically designed for single-cell data. Altogether, our results suggest that for relatively large sample sizes (25 or more cells) sequencing single tumor cells at depths > 5× does not drastically improve somatic variant discovery, characterization of clonal genotypes, or estimation of single-cell phylogenies. We suggest that sequencing multiple individual tumor cells at a modest depth represents an effective alternative to explore the mutational landscape and clonal evolutionary patterns of cancer genomes.
Comparison and correlation of Simple Sequence Repeats distribution in genomes of Brucella species
Kiran, Jangampalli Adi Pradeep; Chakravarthi, Veeraraghavulu Praveen; Kumar, Yellapu Nanda; Rekha, Somesula Swapna; Kruti, Srinivasan Shanthi; Bhaskar, Matcha
2011-01-01
Computational genomics is one of the important tools to understand the distribution of closely related genomes including simple sequence repeats (SSRs) in an organism, which gives valuable information regarding genetic variations. The central objective of the present study was to screen the SSRs distributed in coding and non-coding regions among different human Brucella species which are involved in a range of pathological disorders. Computational analysis of the SSRs in the Brucella indicates few deviations from expected random models. Statistical analysis also reveals that tri-nucleotide SSRs are overrepresented and tetranucleotide SSRs underrepresented in Brucella genomes. From the data, it can be suggested that over expressed tri-nucleotide SSRs in genomic and coding regions might be responsible in the generation of functional variation of proteins expressed which in turn may lead to different pathogenicity, virulence determinants, stress response genes, transcription regulators and host adaptation proteins of Brucella genomes. Abbreviations SSRs - Simple Sequence Repeats, ORFs - Open Reading Frames. PMID:21738309
Genome structure of bdelloid rotifers: shaped by asexuality or desiccation?
Gladyshev, Eugene A; Arkhipova, Irina R
2010-01-01
Bdelloid rotifers are microscopic invertebrate animals best known for their ancient asexuality and the ability to survive desiccation at any life stage. Both factors are expected to have a profound influence on their genome structure. Recent molecular studies demonstrated that, although the gene-rich regions of bdelloid genomes are organized as colinear pairs of closely related sequences and depleted in repetitive DNA, subtelomeric regions harbor diverse transposable elements and horizontally acquired genes of foreign origin. Although asexuality is expected to result in depletion of deleterious transposons, only desiccation appears to have the power to produce all the uncovered genomic peculiarities. Repair of desiccation-induced DNA damage would require the presence of a homologous template, maintaining colinear pairs in gene-rich regions and selecting against insertion of repetitive DNA that might cause chromosomal rearrangements. Desiccation may also induce a transient state of competence in recovering animals, allowing them to acquire environmental DNA. Even if bdelloids engage in rare or obscure forms of sexual reproduction, all these features could still be present. The relative contribution of asexuality and desiccation to genome organization may be clarified by analyzing whole-genome sequences and comparing foreign gene and transposon content in species which lost the ability to survive desiccation.
Clinical providers' experiences with returning results from genomic sequencing: an interview study.
Wynn, Julia; Lewis, Katie; Amendola, Laura M; Bernhardt, Barbara A; Biswas, Sawona; Joshi, Manasi; McMullen, Carmit; Scollon, Sarah
2018-05-08
Current medical practice includes the application of genomic sequencing (GS) in clinical and research settings. Despite expanded use of this technology, the process of disclosure of genomic results to patients and research participants has not been thoroughly examined and there are no established best practices. We conducted semi-structured interviews with 21 genetic and non-genetic clinicians returning results of GS as part of the NIH funded Clinical Sequencing Exploratory Research (CSER) Consortium projects. Interviews focused on the logistics of sessions, participant/patient reactions and factors influencing them, how the sessions changed with experience, and resources and training recommended to return genomic results. The length of preparation and disclosure sessions varied depending on the type and number of results and their implications. Internal and external databases, online resources and result review meetings were used to prepare. Respondents reported that participants' reactions were variable and ranged from enthusiasm and relief to confusion and disappointment. Factors influencing reactions were types of results, expectations and health status. A recurrent challenge was managing inflated expectations about GS. Other challenges included returning multiple, unanticipated and/or uncertain results and navigating a rare diagnosis. Methods to address these challenges included traditional genetic counseling techniques and modifying practice over time in order to provide anticipatory guidance and modulate expectations. Respondents made recommendations to improve access to genomic resources and genetic referrals to prepare future providers as the uptake of GS increases in both genetic and non-genetic settings. These findings indicate that returning genomic results is similar to return of results in traditional genetic testing but is magnified by the additional complexity and potential uncertainty of the results. Managing patient expectations, initially identified in studies of informed consent, remains an ongoing challenge and highlights the need to address this issue throughout the testing process. The results of this study will help to guide future providers in the disclosure of genomic results and highlight educational needs and resources necessary to prepare providers. Future research on the patient experience, understanding and follow-up of recommendations is needed to more fully understand the disclosure process.
Variation block-based genomics method for crop plants.
Kim, Yul Ho; Park, Hyang Mi; Hwang, Tae-Young; Lee, Seuk Ki; Choi, Man Soo; Jho, Sungwoong; Hwang, Seungwoo; Kim, Hak-Min; Lee, Dongwoo; Kim, Byoung-Chul; Hong, Chang Pyo; Cho, Yun Sung; Kim, Hyunmin; Jeong, Kwang Ho; Seo, Min Jung; Yun, Hong Tai; Kim, Sun Lim; Kwon, Young-Up; Kim, Wook Han; Chun, Hye Kyung; Lim, Sang Jong; Shin, Young-Ah; Choi, Ik-Young; Kim, Young Sun; Yoon, Ho-Sung; Lee, Suk-Ha; Lee, Sunghoon
2014-06-15
In contrast with wild species, cultivated crop genomes consist of reshuffled recombination blocks, which occurred by crossing and selection processes. Accordingly, recombination block-based genomics analysis can be an effective approach for the screening of target loci for agricultural traits. We propose the variation block method, which is a three-step process for recombination block detection and comparison. The first step is to detect variations by comparing the short-read DNA sequences of the cultivar to the reference genome of the target crop. Next, sequence blocks with variation patterns are examined and defined. The boundaries between the variation-containing sequence blocks are regarded as recombination sites. All the assumed recombination sites in the cultivar set are used to split the genomes, and the resulting sequence regions are termed variation blocks. Finally, the genomes are compared using the variation blocks. The variation block method identified recurring recombination blocks accurately and successfully represented block-level diversities in the publicly available genomes of 31 soybean and 23 rice accessions. The practicality of this approach was demonstrated by the identification of a putative locus determining soybean hilum color. We suggest that the variation block method is an efficient genomics method for the recombination block-level comparison of crop genomes. We expect that this method will facilitate the development of crop genomics by bringing genomics technologies to the field of crop breeding.
Staton, Margaret; Zhebentyayeva, Tetyana; Olukolu, Bode; Fang, Guang Chen; Nelson, Dana; Carlson, John E; Abbott, Albert G
2015-10-05
Chinese chestnut (Castanea mollissima) has emerged as a model species for the Fagaceae family with extensive genomic resources including a physical map, a dense genetic map and quantitative trait loci (QTLs) for chestnut blight resistance. These resources enable comparative genomics analyses relative to model plants. We assessed the degree of conservation between the chestnut genome and other well annotated and assembled plant genomic sequences, focusing on the QTL regions of most interest to the chestnut breeding community. The integrated physical and genetic map of Chinese chestnut has been improved to now include 858 shared sequence-based markers. The utility of the integrated map has also been improved through the addition of 42,970 BAC (bacterial artificial chromosome) end sequences spanning over 26 million bases of the estimated 800 Mb chestnut genome. Synteny between chestnut and ten model plant species was conducted on a macro-syntenic scale using sequences from both individual probes and BAC end sequences across the chestnut physical map. Blocks of synteny with chestnut were found in all ten reference species, with the percent of the chestnut physical map that could be aligned ranging from 10 to 39 %. The integrated genetic and physical map was utilized to identify BACs that spanned the three previously identified QTL regions conferring blight resistance. The clones were pooled and sequenced, yielding 396 sequence scaffolds covering 13.9 Mbp. Comparative genomic analysis on a microsytenic scale, using the QTL-associated genomic sequence, identified synteny from chestnut to other plant genomes ranging from 5.4 to 12.9 % of the genome sequences aligning. On both the macro- and micro-synteny levels, the peach, grape and poplar genomes were found to be the most structurally conserved with chestnut. Interestingly, these results did not strictly follow the expectation that decreased phylogenetic distance would correspond to increased levels of genome preservation, but rather suggest the additional influence of life-history traits on preservation of synteny. The regions of synteny that were detected provide an important tool for defining and cataloging genes in the QTL regions for advancing chestnut blight resistance research.
A decade after the first full human genome sequencing: when will we understand our own genome?
Eisenhaber, Frank
2012-10-01
The contrast between the pomp of celebrating the first full human genome sequencing in 2000 and the cautious tone of recollections a decade thereafter could hardly be greater. The promises with regard to medical cures and biotechnology applications have been realized not even nearly to the expectations. Understanding the human genomes means knowing the genes' and proteins' functions and their interconnectedness via biomolecular mechanisms. This articles estimates how long will it take to achieve this goal if we extrapolate from the previous decade (indeed, a century!) and the possible disruptive trends in science, technology and society that may accelerate the pace of progress dramatically.
Park, Tae-Ho; Park, Beom-Seok; Kim, Jin-A; Hong, Joon Ki; Jin, Mina; Seol, Young-Joo; Mun, Jeong-Hwan
2011-01-01
As a part of the Multinational Genome Sequencing Project of Brassica rapa, linkage group R9 and R3 were sequenced using a bacterial artificial chromosome (BAC) by BAC strategy. The current physical contigs are expected to cover approximately 90% euchromatins of both chromosomes. As the project progresses, BAC selection for sequence extension becomes more limited because BAC libraries are restriction enzyme-specific. To support the project, a random sheared fosmid library was constructed. The library consists of 97536 clones with average insert size of approximately 40 kb corresponding to seven genome equivalents, assuming a Chinese cabbage genome size of 550 Mb. The library was screened with primers designed at the end of sequences of nine points of scaffold gaps where BAC clones cannot be selected to extend the physical contigs. The selected positive clones were end-sequenced to check the overlap between the fosmid clones and the adjacent BAC clones. Nine fosmid clones were selected and fully sequenced. The sequences revealed two completed gap filling and seven sequence extensions, which can be used for further selection of BAC clones confirming that the fosmid library will facilitate the sequence completion of B. rapa. Copyright © 2011. Published by Elsevier Ltd.
Comprehensive analysis of CpG islands in human chromosomes 21 and 22
NASA Astrophysics Data System (ADS)
Takai, Daiya; Jones, Peter A.
2002-03-01
CpG islands are useful markers for genes in organisms containing 5-methylcytosine in their genomes. In addition, CpG islands located in the promoter regions of genes can play important roles in gene silencing during processes such as X-chromosome inactivation, imprinting, and silencing of intragenomic parasites. The generally accepted definition of what constitutes a CpG island was proposed in 1987 by Gardiner-Garden and Frommer [Gardiner-Garden, M. & Frommer, M. (1987) J. Mol. Biol. 196, 261-282] as being a 200-bp stretch of DNA with a C+G content of 50% and an observed CpG/expected CpG in excess of 0.6. Any definition of a CpG island is somewhat arbitrary, and this one, which was derived before the sequencing of mammalian genomes, will include many sequences that are not necessarily associated with controlling regions of genes but rather are associated with intragenomic parasites. We have therefore used the complete genomic sequences of human chromosomes 21 and 22 to examine the properties of CpG islands in different sequence classes by using a search algorithm that we have developed. Regions of DNA of greater than 500 bp with a G+C equal to or greater than 55% and observed CpG/expected CpG of 0.65 were more likely to be associated with the 5' regions of genes and this definition excluded most Alu-repetitive elements. We also used genome sequences to show strong CpG suppression in the human genome and slight suppression in Drosophila melanogaster and Saccharomyces cerevisiae. This finding is compatible with the recent detection of 5-methylcytosine in Drosophila, and might suggest that S. cerevisiae has, or once had, CpG methylation.
Species-specific Typing of DNA Based on Palindrome Frequency Patterns
Lamprea-Burgunder, Estelle; Ludin, Philipp; Mäser, Pascal
2011-01-01
DNA in its natural, double-stranded form may contain palindromes, sequences which read the same from either side because they are identical to their reverse complement on the sister strand. Short palindromes are underrepresented in all kinds of genomes. The frequency distribution of short palindromes exhibits more than twice the inter-species variance of non-palindromic sequences, which renders palindromes optimally suited for the typing of DNA. Here, we show that based on palindrome frequency, DNA sequences can be discriminated to the level of species of origin. By plotting the ratios of actual occurrence to expectancy, we generate palindrome frequency patterns that allow to cluster different sequences of the same genome and to assign plasmids, and in some cases even viruses to their respective host genomes. This finding will be of use in the growing field of metagenomics. PMID:21429991
Nullomers and High Order Nullomers in Genomic Sequences
Vergni, Davide; Santoni, Daniele
2016-01-01
A nullomer is an oligomer that does not occur as a subsequence in a given DNA sequence, i.e. it is an absent word of that sequence. The importance of nullomers in several applications, from drug discovery to forensic practice, is now debated in the literature. Here, we investigated the nature of nullomers, whether their absence in genomes has just a statistical explanation or it is a peculiar feature of genomic sequences. We introduced an extension of the notion of nullomer, namely high order nullomers, which are nullomers whose mutated sequences are still nullomers. We studied different aspects of them: comparison with nullomers of random sequences, CpG distribution and mean helical rise. In agreement with previous results we found that the number of nullomers in the human genome is much larger than expected by chance. Nevertheless antithetical results were found when considering a random DNA sequence preserving dinucleotide frequencies. The analysis of CpG frequencies in nullomers and high order nullomers revealed, as expected, a high CpG content but it also highlighted a strong dependence of CpG frequencies on the dinucleotide position, suggesting that nullomers have their own peculiar structure and are not simply sequences whose CpG frequency is biased. Furthermore, phylogenetic trees were built on eleven species based on both the similarities between the dinucleotide frequencies and the number of nullomers two species share, showing that nullomers are fairly conserved among close species. Finally the study of mean helical rise of nullomers sequences revealed significantly high mean rise values, reinforcing the hypothesis that those sequences have some peculiar structural features. The obtained results show that nullomers are the consequence of the peculiar structure of DNA (also including biased CpG frequency and CpGs islands), so that the hypermutability model, also taking into account CpG islands, seems to be not sufficient to explain nullomer phenomenon. Finally, high order nullomers could emphasize those features that already make simple nullomers useful in several applications. PMID:27906971
Robustness of Massively Parallel Sequencing Platforms
Kavak, Pınar; Yüksel, Bayram; Aksu, Soner; Kulekci, M. Oguzhan; Güngör, Tunga; Hach, Faraz; Şahinalp, S. Cenk; Alkan, Can; Sağıroğlu, Mahmut Şamil
2015-01-01
The improvements in high throughput sequencing technologies (HTS) made clinical sequencing projects such as ClinSeq and Genomics England feasible. Although there are significant improvements in accuracy and reproducibility of HTS based analyses, the usability of these types of data for diagnostic and prognostic applications necessitates a near perfect data generation. To assess the usability of a widely used HTS platform for accurate and reproducible clinical applications in terms of robustness, we generated whole genome shotgun (WGS) sequence data from the genomes of two human individuals in two different genome sequencing centers. After analyzing the data to characterize SNPs and indels using the same tools (BWA, SAMtools, and GATK), we observed significant number of discrepancies in the call sets. As expected, the most of the disagreements between the call sets were found within genomic regions containing common repeats and segmental duplications, albeit only a small fraction of the discordant variants were within the exons and other functionally relevant regions such as promoters. We conclude that although HTS platforms are sufficiently powerful for providing data for first-pass clinical tests, the variant predictions still need to be confirmed using orthogonal methods before using in clinical applications. PMID:26382624
Primer on Molecular Genetics; DOE Human Genome Program
DOE R&D Accomplishments Database
1992-04-01
This report is taken from the April 1992 draft of the DOE Human Genome 1991--1992 Program Report, which is expected to be published in May 1992. The primer is intended to be an introduction to basic principles of molecular genetics pertaining to the genome project. The material contained herein is not final and may be incomplete. Techniques of genetic mapping and DNA sequencing are described.
Conifer genomics and adaptation: at the crossroads of genetic diversity and genome function.
Prunier, Julien; Verta, Jukka-Pekka; MacKay, John J
2016-01-01
Conifers have been understudied at the genomic level despite their worldwide ecological and economic importance but the situation is rapidly changing with the development of next generation sequencing (NGS) technologies. With NGS, genomics research has simultaneously gained in speed, magnitude and scope. In just a few years, genomes of 20-24 gigabases have been sequenced for several conifers, with several others expected in the near future. Biological insights have resulted from recent sequencing initiatives as well as genetic mapping, gene expression profiling and gene discovery research over nearly two decades. We review the knowledge arising from conifer genomics research emphasizing genome evolution and the genomic basis of adaptation, and outline emerging questions and knowledge gaps. We discuss future directions in three areas with potential inputs from NGS technologies: the evolutionary impacts of adaptation in conifers based on the adaptation-by-speciation model; the contributions of genetic variability of gene expression in adaptation; and the development of a broader understanding of genetic diversity and its impacts on genome function. These research directions promise to sustain research aimed at addressing the emerging challenges of adaptation that face conifer trees. © 2015 The Authors. New Phytologist © 2015 New Phytologist Trust.
Bass, David; Moureau, Gregory; Tang, Shuoya; McAlister, Erica; Culverwell, C. Lorna; Glücksman, Edvard; Wang, Hui; Brown, T. David K.; Gould, Ernest A.; Harbach, Ralph E.; de Lamballerie, Xavier; Firth, Andrew E.
2013-01-01
We investigated whether small RNA (sRNA) sequenced from field-collected mosquitoes and chironomids (Diptera) can be used as a proxy signature of viral prevalence within a range of species and viral groups, using sRNAs sequenced from wild-caught specimens, to inform total RNA deep sequencing of samples of particular interest. Using this strategy, we sequenced from adult Anopheles maculipennis s.l. mosquitoes the apparently nearly complete genome of one previously undescribed virus related to chronic bee paralysis virus, and, from a pool of Ochlerotatus caspius and Oc. detritus mosquitoes, a nearly complete entomobirnavirus genome. We also reconstructed long sequences (1503-6557 nt) related to at least nine other viruses. Crucially, several of the sequences detected were reconstructed from host organisms highly divergent from those in which related viruses have been previously isolated or discovered. It is clear that viral transmission and maintenance cycles in nature are likely to be significantly more complex and taxonomically diverse than previously expected. PMID:24260463
Single-Cell Genomic Analysis in Plants
Hu, Haifei; Scheben, Armin; Edwards, David
2018-01-01
Individual cells in an organism are variable, which strongly impacts cellular processes. Advances in sequencing technologies have enabled single-cell genomic analysis to become widespread, addressing shortcomings of analyses conducted on populations of bulk cells. While the field of single-cell plant genomics is in its infancy, there is great potential to gain insights into cell lineage and functional cell types to help understand complex cellular interactions in plants. In this review, we discuss current approaches for single-cell plant genomic analysis, with a focus on single-cell isolation, DNA amplification, next-generation sequencing, and bioinformatics analysis. We outline the technical challenges of analysing material from a single plant cell, and then examine applications of single-cell genomics and the integration of this approach with genome editing. Finally, we indicate future directions we expect in the rapidly developing field of plant single-cell genomic analysis. PMID:29361790
The Qatar genome project: translation of whole-genome sequencing into clinical practice.
Zayed, Hatem
2016-10-01
Qatar Genome Project was launched in 2013 with the intent to sequence the genome of each Qatari citizen in an effort to protect Qataris from the high rate of indigenous genetic diseases by allowing the mapping of disease-causing variants/rare variants and establishing a Qatari reference genome. Indeed, this project is expected to have numerous global benefits because the elevated homogeneity of the Qatari population, that will make Qatar an excellent genetic laboratory that will generate a wealth of data that will allow us to make sense of the genotype-phenotype correlations of many diseases, especially the complex multifactorial diseases, and will pave the way for changing the traditional medical practice of looking first at the phenotype rather than the genotype. © 2016 John Wiley & Sons Ltd.
Survey of genome sequences in a wild sweet potato, Ipomoea trifida (H. B. K.) G. Don
Hirakawa, Hideki; Okada, Yoshihiro; Tabuchi, Hiroaki; Shirasawa, Kenta; Watanabe, Akiko; Tsuruoka, Hisano; Minami, Chiharu; Nakayama, Shinobu; Sasamoto, Shigemi; Kohara, Mitsuyo; Kishida, Yoshie; Fujishiro, Tsunakazu; Kato, Midori; Nanri, Keiko; Komaki, Akiko; Yoshinaga, Masaru; Takahata, Yasuhiro; Tanaka, Masaru; Tabata, Satoshi; Isobe, Sachiko N.
2015-01-01
Ipomoea trifida (H. B. K.) G. Don. is the most likely diploid ancestor of the hexaploid sweet potato, I. batatas (L.) Lam. To assist in analysis of the sweet potato genome, de novo whole-genome sequencing was performed with two lines of I. trifida, namely the selfed line Mx23Hm and the highly heterozygous line 0431-1, using the Illumina HiSeq platform. We classified the sequences thus obtained as either ‘core candidates’ (common to the two lines) or ‘line specific’. The total lengths of the assembled sequences of Mx23Hm (ITR_r1.0) was 513 Mb, while that of 0431-1 (ITRk_r1.0) was 712 Mb. Of the assembled sequences, 240 Mb (Mx23Hm) and 353 Mb (0431-1) were classified into core candidate sequences. A total of 62,407 (62.4 Mb) and 109,449 (87.2 Mb) putative genes were identified, respectively, in the genomes of Mx23Hm and 0431-1, of which 11,823 were derived from core sequences of Mx23Hm, while 28,831 were from the core candidate sequence of 0431-1. There were a total of 1,464,173 single-nucleotide polymorphisms and 16,682 copy number variations (CNVs) in the two assembled genomic sequences (under the condition of log2 ratio of >1 and CNV size >1,000 bases). The results presented here are expected to contribute to the progress of genomic and genetic studies of I. trifida, as well as studies of the sweet potato and the genus Ipomoea in general. PMID:25805887
Enhancing genomic laboratory reports from the patients' view: A qualitative analysis.
Stuckey, Heather; Williams, Janet L; Fan, Audrey L; Rahm, Alanna Kulchak; Green, Jamie; Feldman, Lynn; Bonhag, Michele; Zallen, Doris T; Segal, Michael M; Williams, Marc S
2015-10-01
The purpose of this study was to develop a family genomic laboratory report designed to communicate genome sequencing results to parents of children who were participating in a whole genome sequencing clinical research study. Semi-structured interviews were conducted with parents of children who participated in a whole genome sequencing clinical research study to address the elements, language and format of a sample family-directed genome laboratory report. The qualitative interviews were followed by two focus groups aimed at evaluating example presentations of information about prognosis and next steps related to the whole genome sequencing result. Three themes emerged from the qualitative data: (i) Parents described a continual search for valid information and resources regarding their child's condition, a need that prior reports did not meet for parents; (ii) Parents believed that the Family Report would help facilitate communication with physicians and family members; and (iii) Parents identified specific items they appreciated in a genomics Family Report: simplicity of language, logical flow, visual appeal, information on what to expect in the future and recommended next steps. Parents affirmed their desire for a family genomic results report designed for their use and reference. They articulated the need for clear, easy to understand language that provided information with temporal detail and specific recommendations regarding relevant findings consistent with that available to clinicians. © 2015 Wiley Periodicals, Inc.
Enhancing genomic laboratory reports from the patients' view: A qualitative analysis
Stuckey, Heather; Fan, Audrey L.; Rahm, Alanna Kulchak; Green, Jamie; Feldman, Lynn; Bonhag, Michele; Zallen, Doris T.; Segal, Michael M.; Williams, Marc S.
2015-01-01
The purpose of this study was to develop a family genomic laboratory report designed to communicate genome sequencing results to parents of children who were participating in a whole genome sequencing clinical research study. Semi‐structured interviews were conducted with parents of children who participated in a whole genome sequencing clinical research study to address the elements, language and format of a sample family‐directed genome laboratory report. The qualitative interviews were followed by two focus groups aimed at evaluating example presentations of information about prognosis and next steps related to the whole genome sequencing result. Three themes emerged from the qualitative data: (i) Parents described a continual search for valid information and resources regarding their child's condition, a need that prior reports did not meet for parents; (ii) Parents believed that the Family Report would help facilitate communication with physicians and family members; and (iii) Parents identified specific items they appreciated in a genomics Family Report: simplicity of language, logical flow, visual appeal, information on what to expect in the future and recommended next steps. Parents affirmed their desire for a family genomic results report designed for their use and reference. They articulated the need for clear, easy to understand language that provided information with temporal detail and specific recommendations regarding relevant findings consistent with that available to clinicians. PMID:26086630
Draft genome sequence of ramie, Boehmeria nivea (L.) Gaudich.
Luan, Ming-Bao; Jian, Jian-Bo; Chen, Ping; Chen, Jun-Hui; Chen, Jian-Hua; Gao, Qiang; Gao, Gang; Zhou, Ju-Hong; Chen, Kun-Mei; Guang, Xuan-Min; Chen, Ji-Kang; Zhang, Qian-Qian; Wang, Xiao-Fei; Fang, Long; Sun, Zhi-Min; Bai, Ming-Zhou; Fang, Xiao-Dong; Zhao, Shan-Cen; Xiong, He-Ping; Yu, Chun-Ming; Zhu, Ai-Guo
2018-05-01
Ramie, Boehmeria nivea (L.) Gaudich, family Urticaceae, is a plant native to eastern Asia, and one of the world's oldest fibre crops. It is also used as animal feed and for the phytoremediation of heavy metal-contaminated farmlands. Thus, the genome sequence of ramie was determined to explore the molecular basis of its fibre quality, protein content and phytoremediation. For further understanding ramie genome, different paired-end and mate-pair libraries were combined to generate 134.31 Gb of raw DNA sequences using the Illumina whole-genome shotgun sequencing approach. The highly heterozygous B. nivea genome was assembled using the Platanus Genome Assembler, which is an effective tool for the assembly of highly heterozygous genome sequences. The final length of the draft genome of this species was approximately 341.9 Mb (contig N50 = 22.62 kb, scaffold N50 = 1,126.36 kb). Based on ramie genome annotations, 30,237 protein-coding genes were predicted, and the repetitive element content was 46.3%. The completeness of the final assembly was evaluated by benchmarking universal single-copy orthologous genes (BUSCO); 90.5% of the 1,440 expected embryophytic genes were identified as complete, and 4.9% were identified as fragmented. Phylogenetic analysis based on single-copy gene families and one-to-one orthologous genes placed ramie with mulberry and cannabis, within the clade of urticalean rosids. Genome information of ramie will be a valuable resource for the conservation of endangered Boehmeria species and for future studies on the biogeography and characteristic evolution of members of Urticaceae. © 2018 John Wiley & Sons Ltd.
2013-01-01
Background Modern banana cultivars are primarily interspecific triploid hybrids of two species, Musa acuminata and Musa balbisiana, which respectively contribute the A- and B-genomes. The M. balbisiana genome has been associated with improved vigour and tolerance to biotic and abiotic stresses and is thus a target for Musa breeding programs. However, while a reference M. acuminata genome has recently been released (Nature 488:213–217, 2012), little sequence data is available for the corresponding B-genome. To address these problems we carried out Next Generation gDNA sequencing of the wild diploid M. balbisiana variety ‘Pisang Klutuk Wulung’ (PKW). Our strategy was to align PKW gDNA reads against the published A-genome and to extract the mapped consensus sequences for subsequent rounds of evaluation and gene annotation. Results The resulting B-genome is 79% the size of the A-genome, and contains 36,638 predicted functional gene sequences which is nearly identical to the 36,542 of the A-genome. There is substantial sequence divergence from the A-genome at a frequency of 1 homozygous SNP per 23.1 bp, and a high degree of heterozygosity corresponding to one heterozygous SNP per 55.9 bp. Using expressed small RNA data, a similar number of microRNA sequences were predicted in both A- and B-genomes, but additional novel miRNAs were detected, including some that are unique to each genome. The usefulness of this B-genome sequence was evaluated by mapping RNA-seq data from a set of triploid AAA and AAB hybrids simultaneously to both genomes. Results for the plantains demonstrated the expected 2:1 distribution of reads across the A- and B-genomes, but for the AAA genomes, results show they contain regions of significant homology to the B-genome supporting proposals that there has been a history of interspecific recombination between homeologous A and B chromosomes in Musa hybrids. Conclusions We have generated and annotated a draft reference Musa B-genome and demonstrate that this can be used for molecular genetic mapping of gene transcripts and small RNA expression data from several allopolyploid banana cultivars. This draft therefore represents a valuable resource to support the study of metabolism in inter- and intraspecific triploid Musa hybrids and to help direct breeding programs. PMID:24094114
Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome
Zhao, Yuhui; Zeng, Jingyao; Alamer, Ali; Alanazi, Ibrahim O.; Alawad, Abdullah O.; Al-Sadi, Abdullah M.; Hu, Songnian; Yu, Jun
2016-01-01
Coconut (Cocos nucifera L.), a member of the palm family (Arecaceae), is one of the most economically important crops in tropics, serving as an important source of food, drink, fuel, medicine, and construction material. Here we report an assembly of the coconut (C. nucifera, Oman local Tall cultivar) mitochondrial (mt) genome based on next-generation sequencing data. This genome, 678,653bp in length and 45.5% in GC content, encodes 72 proteins, 9 pseudogenes, 23 tRNAs, and 3 ribosomal RNAs. Within the assembly, we find that the chloroplast (cp) derived regions account for 5.07% of the total assembly length, including 13 proteins, 2 pseudogenes, and 11 tRNAs. The mt genome has a relatively large fraction of repeat content (17.26%), including both forward (tandem) and inverted (palindromic) repeats. Sequence variation analysis shows that the Ti/Tv ratio of the mt genome is lower as compared to that of the nuclear genome and neutral expectation. By combining public RNA-Seq data for coconut, we identify 734 RNA editing sites supported by at least two datasets. In summary, our data provides the second complete mt genome sequence in the family Arecaceae, essential for further investigations on mitochondrial biology of seed plants. PMID:27736909
Legault, Boris A; Lopez-Lopez, Arantxa; Alba-Casado, Jose Carlos; Doolittle, W Ford; Bolhuis, Henk; Rodriguez-Valera, Francisco; Papke, R Thane
2006-01-01
Background Mature saturated brine (crystallizers) communities are largely dominated (>80% of cells) by the square halophilic archaeon "Haloquadratum walsbyi". The recent cultivation of the strain HBSQ001 and thesequencing of its genome allows comparison with the metagenome of this taxonomically simplified environment. Similar studies carried out in other extreme environments have revealed very little diversity in gene content among the cell lineages present. Results The metagenome of the microbial community of a crystallizer pond has been analyzed by end sequencing a 2000 clone fosmid library and comparing the sequences obtained with the genome sequence of "Haloquadratum walsbyi". The genome of the sequenced strain was retrieved nearly complete within this environmental DNA library. However, many ORF's that could be ascribed to the "Haloquadratum" metapopulation by common genome characteristics or scaffolding to the strain genome were not present in the specific sequenced isolate. Particularly, three regions of the sequenced genome were associated with multiple rearrangements and the presence of different genes from the metapopulation. Many transposition and phage related genes were found within this pool which, together with the associated atypical GC content in these areas, supports lateral gene transfer mediated by these elements as the most probable genetic cause of this variability. Additionally, these sequences were highly enriched in putative regulatory and signal transduction functions. Conclusion These results point to a large pan-genome (total gene repertoire of the genus/species) even in this highly specialized extremophile and at a single geographic location. The extensive gene repertoire is what might be expected of a population that exploits a diverse nutrient pool, resulting from the degradation of biomass produced at lower salinities. PMID:16820057
Gambling on a shortcut to genome sequencing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Roberts, L.
1991-06-21
Almost from the start of the Human Genome Project, a debate has been raging over whether to sequence the entire human genome, all 3 billion bases, or just the genes - a mere 2% or 3% of the genome, and by far the most interesting part. In England, Sydney Brenner convinced the Medical Research Council (MRC) to start with the expressed genes, or complementary DNAs. But the US stance has been that the entire sequence is essential if we are to understand the blueprint of man. Craig Venter of the National Institute of Neurological Disorders and Stroke says that focusingmore » on the expressed genes may be even more useful than expected. His strategy involves randomly selecting clones from cDNA libraries which theoretically contain all the genes that are switched on at a particular time in a particular tissue. Then the researchers sequence just a short stretch of each clone, about 400 to 500 bases, to create can expressed sequence tag or EST. The sequences of these ESTs are then stored in a database. Using that information, other researchers can then recreate that EST by using polymerase chain reaction techniques.« less
MPD: a pathogen genome and metagenome database
Zhang, Tingting; Miao, Jiaojiao; Han, Na; Qiang, Yujun; Zhang, Wen
2018-01-01
Abstract Advances in high-throughput sequencing have led to unprecedented growth in the amount of available genome sequencing data, especially for bacterial genomes, which has been accompanied by a challenge for the storage and management of such huge datasets. To facilitate bacterial research and related studies, we have developed the Mypathogen database (MPD), which provides access to users for searching, downloading, storing and sharing bacterial genomics data. The MPD represents the first pathogenic database for microbial genomes and metagenomes, and currently covers pathogenic microbial genomes (6604 genera, 11 071 species, 41 906 strains) and metagenomic data from host, air, water and other sources (28 816 samples). The MPD also functions as a management system for statistical and storage data that can be used by different organizations, thereby facilitating data sharing among different organizations and research groups. A user-friendly local client tool is provided to maintain the steady transmission of big sequencing data. The MPD is a useful tool for analysis and management in genomic research, especially for clinical Centers for Disease Control and epidemiological studies, and is expected to contribute to advancing knowledge on pathogenic bacteria genomes and metagenomes. Database URL: http://data.mypathogen.org PMID:29917040
Sela, Noa; Lachman, Oded; Reingold, Victoria; Dombrovsky, Aviv
2013-10-01
A novel virus was detected in watermelon plants (Citrullus lanatus Thunb.) infected with Melon necrotic spot virus (MNSV) using SOLiD next-generation sequence analysis. In addition to the expected MSNV genome, two double-stranded RNA (dsRNA) segments of 1,312 and 1,118 bp were also identified and sequenced from the purified virus preparations. These two dsRNA segments encode two putative partitivirus-related proteins, an RNA-dependent RNA polymerase (RdRP) and a capsid protein, which were sequenced. Genomic-sequence analysis and analysis of phylogenetic relationships indicate that these two dsRNAs together make up the genome of a novel Partitivirus. This virus was found to be closely related to the Pepper cryptic virus 1 and Raphanus sativus cryptic virus. It is suggested that this novel virus putatively named Citrullus lanatus cryptic virus be considered as a new member of the family Partitiviridae.
SalmonDB: a bioinformatics resource for Salmo salar and Oncorhynchus mykiss
Di Génova, Alex; Aravena, Andrés; Zapata, Luis; González, Mauricio; Maass, Alejandro; Iturra, Patricia
2011-01-01
SalmonDB is a new multiorganism database containing EST sequences from Salmo salar, Oncorhynchus mykiss and the whole genome sequence of Danio rerio, Gasterosteus aculeatus, Tetraodon nigroviridis, Oryzias latipes and Takifugu rubripes, built with core components from GMOD project, GOPArc system and the BioMart project. The information provided by this resource includes Gene Ontology terms, metabolic pathways, SNP prediction, CDS prediction, orthologs prediction, several precalculated BLAST searches and domains. It also provides a BLAST server for matching user-provided sequences to any of the databases and an advanced query tool (BioMart) that allows easy browsing of EST databases with user-defined criteria. These tools make SalmonDB database a valuable resource for researchers searching for transcripts and genomic information regarding S. salar and other salmonid species. The database is expected to grow in the near feature, particularly with the S. salar genome sequencing project. Database URL: http://genomicasalmones.dim.uchile.cl/ PMID:22120661
SalmonDB: a bioinformatics resource for Salmo salar and Oncorhynchus mykiss.
Di Génova, Alex; Aravena, Andrés; Zapata, Luis; González, Mauricio; Maass, Alejandro; Iturra, Patricia
2011-01-01
SalmonDB is a new multiorganism database containing EST sequences from Salmo salar, Oncorhynchus mykiss and the whole genome sequence of Danio rerio, Gasterosteus aculeatus, Tetraodon nigroviridis, Oryzias latipes and Takifugu rubripes, built with core components from GMOD project, GOPArc system and the BioMart project. The information provided by this resource includes Gene Ontology terms, metabolic pathways, SNP prediction, CDS prediction, orthologs prediction, several precalculated BLAST searches and domains. It also provides a BLAST server for matching user-provided sequences to any of the databases and an advanced query tool (BioMart) that allows easy browsing of EST databases with user-defined criteria. These tools make SalmonDB database a valuable resource for researchers searching for transcripts and genomic information regarding S. salar and other salmonid species. The database is expected to grow in the near feature, particularly with the S. salar genome sequencing project. Database URL: http://genomicasalmones.dim.uchile.cl/
Ramu, P; Kassahun, B; Senthilvel, S; Ashok Kumar, C; Jayashree, B; Folkertsma, R T; Reddy, L Ananda; Kuruvinashetti, M S; Haussmann, B I G; Hash, C T
2009-11-01
The sequencing and detailed comparative functional analysis of genomes of a number of select botanical models open new doors into comparative genomics among the angiosperms, with potential benefits for improvement of many orphan crops that feed large populations. In this study, a set of simple sequence repeat (SSR) markers was developed by mining the expressed sequence tag (EST) database of sorghum. Among the SSR-containing sequences, only those sharing considerable homology with rice genomic sequences across the lengths of the 12 rice chromosomes were selected. Thus, 600 SSR-containing sorghum EST sequences (50 homologous sequences on each of the 12 rice chromosomes) were selected, with the intention of providing coverage for corresponding homologous regions of the sorghum genome. Primer pairs were designed and polymorphism detection ability was assessed using parental pairs of two existing sorghum mapping populations. About 28% of these new markers detected polymorphism in this 4-entry panel. A subset of 55 polymorphic EST-derived SSR markers were mapped onto the existing skeleton map of a recombinant inbred population derived from cross N13 x E 36-1, which is segregating for Striga resistance and the stay-green component of terminal drought tolerance. These new EST-derived SSR markers mapped across all 10 sorghum linkage groups, mostly to regions expected based on prior knowledge of rice-sorghum synteny. The ESTs from which these markers were derived were then mapped in silico onto the aligned sorghum genome sequence, and 88% of the best hits corresponded to linkage-based positions. This study demonstrates the utility of comparative genomic information in targeted development of markers to fill gaps in linkage maps of related crop species for which sufficient genomic tools are not available.
Guo, Yaqiong; Tang, Kevin; Rowe, Lori A; Li, Na; Roellig, Dawn M; Knipe, Kristine; Frace, Michael; Yang, Chunfu; Feng, Yaoyu; Xiao, Lihua
2015-04-18
Cryptosporidium hominis is a dominant species for human cryptosporidiosis. Within the species, IbA10G2 is the most virulent subtype responsible for all C. hominis-associated outbreaks in Europe and Australia, and is a dominant outbreak subtype in the United States. In recent yearsIaA28R4 is becoming a major new subtype in the United States. In this study, we sequenced the genomes of two field specimens from each of the two subtypes and conducted a comparative genomic analysis of the obtained sequences with those from the only fully sequenced Cryptosporidium parvum genome. Altogether, 8.59-9.05 Mb of Cryptosporidium sequences in 45-767 assembled contigs were obtained from the four specimens, representing 94.36-99.47% coverage of the expected genome. These genomes had complete synteny in gene organization and 96.86-97.0% and 99.72-99.83% nucleotide sequence similarities to the published genomes of C. parvum and C. hominis, respectively. Several major insertions and deletions were seen between C. hominis and C. parvum genomes, involving mostly members of multicopy gene families near telomeres. The four C. hominis genomes were highly similar to each other and divergent from the reference IaA25R3 genome in some highly polymorphic regions. Major sequence differences among the four specimens sequenced in this study were in the 5' and 3' ends of chromosome 6 and the gp60 region, largely the result of genetic recombination. The sequence similarity among specimens of the two dominant outbreak subtypes and genetic recombination in chromosome 6, especially around the putative virulence determinant gp60 region, suggest that genetic recombination plays a potential role in the emergence of hyper-transmissible C. hominis subtypes. The high sequence conservation between C. parvum and C. hominis genomes and significant differences in copy numbers of MEDLE family secreted proteins and insulinase-like proteases indicate that telomeric gene duplications could potentially contribute to host expansion in C. parvum.
Finding the missing honey bee genes: lessons learned from a genome upgrade.
Elsik, Christine G; Worley, Kim C; Bennett, Anna K; Beye, Martin; Camara, Francisco; Childers, Christopher P; de Graaf, Dirk C; Debyser, Griet; Deng, Jixin; Devreese, Bart; Elhaik, Eran; Evans, Jay D; Foster, Leonard J; Graur, Dan; Guigo, Roderic; Hoff, Katharina Jasmin; Holder, Michael E; Hudson, Matthew E; Hunt, Greg J; Jiang, Huaiyang; Joshi, Vandita; Khetani, Radhika S; Kosarev, Peter; Kovar, Christie L; Ma, Jian; Maleszka, Ryszard; Moritz, Robin F A; Munoz-Torres, Monica C; Murphy, Terence D; Muzny, Donna M; Newsham, Irene F; Reese, Justin T; Robertson, Hugh M; Robinson, Gene E; Rueppell, Olav; Solovyev, Victor; Stanke, Mario; Stolle, Eckart; Tsuruda, Jennifer M; Vaerenbergh, Matthias Van; Waterhouse, Robert M; Weaver, Daniel B; Whitfield, Charles W; Wu, Yuanqing; Zdobnov, Evgeny M; Zhang, Lan; Zhu, Dianhui; Gibbs, Richard A
2014-01-30
The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination.
Finding the missing honey bee genes: lessons learned from a genome upgrade
2014-01-01
Background The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. Results Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. Conclusions Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination. PMID:24479613
A fully decompressed synthetic bacteriophage øX174 genome assembled and archived in yeast.
Jaschke, Paul R; Lieberman, Erica K; Rodriguez, Jon; Sierra, Adrian; Endy, Drew
2012-12-20
The 5386 nucleotide bacteriophage øX174 genome has a complicated architecture that encodes 11 gene products via overlapping protein coding sequences spanning multiple reading frames. We designed a 6302 nucleotide synthetic surrogate, øX174.1, that fully separates all primary phage protein coding sequences along with cognate translation control elements. To specify øX174.1f, a decompressed genome the same length as wild type, we truncated the gene F coding sequence. We synthesized DNA encoding fragments of øX174.1f and used a combination of in vitro- and yeast-based assembly to produce yeast vectors encoding natural or designer bacteriophage genomes. We isolated clonal preparations of yeast plasmid DNA and transfected E. coli C strains. We recovered viable øX174 particles containing the øX174.1f genome from E. coli C strains that independently express full-length gene F. We expect that yeast can serve as a genomic 'drydock' within which to maintain and manipulate clonal lineages of other obligate lytic phage. Copyright © 2012 Elsevier Inc. All rights reserved.
Adrian-Kalchhauser, Irene; Svensson, Ola; Kutschera, Verena E; Alm Rosenblad, Magnus; Pippel, Martin; Winkler, Sylke; Schloissnig, Siegfried; Blomberg, Anders; Burkhardt-Holm, Patricia
2017-02-16
Vertebrate mitochondrial genomes are optimized for fast replication and low cost of RNA expression. Accordingly, they are devoid of introns, are transcribed as polycistrons and contain very little intergenic sequences. Usually, vertebrate mitochondrial genomes measure between 16.5 and 17 kilobases (kb). During genome sequencing projects for two novel vertebrate models, the invasive round goby and the sand goby, we found that the sand goby genome is exceptionally small (16.4 kb), while the mitochondrial genome of the round goby is much larger than expected for a vertebrate. It is 19 kb in size and is thus one of the largest fish and even vertebrate mitochondrial genomes known to date. The expansion is attributable to a sequence insertion downstream of the putative transcriptional start site. This insertion carries traces of repeats from the control region, but is mostly novel. To get more information about this phenomenon, we gathered all available mitochondrial genomes of Gobiidae and of nine gobioid species, performed phylogenetic analyses, analysed gene arrangements, and compared gobiid mitochondrial genome sizes, ecological information and other species characteristics with respect to the mitochondrial phylogeny. This allowed us amongst others to identify a unique arrangement of tRNAs among Ponto-Caspian gobies. Our results indicate that the round goby mitochondrial genome may contain novel features. Since mitochondrial genome organisation is tightly linked to energy metabolism, these features may be linked to its invasion success. Also, the unique tRNA arrangement among Ponto-Caspian gobies may be helpful in studying the evolution of this highly adaptive and invasive species group. Finally, we find that the phylogeny of gobiids can be further refined by the use of longer stretches of linked DNA sequence.
Paul, Sinu; Piontkivska, Helen
2009-01-01
Background Studies have shown that in the genome of human immunodeficiency virus (HIV-1) regions responsible for interactions with the host's immune system, namely, cytotoxic T-lymphocyte (CTL) epitopes tend to cluster together in relatively conserved regions. On the other hand, "epitope-less" regions or regions with relatively low density of epitopes tend to be more variable. However, very little is known about relationships among epitopes from different genes, in other words, whether particular epitopes from different genes would occur together in the same viral genome. To identify CTL epitopes in different genes that co-occur in HIV genomes, association rule mining was used. Results Using a set of 189 best-defined HIV-1 CTL/CD8+ epitopes from 9 different protein-coding genes, as described by Frahm, Linde & Brander (2007), we examined the complete genomic sequences of 62 reference HIV sequences (including 13 subtypes and sub-subtypes with approximately 4 representative sequences for each subtype or sub-subtype, and 18 circulating recombinant forms). The results showed that despite inclusion of recombinant sequences that would be expected to break-up associations of epitopes in different genes when two different genomes are recombined, there exist particular combinations of epitopes (epitope associations) that occur repeatedly across the world-wide population of HIV-1. For example, Pol epitope LFLDGIDKA is found to be significantly associated with epitopes GHQAAMQML and FLKEKGGL from Gag and Nef, respectively, and this association rule is observed even among circulating recombinant forms. Conclusion We have identified CTL epitope combinations co-occurring in HIV-1 genomes including different subtypes and recombinant forms. Such co-occurrence has important implications for design of complex vaccines (multi-epitope vaccines) and/or drugs that would target multiple HIV-1 regions at once and, thus, may be expected to overcome challenges associated with viral escape. PMID:19580659
Phylogenomics from Whole Genome Sequences Using aTRAM.
Allen, Julie M; Boyd, Bret; Nguyen, Nam-Phuong; Vachaspati, Pranjal; Warnow, Tandy; Huang, Daisie I; Grady, Patrick G S; Bell, Kayce C; Cronk, Quentin C B; Mugisha, Lawrence; Pittendrigh, Barry R; Leonardi, M Soledad; Reed, David L; Johnson, Kevin P
2017-09-01
Novel sequencing technologies are rapidly expanding the size of data sets that can be applied to phylogenetic studies. Currently the most commonly used phylogenomic approaches involve some form of genome reduction. While these approaches make assembling phylogenomic data sets more economical for organisms with large genomes, they reduce the genomic coverage and thereby the long-term utility of the data. Currently, for organisms with moderate to small genomes ($<$1000 Mbp) it is feasible to sequence the entire genome at modest coverage ($10-30\\times$). Computational challenges for handling these large data sets can be alleviated by assembling targeted reads, rather than assembling the entire genome, to produce a phylogenomic data matrix. Here we demonstrate the use of automated Target Restricted Assembly Method (aTRAM) to assemble 1107 single-copy ortholog genes from whole genome sequencing of sucking lice (Anoplura) and out-groups. We developed a pipeline to extract exon sequences from the aTRAM assemblies by annotating them with respect to the original target protein. We aligned these protein sequences with the inferred amino acids and then performed phylogenetic analyses on both the concatenated matrix of genes and on each gene separately in a coalescent analysis. Finally, we tested the limits of successful assembly in aTRAM by assembling 100 genes from close- to distantly related taxa at high to low levels of coverage.Both the concatenated analysis and the coalescent-based analysis produced the same tree topology, which was consistent with previously published results and resolved weakly supported nodes. These results demonstrate that this approach is successful at developing phylogenomic data sets from raw genome sequencing reads. Further, we found that with coverages above $5-10\\times$, aTRAM was successful at assembling 80-90% of the contigs for both close and distantly related taxa. As sequencing costs continue to decline, we expect full genome sequencing will become more feasible for a wider array of organisms, and aTRAM will enable mining of these genomic data sets for an extensive variety of applications, including phylogenomics. [aTRAM; gene assembly; genome sequencing; phylogenomics.]. © The Author(s) 2017. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Walker, Michael B; King, Benjamin L; Paigen, Kenneth
2012-01-01
Arrangements of genes along chromosomes are a product of evolutionary processes, and we can expect that preferable arrangements will prevail over the span of evolutionary time, often being reflected in the non-random clustering of structurally and/or functionally related genes. Such non-random arrangements can arise by two distinct evolutionary processes: duplications of DNA sequences that give rise to clusters of genes sharing both sequence similarity and common sequence features and the migration together of genes related by function, but not by common descent. To provide a background for distinguishing between the two, which is important for future efforts to unravel the evolutionary processes involved, we here provide a description of the extent to which ancestrally related genes are found in proximity.Towards this purpose, we combined information from five genomic datasets, InterPro, SCOP, PANTHER, Ensembl protein families, and Ensembl gene paralogs. The results are provided in publicly available datasets (http://cgd.jax.org/datasets/clustering/paraclustering.shtml) describing the extent to which ancestrally related genes are in proximity beyond what is expected by chance (i.e. form paraclusters) in the human and nine other vertebrate genomes, as well as the D. melanogaster, C. elegans, A. thaliana, and S. cerevisiae genomes. With the exception of Saccharomyces, paraclusters are a common feature of the genomes we examined. In the human genome they are estimated to include at least 22% of all protein coding genes. Paraclusters are far more prevalent among some gene families than others, are highly species or clade specific and can evolve rapidly, sometimes in response to environmental cues. Altogether, they account for a large portion of the functional clustering previously reported in several genomes.
Veerkamp, Roel F; Bouwman, Aniek C; Schrooten, Chris; Calus, Mario P L
2016-12-01
Whole-genome sequence data is expected to capture genetic variation more completely than common genotyping panels. Our objective was to compare the proportion of variance explained and the accuracy of genomic prediction by using imputed sequence data or preselected SNPs from a genome-wide association study (GWAS) with imputed whole-genome sequence data. Phenotypes were available for 5503 Holstein-Friesian bulls. Genotypes were imputed up to whole-genome sequence (13,789,029 segregating DNA variants) by using run 4 of the 1000 bull genomes project. The program GCTA was used to perform GWAS for protein yield (PY), somatic cell score (SCS) and interval from first to last insemination (IFL). From the GWAS, subsets of variants were selected and genomic relationship matrices (GRM) were used to estimate the variance explained in 2087 validation animals and to evaluate the genomic prediction ability. Finally, two GRM were fitted together in several models to evaluate the effect of selected variants that were in competition with all the other variants. The GRM based on full sequence data explained only marginally more genetic variation than that based on common SNP panels: for PY, SCS and IFL, genomic heritability improved from 0.81 to 0.83, 0.83 to 0.87 and 0.69 to 0.72, respectively. Sequence data also helped to identify more variants linked to quantitative trait loci and resulted in clearer GWAS peaks across the genome. The proportion of total variance explained by the selected variants combined in a GRM was considerably smaller than that explained by all variants (less than 0.31 for all traits). When selected variants were used, accuracy of genomic predictions decreased and bias increased. Although 35 to 42 variants were detected that together explained 13 to 19% of the total variance (18 to 23% of the genetic variance) when fitted alone, there was no advantage in using dense sequence information for genomic prediction in the Holstein data used in our study. Detection and selection of variants within a single breed are difficult due to long-range linkage disequilibrium. Stringent selection of variants resulted in more biased genomic predictions, although this might be due to the training population being the same dataset from which the selected variants were identified.
High-throughput physical mapping of chromosomes using automated in situ hybridization.
George, Phillip; Sharakhova, Maria V; Sharakhov, Igor V
2012-06-28
Projects to obtain whole-genome sequences for 10,000 vertebrate species and for 5,000 insect and related arthropod species are expected to take place over the next 5 years. For example, the sequencing of the genomes for 15 malaria mosquitospecies is currently being done using an Illumina platform. This Anopheles species cluster includes both vectors and non-vectors of malaria. When the genome assemblies become available, researchers will have the unique opportunity to perform comparative analysis for inferring evolutionary changes relevant to vector ability. However, it has proven difficult to use next-generation sequencing reads to generate high-quality de novo genome assemblies. Moreover, the existing genome assemblies for Anopheles gambiae, although obtained using the Sanger method, are gapped or fragmented. Success of comparative genomic analyses will be limited if researchers deal with numerous sequencing contigs, rather than with chromosome-based genome assemblies. Fragmented, unmapped sequences create problems for genomic analyses because: (i) unidentified gaps cause incorrect or incomplete annotation of genomic sequences; (ii) unmapped sequences lead to confusion between paralogous genes and genes from different haplotypes; and (iii) the lack of chromosome assignment and orientation of the sequencing contigs does not allow for reconstructing rearrangement phylogeny and studying chromosome evolution. Developing high-resolution physical maps for species with newly sequenced genomes is a timely and cost-effective investment that will facilitate genome annotation, evolutionary analysis, and re-sequencing of individual genomes from natural populations. Here, we present innovative approaches to chromosome preparation, fluorescent in situ hybridization (FISH), and imaging that facilitate rapid development of physical maps. Using An. gambiae as an example, we demonstrate that the development of physical chromosome maps can potentially improve genome assemblies and, thus, the quality of genomic analyses. First, we use a high-pressure method to prepare polytene chromosome spreads. This method, originally developed for Drosophila, allows the user to visualize more details on chromosomes than the regular squashing technique. Second, a fully automated, front-end system for FISH is used for high-throughput physical genome mapping. The automated slide staining system runs multiple assays simultaneously and dramatically reduces hands-on time. Third, an automatic fluorescent imaging system, which includes a motorized slide stage, automatically scans and photographs labeled chromosomes after FISH. This system is especially useful for identifying and visualizing multiple chromosomal plates on the same slide. In addition, the scanning process captures a more uniform FISH result. Overall, the automated high-throughput physical mapping protocol is more efficient than a standard manual protocol.
Cho, Kwang-Soo; Yun, Bong-Kyoung; Yoon, Young-Ho; Hong, Su-Young; Mekapogu, Manjulatha; Kim, Kyung-Hee; Yang, Tae-Jin
2015-01-01
We report the chloroplast (cp) genome sequence of tartary buckwheat (Fagopyrum tataricum) obtained by next-generation sequencing technology and compared this with the previously reported common buckwheat (F. esculentum ssp. ancestrale) cp genome. The cp genome of F. tataricum has a total sequence length of 159,272 bp, which is 327 bp shorter than the common buckwheat cp genome. The cp gene content, order, and orientation are similar to those of common buckwheat, but with some structural variation at tandem and palindromic repeat frequencies and junction areas. A total of seven InDels (around 100 bp) were found within the intergenic sequences and the ycf1 gene. Copy number variation of the 21-bp tandem repeat varied in F. tataricum (four repeats) and F. esculentum (one repeat), and the InDel of the ycf1 gene was 63 bp long. Nucleotide and amino acid have highly conserved coding sequence with about 98% homology and four genes—rpoC2, ycf3, accD, and clpP—have high synonymous (Ks) value. PCR based InDel markers were applied to diverse genetic resources of F. tataricum and F. esculentum, and the amplicon size was identical to that expected in silico. Therefore, these InDel markers are informative biomarkers to practically distinguish raw or processed buckwheat products derived from F. tataricum and F. esculentum. PMID:25966355
Extensive concerted evolution of rice paralogs and the road to regaining independence.
Wang, Xiyin; Tang, Haibao; Bowers, John E; Feltus, Frank A; Paterson, Andrew H
2007-11-01
Many genes duplicated by whole-genome duplications (WGDs) are more similar to one another than expected. We investigated whether concerted evolution through conversion and crossing over, well-known to affect tandem gene clusters, also affects dispersed paralogs. Genome sequences for two Oryza subspecies reveal appreciable gene conversion in the approximately 0.4 MY since their divergence, with a gradual progression toward independent evolution of older paralogs. Since divergence from subspecies indica, approximately 8% of japonica paralogs produced 5-7 MYA on chromosomes 11 and 12 have been affected by gene conversion and several reciprocal exchanges of chromosomal segments, while approximately 70-MY-old "paleologs" resulting from a genome duplication (GD) show much less conversion. Sequence similarity analysis in proximal gene clusters also suggests more conversion between younger paralogs. About 8% of paleologs may have been converted since rice-sorghum divergence approximately 41 MYA. Domain-encoding sequences are more frequently converted than nondomain sequences, suggesting a sort of circularity--that sequences conserved by selection may be further conserved by relatively frequent conversion. The higher level of concerted evolution in the 5-7 MY-old segmental duplication may reflect the behavior of many genomes within the first few million years after duplication or polyploidization.
Lo, Y M Dennis
2013-12-01
The discovery of cell-free fetal DNA in maternal plasma in 1997 has stimulated a rapid development of non-invasive prenatal testing. The recent advent of massively parallel sequencing has allowed the analysis of circulating cell-free fetal DNA to be performed with unprecedented sensitivity and precision. Fetal trisomies 21, 18 and 13 are now robustly detectable in maternal plasma and such analyses have been available clinically since 2011. Fetal genome-wide molecular karyotyping and whole-genome sequencing have now been demonstrated in a number of proof-of-concept studies. Genome-wide and targeted sequencing of maternal plasma has been shown to allow the non-invasive prenatal testing of β-thalassaemia and can potentially be generalized to other monogenic diseases. It is thus expected that plasma DNA-based non-invasive prenatal testing will play an increasingly important role in future obstetric care. It is thus timely and important that the ethical, social and legal issues of non-invasive prenatal testing be discussed actively by all parties involved in prenatal care. Copyright © 2013 Reproductive Healthcare Ltd. Published by Elsevier Ltd. All rights reserved.
Comparative genomic data of the Avian Phylogenomics Project.
Zhang, Guojie; Li, Bo; Li, Cai; Gilbert, M Thomas P; Jarvis, Erich D; Wang, Jun
2014-01-01
The evolutionary relationships of modern birds are among the most challenging to understand in systematic biology and have been debated for centuries. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders, and used the genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomics analyses (Jarvis et al. in press; Zhang et al. in press). Here we release assemblies and datasets associated with the comparative genome analyses, which include 38 newly sequenced avian genomes plus previously released or simultaneously released genomes of Chicken, Zebra finch, Turkey, Pigeon, Peregrine falcon, Duck, Budgerigar, Adelie penguin, Emperor penguin and the Medium Ground Finch. We hope that this resource will serve future efforts in phylogenomics and comparative genomics. The 38 bird genomes were sequenced using the Illumina HiSeq 2000 platform and assembled using a whole genome shotgun strategy. The 48 genomes were categorized into two groups according to the N50 scaffold size of the assemblies: a high depth group comprising 23 species sequenced at high coverage (>50X) with multiple insert size libraries resulting in N50 scaffold sizes greater than 1 Mb (except the White-throated Tinamou and Bald Eagle); and a low depth group comprising 25 species sequenced at a low coverage (~30X) with two insert size libraries resulting in an average N50 scaffold size of about 50 kb. Repetitive elements comprised 4%-22% of the bird genomes. The assembled scaffolds allowed the homology-based annotation of 13,000 ~ 17000 protein coding genes in each avian genome relative to chicken, zebra finch and human, as well as comparative and sequence conservation analyses. Here we release full genome assemblies of 38 newly sequenced avian species, link genome assembly downloads for the 7 of the remaining 10 species, and provide a guideline of genomic data that has been generated and used in our Avian Phylogenomics Project. To the best of our knowledge, the Avian Phylogenomics Project is the biggest vertebrate comparative genomics project to date. The genomic data presented here is expected to accelerate further analyses in many fields, including phylogenetics, comparative genomics, evolution, neurobiology, development biology, and other related areas.
Multiple hybrid de novo genome assembly of finger millet, an orphan allotetraploid crop
Hatakeyama, Masaomi; Aluri, Sirisha; Balachadran, Mathi Thumilan; Sivarajan, Sajeevan Radha; Patrignani, Andrea; Grüter, Simon; Poveda, Lucy; Shimizu-Inatsugi, Rie; Baeten, John; Francoijs, Kees-Jan; Nataraja, Karaba N; Reddy, Yellodu A Nanja; Phadnis, Shamprasad; Ravikumar, Ramapura L; Schlapbach, Ralph; Sreeman, Sheshshayee M; Shimizu, Kentaro K
2018-01-01
Abstract Finger millet (Eleusine coracana (L.) Gaertn) is an important crop for food security because of its tolerance to drought, which is expected to be exacerbated by global climate changes. Nevertheless, it is often classified as an orphan/underutilized crop because of the paucity of scientific attention. Among several small millets, finger millet is considered as an excellent source of essential nutrient elements, such as iron and zinc; hence, it has potential as an alternate coarse cereal. However, high-quality genome sequence data of finger millet are currently not available. One of the major problems encountered in the genome assembly of this species was its polyploidy, which hampers genome assembly compared with a diploid genome. To overcome this problem, we sequenced its genome using diverse technologies with sufficient coverage and assembled it via a novel multiple hybrid assembly workflow that combines next-generation with single-molecule sequencing, followed by whole-genome optical mapping using the Bionano Irys® system. The total number of scaffolds was 1,897 with an N50 length >2.6 Mb and detection of 96% of the universal single-copy orthologs. The majority of the homeologs were assembled separately. This indicates that the proposed workflow is applicable to the assembly of other allotetraploid genomes. PMID:28985356
Discovery and mapping of single feature polymorphisms in wheat using Affymetrix arrays
Bernardo, Amy N; Bradbury, Peter J; Ma, Hongxiang; Hu, Shengwa; Bowden, Robert L; Buckler, Edward S; Bai, Guihua
2009-01-01
Background Wheat (Triticum aestivum L.) is a staple food crop worldwide. The wheat genome has not yet been sequenced due to its huge genome size (~17,000 Mb) and high levels of repetitive sequences; the whole genome sequence may not be expected in the near future. Available linkage maps have low marker density due to limitation in available markers; therefore new technologies that detect genome-wide polymorphisms are still needed to discover a large number of new markers for construction of high-resolution maps. A high-resolution map is a critical tool for gene isolation, molecular breeding and genomic research. Single feature polymorphism (SFP) is a new microarray-based type of marker that is detected by hybridization of DNA or cRNA to oligonucleotide probes. This study was conducted to explore the feasibility of using the Affymetrix GeneChip to discover and map SFPs in the large hexaploid wheat genome. Results Six wheat varieties of diverse origins (Ning 7840, Clark, Jagger, Encruzilhada, Chinese Spring, and Opata 85) were analyzed for significant probe by variety interactions and 396 probe sets with SFPs were identified. A subset of 164 unigenes was sequenced and 54% showed polymorphism within probes. Microarray analysis of 71 recombinant inbred lines from the cross Ning 7840/Clark identified 955 SFPs and 877 of them were mapped together with 269 simple sequence repeat markers. The SFPs were randomly distributed within a chromosome but were unevenly distributed among different genomes. The B genome had the most SFPs, and the D genome had the least. Map positions of a selected set of SFPs were validated by mapping single nucleotide polymorphism using SNaPshot and comparing with expressed sequence tags mapping data. Conclusion The Affymetrix array is a cost-effective platform for SFP discovery and SFP mapping in wheat. The new high-density map constructed in this study will be a useful tool for genetic and genomic research in wheat. PMID:19480702
Deep whole-genome sequencing of 100 southeast Asian Malays.
Wong, Lai-Ping; Ong, Rick Twee-Hee; Poh, Wan-Ting; Liu, Xuanyao; Chen, Peng; Li, Ruoying; Lam, Kevin Koi-Yau; Pillai, Nisha Esakimuthu; Sim, Kar-Seng; Xu, Haiyan; Sim, Ngak-Leng; Teo, Shu-Mei; Foo, Jia-Nee; Tan, Linda Wei-Lin; Lim, Yenly; Koo, Seok-Hwee; Gan, Linda Seo-Hwee; Cheng, Ching-Yu; Wee, Sharon; Yap, Eric Peng-Huat; Ng, Pauline Crystal; Lim, Wei-Yen; Soong, Richie; Wenk, Markus Rene; Aung, Tin; Wong, Tien-Yin; Khor, Chiea-Chuen; Little, Peter; Chia, Kee-Seng; Teo, Yik-Ying
2013-01-10
Whole-genome sequencing across multiple samples in a population provides an unprecedented opportunity for comprehensively characterizing the polymorphic variants in the population. Although the 1000 Genomes Project (1KGP) has offered brief insights into the value of population-level sequencing, the low coverage has compromised the ability to confidently detect rare and low-frequency variants. In addition, the composition of populations in the 1KGP is not complete, despite the fact that the study design has been extended to more than 2,500 samples from more than 20 population groups. The Malays are one of the Austronesian groups predominantly present in Southeast Asia and Oceania, and the Singapore Sequencing Malay Project (SSMP) aims to perform deep whole-genome sequencing of 100 healthy Malays. By sequencing at a minimum of 30× coverage, we have illustrated the higher sensitivity at detecting low-frequency and rare variants and the ability to investigate the presence of hotspots of functional mutations. Compared to the low-pass sequencing in the 1KGP, the deeper coverage allows more functional variants to be identified for each person. A comparison of the fidelity of genotype imputation of Malays indicated that a population-specific reference panel, such as the SSMP, outperforms a cosmopolitan panel with larger number of individuals for common SNPs. For lower-frequency (<5%) markers, a larger number of individuals might have to be whole-genome sequenced so that the accuracy currently afforded by the 1KGP can be achieved. The SSMP data are expected to be the benchmark for evaluating the value of deep population-level sequencing versus low-pass sequencing, especially in populations that are poorly represented in population-genetics studies. Copyright © 2013 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Deep Whole-Genome Sequencing of 100 Southeast Asian Malays
Wong, Lai-Ping; Ong, Rick Twee-Hee; Poh, Wan-Ting; Liu, Xuanyao; Chen, Peng; Li, Ruoying; Lam, Kevin Koi-Yau; Pillai, Nisha Esakimuthu; Sim, Kar-Seng; Xu, Haiyan; Sim, Ngak-Leng; Teo, Shu-Mei; Foo, Jia-Nee; Tan, Linda Wei-Lin; Lim, Yenly; Koo, Seok-Hwee; Gan, Linda Seo-Hwee; Cheng, Ching-Yu; Wee, Sharon; Yap, Eric Peng-Huat; Ng, Pauline Crystal; Lim, Wei-Yen; Soong, Richie; Wenk, Markus Rene; Aung, Tin; Wong, Tien-Yin; Khor, Chiea-Chuen; Little, Peter; Chia, Kee-Seng; Teo, Yik-Ying
2013-01-01
Whole-genome sequencing across multiple samples in a population provides an unprecedented opportunity for comprehensively characterizing the polymorphic variants in the population. Although the 1000 Genomes Project (1KGP) has offered brief insights into the value of population-level sequencing, the low coverage has compromised the ability to confidently detect rare and low-frequency variants. In addition, the composition of populations in the 1KGP is not complete, despite the fact that the study design has been extended to more than 2,500 samples from more than 20 population groups. The Malays are one of the Austronesian groups predominantly present in Southeast Asia and Oceania, and the Singapore Sequencing Malay Project (SSMP) aims to perform deep whole-genome sequencing of 100 healthy Malays. By sequencing at a minimum of 30× coverage, we have illustrated the higher sensitivity at detecting low-frequency and rare variants and the ability to investigate the presence of hotspots of functional mutations. Compared to the low-pass sequencing in the 1KGP, the deeper coverage allows more functional variants to be identified for each person. A comparison of the fidelity of genotype imputation of Malays indicated that a population-specific reference panel, such as the SSMP, outperforms a cosmopolitan panel with larger number of individuals for common SNPs. For lower-frequency (<5%) markers, a larger number of individuals might have to be whole-genome sequenced so that the accuracy currently afforded by the 1KGP can be achieved. The SSMP data are expected to be the benchmark for evaluating the value of deep population-level sequencing versus low-pass sequencing, especially in populations that are poorly represented in population-genetics studies. PMID:23290073
Primers for polymerase chain reaction to detect genomic DNA of Toxocara canis and T. cati.
Wu, Z; Nagano, I; Xu, D; Takahashi, Y
1997-03-01
Primers for polymerase chain reaction to amplify genomic DNA of both Toxocara canis and T. cati were constructed by adapting cloning and sequencing random amplified polymorphic DNA. The primers are expected to detect eggs and/or larvae of T. canis and T. cati, both of which are known to cause toxocariasis in humans.
The Origins of 168, W23, and Other Bacillus subtilis Legacy Strains▿ †
Zeigler, Daniel R.; Prágai, Zoltán; Rodriguez, Sabrina; Chevreux, Bastien; Muffler, Andrea; Albert, Thomas; Bai, Renyuan; Wyss, Markus; Perkins, John B.
2008-01-01
Bacillus subtilis is both a model organism for basic research and an industrial workhorse, yet there are major gaps in our understanding of the genomic heritage and provenance of many widely used strains. We analyzed 17 legacy strains dating to the early years of B. subtilis genetics. For three—NCIB 3610T, PY79, and SMY—we performed comparative genome sequencing. For the remainder, we used conventional sequencing to sample genomic regions expected to show sequence heterogeneity. Sequence comparisons showed that 168, its siblings (122, 160, and 166), and the type strains NCIB 3610 and ATCC 6051 are highly similar and are likely descendants of the original Marburg strain, although the 168 lineage shows genetic evidence of early domestication. Strains 23, W23, and W23SR are identical in sequence to each other but only 94.6% identical to the Marburg group in the sequenced regions. Strain 23, the probable W23 parent, likely arose from a contaminant in the mutagenesis experiments that produced 168. The remaining strains are all genomic hybrids, showing one or more “W23 islands” in a 168 genomic backbone. Each traces its origin to transformations of 168 derivatives with DNA from 23 or W23. The common prototrophic lab strain PY79 possesses substantial W23 islands at its trp and sac loci, along with large deletions that have reduced its genome 4.3%. SMY, reputed to be the parent of 168, is actually a 168-W23 hybrid that likely shares a recent ancestor with PY79. These data provide greater insight into the genomic history of these B. subtilis legacy strains. PMID:18723616
Thomsen, Martin Christen Frølund; Ahrenfeldt, Johanne; Cisneros, Jose Luis Bellod; Jurtz, Vanessa; Larsen, Mette Voldby; Hasman, Henrik; Aarestrup, Frank Møller; Lund, Ole
2016-01-01
Recent advances in whole genome sequencing have made the technology available for routine use in microbiological laboratories. However, a major obstacle for using this technology is the availability of simple and automatic bioinformatics tools. Based on previously published and already available web-based tools we developed a single pipeline for batch uploading of whole genome sequencing data from multiple bacterial isolates. The pipeline will automatically identify the bacterial species and, if applicable, assemble the genome, identify the multilocus sequence type, plasmids, virulence genes and antimicrobial resistance genes. A short printable report for each sample will be provided and an Excel spreadsheet containing all the metadata and a summary of the results for all submitted samples can be downloaded. The pipeline was benchmarked using datasets previously used to test the individual services. The reported results enable a rapid overview of the major results, and comparing that to the previously found results showed that the platform is reliable and able to correctly predict the species and find most of the expected genes automatically. In conclusion, a combined bioinformatics platform was developed and made publicly available, providing easy-to-use automated analysis of bacterial whole genome sequencing data. The platform may be of immediate relevance as a guide for investigators using whole genome sequencing for clinical diagnostics and surveillance. The platform is freely available at: https://cge.cbs.dtu.dk/services/CGEpipeline-1.1 and it is the intention that it will continue to be expanded with new features as these become available.
Targeted Analysis of Whole Genome Sequence Data to Diagnose Genetic Cardiomyopathy
Golbus, Jessica R.; Puckelwartz, Megan J.; Dellefave-Castillo, Lisa; ...
2014-09-01
Background—Cardiomyopathy is highly heritable but genetically diverse. At present, genetic testing for cardiomyopathy uses targeted sequencing to simultaneously assess the coding regions of more than 50 genes. New genes are routinely added to panels to improve the diagnostic yield. With the anticipated $1000 genome, it is expected that genetic testing will shift towards comprehensive genome sequencing accompanied by targeted gene analysis. Therefore, we assessed the reliability of whole genome sequencing and targeted analysis to identify cardiomyopathy variants in 11 subjects with cardiomyopathy. Methods and Results—Whole genome sequencing with an average of 37× coverage was combined with targeted analysis focused onmore » 204 genes linked to cardiomyopathy. Genetic variants were scored using multiple prediction algorithms combined with frequency data from public databases. This pipeline yielded 1-14 potentially pathogenic variants per individual. Variants were further analyzed using clinical criteria and/or segregation analysis. Three of three previously identified primary mutations were detected by this analysis. In six subjects for whom the primary mutation was previously unknown, we identified mutations that segregated with disease, had clinical correlates, and/or had additional pathological correlation to provide evidence for causality. For two subjects with previously known primary mutations, we identified additional variants that may act as modifiers of disease severity. In total, we identified the likely pathological mutation in 9 of 11 (82%) subjects. We conclude that these pilot data demonstrate that ~30-40× coverage whole genome sequencing combined with targeted analysis is feasible and sensitive to identify rare variants in cardiomyopathy-associated genes.« less
2013-01-01
Background Cotton, one of the world’s leading crops, is important to the world’s textile and energy industries, and is a model species for studies of plant polyploidization, cellulose biosynthesis and cell wall biogenesis. Here, we report the construction of a plant-transformation-competent binary bacterial artificial chromosome (BIBAC) library and comparative genome sequence analysis of polyploid Upland cotton (Gossypium hirsutum L.) with one of its diploid putative progenitor species, G. raimondii Ulbr. Results We constructed the cotton BIBAC library in a vector competent for high-molecular-weight DNA transformation in different plant species through either Agrobacterium or particle bombardment. The library contains 76,800 clones with an average insert size of 135 kb, providing an approximate 99% probability of obtaining at least one positive clone from the library using a single-copy probe. The quality and utility of the library were verified by identifying BIBACs containing genes important for fiber development, fiber cellulose biosynthesis, seed fatty acid metabolism, cotton-nematode interaction, and bacterial blight resistance. In order to gain an insight into the Upland cotton genome and its relationship with G. raimondii, we sequenced nearly 10,000 BIBAC ends (BESs) randomly selected from the library, generating approximately one BES for every 250 kb along the Upland cotton genome. The retroelement Gypsy/DIRS1 family predominates in the Upland cotton genome, accounting for over 77% of all transposable elements. From the BESs, we identified 1,269 simple sequence repeats (SSRs), of which 1,006 were new, thus providing additional markers for cotton genome research. Surprisingly, comparative sequence analysis showed that Upland cotton is much more diverged from G. raimondii at the genomic sequence level than expected. There seems to be no significant difference between the relationships of the Upland cotton D- and A-subgenomes with the G. raimondii genome, even though G. raimondii contains a D genome (D5). Conclusions The library represents the first BIBAC library in cotton and related species, thus providing tools useful for integrative physical mapping, large-scale genome sequencing and large-scale functional analysis of the Upland cotton genome. Comparative sequence analysis provides insights into the Upland cotton genome, and a possible mechanism underlying the divergence and evolution of polyploid Upland cotton from its diploid putative progenitor species, G. raimondii. PMID:23537070
GAP Final Technical Report 12-14-04
DOE Office of Scientific and Technical Information (OSTI.GOV)
Andrew J. Bordner, PhD, Senior Research Scientist
2004-12-14
The Genomics Annotation Platform (GAP) was designed to develop new tools for high throughput functional annotation and characterization of protein sequences and structures resulting from genomics and structural proteomics, benchmarking and application of those tools. Furthermore, this platform integrated the genomic scale sequence and structural analysis and prediction tools with the advanced structure prediction and bioinformatics environment of ICM. The development of GAP was primarily oriented towards the annotation of new biomolecular structures using both structural and sequence data. Even though the amount of protein X-ray crystal data is growing exponentially, the volume of sequence data is growing even moremore » rapidly. This trend was exploited by leveraging the wealth of sequence data to provide functional annotation for protein structures. The additional information provided by GAP is expected to assist the majority of the commercial users of ICM, who are involved in drug discovery, in identifying promising drug targets as well in devising strategies for the rational design of therapeutics directed at the protein of interest. The GAP also provided valuable tools for biochemistry education, and structural genomics centers. In addition, GAP incorporates many novel prediction and analysis methods not available in other molecular modeling packages. This development led to signing the first Molsoft agreement in the structural genomics annotation area with the University of oxford Structural Genomics Center. This commercial agreement validated the Molsoft efforts under the GAP project and provided the basis for further development of the large scale functional annotation platform.« less
Single cell genomics of uncultured marine alveolates shows paraphyly of basal dinoflagellates.
Strassert, Jürgen F H; Karnkowska, Anna; Hehenberger, Elisabeth; Del Campo, Javier; Kolisko, Martin; Okamoto, Noriko; Burki, Fabien; Janouškovec, Jan; Poirier, Camille; Leonard, Guy; Hallam, Steven J; Richards, Thomas A; Worden, Alexandra Z; Santoro, Alyson E; Keeling, Patrick J
2018-01-01
Marine alveolates (MALVs) are diverse and widespread early-branching dinoflagellates, but most knowledge of the group comes from a few cultured species that are generally not abundant in natural samples, or from diversity analyses of PCR-based environmental SSU rRNA gene sequences. To more broadly examine MALV genomes, we generated single cell genome sequences from seven individually isolated cells. Genes expected of heterotrophic eukaryotes were found, with interesting exceptions like presence of proteorhodopsin and vacuolar H + -pyrophosphatase. Phylogenetic analysis of concatenated SSU and LSU rRNA gene sequences provided strong support for the paraphyly of MALV lineages. Dinoflagellate viral nucleoproteins were found only in MALV groups that branched as sister to dinokaryotes. Our findings indicate that multiple independent origins of several characteristics early in dinoflagellate evolution, such as a parasitic life style, underlie the environmental diversity of MALVs, and suggest they have more varied trophic modes than previously thought.
Food Safety in the Age of Next Generation Sequencing, Bioinformatics, and Open Data Access.
Taboada, Eduardo N; Graham, Morag R; Carriço, João A; Van Domselaar, Gary
2017-01-01
Public health labs and food regulatory agencies globally are embracing whole genome sequencing (WGS) as a revolutionary new method that is positioned to replace numerous existing diagnostic and microbial typing technologies with a single new target: the microbial draft genome. The ability to cheaply generate large amounts of microbial genome sequence data, combined with emerging policies of food regulatory and public health institutions making their microbial sequences increasingly available and public, has served to open up the field to the general scientific community. This open data access policy shift has resulted in a proliferation of data being deposited into sequence repositories and of novel bioinformatics software designed to analyze these vast datasets. There also has been a more recent drive for improved data sharing to achieve more effective global surveillance, public health and food safety. Such developments have heightened the need for enhanced analytical systems in order to process and interpret this new type of data in a timely fashion. In this review we outline the emergence of genomics, bioinformatics and open data in the context of food safety. We also survey major efforts to translate genomics and bioinformatics technologies out of the research lab and into routine use in modern food safety labs. We conclude by discussing the challenges and opportunities that remain, including those expected to play a major role in the future of food safety science.
NASA Technical Reports Server (NTRS)
Everroad, R. Craig; Stuart, Rhona K.; Bebout, Brad M.; Detweiler, Angela M.; Lee, Jackson Zan; Woebken, Dagmar; Bebout, Leslie E.; Pett-Ridge, Jennifer
2016-01-01
The nonheterocystous filamentous cyanobacterium, strain ESFC-1, is a recently described member of the order Oscillatoriales within the Cyanobacteria. ESFC-1 has been shown to be a major diazotroph in the intertidal microbial mat system at Elkhorn Slough, CA, USA. Based on phylogenetic analyses of the 16S RNA gene, ESFC-1 appears to belong to a unique, genus-level divergence; the draft genome sequence of this strain has now been determined. Here we report features of this genome as they relate to the ecological functions and capabilities of strain ESFC-1. The 5,632,035 bp genome sequence encodes 4914 protein-coding genes and 92 RNA genes. One striking feature of this cyanobacterium is the apparent lack of either uptake or bi-directional hydrogenases typically expected within a diazotroph. Additionally, a large genomic island is found that contains numerous low GC-content genes and genes related to extracellular polysaccharide production and cell wall synthesis and maintenance.
Everroad, R. Craig; Stuart, Rhona K.; Bebout, Brad M.; ...
2016-08-24
The nonheterocystous filamentous cyanobacterium, strain ESFC-1, is a recently described member of the order Oscillatoriales within the Cyanobacteria. ESFC-1 has been shown to be a major diazotroph in the intertidal microbial mat system at Elkhorn Slough, CA, USA. Based on phylogenetic analyses of the 16S RNA gene, ESFC-1 appears to belong to a unique, genus-level divergence; the draft genome sequence of this strain has now been determined. Here we report features of this genome as they relate to the ecological functions and capabilities of strain ESFC-1. The 5,632,035 bp genome sequence encodes 4914 protein-coding genes and 92 RNA genes. Onemore » striking feature of this cyanobacterium is the apparent lack of either uptake or bi-directional hydrogenases typically expected within a diazotroph. In addition, a large genomic island is found that contains numerous low GC-content genes and genes related to extracellular polysaccharide production and cell wall synthesis and maintenance.« less
The COG database: an updated version includes eukaryotes
Tatusov, Roman L; Fedorova, Natalie D; Jackson, John D; Jacobs, Aviva R; Kiryutin, Boris; Koonin, Eugene V; Krylov, Dmitri M; Mazumder, Raja; Mekhedov, Sergei L; Nikolskaya, Anastasia N; Rao, B Sridhar; Smirnov, Sergei; Sverdlov, Alexander V; Vasudevan, Sona; Wolf, Yuri I; Yin, Jodie J; Natale, Darren A
2003-01-01
Background The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies. Results We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of ~20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (~1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes. Conclusion The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies. PMID:12969510
proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes.
Mende, Daniel R; Letunic, Ivica; Huerta-Cepas, Jaime; Li, Simone S; Forslund, Kristoffer; Sunagawa, Shinichi; Bork, Peer
2017-01-04
The availability of microbial genomes has opened many new avenues of research within microbiology. This has been driven primarily by comparative genomics approaches, which rely on accurate and consistent characterization of genomic sequences. It is nevertheless difficult to obtain consistent taxonomic and integrated functional annotations for defined prokaryotic clades. Thus, we developed proGenomes, a resource that provides user-friendly access to currently 25 038 high-quality genomes whose sequences and consistent annotations can be retrieved individually or by taxonomic clade. These genomes are assigned to 5306 consistent and accurate taxonomic species clusters based on previously established methodology. proGenomes also contains functional information for almost 80 million protein-coding genes, including a comprehensive set of general annotations and more focused annotations for carbohydrate-active enzymes and antibiotic resistance genes. Additionally, broad habitat information is provided for many genomes. All genomes and associated information can be downloaded by user-selected clade or multiple habitat-specific sets of representative genomes. We expect that the availability of high-quality genomes with comprehensive functional annotations will promote advances in clinical microbial genomics, functional evolution and other subfields of microbiology. proGenomes is available at http://progenomes.embl.de. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Low-pass sequencing for microbial comparative genomics
Goo, Young Ah; Roach, Jared; Glusman, Gustavo; Baliga, Nitin S; Deutsch, Kerry; Pan, Min; Kennedy, Sean; DasSarma, Shiladitya; Victor Ng, Wailap; Hood, Leroy
2004-01-01
Background We studied four extremely halophilic archaea by low-pass shotgun sequencing: (1) the metabolically versatile Haloarcula marismortui; (2) the non-pigmented Natrialba asiatica; (3) the psychrophile Halorubrum lacusprofundi and (4) the Dead Sea isolate Halobaculum gomorrense. Approximately one thousand single pass genomic sequences per genome were obtained. The data were analyzed by comparative genomic analyses using the completed Halobacterium sp. NRC-1 genome as a reference. Low-pass shotgun sequencing is a simple, inexpensive, and rapid approach that can readily be performed on any cultured microbe. Results As expected, the four archaeal halophiles analyzed exhibit both bacterial and eukaryotic characteristics as well as uniquely archaeal traits. All five halophiles exhibit greater than sixty percent GC content and low isoelectric points (pI) for their predicted proteins. Multiple insertion sequence (IS) elements, often involved in genome rearrangements, were identified in H. lacusprofundi and H. marismortui. The core biological functions that govern cellular and genetic mechanisms of H. sp. NRC-1 appear to be conserved in these four other halophiles. Multiple TATA box binding protein (TBP) and transcription factor IIB (TFB) homologs were identified from most of the four shotgunned halophiles. The reconstructed molecular tree of all five halophiles shows a large divergence between these species, but with the closest relationship being between H. sp. NRC-1 and H. lacusprofundi. Conclusion Despite the diverse habitats of these species, all five halophiles share (1) high GC content and (2) low protein isoelectric points, which are characteristics associated with environmental exposure to UV radiation and hypersalinity, respectively. Identification of multiple IS elements in the genome of H. lacusprofundi and H. marismortui suggest that genome structure and dynamic genome reorganization might be similar to that previously observed in the IS-element rich genome of H. sp. NRC-1. Identification of multiple TBP and TFB homologs in these four halophiles are consistent with the hypothesis that different types of complex transcriptional regulation may occur through multiple TBP-TFB combinations in response to rapidly changing environmental conditions. Low-pass shotgun sequence analyses of genomes permit extensive and diverse analyses, and should be generally useful for comparative microbial genomics. PMID:14718067
DOE Office of Scientific and Technical Information (OSTI.GOV)
Angelova, Angelina; Park, Sang-Hycuk; Kyndt, John
2013-09-01
With the increasing world demand for biofuel, a number of oleaginous algal species are being considered as renewable sources of oil. Chlorella protothecoides Krüger synthesizes triacylglycerols (TAGs) as storage compounds that can be converted into renewable fuel utilizing an anabolic pathway that is poorly understood. The paucity of algal chloroplast genome sequences has been an important constraint to chloroplast transformation and for studying gene expression in TAGs pathways. In this study, the intact chloroplasts were released from algal cells using sonication followed by sucrose gradient centrifugation, resulting in a 2.36-fold enrichment of chloroplasts from C. protothecoides, based on qPCR analysis.more » The C. protothecoides chloroplast genome (cpDNA) was determined using the Illumina HiSeq 2000 sequencing platform and found to be 84,576 Kb in size (8.57 Kb) in size, with a GC content of 30.8 %. This is the first report of an optimized protocol that uses a sonication step, followed by sucrose gradient centrifugation, to release and enrich intact chloroplasts from a microalga (C. prototheocoides) of sufficient quality to permit chloroplast genome sequencing with high coverage, while minimizing nuclear genome contamination. The approach is expected to guide chloroplast isolation from other oleaginous algal species for a variety of uses that benefit from enrichment of chloroplasts, ranging from biochemical analysis to genomics studies.« less
Re-sequencing and genetic variation identification of a rice line with ideal plant architecture.
Li, Shuangcheng; Xie, Kailong; Li, Wenbo; Zou, Ting; Ren, Yun; Wang, Shiquan; Deng, Qiming; Zheng, Aiping; Zhu, Jun; Liu, Huainian; Wang, Lingxia; Ai, Peng; Gao, Fengyan; Huang, Bin; Cao, Xuemei; Li, Ping
2012-12-01
The ideal plant architecture (IPA) includes several important characteristics such as low tiller numbers, few or no unproductive tillers, more grains per panicle, and thick and sturdy stems. We have developed an indica restorer line 7302R that displays the IPA phenotype in terms of tiller number, grain number, and stem strength. However, its mechanism had to be clarified. We performed re-sequencing and genome-wide variation analysis of 7302R using the Solexa sequencing technology. With the genomic sequence of the indica cultivar 9311 as reference, 307 627 SNPs, 57 372 InDels, and 3 096 SVs were identified in the 7302R genome. The 7302R-specific variations were investigated via the synteny analysis of all the SNPs of 7302R with those of the previous sequenced none-IPA-type lines IR24, MH63, and SH527. Moreover, we found 178 168 7302R-specific SNPs across the whole genome and 30 239 SNPs in the predicted mRNA regions, among which 8 517 were Non-syn CDS. In addition, 263 large-effect SNPs that were expected to affect the integrity of encoded proteins were identified from the 7302R-specific SNPs. SNPs of several important previously cloned rice genes were also identified by aligning the 7302R sequence with other sequence lines. Our results provided several candidates account for the IPA phenotype of 7302R. These results therefore lay the groundwork for long-term efforts to uncover important genes and alleles for rice plant architecture construction, also offer useful data resources for future genetic and genomic studies in rice.
Chloroplast DNA Structural Variation, Phylogeny, and Age of Divergence among Diploid Cotton Species.
Chen, Zhiwen; Feng, Kun; Grover, Corrinne E; Li, Pengbo; Liu, Fang; Wang, Yumei; Xu, Qin; Shang, Mingzhao; Zhou, Zhongli; Cai, Xiaoyan; Wang, Xingxing; Wendel, Jonathan F; Wang, Kunbo; Hua, Jinping
2016-01-01
The cotton genus (Gossypium spp.) contains 8 monophyletic diploid genome groups (A, B, C, D, E, F, G, K) and a single allotetraploid clade (AD). To gain insight into the phylogeny of Gossypium and molecular evolution of the chloroplast genome in this group, we performed a comparative analysis of 19 Gossypium chloroplast genomes, six reported here for the first time. Nucleotide distance in non-coding regions was about three times that of coding regions. As expected, distances were smaller within than among genome groups. Phylogenetic topologies based on nucleotide and indel data support for the resolution of the 8 genome groups into 6 clades. Phylogenetic analysis of indel distribution among the 19 genomes demonstrates contrasting evolutionary dynamics in different clades, with a parallel genome downsizing in two genome groups and a biased accumulation of insertions in the clade containing the cultivated cottons leading to large (for Gossypium) chloroplast genomes. Divergence time estimates derived from the cpDNA sequence suggest that the major diploid clades had diverged approximately 10 to 11 million years ago. The complete nucleotide sequences of 6 cpDNA genomes are provided, offering a resource for cytonuclear studies in Gossypium.
Chloroplast DNA Structural Variation, Phylogeny, and Age of Divergence among Diploid Cotton Species
Li, Pengbo; Liu, Fang; Wang, Yumei; Xu, Qin; Shang, Mingzhao; Zhou, Zhongli; Cai, Xiaoyan; Wang, Xingxing; Wendel, Jonathan F.; Wang, Kunbo
2016-01-01
The cotton genus (Gossypium spp.) contains 8 monophyletic diploid genome groups (A, B, C, D, E, F, G, K) and a single allotetraploid clade (AD). To gain insight into the phylogeny of Gossypium and molecular evolution of the chloroplast genome in this group, we performed a comparative analysis of 19 Gossypium chloroplast genomes, six reported here for the first time. Nucleotide distance in non-coding regions was about three times that of coding regions. As expected, distances were smaller within than among genome groups. Phylogenetic topologies based on nucleotide and indel data support for the resolution of the 8 genome groups into 6 clades. Phylogenetic analysis of indel distribution among the 19 genomes demonstrates contrasting evolutionary dynamics in different clades, with a parallel genome downsizing in two genome groups and a biased accumulation of insertions in the clade containing the cultivated cottons leading to large (for Gossypium) chloroplast genomes. Divergence time estimates derived from the cpDNA sequence suggest that the major diploid clades had diverged approximately 10 to 11 million years ago. The complete nucleotide sequences of 6 cpDNA genomes are provided, offering a resource for cytonuclear studies in Gossypium. PMID:27309527
Extensive Concerted Evolution of Rice Paralogs and the Road to Regaining Independence
Wang, Xiyin; Tang, Haibao; Bowers, John E.; Feltus, Frank A.; Paterson, Andrew H.
2007-01-01
Many genes duplicated by whole-genome duplications (WGDs) are more similar to one another than expected. We investigated whether concerted evolution through conversion and crossing over, well-known to affect tandem gene clusters, also affects dispersed paralogs. Genome sequences for two Oryza subspecies reveal appreciable gene conversion in the ∼0.4 MY since their divergence, with a gradual progression toward independent evolution of older paralogs. Since divergence from subspecies indica, ∼8% of japonica paralogs produced 5–7 MYA on chromosomes 11 and 12 have been affected by gene conversion and several reciprocal exchanges of chromosomal segments, while ∼70-MY-old “paleologs” resulting from a genome duplication (GD) show much less conversion. Sequence similarity analysis in proximal gene clusters also suggests more conversion between younger paralogs. About 8% of paleologs may have been converted since rice–sorghum divergence ∼41 MYA. Domain-encoding sequences are more frequently converted than nondomain sequences, suggesting a sort of circularity—that sequences conserved by selection may be further conserved by relatively frequent conversion. The higher level of concerted evolution in the 5–7 MY-old segmental duplication may reflect the behavior of many genomes within the first few million years after duplication or polyploidization. PMID:18039882
Multiple hybrid de novo genome assembly of finger millet, an orphan allotetraploid crop.
Hatakeyama, Masaomi; Aluri, Sirisha; Balachadran, Mathi Thumilan; Sivarajan, Sajeevan Radha; Patrignani, Andrea; Grüter, Simon; Poveda, Lucy; Shimizu-Inatsugi, Rie; Baeten, John; Francoijs, Kees-Jan; Nataraja, Karaba N; Reddy, Yellodu A Nanja; Phadnis, Shamprasad; Ravikumar, Ramapura L; Schlapbach, Ralph; Sreeman, Sheshshayee M; Shimizu, Kentaro K
2017-09-05
Finger millet (Eleusine coracana (L.) Gaertn) is an important crop for food security because of its tolerance to drought, which is expected to be exacerbated by global climate changes. Nevertheless, it is often classified as an orphan/underutilized crop because of the paucity of scientific attention. Among several small millets, finger millet is considered as an excellent source of essential nutrient elements, such as iron and zinc; hence, it has potential as an alternate coarse cereal. However, high-quality genome sequence data of finger millet are currently not available. One of the major problems encountered in the genome assembly of this species was its polyploidy, which hampers genome assembly compared with a diploid genome. To overcome this problem, we sequenced its genome using diverse technologies with sufficient coverage and assembled it via a novel multiple hybrid assembly workflow that combines next-generation with single-molecule sequencing, followed by whole-genome optical mapping using the Bionano Irys® system. The total number of scaffolds was 1,897 with an N50 length >2.6 Mb and detection of 96% of the universal single-copy orthologs. The majority of the homeologs were assembled separately. This indicates that the proposed workflow is applicable to the assembly of other allotetraploid genomes. © The Author 2017. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Analysis of MHC class I genes across horse MHC haplotypes
Tallmadge, Rebecca L.; Campbell, Julie A.; Miller, Donald C.; Antczak, Douglas F.
2010-01-01
The genomic sequences of 15 horse Major Histocompatibility Complex (MHC) class I genes and a collection of MHC class I homozygous horses of five different haplotypes were used to investigate the genomic structure and polymorphism of the equine MHC. A combination of conserved and locus-specific primers was used to amplify horse MHC class I genes with classical and non-classical characteristics. Multiple clones from each haplotype identified three to five classical sequences per homozygous animal, and two to three non-classical sequences. Phylogenetic analysis was applied to these sequences and groups were identified which appear to be allelic series, but some sequences were left ungrouped. Sequences determined from MHC class I heterozygous horses and previously described MHC class I sequences were then added, representing a total of ten horse MHC haplotypes. These results were consistent with those obtained from the MHC homozygous horses alone, and 30 classical sequences were assigned to four previously confirmed loci and three new provisional loci. The non-classical genes had few alleles and the classical genes had higher levels of allelic polymorphism. Alleles for two classical loci with the expected pattern of polymorphism were found in the majority of haplotypes tested, but alleles at two other commonly detected loci had more variation outside of the hypervariable region than within. Our data indicate that the equine Major Histocompatibility Complex is characterized by variation in the complement of class I genes expressed in different haplotypes in addition to the expected allelic polymorphism within loci. PMID:20099063
Keel, B N; Nonneman, D J; Rohrer, G A
2017-08-01
Genetic variants detected from sequence have been used to successfully identify causal variants and map complex traits in several organisms. High and moderate impact variants, those expected to alter or disrupt the protein coded by a gene and those that regulate protein production, likely have a more significant effect on phenotypic variation than do other types of genetic variants. Hence, a comprehensive list of these functional variants would be of considerable interest in swine genomic studies, particularly those targeting fertility and production traits. Whole-genome sequence was obtained from 72 of the founders of an intensely phenotyped experimental swine herd at the U.S. Meat Animal Research Center (USMARC). These animals included all 24 of the founding boars (12 Duroc and 12 Landrace) and 48 Yorkshire-Landrace composite sows. Sequence reads were mapped to the Sscrofa10.2 genome build, resulting in a mean of 6.1 fold (×) coverage per genome. A total of 22 342 915 high confidence SNPs were identified from the sequenced genomes. These included 21 million previously reported SNPs and 79% of the 62 163 SNPs on the PorcineSNP60 BeadChip assay. Variation was detected in the coding sequence or untranslated regions (UTRs) of 87.8% of the genes in the porcine genome: loss-of-function variants were predicted in 504 genes, 10 202 genes contained nonsynonymous variants, 10 773 had variation in UTRs and 13 010 genes contained synonymous variants. Approximately 139 000 SNPs were classified as loss-of-function, nonsynonymous or regulatory, which suggests that over 99% of the variation detected in our pigs could potentially be ignored, allowing us to focus on a much smaller number of functional SNPs during future analyses. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.
Shortt, Jonathan A; Card, Daren C; Schield, Drew R; Liu, Yang; Zhong, Bo; Castoe, Todd A; Carlton, Elizabeth J; Pollock, David D
2017-01-01
In areas where schistosomiasis control programs have been implemented, morbidity and prevalence have been greatly reduced. However, to sustain these reductions and move towards interruption of transmission, new tools for disease surveillance are needed. Genomic methods have the potential to help trace the sources of new infections, and allow us to monitor drug resistance. Large-scale genotyping efforts for schistosome species have been hindered by cost, limited numbers of established target loci, and the small amount of DNA obtained from miracidia, the life stage most readily acquired from humans. Here, we present a method using next generation sequencing to provide high-resolution genomic data from S. japonicum for population-based studies. We applied whole genome amplification followed by double digest restriction site associated DNA sequencing (ddRADseq) to individual S. japonicum miracidia preserved on Whatman FTA cards. We found that we could effectively and consistently survey hundreds of thousands of variants from 10,000 to 30,000 loci from archived miracidia as old as six years. An analysis of variation from eight miracidia obtained from three hosts in two villages in Sichuan showed clear population structuring by village and host even within this limited sample. This high-resolution sequencing approach yields three orders of magnitude more information than microsatellite genotyping methods that have been employed over the last decade, creating the potential to answer detailed questions about the sources of human infections and to monitor drug resistance. Costs per sample range from $50-$200, depending on the amount of sequence information desired, and we expect these costs can be reduced further given continued reductions in sequencing costs, improvement of protocols, and parallelization. This approach provides new promise for using modern genome-scale sampling to S. japonicum surveillance, and could be applied to other schistosome species and other parasitic helminthes.
Moon, Suyun; Lee, Hwa-Yong; Shim, Donghwan; Kim, Myungkil; Ka, Kang-Hyeon; Ryoo, Rhim; Ko, Han-Gyu; Koo, Chang-Duck; Chung, Jong-Wook; Ryu, Hojin
2017-06-01
Sixteen genomic DNA simple sequence repeat (SSR) markers of Lentinula edodes were developed from 205 SSR motifs present in 46.1-Mb long L. edodes genome sequences. The number of alleles ranged from 3-14 and the major allele frequency was distributed from 0.17-0.96. The values of observed and expected heterozygosity ranged from 0.00-0.76 and 0.07-0.90, respectively. The polymorphic information content value ranged from 0.07-0.89. A dendrogram, based on 16 SSR markers clustered by the paired hierarchical clustering' method, showed that 33 shiitake cultivars could be divided into three major groups and successfully identified. These SSR markers will contribute to the efficient breeding of this species by providing diversity in shiitake varieties. Furthermore, the genomic information covered by the markers can provide a valuable resource for genetic linkage map construction, molecular mapping, and marker-assisted selection in the shiitake mushroom.
Protocol matters: which methylome are you actually studying?
Robinson, Mark D; Statham, Aaron L; Speed, Terence P; Clark, Susan J
2011-01-01
The field of epigenetics is now capitalizing on the vast number of emerging technologies, largely based on second-generation sequencing, which interrogate DNA methylation status and histone modifications genome-wide. However, getting an exhaustive and unbiased view of a methylome at a reasonable cost is proving to be a significant challenge. In this article, we take a closer look at the impact of the DNA sequence and bias effects introduced to datasets by genome-wide DNA methylation technologies and where possible, explore the bioinformatics tools that deconvolve them. There remains much to be learned about the performance of genome-wide technologies, the data we mine from these assays and how it reflects the actual biology. While there are several methods to interrogate the DNA methylation status genome-wide, our opinion is that no single technique suitably covers the minimum criteria of high coverage and, high resolution at a reasonable cost. In fact, the fraction of the methylome that is studied currently depends entirely on the inherent biases of the protocol employed. There is promise for this to change, as the third generation of sequencing technologies is expected to again ‘revolutionize’ the way that we study genomes and epigenomes. PMID:21566704
Rapid Identification of Sequences for Orphan Enzymes to Power Accurate Protein Annotation
Ojha, Sunil; Watson, Douglas S.; Bomar, Martha G.; Galande, Amit K.; Shearer, Alexander G.
2013-01-01
The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the “back catalog” of enzymology – “orphan enzymes,” those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme “back catalog” is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology’s “back catalog” another powerful tool to drive accurate genome annotation. PMID:24386392
Rapid identification of sequences for orphan enzymes to power accurate protein annotation.
Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil; Watson, Douglas S; Bomar, Martha G; Galande, Amit K; Shearer, Alexander G
2013-01-01
The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.
Atlas2 Cloud: a framework for personal genome analysis in the cloud
2012-01-01
Background Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues. Results We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set. Conclusions We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms. PMID:23134663
Atlas2 Cloud: a framework for personal genome analysis in the cloud.
Evani, Uday S; Challis, Danny; Yu, Jin; Jackson, Andrew R; Paithankar, Sameer; Bainbridge, Matthew N; Jakkamsetti, Adinarayana; Pham, Peter; Coarfa, Cristian; Milosavljevic, Aleksandar; Yu, Fuli
2012-01-01
Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues. We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set. We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms.
iDoComp: a compression scheme for assembled genomes
Ochoa, Idoia; Hernaez, Mikel; Weissman, Tsachy
2015-01-01
Motivation: With the release of the latest next-generation sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approaching a milestone in the sequencing history, known as the $1000 genome era, where the sequencing of individuals is affordable, opening the doors to effective personalized medicine. Massive generation of genomic data, including assembled genomes, is expected in the following years. There is crucial need for compression of genomes guaranteed of performing well simultaneously on different species, from simple bacteria to humans, which will ease their transmission, dissemination and analysis. Further, most of the new genomes to be compressed will correspond to individuals of a species from which a reference already exists on the database. Thus, it is natural to propose compression schemes that assume and exploit the availability of such references. Results: We propose iDoComp, a compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression. In terms of compression efficiency, iDoComp outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time. For example, we observe compression gains of up to 60% in several cases, including H.sapiens data, when comparing with the best compression performance among the previously proposed algorithms. Availability: iDoComp is written in C and can be downloaded from: http://www.stanford.edu/~iochoa/iDoComp.html (We also provide a full explanation on how to run the program and an example with all the necessary files to run it.). Contact: iochoa@stanford.edu Supplementary information: Supplementary Data are available at Bioinformatics online. PMID:25344501
HLA Diversity in the 1000 Genomes Dataset
Gourraud, Pierre-Antoine; Khankhanian, Pouya; Cereb, Nezih; Yang, Soo Young; Feolo, Michael; Maiers, Martin; D. Rioux, John; Hauser, Stephen; Oksenberg, Jorge
2014-01-01
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation by sequencing at a level that should allow the genome-wide detection of most variants with frequencies as low as 1%. However, in the major histocompatibility complex (MHC), only the top 10 most frequent haplotypes are in the 1% frequency range whereas thousands of haplotypes are present at lower frequencies. Given the limitation of both the coverage and the read length of the sequences generated by the 1000 Genomes Project, the highly variable positions that define HLA alleles may be difficult to identify. We used classical Sanger sequencing techniques to type the HLA-A, HLA-B, HLA-C, HLA-DRB1 and HLA-DQB1 genes in the available 1000 Genomes samples and combined the results with the 103,310 variants in the MHC region genotyped by the 1000 Genomes Project. Using pairwise identity-by-descent distances between individuals and principal component analysis, we established the relationship between ancestry and genetic diversity in the MHC region. As expected, both the MHC variants and the HLA phenotype can identify the major ancestry lineage, informed mainly by the most frequent HLA haplotypes. To some extent, regions of the genome with similar genetic or similar recombination rate have similar properties. An MHC-centric analysis underlines departures between the ancestral background of the MHC and the genome-wide picture. Our analysis of linkage disequilibrium (LD) decay in these samples suggests that overestimation of pairwise LD occurs due to a limited sampling of the MHC diversity. This collection of HLA-specific MHC variants, available on the dbMHC portal, is a valuable resource for future analyses of the role of MHC in population and disease studies. PMID:24988075
HLA diversity in the 1000 genomes dataset.
Gourraud, Pierre-Antoine; Khankhanian, Pouya; Cereb, Nezih; Yang, Soo Young; Feolo, Michael; Maiers, Martin; Rioux, John D; Hauser, Stephen; Oksenberg, Jorge
2014-01-01
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation by sequencing at a level that should allow the genome-wide detection of most variants with frequencies as low as 1%. However, in the major histocompatibility complex (MHC), only the top 10 most frequent haplotypes are in the 1% frequency range whereas thousands of haplotypes are present at lower frequencies. Given the limitation of both the coverage and the read length of the sequences generated by the 1000 Genomes Project, the highly variable positions that define HLA alleles may be difficult to identify. We used classical Sanger sequencing techniques to type the HLA-A, HLA-B, HLA-C, HLA-DRB1 and HLA-DQB1 genes in the available 1000 Genomes samples and combined the results with the 103,310 variants in the MHC region genotyped by the 1000 Genomes Project. Using pairwise identity-by-descent distances between individuals and principal component analysis, we established the relationship between ancestry and genetic diversity in the MHC region. As expected, both the MHC variants and the HLA phenotype can identify the major ancestry lineage, informed mainly by the most frequent HLA haplotypes. To some extent, regions of the genome with similar genetic or similar recombination rate have similar properties. An MHC-centric analysis underlines departures between the ancestral background of the MHC and the genome-wide picture. Our analysis of linkage disequilibrium (LD) decay in these samples suggests that overestimation of pairwise LD occurs due to a limited sampling of the MHC diversity. This collection of HLA-specific MHC variants, available on the dbMHC portal, is a valuable resource for future analyses of the role of MHC in population and disease studies.
Phylogenomics databases for facilitating functional genomics in rice.
Jung, Ki-Hong; Cao, Peijian; Sharma, Rita; Jain, Rashmi; Ronald, Pamela C
2015-12-01
The completion of whole genome sequence of rice (Oryza sativa) has significantly accelerated functional genomics studies. Prior to the release of the sequence, only a few genes were assigned a function each year. Since sequencing was completed in 2005, the rate has exponentially increased. As of 2014, 1,021 genes have been described and added to the collection at The Overview of functionally characterized Genes in Rice online database (OGRO). Despite this progress, that number is still very low compared with the total number of genes estimated in the rice genome. One limitation to progress is the presence of functional redundancy among members of the same rice gene family, which covers 51.6 % of all non-transposable element-encoding genes. There remain a significant portion or rice genes that are not functionally redundant, as reflected in the recovery of loss-of-function mutants. To more accurately analyze functional redundancy in the rice genome, we have developed a phylogenomics databases for six large gene families in rice, including those for glycosyltransferases, glycoside hydrolases, kinases, transcription factors, transporters, and cytochrome P450 monooxygenases. In this review, we introduce key features and applications of these databases. We expect that they will serve as a very useful guide in the post-genomics era of research.
Genomic Tools in Groundnut Breeding Program: Status and Perspectives
Janila, P.; Variath, Murali T.; Pandey, Manish K.; Desmae, Haile; Motagi, Babu N.; Okori, Patrick; Manohar, Surendra S.; Rathnakumar, A. L.; Radhakrishnan, T.; Liao, Boshou; Varshney, Rajeev K.
2016-01-01
Groundnut, a nutrient-rich food legume, is cultivated world over. It is valued for its good quality cooking oil, energy and protein rich food, and nutrient-rich fodder. Globally, groundnut improvement programs have developed varieties to meet the preferences of farmers, traders, processors, and consumers. Enhanced yield, tolerance to biotic and abiotic stresses and quality parameters have been the target traits. Spurt in genetic information of groundnut was facilitated by development of molecular markers, genetic, and physical maps, generation of expressed sequence tags (EST), discovery of genes, and identification of quantitative trait loci (QTL) for some important biotic and abiotic stresses and quality traits. The first groundnut variety developed using marker assisted breeding (MAB) was registered in 2003. Since then, USA, China, Japan, and India have begun to use genomic tools in routine groundnut improvement programs. Introgression lines that combine foliar fungal disease resistance and early maturity were developed using MAB. Establishment of marker-trait associations (MTA) paved way to integrate genomic tools in groundnut breeding for accelerated genetic gain. Genomic Selection (GS) tools are employed to improve drought tolerance and pod yield, governed by several minor effect QTLs. Draft genome sequence and low cost genotyping tools such as genotyping by sequencing (GBS) are expected to accelerate use of genomic tools to enhance genetic gains for target traits in groundnut. PMID:27014312
Lucas Lledó, José Ignacio; Cáceres, Mario
2013-01-01
One of the most used techniques to study structural variation at a genome level is paired-end mapping (PEM). PEM has the advantage of being able to detect balanced events, such as inversions and translocations. However, inversions are still quite difficult to predict reliably, especially from high-throughput sequencing data. We simulated realistic PEM experiments with different combinations of read and library fragment lengths, including sequencing errors and meaningful base-qualities, to quantify and track down the origin of false positives and negatives along sequencing, mapping, and downstream analysis. We show that PEM is very appropriate to detect a wide range of inversions, even with low coverage data. However, % of inversions located between segmental duplications are expected to go undetected by the most common sequencing strategies. In general, longer DNA libraries improve the detectability of inversions far better than increments of the coverage depth or the read length. Finally, we review the performance of three algorithms to detect inversions —SVDetect, GRIAL, and VariationHunter—, identify common pitfalls, and reveal important differences in their breakpoint precisions. These results stress the importance of the sequencing strategy for the detection of structural variants, especially inversions, and offer guidelines for the design of future genome sequencing projects. PMID:23637806
The complete mitochondrial genome of the Aluterus monoceros.
Li, Wenshen; Zhang, Guoqing; Wen, Xin; Wang, Qian; Chen, Guohua
2016-07-01
The complete mitochondrial genome of Aluterus monoceros (A. monoceros) has been sequenced. The mitochondrial genome of A. monoceros is 16,429 bp in length, consisting of 22 tRNA genes, 2 rRNA genes, 13 protein-coding genes and a D-loop region (Gen Bank accession number KP637022). The base A + T of the mitochondrial genome is 63.25%, including 33.16% of A, 30.09% of T and 20.74% of C. Twelve protein-coding genes start with a standard ATG as the initiation codon, expect for the COXI, which begins with GTG. Some of the termination codons are incomplete T or TA, except for the ND1, COXI, ATP8, ND4L1, ND5 and ND6, which stop with TAA. Construction of phylogenetic trees based on the entire mitochondrial genome sequence of 14 Tetrodontiformes species constructed has suggested that A. monoceros has closer relationship with Acreichthys tomentosus and Monacanthus chinensis, and they constitute a sister group.
Pettengill, James B; Pightling, Arthur W; Baugher, Joseph D; Rand, Hugh; Strain, Errol
2016-01-01
The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). When analyzing empirical data (whole-genome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.
Satta, G; Lipman, M; Smith, G P; Arnold, C; Kon, O M; McHugh, T D
2018-06-01
Nearly two decades after completion of the genome sequence of Mycobacterium tuberculosis (MTB), and with the advent of next generation sequencing technologies (NGS), whole-genome sequencing (WGS) has been applied to a wide range of clinical scenarios. Starting in 2017, England is the first country in the world to pioneer its use on a national scale for the diagnosis of tuberculosis, detection of drug resistance, and typing of MTB. This narrative review critically analyses the current applications of WGS for MTB and explains how close we are to realizing its full potential as a diagnostic, epidemiologic, and research tool. We searched for reports (both original articles and reviews) published in English up to 31 May 2017, with combinations of the following keywords: whole-genome sequencing, Mycobacterium, and tuberculosis. MEDLINE, Embase, and Scopus were used as search engines. We included articles that covered different aspects of whole-genome sequencing in relation to MTB. This review focuses on three main themes: the role of WGS for the prediction of drug susceptibility, MTB outbreak investigation and genetic diversity, and research applications of NGS. Many of the original expectations have been accomplished, and we believe that with its unprecedented sensitivity and power, WGS has the potential to address many unanswered questions in the near future. However, caution is still needed when interpreting WGS data as there are some important limitations to be aware of, from correct interpretation of drug susceptibilities to the bioinformatic support needed. Copyright © 2017 European Society of Clinical Microbiology and Infectious Diseases. Published by Elsevier Ltd. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Golbus, Jessica R.; Puckelwartz, Megan J.; Dellefave-Castillo, Lisa
Background—Cardiomyopathy is highly heritable but genetically diverse. At present, genetic testing for cardiomyopathy uses targeted sequencing to simultaneously assess the coding regions of more than 50 genes. New genes are routinely added to panels to improve the diagnostic yield. With the anticipated $1000 genome, it is expected that genetic testing will shift towards comprehensive genome sequencing accompanied by targeted gene analysis. Therefore, we assessed the reliability of whole genome sequencing and targeted analysis to identify cardiomyopathy variants in 11 subjects with cardiomyopathy. Methods and Results—Whole genome sequencing with an average of 37× coverage was combined with targeted analysis focused onmore » 204 genes linked to cardiomyopathy. Genetic variants were scored using multiple prediction algorithms combined with frequency data from public databases. This pipeline yielded 1-14 potentially pathogenic variants per individual. Variants were further analyzed using clinical criteria and/or segregation analysis. Three of three previously identified primary mutations were detected by this analysis. In six subjects for whom the primary mutation was previously unknown, we identified mutations that segregated with disease, had clinical correlates, and/or had additional pathological correlation to provide evidence for causality. For two subjects with previously known primary mutations, we identified additional variants that may act as modifiers of disease severity. In total, we identified the likely pathological mutation in 9 of 11 (82%) subjects. We conclude that these pilot data demonstrate that ~30-40× coverage whole genome sequencing combined with targeted analysis is feasible and sensitive to identify rare variants in cardiomyopathy-associated genes.« less
Identification of copy number variants in whole-genome data using Reference Coverage Profiles
Glusman, Gustavo; Severson, Alissa; Dhankani, Varsha; Robinson, Max; Farrah, Terry; Mauldin, Denise E.; Stittrich, Anna B.; Ament, Seth A.; Roach, Jared C.; Brunkow, Mary E.; Bodian, Dale L.; Vockley, Joseph G.; Shmulevich, Ilya; Niederhuber, John E.; Hood, Leroy
2015-01-01
The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. The raw data for these analyses are measured in tens to hundreds of gigabytes per genome; transmitting, storing, and analyzing such large files is cumbersome, particularly for methods that analyze several samples simultaneously. We developed a very efficient representation of depth of coverage (150–1000× compression) that enables such analyses. Current methods for analyzing variants in whole-genome sequencing (WGS) data frequently miss copy number variants (CNVs), particularly hemizygous deletions in the 1–100 kb range. To fill this gap, we developed a method to identify CNVs in individual genomes, based on comparison to joint profiles pre-computed from a large set of genomes. We analyzed depth of coverage in over 6000 high quality (>40×) genomes. The depth of coverage has strong sequence-specific fluctuations only partially explained by global parameters like %GC. To account for these fluctuations, we constructed multi-genome profiles representing the observed or inferred diploid depth of coverage at each position along the genome. These Reference Coverage Profiles (RCPs) take into account the diverse technologies and pipeline versions used. Normalization of the scaled coverage to the RCP followed by hidden Markov model (HMM) segmentation enables efficient detection of CNVs and large deletions in individual genomes. Use of pre-computed multi-genome coverage profiles improves our ability to analyze each individual genome. We make available RCPs and tools for performing these analyses on personal genomes. We expect the increased sensitivity and specificity for individual genome analysis to be critical for achieving clinical-grade genome interpretation. PMID:25741365
Noorani, Ayesha; Lynch, Andy G.; Achilleos, Achilleas; Eldridge, Matthew; Bower, Lawrence; Weaver, Jamie M.J.; Crawte, Jason; Ong, Chin-Ann; Shannon, Nicholas; MacRae, Shona; Grehan, Nicola; Nutzinger, Barbara; O'Donovan, Maria; Hardwick, Richard; Tavaré, Simon; Fitzgerald, Rebecca C.
2017-01-01
The scientific community has avoided using tissue samples from patients that have been exposed to systemic chemotherapy to infer the genomic landscape of a given cancer. Esophageal adenocarcinoma is a heterogeneous, chemoresistant tumor for which the availability and size of pretreatment endoscopic samples are limiting. This study compares whole-genome sequencing data obtained from chemo-naive and chemo-treated samples. The quality of whole-genomic sequencing data is comparable across all samples regardless of chemotherapy status. Inclusion of samples collected post-chemotherapy increased the proportion of late-stage tumors. When comparing matched pre- and post-chemotherapy samples from 10 cases, the mutational signatures, copy number, and SNV mutational profiles reflect the expected heterogeneity in this disease. Analysis of SNVs in relation to allele-specific copy-number changes pinpoints the common ancestor to a point prior to chemotherapy. For cases in which pre- and post-chemotherapy samples do show substantial differences, the timing of the divergence is near-synchronous with endoreduplication. Comparison across a large prospective cohort (62 treatment-naive, 58 chemotherapy-treated samples) reveals no significant differences in the overall mutation rate, mutation signatures, specific recurrent point mutations, or copy-number events in respect to chemotherapy status. In conclusion, whole-genome sequencing of samples obtained following neoadjuvant chemotherapy is representative of the genomic landscape of esophageal adenocarcinoma. Excluding these samples reduces the material available for cataloging and introduces a bias toward the earlier stages of cancer. PMID:28465312
Gusev, A.; Shah, M. J.; Kenny, E. E.; Ramachandran, A.; Lowe, J. K.; Salit, J.; Lee, C. C.; Levandowsky, E. C.; Weaver, T. N.; Doan, Q. C.; Peckham, H. E.; McLaughlin, S. F.; Lyons, M. R.; Sheth, V. N.; Stoffel, M.; De La Vega, F. M.; Friedman, J. M.; Breslow, J. L.
2012-01-01
Whole-genome sequencing in an isolated population with few founders directly ascertains variants from the population bottleneck that may be rare elsewhere. In such populations, shared haplotypes allow imputation of variants in unsequenced samples without resorting to complex statistical methods as in studies of outbred cohorts. We focus on an isolated population cohort from the Pacific Island of Kosrae, Micronesia, where we previously collected SNP array and rich phenotype data for the majority of the population. We report identification of long regions with haplotypes co-inherited between pairs of individuals and methodology to leverage such shared genetic content for imputation. Our estimates show that sequencing as few as 40 personal genomes allows for inference in up to 60% of the 3000-person cohort at the average locus. We ascertained a pilot data set of whole-genome sequences from seven Kosraean individuals, with average 5× coverage. This assay identified 5,735,306 unique sites of which 1,212,831 were previously unknown. Additionally, these variants are unusually enriched for alleles that are rare in other populations when compared to geographic neighbors (published Korean genome SJK). We used the presence of shared haplotypes between the seven Kosraen individuals to estimate expected imputation accuracy of known and novel homozygous variants at 99.6% and 97.3%, respectively. This study presents whole-genome analysis of a homogenous isolate population with emphasis on optimal rare variant inference. PMID:22135348
Phylogenomic analyses data of the avian phylogenomics project.
Jarvis, Erich D; Mirarab, Siavash; Aberer, Andre J; Li, Bo; Houde, Peter; Li, Cai; Ho, Simon Y W; Faircloth, Brant C; Nabholz, Benoit; Howard, Jason T; Suh, Alexander; Weber, Claudia C; da Fonseca, Rute R; Alfaro-Núñez, Alonzo; Narula, Nitish; Liu, Liang; Burt, Dave; Ellegren, Hans; Edwards, Scott V; Stamatakis, Alexandros; Mindell, David P; Cracraft, Joel; Braun, Edward L; Warnow, Tandy; Jun, Wang; Gilbert, M Thomas Pius; Zhang, Guojie
2015-01-01
Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses. Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence. The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.
Li, Teng; Yang, Jie; Li, Yinwan; Cui, Ying; Xie, Qiang; Bu, Wenjun; Hillis, David M
2016-10-19
The Rhyparochromidae, the largest family of Lygaeoidea, encompasses more than 1,850 described species, but no mitochondrial genome has been sequenced to date. Here we describe the first mitochondrial genome for Rhyparochromidae: a complete mitochondrial genome of Panaorus albomaculatus (Scott, 1874). This mitochondrial genome is comprised of 16,345 bp, and contains the expected 37 genes and control region. The majority of the control region is made up of a large tandem-repeat region, which has a novel pattern not previously observed in other insects. The tandem-repeats region of P. albomaculatus consists of 53 tandem duplications (including one partial repeat), which is the largest number of tandem repeats among all the known insect mitochondrial genomes. Slipped-strand mispairing during replication is likely to have generated this novel pattern of tandem repeats. Comparative analysis of tRNA gene families in sequenced Pentatomomorpha and Lygaeoidea species shows that the pattern of nucleotide conservation is markedly higher on the J-strand. Phylogenetic reconstruction based on mitochondrial genomes suggests that Rhyparochromidae is not the sister group to all the remaining Lygaeoidea, and supports the monophyly of Lygaeoidea.
Reference-free comparative genomics of 174 chloroplasts.
Kua, Chai-Shian; Ruan, Jue; Harting, John; Ye, Cheng-Xi; Helmus, Matthew R; Yu, Jun; Cannon, Charles H
2012-01-01
Direct analysis of unassembled genomic data could greatly increase the power of short read DNA sequencing technologies and allow comparative genomics of organisms without a completed reference available. Here, we compare 174 chloroplasts by analyzing the taxanomic distribution of short kmers across genomes [1]. We then assemble de novo contigs centered on informative variation. The localized de novo contigs can be separated into two major classes: tip = unique to a single genome and group = shared by a subset of genomes. Prior to assembly, we found that ~18% of the chloroplast was duplicated in the inverted repeat (IR) region across a four-fold difference in genome sizes, from a highly reduced parasitic orchid [2] to a massive algal chloroplast [3], including gnetophytes [4] and cycads [5]. The conservation of this ratio between single copy and duplicated sequence was basal among green plants, independent of photosynthesis and mechanism of genome size change, and different in gymnosperms and lower plants. Major lineages in the angiosperm clade differed in the pattern of shared kmers and de novo contigs. For example, parasitic plants demonstrated an expected accelerated overall rate of evolution, while the hemi-parasitic genomes contained a great deal more novel sequence than holo-parasitic plants, suggesting different mechanisms at different stages of genomic contraction. Additionally, the legumes are diverging more quickly and in different ways than other major families. Small duplicated fragments of the rrn23 genes were deeply conserved among seed plants, including among several species without the IR regions, indicating a crucial functional role of this duplication. Localized de novo assembly of informative kmers greatly reduces the complexity of large comparative analyses by confining the analysis to a small partition of data and genomes relevant to the specific question, allowing direct analysis of next-gen sequence data from previously unstudied genomes and rapid discovery of informative candidate regions.
Design and characterization of a 52K SNP chip for goats.
Tosser-Klopp, Gwenola; Bardou, Philippe; Bouchez, Olivier; Cabau, Cédric; Crooijmans, Richard; Dong, Yang; Donnadieu-Tonon, Cécile; Eggen, André; Heuven, Henri C M; Jamli, Saadiah; Jiken, Abdullah Johari; Klopp, Christophe; Lawley, Cynthia T; McEwan, John; Martin, Patrice; Moreno, Carole R; Mulsant, Philippe; Nabihoudine, Ibouniyamine; Pailhoux, Eric; Palhière, Isabelle; Rupp, Rachel; Sarry, Julien; Sayre, Brian L; Tircazes, Aurélie; Jun Wang; Wang, Wen; Zhang, Wenguang
2014-01-01
The success of Genome Wide Association Studies in the discovery of sequence variation linked to complex traits in humans has increased interest in high throughput SNP genotyping assays in livestock species. Primary goals are QTL detection and genomic selection. The purpose here was design of a 50-60,000 SNP chip for goats. The success of a moderate density SNP assay depends on reliable bioinformatic SNP detection procedures, the technological success rate of the SNP design, even spacing of SNPs on the genome and selection of Minor Allele Frequencies (MAF) suitable to use in diverse breeds. Through the federation of three SNP discovery projects consolidated as the International Goat Genome Consortium, we have identified approximately twelve million high quality SNP variants in the goat genome stored in a database together with their biological and technical characteristics. These SNPs were identified within and between six breeds (meat, milk and mixed): Alpine, Boer, Creole, Katjang, Saanen and Savanna, comprising a total of 97 animals. Whole genome and Reduced Representation Library sequences were aligned on >10 kb scaffolds of the de novo goat genome assembly. The 60,000 selected SNPs, evenly spaced on the goat genome, were submitted for oligo manufacturing (Illumina, Inc) and published in dbSNP along with flanking sequences and map position on goat assemblies (i.e. scaffolds and pseudo-chromosomes), sheep genome V2 and cattle UMD3.1 assembly. Ten breeds were then used to validate the SNP content and 52,295 loci could be successfully genotyped and used to generate a final cluster file. The combined strategy of using mainly whole genome Next Generation Sequencing and mapping on a contig genome assembly, complemented with Illumina design tools proved to be efficient in producing this GoatSNP50 chip. Advances in use of molecular markers are expected to accelerate goat genomic studies in coming years.
Design and Characterization of a 52K SNP Chip for Goats
Tosser-Klopp, Gwenola; Bardou, Philippe; Bouchez, Olivier; Cabau, Cédric; Crooijmans, Richard; Dong, Yang; Donnadieu-Tonon, Cécile; Eggen, André; Heuven, Henri C. M.; Jamli, Saadiah; Jiken, Abdullah Johari; Klopp, Christophe; Lawley, Cynthia T.; McEwan, John; Martin, Patrice; Moreno, Carole R.; Mulsant, Philippe; Nabihoudine, Ibouniyamine; Pailhoux, Eric; Palhière, Isabelle; Rupp, Rachel; Sarry, Julien; Sayre, Brian L.; Tircazes, Aurélie; Jun Wang; Wang, Wen; Zhang, Wenguang
2014-01-01
The success of Genome Wide Association Studies in the discovery of sequence variation linked to complex traits in humans has increased interest in high throughput SNP genotyping assays in livestock species. Primary goals are QTL detection and genomic selection. The purpose here was design of a 50–60,000 SNP chip for goats. The success of a moderate density SNP assay depends on reliable bioinformatic SNP detection procedures, the technological success rate of the SNP design, even spacing of SNPs on the genome and selection of Minor Allele Frequencies (MAF) suitable to use in diverse breeds. Through the federation of three SNP discovery projects consolidated as the International Goat Genome Consortium, we have identified approximately twelve million high quality SNP variants in the goat genome stored in a database together with their biological and technical characteristics. These SNPs were identified within and between six breeds (meat, milk and mixed): Alpine, Boer, Creole, Katjang, Saanen and Savanna, comprising a total of 97 animals. Whole genome and Reduced Representation Library sequences were aligned on >10 kb scaffolds of the de novo goat genome assembly. The 60,000 selected SNPs, evenly spaced on the goat genome, were submitted for oligo manufacturing (Illumina, Inc) and published in dbSNP along with flanking sequences and map position on goat assemblies (i.e. scaffolds and pseudo-chromosomes), sheep genome V2 and cattle UMD3.1 assembly. Ten breeds were then used to validate the SNP content and 52,295 loci could be successfully genotyped and used to generate a final cluster file. The combined strategy of using mainly whole genome Next Generation Sequencing and mapping on a contig genome assembly, complemented with Illumina design tools proved to be efficient in producing this GoatSNP50 chip. Advances in use of molecular markers are expected to accelerate goat genomic studies in coming years. PMID:24465974
Reference-Free Comparative Genomics of 174 Chloroplasts
Kua, Chai-Shian; Ruan, Jue; Harting, John; Ye, Cheng-Xi; Helmus, Matthew R.; Yu, Jun; Cannon, Charles H.
2012-01-01
Direct analysis of unassembled genomic data could greatly increase the power of short read DNA sequencing technologies and allow comparative genomics of organisms without a completed reference available. Here, we compare 174 chloroplasts by analyzing the taxanomic distribution of short kmers across genomes [1]. We then assemble de novo contigs centered on informative variation. The localized de novo contigs can be separated into two major classes: tip = unique to a single genome and group = shared by a subset of genomes. Prior to assembly, we found that ∼18% of the chloroplast was duplicated in the inverted repeat (IR) region across a four-fold difference in genome sizes, from a highly reduced parasitic orchid [2] to a massive algal chloroplast [3], including gnetophytes [4] and cycads [5]. The conservation of this ratio between single copy and duplicated sequence was basal among green plants, independent of photosynthesis and mechanism of genome size change, and different in gymnosperms and lower plants. Major lineages in the angiosperm clade differed in the pattern of shared kmers and de novo contigs. For example, parasitic plants demonstrated an expected accelerated overall rate of evolution, while the hemi-parasitic genomes contained a great deal more novel sequence than holo-parasitic plants, suggesting different mechanisms at different stages of genomic contraction. Additionally, the legumes are diverging more quickly and in different ways than other major families. Small duplicated fragments of the rrn23 genes were deeply conserved among seed plants, including among several species without the IR regions, indicating a crucial functional role of this duplication. Localized de novo assembly of informative kmers greatly reduces the complexity of large comparative analyses by confining the analysis to a small partition of data and genomes relevant to the specific question, allowing direct analysis of next-gen sequence data from previously unstudied genomes and rapid discovery of informative candidate regions. PMID:23185288
2009-01-01
Background With the publication of the draft chicken genome and the recent production of several BAC clone libraries from non-avian reptiles and birds, it is now possible to undertake more detailed comparative genomic studies in Reptilia. Of interest in particular are the genomic events that transformed the large, repeat-rich genomes of mammals and non-avian reptiles into the minimalist chicken genome. We have used paired BAC end sequences (BESs) from the American alligator (Alligator mississippiensis), painted turtle (Chrysemys picta) and emu (Dromaius novaehollandiae) to investigate patterns of sequence divergence, gene and retroelement content, and microsynteny between these species and chicken. Results From a total of 11,967 curated BESs, we successfully mapped 725, 773 and 2597 sequences in alligator, turtle, and emu, respectively, to sites in the draft chicken genome using a stringent BLAST protocol. Most commonly, sequences mapped to a single site in the chicken genome. Of 1675, 1828 and 2936 paired BESs obtained for alligator, turtle, and emu, respectively, a total of 34 (alligator, 2%), 24 (turtle, 1.3%) and 479 (emu, 16.3%) pairs were found to map with high confidence and in the correct orientation and with BAC-sized intermarker distances to single chicken chromosomes, including 25 such paired hits in emu mapping to the chicken Z chromosome. By determining the insert sizes of a subset of BAC clones from these three species, we also found a significant correlation between the intermarker distance in alligator and turtle and in chicken, with slopes as expected on the basis of the ratio of the genome sizes. Conclusion Our results suggest that a large number of small-scale chromosomal rearrangements and deletions in the lineage leading to chicken have drastically reduced the number of detected syntenies observed between the chicken and alligator, turtle, and emu genomes and imply that small deletions occurring widely throughout the genomes of reptilian and avian ancestors led to the ~50% reduction in genome size observed in birds compared to reptiles. We have also mapped and identified likely gene regions in hundreds of new BAC clones from these species. PMID:19607659
Chapus, Charles; Edwards, Scott V
2009-07-14
With the publication of the draft chicken genome and the recent production of several BAC clone libraries from non-avian reptiles and birds, it is now possible to undertake more detailed comparative genomic studies in Reptilia. Of interest in particular are the genomic events that transformed the large, repeat-rich genomes of mammals and non-avian reptiles into the minimalist chicken genome. We have used paired BAC end sequences (BESs) from the American alligator (Alligator mississippiensis), painted turtle (Chrysemys picta) and emu (Dromaius novaehollandiae) to investigate patterns of sequence divergence, gene and retroelement content, and microsynteny between these species and chicken. From a total of 11,967 curated BESs, we successfully mapped 725, 773 and 2597 sequences in alligator, turtle, and emu, respectively, to sites in the draft chicken genome using a stringent BLAST protocol. Most commonly, sequences mapped to a single site in the chicken genome. Of 1675, 1828 and 2936 paired BESs obtained for alligator, turtle, and emu, respectively, a total of 34 (alligator, 2%), 24 (turtle, 1.3%) and 479 (emu, 16.3%) pairs were found to map with high confidence and in the correct orientation and with BAC-sized intermarker distances to single chicken chromosomes, including 25 such paired hits in emu mapping to the chicken Z chromosome. By determining the insert sizes of a subset of BAC clones from these three species, we also found a significant correlation between the intermarker distance in alligator and turtle and in chicken, with slopes as expected on the basis of the ratio of the genome sizes. Our results suggest that a large number of small-scale chromosomal rearrangements and deletions in the lineage leading to chicken have drastically reduced the number of detected syntenies observed between the chicken and alligator, turtle, and emu genomes and imply that small deletions occurring widely throughout the genomes of reptilian and avian ancestors led to the ~50% reduction in genome size observed in birds compared to reptiles. We have also mapped and identified likely gene regions in hundreds of new BAC clones from these species.
Sim3C: simulation of Hi-C and Meta3C proximity ligation sequencing technologies.
DeMaere, Matthew Z; Darling, Aaron E
2018-02-01
Chromosome conformation capture (3C) and Hi-C DNA sequencing methods have rapidly advanced our understanding of the spatial organization of genomes and metagenomes. Many variants of these protocols have been developed, each with their own strengths. Currently there is no systematic means for simulating sequence data from this family of sequencing protocols, potentially hindering the advancement of algorithms to exploit this new datatype. We describe a computational simulator that, given simple parameters and reference genome sequences, will simulate Hi-C sequencing on those sequences. The simulator models the basic spatial structure in genomes that is commonly observed in Hi-C and 3C datasets, including the distance-decay relationship in proximity ligation, differences in the frequency of interaction within and across chromosomes, and the structure imposed by cells. A means to model the 3D structure of randomly generated topologically associating domains is provided. The simulator considers several sources of error common to 3C and Hi-C library preparation and sequencing methods, including spurious proximity ligation events and sequencing error. We have introduced the first comprehensive simulator for 3C and Hi-C sequencing protocols. We expect the simulator to have use in testing of Hi-C data analysis algorithms, as well as more general value for experimental design, where questions such as the required depth of sequencing, enzyme choice, and other decisions can be made in advance in order to ensure adequate statistical power with respect to experimental hypothesis testing.
NASA Astrophysics Data System (ADS)
Dick, G. J.; Andersson, A.; Banfield, J. F.
2007-12-01
Our understanding of environmental microbiology has been greatly enhanced by community genome sequencing of DNA recovered directly the environment. Community genomics provides insights into the diversity, community structure, metabolic function, and evolution of natural populations of uncultivated microbes, thereby revealing dynamics of how microorganisms interact with each other and their environment. Recent studies have demonstrated the potential for reconstructing near-complete genomes from natural environments while highlighting the challenges of analyzing community genomic sequence, especially from diverse environments. A major challenge of shotgun community genome sequencing is identification of DNA fragments from minor community members for which only low coverage of genomic sequence is present. We analyzed community genome sequence retrieved from biofilms in an acid mine drainage (AMD) system in the Richmond Mine at Iron Mountain, CA, with an emphasis on identification and assembly of DNA fragments from low-abundance community members. The Richmond mine hosts an extensive, relatively low diversity subterranean chemolithoautotrophic community that is sustained entirely by oxidative dissolution of pyrite. The activity of these microorganisms greatly accelerates the generation of AMD. Previous and ongoing work in our laboratory has focused on reconstrucing genomes of dominant community members, including several bacteria and archaea. We binned contigs from several samples (including one new sample and two that had been previously analyzed) by tetranucleotide frequency with clustering by Self-Organizing Maps (SOM). The binning, evaluated by comparison with information from the manually curated assembly of the dominant organisms, was found to be very effective: fragments were correctly assigned with 95% accuracy. Improperly assigned fragments often contained sequences that are either evolutionarily constrained (e.g. 16S rRNA genes) or mobile elements that are not expected to reflect the tetranucleotide frequency signature of the host genome. Four unknown tetranucleotide frequency clusters with significant sequence (6 Mb total) were noted and analyzed further. Based on phylogenetic markers and BLAST results, these clusters represent low abundance bacteria including Acintobacteria, Firmicutes, and Proteobacteria. Functional analysis of these clusters revealved that the low- abundance bacteria harbor genes that could potentially encode important ecosystem functions such as sulfur utilization (e.g. polysulfide reductase) and polymer degradation (e.g. chitinase and glycoside hydrolase). We conclude that ESOM clustering of tetranucleotide frequency patterns is an effective method for rapidly binning shotgun community genomic sequences and a valuable tool for analyzing minor community members, which despite their low abundance may play crucial ecological roles.
ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.
Coombe, Lauren; Zhang, Jessica; Vandervalk, Benjamin P; Chu, Justin; Jackman, Shaun D; Birol, Inanc; Warren, René L
2018-06-20
The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13). ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.
Pilkington, Sarah M; Crowhurst, Ross; Hilario, Elena; Nardozza, Simona; Fraser, Lena; Peng, Yongyan; Gunaseelan, Kularajathevan; Simpson, Robert; Tahir, Jibran; Deroles, Simon C; Templeton, Kerry; Luo, Zhiwei; Davy, Marcus; Cheng, Canhong; McNeilage, Mark; Scaglione, Davide; Liu, Yifei; Zhang, Qiong; Datson, Paul; De Silva, Nihal; Gardiner, Susan E; Bassett, Heather; Chagné, David; McCallum, John; Dzierzon, Helge; Deng, Cecilia; Wang, Yen-Yi; Barron, Lorna; Manako, Kelvina; Bowen, Judith; Foster, Toshi M; Erridge, Zoe A; Tiffin, Heather; Waite, Chethi N; Davies, Kevin M; Grierson, Ella P; Laing, William A; Kirk, Rebecca; Chen, Xiuyin; Wood, Marion; Montefiori, Mirco; Brummell, David A; Schwinn, Kathy E; Catanach, Andrew; Fullerton, Christina; Li, Dawei; Meiyalaghan, Sathiyamoorthy; Nieuwenhuizen, Niels; Read, Nicola; Prakash, Roneel; Hunter, Don; Zhang, Huaibi; McKenzie, Marian; Knäbel, Mareike; Harris, Alastair; Allan, Andrew C; Gleave, Andrew; Chen, Angela; Janssen, Bart J; Plunkett, Blue; Ampomah-Dwamena, Charles; Voogd, Charlotte; Leif, Davin; Lafferty, Declan; Souleyre, Edwige J F; Varkonyi-Gasic, Erika; Gambi, Francesco; Hanley, Jenny; Yao, Jia-Long; Cheung, Joey; David, Karine M; Warren, Ben; Marsh, Ken; Snowden, Kimberley C; Lin-Wang, Kui; Brian, Lara; Martinez-Sanchez, Marcela; Wang, Mindy; Ileperuma, Nadeesha; Macnee, Nikolai; Campin, Robert; McAtee, Peter; Drummond, Revel S M; Espley, Richard V; Ireland, Hilary S; Wu, Rongmei; Atkinson, Ross G; Karunairetnam, Sakuntala; Bulley, Sean; Chunkath, Shayhan; Hanley, Zac; Storey, Roy; Thrimawithana, Amali H; Thomson, Susan; David, Charles; Testolin, Raffaele; Huang, Hongwen; Hellens, Roger P; Schaffer, Robert J
2018-04-16
Most published genome sequences are drafts, and most are dominated by computational gene prediction. Draft genomes typically incorporate considerable sequence data that are not assigned to chromosomes, and predicted genes without quality confidence measures. The current Actinidia chinensis (kiwifruit) 'Hongyang' draft genome has 164 Mb of sequences unassigned to pseudo-chromosomes, and omissions have been identified in the gene models. A second genome of an A. chinensis (genotype Red5) was fully sequenced. This new sequence resulted in a 554.0 Mb assembly with all but 6 Mb assigned to pseudo-chromosomes. Pseudo-chromosomal comparisons showed a considerable number of translocation events have occurred following a whole genome duplication (WGD) event some consistent with centromeric Robertsonian-like translocations. RNA sequencing data from 12 tissues and ab initio analysis informed a genome-wide manual annotation, using the WebApollo tool. In total, 33,044 gene loci represented by 33,123 isoforms were identified, named and tagged for quality of evidential support. Of these 3114 (9.4%) were identical to a protein within 'Hongyang' The Kiwifruit Information Resource (KIR v2). Some proportion of the differences will be varietal polymorphisms. However, as most computationally predicted Red5 models required manual re-annotation this proportion is expected to be small. The quality of the new gene models was tested by fully sequencing 550 cloned 'Hort16A' cDNAs and comparing with the predicted protein models for Red5 and both the original 'Hongyang' assembly and the revised annotation from KIR v2. Only 48.9% and 63.5% of the cDNAs had a match with 90% identity or better to the original and revised 'Hongyang' annotation, respectively, compared with 90.9% to the Red5 models. Our study highlights the need to take a cautious approach to draft genomes and computationally predicted genes. Our use of the manual annotation tool WebApollo facilitated manual checking and correction of gene models enabling improvement of computational prediction. This utility was especially relevant for certain types of gene families such as the EXPANSIN like genes. Finally, this high quality gene set will supply the kiwifruit and general plant community with a new tool for genomics and other comparative analysis.
Precise detection of chromosomal translocation or inversion breakpoints by whole-genome sequencing.
Suzuki, Toshifumi; Tsurusaki, Yoshinori; Nakashima, Mitsuko; Miyake, Noriko; Saitsu, Hirotomo; Takeda, Satoru; Matsumoto, Naomichi
2014-12-01
Structural variations (SVs), including translocations, inversions, deletions and duplications, are potentially associated with Mendelian diseases and contiguous gene syndromes. Determination of SV-related breakpoints at the nucleotide level is important to reveal the genetic causes for diseases. Whole-genome sequencing (WGS) by next-generation sequencers is expected to determine structural abnormalities more directly and efficiently than conventional methods. In this study, 14 SVs (9 balanced translocations, 1 inversion and 4 microdeletions) in 9 patients were analyzed by WGS with a shallow (5 × ) to moderate read coverage (20 × ). Among 28 breakpoints (as each SV has two breakpoints), 19 SV breakpoints had been determined previously at the nucleotide level by any other methods and 9 were uncharacterized. BreakDancer and Integrative Genomics Viewer determined 20 breakpoints (16 translocation, 2 inversion and 2 deletion breakpoints), but did not detect 8 breakpoints (2 translocation and 6 deletion breakpoints). These data indicate the efficacy of WGS for the precise determination of translocation and inversion breakpoints.
2011-01-01
Background Escherichia coli is a model prokaryote, an important pathogen, and a key organism for industrial biotechnology. E. coli W (ATCC 9637), one of four strains designated as safe for laboratory purposes, has not been sequenced. E. coli W is a fast-growing strain and is the only safe strain that can utilize sucrose as a carbon source. Lifecycle analysis has demonstrated that sucrose from sugarcane is a preferred carbon source for industrial bioprocesses. Results We have sequenced and annotated the genome of E. coli W. The chromosome is 4,900,968 bp and encodes 4,764 ORFs. Two plasmids, pRK1 (102,536 bp) and pRK2 (5,360 bp), are also present. W has unique features relative to other sequenced laboratory strains (K-12, B and Crooks): it has a larger genome and belongs to phylogroup B1 rather than A. W also grows on a much broader range of carbon sources than does K-12. A genome-scale reconstruction was developed and validated in order to interrogate metabolic properties. Conclusions The genome of W is more similar to commensal and pathogenic B1 strains than phylogroup A strains, and therefore has greater utility for comparative analyses with these strains. W should therefore be the strain of choice, or 'type strain' for group B1 comparative analyses. The genome annotation and tools created here are expected to allow further utilization and development of E. coli W as an industrial organism for sucrose-based bioprocesses. Refinements in our E. coli metabolic reconstruction allow it to more accurately define E. coli metabolism relative to previous models. PMID:21208457
Limited Genetic Diversity Preceded Extinction of the Tasmanian Tiger
Menzies, Brandon R.; Renfree, Marilyn B.; Heider, Thomas; Mayer, Frieder; Hildebrandt, Thomas B.; Pask, Andrew J.
2012-01-01
The Tasmanian tiger or thylacine was the largest carnivorous marsupial when Europeans first reached Australia. Sadly, the last known thylacine died in captivity in 1936. A recent analysis of the genome of the closely related and extant Tasmanian devil demonstrated limited genetic diversity between individuals. While a similar lack of diversity has been reported for the thylacine, this analysis was based on just two individuals. Here we report the sequencing of an additional 12 museum-archived specimens collected between 102 and 159 years ago. We examined a portion of the mitochondrial DNA hyper-variable control region and determined that all sequences were on average 99.5% identical at the nucleotide level. As a measure of accuracy we also sequenced mitochondrial DNA from a mother and two offspring. As expected, these samples were found to be 100% identical, validating our methods. We also used 454 sequencing to reconstruct 2.1 kilobases of the mitochondrial genome, which shared 99.91% identity with the two complete thylacine mitochondrial genomes published previously. Our thylacine genomic data also contained three highly divergent putative nuclear mitochondrial sequences, which grouped phylogenetically with the published thylacine mitochondrial homologs but contained 100-fold more polymorphisms than the conserved fragments. Together, our data suggest that the thylacine population in Tasmania had limited genetic diversity prior to its extinction, possibly as a result of their geographic isolation from mainland Australia approximately 10,000 years ago. PMID:22530022
Solving the problem of comparing whole bacterial genomes across different sequencing platforms.
Kaas, Rolf S; Leekitcharoenphon, Pimlapas; Aarestrup, Frank M; Lund, Ole
2014-01-01
Whole genome sequencing (WGS) shows great potential for real-time monitoring and identification of infectious disease outbreaks. However, rapid and reliable comparison of data generated in multiple laboratories and using multiple technologies is essential. So far studies have focused on using one technology because each technology has a systematic bias making integration of data generated from different platforms difficult. We developed two different procedures for identifying variable sites and inferring phylogenies in WGS data across multiple platforms. The methods were evaluated on three bacterial data sets and sequenced on three different platforms (Illumina, 454, Ion Torrent). We show that the methods are able to overcome the systematic biases caused by the sequencers and infer the expected phylogenies. It is concluded that the cause of the success of these new procedures is due to a validation of all informative sites that are included in the analysis. The procedures are available as web tools.
Evidence for Widespread Reticulate Evolution within Human Duplicons
Jackson, Michael S. ; Oliver, Karen ; Loveland, Jane ; Humphray, Sean ; Dunham, Ian ; Rocchi, Mariano ; Viggiano, Luigi ; Park, Jonathan P. ; Hurles, Matthew E. ; Santibanez-Koref, Mauro
2005-01-01
Approximately 5% of the human genome consists of segmental duplications that can cause genomic mutations and may play a role in gene innovation. Reticulate evolutionary processes, such as unequal crossing-over and gene conversion, are known to occur within specific duplicon families, but the broader contribution of these processes to the evolution of human duplications remains poorly characterized. Here, we use phylogenetic profiling to analyze multiple alignments of 24 human duplicon families that span >8 Mb of DNA. Our results indicate that none of them are evolving independently, with all alignments showing sharp discontinuities in phylogenetic signal consistent with reticulation. To analyze these results in more detail, we have developed a quartet method that estimates the relative contribution of nucleotide substitution and reticulate processes to sequence evolution. Our data indicate that most of the duplications show a highly significant excess of sites consistent with reticulate evolution, compared with the number expected by nucleotide substitution alone, with 15 of 30 alignments showing a >20-fold excess over that expected. Using permutation tests, we also show that at least 5% of the total sequence shares 100% sequence identity because of reticulation, a figure that includes 74 independent tracts of perfect identity >2 kb in length. Furthermore, analysis of a subset of alignments indicates that the density of reticulation events is as high as 1 every 4 kb. These results indicate that phylogenetic relationships within recently duplicated human DNA can be rapidly disrupted by reticulate evolution. This finding has important implications for efforts to finish the human genome sequence, complicates comparative sequence analysis of duplicon families, and could profoundly influence the tempo of gene-family evolution. PMID:16252241
Peng, Suotang; Xu, Qun; Yuan, Xiaoping; Feng, Yue; Yu, Hanyong; Wang, Yiping; Wei, Xinghua
2014-01-01
Wild species of Oryza are extremely valuable sources of genetic material that can be used to broaden the genetic background of cultivated rice, and to increase its resistance to abiotic and biotic stresses. Until recently, there was no sequence information for the BBCC Oryza genome; therefore, no special markers had been developed for this genome type. The lack of suitable markers made it difficult to search for valuable genes in the BBCC genome. The aim of this study was to develop microsatellite markers for the BBCC genome. We obtained 13,991 SSR-containing sequences and designed 14,508 primer pairs. The most abundant was hexanuclelotide (31.39%), followed by trinucleotide (27.67%) and dinucleotide (19.04%). 600 markers were selected for validation in 23 accessions of Oryza species with the BBCC genome. A set of 495 markers produced clear amplified fragments of the expected sizes. The average number of alleles per locus (Na) was 2.5, ranging from 1 to 9. The genetic diversity per locus (He) ranged from 0 to 0.844 with a mean of 0.333. The mean polymorphism information content (PIC) was 0.290, and ranged from 0 to 0.825. Of the 495 markers, 12 were only found in the BB genome, 173 were unique to the CC genome, and 198 were also present in the AA genome. These microsatellite markers could be used to evaluate the phylogenetic relationships among different Oryza genomes, and to construct a genetic linkage map for locating and identifying valuable genes in the BBCC genome, and would also for marker-assisted breeding programs that included accessions with the AA genome, especially Oryza sativa. PMID:24632997
BlackOPs: increasing confidence in variant detection through mappability filtering.
Cabanski, Christopher R; Wilkerson, Matthew D; Soloway, Matthew; Parker, Joel S; Liu, Jinze; Prins, Jan F; Marron, J S; Perou, Charles M; Hayes, D Neil
2013-10-01
Identifying variants using high-throughput sequencing data is currently a challenge because true biological variants can be indistinguishable from technical artifacts. One source of technical artifact results from incorrectly aligning experimentally observed sequences to their true genomic origin ('mismapping') and inferring differences in mismapped sequences to be true variants. We developed BlackOPs, an open-source tool that simulates experimental RNA-seq and DNA whole exome sequences derived from the reference genome, aligns these sequences by custom parameters, detects variants and outputs a blacklist of positions and alleles caused by mismapping. Blacklists contain thousands of artifact variants that are indistinguishable from true variants and, for a given sample, are expected to be almost completely false positives. We show that these blacklist positions are specific to the alignment algorithm and read length used, and BlackOPs allows users to generate a blacklist specific to their experimental setup. We queried the dbSNP and COSMIC variant databases and found numerous variants indistinguishable from mapping errors. We demonstrate how filtering against blacklist positions reduces the number of potential false variants using an RNA-seq glioblastoma cell line data set. In summary, accounting for mapping-caused variants tuned to experimental setups reduces false positives and, therefore, improves genome characterization by high-throughput sequencing.
Gene Identification Algorithms Using Exploratory Statistical Analysis of Periodicity
NASA Astrophysics Data System (ADS)
Mukherjee, Shashi Bajaj; Sen, Pradip Kumar
2010-10-01
Studying periodic pattern is expected as a standard line of attack for recognizing DNA sequence in identification of gene and similar problems. But peculiarly very little significant work is done in this direction. This paper studies statistical properties of DNA sequences of complete genome using a new technique. A DNA sequence is converted to a numeric sequence using various types of mappings and standard Fourier technique is applied to study the periodicity. Distinct statistical behaviour of periodicity parameters is found in coding and non-coding sequences, which can be used to distinguish between these parts. Here DNA sequences of Drosophila melanogaster were analyzed with significant accuracy.
Comprehensive molecular characterization of human colon and rectal cancer.
2012-07-18
To characterize somatic alterations in colorectal carcinoma, we conducted a genome-scale analysis of 276 samples, analysing exome sequence, DNA copy number, promoter methylation and messenger RNA and microRNA expression. A subset of these samples (97) underwent low-depth-of-coverage whole-genome sequencing. In total, 16% of colorectal carcinomas were found to be hypermutated: three-quarters of these had the expected high microsatellite instability, usually with hypermethylation and MLH1 silencing, and one-quarter had somatic mismatch-repair gene and polymerase ε (POLE) mutations. Excluding the hypermutated cancers, colon and rectum cancers were found to have considerably similar patterns of genomic alteration. Twenty-four genes were significantly mutated, and in addition to the expected APC, TP53, SMAD4, PIK3CA and KRAS mutations, we found frequent mutations in ARID1A, SOX9 and FAM123B. Recurrent copy-number alterations include potentially drug-targetable amplifications of ERBB2 and newly discovered amplification of IGF2. Recurrent chromosomal translocations include the fusion of NAV2 and WNT pathway member TCF7L1. Integrative analyses suggest new markers for aggressive colorectal carcinoma and an important role for MYC-directed transcriptional activation and repression.
Ni, Guiyan; Cavero, David; Fangmann, Anna; Erbe, Malena; Simianer, Henner
2017-01-16
With the availability of next-generation sequencing technologies, genomic prediction based on whole-genome sequencing (WGS) data is now feasible in animal breeding schemes and was expected to lead to higher predictive ability, since such data may contain all genomic variants including causal mutations. Our objective was to compare prediction ability with high-density (HD) array data and WGS data in a commercial brown layer line with genomic best linear unbiased prediction (GBLUP) models using various approaches to weight single nucleotide polymorphisms (SNPs). A total of 892 chickens from a commercial brown layer line were genotyped with 336 K segregating SNPs (array data) that included 157 K genic SNPs (i.e. SNPs in or around a gene). For these individuals, genome-wide sequence information was imputed based on data from re-sequencing runs of 25 individuals, leading to 5.2 million (M) imputed SNPs (WGS data), including 2.6 M genic SNPs. De-regressed proofs (DRP) for eggshell strength, feed intake and laying rate were used as quasi-phenotypic data in genomic prediction analyses. Four weighting factors for building a trait-specific genomic relationship matrix were investigated: identical weights, -(log 10 P) from genome-wide association study results, squares of SNP effects from random regression BLUP, and variable selection based weights (known as BLUP|GA). Predictive ability was measured as the correlation between DRP and direct genomic breeding values in five replications of a fivefold cross-validation. Averaged over the three traits, the highest predictive ability (0.366 ± 0.075) was obtained when only genic SNPs from WGS data were used. Predictive abilities with genic SNPs and all SNPs from HD array data were 0.361 ± 0.072 and 0.353 ± 0.074, respectively. Prediction with -(log 10 P) or squares of SNP effects as weighting factors for building a genomic relationship matrix or BLUP|GA did not increase accuracy, compared to that with identical weights, regardless of the SNP set used. Our results show that little or no benefit was gained when using all imputed WGS data to perform genomic prediction compared to using HD array data regardless of the weighting factors tested. However, using only genic SNPs from WGS data had a positive effect on prediction ability.
Dabney, Jesse; Knapp, Michael; Glocke, Isabelle; Gansauge, Marie-Theres; Weihmann, Antje; Nickel, Birgit; Valdiosera, Cristina; García, Nuria; Pääbo, Svante; Arsuaga, Juan-Luis; Meyer, Matthias
2013-09-24
Although an inverse relationship is expected in ancient DNA samples between the number of surviving DNA fragments and their length, ancient DNA sequencing libraries are strikingly deficient in molecules shorter than 40 bp. We find that a loss of short molecules can occur during DNA extraction and present an improved silica-based extraction protocol that enables their efficient retrieval. In combination with single-stranded DNA library preparation, this method enabled us to reconstruct the mitochondrial genome sequence from a Middle Pleistocene cave bear (Ursus deningeri) bone excavated at Sima de los Huesos in the Sierra de Atapuerca, Spain. Phylogenetic reconstructions indicate that the U. deningeri sequence forms an early diverging sister lineage to all Western European Late Pleistocene cave bears. Our results prove that authentic ancient DNA can be preserved for hundreds of thousand years outside of permafrost. Moreover, the techniques presented enable the retrieval of phylogenetically informative sequences from samples in which virtually all DNA is diminished to fragments shorter than 50 bp.
Dabney, Jesse; Knapp, Michael; Glocke, Isabelle; Gansauge, Marie-Theres; Weihmann, Antje; Nickel, Birgit; Valdiosera, Cristina; García, Nuria; Pääbo, Svante; Arsuaga, Juan-Luis; Meyer, Matthias
2013-01-01
Although an inverse relationship is expected in ancient DNA samples between the number of surviving DNA fragments and their length, ancient DNA sequencing libraries are strikingly deficient in molecules shorter than 40 bp. We find that a loss of short molecules can occur during DNA extraction and present an improved silica-based extraction protocol that enables their efficient retrieval. In combination with single-stranded DNA library preparation, this method enabled us to reconstruct the mitochondrial genome sequence from a Middle Pleistocene cave bear (Ursus deningeri) bone excavated at Sima de los Huesos in the Sierra de Atapuerca, Spain. Phylogenetic reconstructions indicate that the U. deningeri sequence forms an early diverging sister lineage to all Western European Late Pleistocene cave bears. Our results prove that authentic ancient DNA can be preserved for hundreds of thousand years outside of permafrost. Moreover, the techniques presented enable the retrieval of phylogenetically informative sequences from samples in which virtually all DNA is diminished to fragments shorter than 50 bp. PMID:24019490
Finding similar nucleotide sequences using network BLAST searches.
Ladunga, Istvan
2009-06-01
The Basic Local Alignment Search Tool (BLAST) is a keystone of bioinformatics due to its performance and user-friendliness. Beginner and intermediate users will learn how to design and submit blastn and Megablast searches on the Web pages at the National Center for Biotechnology Information. We map nucleic acid sequences to genomes, find identical or similar mRNA, expressed sequence tag, and noncoding RNA sequences, and run Megablast searches, which are much faster than blastn. Understanding results is assisted by taxonomy reports, genomic views, and multiple alignments. We interpret expected frequency thresholds, biological significance, and statistical significance. Weak hits provide no evidence, but hints for further analyses. We find genes that may code for homologous proteins by translated BLAST. We reduce false positives by filtering out low-complexity regions. Parsed BLAST results can be integrated into analysis pipelines. Links in the output connect to Entrez, PUBMED, structural, sequence, interaction, and expression databases. This facilitates integration with a wide spectrum of biological knowledge.
Complete mitochondrial genome sequence of Urechis caupo, a representative of the phylum Echiura
Boore, Jeffrey L
2004-01-01
Background Mitochondria contain small genomes that are physically separate from those of nuclei. Their comparison serves as a model system for understanding the processes of genome evolution. Although hundreds of these genome sequences have been reported, the taxonomic sampling is highly biased toward vertebrates and arthropods, with many whole phyla remaining unstudied. This is the first description of a complete mitochondrial genome sequence of a representative of the phylum Echiura, that of the fat innkeeper worm, Urechis caupo. Results This mtDNA is 15,113 nts in length and 62% A+T. It contains the 37 genes that are typical for animal mtDNAs in an arrangement somewhat similar to that of annelid worms. All genes are encoded by the same DNA strand which is rich in A and C relative to the opposite strand. Codons ending with the dinucleotide GG are more frequent than would be expected from apparent mutational biases. The largest non-coding region is only 282 nts long, is 71% A+T, and has potential for secondary structures. Conclusions Urechis caupo mtDNA shares many features with those of the few studied annelids, including the common usage of ATG start codons, unusual among animal mtDNAs, as well as gene arrangements, tRNA structures, and codon usage biases. PMID:15369601
Wymant, Chris; Colijn, Caroline; Danaviah, Siva; Essex, Max; Frost, Simon; Gall, Astrid; Gaseitsiwe, Simani; Grabowski, Mary K.; Gray, Ronald; Guindon, Stephane; von Haeseler, Arndt; Kaleebu, Pontiano; Kendall, Michelle; Kozlov, Alexey; Manasa, Justen; Minh, Bui Quang; Moyo, Sikhulile; Novitsky, Vlad; Nsubuga, Rebecca; Pillay, Sureshnee; Quinn, Thomas C.; Serwadda, David; Ssemwanga, Deogratius; Stamatakis, Alexandros; Trifinopoulos, Jana; Wawer, Maria; Brown, Andy Leigh; de Oliveira, Tulio; Kellam, Paul; Pillay, Deenan; Fraser, Christophe
2017-01-01
Abstract To characterize HIV-1 transmission dynamics in regions where the burden of HIV-1 is greatest, the “Phylogenetics and Networks for Generalised HIV Epidemics in Africa” consortium (PANGEA-HIV) is sequencing full-genome viral isolates from across sub-Saharan Africa. We report the first 3,985 PANGEA-HIV consensus sequences from four cohort sites (Rakai Community Cohort Study, n = 2,833; MRC/UVRI Uganda, n = 701; Mochudi Prevention Project, n = 359; Africa Health Research Institute Resistance Cohort, n = 92). Next-generation sequencing success rates varied: more than 80% of the viral genome from the gag to the nef genes could be determined for all sequences from South Africa, 75% of sequences from Mochudi, 60% of sequences from MRC/UVRI Uganda, and 22% of sequences from Rakai. Partial sequencing failure was primarily associated with low viral load, increased for amplicons closer to the 3′ end of the genome, was not associated with subtype diversity except HIV-1 subtype D, and remained significantly associated with sampling location after controlling for other factors. We assessed the impact of the missing data patterns in PANGEA-HIV sequences on phylogeny reconstruction in simulations. We found a threshold in terms of taxon sampling below which the patchy distribution of missing characters in next-generation sequences (NGS) has an excess negative impact on the accuracy of HIV-1 phylogeny reconstruction, which is attributable to tree reconstruction artifacts that accumulate when branches in viral trees are long. The large number of PANGEA-HIV sequences provides unprecedented opportunities for evaluating HIV-1 transmission dynamics across sub-Saharan Africa and identifying prevention opportunities. Molecular epidemiological analyses of these data must proceed cautiously because sequence sampling remains below the identified threshold and a considerable negative impact of missing characters on phylogeny reconstruction is expected. PMID:28540766
Ratmann, Oliver; Wymant, Chris; Colijn, Caroline; Danaviah, Siva; Essex, M; Frost, Simon D W; Gall, Astrid; Gaiseitsiwe, Simani; Grabowski, Mary; Gray, Ronald; Guindon, Stephane; von Haeseler, Arndt; Kaleebu, Pontiano; Kendall, Michelle; Kozlov, Alexey; Manasa, Justen; Minh, Bui Quang; Moyo, Sikhulile; Novitsky, Vladimir; Nsubuga, Rebecca; Pillay, Sureshnee; Quinn, Thomas C; Serwadda, David; Ssemwanga, Deogratius; Stamatakis, Alexandros; Trifinopoulos, Jana; Wawer, Maria; Leigh Brown, Andrew; de Oliveira, Tulio; Kellam, Paul; Pillay, Deenan; Fraser, Christophe
2017-05-25
To characterize HIV-1 transmission dynamics in regions where the burden of HIV-1 is greatest, the 'Phylogenetics and Networks for Generalised HIV Epidemics in Africa' consortium (PANGEA-HIV) is sequencing full-genome viral isolates from across sub-Saharan Africa. We report the first 3,985 PANGEA-HIV consensus sequences from four cohort sites (Rakai Community Cohort Study, n=2,833; MRC/UVRI Uganda, n=701; Mochudi Prevention Project, n=359; Africa Health Research Institute Resistance Cohort, n=92). Next-generation sequencing success rates varied: more than 80% of the viral genome from the gag to the nef genes could be determined for all sequences from South Africa, 75% of sequences from Mochudi, 60% of sequences from MRC/UVRI Uganda, and 22% of sequences from Rakai. Partial sequencing failure was primarily associated with low viral load, increased for amplicons closer to the 3' end of the genome, was not associated with subtype diversity except HIV-1 subtype D, and remained significantly associated with sampling location after controlling for other factors. We assessed the impact of the missing data patterns in PANGEA-HIV sequences on phylogeny reconstruction in simulations. We found a threshold in terms of taxon sampling below which the patchy distribution of missing characters in next-generation sequences has an excess negative impact on the accuracy of HIV-1 phylogeny reconstruction, which is attributable to tree reconstruction artifacts that accumulate when branches in viral trees are long. The large number of PANGEA-HIV sequences provides unprecedented opportunities for evaluating HIV-1 transmission dynamics across sub-Saharan Africa and identifying prevention opportunities. Molecular epidemiological analyses of these data must proceed cautiously because sequence sampling remains below the identified threshold and a considerable negative impact of missing characters on phylogeny reconstruction is expected.
Gallego, Carlos J; Bennette, Caroline S; Heagerty, Patrick; Comstock, Bryan; Horike-Pyne, Martha; Hisama, Fuki; Amendola, Laura M; Bennett, Robin L; Dorschner, Michael O; Tarczy-Hornoch, Peter; Grady, William M; Fullerton, S Malia; Trinidad, Susan B; Regier, Dean A; Nickerson, Deborah A; Burke, Wylie; Patrick, Donald L; Jarvik, Gail P; Veenstra, David L
2014-09-01
Whole exome and whole genome sequencing are applications of next generation sequencing transforming clinical care, but there is little evidence whether these tests improve patient outcomes or if they are cost effective compared to current standard of care. These gaps in knowledge can be addressed by comparative effectiveness and patient-centered outcomes research. We designed a randomized controlled trial that incorporates these research methods to evaluate whole exome sequencing compared to usual care in patients being evaluated for hereditary colorectal cancer and polyposis syndromes. Approximately 220 patients will be randomized and followed for 12 months after return of genomic findings. Patients will receive findings associated with colorectal cancer in a first return of results visit, and findings not associated with colorectal cancer (incidental findings) during a second return of results visit. The primary outcome is efficacy to detect mutations associated with these syndromes; secondary outcomes include psychosocial impact, cost-effectiveness and comparative costs. The secondary outcomes will be obtained via surveys before and after each return visit. The expected challenges in conducting this randomized controlled trial include the relatively low prevalence of genetic disease, difficult interpretation of some genetic variants, and uncertainty about which incidental findings should be returned to patients. The approaches utilized in this study may help guide other investigators in clinical genomics to identify useful outcome measures and strategies to address comparative effectiveness questions about the clinical implementation of genomic sequencing in clinical care. Copyright © 2014 Elsevier Inc. All rights reserved.
Pomel, Sébastien; Diogon, Marie; Bouchard, Philippe; Pradel, Lydie; Ravet, Viviane; Coffe, Gérard; Viguès, Bernard
2006-02-01
Previous attempts to identify the membrane skeleton of Paramecium cells have revealed a protein pattern that is both complex and specific. The most prominent structural elements, epiplasmic scales, are centered around ciliary units and are closely apposed to the cytoplasmic side of the inner alveolar membrane. We sought to characterize epiplasmic scale proteins (epiplasmins) at the molecular level. PCR approaches enabled the cloning and sequencing of two closely related genes by amplifications of sequences from a macronuclear genomic library. Using these two genes (EPI-1 and EPI-2), we have contributed to the annotation of the Paramecium tetraurelia macronuclear genome and identified 39 additional (paralogous) sequences. Two orthologous sequences were found in the Tetrahymena thermophila genome. Structural analysis of the 43 sequences indicates that the hallmark of this new multigenic family is a 79 aa domain flanked by two Q-, P- and V-rich stretches of sequence that are much more variable in amino-acid composition. Such features clearly distinguish members of the multigenic family from epiplasmic proteins previously sequenced in other ciliates. The expression of Green Fluorescent Protein (GFP)-tagged epiplasmin showed significant labeling of epiplasmic scales as well as oral structures. We expect that the GFP construct described herein will prove to be a useful tool for comparative subcellular localization of different putative epiplasmins in Paramecium.
Barry, Elizabeth G; Witherspoon, David J; Lampe, David J
2004-02-01
Transposons of the mariner family are widespread in animal genomes and have apparently infected them by horizontal transfer. Most species carry only old defective copies of particular mariner transposons that have diverged greatly from their active horizontally transferred ancestor, while a few contain young, very similar, and active copies. We report here the use of a whole-genome screen in bacteria to isolate somewhat diverged Famar1 copies from the European earwig, Forficula auricularia, that encode functional transposases. Functional and nonfunctional coding sequences of Famar1 and nonfunctional copies of Ammar1 from the European honey bee, Apis mellifera, were sequenced to examine their molecular evolution. No selection for sequence conservation was detected in any clade of a tree derived from these sequences, not even on branches leading to functional copies. This agrees with the current model for mariner transposon evolution that expects neutral evolution within particular hosts, with selection for function occurring only upon horizontal transfer to a new host. Our results further suggest that mariners are not finely tuned genetic entities and that a greater amount of sequence diversification than had previously been appreciated can occur in functional copies in a single host lineage. Finally, this method of isolating active copies can be used to isolate other novel active transposons without resorting to reconstruction of ancestral sequences.
Persea americana (avocado): bringing ancient flowers to fruit in the genomics era.
Chanderbali, André S; Albert, Victor A; Ashworth, Vanessa E T M; Clegg, Michael T; Litz, Richard E; Soltis, Douglas E; Soltis, Pamela S
2008-04-01
The avocado (Persea americana) is a major crop commodity worldwide. Moreover, avocado, a paleopolyploid, is an evolutionary "outpost" among flowering plants, representing a basal lineage (the magnoliid clade) near the origin of the flowering plants themselves. Following centuries of selective breeding, avocado germplasm has been characterized at the level of microsatellite and RFLP markers. Nonetheless, little is known beyond these general diversity estimates, and much work remains to be done to develop avocado as a major subtropical-zone crop. Among the goals of avocado improvement are to develop varieties with fruit that will "store" better on the tree, show uniform ripening and have better post-harvest storage. Avocado transcriptome sequencing, genome mapping and partial genomic sequencing will represent a major step toward the goal of sequencing the entire avocado genome, which is expected to aid in improving avocado varieties and production, as well as understanding the evolution of flowers from non-flowering seed plants (gymnosperms). Additionally, continued evolutionary and other comparative studies of flower and fruit development in different avocado strains can be accomplished at the gene expression level, including in comparison with avocado relatives, and these should provide important insights into the genetic regulation of fruit development in basal angiosperms.
Effector profiles distinguish formae speciales of Fusarium oxysporum.
van Dam, Peter; Fokkens, Like; Schmidt, Sarah M; Linmans, Jasper H J; Kistler, H Corby; Ma, Li-Jun; Rep, Martijn
2016-11-01
Formae speciales (ff.spp.) of the fungus Fusarium oxysporum are often polyphyletic within the species complex, making it impossible to identify them on the basis of conserved genes. However, sequences that determine host-specific pathogenicity may be expected to be similar between strains within the same forma specialis. Whole genome sequencing was performed on strains from five different ff.spp. (cucumerinum, niveum, melonis, radicis-cucumerinum and lycopersici). In each genome, genes for putative effectors were identified based on small size, secretion signal, and vicinity to a "miniature impala" transposable element. The candidate effector genes of all genomes were collected and the presence/absence patterns in each individual genome were clustered. Members of the same forma specialis turned out to group together, with cucurbit-infecting strains forming a supercluster separate from other ff.spp. Moreover, strains from different clonal lineages within the same forma specialis harbour identical effector gene sequences, supporting horizontal transfer of genetic material. These data offer new insight into the genetic basis of host specificity in the F. oxysporum species complex and show that (putative) effectors can be used to predict host specificity in F. oxysporum. © 2016 Society for Applied Microbiology and John Wiley & Sons Ltd.
Hilson, Pierre; Allemeersch, Joke; Altmann, Thomas; Aubourg, Sébastien; Avon, Alexandra; Beynon, Jim; Bhalerao, Rishikesh P.; Bitton, Frédérique; Caboche, Michel; Cannoot, Bernard; Chardakov, Vasil; Cognet-Holliger, Cécile; Colot, Vincent; Crowe, Mark; Darimont, Caroline; Durinck, Steffen; Eickhoff, Holger; de Longevialle, Andéol Falcon; Farmer, Edward E.; Grant, Murray; Kuiper, Martin T.R.; Lehrach, Hans; Léon, Céline; Leyva, Antonio; Lundeberg, Joakim; Lurin, Claire; Moreau, Yves; Nietfeld, Wilfried; Paz-Ares, Javier; Reymond, Philippe; Rouzé, Pierre; Sandberg, Goran; Segura, Maria Dolores; Serizet, Carine; Tabrett, Alexandra; Taconnat, Ludivine; Thareau, Vincent; Van Hummelen, Paul; Vercruysse, Steven; Vuylsteke, Marnik; Weingartner, Magdalena; Weisbeek, Peter J.; Wirta, Valtteri; Wittink, Floyd R.A.; Zabeau, Marc; Small, Ian
2004-01-01
Microarray transcript profiling and RNA interference are two new technologies crucial for large-scale gene function studies in multicellular eukaryotes. Both rely on sequence-specific hybridization between complementary nucleic acid strands, inciting us to create a collection of gene-specific sequence tags (GSTs) representing at least 21,500 Arabidopsis genes and which are compatible with both approaches. The GSTs were carefully selected to ensure that each of them shared no significant similarity with any other region in the Arabidopsis genome. They were synthesized by PCR amplification from genomic DNA. Spotted microarrays fabricated from the GSTs show good dynamic range, specificity, and sensitivity in transcript profiling experiments. The GSTs have also been transferred to bacterial plasmid vectors via recombinational cloning protocols. These cloned GSTs constitute the ideal starting point for a variety of functional approaches, including reverse genetics. We have subcloned GSTs on a large scale into vectors designed for gene silencing in plant cells. We show that in planta expression of GST hairpin RNA results in the expected phenotypes in silenced Arabidopsis lines. These versatile GST resources provide novel and powerful tools for functional genomics. PMID:15489341
Khafizov, Kamil; Madrid-Aliste, Carlos; Almo, Steven C.; Fiser, Andras
2014-01-01
The exponential growth of protein sequence data provides an ever-expanding body of unannotated and misannotated proteins. The National Institutes of Health-supported Protein Structure Initiative and related worldwide structural genomics efforts facilitate functional annotation of proteins through structural characterization. Recently there have been profound changes in the taxonomic composition of sequence databases, which are effectively redefining the scope and contribution of these large-scale structure-based efforts. The faster-growing bacterial genomic entries have overtaken the eukaryotic entries over the last 5 y, but also have become more redundant. Despite the enormous increase in the number of sequences, the overall structural coverage of proteins—including proteins for which reliable homology models can be generated—on the residue level has increased from 30% to 40% over the last 10 y. Structural genomics efforts contributed ∼50% of this new structural coverage, despite determining only ∼10% of all new structures. Based on current trends, it is expected that ∼55% structural coverage (the level required for significant functional insight) will be achieved within 15 y, whereas without structural genomics efforts, realizing this goal will take approximately twice as long. PMID:24567391
Khafizov, Kamil; Madrid-Aliste, Carlos; Almo, Steven C; Fiser, Andras
2014-03-11
The exponential growth of protein sequence data provides an ever-expanding body of unannotated and misannotated proteins. The National Institutes of Health-supported Protein Structure Initiative and related worldwide structural genomics efforts facilitate functional annotation of proteins through structural characterization. Recently there have been profound changes in the taxonomic composition of sequence databases, which are effectively redefining the scope and contribution of these large-scale structure-based efforts. The faster-growing bacterial genomic entries have overtaken the eukaryotic entries over the last 5 y, but also have become more redundant. Despite the enormous increase in the number of sequences, the overall structural coverage of proteins--including proteins for which reliable homology models can be generated--on the residue level has increased from 30% to 40% over the last 10 y. Structural genomics efforts contributed ∼50% of this new structural coverage, despite determining only ∼10% of all new structures. Based on current trends, it is expected that ∼55% structural coverage (the level required for significant functional insight) will be achieved within 15 y, whereas without structural genomics efforts, realizing this goal will take approximately twice as long.
FY11 Report on Metagenome Analysis using Pathogen Marker Libraries
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gardner, Shea N.; Allen, Jonathan E.; McLoughlin, Kevin S.
2011-06-02
A method, sequence library, and software suite was invented to rapidly assess whether any member of a pre-specified list of threat organisms or their near neighbors is present in a metagenome. The system was designed to handle mega- to giga-bases of FASTA-formatted raw sequence reads from short or long read next generation sequencing platforms. The approach is to pre-calculate a viral and a bacterial "Pathogen Marker Library" (PML) containing sub-sequences specific to pathogens or their near neighbors. A list of expected matches comparing every bacterial or viral genome against the PML sequences is also pre-calculated. To analyze a metagenome, readsmore » are compared to the PML, and observed PML-metagenome matches are compared to the expected PML-genome matches, and the ratio of observed relative to expected matches is reported. In other words, a 3-way comparison among the PML, metagenome, and existing genome sequences is used to quickly assess which (if any) species included in the PML is likely to be present in the metagenome, based on available sequence data. Our tests showed that the species with the most PML matches correctly indicated the organism sequenced for empirical metagenomes consisting of a cultured, relatively pure isolate. These runs completed in 1 minute to 3 hours on 12 CPU (1 thread/CPU), depending on the metagenome and PML. Using more threads on the same number of CPU resulted in speed improvements roughly proportional to the number of threads. Simulations indicated that detection sensitivity depends on both sequencing coverage levels for a species and the size of the PML: species were correctly detected even at ~0.003x coverage by the large PMLs, and at ~0.03x coverage by the smaller PMLs. Matches to true positive species were 3-4 orders of magnitude higher than to false positives. Simulations with short reads (36 nt and ~260 nt) showed that species were usually detected for metagenome coverage above 0.005x and coverage in the PML above 0.05x, and detection probability appears to be a function of both coverages. Multiple species could be detected simultaneously in a simulated low-coverage, complex metagenome, and the largest PML gave no false negative species and no false positive genera. The presence of multiple species was predicted in a complex metagenome from a human gut microbiome with 1.9 GB of short reads (75 nt); the species predicted were reasonable gut flora and no biothreat agents were detected, showing the feasibility of PML analysis of empirical complex metagenomes.« less
Ren, Jie; Song, Kai; Deng, Minghua; Reinert, Gesine; Cannon, Charles H; Sun, Fengzhu
2016-04-01
Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html fsun@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
MycoCosm, an Integrated Fungal Genomics Resource
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shabalov, Igor; Grigoriev, Igor
2012-03-16
MycoCosm is a web-based interactive fungal genomics resource, which was first released in March 2010, in response to an urgent call from the fungal community for integration of all fungal genomes and analytical tools in one place (Pan-fungal data resources meeting, Feb 21-22, 2010, Alexandria, VA). MycoCosm integrates genomics data and analysis tools to navigate through over 100 fungal genomes sequenced at JGI and elsewhere. This resource allows users to explore fungal genomes in the context of both genome-centric analysis and comparative genomics, and promotes user community participation in data submission, annotation and analysis. MycoCosm has over 4500 unique visitors/monthmore » or 35000+ visitors/year as well as hundreds of registered users contributing their data and expertise to this resource. Its scalable architecture allows significant expansion of the data expected from JGI Fungal Genomics Program, its users, and integration with external resources used by fungal community.« less
Čížková, Dagmar; Baird, Stuart J E; Těšíková, Jana; Voigt, Sebastian; Ľudovít, Ďureje; Piálek, Jaroslav; Goüy de Bellocq, Joëlle
2018-06-09
Murine cytomegalovirus (MCMV) has been reported from house mice (Mus musculus) worldwide, but only recently from Eastern house mice (M. m. musculus), of particular interest because they form a semi-permeable species barrier in Europe with Western house mice, M. m. domesticus. Here we report genome sequences of EastMCMV (from Eastern mice), and set these in the context of MCMV genomes from genus Mus hosts. We show EastMCMV and WestMCMV are genetically distinct. Phylogeny splitting analyses show a genome wide (94%) pattern consistent with no West-East introgression, the major exception (3.8%) being a genome-terminal region of duplicated genes involved in host immune system evasion. As expected from its function, this is a region of maintenance of ancestral polymorphism: The lack of clear splitting signal cannot be interpreted as evidence of introgression. The EastMCMV genome sequences reported here can therefore serve as a well-described resource for exploration of murid MCMV diversity. Copyright © 2018 Elsevier Inc. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Simon, Horst D.; Zorn, Manfred D.; Spengler, Sylvia J.
The pace of extraordinary advances in molecular biology has accelerated in the past decade due in large part to discoveries coming from genome projects on human and model organisms. The advances in the genome project so far, happening well ahead of schedule and under budget, have exceeded any dreams by its protagonists, let alone formal expectations. Biologists expect the next phase of the genome project to be even more startling in terms of dramatic breakthroughs in our understanding of human biology, the biology of health and of disease. Only today can biologists begin to envision the necessary experimental, computational andmore » theoretical steps necessary to exploit genome sequence information for its medical impact, its contribution to biotechnology and economic competitiveness, and its ultimate contribution to environmental quality. High performance computing has become one of the critical enabling technologies, which will help to translate this vision of future advances in biology into reality. Biologists are increasingly becoming aware of the potential of high performance computing. The goal of this tutorial is to introduce the exciting new developments in computational biology and genomics to the high performance computing community.« less
Identification of genomic indels and structural variations using split reads
2011-01-01
Background Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. Results We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. Conclusions Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful. PMID:21787423
Shortt, Jonathan A.; Card, Daren C.; Schield, Drew R.; Liu, Yang; Zhong, Bo; Castoe, Todd A.
2017-01-01
Background In areas where schistosomiasis control programs have been implemented, morbidity and prevalence have been greatly reduced. However, to sustain these reductions and move towards interruption of transmission, new tools for disease surveillance are needed. Genomic methods have the potential to help trace the sources of new infections, and allow us to monitor drug resistance. Large-scale genotyping efforts for schistosome species have been hindered by cost, limited numbers of established target loci, and the small amount of DNA obtained from miracidia, the life stage most readily acquired from humans. Here, we present a method using next generation sequencing to provide high-resolution genomic data from S. japonicum for population-based studies. Methodology/Principal Findings We applied whole genome amplification followed by double digest restriction site associated DNA sequencing (ddRADseq) to individual S. japonicum miracidia preserved on Whatman FTA cards. We found that we could effectively and consistently survey hundreds of thousands of variants from 10,000 to 30,000 loci from archived miracidia as old as six years. An analysis of variation from eight miracidia obtained from three hosts in two villages in Sichuan showed clear population structuring by village and host even within this limited sample. Conclusions/Significance This high-resolution sequencing approach yields three orders of magnitude more information than microsatellite genotyping methods that have been employed over the last decade, creating the potential to answer detailed questions about the sources of human infections and to monitor drug resistance. Costs per sample range from $50-$200, depending on the amount of sequence information desired, and we expect these costs can be reduced further given continued reductions in sequencing costs, improvement of protocols, and parallelization. This approach provides new promise for using modern genome-scale sampling to S. japonicum surveillance, and could be applied to other schistosome species and other parasitic helminthes. PMID:28107347
High resolution identity testing of inactivated poliovirus vaccines
Mee, Edward T.; Minor, Philip D.; Martin, Javier
2015-01-01
Background Definitive identification of poliovirus strains in vaccines is essential for quality control, particularly where multiple wild-type and Sabin strains are produced in the same facility. Sequence-based identification provides the ultimate in identity testing and would offer several advantages over serological methods. Methods We employed random RT-PCR and high throughput sequencing to recover full-length genome sequences from monovalent and trivalent poliovirus vaccine products at various stages of the manufacturing process. Results All expected strains were detected in previously characterised products and the method permitted identification of strains comprising as little as 0.1% of sequence reads. Highly similar Mahoney and Sabin 1 strains were readily discriminated on the basis of specific variant positions. Analysis of a product known to contain incorrect strains demonstrated that the method correctly identified the contaminants. Conclusion Random RT-PCR and shotgun sequencing provided high resolution identification of vaccine components. In addition to the recovery of full-length genome sequences, the method could also be easily adapted to the characterisation of minor variant frequencies and distinction of closely related products on the basis of distinguishing consensus and low frequency polymorphisms. PMID:26049003
Widespread Site-Dependent Buffering of Human Regulatory Polymorphism
Kutyavin, Tanya; Stamatoyannopoulos, John A.
2012-01-01
The average individual is expected to harbor thousands of variants within non-coding genomic regions involved in gene regulation. However, it is currently not possible to interpret reliably the functional consequences of genetic variation within any given transcription factor recognition sequence. To address this, we comprehensively analyzed heritable genome-wide binding patterns of a major sequence-specific regulator (CTCF) in relation to genetic variability in binding site sequences across a multi-generational pedigree. We localized and quantified CTCF occupancy by ChIP-seq in 12 related and unrelated individuals spanning three generations, followed by comprehensive targeted resequencing of the entire CTCF–binding landscape across all individuals. We identified hundreds of variants with reproducible quantitative effects on CTCF occupancy (both positive and negative). While these effects paralleled protein–DNA recognition energetics when averaged, they were extensively buffered by striking local context dependencies. In the significant majority of cases buffering was complete, resulting in silent variants spanning every position within the DNA recognition interface irrespective of level of binding energy or evolutionary constraint. The prevalence of complex partial or complete buffering effects severely constrained the ability to predict reliably the impact of variation within any given binding site instance. Surprisingly, 40% of variants that increased CTCF occupancy occurred at positions of human–chimp divergence, challenging the expectation that the vast majority of functional regulatory variants should be deleterious. Our results suggest that, even in the presence of “perfect” genetic information afforded by resequencing and parallel studies in multiple related individuals, genomic site-specific prediction of the consequences of individual variation in regulatory DNA will require systematic coupling with empirical functional genomic measurements. PMID:22457641
Pettengill, James B.; Pightling, Arthur W.; Baugher, Joseph D.; ...
2016-11-10
The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging duemore » to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). Finally, when analyzing empirical data (wholegenome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pettengill, James B.; Pightling, Arthur W.; Baugher, Joseph D.
The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging duemore » to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). Finally, when analyzing empirical data (wholegenome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.« less
Chang, Yun-Juan; Land, Miriam; Hauser, Loren; Chertkov, Olga; Del Rio, Tijana Glavina; Nolan, Matt; Copeland, Alex; Tice, Hope; Cheng, Jan-Fang; Lucas, Susan; Han, Cliff; Goodwin, Lynne; Pitluck, Sam; Ivanova, Natalia; Ovchinikova, Galina; Pati, Amrita; Chen, Amy; Palaniappan, Krishna; Mavromatis, Konstantinos; Liolios, Konstantinos; Brettin, Thomas; Fiebig, Anne; Rohde, Manfred; Abt, Birte; Göker, Markus; Detter, John C; Woyke, Tanja; Bristow, James; Eisen, Jonathan A; Markowitz, Victor; Hugenholtz, Philip; Kyrpides, Nikos C; Klenk, Hans-Peter; Lapidus, Alla
2011-10-15
Ktedonobacter racemifer corrig. Cavaletti et al. 2007 is the type species of the genus Ktedonobacter, which in turn is the type genus of the family Ktedonobacteraceae, the type family of the order Ktedonobacterales within the class Ktedonobacteria in the phylum 'Chloroflexi'. Although K. racemifer shares some morphological features with the actinobacteria, it is of special interest because it was the first cultivated representative of a deep branching unclassified lineage of otherwise uncultivated environmental phylotypes tentatively located within the phylum 'Chloroflexi'. The aerobic, filamentous, non-motile, spore-forming Gram-positive heterotroph was isolated from soil in Italy. The 13,661,586 bp long non-contiguous finished genome consists of ten contigs and is the first reported genome sequence from a member of the class Ktedonobacteria. With its 11,453 protein-coding and 87 RNA genes, it is the largest prokaryotic genome reported so far. It comprises a large number of over-represented COGs, particularly genes associated with transposons, causing the genetic redundancy within the genome being considerably larger than expected by chance. This work is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
Molecular Markers and Cotton Genetic Improvement: Current Status and Future Prospects
Malik, Waqas; Iqbal, Muhammad Zaffar; Ali Khan, Asif; Qayyum, Abdul; Ali Abid, Muhammad; Noor, Etrat; Qadir Ahmad, Muhammad; Hasan Abbasi, Ghulam
2014-01-01
Narrow genetic base and complex allotetraploid genome of cotton (Gossypium hirsutum L.) is stimulating efforts to avail required polymorphism for marker based breeding. The availability of draft genome sequence of G. raimondii and G. arboreum and next generation sequencing (NGS) technologies facilitated the development of high-throughput marker technologies in cotton. The concepts of genetic diversity, QTL mapping, and marker assisted selection (MAS) are evolving into more efficient concepts of linkage disequilibrium, association mapping, and genomic selection, respectively. The objective of the current review is to analyze the pace of evolution in the molecular marker technologies in cotton during the last ten years into the following four areas: (i) comparative analysis of low- and high-throughput marker technologies available in cotton, (ii) genetic diversity in the available wild and improved gene pools of cotton, (iii) identification of the genomic regions within cotton genome underlying economic traits, and (iv) marker based selection methodologies. Moreover, the applications of marker technologies to enhance the breeding efficiency in cotton are also summarized. Aforementioned genomic technologies and the integration of several other omics resources are expected to enhance the cotton productivity and meet the global fiber quantity and quality demands. PMID:25401149
Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures.
Kleftogiannis, Dimitrios; Kalnis, Panos; Bajic, Vladimir B
2013-01-01
A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.
Efficient CRISPR/Cas9-based gene knockout in watermelon.
Tian, Shouwei; Jiang, Linjian; Gao, Qiang; Zhang, Jie; Zong, Mei; Zhang, Haiying; Ren, Yi; Guo, Shaogui; Gong, Guoyi; Liu, Fan; Xu, Yong
2017-03-01
CRISPR/Cas9 system can precisely edit genomic sequence and effectively create knockout mutations in T0 generation watermelon plants. Genome editing offers great advantage to reveal gene function and generate agronomically important mutations to crops. Recently, RNA-guided genome editing system using the type II clustered regularly interspaced short palindromic repeats (CRISPR)-associated protein 9 (Cas9) has been applied to several plant species, achieving successful targeted mutagenesis. Here, we report the genome of watermelon, an important fruit crop, can also be precisely edited by CRISPR/Cas9 system. ClPDS, phytoene desaturase in watermelon, was selected as the target gene because its mutant bears evident albino phenotype. CRISPR/Cas9 system performed genome editing, such as insertions or deletions at the expected position, in transfected watermelon protoplast cells. More importantly, all transgenic watermelon plants harbored ClPDS mutations and showed clear or mosaic albino phenotype, indicating that CRISPR/Cas9 system has technically 100% of genome editing efficiency in transgenic watermelon lines. Furthermore, there were very likely no off-target mutations, indicated by examining regions that were highly homologous to sgRNA sequences. Our results show that CRISPR/Cas9 system is a powerful tool to effectively create knockout mutations in watermelon.
Diversity and Evolution of Mycobacterium tuberculosis: Moving to Whole-Genome-Based Approaches
Niemann, Stefan; Supply, Philip
2014-01-01
Genotyping of clinical Mycobacterium tuberculosis complex (MTBC) strains has become a standard tool for epidemiological tracing and for the investigation of the local and global strain population structure. Of special importance is the analysis of the expansion of multidrug (MDR) and extensively drug-resistant (XDR) strains. Classical genotyping and, more recently, whole-genome sequencing have revealed that the strains of the MTBC are more diverse than previously anticipated. Globally, several phylogenetic lineages can be distinguished whose geographical distribution is markedly variable. Strains of particular (sub)lineages, such as Beijing, seem to be more virulent and associated with enhanced resistance levels and fitness, likely fueling their spread in certain world regions. The upcoming generalization of whole-genome sequencing approaches will expectedly provide more comprehensive insights into the molecular and epidemiological mechanisms involved and lead to better diagnostic and therapeutic tools. PMID:25190252
Moreau, Pierrick; Faury, Nicole; Pepin, Jean-François; Segarra, Amélie; Webb, Stephen
2012-01-01
Although there are a number of ostreid herpesvirus 1 (OsHV-1) variants, it is expected that the true diversity of this virus will be known only after the analysis of significantly more data. To this end, we analyzed 72 OsHV-1 “specimens” collected mainly in France over an 18-year period, from 1993 to 2010. Additional samples were also collected in Ireland, the United States, China, Japan, and New Zealand. Three virus genome regions (open reading frame 4 [ORF4], ORF35, -36, -37, and -38, and ORF42 and -43) were selected for PCR analysis and sequencing. Although ORF4 appeared to be the most polymorphic genome area, distinguishing several genogroups, ORF35, -36, -37, and -38 and ORF42 and -43 also showed variations useful in grouping subpopulations of this virus. PMID:22419803
Liu, Shi; Gao, Peng; Zhu, Qianglong; Luan, Feishi; Davis, Angela R.; Wang, Xiaolu
2016-01-01
Cleaved amplified polymorphic sequence (CAPS) markers are useful tools for detecting single nucleotide polymorphisms (SNPs). This study detected and converted SNP sites into CAPS markers based on high-throughput re-sequencing data in watermelon, for linkage map construction and quantitative trait locus (QTL) analysis. Two inbred lines, Cream of Saskatchewan (COS) and LSW-177 had been re-sequenced and analyzed by Perl self-compiled script for CAPS marker development. 88.7% and 78.5% of the assembled sequences of the two parental materials could map to the reference watermelon genome, respectively. Comparative assembled genome data analysis provided 225,693 and 19,268 SNPs and indels between the two materials. 532 pairs of CAPS markers were designed with 16 restriction enzymes, among which 271 pairs of primers gave distinct bands of the expected length and polymorphic bands, via PCR and enzyme digestion, with a polymorphic rate of 50.94%. Using the new CAPS markers, an initial CAPS-based genetic linkage map was constructed with the F2 population, spanning 1836.51 cM with 11 linkage groups and 301 markers. 12 QTLs were detected related to fruit flesh color, length, width, shape index, and brix content. These newly CAPS markers will be a valuable resource for breeding programs and genetic studies of watermelon. PMID:27162496
When are pathogen genome sequences informative of transmission events?
Ferguson, Neil; Jombart, Thibaut
2018-01-01
Recent years have seen the development of numerous methodologies for reconstructing transmission trees in infectious disease outbreaks from densely sampled whole genome sequence data. However, a fundamental and as of yet poorly addressed limitation of such approaches is the requirement for genetic diversity to arise on epidemiological timescales. Specifically, the position of infected individuals in a transmission tree can only be resolved by genetic data if mutations have accumulated between the sampled pathogen genomes. To quantify and compare the useful genetic diversity expected from genetic data in different pathogen outbreaks, we introduce here the concept of ‘transmission divergence’, defined as the number of mutations separating whole genome sequences sampled from transmission pairs. Using parameter values obtained by literature review, we simulate outbreak scenarios alongside sequence evolution using two models described in the literature to describe transmission divergence of ten major outbreak-causing pathogens. We find that while mean values vary significantly between the pathogens considered, their transmission divergence is generally very low, with many outbreaks characterised by large numbers of genetically identical transmission pairs. We describe the impact of transmission divergence on our ability to reconstruct outbreaks using two outbreak reconstruction tools, the R packages outbreaker and phybreak, and demonstrate that, in agreement with previous observations, genetic sequence data of rapidly evolving pathogens such as RNA viruses can provide valuable information on individual transmission events. Conversely, sequence data of pathogens with lower mean transmission divergence, including Streptococcus pneumoniae, Shigella sonnei and Clostridium difficile, provide little to no information about individual transmission events. Our results highlight the informational limitations of genetic sequence data in certain outbreak scenarios, and demonstrate the need to expand the toolkit of outbreak reconstruction tools to integrate other types of epidemiological data. PMID:29420641
Boutte, Julien; Aliaga, Benoît; Lima, Oscar; Ferreira de Carvalho, Julie; Ainouche, Abdelkader; Macas, Jiri; Rousseau-Gueutin, Mathieu; Coriton, Olivier; Ainouche, Malika; Salmon, Armel
2015-01-01
Gene and whole-genome duplications are widespread in plant nuclear genomes, resulting in sequence heterogeneity. Identification of duplicated genes may be particularly challenging in highly redundant genomes, especially when there are no diploid parents as a reference. Here, we developed a pipeline to detect the different copies in the ribosomal RNA gene family in the hexaploid grass Spartina maritima from next-generation sequencing (Roche-454) reads. The heterogeneity of the different domains of the highly repeated 45S unit was explored by identifying single nucleotide polymorphisms (SNPs) and assembling reads based on shared polymorphisms. SNPs were validated using comparisons with Illumina sequence data sets and by cloning and Sanger (re)sequencing. Using this approach, 29 validated polymorphisms and 11 validated haplotypes were reported (out of 34 and 20, respectively, that were initially predicted by our program). The rDNA domains of S. maritima have similar lengths as those found in other Poaceae, apart from the 5′-ETS, which is approximately two-times longer in S. maritima. Sequence homogeneity was encountered in coding regions and both internal transcribed spacers (ITS), whereas high intragenomic variability was detected in the intergenic spacer (IGS) and the external transcribed spacer (ETS). Molecular cytogenetic analysis by fluorescent in situ hybridization (FISH) revealed the presence of one pair of 45S rDNA signals on the chromosomes of S. maritima instead of three expected pairs for a hexaploid genome, indicating loss of duplicated homeologous loci through the diploidization process. The procedure developed here may be used at any ploidy level and using different sequencing technologies. PMID:26530424
Bohra, Abhishek; Pandey, Manish K; Jha, Uday C; Singh, Balwant; Singh, Indra P; Datta, Dibendu; Chaturvedi, Sushil K; Nadarajan, N; Varshney, Rajeev K
2014-06-01
Given recent advances in pulse molecular biology, genomics-driven breeding has emerged as a promising approach to address the issues of limited genetic gain and low productivity in various pulse crops. The global population is continuously increasing and is expected to reach nine billion by 2050. This huge population pressure will lead to severe shortage of food, natural resources and arable land. Such an alarming situation is most likely to arise in developing countries due to increase in the proportion of people suffering from protein and micronutrient malnutrition. Pulses being a primary and affordable source of proteins and minerals play a key role in alleviating the protein calorie malnutrition, micronutrient deficiencies and other undernourishment-related issues. Additionally, pulses are a vital source of livelihood generation for millions of resource-poor farmers practising agriculture in the semi-arid and sub-tropical regions. Limited success achieved through conventional breeding so far in most of the pulse crops will not be enough to feed the ever increasing population. In this context, genomics-assisted breeding (GAB) holds promise in enhancing the genetic gains. Though pulses have long been considered as orphan crops, recent advances in the area of pulse genomics are noteworthy, e.g. discovery of genome-wide genetic markers, high-throughput genotyping and sequencing platforms, high-density genetic linkage/QTL maps and, more importantly, the availability of whole-genome sequence. With genome sequence in hand, there is a great scope to apply genome-wide methods for trait mapping using association studies and to choose desirable genotypes via genomic selection. It is anticipated that GAB will speed up the progress of genetic improvement of pulses, leading to the rapid development of cultivars with higher yield, enhanced stress tolerance and wider adaptability.
Allegra, Carmen J.
2015-01-01
During the past decade, biomedical technologies have undergone an explosive evolution---from the publication of the first complete human genome in 2003, after more than a decade of effort and at a cost of hundreds of millions of dollars---to the present time, where a complete genomic sequence can be available in less than a day and at a small fraction of the cost of the original sequence. The widespread availability of next generation genomic sequencing has opened the door to the development of precision oncology. The need to test multiple new targeted agents both alone and in combination with other targeted therapies, as well as classic cytotoxic agents, demand the development of novel therapeutic platforms (particularly Master Protocols) capable of efficiently and effectively testing multiple targeted agents or targeted therapeutic strategies in relatively small patient subpopulations. Here, we describe the Master Protocol concept, with a focus on the expected gains and complexities of the use of this design. An overview of Master Protocols currently active or in development is provided along with a more extensive discussion of the Lung Master Protocol (Lung-MAP study). PMID:26433553
Variability Studies of Two Prunus-Infecting Fabaviruses with the Aid of High-Throughput Sequencing
Sarkisova, Tatiana; Lenz, Ondřej; Přibylová, Jaroslava; Špak, Josef; Lotos, Leonidas; Beta, Christina; Katsiani, Asimina; Candresse, Thierry
2018-01-01
During their lifetime, perennial woody plants are expected to face multiple infection events. Furthermore, multiple genotypes of individual virus species may co-infect the same host. This may eventually lead to a situation where plants harbor complex communities of viral species/strains. Using high-throughput sequencing, we describe co-infection of sweet and sour cherry trees with diverse genomic variants of two closely related viruses, namely prunus virus F (PrVF) and cherry virus F (CVF). Both viruses are most homologous to members of the Fabavirus genus (Secoviridae family). The comparison of CVF and PrVF RNA2 genomic sequences suggests that the two viruses may significantly differ in their expression strategy. Indeed, similar to comoviruses, the smaller genomic segment of PrVF, RNA2, may be translated in two collinear proteins while CVF likely expresses only the shorter of these two proteins. Linked with the observation that identity levels between the coat proteins of these two viruses are significantly below the family species demarcation cut-off, these findings support the idea that CVF and PrVF represent two separate Fabavirus species. PMID:29670059
DOE Office of Scientific and Technical Information (OSTI.GOV)
Larsen, P. E.; Trivedi, G.; Sreedasyam, A.
2010-07-06
Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derivedmore » from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. 69% of expressed mycorrhizal JGI 'best' gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural annotation in other species, provided that there is a sequenced genome and a set of gene models.« less
Homoplastic microinversions and the avian tree of life
2011-01-01
Background Microinversions are cytologically undetectable inversions of DNA sequences that accumulate slowly in genomes. Like many other rare genomic changes (RGCs), microinversions are thought to be virtually homoplasy-free evolutionary characters, suggesting that they may be very useful for difficult phylogenetic problems such as the avian tree of life. However, few detailed surveys of these genomic rearrangements have been conducted, making it difficult to assess this hypothesis or understand the impact of microinversions upon genome evolution. Results We surveyed non-coding sequence data from a recent avian phylogenetic study and found substantially more microinversions than expected based upon prior information about vertebrate inversion rates, although this is likely due to underestimation of these rates in previous studies. Most microinversions were lineage-specific or united well-accepted groups. However, some homoplastic microinversions were evident among the informative characters. Hemiplasy, which reflects differences between gene trees and the species tree, did not explain the observed homoplasy. Two specific loci were microinversion hotspots, with high numbers of inversions that included both the homoplastic as well as some overlapping microinversions. Neither stem-loop structures nor detectable sequence motifs were associated with microinversions in the hotspots. Conclusions Microinversions can provide valuable phylogenetic information, although power analysis indicates that large amounts of sequence data will be necessary to identify enough inversions (and similar RGCs) to resolve short branches in the tree of life. Moreover, microinversions are not perfect characters and should be interpreted with caution, just as with any other character type. Independent of their use for phylogenetic analyses, microinversions are important because they have the potential to complicate alignment of non-coding sequences. Despite their low rate of accumulation, they have clearly contributed to genome evolution, suggesting that active identification of microinversions will prove useful in future phylogenomic studies. PMID:21612607
van de Guchte, M; Penaud, S; Grimaldi, C; Barbe, V; Bryson, K; Nicolas, P; Robert, C; Oztas, S; Mangenot, S; Couloux, A; Loux, V; Dervyn, R; Bossy, R; Bolotin, A; Batto, J-M; Walunas, T; Gibrat, J-F; Bessières, P; Weissenbach, J; Ehrlich, S D; Maguin, E
2006-06-13
Lactobacillus delbrueckii ssp. bulgaricus (L. bulgaricus) is a representative of the group of lactic acid-producing bacteria, mainly known for its worldwide application in yogurt production. The genome sequence of this bacterium has been determined and shows the signs of ongoing specialization, with a substantial number of pseudogenes and incomplete metabolic pathways and relatively few regulatory functions. Several unique features of the L. bulgaricus genome support the hypothesis that the genome is in a phase of rapid evolution. (i) Exceptionally high numbers of rRNA and tRNA genes with regard to genome size may indicate that the L. bulgaricus genome has known a recent phase of important size reduction, in agreement with the observed high frequency of gene inactivation and elimination; (ii) a much higher GC content at codon position 3 than expected on the basis of the overall GC content suggests that the composition of the genome is evolving toward a higher GC content; and (iii) the presence of a 47.5-kbp inverted repeat in the replication termination region, an extremely rare feature in bacterial genomes, may be interpreted as a transient stage in genome evolution. The results indicate the adaptation of L. bulgaricus from a plant-associated habitat to the stable protein and lactose-rich milk environment through the loss of superfluous functions and protocooperation with Streptococcus thermophilus.
Spring-Pearson, Senanu M; Stone, Joshua K; Doyle, Adina; Allender, Christopher J; Okinaka, Richard T; Mayo, Mark; Broomall, Stacey M; Hill, Jessica M; Karavis, Mark A; Hubbard, Kyle S; Insalaco, Joseph M; McNew, Lauren A; Rosenzweig, C Nicole; Gibbons, Henry S; Currie, Bart J; Wagner, David M; Keim, Paul; Tuanyok, Apichai
2015-01-01
The pangenomic diversity in Burkholderia pseudomallei is high, with approximately 5.8% of the genome consisting of genomic islands. Genomic islands are known hotspots for recombination driven primarily by site-specific recombination associated with tRNAs. However, recombination rates in other portions of the genome are also high, a feature we expected to disrupt gene order. We analyzed the pangenome of 37 isolates of B. pseudomallei and demonstrate that the pangenome is 'open', with approximately 136 new genes identified with each new genome sequenced, and that the global core genome consists of 4568±16 homologs. Genes associated with metabolism were statistically overrepresented in the core genome, and genes associated with mobile elements, disease, and motility were primarily associated with accessory portions of the pangenome. The frequency distribution of genes present in between 1 and 37 of the genomes analyzed matches well with a model of genome evolution in which 96% of the genome has very low recombination rates but 4% of the genome recombines readily. Using homologous genes among pairs of genomes, we found that gene order was highly conserved among strains, despite the high recombination rates previously observed. High rates of gene transfer and recombination are incompatible with retaining gene order unless these processes are either highly localized to specific sites within the genome, or are characterized by symmetrical gene gain and loss. Our results demonstrate that both processes occur: localized recombination introduces many new genes at relatively few sites, and recombination throughout the genome generates the novel multi-locus sequence types previously observed while preserving gene order.
de Souza, Flávio S.J.; Franchini, Lucía F.; Rubinstein, Marcelo
2013-01-01
Transposable elements (TEs) are mobile genetic sequences that can jump around the genome from one location to another, behaving as genomic parasites. TEs have been particularly effective in colonizing mammalian genomes, and such heavy TE load is expected to have conditioned genome evolution. Indeed, studies conducted both at the gene and genome levels have uncovered TE insertions that seem to have been co-opted—or exapted—by providing transcription factor binding sites (TFBSs) that serve as promoters and enhancers, leading to the hypothesis that TE exaptation is a major factor in the evolution of gene regulation. Here, we critically review the evidence for exaptation of TE-derived sequences as TFBSs, promoters, enhancers, and silencers/insulators both at the gene and genome levels. We classify the functional impact attributed to TE insertions into four categories of increasing complexity and argue that so far very few studies have conclusively demonstrated exaptation of TEs as transcriptional regulatory regions. We also contend that many genome-wide studies dealing with TE exaptation in recent lineages of mammals are still inconclusive and that the hypothesis of rapid transcriptional regulatory rewiring mediated by TE mobilization must be taken with caution. Finally, we suggest experimental approaches that may help attributing higher-order functions to candidate exapted TEs. PMID:23486611
Rapid diversification of five Oryza AA genomes associated with rice adaptation.
Zhang, Qun-Jie; Zhu, Ting; Xia, En-Hua; Shi, Chao; Liu, Yun-Long; Zhang, Yun; Liu, Yuan; Jiang, Wen-Kai; Zhao, You-Jie; Mao, Shu-Yan; Zhang, Li-Ping; Huang, Hui; Jiao, Jun-Ying; Xu, Ping-Zhen; Yao, Qiu-Yang; Zeng, Fan-Chun; Yang, Li-Li; Gao, Ju; Tao, Da-Yun; Wang, Yue-Ju; Bennetzen, Jeffrey L; Gao, Li-Zhi
2014-11-18
Comparative genomic analyses among closely related species can greatly enhance our understanding of plant gene and genome evolution. We report de novo-assembled AA-genome sequences for Oryza nivara, Oryza glaberrima, Oryza barthii, Oryza glumaepatula, and Oryza meridionalis. Our analyses reveal massive levels of genomic structural variation, including segmental duplication and rapid gene family turnover, with particularly high instability in defense-related genes. We show, on a genomic scale, how lineage-specific expansion or contraction of gene families has led to their morphological and reproductive diversification, thus enlightening the evolutionary process of speciation and adaptation. Despite strong purifying selective pressures on most Oryza genes, we documented a large number of positively selected genes, especially those genes involved in flower development, reproduction, and resistance-related processes. These diversifying genes are expected to have played key roles in adaptations to their ecological niches in Asia, South America, Africa and Australia. Extensive variation in noncoding RNA gene numbers, function enrichment, and rates of sequence divergence might also help account for the different genetic adaptations of these rice species. Collectively, these resources provide new opportunities for evolutionary genomics, numerous insights into recent speciation, a valuable database of functional variation for crop improvement, and tools for efficient conservation of wild rice germplasm.
Rapid diversification of five Oryza AA genomes associated with rice adaptation
Zhang, Qun-Jie; Zhu, Ting; Xia, En-Hua; Shi, Chao; Liu, Yun-Long; Zhang, Yun; Liu, Yuan; Jiang, Wen-Kai; Zhao, You-Jie; Mao, Shu-Yan; Zhang, Li-Ping; Huang, Hui; Jiao, Jun-Ying; Xu, Ping-Zhen; Yao, Qiu-Yang; Zeng, Fan-Chun; Yang, Li-Li; Gao, Ju; Tao, Da-Yun; Wang, Yue-Ju; Bennetzen, Jeffrey L.; Gao, Li-Zhi
2014-01-01
Comparative genomic analyses among closely related species can greatly enhance our understanding of plant gene and genome evolution. We report de novo-assembled AA-genome sequences for Oryza nivara, Oryza glaberrima, Oryza barthii, Oryza glumaepatula, and Oryza meridionalis. Our analyses reveal massive levels of genomic structural variation, including segmental duplication and rapid gene family turnover, with particularly high instability in defense-related genes. We show, on a genomic scale, how lineage-specific expansion or contraction of gene families has led to their morphological and reproductive diversification, thus enlightening the evolutionary process of speciation and adaptation. Despite strong purifying selective pressures on most Oryza genes, we documented a large number of positively selected genes, especially those genes involved in flower development, reproduction, and resistance-related processes. These diversifying genes are expected to have played key roles in adaptations to their ecological niches in Asia, South America, Africa and Australia. Extensive variation in noncoding RNA gene numbers, function enrichment, and rates of sequence divergence might also help account for the different genetic adaptations of these rice species. Collectively, these resources provide new opportunities for evolutionary genomics, numerous insights into recent speciation, a valuable database of functional variation for crop improvement, and tools for efficient conservation of wild rice germplasm. PMID:25368197
Hart, Elizabeth A; Caccamo, Mario; Harrow, Jennifer L; Humphray, Sean J; Gilbert, James GR; Trevanion, Steve; Hubbard, Tim; Rogers, Jane; Rothschild, Max F
2007-01-01
Background We describe here the sequencing, annotation and comparative analysis of an 8 Mb region of pig chromosome 17, which provides a useful test region to assess coverage and quality for the pig genome sequencing project. We report our findings comparing the annotation of draft sequence assembled at different depths of coverage. Results Within this region we annotated 71 loci, of which 53 are orthologous to human known coding genes. When compared to the syntenic regions in human (20q13.13-q13.33) and mouse (chromosome 2, 167.5 Mb-178.3 Mb), this region was found to be highly conserved with respect to gene order. The most notable difference between the three species is the presence of a large expansion of zinc finger coding genes and pseudogenes on mouse chromosome 2 between Edn3 and Phactr3 that is absent from pig and human. All of our annotation has been made publicly available in the Vertebrate Genome Annotation browser, VEGA. We assessed the impact of coverage on sequence assembly across this region and found, as expected, that increased sequence depth resulted in fewer, longer contigs. One-third of our annotated loci could not be fully re-aligned back to the low coverage version of the sequence, principally because the transcripts are fragmented over several contigs. Conclusion We have demonstrated the considerable advantages of sequencing at increased read depths and discuss the implications that lower coverage sequence may have on subsequent comparative and functional studies, particularly those involving complex loci such as GNAS. PMID:17705864
Ho, Wai Kuan; Muchugi, Alice; Muthemba, Samuel; Kariba, Robert; Mavenkeni, Busiso Olga; Hendre, Prasad; Song, Bo; Van Deynze, Allen; Massawe, Festo; Mayes, Sean
2016-06-01
Maximizing the research output from a limited investment is often the major challenge for minor and underutilized crops. However, such crops may be tolerant to biotic and abiotic stresses and are adapted to local, marginal, and low-input environments. Their development through breeding will provide an important resource for future agricultural system resilience and diversification in the context of changing climates and the need to achieve food security. The African Orphan Crops Consortium recognizes the values of genomic resources in facilitating the improvement of such crops. Prior to beginning genome sequencing there is a need for an assessment of line varietal purity and to estimate any residual heterozygosity. Here we present an example from bambara groundnut (Vigna subterranea (L.) Verdc.), an underutilized drought tolerant African legume. Two released varieties from Zimbabwe, identified as potential genotypes for whole genome sequencing (WGS), were genotyped with 20 species-specific SSR markers. The results indicate that the cultivars are actually a mix of related inbred genotypes, and the analysis allowed a strategy of single plant selection to be used to generate non-heterogeneous DNA for WGS. The markers also confirmed very low levels of heterozygosity within individual plants. The application of a pre-screen using co-dominant microsatellite markers is expected to substantially improve the genome assembly, compared to a cultivar bulking approach that could have been adopted.
Genetic screens in human cells using the CRISPR-Cas9 system.
Wang, Tim; Wei, Jenny J; Sabatini, David M; Lander, Eric S
2014-01-03
The bacterial clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9 system for genome editing has greatly expanded the toolbox for mammalian genetics, enabling the rapid generation of isogenic cell lines and mice with modified alleles. Here, we describe a pooled, loss-of-function genetic screening approach suitable for both positive and negative selection that uses a genome-scale lentiviral single-guide RNA (sgRNA) library. sgRNA expression cassettes were stably integrated into the genome, which enabled a complex mutant pool to be tracked by massively parallel sequencing. We used a library containing 73,000 sgRNAs to generate knockout collections and performed screens in two human cell lines. A screen for resistance to the nucleotide analog 6-thioguanine identified all expected members of the DNA mismatch repair pathway, whereas another for the DNA topoisomerase II (TOP2A) poison etoposide identified TOP2A, as expected, and also cyclin-dependent kinase 6, CDK6. A negative selection screen for essential genes identified numerous gene sets corresponding to fundamental processes. Last, we show that sgRNA efficiency is associated with specific sequence motifs, enabling the prediction of more effective sgRNAs. Collectively, these results establish Cas9/sgRNA screens as a powerful tool for systematic genetic analysis in mammalian cells.
Next-Generation Sequencing and Genome Editing in Plant Virology
Hadidi, Ahmed; Flores, Ricardo; Candresse, Thierry; Barba, Marina
2016-01-01
Next-generation sequencing (NGS) has been applied to plant virology since 2009. NGS provides highly efficient, rapid, low cost DNA, or RNA high-throughput sequencing of the genomes of plant viruses and viroids and of the specific small RNAs generated during the infection process. These small RNAs, which cover frequently the whole genome of the infectious agent, are 21–24 nt long and are known as vsRNAs for viruses and vd-sRNAs for viroids. NGS has been used in a number of studies in plant virology including, but not limited to, discovery of novel viruses and viroids as well as detection and identification of those pathogens already known, analysis of genome diversity and evolution, and study of pathogen epidemiology. The genome engineering editing method, clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9 system has been successfully used recently to engineer resistance to DNA geminiviruses (family, Geminiviridae) by targeting different viral genome sequences in infected Nicotiana benthamiana or Arabidopsis plants. The DNA viruses targeted include tomato yellow leaf curl virus and merremia mosaic virus (begomovirus); beet curly top virus and beet severe curly top virus (curtovirus); and bean yellow dwarf virus (mastrevirus). The technique has also been used against the RNA viruses zucchini yellow mosaic virus, papaya ringspot virus and turnip mosaic virus (potyvirus) and cucumber vein yellowing virus (ipomovirus, family, Potyviridae) by targeting the translation initiation genes eIF4E in cucumber or Arabidopsis plants. From these recent advances of major importance, it is expected that NGS and CRISPR-Cas technologies will play a significant role in the very near future in advancing the field of plant virology and connecting it with other related fields of biology. PMID:27617007
Detection of Emerging Vaccine-Related Polioviruses by Deep Sequencing.
Sahoo, Malaya K; Holubar, Marisa; Huang, ChunHong; Mohamed-Hadley, Alisha; Liu, Yuanyuan; Waggoner, Jesse J; Troy, Stephanie B; Garcia-Garcia, Lourdes; Ferreyra-Reyes, Leticia; Maldonado, Yvonne; Pinsky, Benjamin A
2017-07-01
Oral poliovirus vaccine can mutate to regain neurovirulence. To date, evaluation of these mutations has been performed primarily on culture-enriched isolates by using conventional Sanger sequencing. We therefore developed a culture-independent, deep-sequencing method targeting the 5' untranslated region (UTR) and P1 genomic region to characterize vaccine-related poliovirus variants. Error analysis of the deep-sequencing method demonstrated reliable detection of poliovirus mutations at levels of <1%, depending on read depth. Sequencing of viral nucleic acids from the stool of vaccinated, asymptomatic children and their close contacts collected during a prospective cohort study in Veracruz, Mexico, revealed no vaccine-derived polioviruses. This was expected given that the longest duration between sequenced sample collection and the end of the most recent national immunization week was 66 days. However, we identified many low-level variants (<5%) distributed across the 5' UTR and P1 genomic region in all three Sabin serotypes, as well as vaccine-related viruses with multiple canonical mutations associated with phenotypic reversion present at high levels (>90%). These results suggest that monitoring emerging vaccine-related poliovirus variants by deep sequencing may aid in the poliovirus endgame and efforts to ensure global polio eradication. Copyright © 2017 Sahoo et al.
A domain-centric solution to functional genomics via dcGO Predictor
2013-01-01
Background Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics. Results Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool. Conclusions As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era. PMID:23514627
Turner, Barbara; Paun, Ovidiu; Munzinger, Jérôme; Chase, Mark W.; Samuel, Rosabelle
2016-01-01
Background and Aims Some plant groups, especially on islands, have been shaped by strong ancestral bottlenecks and rapid, recent radiation of phenotypic characters. Single molecular markers are often not informative enough for phylogenetic reconstruction in such plant groups. Whole plastid genomes and nuclear ribosomal DNA (nrDNA) are viewed by many researchers as sources of information for phylogenetic reconstruction of groups in which expected levels of divergence in standard markers are low. Here we evaluate the usefulness of these data types to resolve phylogenetic relationships among closely related Diospyros species. Methods Twenty-two closely related Diospyros species from New Caledonia were investigated using whole plastid genomes and nrDNA data from low-coverage next-generation sequencing (NGS). Phylogenetic trees were inferred using maximum parsimony, maximum likelihood and Bayesian inference on separate plastid and nrDNA and combined matrices. Key Results The plastid and nrDNA sequences were, singly and together, unable to provide well supported phylogenetic relationships among the closely related New Caledonian Diospyros species. In the nrDNA, a 6-fold greater percentage of parsimony-informative characters compared with plastid DNA was found, but the total number of informative sites was greater for the much larger plastid DNA genomes. Combining the plastid and nuclear data improved resolution. Plastid results showed a trend towards geographical clustering of accessions rather than following taxonomic species. Conclusions In plant groups in which multiple plastid markers are not sufficiently informative, an investigation at the level of the entire plastid genome may also not be sufficient for detailed phylogenetic reconstruction. Sequencing of complete plastid genomes and nrDNA repeats seems to clarify some relationships among the New Caledonian Diospyros species, but the higher percentage of parsimony-informative characters in nrDNA compared with plastid DNA did not help to resolve the phylogenetic tree because the total number of variable sites was much lower than in the entire plastid genome. The geographical clustering of the individuals against a background of overall low sequence divergence could indicate transfer of plastid genomes due to hybridization and introgression following secondary contact. PMID:27098088
Identifying micro-inversions using high-throughput sequencing reads.
He, Feifei; Li, Yang; Tang, Yu-Hang; Ma, Jian; Zhu, Huaiqiu
2016-01-11
The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing reads. The algorithm of MID is designed based on a dynamic programming path-finding approach. What makes MID different from other variant detection tools is that MID can handle small MIs and multiple breakpoints within an unmapped read. Moreover, MID improves reliability in low coverage data by integrating multiple samples. Our evaluation demonstrated that MID outperforms Gustaf, which can currently detect inversions from 30 bp to 500 bp. To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short next-generation sequencing reads. MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1000 Genomes Project (1KGP). MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome. The source code can be downloaded from: http://cqb.pku.edu.cn/ZhuLab/MID .
Tennessen, Jacob A; Bollmann, Stephanie R; Blouin, Michael S
2017-07-05
The aquatic planorbid snail Biomphalaria glabrata is one of the most intensively-studied mollusks due to its role in the transmission of schistosomiasis. Its 916 Mb genome has recently been sequenced and annotated, but it remains poorly assembled. Here, we used targeted capture markers to map over 10,000 B. glabrata scaffolds in a linkage cross of 94 F1 offspring, generating 24 linkage groups (LGs). We added additional scaffolds to these LGs based on linkage disequilibrium (LD) analysis of targeted capture and whole-genome sequences of 96 unrelated snails. Our final linkage map consists of 18,613 scaffolds comprising 515 Mb, representing 56% of the genome and 75% of genic and nonrepetitive regions. There are 18 large (> 10 Mb) LGs, likely representing the expected 18 haploid chromosomes, and > 50% of the genome has been assigned to LGs of at least 17 Mb. Comparisons with other gastropod genomes reveal patterns of synteny and chromosomal rearrangements. Linkage relationships of key immune-relevant genes may help clarify snail-schistosome interactions. By focusing on linkage among genic and nonrepetitive regions, we have generated a useful resource for associating snail phenotypes with causal genes, even in the absence of a complete genome assembly. A similar approach could potentially improve numerous poorly-assembled genomes in other taxa. This map will facilitate future work on this host of a serious human parasite. Copyright © 2017 Tennessen et al.
Yang, S; Chen, S; Geng, X X; Yan, G; Li, Z Y; Meng, J L; Cowling, W A; Zhou, W J
2016-04-01
We present the first genetic map of an allohexaploid Brassica species, based on segregating microsatellite markers in a doubled haploid mapping population generated from a hybrid between two hexaploid parents. This study reports the first genetic map of trigenomic Brassica. A doubled haploid mapping population consisting of 189 lines was obtained via microspore culture from a hybrid H16-1 derived from a cross between two allohexaploid Brassica lines (7H170-1 and Y54-2). Simple sequence repeat primer pairs specific to the A genome (107), B genome (44) and C genome (109) were used to construct a genetic linkage map of the population. Twenty-seven linkage groups were resolved from 274 polymorphic loci on the A genome (109), B genome (49) and C genome (116) covering a total genetic distance of 3178.8 cM with an average distance between markers of 11.60 cM. This is the first genetic framework map for the artificially synthesized Brassica allohexaploids. The linkage groups represent the expected complement of chromosomes in the A, B and C genomes from the original diploid and tetraploid parents. This framework linkage map will be valuable for QTL analysis and future genetic improvement of a new allohexaploid Brassica species, and in improving our understanding of the genetic control of meiosis in new polyploids.
Sequence-Based Genotyping for Marker Discovery and Co-Dominant Scoring in Germplasm and Populations
Truong, Hoa T.; Ramos, A. Marcos; Yalcin, Feyruz; de Ruiter, Marjo; van der Poel, Hein J. A.; Huvenaars, Koen H. J.; Hogers, René C. J.; van Enckevort, Leonora. J. G.; Janssen, Antoine; van Orsouw, Nathalie J.; van Eijk, Michiel J. T.
2012-01-01
Conventional marker-based genotyping platforms are widely available, but not without their limitations. In this context, we developed Sequence-Based Genotyping (SBG), a technology for simultaneous marker discovery and co-dominant scoring, using next-generation sequencing. SBG offers users several advantages including a generic sample preparation method, a highly robust genome complexity reduction strategy to facilitate de novo marker discovery across entire genomes, and a uniform bioinformatics workflow strategy to achieve genotyping goals tailored to individual species, regardless of the availability of a reference sequence. The most distinguishing features of this technology are the ability to genotype any population structure, regardless whether parental data is included, and the ability to co-dominantly score SNP markers segregating in populations. To demonstrate the capabilities of SBG, we performed marker discovery and genotyping in Arabidopsis thaliana and lettuce, two plant species of diverse genetic complexity and backgrounds. Initially we obtained 1,409 SNPs for arabidopsis, and 5,583 SNPs for lettuce. Further filtering of the SNP dataset produced over 1,000 high quality SNP markers for each species. We obtained a genotyping rate of 201.2 genotypes/SNP and 58.3 genotypes/SNP for arabidopsis (n = 222 samples) and lettuce (n = 87 samples), respectively. Linkage mapping using these SNPs resulted in stable map configurations. We have therefore shown that the SBG approach presented provides users with the utmost flexibility in garnering high quality markers that can be directly used for genotyping and downstream applications. Until advances and costs will allow for routine whole-genome sequencing of populations, we expect that sequence-based genotyping technologies such as SBG will be essential for genotyping of model and non-model genomes alike. PMID:22662172
A genetically anchored physical framework for Theobroma cacao cv. Matina 1-6
2011-01-01
Background The fermented dried seeds of Theobroma cacao (cacao tree) are the main ingredient in chocolate. World cocoa production was estimated to be 3 million tons in 2010 with an annual estimated average growth rate of 2.2%. The cacao bean production industry is currently under threat from a rise in fungal diseases including black pod, frosty pod, and witches' broom. In order to address these issues, genome-sequencing efforts have been initiated recently to facilitate identification of genetic markers and genes that could be utilized to accelerate the release of robust T. cacao cultivars. However, problems inherent with assembly and resolution of distal regions of complex eukaryotic genomes, such as gaps, chimeric joins, and unresolvable repeat-induced compressions, have been unavoidably encountered with the sequencing strategies selected. Results Here, we describe the construction of a BAC-based integrated genetic-physical map of the T. cacao cultivar Matina 1-6 which is designed to augment and enhance these sequencing efforts. Three BAC libraries, each comprised of 10× coverage, were constructed and fingerprinted. 230 genetic markers from a high-resolution genetic recombination map and 96 Arabidopsis-derived conserved ortholog set (COS) II markers were anchored using pooled overgo hybridization. A dense tile path consisting of 29,383 BACs was selected and end-sequenced. The physical map consists of 154 contigs and 4,268 singletons. Forty-nine contigs are genetically anchored and ordered to chromosomes for a total span of 307.2 Mbp. The unanchored contigs (105) span 67.4 Mbp and therefore the estimated genome size of T. cacao is 374.6 Mbp. A comparative analysis with A. thaliana, V. vinifera, and P. trichocarpa suggests that comparisons of the genome assemblies of these distantly related species could provide insights into genome structure, evolutionary history, conservation of functional sites, and improvements in physical map assembly. A comparison between the two T. cacao cultivars Matina 1-6 and Criollo indicates a high degree of collinearity in their genomes, yet rearrangements were also observed. Conclusions The results presented in this study are a stand-alone resource for functional exploitation and enhancement of Theobroma cacao but are also expected to complement and augment ongoing genome-sequencing efforts. This resource will serve as a template for refinement of the T. cacao genome through gap-filling, targeted re-sequencing, and resolution of repetitive DNA arrays. PMID:21846342
A genetically anchored physical framework for Theobroma cacao cv. Matina 1-6.
Saski, Christopher A; Feltus, Frank A; Staton, Margaret E; Blackmon, Barbara P; Ficklin, Stephen P; Kuhn, David N; Schnell, Raymond J; Shapiro, Howard; Motamayor, Juan Carlos
2011-08-16
The fermented dried seeds of Theobroma cacao (cacao tree) are the main ingredient in chocolate. World cocoa production was estimated to be 3 million tons in 2010 with an annual estimated average growth rate of 2.2%. The cacao bean production industry is currently under threat from a rise in fungal diseases including black pod, frosty pod, and witches' broom. In order to address these issues, genome-sequencing efforts have been initiated recently to facilitate identification of genetic markers and genes that could be utilized to accelerate the release of robust T. cacao cultivars. However, problems inherent with assembly and resolution of distal regions of complex eukaryotic genomes, such as gaps, chimeric joins, and unresolvable repeat-induced compressions, have been unavoidably encountered with the sequencing strategies selected. Here, we describe the construction of a BAC-based integrated genetic-physical map of the T. cacao cultivar Matina 1-6 which is designed to augment and enhance these sequencing efforts. Three BAC libraries, each comprised of 10× coverage, were constructed and fingerprinted. 230 genetic markers from a high-resolution genetic recombination map and 96 Arabidopsis-derived conserved ortholog set (COS) II markers were anchored using pooled overgo hybridization. A dense tile path consisting of 29,383 BACs was selected and end-sequenced. The physical map consists of 154 contigs and 4,268 singletons. Forty-nine contigs are genetically anchored and ordered to chromosomes for a total span of 307.2 Mbp. The unanchored contigs (105) span 67.4 Mbp and therefore the estimated genome size of T. cacao is 374.6 Mbp. A comparative analysis with A. thaliana, V. vinifera, and P. trichocarpa suggests that comparisons of the genome assemblies of these distantly related species could provide insights into genome structure, evolutionary history, conservation of functional sites, and improvements in physical map assembly. A comparison between the two T. cacao cultivars Matina 1-6 and Criollo indicates a high degree of collinearity in their genomes, yet rearrangements were also observed. The results presented in this study are a stand-alone resource for functional exploitation and enhancement of Theobroma cacao but are also expected to complement and augment ongoing genome-sequencing efforts. This resource will serve as a template for refinement of the T. cacao genome through gap-filling, targeted re-sequencing, and resolution of repetitive DNA arrays.
Farmery, James H R; Smith, Mike L; Lynch, Andy G
2018-01-22
Telomere length is a risk factor in disease and the dynamics of telomere length are crucial to our understanding of cell replication and vitality. The proliferation of whole genome sequencing represents an unprecedented opportunity to glean new insights into telomere biology on a previously unimaginable scale. To this end, a number of approaches for estimating telomere length from whole-genome sequencing data have been proposed. Here we present Telomerecat, a novel approach to the estimation of telomere length. Previous methods have been dependent on the number of telomeres present in a cell being known, which may be problematic when analysing aneuploid cancer data and non-human samples. Telomerecat is designed to be agnostic to the number of telomeres present, making it suited for the purpose of estimating telomere length in cancer studies. Telomerecat also accounts for interstitial telomeric reads and presents a novel approach to dealing with sequencing errors. We show that Telomerecat performs well at telomere length estimation when compared to leading experimental and computational methods. Furthermore, we show that it detects expected patterns in longitudinal data, repeated measurements, and cross-species comparisons. We also apply the method to a cancer cell data, uncovering an interesting relationship with the underlying telomerase genotype.
Genetics of pediatric obesity.
Manco, Melania; Dallapiccola, Bruno
2012-07-01
Onset of obesity has been anticipated at earlier ages, and prevalence has dramatically increased worldwide over the past decades. Epidemic obesity is mainly attributable to modern lifestyle, but family studies prove the significant role of genes in the individual's predisposition to obesity. Advances in genotyping technologies have raised great hope and expectations that genetic testing will pave the way to personalized medicine and that complex traits such as obesity will be prevented even before birth. In the presence of the pressing offer of direct-to-consumer genetic testing services from private companies to estimate the individual's risk for complex phenotypes including obesity, the present review offers pediatricians an update of the state of the art on genomics obesity in childhood. Discrepancies with respect to genomics of adult obesity are discussed. After an appraisal of findings from genome-wide association studies in pediatric populations, the rare variant-common disease hypothesis, the theoretical soil for next-generation sequencing techniques, is discussed as opposite to the common disease-common variant hypothesis. Next-generation sequencing techniques are expected to fill the gap of "missing heritability" of obesity, identifying rare variants associated with the trait and clarifying the role of epigenetics in its heritability. Pediatric obesity emerges as a complex phenotype, modulated by unique gene-environment interactions that occur in periods of life and are "permissive" for the programming of adult obesity. With the advent of next-generation sequencing techniques and advances in the field of exposomics, sensitive and specific tools to predict the obesity risk as early as possible are the challenge for the next decade.
High resolution identity testing of inactivated poliovirus vaccines.
Mee, Edward T; Minor, Philip D; Martin, Javier
2015-07-09
Definitive identification of poliovirus strains in vaccines is essential for quality control, particularly where multiple wild-type and Sabin strains are produced in the same facility. Sequence-based identification provides the ultimate in identity testing and would offer several advantages over serological methods. We employed random RT-PCR and high throughput sequencing to recover full-length genome sequences from monovalent and trivalent poliovirus vaccine products at various stages of the manufacturing process. All expected strains were detected in previously characterised products and the method permitted identification of strains comprising as little as 0.1% of sequence reads. Highly similar Mahoney and Sabin 1 strains were readily discriminated on the basis of specific variant positions. Analysis of a product known to contain incorrect strains demonstrated that the method correctly identified the contaminants. Random RT-PCR and shotgun sequencing provided high resolution identification of vaccine components. In addition to the recovery of full-length genome sequences, the method could also be easily adapted to the characterisation of minor variant frequencies and distinction of closely related products on the basis of distinguishing consensus and low frequency polymorphisms. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.
Uliano-Silva, Marcela; Dondero, Francesco; Dan Otto, Thomas; Costa, Igor; Lima, Nicholas Costa Barroso; Americo, Juliana Alves; Mazzoni, Camila Junqueira; Prosdocimi, Francisco; Rebelo, Mauro de Freitas
2018-01-01
Abstract Background For more than 25 years, the golden mussel, Limnoperna fortunei, has aggressively invaded South American freshwaters, having travelled more than 5000 km upstream across 5 countries. Along the way, the golden mussel has outcompeted native species and economically harmed aquaculture, hydroelectric powers, and ship transit. We have sequenced the complete genome of the golden mussel to understand the molecular basis of its invasiveness and search for ways to control it. Findings We assembled the 1.6-Gb genome into 20 548 scaffolds with an N50 length of 312 Kb using a hybrid and hierarchical assembly strategy from short and long DNA reads and transcriptomes. A total of 60 717 coding genes were inferred from a customized transcriptome-trained AUGUSTUS run. We also compared predicted protein sets with those of complete molluscan genomes, revealing an exacerbation of protein-binding domains in L. fortunei. Conclusions We built one of the best bivalve genome assemblies available using a cost-effective approach using Illumina paired-end, mate-paired, and PacBio long reads. We expect that the continuous and careful annotation of L. fortunei’s genome will contribute to the investigation of bivalve genetics, evolution, and invasiveness, as well as to the development of biotechnological tools for aquatic pest control. PMID:29267857
Uliano-Silva, Marcela; Dondero, Francesco; Dan Otto, Thomas; Costa, Igor; Lima, Nicholas Costa Barroso; Americo, Juliana Alves; Mazzoni, Camila Junqueira; Prosdocimi, Francisco; Rebelo, Mauro de Freitas
2018-02-01
For more than 25 years, the golden mussel, Limnoperna fortunei, has aggressively invaded South American freshwaters, having travelled more than 5000 km upstream across 5 countries. Along the way, the golden mussel has outcompeted native species and economically harmed aquaculture, hydroelectric powers, and ship transit. We have sequenced the complete genome of the golden mussel to understand the molecular basis of its invasiveness and search for ways to control it. We assembled the 1.6-Gb genome into 20 548 scaffolds with an N50 length of 312 Kb using a hybrid and hierarchical assembly strategy from short and long DNA reads and transcriptomes. A total of 60 717 coding genes were inferred from a customized transcriptome-trained AUGUSTUS run. We also compared predicted protein sets with those of complete molluscan genomes, revealing an exacerbation of protein-binding domains in L. fortunei. We built one of the best bivalve genome assemblies available using a cost-effective approach using Illumina paired-end, mate-paired, and PacBio long reads. We expect that the continuous and careful annotation of L. fortunei's genome will contribute to the investigation of bivalve genetics, evolution, and invasiveness, as well as to the development of biotechnological tools for aquatic pest control.
Richardson, Kris; Schnitzler, Gavin R; Lai, Chao-Qiang; Ordovas, Jose M
2015-12-01
Cardiovascular disease and type 2 diabetes mellitus represent overlapping diseases where a large portion of the variation attributable to genetics remains unexplained. An important player in their pathogenesis is peroxisome proliferator-activated receptor γ (PPARγ) that is involved in lipid and glucose metabolism and maintenance of metabolic homeostasis. We used a functional genomics methodology to interrogate human chromatin immunoprecipitation-sequencing, genome-wide association studies, and expression quantitative trait locus data to inform selection of candidate functional single nucleotide polymorphisms (SNPs) falling in PPARγ motifs. We derived 27 328 chromatin immunoprecipitation-sequencing peaks for PPARγ in human adipocytes through meta-analysis of 3 data sets. The PPARγ consensus motif showed greatest enrichment and mapped to 8637 peaks. We identified 146 SNPs in these motifs. This number was significantly less than would be expected by chance, and Inference of Natural Selection from Interspersed Genomically coHerent elemenTs analysis indicated that these motifs are under weak negative selection. A screen of these SNPs against genome-wide association studies for cardiometabolic traits revealed significant enrichment with 16 SNPs. A screen against the MuTHER expression quantitative trait locus data revealed 8 of these were significantly associated with altered gene expression in human adipose, more than would be expected by chance. Several SNPs fall close, or are linked by expression quantitative trait locus to lipid-metabolism loci including CYP26A1. We demonstrated the use of functional genomics to identify SNPs of potential function. Specifically, that SNPs within PPARγ motifs that bind PPARγ in adipocytes are significantly associated with cardiometabolic disease and with the regulation of transcription in adipose. This method may be used to uncover functional SNPs that do not reach significance thresholds in the agnostic approach of genome-wide association studies. © 2015 American Heart Association, Inc.
Varshney, Rajeev K; Saxena, Rachit K; Upadhyaya, Hari D; Khan, Aamir W; Yu, Yue; Kim, Changhoon; Rathore, Abhishek; Kim, Dongseon; Kim, Jihun; An, Shaun; Kumar, Vinay; Anuradha, Ghanta; Yamini, Kalinati Narasimhan; Zhang, Wei; Muniswamy, Sonnappa; Kim, Jong-So; Penmetsa, R Varma; von Wettberg, Eric; Datta, Swapan K
2017-07-01
Pigeonpea (Cajanus cajan), a tropical grain legume with low input requirements, is expected to continue to have an important role in supplying food and nutritional security in developing countries in Asia, Africa and the tropical Americas. From whole-genome resequencing of 292 Cajanus accessions encompassing breeding lines, landraces and wild species, we characterize genome-wide variation. On the basis of a scan for selective sweeps, we find several genomic regions that were likely targets of domestication and breeding. Using genome-wide association analysis, we identify associations between several candidate genes and agronomically important traits. Candidate genes for these traits in pigeonpea have sequence similarity to genes functionally characterized in other plants for flowering time control, seed development and pod dehiscence. Our findings will allow acceleration of genetic gains for key traits to improve yield and sustainability in pigeonpea.
BAC sequencing using pooled methods.
Saski, Christopher A; Feltus, F Alex; Parida, Laxmi; Haiminen, Niina
2015-01-01
Shotgun sequencing and assembly of a large, complex genome can be both expensive and challenging to accurately reconstruct the true genome sequence. Repetitive DNA arrays, paralogous sequences, polyploidy, and heterozygosity are main factors that plague de novo genome sequencing projects that typically result in highly fragmented assemblies and are difficult to extract biological meaning. Targeted, sub-genomic sequencing offers complexity reduction by removing distal segments of the genome and a systematic mechanism for exploring prioritized genomic content through BAC sequencing. If one isolates and sequences the genome fraction that encodes the relevant biological information, then it is possible to reduce overall sequencing costs and efforts that target a genomic segment. This chapter describes the sub-genome assembly protocol for an organism based upon a BAC tiling path derived from a genome-scale physical map or from fine mapping using BACs to target sub-genomic regions. Methods that are described include BAC isolation and mapping, DNA sequencing, and sequence assembly.
Vogiatzi, Emmanouella; Lagnel, Jacques; Pakaki, Victoria; Louro, Bruno; Canario, Adelino V M; Reinhardt, Richard; Kotoulas, Georgios; Magoulas, Antonios; Tsigenopoulos, Costas S
2011-06-01
We screened for simple sequence repeats (SSRs) found in ESTs derived from an EST-database development project ('Marine Genomics Europe' Network of Excellence). Different motifs of di-, tri-, tetra-, penta- and hexanucleotide SSRs were evaluated for variation in length and position in the expressed sequences, relative abundance and distribution in gilthead sea bream (Sparus aurata). We found 899 ESTs that harbor 997 SSRs (4.94%). On average, one SSR was found per 2.95 kb of EST sequence and the dinucleotide SSRs are the most abundant accounting for 47.6% of the total number. EST-SSRs were used as template for primer design. 664 primer pairs could be successfully identified and a subset of 206 pairs of primers was synthesized, PCR-tested and visualized on ethidium bromide stained agarose gels. The main objective was to further assess the potential of EST-SSRs as informative markers and investigate their cross-species amplification in sixteen teleost fish species: seven sparid species and nine other species from different families. Approximately 78% of the primer pairs gave PCR products of expected size in gilthead sea bream, and as expected, the rate of successful amplification of sea bream EST-SSRs was higher in sparids, lower in other perciforms and even lower in species of the Clupeiform and Gadiform orders. We finally determined the polymorphism and the heterozygosity of 63 markers in a wild gilthead sea bream population; fifty-eight loci were found to be polymorphic with the expected heterozygosity and the number of alleles ranging from 0.089 to 0.946 and from 2 to 27, respectively. These tools and markers are expected to enhance the available genetic linkage map in gilthead sea bream, to assist comparative mapping and genome analyses for this species and further with other model fish species and finally to help advance genetic analysis for cultivated and wild populations and accelerate breeding programs. Copyright © 2011 Elsevier B.V. All rights reserved.
Zhang, Jia; Yang, Ming-Kun; Zeng, Honghui; Ge, Feng
2016-11-01
Although the number of sequenced prokaryotic genomes is growing rapidly, experimentally verified annotation of prokaryotic genome remains patchy and challenging. To facilitate genome annotation efforts for prokaryotes, we developed an open source software called GAPP for genome annotation and global profiling of post-translational modifications (PTMs) in prokaryotes. With a single command, it provides a standard workflow to validate and refine predicted genetic models and discover diverse PTM events. We demonstrated the utility of GAPP using proteomic data from Helicobacter pylori, one of the major human pathogens that is responsible for many gastric diseases. Our results confirmed 84.9% of the existing predicted H. pylori proteins, identified 20 novel protein coding genes, and corrected four existing gene models with regard to translation initiation sites. In particular, GAPP revealed a large repertoire of PTMs using the same proteomic data and provided a rich resource that can be used to examine the functions of reversible modifications in this human pathogen. This software is a powerful tool for genome annotation and global discovery of PTMs and is applicable to any sequenced prokaryotic organism; we expect that it will become an integral part of ongoing genome annotation efforts for prokaryotes. GAPP is freely available at https://sourceforge.net/projects/gappproteogenomic/. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.
Boyd, Bret M; Allen, Julie M; de Crécy-Lagard, Valérie; Reed, David L
2014-09-11
The obligate-heritable endosymbionts of insects possess some of the smallest known bacterial genomes. This is likely due to loss of genomic material during symbiosis. The mode and rate of this erosion may change over evolutionary time: faster in newly formed associations and slower in long-established ones. The endosymbionts of human and anthropoid primate lice present a unique opportunity to study genome erosion in newly established (or young) symbionts. This is because we have a detailed phylogenetic history of these endosymbionts with divergence dates for closely related species. This allows for genome evolution to be studied in detail and rates of change to be estimated in a phylogenetic framework. Here, we sequenced the genome of the chimpanzee louse endosymbiont (Candidatus Riesia pediculischaeffi) and compared it with the closely related genome of the human body louse endosymbiont. From this comparison, we found evidence for recent genome erosion leading to gene loss in these endosymbionts. Although gene loss was detected, it was not significantly greater than in older endosymbionts from aphids and ants. Additionally, we searched for genes associated with B-vitamin synthesis in the two louse endosymbiont genomes because these endosymbionts are believed to synthesize essential B vitamins absent in the louse's diet. All of the expected genes were present, except those involved in thiamin synthesis. We failed to find genes encoding for proteins involved in the biosynthesis of thiamin or any complete exogenous means of salvaging thiamin, suggesting there is an undescribed mechanism for the salvage of thiamin. Finally, genes encoding for the pantothenate de novo biosynthesis pathway were located on a plasmid in both taxa along with a heat shock protein. Movement of these genes onto a plasmid may be functionally and evolutionarily significant, potentially increasing production and guarding against the deleterious effects of mutation. These data add to a growing resource of obligate endosymbiont genomes and to our understanding of the rate and mode of genome erosion in obligate animal-associated bacteria. Ultimately sequencing additional louse p-endosymbiont genomes will provide a model system for studying genome evolution in obligate host associated bacteria. Copyright © 2014 Boyd et al.
Black, Michael; Moolhuijzen, Paula; Chapman, Brett; Barrero, Roberto; Howieson, John; Hungria, Mariangela; Bellgard, Matthew
2012-01-01
The symbiotic relationship between legumes and nitrogen fixing bacteria is critical for agriculture, as it may have profound impacts on lowering costs for farmers, on land sustainability, on soil quality, and on mitigation of greenhouse gas emissions. However, despite the importance of the symbioses to the global nitrogen cycling balance, very few rhizobial genomes have been sequenced so far, although there are some ongoing efforts in sequencing elite strains. In this study, the genomes of fourteen selected strains of the order Rhizobiales, all previously fully sequenced and annotated, were compared to assess differences between the strains and to investigate the feasibility of defining a core ‘symbiome’—the essential genes required by all rhizobia for nodulation and nitrogen fixation. Comparison of these whole genomes has revealed valuable information, such as several events of lateral gene transfer, particularly in the symbiotic plasmids and genomic islands that have contributed to a better understanding of the evolution of contrasting symbioses. Unique genes were also identified, as well as omissions of symbiotic genes that were expected to be found. Protein comparisons have also allowed the identification of a variety of similarities and differences in several groups of genes, including those involved in nodulation, nitrogen fixation, production of exopolysaccharides, Type I to Type VI secretion systems, among others, and identifying some key genes that could be related to host specificity and/or a better saprophytic ability. However, while several significant differences in the type and number of proteins were observed, the evidence presented suggests no simple core symbiome exists. A more abstract systems biology concept of nitrogen fixing symbiosis may be required. The results have also highlighted that comparative genomics represents a valuable tool for capturing specificities and generalities of each genome. PMID:24704847
Genomic mid-range inhomogeneity correlates with an abundance of RNA secondary structures
Bechtel, Jason M; Wittenschlaeger, Thomas; Dwyer, Trisha; Song, Jun; Arunachalam, Sasi; Ramakrishnan, Sadeesh K; Shepard, Samuel; Fedorov, Alexei
2008-01-01
Background Genomes possess different levels of non-randomness, in particular, an inhomogeneity in their nucleotide composition. Inhomogeneity is manifest from the short-range where neighboring nucleotides influence the choice of base at a site, to the long-range, commonly known as isochores, where a particular base composition can span millions of nucleotides. A separate genomic issue that has yet to be thoroughly elucidated is the role that RNA secondary structure (SS) plays in gene expression. Results We present novel data and approaches that show that a mid-range inhomogeneity (~30 to 1000 nt) not only exists in mammalian genomes but is also significantly associated with strong RNA SS. A whole-genome bioinformatics investigation of local SS in a set of 11,315 non-redundant human pre-mRNA sequences has been carried out. Four distinct components of these molecules (5'-UTRs, exons, introns and 3'-UTRs) were considered separately, since they differ in overall nucleotide composition, sequence motifs and periodicities. For each pre-mRNA component, the abundance of strong local SS (< -25 kcal/mol) was a factor of two to ten greater than a random expectation model. The randomization process preserves the short-range inhomogeneity of the corresponding natural sequences, thus, eliminating short-range signals as possible contributors to any observed phenomena. Conclusion We demonstrate that the excess of strong local SS in pre-mRNAs is linked to the little explored phenomenon of genomic mid-range inhomogeneity (MRI). MRI is an interdependence between nucleotide choice and base composition over a distance of 20–1000 nt. Additionally, we have created a public computational resource to support further study of genomic MRI. PMID:18549495
Alkan, Can; Kavak, Pinar; Somel, Mehmet; Gokcumen, Omer; Ugurlu, Serkan; Saygi, Ceren; Dal, Elif; Bugra, Kuyas; Güngör, Tunga; Sahinalp, S Cenk; Özören, Nesrin; Bekpen, Cemalettin
2014-11-07
Turkey is a crossroads of major population movements throughout history and has been a hotspot of cultural interactions. Several studies have investigated the complex population history of Turkey through a limited set of genetic markers. However, to date, there have been no studies to assess the genetic variation at the whole genome level using whole genome sequencing. Here, we present whole genome sequences of 16 Turkish individuals resequenced at high coverage (32×-48×). We show that the genetic variation of the contemporary Turkish population clusters with South European populations, as expected, but also shows signatures of relatively recent contribution from ancestral East Asian populations. In addition, we document a significant enrichment of non-synonymous private alleles, consistent with recent observations in European populations. A number of variants associated with skin color and total cholesterol levels show frequency differentiation between the Turkish populations and European populations. Furthermore, we have analyzed the 17q21.31 inversion polymorphism region (MAPT locus) and found increased allele frequency of 31.25% for H1/H2 inversion polymorphism when compared to European populations that show about 25% of allele frequency. This study provides the first map of common genetic variation from 16 western Asian individuals and thus helps fill an important geographical gap in analyzing natural human variation and human migration. Our data will help develop population-specific experimental designs for studies investigating disease associations and demographic history in Turkey.
Genomic approaches for the elucidation of genes and gene networks underlying cardiovascular traits.
Adriaens, M E; Bezzina, C R
2018-06-22
Genome-wide association studies have shed light on the association between natural genetic variation and cardiovascular traits. However, linking a cardiovascular trait associated locus to a candidate gene or set of candidate genes for prioritization for follow-up mechanistic studies is all but straightforward. Genomic technologies based on next-generation sequencing technology nowadays offer multiple opportunities to dissect gene regulatory networks underlying genetic cardiovascular trait associations, thereby aiding in the identification of candidate genes at unprecedented scale. RNA sequencing in particular becomes a powerful tool when combined with genotyping to identify loci that modulate transcript abundance, known as expression quantitative trait loci (eQTL), or loci modulating transcript splicing known as splicing quantitative trait loci (sQTL). Additionally, the allele-specific resolution of RNA-sequencing technology enables estimation of allelic imbalance, a state where the two alleles of a gene are expressed at a ratio differing from the expected 1:1 ratio. When multiple high-throughput approaches are combined with deep phenotyping in a single study, a comprehensive elucidation of the relationship between genotype and phenotype comes into view, an approach known as systems genetics. In this review, we cover key applications of systems genetics in the broad cardiovascular field.
Whole-genome sequencing for comparative genomics and de novo genome assembly.
Benjak, Andrej; Sala, Claudia; Hartkoorn, Ruben C
2015-01-01
Next-generation sequencing technologies for whole-genome sequencing of mycobacteria are rapidly becoming an attractive alternative to more traditional sequencing methods. In particular this technology is proving useful for genome-wide identification of mutations in mycobacteria (comparative genomics) as well as for de novo assembly of whole genomes. Next-generation sequencing however generates a vast quantity of data that can only be transformed into a usable and comprehensible form using bioinformatics. Here we describe the methodology one would use to prepare libraries for whole-genome sequencing, and the basic bioinformatics to identify mutations in a genome following Illumina HiSeq or MiSeq sequencing, as well as de novo genome assembly following sequencing using Pacific Biosciences (PacBio).
Actionable exomic incidental findings in 6503 participants: challenges of variant classification
Amendola, Laura M.; Dorschner, Michael O.; Robertson, Peggy D.; Salama, Joseph S.; Hart, Ragan; Shirts, Brian H.; Murray, Mitzi L.; Tokita, Mari J.; Gallego, Carlos J.; Kim, Daniel Seung; Bennett, James T.; Crosslin, David R.; Ranchalis, Jane; Jones, Kelly L.; Rosenthal, Elisabeth A.; Jarvik, Ella R.; Itsara, Andy; Turner, Emily H.; Herman, Daniel S.; Schleit, Jennifer; Burt, Amber; Jamal, Seema M.; Abrudan, Jenica L.; Johnson, Andrew D.; Conlin, Laura K.; Dulik, Matthew C.; Santani, Avni; Metterville, Danielle R.; Kelly, Melissa; Foreman, Ann Katherine M.; Lee, Kristy; Taylor, Kent D.; Guo, Xiuqing; Crooks, Kristy; Kiedrowski, Lesli A.; Raffel, Leslie J.; Gordon, Ora; Machini, Kalotina; Desnick, Robert J.; Biesecker, Leslie G.; Lubitz, Steven A.; Mulchandani, Surabhi; Cooper, Greg M.; Joffe, Steven; Richards, C. Sue; Yang, Yaoping; Rotter, Jerome I.; Rich, Stephen S.; O’Donnell, Christopher J.; Berg, Jonathan S.; Spinner, Nancy B.; Evans, James P.; Fullerton, Stephanie M.; Leppig, Kathleen A.; Bennett, Robin L.; Bird, Thomas; Sybert, Virginia P.; Grady, William M.; Tabor, Holly K.; Kim, Jerry H.; Bamshad, Michael J.; Wilfond, Benjamin; Motulsky, Arno G.; Scott, C. Ronald; Pritchard, Colin C.; Walsh, Tom D.; Burke, Wylie; Raskind, Wendy H.; Byers, Peter; Hisama, Fuki M.; Rehm, Heidi; Nickerson, Debbie A.; Jarvik, Gail P.
2015-01-01
Recommendations for laboratories to report incidental findings from genomic tests have stimulated interest in such results. In order to investigate the criteria and processes for assigning the pathogenicity of specific variants and to estimate the frequency of such incidental findings in patients of European and African ancestry, we classified potentially actionable pathogenic single-nucleotide variants (SNVs) in all 4300 European- and 2203 African-ancestry participants sequenced by the NHLBI Exome Sequencing Project (ESP). We considered 112 gene-disease pairs selected by an expert panel as associated with medically actionable genetic disorders that may be undiagnosed in adults. The resulting classifications were compared to classifications from other clinical and research genetic testing laboratories, as well as with in silico pathogenicity scores. Among European-ancestry participants, 30 of 4300 (0.7%) had a pathogenic SNV and six (0.1%) had a disruptive variant that was expected to be pathogenic, whereas 52 (1.2%) had likely pathogenic SNVs. For African-ancestry participants, six of 2203 (0.3%) had a pathogenic SNV and six (0.3%) had an expected pathogenic disruptive variant, whereas 13 (0.6%) had likely pathogenic SNVs. Genomic Evolutionary Rate Profiling mammalian conservation score and the Combined Annotation Dependent Depletion summary score of conservation, substitution, regulation, and other evidence were compared across pathogenicity assignments and appear to have utility in variant classification. This work provides a refined estimate of the burden of adult onset, medically actionable incidental findings expected from exome sequencing, highlights challenges in variant classification, and demonstrates the need for a better curated variant interpretation knowledge base. PMID:25637381
Hartl, Daniel L.
2008-01-01
Simple models of molecular evolution assume that sequences evolve by a Poisson process in which nucleotide or amino acid substitutions occur as rare independent events. In these models, the expected ratio of the variance to the mean of substitution counts equals 1, and substitution processes with a ratio greater than 1 are called overdispersed. Comparing the genomes of 10 closely related species of Drosophila, we extend earlier evidence for overdispersion in amino acid replacements as well as in four-fold synonymous substitutions. The observed deviation from the Poisson expectation can be described as a linear function of the rate at which substitutions occur on a phylogeny, which implies that deviations from the Poisson expectation arise from gene-specific temporal variation in substitution rates. Amino acid sequences show greater temporal variation in substitution rates than do four-fold synonymous sequences. Our findings provide a general phenomenological framework for understanding overdispersion in the molecular clock. Also, the presence of substantial variation in gene-specific substitution rates has broad implications for work in phylogeny reconstruction and evolutionary rate estimation. PMID:18480070
Cantalupo, Paul G.; Katz, Joshua P.
2015-01-01
ABSTRACT We searched The Cancer Genome Atlas (TCGA) database for viruses by comparing non-human reads present in transcriptome sequencing (RNA-Seq) and whole-exome sequencing (WXS) data to viral sequence databases. Human papillomavirus 18 (HPV18) is an etiologic agent of cervical cancer, and as expected, we found robust expression of HPV18 genes in cervical cancer samples. In agreement with previous studies, we also found HPV18 transcripts in non-cervical cancer samples, including those from the colon, rectum, and normal kidney. However, in each of these cases, HPV18 gene expression was low, and single-nucleotide variants and positions of genomic alignments matched the integrated portion of HPV18 present in HeLa cells. Chimeric reads that match a known virus-cell junction of HPV18 integrated in HeLa cells were also present in some samples. We hypothesize that HPV18 sequences in these non-cervical samples are due to nucleic acid contamination from HeLa cells. This finding highlights the problems that contamination presents in computational virus detection pipelines. IMPORTANCE Viruses associated with cancer can be detected by searching tumor sequence databases. Several studies involving searches of the TCGA database have reported the presence of HPV18, a known cause of cervical cancer, in a small number of additional cancers, including those of the rectum, kidney, and colon. We have determined that the sequences related to HPV18 in non-cervical samples are due to nucleic acid contamination from HeLa cells. To our knowledge, this is the first report of the misidentification of viruses in next-generation sequencing data of tumors due to contamination with a cancer cell line. These results raise awareness of the difficulty of accurately identifying viruses in human sequence databases. PMID:25631090
Mlinarec, Jelena; Chester, Mike; Siljak-Yakovlev, Sonja; Papes, Drazena; Leitch, Andrew R; Besendorfer, Visnja
2009-01-01
The structure, abundance and location of repetitive DNA sequences on chromosomes can characterize the nature of higher plant genomes. Here we report on three new repeat DNA families isolated from Anemone hortensis L.; (i) AhTR1, a family of satellite DNA (stDNA) composed of a 554-561 bp long EcoRV monomer; (ii) AhTR2, a stDNA family composed of a 743 bp long HindIII monomer and; (iii) AhDR, a repeat family composed of a 945 bp long HindIII fragment that exhibits some sequence similarity to Ty3/gypsy-like retroelements. Fluorescence in-situ hybridization (FISH) to metaphase chromosomes of A. hortensis (2n = 16) revealed that both AhTR1 and AhTR2 sequences co-localized with DAPI-positive AT-rich heterochromatic regions. AhTR1 sequences occur at intercalary DAPI bands while AhTR2 sequences occur at 8-10 terminally located heterochromatic blocks. In contrast AhDR sequences are dispersed over all chromosomes as expected of a Ty3/gypsy-like element. AhTR2 and AhTR1 repeat families include polyA- and polyT-tracks, AT/TA-motifs and a pentanucleotide sequence (CAAAA) that may have consequences for chromatin packing and sequence homogeneity. AhTR2 repeats also contain TTTAGGG motifs and degenerate variants. We suggest that they arose by interspersion of telomeric repeats with subtelomeric repeats, before hybrid unit(s) amplified through the heterochromatic domain. The three repetitive DNA families together occupy approximately 10% of the A. hortensis genome. Comparative analyses of eight Anemone species revealed that the divergence of the A. hortensis genome was accompanied by considerable modification and/or amplification of repeats.
Díaz-Cárdenas, Carolina; López, Gina; Alzate-Ocampo, José David; González, Laura N; Shapiro, Nicole; Woyke, Tanja; Kyrpides, Nikos C; Restrepo, Silvia; Baena, Sandra
2017-01-01
A bacterium belonging to the phylum Synergistetes , genus Dethiosulfovibrio was isolated in 2007 from a saline spring in Colombia. Dethiosulfovibrio salsuginis USBA 82 T ( DSM 21565 T = KCTC 5659 T ) is a mesophilic, strictly anaerobic, slightly halophilic, Gram negative bacterium with a diderm cell envelope. The strain ferments peptides, amino acids and a few organic acids. Here we present the description of the complete genome sequencing and annotation of the type species Dethiosulfovibrio salsuginis USBA 82 T . The genome consisted of 2.68 Mbp with a 53.7% G + C . A total of 2609 genes were predicted and of those, 2543 were protein coding genes and 66 were RNA genes. We detected in USBA 82 T genome six Synergistetes conserved signature indels (CSIs), specific for Jonquetella, Pyramidobacter and Dethiosulfovibrio . The genome of D. salsuginis contained, as expected, genes related to amino acid transport, amino acid metabolism and thiosulfate reduction. These genes represent the major gene groups of Synergistetes , related with their phenotypic traits, and interestingly, 11.8% of the genes in the genome belonged to the amino acid fermentation COG category. In addition, we identified in the genome some ammonification genes such as nitrate reductase genes. The presence of proline operon genes could be related to de novo synthesis of proline to protect the cell in response to high osmolarity. Our bioinformatics workflow included antiSMASH and BAGEL3 which allowed us to identify bacteriocins genes in the genome.
Development and characterization of genomic SSR markers in Cynodon transvaalensis Burtt-Davy.
Tan, Chengcheng; Wu, Yanqi; Taliaferro, Charles M; Bell, Greg E; Martin, Dennis L; Smith, Mike W
2014-08-01
Simple sequence repeat (SSR) markers are a major molecular tool for genetic and genomic research that have been extensively developed and used in major crops. However, few are available in African bermudagrass (Cynodon transvaalensis Burtt-Davy), an economically important warm-season turfgrass species. African bermudagrass is mainly used for hybridizations with common bermudagrass [C. dactylon var. dactylon (L.) Pers.] in the development of superior interspecific hybrid turfgrass cultivars. Accordingly, the major objective of this study was to develop and characterize a large set of SSR markers. Genomic DNA of C. transvaalensis '4200TN 24-2' from an Oklahoma State University (OSU) turf nursery was extracted for construction of four SSR genomic libraries enriched with [CA](n), [GA](n), [AAG](n), and [AAT](n) as core repeat motifs. A total of 3,064 clones were sequenced at the OSU core facility. The sequences were categorized into singletons and contiguous sequences to exclude redundancy. From the two sequence categories, 1,795 SSR loci were identified. After excluding duplicate SSRs by comparison with previously developed SSR markers using a nucleotide basic local alignment tool, 1,426 unique primer pairs (PPs) were designed. Out of the 1,426 designed PPs, 981 (68.8 %) amplified alleles of the expected size in the donor DNA. Polymorphisms of the SSR PPs tested in eight C. transvaalensis plants were 93 % polymorphic with 544 markers effective in all genotypes. Inheritance of the SSRs was examined in six F(1) progeny of African parents 'T577' × 'Uganda', indicating 917 markers amplified heritable alleles. The SSR markers developed in the study are the first large set of co-dominant markers in African bermudagrass and should be highly valuable for molecular and traditional breeding research.
Hou, Wan-ru; Tang, Yun; Hou, Yi-ling; Song, Yan; Zhang, Tian; Wu, Guang-fu
2010-07-01
Eukaryotic initiation factor (eIF) EIF1 is a universally conserved translation factor that is involved in translation initiation site selection. The cDNA and the genomic sequences of EIF1 were cloned successfully from the giant panda (Ailuropoda melanoleuca) and the black bear (Ursus thibetanus mupinensis) using reverse transcription polymerase chain reaction (RT-PCR) technology and touchdown-polymerase chain reaction, respectively. The cDNAs of the EIF1 cloned from the giant panda and the black bear are 418 bp in size, containing an open reading frame (ORF) of 342 bp encoding 113 amino acids. The length of the genomic sequence of the giant panda is 1909 bp, which contains four exons and three introns. The length of the genomic sequence of the black bear is 1897 bp, which also contains four exons and three introns. Sequence alignment indicates a high degree of homology to those of Homo sapiens, Mus musculus, Rattus norvegicus, and Bos Taurus at both amino acid and DNA levels. Topology prediction shows there are one N-glycosylation site, two Casein kinase II phosphorylation sites, and a Amidation site in the EIF1 protein of the giant panda and black bear. In addition, there is a protein kinase C phosphorylation site in EIF1 of the giant panda. The giant panda and the black bear EIF1 genes were overexpressed in E. coli BL21. The results indicated that the both EIF1 fusion proteins with the N-terminally His-tagged form gave rise to the accumulation of two expected 19 kDa polypeptide. The expression products obtained could be used to purify the proteins and study their function further.
Prakash, Celine; Haeseler, Arndt Von
2017-03-01
RNA sequencing (RNA-seq) has emerged as the method of choice for measuring the expression of RNAs in a given cell population. In most RNA-seq technologies, sequencing the full length of RNA molecules requires fragmentation into smaller pieces. Unfortunately, the issue of nonuniform sequencing coverage across a genomic feature has been a concern in RNA-seq and is attributed to biases for certain fragments in RNA-seq library preparation and sequencing. To investigate the expected coverage obtained from fragmentation, we develop a simple fragmentation model that is independent of bias from the experimental method and is not specific to the transcript sequence. Essentially, we enumerate all configurations for maximal placement of a given fragment length, F, on transcript length, T, to represent every possible fragmentation pattern, from which we compute the expected coverage profile across a transcript. We extend this model to incorporate general empirical attributes such as read length, fragment length distribution, and number of molecules of the transcript. We further introduce the fragment starting-point, fragment coverage, and read coverage profiles. We find that the expected profiles are not uniform and that factors such as fragment length to transcript length ratio, read length to fragment length ratio, fragment length distribution, and number of molecules influence the variability of coverage across a transcript. Finally, we explore a potential application of the model where, with simulations, we show that it is possible to correctly estimate the transcript copy number for any transcript in the RNA-seq experiment.
Haeseler, Arndt Von
2017-01-01
Abstract RNA sequencing (RNA-seq) has emerged as the method of choice for measuring the expression of RNAs in a given cell population. In most RNA-seq technologies, sequencing the full length of RNA molecules requires fragmentation into smaller pieces. Unfortunately, the issue of nonuniform sequencing coverage across a genomic feature has been a concern in RNA-seq and is attributed to biases for certain fragments in RNA-seq library preparation and sequencing. To investigate the expected coverage obtained from fragmentation, we develop a simple fragmentation model that is independent of bias from the experimental method and is not specific to the transcript sequence. Essentially, we enumerate all configurations for maximal placement of a given fragment length, F, on transcript length, T, to represent every possible fragmentation pattern, from which we compute the expected coverage profile across a transcript. We extend this model to incorporate general empirical attributes such as read length, fragment length distribution, and number of molecules of the transcript. We further introduce the fragment starting-point, fragment coverage, and read coverage profiles. We find that the expected profiles are not uniform and that factors such as fragment length to transcript length ratio, read length to fragment length ratio, fragment length distribution, and number of molecules influence the variability of coverage across a transcript. Finally, we explore a potential application of the model where, with simulations, we show that it is possible to correctly estimate the transcript copy number for any transcript in the RNA-seq experiment. PMID:27661099
Lasserre, Moira; Fresia, Pablo; Greif, Gonzalo; Iraola, Gregorio; Castro-Ramos, Miguel; Juambeltz, Arturo; Nuñez, Álvaro; Naya, Hugo; Robello, Carlos; Berná, Luisa
2018-01-02
Bovine tuberculosis (bTB) poses serious risks to animal welfare and economy, as well as to public health as a zoonosis. Its etiological agent, Mycobacterium bovis, belongs to the Mycobacterium tuberculosis complex (MTBC), a group of genetically monomorphic organisms featured by a remarkably high overall nucleotide identity (99.9%). Indeed, this characteristic is of major concern for correct typing and determination of strain-specific traits based on sequence diversity. Due to its historical economic dependence on cattle production, Uruguay is deeply affected by the prevailing incidence of Mycobacterium bovis. With the world's highest number of cattle per human, and its intensive cattle production, Uruguay represents a particularly suited setting to evaluate genomic variability among isolates, and the diversity traits associated to this pathogen. We compared 186 genomes from MTBC strains isolated worldwide, and found a highly structured population in M. bovis. The analysis of 23 new M. bovis genomes, belonging to strains isolated in Uruguay evidenced three groups present in the country. Despite presenting an expected highly conserved genomic structure and sequence, these strains segregate into a clustered manner within the worldwide phylogeny. Analysis of the non-pe/ppe differential areas against a reference genome defined four main sources of variability, namely: regions of difference (RD), variable genes, duplications and novel genes. RDs and variant analysis segregated the strains into clusters that are concordant with their spoligotype identities. Due to its high homoplasy rate, spoligotyping failed to reflect the true genomic diversity among worldwide representative strains, however, it remains a good indicator for closely related populations. This study introduces a comprehensive population structure analysis of worldwide M. bovis isolates. The incorporation and analysis of 23 novel Uruguayan M. bovis genomes, sheds light onto the genomic diversity of this pathogen, evidencing the existence of greater genetic variability among strains than previously contemplated.
Developing eThread pipeline using SAGA-pilot abstraction for large-scale structural bioinformatics.
Ragothaman, Anjani; Boddu, Sairam Chowdary; Kim, Nayong; Feinstein, Wei; Brylinski, Michal; Jha, Shantenu; Kim, Joohyun
2014-01-01
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread--a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.
Developing eThread Pipeline Using SAGA-Pilot Abstraction for Large-Scale Structural Bioinformatics
Ragothaman, Anjani; Feinstein, Wei; Jha, Shantenu; Kim, Joohyun
2014-01-01
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread—a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure. PMID:24995285
An efficient approach to BAC based assembly of complex genomes.
Visendi, Paul; Berkman, Paul J; Hayashi, Satomi; Golicz, Agnieszka A; Bayer, Philipp E; Ruperao, Pradeep; Hurgobin, Bhavna; Montenegro, Juan; Chan, Chon-Kit Kenneth; Staňková, Helena; Batley, Jacqueline; Šimková, Hana; Doležel, Jaroslav; Edwards, David
2016-01-01
There has been an exponential growth in the number of genome sequencing projects since the introduction of next generation DNA sequencing technologies. Genome projects have increasingly involved assembly of whole genome data which produces inferior assemblies compared to traditional Sanger sequencing of genomic fragments cloned into bacterial artificial chromosomes (BACs). While whole genome shotgun sequencing using next generation sequencing (NGS) is relatively fast and inexpensive, this method is extremely challenging for highly complex genomes, where polyploidy or high repeat content confounds accurate assembly, or where a highly accurate 'gold' reference is required. Several attempts have been made to improve genome sequencing approaches by incorporating NGS methods, to variable success. We present the application of a novel BAC sequencing approach which combines indexed pools of BACs, Illumina paired read sequencing, a sequence assembler specifically designed for complex BAC assembly, and a custom bioinformatics pipeline. We demonstrate this method by sequencing and assembling BAC cloned fragments from bread wheat and sugarcane genomes. We demonstrate that our assembly approach is accurate, robust, cost effective and scalable, with applications for complete genome sequencing in large and complex genomes.
Aguirre, Jacobo; Buldú, Javier M; Manrubia, Susanna C
2009-12-01
Networks of selectively neutral genotypes underlie the evolution of populations of replicators in constant environments. Previous theoretical analysis predicted that such populations will evolve toward highly connected regions of the genome space. We first study the evolution of populations of replicators on simple networks and quantify how the transient time to equilibrium depends on the initial distribution of sequences on the neutral network, on the topological properties of the latter, and on the mutation rate. Second, network neutrality is broken through the introduction of an energy for each sequence. This allows to study the competition between two features (neutrality and energetic stability) relevant for survival and subjected to different selective pressures. In cases where the two features are negatively correlated, the population experiences sudden migrations in the genome space for values of the relevant parameters that we calculate. The numerical study of larger networks indicates that the qualitative behavior to be expected in more realistic cases is already seen in representative examples of small networks.
Structure of a Trypanosoma Brucei Alpha/Beta--Hydrolase Fold Protein With Unknown Function
DOE Office of Scientific and Technical Information (OSTI.GOV)
Merritt, E.A.; Holmes, M.; Buckner, F.S.
2009-05-26
The structure of a structural genomics target protein, Tbru020260AAA from Trypanosoma brucei, has been determined to a resolution of 2.2 {angstrom} using multiple-wavelength anomalous diffraction at the Se K edge. This protein belongs to Pfam sequence family PF08538 and is only distantly related to previously studied members of the {alpha}/{beta}-hydrolase fold family. Structural superposition onto representative {alpha}/{beta}-hydrolase fold proteins of known function indicates that a possible catalytic nucleophile, Ser116 in the T. brucei protein, lies at the expected location. However, the present structure and by extension the other trypanosomatid members of this sequence family have neither sequence nor structural similaritymore » at the location of other active-site residues typical for proteins with this fold. Together with the presence of an additional domain between strands {beta}6 and {beta}7 that is conserved in trypanosomatid genomes, this suggests that the function of these homologs has diverged from other members of the fold family.« less
NASA Astrophysics Data System (ADS)
Aguirre, Jacobo; Buldú, Javier M.; Manrubia, Susanna C.
2009-12-01
Networks of selectively neutral genotypes underlie the evolution of populations of replicators in constant environments. Previous theoretical analysis predicted that such populations will evolve toward highly connected regions of the genome space. We first study the evolution of populations of replicators on simple networks and quantify how the transient time to equilibrium depends on the initial distribution of sequences on the neutral network, on the topological properties of the latter, and on the mutation rate. Second, network neutrality is broken through the introduction of an energy for each sequence. This allows to study the competition between two features (neutrality and energetic stability) relevant for survival and subjected to different selective pressures. In cases where the two features are negatively correlated, the population experiences sudden migrations in the genome space for values of the relevant parameters that we calculate. The numerical study of larger networks indicates that the qualitative behavior to be expected in more realistic cases is already seen in representative examples of small networks.
Pritchard, Leighton; Holden, Nicola J; Bielaszewska, Martina; Karch, Helge; Toth, Ian K
2012-01-01
An Escherichia coli O104:H4 outbreak in Germany in summer 2011 caused 53 deaths, over 4000 individual infections across Europe, and considerable economic, social and political impact. This outbreak was the first in a position to exploit rapid, benchtop high-throughput sequencing (HTS) technologies and crowdsourced data analysis early in its investigation, establishing a new paradigm for rapid response to disease threats. We describe a novel strategy for design of diagnostic PCR primers that exploited this rapid draft bacterial genome sequencing to distinguish between E. coli O104:H4 outbreak isolates and other pathogenic E. coli isolates, including the historical hæmolytic uræmic syndrome (HUSEC) E. coli HUSEC041 O104:H4 strain, which possesses the same serotype as the outbreak isolates. Primers were designed using a novel alignment-free strategy against eleven draft whole genome assemblies of E. coli O104:H4 German outbreak isolates from the E. coli O104:H4 Genome Analysis Crowd-Sourcing Consortium website, and a negative sequence set containing 69 E. coli chromosome and plasmid sequences from public databases. Validation in vitro against 21 'positive' E. coli O104:H4 outbreak and 32 'negative' non-outbreak EHEC isolates indicated that individual primer sets exhibited 100% sensitivity for outbreak isolates, with false positive rates of between 9% and 22%. A minimal combination of two primers discriminated between outbreak and non-outbreak E. coli isolates with 100% sensitivity and 100% specificity. Draft genomes of isolates of disease outbreak bacteria enable high throughput primer design and enhanced diagnostic performance in comparison to traditional molecular assays. Future outbreak investigations will be able to harness HTS rapidly to generate draft genome sequences and diagnostic primer sets, greatly facilitating epidemiology and clinical diagnostics. We expect that high throughput primer design strategies will enable faster, more precise responses to future disease outbreaks of bacterial origin, and help to mitigate their societal impact.
Gene Discovery through Genomic Sequencing of Brucella abortus
Sánchez, Daniel O.; Zandomeni, Ruben O.; Cravero, Silvio; Verdún, Ramiro E.; Pierrou, Ester; Faccio, Paula; Diaz, Gabriela; Lanzavecchia, Silvia; Agüero, Fernán; Frasch, Alberto C. C.; Andersson, Siv G. E.; Rossetti, Osvaldo L.; Grau, Oscar; Ugalde, Rodolfo A.
2001-01-01
Brucella abortus is the etiological agent of brucellosis, a disease that affects bovines and human. We generated DNA random sequences from the genome of B. abortus strain 2308 in order to characterize molecular targets that might be useful for developing immunological or chemotherapeutic strategies against this pathogen. The partial sequencing of 1,899 clones allowed the identification of 1,199 genomic sequence surveys (GSSs) with high homology (BLAST expect value < 10−5) to sequences deposited in the GenBank databases. Among them, 925 represent putative novel genes for the Brucella genus. Out of 925 nonredundant GSSs, 470 were classified in 15 categories based on cellular function. Seven hundred GSSs showed no significant database matches and remain available for further studies in order to identify their function. A high number of GSSs with homology to Agrobacterium tumefaciens and Rhizobium meliloti proteins were observed, thus confirming their close phylogenetic relationship. Among them, several GSSs showed high similarity with genes related to nodule nitrogen fixation, synthesis of nod factors, nodulation protein symbiotic plasmid, and nodule bacteroid differentiation. We have also identified several B. abortus homologs of virulence and pathogenesis genes from other pathogens, including a homolog to both the Shda gene from Salmonella enterica serovar Typhimurium and the AidA-1 gene from Escherichia coli. Other GSSs displayed significant homologies to genes encoding components of the type III and type IV secretion machineries, suggesting that Brucella might also have an active type III secretion machinery. PMID:11159979
Pita, Sebastián; Mora, Pablo; Vela, Jesús; Palomeque, Teresa; Sánchez, Antonio; Panzera, Francisco; Lorite, Pedro
2018-04-24
Chagas disease or American trypanosomiasis affects six to seven million people worldwide, mostly in Latin America. This disease is transmitted by hematophagous insects known as "kissing bugs" (Hemiptera, Triatominae), with Triatoma infestans and Rhodnius prolixus being the two most important vector species. Despite the fact that both species present the same diploid chromosome number (2 n = 22), they have remarkable differences in their total DNA content, chromosome structure and genome organization. Variations in the DNA genome size are expected to be due to differences in the amount of repetitive DNA sequences. The T. infestans genome-wide analysis revealed the existence of 42 satellite DNA families. BLAST searches of these sequences against the R. prolixus genome assembly revealed that only four of these satellite DNA families are shared between both species, suggesting a great differentiation between the Triatoma and Rhodnius genomes. Fluorescence in situ hybridization (FISH) location of these repetitive DNAs in both species showed that they are dispersed on the euchromatic regions of all autosomes and the X chromosome. Regarding the Y chromosome, these common satellite DNAs are absent in T. infestans but they are present in the R. prolixus Y chromosome. These results support a different origin and/or evolution in the Y chromosome of both species.
Wang, Sibao; Leclerque, Andreas; Pava-Ripoll, Monica; Fang, Weiguo; St Leger, Raymond J
2009-06-01
Many strains of Metarhizium anisopliae have broad host ranges, but others are specialists and adapted to particular hosts. Patterns of gene duplication, divergence, and deletion in three generalist and three specialist strains were investigated by heterologous hybridization of genomic DNA to genes from the generalist strain Ma2575. As expected, major life processes are highly conserved, presumably due to purifying selection. However, up to 7% of Ma2575 genes were highly divergent or absent in specialist strains. Many of these sequences are conserved in other fungal species, suggesting that there has been rapid evolution and loss in specialist Metarhizium genomes. Some poorly hybridizing genes in specialists were functionally coordinated, indicative of reductive evolution. These included several involved in toxin biosynthesis and sugar metabolism in root exudates, suggesting that specialists are losing genes required to live in alternative hosts or as saprophytes. Several components of mobile genetic elements were also highly divergent or lost in specialists. Exceptionally, the genome of the specialist cricket pathogen Ma443 contained extra insertion elements that might play a role in generating evolutionary novelty. This study throws light on the abundance of orphans in genomes, as 15% of orphan sequences were found to be rapidly evolving in the Ma2575 lineage.
Genotype calling from next-generation sequencing data using haplotype information of reads
Zhi, Degui; Wu, Jihua; Liu, Nianjun; Zhang, Kui
2012-01-01
Motivation: Low coverage sequencing provides an economic strategy for whole genome sequencing. When sequencing a set of individuals, genotype calling can be challenging due to low sequencing coverage. Linkage disequilibrium (LD) based refinement of genotyping calling is essential to improve the accuracy. Current LD-based methods use read counts or genotype likelihoods at individual potential polymorphic sites (PPSs). Reads that span multiple PPSs (jumping reads) can provide additional haplotype information overlooked by current methods. Results: In this article, we introduce a new Hidden Markov Model (HMM)-based method that can take into account jumping reads information across adjacent PPSs and implement it in the HapSeq program. Our method extends the HMM in Thunder and explicitly models jumping reads information as emission probabilities conditional on the states of adjacent PPSs. Our simulation results show that, compared to Thunder, HapSeq reduces the genotyping error rate by 30%, from 0.86% to 0.60%. The results from the 1000 Genomes Project show that HapSeq reduces the genotyping error rate by 12 and 9%, from 2.24% and 2.76% to 1.97% and 2.50% for individuals with European and African ancestry, respectively. We expect our program can improve genotyping qualities of the large number of ongoing and planned whole genome sequencing projects. Contact: dzhi@ms.soph.uab.edu; kzhang@ms.soph.uab.edu Availability: The software package HapSeq and its manual can be found and downloaded at www.ssg.uab.edu/hapseq/. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22285565
Suárez, Gabriel A; Renda, Brian A; Dasgupta, Aurko; Barrick, Jeffrey E
2017-09-01
The genomes of most bacteria contain mobile DNA elements that can contribute to undesirable genetic instability in engineered cells. In particular, transposable insertion sequence (IS) elements can rapidly inactivate genes that are important for a designed function. We deleted all six copies of IS 1236 from the genome of the naturally transformable bacterium Acinetobacter baylyi ADP1. The natural competence of ADP1 made it possible to rapidly repair deleterious point mutations that arose during strain construction. In the resulting ADP1-ISx strain, the rates of mutations inactivating a reporter gene were reduced by 7- to 21-fold. This reduction was higher than expected from the incidence of new IS 1236 insertions found during a 300-day mutation accumulation experiment with wild-type ADP1 that was used to estimate spontaneous mutation rates in the strain. The extra improvement appears to be due in part to eliminating large deletions caused by IS 1236 activity, as the point mutation rate was unchanged in ADP1-ISx. Deletion of an error-prone polymerase ( dinP ) and a DNA damage response regulator ( umuD Ab [the umuD gene of A. baylyi ]) from the ADP1-ISx genome did not further reduce mutation rates. Surprisingly, ADP1-ISx exhibited increased transformability. This improvement may be due to less autolysis and aggregation of the engineered cells than of the wild type. Thus, deleting IS elements from the ADP1 genome led to a greater than expected increase in evolutionary reliability and unexpectedly enhanced other key strain properties, as has been observed for other clean-genome bacterial strains. ADP1-ISx is an improved chassis for metabolic engineering and other applications. IMPORTANCE Acinetobacter baylyi ADP1 has been proposed as a next-generation bacterial host for synthetic biology and genome engineering due to its ability to efficiently take up DNA from its environment during normal growth. We deleted transposable elements that are capable of copying themselves, inserting into other genes, and thereby inactivating them from the ADP1 genome. The resulting "clean-genome" ADP1-ISx strain exhibited larger reductions in the rates of inactivating mutations than expected from spontaneous mutation rates measured via whole-genome sequencing of lineages evolved under relaxed selection. Surprisingly, we also found that IS element activity reduces transformability and is a major cause of cell aggregation and death in wild-type ADP1 grown under normal laboratory conditions. More generally, our results demonstrate that domesticating a bacterial genome by removing mobile DNA elements that have accumulated during evolution in the wild can have unanticipated benefits. Copyright © 2017 American Society for Microbiology.
Human genetics and genomics a decade after the release of the draft sequence of the human genome.
Naidoo, Nasheen; Pawitan, Yudi; Soong, Richie; Cooper, David N; Ku, Chee-Seng
2011-10-01
Substantial progress has been made in human genetics and genomics research over the past ten years since the publication of the draft sequence of the human genome in 2001. Findings emanating directly from the Human Genome Project, together with those from follow-on studies, have had an enormous impact on our understanding of the architecture and function of the human genome. Major developments have been made in cataloguing genetic variation, the International HapMap Project, and with respect to advances in genotyping technologies. These developments are vital for the emergence of genome-wide association studies in the investigation of complex diseases and traits. In parallel, the advent of high-throughput sequencing technologies has ushered in the 'personal genome sequencing' era for both normal and cancer genomes, and made possible large-scale genome sequencing studies such as the 1000 Genomes Project and the International Cancer Genome Consortium. The high-throughput sequencing and sequence-capture technologies are also providing new opportunities to study Mendelian disorders through exome sequencing and whole-genome sequencing. This paper reviews these major developments in human genetics and genomics over the past decade.
Human genetics and genomics a decade after the release of the draft sequence of the human genome
2011-01-01
Substantial progress has been made in human genetics and genomics research over the past ten years since the publication of the draft sequence of the human genome in 2001. Findings emanating directly from the Human Genome Project, together with those from follow-on studies, have had an enormous impact on our understanding of the architecture and function of the human genome. Major developments have been made in cataloguing genetic variation, the International HapMap Project, and with respect to advances in genotyping technologies. These developments are vital for the emergence of genome-wide association studies in the investigation of complex diseases and traits. In parallel, the advent of high-throughput sequencing technologies has ushered in the 'personal genome sequencing' era for both normal and cancer genomes, and made possible large-scale genome sequencing studies such as the 1000 Genomes Project and the International Cancer Genome Consortium. The high-throughput sequencing and sequence-capture technologies are also providing new opportunities to study Mendelian disorders through exome sequencing and whole-genome sequencing. This paper reviews these major developments in human genetics and genomics over the past decade. PMID:22155605
Murray, Lee; Mobegi, Victor A; Duffy, Craig W; Assefa, Samuel A; Kwiatkowski, Dominic P; Laman, Eugene; Loua, Kovana M; Conway, David J
2016-05-12
In regions where malaria is endemic, individuals are often infected with multiple distinct parasite genotypes, a situation that may impact on evolution of parasite virulence and drug resistance. Most approaches to studying genotypic diversity have involved analysis of a modest number of polymorphic loci, although whole genome sequencing enables a broader characterisation of samples. PCR-based microsatellite typing of a panel of ten loci was performed on Plasmodium falciparum in 95 clinical isolates from a highly endemic area in the Republic of Guinea, to characterize within-isolate genetic diversity. Separately, single nucleotide polymorphism (SNP) data from genome-wide short-read sequences of the same samples were used to derive within-isolate fixation indices (F ws), an inverse measure of diversity within each isolate compared to overall local genetic diversity. The latter indices were compared with the microsatellite results, and also with indices derived by randomly sampling modest numbers of SNPs. As expected, the number of microsatellite loci with more than one allele in each isolate was highly significantly inversely correlated with the genome-wide F ws fixation index (r = -0.88, P < 0.001). However, the microsatellite analysis revealed that most isolates contained mixed genotypes, even those that had no detectable genome sequence heterogeneity. Random sampling of different numbers of SNPs showed that an F ws index derived from ten or more SNPs with minor allele frequencies of >10 % had high correlation (r > 0.90) with the index derived using all SNPs. Different types of data give highly correlated indices of within-infection diversity, although PCR-based analysis detects low-level minority genotypes not apparent in bulk sequence analysis. When whole-genome data are not obtainable, quantitative assay of ten or more SNPs can yield a reasonably accurate estimate of the within-infection fixation index (F ws).
Directing an artificial zinc finger protein to new targets by fusion to a non-DNA-binding domain.
Lim, Wooi F; Burdach, Jon; Funnell, Alister P W; Pearson, Richard C M; Quinlan, Kate G R; Crossley, Merlin
2016-04-20
Transcription factors are often regarded as having two separable components: a DNA-binding domain (DBD) and a functional domain (FD), with the DBD thought to determine target gene recognition. While this holds true for DNA bindingin vitro, it appears thatin vivoFDs can also influence genomic targeting. We fused the FD from the well-characterized transcription factor Krüppel-like Factor 3 (KLF3) to an artificial zinc finger (AZF) protein originally designed to target the Vascular Endothelial Growth Factor-A (VEGF-A) gene promoter. We compared genome-wide occupancy of the KLF3FD-AZF fusion to that observed with AZF. AZF bound to theVEGF-Apromoter as predicted, but was also found to occupy approximately 25,000 other sites, a large number of which contained the expected AZF recognition sequence, GCTGGGGGC. Interestingly, addition of the KLF3 FD re-distributes the fusion protein to new sites, with total DNA occupancy detected at around 50,000 sites. A portion of these sites correspond to known KLF3-bound regions, while others contained sequences similar but not identical to the expected AZF recognition sequence. These results show that FDs can influence and may be useful in directing AZF DNA-binding proteins to specific targets and provide insights into how natural transcription factors operate. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
RefCNV: Identification of Gene-Based Copy Number Variants Using Whole Exome Sequencing.
Chang, Lun-Ching; Das, Biswajit; Lih, Chih-Jian; Si, Han; Camalier, Corinne E; McGregor, Paul M; Polley, Eric
2016-01-01
With rapid advances in DNA sequencing technologies, whole exome sequencing (WES) has become a popular approach for detecting somatic mutations in oncology studies. The initial intent of WES was to characterize single nucleotide variants, but it was observed that the number of sequencing reads that mapped to a genomic region correlated with the DNA copy number variants (CNVs). We propose a method RefCNV that uses a reference set to estimate the distribution of the coverage for each exon. The construction of the reference set includes an evaluation of the sources of variability in the coverage distribution. We observed that the processing steps had an impact on the coverage distribution. For each exon, we compared the observed coverage with the expected normal coverage. Thresholds for determining CNVs were selected to control the false-positive error rate. RefCNV prediction correlated significantly (r = 0.96-0.86) with CNV measured by digital polymerase chain reaction for MET (7q31), EGFR (7p12), or ERBB2 (17q12) in 13 tumor cell lines. The genome-wide CNV analysis showed a good overall correlation (Spearman's coefficient = 0.82) between RefCNV estimation and publicly available CNV data in Cancer Cell Line Encyclopedia. RefCNV also showed better performance than three other CNV estimation methods in genome-wide CNV analysis.
Gene conversion as a secondary mechanism of short interspersed element (SINE) evolution
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kass, D.H.; Batzer, M.A.; Deininger, P.L.
The Alu repetitive family of short interspersed elements (SINEs) in primates can be subdivided into distinct subfamilies by specific diagnostic nucleotide changes. The older subfamilies are generally very abundant, while the younger subfamilies have fewer copies. Some of the youngest Alu elements are absent in the orthologous loci of nonhuman primates, indicative of recent retroposition events, the primary mode of SINE evolutions. PCR analysis of one young Alu subfamily (Sb2) member found in the low-density lipoprotein receptor gene apparently revealed the presence of this element in the green monkey, orangutan, gorilla, and chimpanzee genomes, as well as the human genome.more » However, sequence analysis of these genomes revealed a highly mutated, older, primate-specific Alu element was present at this position in the nonhuman primates. Comparison of the flanking DNA sequences upstream of this Alu insertion corresponded to evolution expected for standard primate phylogeny, but comparison of the Alu repeat sequences revealed that the human element departed from this phylogeny. The change in the human sequence apparently occurred by a gene conversion event only within the Alu element itself, converting it from one of the oldest to one of the youngest Alu subfamilies. Although gene conversions of Alu elements are clearly very rare, this finding shows that such events can occur and contribute to specific cases of SINE subfamily evolution.« less
Detection of microRNAs in color space.
Marco, Antonio; Griffiths-Jones, Sam
2012-02-01
Deep sequencing provides inexpensive opportunities to characterize the transcriptional diversity of known genomes. The AB SOLiD technology generates millions of short sequencing reads in color-space; that is, the raw data is a sequence of colors, where each color represents 2 nt and each nucleotide is represented by two consecutive colors. This strategy is purported to have several advantages, including increased ability to distinguish sequencing errors from polymorphisms. Several programs have been developed to map short reads to genomes in color space. However, a number of previously unexplored technical issues arise when using SOLiD technology to characterize microRNAs. Here we explore these technical difficulties. First, since the sequenced reads are longer than the biological sequences, every read is expected to contain linker fragments. The color-calling error rate increases toward the 3(') end of the read such that recognizing the linker sequence for removal becomes problematic. Second, mapping in color space may lead to the loss of the first nucleotide of each read. We propose a sequential trimming and mapping approach to map small RNAs. Using our strategy, we reanalyze three published insect small RNA deep sequencing datasets and characterize 22 new microRNAs. A bash shell script to perform the sequential trimming and mapping procedure, called SeqTrimMap, is available at: http://www.mirbase.org/tools/seqtrimmap/ antonio.marco@manchester.ac.uk Supplementary data are available at Bioinformatics online.
Gao, Yangchun; Li, Shiguo; Zhan, Aibin
2018-04-01
Invasive species cause huge damages to ecology, environment and economy globally. The comprehensive understanding of invasion mechanisms, particularly genetic bases of micro-evolutionary processes responsible for invasion success, is essential for reducing potential damages caused by invasive species. The golden star tunicate, Botryllus schlosseri, has become a model species in invasion biology, mainly owing to its high invasiveness nature and small well-sequenced genome. However, the genome-wide genetic markers have not been well developed in this highly invasive species, thus limiting the comprehensive understanding of genetic mechanisms of invasion success. Using restriction site-associated DNA (RAD) tag sequencing, here we developed a high-quality resource of 14,119 out of 158,821 SNPs for B. schlosseri. These SNPs were relatively evenly distributed at each chromosome. SNP annotations showed that the majority of SNPs (63.20%) were located at intergenic regions, and 21.51% and 14.58% were located at introns and exons, respectively. In addition, the potential use of the developed SNPs for population genomics studies was primarily assessed, such as the estimate of observed heterozygosity (H O ), expected heterozygosity (H E ), nucleotide diversity (π), Wright's inbreeding coefficient (F IS ) and effective population size (Ne). Our developed SNP resource would provide future studies the genome-wide genetic markers for genetic and genomic investigations, such as genetic bases of micro-evolutionary processes responsible for invasion success.
Moraes, Luis E.; Blow, Matthew J.; Hawley, Erik R.; ...
2017-02-16
Cyanobacteria have the potential to produce bulk and fine chemicals and members belonging to Nostoc sp. have received particular attention due to their relatively fast growth rate and the relative ease with which they can be harvested. Nostoc punctiforme is an aerobic, motile, Gram-negative, filamentous cyanobacterium that has been studied intensively to enhance our understanding of microbial carbon and nitrogen fixation. The genome of the type strain N. punctiforme ATCC 29133 was sequenced in 2001 and the scientific community has used these genome data extensively since then. Advances in bioinformatics tools for sequence annotation and the importance of this organismmore » prompted us to resequence and reanalyze its genome and to make both, the initial and improved annotation, available to the scientific community. The new draft genome has a total size of 9.1 Mbp and consists of 65 contiguous pieces of DNA with a GC content of 41.38% and 7664 protein-coding genes. Furthermore, the resequenced genome is slightly (5152 bp) larger and contains 987 more genes with functional prediction when compared to the previously published version. We deposited the annotation of both genomes in the Department of Energy’s IMG database to facilitate easy genome exploration by the scientific community without the need of in-depth bioinformatics skills. We expect that an facilitated access and ability to search the N. punctiforme ATCC 29133 for genes of interest will significantly facilitate metabolic engineering and genome prospecting efforts and ultimately the synthesis of biofuels and natural products from this keystone organism and closely related cyanobacteria.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moraes, Luis E.; Blow, Matthew J.; Hawley, Erik R.
Cyanobacteria have the potential to produce bulk and fine chemicals and members belonging to Nostoc sp. have received particular attention due to their relatively fast growth rate and the relative ease with which they can be harvested. Nostoc punctiforme is an aerobic, motile, Gram-negative, filamentous cyanobacterium that has been studied intensively to enhance our understanding of microbial carbon and nitrogen fixation. The genome of the type strain N. punctiforme ATCC 29133 was sequenced in 2001 and the scientific community has used these genome data extensively since then. Advances in bioinformatics tools for sequence annotation and the importance of this organismmore » prompted us to resequence and reanalyze its genome and to make both, the initial and improved annotation, available to the scientific community. The new draft genome has a total size of 9.1 Mbp and consists of 65 contiguous pieces of DNA with a GC content of 41.38% and 7664 protein-coding genes. Furthermore, the resequenced genome is slightly (5152 bp) larger and contains 987 more genes with functional prediction when compared to the previously published version. We deposited the annotation of both genomes in the Department of Energy’s IMG database to facilitate easy genome exploration by the scientific community without the need of in-depth bioinformatics skills. We expect that an facilitated access and ability to search the N. punctiforme ATCC 29133 for genes of interest will significantly facilitate metabolic engineering and genome prospecting efforts and ultimately the synthesis of biofuels and natural products from this keystone organism and closely related cyanobacteria.« less
Rizk, Francine; Laverdure, Sylvain; d'Alençon, Emmanuelle; Bossin, Hervé; Dupressoir, Thierry
2018-01-01
The Lepidopteran ambidensovirus 1 isolated from Junonia coenia (hereafter JcDV) is an invertebrate parvovirus considered as a viral transduction vector as well as a potential tool for the biological control of insect pests. Previous works showed that JcDV-based circular plasmids experimentally integrate into insect cells genomic DNA. In order to approach the natural conditions of infection and possible integration, we generated linear JcDV- gfp based molecules which were transfected into non permissive Spodoptera frugiperda ( Sf9 ) cultured cells. Cells were monitored for the expression of green fluorescent protein (GFP) and DNA was analyzed for integration of transduced viral sequences. Non-structural protein modulation of the VP-gene cassette promoter activity was additionally assayed. We show that linear JcDV-derived molecules are capable of long term genomic integration and sustained transgene expression in Sf9 cells. As expected, only the deletion of both inverted terminal repeats (ITR) or the polyadenylation signals of NS and VP genes dramatically impairs the global transduction/expression efficiency. However, all the integrated viral sequences we characterized appear "scrambled" whatever the viral content of the transfected vector. Despite a strong GFP expression, we were unable to recover any full sequence of the original constructs and found rearranged viral and non-viral sequences as well. Cellular flanking sequences were identified as non-coding ones. On the other hand, the kinetics of GFP expression over time led us to investigate the apparent down-regulation by non-structural proteins of the VP-gene cassette promoter. Altogether, our results show that JcDV-derived sequences included in linear DNA molecules are able to drive efficiently the integration and expression of a foreign gene into the genome of insect cells, whatever their composition, provided that at least one ITR is present. However, the transfected sequences were extensively rearranged with cellular DNA during or after random integration in the host cell genome. Lastly, the non-structural proteins seem to participate in the regulation of p9 promoter activity rather than to the integration of viral sequences.
Genome sequence of Phytophthora ramorum: implications for management
Brett Tyler; Sucheta Tripathy; Nik Grunwald; Kurt Lamour; Kelly Ivors; Matteo Garbelotto; Daniel Rokhsar; Nik Putnam; Igor Grigoriev; Jeffrey Boore
2006-01-01
A draft genome sequence has been determined for Phytophthora ramorum, together with a draft sequence of the soybean pathogen Phytophthora sojae. The P. ramorum genome was sequenced to a depth of 7-fold coverage, while the P. sojae genome was sequenced to a depth of 9-fold coverage. The genome...
ERIC Educational Resources Information Center
Flowers, Susan K.; Easter, Carla; Holmes, Andrea; Cohen, Brian; Bednarski, April E.; Mardis, Elaine R.; Wilson, Richard K.; Elgin, Sarah C. R.
2005-01-01
Sequencing of the human genome has ushered in a new era of biology. The technologies developed to facilitate the sequencing of the human genome are now being applied to the sequencing of other genomes. In 2004, a partnership was formed between Washington University School of Medicine Genome Sequencing Center's Outreach Program and Washington…
Evolution of Functional Diversification within Quasispecies
Colizzi, Enrico Sandro; Hogeweg, Paulien
2014-01-01
According to quasispecies theory, high mutation rates limit the amount of information genomes can store (Eigen’s Paradox), whereas genomes with higher degrees of neutrality may be selected even at the expenses of higher replication rates (the “survival of the flattest” effect). Introducing a complex genotype to phenotype map, such as RNA folding, epitomizes such effect because of the existence of neutral networks and their exploitation by evolution, affecting both population structure and genome composition. We reexamine these classical results in the light of an RNA-based system that can evolve its own ecology. Contrary to expectations, we find that quasispecies evolving at high mutation rates are steep and characterized by one master sequence. Importantly, the analysis of the system and the characterization of the evolved quasispecies reveal the emergence of functionalities as phenotypes of nonreplicating genotypes, whose presence is crucial for the overall viability and stability of the system. In other words, the master sequence codes for the information of the entire ecosystem, whereas the decoding happens, stochastically, through mutations. We show that this solution quickly outcompetes strategies based on genomes with a high degree of neutrality. In conclusion, individually coded but ecosystem-based diversity evolves and persists indefinitely close to the Information Threshold. PMID:25056399
Genome skimming identifies polymorphism in tern populations and species
2012-01-01
Background Terns (Charadriiformes: Sterninae) are a lineage of cosmopolitan shorebirds with a disputed evolutionary history that comprises several species of conservation concern. As a non-model system in genetics, previous study has left most of the nuclear genome unexplored, and population-level studies are limited to only 15% of the world's species of terns and noddies. Screening of polymorphic nuclear sequence markers is needed to enhance genetic resolution because of supposed low mitochondrial mutation rate, documentation of nuclear insertion of hypervariable mitochondrial regions, and limited success of microsatellite enrichment in terns. Here, we investigated the phylogenetic and population genetic utility for terns and relatives of a variety of nuclear markers previously developed for other birds and spanning the nuclear genome. Markers displaying a variety of mutation rates from both the nuclear and mitochondrial genome were tested and prioritized according to optimal cross-species amplification and extent of genetic polymorphism between (1) the main tern clades and (2) individual Royal Terns (Thalasseus maxima) breeding on the US East Coast. Results Results from this genome skimming effort yielded four new nuclear sequence-based markers for tern phylogenetics and 11 intra-specific polymorphic markers. Further, comparison between the two genomes indicated a phylogenetic conflict at the base of terns, involving the inclusion (mitochondrial) or exclusion (nuclear) of the Angel Tern (Gygis alba). Although limited mitochondrial variation was confirmed, both nuclear markers and a short tandem repeat in the mitochondrial control region indicated the presence of considerable genetic variation in Royal Terns at a regional scale. Conclusions These data document the value of intronic markers to the study of terns and allies. We expect that these and additional markers attained through next-generation sequencing methods will accurately map the genetic origin and species history of this group of birds. PMID:22333071
El-Mogharbel, Nisrine; Wakefield, Matthew; Deakin, Janine E; Tsend-Ayush, Enkhjargal; Grützner, Frank; Alsop, Amber; Ezaz, Tariq; Marshall Graves, Jennifer A
2007-01-01
We isolated and characterized a cluster of platypus DMRT genes and compared their arrangement, location, and sequence across vertebrates. The DMRT gene cluster on human 9p24.3 harbors, in order, DMRT1, DMRT3, and DMRT2, which share a DM domain. DMRT1 is highly conserved and involved in sexual development in vertebrates, and deletions in this region cause sex reversal in humans. Sequence comparisons of DMRT genes between species have been valuable in identifying exons, control regions, and conserved nongenic regions (CNGs). The addition of platypus sequences is expected to be particularly valuable, since monotremes fill a gap in the vertebrate genome coverage. We therefore isolated and fully sequenced platypus BAC clones containing DMRT3 and DMRT2 as well as DMRT1 and then generated multispecies alignments and ran prediction programs followed by experimental verification to annotate this gene cluster. We found that the three genes have 58-66% identity to their human orthologues, lie in the same order as in other vertebrates, and colocate on 1 of the 10 platypus sex chromosomes, X5. We also predict that optimal annotation of the newly sequenced platypus genome will be challenging. The analysis of platypus sequence revealed differences in structure and sequence of the DMRT gene cluster. Multispecies comparison was particularly effective for detecting CNGs, revealing several novel potential regulatory regions within DMRT3 and DMRT2 as well as DMRT1. RT-PCR indicated that platypus DMRT1 and DMRT3 are expressed specifically in the adult testis (and not ovary), but DMRT2 has a wider expression profile, as it does for other mammals. The platypus DMRT1 expression pattern, and its location on an X chromosome, suggests an involvement in monotreme sexual development.
Sun, Genlou; Komatsuda, Takao
2010-08-01
It is well known that Elymus arose through hybridization between representatives of different genera. Cytogenetic analyses show that all its members include the St genome in combination with one or more of four other genomes, the H, Y, P, and W genomes. The origins of the H, P, and W genomes are known, but not for the Y genome. We analyzed the single copy nuclear gene coding for elongation factor G (EF-G) from 28 accessions of polyploid Elymus species and 45 accessions of diploid Triticeae species in order to investigate origin of the Y genome and its relationship to other genomes in the tribe Triticeae. Sequence comparisons among the St, H, Y, P, W, and E genomes detected genome-specific polymorphisms at 66 nucleotide positions. The St and Y genomes are relatively dissimilar. The phylogeny of the Y genome sequences was investigated for the first time. They were most similar to the W genome sequences. The Y genome sequences were placed in two different groups. These two groups were included in an unresolved clade that included the W and E sequences as well as sequences from many annual species. The H genomes sequences were in a clade with the F, P, and Ns genome sequences as sister groups. These two clades were more closely related to each other and to the L and Xp genomes than they were to the St genome sequences. These data support the hypothesis that the Y genome evolved in a diploid species and has a different origin from the St genome. Copyright 2010 Elsevier Inc. All rights reserved.
Company profile: Complete Genomics Inc.
Reid, Clifford
2011-02-01
Complete Genomics Inc. is a life sciences company that focuses on complete human genome sequencing. It is taking a completely different approach to DNA sequencing than other companies in the industry. Rather than building a general-purpose platform for sequencing all organisms and all applications, it has focused on a single application - complete human genome sequencing. The company's Complete Genomics Analysis Platform (CGA™ Platform) comprises an integrated package of biochemistry, instrumentation and software that sequences human genomes at the highest quality, lowest cost and largest scale available. Complete Genomics offers a turnkey service that enables customers to outsource their human genome sequencing to the company's genome sequencing center in Mountain View, CA, USA. Customers send in their DNA samples, the company does all the library preparation, DNA sequencing, assembly and variant analysis, and customers receive research-ready data that they can use for biological discovery.
[Genomics basis of Arthrobacter spp. environmental adaptability– A review].
Zhang, Xinjian; Zhang, Guangzhi; Yang, Hetong
2016-04-04
Arthrobacter species are found ecologically diverse and can survive in various environments. Many strains of these species have metabolic versatility and can degrade many environmental pollutants. Arthrobacter species are thought to play important roles in catabolism of environmental pollutants in nature. In recent years, the genomes of many Arthrobacter strains have been sequenced, which provides comprehensive information to clarify the molecular mechanisms related to environmental adaptability of Arthrobacter species. These genomics findings revealed several features that are commonly observed in Arthrobacter strains allowing for survival under stressful conditions. These include an array of genes associated with sigma factors and responses to oxidative, osmotic, starvation and temperature stresses. The genomics basis of their environmental adaptability are reviewed, which is expected to provide useful information for applying Arthrobacter strains in pollution remediation and shed some light on other bacterial environmental adaptability researches.
Signatures of Long-Term Balancing Selection in Human Genomes
de Filippo, Cesare; Teixeira, João C; Schmidt, Joshua M; Kleinert, Philip; Meyer, Diogo; Andrés, Aida M
2018-01-01
Abstract Balancing selection maintains advantageous diversity in populations through various mechanisms. Although extensively explored from a theoretical perspective, an empirical understanding of its prevalence and targets lags behind our knowledge of positive selection. Here, we describe the Non-central Deviation (NCD), a simple yet powerful statistic to detect long-term balancing selection (LTBS) that quantifies how close frequencies are to expectations under LTBS, and provides the basis for a neutrality test. NCD can be applied to a single locus or genomic data, and can be implemented considering only polymorphisms (NCD1) or also considering fixed differences with respect to an outgroup (NCD2) species. Incorporating fixed differences improves power, and NCD2 has higher power to detect LTBS in humans under different frequencies of the balanced allele(s) than other available methods. Applied to genome-wide data from African and European human populations, in both cases using chimpanzee as an outgroup, NCD2 shows that, albeit not prevalent, LTBS affects a sizable portion of the genome: ∼0.6% of analyzed genomic windows and 0.8% of analyzed positions. Significant windows (P < 0.0001) contain 1.6% of SNPs in the genome, which disproportionally fall within exons and change protein sequence, but are not enriched in putatively regulatory sites. These windows overlap ∼8% of the protein-coding genes, and these have larger number of transcripts than expected by chance even after controlling for gene length. Our catalog includes known targets of LTBS but a majority of them (90%) are novel. As expected, immune-related genes are among those with the strongest signatures, although most candidates are involved in other biological functions, suggesting that LTBS potentially influences diverse human phenotypes. PMID:29608730
Wawrousek, Karen; Noble, Scott; Korlach, Jonas; ...
2014-12-05
In this article, we report here the sequencing and analysis of the genome of the purple non-sulfur photosynthetic bacterium Rubrivivax gelatinosus CBS. This microbe is a model for studies of its carboxydotrophic life style under anaerobic condition, based on its ability to utilize carbon monoxide (CO) as the sole carbon substrate and water as the electron acceptor, yielding CO 2 and H 2 as the end products. The CO-oxidation reaction is known to be catalyzed by two enzyme complexes, the CO dehydrogenase and hydrogenase. As expected, analysis of the genome of Rx. gelatinosus CBS reveals the presence of genes encodingmore » both enzyme complexes. The CO-oxidation reaction is CO-inducible, which is consistent with the presence of two putative CO-sensing transcription factors in its genome. Genome analysis also reveals the presence of two additional hydrogenases, an uptake hydrogenase that liberates the electrons in H 2 in support of cell growth, and a regulatory hydrogenase that senses H 2 and relays the signal to a two-component system that ultimately controls synthesis of the uptake hydrogenase. The genome also contains two sets of hydrogenase maturation genes which are known to assemble the catalytic metallocluster of the hydrogenase NiFe active site. Finally and collectively, the genome sequence and analysis information reveals the blueprint of an intricate network of signal transduction pathways and its underlying regulation that enables Rx. gelatinosus CBS to thrive on CO or H 2 in support of cell growth.« less
Demographic history and rare allele sharing among human populations.
Gravel, Simon; Henn, Brenna M; Gutenkunst, Ryan N; Indap, Amit R; Marth, Gabor T; Clark, Andrew G; Yu, Fuli; Gibbs, Richard A; Bustamante, Carlos D
2011-07-19
High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2-4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ∼1,000 sequenced chromosomes per population, whereas ∼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stewart, Jeffrey S.
The success of the Human Genome project is already nearing $1 Trillion dollars of U.S. economic activity. Lawrence Livermore National Laboratory (LLNL) was a co-leader in one of the biggest biological research effort in history, sequencing the Human Genome Project. This ambitious research effort set out to sequence the approximately 3 billion nucleotides in the human genome, an effort many thought was nearly impossible. Deoxyribonucleic acid (DNA) was discovered in 1869, and by 1943 came the discovery that DNA was a molecule that encodes the genetic instructions used in the development and functioning of living organisms and many viruses. Tomore » make full use of the information, scientists needed to first sequence the billions of nucleotides to begin linking them to genetic traits and illnesses, and eventually more effective treatments. New medical discoveries and improved agriculture productivity were some of the expected benefits. While the potential benefits were vast, the timeline (over a decade) and cost ($3.8 Billion) exceeded what the private sector would normally attempt, especially when this would only be the first phase toward the path to new discoveries and market opportunities. The Department of Energy believed its best research laboratories could meet this Grand Challenge and soon convinced the National Institute of Health to formally propose the Human Genome project to the federal government. The U.S. government accepted the risk and challenge to potentially create new healthcare and food discoveries that could benefit the world and the U.S. Industry.« less
Demographic history and rare allele sharing among human populations
Gravel, Simon; Henn, Brenna M.; Gutenkunst, Ryan N.; Indap, Amit R.; Marth, Gabor T.; Clark, Andrew G.; Yu, Fuli; Gibbs, Richard A.; Bustamante, Carlos D.; Altshuler, David L.; Durbin, Richard M.; Abecasis, Gonçalo R.; Bentley, David R.; Chakravarti, Aravinda; Clark, Andrew G.; Collins, Francis S.; De La Vega, Francisco M.; Donnelly, Peter; Egholm, Michael; Flicek, Paul; Gabriel, Stacey B.; Gibbs, Richard A.; Knoppers, Bartha M.; Lander, Eric S.; Lehrach, Hans; Mardis, Elaine R.; McVean, Gil A.; Nickerson, Debbie A.; Peltonen, Leena; Schafer, Alan J.; Sherry, Stephen T.; Wang, Jun; Wilson, Richard K.; Gibbs, Richard A.; Deiros, David; Metzker, Mike; Muzny, Donna; Reid, Jeff; Wheeler, David; Wang, Jun; Li, Jingxiang; Jian, Min; Li, Guoqing; Li, Ruiqiang; Liang, Huiqing; Tian, Geng; Wang, Bo; Wang, Jian; Wang, Wei; Yang, Huanming; Zhang, Xiuqing; Zheng, Huisong; Lander, Eric S.; Altshuler, David L.; Ambrogio, Lauren; Bloom, Toby; Cibulskis, Kristian; Fennell, Tim J.; Gabriel, Stacey B.; Jaffe, David B.; Shefler, Erica; Sougnez, Carrie L.; Bentley, David R.; Gormley, Niall; Humphray, Sean; Kingsbury, Zoya; Koko-Gonzales, Paula; Stone, Jennifer; McKernan, Kevin J.; Costa, Gina L.; Ichikawa, Jeffry K.; Lee, Clarence C.; Sudbrak, Ralf; Lehrach, Hans; Borodina, Tatiana A.; Dahl, Andreas; Davydov, Alexey N.; Marquardt, Peter; Mertes, Florian; Nietfeld, Wilfiried; Rosenstiel, Philip; Schreiber, Stefan; Soldatov, Aleksey V.; Timmermann, Bernd; Tolzmann, Marius; Egholm, Michael; Affourtit, Jason; Ashworth, Dana; Attiya, Said; Bachorski, Melissa; Buglione, Eli; Burke, Adam; Caprio, Amanda; Celone, Christopher; Clark, Shauna; Conners, David; Desany, Brian; Gu, Lisa; Guccione, Lorri; Kao, Kalvin; Kebbel, Andrew; Knowlton, Jennifer; Labrecque, Matthew; McDade, Louise; Mealmaker, Craig; Minderman, Melissa; Nawrocki, Anne; Niazi, Faheem; Pareja, Kristen; Ramenani, Ravi; Riches, David; Song, Wanmin; Turcotte, Cynthia; Wang, Shally; Mardis, Elaine R.; Wilson, Richard K.; Dooling, David; Fulton, Lucinda; Fulton, Robert; Weinstock, George; Durbin, Richard M.; Burton, John; Carter, David M.; Churcher, Carol; Coffey, Alison; Cox, Anthony; Palotie, Aarno; Quail, Michael; Skelly, Tom; Stalker, James; Swerdlow, Harold P.; Turner, Daniel; De Witte, Anniek; Giles, Shane; Gibbs, Richard A.; Wheeler, David; Bainbridge, Matthew; Challis, Danny; Sabo, Aniko; Yu, Fuli; Yu, Jin; Wang, Jun; Fang, Xiaodong; Guo, Xiaosen; Li, Ruiqiang; Li, Yingrui; Luo, Ruibang; Tai, Shuaishuai; Wu, Honglong; Zheng, Hancheng; Zheng, Xiaole; Zhou, Yan; Li, Guoqing; Wang, Jian; Yang, Huanming; Marth, Gabor T.; Garrison, Erik P.; Huang, Weichun; Indap, Amit; Kural, Deniz; Lee, Wan-Ping; Leong, Wen Fung; Quinlan, Aaron R.; Stewart, Chip; Stromberg, Michael P.; Ward, Alistair N.; Wu, Jiantao; Lee, Charles; Mills, Ryan E.; Shi, Xinghua; Daly, Mark J.; DePristo, Mark A.; Altshuler, David L.; Ball, Aaron D.; Banks, Eric; Bloom, Toby; Browning, Brian L.; Cibulskis, Kristian; Fennell, Tim J.; Garimella, Kiran V.; Grossman, Sharon R.; Handsaker, Robert E.; Hanna, Matt; Hartl, Chris; Jaffe, David B.; Kernytsky, Andrew M.; Korn, Joshua M.; Li, Heng; Maguire, Jared R.; McCarroll, Steven A.; McKenna, Aaron; Nemesh, James C.; Philippakis, Anthony A.; Poplin, Ryan E.; Price, Alkes; Rivas, Manuel A.; Sabeti, Pardis C.; Schaffner, Stephen F.; Shefler, Erica; Shlyakhter, Ilya A.; Cooper, David N.; Ball, Edward V.; Mort, Matthew; Phillips, Andrew D.; Stenson, Peter D.; Sebat, Jonathan; Makarov, Vladimir; Ye, Kenny; Yoon, Seungtai C.; Bustamante, Carlos D.; Clark, Andrew G.; Boyko, Adam; Degenhardt, Jeremiah; Gravel, Simon; Gutenkunst, Ryan N.; Kaganovich, Mark; Keinan, Alon; Lacroute, Phil; Ma, Xin; Reynolds, Andy; Clarke, Laura; Flicek, Paul; Cunningham, Fiona; Herrero, Javier; Keenen, Stephen; Kulesha, Eugene; Leinonen, Rasko; McLaren, William M.; Radhakrishnan, Rajesh; Smith, Richard E.; Zalunin, Vadim; Zheng-Bradley, Xiangqun; Korbel, Jan O.; Stütz, Adrian M.; Humphray, Sean; Bauer, Markus; Cheetham, R. Keira; Cox, Tony; Eberle, Michael; James, Terena; Kahn, Scott; Murray, Lisa; Chakravarti, Aravinda; Ye, Kai; De La Vega, Francisco M.; Fu, Yutao; Hyland, Fiona C. L.; Manning, Jonathan M.; McLaughlin, Stephen F.; Peckham, Heather E.; Sakarya, Onur; Sun, Yongming A.; Tsung, Eric F.; Batzer, Mark A.; Konkel, Miriam K.; Walker, Jerilyn A.; Sudbrak, Ralf; Albrecht, Marcus W.; Amstislavskiy, Vyacheslav S.; Herwig, Ralf; Parkhomchuk, Dimitri V.; Sherry, Stephen T.; Agarwala, Richa; Khouri, Hoda M.; Morgulis, Aleksandr O.; Paschall, Justin E.; Phan, Lon D.; Rotmistrovsky, Kirill E.; Sanders, Robert D.; Shumway, Martin F.; Xiao, Chunlin; McVean, Gil A.; Auton, Adam; Iqbal, Zamin; Lunter, Gerton; Marchini, Jonathan L.; Moutsianas, Loukas; Myers, Simon; Tumian, Afidalina; Desany, Brian; Knight, James; Winer, Roger; Craig, David W.; Beckstrom-Sternberg, Steve M.; Christoforides, Alexis; Kurdoglu, Ahmet A.; Pearson, John V.; Sinari, Shripad A.; Tembe, Waibhav D.; Haussler, David; Hinrichs, Angie S.; Katzman, Sol J.; Kern, Andrew; Kuhn, Robert M.; Przeworski, Molly; Hernandez, Ryan D.; Howie, Bryan; Kelley, Joanna L.; Melton, S. Cord; Abecasis, Gonçalo R.; Li, Yun; Anderson, Paul; Blackwell, Tom; Chen, Wei; Cookson, William O.; Ding, Jun; Kang, Hyun Min; Lathrop, Mark; Liang, Liming; Moffatt, Miriam F.; Scheet, Paul; Sidore, Carlo; Snyder, Matthew; Zhan, Xiaowei; Zöllner, Sebastian; Awadalla, Philip; Casals, Ferran; Idaghdour, Youssef; Keebler, John; Stone, Eric A.; Zilversmit, Martine; Jorde, Lynn; Xing, Jinchuan; Eichler, Evan E.; Aksay, Gozde; Alkan, Can; Hajirasouliha, Iman; Hormozdiari, Fereydoun; Kidd, Jeffrey M.; Sahinalp, S. Cenk; Sudmant, Peter H.; Mardis, Elaine R.; Chen, Ken; Chinwalla, Asif; Ding, Li; Koboldt, Daniel C.; McLellan, Mike D.; Dooling, David; Weinstock, George; Wallis, John W.; Wendl, Michael C.; Zhang, Qunyuan; Durbin, Richard M.; Albers, Cornelis A.; Ayub, Qasim; Balasubramaniam, Senduran; Barrett, Jeffrey C.; Carter, David M.; Chen, Yuan; Conrad, Donald F.; Danecek, Petr; Dermitzakis, Emmanouil T.; Hu, Min; Huang, Ni; Hurles, Matt E.; Jin, Hanjun; Jostins, Luke; Keane, Thomas M.; Le, Si Quang; Lindsay, Sarah; Long, Quan; MacArthur, Daniel G.; Montgomery, Stephen B.; Parts, Leopold; Stalker, James; Tyler-Smith, Chris; Walter, Klaudia; Zhang, Yujun; Gerstein, Mark B.; Snyder, Michael; Abyzov, Alexej; Balasubramanian, Suganthi; Bjornson, Robert; Du, Jiang; Grubert, Fabian; Habegger, Lukas; Haraksingh, Rajini; Jee, Justin; Khurana, Ekta; Lam, Hugo Y. K.; Leng, Jing; Mu, Xinmeng Jasmine; Urban, Alexander E.; Zhang, Zhengdong; Li, Yingrui; Luo, Ruibang; Marth, Gabor T.; Garrison, Erik P.; Kural, Deniz; Quinlan, Aaron R.; Stewart, Chip; Stromberg, Michael P.; Ward, Alistair N.; Wu, Jiantao; Lee, Charles; Mills, Ryan E.; Shi, Xinghua; McCarroll, Steven A.; Banks, Eric; DePristo, Mark A.; Handsaker, Robert E.; Hartl, Chris; Korn, Joshua M.; Li, Heng; Nemesh, James C.; Sebat, Jonathan; Makarov, Vladimir; Ye, Kenny; Yoon, Seungtai C.; Degenhardt, Jeremiah; Kaganovich, Mark; Clarke, Laura; Smith, Richard E.; Zheng-Bradley, Xiangqun; Korbel, Jan O.; Humphray, Sean; Cheetham, R. Keira; Eberle, Michael; Kahn, Scott; Murray, Lisa; Ye, Kai; De La Vega, Francisco M.; Fu, Yutao; Peckham, Heather E.; Sun, Yongming A.; Batzer, Mark A.; Konkel, Miriam K.; Walker, Jerilyn A.; Xiao, Chunlin; Iqbal, Zamin; Desany, Brian; Blackwell, Tom; Snyder, Matthew; Xing, Jinchuan; Eichler, Evan E.; Aksay, Gozde; Alkan, Can; Hajirasouliha, Iman; Hormozdiari, Fereydoun; Kidd, Jeffrey M.; Chen, Ken; Chinwalla, Asif; Ding, Li; McLellan, Mike D.; Wallis, John W.; Hurles, Matt E.; Conrad, Donald F.; Walter, Klaudia; Zhang, Yujun; Gerstein, Mark B.; Snyder, Michael; Abyzov, Alexej; Du, Jiang; Grubert, Fabian; Haraksingh, Rajini; Jee, Justin; Khurana, Ekta; Lam, Hugo Y. K.; Leng, Jing; Mu, Xinmeng Jasmine; Urban, Alexander E.; Zhang, Zhengdong; Gibbs, Richard A.; Bainbridge, Matthew; Challis, Danny; Coafra, Cristian; Dinh, Huyen; Kovar, Christie; Lee, Sandy; Muzny, Donna; Nazareth, Lynne; Reid, Jeff; Sabo, Aniko; Yu, Fuli; Yu, Jin; Marth, Gabor T.; Garrison, Erik P.; Indap, Amit; Leong, Wen Fung; Quinlan, Aaron R.; Stewart, Chip; Ward, Alistair N.; Wu, Jiantao; Cibulskis, Kristian; Fennell, Tim J.; Gabriel, Stacey B.; Garimella, Kiran V.; Hartl, Chris; Shefler, Erica; Sougnez, Carrie L.; Wilkinson, Jane; Clark, Andrew G.; Gravel, Simon; Grubert, Fabian; Clarke, Laura; Flicek, Paul; Smith, Richard E.; Zheng-Bradley, Xiangqun; Sherry, Stephen T.; Khouri, Hoda M.; Paschall, Justin E.; Shumway, Martin F.; Xiao, Chunlin; McVean, Gil A.; Katzman, Sol J.; Abecasis, Gonçalo R.; Blackwell, Tom; Mardis, Elaine R.; Dooling, David; Fulton, Lucinda; Fulton, Robert; Koboldt, Daniel C.; Durbin, Richard M.; Balasubramaniam, Senduran; Coffey, Allison; Keane, Thomas M.; MacArthur, Daniel G.; Palotie, Aarno; Scott, Carol; Stalker, James; Tyler-Smith, Chris; Gerstein, Mark B.; Balasubramanian, Suganthi; Chakravarti, Aravinda; Knoppers, Bartha M.; Abecasis, Gonçalo R.; Bustamante, Carlos D.; Gharani, Neda; Gibbs, Richard A.; Jorde, Lynn; Kaye, Jane S.; Kent, Alastair; Li, Taosha; McGuire, Amy L.; McVean, Gil A.; Ossorio, Pilar N.; Rotimi, Charles N.; Su, Yeyang; Toji, Lorraine H.; TylerSmith, Chris; Brooks, Lisa D.; Felsenfeld, Adam L.; McEwen, Jean E.; Abdallah, Assya; Juenger, Christopher R.; Clemm, Nicholas C.; Collins, Francis S.; Duncanson, Audrey; Green, Eric D.; Guyer, Mark S.; Peterson, Jane L.; Schafer, Alan J.; Abecasis, Gonçalo R.; Altshuler, David L.; Auton, Adam; Brooks, Lisa D.; Durbin, Richard M.; Gibbs, Richard A.; Hurles, Matt E.; McVean, Gil A.
2011-01-01
High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2–4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ∼1,000 sequenced chromosomes per population, whereas ∼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence. PMID:21730125
Draft genome of the reindeer (Rangifer tarandus).
Li, Zhipeng; Lin, Zeshan; Ba, Hengxing; Chen, Lei; Yang, Yongzhi; Wang, Kun; Qiu, Qiang; Wang, Wen; Li, Guangyu
2017-12-01
The reindeer (Rangifer tarandus) is the only fully domesticated species in the Cervidae family, and it is the only cervid with a circumpolar distribution. Unlike all other cervids, female reindeer, as well as males, regularly grow cranial appendages (antlers, the defining characteristics of cervids). Moreover, reindeer milk contains more protein and less lactose than bovids' milk. A high-quality reference genome of this species will assist efforts to elucidate these and other important features in the reindeer. We obtained 615 Gb (Gigabase) of usable sequences by filtering the low-quality reads of the raw data generated from the Illumina Hiseq 4000 platform, and a 2.64-Gb final assembly, representing 95.7% of the estimated genome (2.76 Gb according to k-mer analysis), including 92.6% of expected genes according to BUSCO analysis. The contig N50 and scaffold N50 sizes were 89.7 kilo base (kb) and 0.94 mega base (Mb), respectively. We annotated 21 555 protein-coding genes and 1.07 Gb of repetitive sequences by de novo and homology-based prediction. Homology-based searches detected 159 rRNA, 547 miRNA, 1339 snRNA, and 863 tRNA sequences in the genome of R. tarandus. The divergence time between R. tarandus and ancestors of Bos taurus and Capra hircus is estimated to be about 29.5 million years ago. Our results provide the first high-quality reference genome for the reindeer and a valuable resource for studying the evolution, domestication, and other unusual characteristics of the reindeer. © The Authors 2017. Published by Oxford University Press.
Boitard, Simon; Rodríguez, Willy; Jay, Flora; Mona, Stefano; Austerlitz, Frédéric
2016-01-01
Inferring the ancestral dynamics of effective population size is a long-standing question in population genetics, which can now be tackled much more accurately thanks to the massive genomic data available in many species. Several promising methods that take advantage of whole-genome sequences have been recently developed in this context. However, they can only be applied to rather small samples, which limits their ability to estimate recent population size history. Besides, they can be very sensitive to sequencing or phasing errors. Here we introduce a new approximate Bayesian computation approach named PopSizeABC that allows estimating the evolution of the effective population size through time, using a large sample of complete genomes. This sample is summarized using the folded allele frequency spectrum and the average zygotic linkage disequilibrium at different bins of physical distance, two classes of statistics that are widely used in population genetics and can be easily computed from unphased and unpolarized SNP data. Our approach provides accurate estimations of past population sizes, from the very first generations before present back to the expected time to the most recent common ancestor of the sample, as shown by simulations under a wide range of demographic scenarios. When applied to samples of 15 or 25 complete genomes in four cattle breeds (Angus, Fleckvieh, Holstein and Jersey), PopSizeABC revealed a series of population declines, related to historical events such as domestication or modern breed creation. We further highlight that our approach is robust to sequencing errors, provided summary statistics are computed from SNPs with common alleles. PMID:26943927
Approaches for in silico finishing of microbial genome sequences
Kremer, Frederico Schmitt; McBride, Alan John Alexander; Pinto, Luciano da Silva
2017-01-01
Abstract The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic information, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes remained as “drafts”, incomplete representations of the whole genetic content. The previous genome sequencing studies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to optimize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing. PMID:28898352
Approaches for in silico finishing of microbial genome sequences.
Kremer, Frederico Schmitt; McBride, Alan John Alexander; Pinto, Luciano da Silva
The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic information, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes remained as "drafts", incomplete representations of the whole genetic content. The previous genome sequencing studies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to optimize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing.
2011-01-01
Background Many plants have large and complex genomes with an abundance of repeated sequences. Many plants are also polyploid. Both of these attributes typify the genome architecture in the tribe Triticeae, whose members include economically important wheat, rye and barley. Large genome sizes, an abundance of repeated sequences, and polyploidy present challenges to genome-wide SNP discovery using next-generation sequencing (NGS) of total genomic DNA by making alignment and clustering of short reads generated by the NGS platforms difficult, particularly in the absence of a reference genome sequence. Results An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another genotype generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. A pipeline program package, AGSNP, was developed and used for genome-wide SNP discovery in Aegilops tauschii-the diploid source of the wheat D genome, and with a genome size of 4.02 Gb, of which 90% is repetitive sequences. Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa and Roche 454 genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in uncharacterized single-copy regions, and another 145,907 putative SNPs were discovered in repeat junctions. These SNPs were dispersed across the entire Ae. tauschii genome. To assess the false positive SNP discovery rate, DNA containing putative SNPs was amplified by PCR from AL8/78 and AS75 and resequenced with the ABI 3730 xl. In a sample of 302 randomly selected putative SNPs, 84.0% in gene regions, 88.0% in repeat junctions, and 81.3% in uncharacterized regions were validated. Conclusion An annotation-based genome-wide SNP discovery pipeline for NGS platforms was developed. The pipeline is suitable for SNP discovery in genomic libraries of complex genomes and does not require a reference genome sequence. The pipeline is applicable to all current NGS platforms, provided that at least one such platform generates relatively long reads. The pipeline package, AGSNP, and the discovered 497,118 Ae. tauschii SNPs can be accessed at (http://avena.pw.usda.gov/wheatD/agsnp.shtml). PMID:21266061
Chromatin accessibility prediction via a hybrid deep convolutional neural network.
Liu, Qiao; Xia, Fei; Yin, Qijin; Jiang, Rui
2018-03-01
A majority of known genetic variants associated with human-inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from large-scale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies. We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNase-seq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or in-house chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases. Deopen is freely available at https://github.com/kimmo1019/Deopen. ruijiang@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Aokic, Jun-ya; Kawase, Junya; Hamada, Kazuhisa; Fujimoto, Hiroshi; Yamamoto, Ikki; Usuki, Hironori
2018-01-01
Greater amberjack (Seriola dumerili) is distributed in tropical and temperate waters worldwide and is an important aquaculture fish. We carried out de novo sequencing of the greater amberjack genome to construct a reference genome sequence to identify single nucleotide polymorphisms (SNPs) for breeding amberjack by marker-assisted or gene-assisted selection as well as to identify functional genes for biological traits. We obtained 200 times coverage and constructed a high-quality genome assembly using next generation sequencing technology. The assembled sequences were aligned onto a yellowtail (Seriola quinqueradiata) radiation hybrid (RH) physical map by sequence homology. A total of 215 of the longest amberjack sequences, with a total length of 622.8 Mbp (92% of the total length of the genome scaffolds), were lined up on the yellowtail RH map. We resequenced the whole genomes of 20 greater amberjacks and mapped the resulting sequences onto the reference genome sequence. About 186,000 nonredundant SNPs were successfully ordered on the reference genome. Further, we found differences in the genome structural variations between two greater amberjack populations using BreakDancer. We also analyzed the greater amberjack transcriptome and mapped the annotated sequences onto the reference genome sequence. PMID:29785397
Rapid and accurate pyrosequencing of angiosperm plastid genomes
Moore, Michael J; Dhingra, Amit; Soltis, Pamela S; Shaw, Regina; Farmerie, William G; Folta, Kevin M; Soltis, Douglas E
2006-01-01
Background Plastid genome sequence information is vital to several disciplines in plant biology, including phylogenetics and molecular biology. The past five years have witnessed a dramatic increase in the number of completely sequenced plastid genomes, fuelled largely by advances in conventional Sanger sequencing technology. Here we report a further significant reduction in time and cost for plastid genome sequencing through the successful use of a newly available pyrosequencing platform, the Genome Sequencer 20 (GS 20) System (454 Life Sciences Corporation), to rapidly and accurately sequence the whole plastid genomes of the basal eudicot angiosperms Nandina domestica (Berberidaceae) and Platanus occidentalis (Platanaceae). Results More than 99.75% of each plastid genome was simultaneously obtained during two GS 20 sequence runs, to an average depth of coverage of 24.6× in Nandina and 17.3× in Platanus. The Nandina and Platanus plastid genomes shared essentially identical gene complements and possessed the typical angiosperm plastid structure and gene arrangement. To assess the accuracy of the GS 20 sequence, over 45 kilobases of sequence were generated for each genome using conventional sequencing. Overall error rates of 0.043% and 0.031% were observed in GS 20 sequence for Nandina and Platanus, respectively. More than 97% of all observed errors were associated with homopolymer runs, with ~60% of all errors associated with homopolymer runs of 5 or more nucleotides and ~50% of all errors associated with regions of extensive homopolymer runs. No substitution errors were present in either genome. Error rates were generally higher in the single-copy and noncoding regions of both plastid genomes relative to the inverted repeat and coding regions. Conclusion Highly accurate and essentially complete sequence information was obtained for the Nandina and Platanus plastid genomes using the GS 20 System. More importantly, the high accuracy observed in the GS 20 plastid genome sequence was generated for a significant reduction in time and cost over traditional shotgun-based genome sequencing techniques, although with approximately half the coverage of previously reported GS 20 de novo genome sequence. The GS 20 should be broadly applicable to angiosperm plastid genome sequencing, and therefore promises to expand the scale of plant genetic and phylogenetic research dramatically. PMID:16934154
Statistical methods for detecting periodic fragments in DNA sequence data
2011-01-01
Background Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed. Results We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS). Conclusions For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {AA, TT, TA}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers. Reviewers This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight. PMID:21527008
Genome Sequencing of Steroid Producing Bacteria Using Ion Torrent Technology and a Reference Genome.
Sola-Landa, Alberto; Rodríguez-García, Antonio; Barreiro, Carlos; Pérez-Redondo, Rosario
2017-01-01
The Next-Generation Sequencing technology has enormously eased the bacterial genome sequencing and several tens of thousands of genomes have been sequenced during the last 10 years. Most of the genome projects are published as draft version, however, for certain applications the complete genome sequence is required.In this chapter, we describe the strategy that allowed the complete genome sequencing of Mycobacterium neoaurum NRRL B-3805, an industrial strain exploited for steroid production, using Ion Torrent sequencing reads and the genome of a close strain as the reference. This protocol can be applied to analyze the genetic variations between closely related strains; for example, to elucidate the point mutations between a parental strain and a random mutagenesis-derived mutant.
Staňková, Helena; Hastie, Alex R; Chan, Saki; Vrána, Jan; Tulpová, Zuzana; Kubaláková, Marie; Visendi, Paul; Hayashi, Satomi; Luo, Mingcheng; Batley, Jacqueline; Edwards, David; Doležel, Jaroslav; Šimková, Hana
2016-07-01
The assembly of a reference genome sequence of bread wheat is challenging due to its specific features such as the genome size of 17 Gbp, polyploid nature and prevalence of repetitive sequences. BAC-by-BAC sequencing based on chromosomal physical maps, adopted by the International Wheat Genome Sequencing Consortium as the key strategy, reduces problems caused by the genome complexity and polyploidy, but the repeat content still hampers the sequence assembly. Availability of a high-resolution genomic map to guide sequence scaffolding and validate physical map and sequence assemblies would be highly beneficial to obtaining an accurate and complete genome sequence. Here, we chose the short arm of chromosome 7D (7DS) as a model to demonstrate for the first time that it is possible to couple chromosome flow sorting with genome mapping in nanochannel arrays and create a de novo genome map of a wheat chromosome. We constructed a high-resolution chromosome map composed of 371 contigs with an N50 of 1.3 Mb. Long DNA molecules achieved by our approach facilitated chromosome-scale analysis of repetitive sequences and revealed a ~800-kb array of tandem repeats intractable to current DNA sequencing technologies. Anchoring 7DS sequence assemblies obtained by clone-by-clone sequencing to the 7DS genome map provided a valuable tool to improve the BAC-contig physical map and validate sequence assembly on a chromosome-arm scale. Our results indicate that creating genome maps for the whole wheat genome in a chromosome-by-chromosome manner is feasible and that they will be an affordable tool to support the production of improved pseudomolecules. © 2016 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Actionable exomic incidental findings in 6503 participants: challenges of variant classification.
Amendola, Laura M; Dorschner, Michael O; Robertson, Peggy D; Salama, Joseph S; Hart, Ragan; Shirts, Brian H; Murray, Mitzi L; Tokita, Mari J; Gallego, Carlos J; Kim, Daniel Seung; Bennett, James T; Crosslin, David R; Ranchalis, Jane; Jones, Kelly L; Rosenthal, Elisabeth A; Jarvik, Ella R; Itsara, Andy; Turner, Emily H; Herman, Daniel S; Schleit, Jennifer; Burt, Amber; Jamal, Seema M; Abrudan, Jenica L; Johnson, Andrew D; Conlin, Laura K; Dulik, Matthew C; Santani, Avni; Metterville, Danielle R; Kelly, Melissa; Foreman, Ann Katherine M; Lee, Kristy; Taylor, Kent D; Guo, Xiuqing; Crooks, Kristy; Kiedrowski, Lesli A; Raffel, Leslie J; Gordon, Ora; Machini, Kalotina; Desnick, Robert J; Biesecker, Leslie G; Lubitz, Steven A; Mulchandani, Surabhi; Cooper, Greg M; Joffe, Steven; Richards, C Sue; Yang, Yaoping; Rotter, Jerome I; Rich, Stephen S; O'Donnell, Christopher J; Berg, Jonathan S; Spinner, Nancy B; Evans, James P; Fullerton, Stephanie M; Leppig, Kathleen A; Bennett, Robin L; Bird, Thomas; Sybert, Virginia P; Grady, William M; Tabor, Holly K; Kim, Jerry H; Bamshad, Michael J; Wilfond, Benjamin; Motulsky, Arno G; Scott, C Ronald; Pritchard, Colin C; Walsh, Tom D; Burke, Wylie; Raskind, Wendy H; Byers, Peter; Hisama, Fuki M; Rehm, Heidi; Nickerson, Debbie A; Jarvik, Gail P
2015-03-01
Recommendations for laboratories to report incidental findings from genomic tests have stimulated interest in such results. In order to investigate the criteria and processes for assigning the pathogenicity of specific variants and to estimate the frequency of such incidental findings in patients of European and African ancestry, we classified potentially actionable pathogenic single-nucleotide variants (SNVs) in all 4300 European- and 2203 African-ancestry participants sequenced by the NHLBI Exome Sequencing Project (ESP). We considered 112 gene-disease pairs selected by an expert panel as associated with medically actionable genetic disorders that may be undiagnosed in adults. The resulting classifications were compared to classifications from other clinical and research genetic testing laboratories, as well as with in silico pathogenicity scores. Among European-ancestry participants, 30 of 4300 (0.7%) had a pathogenic SNV and six (0.1%) had a disruptive variant that was expected to be pathogenic, whereas 52 (1.2%) had likely pathogenic SNVs. For African-ancestry participants, six of 2203 (0.3%) had a pathogenic SNV and six (0.3%) had an expected pathogenic disruptive variant, whereas 13 (0.6%) had likely pathogenic SNVs. Genomic Evolutionary Rate Profiling mammalian conservation score and the Combined Annotation Dependent Depletion summary score of conservation, substitution, regulation, and other evidence were compared across pathogenicity assignments and appear to have utility in variant classification. This work provides a refined estimate of the burden of adult onset, medically actionable incidental findings expected from exome sequencing, highlights challenges in variant classification, and demonstrates the need for a better curated variant interpretation knowledge base. © 2015 Amendola et al.; Published by Cold Spring Harbor Laboratory Press.
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fleischmann, R.D.; Adams, M.D.; White, O.
1995-07-28
An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Haemophilus influenzae Rd. This approach eliminates the need for initial mapping efforts and is therefore applicable to the vast array of microbial species for which genome maps are unavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase accession number L42023) represents the only complete genome sequence from a free-living organism. 46 refs., 4 figs., 4 tabs.
Suárez, Gabriel A.; Renda, Brian A.; Dasgupta, Aurko
2017-01-01
ABSTRACT The genomes of most bacteria contain mobile DNA elements that can contribute to undesirable genetic instability in engineered cells. In particular, transposable insertion sequence (IS) elements can rapidly inactivate genes that are important for a designed function. We deleted all six copies of IS1236 from the genome of the naturally transformable bacterium Acinetobacter baylyi ADP1. The natural competence of ADP1 made it possible to rapidly repair deleterious point mutations that arose during strain construction. In the resulting ADP1-ISx strain, the rates of mutations inactivating a reporter gene were reduced by 7- to 21-fold. This reduction was higher than expected from the incidence of new IS1236 insertions found during a 300-day mutation accumulation experiment with wild-type ADP1 that was used to estimate spontaneous mutation rates in the strain. The extra improvement appears to be due in part to eliminating large deletions caused by IS1236 activity, as the point mutation rate was unchanged in ADP1-ISx. Deletion of an error-prone polymerase (dinP) and a DNA damage response regulator (umuDAb [the umuD gene of A. baylyi]) from the ADP1-ISx genome did not further reduce mutation rates. Surprisingly, ADP1-ISx exhibited increased transformability. This improvement may be due to less autolysis and aggregation of the engineered cells than of the wild type. Thus, deleting IS elements from the ADP1 genome led to a greater than expected increase in evolutionary reliability and unexpectedly enhanced other key strain properties, as has been observed for other clean-genome bacterial strains. ADP1-ISx is an improved chassis for metabolic engineering and other applications. IMPORTANCE Acinetobacter baylyi ADP1 has been proposed as a next-generation bacterial host for synthetic biology and genome engineering due to its ability to efficiently take up DNA from its environment during normal growth. We deleted transposable elements that are capable of copying themselves, inserting into other genes, and thereby inactivating them from the ADP1 genome. The resulting “clean-genome” ADP1-ISx strain exhibited larger reductions in the rates of inactivating mutations than expected from spontaneous mutation rates measured via whole-genome sequencing of lineages evolved under relaxed selection. Surprisingly, we also found that IS element activity reduces transformability and is a major cause of cell aggregation and death in wild-type ADP1 grown under normal laboratory conditions. More generally, our results demonstrate that domesticating a bacterial genome by removing mobile DNA elements that have accumulated during evolution in the wild can have unanticipated benefits. PMID:28667117
Su, Fei; Ou, Hong-Yu; Tao, Fei; Tang, Hongzhi; Xu, Ping
2013-12-27
With genomic sequences of many closely related bacterial strains made available by deep sequencing, it is now possible to investigate trends in prokaryotic microevolution. Positive selection is a sub-process of microevolution, in which a particular mutation is favored, causing the allele frequency to continuously shift in one direction. Wide scanning of prokaryotic genomes has shown that positive selection at the molecular level is much more frequent than expected. Genes with significant positive selection may play key roles in bacterial adaption to different environmental pressures. However, selection pressure analyses are computationally intensive and awkward to configure. Here we describe an open access web server, which is designated as PSP (Positive Selection analysis for Prokaryotic genomes) for performing evolutionary analysis on orthologous coding genes, specially designed for rapid comparison of dozens of closely related prokaryotic genomes. Remarkably, PSP facilitates functional exploration at the multiple levels by assignments and enrichments of KO, GO or COG terms. To illustrate this user-friendly tool, we analyzed Escherichia coli and Bacillus cereus genomes and found that several genes, which play key roles in human infection and antibiotic resistance, show significant evidence of positive selection. PSP is freely available to all users without any login requirement at: http://db-mml.sjtu.edu.cn/PSP/. PSP ultimately allows researchers to do genome-scale analysis for evolutionary selection across multiple prokaryotic genomes rapidly and easily, and identify the genes undergoing positive selection, which may play key roles in the interactions of host-pathogen and/or environmental adaptation.
Smith, S; Joss, T; Stow, A
2011-10-01
The analysis of microsatellite loci has allowed significant advances in evolutionary biology and pest management. However, until very recently, the potential benefits have been compromised by the high costs of developing these neutral markers. High-throughput sequencing provides a solution to this problem. We describe the development of 13 microsatellite markers for the eusocial ambrosia beetle, Austroplatypus incompertus, a significant pest of forests in southeast Australia. The frequency of microsatellite repeats in the genome of A. incompertus was determined to be low, and previous attempts at microsatellite isolation using a traditional genomic library were problematic. Here, we utilised two protocols, microsatellite-enriched genomic library construction and high-throughput 454 sequencing and characterised 13 loci which were polymorphic among 32 individuals. Numbers of alleles per locus ranged from 2 to 17, and observed and expected heterozygosities from 0.344 to 0.767 and from 0.507 to 0.860, respectively. These microsatellites have the resolution required to analyse fine-scale colony and population genetic structure. Our work demonstrates the utility of next-generation 454 sequencing as a method for rapid and cost-effective acquisition of microsatellites where other techniques have failed, or for taxa where marker development has historically been both complicated and expensive.
Defining personal utility in genomics: A Delphi study.
Kohler, J N; Turbitt, E; Lewis, K L; Wilfond, B S; Jamal, L; Peay, H L; Biesecker, L G; Biesecker, B B
2017-09-01
Individual genome sequencing results are valued by patients in ways distinct from clinical utility. Such outcomes have been described as components of "personal utility," a concept that broadly encompasses patient-endorsed benefits, that is operationally defined as non-clinical outcomes. No empirical delineation of these outcomes has been reported. To address this gap, we administered a Delphi survey to adult participants in a National Institute of Health (NIH) clinical exome study to extract the most highly endorsed outcomes constituting personal utility. Forty research participants responded to a Delphi survey to rate 35 items identified by a systematic literature review of personal utility. Two rounds of ranking resulted in 24 items that represented 14 distinct elements of personal utility. Elements most highly endorsed by participants were: increased self-knowledge, knowledge of "the condition," altruism, and anticipated coping. Our findings represent the first systematic effort to delineate elements of personal utility that may be used to anticipate participant expectation and inform genetic counseling prior to sequencing. The 24 items reported need to be studied further in additional clinical genome sequencing studies to assess generalizability in other populations. Further research will help to understand motivations and to predict the meaning and use of results. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.
NASA Astrophysics Data System (ADS)
Jiang, Qun; Li, Qi; Yu, Hong; Kong, Lingfeng
2011-06-01
The sea cucumber Apostichopus japonicus is a commercially and ecologically important species in China. A total of 3056 potential unigenes were generated after assembling 7597 A. japonicus expressed sequence tags (ESTs) downloaded from Gen-Bank. Two hundred and fifty microsatellite-containing ESTs (8.18%) and 299 simple sequence repeats (SSRs) were detected. The average density of SSRs was 1 per 7.403 kb of EST after redundancy elimination. Di-nucleotide repeat motifs appeared to be the most abundant type with a percentage of 69.90%. Of the 126 primer pairs designed, 90 amplified the expected products and 43 showed polymorphism in 30 individuals tested. The number of alleles per locus ranged from 2 to 26 with an average of 7.0 alleles, and the observed and expected heterozygosities varied from 0.067 to 1.000 and from 0.066 to 0.959, respectively. These new EST-derived microsatellite markers would provide sufficient polymorphism for population genetic studies and genome mapping of this sea cucumber species.
Fungal genome sequencing: basic biology to biotechnology.
Sharma, Krishna Kant
2016-08-01
The genome sequences provide a first glimpse into the genomic basis of the biological diversity of filamentous fungi and yeast. The genome sequence of the budding yeast, Saccharomyces cerevisiae, with a small genome size, unicellular growth, and rich history of genetic and molecular analyses was a milestone of early genomics in the 1990s. The subsequent completion of fission yeast, Schizosaccharomyces pombe and genetic model, Neurospora crassa initiated a revolution in the genomics of the fungal kingdom. In due course of time, a substantial number of fungal genomes have been sequenced and publicly released, representing the widest sampling of genomes from any eukaryotic kingdom. An ambitious genome-sequencing program provides a wealth of data on metabolic diversity within the fungal kingdom, thereby enhancing research into medical science, agriculture science, ecology, bioremediation, bioenergy, and the biotechnology industry. Fungal genomics have higher potential to positively affect human health, environmental health, and the planet's stored energy. With a significant increase in sequenced fungal genomes, the known diversity of genes encoding organic acids, antibiotics, enzymes, and their pathways has increased exponentially. Currently, over a hundred fungal genome sequences are publicly available; however, no inclusive review has been published. This review is an initiative to address the significance of the fungal genome-sequencing program and provides the road map for basic and applied research.
The complete mitochondrial genome sequence of Aesopia cornuta (Pleuronectiformes: Soleidae).
Wang, Shu-Ying; Shi, Wei; Wang, Zhong-Ming; Gong, Li; Kong, Xiao-Yu
2015-02-01
Aesopia cornuta belongs to the family Soleidae of Pleuronectiformes, and the morphological characters are much similar to those of Zebrias. In this article, we sequenced, characterized, and compared the complete mitogenome of A. cornuta for the first time. The genome is 16,737 base pairs in length, and is typically consist of 37 genes, including 13 protein-coding genes, two ribosomal RNA, 22 transfer RNA, as well as a putative L-strand replication origin and a putative control region. The gene organization is identical to that of typical bony fishes. The overall base composition is 29.1, 28.3, 26.8 and 15.8% for C, A, T and G, respectively, with a slight AT bias of 55.1%. This result is expected to contribute to understanding the systematic evolution of the genus Aesopia and further taxonomic and phylogenetic studies of Soleidae and Pleuronectiformes.
Interpreting short tandem repeat variations in humans using mutational constraint
Gymrek, Melissa; Willems, Thomas; Reich, David; Erlich, Yaniv
2017-01-01
Identifying regions of the genome that are depleted of mutations can reveal potentially deleterious variants. Short tandem repeats (STRs), also known as microsatellites, are among the largest contributors of de novo mutations in humans. However, per-locus studies of STR mutations have been limited to highly ascertained panels of several dozen loci. Here, we harnessed bioinformatics tools and a novel analytical framework to estimate mutation parameters for each STR in the human genome by correlating STR genotypes with local sequence heterozygosity. We applied our method to obtain robust estimates of the impact of local sequence features on mutation parameters and used this to create a framework for measuring constraint at STRs by comparing observed vs. expected mutation rates. Constraint scores identified known pathogenic variants with early onset effects. Our metric will provide a valuable tool for prioritizing pathogenic STRs in medical genetics studies. PMID:28892063
Vargas-Caro, Carolina; Bustamante, Carlos; Bennett, Michael B; Ovenden, Jennifer R
2016-01-01
The yellownose skate Zearaja chilensis is endemic to South America. The species is the target of a valuable commercial fishery in Chile, but is highly susceptible to over-exploitation. The complete mitochondrial genome was described from 694,593 sequences obtained using Ion Torrent Next Generation Sequencing. The total length of the mitogenome was 16,909 bp, comprising 2 rRNAs, 13 protein-coding genes, 22 tRNAs and 2 non-coding regions. Comparison between the proposed mitogenome and one previously described from "raw fish fillets from a skate speciality restaurant in Seoul, Korea" resulted in 97.4% similarity, rather than approaching 100% similarity as might be expected. The 2.6% dissimilarity may indicate the presence of two separate stocks or two different species of, ostensibly, Z. chilensis in South America and highlights the need for caution when using genetic resources without a taxonomic reference or a voucher specimen.
Uneven distribution of expressed sequence tag loci on maize pachytene chromosomes
Anderson, Lorinda K.; Lai, Ann; Stack, Stephen M.; Rizzon, Carene; Gaut, Brandon S.
2006-01-01
Examining the relationships among DNA sequence, meiotic recombination, and chromosome structure at a genome-wide scale has been difficult because only a few markers connect genetic linkage maps with physical maps. Here, we have positioned 1195 genetically mapped expressed sequence tag (EST) markers onto the 10 pachytene chromosomes of maize by using a newly developed resource, the RN-cM map. The RN-cM map charts the distribution of crossing over in the form of recombination nodules (RNs) along synaptonemal complexes (SCs, pachytene chromosomes) and allows genetic cM distances to be converted into physical micrometer distances on chromosomes. When this conversion is made, most of the EST markers used in the study are located distally on the chromosomes in euchromatin. ESTs are significantly clustered on chromosomes, even when only euchromatic chromosomal segments are considered. Gene density and recombination rate (as measured by EST and RN frequencies, respectively) are strongly correlated. However, crossover frequencies for telomeric intervals are much higher than was expected from their EST frequencies. For pachytene chromosomes, EST density is about fourfold higher in euchromatin compared with heterochromatin, while DNA density is 1.4 times higher in heterochromatin than in euchromatin. Based on DNA density values and the fraction of pachytene chromosome length that is euchromatic, we estimate that ∼1500 Mbp of the maize genome is in euchromatin. This overview of the organization of the maize genome will be useful in examining genome and chromosome evolution in plants. PMID:16339046
Genome Improvement at JGI-HAGSC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grimwood, Jane; Schmutz, Jeremy J.; Myers, Richard M.
Since the completion of the sequencing of the human genome, the Joint Genome Institute (JGI) has rapidly expanded its scientific goals in several DOE mission-relevant areas. At the JGI-HAGSC, we have kept pace with this rapid expansion of projects with our focus on assessing, assembling, improving and finishing eukaryotic whole genome shotgun (WGS) projects for which the shotgun sequence is generated at the Production Genomic Facility (JGI-PGF). We follow this by combining the draft WGS with genomic resources generated at JGI-HAGSC or in collaborator laboratories (including BAC end sequences, genetic maps and FLcDNA sequences) to produce an improved draft sequence.more » For eukaryotic genomes important to the DOE mission, we then add further information from directed experiments to produce reference genomic sequences that are publicly available for any scientific researcher. Also, we have continued our program for producing BAC-based finished sequence, both for adding information to JGI genome projects and for small BAC-based sequencing projects proposed through any of the JGI sequencing programs. We have now built our computational expertise in WGS assembly and analysis and have moved eukaryotic genome assembly from the JGI-PGF to JGI-HAGSC. We have concentrated our assembly development work on large plant genomes and complex fungal and algal genomes.« less
Using Partial Genomic Fosmid Libraries for Sequencing CompleteOrganellar Genomes
DOE Office of Scientific and Technical Information (OSTI.GOV)
McNeal, Joel R.; Leebens-Mack, James H.; Arumuganathan, K.
2005-08-26
Organellar genome sequences provide numerous phylogenetic markers and yield insight into organellar function and molecular evolution. These genomes are much smaller in size than their nuclear counterparts; thus, their complete sequencing is much less expensive than total nuclear genome sequencing, making broader phylogenetic sampling feasible. However, for some organisms it is challenging to isolate plastid DNA for sequencing using standard methods. To overcome these difficulties, we constructed partial genomic libraries from total DNA preparations of two heterotrophic and two autotrophic angiosperm species using fosmid vectors. We then used macroarray screening to isolate clones containing large fragments of plastid DNA. Amore » minimum tiling path of clones comprising the entire genome sequence of each plastid was selected, and these clones were shotgun-sequenced and assembled into complete genomes. Although this method worked well for both heterotrophic and autotrophic plants, nuclear genome size had a dramatic effect on the proportion of screened clones containing plastid DNA and, consequently, the overall number of clones that must be screened to ensure full plastid genome coverage. This technique makes it possible to determine complete plastid genome sequences for organisms that defy other available organellar genome sequencing methods, especially those for which limited amounts of tissue are available.« less
Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification.
Sinclair, Robert M; Ravantti, Janne J; Bamford, Dennis H
2017-04-15
Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. Copyright © 2017 Sinclair et al.
Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification
Sinclair, Robert M.; Ravantti, Janne J.
2017-01-01
ABSTRACT Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. PMID:28122979
USDA-ARS?s Scientific Manuscript database
Aegilops tauschii is the diploid progenitor of the D genome of hexaploid wheat and an important genetic resource for wheat. A reference-quality sequence for the Ae. tauschii genome was produced with a combination of ordered-clone sequencing, whole-genome shotgun sequencing, and BioNano optical geno...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Onda, M.; Kudo, S.; Fukuda, M.
Human glycophorin A, B, and E (GPA, GPB, and GPE) genes belong to a gene family located at the long arm of chromosome 4. These three genes are homologous from the 5'-flanking sequence to the Alu sequence, which is 1 kb downstream from the exon encoding the transmembrane domain. Analysis of the Alu sequence and flanking direct repeat sequences suggested that the GPA gene most closely resembles the ancestral gene, whereas the GPB and GPE gene arose by homologous recombination within the Alu sequence, acquiring 3' sequences from an unrelated precursor genomic segment. Here the authors describe the identification ofmore » this putative precursor genomic segment. A human genomic library was screened by using the sequence of the 3' region of the GPB gene as a probe. The genomic clones isolated were found to contain an Alu sequence that appeared to be involved in the recombination. Downstream from the Alu sequence, the nucleotide sequence of the precursor genomic segment is almost identical to that of the GPB or GPE gene. In contrast, the upstream sequence of the genomic segment differs entirely from that of the GPA, GPB, and GPE genes. Conservation of the direct repeats flanking the Alu sequence of the genomic segment strongly suggests that the sequence of this genomic segment has been maintained during evolution. This identified genomic segment was found to reside downstream from the GPA gene by both gene mapping and in situ chromosomal localization. The precursor genomic segment was also identified in the orangutan genome, which is known to lack GPB and GPE genes. These results indicate that one of the duplicated ancestral glycophorin genes acquired a unique 3' sequence by unequal crossing-over through its Alu sequence and the further downstream Alu sequence present in the duplicated gene. Further duplication and divergence of this gene yielded the GPB and GPE genes. 37 refs., 5 figs.« less
Genomic Organization of the Drosophila Telomere RetrotransposableElements
DOE Office of Scientific and Technical Information (OSTI.GOV)
George, J.A.; DeBaryshe, P.G.; Traverse, K.L.
2006-10-16
The emerging sequence of the heterochromatic portion of the Drosophila melanogaster genome, with the most recent update of euchromatic sequence, gives the first genome-wide view of the chromosomal distribution of the telomeric retrotransposons, HeT-A, TART, and Tahre. As expected, these elements are entirely excluded from euchromatin, although sequence fragments of HeT-A and TART 3 untranslated regions are found in nontelomeric heterochromatin on the Y chromosome. The proximal ends of HeT-A/TART arrays appear to be a transition zone because only here do other transposable elements mix in the array. The sharp distinction between the distribution of telomeric elements and that ofmore » other transposable elements suggests that chromatin structure is important in telomere element localization. Measurements reported here show (1) D. melanogaster telomeres are very long, in the size range reported for inbred mouse strains (averaging 46 kb per chromosome end in Drosophila stock 2057). As in organisms with telomerase, their length varies depending on genotype. There is also slight under-replication in polytene nuclei. (2) Surprisingly, the relationship between the number of HeT-A and TART elements is not stochastic but is strongly correlated across stocks, supporting the idea that the two elements are interdependent. Although currently assembled portions of the HeT-A/TART arrays are from the most-proximal part of long arrays, {approx}61% of the total HeT-A sequence in these regions consists of intact, potentially active elements with little evidence of sequence decay, making it likely that the content of the telomere arrays turns over more extensively than has been thought.« less
Wong, Lai-Ping; Lai, Jason Kuan-Han; Saw, Woei-Yuh; Ong, Rick Twee-Hee; Cheng, Anthony Youzhi; Pillai, Nisha Esakimuthu; Liu, Xuanyao; Xu, Wenting; Chen, Peng; Foo, Jia-Nee; Tan, Linda Wei-Lin; Koo, Seok-Hwee; Soong, Richie; Wenk, Markus Rene; Lim, Wei-Yen; Khor, Chiea-Chuen; Little, Peter; Chia, Kee-Seng; Teo, Yik-Ying
2014-05-01
South Asia possesses a significant amount of genetic diversity due to considerable intergroup differences in culture and language. There have been numerous reports on the genetic structure of Asian Indians, although these have mostly relied on genotyping microarrays or targeted sequencing of the mitochondria and Y chromosomes. Asian Indians in Singapore are primarily descendants of immigrants from Dravidian-language-speaking states in south India, and 38 individuals from the general population underwent deep whole-genome sequencing with a target coverage of 30X as part of the Singapore Sequencing Indian Project (SSIP). The genetic structure and diversity of these samples were compared against samples from the Singapore Sequencing Malay Project and populations in Phase 1 of the 1,000 Genomes Project (1 KGP). SSIP samples exhibited greater intra-population genetic diversity and possessed higher heterozygous-to-homozygous genotype ratio than other Asian populations. When compared against a panel of well-defined Asian Indians, the genetic makeup of the SSIP samples was closely related to South Indians. However, even though the SSIP samples clustered distinctly from the Europeans in the global population structure analysis with autosomal SNPs, eight samples were assigned to mitochondrial haplogroups that were predominantly present in Europeans and possessed higher European admixture than the remaining samples. An analysis of the relative relatedness between SSIP with two archaic hominins (Denisovan, Neanderthal) identified higher ancient admixture in East Asian populations than in SSIP. The data resource for these samples is publicly available and is expected to serve as a valuable complement to the South Asian samples in Phase 3 of 1 KGP.
Historical Perspective, Development and Applications of Next-Generation Sequencing in Plant Virology
Barba, Marina; Czosnek, Henryk; Hadidi, Ahmed
2014-01-01
Next-generation high throughput sequencing technologies became available at the onset of the 21st century. They provide a highly efficient, rapid, and low cost DNA sequencing platform beyond the reach of the standard and traditional DNA sequencing technologies developed in the late 1970s. They are continually improved to become faster, more efficient and cheaper. They have been used in many fields of biology since 2004. In 2009, next-generation sequencing (NGS) technologies began to be applied to several areas of plant virology including virus/viroid genome sequencing, discovery and detection, ecology and epidemiology, replication and transcription. Identification and characterization of known and unknown viruses and/or viroids in infected plants are currently among the most successful applications of these technologies. It is expected that NGS will play very significant roles in many research and non-research areas of plant virology. PMID:24399207
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, Xiaofan; Peris, David; Kominek, Jacek
The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimentalmore » design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.« less
Zhou, Xiaofan; Peris, David; Kominek, Jacek; ...
2016-09-16
The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimentalmore » design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.« less
Abe, Takashi; Hamano, Yuta; Ikemura, Toshimichi
2014-01-01
A strategy of evolutionary studies that can compare vast numbers of genome sequences is becoming increasingly important with the remarkable progress of high-throughput DNA sequencing methods. We previously established a sequence alignment-free clustering method "BLSOM" for di-, tri-, and tetranucleotide compositions in genome sequences, which can characterize sequence characteristics (genome signatures) of a wide range of species. In the present study, we generated BLSOMs for tetra- and pentanucleotide compositions in approximately one million sequence fragments derived from 101 eukaryotes, for which almost complete genome sequences were available. BLSOM recognized phylotype-specific characteristics (e.g., key combinations of oligonucleotide frequencies) in the genome sequences, permitting phylotype-specific clustering of the sequences without any information regarding the species. In our detailed examination of 12 Drosophila species, the correlation between their phylogenetic classification and the classification on the BLSOMs was observed to visualize oligonucleotides diagnostic for species-specific clustering.
The diploid genome sequence of an Asian individual
Wang, Jun; Wang, Wei; Li, Ruiqiang; Li, Yingrui; Tian, Geng; Goodman, Laurie; Fan, Wei; Zhang, Junqing; Li, Jun; Zhang, Juanbin; Guo, Yiran; Feng, Binxiao; Li, Heng; Lu, Yao; Fang, Xiaodong; Liang, Huiqing; Du, Zhenglin; Li, Dong; Zhao, Yiqing; Hu, Yujie; Yang, Zhenzhen; Zheng, Hancheng; Hellmann, Ines; Inouye, Michael; Pool, John; Yi, Xin; Zhao, Jing; Duan, Jinjie; Zhou, Yan; Qin, Junjie; Ma, Lijia; Li, Guoqing; Yang, Zhentao; Zhang, Guojie; Yang, Bin; Yu, Chang; Liang, Fang; Li, Wenjie; Li, Shaochuan; Li, Dawei; Ni, Peixiang; Ruan, Jue; Li, Qibin; Zhu, Hongmei; Liu, Dongyuan; Lu, Zhike; Li, Ning; Guo, Guangwu; Zhang, Jianguo; Ye, Jia; Fang, Lin; Hao, Qin; Chen, Quan; Liang, Yu; Su, Yeyang; san, A.; Ping, Cuo; Yang, Shuang; Chen, Fang; Li, Li; Zhou, Ke; Zheng, Hongkun; Ren, Yuanyuan; Yang, Ling; Gao, Yang; Yang, Guohua; Li, Zhuo; Feng, Xiaoli; Kristiansen, Karsten; Wong, Gane Ka-Shu; Nielsen, Rasmus; Durbin, Richard; Bolund, Lars; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian
2009-01-01
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics. PMID:18987735
Chen, Mianmian; Xu, Juntian; Yao, Huochun; Lu, Chengping; Zhang, Wei
2016-05-10
Avian pathogenic Escherichia coli (APEC) causes colibacillosis, which results in significant economic losses to the poultry industry worldwide. Due to the drug residues and increased antibiotic resistance caused by antibiotic use, bacteriophages and other alternative therapeutic agents are expected to control APEC infection in poultry. Two APEC phages, named P483 and P694, were isolated from the feces from the farmers market in China. We then studied their biological properties, and carried out high-throughput genome sequencing and homology analyses of these phages. Assembly results of high-throughput sequencing showed that the structures of both P483 and P694 genomes consist of linear and double-stranded DNA. Results of the electron microscopy and homology analysis revealed that both P483 and P694 belong to T7-like virus which is a member of the Podoviridae family of the Caudovirales order. Comparative genomic analysis showed that most of the predicted proteins of these two phages showed strongest sequence similarity to the Enterobacteria phages BA14 and 285P, Erwinia phage FE44, and Kluyvera phage Kvp1; however, some proteins such as gp0.6a, gp1.7 and gp17 showed lower similarity (<85%) with the homologs of other phages in the T7 subgroup. We also found some unique characteristics of P483 and P694, such as the two types of the genes of P694 and no lytic activity of P694 against its host bacteria in liquid medium. Our results serve to further our understanding of phage evolution of T7-like coliphages and provide the potential application of the phages as therapeutic agents for the treatment of diseases. Copyright © 2016 Elsevier B.V. All rights reserved.
Abraham, Paul E; Wang, Xiaojing; Ranjan, Priya; Nookaew, Intawat; Zhang, Bing; Tuskan, Gerald A; Hettich, Robert L
2015-12-04
Next-generation sequencing has transformed the ability to link genotypes to phenotypes and facilitates the dissection of genetic contribution to complex traits. However, it is challenging to link genetic variants with the perturbed functional effects on proteins encoded by such genes. Here we show how RNA sequencing can be exploited to construct genotype-specific protein sequence databases to assess natural variation in proteins, providing information about the molecular toolbox driving cellular processes. For this study, we used two natural genotypes selected from a recent genome-wide association study of Populus trichocarpa, an obligate outcrosser with tremendous phenotypic variation across the natural population. This strategy allowed us to comprehensively catalogue proteins containing single amino acid polymorphisms (SAAPs), as well as insertions and deletions. We profiled the frequency of 128 types of naturally occurring amino acid substitutions, including both expected (neutral) and unexpected (non-neutral) SAAPs, with a subset occurring in regions of the genome having strong polymorphism patterns consistent with recent positive and/or divergent selection. By zeroing in on the molecular signatures of these important regions that might have previously been uncharacterized, we now provide a high-resolution molecular inventory that should improve accessibility and subsequent identification of natural protein variants in future genotype-to-phenotype studies.
Snake Genome Sequencing: Results and Future Prospects
Kerkkamp, Harald M. I.; Kini, R. Manjunatha; Pospelov, Alexey S.; Vonk, Freek J.; Henkel, Christiaan V.; Richardson, Michael K.
2016-01-01
Snake genome sequencing is in its infancy—very much behind the progress made in sequencing the genomes of humans, model organisms and pathogens relevant to biomedical research, and agricultural species. We provide here an overview of some of the snake genome projects in progress, and discuss the biological findings, with special emphasis on toxinology, from the small number of draft snake genomes already published. We discuss the future of snake genomics, pointing out that new sequencing technologies will help overcome the problem of repetitive sequences in assembling snake genomes. Genome sequences are also likely to be valuable in examining the clustering of toxin genes on the chromosomes, in designing recombinant antivenoms and in studying the epigenetic regulation of toxin gene expression. PMID:27916957
Snake Genome Sequencing: Results and Future Prospects.
Kerkkamp, Harald M I; Kini, R Manjunatha; Pospelov, Alexey S; Vonk, Freek J; Henkel, Christiaan V; Richardson, Michael K
2016-12-01
Snake genome sequencing is in its infancy-very much behind the progress made in sequencing the genomes of humans, model organisms and pathogens relevant to biomedical research, and agricultural species. We provide here an overview of some of the snake genome projects in progress, and discuss the biological findings, with special emphasis on toxinology, from the small number of draft snake genomes already published. We discuss the future of snake genomics, pointing out that new sequencing technologies will help overcome the problem of repetitive sequences in assembling snake genomes. Genome sequences are also likely to be valuable in examining the clustering of toxin genes on the chromosomes, in designing recombinant antivenoms and in studying the epigenetic regulation of toxin gene expression.
2012-01-01
Background The feline genome is valuable to the veterinary and model organism genomics communities because the cat is an obligate carnivore and a model for endangered felids. The initial public release of the Felis catus genome assembly provided a framework for investigating the genomic basis of feline biology. However, the entire set of protein coding genes has not been elucidated. Results We identified and characterized 1227 protein coding feline sequences, of which 913 map to public sequences and 314 are novel. These sequences have been deposited into NCBI's genbank database and complement public genomic resources by providing additional protein coding sequences that fill in some of the gaps in the feline genome assembly. Through functional and comparative genomic analyses, we gained an understanding of the role of these sequences in feline development, nutrition and health. Specifically, we identified 104 orthologs of human genes associated with Mendelian disorders. We detected negative selection within sequences with gene ontology annotations associated with intracellular trafficking, cytoskeleton and muscle functions. We detected relatively less negative selection on protein sequences encoding extracellular networks, apoptotic pathways and mitochondrial gene ontology annotations. Additionally, we characterized feline cDNA sequences that have mouse orthologs associated with clinical, nutritional and developmental phenotypes. Together, this analysis provides an overview of the value of our cDNA sequences and enhances our understanding of how the feline genome is similar to, and different from other mammalian genomes. Conclusions The cDNA sequences reported here expand existing feline genomic resources by providing high-quality sequences annotated with comparative genomic information providing functional, clinical, nutritional and orthologous gene information. PMID:22257742
Kamada, Mayumi; Hase, Sumitaka; Sato, Kengo; Toyoda, Atsushi; Fujiyama, Asao; Sakakibara, Yasubumi
2014-01-01
De novo microbial genome sequencing reached a turning point with third-generation sequencing (TGS) platforms, and several microbial genomes have been improved by TGS long reads. Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and it has a function in the production of the traditional Japanese fermented food “natto.” The B. subtilis natto BEST195 genome was previously sequenced with short reads, but it included some incomplete regions. We resequenced the BEST195 genome using a PacBio RS sequencer, and we successfully obtained a complete genome sequence from one scaffold without any gaps, and we also applied Illumina MiSeq short reads to enhance quality. Compared with the previous BEST195 draft genome and Marburg 168 genome, we found that incomplete regions in the previous genome sequence were attributed to GC-bias and repetitive sequences, and we also identified some novel genes that are found only in the new genome. PMID:25329997
Sequencing intractable DNA to close microbial genomes.
Hurt, Richard A; Brown, Steven D; Podar, Mircea; Palumbo, Anthony V; Elias, Dwayne A
2012-01-01
Advancement in high throughput DNA sequencing technologies has supported a rapid proliferation of microbial genome sequencing projects, providing the genetic blueprint for in-depth studies. Oftentimes, difficult to sequence regions in microbial genomes are ruled "intractable" resulting in a growing number of genomes with sequence gaps deposited in databases. A procedure was developed to sequence such problematic regions in the "non-contiguous finished" Desulfovibrio desulfuricans ND132 genome (6 intractable gaps) and the Desulfovibrio africanus genome (1 intractable gap). The polynucleotides surrounding each gap formed GC rich secondary structures making the regions refractory to amplification and sequencing. Strand-displacing DNA polymerases used in concert with a novel ramped PCR extension cycle supported amplification and closure of all gap regions in both genomes. The developed procedures support accurate gene annotation, and provide a step-wise method that reduces the effort required for genome finishing.
First complete genome sequence of infectious laryngotracheitis virus
2011-01-01
Background Infectious laryngotracheitis virus (ILTV) is an alphaherpesvirus that causes acute respiratory disease in chickens worldwide. To date, only one complete genomic sequence of ILTV has been reported. This sequence was generated by concatenating partial sequences from six different ILTV strains. Thus, the full genomic sequence of a single (individual) strain of ILTV has not been determined previously. This study aimed to use high throughput sequencing technology to determine the complete genomic sequence of a live attenuated vaccine strain of ILTV. Results The complete genomic sequence of the Serva vaccine strain of ILTV was determined, annotated and compared to the concatenated ILTV reference sequence. The genome size of the Serva strain was 152,628 bp, with a G + C content of 48%. A total of 80 predicted open reading frames were identified. The Serva strain had 96.5% DNA sequence identity with the concatenated ILTV sequence. Notably, the concatenated ILTV sequence was found to lack four large regions of sequence, including 528 bp and 594 bp of sequence in the UL29 and UL36 genes, respectively, and two copies of a 1,563 bp sequence in the repeat regions. Considerable differences in the size of the predicted translation products of 4 other genes (UL54, UL30, UL37 and UL38) were also identified. More than 530 single-nucleotide polymorphisms (SNPs) were identified. Most SNPs were located within three genomic regions, corresponding to sequence from the SA-2 ILTV vaccine strain in the concatenated ILTV sequence. Conclusions This is the first complete genomic sequence of an individual ILTV strain. This sequence will facilitate future comparative genomic studies of ILTV by providing an appropriate reference sequence for the sequence analysis of other ILTV strains. PMID:21501528
Expanded microbial genome coverage and improved protein family annotation in the COG database
Galperin, Michael Y.; Makarova, Kira S.; Wolf, Yuri I.; Koonin, Eugene V.
2015-01-01
Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/), first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the COGs is expected to become an important tool for microbial genomics. PMID:25428365
Read, Timothy D; Petit, Robert A; Joseph, Sandeep J; Alam, Md Tauqeer; Weil, M Ryan; Ahmad, Maida; Bhimani, Ravila; Vuong, Jocelyn S; Haase, Chad P; Webb, D Harry; Tan, Milton; Dove, Alistair D M
2017-07-14
The whale shark (Rhincodon typus) has by far the largest body size of any elasmobranch (shark or ray) species. Therefore, it is also the largest extant species of the paraphyletic assemblage commonly referred to as fishes. As both a phenotypic extreme and a member of the group Chondrichthyes - the sister group to the remaining gnathostomes, which includes all tetrapods and therefore also humans - its genome is of substantial comparative interest. Whale sharks are also listed as an endangered species on the International Union for Conservation of Nature's Red List of threatened species and are of growing popularity as both a target of ecotourism and as a charismatic conservation ambassador for the pelagic ecosystem. A genome map for this species would aid in defining effective conservation units and understanding global population structure. We characterised the nuclear genome of the whale shark using next generation sequencing (454, Illumina) and de novo assembly and annotation methods, based on material collected from the Georgia Aquarium. The data set consisted of 878,654,233 reads, which yielded a draft assembly of 1,213,200 contigs and 997,976 scaffolds. The estimated genome size was 3.44Gb. As expected, the proteome of the whale shark was most closely related to the only other complete genome of a cartilaginous fish, the holocephalan elephant shark. The whale shark contained a novel Toll-like-receptor (TLR) protein with sequence similarity to both the TLR4 and TLR13 proteins of mammals and TLR21 of teleosts. The data are publicly available on GenBank, FigShare, and from the NCBI Short Read Archive under accession number SRP044374. This represents the first shotgun elasmobranch genome and will aid studies of molecular systematics, biogeography, genetic differentiation, and conservation genetics in this and other shark species, as well as providing comparative data for studies of evolutionary biology and immunology across the jawed vertebrate lineages.
The missing graphical user interface for genomics.
Schatz, Michael C
2010-01-01
The Galaxy package empowers regular users to perform rich DNA sequence analysis through a much-needed and user-friendly graphical web interface. See research article http://genomebiology.com/2010/11/8/R86 RESEARCH HIGHLIGHT: With the advent of affordable and high-throughput DNA sequencing, sequencing is becoming an essential component in nearly every genetics lab. These data are being generated to probe sequence variations, to understand transcribed, regulated or methylated DNA elements, and to explore a host of other biological features across the tree of life and across a range of environments and conditions. Given this deluge of data, novices and experts alike are facing the daunting challenge of trying to analyze the raw sequence data computationally. With so many tools available and so many assays to analyze, how can one be expected to stay current with the state of the art? How can one be expected to learn to use each tool and construct robust end-to-end analysis pipelines, all while ensuring that input formats, command-line options, sequence databases and program libraries are set correctly? Finally, once the analysis is complete, how does one ensure the results are reproducible and transparent for others to scrutinize and study?In an article published in Genome Biology, Jeremy Goecks, Anton Nekrutenko, James Taylor and the rest of the Galaxy Team (Goecks et al. 1) make a great advance towards resolving these critical questions with the latest update to their Galaxy Project. The ambitious goal of Galaxy is to empower regular users to carry out their own computational analysis without having to be an expert in computational biology or computer science. Galaxy adds a desperately needed graphical user interface to genomics research, making data analysis universally accessible in a web browser, and freeing users from the minutiae of archaic command-line parameters, data formats and scripting languages. Data inputs and computational steps are selected from dynamic graphical menus, and the results are displayed in intuitive plots and summaries that encourage interactive workflows and the exploration of hypotheses. The underlying data analysis tools can be almost any piece of software, written in any language, but all their complexity is neatly hidden inside of Galaxy, allowing users to focus on scientific rather than technical questions.
Insights from 20 years of bacterial genome sequencing
Land, Miriam L.; Hauser, Loren; Jun, Se-Ran; ...
2015-02-27
Since the first two complete bacterial genome sequences were published in 1995, the science of bacteria has dramatically changed. Using third-generation DNA sequencing, it is possible to completely sequence a bacterial genome in a few hours and identify some types of methylation sites along the genome as well. Sequencing of bacterial genome sequences is now a standard procedure, and the information from tens of thousands of bacterial genomes has had a major impact on our views of the bacterial world. In this review, we explore a series of questions to highlight some insights that comparative genomics has produced. To date,more » there are genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. However, the distribution is quite skewed towards a few phyla that contain model organisms. But the breadth is continuing to improve, with projects dedicated to filling in less characterized taxonomic groups. The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas system provides bacteria with immunity against viruses, which outnumber bacteria by tenfold. How fast can we go? Second-generation sequencing has produced a large number of draft genomes (close to 90 % of bacterial genomes in GenBank are currently not complete); third-generation sequencing can potentially produce a finished genome in a few hours, and at the same time provide methlylation sites along the entire chromosome. The diversity of bacterial communities is extensive as is evident from the genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. Genome sequencing can help in classifying an organism, and in the case where multiple genomes of the same species are available, it is possible to calculate the pan- and core genomes; comparison of more than 2000 Escherichia coli genomes finds an E. coli core genome of about 3100 gene families and a total of about 89,000 different gene families. Why do we care about bacterial genome sequencing? There are many practical applications, such as genome-scale metabolic modeling, biosurveillance, bioforensics, and infectious disease epidemiology. In the near future, high-throughput sequencing of patient metagenomic samples could revolutionize medicine in terms of speed and accuracy of finding pathogens and knowing how to treat them.« less
Insights from 20 years of bacterial genome sequencing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Land, Miriam L.; Hauser, Loren; Jun, Se-Ran
Since the first two complete bacterial genome sequences were published in 1995, the science of bacteria has dramatically changed. Using third-generation DNA sequencing, it is possible to completely sequence a bacterial genome in a few hours and identify some types of methylation sites along the genome as well. Sequencing of bacterial genome sequences is now a standard procedure, and the information from tens of thousands of bacterial genomes has had a major impact on our views of the bacterial world. In this review, we explore a series of questions to highlight some insights that comparative genomics has produced. To date,more » there are genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. However, the distribution is quite skewed towards a few phyla that contain model organisms. But the breadth is continuing to improve, with projects dedicated to filling in less characterized taxonomic groups. The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas system provides bacteria with immunity against viruses, which outnumber bacteria by tenfold. How fast can we go? Second-generation sequencing has produced a large number of draft genomes (close to 90 % of bacterial genomes in GenBank are currently not complete); third-generation sequencing can potentially produce a finished genome in a few hours, and at the same time provide methlylation sites along the entire chromosome. The diversity of bacterial communities is extensive as is evident from the genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. Genome sequencing can help in classifying an organism, and in the case where multiple genomes of the same species are available, it is possible to calculate the pan- and core genomes; comparison of more than 2000 Escherichia coli genomes finds an E. coli core genome of about 3100 gene families and a total of about 89,000 different gene families. Why do we care about bacterial genome sequencing? There are many practical applications, such as genome-scale metabolic modeling, biosurveillance, bioforensics, and infectious disease epidemiology. In the near future, high-throughput sequencing of patient metagenomic samples could revolutionize medicine in terms of speed and accuracy of finding pathogens and knowing how to treat them.« less
RefSeq microbial genomes database: new representation and annotation strategy.
Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O'Neill, Kathleen; Tolstoy, Igor
2014-01-01
The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.
Gene calling and bacterial genome annotation with BG7.
Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo
2015-01-01
New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services).
Hendre, Prasad S.; Aggarwal, Ramesh K.
2014-01-01
Coffee breeding and improvement efforts can be greatly facilitated by availability of a large repository of simple sequence repeats (SSRs) based microsatellite markers, which provides efficiency and high-resolution in genetic analyses. This study was aimed to improve SSR availability in coffee by developing new genic−/genomic-SSR markers using in-silico bioinformatics and streptavidin-biotin based enrichment approach, respectively. The expressed sequence tag (EST) based genic microsatellite markers (EST-SSRs) were developed using the publicly available dataset of 13,175 unigene ESTs, which showed a distribution of 1 SSR/3.4 kb of coffee transcriptome. Genomic SSRs, on the other hand, were developed from an SSR-enriched small-insert partial genomic library of robusta coffee. In total, 69 new SSRs (44 EST-SSRs and 25 genomic SSRs) were developed and validated as suitable genetic markers. Diversity analysis of selected coffee genotypes revealed these to be highly informative in terms of allelic diversity and PIC values, and eighteen of these markers (∼27%) could be mapped on a robusta linkage map. Notably, the markers described here also revealed a very high cross-species transferability. In addition to the validated markers, we have also designed primer pairs for 270 putative EST-SSRs, which are expected to provide another ca. 200 useful genetic markers considering the high success rate (88%) of marker conversion of similar pairs tested/validated in this study. PMID:25461752
Genome-wide nucleosome map and cytosine methylation levels of an ancient human genome.
Pedersen, Jakob Skou; Valen, Eivind; Velazquez, Amhed M Vargas; Parker, Brian J; Rasmussen, Morten; Lindgreen, Stinus; Lilje, Berit; Tobin, Desmond J; Kelly, Theresa K; Vang, Søren; Andersson, Robin; Jones, Peter A; Hoover, Cindi A; Tikhonov, Alexei; Prokhortchouk, Egor; Rubin, Edward M; Sandelin, Albin; Gilbert, M Thomas P; Krogh, Anders; Willerslev, Eske; Orlando, Ludovic
2014-03-01
Epigenetic information is available from contemporary organisms, but is difficult to track back in evolutionary time. Here, we show that genome-wide epigenetic information can be gathered directly from next-generation sequence reads of DNA isolated from ancient remains. Using the genome sequence data generated from hair shafts of a 4000-yr-old Paleo-Eskimo belonging to the Saqqaq culture, we generate the first ancient nucleosome map coupled with a genome-wide survey of cytosine methylation levels. The validity of both nucleosome map and methylation levels were confirmed by the recovery of the expected signals at promoter regions, exon/intron boundaries, and CTCF sites. The top-scoring nucleosome calls revealed distinct DNA positioning biases, attesting to nucleotide-level accuracy. The ancient methylation levels exhibited high conservation over time, clustering closely with modern hair tissues. Using ancient methylation information, we estimated the age at death of the Saqqaq individual and illustrate how epigenetic information can be used to infer ancient gene expression. Similar epigenetic signatures were found in other fossil material, such as 110,000- to 130,000-yr-old bones, supporting the contention that ancient epigenomic information can be reconstructed from a deep past. Our findings lay the foundation for extracting epigenomic information from ancient samples, allowing shifts in epialleles to be tracked through evolutionary time, as well as providing an original window into modern epigenomics.
Genome-wide nucleosome map and cytosine methylation levels of an ancient human genome
Pedersen, Jakob Skou; Valen, Eivind; Velazquez, Amhed M. Vargas; Parker, Brian J.; Rasmussen, Morten; Lindgreen, Stinus; Lilje, Berit; Tobin, Desmond J.; Kelly, Theresa K.; Vang, Søren; Andersson, Robin; Jones, Peter A.; Hoover, Cindi A.; Tikhonov, Alexei; Prokhortchouk, Egor; Rubin, Edward M.; Sandelin, Albin; Gilbert, M. Thomas P.; Krogh, Anders; Willerslev, Eske; Orlando, Ludovic
2014-01-01
Epigenetic information is available from contemporary organisms, but is difficult to track back in evolutionary time. Here, we show that genome-wide epigenetic information can be gathered directly from next-generation sequence reads of DNA isolated from ancient remains. Using the genome sequence data generated from hair shafts of a 4000-yr-old Paleo-Eskimo belonging to the Saqqaq culture, we generate the first ancient nucleosome map coupled with a genome-wide survey of cytosine methylation levels. The validity of both nucleosome map and methylation levels were confirmed by the recovery of the expected signals at promoter regions, exon/intron boundaries, and CTCF sites. The top-scoring nucleosome calls revealed distinct DNA positioning biases, attesting to nucleotide-level accuracy. The ancient methylation levels exhibited high conservation over time, clustering closely with modern hair tissues. Using ancient methylation information, we estimated the age at death of the Saqqaq individual and illustrate how epigenetic information can be used to infer ancient gene expression. Similar epigenetic signatures were found in other fossil material, such as 110,000- to 130,000-yr-old bones, supporting the contention that ancient epigenomic information can be reconstructed from a deep past. Our findings lay the foundation for extracting epigenomic information from ancient samples, allowing shifts in epialleles to be tracked through evolutionary time, as well as providing an original window into modern epigenomics. PMID:24299735
Vélez, Julián Reyes; Cameron, Marguerite; Rodríguez-Lecompte, Juan Carlos; Xia, Fangfang; Heider, Luke C.; Saab, Matthew; McClure, J. Trenton; Sánchez, Javier
2017-01-01
The objectives of this study are to determine the occurrence of antimicrobial resistance (AMR) genes using whole-genome sequence (WGS) of Streptococcus uberis (S. uberis) and Streptococcus dysgalactiae (S. dysgalactiae) isolates, recovered from dairy cows in the Canadian Maritime Provinces. A secondary objective included the exploration of the association between phenotypic AMR and the genomic characteristics (genome size, guanine–cytosine content, and occurrence of unique gene sequences). Initially, 91 isolates were sequenced, and of these isolates, 89 were assembled. Furthermore, 16 isolates were excluded due to larger than expected genomic sizes (>2.3 bp × 1,000 bp). In the final analysis, 73 were used with complete WGS and minimum inhibitory concentration records, which were part of the previous phenotypic AMR study, representing 18 dairy herds from the Maritime region of Canada (1). A total of 23 unique AMR gene sequences were found in the bacterial genomes, with a mean number of 8.1 (minimum: 5; maximum: 13) per genome. Overall, there were 10 AMR genes [ANT(6), TEM-127, TEM-163, TEM-89, TEM-95, Linb, Lnub, Ermb, Ermc, and TetS] present only in S. uberis genomes and 2 genes unique (EF-TU and TEM-71) to the S. dysgalactiae genomes; 11 AMR genes [APH(3′), TEM-1, TEM-136, TEM-157, TEM-47, TetM, bl2b, gyrA, parE, phoP, and rpoB] were found in both bacterial species. Two-way tabulations showed association between the phenotypic susceptibility to lincosamides and the presence of linB (P = 0.002) and lnuB (P < 0.001) genes and the between the presence of tetM (P = 0.015) and tetS (P = 0.064) genes and phenotypic resistance to tetracyclines only for the S. uberis isolates. The logistic model showed that the odds of resistance (to any of the phenotypically tested antimicrobials) was 4.35 times higher when there were >11 AMR genes present in the genome, compared with <7 AMR genes (P < 0.001). The odds of resistance was lower for S. dysgalactiae than S. uberis (P = 0.031). When the within-herd somatic cell count was >250,000 cells/mL, a trend toward higher odds of resistance compared with the baseline category of <150,000 cells/mL was observed. When the isolate corresponded to a post-mastitis sample, there were lower odds of resistance when compared with non-clinical isolates (P = 0.01). The results of this study showed the strength of associations between phenotypic AMR resistance of both mastitis pathogens and their genotypic resistome and other epidemiological characteristics. PMID:28589129
Gordon, Kacy L.; Arthur, Robert K.; Ruvinsky, Ilya
2015-01-01
Gene regulatory information guides development and shapes the course of evolution. To test conservation of gene regulation within the phylum Nematoda, we compared the functions of putative cis-regulatory sequences of four sets of orthologs (unc-47, unc-25, mec-3 and elt-2) from distantly-related nematode species. These species, Caenorhabditis elegans, its congeneric C. briggsae, and three parasitic species Meloidogyne hapla, Brugia malayi, and Trichinella spiralis, represent four of the five major clades in the phylum Nematoda. Despite the great phylogenetic distances sampled and the extensive sequence divergence of nematode genomes, all but one of the regulatory elements we tested are able to drive at least a subset of the expected gene expression patterns. We show that functionally conserved cis-regulatory elements have no more extended sequence similarity to their C. elegans orthologs than would be expected by chance, but they do harbor motifs that are important for proper expression of the C. elegans genes. These motifs are too short to be distinguished from the background level of sequence similarity, and while identical in sequence they are not conserved in orientation or position. Functional tests reveal that some of these motifs contribute to proper expression. Our results suggest that conserved regulatory circuitry can persist despite considerable turnover within cis elements. PMID:26020930
PCR Amplification Strategies towards full-length HIV-1 Genome sequencing.
Liu, Chao Chun; Ji, Hezhao
2018-06-26
The advent of next generation sequencing has enabled greater resolution of viral diversity and improved feasibility of full viral genome sequencing allowing routine HIV-1 full genome sequencing in both research and diagnostic settings. Regardless of the sequencing platform selected, successful PCR amplification of the HIV-1 genome is essential for sequencing template preparation. As such, full HIV-1 genome amplification is a crucial step in dictating the successful and reliable sequencing downstream. Here we reviewed existing PCR protocols leading to HIV-1 full genome sequencing. In addition to the discussion on basic considerations on relevant PCR design, the advantages as well as the pitfalls of published protocols were reviewed. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
A robust TALENs system for highly efficient mammalian genome editing.
Feng, Yuanxi; Zhang, Siliang; Huang, Xin
2014-01-10
Recently, transcription activator-like effector nucleases (TALENs) have emerged as a highly effective tool for genomic editing. A pair of TALENs binds to two DNA recognition sites separated by a spacer sequence, and the dimerized FokI nucleases at the C terminal then cleave DNA in the spacer. Because of its modular design and capacity to precisely target almost any desired genomic locus, TALEN is a technology that can revolutionize the entire biomedical research field. Currently, for genomic editing in cultured cells, two plasmids encoding a pair of TALENs are co-transfected, followed by limited dilution to isolate cell colonies with the intended genomic manipulation. However, uncertain transfection efficiency becomes a bottleneck, especially in hard-to-transfect cells, reducing the overall efficiency of genome editing. We have developed a robust TALENs system in which each TALEN plasmid also encodes a fluorescence protein. Thus, cells transfected with both TALEN plasmids, a prerequisite for genomic editing, can be isolated by fluorescence-activated cell sorting. Our improved TALENs system can be applied to all cultured cells to achieve highly efficient genomic editing. Furthermore, an optimized procedure for genomic editing using TALENs is also presented. We expect our system to be widely adopted by the scientific community.
Ciotlos, Serban; Mao, Qing; Zhang, Rebecca Yu; Li, Zhenyu; Chin, Robert; Gulbahce, Natali; Liu, Sophie Jia; Drmanac, Radoje; Peters, Brock A
2016-01-01
The cell line BT-474 is a popular cell line for studying the biology of cancer and developing novel drugs. However, there is no complete, published genome sequence for this highly utilized scientific resource. In this study we sought to provide a comprehensive and useful data set for the scientific community by generating a whole genome sequence for BT-474. Five μg of genomic DNA, isolated from an early passage of the BT-474 cell line, was used to generate a whole genome sequence (114X coverage) using Complete Genomics' standard sequencing process. To provide additional variant phasing and structural variation data we also processed and analyzed two separate libraries of 5 and 6 individual cells to depths of 99X and 87X, respectively, using Complete Genomics' Long Fragment Read (LFR) technology. BT-474 is a highly aneuploid cell line with an extremely complex genome sequence. This ~300X total coverage genome sequence provides a more complete understanding of this highly utilized cell line at the genomic level.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Podsiadlowski, L.; Carapelli, A.; Nardi, F.
2005-12-01
Mitochondrial genomes from two dipluran hexapods of the genus Campodea have been sequenced. Gene order is the same as in most other hexapods and crustaceans. Secondary structures of tRNAs reveal specific structural changes in tRNA-C, tRNA-R, tRNA-S1 and tRNA-S2. Comparative analyses of nucleotide and amino acid composition, as well as structural features of both ribosomal RNA subunits, reveal substantial differences among the analyzed taxa. Although the two Campodea species are morphologically highly uniform, genetic divergence is larger than expected, suggesting a long evolutionary history under stable ecological conditions.
Yang, Jun-Bo; Li, De-Zhu; Li, Hong-Tao
2014-09-01
Chloroplast genomes supply indispensable information that helps improve the phylogenetic resolution and even as organelle-scale barcodes. Next-generation sequencing technologies have helped promote sequencing of complete chloroplast genomes, but compared with the number of angiosperms, relatively few chloroplast genomes have been sequenced. There are two major reasons for the paucity of completely sequenced chloroplast genomes: (i) massive amounts of fresh leaves are needed for chloroplast sequencing and (ii) there are considerable gaps in the sequenced chloroplast genomes of many plants because of the difficulty of isolating high-quality chloroplast DNA, preventing complete chloroplast genomes from being assembled. To overcome these obstacles, all known angiosperm chloroplast genomes available to date were analysed, and then we designed nine universal primer pairs corresponding to the highly conserved regions. Using these primers, angiosperm whole chloroplast genomes can be amplified using long-range PCR and sequenced using next-generation sequencing methods. The primers showed high universality, which was tested using 24 species representing major clades of angiosperms. To validate the functionality of the primers, eight species representing major groups of angiosperms, that is, early-diverging angiosperms, magnoliids, monocots, Saxifragales, fabids, malvids and asterids, were sequenced and assembled their complete chloroplast genomes. In our trials, only 100 mg of fresh leaves was used. The results show that the universal primer set provided an easy, effective and feasible approach for sequencing whole chloroplast genomes in angiosperms. The designed universal primer pairs provide a possibility to accelerate genome-scale data acquisition and will therefore magnify the phylogenetic resolution and species identification in angiosperms. © 2014 John Wiley & Sons Ltd.
Contrasting Patterns of rDNA Homogenization within the Zygosaccharomyces rouxii Species Complex
Chand Dakal, Tikam; Giudici, Paolo; Solieri, Lisa
2016-01-01
Arrays of repetitive ribosomal DNA (rDNA) sequences are generally expected to evolve as a coherent family, where repeats within such a family are more similar to each other than to orthologs in related species. The continuous homogenization of repeats within individual genomes is a recombination process termed concerted evolution. Here, we investigated the extent and the direction of concerted evolution in 43 yeast strains of the Zygosaccharomyces rouxii species complex (Z. rouxii, Z. sapae, Z. mellis), by analyzing two portions of the 35S rDNA cistron, namely the D1/D2 domains at the 5’ end of the 26S rRNA gene and the segment including the internal transcribed spacers (ITS) 1 and 2 (ITS regions). We demonstrate that intra-genomic rDNA sequence variation is unusually frequent in this clade and that rDNA arrays in single genomes consist of an intermixing of Z. rouxii, Z. sapae and Z. mellis-like sequences, putatively evolved by reticulate evolutionary events that involved repeated hybridization between lineages. The levels and distribution of sequence polymorphisms vary across rDNA repeats in different individuals, reflecting four patterns of rDNA evolution: I) rDNA repeats that are homogeneous within a genome but are chimeras derived from two parental lineages via recombination: Z. rouxii in the ITS region and Z. sapae in the D1/D2 region; II) intra-genomic rDNA repeats that retain polymorphisms only in ITS regions; III) rDNA repeats that vary only in their D1/D2 domains; IV) heterogeneous rDNA arrays that have both polymorphic ITS and D1/D2 regions. We argue that an ongoing process of homogenization following allodiplodization or incomplete lineage sorting gave rise to divergent evolutionary trajectories in different strains, depending upon temporal, structural and functional constraints. We discuss the consequences of these findings for Zygosaccharomyces species delineation and, more in general, for yeast barcoding. PMID:27501051
Scalabrin, Simone; Gilmore, Barbara; Lawley, Cynthia T.; Gasic, Ksenija; Micheletti, Diego; Rosyara, Umesh R.; Cattonaro, Federica; Vendramin, Elisa; Main, Dorrie; Aramini, Valeria; Blas, Andrea L.; Mockler, Todd C.; Bryant, Douglas W.; Wilhelm, Larry; Troggio, Michela; Sosinski, Bryon; Aranzana, Maria José; Arús, Pere; Iezzoni, Amy; Morgante, Michele; Peace, Cameron
2012-01-01
Although a large number of single nucleotide polymorphism (SNP) markers covering the entire genome are needed to enable molecular breeding efforts such as genome wide association studies, fine mapping, genomic selection and marker-assisted selection in peach [Prunus persica (L.) Batsch] and related Prunus species, only a limited number of genetic markers, including simple sequence repeats (SSRs), have been available to date. To address this need, an international consortium (The International Peach SNP Consortium; IPSC) has pursued a coordinated effort to perform genome-scale SNP discovery in peach using next generation sequencing platforms to develop and characterize a high-throughput Illumina Infinium® SNP genotyping array platform. We performed whole genome re-sequencing of 56 peach breeding accessions using the Illumina and Roche/454 sequencing technologies. Polymorphism detection algorithms identified a total of 1,022,354 SNPs. Validation with the Illumina GoldenGate® assay was performed on a subset of the predicted SNPs, verifying ∼75% of genic (exonic and intronic) SNPs, whereas only about a third of intergenic SNPs were verified. Conservative filtering was applied to arrive at a set of 8,144 SNPs that were included on the IPSC peach SNP array v1, distributed over all eight peach chromosomes with an average spacing of 26.7 kb between SNPs. Use of this platform to screen a total of 709 accessions of peach in two separate evaluation panels identified a total of 6,869 (84.3%) polymorphic SNPs. The almost 7,000 SNPs verified as polymorphic through extensive empirical evaluation represent an excellent source of markers for future studies in genetic relatedness, genetic mapping, and dissecting the genetic architecture of complex agricultural traits. The IPSC peach SNP array v1 is commercially available and we expect that it will be used worldwide for genetic studies in peach and related stone fruit and nut species. PMID:22536421
Sequencing and assembly of the 22-gb loblolly pine genome.
Zimin, Aleksey; Stevens, Kristian A; Crepeau, Marc W; Holtz-Morris, Ann; Koriabine, Maxim; Marçais, Guillaume; Puiu, Daniela; Roberts, Michael; Wegrzyn, Jill L; de Jong, Pieter J; Neale, David B; Salzberg, Steven L; Yorke, James A; Langley, Charles H
2014-03-01
Conifers are the predominant gymnosperm. The size and complexity of their genomes has presented formidable technical challenges for whole-genome shotgun sequencing and assembly. We employed novel strategies that allowed us to determine the loblolly pine (Pinus taeda) reference genome sequence, the largest genome assembled to date. Most of the sequence data were derived from whole-genome shotgun sequencing of a single megagametophyte, the haploid tissue of a single pine seed. Although that constrained the quantity of available DNA, the resulting haploid sequence data were well-suited for assembly. The haploid sequence was augmented with multiple linking long-fragment mate pair libraries from the parental diploid DNA. For the longest fragments, we used novel fosmid DiTag libraries. Sequences from the linking libraries that did not match the megagametophyte were identified and removed. Assembly of the sequence data were aided by condensing the enormous number of paired-end reads into a much smaller set of longer "super-reads," rendering subsequent assembly with an overlap-based assembly algorithm computationally feasible. To further improve the contiguity and biological utility of the genome sequence, additional scaffolding methods utilizing independent genome and transcriptome assemblies were implemented. The combination of these strategies resulted in a draft genome sequence of 20.15 billion bases, with an N50 scaffold size of 66.9 kbp.
Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum.
VanBuren, Robert; Bryant, Doug; Edger, Patrick P; Tang, Haibao; Burgess, Diane; Challabathula, Dinakar; Spittle, Kristi; Hall, Richard; Gu, Jenny; Lyons, Eric; Freeling, Michael; Bartels, Dorothea; Ten Hallers, Boudewijn; Hastie, Alex; Michael, Todd P; Mockler, Todd C
2015-11-26
Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE). Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetium genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a 'near-complete' draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. The Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.
Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data
2014-01-01
Background The rapid evolution in high-throughput sequencing (HTS) technologies has opened up new perspectives in several research fields and led to the production of large volumes of sequence data. A fundamental step in HTS data analysis is the mapping of reads onto reference sequences. Choosing a suitable mapper for a given technology and a given application is a subtle task because of the difficulty of evaluating mapping algorithms. Results In this paper, we present a benchmark procedure to compare mapping algorithms used in HTS using both real and simulated datasets and considering four evaluation criteria: computational resource and time requirements, robustness of mapping, ability to report positions for reads in repetitive regions, and ability to retrieve true genetic variation positions. To measure robustness, we introduced a new definition for a correctly mapped read taking into account not only the expected start position of the read but also the end position and the number of indels and substitutions. We developed CuReSim, a new read simulator, that is able to generate customized benchmark data for any kind of HTS technology by adjusting parameters to the error types. CuReSim and CuReSimEval, a tool to evaluate the mapping quality of the CuReSim simulated reads, are freely available. We applied our benchmark procedure to evaluate 14 mappers in the context of whole genome sequencing of small genomes with Ion Torrent data for which such a comparison has not yet been established. Conclusions A benchmark procedure to compare HTS data mappers is introduced with a new definition for the mapping correctness as well as tools to generate simulated reads and evaluate mapping quality. The application of this procedure to Ion Torrent data from the whole genome sequencing of small genomes has allowed us to validate our benchmark procedure and demonstrate that it is helpful for selecting a mapper based on the intended application, questions to be addressed, and the technology used. This benchmark procedure can be used to evaluate existing or in-development mappers as well as to optimize parameters of a chosen mapper for any application and any sequencing platform. PMID:24708189
Mahato, Ajay Kumar; Sharma, Nimisha; Singh, Akshay; Srivastav, Manish; Jaiprakash; Singh, Sanjay Kumar; Singh, Anand Kumar; Sharma, Tilak Raj; Singh, Nagendra Kumar
2016-01-01
Mango (Mangifera indica L.) is called "king of fruits" due to its sweetness, richness of taste, diversity, large production volume and a variety of end usage. Despite its huge economic importance genomic resources in mango are scarce and genetics of useful horticultural traits are poorly understood. Here we generated deep coverage leaf RNA sequence data for mango parental varieties 'Neelam', 'Dashehari' and their hybrid 'Amrapali' using next generation sequencing technologies. De-novo sequence assembly generated 27,528, 20,771 and 35,182 transcripts for the three genotypes, respectively. The transcripts were further assembled into a non-redundant set of 70,057 unigenes that were used for SSR and SNP identification and annotation. Total 5,465 SSR loci were identified in 4,912 unigenes with 288 type I SSR (n ≥ 20 bp). One hundred type I SSR markers were randomly selected of which 43 yielded PCR amplicons of expected size in the first round of validation and were designated as validated genic-SSR markers. Further, 22,306 SNPs were identified by aligning high quality sequence reads of the three mango varieties to the reference unigene set, revealing significantly enhanced SNP heterozygosity in the hybrid Amrapali. The present study on leaf RNA sequencing of mango varieties and their hybrid provides useful genomic resource for genetic improvement of mango.
Mahato, Ajay Kumar; Sharma, Nimisha; Singh, Akshay; Srivastav, Manish; Jaiprakash; Singh, Sanjay Kumar; Singh, Anand Kumar; Sharma, Tilak Raj; Singh, Nagendra Kumar
2016-01-01
Mango (Mangifera indica L.) is called “king of fruits” due to its sweetness, richness of taste, diversity, large production volume and a variety of end usage. Despite its huge economic importance genomic resources in mango are scarce and genetics of useful horticultural traits are poorly understood. Here we generated deep coverage leaf RNA sequence data for mango parental varieties ‘Neelam’, ‘Dashehari’ and their hybrid ‘Amrapali’ using next generation sequencing technologies. De-novo sequence assembly generated 27,528, 20,771 and 35,182 transcripts for the three genotypes, respectively. The transcripts were further assembled into a non-redundant set of 70,057 unigenes that were used for SSR and SNP identification and annotation. Total 5,465 SSR loci were identified in 4,912 unigenes with 288 type I SSR (n ≥ 20 bp). One hundred type I SSR markers were randomly selected of which 43 yielded PCR amplicons of expected size in the first round of validation and were designated as validated genic-SSR markers. Further, 22,306 SNPs were identified by aligning high quality sequence reads of the three mango varieties to the reference unigene set, revealing significantly enhanced SNP heterozygosity in the hybrid Amrapali. The present study on leaf RNA sequencing of mango varieties and their hybrid provides useful genomic resource for genetic improvement of mango. PMID:27736892
Anatomy of a hash-based long read sequence mapping algorithm for next generation DNA sequencing.
Misra, Sanchit; Agrawal, Ankit; Liao, Wei-keng; Choudhary, Alok
2011-01-15
Recently, a number of programs have been proposed for mapping short reads to a reference genome. Many of them are heavily optimized for short-read mapping and hence are very efficient for shorter queries, but that makes them inefficient or not applicable for reads longer than 200 bp. However, many sequencers are already generating longer reads and more are expected to follow. For long read sequence mapping, there are limited options; BLAT, SSAHA2, FANGS and BWA-SW are among the popular ones. However, resequencing and personalized medicine need much faster software to map these long sequencing reads to a reference genome to identify SNPs or rare transcripts. We present AGILE (AliGnIng Long rEads), a hash table based high-throughput sequence mapping algorithm for longer 454 reads that uses diagonal multiple seed-match criteria, customized q-gram filtering and a dynamic incremental search approach among other heuristics to optimize every step of the mapping process. In our experiments, we observe that AGILE is more accurate than BLAT, and comparable to BWA-SW and SSAHA2. For practical error rates (< 5%) and read lengths (200-1000 bp), AGILE is significantly faster than BLAT, SSAHA2 and BWA-SW. Even for the other cases, AGILE is comparable to BWA-SW and several times faster than BLAT and SSAHA2. http://www.ece.northwestern.edu/~smi539/agile.html.
The first genome sequences of human bocaviruses from Vietnam
Thanh, Tran Tan; Van, Hoang Minh Tu; Hong, Nguyen Thi Thu; Nhu, Le Nguyen Truc; Anh, Nguyen To; Tuan, Ha Manh; Hien, Ho Van; Tuong, Nguyen Manh; Kien, Trinh Trung; Khanh, Truong Huu; Nhan, Le Nguyen Thanh; Hung, Nguyen Thanh; Chau, Nguyen Van Vinh; Thwaites, Guy; van Doorn, H. Rogier; Tan, Le Van
2017-01-01
As part of an ongoing effort to generate complete genome sequences of hand, foot and mouth disease-causing enteroviruses directly from clinical specimens, two complete coding sequences and two partial genomic sequences of human bocavirus 1 (n=3) and 2 (n=1) were co-amplified and sequenced, representing the first genome sequences of human bocaviruses from Vietnam. The sequences may aid future study aiming at understanding the evolution of the virus. PMID:28090592
Advantages of genome sequencing by long-read sequencer using SMRT technology in medical area.
Nakano, Kazuma; Shiroma, Akino; Shimoji, Makiko; Tamotsu, Hinako; Ashimine, Noriko; Ohki, Shun; Shinzato, Misuzu; Minami, Maiko; Nakanishi, Tetsuhiro; Teruya, Kuniko; Satou, Kazuhito; Hirano, Takashi
2017-07-01
PacBio RS II is the first commercialized third-generation DNA sequencer able to sequence a single molecule DNA in real-time without amplification. PacBio RS II's sequencing technology is novel and unique, enabling the direct observation of DNA synthesis by DNA polymerase. PacBio RS II confers four major advantages compared to other sequencing technologies: long read lengths, high consensus accuracy, a low degree of bias, and simultaneous capability of epigenetic characterization. These advantages surmount the obstacle of sequencing genomic regions such as high/low G+C, tandem repeat, and interspersed repeat regions. Moreover, PacBio RS II is ideal for whole genome sequencing, targeted sequencing, complex population analysis, RNA sequencing, and epigenetics characterization. With PacBio RS II, we have sequenced and analyzed the genomes of many species, from viruses to humans. Herein, we summarize and review some of our key genome sequencing projects, including full-length viral sequencing, complete bacterial genome and almost-complete plant genome assemblies, and long amplicon sequencing of a disease-associated gene region. We believe that PacBio RS II is not only an effective tool for use in the basic biological sciences but also in the medical/clinical setting.
Su, Aiguo; Geng, Jianing; Grover, Corrinne E.; Hu, Songnian; Hua, Jinping
2013-01-01
Background Mitochondria are the main manufacturers of cellular ATP in eukaryotes. The plant mitochondrial genome contains large number of foreign DNA and repeated sequences undergone frequently intramolecular recombination. Upland Cotton (Gossypium hirsutum L.) is one of the main natural fiber crops and also an important oil-producing plant in the world. Sequencing of the cotton mitochondrial (mt) genome could be helpful for the evolution research of plant mt genomes. Methodology/Principal Findings We utilized 454 technology for sequencing and combined with Fosmid library of the Gossypium hirsutum mt genome screening and positive clones sequencing and conducted a series of evolutionary analysis on Cycas taitungensis and 24 angiosperms mt genomes. After data assembling and contigs joining, the complete mitochondrial genome sequence of G. hirsutum was obtained. The completed G.hirsutum mt genome is 621,884 bp in length, and contained 68 genes, including 35 protein genes, four rRNA genes and 29 tRNA genes. Five gene clusters are found conserved in all plant mt genomes; one and four clusters are specifically conserved in monocots and dicots, respectively. Homologous sequences are distributed along the plant mt genomes and species closely related share the most homologous sequences. For species that have both mt and chloroplast genome sequences available, we checked the location of cp-like migration and found several fragments closely linked with mitochondrial genes. Conclusion The G. hirsutum mt genome possesses most of the common characters of higher plant mt genomes. The existence of syntenic gene clusters, as well as the conservation of some intergenic sequences and genic content among the plant mt genomes suggest that evolution of mt genomes is consistent with plant taxonomy but independent among different species. PMID:23940520
Liu, Guozheng; Cao, Dandan; Li, Shuangshuang; Su, Aiguo; Geng, Jianing; Grover, Corrinne E; Hu, Songnian; Hua, Jinping
2013-01-01
Mitochondria are the main manufacturers of cellular ATP in eukaryotes. The plant mitochondrial genome contains large number of foreign DNA and repeated sequences undergone frequently intramolecular recombination. Upland Cotton (Gossypium hirsutum L.) is one of the main natural fiber crops and also an important oil-producing plant in the world. Sequencing of the cotton mitochondrial (mt) genome could be helpful for the evolution research of plant mt genomes. We utilized 454 technology for sequencing and combined with Fosmid library of the Gossypium hirsutum mt genome screening and positive clones sequencing and conducted a series of evolutionary analysis on Cycas taitungensis and 24 angiosperms mt genomes. After data assembling and contigs joining, the complete mitochondrial genome sequence of G. hirsutum was obtained. The completed G.hirsutum mt genome is 621,884 bp in length, and contained 68 genes, including 35 protein genes, four rRNA genes and 29 tRNA genes. Five gene clusters are found conserved in all plant mt genomes; one and four clusters are specifically conserved in monocots and dicots, respectively. Homologous sequences are distributed along the plant mt genomes and species closely related share the most homologous sequences. For species that have both mt and chloroplast genome sequences available, we checked the location of cp-like migration and found several fragments closely linked with mitochondrial genes. The G. hirsutum mt genome possesses most of the common characters of higher plant mt genomes. The existence of syntenic gene clusters, as well as the conservation of some intergenic sequences and genic content among the plant mt genomes suggest that evolution of mt genomes is consistent with plant taxonomy but independent among different species.
Nagano, Soichiro; Shirasawa, Kenta; Hirakawa, Hideki; Maeda, Fumi; Ishikawa, Masami; Isobe, Sachiko N
2017-05-12
The strawberry, Fragaria × ananassa, is an allo-octoploid (2n = 8x = 56) and outcrossing species. Although it is the most widely consumed berry crop in the world, its complex genome structure has hindered its genetic and genomic analysis, and thus discrimination of subgenome-specific loci among the homoeologous chromosomes is needed. In the present study, we identified candidate subgenome-specific single nucleotide polymorphism (SNP) and simple sequence repeat (SSR) loci, and constructed a linkage map using an S 1 mapping population of the cultivar 'Reikou' with an IStraw90 Axiom® SNP array and previously published SSR markers. The 'Reikou' linkage map consisted of 11,574 loci (11,002 SNPs and 572 SSR loci) spanning 2816.5 cM of 31 linkage groups. The 11,574 loci were located on 4738 unique positions (bin) on the linkage map. Of the mapped loci, 8999 (8588 SNPs and 411 SSR loci) showed a 1:2:1 segregation ratio of AA:AB:BB allele, which suggested the possibility of deriving loci from candidate subgenome-specific sequences. In addition, 2575 loci (2414 SNPs and 161 SSR loci) showed a 3:1 segregation of AB:BB allele, indicating they were derived from homoeologous genomic sequences. Comparative analysis of the homoeologous linkage groups revealed differences in genome structure among the subgenomes. Our results suggest that candidate subgenome-specific loci are randomly located across the genomes, and that there are small- to large-scale structural variations among the subgenomes. The mapped SNPs and SSR loci on the linkage map are expected to be seed points for the construction of pseudomolecules in the octoploid strawberry.
The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences
2010-01-01
Background In today's age of genomic discovery, no attempt has been made to comprehensively sequence a gymnosperm genome. The largest genus in the coniferous family Pinaceae is Pinus, whose 110-120 species have extremely large genomes (c. 20-40 Gb, 2N = 24). The size and complexity of these genomes have prompted much speculation as to the feasibility of completing a conifer genome sequence. Conifer genomes are reputed to be highly repetitive, but there is little information available on the nature and identity of repetitive units in gymnosperms. The pines have extensive genetic resources, with approximately 329000 ESTs from eleven species and genetic maps in eight species, including a dense genetic map of the twelve linkage groups in Pinus taeda. Results We present here the Sanger sequence and annotation of ten P. taeda BAC clones and Genome Analyzer II whole genome shotgun (WGS) sequences representing 7.5% of the genome. Computational annotation of ten BACs predicts three putative protein-coding genes and at least fifteen likely pseudogenes in nearly one megabase of sequence. We found three conifer-specific LTR retroelements in the BACs, and tentatively identified at least 15 others based on evidence from the distantly related angiosperms. Alignment of WGS sequences to the BACs indicates that 80% of BAC sequences have similar copies (≥ 75% nucleotide identity) elsewhere in the genome, but only 23% have identical copies (99% identity). The three most common repetitive elements in the genome were identified and, when combined, represent less than 5% of the genome. Conclusions This study indicates that the majority of repeats in the P. taeda genome are 'novel' and will therefore require additional BAC or genomic sequencing for accurate characterization. The pine genome contains a very large number of diverged and probably defunct repetitive elements. This study also provides new evidence that sequencing a pine genome using a WGS approach is a feasible goal. PMID:20609256
O'Brien, Heath E; Gong, Yunchen; Fung, Pauline; Wang, Pauline W; Guttman, David S
2011-01-01
Next-generation genomic technology has both greatly accelerated the pace of genome research as well as increased our reliance on draft genome sequences. While groups such as the Genomics Standards Consortium have made strong efforts to promote genome standards there is a still a general lack of uniformity among published draft genomes, leading to challenges for downstream comparative analyses. This lack of uniformity is a particular problem when using standard draft genomes that frequently have large numbers of low-quality sequencing tracts. Here we present a proposal for an "enhanced-quality draft" genome that identifies at least 95% of the coding sequences, thereby effectively providing a full accounting of the genic component of the genome. Enhanced-quality draft genomes are easily attainable through a combination of small- and large-insert next-generation, paired-end sequencing. We illustrate the generation of an enhanced-quality draft genome by re-sequencing the plant pathogenic bacterium Pseudomonas syringae pv. phaseolicola 1448A (Pph 1448A), which has a published, closed genome sequence of 5.93 Mbp. We use a combination of Illumina paired-end and mate-pair sequencing, and surprisingly find that de novo assemblies with 100x paired-end coverage and mate-pair sequencing with as low as low as 2-5x coverage are substantially better than assemblies based on higher coverage. The rapid and low-cost generation of large numbers of enhanced-quality draft genome sequences will be of particular value for microbial diagnostics and biosecurity, which rely on precise discrimination of potentially dangerous clones from closely related benign strains.
ERIC Educational Resources Information Center
Taylor, D. Leland; Campbell, A. Malcolm; Heyer, Laurie J.
2013-01-01
Next-generation sequencing technologies have greatly reduced the cost of sequencing genomes. With the current sequencing technology, a genome is broken into fragments and sequenced, producing millions of "reads." A computer algorithm pieces these reads together in the genome assembly process. PHAST is a set of online modules…
Exome-wide DNA capture and next generation sequencing in domestic and wild species.
Cosart, Ted; Beja-Pereira, Albano; Chen, Shanyuan; Ng, Sarah B; Shendure, Jay; Luikart, Gordon
2011-07-05
Gene-targeted and genome-wide markers are crucial to advance evolutionary biology, agriculture, and biodiversity conservation by improving our understanding of genetic processes underlying adaptation and speciation. Unfortunately, for eukaryotic species with large genomes it remains costly to obtain genome sequences and to develop genome resources such as genome-wide SNPs. A method is needed to allow gene-targeted, next-generation sequencing that is flexible enough to include any gene or number of genes, unlike transcriptome sequencing. Such a method would allow sequencing of many individuals, avoiding ascertainment bias in subsequent population genetic analyses.We demonstrate the usefulness of a recent technology, exon capture, for genome-wide, gene-targeted marker discovery in species with no genome resources. We use coding gene sequences from the domestic cow genome sequence (Bos taurus) to capture (enrich for), and subsequently sequence, thousands of exons of B. taurus, B. indicus, and Bison bison (wild bison). Our capture array has probes for 16,131 exons in 2,570 genes, including 203 candidate genes with known function and of interest for their association with disease and other fitness traits. We successfully sequenced and mapped exon sequences from across the 29 autosomes and X chromosome in the B. taurus genome sequence. Exon capture and high-throughput sequencing identified thousands of putative SNPs spread evenly across all reference chromosomes, in all three individuals, including hundreds of SNPs in our targeted candidate genes. This study shows exon capture can be customized for SNP discovery in many individuals and for non-model species without genomic resources. Our captured exome subset was small enough for affordable next-generation sequencing, and successfully captured exons from a divergent wild species using the domestic cow genome as reference.
Mosaic Graphs and Comparative Genomics in Phage Communities
Belcaid, Mahdi; Bergeron, Anne
2010-01-01
Abstract Comparing the genomes of two closely related viruses often produces mosaics where nearly identical sequences alternate with sequences that are unique to each genome. When several closely related genomes are compared, the unique sequences are likely to be shared with third genomes, leading to virus mosaic communities. Here we present comparative analysis of sets of Staphylococcus aureus phages that share large identical sequences with up to three other genomes, and with different partners along their genomes. We introduce mosaic graphs to represent these complex recombination events, and use them to illustrate the breath and depth of sequence sharing: some genomes are almost completely made up of shared sequences, while genomes that share very large identical sequences can adopt alternate functional modules. Mosaic graphs also allow us to identify breakpoints that could eventually be used for the construction of recombination networks. These findings have several implications on phage metagenomics assembly, on the horizontal gene transfer paradigm, and more generally on the understanding of the composition and evolutionary dynamics of virus communities. PMID:20874413
Camillo, Julceia; Leão, André P; Alves, Alexandre A; Formighieri, Eduardo F; Azevedo, Ana LS; Nunes, Juliana D; de Capdeville, Guy; de A Mattos, Jean K; Souza, Manoel T
2014-01-01
Aiming at generating a comprehensive genomic database on Elaeis spp., our group is leading several R&D initiatives with Elaeis guineensis (African oil palm) and Elaeis oleifera (American oil palm), including the whole-genome sequencing of the last. Genome size estimates currently available for this genus are controversial, as they indicate that American oil palm genome is about half the size of the African oil palm genome and that the genome of the interspecific hybrid is bigger than both the parental species genomes. We estimated the genome size of three E. guineensis genotypes, five E. oleifera genotypes, and two interspecific hybrids genotypes. On average, the genome size of E. guineensis is 4.32 ± 0.173 pg, while that of E. oleifera is 4.43 ± 0.018 pg. This indicates that both genomes are similar in size, even though E. oleifera is in fact bigger. As expected, the hybrid genome size is around the average of the two genomes, 4.40 ± 0.016 pg. Additionally, we demonstrate that both species present around 38% of GC content. As our results contradict the currently available data on Elaeis spp. genome sizes, we propose that the actual genome size of the Elaeis species is around 4 pg and that American oil palm possesses a larger genome than African oil palm. PMID:26203259
Single-cell genome sequencing at ultra-high-throughput with microfluidic droplet barcoding.
Lan, Freeman; Demaree, Benjamin; Ahmed, Noorsher; Abate, Adam R
2017-07-01
The application of single-cell genome sequencing to large cell populations has been hindered by technical challenges in isolating single cells during genome preparation. Here we present single-cell genomic sequencing (SiC-seq), which uses droplet microfluidics to isolate, fragment, and barcode the genomes of single cells, followed by Illumina sequencing of pooled DNA. We demonstrate ultra-high-throughput sequencing of >50,000 cells per run in a synthetic community of Gram-negative and Gram-positive bacteria and fungi. The sequenced genomes can be sorted in silico based on characteristic sequences. We use this approach to analyze the distributions of antibiotic-resistance genes, virulence factors, and phage sequences in microbial communities from an environmental sample. The ability to routinely sequence large populations of single cells will enable the de-convolution of genetic heterogeneity in diverse cell populations.
Nowrousian, Minou; Stajich, Jason E.; Chu, Meiling; Engh, Ines; Espagne, Eric; Halliday, Karen; Kamerewerd, Jens; Kempken, Frank; Knab, Birgit; Kuo, Hsiao-Che; Osiewacz, Heinz D.; Pöggeler, Stefanie; Read, Nick D.; Seiler, Stephan; Smith, Kristina M.; Zickler, Denise; Kück, Ulrich; Freitag, Michael
2010-01-01
Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30–90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in ∼4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative studies to address basic questions of fungal biology. PMID:20386741
Nowrousian, Minou; Stajich, Jason E; Chu, Meiling; Engh, Ines; Espagne, Eric; Halliday, Karen; Kamerewerd, Jens; Kempken, Frank; Knab, Birgit; Kuo, Hsiao-Che; Osiewacz, Heinz D; Pöggeler, Stefanie; Read, Nick D; Seiler, Stephan; Smith, Kristina M; Zickler, Denise; Kück, Ulrich; Freitag, Michael
2010-04-08
Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in approximately 4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative studies to address basic questions of fungal biology.
Newborn Sequencing in Genomic Medicine and Public Health
Agrawal, Pankaj B.; Bailey, Donald B.; Beggs, Alan H.; Brenner, Steven E.; Brower, Amy M.; Cakici, Julie A.; Ceyhan-Birsoy, Ozge; Chan, Kee; Chen, Flavia; Currier, Robert J.; Dukhovny, Dmitry; Green, Robert C.; Harris-Wai, Julie; Holm, Ingrid A.; Iglesias, Brenda; Joseph, Galen; Kingsmore, Stephen F.; Koenig, Barbara A.; Kwok, Pui-Yan; Lantos, John; Leeder, Steven J.; Lewis, Megan A.; McGuire, Amy L.; Milko, Laura V.; Mooney, Sean D.; Parad, Richard B.; Pereira, Stacey; Petrikin, Joshua; Powell, Bradford C.; Powell, Cynthia M.; Puck, Jennifer M.; Rehm, Heidi L.; Risch, Neil; Roche, Myra; Shieh, Joseph T.; Veeraraghavan, Narayanan; Watson, Michael S.; Willig, Laurel; Yu, Timothy W.; Urv, Tiina; Wise, Anastasia L.
2017-01-01
The rapid development of genomic sequencing technologies has decreased the cost of genetic analysis to the extent that it seems plausible that genome-scale sequencing could have widespread availability in pediatric care. Genomic sequencing provides a powerful diagnostic modality for patients who manifest symptoms of monogenic disease and an opportunity to detect health conditions before their development. However, many technical, clinical, ethical, and societal challenges should be addressed before such technology is widely deployed in pediatric practice. This article provides an overview of the Newborn Sequencing in Genomic Medicine and Public Health Consortium, which is investigating the application of genome-scale sequencing in newborns for both diagnosis and screening. PMID:28096516
Bai, Yu; Iwasaki, Yuki; Kanaya, Shigehiko; Zhao, Yue; Ikemura, Toshimichi
2014-01-01
With remarkable increase of genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional data such as oligonucleotide composition on one map. By modifying the conventional SOM, we have previously developed Batch-Learning SOM (BLSOM), which allows classification of sequence fragments according to species, solely depending on the oligonucleotide composition. In the present study, we introduce the oligonucleotide BLSOM used for characterization of vertebrate genome sequences. We first analyzed pentanucleotide compositions in 100 kb sequences derived from a wide range of vertebrate genomes and then the compositions in the human and mouse genomes in order to investigate an efficient method for detecting differences between the closely related genomes. BLSOM can recognize the species-specific key combination of oligonucleotide frequencies in each genome, which is called a "genome signature," and the specific regions specifically enriched in transcription-factor-binding sequences. Because the classification and visualization power is very high, BLSOM is an efficient powerful tool for extracting a wide range of information from massive amounts of genomic sequences (i.e., big sequence data).
2005-01-01
Sequencing of the human genome has ushered in a new era of biology. The technologies developed to facilitate the sequencing of the human genome are now being applied to the sequencing of other genomes. In 2004, a partnership was formed between Washington University School of Medicine Genome Sequencing Center's Outreach Program and Washington University Department of Biology Science Outreach to create a video tour depicting the processes involved in large-scale sequencing. “Sequencing a Genome: Inside the Washington University Genome Sequencing Center” is a tour of the laboratory that follows the steps in the sequencing pipeline, interspersed with animated explanations of the scientific procedures used at the facility. Accompanying interviews with the staff illustrate different entry levels for a career in genome science. This video project serves as an example of how research and academic institutions can provide teachers and students with access and exposure to innovative technologies at the forefront of biomedical research. Initial feedback on the video from undergraduate students, high school teachers, and high school students provides suggestions for use of this video in a classroom setting to supplement present curricula. PMID:16341256
From Conventional to Next Generation Sequencing of Epstein-Barr Virus Genomes.
Kwok, Hin; Chiang, Alan Kwok Shing
2016-02-24
Genomic sequences of Epstein-Barr virus (EBV) have been of interest because the virus is associated with cancers, such as nasopharyngeal carcinoma, and conditions such as infectious mononucleosis. The progress of whole-genome EBV sequencing has been limited by the inefficiency and cost of the first-generation sequencing technology. With the advancement of next-generation sequencing (NGS) and target enrichment strategies, increasing number of EBV genomes has been published. These genomes were sequenced using different approaches, either with or without EBV DNA enrichment. This review provides an overview of the EBV genomes published to date, and a description of the sequencing technology and bioinformatic analyses employed in generating these sequences. We further explored ways through which the quality of sequencing data can be improved, such as using DNA oligos for capture hybridization, and longer insert size and read length in the sequencing runs. These advances will enable large-scale genomic sequencing of EBV which will facilitate a better understanding of the genetic variations of EBV in different geographic regions and discovery of potentially pathogenic variants in specific diseases.
Initial sequencing and comparative analysis of the mouse genome
DOE Office of Scientific and Technical Information (OSTI.GOV)
Waterston, Robert H.; Lindblad-Toh, Kerstin; Birney, Ewan
2002-12-15
The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of themore » genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.« less
Tapping the promise of genomics in species with complex, nonmodel genomes.
Hirsch, Candice N; Buell, C Robin
2013-01-01
Genomics is enabling a renaissance in all disciplines of plant biology. However, many plant genomes are complex and remain recalcitrant to current genomic technologies. The complexities of these nonmodel plant genomes are attributable to gene and genome duplication, heterozygosity, ploidy, and/or repetitive sequences. Methods are available to simplify the genome and reduce these barriers, including inbreeding and genome reduction, making these species amenable to current sequencing and assembly methods. Some, but not all, of the complexities in nonmodel genomes can be bypassed by sequencing the transcriptome rather than the genome. Additionally, comparative genomics approaches, which leverage phylogenetic relatedness, can aid in the interpretation of complex genomes. Although there are limitations in accessing complex nonmodel plant genomes using current sequencing technologies, genome manipulation and resourceful analyses can allow access to even the most recalcitrant plant genomes.
Genome-wide characterization of centromeric satellites from multiple mammalian genomes.
Alkan, Can; Cardone, Maria Francesca; Catacchio, Claudia Rita; Antonacci, Francesca; O'Brien, Stephen J; Ryder, Oliver A; Purgato, Stefania; Zoli, Monica; Della Valle, Giuliano; Eichler, Evan E; Ventura, Mario
2011-01-01
Despite its importance in cell biology and evolution, the centromere has remained the final frontier in genome assembly and annotation due to its complex repeat structure. However, isolation and characterization of the centromeric repeats from newly sequenced species are necessary for a complete understanding of genome evolution and function. In recent years, various genomes have been sequenced, but the characterization of the corresponding centromeric DNA has lagged behind. Here, we present a computational method (RepeatNet) to systematically identify higher-order repeat structures from unassembled whole-genome shotgun sequence and test whether these sequence elements correspond to functional centromeric sequences. We analyzed genome datasets from six species of mammals representing the diversity of the mammalian lineage, namely, horse, dog, elephant, armadillo, opossum, and platypus. We define candidate monomer satellite repeats and demonstrate centromeric localization for five of the six genomes. Our analysis revealed the greatest diversity of centromeric sequences in horse and dog in contrast to elephant and armadillo, which showed high-centromeric sequence homogeneity. We could not isolate centromeric sequences within the platypus genome, suggesting that centromeres in platypus are not enriched in satellite DNA. Our method can be applied to the characterization of thousands of other vertebrate genomes anticipated for sequencing in the near future, providing an important tool for annotation of centromeres.
Parents' interest in whole-genome sequencing of newborns.
Goldenberg, Aaron J; Dodson, Daniel S; Davis, Matthew M; Tarini, Beth A
2014-01-01
The aim of this study was to assess parents' interest in whole-genome sequencing for newborns. We conducted a survey of a nationally representative sample of 1,539 parents about their interest in whole-genome sequencing of newborns. Participants were randomly presented with one of two scenarios that differed in the venue of testing: one offered whole-genome sequencing through a state newborn screening program, whereas the other offered whole-genome sequencing in a pediatrician's office. Overall interest in having future newborns undergo whole-genome sequencing was generally high among parents. If whole-genome sequencing were offered through a state's newborn-screening program, 74% of parents were either definitely or somewhat interested in utilizing this technology. If offered in a pediatrician's office, 70% of parents were either definitely or somewhat interested. Parents in both groups most frequently identified test accuracy and the ability to prevent a child from developing a disease as "very important" in making a decision to have a newborn's whole genome sequenced. These data may help health departments and children's health-care providers anticipate parents' level of interest in genomic screening for newborns. As whole-genome sequencing is integrated into clinical and public health services, these findings may inform the development of educational strategies and outreach messages for parents.
Trotta, Edoardo
2016-05-17
The three stop codons UAA, UAG, and UGA signal the termination of mRNA translation. As a result of a mechanism that is not adequately understood, they are normally used with unequal frequencies. In this work, we showed that selective forces and mutational biases drive stop codon usage in the human genome. We found that, in respect to sense codons, stop codon usage was affected by stronger selective forces but was less influenced by neutral mutational biases. UGA is the most frequent termination codon in human genome. However, UAA was the preferred stop codon in genes with high breadth of expression, high level of expression, AT-rich coding sequences, housekeeping functions, and in gene ontology categories with the largest deviation from expected stop codon usage. Selective forces associated with the breadth and the level of expression favoured AT-rich sequences in the mRNA region including the stop site and its proximal 3'-UTR, but acted with scarce effects on sense codons, generating two regions, upstream and downstream of the stop codon, with strongly different base composition. By favouring low levels of GC-content, selection promoted labile local secondary structures at the stop site and its proximal 3'-UTR. The compositional and structural context favoured by selection was surprisingly emphasized in the class of ribosomal proteins and was consistent with sequence elements that increase the efficiency of translational termination. Stop codons were also heterogeneously distributed among chromosomes by a mechanism that was strongly correlated with the GC-content of coding sequences. In human genome, the nucleotide composition and the thermodynamic stability of stop codon site and its proximal 3'-UTR are correlated with the GC-content of coding sequences and with the breadth and the level of gene expression. In highly expressed genes stop codon usage is compositionally and structurally consistent with highly efficient translation termination signals.
Taverniers, Isabel; Van Bockstaele, Erik; De Loose, Marc
2004-03-01
Analytical real-time PCR technology is a powerful tool for implementation of the GMO labeling regulations enforced in the EU. The quality of analytical measurement data obtained by quantitative real-time PCR depends on the correct use of calibrator and reference materials (RMs). For GMO methods of analysis, the choice of appropriate RMs is currently under debate. So far, genomic DNA solutions from certified reference materials (CRMs) are most often used as calibrators for GMO quantification by means of real-time PCR. However, due to some intrinsic features of these CRMs, errors may be expected in the estimations of DNA sequence quantities. In this paper, two new real-time PCR methods are presented for Roundup Ready soybean, in which two types of plasmid DNA fragments are used as calibrators. Single-target plasmids (STPs) diluted in a background of genomic DNA were used in the first method. Multiple-target plasmids (MTPs) containing both sequences in one molecule were used as calibrators for the second method. Both methods simultaneously detect a promoter 35S sequence as GMO-specific target and a lectin gene sequence as endogenous reference target in a duplex PCR. For the estimation of relative GMO percentages both "delta C(T)" and "standard curve" approaches are tested. Delta C(T) methods are based on direct comparison of measured C(T) values of both the GMO-specific target and the endogenous target. Standard curve methods measure absolute amounts of target copies or haploid genome equivalents. A duplex delta C(T) method with STP calibrators performed at least as well as a similar method with genomic DNA calibrators from commercial CRMs. Besides this, high quality results were obtained with a standard curve method using MTP calibrators. This paper demonstrates that plasmid DNA molecules containing either one or multiple target sequences form perfect alternative calibrators for GMO quantification and are especially suitable for duplex PCR reactions.
Bielaszewska, Martina; Karch, Helge; Toth, Ian K.
2012-01-01
Background An Escherichia coli O104:H4 outbreak in Germany in summer 2011 caused 53 deaths, over 4000 individual infections across Europe, and considerable economic, social and political impact. This outbreak was the first in a position to exploit rapid, benchtop high-throughput sequencing (HTS) technologies and crowdsourced data analysis early in its investigation, establishing a new paradigm for rapid response to disease threats. We describe a novel strategy for design of diagnostic PCR primers that exploited this rapid draft bacterial genome sequencing to distinguish between E. coli O104:H4 outbreak isolates and other pathogenic E. coli isolates, including the historical hæmolytic uræmic syndrome (HUSEC) E. coli HUSEC041 O104:H4 strain, which possesses the same serotype as the outbreak isolates. Methodology/Principal Findings Primers were designed using a novel alignment-free strategy against eleven draft whole genome assemblies of E. coli O104:H4 German outbreak isolates from the E. coli O104:H4 Genome Analysis Crowd-Sourcing Consortium website, and a negative sequence set containing 69 E. coli chromosome and plasmid sequences from public databases. Validation in vitro against 21 ‘positive’ E. coli O104:H4 outbreak and 32 ‘negative’ non-outbreak EHEC isolates indicated that individual primer sets exhibited 100% sensitivity for outbreak isolates, with false positive rates of between 9% and 22%. A minimal combination of two primers discriminated between outbreak and non-outbreak E. coli isolates with 100% sensitivity and 100% specificity. Conclusions/Significance Draft genomes of isolates of disease outbreak bacteria enable high throughput primer design and enhanced diagnostic performance in comparison to traditional molecular assays. Future outbreak investigations will be able to harness HTS rapidly to generate draft genome sequences and diagnostic primer sets, greatly facilitating epidemiology and clinical diagnostics. We expect that high throughput primer design strategies will enable faster, more precise responses to future disease outbreaks of bacterial origin, and help to mitigate their societal impact. PMID:22496820
Genome signature analysis of thermal virus metagenomes reveals Archaea and thermophilic signatures
Pride, David T; Schoenfeld, Thomas
2008-01-01
Background Metagenomic analysis provides a rich source of biological information for otherwise intractable viral communities. However, study of viral metagenomes has been hampered by its nearly complete reliance on BLAST algorithms for identification of DNA sequences. We sought to develop algorithms for examination of viral metagenomes to identify the origin of sequences independent of BLAST algorithms. We chose viral metagenomes obtained from two hot springs, Bear Paw and Octopus, in Yellowstone National Park, as they represent simple microbial populations where comparatively large contigs were obtained. Thermal spring metagenomes have high proportions of sequences without significant Genbank homology, which has hampered identification of viruses and their linkage with hosts. To analyze each metagenome, we developed a method to classify DNA fragments using genome signature-based phylogenetic classification (GSPC), where metagenomic fragments are compared to a database of oligonucleotide signatures for all previously sequenced Bacteria, Archaea, and viruses. Results From both Bear Paw and Octopus hot springs, each assembled contig had more similarity to other metagenome contigs than to any sequenced microbial genome based on GSPC analysis, suggesting a genome signature common to each of these extreme environments. While viral metagenomes from Bear Paw and Octopus share some similarity, the genome signatures from each locale are largely unique. GSPC using a microbial database predicts most of the Octopus metagenome has archaeal signatures, while bacterial signatures predominate in Bear Paw; a finding consistent with those of Genbank BLAST. When using a viral database, the majority of the Octopus metagenome is predicted to belong to archaeal virus Families Globuloviridae and Fuselloviridae, while none of the Bear Paw metagenome is predicted to belong to archaeal viruses. As expected, when microbial and viral databases are combined, each of the Octopus and Bear Paw metagenomic contigs are predicted to belong to viruses rather than to any Bacteria or Archaea, consistent with the apparent viral origin of both metagenomes. Conclusion That BLAST searches identify no significant homologs for most metagenome contigs, while GSPC suggests their origin as archaeal viruses or bacteriophages, indicates GSPC provides a complementary approach in viral metagenomic analysis. PMID:18798991
Genome signature analysis of thermal virus metagenomes reveals Archaea and thermophilic signatures.
Pride, David T; Schoenfeld, Thomas
2008-09-17
Metagenomic analysis provides a rich source of biological information for otherwise intractable viral communities. However, study of viral metagenomes has been hampered by its nearly complete reliance on BLAST algorithms for identification of DNA sequences. We sought to develop algorithms for examination of viral metagenomes to identify the origin of sequences independent of BLAST algorithms. We chose viral metagenomes obtained from two hot springs, Bear Paw and Octopus, in Yellowstone National Park, as they represent simple microbial populations where comparatively large contigs were obtained. Thermal spring metagenomes have high proportions of sequences without significant Genbank homology, which has hampered identification of viruses and their linkage with hosts. To analyze each metagenome, we developed a method to classify DNA fragments using genome signature-based phylogenetic classification (GSPC), where metagenomic fragments are compared to a database of oligonucleotide signatures for all previously sequenced Bacteria, Archaea, and viruses. From both Bear Paw and Octopus hot springs, each assembled contig had more similarity to other metagenome contigs than to any sequenced microbial genome based on GSPC analysis, suggesting a genome signature common to each of these extreme environments. While viral metagenomes from Bear Paw and Octopus share some similarity, the genome signatures from each locale are largely unique. GSPC using a microbial database predicts most of the Octopus metagenome has archaeal signatures, while bacterial signatures predominate in Bear Paw; a finding consistent with those of Genbank BLAST. When using a viral database, the majority of the Octopus metagenome is predicted to belong to archaeal virus Families Globuloviridae and Fuselloviridae, while none of the Bear Paw metagenome is predicted to belong to archaeal viruses. As expected, when microbial and viral databases are combined, each of the Octopus and Bear Paw metagenomic contigs are predicted to belong to viruses rather than to any Bacteria or Archaea, consistent with the apparent viral origin of both metagenomes. That BLAST searches identify no significant homologs for most metagenome contigs, while GSPC suggests their origin as archaeal viruses or bacteriophages, indicates GSPC provides a complementary approach in viral metagenomic analysis.
Coding Complete Genome for the Mogiana Tick Virus, a Jingmenvirus Isolated from Ticks in Brazil
2017-05-04
sequences for all four genome segments. We downloaded the raw Illumina sequence reads from the NCBI Short Read Archive (GenBank...MGTV genome segments through sequence similarity (BLASTN) to the published genome of Jingmen tick virus (JMTV) isolate SY84 (GenBank: KJ001579-KJ001582...2014. Standards for sequencing viral genomes in the era of high-throughput sequencing . MBio 5:e01360–14. 8. Bankevich A, Nurk S, Antipov
A one-page summary report of genome sequencing for the healthy adult.
Vassy, Jason L; McLaughlin, Heather M; McLaughlin, Heather L; MacRae, Calum A; Seidman, Christine E; Lautenbach, Denise; Krier, Joel B; Lane, William J; Kohane, Isaac S; Murray, Michael F; McGuire, Amy L; Rehm, Heidi L; Green, Robert C
2015-01-01
As genome sequencing technologies increasingly enter medical practice, genetics laboratories must communicate sequencing results effectively to nongeneticist physicians. We describe the design and delivery of a clinical genome sequencing report, including a one-page summary suitable for interpretation by primary care physicians. To illustrate our preliminary experience with this report, we summarize the genomic findings from 10 healthy participants in a study of genome sequencing in primary care. © 2015 S. Karger AG, Basel.
A One-Page Summary Report of Genome Sequencing for the Healthy Adult
Vassy, Jason L.; McLaughlin, Heather M.; MacRae, Calum A.; Seidman, Christine E.; Lautenbach, Denise; Krier, Joel B.; Lane, William J.; Kohane, Isaac S.; Murray, Michael F.; McGuire, Amy L.; Rehm, Heidi L.; Green, Robert C.
2015-01-01
As genome sequencing technologies increasingly enter medical practice, genetics laboratories must communicate sequencing results effectively to non-geneticist physicians. We describe the design and delivery of a clinical genome sequencing report, including a one-page summary suitable for interpretation by primary care physicians. To illustrate our preliminary experience with this report, we summarize the genomic findings from ten healthy patient participants in a study of genome sequencing in primary care. PMID:25612602
Froenicke, Lutz; Lavelle, Dean; Martineau, Belinda; Perroud, Bertrand; Michelmore, Richard
2013-01-01
Several applications of high throughput genome and transcriptome sequencing would benefit from a reduction of the high-copy-number sequences in the libraries being sequenced and analyzed, particularly when applied to species with large genomes. We adapted and analyzed the consequences of a method that utilizes a thermostable duplex-specific nuclease for reducing the high-copy components in transcriptomic and genomic libraries prior to sequencing. This reduces the time, cost, and computational effort of obtaining informative transcriptomic and genomic sequence data for both fully sequenced and non-sequenced genomes. It also reduces contamination from organellar DNA in preparations of nuclear DNA. Hybridization in the presence of 3 M tetramethylammonium chloride (TMAC), which equalizes the rates of hybridization of GC and AT nucleotide pairs, reduced the bias against sequences with high GC content. Consequences of this method on the reduction of high-copy and enrichment of low-copy sequences are reported for Arabidopsis and lettuce. PMID:23409088
Matvienko, Marta; Kozik, Alexander; Froenicke, Lutz; Lavelle, Dean; Martineau, Belinda; Perroud, Bertrand; Michelmore, Richard
2013-01-01
Several applications of high throughput genome and transcriptome sequencing would benefit from a reduction of the high-copy-number sequences in the libraries being sequenced and analyzed, particularly when applied to species with large genomes. We adapted and analyzed the consequences of a method that utilizes a thermostable duplex-specific nuclease for reducing the high-copy components in transcriptomic and genomic libraries prior to sequencing. This reduces the time, cost, and computational effort of obtaining informative transcriptomic and genomic sequence data for both fully sequenced and non-sequenced genomes. It also reduces contamination from organellar DNA in preparations of nuclear DNA. Hybridization in the presence of 3 M tetramethylammonium chloride (TMAC), which equalizes the rates of hybridization of GC and AT nucleotide pairs, reduced the bias against sequences with high GC content. Consequences of this method on the reduction of high-copy and enrichment of low-copy sequences are reported for Arabidopsis and lettuce.
Personal Genome Sequencing in Ostensibly Healthy Individuals and the PeopleSeq Consortium
Linderman, Michael D.; Nielsen, Daiva E.; Green, Robert C.
2016-01-01
Thousands of ostensibly healthy individuals have had their exome or genome sequenced, but a much smaller number of these individuals have received any personal genomic results from that sequencing. We term those projects in which ostensibly healthy participants can receive sequencing-derived genetic findings and may also have access to their genomic data as participatory predispositional personal genome sequencing (PPGS). Here we are focused on genome sequencing applied in a pre-symptomatic context and so define PPGS to exclude diagnostic genome sequencing intended to identify the molecular cause of suspected or diagnosed genetic disease. In this report we describe the design of completed and underway PPGS projects, briefly summarize the results reported to date and introduce the PeopleSeq Consortium, a newly formed collaboration of PPGS projects designed to collect much-needed longitudinal outcome data. PMID:27023617
Research progress of plant population genomics based on high-throughput sequencing.
Wang, Yun-sheng
2016-08-01
Population genomics, a new paradigm for population genetics, combine the concepts and techniques of genomics with the theoretical system of population genetics and improve our understanding of microevolution through identification of site-specific effect and genome-wide effects using genome-wide polymorphic sites genotypeing. With the appearance and improvement of the next generation high-throughput sequencing technology, the numbers of plant species with complete genome sequences increased rapidly and large scale resequencing has also been carried out in recent years. Parallel sequencing has also been done in some plant species without complete genome sequences. These studies have greatly promoted the development of population genomics and deepened our understanding of the genetic diversity, level of linking disequilibium, selection effect, demographical history and molecular mechanism of complex traits of relevant plant population at a genomic level. In this review, I briely introduced the concept and research methods of population genomics and summarized the research progress of plant population genomics based on high-throughput sequencing. I also discussed the prospect as well as existing problems of plant population genomics in order to provide references for related studies.
Genomic Diversity and Evolution of the Lyssaviruses
Delmas, Olivier; Holmes, Edward C.; Talbi, Chiraz; Larrous, Florence; Dacheux, Laurent; Bouchier, Christiane; Bourhy, Hervé
2008-01-01
Lyssaviruses are RNA viruses with single-strand, negative-sense genomes responsible for rabies-like diseases in mammals. To date, genomic and evolutionary studies have most often utilized partial genome sequences, particularly of the nucleoprotein and glycoprotein genes, with little consideration of genome-scale evolution. Herein, we report the first genomic and evolutionary analysis using complete genome sequences of all recognised lyssavirus genotypes, including 14 new complete genomes of field isolates from 6 genotypes and one genotype that is completely sequenced for the first time. In doing so we significantly increase the extent of genome sequence data available for these important viruses. Our analysis of these genome sequence data reveals that all lyssaviruses have the same genomic organization. A phylogenetic analysis reveals strong geographical structuring, with the greatest genetic diversity in Africa, and an independent origin for the two known genotypes that infect European bats. We also suggest that multiple genotypes may exist within the diversity of viruses currently classified as ‘Lagos Bat’. In sum, we show that rigorous phylogenetic techniques based on full length genome sequence provide the best discriminatory power for genotype classification within the lyssaviruses. PMID:18446239
Fungal Genomics for Energy and Environment
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grigoriev, Igor V.
2013-03-11
Genomes of fungi relevant to energy and environment are in focus of the Fungal Genomic Program at the US Department of Energy Joint Genome Institute (JGI). One of its projects, the Genomics Encyclopedia of Fungi, targets fungi related to plant health (symbionts, pathogens, and biocontrol agents) and biorefinery processes (cellulose degradation, sugar fermentation, industrial hosts) by means of genome sequencing and analysis. New chapters of the Encyclopedia can be opened with user proposals to the JGI Community Sequencing Program (CSP). Another JGI project, the 1000 fungal genomes, explores fungal diversity on genome level at scale and is open for usersmore » to nominate new species for sequencing. Over 200 fungal genomes have been sequenced by JGI to date and released through MycoCosm (www.jgi.doe.gov/fungi), a fungal web-portal, which integrates sequence and functional data with genome analysis tools for user community. Sequence analysis supported by functional genomics leads to developing parts list for complex systems ranging from ecosystems of biofuel crops to biorefineries. Recent examples of such parts suggested by comparative genomics and functional analysis in these areas are presented here.« less
Serendipitous discovery of Wolbachia genomes in multiple Drosophila species.
Salzberg, Steven L; Dunning Hotopp, Julie C; Delcher, Arthur L; Pop, Mihai; Smith, Douglas R; Eisen, Michael B; Nelson, William C
2005-01-01
The Trace Archive is a repository for the raw, unanalyzed data generated by large-scale genome sequencing projects. The existence of this data offers scientists the possibility of discovering additional genomic sequences beyond those originally sequenced. In particular, if the source DNA for a sequencing project came from a species that was colonized by another organism, then the project may yield substantial amounts of genomic DNA, including near-complete genomes, from the symbiotic or parasitic organism. By searching the publicly available repository of DNA sequencing trace data, we discovered three new species of the bacterial endosymbiont Wolbachia pipientis in three different species of fruit fly: Drosophila ananassae, D. simulans, and D. mojavensis. We extracted all sequences with partial matches to a previously sequenced Wolbachia strain and assembled those sequences using customized software. For one of the three new species, the data recovered were sufficient to produce an assembly that covers more than 95% of the genome; for a second species the data produce the equivalent of a 'light shotgun' sampling of the genome, covering an estimated 75-80% of the genome; and for the third species the data cover approximately 6-7% of the genome. The results of this study reveal an unexpected benefit of depositing raw data in a central genome sequence repository: new species can be discovered within this data. The differences between these three new Wolbachia genomes and the previously sequenced strain revealed numerous rearrangements and insertions within each lineage and hundreds of novel genes. The three new genomes, with annotation, have been deposited in GenBank.
Eastman, Alexander W.; Yuan, Ze-Chun
2015-01-01
Advances in sequencing technology have drastically increased the depth and feasibility of bacterial genome sequencing. However, little information is available that details the specific techniques and procedures employed during genome sequencing despite the large numbers of published genomes. Shotgun approaches employed by second-generation sequencing platforms has necessitated the development of robust bioinformatics tools for in silico assembly, and complete assembly is limited by the presence of repetitive DNA sequences and multi-copy operons. Typically, re-sequencing with multiple platforms and laborious, targeted Sanger sequencing are employed to finish a draft bacterial genome. Here we describe a novel strategy based on the identification and targeted sequencing of repetitive rDNA operons to expedite bacterial genome assembly and finishing. Our strategy was validated by finishing the genome of Paenibacillus polymyxa strain CR1, a bacterium with potential in sustainable agriculture and bio-based processes. An analysis of the 38 contigs contained in the P. polymyxa strain CR1 draft genome revealed 12 repetitive rDNA operons with varied intragenic and flanking regions of variable length, unanimously located at contig boundaries and within contig gaps. These highly similar but not identical rDNA operons were experimentally verified and sequenced simultaneously with multiple, specially designed primer sets. This approach also identified and corrected significant sequence rearrangement generated during the initial in silico assembly of sequencing reads. Our approach reduces the required effort associated with blind primer walking for contig assembly, increasing both the speed and feasibility of genome finishing. Our study further reinforces the notion that repetitive DNA elements are major limiting factors for genome finishing. Moreover, we provided a step-by-step workflow for genome finishing, which may guide future bacterial genome finishing projects. PMID:25653642
Why Assembling Plant Genome Sequences Is So Challenging
Claros, Manuel Gonzalo; Bautista, Rocío; Guerrero-Fernández, Darío; Benzerki, Hicham; Seoane, Pedro; Fernández-Pozo, Noé
2012-01-01
In spite of the biological and economic importance of plants, relatively few plant species have been sequenced. Only the genome sequence of plants with relatively small genomes, most of them angiosperms, in particular eudicots, has been determined. The arrival of next-generation sequencing technologies has allowed the rapid and efficient development of new genomic resources for non-model or orphan plant species. But the sequencing pace of plants is far from that of animals and microorganisms. This review focuses on the typical challenges of plant genomes that can explain why plant genomics is less developed than animal genomics. Explanations about the impact of some confounding factors emerging from the nature of plant genomes are given. As a result of these challenges and confounding factors, the correct assembly and annotation of plant genomes is hindered, genome drafts are produced, and advances in plant genomics are delayed. PMID:24832233
Insights from Human/Mouse genome comparisons
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pennacchio, Len A.
2003-03-30
Large-scale public genomic sequencing efforts have provided a wealth of vertebrate sequence data poised to provide insights into mammalian biology. These include deep genomic sequence coverage of human, mouse, rat, zebrafish, and two pufferfish (Fugu rubripes and Tetraodon nigroviridis) (Aparicio et al. 2002; Lander et al. 2001; Venter et al. 2001; Waterston et al. 2002). In addition, a high-priority has been placed on determining the genomic sequence of chimpanzee, dog, cow, frog, and chicken (Boguski 2002). While only recently available, whole genome sequence data have provided the unique opportunity to globally compare complete genome contents. Furthermore, the shared evolutionary ancestrymore » of vertebrate species has allowed the development of comparative genomic approaches to identify ancient conserved sequences with functionality. Accordingly, this review focuses on the initial comparison of available mammalian genomes and describes various insights derived from such analysis.« less
Emerging Science and Technology Trends: 2017-2047
2017-11-21
genomics, coupled with the exponentially declining cost of gene editing techniques such as CRISPR , has created fertile ground for rapid technological...sequences from scratch. Falling costs and new gene editing tools like CRISPR are accelerating progress, and the global market is expected to reach...by the Bill & Melinda Gates foundation, is reengineering the bacteria found in the human gut to fight disease.121 eGensis is using CRISPR gene
Katz, Lee S.; Sharma, Nitya V.; Harcourt, Brian H.; Thomas, Jennifer Dolan; Wang, Xin; Mayer, Leonard W.; Jordan, I. King
2011-01-01
Neisseria meningitidis is one of the main agents of bacterial meningitis, causing substantial morbidity and mortality worldwide. However, most of the time N. meningitidis is carried as a commensal not associated with invasive disease. The genomic basis of the difference between disease-associated and carried isolates of N. meningitidis may provide critical insight into mechanisms of virulence, yet it has remained elusive. Here, we have taken a comparative genomics approach to interrogate the difference between disease-associated and carried isolates of N. meningitidis at the level of individual nucleotide variations (i.e., single nucleotide polymorphisms [SNPs]). We aligned complete genome sequences of 8 disease-associated and 4 carried isolates of N. meningitidis to search for SNPs that show mutually exclusive patterns of variation between the two groups. We found 63 SNPs that distinguish the 8 disease-associated genomes from the 4 carried genomes of N. meningitidis, which is far more than can be expected by chance alone given the level of nucleotide variation among the genomes. The putative list of SNPs that discriminate between disease-associated and carriage genomes may be expected to change with increased sampling or changes in the identities of the isolates being compared. Nevertheless, we show that these discriminating SNPs are more likely to reflect phenotypic differences than shared evolutionary history. Discriminating SNPs were mapped to genes, and the functions of the genes were evaluated for possible connections to virulence mechanisms. A number of overrepresented functional categories related to virulence were uncovered among SNP-associated genes, including genes related to the category “symbiosis, encompassing mutualism through parasitism.” PMID:21622743
Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum
DOE Office of Scientific and Technical Information (OSTI.GOV)
VanBuren, Robert; Bryant, Doug; Edger, Patrick P.
Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly1. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE). Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetiummore » genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a ‘near-complete’ draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. As a result, the Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.« less
Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum
VanBuren, Robert; Bryant, Doug; Edger, Patrick P.; ...
2015-11-11
Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly1. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE). Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetiummore » genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a ‘near-complete’ draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. As a result, the Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.« less
It’s More Than Stamp Collecting: How Genome Sequencing Can Unify Biological Research
Richards, Stephen
2015-01-01
The availability of reference genome sequences, especially the human reference, has revolutionized the study of biology. However, whilst the genomes of some species have been fully sequenced, a wide range of biological problems still cannot be effectively studied for lack of genome sequence information. Here, I identify neglected areas of biology and describe how both targeted species sequencing and more broad taxonomic surveys of the tree of life can address important biological questions. I enumerate the significant benefits that would accrue from sequencing a broader range of taxa, as well as discuss the technical advances in sequencing and assembly methods that would allow for wide-ranging application of whole-genome analysis. Finally, I suggest that in addition to “Big Science” survey initiatives to sequence the tree of life, a modified infrastructure-funding paradigm would better support reference genome sequence generation for research communities most in need. PMID:26003218
It's more than stamp collecting: how genome sequencing can unify biological research.
Richards, Stephen
2015-07-01
The availability of reference genome sequences, especially the human reference, has revolutionized the study of biology. However, while the genomes of some species have been fully sequenced, a wide range of biological problems still cannot be effectively studied for lack of genome sequence information. Here, I identify neglected areas of biology and describe how both targeted species sequencing and more broad taxonomic surveys of the tree of life can address important biological questions. I enumerate the significant benefits that would accrue from sequencing a broader range of taxa, as well as discuss the technical advances in sequencing and assembly methods that would allow for wide-ranging application of whole-genome analysis. Finally, I suggest that in addition to 'big science' survey initiatives to sequence the tree of life, a modified infrastructure-funding paradigm would better support reference genome sequence generation for research communities most in need. Copyright © 2015 Elsevier Ltd. All rights reserved.
2016-10-27
Institute of Infectious Diseases, Fort Detrick, Frederick, Maryland, USA 9 10 11 Running head: Complete Genome Sequence of Y. pestis strain Cadman...1 Complete Genome Sequence of Pigmentation Negative Yersinia pestis strain Cadman 1 2 3 Sean Lovetta, Kitty Chaseb, Galina Korolevaa, Gustavo...we report the genome sequence of Yersinia pestis strain Cadman, an attenuated strain 25 lacking the pgm locus. Y. pestis is the causative agent of
MIPS: a database for genomes and protein sequences.
Mewes, H W; Heumann, K; Kaps, A; Mayer, K; Pfeiffer, F; Stocker, S; Frishman, D
1999-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried near Munich, Germany, develops and maintains genome oriented databases. It is commonplace that the amount of sequence data available increases rapidly, but not the capacity of qualified manual annotation at the sequence databases. Therefore, our strategy aims to cope with the data stream by the comprehensive application of analysis tools to sequences of complete genomes, the systematic classification of protein sequences and the active support of sequence analysis and functional genomics projects. This report describes the systematic and up-to-date analysis of genomes (PEDANT), a comprehensive database of the yeast genome (MYGD), a database reflecting the progress in sequencing the Arabidopsis thaliana genome (MATD), the database of assembled, annotated human EST clusters (MEST), and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). MIPS provides access through its WWW server (http://www.mips.biochem.mpg.de) to a spectrum of generic databases, including the above mentioned as well as a database of protein families (PROTFAM), the MITOP database, and the all-against-all FASTA database. PMID:9847138
Droege, Marcus; Hill, Brendon
2008-08-31
The Genome Sequencer FLX System (GS FLX), powered by 454 Sequencing, is a next-generation DNA sequencing technology featuring a unique mix of long reads, exceptional accuracy, and ultra-high throughput. It has been proven to be the most versatile of all currently available next-generation sequencing technologies, supporting many high-profile studies in over seven applications categories. GS FLX users have pursued innovative research in de novo sequencing, re-sequencing of whole genomes and target DNA regions, metagenomics, and RNA analysis. 454 Sequencing is a powerful tool for human genetics research, having recently re-sequenced the genome of an individual human, currently re-sequencing the complete human exome and targeted genomic regions using the NimbleGen sequence capture process, and detected low-frequency somatic mutations linked to cancer.
Newborn Sequencing in Genomic Medicine and Public Health.
Berg, Jonathan S; Agrawal, Pankaj B; Bailey, Donald B; Beggs, Alan H; Brenner, Steven E; Brower, Amy M; Cakici, Julie A; Ceyhan-Birsoy, Ozge; Chan, Kee; Chen, Flavia; Currier, Robert J; Dukhovny, Dmitry; Green, Robert C; Harris-Wai, Julie; Holm, Ingrid A; Iglesias, Brenda; Joseph, Galen; Kingsmore, Stephen F; Koenig, Barbara A; Kwok, Pui-Yan; Lantos, John; Leeder, Steven J; Lewis, Megan A; McGuire, Amy L; Milko, Laura V; Mooney, Sean D; Parad, Richard B; Pereira, Stacey; Petrikin, Joshua; Powell, Bradford C; Powell, Cynthia M; Puck, Jennifer M; Rehm, Heidi L; Risch, Neil; Roche, Myra; Shieh, Joseph T; Veeraraghavan, Narayanan; Watson, Michael S; Willig, Laurel; Yu, Timothy W; Urv, Tiina; Wise, Anastasia L
2017-02-01
The rapid development of genomic sequencing technologies has decreased the cost of genetic analysis to the extent that it seems plausible that genome-scale sequencing could have widespread availability in pediatric care. Genomic sequencing provides a powerful diagnostic modality for patients who manifest symptoms of monogenic disease and an opportunity to detect health conditions before their development. However, many technical, clinical, ethical, and societal challenges should be addressed before such technology is widely deployed in pediatric practice. This article provides an overview of the Newborn Sequencing in Genomic Medicine and Public Health Consortium, which is investigating the application of genome-scale sequencing in newborns for both diagnosis and screening. Copyright © 2017 by the American Academy of Pediatrics.
Nowrousian, Minou; Würtz, Christian; Pöggeler, Stefanie; Kück, Ulrich
2004-03-01
One of the most challenging parts of large scale sequencing projects is the identification of functional elements encoded in a genome. Recently, studies of genomes of up to six different Saccharomyces species have demonstrated that a comparative analysis of genome sequences from closely related species is a powerful approach to identify open reading frames and other functional regions within genomes [Science 301 (2003) 71, Nature 423 (2003) 241]. Here, we present a comparison of selected sequences from Sordaria macrospora to their corresponding Neurospora crassa orthologous regions. Our analysis indicates that due to the high degree of sequence similarity and conservation of overall genomic organization, S. macrospora sequence information can be used to simplify the annotation of the N. crassa genome.
Breeding and Genetics Symposium: networks and pathways to guide genomic selection.
Snelling, W M; Cushman, R A; Keele, J W; Maltecca, C; Thomas, M G; Fortes, M R S; Reverter, A
2013-02-01
Many traits affecting profitability and sustainability of meat, milk, and fiber production are polygenic, with no single gene having an overwhelming influence on observed variation. No knowledge of the specific genes controlling these traits has been needed to make substantial improvement through selection. Significant gains have been made through phenotypic selection enhanced by pedigree relationships and continually improving statistical methodology. Genomic selection, recently enabled by assays for dense SNP located throughout the genome, promises to increase selection accuracy and accelerate genetic improvement by emphasizing the SNP most strongly correlated to phenotype although the genes and sequence variants affecting phenotype remain largely unknown. These genomic predictions theoretically rely on linkage disequilibrium (LD) between genotyped SNP and unknown functional variants, but familial linkage may increase effectiveness when predicting individuals related to those in the training data. Genomic selection with functional SNP genotypes should be less reliant on LD patterns shared by training and target populations, possibly allowing robust prediction across unrelated populations. Although the specific variants causing polygenic variation may never be known with certainty, a number of tools and resources can be used to identify those most likely to affect phenotype. Associations of dense SNP genotypes with phenotype provide a 1-dimensional approach for identifying genes affecting specific traits; in contrast, associations with multiple traits allow defining networks of genes interacting to affect correlated traits. Such networks are especially compelling when corroborated by existing functional annotation and established molecular pathways. The SNP occurring within network genes, obtained from public databases or derived from genome and transcriptome sequences, may be classified according to expected effects on gene products. As illustrated by functionally informed genomic predictions being more accurate than naive whole-genome predictions of beef tenderness, coupling evidence from livestock genotypes, phenotypes, gene expression, and genomic variants with existing knowledge of gene functions and interactions may provide greater insight into the genes and genomic mechanisms affecting polygenic traits and facilitate functional genomic selection for economically important traits.
Multiplexed fragaria chloroplast genome sequencing
W. Njuguna; A. Liston; R. Cronn; N.V. Bassil
2010-01-01
A method to sequence multiple chloroplast genomes using ultra high throughput sequencing technologies was recently described. Complete chloroplast genome sequences can resolve phylogenetic relationships at low taxonomic levels and identify informative point mutations and indels. The objective of this research was to sequence multiple Fragaria...
Pseudoexon activation increases phenotype severity in a Becker muscular dystrophy patient.
Greer, Kane; Mizzi, Kayla; Rice, Emily; Kuster, Lukas; Barrero, Roberto A; Bellgard, Matthew I; Lynch, Bryan J; Foley, Aileen Reghan; O Rathallaigh, Eoin; Wilton, Steve D; Fletcher, Sue
2015-07-01
We report a dystrophinopathy patient with an in-frame deletion of DMD exons 45-47, and therefore a genetic diagnosis of Becker muscular dystrophy, who presented with a more severe than expected phenotype. Analysis of the patient DMD mRNA revealed an 82 bp pseudoexon, derived from intron 44, that disrupts the reading frame and is expected to yield a nonfunctional dystrophin. Since the sequence of the pseudoexon and canonical splice sites does not differ from the reference sequence, we concluded that the genomic rearrangement promoted recognition of the pseudoexon, causing a severe dystrophic phenotype. We characterized the deletion breakpoints and identified motifs that might influence selection of the pseudoexon. We concluded that the donor splice site was strengthened by juxtaposition of intron 47, and loss of intron 44 silencer elements, normally located downstream of the pseudoexon donor splice site, further enhanced pseudoexon selection and inclusion in the DMD transcript in this patient.
Human Genome Sequencing in Health and Disease
Gonzaga-Jauregui, Claudia; Lupski, James R.; Gibbs, Richard A.
2013-01-01
Following the “finished,” euchromatic, haploid human reference genome sequence, the rapid development of novel, faster, and cheaper sequencing technologies is making possible the era of personalized human genomics. Personal diploid human genome sequences have been generated, and each has contributed to our better understanding of variation in the human genome. We have consequently begun to appreciate the vastness of individual genetic variation from single nucleotide to structural variants. Translation of genome-scale variation into medically useful information is, however, in its infancy. This review summarizes the initial steps undertaken in clinical implementation of personal genome information, and describes the application of whole-genome and exome sequencing to identify the cause of genetic diseases and to suggest adjuvant therapies. Better analysis tools and a deeper understanding of the biology of our genome are necessary in order to decipher, interpret, and optimize clinical utility of what the variation in the human genome can teach us. Personal genome sequencing may eventually become an instrument of common medical practice, providing information that assists in the formulation of a differential diagnosis. We outline herein some of the remaining challenges. PMID:22248320
2009-01-01
Background Conifers are a large group of gymnosperm trees which are separated from the angiosperms by more than 300 million years of independent evolution. Conifer genomes are extremely large and contain considerable amounts of repetitive DNA. Currently, conifer sequence resources exist predominantly as expressed sequence tags (ESTs) and full-length (FL)cDNAs. There is no genome sequence available for a conifer or any other gymnosperm. Conifer defence-related genes often group into large families with closely related members. The goals of this study are to assess the feasibility of targeted isolation and sequence assembly of conifer BAC clones containing specific genes from two large gene families, and to characterize large segments of genomic DNA sequence for the first time from a conifer. Results We used a PCR-based approach to identify BAC clones for two target genes, a terpene synthase (3-carene synthase; 3CAR) and a cytochrome P450 (CYP720B4) from a non-arrayed genomic BAC library of white spruce (Picea glauca). Shotgun genomic fragments isolated from the BAC clones were sequenced to a depth of 15.6- and 16.0-fold coverage, respectively. Assembly and manual curation yielded sequence scaffolds of 172 kbp (3CAR) and 94 kbp (CYP720B4) long. Inspection of the genomic sequences revealed the intron-exon structures, the putative promoter regions and putative cis-regulatory elements of these genes. Sequences related to transposable elements (TEs), high complexity repeats and simple repeats were prevalent and comprised approximately 40% of the sequenced genomic DNA. An in silico simulation of the effect of sequencing depth on the quality of the sequence assembly provides direction for future efforts of conifer genome sequencing. Conclusion We report the first targeted cloning, sequencing, assembly, and annotation of large segments of genomic DNA from a conifer. We demonstrate that genomic BAC clones for individual members of multi-member gene families can be isolated in a gene-specific fashion. The results of the present work provide important new information about the structure and content of conifer genomic DNA that will guide future efforts to sequence and assemble conifer genomes. PMID:19656416
Hamberger, Björn; Hall, Dawn; Yuen, Mack; Oddy, Claire; Hamberger, Britta; Keeling, Christopher I; Ritland, Carol; Ritland, Kermit; Bohlmann, Jörg
2009-08-06
Conifers are a large group of gymnosperm trees which are separated from the angiosperms by more than 300 million years of independent evolution. Conifer genomes are extremely large and contain considerable amounts of repetitive DNA. Currently, conifer sequence resources exist predominantly as expressed sequence tags (ESTs) and full-length (FL)cDNAs. There is no genome sequence available for a conifer or any other gymnosperm. Conifer defence-related genes often group into large families with closely related members. The goals of this study are to assess the feasibility of targeted isolation and sequence assembly of conifer BAC clones containing specific genes from two large gene families, and to characterize large segments of genomic DNA sequence for the first time from a conifer. We used a PCR-based approach to identify BAC clones for two target genes, a terpene synthase (3-carene synthase; 3CAR) and a cytochrome P450 (CYP720B4) from a non-arrayed genomic BAC library of white spruce (Picea glauca). Shotgun genomic fragments isolated from the BAC clones were sequenced to a depth of 15.6- and 16.0-fold coverage, respectively. Assembly and manual curation yielded sequence scaffolds of 172 kbp (3CAR) and 94 kbp (CYP720B4) long. Inspection of the genomic sequences revealed the intron-exon structures, the putative promoter regions and putative cis-regulatory elements of these genes. Sequences related to transposable elements (TEs), high complexity repeats and simple repeats were prevalent and comprised approximately 40% of the sequenced genomic DNA. An in silico simulation of the effect of sequencing depth on the quality of the sequence assembly provides direction for future efforts of conifer genome sequencing. We report the first targeted cloning, sequencing, assembly, and annotation of large segments of genomic DNA from a conifer. We demonstrate that genomic BAC clones for individual members of multi-member gene families can be isolated in a gene-specific fashion. The results of the present work provide important new information about the structure and content of conifer genomic DNA that will guide future efforts to sequence and assemble conifer genomes.
Whole-genome sequencing in bacteriology: state of the art
Dark, Michael J
2013-01-01
Over the last ten years, genome sequencing capabilities have expanded exponentially. There have been tremendous advances in sequencing technology, DNA sample preparation, genome assembly, and data analysis. This has led to advances in a number of facets of bacterial genomics, including metagenomics, clinical medicine, bacterial archaeology, and bacterial evolution. This review examines the strengths and weaknesses of techniques in bacterial genome sequencing, upcoming technologies, and assembly techniques, as well as highlighting recent studies that highlight new applications for bacterial genomics. PMID:24143115
USDA-ARS?s Scientific Manuscript database
The ARS Microbial Genome Sequence Database (http://199.133.98.43), a web-based database server, was established utilizing the BIGSdb (Bacterial Isolate Genomics Sequence Database) software package, developed at Oxford University, as a tool to manage multi-locus sequence data for the family Streptomy...
USDA-ARS?s Scientific Manuscript database
We report the complete genome sequence of Clavibacter michiganensis subsp. insidiosus R1-1 isolated in Minnesota, USA. The R1-1 genome, generated by de novo assembly of PacBio sequencing data, is the first complete genome sequence available for this subspecies....
Draft Genome Sequence of a Rare Smut Relative, Tilletiaria anomala UBC 951
Toome, Merje; Kuo, Alan; Henrissat, Bernard; ...
2014-06-12
We present the draft genome sequence of the smut fungus Tilletiaria anomala UBC 951 (Basidiomycota, Ustilaginomycotina). The sequenced genome size is 18.7 Mb, consisting of 289 scaffolds and a total of 6,810 predicted genes. This is the first genome sequence published for a fungus in the order Georgefisheriales (Exobasidiomycetes).
Qualitative thematic analysis of consent forms used in cancer genome sequencing.
Allen, Clarissa; Foulkes, William D
2011-07-19
Large-scale whole genome sequencing (WGS) studies promise to revolutionize cancer research by identifying targets for therapy and by discovering molecular biomarkers to aid early diagnosis, to better determine prognosis and to improve treatment response prediction. Such projects raise a number of ethical, legal, and social (ELS) issues that should be considered. In this study, we set out to discover how these issues are being handled across different jurisdictions. We examined informed consent (IC) forms from 30 cancer genome sequencing studies to assess (1) stated purpose of sample collection, (2) scope of consent requested, (3) data sharing protocols (4) privacy protection measures, (5) described risks of participation, (6) subject re-contacting, and (7) protocol for withdrawal. There is a high degree of similarity in how cancer researchers engaged in WGS are protecting participant privacy. We observed a strong trend towards both using samples for additional, unspecified research and sharing data with other investigators. IC forms were varied in terms of how they discussed re-contacting participants, returning results and facilitating participant withdrawal. Contrary to expectation, there were no consistent trends that emerged over the eight year period from which forms were collected. Examining IC forms from WGS studies elucidates how investigators are handling ELS challenges posed by this research. This information is important for ensuring that while the public benefits of research are maximized, the rights of participants are also being appropriately respected.
Insights into Conifer Giga-Genomes1
De La Torre, Amanda R.; Birol, Inanc; Bousquet, Jean; Ingvarsson, Pär K.; Jansson, Stefan; Jones, Steven J.M.; Keeling, Christopher I.; MacKay, John; Nilsson, Ove; Ritland, Kermit; Street, Nathaniel; Yanchuk, Alvin; Zerbe, Philipp; Bohlmann, Jörg
2014-01-01
Insights from sequenced genomes of major land plant lineages have advanced research in almost every aspect of plant biology. Until recently, however, assembled genome sequences of gymnosperms have been missing from this picture. Conifers of the pine family (Pinaceae) are a group of gymnosperms that dominate large parts of the world’s forests. Despite their ecological and economic importance, conifers seemed long out of reach for complete genome sequencing, due in part to their enormous genome size (20–30 Gb) and the highly repetitive nature of their genomes. Technological advances in genome sequencing and assembly enabled the recent publication of three conifer genomes: white spruce (Picea glauca), Norway spruce (Picea abies), and loblolly pine (Pinus taeda). These genome sequences revealed distinctive features compared with other plant genomes and may represent a window into the past of seed plant genomes. This Update highlights recent advances, remaining challenges, and opportunities in light of the publication of the first conifer and gymnosperm genomes. PMID:25349325
Insights into conifer giga-genomes.
De La Torre, Amanda R; Birol, Inanc; Bousquet, Jean; Ingvarsson, Pär K; Jansson, Stefan; Jones, Steven J M; Keeling, Christopher I; MacKay, John; Nilsson, Ove; Ritland, Kermit; Street, Nathaniel; Yanchuk, Alvin; Zerbe, Philipp; Bohlmann, Jörg
2014-12-01
Insights from sequenced genomes of major land plant lineages have advanced research in almost every aspect of plant biology. Until recently, however, assembled genome sequences of gymnosperms have been missing from this picture. Conifers of the pine family (Pinaceae) are a group of gymnosperms that dominate large parts of the world's forests. Despite their ecological and economic importance, conifers seemed long out of reach for complete genome sequencing, due in part to their enormous genome size (20-30 Gb) and the highly repetitive nature of their genomes. Technological advances in genome sequencing and assembly enabled the recent publication of three conifer genomes: white spruce (Picea glauca), Norway spruce (Picea abies), and loblolly pine (Pinus taeda). These genome sequences revealed distinctive features compared with other plant genomes and may represent a window into the past of seed plant genomes. This Update highlights recent advances, remaining challenges, and opportunities in light of the publication of the first conifer and gymnosperm genomes. © 2014 American Society of Plant Biologists. All Rights Reserved.
Mobile elements reveal small population size in the ancient ancestors of Homo sapiens.
Huff, Chad D; Xing, Jinchuan; Rogers, Alan R; Witherspoon, David; Jorde, Lynn B
2010-02-02
The genealogies of different genetic loci vary in depth. The deeper the genealogy, the greater the chance that it will include a rare event, such as the insertion of a mobile element. Therefore, the genealogy of a region that contains a mobile element is on average older than that of the rest of the genome. In a simple demographic model, the expected time to most recent common ancestor (TMRCA) is doubled if a rare insertion is present. We test this expectation by examining single nucleotide polymorphisms around polymorphic Alu insertions from two completely sequenced human genomes. The estimated TMRCA for regions containing a polymorphic insertion is two times larger than the genomic average (P < <10(-30)), as predicted. Because genealogies that contain polymorphic mobile elements are old, they are shaped largely by the forces of ancient population history and are insensitive to recent demographic events, such as bottlenecks and expansions. Remarkably, the information in just two human DNA sequences provides substantial information about ancient human population size. By comparing the likelihood of various demographic models, we estimate that the effective population size of human ancestors living before 1.2 million years ago was 18,500, and we can reject all models where the ancient effective population size was larger than 26,000. This result implies an unusually small population for a species spread across the entire Old World, particularly in light of the effective population sizes of chimpanzees (21,000) and gorillas (25,000), which each inhabit only one part of a single continent.
Draft genome sequence of an aflatoxigenic Aspergillus species, A. bombycis
USDA-ARS?s Scientific Manuscript database
The genome of the A. bombycis Type strain was sequenced using a Personal Genome Machine, followed by annotation of its predicted genes. The genome size for A. bombycis was found to be approximately 37 Mb and contained 12,266 genes. This announcement introduces a sequenced genome for an aflatoxigenic...
Sato, Kengo; Kuroki, Yoko; Kumita, Wakako; Fujiyama, Asao; Toyoda, Atsushi; Kawai, Jun; Iriki, Atsushi; Sasaki, Erika; Okano, Hideyuki; Sakakibara, Yasubumi
2015-11-20
The first draft of the common marmoset (Callithrix jacchus) genome was published by the Marmoset Genome Sequencing and Analysis Consortium. The draft was based on whole-genome shotgun sequencing, and the current assembly version is Callithrix_jacches-3.2.1, but there still exist 187,214 undetermined gap regions and supercontigs and relatively short contigs that are unmapped to chromosomes in the draft genome. We performed resequencing and assembly of the genome of common marmoset by deep sequencing with high-throughput sequencing technology. Several different sequence runs using Illumina sequencing platforms were executed, and 181 Gbp of high-quality bases including mate-pairs with long insert lengths of 3, 8, 20, and 40 Kbp were obtained, that is, approximately 60× coverage. The resequencing significantly improved the MGSAC draft genome sequence. The N50 of the contigs, which is a statistical measure used to evaluate assembly quality, doubled. As a result, 51% of the contigs (total length: 299 Mbp) that were unmapped to chromosomes in the MGSAC draft were merged with chromosomal contigs, and the improved genome sequence helped to detect 5,288 new genes that are homologous to human cDNAs and the gaps in 5,187 transcripts of the Ensembl gene annotations were completely filled.
Chapple, Stephanie N. J.; Sarovich, Derek S.; Holden, Matthew T. G.; Peacock, Sharon J.; Buller, Nicky; Golledge, Clayton; Mayo, Mark; Currie, Bart J.
2016-01-01
Melioidosis, caused by the highly recombinogenic bacterium Burkholderia pseudomallei, is a disease with high mortality. Tracing the origin of melioidosis outbreaks and understanding how the bacterium spreads and persists in the environment are essential to protecting public and veterinary health and reducing mortality associated with outbreaks. We used whole-genome sequencing to compare isolates from a historical quarter-century outbreak that occurred between 1966 and 1991 in the Avon Valley, Western Australia, a region far outside the known range of B. pseudomallei endemicity. All Avon Valley outbreak isolates shared the same multilocus sequence type (ST-284), which has not been identified outside this region. We found substantial genetic diversity among isolates based on a comparison of genome-wide variants, with no clear correlation between genotypes and temporal, geographical or source data. We observed little evidence of recombination in the outbreak strains, indicating that genetic diversity among these isolates has primarily accrued by mutation. Phylogenomic analysis demonstrated that the isolates confidently grouped within the Australian B. pseudomallei clade, thereby ruling out introduction from a melioidosis-endemic region outside Australia. Collectively, our results point to B. pseudomallei ST-284 being present in the Avon Valley for longer than previously recognized, with its persistence and genomic diversity suggesting long-term, low-prevalence endemicity in this temperate region. Our findings provide a concerning demonstration of the potential for environmental persistence of B. pseudomallei far outside the conventional endemic regions. An expected increase in extreme weather events may reactivate latent B. pseudomallei populations in this region. PMID:28348862
Investigation of terpene diversification across multiple sequenced plant genomes
Boutanaev, Alexander M.; Moses, Tessa; Zi, Jiachen; Nelson, David R.; Mugford, Sam T.; Peters, Reuben J.; Osbourn, Anne
2015-01-01
Plants produce an array of specialized metabolites, including chemicals that are important as medicines, flavors, fragrances, pigments and insecticides. The vast majority of this metabolic diversity is untapped. Here we take a systematic approach toward dissecting genetic components of plant specialized metabolism. Focusing on the terpenes, the largest class of plant natural products, we investigate the basis of terpene diversity through analysis of multiple sequenced plant genomes. The primary drivers of terpene diversification are terpenoid synthase (TS) “signature” enzymes (which generate scaffold diversity), and cytochromes P450 (CYPs), which modify and further diversify these scaffolds, so paving the way for further downstream modifications. Our systematic search of sequenced plant genomes for all TS and CYP genes reveals that distinct TS/CYP gene pairs are found together far more commonly than would be expected by chance, and that certain TS/CYP pairings predominate, providing signals for key events that are likely to have shaped terpene diversity. We recover TS/CYP gene pairs for previously characterized terpene metabolic gene clusters and demonstrate new functional pairing of TSs and CYPs within previously uncharacterized clusters. Unexpectedly, we find evidence for different mechanisms of pathway assembly in eudicots and monocots; in the former, microsyntenic blocks of TS/CYP gene pairs duplicate and provide templates for the evolution of new pathways, whereas in the latter, new pathways arise by mixing and matching of individual TS and CYP genes through dynamic genome rearrangements. This is, to our knowledge, the first documented observation of the unique pattern of TS and CYP assembly in eudicots and monocots. PMID:25502595
Lee, Patrick K H; Men, Yujie; Wang, Shanquan; He, Jianzhong; Alvarez-Cohen, Lisa
2015-02-03
Dehalococcoides mccartyi are functionally important bacteria that catalyze the reductive dechlorination of chlorinated ethenes. However, these anaerobic bacteria are fastidious to isolate, making downstream genomic characterization challenging. In order to facilitate genomic analysis, a fluorescence-activated cell sorting (FACS) method was developed in this study to separate D. mccartyi cells from a microbial community, and the DNA of the isolated cells was processed by whole genome amplification (WGA) and hybridized onto a D. mccartyi microarray for comparative genomics against four sequenced strains. First, FACS was successfully applied to a D. mccartyi isolate as positive control, and then microarray results verified that WGA from 10(6) cells or ∼1 ng of genomic DNA yielded high-quality coverage detecting nearly all genes across the genome. As expected, some inter- and intrasample variability in WGA was observed, but these biases were minimized by performing multiple parallel amplifications. Subsequent application of the FACS and WGA protocols to two enrichment cultures containing ∼10% and ∼1% D. mccartyi cells successfully enabled genomic analysis. As proof of concept, this study demonstrates that coupling FACS with WGA and microarrays is a promising tool to expedite genomic characterization of target strains in environmental communities where the relative concentrations are low.
Bowers, Robert M.; Kyrpides, Nikos C.; Stepanauskas, Ramunas; ...
2017-08-08
Here, we present two standards developed by the Genomic Standards Consortium (GSC) for reporting bacterial and archaeal genome sequences. Both are extensions of the Minimum Information about Any (x) Sequence (MIxS). The standards are the Minimum Information about a Single Amplified Genome (MISAG) and the Minimum Information about a MetagenomeAssembled Genome (MIMAG), including, but not limited to, assembly quality, and estimates of genome completeness and contamination. These standards can be used in combination with other GSC checklists, including the Minimum Information about a Genome Sequence (MIGS), Minimum Information about a Metagenomic Sequence (MIMS), and Minimum Information about a Marker Genemore » Sequence (MIMARKS). Community-wide adoption of MISAG and MIMAG will facilitate more robust comparative genomic analyses of bacterial and archaeal diversity.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bowers, Robert M.; Kyrpides, Nikos C.; Stepanauskas, Ramunas
Here, we present two standards developed by the Genomic Standards Consortium (GSC) for reporting bacterial and archaeal genome sequences. Both are extensions of the Minimum Information about Any (x) Sequence (MIxS). The standards are the Minimum Information about a Single Amplified Genome (MISAG) and the Minimum Information about a MetagenomeAssembled Genome (MIMAG), including, but not limited to, assembly quality, and estimates of genome completeness and contamination. These standards can be used in combination with other GSC checklists, including the Minimum Information about a Genome Sequence (MIGS), Minimum Information about a Metagenomic Sequence (MIMS), and Minimum Information about a Marker Genemore » Sequence (MIMARKS). Community-wide adoption of MISAG and MIMAG will facilitate more robust comparative genomic analyses of bacterial and archaeal diversity.« less
Analysis of expressed sequence tags generated from full-length enriched cDNA libraries of melon
2011-01-01
Background Melon (Cucumis melo), an economically important vegetable crop, belongs to the Cucurbitaceae family which includes several other important crops such as watermelon, cucumber, and pumpkin. It has served as a model system for sex determination and vascular biology studies. However, genomic resources currently available for melon are limited. Result We constructed eleven full-length enriched and four standard cDNA libraries from fruits, flowers, leaves, roots, cotyledons, and calluses of four different melon genotypes, and generated 71,577 and 22,179 ESTs from full-length enriched and standard cDNA libraries, respectively. These ESTs, together with ~35,000 ESTs available in public domains, were assembled into 24,444 unigenes, which were extensively annotated by comparing their sequences to different protein and functional domain databases, assigning them Gene Ontology (GO) terms, and mapping them onto metabolic pathways. Comparative analysis of melon unigenes and other plant genomes revealed that 75% to 85% of melon unigenes had homologs in other dicot plants, while approximately 70% had homologs in monocot plants. The analysis also identified 6,972 gene families that were conserved across dicot and monocot plants, and 181, 1,192, and 220 gene families specific to fleshy fruit-bearing plants, the Cucurbitaceae family, and melon, respectively. Digital expression analysis identified a total of 175 tissue-specific genes, which provides a valuable gene sequence resource for future genomics and functional studies. Furthermore, we identified 4,068 simple sequence repeats (SSRs) and 3,073 single nucleotide polymorphisms (SNPs) in the melon EST collection. Finally, we obtained a total of 1,382 melon full-length transcripts through the analysis of full-length enriched cDNA clones that were sequenced from both ends. Analysis of these full-length transcripts indicated that sizes of melon 5' and 3' UTRs were similar to those of tomato, but longer than many other dicot plants. Codon usages of melon full-length transcripts were largely similar to those of Arabidopsis coding sequences. Conclusion The collection of melon ESTs generated from full-length enriched and standard cDNA libraries is expected to play significant roles in annotating the melon genome. The ESTs and associated analysis results will be useful resources for gene discovery, functional analysis, marker-assisted breeding of melon and closely related species, comparative genomic studies and for gaining insights into gene expression patterns. PMID:21599934
Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data
Lemoine, Frédéric; Lespinet, Olivier; Labedan, Bernard
2007-01-01
Background Comparison of completely sequenced microbial genomes has revealed how fluid these genomes are. Detecting synteny blocks requires reliable methods to determining the orthologs among the whole set of homologs detected by exhaustive comparisons between each pair of completely sequenced genomes. This is a complex and difficult problem in the field of comparative genomics but will help to better understand the way prokaryotic genomes are evolving. Results We have developed a suite of programs that automate three essential steps to study conservation of gene order, and validated them with a set of 107 bacteria and archaea that cover the majority of the prokaryotic taxonomic space. We identified the whole set of shared homologs between two or more species and computed the evolutionary distance separating each pair of homologs. We applied two strategies to extract from the set of homologs a collection of valid orthologs shared by at least two genomes. The first computes the Reciprocal Smallest Distance (RSD) using the PAM distances separating pairs of homologs. The second method groups homologs in families and reconstructs each family's evolutionary tree, distinguishing bona fide orthologs as well as paralogs created after the last speciation event. Although the phylogenetic tree method often succeeds where RSD fails, the reverse could occasionally be true. Accordingly, we used the data obtained with either methods or their intersection to number the orthologs that are adjacent in for each pair of genomes, the Positional Orthologous Genes (POGs), and to further study their properties. Once all these synteny blocks have been detected, we showed that POGs are subject to more evolutionary constraints than orthologs outside synteny groups, whichever the taxonomic distance separating the compared organisms. Conclusion The suite of programs described in this paper allows a reliable detection of orthologs and is useful for evaluating gene order conservation in prokaryotes whichever their taxonomic distance. Thus, our approach will make easy the rapid identification of POGS in the next few years as we are expecting to be inundated with thousands of completely sequenced microbial genomes. PMID:18047665
Windsor, Aaron J.; Schranz, M. Eric; Formanová, Nataša; Gebauer-Jung, Steffi; Bishop, John G.; Schnabelrauch, Domenica; Kroymann, Juergen; Mitchell-Olds, Thomas
2006-01-01
Comparative genomics provides insight into the evolutionary dynamics that shape discrete sequences as well as whole genomes. To advance comparative genomics within the Brassicaceae, we have end sequenced 23,136 medium-sized insert clones from Boechera stricta, a wild relative of Arabidopsis (Arabidopsis thaliana). A significant proportion of these sequences, 18,797, are nonredundant and display highly significant similarity (BLASTn e-value ≤ 10−30) to low copy number Arabidopsis genomic regions, including more than 9,000 annotated coding sequences. We have used this dataset to identify orthologous gene pairs in the two species and to perform a global comparison of DNA regions 5′ to annotated coding regions. On average, the 500 nucleotides upstream to coding sequences display 71.4% identity between the two species. In a similar analysis, 61.4% identity was observed between 5′ noncoding sequences of Brassica oleracea and Arabidopsis, indicating that regulatory regions are not as diverged among these lineages as previously anticipated. By mapping the B. stricta end sequences onto the Arabidopsis genome, we have identified nearly 2,000 conserved blocks of microsynteny (bracketing 26% of the Arabidopsis genome). A comparison of fully sequenced B. stricta inserts to their homologous Arabidopsis genomic regions indicates that indel polymorphisms >5 kb contribute substantially to the genome size difference observed between the two species. Further, we demonstrate that microsynteny inferred from end-sequence data can be applied to the rapid identification and cloning of genomic regions of interest from nonmodel species. These results suggest that among diploid relatives of Arabidopsis, small- to medium-scale shotgun sequencing approaches can provide rapid and cost-effective benefits to evolutionary and/or functional comparative genomic frameworks. PMID:16607030
Zill, Oliver A; Banks, Kimberly C; Fairclough, Stephen R; Mortimer, Stefanie; Vowles, James V; Mokhtari, Reza; Gandara, David R; Mack, Philip C; Odegaard, Justin I; Nagy, Rebecca J; Baca, Arthur M; Eltoukhy, Helmy; Chudova, Darya I; Lanman, Richard B; Talasaz, AmirAli
2018-05-18
Cell-free DNA (cfDNA) sequencing provides a non-invasive method for obtaining actionable genomic information to guide personalized cancer treatment, but the presence of multiple alterations in circulation related to treatment and tumor heterogeneity complicate the interpretation of the observed variants. Experimental Design: We describe the somatic mutation landscape of 70 cancer genes from cfDNA deep-sequencing analysis of 21,807 patients with treated, late-stage cancers across >50 cancer types. To facilitate interpretation of the genomic complexity of circulating tumor DNA in advanced, treated cancer patients, we developed methods to identify cfDNA copy-number driver alterations and cfDNA clonality. Patterns and prevalence of cfDNA alterations in major driver genes for non-small cell lung, breast, and colorectal cancer largely recapitulated those from tumor tissue sequencing compendia (TCGA and COSMIC; r=0.90-0.99), with the principle differences in alteration prevalence being due to patient treatment. This highly sensitive cfDNA sequencing assay revealed numerous subclonal tumor-derived alterations, expected as a result of clonal evolution, but leading to an apparent departure from mutual exclusivity in treatment-naïve tumors. Upon applying novel cfDNA clonality and copy-number driver identification methods, robust mutual exclusivity was observed among predicted truncal driver cfDNA alterations (FDR=5x10 -7 for EGFR and ERBB2 ), in effect distinguishing tumor-initiating alterations from secondary alterations. Treatment-associated resistance, including both novel alterations and parallel evolution, was common in the cfDNA cohort and was enriched in patients with targetable driver alterations (>18.6% patients). Together these retrospective analyses of a large cfDNA sequencing data set reveal subclonal structures and emerging resistance in advanced solid tumors. Copyright ©2018, American Association for Cancer Research.
Song, B; Hou, Y L; Ding, X; Wang, T; Wang, F; Zhong, J C; Xu, T; Zhong, J; Hou, W R; Shuai, S R
2014-02-20
Fatty acid binding proteins (FABPs) are a family of small, highly conserved cytoplasmic proteins that bind long-chain fatty acids and other hydrophobic ligands. In this study, cDNA and genomic sequences of FABP4 and FABP5 were cloned successfully from the giant panda (Ailuropoda melanoleuca) using reverse transcription polymerase chain reaction (RT-PCR) technology and touchdown-PCR. The cDNAs of FABP4 and FABP5 cloned from the giant panda were 400 and 413 bp in length, containing an open reading frame of 399 and 408 bp, encoding 132 and 135 amino acids, respectively. The genomic sequences of FABP4 and FABP5 were 3976 and 3962 bp, respectively, which each contained four exons and three introns. Sequence alignment indicated a high degree of homology with reported FABP sequences of other mammals at both the amino acid and DNA levels. Topology prediction revealed seven protein kinase C phosphorylation sites, two casein kinase II phosphorylation sites, two N-myristoylation sites, and one cytosolic fatty acid-binding protein signature in the FABP4 protein, and three N-glycosylation sites, three protein kinase C phosphorylation sites, one casein kinase II phosphorylation site, one N-myristoylation site, one amidation site, and one cytosolic fatty acid-binding protein signature in the FABP5 protein. The FABP4 and FABP5 genes were overexpressed in Escherichia coli BL21 and they produced the expected 16.8- and 17.0-kDa polypeptides. The results obtained in this study provide information for further in-depth research of this system, which has great value of both theoretical and practical significance.
Desai, Aarti; Marwah, Veer Singh; Yadav, Akshay; Jha, Vineet; Dhaygude, Kishor; Bangar, Ujwala; Kulkarni, Vivek; Jere, Abhay
2013-01-01
Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6-40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.
Large-scale contamination of microbial isolate genomes by Illumina PhiX control.
Mukherjee, Supratim; Huntemann, Marcel; Ivanova, Natalia; Kyrpides, Nikos C; Pati, Amrita
2015-01-01
With the rapid growth and development of sequencing technologies, genomes have become the new go-to for exploring solutions to some of the world's biggest challenges such as searching for alternative energy sources and exploration of genomic dark matter. However, progress in sequencing has been accompanied by its share of errors that can occur during template or library preparation, sequencing, imaging or data analysis. In this study we screened over 18,000 publicly available microbial isolate genome sequences in the Integrated Microbial Genomes database and identified more than 1000 genomes that are contaminated with PhiX, a control frequently used during Illumina sequencing runs. Approximately 10% of these genomes have been published in literature and 129 contaminated genomes were sequenced under the Human Microbiome Project. Raw sequence reads are prone to contamination from various sources and are usually eliminated during downstream quality control steps. Detection of PhiX contaminated genomes indicates a lapse in either the application or effectiveness of proper quality control measures. The presence of PhiX contamination in several publicly available isolate genomes can result in additional errors when such data are used in comparative genomics analyses. Such contamination of public databases have far-reaching consequences in the form of erroneous data interpretation and analyses, and necessitates better measures to proofread raw sequences before releasing them to the broader scientific community.
Wang, Xumin; Deng, Xin; Zhang, Xiaowei; Hu, Songnian; Yu, Jun
2012-01-01
The complete nucleotide sequences of the chloroplast (cp) and mitochondrial (mt) genomes of resurrection plant Boea hygrometrica (Bh, Gesneriaceae) have been determined with the lengths of 153,493 bp and 510,519 bp, respectively. The smaller chloroplast genome contains more genes (147) with a 72% coding sequence, and the larger mitochondrial genome have less genes (65) with a coding faction of 12%. Similar to other seed plants, the Bh cp genome has a typical quadripartite organization with a conserved gene in each region. The Bh mt genome has three recombinant sequence repeats of 222 bp, 843 bp, and 1474 bp in length, which divide the genome into a single master circle (MC) and four isomeric molecules. Compared to other angiosperms, one remarkable feature of the Bh mt genome is the frequent transfer of genetic material from the cp genome during recent Bh evolution. We also analyzed organellar genome evolution in general regarding genome features as well as compositional dynamics of sequence and gene structure/organization, providing clues for the understanding of the evolution of organellar genomes in plants. The cp-derived sequences including tRNAs found in angiosperm mt genomes support the conclusion that frequent gene transfer events may have begun early in the land plant lineage. PMID:22291979
Dessimoz, Christophe; Zoller, Stefan; Manousaki, Tereza; Qiu, Huan; Meyer, Axel; Kuraku, Shigehiro
2011-09-01
Recent development of deep sequencing technologies has facilitated de novo genome sequencing projects, now conducted even by individual laboratories. However, this will yield more and more genome sequences that are not well assembled, and will hinder thorough annotation when no closely related reference genome is available. One of the challenging issues is the identification of protein-coding sequences split into multiple unassembled genomic segments, which can confound orthology assignment and various laboratory experiments requiring the identification of individual genes. In this study, using the genome of a cartilaginous fish, Callorhinchus milii, as test case, we performed gene prediction using a model specifically trained for this genome. We implemented an algorithm, designated ESPRIT, to identify possible linkages between multiple protein-coding portions derived from a single genomic locus split into multiple unassembled genomic segments. We developed a validation framework based on an artificially fragmented human genome, improvements between early and recent mouse genome assemblies, comparison with experimentally validated sequences from GenBank, and phylogenetic analyses. Our strategy provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes. To our knowledge, our study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references.
Zoller, Stefan; Manousaki, Tereza; Qiu, Huan; Meyer, Axel; Kuraku, Shigehiro
2011-01-01
Recent development of deep sequencing technologies has facilitated de novo genome sequencing projects, now conducted even by individual laboratories. However, this will yield more and more genome sequences that are not well assembled, and will hinder thorough annotation when no closely related reference genome is available. One of the challenging issues is the identification of protein-coding sequences split into multiple unassembled genomic segments, which can confound orthology assignment and various laboratory experiments requiring the identification of individual genes. In this study, using the genome of a cartilaginous fish, Callorhinchus milii, as test case, we performed gene prediction using a model specifically trained for this genome. We implemented an algorithm, designated ESPRIT, to identify possible linkages between multiple protein-coding portions derived from a single genomic locus split into multiple unassembled genomic segments. We developed a validation framework based on an artificially fragmented human genome, improvements between early and recent mouse genome assemblies, comparison with experimentally validated sequences from GenBank, and phylogenetic analyses. Our strategy provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes. To our knowledge, our study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references. PMID:21712341
Kantarski, Traci; Larson, Steve; Zhang, Xiaofei; DeHaan, Lee; Borevitz, Justin; Anderson, James; Poland, Jesse
2017-01-01
Development of the first consensus genetic map of intermediate wheatgrass gives insight into the genome and tools for molecular breeding. Intermediate wheatgrass (Thinopyrum intermedium) has been identified as a candidate for domestication and improvement as a perennial grain, forage, and biofuel crop and is actively being improved by several breeding programs. To accelerate this process using genomics-assisted breeding, efficient genotyping methods and genetic marker reference maps are needed. We present here the first consensus genetic map for intermediate wheatgrass (IWG), which confirms the species' allohexaploid nature (2n = 6x = 42) and homology to Triticeae genomes. Genotyping-by-sequencing was used to identify markers that fit expected segregation ratios and construct genetic maps for 13 heterogeneous parents of seven full-sib families. These maps were then integrated using a linear programming method to produce a consensus map with 21 linkage groups containing 10,029 markers, 3601 of which were present in at least two populations. Each of the 21 linkage groups contained between 237 and 683 markers, cumulatively covering 5061 cM (2891 cM--Kosambi) with an average distance of 0.5 cM between each pair of markers. Through mapping the sequence tags to the diploid (2n = 2x = 14) barley reference genome, we observed high colinearity and synteny between these genomes, with three homoeologous IWG chromosomes corresponding to each of the seven barley chromosomes, and mapped translocations that are known in the Triticeae. The consensus map is a valuable tool for wheat breeders to map important disease-resistance genes within intermediate wheatgrass. These genomic tools can help lead to rapid improvement of IWG and development of high-yielding cultivars of this perennial grain that would facilitate the sustainable intensification of agricultural systems.
Evolutionary paths of streptococcal and staphylococcal superantigens
2012-01-01
Background Streptococcus pyogenes (GAS) harbors several superantigens (SAgs) in the prophage region of its genome, although speG and smez are not located in this region. The diversity of SAgs is thought to arise during horizontal transfer, but their evolutionary pathways have not yet been determined. We recently completed sequencing the entire genome of S. dysgalactiae subsp. equisimilis (SDSE), the closest relative of GAS. Although speG is the only SAg gene of SDSE, speG was present in only 50% of clinical SDSE strains and smez in none. In this study, we analyzed the evolutionary paths of streptococcal and staphylococcal SAgs. Results We compared the sequences of the 12–60 kb speG regions of nine SDSE strains, five speG+ and four speG–. We found that the synteny of this region was highly conserved, whether or not the speG gene was present. Synteny analyses based on genome-wide comparisons of GAS and SDSE indicated that speG is the direct descendant of a common ancestor of streptococcal SAgs, whereas smez was deleted from SDSE after SDSE and GAS split from a common ancestor. Cumulative nucleotide skew analysis of SDSE genomes suggested that speG was located outside segments of steeper slopes than the stable region in the genome, whereas the region flanking smez was unstable, as expected from the results of GAS. We also detected a previously undescribed staphylococcal SAg gene, selW, and a staphylococcal SAg -like gene, ssl, in the core genomes of all Staphylococcus aureus strains sequenced. Amino acid substitution analyses, based on dN/dS window analysis of the products encoded by speG, selW and ssl suggested that all three genes have been subjected to strong positive selection. Evolutionary analysis based on the Bayesian Markov chain Monte Carlo method showed that each clade included at least one direct descendant. Conclusions Our findings reveal a plausible model for the comprehensive evolutionary pathway of streptococcal and staphylococcal SAgs. PMID:22900646
Genomic treasure troves: complete genome sequencing of herbarium and insect museum specimens.
Staats, Martijn; Erkens, Roy H J; van de Vossenberg, Bart; Wieringa, Jan J; Kraaijeveld, Ken; Stielow, Benjamin; Geml, József; Richardson, James E; Bakker, Freek T
2013-01-01
Unlocking the vast genomic diversity stored in natural history collections would create unprecedented opportunities for genome-scale evolutionary, phylogenetic, domestication and population genomic studies. Many researchers have been discouraged from using historical specimens in molecular studies because of both generally limited success of DNA extraction and the challenges associated with PCR-amplifying highly degraded DNA. In today's next-generation sequencing (NGS) world, opportunities and prospects for historical DNA have changed dramatically, as most NGS methods are actually designed for taking short fragmented DNA molecules as templates. Here we show that using a standard multiplex and paired-end Illumina sequencing approach, genome-scale sequence data can be generated reliably from dry-preserved plant, fungal and insect specimens collected up to 115 years ago, and with minimal destructive sampling. Using a reference-based assembly approach, we were able to produce the entire nuclear genome of a 43-year-old Arabidopsis thaliana (Brassicaceae) herbarium specimen with high and uniform sequence coverage. Nuclear genome sequences of three fungal specimens of 22-82 years of age (Agaricus bisporus, Laccaria bicolor, Pleurotus ostreatus) were generated with 81.4-97.9% exome coverage. Complete organellar genome sequences were assembled for all specimens. Using de novo assembly we retrieved between 16.2-71.0% of coding sequence regions, and hence remain somewhat cautious about prospects for de novo genome assembly from historical specimens. Non-target sequence contaminations were observed in 2 of our insect museum specimens. We anticipate that future museum genomics projects will perhaps not generate entire genome sequences in all cases (our specimens contained relatively small and low-complexity genomes), but at least generating vital comparative genomic data for testing (phylo)genetic, demographic and genetic hypotheses, that become increasingly more horizontal. Furthermore, NGS of historical DNA enables recovering crucial genetic information from old type specimens that to date have remained mostly unutilized and, thus, opens up a new frontier for taxonomic research as well.
Draft Sequences of the Radish (Raphanus sativus L.) Genome
Kitashiba, Hiroyasu; Li, Feng; Hirakawa, Hideki; Kawanabe, Takahiro; Zou, Zhongwei; Hasegawa, Yoichi; Tonosaki, Kaoru; Shirasawa, Sachiko; Fukushima, Aki; Yokoi, Shuji; Takahata, Yoshihito; Kakizaki, Tomohiro; Ishida, Masahiko; Okamoto, Shunsuke; Sakamoto, Koji; Shirasawa, Kenta; Tabata, Satoshi; Nishio, Takeshi
2014-01-01
Radish (Raphanus sativus L., n = 9) is one of the major vegetables in Asia. Since the genomes of Brassica and related species including radish underwent genome rearrangement, it is quite difficult to perform functional analysis based on the reported genomic sequence of Brassica rapa. Therefore, we performed genome sequencing of radish. Short reads of genomic sequences of 191.1 Gb were obtained by next-generation sequencing (NGS) for a radish inbred line, and 76,592 scaffolds of ≥300 bp were constructed along with the bacterial artificial chromosome-end sequences. Finally, the whole draft genomic sequence of 402 Mb spanning 75.9% of the estimated genomic size and containing 61,572 predicted genes was obtained. Subsequently, 221 single nucleotide polymorphism markers and 768 PCR-RFLP markers were used together with the 746 markers produced in our previous study for the construction of a linkage map. The map was combined further with another radish linkage map constructed mainly with expressed sequence tag-simple sequence repeat markers into a high-density integrated map of 1,166 cM with 2,553 DNA markers. A total of 1,345 scaffolds were assigned to the linkage map, spanning 116.0 Mb. Bulked PCR products amplified by 2,880 primer pairs were sequenced by NGS, and SNPs in eight inbred lines were identified. PMID:24848699
Detection of a divergent variant of grapevine virus F by next-generation sequencing.
Molenaar, Nicholas; Burger, Johan T; Maree, Hans J
2015-08-01
The complete genome sequence of a South African isolate of grapevine virus F (GVF) is presented. It was first detected by metagenomic next-generation sequencing of field samples and validated through direct Sanger sequencing. The genome sequence of GVF isolate V5 consists of 7539 nucleotides and contains a poly(A) tail. It has a typical vitivirus genome arrangement that comprises five open reading frames (ORFs), which share only 88.96 % nucleotide sequence identity with the existing complete GVF genome sequence (JX105428).
Dissection of the Octoploid Strawberry Genome by Deep Sequencing of the Genomes of Fragaria Species
Hirakawa, Hideki; Shirasawa, Kenta; Kosugi, Shunichi; Tashiro, Kosuke; Nakayama, Shinobu; Yamada, Manabu; Kohara, Mistuyo; Watanabe, Akiko; Kishida, Yoshie; Fujishiro, Tsunakazu; Tsuruoka, Hisano; Minami, Chiharu; Sasamoto, Shigemi; Kato, Midori; Nanri, Keiko; Komaki, Akiko; Yanagi, Tomohiro; Guoxin, Qin; Maeda, Fumi; Ishikawa, Masami; Kuhara, Satoru; Sato, Shusei; Tabata, Satoshi; Isobe, Sachiko N.
2014-01-01
Cultivated strawberry (Fragaria x ananassa) is octoploid and shows allogamous behaviour. The present study aims at dissecting this octoploid genome through comparison with its wild relatives, F. iinumae, F. nipponica, F. nubicola, and F. orientalis by de novo whole-genome sequencing on an Illumina and Roche 454 platforms. The total length of the assembled Illumina genome sequences obtained was 698 Mb for F. x ananassa, and ∼200 Mb each for the four wild species. Subsequently, a virtual reference genome termed FANhybrid_r1.2 was constructed by integrating the sequences of the four homoeologous subgenomes of F. x ananassa, from which heterozygous regions in the Roche 454 and Illumina genome sequences were eliminated. The total length of FANhybrid_r1.2 thus created was 173.2 Mb with the N50 length of 5137 bp. The Illumina-assembled genome sequences of F. x ananassa and the four wild species were then mapped onto the reference genome, along with the previously published F. vesca genome sequence to establish the subgenomic structure of F. x ananassa. The strategy adopted in this study has turned out to be successful in dissecting the genome of octoploid F. x ananassa and appears promising when applied to the analysis of other polyploid plant species. PMID:24282021
The first genome sequence of a metatherian herpesvirus: Macropodid herpesvirus 1.
Vaz, Paola K; Mahony, Timothy J; Hartley, Carol A; Fowler, Elizabeth V; Ficorilli, Nino; Lee, Sang W; Gilkerson, James R; Browning, Glenn F; Devlin, Joanne M
2016-01-22
While many placental herpesvirus genomes have been fully sequenced, the complete genome of a marsupial herpesvirus has not been described. Here we present the first genome sequence of a metatherian herpesvirus, Macropodid herpesvirus 1 (MaHV-1). The MaHV-1 viral genome was sequenced using an Illumina MiSeq sequencer, de novo assembly was performed and the genome was annotated. The MaHV-1 genome was 140 kbp in length and clustered phylogenetically with the primate simplexviruses, sharing 67% nucleotide sequence identity with Human herpesviruses 1 and 2. The MaHV-1 genome contained 66 predicted open reading frames (ORFs) homologous to those in other herpesvirus genomes, but lacked homologues of UL3, UL4, UL56 and glycoprotein J. This is the first alphaherpesvirus genome that has been found to lack the UL3 and UL4 homologues. We identified six novel ORFs and confirmed their transcription by RT-PCR. This is the first genome sequence of a herpesvirus that infects metatherians, a taxonomically unique mammalian clade. Members of the Simplexvirus genus are remarkably conserved, so the absence of ORFs otherwise retained in eutherian and avian alphaherpesviruses contributes to our understanding of the Alphaherpesvirinae. Further study of metatherian herpesvirus genetics and pathogenesis provides a unique approach to understanding herpesvirus-mammalian interactions.
Hemipteran Mitochondrial Genomes: Features, Structures and Implications for Phylogeny
Wang, Yuan; Chen, Jing; Jiang, Li-Yun; Qiao, Ge-Xia
2015-01-01
The study of Hemipteran mitochondrial genomes (mitogenomes) began with the Chagas disease vector, Triatoma dimidiata, in 2001. At present, 90 complete Hemipteran mitogenomes have been sequenced and annotated. This review examines the history of Hemipteran mitogenomes research and summarizes the main features of them including genome organization, nucleotide composition, protein-coding genes, tRNAs and rRNAs, and non-coding regions. Special attention is given to the comparative analysis of repeat regions. Gene rearrangements are an additional data type for a few families, and most mitogenomes are arranged in the same order to the proposed ancestral insect. We also discuss and provide insights on the phylogenetic analyses of a variety of taxonomic levels. This review is expected to further expand our understanding of research in this field and serve as a valuable reference resource. PMID:26039239
Lu, You; Samac, Deborah A.; Glazebrook, Jane
2015-01-01
We report here the complete genome sequence of Clavibacter michiganensis subsp. insidiosus R1-1, isolated in Minnesota, USA. The R1-1 genome, generated by a de novo assembly of PacBio sequencing data, is the first complete genome sequence available for this subspecies. PMID:25953184
Complete Genome Sequences of Two Vesicular Stomatitis Virus Isolates Collected in Mexico.
Velazquez-Salinas, Lauro; Isa, Pavel; Pauszek, Steven J; Rodriguez, Luis L
2017-09-14
We report two full-genome sequences of vesicular stomatitis New Jersey virus (VSNJV) obtained by Illumina next-generation sequencing of RNA isolated from epithelial suspensions of cattle naturally infected in Mexico. These genomes represent the first full-genome sequences of vesicular stomatitis New Jersey viruses circulating in Mexico deposited in the GenBank database.
Genome Sequences of Pseudomonas spp. Isolated from Cereal Crops
Stiller, Jiri; Covarelli, Lorenzo; Lindeberg, Magdalen; Shivas, Roger G.; Manners, John M.
2013-01-01
Compared to those of dicot-infecting bacteria, the available genome sequences of bacteria that infect wheat and barley are limited. Herein, we report the draft genome sequences of four pseudomonads originally isolated from these cereals. These genome sequences provide a useful resource for comparative analyses within the genus and for cross-kingdom analyses of plant pathogenesis. PMID:23661484
USDA-ARS?s Scientific Manuscript database
A reassociation kinetics-based approach was used to reduce the complexity of genomic DNA from the Deutsch laboratory strain of the cattle tick, Rhipicephalus microplus, to facilitate genome sequencing. Selected genomic DNA (Cot value = 660) was sequenced using 454 GS FLX technology, resulting in 356...
Complete genome sequence of the Antarctic Halorubrum lacusprofundi type strain ACAM 34
Anderson, Iain J.; DasSarma, Priya; Lucas, Susan; ...
2016-09-10
Halorubrum lacusprofundi is an extreme halophile within the archaeal phylum Euryarchaeota. The type strain ACAM 34 was isolated from Deep Lake, Antarctica. H. lacusprofundi is of phylogenetic interest because it is distantly related to the haloarchaea that have previously been sequenced. It is also of interest because of its psychrotolerance. We report here the complete genome sequence of H. lacusprofundi type strain ACAM 34 and its annotation. In conclusion, this genome is part of a 2006 Joint Genome Institute Community Sequencing Program project to sequence genomes of diverse Archaea.
Complete genome sequence of the Antarctic Halorubrum lacusprofundi type strain ACAM 34
DOE Office of Scientific and Technical Information (OSTI.GOV)
Anderson, Iain J.; DasSarma, Priya; Lucas, Susan
Halorubrum lacusprofundi is an extreme halophile within the archaeal phylum Euryarchaeota. The type strain ACAM 34 was isolated from Deep Lake, Antarctica. H. lacusprofundi is of phylogenetic interest because it is distantly related to the haloarchaea that have previously been sequenced. It is also of interest because of its psychrotolerance. We report here the complete genome sequence of H. lacusprofundi type strain ACAM 34 and its annotation. In conclusion, this genome is part of a 2006 Joint Genome Institute Community Sequencing Program project to sequence genomes of diverse Archaea.
FlyBase: genes and gene models
Drysdale, Rachel A.; Crosby, Madeline A.
2005-01-01
FlyBase (http://flybase.org) is the primary repository of genetic and molecular data of the insect family Drosophilidae. For the most extensively studied species, Drosophila melanogaster, a wide range of data are presented in integrated formats. Data types include mutant phenotypes, molecular characterization of mutant alleles and aberrations, cytological maps, wild-type expression patterns, anatomical images, transgenic constructs and insertions, sequence-level gene models and molecular classification of gene product functions. There is a growing body of data for other Drosophila species; this is expected to increase dramatically over the next year, with the completion of draft-quality genomic sequences of an additional 11 Drosphila species. PMID:15608223
USDA-ARS?s Scientific Manuscript database
The size and repetitive nature of the Rhipicephalus microplus genome makes obtaining a full genome sequence difficult. Cot filtration/selection techniques were used to reduce the repetitive fraction of the tick genome and enrich for the fraction of DNA with gene-containing regions. The Cot-selected ...
USDA-ARS?s Scientific Manuscript database
Genomic structural variations are an important source of genetic diversity. Copy number variations (CNVs), gains and losses of large regions of genomic sequence between individuals of a species, are known to be associated with both diseases and phenotypic traits. Deeply sequenced genomes are often u...
A new strategy for genome assembly using short sequence reads and reduced representation libraries.
Young, Andrew L; Abaan, Hatice Ozel; Zerbino, Daniel; Mullikin, James C; Birney, Ewan; Margulies, Elliott H
2010-02-01
We have developed a novel approach for using massively parallel short-read sequencing to generate fast and inexpensive de novo genomic assemblies comparable to those generated by capillary-based methods. The ultrashort (<100 base) sequences generated by this technology pose specific biological and computational challenges for de novo assembly of large genomes. To account for this, we devised a method for experimentally partitioning the genome using reduced representation (RR) libraries prior to assembly. We use two restriction enzymes independently to create a series of overlapping fragment libraries, each containing a tractable subset of the genome. Together, these libraries allow us to reassemble the entire genome without the need of a reference sequence. As proof of concept, we applied this approach to sequence and assembled the majority of the 125-Mb Drosophila melanogaster genome. We subsequently demonstrate the accuracy of our assembly method with meaningful comparisons against the current available D. melanogaster reference genome (dm3). The ease of assembly and accuracy for comparative genomics suggest that our approach will scale to future mammalian genome-sequencing efforts, saving both time and money without sacrificing quality.
Mutations that Cause Human Disease: A Computational/Experimental Approach
DOE Office of Scientific and Technical Information (OSTI.GOV)
Beernink, P; Barsky, D; Pesavento, B
International genome sequencing projects have produced billions of nucleotides (letters) of DNA sequence data, including the complete genome sequences of 74 organisms. These genome sequences have created many new scientific opportunities, including the ability to identify sequence variations among individuals within a species. These genetic differences, which are known as single nucleotide polymorphisms (SNPs), are particularly important in understanding the genetic basis for disease susceptibility. Since the report of the complete human genome sequence, over two million human SNPs have been identified, including a large-scale comparison of an entire chromosome from twenty individuals. Of the protein coding SNPs (cSNPs), approximatelymore » half leads to a single amino acid change in the encoded protein (non-synonymous coding SNPs). Most of these changes are functionally silent, while the remainder negatively impact the protein and sometimes cause human disease. To date, over 550 SNPs have been found to cause single locus (monogenic) diseases and many others have been associated with polygenic diseases. SNPs have been linked to specific human diseases, including late-onset Parkinson disease, autism, rheumatoid arthritis and cancer. The ability to predict accurately the effects of these SNPs on protein function would represent a major advance toward understanding these diseases. To date several attempts have been made toward predicting the effects of such mutations. The most successful of these is a computational approach called ''Sorting Intolerant From Tolerant'' (SIFT). This method uses sequence conservation among many similar proteins to predict which residues in a protein are functionally important. However, this method suffers from several limitations. First, a query sequence must have a sufficient number of relatives to infer sequence conservation. Second, this method does not make use of or provide any information on protein structure, which can be used to understand how an amino acid change affects the protein. The experimental methods that provide the most detailed structural information on proteins are X-ray crystallography and NMR spectroscopy. However, these methods are labor intensive and currently cannot be carried out on a genomic scale. Nonetheless, Structural Genomics projects are being pursued by more than a dozen groups and consortia worldwide and as a result the number of experimentally determined structures is rising exponentially. Based on the expectation that protein structures will continue to be determined at an ever-increasing rate, reliable structure prediction schemes will become increasingly valuable, leading to information on protein function and disease for many different proteins. Given known genetic variability and experimentally determined protein structures, can we accurately predict the effects of single amino acid substitutions? An objective assessment of this question would involve comparing predicted and experimentally determined structures, which thus far has not been rigorously performed. The completed research leveraged existing expertise at LLNL in computational and structural biology, as well as significant computing resources, to address this question.« less
Herrmann, Luise; Felbinger, Christine; Haase, Ilka; Rudolph, Barbara; Biermann, Bernhard; Fischer, Markus
2015-05-13
The cocoa type "Colección Castro Naranjal 51" (CCN-51) is known for its resistance to specific climate conditions and its high yield, but it shows a weaker flavor profile and therefore is marketed as bulk cocoa. In a previous study, the two cocoa types Arriba and CCN-51 could easily be distinguished, but differences among the CCN-51 samples were observed. This was unexpected, as CCN-51 is reported to be a clone. To confirm whether CCN-51 is a pure clone, 10 simple sequence repeats (SSR) located on the nuclear genome were used to analyze various CCN-51 samples in comparison to the cocoa varieties Arriba and Criollo. As expected, there are differences in the SSR pattern among CCN-51, Arriba, and Criollo, but a variability within the CCN-51 sample set was detected as well. The previously described sequence variation in the chloroplast genome was confirmed by a variability in the microsatellite loci of the nuclear genome for a comprehensive cultivar collection of CCN-51 of both bean and leaf samples. In summary, beneath somaclonal variation, misidentification of plant collections and also sexual reproduction of CCN-51 can be suggested.
The genomic landscape shaped by selection on transposable elements across 18 mouse strains.
Nellåker, Christoffer; Keane, Thomas M; Yalcin, Binnaz; Wong, Kim; Agam, Avigail; Belgard, T Grant; Flint, Jonathan; Adams, David J; Frankel, Wayne N; Ponting, Chris P
2012-06-15
Transposable element (TE)-derived sequence dominates the landscape of mammalian genomes and can modulate gene function by dysregulating transcription and translation. Our current knowledge of TEs in laboratory mouse strains is limited primarily to those present in the C57BL/6J reference genome, with most mouse TEs being drawn from three distinct classes, namely short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs) and the endogenous retrovirus (ERV) superfamily. Despite their high prevalence, the different genomic and gene properties controlling whether TEs are preferentially purged from, or are retained by, genetic drift or positive selection in mammalian genomes remain poorly defined. Using whole genome sequencing data from 13 classical laboratory and 4 wild-derived mouse inbred strains, we developed a comprehensive catalogue of 103,798 polymorphic TE variants. We employ this extensive data set to characterize TE variants across the Mus lineage, and to infer neutral and selective processes that have acted over 2 million years. Our results indicate that the majority of TE variants are introduced though the male germline and that only a minority of TE variants exert detectable changes in gene expression. However, among genes with differential expression across the strains there are twice as many TE variants identified as being putative causal variants as expected. Most TE variants that cause gene expression changes appear to be purged rapidly by purifying selection. Our findings demonstrate that past TE insertions have often been highly deleterious, and help to prioritize TE variants according to their likely contribution to gene expression or phenotype variation.
A probabilistic model to predict clinical phenotypic traits from genome sequencing.
Chen, Yun-Ching; Douville, Christopher; Wang, Cheng; Niknafs, Noushin; Yeo, Grace; Beleva-Guthrie, Violeta; Carter, Hannah; Stenson, Peter D; Cooper, David N; Li, Biao; Mooney, Sean; Karchin, Rachel
2014-09-01
Genetic screening is becoming possible on an unprecedented scale. However, its utility remains controversial. Although most variant genotypes cannot be easily interpreted, many individuals nevertheless attempt to interpret their genetic information. Initiatives such as the Personal Genome Project (PGP) and Illumina's Understand Your Genome are sequencing thousands of adults, collecting phenotypic information and developing computational pipelines to identify the most important variant genotypes harbored by each individual. These pipelines consider database and allele frequency annotations and bioinformatics classifications. We propose that the next step will be to integrate these different sources of information to estimate the probability that a given individual has specific phenotypes of clinical interest. To this end, we have designed a Bayesian probabilistic model to predict the probability of dichotomous phenotypes. When applied to a cohort from PGP, predictions of Gilbert syndrome, Graves' disease, non-Hodgkin lymphoma, and various blood groups were accurate, as individuals manifesting the phenotype in question exhibited the highest, or among the highest, predicted probabilities. Thirty-eight PGP phenotypes (26%) were predicted with area-under-the-ROC curve (AUC)>0.7, and 23 (15.8%) of these were statistically significant, based on permutation tests. Moreover, in a Critical Assessment of Genome Interpretation (CAGI) blinded prediction experiment, the models were used to match 77 PGP genomes to phenotypic profiles, generating the most accurate prediction of 16 submissions, according to an independent assessor. Although the models are currently insufficiently accurate for diagnostic utility, we expect their performance to improve with growth of publicly available genomics data and model refinement by domain experts.
Lindsey, Rebecca L; Garcia-Toledo, L; Fasulo, D; Gladney, L M; Strockbine, N
2017-09-01
Escherichia coli, Escherichia albertii, and Escherichia fergusonii are closely related bacteria that can cause illness in humans, such as bacteremia, urinary tract infections and diarrhea. Current identification strategies for these three species vary in complexity and typically rely on the use of multiple phenotypic and genetic tests. To facilitate their rapid identification, we developed a multiplex PCR assay targeting conserved, species-specific genes. We used the Daydreamer™ (Pattern Genomics, USA) software platform to concurrently analyze whole genome sequence assemblies (WGS) from 150 Enterobacteriaceae genomes (107 E. coli, 5 Shigella spp., 21 E. albertii, 12 E. fergusonii and 5 other species) and design primers for the following species-specific regions: a 212bp region of the cyclic di-GMP regulator gene (cdgR, AW869_22935 from genome K-12 MG1655, CP014225) for E. coli/Shigella; a 393bp region of the DNA-binding transcriptional activator of cysteine biosynthesis gene (EAKF1_ch4033 from genome KF1, CP007025) for E. albertii; and a 575bp region of the palmitoleoyl-acyl carrier protein (ACP)-dependent acyltransferase (EFER_0790 from genome ATCC 35469, CU928158) for E. fergusonii. We incorporated the species-specific primers into a conventional multiplex PCR assay and assessed its performance with a collection of 97 Enterobacteriaceae strains. The assay was 100% sensitive and specific for detecting the expected species and offers a quick and accurate strategy for identifying E. coli, E. albertii, and E. fergusonii in either a single reaction or by in silico PCR with sequence assemblies. Published by Elsevier B.V.
Decoding the complex genetic causes of heart diseases using systems biology.
Djordjevic, Djordje; Deshpande, Vinita; Szczesnik, Tomasz; Yang, Andrian; Humphreys, David T; Giannoulatou, Eleni; Ho, Joshua W K
2015-03-01
The pace of disease gene discovery is still much slower than expected, even with the use of cost-effective DNA sequencing and genotyping technologies. It is increasingly clear that many inherited heart diseases have a more complex polygenic aetiology than previously thought. Understanding the role of gene-gene interactions, epigenetics, and non-coding regulatory regions is becoming increasingly critical in predicting the functional consequences of genetic mutations identified by genome-wide association studies and whole-genome or exome sequencing. A systems biology approach is now being widely employed to systematically discover genes that are involved in heart diseases in humans or relevant animal models through bioinformatics. The overarching premise is that the integration of high-quality causal gene regulatory networks (GRNs), genomics, epigenomics, transcriptomics and other genome-wide data will greatly accelerate the discovery of the complex genetic causes of congenital and complex heart diseases. This review summarises state-of-the-art genomic and bioinformatics techniques that are used in accelerating the pace of disease gene discovery in heart diseases. Accompanying this review, we provide an interactive web-resource for systems biology analysis of mammalian heart development and diseases, CardiacCode ( http://CardiacCode.victorchang.edu.au/ ). CardiacCode features a dataset of over 700 pieces of manually curated genetic or molecular perturbation data, which enables the inference of a cardiac-specific GRN of 280 regulatory relationships between 33 regulator genes and 129 target genes. We believe this growing resource will fill an urgent unmet need to fully realise the true potential of predictive and personalised genomic medicine in tackling human heart disease.
USDA-ARS?s Scientific Manuscript database
Next-generation sequencing technologies were used to rapidly and efficiently sequence the genome of the domestic turkey (Meleagris gallopavo). The current genome assembly (~1.1 Gb) includes 917 Mb of sequence assigned to chromosomes. Innate heterozygosity of the sequenced bird allowed discovery of...
Jakupciak, John P; Wells, Jeffrey M; Karalus, Richard J; Pawlowski, David R; Lin, Jeffrey S; Feldman, Andrew B
2013-01-01
Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations.
Jakupciak, John P.; Wells, Jeffrey M.; Karalus, Richard J.; Pawlowski, David R.; Lin, Jeffrey S.; Feldman, Andrew B.
2013-01-01
Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations. PMID:24455204
Viral quasispecies inference from 454 pyrosequencing
2013-01-01
Background Many potentially life-threatening infectious viruses are highly mutable in nature. Characterizing the fittest variants within a quasispecies from infected patients is expected to allow unprecedented opportunities to investigate the relationship between quasispecies diversity and disease epidemiology. The advent of next-generation sequencing technologies has allowed the study of virus diversity with high-throughput sequencing, although these methods come with higher rates of errors which can artificially increase diversity. Results Here we introduce a novel computational approach that incorporates base quality scores from next-generation sequencers for reconstructing viral genome sequences that simultaneously infers the number of variants within a quasispecies that are present. Comparisons on simulated and clinical data on dengue virus suggest that the novel approach provides a more accurate inference of the underlying number of variants within the quasispecies, which is vital for clinical efforts in mapping the within-host viral diversity. Sequence alignments generated by our approach are also found to exhibit lower rates of error. Conclusions The ability to infer the viral quasispecies colony that is present within a human host provides the potential for a more accurate classification of the viral phenotype. Understanding the genomics of viruses will be relevant not just to studying how to control or even eradicate these viral infectious diseases, but also in learning about the innate protection in the human host against the viruses. PMID:24308284
Genome-wide comparative analysis of four Indian Drosophila species.
Mohanty, Sujata; Khanna, Radhika
2017-12-01
Comparative analysis of multiple genomes of closely or distantly related Drosophila species undoubtedly creates excitement among evolutionary biologists in exploring the genomic changes with an ecology and evolutionary perspective. We present herewith the de novo assembled whole genome sequences of four Drosophila species, D. bipectinata, D. takahashii, D. biarmipes and D. nasuta of Indian origin using Next Generation Sequencing technology on an Illumina platform along with their detailed assembly statistics. The comparative genomics analysis, e.g. gene predictions and annotations, functional and orthogroup analysis of coding sequences and genome wide SNP distribution were performed. The whole genome of Zaprionus indianus of Indian origin published earlier by us and the genome sequences of previously sequenced 12 Drosophila species available in the NCBI database were included in the analysis. The present work is a part of our ongoing genomics project of Indian Drosophila species.
Qin, Yanhong; Wang, Li; Zhang, Zhenchen; Qiao, Qi; Zhang, Desheng; Tian, Yuting; Wang, Shuang; Wang, Yongjiang; Yan, Zhaoling
2014-01-01
Background Sweet potato chlorotic stunt virus (family Closteroviridae, genus Crinivirus) features a large bipartite, single-stranded, positive-sense RNA genome. To date, only three complete genomic sequences of SPCSV can be accessed through GenBank. SPCSV was first detected from China in 2011, only partial genomic sequences have been determined in the country. No report on the complete genomic sequence and genome structure of Chinese SPCSV isolates or the genetic relation between isolates from China and other countries is available. Methodology/Principal Findings The complete genomic sequences of five isolates from different areas in China were characterized. This study is the first to report the complete genome sequences of SPCSV from whitefly vectors. Genome structure analysis showed that isolates of WA and EA strains from China have the same coding protein as isolates Can181-9 and m2-47, respectively. Twenty cp genes and four RNA1 partial segments were sequenced and analyzed, and the nucleotide identities of complete genomic, cp, and RNA1 partial sequences were determined. Results indicated high conservation among strains and significant differences between WA and EA strains. Genetic analysis demonstrated that, except for isolates from Guangdong Province, SPCSVs from other areas belong to the WA strain. Genome organization analysis showed that the isolates in this study lack the p22 gene. Conclusions/Significance We presented the complete genome sequences of SPCSV in China. Comparison of nucleotide identities and genome structures between these isolates and previously reported isolates showed slight differences. The nucleotide identities of different SPCSV isolates showed high conservation among strains and significant differences between strains. All nine isolates in this study lacked p22 gene. WA strains were more extensively distributed than EA strains in China. These data provide important insights into the molecular variation and genomic structure of SPCSV in China as well as genetic relationships among isolates from China and other countries. PMID:25170926
Hernandez-Valladares, Maria; Vaudel, Marc; Selheim, Frode; Berven, Frode; Bruserud, Øystein
2017-08-01
Mass spectrometry (MS)-based proteomics has become an indispensable tool for the characterization of the proteome and its post-translational modifications (PTM). In addition to standard protein sequence databases, proteogenomics strategies search the spectral data against the theoretical spectra obtained from customized protein sequence databases. Up to date, there are no published proteogenomics studies on acute myeloid leukemia (AML) samples. Areas covered: Proteogenomics involves the understanding of genomic and proteomic data. The intersection of both datatypes requires advanced bioinformatics skills. A standard proteogenomics workflow that could be used for the study of AML samples is described. The generation of customized protein sequence databases as well as bioinformatics tools and pipelines commonly used in proteogenomics are discussed in detail. Expert commentary: Drawing on evidence from recent cancer proteogenomics studies and taking into account the public availability of AML genomic data, the interpretation of present and future MS-based AML proteomic data using AML-specific protein sequence databases could discover new biological mechanisms and targets in AML. However, proteogenomics workflows including bioinformatics guidelines can be challenging for the wide AML research community. It is expected that further automation and simplification of the bioinformatics procedures might attract AML investigators to adopt the proteogenomics strategy.
Lin, Chia-Chi; Yang, Zhi-Wei; Iang, Shan-Bei; Chao, Mei
2015-01-02
Hepatitis delta virus (HDV) replication is carried out by host RNA polymerases. Since homologous inter-genotypic RNA recombination is known to occur in HDV, possibly via a replication-dependent process, we hypothesized that the degree of sequence homology and the replication level should be related to the recombination frequency in cells co-expressing two HDV sequences. To confirm this, we separately co-transfected cells with three different pairs of HDV genomic RNAs and analyzed the obtained recombinants by RT-PCR followed by restriction fragment length polymorphism and sequencing analyses. The sequence divergence between the clones ranged from 24% to less than 0.1%, and the difference in replication levels was as high as 100-fold. As expected, significant differences were observed in the recombination frequencies, which ranged from 0.5% to 47.5%. Furthermore, varying the relative amounts of parental RNA altered the dominant recombinant species produced, suggesting that template switching occurs frequently during the synthesis of genomic HDV RNA. Taken together, these data suggest that during the host RNA polymerase-driven RNA recombination of HDV, both inter- and intra-genotypic recombination events are important in shaping the genetic diversity of HDV. Copyright © 2014 Elsevier B.V. All rights reserved.
Whole-genome sequencing to determine Neisseria gonorrhoeae transmission: an observational study
Cole, Kevin; Cole, Michelle J; Cresswell, Fiona; Dean, Gillian; Dave, Jayshree; Thomas, Daniel Rh; Foster, Kirsty; Waldram, Alison; Wilson, Daniel J; Didelot, Xavier; Grad, Yonatan H; Crook, Derrick W; Peto, Tim EA; Walker, A Sarah
2016-01-01
Background New approaches are urgently required to address increasing rates of gonorrhoea and the emergence and global spread of antibiotic-resistant Neisseria gonorrhoeae. Whole genome sequencing (WGS) can be applied to study transmission and track resistance. Methods We performed WGS on 1659 isolates from Brighton, UK, and 217 additional isolates from other UK locations. We included WGS data (n=196) from the USA. Estimated mutation rates, plus diversity observed within patients across anatomical sites and probable transmission pairs, were used to fit a coalescent model to determine the number of single nucleotide polymorphisms (SNPs) expected between sequences related by direct/indirect transmission, depending on the time between samples. Findings We detected extensive local transmission. 281/1061(26%) Brighton cases were indistinguishable (0 SNPs) to ≥1 previous case(s), and 786(74%) had evidence of a sampled direct or indirect Brighton source. There was evidence of sustained transmission of some lineages. We observed multiple related samples across geographic locations. Of 1273 infections in Brighton, 225(18%) were linked to another case from elsewhere in the UK, and 115(9%) to a case from the USA. Four lineages initially identified in Brighton could be linked to 70 USA sequences, including 61 from a lineage carrying the mosaic penA XXXIV associated with reduced cefixime susceptibility. Interpretation We present a WGS-based tool for genomic contact tracing of N. gonorrhoeae and demonstrate local, national and international transmission. WGS can be applied across geographical boundaries to investigate gonorrhoea transmission and to track antimicrobial resistance. Funding Oxford NIHR Health Protection Research Unit and Biomedical Research Centre. PMID:27427203
Arakawa, C.K.; Deering, R.E.; Higman, K.H.; Oshima, K.H.; O'Hara, P.J.; Winton, J.R.
1990-01-01
The polymerase chain reaction [PCR) was used to amplify a portion of the nucleoprotein [NI gene of infectious hematopoietic necrosis virus (IHNV). Using a published sequence for the Round Butte isolate of IHNV, a pair of PCR pnmers was synthesized that spanned a 252 nucleotide region of the N gene from residue 319 to residue 570 of the open reading frame. This region included a 30 nucleotide target sequence for a synthetic oligonucleotide probe developed for detection of IHNV N gene messenger RNA. After 25 cycles of amplification of either messenger or genomic RNA, the PCR product (DNA) of the expected size was easily visible on agarose gels stained with ethidium bromide. The specificity of the amplified DNA was confirmed by Southern and dot-blot analysis using the biotinylated oligonucleotide probe. The PCR was able to amplify the N gene sequence of purified genomic RNA from isolates of IHNV representing 5 different electropherotypes. Using the IHNV primer set, no PCR product was obtained from viral hemorrhagic septicemia virus RNA, but 2 higher molecular weight products were synthesized from hirame rhabdovirus RNA that did not hybridize with the biotinylated probe. The PCR could be efficiently performed with all IHNV genomic RNA template concentrations tested (1 ng to 1 pg). The lowest level of sensitivity was not determined. The PCR was used to amplify RNA extracted from infected cell cultures and selected tissues of Infected rainbow trout. The combination of PCR and nucleic acid probe promises to provide a detection method for IHNV that is rapid, h~ghly specific, and sensitive.
Yohn, Chris T; Jiang, Zhaoshi; McGrath, Sean D; Hayden, Karen E; Khaitovich, Philipp; Johnson, Matthew E; Eichler, Marla Y; McPherson, John D; Zhao, Shaying; Pääbo, Svante; Eichler, Evan E
2005-04-01
Retroviral infections of the germline have the potential to episodically alter gene function and genome structure during the course of evolution. Horizontal transmissions between species have been proposed, but little evidence exists for such events in the human/great ape lineage of evolution. Based on analysis of finished BAC chimpanzee genome sequence, we characterize a retroviral element (Pan troglodytes endogenous retrovirus 1 [PTERV1]) that has become integrated in the germline of African great ape and Old World monkey species but is absent from humans and Asian ape genomes. We unambiguously map 287 retroviral integration sites and determine that approximately 95.8% of the insertions occur at non-orthologous regions between closely related species. Phylogenetic analysis of the endogenous retrovirus reveals that the gorilla and chimpanzee elements share a monophyletic origin with a subset of the Old World monkey retroviral elements, but that the average sequence divergence exceeds neutral expectation for a strictly nuclear inherited DNA molecule. Within the chimpanzee, there is a significant integration bias against genes, with only 14 of these insertions mapping within intronic regions. Six out of ten of these genes, for which there are expression data, show significant differences in transcript expression between human and chimpanzee. Our data are consistent with a retroviral infection that bombarded the genomes of chimpanzees and gorillas independently and concurrently, 3-4 million years ago. We speculate on the potential impact of such recent events on the evolution of humans and great apes.
The genome revolution and its role in understanding complex diseases.
Hofker, Marten H; Fu, Jingyuan; Wijmenga, Cisca
2014-10-01
The completion of the human genome sequence in 2003 clearly marked the beginning of a new era for biomedical research. It spurred technological progress that was unprecedented in the life sciences, including the development of high-throughput technologies to detect genetic variation and gene expression. The study of genetics has become "big data science". One of the current goals of genetic research is to use genomic information to further our understanding of common complex diseases. An essential first step made towards this goal was by the identification of thousands of single nucleotide polymorphisms showing robust association with hundreds of different traits and diseases. As insight into common genetic variation has expanded enormously and the technology to identify more rare variation has become available, we can utilize these advances to gain a better understanding of disease etiology. This will lead to developments in personalized medicine and P4 healthcare. Here, we review some of the historical events and perspectives before and after the completion of the human genome sequence. We also describe the success of large-scale genetic association studies and how these are expected to yield more insight into complex disorders. We show how we can now combine gene-oriented research and systems-based approaches to develop more complex models to help explain the etiology of common diseases. This article is part of a Special Issue entitled: From Genome to Function. Copyright © 2014 Elsevier B.V. All rights reserved.
Whole-genome sequencing of Atacama skeleton shows novel mutations linked with dysplasia
Bhattacharya, Sanchita; Li, Jian; Sockell, Alexandra; Kan, Matthew J.; Bava, Felice A.; Chen, Shann-Ching; Ávila-Arcos, María C.; Ji, Xuhuai; Smith, Emery; Asadi, Narges B.; Lachman, Ralph S.; Lam, Hugo Y.K.; Bustamante, Carlos D.; Butte, Atul J.; Nolan, Garry P.
2018-01-01
Over a decade ago, the Atacama humanoid skeleton (Ata) was discovered in the Atacama region of Chile. The Ata specimen carried a strange phenotype—6-in stature, fewer than expected ribs, elongated cranium, and accelerated bone age—leading to speculation that this was a preserved nonhuman primate, human fetus harboring genetic mutations, or even an extraterrestrial. We previously reported that it was human by DNA analysis with an estimated bone age of about 6–8 yr at the time of demise. To determine the possible genetic drivers of the observed morphology, DNA from the specimen was subjected to whole-genome sequencing using the Illumina HiSeq platform with an average 11.5× coverage of 101-bp, paired-end reads. In total, 3,356,569 single nucleotide variations (SNVs) were found as compared to the human reference genome, 518,365 insertions and deletions (indels), and 1047 structural variations (SVs) were detected. Here, we present the detailed whole-genome analysis showing that Ata is a female of human origin, likely of Chilean descent, and its genome harbors mutations in genes (COL1A1, COL2A1, KMT2D, FLNB, ATR, TRIP11, PCNT) previously linked with diseases of small stature, rib anomalies, cranial malformations, premature joint fusion, and osteochondrodysplasia (also known as skeletal dysplasia). Together, these findings provide a molecular characterization of Ata's peculiar phenotype, which likely results from multiple known and novel putative gene mutations affecting bone development and ossification. PMID:29567674
The $100 Genome: Implications for the DoD
2010-12-15
Phenotypes Influenced by the Microbiome In a recent study (Vijay-Kumar et al. 2010), gut microbes from obese mice were transferred to thin mice whose gut ...extraordinary genetic diversity of the gut microbiome , it is expected that these organisms (which usually have the first chance to interact with orally...Human Microbiome . In some cases it may be useful to sequence metagenomic samples of the microbiomes that colonize the human gut . oral cavity or other
Campo, D; Lehmann, K; Fjeldsted, C; Souaiaia, T; Kao, J; Nuzhdin, S V
2013-10-01
The prevailing demographic model for Drosophila melanogaster suggests that the colonization of North America occurred very recently from a subset of European flies that rapidly expanded across the continent. This model implies a sudden population growth and range expansion consistent with very low or no population subdivision. As flies adapt to new environments, local adaptation events may be expected. To describe demographic and selective events during North American colonization, we have generated a data set of 35 individual whole-genome sequences from inbred lines of D. melanogaster from a west coast US population (Winters, California, USA) and compared them with a public genome data set from Raleigh (Raleigh, North Carolina, USA). We analysed nuclear and mitochondrial genomes and described levels of variation and divergence within and between these two North American D. melanogaster populations. Both populations exhibit negative values of Tajima's D across the genome, a common signature of demographic expansion. We also detected a low but significant level of genome-wide differentiation between the two populations, as well as multiple allele surfing events, which can be the result of gene drift in local subpopulations on the edge of an expansion wave. In contrast to this genome-wide pattern, we uncovered a 50-kilobase segment in chromosome arm 3L that showed all the hallmarks of a soft selective sweep in both populations. A comparison of allele frequencies within this divergent region among six populations from three continents allowed us to cluster these populations in two differentiated groups, providing evidence for the action of natural selection on a global scale. © 2013 John Wiley & Sons Ltd.
Auernik, Kathryne S; Maezato, Yukari; Blum, Paul H; Kelly, Robert M
2008-02-01
Despite their taxonomic description, not all members of the order Sulfolobales are capable of oxidizing reduced sulfur species, which, in addition to iron oxidation, is a desirable trait of biomining microorganisms. However, the complete genome sequence of the extremely thermoacidophilic archaeon Metallosphaera sedula DSM 5348 (2.2 Mb, approximately 2,300 open reading frames [ORFs]) provides insights into biologically catalyzed metal sulfide oxidation. Comparative genomics was used to identify pathways and proteins involved (directly or indirectly) with bioleaching. As expected, the M. sedula genome contains genes related to autotrophic carbon fixation, metal tolerance, and adhesion. Also, terminal oxidase cluster organization indicates the presence of hybrid quinol-cytochrome oxidase complexes. Comparisons with the mesophilic biomining bacterium Acidithiobacillus ferrooxidans ATCC 23270 indicate that the M. sedula genome encodes at least one putative rusticyanin, involved in iron oxidation, and a putative tetrathionate hydrolase, implicated in sulfur oxidation. The fox gene cluster, involved in iron oxidation in the thermoacidophilic archaeon Sulfolobus metallicus, was also identified. These iron- and sulfur-oxidizing components are missing from genomes of nonleaching members of the Sulfolobales, such as Sulfolobus solfataricus P2 and Sulfolobus acidocaldarius DSM 639. Whole-genome transcriptional response analysis showed that 88 ORFs were up-regulated twofold or more in M. sedula upon addition of ferrous sulfate to yeast extract-based medium; these included genes for components of terminal oxidase clusters predicted to be involved with iron oxidation, as well as genes predicted to be involved with sulfur metabolism. Many hypothetical proteins were also differentially transcribed, indicating that aspects of the iron and sulfur metabolism of M. sedula remain to be identified and characterized.
Genome editing in sea urchin embryos by using a CRISPR/Cas9 system.
Lin, Che-Yi; Su, Yi-Hsien
2016-01-15
Sea urchin embryos are a useful model system for investigating early developmental processes and the underlying gene regulatory networks. Most functional studies using sea urchin embryos rely on antisense morpholino oligonucleotides to knockdown gene functions. However, major concerns related to this technique include off-target effects, variations in morpholino efficiency, and potential morpholino toxicity; furthermore, such problems are difficult to discern. Recent advances in genome editing technologies have introduced the prospect of not only generating sequence-specific knockouts, but also providing genome-engineering applications. Two genome editing tools, zinc-finger nuclease (ZFN) and transcription activator-like effector nucleases (TALENs), have been utilized in sea urchin embryos, but the resulting efficiencies are far from satisfactory. The CRISPR (clustered regularly interspaced short palindromic repeat)-Cas9 (CRISPR-associated nuclease 9) system serves as an easy and efficient method with which to edit the genomes of several established and emerging model organisms in the field of developmental biology. Here, we apply the CRISPR/Cas9 system to the sea urchin embryo. We designed six guide RNAs (gRNAs) against the well-studied nodal gene and discovered that five of the gRNAs induced the expected phenotype in 60-80% of the injected embryos. In addition, we developed a simple method for isolating genomic DNA from individual embryos, enabling phenotype to be precisely linked to genotype, and revealed that the mutation rates were 67-100% among the sequenced clones. Of the two potential off-target sites we examined, no off-target effects were observed. The detailed procedures described herein promise to accelerate the usage of CRISPR/Cas9 system for genome editing in sea urchin embryos. Copyright © 2015 Elsevier Inc. All rights reserved.
Genome-Based Taxonomic Classification of Bacteroidetes
Hahnke, Richard L.; Meier-Kolthoff, Jan P.; García-López, Marina; ...
2016-12-20
The bacterial phylum Bacteroidetes, characterized by a distinct gliding motility, occurs in a broad variety of ecosystems, habitats, life styles, and physiologies. Accordingly, taxonomic classification of the phylum, based on a limited number of features, proved difficult and controversial in the past, for example, when decisions were based on unresolved phylogenetic trees of the 16S rRNA gene sequence. Here we use a large collection of type-strain genomes from Bacteroidetes and closely related phyla for assessing their taxonomy based on the principles of phylogenetic classification and trees inferred from genome-scale data. No significant conflict between 16S rRNA gene and whole-genome phylogeneticmore » analysis is found, whereas many but not all of the involved taxa are supported as monophyletic groups, particularly in the genome-scale trees. Phenotypic and phylogenomic features support the separation of Balneolaceae as new phylum Balneolaeota from Rhodothermaeota and of Saprospiraceae as new class Saprospiria from Chitinophagia. Epilithonimonas is nested within the older genus Chryseobacterium and without significant phenotypic differences; thus merging the two genera is proposed. Similarly, Vitellibacter is proposed to be included in Aequorivita. Flexibacter is confirmed as being heterogeneous and dissected, yielding six distinct genera. Hallella seregens is a later heterotypic synonym of Prevotella dentalis. Compared to values directly calculated from genome sequences, the G+C content mentioned in many species descriptions is too imprecise; moreover, corrected G+C content values have a significantly better fit to the phylogeny. Corresponding emendations of species descriptions are provided where necessary. Whereas most observed conflict with the current classification of Bacteroidetes is already visible in 16S rRNA gene trees, as expected whole-genome phylogenies are much better resolved.« less
Genome-Based Taxonomic Classification of Bacteroidetes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hahnke, Richard L.; Meier-Kolthoff, Jan P.; García-López, Marina
The bacterial phylum Bacteroidetes, characterized by a distinct gliding motility, occurs in a broad variety of ecosystems, habitats, life styles, and physiologies. Accordingly, taxonomic classification of the phylum, based on a limited number of features, proved difficult and controversial in the past, for example, when decisions were based on unresolved phylogenetic trees of the 16S rRNA gene sequence. Here we use a large collection of type-strain genomes from Bacteroidetes and closely related phyla for assessing their taxonomy based on the principles of phylogenetic classification and trees inferred from genome-scale data. No significant conflict between 16S rRNA gene and whole-genome phylogeneticmore » analysis is found, whereas many but not all of the involved taxa are supported as monophyletic groups, particularly in the genome-scale trees. Phenotypic and phylogenomic features support the separation of Balneolaceae as new phylum Balneolaeota from Rhodothermaeota and of Saprospiraceae as new class Saprospiria from Chitinophagia. Epilithonimonas is nested within the older genus Chryseobacterium and without significant phenotypic differences; thus merging the two genera is proposed. Similarly, Vitellibacter is proposed to be included in Aequorivita. Flexibacter is confirmed as being heterogeneous and dissected, yielding six distinct genera. Hallella seregens is a later heterotypic synonym of Prevotella dentalis. Compared to values directly calculated from genome sequences, the G+C content mentioned in many species descriptions is too imprecise; moreover, corrected G+C content values have a significantly better fit to the phylogeny. Corresponding emendations of species descriptions are provided where necessary. Whereas most observed conflict with the current classification of Bacteroidetes is already visible in 16S rRNA gene trees, as expected whole-genome phylogenies are much better resolved.« less
Auernik, Kathryne S.; Maezato, Yukari; Blum, Paul H.; Kelly, Robert M.
2008-01-01
Despite their taxonomic description, not all members of the order Sulfolobales are capable of oxidizing reduced sulfur species, which, in addition to iron oxidation, is a desirable trait of biomining microorganisms. However, the complete genome sequence of the extremely thermoacidophilic archaeon Metallosphaera sedula DSM 5348 (2.2 Mb, ∼2,300 open reading frames [ORFs]) provides insights into biologically catalyzed metal sulfide oxidation. Comparative genomics was used to identify pathways and proteins involved (directly or indirectly) with bioleaching. As expected, the M. sedula genome contains genes related to autotrophic carbon fixation, metal tolerance, and adhesion. Also, terminal oxidase cluster organization indicates the presence of hybrid quinol-cytochrome oxidase complexes. Comparisons with the mesophilic biomining bacterium Acidithiobacillus ferrooxidans ATCC 23270 indicate that the M. sedula genome encodes at least one putative rusticyanin, involved in iron oxidation, and a putative tetrathionate hydrolase, implicated in sulfur oxidation. The fox gene cluster, involved in iron oxidation in the thermoacidophilic archaeon Sulfolobus metallicus, was also identified. These iron- and sulfur-oxidizing components are missing from genomes of nonleaching members of the Sulfolobales, such as Sulfolobus solfataricus P2 and Sulfolobus acidocaldarius DSM 639. Whole-genome transcriptional response analysis showed that 88 ORFs were up-regulated twofold or more in M. sedula upon addition of ferrous sulfate to yeast extract-based medium; these included genes for components of terminal oxidase clusters predicted to be involved with iron oxidation, as well as genes predicted to be involved with sulfur metabolism. Many hypothetical proteins were also differentially transcribed, indicating that aspects of the iron and sulfur metabolism of M. sedula remain to be identified and characterized. PMID:18083856
Genome-Based Taxonomic Classification of Bacteroidetes
Hahnke, Richard L.; Meier-Kolthoff, Jan P.; García-López, Marina; Mukherjee, Supratim; Huntemann, Marcel; Ivanova, Natalia N.; Woyke, Tanja; Kyrpides, Nikos C.; Klenk, Hans-Peter; Göker, Markus
2016-01-01
The bacterial phylum Bacteroidetes, characterized by a distinct gliding motility, occurs in a broad variety of ecosystems, habitats, life styles, and physiologies. Accordingly, taxonomic classification of the phylum, based on a limited number of features, proved difficult and controversial in the past, for example, when decisions were based on unresolved phylogenetic trees of the 16S rRNA gene sequence. Here we use a large collection of type-strain genomes from Bacteroidetes and closely related phyla for assessing their taxonomy based on the principles of phylogenetic classification and trees inferred from genome-scale data. No significant conflict between 16S rRNA gene and whole-genome phylogenetic analysis is found, whereas many but not all of the involved taxa are supported as monophyletic groups, particularly in the genome-scale trees. Phenotypic and phylogenomic features support the separation of Balneolaceae as new phylum Balneolaeota from Rhodothermaeota and of Saprospiraceae as new class Saprospiria from Chitinophagia. Epilithonimonas is nested within the older genus Chryseobacterium and without significant phenotypic differences; thus merging the two genera is proposed. Similarly, Vitellibacter is proposed to be included in Aequorivita. Flexibacter is confirmed as being heterogeneous and dissected, yielding six distinct genera. Hallella seregens is a later heterotypic synonym of Prevotella dentalis. Compared to values directly calculated from genome sequences, the G+C content mentioned in many species descriptions is too imprecise; moreover, corrected G+C content values have a significantly better fit to the phylogeny. Corresponding emendations of species descriptions are provided where necessary. Whereas most observed conflict with the current classification of Bacteroidetes is already visible in 16S rRNA gene trees, as expected whole-genome phylogenies are much better resolved. PMID:28066339
GFinisher: a new strategy to refine and finish bacterial genome assemblies
NASA Astrophysics Data System (ADS)
Guizelini, Dieval; Raittz, Roberto T.; Cruz, Leonardo M.; Souza, Emanuel M.; Steffens, Maria B. R.; Pedrosa, Fabio O.
2016-10-01
Despite the development in DNA sequencing technology, improving the number and the length of reads, the process of reconstruction of complete genome sequences, the so called genome assembly, is still complex. Only 13% of the prokaryotic genome sequencing projects have been completed. Draft genome sequences deposited in public databases are fragmented in contigs and may lack the full gene complement. The aim of the present work is to identify assembly errors and improve the assembly process of bacterial genomes. The biological patterns observed in genomic sequences and the application of a priori information can allow the identification of misassembled regions, and the reorganization and improvement of the overall de novo genome assembly. GFinisher starts generating a Fuzzy GC skew graphs for each contig in an assembly and follows breaking down the contigs in critical points in order to reassemble and close them using jFGap. This has been successfully applied to dataset from 96 genome assemblies, decreasing the number of contigs by up to 86%. GFinisher can easily optimize assemblies of prokaryotic draft genomes and can be used to improve the assembly programs based on nucleotide sequence patterns in the genome. The software and source code are available at http://gfinisher.sourceforge.net/.
Hamilton, John P; Neeno-Eckwall, Eric C; Adhikari, Bishwo N; Perna, Nicole T; Tisserat, Ned; Leach, Jan E; Lévesque, C André; Buell, C Robin
2011-01-01
The Comprehensive Phytopathogen Genomics Resource (CPGR) provides a web-based portal for plant pathologists and diagnosticians to view the genome and trancriptome sequence status of 806 bacterial, fungal, oomycete, nematode, viral and viroid plant pathogens. Tools are available to search and analyze annotated genome sequences of 74 bacterial, fungal and oomycete pathogens. Oomycete and fungal genomes are obtained directly from GenBank, whereas bacterial genome sequences are downloaded from the A Systematic Annotation Package (ASAP) database that provides curation of genomes using comparative approaches. Curated lists of bacterial genes relevant to pathogenicity and avirulence are also provided. The Plant Pathogen Transcript Assemblies Database provides annotated assemblies of the transcribed regions of 82 eukaryotic genomes from publicly available single pass Expressed Sequence Tags. Data-mining tools are provided along with tools to create candidate diagnostic markers, an emerging use for genomic sequence data in plant pathology. The Plant Pathogen Ribosomal DNA (rDNA) database is a resource for pathogens that lack genome or transcriptome data sets and contains 131 755 rDNA sequences from GenBank for 17 613 species identified as plant pathogens and related genera. Database URL: http://cpgr.plantbiology.msu.edu.
GFinisher: a new strategy to refine and finish bacterial genome assemblies.
Guizelini, Dieval; Raittz, Roberto T; Cruz, Leonardo M; Souza, Emanuel M; Steffens, Maria B R; Pedrosa, Fabio O
2016-10-10
Despite the development in DNA sequencing technology, improving the number and the length of reads, the process of reconstruction of complete genome sequences, the so called genome assembly, is still complex. Only 13% of the prokaryotic genome sequencing projects have been completed. Draft genome sequences deposited in public databases are fragmented in contigs and may lack the full gene complement. The aim of the present work is to identify assembly errors and improve the assembly process of bacterial genomes. The biological patterns observed in genomic sequences and the application of a priori information can allow the identification of misassembled regions, and the reorganization and improvement of the overall de novo genome assembly. GFinisher starts generating a Fuzzy GC skew graphs for each contig in an assembly and follows breaking down the contigs in critical points in order to reassemble and close them using jFGap. This has been successfully applied to dataset from 96 genome assemblies, decreasing the number of contigs by up to 86%. GFinisher can easily optimize assemblies of prokaryotic draft genomes and can be used to improve the assembly programs based on nucleotide sequence patterns in the genome. The software and source code are available at http://gfinisher.sourceforge.net/.
Genome Analysis of the Domestic Dog (Korean Jindo) by Massively Parallel Sequencing
Kim, Ryong Nam; Kim, Dae-Soo; Choi, Sang-Haeng; Yoon, Byoung-Ha; Kang, Aram; Nam, Seong-Hyeuk; Kim, Dong-Wook; Kim, Jong-Joo; Ha, Ji-Hong; Toyoda, Atsushi; Fujiyama, Asao; Kim, Aeri; Kim, Min-Young; Park, Kun-Hyang; Lee, Kang Seon; Park, Hong-Seog
2012-01-01
Although pioneering sequencing projects have shed light on the boxer and poodle genomes, a number of challenges need to be met before the sequencing and annotation of the dog genome can be considered complete. Here, we present the DNA sequence of the Jindo dog genome, sequenced to 45-fold average coverage using Illumina massively parallel sequencing technology. A comparison of the sequence to the reference boxer genome led to the identification of 4 675 437 single nucleotide polymorphisms (SNPs, including 3 346 058 novel SNPs), 71 642 indels and 8131 structural variations. Of these, 339 non-synonymous SNPs and 3 indels are located within coding sequences (CDS). In particular, 3 non-synonymous SNPs and a 26-bp deletion occur in the TCOF1 locus, implying that the difference observed in cranial facial morphology between Jindo and boxer dogs might be influenced by those variations. Through the annotation of the Jindo olfactory receptor gene family, we found 2 unique olfactory receptor genes and 236 olfactory receptor genes harbouring non-synonymous homozygous SNPs that are likely to affect smelling capability. In addition, we determined the DNA sequence of the Jindo dog mitochondrial genome and identified Jindo dog-specific mtDNA genotypes. This Jindo genome data upgrade our understanding of dog genomic architecture and will be a very valuable resource for investigating not only dog genetics and genomics but also human and dog disease genetics and comparative genomics. PMID:22474061
Nakamura, Kosuke; Kondo, Kazunari; Akiyama, Hiroshi; Ishigaki, Takumi; Noguchi, Akio; Katsumata, Hiroshi; Takasaki, Kazuto; Futo, Satoshi; Sakata, Kozue; Fukuda, Nozomi; Mano, Junichi; Kitta, Kazumi; Tanaka, Hidenori; Akashi, Ryo; Nishimaki-Mogami, Tomoko
2016-08-15
Identification of transgenic sequences in an unknown genetically modified (GM) papaya (Carica papaya L.) by whole genome sequence analysis was demonstrated. Whole genome sequence data were generated for a GM-positive fresh papaya fruit commodity detected in monitoring using real-time polymerase chain reaction (PCR). The sequences obtained were mapped against an open database for papaya genome sequence. Transgenic construct- and event-specific sequences were identified as a GM papaya developed to resist infection from a Papaya ringspot virus. Based on the transgenic sequences, a specific real-time PCR detection method for GM papaya applicable to various food commodities was developed. Whole genome sequence analysis enabled identifying unknown transgenic construct- and event-specific sequences in GM papaya and development of a reliable method for detecting them in papaya food commodities. Copyright © 2016 Elsevier Ltd. All rights reserved.
Coverage Bias and Sensitivity of Variant Calling for Four Whole-genome Sequencing Technologies
Lasitschka, Bärbel; Jones, David; Northcott, Paul; Hutter, Barbara; Jäger, Natalie; Kool, Marcel; Taylor, Michael; Lichter, Peter; Pfister, Stefan; Wolf, Stephan; Brors, Benedikt; Eils, Roland
2013-01-01
The emergence of high-throughput, next-generation sequencing technologies has dramatically altered the way we assess genomes in population genetics and in cancer genomics. Currently, there are four commonly used whole-genome sequencing platforms on the market: Illumina’s HiSeq2000, Life Technologies’ SOLiD 4 and its completely redesigned 5500xl SOLiD, and Complete Genomics’ technology. A number of earlier studies have compared a subset of those sequencing platforms or compared those platforms with Sanger sequencing, which is prohibitively expensive for whole genome studies. Here we present a detailed comparison of the performance of all currently available whole genome sequencing platforms, especially regarding their ability to call SNVs and to evenly cover the genome and specific genomic regions. Unlike earlier studies, we base our comparison on four different samples, allowing us to assess the between-sample variation of the platforms. We find a pronounced GC bias in GC-rich regions for Life Technologies’ platforms, with Complete Genomics performing best here, while we see the least bias in GC-poor regions for HiSeq2000 and 5500xl. HiSeq2000 gives the most uniform coverage and displays the least sample-to-sample variation. In contrast, Complete Genomics exhibits by far the smallest fraction of bases not covered, while the SOLiD platforms reveal remarkable shortcomings, especially in covering CpG islands. When comparing the performance of the four platforms for calling SNPs, HiSeq2000 and Complete Genomics achieve the highest sensitivity, while the SOLiD platforms show the lowest false positive rate. Finally, we find that integrating sequencing data from different platforms offers the potential to combine the strengths of different technologies. In summary, our results detail the strengths and weaknesses of all four whole-genome sequencing platforms. It indicates application areas that call for a specific sequencing platform and disallow other platforms. This helps to identify the proper sequencing platform for whole genome studies with different application scopes. PMID:23776689
Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca
2015-01-01
Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources. PMID:26151450
Bertolini, Francesca; Scimone, Concetta; Geraci, Claudia; Schiavo, Giuseppina; Utzeri, Valerio Joe; Chiofalo, Vincenzo; Fontanesi, Luca
2015-01-01
Few studies investigated the donkey (Equus asinus) at the whole genome level so far. Here, we sequenced the genome of two male donkeys using a next generation semiconductor based sequencing platform (the Ion Proton sequencer) and compared obtained sequence information with the available donkey draft genome (and its Illumina reads from which it was originated) and with the EquCab2.0 assembly of the horse genome. Moreover, the Ion Torrent Personal Genome Analyzer was used to sequence reduced representation libraries (RRL) obtained from a DNA pool including donkeys of different breeds (Grigio Siciliano, Ragusano and Martina Franca). The number of next generation sequencing reads aligned with the EquCab2.0 horse genome was larger than those aligned with the draft donkey genome. This was due to the larger N50 for contigs and scaffolds of the horse genome. Nucleotide divergence between E. caballus and E. asinus was estimated to be ~ 0.52-0.57%. Regions with low nucleotide divergence were identified in several autosomal chromosomes and in the whole chromosome X. These regions might be evolutionally important in equids. Comparing Y-chromosome regions we identified variants that could be useful to track donkey paternal lineages. Moreover, about 4.8 million of single nucleotide polymorphisms (SNPs) in the donkey genome were identified and annotated combining sequencing data from Ion Proton (whole genome sequencing) and Ion Torrent (RRL) runs with Illumina reads. A higher density of SNPs was present in regions homologous to horse chromosome 12, in which several studies reported a high frequency of copy number variants. The SNPs we identified constitute a first resource useful to describe variability at the population genomic level in E. asinus and to establish monitoring systems for the conservation of donkey genetic resources.
2011-01-01
Background One of the key goals of oak genomics research is to identify genes of adaptive significance. This information may help to improve the conservation of adaptive genetic variation and the management of forests to increase their health and productivity. Deep-coverage large-insert genomic libraries are a crucial tool for attaining this objective. We report herein the construction of a BAC library for Quercus robur, its characterization and an analysis of BAC end sequences. Results The EcoRI library generated consisted of 92,160 clones, 7% of which had no insert. Levels of chloroplast and mitochondrial contamination were below 3% and 1%, respectively. Mean clone insert size was estimated at 135 kb. The library represents 12 haploid genome equivalents and, the likelihood of finding a particular oak sequence of interest is greater than 99%. Genome coverage was confirmed by PCR screening of the library with 60 unique genetic loci sampled from the genetic linkage map. In total, about 20,000 high-quality BAC end sequences (BESs) were generated by sequencing 15,000 clones. Roughly 5.88% of the combined BAC end sequence length corresponded to known retroelements while ab initio repeat detection methods identified 41 additional repeats. Collectively, characterized and novel repeats account for roughly 8.94% of the genome. Further analysis of the BESs revealed 1,823 putative genes suggesting at least 29,340 genes in the oak genome. BESs were aligned with the genome sequences of Arabidopsis thaliana, Vitis vinifera and Populus trichocarpa. One putative collinear microsyntenic region encoding an alcohol acyl transferase protein was observed between oak and chromosome 2 of V. vinifera. Conclusions This BAC library provides a new resource for genomic studies, including SSR marker development, physical mapping, comparative genomics and genome sequencing. BES analysis provided insight into the structure of the oak genome. These sequences will be used in the assembly of a future genome sequence for oak. PMID:21645357
The Reference Genome Sequence of Saccharomyces cerevisiae: Then and Now
Engel, Stacia R.; Dietrich, Fred S.; Fisk, Dianna G.; Binkley, Gail; Balakrishnan, Rama; Costanzo, Maria C.; Dwight, Selina S.; Hitz, Benjamin C.; Karra, Kalpana; Nash, Robert S.; Weng, Shuai; Wong, Edith D.; Lloyd, Paul; Skrzypek, Marek S.; Miyasato, Stuart R.; Simison, Matt; Cherry, J. Michael
2014-01-01
The genome of the budding yeast Saccharomyces cerevisiae was the first completely sequenced from a eukaryote. It was released in 1996 as the work of a worldwide effort of hundreds of researchers. In the time since, the yeast genome has been intensively studied by geneticists, molecular biologists, and computational scientists all over the world. Maintenance and annotation of the genome sequence have long been provided by the Saccharomyces Genome Database, one of the original model organism databases. To deepen our understanding of the eukaryotic genome, the S. cerevisiae strain S288C reference genome sequence was updated recently in its first major update since 1996. The new version, called “S288C 2010,” was determined from a single yeast colony using modern sequencing technologies and serves as the anchor for further innovations in yeast genomic science. PMID:24374639
Oduru, Sreedhar; Campbell, Janee L; Karri, SriTulasi; Hendry, William J; Khan, Shafiq A; Williams, Simon C
2003-01-01
Background Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish) genomes. Results 735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque. Conclusion The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells. PMID:12783626
Real-time, portable genome sequencing for Ebola surveillance.
Quick, Joshua; Loman, Nicholas J; Duraffour, Sophie; Simpson, Jared T; Severi, Ettore; Cowley, Lauren; Bore, Joseph Akoi; Koundouno, Raymond; Dudas, Gytis; Mikhail, Amy; Ouédraogo, Nobila; Afrough, Babak; Bah, Amadou; Baum, Jonathan Hj; Becker-Ziaja, Beate; Boettcher, Jan-Peter; Cabeza-Cabrerizo, Mar; Camino-Sanchez, Alvaro; Carter, Lisa L; Doerrbecker, Juiliane; Enkirch, Theresa; Dorival, Isabel Graciela García; Hetzelt, Nicole; Hinzmann, Julia; Holm, Tobias; Kafetzopoulou, Liana Eleni; Koropogui, Michel; Kosgey, Abigail; Kuisma, Eeva; Logue, Christopher H; Mazzarelli, Antonio; Meisel, Sarah; Mertens, Marc; Michel, Janine; Ngabo, Didier; Nitzsche, Katja; Pallash, Elisa; Patrono, Livia Victoria; Portmann, Jasmine; Repits, Johanna Gabriella; Rickett, Natasha Yasmin; Sachse, Andrea; Singethan, Katrin; Vitoriano, Inês; Yemanaberhan, Rahel L; Zekeng, Elsa G; Trina, Racine; Bello, Alexander; Sall, Amadou Alpha; Faye, Ousmane; Faye, Oumar; Magassouba, N'Faly; Williams, Cecelia V; Amburgey, Victoria; Winona, Linda; Davis, Emily; Gerlach, Jon; Washington, Franck; Monteil, Vanessa; Jourdain, Marine; Bererd, Marion; Camara, Alimou; Somlare, Hermann; Camara, Abdoulaye; Gerard, Marianne; Bado, Guillaume; Baillet, Bernard; Delaune, Déborah; Nebie, Koumpingnin Yacouba; Diarra, Abdoulaye; Savane, Yacouba; Pallawo, Raymond Bernard; Gutierrez, Giovanna Jaramillo; Milhano, Natacha; Roger, Isabelle; Williams, Christopher J; Yattara, Facinet; Lewandowski, Kuiama; Taylor, Jamie; Rachwal, Philip; Turner, Daniel; Pollakis, Georgios; Hiscox, Julian A; Matthews, David A; O'Shea, Matthew K; Johnston, Andrew McD; Wilson, Duncan; Hutley, Emma; Smit, Erasmus; Di Caro, Antonino; Woelfel, Roman; Stoecker, Kilian; Fleischmann, Erna; Gabriel, Martin; Weller, Simon A; Koivogui, Lamine; Diallo, Boubacar; Keita, Sakoba; Rambaut, Andrew; Formenty, Pierre; Gunther, Stephan; Carroll, Miles W
2016-02-11
The Ebola virus disease epidemic in West Africa is the largest on record, responsible for over 28,599 cases and more than 11,299 deaths. Genome sequencing in viral outbreaks is desirable to characterize the infectious agent and determine its evolutionary rate. Genome sequencing also allows the identification of signatures of host adaptation, identification and monitoring of diagnostic targets, and characterization of responses to vaccines and treatments. The Ebola virus (EBOV) genome substitution rate in the Makona strain has been estimated at between 0.87 × 10(-3) and 1.42 × 10(-3) mutations per site per year. This is equivalent to 16-27 mutations in each genome, meaning that sequences diverge rapidly enough to identify distinct sub-lineages during a prolonged epidemic. Genome sequencing provides a high-resolution view of pathogen evolution and is increasingly sought after for outbreak surveillance. Sequence data may be used to guide control measures, but only if the results are generated quickly enough to inform interventions. Genomic surveillance during the epidemic has been sporadic owing to a lack of local sequencing capacity coupled with practical difficulties transporting samples to remote sequencing facilities. To address this problem, here we devise a genomic surveillance system that utilizes a novel nanopore DNA sequencing instrument. In April 2015 this system was transported in standard airline luggage to Guinea and used for real-time genomic surveillance of the ongoing epidemic. We present sequence data and analysis of 142 EBOV samples collected during the period March to October 2015. We were able to generate results less than 24 h after receiving an Ebola-positive sample, with the sequencing process taking as little as 15-60 min. We show that real-time genomic surveillance is possible in resource-limited settings and can be established rapidly to monitor outbreaks.
Whole-genome sequencing and genetic variant analysis of a Quarter Horse mare.
Doan, Ryan; Cohen, Noah D; Sawyer, Jason; Ghaffari, Noushin; Johnson, Charlie D; Dindot, Scott V
2012-02-17
The catalog of genetic variants in the horse genome originates from a few select animals, the majority originating from the Thoroughbred mare used for the equine genome sequencing project. The purpose of this study was to identify genetic variants, including single nucleotide polymorphisms (SNPs), insertion/deletion polymorphisms (INDELs), and copy number variants (CNVs) in the genome of an individual Quarter Horse mare sequenced by next-generation sequencing. Using massively parallel paired-end sequencing, we generated 59.6 Gb of DNA sequence from a Quarter Horse mare resulting in an average of 24.7X sequence coverage. Reads were mapped to approximately 97% of the reference Thoroughbred genome. Unmapped reads were de novo assembled resulting in 19.1 Mb of new genomic sequence in the horse. Using a stringent filtering method, we identified 3.1 million SNPs, 193 thousand INDELs, and 282 CNVs. Genetic variants were annotated to determine their impact on gene structure and function. Additionally, we genotyped this Quarter Horse for mutations of known diseases and for variants associated with particular traits. Functional clustering analysis of genetic variants revealed that most of the genetic variation in the horse's genome was enriched in sensory perception, signal transduction, and immunity and defense pathways. This is the first sequencing of a horse genome by next-generation sequencing and the first genomic sequence of an individual Quarter Horse mare. We have increased the catalog of genetic variants for use in equine genomics by the addition of novel SNPs, INDELs, and CNVs. The genetic variants described here will be a useful resource for future studies of genetic variation regulating performance traits and diseases in equids.
The complete chloroplast genome sequence of the medicinal plant Salvia miltiorrhiza.
Qian, Jun; Song, Jingyuan; Gao, Huanhuan; Zhu, Yingjie; Xu, Jiang; Pang, Xiaohui; Yao, Hui; Sun, Chao; Li, Xian'en; Li, Chuyuan; Liu, Juyan; Xu, Haibin; Chen, Shilin
2013-01-01
Salvia miltiorrhiza is an important medicinal plant with great economic and medicinal value. The complete chloroplast (cp) genome sequence of Salvia miltiorrhiza, the first sequenced member of the Lamiaceae family, is reported here. The genome is 151,328 bp in length and exhibits a typical quadripartite structure of the large (LSC, 82,695 bp) and small (SSC, 17,555 bp) single-copy regions, separated by a pair of inverted repeats (IRs, 25,539 bp). It contains 114 unique genes, including 80 protein-coding genes, 30 tRNAs and four rRNAs. The genome structure, gene order, GC content and codon usage are similar to the typical angiosperm cp genomes. Four forward, three inverted and seven tandem repeats were detected in the Salvia miltiorrhiza cp genome. Simple sequence repeat (SSR) analysis among the 30 asterid cp genomes revealed that most SSRs are AT-rich, which contribute to the overall AT richness of these cp genomes. Additionally, fewer SSRs are distributed in the protein-coding sequences compared to the non-coding regions, indicating an uneven distribution of SSRs within the cp genomes. Entire cp genome comparison of Salvia miltiorrhiza and three other Lamiales cp genomes showed a high degree of sequence similarity and a relatively high divergence of intergenic spacers. Sequence divergence analysis discovered the ten most divergent and ten most conserved genes as well as their length variation, which will be helpful for phylogenetic studies in asterids. Our analysis also supports that both regional and functional constraints affect gene sequence evolution. Further, phylogenetic analysis demonstrated a sister relationship between Salvia miltiorrhiza and Sesamum indicum. The complete cp genome sequence of Salvia miltiorrhiza reported in this paper will facilitate population, phylogenetic and cp genetic engineering studies of this medicinal plant.
Jaeckisch, Nina; Yang, Ines; Wohlrab, Sylke; Glöckner, Gernot; Kroymann, Juergen; Vogel, Heiko; Cembella, Allan; John, Uwe
2011-01-01
Many dinoflagellate species are notorious for the toxins they produce and ecological and human health consequences associated with harmful algal blooms (HABs). Dinoflagellates are particularly refractory to genomic analysis due to the enormous genome size, lack of knowledge about their DNA composition and structure, and peculiarities of gene regulation, such as spliced leader (SL) trans-splicing and mRNA transposition mechanisms. Alexandrium ostenfeldii is known to produce macrocyclic imine toxins, described as spirolides. We characterized the genome of A. ostenfeldii using a combination of transcriptomic data and random genomic clones for comparison with other dinoflagellates, particularly Alexandrium species. Examination of SL sequences revealed similar features as in other dinoflagellates, including Alexandrium species. SL sequences in decay indicate frequent retro-transposition of mRNA species. This probably contributes to overall genome complexity by generating additional gene copies. Sequencing of several thousand fosmid and bacterial artificial chromosome (BAC) ends yielded a wealth of simple repeats and tandemly repeated longer sequence stretches which we estimated to comprise more than half of the whole genome. Surprisingly, the repeats comprise a very limited set of 79–97 bp sequences; in part the genome is thus a relatively uniform sequence space interrupted by coding sequences. Our genomic sequence survey (GSS) represents the largest genomic data set of a dinoflagellate to date. Alexandrium ostenfeldii is a typical dinoflagellate with respect to its transcriptome and mRNA transposition but demonstrates Alexandrium-like stop codon usage. The large portion of repetitive sequences and the organization within the genome is in agreement with several other studies on dinoflagellates using different approaches. It remains to be determined whether this unusual composition is directly correlated to the exceptionally genome organization of dinoflagellates with a low amount of histones and histone-like proteins. PMID:22164224
The Douglas-fir genome sequence reveals specialization of the photosynthetic apparatus in Pinaceae
David B. Neale; Patrick E. McGuire; Nicholas C. Wheeler; Kristian A. Stevens; Marc W. Crepeau; Charis Cardeno; Aleksey V. Zimin; Daniela Puiu; Geo M. Pertea; U. Uzay Sezen; Claudio Casola; Tomasz E. Koralewski; Robin Paul; Daniel Gonzalez-Ibeas; Sumaira Zaman; Richard Cronn; Mark Yandell; Carson Holt; Charles H. Langley; James A. Yorke; Steven L. Salzberg; Jill L. Wegrzyn
2017-01-01
A reference genome sequence for Pseudotsuga menziesii var. menziesii (Mirb.) Franco (Coastal Douglas-fir) is reported, thus providing a reference sequence for a third genus of the family Pinaceae. The contiguity and quality of the genome assembly far exceeds that of other conifer reference genome sequences (contig N50 = 44,136 bp and scaffold N50...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ghodhbane-Gtari, Faten; Beauchemin, Nicholas; Bruce, David
2013-01-01
We report here the genome sequence of Frankia sp. strain CN3, which was isolated from Coriaria nepalensis. This genome sequence is the first from the fourth lineage of Frankia, that are unable to re-infect actinorhizal plants. At 10 Mb, it represents the largest Frankia genome sequenced to date.
Lu, You; Samac, Deborah A; Glazebrook, Jane; Ishimaru, Carol A
2015-05-07
We report here the complete genome sequence of Clavibacter michiganensis subsp. insidiosus R1-1, isolated in Minnesota, USA. The R1-1 genome, generated by a de novo assembly of PacBio sequencing data, is the first complete genome sequence available for this subspecies. Copyright © 2015 Lu et al.