Shafiee, Mohammad Javad; Chung, Audrey G; Khalvati, Farzad; Haider, Masoom A; Wong, Alexander
2017-10-01
While lung cancer is the second most diagnosed form of cancer in men and women, a sufficiently early diagnosis can be pivotal in patient survival rates. Imaging-based, or radiomics-driven, detection methods have been developed to aid diagnosticians, but largely rely on hand-crafted features that may not fully encapsulate the differences between cancerous and healthy tissue. Recently, the concept of discovery radiomics was introduced, where custom abstract features are discovered from readily available imaging data. We propose an evolutionary deep radiomic sequencer discovery approach based on evolutionary deep intelligence. Motivated by patient privacy concerns and the idea of operational artificial intelligence, the evolutionary deep radiomic sequencer discovery approach organically evolves increasingly more efficient deep radiomic sequencers that produce significantly more compact yet similarly descriptive radiomic sequences over multiple generations. As a result, this framework improves operational efficiency and enables diagnosis to be run locally at the radiologist's computer while maintaining detection accuracy. We evaluated the evolved deep radiomic sequencer (EDRS) discovered via the proposed evolutionary deep radiomic sequencer discovery framework against state-of-the-art radiomics-driven and discovery radiomics methods using clinical lung CT data with pathologically proven diagnostic data from the LIDC-IDRI dataset. The EDRS shows improved sensitivity (93.42%), specificity (82.39%), and diagnostic accuracy (88.78%) relative to previous radiomics approaches.
DeepBase: annotation and discovery of microRNAs and other noncoding RNAs from deep-sequencing data.
Yang, Jian-Hua; Qu, Liang-Hu
2012-01-01
Recent advances in high-throughput deep-sequencing technology have produced large numbers of short and long RNA sequences and enabled the detection and profiling of known and novel microRNAs (miRNAs) and other noncoding RNAs (ncRNAs) at unprecedented sensitivity and depth. In this chapter, we describe the use of deepBase, a database that we have developed to integrate all public deep-sequencing data and to facilitate the comprehensive annotation and discovery of miRNAs and other ncRNAs from these data. deepBase provides an integrative, interactive, and versatile web graphical interface to evaluate miRBase-annotated miRNA genes and other known ncRNAs, explores the expression patterns of miRNAs and other ncRNAs, and discovers novel miRNAs and other ncRNAs from deep-sequencing data. deepBase also provides a deepView genome browser to comparatively analyze these data at multiple levels. deepBase is available at http://deepbase.sysu.edu.cn/.
VirusDetect: An automated pipeline for efficient virus discovery using deep sequencing of small RNAs
USDA-ARS?s Scientific Manuscript database
Accurate detection of viruses in plants and animals is critical for agriculture production and human health. Deep sequencing and assembly of virus-derived siRNAs has proven to be a highly efficient approach for virus discovery. However, to date no computational tools specifically designed for both k...
SNP discovery through de novo deep sequencing using the next generation of DNA sequencers
USDA-ARS?s Scientific Manuscript database
The production of high volumes of DNA sequence data using new technologies has permitted more efficient identification of single nucleotide polymorphisms in vertebrate genomes. This chapter presented practical methodology for production and analysis of DNA sequence data for SNP discovery....
Speth, Daan R; Lagkouvardos, Ilias; Wang, Yong; Qian, Pei-Yuan; Dutilh, Bas E; Jetten, Mike S M
2017-07-01
Several recent studies have indicated that members of the phylum Planctomycetes are abundantly present at the brine-seawater interface (BSI) above multiple brine pools in the Red Sea. Planctomycetes include bacteria capable of anaerobic ammonium oxidation (anammox). Here, we investigated the possibility of anammox at BSI sites using metagenomic shotgun sequencing of DNA obtained from the BSI above the Discovery Deep brine pool. Analysis of sequencing reads matching the 16S rRNA and hzsA genes confirmed presence of anammox bacteria of the genus Scalindua. Phylogenetic analysis of the 16S rRNA gene indicated that this Scalindua sp. belongs to a distinct group, separate from the anammox bacteria in the seawater column, that contains mostly sequences retrieved from high-salt environments. Using coverage- and composition-based binning, we extracted and assembled the draft genome of the dominant anammox bacterium. Comparative genomic analysis indicated that this Scalindua species uses compatible solutes for osmoadaptation, in contrast to other marine anammox bacteria that likely use a salt-in strategy. We propose the name Candidatus Scalindua rubra for this novel species, alluding to its discovery in the Red Sea.
Dissecting enzyme function with microfluidic-based deep mutational scanning.
Romero, Philip A; Tran, Tuan M; Abate, Adam R
2015-06-09
Natural enzymes are incredibly proficient catalysts, but engineering them to have new or improved functions is challenging due to the complexity of how an enzyme's sequence relates to its biochemical properties. Here, we present an ultrahigh-throughput method for mapping enzyme sequence-function relationships that combines droplet microfluidic screening with next-generation DNA sequencing. We apply our method to map the activity of millions of glycosidase sequence variants. Microfluidic-based deep mutational scanning provides a comprehensive and unbiased view of the enzyme function landscape. The mapping displays expected patterns of mutational tolerance and a strong correspondence to sequence variation within the enzyme family, but also reveals previously unreported sites that are crucial for glycosidase function. We modified the screening protocol to include a high-temperature incubation step, and the resulting thermotolerance landscape allowed the discovery of mutations that enhance enzyme thermostability. Droplet microfluidics provides a general platform for enzyme screening that, when combined with DNA-sequencing technologies, enables high-throughput mapping of enzyme sequence space.
Comprehensive discovery of noncoding RNAs in acute myeloid leukemia cell transcriptomes.
Zhang, Jin; Griffith, Malachi; Miller, Christopher A; Griffith, Obi L; Spencer, David H; Walker, Jason R; Magrini, Vincent; McGrath, Sean D; Ly, Amy; Helton, Nichole M; Trissal, Maria; Link, Daniel C; Dang, Ha X; Larson, David E; Kulkarni, Shashikant; Cordes, Matthew G; Fronick, Catrina C; Fulton, Robert S; Klco, Jeffery M; Mardis, Elaine R; Ley, Timothy J; Wilson, Richard K; Maher, Christopher A
2017-11-01
To detect diverse and novel RNA species comprehensively, we compared deep small RNA and RNA sequencing (RNA-seq) methods applied to a primary acute myeloid leukemia (AML) sample. We were able to discover previously unannotated small RNAs using deep sequencing of a library method using broader insert size selection. We analyzed the long noncoding RNA (lncRNA) landscape in AML by comparing deep sequencing from multiple RNA-seq library construction methods for the sample that we studied and then integrating RNA-seq data from 179 AML cases. This identified lncRNAs that are completely novel, differentially expressed, and associated with specific AML subtypes. Our study revealed the complexity of the noncoding RNA transcriptome through a combined strategy of strand-specific small RNA and total RNA-seq. This dataset will serve as an invaluable resource for future RNA-based analyses. Copyright © 2017 ISEH – Society for Hematology and Stem Cells. Published by Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Zhang, Xiao-Yong; Wang, Guang-Hua; Xu, Xin-Ya; Nong, Xu-Hua; Wang, Jie; Amin, Muhammad; Qi, Shu-Hua
2016-10-01
The present study investigated the fungal diversity in four different deep-sea sediments from Okinawa Trough using high-throughput Illumina sequencing of the nuclear ribosomal internal transcribed spacer-1 (ITS1). A total of 40,297 fungal ITS1 sequences clustered into 420 operational taxonomic units (OTUs) with 97% sequence similarity and 170 taxa were recovered from these sediments. Most ITS1 sequences (78%) belonged to the phylum Ascomycota, followed by Basidiomycota (17.3%), Zygomycota (1.5%) and Chytridiomycota (0.8%), and a small proportion (2.4%) belonged to unassigned fungal phyla. Compared with previous studies on fungal diversity of sediments from deep-sea environments by culture-dependent approach and clone library analysis, the present result suggested that Illumina sequencing had been dramatically accelerating the discovery of fungal community of deep-sea sediments. Furthermore, our results revealed that Sordariomycetes was the most diverse and abundant fungal class in this study, challenging the traditional view that the diversity of Sordariomycetes phylotypes was low in the deep-sea environments. In addition, more than 12 taxa accounted for 21.5% sequences were found to be rarely reported as deep-sea fungi, suggesting the deep-sea sediments from Okinawa Trough harbored a plethora of different fungal communities compared with other deep-sea environments. To our knowledge, this study is the first exploration of the fungal diversity in deep-sea sediments from Okinawa Trough using high-throughput Illumina sequencing.
Vernick, Kenneth D.
2017-01-01
Metavisitor is a software package that allows biologists and clinicians without specialized bioinformatics expertise to detect and assemble viral genomes from deep sequence datasets. The package is composed of a set of modular bioinformatic tools and workflows that are implemented in the Galaxy framework. Using the graphical Galaxy workflow editor, users with minimal computational skills can use existing Metavisitor workflows or adapt them to suit specific needs by adding or modifying analysis modules. Metavisitor works with DNA, RNA or small RNA sequencing data over a range of read lengths and can use a combination of de novo and guided approaches to assemble genomes from sequencing reads. We show that the software has the potential for quick diagnosis as well as discovery of viruses from a vast array of organisms. Importantly, we provide here executable Metavisitor use cases, which increase the accessibility and transparency of the software, ultimately enabling biologists or clinicians to focus on biological or medical questions. PMID:28045932
USDA-ARS?s Scientific Manuscript database
The soybean Consensus Map 4.0 facilitated the anchoring of 95.6% of the soybean whole genome sequence developed by the Joint Genome Institute, Department of Energy but only properly oriented 66% of the sequence scaffolds. To find additional single nucleotide polymorphism (SNP) markers for additiona...
miRBase: integrating microRNA annotation and deep-sequencing data.
Kozomara, Ana; Griffiths-Jones, Sam
2011-01-01
miRBase is the primary online repository for all microRNA sequences and annotation. The current release (miRBase 16) contains over 15,000 microRNA gene loci in over 140 species, and over 17,000 distinct mature microRNA sequences. Deep-sequencing technologies have delivered a sharp rise in the rate of novel microRNA discovery. We have mapped reads from short RNA deep-sequencing experiments to microRNAs in miRBase and developed web interfaces to view these mappings. The user can view all read data associated with a given microRNA annotation, filter reads by experiment and count, and search for microRNAs by tissue- and stage-specific expression. These data can be used as a proxy for relative expression levels of microRNA sequences, provide detailed evidence for microRNA annotations and alternative isoforms of mature microRNAs, and allow us to revisit previous annotations. miRBase is available online at: http://www.mirbase.org/.
Zhang, Hanyuan; Vieira Resende E Silva, Bruno; Cui, Juan
2018-05-01
Small RNA sequencing is the most widely used tool for microRNA (miRNA) discovery, and shows great potential for the efficient study of miRNA cross-species transport, i.e., by detecting the presence of exogenous miRNA sequences in the host species. Because of the increased appreciation of dietary miRNAs and their far-reaching implication in human health, research interests are currently growing with regard to exogenous miRNAs bioavailability, mechanisms of cross-species transport and miRNA function in cellular biological processes. In this article, we present microRNA Discovery (miRDis), a new small RNA sequencing data analysis pipeline for both endogenous and exogenous miRNA detection. Specifically, we developed and deployed a Web service that supports the annotation and expression profiling data of known host miRNAs and the detection of novel miRNAs, other noncoding RNAs, and the exogenous miRNAs from dietary species. As a proof-of-concept, we analyzed a set of human plasma sequencing data from a milk-feeding study where 225 human miRNAs were detected in the plasma samples and 44 show elevated expression after milk intake. By examining the bovine-specific sequences, data indicate that three bovine miRNAs (bta-miR-378, -181* and -150) are present in human plasma possibly because of the dietary uptake. Further evaluation based on different sets of public data demonstrates that miRDis outperforms other state-of-the-art tools in both detection and quantification of miRNA from either animal or plant sources. The miRDis Web server is available at: http://sbbi.unl.edu/miRDis/index.php.
Samad, Abdul Fatah A; Nazaruddin, Nazaruddin; Murad, Abdul Munir Abdul; Jani, Jaeyres; Zainal, Zamri; Ismail, Ismanizan
2018-03-01
In current era, majority of microRNA (miRNA) are being discovered through computational approaches which are more confined towards model plants. Here, for the first time, we have described the identification and characterization of novel miRNA in a non-model plant, Persicaria minor ( P . minor ) using computational approach. Unannotated sequences from deep sequencing were analyzed based on previous well-established parameters. Around 24 putative novel miRNAs were identified from 6,417,780 reads of the unannotated sequence which represented 11 unique putative miRNA sequences. PsRobot target prediction tool was deployed to identify the target transcripts of putative novel miRNAs. Most of the predicted target transcripts (mRNAs) were known to be involved in plant development and stress responses. Gene ontology showed that majority of the putative novel miRNA targets involved in cellular component (69.07%), followed by molecular function (30.08%) and biological process (0.85%). Out of 11 unique putative miRNAs, 7 miRNAs were validated through semi-quantitative PCR. These novel miRNAs discoveries in P . minor may develop and update the current public miRNA database.
Verhoeven, Joost Theo Petra; Canuti, Marta; Munro, Hannah J; Dufour, Suzanne C; Lang, Andrew S
2018-04-19
High-throughput sequencing (HTS) technologies are becoming increasingly important within microbiology research, but aspects of library preparation, such as high cost per sample or strict input requirements, make HTS difficult to implement in some niche applications and for research groups on a budget. To answer these necessities, we developed ViDiT, a customizable, PCR-based, extremely low-cost (<5 US dollars per sample) and versatile library preparation method, and CACTUS, an analysis pipeline designed to rely on cloud computing power to generate high-quality data from ViDiT-based experiments without the need of expensive servers. We demonstrate here the versatility and utility of these methods within three fields of microbiology: virus discovery, amplicon-based viral genome sequencing and microbiome profiling. ViDiT-CACTUS allowed the identification of viral fragments from 25 different viral families from 36 oropharyngeal-cloacal swabs collected from wild birds, the sequencing of three almost complete genomes of avian influenza A viruses (>90% coverage), and the characterization and functional profiling of the complete microbial diversity (bacteria, archaea, viruses) within a deep-sea carnivorous sponge. ViDiT-CACTUS demonstrated its validity in a wide range of microbiology applications and its simplicity and modularity make it easily implementable in any molecular biology laboratory, towards various research goals.
Single-Cell Sequencing for Drug Discovery and Drug Development.
Wu, Hongjin; Wang, Charles; Wu, Shixiu
2017-01-01
Next-generation sequencing (NGS), particularly single-cell sequencing, has revolutionized the scale and scope of genomic and biomedical research. Recent technological advances in NGS and singlecell studies have made the deep whole-genome (DNA-seq), whole epigenome and whole-transcriptome sequencing (RNA-seq) at single-cell level feasible. NGS at the single-cell level expands our view of genome, epigenome and transcriptome and allows the genome, epigenome and transcriptome of any organism to be explored without a priori assumptions and with unprecedented throughput. And it does so with single-nucleotide resolution. NGS is also a very powerful tool for drug discovery and drug development. In this review, we describe the current state of single-cell sequencing techniques, which can provide a new, more powerful and precise approach for analyzing effects of drugs on treated cells and tissues. Our review discusses single-cell whole genome/exome sequencing (scWGS/scWES), single-cell transcriptome sequencing (scRNA-seq), single-cell bisulfite sequencing (scBS), and multiple omics of single-cell sequencing. We also highlight the advantages and challenges of each of these approaches. Finally, we describe, elaborate and speculate the potential applications of single-cell sequencing for drug discovery and drug development. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Buschmann, Tilo; Zhang, Rong; Brash, Douglas E; Bystrykh, Leonid V
2014-08-07
DNA barcodes are short unique sequences used to label DNA or RNA-derived samples in multiplexed deep sequencing experiments. During the demultiplexing step, barcodes must be detected and their position identified. In some cases (e.g., with PacBio SMRT), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives.For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements. In our analysis, barcode sequences showed high rates of coincidental similarities with the Mus musculus reference DNA. This problem became more acute when the length of the barcode sequence decreased and the number of barcodes in the set increased. The method presented in this paper controls the tail area-based false discovery rate to distinguish between barcoded and unbarcoded reads. This method helps to establish the highest acceptable minimal distance between reads and barcode sequences. In a proof of concept experiment we correctly detected barcodes in 83% of the reads with a precision of 89%. Sensitivity improved to 99% at 99% precision when the adjacent primer sequence was incorporated in the analysis. The analysis was further improved using a paired end strategy. Following an analysis of the data for sequence variants induced in the Atp1a1 gene of C57BL/6 murine melanocytes by ultraviolet light and conferring resistance to ouabain, we found no evidence of cross-contamination of DNA material between samples. Our method offers a proper quantitative treatment of the problem of detecting barcoded reads in a noisy sequencing environment. It is based on the false discovery rate statistics that allows a proper trade-off between sensitivity and precision to be chosen.
Evolutionary process of deep-sea bathymodiolus mussels.
Miyazaki, Jun-Ichi; de Oliveira Martins, Leonardo; Fujita, Yuko; Matsumoto, Hiroto; Fujiwara, Yoshihiro
2010-04-27
Since the discovery of deep-sea chemosynthesis-based communities, much work has been done to clarify their organismal and environmental aspects. However, major topics remain to be resolved, including when and how organisms invade and adapt to deep-sea environments; whether strategies for invasion and adaptation are shared by different taxa or unique to each taxon; how organisms extend their distribution and diversity; and how they become isolated to speciate in continuous waters. Deep-sea mussels are one of the dominant organisms in chemosynthesis-based communities, thus investigations of their origin and evolution contribute to resolving questions about life in those communities. We investigated worldwide phylogenetic relationships of deep-sea Bathymodiolus mussels and their mytilid relatives by analyzing nucleotide sequences of the mitochondrial cytochrome c oxidase subunit I (COI) and NADH dehydrogenase subunit 4 (ND4) genes. Phylogenetic analysis of the concatenated sequence data showed that mussels of the subfamily Bathymodiolinae from vents and seeps were divided into four groups, and that mussels of the subfamily Modiolinae from sunken wood and whale carcasses assumed the outgroup position and shallow-water modioline mussels were positioned more distantly to the bathymodioline mussels. We provisionally hypothesized the evolutionary history of Bathymodilolus mussels by estimating evolutionary time under a relaxed molecular clock model. Diversification of bathymodioline mussels was initiated in the early Miocene, and subsequently diversification of the groups occurred in the early to middle Miocene. The phylogenetic relationships support the "Evolutionary stepping stone hypothesis," in which mytilid ancestors exploited sunken wood and whale carcasses in their progressive adaptation to deep-sea environments. This hypothesis is also supported by the evolutionary transition of symbiosis in that nutritional adaptation to the deep sea proceeded from extracellular to intracellular symbiotic states in whale carcasses. The estimated evolutionary time suggests that the mytilid ancestors were able to exploit whales during adaptation to the deep sea.
Castro, Rosario; Navelsaker, Sofie; Krasnov, Aleksei; Du Pasquier, Louis; Boudinot, Pierre
2017-10-01
During the last decades, gene and cDNA cloning identified TCR and Ig genes across vertebrates; genome sequencing of TCR and Ig loci in many species revealed the different organizations selected during evolution under the pressure of generating diverse repertoires of Ag receptors. By detecting clonotypes over a wide range of frequency, deep sequencing of Ig and TCR transcripts provides a new way to compare the structure of expressed repertoires in species of various sizes, at different stages of development, with different physiologies, and displaying multiple adaptations to the environment. In this review, we provide a short overview of the technologies currently used to produce global description of immune repertoires, describe how they have already been used in comparative immunology, and we discuss the future potential of such approaches. The development of these methodologies in new species holds promise for new discoveries concerning particular adaptations. As an example, understanding the development of adaptive immunity across metamorphosis in frogs has been made possible by such approaches. Repertoire sequencing is now widely used, not only in basic research but also in the context of immunotherapy and vaccination. Analysis of fish responses to pathogens and vaccines has already benefited from these methods. Finally, we also discuss potential advances based on repertoire sequencing of multigene families of immune sensors and effectors in invertebrates. Copyright © 2017 Elsevier Ltd. All rights reserved.
Ethnobotany and Medicinal Plant Biotechnology: From Tradition to Modern Aspects of Drug Development.
Kayser, Oliver
2018-05-24
Secondary natural products from plants are important drug leads for the development of new drug candidates for rational clinical therapy and exhibit a variety of biological activities in experimental pharmacology and serve as structural template in medicinal chemistry. The exploration of plants and discovery of natural compounds based on ethnopharmacology in combination with high sophisticated analytics is still today an important drug discovery to characterize and validate potential leads. Due to structural complexity, low abundance in biological material, and high costs in chemical synthesis, alternative ways in production like plant cell cultures, heterologous biosynthesis, and synthetic biotechnology are applied. The basis for any biotechnological process is deep knowledge in genetic regulation of pathways and protein expression with regard to todays "omics" technologies. The high number genetic techniques allowed the implementation of combinatorial biosynthesis and wide genome sequencing. Consequently, genetics allowed functional expression of biosynthetic cascades from plants and to reconstitute low-performing pathways in more productive heterologous microorganisms. Thus, de novo biosynthesis in heterologous hosts requires fundamental understanding of pathway reconstruction and multitude of genes in a foreign organism. Here, actual concepts and strategies are discussed for pathway reconstruction and genome sequencing techniques cloning tools to bridge the gap between ethnopharmaceutical drug discovery to industrial biotechnology. Georg Thieme Verlag KG Stuttgart · New York.
You, Ronghui; Huang, Xiaodi; Zhu, Shanfeng
2018-06-06
As of April 2018, UniProtKB has collected more than 115 million protein sequences. Less than 0.15% of these proteins, however, have been associated with experimental GO annotations. As such, the use of automatic protein function prediction (AFP) to reduce this huge gap becomes increasingly important. The previous studies conclude that sequence homology based methods are highly effective in AFP. In addition, mining motif, domain, and functional information from protein sequences has been found very helpful for AFP. Other than sequences, alternative information sources such as text, however, may be useful for AFP as well. Instead of using BOW (bag of words) representation in traditional text-based AFP, we propose a new method called DeepText2GO that relies on deep semantic text representation, together with different kinds of available protein information such as sequence homology, families, domains, and motifs, to improve large-scale AFP. Furthermore, DeepText2GO integrates text-based methods with sequence-based ones by means of a consensus approach. Extensive experiments on the benchmark dataset extracted from UniProt/SwissProt have demonstrated that DeepText2GO significantly outperformed both text-based and sequence-based methods, validating its superiority. Copyright © 2018 Elsevier Inc. All rights reserved.
De novo characterization of Lentinula edodes C(91-3) transcriptome by deep Solexa sequencing.
Zhong, Mintao; Liu, Ben; Wang, Xiaoli; Liu, Lei; Lun, Yongzhi; Li, Xingyun; Ning, Anhong; Cao, Jing; Huang, Min
2013-02-01
Lentinula edodes, has been utilized as food, as well as, in popular medicine, moreover, its extract isolated from its mycelium and fruiting body have shown several therapeutic properties. Yet little is understood about its genes involved in these properties, and the absence of L.edodes genomes has been a barrier to the development of functional genomics research. However, high throughput sequencing technologies are now being widely applied to non-model species. To facilitate research on L.edodes, we leveraged Solexa sequencing technology in de novo assembly of L.edodes C(91-3) transcriptome. In a single run, we produced more than 57 million sequencing reads. These reads were assembled into 28,923 unigene sequences (mean size=689bp) including 18,120 unigenes with coding sequence (CDS). Based on similarity search with known proteins, assembled unigene sequences were annotated with gene descriptions, gene ontology (GO) and clusters of orthologous group (COG) terms. Our data provides the first comprehensive sequence resource available for functional genomics studies in L.edodes, and demonstrates the utility of Illumina/Solexa sequencing for de novo transcriptome characterization and gene discovery in a non-model mushroom. Copyright © 2012 Elsevier Inc. All rights reserved.
Discovery of a large-scale clumpy structure around the Lynx supercluster at z~ 1.27
NASA Astrophysics Data System (ADS)
Nakata, Fumiaki; Kodama, Tadayuki; Shimasaku, Kazuhiro; Doi, Mamoru; Furusawa, Hisanori; Hamabe, Masaru; Kimura, Masahiko; Komiyama, Yutaka; Miyazaki, Satoshi; Okamura, Sadanori; Ouchi, Masami; Sekiguchi, Maki; Ueda, Yoshihiro; Yagi, Masafumi; Yasuda, Naoki
2005-03-01
We report the discovery of a probable large-scale structure composed of many galaxy clumps around the known twin clusters at z= 1.26 and 1.27 in the Lynx region. Our analysis is based on deep, panoramic, and multicolour imaging, 26.4 × 24.1 arcmin2 in VRi'z' bands with the Suprime-Cam on the 8.2-m Subaru telescope. This unique, deep and wide-field imaging data set allows us for the first time to map out the galaxy distribution in the highest-redshift supercluster known. We apply a photometric redshift technique to extract plausible cluster members at z~ 1.27 down to i'= 26.15 (5σ) corresponding to ~M*+ 2.5 at this redshift. From the two-dimensional distribution of these photometrically selected galaxies, we newly identify seven candidates of galaxy groups or clusters where the surface density of red galaxies is significantly high (>5σ), in addition to the two known clusters. These candidates show clear red colour-magnitude sequences consistent with a passive evolution model, which suggests the existence of additional high-density regions around the Lynx superclusters.
Huang, Shunmou; Yang, Hongli; Zhan, Gaomiao; Wang, Xinfa; Liu, Guihua; Wang, Hanzhong
2012-01-01
Background Single nucleotide polymorphisms (SNPs) are an important class of genetic marker for target gene mapping. As of yet, there is no rapid and effective method to identify SNPs linked with agronomic traits in rapeseed and other crop species. Methodology/Principal Findings We demonstrate a novel method for identifying SNP markers in rapeseed by deep sequencing a representative library and performing bulk segregant analysis. With this method, SNPs associated with rapeseed pod shatter-resistance were discovered. Firstly, a reduced representation of the rapeseed genome was used. Genomic fragments ranging from 450–550 bp were prepared from the susceptible bulk (ten F2 plants with the silique shattering resistance index, SSRI <0.10) and the resistance bulk (ten F2 plants with SSRI >0.90), and also Solexa sequencing-produced 90 bp reads. Approximately 50 million of these sequence reads were assembled into contigs to a depth of 20-fold coverage. Secondly, 60,396 ‘simple SNPs’ were identified, and the statistical significance was evaluated using Fisher's exact test. There were 70 associated SNPs whose –log10 p value over 16 were selected to be further analyzed. The distribution of these SNPs appeared a tight cluster, which consisted of 14 associated SNPs within a 396 kb region on chromosome A09. Our evidence indicates that this region contains a major quantitative trait locus (QTL). Finally, two associated SNPs from this region were mapped on a major QTL region. Conclusions/Significance 70 associated SNPs were discovered and a major QTL for rapeseed pod shatter-resistance was found on chromosome A09 using our novel method. The associated SNP markers were used for mapping of the QTL, and may be useful for improving pod shatter-resistance in rapeseed through marker-assisted selection and map-based cloning. This approach will accelerate the discovery of major QTLs and the cloning of functional genes for important agronomic traits in rapeseed and other crop species. PMID:22529909
Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard
2009-05-01
The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation.
Domain-specific Web Service Discovery with Service Class Descriptions
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rocco, D; Caverlee, J; Liu, L
2005-02-14
This paper presents DynaBot, a domain-specific web service discovery system. The core idea of the DynaBot service discovery system is to use domain-specific service class descriptions powered by an intelligent Deep Web crawler. In contrast to current registry-based service discovery systems--like the several available UDDI registries--DynaBot promotes focused crawling of the Deep Web of services and discovers candidate services that are relevant to the domain of interest. It uses intelligent filtering algorithms to match services found by focused crawling with the domain-specific service class descriptions. We demonstrate the capability of DynaBot through the BLAST service discovery scenario and describe ourmore » initial experience with DynaBot.« less
Han, R; Rai, A; Nakamura, M; Suzuki, H; Takahashi, H; Yamazaki, M; Saito, K
2016-01-01
Study on transcriptome, the entire pool of transcripts in an organism or single cells at certain physiological or pathological stage, is indispensable in unraveling the connection and regulation between DNA and protein. Before the advent of deep sequencing, microarray was the main approach to handle transcripts. Despite obvious shortcomings, including limited dynamic range and difficulties to compare the results from distinct experiments, microarray was widely applied. During the past decade, next-generation sequencing (NGS) has revolutionized our understanding of genomics in a fast, high-throughput, cost-effective, and tractable manner. By adopting NGS, efficiency and fruitful outcomes concerning the efforts to elucidate genes responsible for producing active compounds in medicinal plants were profoundly enhanced. The whole process involves steps, from the plant material sampling, to cDNA library preparation, to deep sequencing, and then bioinformatics takes over to assemble enormous-yet fragmentary-data from which to comb and extract information. The unprecedentedly rapid development of such technologies provides so many choices to facilitate the task, which can cause confusion when choosing the suitable methodology for specific purposes. Here, we review the general approaches for deep transcriptome analysis and then focus on their application in discovering biosynthetic pathways of medicinal plants that produce important secondary metabolites. © 2016 Elsevier Inc. All rights reserved.
iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder.
Xu, Zhao-Chun; Wang, Peng; Qiu, Wang-Ren; Xiao, Xuan
2017-08-15
Gene splicing is one of the most significant biological processes in eukaryotic gene expression, such as RNA splicing, which can cause a pre-mRNA to produce one or more mature messenger RNAs containing the coded information with multiple biological functions. Thus, identifying splicing sites in DNA/RNA sequences is significant for both the bio-medical research and the discovery of new drugs. However, it is expensive and time consuming based only on experimental technique, so new computational methods are needed. To identify the splice donor sites and splice acceptor sites accurately and quickly, a deep sparse auto-encoder model with two hidden layers, called iSS-PC, was constructed based on minimum error law, in which we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. In this paper, five-fold cross-validation test results based on the same benchmark data-sets indicated that the new predictor remarkably outperformed the existing prediction methods in this field. Furthermore, it is expected that many other related problems can be also studied by this approach. To implement classification accurately and quickly, an easy-to-use web-server for identifying slicing sites has been established for free access at: http://www.jci-bioinfo.cn/iSS-PC.
Schilmiller, Anthony L; Miner, Dennis P; Larson, Matthew; McDowell, Eric; Gang, David R; Wilkerson, Curtis; Last, Robert L
2010-07-01
Shotgun proteomics analysis allows hundreds of proteins to be identified and quantified from a single sample at relatively low cost. Extensive DNA sequence information is a prerequisite for shotgun proteomics, and it is ideal to have sequence for the organism being studied rather than from related species or accessions. While this requirement has limited the set of organisms that are candidates for this approach, next generation sequencing technologies make it feasible to obtain deep DNA sequence coverage from any organism. As part of our studies of specialized (secondary) metabolism in tomato (Solanum lycopersicum) trichomes, 454 sequencing of cDNA was combined with shotgun proteomics analyses to obtain in-depth profiles of genes and proteins expressed in leaf and stem glandular trichomes of 3-week-old plants. The expressed sequence tag and proteomics data sets combined with metabolite analysis led to the discovery and characterization of a sesquiterpene synthase that produces beta-caryophyllene and alpha-humulene from E,E-farnesyl diphosphate in trichomes of leaf but not of stem. This analysis demonstrates the utility of combining high-throughput cDNA sequencing with proteomics experiments in a target tissue. These data can be used for dissection of other biochemical processes in these specialized epidermal cells.
Schilmiller, Anthony L.; Miner, Dennis P.; Larson, Matthew; McDowell, Eric; Gang, David R.; Wilkerson, Curtis; Last, Robert L.
2010-01-01
Shotgun proteomics analysis allows hundreds of proteins to be identified and quantified from a single sample at relatively low cost. Extensive DNA sequence information is a prerequisite for shotgun proteomics, and it is ideal to have sequence for the organism being studied rather than from related species or accessions. While this requirement has limited the set of organisms that are candidates for this approach, next generation sequencing technologies make it feasible to obtain deep DNA sequence coverage from any organism. As part of our studies of specialized (secondary) metabolism in tomato (Solanum lycopersicum) trichomes, 454 sequencing of cDNA was combined with shotgun proteomics analyses to obtain in-depth profiles of genes and proteins expressed in leaf and stem glandular trichomes of 3-week-old plants. The expressed sequence tag and proteomics data sets combined with metabolite analysis led to the discovery and characterization of a sesquiterpene synthase that produces β-caryophyllene and α-humulene from E,E-farnesyl diphosphate in trichomes of leaf but not of stem. This analysis demonstrates the utility of combining high-throughput cDNA sequencing with proteomics experiments in a target tissue. These data can be used for dissection of other biochemical processes in these specialized epidermal cells. PMID:20431087
Deep hierarchies in the primate visual cortex: what can we learn for computer vision?
Krüger, Norbert; Janssen, Peter; Kalkan, Sinan; Lappe, Markus; Leonardis, Ales; Piater, Justus; Rodríguez-Sánchez, Antonio J; Wiskott, Laurenz
2013-08-01
Computational modeling of the primate visual system yields insights of potential relevance to some of the challenges that computer vision is facing, such as object recognition and categorization, motion detection and activity recognition, or vision-based navigation and manipulation. This paper reviews some functional principles and structures that are generally thought to underlie the primate visual cortex, and attempts to extract biological principles that could further advance computer vision research. Organized for a computer vision audience, we present functional principles of the processing hierarchies present in the primate visual system considering recent discoveries in neurophysiology. The hierarchical processing in the primate visual system is characterized by a sequence of different levels of processing (on the order of 10) that constitute a deep hierarchy in contrast to the flat vision architectures predominantly used in today's mainstream computer vision. We hope that the functional description of the deep hierarchies realized in the primate visual system provides valuable insights for the design of computer vision algorithms, fostering increasingly productive interaction between biological and computer vision research.
Emerging pathogens in the fish farming industry and sequencing-based pathogen discovery.
Tengs, Torstein; Rimstad, Espen
2017-10-01
The use of large scale DNA/RNA sequencing has become an integral part of biomedical research. Reduced sequencing costs and the availability of efficient computational resources has led to a revolution in how problems concerning genomics and transcriptomics are addressed. Sequencing-based pathogen discovery represents one example of how genetic data can now be used in ways that were previously considered infeasible. Emerging pathogens affect both human and animal health due to a multitude of factors, including globalization, a shifting environment and an increasing human population. Fish farming represents a relevant, interesting and challenging system to study emerging pathogens. This review summarizes recent progress in pathogen discovery using sequence data, with particular emphasis on viruses in Atlantic salmon (Salmo salar). Copyright © 2017 Elsevier Ltd. All rights reserved.
Ribosome profiling reveals the what, when, where and how of protein synthesis.
Brar, Gloria A; Weissman, Jonathan S
2015-11-01
Ribosome profiling, which involves the deep sequencing of ribosome-protected mRNA fragments, is a powerful tool for globally monitoring translation in vivo. The method has facilitated discovery of the regulation of gene expression underlying diverse and complex biological processes, of important aspects of the mechanism of protein synthesis, and even of new proteins, by providing a systematic approach for experimental annotation of coding regions. Here, we introduce the methodology of ribosome profiling and discuss examples in which this approach has been a key factor in guiding biological discovery, including its prominent role in identifying thousands of novel translated short open reading frames and alternative translation products.
Next-generation sequencing in clinical virology: Discovery of new viruses.
Datta, Sibnarayan; Budhauliya, Raghvendra; Das, Bidisha; Chatterjee, Soumya; Vanlalhmuaka; Veer, Vijay
2015-08-12
Viruses are a cause of significant health problem worldwide, especially in the developing nations. Due to different anthropological activities, human populations are exposed to different viral pathogens, many of which emerge as outbreaks. In such situations, discovery of novel viruses is utmost important for deciding prevention and treatment strategies. Since last century, a number of different virus discovery methods, based on cell culture inoculation, sequence-independent PCR have been used for identification of a variety of viruses. However, the recent emergence and commercial availability of next-generation sequencers (NGS) has entirely changed the field of virus discovery. These massively parallel sequencing platforms can sequence a mixture of genetic materials from a very heterogeneous mix, with high sensitivity. Moreover, these platforms work in a sequence-independent manner, making them ideal tools for virus discovery. However, for their application in clinics, sample preparation or enrichment is necessary to detect low abundance virus populations. A number of techniques have also been developed for enrichment or viral nucleic acids. In this manuscript, we review the evolution of sequencing; NGS technologies available today as well as widely used virus enrichment technologies. We also discuss the challenges associated with their applications in the clinical virus discovery.
Genetics of impulsive behaviour
Bevilacqua, Laura; Goldman, David
2013-01-01
Impulsivity, defined as the tendency to act without foresight, comprises a multitude of constructs and is associated with a variety of psychiatric disorders. Dissecting different aspects of impulsive behaviour and relating these to specific neurobiological circuits would improve our understanding of the etiology of complex behaviours for which impulsivity is key, and advance genetic studies in this behavioural domain. In this review, we will discuss the heritability of some impulsivity constructs and their possible use as endophenotypes (heritable, disease-associated intermediate phenotypes). Several functional genetic variants associated with impulsive behaviour have been identified by the candidate gene approach and re-sequencing, and whole genome strategies can be implemented for discovery of novel rare and common alleles influencing impulsivity. Via deep sequencing an uncommon HTR2B stop codon, common in one population, was discovered, with implications for understanding impulsive behaviour in both humans and rodents and for future gene discovery. PMID:23440466
USDA-ARS?s Scientific Manuscript database
Squash mosaic virus (SqMV), a seed-borne virus belonging to the genus Commovirus in the family Comoviridae, could cause a serious yield loss on cucurbit crops worldwide. SqMV has a bipartite single-stranded ribonucleic acid (RNA) genome (RNA-1 and RNA-2) encapsidated separately with two capsid prote...
Jain, Mukesh; Chevala, V V S Narayana; Garg, Rohini
2014-11-01
MicroRNAs (miRNAs) are essential components of complex gene regulatory networks that orchestrate plant development. Although several genomic resources have been developed for the legume crop chickpea, miRNAs have not been discovered until now. For genome-wide discovery of miRNAs in chickpea (Cicer arietinum), we sequenced the small RNA content from seven major tissues/organs employing Illumina technology. About 154 million reads were generated, which represented more than 20 million distinct small RNA sequences. We identified a total of 440 conserved miRNAs in chickpea based on sequence similarity with known miRNAs in other plants. In addition, 178 novel miRNAs were identified using a miRDeep pipeline with plant-specific scoring. Some of the conserved and novel miRNAs with significant sequence similarity were grouped into families. The chickpea miRNAs targeted a wide range of mRNAs involved in diverse cellular processes, including transcriptional regulation (transcription factors), protein modification and turnover, signal transduction, and metabolism. Our analysis revealed several miRNAs with differential spatial expression. Many of the chickpea miRNAs were expressed in a tissue-specific manner. The conserved and differential expression of members of the same miRNA family in different tissues was also observed. Some of the same family members were predicted to target different chickpea mRNAs, which suggested the specificity and complexity of miRNA-mediated developmental regulation. This study, for the first time, reveals a comprehensive set of conserved and novel miRNAs along with their expression patterns and putative targets in chickpea, and provides a framework for understanding regulation of developmental processes in legumes. © The Author 2014. Published by Oxford University Press on behalf of the Society for Experimental Biology.
Sánchez, Cecilia Castaño; Smith, Timothy P L; Wiedmann, Ralph T; Vallejo, Roger L; Salem, Mohamed; Yao, Jianbo; Rexroad, Caird E
2009-11-25
To enhance capabilities for genomic analyses in rainbow trout, such as genomic selection, a large suite of polymorphic markers that are amenable to high-throughput genotyping protocols must be identified. Expressed Sequence Tags (ESTs) have been used for single nucleotide polymorphism (SNP) discovery in salmonids. In those strategies, the salmonid semi-tetraploid genomes often led to assemblies of paralogous sequences and therefore resulted in a high rate of false positive SNP identification. Sequencing genomic DNA using primers identified from ESTs proved to be an effective but time consuming methodology of SNP identification in rainbow trout, therefore not suitable for high throughput SNP discovery. In this study, we employed a high-throughput strategy that used pyrosequencing technology to generate data from a reduced representation library constructed with genomic DNA pooled from 96 unrelated rainbow trout that represent the National Center for Cool and Cold Water Aquaculture (NCCCWA) broodstock population. The reduced representation library consisted of 440 bp fragments resulting from complete digestion with the restriction enzyme HaeIII; sequencing produced 2,000,000 reads providing an average 6 fold coverage of the estimated 150,000 unique genomic restriction fragments (300,000 fragment ends). Three independent data analyses identified 22,022 to 47,128 putative SNPs on 13,140 to 24,627 independent contigs. A set of 384 putative SNPs, randomly selected from the sets produced by the three analyses were genotyped on individual fish to determine the validation rate of putative SNPs among analyses, distinguish apparent SNPs that actually represent paralogous loci in the tetraploid genome, examine Mendelian segregation, and place the validated SNPs on the rainbow trout linkage map. Approximately 48% (183) of the putative SNPs were validated; 167 markers were successfully incorporated into the rainbow trout linkage map. In addition, 2% of the sequences from the validated markers were associated with rainbow trout transcripts. The use of reduced representation libraries and pyrosequencing technology proved to be an effective strategy for the discovery of a high number of putative SNPs in rainbow trout; however, modifications to the technique to decrease the false discovery rate resulting from the evolutionary recent genome duplication would be desirable.
Accurate identification of RNA editing sites from primitive sequence with deep neural networks.
Ouyang, Zhangyi; Liu, Feng; Zhao, Chenghui; Ren, Chao; An, Gaole; Mei, Chuan; Bo, Xiaochen; Shu, Wenjie
2018-04-16
RNA editing is a post-transcriptional RNA sequence alteration. Current methods have identified editing sites and facilitated research but require sufficient genomic annotations and prior-knowledge-based filtering steps, resulting in a cumbersome, time-consuming identification process. Moreover, these methods have limited generalizability and applicability in species with insufficient genomic annotations or in conditions of limited prior knowledge. We developed DeepRed, a deep learning-based method that identifies RNA editing from primitive RNA sequences without prior-knowledge-based filtering steps or genomic annotations. DeepRed achieved 98.1% and 97.9% area under the curve (AUC) in training and test sets, respectively. We further validated DeepRed using experimentally verified U87 cell RNA-seq data, achieving 97.9% positive predictive value (PPV). We demonstrated that DeepRed offers better prediction accuracy and computational efficiency than current methods with large-scale, mass RNA-seq data. We used DeepRed to assess the impact of multiple factors on editing identification with RNA-seq data from the Association of Biomolecular Resource Facilities and Sequencing Quality Control projects. We explored developmental RNA editing pattern changes during human early embryogenesis and evolutionary patterns in Drosophila species and the primate lineage using DeepRed. Our work illustrates DeepRed's state-of-the-art performance; it may decipher the hidden principles behind RNA editing, making editing detection convenient and effective.
Quantitative phenotyping via deep barcode sequencing.
Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey
2009-10-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.
Application of industrial scale genomics to discovery of therapeutic targets in heart failure.
Mehraban, F; Tomlinson, J E
2001-12-01
In recent years intense activity in both academic and industrial sectors has provided a wealth of information on the human genome with an associated impressive increase in the number of novel gene sequences deposited in sequence data repositories and patent applications. This genomic industrial revolution has transformed the way in which drug target discovery is now approached. In this article we discuss how various differential gene expression (DGE) technologies are being utilized for cardiovascular disease (CVD) drug target discovery. Other approaches such as sequencing cDNA from cardiovascular derived tissues and cells coupled with bioinformatic sequence analysis are used with the aim of identifying novel gene sequences that may be exploited towards target discovery. Additional leverage from gene sequence information is obtained through identification of polymorphisms that may confer disease susceptibility and/or affect drug responsiveness. Pharmacogenomic studies are described wherein gene expression-based techniques are used to evaluate drug response and/or efficacy. Industrial-scale genomics supports and addresses not only novel target gene discovery but also the burgeoning issues in pharmaceutical and clinical cardiovascular medicine relative to polymorphic gene responses.
Chen, Xin; Wu, Qiong; Sun, Ruimin; Zhang, Louxin
2012-01-01
The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of this SNP discovery approach. In this study, we formulate two new combinatorial optimization problems. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as SNP - MSP, limits its search to sequences whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, the second problem, denoted as SNP - MSQ, limits its search to sequences whose in silico predicted mass spectra instead contain all the signals of the measured mass spectra. We present an exact dynamic programming algorithm for solving the SNP - MSP problem and also show that the SNP - MSQ problem is NP-hard by a reduction from a restricted variation of the 3-partition problem. We believe that an efficient solution to either problem above could offer a seamless integration of information in four complementary base-specific cleavage reactions, thereby improving the capability of the underlying biotechnology for sensitive and accurate SNP discovery.
2011-01-01
Background Many plants have large and complex genomes with an abundance of repeated sequences. Many plants are also polyploid. Both of these attributes typify the genome architecture in the tribe Triticeae, whose members include economically important wheat, rye and barley. Large genome sizes, an abundance of repeated sequences, and polyploidy present challenges to genome-wide SNP discovery using next-generation sequencing (NGS) of total genomic DNA by making alignment and clustering of short reads generated by the NGS platforms difficult, particularly in the absence of a reference genome sequence. Results An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another genotype generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. A pipeline program package, AGSNP, was developed and used for genome-wide SNP discovery in Aegilops tauschii-the diploid source of the wheat D genome, and with a genome size of 4.02 Gb, of which 90% is repetitive sequences. Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa and Roche 454 genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in uncharacterized single-copy regions, and another 145,907 putative SNPs were discovered in repeat junctions. These SNPs were dispersed across the entire Ae. tauschii genome. To assess the false positive SNP discovery rate, DNA containing putative SNPs was amplified by PCR from AL8/78 and AS75 and resequenced with the ABI 3730 xl. In a sample of 302 randomly selected putative SNPs, 84.0% in gene regions, 88.0% in repeat junctions, and 81.3% in uncharacterized regions were validated. Conclusion An annotation-based genome-wide SNP discovery pipeline for NGS platforms was developed. The pipeline is suitable for SNP discovery in genomic libraries of complex genomes and does not require a reference genome sequence. The pipeline is applicable to all current NGS platforms, provided that at least one such platform generates relatively long reads. The pipeline package, AGSNP, and the discovered 497,118 Ae. tauschii SNPs can be accessed at (http://avena.pw.usda.gov/wheatD/agsnp.shtml). PMID:21266061
In-Silico Identification Of Micro-Loops In Myelodysplastic Syndromes
NASA Astrophysics Data System (ADS)
Beck, Dominik; Brandl, Miriam; Pham, Tuan D.; Chang, Chung-Che; Zhou, Xiaobo
2011-06-01
Micro-loops are regulatory network motifs that leverage transcriptional and posttranscriptional control to effectively regulate the transcriptome. In this paper a regulatory network for Myelodysplastic Syndromes (MDSs) was constructed from the literature and publicly available data sources. The network was filtered using data from deep-sequencing of small RNAs, exon and microarrays. Motif discovery showed that micro-loops might exist in MDS. We further used the identified micro-loops and performed basic network analysis to identify the known disease gene RUNX1/AML, as well as miRNA family hsa-mir-181. This suggested that the concept of micro-loops can be applied to enhance disease gene identification and biomarker discovery.
Sequence-specific bias correction for RNA-seq data using recurrent neural networks.
Zhang, Yao-Zhong; Yamaguchi, Rui; Imoto, Seiya; Miyano, Satoru
2017-01-25
The recent success of deep learning techniques in machine learning and artificial intelligence has stimulated a great deal of interest among bioinformaticians, who now wish to bring the power of deep learning to bare on a host of bioinformatical problems. Deep learning is ideally suited for biological problems that require automatic or hierarchical feature representation for biological data when prior knowledge is limited. In this work, we address the sequence-specific bias correction problem for RNA-seq data redusing Recurrent Neural Networks (RNNs) to model nucleotide sequences without pre-determining sequence structures. The sequence-specific bias of a read is then calculated based on the sequence probabilities estimated by RNNs, and used in the estimation of gene abundance. We explore the application of two popular RNN recurrent units for this task and demonstrate that RNN-based approaches provide a flexible way to model nucleotide sequences without knowledge of predetermined sequence structures. Our experiments show that training a RNN-based nucleotide sequence model is efficient and RNN-based bias correction methods compare well with the-state-of-the-art sequence-specific bias correction method on the commonly used MAQC-III data set. RNNs provides an alternative and flexible way to calculate sequence-specific bias without explicitly pre-determining sequence structures.
Emerging Concepts and Methodologies in Cancer Biomarker Discovery.
Lu, Meixia; Zhang, Jinxiang; Zhang, Lanjing
2017-01-01
Cancer biomarker discovery is a critical part of cancer prevention and treatment. Despite the decades of effort, only a small number of cancer biomarkers have been identified for and validated in clinical settings. Conceptual and methodological breakthroughs may help accelerate the discovery of additional cancer biomarkers, particularly their use for diagnostics. In this review, we have attempted to review the emerging concepts in cancer biomarker discovery, including real-world evidence, open access data, and data paucity in rare or uncommon cancers. We have also summarized the recent methodological progress in cancer biomarker discovery, such as high-throughput sequencing, liquid biopsy, big data, artificial intelligence (AI), and deep learning and neural networks. Much attention has been given to the methodological details and comparison of the methodologies. Notably, these concepts and methodologies interact with each other and will likely lead to synergistic effects when carefully combined. Newer, more innovative concepts and methodologies are emerging as the current emerging ones became mainstream and widely applied to the field. Some future challenges are also discussed. This review contributes to the development of future theoretical frameworks and technologies in cancer biomarker discovery and will contribute to the discovery of more useful cancer biomarkers.
Fungal diversity from deep marine subsurface sediments (IODP 317, Canterbury Basin, New Zealand)
NASA Astrophysics Data System (ADS)
Redou, V.; Arzur, D.; Burgaud, G.; Barbier, G.
2012-12-01
Recent years have seen a growing interest regarding micro-eukaryotic communities in extreme environments as a third microbial domain after Bacteria and Archaea. However, knowledge is still scarce and the diversity of micro-eukaryotes in such environments remains hidden and their ecological role unknown. Our research program is based on the deep sedimentary layers of the Canterbury Basin in New Zealand (IODP 317) from the subsurface to the record depth of 1884 meters below seafloor. The objectives of our study are (i) to assess the genetic diversity of fungi in deep-sea sediments and (ii) identify the functional part in order to better understand the origin and the ecological role of fungal communities in this extreme ecosystem. Fingerprinting-based methods using capillary electrophoresis single-strand conformation polymorphism and denaturing high-performance liquid chromatography were used as a first step to raise our objectives. Molecular fungal diversity was assessed using amplification of ITS1 (Internal Transcribed Spacer 1) as a biomarker on 11 samples sediments from 3.76 to 1884 meters below seafloor. Fungal molecular signatures were detected throughout the sediment core. The phyla Ascomycota and Basidiomycota were revealed with DNA as well as cDNA. Most of the phylotypes are affiliated to environmental sequences and some to common fungal cultured species. The discovery of a present and metabolically active fungal component in this unique ecosystem allows some interesting first hypotheses that will be further combined to culture-based methods and deeper molecular methods (454 pyrosequencing) to highlight essential informations regarding physiology and ecological role of fungal communities in deep marine sediments.
Deep machine learning provides state-of-the-art performance in image-based plant phenotyping.
Pound, Michael P; Atkinson, Jonathan A; Townsend, Alexandra J; Wilson, Michael H; Griffiths, Marcus; Jackson, Aaron S; Bulat, Adrian; Tzimiropoulos, Georgios; Wells, Darren M; Murchie, Erik H; Pridmore, Tony P; French, Andrew P
2017-10-01
In plant phenotyping, it has become important to be able to measure many features on large image sets in order to aid genetic discovery. The size of the datasets, now often captured robotically, often precludes manual inspection, hence the motivation for finding a fully automated approach. Deep learning is an emerging field that promises unparalleled results on many data analysis problems. Building on artificial neural networks, deep approaches have many more hidden layers in the network, and hence have greater discriminative and predictive power. We demonstrate the use of such approaches as part of a plant phenotyping pipeline. We show the success offered by such techniques when applied to the challenging problem of image-based plant phenotyping and demonstrate state-of-the-art results (>97% accuracy) for root and shoot feature identification and localization. We use fully automated trait identification using deep learning to identify quantitative trait loci in root architecture datasets. The majority (12 out of 14) of manually identified quantitative trait loci were also discovered using our automated approach based on deep learning detection to locate plant features. We have shown deep learning-based phenotyping to have very good detection and localization accuracy in validation and testing image sets. We have shown that such features can be used to derive meaningful biological traits, which in turn can be used in quantitative trait loci discovery pipelines. This process can be completely automated. We predict a paradigm shift in image-based phenotyping bought about by such deep learning approaches, given sufficient training sets. © The Authors 2017. Published by Oxford University Press.
Computational functional genomics-based approaches in analgesic drug discovery and repurposing.
Lippmann, Catharina; Kringel, Dario; Ultsch, Alfred; Lötsch, Jörn
2018-06-01
Persistent pain is a major healthcare problem affecting a fifth of adults worldwide with still limited treatment options. The search for new analgesics increasingly includes the novel research area of functional genomics, which combines data derived from various processes related to DNA sequence, gene expression or protein function and uses advanced methods of data mining and knowledge discovery with the goal of understanding the relationship between the genome and the phenotype. Its use in drug discovery and repurposing for analgesic indications has so far been performed using knowledge discovery in gene function and drug target-related databases; next-generation sequencing; and functional proteomics-based approaches. Here, we discuss recent efforts in functional genomics-based approaches to analgesic drug discovery and repurposing and highlight the potential of computational functional genomics in this field including a demonstration of the workflow using a novel R library 'dbtORA'.
Diverse deep-sea fungi from the South China Sea and their antimicrobial activity.
Zhang, Xiao-Yong; Zhang, Yun; Xu, Xin-Ya; Qi, Shu-Hua
2013-11-01
We investigated the diversity of fungal communities in nine different deep-sea sediment samples of the South China Sea by culture-dependent methods followed by analysis of fungal internal transcribed spacer (ITS) sequences. Although 14 out of 27 identified species were reported in a previous study, 13 species were isolated from sediments of deep-sea environments for the first report. Moreover, these ITS sequences of six isolates shared 84-92 % similarity with their closest matches in GenBank, which suggested that they might be novel phylotypes of genera Ajellomyces, Podosordaria, Torula, and Xylaria. The antimicrobial activities of these fungal isolates were explored using a double-layer technique. A relatively high proportion (56 %) of fungal isolates exhibited antimicrobial activity against at least one pathogenic bacterium or fungus among four marine pathogenic microbes (Micrococcus luteus, Pseudoaltermonas piscida, Aspergerillus versicolor, and A. sydowii). Out of these antimicrobial fungi, the genera Arthrinium, Aspergillus, and Penicillium exhibited antibacterial and antifungal activities, while genus Aureobasidium displayed only antibacterial activity, and genera Acremonium, Cladosporium, Geomyces, and Phaeosphaeriopsis displayed only antifungal activity. To our knowledge, this is the first report to investigate the diversity and antimicrobial activity of culturable deep-sea-derived fungi in the South China Sea. These results suggest that diverse deep-sea fungi from the South China Sea are a potential source for antibiotics' discovery and further increase the pool of fungi available for natural bioactive product screening.
Bigot, Diane; Atyame, Célestine M; Weill, Mylène; Justy, Fabienne
2018-01-01
Abstract In the global context of arboviral emergence, deep sequencing unlocks the discovery of new mosquito-borne viruses. Mosquitoes of the species Culex pipiens, C. torrentium, and C. hortensis were sampled from 22 locations worldwide for transcriptomic analyses. A virus discovery pipeline was used to analyze the dataset of 0.7 billion reads comprising 22 individual transcriptomes. Two closely related 6.8 kb viral genomes were identified in C. pipiens and named as Culex pipiens associated tunisia virus (CpATV) strains Ayed and Jedaida. The CpATV genome contained four ORFs. ORF1 possessed helicase and RNA-dependent RNA polymerase (RdRp) domains related to new viral sequences recently found mainly in dipterans. ORF2 and 4 contained a capsid protein domain showing strong homology with Virgaviridae plant viruses. ORF3 displayed similarities with eukaryotic Rhoptry domain and a merozoite surface protein (MSP7) domain only found in mosquito-transmitted Plasmodium, suggesting possible interactions between CpATV and vertebrate cells. Estimation of a strong purifying selection exerted on each ORFs and the presence of a polymorphism maintained in the coding region of ORF3 suggested that both CpATV sequences are genuine functional viruses. CpATV is part of an entirely new and highly diversified group of viruses recently found in insects, and that bears the genomic hallmarks of a new viral family. PMID:29340209
Quantitative phenotyping via deep barcode sequencing
Smith, Andrew M.; Heisler, Lawrence E.; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J.; Chee, Mark; Roth, Frederick P.; Giaever, Guri; Nislow, Corey
2009-01-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or “Bar-seq,” outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that ∼20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene–environment interactions on a genome-wide scale. PMID:19622793
Limitations and potentials of current motif discovery algorithms
Hu, Jianjun; Li, Bin; Kihara, Daisuke
2005-01-01
Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 6–45% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them. PMID:16284194
Next-generation libraries for robust RNA interference-based genome-wide screens
Kampmann, Martin; Horlbeck, Max A.; Chen, Yuwen; Tsai, Jordan C.; Bassik, Michael C.; Gilbert, Luke A.; Villalta, Jacqueline E.; Kwon, S. Chul; Chang, Hyeshik; Kim, V. Narry; Weissman, Jonathan S.
2015-01-01
Genetic screening based on loss-of-function phenotypes is a powerful discovery tool in biology. Although the recent development of clustered regularly interspaced short palindromic repeats (CRISPR)-based screening approaches in mammalian cell culture has enormous potential, RNA interference (RNAi)-based screening remains the method of choice in several biological contexts. We previously demonstrated that ultracomplex pooled short-hairpin RNA (shRNA) libraries can largely overcome the problem of RNAi off-target effects in genome-wide screens. Here, we systematically optimize several aspects of our shRNA library, including the promoter and microRNA context for shRNA expression, selection of guide strands, and features relevant for postscreen sample preparation for deep sequencing. We present next-generation high-complexity libraries targeting human and mouse protein-coding genes, which we grouped into 12 sublibraries based on biological function. A pilot screen suggests that our next-generation RNAi library performs comparably to current CRISPR interference (CRISPRi)-based approaches and can yield complementary results with high sensitivity and high specificity. PMID:26080438
[Current applications of high-throughput DNA sequencing technology in antibody drug research].
Yu, Xin; Liu, Qi-Gang; Wang, Ming-Rong
2012-03-01
Since the publication of a high-throughput DNA sequencing technology based on PCR reaction was carried out in oil emulsions in 2005, high-throughput DNA sequencing platforms have been evolved to a robust technology in sequencing genomes and diverse DNA libraries. Antibody libraries with vast numbers of members currently serve as a foundation of discovering novel antibody drugs, and high-throughput DNA sequencing technology makes it possible to rapidly identify functional antibody variants with desired properties. Herein we present a review of current applications of high-throughput DNA sequencing technology in the analysis of antibody library diversity, sequencing of CDR3 regions, identification of potent antibodies based on sequence frequency, discovery of functional genes, and combination with various display technologies, so as to provide an alternative approach of discovery and development of antibody drugs.
A deep learning method for lincRNA detection using auto-encoder algorithm.
Yu, Ning; Yu, Zeng; Pan, Yi
2017-12-06
RNA sequencing technique (RNA-seq) enables scientists to develop novel data-driven methods for discovering more unidentified lincRNAs. Meantime, knowledge-based technologies are experiencing a potential revolution ignited by the new deep learning methods. By scanning the newly found data set from RNA-seq, scientists have found that: (1) the expression of lincRNAs appears to be regulated, that is, the relevance exists along the DNA sequences; (2) lincRNAs contain some conversed patterns/motifs tethered together by non-conserved regions. The two evidences give the reasoning for adopting knowledge-based deep learning methods in lincRNA detection. Similar to coding region transcription, non-coding regions are split at transcriptional sites. However, regulatory RNAs rather than message RNAs are generated. That is, the transcribed RNAs participate the biological process as regulatory units instead of generating proteins. Identifying these transcriptional regions from non-coding regions is the first step towards lincRNA recognition. The auto-encoder method achieves 100% and 92.4% prediction accuracy on transcription sites over the putative data sets. The experimental results also show the excellent performance of predictive deep neural network on the lincRNA data sets compared with support vector machine and traditional neural network. In addition, it is validated through the newly discovered lincRNA data set and one unreported transcription site is found by feeding the whole annotated sequences through the deep learning machine, which indicates that deep learning method has the extensive ability for lincRNA prediction. The transcriptional sequences of lincRNAs are collected from the annotated human DNA genome data. Subsequently, a two-layer deep neural network is developed for the lincRNA detection, which adopts the auto-encoder algorithm and utilizes different encoding schemes to obtain the best performance over intergenic DNA sequence data. Driven by those newly annotated lincRNA data, deep learning methods based on auto-encoder algorithm can exert their capability in knowledge learning in order to capture the useful features and the information correlation along DNA genome sequences for lincRNA detection. As our knowledge, this is the first application to adopt the deep learning techniques for identifying lincRNA transcription sequences.
Zhou, Wen-Zhao; Zhang, Yan-Mei; Lu, Jun-Ying; Li, Jun-Feng
2012-01-01
To provide a resource of sisal-specific expressed sequence data and facilitate this powerful approach in new gene research, the preparation of normalized cDNA libraries enriched with full-length sequences is necessary. Four libraries were produced with RNA pooled from Agave sisalana multiple tissues to increase efficiency of normalization and maximize the number of independent genes by SMART™ method and the duplex-specific nuclease (DSN). This procedure kept the proportion of full-length cDNAs in the subtracted/normalized libraries and dramatically enhanced the discovery of new genes. Sequencing of 3875 cDNA clones of libraries revealed 3320 unigenes with an average insert length about 1.2 kb, indicating that the non-redundancy of libraries was about 85.7%. These unigene functions were predicted by comparing their sequences to functional domain databases and extensively annotated with Gene Ontology (GO) terms. Comparative analysis of sisal unigenes and other plant genomes revealed that four putative MADS-box genes and knotted-like homeobox (knox) gene were obtained from a total of 1162 full-length transcripts. Furthermore, real-time PCR showed that the characteristics of their transcripts mainly depended on the tight expression regulation of a number of genes during the leaf and flower development. Analysis of individual library sequence data indicated that the pooled-tissue approach was highly effective in discovering new genes and preparing libraries for efficient deep sequencing. PMID:23202944
Biosynthesis and genetic encoding of phosphothreonine through parallel selection and deep sequencing
Huguenin-Dezot, Nicolas; Liang, Alexandria D.; Schmied, Wolfgang H.; Rogerson, Daniel T.; Chin, Jason W.
2017-01-01
The phosphorylation of threonine residues in proteins regulates diverse processes in eukaryotic cells, and thousands of threonine phosphorylations have been identified. An understanding of how threonine phosphorylation regulates biological function will be accelerated by general methods to bio-synthesize defined phospho-proteins. Here we address limitations in current methods for discovering aminoacyl-tRNA synthetase/tRNA pairs for incorporating non-natural amino acids into proteins, by combining parallel positive selections with deep sequencing and statistical analysis, to create a rapid approach for directly discovering aminoacyl-tRNA synthetase/tRNA pairs that selectively incorporate non-natural substrates. Our approach is scalable and enables the direct discovery of aminoacyl-tRNA synthetase/tRNA pairs with mutually orthogonal substrate specificity. We biosynthesize phosphothreonine in cells, and use our new selection approach to discover a phosphothreonyl-tRNA synthetase/tRNACUA pair. By combining these advances we create an entirely biosynthetic route to incorporating phosphothreonine in proteins and biosynthesize several phosphoproteins; enabling phosphoprotein structure determination and synthetic protein kinase activation. PMID:28553966
Lim, Chun Shen; Brown, Chris M
2017-01-01
Structured RNA elements may control virus replication, transcription and translation, and their distinct features are being exploited by novel antiviral strategies. Viral RNA elements continue to be discovered using combinations of experimental and computational analyses. However, the wealth of sequence data, notably from deep viral RNA sequencing, viromes, and metagenomes, necessitates computational approaches being used as an essential discovery tool. In this review, we describe practical approaches being used to discover functional RNA elements in viral genomes. In addition to success stories in new and emerging viruses, these approaches have revealed some surprising new features of well-studied viruses e.g., human immunodeficiency virus, hepatitis C virus, influenza, and dengue viruses. Some notable discoveries were facilitated by new comparative analyses of diverse viral genome alignments. Importantly, comparative approaches for finding RNA elements embedded in coding and non-coding regions differ. With the exponential growth of computer power we have progressed from stem-loop prediction on single sequences to cutting edge 3D prediction, and from command line to user friendly web interfaces. Despite these advances, many powerful, user friendly prediction tools and resources are underutilized by the virology community.
Lim, Chun Shen; Brown, Chris M.
2018-01-01
Structured RNA elements may control virus replication, transcription and translation, and their distinct features are being exploited by novel antiviral strategies. Viral RNA elements continue to be discovered using combinations of experimental and computational analyses. However, the wealth of sequence data, notably from deep viral RNA sequencing, viromes, and metagenomes, necessitates computational approaches being used as an essential discovery tool. In this review, we describe practical approaches being used to discover functional RNA elements in viral genomes. In addition to success stories in new and emerging viruses, these approaches have revealed some surprising new features of well-studied viruses e.g., human immunodeficiency virus, hepatitis C virus, influenza, and dengue viruses. Some notable discoveries were facilitated by new comparative analyses of diverse viral genome alignments. Importantly, comparative approaches for finding RNA elements embedded in coding and non-coding regions differ. With the exponential growth of computer power we have progressed from stem-loop prediction on single sequences to cutting edge 3D prediction, and from command line to user friendly web interfaces. Despite these advances, many powerful, user friendly prediction tools and resources are underutilized by the virology community. PMID:29354101
Bass, David; Moureau, Gregory; Tang, Shuoya; McAlister, Erica; Culverwell, C. Lorna; Glücksman, Edvard; Wang, Hui; Brown, T. David K.; Gould, Ernest A.; Harbach, Ralph E.; de Lamballerie, Xavier; Firth, Andrew E.
2013-01-01
We investigated whether small RNA (sRNA) sequenced from field-collected mosquitoes and chironomids (Diptera) can be used as a proxy signature of viral prevalence within a range of species and viral groups, using sRNAs sequenced from wild-caught specimens, to inform total RNA deep sequencing of samples of particular interest. Using this strategy, we sequenced from adult Anopheles maculipennis s.l. mosquitoes the apparently nearly complete genome of one previously undescribed virus related to chronic bee paralysis virus, and, from a pool of Ochlerotatus caspius and Oc. detritus mosquitoes, a nearly complete entomobirnavirus genome. We also reconstructed long sequences (1503-6557 nt) related to at least nine other viruses. Crucially, several of the sequences detected were reconstructed from host organisms highly divergent from those in which related viruses have been previously isolated or discovered. It is clear that viral transmission and maintenance cycles in nature are likely to be significantly more complex and taxonomically diverse than previously expected. PMID:24260463
Oral Microbiome of Deep and Shallow Dental Pockets In Chronic Periodontitis
Ge, Xiuchun; Rodriguez, Rafael; Trinh, My; Gunsolley, John; Xu, Ping
2013-01-01
We examined the subgingival bacterial biodiversity in untreated chronic periodontitis patients by sequencing 16S rRNA genes. The primary purpose of the study was to compare the oral microbiome in deep (diseased) and shallow (healthy) sites. A secondary purpose was to evaluate the influences of smoking, race and dental caries on this relationship. A total of 88 subjects from two clinics were recruited. Paired subgingival plaque samples were taken from each subject, one from a probing site depth >5 mm (deep site) and the other from a probing site depth ≤3mm (shallow site). A universal primer set was designed to amplify the V4–V6 region for oral microbial 16S rRNA sequences. Differences in genera and species attributable to deep and shallow sites were determined by statistical analysis using a two-part model and false discovery rate. Fifty-one of 170 genera and 200 of 746 species were found significantly different in abundances between shallow and deep sites. Besides previously identified periodontal disease-associated bacterial species, additional species were found markedly changed in diseased sites. Cluster analysis revealed that the microbiome difference between deep and shallow sites was influenced by patient-level effects such as clinic location, race and smoking. The differences between clinic locations may be influenced by racial distribution, in that all of the African Americans subjects were seen at the same clinic. Our results suggested that there were influences from the microbiome for caries and periodontal disease and these influences are independent. PMID:23762384
Sequence-Based Genotyping for Marker Discovery and Co-Dominant Scoring in Germplasm and Populations
Truong, Hoa T.; Ramos, A. Marcos; Yalcin, Feyruz; de Ruiter, Marjo; van der Poel, Hein J. A.; Huvenaars, Koen H. J.; Hogers, René C. J.; van Enckevort, Leonora. J. G.; Janssen, Antoine; van Orsouw, Nathalie J.; van Eijk, Michiel J. T.
2012-01-01
Conventional marker-based genotyping platforms are widely available, but not without their limitations. In this context, we developed Sequence-Based Genotyping (SBG), a technology for simultaneous marker discovery and co-dominant scoring, using next-generation sequencing. SBG offers users several advantages including a generic sample preparation method, a highly robust genome complexity reduction strategy to facilitate de novo marker discovery across entire genomes, and a uniform bioinformatics workflow strategy to achieve genotyping goals tailored to individual species, regardless of the availability of a reference sequence. The most distinguishing features of this technology are the ability to genotype any population structure, regardless whether parental data is included, and the ability to co-dominantly score SNP markers segregating in populations. To demonstrate the capabilities of SBG, we performed marker discovery and genotyping in Arabidopsis thaliana and lettuce, two plant species of diverse genetic complexity and backgrounds. Initially we obtained 1,409 SNPs for arabidopsis, and 5,583 SNPs for lettuce. Further filtering of the SNP dataset produced over 1,000 high quality SNP markers for each species. We obtained a genotyping rate of 201.2 genotypes/SNP and 58.3 genotypes/SNP for arabidopsis (n = 222 samples) and lettuce (n = 87 samples), respectively. Linkage mapping using these SNPs resulted in stable map configurations. We have therefore shown that the SBG approach presented provides users with the utmost flexibility in garnering high quality markers that can be directly used for genotyping and downstream applications. Until advances and costs will allow for routine whole-genome sequencing of populations, we expect that sequence-based genotyping technologies such as SBG will be essential for genotyping of model and non-model genomes alike. PMID:22662172
Korotcov, Alexandru; Tkachenko, Valery; Russo, Daniel P; Ekins, Sean
2017-12-04
Machine learning methods have been applied to many data sets in pharmaceutical research for several decades. The relative ease and availability of fingerprint type molecular descriptors paired with Bayesian methods resulted in the widespread use of this approach for a diverse array of end points relevant to drug discovery. Deep learning is the latest machine learning algorithm attracting attention for many of pharmaceutical applications from docking to virtual screening. Deep learning is based on an artificial neural network with multiple hidden layers and has found considerable traction for many artificial intelligence applications. We have previously suggested the need for a comparison of different machine learning methods with deep learning across an array of varying data sets that is applicable to pharmaceutical research. End points relevant to pharmaceutical research include absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties, as well as activity against pathogens and drug discovery data sets. In this study, we have used data sets for solubility, probe-likeness, hERG, KCNQ1, bubonic plague, Chagas, tuberculosis, and malaria to compare different machine learning methods using FCFP6 fingerprints. These data sets represent whole cell screens, individual proteins, physicochemical properties as well as a data set with a complex end point. Our aim was to assess whether deep learning offered any improvement in testing when assessed using an array of metrics including AUC, F1 score, Cohen's kappa, Matthews correlation coefficient and others. Based on ranked normalized scores for the metrics or data sets Deep Neural Networks (DNN) ranked higher than SVM, which in turn was ranked higher than all the other machine learning methods. Visualizing these properties for training and test sets using radar type plots indicates when models are inferior or perhaps over trained. These results also suggest the need for assessing deep learning further using multiple metrics with much larger scale comparisons, prospective testing as well as assessment of different fingerprints and DNN architectures beyond those used.
deepTools: a flexible platform for exploring deep-sequencing data.
Ramírez, Fidel; Dündar, Friederike; Diehl, Sarah; Grüning, Björn A; Manke, Thomas
2014-07-01
We present a Galaxy based web server for processing and visualizing deeply sequenced data. The web server's core functionality consists of a suite of newly developed tools, called deepTools, that enable users with little bioinformatic background to explore the results of their sequencing experiments in a standardized setting. Users can upload pre-processed files with continuous data in standard formats and generate heatmaps and summary plots in a straight-forward, yet highly customizable manner. In addition, we offer several tools for the analysis of files containing aligned reads and enable efficient and reproducible generation of normalized coverage files. As a modular and open-source platform, deepTools can easily be expanded and customized to future demands and developments. The deepTools webserver is freely available at http://deeptools.ie-freiburg.mpg.de and is accompanied by extensive documentation and tutorials aimed at conveying the principles of deep-sequencing data analysis. The web server can be used without registration. deepTools can be installed locally either stand-alone or as part of Galaxy. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Swenson, Luke C; Moores, Andrew; Low, Andrew J; Thielen, Alexander; Dong, Winnie; Woods, Conan; Jensen, Mark A; Wynhoven, Brian; Chan, Dennison; Glascock, Christopher; Harrigan, P Richard
2010-08-01
Tropism testing should rule out CXCR4-using HIV before treatment with CCR5 antagonists. Currently, the recombinant phenotypic Trofile assay (Monogram) is most widely utilized; however, genotypic tests may represent alternative methods. Independent triplicate amplifications of the HIV gp120 V3 region were made from either plasma HIV RNA or proviral DNA. These underwent standard, population-based sequencing with an ABI3730 (RNA n = 63; DNA n = 40), or "deep" sequencing with a Roche/454 Genome Sequencer-FLX (RNA n = 12; DNA n = 12). Position-specific scoring matrices (PSSMX4/R5) (-6.96 cutoff) and geno2pheno[coreceptor] (5% false-positive rate) inferred tropism from V3 sequence. These methods were then independently validated with a separate, blinded dataset (n = 278) of screening samples from the maraviroc MOTIVATE trials. Standard sequencing of HIV RNA with PSSM yielded 69% sensitivity and 91% specificity, relative to Trofile. The validation dataset gave 75% sensitivity and 83% specificity. Proviral DNA plus PSSM gave 77% sensitivity and 71% specificity. "Deep" sequencing of HIV RNA detected >2% inferred-CXCR4-using virus in 8/8 samples called non-R5 by Trofile, and <2% in 4/4 samples called R5. Triplicate analyses of V3 standard sequence data detect greater proportions of CXCR4-using samples than previously achieved. Sequencing proviral DNA and "deep" V3 sequencing may also be useful tools for assessing tropism.
Roden, Suzanne E; Dutton, Peter H; Morin, Phillip A
2009-01-01
The green sea turtle, Chelonia mydas, was used as a case study for single nucleotide polymorphism (SNP) discovery in a species that has little genetic sequence information available. As green turtles have a complex population structure, additional nuclear markers other than microsatellites could add to our understanding of their complex life history. Amplified fragment length polymorphism technique was used to generate sets of random fragments of genomic DNA, which were then electrophoretically separated with precast gels, stained with SYBR green, excised, and directly sequenced. It was possible to perform this method without the use of polyacrylamide gels, radioactive or fluorescent labeled primers, or hybridization methods, reducing the time, expense, and safety hazards of SNP discovery. Within 13 loci, 2547 base pairs were screened, resulting in the discovery of 35 SNPs. Using this method, it was possible to yield a sufficient number of loci to screen for SNP markers without the availability of prior sequence information.
Siam, Rania; Mustafa, Ghada A.; Sharaf, Hazem; Moustafa, Ahmed; Ramadan, Adham R.; Antunes, Andre; Bajic, Vladimir B.; Stingl, Uli; Marsis, Nardine G. R.; Coolen, Marco J. L.; Sogin, Mitchell; Ferreira, Ari J. S.; Dorry, Hamza El
2012-01-01
The seafloor is a unique environment, which allows insights into how geochemical processes affect the diversity of biological life. Among its diverse ecosystems are deep-sea brine pools - water bodies characterized by a unique combination of extreme conditions. The ‘polyextremophiles’ that constitute the microbial assemblage of these deep hot brines have not been comprehensively studied. We report a comparative taxonomic analysis of the prokaryotic communities of the sediments directly below the Red Sea brine pools, namely, Atlantis II, Discovery, Chain Deep, and an adjacent brine-influenced site. Analyses of sediment samples and high-throughput pyrosequencing of PCR-amplified environmental 16S ribosomal RNA genes (16S rDNA) revealed that one sulfur (S)-rich Atlantis II and one nitrogen (N)-rich Discovery Deep section contained distinct microbial populations that differed from those found in the other sediment samples examined. Proteobacteria, Actinobacteria, Cyanobacteria, Deferribacteres, and Euryarchaeota were the most abundant bacterial and archaeal phyla in both the S- and N-rich sections. Relative abundance-based hierarchical clustering of the 16S rDNA pyrotags assigned to major taxonomic groups allowed us to categorize the archaeal and bacterial communities into three major and distinct groups; group I was unique to the S-rich Atlantis II section (ATII-1), group II was characteristic for the N-rich Discovery sample (DD-1), and group III reflected the composition of the remaining sediments. Many of the groups detected in the S-rich Atlantis II section are likely to play a dominant role in the cycling of methane and sulfur due to their phylogenetic affiliations with bacteria and archaea involved in anaerobic methane oxidation and sulfate reduction. PMID:22916172
Karas, Vlad O; Sinnott-Armstrong, Nicholas A; Varghese, Vici; Shafer, Robert W; Greenleaf, William J; Sherlock, Gavin
2018-01-01
Abstract Much of the within species genetic variation is in the form of single nucleotide polymorphisms (SNPs), typically detected by whole genome sequencing (WGS) or microarray-based technologies. However, WGS produces mostly uninformative reads that perfectly match the reference, while microarrays require genome-specific reagents. We have developed Diff-seq, a sequencing-based mismatch detection assay for SNP discovery without the requirement for specialized nucleic-acid reagents. Diff-seq leverages the Surveyor endonuclease to cleave mismatched DNA molecules that are generated after cross-annealing of a complex pool of DNA fragments. Sequencing libraries enriched for Surveyor-cleaved molecules result in increased coverage at the variant sites. Diff-seq detected all mismatches present in an initial test substrate, with specific enrichment dependent on the identity and context of the variation. Application to viral sequences resulted in increased observation of variant alleles in a biologically relevant context. Diff-Seq has the potential to increase the sensitivity and efficiency of high-throughput sequencing in the detection of variation. PMID:29361139
Haplotag: Software for Haplotype-Based Genotyping-by-Sequencing Analysis
Tinker, Nicholas A.; Bekele, Wubishet A.; Hattori, Jiro
2016-01-01
Genotyping-by-sequencing (GBS), and related methods, are based on high-throughput short-read sequencing of genomic complexity reductions followed by discovery of single nucleotide polymorphisms (SNPs) within sequence tags. This provides a powerful and economical approach to whole-genome genotyping, facilitating applications in genomics, diversity analysis, and molecular breeding. However, due to the complexity of analyzing large data sets, applications of GBS may require substantial time, expertise, and computational resources. Haplotag, the novel GBS software described here, is freely available, and operates with minimal user-investment on widely available computer platforms. Haplotag is unique in fulfilling the following set of criteria: (1) operates without a reference genome; (2) can be used in a polyploid species; (3) provides a discovery mode, and a production mode; (4) discovers polymorphisms based on a model of tag-level haplotypes within sequenced tags; (5) reports SNPs as well as haplotype-based genotypes; and (6) provides an intuitive visual “passport” for each inferred locus. Haplotag is optimized for use in a self-pollinating plant species. PMID:26818073
FIRST RESULTS FROM Z -FOURGE : DISCOVERY OF A CANDIDATE CLUSTER AT z = 2.2 IN COSMOS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Spitler, Lee R.; Glazebrook, Karl; Poole, Gregory B.
2012-04-01
We report the first results from the Z -FOURGE survey: the discovery of a candidate galaxy cluster at z = 2.2 consisting of two compact overdensities with red galaxies detected at {approx}> 20{sigma} above the mean surface density. The discovery was made possible by a new deep (K{sub s} {approx}< 24.8 AB 5{sigma}) Magellan/FOURSTAR near-IR imaging survey with five custom medium-bandwidth filters. The filters pinpoint the location of the Balmer/4000 A break in evolved stellar populations at 1.5 < z < 3.5, yielding significantly more accurate photometric redshifts than possible with broadband imaging alone. The overdensities are within 1' ofmore » each other in the COSMOS field and appear to be embedded in a larger structure that contains at least one additional overdensity ({approx}10{sigma}). Considering the global properties of the overdensities, the z = 2.2 system appears to be the most distant example of a galaxy cluster with a population of red galaxies. A comparison to a large {Lambda}CDM simulation suggests that the system may consist of merging subclusters, with properties in between those of z > 2 protoclusters with more diffuse distributions of blue galaxies and the lower-redshift galaxy clusters with prominent red sequences. The structure is completely absent in public optical catalogs in COSMOS and only weakly visible in a shallower near-IR survey. The discovery showcases the potential of deep near-IR surveys with medium-band filters to advance the understanding of environment and galaxy evolution at z > 1.5.« less
Bashir, Ali; Bansal, Vikas; Bafna, Vineet
2010-06-18
Massively parallel DNA sequencing technologies have enabled the sequencing of several individual human genomes. These technologies are also being used in novel ways for mRNA expression profiling, genome-wide discovery of transcription-factor binding sites, small RNA discovery, etc. The multitude of sequencing platforms, each with their unique characteristics, pose a number of design challenges, regarding the technology to be used and the depth of sequencing required for a particular sequencing application. Here we describe a number of analytical and empirical results to address design questions for two applications: detection of structural variations from paired-end sequencing and estimating mRNA transcript abundance. For structural variation, our results provide explicit trade-offs between the detection and resolution of rearrangement breakpoints, and the optimal mix of paired-read insert lengths. Specifically, we prove that optimal detection and resolution of breakpoints is achieved using a mix of exactly two insert library lengths. Furthermore, we derive explicit formulae to determine these insert length combinations, enabling a 15% improvement in breakpoint detection at the same experimental cost. On empirical short read data, these predictions show good concordance with Illumina 200 bp and 2 Kbp insert length libraries. For transcriptome sequencing, we determine the sequencing depth needed to detect rare transcripts from a small pilot study. With only 1 Million reads, we derive corrections that enable almost perfect prediction of the underlying expression probability distribution, and use this to predict the sequencing depth required to detect low expressed genes with greater than 95% probability. Together, our results form a generic framework for many design considerations related to high-throughput sequencing. We provide software tools http://bix.ucsd.edu/projects/NGS-DesignTools to derive platform independent guidelines for designing sequencing experiments (amount of sequencing, choice of insert length, mix of libraries) for novel applications of next generation sequencing.
Relationship between aging and T1 relaxation time in deep gray matter: A voxel-based analysis.
Okubo, Gosuke; Okada, Tomohisa; Yamamoto, Akira; Fushimi, Yasutaka; Okada, Tsutomu; Murata, Katsutoshi; Togashi, Kaori
2017-09-01
To investigate age-related changes in T 1 relaxation time in deep gray matter structures in healthy volunteers using magnetization-prepared 2 rapid acquisition gradient echoes (MP2RAGE). In all, 70 healthy volunteers (aged 20-76, mean age 42.6 years) were scanned at 3T magnetic resonance imaging (MRI). A MP2RAGE sequence was employed to quantify T 1 relaxation times. After the spatial normalization of T 1 maps with the diffeomorphic anatomical registration using the exponentiated Lie algebra algorithm, voxel-based regression analysis was conducted. In addition, linear and quadratic regression analyses of regions of interest (ROIs) were also performed. With aging, voxel-based analysis (VBA) revealed significant T 1 value decreases in the ventral-inferior putamen, nucleus accumbens, and amygdala, whereas T 1 values significantly increased in the thalamus and white matter as well (P < 0.05 at cluster level, false discovery rate). ROI analysis revealed that T 1 values in the nucleus accumbens linearly decreased with aging (P = 0.0016), supporting the VBA result. T 1 values in the thalamus (P < 0.0001), substantia nigra (P = 0.0003), and globus pallidus (P < 0.0001) had a best fit to quadratic curves, with the minimum T 1 values observed between 30 and 50 years of age. Age-related changes in T 1 relaxation time vary by location in deep gray matter. 2 Technical Efficacy: Stage 2 J. MAGN. RESON. IMAGING 2017;46:724-731. © 2017 International Society for Magnetic Resonance in Medicine.
Wang, Ruijia; Nambiar, Ram; Zheng, Dinghai
2018-01-01
Abstract PolyA_DB is a database cataloging cleavage and polyadenylation sites (PASs) in several genomes. Previous versions were based mainly on expressed sequence tags (ESTs), which had a limited amount and could lead to inaccurate PAS identification due to the presence of internal A-rich sequences in transcripts. Here, we present an updated version of the database based solely on deep sequencing data. First, PASs are mapped by the 3′ region extraction and deep sequencing (3′READS) method, ensuring unequivocal PAS identification. Second, a large volume of data based on diverse biological samples increases PAS coverage by 3.5-fold over the EST-based version and provides PAS usage information. Third, strand-specific RNA-seq data are used to extend annotated 3′ ends of genes to obtain more thorough annotations of alternative polyadenylation (APA) sites. Fourth, conservation information of PAS across mammals sheds light on significance of APA sites. The database (URL: http://www.polya-db.org/v3) currently holds PASs in human, mouse, rat and chicken, and has links to the UCSC genome browser for further visualization and for integration with other genomic data. PMID:29069441
Antibody Engineering and Therapeutics
Almagro, Juan Carlos; Gilliland, Gary L; Breden, Felix; Scott, Jamie K; Sok, Devin; Pauthner, Matthias; Reichert, Janice M; Helguera, Gustavo; Andrabi, Raiees; Mabry, Robert; Bléry, Mathieu; Voss, James E; Laurén, Juha; Abuqayyas, Lubna; Barghorn, Stefan; Ben-Jacob, Eshel; Crowe, James E; Huston, James S; Johnston, Stephen Albert; Krauland, Eric; Lund-Johansen, Fridtjof; Marasco, Wayne A; Parren, Paul WHI; Xu, Kai Y
2014-01-01
The 24th Antibody Engineering & Therapeutics meeting brought together a broad range of participants who were updated on the latest advances in antibody research and development. Organized by IBC Life Sciences, the gathering is the annual meeting of The Antibody Society, which serves as the scientific sponsor. Preconference workshops on 3D modeling and delineation of clonal lineages were featured, and the conference included sessions on a wide variety of topics relevant to researchers, including systems biology; antibody deep sequencing and repertoires; the effects of antibody gene variation and usage on antibody response; directed evolution; knowledge-based design; antibodies in a complex environment; polyreactive antibodies and polyspecificity; the interface between antibody therapy and cellular immunity in cancer; antibodies in cardiometabolic medicine; antibody pharmacokinetics, distribution and off-target toxicity; optimizing antibody formats for immunotherapy; polyclonals, oligoclonals and bispecifics; antibody discovery platforms; and antibody-drug conjugates. PMID:24589717
Jeanne, Nicolas; Saliou, Adrien; Carcenac, Romain; Lefebvre, Caroline; Dubois, Martine; Cazabat, Michelle; Nicot, Florence; Loiseau, Claire; Raymond, Stéphanie; Izopet, Jacques; Delobel, Pierre
2015-01-01
HIV-1 coreceptor usage must be accurately determined before starting CCR5 antagonist-based treatment as the presence of undetected minor CXCR4-using variants can cause subsequent virological failure. Ultra-deep pyrosequencing of HIV-1 V3 env allows to detect low levels of CXCR4-using variants that current genotypic approaches miss. However, the computation of the mass of sequence data and the need to identify true minor variants while excluding artifactual sequences generated during amplification and ultra-deep pyrosequencing is rate-limiting. Arbitrary fixed cut-offs below which minor variants are discarded are currently used but the errors generated during ultra-deep pyrosequencing are sequence-dependant rather than random. We have developed an automated processing of HIV-1 V3 env ultra-deep pyrosequencing data that uses biological filters to discard artifactual or non-functional V3 sequences followed by statistical filters to determine position-specific sensitivity thresholds, rather than arbitrary fixed cut-offs. It allows to retain authentic sequences with point mutations at V3 positions of interest and discard artifactual ones with accurate sensitivity thresholds. PMID:26585833
Jeanne, Nicolas; Saliou, Adrien; Carcenac, Romain; Lefebvre, Caroline; Dubois, Martine; Cazabat, Michelle; Nicot, Florence; Loiseau, Claire; Raymond, Stéphanie; Izopet, Jacques; Delobel, Pierre
2015-11-20
HIV-1 coreceptor usage must be accurately determined before starting CCR5 antagonist-based treatment as the presence of undetected minor CXCR4-using variants can cause subsequent virological failure. Ultra-deep pyrosequencing of HIV-1 V3 env allows to detect low levels of CXCR4-using variants that current genotypic approaches miss. However, the computation of the mass of sequence data and the need to identify true minor variants while excluding artifactual sequences generated during amplification and ultra-deep pyrosequencing is rate-limiting. Arbitrary fixed cut-offs below which minor variants are discarded are currently used but the errors generated during ultra-deep pyrosequencing are sequence-dependant rather than random. We have developed an automated processing of HIV-1 V3 env ultra-deep pyrosequencing data that uses biological filters to discard artifactual or non-functional V3 sequences followed by statistical filters to determine position-specific sensitivity thresholds, rather than arbitrary fixed cut-offs. It allows to retain authentic sequences with point mutations at V3 positions of interest and discard artifactual ones with accurate sensitivity thresholds.
Evolution of coreceptor utilization to escape CCR5 antagonist therapy.
Zhang, Jie; Gao, Xiang; Martin, John; Rosa, Bruce; Chen, Zheng; Mitreva, Makedonka; Henrich, Timothy; Kuritzkes, Daniel; Ratner, Lee
2016-07-01
The HIV-1 envelope interacts with coreceptors CCR5 and CXCR4 in a dynamic, multi-step process, its molecular details not clearly delineated. Use of CCR5 antagonists results in tropism shift and therapeutic failure. Here we describe a novel approach using full-length patient-derived gp160 quasispecies libraries cloned into HIV-1 molecular clones, their separation based on phenotypic tropism in vitro, and deep sequencing of the resultant variants for structure-function analyses. Analysis of functionally validated envelope sequences from patients who failed CCR5 antagonist therapy revealed determinants strongly associated with coreceptor specificity, especially at the gp120-gp41 and gp41-gp41 interaction surfaces that invite future research on the roles of subunit interaction and envelope trimer stability in coreceptor usage. This study identifies important structure-function relationships in HIV-1 envelope, and demonstrates proof of concept for a new integrated analysis method that facilitates laboratory discovery of resistant mutants to aid in development of other therapeutic agents. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.
Quasispecies Analyses of the HIV-1 Near-full-length Genome With Illumina MiSeq
Ode, Hirotaka; Matsuda, Masakazu; Matsuoka, Kazuhiro; Hachiya, Atsuko; Hattori, Junko; Kito, Yumiko; Yokomaku, Yoshiyuki; Iwatani, Yasumasa; Sugiura, Wataru
2015-01-01
Human immunodeficiency virus type-1 (HIV-1) exhibits high between-host genetic diversity and within-host heterogeneity, recognized as quasispecies. Because HIV-1 quasispecies fluctuate in terms of multiple factors, such as antiretroviral exposure and host immunity, analyzing the HIV-1 genome is critical for selecting effective antiretroviral therapy and understanding within-host viral coevolution mechanisms. Here, to obtain HIV-1 genome sequence information that includes minority variants, we sought to develop a method for evaluating quasispecies throughout the HIV-1 near-full-length genome using the Illumina MiSeq benchtop deep sequencer. To ensure the reliability of minority mutation detection, we applied an analysis method of sequence read mapping onto a consensus sequence derived from de novo assembly followed by iterative mapping and subsequent unique error correction. Deep sequencing analyses of aHIV-1 clone showed that the analysis method reduced erroneous base prevalence below 1% in each sequence position and discarded only < 1% of all collected nucleotides, maximizing the usage of the collected genome sequences. Further, we designed primer sets to amplify the HIV-1 near-full-length genome from clinical plasma samples. Deep sequencing of 92 samples in combination with the primer sets and our analysis method provided sufficient coverage to identify >1%-frequency sequences throughout the genome. When we evaluated sequences of pol genes from 18 treatment-naïve patients' samples, the deep sequencing results were in agreement with Sanger sequencing and identified numerous additional minority mutations. The results suggest that our deep sequencing method would be suitable for identifying within-host viral population dynamics throughout the genome. PMID:26617593
Carissimo, Guillaume; Eiglmeier, Karin; Reveillaud, Julie; Holm, Inge; Diallo, Mawlouth; Diallo, Diawo; Vantaux, Amélie; Kim, Saorin; Ménard, Didier; Siv, Sovannaroth; Belda, Eugeni; Bischoff, Emmanuel; Antoniewski, Christophe; Vernick, Kenneth D.
2016-01-01
Mosquitoes of the Anopheles gambiae complex display strong preference for human bloodmeals and are major malaria vectors in Africa. However, their interaction with viruses or role in arbovirus transmission during epidemics has been little examined, with the exception of O’nyong-nyong virus, closely related to Chikungunya virus. Deep-sequencing has revealed different RNA viruses in natural insect viromes, but none have been previously described in the Anopheles gambiae species complex. Here, we describe two novel insect RNA viruses, a Dicistrovirus and a Cypovirus, found in laboratory colonies of An. gambiae taxa using small-RNA deep sequencing. Sequence analysis was done with Metavisitor, an open-source bioinformatic pipeline for virus discovery and de novo genome assembly. Wild-collected Anopheles from Senegal and Cambodia were positive for the Dicistrovirus and Cypovirus, displaying high sequence identity to the laboratory-derived virus. Thus, the Dicistrovirus (Anopheles C virus, AnCV) and Cypovirus (Anopheles Cypovirus, AnCPV) are components of the natural virome of at least some anopheline species. Their possible influence on mosquito immunity or transmission of other pathogens is unknown. These natural viruses could be developed as models for the study of Anopheles-RNA virus interactions in low security laboratory settings, in an analogous manner to the use of rodent malaria parasites for studies of mosquito anti-parasite immunity. PMID:27138938
BayesMotif: de novo protein sorting motif discovery from impure datasets.
Hu, Jianjun; Zhang, Fan
2010-01-18
Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model.
Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets.
Vishnevsky, Oleg V; Bocharnikov, Andrey V; Kolchanov, Nikolay A
2018-02-01
The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top "peak" ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.
Avsec, Žiga; Cheng, Jun; Gagneur, Julien
2018-01-01
Abstract Motivation Regulatory sequences are not solely defined by their nucleic acid sequence but also by their relative distances to genomic landmarks such as transcription start site, exon boundaries or polyadenylation site. Deep learning has become the approach of choice for modeling regulatory sequences because of its strength to learn complex sequence features. However, modeling relative distances to genomic landmarks in deep neural networks has not been addressed. Results Here we developed spline transformation, a neural network module based on splines to flexibly and robustly model distances. Modeling distances to various genomic landmarks with spline transformations significantly increased state-of-the-art prediction accuracy of in vivo RNA-binding protein binding sites for 120 out of 123 proteins. We also developed a deep neural network for human splice branchpoint based on spline transformations that outperformed the current best, already distance-based, machine learning model. Compared to piecewise linear transformation, as obtained by composition of rectified linear units, spline transformation yields higher prediction accuracy as well as faster and more robust training. As spline transformation can be applied to further quantities beyond distances, such as methylation or conservation, we foresee it as a versatile component in the genomics deep learning toolbox. Availability and implementation Spline transformation is implemented as a Keras layer in the CONCISE python package: https://github.com/gagneurlab/concise. Analysis code is available at https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017. Contact avsec@in.tum.de or gagneur@in.tum.de Supplementary information Supplementary data are available at Bioinformatics online. PMID:29155928
Burkholder, William F; Newell, Evan W; Poidinger, Michael; Chen, Swaine; Fink, Katja
2017-01-01
The inaugural workshop "Deep Sequencing in Infectious Diseases: Immune and Pathogen Repertoires for the Improvement of Patient Outcomes" was held in Singapore on 13-14 October 2016. The aim of the workshop was to discuss the latest trends in using high-throughput sequencing, bioinformatics, and allied technologies to analyze immune and pathogen repertoires and their interplay within the host, bringing together key international players in the field and Singapore-based researchers and clinician-scientists. The focus was in particular on the application of these technologies for the improvement of patient diagnosis, prognosis and treatment, and for other broad public health outcomes. The presentations by scientists and clinicians showed the potential of deep sequencing technology to capture the coevolution of adaptive immunity and pathogens. For clinical applications, some key challenges remain, such as the long turnaround time and relatively high cost of deep sequencing for pathogen identification and characterization and the lack of international standardization in immune repertoire analysis.
Burkholder, William F.; Newell, Evan W.; Poidinger, Michael; Chen, Swaine; Fink, Katja
2017-01-01
The inaugural workshop “Deep Sequencing in Infectious Diseases: Immune and Pathogen Repertoires for the Improvement of Patient Outcomes” was held in Singapore on 13–14 October 2016. The aim of the workshop was to discuss the latest trends in using high-throughput sequencing, bioinformatics, and allied technologies to analyze immune and pathogen repertoires and their interplay within the host, bringing together key international players in the field and Singapore-based researchers and clinician-scientists. The focus was in particular on the application of these technologies for the improvement of patient diagnosis, prognosis and treatment, and for other broad public health outcomes. The presentations by scientists and clinicians showed the potential of deep sequencing technology to capture the coevolution of adaptive immunity and pathogens. For clinical applications, some key challenges remain, such as the long turnaround time and relatively high cost of deep sequencing for pathogen identification and characterization and the lack of international standardization in immune repertoire analysis. PMID:28620372
Zhang, Yinan; Samee, Md. Abul Hassan; Halfon, Marc S.; Sinha, Saurabh
2014-01-01
Many genes familiar from Drosophila development, such as the so-called gap, pair-rule, and segment polarity genes, play important roles in the development of other insects and in many cases appear to be deployed in a similar fashion, despite the fact that Drosophila-like “long germband” development is highly derived and confined to a subset of insect families. Whether or not these similarities extend to the regulatory level is unknown. Identification of regulatory regions beyond the well-studied Drosophila has been challenging as even within the Diptera (flies, including mosquitoes) regulatory sequences have diverged past the point of recognition by standard alignment methods. Here, we demonstrate that methods we previously developed for computational cis-regulatory module (CRM) discovery in Drosophila can be used effectively in highly diverged (250–350 Myr) insect species including Anopheles gambiae, Tribolium castaneum, Apis mellifera, and Nasonia vitripennis. In Drosophila, we have successfully used small sets of known CRMs as “training data” to guide the search for other CRMs with related function. We show here that although species-specific CRM training data do not exist, training sets from Drosophila can facilitate CRM discovery in diverged insects. We validate in vivo over a dozen new CRMs, roughly doubling the number of known CRMs in the four non-Drosophila species. Given the growing wealth of Drosophila CRM annotation, these results suggest that extensive regulatory sequence annotation will be possible in newly sequenced insects without recourse to costly and labor-intensive genome-scale experiments. We develop a new method, Regulus, which computes a probabilistic score of similarity based on binding site composition (despite the absence of nucleotide-level sequence alignment), and demonstrate similarity between functionally related CRMs from orthologous loci. Our work represents an important step toward being able to trace the evolutionary history of gene regulatory networks and defining the mechanisms underlying insect evolution. PMID:25173756
Kazemian, Majid; Suryamohan, Kushal; Chen, Jia-Yu; Zhang, Yinan; Samee, Md Abul Hassan; Halfon, Marc S; Sinha, Saurabh
2014-09-01
Many genes familiar from Drosophila development, such as the so-called gap, pair-rule, and segment polarity genes, play important roles in the development of other insects and in many cases appear to be deployed in a similar fashion, despite the fact that Drosophila-like "long germband" development is highly derived and confined to a subset of insect families. Whether or not these similarities extend to the regulatory level is unknown. Identification of regulatory regions beyond the well-studied Drosophila has been challenging as even within the Diptera (flies, including mosquitoes) regulatory sequences have diverged past the point of recognition by standard alignment methods. Here, we demonstrate that methods we previously developed for computational cis-regulatory module (CRM) discovery in Drosophila can be used effectively in highly diverged (250-350 Myr) insect species including Anopheles gambiae, Tribolium castaneum, Apis mellifera, and Nasonia vitripennis. In Drosophila, we have successfully used small sets of known CRMs as "training data" to guide the search for other CRMs with related function. We show here that although species-specific CRM training data do not exist, training sets from Drosophila can facilitate CRM discovery in diverged insects. We validate in vivo over a dozen new CRMs, roughly doubling the number of known CRMs in the four non-Drosophila species. Given the growing wealth of Drosophila CRM annotation, these results suggest that extensive regulatory sequence annotation will be possible in newly sequenced insects without recourse to costly and labor-intensive genome-scale experiments. We develop a new method, Regulus, which computes a probabilistic score of similarity based on binding site composition (despite the absence of nucleotide-level sequence alignment), and demonstrate similarity between functionally related CRMs from orthologous loci. Our work represents an important step toward being able to trace the evolutionary history of gene regulatory networks and defining the mechanisms underlying insect evolution. © The Author(s) 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
The promises and pitfalls of RNA-interference-based therapeutics
Castanotto, Daniela; Rossi, John J.
2009-01-01
The discovery that gene expression can be controlled by the Watson–Crick base-pairing of small RNAs with messenger RNAs containing complementary sequence — a process known as RNA interference — has markedly advanced our understanding of eukaryotic gene regulation and function. The ability of short RNA sequences to modulate gene expression has provided a powerful tool with which to study gene function and is set to revolutionize the treatment of disease. Remarkably, despite being just one decade from its discovery, the phenomenon is already being used therapeutically in human clinical trials, and biotechnology companies that focus on RNA-interference-based therapeutics are already publicly traded. PMID:19158789
Gehring, Philip-Sebastian; Tolley, Krystal A; Eckhardt, Falk Sebastian; Townsend, Ted M; Ziegler, Thomas; Ratsoavina, Fanomezana; Glaw, Frank; Vences, Miguel
2012-01-01
We conducted a comprehensive molecular phylogenetic study for a group of chameleons from Madagascar (Chamaeleonidae: Calumma nasutum group, comprising seven nominal species) to examine the genetic and species diversity in this widespread genus. Based on DNA sequences of the mitochondrial gene (ND2) from 215 specimens, we reconstructed the phylogeny using a Bayesian approach. Our results show deep divergences among several unnamed mitochondrial lineages that are difficult to identify morphologically. We evaluated lineage diversification using a number of statistical phylogenetic methods (general mixed Yule-coalescent model; SpeciesIdentifier; net p-distances) to objectively delimit lineages that we here consider as operational taxonomic units (OTUs), and for which the taxonomic status remains largely unknown. In addition, we compared molecular and morphological differentiation in detail for one particularly diverse clade (the C. boettgeri complex) from northern Madagascar. To assess the species boundaries within this group we used an integrative taxonomic approach, combining evidence from two independent molecular markers (ND2 and CMOS), together with genital and other external morphological characters, and conclude that some of the newly discovered OTUs are separate species (confirmed candidate species, CCS), while others should best be considered as deep conspecific lineages (DCLs). Our analysis supports a total of 33 OTUs, of which seven correspond to described species, suggesting that the taxonomy of the C. nasutum group is in need of revision. PMID:22957155
Yildirim, Özal
2018-05-01
Long-short term memory networks (LSTMs), which have recently emerged in sequential data analysis, are the most widely used type of recurrent neural networks (RNNs) architecture. Progress on the topic of deep learning includes successful adaptations of deep versions of these architectures. In this study, a new model for deep bidirectional LSTM network-based wavelet sequences called DBLSTM-WS was proposed for classifying electrocardiogram (ECG) signals. For this purpose, a new wavelet-based layer is implemented to generate ECG signal sequences. The ECG signals were decomposed into frequency sub-bands at different scales in this layer. These sub-bands are used as sequences for the input of LSTM networks. New network models that include unidirectional (ULSTM) and bidirectional (BLSTM) structures are designed for performance comparisons. Experimental studies have been performed for five different types of heartbeats obtained from the MIT-BIH arrhythmia database. These five types are Normal Sinus Rhythm (NSR), Ventricular Premature Contraction (VPC), Paced Beat (PB), Left Bundle Branch Block (LBBB), and Right Bundle Branch Block (RBBB). The results show that the DBLSTM-WS model gives a high recognition performance of 99.39%. It has been observed that the wavelet-based layer proposed in the study significantly improves the recognition performance of conventional networks. This proposed network structure is an important approach that can be applied to similar signal processing problems. Copyright © 2018 Elsevier Ltd. All rights reserved.
A DNA Barcode Library for North American Pyraustinae (Lepidoptera: Pyraloidea: Crambidae).
Yang, Zhaofu; Landry, Jean-François; Hebert, Paul D N
2016-01-01
Although members of the crambid subfamily Pyraustinae are frequently important crop pests, their identification is often difficult because many species lack conspicuous diagnostic morphological characters. DNA barcoding employs sequence diversity in a short standardized gene region to facilitate specimen identifications and species discovery. This study provides a DNA barcode reference library for North American pyraustines based upon the analysis of 1589 sequences recovered from 137 nominal species, 87% of the fauna. Data from 125 species were barcode compliant (>500bp, <1% n), and 99 of these taxa formed a distinct cluster that was assigned to a single BIN. The other 26 species were assigned to 56 BINs, reflecting frequent cases of deep intraspecific sequence divergence and a few instances of barcode sharing, creating a total of 155 BINs. Two systems for OTU designation, ABGD and BIN, were examined to check the correspondence between current taxonomy and sequence clusters. The BIN system performed better than ABGD in delimiting closely related species, while OTU counts with ABGD were influenced by the value employed for relative gap width. Different species with low or no interspecific divergence may represent cases of unrecognized synonymy, whereas those with high intraspecific divergence require further taxonomic scrutiny as they may involve cryptic diversity. The barcode library developed in this study will also help to advance understanding of relationships among species of Pyraustinae.
Focused Crawling of the Deep Web Using Service Class Descriptions
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rocco, D; Liu, L; Critchlow, T
2004-06-21
Dynamic Web data sources--sometimes known collectively as the Deep Web--increase the utility of the Web by providing intuitive access to data repositories anywhere that Web access is available. Deep Web services provide access to real-time information, like entertainment event listings, or present a Web interface to large databases or other data repositories. Recent studies suggest that the size and growth rate of the dynamic Web greatly exceed that of the static Web, yet dynamic content is often ignored by existing search engine indexers owing to the technical challenges that arise when attempting to search the Deep Web. To address thesemore » challenges, we present DynaBot, a service-centric crawler for discovering and clustering Deep Web sources offering dynamic content. DynaBot has three unique characteristics. First, DynaBot utilizes a service class model of the Web implemented through the construction of service class descriptions (SCDs). Second, DynaBot employs a modular, self-tuning system architecture for focused crawling of the DeepWeb using service class descriptions. Third, DynaBot incorporates methods and algorithms for efficient probing of the Deep Web and for discovering and clustering Deep Web sources and services through SCD-based service matching analysis. Our experimental results demonstrate the effectiveness of the service class discovery, probing, and matching algorithms and suggest techniques for efficiently managing service discovery in the face of the immense scale of the Deep Web.« less
Starrett, James; Derkarabetian, Shahan; Richart, Casey H.; Cabrero, Allan; Hedin, Marshal
2016-01-01
Abstract The monotypic genus Cryptomaster Briggs, 1969 was described based on individuals from a single locality in southwestern Oregon. The described species Cryptomaster leviathan Briggs, 1969 was named for its large body size compared to most travunioid Laniatores. However, as the generic name suggests, Cryptomaster are notoriously difficult to find, and few subsequent collections have been recorded for this genus. Here, we increase sampling of Cryptomaster to 15 localities, extending their known range from the Coast Range northeast to the western Cascade Mountains of southern Oregon. Phylogenetic analyses of mitochondrial and nuclear DNA sequence data reveal deep phylogenetic breaks consistent with independently evolving lineages. We use discovery and validation species delimitation approaches to generate and test species hypotheses, including a coalescent species delimitation method to test multi-species hypotheses. For delimited species, we use light microscopy and SEM to discover diagnostic morphological characters. Although Cryptomaster has a small geographic distribution, this taxon is consistent with other short-range endemics in having deep phylogenetic breaks indicative of species level divergences. Herein we describe Cryptomaster behemoth sp. n., and provide morphological diagnostic characters for identifying Cryptomaster leviathan and Cryptomaster behemoth. PMID:26877685
Loeffler 4.0: Diagnostic Metagenomics.
Höper, Dirk; Wylezich, Claudia; Beer, Martin
2017-01-01
A new world of possibilities for "virus discovery" was opened up with high-throughput sequencing becoming available in the last decade. While scientifically metagenomic analysis was established before the start of the era of high-throughput sequencing, the availability of the first second-generation sequencers was the kick-off for diagnosticians to use sequencing for the detection of novel pathogens. Today, diagnostic metagenomics is becoming the standard procedure for the detection and genetic characterization of new viruses or novel virus variants. Here, we provide an overview about technical considerations of high-throughput sequencing-based diagnostic metagenomics together with selected examples of "virus discovery" for animal diseases or zoonoses and metagenomics for food safety or basic veterinary research. © 2017 Elsevier Inc. All rights reserved.
LookSeq: a browser-based viewer for deep sequencing data.
Manske, Heinrich Magnus; Kwiatkowski, Dominic P
2009-11-01
Sequencing a genome to great depth can be highly informative about heterogeneity within an individual or a population. Here we address the problem of how to visualize the multiple layers of information contained in deep sequencing data. We propose an interactive AJAX-based web viewer for browsing large data sets of aligned sequence reads. By enabling seamless browsing and fast zooming, the LookSeq program assists the user to assimilate information at different levels of resolution, from an overview of a genomic region to fine details such as heterogeneity within the sample. A specific problem, particularly if the sample is heterogeneous, is how to depict information about structural variation. LookSeq provides a simple graphical representation of paired sequence reads that is more revealing about potential insertions and deletions than are conventional methods.
VIP: an integrated pipeline for metagenomics of virus identification and discovery
Li, Yang; Wang, Hao; Nie, Kai; Zhang, Chen; Zhang, Yi; Wang, Ji; Niu, Peihua; Ma, Xuejun
2016-01-01
Identification and discovery of viruses using next-generation sequencing technology is a fast-developing area with potential wide application in clinical diagnostics, public health monitoring and novel virus discovery. However, tremendous sequence data from NGS study has posed great challenge both in accuracy and velocity for application of NGS study. Here we describe VIP (“Virus Identification Pipeline”), a one-touch computational pipeline for virus identification and discovery from metagenomic NGS data. VIP performs the following steps to achieve its goal: (i) map and filter out background-related reads, (ii) extensive classification of reads on the basis of nucleotide and remote amino acid homology, (iii) multiple k-mer based de novo assembly and phylogenetic analysis to provide evolutionary insight. We validated the feasibility and veracity of this pipeline with sequencing results of various types of clinical samples and public datasets. VIP has also contributed to timely virus diagnosis (~10 min) in acutely ill patients, demonstrating its potential in the performance of unbiased NGS-based clinical studies with demand of short turnaround time. VIP is released under GPLv3 and is available for free download at: https://github.com/keylabivdc/VIP. PMID:27026381
The Revolution Continues: Newly Discovered Systems Expand the CRISPR-Cas Toolkit.
Murugan, Karthik; Babu, Kesavan; Sundaresan, Ramya; Rajan, Rakhi; Sashital, Dipali G
2017-10-05
CRISPR-Cas systems defend prokaryotes against bacteriophages and mobile genetic elements and serve as the basis for revolutionary tools for genetic engineering. Class 2 CRISPR-Cas systems use single Cas endonucleases paired with guide RNAs to cleave complementary nucleic acid targets, enabling programmable sequence-specific targeting with minimal machinery. Recent discoveries of previously unidentified CRISPR-Cas systems have uncovered a deep reservoir of potential biotechnological tools beyond the well-characterized Type II Cas9 systems. Here we review the current mechanistic understanding of newly discovered single-protein Cas endonucleases. Comparison of these Cas effectors reveals substantial mechanistic diversity, underscoring the phylogenetic divergence of related CRISPR-Cas systems. This diversity has enabled further expansion of CRISPR-Cas biotechnological toolkits, with wide-ranging applications from genome editing to diagnostic tools based on various Cas endonuclease activities. These advances highlight the exciting prospects for future tools based on the continually expanding set of CRISPR-Cas systems. Copyright © 2017 Elsevier Inc. All rights reserved.
MicroRNA-based biotechnology for plant improvement.
Zhang, Baohong; Wang, Qinglian
2015-01-01
MicroRNAs (miRNAs) are an extensive class of newly discovered endogenous small RNAs, which negatively regulate gene expression at the post-transcription levels. As the application of next-generation deep sequencing and advanced bioinformatics, the miRNA-related study has been expended to non-model plant species and the number of identified miRNAs has dramatically increased in the past years. miRNAs play a critical role in almost all biological and metabolic processes, and provide a unique strategy for plant improvement. Here, we first briefly review the discovery, history, and biogenesis of miRNAs, then focus more on the application of miRNAs on plant breeding and the future directions. Increased plant biomass through controlling plant development and phase change has been one achievement for miRNA-based biotechnology; plant tolerance to abiotic and biotic stress was also significantly enhanced by regulating the expression of an individual miRNA. Both endogenous and artificial miRNAs may serve as important tools for plant improvement. © 2014 Wiley Periodicals, Inc.
Discovery and small RNA profile of Pecan mosaic-associated virus, a novel potyvirus of pecan trees.
Su, Xiu; Fu, Shuai; Qian, Yajuan; Zhang, Liqin; Xu, Yi; Zhou, Xueping
2016-05-26
A novel potyvirus was discovered in pecan (Carya illinoensis) showing leaf mosaic symptom through the use of deep sequencing of small RNAs. The complete genome of this virus was determined to comprise of 9,310 nucleotides (nt), and shared 24.0% to 58.9% nucleotide similarities with that of other Potyviridae viruses. The genome was deduced to encode a single open reading frame (polyprotein) on the plus strand. Phylogenetic analysis based on the whole genome sequence and coat protein amino acid sequence showed that this virus is most closely related to Lettuce mosaic virus. Using electron microscopy, the typical Potyvirus filamentous particles were identified in infected pecan leaves with mosaic symptoms. Our results clearly show that this virus is a new member of the genus Potyvirus in the family Potyviridae. The virus is tentatively named Pecan mosaic-associated virus (PMaV). Additionally, profiling of the PMaV-derived small RNA (PMaV-sRNA) showed that the most abundant PMaV-sRNAs were 21-nt in length. There are several hotspots for small RNA production along the PMaV genome; two 21-nt PMaV-sRNAs starting at 811 nt and 610 nt of the minus-strand genome were highly repeated.
Discovery and small RNA profile of Pecan mosaic-associated virus, a novel potyvirus of pecan trees
Su, Xiu; Fu, Shuai; Qian, Yajuan; Zhang, Liqin; Xu, Yi; Zhou, Xueping
2016-01-01
A novel potyvirus was discovered in pecan (Carya illinoensis) showing leaf mosaic symptom through the use of deep sequencing of small RNAs. The complete genome of this virus was determined to comprise of 9,310 nucleotides (nt), and shared 24.0% to 58.9% nucleotide similarities with that of other Potyviridae viruses. The genome was deduced to encode a single open reading frame (polyprotein) on the plus strand. Phylogenetic analysis based on the whole genome sequence and coat protein amino acid sequence showed that this virus is most closely related to Lettuce mosaic virus. Using electron microscopy, the typical Potyvirus filamentous particles were identified in infected pecan leaves with mosaic symptoms. Our results clearly show that this virus is a new member of the genus Potyvirus in the family Potyviridae. The virus is tentatively named Pecan mosaic-associated virus (PMaV). Additionally, profiling of the PMaV-derived small RNA (PMaV-sRNA) showed that the most abundant PMaV-sRNAs were 21-nt in length. There are several hotspots for small RNA production along the PMaV genome; two 21-nt PMaV-sRNAs starting at 811 nt and 610 nt of the minus-strand genome were highly repeated. PMID:27226228
Providing Multi-Page Data Extraction Services with XWRAPComposer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Ling; Zhang, Jianjun; Han, Wei
2008-04-30
Dynamic Web data sources – sometimes known collectively as the Deep Web – increase the utility of the Web by providing intuitive access to data repositories anywhere that Web access is available. Deep Web services provide access to real-time information, like entertainment event listings, or present a Web interface to large databases or other data repositories. Recent studies suggest that the size and growth rate of the dynamic Web greatly exceed that of the static Web, yet dynamic content is often ignored by existing search engine indexers owing to the technical challenges that arise when attempting to search the Deepmore » Web. To address these challenges, we present DYNABOT, a service-centric crawler for discovering and clustering Deep Web sources offering dynamic content. DYNABOT has three unique characteristics. First, DYNABOT utilizes a service class model of the Web implemented through the construction of service class descriptions (SCDs). Second, DYNABOT employs a modular, self-tuning system architecture for focused crawling of the Deep Web using service class descriptions. Third, DYNABOT incorporates methods and algorithms for efficient probing of the Deep Web and for discovering and clustering Deep Web sources and services through SCD-based service matching analysis. Our experimental results demonstrate the effectiveness of the service class discovery, probing, and matching algorithms and suggest techniques for efficiently managing service discovery in the face of the immense scale of the Deep Web.« less
NASA Astrophysics Data System (ADS)
Chen, Xinyuan; Song, Li; Yang, Xiaokang
2016-09-01
Video denoising can be described as the problem of mapping from a specific length of noisy frames to clean one. We propose a deep architecture based on Recurrent Neural Network (RNN) for video denoising. The model learns a patch-based end-to-end mapping between the clean and noisy video sequences. It takes the corrupted video sequences as the input and outputs the clean one. Our deep network, which we refer to as deep Recurrent Neural Networks (deep RNNs or DRNNs), stacks RNN layers where each layer receives the hidden state of the previous layer as input. Experiment shows (i) the recurrent architecture through temporal domain extracts motion information and does favor to video denoising, and (ii) deep architecture have large enough capacity for expressing mapping relation between corrupted videos as input and clean videos as output, furthermore, (iii) the model has generality to learned different mappings from videos corrupted by different types of noise (e.g., Poisson-Gaussian noise). By training on large video databases, we are able to compete with some existing video denoising methods.
Messina, Enzo; Sorokin, Dimitry Y; Kublanov, Ilya V; Toshchakov, Stepan; Lopatina, Anna; Arcadi, Erika; Smedile, Francesco; La Spada, Gina; La Cono, Violetta; Yakimov, Michail M
2016-01-01
Strain M27-SA2 was isolated from the deep-sea salt-saturated anoxic lake Medee, which represents one of the most hostile extreme environments on our planet. On the basis of physiological studies and phylogenetic positioning this extremely halophilic euryarchaeon belongs to a novel genus 'Halanaeroarchaeum' within the family Halobacteriaceae. All members of this genus cultivated so far are strict anaerobes using acetate as the sole carbon and energy source and elemental sulfur as electron acceptor. Here we report the complete genome sequence of the strain M27-SA2 which is composed of a 2,129,244-bp chromosome and a 124,256-bp plasmid. This is the second complete genome sequence within the genus Halanaeroarchaeum. We demonstrate that genome of 'Halanaeroarchaeum sulfurireducens' M27-SA2 harbors complete metabolic pathways for acetate and sulfur catabolism and for de novo biosynthesis of 19 amino acids. The genomic analysis also reveals that 'Halanaeroarchaeum sulfurireducens' M27-SA2 harbors two prophage loci and one CRISPR locus, highly similar to that of Kulunda Steppe (Altai, Russia) isolate 'H. sulfurireducens' HSR2(T). The discovery of sulfur-respiring acetate-utilizing haloarchaeon in deep-sea hypersaline anoxic lakes has certain significance for understanding the biogeochemical functioning of these harsh ecosystems, which are incompatible with life for common organisms. Moreover, isolations of Halanaeroarchaeum members from geographically distant salt-saturated sites of different origin suggest a high degree of evolutionary success in their adaptation to this type of extreme biotopes around the world.
Drilling through the Messinian evaporites: the beginning of a new adventure?
NASA Astrophysics Data System (ADS)
Bassetti, M. A.; Lofi, J.
2009-04-01
The sensitivity of past environments tell us a lot about the nature of changes, either of climatic or geodynamic origin. In this respect, the Mediterranean basin represents the ideal natural laboratory for studying the interaction between deep processes, tectonics, sedimentary fluxes and sea-level oscillation that are at the origin of the sedimentary records. A spectacular example of reactivity of this system have been experienced less than 6 Myrs ago, when the pan-Mediterranean realm underwent rapid and abrupt changes of paleo-environmental parameters that led to the well known Messinian Salinity Crisis (MSC, Hsü et al., 1973). This short-term event at the geological scale (~5.96-5.32 Ma) results from the progressive closure of the two-way connection between the Atlantic Ocean and the Mediterranean Sea. The most important characteristics of this event are: (1) a reduction of the Atlantic water supply having as a consequence, an increased salinity and in the precipitation of thick evaporites within shallow water marginal basins (presently disconnected from the deep basins); (2) a subsequent major sea-level fall exceeding 2000 m and resulting in the massive erosion of the margins and the development of deep subaerial canyons; (3) the accumulation of the product of the erosion in the downslope domain of the margins; (4) the deposition of thick evaporites (up to 3000 m thick) above the deep Mediterranean abyssal plains and (5) and a very rapid refilling of the Mediterranean basin during the Latest Miocene/Lower Pliocene, following the re-connection between Atlantic and Mediterranean through the Gibraltar straight. Timing, causes and chronology of the MSC are not yet fully understood, but different scenarii have been proposed to explain in details the modalities of this catastrophic event. Certainly, the ongoing discussion about not fully conclusive interpretations are mainly linked to the fact that so far, only the deepest and buried Mediterranean basins might offer the most complete sequence from the Messinian to the Quaternary. Anywhere else, the MSC mostly generated a sedimentary/time lag corresponding to a widespread erosion surface extending from onshore down to the lower slopes of the margins. Onland, Messinian outcrops (e.g. Morocco, Cyprus, Spain, Italy…) are all incomplete and pre-date the drawdown phase and/or are tectonically/geometrically disconnected from the deep basin sequence. Correlations with the offshore depositional units are thus complex, preventing the construction of a coherent scenario of the MSC linking the outcropping evaporites, the erosion of the margins, and the deposition of clastics and deep evaporites in the abyssal plains. The discovery of the Messinian evaporites in the Mediterranenan is probably one of the major achievements of the DSDP program. Unfortunately, the Joides Resolution never drilled through evaporites because of technical impossibility (non-riser drilling vessel). Only the upper few meters of the pinch out of the deep basin sequence has been recovered. Thus, all hypothesis are based on onland outcropping evaporites and offshore seismic data interpretations. Improved quality of seismic data allowed some important advances in the recognition and understanding of Messinian markers (erosion surfaces, depositional units and bounding surfaces) but without the recovery of the full succession, all interpretations lack lithological and stratigraphical calibrations. At present, several basic questions are still open: - What are the true nature of the deep basin depositional units? What are their ages and chronologies? - What was the water depth before, during and after halite deposition in the deep basin? Did the basin(s) completely dried out? What are the associated amplitude and dynamics of the base-level changes? - Did the desiccation impact the regional climate and river run-off? What about climatic variability during the drawdown phase? - What was the balance between erosion and sedimentation during the crisis? What are the vertical movements (tectonic/isostatic responses) associated to margin unloading and basin loading? - What are the present-day fluid dynamics related to the salt layer? Their impact on the deep biosphere? The response to all of these questions would only come from drilling through the complete Messinian succession. It would represent an outstanding opportunity to unravel the history of extreme environmental changes during the Messinian and a unique chance to constrain the age, nature and paleo-environment of deposition of the deep-basin Messinian sequence. For that reason, in the framework of the IODP drilling program, we propose to sample and log two different sites in the western and eastern Mediterranean basins, with the new scientific riser drillship Chikyu perfectly adapted to overcome all safety problems. In order to promote a continuous sedimentary record of the MSC since the pre-crisis paleo-environmental changes, the sites should be drilled in areas where the Messinian salt is tabular and exempted of significant tectonic influence. A complete set of integrated studies (sedimentology, geochemistry, micropaleontology, bio-and cyclostratigraphy) should be carried out. This project opens the perspective of a new intellectual and scientific adventure that we expect to be as rich and exciting as the discovery of this unusual event was.
The Expanding Family of Virophages.
Bekliz, Meriem; Colson, Philippe; La Scola, Bernard
2016-11-23
Virophages replicate with giant viruses in the same eukaryotic cells. They are a major component of the specific mobilome of mimiviruses. Since their discovery in 2008, five other representatives have been isolated, 18 new genomes have been described, two of which being nearly completely sequenced, and they have been classified in a new viral family, Lavidaviridae . Virophages are small viruses with approximately 35-74 nm large icosahedral capsids and 17-29 kbp large double-stranded DNA genomes with 16-34 genes, among which a very small set is shared with giant viruses. Virophages have been isolated or detected in various locations and in a broad range of habitats worldwide, including the deep ocean and inland. Humans, therefore, could be commonly exposed to virophages, although currently limited evidence exists of their presence in humans based on serology and metagenomics. The distribution of virophages, the consequences of their infection and the interactions with their giant viral hosts within eukaryotic cells deserve further research.
Sarmiento-Vizcaíno, Aida; González, Verónica; Braña, Alfredo F; Palacios, Juan J; Otero, Luis; Fernández, Jonathan; Molina, Axayacatl; Kulik, Andreas; Vázquez, Fernando; Acuña, José L; García, Luis A; Blanco, Gloria
2017-02-01
Marine Actinobacteria are emerging as an unexplored source for natural product discovery. Eighty-seven deep-sea coral reef invertebrates were collected during an oceanographic expedition at the submarine Avilés Canyon (Asturias, Spain) in a range of 1500 to 4700 m depth. From these, 18 cultivable bioactive Actinobacteria were isolated, mainly from corals, phylum Cnidaria, and some specimens of phyla Echinodermata, Porifera, Annelida, Arthropoda, Mollusca and Sipuncula. As determined by 16S rRNA sequencing and phylogenetic analyses, all isolates belong to the phylum Actinobacteria, mainly to the Streptomyces genus and also to Micromonospora, Pseudonocardia and Myceligenerans. Production of bioactive compounds of pharmacological interest was investigated by high-performance liquid chromatography (HPLC) and gas chromatography-mass spectrometry (GC-MS) techniques and subsequent database comparison. Results reveal that deep-sea isolated Actinobacteria display a wide repertoire of secondary metabolite production with a high chemical diversity. Most identified products (both diffusible and volatiles) are known by their contrasted antibiotic or antitumor activities. Bioassays with ethyl acetate extracts from isolates displayed strong antibiotic activities against a panel of important resistant clinical pathogens, including Gram-positive and Gram-negative bacteria, as well as fungi, all of them isolated at two main hospitals (HUCA and Cabueñes) from the same geographical region. The identity of the active extracts components of these producing Actinobacteria is currently being investigated, given its potential for the discovery of pharmaceuticals and other products of biotechnological interest.
NASA Technical Reports Server (NTRS)
Wissler, Steven S.; Maldague, Pierre; Rocca, Jennifer; Seybold, Calina
2006-01-01
The Deep Impact mission was ambitious and challenging. JPL's well proven, easily adaptable multi-mission sequence planning tools combined with integrated spacecraft subsystem models enabled a small operations team to develop, validate, and execute extremely complex sequence-based activities within very short development times. This paper focuses on the core planning tool used in the mission, APGEN. It shows how the multi-mission design and adaptability of APGEN made it possible to model spacecraft subsystems as well as ground assets throughout the lifecycle of the Deep Impact project, starting with models of initial, high-level mission objectives, and culminating in detailed predictions of spacecraft behavior during mission-critical activities.
Zhang, Lu; Tan, Jianjun; Han, Dan; Zhu, Hao
2017-11-01
Machine intelligence, which is normally presented as artificial intelligence, refers to the intelligence exhibited by computers. In the history of rational drug discovery, various machine intelligence approaches have been applied to guide traditional experiments, which are expensive and time-consuming. Over the past several decades, machine-learning tools, such as quantitative structure-activity relationship (QSAR) modeling, were developed that can identify potential biological active molecules from millions of candidate compounds quickly and cheaply. However, when drug discovery moved into the era of 'big' data, machine learning approaches evolved into deep learning approaches, which are a more powerful and efficient way to deal with the massive amounts of data generated from modern drug discovery approaches. Here, we summarize the history of machine learning and provide insight into recently developed deep learning approaches and their applications in rational drug discovery. We suggest that this evolution of machine intelligence now provides a guide for early-stage drug design and discovery in the current big data era. Copyright © 2017 Elsevier Ltd. All rights reserved.
Applications of Deep Learning in Biomedicine.
Mamoshina, Polina; Vieira, Armando; Putin, Evgeny; Zhavoronkov, Alex
2016-05-02
Increases in throughput and installed base of biomedical research equipment led to a massive accumulation of -omics data known to be highly variable, high-dimensional, and sourced from multiple often incompatible data platforms. While this data may be useful for biomarker identification and drug discovery, the bulk of it remains underutilized. Deep neural networks (DNNs) are efficient algorithms based on the use of compositional layers of neurons, with advantages well matched to the challenges -omics data presents. While achieving state-of-the-art results and even surpassing human accuracy in many challenging tasks, the adoption of deep learning in biomedicine has been comparatively slow. Here, we discuss key features of deep learning that may give this approach an edge over other machine learning methods. We then consider limitations and review a number of applications of deep learning in biomedical studies demonstrating proof of concept and practical utility.
Discovery of asphalt seeps in the deep Southwest Atlantic off Brazil
NASA Astrophysics Data System (ADS)
Fujikura, Katsunori; Yamanaka, Toshiro; Sumida, Paulo Y. G.; Bernardino, Angelo F.; Pereira, Olivia S.; Kanehara, Toshiyuki; Nagano, Yuriko; Nakayama, Cristina R.; Nobrega, Marcos; Pellizari, Vivian H.; Shigeno, Shuichi; Yoshida, Takao; Zhang, Jing; Kitazato, Hiroshi
2017-12-01
The discovery and description of cold seeps with deep-sea chemosynthetic communities in the Southwest Atlantic Ocean are still incomplete, despite the large proven oil and gas reserves off the coast of Brazil. In the southeastern Brazilian continental margin, where over 71% of the country's oil and gas production takes place, there are previous geological and qualitative biological evidence of seep biota associated with pockmarks on the upper slope of the Santos Basin. In order to further study seep ecosystems on the Brazilian margin, a deep-sea investigation named Iatá-Piúna cruise was conducted using the human-occupied vehicle Shinkai 6500 off Brazil's southeast continental margin. Asphalt seeps were discovered on the seafloor of the North São Paulo Plateau from depths of 2652-2752 m, representing only the third discovery of this type of seep worldwide, following those in the Gulf of Mexico and off Angola. Video and isotopic analyses indicated a number of megabenthic animals in the asphalt seeps in the North São Paulo Plateau and revealed typical deep-sea heterotrophic and photosynthesis-based fauna occupying hard substrates provided by the asphalt seep. There was no evidence of chemosynthesis-based megabenthic fauna such as vesicomyid clams, Bathymodiolus mussels, and siboglinid tube worms, or any sediment bacterial mats, gas seepage, and carbonate rock in/around the seeps. The benthic fauna was composed mainly of sponges (ca. 15 species), such as the hexactinellids Caulophacus sp., Poliopogon amadou, Saccocalyx pedunculatus, Farrea occa and cf. Chonelasma choanoides; besides typical deep-sea isidid octocorals, brisingid starfishes and galatheid crabs. The δ13C values of poriferan sponges suggested a heterotrophic and pelagic nutrition. Geochemical analyses of asphalt revealed a heavy biodegradation of hydrocarbon molecules, supported by the depletion of light n-alkanes and other labile compounds. This advanced asphalt biodegradation is the likely reason for the absence of chemosynthetic communities at these seep sites.
Guiding principles for peptide nanotechnology through directed discovery.
Lampel, A; Ulijn, R V; Tuttle, T
2018-05-21
Life's diverse molecular functions are largely based on only a small number of highly conserved building blocks - the twenty canonical amino acids. These building blocks are chemically simple, but when they are organized in three-dimensional structures of tremendous complexity, new properties emerge. This review explores recent efforts in the directed discovery of functional nanoscale systems and materials based on these same amino acids, but that are not guided by copying or editing biological systems. The review summarises insights obtained using three complementary approaches of searching the sequence space to explore sequence-structure relationships for assembly, reactivity and complexation, namely: (i) strategic editing of short peptide sequences; (ii) computational approaches to predicting and comparing assembly behaviours; (iii) dynamic peptide libraries that explore the free energy landscape. These approaches give rise to guiding principles on controlling order/disorder, complexation and reactivity by peptide sequence design.
DOE Office of Scientific and Technical Information (OSTI.GOV)
With the flood of whole genome finished and draft microbial sequences, we need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/or draft genomes available as unassembled contigs or raw, unassembled reads. The method is fast to compute, finding SNPs and building a SNP phylogeny in minutes to hours, depending on the size and diversity of the input sequences. The SNP-based trees that result are consistent with known taxonomy and treesmore » determined in other studies. The approach we describe can handle many gigabases of sequence in a single run. The algorithm is based on k-mer analysis.« less
The third annual BRDS on research and development of nucleic acid-based nanomedicines
Chaudhary, Amit Kumar
2017-01-01
The completion of human genome project, decrease in the sequencing cost, and correlation of genome sequencing data with specific diseases led to the exponential rise in the nucleic acid-based therapeutic approaches. In the third annual Biopharmaceutical Research and Development Symposium (BRDS) held at the Center for Drug Discovery and Lozier Center for Pharmacy Sciences and Education at the University of Nebraska Medical Center (UNMC), we highlighted the remarkable features of the nucleic acid-based nanomedicines, their significance, NIH funding opportunities on nanomedicines and gene therapy research, challenges and opportunities in the clinical translation of nucleic acids into therapeutics, and the role of intellectual property (IP) in drug discovery and development. PMID:27848223
Discovering discovery patterns with Predication-based Semantic Indexing.
Cohen, Trevor; Widdows, Dominic; Schvaneveldt, Roger W; Davies, Peter; Rindflesch, Thomas C
2012-12-01
In this paper we utilize methods of hyperdimensional computing to mediate the identification of therapeutically useful connections for the purpose of literature-based discovery. Our approach, named Predication-based Semantic Indexing, is utilized to identify empirically sequences of relationships known as "discovery patterns", such as "drug x INHIBITS substance y, substance y CAUSES disease z" that link pharmaceutical substances to diseases they are known to treat. These sequences are derived from semantic predications extracted from the biomedical literature by the SemRep system, and subsequently utilized to direct the search for known treatments for a held out set of diseases. Rapid and efficient inference is accomplished through the application of geometric operators in PSI space, allowing for both the derivation of discovery patterns from a large set of known TREATS relationships, and the application of these discovered patterns to constrain search for therapeutic relationships at scale. Our results include the rediscovery of discovery patterns that have been constructed manually by other authors in previous research, as well as the discovery of a set of previously unrecognized patterns. The application of these patterns to direct search through PSI space results in better recovery of therapeutic relationships than is accomplished with models based on distributional statistics alone. These results demonstrate the utility of efficient approximate inference in geometric space as a means to identify therapeutic relationships, suggesting a role of these methods in drug repurposing efforts. In addition, the results provide strong support for the utility of the discovery pattern approach pioneered by Hristovski and his colleagues. Copyright © 2012 Elsevier Inc. All rights reserved.
Discovering discovery patterns with predication-based Semantic Indexing
Cohen, Trevor; Widdows, Dominic; Schvaneveldt, Roger W.; Davies, Peter; Rindflesch, Thomas C.
2012-01-01
In this paper we utilize methods of hyperdimensional computing to mediate the identification of therapeutically useful connections for the purpose of literature-based discovery. Our approach, named Predication-based Semantic Indexing, is utilized to identify empirically sequences of relationships known as “discovery patterns”, such as “drug x INHIBITS substance y, substance y CAUSES disease z” that link pharmaceutical substances to diseases they are known to treat. These sequences are derived from semantic predications extracted from the biomedical literature by the SemRep system, and subsequently utilized to direct the search for known treatments for a held out set of diseases. Rapid and efficient inference is accomplished through the application of geometric operators in PSI space, allowing for both the derivation of discovery patterns from a large set of known TREATS relationships, and the application of these discovered patterns to constrain search for therapeutic relationships at scale. Our results include the rediscovery of discovery patterns that have been constructed manually by other authors in previous research, as well as the discovery of a set of previously unrecognized patterns. The application of these patterns to direct search through PSI space results in better recovery of therapeutic relationships than is accomplished with models based on distributional statistics alone. These results demonstrate the utility of efficient approximate inference in geometric space as a means to identify therapeutic relationships, suggesting a role of these methods in drug repurposing efforts. In addition, the results provide strong support for the utility of the discovery pattern approach pioneered by Hristovski and his colleagues. PMID:22841748
DOE Office of Scientific and Technical Information (OSTI.GOV)
Duncan, Katherine R.; Crüsemann, Max; Lechner, Anna
Genome sequencing has revealed that bacteria contain many more biosynthetic gene clusters than predicted based on the number of secondary metabolites discovered to date. While this biosynthetic reservoir has fostered interest in new tools for natural product discovery, there remains a gap between gene cluster detection and compound discovery. In this paper, we apply molecular networking and the new concept of pattern-based genome mining to 35 Salinispora strains, including 30 for which draft genome sequences were either available or obtained for this study. The results provide a method to simultaneously compare large numbers of complex microbial extracts, which facilitated themore » identification of media components, known compounds and their derivatives, and new compounds that could be prioritized for structure elucidation. Finally, these efforts revealed considerable metabolite diversity and led to several molecular family-gene cluster pairings, of which the quinomycin-type depsipeptide retimycin A was characterized and linked to gene cluster NRPS40 using pattern-based bioinformatic approaches.« less
Duncan, Katherine R.; Crüsemann, Max; Lechner, Anna; ...
2015-04-09
Genome sequencing has revealed that bacteria contain many more biosynthetic gene clusters than predicted based on the number of secondary metabolites discovered to date. While this biosynthetic reservoir has fostered interest in new tools for natural product discovery, there remains a gap between gene cluster detection and compound discovery. In this paper, we apply molecular networking and the new concept of pattern-based genome mining to 35 Salinispora strains, including 30 for which draft genome sequences were either available or obtained for this study. The results provide a method to simultaneously compare large numbers of complex microbial extracts, which facilitated themore » identification of media components, known compounds and their derivatives, and new compounds that could be prioritized for structure elucidation. Finally, these efforts revealed considerable metabolite diversity and led to several molecular family-gene cluster pairings, of which the quinomycin-type depsipeptide retimycin A was characterized and linked to gene cluster NRPS40 using pattern-based bioinformatic approaches.« less
Duncan, Katherine R.; Crüsemann, Max; Lechner, Anna; Sarkar, Anindita; Li, Jie; Ziemert, Nadine; Wang, Mingxun; Bandeira, Nuno; Moore, Bradley S.; Dorrestein, Pieter C.; Jensen, Paul R.
2015-01-01
Summary Genome sequencing has revealed that bacteria contain many more biosynthetic gene clusters than predicted based on the number of secondary metabolites discovered to date. While this biosynthetic reservoir has fostered interest in new tools for natural product discovery, there remains a gap between gene cluster detection and compound discovery. Here we apply molecular networking and the new concept of pattern-based genome mining to 35 Salinispora strains including 30 for which draft genome sequences were either available or obtained for this study. The results provide a method to simultaneously compare large numbers of complex microbial extracts, which facilitated the identification of media components, known compounds and their derivatives, and new compounds that could be prioritized for structure elucidation. These efforts revealed considerable metabolite diversity and led to several molecular family-gene cluster pairings, of which the quinomycin-type depsipeptide retimycin A was characterized and linked to gene cluster NRPS40 using pattern-based bioinformatic approaches. PMID:25865308
Geoseq: a tool for dissecting deep-sequencing datasets.
Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi
2010-10-12
Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.
Gibson, Richard M.; Meyer, Ashley M.; Winner, Dane; Archer, John; Feyertag, Felix; Ruiz-Mateos, Ezequiel; Leal, Manuel; Robertson, David L.; Schmotzer, Christine L.
2014-01-01
With 29 individual antiretroviral drugs available from six classes that are approved for the treatment of HIV-1 infection, a combination of different phenotypic and genotypic tests is currently needed to monitor HIV-infected individuals. In this study, we developed a novel HIV-1 genotypic assay based on deep sequencing (DeepGen HIV) to simultaneously assess HIV-1 susceptibilities to all drugs targeting the three viral enzymes and to predict HIV-1 coreceptor tropism. Patient-derived gag-p2/NCp7/p1/p6/pol-PR/RT/IN- and env-C2V3 PCR products were sequenced using the Ion Torrent Personal Genome Machine. Reads spanning the 3′ end of the Gag, protease (PR), reverse transcriptase (RT), integrase (IN), and V3 regions were extracted, truncated, translated, and assembled for genotype and HIV-1 coreceptor tropism determination. DeepGen HIV consistently detected both minority drug-resistant viruses and non-R5 HIV-1 variants from clinical specimens with viral loads of ≥1,000 copies/ml and from B and non-B subtypes. Additional mutations associated with resistance to PR, RT, and IN inhibitors, previously undetected by standard (Sanger) population sequencing, were reliably identified at frequencies as low as 1%. DeepGen HIV results correlated with phenotypic (original Trofile, 92%; enhanced-sensitivity Trofile assay [ESTA], 80%; TROCAI, 81%; and VeriTrop, 80%) and genotypic (population sequencing/Geno2Pheno with a 10% false-positive rate [FPR], 84%) HIV-1 tropism test results. DeepGen HIV (83%) and Trofile (85%) showed similar concordances with the clinical response following an 8-day course of maraviroc monotherapy (MCT). In summary, this novel all-inclusive HIV-1 genotypic and coreceptor tropism assay, based on deep sequencing of the PR, RT, IN, and V3 regions, permits simultaneous multiplex detection of low-level drug-resistant and/or non-R5 viruses in up to 96 clinical samples. This comprehensive test, the first of its class, will be instrumental in the development of new antiretroviral drugs and, more importantly, will aid in the treatment and management of HIV-infected individuals. PMID:24468782
Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions
2014-01-01
Deep sequencing harnesses the high throughput nature of next generation sequencing technologies to generate population samples, treating information contained in individual reads as meaningful. Here, we review applications of deep sequencing to pathogen evolution. Pioneering deep sequencing studies from the virology literature are discussed, such as whole genome Roche-454 sequencing analyses of the dynamics of the rapidly mutating pathogens hepatitis C virus and HIV. Extension of the deep sequencing approach to bacterial populations is then discussed, including the impacts of emerging sequencing technologies. While it is clear that deep sequencing has unprecedented potential for assessing the genetic structure and evolutionary history of pathogen populations, bioinformatic challenges remain. We summarise current approaches to overcoming these challenges, in particular methods for detecting low frequency variants in the context of sequencing error and reconstructing individual haplotypes from short reads. PMID:24428920
Counting of oligomers in sequences generated by markov chains for DNA motif discovery.
Shan, Gao; Zheng, Wei-Mou
2009-02-01
By means of the technique of the imbedded Markov chain, an efficient algorithm is proposed to exactly calculate first, second moments of word counts and the probability for a word to occur at least once in random texts generated by a Markov chain. A generating function is introduced directly from the imbedded Markov chain to derive asymptotic approximations for the problem. Two Z-scores, one based on the number of sequences with hits and the other on the total number of word hits in a set of sequences, are examined for discovery of motifs on a set of promoter sequences extracted from A. thaliana genome. Source code is available at http://www.itp.ac.cn/zheng/oligo.c.
Deep Molecular Diversity of Mammalian Synapses: Why It Matters and How to Measure It
O’Rourke, Nancy A.; Weiler, Nick C.; Micheva, Kristina D.; Smith, Stephen J
2013-01-01
Summary Pioneering studies during the middle of the twentieth century revealed substantial diversity amongst mammalian chemical synapses and led to a widely accepted synapse type classification based on neurotransmitter molecule identity. Subsequently, powerful new physiological, genetic and structural methods have enabled the discovery of much deeper functional and molecular diversity within each traditional neurotransmitter type. Today, this deep diversity continues to pose both daunting challenges and exciting new opportunities for neuroscience. Our growing understanding of deep synapse diversity may transform how we think about and study neural circuit development, structure and function. PMID:22573027
Fungal diversity in deep-sea sediments of a hydrothermal vent system in the Southwest Indian Ridge
NASA Astrophysics Data System (ADS)
Xu, Wei; Gong, Lin-feng; Pang, Ka-Lai; Luo, Zhu-Hua
2018-01-01
Deep-sea hydrothermal sediment is known to support remarkably diverse microbial consortia. In deep sea environments, fungal communities remain less studied despite their known taxonomic and functional diversity. High-throughput sequencing methods have augmented our capacity to assess eukaryotic diversity and their functions in microbial ecology. Here we provide the first description of the fungal community diversity found in deep sea sediments collected at the Southwest Indian Ridge (SWIR) using culture-dependent and high-throughput sequencing approaches. A total of 138 fungal isolates were cultured from seven different sediment samples using various nutrient media, and these isolates were identified to 14 fungal taxa, including 11 Ascomycota taxa (7 genera) and 3 Basidiomycota taxa (2 genera) based on internal transcribed spacers (ITS1, ITS2 and 5.8S) of rDNA. Using illumina HiSeq sequencing, a total of 757,467 fungal ITS2 tags were recovered from the samples and clustered into 723 operational taxonomic units (OTUs) belonging to 79 taxa (Ascomycota and Basidiomycota contributed to 99% of all samples) based on 97% sequence similarity. Results from both approaches suggest that there is a high fungal diversity in the deep-sea sediments collected in the SWIR and fungal communities were shown to be slightly different by location, although all were collected from adjacent sites at the SWIR. This study provides baseline data of the fungal diversity and biogeography, and a glimpse to the microbial ecology associated with the deep-sea sediments of the hydrothermal vent system of the Southwest Indian Ridge.
Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning.
Teng, Haotian; Cao, Minh Duc; Hall, Michael B; Duarte, Tania; Wang, Sheng; Coin, Lachlan J M
2018-05-01
Sequencing by translocating DNA fragments through an array of nanopores is a rapidly maturing technology that offers faster and cheaper sequencing than other approaches. However, accurately deciphering the DNA sequence from the noisy and complex electrical signal is challenging. Here, we report Chiron, the first deep learning model to achieve end-to-end basecalling and directly translate the raw signal to DNA sequence without the error-prone segmentation step. Trained with only a small set of 4,000 reads, we show that our model provides state-of-the-art basecalling accuracy, even on previously unseen species. Chiron achieves basecalling speeds of more than 2,000 bases per second using desktop computer graphics processing units.
2016 Year in Review Video- NASA’s Marshall Space Flight Center
2016-12-22
The work underway today at NASA’s Marshall Space Flight Center is making it possible to send humans beyond Earth’s orbit and into deep space on bold new missions of space exploration. Marshall teams are designing and building NASA’s Space Launch System, the most powerful rocket ever built and the only launch vehicle capable of launching human explorers to Mars. Using the International Space Station’s orbiting lab, Marshall flight controllers provided round-the-clock oversight of science experiments, supporting the first-ever DNA sequencing in space, pioneering 3-D printing capabilities and advancing human health research. Several successful New Frontiers deep-space robotic missions including OSIRIS-REx, New Horizons and Juno, made new discoveries and refined theories of the solar system. And Marshall collaborations with outside partners are yielding innovative technologies and solving technical challenges that are making the Journey to Mars a reality.
Low Data Drug Discovery with One-Shot Learning.
Altae-Tran, Han; Ramsundar, Bharath; Pappu, Aneesh S; Pande, Vijay
2017-04-26
Recent advances in machine learning have made significant contributions to drug discovery. Deep neural networks in particular have been demonstrated to provide significant boosts in predictive power when inferring the properties and activities of small-molecule compounds (Ma, J. et al. J. Chem. Inf. 2015, 55, 263-274). However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this work, we demonstrate how one-shot learning can be used to significantly lower the amounts of data required to make meaningful predictions in drug discovery applications. We introduce a new architecture, the iterative refinement long short-term memory, that, when combined with graph convolutional neural networks, significantly improves learning of meaningful distance metrics over small-molecules. We open source all models introduced in this work as part of DeepChem, an open-source framework for deep-learning in drug discovery (Ramsundar, B. deepchem.io. https://github.com/deepchem/deepchem, 2016).
Discovery of a Novel Periodontal Disease-Associated Bacterium.
Torres, Pedro J; Thompson, John; McLean, Jeffrey S; Kelley, Scott T; Edlund, Anna
2018-06-02
One of the world's most common infectious disease, periodontitis (PD), derives from largely uncharacterized communities of oral bacteria growing as biofilms (a.k.a. plaque) on teeth and gum surfaces in periodontal pockets. Bacteria associated with periodontal disease trigger inflammatory responses in immune cells, which in later stages of the disease cause loss of both soft and hard tissue structures supporting teeth. Thus far, only a handful of bacteria have been characterized as infectious agents of PD. Although deep sequencing technologies, such as whole community shotgun sequencing have the potential to capture a detailed picture of highly complex bacterial communities in any given environment, we still lack major reference genomes for the oral microbiome associated with PD and other diseases. In recent work, by using a combination of supervised machine learning and genome assembly, we identified a genome from a novel member of the Bacteroidetes phylum in periodontal samples. Here, by applying a comparative metagenomics read-classification approach, including 272 metagenomes from various human body sites, and our previously assembled draft genome of the uncultivated Candidatus Bacteroides periocalifornicus (CBP) bacterium, we show CBP's ubiquitous distribution in dental plaque, as well as its strong association with the well-known pathogenic "red complex" that resides in deep periodontal pockets.
Performance Evaluation of an Expanded Range XIPS Ion Thruster System for NASA Science Missions
NASA Technical Reports Server (NTRS)
Oh, David Y.; Goebel, Dan M.
2006-01-01
This paper examines the benefit that a solar electric propulsion (SEP) system based on the 5 kW Xenon Ion Propulsion System (XIPS) could have for NASA's Discovery class deep space missions. The relative cost and performance of the commercial heritage XIPS system is compared to NSTAR ion thruster based systems on three Discovery class reference missions: 1) a Near Earth Asteroid Sample Return, 2) a Comet Rendezvous and 3) a Main Belt Asteroid Rendezvous. It is found that systems utilizing a single operating XIPS thruster provides significant performance advantages over a single operating NSTAR thruster. In fact, XIPS performs as well as systems utilizing two operating NSTAR thrusters, and still costs less than the NSTAR system with a single operating thruster. This makes XIPS based SEP a competitive and attractive candidate for Discovery class science missions.
Lasko, Thomas A; Denny, Joshua C; Levy, Mia A
2013-01-01
Inferring precise phenotypic patterns from population-scale clinical data is a core computational task in the development of precision, personalized medicine. The traditional approach uses supervised learning, in which an expert designates which patterns to look for (by specifying the learning task and the class labels), and where to look for them (by specifying the input variables). While appropriate for individual tasks, this approach scales poorly and misses the patterns that we don't think to look for. Unsupervised feature learning overcomes these limitations by identifying patterns (or features) that collectively form a compact and expressive representation of the source data, with no need for expert input or labeled examples. Its rising popularity is driven by new deep learning methods, which have produced high-profile successes on difficult standardized problems of object recognition in images. Here we introduce its use for phenotype discovery in clinical data. This use is challenging because the largest source of clinical data - Electronic Medical Records - typically contains noisy, sparse, and irregularly timed observations, rendering them poor substrates for deep learning methods. Our approach couples dirty clinical data to deep learning architecture via longitudinal probability densities inferred using Gaussian process regression. From episodic, longitudinal sequences of serum uric acid measurements in 4368 individuals we produced continuous phenotypic features that suggest multiple population subtypes, and that accurately distinguished (0.97 AUC) the uric-acid signatures of gout vs. acute leukemia despite not being optimized for the task. The unsupervised features were as accurate as gold-standard features engineered by an expert with complete knowledge of the domain, the classification task, and the class labels. Our findings demonstrate the potential for achieving computational phenotype discovery at population scale. We expect such data-driven phenotypes to expose unknown disease variants and subtypes and to provide rich targets for genetic association studies.
Lasko, Thomas A.; Denny, Joshua C.; Levy, Mia A.
2013-01-01
Inferring precise phenotypic patterns from population-scale clinical data is a core computational task in the development of precision, personalized medicine. The traditional approach uses supervised learning, in which an expert designates which patterns to look for (by specifying the learning task and the class labels), and where to look for them (by specifying the input variables). While appropriate for individual tasks, this approach scales poorly and misses the patterns that we don’t think to look for. Unsupervised feature learning overcomes these limitations by identifying patterns (or features) that collectively form a compact and expressive representation of the source data, with no need for expert input or labeled examples. Its rising popularity is driven by new deep learning methods, which have produced high-profile successes on difficult standardized problems of object recognition in images. Here we introduce its use for phenotype discovery in clinical data. This use is challenging because the largest source of clinical data – Electronic Medical Records – typically contains noisy, sparse, and irregularly timed observations, rendering them poor substrates for deep learning methods. Our approach couples dirty clinical data to deep learning architecture via longitudinal probability densities inferred using Gaussian process regression. From episodic, longitudinal sequences of serum uric acid measurements in 4368 individuals we produced continuous phenotypic features that suggest multiple population subtypes, and that accurately distinguished (0.97 AUC) the uric-acid signatures of gout vs. acute leukemia despite not being optimized for the task. The unsupervised features were as accurate as gold-standard features engineered by an expert with complete knowledge of the domain, the classification task, and the class labels. Our findings demonstrate the potential for achieving computational phenotype discovery at population scale. We expect such data-driven phenotypes to expose unknown disease variants and subtypes and to provide rich targets for genetic association studies. PMID:23826094
Zhang, Xi
2016-01-01
Neurotransmitter ligand-gated ion channels (LGICs) are widespread and pivotal in brain functions. Unveiling their structure-function mechanisms is crucial to drive drug discovery, and demands robust proteomic quantitation of expression, post-translational modifications (PTMs) and dynamic structures. Yet unbiased digestion of these modified transmembrane proteins—at high efficiency and peptide reproducibility—poses the obstacle. Targeting both enzyme-substrate contacts and PTMs for peptide formation and detection, we devised flow-and-detergent-facilitated protease and de-PTM digestions for deep sequencing (FDD) method that combined omni-compatible detergent, tandem immobilized protease/PNGase columns, and Cys-selective reduction/alkylation, to achieve streamlined ultradeep peptide preparation within minutes not days, at high peptide reproducibility and low abundance-bias. FDD transformed enzyme-protein contacts into equal catalytic travel paths through enzyme-excessive columns regardless of protein abundance, removed products instantly preventing inhibition, tackled intricate structures via sequential multiple micro-digestions along the flow, and precisely controlled peptide formation by flow rate. Peptide-stage reactions reduced steric bias; low contamination deepened MS/MS scan; distinguishing disulfide from M oxidation and avoiding gain/loss artifacts unmasked protein-endogenous oxidation states. Using a recent interactome of 285-kDa human GABA type A receptor, this pilot study validated FDD platform's applicability to deep sequencing (up to 99% coverage), H/D-exchange and TMT-based structural mapping. FDD discovered novel subunit-specific PTM signatures, including unusual nontop-surface N-glycosylations, that may drive subunit biases in human Cys-loop LGIC assembly and pharmacology, by redefining subunit/ligand interfaces and connecting function domains. PMID:27073180
Adler, Adam S; Bedinger, Daniel; Adams, Matthew S; Asensio, Michael A; Edgar, Robert C; Leong, Renee; Leong, Jackson; Mizrahi, Rena A; Spindler, Matthew J; Bandi, Srinivasa Rao; Huang, Haichun; Tawde, Pallavi; Brams, Peter; Johnson, David S
2018-04-01
Deep sequencing and single-chain variable fragment (scFv) yeast display methods are becoming more popular for discovery of therapeutic antibody candidates in mouse B cell repertoires. In this study, we compare a deep sequencing and scFv display method that retains native heavy and light chain pairing with a related method that randomly pairs heavy and light chain. We performed the studies in a humanized mouse, using interleukin 21 receptor (IL-21R) as a test immunogen. We identified 44 high-affinity binder scFv with the native pairing method and 100 high-affinity binder scFv with the random pairing method. 30% of the natively paired scFv binders were also discovered with the randomly paired method, and 13% of the randomly paired binders were also discovered with the natively paired method. Additionally, 33% of the scFv binders discovered only in the randomly paired library were initially present in the natively paired pre-sort library. Thus, a significant proportion of "randomly paired" scFv were actually natively paired. We synthesized and produced 46 of the candidates as full-length antibodies and subjected them to a panel of binding assays to characterize their therapeutic potential. 87% of the antibodies were verified as binding IL-21R by at least one assay. We found that antibodies with native light chains were more likely to bind IL-21R than antibodies with non-native light chains, suggesting a higher false positive rate for antibodies from the randomly paired library. Additionally, the randomly paired method failed to identify nearly half of the true natively paired binders, suggesting a higher false negative rate. We conclude that natively paired libraries have critical advantages in sensitivity and specificity for antibody discovery programs.
Adler, Adam S.; Bedinger, Daniel; Adams, Matthew S.; Asensio, Michael A.; Edgar, Robert C.; Leong, Renee; Leong, Jackson; Mizrahi, Rena A.; Spindler, Matthew J.; Bandi, Srinivasa Rao; Huang, Haichun; Brams, Peter; Johnson, David S.
2018-01-01
ABSTRACT Deep sequencing and single-chain variable fragment (scFv) yeast display methods are becoming more popular for discovery of therapeutic antibody candidates in mouse B cell repertoires. In this study, we compare a deep sequencing and scFv display method that retains native heavy and light chain pairing with a related method that randomly pairs heavy and light chain. We performed the studies in a humanized mouse, using interleukin 21 receptor (IL-21R) as a test immunogen. We identified 44 high-affinity binder scFv with the native pairing method and 100 high-affinity binder scFv with the random pairing method. 30% of the natively paired scFv binders were also discovered with the randomly paired method, and 13% of the randomly paired binders were also discovered with the natively paired method. Additionally, 33% of the scFv binders discovered only in the randomly paired library were initially present in the natively paired pre-sort library. Thus, a significant proportion of “randomly paired” scFv were actually natively paired. We synthesized and produced 46 of the candidates as full-length antibodies and subjected them to a panel of binding assays to characterize their therapeutic potential. 87% of the antibodies were verified as binding IL-21R by at least one assay. We found that antibodies with native light chains were more likely to bind IL-21R than antibodies with non-native light chains, suggesting a higher false positive rate for antibodies from the randomly paired library. Additionally, the randomly paired method failed to identify nearly half of the true natively paired binders, suggesting a higher false negative rate. We conclude that natively paired libraries have critical advantages in sensitivity and specificity for antibody discovery programs. PMID:29376776
DeepGene: an advanced cancer type classifier based on deep learning and somatic point mutations.
Yuan, Yuchen; Shi, Yi; Li, Changyang; Kim, Jinman; Cai, Weidong; Han, Zeguang; Feng, David Dagan
2016-12-23
With the developments of DNA sequencing technology, large amounts of sequencing data have become available in recent years and provide unprecedented opportunities for advanced association studies between somatic point mutations and cancer types/subtypes, which may contribute to more accurate somatic point mutation based cancer classification (SMCC). However in existing SMCC methods, issues like high data sparsity, small volume of sample size, and the application of simple linear classifiers, are major obstacles in improving the classification performance. To address the obstacles in existing SMCC studies, we propose DeepGene, an advanced deep neural network (DNN) based classifier, that consists of three steps: firstly, the clustered gene filtering (CGF) concentrates the gene data by mutation occurrence frequency, filtering out the majority of irrelevant genes; secondly, the indexed sparsity reduction (ISR) converts the gene data into indexes of its non-zero elements, thereby significantly suppressing the impact of data sparsity; finally, the data after CGF and ISR is fed into a DNN classifier, which extracts high-level features for accurate classification. Experimental results on our curated TCGA-DeepGene dataset, which is a reformulated subset of the TCGA dataset containing 12 selected types of cancer, show that CGF, ISR and DNN all contribute in improving the overall classification performance. We further compare DeepGene with three widely adopted classifiers and demonstrate that DeepGene has at least 24% performance improvement in terms of testing accuracy. Based on deep learning and somatic point mutation data, we devise DeepGene, an advanced cancer type classifier, which addresses the obstacles in existing SMCC studies. Experiments indicate that DeepGene outperforms three widely adopted existing classifiers, which is mainly attributed to its deep learning module that is able to extract the high level features between combinatorial somatic point mutations and cancer types.
Ou-Yang, Si-sheng; Lu, Jun-yan; Kong, Xiang-qian; Liang, Zhong-jie; Luo, Cheng; Jiang, Hualiang
2012-01-01
Computational drug discovery is an effective strategy for accelerating and economizing drug discovery and development process. Because of the dramatic increase in the availability of biological macromolecule and small molecule information, the applicability of computational drug discovery has been extended and broadly applied to nearly every stage in the drug discovery and development workflow, including target identification and validation, lead discovery and optimization and preclinical tests. Over the past decades, computational drug discovery methods such as molecular docking, pharmacophore modeling and mapping, de novo design, molecular similarity calculation and sequence-based virtual screening have been greatly improved. In this review, we present an overview of these important computational methods, platforms and successful applications in this field. PMID:22922346
Deep Learning and Its Applications in Biomedicine.
Cao, Chensi; Liu, Feng; Tan, Hai; Song, Deshou; Shu, Wenjie; Li, Weizhong; Zhou, Yiming; Bo, Xiaochen; Xie, Zhi
2018-02-01
Advances in biological and medical technologies have been providing us explosive volumes of biological and physiological data, such as medical images, electroencephalography, genomic and protein sequences. Learning from these data facilitates the understanding of human health and disease. Developed from artificial neural networks, deep learning-based algorithms show great promise in extracting features and learning patterns from complex data. The aim of this paper is to provide an overview of deep learning techniques and some of the state-of-the-art applications in the biomedical field. We first introduce the development of artificial neural network and deep learning. We then describe two main components of deep learning, i.e., deep learning architectures and model optimization. Subsequently, some examples are demonstrated for deep learning applications, including medical image classification, genomic sequence analysis, as well as protein structure classification and prediction. Finally, we offer our perspectives for the future directions in the field of deep learning. Copyright © 2018. Production and hosting by Elsevier B.V.
Unified Deep Learning Architecture for Modeling Biology Sequence.
Wu, Hongjie; Cao, Chengyuan; Xia, Xiaoyan; Lu, Qiang
2017-10-09
Prediction of the spatial structure or function of biological macromolecules based on their sequence remains an important challenge in bioinformatics. When modeling biological sequences using traditional sequencing models, characteristics, such as long-range interactions between basic units, the complicated and variable output of labeled structures, and the variable length of biological sequences, usually lead to different solutions on a case-by-case basis. This study proposed the use of bidirectional recurrent neural networks based on long short-term memory or a gated recurrent unit to capture long-range interactions by designing the optional reshape operator to adapt to the diversity of the output labels and implementing a training algorithm to support the training of sequence models capable of processing variable-length sequences. Additionally, the merge and pooling operators enhanced the ability to capture short-range interactions between basic units of biological sequences. The proposed deep-learning model and its training algorithm might be capable of solving currently known biological sequence-modeling problems through the use of a unified framework. We validated our model on one of the most difficult biological sequence-modeling problems currently known, with our results indicating the ability of the model to obtain predictions of protein residue interactions that exceeded the accuracy of current popular approaches by 10% based on multiple benchmarks.
An Integrated Microfluidic Processor for DNA-Encoded Combinatorial Library Functional Screening
2017-01-01
DNA-encoded synthesis is rekindling interest in combinatorial compound libraries for drug discovery and in technology for automated and quantitative library screening. Here, we disclose a microfluidic circuit that enables functional screens of DNA-encoded compound beads. The device carries out library bead distribution into picoliter-scale assay reagent droplets, photochemical cleavage of compound from the bead, assay incubation, laser-induced fluorescence-based assay detection, and fluorescence-activated droplet sorting to isolate hits. DNA-encoded compound beads (10-μm diameter) displaying a photocleavable positive control inhibitor pepstatin A were mixed (1920 beads, 729 encoding sequences) with negative control beads (58 000 beads, 1728 encoding sequences) and screened for cathepsin D inhibition using a biochemical enzyme activity assay. The circuit sorted 1518 hit droplets for collection following 18 min incubation over a 240 min analysis. Visual inspection of a subset of droplets (1188 droplets) yielded a 24% false discovery rate (1166 pepstatin A beads; 366 negative control beads). Using template barcoding strategies, it was possible to count hit collection beads (1863) using next-generation sequencing data. Bead-specific barcodes enabled replicate counting, and the false discovery rate was reduced to 2.6% by only considering hit-encoding sequences that were observed on >2 beads. This work represents a complete distributable small molecule discovery platform, from microfluidic miniaturized automation to ultrahigh-throughput hit deconvolution by sequencing. PMID:28199790
An Integrated Microfluidic Processor for DNA-Encoded Combinatorial Library Functional Screening.
MacConnell, Andrew B; Price, Alexander K; Paegel, Brian M
2017-03-13
DNA-encoded synthesis is rekindling interest in combinatorial compound libraries for drug discovery and in technology for automated and quantitative library screening. Here, we disclose a microfluidic circuit that enables functional screens of DNA-encoded compound beads. The device carries out library bead distribution into picoliter-scale assay reagent droplets, photochemical cleavage of compound from the bead, assay incubation, laser-induced fluorescence-based assay detection, and fluorescence-activated droplet sorting to isolate hits. DNA-encoded compound beads (10-μm diameter) displaying a photocleavable positive control inhibitor pepstatin A were mixed (1920 beads, 729 encoding sequences) with negative control beads (58 000 beads, 1728 encoding sequences) and screened for cathepsin D inhibition using a biochemical enzyme activity assay. The circuit sorted 1518 hit droplets for collection following 18 min incubation over a 240 min analysis. Visual inspection of a subset of droplets (1188 droplets) yielded a 24% false discovery rate (1166 pepstatin A beads; 366 negative control beads). Using template barcoding strategies, it was possible to count hit collection beads (1863) using next-generation sequencing data. Bead-specific barcodes enabled replicate counting, and the false discovery rate was reduced to 2.6% by only considering hit-encoding sequences that were observed on >2 beads. This work represents a complete distributable small molecule discovery platform, from microfluidic miniaturized automation to ultrahigh-throughput hit deconvolution by sequencing.
Comparative Single-Cell Genomics of Chloroflexi from the Okinawa Trough Deep-Subsurface Biosphere.
Fullerton, Heather; Moyer, Craig L
2016-05-15
Chloroflexi small-subunit (SSU) rRNA gene sequences are frequently recovered from subseafloor environments, but the metabolic potential of the phylum is poorly understood. The phylum Chloroflexi is represented by isolates with diverse metabolic strategies, including anoxic phototrophy, fermentation, and reductive dehalogenation; therefore, function cannot be attributed to these organisms based solely on phylogeny. Single-cell genomics can provide metabolic insights into uncultured organisms, like the deep-subsurface Chloroflexi Nine SSU rRNA gene sequences were identified from single-cell sorts of whole-round core material collected from the Okinawa Trough at Iheya North hydrothermal field as part of Integrated Ocean Drilling Program (IODP) expedition 331 (Deep Hot Biosphere). Previous studies of subsurface Chloroflexi single amplified genomes (SAGs) suggested heterotrophic or lithotrophic metabolisms and provided no evidence for growth by reductive dehalogenation. Our nine Chloroflexi SAGs (seven of which are from the order Anaerolineales) indicate that, in addition to genes for the Wood-Ljungdahl pathway, exogenous carbon sources can be actively transported into cells. At least one subunit for pyruvate ferredoxin oxidoreductase was found in four of the Chloroflexi SAGs. This protein can provide a link between the Wood-Ljungdahl pathway and other carbon anabolic pathways. Finally, one of the seven Anaerolineales SAGs contains a distinct reductive dehalogenase homologous (rdhA) gene. Through the use of single amplified genomes (SAGs), we have extended the metabolic potential of an understudied group of subsurface microbes, the Chloroflexi These microbes are frequently detected in the subsurface biosphere, though their metabolic capabilities have remained elusive. In contrast to previously examined Chloroflexi SAGs, our genomes (several are from the order Anaerolineales) were recovered from a hydrothermally driven system and therefore provide a unique window into the metabolic potential of this type of habitat. In addition, a reductive dehalogenase gene (rdhA) has been directly linked to marine subsurface Chloroflexi, suggesting that reductive dehalogenation is not limited to the class Dehalococcoidia This discovery expands the nutrient-cycling and metabolic potential present within the deep subsurface and provides functional gene information relating to this enigmatic group. Copyright © 2016 Fullerton and Moyer.
Chen, Muyan; Zhang, Xiumei; Liu, Jianning; Storey, Kenneth B.
2013-01-01
The regulatory role of miRNA in gene expression is an emerging hot new topic in the control of hypometabolism. Sea cucumber aestivation is a complicated physiological process that includes obvious hypometabolism as evidenced by a decrease in the rates of oxygen consumption and ammonia nitrogen excretion, as well as a serious degeneration of the intestine into a very tiny filament. To determine whether miRNAs play regulatory roles in this process, the present study analyzed profiles of miRNA expression in the intestine of the sea cucumber (Apostichopus japonicus), using Solexa deep sequencing technology. We identified 308 sea cucumber miRNAs, including 18 novel miRNAs specific to sea cucumber. Animals sampled during deep aestivation (DA) after at least 15 days of continuous torpor, were compared with animals from a non-aestivation (NA) state (animals that had passed through aestivation and returned to the active state). We identified 42 differentially expressed miRNAs [RPM (reads per million) >10, |FC| (|fold change|) ≥1, FDR (false discovery rate) <0.01] during aestivation, which were validated by two other miRNA profiling methods: miRNA microarray and real-time PCR. Among the most prominent miRNA species, miR-200-3p, miR-2004, miR-2010, miR-22, miR-252a, miR-252a-3p and miR-92 were significantly over-expressed during deep aestivation compared with non-aestivation animals. Preliminary analyses of their putative target genes and GO analysis suggest that these miRNAs could play important roles in global transcriptional depression and cell differentiation during aestivation. High-throughput sequencing data and microarray data have been submitted to GEO database. PMID:24143179
Kravatsky, Yuri; Chechetkin, Vladimir; Fedoseeva, Daria; Gorbacheva, Maria; Kravatskaya, Galina; Kretova, Olga; Tchurikov, Nickolai
2017-11-23
The efficient development of antiviral drugs, including efficient antiviral small interfering RNAs (siRNAs), requires continuous monitoring of the strict correspondence between a drug and the related highly variable viral DNA/RNA target(s). Deep sequencing is able to provide an assessment of both the general target conservation and the frequency of particular mutations in the different target sites. The aim of this study was to develop a reliable bioinformatic pipeline for the analysis of millions of short, deep sequencing reads corresponding to selected highly variable viral sequences that are drug target(s). The suggested bioinformatic pipeline combines the available programs and the ad hoc scripts based on an original algorithm of the search for the conserved targets in the deep sequencing data. We also present the statistical criteria for the threshold of reliable mutation detection and for the assessment of variations between corresponding data sets. These criteria are robust against the possible sequencing errors in the reads. As an example, the bioinformatic pipeline is applied to the study of the conservation of RNA interference (RNAi) targets in human immunodeficiency virus 1 (HIV-1) subtype A. The developed pipeline is freely available to download at the website http://virmut.eimb.ru/. Brief comments and comparisons between VirMut and other pipelines are also presented.
TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets.
Dang, Louis T; Tondl, Markus; Chiu, Man Ho H; Revote, Jerico; Paten, Benedict; Tano, Vincent; Tokolyi, Alex; Besse, Florence; Quaife-Ryan, Greg; Cumming, Helen; Drvodelic, Mark J; Eichenlaub, Michael P; Hallab, Jeannette C; Stolper, Julian S; Rossello, Fernando J; Bogoyevitch, Marie A; Jans, David A; Nim, Hieu T; Porrello, Enzo R; Hudson, James E; Ramialison, Mirana
2018-04-05
A strong focus of the post-genomic era is mining of the non-coding regulatory genome in order to unravel the function of regulatory elements that coordinate gene expression (Nat 489:57-74, 2012; Nat 507:462-70, 2014; Nat 507:455-61, 2014; Nat 518:317-30, 2015). Whole-genome approaches based on next-generation sequencing (NGS) have provided insight into the genomic location of regulatory elements throughout different cell types, organs and organisms. These technologies are now widespread and commonly used in laboratories from various fields of research. This highlights the need for fast and user-friendly software tools dedicated to extracting cis-regulatory information contained in these regulatory regions; for instance transcription factor binding site (TFBS) composition. Ideally, such tools should not require prior programming knowledge to ensure they are accessible for all users. We present TrawlerWeb, a web-based version of the Trawler_standalone tool (Nat Methods 4:563-5, 2007; Nat Protoc 5:323-34, 2010), to allow for the identification of enriched motifs in DNA sequences obtained from next-generation sequencing experiments in order to predict their TFBS composition. TrawlerWeb is designed for online queries with standard options common to web-based motif discovery tools. In addition, TrawlerWeb provides three unique new features: 1) TrawlerWeb allows the input of BED files directly generated from NGS experiments, 2) it automatically generates an input-matched biologically relevant background, and 3) it displays resulting conservation scores for each instance of the motif found in the input sequences, which assists the researcher in prioritising the motifs to validate experimentally. Finally, to date, this web-based version of Trawler_standalone remains the fastest online de novo motif discovery tool compared to other popular web-based software, while generating predictions with high accuracy. TrawlerWeb provides users with a fast, simple and easy-to-use web interface for de novo motif discovery. This will assist in rapidly analysing NGS datasets that are now being routinely generated. TrawlerWeb is freely available and accessible at: http://trawler.erc.monash.edu.au .
Molecular biology and immunology of head and neck cancer.
Guo, Theresa; Califano, Joseph A
2015-07-01
In recent years, our knowledge and understanding of head and neck squamous cell carcinoma (HNSCC) has expanded dramatically. New high-throughput sequencing technologies have accelerated these discoveries since the first reports of whole-exome sequencing of HNSCC tumors in 2011. In addition, the discovery of human papillomavirus in relationship with oropharyngeal squamous cell carcinoma has shifted our molecular understanding of the disease. New investigation into the role of immune evasion in HNSCC has also led to potential novel therapies based on immune-specific systemic therapies. Copyright © 2015 Elsevier Inc. All rights reserved.
Adhikari, Badri; Hou, Jie; Cheng, Jianlin
2018-03-01
In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRUCT) uses our new alignment algorithm to generate deep multiple sequence alignment to derive coevolution-based features, which are integrated by a neural network method to predict contacts. The third method (MULTICOM-CLUSTER) is a consensus combination of the predictions of the first two methods. We evaluated our methods on 94 CASP12 domains. On a subset of 38 free-modeling domains, our methods achieved an average precision of up to 41.7% for top L/5 long-range contact predictions. The comparison of the three methods shows that the quality and effective depth of multiple sequence alignments, coevolution-based features, and machine learning integration of coevolution-based features and traditional features drive the quality of predicted protein contacts. On the full CASP12 dataset, the coevolution-based features alone can improve the average precision from 28.4% to 41.6%, and the machine learning integration of all the features further raises the precision to 56.3%, when top L/5 predicted long-range contacts are evaluated. And the correlation between the precision of contact prediction and the logarithm of the number of effective sequences in alignments is 0.66. © 2017 Wiley Periodicals, Inc.
DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data.
Arango-Argoty, Gustavo; Garner, Emily; Pruden, Amy; Heath, Lenwood S; Vikesland, Peter; Zhang, Liqing
2018-02-01
Growing concerns about increasing rates of antibiotic resistance call for expanded and comprehensive global monitoring. Advancing methods for monitoring of environmental media (e.g., wastewater, agricultural waste, food, and water) is especially needed for identifying potential resources of novel antibiotic resistance genes (ARGs), hot spots for gene exchange, and as pathways for the spread of ARGs and human exposure. Next-generation sequencing now enables direct access and profiling of the total metagenomic DNA pool, where ARGs are typically identified or predicted based on the "best hits" of sequence searches against existing databases. Unfortunately, this approach produces a high rate of false negatives. To address such limitations, we propose here a deep learning approach, taking into account a dissimilarity matrix created using all known categories of ARGs. Two deep learning models, DeepARG-SS and DeepARG-LS, were constructed for short read sequences and full gene length sequences, respectively. Evaluation of the deep learning models over 30 antibiotic resistance categories demonstrates that the DeepARG models can predict ARGs with both high precision (> 0.97) and recall (> 0.90). The models displayed an advantage over the typical best hit approach, yielding consistently lower false negative rates and thus higher overall recall (> 0.9). As more data become available for under-represented ARG categories, the DeepARG models' performance can be expected to be further enhanced due to the nature of the underlying neural networks. Our newly developed ARG database, DeepARG-DB, encompasses ARGs predicted with a high degree of confidence and extensive manual inspection, greatly expanding current ARG repositories. The deep learning models developed here offer more accurate antimicrobial resistance annotation relative to current bioinformatics practice. DeepARG does not require strict cutoffs, which enables identification of a much broader diversity of ARGs. The DeepARG models and database are available as a command line version and as a Web service at http://bench.cs.vt.edu/deeparg .
2010-01-01
Background Suppression subtractive hybridization is a popular technique for gene discovery from non-model organisms without an annotated genome sequence, such as cowpea (Vigna unguiculata (L.) Walp). We aimed to use this method to enrich for genes expressed during drought stress in a drought tolerant cowpea line. However, current methods were inefficient in screening libraries and management of the sequence data, and thus there was a need to develop software tools to facilitate the process. Results Forward and reverse cDNA libraries enriched for cowpea drought response genes were screened on microarrays, and the R software package SSHscreen 2.0.1 was developed (i) to normalize the data effectively using spike-in control spot normalization, and (ii) to select clones for sequencing based on the calculation of enrichment ratios with associated statistics. Enrichment ratio 3 values for each clone showed that 62% of the forward library and 34% of the reverse library clones were significantly differentially expressed by drought stress (adjusted p value < 0.05). Enrichment ratio 2 calculations showed that > 88% of the clones in both libraries were derived from rare transcripts in the original tester samples, thus supporting the notion that suppression subtractive hybridization enriches for rare transcripts. A set of 118 clones were chosen for sequencing, and drought-induced cowpea genes were identified, the most interesting encoding a late embryogenesis abundant Lea5 protein, a glutathione S-transferase, a thaumatin, a universal stress protein, and a wound induced protein. A lipid transfer protein and several components of photosynthesis were down-regulated by the drought stress. Reverse transcriptase quantitative PCR confirmed the enrichment ratio values for the selected cowpea genes. SSHdb, a web-accessible database, was developed to manage the clone sequences and combine the SSHscreen data with sequence annotations derived from BLAST and Blast2GO. The self-BLAST function within SSHdb grouped redundant clones together and illustrated that the SSHscreen plots are a useful tool for choosing anonymous clones for sequencing, since redundant clones cluster together on the enrichment ratio plots. Conclusions We developed the SSHscreen-SSHdb software pipeline, which greatly facilitates gene discovery using suppression subtractive hybridization by improving the selection of clones for sequencing after screening the library on a small number of microarrays. Annotation of the sequence information and collaboration was further enhanced through a web-based SSHdb database, and we illustrated this through identification of drought responsive genes from cowpea, which can now be investigated in gene function studies. SSH is a popular and powerful gene discovery tool, and therefore this pipeline will have application for gene discovery in any biological system, particularly non-model organisms. SSHscreen 2.0.1 and a link to SSHdb are available from http://microarray.up.ac.za/SSHscreen. PMID:20359330
Molecular Evolution in Historical Perspective.
Suárez-Díaz, Edna
2016-12-01
In the 1960s, advances in protein chemistry and molecular genetics provided new means for the study of biological evolution. Amino acid sequencing, nucleic acid hybridization, zone gel electrophoresis, and immunochemistry were some of the experimental techniques that brought about new perspectives to the study of the patterns and mechanisms of evolution. New concepts, such as the molecular evolutionary clock, and the discovery of unexpected molecular phenomena, like the presence of repetitive sequences in eukaryotic genomes, eventually led to the realization that evolution might occur at a different pace at the organismic and the molecular levels, and according to different mechanisms. These developments sparked important debates between defendants of the molecular and organismic approaches. The most vocal confrontations focused on the relation between primates and humans, and the neutral theory of molecular evolution. By the 1980s and 1990s, the construction of large protein and DNA sequences databases, and the development of computer-based statistical tools, facilitated the coming together of molecular and evolutionary biology. Although in its contemporary form the field of molecular evolution can be traced back to the last five decades, the field has deep roots in twentieth century experimental life sciences. For historians of science, the origins and consolidation of molecular evolution provide a privileged field for the study of scientific debates, the relation between technological advances and scientific knowledge, and the connection between science and broader social concerns.
RSAT: regulatory sequence analysis tools.
Thomas-Chollier, Morgane; Sand, Olivier; Turatsinze, Jean-Valéry; Janky, Rekin's; Defrance, Matthieu; Vervisch, Eric; Brohée, Sylvain; van Helden, Jacques
2008-07-01
The regulatory sequence analysis tools (RSAT, http://rsat.ulb.ac.be/rsat/) is a software suite that integrates a wide collection of modular tools for the detection of cis-regulatory elements in genome sequences. The suite includes programs for sequence retrieval, pattern discovery, phylogenetic footprint detection, pattern matching, genome scanning and feature map drawing. Random controls can be performed with random gene selections or by generating random sequences according to a variety of background models (Bernoulli, Markov). Beyond the original word-based pattern-discovery tools (oligo-analysis and dyad-analysis), we recently added a battery of tools for matrix-based detection of cis-acting elements, with some original features (adaptive background models, Markov-chain estimation of P-values) that do not exist in other matrix-based scanning tools. The web server offers an intuitive interface, where each program can be accessed either separately or connected to the other tools. In addition, the tools are now available as web services, enabling their integration in programmatic workflows. Genomes are regularly updated from various genome repositories (NCBI and EnsEMBL) and 682 organisms are currently supported. Since 1998, the tools have been used by several hundreds of researchers from all over the world. Several predictions made with RSAT were validated experimentally and published.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Hong; Yang, Yanling; Li, Yuxin
2015-02-06
Development of high resolution liquid chromatography (LC) is essential for improving the sensitivity and throughput of mass spectrometry (MS)-based proteomics. Here we present systematic optimization of a long gradient LC-MS/MS platform to enhance protein identification from a complex mixture. The platform employed an in-house fabricated, reverse phase column (100 μm x 150 cm) coupled with Q Exactive MS. The column was capable of achieving a peak capacity of approximately 700 in a 720 min gradient of 10-45% acetonitrile. The optimal loading level was about 6 micrograms of peptides, although the column allowed loading as many as 20 micrograms. Gas phasemore » fractionation of peptide ions further increased the number of peptide identification by ~10%. Moreover, the combination of basic pH LC pre-fractionation with the long gradient LC-MS/MS platform enabled the identification of 96,127 peptides and 10,544 proteins at 1% protein false discovery rate in a postmortem brain sample of Alzheimer’s disease. As deep RNA sequencing of the same specimen suggested that ~16,000 genes were expressed, current analysis covered more than 60% of the expressed proteome. Further improvement strategies of the LC/LC-MS/MS platform were also discussed.« less
Indel variant analysis of short-read sequencing data with Scalpel
Fang, Han; Bergmann, Ewa A; Arora, Kanika; Vacic, Vladimir; Zody, Michael C; Iossifov, Ivan; O’Rawe, Jason A; Wu, Yiyang; Barron, Laura T Jimenez; Rosenbaum, Julie; Ronemus, Michael; Lee, Yoon-ha; Wang, Zihua; Dikoglu, Esra; Jobanputra, Vaidehi; Lyon, Gholson J; Wigler, Michael; Schatz, Michael C; Narzisi, Giuseppe
2017-01-01
As the second most common type of variation in the human genome, insertions and deletions (indels) have been linked to many diseases, but the discovery of indels of more than a few bases in size from short-read sequencing data remains challenging. Scalpel (http://scalpel.sourceforge.net) is an open-source software for reliable indel detection based on the microassembly technique. It has been successfully used to discover mutations in novel candidate genes for autism, and it is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole-genome and whole-exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation: single-sample and somatic analysis. Indel normalization, visualization and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be completed in ~5 h after read mapping. PMID:27854363
Jazaeri Farsani, Seyed Mohammad; Deijs, Martin; Dijkman, Ronald; Molenkamp, Richard; Jeeninga, Rienk E; Ieven, Margareta; Goossens, Herman; van der Hoek, Lia
2015-01-01
Background Currently, virus discovery is mainly based on molecular techniques. Here, we propose a method that relies on virus culturing combined with state-of-the-art sequencing techniques. The most natural ex vivo culture system was used to enable replication of respiratory viruses. Method Three respiratory clinical samples were tested on well-differentiated pseudostratified tracheobronchial human airway epithelial (HAE) cultures grown at an air–liquid interface, which resemble the airway epithelium. Cells were stained with convalescent serum of the patients to identify infected cells and apical washes were analyzed by VIDISCA-454, a next-generation sequencing virus discovery technique. Results Infected cells were observed for all three samples. Sequencing subsequently indicated that the cells were infected by either human coronavirus OC43, influenzavirus B, or influenzavirus A. The sequence reads covered a large part of the genome (52%, 82%, and 57%, respectively). Conclusion We present here a new method for virus discovery that requires a virus culture on primary cells and an antibody detection. The virus in the harvest can be used to characterize the viral genome sequence and cell tropism, but also provides progeny virus to initiate experiments to fulfill the Koch's postulates. PMID:25482367
Exome Sequencing and the Management of Neurometabolic Disorders.
Tarailo-Graovac, Maja; Shyr, Casper; Ross, Colin J; Horvath, Gabriella A; Salvarinova, Ramona; Ye, Xin C; Zhang, Lin-Hua; Bhavsar, Amit P; Lee, Jessica J Y; Drögemöller, Britt I; Abdelsayed, Mena; Alfadhel, Majid; Armstrong, Linlea; Baumgartner, Matthias R; Burda, Patricie; Connolly, Mary B; Cameron, Jessie; Demos, Michelle; Dewan, Tammie; Dionne, Janis; Evans, A Mark; Friedman, Jan M; Garber, Ian; Lewis, Suzanne; Ling, Jiqiang; Mandal, Rupasri; Mattman, Andre; McKinnon, Margaret; Michoulas, Aspasia; Metzger, Daniel; Ogunbayo, Oluseye A; Rakic, Bojana; Rozmus, Jacob; Ruben, Peter; Sayson, Bryan; Santra, Saikat; Schultz, Kirk R; Selby, Kathryn; Shekel, Paul; Sirrs, Sandra; Skrypnyk, Cristina; Superti-Furga, Andrea; Turvey, Stuart E; Van Allen, Margot I; Wishart, David; Wu, Jiang; Wu, John; Zafeiriou, Dimitrios; Kluijtmans, Leo; Wevers, Ron A; Eydoux, Patrice; Lehman, Anna M; Vallance, Hilary; Stockler-Ipsiroglu, Sylvia; Sinclair, Graham; Wasserman, Wyeth W; van Karnebeek, Clara D
2016-06-09
Whole-exome sequencing has transformed gene discovery and diagnosis in rare diseases. Translation into disease-modifying treatments is challenging, particularly for intellectual developmental disorder. However, the exception is inborn errors of metabolism, since many of these disorders are responsive to therapy that targets pathophysiological features at the molecular or cellular level. To uncover the genetic basis of potentially treatable inborn errors of metabolism, we combined deep clinical phenotyping (the comprehensive characterization of the discrete components of a patient's clinical and biochemical phenotype) with whole-exome sequencing analysis through a semiautomated bioinformatics pipeline in consecutively enrolled patients with intellectual developmental disorder and unexplained metabolic phenotypes. We performed whole-exome sequencing on samples obtained from 47 probands. Of these patients, 6 were excluded, including 1 who withdrew from the study. The remaining 41 probands had been born to predominantly nonconsanguineous parents of European descent. In 37 probands, we identified variants in 2 genes newly implicated in disease, 9 candidate genes, 22 known genes with newly identified phenotypes, and 9 genes with expected phenotypes; in most of the genes, the variants were classified as either pathogenic or probably pathogenic. Complex phenotypes of patients in five families were explained by coexisting monogenic conditions. We obtained a diagnosis in 28 of 41 probands (68%) who were evaluated. A test of a targeted intervention was performed in 18 patients (44%). Deep phenotyping and whole-exome sequencing in 41 probands with intellectual developmental disorder and unexplained metabolic abnormalities led to a diagnosis in 68%, the identification of 11 candidate genes newly implicated in neurometabolic disease, and a change in treatment beyond genetic counseling in 44%. (Funded by BC Children's Hospital Foundation and others.).
Exome Sequencing and the Management of Neurometabolic Disorders
Tarailo-Graovac, M.; Shyr, C.; Ross, C.J.; Horvath, G.A.; Salvarinova, R.; Ye, X.C.; Zhang, L.-H.; Bhavsar, A.P.; Lee, J.J.Y.; Drögemöller, B.I.; Abdelsayed, M.; Alfadhel, M.; Armstrong, L.; Baumgartner, M.R.; Burda, P.; Connolly, M.B.; Cameron, J.; Demos, M.; Dewan, T.; Dionne, J.; Evans, A.M.; Friedman, J.M.; Garber, I.; Lewis, S.; Ling, J.; Mandal, R.; Mattman, A.; McKinnon, M.; Michoulas, A.; Metzger, D.; Ogunbayo, O.A.; Rakic, B.; Rozmus, J.; Ruben, P.; Sayson, B.; Santra, S.; Schultz, K.R.; Selby, K.; Shekel, P.; Sirrs, S.; Skrypnyk, C.; Superti-Furga, A.; Turvey, S.E.; Van Allen, M.I.; Wishart, D.; Wu, J.; Wu, J.; Zafeiriou, D.; Kluijtmans, L.; Wevers, R.A.; Eydoux, P.; Lehman, A.M.; Vallance, H.; Stockler-Ipsiroglu, S.; Sinclair, G.; Wasserman, W.W.; van Karnebeek, C.D.
2016-01-01
BACKGROUND Whole-exome sequencing has transformed gene discovery and diagnosis in rare diseases. Translation into disease-modifying treatments is challenging, particularly for intellectual developmental disorder. However, the exception is inborn errors of metabolism, since many of these disorders are responsive to therapy that targets pathophysiological features at the molecular or cellular level. METHODS To uncover the genetic basis of potentially treatable inborn errors of metabolism, we combined deep clinical phenotyping (the comprehensive characterization of the discrete components of a patient’s clinical and biochemical phenotype) with whole-exome sequencing analysis through a semiautomated bioinformatics pipeline in consecutively enrolled patients with intellectual developmental disorder and unexplained metabolic phenotypes. RESULTS We performed whole-exome sequencing on samples obtained from 47 probands. Of these patients, 6 were excluded, including 1 who withdrew from the study. The remaining 41 probands had been born to predominantly nonconsanguineous parents of European descent. In 37 probands, we identified variants in 2 genes newly implicated in disease, 9 candidate genes, 22 known genes with newly identified phenotypes, and 9 genes with expected phenotypes; in most of the genes, the variants were classified as either pathogenic or probably pathogenic. Complex phenotypes of patients in five families were explained by coexisting monogenic conditions. We obtained a diagnosis in 28 of 41 probands (68%) who were evaluated. A test of a targeted intervention was performed in 18 patients (44%). CONCLUSIONS Deep phenotyping and whole-exome sequencing in 41 probands with intellectual developmental disorder and unexplained metabolic abnormalities led to a diagnosis in 68%, the identification of 11 candidate genes newly implicated in neurometabolic disease, and a change in treatment beyond genetic counseling in 44%. (Funded by BC Children’s Hospital Foundation and others.) PMID:27276562
SSRPrimer and SSR Taxonomy Tree: Biome SSR discovery
Jewell, Erica; Robinson, Andrew; Savage, David; Erwin, Tim; Love, Christopher G.; Lim, Geraldine A. C.; Li, Xi; Batley, Jacqueline; Spangenberg, German C.; Edwards, David
2006-01-01
Simple sequence repeat (SSR) molecular genetic markers have become important tools for a broad range of applications such as genome mapping and genetic diversity studies. SSRs are readily identified within DNA sequence data and PCR primers can be designed for their amplification. These PCR primers frequently cross amplify within related species. We report a web-based tool, SSR Primer, that integrates SPUTNIK, an SSR repeat finder, with Primer3, a primer design program, within one pipeline. On submission of multiple FASTA formatted sequences, the script screens each sequence for SSRs using SPUTNIK. Results are then parsed to Primer3 for locus specific primer design. We have applied this tool for the discovery of SSRs within the complete GenBank database, and have designed PCR amplification primers for over 13 million SSRs. The SSR Taxonomy Tree server provides web-based searching and browsing of species and taxa for the visualisation and download of these SSR amplification primers. These tools are available at . PMID:16845092
SSRPrimer and SSR Taxonomy Tree: Biome SSR discovery.
Jewell, Erica; Robinson, Andrew; Savage, David; Erwin, Tim; Love, Christopher G; Lim, Geraldine A C; Li, Xi; Batley, Jacqueline; Spangenberg, German C; Edwards, David
2006-07-01
Simple sequence repeat (SSR) molecular genetic markers have become important tools for a broad range of applications such as genome mapping and genetic diversity studies. SSRs are readily identified within DNA sequence data and PCR primers can be designed for their amplification. These PCR primers frequently cross amplify within related species. We report a web-based tool, SSR Primer, that integrates SPUTNIK, an SSR repeat finder, with Primer3, a primer design program, within one pipeline. On submission of multiple FASTA formatted sequences, the script screens each sequence for SSRs using SPUTNIK. Results are then parsed to Primer3 for locus specific primer design. We have applied this tool for the discovery of SSRs within the complete GenBank database, and have designed PCR amplification primers for over 13 million SSRs. The SSR Taxonomy Tree server provides web-based searching and browsing of species and taxa for the visualisation and download of these SSR amplification primers. These tools are available at http://bioinformatics.pbcbasc.latrobe.edu.au/ssrdiscovery.html.
Graw, Michael F.; D'Angelo, Grace; Borchers, Matthew; Thurber, Andrew R.; Johnson, Joel E.; Zhang, Chuanlun; Liu, Haodong; Colwell, Frederick S.
2018-01-01
The deep marine subsurface is a heterogeneous environment in which the assembly of microbial communities is thought to be controlled by a combination of organic matter deposition, electron acceptor availability, and sedimentology. However, the relative importance of these factors in structuring microbial communities in marine sediments remains unclear. The South China Sea (SCS) experiences significant variability in sedimentation across the basin and features discrete changes in sedimentology as a result of episodic deposition of turbidites and volcanic ashes within lithogenic clays and siliceous or calcareous ooze deposits throughout the basin's history. Deep subsurface microbial communities were recently sampled by the International Ocean Discovery Program (IODP) at three locations in the SCS with sedimentation rates of 5, 12, and 20 cm per thousand years. Here, we used Illumina sequencing of the 16S ribosomal RNA gene to characterize deep subsurface microbial communities from distinct sediment types at these sites. Communities across all sites were dominated by several poorly characterized taxa implicated in organic matter degradation, including Atribacteria, Dehalococcoidia, and Aerophobetes. Sulfate-reducing bacteria comprised only 4% of the community across sulfate-bearing sediments from multiple cores and did not change in abundance in sediments from the methanogenic zone at the site with the lowest sedimentation rate. Microbial communities were significantly structured by sediment age and the availability of sulfate as an electron acceptor in pore waters. However, microbial communities demonstrated no partitioning based on the sediment type they inhabited. These results indicate that microbial communities in the SCS are structured by the availability of electron donors and acceptors rather than sedimentological characteristics. PMID:29696012
Boosting compound-protein interaction prediction by deep learning.
Tian, Kai; Shao, Mingyu; Wang, Yang; Guan, Jihong; Zhou, Shuigeng
2016-11-01
The identification of interactions between compounds and proteins plays an important role in network pharmacology and drug discovery. However, experimentally identifying compound-protein interactions (CPIs) is generally expensive and time-consuming, computational approaches are thus introduced. Among these, machine-learning based methods have achieved a considerable success. However, due to the nonlinear and imbalanced nature of biological data, many machine learning approaches have their own limitations. Recently, deep learning techniques show advantages over many state-of-the-art machine learning methods in some applications. In this study, we aim at improving the performance of CPI prediction based on deep learning, and propose a method called DL-CPI (the abbreviation of Deep Learning for Compound-Protein Interactions prediction), which employs deep neural network (DNN) to effectively learn the representations of compound-protein pairs. Extensive experiments show that DL-CPI can learn useful features of compound-protein pairs by a layerwise abstraction, and thus achieves better prediction performance than existing methods on both balanced and imbalanced datasets. Copyright © 2016 Elsevier Inc. All rights reserved.
Iftikhar, Romana; Ashfaq, Muhammad; Rasool, Akhtar; Hebert, Paul D N
2016-01-01
Although thrips are globally important crop pests and vectors of viral disease, species identifications are difficult because of their small size and inconspicuous morphological differences. Sequence variation in the mitochondrial COI-5' (DNA barcode) region has proven effective for the identification of species in many groups of insect pests. We analyzed barcode sequence variation among 471 thrips from various plant hosts in north-central Pakistan. The Barcode Index Number (BIN) system assigned these sequences to 55 BINs, while the Automatic Barcode Gap Discovery detected 56 partitions, a count that coincided with the number of monophyletic lineages recognized by Neighbor-Joining analysis and Bayesian inference. Congeneric species showed an average of 19% sequence divergence (range = 5.6% - 27%) at COI, while intraspecific distances averaged 0.6% (range = 0.0% - 7.6%). BIN analysis suggested that all intraspecific divergence >3.0% actually involved a species complex. In fact, sequences for three major pest species (Haplothrips reuteri, Thrips palmi, Thrips tabaci), and one predatory thrips (Aeolothrips intermedius) showed deep intraspecific divergences, providing evidence that each is a cryptic species complex. The study compiles the first barcode reference library for the thrips of Pakistan, and examines global haplotype diversity in four important pest thrips.
BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone.
Yang, Bite; Liu, Feng; Ren, Chao; Ouyang, Zhangyi; Xie, Ziwei; Bo, Xiaochen; Shu, Wenjie
2017-07-01
Enhancer elements are noncoding stretches of DNA that play key roles in controlling gene expression programmes. Despite major efforts to develop accurate enhancer prediction methods, identifying enhancer sequences continues to be a challenge in the annotation of mammalian genomes. One of the major issues is the lack of large, sufficiently comprehensive and experimentally validated enhancers for humans or other species. Thus, the development of computational methods based on limited experimentally validated enhancers and deciphering the transcriptional regulatory code encoded in the enhancer sequences is urgent. We present a deep-learning-based hybrid architecture, BiRen, which predicts enhancers using the DNA sequence alone. Our results demonstrate that BiRen can learn common enhancer patterns directly from the DNA sequence and exhibits superior accuracy, robustness and generalizability in enhancer prediction relative to other state-of-the-art enhancer predictors based on sequence characteristics. Our BiRen will enable researchers to acquire a deeper understanding of the regulatory code of enhancer sequences. Our BiRen method can be freely accessed at https://github.com/wenjiegroup/BiRen . shuwj@bmi.ac.cn or boxc@bmi.ac.cn. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Modelling and enhanced molecular dynamics to steer structure-based drug discovery.
Kalyaanamoorthy, Subha; Chen, Yi-Ping Phoebe
2014-05-01
The ever-increasing gap between the availabilities of the genome sequences and the crystal structures of proteins remains one of the significant challenges to the modern drug discovery efforts. The knowledge of structure-dynamics-functionalities of proteins is important in order to understand several key aspects of structure-based drug discovery, such as drug-protein interactions, drug binding and unbinding mechanisms and protein-protein interactions. This review presents a brief overview on the different state of the art computational approaches that are applied for protein structure modelling and molecular dynamics simulations of biological systems. We give an essence of how different enhanced sampling molecular dynamics approaches, together with regular molecular dynamics methods, assist in steering the structure based drug discovery processes. Copyright © 2013 Elsevier Ltd. All rights reserved.
Deep learning for neuroimaging: a validation study.
Plis, Sergey M; Hjelm, Devon R; Salakhutdinov, Ruslan; Allen, Elena A; Bockholt, Henry J; Long, Jeffrey D; Johnson, Hans J; Paulsen, Jane S; Turner, Jessica A; Calhoun, Vince D
2014-01-01
Deep learning methods have recently made notable advances in the tasks of classification and representation learning. These tasks are important for brain imaging and neuroscience discovery, making the methods attractive for porting to a neuroimager's toolbox. Success of these methods is, in part, explained by the flexibility of deep learning models. However, this flexibility makes the process of porting to new areas a difficult parameter optimization problem. In this work we demonstrate our results (and feasible parameter ranges) in application of deep learning methods to structural and functional brain imaging data. These methods include deep belief networks and their building block the restricted Boltzmann machine. We also describe a novel constraint-based approach to visualizing high dimensional data. We use it to analyze the effect of parameter choices on data transformations. Our results show that deep learning methods are able to learn physiologically important representations and detect latent relations in neuroimaging data.
Low Data Drug Discovery with One-Shot Learning
2017-01-01
Recent advances in machine learning have made significant contributions to drug discovery. Deep neural networks in particular have been demonstrated to provide significant boosts in predictive power when inferring the properties and activities of small-molecule compounds (Ma, J. et al. J. Chem. Inf. Model.2015, 55, 263–27425635324). However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this work, we demonstrate how one-shot learning can be used to significantly lower the amounts of data required to make meaningful predictions in drug discovery applications. We introduce a new architecture, the iterative refinement long short-term memory, that, when combined with graph convolutional neural networks, significantly improves learning of meaningful distance metrics over small-molecules. We open source all models introduced in this work as part of DeepChem, an open-source framework for deep-learning in drug discovery (Ramsundar, B. deepchem.io. https://github.com/deepchem/deepchem, 2016). PMID:28470045
Is Multitask Deep Learning Practical for Pharma?
Ramsundar, Bharath; Liu, Bowen; Wu, Zhenqin; Verras, Andreas; Tudor, Matthew; Sheridan, Robert P; Pande, Vijay
2017-08-28
Multitask deep learning has emerged as a powerful tool for computational drug discovery. However, despite a number of preliminary studies, multitask deep networks have yet to be widely deployed in the pharmaceutical and biotech industries. This lack of acceptance stems from both software difficulties and lack of understanding of the robustness of multitask deep networks. Our work aims to resolve both of these barriers to adoption. We introduce a high-quality open-source implementation of multitask deep networks as part of the DeepChem open-source platform. Our implementation enables simple python scripts to construct, fit, and evaluate sophisticated deep models. We use our implementation to analyze the performance of multitask deep networks and related deep models on four collections of pharmaceutical data (three of which have not previously been analyzed in the literature). We split these data sets into train/valid/test using time and neighbor splits to test multitask deep learning performance under challenging conditions. Our results demonstrate that multitask deep networks are surprisingly robust and can offer strong improvement over random forests. Our analysis and open-source implementation in DeepChem provide an argument that multitask deep networks are ready for widespread use in commercial drug discovery.
Deep Recurrent Neural Networks for Human Activity Recognition
Murad, Abdulmajid
2017-01-01
Adopting deep learning methods for human activity recognition has been effective in extracting discriminative features from raw input sequences acquired from body-worn sensors. Although human movements are encoded in a sequence of successive samples in time, typical machine learning methods perform recognition tasks without exploiting the temporal correlations between input data samples. Convolutional neural networks (CNNs) address this issue by using convolutions across a one-dimensional temporal sequence to capture dependencies among input data. However, the size of convolutional kernels restricts the captured range of dependencies between data samples. As a result, typical models are unadaptable to a wide range of activity-recognition configurations and require fixed-length input windows. In this paper, we propose the use of deep recurrent neural networks (DRNNs) for building recognition models that are capable of capturing long-range dependencies in variable-length input sequences. We present unidirectional, bidirectional, and cascaded architectures based on long short-term memory (LSTM) DRNNs and evaluate their effectiveness on miscellaneous benchmark datasets. Experimental results show that our proposed models outperform methods employing conventional machine learning, such as support vector machine (SVM) and k-nearest neighbors (KNN). Additionally, the proposed models yield better performance than other deep learning techniques, such as deep believe networks (DBNs) and CNNs. PMID:29113103
Deep Recurrent Neural Networks for Human Activity Recognition.
Murad, Abdulmajid; Pyun, Jae-Young
2017-11-06
Adopting deep learning methods for human activity recognition has been effective in extracting discriminative features from raw input sequences acquired from body-worn sensors. Although human movements are encoded in a sequence of successive samples in time, typical machine learning methods perform recognition tasks without exploiting the temporal correlations between input data samples. Convolutional neural networks (CNNs) address this issue by using convolutions across a one-dimensional temporal sequence to capture dependencies among input data. However, the size of convolutional kernels restricts the captured range of dependencies between data samples. As a result, typical models are unadaptable to a wide range of activity-recognition configurations and require fixed-length input windows. In this paper, we propose the use of deep recurrent neural networks (DRNNs) for building recognition models that are capable of capturing long-range dependencies in variable-length input sequences. We present unidirectional, bidirectional, and cascaded architectures based on long short-term memory (LSTM) DRNNs and evaluate their effectiveness on miscellaneous benchmark datasets. Experimental results show that our proposed models outperform methods employing conventional machine learning, such as support vector machine (SVM) and k-nearest neighbors (KNN). Additionally, the proposed models yield better performance than other deep learning techniques, such as deep believe networks (DBNs) and CNNs.
DCO-VIVO: A Collaborative Data Platform for the Deep Carbon Science Communities
NASA Astrophysics Data System (ADS)
Wang, H.; Chen, Y.; West, P.; Erickson, J. S.; Ma, X.; Fox, P. A.
2014-12-01
Deep Carbon Observatory (DCO) is a decade-long scientific endeavor to understand carbon in the complex deep Earth system. Thousands of DCO scientists from institutions across the globe are organized into communities representing four domains of exploration: Extreme Physics and Chemistry, Reservoirs and Fluxes, Deep Energy, and Deep Life. Cross-community and cross-disciplinary collaboration is one of the most distinctive features in DCO's flexible research framework. VIVO is an open-source Semantic Web platform that facilitates cross-institutional researcher and research discovery. it includes a number of standard ontologies that interconnect people, organizations, publications, activities, locations, and other entities of research interest to enable browsing, searching, visualizing, and generating Linked Open (research) Data. The DCO-VIVO solution expedites research collaboration between DCO scientists and communities. Based on DCO's specific requirements, the DCO Data Science team developed a series of extensions to the VIVO platform including extending the VIVO information model, extended query over the semantic information within VIVO, integration with other open source collaborative environments and data management systems, using single sign-on, assigning of unique Handles to DCO objects, and publication and dataset ingesting extensions using existing publication systems. We present here the iterative development of these requirements that are now in daily use by the DCO community of scientists for research reporting, information sharing, and resource discovery in support of research activities and program management.
Deep-sea vent chemoautotrophs: diversity, biochemistry and ecological significance.
Nakagawa, Satoshi; Takai, Ken
2008-07-01
Deep-sea vents support productive ecosystems driven primarily by chemoautotrophs. Chemoautotrophs are organisms that are able to fix inorganic carbon using a chemical energy obtained through the oxidation of reduced compounds. Following the discovery of deep-sea vent ecosystems in 1977, there has been an increasing knowledge that deep-sea vent chemoautotrophs display remarkable physiological and phylogenetic diversity. Cultivation-dependent and -independent studies have led to an emerging view that the majority of deep-sea vent chemoautotrophs have the ability to derive energy from a variety of redox couples other than the conventional sulfur-oxygen couple, and fix inorganic carbon via the reductive tricarboxylic acid cycle. In addition, recent genomic, metagenomic and postgenomic studies have considerably accelerated the comprehensive understanding of molecular mechanisms of deep-sea vent chemoautotrophy, even in yet uncultivable endosymbionts of vent fauna. Genomic analysis also suggested that there are previously unrecognized evolutionary links between deep-sea vent chemoautotrophs and important human/animal pathogens. This review summarizes chemoautotrophy in deep-sea vents, highlighting recent biochemical and genomic discoveries.
Transcription profile of boar spermatozoa as revealed by RNA-sequencing
USDA-ARS?s Scientific Manuscript database
High-throughput RNA sequencing (RNA-Seq) overcomes the limitations of the current hybridization-based techniques to detect the actual pool of RNA transcripts in spermatozoa. The application of this technology in livestock can speed the discovery of potential predictors of male fertility. As a first ...
USDA-ARS?s Scientific Manuscript database
Background: Vertebrate immune systems generate diverse repertoires of antibodies capable of mediating response to a variety of antigens. Next generation sequencing methods provide unique approaches to a number of immuno-based research areas including antibody discovery and engineering, disease surve...
Next generation sequencing applications for microRNA biomarker discovery in toxicological studies
Next Generation Sequencing (NGS) technology will be reviewed for its base pair resolution, wide dynamic range, and insights into the genome and transcriptome, with special focus upon the biomarker potential of microRNAs (miRNAs). The first part of this presentation reviews commo...
SNPServer: a real-time SNP discovery tool.
Savage, David; Batley, Jacqueline; Erwin, Tim; Logan, Erica; Love, Christopher G; Lim, Geraldine A C; Mongin, Emmanuel; Barker, Gary; Spangenberg, German C; Edwards, David
2005-07-01
SNPServer is a real-time flexible tool for the discovery of SNPs (single nucleotide polymorphisms) within DNA sequence data. The program uses BLAST, to identify related sequences, and CAP3, to cluster and align these sequences. The alignments are parsed to the SNP discovery software autoSNP, a program that detects SNPs and insertion/deletion polymorphisms (indels). Alternatively, lists of related sequences or pre-assembled sequences may be entered for SNP discovery. SNPServer and autoSNP use redundancy to differentiate between candidate SNPs and sequence errors. For each candidate SNP, two measures of confidence are calculated, the redundancy of the polymorphism at a SNP locus and the co-segregation of the candidate SNP with other SNPs in the alignment. SNPServer is available at http://hornbill.cspp.latrobe.edu.au/snpdiscovery.html.
Direct AUC optimization of regulatory motifs.
Zhu, Lin; Zhang, Hong-Bo; Huang, De-Shuang
2017-07-15
The discovery of transcription factor binding site (TFBS) motifs is essential for untangling the complex mechanism of genetic variation under different developmental and environmental conditions. Among the huge amount of computational approaches for de novo identification of TFBS motifs, discriminative motif learning (DML) methods have been proven to be promising for harnessing the discovery power of accumulated huge amount of high-throughput binding data. However, they have to sacrifice accuracy for speed and could fail to fully utilize the information of the input sequences. We propose a novel algorithm called CDAUC for optimizing DML-learned motifs based on the area under the receiver-operating characteristic curve (AUC) criterion, which has been widely used in the literature to evaluate the significance of extracted motifs. We show that when the considered AUC loss function is optimized in a coordinate-wise manner, the cost function of each resultant sub-problem is a piece-wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step of each iteration of CDAUC can be efficiently solved as a computational geometry problem. Experimental results on real world high-throughput datasets illustrate that CDAUC outperforms competing methods for refining DML motifs, while being one order of magnitude faster. Meanwhile, preliminary results also show that CDAUC may also be useful for improving the interpretability of convolutional kernels generated by the emerging deep learning approaches for predicting TF sequences specificities. CDAUC is available at: https://drive.google.com/drive/folders/0BxOW5MtIZbJjNFpCeHlBVWJHeW8 . dshuang@tongji.edu.cn. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Making sense of deep sequencing
Goldman, D.; Domschke, K.
2016-01-01
This review, the first of an occasional series, tries to make sense of the concepts and uses of deep sequencing of polynucleic acids (DNA and RNA). Deep sequencing, synonymous with next-generation sequencing, high-throughput sequencing and massively parallel sequencing, includes whole genome sequencing but is more often and diversely applied to specific parts of the genome captured in different ways, for example the highly expressed portion of the genome known as the exome and portions of the genome that are epigenetically marked either by DNA methylation, the binding of proteins including histones, or that are in different configurations and thus more or less accessible to enzymes that cleave DNA. Deep sequencing of RNA (RNASeq) reverse-transcribed to complementary DNA is invaluable for measuring RNA expression and detecting changes in RNA structure. Important concepts in deep sequencing include the length and depth of sequence reads, mapping and assembly of reads, sequencing error, haplotypes, and the propensity of deep sequencing, as with other types of ‘big data’, to generate large numbers of errors, requiring monitoring for methodologic biases and strategies for replication and validation. Deep sequencing yields a unique genetic fingerprint that can be used to identify a person, and a trove of predictors of genetic medical diseases. Deep sequencing to identify epigenetic events including changes in DNA methylation and RNA expression can reveal the history and impact of environmental exposures. Because of the power of sequencing to identify and deliver biomedically significant information about a person and their blood relatives, it creates ethical dilemmas and practical challenges in research and clinical care, for example the decision and procedures to report incidental findings that will increasingly and frequently be discovered. PMID:24925306
MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping.
Lee, Wan-Ping; Stromberg, Michael P; Ward, Alistair; Stewart, Chip; Garrison, Erik P; Marth, Gabor T
2014-01-01
MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me).
MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping
Lee, Wan-Ping; Stromberg, Michael P.; Ward, Alistair; Stewart, Chip; Garrison, Erik P.; Marth, Gabor T.
2014-01-01
MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me). PMID:24599324
Molecular Mapping of Restriction-Site Associated DNA Markers In Allotetraploid Upland Cotton.
Wang, Yangkun; Ning, Zhiyuan; Hu, Yan; Chen, Jiedan; Zhao, Rui; Chen, Hong; Ai, Nijiang; Guo, Wangzhen; Zhang, Tianzhen
2015-01-01
Upland cotton (Gossypium hirsutum L., 2n = 52, AADD) is an allotetraploid, therefore the discovery of single nucleotide polymorphism (SNP) markers is difficult. The recent emergence of genome complexity reduction technologies based on the next-generation sequencing (NGS) platform has greatly expedited SNP discovery in crops with highly repetitive and complex genomes. Here we applied restriction-site associated DNA (RAD) sequencing technology for de novo SNP discovery in allotetraploid cotton. We identified 21,109 SNPs between the two parents and used these for genotyping of 161 recombinant inbred lines (RILs). Finally, a high dense linkage map comprising 4,153 loci over 3500-cM was developed based on the previous result. Using this map quantitative trait locus (QTLs) conferring fiber strength and Verticillium Wilt (VW) resistance were mapped to a more accurate region in comparison to the 1576-cM interval determined using the simple sequence repeat (SSR) genetic map. This suggests that the newly constructed map has more power and resolution than the previous SSR map. It will pave the way for the rapid identification of the marker-assisted selection in cotton breeding and cloning of QTL of interest traits.
Motif-based analysis of large nucleotide data sets using MEME-ChIP
Ma, Wenxiu; Noble, William S; Bailey, Timothy L
2014-01-01
MEME-ChIP is a web-based tool for analyzing motifs in large DNA or RNA data sets. It can analyze peak regions identified by ChIP-seq, cross-linking sites identified by cLIP-seq and related assays, as well as sets of genomic regions selected using other criteria. MEME-ChIP performs de novo motif discovery, motif enrichment analysis, motif location analysis and motif clustering, providing a comprehensive picture of the DNA or RNA motifs that are enriched in the input sequences. MEME-ChIP performs two complementary types of de novo motif discovery: weight matrix–based discovery for high accuracy; and word-based discovery for high sensitivity. Motif enrichment analysis using DNA or RNA motifs from human, mouse, worm, fly and other model organisms provides even greater sensitivity. MEME-ChIP’s interactive HTML output groups and aligns significant motifs to ease interpretation. this protocol takes less than 3 h, and it provides motif discovery approaches that are distinct and complementary to other online methods. PMID:24853928
Identifying active foraminifera in the Sea of Japan using metatranscriptomic approach
NASA Astrophysics Data System (ADS)
Lejzerowicz, Franck; Voltsky, Ivan; Pawlowski, Jan
2013-02-01
Metagenetics represents an efficient and rapid tool to describe environmental diversity patterns of microbial eukaryotes based on ribosomal DNA sequences. However, the results of metagenetic studies are often biased by the presence of extracellular DNA molecules that are persistent in the environment, especially in deep-sea sediment. As an alternative, short-lived RNA molecules constitute a good proxy for the detection of active species. Here, we used a metatranscriptomic approach based on RNA-derived (cDNA) sequences to study the diversity of the deep-sea benthic foraminifera and compared it to the metagenetic approach. We analyzed 257 ribosomal DNA and cDNA sequences obtained from seven sediments samples collected in the Sea of Japan at depths ranging from 486 to 3665 m. The DNA and RNA-based approaches gave a similar view of the taxonomic composition of foraminiferal assemblage, but differed in some important points. First, the cDNA dataset was dominated by sequences of rotaliids and robertiniids, suggesting that these calcareous species, some of which have been observed in Rose Bengal stained samples, are the most active component of foraminiferal community. Second, the richness of monothalamous (single-chambered) foraminifera was particularly high in DNA extracts from the deepest samples, confirming that this group of foraminifera is abundant but not necessarily very active in the deep-sea sediments. Finally, the high divergence of undetermined sequences in cDNA dataset indicate the limits of our database and lack of knowledge about some active but possibly rare species. Our study demonstrates the capability of the metatranscriptomic approach to detect active foraminiferal species and prompt its use in future high-throughput sequencing-based environmental surveys.
McAlpine, James B
2009-03-27
Over the past decade major changes have occurred in the access to genome sequences that encode the enzymes responsible for the biosynthesis of secondary metabolites, knowledge of how those sequences translate into the final structure of the metabolite, and the ability to alter the sequence to obtain predicted products via both homologous and heterologous expression. Novel genera have been discovered leading to new chemotypes, but more surprisingly several instances have been uncovered where the apparently general rules of modular translation have not applied. Several new biosynthetic pathways have been unearthed, and our general knowledge grows rapidly. This review aims to highlight some of the more striking discoveries and advances of the decade.
Zeng, Cong; Thomas, Leighton J; Kelly, Michelle; Gardner, Jonathan P A
2016-05-01
The complete mitochondrial genome of a New Zealand specimen of the deep-sea sponge Poecillastra laminaris (Sollas, 1886) (Astrophorida, Vulcanellidae), from the Colville Ridge, New Zealand, was sequenced using the 454 Life Science pyrosequencing system. To identify homologous mitochondrial sequences, the 454 reads were mapped to the complete mitochondrial genome sequence of Geodia neptuni (GeneBank No. NC_006990). The P. laminaris genome is 18,413 bp in length and includes 14 protein-coding genes, 24 transfer RNA genes and 2 ribosomal RNA genes. Gene order resembled that of other demosponges. The base composition of the genome is A (29.1%), T (35.2%), C (14.0%) and G (21.7%). This is the second published mitogenome for a sponge of the order Astrophorida and will be useful in future phylogenetic analysis of deep-sea sponges.
Drug Discovery in Fish, Flies, and Worms
Strange, Kevin
2016-01-01
Abstract Nonmammalian model organisms such as the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, and the zebrafish Danio rerio provide numerous experimental advantages for drug discovery including genetic and molecular tractability, amenability to high-throughput screening methods and reduced experimental costs and increased experimental throughput compared to traditional mammalian models. An interdisciplinary approach that strategically combines the study of nonmammalian and mammalian animal models with diverse experimental tools has and will continue to provide deep molecular and genetic understanding of human disease and will significantly enhance the discovery and application of new therapies to treat those diseases. This review will provide an overview of C. elegans, Drosophila, and zebrafish biology and husbandry and will discuss how these models are being used for phenotype-based drug screening and for identification of drug targets and mechanisms of action. The review will also describe how these and other nonmammalian model organisms are uniquely suited for the discovery of drug-based regenerative medicine therapies. PMID:28053067
Strain-Level Diversity of Secondary Metabolism in Streptomyces albus
Seipke, Ryan F.
2015-01-01
Streptomyces spp. are robust producers of medicinally-, industrially- and agriculturally-important small molecules. Increased resistance to antibacterial agents and the lack of new antibiotics in the pipeline have led to a renaissance in natural product discovery. This endeavor has benefited from inexpensive high quality DNA sequencing technology, which has generated more than 140 genome sequences for taxonomic type strains and environmental Streptomyces spp. isolates. Many of the sequenced streptomycetes belong to the same species. For instance, Streptomyces albus has been isolated from diverse environmental niches and seven strains have been sequenced, consequently this species has been sequenced more than any other streptomycete, allowing valuable analyses of strain-level diversity in secondary metabolism. Bioinformatics analyses identified a total of 48 unique biosynthetic gene clusters harboured by Streptomyces albus strains. Eighteen of these gene clusters specify the core secondary metabolome of the species. Fourteen of the gene clusters are contained by one or more strain and are considered auxiliary, while 16 of the gene clusters encode the production of putative strain-specific secondary metabolites. Analysis of Streptomyces albus strains suggests that each strain of a Streptomyces species likely harbours at least one strain-specific biosynthetic gene cluster. Importantly, this implies that deep sequencing of a species will not exhaust gene cluster diversity and will continue to yield novelty. PMID:25635820
A fortran program for Monte Carlo simulation of oil-field discovery sequences
Bohling, Geoffrey C.; Davis, J.C.
1993-01-01
We have developed a program for performing Monte Carlo simulation of oil-field discovery histories. A synthetic parent population of fields is generated as a finite sample from a distribution of specified form. The discovery sequence then is simulated by sampling without replacement from this parent population in accordance with a probabilistic discovery process model. The program computes a chi-squared deviation between synthetic and actual discovery sequences as a function of the parameters of the discovery process model, the number of fields in the parent population, and the distributional parameters of the parent population. The program employs the three-parameter log gamma model for the distribution of field sizes and employs a two-parameter discovery process model, allowing the simulation of a wide range of scenarios. ?? 1993.
De novo peptide sequencing by deep learning
Tran, Ngoc Hieu; Zhang, Xianglilan; Xin, Lei; Shan, Baozhen; Li, Ming
2017-01-01
De novo peptide sequencing from tandem MS data is the key technology in proteomics for the characterization of proteins, especially for new sequences, such as mAbs. In this study, we propose a deep neural network model, DeepNovo, for de novo peptide sequencing. DeepNovo architecture combines recent advances in convolutional neural networks and recurrent neural networks to learn features of tandem mass spectra, fragment ions, and sequence patterns of peptides. The networks are further integrated with local dynamic programming to solve the complex optimization task of de novo sequencing. We evaluated the method on a wide variety of species and found that DeepNovo considerably outperformed state of the art methods, achieving 7.7–22.9% higher accuracy at the amino acid level and 38.1–64.0% higher accuracy at the peptide level. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, achieving 97.5–100% coverage and 97.2–99.5% accuracy, without assisting databases. Moreover, DeepNovo is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution to the de novo sequencing problem. Not only does our study extend the deep learning revolution to a new field, but it also shows an innovative approach in solving optimization problems by using deep learning and dynamic programming. PMID:28720701
The Expanding Family of Virophages
Bekliz, Meriem; Colson, Philippe; La Scola, Bernard
2016-01-01
Virophages replicate with giant viruses in the same eukaryotic cells. They are a major component of the specific mobilome of mimiviruses. Since their discovery in 2008, five other representatives have been isolated, 18 new genomes have been described, two of which being nearly completely sequenced, and they have been classified in a new viral family, Lavidaviridae. Virophages are small viruses with approximately 35–74 nm large icosahedral capsids and 17–29 kbp large double-stranded DNA genomes with 16–34 genes, among which a very small set is shared with giant viruses. Virophages have been isolated or detected in various locations and in a broad range of habitats worldwide, including the deep ocean and inland. Humans, therefore, could be commonly exposed to virophages, although currently limited evidence exists of their presence in humans based on serology and metagenomics. The distribution of virophages, the consequences of their infection and the interactions with their giant viral hosts within eukaryotic cells deserve further research. PMID:27886075
Less is More: Membrane Protein Digestion Beyond Urea–Trypsin Solution for Next-level Proteomics*
Zhang, Xi
2015-01-01
The goal of next-level bottom-up membrane proteomics is protein function investigation, via high-coverage high-throughput peptide-centric quantitation of expression, modifications and dynamic structures at systems scale. Yet efficient digestion of mammalian membrane proteins presents a daunting barrier, and prevalent day-long urea–trypsin in-solution digestion proved insufficient to reach this goal. Many efforts contributed incremental advances over past years, but involved protein denaturation that disconnected measurement from functional states. Beyond denaturation, the recent discovery of structure/proteomics omni-compatible detergent n-dodecyl-β-d-maltopyranoside, combined with pepsin and PNGase F columns, enabled breakthroughs in membrane protein digestion: a 2010 DDM-low-TCEP (DLT) method for H/D-exchange (HDX) using human G protein-coupled receptor, and a 2015 flow/detergent-facilitated protease and de-PTM digestions (FDD) for integrative deep sequencing and quantitation using full-length human ion channel complex. Distinguishing protein solubilization from denaturation, protease digestion reliability from theoretical specificity, and reduction from alkylation, these methods shifted day(s)-long paradigms into minutes, and afforded fully automatable (HDX)-protein-peptide-(tandem mass tag)-HPLC pipelines to instantly measure functional proteins at deep coverage, high peptide reproducibility, low artifacts and minimal leakage. Promoting—not destroying—structures and activities harnessed membrane proteins for the next-level streamlined functional proteomics. This review analyzes recent advances in membrane protein digestion methods and highlights critical discoveries for future proteomics. PMID:26081834
Discovery sequence and the nature of low permeability gas accumulations
Attanasi, E.D.
2005-01-01
There is an ongoing discussion regarding the geologic nature of accumulations that host gas in low-permeability sandstone environments. This note examines the discovery sequence of the accumulations in low permeability sandstone plays that were classified as continuous-type by the U.S. Geological Survey for the 1995 National Oil and Gas Assessment. It compares the statistical character of historical discovery sequences of accumulations associated with continuous-type sandstone gas plays to those of conventional plays. The seven sandstone plays with sufficient data exhibit declining size with sequence order, on average, and in three of the seven the trend is statistically significant. Simulation experiments show that both a skewed endowment size distribution and a discovery process that mimics sampling proportional to size are necessary to generate a discovery sequence that consistently produces a statistically significant negative size order relationship. The empirical findings suggest that discovery sequence could be used to constrain assessed gas in untested areas. The plays examined represent 134 of the 265 trillion cubic feet of recoverable gas assessed in undeveloped areas of continuous-type gas plays in low permeability sandstone environments reported in the 1995 National Assessment. ?? 2005 International Association for Mathematical Geology.
Integrating functional genomics to accelerate mechanistic personalized medicine.
Tyner, Jeffrey W
2017-03-01
The advent of deep sequencing technologies has resulted in the deciphering of tremendous amounts of genetic information. These data have led to major discoveries, and many anecdotes now exist of individual patients whose clinical outcomes have benefited from novel, genetically guided therapeutic strategies. However, the majority of genetic events in cancer are currently undrugged, leading to a biological gap between understanding of tumor genetic etiology and translation to improved clinical approaches. Functional screening has made tremendous strides in recent years with the development of new experimental approaches to studying ex vivo and in vivo drug sensitivity. Numerous discoveries and anecdotes also exist for translation of functional screening into novel clinical strategies; however, the current clinical application of functional screening remains largely confined to small clinical trials at specific academic centers. The intersection between genomic and functional approaches represents an ideal modality to accelerate our understanding of drug sensitivities as they relate to specific genetic events and further understand the full mechanisms underlying drug sensitivity patterns.
Deep Space Mission Applications for NEXT: NASA's Evolutionary Xenon Thruster
NASA Technical Reports Server (NTRS)
Oh, David; Benson, Scott; Witzberger, Kevin; Cupples, Michael
2004-01-01
NASA's Evolutionary Xenon Thruster (NEXT) is designed to address a need for advanced ion propulsion systems on certain future NASA deep space missions. This paper surveys seven potential missions that have been identified as being able to take advantage of the unique capabilities of NEXT. Two conceptual missions to Titan and Neptune are analyzed, and it is shown that ion thrusters could decrease launch mass and shorten trip time, to Titan compared to chemical propulsion. A potential Mars Sample return mission is described, and compassion made between a chemical mission and a NEXT based mission. Four possible near term applications to New Frontiers and Discovery class missions are described, and comparisons are made to chemical systems or existing NSTAR ion propulsion system performance. The results show that NEXT has potential performance and cost benefits for missions in the Discovery, New Frontiers, and larger mission classes.
Liu, Jun-Jun; Xiang, Yu
2011-01-01
WRKY transcription factors are key regulators of numerous biological processes in plant growth and development, as well as plant responses to abiotic and biotic stresses. Research on biological functions of plant WRKY genes has focused in the past on model plant species or species with largely characterized transcriptomes. However, a variety of non-model plants, such as forest conifers, are essential as feed, biofuel, and wood or for sustainable ecosystems. Identification of WRKY genes in these non-model plants is equally important for understanding the evolutionary and function-adaptive processes of this transcription factor family. Because of limited genomic information, the rarity of regulatory gene mRNAs in transcriptomes, and the sequence divergence to model organism genes, identification of transcription factors in non-model plants using methods similar to those generally used for model plants is difficult. This chapter describes a gene family discovery strategy for identification of WRKY transcription factors in conifers by a combination of in silico-based prediction and PCR-based experimental approaches. Compared to traditional cDNA library screening or EST sequencing at transcriptome scales, this integrated gene discovery strategy provides fast, simple, reliable, and specific methods to unveil the WRKY gene family at both genome and transcriptome levels in non-model plants.
He, Bifang; Tjhung, Katrina F; Bennett, Nicholas J; Chou, Ying; Rau, Andrea; Huang, Jian; Derda, Ratmir
2018-01-19
Understanding the composition of a genetically-encoded (GE) library is instrumental to the success of ligand discovery. In this manuscript, we investigate the bias in GE-libraries of linear, macrocyclic and chemically post-translationally modified (cPTM) tetrapeptides displayed on the M13KE platform, which are produced via trinucleotide cassette synthesis (19 codons) and NNK-randomized codon. Differential enrichment of synthetic DNA {S}, ligated vector {L} (extension and ligation of synthetic DNA into the vector), naïve libraries {N} (transformation of the ligated vector into the bacteria followed by expression of the library for 4.5 hours to yield a "naïve" library), and libraries chemically modified by aldehyde ligation and cysteine macrocyclization {M} characterized by paired-end deep sequencing, detected a significant drop in diversity in {L} → {N}, but only a minor compositional difference in {S} → {L} and {N} → {M}. Libraries expressed at the N-terminus of phage protein pIII censored positively charged amino acids Arg and Lys; libraries expressed between pIII domains N1 and N2 overcame Arg/Lys-censorship but introduced new bias towards Gly and Ser. Interrogation of biases arising from cPTM by aldehyde ligation and cysteine macrocyclization unveiled censorship of sequences with Ser/Phe. Analogous analysis can be used to explore library diversity in new display platforms and optimize cPTM of these libraries.
The Deepwater Horizon Oil Spill: Ecogenomics of the Deep-Sea Plume
NASA Astrophysics Data System (ADS)
Hazen, T. C.
2012-12-01
The explosion on April 20, 2010 at the BP-leased Deepwater Horizon drilling rig in the Gulf of Mexico off the coast of Louisiana, resulted in oil and gas rising to the surface and the oil coming ashore in many parts of the Gulf, it also resulted in the dispersment of an immense oil plume 4,000 feet below the surface of the water. Despite spanning more than 600 feet in the water column and extending more than 10 miles from the wellhead, the dispersed oil plume was gone within weeks after the wellhead was capped - degraded and diluted to undetectable levels. Furthermore, this degradation took place without significant oxygen depletion. Ecogenomics enabled discovery of new and unclassified species of oil-eating bacteria that apparently lives in the deep Gulf where oil seeps are common. Using 16s microarrays, functional gene arrays, clone libraries, lipid analysis and a variety of hydrocarbon and micronutrient analyses we were able to characterize the oil degraders. Metagenomic sequence data was obtained for the deep-water samples using the Illumina platform. In addition, single cells were sorted and sequenced for the some of the most dominant bacteria that were represented in the oil plume; namely uncultivated representatives of Colwellia and Oceanospirillum. In addition, we performed laboratory microcosm experiments using uncontaminated water collected from The Gulf at the depth of the oil plume to which we added oil and COREXIT. These samples were characterized by 454 pyrotag. The results provide information about the key players and processes involved in degradation of oil, with and without COREXIT, in different impacted environments in The Gulf of Mexico. We are also extending these studies to explore dozens of deep sediment samples that were also collected after the oil spill around the wellhead. This data suggests that a great potential for intrinsic bioremediation of oil plumes exists in the deep-sea and other environs in the Gulf of Mexico.
Zhang, Xi
2016-12-01
Neurotransmitter ligand-gated ion channels (LGICs) are widespread and pivotal in brain functions. Unveiling their structure-function mechanisms is crucial to drive drug discovery, and demands robust proteomic quantitation of expression, post-translational modifications (PTMs) and dynamic structures. Yet unbiased digestion of these modified transmembrane proteins-at high efficiency and peptide reproducibility-poses the obstacle. Targeting both enzyme-substrate contacts and PTMs for peptide formation and detection, we devised flow-and-detergent-facilitated protease and de-PTM digestions for deep sequencing (FDD) method that combined omni-compatible detergent, tandem immobilized protease/PNGase columns, and Cys-selective reduction/alkylation, to achieve streamlined ultradeep peptide preparation within minutes not days, at high peptide reproducibility and low abundance-bias. FDD transformed enzyme-protein contacts into equal catalytic travel paths through enzyme-excessive columns regardless of protein abundance, removed products instantly preventing inhibition, tackled intricate structures via sequential multiple micro-digestions along the flow, and precisely controlled peptide formation by flow rate. Peptide-stage reactions reduced steric bias; low contamination deepened MS/MS scan; distinguishing disulfide from M oxidation and avoiding gain/loss artifacts unmasked protein-endogenous oxidation states. Using a recent interactome of 285-kDa human GABA type A receptor, this pilot study validated FDD platform's applicability to deep sequencing (up to 99% coverage), H/D-exchange and TMT-based structural mapping. FDD discovered novel subunit-specific PTM signatures, including unusual nontop-surface N-glycosylations, that may drive subunit biases in human Cys-loop LGIC assembly and pharmacology, by redefining subunit/ligand interfaces and connecting function domains. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.
Patterns of DNA barcode variation in Canadian marine molluscs.
Layton, Kara K S; Martel, André L; Hebert, Paul D N
2014-01-01
Molluscs are the most diverse marine phylum and this high diversity has resulted in considerable taxonomic problems. Because the number of species in Canadian oceans remains uncertain, there is a need to incorporate molecular methods into species identifications. A 648 base pair segment of the cytochrome c oxidase subunit I gene has proven useful for the identification and discovery of species in many animal lineages. While the utility of DNA barcoding in molluscs has been demonstrated in other studies, this is the first effort to construct a DNA barcode registry for marine molluscs across such a large geographic area. This study examines patterns of DNA barcode variation in 227 species of Canadian marine molluscs. Intraspecific sequence divergences ranged from 0-26.4% and a barcode gap existed for most taxa. Eleven cases of relatively deep (>2%) intraspecific divergence were detected, suggesting the possible presence of overlooked species. Structural variation was detected in COI with indels found in 37 species, mostly bivalves. Some indels were present in divergent lineages, primarily in the region of the first external loop, suggesting certain areas are hotspots for change. Lastly, mean GC content varied substantially among orders (24.5%-46.5%), and showed a significant positive correlation with nearest neighbour distances. DNA barcoding is an effective tool for the identification of Canadian marine molluscs and for revealing possible cases of overlooked species. Some species with deep intraspecific divergence showed a biogeographic partition between lineages on the Atlantic, Arctic and Pacific coasts, suggesting the role of Pleistocene glaciations in the subdivision of their populations. Indels were prevalent in the barcode region of the COI gene in bivalves and gastropods. This study highlights the efficacy of DNA barcoding for providing insights into sequence variation across a broad taxonomic group on a large geographic scale.
Fu, Shuyue; Liu, Xiang; Luo, Maochao; Xie, Ke; Nice, Edouard C; Zhang, Haiyuan; Huang, Canhua
2017-04-01
Chemoresistance is a major obstacle for current cancer treatment. Proteogenomics is a powerful multi-omics research field that uses customized protein sequence databases generated by genomic and transcriptomic information to identify novel genes (e.g. noncoding, mutation and fusion genes) from mass spectrometry-based proteomic data. By identifying aberrations that are differentially expressed between tumor and normal pairs, this approach can also be applied to validate protein variants in cancer, which may reveal the response to drug treatment. Areas covered: In this review, we will present recent advances in proteogenomic investigations of cancer drug resistance with an emphasis on integrative proteogenomic pipelines and the biomarker discovery which contributes to achieving the goal of using precision/personalized medicine for cancer treatment. Expert commentary: The discovery and comprehensive understanding of potential biomarkers help identify the cohort of patients who may benefit from particular treatments, and will assist real-time clinical decision-making to maximize therapeutic efficacy and minimize adverse effects. With the development of MS-based proteomics and NGS-based sequencing, a growing number of proteogenomic tools are being developed specifically to investigate cancer drug resistance.
Joint deep shape and appearance learning: application to optic pathway glioma segmentation
NASA Astrophysics Data System (ADS)
Mansoor, Awais; Li, Ien; Packer, Roger J.; Avery, Robert A.; Linguraru, Marius George
2017-03-01
Automated tissue characterization is one of the major applications of computer-aided diagnosis systems. Deep learning techniques have recently demonstrated impressive performance for the image patch-based tissue characterization. However, existing patch-based tissue classification techniques struggle to exploit the useful shape information. Local and global shape knowledge such as the regional boundary changes, diameter, and volumetrics can be useful in classifying the tissues especially in scenarios where the appearance signature does not provide significant classification information. In this work, we present a deep neural network-based method for the automated segmentation of the tumors referred to as optic pathway gliomas (OPG) located within the anterior visual pathway (AVP; optic nerve, chiasm or tracts) using joint shape and appearance learning. Voxel intensity values of commonly used MRI sequences are generally not indicative of OPG. To be considered an OPG, current clinical practice dictates that some portion of AVP must demonstrate shape enlargement. The method proposed in this work integrates multiple sequence magnetic resonance image (T1, T2, and FLAIR) along with local boundary changes to train a deep neural network. For training and evaluation purposes, we used a dataset of multiple sequence MRI obtained from 20 subjects (10 controls, 10 NF1+OPG). To our best knowledge, this is the first deep representation learning-based approach designed to merge shape and multi-channel appearance data for the glioma detection. In our experiments, mean misclassification errors of 2:39% and 0:48% were observed respectively for glioma and control patches extracted from the AVP. Moreover, an overall dice similarity coefficient of 0:87+/-0:13 (0:93+/-0:06 for healthy tissue, 0:78+/-0:18 for glioma tissue) demonstrates the potential of the proposed method in the accurate localization and early detection of OPG.
USDA-ARS?s Scientific Manuscript database
Modern day genomics holds the promise of solving the complexities of basic plant sciences, and of catalyzing practical advances in plant breeding. While contiguous, "base perfect" deep sequencing is a key module of any genome project, recent advances in parallel next generation sequencing technologi...
Erwin, Douglas H
2017-10-13
Eric Davidson had a deep and abiding interest in the role developmental mechanisms played in generating evolutionary patterns documented in deep time, from the origin of the euechinoids to the processes responsible for the morphological architectures of major animal clades. Although not an evolutionary biologist, Davidson's interests long preceded the current excitement over comparative evolutionary developmental biology. Here I discuss three aspects at the intersection between his research and evolutionary patterns in deep time: First, understanding the mechanisms of body plan formation, particularly those associated with the early diversification of major metazoan clades. Second, a critique of early claims about ancestral metazoans based on the discoveries of highly conserved genes across bilaterian animals. Third, Davidson's own involvement in paleontology through a collaborative study of the fossil embryos from the Ediacaran Doushantuo Formation in south China.
NASA Astrophysics Data System (ADS)
Lobecker, E.; McKenna, L.; Sowers, D.; Elliott, K.; Kennedy, B.
2014-12-01
NOAA ShipOkeanos Explorer, the only U.S. federal vessel dedicated to global ocean exploration, made several important discoveries in U.S. waters of the North Atlantic Ocean and Gulf of Mexico during the 2014 field season. Based on input received from a broad group ofmarine scientists and resource managers, over 100,000 square kilometers of seafloor and associated water column were systematically explored using advanced mapping sonars. 39 ROV diveswere conducted, leading to new discoveries that will further ourunderstanding of biologic, geologic, and underwater-cultural heritage secrets hidden withinthe oceans. In the Atlantic, season highlights include completion of a multi-year submarine canyons mapping effort of the continental shelf break from North Carolina to the U.S.-Canada maritime border;new information on the ephemerality of recently discovered and geographically extensive cold water seeps; and continued exploration of the New England Seamount chain; and mapping of two potential historically significant World War II wreck sites. In the Gulf of Mexico, season highlights includecompletion of a multi-year mapping effort of the West Florida Escarpment providing new insight into submarine landslides and detachment zones;the discovery of at least two asphalt volcanoes, or 'tar lilies'; range extensions of deep-sea corals; discovery of two potential new species of crinoids; identification of at least 300 potential cold water seeps; and ROV exploration of three historically significant19th century shipwrecks. In both regions, high-resolution mapping led to new insight into the geological context in which deep sea corals develop,while ROV dives provided valuable observations of deep sea coral habitats and their associated organisms, and chemosynthetic habitats. All mapping and ROV data is freely available to the public in usable data formats and maintained in national geophysical and oceanographic data archives.
BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements.
De Witte, Dieter; Van de Velde, Jan; Decap, Dries; Van Bel, Michiel; Audenaert, Pieter; Demeester, Piet; Dhoedt, Bart; Vandepoele, Klaas; Fostier, Jan
2015-12-01
The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be. Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements
De Witte, Dieter; Van de Velde, Jan; Decap, Dries; Van Bel, Michiel; Audenaert, Pieter; Demeester, Piet; Dhoedt, Bart; Vandepoele, Klaas; Fostier, Jan
2015-01-01
Motivation: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. Results: We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. Availability and implementation: BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller Contact: Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26254488
Chambers, E Anne; Hebert, Paul D N
2016-01-01
High rates of species discovery and loss have led to the urgent need for more rapid assessment of species diversity in the herpetofauna. DNA barcoding allows for the preliminary identification of species based on sequence divergence. Prior DNA barcoding work on reptiles and amphibians has revealed higher biodiversity counts than previously estimated due to cases of cryptic and undiscovered species. Past studies have provided DNA barcodes for just 14% of the North American herpetofauna, revealing the need for expanded coverage. This study extends the DNA barcode reference library for North American herpetofauna, assesses the utility of this approach in aiding species delimitation, and examines the correspondence between current species boundaries and sequence clusters designated by the BIN system. Sequences were obtained from 730 specimens, representing 274 species (43%) from the North American herpetofauna. Mean intraspecific divergences were 1% and 3%, while average congeneric sequence divergences were 16% and 14% in amphibians and reptiles, respectively. BIN assignments corresponded with current species boundaries in 79% of amphibians, 100% of turtles, and 60% of squamates. Deep divergences (>2%) were noted in 35% of squamate and 16% of amphibian species, and low divergences (<2%) occurred in 12% of reptiles and 23% of amphibians, patterns reflected in BIN assignments. Sequence recovery declined with specimen age, and variation in recovery success was noted among collections. Within collections, barcodes effectively flagged seven mislabeled tissues, and barcode fragments were recovered from five formalin-fixed specimens. This study demonstrates that DNA barcodes can effectively flag errors in museum collections, while BIN splits and merges reveal taxa belonging to deeply diverged or hybridizing lineages. This study is the first effort to compile a reference library of DNA barcodes for herpetofauna on a continental scale.
Chambers, E. Anne; Hebert, Paul D. N.
2016-01-01
Background High rates of species discovery and loss have led to the urgent need for more rapid assessment of species diversity in the herpetofauna. DNA barcoding allows for the preliminary identification of species based on sequence divergence. Prior DNA barcoding work on reptiles and amphibians has revealed higher biodiversity counts than previously estimated due to cases of cryptic and undiscovered species. Past studies have provided DNA barcodes for just 14% of the North American herpetofauna, revealing the need for expanded coverage. Methodology/Principal Findings This study extends the DNA barcode reference library for North American herpetofauna, assesses the utility of this approach in aiding species delimitation, and examines the correspondence between current species boundaries and sequence clusters designated by the BIN system. Sequences were obtained from 730 specimens, representing 274 species (43%) from the North American herpetofauna. Mean intraspecific divergences were 1% and 3%, while average congeneric sequence divergences were 16% and 14% in amphibians and reptiles, respectively. BIN assignments corresponded with current species boundaries in 79% of amphibians, 100% of turtles, and 60% of squamates. Deep divergences (>2%) were noted in 35% of squamate and 16% of amphibian species, and low divergences (<2%) occurred in 12% of reptiles and 23% of amphibians, patterns reflected in BIN assignments. Sequence recovery declined with specimen age, and variation in recovery success was noted among collections. Within collections, barcodes effectively flagged seven mislabeled tissues, and barcode fragments were recovered from five formalin-fixed specimens. Conclusions/Significance This study demonstrates that DNA barcodes can effectively flag errors in museum collections, while BIN splits and merges reveal taxa belonging to deeply diverged or hybridizing lineages. This study is the first effort to compile a reference library of DNA barcodes for herpetofauna on a continental scale. PMID:27116180
The chronostratigraphy of the Haua Fteah cave (Cyrenaica, northeast Libya).
Douka, Katerina; Jacobs, Zenobia; Lane, Christine; Grün, Rainer; Farr, Lucy; Hunt, Chris; Inglis, Robyn H; Reynolds, Tim; Albert, Paul; Aubert, Maxime; Cullen, Victoria; Hill, Evan; Kinsley, Leslie; Roberts, Richard G; Tomlinson, Emma L; Wulf, Sabine; Barker, Graeme
2014-01-01
The 1950s excavations by Charles McBurney in the Haua Fteah, a large karstic cave on the coast of northeast Libya, revealed a deep sequence of human occupation. Most subsequent research on North African prehistory refers to his discoveries and interpretations, but the chronology of its archaeological and geological sequences has been based on very early age determinations. This paper reports on the initial results of a comprehensive multi-method dating program undertaken as part of new work at the site, involving radiocarbon dating of charcoal, land snails and marine shell, cryptotephra investigations, optically stimulated luminescence (OSL) dating of sediments, and electron spin resonance (ESR) dating of tooth enamel. The dating samples were collected from the newly exposed and cleaned faces of the upper 7.5 m of the ∼14.0 m-deep McBurney trench, which contain six of the seven major cultural phases that he identified. Despite problems of sediment transport and reworking, using a Bayesian statistical model the new dating program establishes a robust framework for the five major lithostratigraphic units identified in the stratigraphic succession, and for the major cultural units. The age of two anatomically modern human mandibles found by McBurney in Layer XXXIII near the base of his Levalloiso-Mousterian phase can now be estimated to between 73 and 65 ka (thousands of years ago) at the 95.4% confidence level, within Marine Isotope Stage (MIS) 4. McBurney's Layer XXV, associated with Upper Palaeolithic Dabban blade industries, has a clear stratigraphic relationship with Campanian Ignimbrite tephra. Microlithic Oranian technologies developed following the climax of the Last Glacial Maximum and the more microlithic Capsian in the Younger Dryas. Neolithic pottery and perhaps domestic livestock were used in the cave from the mid Holocene but there is no certain evidence for plant cultivation until the Graeco-Roman period. Copyright © 2013 Elsevier Ltd. All rights reserved.
Patterns of homoeologous gene expression shown by RNA sequencing in hexaploid bread wheat
2014-01-01
Background Bread wheat (Triticum aestivum) has a large, complex and hexaploid genome consisting of A, B and D homoeologous chromosome sets. Therefore each wheat gene potentially exists as a trio of A, B and D homoeoloci, each of which may contribute differentially to wheat phenotypes. We describe a novel approach combining wheat cytogenetic resources (chromosome substitution ‘nullisomic-tetrasomic’ lines) with next generation deep sequencing of gene transcripts (RNA-Seq), to directly and accurately identify homoeologue-specific single nucleotide variants and quantify the relative contribution of individual homoeoloci to gene expression. Results We discover, based on a sample comprising ~5-10% of the total wheat gene content, that at least 45% of wheat genes are expressed from all three distinct homoeoloci. Most of these genes show strikingly biased expression patterns in which expression is dominated by a single homoeolocus. The remaining ~55% of wheat genes are expressed from either one or two homoeoloci only, through a combination of extensive transcriptional silencing and homoeolocus loss. Conclusions We conclude that wheat is tending towards functional diploidy, through a variety of mechanisms causing single homoeoloci to become the predominant source of gene transcripts. This discovery has profound consequences for wheat breeding and our understanding of wheat evolution. PMID:24726045
deepTools2: a next generation web server for deep-sequencing data analysis.
Ramírez, Fidel; Ryan, Devon P; Grüning, Björn; Bhardwaj, Vivek; Kilpert, Fabian; Richter, Andreas S; Heyne, Steffen; Dündar, Friederike; Manke, Thomas
2016-07-08
We present an update to our Galaxy-based web server for processing and visualizing deeply sequenced data. Its core tool set, deepTools, allows users to perform complete bioinformatic workflows ranging from quality controls and normalizations of aligned reads to integrative analyses, including clustering and visualization approaches. Since we first described our deepTools Galaxy server in 2014, we have implemented new solutions for many requests from the community and our users. Here, we introduce significant enhancements and new tools to further improve data visualization and interpretation. deepTools continue to be open to all users and freely available as a web service at deeptools.ie-freiburg.mpg.de The new deepTools2 suite can be easily deployed within any Galaxy framework via the toolshed repository, and we also provide source code for command line usage under Linux and Mac OS X. A public and documented API for access to deepTools functionality is also available. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Active subsurface cellular function in the Baltic Sea Basin, IODP Exp 347
NASA Astrophysics Data System (ADS)
Reese, B. K.; Zinke, L. A.; Bird, J. T.; Lloyd, K. G.; Marshall, I.; Amend, J.; Jørgensen, B. B.
2016-12-01
The Baltic Sea Basin is a unique depositional setting that has experienced periods of glaciation and deglaciation as a result of global temperature fluctuations over the course of several hundred thousand years. This has resulted in laminated sediments formed during periods with strong permanent salinity stratification. The high sedimentation rates (100-500 cm/1000 y) make this an ideal setting to understand the microbial structure of a deep biosphere community in a high-organic matter environment. The responses of deep sediment microbial communities to variations in conditions during and after deposition are poorly understood. Samples were collected through scientific drilling during the International Ocean Discovery Program (IODP) Expedition 347 on board the Greatship Manisha, September-November 2013. We examined the active microbial community structure using the 16S rRNA gene transcript and active functional genes through metatranscriptome sequencing. Major biogeochemical shifts have been observed in response to the depositional history between the limnic, brackish, and marine phases. The microbial community structure in the BSB is diverse and reflective of the unique changes in the geochemical profile. These data further define the existence life in the deep subsurface and the survival mechanisms required for this extreme environment.
Anonymization of electronic medical records for validating genome-wide association studies
Loukides, Grigorios; Gkoulalas-Divanis, Aris; Malin, Bradley
2010-01-01
Genome-wide association studies (GWAS) facilitate the discovery of genotype–phenotype relations from population-based sequence databases, which is an integral facet of personalized medicine. The increasing adoption of electronic medical records allows large amounts of patients’ standardized clinical features to be combined with the genomic sequences of these patients and shared to support validation of GWAS findings and to enable novel discoveries. However, disseminating these data “as is” may lead to patient reidentification when genomic sequences are linked to resources that contain the corresponding patients’ identity information based on standardized clinical features. This work proposes an approach that provably prevents this type of data linkage and furnishes a result that helps support GWAS. Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases. Extensive experiments with real patient data derived from the Vanderbilt's University Medical Center verify that our approach generates data that eliminate the threat of individual reidentification, while supporting GWAS validation and clinical case analysis tasks. PMID:20385806
Deep learning improves prediction of CRISPR-Cpf1 guide RNA activity.
Kim, Hui Kwon; Min, Seonwoo; Song, Myungjae; Jung, Soobin; Choi, Jae Woo; Kim, Younggwang; Lee, Sangeun; Yoon, Sungroh; Kim, Hyongbum Henry
2018-03-01
We present two algorithms to predict the activity of AsCpf1 guide RNAs. Indel frequencies for 15,000 target sequences were used in a deep-learning framework based on a convolutional neural network to train Seq-deepCpf1. We then incorporated chromatin accessibility information to create the better-performing DeepCpf1 algorithm for cell lines for which such information is available and show that both algorithms outperform previous machine learning algorithms on our own and published data sets.
Overview of Petroleum Settings in Deep Waters of the Brazilian South Atlantic Margin
NASA Astrophysics Data System (ADS)
Anjos, Sylvia; Penteado, Henrique; Oliveira, Carlos M. M.
2015-04-01
The objective of this work is to present an overall view of the tectonic and stratigraphic evolution of the western South Atlantic with focus on the Brazilian marginal basins. It includes the structural evolution, stratigraphic sequences, depositional environments and petroleum systems model along the Brazilian marginal basins. In addition, a description of the main petroleum provinces and selected plays including the pre-salt carbonates and post-salt turbidite reservoirs is presented. Source-rock ages and types, trap styles, main reservoir characteristics, petroleum compositions, and recent exploration results are discussed. Finally, an outlook and general assessment of the impact of the large pre-salt discoveries on the present-day and future production curves are given.
USDA-ARS?s Scientific Manuscript database
In recent years, next generation sequencing (NGS) based bulked segregant analysis (BSA) has become a powerful approach for allele discovery in non-model plant species. However, challenges remain, particular for out-crossing species with complex genomes. Here, the genetic control of a weeping bran...
Scaling up discovery of hidden diversity in fungi: impacts of barcoding approaches.
Yahr, Rebecca; Schoch, Conrad L; Dentinger, Bryn T M
2016-09-05
The fungal kingdom is a hyperdiverse group of multicellular eukaryotes with profound impacts on human society and ecosystem function. The challenge of documenting and describing fungal diversity is exacerbated by their typically cryptic nature, their ability to produce seemingly unrelated morphologies from a single individual and their similarity in appearance to distantly related taxa. This multiplicity of hurdles resulted in the early adoption of DNA-based comparisons to study fungal diversity, including linking curated DNA sequence data to expertly identified voucher specimens. DNA-barcoding approaches in fungi were first applied in specimen-based studies for identification and discovery of taxonomic diversity, but are now widely deployed for community characterization based on sequencing of environmental samples. Collectively, fungal barcoding approaches have yielded important advances across biological scales and research applications, from taxonomic, ecological, industrial and health perspectives. A major outstanding issue is the growing problem of 'sequences without names' that are somewhat uncoupled from the traditional framework of fungal classification based on morphology and preserved specimens. This review summarizes some of the most significant impacts of fungal barcoding, its limitations, and progress towards the challenge of effective utilization of the exponentially growing volume of data gathered from high-throughput sequencing technologies.This article is part of the themed issue 'From DNA barcodes to biomes'. © 2016 The Authors.
Overview Article: Identifying transcriptional cis-regulatory modules in animal genomes
Suryamohan, Kushal; Halfon, Marc S.
2014-01-01
Gene expression is regulated through the activity of transcription factors and chromatin modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily-identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods has led to an explosion of both computational and empirical methods for CRM discovery in model and non-model organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against transcription factors or histone post-translational modifications, identification of nucleosome-depleted “open” chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted transcription factor binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. PMID:25704908
Harris, Katherine E; Aldred, Shelley Force; Davison, Laura M; Ogana, Heather Anne N; Boudreau, Andrew; Brüggemann, Marianne; Osborn, Michael; Ma, Biao; Buelow, Benjamin; Clarke, Starlynn C; Dang, Kevin H; Iyer, Suhasini; Jorgensen, Brett; Pham, Duy T; Pratap, Payal P; Rangaswamy, Udaya S; Schellenberger, Ute; van Schooten, Wim C; Ugamraj, Harshad S; Vafa, Omid; Buelow, Roland; Trinklein, Nathan D
2018-01-01
We created a novel transgenic rat that expresses human antibodies comprising a diverse repertoire of heavy chains with a single common rearranged kappa light chain (IgKV3-15-JK1). This fixed light chain animal, called OmniFlic, presents a unique system for human therapeutic antibody discovery and a model to study heavy chain repertoire diversity in the context of a constant light chain. The purpose of this study was to analyze heavy chain variable gene usage, clonotype diversity, and to describe the sequence characteristics of antigen-specific monoclonal antibodies (mAbs) isolated from immunized OmniFlic animals. Using next-generation sequencing antibody repertoire analysis, we measured heavy chain variable gene usage and the diversity of clonotypes present in the lymph node germinal centers of 75 OmniFlic rats immunized with 9 different protein antigens. Furthermore, we expressed 2,560 unique heavy chain sequences sampled from a diverse set of clonotypes as fixed light chain antibody proteins and measured their binding to antigen by ELISA. Finally, we measured patterns and overall levels of somatic hypermutation in the full B-cell repertoire and in the 2,560 mAbs tested for binding. The results demonstrate that OmniFlic animals produce an abundance of antigen-specific antibodies with heavy chain clonotype diversity that is similar to what has been described with unrestricted light chain use in mammals. In addition, we show that sequence-based discovery is a highly effective and efficient way to identify a large number of diverse monoclonal antibodies to a protein target of interest.
Webster, Nicole S; Taylor, Michael W; Behnam, Faris; Lücker, Sebastian; Rattei, Thomas; Whalan, Stephen; Horn, Matthias; Wagner, Michael
2010-08-01
Marine sponges contain complex bacterial communities of considerable ecological and biotechnological importance, with many of these organisms postulated to be specific to sponge hosts. Testing this hypothesis in light of the recent discovery of the rare microbial biosphere, we investigated three Australian sponges by massively parallel 16S rRNA gene tag pyrosequencing. Here we show bacterial diversity that is unparalleled in an invertebrate host, with more than 250,000 sponge-derived sequence tags being assigned to 23 bacterial phyla and revealing up to 2996 operational taxonomic units (95% sequence similarity) per sponge species. Of the 33 previously described 'sponge-specific' clusters that were detected in this study, 48% were found exclusively in adults and larvae - implying vertical transmission of these groups. The remaining taxa, including 'Poribacteria', were also found at very low abundance among the 135,000 tags retrieved from surrounding seawater. Thus, members of the rare seawater biosphere may serve as seed organisms for widely occurring symbiont populations in sponges and their host association might have evolved much more recently than previously thought. © 2009 Society for Applied Microbiology and Blackwell Publishing Ltd.
Webster, Nicole S; Taylor, Michael W; Behnam, Faris; Lücker, Sebastian; Rattei, Thomas; Whalan, Stephen; Horn, Matthias; Wagner, Michael
2010-01-01
Marine sponges contain complex bacterial communities of considerable ecological and biotechnological importance, with many of these organisms postulated to be specific to sponge hosts. Testing this hypothesis in light of the recent discovery of the rare microbial biosphere, we investigated three Australian sponges by massively parallel 16S rRNA gene tag pyrosequencing. Here we show bacterial diversity that is unparalleled in an invertebrate host, with more than 250 000 sponge-derived sequence tags being assigned to 23 bacterial phyla and revealing up to 2996 operational taxonomic units (95% sequence similarity) per sponge species. Of the 33 previously described ‘sponge-specific’ clusters that were detected in this study, 48% were found exclusively in adults and larvae – implying vertical transmission of these groups. The remaining taxa, including ‘Poribacteria’, were also found at very low abundance among the 135 000 tags retrieved from surrounding seawater. Thus, members of the rare seawater biosphere may serve as seed organisms for widely occurring symbiont populations in sponges and their host association might have evolved much more recently than previously thought. PMID:21966903
Jones, Darryl R; Thomas, Dallas; Alger, Nicholas; Ghavidel, Ata; Inglis, G Douglas; Abbott, D Wade
2018-01-01
Deposition of new genetic sequences in online databases is expanding at an unprecedented rate. As a result, sequence identification continues to outpace functional characterization of carbohydrate active enzymes (CAZymes). In this paradigm, the discovery of enzymes with novel functions is often hindered by high volumes of uncharacterized sequences particularly when the enzyme sequence belongs to a family that exhibits diverse functional specificities (i.e., polyspecificity). Therefore, to direct sequence-based discovery and characterization of new enzyme activities we have developed an automated in silico pipeline entitled: Sequence Analysis and Clustering of CarboHydrate Active enzymes for Rapid Informed prediction of Specificity (SACCHARIS). This pipeline streamlines the selection of uncharacterized sequences for discovery of new CAZyme or CBM specificity from families currently maintained on the CAZy website or within user-defined datasets. SACCHARIS was used to generate a phylogenetic tree of a GH43, a CAZyme family with defined subfamily designations. This analysis confirmed that large datasets can be organized into sequence clusters of manageable sizes that possess related functions. Seeding this tree with a GH43 sequence from Bacteroides dorei DSM 17855 (BdGH43b, revealed it partitioned as a single sequence within the tree. This pattern was consistent with it possessing a unique enzyme activity for GH43 as BdGH43b is the first described α-glucanase described for this family. The capacity of SACCHARIS to extract and cluster characterized carbohydrate binding module sequences was demonstrated using family 6 CBMs (i.e., CBM6s). This CBM family displays a polyspecific ligand binding profile and contains many structurally determined members. Using SACCHARIS to identify a cluster of divergent sequences, a CBM6 sequence from a unique clade was demonstrated to bind yeast mannan, which represents the first description of an α-mannan binding CBM. Additionally, we have performed a CAZome analysis of an in-house sequenced bacterial genome and a comparative analysis of B. thetaiotaomicron VPI-5482 and B. thetaiotaomicron 7330, to demonstrate that SACCHARIS can generate "CAZome fingerprints", which differentiate between the saccharolytic potential of two related strains in silico. Establishing sequence-function and sequence-structure relationships in polyspecific CAZyme families are promising approaches for streamlining enzyme discovery. SACCHARIS facilitates this process by embedding CAZyme and CBM family trees generated from biochemically to structurally characterized sequences, with protein sequences that have unknown functions. In addition, these trees can be integrated with user-defined datasets (e.g., genomics, metagenomics, and transcriptomics) to inform experimental characterization of new CAZymes or CBMs not currently curated, and for researchers to compare differential sequence patterns between entire CAZomes. In this light, SACCHARIS provides an in silico tool that can be tailored for enzyme bioprospecting in datasets of increasing complexity and for diverse applications in glycobiotechnology.
Chen, Muyan; Storey, Kenneth B
2014-02-01
The sea cucumber Apostichopus japonicus withstands high water temperatures in the summer by suppressing its metabolic rate and entering a state of aestivation. We hypothesized that changes in the expression of miRNAs could provide important post-transcriptional regulation of gene expression during hypometabolism via control over mRNA translation. The present study analyzed profiles of miRNA expression in the sea cucumber respiratory tree using Solexa deep sequencing technology. We identified 279 sea cucumber miRNAs, including 15 novel miRNAs specific to sea cucumber. Animals sampled during deep aestivation (DA; after at least 15 days of continuous torpor) were compared with animals from a non-aestivation (NA) state (animals that had passed through aestivation and returned to an active state). We identified 30 differentially expressed miRNAs ([RPM (reads per million) >10, |FC| (|fold change|)≥1, FDR (false discovery rate)<0.01]) during aestivation, which were validated by two other miRNA profiling methods: miRNA microarray and real-time PCR. Among the most prominent miRNA species, miR-124, miR-124-3p, miR-79, miR-9 and miR-2010 were significantly over-expressed during deep aestivation compared with non-aestivation animals, suggesting that these miRNAs may play important roles in metabolic rate suppression during aestivation. High-throughput sequencing data and microarray data have been submitted to the GEO database with accession number: 16902695. Copyright © 2014 Elsevier B.V. All rights reserved.
Brain Tumor Segmentation Using Deep Belief Networks and Pathological Knowledge.
Zhan, Tianming; Chen, Yi; Hong, Xunning; Lu, Zhenyu; Chen, Yunjie
2017-01-01
In this paper, we propose an automatic brain tumor segmentation method based on Deep Belief Networks (DBNs) and pathological knowledge. The proposed method is targeted against gliomas (both low and high grade) obtained in multi-sequence magnetic resonance images (MRIs). Firstly, a novel deep architecture is proposed to combine the multi-sequences intensities feature extraction with classification to get the classification probabilities of each voxel. Then, graph cut based optimization is executed on the classification probabilities to strengthen the spatial relationships of voxels. At last, pathological knowledge of gliomas is applied to remove some false positives. Our method was validated in the Brain Tumor Segmentation Challenge 2012 and 2013 databases (BRATS 2012, 2013). The performance of segmentation results demonstrates our proposal providing a competitive solution with stateof- the-art methods. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Ancient orphan crop joins modern era: gene-based SNP discovery and mapping in lentil.
Sharpe, Andrew G; Ramsay, Larissa; Sanderson, Lacey-Anne; Fedoruk, Michael J; Clarke, Wayne E; Li, Rong; Kagale, Sateesh; Vijayan, Perumal; Vandenberg, Albert; Bett, Kirstin E
2013-03-18
The genus Lens comprises a range of closely related species within the galegoid clade of the Papilionoideae family. The clade includes other important crops (e.g. chickpea and pea) as well as a sequenced model legume (Medicago truncatula). Lentil is a global food crop increasing in importance in the Indian sub-continent and elsewhere due to its nutritional value and quick cooking time. Despite this importance there has been a dearth of genetic and genomic resources for the crop and this has limited the application of marker-assisted selection strategies in breeding. We describe here the development of a deep and diverse transcriptome resource for lentil using next generation sequencing technology. The generation of data in multiple cultivated (L. culinaris) and wild (L. ervoides) genotypes together with the utilization of a bioinformatics workflow enabled the identification of a large collection of SNPs and the subsequent development of a genotyping platform that was used to establish the first comprehensive genetic map of the L. culinaris genome. Extensive collinearity with M. truncatula was evident on the basis of sequence homology between mapped markers and the model genome and large translocations and inversions relative to M. truncatula were identified. An estimate for the time divergence of L. culinaris from L. ervoides and of both from M. truncatula was also calculated. The availability of the genomic and derived molecular marker resources presented here will help change lentil breeding strategies and lead to increased genetic gain in the future.
Wang, Duolin; Zeng, Shuai; Xu, Chunhui; Qiu, Wangren; Liang, Yanchun; Joshi, Trupti; Xu, Dong
2017-12-15
Computational methods for phosphorylation site prediction play important roles in protein function studies and experimental design. Most existing methods are based on feature extraction, which may result in incomplete or biased features. Deep learning as the cutting-edge machine learning method has the ability to automatically discover complex representations of phosphorylation patterns from the raw sequences, and hence it provides a powerful tool for improvement of phosphorylation site prediction. We present MusiteDeep, the first deep-learning framework for predicting general and kinase-specific phosphorylation sites. MusiteDeep takes raw sequence data as input and uses convolutional neural networks with a novel two-dimensional attention mechanism. It achieves over a 50% relative improvement in the area under the precision-recall curve in general phosphorylation site prediction and obtains competitive results in kinase-specific prediction compared to other well-known tools on the benchmark data. MusiteDeep is provided as an open-source tool available at https://github.com/duolinwang/MusiteDeep. xudong@missouri.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Deep Sequencing to Identify the Causes of Viral Encephalitis
Chan, Benjamin K.; Wilson, Theodore; Fischer, Kael F.; Kriesel, John D.
2014-01-01
Deep sequencing allows for a rapid, accurate characterization of microbial DNA and RNA sequences in many types of samples. Deep sequencing (also called next generation sequencing or NGS) is being developed to assist with the diagnosis of a wide variety of infectious diseases. In this study, seven frozen brain samples from deceased subjects with recent encephalitis were investigated. RNA from each sample was extracted, randomly reverse transcribed and sequenced. The sequence analysis was performed in a blinded fashion and confirmed with pathogen-specific PCR. This analysis successfully identified measles virus sequences in two brain samples and herpes simplex virus type-1 sequences in three brain samples. No pathogen was identified in the other two brain specimens. These results were concordant with pathogen-specific PCR and partially concordant with prior neuropathological examinations, demonstrating that deep sequencing can accurately identify viral infections in frozen brain tissue. PMID:24699691
Benthic protists and fungi of Mediterranean deep hypsersaline anoxic basin redoxcline sediments.
Bernhard, Joan M; Kormas, Konstantinos; Pachiadaki, Maria G; Rocke, Emma; Beaudoin, David J; Morrison, Colin; Visscher, Pieter T; Cobban, Alec; Starczak, Victoria R; Edgcomb, Virginia P
2014-01-01
Some of the most extreme marine habitats known are the Mediterranean deep hypersaline anoxic basins (DHABs; water depth ∼3500 m). Brines of DHABs are nearly saturated with salt, leading many to suspect they are uninhabitable for eukaryotes. While diverse bacterial and protistan communities are reported from some DHAB water-column haloclines and brines, the existence and activity of benthic DHAB protists have rarely been explored. Here, we report findings regarding protists and fungi recovered from sediments of three DHAB (Discovery, Urania, L' Atalante) haloclines, and compare these to communities from sediments underlying normoxic waters of typical Mediterranean salinity. Halocline sediments, where the redoxcline impinges the seafloor, were studied from all three DHABs. Microscopic cell counts suggested that halocline sediments supported denser protist populations than those in adjacent control sediments. Pyrosequencing analysis based on ribosomal RNA detected eukaryotic ribotypes in the halocline sediments from each of the three DHABs, most of which were fungi. Sequences affiliated with Ustilaginomycotina Basidiomycota were the most abundant eukaryotic signatures detected. Benthic communities in these DHABs appeared to differ, as expected, due to differing brine chemistries. Microscopy indicated that only a low proportion of protists appeared to bear associated putative symbionts. In a considerable number of cases, when prokaryotes were associated with a protist, DAPI staining did not reveal presence of any nuclei, suggesting that at least some protists were carcasses inhabited by prokaryotic scavengers.
Benthic protists and fungi of Mediterranean deep hypsersaline anoxic basin redoxcline sediments
Bernhard, Joan M.; Kormas, Konstantinos; Pachiadaki, Maria G.; Rocke, Emma; Beaudoin, David J.; Morrison, Colin; Visscher, Pieter T.; Cobban, Alec; Starczak, Victoria R.; Edgcomb, Virginia P.
2014-01-01
Some of the most extreme marine habitats known are the Mediterranean deep hypersaline anoxic basins (DHABs; water depth ∼3500 m). Brines of DHABs are nearly saturated with salt, leading many to suspect they are uninhabitable for eukaryotes. While diverse bacterial and protistan communities are reported from some DHAB water-column haloclines and brines, the existence and activity of benthic DHAB protists have rarely been explored. Here, we report findings regarding protists and fungi recovered from sediments of three DHAB (Discovery, Urania, L’ Atalante) haloclines, and compare these to communities from sediments underlying normoxic waters of typical Mediterranean salinity. Halocline sediments, where the redoxcline impinges the seafloor, were studied from all three DHABs. Microscopic cell counts suggested that halocline sediments supported denser protist populations than those in adjacent control sediments. Pyrosequencing analysis based on ribosomal RNA detected eukaryotic ribotypes in the halocline sediments from each of the three DHABs, most of which were fungi. Sequences affiliated with Ustilaginomycotina Basidiomycota were the most abundant eukaryotic signatures detected. Benthic communities in these DHABs appeared to differ, as expected, due to differing brine chemistries. Microscopy indicated that only a low proportion of protists appeared to bear associated putative symbionts. In a considerable number of cases, when prokaryotes were associated with a protist, DAPI staining did not reveal presence of any nuclei, suggesting that at least some protists were carcasses inhabited by prokaryotic scavengers. PMID:25452749
NASA Astrophysics Data System (ADS)
Sato, K. Y.; Tomko, D. L.; Levine, H. G.; Quincy, C. D.; Rayl, N. A.; Sowa, M. B.; Taylor, E. M.; Sun, S. C.; Kundrot, C. E.
2018-02-01
Model organisms are foundational for conducting physiological and systems biology research to define how life responds to the deep space environment. The organisms, areas of research, and Deep Space Gateway capabilities needed will be presented.
Natural Products from Deep-Sea-Derived Fungi ̶ A New Source of Novel Bioactive Compounds?
Daletos, Georgios; Ebrahim, Weaam; Ancheeva, Elena; El-Neketi, Mona; Song, Weiguo; Lin, Wenhan; Proksch, Peter
2018-01-01
Over the last two decades, deep-sea-derived fungi are considered to be a new source of pharmacologically active secondary metabolites for drug discovery mainly based on the underlying assumption that the uniqueness of the deep sea will give rise to equally unprecedented natural products. Indeed, up to now over 200 new metabolites have been identified from deep-sea fungi, which is in support of the statement made above. This review summarizes the new and/or bioactive compounds reported from deepsea- derived fungi in the last six years (2010 - October 2016) and critically evaluates whether the data published so far really support the notion that these fungi are a promising source of new bioactive chemical entities. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Exploring Dance Movement Data Using Sequence Alignment Methods
Chavoshi, Seyed Hossein; De Baets, Bernard; Neutens, Tijs; De Tré, Guy; Van de Weghe, Nico
2015-01-01
Despite the abundance of research on knowledge discovery from moving object databases, only a limited number of studies have examined the interaction between moving point objects in space over time. This paper describes a novel approach for measuring similarity in the interaction between moving objects. The proposed approach consists of three steps. First, we transform movement data into sequences of successive qualitative relations based on the Qualitative Trajectory Calculus (QTC). Second, sequence alignment methods are applied to measure the similarity between movement sequences. Finally, movement sequences are grouped based on similarity by means of an agglomerative hierarchical clustering method. The applicability of this approach is tested using movement data from samba and tango dancers. PMID:26181435
Lu, Zen H; Brown, Alexander; Wilson, Alison D; Calvert, Jay G; Balasch, Monica; Fuentes-Utrilla, Pablo; Loecherbach, Julia; Turner, Frances; Talbot, Richard; Archibald, Alan L; Ait-Ali, Tahar
2014-03-04
Porcine Reproductive and Respiratory Syndrome (PRRS) is a disease of major economic impact worldwide. The etiologic agent of this disease is the PRRS virus (PRRSV). Increasing evidence suggest that microevolution within a coexisting quasispecies population can give rise to high sequence heterogeneity in PRRSV. We developed a pipeline based on the ultra-deep next generation sequencing approach to first construct the complete genome of a European PRRSV, strain Olot/9, cultured on macrophages and then capture the rare variants representative of the mixed quasispecies population. Olot/91 differs from the reference Lelystad strain by about 5% and a total of 88 variants, with frequencies as low as 1%, were detected in the mixed population. These variants included 16 non-synonymous variants concentrated in the genes encoding structural and nonstructural proteins; including Glycoprotein 2a and 5. Using an ultra-deep sequencing methodology, the complete genome of Olot/91 was constructed without any prior knowledge of the sequence. Rare variants that constitute minor fractions of the heterogeneous PRRSV population could successfully be detected to allow further exploration of microevolutionary events.
Computational methods in drug discovery
Leelananda, Sumudu P
2016-01-01
The process for drug discovery and development is challenging, time consuming and expensive. Computer-aided drug discovery (CADD) tools can act as a virtual shortcut, assisting in the expedition of this long process and potentially reducing the cost of research and development. Today CADD has become an effective and indispensable tool in therapeutic development. The human genome project has made available a substantial amount of sequence data that can be used in various drug discovery projects. Additionally, increasing knowledge of biological structures, as well as increasing computer power have made it possible to use computational methods effectively in various phases of the drug discovery and development pipeline. The importance of in silico tools is greater than ever before and has advanced pharmaceutical research. Here we present an overview of computational methods used in different facets of drug discovery and highlight some of the recent successes. In this review, both structure-based and ligand-based drug discovery methods are discussed. Advances in virtual high-throughput screening, protein structure prediction methods, protein–ligand docking, pharmacophore modeling and QSAR techniques are reviewed. PMID:28144341
Computational methods in drug discovery.
Leelananda, Sumudu P; Lindert, Steffen
2016-01-01
The process for drug discovery and development is challenging, time consuming and expensive. Computer-aided drug discovery (CADD) tools can act as a virtual shortcut, assisting in the expedition of this long process and potentially reducing the cost of research and development. Today CADD has become an effective and indispensable tool in therapeutic development. The human genome project has made available a substantial amount of sequence data that can be used in various drug discovery projects. Additionally, increasing knowledge of biological structures, as well as increasing computer power have made it possible to use computational methods effectively in various phases of the drug discovery and development pipeline. The importance of in silico tools is greater than ever before and has advanced pharmaceutical research. Here we present an overview of computational methods used in different facets of drug discovery and highlight some of the recent successes. In this review, both structure-based and ligand-based drug discovery methods are discussed. Advances in virtual high-throughput screening, protein structure prediction methods, protein-ligand docking, pharmacophore modeling and QSAR techniques are reviewed.
A non-parametric peak calling algorithm for DamID-Seq.
Li, Renhua; Hempel, Leonie U; Jiang, Tingbo
2015-01-01
Protein-DNA interactions play a significant role in gene regulation and expression. In order to identify transcription factor binding sites (TFBS) of double sex (DSX)-an important transcription factor in sex determination, we applied the DNA adenine methylation identification (DamID) technology to the fat body tissue of Drosophila, followed by deep sequencing (DamID-Seq). One feature of DamID-Seq data is that induced adenine methylation signals are not assured to be symmetrically distributed at TFBS, which renders the existing peak calling algorithms for ChIP-Seq, including SPP and MACS, inappropriate for DamID-Seq data. This challenged us to develop a new algorithm for peak calling. A challenge in peaking calling based on sequence data is estimating the averaged behavior of background signals. We applied a bootstrap resampling method to short sequence reads in the control (Dam only). After data quality check and mapping reads to a reference genome, the peaking calling procedure compromises the following steps: 1) reads resampling; 2) reads scaling (normalization) and computing signal-to-noise fold changes; 3) filtering; 4) Calling peaks based on a statistically significant threshold. This is a non-parametric method for peak calling (NPPC). We also used irreproducible discovery rate (IDR) analysis, as well as ChIP-Seq data to compare the peaks called by the NPPC. We identified approximately 6,000 peaks for DSX, which point to 1,225 genes related to the fat body tissue difference between female and male Drosophila. Statistical evidence from IDR analysis indicated that these peaks are reproducible across biological replicates. In addition, these peaks are comparable to those identified by use of ChIP-Seq on S2 cells, in terms of peak number, location, and peaks width.
Lenert, L.; Lopez-Campos, G.
2014-01-01
Summary Objectives Given the quickening speed of discovery of variant disease drivers from combined patient genotype and phenotype data, the objective is to provide methodology using big data technology to support the definition of deep phenotypes in medical records. Methods As the vast stores of genomic information increase with next generation sequencing, the importance of deep phenotyping increases. The growth of genomic data and adoption of Electronic Health Records (EHR) in medicine provides a unique opportunity to integrate phenotype and genotype data into medical records. The method by which collections of clinical findings and other health related data are leveraged to form meaningful phenotypes is an active area of research. Longitudinal data stored in EHRs provide a wealth of information that can be used to construct phenotypes of patients. We focus on a practical problem around data integration for deep phenotype identification within EHR data. The use of big data approaches are described that enable scalable markup of EHR events that can be used for semantic and temporal similarity analysis to support the identification of phenotype and genotype relationships. Conclusions Stead and colleagues’ 2005 concept of using light standards to increase the productivity of software systems by riding on the wave of hardware/processing power is described as a harbinger for designing future healthcare systems. The big data solution, using flexible markup, provides a route to improved utilization of processing power for organizing patient records in genotype and phenotype research. PMID:25123744
Biomarker Discovery and Mechanistic Studies of Prostate Cancer using Targeted Proteomic Approaches
2012-07-01
basigin in Drosophila ) tightly regulates cytoskeleton rearrangement in Drosophila melanogaster [23]. Based on the present results and the existing...from OligoEngine according to the manufac- turer’s instruction. Plasmids were amplified in DH5a cell and confirmed by sequencing . Subconfluent cell...electrophoresis and the results are shown in Figure 1 (Panel C). The RT-PCR products were cloned and subjected to DNA sequenc - ing. The sequencing
Leighton, Philip A; Schusser, Benjamin; Yi, Henry; Glanville, Jacob; Harriman, William
2015-01-01
Chicken immune responses to human proteins are often more robust than rodent responses because of the phylogenetic relationship between the different species. For discovery of a diverse panel of unique therapeutic antibody candidates, chickens therefore represent an attractive host for human-derived targets. Recent advances in monoclonal antibody technology, specifically new methods for the molecular cloning of antibody genes directly from primary B cells, has ushered in a new era of generating monoclonal antibodies from non-traditional host animals that were previously inaccessible through hybridoma technology. However, such monoclonals still require post-discovery humanization in order to be developed as therapeutics. To obviate the need for humanization, a modified strain of chickens could be engineered to express a human-sequence immunoglobulin variable region repertoire. Here, human variable genes introduced into the chicken immunoglobulin loci through gene targeting were evaluated for their ability to be recognized and diversified by the native chicken recombination machinery that is present in the B-lineage cell line DT40. After expansion in culture the DT40 population accumulated genetic mutants that were detected via deep sequencing. Bioinformatic analysis revealed that the human targeted constructs are performing as expected in the cell culture system, and provide a measure of confidence that they will be functional in transgenic animals.
NASA Astrophysics Data System (ADS)
McKenzie, Judith A.; Aloisi, Giovanni; Anjos, Sylvia; Latgé, Ricardo; Matsuda, Nilo; Bontognali, Tomaso; Vasconcelos, Crisogono
2015-04-01
Sedimentologic and stratigraphic studies of the Lower Cretaceous sequence, deposited in the economically important Campos Basin, southeast Brazil, document the occurrence of ~20-m-thick dolomite intervals overlying the "massive salt" megasequences of the Lagoa Feia Formation. This stratigaphic succession marks the Aptian/Albian transition from extreme evaporitic conditions of the Lagoa Feia Formation to shallow marine conditions of the Macaé Formation, related to the early opening of the South Atlantic. The facies change from evaporites to dolomite is interpreted as a product of dolomitization resulting from the refuxing of hypersaline fluids from shallow embayments with intense evaporation (Latgé, 2001). Although the reflux model provides a mechanism to produce fluids with geochemical composition favorable for dolomite precipitation, it cannot account for all of the factors required to promote dolomite precipitation. In this study, we propose a different model to explain the post-evaporite deposition of massive dolomite based on the study of sequences deposited at the end Messinian Salinity Crisis, which were recovered from the deep basins of the Mediterranean Sea during DSDP/ODP drilling campaigns. At most of these deep-water sites, the cored interval contained unusual dolomite deposits overlying the uppermost evaporite sections. For example, the upper Messinian sedimentary sequence at DSDP Site 374 comprises non-fossiliferous dolomitic mudstone overlying dolomitic mudstone/gypsum cycles, which in turn overlie anhydrite and halite (Hsü, Montadert et al., 1978). We postulate that the end Messinian dolomite is a product of microbial activity under extreme hypersaline conditions. In the last 20 years, research into the factors controlling dolomite precipitation under Earth surface conditions has led to the development of new models involving the metabolism of microorganisms and associated biofilms to overcome the kinetic inhibitions associated with primary dolomite precipitation. Furthermore, based on the limited pore-water geochemical data obtained during drilling at DSDP Site 374: Messina Abyssal Plain, the dolomitic mudstones of the uppermost Messinian evaporite complex represent an ideal candidate for such an extensive study in a "natural laboratory". In fact, the data suggest that microbial diagenesis and perhaps dolomite precipitation may still be occurring. Thus, to increase our understanding of the biogeochemical processes associated with ancient massive dolomite formation, a major new drilling campaign to study the sub-seafloor Messinian evaporite complex in the deep Mediterranean basins, using greatly enhanced drilling technology currently available within the new International Ocean Discovery Program (IODP), would be timely. Hsü, K., Montadert, L. et al., 1978. Initial Reports of the Deep Sea Drilling Project, Volume 42, Part 1: Washington (U.S. Government Printing Office). Latgé, M. A. R., 2001. O Albiano no Atlântico Sul: estratigrafia, Paleoceanografia e Relações Globais. PhD thesis, Universidade Federal do Rio Grande do Sul, pp. 257.
A biological compression model and its applications.
Cao, Minh Duc; Dix, Trevor I; Allison, Lloyd
2011-01-01
A biological compression model, expert model, is presented which is superior to existing compression algorithms in both compression performance and speed. The model is able to compress whole eukaryotic genomes. Most importantly, the model provides a framework for knowledge discovery from biological data. It can be used for repeat element discovery, sequence alignment and phylogenetic analysis. We demonstrate that the model can handle statistically biased sequences and distantly related sequences where conventional knowledge discovery tools often fail.
Crane, Paul K; Foroud, Tatiana; Montine, Thomas J; Larson, Eric B
2017-12-01
The Alzheimer's Disease Sequencing Project (ADSP) used different criteria for assigning case and control status from the discovery and replication phases of the project. We considered data from a community-based prospective cohort study with autopsy follow-up where participants could be categorized as case, control, or neither by both definitions and compared the two sets of criteria. We used data from the Adult Changes in Thought (ACT) study including Diagnostic and Statistical Manual-IV criteria for dementia status, McKhann et al. criteria for clinical Alzheimer's disease, and Braak and Consortium to Establish a Registry for AD findings on neurofibrillary tangles and neuritic plaques to categorize the 621 ACT participants of European ancestry who died and came to autopsy. We applied ADSP discovery and replication definitions to identify controls, cases, and people who were neither controls nor cases. There was some agreement between the discovery and replication definitions. Major areas of discrepancy included the finding that only 40% of the discovery sample controls had sufficiently low levels of neurofibrillary tangles and neuritic plaques to be considered controls by the replication criteria and the finding that 16% of the replication phase cases were diagnosed with non-AD dementia during life and thus were excluded as cases for the discovery phase. These findings should inform interpretation of genetic association findings from the ADSP. Differences in genetic association findings between the two phases of the study may reflect these different phenotype definitions from the discovery and replication phase of the ADSP. Copyright © 2017 the Alzheimer's Association. Published by Elsevier Inc. All rights reserved.
Sohlberg, Elina; Bomberg, Malin; Miettinen, Hanna; Nyyssönen, Mari; Salavirta, Heikki; Vikman, Minna; Itävaara, Merja
2015-01-01
The diversity and functional role of fungi, one of the ecologically most important groups of eukaryotic microorganisms, remains largely unknown in deep biosphere environments. In this study we investigated fungal communities in packer-isolated bedrock fractures in Olkiluoto, Finland at depths ranging from 296 to 798 m below surface level. DNA- and cDNA-based high-throughput amplicon sequencing analysis of the fungal internal transcribed spacer (ITS) gene markers was used to examine the total fungal diversity and to identify the active members in deep fracture zones at different depths. Results showed that fungi were present in fracture zones at all depths and fungal diversity was higher than expected. Most of the observed fungal sequences belonged to the phylum Ascomycota. Phyla Basidiomycota and Chytridiomycota were only represented as a minor part of the fungal community. Dominating fungal classes in the deep bedrock aquifers were Sordariomycetes, Eurotiomycetes, and Dothideomycetes from the Ascomycota phylum and classes Microbotryomycetes and Tremellomycetes from the Basidiomycota phylum, which are the most frequently detected fungal taxa reported also from deep sea environments. In addition some fungal sequences represented potentially novel fungal species. Active fungi were detected in most of the fracture zones, which proves that fungi are able to maintain cellular activity in these oligotrophic conditions. Possible roles of fungi and their origin in deep bedrock groundwater can only be speculated in the light of current knowledge but some species may be specifically adapted to deep subsurface environment and may play important roles in the utilization and recycling of nutrients and thus sustaining the deep subsurface microbial community.
Wang, Zheng Jia; Huang, Jian Qin; Huang, You Jun; Li, Zheng; Zheng, Bing Song
2012-08-01
Hickory (Carya cathayensis Sarg.) is an economically important woody plant in China, but its long juvenile phase delays yield. MicroRNAs (miRNAs) are critical regulators of genes and important for normal plant development and physiology, including flower development. We used Solexa technology to sequence two small RNA libraries from two floral differentiation stages in hickory to identify miRNAs related to flower development. We identified 39 conserved miRNA sequences from 114 loci belonging to 23 families as well as two novel and ten potential novel miRNAs belonging to nine families. Moreover, 35 conserved miRNA*s and two novel miRNA*s were detected. Twenty miRNA sequences from 49 loci belonging to 11 families were differentially expressed; all were up-regulated at the later stage of flower development in hickory. Quantitative real-time PCR of 12 conserved miRNA sequences, five novel miRNA families, and two novel miRNA*s validated that all were expressed during hickory flower development, and the expression patterns were similar to those detected with Solexa sequencing. Finally, a total of 146 targets of the novel and conserved miRNAs were predicted. This study identified a diverse set of miRNAs that were closely related to hickory flower development and that could help in plant floral induction.
Hwang, Kyu-Baek; Lee, In-Hee; Park, Jin-Ho; Hambuch, Tina; Choe, Yongjoon; Kim, MinHyeok; Lee, Kyungjoon; Song, Taemin; Neu, Matthew B; Gupta, Neha; Kohane, Isaac S; Green, Robert C; Kong, Sek Won
2014-08-01
As whole genome sequencing (WGS) uncovers variants associated with rare and common diseases, an immediate challenge is to minimize false-positive findings due to sequencing and variant calling errors. False positives can be reduced by combining results from orthogonal sequencing methods, but costly. Here, we present variant filtering approaches using logistic regression (LR) and ensemble genotyping to minimize false positives without sacrificing sensitivity. We evaluated the methods using paired WGS datasets of an extended family prepared using two sequencing platforms and a validated set of variants in NA12878. Using LR or ensemble genotyping based filtering, false-negative rates were significantly reduced by 1.1- to 17.8-fold at the same levels of false discovery rates (5.4% for heterozygous and 4.5% for homozygous single nucleotide variants (SNVs); 30.0% for heterozygous and 18.7% for homozygous insertions; 25.2% for heterozygous and 16.6% for homozygous deletions) compared to the filtering based on genotype quality scores. Moreover, ensemble genotyping excluded > 98% (105,080 of 107,167) of false positives while retaining > 95% (897 of 937) of true positives in de novo mutation (DNM) discovery in NA12878, and performed better than a consensus method using two sequencing platforms. Our proposed methods were effective in prioritizing phenotype-associated variants, and an ensemble genotyping would be essential to minimize false-positive DNM candidates. © 2014 WILEY PERIODICALS, INC.
Hwang, Kyu-Baek; Lee, In-Hee; Park, Jin-Ho; Hambuch, Tina; Choi, Yongjoon; Kim, MinHyeok; Lee, Kyungjoon; Song, Taemin; Neu, Matthew B.; Gupta, Neha; Kohane, Isaac S.; Green, Robert C.; Kong, Sek Won
2014-01-01
As whole genome sequencing (WGS) uncovers variants associated with rare and common diseases, an immediate challenge is to minimize false positive findings due to sequencing and variant calling errors. False positives can be reduced by combining results from orthogonal sequencing methods, but costly. Here we present variant filtering approaches using logistic regression (LR) and ensemble genotyping to minimize false positives without sacrificing sensitivity. We evaluated the methods using paired WGS datasets of an extended family prepared using two sequencing platforms and a validated set of variants in NA12878. Using LR or ensemble genotyping based filtering, false negative rates were significantly reduced by 1.1- to 17.8-fold at the same levels of false discovery rates (5.4% for heterozygous and 4.5% for homozygous SNVs; 30.0% for heterozygous and 18.7% for homozygous insertions; 25.2% for heterozygous and 16.6% for homozygous deletions) compared to the filtering based on genotype quality scores. Moreover, ensemble genotyping excluded > 98% (105,080 of 107,167) of false positives while retaining > 95% (897 of 937) of true positives in de novo mutation (DNM) discovery, and performed better than a consensus method using two sequencing platforms. Our proposed methods were effective in prioritizing phenotype-associated variants, and ensemble genotyping would be essential to minimize false positive DNM candidates. PMID:24829188
Agrawal, Neeraj J; Dykstra, Andrew; Yang, Jane; Yue, Hai; Nguyen, Xichdao; Kolvenbach, Carl; Angell, Nicolas
2018-05-01
Methionine oxidation in therapeutic antibodies can impact the product's stability, clinical efficacy, and safety and hence it is desirable to address the methionine oxidation liability during antibody discovery and development phase. Although the current experimental approaches can identify the oxidation-labile methionine residues, their application is limited mostly to the development phase. We demonstrate an in silico method that can be used to predict oxidation-labile residues based solely on the antibody sequence and structure information. Since antibody sequence information is available in the discovery phase, the in silico method can be applied very early on to identify the oxidation-labile methionine residues and subsequently address the oxidation liability. We believe that the in silico method for methionine oxidation liability assessment can aid in antibody discovery and development phase to address the liability in a more rational way. Copyright © 2018 American Pharmacists Association®. Published by Elsevier Inc. All rights reserved.
Maurer-Stroh, Sebastian; Gao, He; Han, Hao; Baeten, Lies; Schymkowitz, Joost; Rousseau, Frederic; Zhang, Louxin; Eisenhaber, Frank
2013-02-01
Data mining in protein databases, derivatives from more fundamental protein 3D structure and sequence databases, has considerable unearthed potential for the discovery of sequence motif--structural motif--function relationships as the finding of the U-shape (Huf-Zinc) motif, originally a small student's project, exemplifies. The metal ion zinc is critically involved in universal biological processes, ranging from protein-DNA complexes and transcription regulation to enzymatic catalysis and metabolic pathways. Proteins have evolved a series of motifs to specifically recognize and bind zinc ions. Many of these, so called zinc fingers, are structurally independent globular domains with discontinuous binding motifs made up of residues mostly far apart in sequence. Through a systematic approach starting from the BRIX structure fragment database, we discovered that there exists another predictable subset of zinc-binding motifs that not only have a conserved continuous sequence pattern but also share a characteristic local conformation, despite being included in totally different overall folds. While this does not allow general prediction of all Zn binding motifs, a HMM-based web server, Huf-Zinc, is available for prediction of these novel, as well as conventional, zinc finger motifs in protein sequences. The Huf-Zinc webserver can be freely accessed through this URL (http://mendel.bii.a-star.edu.sg/METHODS/hufzinc/).
The epistellar body and what followed from its discovery.
Young, J Z
1990-07-01
The sequence of discoveries that has followed the investigation of this small yellow spot shows the value of studies begun out of "mere curiosity". The spot occurs on the stellate ganglion of octopods. It proved to be an enclosed sac, perhaps a gland. The search for it in squids and cuttlefishes led to the discovery of the giant nerve fibres. At first they were thought to be veins but we soon showed that they were nerve fibres concerned with jet propulsion. Their action potentials, membranes and synapses have been used for thousand of studies, including those that led to the Hodkin Huxley equations. They have been the basis of much of modern neuroscience. The epistellar body itself proved not to be a gland but a photoreceptor. Comparable photosensitive vesicles are especially large in the heads of deep-sea squids. In the mesopelagic ones they allow the squid to conceal itself by counterillumination, matching its own light output to the light coming from above. In bathypelagic squids the vesicles are enormous and probably keep the animals in the dark, where they breed. The function of the epistellar body, lying within the mantle of octopods is still unknown. It may act in the transparent larval stage to trigger the ejection of luminous plankton, which would be a hazard.
Wu, Jieying; Gao, Weimin; Zhang, Weiwen; Meldrum, Deirdre R
2011-01-01
Limitation in sample quality and quantity is one of the big obstacles for applying metatranscriptomic technologies to explore gene expression and functionality of microbial communities in natural environments. In this study, several amplification methods were evaluated for whole-transcriptome amplification of deep-sea microbial samples, which are of low cell density and high impurity. The best amplification method was identified and incorporated into a complete protocol to isolate and amplify deep-sea microbial samples. In the protocol, total RNA was first isolated by a modified method combining Trizol (Invitrogen, CA) and RNeasy (QIAGEN, CA) method, amplified with a WT-Ovation™ Pico RNA Amplification System (NuGEN, CA), and then converted to double-strand DNA from single-strand cDNA with a WT-Ovation™ Exon Module (NuGEN, CA). The products from the whole-transcriptome amplification of deep-sea microbial samples were assessed first through random clone library sequencing. The BLAST search results showed that marine-based sequences are dominant in the libraries, consistent with the ecological source of the samples. The products were then used for next-generation Roche GS FLX Titanium sequencing to obtain metatranscriptome data. Preliminary analysis of the metatranscriptomic data showed good sequencing quality. Although the protocol was designed and demonstrated to be effective for deep-sea microbial samples, it should be applicable to similar samples from other extreme environments in exploring community structure and functionality of microbial communities. Copyright © 2010 Elsevier B.V. All rights reserved.
Less is More: Membrane Protein Digestion Beyond Urea-Trypsin Solution for Next-level Proteomics.
Zhang, Xi
2015-09-01
The goal of next-level bottom-up membrane proteomics is protein function investigation, via high-coverage high-throughput peptide-centric quantitation of expression, modifications and dynamic structures at systems scale. Yet efficient digestion of mammalian membrane proteins presents a daunting barrier, and prevalent day-long urea-trypsin in-solution digestion proved insufficient to reach this goal. Many efforts contributed incremental advances over past years, but involved protein denaturation that disconnected measurement from functional states. Beyond denaturation, the recent discovery of structure/proteomics omni-compatible detergent n-dodecyl-β-d-maltopyranoside, combined with pepsin and PNGase F columns, enabled breakthroughs in membrane protein digestion: a 2010 DDM-low-TCEP (DLT) method for H/D-exchange (HDX) using human G protein-coupled receptor, and a 2015 flow/detergent-facilitated protease and de-PTM digestions (FDD) for integrative deep sequencing and quantitation using full-length human ion channel complex. Distinguishing protein solubilization from denaturation, protease digestion reliability from theoretical specificity, and reduction from alkylation, these methods shifted day(s)-long paradigms into minutes, and afforded fully automatable (HDX)-protein-peptide-(tandem mass tag)-HPLC pipelines to instantly measure functional proteins at deep coverage, high peptide reproducibility, low artifacts and minimal leakage. Promoting-not destroying-structures and activities harnessed membrane proteins for the next-level streamlined functional proteomics. This review analyzes recent advances in membrane protein digestion methods and highlights critical discoveries for future proteomics. © 2015 by The American Society for Biochemistry and Molecular Biology, Inc.
ADEPT, a dynamic next generation sequencing data error-detection program with trimming
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feng, Shihai; Lo, Chien-Chi; Li, Po-E
Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the truemore » positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. We conclude that ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate.« less
ADEPT, a dynamic next generation sequencing data error-detection program with trimming
Feng, Shihai; Lo, Chien-Chi; Li, Po-E; ...
2016-02-29
Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the truemore » positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. We conclude that ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate.« less
The rise of deep learning in drug discovery.
Chen, Hongming; Engkvist, Ola; Wang, Yinhai; Olivecrona, Marcus; Blaschke, Thomas
2018-06-01
Over the past decade, deep learning has achieved remarkable success in various artificial intelligence research areas. Evolved from the previous research on artificial neural networks, this technology has shown superior performance to other machine learning algorithms in areas such as image and voice recognition, natural language processing, among others. The first wave of applications of deep learning in pharmaceutical research has emerged in recent years, and its utility has gone beyond bioactivity predictions and has shown promise in addressing diverse problems in drug discovery. Examples will be discussed covering bioactivity prediction, de novo molecular design, synthesis prediction and biological image analysis. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.
The sequence of sequencers: The history of sequencing DNA
Heather, James M.; Chain, Benjamin
2016-01-01
Determining the order of nucleic acid residues in biological samples is an integral component of a wide variety of research applications. Over the last fifty years large numbers of researchers have applied themselves to the production of techniques and technologies to facilitate this feat, sequencing DNA and RNA molecules. This time-scale has witnessed tremendous changes, moving from sequencing short oligonucleotides to millions of bases, from struggling towards the deduction of the coding sequence of a single gene to rapid and widely available whole genome sequencing. This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries, researchers, and sequences along the way. PMID:26554401
ComplexContact: a web server for inter-protein contact prediction using deep learning.
Zeng, Hong; Wang, Sheng; Zhou, Tianming; Zhao, Feifeng; Li, Xiufeng; Wu, Qing; Xu, Jinbo
2018-05-22
ComplexContact (http://raptorx2.uchicago.edu/ComplexContact/) is a web server for sequence-based interfacial residue-residue contact prediction of a putative protein complex. Interfacial residue-residue contacts are critical for understanding how proteins form complex and interact at residue level. When receiving a pair of protein sequences, ComplexContact first searches for their sequence homologs and builds two paired multiple sequence alignments (MSA), then it applies co-evolution analysis and a CASP-winning deep learning (DL) method to predict interfacial contacts from paired MSAs and visualizes the prediction as an image. The DL method was originally developed for intra-protein contact prediction and performed the best in CASP12. Our large-scale experimental test further shows that ComplexContact greatly outperforms pure co-evolution methods for inter-protein contact prediction, regardless of the species.
Coffey, Lark L; Page, Brady L; Greninger, Alexander L; Herring, Belinda L; Russell, Richard C; Doggett, Stephen L; Haniotis, John; Wang, Chunlin; Deng, Xutao; Delwart, Eric L
2014-01-05
Viral metagenomics characterizes known and identifies unknown viruses based on sequence similarities to any previously sequenced viral genomes. A metagenomics approach was used to identify virus sequences in Australian mosquitoes causing cytopathic effects in inoculated mammalian cell cultures. Sequence comparisons revealed strains of Liao Ning virus (Reovirus, Seadornavirus), previously detected only in China, livestock-infecting Stretch Lagoon virus (Reovirus, Orbivirus), two novel dimarhabdoviruses, named Beaumont and North Creek viruses, and two novel orthobunyaviruses, named Murrumbidgee and Salt Ash viruses. The novel virus proteomes diverged by ≥ 50% relative to their closest previously genetically characterized viral relatives. Deep sequencing also generated genomes of Warrego and Wallal viruses, orbiviruses linked to kangaroo blindness, whose genomes had not been fully characterized. This study highlights viral metagenomics in concert with traditional arbovirus surveillance to characterize known and new arboviruses in field-collected mosquitoes. Follow-up epidemiological studies are required to determine whether the novel viruses infect humans. © 2013 Elsevier Inc. All rights reserved.
Schmidt, Olga; Hausmann, Axel; Cancian de Araujo, Bruno; Sutrisno, Hari; Peggie, Djunijanti; Schmidt, Stefan
2017-01-01
Here we present a general collecting and preparation protocol for DNA barcoding of Lepidoptera as part of large-scale rapid biodiversity assessment projects, and a comparison with alternative preserving and vouchering methods. About 98% of the sequenced specimens processed using the present collecting and preparation protocol yielded sequences with more than 500 base pairs. The study is based on the first outcomes of the Indonesian Biodiversity Discovery and Information System (IndoBioSys). IndoBioSys is a German-Indonesian research project that is conducted by the Museum für Naturkunde in Berlin and the Zoologische Staatssammlung München, in close cooperation with the Research Center for Biology - Indonesian Institute of Sciences (RCB-LIPI, Bogor).
USDA-ARS?s Scientific Manuscript database
Butyrate is a nutritional element with strong epigenetic regulatory activity as an inhibitor of histone deacetylases (HDACs). Based on the analysis of differentially expressed genes induced by butyrate in the bovine epithelial cell using deep RNA-sequencing technology (RNA-seq), a set of unique gen...
NASA Astrophysics Data System (ADS)
Medialdea, T.; Somoza, L.; González, F. J.; Vázquez, J. T.; de Ignacio, C.; Sumino, H.; Sánchez-Guillamón, O.; Orihashi, Y.; León, R.; Palomino, D.
2017-08-01
New seismic profiles, bathymetric data, and sediment-rock sampling document for the first time the discovery of hydrothermal vent complexes and volcanic cones at 4800-5200 m depth related to recent volcanic and intrusive activity in an unexplored area of the Canary Basin (Eastern Atlantic Ocean, 500 km west of the Canary Islands). A complex of sill intrusions is imaged on seismic profiles showing saucer-shaped, parallel, or inclined geometries. Three main types of structures are related to these intrusions. Type I consists of cone-shaped depressions developed above inclined sills interpreted as hydrothermal vents. Type II is the most abundant and is represented by isolated or clustered hydrothermal domes bounded by faults rooted at the tips of saucer-shaped sills. Domes are interpreted as seabed expressions of reservoirs of CH4 and CO2-rich fluids formed by degassing and contact metamorphism of organic-rich sediments around sill intrusions. Type III are hydrothermal-volcanic complexes originated above stratified or branched inclined sills connected by a chimney to the seabed volcanic edifice. Parallel sills sourced from the magmatic chimney formed also domes surrounding the volcanic cones. Core and dredges revealed that these volcanoes, which must be among the deepest in the world, are constituted by OIB-type, basanites with an outer ring of blue-green hydrothermal Al-rich smectite muds. Magmatic activity is dated, based on lava samples, at 0.78 ± 0.05 and 1.61 ± 0.09 Ma (K/Ar methods) and on tephra layers within cores at 25-237 ky. The Subvent hydrothermal-volcanic complex constitutes the first modern system reported in deep water oceanic basins related to intraplate hotspot activity.
AUC-Maximized Deep Convolutional Neural Fields for Protein Sequence Labeling.
Wang, Sheng; Sun, Siqi; Xu, Jinbo
2016-09-01
Deep Convolutional Neural Networks (DCNN) has shown excellent performance in a variety of machine learning tasks. This paper presents Deep Convolutional Neural Fields (DeepCNF), an integration of DCNN with Conditional Random Field (CRF), for sequence labeling with an imbalanced label distribution. The widely-used training methods, such as maximum-likelihood and maximum labelwise accuracy, do not work well on imbalanced data. To handle this, we present a new training algorithm called maximum-AUC for DeepCNF. That is, we train DeepCNF by directly maximizing the empirical Area Under the ROC Curve (AUC), which is an unbiased measurement for imbalanced data. To fulfill this, we formulate AUC in a pairwise ranking framework, approximate it by a polynomial function and then apply a gradient-based procedure to optimize it. Our experimental results confirm that maximum-AUC greatly outperforms the other two training methods on 8-state secondary structure prediction and disorder prediction since their label distributions are highly imbalanced and also has similar performance as the other two training methods on solvent accessibility prediction, which has three equally-distributed labels. Furthermore, our experimental results show that our AUC-trained DeepCNF models greatly outperform existing popular predictors of these three tasks. The data and software related to this paper are available at https://github.com/realbigws/DeepCNF_AUC.
AUC-Maximized Deep Convolutional Neural Fields for Protein Sequence Labeling
Wang, Sheng; Sun, Siqi
2017-01-01
Deep Convolutional Neural Networks (DCNN) has shown excellent performance in a variety of machine learning tasks. This paper presents Deep Convolutional Neural Fields (DeepCNF), an integration of DCNN with Conditional Random Field (CRF), for sequence labeling with an imbalanced label distribution. The widely-used training methods, such as maximum-likelihood and maximum labelwise accuracy, do not work well on imbalanced data. To handle this, we present a new training algorithm called maximum-AUC for DeepCNF. That is, we train DeepCNF by directly maximizing the empirical Area Under the ROC Curve (AUC), which is an unbiased measurement for imbalanced data. To fulfill this, we formulate AUC in a pairwise ranking framework, approximate it by a polynomial function and then apply a gradient-based procedure to optimize it. Our experimental results confirm that maximum-AUC greatly outperforms the other two training methods on 8-state secondary structure prediction and disorder prediction since their label distributions are highly imbalanced and also has similar performance as the other two training methods on solvent accessibility prediction, which has three equally-distributed labels. Furthermore, our experimental results show that our AUC-trained DeepCNF models greatly outperform existing popular predictors of these three tasks. The data and software related to this paper are available at https://github.com/realbigws/DeepCNF_AUC. PMID:28884168
Fungal diversity in deep-sea sediments associated with asphalt seeps at the Sao Paulo Plateau
NASA Astrophysics Data System (ADS)
Nagano, Yuriko; Miura, Toshiko; Nishi, Shinro; Lima, Andre O.; Nakayama, Cristina; Pellizari, Vivian H.; Fujikura, Katsunori
2017-12-01
We investigated the fungal diversity in a total of 20 deep-sea sediment samples (of which 14 samples were associated with natural asphalt seeps and 6 samples were not associated) collected from two different sites at the Sao Paulo Plateau off Brazil by Ion Torrent PGM targeting ITS region of ribosomal RNA. Our results suggest that diverse fungi (113 operational taxonomic units (OTUs) based on clustering at 97% sequence similarity assigned into 9 classes and 31 genus) are present in deep-sea sediment samples collected at the Sao Paulo Plateau, dominated by Ascomycota (74.3%), followed by Basidiomycota (11.5%), unidentified fungi (7.1%), and sequences with no affiliation to any organisms in the public database (7.1%). However, it was revealed that only three species, namely Penicillium sp., Cadophora malorum and Rhodosporidium diobovatum, were dominant, with the majority of OTUs remaining a minor community. Unexpectedly, there was no significant difference in major fungal community structure between the asphalt seep and non-asphalt seep sites, despite the presence of mass hydrocarbon deposits and the high amount of macro organisms surrounding the asphalt seeps. However, there were some differences in the minor fungal communities, with possible asphalt degrading fungi present specifically in the asphalt seep sites. In contrast, some differences were found between the two different sampling sites. Classification of OTUs revealed that only 47 (41.6%) fungal OTUs exhibited >97% sequence similarity, in comparison with pre-existing ITS sequences in public databases, indicating that a majority of deep-sea inhabiting fungal taxa still remain undescribed. Although our knowledge on fungi and their role in deep-sea environments is still limited and scarce, this study increases our understanding of fungal diversity and community structure in deep-sea environments.
Covington, Brett C; McLean, John A; Bachmann, Brian O
2017-01-04
Covering: 2000 to 2016The labor-intensive process of microbial natural product discovery is contingent upon identifying discrete secondary metabolites of interest within complex biological extracts, which contain inventories of all extractable small molecules produced by an organism or consortium. Historically, compound isolation prioritization has been driven by observed biological activity and/or relative metabolite abundance and followed by dereplication via accurate mass analysis. Decades of discovery using variants of these methods has generated the natural pharmacopeia but also contributes to recent high rediscovery rates. However, genomic sequencing reveals substantial untapped potential in previously mined organisms, and can provide useful prescience of potentially new secondary metabolites that ultimately enables isolation. Recently, advances in comparative metabolomics analyses have been coupled to secondary metabolic predictions to accelerate bioactivity and abundance-independent discovery work flows. In this review we will discuss the various analytical and computational techniques that enable MS-based metabolomic applications to natural product discovery and discuss the future prospects for comparative metabolomics in natural product discovery.
Geology and biology of North Pacific cold seep communities
NASA Astrophysics Data System (ADS)
Robison, Bruce H.; Greene, H. Gary
Because of crushing pressure, low temperature, and stygian darkness, the floor of the deep sea is one of the most hostile habitats on Earth. Until recently it was widely believed that the base of the food chain for all deep-sea communities was plant life in the ocean's sunlit upper layer. With the discovery of hydrothermal vent and cold-seep communities, which are based on chemical rather than solar energy, those beliefs were overturned. New studies focused on the animals that inhabit cold seep regions have begun to throw light on the geological basis of chemosynthetic communities. The initial results suggest a strong relationship between geologically determined fluid flux, and the diversity and abundance of animals at the seeps.
Anaerobic consortia of fungi and sulfate reducing bacteria in deep granite fractures.
Drake, Henrik; Ivarsson, Magnus; Bengtson, Stefan; Heim, Christine; Siljeström, Sandra; Whitehouse, Martin J; Broman, Curt; Belivanova, Veneta; Åström, Mats E
2017-07-04
The deep biosphere is one of the least understood ecosystems on Earth. Although most microbiological studies in this system have focused on prokaryotes and neglected microeukaryotes, recent discoveries have revealed existence of fossil and active fungi in marine sediments and sub-seafloor basalts, with proposed importance for the subsurface energy cycle. However, studies of fungi in deep continental crystalline rocks are surprisingly few. Consequently, the characteristics and processes of fungi and fungus-prokaryote interactions in this vast environment remain enigmatic. Here we report the first findings of partly organically preserved and partly mineralized fungi at great depth in fractured crystalline rock (-740 m). Based on environmental parameters and mineralogy the fungi are interpreted as anaerobic. Synchrotron-based techniques and stable isotope microanalysis confirm a coupling between the fungi and sulfate reducing bacteria. The cryptoendolithic fungi have significantly weathered neighboring zeolite crystals and thus have implications for storage of toxic wastes using zeolite barriers.Deep subsurface microorganisms play an important role in nutrient cycling, yet little is known about deep continental fungal communities. Here, the authors show organically preserved and partly mineralized fungi at 740 m depth, and find evidence of an anaerobic fungi and sulfate reducing bacteria consortium.
Pan, Xiaoyong; Shen, Hong-Bin
2018-05-02
RNA-binding proteins (RBPs) take over 5∼10% of the eukaryotic proteome and play key roles in many biological processes, e.g. gene regulation. Experimental detection of RBP binding sites is still time-intensive and high-costly. Instead, computational prediction of the RBP binding sites using pattern learned from existing annotation knowledge is a fast approach. From the biological point of view, the local structure context derived from local sequences will be recognized by specific RBPs. However, in computational modeling using deep learning, to our best knowledge, only global representations of entire RNA sequences are employed. So far, the local sequence information is ignored in the deep model construction process. In this study, we present a computational method iDeepE to predict RNA-protein binding sites from RNA sequences by combining global and local convolutional neural networks (CNNs). For the global CNN, we pad the RNA sequences into the same length. For the local CNN, we split a RNA sequence into multiple overlapping fixed-length subsequences, where each subsequence is a signal channel of the whole sequence. Next, we train deep CNNs for multiple subsequences and the padded sequences to learn high-level features, respectively. Finally, the outputs from local and global CNNs are combined to improve the prediction. iDeepE demonstrates a better performance over state-of-the-art methods on two large-scale datasets derived from CLIP-seq. We also find that the local CNN run 1.8 times faster than the global CNN with comparable performance when using GPUs. Our results show that iDeepE has captured experimentally verified binding motifs. https://github.com/xypan1232/iDeepE. xypan172436@gmail.com or hbshen@sjtu.edu.cn. Supplementary data are available at Bioinformatics online.
2012-01-01
Background Discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Oftentimes, not all sequences in the set contain a motif. These non-motif-containing sequences complicate the algorithmic discovery of motifs. Filtering the non-motif-containing sequences from the larger set of sequences while simultaneously determining the identity of the motif is, therefore, desirable and a non-trivial problem in motif discovery research. Results We describe MotifCatcher, a framework that extends the sensitivity of existing motif-finding tools by employing random sampling to effectively remove non-motif-containing sequences from the motif search. We developed two implementations of our algorithm; each built around a commonly used motif-finding tool, and applied our algorithm to three diverse chromatin immunoprecipitation (ChIP) data sets. In each case, the motif finder with the MotifCatcher extension demonstrated improved sensitivity over the motif finder alone. Our approach organizes candidate functionally significant discovered motifs into a tree, which allowed us to make additional insights. In all cases, we were able to support our findings with experimental work from the literature. Conclusions Our framework demonstrates that additional processing at the sequence entry level can significantly improve the performance of existing motif-finding tools. For each biological data set tested, we were able to propose novel biological hypotheses supported by experimental work from the literature. Specifically, in Escherichia coli, we suggested binding site motifs for 6 non-traditional LexA protein binding sites; in Saccharomyces cerevisiae, we hypothesize 2 disparate mechanisms for novel binding sites of the Cse4p protein; and in Halobacterium sp. NRC-1, we discoverd subtle differences in a general transcription factor (GTF) binding site motif across several data sets. We suggest that small differences in our discovered motif could confer specificity for one or more homologous GTF proteins. We offer a free implementation of the MotifCatcher software package at http://www.bme.ucdavis.edu/facciotti/resources_data/software/. PMID:23181585
Deep Sequencing Analysis of Apple Infecting Viruses in Korea
Cho, In-Sook; Igori, Davaajargal; Lim, Seungmo; Choi, Gug-Seoun; Hammond, John; Lim, Hyoun-Sub; Moon, Jae Sun
2016-01-01
Deep sequencing has generated 52 contigs derived from five viruses; Apple chlorotic leaf spot virus (ACLSV), Apple stem grooving virus (ASGV), Apple stem pitting virus (ASPV), Apple green crinkle associated virus (AGCaV), and Apricot latent virus (ApLV) were identified from eight apple samples showing small leaves and/or growth retardation. Nucleotide (nt) sequence identity of the assembled contigs was from 68% to 99% compared to the reference sequences of the five respective viral genomes. Sequences of ASPV and ASGV were the most abundantly represented by the 52 contigs assembled. The presence of the five viruses in the samples was confirmed by RT-PCR using specific primers based on the sequences of each assembled contig. All five viruses were detected in three of the samples, whereas all samples had mixed infections with at least two viruses. The most frequently detected virus was ASPV, followed by ASGV, ApLV, ACLSV, and AGCaV which were withal found in mixed infections in the tested samples. AGCaV was identified in assembled contigs ID 1012480 and 93549, which showed 82% and 78% nt sequence identity with ORF1 of AGCaV isolate Aurora-1. ApLV was identified in three assembled contigs, ID 65587, 1802365, and 116777, which showed 77%, 78%, and 76% nt sequence identity respectively with ORF1 of ApLV isolate LA2. Deep sequencing assay was shown to be a valuable and powerful tool for detection and identification of known and unknown virome in infected apple trees, here identifying ApLV and AGCaV in commercial orchards in Korea for the first time. PMID:27721694
Gasc, Cyrielle; Peyretaillade, Eric
2016-01-01
Abstract The recent expansion of next-generation sequencing has significantly improved biological research. Nevertheless, deep exploration of genomes or metagenomic samples remains difficult because of the sequencing depth and the associated costs required. Therefore, different partitioning strategies have been developed to sequence informative subsets of studied genomes. Among these strategies, hybridization capture has proven to be an innovative and efficient tool for targeting and enriching specific biomarkers in complex DNA mixtures. It has been successfully applied in numerous areas of biology, such as exome resequencing for the identification of mutations underlying Mendelian or complex diseases and cancers, and its usefulness has been demonstrated in the agronomic field through the linking of genetic variants to agricultural phenotypic traits of interest. Moreover, hybridization capture has provided access to underexplored, but relevant fractions of genomes through its ability to enrich defined targets and their flanking regions. Finally, on the basis of restricted genomic information, this method has also allowed the expansion of knowledge of nonreference species and ancient genomes and provided a better understanding of metagenomic samples. In this review, we present the major advances and discoveries permitted by hybridization capture and highlight the potency of this approach in all areas of biology. PMID:27105841
Gasc, Cyrielle; Peyretaillade, Eric; Peyret, Pierre
2016-06-02
The recent expansion of next-generation sequencing has significantly improved biological research. Nevertheless, deep exploration of genomes or metagenomic samples remains difficult because of the sequencing depth and the associated costs required. Therefore, different partitioning strategies have been developed to sequence informative subsets of studied genomes. Among these strategies, hybridization capture has proven to be an innovative and efficient tool for targeting and enriching specific biomarkers in complex DNA mixtures. It has been successfully applied in numerous areas of biology, such as exome resequencing for the identification of mutations underlying Mendelian or complex diseases and cancers, and its usefulness has been demonstrated in the agronomic field through the linking of genetic variants to agricultural phenotypic traits of interest. Moreover, hybridization capture has provided access to underexplored, but relevant fractions of genomes through its ability to enrich defined targets and their flanking regions. Finally, on the basis of restricted genomic information, this method has also allowed the expansion of knowledge of nonreference species and ancient genomes and provided a better understanding of metagenomic samples. In this review, we present the major advances and discoveries permitted by hybridization capture and highlight the potency of this approach in all areas of biology. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
MicroRNA repertoire for functional genome research in tilapia identified by deep sequencing.
Yan, Biao; Wang, Zhen-Hua; Zhu, Chang-Dong; Guo, Jin-Tao; Zhao, Jin-Liang
2014-08-01
The Nile tilapia (Oreochromis niloticus; Cichlidae) is an economically important species in aquaculture and occupies a prominent position in the aquaculture industry. MicroRNAs (miRNAs) are a class of noncoding RNAs that post-transcriptionally regulate gene expression involved in diverse biological and metabolic processes. To increase the repertoire of miRNAs characterized in tilapia, we used the Illumina/Solexa sequencing technology to sequence a small RNA library using pooled RNA sample isolated from the different developmental stages of tilapia. Bioinformatic analyses suggest that 197 conserved and 27 novel miRNAs are expressed in tilapia. Sequence alignments indicate that all tested miRNAs and miRNAs* are highly conserved across many species. In addition, we characterized the tissue expression patterns of five miRNAs using real-time quantitative PCR. We found that miR-1/206, miR-7/9, and miR-122 is abundantly expressed in muscle, brain, and liver, respectively, implying a potential role in the regulation of tissue differentiation or the maintenance of tissue identity. Overall, our results expand the number of tilapia miRNAs, and the discovery of miRNAs in tilapia genome contributes to a better understanding the role of miRNAs in regulating diverse biological processes.
Protein Solvent-Accessibility Prediction by a Stacked Deep Bidirectional Recurrent Neural Network.
Zhang, Buzhong; Li, Linqing; Lü, Qiang
2018-05-25
Residue solvent accessibility is closely related to the spatial arrangement and packing of residues. Predicting the solvent accessibility of a protein is an important step to understand its structure and function. In this work, we present a deep learning method to predict residue solvent accessibility, which is based on a stacked deep bidirectional recurrent neural network applied to sequence profiles. To capture more long-range sequence information, a merging operator was proposed when bidirectional information from hidden nodes was merged for outputs. Three types of merging operators were used in our improved model, with a long short-term memory network performing as a hidden computing node. The trained database was constructed from 7361 proteins extracted from the PISCES server using a cut-off of 25% sequence identity. Sequence-derived features including position-specific scoring matrix, physical properties, physicochemical characteristics, conservation score and protein coding were used to represent a residue. Using this method, predictive values of continuous relative solvent-accessible area were obtained, and then, these values were transformed into binary states with predefined thresholds. Our experimental results showed that our deep learning method improved prediction quality relative to current methods, with mean absolute error and Pearson's correlation coefficient values of 8.8% and 74.8%, respectively, on the CB502 dataset and 8.2% and 78%, respectively, on the Manesh215 dataset.
Faggionato, Davide; Serb, Jeanne M
2017-08-01
The rise of high-throughput RNA sequencing (RNA-seq) and de novo transcriptome assembly has had a transformative impact on how we identify and study genes in the phototransduction cascade of non-model organisms. But the advantage provided by the nearly automated annotation of RNA-seq transcriptomes may at the same time hinder the possibility for gene discovery and the discovery of new gene functions. For example, standard functional annotation based on domain homology to known protein families can only confirm group membership, not identify the emergence of new biochemical function. In this study, we show the importance of developing a strategy that circumvents the limitations of semiautomated annotation and apply this workflow to photosensitivity as a means to discover non-opsin photoreceptors. We hypothesize that non-opsin G-protein-coupled receptor (GPCR) proteins may have chromophore-binding lysines in locations that differ from opsin. Here, we provide the first case study describing non-opsin light-sensitive GPCRs based on tissue-specific RNA-seq data of the common bay scallop Argopecten irradians (Lamarck, 1819). Using a combination of sequence analysis and three-dimensional protein modeling, we identified two candidate proteins. We tested their photochemical properties and provide evidence showing that these two proteins incorporate 11-cis and/or all-trans retinal and react to light photochemically. Based on this case study, we demonstrate that there is potential for the discovery of new light-sensitive GPCRs, and we have developed a workflow that starts from RNA-seq assemblies to the discovery of new non-opsin, GPCR-based photopigments.
Lahuerta, Juan J.; Pepin, François; González, Marcos; Barrio, Santiago; Ayala, Rosa; Puig, Noemí; Montalban, María A.; Paiva, Bruno; Weng, Li; Jiménez, Cristina; Sopena, María; Moorhead, Martin; Cedena, Teresa; Rapado, Immaculada; Mateos, María Victoria; Rosiñol, Laura; Oriol, Albert; Blanchard, María J.; Martínez, Rafael; Bladé, Joan; San Miguel, Jesús; Faham, Malek; García-Sanz, Ramón
2014-01-01
We assessed the prognostic value of minimal residual disease (MRD) detection in multiple myeloma (MM) patients using a sequencing-based platform in bone marrow samples from 133 MM patients in at least very good partial response (VGPR) after front-line therapy. Deep sequencing was carried out in patients in whom a high-frequency myeloma clone was identified and MRD was assessed using the IGH-VDJH, IGH-DJH, and IGK assays. The results were contrasted with those of multiparametric flow cytometry (MFC) and allele-specific oligonucleotide polymerase chain reaction (ASO-PCR). The applicability of deep sequencing was 91%. Concordance between sequencing and MFC and ASO-PCR was 83% and 85%, respectively. Patients who were MRD– by sequencing had a significantly longer time to tumor progression (TTP) (median 80 vs 31 months; P < .0001) and overall survival (median not reached vs 81 months; P = .02), compared with patients who were MRD+. When stratifying patients by different levels of MRD, the respective TTP medians were: MRD ≥10−3 27 months, MRD 10−3 to 10−5 48 months, and MRD <10−5 80 months (P = .003 to .0001). Ninety-two percent of VGPR patients were MRD+. In complete response patients, the TTP remained significantly longer for MRD– compared with MRD+ patients (131 vs 35 months; P = .0009). PMID:24646471
Devlin, Joseph C; Battaglia, Thomas; Blaser, Martin J; Ruggles, Kelly V
2018-06-25
Exploration of large data sets, such as shotgun metagenomic sequence or expression data, by biomedical experts and medical professionals remains as a major bottleneck in the scientific discovery process. Although tools for this purpose exist for 16S ribosomal RNA sequencing analysis, there is a growing but still insufficient number of user-friendly interactive visualization workflows for easy data exploration and figure generation. The development of such platforms for this purpose is necessary to accelerate and streamline microbiome laboratory research. We developed the Workflow Hub for Automated Metagenomic Exploration (WHAM!) as a web-based interactive tool capable of user-directed data visualization and statistical analysis of annotated shotgun metagenomic and metatranscriptomic data sets. WHAM! includes exploratory and hypothesis-based gene and taxa search modules for visualizing differences in microbial taxa and gene family expression across experimental groups, and for creating publication quality figures without the need for command line interface or in-house bioinformatics. WHAM! is an interactive and customizable tool for downstream metagenomic and metatranscriptomic analysis providing a user-friendly interface allowing for easy data exploration by microbiome and ecological experts to facilitate discovery in multi-dimensional and large-scale data sets.
VarDetect: a nucleotide sequence variation exploratory tool
Ngamphiw, Chumpol; Kulawonganunchai, Supasak; Assawamakin, Anunchai; Jenwitheesuk, Ekachai; Tongsima, Sissades
2008-01-01
Background Single nucleotide polymorphisms (SNPs) are the most commonly studied units of genetic variation. The discovery of such variation may help to identify causative gene mutations in monogenic diseases and SNPs associated with predisposing genes in complex diseases. Accurate detection of SNPs requires software that can correctly interpret chromatogram signals to nucleotides. Results We present VarDetect, a stand-alone nucleotide variation exploratory tool that automatically detects nucleotide variation from fluorescence based chromatogram traces. Accurate SNP base-calling is achieved using pre-calculated peak content ratios, and is enhanced by rules which account for common sequence reading artifacts. The proposed software tool is benchmarked against four other well-known SNP discovery software tools (PolyPhred, novoSNP, Genalys and Mutation Surveyor) using fluorescence based chromatograms from 15 human genes. These chromatograms were obtained from sequencing 16 two-pooled DNA samples; a total of 32 individual DNA samples. In this comparison of automatic SNP detection tools, VarDetect achieved the highest detection efficiency. Availability VarDetect is compatible with most major operating systems such as Microsoft Windows, Linux, and Mac OSX. The current version of VarDetect is freely available at . PMID:19091032
Zhang, Xiao-yong; Tang, Gui-ling; Xu, Xin-ya; Nong, Xu-hua; Qi, Shu-Hua
2014-01-01
The fungal diversity in deep-sea environments has recently gained an increasing amount attention. Our knowledge and understanding of the true fungal diversity and the role it plays in deep-sea environments, however, is still limited. We investigated the fungal community structure in five sediments from a depth of ∼4000 m in the East India Ocean using a combination of targeted environmental sequencing and traditional cultivation. This approach resulted in the recovery of a total of 45 fungal operational taxonomic units (OTUs) and 20 culturable fungal phylotypes. This finding indicates that there is a great amount of fungal diversity in the deep-sea sediments collected in the East Indian Ocean. Three fungal OTUs and one culturable phylotype demonstrated high divergence (89%–97%) from the existing sequences in the GenBank. Moreover, 44.4% fungal OTUs and 30% culturable fungal phylotypes are new reports for deep-sea sediments. These results suggest that the deep-sea sediments from the East India Ocean can serve as habitats for new fungal communities compared with other deep-sea environments. In addition, different fungal community could be detected when using targeted environmental sequencing compared with traditional cultivation in this study, which suggests that a combination of targeted environmental sequencing and traditional cultivation will generate a more diverse fungal community in deep-sea environments than using either targeted environmental sequencing or traditional cultivation alone. This study is the first to report new insights into the fungal communities in deep-sea sediments from the East Indian Ocean, which increases our knowledge and understanding of the fungal diversity in deep-sea environments. PMID:25272044
Brody, Thomas; Yavatkar, Amarendra S; Kuzin, Alexander; Kundu, Mukta; Tyson, Leonard J; Ross, Jermaine; Lin, Tzu-Yang; Lee, Chi-Hon; Awasaki, Takeshi; Lee, Tzumin; Odenwald, Ward F
2012-01-01
Background: Phylogenetic footprinting has revealed that cis-regulatory enhancers consist of conserved DNA sequence clusters (CSCs). Currently, there is no systematic approach for enhancer discovery and analysis that takes full-advantage of the sequence information within enhancer CSCs. Results: We have generated a Drosophila genome-wide database of conserved DNA consisting of >100,000 CSCs derived from EvoPrints spanning over 90% of the genome. cis-Decoder database search and alignment algorithms enable the discovery of functionally related enhancers. The program first identifies conserved repeat elements within an input enhancer and then searches the database for CSCs that score highly against the input CSC. Scoring is based on shared repeats as well as uniquely shared matches, and includes measures of the balance of shared elements, a diagnostic that has proven to be useful in predicting cis-regulatory function. To demonstrate the utility of these tools, a temporally-restricted CNS neuroblast enhancer was used to identify other functionally related enhancers and analyze their structural organization. Conclusions: cis-Decoder reveals that co-regulating enhancers consist of combinations of overlapping shared sequence elements, providing insights into the mode of integration of multiple regulating transcription factors. The database and accompanying algorithms should prove useful in the discovery and analysis of enhancers involved in any developmental process. Developmental Dynamics 241:169–189, 2012. © 2011 Wiley Periodicals, Inc. Key findings A genome-wide catalog of Drosophila conserved DNA sequence clusters. cis-Decoder discovers functionally related enhancers. Functionally related enhancers share balanced sequence element copy numbers. Many enhancers function during multiple phases of development. PMID:22174086
NASA Astrophysics Data System (ADS)
Morono, Y.; Hauer, V. B.; Inagaki, F.; Kubo, Y.; Maeda, L.; Scientists, E.
2017-12-01
Expedition 370 of the International Ocean Discovery Program (IODP) aimed to explore the limits of life in the deep subseafloor biosphere at a location where elevated heat flow lets temperature increase with sediment depth beyond the known maximum of microbial life ( 120°C) at 1.2 km below the seafloor. Such conditions are met in the protothrust zone of the Nankai Trough off Cape Muroto, Japan, where Site C0023 was established in the vicinity of ODP Sites 808 and 1174 at a water depth of 4776 m using the drilling vessel DV Chikyu. Hole C0023A was cored down to a total depth of 1180 meters below seafloor, offshore sampling and research was combined with simultaneous shore-based investigations at the Kochi Core Center (KCC), and long-term temperature observations were started (Heuer et al., 2017). The primary scientific objectives of Expedition 370 are (a) to detect and investigate the presence or absence of life and biological processes at the biotic-abiotic transition of the deep subseafloor with unprecedented analytical sensitivity and precision; (b) to comprehensively study the factors that control biomass, activity, and diversity of microbial communities; and (c) to elucidate if continuous or episodic flow of fluids containing thermogenic and/or geogenic nutrients and energy substrates support subseafloor microbial communities in the Nankai Trough accretionary complex (Hinrichs et al., 2016). This contribution will highlight the scientific approach of our field-work and preliminary expedition results by shipboard and shorebased activities. Hinrichs K-U, Inagaki F, Heuer VB, Kinoshita M, Morono Y, Kubo Y (2016) Expedition 370 Scientific Prospectus: T-Limit of the Deep Biosphere off Muroto (T-Limit). International Ocean Discovery Program. http://dx.doi.org/10.14379/iodp.sp.370.2016 Heuer VB, Inagaki F, Morono Y, Kubo Y, Maeda L, the Expedition 370 Scientists (2017) Expedition 370 Preliminary Report: Temperature Limit of the Deep Biosphere off Muroto. International Ocean Discovery Program. http://dx.doi.org/10.14379/iodp.pr.370.2017
Xiong, Dapeng; Zeng, Jianyang; Gong, Haipeng
2017-09-01
Residue-residue contacts are of great value for protein structure prediction, since contact information, especially from those long-range residue pairs, can significantly reduce the complexity of conformational sampling for protein structure prediction in practice. Despite progresses in the past decade on protein targets with abundant homologous sequences, accurate contact prediction for proteins with limited sequence information is still far from satisfaction. Methodologies for these hard targets still need further improvement. We presented a computational program DeepConPred, which includes a pipeline of two novel deep-learning-based methods (DeepCCon and DeepRCon) as well as a contact refinement step, to improve the prediction of long-range residue contacts from primary sequences. When compared with previous prediction approaches, our framework employed an effective scheme to identify optimal and important features for contact prediction, and was only trained with coevolutionary information derived from a limited number of homologous sequences to ensure robustness and usefulness for hard targets. Independent tests showed that 59.33%/49.97%, 64.39%/54.01% and 70.00%/59.81% of the top L/5, top L/10 and top 5 predictions were correct for CASP10/CASP11 proteins, respectively. In general, our algorithm ranked as one of the best methods for CASP targets. All source data and codes are available at http://166.111.152.91/Downloads.html . hgong@tsinghua.edu.cn or zengjy321@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
The sequence of sequencers: The history of sequencing DNA.
Heather, James M; Chain, Benjamin
2016-01-01
Determining the order of nucleic acid residues in biological samples is an integral component of a wide variety of research applications. Over the last fifty years large numbers of researchers have applied themselves to the production of techniques and technologies to facilitate this feat, sequencing DNA and RNA molecules. This time-scale has witnessed tremendous changes, moving from sequencing short oligonucleotides to millions of bases, from struggling towards the deduction of the coding sequence of a single gene to rapid and widely available whole genome sequencing. This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries, researchers, and sequences along the way. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
2015-01-01
Nematodes inhabiting benthic deep-sea ecosystems account for >90% of the total metazoan abundances and they have been hypothesised to be hyper-diverse, but their biodiversity is still largely unknown. Metabarcoding could facilitate the census of biodiversity, especially for those tiny metazoans for which morphological identification is difficult. We compared, for the first time, different DNA extraction procedures based on the use of two commercial kits and a previously published laboratory protocol and tested their suitability for sequencing analyses of 18S rDNA of marine nematodes. We also investigated the reliability of Roche 454 sequencing analyses for assessing the biodiversity of deep-sea nematode assemblages previously morphologically identified. Finally, intra-genomic variation in 18S rRNA gene repeats was investigated by Illumina MiSeq in different deep-sea nematode morphospecies to assess the influence of polymorphisms on nematode biodiversity estimates. Our results indicate that the two commercial kits should be preferred for the molecular analysis of biodiversity of deep-sea nematodes since they consistently provide amplifiable DNA suitable for sequencing. We report that the morphological identification of deep-sea nematodes matches the results obtained by metabarcoding analysis only at the order-family level and that a large portion of Operational Clustered Taxonomic Units (OCTUs) was not assigned. We also show that independently from the cut-off criteria and bioinformatic pipelines used, the number of OCTUs largely exceeds the number of individuals and that 18S rRNA gene of different morpho-species of nematodes displayed intra-genomic polymorphisms. Our results indicate that metabarcoding is an important tool to explore the diversity of deep-sea nematodes, but still fails in identifying most of the species due to limited number of sequences deposited in the public databases, and in providing quantitative data on the species encountered. These aspects should be carefully taken into account before using metabarcoding in quantitative ecological research and monitoring programmes of marine biodiversity. PMID:26701112
Dell'Anno, Antonio; Carugati, Laura; Corinaldesi, Cinzia; Riccioni, Giulia; Danovaro, Roberto
2015-01-01
Nematodes inhabiting benthic deep-sea ecosystems account for >90% of the total metazoan abundances and they have been hypothesised to be hyper-diverse, but their biodiversity is still largely unknown. Metabarcoding could facilitate the census of biodiversity, especially for those tiny metazoans for which morphological identification is difficult. We compared, for the first time, different DNA extraction procedures based on the use of two commercial kits and a previously published laboratory protocol and tested their suitability for sequencing analyses of 18S rDNA of marine nematodes. We also investigated the reliability of Roche 454 sequencing analyses for assessing the biodiversity of deep-sea nematode assemblages previously morphologically identified. Finally, intra-genomic variation in 18S rRNA gene repeats was investigated by Illumina MiSeq in different deep-sea nematode morphospecies to assess the influence of polymorphisms on nematode biodiversity estimates. Our results indicate that the two commercial kits should be preferred for the molecular analysis of biodiversity of deep-sea nematodes since they consistently provide amplifiable DNA suitable for sequencing. We report that the morphological identification of deep-sea nematodes matches the results obtained by metabarcoding analysis only at the order-family level and that a large portion of Operational Clustered Taxonomic Units (OCTUs) was not assigned. We also show that independently from the cut-off criteria and bioinformatic pipelines used, the number of OCTUs largely exceeds the number of individuals and that 18S rRNA gene of different morpho-species of nematodes displayed intra-genomic polymorphisms. Our results indicate that metabarcoding is an important tool to explore the diversity of deep-sea nematodes, but still fails in identifying most of the species due to limited number of sequences deposited in the public databases, and in providing quantitative data on the species encountered. These aspects should be carefully taken into account before using metabarcoding in quantitative ecological research and monitoring programmes of marine biodiversity.
Zhang, Yiming; Jin, Quan; Wang, Shuting; Ren, Ren
2011-05-01
The mobile behavior of 1481 peptides in ion mobility spectrometry (IMS), which are generated by protease digestion of the Drosophila melanogaster proteome, is modeled and predicted based on two different types of characterization methods, i.e. sequence-based approach and structure-based approach. In this procedure, the sequence-based approach considers both the amino acid composition of a peptide and the local environment profile of each amino acid in the peptide; the structure-based approach is performed with the CODESSA protocol, which regards a peptide as a common organic compound and generates more than 200 statistically significant variables to characterize the whole structure profile of a peptide molecule. Subsequently, the nonlinear support vector machine (SVM) and Gaussian process (GP) as well as linear partial least squares (PLS) regression is employed to correlate the structural parameters of the characterizations with the IMS drift times of these peptides. The obtained quantitative structure-spectrum relationship (QSSR) models are evaluated rigorously and investigated systematically via both one-deep and two-deep cross-validations as well as the rigorous Monte Carlo cross-validation (MCCV). We also give a comprehensive comparison on the resulting statistics arising from the different combinations of variable types with modeling methods and find that the sequence-based approach can give the QSSR models with better fitting ability and predictive power but worse interpretability than the structure-based approach. In addition, though the QSSR modeling using sequence-based approach is not needed for the preparation of the minimization structures of peptides before the modeling, it would be considerably efficient as compared to that using structure-based approach. Copyright © 2011 Elsevier Ltd. All rights reserved.
Romer, Katherine A.; Kayombya, Guy-Richard; Fraenkel, Ernest
2007-01-01
WebMOTIFS provides a web interface that facilitates the discovery and analysis of DNA-sequence motifs. Several studies have shown that the accuracy of motif discovery can be significantly improved by using multiple de novo motif discovery programs and using randomized control calculations to identify the most significant motifs or by using Bayesian approaches. WebMOTIFS makes it easy to apply these strategies. Using a single submission form, users can run several motif discovery programs and score, cluster and visualize the results. In addition, the Bayesian motif discovery program THEME can be used to determine the class of transcription factors that is most likely to regulate a set of sequences. Input can be provided as a list of gene or probe identifiers. Used with the default settings, WebMOTIFS accurately identifies biologically relevant motifs from diverse data in several species. WebMOTIFS is freely available at http://fraenkel.mit.edu/webmotifs. PMID:17584794
Hausmann, Axel; Cancian de Araujo, Bruno; Sutrisno, Hari; Peggie, Djunijanti; Schmidt, Stefan
2017-01-01
Abstract Here we present a general collecting and preparation protocol for DNA barcoding of Lepidoptera as part of large-scale rapid biodiversity assessment projects, and a comparison with alternative preserving and vouchering methods. About 98% of the sequenced specimens processed using the present collecting and preparation protocol yielded sequences with more than 500 base pairs. The study is based on the first outcomes of the Indonesian Biodiversity Discovery and Information System (IndoBioSys). IndoBioSys is a German-Indonesian research project that is conducted by the Museum für Naturkunde in Berlin and the Zoologische Staatssammlung München, in close cooperation with the Research Center for Biology – Indonesian Institute of Sciences (RCB-LIPI, Bogor). PMID:29134041
A multi-model approach to nucleic acid-based drug development.
Gautherot, Isabelle; Sodoyer, Regís
2004-01-01
With the advent of functional genomics and the shift of interest towards sequence-based therapeutics, the past decades have witnessed intense research efforts on nucleic acid-mediated gene regulation technologies. Today, RNA interference is emerging as a groundbreaking discovery, holding promise for development of genetic modulators of unprecedented potency. Twenty-five years after the discovery of antisense RNA and ribozymes, gene control therapeutics are still facing developmental difficulties, with only one US FDA-approved antisense drug currently available in the clinic. Limited predictability of target site selection models is recognized as one major stumbling block that is shared by all of the so-called complementary technologies, slowing the progress towards a commercial product. Currently employed in vitro systems for target site selection include RNAse H-based mapping, antisense oligonucleotide microarrays, and functional screening approaches using libraries of catalysts with randomized target-binding arms to identify optimal ribozyme/DNAzyme cleavage sites. Individually, each strategy has its drawbacks from a drug development perspective. Utilization of message-modulating sequences as therapeutic agents requires that their action on a given target transcript meets criteria of potency and selectivity in the natural physiological environment. In addition to sequence-dependent characteristics, other factors will influence annealing reactions and duplex stability, as well as nucleic acid-mediated catalysis. Parallel consideration of physiological selection systems thus appears essential for screening for nucleic acid compounds proposed for therapeutic applications. Cellular message-targeting studies face issues relating to efficient nucleic acid delivery and appropriate analysis of response. For reliability and simplicity, prokaryotic systems can provide a rapid and cost-effective means of studying message targeting under pseudo-cellular conditions, but such approaches also have limitations. To streamline nucleic acid drug discovery, we propose a multi-model strategy integrating high-throughput-adapted bacterial screening, followed by reporter-based and/or natural cellular models and potentially also in vitro assays for characterization of the most promising candidate sequences, before final in vivo testing.
A renaissance of neural networks in drug discovery.
Baskin, Igor I; Winkler, David; Tetko, Igor V
2016-08-01
Neural networks are becoming a very popular method for solving machine learning and artificial intelligence problems. The variety of neural network types and their application to drug discovery requires expert knowledge to choose the most appropriate approach. In this review, the authors discuss traditional and newly emerging neural network approaches to drug discovery. Their focus is on backpropagation neural networks and their variants, self-organizing maps and associated methods, and a relatively new technique, deep learning. The most important technical issues are discussed including overfitting and its prevention through regularization, ensemble and multitask modeling, model interpretation, and estimation of applicability domain. Different aspects of using neural networks in drug discovery are considered: building structure-activity models with respect to various targets; predicting drug selectivity, toxicity profiles, ADMET and physicochemical properties; characteristics of drug-delivery systems and virtual screening. Neural networks continue to grow in importance for drug discovery. Recent developments in deep learning suggests further improvements may be gained in the analysis of large chemical data sets. It's anticipated that neural networks will be more widely used in drug discovery in the future, and applied in non-traditional areas such as drug delivery systems, biologically compatible materials, and regenerative medicine.
Regulatory sequence analysis tools.
van Helden, Jacques
2003-07-01
The web resource Regulatory Sequence Analysis Tools (RSAT) (http://rsat.ulb.ac.be/rsat) offers a collection of software tools dedicated to the prediction of regulatory sites in non-coding DNA sequences. These tools include sequence retrieval, pattern discovery, pattern matching, genome-scale pattern matching, feature-map drawing, random sequence generation and other utilities. Alternative formats are supported for the representation of regulatory motifs (strings or position-specific scoring matrices) and several algorithms are proposed for pattern discovery. RSAT currently holds >100 fully sequenced genomes and these data are regularly updated from GenBank.
High Class-Imbalance in pre-miRNA Prediction: A Novel Approach Based on deepSOM.
Stegmayer, Georgina; Yones, Cristian; Kamenetzky, Laura; Milone, Diego H
2017-01-01
The computational prediction of novel microRNA within a full genome involves identifying sequences having the highest chance of being a miRNA precursor (pre-miRNA). These sequences are usually named candidates to miRNA. The well-known pre-miRNAs are usually only a few in comparison to the hundreds of thousands of potential candidates to miRNA that have to be analyzed, which makes this task a high class-imbalance classification problem. The classical way of approaching it has been training a binary classifier in a supervised manner, using well-known pre-miRNAs as positive class and artificially defining the negative class. However, although the selection of positive labeled examples is straightforward, it is very difficult to build a set of negative examples in order to obtain a good set of training samples for a supervised method. In this work, we propose a novel and effective way of approaching this problem using machine learning, without the definition of negative examples. The proposal is based on clustering unlabeled sequences of a genome together with well-known miRNA precursors for the organism under study, which allows for the quick identification of the best candidates to miRNA as those sequences clustered with known precursors. Furthermore, we propose a deep model to overcome the problem of having very few positive class labels. They are always maintained in the deep levels as positive class while less likely pre-miRNA sequences are filtered level after level. Our approach has been compared with other methods for pre-miRNAs prediction in several species, showing effective predictivity of novel miRNAs. Additionally, we will show that our approach has a lower training time and allows for a better graphical navegability and interpretation of the results. A web-demo interface to try deepSOM is available at http://fich.unl.edu.ar/sinc/web-demo/deepsom/.
DeepSig: deep learning improves signal peptide detection in proteins.
Savojardo, Castrense; Martelli, Pier Luigi; Fariselli, Piero; Casadio, Rita
2018-05-15
The identification of signal peptides in protein sequences is an important step toward protein localization and function characterization. Here, we present DeepSig, an improved approach for signal peptide detection and cleavage-site prediction based on deep learning methods. Comparative benchmarks performed on an updated independent dataset of proteins show that DeepSig is the current best performing method, scoring better than other available state-of-the-art approaches on both signal peptide detection and precise cleavage-site identification. DeepSig is available as both standalone program and web server at https://deepsig.biocomp.unibo.it. All datasets used in this study can be obtained from the same website. pierluigi.martelli@unibo.it. Supplementary data are available at Bioinformatics online.
A framework genetic map for Miscanthus sinensis from RNAseq-based markers shows recent tetraploidy
2012-01-01
Background Miscanthus (subtribe Saccharinae, tribe Andropogoneae, family Poaceae) is a genus of temperate perennial C4 grasses whose high biomass production makes it, along with its close relatives sugarcane and sorghum, attractive as a biofuel feedstock. The base chromosome number of Miscanthus (x = 19) is different from that of other Saccharinae and approximately twice that of the related Sorghum bicolor (x = 10), suggesting large-scale duplications may have occurred in recent ancestors of Miscanthus. Owing to the complexity of the Miscanthus genome and the complications of self-incompatibility, a complete genetic map with a high density of markers has not yet been developed. Results We used deep transcriptome sequencing (RNAseq) from two M. sinensis accessions to define 1536 single nucleotide variants (SNVs) for a GoldenGate™ genotyping array, and found that simple sequence repeat (SSR) markers defined in sugarcane are often informative in M. sinensis. A total of 658 SNP and 210 SSR markers were validated via segregation in a full sibling F1 mapping population. Using 221 progeny from this mapping population, we constructed a genetic map for M. sinensis that resolves into 19 linkage groups, the haploid chromosome number expected from cytological evidence. Comparative genomic analysis documents a genome-wide duplication in Miscanthus relative to Sorghum bicolor, with subsequent insertional fusion of a pair of chromosomes. The utility of the map is confirmed by the identification of two paralogous C4-pyruvate, phosphate dikinase (C4-PPDK) loci in Miscanthus, at positions syntenic to the single orthologous gene in Sorghum. Conclusions The genus Miscanthus experienced an ancestral tetraploidy and chromosome fusion prior to its diversification, but after its divergence from the closely related sugarcane clade. The recent timing of this tetraploidy complicates discovery and mapping of genetic markers for Miscanthus species, since alleles and fixed differences between paralogs are comparable. These difficulties can be overcome by careful analysis of segregation patterns in a mapping population and genotyping of doubled haploids. The genetic map for Miscanthus will be useful in biological discovery and breeding efforts to improve this emerging biofuel crop, and also provide a valuable resource for understanding genomic responses to tetraploidy and chromosome fusion. PMID:22524439
Desjardin, Dennis E; Hemmes, Don E; Perry, Brian A
2014-01-01
Pseudobaeospora wipapatiae is described as new based on material collected in alien wet habitats on the island of Hawaii. Unique features of this beautiful species include deep ruby-colored basidiomes with two-spored basidia, amyloid cheilocystidia and a hymeniderm pileipellis with abundant pileocystidia that is initially deep ruby in KOH then changes to lilac gray. Phylogenetic analysis of nuclear large ribosomal subunit sequence data suggest a close relationship between Pseudobaeospora and Tricholoma. BLAST comparisons of internal transcribed spacer and 5.8S nuclear ribosomal subunit regions sequence data reveal greatest similarity with existing sequences of Pseudobaeospora species. A comprehensive description, color photograph, illustrations of salient micromorphological features and comparisons with phenetically similar taxa are provided. © 2014 by The Mycological Society of America.
The impact of genetics on future drug discovery in schizophrenia.
Matsumoto, Mitsuyuki; Walton, Noah M; Yamada, Hiroshi; Kondo, Yuji; Marek, Gerard J; Tajinda, Katsunori
2017-07-01
Failures of investigational new drugs (INDs) for schizophrenia have left huge unmet medical needs for patients. Given the recent lackluster results, it is imperative that new drug discovery approaches (and resultant drug candidates) target pathophysiological alterations that are shared in specific, stratified patient populations that are selected based on pre-identified biological signatures. One path to implementing this paradigm is achievable by leveraging recent advances in genetic information and technologies. Genome-wide exome sequencing and meta-analysis of single nucleotide polymorphism (SNP)-based association studies have already revealed rare deleterious variants and SNPs in patient populations. Areas covered: Herein, the authors review the impact that genetics have on the future of schizophrenia drug discovery. The high polygenicity of schizophrenia strongly indicates that this disease is biologically heterogeneous so the identification of unique subgroups (by patient stratification) is becoming increasingly necessary for future investigational new drugs. Expert opinion: The authors propose a pathophysiology-based stratification of genetically-defined subgroups that share deficits in particular biological pathways. Existing tools, including lower-cost genomic sequencing and advanced gene-editing technology render this strategy ever more feasible. Genetically complex psychiatric disorders such as schizophrenia may also benefit from synergistic research with simpler monogenic disorders that share perturbations in similar biological pathways.
Sokol, Martin; Jessen, Karen Margrethe; Pedersen, Finn Skou
2016-01-01
Several studies have shown that human endogenous retroviruses and endogenous retrovirus-like repeats (here collectively HERVs) impose direct regulation on human genes through enhancer and promoter motifs present in their long terminal repeats (LTRs). Although chimeric transcription in which novel gene isoforms containing retroviral and human sequence are transcribed from viral promoters are commonly associated with disease, regulation by HERVs is beneficial in other settings; for example, in human testis chimeric isoforms of TP63 induced by an ERV9 LTR protect the male germ line upon DNA damage by inducing apoptosis, whereas in the human globin locus the γ- and β-globin switch during normal hematopoiesis is mediated by complex interactions of an ERV9 LTR and surrounding human sequence. The advent of deep sequencing or next-generation sequencing (NGS) has revolutionized the way researchers solve important scientific questions and develop novel hypotheses in relation to human genome regulation. We recently applied next-generation paired-end RNA-sequencing (RNA-seq) together with chromatin immunoprecipitation with sequencing (ChIP-seq) to examine ERV9 chimeric transcription in human reference cell lines from Encyclopedia of DNA Elements (ENCODE). This led to the discovery of advanced regulation mechanisms by ERV9s and other HERVs across numerous human loci including transcription of large gene-unannotated genomic regions, as well as cooperative regulation by multiple HERVs and non-LTR repeats such as Alu elements. In this article, well-established examples of human gene regulation by HERVs are reviewed followed by a description of paired-end RNA-seq, and its application in identifying chimeric transcription genome-widely. Based on integrative analyses of RNA-seq and ChIP-seq, data we then present novel examples of regulation by ERV9s of tumor suppressor genes CADM2 and SEMA3A, as well as transcription of an unannotated region. Taken together, this article highlights the high suitability of contemporary sequencing methods in future analyses of human biology in relation to evolutionary acquired retroviruses in the human genome. © 2016 APMIS. Published by John Wiley & Sons Ltd.
Kwon, Andrew T.; Chou, Alice Yi; Arenillas, David J.; Wasserman, Wyeth W.
2011-01-01
We performed a genome-wide scan for muscle-specific cis-regulatory modules (CRMs) using three computational prediction programs. Based on the predictions, 339 candidate CRMs were tested in cell culture with NIH3T3 fibroblasts and C2C12 myoblasts for capacity to direct selective reporter gene expression to differentiated C2C12 myotubes. A subset of 19 CRMs validated as functional in the assay. The rate of predictive success reveals striking limitations of computational regulatory sequence analysis methods for CRM discovery. Motif-based methods performed no better than predictions based only on sequence conservation. Analysis of the properties of the functional sequences relative to inactive sequences identifies nucleotide sequence composition can be an important characteristic to incorporate in future methods for improved predictive specificity. Muscle-related TFBSs predicted within the functional sequences display greater sequence conservation than non-TFBS flanking regions. Comparison with recent MyoD and histone modification ChIP-Seq data supports the validity of the functional regions. PMID:22144875
Kaplan, Oktay I; Berber, Burak; Hekim, Nezih; Doluca, Osman
2016-11-02
Many studies show that short non-coding sequences are widely conserved among regulatory elements. More and more conserved sequences are being discovered since the development of next generation sequencing technology. A common approach to identify conserved sequences with regulatory roles relies on topological changes such as hairpin formation at the DNA or RNA level. G-quadruplexes, non-canonical nucleic acid topologies with little established biological roles, are increasingly considered for conserved regulatory element discovery. Since the tertiary structure of G-quadruplexes is strongly dependent on the loop sequence which is disregarded by the generally accepted algorithm, we hypothesized that G-quadruplexes with similar topology and, indirectly, similar interaction patterns, can be determined using phylogenetic clustering based on differences in the loop sequences. Phylogenetic analysis of 52 G-quadruplex forming sequences in the Escherichia coli genome revealed two conserved G-quadruplex motifs with a potential regulatory role. Further analysis revealed that both motifs tend to form hairpins and G quadruplexes, as supported by circular dichroism studies. The phylogenetic analysis as described in this work can greatly improve the discovery of functional G-quadruplex structures and may explain unknown regulatory patterns. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
2010-01-01
Background Bathymodiolus azoricus is a deep-sea hydrothermal vent mussel found in association with large faunal communities living in chemosynthetic environments at the bottom of the sea floor near the Azores Islands. Investigation of the exceptional physiological reactions that vent mussels have adopted in their habitat, including responses to environmental microbes, remains a difficult challenge for deep-sea biologists. In an attempt to reveal genes potentially involved in the deep-sea mussel innate immunity we carried out a high-throughput sequence analysis of freshly collected B. azoricus transcriptome using gills tissues as the primary source of immune transcripts given its strategic role in filtering the surrounding waterborne potentially infectious microorganisms. Additionally, a substantial EST data set was produced and from which a comprehensive collection of genes coding for putative proteins was organized in a dedicated database, "DeepSeaVent" the first deep-sea vent animal transcriptome database based on the 454 pyrosequencing technology. Results A normalized cDNA library from gills tissue was sequenced in a full 454 GS-FLX run, producing 778,996 sequencing reads. Assembly of the high quality reads resulted in 75,407 contigs of which 3,071 were singletons. A total of 39,425 transcripts were conceptually translated into amino-sequences of which 22,023 matched known proteins in the NCBI non-redundant protein database, 15,839 revealed conserved protein domains through InterPro functional classification and 9,584 were assigned with Gene Ontology terms. Queries conducted within the database enabled the identification of genes putatively involved in immune and inflammatory reactions which had not been previously evidenced in the vent mussel. Their physical counterpart was confirmed by semi-quantitative quantitative Reverse-Transcription-Polymerase Chain Reactions (RT-PCR) and their RNA transcription level by quantitative PCR (qPCR) experiments. Conclusions We have established the first tissue transcriptional analysis of a deep-sea hydrothermal vent animal and generated a searchable catalog of genes that provides a direct method of identifying and retrieving vast numbers of novel coding sequences which can be applied in gene expression profiling experiments from a non-conventional model organism. This provides the most comprehensive sequence resource for identifying novel genes currently available for a deep-sea vent organism, in particular, genes putatively involved in immune and inflammatory reactions in vent mussels. The characterization of the B. azoricus transcriptome will facilitate research into biological processes underlying physiological adaptations to hydrothermal vent environments and will provide a basis for expanding our understanding of genes putatively involved in adaptations processes during post-capture long term acclimatization experiments, at "sea-level" conditions, using B. azoricus as a model organism. PMID:20937131
Slack, J.F.; Grenne, Tor; Bekker, A.; Rouxel, O.J.; Lindberg, P.A.
2007-01-01
A current model for the evolution of Proterozoic deep seawater composition involves a change from anoxic sulfide-free to sulfidic conditions 1.8??Ga. In an earlier model the deep ocean became oxic at that time. Both models are based on the secular distribution of banded iron formation (BIF) in shallow marine sequences. We here present a new model based on rare earth elements, especially redox-sensitive Ce, in hydrothermal silica-iron oxide sediments from deeper-water, open-marine settings related to volcanogenic massive sulfide (VMS) deposits. In contrast to Archean, Paleozoic, and modern hydrothermal iron oxide sediments, 1.74 to 1.71??Ga hematitic chert (jasper) and iron formation in central Arizona, USA, show moderate positive to small negative Ce anomalies, suggesting that the redox state of the deep ocean then was at a transitional, suboxic state with low concentrations of dissolved O2 but no H2S. The presence of jasper and/or iron formation related to VMS deposits in other volcanosedimentary sequences ca. 1.79-1.69??Ga, 1.40??Ga, and 1.24??Ga also reflects oxygenated and not sulfidic deep ocean waters during these time periods. Suboxic conditions in the deep ocean are consistent with the lack of shallow-marine BIF ??? 1.8 to 0.8??Ga, and likely limited nutrient concentrations in seawater and, consequently, may have constrained biological evolution. ?? 2006 Elsevier B.V. All rights reserved.
Virus Identification in Unknown Tropical Febrile Illness Cases Using Deep Sequencing
Balmaseda, Angel; Harris, Eva; DeRisi, Joseph L.
2012-01-01
Dengue virus is an emerging infectious agent that infects an estimated 50–100 million people annually worldwide, yet current diagnostic practices cannot detect an etiologic pathogen in ∼40% of dengue-like illnesses. Metagenomic approaches to pathogen detection, such as viral microarrays and deep sequencing, are promising tools to address emerging and non-diagnosable disease challenges. In this study, we used the Virochip microarray and deep sequencing to characterize the spectrum of viruses present in human sera from 123 Nicaraguan patients presenting with dengue-like symptoms but testing negative for dengue virus. We utilized a barcoding strategy to simultaneously deep sequence multiple serum specimens, generating on average over 1 million reads per sample. We then implemented a stepwise bioinformatic filtering pipeline to remove the majority of human and low-quality sequences to improve the speed and accuracy of subsequent unbiased database searches. By deep sequencing, we were able to detect virus sequence in 37% (45/123) of previously negative cases. These included 13 cases with Human Herpesvirus 6 sequences. Other samples contained sequences with similarity to sequences from viruses in the Herpesviridae, Flaviviridae, Circoviridae, Anelloviridae, Asfarviridae, and Parvoviridae families. In some cases, the putative viral sequences were virtually identical to known viruses, and in others they diverged, suggesting that they may derive from novel viruses. These results demonstrate the utility of unbiased metagenomic approaches in the detection of known and divergent viruses in the study of tropical febrile illness. PMID:22347512
Manivannan, Abinaya; Kim, Jin-Hee; Yang, Eun-Young; Ahn, Yul-Kyun; Lee, Eun-Su; Choi, Sena; Kim, Do-Sun
2018-01-01
Pepper is an economically important horticultural plant that has been widely used for its pungency and spicy taste in worldwide cuisines. Therefore, the domestication of pepper has been carried out since antiquity. Owing to meet the growing demand for pepper with high quality, organoleptic property, nutraceutical contents, and disease tolerance, genomics assisted breeding techniques can be incorporated to develop novel pepper varieties with desired traits. The application of next-generation sequencing (NGS) approaches has reformed the plant breeding technology especially in the area of molecular marker assisted breeding. The availability of genomic information aids in the deeper understanding of several molecular mechanisms behind the vital physiological processes. In addition, the NGS methods facilitate the genome-wide discovery of DNA based markers linked to key genes involved in important biological phenomenon. Among the molecular markers, single nucleotide polymorphism (SNP) indulges various benefits in comparison with other existing DNA based markers. The present review concentrates on the impact of NGS approaches in the discovery of useful SNP markers associated with pungency and disease resistance in pepper. The information provided in the current endeavor can be utilized for the betterment of pepper breeding in future.
NASA Astrophysics Data System (ADS)
Wang, K.; Sun, T.; Hino, R.; Iinuma, T.; Tomita, F.; Kido, M.
2017-12-01
Numerous observations pertaining to the M=9.0 2011 Tohoku-oki earthquake have led to new understanding of subduction zone earthquakes. By synthesizing published research results and our own findings, we explore what has been learned about fault behavior and Earth rheology from geodetic imaging of crustal deformation before and after the earthquake. Before the earthquake, megathrust locking models based on land-based geodetic observations correctly outlined the along-strike location of the future rupture zone, showing that land-based observations are capable of resolving along-strike variations in locking and creep at wavelengths comparable to distances from the network. But they predicted a locked zone that was much deeper than the actual rupture in 2011. The incorrect definition of the locking pattern in the dip direction demonstrates not only the need for seafloor geodesy but also the importance of modeling interseismic viscoelastic stress relaxation and stress shadowing. The discovery of decade-long accelerated slip downdip of the future rupture zone raises new questions on fault mechanics. After the earthquake, seafloor geodetic discovery of opposing motion offshore provided unambiguous evidence for the dominance of viscoelastic relaxation in short-term postseismic deformation. There is little deep afterslip in the fault area where the decade-long pre-earthquake slip acceleration is observed. The complementary spatial distribution of pre-slip and afterslip calls for new scientific research. However, the near absence of deep afterslip directly downdip of the main rupture is perceived to be controversial because some viscoelastic models do predict large afterslip here, although less than predicted by purely elastic models. We show that the large afterslip in these models is largely an artefact due to the use of a layered Earth model without a subducting slab. The slab acts as an "anchor" in the mantle and retards landward motion following a subduction earthquake. Neglecting the slab causes fast landward motion of the trench area that has to be prevented by using a high value of mantle viscosity. The incorrect high viscosity, however, slows down the seaward motion of the coastal area, which has to be compensated by introducing deep afterslip.
Wilson, M R; Zimmermann, L L; Crawford, E D; Sample, H A; Soni, P R; Baker, A N; Khan, L M; DeRisi, J L
2017-03-01
Solid organ transplant patients are vulnerable to suffering neurologic complications from a wide array of viral infections and can be sentinels in the population who are first to get serious complications from emerging infections like the recent waves of arboviruses, including West Nile virus, Chikungunya virus, Zika virus, and Dengue virus. The diverse and rapidly changing landscape of possible causes of viral encephalitis poses great challenges for traditional candidate-based infectious disease diagnostics that already fail to identify a causative pathogen in approximately 50% of encephalitis cases. We present the case of a 14-year-old girl on immunosuppression for a renal transplant who presented with acute meningoencephalitis. Traditional diagnostics failed to identify an etiology. RNA extracted from her cerebrospinal fluid was subjected to unbiased metagenomic deep sequencing, enhanced with the use of a Cas9-based technique for host depletion. This analysis identified West Nile virus (WNV). Convalescent serum serologies subsequently confirmed WNV seroconversion. These results support a clear clinical role for metagenomic deep sequencing in the setting of suspected viral encephalitis, especially in the context of the high-risk transplant patient population. © 2016 The Authors. American Journal of Transplantation published by Wiley Periodicals, Inc. on behalf of American Society of Transplant Surgeons.
Promoter Sequences Prediction Using Relational Association Rule Mining
Czibula, Gabriela; Bocicor, Maria-Iuliana; Czibula, Istvan Gergely
2012-01-01
In this paper we are approaching, from a computational perspective, the problem of promoter sequences prediction, an important problem within the field of bioinformatics. As the conditions for a DNA sequence to function as a promoter are not known, machine learning based classification models are still developed to approach the problem of promoter identification in the DNA. We are proposing a classification model based on relational association rules mining. Relational association rules are a particular type of association rules and describe numerical orderings between attributes that commonly occur over a data set. Our classifier is based on the discovery of relational association rules for predicting if a DNA sequence contains or not a promoter region. An experimental evaluation of the proposed model and comparison with similar existing approaches is provided. The obtained results show that our classifier overperforms the existing techniques for identifying promoter sequences, confirming the potential of our proposal. PMID:22563233
Systematic and fully automated identification of protein sequence patterns.
Hart, R K; Royyuru, A K; Stolovitzky, G; Califano, A
2000-01-01
We present an efficient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are defined by patterns and contain DR records). Splash generates patterns with better specificity and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufficiently rapid to enable its use for daily curation of existing motif and profile databases. Third, our results show that the statistical significance of discovered patterns correlates well with their biological significance. The trypsin subfamily of serine proteases is used to illustrate this method's ability to exhaustively discover all motifs in a family that are statistically and biologically significant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at httpl//www.research.ibm.com/spat/.
Deep whole-genome sequencing of 90 Han Chinese genomes.
Lan, Tianming; Lin, Haoxiang; Zhu, Wenjuan; Laurent, Tellier Christian Asker Melchior; Yang, Mengcheng; Liu, Xin; Wang, Jun; Wang, Jian; Yang, Huanming; Xu, Xun; Guo, Xiaosen
2017-09-01
Next-generation sequencing provides a high-resolution insight into human genetic information. However, the focus of previous studies has primarily been on low-coverage data due to the high cost of sequencing. Although the 1000 Genomes Project and the Haplotype Reference Consortium have both provided powerful reference panels for imputation, low-frequency and novel variants remain difficult to discover and call with accuracy on the basis of low-coverage data. Deep sequencing provides an optimal solution for the problem of these low-frequency and novel variants. Although whole-exome sequencing is also a viable choice for exome regions, it cannot account for noncoding regions, sometimes resulting in the absence of important, causal variants. For Han Chinese populations, the majority of variants have been discovered based upon low-coverage data from the 1000 Genomes Project. However, high-coverage, whole-genome sequencing data are limited for any population, and a large amount of low-frequency, population-specific variants remain uncharacterized. We have performed whole-genome sequencing at a high depth (∼×80) of 90 unrelated individuals of Chinese ancestry, collected from the 1000 Genomes Project samples, including 45 Northern Han Chinese and 45 Southern Han Chinese samples. Eighty-three of these 90 have been sequenced by the 1000 Genomes Project. We have identified 12 568 804 single nucleotide polymorphisms, 2 074 210 short InDels, and 26 142 structural variations from these 90 samples. Compared to the Han Chinese data from the 1000 Genomes Project, we have found 7 000 629 novel variants with low frequency (defined as minor allele frequency < 5%), including 5 813 503 single nucleotide polymorphisms, 1 169 199 InDels, and 17 927 structural variants. Using deep sequencing data, we have built a greatly expanded spectrum of genetic variation for the Han Chinese genome. Compared to the 1000 Genomes Project, these Han Chinese deep sequencing data enhance the characterization of a large number of low-frequency, novel variants. This will be a valuable resource for promoting Chinese genetics research and medical development. Additionally, it will provide a valuable supplement to the 1000 Genomes Project, as well as to other human genome projects. © The Authors 2017. Published by Oxford University Press.
Carbon from Crust to Core: A history of deep carbon science
NASA Astrophysics Data System (ADS)
Mitton, Simon
2017-04-01
As an academic historian of science, I am writing a history of the discovery of the interior workings of our dynamic planet. I am preparing a book, titled Carbon from Crust to Core: A Chronicle of Deep Carbon Science, in which I will present the first history of deep carbon science. I will identify and document key discoveries, the impact of new knowledge, and the roles of deep carbon scientists and their institutions from the 1400s to the present. This innovative book will set down the engaging human story of many remarkable scientists from whom we have learned about Earth's interior, and particularly the fascinating story of carbon in Earth. I will describe a great journey of discovery that has led to a better understanding of the physical, chemical, and biological behaviour of carbon in the vast majority of Earth's interior. My poster has a list of remarkable Deep Carbon Explorers, from Georgius Agricola (1494-1555) to Claude ZoBell (1904-1989). Come along to my poster and add to my compilation: choose pioneers from history, or nominate your colleagues, or even add a selfie! As a biographer, I am keen to add researchers who may have been overlooked in the standard histories of geology and geophysics. And I am always on the lookout for standout stories and personal recollections. I am equipped to do oral history interviews. What's your story? Cambridge University Press will publish the book in 2019.
NASA Astrophysics Data System (ADS)
Gan, Wen-Cong; Shu, Fu-Wen
Quantum many-body problem with exponentially large degrees of freedom can be reduced to a tractable computational form by neural network method [G. Carleo and M. Troyer, Science 355 (2017) 602, arXiv:1606.02318.] The power of deep neural network (DNN) based on deep learning is clarified by mapping it to renormalization group (RG), which may shed lights on holographic principle by identifying a sequence of RG transformations to the AdS geometry. In this paper, we show that any network which reflects RG process has intrinsic hyperbolic geometry, and discuss the structure of entanglement encoded in the graph of DNN. We find the entanglement structure of DNN is of Ryu-Takayanagi form. Based on these facts, we argue that the emergence of holographic gravitational theory is related to deep learning process of the quantum-field theory.
DSAP: deep-sequencing small RNA analysis pipeline.
Huang, Po-Jung; Liu, Yi-Chung; Lee, Chi-Ching; Lin, Wei-Chen; Gan, Richie Ruei-Chi; Lyu, Ping-Chiang; Tang, Petrus
2010-07-01
DSAP is an automated multiple-task web service designed to provide a total solution to analyzing deep-sequencing small RNA datasets generated by next-generation sequencing technology. DSAP uses a tab-delimited file as an input format, which holds the unique sequence reads (tags) and their corresponding number of copies generated by the Solexa sequencing platform. The input data will go through four analysis steps in DSAP: (i) cleanup: removal of adaptors and poly-A/T/C/G/N nucleotides; (ii) clustering: grouping of cleaned sequence tags into unique sequence clusters; (iii) non-coding RNA (ncRNA) matching: sequence homology mapping against a transcribed sequence library from the ncRNA database Rfam (http://rfam.sanger.ac.uk/); and (iv) known miRNA matching: detection of known miRNAs in miRBase (http://www.mirbase.org/) based on sequence homology. The expression levels corresponding to matched ncRNAs and miRNAs are summarized in multi-color clickable bar charts linked to external databases. DSAP is also capable of displaying miRNA expression levels from different jobs using a log(2)-scaled color matrix. Furthermore, a cross-species comparative function is also provided to show the distribution of identified miRNAs in different species as deposited in miRBase. DSAP is available at http://dsap.cgu.edu.tw.
Xiao, Chuan-Le; Mai, Zhi-Biao; Lian, Xin-Lei; Zhong, Jia-Yong; Jin, Jing-Jie; He, Qing-Yu; Zhang, Gong
2014-01-01
Correct and bias-free interpretation of the deep sequencing data is inevitably dependent on the complete mapping of all mappable reads to the reference sequence, especially for quantitative RNA-seq applications. Seed-based algorithms are generally slow but robust, while Burrows-Wheeler Transform (BWT) based algorithms are fast but less robust. To have both advantages, we developed an algorithm FANSe2 with iterative mapping strategy based on the statistics of real-world sequencing error distribution to substantially accelerate the mapping without compromising the accuracy. Its sensitivity and accuracy are higher than the BWT-based algorithms in the tests using both prokaryotic and eukaryotic sequencing datasets. The gene identification results of FANSe2 is experimentally validated, while the previous algorithms have false positives and false negatives. FANSe2 showed remarkably better consistency to the microarray than most other algorithms in terms of gene expression quantifications. We implemented a scalable and almost maintenance-free parallelization method that can utilize the computational power of multiple office computers, a novel feature not present in any other mainstream algorithm. With three normal office computers, we demonstrated that FANSe2 mapped an RNA-seq dataset generated from an entire Illunima HiSeq 2000 flowcell (8 lanes, 608 M reads) to masked human genome within 4.1 hours with higher sensitivity than Bowtie/Bowtie2. FANSe2 thus provides robust accuracy, full indel sensitivity, fast speed, versatile compatibility and economical computational utilization, making it a useful and practical tool for deep sequencing applications. FANSe2 is freely available at http://bioinformatics.jnu.edu.cn/software/fanse2/.
A visual tracking method based on deep learning without online model updating
NASA Astrophysics Data System (ADS)
Tang, Cong; Wang, Yicheng; Feng, Yunsong; Zheng, Chao; Jin, Wei
2018-02-01
The paper proposes a visual tracking method based on deep learning without online model updating. In consideration of the advantages of deep learning in feature representation, deep model SSD (Single Shot Multibox Detector) is used as the object extractor in the tracking model. Simultaneously, the color histogram feature and HOG (Histogram of Oriented Gradient) feature are combined to select the tracking object. In the process of tracking, multi-scale object searching map is built to improve the detection performance of deep detection model and the tracking efficiency. In the experiment of eight respective tracking video sequences in the baseline dataset, compared with six state-of-the-art methods, the method in the paper has better robustness in the tracking challenging factors, such as deformation, scale variation, rotation variation, illumination variation, and background clutters, moreover, its general performance is better than other six tracking methods.
Grötzinger, Stefan W.; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B.; Stingl, Ulrich; Eppinger, Jörg
2014-01-01
Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website. PMID:24778629
Adapting to the Deep Sea: A Fun Activity with Bioluminescence
ERIC Educational Resources Information Center
Rife, Gwynne
2006-01-01
Over the past decade, much has been learned about the ocean's secrets and especially about the creatures of the deep sea. The deepest parts of the oceans are currently the focus of many new discoveries in both the physical and biological sciences. Middle school students find the deep sea fascinating and especially seem to enjoy its mysterious and…
novPTMenzy: a database for enzymes involved in novel post-translational modifications
Khater, Shradha; Mohanty, Debasisa
2015-01-01
With the recent discoveries of novel post-translational modifications (PTMs) which play important roles in signaling and biosynthetic pathways, identification of such PTM catalyzing enzymes by genome mining has been an area of major interest. Unlike well-known PTMs like phosphorylation, glycosylation, SUMOylation, no bioinformatics resources are available for enzymes associated with novel and unusual PTMs. Therefore, we have developed the novPTMenzy database which catalogs information on the sequence, structure, active site and genomic neighborhood of experimentally characterized enzymes involved in five novel PTMs, namely AMPylation, Eliminylation, Sulfation, Hydroxylation and Deamidation. Based on a comprehensive analysis of the sequence and structural features of these known PTM catalyzing enzymes, we have created Hidden Markov Model profiles for the identification of similar PTM catalyzing enzymatic domains in genomic sequences. We have also created predictive rules for grouping them into functional subfamilies and deciphering their mechanistic details by structure-based analysis of their active site pockets. These analytical modules have been made available as user friendly search interfaces of novPTMenzy database. It also has a specialized analysis interface for some PTMs like AMPylation and Eliminylation. The novPTMenzy database is a unique resource that can aid in discovery of unusual PTM catalyzing enzymes in newly sequenced genomes. Database URL: http://www.nii.ac.in/novptmenzy.html PMID:25931459
Distinguishing friends, foes, and freeloaders in giant genomes.
Bennetzen, Jeffrey L; Park, Minkyu
2018-04-01
Most annotations of large eukaryotic genomes initially find transposable elements (TEs) and other repeats, then mask them so that subsequent efforts can be concentrated on the annotation and study of non-TE genes. However, TEs often contribute to host biology, and their community biologies are of intrinsic interest. This review discusses the challenges, rationale and technologies for comprehensive TE annotation in the commonly giant genomes of animals and plants. Complete discovery of the TEs in a fully sequenced genome is laborious, but feasible, with current strategies in the hands of a careful researcher. These deep TE studies have begun to provide important perspectives on how genomes evolve and the degree to which genome changes do and do not affect eukaryotic biology. Copyright © 2018 The Authors. Published by Elsevier Ltd.. All rights reserved.
An optimized protocol for generation and analysis of Ion Proton sequencing reads for RNA-Seq.
Yuan, Yongxian; Xu, Huaiqian; Leung, Ross Ka-Kit
2016-05-26
Previous studies compared running cost, time and other performance measures of popular sequencing platforms. However, comprehensive assessment of library construction and analysis protocols for Proton sequencing platform remains unexplored. Unlike Illumina sequencing platforms, Proton reads are heterogeneous in length and quality. When sequencing data from different platforms are combined, this can result in reads with various read length. Whether the performance of the commonly used software for handling such kind of data is satisfactory is unknown. By using universal human reference RNA as the initial material, RNaseIII and chemical fragmentation methods in library construction showed similar result in gene and junction discovery number and expression level estimated accuracy. In contrast, sequencing quality, read length and the choice of software affected mapping rate to a much larger extent. Unspliced aligner TMAP attained the highest mapping rate (97.27 % to genome, 86.46 % to transcriptome), though 47.83 % of mapped reads were clipped. Long reads could paradoxically reduce mapping in junctions. With reference annotation guide, the mapping rate of TopHat2 significantly increased from 75.79 to 92.09 %, especially for long (>150 bp) reads. Sailfish, a k-mer based gene expression quantifier attained highly consistent results with that of TaqMan array and highest sensitivity. We provided for the first time, the reference statistics of library preparation methods, gene detection and quantification and junction discovery for RNA-Seq by the Ion Proton platform. Chemical fragmentation performed equally well with the enzyme-based one. The optimal Ion Proton sequencing options and analysis software have been evaluated.
Osca, David; Templado, José; Zardoya, Rafael
2014-09-01
The complete nucleotide sequence of the mitochondrial (mt) genome of the deep-sea vent snail Ifremeria nautilei (Gastropoda: Abyssochrysoidea) was determined. The double stranded circular molecule is 15,664 pb in length and encodes for the typical 37 metazoan mitochondrial genes. The gene arrangement of the Ifremeria mt genome is most similar to genome organization of caenogastropods and differs only on the relative position of the trnW gene. The deduced amino acid sequences of the mt protein coding genes of Ifremeria mt genome were aligned with orthologous sequences from representatives of the main lineages of gastropods and phylogenetic relationships were inferred. The reconstructed phylogeny supports that Ifremeria belongs to Caenogastropoda and that it is closely related to hypsogastropod superfamilies. Results were compared with a reconstructed nuclear-based phylogeny. Moreover, a relaxed molecular-clock timetree calibrated with fossils dated the divergence of Abyssochrysoidea in the Late Jurassic-Early Cretaceous indicating a relatively modern colonization of deep-sea environments by these snails. Copyright © 2014 Elsevier B.V. All rights reserved.
DNA Cryptography and Deep Learning using Genetic Algorithm with NW algorithm for Key Generation.
Kalsi, Shruti; Kaur, Harleen; Chang, Victor
2017-12-05
Cryptography is not only a science of applying complex mathematics and logic to design strong methods to hide data called as encryption, but also to retrieve the original data back, called decryption. The purpose of cryptography is to transmit a message between a sender and receiver such that an eavesdropper is unable to comprehend it. To accomplish this, not only we need a strong algorithm, but a strong key and a strong concept for encryption and decryption process. We have introduced a concept of DNA Deep Learning Cryptography which is defined as a technique of concealing data in terms of DNA sequence and deep learning. In the cryptographic technique, each alphabet of a letter is converted into a different combination of the four bases, namely; Adenine (A), Cytosine (C), Guanine (G) and Thymine (T), which make up the human deoxyribonucleic acid (DNA). Actual implementations with the DNA don't exceed laboratory level and are expensive. To bring DNA computing on a digital level, easy and effective algorithms are proposed in this paper. In proposed work we have introduced firstly, a method and its implementation for key generation based on the theory of natural selection using Genetic Algorithm with Needleman-Wunsch (NW) algorithm and Secondly, a method for implementation of encryption and decryption based on DNA computing using biological operations Transcription, Translation, DNA Sequencing and Deep Learning.
Cai, Congbo; Wang, Chao; Zeng, Yiqing; Cai, Shuhui; Liang, Dong; Wu, Yawen; Chen, Zhong; Ding, Xinghao; Zhong, Jianhui
2018-04-24
An end-to-end deep convolutional neural network (CNN) based on deep residual network (ResNet) was proposed to efficiently reconstruct reliable T 2 mapping from single-shot overlapping-echo detachment (OLED) planar imaging. The training dataset was obtained from simulations that were carried out on SPROM (Simulation with PRoduct Operator Matrix) software developed by our group. The relationship between the original OLED image containing two echo signals and the corresponding T 2 mapping was learned by ResNet training. After the ResNet was trained, it was applied to reconstruct the T 2 mapping from simulation and in vivo human brain data. Although the ResNet was trained entirely on simulated data, the trained network was generalized well to real human brain data. The results from simulation and in vivo human brain experiments show that the proposed method significantly outperforms the echo-detachment-based method. Reliable T 2 mapping with higher accuracy is achieved within 30 ms after the network has been trained, while the echo-detachment-based OLED reconstruction method took approximately 2 min. The proposed method will facilitate real-time dynamic and quantitative MR imaging via OLED sequence, and deep convolutional neural network has the potential to reconstruct maps from complex MRI sequences efficiently. © 2018 International Society for Magnetic Resonance in Medicine.
[GNU Pattern: open source pattern hunter for biological sequences based on SPLASH algorithm].
Xu, Ying; Li, Yi-xue; Kong, Xiang-yin
2005-06-01
To construct a high performance open source software engine based on IBM SPLASH algorithm for later research on pattern discovery. Gpat, which is based on SPLASH algorithm, was developed by using open source software. GNU Pattern (Gpat) software was developped, which efficiently implemented the core part of SPLASH algorithm. Full source code of Gpat was also available for other researchers to modify the program under the GNU license. Gpat is a successful implementation of SPLASH algorithm and can be used as a basic framework for later research on pattern recognition in biological sequences.
The ICDP Dead Sea deep drill cores: records of climate change and tectonics in the Levant
NASA Astrophysics Data System (ADS)
Goldstein, S. L.; Stein, M.; Ben-Avraham, Z.; Agnon, A.; Ariztegui, D.; Brauer, A.; Haug, G. H.; Ito, E.; Kitagawa, H.; Torfstein, A.
2012-12-01
The Dead Sea drainage basin sits at the boundary of the Mediterranean and the Saharan climate zones, and the basin is formed by the Dead Sea transform fault. The ICDP-funded Dead Sea Deep Drilling Project recovered the longest and most complete paleo-environmental and paleo-seismic record in the Middle East, drilling holes of ~450 and ~350 meters in deep (~300 m below the lake level) and shallow sites (~3 m), respectively, and. The sediments record the evolving environmental conditions (e.g. droughts, rains, floods, dust-storms), as well as tectonics (earthquake layers). The core can be dated using 14C on organic materials, U-Th on inorganic aragonite, stable isotopes, and layer counting. They were opened, described, and XRF-scanned during June to November 2011, the first sampling party took place in July 2012, and study is now underway. Some important conclusions can already be drawn. The stratigraphy reflects the climate conditions. During wet climate intervals the lithology is typically varve-like laminated aragonite and detritus (aad), reflecting summer and winter seasons, respectively, and sequences of mud. Gypsum layers reflect more arid climate, and salt (halite) indicates extreme aridity. The Dead Sea expands during glacials, and the portion of the core that corresponds to the last glacial Lisan Formation above the shoreline is easily recognized in the core based on the common lithological sequence, and this allows us to infer a broad scale age model. Interglacials show all the lithologic facies (aad, mud, gypsum, salt), reflecting extreme climate variability, while glacials contain the aad, mud, and gypsum but lack salt layers. Thus we estimate that the deep site hole extends into MIS 7 (to ~200,000 years). Thin (up to several cm thick) seismic layers occur throughout the core, but thick (up to several meters) landslide deposits only occur during glacial intervals. The most dramatic discovery is evidence of an extreme dry interval during MIS 5 at the deep site. There is a ~40 cm thick interval of partly rounded pebbles in the core at ~235 m below the lake floor. It is the only clean pebbly unit in the core, and resembles a beach deposit. Below the layer there is ~45 meters of mainly salt. These observations indicate a severe dry interval during MIS 5. This observation has implications for the Middle East today, where the Dead Sea level is dropping at rates >1m/year, as all the countries in the area are using all the runoff. GCM models indicate a more arid future in the region. The core shows that the runoff nearly stopped during the last interglacial without human intervention. Dating is underway to constrain the timing of the extreme drydown.
Aliper, Alexander; Plis, Sergey; Artemov, Artem; Ulloa, Alvaro; Mamoshina, Polina; Zhavoronkov, Alex
2016-07-05
Deep learning is rapidly advancing many areas of science and technology with multiple success stories in image, text, voice and video recognition, robotics, and autonomous driving. In this paper we demonstrate how deep neural networks (DNN) trained on large transcriptional response data sets can classify various drugs to therapeutic categories solely based on their transcriptional profiles. We used the perturbation samples of 678 drugs across A549, MCF-7, and PC-3 cell lines from the LINCS Project and linked those to 12 therapeutic use categories derived from MeSH. To train the DNN, we utilized both gene level transcriptomic data and transcriptomic data processed using a pathway activation scoring algorithm, for a pooled data set of samples perturbed with different concentrations of the drug for 6 and 24 hours. In both pathway and gene level classification, DNN achieved high classification accuracy and convincingly outperformed the support vector machine (SVM) model on every multiclass classification problem, however, models based on pathway level data performed significantly better. For the first time we demonstrate a deep learning neural net trained on transcriptomic data to recognize pharmacological properties of multiple drugs across different biological systems and conditions. We also propose using deep neural net confusion matrices for drug repositioning. This work is a proof of principle for applying deep learning to drug discovery and development.
Aliper, Alexander; Plis, Sergey; Artemov, Artem; Ulloa, Alvaro; Mamoshina, Polina; Zhavoronkov, Alex
2016-01-01
Deep learning is rapidly advancing many areas of science and technology with multiple success stories in image, text, voice and video recognition, robotics and autonomous driving. In this paper we demonstrate how deep neural networks (DNN) trained on large transcriptional response data sets can classify various drugs to therapeutic categories solely based on their transcriptional profiles. We used the perturbation samples of 678 drugs across A549, MCF‐7 and PC‐3 cell lines from the LINCS project and linked those to 12 therapeutic use categories derived from MeSH. To train the DNN, we utilized both gene level transcriptomic data and transcriptomic data processed using a pathway activation scoring algorithm, for a pooled dataset of samples perturbed with different concentrations of the drug for 6 and 24 hours. In both gene and pathway level classification, DNN convincingly outperformed support vector machine (SVM) model on every multiclass classification problem, however, models based on a pathway level classification perform better. For the first time we demonstrate a deep learning neural net trained on transcriptomic data to recognize pharmacological properties of multiple drugs across different biological systems and conditions. We also propose using deep neural net confusion matrices for drug repositioning. This work is a proof of principle for applying deep learning to drug discovery and development. PMID:27200455
[Artificial Intelligence in Drug Discovery].
Fujiwara, Takeshi; Kamada, Mayumi; Okuno, Yasushi
2018-04-01
According to the increase of data generated from analytical instruments, application of artificial intelligence(AI)technology in medical field is indispensable. In particular, practical application of AI technology is strongly required in "genomic medicine" and "genomic drug discovery" that conduct medical practice and novel drug development based on individual genomic information. In our laboratory, we have been developing a database to integrate genome data and clinical information obtained by clinical genome analysis and a computational support system for clinical interpretation of variants using AI. In addition, with the aim of creating new therapeutic targets in genomic drug discovery, we have been also working on the development of a binding affinity prediction system for mutated proteins and drugs by molecular dynamics simulation using supercomputer "Kei". We also have tackled for problems in a drug virtual screening. Our developed AI technology has successfully generated virtual compound library, and deep learning method has enabled us to predict interaction between compound and target protein.
Recurrent neural networks for breast lesion classification based on DCE-MRIs
NASA Astrophysics Data System (ADS)
Antropova, Natasha; Huynh, Benjamin; Giger, Maryellen
2018-02-01
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays a significant role in breast cancer screening, cancer staging, and monitoring response to therapy. Recently, deep learning methods are being rapidly incorporated in image-based breast cancer diagnosis and prognosis. However, most of the current deep learning methods make clinical decisions based on 2-dimentional (2D) or 3D images and are not well suited for temporal image data. In this study, we develop a deep learning methodology that enables integration of clinically valuable temporal components of DCE-MRIs into deep learning-based lesion classification. Our work is performed on a database of 703 DCE-MRI cases for the task of distinguishing benign and malignant lesions, and uses the area under the ROC curve (AUC) as the performance metric in conducting that task. We train a recurrent neural network, specifically a long short-term memory network (LSTM), on sequences of image features extracted from the dynamic MRI sequences. These features are extracted with VGGNet, a convolutional neural network pre-trained on a large dataset of natural images ImageNet. The features are obtained from various levels of the network, to capture low-, mid-, and high-level information about the lesion. Compared to a classification method that takes as input only images at a single time-point (yielding an AUC = 0.81 (se = 0.04)), our LSTM method improves lesion classification with an AUC of 0.85 (se = 0.03).
SD-MSAEs: Promoter recognition in human genome based on deep feature extraction.
Xu, Wenxuan; Zhang, Li; Lu, Yaping
2016-06-01
The prediction and recognition of promoter in human genome play an important role in DNA sequence analysis. Entropy, in Shannon sense, of information theory is a multiple utility in bioinformatic details analysis. The relative entropy estimator methods based on statistical divergence (SD) are used to extract meaningful features to distinguish different regions of DNA sequences. In this paper, we choose context feature and use a set of methods of SD to select the most effective n-mers distinguishing promoter regions from other DNA regions in human genome. Extracted from the total possible combinations of n-mers, we can get four sparse distributions based on promoter and non-promoters training samples. The informative n-mers are selected by optimizing the differentiating extents of these distributions. Specially, we combine the advantage of statistical divergence and multiple sparse auto-encoders (MSAEs) in deep learning to extract deep feature for promoter recognition. And then we apply multiple SVMs and a decision model to construct a human promoter recognition method called SD-MSAEs. Framework is flexible that it can integrate new feature extraction or new classification models freely. Experimental results show that our method has high sensitivity and specificity. Copyright © 2016 Elsevier Inc. All rights reserved.
[The nineteenth century roots of the contemporary biological revolution].
Swynghedauw, Bernard
2006-01-01
The recent publication of the human genomic sequence is the most important progress in biology. It originates from four major watersheds between 1860-1865, namely the biological evolution by Darwin in 1858, the Mendel laws of heredity in 1865, the basis of physiology established by Claude Bernard also in 1865, and the discoveries of microbacteria by Louis Pasteur around 1857. Before 1860, biology did not exist as a science. After 1860, the Darwin's theory progressively became a law after the discovery of the DNA polymorphism and that of the mechanisms of genetic mixing. So far the Mendel's laws were confirmed in parallel with the development of molecular genetics after the discovery of DNA structure and genetic code. The discovery of hormones is one example, amongst several on how integrative physiology applies to Claude Bernard's basis. Finally, based on Pasteur's discovery and Pasteur Institutes, microbiology became a tool for molecular biologists.
Fingerprints of Modified RNA Bases from Deep Sequencing Profiles.
Kietrys, Anna M; Velema, Willem A; Kool, Eric T
2017-11-29
Posttranscriptional modifications of RNA bases are not only found in many noncoding RNAs but have also recently been identified in coding (messenger) RNAs as well. They require complex and laborious methods to locate, and many still lack methods for localized detection. Here we test the ability of next-generation sequencing (NGS) to detect and distinguish between ten modified bases in synthetic RNAs. We compare ultradeep sequencing patterns of modified bases, including miscoding, insertions and deletions (indels), and truncations, to unmodified bases in the same contexts. The data show widely varied responses to modification, ranging from no response, to high levels of mutations, insertions, deletions, and truncations. The patterns are distinct for several of the modifications, and suggest the future use of ultradeep sequencing as a fingerprinting strategy for locating and identifying modifications in cellular RNAs.
Bidlingmaier, Scott; Ha, Kevin; Lee, Nam-Kyung; Su, Yang; Liu, Bin
2016-04-01
Although the bioactive sphingolipid ceramide is an important cell signaling molecule, relatively few direct ceramide-interacting proteins are known. We used an approach combining yeast surface cDNA display and deep sequencing technology to identify novel proteins binding directly to ceramide. We identified 234 candidate ceramide-binding protein fragments and validated binding for 20. Most (17) bound selectively to ceramide, although a few (3) bound to other lipids as well. Several novel ceramide-binding domains were discovered, including the EF-hand calcium-binding motif, the heat shock chaperonin-binding motif STI1, the SCP2 sterol-binding domain, and the tetratricopeptide repeat region motif. Interestingly, four of the verified ceramide-binding proteins (HPCA, HPCAL1, NCS1, and VSNL1) and an additional three candidate ceramide-binding proteins (NCALD, HPCAL4, and KCNIP3) belong to the neuronal calcium sensor family of EF hand-containing proteins. We used mutagenesis to map the ceramide-binding site in HPCA and to create a mutant HPCA that does not bind to ceramide. We demonstrated selective binding to ceramide by mammalian cell-produced wild type but not mutant HPCA. Intriguingly, we also identified a fragment from prostaglandin D2synthase that binds preferentially to ceramide 1-phosphate. The wide variety of proteins and domains capable of binding to ceramide suggests that many of the signaling functions of ceramide may be regulated by direct binding to these proteins. Based on the deep sequencing data, we estimate that our yeast surface cDNA display library covers ∼60% of the human proteome and our selection/deep sequencing protocol can identify target-interacting protein fragments that are present at extremely low frequency in the starting library. Thus, the yeast surface cDNA display/deep sequencing approach is a rapid, comprehensive, and flexible method for the analysis of protein-ligand interactions, particularly for the study of non-protein ligands. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.
The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update.
Huynh, Tien; Rigoutsos, Isidore
2004-07-01
In this report, we provide an update on the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server, which is operational around the clock, provides access to a large number of methods that have been developed and published by the group's members. There is an increasing number of problems that these tools can help tackle; these problems range from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences, the identification--directly from sequence--of structural deviations from alpha-helicity and the annotation of amino acid sequences for antimicrobial activity. Additionally, annotations for more than 130 archaeal, bacterial, eukaryotic and viral genomes are now available on-line and can be searched interactively. The tools and code bundles continue to be accessible from http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/.
Deep neural networks to enable real-time multimessenger astrophysics
NASA Astrophysics Data System (ADS)
George, Daniel; Huerta, E. A.
2018-02-01
Gravitational wave astronomy has set in motion a scientific revolution. To further enhance the science reach of this emergent field of research, there is a pressing need to increase the depth and speed of the algorithms used to enable these ground-breaking discoveries. We introduce Deep Filtering—a new scalable machine learning method for end-to-end time-series signal processing. Deep Filtering is based on deep learning with two deep convolutional neural networks, which are designed for classification and regression, to detect gravitational wave signals in highly noisy time-series data streams and also estimate the parameters of their sources in real time. Acknowledging that some of the most sensitive algorithms for the detection of gravitational waves are based on implementations of matched filtering, and that a matched filter is the optimal linear filter in Gaussian noise, the application of Deep Filtering using whitened signals in Gaussian noise is investigated in this foundational article. The results indicate that Deep Filtering outperforms conventional machine learning techniques, achieves similar performance compared to matched filtering, while being several orders of magnitude faster, allowing real-time signal processing with minimal resources. Furthermore, we demonstrate that Deep Filtering can detect and characterize waveform signals emitted from new classes of eccentric or spin-precessing binary black holes, even when trained with data sets of only quasicircular binary black hole waveforms. The results presented in this article, and the recent use of deep neural networks for the identification of optical transients in telescope data, suggests that deep learning can facilitate real-time searches of gravitational wave sources and their electromagnetic and astroparticle counterparts. In the subsequent article, the framework introduced herein is directly applied to identify and characterize gravitational wave events in real LIGO data.
2011-01-01
Background Readthrough fusions across adjacent genes in the genome, or transcription-induced chimeras (TICs), have been estimated using expressed sequence tag (EST) libraries to involve 4-6% of all genes. Deep transcriptional sequencing (RNA-Seq) now makes it possible to study the occurrence and expression levels of TICs in individual samples across the genome. Methods We performed single-end RNA-Seq on three human prostate adenocarcinoma samples and their corresponding normal tissues, as well as brain and universal reference samples. We developed two bioinformatics methods to specifically identify TIC events: a targeted alignment method using artificial exon-exon junctions within 200,000 bp from adjacent genes, and genomic alignment allowing splicing within individual reads. We performed further experimental verification and characterization of selected TIC and fusion events using quantitative RT-PCR and comparative genomic hybridization microarrays. Results Targeted alignment against artificial exon-exon junctions yielded 339 distinct TIC events, including 32 gene pairs with multiple isoforms. The false discovery rate was estimated to be 1.5%. Spliced alignment to the genome was less sensitive, finding only 18% of those found by targeted alignment in 33-nt reads and 59% of those in 50-nt reads. However, spliced alignment revealed 30 cases of TICs with intervening exons, in addition to distant inversions, scrambled genes, and translocations. Our findings increase the catalog of observed TIC gene pairs by 66%. We verified 6 of 6 predicted TICs in all prostate samples, and 2 of 5 predicted novel distant gene fusions, both private events among 54 prostate tumor samples tested. Expression of TICs correlates with that of the upstream gene, which can explain the prostate-specific pattern of some TIC events and the restriction of the SLC45A3-ELK4 e4-e2 TIC to ERG-negative prostate samples, as confirmed in 20 matched prostate tumor and normal samples and 9 lung cancer cell lines. Conclusions Deep transcriptional sequencing and analysis with targeted and spliced alignment methods can effectively identify TIC events across the genome in individual tissues. Prostate and reference samples exhibit a wide range of TIC events, involving more genes than estimated previously using ESTs. Tissue specificity of TIC events is correlated with expression patterns of the upstream gene. Some TIC events, such as MSMB-NCOA4, may play functional roles in cancer. PMID:21261984
Theoretical and observational planetary physics
NASA Technical Reports Server (NTRS)
Caldwell, J.
1986-01-01
This program supports NASA's deep space exploration missions, particularly those to the outer Solar System, and also NASA's Earth-orbital astronomy missions, using ground-based observations, primarily with the NASA IRTF at Mauna Kea, Hawaii, and also with such instruments as the Kitt Peak 4 meter Mayall telescope and the NRAO VLA facility in Socorro, New Mexico. An important component of the program is the physical interpretation of the observations. There were two major scientific discoveries resulting from 8 micrometer observations of Jupiter. The first is that at that wavelength there are two spots, one near each magnetic pole, which are typically the brightest and therefore warmest places on the planet. The effect is clearly due to precipitating high energy magnetospheric particles. A second ground-based discovery is that in 1985, Jupiter exhibited low latitude (+ or - 18 deg.) stratospheric wave structure.
Low-Latency Telerobotic Sample Return and Biomolecular Sequencing for Deep Space Gateway
NASA Astrophysics Data System (ADS)
Lupisella, M.; Bleacher, J.; Lewis, R.; Dworkin, J.; Wright, M.; Burton, A.; Rubins, K.; Wallace, S.; Stahl, S.; John, K.; Archer, D.; Niles, P.; Regberg, A.; Smith, D.; Race, M.; Chiu, C.; Russell, J.; Rampe, E.; Bywaters, K.
2018-02-01
Low-latency telerobotics, crew-assisted sample return, and biomolecular sequencing can be used to acquire and analyze lunar farside and/or Apollo landing site samples. Sequencing can also be used to monitor and study Deep Space Gateway environment and crew health.
Effects of hydrostatic pressure on yeasts isolated from deep-sea hydrothermal vents.
Burgaud, Gaëtan; Hué, Nguyen Thi Minh; Arzur, Danielle; Coton, Monika; Perrier-Cornet, Jean-Marie; Jebbar, Mohamed; Barbier, Georges
2015-11-01
Hydrostatic pressure plays a significant role in the distribution of life in the biosphere. Knowledge of deep-sea piezotolerant and (hyper)piezophilic bacteria and archaea diversity has been well documented, along with their specific adaptations to cope with high hydrostatic pressure (HHP). Recent investigations of deep-sea microbial community compositions have shown unexpected micro-eukaryotic communities, mainly dominated by fungi. Molecular methods such as next-generation sequencing have been used for SSU rRNA gene sequencing to reveal fungal taxa. Currently, a difficult but fascinating challenge for marine mycologists is to create deep-sea marine fungus culture collections and assess their ability to cope with pressure. Indeed, although there is no universal genetic marker for piezoresistance, physiological analyses provide concrete relevant data for estimating their adaptations and understanding the role of fungal communities in the abyss. The present study investigated morphological and physiological responses of fungi to HHP using a collection of deep-sea yeasts as a model. The aim was to determine whether deep-sea yeasts were able to tolerate different HHP and if they were metabolically active. Here we report an unexpected taxonomic-based dichotomic response to pressure with piezosensitve ascomycetes and piezotolerant basidiomycetes, and distinct morphological switches triggered by pressure for certain strains. Copyright © 2015 Institut Pasteur. Published by Elsevier Masson SAS. All rights reserved.
Deep Packet/Flow Analysis using GPUs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gong, Qian; Wu, Wenji; DeMar, Phil
Deep packet inspection (DPI) faces severe performance challenges in high-speed networks (40/100 GE) as it requires a large amount of raw computing power and high I/O throughputs. Recently, researchers have tentatively used GPUs to address the above issues and boost the performance of DPI. Typically, DPI applications involve highly complex operations in both per-packet and per-flow data level, often in real-time. The parallel architecture of GPUs fits exceptionally well for per-packet network traffic processing. However, for stateful network protocols such as TCP, their data stream need to be reconstructed in a per-flow level to deliver a consistent content analysis. Sincemore » the flow-centric operations are naturally antiparallel and often require large memory space for buffering out-of-sequence packets, they can be problematic for GPUs, whose memory is normally limited to several gigabytes. In this work, we present a highly efficient GPU-based deep packet/flow analysis framework. The proposed design includes a purely GPU-implemented flow tracking and TCP stream reassembly. Instead of buffering and waiting for TCP packets to become in sequence, our framework process the packets in batch and uses a deterministic finite automaton (DFA) with prefix-/suffix- tree method to detect patterns across out-of-sequence packets that happen to be located in different batches. In conclusion, evaluation shows that our code can reassemble and forward tens of millions of packets per second and conduct a stateful signature-based deep packet inspection at 55 Gbit/s using an NVIDIA K40 GPU.« less
NASA Astrophysics Data System (ADS)
Pasquale, V.; Chiozzi, P.; Verdoya, M.
2013-05-01
Temperatures recorded in wells as deep as 6 km drilled for hydrocarbon prospecting were used together with geological information to depict the thermal regime of the sedimentary sequence of the eastern sector of the Po Plain. After correction for drilling disturbance, temperature data were analyzed through an inversion technique based on a laterally constant thermal gradient model. The obtained thermal gradient is quite low within the deep carbonate unit (14 mK m- 1), while it is larger (53 mK m- 1) in the overlying impermeable formations. In the uppermost sedimentary layers, the thermal gradient is close to the regional average (21 mK m- 1). We argue that such a vertical change cannot be ascribed to thermal conductivity variation within the sedimentary sequence, but to deep groundwater flow. Since the hydrogeological characteristics (including litho-stratigraphic sequence and structural setting) hardly permit forced convection, we suggest that thermal convection might occur within the deep carbonate aquifer. The potential of this mechanism was evaluated by means of the Rayleigh number analysis. It turned out that permeability required for convection to occur must be larger than 3 10- 15 m2. The average over-heat ratio is 0.45. The lateral variation of hydrothermal regime was tested by using temperature data representing the aquifer thermal conditions. We found that thermal convection might be more developed and variable at the Ferrara High and its surroundings, where widespread fracturing may have increased permeability.
Zhang, Likui; Kang, Manyu; Huang, Yangchao; Yang, Lixiang
2016-05-01
The diversity and ecological significance of bacteria and archaea in deep-sea environments have been thoroughly investigated, but eukaryotic microorganisms in these areas, such as fungi, are poorly understood. To elucidate fungal diversity in calcareous deep-sea sediments in the Southwest India Ridge (SWIR), the internal transcribed spacer (ITS) regions of rRNA genes from two sediment metagenomic DNA samples were amplified and sequenced using the Illumina sequencing platform. The results revealed that 58-63 % and 36-42 % of the ITS sequences (97 % similarity) belonged to Basidiomycota and Ascomycota, respectively. These findings suggest that Basidiomycota and Ascomycota are the predominant fungal phyla in the two samples. We also found that Agaricomycetes, Leotiomycetes, and Pezizomycetes were the major fungal classes in the two samples. At the species level, Thelephoraceae sp. and Phialocephala fortinii were major fungal species in the two samples. Despite the low relative abundance, unidentified fungal sequences were also observed in the two samples. Furthermore, we found that there were slight differences in fungal diversity between the two sediment samples, although both were collected from the SWIR. Thus, our results demonstrate that calcareous deep-sea sediments in the SWIR harbor diverse fungi, which augment the fungal groups in deep-sea sediments. This is the first report of fungal communities in calcareous deep-sea sediments in the SWIR revealed by Illumina sequencing.
Advancements in Aptamer Discovery Technologies.
Gotrik, Michael R; Feagin, Trevor A; Csordas, Andrew T; Nakamoto, Margaret A; Soh, H Tom
2016-09-20
Affinity reagents that specifically bind to their target molecules are invaluable tools in nearly every field of modern biomedicine. Nucleic acid-based aptamers offer many advantages in this domain, because they are chemically synthesized, stable, and economical. Despite these compelling features, aptamers are currently not widely used in comparison to antibodies. This is primarily because conventional aptamer-discovery techniques such as SELEX are time-consuming and labor-intensive and often fail to produce aptamers with comparable binding performance to antibodies. This Account describes a body of work from our laboratory in developing advanced methods for consistently producing high-performance aptamers with higher efficiency, fewer resources, and, most importantly, a greater probability of success. We describe our efforts in systematically transforming each major step of the aptamer discovery process: selection, analysis, and characterization. To improve selection, we have developed microfluidic devices (M-SELEX) that enable discovery of high-affinity aptamers after a minimal number of selection rounds by precisely controlling the target concentration and washing stringency. In terms of improving aptamer pool analysis, our group was the first to use high-throughput sequencing (HTS) for the discovery of new aptamers. We showed that tracking the enrichment trajectory of individual aptamer sequences enables the identification of high-performing aptamers without requiring full convergence of the selected aptamer pool. HTS is now widely used for aptamer discovery, and open-source software has become available to facilitate analysis. To improve binding characterization, we used HTS data to design custom aptamer arrays to measure the affinity and specificity of up to ∼10(4) DNA aptamers in parallel as a means to rapidly discover high-quality aptamers. Most recently, our efforts have culminated in the invention of the "particle display" (PD) screening system, which transforms solution-phase aptamers into "aptamer particles" that can be individually screened at high-throughput via fluorescence-activated cell sorting. Using PD, we have shown the feasibility of rapidly generating aptamers with exceptional affinities, even for proteins that have previously proven intractable to aptamer discovery. We are confident that these advanced aptamer-discovery methods will accelerate the discovery of aptamer reagents with excellent affinities and specificities, perhaps even exceeding those of the best monoclonal antibodies. Since aptamers are reproducible, renewable, stable, and can be distributed as sequence information, we anticipate that these affinity reagents will become even more valuable tools for both research and clinical applications.
Transcriptome assembly and digital gene expression atlas of the rainbow trout
USDA-ARS?s Scientific Manuscript database
Background: Transcriptome analysis is a preferred method for gene discovery, marker development and gene expression profiling in non-model organisms. Previously, we sequenced a transcriptome reference using Sanger-based and 454-pyrosequencing, however, a transcriptome assembly is still incomplete an...
HomozygosityMapper2012--bridging the gap between homozygosity mapping and deep sequencing.
Seelow, Dominik; Schuelke, Markus
2012-07-01
Homozygosity mapping is a common method to map recessive traits in consanguineous families. To facilitate these analyses, we have developed HomozygosityMapper, a web-based approach to homozygosity mapping. HomozygosityMapper allows researchers to directly upload the genotype files produced by the major genotyping platforms as well as deep sequencing data. It detects stretches of homozygosity shared by the affected individuals and displays them graphically. Users can interactively inspect the underlying genotypes, manually refine these regions and eventually submit them to our candidate gene search engine GeneDistiller to identify the most promising candidate genes. Here, we present the new version of HomozygosityMapper. The most striking new feature is the support of Next Generation Sequencing *.vcf files as input. Upon users' requests, we have implemented the analysis of common experimental rodents as well as of important farm animals. Furthermore, we have extended the options for single families and loss of heterozygosity studies. Another new feature is the export of *.bed files for targeted enrichment of the potential disease regions for deep sequencing strategies. HomozygosityMapper also generates files for conventional linkage analyses which are already restricted to the possible disease regions, hence superseding CPU-intensive genome-wide analyses. HomozygosityMapper is freely available at http://www.homozygositymapper.org/.
Maximum entropy methods for extracting the learned features of deep neural networks.
Finnegan, Alex; Song, Jun S
2017-10-01
New architectures of multilayer artificial neural networks and new methods for training them are rapidly revolutionizing the application of machine learning in diverse fields, including business, social science, physical sciences, and biology. Interpreting deep neural networks, however, currently remains elusive, and a critical challenge lies in understanding which meaningful features a network is actually learning. We present a general method for interpreting deep neural networks and extracting network-learned features from input data. We describe our algorithm in the context of biological sequence analysis. Our approach, based on ideas from statistical physics, samples from the maximum entropy distribution over possible sequences, anchored at an input sequence and subject to constraints implied by the empirical function learned by a network. Using our framework, we demonstrate that local transcription factor binding motifs can be identified from a network trained on ChIP-seq data and that nucleosome positioning signals are indeed learned by a network trained on chemical cleavage nucleosome maps. Imposing a further constraint on the maximum entropy distribution also allows us to probe whether a network is learning global sequence features, such as the high GC content in nucleosome-rich regions. This work thus provides valuable mathematical tools for interpreting and extracting learned features from feed-forward neural networks.
Extracting DNA words based on the sequence features: non-uniform distribution and integrity.
Li, Zhi; Cao, Hongyan; Cui, Yuehua; Zhang, Yanbo
2016-01-25
DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the "words" based only on the DNA sequences. We considered that non-uniform distribution and integrity were two important features of a word, based on which we developed an ab initio algorithm to extract "DNA words" that have potential functional meaning. A Kolmogorov-Smirnov test was used for consistency test of uniform distribution of DNA sequences, and the integrity was judged by the sequence and position alignment. Two random base sequences were adopted as negative control, and an English book was used as positive control to verify our algorithm. We applied our algorithm to the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli to show the utility of the methods. The results provide strong evidences that the algorithm is a promising tool for ab initio building a DNA dictionary. Our method provides a fast way for large scale screening of important DNA elements and offers potential insights into the understanding of a genome.
The web server of IBM's Bioinformatics and Pattern Discovery group.
Huynh, Tien; Rigoutsos, Isidore; Parida, Laxmi; Platt, Daniel; Shibuya, Tetsuo
2003-07-01
We herein present and discuss the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server is operational around the clock and provides access to a variety of methods that have been published by the group's members and collaborators. The available tools correspond to applications ranging from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences and the interactive annotation of amino acid sequences. Additionally, annotations for more than 70 archaeal, bacterial, eukaryotic and viral genomes are available on-line and can be searched interactively. The tools and code bundles can be accessed beginning at http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/.
The web server of IBM's Bioinformatics and Pattern Discovery group
Huynh, Tien; Rigoutsos, Isidore; Parida, Laxmi; Platt, Daniel; Shibuya, Tetsuo
2003-01-01
We herein present and discuss the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server is operational around the clock and provides access to a variety of methods that have been published by the group's members and collaborators. The available tools correspond to applications ranging from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences and the interactive annotation of amino acid sequences. Additionally, annotations for more than 70 archaeal, bacterial, eukaryotic and viral genomes are available on-line and can be searched interactively. The tools and code bundles can be accessed beginning at http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/. PMID:12824385
Morris, R. M.; Rappé, M. S.; Urbach, E.; Connon, S. A.; Giovannoni, S. J.
2004-01-01
Since their initial discovery in samples from the north Atlantic Ocean, 16S rRNA genes related to the environmental gene clone cluster known as SAR202 have been recovered from pelagic freshwater, marine sediment, soil, and deep subsurface terrestrial environments. Together, these clones form a major, monophyletic subgroup of the phylum Chloroflexi. While members of this diverse group are consistently identified in the marine environment, there are currently no cultured representatives, and very little is known about their distribution or abundance in the world's oceans. In this study, published and newly identified SAR202-related 16S rRNA gene sequences were used to further resolve the phylogeny of this cluster and to design taxon-specific oligonucleotide probes for fluorescence in situ hybridization. Direct cell counts from the Bermuda Atlantic time series study site in the north Atlantic Ocean, the Hawaii ocean time series site in the central Pacific Ocean, and along the Newport hydroline in eastern Pacific coastal waters showed that SAR202 cluster cells were most abundant below the deep chlorophyll maximum and that they persisted to 3,600 m in the Atlantic Ocean and to 4,000 m in the Pacific Ocean, the deepest samples used in this study. On average, members of the SAR202 group accounted for 10.2% (±5.7%) of all DNA-containing bacterioplankton between 500 and 4,000 m. PMID:15128540
Brancaccio, Rosario N; Robitaille, Alexis; Dutta, Sankhadeep; Cuenin, Cyrille; Santare, Daiga; Skenders, Girts; Leja, Marcis; Fischer, Nicole; Giuliano, Anna R; Rollison, Dana E; Grundhoff, Adam; Tommasino, Massimo; Gheit, Tarik
2018-05-07
With the advent of new molecular tools, the discovery of new papillomaviruses (PVs) has accelerated during the past decade, enabling the expansion of knowledge about the viral populations that inhabit the human body. Human PVs (HPVs) are etiologically linked to benign or malignant lesions of the skin and mucosa. The detection of HPV types can vary widely, depending mainly on the methodology and the quality of the biological sample. Next-generation sequencing is one of the most powerful tools, enabling the discovery of novel viruses in a wide range of biological material. Here, we report a novel protocol for the detection of known and unknown HPV types in human skin and oral gargle samples using improved PCR protocols combined with next-generation sequencing. We identified 105 putative new PV types in addition to 296 known types, thus providing important information about the viral distribution in the oral cavity and skin. Copyright © 2018. Published by Elsevier Inc.
Thakar, Sambhaji B; Ghorpade, Pradnya N; Kale, Manisha V; Sonawane, Kailas D
2015-01-01
Fern plants are known for their ethnomedicinal applications. Huge amount of fern medicinal plants information is scattered in the form of text. Hence, database development would be an appropriate endeavor to cope with the situation. So by looking at the importance of medicinally useful fern plants, we developed a web based database which contains information about several group of ferns, their medicinal uses, chemical constituents as well as protein/enzyme sequences isolated from different fern plants. Fern ethnomedicinal plant database is an all-embracing, content management web-based database system, used to retrieve collection of factual knowledge related to the ethnomedicinal fern species. Most of the protein/enzyme sequences have been extracted from NCBI Protein sequence database. The fern species, family name, identification, taxonomy ID from NCBI, geographical occurrence, trial for, plant parts used, ethnomedicinal importance, morphological characteristics, collected from various scientific literatures and journals available in the text form. NCBI's BLAST, InterPro, phylogeny, Clustal W web source has also been provided for the future comparative studies. So users can get information related to fern plants and their medicinal applications at one place. This Fern ethnomedicinal plant database includes information of 100 fern medicinal species. This web based database would be an advantageous to derive information specifically for computational drug discovery, botanists or botanical interested persons, pharmacologists, researchers, biochemists, plant biotechnologists, ayurvedic practitioners, doctors/pharmacists, traditional medicinal users, farmers, agricultural students and teachers from universities as well as colleges and finally fern plant lovers. This effort would be useful to provide essential knowledge for the users about the adventitious applications for drug discovery, applications, conservation of fern species around the world and finally to create social awareness.
Ma, Chun-Lei; Jin, Ji-Qiang; Li, Chun-Fang; Wang, Rong-Kai; Zheng, Hong-Kun; Yao, Ming-Zhe; Chen, Liang
2015-01-01
Genetic maps are important tools in plant genomics and breeding. The present study reports the large-scale discovery of single nucleotide polymorphisms (SNPs) for genetic map construction in tea plant. We developed a total of 6,042 valid SNP markers using specific-locus amplified fragment sequencing (SLAF-seq), and subsequently mapped them into the previous framework map. The final map contained 6,448 molecular markers, distributing on fifteen linkage groups corresponding to the number of tea plant chromosomes. The total map length was 3,965 cM, with an average inter-locus distance of 1.0 cM. This map is the first SNP-based reference map of tea plant, as well as the most saturated one developed to date. The SNP markers and map resources generated in this study provide a wealth of genetic information that can serve as a foundation for downstream genetic analyses, such as the fine mapping of quantitative trait loci (QTL), map-based cloning, marker-assisted selection, and anchoring of scaffolds to facilitate the process of whole genome sequencing projects for tea plant. PMID:26035838
Kitahara, Marcelo V.; Cairns, Stephen D.; Stolarski, Jarosław; Blair, David; Miller, David J.
2010-01-01
Background Classical morphological taxonomy places the approximately 1400 recognized species of Scleractinia (hard corals) into 27 families, but many aspects of coral evolution remain unclear despite the application of molecular phylogenetic methods. In part, this may be a consequence of such studies focusing on the reef-building (shallow water and zooxanthellate) Scleractinia, and largely ignoring the large number of deep-sea species. To better understand broad patterns of coral evolution, we generated molecular data for a broad and representative range of deep sea scleractinians collected off New Caledonia and Australia during the last decade, and conducted the most comprehensive molecular phylogenetic analysis to date of the order Scleractinia. Methodology Partial (595 bp) sequences of the mitochondrial cytochrome oxidase subunit 1 (CO1) gene were determined for 65 deep-sea (azooxanthellate) scleractinians and 11 shallow-water species. These new data were aligned with 158 published sequences, generating a 234 taxon dataset representing 25 of the 27 currently recognized scleractinian families. Principal Findings/Conclusions There was a striking discrepancy between the taxonomic validity of coral families consisting predominantly of deep-sea or shallow-water species. Most families composed predominantly of deep-sea azooxanthellate species were monophyletic in both maximum likelihood and Bayesian analyses but, by contrast (and consistent with previous studies), most families composed predominantly of shallow-water zooxanthellate taxa were polyphyletic, although Acroporidae, Poritidae, Pocilloporidae, and Fungiidae were exceptions to this general pattern. One factor contributing to this inconsistency may be the greater environmental stability of deep-sea environments, effectively removing taxonomic “noise” contributed by phenotypic plasticity. Our phylogenetic analyses imply that the most basal extant scleractinians are azooxanthellate solitary corals from deep-water, their divergence predating that of the robust and complex corals. Deep-sea corals are likely to be critical to understanding anthozoan evolution and the origins of the Scleractinia. PMID:20628613
Shiba, Norio; Yoshida, Kenichi; Shiraishi, Yuichi; Okuno, Yusuke; Yamato, Genki; Hara, Yusuke; Nagata, Yasunobu; Chiba, Kenichi; Tanaka, Hiroko; Terui, Kiminori; Kato, Motohiro; Park, Myoung-Ja; Ohki, Kentaro; Shimada, Akira; Takita, Junko; Tomizawa, Daisuke; Kudo, Kazuko; Arakawa, Hirokazu; Adachi, Souichi; Taga, Takashi; Tawa, Akio; Ito, Etsuro; Horibe, Keizo; Sanada, Masashi; Miyano, Satoru; Ogawa, Seishi; Hayashi, Yasuhide
2016-11-01
Acute myeloid leukaemia (AML) is a molecularly and clinically heterogeneous disease. Targeted sequencing efforts have identified several mutations with diagnostic and prognostic values in KIT, NPM1, CEBPA and FLT3 in both adult and paediatric AML. In addition, massively parallel sequencing enabled the discovery of recurrent mutations (i.e. IDH1/2 and DNMT3A) in adult AML. In this study, whole-exome sequencing (WES) of 22 paediatric AML patients revealed mutations in components of the cohesin complex (RAD21 and SMC3), BCORL1 and ASXL2 in addition to previously known gene mutations. We also revealed intratumoural heterogeneities in many patients, implicating multiple clonal evolution events in the development of AML. Furthermore, targeted deep sequencing in 182 paediatric AML patients identified three major categories of recurrently mutated genes: cohesion complex genes [STAG2, RAD21 and SMC3 in 17 patients (8·3%)], epigenetic regulators [ASXL1/ASXL2 in 17 patients (8·3%), BCOR/BCORL1 in 7 patients (3·4%)] and signalling molecules. We also performed WES in four patients with relapsed AML. Relapsed AML evolved from one of the subclones at the initial phase and was accompanied by many additional mutations, including common driver mutations that were absent or existed only with lower allele frequency in the diagnostic samples, indicating a multistep process causing leukaemia recurrence. © 2016 John Wiley & Sons Ltd.
Metagenomics and novel gene discovery
Culligan, Eamonn P; Sleator, Roy D; Marchesi, Julian R; Hill, Colin
2014-01-01
Metagenomics provides a means of assessing the total genetic pool of all the microbes in a particular environment, in a culture-independent manner. It has revealed unprecedented diversity in microbial community composition, which is further reflected in the encoded functional diversity of the genomes, a large proportion of which consists of novel genes. Herein, we review both sequence-based and functional metagenomic methods to uncover novel genes and outline some of the associated problems of each type of approach, as well as potential solutions. Furthermore, we discuss the potential for metagenomic biotherapeutic discovery, with a particular focus on the human gut microbiome and finally, we outline how the discovery of novel genes may be used to create bioengineered probiotics. PMID:24317337
Analyzing Student Inquiry Data Using Process Discovery and Sequence Classification
ERIC Educational Resources Information Center
Emond, Bruno; Buffett, Scott
2015-01-01
This paper reports on results of applying process discovery mining and sequence classification mining techniques to a data set of semi-structured learning activities. The main research objective is to advance educational data mining to model and support self-regulated learning in heterogeneous environments of learning content, activities, and…
Preface to COAST 2016 innovators' workshop on personalized and precision orthodontic therapy.
Nickel, J C; Covell, D A; Frazier-Bowers, S A; Kapila, S; Huja, S S; Iwasaki, L R
2017-06-01
A second focused workshop explored how to transfer novel findings into clinical orthodontic practice. Participants met in West Palm Beach (Florida, USA), on 9-11 September 2016 for the Consortium for Orthodontic Advances in Science and Technology 2016 Innovators' Workshop (COAST). Approximately 65 registered attendees considered and discussed information from 27 to 34 speakers, 8 to 15 poster presenters and four lunch-hour focus group leaders. The innovators' workshops were organized according to five themed sessions. The aims of the discussion sessions were to identify the following: i) the strength and impact of the evidenced-based discoveries, ii) required steps to enable further development and iii) required steps to translate these new discoveries into orthodontic practice. The role of gene-environment interactions that underlie complex craniofacial traits was the focus of several sessions. It was agreed that diverse approaches are called for, such as (i) large-scale collaborative efforts for future genetic studies of complex traits; (ii) deep genome sequencing to address the issues of isolated mutations; (iii) quantifying epigenetic-environmental variables in diverse areas myofascial pain, alveolar remodelling and mandibular growth. Common needs identified from the themed sessions were multiscale/multispecies modelling and experimentation using controlled and quantified mechanics and translation of the findings in bone biology between species. Panel discussions led to the consensus that a consortium approach to establish standards for intra-oral scanning and 3D imaging should be initiated. Current and emerging technologies still require supported research to translate new findings from the laboratory to orthodontic practice. © 2017 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Ries, David; Holtgräwe, Daniela; Viehöver, Prisca; Weisshaar, Bernd
2016-03-15
The combination of bulk segregant analysis (BSA) and next generation sequencing (NGS), also known as mapping by sequencing (MBS), has been shown to significantly accelerate the identification of causal mutations for species with a reference genome sequence. The usual approach is to cross homozygous parents that differ for the monogenic trait to address, to perform deep sequencing of DNA from F2 plants pooled according to their phenotype, and subsequently to analyze the allele frequency distribution based on a marker table for the parents studied. The method has been successfully applied for EMS induced mutations as well as natural variation. Here, we show that pooling genetically diverse breeding lines according to a contrasting phenotype also allows high resolution mapping of the causal gene in a crop species. The test case was the monogenic locus causing red vs. green hypocotyl color in Beta vulgaris (R locus). We determined the allele frequencies of polymorphic sequences using sequence data from two diverging phenotypic pools of 180 B. vulgaris accessions each. A single interval of about 31 kbp among the nine chromosomes was identified which indeed contained the causative mutation. By applying a variation of the mapping by sequencing approach, we demonstrated that phenotype-based pooling of diverse accessions from breeding panels and subsequent direct determination of the allele frequency distribution can be successfully applied for gene identification in a crop species. Our approach made it possible to identify a small interval around the causative gene. Sequencing of parents or individual lines was not necessary. Whenever the appropriate plant material is available, the approach described saves time compared to the generation of an F2 population. In addition, we provide clues for planning similar experiments with regard to pool size and the sequencing depth required.
Sarmady, Mahdi; Dampier, William; Tozeren, Aydin
2011-01-01
Virus proteins alter protein pathways of the host toward the synthesis of viral particles by breaking and making edges via binding to host proteins. In this study, we developed a computational approach to predict viral sequence hotspots for binding to host proteins based on sequences of viral and host proteins and literature-curated virus-host protein interactome data. We use a motif discovery algorithm repeatedly on collections of sequences of viral proteins and immediate binding partners of their host targets and choose only those motifs that are conserved on viral sequences and highly statistically enriched among binding partners of virus protein targeted host proteins. Our results match experimental data on binding sites of Nef to host proteins such as MAPK1, VAV1, LCK, HCK, HLA-A, CD4, FYN, and GNB2L1 with high statistical significance but is a poor predictor of Nef binding sites on highly flexible, hoop-like regions. Predicted hotspots recapture CD8 cell epitopes of HIV Nef highlighting their importance in modulating virus-host interactions. Host proteins potentially targeted or outcompeted by Nef appear crowding the T cell receptor, natural killer cell mediated cytotoxicity, and neurotrophin signaling pathways. Scanning of HIV Nef motifs on multiple alignments of hepatitis C protein NS5A produces results consistent with literature, indicating the potential value of the hotspot discovery in advancing our understanding of virus-host crosstalk. PMID:21738584
Hellner, Karin; Miranda, Fabrizio; Fotso Chedom, Donatien; Herrero-Gonzalez, Sandra; Hayden, Daniel M; Tearle, Rick; Artibani, Mara; KaramiNejadRanjbar, Mohammad; Williams, Ruth; Gaitskell, Kezia; Elorbany, Samar; Xu, Ruoyan; Laios, Alex; Buiga, Petronela; Ahmed, Karim; Dhar, Sunanda; Zhang, Rebecca Yu; Campo, Leticia; Myers, Kevin A; Lozano, María; Ruiz-Miró, María; Gatius, Sónia; Mota, Alba; Moreno-Bueno, Gema; Matias-Guiu, Xavier; Benítez, Javier; Witty, Lorna; McVean, Gil; Leedham, Simon; Tomlinson, Ian; Drmanac, Radoje; Cazier, Jean-Baptiste; Klein, Robert; Dunne, Kevin; Bast, Robert C; Kennedy, Stephen H; Hassan, Bassim; Lise, Stefano; Garcia, María José; Peters, Brock A; Yau, Christopher; Sauka-Spengler, Tatjana; Ahmed, Ahmed Ashour
2016-08-01
Current screening methods for ovarian cancer can only detect advanced disease. Earlier detection has proved difficult because the molecular precursors involved in the natural history of the disease are unknown. To identify early driver mutations in ovarian cancer cells, we used dense whole genome sequencing of micrometastases and microscopic residual disease collected at three time points over three years from a single patient during treatment for high-grade serous ovarian cancer (HGSOC). The functional and clinical significance of the identified mutations was examined using a combination of population-based whole genome sequencing, targeted deep sequencing, multi-center analysis of protein expression, loss of function experiments in an in-vivo reporter assay and mammalian models, and gain of function experiments in primary cultured fallopian tube epithelial (FTE) cells. We identified frequent mutations involving a 40kb distal repressor region for the key stem cell differentiation gene SOX2. In the apparently normal FTE, the region was also mutated. This was associated with a profound increase in SOX2 expression (p<2(-16)), which was not found in patients without cancer (n=108). Importantly, we show that SOX2 overexpression in FTE is nearly ubiquitous in patients with HGSOCs (n=100), and common in BRCA1-BRCA2 mutation carriers (n=71) who underwent prophylactic salpingo-oophorectomy. We propose that the finding of SOX2 overexpression in FTE could be exploited to develop biomarkers for detecting disease at a premalignant stage, which would reduce mortality from this devastating disease. Copyright © 2016 The Ohio State University Wexner Medical Center. Published by Elsevier B.V. All rights reserved.
Rodrigues, Jorge L. M.; Serres, Margrethe H.; Tiedje, James M.
2011-01-01
The use of comparative genomics for the study of different microbiological species has increased substantially as sequence technologies become more affordable. However, efforts to fully link a genotype to its phenotype remain limited to the development of one mutant at a time. In this study, we provided a high-throughput alternative to this limiting step by coupling comparative genomics to the use of phenotype arrays for five sequenced Shewanella strains. Positive phenotypes were obtained for 441 nutrients (C, N, P, and S sources), with N-based compounds being the most utilized for all strains. Many genes and pathways predicted by genome analyses were confirmed with the comparative phenotype assay, and three degradation pathways believed to be missing in Shewanella were confirmed as missing. A number of previously unknown gene products were predicted to be parts of pathways or to have a function, expanding the number of gene targets for future genetic analyses. Ecologically, the comparative high-throughput phenotype analysis provided insights into niche specialization among the five different strains. For example, Shewanella amazonensis strain SB2B, isolated from the Amazon River delta, was capable of utilizing 60 C compounds, whereas Shewanella sp. strain W3-18-1, isolated from deep marine sediment, utilized only 25 of them. In spite of the large number of nutrient sources yielding positive results, our study indicated that except for the N sources, they were not sufficiently informative to predict growth phenotypes from increasing evolutionary distances. Our results indicate the importance of phenotypic evaluation for confirming genome predictions. This strategy will accelerate the functional discovery of genes and provide an ecological framework for microbial genome sequencing projects. PMID:21642407
Search and Discovery Strategies for Biotechnology: the Paradigm Shift
Bull, Alan T.; Ward, Alan C.; Goodfellow, Michael
2000-01-01
Profound changes are occurring in the strategies that biotechnology-based industries are deploying in the search for exploitable biology and to discover new products and develop new or improved processes. The advances that have been made in the past decade in areas such as combinatorial chemistry, combinatorial biosynthesis, metabolic pathway engineering, gene shuffling, and directed evolution of proteins have caused some companies to consider withdrawing from natural product screening. In this review we examine the paradigm shift from traditional biology to bioinformatics that is revolutionizing exploitable biology. We conclude that the reinvigorated means of detecting novel organisms, novel chemical structures, and novel biocatalytic activities will ensure that natural products will continue to be a primary resource for biotechnology. The paradigm shift has been driven by a convergence of complementary technologies, exemplified by DNA sequencing and amplification, genome sequencing and annotation, proteome analysis, and phenotypic inventorying, resulting in the establishment of huge databases that can be mined in order to generate useful knowledge such as the identity and characterization of organisms and the identity of biotechnology targets. Concurrently there have been major advances in understanding the extent of microbial diversity, how uncultured organisms might be grown, and how expression of the metabolic potential of microorganisms can be maximized. The integration of information from complementary databases presents a significant challenge. Such integration should facilitate answers to complex questions involving sequence, biochemical, physiological, taxonomic, and ecological information of the sort posed in exploitable biology. The paradigm shift which we discuss is not absolute in the sense that it will replace established microbiology; rather, it reinforces our view that innovative microbiology is essential for releasing the potential of microbial diversity for biotechnology penetration throughout industry. Various of these issues are considered with reference to deep-sea microbiology and biotechnology. PMID:10974127
Ansell, Brendan R E; Schnyder, Manuela; Deplazes, Peter; Korhonen, Pasi K; Young, Neil D; Hall, Ross S; Mangiola, Stefano; Boag, Peter R; Hofmann, Andreas; Sternberg, Paul W; Jex, Aaron R; Gasser, Robin B
2013-12-01
Angiostrongylus vasorum is a metastrongyloid nematode of dogs and other canids of major clinical importance in many countries. In order to gain first insights into the molecular biology of this worm, we conducted the first large-scale exploration of its transcriptome, and predicted essential molecules linked to metabolic and biological processes as well as host immune responses. We also predicted and prioritized drug targets and drug candidates. Following Illumina sequencing (RNA-seq), 52.3 million sequence reads representing adult A. vasorum were assembled and annotated. The assembly yielded 20,033 contigs, which encoded proteins with 11,505 homologues in Caenorhabditis elegans, and additional 2252 homologues in various other parasitic helminths for which curated data sets were publicly available. Functional annotation was achieved for 11,752 (58.6%) proteins predicted for A. vasorum, including peptidases (4.5%) and peptidase inhibitors (1.6%), protein kinases (1.7%), G protein-coupled receptors (GPCRs) (1.5%) and phosphatases (1.2%). Contigs encoding excretory/secretory and immuno-modulatory proteins represented some of the most highly transcribed molecules, and encoded enzymes that digest haemoglobin were conserved between A. vasorum and other blood-feeding nematodes. Using an essentiality-based approach, drug targets, including neurotransmitter receptors, an important chemosensory ion channel and cysteine proteinase-3 were predicted in A. vasorum, as were associated small molecular inhibitors/activators. Future transcriptomic analyses of all developmental stages of A. vasorum should facilitate deep explorations of the molecular biology of this important parasitic nematode and support the sequencing of its genome. These advances will provide a foundation for exploring immuno-molecular aspects of angiostrongylosis and have the potential to underpin the discovery of new methods of intervention. © 2013.
NASA Astrophysics Data System (ADS)
Yakimov, Michail M.; Cono, Violetta La; Denaro, Renata
2009-05-01
The autotrophic and ammonia-oxidizing crenarchaeal assemblage at offshore site located in the deep Mediterranean (Tyrrhenian Sea, depth 3000 m) water was studied by PCR amplification of the key functional genes involved in energy (ammonia mono-oxygenase alpha subunit, amoA) and central metabolism (acetyl-CoA carboxylase alpha subunit, accA). Using two recently annotated genomes of marine crenarchaeons, an initial set of primers targeting archaeal accA-like genes was designed. Approximately 300 clones were analyzed, of which 100% of amoA library and almost 70% of accA library were unambiguously related to the corresponding genes from marine Crenarchaeota. Even though the acetyl-CoA carboxylase is phylogenetically not well conserved and the remaining clones were affiliated to various bacterial acetyl-CoA/propionyl-CoA carboxylase genes, the pool of archaeal sequences was applied for development of quantitative PCR analysis of accA-like distribution using TaqMan ® methodolgy. The archaeal accA gene fragments, together with alignable gene fragments from the Sargasso Sea and North Pacific Subtropical Gyre (ALOHA Station) metagenome databases, were analyzed by multiple sequence alignment. Two accA-like sequences, found in ALOHA Station at the depth of 4000 m, formed a deeply branched clade with 64% of all archaeal Tyrrhenian clones. No close relatives for residual 36% of clones, except of those recovered from Eastern Mediterranean, was found, suggesting the existence of a specific lineage of the crenarchaeal accA genes in deep Mediterranean water. Alignment of Mediterranean amoA sequences defined four cosmopolitan phylotypes of Crenarchaeota putative ammonia mono-oxygenase subunit A gene occurring in the water sample from the 3000 m depth. Without exception all phylotypes fell into Deep Marine Group I cluster that contain the vast majority of known sequences recovered from global deep-sea environment. Remarkably, three phylotypes accounted for 91% of all Mediterranean amoA clones and corresponded to the sequences retrieved from the less deep compartments of the world's ocean, most likely reflecting the higher temperature at the depth of the Mediterranean Sea. In order to verify whether these phylotypes might represent important Crenarchaeota in the functioning of the Mediterranean bathypelagic ecosystem, expression of crenarchaeal amoA gene was monitored by direct RNA retrieval and following analysis of amoA-related mRNA transcripts. Surprisingly, all mRNA-derived sequences formed a tight monophyletic group, which fell into large Shallow Marine Group I cluster with sequences retrieved from shallow (up to 200 m) waters, sediments and corals. This group was not detected in DNA-based clone library, obviously, due to an overwhelming dominance of the Deep Marine Group I. The failure to recover the amoA transcripts, related to Deep Marine Group I of Crenarchaeota, was unanticipated and likely resulted from the physiology of these strongly adapted deep-sea organisms. As far as all seawater samples were treated on-board under atmospheric pressure conditions and sunlight, the decompression and/or photoinhibition likely affected their metabolic activity, followed by the strong decay of gene expression.
Deep Whole-Genome Sequencing to Detect Mixed Infection of Mycobacterium tuberculosis
Gan, Mingyu; Liu, Qingyun; Yang, Chongguang; Gao, Qian; Luo, Tao
2016-01-01
Mixed infection by multiple Mycobacterium tuberculosis (MTB) strains is associated with poor treatment outcome of tuberculosis (TB). Traditional genotyping methods have been used to detect mixed infections of MTB, however, their sensitivity and resolution are limited. Deep whole-genome sequencing (WGS) has been proved highly sensitive and discriminative for studying population heterogeneity of MTB. Here, we developed a phylogenetic-based method to detect MTB mixed infections using WGS data. We collected published WGS data of 782 global MTB strains from public database. We called homogeneous and heterogeneous single nucleotide variations (SNVs) of individual strains by mapping short reads to the ancestral MTB reference genome. We constructed a phylogenomic database based on 68,639 homogeneous SNVs of 652 MTB strains. Mixed infections were determined if multiple evolutionary paths were identified by mapping the SNVs of individual samples to the phylogenomic database. By simulation, our method could specifically detect mixed infections when the sequencing depth of minor strains was as low as 1× coverage, and when the genomic distance of two mixed strains was as small as 16 SNVs. By applying our methods to all 782 samples, we detected 47 mixed infections and 45 of them were caused by locally endemic strains. The results indicate that our method is highly sensitive and discriminative for identifying mixed infections from deep WGS data of MTB isolates. PMID:27391214
Cancer genomics: technology, discovery, and translation.
Tran, Ben; Dancey, Janet E; Kamel-Reid, Suzanne; McPherson, John D; Bedard, Philippe L; Brown, Andrew M K; Zhang, Tong; Shaw, Patricia; Onetto, Nicole; Stein, Lincoln; Hudson, Thomas J; Neel, Benjamin G; Siu, Lillian L
2012-02-20
In recent years, the increasing awareness that somatic mutations and other genetic aberrations drive human malignancies has led us within reach of personalized cancer medicine (PCM). The implementation of PCM is based on the following premises: genetic aberrations exist in human malignancies; a subset of these aberrations drive oncogenesis and tumor biology; these aberrations are actionable (defined as having the potential to affect management recommendations based on diagnostic, prognostic, and/or predictive implications); and there are highly specific anticancer agents available that effectively modulate these targets. This article highlights the technology underlying cancer genomics and examines the early results of genome sequencing and the challenges met in the discovery of new genetic aberrations. Finally, drawing from experiences gained in a feasibility study of somatic mutation genotyping and targeted exome sequencing led by Princess Margaret Hospital-University Health Network and the Ontario Institute for Cancer Research, the processes, challenges, and issues involved in the translation of cancer genomics to the clinic are discussed.
Boja, Emily S; Fehniger, Thomas E; Baker, Mark S; Marko-Varga, György; Rodriguez, Henry
2014-12-05
Protein biomarker discovery and validation in current omics era are vital for healthcare professionals to improve diagnosis, detect cancers at an early stage, identify the likelihood of cancer recurrence, stratify stages with differential survival outcomes, and monitor therapeutic responses. The success of such biomarkers would have a huge impact on how we improve the diagnosis and treatment of patients and alleviate the financial burden of healthcare systems. In the past, the genomics community (mostly through large-scale, deep genomic sequencing technologies) has been steadily improving our understanding of the molecular basis of disease, with a number of biomarker panels already authorized by the U.S. Food and Drug Administration (FDA) for clinical use (e.g., MammaPrint, two recently cleared devices using next-generation sequencing platforms to detect DNA changes in the cystic fibrosis transmembrane conductance regulator (CFTR) gene). Clinical proteomics, on the other hand, albeit its ability to delineate the functional units of a cell, more likely driving the phenotypic differences of a disease (i.e., proteins and protein-protein interaction networks and signaling pathways underlying the disease), "staggers" to make a significant impact with only an average ∼ 1.5 protein biomarkers per year approved by the FDA over the past 15-20 years. This statistic itself raises the concern that major roadblocks have been impeding an efficient transition of protein marker candidates in biomarker development despite major technological advances in proteomics in recent years.
Yang, Jian-Hua; Li, Jun-Hao; Jiang, Shan; Zhou, Hui; Qu, Liang-Hu
2013-01-01
Long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) represent two classes of important non-coding RNAs in eukaryotes. Although these non-coding RNAs have been implicated in organismal development and in various human diseases, surprisingly little is known about their transcriptional regulation. Recent advances in chromatin immunoprecipitation with next-generation DNA sequencing (ChIP-Seq) have provided methods of detecting transcription factor binding sites (TFBSs) with unprecedented sensitivity. In this study, we describe ChIPBase (http://deepbase.sysu.edu.cn/chipbase/), a novel database that we have developed to facilitate the comprehensive annotation and discovery of transcription factor binding maps and transcriptional regulatory relationships of lncRNAs and miRNAs from ChIP-Seq data. The current release of ChIPBase includes high-throughput sequencing data that were generated by 543 ChIP-Seq experiments in diverse tissues and cell lines from six organisms. By analysing millions of TFBSs, we identified tens of thousands of TF-lncRNA and TF-miRNA regulatory relationships. Furthermore, two web-based servers were developed to annotate and discover transcriptional regulatory relationships of lncRNAs and miRNAs from ChIP-Seq data. In addition, we developed two genome browsers, deepView and genomeView, to provide integrated views of multidimensional data. Moreover, our web implementation supports diverse query types and the exploration of TFs, lncRNAs, miRNAs, gene ontologies and pathways.
Nematoda from the terrestrial deep subsurface of South Africa.
Borgonie, G; García-Moyano, A; Litthauer, D; Bert, W; Bester, A; van Heerden, E; Möller, C; Erasmus, M; Onstott, T C
2011-06-02
Since its discovery over two decades ago, the deep subsurface biosphere has been considered to be the realm of single-cell organisms, extending over three kilometres into the Earth's crust and comprising a significant fraction of the global biosphere. The constraints of temperature, energy, dioxygen and space seemed to preclude the possibility of more-complex, multicellular organisms from surviving at these depths. Here we report species of the phylum Nematoda that have been detected in or recovered from 0.9-3.6-kilometre-deep fracture water in the deep mines of South Africa but have not been detected in the mining water. These subsurface nematodes, including a new species, Halicephalobus mephisto, tolerate high temperature, reproduce asexually and preferentially feed upon subsurface bacteria. Carbon-14 data indicate that the fracture water in which the nematodes reside is 3,000-12,000-year-old palaeometeoric water. Our data suggest that nematodes should be found in other deep hypoxic settings where temperature permits, and that they may control the microbial population density by grazing on fracture surface biofilm patches. Our results expand the known metazoan biosphere and demonstrate that deep ecosystems are more complex than previously accepted. The discovery of multicellular life in the deep subsurface of the Earth also has important implications for the search for subsurface life on other planets in our Solar System.
Spitzer Space Telescope Sequencing Operations Software, Strategies, and Lessons Learned
NASA Technical Reports Server (NTRS)
Bliss, David A.
2006-01-01
The Space Infrared Telescope Facility (SIRTF) was launched in August, 2003, and renamed to the Spitzer Space Telescope in 2004. Two years of observing the universe in the wavelength range from 3 to 180 microns has yielded enormous scientific discoveries. Since this magnificent observatory has a limited lifetime, maximizing science viewing efficiency (ie, maximizing time spent executing activities directly related to science observations) was the key operational objective. The strategy employed for maximizing science viewing efficiency was to optimize spacecraft flexibility, adaptability, and use of observation time. The selected approach involved implementation of a multi-engine sequencing architecture coupled with nondeterministic spacecraft and science execution times. This approach, though effective, added much complexity to uplink operations and sequence development. The Jet Propulsion Laboratory (JPL) manages Spitzer s operations. As part of the uplink process, Spitzer s Mission Sequence Team (MST) was tasked with processing observatory inputs from the Spitzer Science Center (SSC) into efficiently integrated, constraint-checked, and modeled review and command products which accommodated the complexity of non-deterministic spacecraft and science event executions without increasing operations costs. The MST developed processes, scripts, and participated in the adaptation of multi-mission core software to enable rapid processing of complex sequences. The MST was also tasked with developing a Downlink Keyword File (DKF) which could instruct Deep Space Network (DSN) stations on how and when to configure themselves to receive Spitzer science data. As MST and uplink operations developed, important lessons were learned that should be applied to future missions, especially those missions which employ command-intensive operations via a multi-engine sequence architecture.
Danielsson, Frida; Wiking, Mikaela; Mahdessian, Diana; Skogs, Marie; Ait Blal, Hammou; Hjelmare, Martin; Stadler, Charlotte; Uhlén, Mathias; Lundberg, Emma
2013-01-04
One of the major challenges of a chromosome-centric proteome project is to explore in a systematic manner the potential proteins identified from the chromosomal genome sequence, but not yet characterized on a protein level. Here, we describe the use of RNA deep sequencing to screen human cell lines for RNA profiles and to use this information to select cell lines suitable for characterization of the corresponding gene product. In this manner, the subcellular localization of proteins can be analyzed systematically using antibody-based confocal microscopy. We demonstrate the usefulness of selecting cell lines with high expression levels of RNA transcripts to increase the likelihood of high quality immunofluorescence staining and subsequent successful subcellular localization of the corresponding protein. The results show a path to combine transcriptomics with affinity proteomics to characterize the proteins in a gene- or chromosome-centric manner.
Measuring Student Understanding of Geological Time
ERIC Educational Resources Information Center
Dodick, Jeff; Orion, Nir
2003-01-01
There have been few discoveries in geology more important than "deep time"--the understanding that the universe has existed for countless millennia, such that man's existence is confined to the last milliseconds of the metaphorical geological clock. The influence of deep time is felt in a variety of sciences including geology, cosmology,…
A Template-Based Protein Structure Reconstruction Method Using Deep Autoencoder Learning.
Li, Haiou; Lyu, Qiang; Cheng, Jianlin
2016-12-01
Protein structure prediction is an important problem in computational biology, and is widely applied to various biomedical problems such as protein function study, protein design, and drug design. In this work, we developed a novel deep learning approach based on a deeply stacked denoising autoencoder for protein structure reconstruction. We applied our approach to a template-based protein structure prediction using only the 3D structural coordinates of homologous template proteins as input. The templates were identified for a target protein by a PSI-BLAST search. 3DRobot (a program that automatically generates diverse and well-packed protein structure decoys) was used to generate initial decoy models for the target from the templates. A stacked denoising autoencoder was trained on the decoys to obtain a deep learning model for the target protein. The trained deep model was then used to reconstruct the final structural model for the target sequence. With target proteins that have highly similar template proteins as benchmarks, the GDT-TS score of the predicted structures is greater than 0.7, suggesting that the deep autoencoder is a promising method for protein structure reconstruction.
Zhong, Daibin; Lo, Eugenia; Wang, Xiaoming; Yewhalaw, Delenasaw; Zhou, Guofa; Atieli, Harrysone E; Githeko, Andrew; Hemming-Schroeder, Elizabeth; Lee, Ming-Chieh; Afrane, Yaw; Yan, Guiyun
2018-05-02
Parasite genetic diversity and multiplicity of infection (MOI) affect clinical outcomes, response to drug treatment and naturally-acquired or vaccine-induced immunity. Traditional methods often underestimate the frequency and diversity of multiclonal infections due to technical sensitivity and specificity. Next-generation sequencing techniques provide a novel opportunity to study complexity of parasite populations and molecular epidemiology. Symptomatic and asymptomatic Plasmodium vivax samples were collected from health centres/hospitals and schools, respectively, from 2011 to 2015 in Ethiopia. Similarly, both symptomatic and asymptomatic Plasmodium falciparum samples were collected, respectively, from hospitals and schools in 2005 and 2015 in Kenya. Finger-pricked blood samples were collected and dried on filter paper. Long amplicon (> 400 bp) deep sequencing of merozoite surface protein 1 (msp1) gene was conducted to determine multiplicity and molecular epidemiology of P. vivax and P. falciparum infections. The results were compared with those based on short amplicon (117 bp) deep sequencing. A total of 139 P. vivax and 222 P. falciparum samples were pyro-sequenced for pvmsp1 and pfmsp1, yielding a total of 21 P. vivax and 99 P. falciparum predominant haplotypes. The average MOI for P. vivax and P. falciparum were 2.16 and 2.68, respectively, which were significantly higher than that of microsatellite markers and short amplicon (117 bp) deep sequencing. Multiclonal infections were detected in 62.2% of the samples for P. vivax and 74.8% of the samples for P. falciparum. Four out of the five subjects with recurrent P. vivax malaria were found to be a relapse 44-65 days after clearance of parasites. No difference was observed in MOI among P. vivax patients of different symptoms, ages and genders. Similar patterns were also observed in P. falciparum except for one study site in Kenyan lowland areas with significantly higher MOI. The study used a novel method to evaluate Plasmodium MOI and molecular epidemiological patterns by long amplicon ultra-deep sequencing. The complexity of infections were similar among age groups, symptoms, genders, transmission settings (spatial heterogeneity), as well as over years (pre- vs. post-scale-up interventions). This study demonstrated that long amplicon deep sequencing is a useful tool to investigate multiplicity and molecular epidemiology of Plasmodium parasite infections.
Pharmacological screening technologies for venom peptide discovery.
Prashanth, Jutty Rajan; Hasaballah, Nojod; Vetter, Irina
2017-12-01
Venomous animals occupy one of the most successful evolutionary niches and occur on nearly every continent. They deliver venoms via biting and stinging apparatuses with the aim to rapidly incapacitate prey and deter predators. This has led to the evolution of venom components that act at a number of biological targets - including ion channels, G-protein coupled receptors, transporters and enzymes - with exquisite selectivity and potency, making venom-derived components attractive pharmacological tool compounds and drug leads. In recent years, plate-based pharmacological screening approaches have been introduced to accelerate venom-derived drug discovery. A range of assays are amenable to this purpose, including high-throughput electrophysiology, fluorescence-based functional and binding assays. However, despite these technological advances, the traditional activity-guided fractionation approach is time-consuming and resource-intensive. The combination of screening techniques suitable for miniaturization with sequence-based discovery approaches - supported by advanced proteomics, mass spectrometry, chromatography as well as synthesis and expression techniques - promises to further improve venom peptide discovery. Here, we discuss practical aspects of establishing a pipeline for venom peptide drug discovery with a particular emphasis on pharmacology and pharmacological screening approaches. This article is part of the Special Issue entitled 'Venom-derived Peptides as Pharmacological Tools.' Copyright © 2017 Elsevier Ltd. All rights reserved.
Genome survey sequencing of red swamp crayfish Procambarus clarkii.
Shi, Linlin; Yi, Shaokui; Li, Yanhe
2018-06-21
Red swamp crayfish, Procambarus clarkii, presently is an important aquatic commercial species in China. The crayfish is a hot area of research focus, and its genetic improvement is quite urgent for the crayfish aquaculture in China. However, the knowledge of its genomic landscape is limited. In this study, a survey of P. clarkii genome was investigated based on Illumina's Solexa sequencing platform. Meanwhile, its genome size was estimated using flow cytometry. Interestingly, the genome size estimated is about 8.50 Gb by flow cytometry and 1.86 Gb with genome survey sequencing. Based on the assembled genome sequences, total of 136,962 genes and 152,268 exons were predicted, and the predicted genes ranged from 150 to 12,807 bp in length. The survey sequences could help accelerate the progress of gene discovery involved in genetic diversity and evolutionary analysis, even though it could not successfully applied for estimation of P. clarkii genome size.
Suyama, Yoshihisa; Matsuki, Yu
2015-01-01
Restriction-enzyme (RE)-based next-generation sequencing methods have revolutionized marker-assisted genetic studies; however, the use of REs has limited their widespread adoption, especially in field samples with low-quality DNA and/or small quantities of DNA. Here, we developed a PCR-based procedure to construct reduced representation libraries without RE digestion steps, representing de novo single-nucleotide polymorphism discovery, and its genotyping using next-generation sequencing. Using multiplexed inter-simple sequence repeat (ISSR) primers, thousands of genome-wide regions were amplified effectively from a wide variety of genomes, without prior genetic information. We demonstrated: 1) Mendelian gametic segregation of the discovered variants; 2) reproducibility of genotyping by checking its applicability for individual identification; and 3) applicability in a wide variety of species by checking standard population genetic analysis. This approach, called multiplexed ISSR genotyping by sequencing, should be applicable to many marker-assisted genetic studies with a wide range of DNA qualities and quantities. PMID:26593239
VaDiR: an integrated approach to Variant Detection in RNA.
Neums, Lisa; Suenaga, Seiji; Beyerlein, Peter; Anders, Sara; Koestler, Devin; Mariani, Andrea; Chien, Jeremy
2018-02-01
Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.
GWATCH: a web platform for automated gene association discovery analysis.
Svitin, Anton; Malov, Sergey; Cherkasov, Nikolay; Geerts, Paul; Rotkevich, Mikhail; Dobrynin, Pavel; Shevchenko, Andrey; Guan, Li; Troyer, Jennifer; Hendrickson, Sher; Dilks, Holli Hutcheson; Oleksyk, Taras K; Donfield, Sharyne; Gomperts, Edward; Jabs, Douglas A; Sezgin, Efe; Van Natta, Mark; Harrigan, P Richard; Brumme, Zabrina L; O'Brien, Stephen J
2014-01-01
As genome-wide sequence analyses for complex human disease determinants are expanding, it is increasingly necessary to develop strategies to promote discovery and validation of potential disease-gene associations. Here we present a dynamic web-based platform - GWATCH - that automates and facilitates four steps in genetic epidemiological discovery: 1) Rapid gene association search and discovery analysis of large genome-wide datasets; 2) Expanded visual display of gene associations for genome-wide variants (SNPs, indels, CNVs), including Manhattan plots, 2D and 3D snapshots of any gene region, and a dynamic genome browser illustrating gene association chromosomal regions; 3) Real-time validation/replication of candidate or putative genes suggested from other sources, limiting Bonferroni genome-wide association study (GWAS) penalties; 4) Open data release and sharing by eliminating privacy constraints (The National Human Genome Research Institute (NHGRI) Institutional Review Board (IRB), informed consent, The Health Insurance Portability and Accountability Act (HIPAA) of 1996 etc.) on unabridged results, which allows for open access comparative and meta-analysis. GWATCH is suitable for both GWAS and whole genome sequence association datasets. We illustrate the utility of GWATCH with three large genome-wide association studies for HIV-AIDS resistance genes screened in large multicenter cohorts; however, association datasets from any study can be uploaded and analyzed by GWATCH.
USDA-ARS?s Scientific Manuscript database
Deep sequencing of viruses isolated from infected hosts is an efficient way to measure population-genetic variation and can reveal patterns of dispersal and natural selection. In this study, we mined existing Illumina sequence reads to investigate single-nucleotide polymorphisms (SNPs) within two RN...
Detection of Emerging Vaccine-Related Polioviruses by Deep Sequencing.
Sahoo, Malaya K; Holubar, Marisa; Huang, ChunHong; Mohamed-Hadley, Alisha; Liu, Yuanyuan; Waggoner, Jesse J; Troy, Stephanie B; Garcia-Garcia, Lourdes; Ferreyra-Reyes, Leticia; Maldonado, Yvonne; Pinsky, Benjamin A
2017-07-01
Oral poliovirus vaccine can mutate to regain neurovirulence. To date, evaluation of these mutations has been performed primarily on culture-enriched isolates by using conventional Sanger sequencing. We therefore developed a culture-independent, deep-sequencing method targeting the 5' untranslated region (UTR) and P1 genomic region to characterize vaccine-related poliovirus variants. Error analysis of the deep-sequencing method demonstrated reliable detection of poliovirus mutations at levels of <1%, depending on read depth. Sequencing of viral nucleic acids from the stool of vaccinated, asymptomatic children and their close contacts collected during a prospective cohort study in Veracruz, Mexico, revealed no vaccine-derived polioviruses. This was expected given that the longest duration between sequenced sample collection and the end of the most recent national immunization week was 66 days. However, we identified many low-level variants (<5%) distributed across the 5' UTR and P1 genomic region in all three Sabin serotypes, as well as vaccine-related viruses with multiple canonical mutations associated with phenotypic reversion present at high levels (>90%). These results suggest that monitoring emerging vaccine-related poliovirus variants by deep sequencing may aid in the poliovirus endgame and efforts to ensure global polio eradication. Copyright © 2017 Sahoo et al.
NASA Astrophysics Data System (ADS)
Eyles, Nicholas; Mullins, Henry T.; Hine, Albert C.
1991-09-01
This paper presents the first detailed data regarding the newly discovered deep infill of Okanagan Lake. Okanagan Lake (50°00'N, 119°30'W) is 120 km long, ˜ 3-5 km wide and occupies a glacially overdeepened bedrock basin in the southern interior of British Columbia. This basin, and other elongate lakes of the region (e.g. Shuswap, Kootenay, Kalamalka, Canim and Mahood lakes), mark the site of westward flowing ice streams within successive Cordilleran ice sheets. An air gun seismic survey of Okanagan Lake shows that the bedrock floor is nearly 650 m below sea-level, more than 2000 m below the rim of the surrounding plateau. The maximum thickness of Pleistocene sediment in Okanagan Lake basin approaches 800 m. Forty-six seismic reflection traverses and an axial profile show a relatively simple stratigraphy composed of three seismic sequences argued to be no older than the last glacial cycle (< 30 ka). A discontinuous basal unit (sequence I) characterized by large-scale diffractions, and up to 460 m thick, infills the narrow, V-shaped bedrock floor of the basin and is interpreted as a boulder gravel deposited by subglacial meltwaters. Overlying seismic sequence II is composed of two sub-sequences. Sub-sequence IIa is a chaotic to massive facies up to 736 m thick. Lakeshore exposures close to where this unit reaches lake level show deformed and chaotically-bedded glaciolacustrine silts containing gravel lens and large ice-rafted boulders. The surface topography of this sub-sequence is irregular and in general mimics the form of the underlying bedrock as a result of compaction. This sequence passes laterally into stratified facies (sub-sequence IIb) at the northern end of the basin. Seismic sequence II appears to record rapid ice-proximal dumping of glaciolacustrine silt as the Okanagan glacier backwasted upvalley in a deep lake. A thin (60 m max.) laminated seismic sequence (III) drapes the hummocky surface of sequence II and represents postglacial sedimentation from fan-deltas. The extreme thickness of sequences I and II in Okanagan Lake reflects the focussing of large volumes of meltwater and sediment into the basin during deglaciation; pre-existing sediments that pre-date the last glacial cycle appear to have been completely eroded. Glaciological conditions during sedimentation may have been similar to marine-based outlet glaciers calving in deep water in fiord basins. In contrast to marine settings where ice bergs are free to disperse, large volumes of dead ice were trapped within the basin; structural evidence for sedimentation around dead ice blocks has been previously used to argue that the Cordilleran Ice Sheet downwasted in situ. We emphasize in contrast, the trapping of dead ice left behind by rapidly calving lake-based outlet glaciers.
Rational Protein Engineering Guided by Deep Mutational Scanning
Shin, HyeonSeok; Cho, Byung-Kwan
2015-01-01
Sequence–function relationship in a protein is commonly determined by the three-dimensional protein structure followed by various biochemical experiments. However, with the explosive increase in the number of genome sequences, facilitated by recent advances in sequencing technology, the gap between protein sequences available and three-dimensional structures is rapidly widening. A recently developed method termed deep mutational scanning explores the functional phenotype of thousands of mutants via massive sequencing. Coupled with a highly efficient screening system, this approach assesses the phenotypic changes made by the substitution of each amino acid sequence that constitutes a protein. Such an informational resource provides the functional role of each amino acid sequence, thereby providing sufficient rationale for selecting target residues for protein engineering. Here, we discuss the current applications of deep mutational scanning and consider experimental design. PMID:26404267
Biochemical mechanisms of cisplatin cytotoxicity.
Cepeda, Victoria; Fuertes, Miguel A; Castilla, Josefina; Alonso, Carlos; Quevedo, Celia; Pérez, Jose M
2007-01-01
Since the discovery by Rosenberg and collaborators of the antitumor activity of cisplatin 35 years ago, three platinum antitumor drugs (cisplatin, carboplatin and oxaliplatin) have enjoyed a huge clinical and commercial hit. Ever since the initial discovery of the anticancer activity of cisplatin, major efforts have been devoted to elucidate the biochemical mechanisms of antitumor activity of cisplatin in order to be able to rationally design novel platinum based drugs with superior pharmacological profiles. In this report we attempt to provide a current picture of the known facts pertaining to the mechanism of action of the drug, including those involved in drug uptake, DNA damage signals transduction, and cell death through apoptosis or necrosis. A deep knowledge of the biochemical mechanisms, which are triggered in the tumor cell in response to cisplatin injury not only may lead to the design of more efficient platinum antitumor drugs but also may provide new therapeutic strategies based on the biochemical modulation of cisplatin activity.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pharhizgar, K.D.; Lunce, S.E.
1994-12-31
Development of knowledge-based technological acquisition techniques and customers` information profiles are known as assimilative integrated discovery systems (AIDS) in modern organizations. These systems have access through processing to both deep and broad domains of information in modern societies. Through these systems organizations and individuals can predict future trend probabilities and events concerning their customers. AIDSs are new techniques which produce new information which informants can use without the help of the knowledge sources because of the existence of highly sophisticated computerized networks. This paper has analyzed the danger and side effects of misuse of information through the illegal, unethical andmore » immoral access to the data-base in an integrated and assimilative information system as described above. Cognivistic mapping, pragmatistic informational design gathering, and holistic classifiable and distributive techniques are potentially abusive systems whose outputs can be easily misused by businesses when researching the firm`s customers.« less
Braña, Alfredo F; Braña, Afredo F; Fiedler, Hans-Peter; Nava, Herminio; González, Verónica; Sarmiento-Vizcaíno, Aida; Molina, Axayacatl; Acuña, José L; García, Luis A; Blanco, Gloria
2015-04-01
Streptomycetes are widely distributed in the marine environment, although only a few studies on their associations to algae and coral ecosystems have been reported. Using a culture-dependent approach, we have isolated antibiotic-active Streptomyces species associated to diverse intertidal marine macroalgae (Phyllum Heterokontophyta, Rhodophyta, and Chlorophyta), from the central Cantabrian Sea. Two strains, with diverse antibiotic and cytotoxic activities, were found to inhabit these coastal environments, being widespread and persistent over a 3-year observation time frame. Based on 16S rRNA sequence analysis, the strains were identified as Streptomyces cyaneofuscatus M-27 and Streptomyces carnosus M-40. Similar isolates to these two strains were also associated to corals and other invertebrates from deep-sea coral reef ecosystem (Phyllum Cnidaria, Echinodermata, Arthropoda, Sipuncula, and Anelida) living up to 4.700-m depth in the submarine Avilés Canyon, thus revealing their barotolerant feature. These two strains were also found to colonize terrestrial lichens and have been repeatedly isolated from precipitations from tropospheric clouds. Compounds with antibiotic and cytotoxic activities produced by these strains were identified by high-performance liquid chromatography (HPLC) and database comparison. Antitumor compounds with antibacterial activities and members of the anthracycline family (daunomycin, cosmomycin B, galtamycin B), antifungals (maltophilins), anti-inflamatory molecules also with antituberculosis properties (lobophorins) were identified in this work. Many other compounds produced by the studied strains still remain unidentified, suggesting that Streptomyces associated to algae and coral ecosystems might represent an underexplored promising source for pharmaceutical drug discovery.
Scala, Giovanni; Affinito, Ornella; Palumbo, Domenico; Florio, Ermanno; Monticelli, Antonella; Miele, Gennaro; Chiariotti, Lorenzo; Cocozza, Sergio
2016-11-25
CpG sites in an individual molecule may exist in a binary state (methylated or unmethylated) and each individual DNA molecule, containing a certain number of CpGs, is a combination of these states defining an epihaplotype. Classic quantification based approaches to study DNA methylation are intrinsically unable to fully represent the complexity of the underlying methylation substrate. Epihaplotype based approaches, on the other hand, allow methylation profiles of cell populations to be studied at the single molecule level. For such investigations, next-generation sequencing techniques can be used, both for quantitative and for epihaplotype analysis. Currently available tools for methylation analysis lack output formats that explicitly report CpG methylation profiles at the single molecule level and that have suited statistical tools for their interpretation. Here we present ampliMethProfiler, a python-based pipeline for the extraction and statistical epihaplotype analysis of amplicons from targeted deep bisulfite sequencing of multiple DNA regions. ampliMethProfiler tool provides an easy and user friendly way to extract and analyze the epihaplotype composition of reads from targeted bisulfite sequencing experiments. ampliMethProfiler is written in python language and requires a local installation of BLAST and (optionally) QIIME tools. It can be run on Linux and OS X platforms. The software is open source and freely available at http://amplimethprofiler.sourceforge.net .
Yam, Alice Wei Yee; Colmant, Agathe M. G.; McLean, Breeanna J.; Prow, Natalie A.; Watterson, Daniel; Hall-Mendelin, Sonja; Warrilow, David; Ng, Mah-Lee; Khromykh, Alexander A.; Hall, Roy A.
2015-01-01
Mosquito-borne viruses encompass a range of virus families, comprising a number of significant human pathogens (e.g., dengue viruses, West Nile virus, Chikungunya virus). Virulent strains of these viruses are continually evolving and expanding their geographic range, thus rapid and sensitive screening assays are required to detect emerging viruses and monitor their prevalence and spread in mosquito populations. Double-stranded RNA (dsRNA) is produced during the replication of many of these viruses as either an intermediate in RNA replication (e.g., flaviviruses, togaviruses) or the double-stranded RNA genome (e.g., reoviruses). Detection and discovery of novel viruses from field and clinical samples usually relies on recognition of antigens or nucleotide sequences conserved within a virus genus or family. However, due to the wide antigenic and genetic variation within and between viral families, many novel or divergent species can be overlooked by these approaches. We have developed two monoclonal antibodies (mAbs) which show co-localised staining with proteins involved in viral RNA replication in immunofluorescence assay (IFA), suggesting specific reactivity to viral dsRNA. By assessing binding against a panel of synthetic dsRNA molecules, we have shown that these mAbs recognise dsRNA greater than 30 base pairs in length in a sequence-independent manner. IFA and enzyme-linked immunosorbent assay (ELISA) were employed to demonstrate detection of a panel of RNA viruses from several families, in a range of cell types. These mAbs, termed monoclonal antibodies to viral RNA intermediates in cells (MAVRIC), have now been incorporated into a high-throughput, economical ELISA-based screening system for the detection and discovery of viruses from mosquito populations. Our results have demonstrated that this simple system enables the efficient detection and isolation of a range of known and novel viruses in cells inoculated with field-caught mosquito samples, and represents a rapid, sequence-independent, and cost-effective approach to virus discovery. PMID:25799391
kpLogo: positional k-mer analysis reveals hidden specificity in biological sequences
2017-01-01
Abstract Motifs of only 1–4 letters can play important roles when present at key locations within macromolecules. Because existing motif-discovery tools typically miss these position-specific short motifs, we developed kpLogo, a probability-based logo tool for integrated detection and visualization of position-specific ultra-short motifs from a set of aligned sequences. kpLogo also overcomes the limitations of conventional motif-visualization tools in handling positional interdependencies and utilizing ranked or weighted sequences increasingly available from high-throughput assays. kpLogo can be found at http://kplogo.wi.mit.edu/. PMID:28460012
Rathe, Susan K; Moriarity, Branden S; Stoltenberg, Christopher B; Kurata, Morito; Aumann, Natalie K; Rahrmann, Eric P; Bailey, Natashay J; Melrose, Ellen G; Beckmann, Dominic A; Liska, Chase R; Largaespada, David A
2014-08-13
The evolution from microarrays to transcriptome deep-sequencing (RNA-seq) and from RNA interference to gene knockouts using Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) and Transcription Activator-Like Effector Nucleases (TALENs) has provided a new experimental partnership for identifying and quantifying the effects of gene changes on drug resistance. Here we describe the results from deep-sequencing of RNA derived from two cytarabine (Ara-C) resistance acute myeloid leukemia (AML) cell lines, and present CRISPR and TALEN based methods for accomplishing complete gene knockout (KO) in AML cells. We found protein modifying loss-of-function mutations in Dck in both Ara-C resistant cell lines. CRISPR and TALEN-based KO of Dck dramatically increased the IC₅₀ of Ara-C and introduction of a DCK overexpression vector into Dck KO clones resulted in a significant increase in Ara-C sensitivity. This effort demonstrates the power of using transcriptome analysis and CRISPR/TALEN-based KOs to identify and verify genes associated with drug resistance.
RNAbrowse: RNA-Seq de novo assembly results browser.
Mariette, Jérôme; Noirot, Céline; Nabihoudine, Ibounyamine; Bardou, Philippe; Hoede, Claire; Djari, Anis; Cabau, Cédric; Klopp, Christophe
2014-01-01
Transcriptome analysis based on a de novo assembly of next generation RNA sequences is now performed routinely in many laboratories. The generated results, including contig sequences, quantification figures, functional annotations and variation discovery outputs are usually bulky and quite diverse. This article presents a user oriented storage and visualisation environment permitting to explore the data in a top-down manner, going from general graphical views to all possible details. The software package is based on biomart, easy to install and populate with local data. The software package is available under the GNU General Public License (GPL) at http://bioinfo.genotoul.fr/RNAbrowse.
Open discovery: An integrated live Linux platform of Bioinformatics tools.
Vetrivel, Umashankar; Pilla, Kalabharath
2008-01-01
Historically, live linux distributions for Bioinformatics have paved way for portability of Bioinformatics workbench in a platform independent manner. Moreover, most of the existing live Linux distributions limit their usage to sequence analysis and basic molecular visualization programs and are devoid of data persistence. Hence, open discovery - a live linux distribution has been developed with the capability to perform complex tasks like molecular modeling, docking and molecular dynamics in a swift manner. Furthermore, it is also equipped with complete sequence analysis environment and is capable of running windows executable programs in Linux environment. Open discovery portrays the advanced customizable configuration of fedora, with data persistency accessible via USB drive or DVD. The Open Discovery is distributed free under Academic Free License (AFL) and can be downloaded from http://www.OpenDiscovery.org.in.
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.
Li, Yifeng; Shi, Wenqiang; Wasserman, Wyeth W
2018-05-31
In the human genome, 98% of DNA sequences are non-protein-coding regions that were previously disregarded as junk DNA. In fact, non-coding regions host a variety of cis-regulatory regions which precisely control the expression of genes. Thus, Identifying active cis-regulatory regions in the human genome is critical for understanding gene regulation and assessing the impact of genetic variation on phenotype. The developments of high-throughput sequencing and machine learning technologies make it possible to predict cis-regulatory regions genome wide. Based on rich data resources such as the Encyclopedia of DNA Elements (ENCODE) and the Functional Annotation of the Mammalian Genome (FANTOM) projects, we introduce DECRES based on supervised deep learning approaches for the identification of enhancer and promoter regions in the human genome. Due to their ability to discover patterns in large and complex data, the introduction of deep learning methods enables a significant advance in our knowledge of the genomic locations of cis-regulatory regions. Using models for well-characterized cell lines, we identify key experimental features that contribute to the predictive performance. Applying DECRES, we delineate locations of 300,000 candidate enhancers genome wide (6.8% of the genome, of which 40,000 are supported by bidirectional transcription data), and 26,000 candidate promoters (0.6% of the genome). The predicted annotations of cis-regulatory regions will provide broad utility for genome interpretation from functional genomics to clinical applications. The DECRES model demonstrates potentials of deep learning technologies when combined with high-throughput sequencing data, and inspires the development of other advanced neural network models for further improvement of genome annotations.
Zhao, Zijian; Voros, Sandrine; Weng, Ying; Chang, Faliang; Li, Ruijian
2017-12-01
Worldwide propagation of minimally invasive surgeries (MIS) is hindered by their drawback of indirect observation and manipulation, while monitoring of surgical instruments moving in the operated body required by surgeons is a challenging problem. Tracking of surgical instruments by vision-based methods is quite lucrative, due to its flexible implementation via software-based control with no need to modify instruments or surgical workflow. A MIS instrument is conventionally split into a shaft and end-effector portions, while a 2D/3D tracking-by-detection framework is proposed, which performs the shaft tracking followed by the end-effector one. The former portion is described by line features via the RANSAC scheme, while the latter is depicted by special image features based on deep learning through a well-trained convolutional neural network. The method verification in 2D and 3D formulation is performed through the experiments on ex-vivo video sequences, while qualitative validation on in-vivo video sequences is obtained. The proposed method provides robust and accurate tracking, which is confirmed by the experimental results: its 3D performance in ex-vivo video sequences exceeds those of the available state-of -the-art methods. Moreover, the experiments on in-vivo sequences demonstrate that the proposed method can tackle the difficult condition of tracking with unknown camera parameters. Further refinements of the method will refer to the occlusion and multi-instrumental MIS applications.
Shahinas, Dea; Silverman, Michael; Sittler, Taylor; Chiu, Charles; Kim, Peter; Allen-Vercoe, Emma; Weese, Scott; Wong, Andrew; Low, Donald E.; Pillai, Dylan R.
2012-01-01
ABSTRACT Fecal microbiome transplantation by low-volume enema is an effective, safe, and inexpensive alternative to antibiotic therapy for patients with chronic relapsing Clostridium difficile infection (CDI). We explored the microbial diversity of pre- and posttransplant stool specimens from CDI patients (n = 6) using deep sequencing of the 16S rRNA gene. While interindividual variability in microbiota change occurs with fecal transplantation and vancomycin exposure, in this pilot study we note that clinical cure of CDI is associated with an increase in diversity and richness. Genus- and species-level analysis may reveal a cocktail of microorganisms or products thereof that will ultimately be used as a probiotic to treat CDI. PMID:23093385
Petitjean, Céline; Deschamps, Philippe; López-García, Purificación; Moreira, David
2014-12-19
The first 16S rRNA-based phylogenies of the Archaea showed a deep division between two groups, the kingdoms Euryarchaeota and Crenarchaeota. This bipartite classification has been challenged by the recent discovery of new deeply branching lineages (e.g., Thaumarchaeota, Aigarchaeota, Nanoarchaeota, Korarchaeota, Parvarchaeota, Aenigmarchaeota, Diapherotrites, and Nanohaloarchaeota) which have also been given the same taxonomic status of kingdoms. However, the phylogenetic position of some of these lineages is controversial. In addition, phylogenetic analyses of the Archaea have often been carried out without outgroup sequences, making it difficult to determine if these taxa actually define lineages at the same level as the Euryarchaeota and Crenarchaeota. We have addressed the question of the position of the root of the Archaea by reconstructing rooted archaeal phylogenetic trees using bacterial sequences as outgroup. These trees were based on commonly used conserved protein markers (32 ribosomal proteins) as well as on 38 new markers identified through phylogenomic analysis. We thus gathered a total of 70 conserved markers that we analyzed as a concatenated data set. In contrast with previous analyses, our trees consistently placed the root of the archaeal tree between the Euryarchaeota (including the Nanoarchaeota and other fast-evolving lineages) and the rest of archaeal species, which we propose to class within the new kingdom Proteoarchaeota. This implies the relegation of several groups previously classified as kingdoms (e.g., Crenarchaeota, Thaumarchaeota, Aigarchaeota, and Korarchaeota) to a lower taxonomic rank. In addition to taxonomic implications, this profound reorganization of the archaeal phylogeny has also consequences on our appraisal of the nature of the last archaeal ancestor, which most likely was a complex organism with a gene-rich genome. © The Author(s) 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Petitjean, Céline; Deschamps, Philippe; López-García, Purificación; Moreira, David
2015-01-01
The first 16S rRNA-based phylogenies of the Archaea showed a deep division between two groups, the kingdoms Euryarchaeota and Crenarchaeota. This bipartite classification has been challenged by the recent discovery of new deeply branching lineages (e.g., Thaumarchaeota, Aigarchaeota, Nanoarchaeota, Korarchaeota, Parvarchaeota, Aenigmarchaeota, Diapherotrites, and Nanohaloarchaeota) which have also been given the same taxonomic status of kingdoms. However, the phylogenetic position of some of these lineages is controversial. In addition, phylogenetic analyses of the Archaea have often been carried out without outgroup sequences, making it difficult to determine if these taxa actually define lineages at the same level as the Euryarchaeota and Crenarchaeota. We have addressed the question of the position of the root of the Archaea by reconstructing rooted archaeal phylogenetic trees using bacterial sequences as outgroup. These trees were based on commonly used conserved protein markers (32 ribosomal proteins) as well as on 38 new markers identified through phylogenomic analysis. We thus gathered a total of 70 conserved markers that we analyzed as a concatenated data set. In contrast with previous analyses, our trees consistently placed the root of the archaeal tree between the Euryarchaeota (including the Nanoarchaeota and other fast-evolving lineages) and the rest of archaeal species, which we propose to class within the new kingdom Proteoarchaeota. This implies the relegation of several groups previously classified as kingdoms (e.g., Crenarchaeota, Thaumarchaeota, Aigarchaeota, and Korarchaeota) to a lower taxonomic rank. In addition to taxonomic implications, this profound reorganization of the archaeal phylogeny has also consequences on our appraisal of the nature of the last archaeal ancestor, which most likely was a complex organism with a gene-rich genome. PMID:25527841
Rudnick, Paul A.; Markey, Sanford P.; Roth, Jeri; Mirokhin, Yuri; Yan, Xinjian; Tchekhovskoi, Dmitrii V.; Edwards, Nathan J.; Thangudu, Ratna R.; Ketchum, Karen A.; Kinsinger, Christopher R.; Mesri, Mehdi; Rodriguez, Henry; Stein, Stephen E.
2016-01-01
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) has produced large proteomics datasets from the mass spectrometric interrogation of tumor samples previously analyzed by The Cancer Genome Atlas (TCGA) program. The availability of the genomic and proteomic data is enabling proteogenomic study for both reference (i.e., contained in major sequence databases) and non-reference markers of cancer. The CPTAC labs have focused on colon, breast, and ovarian tissues in the first round of analyses; spectra from these datasets were produced from 2D LC-MS/MS analyses and represent deep coverage. To reduce the variability introduced by disparate data analysis platforms (e.g., software packages, versions, parameters, sequence databases, etc.), the CPTAC Common Data Analysis Platform (CDAP) was created. The CDAP produces both peptide-spectrum-match (PSM) reports and gene-level reports. The pipeline processes raw mass spectrometry data according to the following: (1) Peak-picking and quantitative data extraction, (2) database searching, (3) gene-based protein parsimony, and (4) false discovery rate (FDR)-based filtering. The pipeline also produces localization scores for the phosphopeptide enrichment studies using the PhosphoRS program. Quantitative information for each of the datasets is specific to the sample processing, with PSM and protein reports containing the spectrum-level or gene-level (“rolled-up”) precursor peak areas and spectral counts for label-free or reporter ion log-ratios for 4plex iTRAQ™. The reports are available in simple tab-delimited formats and, for the PSM-reports, in mzIdentML. The goal of the CDAP is to provide standard, uniform reports for all of the CPTAC data, enabling comparisons between different samples and cancer types as well as across the major ‘omics fields. PMID:26860878
Rudnick, Paul A; Markey, Sanford P; Roth, Jeri; Mirokhin, Yuri; Yan, Xinjian; Tchekhovskoi, Dmitrii V; Edwards, Nathan J; Thangudu, Ratna R; Ketchum, Karen A; Kinsinger, Christopher R; Mesri, Mehdi; Rodriguez, Henry; Stein, Stephen E
2016-03-04
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) has produced large proteomics data sets from the mass spectrometric interrogation of tumor samples previously analyzed by The Cancer Genome Atlas (TCGA) program. The availability of the genomic and proteomic data is enabling proteogenomic study for both reference (i.e., contained in major sequence databases) and nonreference markers of cancer. The CPTAC laboratories have focused on colon, breast, and ovarian tissues in the first round of analyses; spectra from these data sets were produced from 2D liquid chromatography-tandem mass spectrometry analyses and represent deep coverage. To reduce the variability introduced by disparate data analysis platforms (e.g., software packages, versions, parameters, sequence databases, etc.), the CPTAC Common Data Analysis Platform (CDAP) was created. The CDAP produces both peptide-spectrum-match (PSM) reports and gene-level reports. The pipeline processes raw mass spectrometry data according to the following: (1) peak-picking and quantitative data extraction, (2) database searching, (3) gene-based protein parsimony, and (4) false-discovery rate-based filtering. The pipeline also produces localization scores for the phosphopeptide enrichment studies using the PhosphoRS program. Quantitative information for each of the data sets is specific to the sample processing, with PSM and protein reports containing the spectrum-level or gene-level ("rolled-up") precursor peak areas and spectral counts for label-free or reporter ion log-ratios for 4plex iTRAQ. The reports are available in simple tab-delimited formats and, for the PSM-reports, in mzIdentML. The goal of the CDAP is to provide standard, uniform reports for all of the CPTAC data to enable comparisons between different samples and cancer types as well as across the major omics fields.
NASA Technical Reports Server (NTRS)
Elardo, S. M.; Shearer, C. K.; McCubbin, F. M.
2017-01-01
The lunar magnesian-suite, or Mg-suite, is a series of ancient plutonic rocks from the lunar crust. They have received a considerable amount of attention from lunar scientists since their discovery for three primary reasons: 1) their ages and geochemistry indicate they represent pristine magmatic samples that crystallized very soon after the formation of the Moon; 2) their ages often overlap with ages of the ferroan anorthosite (FAN) crust; and 3) planetary-scale processes are needed in formation models to account for their unique geochemical features. Taken as a whole, the Mg-suite samples, as magmatic cumulate rocks, approximate a fractional crystallization sequence in the low-pressure forsterite-anorthite-silica system, and thus these samples are generally thought to be derived from layered mafic intrusions which crystallized very slowly from magmas that intruded the anorthositic crust. However, no direct linkages have been established between different Mg-suite samples based either on field relationships or geochemistry.The model for the origin of the Mg-suite, which best fits the limited available data, is one where Mg-suite magmas form from melting of a hybrid cumulate package consisting of deep mantle dunite, crustal anorthosite, and KREEP (potassium-rare earth elements-phosphorus) at the base of the crust under the Procellarum KREEP Terrane (PKT). In this model, these three LMO (Lunar Magma Ocean) cumulate components are brought into close proximity by the cumulate overturn process. Deep mantle dunitic cumulates with an Mg number of approximately 90 rise to the base of the anorthositic crust due to their buoyancy relative to colder, more dense Fe- and Ti-rich cumulates. This hybridized source rock melts to form Mg-suite magmas, saturated in Mg-rich olivine and anorthitic plagioclase, that have a substantial KREEP component.
Chaitankar, Vijender; Karakülah, Gökhan; Ratnapriya, Rinki; Giuste, Felipe O.; Brooks, Matthew J.; Swaroop, Anand
2016-01-01
The advent of high throughput next generation sequencing (NGS) has accelerated the pace of discovery of disease-associated genetic variants and genomewide profiling of expressed sequences and epigenetic marks, thereby permitting systems-based analyses of ocular development and disease. Rapid evolution of NGS and associated methodologies presents significant challenges in acquisition, management, and analysis of large data sets and for extracting biologically or clinically relevant information. Here we illustrate the basic design of commonly used NGS-based methods, specifically whole exome sequencing, transcriptome, and epigenome profiling, and provide recommendations for data analyses. We briefly discuss systems biology approaches for integrating multiple data sets to elucidate gene regulatory or disease networks. While we provide examples from the retina, the NGS guidelines reviewed here are applicable to other tissues/cell types as well. PMID:27297499
A Quaternary paleolake in a sinkhole at Cassis (SE France) : a geomorphology and geophysical study
NASA Astrophysics Data System (ADS)
Romey, C.; Rochette, P.; Vella, C.; Arfib, B.; Champollion, C.; Dussouillez, P.; Hermitte, D.; Parisot, J.-C.
2012-04-01
The Lower Provence and the Massif des Calanques, near Marseille, are a key area in understanding the mechanisms of evolution of the Mediterranean climate and the study of human impact on the local environment during the Quaternary. However, a continuous continental record of paleoenvironment in coastal Provence was not previously available. Looking for such a record, we discovered in a coastal alluvial plain a small paleolake filling a sinkhole that occurred in a marl sequence topping pure limestones at an altitude of 80 m, and a distance to the sea of 2 km. The sinkhole is close to the outlet of a small catchment area of about 8 km2. Limestone is massive but much fractured and therefore suitable for the development of karst. The drilling sedimentary sequence of 50 meters is mainly resulting from the weathering of Cretaceous marls. It consists of 5 meters of oxidized brown clay deposit which covers 45 meters of laminated lacustrine gray clay with sandy past. Cretaceous marls are at the base of the sequence. The presence of marls pebbles in the last meters of the sequence reflects the collapse of the sinkhole. The lacustrine clay was probably deposed during stages isotope 2 to 4 (48 ± 3 ka C14 date at 23 meters depth), whereas brown clay deposit was interpreted as Holocene paleosol. Combination of surface observation, drilling and geophysical studies (gravimetry and Electrical Resistivity Tomography) allows to constraint the geometry of the paleo-polje that formed during glacial period. Lake diameter was likely of the order of 200 m. It evolved from a deep lake to a swamp (probably Holocene, dating in progress) and it was drained in roman times for agriculture. Locally, this discovery has implications for the understanding of karst processes and water resources. The relationship between the sinkhole, rooted at circa 100 m below surface according to gravimetric modeling and the underground karstic river of Bestouan is strongly suggested by underwater exploration and hydrogeologic investigations.
Lake Vostok: An earthly analogue for the geomicrobiology on Europa
NASA Astrophysics Data System (ADS)
Priscu, J. C.; Christner, B. C.
2007-12-01
The recent discovery of more than 150 subglacial lakes beneath the Antarctic ice sheet has important implications in our search for liquid water and associated life on other icy worlds. The largest of these lakes is Lake Vostok, which has a surface area of 14000 square km and a depth of 1000 m, making it one of the largest lakes on Earth. Although we have yet to sample directly the liquid water from any of the Antarctic subglacial lakes, refrozen lakewater (accretion ice) has been sampled just above the surface of Lake Vostok. Genomic and geochemical analysis of this ice reveals that the surface lake water supports a microbial assemblage with a density approaching 1000 cells per milliliter. Sequencing and phylogenetic analysis of the 900 to 1000 base pair small subunit rRNA gene sequences obtained revealed a low diversity of clones that classify within the beta, gamma and delta subdivisions of the phylum Proteobacteria. Nearest phylogenetic neighbor analysis of these gene sequences imply that the lake contains an aerobic and anaerobic consortium of bacteria with metabolisms dedicated to iron and sulfur respiration or oxidation indicating that these metals play a role in the bioenergetics of microorganisms that occur in Lake Vostok. Sequence analysis further revealed that heterotrophic life in the lake can be sustained by chemolithotrophic production of new carbon supplemented by dissolved organic carbon released from the overlying ice sheet. Data obtained from orbiters have revealed that a deep ocean of liquid water lies under a thick chaotic ice cover on Europa where organic matter derived from comets and oxidants provided by radiation from Jupiter's magnetosphere may provide a habitat for life and a reservoir of endogenous and exogenous substances much like we observe in Lake Vostok. Future studies of Antarctic subglacial lake environments will play a crucial role in our understanding of life on Europa and other frozen worlds.
Petroleum geology and resources of the North Caspian Basin, Kazakhstan and Russia
Ulmishek, Gregory F.
2001-01-01
The North Caspian basin is a petroleum-rich but lightly explored basin located in Kazakhstan and Russia. It occupies the shallow northern portion of the Caspian Sea and a large plain to the north of the sea between the Volga and Ural Rivers and farther east to the Mugodzhary Highland, which is the southern continuation of the Ural foldbelt. The basin is bounded by the Paleozoic carbonate platform of the Volga-Ural province to the north and west and by the Ural, South Emba, and Karpinsky Hercynian foldbelts to the east and south. The basin was originated by pre-Late Devonian rifting and subsequent spreading that opened the oceanic crust, but the precise time of these tectonic events is not known. The sedimentary succession of the basin is more than 20 km thick in the central areas. The drilled Upper Devonian to Tertiary part of this succession includes a prominent thick Kungurian (uppermost Lower Permian) salt formation that separates strata into the subsalt and suprasalt sequences and played an important role in the formation of oil and gas fields. Shallow-shelf carbonate formations that contain various reefs and alternate with clastic wedges compose the subsalt sequence on the 1 basin margins. Basinward, these rocks grade into deep-water anoxic black shales and turbidites. The Kungurian salt formation is strongly deformed into domes and intervening depressions. The most active halokinesis occurred during Late Permian?Triassic time, but growth of salt domes continued later and some of them are exposed on the present-day surface. The suprasalt sequence is mostly composed of clastic rocks that are several kilometers thick in depressions between salt domes. A single total petroleum system is defined in the North Caspian basin. Discovered reserves are about 19.7 billion barrels of oil and natural gas liquids and 157 trillion cubic feet of gas. Much of the reserves are concentrated in the supergiant Tengiz, Karachaganak, and Astrakhan fields. A recent new oil discovery on the Kashagan structure offshore in the Caspian Sea is probably also of the supergiant status. Major oil and gas reserves are located in carbonate reservoirs in reefs and structural traps of the subsalt sequence. Substantially smaller reserves are located in numerous fields in the suprasalt sequence. These suprasalt fields are largely in shallow Jurassic and Cretaceous clastic reservoirs in salt dome-related traps. Petroleum source rocks are poorly identified by geochemical methods. However, geologic data indicate that the principal source rocks are Upper Devonian to Lower Permian deep-water black-shale facies stratigraphically correlative to shallow-shelf carbonate platforms on the basin margins. The main stage of hydrocarbon generation was probably in Late Permian and Triassic time, during deposition of thick orogenic clastics. Generated hydrocarbons migrated laterally into adjacent subsalt reservoirs and vertically, through depressions between Kungurian salt domes where the salt is thin or absent, into suprasalt clastic reservoirs. Six assessment units have been identified in the North Caspian basin. Four of them include Paleozoic subsalt rocks of the basin margins, and a fifth unit, which encompasses the entire total petroleum system area, includes the suprasalt sequence. All five of these assessment units are underexplored and have significant potential for new discoveries. Most undiscovered petroleum resources are expected in Paleozoic subsalt carbonate rocks. The assessment unit in subsalt rocks with the greatest undiscovered potential occupies the south basin margin. Petroleum potential of suprasalt rocks is lower; however, discoveries of many small to medium size fields are expected. The sixth identified assessment unit embraces subsalt rocks of the central basin areas. The top of subsalt rocks in these areas occurs at depths ranging from 7 to 10 kilometers and has not been reached by wells. Undiscovered resources of this unit did not rec
Pan, Xiaoyong; Shen, Hong-Bin
2017-02-28
RNAs play key roles in cells through the interactions with proteins known as the RNA-binding proteins (RBP) and their binding motifs enable crucial understanding of the post-transcriptional regulation of RNAs. How the RBPs correctly recognize the target RNAs and why they bind specific positions is still far from clear. Machine learning-based algorithms are widely acknowledged to be capable of speeding up this process. Although many automatic tools have been developed to predict the RNA-protein binding sites from the rapidly growing multi-resource data, e.g. sequence, structure, their domain specific features and formats have posed significant computational challenges. One of current difficulties is that the cross-source shared common knowledge is at a higher abstraction level beyond the observed data, resulting in a low efficiency of direct integration of observed data across domains. The other difficulty is how to interpret the prediction results. Existing approaches tend to terminate after outputting the potential discrete binding sites on the sequences, but how to assemble them into the meaningful binding motifs is a topic worth of further investigation. In viewing of these challenges, we propose a deep learning-based framework (iDeep) by using a novel hybrid convolutional neural network and deep belief network to predict the RBP interaction sites and motifs on RNAs. This new protocol is featured by transforming the original observed data into a high-level abstraction feature space using multiple layers of learning blocks, where the shared representations across different domains are integrated. To validate our iDeep method, we performed experiments on 31 large-scale CLIP-seq datasets, and our results show that by integrating multiple sources of data, the average AUC can be improved by 8% compared to the best single-source-based predictor; and through cross-domain knowledge integration at an abstraction level, it outperforms the state-of-the-art predictors by 6%. Besides the overall enhanced prediction performance, the convolutional neural network module embedded in iDeep is also able to automatically capture the interpretable binding motifs for RBPs. Large-scale experiments demonstrate that these mined binding motifs agree well with the experimentally verified results, suggesting iDeep is a promising approach in the real-world applications. The iDeep framework not only can achieve promising performance than the state-of-the-art predictors, but also easily capture interpretable binding motifs. iDeep is available at http://www.csbio.sjtu.edu.cn/bioinf/iDeep.
A Sensitive Assay for Virus Discovery in Respiratory Clinical Samples
de Vries, Michel; Deijs, Martin; Canuti, Marta; van Schaik, Barbera D. C.; Faria, Nuno R.; van de Garde, Martijn D. B.; Jachimowski, Loes C. M.; Jebbink, Maarten F.; Jakobs, Marja; Luyf, Angela C. M.; Coenjaerts, Frank E. J.; Claas, Eric C. J.; Molenkamp, Richard; Koekkoek, Sylvie M.; Lammens, Christine; Leus, Frank; Goossens, Herman; Ieven, Margareta; Baas, Frank; van der Hoek, Lia
2011-01-01
In 5–40% of respiratory infections in children, the diagnostics remain negative, suggesting that the patients might be infected with a yet unknown pathogen. Virus discovery cDNA-AFLP (VIDISCA) is a virus discovery method based on recognition of restriction enzyme cleavage sites, ligation of adaptors and subsequent amplification by PCR. However, direct discovery of unknown pathogens in nasopharyngeal swabs is difficult due to the high concentration of ribosomal RNA (rRNA) that acts as competitor. In the current study we optimized VIDISCA by adjusting the reverse transcription enzymes and decreasing rRNA amplification in the reverse transcription, using hexamer oligonucleotides that do not anneal to rRNA. Residual cDNA synthesis on rRNA templates was further reduced with oligonucleotides that anneal to rRNA but can not be extended due to 3′-dideoxy-C6-modification. With these modifications >90% reduction of rRNA amplification was established. Further improvement of the VIDISCA sensitivity was obtained by high throughput sequencing (VIDISCA-454). Eighteen nasopharyngeal swabs were analysed, all containing known respiratory viruses. We could identify the proper virus in the majority of samples tested (11/18). The median load in the VIDISCA-454 positive samples was 7.2 E5 viral genome copies/ml (ranging from 1.4 E3–7.7 E6). Our results show that optimization of VIDISCA and subsequent high-throughput-sequencing enhances sensitivity drastically and provides the opportunity to perform virus discovery directly in patient material. PMID:21283679
Lonardi, Stefano; Mirebrahim, Hamid; Wanamaker, Steve; Alpert, Matthew; Ciardo, Gianfranco; Duma, Denisa; Close, Timothy J
2015-09-15
As the invention of DNA sequencing in the 70s, computational biologists have had to deal with the problem of de novo genome assembly with limited (or insufficient) depth of sequencing. In this work, we investigate the opposite problem, that is, the challenge of dealing with excessive depth of sequencing. We explore the effect of ultra-deep sequencing data in two domains: (i) the problem of decoding reads to bacterial artificial chromosome (BAC) clones (in the context of the combinatorial pooling design we have recently proposed), and (ii) the problem of de novo assembly of BAC clones. Using real ultra-deep sequencing data, we show that when the depth of sequencing increases over a certain threshold, sequencing errors make these two problems harder and harder (instead of easier, as one would expect with error-free data), and as a consequence the quality of the solution degrades with more and more data. For the first problem, we propose an effective solution based on 'divide and conquer': we 'slice' a large dataset into smaller samples of optimal size, decode each slice independently, and then merge the results. Experimental results on over 15 000 barley BACs and over 4000 cowpea BACs demonstrate a significant improvement in the quality of the decoding and the final assembly. For the second problem, we show for the first time that modern de novo assemblers cannot take advantage of ultra-deep sequencing data. Python scripts to process slices and resolve decoding conflicts are available from http://goo.gl/YXgdHT; software Hashfilter can be downloaded from http://goo.gl/MIyZHs stelo@cs.ucr.edu or timothy.close@ucr.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Simbolo, Michele; Mafficini, Andrea; Sikora, Katarzyna O; Fassan, Matteo; Barbi, Stefano; Corbo, Vincenzo; Mastracci, Luca; Rusev, Borislav; Grillo, Federica; Vicentini, Caterina; Ferrara, Roberto; Pilotto, Sara; Davini, Federico; Pelosi, Giuseppe; Lawlor, Rita T; Chilosi, Marco; Tortora, Giampaolo; Bria, Emilio; Fontanini, Gabriella; Volante, Marco; Scarpa, Aldo
2017-03-01
Next-generation sequencing (NGS) was applied to 148 lung neuroendocrine tumours (LNETs) comprising the four World Health Organization classification categories: 53 typical carcinoid (TCs), 35 atypical carcinoid (ACs), 27 large-cell neuroendocrine carcinomas, and 33 small-cell lung carcinomas. A discovery screen was conducted on 46 samples by the use of whole-exome sequencing and high-coverage targeted sequencing of 418 genes. Eighty-eight recurrently mutated genes from both the discovery screen and current literature were verified in the 46 cases of the discovery screen, and validated on additional 102 LNETs by targeted NGS; their prevalence was then evaluated on the whole series. Thirteen of these 88 genes were also evaluated for copy number alterations (CNAs). Carcinoids and carcinomas shared most of the altered genes but with different prevalence rates. When mutations and copy number changes were combined, MEN1 alterations were almost exclusive to carcinoids, whereas alterations of TP53 and RB1 cell cycle regulation genes and PI3K/AKT/mTOR pathway genes were significantly enriched in carcinomas. Conversely, mutations in chromatin-remodelling genes, including those encoding histone modifiers and members of SWI-SNF complexes, were found at similar rates in carcinoids (45.5%) and carcinomas (55.0%), suggesting a major role in LNET pathogenesis. One AC and one TC showed a hypermutated profile associated with a POLQ damaging mutation. There were fewer CNAs in carcinoids than in carcinomas; however ACs showed a hybrid pattern, whereby gains of TERT, SDHA, RICTOR, PIK3CA, MYCL and SRC were found at rates similar to those in carcinomas, whereas the MEN1 loss rate mirrored that of TCs. Multivariate survival analysis revealed RB1 mutation (p = 0.0005) and TERT copy gain (p = 0.016) as independent predictors of poorer prognosis. MEN1 mutation was associated with poor prognosis in AC (p = 0.0045), whereas KMT2D mutation correlated with longer survival in SCLC (p = 0.0022). In conclusion, molecular profiling may complement histology for better diagnostic definition and prognostic stratification of LNETs. © 2016 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of Pathological Society of Great Britain and Ireland. © 2016 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of Pathological Society of Great Britain and Ireland.
DNA Replication Profiling Using Deep Sequencing.
Saayman, Xanita; Ramos-Pérez, Cristina; Brown, Grant W
2018-01-01
Profiling of DNA replication during progression through S phase allows a quantitative snap-shot of replication origin usage and DNA replication fork progression. We present a method for using deep sequencing data to profile DNA replication in S. cerevisiae.
Diversity of Bacillus-like organisms isolated from deep-sea hypersaline anoxic sediments
Sass, Andrea M; McKew, Boyd A; Sass, Henrik; Fichtel, Jörg; Timmis, Kenneth N; McGenity, Terry J
2008-01-01
Background The deep-sea, hypersaline anoxic brine lakes in the Mediterranean are among the most extreme environments on earth, and in one of them, the MgCl2-rich Discovery basin, the presence of active microbes is equivocal. However, thriving microbial communities have been detected especially in the chemocline between deep seawater and three NaCl-rich brine lakes, l'Atalante, Bannock and Urania. By contrast, the microbiota of these brine-lake sediments remains largely unexplored. Results Eighty nine isolates were obtained from the sediments of four deep-sea, hypersaline anoxic brine lakes in the Eastern Mediterranean Sea: l'Atalante, Bannock, Discovery and Urania basins. This culture collection was dominated by representatives of the genus Bacillus and close relatives (90% of all isolates) that were investigated further. Physiological characterization of representative strains revealed large versatility with respect to enzyme activities or substrate utilization. Two third of the isolates did not grow at in-situ salinities and were presumably present as endospores. This is supported by high numbers of endospores in Bannock, Discovery and Urania basins ranging from 3.8 × 105 to 1.2 × 106 g-1 dw sediment. However, the remaining isolates were highly halotolerant growing at salinities of up to 30% NaCl. Some of the novel isolates affiliating with the genus Pontibacillus grew well under anoxic conditions in sulfidic medium by fermentation or anaerobic respiration using dimethylsulfoxide or trimethylamine N-oxide as electron acceptor. Conclusion Some of the halophilic, facultatively anaerobic relatives of Bacillus appear well adapted to life in this hostile environment and suggest the presence of actively growing microbial communities in the NaCl-rich, deep-sea brine-lake sediments. PMID:18541011
Theories of the Earth and the Nature of Science.
ERIC Educational Resources Information Center
Williams, James
1991-01-01
Describes the history of the science of geology. The author expounds upon the discovery of deep time and plate tectonics, explaining how the theory of deep time influenced the development of Darwin and Wallace's theory of evolution. Describes how the history of earth science helps students understand the nature of science. (PR)
Jiang, Haojun; Xie, Yifan; Li, Xuchao; Ge, Huijuan; Deng, Yongqiang; Mu, Haofang; Feng, Xiaoli; Yin, Lu; Du, Zhou; Chen, Fang; He, Nongyue
2016-01-01
Short tandem repeats (STRs) and single nucleotide polymorphisms (SNPs) have been already used to perform noninvasive prenatal paternity testing from maternal plasma DNA. The frequently used technologies were PCR followed by capillary electrophoresis and SNP typing array, respectively. Here, we developed a noninvasive prenatal paternity testing (NIPAT) based on SNP typing with maternal plasma DNA sequencing. We evaluated the influence factors (minor allele frequency (MAF), the number of total SNP, fetal fraction and effective sequencing depth) and designed three different selective SNP panels in order to verify the performance in clinical cases. Combining targeted deep sequencing of selective SNP and informative bioinformatics pipeline, we calculated the combined paternity index (CPI) of 17 cases to determine paternity. Sequencing-based NIPAT results fully agreed with invasive prenatal paternity test using STR multiplex system. Our study here proved that the maternal plasma DNA sequencing-based technology is feasible and accurate in determining paternity, which may provide an alternative in forensic application in the future.
Inagaki, F; Takai, K; Komatsu, T; Kanamatsu, T; Fujioka, K; Horikoshi, K
2001-12-01
A record of the history of the Earth is hidden in the Earth's crust, like the annual rings of an old tree. From very limited records retrieved from deep underground, one can infer the geographical, geological, and biological events that occurred throughout Earth's history. Here we report the discovery of vertically shifted community structures of Archaea in a typical oceanic subseafloor core sample (1410 cm long) recovered from the West Philippine Basin at a depth of 5719 m. Beneath a surface community of ubiquitous deep-sea archaea (marine crenarchaeotic group I; MGI), an unusual archaeal community consisting of extremophilic archaea, such as extreme halophiles and hyperthermophiles, was present. These organisms could not be cultivated, and may be microbial relicts more than 2 million years old. Our discovery of archaeal rDNA in this core sample, probably associated with the past terrestrial volcanic and submarine hydrothermal activities surrounding the West Philippine Basin, serves as potential geomicrobiological evidence reflecting novel records of geologic thermal events in the Pleistocene period concealed in the deep-sea subseafloor.
DISCOVERY OF A FAINT QUASAR AT z ∼ 6 AND IMPLICATIONS FOR COSMIC REIONIZATION
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Yongjung; Im, Myungshin; Jeon, Yiseul
2015-11-10
Recent studies suggest that faint active galactic nuclei may be responsible for the reionization of the universe. Confirmation of this scenario requires spectroscopic identification of faint quasars (M{sub 1450} > −24 mag) at z ≳ 6, but only a very small number of such quasars have been spectroscopically identified so far. Here, we report the discovery of a faint quasar IMS J220417.92+011144.8 at z ∼ 6 in a 12.5 deg{sup 2} region of the SA22 field of the Infrared Medium-deep Survey (IMS). The spectrum of the quasar shows a sharp break at ∼8443 Å, with emission lines redshifted to zmore » = 5.944 ± 0.002 and rest-frame ultraviolet continuum magnitude M{sub 1450} = −23.59 ± 0.10 AB mag. The discovery of IMS J220417.92+011144.8 is consistent with the expected number of quasars at z ∼ 6 estimated from quasar luminosity functions based on previous observations of spectroscopically identified low-luminosity quasars. This suggests that the number of M{sub 1450} ∼ −23 mag quasars at z ∼ 6 may not be high enough to fully account for the reionization of the universe. In addition, our study demonstrates that faint quasars in the early universe can be identified effectively with a moderately wide and deep near-infrared survey such as the IMS.« less
nRC: non-coding RNA Classifier based on structural features.
Fiannaca, Antonino; La Rosa, Massimo; La Paglia, Laura; Rizzo, Riccardo; Urso, Alfonso
2017-01-01
Non-coding RNA (ncRNA) are small non-coding sequences involved in gene expression regulation of many biological processes and diseases. The recent discovery of a large set of different ncRNAs with biologically relevant roles has opened the way to develop methods able to discriminate between the different ncRNA classes. Moreover, the lack of knowledge about the complete mechanisms in regulative processes, together with the development of high-throughput technologies, has required the help of bioinformatics tools in addressing biologists and clinicians with a deeper comprehension of the functional roles of ncRNAs. In this work, we introduce a new ncRNA classification tool, nRC (non-coding RNA Classifier). Our approach is based on features extraction from the ncRNA secondary structure together with a supervised classification algorithm implementing a deep learning architecture based on convolutional neural networks. We tested our approach for the classification of 13 different ncRNA classes. We obtained classification scores, using the most common statistical measures. In particular, we reach an accuracy and sensitivity score of about 74%. The proposed method outperforms other similar classification methods based on secondary structure features and machine learning algorithms, including the RNAcon tool that, to date, is the reference classifier. nRC tool is freely available as a docker image at https://hub.docker.com/r/tblab/nrc/. The source code of nRC tool is also available at https://github.com/IcarPA-TBlab/nrc.
Dutta, Sutapa; Kumawat, Giriraj; Singh, Bikram P; Gupta, Deepak K; Singh, Sangeeta; Dogra, Vivek; Gaikwad, Kishor; Sharma, Tilak R; Raje, Ranjeet S; Bandhopadhya, Tapas K; Datta, Subhojit; Singh, Mahendra N; Bashasab, Fakrudin; Kulwal, Pawan; Wanjari, K B; K Varshney, Rajeev; Cook, Douglas R; Singh, Nagendra K
2011-01-20
Pigeonpea [Cajanus cajan (L.) Millspaugh], one of the most important food legumes of semi-arid tropical and subtropical regions, has limited genomic resources, particularly expressed sequence based (genic) markers. We report a comprehensive set of validated genic simple sequence repeat (SSR) markers using deep transcriptome sequencing, and its application in genetic diversity analysis and mapping. In this study, 43,324 transcriptome shotgun assembly unigene contigs were assembled from 1.696 million 454 GS-FLX sequence reads of separate pooled cDNA libraries prepared from leaf, root, stem and immature seed of two pigeonpea varieties, Asha and UPAS 120. A total of 3,771 genic-SSR loci, excluding homopolymeric and compound repeats, were identified; of which 2,877 PCR primer pairs were designed for marker development. Dinucleotide was the most common repeat motif with a frequency of 60.41%, followed by tri- (34.52%), hexa- (2.62%), tetra- (1.67%) and pentanucleotide (0.76%) repeat motifs. Primers were synthesized and tested for 772 of these loci with repeat lengths of ≥ 18 bp. Of these, 550 markers were validated for consistent amplification in eight diverse pigeonpea varieties; 71 were found to be polymorphic on agarose gel electrophoresis. Genetic diversity analysis was done on 22 pigeonpea varieties and eight wild species using 20 highly polymorphic genic-SSR markers. The number of alleles at these loci ranged from 4-10 and the polymorphism information content values ranged from 0.46 to 0.72. Neighbor-joining dendrogram showed distinct separation of the different groups of pigeonpea cultivars and wild species. Deep transcriptome sequencing of the two parental lines helped in silico identification of polymorphic genic-SSR loci to facilitate the rapid development of an intra-species reference genetic map, a subset of which was validated for expected allelic segregation in the reference mapping population. We developed 550 validated genic-SSR markers in pigeonpea using deep transcriptome sequencing. From these, 20 highly polymorphic markers were used to evaluate the genetic relationship among species of the genus Cajanus. A comprehensive set of genic-SSR markers was developed as an important genomic resource for diversity analysis and genetic mapping in pigeonpea.
2011-01-01
Background Pigeonpea [Cajanus cajan (L.) Millspaugh], one of the most important food legumes of semi-arid tropical and subtropical regions, has limited genomic resources, particularly expressed sequence based (genic) markers. We report a comprehensive set of validated genic simple sequence repeat (SSR) markers using deep transcriptome sequencing, and its application in genetic diversity analysis and mapping. Results In this study, 43,324 transcriptome shotgun assembly unigene contigs were assembled from 1.696 million 454 GS-FLX sequence reads of separate pooled cDNA libraries prepared from leaf, root, stem and immature seed of two pigeonpea varieties, Asha and UPAS 120. A total of 3,771 genic-SSR loci, excluding homopolymeric and compound repeats, were identified; of which 2,877 PCR primer pairs were designed for marker development. Dinucleotide was the most common repeat motif with a frequency of 60.41%, followed by tri- (34.52%), hexa- (2.62%), tetra- (1.67%) and pentanucleotide (0.76%) repeat motifs. Primers were synthesized and tested for 772 of these loci with repeat lengths of ≥18 bp. Of these, 550 markers were validated for consistent amplification in eight diverse pigeonpea varieties; 71 were found to be polymorphic on agarose gel electrophoresis. Genetic diversity analysis was done on 22 pigeonpea varieties and eight wild species using 20 highly polymorphic genic-SSR markers. The number of alleles at these loci ranged from 4-10 and the polymorphism information content values ranged from 0.46 to 0.72. Neighbor-joining dendrogram showed distinct separation of the different groups of pigeonpea cultivars and wild species. Deep transcriptome sequencing of the two parental lines helped in silico identification of polymorphic genic-SSR loci to facilitate the rapid development of an intra-species reference genetic map, a subset of which was validated for expected allelic segregation in the reference mapping population. Conclusion We developed 550 validated genic-SSR markers in pigeonpea using deep transcriptome sequencing. From these, 20 highly polymorphic markers were used to evaluate the genetic relationship among species of the genus Cajanus. A comprehensive set of genic-SSR markers was developed as an important genomic resource for diversity analysis and genetic mapping in pigeonpea. PMID:21251263
Integrated design, execution, and analysis of arrayed and pooled CRISPR genome-editing experiments.
Canver, Matthew C; Haeussler, Maximilian; Bauer, Daniel E; Orkin, Stuart H; Sanjana, Neville E; Shalem, Ophir; Yuan, Guo-Cheng; Zhang, Feng; Concordet, Jean-Paul; Pinello, Luca
2018-05-01
CRISPR (clustered regularly interspaced short palindromic repeats) genome-editing experiments offer enormous potential for the evaluation of genomic loci using arrayed single guide RNAs (sgRNAs) or pooled sgRNA libraries. Numerous computational tools are available to help design sgRNAs with optimal on-target efficiency and minimal off-target potential. In addition, computational tools have been developed to analyze deep-sequencing data resulting from genome-editing experiments. However, these tools are typically developed in isolation and oftentimes are not readily translatable into laboratory-based experiments. Here, we present a protocol that describes in detail both the computational and benchtop implementation of an arrayed and/or pooled CRISPR genome-editing experiment. This protocol provides instructions for sgRNA design with CRISPOR (computational tool for the design, evaluation, and cloning of sgRNA sequences), experimental implementation, and analysis of the resulting high-throughput sequencing data with CRISPResso (computational tool for analysis of genome-editing outcomes from deep-sequencing data). This protocol allows for design and execution of arrayed and pooled CRISPR experiments in 4-5 weeks by non-experts, as well as computational data analysis that can be performed in 1-2 d by both computational and noncomputational biologists alike using web-based and/or command-line versions.
Research and Teaching About the Deep Earth
NASA Astrophysics Data System (ADS)
Williams, Michael L.; Mogk, David W.; McDaris, John
2010-08-01
Understanding the Deep Earth: Slabs, Drips, Plumes and More; Virtual Workshop, 17-19 February and 24-26 February 2010; Images and models of active faults, subducting plates, mantle drips, and rising plumes are spurring new excitement about deep-Earth processes and connections between Earth's internal systems and plate tectonics. The new results and the steady progress of Earthscope's USArray across the country are also providing a special opportunity to reach students and the general public. The pace of discoveries about the deep Earth is accelerating due to advances in experimental, modeling, and sensing technologies; new data processing capabilities; and installation of new networks, especially the EarthScope facility. EarthScope is an interdisciplinary program that combines geology and geophysics to study the structure and evolution of the North American continent. To explore the current state of deep-Earth science and ways in which it can be brought into the undergraduate classroom, 40 professors attended a virtual workshop given by On the Cutting Edge, a program that strives to improve undergraduate geoscience education through an integrated cooperative series of workshops and Web-based resources. The 6-day two-part workshop consisted of plenary talks, large and small group discussions, and development and review of new classroom and laboratory activities.
DNA barcoding for species identification in deep-sea clams (Mollusca: Bivalvia: Vesicomyidae).
Liu, Jun; Zhang, Haibin
2018-01-15
Deep-sea clams (Bivalvia: Vesicomyidae) have been found in reduced environments over the world oceans, but taxonomy of this group remains confusing at species and supraspecific levels due to their high-morphological similarity and plasticity. In the present study, we collected mitochondrial COI sequences to evaluate the utility of DNA barcoding on identifying vesicomyid species. COI dataset identified 56 well-supported putative species/operational taxonomic units (OTUs), approximately covering half of the extant vesicomyid species. One species (OTU2) was first detected, and may represent a new species. Average distances between species ranged from 1.65 to 29.64%, generally higher than average intraspecific distances (0-1.41%) when excluding Pliocardia sp.10 cf. venusta (average intraspecific distance 1.91%). Local barcoding gap existed in 33 of the 35 species when comparing distances of maximum interspecific and minimum interspecific distances with two exceptions (Abyssogena southwardae and Calyptogena rectimargo-starobogatovi). The barcode index number (BIN) system determined 41 of the 56 species/OTUs, each with a unique BIN, indicating their validity. Three species were found to have two BINs, together with their high level of intraspecific variation, implying cryptic diversity within them. Although fewer 16 S sequences were collected, similar results were obtained. Nineteen putative species were determined and no overlap observed between intra- and inter-specific variation. Implications of DNA barcoding for the Vesicomyidae taxonomy were then discussed. Findings of this study will provide important evidence for taxonomic revision in this problematic clam group, and accelerate the discovery of new vesicomyid species in the future.
Leung, Preston; Eltahla, Auda A; Lloyd, Andrew R; Bull, Rowena A; Luciani, Fabio
2017-07-15
With the advent of affordable deep sequencing technologies, detection of low frequency variants within genetically diverse viral populations can now be achieved with unprecedented depth and efficiency. The high-resolution data provided by next generation sequencing technologies is currently recognised as the gold standard in estimation of viral diversity. In the analysis of rapidly mutating viruses, longitudinal deep sequencing datasets from viral genomes during individual infection episodes, as well as at the epidemiological level during outbreaks, now allow for more sophisticated analyses such as statistical estimates of the impact of complex mutation patterns on the evolution of the viral populations both within and between hosts. These analyses are revealing more accurate descriptions of the evolutionary dynamics that underpin the rapid adaptation of these viruses to the host response, and to drug therapies. This review assesses recent developments in methods and provide informative research examples using deep sequencing data generated from rapidly mutating viruses infecting humans, particularly hepatitis C virus (HCV), human immunodeficiency virus (HIV), Ebola virus and influenza virus, to understand the evolution of viral genomes and to explore the relationship between viral mutations and the host adaptive immune response. Finally, we discuss limitations in current technologies, and future directions that take advantage of publically available large deep sequencing datasets. Copyright © 2016 Elsevier B.V. All rights reserved.
DEEP MOTIF DASHBOARD: VISUALIZING AND UNDERSTANDING GENOMIC SEQUENCES USING DEEP NEURAL NETWORKS.
Lanchantin, Jack; Singh, Ritambhara; Wang, Beilun; Qi, Yanjun
2017-01-01
Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy for the transcription factor binding (TFBS) site classification task. However, it remains unclear how these approaches identify meaningful DNA sequence signals and give insights as to why TFs bind to certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMo Dashboard) which provides a suite of visualization strategies to extract motifs, or sequence patterns from deep neural network models for TFBS classification. We demonstrate how to visualize and understand three important DNN models: convolutional, recurrent, and convolutional-recurrent networks. Our first visualization method is finding a test sequence's saliency map which uses first-order derivatives to describe the importance of each nucleotide in making the final prediction. Second, considering recurrent models make predictions in a temporal manner (from one end of a TFBS sequence to the other), we introduce temporal output scores, indicating the prediction score of a model over time for a sequential input. Lastly, a class-specific visualization strategy finds the optimal input sequence for a given TFBS positive class via stochastic gradient optimization. Our experimental results indicate that a convolutional-recurrent architecture performs the best among the three architectures. The visualization techniques indicate that CNN-RNN makes predictions by modeling both motifs as well as dependencies among them.
Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees.
Williams, Philip H; Eyles, Rod; Weiller, Georg
2012-01-01
MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require "read count" to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA(∗) duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.
Rogers, Alex D.; Tyler, Paul A.; Connelly, Douglas P.; Copley, Jon T.; James, Rachael; Larter, Robert D.; Linse, Katrin; Mills, Rachel A.; Garabato, Alfredo Naveira; Pancost, Richard D.; Pearce, David A.; Polunin, Nicholas V. C.; German, Christopher R.; Shank, Timothy; Boersch-Supan, Philipp H.; Alker, Belinda J.; Aquilina, Alfred; Bennett, Sarah A.; Clarke, Andrew; Dinley, Robert J. J.; Graham, Alastair G. C.; Green, Darryl R. H.; Hawkes, Jeffrey A.; Hepburn, Laura; Hilario, Ana; Huvenne, Veerle A. I.; Marsh, Leigh; Ramirez-Llodra, Eva; Reid, William D. K.; Roterman, Christopher N.; Sweeting, Christopher J.; Thatje, Sven; Zwirglmaier, Katrin
2012-01-01
Since the first discovery of deep-sea hydrothermal vents along the Galápagos Rift in 1977, numerous vent sites and endemic faunal assemblages have been found along mid-ocean ridges and back-arc basins at low to mid latitudes. These discoveries have suggested the existence of separate biogeographic provinces in the Atlantic and the North West Pacific, the existence of a province including the South West Pacific and Indian Ocean, and a separation of the North East Pacific, North East Pacific Rise, and South East Pacific Rise. The Southern Ocean is known to be a region of high deep-sea species diversity and centre of origin for the global deep-sea fauna. It has also been proposed as a gateway connecting hydrothermal vents in different oceans but is little explored because of extreme conditions. Since 2009 we have explored two segments of the East Scotia Ridge (ESR) in the Southern Ocean using a remotely operated vehicle. In each segment we located deep-sea hydrothermal vents hosting high-temperature black smokers up to 382.8°C and diffuse venting. The chemosynthetic ecosystems hosted by these vents are dominated by a new yeti crab (Kiwa n. sp.), stalked barnacles, limpets, peltospiroid gastropods, anemones, and a predatory sea star. Taxa abundant in vent ecosystems in other oceans, including polychaete worms (Siboglinidae), bathymodiolid mussels, and alvinocaridid shrimps, are absent from the ESR vents. These groups, except the Siboglinidae, possess planktotrophic larvae, rare in Antarctic marine invertebrates, suggesting that the environmental conditions of the Southern Ocean may act as a dispersal filter for vent taxa. Evidence from the distinctive fauna, the unique community structure, and multivariate analyses suggest that the Antarctic vent ecosystems represent a new vent biogeographic province. However, multivariate analyses of species present at the ESR and at other deep-sea hydrothermal vents globally indicate that vent biogeography is more complex than previously recognised. PMID:22235194
NASA Technical Reports Server (NTRS)
Armus, L.; Matthews, K.; Neugebauer, G.; Soifer, B. T.
1998-01-01
In the last several years, the combination of new wavelength dropout discovery techniques coupled with the incredible power of deep imaging of the Hubble Space Telescope and the spectroscopic capabilities of a new generation of large ground-based telescopes, has lead to an astonishing blossoming of the study of galaxies at redshifts of z=2-4, when the Universe was less than 10-20% of its current age.
Culture-independent discovery of natural products from soil metagenomes.
Katz, Micah; Hover, Bradley M; Brady, Sean F
2016-03-01
Bacterial natural products have proven to be invaluable starting points in the development of many currently used therapeutic agents. Unfortunately, traditional culture-based methods for natural product discovery have been deemphasized by pharmaceutical companies due in large part to high rediscovery rates. Culture-independent, or "metagenomic," methods, which rely on the heterologous expression of DNA extracted directly from environmental samples (eDNA), have the potential to provide access to metabolites encoded by a large fraction of the earth's microbial biosynthetic diversity. As soil is both ubiquitous and rich in bacterial diversity, it is an appealing starting point for culture-independent natural product discovery efforts. This review provides an overview of the history of soil metagenome-driven natural product discovery studies and elaborates on the recent development of new tools for sequence-based, high-throughput profiling of environmental samples used in discovering novel natural product biosynthetic gene clusters. We conclude with several examples of these new tools being employed to facilitate the recovery of novel secondary metabolite encoding gene clusters from soil metagenomes and the subsequent heterologous expression of these clusters to produce bioactive small molecules.
Song, Yuhyun; Leman, Scotland; Monteil, Caroline L.; Heath, Lenwood S.; Vinatzer, Boris A.
2014-01-01
A broadly accepted and stable biological classification system is a prerequisite for biological sciences. It provides the means to describe and communicate about life without ambiguity. Current biological classification and nomenclature use the species as the basic unit and require lengthy and laborious species descriptions before newly discovered organisms can be assigned to a species and be named. The current system is thus inadequate to classify and name the immense genetic diversity within species that is now being revealed by genome sequencing on a daily basis. To address this lack of a general intra-species classification and naming system adequate for today’s speed of discovery of new diversity, we propose a classification and naming system that is exclusively based on genome similarity and that is suitable for automatic assignment of codes to any genome-sequenced organism without requiring any phenotypic or phylogenetic analysis. We provide examples demonstrating that genome similarity-based codes largely align with current taxonomic groups at many different levels in bacteria, animals, humans, plants, and viruses. Importantly, the proposed approach is only slightly affected by the order of code assignment and can thus provide codes that reflect similarity between organisms and that do not need to be revised upon discovery of new diversity. We envision genome similarity-based codes to complement current biological nomenclature and to provide a universal means to communicate unambiguously about any genome-sequenced organism in fields as diverse as biodiversity research, infectious disease control, human and microbial forensics, animal breed and plant cultivar certification, and human ancestry research. PMID:24586551
SNP discovery by high-throughput sequencing in soybean
2010-01-01
Background With the advance of new massively parallel genotyping technologies, quantitative trait loci (QTL) fine mapping and map-based cloning become more achievable in identifying genes for important and complex traits. Development of high-density genetic markers in the QTL regions of specific mapping populations is essential for fine-mapping and map-based cloning of economically important genes. Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation existing between any diverse genotypes that are usually used for QTL mapping studies. The massively parallel sequencing technologies (Roche GS/454, Illumina GA/Solexa, and ABI/SOLiD), have been widely applied to identify genome-wide sequence variations. However, it is still remains unclear whether sequence data at a low sequencing depth are enough to detect the variations existing in any QTL regions of interest in a crop genome, and how to prepare sequencing samples for a complex genome such as soybean. Therefore, with the aims of identifying SNP markers in a cost effective way for fine-mapping several QTL regions, and testing the validation rate of the putative SNPs predicted with Solexa short sequence reads at a low sequencing depth, we evaluated a pooled DNA fragment reduced representation library and SNP detection methods applied to short read sequences generated by Solexa high-throughput sequencing technology. Results A total of 39,022 putative SNPs were identified by the Illumina/Solexa sequencing system using a reduced representation DNA library of two parental lines of a mapping population. The validation rates of these putative SNPs predicted with low and high stringency were 72% and 85%, respectively. One hundred sixty four SNP markers resulted from the validation of putative SNPs and have been selectively chosen to target a known QTL, thereby increasing the marker density of the targeted region to one marker per 42 K bp. Conclusions We have demonstrated how to quickly identify large numbers of SNPs for fine mapping of QTL regions by applying massively parallel sequencing combined with genome complexity reduction techniques. This SNP discovery approach is more efficient for targeting multiple QTL regions in a same genetic population, which can be applied to other crops. PMID:20701770
Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization.
Bauer, Markus; Klau, Gunnar W; Reinert, Knut
2007-07-27
The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account. We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments. The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of input sequences. Our program LARA is freely available for academic purposes from http://www.planet-lisa.net.
PhAST: pharmacophore alignment search tool.
Hähnke, Volker; Hofmann, Bettina; Grgat, Tomislav; Proschak, Ewgenij; Steinhilber, Dieter; Schneider, Gisbert
2009-04-15
We present a ligand-based virtual screening technique (PhAST) for rapid hit and lead structure searching in large compound databases. Molecules are represented as strings encoding the distribution of pharmacophoric features on the molecular graph. In contrast to other text-based methods using SMILES strings, we introduce a new form of text representation that describes the pharmacophore of molecules. This string representation opens the opportunity for revealing functional similarity between molecules by sequence alignment techniques in analogy to homology searching in protein or nucleic acid sequence databases. We favorably compared PhAST with other current ligand-based virtual screening methods in a retrospective analysis using the BEDROC metric. In a prospective application, PhAST identified two novel inhibitors of 5-lipoxygenase product formation with minimal experimental effort. This outcome demonstrates the applicability of PhAST to drug discovery projects and provides an innovative concept of sequence-based compound screening with substantial scaffold hopping potential. 2008 Wiley Periodicals, Inc.
Delivery and detection of dietary plant-based miRNAs in animal tissues
USDA-ARS?s Scientific Manuscript database
It has been proposed that genetic material, namely microRNAs (miRNAs), consumed in plant-based diets can affect animal gene expression. Though deep sequencing reveals the low-level presence of plant miRNAs in animal tissues, many groups have been thus far unable to replicate the finding that a rice ...
Ou, Hong-Yu; He, Xinyi; Harrison, Ewan M.; Kulasekara, Bridget R.; Thani, Ali Bin; Kadioglu, Aras; Lory, Stephen; Hinton, Jay C. D.; Barer, Michael R.; Rajakumar, Kumar
2007-01-01
MobilomeFINDER (http://mml.sjtu.edu.cn/MobilomeFINDER) is an interactive online tool that facilitates bacterial genomic island or ‘mobile genome’ (mobilome) discovery; it integrates the ArrayOme and tRNAcc software packages. ArrayOme utilizes a microarray-derived comparative genomic hybridization input data set to generate ‘inferred contigs’ produced by merging adjacent genes classified as ‘present’. Collectively these ‘fragments’ represent a hypothetical ‘microarray-visualized genome (MVG)’. ArrayOme permits recognition of discordances between physical genome and MVG sizes, thereby enabling identification of strains rich in microarray-elusive novel genes. Individual tRNAcc tools facilitate automated identification of genomic islands by comparative analysis of the contents and contexts of tRNA sites and other integration hotspots in closely related sequenced genomes. Accessory tools facilitate design of hotspot-flanking primers for in silico and/or wet-science-based interrogation of cognate loci in unsequenced strains and analysis of islands for features suggestive of foreign origins; island-specific and genome-contextual features are tabulated and represented in schematic and graphical forms. To date we have used MobilomeFINDER to analyse several Enterobacteriaceae, Pseudomonas aeruginosa and Streptococcus suis genomes. MobilomeFINDER enables high-throughput island identification and characterization through increased exploitation of emerging sequence data and PCR-based profiling of unsequenced test strains; subsequent targeted yeast recombination-based capture permits full-length sequencing and detailed functional studies of novel genomic islands. PMID:17537813
Toma, Tudor; Bosman, Robert-Jan; Siebes, Arno; Peek, Niels; Abu-Hanna, Ameen
2010-08-01
An important problem in the Intensive Care is how to predict on a given day of stay the eventual hospital mortality for a specific patient. A recent approach to solve this problem suggested the use of frequent temporal sequences (FTSs) as predictors. Methods following this approach were evaluated in the past by inducing a model from a training set and validating the prognostic performance on an independent test set. Although this evaluative approach addresses the validity of the specific models induced in an experiment, it falls short of evaluating the inductive method itself. To achieve this, one must account for the inherent sources of variation in the experimental design. The main aim of this work is to demonstrate a procedure based on bootstrapping, specifically the .632 bootstrap procedure, for evaluating inductive methods that discover patterns, such as FTSs. A second aim is to apply this approach to find out whether a recently suggested inductive method that discovers FTSs of organ functioning status is superior over a traditional method that does not use temporal sequences when compared on each successive day of stay at the Intensive Care Unit. The use of bootstrapping with logistic regression using pre-specified covariates is known in the statistical literature. Using inductive methods of prognostic models based on temporal sequence discovery within the bootstrap procedure is however novel at least in predictive models in the Intensive Care. Our results of applying the bootstrap-based evaluative procedure demonstrate the superiority of the FTS-based inductive method over the traditional method in terms of discrimination as well as accuracy. In addition we illustrate the insights gained by the analyst into the discovered FTSs from the bootstrap samples. Copyright 2010 Elsevier Inc. All rights reserved.
Verde, Ignazio; Jenkins, Jerry; Dondini, Luca; Micali, Sabrina; Pagliarani, Giulia; Vendramin, Elisa; Paris, Roberta; Aramini, Valeria; Gazza, Laura; Rossini, Laura; Bassi, Daniele; Troggio, Michela; Shu, Shengqiang; Grimwood, Jane; Tartarini, Stefano; Dettori, Maria Teresa; Schmutz, Jeremy
2017-03-11
The availability of the peach genome sequence has fostered relevant research in peach and related Prunus species enabling the identification of genes underlying important horticultural traits as well as the development of advanced tools for genetic and genomic analyses. The first release of the peach genome (Peach v1.0) represented a high-quality WGS (Whole Genome Shotgun) chromosome-scale assembly with high contiguity (contig L50 214.2 kb), large portions of mapped sequences (96%) and high base accuracy (99.96%). The aim of this work was to improve the quality of the first assembly by increasing the portion of mapped and oriented sequences, correcting misassemblies and improving the contiguity and base accuracy using high-throughput linkage mapping and deep resequencing approaches. Four linkage maps with 3,576 molecular markers were used to improve the portion of mapped and oriented sequences (from 96.0% and 85.6% of Peach v1.0 to 99.2% and 98.2% of v2.0, respectively) and enabled a more detailed identification of discernible misassemblies (10.4 Mb in total). The deep resequencing approach fixed 859 homozygous SNPs (Single Nucleotide Polymorphisms) and 1347 homozygous indels. Moreover, the assembled NGS contigs enabled the closing of 212 gaps with an improvement in the contig L50 of 19.2%. The improved high quality peach genome assembly (Peach v2.0) represents a valuable tool for the analysis of the genetic diversity, domestication, and as a vehicle for genetic improvement of peach and related Prunus species. Moreover, the important phylogenetic position of peach and the absence of recent whole genome duplication (WGD) events make peach a pivotal species for comparative genomics studies aiming at elucidating plant speciation and diversification processes.
Open discovery: An integrated live Linux platform of Bioinformatics tools
Vetrivel, Umashankar; Pilla, Kalabharath
2008-01-01
Historically, live linux distributions for Bioinformatics have paved way for portability of Bioinformatics workbench in a platform independent manner. Moreover, most of the existing live Linux distributions limit their usage to sequence analysis and basic molecular visualization programs and are devoid of data persistence. Hence, open discovery ‐ a live linux distribution has been developed with the capability to perform complex tasks like molecular modeling, docking and molecular dynamics in a swift manner. Furthermore, it is also equipped with complete sequence analysis environment and is capable of running windows executable programs in Linux environment. Open discovery portrays the advanced customizable configuration of fedora, with data persistency accessible via USB drive or DVD. Availability The Open Discovery is distributed free under Academic Free License (AFL) and can be downloaded from http://www.OpenDiscovery.org.in PMID:19238235
Namouchi, Amine; Cimino, Mena; Favre-Rochex, Sandrine; Charles, Patricia; Gicquel, Brigitte
2017-07-13
Tuberculosis (TB) is caused by Mycobacterium tuberculosis and represents one of the major challenges facing drug discovery initiatives worldwide. The considerable rise in bacterial drug resistance in recent years has led to the need of new drugs and drug regimens. Model systems are regularly used to speed-up the drug discovery process and circumvent biosafety issues associated with manipulating M. tuberculosis. These include the use of strains such as Mycobacterium smegmatis and Mycobacterium marinum that can be handled in biosafety level 2 facilities, making high-throughput screening feasible. However, each of these model species have their own limitations. We report and describe the first complete genome sequence of Mycobacterium aurum ATCC23366, an environmental mycobacterium that can also grow in the gut of humans and animals as part of the microbiota. This species shows a comparable resistance profile to that of M. tuberculosis for several anti-TB drugs. The aims of this study were to (i) determine the drug resistance profile of a recently proposed model species, Mycobacterium aurum, strain ATCC23366, for anti-TB drug discovery as well as Mycobacterium smegmatis and Mycobacterium marinum (ii) sequence and annotate the complete genome sequence of this species obtained using Pacific Bioscience technology (iii) perform comparative genomics analyses of the various surrogate strains with M. tuberculosis (iv) discuss how the choice of the surrogate model used for drug screening can affect the drug discovery process. We describe the complete genome sequence of M. aurum, a surrogate model for anti-tuberculosis drug discovery. Most of the genes already reported to be associated with drug resistance are shared between all the surrogate strains and M. tuberculosis. We consider that M. aurum might be used in high-throughput screening for tuberculosis drug discovery. We also highly recommend the use of different model species during the drug discovery screening process.
NASA Astrophysics Data System (ADS)
Okay, Aral I.; Altiner, Demir
2016-10-01
The Haymana region in Central Anatolia is located in the southern part of the Pontides close to the İzmir-Ankara suture. During the Cretaceous, the region formed part of the south-facing active margin of the Eurasia. The area preserves a nearly complete record of the Cretaceous system. Shallow marine carbonates of earliest Cretaceous age are overlain by a 700-m-thick Cretaceous sequence, dominated by deep marine limestones. Three unconformity-bounded pelagic carbonate sequences of Berriasian, Albian-Cenomanian and Turonian-Santonian ages are recognized: Each depositional sequence is preceded by a period of tilting and submarine erosion during the Berriasian, early Albian and late Cenomanian, which corresponds to phases of local extension in the active continental margin. Carbonate breccias mark the base of the sequences and each carbonate sequence steps down on older units. The deep marine carbonate deposition ended in the late Santonian followed by tilting, erosion and folding during the Campanian. Deposition of thick siliciclastic turbidites started in the late Campanian and continued into the Tertiary. Unlike most forearc basins, the Haymana region was a site of deep marine carbonate deposition until the Campanian. This was because the Pontide arc was extensional and the volcanic detritus was trapped in the intra-arc basins and did not reach the forearc or the trench. The extensional nature of the arc is also shown by the opening of the Black Sea as a backarc basin in the Turonian-Santonian. The carbonate sedimentation in an active margin is characterized by synsedimentary vertical displacements, which results in submarine erosion, carbonate breccias and in the lateral discontinuity of the sequences, and differs from blanket like carbonate deposition in the passive margins.
RAD tag sequencing as a source of SNP markers in Cynara cardunculus L
2012-01-01
Background The globe artichoke (Cynara cardunculus L. var. scolymus) genome is relatively poorly explored, especially compared to those of the other major Asteraceae crops sunflower and lettuce. No SNP markers are in the public domain. We have combined the recently developed restriction-site associated DNA (RAD) approach with the Illumina DNA sequencing platform to effect the rapid and mass discovery of SNP markers for C. cardunculus. Results RAD tags were sequenced from the genomic DNA of three C. cardunculus mapping population parents, generating 9.7 million reads, corresponding to ~1 Gbp of sequence. An assembly based on paired ends produced ~6.0 Mbp of genomic sequence, separated into ~19,000 contigs (mean length 312 bp), of which ~21% were fragments of putative coding sequence. The shared sequences allowed for the discovery of ~34,000 SNPs and nearly 800 indels, equivalent to a SNP frequency of 5.6 per 1,000 nt, and an indel frequency of 0.2 per 1,000 nt. A sample of heterozygous SNP loci was mapped by CAPS assays and this exercise provided validation of our mining criteria. The repetitive fraction of the genome had a high representation of retrotransposon sequence, followed by simple repeats, AT-low complexity regions and mobile DNA elements. The genomic k-mers distribution and CpG rate of C. cardunculus, compared with data derived from three whole genome-sequenced dicots species, provided a further evidence of the random representation of the C. cardunculus genome generated by RAD sampling. Conclusion The RAD tag sequencing approach is a cost-effective and rapid method to develop SNP markers in a highly heterozygous species. Our approach permitted to generate a large and robust SNP datasets by the adoption of optimized filtering criteria. PMID:22214349
Chen, Zhao; Moran, Kimberly; Richards-Yutz, Jennifer; Toorens, Erik; Gerhart, Daniel; Ganguly, Tapan; Shields, Carol L; Ganguly, Arupa
2014-03-01
Sporadic retinoblastoma (RB) is caused by de novo mutations in the RB1 gene. Often, these mutations are present as mosaic mutations that cannot be detected by Sanger sequencing. Next-generation deep sequencing allows unambiguous detection of the mosaic mutations in lymphocyte DNA. Deep sequencing of the RB1 gene on lymphocyte DNA from 20 bilateral and 70 unilateral RB cases was performed, where Sanger sequencing excluded the presence of mutations. The individual exons of the RB1 gene from each sample were amplified, pooled, ligated to barcoded adapters, and sequenced using semiconductor sequencing on an Ion Torrent Personal Genome Machine. Six low-level mosaic mutations were identified in bilateral RB and four in unilateral RB cases. The incidence of low-level mosaic mutation was estimated to be 30% and 6%, respectively, in sporadic bilateral and unilateral RB cases, previously classified as mutation negative. The frequency of point mutations detectable in lymphocyte DNA increased from 96% to 97% for bilateral RB and from 13% to 18% for unilateral RB. The use of deep sequencing technology increased the sensitivity of the detection of low-level germline mosaic mutations in the RB1 gene. This finding has significant implications for improved clinical diagnosis, genetic counseling, surveillance, and management of RB. © 2013 WILEY PERIODICALS, INC.
Naghdi, Mohammad Reza; Smail, Katia; Wang, Joy X; Wade, Fallou; Breaker, Ronald R; Perreault, Jonathan
2017-03-15
The discovery of noncoding RNAs (ncRNAs) and their importance for gene regulation led us to develop bioinformatics tools to pursue the discovery of novel ncRNAs. Finding ncRNAs de novo is challenging, first due to the difficulty of retrieving large numbers of sequences for given gene activities, and second due to exponential demands on calculation needed for comparative genomics on a large scale. Recently, several tools for the prediction of conserved RNA secondary structure were developed, but many of them are not designed to uncover new ncRNAs, or are too slow for conducting analyses on a large scale. Here we present various approaches using the database RiboGap as a primary tool for finding known ncRNAs and for uncovering simple sequence motifs with regulatory roles. This database also can be used to easily extract intergenic sequences of eubacteria and archaea to find conserved RNA structures upstream of given genes. We also show how to extend analysis further to choose the best candidate ncRNAs for experimental validation. Copyright © 2017 Elsevier Inc. All rights reserved.
Discovery of a large-scale clumpy structure of the Lynx supercluster at z[similar]1.27
NASA Astrophysics Data System (ADS)
Nakata, Fumiaki; Kodama, Tadayuki; Shimasaku, Kazuhiro; Doi, Mamoru; Furusawa, Hisanori; Hamabe, Masaru; Kimura, Masahiko; Komiyama, Yutaka; Miyazaki, Satoshi; Okamura, Sadanori; Ouchi, Masami; Sekiguchi, Maki; Yagi, Masafumi; Yasuda, Naoki
2004-07-01
We report the discovery of a probable large-scale structure composed of many galaxy clumps around the known twin clusters at z=1.26 and z=1.27 in the Lynx region. Our analysis is based on deep, panoramic, and multi-colour imaging with the Suprime-Cam on the 8.2 m Subaru telescope. We apply a photometric redshift technique to extract plausible cluster members at z˜1.27 down to ˜ M*+2.5. From the 2-D distribution of these photometrically selected galaxies, we newly identify seven candidates of galaxy groups or clusters where the surface density of red galaxies is significantly high (>5σ), in addition to the two known clusters, comprising the largest most distant supercluster ever identified.
El Enshasy, Hesham; Elsayed, Elsayed A.; Aziz, Ramlan; Wadaan, Mohamad A.
2013-01-01
The ethnopharmaceutical approach is important for the discovery and development of natural product research and requires a deep understanding not only of biometabolites discovery and profiling but also of cultural and social science. For millennia, epigeous macrofungi (mushrooms) and hypogeous macrofungi (truffles) were considered as precious food in many cultures based on their high nutritional value and characterized pleasant aroma. In African and Middle Eastern cultures, macrofungi have long history as high nutritional food and were widely applied in folk medicine. The purpose of this review is to summarize the available information related to the nutritional and medicinal value of African and Middle Eastern macrofungi and to highlight their application in complementary folk medicine in this part of the world. PMID:24348710
The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update
Huynh, Tien; Rigoutsos, Isidore
2004-01-01
In this report, we provide an update on the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server, which is operational around the clock, provides access to a large number of methods that have been developed and published by the group's members. There is an increasing number of problems that these tools can help tackle; these problems range from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences, the identification—directly from sequence—of structural deviations from α-helicity and the annotation of amino acid sequences for antimicrobial activity. Additionally, annotations for more than 130 archaeal, bacterial, eukaryotic and viral genomes are now available on-line and can be searched interactively. The tools and code bundles continue to be accessible from http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/. PMID:15215340
Sequencing Data Discovery and Integration for Earth System Science with MetaSeek
NASA Astrophysics Data System (ADS)
Hoarfrost, A.; Brown, N.; Arnosti, C.
2017-12-01
Microbial communities play a central role in biogeochemical cycles. Sequencing data resources from environmental sources have grown exponentially in recent years, and represent a singular opportunity to investigate microbial interactions with Earth system processes. Carrying out such meta-analyses depends on our ability to discover and curate sequencing data into large-scale integrated datasets. However, such integration efforts are currently challenging and time-consuming, with sequencing data scattered across multiple repositories and metadata that is not easily or comprehensively searchable. MetaSeek is a sequencing data discovery tool that integrates sequencing metadata from all the major data repositories, allowing the user to search and filter on datasets in a lightweight application with an intuitive, easy-to-use web-based interface. Users can save and share curated datasets, while other users can browse these data integrations or use them as a jumping off point for their own curation. Missing and/or erroneous metadata are inferred automatically where possible, and where not possible, users are prompted to contribute to the improvement of the sequencing metadata pool by correcting and amending metadata errors. Once an integrated dataset has been curated, users can follow simple instructions to download their raw data and quickly begin their investigations. In addition to the online interface, the MetaSeek database is easily queryable via an open API, further enabling users and facilitating integrations of MetaSeek with other data curation tools. This tool lowers the barriers to curation and integration of environmental sequencing data, clearing the path forward to illuminating the ecosystem-scale interactions between biological and abiotic processes.
Semantically-enabled Knowledge Discovery in the Deep Carbon Observatory
NASA Astrophysics Data System (ADS)
Wang, H.; Chen, Y.; Ma, X.; Erickson, J. S.; West, P.; Fox, P. A.
2013-12-01
The Deep Carbon Observatory (DCO) is a decadal effort aimed at transforming scientific and public understanding of carbon in the complex deep earth system from the perspectives of Deep Energy, Deep Life, Extreme Physics and Chemistry, and Reservoirs and Fluxes. Over the course of the decade DCO scientific activities will generate a massive volume of data across a variety of disciplines, presenting significant challenges in terms of data integration, management, analysis and visualization, and ultimately limiting the ability of scientists across disciplines to make insights and unlock new knowledge. The DCO Data Science Team (DCO-DS) is applying Semantic Web methodologies to construct a knowledge representation focused on the DCO Earth science disciplines, and use it together with other technologies (e.g. natural language processing and data mining) to create a more expressive representation of the distributed corpus of DCO artifacts including datasets, metadata, instruments, sensors, platforms, deployments, researchers, organizations, funding agencies, grants and various awards. The embodiment of this knowledge representation is the DCO Data Science Infrastructure, in which unique entities within the DCO domain and the relations between them are recognized and explicitly identified. The DCO-DS Infrastructure will serve as a platform for more efficient and reliable searching, discovery, access, and publication of information and knowledge for the DCO scientific community and beyond.
Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana
2012-03-27
Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to themore » un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, and a transcriptional regulator, among other proteins, most of which are annotated as hypothetical, that were missed during annotation.« less
Barrett, Nolan H.; McCarthy, Peter J.
2017-01-01
ABSTRACT The proteobacterium Alteromonas sp. strain V450 was isolated from the Atlantic deep-sea sponge Leiodermatium sp. Here, we report the draft genome sequence of this strain, with a genome size of approx. 4.39 Mb and a G+C content of 44.01%. The results will aid deep-sea microbial ecology, evolution, and sponge-microbe association studies. PMID:28153886
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fliedner Theodor M.; Feinendegen Ludwig E.; Meineke Viktor
2005-02-28
First results of this feasibility study showed that evaluation of the stored material of the chronically irradiated dogs with modern molecular biological techniques proved to be successful and extremely promising. Therefore an in deep analysis of at least part of the huge amount of remaining material is of outmost interest. The methods applied in this feasibility study were pathological evaluation with different staining methods, protein analysis by means of immunohistochemistry, strand break analysis with the TdT-assay, DNA- and RNA-analysis as well as genomic examination by gene array. Overall more than 50% of the investigated material could be used. In particularmore » the results of an increased stimulation of the immune system within the dogs of the 3mSv group as both compared to the control and higher dose groups gives implications for the in depth study of the cellular events occurring in context with low dose radiation. Based on the findings of this study a further evaluation and statistically analysis of more material can help to identify promising biomarkers for low dose radiation. A systematic evaluation of a correlation of dose rates and strand breaks within the dog tissue might moreover help to explain mechanisms of tolerance to IR. One central problem is that most sequences for dog specific primers are not known yet. The discovery of the dog genome is still under progress. In this study the isolation of RNA within the dog tissue was successful. But up to now there are no gene arrays or gene chips commercially available, tested and adapted for canine tissue. The uncritical use of untested genomic test systems for canine tissue seems to be ineffective at the moment, time consuming and ineffective. Next steps in the investigation of genomic changes after IR within the stored dog tissue should be limited to quantitative RT-PCR of tested primer sequences for the dog. A collaboration with institutions working in the field of the discovery of the dog genome could have synergistic effects.« less
Wei, Ran; Yan, Yue-Hong; Harris, AJ; Kang, Jong-Soo; Shen, Hui; Zhang, Xian-Chun
2017-01-01
Abstract The eupolypods II ferns represent a classic case of evolutionary radiation and, simultaneously, exhibit high substitution rate heterogeneity. These factors have been proposed to contribute to the contentious resolutions among clades within this fern group in multilocus phylogenetic studies. We investigated the deep phylogenetic relationships of eupolypod II ferns by sampling all major families and using 40 plastid genomes, or plastomes, of which 33 were newly sequenced with next-generation sequencing technology. We performed model-based analyses to evaluate the diversity of molecular evolutionary rates for these ferns. Our plastome data, with more than 26,000 informative characters, yielded good resolution for deep relationships within eupolypods II and unambiguously clarified the position of Rhachidosoraceae and the monophyly of Athyriaceae. Results of rate heterogeneity analysis revealed approximately 33 significant rate shifts in eupolypod II ferns, with the most heterogeneous rates (both accelerations and decelerations) occurring in two phylogenetically difficult lineages, that is, the Rhachidosoraceae–Aspleniaceae and Athyriaceae clades. These observations support the hypothesis that rate heterogeneity has previously constrained the deep phylogenetic resolution in eupolypods II. According to the plastome data, we propose that 14 chloroplast markers are particularly phylogenetically informative for eupolypods II both at the familial and generic levels. Our study demonstrates the power of a character-rich plastome data set and high-throughput sequencing for resolving the recalcitrant lineages, which have undergone rapid evolutionary radiation and dramatic changes in substitution rates. PMID:28854625
Ancient origin of the modern deep-sea fauna.
Thuy, Ben; Gale, Andy S; Kroh, Andreas; Kucera, Michal; Numberger-Thuy, Lea D; Reich, Mike; Stöhr, Sabine
2012-01-01
The origin and possible antiquity of the spectacularly diverse modern deep-sea fauna has been debated since the beginning of deep-sea research in the mid-nineteenth century. Recent hypotheses, based on biogeographic patterns and molecular clock estimates, support a latest Mesozoic or early Cenozoic date for the origin of key groups of the present deep-sea fauna (echinoids, octopods). This relatively young age is consistent with hypotheses that argue for extensive extinction during Jurassic and Cretaceous Oceanic Anoxic Events (OAEs) and the mid-Cenozoic cooling of deep-water masses, implying repeated re-colonization by immigration of taxa from shallow-water habitats. Here we report on a well-preserved echinoderm assemblage from deep-sea (1000-1500 m paleodepth) sediments of the NE-Atlantic of Early Cretaceous age (114 Ma). The assemblage is strikingly similar to that of extant bathyal echinoderm communities in composition, including families and genera found exclusively in modern deep-sea habitats. A number of taxa found in the assemblage have no fossil record at shelf depths postdating the assemblage, which precludes the possibility of deep-sea recolonization from shallow habitats following episodic extinction at least for those groups. Our discovery provides the first key fossil evidence that a significant part of the modern deep-sea fauna is considerably older than previously assumed. As a consequence, most major paleoceanographic events had far less impact on the diversity of deep-sea faunas than has been implied. It also suggests that deep-sea biota are more resilient to extinction events than shallow-water forms, and that the unusual deep-sea environment, indeed, provides evolutionary stability which is very rarely punctuated on macroevolutionary time scales.
From genomics to functional markers in the era of next-generation sequencing.
Salgotra, R K; Gupta, B B; Stewart, C N
2014-03-01
The availability of complete genome sequences, along with other genomic resources for Arabidopsis, rice, pigeon pea, soybean and other crops, has revolutionized our understanding of the genetic make-up of plants. Next-generation DNA sequencing (NGS) has facilitated single nucleotide polymorphism discovery in plants. Functionally-characterized sequences can be identified and functional markers (FMs) for important traits can be developed at an ever-increasing ease. FMs are derived from sequence polymorphisms found in allelic variants of a functional gene. Linkage disequilibrium-based association mapping and homologous recombinants have been developed for identification of "perfect" markers for their use in crop improvement practices. Compared with many other molecular markers, FMs derived from the functionally characterized sequence genes using NGS techniques and their use provide opportunities to develop high-yielding plant genotypes resistant to various stresses at a fast pace.
Moreira, Rebeca; Balseiro, Pablo; Planas, Josep V.; Fuste, Berta; Beltran, Sergi; Novoa, Beatriz; Figueras, Antonio
2012-01-01
Background The Manila clam (Ruditapes philippinarum) is a worldwide cultured bivalve species with important commercial value. Diseases affecting this species can result in large economic losses. Because knowledge of the molecular mechanisms of the immune response in bivalves, especially clams, is scarce and fragmentary, we sequenced RNA from immune-stimulated R. philippinarum hemocytes by 454-pyrosequencing to identify genes involved in their immune defense against infectious diseases. Methodology and Principal Findings High-throughput deep sequencing of R. philippinarum using 454 pyrosequencing technology yielded 974,976 high-quality reads with an average read length of 250 bp. The reads were assembled into 51,265 contigs and the 44.7% of the translated nucleotide sequences into protein were annotated successfully. The 35 most frequently found contigs included a large number of immune-related genes, and a more detailed analysis showed the presence of putative members of several immune pathways and processes like the apoptosis, the toll like signaling pathway and the complement cascade. We have found sequences from molecules never described in bivalves before, especially in the complement pathway where almost all the components are present. Conclusions This study represents the first transcriptome analysis using 454-pyrosequencing conducted on R. philippinarum focused on its immune system. Our results will provide a rich source of data to discover and identify new genes, which will serve as a basis for microarray construction and the study of gene expression as well as for the identification of genetic markers. The discovery of new immune sequences was very productive and resulted in a large variety of contigs that may play a role in the defense mechanisms of Ruditapes philippinarum. PMID:22536348
Robustness of disaggregate oil and gas discovery forecasting models
Attanasi, E.D.; Schuenemeyer, J.H.
1989-01-01
The trend in forecasting oil and gas discoveries has been to develop and use models that allow forecasts of the size distribution of future discoveries. From such forecasts, exploration and development costs can more readily be computed. Two classes of these forecasting models are the Arps-Roberts type models and the 'creaming method' models. This paper examines the robustness of the forecasts made by these models when the historical data on which the models are based have been subject to economic upheavals or when historical discovery data are aggregated from areas having widely differing economic structures. Model performance is examined in the context of forecasting discoveries for offshore Texas State and Federal areas. The analysis shows how the model forecasts are limited by information contained in the historical discovery data. Because the Arps-Roberts type models require more regularity in discovery sequence than the creaming models, prior information had to be introduced into the Arps-Roberts models to accommodate the influence of economic changes. The creaming methods captured the overall decline in discovery size but did not easily allow introduction of exogenous information to compensate for incomplete historical data. Moreover, the predictive log normal distribution associated with the creaming model methods appears to understate the importance of the potential contribution of small fields. ?? 1989.
Predicting discovery rates of genomic features.
Gravel, Simon
2014-06-01
Successful sequencing experiments require judicious sample selection. However, this selection must often be performed on the basis of limited preliminary data. Predicting the statistical properties of the final sample based on preliminary data can be challenging, because numerous uncertain model assumptions may be involved. Here, we ask whether we can predict "omics" variation across many samples by sequencing only a fraction of them. In the infinite-genome limit, we find that a pilot study sequencing 5% of a population is sufficient to predict the number of genetic variants in the entire population within 6% of the correct value, using an estimator agnostic to demography, selection, or population structure. To reach similar accuracy in a finite genome with millions of polymorphisms, the pilot study would require ∼15% of the population. We present computationally efficient jackknife and linear programming methods that exhibit substantially less bias than the state of the art when applied to simulated data and subsampled 1000 Genomes Project data. Extrapolating based on the National Heart, Lung, and Blood Institute Exome Sequencing Project data, we predict that 7.2% of sites in the capture region would be variable in a sample of 50,000 African Americans and 8.8% in a European sample of equal size. Finally, we show how the linear programming method can also predict discovery rates of various genomic features, such as the number of transcription factor binding sites across different cell types. Copyright © 2014 by the Genetics Society of America.
NASA Astrophysics Data System (ADS)
Cantwell, K. L.; Kennedy, B. R.; Quattrini, A.; Cheadle, M. J.; Sowers, D.; Lobecker, E.; Ford, M.; Garcia-Moliner, G.; Gray, L. M.; Chaytor, J. D.; Demopoulos, A. W.
2016-02-01
From February to April 2015, NOAA Ship Okeanos Explorer, America's Ship for Ocean Exploration, surveyed unknown deep-sea ecosystems and potential geohazards off the coast of Puerto Rico and the US Virgin Islands. Over 37,500 km² of high-resolution multibeam sonar data was collected, revealing rugged canyons along shelf breaks, intricate incised channels, and large slumps and slope failures. Twelve remotely operated vehicle (ROV) dives, surveyed seamounts, escarpments, and submarine canyons at depths of 300-6,000 m. Additional ROV exploration of the water column occurred at depths of 800-1200 m. Dives included three of the deepest dives ever conducted in the Puerto Rico Trench and the first exploration of Exocet and Whiting seamounts. Discoveries included assemblages of deep-sea corals (>50 species), and observations of several rare and new species. For example, the seastar Laetmaster spectabilis had not been documented since its original description in 1881 and a new species of benthopelagic cydippid ctenophore was observed at 3900 m in the Aricebo Amphitheater. Other expedition highlights included two rarely observed blind octopods (Cirrothauma murrayi); novel observation of a symbiotic association between predatory tunicates with polychaete associates; and approximately 75 species of demersal fishes, including a new species of wrasse and the first records of Shaefer's anglerfish and the ateleopodid jellynose in Puerto Rican waters. ROV dives traversed elements of the complete geological succession from 1 km deep into the Cretaceous volcanic arc basement, across the carbonate platform sequence unconformity and into the uppermost Pliocene carbonates. Highlights included spectacular slope failure headwall scarps and sub-aerial karstic weathering of the youngest carbonates. All data collected during Océano Profundo 2015 are now publicly available through the National Archives and are awaiting further analysis by the scientific community.
Using the TIGR gene index databases for biological discovery.
Lee, Yuandan; Quackenbush, John
2003-11-01
The TIGR Gene Index web pages provide access to analyses of ESTs and gene sequences for nearly 60 species, as well as a number of resources derived from these. Each species-specific database is presented using a common format with a homepage. A variety of methods exist that allow users to search each species-specific database. Methods implemented currently include nucleotide or protein sequence queries using WU-BLAST, text-based searches using various sequence identifiers, searches by gene, tissue and library name, and searches using functional classes through Gene Ontology assignments. This protocol provides guidance for using the Gene Index Databases to extract information.
Identification of Prostate Cancer-Specific microDNAs
2016-02-01
circular DNA by rolling circle amplification (RCA) and then amplified DNA fragments were subject to deep sequencing. Deep sequencing of the...demonstrate the existence of microDNAs in prostate cancer. We adopted multiple displacement amplification (MDA) with random 2 primers for enriched...prostate cancer cells through multiple displacement amplification and next generation sequencing. R e la ti v e c e ll g ro w th ( % ) 0 20
The eclipsing AM Herculis variable H1907 + 690
NASA Technical Reports Server (NTRS)
Remillard, R. A.; Silber, A.; Stroozas, B. A.; Tapia, S.
1991-01-01
The discovery is reported of an eclipsing cataclysmic variable that exhibits up to 10 percent circular polarization at optical wavelengths, securing its classification as an AM Herculis type binary. The object, H1907 + 609, was located with the guidance of X-ray positions from the HEAO 1 survey. Optical CCD photometry exhibits deep eclipses, from which is derived a precise orbital period of 1.743750 hr. The eclipse duration suggests an inclination angle about 80 deg for a main-sequence secondary star. The optical flux has been persistently faint during observations spanning 1987-1990, while the X-ray measurements suggest long-term X-ray variability. The polarization and photometric light curves can be interpreted with a geometric model in which most of the accretion is directed toward a single magnetic pole, with an accretion spot displaced about 17 deg in longitude from the projection of the secondary star on the white dwarf surface.
Day-Williams, Aaron G.; McLay, Kirsten; Drury, Eleanor; Edkins, Sarah; Coffey, Alison J.; Palotie, Aarno; Zeggini, Eleftheria
2011-01-01
Pooled sequencing can be a cost-effective approach to disease variant discovery, but its applicability in association studies remains unclear. We compare sequence enrichment methods coupled to next-generation sequencing in non-indexed pools of 1, 2, 10, 20 and 50 individuals and assess their ability to discover variants and to estimate their allele frequencies. We find that pooled resequencing is most usefully applied as a variant discovery tool due to limitations in estimating allele frequency with high enough accuracy for association studies, and that in-solution hybrid-capture performs best among the enrichment methods examined regardless of pool size. PMID:22069447
Neo-sex Chromosomes in the Monarch Butterfly, Danaus plexippus
Mongue, Andrew J.; Nguyen, Petr; Voleníková, Anna; Walters, James R.
2017-01-01
We report the discovery of a neo-sex chromosome in the monarch butterfly, Danaus plexippus, and several of its close relatives. Z-linked scaffolds in the D. plexippus genome assembly were identified via sex-specific differences in Illumina sequencing coverage. Additionally, a majority of the D. plexippus genome assembly was assigned to chromosomes based on counts of one-to-one orthologs relative to the butterfly Melitaea cinxia (with replication using two other lepidopteran species), in which genome scaffolds have been mapped to linkage groups. Sequencing coverage-based assessments of Z linkage combined with homology-based chromosomal assignments provided strong evidence for a Z-autosome fusion in the Danaus lineage, involving the autosome homologous to chromosome 21 in M. cinxia. Coverage analysis also identified three notable assembly errors resulting in chimeric Z-autosome scaffolds. Cytogenetic analysis further revealed a large W chromosome that is partially euchromatic, consistent with being a neo-W chromosome. The discovery of a neo-Z and the provisional assignment of chromosome linkage for >90% of D. plexippus genes lays the foundation for novel insights concerning sex chromosome evolution in this female-heterogametic model species for functional and evolutionary genomics. PMID:28839116
Smeele, Zoe E; Ainley, David G; Varsani, Arvind
2018-01-02
The Antarctic, sub-Antarctic islands and surrounding sea-ice provide a unique environment for the existence of organisms. Nonetheless, birds and seals of a variety of species inhabit them, particularly during their breeding seasons. Early research on Antarctic wildlife health, using serology-based assays, showed exposure to viruses in the families Birnaviridae, Flaviviridae, Herpesviridae, Orthomyxoviridae and Paramyxoviridae circulating in seals (Phocidae), penguins (Spheniscidae), petrels (Procellariidae) and skuas (Stercorariidae). It is only during the last decade or so that polymerase chain reaction-based assays have been used to characterize viruses associated with Antarctic animals. Furthermore, it is only during the last five years that full/whole genomes of viruses (adenoviruses, anelloviruses, orthomyxoviruses, a papillomavirus, paramyoviruses, polyomaviruses and a togavirus) have been sequenced using Sanger sequencing or high throughput sequencing (HTS) approaches. This review summaries the knowledge of animal Antarctic virology and discusses potential future directions with the advent of HTS in virus discovery and ecology. Copyright © 2017 Elsevier B.V. All rights reserved.
RNAbrowse: RNA-Seq De Novo Assembly Results Browser
Mariette, Jérôme; Noirot, Céline; Nabihoudine, Ibounyamine; Bardou, Philippe; Hoede, Claire; Djari, Anis; Cabau, Cédric; Klopp, Christophe
2014-01-01
Transcriptome analysis based on a de novo assembly of next generation RNA sequences is now performed routinely in many laboratories. The generated results, including contig sequences, quantification figures, functional annotations and variation discovery outputs are usually bulky and quite diverse. This article presents a user oriented storage and visualisation environment permitting to explore the data in a top-down manner, going from general graphical views to all possible details. The software package is based on biomart, easy to install and populate with local data. The software package is available under the GNU General Public License (GPL) at http://bioinfo.genotoul.fr/RNAbrowse. PMID:24823498
NASA Astrophysics Data System (ADS)
Hammond, S. R.; Baker, E. T.; Embley, R. W.
2015-12-01
Inspiration for the Vents program arose from two serendipitous events: the discovery of seafloor spreading-center hydrothermal venting on the Galápagos Rift in 1977, and NOAA's deployment of the first US civilian research multibeam bathymetric sonar on the NOAA Ship Surveyor in 1979. Multibeam mapping in the NE Pacific revealed an unprecedented and revolutionary perspective of the Gorda and Juan de Fuca spreading centers, thus stimulating a successful exploration for volcanic and hydrothermal activity at numerous locations along both. After the 1986 discovery of the first "megaplume,", quickly recognized as the water column manifestation of a deep submarine volcanic eruption, the Vents program embarked on a multi-decadal effort to discover and understand local-, regional-, and, ultimately, global-scale physical, chemical, and biological ocean environmental impacts of submarine volcanism and hydrothermal venting. The Vents program made scores of scientific discoveries, many of which owed their success to the program's equally innovative and productive technological prowess. These discoveries were documented in hundreds of peer-reviewed papers by Vents researchers and their colleagues around the world. An emblematic success was the internationally recognized, first-ever detection, location, and study of an active deep volcanic eruption in 1993. To continue the Vents mission and further enhance its effectiveness in marine science and technology innovation, the program was reorganized in 2014 into two distinct, but closely linked, programs: Earth-Oceans Interactions and Acoustics. Both are currently engaged in expeditions and projects that maintain the Vents tradition of pioneering ocean exploration and research.
Neptune: a bioinformatics tool for rapid discovery of genomic variation in bacterial populations
Marinier, Eric; Zaheer, Rahat; Berry, Chrystal; Weedmark, Kelly A.; Domaratzki, Michael; Mabon, Philip; Knox, Natalie C.; Reimer, Aleisha R.; Graham, Morag R.; Chui, Linda; Patterson-Fortin, Laura; Zhang, Jian; Pagotto, Franco; Farber, Jeff; Mahony, Jim; Seyer, Karine; Bekal, Sadjia; Tremblay, Cécile; Isaac-Renton, Judy; Prystajecky, Natalie; Chen, Jessica; Slade, Peter
2017-01-01
Abstract The ready availability of vast amounts of genomic sequence data has created the need to rethink comparative genomics algorithms using ‘big data’ approaches. Neptune is an efficient system for rapidly locating differentially abundant genomic content in bacterial populations using an exact k-mer matching strategy, while accommodating k-mer mismatches. Neptune’s loci discovery process identifies sequences that are sufficiently common to a group of target sequences and sufficiently absent from non-targets using probabilistic models. Neptune uses parallel computing to efficiently identify and extract these loci from draft genome assemblies without requiring multiple sequence alignments or other computationally expensive comparative sequence analyses. Tests on simulated and real datasets showed that Neptune rapidly identifies regions that are both sensitive and specific. We demonstrate that this system can identify trait-specific loci from different bacterial lineages. Neptune is broadly applicable for comparative bacterial analyses, yet will particularly benefit pathogenomic applications, owing to efficient and sensitive discovery of differentially abundant genomic loci. The software is available for download at: http://github.com/phac-nml/neptune. PMID:29048594
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bohacs, K.M.
1990-05-01
Deep basinal rocks of the Monterey Formation can be allocated to different depositional environments based on an integration of bedding, facies stacking patterns, lithology, biofacies, and inorganic and organic chemistry. These rocks show evidence of systematic changes in depositional environments that can be related to eustatic sea level change and basin evolution. Even deep-basinal environments are affected by changing sea level through changes in circulation patterns and intensities nutrient budgets and dispersal patterns, and location and intensity of the oceanic oxygen minimum. The sequence-stratigraphic framework was constructed based on the physical expression of the outcrop strata and confirmed by typingmore » the outcrop sections to an integrated well-log/seismic grid through outcrop gamma-ray-spectral profiles. Interpretation of a sequence boundary was based on increased proportions of hemipelagic facies, evidence of increased bottom-energy levels above the boundary, and local erosion and relief on the surface. The proportion of shallower water and reworked dinoflagellates increased to a local maximum above the boundary, Downlap surfaces exhibited increased proportions of pelagic facies around the surface, evidence of decreased bottom-energy levels and terrigenous sedimentation rates, and little or no significant erosion on the surface. The proportion of deeper water dinoflagellates increased to a local maximum at or near the downlap surface; there was no evidence of reworked individuals. The detailed sequence-stratigraphic framework makes it possible to the rock properties to genetic processes for construction of predictive models.« less
Diversity of Pico- to Mesoplankton along the 2000 km Salinity Gradient of the Baltic Sea
Hu, Yue O. O.; Karlson, Bengt; Charvet, Sophie; Andersson, Anders F.
2016-01-01
Microbial plankton form the productive base of both marine and freshwater ecosystems and are key drivers of global biogeochemical cycles of carbon and nutrients. Plankton diversity is immense with representations from all major phyla within the three domains of life. So far, plankton monitoring has mainly been based on microscopic identification, which has limited sensitivity and reproducibility, not least because of the numerical majority of plankton being unidentifiable under the light microscope. High-throughput sequencing of taxonomic marker genes offers a means to identify taxa inaccessible by traditional methods; thus, recent studies have unveiled an extensive previously unknown diversity of plankton. Here, we conducted ultra-deep Illumina sequencing (average 105 sequences/sample) of rRNA gene amplicons of surface water eukaryotic and bacterial plankton communities sampled in summer along a 2000 km transect following the salinity gradient of the Baltic Sea. Community composition was strongly correlated with salinity for both bacterial and eukaryotic plankton assemblages, highlighting the importance of salinity for structuring the biodiversity within this ecosystem. In contrast, no clear trends in alpha-diversity for bacterial or eukaryotic communities could be detected along the transect. The distribution of major planktonic taxa followed expected patterns as observed in monitoring programs, but groups novel to the Baltic Sea were also identified, such as relatives to the coccolithophore Emiliana huxleyi detected in the northern Baltic Sea. This study provides the first ultra-deep sequencing-based survey on eukaryotic and bacterial plankton biogeography in the Baltic Sea. PMID:27242706
NASA Astrophysics Data System (ADS)
Mobarhan, Kamran S.
2007-06-01
Every year large sums of tax payers money are used to fund scientific research at various universities. The result is outstanding new discoveries which are published in scientific journals. However, more often than not, once the funding for these research programs end, the results of these new discoveries are buried deep within old issues of technical journals which are archived in university libraries and are consequently forgotten. Ideally, these scientific discoveries and technological advances generated at our academic institutions should lead to the creation of new jobs for our graduating students and emerging scientists and professionals. In this fashion the students who worked hard to produce these new discoveries and technological advances, can continue with their good work at companies that they helped launch and establish. This article explores some of the issues related to new business development activities at academic institutions. Included is a discussion of possible ways of helping graduating students create jobs for themselves, and for their fellow students, through creation of new companies which are based on the work that they did during their course of university studies.
Dystonia: an update on phenomenology, classification, pathogenesis and treatment.
Balint, Bettina; Bhatia, Kailash P
2014-08-01
This article will highlight recent advances in dystonia with focus on clinical aspects such as the new classification, syndromic approach, new gene discoveries and genotype-phenotype correlations. Broadening of phenotype of some of the previously described hereditary dystonias and environmental risk factors and trends in treatment will be covered. Based on phenomenology, a new consensus update on the definition, phenomenology and classification of dystonia and a syndromic approach to guide diagnosis have been proposed. Terminology has changed and 'isolated dystonia' is used wherein dystonia is the only motor feature apart from tremor, and the previously called heredodegenerative dystonias and dystonia plus syndromes are now subsumed under 'combined dystonia'. The recently discovered genes ANO3, GNAL and CIZ1 appear not to be a common cause of adult-onset cervical dystonia. Clinical and genetic heterogeneity underlie myoclonus-dystonia, dopa-responsive dystonia and deafness-dystonia syndrome. ALS2 gene mutations are a newly recognized cause for combined dystonia. The phenotypic and genotypic spectra of ATP1A3 mutations have considerably broadened. Two new genome-wide association studies identified new candidate genes. A retrospective analysis suggested complicated vaginal delivery as a modifying risk factor in DYT1. Recent studies confirm lasting therapeutic effects of deep brain stimulation in isolated dystonia, good treatment response in myoclonus-dystonia, and suggest that early treatment correlates with a better outcome. Phenotypic classification continues to be important to recognize particular forms of dystonia and this includes syndromic associations. There are a number of genes underlying isolated or combined dystonia and there will be further new discoveries with the advances in genetic technologies such as exome and whole-genome sequencing. The identification of new genes will facilitate better elucidation of pathogenetic mechanisms and possible corrective therapies.
Accurate read-based metagenome characterization using a hierarchical suite of unique signatures
Freitas, Tracey Allen K.; Li, Po-E; Scholz, Matthew B.; Chain, Patrick S. G.
2015-01-01
A major challenge in the field of shotgun metagenomics is the accurate identification of organisms present within a microbial community, based on classification of short sequence reads. Though existing microbial community profiling methods have attempted to rapidly classify the millions of reads output from modern sequencers, the combination of incomplete databases, similarity among otherwise divergent genomes, errors and biases in sequencing technologies, and the large volumes of sequencing data required for metagenome sequencing has led to unacceptably high false discovery rates (FDR). Here, we present the application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly and consistently smaller FDR than any other available method. Our algorithm circumvents false positives using a series of non-redundant signature databases and examines Genomic Origins Through Taxonomic CHAllenge (GOTTCHA). GOTTCHA was tested and validated on 20 synthetic and mock datasets ranging in community composition and complexity, was applied successfully to data generated from spiked environmental and clinical samples, and robustly demonstrates superior performance compared with other available tools. PMID:25765641
NASA Technical Reports Server (NTRS)
Woese, C. R.; Achenbach, L.; Rouviere, P.; Mandelco, L.
1991-01-01
A major and too little recognized source of artifact in phylogenetic analysis of molecular sequence data is compositional difference among sequences. The problem becomes particularly acute when alignments contain ribosomal RNAs from both mesophilic and thermophilic species. Among prokaryotes the latter are considerably higher in G + C content than the former, which often results in artificial clustering of thermophilic lineages and their being placed artificially deep in phylogenetic trees. In this communication we review archaeal phylogeny in the light of this consideration, focusing in particular on the phylogenetic position of the sulfate reducing species Archaeoglobus fulgidus, using both 16S rRNA and 23S rRNA sequences. The analysis shows clearly that the previously reported deep branching of the A. fulgidus lineage (very near the base of the euryarchaeal side of the archaeal tree) is incorrect, and that the lineage actually groups with a previously recognized unit that comprises the Methanomicrobiales and extreme halophiles.
Detection of non-coding RNA in bacteria and archaea using the DETR'PROK Galaxy pipeline.
Toffano-Nioche, Claire; Luo, Yufei; Kuchly, Claire; Wallon, Claire; Steinbach, Delphine; Zytnicki, Matthias; Jacq, Annick; Gautheret, Daniel
2013-09-01
RNA-seq experiments are now routinely used for the large scale sequencing of transcripts. In bacteria or archaea, such deep sequencing experiments typically produce 10-50 million fragments that cover most of the genome, including intergenic regions. In this context, the precise delineation of the non-coding elements is challenging. Non-coding elements include untranslated regions (UTRs) of mRNAs, independent small RNA genes (sRNAs) and transcripts produced from the antisense strand of genes (asRNA). Here we present a computational pipeline (DETR'PROK: detection of ncRNAs in prokaryotes) based on the Galaxy framework that takes as input a mapping of deep sequencing reads and performs successive steps of clustering, comparison with existing annotation and identification of transcribed non-coding fragments classified into putative 5' UTRs, sRNAs and asRNAs. We provide a step-by-step description of the protocol using real-life example data sets from Vibrio splendidus and Escherichia coli. Copyright © 2013 The Authors. Published by Elsevier Inc. All rights reserved.
Yu, Dongliang; Meng, Yijun; Zuo, Ziwei; Xue, Jie; Wang, Huizhong
2016-01-01
Nat-siRNAs (small interfering RNAs originated from natural antisense transcripts) are a class of functional small RNA (sRNA) species discovered in both plants and animals. These siRNAs are highly enriched within the annealed regions of the NAT (natural antisense transcript) pairs. To date, great research efforts have been taken for systematical identification of the NATs in various organisms. However, developing a freely available and easy-to-use program for NAT prediction is strongly demanded by researchers. Here, we proposed an integrative pipeline named NATpipe for systematical discovery of NATs from de novo assembled transcriptomes. By utilizing sRNA sequencing data, the pipeline also allowed users to search for phase-distributed nat-siRNAs within the perfectly annealed regions of the NAT pairs. Additionally, more reliable nat-siRNA loci could be identified based on degradome sequencing data. A case study on the non-model plant Dendrobium officinale was performed to illustrate the utility of NATpipe. Finally, we hope that NATpipe would be a useful tool for NAT prediction, nat-siRNA discovery, and related functional studies. NATpipe is available at www.bioinfolab.cn/NATpipe/NATpipe.zip. PMID:26858106
NASA Astrophysics Data System (ADS)
Stewart, Kent D.; Steffy, Kevin; Harris, Kevin; Harlan, John E.; Stoll, Vincent S.; Huth, Jeffrey R.; Walter, Karl A.; Gramling-Evans, Emily; Mendoza, Renaldo R.; Severin, Jean M.; Richardson, Paul L.; Barrett, Leo W.; Matayoshi, Edmund D.; Swift, Kerry M.; Betz, Stephen F.; Muchmore, Steve W.; Kempf, Dale J.; Molla, Akhter
2007-01-01
Two new proteins of approximately 70 amino acids in length, corresponding to an unnaturally-linked N- and C-helix of the ectodomain of the gp41 protein from the human immunodeficiency virus (HIV) type 1, were designed and characterized. A designed tripeptide links the C-terminus of the C-helix with the N-terminus of the N-helix in a circular permutation so that the C-helix precedes the N-helix in sequence. In addition to the artificial peptide linkage, the C-helix is truncated at its N-terminus to expose a region of the N-helix known as the "Trp-Trp-Ile" binding pocket. Sedimentation, crystallographic, and nuclear magnetic resonance studies confirmed that the protein had the desired trimeric structure with an unoccupied binding site. Spectroscopic and centrifugation studies demonstrated that the engineered protein had ligand binding characteristics similar to previously reported constructs. Unlike previous constructs which expose additional, shallow, non-conserved, and undesired binding pockets, only the single deep and conserved Trp-Trp-Ile pocket is exposed in the proteins of this study. This engineered version of gp41 protein will be potentially useful in research programs aimed at discovery of new drugs for therapy of HIV-infection in humans.
An, Xiaoping; Fan, Hang; Ma, Maijuan; Anderson, Benjamin D.; Jiang, Jiafu; Liu, Wei; Cao, Wuchun; Tong, Yigang
2014-01-01
This paper explored our hypothesis that sRNA (18∼30 bp) deep sequencing technique can be used as an efficient strategy to identify microorganisms other than viruses, such as prokaryotic and eukaryotic pathogens. In the study, the clean reads derived from the sRNA deep sequencing data of wild-caught ticks and mosquitoes were compared against the NCBI nucleotide collection (non-redundant nt database) using Blastn. The blast results were then analyzed with in-house Python scripts. An empirical formula was proposed to identify the putative pathogens. Results showed that not only viruses but also prokaryotic and eukaryotic species of interest can be screened out and were subsequently confirmed with experiments. Specially, a novel Rickettsia spp. was indicated to exist in Haemaphysalis longicornis ticks collected in Beijing. Our study demonstrated the reuse of sRNA deep sequencing data would have the potential to trace the origin of pathogens or discover novel agents of emerging/re-emerging infectious diseases. PMID:24618575
Xiao, Bingbing; Niu, Xiaoxi; Han, Na; Wang, Ben; Du, Pengcheng; Na, Risu; Chen, Chen; Liao, Qinping
2016-06-02
Bacterial vaginosis (BV) is a highly prevalent disease in women, and increases the risk of pelvic inflammatory disease. It has been given wide attention because of the high recurrence rate. Traditional diagnostic methods based on microscope providing limited information on the vaginal microbiota increase the difficulty in tracing the development of the disease in bacteria resistance condition. In this study, we used deep-sequencing technology to observe dynamic variation of the vaginal microbiota at three major time points during treatment, at D0 (before treatment), D7 (stop using the antibiotics) and D30 (the 30-day follow-up visit). Sixty-five patients with BV were enrolled (48 were cured and 17 were not cured), and their bacterial composition of the vaginal microbiota was compared. Interestingly, we identified 9 patients might be recurrence. We also introduced a new measurement point of D7, although its microbiota were significantly inhabited by antibiotic and hard to be observed by traditional method. The vaginal microbiota in deep-sequencing-view present a strong correlation to the final outcome. Thus, coupled with detailed individual bioinformatics analysis and deep-sequencing technology, we may illustrate a more accurate map of vaginal microbial to BV patients, which provide a new opportunity to reduce the rate of recurrence of BV.
Brouilette, Scott; Kuersten, Scott; Mein, Charles; Bozek, Monika; Terry, Anna; Dias, Kerith-Rae; Bhaw-Rosun, Leena; Shintani, Yasunori; Coppen, Steven; Ikebe, Chiho; Sawhney, Vinit; Campbell, Niall; Kaneko, Masahiro; Tano, Nobuko; Ishida, Hidekazu; Suzuki, Ken; Yashiro, Kenta
2012-10-01
Deep sequencing of single cell-derived cDNAs offers novel insights into oncogenesis and embryogenesis. However, traditional library preparation for RNA-seq analysis requires multiple steps with consequent sample loss and stochastic variation at each step significantly affecting output. Thus, a simpler and better protocol is desirable. The recently developed hyperactive Tn5-mediated library preparation, which brings high quality libraries, is likely one of the solutions. Here, we tested the applicability of hyperactive Tn5-mediated library preparation to deep sequencing of single cell cDNA, optimized the protocol, and compared it with the conventional method based on sonication. This new technique does not require any expensive or special equipment, which secures wider availability. A library was constructed from only 100 ng of cDNA, which enables the saving of precious specimens. Only a few steps of robust enzymatic reaction resulted in saved time, enabling more specimens to be prepared at once, and with a more reproducible size distribution among the different specimens. The obtained RNA-seq results were comparable to the conventional method. Thus, this Tn5-mediated preparation is applicable for anyone who aims to carry out deep sequencing for single cell cDNAs. Copyright © 2012 Wiley Periodicals, Inc.
Guo, Feng; Wang, Zhi-Ping; Yu, Ke; Zhang, T.
2015-01-01
Foaming of activated sludge (AS) causes adverse impacts on wastewater treatment operation and hygiene. In this study, we investigated the microbial communities of foam, foaming AS and non-foaming AS in a sewage treatment plant via deep-sequencing of the taxonomic marker genes 16S rRNA and mycobacterial rpoB and a metagenomic approach. In addition to Actinobacteria, many genera (e.g., Clostridium XI, Arcobacter, Flavobacterium) were more abundant in the foam than in the AS. On the other hand, deep-sequencing of rpoB did not detect any obligate pathogenic mycobacteria in the foam. We found that unknown factors other than the abundance of Gordonia sp. could determine the foaming process, because abundance of the same species was stable before and after a foaming event over six months. More interestingly, although the dominant Gordonia foam former was the closest with G. amarae, it was identified as an undescribed Gordonia species by referring to the 16S rRNA gene, gyrB and, most convincingly, the reconstructed draft genome from metagenomic reads. Our results, based on metagenomics and deep sequencing, reveal that foams are derived from diverse taxa, which expands previous understanding and provides new insight into the underlying complications of the foaming phenomenon in AS. PMID:25560234
Gene discovery using next-generation pyrosequencing to develop ESTs for Phalaenopsis orchids
2011-01-01
Background Orchids are one of the most diversified angiosperms, but few genomic resources are available for these non-model plants. In addition to the ecological significance, Phalaenopsis has been considered as an economically important floriculture industry worldwide. We aimed to use massively parallel 454 pyrosequencing for a global characterization of the Phalaenopsis transcriptome. Results To maximize sequence diversity, we pooled RNA from 10 samples of different tissues, various developmental stages, and biotic- or abiotic-stressed plants. We obtained 206,960 expressed sequence tags (ESTs) with an average read length of 228 bp. These reads were assembled into 8,233 contigs and 34,630 singletons. The unigenes were searched against the NCBI non-redundant (NR) protein database. Based on sequence similarity with known proteins, these analyses identified 22,234 different genes (E-value cutoff, e-7). Assembled sequences were annotated with Gene Ontology, Gene Family and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Among these annotations, over 780 unigenes encoding putative transcription factors were identified. Conclusion Pyrosequencing was effective in identifying a large set of unigenes from Phalaenopsis. The informative EST dataset we developed constitutes a much-needed resource for discovery of genes involved in various biological processes in Phalaenopsis and other orchid species. These transcribed sequences will narrow the gap between study of model organisms with many genomic resources and species that are important for ecological and evolutionary studies. PMID:21749684
Simple sequence repeat marker loci discovery using SSR primer.
Robinson, Andrew J; Love, Christopher G; Batley, Jacqueline; Barker, Gary; Edwards, David
2004-06-12
Simple sequence repeats (SSRs) have become important molecular markers for a broad range of applications, such as genome mapping and characterization, phenotype mapping, marker assisted selection of crop plants and a range of molecular ecology and diversity studies. With the increase in the availability of DNA sequence information, an automated process to identify and design PCR primers for amplification of SSR loci would be a useful tool in plant breeding programs. We report an application that integrates SPUTNIK, an SSR repeat finder, with Primer3, a PCR primer design program, into one pipeline tool, SSR Primer. On submission of multiple FASTA formatted sequences, the script screens each sequence for SSRs using SPUTNIK. The results are parsed to Primer3 for locus-specific primer design. The script makes use of a Web-based interface, enabling remote use. This program has been written in PERL and is freely available for non-commercial users by request from the authors. The Web-based version may be accessed at http://hornbill.cspp.latrobe.edu.au/
Wang, Shaolin; Yang, Zhongli; Ma, Jennie Z.; Payne, Thomas J.; Li, Ming D
2013-01-01
Through linkage analysis, candidate gene approach, and genome-wide association studies (GWAS), many genetic susceptibility factors for substance dependence have been discovered, such as the alcohol dehydrogenase gene (ALDH2) for alcohol dependence (AD) and nicotinic acetylcholine receptor (nAChR) subunit variants on chromosomes 8 and 15 for nicotine dependence (ND). However, these confirmed genetic factors contribute only a small portion of the heritability responsible for each addiction. Among many potential factors, rare variants in those identified and unidentified susceptibility genes are supposed to contribute greatly to the missing heritability. Several studies focusing on rare variants have been conducted by taking advantage of next-generation sequencing technologies, which revealed that some rare variants of nAChR subunits are associated with ND in both genetic and functional studies. However, these studies investigated variants for only a small number of genes and need to be expanded to broad regions/genes in a larger population. This review presents an update on recently developed methods for rare-variant identification and association analysis and on studies focused on rare-variant discovery and function related to addictions. PMID:23990377
Effect of Next-Generation Exome Sequencing Depth for Discovery of Diagnostic Variants.
Kim, Kyung; Seong, Moon-Woo; Chung, Won-Hyong; Park, Sung Sup; Leem, Sangseob; Park, Won; Kim, Jihyun; Lee, KiYoung; Park, Rae Woong; Kim, Namshin
2015-06-01
Sequencing depth, which is directly related to the cost and time required for the generation, processing, and maintenance of next-generation sequencing data, is an important factor in the practical utilization of such data in clinical fields. Unfortunately, identifying an exome sequencing depth adequate for clinical use is a challenge that has not been addressed extensively. Here, we investigate the effect of exome sequencing depth on the discovery of sequence variants for clinical use. Toward this, we sequenced ten germ-line blood samples from breast cancer patients on the Illumina platform GAII(x) at a high depth of ~200×. We observed that most function-related diverse variants in the human exonic regions could be detected at a sequencing depth of 120×. Furthermore, investigation using a diagnostic gene set showed that the number of clinical variants identified using exome sequencing reached a plateau at an average sequencing depth of about 120×. Moreover, the phenomena were consistent across the breast cancer samples.
Wang, Guojun; Barrett, Nolan H; McCarthy, Peter J
2017-02-02
The proteobacterium Alteromonas sp. strain V450 was isolated from the Atlantic deep-sea sponge Leiodermatium sp. Here, we report the draft genome sequence of this strain, with a genome size of approx. 4.39 Mb and a G+C content of 44.01%. The results will aid deep-sea microbial ecology, evolution, and sponge-microbe association studies. Copyright © 2017 Wang et al.
ERIC Educational Resources Information Center
Thumm, Walter
1975-01-01
Relates the story of Wilhelm Conrad Rontgen and presents one view of the extent to which the discovery of the x-ray was an accident. Reconstructs the sequence of events that led to the discovery and includes photographs of the lab where he worked and replicas of apparatus used. (GS)
Zhang, Ning; Wen, Jun; Zimmer, Elizabeth A.
2015-01-01
Vitaceae is well-known for having one of the most economically important fruits, i.e., the grape (Vitis vinifera). The deep phylogeny of the grape family was not resolved until a recent phylogenomic analysis of 417 nuclear genes from transcriptome data. However, it has been reported extensively that topologies based on nuclear and organellar genes may be incongruent due to differences in their evolutionary histories. Therefore, it is important to reconstruct a backbone phylogeny of the grape family using plastomes and mitochondrial genes. In this study, next-generation sequencing data sets of 27 species were obtained using genome skimming with total DNAs from silica-gel preserved tissue samples on an Illumina HiSeq 2500 instrument. Plastomes were assembled using the combination of de novo and reference genome (of V. vinifera) methods. Sixteen mitochondrial genes were also obtained via genome skimming using the reference genome of V. vinifera. Extensive phylogenetic analyses were performed using maximum likelihood and Bayesian methods. The topology based on either plastome data or mitochondrial genes is congruent with the one using hundreds of nuclear genes, indicating that the grape family did not exhibit significant reticulation at the deep level. The results showcase the power of genome skimming in capturing extensive phylogenetic data: especially from chloroplast and mitochondrial DNAs. PMID:26656830
Zhang, Ning; Wen, Jun; Zimmer, Elizabeth A
2015-01-01
Vitaceae is well-known for having one of the most economically important fruits, i.e., the grape (Vitis vinifera). The deep phylogeny of the grape family was not resolved until a recent phylogenomic analysis of 417 nuclear genes from transcriptome data. However, it has been reported extensively that topologies based on nuclear and organellar genes may be incongruent due to differences in their evolutionary histories. Therefore, it is important to reconstruct a backbone phylogeny of the grape family using plastomes and mitochondrial genes. In this study,next-generation sequencing data sets of 27 species were obtained using genome skimming with total DNAs from silica-gel preserved tissue samples on an Illumina NextSeq 500 instrument [corrected]. Plastomes were assembled using the combination of de novo and reference genome (of V. vinifera) methods. Sixteen mitochondrial genes were also obtained via genome skimming using the reference genome of V. vinifera. Extensive phylogenetic analyses were performed using maximum likelihood and Bayesian methods. The topology based on either plastome data or mitochondrial genes is congruent with the one using hundreds of nuclear genes, indicating that the grape family did not exhibit significant reticulation at the deep level. The results showcase the power of genome skimming in capturing extensive phylogenetic data: especially from chloroplast and mitochondrial DNAs.
Jenkins, Adam M; Waterhouse, Robert M; Muskavitch, Marc A T
2015-04-23
Long non-coding RNAs (lncRNAs) have been defined as mRNA-like transcripts longer than 200 nucleotides that lack significant protein-coding potential, and many of them constitute scaffolds for ribonucleoprotein complexes with critical roles in epigenetic regulation. Various lncRNAs have been implicated in the modulation of chromatin structure, transcriptional and post-transcriptional gene regulation, and regulation of genomic stability in mammals, Caenorhabditis elegans, and Drosophila melanogaster. The purpose of this study is to identify the lncRNA landscape in the malaria vector An. gambiae and assess the evolutionary conservation of lncRNAs and their secondary structures across the Anopheles genus. Using deep RNA sequencing of multiple Anopheles gambiae life stages, we have identified 2,949 lncRNAs and more than 300 previously unannotated putative protein-coding genes. The lncRNAs exhibit differential expression profiles across life stages and adult genders. We find that across the genus Anopheles, lncRNAs display much lower sequence conservation than protein-coding genes. Additionally, we find that lncRNA secondary structure is highly conserved within the Gambiae complex, but diverges rapidly across the rest of the genus Anopheles. This study offers one of the first lncRNA secondary structure analyses in vector insects. Our description of lncRNAs in An. gambiae offers the most comprehensive genome-wide insights to date into lncRNAs in this vector mosquito, and defines a set of potential targets for the development of vector-based interventions that may further curb the human malaria burden in disease-endemic countries.
Heuckmann, J M; Thomas, R K
2015-09-01
The identification of 'druggable' kinase gene alterations has revolutionized cancer treatment in the last decade by providing new and successfully targetable drug targets. Thus, genotyping tumors for matching the right patients with the right drugs have become a clinical routine. Today, advances in sequencing technology and computational genome analyses enable the discovery of a constantly growing number of genome alterations relevant for clinical decision making. As a consequence, several technological approaches have emerged in order to deal with these rapidly increasing demands for clinical cancer genome analyses. Here, we describe challenges on the path to the broad introduction of diagnostic cancer genome analyses and the technologies that can be applied to overcome them. We define three generations of molecular diagnostics that are in clinical use. The latest generation of these approaches involves deep and thus, highly sensitive sequencing of all therapeutically relevant types of genome alterations-mutations, copy number alterations and rearrangements/fusions-in a single assay. Such approaches therefore have substantial advantages (less time and less tissue required) over PCR-based methods that typically have to be combined with fluorescence in situ hybridization for detection of gene amplifications and fusions. Since these new technologies work reliably on routine diagnostic formalin-fixed, paraffin-embedded specimens, they can help expedite the broad introduction of personalized cancer therapy into the clinic by providing comprehensive, sensitive and accurate cancer genome diagnoses in 'real-time'. © The Author 2015. Published by Oxford University Press on behalf of the European Society for Medical Oncology. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Transcriptome sequences resolve deep relationships of the grape family.
Wen, Jun; Xiong, Zhiqiang; Nie, Ze-Long; Mao, Likai; Zhu, Yabing; Kan, Xian-Zhao; Ickert-Bond, Stefanie M; Gerrath, Jean; Zimmer, Elizabeth A; Fang, Xiao-Dong
2013-01-01
Previous phylogenetic studies of the grape family (Vitaceae) yielded poorly resolved deep relationships, thus impeding our understanding of the evolution of the family. Next-generation sequencing now offers access to protein coding sequences very easily, quickly and cost-effectively. To improve upon earlier work, we extracted 417 orthologous single-copy nuclear genes from the transcriptomes of 15 species of the Vitaceae, covering its phylogenetic diversity. The resulting transcriptome phylogeny provides robust support for the deep relationships, showing the phylogenetic utility of transcriptome data for plants over a time scale at least since the mid-Cretaceous. The pros and cons of transcriptome data for phylogenetic inference in plants are also evaluated.
Regional stratigraphy and petroleum potential, Ghadames basin, Algeria
DOE Office of Scientific and Technical Information (OSTI.GOV)
Emme, J.J.; Sunderland, B.L.
1991-03-01
The Ghadames basin in east-central Algeria extends over 65,000 km{sup 2} (25,000 mi{sup 2}), of which 90% is covered by dunes of the eastern Erg. This intracratonic basin consists of up to 6000 m (20,000 ft) of dominantly clastic Paleozoic through Mesozoic strata. The Ghadames basin is part of a larger, composite basin complex (Ilizzi-Ghadames-Triassic basins) where Paleozoic strata have been truncated during a Hercynian erosional event and subsequently overlain by a northward-thickening wedge of Mesozoic sediments. Major reservoir rocks include Triassic sandstones that produce oil, gas, and condensate in the western Ghadames basin, Siluro-Devonian sandstones that produce mostly oilmore » in the shallower Ilizzi basin to the south, and Cambro-Ordovician orthoquartzites that produce oil at Hassi Messaoud to the northwest. Organic shales of the Silurian and Middle-Upper Devonian are considered primary source rocks. Paleozoic shales and Triassic evaporite/red bed sequences act as seals for hydrocarbon accumulations. The central Ghadames basin is underexplored, with less than one wildcat well/1700 km{sup 2} (one well/420,000 ac). Recent Devonian and Triassic oil discoveries below 3500 m (11,500 ft) indicate that deep oil potential exists. Exploration to date has concentrated on structural traps. Subcrop and facies trends indicate that potential for giant stratigraphic or combination traps exists for both Siluro-Devonian and Triassic intervals. Modern seismic acquisition and processing techniques in high dune areas can be used to successfully identify critical unconformity-bound sequences with significant stratigraphic trap potential. Advances in seismic and drilling technology combined with creative exploration should result in major petroleum discoveries in the Ghadames basin.« less
Efficient exact motif discovery.
Marschall, Tobias; Rahmann, Sven
2009-06-15
The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif. We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis. The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/.
Verbist, Bie; Clement, Lieven; Reumers, Joke; Thys, Kim; Vapirev, Alexander; Talloen, Willem; Wetzels, Yves; Meys, Joris; Aerssens, Jeroen; Bijnens, Luc; Thas, Olivier
2015-02-22
Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection.
Analysis of deep learning methods for blind protein contact prediction in CASP12.
Wang, Sheng; Sun, Siqi; Xu, Jinbo
2018-03-01
Here we present the results of protein contact prediction achieved in CASP12 by our RaptorX-Contact server, which is an early implementation of our deep learning method for contact prediction. On a set of 38 free-modeling target domains with a median family size of around 58 effective sequences, our server obtained an average top L/5 long- and medium-range contact accuracy of 47% and 44%, respectively (L = length). A complete implementation has an average accuracy of 59% and 57%, respectively. Our deep learning method formulates contact prediction as a pixel-level image labeling problem and simultaneously predicts all residue pairs of a protein using a combination of two deep residual neural networks, taking as input the residue conservation information, predicted secondary structure and solvent accessibility, contact potential, and coevolution information. Our approach differs from existing methods mainly in (1) formulating contact prediction as a pixel-level image labeling problem instead of an image-level classification problem; (2) simultaneously predicting all contacts of an individual protein to make effective use of contact occurrence patterns; and (3) integrating both one-dimensional and two-dimensional deep convolutional neural networks to effectively learn complex sequence-structure relationship including high-order residue correlation. This paper discusses the RaptorX-Contact pipeline, both contact prediction and contact-based folding results, and finally the strength and weakness of our method. © 2017 Wiley Periodicals, Inc.
Draft Genome Sequence of Pseudomonas oceani DSM 100277T, a Deep-Sea Bacterium
2018-01-01
ABSTRACT Pseudomonas oceani DSM 100277T was isolated from deep seawater in the Okinawa Trough at 1390 m. P. oceani belongs to the Pseudomonas pertucinogena group. Here, we report the draft genome sequence of P. oceani, which has an estimated size of 4.1 Mb and exhibits 3,790 coding sequences, with a G+C content of 59.94 mol%. PMID:29650573
Natural product discovery: past, present, and future.
Katz, Leonard; Baltz, Richard H
2016-03-01
Microorganisms have provided abundant sources of natural products which have been developed as commercial products for human medicine, animal health, and plant crop protection. In the early years of natural product discovery from microorganisms (The Golden Age), new antibiotics were found with relative ease from low-throughput fermentation and whole cell screening methods. Later, molecular genetic and medicinal chemistry approaches were applied to modify and improve the activities of important chemical scaffolds, and more sophisticated screening methods were directed at target disease states. In the 1990s, the pharmaceutical industry moved to high-throughput screening of synthetic chemical libraries against many potential therapeutic targets, including new targets identified from the human genome sequencing project, largely to the exclusion of natural products, and discovery rates dropped dramatically. Nonetheless, natural products continued to provide key scaffolds for drug development. In the current millennium, it was discovered from genome sequencing that microbes with large genomes have the capacity to produce about ten times as many secondary metabolites as was previously recognized. Indeed, the most gifted actinomycetes have the capacity to produce around 30-50 secondary metabolites. With the precipitous drop in cost for genome sequencing, it is now feasible to sequence thousands of actinomycete genomes to identify the "biosynthetic dark matter" as sources for the discovery of new and novel secondary metabolites. Advances in bioinformatics, mass spectrometry, proteomics, transcriptomics, metabolomics and gene expression are driving the new field of microbial genome mining for applications in natural product discovery and development.
A New Approach for Mining Order-Preserving Submatrices Based on All Common Subsequences.
Xue, Yun; Liao, Zhengling; Li, Meihang; Luo, Jie; Kuang, Qiuhua; Hu, Xiaohui; Li, Tiechen
2015-01-01
Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method.
Sahl, Jason W; Fairfield, Nathaniel; Harris, J Kirk; Wettergreen, David; Stone, William C; Spear, John R
2010-03-01
The deep phreatic thermal explorer (DEPTHX) is an autonomous underwater vehicle designed to navigate an unexplored environment, generate high-resolution three-dimensional (3-D) maps, collect biological samples based on an autonomous sampling decision, and return to its origin. In the spring of 2007, DEPTHX was deployed in Zacatón, a deep (approximately 318 m), limestone, phreatic sinkhole (cenote) in northeastern Mexico. As DEPTHX descended, it generated a 3-D map based on the processing of range data from 54 onboard sonars. The vehicle collected water column samples and wall biomat samples throughout the depth profile of the cenote. Post-expedition sample analysis via comparative analysis of 16S rRNA gene sequences revealed a wealth of microbial diversity. Traditional Sanger gene sequencing combined with a barcoded-amplicon pyrosequencing approach revealed novel, phylum-level lineages from the domains Bacteria and Archaea; in addition, several novel subphylum lineages were also identified. Overall, DEPTHX successfully navigated and mapped Zacatón, and collected biological samples based on an autonomous decision, which revealed novel microbial diversity in a previously unexplored environment.
Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks
Lanchantin, Jack; Singh, Ritambhara; Wang, Beilun; Qi, Yanjun
2018-01-01
Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy for the transcription factor binding (TFBS) site classification task. However, it remains unclear how these approaches identify meaningful DNA sequence signals and give insights as to why TFs bind to certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMo Dashboard) which provides a suite of visualization strategies to extract motifs, or sequence patterns from deep neural network models for TFBS classification. We demonstrate how to visualize and understand three important DNN models: convolutional, recurrent, and convolutional-recurrent networks. Our first visualization method is finding a test sequence’s saliency map which uses first-order derivatives to describe the importance of each nucleotide in making the final prediction. Second, considering recurrent models make predictions in a temporal manner (from one end of a TFBS sequence to the other), we introduce temporal output scores, indicating the prediction score of a model over time for a sequential input. Lastly, a class-specific visualization strategy finds the optimal input sequence for a given TFBS positive class via stochastic gradient optimization. Our experimental results indicate that a convolutional-recurrent architecture performs the best among the three architectures. The visualization techniques indicate that CNN-RNN makes predictions by modeling both motifs as well as dependencies among them. PMID:27896980
Aftershock occurrence rate decay for individual sequences and catalogs
NASA Astrophysics Data System (ADS)
Nyffenegger, Paul A.
One of the earliest observations of the Earth's seismicity is that the rate of aftershock occurrence decays with time according to a power law commonly known as modified Omori-law (MOL) decay. However, the physical reasons for aftershock occurrence and the empirical decay in rate remain unclear despite numerous models that yield similar rate decay behavior. Key problems in relating the observed empirical relationship to the physical conditions of the mainshock and fault are the lack of studies including small magnitude mainshocks and the lack of uniformity between studies. We use simulated aftershock sequences to investigate the factors which influence the maximum likelihood (ML) estimate of the Omori-law p value, the parameter describing aftershock occurrence rate decay, for both individual aftershock sequences and "stacked" or superposed sequences. Generally the ML estimate of p is accurate, but since the ML estimated uncertainty is unaffected by whether the sequence resembles an MOL model, a goodness-of-fit test such as the Anderson-Darling statistic is necessary. While stacking aftershock sequences permits the study of entire catalogs and sequences with small aftershock populations, stacking introduces artifacts. The p value for stacked sequences is approximately equal to the mean of the individual sequence p values. We apply single-link cluster analysis to identify all aftershock sequences from eleven regional seismicity catalogs. We observe two new mathematically predictable empirical relationships for the distribution of aftershock sequence populations. The average properties of aftershock sequences are not correlated with tectonic environment, but aftershock populations and p values do show a depth dependence. The p values show great variability with time, and large values or changes in p sometimes precedes major earthquakes. Studies of teleseismic earthquake catalogs over the last twenty years have led seismologists to question seismicity models and aftershock sequence decay for deep sequences. For seven exceptional deep sequences, we conclude that MOL decay adequately describes these sequences, and little difference exists compared to shallow sequences. However, they do include larger aftershock populations compared to most deep sequences. These results imply that p values for deep sequences are larger than those for intermediate depth sequences.
Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields.
Wang, Sheng; Peng, Jian; Ma, Jianzhu; Xu, Jinbo
2016-01-11
Protein secondary structure (SS) prediction is important for studying protein structure and function. When only the sequence (profile) information is used as input feature, currently the best predictors can obtain ~80% Q3 accuracy, which has not been improved in the past decade. Here we present DeepCNF (Deep Convolutional Neural Fields) for protein SS prediction. DeepCNF is a Deep Learning extension of Conditional Neural Fields (CNF), which is an integration of Conditional Random Fields (CRF) and shallow neural networks. DeepCNF can model not only complex sequence-structure relationship by a deep hierarchical architecture, but also interdependency between adjacent SS labels, so it is much more powerful than CNF. Experimental results show that DeepCNF can obtain ~84% Q3 accuracy, ~85% SOV score, and ~72% Q8 accuracy, respectively, on the CASP and CAMEO test proteins, greatly outperforming currently popular predictors. As a general framework, DeepCNF can be used to predict other protein structure properties such as contact number, disorder regions, and solvent accessibility.
Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields
NASA Astrophysics Data System (ADS)
Wang, Sheng; Peng, Jian; Ma, Jianzhu; Xu, Jinbo
2016-01-01
Protein secondary structure (SS) prediction is important for studying protein structure and function. When only the sequence (profile) information is used as input feature, currently the best predictors can obtain ~80% Q3 accuracy, which has not been improved in the past decade. Here we present DeepCNF (Deep Convolutional Neural Fields) for protein SS prediction. DeepCNF is a Deep Learning extension of Conditional Neural Fields (CNF), which is an integration of Conditional Random Fields (CRF) and shallow neural networks. DeepCNF can model not only complex sequence-structure relationship by a deep hierarchical architecture, but also interdependency between adjacent SS labels, so it is much more powerful than CNF. Experimental results show that DeepCNF can obtain ~84% Q3 accuracy, ~85% SOV score, and ~72% Q8 accuracy, respectively, on the CASP and CAMEO test proteins, greatly outperforming currently popular predictors. As a general framework, DeepCNF can be used to predict other protein structure properties such as contact number, disorder regions, and solvent accessibility.
Deep learning methods for protein torsion angle prediction.
Li, Haiou; Hou, Jie; Adhikari, Badri; Lyu, Qiang; Cheng, Jianlin
2017-09-18
Deep learning is one of the most powerful machine learning methods that has achieved the state-of-the-art performance in many domains. Since deep learning was introduced to the field of bioinformatics in 2012, it has achieved success in a number of areas such as protein residue-residue contact prediction, secondary structure prediction, and fold recognition. In this work, we developed deep learning methods to improve the prediction of torsion (dihedral) angles of proteins. We design four different deep learning architectures to predict protein torsion angles. The architectures including deep neural network (DNN) and deep restricted Boltzmann machine (DRBN), deep recurrent neural network (DRNN) and deep recurrent restricted Boltzmann machine (DReRBM) since the protein torsion angle prediction is a sequence related problem. In addition to existing protein features, two new features (predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments) are used as input to each of the four deep learning architectures to predict phi and psi angles of protein backbone. The mean absolute error (MAE) of phi and psi angles predicted by DRNN, DReRBM, DRBM and DNN is about 20-21° and 29-30° on an independent dataset. The MAE of phi angle is comparable to the existing methods, but the MAE of psi angle is 29°, 2° lower than the existing methods. On the latest CASP12 targets, our methods also achieved the performance better than or comparable to a state-of-the art method. Our experiment demonstrates that deep learning is a valuable method for predicting protein torsion angles. The deep recurrent network architecture performs slightly better than deep feed-forward architecture, and the predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments are useful features for improving prediction accuracy.
Estorninho, Megan; Gibson, Vivienne B; Kronenberg-Versteeg, Deborah; Liu, Yuk-Fun; Ni, Chester; Cerosaletti, Karen; Peakman, Mark
2013-12-01
Extensive diversity in the human repertoire of TCRs for Ag is both a cornerstone of effective adaptive immunity that enables host protection against a multiplicity of pathogens and a weakness that gives rise to potential pathological self-reactivity. The complexity arising from diversity makes detection and tracking of single Ag-specific CD4 T cells (ASTs) involved in these immune responses challenging. We report a tandem, multistep process to quantify rare TCRβ-chain variable sequences of ASTs in large polyclonal populations. The approach combines deep high-throughput sequencing (HTS) within functional CD4 T cell compartments, such as naive/memory cells, with shallow, multiple identifier-based HTS of ASTs identified by activation marker upregulation after short-term Ag stimulation in vitro. We find that clonotypes recognizing HLA class II-restricted epitopes of both pathogen-derived Ags and self-Ags are oligoclonal and typically private. Clonotype tracking within an individual reveals private AST clonotypes resident in the memory population, as would be expected, representing clonal expansions (identical nucleotide sequence; "ultraprivate"). Other AST clonotypes share CDR3β amino acid sequences through convergent recombination and are found in memory populations of multiple individuals. Tandem HTS-based clonotyping will facilitate studying AST dynamics, epitope spreading, and repertoire changes that arise postvaccination and following Ag-specific immunotherapies for cancer and autoimmune disease.
Rinaldi, Fabio; Schneider, Gerold; Kaljurand, Kaarel; Hess, Michael; Andronis, Christos; Konstandi, Ourania; Persidis, Andreas
2007-02-01
The amount of new discoveries (as published in the scientific literature) in the biomedical area is growing at an exponential rate. This growth makes it very difficult to filter the most relevant results, and thus the extraction of the core information becomes very expensive. Therefore, there is a growing interest in text processing approaches that can deliver selected information from scientific publications, which can limit the amount of human intervention normally needed to gather those results. This paper presents and evaluates an approach aimed at automating the process of extracting functional relations (e.g. interactions between genes and proteins) from scientific literature in the biomedical domain. The approach, using a novel dependency-based parser, is based on a complete syntactic analysis of the corpus. We have implemented a state-of-the-art text mining system for biomedical literature, based on a deep-linguistic, full-parsing approach. The results are validated on two different corpora: the manually annotated genomics information access (GENIA) corpus and the automatically annotated arabidopsis thaliana circadian rhythms (ATCR) corpus. We show how a deep-linguistic approach (contrary to common belief) can be used in a real world text mining application, offering high-precision relation extraction, while at the same time retaining a sufficient recall.
Directional genomic hybridization for chromosomal inversion discovery and detection.
Ray, F Andrew; Zimmerman, Erin; Robinson, Bruce; Cornforth, Michael N; Bedford, Joel S; Goodwin, Edwin H; Bailey, Susan M
2013-04-01
Chromosomal rearrangements are a source of structural variation within the genome that figure prominently in human disease, where the importance of translocations and deletions is well recognized. In principle, inversions-reversals in the orientation of DNA sequences within a chromosome-should have similar detrimental potential. However, the study of inversions has been hampered by traditional approaches used for their detection, which are not particularly robust. Even with significant advances in whole genome approaches, changes in the absolute orientation of DNA remain difficult to detect routinely. Consequently, our understanding of inversions is still surprisingly limited, as is our appreciation for their frequency and involvement in human disease. Here, we introduce the directional genomic hybridization methodology of chromatid painting-a whole new way of looking at structural features of the genome-that can be employed with high resolution on a cell-by-cell basis, and demonstrate its basic capabilities for genome-wide discovery and targeted detection of inversions. Bioinformatics enabled development of sequence- and strand-specific directional probe sets, which when coupled with single-stranded hybridization, greatly improved the resolution and ease of inversion detection. We highlight examples of the far-ranging applicability of this cytogenomics-based approach, which include confirmation of the alignment of the human genome database and evidence that individuals themselves share similar sequence directionality, as well as use in comparative and evolutionary studies for any species whose genome has been sequenced. In addition to applications related to basic mechanistic studies, the information obtainable with strand-specific hybridization strategies may ultimately enable novel gene discovery, thereby benefitting the diagnosis and treatment of a variety of human disease states and disorders including cancer, autism, and idiopathic infertility.
Landslide oil field, San Joaquin Valley, California
DOE Office of Scientific and Technical Information (OSTI.GOV)
Collins, B.P.; March, K.A.; Caballero, J.S.
1988-03-01
The Landslide field, located at the southern margin of the San Joaquin basin, was discovered in 1985 by a partnership headed by Channel Exploration Company, on a farm out from Tenneco Oil Company. Initial production from the Tenneco San Emidio 63X-30 was 2064 BOPD, making landslide one of the largest onshore discoveries in California during the past decade. Current production is 7100 BOPD from a sandstone reservoir at 12,500 ft. Fifteen wells have been drilled in the field, six of which are water injectors. Production from the Landslide field occurs from a series of upper Miocene Stevens turbidite sandstones thatmore » lie obliquely across an east-plunging structural nose. These turbidite sandstones were deposited as channel-fill sequences within a narrowly bounded levied channel complex. Both the Landslide field and the larger Yowlumne field, located 3 mi to the northwest, comprise a single channel-fan depositional system that developed in the restricted deep-water portion of the San Joaquin basin. Information from the open-hole logs, three-dimensional surveys, vertical seismic profiles, repeat formation tester data, cores, and pressure buildup tests allowed continuous drilling from the initial discovery to the final waterflood injector, without a single dry hole. In addition, the successful application of three-dimensional seismic data in the Landslide development program has helped correctly image channel-fan anomalies in the southern Maricopa basin, where data quality and severe velocity problems have hampered previous efforts. New exploration targets are currently being evaluated on the acreage surrounding the Landslide discovery and should lead to an interesting new round of drilling activity in the Maricopa basin.« less
Gerlt, John A
2017-08-22
The exponentially increasing number of protein and nucleic acid sequences provides opportunities to discover novel enzymes, metabolic pathways, and metabolites/natural products, thereby adding to our knowledge of biochemistry and biology. The challenge has evolved from generating sequence information to mining the databases to integrating and leveraging the available information, i.e., the availability of "genomic enzymology" web tools. Web tools that allow identification of biosynthetic gene clusters are widely used by the natural products/synthetic biology community, thereby facilitating the discovery of novel natural products and the enzymes responsible for their biosynthesis. However, many novel enzymes with interesting mechanisms participate in uncharacterized small-molecule metabolic pathways; their discovery and functional characterization also can be accomplished by leveraging information in protein and nucleic acid databases. This Perspective focuses on two genomic enzymology web tools that assist the discovery novel metabolic pathways: (1) Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) for generating sequence similarity networks to visualize and analyze sequence-function space in protein families and (2) Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT) for generating genome neighborhood networks to visualize and analyze the genome context in microbial and fungal genomes. Both tools have been adapted to other applications to facilitate target selection for enzyme discovery and functional characterization. As the natural products community has demonstrated, the enzymology community needs to embrace the essential role of web tools that allow the protein and genome sequence databases to be leveraged for novel insights into enzymological problems.