sequence variations predicting: Topics by Science.gov

Sample records for sequence variations predicting

BayesPI-BAR: a new biophysical model for characterization of regulatory sequence variations

PubMed Central

Wang, Junbai; Batmanov, Kirill

2015-01-01

Sequence variations in regulatory DNA regions are known to cause functionally important consequences for gene expression. DNA sequence variations may have an essential role in determining phenotypes and may be linked to disease; however, their identification through analysis of massive genome-wide sequencing data is a great challenge. In this work, a new computational pipeline, a Bayesian method for protein–DNA interaction with binding affinity ranking (BayesPI-BAR), is proposed for quantifying the effect of sequence variations on protein binding. BayesPI-BAR uses biophysical modeling of protein–DNA interactions to predict single nucleotide polymorphisms (SNPs) that cause significant changes in the binding affinity of a regulatory region for transcription factors (TFs). The method includes two new parameters (TF chemical potentials or protein concentrations and direct TF binding targets) that are neglected by previous methods. The new method is verified on 67 known human regulatory SNPs, of which 47 (70%) have predicted true TFs ranked in the top 10. Importantly, the performance of BayesPI-BAR, which uses principal component analysis to integrate multiple predictions from various TF chemical potentials, is found to be better than that of existing programs, such as sTRAP and is-rSNP, when evaluated on the same SNPs. BayesPI-BAR is a publicly available tool and is able to carry out parallelized computation, which helps to investigate a large number of TFs or SNPs and to detect disease-associated regulatory sequence variations in the sea of genome-wide noncoding regions. PMID:26202972
In Silico Detection of Sequence Variations Modifying Transcriptional Regulation

PubMed Central

Andersen, Malin C; Engström, Pär G; Lithwick, Stuart; Arenillas, David; Eriksson, Per; Lenhard, Boris; Wasserman, Wyeth W; Odeberg, Jacob

2008-01-01

Identification of functional genetic variation associated with increased susceptibility to complex diseases can elucidate genes and underlying biochemical mechanisms linked to disease onset and progression. For genes linked to genetic diseases, most identified causal mutations alter an encoded protein sequence. Technological advances for measuring RNA abundance suggest that a significant number of undiscovered causal mutations may alter the regulation of gene transcription. However, it remains a challenge to separate causal genetic variations from linked neutral variations. Here we present an in silico driven approach to identify possible genetic variation in regulatory sequences. The approach combines phylogenetic footprinting and transcription factor binding site prediction to identify variation in candidate cis-regulatory elements. The bioinformatics approach has been tested on a set of SNPs that are reported to have a regulatory function, as well as background SNPs. In the absence of additional information about an analyzed gene, the poor specificity of binding site prediction is prohibitive to its application. However, when additional data is available that can give guidance on which transcription factor is involved in the regulation of the gene, the in silico binding site prediction improves the selection of candidate regulatory polymorphisms for further analyses. The bioinformatics software generated for the analysis has been implemented as a Web-based application system entitled RAVEN (regulatory analysis of variation in enhancers). The RAVEN system is available at http://www.cisreg.ca for all researchers interested in the detection and characterization of regulatory sequence variation. PMID:18208319
Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq.

PubMed

Barrick, Jeffrey E; Colburn, Geoffrey; Deatherage, Daniel E; Traverse, Charles C; Strand, Matthew D; Borges, Jordan J; Knoester, David B; Reba, Aaron; Meyer, Austin G

2014-11-29

Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events. We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for ~25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold). Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.
Influence of Molecular Resolution on Sequence-Based Discovery of Ecological Diversity among Synechococcus Populations in an Alkaline Siliceous Hot Spring Microbial Mat ▿ †

PubMed Central

Melendrez, Melanie C.; Lange, Rachel K.; Cohan, Frederick M.; Ward, David M.

2011-01-01

Previous research has shown that sequences of 16S rRNA genes and 16S-23S rRNA internal transcribed spacer regions may not have enough genetic resolution to define all ecologically distinct Synechococcus populations (ecotypes) inhabiting alkaline, siliceous hot spring microbial mats. To achieve higher molecular resolution, we studied sequence variation in three protein-encoding loci sampled by PCR from 60°C and 65°C sites in the Mushroom Spring mat (Yellowstone National Park, WY). Sequences were analyzed using the ecotype simulation (ES) and AdaptML algorithms to identify putative ecotypes. Between 4 and 14 times more putative ecotypes were predicted from variation in protein-encoding locus sequences than from variation in 16S rRNA and 16S-23S rRNA internal transcribed spacer sequences. The number of putative ecotypes predicted depended on the number of sequences sampled and the molecular resolution of the locus. Chao estimates of diversity indicated that few rare ecotypes were missed. Many ecotypes hypothesized by sequence analyses were different in their habitat specificities, suggesting different adaptations to temperature or other parameters that vary along the flow channel. PMID:21169433
Full genome sequence of Rocio virus reveal substantial variations from the prototype Rocio virus SPH 34675 sequence.

PubMed

Setoh, Yin Xiang; Amarilla, Alberto A; Peng, Nias Y; Slonchak, Andrii; Periasamy, Parthiban; Figueiredo, Luiz T M; Aquino, Victor H; Khromykh, Alexander A

2018-01-01

Rocio virus (ROCV) is an arbovirus belonging to the genus Flavivirus, family Flaviviridae. We present an updated sequence of ROCV strain SPH 34675 (GenBank: AY632542.4), the only available full genome sequence prior to this study. Using next-generation sequencing of the entire genome, we reveal substantial sequence variation from the prototype sequence, with 30 nucleotide differences amounting to 14 amino acid changes, as well as significant changes to predicted 3'UTR RNA structures. Our results present an updated and corrected sequence of a potential emerging human-virulent flavivirus uniquely indigenous to Brazil (GenBank: MF461639).
Polymorphism in the Eruption Sequence of Primary Dentition: A Cross-sectional Study

PubMed Central

Bhojraj, Nandlal; Narayanappa

2017-01-01

Introduction Primary teeth have shown wide variations in their eruption time among different population. Population specific eruption ages are provided as mean with standard deviations or median ages with its percentile range. This alone will be insufficient for prediction of tooth eruption sequence because they provide no information on the frequency of sequence variation within the pairs of teeth. Norms of polymorphic variation in the eruption sequence can be more useful. Aim This study aims at providing norms for the sequence polymorphism in primary teeth among the children of Mysore population. Materials and Methods A cross-sectional study was designed with 1392 children, recruited from December 2015 to June 2016 by simple random sampling method. Tooth was recorded as present or absent. Across the entire possible intra quadrant tooth pair, cases of present-present, absent-absent, present-absent and absent-present and were counted and computed as percentages. Results Sequence polymorphisms were more common in 82-84 pairs of teeth. Significant polymorphic reverse sequence was observed in 52-54 (9%), 82-84 (35%) in males and 82-84 (18%) in females. There was no polymorphism in maxillary arch in females. Conclusion The present study provides the baseline data values for sequence variation in primary teeth eruption. To the best of investigators knowledge, there are no previous studies describing the sequence polymorphism in primary teeth in Indian population. The results of this study helps in assessment of eruption sequence problems in paediatric dentistry and in evaluation and prediction of tooth eruption sequence in individual child. PMID:28658912
BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers.

PubMed

Abo, Ryan P; Ducar, Matthew; Garcia, Elizabeth P; Thorner, Aaron R; Rojas-Rudilla, Vanesa; Lin, Ling; Sholl, Lynette M; Hahn, William C; Meyerson, Matthew; Lindeman, Neal I; Van Hummelen, Paul; MacConaill, Laura E

2015-02-18

Genomic structural variation (SV), a common hallmark of cancer, has important predictive and therapeutic implications. However, accurately detecting SV using high-throughput sequencing data remains challenging, especially for 'targeted' resequencing efforts. This is critically important in the clinical setting where targeted resequencing is frequently being applied to rapidly assess clinically actionable mutations in tumor biopsies in a cost-effective manner. We present BreaKmer, a novel approach that uses a 'kmer' strategy to assemble misaligned sequence reads for predicting insertions, deletions, inversions, tandem duplications and translocations at base-pair resolution in targeted resequencing data. Variants are predicted by realigning an assembled consensus sequence created from sequence reads that were abnormally aligned to the reference genome. Using targeted resequencing data from tumor specimens with orthogonally validated SV, non-tumor samples and whole-genome sequencing data, BreaKmer had a 97.4% overall sensitivity for known events and predicted 17 positively validated, novel variants. Relative to four publically available algorithms, BreaKmer detected SV with increased sensitivity and limited calls in non-tumor samples, key features for variant analysis of tumor specimens in both the clinical and research settings. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Designing deep sequencing experiments: detecting structural variation and estimating transcript abundance.

PubMed

Bashir, Ali; Bansal, Vikas; Bafna, Vineet

2010-06-18

Massively parallel DNA sequencing technologies have enabled the sequencing of several individual human genomes. These technologies are also being used in novel ways for mRNA expression profiling, genome-wide discovery of transcription-factor binding sites, small RNA discovery, etc. The multitude of sequencing platforms, each with their unique characteristics, pose a number of design challenges, regarding the technology to be used and the depth of sequencing required for a particular sequencing application. Here we describe a number of analytical and empirical results to address design questions for two applications: detection of structural variations from paired-end sequencing and estimating mRNA transcript abundance. For structural variation, our results provide explicit trade-offs between the detection and resolution of rearrangement breakpoints, and the optimal mix of paired-read insert lengths. Specifically, we prove that optimal detection and resolution of breakpoints is achieved using a mix of exactly two insert library lengths. Furthermore, we derive explicit formulae to determine these insert length combinations, enabling a 15% improvement in breakpoint detection at the same experimental cost. On empirical short read data, these predictions show good concordance with Illumina 200 bp and 2 Kbp insert length libraries. For transcriptome sequencing, we determine the sequencing depth needed to detect rare transcripts from a small pilot study. With only 1 Million reads, we derive corrections that enable almost perfect prediction of the underlying expression probability distribution, and use this to predict the sequencing depth required to detect low expressed genes with greater than 95% probability. Together, our results form a generic framework for many design considerations related to high-throughput sequencing. We provide software tools http://bix.ucsd.edu/projects/NGS-DesignTools to derive platform independent guidelines for designing sequencing experiments (amount of sequencing, choice of insert length, mix of libraries) for novel applications of next generation sequencing.
WS-SNPs&GO: a web server for predicting the deleterious effect of human protein variants using functional annotation

PubMed Central

2013-01-01

Background SNPs&GO is a method for the prediction of deleterious Single Amino acid Polymorphisms (SAPs) using protein functional annotation. In this work, we present the web server implementation of SNPs&GO (WS-SNPs&GO). The server is based on Support Vector Machines (SVM) and for a given protein, its input comprises: the sequence and/or its three-dimensional structure (when available), a set of target variations and its functional Gene Ontology (GO) terms. The output of the server provides, for each protein variation, the probabilities to be associated to human diseases. Results The server consists of two main components, including updated versions of the sequence-based SNPs&GO (recently scored as one of the best algorithms for predicting deleterious SAPs) and of the structure-based SNPs&GO3d programs. Sequence and structure based algorithms are extensively tested on a large set of annotated variations extracted from the SwissVar database. Selecting a balanced dataset with more than 38,000 SAPs, the sequence-based approach achieves 81% overall accuracy, 0.61 correlation coefficient and an Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve of 0.88. For the subset of ~6,600 variations mapped on protein structures available at the Protein Data Bank (PDB), the structure-based method scores with 84% overall accuracy, 0.68 correlation coefficient, and 0.91 AUC. When tested on a new blind set of variations, the results of the server are 79% and 83% overall accuracy for the sequence-based and structure-based inputs, respectively. Conclusions WS-SNPs&GO is a valuable tool that includes in a unique framework information derived from protein sequence, structure, evolutionary profile, and protein function. WS-SNPs&GO is freely available at http://snps.biofold.org/snps-and-go. PMID:23819482
Genetic Variation in Cardiomyopathy and Cardiovascular Disorders.

PubMed

McNally, Elizabeth M; Puckelwartz, Megan J

2015-01-01

With the wider deployment of massively-parallel, next-generation sequencing, it is now possible to survey human genome data for research and clinical purposes. The reduced cost of producing short-read sequencing has now shifted the burden to data analysis. Analysis of genome sequencing remains challenged by the complexity of the human genome, including redundancy and the repetitive nature of genome elements and the large amount of variation in individual genomes. Public databases of human genome sequences greatly facilitate interpretation of common and rare genetic variation, although linking database sequence information to detailed clinical information is limited by privacy and practical issues. Genetic variation is a rich source of knowledge for cardiovascular disease because many, if not all, cardiovascular disorders are highly heritable. The role of rare genetic variation in predicting risk and complications of cardiovascular diseases has been well established for hypertrophic and dilated cardiomyopathy, where the number of genes that are linked to these disorders is growing. Bolstered by family data, where genetic variants segregate with disease, rare variation can be linked to specific genetic variation that offers profound diagnostic information. Understanding genetic variation in cardiomyopathy is likely to help stratify forms of heart failure and guide therapy. Ultimately, genetic variation may be amenable to gene correction and gene editing strategies.
Identification and characterization of transcript polymorphisms in soybean lines varying in oil composition and content.

PubMed

Goettel, Wolfgang; Xia, Eric; Upchurch, Robert; Wang, Ming-Li; Chen, Pengyin; An, Yong-Qiang Charles

2014-04-23

Variation in seed oil composition and content among soybean varieties is largely attributed to differences in transcript sequences and/or transcript accumulation of oil production related genes in seeds. Discovery and analysis of sequence and expression variations in these genes will accelerate soybean oil quality improvement. In an effort to identify these variations, we sequenced the transcriptomes of soybean seeds from nine lines varying in oil composition and/or total oil content. Our results showed that 69,338 distinct transcripts from 32,885 annotated genes were expressed in seeds. A total of 8,037 transcript expression polymorphisms and 50,485 transcript sequence polymorphisms (48,792 SNPs and 1,693 small Indels) were identified among the lines. Effects of the transcript polymorphisms on their encoded protein sequences and functions were predicted. The studies also provided independent evidence that the lack of FAD2-1A gene activity and a non-synonymous SNP in the coding sequence of FAB2C caused elevated oleic acid and stearic acid levels in soybean lines M23 and FAM94-41, respectively. As a proof-of-concept, we developed an integrated RNA-seq and bioinformatics approach to identify and functionally annotate transcript polymorphisms, and demonstrated its high effectiveness for discovery of genetic and transcript variations that result in altered oil quality traits. The collection of transcript polymorphisms coupled with their predicted functional effects will be a valuable asset for further discovery of genes, gene variants, and functional markers to improve soybean oil quality.
Predicted stem-loop structures and variation in nucleotide sequence of 3' noncoding regions among animal calicivirus genomes.

PubMed

Seal, B S; Neill, J D; Ridpath, J F

1994-07-01

Caliciviruses are nonenveloped with a polyadenylated genome of approximately 7.6 kb and a single capsid protein. The "RNA Fold" computer program was used to analyze 3'-terminal noncoding sequences of five feline calicivirus (FCV), rabbit hemorrhagic disease virus (RHDV), and two San Miguel sea lion virus (SMSV) isolates. The FCV 3'-terminal sequences are 40-46 nucleotides in length and 72-91% similar. The FCV sequences were predicted to contain two possible duplex structures and one stem-loop structure with free energies of -2.1 to -18.2 kcal/mole. The RHDV genomic 3'-terminal RNA sequences are 54 nucleotides in length and share 49% sequence similarity to homologous regions of the FCV genome. The RHDV sequence was predicted to form two duplex structures in the 3'-terminal noncoding region with a single stem-loop structure, resembling that of FCV. In contrast, the SMSV 1 and 4 genomic 3'-terminal noncoding sequences were 185 and 182 nucleotides in length, respectively. Ten possible duplex structures were predicted with an average structural free energy of -35 kcal/mole. Sequence similarity between the two SMSV isolates was 75%. Furthermore, extensive cloverleaflike structures are predicted in the 3' noncoding region of the SMSV genome, in contrast to the predicted single stem-loop structures of FCV or RHDV.
Identification and characterization of transcript polymorphisms in soybean lines varying in oil composition and content

PubMed Central

2014-01-01

Background Variation in seed oil composition and content among soybean varieties is largely attributed to differences in transcript sequences and/or transcript accumulation of oil production related genes in seeds. Discovery and analysis of sequence and expression variations in these genes will accelerate soybean oil quality improvement. Results In an effort to identify these variations, we sequenced the transcriptomes of soybean seeds from nine lines varying in oil composition and/or total oil content. Our results showed that 69,338 distinct transcripts from 32,885 annotated genes were expressed in seeds. A total of 8,037 transcript expression polymorphisms and 50,485 transcript sequence polymorphisms (48,792 SNPs and 1,693 small Indels) were identified among the lines. Effects of the transcript polymorphisms on their encoded protein sequences and functions were predicted. The studies also provided independent evidence that the lack of FAD2-1A gene activity and a non-synonymous SNP in the coding sequence of FAB2C caused elevated oleic acid and stearic acid levels in soybean lines M23 and FAM94-41, respectively. Conclusions As a proof-of-concept, we developed an integrated RNA-seq and bioinformatics approach to identify and functionally annotate transcript polymorphisms, and demonstrated its high effectiveness for discovery of genetic and transcript variations that result in altered oil quality traits. The collection of transcript polymorphisms coupled with their predicted functional effects will be a valuable asset for further discovery of genes, gene variants, and functional markers to improve soybean oil quality. PMID:24755115
Single nucleotide variations: Biological impact and theoretical interpretation

PubMed Central

Katsonis, Panagiotis; Koire, Amanda; Wilson, Stephen Joseph; Hsu, Teng-Kuei; Lua, Rhonald C; Wilkins, Angela Dawn; Lichtarge, Olivier

2014-01-01

Genome-wide association studies (GWAS) and whole-exome sequencing (WES) generate massive amounts of genomic variant information, and a major challenge is to identify which variations drive disease or contribute to phenotypic traits. Because the majority of known disease-causing mutations are exonic non-synonymous single nucleotide variations (nsSNVs), most studies focus on whether these nsSNVs affect protein function. Computational studies show that the impact of nsSNVs on protein function reflects sequence homology and structural information and predict the impact through statistical methods, machine learning techniques, or models of protein evolution. Here, we review impact prediction methods and discuss their underlying principles, their advantages and limitations, and how they compare to and complement one another. Finally, we present current applications and future directions for these methods in biological research and medical genetics. PMID:25234433
The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity

PubMed Central

Wang, Quanli; Halvorsen, Matt; Han, Yujun; Weir, William H.; Allen, Andrew S.; Goldstein, David B.

2015-01-01

Noncoding sequence contains pathogenic mutations. Yet, compared with mutations in protein-coding sequence, pathogenic regulatory mutations are notoriously difficult to recognize. Most fundamentally, we are not yet adept at recognizing the sequence stretches in the human genome that are most important in regulating the expression of genes. For this reason, it is difficult to apply to the regulatory regions the same kinds of analytical paradigms that are being successfully applied to identify mutations among protein-coding regions that influence risk. To determine whether dosage sensitive genes have distinct patterns among their noncoding sequence, we present two primary approaches that focus solely on a gene’s proximal noncoding regulatory sequence. The first approach is a regulatory sequence analogue of the recently introduced residual variation intolerance score (RVIS), termed noncoding RVIS, or ncRVIS. The ncRVIS compares observed and predicted levels of standing variation in the regulatory sequence of human genes. The second approach, termed ncGERP, reflects the phylogenetic conservation of a gene’s regulatory sequence using GERP++. We assess how well these two approaches correlate with four gene lists that use different ways to identify genes known or likely to cause disease through changes in expression: 1) genes that are known to cause disease through haploinsufficiency, 2) genes curated as dosage sensitive in ClinGen’s Genome Dosage Map, 3) genes judged likely to be under purifying selection for mutations that change expression levels because they are statistically depleted of loss-of-function variants in the general population, and 4) genes judged unlikely to cause disease based on the presence of copy number variants in the general population. We find that both noncoding scores are highly predictive of dosage sensitivity using any of these criteria. In a similar way to ncGERP, we assess two ensemble-based predictors of regional noncoding importance, ncCADD and ncGWAVA, and find both scores are significantly predictive of human dosage sensitive genes and appear to carry information beyond conservation, as assessed by ncGERP. These results highlight that the intolerance of noncoding sequence stretches in the human genome can provide a critical complementary tool to other genome annotation approaches to help identify the parts of the human genome increasingly likely to harbor mutations that influence risk of disease. PMID:26332131
3D RNA and functional interactions from evolutionary couplings

PubMed Central

Weinreb, Caleb; Riesselman, Adam; Ingraham, John B.; Gross, Torsten; Sander, Chris; Marks, Debora S.

2016-01-01

Summary Non-coding RNAs are ubiquitous, but the discovery of new RNA gene sequences far outpaces research on their structure and functional interactions. We mine the evolutionary sequence record to derive precise information about function and structure of RNAs and RNA-protein complexes. As in protein structure prediction, we use maximum entropy global probability models of sequence co-variation to infer evolutionarily constrained nucleotide-nucleotide interactions within RNA molecules, and nucleotide-amino acid interactions in RNA-protein complexes. The predicted contacts allow all-atom blinded 3D structure prediction at good accuracy for several known RNA structures and RNA-protein complexes. For unknown structures, we predict contacts in 160 non-coding RNA families. Beyond 3D structure prediction, evolutionary couplings help identify important functional interactions, e.g., at switch points in riboswitches and at a complex nucleation site in HIV. Aided by accelerating sequence accumulation, evolutionary coupling analysis can accelerate the discovery of functional interactions and 3D structures involving RNA. PMID:27087444
The diploid genome sequence of an Asian individual

PubMed Central

Wang, Jun; Wang, Wei; Li, Ruiqiang; Li, Yingrui; Tian, Geng; Goodman, Laurie; Fan, Wei; Zhang, Junqing; Li, Jun; Zhang, Juanbin; Guo, Yiran; Feng, Binxiao; Li, Heng; Lu, Yao; Fang, Xiaodong; Liang, Huiqing; Du, Zhenglin; Li, Dong; Zhao, Yiqing; Hu, Yujie; Yang, Zhenzhen; Zheng, Hancheng; Hellmann, Ines; Inouye, Michael; Pool, John; Yi, Xin; Zhao, Jing; Duan, Jinjie; Zhou, Yan; Qin, Junjie; Ma, Lijia; Li, Guoqing; Yang, Zhentao; Zhang, Guojie; Yang, Bin; Yu, Chang; Liang, Fang; Li, Wenjie; Li, Shaochuan; Li, Dawei; Ni, Peixiang; Ruan, Jue; Li, Qibin; Zhu, Hongmei; Liu, Dongyuan; Lu, Zhike; Li, Ning; Guo, Guangwu; Zhang, Jianguo; Ye, Jia; Fang, Lin; Hao, Qin; Chen, Quan; Liang, Yu; Su, Yeyang; san, A.; Ping, Cuo; Yang, Shuang; Chen, Fang; Li, Li; Zhou, Ke; Zheng, Hongkun; Ren, Yuanyuan; Yang, Ling; Gao, Yang; Yang, Guohua; Li, Zhuo; Feng, Xiaoli; Kristiansen, Karsten; Wong, Gane Ka-Shu; Nielsen, Rasmus; Durbin, Richard; Bolund, Lars; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian

2009-01-01

Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics. PMID:18987735
The predicted secondary structures of class I fructose-bisphosphate aldolases.

PubMed Central

Sawyer, L; Fothergill-Gilmore, L A; Freemont, P S

1988-01-01

The results of several secondary-structure prediction programs were combined to produce an estimate of the regions of alpha-helix, beta-sheet and reverse turns for fructose-bisphosphate aldolases from human and rat muscle and liver, from Trypanosoma brucei and from Drosophila melanogaster. All the aldolase sequences gave essentially the same pattern of secondary-structure predictions despite having sequences up to 50% different. One exception to this pattern was an additional strongly predicted helix in the rat liver and Drosophila enzymes. Regions of relatively high sequence variation generally were predicted as reverse turns, and probably occur as surface loops. Most of the positions corresponding to exon boundaries are located between regions predicted to have secondary-structural elements consistent with a compact structure. The predominantly alternating alpha/beta structure predicted is consistent with the alpha/beta-barrel structure indicated by preliminary high-resolution X-ray diffraction studies on rabbit muscle aldolase [Sygusch, Beaudry & Allaire (1986) Biophys. J. 49, 287a]. Images Fig. 1. (cont.) Fig. 1. PMID:3128269
Faster-X evolution of gene expression is driven by recessive adaptive cis-regulatory variation in Drosophila.

PubMed

Llopart, Ana

2018-05-01

The hemizygosity of the X (Z) chromosome fully exposes the fitness effects of mutations on that chromosome and has evolutionary consequences on the relative rates of evolution of X and autosomes. Specifically, several population genetics models predict increased rates of evolution in X-linked loci relative to autosomal loci. This prediction of faster-X evolution has been evaluated and confirmed for both protein coding sequences and gene expression. In the case of faster-X evolution for gene expression divergence, it is often assumed that variation in 5' noncoding sequences is associated with variation in transcript abundance between species but a formal, genomewide test of this hypothesis is still missing. Here, I use whole genome sequence data in Drosophila yakuba and D. santomea to evaluate this hypothesis and report positive correlations between sequence divergence at 5' noncoding sequences and gene expression divergence. I also examine polymorphism and divergence in 9,279 noncoding sequences located at the 5' end of annotated genes and detected multiple signals of positive selection. Notably, I used the traditional synonymous sites as neutral reference to test for adaptive evolution, but I also used bases 8-30 of introns <65 bp, which have been proposed to be a better neutral choice. X-linked genes with high degree of male-biased expression show the most extreme adaptive pattern at 5' noncoding regions, in agreement with faster-X evolution for gene expression divergence and a higher incidence of positively selected recessive mutations. © 2018 The Authors. Molecular Ecology Published by John Wiley & Sons Ltd.
UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures.

PubMed

Lua, Rhonald C; Wilson, Stephen J; Konecki, Daniel M; Wilkins, Angela D; Venner, Eric; Morgan, Daniel H; Lichtarge, Olivier

2016-01-04

The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

Historical feature pattern extraction based network attack situation sensing algorithm.

PubMed

Zeng, Yong; Liu, Dacheng; Lei, Zhou

2014-01-01

The situation sequence contains a series of complicated and multivariate random trends, which are very sudden, uncertain, and difficult to recognize and describe its principle by traditional algorithms. To solve the above questions, estimating parameters of super long situation sequence is essential, but very difficult, so this paper proposes a situation prediction method based on historical feature pattern extraction (HFPE). First, HFPE algorithm seeks similar indications from the history situation sequence recorded and weighs the link intensity between occurred indication and subsequent effect. Then it calculates the probability that a certain effect reappears according to the current indication and makes a prediction after weighting. Meanwhile, HFPE method gives an evolution algorithm to derive the prediction deviation from the views of pattern and accuracy. This algorithm can continuously promote the adaptability of HFPE through gradual fine-tuning. The method preserves the rules in sequence at its best, does not need data preprocessing, and can track and adapt to the variation of situation sequence continuously.
Historical Feature Pattern Extraction Based Network Attack Situation Sensing Algorithm

PubMed Central

Zeng, Yong; Liu, Dacheng; Lei, Zhou

2014-01-01

The situation sequence contains a series of complicated and multivariate random trends, which are very sudden, uncertain, and difficult to recognize and describe its principle by traditional algorithms. To solve the above questions, estimating parameters of super long situation sequence is essential, but very difficult, so this paper proposes a situation prediction method based on historical feature pattern extraction (HFPE). First, HFPE algorithm seeks similar indications from the history situation sequence recorded and weighs the link intensity between occurred indication and subsequent effect. Then it calculates the probability that a certain effect reappears according to the current indication and makes a prediction after weighting. Meanwhile, HFPE method gives an evolution algorithm to derive the prediction deviation from the views of pattern and accuracy. This algorithm can continuously promote the adaptability of HFPE through gradual fine-tuning. The method preserves the rules in sequence at its best, does not need data preprocessing, and can track and adapt to the variation of situation sequence continuously. PMID:24892054
Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences

PubMed Central

2018-01-01

Prediction of taxonomy for marker gene sequences such as 16S ribosomal RNA (rRNA) is a fundamental task in microbiology. Most experimentally observed sequences are diverged from reference sequences of authoritatively named organisms, creating a challenge for prediction methods. I assessed the accuracy of several algorithms using cross-validation by identity, a new benchmark strategy which explicitly models the variation in distances between query sequences and the closest entry in a reference database. When the accuracy of genus predictions was averaged over a representative range of identities with the reference database (100%, 99%, 97%, 95% and 90%), all tested methods had ≤50% accuracy on the currently-popular V4 region of 16S rRNA. Accuracy was found to fall rapidly with identity; for example, better methods were found to have V4 genus prediction accuracy of ∼100% at 100% identity but ∼50% at 97% identity. The relationship between identity and taxonomy was quantified as the probability that a rank is the lowest shared by a pair of sequences with a given pair-wise identity. With the V4 region, 95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal. PMID:29682424
Genomic prediction using preselected DNA variants from a GWAS with whole-genome sequence data in Holstein-Friesian cattle.

PubMed

Veerkamp, Roel F; Bouwman, Aniek C; Schrooten, Chris; Calus, Mario P L

2016-12-01

Whole-genome sequence data is expected to capture genetic variation more completely than common genotyping panels. Our objective was to compare the proportion of variance explained and the accuracy of genomic prediction by using imputed sequence data or preselected SNPs from a genome-wide association study (GWAS) with imputed whole-genome sequence data. Phenotypes were available for 5503 Holstein-Friesian bulls. Genotypes were imputed up to whole-genome sequence (13,789,029 segregating DNA variants) by using run 4 of the 1000 bull genomes project. The program GCTA was used to perform GWAS for protein yield (PY), somatic cell score (SCS) and interval from first to last insemination (IFL). From the GWAS, subsets of variants were selected and genomic relationship matrices (GRM) were used to estimate the variance explained in 2087 validation animals and to evaluate the genomic prediction ability. Finally, two GRM were fitted together in several models to evaluate the effect of selected variants that were in competition with all the other variants. The GRM based on full sequence data explained only marginally more genetic variation than that based on common SNP panels: for PY, SCS and IFL, genomic heritability improved from 0.81 to 0.83, 0.83 to 0.87 and 0.69 to 0.72, respectively. Sequence data also helped to identify more variants linked to quantitative trait loci and resulted in clearer GWAS peaks across the genome. The proportion of total variance explained by the selected variants combined in a GRM was considerably smaller than that explained by all variants (less than 0.31 for all traits). When selected variants were used, accuracy of genomic predictions decreased and bias increased. Although 35 to 42 variants were detected that together explained 13 to 19% of the total variance (18 to 23% of the genetic variance) when fitted alone, there was no advantage in using dense sequence information for genomic prediction in the Holstein data used in our study. Detection and selection of variants within a single breed are difficult due to long-range linkage disequilibrium. Stringent selection of variants resulted in more biased genomic predictions, although this might be due to the training population being the same dataset from which the selected variants were identified.
Increased genomic prediction accuracy in wheat breeding through spatial adjustment of field trial data.

PubMed

Lado, Bettina; Matus, Ivan; Rodríguez, Alejandra; Inostroza, Luis; Poland, Jesse; Belzile, François; del Pozo, Alejandro; Quincke, Martín; Castro, Marina; von Zitzewitz, Jarislav

2013-12-09

In crop breeding, the interest of predicting the performance of candidate cultivars in the field has increased due to recent advances in molecular breeding technologies. However, the complexity of the wheat genome presents some challenges for applying new technologies in molecular marker identification with next-generation sequencing. We applied genotyping-by-sequencing, a recently developed method to identify single-nucleotide polymorphisms, in the genomes of 384 wheat (Triticum aestivum) genotypes that were field tested under three different water regimes in Mediterranean climatic conditions: rain-fed only, mild water stress, and fully irrigated. We identified 102,324 single-nucleotide polymorphisms in these genotypes, and the phenotypic data were used to train and test genomic selection models intended to predict yield, thousand-kernel weight, number of kernels per spike, and heading date. Phenotypic data showed marked spatial variation. Therefore, different models were tested to correct the trends observed in the field. A mixed-model using moving-means as a covariate was found to best fit the data. When we applied the genomic selection models, the accuracy of predicted traits increased with spatial adjustment. Multiple genomic selection models were tested, and a Gaussian kernel model was determined to give the highest accuracy. The best predictions between environments were obtained when data from different years were used to train the model. Our results confirm that genotyping-by-sequencing is an effective tool to obtain genome-wide information for crops with complex genomes, that these data are efficient for predicting traits, and that correction of spatial variation is a crucial ingredient to increase prediction accuracy in genomic selection models.
SNP discovery by high-throughput sequencing in soybean

PubMed Central

2010-01-01

Background With the advance of new massively parallel genotyping technologies, quantitative trait loci (QTL) fine mapping and map-based cloning become more achievable in identifying genes for important and complex traits. Development of high-density genetic markers in the QTL regions of specific mapping populations is essential for fine-mapping and map-based cloning of economically important genes. Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation existing between any diverse genotypes that are usually used for QTL mapping studies. The massively parallel sequencing technologies (Roche GS/454, Illumina GA/Solexa, and ABI/SOLiD), have been widely applied to identify genome-wide sequence variations. However, it is still remains unclear whether sequence data at a low sequencing depth are enough to detect the variations existing in any QTL regions of interest in a crop genome, and how to prepare sequencing samples for a complex genome such as soybean. Therefore, with the aims of identifying SNP markers in a cost effective way for fine-mapping several QTL regions, and testing the validation rate of the putative SNPs predicted with Solexa short sequence reads at a low sequencing depth, we evaluated a pooled DNA fragment reduced representation library and SNP detection methods applied to short read sequences generated by Solexa high-throughput sequencing technology. Results A total of 39,022 putative SNPs were identified by the Illumina/Solexa sequencing system using a reduced representation DNA library of two parental lines of a mapping population. The validation rates of these putative SNPs predicted with low and high stringency were 72% and 85%, respectively. One hundred sixty four SNP markers resulted from the validation of putative SNPs and have been selectively chosen to target a known QTL, thereby increasing the marker density of the targeted region to one marker per 42 K bp. Conclusions We have demonstrated how to quickly identify large numbers of SNPs for fine mapping of QTL regions by applying massively parallel sequencing combined with genome complexity reduction techniques. This SNP discovery approach is more efficient for targeting multiple QTL regions in a same genetic population, which can be applied to other crops. PMID:20701770
RACER a Coarse-Grained RNA Model for Capturing Folding Free Energy in Molecular Dynamics Simulations

NASA Astrophysics Data System (ADS)

Cheng, Sara; Bell, David; Ren, Pengyu

RACER is a coarse-grained RNA model that can be used in molecular dynamics simulations to predict native structures and sequence-specific variation of free energy of various RNA structures. RACER is capable of accurate prediction of native structures of duplexes and hairpins (average RMSD of 4.15 angstroms), and RACER can capture sequence-specific variation of free energy in excellent agreement with experimentally measured stabilities (r-squared =0.98). The RACER model implements a new effective non-bonded potential and re-parameterization of hydrogen bond and Debye-Huckel potentials. Insights from the RACER model include the importance of treating pairing and stacking interactions separately in order to distinguish folded an unfolded states and identification of hydrogen-bonding, base stacking, and electrostatic interactions as essential driving forces for RNA folding. Future applications of the RACER model include predicting free energy landscapes of more complex RNA structures and use of RACER for multiscale simulations.
Rebelling for a Reason: Protein Structural “Outliers”

PubMed Central

Arumugam, Gandhimathi; Nair, Anu G.; Hariharaputran, Sridhar; Ramanathan, Sowdhamini

2013-01-01

Analysis of structural variation in domain superfamilies can reveal constraints in protein evolution which aids protein structure prediction and classification. Structure-based sequence alignment of distantly related proteins, organized in PASS2 database, provides clues about structurally conserved regions among different functional families. Some superfamily members show large structural differences which are functionally relevant. This paper analyses the impact of structural divergence on function for multi-member superfamilies, selected from the PASS2 superfamily alignment database. Functional annotations within superfamilies, with structural outliers or ‘rebels’, are discussed in the context of structural variations. Overall, these data reinforce the idea that functional similarities cannot be extrapolated from mere structural conservation. The implication for fold-function prediction is that the functional annotations can only be inherited with very careful consideration, especially at low sequence identities. PMID:24073209
High levels of variation in Salix lignocellulose genes revealed using poplar genomic resources

PubMed Central

2013-01-01

Background Little is known about the levels of variation in lignin or other wood related genes in Salix, a genus that is being increasingly used for biomass and biofuel production. The lignin biosynthesis pathway is well characterized in a number of species, including the model tree Populus. We aimed to transfer the genomic resources already available in Populus to its sister genus Salix to assess levels of variation within genes involved in wood formation. Results Amplification trials for 27 gene regions were undertaken in 40 Salix taxa. Twelve of these regions were sequenced. Alignment searches of the resulting sequences against reference databases, combined with phylogenetic analyses, showed the close similarity of these Salix sequences to Populus, confirming homology of the primer regions and indicating a high level of conservation within the wood formation genes. However, all sequences were found to vary considerably among Salix species, mainly as SNPs with a smaller number of insertions-deletions. Between 25 and 176 SNPs per kbp per gene region (in predicted exons) were discovered within Salix. Conclusions The variation found is sizeable but not unexpected as it is based on interspecific and not intraspecific comparison; it is comparable to interspecific variation in Populus. The characterisation of genetic variation is a key process in pre-breeding and for the conservation and exploitation of genetic resources in Salix. This study characterises the variation in several lignocellulose gene markers for such purposes. PMID:23924375
TEMPLE: analysing population genetic variation at transcription factor binding sites.

PubMed

Litovchenko, Maria; Laurent, Stefan

2016-11-01

Genetic variation occurring at the level of regulatory sequences can affect phenotypes and fitness in natural populations. This variation can be analysed in a population genetic framework to study how genetic drift and selection affect the evolution of these functional elements. However, doing this requires a good understanding of the location and nature of regulatory regions and has long been a major hurdle. The current proliferation of genomewide profiling experiments of transcription factor occupancies greatly improves our ability to identify genomic regions involved in specific DNA-protein interactions. Although software exists for predicting transcription factor binding sites (TFBS), and the effects of genetic variants on TFBS specificity, there are no tools currently available for inferring this information jointly with the genetic variation at TFBS in natural populations. We developed the software Transcription Elements Mapping at the Population LEvel (TEMPLE), which predicts TFBS, evaluates the effects of genetic variants on TFBS specificity and summarizes the genetic variation occurring at TFBS in intraspecific sequence alignments. We demonstrate that TEMPLE's TFBS prediction algorithms gives identical results to PATSER, a software distribution commonly used in the field. We also illustrate the unique features of TEMPLE by analysing TFBS diversity for the TF Senseless (SENS) in one ancestral and one cosmopolitan population of the fruit fly Drosophila melanogaster. TEMPLE can be used to localize TFBS that are characterized by strong genetic differentiation across natural populations. This will be particularly useful for studies aiming to identify adaptive mutations. TEMPLE is a java-based cross-platform software that easily maps the genetic diversity at predicted TFBSs using a graphical interface, or from the Unix command line. © 2016 John Wiley & Sons Ltd.
Variation in Time of Flowering and Seed Dispersal of Eastern Cottonwood In the Lower Mississippi Valley

Treesearch

Robert E. Farmer

1966-01-01

Flowering of Populus deItoides Bartr. occurred from early March to early April; differences between trees within stands accounted for 98 percent of the significant variation in dates. High correlation (r = .91 to .96) between 1963 and 1964 dates of individual trees indicated that trees within stands flower in a predictable sequence. Seed dispersal...
Prognostic and predictive value of TP53 mutations in node-positive breast cancer patients treated with anthracycline- or anthracycline/taxane-based adjuvant therapy: results from the BIG 02-98 phase III trial

PubMed Central

2012-01-01

Abstract Introduction Pre-clinical data suggest p53-dependent anthracycline-induced apoptosis and p53-independent taxane activity. However, dedicated clinical research has not defined a predictive role for TP53 gene mutations. The aim of the current study was to retrospectively explore the prognosis and predictive values of TP53 somatic mutations in the BIG 02-98 randomized phase III trial in which women with node-positive breast cancer were treated with adjuvant doxorubicin-based chemotherapy with or without docetaxel. Methods The prognostic and predictive values of TP53 were analyzed in tumor samples by gene sequencing within exons 5 to 8. Patients were classified according to p53 protein status predicted from TP53 gene sequence, as wild-type (no TP53 variation or TP53 variations which are predicted not to modify p53 protein sequence) or mutant (p53 nonsynonymous mutations). Mutations were subcategorized according to missense or truncating mutations. Survival analyses were performed using the Kaplan-Meier method and log-rank test. Cox-regression analysis was used to identify independent predictors of outcome. Results TP53 gene status was determined for 18% (520 of 2887) of the women enrolled in BIG 02-98. TP53 gene variations were found in 17% (90 of 520). Nonsynonymous p53 mutations, found in 16.3% (85 of 520), were associated with older age, ductal morphology, higher grade and hormone-receptor negativity. Of the nonsynonymous mutations, 12.3% (64 of 520) were missense and 3.6% were truncating (19 of 520). Only truncating mutations showed significant independent prognostic value, with an increased recurrence risk compared to patients with non-modified p53 protein (hazard ratio = 3.21, 95% confidence interval = 1.740 to 5.935, P = 0.0002). p53 status had no significant predictive value for response to docetaxel. Conclusions p53 truncating mutations were uncommon but associated with poor prognosis. No significant predictive role for p53 status was detected. Trial registration ClinicalTrials.gov NCT00174655 PMID:22551440
Dissecting the relationship between protein structure and sequence variation

NASA Astrophysics Data System (ADS)

Shahmoradi, Amir; Wilke, Claus; Wilke Lab Team

2015-03-01

Over the past decade several independent works have shown that some structural properties of proteins are capable of predicting protein evolution. The strength and significance of these structure-sequence relations, however, appear to vary widely among different proteins, with absolute correlation strengths ranging from 0 . 1 to 0 . 8 . Here we present the results from a comprehensive search for the potential biophysical and structural determinants of protein evolution by studying more than 200 structural and evolutionary properties in a dataset of 209 monomeric enzymes. We discuss the main protein characteristics responsible for the general patterns of protein evolution, and identify sequence divergence as the main determinant of the strengths of virtually all structure-evolution relationships, explaining ~ 10 - 30 % of observed variation in sequence-structure relations. In addition to sequence divergence, we identify several protein structural properties that are moderately but significantly coupled with the strength of sequence-structure relations. In particular, proteins with more homogeneous back-bone hydrogen bond energies, large fractions of helical secondary structures and low fraction of beta sheets tend to have the strongest sequence-structure relation. BEACON-NSF center for the study of evolution in action.
Increased Genomic Prediction Accuracy in Wheat Breeding Through Spatial Adjustment of Field Trial Data

PubMed Central

Lado, Bettina; Matus, Ivan; Rodríguez, Alejandra; Inostroza, Luis; Poland, Jesse; Belzile, François; del Pozo, Alejandro; Quincke, Martín; Castro, Marina; von Zitzewitz, Jarislav

2013-01-01

In crop breeding, the interest of predicting the performance of candidate cultivars in the field has increased due to recent advances in molecular breeding technologies. However, the complexity of the wheat genome presents some challenges for applying new technologies in molecular marker identification with next-generation sequencing. We applied genotyping-by-sequencing, a recently developed method to identify single-nucleotide polymorphisms, in the genomes of 384 wheat (Triticum aestivum) genotypes that were field tested under three different water regimes in Mediterranean climatic conditions: rain-fed only, mild water stress, and fully irrigated. We identified 102,324 single-nucleotide polymorphisms in these genotypes, and the phenotypic data were used to train and test genomic selection models intended to predict yield, thousand-kernel weight, number of kernels per spike, and heading date. Phenotypic data showed marked spatial variation. Therefore, different models were tested to correct the trends observed in the field. A mixed-model using moving-means as a covariate was found to best fit the data. When we applied the genomic selection models, the accuracy of predicted traits increased with spatial adjustment. Multiple genomic selection models were tested, and a Gaussian kernel model was determined to give the highest accuracy. The best predictions between environments were obtained when data from different years were used to train the model. Our results confirm that genotyping-by-sequencing is an effective tool to obtain genome-wide information for crops with complex genomes, that these data are efficient for predicting traits, and that correction of spatial variation is a crucial ingredient to increase prediction accuracy in genomic selection models. PMID:24082033
Plasma genetic and genomic abnormalities predict treatment response and clinical outcome in advanced prostate cancer.

PubMed

Xia, Shu; Kohli, Manish; Du, Meijun; Dittmar, Rachel L; Lee, Adam; Nandy, Debashis; Yuan, Tiezheng; Guo, Yongchen; Wang, Yuan; Tschannen, Michael R; Worthey, Elizabeth; Jacob, Howard; See, William; Kilari, Deepak; Wang, Xuexia; Hovey, Raymond L; Huang, Chiang-Ching; Wang, Liang

2015-06-30

Liquid biopsies, examinations of tumor components in body fluids, have shown promise for predicting clinical outcomes. To evaluate tumor-associated genomic and genetic variations in plasma cell-free DNA (cfDNA) and their associations with treatment response and overall survival, we applied whole genome and targeted sequencing to examine the plasma cfDNAs derived from 20 patients with advanced prostate cancer. Sequencing-based genomic abnormality analysis revealed locus-specific gains or losses that were common in prostate cancer, such as 8q gains, AR amplifications, PTEN losses and TMPRSS2-ERG fusions. To estimate tumor burden in cfDNA, we developed a Plasma Genomic Abnormality (PGA) score by summing the most significant copy number variations. Cox regression analysis showed that PGA scores were significantly associated with overall survival (p < 0.04). After androgen deprivation therapy or chemotherapy, targeted sequencing showed significant mutational profile changes in genes involved in androgen biosynthesis, AR activation, DNA repair, and chemotherapy resistance. These changes may reflect the dynamic evolution of heterozygous tumor populations in response to these treatments. These results strongly support the feasibility of using non-invasive liquid biopsies as potential tools to study biological mechanisms underlying therapy-specific resistance and to predict disease progression in advanced prostate cancer.
A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins.

PubMed

Sawle, Lucas; Ghosh, Kingshuk

2015-08-28

A general formalism to compute configurational properties of proteins and other heteropolymers with an arbitrary sequence of charges and non-uniform excluded volume interaction is presented. A variational approach is utilized to predict average distance between any two monomers in the chain. The presented analytical model, for the first time, explicitly incorporates the role of sequence charge distribution to determine relative sizes between two sequences that vary not only in total charge composition but also in charge decoration (even when charge composition is fixed). Furthermore, the formalism is general enough to allow variation in excluded volume interactions between two monomers. Model predictions are benchmarked against the all-atom Monte Carlo studies of Das and Pappu [Proc. Natl. Acad. Sci. U. S. A. 110, 13392 (2013)] for 30 different synthetic sequences of polyampholytes. These sequences possess an equal number of glutamic acid (E) and lysine (K) residues but differ in the patterning within the sequence. Without any fit parameter, the model captures the strong sequence dependence of the simulated values of the radius of gyration with a correlation coefficient of R(2) = 0.9. The model is then applied to real proteins to compare the unfolded state dimensions of 540 orthologous pairs of thermophilic and mesophilic proteins. The excluded volume parameters are assumed similar under denatured conditions, and only electrostatic effects encoded in the sequence are accounted for. With these assumptions, thermophilic proteins are found-with high statistical significance-to have more compact disordered ensemble compared to their mesophilic counterparts. The method presented here, due to its analytical nature, is capable of making such high throughput analysis of multiple proteins and will have broad applications in proteomic studies as well as in other heteropolymeric systems.
cgDNA: a software package for the prediction of sequence-dependent coarse-grain free energies of B-form DNA.

PubMed

Petkevičiūtė, D; Pasi, M; Gonzalez, O; Maddocks, J H

2014-11-10

cgDNA is a package for the prediction of sequence-dependent configuration-space free energies for B-form DNA at the coarse-grain level of rigid bases. For a fragment of any given length and sequence, cgDNA calculates the configuration of the associated free energy minimizer, i.e. the relative positions and orientations of each base, along with a stiffness matrix, which together govern differences in free energies. The model predicts non-local (i.e. beyond base-pair step) sequence dependence of the free energy minimizer. Configurations can be input or output in either the Curves+ definition of the usual helical DNA structural variables, or as a PDB file of coordinates of base atoms. We illustrate the cgDNA package by comparing predictions of free energy minimizers from (a) the cgDNA model, (b) time-averaged atomistic molecular dynamics (or MD) simulations, and (c) NMR or X-ray experimental observation, for (i) the Dickerson-Drew dodecamer and (ii) three oligomers containing A-tracts. The cgDNA predictions are rather close to those of the MD simulations, but many orders of magnitude faster to compute. Both the cgDNA and MD predictions are in reasonable agreement with the available experimental data. Our conclusion is that cgDNA can serve as a highly efficient tool for studying structural variations in B-form DNA over a wide range of sequences. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Conservation of the C-type lectin fold for massive sequence variation in a Treponema diversity-generating retroelement

DOE Office of Scientific and Technical Information (OSTI.GOV)

Le Coq, Johanne; Ghosh, Partho

2012-06-19

Anticipatory ligand binding through massive protein sequence variation is rare in biological systems, having been observed only in the vertebrate adaptive immune response and in a phage diversity-generating retroelement (DGR). Earlier work has demonstrated that the prototypical DGR variable protein, major tropism determinant (Mtd), meets the demands of anticipatory ligand binding by novel means through the C-type lectin (CLec) fold. However, because of the low sequence identity among DGR variable proteins, it has remained unclear whether the CLec fold is a general solution for DGRs. We have addressed this problem by determining the structure of a second DGR variable protein,more » TvpA, from the pathogenic oral spirochete Treponema denticola. Despite its weak sequence identity to Mtd ({approx}16%), TvpA was found to also have a CLec fold, with predicted variable residues exposed in a ligand-binding site. However, this site in TvpA was markedly more variable than the one in Mtd, reflecting the unprecedented approximate 10{sup 20} potential variability of TvpA. In addition, similarity between TvpA and Mtd with formylglycine-generating enzymes was detected. These results provide strong evidence for the conservation of the formylglycine-generating enzyme-type CLec fold among DGRs as a means of accommodating massive sequence variation.« less
Conservation of the C-type lectin fold for massive sequence variation in a Treponema diversity-generating retroelement

PubMed Central

Le Coq, Johanne; Ghosh, Partho

2011-01-01

Anticipatory ligand binding through massive protein sequence variation is rare in biological systems, having been observed only in the vertebrate adaptive immune response and in a phage diversity-generating retroelement (DGR). Earlier work has demonstrated that the prototypical DGR variable protein, major tropism determinant (Mtd), meets the demands of anticipatory ligand binding by novel means through the C-type lectin (CLec) fold. However, because of the low sequence identity among DGR variable proteins, it has remained unclear whether the CLec fold is a general solution for DGRs. We have addressed this problem by determining the structure of a second DGR variable protein, TvpA, from the pathogenic oral spirochete Treponema denticola. Despite its weak sequence identity to Mtd (∼16%), TvpA was found to also have a CLec fold, with predicted variable residues exposed in a ligand-binding site. However, this site in TvpA was markedly more variable than the one in Mtd, reflecting the unprecedented approximate 1020 potential variability of TvpA. In addition, similarity between TvpA and Mtd with formylglycine-generating enzymes was detected. These results provide strong evidence for the conservation of the formylglycine-generating enzyme-type CLec fold among DGRs as a means of accommodating massive sequence variation. PMID:21873231
Conservation of the C-type lectin fold for massive sequence variation in a Treponema diversity-generating retroelement.

PubMed

Le Coq, Johanne; Ghosh, Partho

2011-08-30

Anticipatory ligand binding through massive protein sequence variation is rare in biological systems, having been observed only in the vertebrate adaptive immune response and in a phage diversity-generating retroelement (DGR). Earlier work has demonstrated that the prototypical DGR variable protein, major tropism determinant (Mtd), meets the demands of anticipatory ligand binding by novel means through the C-type lectin (CLec) fold. However, because of the low sequence identity among DGR variable proteins, it has remained unclear whether the CLec fold is a general solution for DGRs. We have addressed this problem by determining the structure of a second DGR variable protein, TvpA, from the pathogenic oral spirochete Treponema denticola. Despite its weak sequence identity to Mtd (∼16%), TvpA was found to also have a CLec fold, with predicted variable residues exposed in a ligand-binding site. However, this site in TvpA was markedly more variable than the one in Mtd, reflecting the unprecedented approximate 10(20) potential variability of TvpA. In addition, similarity between TvpA and Mtd with formylglycine-generating enzymes was detected. These results provide strong evidence for the conservation of the formylglycine-generating enzyme-type CLec fold among DGRs as a means of accommodating massive sequence variation.

Position specific variation in the rate of evolution in transcription factor binding sites

PubMed Central

Moses, Alan M; Chiang, Derek Y; Kellis, Manolis; Lander, Eric S; Eisen, Michael B

2003-01-01

Background The binding sites of sequence specific transcription factors are an important and relatively well-understood class of functional non-coding DNAs. Although a wide variety of experimental and computational methods have been developed to characterize transcription factor binding sites, they remain difficult to identify. Comparison of non-coding DNA from related species has shown considerable promise in identifying these functional non-coding sequences, even though relatively little is known about their evolution. Results Here we analyse the genome sequences of the budding yeasts Saccharomyces cerevisiae, S. bayanus, S. paradoxus and S. mikatae to study the evolution of transcription factor binding sites. As expected, we find that both experimentally characterized and computationally predicted binding sites evolve slower than surrounding sequence, consistent with the hypothesis that they are under purifying selection. We also observe position-specific variation in the rate of evolution within binding sites. We find that the position-specific rate of evolution is positively correlated with degeneracy among binding sites within S. cerevisiae. We test theoretical predictions for the rate of evolution at positions where the base frequencies deviate from background due to purifying selection and find reasonable agreement with the observed rates of evolution. Finally, we show how the evolutionary characteristics of real binding motifs can be used to distinguish them from artefacts of computational motif finding algorithms. Conclusion As has been observed for protein sequences, the rate of evolution in transcription factor binding sites varies with position, suggesting that some regions are under stronger functional constraint than others. This variation likely reflects the varying importance of different positions in the formation of the protein-DNA complex. The characterization of the pattern of evolution in known binding sites will likely contribute to the effective use of comparative sequence data in the identification of transcription factor binding sites and is an important step toward understanding the evolution of functional non-coding DNA. PMID:12946282
Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents.

PubMed Central

Bergman, Casey M.; Haddrill, Penelope R.

2015-01-01

To contribute to our general understanding of the evolutionary forces that shape variation in genome sequences in nature, we have sequenced genomes from 50 isofemale lines and six pooled samples from populations of Drosophila melanogaster on three continents. Analysis of raw and reference-mapped reads indicates the quality of these genomic sequence data is very high. Comparison of the predicted and experimentally-determined Wolbachia infection status of these samples suggests that strain or sample swaps are unlikely to have occurred in the generation of these data. Genome sequences are freely available in the European Nucleotide Archive under accession ERP009059. Isofemale lines can be obtained from the Drosophila Species Stock Center. PMID:25717372
Strain-specific and pooled genome sequences for populations of Drosophila melanogaster from three continents.

PubMed

Bergman, Casey M; Haddrill, Penelope R

2015-01-01

To contribute to our general understanding of the evolutionary forces that shape variation in genome sequences in nature, we have sequenced genomes from 50 isofemale lines and six pooled samples from populations of Drosophila melanogaster on three continents. Analysis of raw and reference-mapped reads indicates the quality of these genomic sequence data is very high. Comparison of the predicted and experimentally-determined Wolbachia infection status of these samples suggests that strain or sample swaps are unlikely to have occurred in the generation of these data. Genome sequences are freely available in the European Nucleotide Archive under accession ERP009059. Isofemale lines can be obtained from the Drosophila Species Stock Center.
Genome-Wide Association Mapping and Genomic Prediction Elucidate the Genetic Architecture of Morphological Traits in Arabidopsis.

PubMed

Kooke, Rik; Kruijer, Willem; Bours, Ralph; Becker, Frank; Kuhn, André; van de Geest, Henri; Buntjer, Jaap; Doeswijk, Timo; Guerra, José; Bouwmeester, Harro; Vreugdenhil, Dick; Keurentjes, Joost J B

2016-04-01

Quantitative traits in plants are controlled by a large number of genes and their interaction with the environment. To disentangle the genetic architecture of such traits, natural variation within species can be explored by studying genotype-phenotype relationships. Genome-wide association studies that link phenotypes to thousands of single nucleotide polymorphism markers are nowadays common practice for such analyses. In many cases, however, the identified individual loci cannot fully explain the heritability estimates, suggesting missing heritability. We analyzed 349 Arabidopsis accessions and found extensive variation and high heritabilities for different morphological traits. The number of significant genome-wide associations was, however, very low. The application of genomic prediction models that take into account the effects of all individual loci may greatly enhance the elucidation of the genetic architecture of quantitative traits in plants. Here, genomic prediction models revealed different genetic architectures for the morphological traits. Integrating genomic prediction and association mapping enabled the assignment of many plausible candidate genes explaining the observed variation. These genes were analyzed for functional and sequence diversity, and good indications that natural allelic variation in many of these genes contributes to phenotypic variation were obtained. For ACS11, an ethylene biosynthesis gene, haplotype differences explaining variation in the ratio of petiole and leaf length could be identified. © 2016 American Society of Plant Biologists. All Rights Reserved.
Regional variations in the diversity and predicted metabolic potential of benthic prokaryotes in coastal northern Zhejiang, East China Sea

PubMed Central

Wang, Kai; Ye, Xiansen; Zhang, Huajun; Chen, Heping; Zhang, Demin; Liu, Lian

2016-01-01

Knowledge about the drivers of benthic prokaryotic diversity and metabolic potential in interconnected coastal sediments at regional scales is limited. We collected surface sediments across six zones covering ~200 km in coastal northern Zhejiang, East China Sea and combined 16 S rRNA gene sequencing, community-level metabolic prediction, and sediment physicochemical measurements to investigate variations in prokaryotic diversity and metabolic gene composition with geographic distance and under local environmental conditions. Geographic distance was the most influential factor in prokaryotic β-diversity compared with major environmental drivers, including temperature, sediment texture, acid-volatile sulfide, and water depth, but a large unexplained variation in community composition suggested the potential effects of unmeasured abiotic/biotic factors and stochastic processes. Moreover, prokaryotic assemblages showed a biogeographic provincialism across the zones. The predicted metabolic gene composition similarly shifted as taxonomic composition did. Acid-volatile sulfide was strongly correlated with variation in metabolic gene composition. The enrichments in the relative abundance of sulfate-reducing bacteria and genes relevant with dissimilatory sulfate reduction were observed and predicted, respectively, in the Yushan area. These results provide insights into the relative importance of geographic distance and environmental condition in driving benthic prokaryotic diversity in coastal areas and predict specific biogeochemically-relevant genes for future studies. PMID:27917954
Genome-wide patterns of copy number variation in the diversified chicken genomes using next-generation sequencing.

PubMed

Yi, Guoqiang; Qu, Lujiang; Liu, Jianfeng; Yan, Yiyuan; Xu, Guiyun; Yang, Ning

2014-11-07

Copy number variation (CNV) is important and widespread in the genome, and is a major cause of disease and phenotypic diversity. Herein, we performed a genome-wide CNV analysis in 12 diversified chicken genomes based on whole genome sequencing. A total of 8,840 CNV regions (CNVRs) covering 98.2 Mb and representing 9.4% of the chicken genome were identified, ranging in size from 1.1 to 268.8 kb with an average of 11.1 kb. Sequencing-based predictions were confirmed at a high validation rate by two independent approaches, including array comparative genomic hybridization (aCGH) and quantitative PCR (qPCR). The Pearson's correlation coefficients between sequencing and aCGH results ranged from 0.435 to 0.755, and qPCR experiments revealed a positive validation rate of 91.71% and a false negative rate of 22.43%. In total, 2,214 (25.0%) predicted CNVRs span 2,216 (36.4%) RefSeq genes associated with specific biological functions. Besides two previously reported copy number variable genes EDN3 and PRLR, we also found some promising genes with potential in phenotypic variation. Two genes, FZD6 and LIMS1, related to disease susceptibility/resistance are covered by CNVRs. The highly duplicated SOCS2 may lead to higher bone mineral density. Entire or partial duplication of some genes like POPDC3 may have great economic importance in poultry breeding. Our results based on extensive genetic diversity provide a more refined chicken CNV map and genome-wide gene copy number estimates, and warrant future CNV association studies for important traits in chickens.
LenVarDB: database of length-variant protein domains.

PubMed

Mutt, Eshita; Mathew, Oommen K; Sowdhamini, Ramanathan

2014-01-01

Protein domains are functionally and structurally independent modules, which add to the functional variety of proteins. This array of functional diversity has been enabled by evolutionary changes, such as amino acid substitutions or insertions or deletions, occurring in these protein domains. Length variations (indels) can introduce changes at structural, functional and interaction levels. LenVarDB (freely available at http://caps.ncbs.res.in/lenvardb/) traces these length variations, starting from structure-based sequence alignments in our Protein Alignments organized as Structural Superfamilies (PASS2) database, across 731 structural classification of proteins (SCOP)-based protein domain superfamilies connected to 2 730 625 sequence homologues. Alignment of sequence homologues corresponding to a structural domain is available, starting from a structure-based sequence alignment of the superfamily. Orientation of the length-variant (indel) regions in protein domains can be visualized by mapping them on the structure and on the alignment. Knowledge about location of length variations within protein domains and their visual representation will be useful in predicting changes within structurally or functionally relevant sites, which may ultimately regulate protein function. Non-technical summary: Evolutionary changes bring about natural changes to proteins that may be found in many organisms. Such changes could be reflected as amino acid substitutions or insertions-deletions (indels) in protein sequences. LenVarDB is a database that provides an early overview of observed length variations that were set among 731 protein families and after examining >2 million sequences. Indels are followed up to observe if they are close to the active site such that they can affect the activity of proteins. Inclusion of such information can aid the design of bioengineering experiments.
Hysteretic energy prediction method for mainshock-aftershock sequences

NASA Astrophysics Data System (ADS)

Zhai, Changhai; Ji, Duofa; Wen, Weiping; Li, Cuihua; Lei, Weidong; Xie, Lili

2018-04-01

Structures located in seismically active regions may be subjected to mainshock-aftershock (MSAS) sequences. Strong aftershocks significantly affect the hysteretic energy demand of structures. The hysteretic energy, E H,seq, is normalized by mass m and expressed in terms of the equivalent velocity, V D,seq, to quantitatively investigate aftershock effects on the hysteretic energy of structures. The equivalent velocity, V D,seq, is computed by analyzing the response time-history of an inelastic single-degree-of-freedom (SDOF) system with a varying vibration period subjected to 309 MSAS sequences. The present study selected two kinds of MSAS sequences, with one aftershock and two aftershocks, respectively. The aftershocks are scaled to maintain different relative intensities. The variation of the equivalent velocity, V D,seq, is studied for consideration of the ductility values, site conditions, relative intensities, number of aftershocks, hysteretic models, and damping ratios. The MSAS sequence with one aftershock exhibited a 10% to 30% hysteretic energy increase, whereas the MSAS sequence with two aftershocks presented a 20% to 40% hysteretic energy increase. Finally, a hysteretic energy prediction equation is proposed as a function of the vibration period, ductility value, and damping ratio to estimate hysteretic energy for mainshock-aftershock sequences.
Development of a genotype-by-sequencing immunogenetic assay as exemplified by screening for variation in red fox with and without endemic rabies exposure.

PubMed

Donaldson, Michael E; Rico, Yessica; Hueffer, Karsten; Rando, Halie M; Kukekova, Anna V; Kyle, Christopher J

2018-01-01

Pathogens are recognized as major drivers of local adaptation in wildlife systems. By determining which gene variants are favored in local interactions among populations with and without disease, spatially explicit adaptive responses to pathogens can be elucidated. Much of our current understanding of host responses to disease comes from a small number of genes associated with an immune response. High-throughput sequencing (HTS) technologies, such as genotype-by-sequencing (GBS), facilitate expanded explorations of genomic variation among populations. Hybridization-based GBS techniques can be leveraged in systems not well characterized for specific variants associated with disease outcome to "capture" specific genes and regulatory regions known to influence expression and disease outcome. We developed a multiplexed, sequence capture assay for red foxes to simultaneously assess ~300-kbp of genomic sequence from 116 adaptive, intrinsic, and innate immunity genes of predicted adaptive significance and their putative upstream regulatory regions along with 23 neutral microsatellite regions to control for demographic effects. The assay was applied to 45 fox DNA samples from Alaska, where three arctic rabies strains are geographically restricted and endemic to coastal tundra regions, yet absent from the boreal interior. The assay provided 61.5% on-target enrichment with relatively even sequence coverage across all targeted loci and samples (mean = 50×), which allowed us to elucidate genetic variation across introns, exons, and potential regulatory regions (4,819 SNPs). Challenges remained in accurately describing microsatellite variation using this technique; however, longer-read HTS technologies should overcome these issues. We used these data to conduct preliminary analyses and detected genetic structure in a subset of red fox immune-related genes between regions with and without endemic arctic rabies. This assay provides a template to assess immunogenetic variation in wildlife disease systems.
Re-sequencing and genetic variation identification of a rice line with ideal plant architecture.

PubMed

Li, Shuangcheng; Xie, Kailong; Li, Wenbo; Zou, Ting; Ren, Yun; Wang, Shiquan; Deng, Qiming; Zheng, Aiping; Zhu, Jun; Liu, Huainian; Wang, Lingxia; Ai, Peng; Gao, Fengyan; Huang, Bin; Cao, Xuemei; Li, Ping

2012-12-01

The ideal plant architecture (IPA) includes several important characteristics such as low tiller numbers, few or no unproductive tillers, more grains per panicle, and thick and sturdy stems. We have developed an indica restorer line 7302R that displays the IPA phenotype in terms of tiller number, grain number, and stem strength. However, its mechanism had to be clarified. We performed re-sequencing and genome-wide variation analysis of 7302R using the Solexa sequencing technology. With the genomic sequence of the indica cultivar 9311 as reference, 307 627 SNPs, 57 372 InDels, and 3 096 SVs were identified in the 7302R genome. The 7302R-specific variations were investigated via the synteny analysis of all the SNPs of 7302R with those of the previous sequenced none-IPA-type lines IR24, MH63, and SH527. Moreover, we found 178 168 7302R-specific SNPs across the whole genome and 30 239 SNPs in the predicted mRNA regions, among which 8 517 were Non-syn CDS. In addition, 263 large-effect SNPs that were expected to affect the integrity of encoded proteins were identified from the 7302R-specific SNPs. SNPs of several important previously cloned rice genes were also identified by aligning the 7302R sequence with other sequence lines. Our results provided several candidates account for the IPA phenotype of 7302R. These results therefore lay the groundwork for long-term efforts to uncover important genes and alleles for rice plant architecture construction, also offer useful data resources for future genetic and genomic studies in rice.
The population genomics of rhesus macaques (Macaca mulatta) based on whole-genome sequences

PubMed Central

Xue, Cheng; Raveendran, Muthuswamy; Harris, R. Alan; Fawcett, Gloria L.; Liu, Xiaoming; White, Simon; Dahdouli, Mahmoud; Rio Deiros, David; Below, Jennifer E.; Salerno, William; Cox, Laura; Fan, Guoping; Ferguson, Betsy; Horvath, Julie; Johnson, Zach; Kanthaswamy, Sree; Kubisch, H. Michael; Liu, Dahai; Platt, Michael; Smith, David G.; Sun, Binghua; Vallender, Eric J.; Wang, Feng; Wiseman, Roger W.; Chen, Rui; Muzny, Donna M.; Gibbs, Richard A.; Yu, Fuli; Rogers, Jeffrey

2016-01-01

Rhesus macaques (Macaca mulatta) are the most widely used nonhuman primate in biomedical research, have the largest natural geographic distribution of any nonhuman primate, and have been the focus of much evolutionary and behavioral investigation. Consequently, rhesus macaques are one of the most thoroughly studied nonhuman primate species. However, little is known about genome-wide genetic variation in this species. A detailed understanding of extant genomic variation among rhesus macaques has implications for the use of this species as a model for studies of human health and disease, as well as for evolutionary population genomics. Whole-genome sequencing analysis of 133 rhesus macaques revealed more than 43.7 million single-nucleotide variants, including thousands predicted to alter protein sequences, transcript splicing, and transcription factor binding sites. Rhesus macaques exhibit 2.5-fold higher overall nucleotide diversity and slightly elevated putative functional variation compared with humans. This functional variation in macaques provides opportunities for analyses of coding and noncoding variation, and its cellular consequences. Despite modestly higher levels of nonsynonymous variation in the macaques, the estimated distribution of fitness effects and the ratio of nonsynonymous to synonymous variants suggest that purifying selection has had stronger effects in rhesus macaques than in humans. Demographic reconstructions indicate this species has experienced a consistently large but fluctuating population size. Overall, the results presented here provide new insights into the population genomics of nonhuman primates and expand genomic information directly relevant to primate models of human disease. PMID:27934697
On the Power and the Systematic Biases of the Detection of Chromosomal Inversions by Paired-End Genome Sequencing

PubMed Central

Lucas Lledó, José Ignacio; Cáceres, Mario

2013-01-01

One of the most used techniques to study structural variation at a genome level is paired-end mapping (PEM). PEM has the advantage of being able to detect balanced events, such as inversions and translocations. However, inversions are still quite difficult to predict reliably, especially from high-throughput sequencing data. We simulated realistic PEM experiments with different combinations of read and library fragment lengths, including sequencing errors and meaningful base-qualities, to quantify and track down the origin of false positives and negatives along sequencing, mapping, and downstream analysis. We show that PEM is very appropriate to detect a wide range of inversions, even with low coverage data. However, % of inversions located between segmental duplications are expected to go undetected by the most common sequencing strategies. In general, longer DNA libraries improve the detectability of inversions far better than increments of the coverage depth or the read length. Finally, we review the performance of three algorithms to detect inversions —SVDetect, GRIAL, and VariationHunter—, identify common pitfalls, and reveal important differences in their breakpoint precisions. These results stress the importance of the sequencing strategy for the detection of structural variants, especially inversions, and offer guidelines for the design of future genome sequencing projects. PMID:23637806
Predicting discovery rates of genomic features.

PubMed

Gravel, Simon

2014-06-01

Successful sequencing experiments require judicious sample selection. However, this selection must often be performed on the basis of limited preliminary data. Predicting the statistical properties of the final sample based on preliminary data can be challenging, because numerous uncertain model assumptions may be involved. Here, we ask whether we can predict "omics" variation across many samples by sequencing only a fraction of them. In the infinite-genome limit, we find that a pilot study sequencing 5% of a population is sufficient to predict the number of genetic variants in the entire population within 6% of the correct value, using an estimator agnostic to demography, selection, or population structure. To reach similar accuracy in a finite genome with millions of polymorphisms, the pilot study would require ∼15% of the population. We present computationally efficient jackknife and linear programming methods that exhibit substantially less bias than the state of the art when applied to simulated data and subsampled 1000 Genomes Project data. Extrapolating based on the National Heart, Lung, and Blood Institute Exome Sequencing Project data, we predict that 7.2% of sites in the capture region would be variable in a sample of 50,000 African Americans and 8.8% in a European sample of equal size. Finally, we show how the linear programming method can also predict discovery rates of various genomic features, such as the number of transcription factor binding sites across different cell types. Copyright © 2014 by the Genetics Society of America.
A parametric approach to irregular fatigue prediction

NASA Technical Reports Server (NTRS)

Erismann, T. H.

1972-01-01

A parametric approach to irregular fatigue protection is presented. The method proposed consists of two parts: empirical determination of certain characteristics of a material by means of a relatively small number of well-defined standard tests, and arithmetical application of the results obtained to arbitrary loading histories. The following groups of parameters are thus taken into account: (1) the variations of the mean stress, (2) the interaction of these variations and the superposed oscillating stresses, (3) the spectrum of the oscillating-stress amplitudes, and (4) the sequence of the oscillating-stress amplitudes. It is pointed out that only experimental verification can throw sufficient light upon possibilities and limitations of this (or any other) prediction method.
Genetic Variation and Its Reflection on Posttranslational Modifications in Frequency Clock and Mating Type a-1 Proteins in Sordaria fimicola

PubMed Central

Arif, Rabia; Akram, Faiza; Jamil, Tazeen; Lee, Siu Fai

2017-01-01

Posttranslational modifications (PTMs) occur in all essential proteins taking command of their functions. There are many domains inside proteins where modifications take place on side-chains of amino acids through various enzymes to generate different species of proteins. In this manuscript we have, for the first time, predicted posttranslational modifications of frequency clock and mating type a-1 proteins in Sordaria fimicola collected from different sites to see the effect of environment on proteins or various amino acids pickings and their ultimate impact on consensus sequences present in mating type proteins using bioinformatics tools. Furthermore, we have also measured and walked through genomic DNA of various Sordaria strains to determine genetic diversity by genotyping the short sequence repeats (SSRs) of wild strains of S. fimicola collected from contrasting environments of two opposing slopes (harsh and xeric south facing slope and mild north facing slope) of Evolution Canyon (EC), Israel. Based on the whole genome sequence of S. macrospora, we targeted 20 genomic regions in S. fimicola which contain short sequence repeats (SSRs). Our data revealed genetic variations in strains from south facing slope and these findings assist in the hypothesis that genetic variations caused by stressful environments lead to evolution. PMID:28717646
Genetic Variation and Its Reflection on Posttranslational Modifications in Frequency Clock and Mating Type a-1 Proteins in Sordaria fimicola.

PubMed

Arif, Rabia; Akram, Faiza; Jamil, Tazeen; Mukhtar, Hamid; Lee, Siu Fai; Saleem, Muhammad

2017-01-01

Posttranslational modifications (PTMs) occur in all essential proteins taking command of their functions. There are many domains inside proteins where modifications take place on side-chains of amino acids through various enzymes to generate different species of proteins. In this manuscript we have, for the first time, predicted posttranslational modifications of frequency clock and mating type a-1 proteins in Sordaria fimicola collected from different sites to see the effect of environment on proteins or various amino acids pickings and their ultimate impact on consensus sequences present in mating type proteins using bioinformatics tools. Furthermore, we have also measured and walked through genomic DNA of various Sordaria strains to determine genetic diversity by genotyping the short sequence repeats (SSRs) of wild strains of S. fimicola collected from contrasting environments of two opposing slopes (harsh and xeric south facing slope and mild north facing slope) of Evolution Canyon (EC), Israel. Based on the whole genome sequence of S. macrospora , we targeted 20 genomic regions in S. fimicola which contain short sequence repeats (SSRs). Our data revealed genetic variations in strains from south facing slope and these findings assist in the hypothesis that genetic variations caused by stressful environments lead to evolution.
Association of levels of fasting glucose and insulin with rare variants at the chromosome 11p11.2-MADD locus: Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium Targeted Sequencing Study.

PubMed

Cornes, Belinda K; Brody, Jennifer A; Nikpoor, Naghmeh; Morrison, Alanna C; Chu, Huan; Ahn, Byung Soo; Wang, Shuai; Dauriz, Marco; Barzilay, Joshua I; Dupuis, Josée; Florez, Jose C; Coresh, Josef; Gibbs, Richard A; Kao, W H Linda; Liu, Ching-Ti; McKnight, Barbara; Muzny, Donna; Pankow, James S; Reid, Jeffrey G; White, Charles C; Johnson, Andrew D; Wong, Tien Y; Psaty, Bruce M; Boerwinkle, Eric; Rotter, Jerome I; Siscovick, David S; Sladek, Robert; Meigs, James B

2014-06-01

Common variation at the 11p11.2 locus, encompassing MADD, ACP2, NR1H3, MYBPC3, and SPI1, has been associated in genome-wide association studies with fasting glucose and insulin (FI). In the Cohorts for Heart and Aging Research in Genomic Epidemiology Targeted Sequencing Study, we sequenced 5 gene regions at 11p11.2 to identify rare, potentially functional variants influencing fasting glucose or FI levels. Sequencing (mean depth, 38×) across 16.1 kb in 3566 individuals without diabetes mellitus identified 653 variants, 79.9% of which were rare (minor allele frequency <1%) and novel. We analyzed rare variants in 5 gene regions with FI or fasting glucose using the sequence kernel association test. At NR1H3, 53 rare variants were jointly associated with FI (P=2.73×10(-3)); of these, 7 were predicted to have regulatory function and showed association with FI (P=1.28×10(-3)). Conditioning on 2 previously associated variants at MADD (rs7944584, rs10838687) did not attenuate this association, suggesting that there are >2 independent signals at 11p11.2. One predicted regulatory variant, chr11:47227430 (hg18; minor allele frequency=0.00068), contributed 20.6% to the overall sequence kernel association test score at NR1H3, lies in intron 2 of NR1H3, and is a predicted binding site for forkhead box A1 (FOXA1), a transcription factor associated with insulin regulation. In human HepG2 hepatoma cells, the rare chr11:47227430 A allele disrupted FOXA1 binding and reduced FOXA1-dependent transcriptional activity. Sequencing at 11p11.2-NR1H3 identified rare variation associated with FI. One variant, chr11:47227430, seems to be functional, with the rare A allele reducing transcription factor FOXA1 binding and FOXA1-dependent transcriptional activity. © 2014 American Heart Association, Inc.
Structural details (kinks and non-α conformations) in transmembrane helices are intrahelically determined and can be predicted by sequence pattern descriptors

PubMed Central

Rigoutsos, Isidore; Riek, Peter; Graham, Robert M.; Novotny, Jiri

2003-01-01

One of the promising methods of protein structure prediction involves the use of amino acid sequence-derived patterns. Here we report on the creation of non-degenerate motif descriptors derived through data mining of training sets of residues taken from the transmembrane-spanning segments of polytopic proteins. These residues correspond to short regions in which there is a deviation from the regular α-helical character (i.e. π-helices, 310-helices and kinks). A ‘search engine’ derived from these motif descriptors correctly identifies, and discriminates amongst instances of the above ‘non-canonical’ helical motifs contained in the SwissProt/TrEMBL database of protein primary structures. Our results suggest that deviations from α-helicity are encoded locally in sequence patterns only about 7–9 residues long and can be determined in silico directly from the amino acid sequence. Delineation of such variations in helical habit is critical to understanding the complex structure–function relationships of polytopic proteins and for drug discovery. The success of our current methodology foretells development of similar prediction tools capable of identifying other structural motifs from sequence alone. The method described here has been implemented and is available on the World Wide Web at http://cbcsrv.watson.ibm.com/Ttkw.html. PMID:12888523
Structural details (kinks and non-alpha conformations) in transmembrane helices are intrahelically determined and can be predicted by sequence pattern descriptors.

PubMed

Rigoutsos, Isidore; Riek, Peter; Graham, Robert M; Novotny, Jiri

2003-08-01

One of the promising methods of protein structure prediction involves the use of amino acid sequence-derived patterns. Here we report on the creation of non-degenerate motif descriptors derived through data mining of training sets of residues taken from the transmembrane-spanning segments of polytopic proteins. These residues correspond to short regions in which there is a deviation from the regular alpha-helical character (i.e. pi-helices, 3(10)-helices and kinks). A 'search engine' derived from these motif descriptors correctly identifies, and discriminates amongst instances of the above 'non-canonical' helical motifs contained in the SwissProt/TrEMBL database of protein primary structures. Our results suggest that deviations from alpha-helicity are encoded locally in sequence patterns only about 7-9 residues long and can be determined in silico directly from the amino acid sequence. Delineation of such variations in helical habit is critical to understanding the complex structure-function relationships of polytopic proteins and for drug discovery. The success of our current methodology foretells development of similar prediction tools capable of identifying other structural motifs from sequence alone. The method described here has been implemented and is available on the World Wide Web at http://cbcsrv.watson.ibm.com/Ttkw.html.
Targeted next generation sequencing of the entire vitamin D receptor gene reveals polymorphisms correlated with vitamin D deficiency among older Filipino women with and without fragility fracture.

PubMed

Zumaraga, Mark Pretzel; Medina, Paul Julius; Recto, Juan Miguel; Abrahan, Lauro; Azurin, Edelyn; Tanchoco, Celeste C; Jimeno, Cecilia A; Palmes-Saloma, Cynthia

2017-03-01

This study aimed to discover genetic variants in the entire 101 kB vitamin D receptor (VDR) gene for vitamin D deficiency in a group of postmenopausal Filipino women using targeted next generation sequencing (TNGS) approach in a case-control study design. A total of 50 women with and without osteoporotic fracture seen at the Philippine Orthopedic Center were included. Blood samples were collected for determination of serum vitamin D, calcium, phosphorus, glucose, blood urea nitrogen, creatinine, aspartate aminotransferase, alanine aminotransferase and as primary source for targeted VDR gene sequencing using the Ion Torrent Personal Genome Machine. The variant calling was based on the GATK best practice workflow and annotated using Annovar tool. A total of 1496 unique variants in the whole 101-kb VDR gene were identified. Novel sequence variations not registered in the dbSNP database were found among cases and controls at a rate of 23.1% and 16.6% of total discovered variants, respectively. One disease-associated enhancer showed statistically significant association to low serum 25-hydroxy vitamin D levels (Pearson chi-square P-value=0.009). The transcription factor binding site prediction program PROMO predicted the disruption of three transcription factor binding sites in this enhancer region. These findings show the power of TNGS in identifying sequence variations in a very large gene and the surprising results obtained in this study greatly expand the catalog of known VDR sequence variants that may represent an important clue in the emergence of vitamin D deficiency. Such information will also provide the additional guidance necessary toward a personalized nutritional advice to reach sufficient vitamin D status. Copyright © 2016 Elsevier Inc. All rights reserved.

A mathematical model for computer image tracking.

PubMed

Legters, G R; Young, T Y

1982-06-01

A mathematical model using an operator formulation for a moving object in a sequence of images is presented. Time-varying translation and rotation operators are derived to describe the motion. A variational estimation algorithm is developed to track the dynamic parameters of the operators. The occlusion problem is alleviated by using a predictive Kalman filter to keep the tracking on course during severe occlusion. The tracking algorithm (variational estimation in conjunction with Kalman filter) is implemented to track moving objects with occasional occlusion in computer-simulated binary images.
Identification of Putative Transmembrane Proteins Involved in Salinity Tolerance in Chenopodium quinoa by Integrating Physiological Data, RNAseq, and SNP Analyses

PubMed Central

Schmöckel, Sandra M.; Lightfoot, Damien J.; Razali, Rozaimi; Tester, Mark; Jarvis, David E.

2017-01-01

Chenopodium quinoa (quinoa) is an emerging crop that produces nutritious grains with the potential to contribute to global food security. Quinoa can also grow on marginal lands, such as soils affected by high salinity. To identify candidate salt tolerance genes in the recently sequenced quinoa genome, we used a multifaceted approach integrating RNAseq analyses with comparative genomics and topology prediction. We identified 219 candidate genes by selecting those that were differentially expressed in response to salinity, were specific to or overrepresented in quinoa relative to other Amaranthaceae species, and had more than one predicted transmembrane domain. To determine whether these genes might underlie variation in salinity tolerance in quinoa and its close relatives, we compared the response to salinity stress in a panel of 21 Chenopodium accessions (14 C. quinoa, 5 C. berlandieri, and 2 C. hircinum). We found large variation in salinity tolerance, with one C. hircinum displaying the highest salinity tolerance. Using genome re-sequencing data from these accessions, we investigated single nucleotide polymorphisms and copy number variation (CNV) in the 219 candidate genes in accessions of contrasting salinity tolerance, and identified 15 genes that could contribute to the differences in salinity tolerance of these Chenopodium accessions. PMID:28680429
Ribosomal DNA sequence heterogeneity reflects intraspecies phylogenies and predicts genome structure in two contrasting yeast species.

PubMed

West, Claire; James, Stephen A; Davey, Robert P; Dicks, Jo; Roberts, Ian N

2014-07-01

The ribosomal RNA encapsulates a wealth of evolutionary information, including genetic variation that can be used to discriminate between organisms at a wide range of taxonomic levels. For example, the prokaryotic 16S rDNA sequence is very widely used both in phylogenetic studies and as a marker in metagenomic surveys and the internal transcribed spacer region, frequently used in plant phylogenetics, is now recognized as a fungal DNA barcode. However, this widespread use does not escape criticism, principally due to issues such as difficulties in classification of paralogous versus orthologous rDNA units and intragenomic variation, both of which may be significant barriers to accurate phylogenetic inference. We recently analyzed data sets from the Saccharomyces Genome Resequencing Project, characterizing rDNA sequence variation within multiple strains of the baker's yeast Saccharomyces cerevisiae and its nearest wild relative Saccharomyces paradoxus in unprecedented detail. Notably, both species possess single locus rDNA systems. Here, we use these new variation datasets to assess whether a more detailed characterization of the rDNA locus can alleviate the second of these phylogenetic issues, sequence heterogeneity, while controlling for the first. We demonstrate that a strong phylogenetic signal exists within both datasets and illustrate how they can be used, with existing methodology, to estimate intraspecies phylogenies of yeast strains consistent with those derived from whole-genome approaches. We also describe the use of partial Single Nucleotide Polymorphisms, a type of sequence variation found only in repetitive genomic regions, in identifying key evolutionary features such as genome hybridization events and show their consistency with whole-genome Structure analyses. We conclude that our approach can transform rDNA sequence heterogeneity from a problem to a useful source of evolutionary information, enabling the estimation of highly accurate phylogenies of closely related organisms, and discuss how it could be extended to future studies of multilocus rDNA systems. [concerted evolution; genome hydridisation; phylogenetic analysis; ribosomal DNA; whole genome sequencing; yeast]. © The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.
Tertiary alphabet for the observable protein structural universe.

PubMed

Mackenzie, Craig O; Zhou, Jianfu; Grigoryan, Gevorg

2016-11-22

Here, we systematically decompose the known protein structural universe into its basic elements, which we dub tertiary structural motifs (TERMs). A TERM is a compact backbone fragment that captures the secondary, tertiary, and quaternary environments around a given residue, comprising one or more disjoint segments (three on average). We seek the set of universal TERMs that capture all structure in the Protein Data Bank (PDB), finding remarkable degeneracy. Only ∼600 TERMs are sufficient to describe 50% of the PDB at sub-Angstrom resolution. However, more rare geometries also exist, and the overall structural coverage grows logarithmically with the number of TERMs. We go on to show that universal TERMs provide an effective mapping between sequence and structure. We demonstrate that TERM-based statistics alone are sufficient to recapitulate close-to-native sequences given either NMR or X-ray backbones. Furthermore, sequence variability predicted from TERM data agrees closely with evolutionary variation. Finally, locations of TERMs in protein chains can be predicted from sequence alone based on sequence signatures emergent from TERM instances in the PDB. For multisegment motifs, this method identifies spatially adjacent fragments that are not contiguous in sequence-a major bottleneck in structure prediction. Although all TERMs recur in diverse proteins, some appear specialized for certain functions, such as interface formation, metal coordination, or even water binding. Structural biology has benefited greatly from previously observed degeneracies in structure. The decomposition of the known structural universe into a finite set of compact TERMs offers exciting opportunities toward better understanding, design, and prediction of protein structure.
Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences.

PubMed

Bergman, C M; Kreitman, M

2001-08-01

Comparative genomic approaches to gene and cis-regulatory prediction are based on the principle that differential DNA sequence conservation reflects variation in functional constraint. Using this principle, we analyze noncoding sequence conservation in Drosophila for 40 loci with known or suspected cis-regulatory function encompassing >100 kb of DNA. We estimate the fraction of noncoding DNA conserved in both intergenic and intronic regions and describe the length distribution of ungapped conserved noncoding blocks. On average, 22%-26% of noncoding sequences surveyed are conserved in Drosophila, with median block length approximately 19 bp. We show that point substitution in conserved noncoding blocks exhibits transition bias as well as lineage effects in base composition, and occurs more than an order of magnitude more frequently than insertion/deletion (indel) substitution. Overall, patterns of noncoding DNA structure and evolution differ remarkably little between intergenic and intronic conserved blocks, suggesting that the effects of transcription per se contribute minimally to the constraints operating on these sequences. The results of this study have implications for the development of alignment and prediction algorithms specific to noncoding DNA, as well as for models of cis-regulatory DNA sequence evolution.
Software for optimization of SNP and PCR-RFLP genotyping to discriminate many genomes with the fewest assays

PubMed Central

Gardner, Shea N; Wagner, Mark C

2005-01-01

Background Microbial forensics is important in tracking the source of a pathogen, whether the disease is a naturally occurring outbreak or part of a criminal investigation. Results A method and SPR Opt (SNP and PCR-RFLP Optimization) software to perform a comprehensive, whole-genome analysis to forensically discriminate multiple sequences is presented. Tools for the optimization of forensic typing using Single Nucleotide Polymorphism (SNP) and PCR-Restriction Fragment Length Polymorphism (PCR-RFLP) analyses across multiple isolate sequences of a species are described. The PCR-RFLP analysis includes prediction and selection of optimal primers and restriction enzymes to enable maximum isolate discrimination based on sequence information. SPR Opt calculates all SNP or PCR-RFLP variations present in the sequences, groups them into haplotypes according to their co-segregation across those sequences, and performs combinatoric analyses to determine which sets of haplotypes provide maximal discrimination among all the input sequences. Those set combinations requiring that membership in the fewest haplotypes be queried (i.e. the fewest assays be performed) are found. These analyses highlight variable regions based on existing sequence data. These markers may be heterogeneous among unsequenced isolates as well, and thus may be useful for characterizing the relationships among unsequenced as well as sequenced isolates. The predictions are multi-locus. Analyses of mumps and SARS viruses are summarized. Phylogenetic trees created based on SNPs, PCR-RFLPs, and full genomes are compared for SARS virus, illustrating that purported phylogenies based only on SNP or PCR-RFLP variations do not match those based on multiple sequence alignment of the full genomes. Conclusion This is the first software to optimize the selection of forensic markers to maximize information gained from the fewest assays, accepting whole or partial genome sequence data as input. As more sequence data becomes available for multiple strains and isolates of a species, automated, computational approaches such as those described here will be essential to make sense of large amounts of information, and to guide and optimize efforts in the laboratory. The software and source code for SPR Opt is publicly available and free for non-profit use at . PMID:15904493
The science of tobacco addiction and cessation.

PubMed

2014-01-01

Over the past decade, researchers have found genetic variations that affect how nicotine, the main addictive component of tobacco, interacts with cells in the brain and how fast the body metabolizes it. Carrying a high-risk variant predicts a person's ability to snuff out their cigarettes for good. Genetic testing could help predict which smokers might benefit from nicotine replacement therapy or other prescription medications, much like sequencing of malignant tumors can point to the most effective cancer treatment.
MICRA: an automatic pipeline for fast characterization of microbial genomes from high-throughput sequencing data.

PubMed

Caboche, Ségolène; Even, Gaël; Loywick, Alexandre; Audebert, Christophe; Hot, David

2017-12-19

The increase in available sequence data has advanced the field of microbiology; however, making sense of these data without bioinformatics skills is still problematic. We describe MICRA, an automatic pipeline, available as a web interface, for microbial identification and characterization through reads analysis. MICRA uses iterative mapping against reference genomes to identify genes and variations. Additional modules allow prediction of antibiotic susceptibility and resistance and comparing the results of several samples. MICRA is fast, producing few false-positive annotations and variant calls compared to current methods, making it a tool of great interest for fully exploiting sequencing data.
Saccharomyces cerevisiae: gene annotation and genome variability, state of the art through comparative genomics.

PubMed

Louis, Ed

2011-01-01

In the early days of the yeast genome sequencing project, gene annotation was in its infancy and suffered the problem of many false positive annotations as well as missed genes. The lack of other sequences for comparison also prevented the annotation of conserved, functional sequences that were not coding. We are now in an era of comparative genomics where many closely related as well as more distantly related genomes are available for direct sequence and synteny comparisons allowing for more probable predictions of genes and other functional sequences due to conservation. We also have a plethora of functional genomics data which helps inform gene annotation for previously uncharacterised open reading frames (ORFs)/genes. For Saccharomyces cerevisiae this has resulted in a continuous updating of the gene and functional sequence annotations in the reference genome helping it retain its position as the best characterized eukaryotic organism's genome. A single reference genome for a species does not accurately describe the species and this is quite clear in the case of S. cerevisiae where the reference strain is not ideal for brewing or baking due to missing genes. Recent surveys of numerous isolates, from a variety of sources, using a variety of technologies have revealed a great deal of variation amongst isolates with genome sequence surveys providing information on novel genes, undetectable by other means. We now have a better understanding of the extant variation in S. cerevisiae as a species as well as some idea of how much we are missing from this understanding. As with gene annotation, comparative genomics enhances the discovery and description of genome variation and is providing us with the tools for understanding genome evolution, adaptation and selection, and underlying genetics of complex traits.
Computational Prediction of miRNA Genes from Small RNA Sequencing Data

PubMed Central

Kang, Wenjing; Friedländer, Marc R.

2015-01-01

Next-generation sequencing now for the first time allows researchers to gage the depth and variation of entire transcriptomes. However, now as rare transcripts can be detected that are present in cells at single copies, more advanced computational tools are needed to accurately annotate and profile them. microRNAs (miRNAs) are 22 nucleotide small RNAs (sRNAs) that post-transcriptionally reduce the output of protein coding genes. They have established roles in numerous biological processes, including cancers and other diseases. During miRNA biogenesis, the sRNAs are sequentially cleaved from precursor molecules that have a characteristic hairpin RNA structure. The vast majority of new miRNA genes that are discovered are mined from small RNA sequencing (sRNA-seq), which can detect more than a billion RNAs in a single run. However, given that many of the detected RNAs are degradation products from all types of transcripts, the accurate identification of miRNAs remain a non-trivial computational problem. Here, we review the tools available to predict animal miRNAs from sRNA sequencing data. We present tools for generalist and specialist use cases, including prediction from massively pooled data or in species without reference genome. We also present wet-lab methods used to validate predicted miRNAs, and approaches to computationally benchmark prediction accuracy. For each tool, we reference validation experiments and benchmarking efforts. Last, we discuss the future of the field. PMID:25674563
Deciphering the distance to antibiotic resistance for the pneumococcus using genome sequencing data

PubMed Central

Mobegi, Fredrick M.; Cremers, Amelieke J. H.; de Jonge, Marien I.; Bentley, Stephen D.; van Hijum, Sacha A. F. T.; Zomer, Aldert

2017-01-01

Advances in genome sequencing technologies and genome-wide association studies (GWAS) have provided unprecedented insights into the molecular basis of microbial phenotypes and enabled the identification of the underlying genetic variants in real populations. However, utilization of genome sequencing in clinical phenotyping of bacteria is challenging due to the lack of reliable and accurate approaches. Here, we report a method for predicting microbial resistance patterns using genome sequencing data. We analyzed whole genome sequences of 1,680 Streptococcus pneumoniae isolates from four independent populations using GWAS and identified probable hotspots of genetic variation which correlate with phenotypes of resistance to essential classes of antibiotics. With the premise that accumulation of putative resistance-conferring SNPs, potentially in combination with specific resistance genes, precedes full resistance, we retrogressively surveyed the hotspot loci and quantified the number of SNPs and/or genes, which if accumulated would confer full resistance to an otherwise susceptible strain. We name this approach the ‘distance to resistance’. It can be used to identify the creep towards complete antibiotics resistance in bacteria using genome sequencing. This approach serves as a basis for the development of future sequencing-based methods for predicting resistance profiles of bacterial strains in hospital microbiology and public health settings. PMID:28205635
A framework for establishing predictive relationships between specific bacterial 16S rRNA sequence abundances and biotransformation rates.

PubMed

Helbling, Damian E; Johnson, David R; Lee, Tae Kwon; Scheidegger, Andreas; Fenner, Kathrin

2015-03-01

The rates at which wastewater treatment plant (WWTP) microbial communities biotransform specific substrates can differ by orders of magnitude among WWTP communities. Differences in taxonomic compositions among WWTP communities may predict differences in the rates of some types of biotransformations. In this work, we present a novel framework for establishing predictive relationships between specific bacterial 16S rRNA sequence abundances and biotransformation rates. We selected ten WWTPs with substantial variation in their environmental and operational metrics and measured the in situ ammonia biotransformation rate constants in nine of them. We isolated total RNA from samples from each WWTP and analyzed 16S rRNA sequence reads. We then developed multivariate models between the measured abundances of specific bacterial 16S rRNA sequence reads and the ammonia biotransformation rate constants. We constructed model scenarios that systematically explored the effects of model regularization, model linearity and non-linearity, and aggregation of 16S rRNA sequences into operational taxonomic units (OTUs) as a function of sequence dissimilarity threshold (SDT). A large percentage (greater than 80%) of model scenarios resulted in well-performing and significant models at intermediate SDTs of 0.13-0.14 and 0.26. The 16S rRNA sequences consistently selected into the well-performing and significant models at those SDTs were classified as Nitrosomonas and Nitrospira groups. We then extend the framework by applying it to the biotransformation rate constants of ten micropollutants measured in batch reactors seeded with the ten WWTP communities. We identified phylogenetic groups that were robustly selected into all well-performing and significant models constructed with biotransformation rates of isoproturon, propachlor, ranitidine, and venlafaxine. These phylogenetic groups can be used as predictive biomarkers of WWTP microbial community activity towards these specific micropollutants. This work is an important step towards developing tools to predict biotransformation rates in WWTPs based on taxonomic composition. Copyright © 2014 Elsevier Ltd. All rights reserved.
Analysis of protein-coding genetic variation in 60,706 humans.

PubMed

Lek, Monkol; Karczewski, Konrad J; Minikel, Eric V; Samocha, Kaitlin E; Banks, Eric; Fennell, Timothy; O'Donnell-Luria, Anne H; Ware, James S; Hill, Andrew J; Cummings, Beryl B; Tukiainen, Taru; Birnbaum, Daniel P; Kosmicki, Jack A; Duncan, Laramie E; Estrada, Karol; Zhao, Fengmei; Zou, James; Pierce-Hoffman, Emma; Berghout, Joanne; Cooper, David N; Deflaux, Nicole; DePristo, Mark; Do, Ron; Flannick, Jason; Fromer, Menachem; Gauthier, Laura; Goldstein, Jackie; Gupta, Namrata; Howrigan, Daniel; Kiezun, Adam; Kurki, Mitja I; Moonshine, Ami Levy; Natarajan, Pradeep; Orozco, Lorena; Peloso, Gina M; Poplin, Ryan; Rivas, Manuel A; Ruano-Rubio, Valentin; Rose, Samuel A; Ruderfer, Douglas M; Shakir, Khalid; Stenson, Peter D; Stevens, Christine; Thomas, Brett P; Tiao, Grace; Tusie-Luna, Maria T; Weisburd, Ben; Won, Hong-Hee; Yu, Dongmei; Altshuler, David M; Ardissino, Diego; Boehnke, Michael; Danesh, John; Donnelly, Stacey; Elosua, Roberto; Florez, Jose C; Gabriel, Stacey B; Getz, Gad; Glatt, Stephen J; Hultman, Christina M; Kathiresan, Sekar; Laakso, Markku; McCarroll, Steven; McCarthy, Mark I; McGovern, Dermot; McPherson, Ruth; Neale, Benjamin M; Palotie, Aarno; Purcell, Shaun M; Saleheen, Danish; Scharf, Jeremiah M; Sklar, Pamela; Sullivan, Patrick F; Tuomilehto, Jaakko; Tsuang, Ming T; Watkins, Hugh C; Wilson, James G; Daly, Mark J; MacArthur, Daniel G

2016-08-18

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.
DB2: a probabilistic approach for accurate detection of tandem duplication breakpoints using paired-end reads.

PubMed

Yavaş, Gökhan; Koyutürk, Mehmet; Gould, Meetha P; McMahon, Sarah; LaFramboise, Thomas

2014-03-05

With the advent of paired-end high throughput sequencing, it is now possible to identify various types of structural variation on a genome-wide scale. Although many methods have been proposed for structural variation detection, most do not provide precise boundaries for identified variants. In this paper, we propose a new method, Distribution Based detection of Duplication Boundaries (DB2), for accurate detection of tandem duplication breakpoints, an important class of structural variation, with high precision and recall. Our computational experiments on simulated data show that DB2 outperforms state-of-the-art methods in terms of finding breakpoints of tandem duplications, with a higher positive predictive value (precision) in calling the duplications' presence. In particular, DB2's prediction of tandem duplications is correct 99% of the time even for very noisy data, while narrowing down the space of possible breakpoints within a margin of 15 to 20 bps on the average. Most of the existing methods provide boundaries in ranges that extend to hundreds of bases with lower precision values. Our method is also highly robust to varying properties of the sequencing library and to the sizes of the tandem duplications, as shown by its stable precision, recall and mean boundary mismatch performance. We demonstrate our method's efficacy using both simulated paired-end reads, and those generated from a melanoma sample and two ovarian cancer samples. Newly discovered tandem duplications are validated using PCR and Sanger sequencing. Our method, DB2, uses discordantly aligned reads, taking into account the distribution of fragment length to predict tandem duplications along with their breakpoints on a donor genome. The proposed method fine tunes the breakpoint calls by applying a novel probabilistic framework that incorporates the empirical fragment length distribution to score each feasible breakpoint. DB2 is implemented in Java programming language and is freely available at http://mendel.gene.cwru.edu/laframboiselab/software.php.
Aggregation of population‐based genetic variation over protein domain homologues and its potential use in genetic diagnostics

PubMed Central

Wiel, Laurens; Venselaar, Hanka; Veltman, Joris A.; Vriend, Gert

2017-01-01

Abstract Whole exomes of patients with a genetic disorder are nowadays routinely sequenced but interpretation of the identified genetic variants remains a major challenge. The increased availability of population‐based human genetic variation has given rise to measures of genetic tolerance that have been used, for example, to predict disease‐causing genes in neurodevelopmental disorders. Here, we investigated whether combining variant information from homologous protein domains can improve variant interpretation. For this purpose, we developed a framework that maps population variation and known pathogenic mutations onto 2,750 “meta‐domains.” These meta‐domains consist of 30,853 homologous Pfam protein domain instances that cover 36% of all human protein coding sequences. We find that genetic tolerance is consistent across protein domain homologues, and that patterns of genetic tolerance faithfully mimic patterns of evolutionary conservation. Furthermore, for a significant fraction (68%) of the meta‐domains high‐frequency population variation re‐occurs at the same positions across domain homologues more often than expected. In addition, we observe that the presence of pathogenic missense variants at an aligned homologous domain position is often paired with the absence of population variation and vice versa. The use of these meta‐domains can improve the interpretation of genetic variation. PMID:28815929
An unsupervised classification scheme for improving predictions of prokaryotic TIS.

PubMed

Tech, Maike; Meinicke, Peter

2006-03-09

Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation of prokaryotic TIS. However, inherent difficulties of these approaches arise from the considerable variation of TIS characteristics across different species. Therefore prior assumptions about the properties of prokaryotic gene starts may cause suboptimal predictions for newly sequenced genomes with TIS signals differing from those of well-investigated genomes. We introduce a clustering algorithm for completely unsupervised scoring of potential TIS, based on positionally smoothed probability matrices. The algorithm requires an initial gene prediction and the genomic sequence of the organism to perform the reannotation. As compared with other methods for improving predictions of gene starts in bacterial genomes, our approach is not based on any specific assumptions about prokaryotic TIS. Despite the generality of the underlying algorithm, the prediction rate of our method is competitive on experimentally verified test data from E. coli and B. subtilis. Regarding genomes with high G+C content, in contrast to some previously proposed methods, our algorithm also provides good performance on P. aeruginosa, B. pseudomallei and R. solanacearum. On reliable test data we showed that our method provides good results in post-processing the predictions of the widely-used program GLIMMER. The underlying clustering algorithm is robust with respect to variations in the initial TIS annotation and does not require specific assumptions about prokaryotic gene starts. These features are particularly useful on genomes with high G+C content. The algorithm has been implemented in the tool "TICO" (TIs COrrector) which is publicly available from our web site.
Glycosylation Focuses Sequence Variation in the Influenza A Virus H1 Hemagglutinin Globular Domain

PubMed Central

Hensley, Scott E.; Hurt, Darrell E.; Bennink, Jack R.; Yewdell, Jonathan W.

2010-01-01

Antigenic drift in the influenza A virus hemagglutinin (HA) is responsible for seasonal reformulation of influenza vaccines. Here, we address an important and largely overlooked issue in antigenic drift: how does the number and location of glycosylation sites affect HA evolution in man? We analyzed the glycosylation status of all full-length H1 subtype HA sequences available in the NCBI influenza database. We devised the “flow index” (FI), a simple algorithm that calculates the tendency for viruses to gain or lose consensus glycosylation sites. The FI predicts the predominance of glycosylation states among existing strains. Our analyses show that while the number of glycosylation sites in the HA globular domain does not influence the overall magnitude of variation in defined antigenic regions, variation focuses on those regions unshielded by glycosylation. This supports the conclusion that glycosylation generally shields HA from antibody-mediated neutralization, and implies that fitness costs in accommodating oligosaccharides limit virus escape via HA hyperglycosylation. PMID:21124818
Use of four next-generation sequencing platforms to determine HIV-1 coreceptor tropism.

PubMed

Archer, John; Weber, Jan; Henry, Kenneth; Winner, Dane; Gibson, Richard; Lee, Lawrence; Paxinos, Ellen; Arts, Eric J; Robertson, David L; Mimms, Larry; Quiñones-Mateu, Miguel E

2012-01-01

HIV-1 coreceptor tropism assays are required to rule out the presence of CXCR4-tropic (non-R5) viruses prior treatment with CCR5 antagonists. Phenotypic (e.g., Trofile™, Monogram Biosciences) and genotypic (e.g., population sequencing linked to bioinformatic algorithms) assays are the most widely used. Although several next-generation sequencing (NGS) platforms are available, to date all published deep sequencing HIV-1 tropism studies have used the 454™ Life Sciences/Roche platform. In this study, HIV-1 co-receptor usage was predicted for twelve patients scheduled to start a maraviroc-based antiretroviral regimen. The V3 region of the HIV-1 env gene was sequenced using four NGS platforms: 454™, PacBio® RS (Pacific Biosciences), Illumina®, and Ion Torrent™ (Life Technologies). Cross-platform variation was evaluated, including number of reads, read length and error rates. HIV-1 tropism was inferred using Geno2Pheno, Web PSSM, and the 11/24/25 rule and compared with Trofile™ and virologic response to antiretroviral therapy. Error rates related to insertions/deletions (indels) and nucleotide substitutions introduced by the four NGS platforms were low compared to the actual HIV-1 sequence variation. Each platform detected all major virus variants within the HIV-1 population with similar frequencies. Identification of non-R5 viruses was comparable among the four platforms, with minor differences attributable to the algorithms used to infer HIV-1 tropism. All NGS platforms showed similar concordance with virologic response to the maraviroc-based regimen (75% to 80% range depending on the algorithm used), compared to Trofile (80%) and population sequencing (70%). In conclusion, all four NGS platforms were able to detect minority non-R5 variants at comparable levels suggesting that any NGS-based method can be used to predict HIV-1 coreceptor usage.
Analysis of sequence variation among smeDEF multi drug efflux pump genes and flanking DNA from defined 16S rRNA subgroups of clinical Stenotrophomonas maltophilia isolates.

PubMed

Gould, Virginia C; Okazaki, Aki; Howe, Robin A; Avison, Matthew B

2004-08-01

To determine the level of variation in the smeDEF efflux pump and smeT transcriptional regulator genes among three defined 16S rRNA sequence subgroups of clinical Stenotrophomonas maltophilia isolates. smeDEF sequencing used a PCR genome walking approach. Determination of the sequence surrounding smeDEF used a flanking primer PCR method and specific primers anchored in smeD or smeF together with random primers. smeDEF is chromosomal and located in the same position in the chromosome in all three subgroups of isolates. Flanking smeD is a gene, smeT, encoding a putative transcriptional repressor for smeDEF. Variation at these loci among the isolates is considerably lower (up to 10%) than at intrinsic beta-lactamase loci (up to 30%) in the same isolates, implying greater functional constraint. The smeD-smeT intergenic region contains a highly conserved section, which maps with previously predicted promoter/operator regions, and a hypervariable untranslated region, which can be used to subgroup clinical isolates. These data provide further evidence that it is possible to group clinical isolates of the inherently variable species, S. maltophilia, based on genotypic properties. Isolate D457, in which most work concerning smeDEF expression has been performed, does not fall into S. maltophilia subgroup A, which is the most typical.
A survey of single nucleotide polymorphisms identified from whole-genome sequencing and their functional effect in the porcine genome.

PubMed

Keel, B N; Nonneman, D J; Rohrer, G A

2017-08-01

Genetic variants detected from sequence have been used to successfully identify causal variants and map complex traits in several organisms. High and moderate impact variants, those expected to alter or disrupt the protein coded by a gene and those that regulate protein production, likely have a more significant effect on phenotypic variation than do other types of genetic variants. Hence, a comprehensive list of these functional variants would be of considerable interest in swine genomic studies, particularly those targeting fertility and production traits. Whole-genome sequence was obtained from 72 of the founders of an intensely phenotyped experimental swine herd at the U.S. Meat Animal Research Center (USMARC). These animals included all 24 of the founding boars (12 Duroc and 12 Landrace) and 48 Yorkshire-Landrace composite sows. Sequence reads were mapped to the Sscrofa10.2 genome build, resulting in a mean of 6.1 fold (×) coverage per genome. A total of 22 342 915 high confidence SNPs were identified from the sequenced genomes. These included 21 million previously reported SNPs and 79% of the 62 163 SNPs on the PorcineSNP60 BeadChip assay. Variation was detected in the coding sequence or untranslated regions (UTRs) of 87.8% of the genes in the porcine genome: loss-of-function variants were predicted in 504 genes, 10 202 genes contained nonsynonymous variants, 10 773 had variation in UTRs and 13 010 genes contained synonymous variants. Approximately 139 000 SNPs were classified as loss-of-function, nonsynonymous or regulatory, which suggests that over 99% of the variation detected in our pigs could potentially be ignored, allowing us to focus on a much smaller number of functional SNPs during future analyses. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.

PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions

PubMed Central

Brezovský, Jan

2016-01-01

An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools’ predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2. PMID:27224906
PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions.

PubMed

Bendl, Jaroslav; Musil, Miloš; Štourač, Jan; Zendulka, Jaroslav; Damborský, Jiří; Brezovský, Jan

2016-05-01

An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools' predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2.
Predicting the binding preference of transcription factors to individual DNA k-mers.

PubMed

Alleyne, Trevis M; Peña-Castillo, Lourdes; Badis, Gwenael; Talukder, Shaheynoor; Berger, Michael F; Gehrke, Andrew R; Philippakis, Anthony A; Bulyk, Martha L; Morris, Quaid D; Hughes, Timothy R

2009-04-15

Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible k-mers provide a new opportunity to study DNA-protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members. We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.
AlloRep: A Repository of Sequence, Structural and Mutagenesis Data for the LacI/GalR Transcription Regulators.

PubMed

Sousa, Filipa L; Parente, Daniel J; Shis, David L; Hessman, Jacob A; Chazelle, Allen; Bennett, Matthew R; Teichmann, Sarah A; Swint-Kruse, Liskin

2016-02-22

Protein families evolve functional variation by accumulating point mutations at functionally important amino acid positions. Homologs in the LacI/GalR family of transcription regulators have evolved to bind diverse DNA sequences and allosteric regulatory molecules. In addition to playing key roles in bacterial metabolism, these proteins have been widely used as a model family for benchmarking structural and functional prediction algorithms. We have collected manually curated sequence alignments for >3000 sequences, in vivo phenotypic and biochemical data for >5750 LacI/GalR mutational variants, and noncovalent residue contact networks for 65 LacI/GalR homolog structures. Using this rich data resource, we compared the noncovalent residue contact networks of the LacI/GalR subfamilies to design and experimentally validate an allosteric mutant of a synthetic LacI/GalR repressor for use in biotechnology. The AlloRep database (freely available at www.AlloRep.org) is a key resource for future evolutionary studies of LacI/GalR homologs and for benchmarking computational predictions of functional change. Copyright © 2015 Elsevier Ltd. All rights reserved.
Tertiary alphabet for the observable protein structural universe

PubMed Central

Mackenzie, Craig O.; Zhou, Jianfu; Grigoryan, Gevorg

2016-01-01

Here, we systematically decompose the known protein structural universe into its basic elements, which we dub tertiary structural motifs (TERMs). A TERM is a compact backbone fragment that captures the secondary, tertiary, and quaternary environments around a given residue, comprising one or more disjoint segments (three on average). We seek the set of universal TERMs that capture all structure in the Protein Data Bank (PDB), finding remarkable degeneracy. Only ∼600 TERMs are sufficient to describe 50% of the PDB at sub-Angstrom resolution. However, more rare geometries also exist, and the overall structural coverage grows logarithmically with the number of TERMs. We go on to show that universal TERMs provide an effective mapping between sequence and structure. We demonstrate that TERM-based statistics alone are sufficient to recapitulate close-to-native sequences given either NMR or X-ray backbones. Furthermore, sequence variability predicted from TERM data agrees closely with evolutionary variation. Finally, locations of TERMs in protein chains can be predicted from sequence alone based on sequence signatures emergent from TERM instances in the PDB. For multisegment motifs, this method identifies spatially adjacent fragments that are not contiguous in sequence—a major bottleneck in structure prediction. Although all TERMs recur in diverse proteins, some appear specialized for certain functions, such as interface formation, metal coordination, or even water binding. Structural biology has benefited greatly from previously observed degeneracies in structure. The decomposition of the known structural universe into a finite set of compact TERMs offers exciting opportunities toward better understanding, design, and prediction of protein structure. PMID:27810958
Widespread Site-Dependent Buffering of Human Regulatory Polymorphism

PubMed Central

Kutyavin, Tanya; Stamatoyannopoulos, John A.

2012-01-01

The average individual is expected to harbor thousands of variants within non-coding genomic regions involved in gene regulation. However, it is currently not possible to interpret reliably the functional consequences of genetic variation within any given transcription factor recognition sequence. To address this, we comprehensively analyzed heritable genome-wide binding patterns of a major sequence-specific regulator (CTCF) in relation to genetic variability in binding site sequences across a multi-generational pedigree. We localized and quantified CTCF occupancy by ChIP-seq in 12 related and unrelated individuals spanning three generations, followed by comprehensive targeted resequencing of the entire CTCF–binding landscape across all individuals. We identified hundreds of variants with reproducible quantitative effects on CTCF occupancy (both positive and negative). While these effects paralleled protein–DNA recognition energetics when averaged, they were extensively buffered by striking local context dependencies. In the significant majority of cases buffering was complete, resulting in silent variants spanning every position within the DNA recognition interface irrespective of level of binding energy or evolutionary constraint. The prevalence of complex partial or complete buffering effects severely constrained the ability to predict reliably the impact of variation within any given binding site instance. Surprisingly, 40% of variants that increased CTCF occupancy occurred at positions of human–chimp divergence, challenging the expectation that the vast majority of functional regulatory variants should be deleterious. Our results suggest that, even in the presence of “perfect” genetic information afforded by resequencing and parallel studies in multiple related individuals, genomic site-specific prediction of the consequences of individual variation in regulatory DNA will require systematic coupling with empirical functional genomic measurements. PMID:22457641
Translating natural genetic variation to gene expression in a computational model of the Drosophila gap gene regulatory network

PubMed Central

Kozlov, Konstantin N.; Kulakovskiy, Ivan V.; Zubair, Asif; Marjoram, Paul; Lawrie, David S.; Nuzhdin, Sergey V.; Samsonova, Maria G.

2017-01-01

Annotating the genotype-phenotype relationship, and developing a proper quantitative description of the relationship, requires understanding the impact of natural genomic variation on gene expression. We apply a sequence-level model of gap gene expression in the early development of Drosophila to analyze single nucleotide polymorphisms (SNPs) in a panel of natural sequenced D. melanogaster lines. Using a thermodynamic modeling framework, we provide both analytical and computational descriptions of how single-nucleotide variants affect gene expression. The analysis reveals that the sequence variants increase (decrease) gene expression if located within binding sites of repressors (activators). We show that the sign of SNP influence (activation or repression) may change in time and space and elucidate the origin of this change in specific examples. The thermodynamic modeling approach predicts non-local and non-linear effects arising from SNPs, and combinations of SNPs, in individual fly genotypes. Simulation of individual fly genotypes using our model reveals that this non-linearity reduces to almost additive inputs from multiple SNPs. Further, we see signatures of the action of purifying selection in the gap gene regulatory regions. To infer the specific targets of purifying selection, we analyze the patterns of polymorphism in the data at two phenotypic levels: the strengths of binding and expression. We find that combinations of SNPs show evidence of being under selective pressure, while individual SNPs do not. The model predicts that SNPs appear to accumulate in the genotypes of the natural population in a way biased towards small increases in activating action on the expression pattern. Taken together, these results provide a systems-level view of how genetic variation translates to the level of gene regulatory networks via combinatorial SNP effects. PMID:28898266
Length variation and sequence divergence in mitochondrial control region of Schizothoracine (Teleostei: Cyperinidae) species.

PubMed

Syed, Mudasir Ahmad; Bhat, Farooz Ahmad; Balkhi, Masood-ul Hassan; Bhat, Bilal Ahmad

2016-01-01

Schizothoracine fish commonly called snow trouts inhibit the entire network of snow and spring fed cool waters of Kashmir, India. Over 10 species reported earlier, only five species have been found, these include Schizothorax niger, Schizothorax esocinus, Schizothorax plagiostomus, Schizothorax curvifrons and Schizothorax labiatus. The relationship between these species is contradicting. To understand the evolutionary relation of these species, we examined the sequence information of mitochondrial D-loop of 25 individuals representing five species. Sequence alignment showed D-loop region highly variable and length variation was observed in di-nucleotide (TA)n microsatellite between and within species. Interestingly, all these species have (TA)n microsatellite not associated with longer tandem repeats at the 3' end of the mitochondrial control region and do not show heteroplasmy. Our analysis also indicates the presence of four conserved sequence blocks (CSB), CSB-D, CSB-1, CSB-II and CSB-III, four (Termination Associated Sequence) TAS motifs and 15bp pyrimidine block within the mitochondrial control region, that are highly conserved within genus Schizothorax when compared with other species. The phylogenetic analysis carried by Maximum likelihood (ML), Neighbor Joining (NJ) and Bayesian inference (BI) generated almost identical results. The resultant BI tree showed a close genetic relationship of all the five species and supports two distinct grouping of S. esocinus species. Besides the species relation, the presence of length variation in tandem repeats is attributed to differences in predicting the stability of secondary structures. The role of CSBs and TASs, reported so far as main regulatory signals, would explain the conservation of these elements in evolution.
A mechanistic insight into the amyloidogenic structure of hIAPP peptide revealed from sequence analysis and molecular dynamics simulation.

PubMed

Chakraborty, Sandipan; Chatterjee, Barnali; Basu, Soumalee

2012-07-01

A collective approach of sequence analysis, phylogenetic tree and in silico prediction of amyloidogenecity using bioinformatics tools have been used to correlate the observed species-specific variations in IAPP sequences with the amyloid forming propensity. Observed substitution patterns indicate that probable changes in local hydrophobicity are instrumental in altering the aggregation propensity of the peptide. In particular, residues at 17th, 22nd and 23rd positions of the IAPP peptide are found to be crucial for amyloid formation. Proline25 primarily dictates the observed non-amyloidogenecity in rodents. Furthermore, extensive molecular dynamics simulation of 0.24 μs have been carried out with human IAPP (hIAPP) fragment 19-27, the portion showing maximum sequence variation across different species, to understand the native folding characteristic of this region. Principal component analysis in combination with free energy landscape analysis illustrates a four residue turn spanning from residue 22 to 25. The results provide a structural insight into the intramolecular β-sheet structure of amylin which probably is the template for nucleation of fibril formation and growth, a pathogenic feature of type II diabetes. Copyright © 2012 Elsevier B.V. All rights reserved.
A Glimpse into the Satellite DNA Library in Characidae Fish (Teleostei, Characiformes)

PubMed Central

Utsunomia, Ricardo; Ruiz-Ruano, Francisco J.; Silva, Duílio M. Z. A.; Serrano, Érica A.; Rosa, Ivana F.; Scudeler, Patrícia E. S.; Hashimoto, Diogo T.; Oliveira, Claudio; Camacho, Juan Pedro M.; Foresti, Fausto

2017-01-01

Satellite DNA (satDNA) is an abundant fraction of repetitive DNA in eukaryotic genomes and plays an important role in genome organization and evolution. In general, satDNA sequences follow a concerted evolutionary pattern through the intragenomic homogenization of different repeat units. In addition, the satDNA library hypothesis predicts that related species share a series of satDNA variants descended from a common ancestor species, with differential amplification of different satDNA variants. The finding of a same satDNA family in species belonging to different genera within Characidae fish provided the opportunity to test both concerted evolution and library hypotheses. For this purpose, we analyzed here sequence variation and abundance of this satDNA family in ten species, by a combination of next generation sequencing (NGS), PCR and Sanger sequencing, and fluorescence in situ hybridization (FISH). We found extensive between-species variation for the number and size of pericentromeric FISH signals. At genomic level, the analysis of 1000s of DNA sequences obtained by Illumina sequencing and PCR amplification allowed defining 150 haplotypes which were linked in a common minimum spanning tree, where different patterns of concerted evolution were apparent. This also provided a glimpse into the satDNA library of this group of species. In consistency with the library hypothesis, different variants for this satDNA showed high differences in abundance between species, from highly abundant to simply relictual variants. PMID:28855916
Balancing Selection on a Regulatory Region Exhibiting Ancient Variation That Predates Human–Neandertal Divergence

PubMed Central

Iskow, Rebecca C.; Austermann, Christian; Scharer, Christopher D.; Raj, Towfique; Boss, Jeremy M.; Sunyaev, Shamil; Price, Alkes; Stranger, Barbara; Simon, Viviana; Lee, Charles

2013-01-01

Ancient population structure shaping contemporary genetic variation has been recently appreciated and has important implications regarding our understanding of the structure of modern human genomes. We identified a ∼36-kb DNA segment in the human genome that displays an ancient substructure. The variation at this locus exists primarily as two highly divergent haplogroups. One of these haplogroups (the NE1 haplogroup) aligns with the Neandertal haplotype and contains a 4.6-kb deletion polymorphism in perfect linkage disequilibrium with 12 single nucleotide polymorphisms (SNPs) across diverse populations. The other haplogroup, which does not contain the 4.6-kb deletion, aligns with the chimpanzee haplotype and is likely ancestral. Africans have higher overall pairwise differences with the Neandertal haplotype than Eurasians do for this NE1 locus (p<10−15). Moreover, the nucleotide diversity at this locus is higher in Eurasians than in Africans. These results mimic signatures of recent Neandertal admixture contributing to this locus. However, an in-depth assessment of the variation in this region across multiple populations reveals that African NE1 haplotypes, albeit rare, harbor more sequence variation than NE1 haplotypes found in Europeans, indicating an ancient African origin of this haplogroup and refuting recent Neandertal admixture. Population genetic analyses of the SNPs within each of these haplogroups, along with genome-wide comparisons revealed significant FST (p = 0.00003) and positive Tajima's D (p = 0.00285) statistics, pointing to non-neutral evolution of this locus. The NE1 locus harbors no protein-coding genes, but contains transcribed sequences as well as sequences with putative regulatory function based on bioinformatic predictions and in vitro experiments. We postulate that the variation observed at this locus predates Human–Neandertal divergence and is evolving under balancing selection, especially among European populations. PMID:23593015
Efficient and accurate causal inference with hidden confounders from genome-transcriptome variation data

PubMed Central

2017-01-01

Mapping gene expression as a quantitative trait using whole genome-sequencing and transcriptome analysis allows to discover the functional consequences of genetic variation. We developed a novel method and ultra-fast software Findr for higly accurate causal inference between gene expression traits using cis-regulatory DNA variations as causal anchors, which improves current methods by taking into consideration hidden confounders and weak regulations. Findr outperformed existing methods on the DREAM5 Systems Genetics challenge and on the prediction of microRNA and transcription factor targets in human lymphoblastoid cells, while being nearly a million times faster. Findr is publicly available at https://github.com/lingfeiwang/findr. PMID:28821014
Ecology has contrasting effects on genetic variation within species versus rates of molecular evolution across species in water beetles.

PubMed

Fujisawa, Tomochika; Vogler, Alfried P; Barraclough, Timothy G

2015-01-22

Comparative analysis is a potentially powerful approach to study the effects of ecological traits on genetic variation and rate of evolution across species. However, the lack of suitable datasets means that comparative studies of correlates of genetic traits across an entire clade have been rare. Here, we use a large DNA-barcode dataset (5062 sequences) of water beetles to test the effects of species ecology and geographical distribution on genetic variation within species and rates of molecular evolution across species. We investigated species traits predicted to influence their genetic characteristics, such as surrogate measures of species population size, latitudinal distribution and habitat types, taking phylogeny into account. Genetic variation of cytochrome oxidase I in water beetles was positively correlated with occupancy (numbers of sites of species presence) and negatively with latitude, whereas substitution rates across species depended mainly on habitat types, and running water specialists had the highest rate. These results are consistent with theoretical predictions from nearly-neutral theories of evolution, and suggest that the comparative analysis using large databases can give insights into correlates of genetic variation and molecular evolution.
How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis.

PubMed

Tian, Pengfei; Best, Robert B

2017-10-17

Quantifying the relationship between protein sequence and structure is key to understanding the protein universe. A fundamental measure of this relationship is the total number of amino acid sequences that can fold to a target protein structure, known as the "sequence capacity," which has been suggested as a proxy for how designable a given protein fold is. Although sequence capacity has been extensively studied using lattice models and theory, numerical estimates for real protein structures are currently lacking. In this work, we have quantitatively estimated the sequence capacity of 10 proteins with a variety of different structures using a statistical model based on residue-residue co-evolution to capture the variation of sequences from the same protein family. Remarkably, we find that even for the smallest protein folds, such as the WW domain, the number of foldable sequences is extremely large, exceeding the Avogadro constant. In agreement with earlier theoretical work, the calculated sequence capacity is positively correlated with the size of the protein, or better, the density of contacts. This allows the absolute sequence capacity of a given protein to be approximately predicted from its structure. On the other hand, the relative sequence capacity, i.e., normalized by the total number of possible sequences, is an extremely tiny number and is strongly anti-correlated with the protein length. Thus, although there may be more foldable sequences for larger proteins, it will be much harder to find them. Lastly, we have correlated the evolutionary age of proteins in the CATH database with their sequence capacity as predicted by our model. The results suggest a trade-off between the opposing requirements of high designability and the likelihood of a novel fold emerging by chance. Published by Elsevier Inc.
Modeling read counts for CNV detection in exome sequencing data.

PubMed

Love, Michael I; Myšičková, Alena; Sun, Ruping; Kalscheuer, Vera; Vingron, Martin; Haas, Stefan A

2011-11-08

Varying depth of high-throughput sequencing reads along a chromosome makes it possible to observe copy number variants (CNVs) in a sample relative to a reference. In exome and other targeted sequencing projects, technical factors increase variation in read depth while reducing the number of observed locations, adding difficulty to the problem of identifying CNVs. We present a hidden Markov model for detecting CNVs from raw read count data, using background read depth from a control set as well as other positional covariates such as GC-content. The model, exomeCopy, is applied to a large chromosome X exome sequencing project identifying a list of large unique CNVs. CNVs predicted by the model and experimentally validated are then recovered using a cross-platform control set from publicly available exome sequencing data. Simulations show high sensitivity for detecting heterozygous and homozygous CNVs, outperforming normalization and state-of-the-art segmentation methods.
Prediction of BRCA1 and BRCA2 mutation status using post-irradiation assays of lymphoblastoid cell lines is compromised by inter-cell-line phenotypic variability.

PubMed

Lovelock, Paul K; Wong, Ee Ming; Sprung, Carl N; Marsh, Anna; Hobson, Karen; French, Juliet D; Southey, Melissa; Sculley, Tom; Pandeya, Nirmala; Brown, Melissa A; Chenevix-Trench, Georgia; Spurdle, Amanda B; McKay, Michael J

2007-09-01

Assays to determine the pathogenicity of unclassified sequence variants in disease-associated genes include the analysis of lymphoblastoid cell lines (LCLs). We assessed the ability of several assays of LCLs to distinguish carriers of germline BRCA1 and BRCA2 gene mutations from mutation-negative controls to determine their utility for use in a diagnostic setting. Post-ionising radiation cell viability and micronucleus formation, and telomere length were assayed in LCLs carrying BRCA1 or BRCA2 mutations, and in unaffected mutation-negative controls. Post-irradiation cell viability and micronucleus induction assays of LCLs from individuals carrying pathogenic BRCA1 mutations, unclassified BRCA1 sequence variants or wildtype BRCA1 sequence showed significant phenotypic heterogeneity within each group. Responses were not consistent with predicted functional consequences of known pathogenic or normal sequences. Telomere length was also highly heterogeneous within groups of LCLs carrying pathogenic BRCA1 or BRCA2 mutations, and normal BRCA1 sequences, and was not predictive of mutation status. Given the significant degree of phenotypic heterogeneity of LCLs after gamma-irradiation, and the lack of association with BRCA1 or BRCA2 mutation status, we conclude that the assays evaluated in this study should not be used as a means of differentiating pathogenic and non-pathogenic sequence variants for clinical application. We suggest that a range of normal controls must be included in any functional assays of LCLs to ensure that any observed differences between samples reflect the genotype under investigation rather than generic inter-individual variation.
Genome-wide association analysis of milk yield traits in Nordic Red Cattle using imputed whole genome sequence variants.

PubMed

Iso-Touru, T; Sahana, G; Guldbrandtsen, B; Lund, M S; Vilkki, J

2016-03-22

The Nordic Red Cattle consisting of three different populations from Finland, Sweden and Denmark are under a joint breeding value estimation system. The long history of recording of production and health traits offers a great opportunity to study production traits and identify causal variants behind them. In this study, we used whole genome sequence level data from 4280 progeny tested Nordic Red Cattle bulls to scan the genome for loci affecting milk, fat and protein yields. Using a genome-wise significance threshold, regions on Bos taurus chromosomes 5, 14, 23, 25 and 26 were associated with fat yield. Regions on chromosomes 5, 14, 16, 19, 20 and 25 were associated with milk yield and chromosomes 5, 14 and 25 had regions associated with protein yield. Significantly associated variations were found in 227 genes for fat yield, 72 genes for milk yield and 30 genes for protein yield. Ingenuity Pathway Analysis was used to identify networks connecting these genes displaying significant hits. When compared to previously mapped genomic regions associated with fertility, significantly associated variations were found in 5 genes common for fat yield and fertility, thus linking these two traits via biological networks. This is the first time when whole genome sequence data is utilized to study genomic regions affecting milk production in the Nordic Red Cattle population. Sequence level data offers the possibility to study quantitative traits in detail but still cannot unambiguously reveal which of the associated variations is causative. Linkage disequilibrium creates difficulties to pinpoint the causative genes and variations. One solution to overcome these difficulties is the identification of the functional gene networks and pathways to reveal important interacting genes as candidates for the observed effects. This information on target genomic regions may be exploited to improve genomic prediction.
Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-analysis of Noncoding Variation in Metabochip Data.

PubMed

He, Zihuai; Xu, Bin; Lee, Seunggeun; Ionita-Laza, Iuliana

2017-09-07

Substantial progress has been made in the functional annotation of genetic variation in the human genome. Integrative analysis that incorporates such functional annotations into sequencing studies can aid the discovery of disease-associated genetic variants, especially those with unknown function and located outside protein-coding regions. Direct incorporation of one functional annotation as weight in existing dispersion and burden tests can suffer substantial loss of power when the functional annotation is not predictive of the risk status of a variant. Here, we have developed unified tests that can utilize multiple functional annotations simultaneously for integrative association analysis with efficient computational techniques. We show that the proposed tests significantly improve power when variant risk status can be predicted by functional annotations. Importantly, when functional annotations are not predictive of risk status, the proposed tests incur only minimal loss of power in relation to existing dispersion and burden tests, and under certain circumstances they can even have improved power by learning a weight that better approximates the underlying disease model in a data-adaptive manner. The tests can be constructed with summary statistics of existing dispersion and burden tests for sequencing data, therefore allowing meta-analysis of multiple studies without sharing individual-level data. We applied the proposed tests to a meta-analysis of noncoding rare variants in Metabochip data on 12,281 individuals from eight studies for lipid traits. By incorporating the Eigen functional score, we detected significant associations between noncoding rare variants in SLC22A3 and low-density lipoprotein and total cholesterol, associations that are missed by standard dispersion and burden tests. Copyright © 2017 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Insights into mechanisms of bacterial antigenic variation derived from the complete genome sequence of Anaplasma marginale.

PubMed

Palmer, Guy H; Futse, James E; Knowles, Donald P; Brayton, Kelly A

2006-10-01

Persistence of Anaplasma spp. in the animal reservoir host is required for efficient tick-borne transmission of these pathogens to animals and humans. Using A. marginale infection of its natural reservoir host as a model, persistent infection has been shown to reflect sequential cycles in which antigenic variants emerge, replicate, and are controlled by the immune system. Variation in the immunodominant outer-membrane protein MSP2 is generated by a process of gene conversion, in which unique hypervariable region sequences (HVRs) located in pseudogenes are recombined into a single operon-linked msp2 expression site. Although organisms expressing whole HVRs derived from pseudogenes emerge early in infection, long-term persistent infection is dependent on the generation of complex mosaics in which segments from different HVRs recombine into the expression site. The resulting combinatorial diversity generates the number of variants both predicted and shown to emerge during persistence.
A computational prediction of structure and function of novel homologue of Arabidopsis thaliana Vps51/Vps67 subunit in Corchorus olitorius.

PubMed

Zaman, Aubhishek; Fancy, Nurun Nahar

2012-12-01

Vps mediated vesicular transport is important for transferring macromolecules trapped inside a vesicle. Although highly abundant, Vps shows tremendous sequence variation among diverse array of species. However, this difference in sequence, which seems to also translate into substantial functional variation, is hardly characterized in Corchorus spp. Here, our computational study investigates structural and functional features of one of the Vps subunit namely Vps51/Vps67 in C. olitorius. Broad scale structural characterization revealed novel information about the overall Vps structure and binding sites. Moreover, functional analyses indicate interaction partners which were unexplored to date. Since membrane trafficking is essentially associated with nutrient uptake and chemical de-toxification, characterization of the Vps subunit can well provide us with better insight into important agronomic traits such as stress response, immune response and phytoremediation capacity.

Usage of mitochondrial D-loop variation to predict risk for Huntington disease.

PubMed

Mousavizadeh, Kazem; Rajabi, Peyman; Alaee, Mahsa; Dadgar, Sepideh; Houshmand, Massoud

2015-08-01

Huntington's disease (HD) is an inherited autosomal neurodegenerative disease caused by the abnormal expansion of the CAG repeats in the Huntingtin (Htt) gene. It has been proven that mitochondrial dysfunction is contributed to the pathogenesis of Huntington's disease. The mitochondrial displacement loop (D-loop) is proven to accumulate mutations at a higher rate than other regions of mtDNA. Thus, we hypothesized that specific SNPs in the D-loop may contribute to the pathogenesis of Huntington's disease. In the present study, 30 patients with Huntington's disease and 463 healthy controls were evaluated for mitochondrial mutation sites within the D-loop region using PCR-sequencing method. Sequence analysis revealed 35 variations in HD group from Cambridge Mitochondrial Sequences. A significant difference (p < 0.05) was seen between patients and control group in eight SNPs. Polymorphisms at C16069T, T16126C, T16189C, T16519C and C16223T were correlated with an increased risk of HD while SNPs at C16150T, T16086C and T16195C were associated with a decreased risk of Huntington's disease.
RNA-sequence data normalization through in silico prediction of reference genes: the bacterial response to DNA damage as case study.

PubMed

Berghoff, Bork A; Karlsson, Torgny; Källman, Thomas; Wagner, E Gerhart H; Grabherr, Manfred G

2017-01-01

Measuring how gene expression changes in the course of an experiment assesses how an organism responds on a molecular level. Sequencing of RNA molecules, and their subsequent quantification, aims to assess global gene expression changes on the RNA level (transcriptome). While advances in high-throughput RNA-sequencing (RNA-seq) technologies allow for inexpensive data generation, accurate post-processing and normalization across samples is required to eliminate any systematic noise introduced by the biochemical and/or technical processes. Existing methods thus either normalize on selected known reference genes that are invariant in expression across the experiment, assume that the majority of genes are invariant, or that the effects of up- and down-regulated genes cancel each other out during the normalization. Here, we present a novel method, moose 2 , which predicts invariant genes in silico through a dynamic programming (DP) scheme and applies a quadratic normalization based on this subset. The method allows for specifying a set of known or experimentally validated invariant genes, which guides the DP. We experimentally verified the predictions of this method in the bacterium Escherichia coli , and show how moose 2 is able to (i) estimate the expression value distances between RNA-seq samples, (ii) reduce the variation of expression values across all samples, and (iii) to subsequently reveal new functional groups of genes during the late stages of DNA damage. We further applied the method to three eukaryotic data sets, on which its performance compares favourably to other methods. The software is implemented in C++ and is publicly available from http://grabherr.github.io/moose2/. The proposed RNA-seq normalization method, moose 2 , is a valuable alternative to existing methods, with two major advantages: (i) in silico prediction of invariant genes provides a list of potential reference genes for downstream analyses, and (ii) non-linear artefacts in RNA-seq data are handled adequately to minimize variations between replicates.
Identification of a member of the catalase multigene family on wheat chromosome 7A associated with flour b* colour and biological significance of allelic variation.

PubMed

Li, Dora A; Walker, Esther; Francki, Michael G

2015-12-01

Carotenoids (especially lutein) are known to be the pigment source for flour b* colour in bread wheat. Flour b* colour variation is controlled by a quantitative trait locus (QTL) on wheat chromosome 7AL and one gene from the carotenoid pathway, phytoene synthase, was functionally associated with the QTL on 7AL in some, but not all, wheat genotypes. A SNP marker within a sequence similar to catalase (Cat3-A1snp) derived from full-length (FL) cDNA (AK332460), however, was consistently associated with the QTL on 7AL and implicated in regulating hydrogen peroxide (H2O2) to control carotenoid accumulation affecting flour b* colour. The number of catalase genes on chromosome 7AL was investigated in this study to identify which gene may be implicated in flour b* variation and two were identified through interrogation of the draft wheat genome survey sequence consisting of five exons and a further two members having eight exons identified through comparative analysis with the single catalase gene on rice chromosome 6, PCR amplification and sequencing. It was evident that the catalase genes on chromosome 7A had duplicated and diverged during evolution relative to its counterpart on rice chromosome 6. The detection of transcripts in seeds, the co-location with Cat3-A1snp marker and maximised alignment of FL-cDNA (AK332460) with cognate genomic sequence indicated that TaCat3-A1 was the member of the catalase gene family associated with flour b* colour variation. Re-sequencing identified three alleles from three wheat varieties, TaCat3-A1a, TaCat3-A1b and TaCat3-A1c, and their predicted protein identified differences in peroxisomal targeting signal tri-peptide domain in the carboxyl terminal end providing new insights into their potential role in regulating cellular H2O2 that contribute to flour b* colour variation.
The Solanum commersonii Genome Sequence Provides Insights into Adaptation to Stress Conditions and Genome Evolution of Wild Potato Relatives

PubMed Central

Aversano, Riccardo; Contaldi, Felice; Ercolano, Maria Raffaella; Grosso, Valentina; Iorizzo, Massimo; Tatino, Filippo; Xumerle, Luciano; Dal Molin, Alessandra; Avanzato, Carla; Ferrarini, Alberto; Delledonne, Massimo; Sanseverino, Walter; Cigliano, Riccardo Aiese; Capella-Gutierrez, Salvador; Gabaldón, Toni; Frusciante, Luigi; Bradeen, James M.; Carputo, Domenico

2015-01-01

Here, we report the draft genome sequence of Solanum commersonii, which consists of ∼830 megabases with an N50 of 44,303 bp anchored to 12 chromosomes, using the potato (Solanum tuberosum) genome sequence as a reference. Compared with potato, S. commersonii shows a striking reduction in heterozygosity (1.5% versus 53 to 59%), and differences in genome sizes were mainly due to variations in intergenic sequence length. Gene annotation by ab initio prediction supported by RNA-seq data produced a catalog of 1703 predicted microRNAs, 18,882 long noncoding RNAs of which 20% are shown to target cold-responsive genes, and 39,290 protein-coding genes with a significant repertoire of nonredundant nucleotide binding site-encoding genes and 126 cold-related genes that are lacking in S. tuberosum. Phylogenetic analyses indicate that domesticated potato and S. commersonii lineages diverged ∼2.3 million years ago. Three duplication periods corresponding to genome enrichment for particular gene families related to response to salt stress, water transport, growth, and defense response were discovered. The draft genome sequence of S. commersonii substantially increases our understanding of the domesticated germplasm, facilitating translation of acquired knowledge into advances in crop stability in light of global climate and environmental changes. PMID:25873387
Secondary structure prediction and structure-specific sequence analysis of single-stranded DNA.

PubMed

Dong, F; Allawi, H T; Anderson, T; Neri, B P; Lyamichev, V I

2001-08-01

DNA sequence analysis by oligonucleotide binding is often affected by interference with the secondary structure of the target DNA. Here we describe an approach that improves DNA secondary structure prediction by combining enzymatic probing of DNA by structure-specific 5'-nucleases with an energy minimization algorithm that utilizes the 5'-nuclease cleavage sites as constraints. The method can identify structural differences between two DNA molecules caused by minor sequence variations such as a single nucleotide mutation. It also demonstrates the existence of long-range interactions between DNA regions separated by >300 nt and the formation of multiple alternative structures by a 244 nt DNA molecule. The differences in the secondary structure of DNA molecules revealed by 5'-nuclease probing were used to design structure-specific probes for mutation discrimination that target the regions of structural, rather than sequence, differences. We also demonstrate the performance of structure-specific 'bridge' probes complementary to non-contiguous regions of the target molecule. The structure-specific probes do not require the high stringency binding conditions necessary for methods based on mismatch formation and permit mutation detection at temperatures from 4 to 37 degrees C. Structure-specific sequence analysis is applied for mutation detection in the Mycobacterium tuberculosis katG gene and for genotyping of the hepatitis C virus.
Allele-specific copy-number discovery from whole-genome and whole-exome sequencing

PubMed Central

Wang, WeiBo; Wang, Wei; Sun, Wei; Crowley, James J.; Szatkiewicz, Jin P.

2015-01-01

Copy-number variants (CNVs) are a major form of genetic variation and a risk factor for various human diseases, so it is crucial to accurately detect and characterize them. It is conceivable that allele-specific reads from high-throughput sequencing data could be leveraged to both enhance CNV detection and produce allele-specific copy number (ASCN) calls. Although statistical methods have been developed to detect CNVs using whole-genome sequence (WGS) and/or whole-exome sequence (WES) data, information from allele-specific read counts has not yet been adequately exploited. In this paper, we develop an integrated method, called AS-GENSENG, which incorporates allele-specific read counts in CNV detection and estimates ASCN using either WGS or WES data. To evaluate the performance of AS-GENSENG, we conducted extensive simulations, generated empirical data using existing WGS and WES data sets and validated predicted CNVs using an independent methodology. We conclude that AS-GENSENG not only predicts accurate ASCN calls but also improves the accuracy of total copy number calls, owing to its unique ability to exploit information from both total and allele-specific read counts while accounting for various experimental biases in sequence data. Our novel, user-friendly and computationally efficient method and a complete analytic protocol is freely available at https://sourceforge.net/projects/asgenseng/. PMID:25883151
Web of Objects Based Ambient Assisted Living Framework for Emergency Psychiatric State Prediction

PubMed Central

Alam, Md Golam Rabiul; Abedin, Sarder Fakhrul; Al Ameen, Moshaddique; Hong, Choong Seon

2016-01-01

Ambient assisted living can facilitate optimum health and wellness by aiding physical, mental and social well-being. In this paper, patients’ psychiatric symptoms are collected through lightweight biosensors and web-based psychiatric screening scales in a smart home environment and then analyzed through machine learning algorithms to provide ambient intelligence in a psychiatric emergency. The psychiatric states are modeled through a Hidden Markov Model (HMM), and the model parameters are estimated using a Viterbi path counting and scalable Stochastic Variational Inference (SVI)-based training algorithm. The most likely psychiatric state sequence of the corresponding observation sequence is determined, and an emergency psychiatric state is predicted through the proposed algorithm. Moreover, to enable personalized psychiatric emergency care, a service a web of objects-based framework is proposed for a smart-home environment. In this framework, the biosensor observations and the psychiatric rating scales are objectified and virtualized in the web space. Then, the web of objects of sensor observations and psychiatric rating scores are used to assess the dweller’s mental health status and to predict an emergency psychiatric state. The proposed psychiatric state prediction algorithm reported 83.03 percent prediction accuracy in an empirical performance study. PMID:27608023
Topology of membrane proteins-predictions, limitations and variations.

PubMed

Tsirigos, Konstantinos D; Govindarajan, Sudha; Bassot, Claudio; Västermark, Åke; Lamb, John; Shu, Nanjiang; Elofsson, Arne

2017-10-26

Transmembrane proteins perform a variety of important biological functions necessary for the survival and growth of the cells. Membrane proteins are built up by transmembrane segments that span the lipid bilayer. The segments can either be in the form of hydrophobic alpha-helices or beta-sheets which create a barrel. A fundamental aspect of the structure of transmembrane proteins is the membrane topology, that is, the number of transmembrane segments, their position in the protein sequence and their orientation in the membrane. Along these lines, many predictive algorithms for the prediction of the topology of alpha-helical and beta-barrel transmembrane proteins exist. The newest algorithms obtain an accuracy close to 80% both for alpha-helical and beta-barrel transmembrane proteins. However, lately it has been shown that the simplified picture presented when describing a protein family by its topology is limited. To demonstrate this, we highlight examples where the topology is either not conserved in a protein superfamily or where the structure cannot be described solely by the topology of a protein. The prediction of these non-standard features from sequence alone was not successful until the recent revolutionary progress in 3D-structure prediction of proteins. Copyright © 2017 Elsevier Ltd. All rights reserved.
Geoscience technology application to optimize field development, Seligi Field, Malay Basin

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ahmed, M.S.; Wiggins, B.D.

1994-07-01

Integration of well log, core, 3-D seismic, and engineering data within a sequence stratigraphic framework, has enabled prediction of reservoir distribution and optimum development of Seligi field. Seligi is the largest field in the Malay Basin, with half of the reserves within lower Miocene Group J reservoirs. These reservoirs consist of shallow marine sandstones and estuarine sandstones predominantly within an incised valley. Variation in reservoir quality has been a major challenge in developing Seligi. Recognizing and mapping four sequences within the Group J incised valley fill has resulted in a geologic model for predicting the distribution of good quality estuarinemore » reservoir units and intercalated low-permeability sand/shale units deposited during marine transgressions. These low-permeability units segregate the reservoir fluids, causing differential contact movement in response to production thus impacting completion strategy and well placement. Seismic calibration shows that a large impedance contrast exists between the low-permeability rock and adjacent good quality oil sand. Application of sequence stratigraphic/facies analysis coupled with the ability to identify the low-permeability units seismically is enabling optimum development of each of the four sequences at Seligi.« less
Whole-genome sequencing of a Plasmodium vivax clinical isolate exhibits geographical characteristics and high genetic variation in China-Myanmar border area.

PubMed

Chen, Shen-Bo; Wang, Yue; Kassegne, Kokouvi; Xu, Bin; Shen, Hai-Mo; Chen, Jun-Hu

2017-02-06

Currently in China, the trend of Plasmodium vivax cases imported from Southeast Asia was increased especially in the China-Myanmar border area. Driven by the increase in P. vivax cases and stronger need for vaccine and drug development, several P. vivax isolates genome sequencing projects are underway. However, little is known about the genetic variability in this area until now. The sequencing of the first P. vivax isolate from China-Myanmar border area (CMB-1) generated 120 million paired-end reads. A percentage of 10.6 of the quality-evaluated reads were aligned onto 99.9% of the reference strain Sal I genome in 62-fold coverage with an average of 4.8 SNPs per kb. We present a 539-SNP marker data set for P. vivax that can identify different parasites from different geographic origins with high sensitivity. We also identified exceptionally high levels of genetic variability in members of multigene families such as RBP, SERA, vir, MSP3 and AP2. The de-novo assembly yielded a database composed of 8,409 contigs with N50 lengths of 6.6 kb and revealed 661 novel predicted genes including 78 vir genes, suggesting a greater functional variation in P. vivax from this area. Our result contributes to a better understanding of P. vivax genetic variation, and provides a fundamental basis for the geographic differentiation of vivax malaria from China-Myanmar border area using a direct sequencing approach without leukocyte depletion. This novel sequencing method can be used as an essential tool for the genomic research of P. vivax in the near future.
Modeling genome coverage in single-cell sequencing

PubMed Central

Daley, Timothy; Smith, Andrew D.

2014-01-01

Motivation: Single-cell DNA sequencing is necessary for examining genetic variation at the cellular level, which remains hidden in bulk sequencing experiments. But because they begin with such small amounts of starting material, the amount of information that is obtained from single-cell sequencing experiment is highly sensitive to the choice of protocol employed and variability in library preparation. In particular, the fraction of the genome represented in single-cell sequencing libraries exhibits extreme variability due to quantitative biases in amplification and loss of genetic material. Results: We propose a method to predict the genome coverage of a deep sequencing experiment using information from an initial shallow sequencing experiment mapped to a reference genome. The observed coverage statistics are used in a non-parametric empirical Bayes Poisson model to estimate the gain in coverage from deeper sequencing. This approach allows researchers to know statistical features of deep sequencing experiments without actually sequencing deeply, providing a basis for optimizing and comparing single-cell sequencing protocols or screening libraries. Availability and implementation: The method is available as part of the preseq software package. Source code is available at http://smithlabresearch.org/preseq. Contact: andrewds@usc.edu Supplementary information: Supplementary material is available at Bioinformatics online. PMID:25107873
PredSTP: a highly accurate SVM based model to predict sequential cystine stabilized peptides.

PubMed

Islam, S M Ashiqul; Sajed, Tanvir; Kearney, Christopher Michel; Baker, Erich J

2015-07-05

Numerous organisms have evolved a wide range of toxic peptides for self-defense and predation. Their effective interstitial and macro-environmental use requires energetic and structural stability. One successful group of these peptides includes a tri-disulfide domain arrangement that offers toxicity and high stability. Sequential tri-disulfide connectivity variants create highly compact disulfide folds capable of withstanding a variety of environmental stresses. Their combination of toxicity and stability make these peptides remarkably valuable for their potential as bio-insecticides, antimicrobial peptides and peptide drug candidates. However, the wide sequence variation, sources and modalities of group members impose serious limitations on our ability to rapidly identify potential members. As a result, there is a need for automated high-throughput member classification approaches that leverage their demonstrated tertiary and functional homology. We developed an SVM-based model to predict sequential tri-disulfide peptide (STP) toxins from peptide sequences. One optimized model, called PredSTP, predicted STPs from training set with sensitivity, specificity, precision, accuracy and a Matthews correlation coefficient of 94.86%, 94.11%, 84.31%, 94.30% and 0.86, respectively, using 200 fold cross validation. The same model outperforms existing prediction approaches in three independent out of sample testsets derived from PDB. PredSTP can accurately identify a wide range of cystine stabilized peptide toxins directly from sequences in a species-agnostic fashion. The ability to rapidly filter sequences for potential bioactive peptides can greatly compress the time between peptide identification and testing structural and functional properties for possible antimicrobial and insecticidal candidates. A web interface is freely available to predict STP toxins from http://crick.ecs.baylor.edu/.
Journal of Engineering Thermophysics (Selected Articles),

DTIC Science & Technology

1983-05-13

compressor, prediction of unsteady vibration , and prevention of unsteady vibration . This test was undergone on a turbojet engine. The paper stresses the...induce unsteady engine vibration . While studying the effect of inlet anomaly and variation of the first stage nozzle area of the turbine, the engine...constant revolution speed curve until unsteady vibration or stall appeared. In studying the influence of the starting sequence, starting was
UniDrug-target: a computational tool to identify unique drug targets in pathogenic bacteria.

PubMed

Chanumolu, Sree Krishna; Rout, Chittaranjan; Chauhan, Rajinder S

2012-01-01

Targeting conserved proteins of bacteria through antibacterial medications has resulted in both the development of resistant strains and changes to human health by destroying beneficial microbes which eventually become breeding grounds for the evolution of resistances. Despite the availability of more than 800 genomes sequences, 430 pathways, 4743 enzymes, 9257 metabolic reactions and protein (three-dimensional) 3D structures in bacteria, no pathogen-specific computational drug target identification tool has been developed. A web server, UniDrug-Target, which combines bacterial biological information and computational methods to stringently identify pathogen-specific proteins as drug targets, has been designed. Besides predicting pathogen-specific proteins essentiality, chokepoint property, etc., three new algorithms were developed and implemented by using protein sequences, domains, structures, and metabolic reactions for construction of partial metabolic networks (PMNs), determination of conservation in critical residues, and variation analysis of residues forming similar cavities in proteins sequences. First, PMNs are constructed to determine the extent of disturbances in metabolite production by targeting a protein as drug target. Conservation of pathogen-specific protein's critical residues involved in cavity formation and biological function determined at domain-level with low-matching sequences. Last, variation analysis of residues forming similar cavities in proteins sequences from pathogenic versus non-pathogenic bacteria and humans is performed. The server is capable of predicting drug targets for any sequenced pathogenic bacteria having fasta sequences and annotated information. The utility of UniDrug-Target server was demonstrated for Mycobacterium tuberculosis (H37Rv). The UniDrug-Target identified 265 mycobacteria pathogen-specific proteins, including 17 essential proteins which can be potential drug targets. UniDrug-Target is expected to accelerate pathogen-specific drug targets identification which will increase their success and durability as drugs developed against them have less chance to develop resistances and adverse impact on environment. The server is freely available at http://117.211.115.67/UDT/main.html. The standalone application (source codes) is available at http://www.bioinformatics.org/ftp/pub/bioinfojuit/UDT.rar.
Positive selection in the SLC11A1 gene in the family Equidae.

PubMed

Bayerova, Zuzana; Janova, Eva; Matiasovic, Jan; Orlando, Ludovic; Horin, Petr

2016-05-01

Immunity-related genes are a suitable model for studying effects of selection at the genomic level. Some of them are highly conserved due to functional constraints and purifying selection, while others are variable and change quickly to cope with the variation of pathogens. The SLC11A1 gene encodes a transporter protein mediating antimicrobial activity of macrophages. Little is known about the patterns of selection shaping this gene during evolution. Although it is a typical evolutionarily conserved gene, functionally important polymorphisms associated with various diseases were identified in humans and other species. We analyzed the genomic organization, genetic variation, and evolution of the SLC11A1 gene in the family Equidae to identify patterns of selection within this important gene. Nucleotide SLC11A1 sequences were shown to be highly conserved in ten equid species, with more than 97 % sequence identity across the family. Single nucleotide polymorphisms (SNPs) were found in the coding and noncoding regions of the gene. Seven codon sites were identified to be under strong purifying selection. Codons located in three regions, including the glycosylated extracellular loop, were shown to be under diversifying selection. A 3-bp indel resulting in a deletion of the amino acid 321 in the predicted protein was observed in all horses, while it has been maintained in all other equid species. This codon comprised in an N-glycosylation site was found to be under positive selection. Interspecific variation in the presence of predicted N-glycosylation sites was observed.
PGen: large-scale genomic variations analysis workflow and browser in SoyKB.

PubMed

Liu, Yang; Khan, Saad M; Wang, Juexin; Rynge, Mats; Zhang, Yuanxun; Zeng, Shuai; Chen, Shiyuan; Maldonado Dos Santos, Joao V; Valliyodan, Babu; Calyam, Prasad P; Merchant, Nirav; Nguyen, Henry T; Xu, Dong; Joshi, Trupti

2016-10-06

With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed "PGen", an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way. We have developed both a Linux version in GitHub ( https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow ) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), ( http://soykb.org/Pegasus/index.php ). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser ( http://soykb.org/NGS_Resequence/NGS_index.php ) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers. PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.
Fine Analysis of Genetic Diversity of the tpr Gene Family among Treponemal Species, Subspecies and Strains

PubMed Central

Centurion-Lara, Arturo; Giacani, Lorenzo; Godornes, Charmie; Molini, Barbara J.; Brinck Reid, Tara; Lukehart, Sheila A.

2013-01-01

Background The pathogenic non-cultivable treponemes include three subspecies of Treponema pallidum (pallidum, pertenue, endemicum), T. carateum, T. paraluiscuniculi, and the unclassified Fribourg-Blanc treponeme (Simian isolate). These treponemes are morphologically indistinguishable and antigenically and genetically highly similar, yet cross-immunity is variable or non-existent. Although all of these organisms cause chronic, multistage skin and systemic disease, they have historically been classified by mode of transmission, clinical presentations and host ranges. Whole genome studies underscore the high degree of sequence identity among species, subspecies and strains, pinpointing a limited number of genomic regions for variation. Many of these “hot spots” include members of the tpr gene family, composed of 12 paralogs encoding candidate virulence factors. We hypothesize that the distinct clinical presentations, host specificity, and variable cross-immunity might reside on virulence factors such as the tpr genes. Methodology/Principal Findings Sequence analysis of 11 tpr loci (excluding tprK) from 12 strains demonstrated an impressive heterogeneity, including SNPs, indels, chimeric genes, truncated gene products and large deletions. Comparative analyses of sequences and 3D models of predicted proteins in Subfamily I highlight the striking co-localization of discrete variable regions with predicted surface-exposed loops. A hallmark of Subfamily II is the presence of chimeric genes in the tprG and J loci. Diversity in Subfamily III is limited to tprA and tprL. Conclusions/Significance An impressive sequence variability was found in tpr sequences among the Treponema isolates examined in this study, with most of the variation being consistent within subspecies or species, or between syphilis vs. non-syphilis strains. Variability was seen in the pallidum subspecies, which can be divided into 5 genogroups. These findings support a genetic basis for the classification of these organisms into their respective subspecies and species. Future functional studies will determine whether the identified genetic differences relate to cross-immunity, clinical differences, or host ranges. PMID:23696912
3D mechanical stratigraphy of a deformed multi-layer: Linking sedimentary architecture and strain partitioning

NASA Astrophysics Data System (ADS)

Cawood, Adam J.; Bond, Clare E.

2018-01-01

Stratigraphic influence on structural style and strain distribution in deformed sedimentary sequences is well established, in models of 2D mechanical stratigraphy. In this study we attempt to refine existing models of stratigraphic-structure interaction by examining outcrop scale 3D variations in sedimentary architecture and the effects on subsequent deformation. At Monkstone Point, Pembrokeshire, SW Wales, digital mapping and virtual scanline data from a high resolution virtual outcrop have been combined with field observations, sedimentary logs and thin section analysis. Results show that significant variation in strain partitioning is controlled by changes, at a scale of tens of metres, in sedimentary architecture within Upper Carboniferous fluvio-deltaic deposits. Coupled vs uncoupled deformation of the sequence is defined by the composition and lateral continuity of mechanical units and unit interfaces. Where the sedimentary sequence is characterized by gradational changes in composition and grain size, we find that deformation structures are best characterized by patterns of distributed strain. In contrast, distinct compositional changes vertically and in laterally equivalent deposits results in highly partitioned deformation and strain. The mechanical stratigraphy of the study area is inherently 3D in nature, due to lateral and vertical compositional variability. Consideration should be given to 3D variations in mechanical stratigraphy, such as those outlined here, when predicting subsurface deformation in multi-layers.
Quantifying rare, deleterious variation in 12 human cytochrome P450 drug-metabolism genes in a large-scale exome dataset.

PubMed

Gordon, Adam S; Tabor, Holly K; Johnson, Andrew D; Snively, Beverly M; Assimes, Themistocles L; Auer, Paul L; Ioannidis, John P A; Peters, Ulrike; Robinson, Jennifer G; Sucheston, Lara E; Wang, Danxin; Sotoodehnia, Nona; Rotter, Jerome I; Psaty, Bruce M; Jackson, Rebecca D; Herrington, David M; O'Donnell, Christopher J; Reiner, Alexander P; Rich, Stephen S; Rieder, Mark J; Bamshad, Michael J; Nickerson, Deborah A

2014-04-15

The study of genetic influences on drug response and efficacy ('pharmacogenetics') has existed for over 50 years. Yet, we still lack a complete picture of how genetic variation, both common and rare, affects each individual's responses to medications. Exome sequencing is a promising alternative method for pharmacogenetic discovery as it provides information on both common and rare variation in large numbers of individuals. Using exome data from 2203 AA and 4300 Caucasian individuals through the NHLBI Exome Sequencing Project, we conducted a survey of coding variation within 12 Cytochrome P450 (CYP) genes that are collectively responsible for catalyzing nearly 75% of all known Phase I drug oxidation reactions. In addition to identifying many polymorphisms with known pharmacogenetic effects, we discovered over 730 novel nonsynonymous alleles across the 12 CYP genes of interest. These alleles include many with diverse functional effects such as premature stop codons, aberrant splicesites and mutations at conserved active site residues. Our analysis considering both novel, predicted functional alleles as well as known, actionable CYP alleles reveals that rare, deleterious variation contributes markedly to the overall burden of pharmacogenetic alleles within the populations considered, and that the contribution of rare variation to this burden is over three times greater in AA individuals as compared with Caucasians. While most of these impactful alleles are individually rare, 7.6-11.7% of individuals interrogated in the study carry at least one newly described potentially deleterious alleles in a major drug-metabolizing CYP.
Global and disease-associated genetic variation in the human Fanconi anemia gene family

PubMed Central

Rogers, Kai J.; Fu, Wenqing; Akey, Joshua M.; Monnat, Raymond J.

2014-01-01

Fanconi anemia (FA) is a human recessive genetic disease resulting from inactivating mutations in any of 16 FANC (Fanconi) genes. Individuals with FA are at high risk of developmental abnormalities, early bone marrow failure and leukemia. These are followed in the second and subsequent decades by a very high risk of carcinomas of the head and neck and anogenital region, and a small continuing risk of leukemia. In order to characterize base pair-level disease-associated (DA) and population genetic variation in FANC genes and the segregation of this variation in the human population, we identified 2948 unique FANC gene variants including 493 FA DA variants across 57 240 potential base pair variation sites in the 16 FANC genes. We then analyzed the segregation of this variation in the 7578 subjects included in the Exome Sequencing Project (ESP) and the 1000 Genomes Project (1KGP). There was a remarkably high frequency of FA DA variants in ESP/1KGP subjects: at least 1 FA DA variant was identified in 78.5% (5950 of 7578) individuals included in these two studies. Six widely used functional prediction algorithms correctly identified only a third of the known, DA FANC missense variants. We also identified FA DA variants that may be good candidates for different types of mutation-specific therapies. Our results demonstrate the power of direct DNA sequencing to detect, estimate the frequency of and follow the segregation of deleterious genetic variation in human populations. PMID:25104853

Prediction of phenotypes of missense mutations in human proteins from biological assemblies.

PubMed

Wei, Qiong; Xu, Qifang; Dunbrack, Roland L

2013-02-01

Single nucleotide polymorphisms (SNPs) are the most frequent variation in the human genome. Nonsynonymous SNPs that lead to missense mutations can be neutral or deleterious, and several computational methods have been presented that predict the phenotype of human missense mutations. These methods use sequence-based and structure-based features in various combinations, relying on different statistical distributions of these features for deleterious and neutral mutations. One structure-based feature that has not been studied significantly is the accessible surface area within biologically relevant oligomeric assemblies. These assemblies are different from the crystallographic asymmetric unit for more than half of X-ray crystal structures. We find that mutations in the core of proteins or in the interfaces in biological assemblies are significantly more likely to be disease-associated than those on the surface of the biological assemblies. For structures with more than one protein in the biological assembly (whether the same sequence or different), we find the accessible surface area from biological assemblies provides a statistically significant improvement in prediction over the accessible surface area of monomers from protein crystal structures (P = 6e-5). When adding this information to sequence-based features such as the difference between wildtype and mutant position-specific profile scores, the improvement from biological assemblies is statistically significant but much smaller (P = 0.018). Combining this information with sequence-based features in a support vector machine leads to 82% accuracy on a balanced dataset of 50% disease-associated mutations from SwissVar and 50% neutral mutations from human/primate sequence differences in orthologous proteins. Copyright © 2012 Wiley Periodicals, Inc.
Rapid evolution of cis-regulatory sequences via local point mutations

NASA Technical Reports Server (NTRS)

Stone, J. R.; Wray, G. A.

2001-01-01

Although the evolution of protein-coding sequences within genomes is well understood, the same cannot be said of the cis-regulatory regions that control transcription. Yet, changes in gene expression are likely to constitute an important component of phenotypic evolution. We simulated the evolution of new transcription factor binding sites via local point mutations. The results indicate that new binding sites appear and become fixed within populations on microevolutionary timescales under an assumption of neutral evolution. Even combinations of two new binding sites evolve very quickly. We predict that local point mutations continually generate considerable genetic variation that is capable of altering gene expression.
Evidence for large inversion polymorphisms in the human genome from HapMap data

PubMed Central

Bansal, Vikas; Bashir, Ali; Bafna, Vineet

2007-01-01

Knowledge about structural variation in the human genome has grown tremendously in the past few years. However, inversions represent a class of structural variation that remains difficult to detect. We present a statistical method to identify large inversion polymorphisms using unusual Linkage Disequilibrium (LD) patterns from high-density SNP data. The method is designed to detect chromosomal segments that are inverted (in a majority of the chromosomes) in a population with respect to the reference human genome sequence. We demonstrate the power of this method to detect such inversion polymorphisms through simulations done using the HapMap data. Application of this method to the data from the first phase of the International HapMap project resulted in 176 candidate inversions ranging from 200 kb to several megabases in length. Our predicted inversions include an 800-kb polymorphic inversion at 7p22, a 1.1-Mb inversion at 16p12, and a novel 1.2-Mb inversion on chromosome 10 that is supported by the presence of two discordant fosmids. Analysis of the genomic sequence around inversion breakpoints showed that 11 predicted inversions are flanked by pairs of highly homologous repeats in the inverted orientation. In addition, for three candidate inversions, the inverted orientation is represented in the Celera genome assembly. Although the power of our method to detect inversions is restricted because of inherently noisy LD patterns in population data, inversions predicted by our method represent strong candidates for experimental validation and analysis. PMID:17185644
Next generation sequencing to dissect the genetic architecture of KNG1 and F11 loci using factor XI levels as an intermediate phenotype of thrombosis.

PubMed

Martin-Fernandez, Laura; Gavidia-Bovadilla, Giovana; Corrales, Irene; Brunel, Helena; Ramírez, Lorena; López, Sonia; Souto, Juan Carlos; Vidal, Francisco; Soria, José Manuel

2017-01-01

Venous thromboembolism is a complex disease with a high heritability. There are significant associations among Factor XI (FXI) levels and SNPs in the KNG1 and F11 loci. Our aim was to identify the genetic variation of KNG1 and F11 that might account for the variability of FXI levels. The KNG1 and F11 loci were sequenced completely in 110 unrelated individuals from the GAIT-2 (Genetic Analysis of Idiopathic Thrombophilia 2) Project using Next Generation Sequencing on an Illumina MiSeq. The GAIT-2 Project is a study of 935 individuals in 35 extended Spanish families selected through a proband with idiopathic thrombophilia. Among the 110 individuals, a subset of 40 individuals was chosen as a discovery sample for identifying variants. A total of 762 genetic variants were detected. Several significant associations were established among common variants and low-frequency variants sets in KNG1 and F11 with FXI levels using the PLINK and SKAT packages. Among these associations, those of rs710446 and five low-frequency variant sets in KNG1 with FXI level variation were significant after multiple testing correction and permutation. Also, two putative pathogenic mutations related to high and low FXI levels were identified by data filtering and in silico predictions. This study of KNG1 and F11 loci should help to understand the connection between genotypic variation and variation in FXI levels. The functional genetic variants should be useful as markers of thromboembolic risk.
Upper Cretaceous sequences and sea-level history, New Jersey Coastal Plain

USGS Publications Warehouse

Miller, K.G.; Sugarman, P.J.; Browning, J.V.; Kominz, M.A.; Olsson, R.K.; Feigenson, M.D.; Hernandez, J.C.

2004-01-01

We developed a Late Cretaceous sealevel estimate from Upper Cretaceous sequences at Bass River and Ancora, New Jersey (ODP [Ocean Drilling Program] Leg 174AX). We dated 11-14 sequences by integrating Sr isotope and biostratigraphy (age resolution ??0.5 m.y.) and then estimated paleoenvironmental changes within the sequences from lithofacies and biofacies analyses. Sequences generally shallow upsection from middle-neritic to inner-neritic paleodepths, as shown by the transition from thin basal glauconite shelf sands (transgressive systems tracts [TST]), to medial-prodelta silty clays (highstand systems tracts [HST]), and finally to upper-delta-front quartz sands (HST). Sea-level estimates obtained by backstripping (accounting for paleodepth variations, sediment loading, compaction, and basin subsidence) indicate that large (>25 m) and rapid (???1 m.y.) sea-level variations occurred during the Late Cretaceous greenhouse world. The fact that the timing of Upper Cretaceous sequence boundaries in New Jersey is similar to the sea-level lowering records of Exxon Production Research Company (EPR), northwest European sections, and Russian platform outcrops points to a global cause. Because backstripping, seismicity, seismic stratigraphic data, and sediment-distribution patterns all indicate minimal tectonic effects on the New Jersey Coastal Plain, we interpret that we have isolated a eustatic signature. The only known mechanism that can explain such global changes-glacio-eustasy-is consistent with foraminiferal ??18O data. Either continental ice sheets paced sea-level changes during the Late Cretaceous, or our understanding of causal mechanisms for global sea-level change is fundamentally flawed. Comparison of our eustatic history with published ice-sheet models and Milankovitch predictions suggests that small (5-10 ?? 106 km3), ephemeral, and areally restricted Antarctic ice sheets paced the Late Cretaceous global sea-level change. New Jersey and Russian eustatic estimates are typically one-half of the EPR amplitudes, though this difference varies through time, yielding markedly different eustatic curves. We conclude that New Jersey provides the best available estimate for Late Cretaceous sea-level variations. ?? 2004 Geological Society America.
Novel determinants of mammalian primary microRNA processing revealed by systematic evaluation of hairpin-containing transcripts and human genetic variation

PubMed Central

Roden, Christine; Gaillard, Jonathan; Kanoria, Shaveta; Rennie, William; Barish, Syndi; Cheng, Jijun; Pan, Wen; Liu, Jun; Cotsapas, Chris; Ding, Ye; Lu, Jun

2017-01-01

Mature microRNAs (miRNAs) are processed from hairpin-containing primary miRNAs (pri-miRNAs). However, rules that distinguish pri-miRNAs from other hairpin-containing transcripts in the genome are incompletely understood. By developing a computational pipeline to systematically evaluate 30 structural and sequence features of mammalian RNA hairpins, we report several new rules that are preferentially utilized in miRNA hairpins and govern efficient pri-miRNA processing. We propose that a hairpin stem length of 36 ± 3 nt is optimal for pri-miRNA processing. We identify two bulge-depleted regions on the miRNA stem, located ∼16–21 nt and ∼28–32 nt from the base of the stem, that are less tolerant of unpaired bases. We further show that the CNNC primary sequence motif selectively enhances the processing of optimal-length hairpins. We predict that a small but significant fraction of human single-nucleotide polymorphisms (SNPs) alter pri-miRNA processing, and confirm several predictions experimentally including a disease-causing mutation. Our study enhances the rules governing mammalian pri-miRNA processing and suggests a diverse impact of human genetic variation on miRNA biogenesis. PMID:28087842
Genetic diversity based on 28S rDNA sequences among populations of Culex quinquefasciatus collected at different locations in Tamil Nadu, India.

PubMed

Sakthivelkumar, S; Ramaraj, P; Veeramani, V; Janarthanan, S

2015-09-01

The basis of the present study was to distinguish the existence of any genetic variability among populations of Culex quinquefasciatus which would be a valuable tool in the management of mosquito control programmes. In the present study, population of Cx. quinquefasciatus collected at different locations in Tamil Nadu were analyzed for their genetic variation based on 28S rDNA D2 region nucleotide sequences. A high degree of genetic polymorphism was detected in the sequences of D2 region of 28S rDNA on the predicted secondary structures in spite of high nucleotide sequence similarity. The findings based on secondary structure using rDNA sequences suggested the existence of a complex genotypic diversity of Cx. quinquefasciatus population collected at different locations of Tamil Nadu, India. This complexity in genetic diversity in a single mosquito population collected at different locations is considered an important issue towards their influence and nature of vector potential of these mosquitoes.
Communication variations and aircrew performance

NASA Technical Reports Server (NTRS)

Kanki, Barbara G.; Folk, Valerie G.; Irwin, Cheryl M.

1991-01-01

The relationship between communication variations and aircrew performance (high-error vs low-error performances) was investigated by analyzing the coded verbal transcripts derived from the videotape records of 18 two-person air transport crews who participated in a high-fidelity, full-mission flight simulation. The flight scenario included a task which involved abnormal operations and required the coordinated efforts of all crew members. It was found that the best-performing crews were characterized by nearly identical patterns of communication, whereas the midrange and poorer performing crews showed a great deal of heterogeneity in their speech patterns. Although some specific speech sequences can be interpreted as being more or less facilitative to the crew-coordination process, predictability appears to be the key ingredient for enhancing crew performance. Crews communicating in highly standard (hence predictable) ways were better able to coordinate their task, whereas crews characterized by multiple, nonstandard communication profiles were less effective in their performance.
Genetics of Inflammatory Bowel Diseases

PubMed Central

McGovern, Dermot; Kugathasan, Subra; Cho, Judy H.

2015-01-01

In this Review, we provide an update on genome-wide association studies (GWAS) in inflammatory bowel disease (IBD). In addition, we summarize progress in defining the functional consequences of associated alleles for coding and non-coding genetic variation. In the small minority of loci where major association signals correspond to non-synonymous variation, we summarize studies defining their functional effects and implications for therapeutic targeting. Importantly, the large majority of GWAS-associated loci involve non-coding variation, many of which modulate levels of gene expression. Recent expression quantitative trait loci (eQTL) studies have established that expression of the large majority of human genes is regulated by non-coding genetic variation. Significant advances in defining the epigenetic landscape have demonstrated that IBD GWAS signals are highly enriched within cell-specific active enhancer marks. Studies in European ancestry populations have dominated the landscape of IBD genetics studies, but increasingly, studies in Asian and African-American populations are being reported. Common variation accounts for only a modest fraction of the predicted heritability and the role of rare genetic variation of higher effects (i.e. odds ratios markedly deviating from one) is increasingly being identified through sequencing efforts. These sequencing studies have been particularly productive in very-early onset, more severe cases. A major challenge in IBD genetics will be harnessing the vast array of genetic discovery for clinical utility, through emerging precision medicine initiatives. We discuss the rapidly evolving area of direct to consumer genetic testing, as well as the current utility of clinical exome sequencing, especially in very early onset, severe IBD cases. We summarize recent progress in the pharmacogenetics of IBD with respect of partitioning patient responses to anti-TNF and thiopurine therapies. Highly collaborative studies across research centers and across subspecialties and disciplines will be required to fully realize the promise of genetic discovery in IBD. PMID:26255561
Systematics of Cladophora spp. (Chlorophyta) from North Carolina, USA, based upon morphology and DNA sequence data with a description of Cladophora subtilissima sp. nov.

PubMed

Taylor, Robin L; Bailey, Jeffrey Craig; Freshwater, David Wilson

2017-06-01

Identification of Cladophora species is challenging due to conservation of gross morphology, few discrete autapomorphies, and environmental influences on morphology. Twelve species of marine Cladophora were reported from North Carolina waters. Cladophora specimens were collected from inshore and offshore marine waters for DNA sequence and morphological analyses. The nuclear-encoded rRNA internal transcribed spacer regions (ITS) were sequenced for 105 specimens and used in molecular assisted identification. The ITS1 and ITS2 region was highly variable, and sequences were sorted into ITS Sets of Alignable Sequences (SASs). Sequencing of short hyper-variable ITS1 sections from Cladophora type specimens was used to positively identify species represented by SASs when the types were made available. Secondary structures for the ITS1 locus were also predicted for each specimen and compared to predicted structures from Cladophora sequences available in GenBank. Nine ITS SASs were identified and representative specimens chosen for phylogenetic analyses of 18S and 28S rRNA gene sequences to reveal relationships with other Cladophora species. Phylogenetic analyses indicated that marine Cladophorales were polyphyletic and separated into two clades, the Cladophora clade and the "Siphonocladales" clade. Morphological analyses were performed to assess the consistency of character states within species, and complement the DNA sequence analyses. These analyses revealed intra- and interspecific character state variation, and that combined molecular and morphological analyses were required for the identification of species. One new report, Cladophora dotyana, and one new species Cladophora subtilissima sp. nov., were revealed, and increased the biodiversity of North Carolina marine Cladophora to 14 species. © 2017 Phycological Society of America.
Transcriptome Profiling of Antimicrobial Resistance in Pseudomonas aeruginosa.

PubMed

Khaledi, Ariane; Schniederjans, Monika; Pohl, Sarah; Rainer, Roman; Bodenhofer, Ulrich; Xia, Boyang; Klawonn, Frank; Bruchmann, Sebastian; Preusse, Matthias; Eckweiler, Denitsa; Dötsch, Andreas; Häussler, Susanne

2016-08-01

Emerging resistance to antimicrobials and the lack of new antibiotic drug candidates underscore the need for optimization of current diagnostics and therapies to diminish the evolution and spread of multidrug resistance. As the antibiotic resistance status of a bacterial pathogen is defined by its genome, resistance profiling by applying next-generation sequencing (NGS) technologies may in the future accomplish pathogen identification, prompt initiation of targeted individualized treatment, and the implementation of optimized infection control measures. In this study, qualitative RNA sequencing was used to identify key genetic determinants of antibiotic resistance in 135 clinical Pseudomonas aeruginosa isolates from diverse geographic and infection site origins. By applying transcriptome-wide association studies, adaptive variations associated with resistance to the antibiotic classes fluoroquinolones, aminoglycosides, and β-lactams were identified. Besides potential novel biomarkers with a direct correlation to resistance, global patterns of phenotype-associated gene expression and sequence variations were identified by predictive machine learning approaches. Our research serves to establish genotype-based molecular diagnostic tools for the identification of the current resistance profiles of bacterial pathogens and paves the way for faster diagnostics for more efficient, targeted treatment strategies to also mitigate the future potential for resistance evolution. Copyright © 2016, American Society for Microbiology. All Rights Reserved.
Transcriptome Profiling of Antimicrobial Resistance in Pseudomonas aeruginosa

PubMed Central

Khaledi, Ariane; Schniederjans, Monika; Pohl, Sarah; Rainer, Roman; Bodenhofer, Ulrich; Xia, Boyang; Klawonn, Frank; Bruchmann, Sebastian; Preusse, Matthias; Eckweiler, Denitsa; Dötsch, Andreas

2016-01-01

Emerging resistance to antimicrobials and the lack of new antibiotic drug candidates underscore the need for optimization of current diagnostics and therapies to diminish the evolution and spread of multidrug resistance. As the antibiotic resistance status of a bacterial pathogen is defined by its genome, resistance profiling by applying next-generation sequencing (NGS) technologies may in the future accomplish pathogen identification, prompt initiation of targeted individualized treatment, and the implementation of optimized infection control measures. In this study, qualitative RNA sequencing was used to identify key genetic determinants of antibiotic resistance in 135 clinical Pseudomonas aeruginosa isolates from diverse geographic and infection site origins. By applying transcriptome-wide association studies, adaptive variations associated with resistance to the antibiotic classes fluoroquinolones, aminoglycosides, and β-lactams were identified. Besides potential novel biomarkers with a direct correlation to resistance, global patterns of phenotype-associated gene expression and sequence variations were identified by predictive machine learning approaches. Our research serves to establish genotype-based molecular diagnostic tools for the identification of the current resistance profiles of bacterial pathogens and paves the way for faster diagnostics for more efficient, targeted treatment strategies to also mitigate the future potential for resistance evolution. PMID:27216077
[Genetic characteristics of hemagglutinin in measles viruses isolated in Henan Province, China].

PubMed

Feng, Da-Xing; Seng, Ming-Hua; Liu, Qian; Zhang, Zhen-Ying

2014-03-01

This study aims to investigate the genetic characteristics of hemagglutinin in wild-type measles viruses in Henan Province, China and to provide a basis for measles control and elimination. Specimens were collected from suspected measles cases in Henan during 2008-2012. Cell culture was performed for virus isolation, and RT-PCR was used to amplify hemagglutinin gene. The PCR products were sequenced and analyzed, including construction of phylogenetic tree and analysis of the distance between the isolated virus and the reference virus; then, the variations in predicted amino acids were analyzed. The results showed that 12 measles viruses were isolated in Henan Province and identified as H1a genotype; the nucleotide and amino acid homologies were 98.0%-100% and 97.2%-99.8%, respectively. One glycosylation site changed in all the 12 sequences because of the amino acid mutation from serine to asparagine at the 240th site, as compared with Edmonston-wt. USA/54/A. Overall, the wild-type measles virus genotype circulating in Henan Province from 2008 to 2012 was H1a, with high homology between strains; there were some variations in amino acid sequences, resulting in glycosylation site deletion.
The genetic breakdown of sporophytic self-incompatibility in Tolpis coronopifolia (Asteraceae).

PubMed

Koseva, Boryana; Crawford, Daniel J; Brown, Keely E; Mort, Mark E; Kelly, John K

2017-12-01

Angiosperm diversity has been shaped by mating system evolution, with the most common transition from outcrossing to self-fertilizing. To investigate the genetic basis of this transition, we performed crosses between two species endemic to the Canary Islands, the self-compatible (SC) species Tolpis coronopifolia and its self-incompatible (SI) relative Tolpis santosii. We scored self-compatibility as self-seed set of recombinant plants within two F 2 populations. To map and genetically characterize the breakdown of SI, we built a draft genome sequence of T. coronopifolia, genotyped F 2 plants using multiplexed shotgun genotyping (MSG), and located MSG markers to the genome sequence. We identified a single quantitative trait locus (QTL) that explains nearly all variation in self-seed set in both F 2 populations. To identify putative causal genetic variants within the QTL, we performed transcriptome sequencing on mature floral tissue from both SI and SC species, constructed a transcriptome for each species, and then located each predicted transcript to the T. coronopifolia genome sequence. We annotated each predicted gene within the QTL and found two strong candidates for SI breakdown. Each gene has a coding sequence insertion/deletion mutation within the SC species that produces a truncated protein. Homologs of each gene have been implicated in pollen development, pollen germination, and pollen tube growth in other species. © 2017 The Authors. New Phytologist © 2017 New Phytologist Trust.
Allele-specific copy-number discovery from whole-genome and whole-exome sequencing.

PubMed

Wang, WeiBo; Wang, Wei; Sun, Wei; Crowley, James J; Szatkiewicz, Jin P

2015-08-18

Copy-number variants (CNVs) are a major form of genetic variation and a risk factor for various human diseases, so it is crucial to accurately detect and characterize them. It is conceivable that allele-specific reads from high-throughput sequencing data could be leveraged to both enhance CNV detection and produce allele-specific copy number (ASCN) calls. Although statistical methods have been developed to detect CNVs using whole-genome sequence (WGS) and/or whole-exome sequence (WES) data, information from allele-specific read counts has not yet been adequately exploited. In this paper, we develop an integrated method, called AS-GENSENG, which incorporates allele-specific read counts in CNV detection and estimates ASCN using either WGS or WES data. To evaluate the performance of AS-GENSENG, we conducted extensive simulations, generated empirical data using existing WGS and WES data sets and validated predicted CNVs using an independent methodology. We conclude that AS-GENSENG not only predicts accurate ASCN calls but also improves the accuracy of total copy number calls, owing to its unique ability to exploit information from both total and allele-specific read counts while accounting for various experimental biases in sequence data. Our novel, user-friendly and computationally efficient method and a complete analytic protocol is freely available at https://sourceforge.net/projects/asgenseng/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Complete genome sequence and analysis of the industrial Saccharomyces cerevisiae strain N85 used in Chinese rice wine production.

PubMed

Zhang, Weiping; Li, Yudong; Chen, Yiwang; Xu, Sha; Du, Guocheng; Shi, Huidong; Zhou, Jingwen; Chen, Jian

2018-02-05

Chinese rice wine is a popular traditional alcoholic beverage in China, while its brewing processes have rarely been explored. We herein report the first gapless, near-finished genome sequence of the yeast strain Saccharomyces cerevisiae N85 for Chinese rice wine production. Several assembly methods were used to integrate Pacific Bioscience (PacBio) and Illumina sequencing data to achieve high-quality genome sequencing of the strain. The genome encodes more than 6,000 predicted proteins, and 238 long non-coding RNAs, which are validated by RNA-sequencing data. Moreover, our annotation predicts 171 novel genes that are not present in the reference S288c genome. We also identified 65,902 single nucleotide polymorphisms and small indels, many of which are located within genic regions. Dozens of larger copy-number variations and translocations were detected, mainly enriched in the subtelomeres, suggesting these regions may be related to genomic evolution. This study will serve as a milestone in studying of Chinese rice wine and related beverages in China and in other countries. It will help to develop more scientific and modern fermentation processes of Chinese rice wine, and explore metabolism pathways of desired and harmful components in Chinese rice wine to improve its taste and nutritional value. © The Author(s) 2018. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Extensive de novo mutation rate variation between individuals and across the genome of Chlamydomonas reinhardtii

PubMed Central

Ness, Rob W.; Morgan, Andrew D.; Vasanthakrishnan, Radhakrishnan B.; Colegrave, Nick; Keightley, Peter D.

2015-01-01

Describing the process of spontaneous mutation is fundamental for understanding the genetic basis of disease, the threat posed by declining population size in conservation biology, and much of evolutionary biology. Directly studying spontaneous mutation has been difficult, however, because new mutations are rare. Mutation accumulation (MA) experiments overcome this by allowing mutations to build up over many generations in the near absence of natural selection. Here, we sequenced the genomes of 85 MA lines derived from six genetically diverse strains of the green alga Chlamydomonas reinhardtii. We identified 6843 new mutations, more than any other study of spontaneous mutation. We observed sevenfold variation in the mutation rate among strains and that mutator genotypes arose, increasing the mutation rate approximately eightfold in some replicates. We also found evidence for fine-scale heterogeneity in the mutation rate, with certain sequence motifs mutating at much higher rates, and clusters of multiple mutations occurring at closely linked sites. There was little evidence, however, for mutation rate heterogeneity between chromosomes or over large genomic regions of 200 kbp. We generated a predictive model of the mutability of sites based on their genomic properties, including local GC content, gene expression level, and local sequence context. Our model accurately predicted the average mutation rate and natural levels of genetic diversity of sites across the genome. Notably, trinucleotides vary 17-fold in rate between the most and least mutable sites. Our results uncover a rich heterogeneity in the process of spontaneous mutation both among individuals and across the genome. PMID:26260971
Imperfect duplicate insertions type of mutations in plasmepsin V modulates binding properties of PEXEL motifs of export proteins in Indian Plasmodium vivax.

PubMed

Rawat, Manmeet; Vijay, Sonam; Gupta, Yash; Tiwari, Pramod Kumar; Sharma, Arun

2013-01-01

Plasmepsin V (PM-V) have functionally conserved orthologues across the Plasmodium genus who's binding and antigenic processing at the PEXEL motifs for export about 200-300 essential proteins is important for the virulence and viability of the causative Plasmodium species. This study was undertaken to determine P. vivax plasmepsin V Ind (PvPM-V-Ind) PEXEL motif export pathway for pathogenicity-related proteins/antigens export thereby altering plasmodium exportome during erythrocytic stages. We identify and characterize Plasmodium vivax plasmepsin-V-Ind (mutant) gene by cloning, sequence analysis, in silico bioinformatic protocols and structural modeling predictions based on docking studies on binding capacity with PEXEL motifs processing in terms of binding and accessibility of export proteins. Cloning and sequence analysis for genetic diversity demonstrates PvPM-V-Ind (mutant) gene is highly conserved among all isolates from different geographical regions of India. Imperfect duplicate insertion types of mutations (SVSE from 246-249 AA and SLSE from 266-269 AA) were identified among all Indian isolates in comparison to P.vivax Sal-1 (PvPM-V-Sal 1) isolate. In silico bioinformatics interaction studies of PEXEL peptide and active enzyme reveal that PvPM-V-Ind (mutant) is only active in endoplasmic reticulum lumen and membrane embedding is essential for activation of plasmepsin V. Structural modeling predictions based on docking studies with PEXEL motif show significant variation in substrate protein binding of these imperfect mutations with data mined PEXEL sequences. The predicted variation in the docking score and interacting amino acids of PvPM-V-Ind (mutant) proteins with PEXEL and lopinavir suggests a modulation in the activity of PvPM-V in terms of binding and accessibility at these sites. Our functional modeled validation of PvPM-V-Ind (mutant) imperfect duplicate insertions with data mined PEXEL sequences leading to altered binding and substrate accessibility of the enzyme makes it a plausible target to investigate export mechanisms for in silico virtual screening and novel pharmacophore designing.
Imperfect Duplicate Insertions Type of Mutations in Plasmepsin V Modulates Binding Properties of PEXEL Motifs of Export Proteins in Indian Plasmodium vivax

PubMed Central

Rawat, Manmeet; Vijay, Sonam; Gupta, Yash; Tiwari, Pramod Kumar; Sharma, Arun

2013-01-01

Introduction Plasmepsin V (PM-V) have functionally conserved orthologues across the Plasmodium genus who's binding and antigenic processing at the PEXEL motifs for export about 200–300 essential proteins is important for the virulence and viability of the causative Plasmodium species. This study was undertaken to determine P. vivax plasmepsin V Ind (PvPM-V-Ind) PEXEL motif export pathway for pathogenicity-related proteins/antigens export thereby altering plasmodium exportome during erythrocytic stages. Method We identify and characterize Plasmodium vivax plasmepsin-V-Ind (mutant) gene by cloning, sequence analysis, in silico bioinformatic protocols and structural modeling predictions based on docking studies on binding capacity with PEXEL motifs processing in terms of binding and accessibility of export proteins. Results Cloning and sequence analysis for genetic diversity demonstrates PvPM-V-Ind (mutant) gene is highly conserved among all isolates from different geographical regions of India. Imperfect duplicate insertion types of mutations (SVSE from 246–249 AA and SLSE from 266–269 AA) were identified among all Indian isolates in comparison to P.vivax Sal-1 (PvPM-V-Sal 1) isolate. In silico bioinformatics interaction studies of PEXEL peptide and active enzyme reveal that PvPM-V-Ind (mutant) is only active in endoplasmic reticulum lumen and membrane embedding is essential for activation of plasmepsin V. Structural modeling predictions based on docking studies with PEXEL motif show significant variation in substrate protein binding of these imperfect mutations with data mined PEXEL sequences. The predicted variation in the docking score and interacting amino acids of PvPM-V-Ind (mutant) proteins with PEXEL and lopinavir suggests a modulation in the activity of PvPM-V in terms of binding and accessibility at these sites. Conclusion/Significance Our functional modeled validation of PvPM-V-Ind (mutant) imperfect duplicate insertions with data mined PEXEL sequences leading to altered binding and substrate accessibility of the enzyme makes it a plausible target to investigate export mechanisms for in silico virtual screening and novel pharmacophore designing. PMID:23555891
Scrutinizing MHC-I binding peptides and their limits of variation.

PubMed

Koch, Christian P; Perna, Anna M; Pillong, Max; Todoroff, Nickolay K; Wrede, Paul; Folkers, Gerd; Hiss, Jan A; Schneider, Gisbert

2013-01-01

Designed peptides that bind to major histocompatibility protein I (MHC-I) allomorphs bear the promise of representing epitopes that stimulate a desired immune response. A rigorous bioinformatical exploration of sequence patterns hidden in peptides that bind to the mouse MHC-I allomorph H-2K(b) is presented. We exemplify and validate these motif findings by systematically dissecting the epitope SIINFEKL and analyzing the resulting fragments for their binding potential to H-2K(b) in a thermal denaturation assay. The results demonstrate that only fragments exclusively retaining the carboxy- or amino-terminus of the reference peptide exhibit significant binding potential, with the N-terminal pentapeptide SIINF as shortest ligand. This study demonstrates that sophisticated machine-learning algorithms excel at extracting fine-grained patterns from peptide sequence data and predicting MHC-I binding peptides, thereby considerably extending existing linear prediction models and providing a fresh view on the computer-based molecular design of future synthetic vaccines. The server for prediction is available at http://modlab-cadd.ethz.ch (SLiDER tool, MHC-I version 2012).

Stresses and deformations in cross-ply composite tubes subjected to a uniform temperature change: Elasticity and Approximate Solutions

NASA Technical Reports Server (NTRS)

Hyer, M. W.; Cooper, D. E.; Cohen, D.

1985-01-01

The effects of a uniform temperature change on the stresses and deformations of composite tubes are investigated. The accuracy of an approximate solution based on the principle of complementary virtual work is determined. Interest centers on tube response away from the ends and so a planar elasticity approach is used. For the approximate solution a piecewise linear variation of stresses with the radial coordinate is assumed. The results from the approximate solution are compared with the elasticity solution. The stress predictions agree well, particularly peak interlaminar stresses. Surprisingly, the axial deformations also agree well. This, despite the fact that the deformations predicted by the approximate solution do not satisfy the interface displacement continuity conditions required by the elasticity solution. The study shows that the axial thermal expansion coefficient of tubes with a specific number of axial and circumferential layers depends on the stacking sequence. This is in contrast to classical lamination theory which predicts the expansion to be independent of the stacking arrangement. As expected, the sign and magnitude of the peak interlaminar stresses depends on stacking sequence.
Kernel-based whole-genome prediction of complex traits: a review.

PubMed

Morota, Gota; Gianola, Daniel

2014-01-01

Prediction of genetic values has been a focus of applied quantitative genetics since the beginning of the 20th century, with renewed interest following the advent of the era of whole genome-enabled prediction. Opportunities offered by the emergence of high-dimensional genomic data fueled by post-Sanger sequencing technologies, especially molecular markers, have driven researchers to extend Ronald Fisher and Sewall Wright's models to confront new challenges. In particular, kernel methods are gaining consideration as a regression method of choice for genome-enabled prediction. Complex traits are presumably influenced by many genomic regions working in concert with others (clearly so when considering pathways), thus generating interactions. Motivated by this view, a growing number of statistical approaches based on kernels attempt to capture non-additive effects, either parametrically or non-parametrically. This review centers on whole-genome regression using kernel methods applied to a wide range of quantitative traits of agricultural importance in animals and plants. We discuss various kernel-based approaches tailored to capturing total genetic variation, with the aim of arriving at an enhanced predictive performance in the light of available genome annotation information. Connections between prediction machines born in animal breeding, statistics, and machine learning are revisited, and their empirical prediction performance is discussed. Overall, while some encouraging results have been obtained with non-parametric kernels, recovering non-additive genetic variation in a validation dataset remains a challenge in quantitative genetics.
Microsatellite analysis in the genome of Acanthaceae: An in silico approach.

PubMed

Kaliswamy, Priyadharsini; Vellingiri, Srividhya; Nathan, Bharathi; Selvaraj, Saravanakumar

2015-01-01

Acanthaceae is one of the advanced and specialized families with conventionally used medicinal plants. Simple sequence repeats (SSRs) play a major role as molecular markers for genome analysis and plant breeding. The microsatellites existing in the complete genome sequences would help to attain a direct role in the genome organization, recombination, gene regulation, quantitative genetic variation, and evolution of genes. The current study reports the frequency of microsatellites and appropriate markers for the Acanthaceae family genome sequences. The whole nucleotide sequences of Acanthaceae species were obtained from National Center for Biotechnology Information database and screened for the presence of SSRs. SSR Locator tool was used to predict the microsatellites and inbuilt Primer3 module was used for primer designing. Totally 110 repeats from 108 sequences of Acanthaceae family plant genomes were identified, and the occurrence of dinucleotide repeats was found to be abundant in the genome sequences. The essential amino acid isoleucine was found rich in all the sequences. We also designed the SSR-based primers/markers for 59 sequences of this family that contains microsatellite repeats in their genome. The identified microsatellites and primers might be useful for breeding and genetic studies of plants that belong to Acanthaceae family in the future.
Discovery of somatic mutations in the progression of chronic myeloid leukemia by whole-exome sequencing.

PubMed

Huang, Y; Zheng, J; Hu, J D; Wu, Y A; Zheng, X Y; Liu, T B; Chen, F L

2014-02-19

We performed whole-exome sequencing in samples representing accelerated phase (AP) and blastic crisis (BC) in a subject with chronic myeloid leukemia (CML). A total of 12.74 Gb clean data were generated, achieving a mean depth coverage of 64.45 and 69.53 for AP and BC samples, respectively, of the target region. A total of 148 somatic variants were detected, including 76 insertions and deletions (indels), 64 single-nucleotide variations (SNV), and 8 structural variations (SV). On the basis of annotation and functional prediction analysis, we identified 3 SNVs and 6 SVs that showed a potential association with CML progression. Among the genes that harbor the identified variants, GATA2 has previously been reported to play important roles in the progression from AP to BC in CML. Identification of these genes will allow us to gain a better understanding of the pathological mechanism of CML and represents a critical advance toward new molecular diagnostic tests for the development of potential therapies for CML.
Diversity of Pneumolysin and Pneumococcal Histidine Triad Protein D of Streptococcus pneumoniae Isolated from Invasive Diseases in Korean Children.

PubMed

Yun, Ki Wook; Lee, Hyunju; Choi, Eun Hwa; Lee, Hoan Jong

2015-01-01

Pneumolysin (Ply) and pneumococcal histidine triad protein D (PhtD) are candidate proteins for a next-generation pneumococcal vaccine. We aimed to analyze the genetic diversity and antigenic heterogeneity of Ply and PhtD for 173 pneumococci isolated from invasive diseases in Korean children. Allele was designated based on the variation of amino acid sequence. Antigenicity was predicted by the amino acid hydrophobicity of the region. There were seven and 39 allele types for the ply and phtD genes, respectively. The nucleotide sequence identity was 97.2%-99.9% for ply and 91.4%-98.0% for phtD gene. Only minor variations in hydrophobicity were noted among the antigenicity plots of Ply and PhtD. Overall, the allele types of the ply and phtD genes were remarkably homogeneous, and the antigenic diversity of the corresponding proteins was very limited. The Ply and PhtD could be useful antigens for universal pneumococcal vaccines.
VWF mutations and new sequence variations identified in healthy controls are more frequent in the African-American population.

PubMed

Bellissimo, Daniel B; Christopherson, Pamela A; Flood, Veronica H; Gill, Joan Cox; Friedman, Kenneth D; Haberichter, Sandra L; Shapiro, Amy D; Abshire, Thomas C; Leissinger, Cindy; Hoots, W Keith; Lusher, Jeanne M; Ragni, Margaret V; Montgomery, Robert R

2012-03-01

Diagnosis and classification of VWD is aided by molecular analysis of the VWF gene. Because VWF polymorphisms have not been fully characterized, we performed VWF laboratory testing and gene sequencing of 184 healthy controls with a negative bleeding history. The controls included 66 (35.9%) African Americans (AAs). We identified 21 new sequence variations, 13 (62%) of which occurred exclusively in AAs and 2 (G967D, T2666M) that were found in 10%-15% of the AA samples, suggesting they are polymorphisms. We identified 14 sequence variations reported previously as VWF mutations, the majority of which were type 1 mutations. These controls had VWF Ag levels within the normal range, suggesting that these sequence variations might not always reduce plasma VWF levels. Eleven mutations were found in AAs, and the frequency of M740I, H817Q, and R2185Q was 15%-18%. Ten AA controls had the 2N mutation H817Q; 1 was homozygous. The average factor VIII level in this group was 99 IU/dL, suggesting that this variation may confer little or no clinical symptoms. This study emphasizes the importance of sequencing healthy controls to understand ethnic-specific sequence variations so that asymptomatic sequence variations are not misidentified as mutations in other ethnic or racial groups.
Non-codingRNA sequence variations in human chronic lymphocytic leukemia and colorectal cancer.

PubMed

Wojcik, Sylwia E; Rossi, Simona; Shimizu, Masayoshi; Nicoloso, Milena S; Cimmino, Amelia; Alder, Hansjuerg; Herlea, Vlad; Rassenti, Laura Z; Rai, Kanti R; Kipps, Thomas J; Keating, Michael J; Croce, Carlo M; Calin, George A

2010-02-01

Cancer is a genetic disease in which the interplay between alterations in protein-coding genes and non-coding RNAs (ncRNAs) plays a fundamental role. In recent years, the full coding component of the human genome was sequenced in various cancers, whereas such attempts related to ncRNAs are still fragmentary. We screened genomic DNAs for sequence variations in 148 microRNAs (miRNAs) and ultraconserved regions (UCRs) loci in patients with chronic lymphocytic leukemia (CLL) or colorectal cancer (CRC) by Sanger technique and further tried to elucidate the functional consequences of some of these variations. We found sequence variations in miRNAs in both sporadic and familial CLL cases, mutations of UCRs in CLLs and CRCs and, in certain instances, detected functional effects of these variations. Furthermore, by integrating our data with previously published data on miRNA sequence variations, we have created a catalog of DNA sequence variations in miRNAs/ultraconserved genes in human cancers. These findings argue that ncRNAs are targeted by both germ line and somatic mutations as well as by single-nucleotide polymorphisms with functional significance for human tumorigenesis. Sequence variations in ncRNA loci are frequent and some have functional and biological significance. Such information can be exploited to further investigate on a genome-wide scale the frequency of genetic variations in ncRNAs and their functional meaning, as well as for the development of new diagnostic and prognostic markers for leukemias and carcinomas.
Non-codingRNA sequence variations in human chronic lymphocytic leukemia and colorectal cancer

PubMed Central

Wojcik, Sylwia E.; Rossi, Simona; Shimizu, Masayoshi; Nicoloso, Milena S.; Cimmino, Amelia; Alder, Hansjuerg; Herlea, Vlad; Rassenti, Laura Z.; Rai, Kanti R.; Kipps, Thomas J.; Keating, Michael J.

2010-01-01

Cancer is a genetic disease in which the interplay between alterations in protein-coding genes and non-coding RNAs (ncRNAs) plays a fundamental role. In recent years, the full coding component of the human genome was sequenced in various cancers, whereas such attempts related to ncRNAs are still fragmentary. We screened genomic DNAs for sequence variations in 148 microRNAs (miRNAs) and ultraconserved regions (UCRs) loci in patients with chronic lymphocytic leukemia (CLL) or colorectal cancer (CRC) by Sanger technique and further tried to elucidate the functional consequences of some of these variations. We found sequence variations in miRNAs in both sporadic and familial CLL cases, mutations of UCRs in CLLs and CRCs and, in certain instances, detected functional effects of these variations. Furthermore, by integrating our data with previously published data on miRNA sequence variations, we have created a catalog of DNA sequence variations in miRNAs/ultraconserved genes in human cancers. These findings argue that ncRNAs are targeted by both germ line and somatic mutations as well as by single-nucleotide polymorphisms with functional significance for human tumorigenesis. Sequence variations in ncRNA loci are frequent and some have functional and biological significance. Such information can be exploited to further investigate on a genome-wide scale the frequency of genetic variations in ncRNAs and their functional meaning, as well as for the development of new diagnostic and prognostic markers for leukemias and carcinomas. PMID:19926640
Receptor-like genes in the major resistance locus of lettuce are subject to divergent selection.

PubMed Central

Meyers, B C; Shen, K A; Rohani, P; Gaut, B S; Michelmore, R W

1998-01-01

Disease resistance genes in plants are often found in complex multigene families. The largest known cluster of disease resistance specificities in lettuce contains the RGC2 family of genes. We compared the sequences of nine full-length genomic copies of RGC2 representing the diversity in the cluster to determine the structure of genes within this family and to examine the evolution of its members. The transcribed regions range from at least 7.0 to 13.1 kb, and the cDNAs contain deduced open reading frames of approximately 5. 5 kb. The predicted RGC2 proteins contain a nucleotide binding site and irregular leucine-rich repeats (LRRs) that are characteristic of resistance genes cloned from other species. Unique features of the RGC2 gene products include a bipartite LRR region with >40 repeats. At least eight members of this family are transcribed. The level of sequence diversity between family members varied in different regions of the gene. The ratio of nonsynonymous (Ka) to synonymous (Ks) nucleotide substitutions was lowest in the region encoding the nucleotide binding site, which is the presumed effector domain of the protein. The LRR-encoding region showed an alternating pattern of conservation and hypervariability. This alternating pattern of variation was also found in all comparisons within families of resistance genes cloned from other species. The Ka /Ks ratios indicate that diversifying selection has resulted in increased variation at these codons. The patterns of variation support the predicted structure of LRR regions with solvent-exposed hypervariable residues that are potentially involved in binding pathogen-derived ligands. PMID:9811792
Lessons learned from whole exome sequencing in multiplex families affected by a complex genetic disorder, intracranial aneurysm.

PubMed

Farlow, Janice L; Lin, Hai; Sauerbeck, Laura; Lai, Dongbing; Koller, Daniel L; Pugh, Elizabeth; Hetrick, Kurt; Ling, Hua; Kleinloog, Rachel; van der Vlies, Pieter; Deelen, Patrick; Swertz, Morris A; Verweij, Bon H; Regli, Luca; Rinkel, Gabriel J E; Ruigrok, Ynte M; Doheny, Kimberly; Liu, Yunlong; Broderick, Joseph; Foroud, Tatiana

2015-01-01

Genetic risk factors for intracranial aneurysm (IA) are not yet fully understood. Genomewide association studies have been successful at identifying common variants; however, the role of rare variation in IA susceptibility has not been fully explored. In this study, we report the use of whole exome sequencing (WES) in seven densely-affected families (45 individuals) recruited as part of the Familial Intracranial Aneurysm study. WES variants were prioritized by functional prediction, frequency, predicted pathogenicity, and segregation within families. Using these criteria, 68 variants in 68 genes were prioritized across the seven families. Of the genes that were expressed in IA tissue, one gene (TMEM132B) was differentially expressed in aneurysmal samples (n=44) as compared to control samples (n=16) (false discovery rate adjusted p-value=0.023). We demonstrate that sequencing of densely affected families permits exploration of the role of rare variants in a relatively common disease such as IA, although there are important study design considerations for applying sequencing to complex disorders. In this study, we explore methods of WES variant prioritization, including the incorporation of unaffected individuals, multipoint linkage analysis, biological pathway information, and transcriptome profiling. Further studies are needed to validate and characterize the set of variants and genes identified in this study.
Conservation of Three-Dimensional Helix-Loop-Helix Structure through the Vertebrate Lineage Reopens the Cold Case of Gonadotropin-Releasing Hormone-Associated Peptide.

PubMed

Pérez Sirkin, Daniela I; Lafont, Anne-Gaëlle; Kamech, Nédia; Somoza, Gustavo M; Vissio, Paula G; Dufour, Sylvie

2017-01-01

GnRH-associated peptide (GAP) is the C-terminal portion of the gonadotropin-releasing hormone (GnRH) preprohormone. Although it was reported in mammals that GAP may act as a prolactin-inhibiting factor and can be co-secreted with GnRH into the hypophyseal portal blood, GAP has been practically out of the research circuit for about 20 years. Comparative studies highlighted the low conservation of GAP primary amino acid sequences among vertebrates, contributing to consider that this peptide only participates in the folding or carrying process of GnRH. Considering that the three-dimensional (3D) structure of a protein may define its function, the aim of this study was to evaluate if GAP sequences and 3D structures are conserved in the vertebrate lineage. GAP sequences from various vertebrates were retrieved from databases. Analysis of primary amino acid sequence identity and similarity, molecular phylogeny, and prediction of 3D structures were performed. Amino acid sequence comparison and phylogeny analyses confirmed the large variation of GAP sequences throughout vertebrate radiation. In contrast, prediction of the 3D structure revealed a striking conservation of the 3D structure of GAP1 (GAP associated with the hypophysiotropic type 1 GnRH), despite low amino acid sequence conservation. This GAP1 peptide presented a typical helix-loop-helix (HLH) structure in all the vertebrate species analyzed. This HLH structure could also be predicted for GAP2 in some but not all vertebrate species and in none of the GAP3 analyzed. These results allowed us to infer that selective pressures have maintained GAP1 HLH structure throughout the vertebrate lineage. The conservation of the HLH motif, known to confer biological activity to various proteins, suggests that GAP1 peptides may exert some hypophysiotropic biological functions across vertebrate radiation.
Conservation of Three-Dimensional Helix-Loop-Helix Structure through the Vertebrate Lineage Reopens the Cold Case of Gonadotropin-Releasing Hormone-Associated Peptide

PubMed Central

Pérez Sirkin, Daniela I.; Lafont, Anne-Gaëlle; Kamech, Nédia; Somoza, Gustavo M.; Vissio, Paula G.; Dufour, Sylvie

2017-01-01

GnRH-associated peptide (GAP) is the C-terminal portion of the gonadotropin-releasing hormone (GnRH) preprohormone. Although it was reported in mammals that GAP may act as a prolactin-inhibiting factor and can be co-secreted with GnRH into the hypophyseal portal blood, GAP has been practically out of the research circuit for about 20 years. Comparative studies highlighted the low conservation of GAP primary amino acid sequences among vertebrates, contributing to consider that this peptide only participates in the folding or carrying process of GnRH. Considering that the three-dimensional (3D) structure of a protein may define its function, the aim of this study was to evaluate if GAP sequences and 3D structures are conserved in the vertebrate lineage. GAP sequences from various vertebrates were retrieved from databases. Analysis of primary amino acid sequence identity and similarity, molecular phylogeny, and prediction of 3D structures were performed. Amino acid sequence comparison and phylogeny analyses confirmed the large variation of GAP sequences throughout vertebrate radiation. In contrast, prediction of the 3D structure revealed a striking conservation of the 3D structure of GAP1 (GAP associated with the hypophysiotropic type 1 GnRH), despite low amino acid sequence conservation. This GAP1 peptide presented a typical helix-loop-helix (HLH) structure in all the vertebrate species analyzed. This HLH structure could also be predicted for GAP2 in some but not all vertebrate species and in none of the GAP3 analyzed. These results allowed us to infer that selective pressures have maintained GAP1 HLH structure throughout the vertebrate lineage. The conservation of the HLH motif, known to confer biological activity to various proteins, suggests that GAP1 peptides may exert some hypophysiotropic biological functions across vertebrate radiation. PMID:28878737
The role of heterologous chloroplast sequence elements in transgene integration and expression.

PubMed

Ruhlman, Tracey; Verma, Dheeraj; Samson, Nalapalli; Daniell, Henry

2010-04-01

Heterologous regulatory elements and flanking sequences have been used in chloroplast transformation of several crop species, but their roles and mechanisms have not yet been investigated. Nucleotide sequence identity in the photosystem II protein D1 (psbA) upstream region is 59% across all taxa; similar variation was consistent across all genes and taxa examined. Secondary structure and predicted Gibbs free energy values of the psbA 5' untranslated region (UTR) among different families reflected this variation. Therefore, chloroplast transformation vectors were made for tobacco (Nicotiana tabacum) and lettuce (Lactuca sativa), with endogenous (Nt-Nt, Ls-Ls) or heterologous (Nt-Ls, Ls-Nt) psbA promoter, 5' UTR and 3' UTR, regulating expression of the anthrax protective antigen (PA) or human proinsulin (Pins) fused with the cholera toxin B-subunit (CTB). Unique lettuce flanking sequences were completely eliminated during homologous recombination in the transplastomic tobacco genomes but not unique tobacco sequences. Nt-Ls or Ls-Nt transplastomic lines showed reduction of 80% PA and 97% CTB-Pins expression when compared with endogenous psbA regulatory elements, which accumulated up to 29.6% total soluble protein PA and 72.0% total leaf protein CTB-Pins, 2-fold higher than Rubisco. Transgene transcripts were reduced by 84% in Ls-Nt-CTB-Pins and by 72% in Nt-Ls-PA lines. Transcripts containing endogenous 5' UTR were stabilized in nonpolysomal fractions. Stromal RNA-binding proteins were preferentially associated with endogenous psbA 5' UTR. A rapid and reproducible regeneration system was developed for lettuce commercial cultivars by optimizing plant growth regulators. These findings underscore the need for sequencing complete crop chloroplast genomes, utilization of endogenous regulatory elements and flanking sequences, as well as optimization of plant growth regulators for efficient chloroplast transformation.
The Role of Heterologous Chloroplast Sequence Elements in Transgene Integration and Expression1[W][OA

PubMed Central

Ruhlman, Tracey; Verma, Dheeraj; Samson, Nalapalli; Daniell, Henry

2010-01-01

Heterologous regulatory elements and flanking sequences have been used in chloroplast transformation of several crop species, but their roles and mechanisms have not yet been investigated. Nucleotide sequence identity in the photosystem II protein D1 (psbA) upstream region is 59% across all taxa; similar variation was consistent across all genes and taxa examined. Secondary structure and predicted Gibbs free energy values of the psbA 5′ untranslated region (UTR) among different families reflected this variation. Therefore, chloroplast transformation vectors were made for tobacco (Nicotiana tabacum) and lettuce (Lactuca sativa), with endogenous (Nt-Nt, Ls-Ls) or heterologous (Nt-Ls, Ls-Nt) psbA promoter, 5′ UTR and 3′ UTR, regulating expression of the anthrax protective antigen (PA) or human proinsulin (Pins) fused with the cholera toxin B-subunit (CTB). Unique lettuce flanking sequences were completely eliminated during homologous recombination in the transplastomic tobacco genomes but not unique tobacco sequences. Nt-Ls or Ls-Nt transplastomic lines showed reduction of 80% PA and 97% CTB-Pins expression when compared with endogenous psbA regulatory elements, which accumulated up to 29.6% total soluble protein PA and 72.0% total leaf protein CTB-Pins, 2-fold higher than Rubisco. Transgene transcripts were reduced by 84% in Ls-Nt-CTB-Pins and by 72% in Nt-Ls-PA lines. Transcripts containing endogenous 5′ UTR were stabilized in nonpolysomal fractions. Stromal RNA-binding proteins were preferentially associated with endogenous psbA 5′ UTR. A rapid and reproducible regeneration system was developed for lettuce commercial cultivars by optimizing plant growth regulators. These findings underscore the need for sequencing complete crop chloroplast genomes, utilization of endogenous regulatory elements and flanking sequences, as well as optimization of plant growth regulators for efficient chloroplast transformation. PMID:20130101
Predicting nuclear gene coalescence from mitochondrial data: the three-times rule.

PubMed

Palumbi, S R; Cipriano, F; Hare, M P

2001-05-01

Coalescence theory predicts when genetic drift at nuclear loci will result in fixation of sequence differences to produce monophyletic gene trees. However, the theory is difficult to apply to particular taxa because it hinges on genetically effective population size, which is generally unknown. Neutral theory also predicts that evolution of monophyly will be four times slower in nuclear than in mitochondrial genes primarily because genetic drift is slower at nuclear loci. Variation in mitochondrial DNA (mtDNA) within and between species has been studied extensively, but can these mtDNA data be used to predict coalescence in nuclear loci? Comparison of neutral theories of coalescence of mitochondrial and nuclear loci suggests a simple rule of thumb. The "three-times rule" states that, on average, most nuclear loci will be monophyletic when the branch length leading to the mtDNA sequences of a species is three times longer than the average mtDNA sequence diversity observed within that species. A test using mitochondrial and nuclear intron data from seven species of whales and dolphins suggests general agreement with predictions of the three-times rule. We define the coalescence ratio as the mitochondrial branch length for a species divided by intraspecific mtDNA diversity. We show that species with high coalescence ratios show nuclear monophyly, whereas species with low ratios have polyphyletic nuclear gene trees. As expected, species with intermediate coalescence ratios show a variety of patterns. Especially at very high or low coalescence ratios, the three-times rule predicts nuclear gene patterns that can help detect the action of selection. The three-times rule may be useful as an empirical benchmark for evaluating evolutionary processes occurring at multiple loci.
Commonly-occurring polymorphisms in the COMT, DRD1 and DRD2 genes influence different aspects of motor sequence learning in humans.

PubMed

Baetu, Irina; Burns, Nicholas R; Urry, Kristi; Barbante, Girolamo Giovanni; Pitcher, Julia B

2015-11-01

Performing sequences of movements is a ubiquitous skill that involves dopamine transmission. However, it is unclear which components of the dopamine system contribute to which aspects of motor sequence learning. Here we used a genetic approach to investigate the relationship between different components of the dopamine system and specific aspects of sequence learning in humans. In particular, we investigated variations in genes that code for the catechol-O-methyltransferase (COMT) enzyme, the dopamine transporter (DAT) and dopamine D1 and D2 receptors (DRD1 and DRD2). COMT and the DAT regulate dopamine availability in the prefrontal cortex and the striatum, respectively, two key regions recruited during learning, whereas dopamine D1 and D2 receptors are thought to be involved in long-term potentiation and depression, respectively. We show that polymorphisms in the COMT, DRD1 and DRD2 genes differentially affect behavioral performance on a sequence learning task in 161 Caucasian participants. The DRD1 polymorphism predicted the ability to learn new sequences, the DRD2 polymorphism predicted the ability to perform a previously learnt sequence after performing interfering random movements, whereas the COMT polymorphism predicted the ability to switch flexibly between two sequences. We used computer simulations to explore potential mechanisms underlying these effects, which revealed that the DRD1 and DRD2 effects are possibly related to neuroplasticity. Our prediction-error algorithm estimated faster rates of connection strengthening in genotype groups with presumably higher D1 receptor densities, and faster rates of connection weakening in genotype groups with presumably higher D2 receptor densities. Consistent with current dopamine theories, these simulations suggest that D1-mediated neuroplasticity contributes to learning to select appropriate actions, whereas D2-mediated neuroplasticity is involved in learning to inhibit incorrect action plans. However, the learning algorithm did not account for the COMT effect, suggesting that prefrontal dopamine availability might affect sequence switching via other, non-learning, mechanisms. These findings provide insight into the function of the dopamine system, which is relevant to the development of treatments for disorders such as Parkinson's disease. Our results suggest that treatments targeting dopamine D1 receptors may improve learning of novel sequences, whereas those targeting dopamine D2 receptors may improve the ability to initiate previously learned sequences of movements. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Whole-Genome Sequencing and Concordance Between Antimicrobial Susceptibility Genotypes and Phenotypes of Bacterial Isolates Associated with Bovine Respiratory Disease

PubMed Central

Owen, Joseph R.; Noyes, Noelle; Young, Amy E.; Prince, Daniel J.; Blanchard, Patricia C.; Lehenbauer, Terry W.; Aly, Sharif S.; Davis, Jessica H.; O’Rourke, Sean M.; Abdo, Zaid; Belk, Keith; Miller, Michael R.; Morley, Paul; Van Eenennaam, Alison L.

2017-01-01

Extended laboratory culture and antimicrobial susceptibility testing timelines hinder rapid species identification and susceptibility profiling of bacterial pathogens associated with bovine respiratory disease, the most prevalent cause of cattle mortality in the United States. Whole-genome sequencing offers a culture-independent alternative to current bacterial identification methods, but requires a library of bacterial reference genomes for comparison. To contribute new bacterial genome assemblies and evaluate genetic diversity and variation in antimicrobial resistance genotypes, whole-genome sequencing was performed on bovine respiratory disease–associated bacterial isolates (Histophilus somni, Mycoplasma bovis, Mannheimia haemolytica, and Pasteurella multocida) from dairy and beef cattle. One hundred genomically distinct assemblies were added to the NCBI database, doubling the available genomic sequences for these four species. Computer-based methods identified 11 predicted antimicrobial resistance genes in three species, with none being detected in M. bovis. While computer-based analysis can identify antibiotic resistance genes within whole-genome sequences (genotype), it may not predict the actual antimicrobial resistance observed in a living organism (phenotype). Antimicrobial susceptibility testing on 64 H. somni, M. haemolytica, and P. multocida isolates had an overall concordance rate between genotype and phenotypic resistance to the associated class of antimicrobials of 72.7% (P < 0.001), showing substantial discordance. Concordance rates varied greatly among different antimicrobial, antibiotic resistance gene, and bacterial species combinations. This suggests that antimicrobial susceptibility phenotypes are needed to complement genomically predicted antibiotic resistance gene genotypes to better understand how the presence of antibiotic resistance genes within a given bacterial species could potentially impact optimal bovine respiratory disease treatment and morbidity/mortality outcomes. PMID:28739600
Whole-Genome Sequencing and Concordance Between Antimicrobial Susceptibility Genotypes and Phenotypes of Bacterial Isolates Associated with Bovine Respiratory Disease.

PubMed

Owen, Joseph R; Noyes, Noelle; Young, Amy E; Prince, Daniel J; Blanchard, Patricia C; Lehenbauer, Terry W; Aly, Sharif S; Davis, Jessica H; O'Rourke, Sean M; Abdo, Zaid; Belk, Keith; Miller, Michael R; Morley, Paul; Van Eenennaam, Alison L

2017-09-07

Extended laboratory culture and antimicrobial susceptibility testing timelines hinder rapid species identification and susceptibility profiling of bacterial pathogens associated with bovine respiratory disease, the most prevalent cause of cattle mortality in the United States. Whole-genome sequencing offers a culture-independent alternative to current bacterial identification methods, but requires a library of bacterial reference genomes for comparison. To contribute new bacterial genome assemblies and evaluate genetic diversity and variation in antimicrobial resistance genotypes, whole-genome sequencing was performed on bovine respiratory disease-associated bacterial isolates ( Histophilus somni , Mycoplasma bovis , Mannheimia haemolytica , and Pasteurella multocida ) from dairy and beef cattle. One hundred genomically distinct assemblies were added to the NCBI database, doubling the available genomic sequences for these four species. Computer-based methods identified 11 predicted antimicrobial resistance genes in three species, with none being detected in M. bovis While computer-based analysis can identify antibiotic resistance genes within whole-genome sequences (genotype), it may not predict the actual antimicrobial resistance observed in a living organism (phenotype). Antimicrobial susceptibility testing on 64 H. somni , M. haemolytica , and P. multocida isolates had an overall concordance rate between genotype and phenotypic resistance to the associated class of antimicrobials of 72.7% ( P < 0.001), showing substantial discordance. Concordance rates varied greatly among different antimicrobial, antibiotic resistance gene, and bacterial species combinations. This suggests that antimicrobial susceptibility phenotypes are needed to complement genomically predicted antibiotic resistance gene genotypes to better understand how the presence of antibiotic resistance genes within a given bacterial species could potentially impact optimal bovine respiratory disease treatment and morbidity/mortality outcomes. Copyright © 2017 Owen et al.
Comparison of the Genome Sequence of the Poultry Pathogen Bordetella avium with Those of B. bronchiseptica, B. pertussis, and B. parapertussis Reveals Extensive Diversity in Surface Structures Associated with Host Interaction

PubMed Central

Sebaihia, Mohammed; Preston, Andrew; Maskell, Duncan J.; Kuzmiak, Holly; Connell, Terry D.; King, Natalie D.; Orndorff, Paul E.; Miyamoto, David M.; Thomson, Nicholas R.; Harris, David; Goble, Arlette; Lord, Angela; Murphy, Lee; Quail, Michael A.; Rutter, Simon; Squares, Robert; Squares, Steven; Woodward, John; Parkhill, Julian; Temple, Louise M.

2006-01-01

Bordetella avium is a pathogen of poultry and is phylogenetically distinct from Bordetella bronchiseptica, Bordetella pertussis, and Bordetella parapertussis, which are other species in the Bordetella genus that infect mammals. In order to understand the evolutionary relatedness of Bordetella species and further the understanding of pathogenesis, we obtained the complete genome sequence of B. avium strain 197N, a pathogenic strain that has been extensively studied. With 3,732,255 base pairs of DNA and 3,417 predicted coding sequences, it has the smallest genome and gene complement of the sequenced bordetellae. In this study, the presence or absence of previously reported virulence factors from B. avium was confirmed, and the genetic bases for growth characteristics were elucidated. Over 1,100 genes present in B. avium but not in B. bronchiseptica were identified, and most were predicted to encode surface or secreted proteins that are likely to define an organism adapted to the avian rather than the mammalian respiratory tracts. These include genes coding for the synthesis of a polysaccharide capsule, hemagglutinins, a type I secretion system adjacent to two very large genes for secreted proteins, and unique genes for both lipopolysaccharide and fimbrial biogenesis. Three apparently complete prophages are also present. The BvgAS virulence regulatory system appears to have polymorphisms at a poly(C) tract that is involved in phase variation in other bordetellae. A number of putative iron-regulated outer membrane proteins were predicted from the sequence, and this regulation was confirmed experimentally for five of these. PMID:16885469
Mitochondrial genomic comparison of Clonorchis sinensis from South Korea with other isolates of this species.

PubMed

Wang, Daxi; Young, Neil D; Koehler, Anson V; Tan, Patrick; Sohn, Woon-Mok; Korhonen, Pasi K; Gasser, Robin B

2017-07-01

Clonorchiasis is a neglected tropical disease that affects >35 million people mainly in China, Vietnam, South Korea and some parts of Russia. The disease-causing agent, Clonorchis sinensis, is a liver fluke of humans and other piscivorous animals, and has a complex aquatic life cycle involving snails and fish intermediate hosts. Chronic infection in humans causes liver disease and associated complications including malignant bile duct cancer. Central to control and to understanding the epidemiology of this disease is knowledge of the specific identity of the causative agent as well as genetic variation within and among populations of this parasite. Although most published molecular studies seem to suggest that C. sinensis represents a single species and that genetic variation within the species is limited, karyotypic variation within C. sinensis among China, Korea (2n=56) and Russian Far East (2n=14) suggests that this taxon might contain sibling species. Here, we assessed and applied a deep sequencing-bioinformatic approach to sequence and define a reference mitochondrial (mt) genome for a particular isolate of C. sinensis from Korea (Cs-k2), to confirm its specific identity, and compared this mt genome with homologous data sets available for this species. Comparative analyses revealed consistency in the number and structure of genes as well as in the lengths of protein-coding genes, and limited genetic variation among isolates of C. sinensis. Phylogenetic analyses of amino acid sequences predicted from mt genes showed that representatives of C. sinensis clustered together, with absolute nodal support, to the exclusion of other liver fluke representatives, but sub-structuring within C. sinensis was not well supported. The plan now is to proceed with the sequencing, assembly and annotation of a high quality draft nuclear genome of this defined isolate (Cs-k2) as a basis for a detailed investigation of molecular variation within C. sinensis from disparate geographical locations in parts of Asia and to prospect for cryptic species. Copyright © 2017 Elsevier B.V. All rights reserved.

Spatial Structure of the Mormon Cricket Gut Microbiome and its Predicted Contribution to Nutrition and Immune Function

PubMed Central

Smith, Chad C.; Srygley, Robert B.; Healy, Frank; Swaminath, Karthikeyan; Mueller, Ulrich G.

2017-01-01

The gut microbiome of insects plays an important role in their ecology and evolution, participating in nutrient acquisition, immunity, and behavior. Microbial community structure within the gut is heavily influenced by differences among gut regions in morphology and physiology, which determine the niches available for microbes to colonize. We present a high-resolution analysis of the structure of the gut microbiome in the Mormon cricket Anabrus simplex, an insect known for its periodic outbreaks in the western United States and nutrition-dependent mating system. The Mormon cricket microbiome was dominated by 11 taxa from the Lactobacillaceae, Enterobacteriaceae, and Streptococcaceae. While most of these were represented in all gut regions, there were marked differences in their relative abundance, with lactic-acid bacteria (Lactobacillaceae) more common in the foregut and midgut and enteric (Enterobacteriaceae) bacteria more common in the hindgut. Differences in community structure were driven by variation in the relative prevalence of three groups: a Lactobacillus in the foregut, Pediococcus lactic-acid bacteria in the midgut, and Pantoea agglomerans, an enteric bacterium, in the hindgut. These taxa have been shown to have beneficial effects on their hosts in insects and other animals by improving nutrition, increasing resistance to pathogens, and modulating social behavior. Using PICRUSt to predict gene content from our 16S rRNA sequences, we found enzymes that participate in carbohydrate metabolism and pathogen defense in other orthopterans. These were predominately represented in the hindgut and midgut, the most important sites for nutrition and pathogen defense. Phylogenetic analysis of 16S rRNA sequences from cultured isolates indicated low levels of divergence from sequences derived from plants and other insects, suggesting that these bacteria are likely to be exchanged between Mormon crickets and the environment. Our study shows strong spatial variation in microbiome community structure, which influences predicted gene content and thus the potential of the microbiome to influence host function. PMID:28553263
Analyses of Genotypic Diversity among North, South, and Central American Isolates of Sugarcane Yellow Leaf Virus: Evidence for Colombian Origins and for Intraspecific Spatial Phylogenetic Variation

PubMed Central

Moonan, Francis; Mirkov, T. Erik

2002-01-01

We have analyzed the genotypic diversity of Sugarcane yellow leaf virus (SCYLV) collected from North, South, and Central America by fingerprinting assays and selective cDNA cloning and sequencing. One group of isolates from Colombia, designated the C-population, has been identified as residing at the root node between a separable superpopulation structure of SCYLV and other members of the family Luteoviridae, indicating that the progenitor viruses of the North, South, and Central American isolates of the SCYLV superpopulation most likely arose from a C-population structure. From a model of intrafamilial evolution (F. Moonan et al., Virology 269:156–171, 2000), a prediction could be made that within the SCYLV species, the capacity of genomic sequence divergence would range from lowest in the capsid protein open reading frame 3 (ORF 3) to highest in a region spanning across the carboxy-terminal end of the RNA-dependent RNA polymerase ORF. We have demonstrated the validity and applicability of this intrafamilial model for the prediction of intraspecies SCYLV diversity. Analysis of spatial phylogenetic variation (SPV) within the SCYLV isolates could not be assessed by application of a “partial likelihoods assessed through optimization” (PLATO)-derived intraspecies model alone. However, application of a PLATO-derived intrafamilial model with the intraspecies-derived model allowed distinction of three forms of SPV. Two of the SPV forms identified correspond to the extremes in a continuum of sequence evolution displayed in a SCYLV superpopulation structure, and the third form was diagnostic of a C-population structure. The application of these types of models has value in terms of predicting the types of SCYLV intraspecies diversity that may exist worldwide, and in general, may be useful in application for more informed design of transgenes for use in the elicitation of homology-dependent virus resistance mechanisms in transgenic plants. PMID:11773408
Global and disease-associated genetic variation in the human Fanconi anemia gene family.

PubMed

Rogers, Kai J; Fu, Wenqing; Akey, Joshua M; Monnat, Raymond J

2014-12-20

Fanconi anemia (FA) is a human recessive genetic disease resulting from inactivating mutations in any of 16 FANC (Fanconi) genes. Individuals with FA are at high risk of developmental abnormalities, early bone marrow failure and leukemia. These are followed in the second and subsequent decades by a very high risk of carcinomas of the head and neck and anogenital region, and a small continuing risk of leukemia. In order to characterize base pair-level disease-associated (DA) and population genetic variation in FANC genes and the segregation of this variation in the human population, we identified 2948 unique FANC gene variants including 493 FA DA variants across 57,240 potential base pair variation sites in the 16 FANC genes. We then analyzed the segregation of this variation in the 7578 subjects included in the Exome Sequencing Project (ESP) and the 1000 Genomes Project (1KGP). There was a remarkably high frequency of FA DA variants in ESP/1KGP subjects: at least 1 FA DA variant was identified in 78.5% (5950 of 7578) individuals included in these two studies. Six widely used functional prediction algorithms correctly identified only a third of the known, DA FANC missense variants. We also identified FA DA variants that may be good candidates for different types of mutation-specific therapies. Our results demonstrate the power of direct DNA sequencing to detect, estimate the frequency of and follow the segregation of deleterious genetic variation in human populations. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Genome-wide copy number variation in the bovine genome detected using low coverage sequence of popular beef breeds

USDA-ARS?s Scientific Manuscript database

Genomic structural variations are an important source of genetic diversity. Copy number variations (CNVs), gains and losses of large regions of genomic sequence between individuals of a species, are known to be associated with both diseases and phenotypic traits. Deeply sequenced genomes are often u...
Genomic Analysis of Genotype-by-Social Environment Interaction for Drosophila melanogaster Aggressive Behavior.

PubMed

Rohde, Palle Duun; Gaertner, Bryn; Ward, Kirsty; Sørensen, Peter; Mackay, Trudy F C

2017-08-01

Human psychiatric disorders such as schizophrenia, bipolar disorder, and attention-deficit/hyperactivity disorder often include adverse behaviors including increased aggressiveness. Individuals with psychiatric disorders often exhibit social withdrawal, which can further increase the probability of conducting a violent act. Here, we used the inbred, sequenced lines of the Drosophila Genetic Reference Panel (DGRP) to investigate the genetic basis of variation in male aggressive behavior for flies reared in a socialized and socially isolated environment. We identified genetic variation for aggressive behavior, as well as significant genotype-by-social environmental interaction (GSEI); i.e. , variation among DGRP genotypes in the degree to which social isolation affected aggression. We performed genome-wide association (GWA) analyses to identify genetic variants associated with aggression within each environment. We used genomic prediction to partition genetic variants into gene ontology (GO) terms and constituent genes, and identified GO terms and genes with high prediction accuracies in both social environments and for GSEI. The top predictive GO terms significantly increased the proportion of variance explained, compared to prediction models based on all segregating variants. We performed genomic prediction across environments, and identified genes in common between the social environments that turned out to be enriched for genome-wide associated variants. A large proportion of the associated genes have previously been associated with aggressive behavior in Drosophila and mice. Further, many of these genes have human orthologs that have been associated with neurological disorders, indicating partially shared genetic mechanisms underlying aggression in animal models and human psychiatric disorders. Copyright © 2017 by the Genetics Society of America.
Understanding Neurodevelopmental Disorders: The Promise of Regulatory Variation in the 3'UTRome.

PubMed

Wanke, Kai A; Devanna, Paolo; Vernes, Sonja C

2018-04-01

Neurodevelopmental disorders have a strong genetic component, but despite widespread efforts, the specific genetic factors underlying these disorders remain undefined for a large proportion of affected individuals. Given the accessibility of exome sequencing, this problem has thus far been addressed from a protein-centric standpoint; however, protein-coding regions only make up ∼1% to 2% of the human genome. With the advent of whole genome sequencing we are in the midst of a paradigm shift as it is now possible to interrogate the entire sequence of the human genome (coding and noncoding) to fill in the missing heritability of complex disorders. These new technologies bring new challenges, as the number of noncoding variants identified per individual can be overwhelming, making it prudent to focus on noncoding regions of known function, for which the effects of variation can be predicted and directly tested to assess pathogenicity. The 3'UTRome is a region of the noncoding genome that perfectly fulfills these criteria and is of high interest when searching for pathogenic variation related to complex neurodevelopmental disorders. Herein, we review the regulatory roles of the 3'UTRome as binding sites for microRNAs or RNA binding proteins, or during alternative polyadenylation. We detail existing evidence that these regions contribute to neurodevelopmental disorders and outline strategies for identification and validation of novel putatively pathogenic variation in these regions. This evidence suggests that studying the 3'UTRome will lead to the identification of new risk factors, new candidate disease genes, and a better understanding of the molecular mechanisms contributing to neurodevelopmental disorders. Copyright © 2017 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.
FPGA implementation of predictive degradation model for engine oil lifetime

NASA Astrophysics Data System (ADS)

Idros, M. F. M.; Razak, A. H. A.; Junid, S. A. M. Al; Suliman, S. I.; Halim, A. K.

2018-03-01

This paper presents the implementation of linear regression model for degradation prediction on Register Transfer Logic (RTL) using QuartusII. A stationary model had been identified in the degradation trend for the engine oil in a vehicle in time series method. As for RTL implementation, the degradation model is written in Verilog HDL and the data input are taken at a certain time. Clock divider had been designed to support the timing sequence of input data. At every five data, a regression analysis is adapted for slope variation determination and prediction calculation. Here, only the negative value are taken as the consideration for the prediction purposes for less number of logic gate. Least Square Method is adapted to get the best linear model based on the mean values of time series data. The coded algorithm has been implemented on FPGA for validation purposes. The result shows the prediction time to change the engine oil.
Generation and Characterization of HIV-1 Transmitted and Founder Virus Consensus Sequence from Intravenous Drug Users in Xinjiang, China.

PubMed

Li, Fan; Ma, Liying; Feng, Yi; Hu, Jing; Ni, Na; Ruan, Yuhua; Shao, Yiming

2017-06-01

HIV-1 transmission in intravenous drug users (IDUs) has been characterized by high genetic multiplicity and suggests a greater challenge for HIV-1 infection blocking. We investigated a total of 749 sequences of full-length gp160 gene obtained by single genome sequencing (SGS) from 22 HIV-1 early infected IDUs in Xinjiang province, northwest China, and generated a transmitted and founder virus (T/F virus) consensus sequence (IDU.CON). The T/F virus was classified as subtype CRF07_BC and predicted to be CCR5-tropic virus. The variable region (V1, V2, and V4 loop) of IDU.CON showed length variation compared with the heterosexual T/F virus consensus sequence (HSX.CON) and homosexual T/F virus consensus sequence (MSM.CON). A total of 26 N-linked glycosylation sites were discovered in the IDU.CON sequence, which is less than that of MSM.CON and HSX.CON. Characterization of T/F virus from IDUs highlights the genetic make-up and complexity of virus near the moment of transmission or in early infection preceding systemic dissemination and is important toward the development of an effective HIV-1 preventive methods, including vaccines.
TnSeq of Mycobacterium tuberculosis clinical isolates reveals strain-specific antibiotic liabilities

PubMed Central

Carey, Allison F.; Rock, Jeremy M.; Krieger, Inna V.; Gagneux, Sebastien; Sacchettini, James C.; Fortune, Sarah M.

2018-01-01

Once considered a phenotypically monomorphic bacterium, there is a growing body of work demonstrating heterogeneity among Mycobacterium tuberculosis (Mtb) strains in clinically relevant characteristics, including virulence and response to antibiotics. However, the genetic and molecular basis for most phenotypic differences among Mtb strains remains unknown. To investigate the basis of strain variation in Mtb, we performed genome-wide transposon mutagenesis coupled with next-generation sequencing (TnSeq) for a panel of Mtb clinical isolates and the reference strain H37Rv to compare genetic requirements for in vitro growth across these strains. We developed an analytic approach to identify quantitative differences in genetic requirements between these genetically diverse strains, which vary in genomic structure and gene content. Using this methodology, we found differences between strains in their requirements for genes involved in fundamental cellular processes, including redox homeostasis and central carbon metabolism. Among the genes with differential requirements were katG, which encodes the activator of the first-line antitubercular agent isoniazid, and glcB, which encodes malate synthase, the target of a novel small-molecule inhibitor. Differences among strains in their requirement for katG and glcB predicted differences in their response to these antimicrobial agents. Importantly, these strain-specific differences in antibiotic response could not be predicted by genetic variants identified through whole genome sequencing or by gene expression analysis. Our results provide novel insight into the basis of variation among Mtb strains and demonstrate that TnSeq is a scalable method to predict clinically important phenotypic differences among Mtb strains. PMID:29505613
Predictive genomics DNA profiling for athletic performance.

PubMed

Kambouris, Marios; Ntalouka, Foteini; Ziogas, Georgios; Maffulli, Nicola

2012-12-01

Genes control biological processes such as muscle, cartilage and bone formation, muscle energy production and metabolism (mitochondriogenesis, lactic acid removal), blood and tissue oxygenation (erythropoiesis, angiogenesis, vasodilatation), all essential in sport and athletic performance. DNA sequence variations in such genes confer genetic advantages that can be exploited, or genetic 'barriers' that could be overcome to achieve optimal athletic performance. Predictive Genomic DNA Profiling for athletic performance reveals genetic variations that may be associated with better suitability for endurance, strength and speed sports, vulnerability to sports-related injuries and individualized nutritional requirements. Knowledge of genetic 'suitability' in respect to endurance capacity or strength and speed would lead to appropriate sport and athletic activity selection. Knowledge of genetic advantages and barriers would 'direct' an individualized training program, nutritional plan and nutritional supplementation to achieving optimal performance, overcoming 'barriers' that results from intense exercise and pressure under competition with minimum waste of time and energy and avoidance of health risks (hypertension, cardiovascular disease, inflammation, and musculoskeletal injuries) related to exercise, training and competition. Predictive Genomics DNA profiling for Athletics and Sports performance is developing into a tool for athletic activity and sport selection and for the formulation of individualized and personalized training and nutritional programs to optimize health and performance for the athlete. Human DNA sequences are patentable in some countries, while in others DNA testing methodologies [unless proprietary], are non patentable. On the other hand, gene and variant selection, genotype interpretation and the risk and suitability assigning algorithms based on the specific Genomic variants used are amenable to patent protection.
Identification and Characterization of Novel Variations in Platelet G-Protein Coupled Receptor (GPCR) Genes in Patients Historically Diagnosed with Type 1 von Willebrand Disease.

PubMed

Stockley, Jacqueline; Nisar, Shaista P; Leo, Vincenzo C; Sabi, Essa; Cunningham, Margaret R; Eikenboom, Jeroen C; Lethagen, Stefan; Schneppenheim, Reinhard; Goodeve, Anne C; Watson, Steve P; Mundell, Stuart J; Daly, Martina E

2015-01-01

The clinical expression of type 1 von Willebrand disease may be modified by co-inheritance of other mild bleeding diatheses. We previously showed that mutations in the platelet P2Y12 ADP receptor gene (P2RY12) could contribute to the bleeding phenotype in patients with type 1 von Willebrand disease. Here we investigated whether variations in platelet G protein-coupled receptor genes other than P2RY12 also contributed to the bleeding phenotype. Platelet G protein-coupled receptor genes P2RY1, F2R, F2RL3, TBXA2R and PTGIR were sequenced in 146 index cases with type 1 von Willebrand disease and the potential effects of identified single nucleotide variations were assessed using in silico methods and heterologous expression analysis. Seven heterozygous single nucleotide variations were identified in 8 index cases. Two single nucleotide variations were detected in F2R; a novel c.-67G>C transversion which reduced F2R transcriptional activity and a rare c.1063C>T transition predicting a p.L355F substitution which did not interfere with PAR1 expression or signalling. Two synonymous single nucleotide variations were identified in F2RL3 (c.402C>G, p.A134 =; c.1029 G>C p.V343 =), both of which introduced less commonly used codons and were predicted to be deleterious, though neither of them affected PAR4 receptor expression. A third single nucleotide variation in F2RL3 (c.65 C>A; p.T22N) was co-inherited with a synonymous single nucleotide variation in TBXA2R (c.6680 C>T, p.S218 =). Expression and signalling of the p.T22N PAR4 variant was similar to wild-type, while the TBXA2R variation introduced a cryptic splice site that was predicted to cause premature termination of protein translation. The enrichment of single nucleotide variations in G protein-coupled receptor genes among type 1 von Willebrand disease patients supports the view of type 1 von Willebrand disease as a polygenic disorder.
Maximizing lipocalin prediction through balanced and diversified training set and decision fusion.

PubMed

Nath, Abhigyan; Subbiah, Karthikeyan

2015-12-01

Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method. Copyright © 2015 Elsevier Ltd. All rights reserved.
Lineage-specific evolutionary rate in plants: Contributions of a screening for Cereus (Cactaceae).

PubMed

Romeiro-Brito, Monique; Moraes, Evandro M; Taylor, Nigel P; Zappi, Daniela C; Franco, Fernando F

2016-01-01

Predictable chloroplast DNA (cpDNA) sequences have been listed for the shallowest taxonomic studies in plants. We investigated whether plastid regions that vary between closely allied species could be applied for intraspecific studies and compared the variation of these plastid segments with two nuclear regions. We screened 16 plastid and two nuclear intronic regions for species of the genus Cereus (Cactaceae) at three hierarchical levels (species from different clades, species of the same clade, and allopatric populations). Ten plastid regions presented interspecific variation, and six of them showed variation at the intraspecific level. The two nuclear regions showed both inter- and intraspecific variation, and in general they showed higher levels of variability in almost all hierarchical levels than the plastid segments. Our data suggest no correspondence between variation of plastid regions at the interspecific and intraspecific level, probably due to lineage-specific variation in cpDNA, which appears to have less effect in nuclear data. Despite the heterogeneity in evolutionary rates of cpDNA, we highlight three plastid segments that may be considered in initial screenings in plant phylogeographic studies.
Investigation of sequential properties of snoring episodes for obstructive sleep apnoea identification.

PubMed

Cavusoglu, M; Ciloglu, T; Serinagaoglu, Y; Kamasak, M; Erogul, O; Akcam, T

2008-08-01

In this paper, 'snore regularity' is studied in terms of the variations of snoring sound episode durations, separations and average powers in simple snorers and in obstructive sleep apnoea (OSA) patients. The goal was to explore the possibility of distinguishing among simple snorers and OSA patients using only sleep sound recordings of individuals and to ultimately eliminate the need for spending a whole night in the clinic for polysomnographic recording. Sequences that contain snoring episode durations (SED), snoring episode separations (SES) and average snoring episode powers (SEP) were constructed from snoring sound recordings of 30 individuals (18 simple snorers and 12 OSA patients) who were also under polysomnographic recording in Gülhane Military Medical Academy Sleep Studies Laboratory (GMMA-SSL), Ankara, Turkey. Snore regularity is quantified in terms of mean, standard deviation and coefficient of variation values for the SED, SES and SEP sequences. In all three of these sequences, OSA patients' data displayed a higher variation than those of simple snorers. To exclude the effects of slow variations in the base-line of these sequences, new sequences that contain the coefficient of variation of the sample values in a 'short' signal frame, i.e., short time coefficient of variation (STCV) sequences, were defined. The mean, the standard deviation and the coefficient of variation values calculated from the STCV sequences displayed a stronger potential to distinguish among simple snorers and OSA patients than those obtained from the SED, SES and SEP sequences themselves. Spider charts were used to jointly visualize the three parameters, i.e., the mean, the standard deviation and the coefficient of variation values of the SED, SES and SEP sequences, and the corresponding STCV sequences as two-dimensional plots. Our observations showed that the statistical parameters obtained from the SED and SES sequences, and the corresponding STCV sequences, possessed a strong potential to distinguish among simple snorers and OSA patients, both marginally, i.e., when the parameters are examined individually, and jointly. The parameters obtained from the SEP sequences and the corresponding STCV sequences, on the other hand, did not have a strong discrimination capability. However, the joint behaviour of these parameters showed some potential to distinguish among simple snorers and OSA patients.
CRISPRDetect: A flexible algorithm to define CRISPR arrays.

PubMed

Biswas, Ambarish; Staals, Raymond H J; Morales, Sergio E; Fineran, Peter C; Brown, Chris M

2016-05-17

CRISPR (clustered regularly interspaced short palindromic repeats) RNAs provide the specificity for noncoding RNA-guided adaptive immune defence systems in prokaryotes. CRISPR arrays consist of repeat sequences separated by specific spacer sequences. CRISPR arrays have previously been identified in a large proportion of prokaryotic genomes. However, currently available detection algorithms do not utilise recently discovered features regarding CRISPR loci. We have developed a new approach to automatically detect, predict and interactively refine CRISPR arrays. It is available as a web program and command line from bioanalysis.otago.ac.nz/CRISPRDetect. CRISPRDetect discovers putative arrays, extends the array by detecting additional variant repeats, corrects the direction of arrays, refines the repeat/spacer boundaries, and annotates different types of sequence variations (e.g. insertion/deletion) in near identical repeats. Due to these features, CRISPRDetect has significant advantages when compared to existing identification tools. As well as further support for small medium and large repeats, CRISPRDetect identified a class of arrays with 'extra-large' repeats in bacteria (repeats 44-50 nt). The CRISPRDetect output is integrated with other analysis tools. Notably, the predicted spacers can be directly utilised by CRISPRTarget to predict targets. CRISPRDetect enables more accurate detection of arrays and spacers and its gff output is suitable for inclusion in genome annotation pipelines and visualisation. It has been used to analyse all complete bacterial and archaeal reference genomes.
Breeding and Genetics Symposium: networks and pathways to guide genomic selection.

PubMed

Snelling, W M; Cushman, R A; Keele, J W; Maltecca, C; Thomas, M G; Fortes, M R S; Reverter, A

2013-02-01

Many traits affecting profitability and sustainability of meat, milk, and fiber production are polygenic, with no single gene having an overwhelming influence on observed variation. No knowledge of the specific genes controlling these traits has been needed to make substantial improvement through selection. Significant gains have been made through phenotypic selection enhanced by pedigree relationships and continually improving statistical methodology. Genomic selection, recently enabled by assays for dense SNP located throughout the genome, promises to increase selection accuracy and accelerate genetic improvement by emphasizing the SNP most strongly correlated to phenotype although the genes and sequence variants affecting phenotype remain largely unknown. These genomic predictions theoretically rely on linkage disequilibrium (LD) between genotyped SNP and unknown functional variants, but familial linkage may increase effectiveness when predicting individuals related to those in the training data. Genomic selection with functional SNP genotypes should be less reliant on LD patterns shared by training and target populations, possibly allowing robust prediction across unrelated populations. Although the specific variants causing polygenic variation may never be known with certainty, a number of tools and resources can be used to identify those most likely to affect phenotype. Associations of dense SNP genotypes with phenotype provide a 1-dimensional approach for identifying genes affecting specific traits; in contrast, associations with multiple traits allow defining networks of genes interacting to affect correlated traits. Such networks are especially compelling when corroborated by existing functional annotation and established molecular pathways. The SNP occurring within network genes, obtained from public databases or derived from genome and transcriptome sequences, may be classified according to expected effects on gene products. As illustrated by functionally informed genomic predictions being more accurate than naive whole-genome predictions of beef tenderness, coupling evidence from livestock genotypes, phenotypes, gene expression, and genomic variants with existing knowledge of gene functions and interactions may provide greater insight into the genes and genomic mechanisms affecting polygenic traits and facilitate functional genomic selection for economically important traits.
Genomics of gene banks: A case study in rice.

PubMed

McCouch, Susan R; McNally, Kenneth L; Wang, Wen; Sackville Hamilton, Ruaraidh

2012-02-01

Only a small fraction of the naturally occurring genetic diversity available in the world's germplasm repositories has been explored to date, but this is expected to change with the advent of affordable, high-throughput genotyping and sequencing technology. It is now possible to examine genome-wide patterns of natural variation and link sequence polymorphisms with downstream phenotypic consequences. In this paper, we discuss how dramatic changes in the cost and efficiency of sequencing and genotyping are revolutionizing the way gene bank scientists approach the responsibilities of their job. Sequencing technology provides a set of tools that can be used to enhance the quality, efficiency, and cost-effectiveness of gene bank operations, the depth of scientific knowledge of gene bank holdings, and the level of public interest in natural variation. As a result, gene banks have the chance to take on new life. Previously seen as "warehouses" where seeds were diligently maintained, but evolutionarily frozen in time, gene banks could transform into vibrant research centers that actively investigate the genetic potential of their holdings. In this paper, we will discuss how genotyping and sequencing can be integrated into the activities of a modern gene bank to revolutionize the way scientists document the genetic identity of their accessions; track seed lots, varieties, and alleles; identify duplicates; and rationalize active collections, and how the availability of genomics data are likely to motivate innovative collaborations with the larger research and breeding communities to engage in systematic and rigorous phenotyping and multilocation evaluation of the genetic resources in gene banks around the world. The objective is to understand and eventually predict how variation at the DNA level helps determine the phenotypic potential of an individual or population. Leadership and vision are needed to coordinate the characterization of collections and to integrate genotypic and phenotypic information in ways that will illuminate the value of these resources. Genotyping of collections represents a powerful starting point that will enable gene banks to become more effective as stewards of crop biodiversity.
Major histocompatibility complex variation in the endangered Przewalski's horse.

PubMed Central

Hedrick, P W; Parker, K M; Miller, E L; Miller, P S

1999-01-01

The major histocompatibility complex (MHC) is a fundamental part of the vertebrate immune system, and the high variability in many MHC genes is thought to play an essential role in recognition of parasites. The Przewalski's horse is extinct in the wild and all the living individuals descend from 13 founders, most of whom were captured around the turn of the century. One of the primary genetic concerns in endangered species is whether they have ample adaptive variation to respond to novel selective factors. In examining 14 Przewalski's horses that are broadly representative of the living animals, we found six different class II DRB major histocompatibility sequences. The sequences showed extensive nonsynonymous variation, concentrated in the putative antigen-binding sites, and little synonymous variation. Individuals had from two to four sequences as determined by single-stranded conformation polymorphism (SSCP) analysis. On the basis of the SSCP data, phylogenetic analysis of the nucleotide sequences, and segregation in a family group, we conclude that four of these sequences are from one gene (although one sequence codes for a nonfunctional allele because it contains a stop codon) and two other sequences are from another gene. The position of the stop codon is at the same amino-acid position as in a closely related sequence from the domestic horse. Because other organisms have extensive variation at homologous loci, the Przewalski's horse may have quite low variation in this important adaptive region. PMID:10430594
Using chaos to generate variations on movement sequences

NASA Astrophysics Data System (ADS)

Bradley, Elizabeth; Stuart, Joshua

1998-12-01

We describe a method for introducing variations into predefined motion sequences using a chaotic symbol-sequence reordering technique. A progression of symbols representing the body positions in a dance piece, martial arts form, or other motion sequence is mapped onto a chaotic trajectory, establishing a symbolic dynamics that links the movement sequence and the attractor structure. A variation on the original piece is created by generating a trajectory with slightly different initial conditions, inverting the mapping, and using special corpus-based graph-theoretic interpolation schemes to smooth any abrupt transitions. Sensitive dependence guarantees that the variation is different from the original; the attractor structure and the symbolic dynamics guarantee that the two resemble one another in both aesthetic and mathematical senses.
Complete genome sequence and the expression pattern of plasmids of the model ethanologen Zymomonas mobilis ZM4 and its xylose-utilizing derivatives 8b and 2032.

PubMed

Yang, Shihui; Vera, Jessica M; Grass, Jeff; Savvakis, Giannis; Moskvin, Oleg V; Yang, Yongfu; McIlwain, Sean J; Lyu, Yucai; Zinonos, Irene; Hebert, Alexander S; Coon, Joshua J; Bates, Donna M; Sato, Trey K; Brown, Steven D; Himmel, Michael E; Zhang, Min; Landick, Robert; Pappas, Katherine M; Zhang, Yaoping

2018-01-01

Zymomonas mobilis is a natural ethanologen being developed and deployed as an industrial biofuel producer. To date, eight Z. mobilis strains have been completely sequenced and found to contain 2-8 native plasmids. However, systematic verification of predicted Z. mobilis plasmid genes and their contribution to cell fitness has not been hitherto addressed. Moreover, the precise number and identities of plasmids in Z. mobilis model strain ZM4 have been unclear. The lack of functional information about plasmid genes in ZM4 impedes ongoing studies for this model biofuel-producing strain. In this study, we determined the complete chromosome and plasmid sequences of ZM4 and its engineered xylose-utilizing derivatives 2032 and 8b. Compared to previously published and revised ZM4 chromosome sequences, the ZM4 chromosome sequence reported here contains 65 nucleotide sequence variations as well as a 2400-bp insertion. Four plasmids were identified in all three strains, with 150 plasmid genes predicted in strain ZM4 and 2032, and 153 plasmid genes predicted in strain 8b due to the insertion of heterologous DNA for expanded substrate utilization. Plasmid genes were then annotated using Blast2GO, InterProScan, and systems biology data analyses, and most genes were found to have apparent orthologs in other organisms or identifiable conserved domains. To verify plasmid gene prediction, RNA-Seq was used to map transcripts and also compare relative gene expression under various growth conditions, including anaerobic and aerobic conditions, or growth in different concentrations of biomass hydrolysates. Overall, plasmid genes were more responsive to varying hydrolysate concentrations than to oxygen availability. Additionally, our results indicated that although all plasmids were present in low copy number (about 1-2 per cell), the copy number of some plasmids varied under specific growth conditions or due to heterologous gene insertion. The complete genome of ZM4 and two xylose-utilizing derivatives is reported in this study, with an emphasis on identifying and characterizing plasmid genes. Plasmid gene annotation, validation, expression levels at growth conditions of interest, and contribution to host fitness are reported for the first time.

Ebolavirus comparative genomics

DOE PAGES

Jun, Se-Ran; Leuze, Michael R.; Nookaew, Intawat; ...

2015-07-14

The 2014 Ebola outbreak in West Africa is the largest documented for this virus. We examine the dynamics of this genome, comparing more than one hundred currently available ebolavirus genomes to each other and to other viral genomes. Based on oligomer frequency analysis, the family Filoviridae forms a distinct group from all other sequenced viral genomes. All filovirus genomes sequenced to date encode proteins with similar functions and gene order, although there is considerable divergence in sequences between the three genera Ebolavirus, Cuevavirus, and Marburgvirus within the family Filoviridae. Whereas all ebolavirus genomes are quite similar (multiple sequences of themore » same strain are often identical), variation is most common in the intergenic regions and within specific areas of the genes encoding the glycoprotein (GP), nucleoprotein (NP), and polymerase (L). We predict regions that could contain epitope-binding sites, which might be good vaccine targets. In conclusion, this information, combined with glycosylation sites and experimentally determined epitopes, can identify the most promising regions for the development of therapeutic strategies.« less
GAMES identifies and annotates mutations in next-generation sequencing projects.

PubMed

Sana, Maria Elena; Iascone, Maria; Marchetti, Daniela; Palatini, Jeff; Galasso, Marco; Volinia, Stefano

2011-01-01

Next-generation sequencing (NGS) methods have the potential for changing the landscape of biomedical science, but at the same time pose several problems in analysis and interpretation. Currently, there are many commercial and public software packages that analyze NGS data. However, the limitations of these applications include output which is insufficiently annotated and of difficult functional comprehension to end users. We developed GAMES (Genomic Analysis of Mutations Extracted by Sequencing), a pipeline aiming to serve as an efficient middleman between data deluge and investigators. GAMES attains multiple levels of filtering and annotation, such as aligning the reads to a reference genome, performing quality control and mutational analysis, integrating results with genome annotations and sorting each mismatch/deletion according to a range of parameters. Variations are matched to known polymorphisms. The prediction of functional mutations is achieved by using different approaches. Overall GAMES enables an effective complexity reduction in large-scale DNA-sequencing projects. GAMES is available free of charge to academic users and may be obtained from http://aqua.unife.it/GAMES.
Phylogeny and polymorphism in the long control regions E6, E7, and L1 of HPV Type 56 in women from southwest China

PubMed Central

Jing, Yaling; Wang, Tao; Chen, Zuyi; Ding, Xianping; Xu, Jianju; Mu, Xuemei; Cao, Man; Chen, Honghan

2018-01-01

Globally, human papillomavirus (HPV)-56 accounts for a small proportion of all high-risk HPV types; however, HPV-56 is detected at a higher rate in Asia, particularly in southwest China. The present study analyzed polymorphisms, intratypic variants, and genetic variability in the long control regions (LCR), E6, E7, and L1 of HPV-56 (n=75). The LCRs, E6, E7 and L1 were sequenced using a polymerase chain reaction and the sequences were submitted to GenBank. Maximum-likelihood trees were constructed using Kimura's two-parameter model, followed by secondary structure analysis and protein damaging prediction. Additionally, in order to assess the effect of variations in the LCR on putative binding sites for cellular proteins, MATCH server was used. Finally, the selection pressures of the E6-E7 and L1 genes were estimated. A total of 18 point substitutions, a 42-bp deletion and a 19-bp deletion of LCR were identified. Some of those mutations are embedded in the putative binding sites for transcription factors. 18 single nucleotide changes occurred in the E6-E7 sequence, 11/18 were non-synonymous substitutions and 7/18 were synonymous mutations. A total 24 single nucleotide changes were identified in the L1 sequence, 6/24 being non-synonymous mutations and 18/24 synonymous mutations. Selective pressure analysis predicted that the majority of mutations of HPV-56 E6, E7 and L1 were of positive selection. The phylogenetic tree demonstrated that the isolates distributed in two lineages. Data on the prevalence and genetic variation of HPV-56 types in southwest China may aid future studies on viral molecular mechanisms and contribute to future investigations of diagnostic probes and therapeutic vaccines. PMID:29568922
Brain Region-Specific Expression of Genes Mapped within Quantitative Trait Loci for Behavioral Responsiveness to Acute Stress in Fisher 344 and Wistar Kyoto Male Rats (Postprint)

DTIC Science & Technology

2018-03-12

Integrative Genomics Viewer (Broad Institute, Cambridge, Massachusetts), we iden- tified the coding sequence variations between the F344 and WKY... abnormalities and disturbances in brain metabolism resem- bling those in depressive states [74]. Ifna2 is also known to induce memory, concentration, and...Variant and Chronic Interpersonal Stress Prospectively Predicts Social Anxiety and Depression Symptoms Over Six Years. Clinical psychological science
Hybridization properties of long nucleic acid probes for detection of variable target sequences, and development of a hybridization prediction algorithm

PubMed Central

Öhrmalm, Christina; Jobs, Magnus; Eriksson, Ronnie; Golbob, Sultan; Elfaitouri, Amal; Benachenhou, Farid; Strømme, Maria; Blomberg, Jonas

2010-01-01

One of the main problems in nucleic acid-based techniques for detection of infectious agents, such as influenza viruses, is that of nucleic acid sequence variation. DNA probes, 70-nt long, some including the nucleotide analog deoxyribose-Inosine (dInosine), were analyzed for hybridization tolerance to different amounts and distributions of mismatching bases, e.g. synonymous mutations, in target DNA. Microsphere-linked 70-mer probes were hybridized in 3M TMAC buffer to biotinylated single-stranded (ss) DNA for subsequent analysis in a Luminex® system. When mismatches interrupted contiguous matching stretches of 6 nt or longer, it had a strong impact on hybridization. Contiguous matching stretches are more important than the same number of matching nucleotides separated by mismatches into several regions. dInosine, but not 5-nitroindole, substitutions at mismatching positions stabilized hybridization remarkably well, comparable to N (4-fold) wobbles in the same positions. In contrast to shorter probes, 70-nt probes with judiciously placed dInosine substitutions and/or wobble positions were remarkably mismatch tolerant, with preserved specificity. An algorithm, NucZip, was constructed to model the nucleation and zipping phases of hybridization, integrating both local and distant binding contributions. It predicted hybridization more exactly than previous algorithms, and has the potential to guide the design of variation-tolerant yet specific probes. PMID:20864443
Brain potentials predict learning, transmission and modification of an artificial symbolic system.

PubMed

Lumaca, Massimo; Baggio, Giosuè

2016-12-01

It has recently been argued that symbolic systems evolve while they are being transmitted across generations of learners, gradually adapting to the relevant brain structures and processes. In the context of this hypothesis, little is known on whether individual differences in neural processing capacity account for aspects of 'variation' observed in symbolic behavior and symbolic systems. We addressed this issue in the domain of auditory processing. We conducted a combined behavioral and EEG study on 2 successive days. On day 1, participants listened to standard and deviant five-tone sequences: as in previous oddball studies, an mismatch negativity (MMN) was elicited by deviant tones. On day 2, participants learned an artificial signaling system from a trained confederate of the experimenters in a coordination game in which five-tone sequences were associated to affective meanings (emotion-laden pictures of human faces). In a subsequent game with identical structure, participants transmitted and occasionally changed the signaling system learned during the first game. The MMN latency from day 1 predicted learning, transmission and structural modification of signaling systems on day 2. Our study introduces neurophysiological methods into research on cultural transmission and evolution, and relates aspects of variation in symbolic systems to individual differences in neural information processing. © The Author (2016). Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
[Hydrologic variability and sensitivity based on Hurst coefficient and Bartels statistic].

PubMed

Lei, Xu; Xie, Ping; Wu, Zi Yi; Sang, Yan Fang; Zhao, Jiang Yan; Li, Bin Bin

2018-04-01

Due to the global climate change and frequent human activities in recent years, the pure stochastic components of hydrological sequence is mixed with one or several of the variation ingredients, including jump, trend, period and dependency. It is urgently needed to clarify which indices should be used to quantify the degree of their variability. In this study, we defined the hydrological variability based on Hurst coefficient and Bartels statistic, and used Monte Carlo statistical tests to test and analyze their sensitivity to different variants. When the hydrological sequence had jump or trend variation, both Hurst coefficient and Bartels statistic could reflect the variation, with the Hurst coefficient being more sensitive to weak jump or trend variation. When the sequence had period, only the Bartels statistic could detect the mutation of the sequence. When the sequence had a dependency, both the Hurst coefficient and the Bartels statistics could reflect the variation, with the latter could detect weaker dependent variations. For the four variations, both the Hurst variability and Bartels variability increased with the increases of variation range. Thus, they could be used to measure the variation intensity of the hydrological sequence. We analyzed the temperature series of different weather stations in the Lancang River basin. Results showed that the temperature of all stations showed the upward trend or jump, indicating that the entire basin had experienced warming in recent years and the temperature variability in the upper and lower reaches was much higher. This case study showed the practicability of the proposed method.
A Modified LS+AR Model to Improve the Accuracy of the Short-term Polar Motion Prediction

NASA Astrophysics Data System (ADS)

Wang, Z. W.; Wang, Q. X.; Ding, Y. Q.; Zhang, J. J.; Liu, S. S.

2017-03-01

There are two problems of the LS (Least Squares)+AR (AutoRegressive) model in polar motion forecast: the inner residual value of LS fitting is reasonable, but the residual value of LS extrapolation is poor; and the LS fitting residual sequence is non-linear. It is unsuitable to establish an AR model for the residual sequence to be forecasted, based on the residual sequence before forecast epoch. In this paper, we make solution to those two problems with two steps. First, restrictions are added to the two endpoints of LS fitting data to fix them on the LS fitting curve. Therefore, the fitting values next to the two endpoints are very close to the observation values. Secondly, we select the interpolation residual sequence of an inward LS fitting curve, which has a similar variation trend as the LS extrapolation residual sequence, as the modeling object of AR for the residual forecast. Calculation examples show that this solution can effectively improve the short-term polar motion prediction accuracy by the LS+AR model. In addition, the comparison results of the forecast models of RLS (Robustified Least Squares)+AR, RLS+ARIMA (AutoRegressive Integrated Moving Average), and LS+ANN (Artificial Neural Network) confirm the feasibility and effectiveness of the solution for the polar motion forecast. The results, especially for the polar motion forecast in the 1-10 days, show that the forecast accuracy of the proposed model can reach the world level.
Non-B-Form DNA Is Enriched at Centromeres

PubMed Central

Henikoff, Steven

2018-01-01

Abstract Animal and plant centromeres are embedded in repetitive “satellite” DNA, but are thought to be epigenetically specified. To define genetic characteristics of centromeres, we surveyed satellite DNA from diverse eukaryotes and identified variation in <10-bp dyad symmetries predicted to adopt non-B-form conformations. Organisms lacking centromeric dyad symmetries had binding sites for sequence-specific DNA-binding proteins with DNA-bending activity. For example, human and mouse centromeres are depleted for dyad symmetries, but are enriched for non-B-form DNA and are associated with binding sites for the conserved DNA-binding protein CENP-B, which is required for artificial centromere function but is paradoxically nonessential. We also detected dyad symmetries and predicted non-B-form DNA structures at neocentromeres, which form at ectopic loci. We propose that centromeres form at non-B-form DNA because of dyad symmetries or are strengthened by sequence-specific DNA binding proteins. This may resolve the CENP-B paradox and provide a general basis for centromere specification. PMID:29365169
A versatile palindromic amphipathic repeat coding sequence horizontally distributed among diverse bacterial and eucaryotic microbes

PubMed Central

2010-01-01

Background Intragenic tandem repeats occur throughout all domains of life and impart functional and structural variability to diverse translation products. Repeat proteins confer distinctive surface phenotypes to many unicellular organisms, including those with minimal genomes such as the wall-less bacterial monoderms, Mollicutes. One such repeat pattern in this clade is distributed in a manner suggesting its exchange by horizontal gene transfer (HGT). Expanding genome sequence databases reveal the pattern in a widening range of bacteria, and recently among eucaryotic microbes. We examined the genomic flux and consequences of the motif by determining its distribution, predicted structural features and association with membrane-targeted proteins. Results Using a refined hidden Markov model, we document a 25-residue protein sequence motif tandemly arrayed in variable-number repeats in ORFs lacking assigned functions. It appears sporadically in unicellular microbes from disparate bacterial and eucaryotic clades, representing diverse lifestyles and ecological niches that include host parasitic, marine and extreme environments. Tracts of the repeats predict a malleable configuration of recurring domains, with conserved hydrophobic residues forming an amphipathic secondary structure in which hydrophilic residues endow extensive sequence variation. Many ORFs with these domains also have membrane-targeting sequences that predict assorted topologies; others may comprise reservoirs of sequence variants. We demonstrate expressed variants among surface lipoproteins that distinguish closely related animal pathogens belonging to a subgroup of the Mollicutes. DNA sequences encoding the tandem domains display dyad symmetry. Moreover, in some taxa the domains occur in ORFs selectively associated with mobile elements. These features, a punctate phylogenetic distribution, and different patterns of dispersal in genomes of related taxa, suggest that the repeat may be disseminated by HGT and intra-genomic shuffling. Conclusions We describe novel features of PARCELs (Palindromic Amphipathic Repeat Coding ELements), a set of widely distributed repeat protein domains and coding sequences that were likely acquired through HGT by diverse unicellular microbes, further mobilized and diversified within genomes, and co-opted for expression in the membrane proteome of some taxa. Disseminated by multiple gene-centric vehicles, ORFs harboring these elements enhance accessory gene pools as part of the "mobilome" connecting genomes of various clades, in taxa sharing common niches. PMID:20626840
A coarse-grained biophysical model of sequence evolution and the population size dependence of the speciation rate

PubMed Central

Khatri, Bhavin S.; Goldstein, Richard A.

2015-01-01

Speciation is fundamental to understanding the huge diversity of life on Earth. Although still controversial, empirical evidence suggests that the rate of speciation is larger for smaller populations. Here, we explore a biophysical model of speciation by developing a simple coarse-grained theory of transcription factor-DNA binding and how their co-evolution in two geographically isolated lineages leads to incompatibilities. To develop a tractable analytical theory, we derive a Smoluchowski equation for the dynamics of binding energy evolution that accounts for the fact that natural selection acts on phenotypes, but variation arises from mutations in sequences; the Smoluchowski equation includes selection due to both gradients in fitness and gradients in sequence entropy, which is the logarithm of the number of sequences that correspond to a particular binding energy. This simple consideration predicts that smaller populations develop incompatibilities more quickly in the weak mutation regime; this trend arises as sequence entropy poises smaller populations closer to incompatible regions of phenotype space. These results suggest a generic coarse-grained approach to evolutionary stochastic dynamics, allowing realistic modelling at the phenotypic level. PMID:25936759
Energy hyperspace for stacking interaction in AU/AU dinucleotide step: Dispersion-corrected density functional theory study.

PubMed

Mukherjee, Sanchita; Kailasam, Senthilkumar; Bansal, Manju; Bhattacharyya, Dhananjay

2014-01-01

Double helical structures of DNA and RNA are mostly determined by base pair stacking interactions, which give them the base sequence-directed features, such as small roll values for the purine-pyrimidine steps. Earlier attempts to characterize stacking interactions were mostly restricted to calculations on fiber diffraction geometries or optimized structure using ab initio calculations lacking variation in geometry to comment on rather unusual large roll values observed in AU/AU base pair step in crystal structures of RNA double helices. We have generated stacking energy hyperspace by modeling geometries with variations along the important degrees of freedom, roll, and slide, which were chosen via statistical analysis as maximally sequence dependent. Corresponding energy contours were constructed by several quantum chemical methods including dispersion corrections. This analysis established the most suitable methods for stacked base pair systems despite the limitation imparted by number of atom in a base pair step to employ very high level of theory. All the methods predict negative roll value and near-zero slide to be most favorable for the purine-pyrimidine steps, in agreement with Calladine's steric clash based rule. Successive base pairs in RNA are always linked by sugar-phosphate backbone with C3'-endo sugars and this demands C1'-C1' distance of about 5.4 Å along the chains. Consideration of an energy penalty term for deviation of C1'-C1' distance from the mean value, to the recent DFT-D functionals, specifically ωB97X-D appears to predict reliable energy contour for AU/AU step. Such distance-based penalty improves energy contours for the other purine-pyrimidine sequences also. © 2013 Wiley Periodicals, Inc. Biopolymers 101: 107-120, 2014. Copyright © 2013 Wiley Periodicals, Inc.
Penicillin-Binding Protein Transpeptidase Signatures for Tracking and Predicting β-Lactam Resistance Levels in Streptococcus pneumoniae

PubMed Central

Metcalf, Benjamin J.; Chochua, Sopio; Li, Zhongya; Gertz, Robert E.; Walker, Hollis; Hawkins, Paulina A.; Tran, Theresa; Whitney, Cynthia G.; McGee, Lesley; Beall, Bernard W.

2016-01-01

ABSTRACT β-Lactam antibiotics are the drugs of choice to treat pneumococcal infections. The spread of β-lactam-resistant pneumococci is a major concern in choosing an effective therapy for patients. Systematically tracking β-lactam resistance could benefit disease surveillance. Here we developed a classification system in which a pneumococcal isolate is assigned to a “PBP type” based on sequence signatures in the transpeptidase domains (TPDs) of the three critical penicillin-binding proteins (PBPs), PBP1a, PBP2b, and PBP2x. We identified 307 unique PBP types from 2,528 invasive pneumococcal isolates, which had known MICs to six β-lactams based on broth microdilution. We found that increased β-lactam MICs strongly correlated with PBP types containing divergent TPD sequences. The PBP type explained 94 to 99% of variation in MICs both before and after accounting for genomic backgrounds defined by multilocus sequence typing, indicating that genomic backgrounds made little independent contribution to β-lactam MICs at the population level. We further developed and evaluated predictive models of MICs based on PBP type. Compared to microdilution MICs, MICs predicted by PBP type showed essential agreement (MICs agree within 1 dilution) of >98%, category agreement (interpretive results agree) of >94%, a major discrepancy (sensitive isolate predicted as resistant) rate of <3%, and a very major discrepancy (resistant isolate predicted as sensitive) rate of <2% for all six β-lactams. Thus, the PBP transpeptidase signatures are robust indicators of MICs to different β-lactam antibiotics in clinical pneumococcal isolates and serve as an accurate alternative to phenotypic susceptibility testing. PMID:27302760
DNA barcode identification of Podocarpaceae--the second largest conifer family.

PubMed

Little, Damon P; Knopf, Patrick; Schulz, Christian

2013-01-01

We have generated matK, rbcL, and nrITS2 DNA barcodes for 320 specimens representing all 18 extant genera of the conifer family Podocarpaceae. The sample includes 145 of the 198 recognized species. Comparative analyses of sequence quality and species discrimination were conducted on the 159 individuals from which all three markers were recovered (representing 15 genera and 97 species). The vast majority of sequences were of high quality (B 30 = 0.596-0.989). Even the lowest quality sequences exceeded the minimum requirements of the BARCODE data standard. In the few instances that low quality sequences were generated, the responsible mechanism could not be discerned. There were no statistically significant differences in the discriminatory power of markers or marker combinations (p = 0.05). The discriminatory power of the barcode markers individually and in combination is low (56.7% of species at maximum). In some instances, species discrimination failed in spite of ostensibly useful variation being present (genotypes were shared among species), but in many cases there was simply an absence of sequence variation. Barcode gaps (maximum intraspecific p-distance > minimum interspecific p-distance) were observed in 50.5% of species when all three markers were considered simultaneously. The presence of a barcode gap was not predictive of discrimination success (p = 0.02) and there was no statistically significant difference in the frequency of barcode gaps among markers (p = 0.05). In addition, there was no correlation between number of individuals sampled per species and the presence of a barcode gap (p = 0.27).
Evolution of meiotic recombination genes in maize and teosinte.

PubMed

Sidhu, Gaganpreet K; Warzecha, Tomasz; Pawlowski, Wojciech P

2017-01-25

Meiotic recombination is a major source of genetic variation in eukaryotes. The role of recombination in evolution is recognized but little is known about how evolutionary forces affect the recombination pathway itself. Although the recombination pathway is fundamentally conserved across different species, genetic variation in recombination components and outcomes has been observed. Theoretical predictions and empirical studies suggest that changes in the recombination pathway are likely to provide adaptive abilities to populations experiencing directional or strong selection pressures, such as those occurring during species domestication. We hypothesized that adaptive changes in recombination may be associated with adaptive evolution patterns of genes involved in meiotic recombination. To examine how maize evolution and domestication affected meiotic recombination genes, we studied patterns of sequence polymorphism and divergence in eleven genes controlling key steps in the meiotic recombination pathway in a diverse set of maize inbred lines and several accessions of teosinte, the wild ancestor of maize. We discovered that, even though the recombination genes generally exhibited high sequence conservation expected in a pathway controlling a key cellular process, they showed substantial levels and diverse patterns of sequence polymorphism. Among others, we found differences in sequence polymorphism patterns between tropical and temperate maize germplasms. Several recombination genes displayed patterns of polymorphism indicative of adaptive evolution. Despite their ancient origin and overall sequence conservation, meiotic recombination genes can exhibit extensive and complex patterns of molecular evolution. Changes in these genes could affect the functioning of the recombination pathway, and may have contributed to the successful domestication of maize and its expansion to new cultivation areas.
DNA Barcode Identification of Podocarpaceae—The Second Largest Conifer Family

PubMed Central

Little, Damon P.; Knopf, Patrick; Schulz, Christian

2013-01-01

We have generated matK, rbcL, and nrITS2 DNA barcodes for 320 specimens representing all 18 extant genera of the conifer family Podocarpaceae. The sample includes 145 of the 198 recognized species. Comparative analyses of sequence quality and species discrimination were conducted on the 159 individuals from which all three markers were recovered (representing 15 genera and 97 species). The vast majority of sequences were of high quality (B 30 = 0.596–0.989). Even the lowest quality sequences exceeded the minimum requirements of the BARCODE data standard. In the few instances that low quality sequences were generated, the responsible mechanism could not be discerned. There were no statistically significant differences in the discriminatory power of markers or marker combinations (p = 0.05). The discriminatory power of the barcode markers individually and in combination is low (56.7% of species at maximum). In some instances, species discrimination failed in spite of ostensibly useful variation being present (genotypes were shared among species), but in many cases there was simply an absence of sequence variation. Barcode gaps (maximum intraspecific p–distance > minimum interspecific p–distance) were observed in 50.5% of species when all three markers were considered simultaneously. The presence of a barcode gap was not predictive of discrimination success (p = 0.02) and there was no statistically significant difference in the frequency of barcode gaps among markers (p = 0.05). In addition, there was no correlation between number of individuals sampled per species and the presence of a barcode gap (p = 0.27). PMID:24312258
Specificity determinants for the abscisic acid response element.

PubMed

Sarkar, Aditya Kumar; Lahiri, Ansuman

2013-01-01

Abscisic acid (ABA) response elements (ABREs) are a group of cis-acting DNA elements that have been identified from promoter analysis of many ABA-regulated genes in plants. We are interested in understanding the mechanism of binding specificity between ABREs and a class of bZIP transcription factors known as ABRE binding factors (ABFs). In this work, we have modeled the homodimeric structure of the bZIP domain of ABRE binding factor 1 from Arabidopsis thaliana (AtABF1) and studied its interaction with ACGT core motif-containing ABRE sequences. We have also examined the variation in the stability of the protein-DNA complex upon mutating ABRE sequences using the protein design algorithm FoldX. The high throughput free energy calculations successfully predicted the ability of ABF1 to bind to alternative core motifs like GCGT or AAGT and also rationalized the role of the flanking sequences in determining the specificity of the protein-DNA interaction.
Free Vibration of Uncertain Unsymmetrically Laminated Beams

NASA Technical Reports Server (NTRS)

Kapania, Rakesh K.; Goyal, Vijay K.

2001-01-01

Monte Carlo Simulation and Stochastic FEA are used to predict randomness in the free vibration response of thin unsymmetrically laminated beams. For the present study, it is assumed that randomness in the response is only caused by uncertainties in the ply orientations. The ply orientations may become random or uncertain during the manufacturing process. A new 16-dof beam element, based on the first-order shear deformation beam theory, is used to study the stochastic nature of the natural frequencies. Using variational principles, the element stiffness matrix and mass matrix are obtained through analytical integration. Using a random sequence a large data set is generated, containing possible random ply-orientations. This data is assumed to be symmetric. The stochastic-based finite element model for free vibrations predicts the relation between the randomness in fundamental natural frequencies and the randomness in ply-orientation. The sensitivity derivatives are calculated numerically through an exact formulation. The squared fundamental natural frequencies are expressed in terms of deterministic and probabilistic quantities, allowing to determine how sensitive they are to variations in ply angles. The predicted mean-valued fundamental natural frequency squared and the variance of the present model are in good agreement with Monte Carlo Simulation. Results, also, show that variations between plus or minus 5 degrees in ply-angles can affect free vibration response of unsymmetrically and symmetrically laminated beams.
Microsatellite analysis in the genome of Acanthaceae: An in silico approach

PubMed Central

Kaliswamy, Priyadharsini; Vellingiri, Srividhya; Nathan, Bharathi; Selvaraj, Saravanakumar

2015-01-01

Background: Acanthaceae is one of the advanced and specialized families with conventionally used medicinal plants. Simple sequence repeats (SSRs) play a major role as molecular markers for genome analysis and plant breeding. The microsatellites existing in the complete genome sequences would help to attain a direct role in the genome organization, recombination, gene regulation, quantitative genetic variation, and evolution of genes. Objective: The current study reports the frequency of microsatellites and appropriate markers for the Acanthaceae family genome sequences. Materials and Methods: The whole nucleotide sequences of Acanthaceae species were obtained from National Center for Biotechnology Information database and screened for the presence of SSRs. SSR Locator tool was used to predict the microsatellites and inbuilt Primer3 module was used for primer designing. Results: Totally 110 repeats from 108 sequences of Acanthaceae family plant genomes were identified, and the occurrence of dinucleotide repeats was found to be abundant in the genome sequences. The essential amino acid isoleucine was found rich in all the sequences. We also designed the SSR-based primers/markers for 59 sequences of this family that contains microsatellite repeats in their genome. Conclusion: The identified microsatellites and primers might be useful for breeding and genetic studies of plants that belong to Acanthaceae family in the future. PMID:25709226
Mutations that Cause Human Disease: A Computational/Experimental Approach

DOE Office of Scientific and Technical Information (OSTI.GOV)

Beernink, P; Barsky, D; Pesavento, B

International genome sequencing projects have produced billions of nucleotides (letters) of DNA sequence data, including the complete genome sequences of 74 organisms. These genome sequences have created many new scientific opportunities, including the ability to identify sequence variations among individuals within a species. These genetic differences, which are known as single nucleotide polymorphisms (SNPs), are particularly important in understanding the genetic basis for disease susceptibility. Since the report of the complete human genome sequence, over two million human SNPs have been identified, including a large-scale comparison of an entire chromosome from twenty individuals. Of the protein coding SNPs (cSNPs), approximatelymore » half leads to a single amino acid change in the encoded protein (non-synonymous coding SNPs). Most of these changes are functionally silent, while the remainder negatively impact the protein and sometimes cause human disease. To date, over 550 SNPs have been found to cause single locus (monogenic) diseases and many others have been associated with polygenic diseases. SNPs have been linked to specific human diseases, including late-onset Parkinson disease, autism, rheumatoid arthritis and cancer. The ability to predict accurately the effects of these SNPs on protein function would represent a major advance toward understanding these diseases. To date several attempts have been made toward predicting the effects of such mutations. The most successful of these is a computational approach called ''Sorting Intolerant From Tolerant'' (SIFT). This method uses sequence conservation among many similar proteins to predict which residues in a protein are functionally important. However, this method suffers from several limitations. First, a query sequence must have a sufficient number of relatives to infer sequence conservation. Second, this method does not make use of or provide any information on protein structure, which can be used to understand how an amino acid change affects the protein. The experimental methods that provide the most detailed structural information on proteins are X-ray crystallography and NMR spectroscopy. However, these methods are labor intensive and currently cannot be carried out on a genomic scale. Nonetheless, Structural Genomics projects are being pursued by more than a dozen groups and consortia worldwide and as a result the number of experimentally determined structures is rising exponentially. Based on the expectation that protein structures will continue to be determined at an ever-increasing rate, reliable structure prediction schemes will become increasingly valuable, leading to information on protein function and disease for many different proteins. Given known genetic variability and experimentally determined protein structures, can we accurately predict the effects of single amino acid substitutions? An objective assessment of this question would involve comparing predicted and experimentally determined structures, which thus far has not been rigorously performed. The completed research leveraged existing expertise at LLNL in computational and structural biology, as well as significant computing resources, to address this question.« less

A genome-wide SNP-association study confirms a sequence variant (g.66493737C>T) in the equine myostatin (MSTN) gene as the most powerful predictor of optimum racing distance for Thoroughbred racehorses

PubMed Central

2010-01-01

Background Thoroughbred horses have been selected for traits contributing to speed and stamina for centuries. It is widely recognized that inherited variation in physical and physiological characteristics is responsible for variation in individual aptitude for race distance, and that muscle phenotypes in particular are important. Results A genome-wide SNP-association study for optimum racing distance was performed using the EquineSNP50 Bead Chip genotyping array in a cohort of n = 118 elite Thoroughbred racehorses divergent for race distance aptitude. In a cohort-based association test we evaluated genotypic variation at 40,977 SNPs between horses suited to short distance (≤ 8 f) and middle-long distance (> 8 f) races. The most significant SNP was located on chromosome 18: BIEC2-417495 ~690 kb from the gene encoding myostatin (MSTN) [Punadj. = 6.96 × 10-6]. Considering best race distance as a quantitative phenotype, a peak of association on chromosome 18 (chr18:65809482-67545806) comprising eight SNPs encompassing a 1.7 Mb region was observed. Again, similar to the cohort-based analysis, the most significant SNP was BIEC2-417495 (Punadj. = 1.61 × 10-9; PBonf. = 6.58 × 10-5). In a candidate gene study we have previously reported a SNP (g.66493737C>T) in MSTN associated with best race distance in Thoroughbreds; however, its functional and genome-wide relevance were uncertain. Additional re-sequencing in the flanking regions of the MSTN gene revealed four novel 3' UTR SNPs and a 227 bp SINE insertion polymorphism in the 5' UTR promoter sequence. Linkage disequilibrium was highest between g.66493737C>T and BIEC2-417495 (r2 = 0.86). Conclusions Comparative association tests consistently demonstrated the g.66493737C>T SNP as the superior variant in the prediction of distance aptitude in racehorses (g.66493737C>T, P = 1.02 × 10-10; BIEC2-417495, Punadj. = 1.61 × 10-9). Functional investigations will be required to determine whether this polymorphism affects putative transcription-factor binding and gives rise to variation in gene and protein expression. Nonetheless, this study demonstrates that the g.66493737C>T SNP provides the most powerful genetic marker for prediction of race distance aptitude in Thoroughbreds. PMID:20932346
A genome-wide SNP-association study confirms a sequence variant (g.66493737C>T) in the equine myostatin (MSTN) gene as the most powerful predictor of optimum racing distance for Thoroughbred racehorses.

PubMed

Hill, Emmeline W; McGivney, Beatrice A; Gu, Jingjing; Whiston, Ronan; Machugh, David E

2010-10-11

Thoroughbred horses have been selected for traits contributing to speed and stamina for centuries. It is widely recognized that inherited variation in physical and physiological characteristics is responsible for variation in individual aptitude for race distance, and that muscle phenotypes in particular are important. A genome-wide SNP-association study for optimum racing distance was performed using the EquineSNP50 Bead Chip genotyping array in a cohort of n = 118 elite Thoroughbred racehorses divergent for race distance aptitude. In a cohort-based association test we evaluated genotypic variation at 40,977 SNPs between horses suited to short distance (≤ 8 f) and middle-long distance (> 8 f) races. The most significant SNP was located on chromosome 18: BIEC2-417495 ~690 kb from the gene encoding myostatin (MSTN) [P(unadj.) = 6.96 x 10⁻⁶]. Considering best race distance as a quantitative phenotype, a peak of association on chromosome 18 (chr18:65809482-67545806) comprising eight SNPs encompassing a 1.7 Mb region was observed. Again, similar to the cohort-based analysis, the most significant SNP was BIEC2-417495 (P(unadj.) = 1.61 x 10⁻⁹; P(Bonf.) = 6.58 x 10⁻⁵). In a candidate gene study we have previously reported a SNP (g.66493737C>T) in MSTN associated with best race distance in Thoroughbreds; however, its functional and genome-wide relevance were uncertain. Additional re-sequencing in the flanking regions of the MSTN gene revealed four novel 3' UTR SNPs and a 227 bp SINE insertion polymorphism in the 5' UTR promoter sequence. Linkage disequilibrium was highest between g.66493737C>T and BIEC2-417495 (r² = 0.86). Comparative association tests consistently demonstrated the g.66493737C>T SNP as the superior variant in the prediction of distance aptitude in racehorses (g.66493737C>T, P = 1.02 x 10⁻¹⁰; BIEC2-417495, P(unadj.) = 1.61 x 10⁻⁹). Functional investigations will be required to determine whether this polymorphism affects putative transcription-factor binding and gives rise to variation in gene and protein expression. Nonetheless, this study demonstrates that the g.66493737C>T SNP provides the most powerful genetic marker for prediction of race distance aptitude in Thoroughbreds.
Genetic variations of the SLCO1B1 gene in the Chinese, Malay and Indian populations of Singapore.

PubMed

Ho, Woon Fei; Koo, Seok Hwee; Yee, Jie Yin; Lee, Edmund Jon Deoon

2008-01-01

OATP1B1 is a liver-specific transporter that mediates the uptake of various endogenous and exogenous compounds including many clinically used drugs from blood into hepatocytes. This study aims to identify genetic variations of SLCO1B1 gene in three distinct ethnic groups of the Singaporean population (n=288). The coding region of the gene encoding the transporter protein was screened for genetic variations in the study population by denaturing high-performance liquid chromatography and DNA sequencing. Twenty-five genetic variations of SLCO1B1, including 10 novel ones, were found: 13 in the coding exons (9 nonsynonymous and 4 synonymous variations), 6 in the introns, and 6 in the 3' untranslated region. Four novel nonsynonymous variations: 633A>G (Ile211Met), 875C>T (Ala292Val), 1837T>C (Cys613Arg), and 1877T>A (Leu626Stop) were detected as heterozygotes. Among the novel nonsynonymous variations, 633A>G, 1837T>C, and 1877T>A were predicted to be functionally significant. These data would provide fundamental and useful information for pharmacogenetic studies on drugs that are substrates of OATP1B1 in Asians.
The effects of processing and sequence organization on the timing of turn taking: a corpus study

PubMed Central

Roberts, Seán G.; Torreira, Francisco; Levinson, Stephen C.

2015-01-01

The timing of turn taking in conversation is extremely rapid given the cognitive demands on speakers to comprehend, plan and execute turns in real time. Findings from psycholinguistics predict that the timing of turn taking is influenced by demands on processing, such as word frequency or syntactic complexity. An alternative view comes from the field of conversation analysis, which predicts that the rules of turn-taking and sequence organization may dictate the variation in gap durations (e.g., the functional role of each turn in communication). In this paper, we estimate the role of these two different kinds of factors in determining the speed of turn-taking in conversation. We use the Switchboard corpus of English telephone conversation, already richly annotated for syntactic structure speech act sequences, and segmental alignment. To this we add further information including Floor Transfer Offset (the amount of time between the end of one turn and the beginning of the next), word frequency, concreteness, and surprisal values. We then apply a novel statistical framework (“random forests”) to show that these two dimensions are interwoven together with indexical properties of the speakers as explanatory factors determining the speed of response. We conclude that an explanation of the of the timing of turn taking will require insights from both processing and sequence organization. PMID:26029125
Two combinatorial optimization problems for SNP discovery using base-specific cleavage and mass spectrometry.

PubMed

Chen, Xin; Wu, Qiong; Sun, Ruimin; Zhang, Louxin

2012-01-01

The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of this SNP discovery approach. In this study, we formulate two new combinatorial optimization problems. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as SNP - MSP, limits its search to sequences whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, the second problem, denoted as SNP - MSQ, limits its search to sequences whose in silico predicted mass spectra instead contain all the signals of the measured mass spectra. We present an exact dynamic programming algorithm for solving the SNP - MSP problem and also show that the SNP - MSQ problem is NP-hard by a reduction from a restricted variation of the 3-partition problem. We believe that an efficient solution to either problem above could offer a seamless integration of information in four complementary base-specific cleavage reactions, thereby improving the capability of the underlying biotechnology for sensitive and accurate SNP discovery.
A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks.

PubMed

Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong

2017-01-01

Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.
A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks

PubMed Central

Wu, Chenxue; Liu, Zhao; Zhu, Yunhong

2017-01-01

Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687
LOVD: easy creation of a locus-specific sequence variation database using an "LSDB-in-a-box" approach.

PubMed

Fokkema, Ivo F A C; den Dunnen, Johan T; Taschner, Peter E M

2005-08-01

The completion of the human genome project has initiated, as well as provided the basis for, the collection and study of all sequence variation between individuals. Direct access to up-to-date information on sequence variation is currently provided most efficiently through web-based, gene-centered, locus-specific databases (LSDBs). We have developed the Leiden Open (source) Variation Database (LOVD) software approaching the "LSDB-in-a-Box" idea for the easy creation and maintenance of a fully web-based gene sequence variation database. LOVD is platform-independent and uses PHP and MySQL open source software only. The basic gene-centered and modular design of the database follows the recommendations of the Human Genome Variation Society (HGVS) and focuses on the collection and display of DNA sequence variations. With minimal effort, the LOVD platform is extendable with clinical data. The open set-up should both facilitate and promote functional extension with scripts written by the community. The LOVD software is freely available from the Leiden Muscular Dystrophy pages (www.DMD.nl/LOVD/). To promote the use of LOVD, we currently offer curators the possibility to set up an LSDB on our Leiden server. (c) 2005 Wiley-Liss, Inc.
Assembly-history dynamics of a pitcher-plant protozoan community in experimental microcosms.

PubMed

Kadowaki, Kohmei; Inouye, Brian D; Miller, Thomas E

2012-01-01

History drives community assembly through differences both in density (density effects) and in the sequence in which species arrive (sequence effects). Density effects arise from predictable population dynamics, which are free of history, but sequence effects are due to a density-free mechanism, arising solely from the order and timing of immigration events. Few studies have determined how components of immigration history (timing, number of individuals, frequency) alter local dynamics to determine community assembly, beyond addressing when immigration history produces historically contingent assembly. We varied density and sequence effects independently in a two-way factorial design to follow community assembly in a three-species aquatic protozoan community. A superior competitor, Colpoda steinii, mediated alternative community states; early arrival or high introduction density allowed this species to outcompete or suppress the other competitors (Poterioochromonas malhamensis and Eimeriidae gen. sp.). Multivariate analysis showed that density effects caused greater variation in community states, whereas sequence effects altered the mean community composition. A significant interaction between density and sequence effects suggests that we should refine our understanding of priority effects. These results highlight a practical need to understand not only the "ingredients" (species) in ecological communities but their "recipes" as well.
Next-generation sequencing: advances and applications in cancer diagnosis

PubMed Central

Serratì, Simona; De Summa, Simona; Pilato, Brunella; Petriella, Daniela; Lacalamita, Rosanna; Tommasi, Stefania; Pinto, Rosamaria

2016-01-01

Technological advances have led to the introduction of next-generation sequencing (NGS) platforms in cancer investigation. NGS allows massive parallel sequencing that affords maximal tumor genomic assessment. NGS approaches are different, and concern DNA and RNA analysis. DNA sequencing includes whole-genome, whole-exome, and targeted sequencing, which focuses on a selection of genes of interest for a specific disease. RNA sequencing facilitates the detection of alternative gene-spliced transcripts, posttranscriptional modifications, gene fusion, mutations/single-nucleotide polymorphisms, small and long noncoding RNAs, and changes in gene expression. Most applications are in the cancer research field, but lately NGS technology has been revolutionizing cancer molecular diagnostics, due to the many advantages it offers compared to traditional methods. There is greater knowledge on solid cancer diagnostics, and recent interest has been shown also in the field of hematologic cancer. In this review, we report the latest data on NGS diagnostic/predictive clinical applications in solid and hematologic cancers. Moreover, since the amount of NGS data produced is very large and their interpretation is very complex, we briefly discuss two bioinformatic aspects, variant-calling accuracy and copy-number variation detection, which are gaining a lot of importance in cancer-diagnostic assessment. PMID:27980425
Neocortical malformation as consequence of nonadaptive regulation of neuronogenetic sequence

NASA Technical Reports Server (NTRS)

Caviness, V. S. Jr; Takahashi, T.; Nowakowski, R. S.

2000-01-01

Variations in the structure of the neocortex induced by single gene mutations may be extreme or subtle. They differ from variations in neocortical structure encountered across and within species in that these "normal" structural variations are adaptive (both structurally and behaviorally), whereas those associated with disorders of development are not. Here we propose that they also differ in principle in that they represent disruptions of molecular mechanisms that are not normally regulatory to variations in the histogenetic sequence. We propose an algorithm for the operation of the neuronogenetic sequence in relation to the overall neocortical histogenetic sequence and highlight the restriction point of the G1 phase of the cell cycle as the master regulatory control point for normal coordinate structural variation across species and importantly within species. From considerations based on the anatomic evidence from neocortical malformation in humans, we illustrate in principle how this overall sequence appears to be disrupted by molecular biological linkages operating principally outside the control mechanisms responsible for the normal structural variation of the neocortex. MRDD Research Reviews 6:22-33, 2000. Copyright 2000 Wiley-Liss, Inc.
CRHR1 genotypes, neural circuits and the diathesis for anxiety and depression.

PubMed

Rogers, J; Raveendran, M; Fawcett, G L; Fox, A S; Shelton, S E; Oler, J A; Cheverud, J; Muzny, D M; Gibbs, R A; Davidson, R J; Kalin, N H

2013-06-01

The corticotrophin-releasing hormone (CRH) system integrates the stress response and is associated with stress-related psychopathology. Previous reports have identified interactions between childhood trauma and sequence variation in the CRH receptor 1 gene (CRHR1) that increase risk for affective disorders. However, the underlying mechanisms that connect variation in CRHR1 to psychopathology are unknown. To explore potential mechanisms, we used a validated rhesus macaque model to investigate association between genetic variation in CRHR1, anxious temperament (AT) and brain metabolic activity. In young rhesus monkeys, AT is analogous to the childhood risk phenotype that predicts the development of human anxiety and depressive disorders. Regional brain metabolism was assessed with (18)F-labeled fluoro-2-deoxyglucose (FDG) positron emission tomography in 236 young, normally reared macaques that were also characterized for AT. We show that single nucleotide polymorphisms (SNPs) affecting exon 6 of CRHR1 influence both AT and metabolic activity in the anterior hippocampus and amygdala, components of the neural circuit underlying AT. We also find evidence for association between SNPs in CRHR1 and metabolism in the intraparietal sulcus and precuneus. These translational data suggest that genetic variation in CRHR1 affects the risk for affective disorders by influencing the function of the neural circuit underlying AT and that differences in gene expression or the protein sequence involving exon 6 may be important. These results suggest that variation in CRHR1 may influence brain function before any childhood adversity and may be a diathesis for the interaction between CRHR1 genotypes and childhood trauma reported to affect human psychopathology.
Effects of N/C Ratio on Solidification Behaviors of Novel Nb-Bearing Austenitic Heat-Resistant Cast Steels for Exhaust Components of Gasoline Engines

NASA Astrophysics Data System (ADS)

Zhang, Yinhui; Li, Mei; Godlewski, Larry A.; Zindel, Jacob W.; Feng, Qiang

2017-03-01

In order to comply with more stringent environmental and fuel consumption regulations, novel Nb-bearing austenitic heat-resistant cast steels that withstand exhaust temperatures as high as 1,323 K (1,050 °C) is urgently demanded from automotive industries. In the current research, the solidification behavior of these alloys with variations of N/C ratio is investigated. Directional solidification methods were carried out to examine the microstructural development in mushy zones. Computational thermodynamic calculations under partial equilibrium conditions were performed to predict the solidification sequence of different phases. Microstructural characterization of the mushy zones indicates that N/C ratio significantly influenced the stability of γ-austenite and the precipitation temperature of NbC/Nb(C,N), thereby altering the solidification path, as well as the morphology and distribution of NbC/Nb(C,N) and γ-ferrite. The solidification sequence of different phases predicted by thermodynamic software agreed well with the experimental results, except the specific precipitation temperatures. The generated data and fundamental understanding will be helpful for the application of computational thermodynamic methods to predict the as-cast microstructure of Nb-bearing austenitic heat-resistant steels.
Stresses and deformations in cross-ply composite tubes subjected to a uniform temperature change

NASA Technical Reports Server (NTRS)

Hyer, M. W.; Cooper, D. E.; Cohen, D.

1986-01-01

This study investigates the effects of a uniform temperature change on the stresses and deformations of composite tubes and determines the accuracy of an approximate solution based on the principle of complementary virtual work. Interest centers on tube response away from the ends and so a planar elasticity approach is used. For the approximate solution a piecewise linear variation of stresses with the radial coordinate is assumed. The results from the approximate solution are compared with the elasticity solution. The stress predictions agree well, particularly peak interlaminar stresses. Surprisingly, the axial deformations also agree well, despite the fact that the deformations predicted by the approximate solution do not satisfy the interface displacement continuity conditions required by the elasticity solution. The study shows that the axial thermal expansion coefficient of tubes with a specific number of axial and circumferential layers depends on the stacking sequence. This is in contrast to classical lamination theory, which predicts that the expansion will be independent of the stacking arrangement. As expected, the sign and magnitude of the peak interlaminar stresses depend on stacking sequence. For tubes with a specific number of axial and circumferential layers, thermally induced interlaminar stresses can be controlled by altering stacking arrangement.
Genomic Sequence Variation Markup Language (GSVML).

PubMed

Nakaya, Jun; Kimura, Michio; Hiroi, Kaei; Ido, Keisuke; Yang, Woosung; Tanaka, Hiroshi

2010-02-01

With the aim of making good use of internationally accumulated genomic sequence variation data, which is increasing rapidly due to the explosive amount of genomic research at present, the development of an interoperable data exchange format and its international standardization are necessary. Genomic Sequence Variation Markup Language (GSVML) will focus on genomic sequence variation data and human health applications, such as gene based medicine or pharmacogenomics. We developed GSVML through eight steps, based on case analysis and domain investigations. By focusing on the design scope to human health applications and genomic sequence variation, we attempted to eliminate ambiguity and to ensure practicability. We intended to satisfy the requirements derived from the use case analysis of human-based clinical genomic applications. Based on database investigations, we attempted to minimize the redundancy of the data format, while maximizing the data covering range. We also attempted to ensure communication and interface ability with other Markup Languages, for exchange of omics data among various omics researchers or facilities. The interface ability with developing clinical standards, such as the Health Level Seven Genotype Information model, was analyzed. We developed the human health-oriented GSVML comprising variation data, direct annotation, and indirect annotation categories; the variation data category is required, while the direct and indirect annotation categories are optional. The annotation categories contain omics and clinical information, and have internal relationships. For designing, we examined 6 cases for three criteria as human health application and 15 data elements for three criteria as data formats for genomic sequence variation data exchange. The data format of five international SNP databases and six Markup Languages and the interface ability to the Health Level Seven Genotype Model in terms of 317 items were investigated. GSVML was developed as a potential data exchanging format for genomic sequence variation data exchange focusing on human health applications. The international standardization of GSVML is necessary, and is currently underway. GSVML can be applied to enhance the utilization of genomic sequence variation data worldwide by providing a communicable platform between clinical and research applications. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
Intra-Seasonal Rainfall Variations and Linkage with Kharif Crop Production: An Attempt to Evaluate Predictability of Sub-Seasonal Rainfall Events

NASA Astrophysics Data System (ADS)

Singh, Ankita; Ghosh, Kripan; Mohanty, U. C.

2018-03-01

The sub-seasonal variation of Indian summer monsoon rainfall highly impacts Kharif crop production in comparison with seasonal total rainfall. The rainfall frequency and intensity corresponding to various rainfall events are found to be highly related to crop production and therefore, the predictability of such events are considered to be diagnosed. Daily rainfall predictions are made available by one of the coupled dynamical model National Centers for Environmental Prediction Climate Forecast System (NCEPCFS). A large error in the simulation of daily rainfall sequence influences to take up a bias correction and for that reason, two approaches are used. The bias-corrected GCM is able to capture the inter-annual variability in rainfall events. Maximum prediction skill of frequency of less rainfall (LR) event is observed during the month of September and a similar result is also noticed for moderate rainfall event with maximum skill over the central parts of the country. On the other hand, the impact of rainfall weekly rainfall intensity is evaluated against the Kharif rice production. It is found that weekly rainfall intensity during July is having a significant impact on Kharif rice production, but the corresponding skill was found very low in GCM. The GCM are able to simulate the less and moderate rainfall frequency with significant skill.
High-throughput sequencing of mGluR signaling pathway genes reveals enrichment of rare variants in autism.

PubMed

Kelleher, Raymond J; Geigenmüller, Ute; Hovhannisyan, Hayk; Trautman, Edwin; Pinard, Robert; Rathmell, Barbara; Carpenter, Randall; Margulies, David

2012-01-01

Identification of common molecular pathways affected by genetic variation in autism is important for understanding disease pathogenesis and devising effective therapies. Here, we test the hypothesis that rare genetic variation in the metabotropic glutamate-receptor (mGluR) signaling pathway contributes to autism susceptibility. Single-nucleotide variants in genes encoding components of the mGluR signaling pathway were identified by high-throughput multiplex sequencing of pooled samples from 290 non-syndromic autism cases and 300 ethnically matched controls on two independent next-generation platforms. This analysis revealed significant enrichment of rare functional variants in the mGluR pathway in autism cases. Higher burdens of rare, potentially deleterious variants were identified in autism cases for three pathway genes previously implicated in syndromic autism spectrum disorder, TSC1, TSC2, and SHANK3, suggesting that genetic variation in these genes also contributes to risk for non-syndromic autism. In addition, our analysis identified HOMER1, which encodes a postsynaptic density-localized scaffolding protein that interacts with Shank3 to regulate mGluR activity, as a novel autism-risk gene. Rare, potentially deleterious HOMER1 variants identified uniquely in the autism population affected functionally important protein regions or regulatory sequences and co-segregated closely with autism among children of affected families. We also identified rare ASD-associated coding variants predicted to have damaging effects on components of the Ras/MAPK cascade. Collectively, these findings suggest that altered signaling downstream of mGluRs contributes to the pathogenesis of non-syndromic autism.
High-Throughput Sequencing of mGluR Signaling Pathway Genes Reveals Enrichment of Rare Variants in Autism

PubMed Central

Hovhannisyan, Hayk; Trautman, Edwin; Pinard, Robert; Rathmell, Barbara; Carpenter, Randall; Margulies, David

2012-01-01

Identification of common molecular pathways affected by genetic variation in autism is important for understanding disease pathogenesis and devising effective therapies. Here, we test the hypothesis that rare genetic variation in the metabotropic glutamate-receptor (mGluR) signaling pathway contributes to autism susceptibility. Single-nucleotide variants in genes encoding components of the mGluR signaling pathway were identified by high-throughput multiplex sequencing of pooled samples from 290 non-syndromic autism cases and 300 ethnically matched controls on two independent next-generation platforms. This analysis revealed significant enrichment of rare functional variants in the mGluR pathway in autism cases. Higher burdens of rare, potentially deleterious variants were identified in autism cases for three pathway genes previously implicated in syndromic autism spectrum disorder, TSC1, TSC2, and SHANK3, suggesting that genetic variation in these genes also contributes to risk for non-syndromic autism. In addition, our analysis identified HOMER1, which encodes a postsynaptic density-localized scaffolding protein that interacts with Shank3 to regulate mGluR activity, as a novel autism-risk gene. Rare, potentially deleterious HOMER1 variants identified uniquely in the autism population affected functionally important protein regions or regulatory sequences and co-segregated closely with autism among children of affected families. We also identified rare ASD-associated coding variants predicted to have damaging effects on components of the Ras/MAPK cascade. Collectively, these findings suggest that altered signaling downstream of mGluRs contributes to the pathogenesis of non-syndromic autism. PMID:22558107
Complete genomic sequence of a Tobacco rattle virus isolate from Michigan-grown potatoes.

PubMed

Crosslin, James M; Hamm, Philip B; Kirk, William W; Hammond, Rosemarie W

2010-04-01

Tobacco rattle virus (TRV) causes stem mottle on potato leaves and necrotic arcs and rings in potato tubers, known as corky ringspot disease. Recently, TRV was reported in Michigan potato tubers cv. FL1879 exhibiting corky ringspot disease. Sequence analysis of the RNA-1-encoded 16-kDa gene of the Michigan isolate, designated MI-1, revealed homology to TRV isolates from Florida and Washington. Here, we report the complete genomic sequence of RNA-1 (6,791 nt) and RNA-2 (3,685 nt) of TRV MI-1. RNA-1 is predicted to contain four open reading frames, and the genome structure and phylogenetic analyses of the RNA-1 nucleotide sequence revealed significant homologies to the known sequences of other TRV-1 isolates. The relationships based on the full-length nucleotide sequence were different from than those based on the 16-kDa gene encoded on genomic RNA-1 and reflect sequence variation within a 20-25-aa residue region of the 16-kDa protein. MI-1 RNA-2 is predicted to contain three ORFs, encoding the coat protein (CP), a 37.6-kDa protein (ORF 2b), and a 33.6-kDa protein (ORF 2c). In addition, it contains a region of similarity to the 3' terminus of RNA-1, including a truncated portion of the 16-kDa cistron. Phylogenetic analysis of RNA-2, based on a comparison of nucleotide sequences with other members of the genus Tobravirus, indicates that TRV MI-1 and other North American isolates cluster as a distinct group. TRV M1-1 is only the second North American isolate for which there is a complete sequence of the genome, and it is distinct from the North American isolate TRV ORY. The relationship of the TRV MI-1 isolate to other tobravirus isolates is discussed.
Oceanographic variation influences spatial genomic structure in the sea scallop, Placopecten magellanicus.

PubMed

Van Wyngaarden, Mallory; Snelgrove, Paul V R; DiBacco, Claudio; Hamilton, Lorraine C; Rodríguez-Ezpeleta, Naiara; Zhan, Luyao; Beiko, Robert G; Bradbury, Ian R

2018-03-01

Environmental factors can influence diversity and population structure in marine species and accurate understanding of this influence can both improve fisheries management and help predict responses to environmental change. We used 7163 SNPs derived from restriction site-associated DNA sequencing genotyped in 245 individuals of the economically important sea scallop, Placopecten magellanicus , to evaluate the correlations between oceanographic variation and a previously identified latitudinal genomic cline. Sea scallops span a broad latitudinal area (>10 degrees), and we hypothesized that climatic variation significantly drives clinal trends in allele frequency. Using a large environmental dataset, including temperature, salinity, chlorophyll a, and nutrient concentrations, we identified a suite of SNPs (285-621, depending on analysis and environmental dataset) potentially under selection through correlations with environmental variation. Principal components analysis of different outlier SNPs and environmental datasets revealed similar northern and southern clusters, with significant associations between the first axes of each ( R 2 adj = .66-.79). Multivariate redundancy analysis of outlier SNPs and the environmental principal components indicated that environmental factors explained more than 32% of the variance. Similarly, multiple linear regressions and random-forest analysis identified winter average and minimum ocean temperatures as significant parameters in the link between genetic and environmental variation. This work indicates that oceanographic variation is associated with the observed genomic cline in this species and that seasonal periods of extreme cold may restrict gene flow along a latitudinal gradient in this marine benthic bivalve. Incorporating this finding into management may improve accuracy of management strategies and future predictions.

Genomic Features That Predict Allelic Imbalance in Humans Suggest Patterns of Constraint on Gene Expression Variation

PubMed Central

Fédrigo, Olivier; Haygood, Ralph; Mukherjee, Sayan; Wray, Gregory A.

2009-01-01

Variation in gene expression is an important contributor to phenotypic diversity within and between species. Although this variation often has a genetic component, identification of the genetic variants driving this relationship remains challenging. In particular, measurements of gene expression usually do not reveal whether the genetic basis for any observed variation lies in cis or in trans to the gene, a distinction that has direct relevance to the physical location of the underlying genetic variant, and which may also impact its evolutionary trajectory. Allelic imbalance measurements identify cis-acting genetic effects by assaying the relative contribution of the two alleles of a cis-regulatory region to gene expression within individuals. Identification of patterns that predict commonly imbalanced genes could therefore serve as a useful tool and also shed light on the evolution of cis-regulatory variation itself. Here, we show that sequence motifs, polymorphism levels, and divergence levels around a gene can be used to predict commonly imbalanced genes in a human data set. Reduction of this feature set to four factors revealed that only one factor significantly differentiated between commonly imbalanced and nonimbalanced genes. We demonstrate that these results are consistent between the original data set and a second published data set in humans obtained using different technical and statistical methods. Finally, we show that variation in the single allelic imbalance-associated factor is partially explained by the density of genes in the region of a target gene (allelic imbalance is less probable for genes in gene-dense regions), and, to a lesser extent, the evenness of expression of the gene across tissues and the magnitude of negative selection on putative regulatory regions of the gene. These results suggest that the genomic distribution of functional cis-regulatory variants in the human genome is nonrandom, perhaps due to local differences in evolutionary constraint. PMID:19506001
Cenozoic global sea level, sequences, and the New Jersey transect: Results from coastal plain and continental slope drilling

USGS Publications Warehouse

Miller, K.G.; Mountain, Gregory S.; Browning, J.V.; Kominz, M.; Sugarman, P.J.; Christie-Blick, N.; Katz, M.E.; Wright, J.D.

1998-01-01

The New Jersey Sea Level Transect was designed to evaluate the relationships among global sea level (eustatic) change, unconformity-bounded sequences, and variations in subsidence, sediment supply, and climate on a passive continental margin. By sampling and dating Cenozoic strata from coastal plain and continental slope locations, we show that sequence boundaries correlate (within ??0.5 myr) regionally (onshore-offshore) and interregionally (New Jersey-Alabama-Bahamas), implicating a global cause. Sequence boundaries correlate with ??18O increases for at least the past 42 myr, consistent with an ice volume (glacioeustatic) control, although a causal relationship is not required because of uncertainties in ages and correlations. Evidence for a causal connection is provided by preliminary Miocene data from slope Site 904 that directly link ??18O increases with sequence boundaries. We conclude that variation in the size of ice sheets has been a primary control on the formation of sequence boundaries since ~42 Ma. We speculate that prior to this, the growth and decay of small ice sheets caused small-amplitude sea level changes (<20 m) in this supposedly ice-free world because Eocene sequence boundaries also appear to correlate with minor ??18O increases. Subsidence estimates (backstripping) indicate amplitudes of short-term (million-year scale) lowerings that are consistent with estimates derived from ??18O studies (25-50 m in the Oligocene-middle Miocene and 10-20 m in the Eocene) and a long-term lowering of 150-200 m over the past 65 myr, consistent with estimates derived from volume changes on mid-ocean ridges. Although our results are consistent with the general number and timing of Paleocene to middle Miocene sequences published by workers at Exxon Production Research Company, our estimates of sea level amplitudes are substantially lower than theirs. Lithofacies patterns within sequences follow repetitive, predictable patterns: (1) coastal plain sequences consist of basal transgressive sands overlain by regressive highstand silts and quartz sands; and (2) although slope lithofacies variations are subdued, reworked sediments constitute lowstand deposits, causing the strongest, most extensive seismic reflections. Despite a primary eustatic control on sequence boundaries, New Jersey sequences were also influenced by changes in tectonics, sediment supply, and climate. During the early to middle Eocene, low siliciclastic and high pelagic input associated with warm climates resulted in widespread carbonate deposition and thin sequences. Late middle Eocene and earliest Oligocene cooling events curtailed carbonate deposition in the coastal plain and slope, respectively, resulting in a switch to siliciclastic sedimentation. In onshore areas, Oligocene sequences are thin owing to low siliciclastic and pelagic input, and their distribution is patchy, reflecting migration or progradation of depocenters; in contrast, Miocene onshore sequences are thicker, reflecting increased sediment supply, and they are more complete downdip owing to simple tectonics. We conclude that the New Jersey margin provides a natural laboratory for unraveling complex interactions of eustasy, tectonics, changes in sediment supply, and climate change.
Association between SCO2 mutation and extreme myopia in Japanese patients.

PubMed

Wakazono, Tomotaka; Miyake, Masahiro; Yamashiro, Kenji; Yoshikawa, Munemitsu; Yoshimura, Nagahisa

2016-07-01

To investigate the role of SCO2 in extreme myopia of Japanese patients. In total, 101 Japanese patients with extreme myopia (axial length of ≥30 mm) OU at the Kyoto University Hospital were included in this study. Exon 2 of SCO2 was sequenced by conventional Sanger sequencing. The detected variants were assessed using in silico prediction programs: SIFT, PolyPhen-2 and MutationTaster. To determine the frequency of the mutations in normal subjects, we referred to the 1000 Genomes Project data and the Human Genetic Variation Database (HGVD) in the Human Genetic Variation Browser. The average age of the participants was 62.9 ± 12.7 years. There were 31 males (30.7 %) and 70 females. Axial lengths were 31.76 ± 1.17 mm OD and 31.40 ± 1.07 mm OS, and 176 eyes (87.6 %) out of 201 eyes had myopic maculopathy of grade 2 or more. Among the 101 extremely myopic patients, one mutation (c.290 C > T;p.Ala97Val) in SCO2 was detected. This mutation was not found in the 1000 Genomes Project data or HGVD data. Variant type of the mutation was nonsynonymous. Although the SIFT prediction score was 0.350, the PolyPhen-2 probability was 0.846, thus predicting its pathogenicity to be possibly damaging. MutationTaster PhyloP was 1.268, suggesting that the mutation is conserved. We identified one novel possibility of an extreme myopia-causing mutation in SCO2. No other disease-causing mutation was found in 101 extremely myopic Japanese patients, suggesting that SCO2 plays a limited role in Japanese extreme myopia. Further investigation is required for better understanding of extreme myopia.
Alt a 1 allergen homologs from Alternaria and related taxa: analysis of phylogenetic content and secondary structure.

PubMed

Hong, Soon Gyu; Cramer, Robert A; Lawrence, Christopher B; Pryor, Barry M

2005-02-01

A gene for the Alternaria major allergen, Alt a 1, was amplified from 52 species of Alternaria and related genera, and sequence information was used for phylogenetic study. Alt a 1 gene sequences evolved 3.8 times faster and contained 3.5 times more parsimony-informative sites than glyceraldehyde-3-phosphate dehydrogenase (gpd) sequences. Analyses of Alt a 1 gene and gpd exon sequences strongly supported grouping of Alternaria spp. and related taxa into several species-groups described in previous studies, especially the infectoria, alternata, porri, brassicicola, and radicina species-groups and the Embellisia group. The sonchi species-group was newly suggested in this study. Monophyly of the Nimbya group was moderately supported, and monophyly of the Ulocladium group was weakly supported. Relationships among species-groups and among closely related species of the same species-group were not fully resolved. However, higher resolution could be obtained using Alt a 1 sequences or a combined dataset than using gpd sequences alone. Despite high levels of variation in amino acid sequences, results of in silico prediction of protein secondary structure for Alt a 1 demonstrated a high degree of structural similarity for most of the species suggesting a conservation of function.
Satellite remote sensing data can be used to model marine microbial metabolite turnover

PubMed Central

Larsen, Peter E; Scott, Nicole; Post, Anton F; Field, Dawn; Knight, Rob; Hamada, Yuki; Gilbert, Jack A

2015-01-01

Sampling ecosystems, even at a local scale, at the temporal and spatial resolution necessary to capture natural variability in microbial communities are prohibitively expensive. We extrapolated marine surface microbial community structure and metabolic potential from 72 16S rRNA amplicon and 8 metagenomic observations using remotely sensed environmental parameters to create a system-scale model of marine microbial metabolism for 5904 grid cells (49 km2) in the Western English Chanel, across 3 years of weekly averages. Thirteen environmental variables predicted the relative abundance of 24 bacterial Orders and 1715 unique enzyme-encoding genes that encode turnover of 2893 metabolites. The genes' predicted relative abundance was highly correlated (Pearson Correlation 0.72, P-value <10−6) with their observed relative abundance in sequenced metagenomes. Predictions of the relative turnover (synthesis or consumption) of CO2 were significantly correlated with observed surface CO2 fugacity. The spatial and temporal variation in the predicted relative abundances of genes coding for cyanase, carbon monoxide and malate dehydrogenase were investigated along with the predicted inter-annual variation in relative consumption or production of ∼3000 metabolites forming six significant temporal clusters. These spatiotemporal distributions could possibly be explained by the co-occurrence of anaerobic and aerobic metabolisms associated with localized plankton blooms or sediment resuspension, which facilitate the presence of anaerobic micro-niches. This predictive model provides a general framework for focusing future sampling and experimental design to relate biogeochemical turnover to microbial ecology. PMID:25072414
Satellite remote sensing data can be used to model marine microbial metabolite turnover

DOE Office of Scientific and Technical Information (OSTI.GOV)

Larsen, Peter E.; Scott, Nicole; Post, Anton F.

Sampling ecosystems, even at a local scale, at the temporal and spatial resolution necessary to capture natural variability in microbial communities are prohibitively expensive. We extrapolated marine surface microbial community structure and metabolic potential from 72 16S rRNA amplicon and 8 metagenomic observations using remotely sensed environmental parameters to create a system-scale model of marine microbial metabolism for 5904 grid cells (49 km2) in the Western English Chanel, across 3 years of weekly averages. Thirteen environmental variables predicted the relative abundance of 24 bacterial Orders and 1715 unique enzyme-encoding genes that encode turnover of 2893 metabolites. The genes’ predicted relativemore » abundance was highly correlated (Pearson Correlation 0.72, P-value <10-6) with their observed relative abundance in sequenced metagenomes. Predictions of the relative turnover (synthesis or consumption) of CO2 were significantly correlated with observed surface CO2 fugacity. The spatial and temporal variation in the predicted relative abundances of genes coding for cyanase, carbon monoxide and malate dehydrogenase were investigated along with the predicted inter-annual variation in relative consumption or production of ~3000 metabolites forming six significant temporal clusters. These spatiotemporal distributions could possibly be explained by the co-occurrence of anaerobic and aerobic metabolisms associated with localized plankton blooms or sediment resuspension, which facilitate the presence of anaerobic micro-niches. This predictive model provides a general framework for focusing future sampling and experimental design to relate biogeochemical turnover to microbial ecology.« less
Prospects of Genomic Prediction in the USDA Soybean Germplasm Collection: Historical Data Creates Robust Models for Enhancing Selection of Accessions.

PubMed

Jarquin, Diego; Specht, James; Lorenz, Aaron

2016-08-09

The identification and mobilization of useful genetic variation from germplasm banks for use in breeding programs is critical for future genetic gain and protection against crop pests. Plummeting costs of next-generation sequencing and genotyping is revolutionizing the way in which researchers and breeders interface with plant germplasm collections. An example of this is the high density genotyping of the entire USDA Soybean Germplasm Collection. We assessed the usefulness of 50K single nucleotide polymorphism data collected on 18,480 domesticated soybean (Glycine max) accessions and vast historical phenotypic data for developing genomic prediction models for protein, oil, and yield. Resulting genomic prediction models explained an appreciable amount of the variation in accession performance in independent validation trials, with correlations between predicted and observed reaching up to 0.92 for oil and protein and 0.79 for yield. The optimization of training set design was explored using a series of cross-validation schemes. It was found that the target population and environment need to be well represented in the training set. Second, genomic prediction training sets appear to be robust to the presence of data from diverse geographical locations and genetic clusters. This finding, however, depends on the influence of shattering and lodging, and may be specific to soybean with its presence of maturity groups. The distribution of 7608 nonphenotyped accessions was examined through the application of genomic prediction models. The distribution of predictions of phenotyped accessions was representative of the distribution of predictions for nonphenotyped accessions, with no nonphenotyped accessions being predicted to fall far outside the range of predictions of phenotyped accessions. Copyright © 2016 Jarquin et al.
Unraveling Selection in the Mitochondrial Genome of Drosophila

PubMed Central

Ballard, JWO.; Kreitman, M.

1994-01-01

We examine mitochondrial DNA variation at the cytochrome b locus within and between three species of Drosophila to determine whether patterns of variation conform to the predictions of neutral molecular evolution. The entire 1137-bp cytochrome b locus was sequenced in 16 lines of Drosophila melanogaster, 18 lines of Drosophila simulans and 13 lines of Drosophila yakuba. Patterns of variation depart from neutrality by several test criteria. Analysis of the evolutionary clock hypothesis shows unequal rates of change along D. simulans lineages. A comparison within and between species of the ratio of amino acid replacement change to synonymous change reveals a relative excess of amino acid replacement polymorphism compared to the neutral prediction, suggestive of slightly deleterious or diversifying selection. There is evidence for excess homozygosity in our world wide sample of D. melanogaster and D. simulans alleles, as well as a reduction in the number of segregating sites in D. simulans, indicative of selective sweeps. Furthermore, a test of neutrality for codon usage shows the direction of mutations at third positions differs among different topological regions of the gene tree. The analyses indicate that molecular variation and evolution of mtDNA are governed by many of the same selective forces that have been shown to govern nuclear genome evolution and suggest caution be taken in the use of mtDNA as a ``neutral'' molecular marker. PMID:7851772
Learning predictive models that use pattern discovery--a bootstrap evaluative approach applied in organ functioning sequences.

PubMed

Toma, Tudor; Bosman, Robert-Jan; Siebes, Arno; Peek, Niels; Abu-Hanna, Ameen

2010-08-01

An important problem in the Intensive Care is how to predict on a given day of stay the eventual hospital mortality for a specific patient. A recent approach to solve this problem suggested the use of frequent temporal sequences (FTSs) as predictors. Methods following this approach were evaluated in the past by inducing a model from a training set and validating the prognostic performance on an independent test set. Although this evaluative approach addresses the validity of the specific models induced in an experiment, it falls short of evaluating the inductive method itself. To achieve this, one must account for the inherent sources of variation in the experimental design. The main aim of this work is to demonstrate a procedure based on bootstrapping, specifically the .632 bootstrap procedure, for evaluating inductive methods that discover patterns, such as FTSs. A second aim is to apply this approach to find out whether a recently suggested inductive method that discovers FTSs of organ functioning status is superior over a traditional method that does not use temporal sequences when compared on each successive day of stay at the Intensive Care Unit. The use of bootstrapping with logistic regression using pre-specified covariates is known in the statistical literature. Using inductive methods of prognostic models based on temporal sequence discovery within the bootstrap procedure is however novel at least in predictive models in the Intensive Care. Our results of applying the bootstrap-based evaluative procedure demonstrate the superiority of the FTS-based inductive method over the traditional method in terms of discrimination as well as accuracy. In addition we illustrate the insights gained by the analyst into the discovered FTSs from the bootstrap samples. Copyright 2010 Elsevier Inc. All rights reserved.
PySeqLab: an open source Python package for sequence labeling and segmentation.

PubMed

Allam, Ahmed; Krauthammer, Michael

2017-11-01

Text and genomic data are composed of sequential tokens, such as words and nucleotides that give rise to higher order syntactic constructs. In this work, we aim at providing a comprehensive Python library implementing conditional random fields (CRFs), a class of probabilistic graphical models, for robust prediction of these constructs from sequential data. Python Sequence Labeling (PySeqLab) is an open source package for performing supervised learning in structured prediction tasks. It implements CRFs models, that is discriminative models from (i) first-order to higher-order linear-chain CRFs, and from (ii) first-order to higher-order semi-Markov CRFs (semi-CRFs). Moreover, it provides multiple learning algorithms for estimating model parameters such as (i) stochastic gradient descent (SGD) and its multiple variations, (ii) structured perceptron with multiple averaging schemes supporting exact and inexact search using 'violation-fixing' framework, (iii) search-based probabilistic online learning algorithm (SAPO) and (iv) an interface for Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the limited-memory BFGS algorithms. Viterbi and Viterbi A* are used for inference and decoding of sequences. Using PySeqLab, we built models (classifiers) and evaluated their performance in three different domains: (i) biomedical Natural language processing (NLP), (ii) predictive DNA sequence analysis and (iii) Human activity recognition (HAR). State-of-the-art performance comparable to machine-learning based systems was achieved in the three domains without feature engineering or the use of knowledge sources. PySeqLab is available through https://bitbucket.org/A_2/pyseqlab with tutorials and documentation. ahmed.allam@yale.edu or michael.krauthammer@yale.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Molecular Population Genetics of the Alcohol Dehydrogenase Gene Region of DROSOPHILA MELANOGASTER

PubMed Central

Aquadro, Charles F.; Desse, Susan F.; Bland, Molly M.; Langley, Charles H.; Laurie-Ahlberg, Cathy C.

1986-01-01

Variation in the DNA restriction map of a 13-kb region of chromosome II including the alcohol dehydrogenase structural gene (Adh) was examined in Drosophila melanogaster from natural populations. Detailed analysis of 48 D. melanogaster lines representing four eastern United States populations revealed extensive DNA sequence variation due to base substitutions, insertions and deletions. Cloning of this region from several lines allowed characterization of length variation as due to unique sequence insertions or deletions [nine sizes; 21–200 base pairs (bp)] or transposable element insertions (several sizes, 340 bp to 10.2 kb, representing four different elements). Despite this extensive variation in sequences flanking the Adh gene, only one length polymorphism is clearly associated with altered Adh expression (a copia element approximately 250 bp 5' to the distal transcript start site). Nonetheless, the frequency spectra of transposable elements within and between Drosophila species suggests they are slightly deleterious. Strong nonrandom associations are observed among Adh region sequence variants, ADH allozyme (Fast vs. Slow), ADH enzyme activity and the chromosome inversion ln(2L) t. Phylogenetic analysis of restriction map haplotypes suggest that the major twofold component of ADH activity variation (high vs. low, typical of Fast and Slow allozymes, respectively) is due to sequence variation tightly linked to and possibly distinct from that underlying the allozyme difference. The patterns of nucleotide and haplotype variation for Fast and Slow allozyme lines are consistent with the recent increase in frequency and spread of the Fast haplotype associated with high ADH activity. These data emphasize the important role of evolutionary history and strong nonrandom associations among tightly linked sequence variation as determinants of the patterns of variation observed in natural populations. PMID:3026893
Variation, Repetition, And Choice

PubMed Central

Abreu-Rodrigues, Josele; Lattal, Kennon A; dos Santos, Cristiano V; Matos, Ricardo A

2005-01-01

Experiment 1 investigated the controlling properties of variability contingencies on choice between repeated and variable responding. Pigeons were exposed to concurrent-chains schedules with two alternatives. In the REPEAT alternative, reinforcers in the terminal link depended on a single sequence of four responses. In the VARY alternative, a response sequence in the terminal link was reinforced only if it differed from the n previous sequences (lag criterion). The REPEAT contingency generated low, constant levels of sequence variation whereas the VARY contingency produced levels of sequence variation that increased with the lag criterion. Preference for the REPEAT alternative tended to increase directly with the degree of variation required for reinforcement. Experiment 2 examined the potential confounding effects in Experiment 1 of immediacy of reinforcement by yoking the interreinforcer intervals in the REPEAT alternative to those in the VARY alternative. Again, preference for REPEAT was a function of the lag criterion. Choice between varying and repeating behavior is discussed with respect to obtained behavioral variability, probability of reinforcement, delay of reinforcement, and switching within a sequence. PMID:15828592
Lineage-specific evolutionary rate in plants: Contributions of a screening for Cereus (Cactaceae)1

PubMed Central

Romeiro-Brito, Monique; Moraes, Evandro M.; Taylor, Nigel P.; Zappi, Daniela C.; Franco, Fernando F.

2016-01-01

Premise of the study: Predictable chloroplast DNA (cpDNA) sequences have been listed for the shallowest taxonomic studies in plants. We investigated whether plastid regions that vary between closely allied species could be applied for intraspecific studies and compared the variation of these plastid segments with two nuclear regions. Methods: We screened 16 plastid and two nuclear intronic regions for species of the genus Cereus (Cactaceae) at three hierarchical levels (species from different clades, species of the same clade, and allopatric populations). Results: Ten plastid regions presented interspecific variation, and six of them showed variation at the intraspecific level. The two nuclear regions showed both inter- and intraspecific variation, and in general they showed higher levels of variability in almost all hierarchical levels than the plastid segments. Discussion: Our data suggest no correspondence between variation of plastid regions at the interspecific and intraspecific level, probably due to lineage-specific variation in cpDNA, which appears to have less effect in nuclear data. Despite the heterogeneity in evolutionary rates of cpDNA, we highlight three plastid segments that may be considered in initial screenings in plant phylogeographic studies. PMID:26819857
Three-dimensional spatial analysis of missense variants in RTEL1 identifies pathogenic variants in patients with Familial Interstitial Pneumonia.

PubMed

Sivley, R Michael; Sheehan, Jonathan H; Kropski, Jonathan A; Cogan, Joy; Blackwell, Timothy S; Phillips, John A; Bush, William S; Meiler, Jens; Capra, John A

2018-01-23

Next-generation sequencing of individuals with genetic diseases often detects candidate rare variants in numerous genes, but determining which are causal remains challenging. We hypothesized that the spatial distribution of missense variants in protein structures contains information about function and pathogenicity that can help prioritize variants of unknown significance (VUS) and elucidate the structural mechanisms leading to disease. To illustrate this approach in a clinical application, we analyzed 13 candidate missense variants in regulator of telomere elongation helicase 1 (RTEL1) identified in patients with Familial Interstitial Pneumonia (FIP). We curated pathogenic and neutral RTEL1 variants from the literature and public databases. We then used homology modeling to construct a 3D structural model of RTEL1 and mapped known variants into this structure. We next developed a pathogenicity prediction algorithm based on proximity to known disease causing and neutral variants and evaluated its performance with leave-one-out cross-validation. We further validated our predictions with segregation analyses, telomere lengths, and mutagenesis data from the homologous XPD protein. Our algorithm for classifying RTEL1 VUS based on spatial proximity to pathogenic and neutral variation accurately distinguished 7 known pathogenic from 29 neutral variants (ROC AUC = 0.85) in the N-terminal domains of RTEL1. Pathogenic proximity scores were also significantly correlated with effects on ATPase activity (Pearson r = -0.65, p = 0.0004) in XPD, a related helicase. Applying the algorithm to 13 VUS identified from sequencing of RTEL1 from patients predicted five out of six disease-segregating VUS to be pathogenic. We provide structural hypotheses regarding how these mutations may disrupt RTEL1 ATPase and helicase function. Spatial analysis of missense variation accurately classified candidate VUS in RTEL1 and suggests how such variants cause disease. Incorporating spatial proximity analyses into other pathogenicity prediction tools may improve accuracy for other genes and genetic diseases.
A Sequence Mining Method to Predict the Bidding Strategy of Trading Agents

NASA Astrophysics Data System (ADS)

Nikolaidou, Vivia; Mitkas, Pericles A.

In this work, we describe the process used in order to predict the bidding strategy of trading agents. This was done in the context of the Reverse TAC, or CAT, game of the Trading Agent Competition. In this game, a set of trading agents, buyers or sellers, are provided by the server and they trade their goods in one of the markets operated by the competing agents. Better knowledge of the strategy of the trading agents will allow a market maker to adapt its incentives and attract more agents to its own market. Our prediction was based on the time series of the traders’ past bids, taking into account the variation of each bid compared to its history. The results proved to be of satisfactory accuracy, both in the game’s context and when compared to other existing approaches.
PSSRdb: a relational database of polymorphic simple sequence repeats extracted from prokaryotic genomes.

PubMed

Kumar, Pankaj; Chaitanya, Pasumarthy S; Nagarajaram, Hampapathalu A

2011-01-01

PSSRdb (Polymorphic Simple Sequence Repeats database) (http://www.cdfd.org.in/PSSRdb/) is a relational database of polymorphic simple sequence repeats (PSSRs) extracted from 85 different species of prokaryotes. Simple sequence repeats (SSRs) are the tandem repeats of nucleotide motifs of the sizes 1-6 bp and are highly polymorphic. SSR mutations in and around coding regions affect transcription and translation of genes. Such changes underpin phase variations and antigenic variations seen in some bacteria. Although SSR-mediated phase variation and antigenic variations have been well-studied in some bacteria there seems a lot of other species of prokaryotes yet to be investigated for SSR mediated adaptive and other evolutionary advantages. As a part of our on-going studies on SSR polymorphism in prokaryotes we compared the genome sequences of various strains and isolates available for 85 different species of prokaryotes and extracted a number of SSRs showing length variations and created a relational database called PSSRdb. This database gives useful information such as location of PSSRs in genomes, length variation across genomes, the regions harboring PSSRs, etc. The information provided in this database is very useful for further research and analysis of SSRs in prokaryotes.
Pfhrp2 and pfhrp3 polymorphisms in Plasmodium falciparum isolates from Dakar, Senegal: impact on rapid malaria diagnostic tests

PubMed Central

2013-01-01

Background An accurate diagnosis is essential for the rapid and appropriate treatment of malaria. The accuracy of the histidine-rich protein 2 (PfHRP2)-based rapid diagnostic test (RDT) Palutop+4® was assessed here. One possible factor contributing to the failure to detect malaria by this test is the diversity of the parasite PfHRP2 antigens. Methods PfHRP2 detection with the Palutop+4® RDT was carried out. The pfhrp2 and pfhrp3 genes were amplified and sequenced from 136 isolates of Plasmodium falciparum that were collected in Dakar, Senegal from 2009 to 2011. The DNA sequences were determined and statistical analyses of the variation observed between these two genes were conducted. The potential impact of PfHRP2 and PfHRP3 sequence variation on malaria diagnosis was examined. Results Seven P. falciparum isolates (5.9% of the total isolates, regardless of the parasitaemia; 10.7% of the isolates with parasitaemia ≤0.005% or ≤250 parasites/μl) were undetected by the PfHRP2 Palutop+4® RDT. Low parasite density is not sufficient to explain the PfHRP2 detection failure. Three of these seven samples showed pfhrp2 deletion (2.4%). The pfhrp3 gene was deleted in 12.8%. Of the 122 PfHRP2 sequences, 120 unique sequences were identified. Of the 109 PfHRP3 sequences, 64 unique sequences were identified. Using the Baker’s regression model, at least 7.4% of the P. falciparum isolates in Dakar were likely to be undetected by PfHRP2 at a parasite density of ≤250 parasites/μl (slightly lower than the evaluated prevalence of 10.7%). This predictive prevalence increased significantly between 2009 and 2011 (P = 0.0046). Conclusion In the present work, 10.7% of the isolates with a parasitaemia ≤0.005% (≤250 parasites/μl) were undetected by the PfHRP2 Palutop+4® RDT (7.4% by the predictive Baker’model). In addition, all of the parasites with pfhrp2 deletion (2.4% of the total samples) and 2.1% of the parasites with parasitaemia >0.005% and presence of pfhrp2 were not detected by PfHRP2 RDT. PfHRP2 is highly polymorphic in Senegal. Efforts should be made to more accurately determine the prevalence of non-sensitive parasites to pfHRP2. PMID:23347727
Environmental metabarcoding reveals heterogeneous drivers of microbial eukaryote diversity in contrasting estuarine ecosystems

PubMed Central

Lallias, Delphine; Hiddink, Jan G; Fonseca, Vera G; Gaspar, John M; Sung, Way; Neill, Simon P; Barnes, Natalie; Ferrero, Tim; Hall, Neil; Lambshead, P John D; Packer, Margaret; Thomas, W Kelley; Creer, Simon

2015-01-01

Assessing how natural environmental drivers affect biodiversity underpins our understanding of the relationships between complex biotic and ecological factors in natural ecosystems. Of all ecosystems, anthropogenically important estuaries represent a ‘melting pot' of environmental stressors, typified by extreme salinity variations and associated biological complexity. Although existing models attempt to predict macroorganismal diversity over estuarine salinity gradients, attempts to model microbial biodiversity are limited for eukaryotes. Although diatoms commonly feature as bioindicator species, additional microbial eukaryotes represent a huge resource for assessing ecosystem health. Of these, meiofaunal communities may represent the optimal compromise between functional diversity that can be assessed using morphology and phenotype–environment interactions as compared with smaller life fractions. Here, using 454 Roche sequencing of the 18S nSSU barcode we investigate which of the local natural drivers are most strongly associated with microbial metazoan and sampled protist diversity across the full salinity gradient of the estuarine ecosystem. In order to investigate potential variation at the ecosystem scale, we compare two geographically proximate estuaries (Thames and Mersey, UK) with contrasting histories of anthropogenic stress. The data show that although community turnover is likely to be predictable, taxa are likely to respond to different environmental drivers and, in particular, hydrodynamics, salinity range and granulometry, according to varied life-history characteristics. At the ecosystem level, communities exhibited patterns of estuary-specific similarity within different salinity range habitats, highlighting the environmental sequencing biomonitoring potential of meiofauna, dispersal effects or both. PMID:25423027
Genomic evolution, recombination, and inter-strain diversity of chelonid alphaherpesvirus 5 from Florida and Hawaii green sea turtles with fibropapillomatosis.

PubMed

Morrison, Cheryl L; Iwanowicz, Luke; Work, Thierry M; Fahsbender, Elizabeth; Breitbart, Mya; Adams, Cynthia; Iwanowicz, Deb; Sanders, Lakyn; Ackermann, Mathias; Cornman, Robert S

2018-01-01

Chelonid alphaherpesvirus 5 (ChHV5) is a herpesvirus associated with fibropapillomatosis (FP) in sea turtles worldwide. Single-locus typing has previously shown differentiation between Atlantic and Pacific strains of this virus, with low variation within each geographic clade. However, a lack of multi-locus genomic sequence data hinders understanding of the rate and mechanisms of ChHV5 evolutionary divergence, as well as how these genomic changes may contribute to differences in disease manifestation. To assess genomic variation in ChHV5 among five Hawaii and three Florida green sea turtles, we used high-throughput short-read sequencing of long-range PCR products amplified from tumor tissue using primers designed from the single available ChHV5 reference genome from a Hawaii green sea turtle. This strategy recovered sequence data from both geographic regions for approximately 75% of the predicted ChHV5 coding sequences. The average nucleotide divergence between geographic populations was 1.5%; most of the substitutions were fixed differences between regions. Protein divergence was generally low (average 0.08%), and ranged between 0 and 5.3%. Several atypical genes originally identified and annotated in the reference genome were confirmed in ChHV5 genomes from both geographic locations. Unambiguous recombination events between geographic regions were identified, and clustering of private alleles suggests the prevalence of recombination in the evolutionary history of ChHV5. This study significantly increased the amount of sequence data available from ChHV5 strains, enabling informed selection of loci for future population genetic and natural history studies, and suggesting the (possibly latent) co-infection of individuals by well-differentiated geographic variants.
Genomic evolution, recombination, and inter-strain diversity of chelonid alphaherpesvirus 5 from Florida and Hawaii green sea turtles with fibropapillomatosis

USGS Publications Warehouse

Morrison, Cheryl L.; Iwanowicz, Luke R.; Work, Thierry M.; Fahsbender, Elizabeth; Breitbart, Mya; Adams, Cynthia; Iwanowicz, Deborah; Sanders, Lakyn; Ackermann, Mathias; Cornman, Robert S.

2018-01-01

Chelonid alphaherpesvirus 5 (ChHV5) is a herpesvirus associated with fibropapillomatosis (FP) in sea turtles worldwide. Single-locus typing has previously shown differentiation between Atlantic and Pacific strains of this virus, with low variation within each geographic clade. However, a lack of multi-locus genomic sequence data hinders understanding of the rate and mechanisms of ChHV5 evolutionary divergence, as well as how these genomic changes may contribute to differences in disease manifestation. To assess genomic variation in ChHV5 among five Hawaii and three Florida green sea turtles, we used high-throughput short-read sequencing of long-range PCR products amplified from tumor tissue using primers designed from the single available ChHV5 reference genome from a Hawaii green sea turtle. This strategy recovered sequence data from both geographic regions for approximately 75% of the predicted ChHV5 coding sequences. The average nucleotide divergence between geographic populations was 1.5%; most of the substitutions were fixed differences between regions. Protein divergence was generally low (average 0.08%), and ranged between 0 and 5.3%. Several atypical genes originally identified and annotated in the reference genome were confirmed in ChHV5 genomes from both geographic locations. Unambiguous recombination events between geographic regions were identified, and clustering of private alleles suggests the prevalence of recombination in the evolutionary history of ChHV5. This study significantly increased the amount of sequence data available from ChHV5 strains, enabling informed selection of loci for future population genetic and natural history studies, and suggesting the (possibly latent) co-infection of individuals by well-differentiated geographic variants.

Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins

PubMed Central

Turnbaugh, Peter J.; Quince, Christopher; Faith, Jeremiah J.; McHardy, Alice C.; Yatsunenko, Tanya; Niazi, Faheem; Affourtit, Jason; Egholm, Michael; Henrissat, Bernard; Knight, Rob; Gordon, Jeffrey I.

2010-01-01

We deeply sampled the organismal, genetic, and transcriptional diversity in fecal samples collected from a monozygotic (MZ) twin pair and compared the results to 1,095 communities from the gut and other body habitats of related and unrelated individuals. Using a new scheme for noise reduction in pyrosequencing data, we estimated the total diversity of species-level bacterial phylotypes in the 1.2-1.5 million bacterial 16S rRNA reads obtained from each deeply sampled cotwin to be ~800 (35.9%, 49.1% detected in both). A combined 1.1 million read 16S rRNA dataset representing 281 shallowly sequenced fecal samples from 54 twin pairs and their mothers contained an estimated 4,018 species-level phylotypes, with each sample having a unique species assemblage (53.4 ± 0.6% and 50.3 ± 0.5% overlap with the deeply sampled cotwins). Of the 134 phylotypes with a relative abundance of >0.1% in the combined dataset, only 37 appeared in >50% of the samples, with one phylotype in the Lachnospiraceae family present in 99%. Nongut communities had significantly reduced overlap with the deeply sequenced twins’ fecal microbiota (18.3 ± 0.3%, 15.3 ± 0.3%). The MZ cotwins’ fecal DNA was deeply sequenced (3.8-6.3 Gbp/sample) and assembled reads were assigned to 25 genus-level phylogenetic bins. Only 17% of the genes in these bins were shared between the cotwins. Bins exhibited differences in their degree of sequence variation, gene content including the repertoire of carbohydrate active enzymes present within and between twins (e.g., predicted cellulases, dockerins), and transcriptional activities. These results provide an expanded perspective about features that make each of us unique life forms and directions for future characterization of our gut ecosystems. PMID:20363958
Genomic evolution, recombination, and inter-strain diversity of chelonid alphaherpesvirus 5 from Florida and Hawaii green sea turtles with fibropapillomatosis

PubMed Central

Iwanowicz, Luke; Work, Thierry M.; Fahsbender, Elizabeth; Breitbart, Mya; Adams, Cynthia; Iwanowicz, Deb; Sanders, Lakyn; Ackermann, Mathias; Cornman, Robert S.

2018-01-01

Chelonid alphaherpesvirus 5 (ChHV5) is a herpesvirus associated with fibropapillomatosis (FP) in sea turtles worldwide. Single-locus typing has previously shown differentiation between Atlantic and Pacific strains of this virus, with low variation within each geographic clade. However, a lack of multi-locus genomic sequence data hinders understanding of the rate and mechanisms of ChHV5 evolutionary divergence, as well as how these genomic changes may contribute to differences in disease manifestation. To assess genomic variation in ChHV5 among five Hawaii and three Florida green sea turtles, we used high-throughput short-read sequencing of long-range PCR products amplified from tumor tissue using primers designed from the single available ChHV5 reference genome from a Hawaii green sea turtle. This strategy recovered sequence data from both geographic regions for approximately 75% of the predicted ChHV5 coding sequences. The average nucleotide divergence between geographic populations was 1.5%; most of the substitutions were fixed differences between regions. Protein divergence was generally low (average 0.08%), and ranged between 0 and 5.3%. Several atypical genes originally identified and annotated in the reference genome were confirmed in ChHV5 genomes from both geographic locations. Unambiguous recombination events between geographic regions were identified, and clustering of private alleles suggests the prevalence of recombination in the evolutionary history of ChHV5. This study significantly increased the amount of sequence data available from ChHV5 strains, enabling informed selection of loci for future population genetic and natural history studies, and suggesting the (possibly latent) co-infection of individuals by well-differentiated geographic variants. PMID:29479497
An exploration of the sequence of a 2.9-Mb region of the genome of Drosophila melanogaster: the Adh region.

PubMed Central

Ashburner, M; Misra, S; Roote, J; Lewis, S E; Blazej, R; Davis, T; Doyle, C; Galle, R; George, R; Harris, N; Hartzell, G; Harvey, D; Hong, L; Houston, K; Hoskins, R; Johnson, G; Martin, C; Moshrefi, A; Palazzolo, M; Reese, M G; Spradling, A; Tsang, G; Wan, K; Whitelaw, K; Celniker, S

1999-01-01

A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized "Adh region." A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species.Before beginning a Hunt, it is wise to ask someone what you are looking for before you begin looking for it. Milne 1926 PMID:10471707
Potential and limits to unravel the genetic architecture and predict the variation of Fusarium head blight resistance in European winter wheat (Triticum aestivum L.).

PubMed

Jiang, Y; Zhao, Y; Rodemann, B; Plieske, J; Kollers, S; Korzun, V; Ebmeyer, E; Argillier, O; Hinze, M; Ling, J; Röder, M S; Ganal, M W; Mette, M F; Reif, J C

2015-03-01

Genome-wide mapping approaches in diverse populations are powerful tools to unravel the genetic architecture of complex traits. The main goals of our study were to investigate the potential and limits to unravel the genetic architecture and to identify the factors determining the accuracy of prediction of the genotypic variation of Fusarium head blight (FHB) resistance in wheat (Triticum aestivum L.) based on data collected with a diverse panel of 372 European varieties. The wheat lines were phenotyped in multi-location field trials for FHB resistance and genotyped with 782 simple sequence repeat (SSR) markers, and 9k and 90k single-nucleotide polymorphism (SNP) arrays. We applied genome-wide association mapping in combination with fivefold cross-validations and observed surprisingly high accuracies of prediction for marker-assisted selection based on the detected quantitative trait loci (QTLs). Using a random sample of markers not selected for marker-trait associations revealed only a slight decrease in prediction accuracy compared with marker-based selection exploiting the QTL information. The same picture was confirmed in a simulation study, suggesting that relatedness is a main driver of the accuracy of prediction in marker-assisted selection of FHB resistance. When the accuracy of prediction of three genomic selection models was contrasted for the three marker data sets, no significant differences in accuracies among marker platforms and genomic selection models were observed. Marker density impacted the accuracy of prediction only marginally. Consequently, genomic selection of FHB resistance can be implemented most cost-efficiently based on low- to medium-density SNP arrays.
Ovine mitochondrial DNA sequence variation and its association with production and reproduction traits within an Afec-Assaf flock.

PubMed

Reicher, S; Seroussi, E; Weller, J I; Rosov, A; Gootwine, E

2012-07-01

Polymorphisms in mitochondrial DNA (mtDNA) protein- and tRNA-coding genes were shown to be associated with various diseases in humans as well as with production and reproduction traits in livestock. Alignment of full length mitochondria sequences from the 5 known ovine haplogroups: HA (n = 3), HB (n = 5), HC (n = 3), HD (n = 2), and HE (n = 2; GenBank accession nos. HE577847-50 and 11 published complete ovine mitochondria sequences) revealed sequence variation in 10 out of the 13 protein coding mtDNA sequences. Twenty-six of the 245 variable sites found in the protein coding sequences represent non-synonymous mutations. Sequence variation was observed also in 8 out of the 22 tRNA mtDNA sequences. On the basis of the mtDNA control region and cytochrome b partial sequences along with information on maternal lineages within an Afec-Assaf flock, 1,126 Afec-Assaf ewes were assigned to mitochondrial haplogroups HA, HB, and HC, with frequencies of 0.43, 0.43, and 0.14, respectively. Analysis of birth weight and growth rate records of lamb (n = 1286) and productivity from 4,993 lambing records revealed no association between mitochondrial haplogroup affiliation and female longevity, lambs perinatal survival rate, birth weight, and daily growth rate of lambs up to 150 d that averaged 1,664 d, 88.3%, 4.5 kg, and 320 g/d, respectively. However, significant (P < 0.0001) differences among the haplogroups were found for prolificacy of ewes, with prolificacies (mean ± SE) of 2.14 ± 0.04, 2.25 ± 0.04, and 2.30 ± 0.06 lamb born/ewe lambing for the HA, HB, and the HC haplogroups, respectively. Our results highlight the ovine mitogenome genetic variation in protein- and tRNA coding genes and suggest that sequence variation in ovine mtDNA is associated with variation in ewe prolificacy.
HIV-1 sequence variation between isolates from mother-infant transmission pairs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wike, C.M.; Daniels, M.R.; Furtado, M.

1991-12-31

To examine the sequence diversity of human immunodeficiency virus type 1 (HIV-1) between known transmission sets, sequences from the V3 and V4-V5 region of the env gene from 4 mother-infant pairs were analyzed. The mean interpatient sequence variation between isolates from linked mother-infant pairs was comparable to the sequence diversity found between isolates from other close contacts. The mean intrapatient variation was significantly less in the infants` isolates then the isolates from both their mothers and other characterized intrapatient sequence sets. In addition, a distinct and characteristic difference in the glycosylation pattern preceding the V3 loop was found between eachmore » linked transmission pair. These findings indicate that selection of specific genotypic variants, which may play a role in some direct transmission sets, and the duration of infection are important factors in the degree of diversity seen between the sequence sets.« less
The Evolution of Mobile DNAs: When Will Transposons Create Phylogenies That Look As If There Is a Master Gene?

PubMed Central

Brookfield, John F. Y.; Johnson, Louise J.

2006-01-01

Some families of mammalian interspersed repetitive DNA, such as the Alu SINE sequence, appear to have evolved by the serial replacement of one active sequence with another, consistent with there being a single source of transposition: the “master gene.” Alternative models, in which multiple source sequences are simultaneously active, have been called “transposon models.” Transposon models differ in the proportion of elements that are active and in whether inactivation occurs at the moment of transposition or later. Here we examine the predictions of various types of transposon model regarding the patterns of sequence variation expected at an equilibrium between transposition, inactivation, and deletion. Under the master gene model, all bifurcations in the true tree of elements occur in a single lineage. We show that this property will also hold approximately for transposon models in which most elements are inactive and where at least some of the inactivation events occur after transposition. Such tree shapes are therefore not conclusive evidence for a single source of transposition. PMID:16790583
Mammoth and Mastodon collagen sequences; survival and utility

NASA Astrophysics Data System (ADS)

Buckley, M.; Larkin, N.; Collins, M.

2011-04-01

Near-complete collagen (I) sequences are proposed for elephantid and mammutid taxa, based upon available African elephant genomic data and supported with LC-MALDI-MS/MS and LC-ESI-MS/MS analyses of collagen digests from proboscidean bone. Collagen sequence coverage was investigated from several specimens of two extinct mammoths ( Mammuthus trogontherii and Mammuthus primigenius), the extinct American mastodon ( Mammut americanum), the extinct straight-tusked elephant ( Elephas ( Palaeoloxodon) antiquus) and extant Asian ( Elephas maximus) and African ( Loxodonta africana) elephants and compared between the two ionization techniques used. Two suspected mammoth fossils from the British Middle Pleistocene (Cromerian) deposits of the West Runton Forest Bed were analysed to investigate the potential use of peptide mass spectrometry for fossil identification. Despite the age of the fossils, sufficient peptides were obtained to identify these as elephantid, and sufficient sequence variation to discriminate elephantid and mammutid collagen (I). In-depth LC-MS analyses further failed to identify a peptide that could be used to reliably distinguish between the three genera of elephantids ( Elephas, Loxodonta and Mammuthus), an observation consistent with predicted amino acid substitution rates between these species.
Molecular cloning of a cDNA coding for GTP cyclohydrolase I from Dictyostelium discoideum.

PubMed Central

Witter, K; Cahill, D J; Werner, T; Ziegler, I; Rödl, W; Bacher, A; Gütlich, M

1996-01-01

The GTP cyclohydrolase I (GTP-CH) gene of the cellular slime mould Dictyostelium discoideum has been cloned and sequenced. The 855 bp cDNA of this gene contains the open reading frame (ORF) encoding 232 amino acids with a predicted molecular mass of approx. 26 kDa. Southern blot analysis indicated the presence of a single gene for GTP-CH in Dictyostelium. PCR amplification of the ORF from chromosomal DNA and sequencing showed the existence of a 101 bp intron in the GTP-CH gene of Dictyostelium discoideum. The amino acid sequence has 47% and 49% positional identity to those of the human and yeast enzymes respectively. Most of the sequence variation between species is located in the N-terminal part of the protein. The overall identity with the E. coli protein is markedly lower. The enzyme was expressed in E. coli and purified as a 68 kDa fusion protein with the maltose-binding protein of E. coli. GTP-CH of Dictyostelium is heat-stable and showed maximal activity at 60 degrees C. The Km value for GTP is 50 microM. PMID:8870645
The design of strain-specific polymerase chain reactions for discrimination of the racoon rabies virus strain from indigenous rabies viruses of Ontario.

PubMed

Nadin-Davis, S A; Huang, W; Wandeler, A I

1996-03-01

Since its recognition as a discrete epizootic in Florida in the early 1950s, the raccoon strain of rabies virus (RV) has spread over almost the entire eastern seaboard of the US and now threatens to enter the southernmost regions of Canada. To characterise this RV strain in more detail, nucleotide sequencing of the N and G genes, encoding the nucleoprotein and glycoprotein, respectively, of representative isolates has been undertaken. This sequence information generated a conserved restriction map of the N gene, thereby permitting unequivocal identification of this strain by molecular techniques. Comparisons of the predicted nucleoprotein and glycoprotein products with those of other RV strains identified a number of amino acid sequence variations conserved only in the raccoon strain. This information was used to design strain-specific primers targeted to the N gene sequences encoding these residues. The incorporation of these primers into a multiplex polymerase chain reaction (PCR) protocol permitted easy and rapid discrimination between the raccoon RV strain and indigenous Ontario RVs.
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.

PubMed

Li, Yifeng; Shi, Wenqiang; Wasserman, Wyeth W

2018-05-31

In the human genome, 98% of DNA sequences are non-protein-coding regions that were previously disregarded as junk DNA. In fact, non-coding regions host a variety of cis-regulatory regions which precisely control the expression of genes. Thus, Identifying active cis-regulatory regions in the human genome is critical for understanding gene regulation and assessing the impact of genetic variation on phenotype. The developments of high-throughput sequencing and machine learning technologies make it possible to predict cis-regulatory regions genome wide. Based on rich data resources such as the Encyclopedia of DNA Elements (ENCODE) and the Functional Annotation of the Mammalian Genome (FANTOM) projects, we introduce DECRES based on supervised deep learning approaches for the identification of enhancer and promoter regions in the human genome. Due to their ability to discover patterns in large and complex data, the introduction of deep learning methods enables a significant advance in our knowledge of the genomic locations of cis-regulatory regions. Using models for well-characterized cell lines, we identify key experimental features that contribute to the predictive performance. Applying DECRES, we delineate locations of 300,000 candidate enhancers genome wide (6.8% of the genome, of which 40,000 are supported by bidirectional transcription data), and 26,000 candidate promoters (0.6% of the genome). The predicted annotations of cis-regulatory regions will provide broad utility for genome interpretation from functional genomics to clinical applications. The DECRES model demonstrates potentials of deep learning technologies when combined with high-throughput sequencing data, and inspires the development of other advanced neural network models for further improvement of genome annotations.
Equivalent Indels – Ambiguous Functional Classes and Redundancy in Databases

PubMed Central

Assmus, Jens; Kleffe, Jürgen; Schmitt, Armin O.; Brockmann, Gudrun A.

2013-01-01

There is considerable interest in studying sequenced variations. However, while the positions of substitutions are uniquely identifiable by sequence alignment, the location of insertions and deletions still poses problems. Each insertion and deletion causes a change of sequence. Yet, due to low complexity or repetitive sequence structures, the same indel can sometimes be annotated in different ways. Two indels which differ in allele sequence and position can be one and the same, i.e. the alternative sequence of the whole chromosome is identical in both cases and, therefore, the two deletions are biologically equivalent. In such a case, it is impossible to identify the exact position of an indel merely based on sequence alignment. Thus, variation entries in a mutation database are not necessarily uniquely defined. We prove the existence of a contiguous region around an indel in which all deletions of the same length are biologically identical. Databases often show only one of several possible locations for a given variation. Furthermore, different data base entries can represent equivalent variation events. We identified 1,045,590 such problematic entries of insertions and deletions out of 5,860,408 indel entries in the current human database of Ensembl. Equivalent indels are found in sequence regions of different functions like exons, introns or 5' and 3' UTRs. One and the same variation can be assigned to several different functional classifications of which only one is correct. We implemented an algorithm that determines for each indel database entry its complete set of equivalent indels which is uniquely characterized by the indel itself and a given interval of the reference sequence. PMID:23658777
Detection and quantitation of single nucleotide polymorphisms, DNA sequence variations, DNA mutations, DNA damage and DNA mismatches

DOEpatents

McCutchen-Maloney, Sandra L.

2002-01-01

DNA mutation binding proteins alone and as chimeric proteins with nucleases are used with solid supports to detect DNA sequence variations, DNA mutations and single nucleotide polymorphisms. The solid supports may be flow cytometry beads, DNA chips, glass slides or DNA dips sticks. DNA molecules are coupled to solid supports to form DNA-support complexes. Labeled DNA is used with unlabeled DNA mutation binding proteins such at TthMutS to detect DNA sequence variations, DNA mutations and single nucleotide length polymorphisms by binding which gives an increase in signal. Unlabeled DNA is utilized with labeled chimeras to detect DNA sequence variations, DNA mutations and single nucleotide length polymorphisms by nuclease activity of the chimera which gives a decrease in signal.
Complete genome sequence and the expression pattern of plasmids of the model ethanologen Zymomonas mobilis ZM4 and its xylose-utilizing derivatives 8b and 2032

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yang, Shihui; Vera, Jessica M.; Grass, Jeff

Zymomonas mobilis is a natural ethanologen being developed and deployed as an industrial biofuel producer. To date, eight Z. mobilis strains have been completely sequenced and found to contain 2-8 native plasmids. However, systematic verification of predicted Z. mobilis plasmid genes and their contribution to cell fitness has not been hitherto addressed. Moreover, the precise number and identities of plasmids in Z. mobilis model strain ZM4 have been unclear. The lack of functional information about plasmid genes in ZM4 impedes ongoing studies for this model biofuel-producing strain. In this study, we determined the complete chromosome and plasmid sequences of ZM4more » and its engineered xylose-utilizing derivatives 2032 and 8b. Compared to previously published and revised ZM4 chromosome sequences, the ZM4 chromosome sequence reported here contains 65 nucleotide sequence variations as well as a 2400-bp insertion. Four plasmids were identified in all three strains, with 150 plasmid genes predicted in strain ZM4 and 2032, and 153 plasmid genes predicted in strain 8b due to the insertion of heterologous DNA for expanded substrate utilization. Plasmid genes were then annotated using Blast2GO, InterProScan, and systems biology data analyses, and most genes were found to have apparent orthologs in other organisms or identifiable conserved domains. To verify plasmid gene prediction, RNA-Seq was used to map transcripts and also compare relative gene expression under various growth conditions, including anaerobic and aerobic conditions, or growth in different concentrations of biomass hydrolysates. Overall, plasmid genes were more responsive to varying hydrolysate concentrations than to oxygen availability. Additionally, our results indicated that although all plasmids were present in low copy number (about 1-2 per cell), the copy number of some plasmids varied under specific growth conditions or due to heterologous gene insertion. The complete genome of ZM4 and two xylose-utilizing derivatives is reported in this study, with an emphasis on identifying and characterizing plasmid genes. Furthermore, plasmid gene annotation, validation, expression levels at growth conditions of interest, and contribution to host fitness are reported for the first time.« less
Complete genome sequence and the expression pattern of plasmids of the model ethanologen Zymomonas mobilis ZM4 and its xylose-utilizing derivatives 8b and 2032

DOE PAGES

Yang, Shihui; Vera, Jessica M.; Grass, Jeff; ...

2018-05-02

Zymomonas mobilis is a natural ethanologen being developed and deployed as an industrial biofuel producer. To date, eight Z. mobilis strains have been completely sequenced and found to contain 2-8 native plasmids. However, systematic verification of predicted Z. mobilis plasmid genes and their contribution to cell fitness has not been hitherto addressed. Moreover, the precise number and identities of plasmids in Z. mobilis model strain ZM4 have been unclear. The lack of functional information about plasmid genes in ZM4 impedes ongoing studies for this model biofuel-producing strain. In this study, we determined the complete chromosome and plasmid sequences of ZM4more » and its engineered xylose-utilizing derivatives 2032 and 8b. Compared to previously published and revised ZM4 chromosome sequences, the ZM4 chromosome sequence reported here contains 65 nucleotide sequence variations as well as a 2400-bp insertion. Four plasmids were identified in all three strains, with 150 plasmid genes predicted in strain ZM4 and 2032, and 153 plasmid genes predicted in strain 8b due to the insertion of heterologous DNA for expanded substrate utilization. Plasmid genes were then annotated using Blast2GO, InterProScan, and systems biology data analyses, and most genes were found to have apparent orthologs in other organisms or identifiable conserved domains. To verify plasmid gene prediction, RNA-Seq was used to map transcripts and also compare relative gene expression under various growth conditions, including anaerobic and aerobic conditions, or growth in different concentrations of biomass hydrolysates. Overall, plasmid genes were more responsive to varying hydrolysate concentrations than to oxygen availability. Additionally, our results indicated that although all plasmids were present in low copy number (about 1-2 per cell), the copy number of some plasmids varied under specific growth conditions or due to heterologous gene insertion. The complete genome of ZM4 and two xylose-utilizing derivatives is reported in this study, with an emphasis on identifying and characterizing plasmid genes. Furthermore, plasmid gene annotation, validation, expression levels at growth conditions of interest, and contribution to host fitness are reported for the first time.« less
A Supervised Statistical Learning Approach for Accurate Legionella pneumophila Source Attribution during Outbreaks

PubMed Central

Buultjens, Andrew H.; Chua, Kyra Y. L.; Baines, Sarah L.; Kwong, Jason; Gao, Wei; Cutcher, Zoe; Adcock, Stuart; Ballard, Susan; Schultz, Mark B.; Tomita, Takehiro; Subasinghe, Nela; Carter, Glen P.; Pidot, Sacha J.; Franklin, Lucinda; Seemann, Torsten; Gonçalves Da Silva, Anders

2017-01-01

ABSTRACT Public health agencies are increasingly relying on genomics during Legionnaires' disease investigations. However, the causative bacterium (Legionella pneumophila) has an unusual population structure, with extreme temporal and spatial genome sequence conservation. Furthermore, Legionnaires' disease outbreaks can be caused by multiple L. pneumophila genotypes in a single source. These factors can confound cluster identification using standard phylogenomic methods. Here, we show that a statistical learning approach based on L. pneumophila core genome single nucleotide polymorphism (SNP) comparisons eliminates ambiguity for defining outbreak clusters and accurately predicts exposure sources for clinical cases. We illustrate the performance of our method by genome comparisons of 234 L. pneumophila isolates obtained from patients and cooling towers in Melbourne, Australia, between 1994 and 2014. This collection included one of the largest reported Legionnaires' disease outbreaks, which involved 125 cases at an aquarium. Using only sequence data from L. pneumophila cooling tower isolates and including all core genome variation, we built a multivariate model using discriminant analysis of principal components (DAPC) to find cooling tower-specific genomic signatures and then used it to predict the origin of clinical isolates. Model assignments were 93% congruent with epidemiological data, including the aquarium Legionnaires' disease outbreak and three other unrelated outbreak investigations. We applied the same approach to a recently described investigation of Legionnaires' disease within a UK hospital and observed a model predictive ability of 86%. We have developed a promising means to breach L. pneumophila genetic diversity extremes and provide objective source attribution data for outbreak investigations. IMPORTANCE Microbial outbreak investigations are moving to a paradigm where whole-genome sequencing and phylogenetic trees are used to support epidemiological investigations. It is critical that outbreak source predictions are accurate, particularly for pathogens, like Legionella pneumophila, which can spread widely and rapidly via cooling system aerosols, causing Legionnaires' disease. Here, by studying hundreds of Legionella pneumophila genomes collected over 21 years around a major Australian city, we uncovered limitations with the phylogenetic approach that could lead to a misidentification of outbreak sources. We implement instead a statistical learning technique that eliminates the ambiguity of inferring disease transmission from phylogenies. Our approach takes geolocation information and core genome variation from environmental L. pneumophila isolates to build statistical models that predict with high confidence the environmental source of clinical L. pneumophila during disease outbreaks. We show the versatility of the technique by applying it to unrelated Legionnaires' disease outbreaks in Australia and the UK. PMID:28821546
Amino acid sequence analysis of the annexin super-gene family of proteins.

PubMed

Barton, G J; Newman, R H; Freemont, P S; Crumpton, M J

1991-06-15

The annexins are a widespread family of calcium-dependent membrane-binding proteins. No common function has been identified for the family and, until recently, no crystallographic data existed for an annexin. In this paper we draw together 22 available annexin sequences consisting of 88 similar repeat units, and apply the techniques of multiple sequence alignment, pattern matching, secondary structure prediction and conservation analysis to the characterisation of the molecules. The analysis clearly shows that the repeats cluster into four distinct families and that greatest variation occurs within the repeat 3 units. Multiple alignment of the 88 repeats shows amino acids with conserved physicochemical properties at 22 positions, with only Gly at position 23 being absolutely conserved in all repeats. Secondary structure prediction techniques identify five conserved helices in each repeat unit and patterns of conserved hydrophobic amino acids are consistent with one face of a helix packing against the protein core in predicted helices a, c, d, e. Helix b is generally hydrophobic in all repeats, but contains a striking pattern of repeat-specific residue conservation at position 31, with Arg in repeats 4 and Glu in repeats 2, but unconserved amino acids in repeats 1 and 3. This suggests repeats 2 and 4 may interact via a buried saltbridge. The loop between predicted helices a and b of repeat 3 shows features distinct from the equivalent loop in repeats 1, 2 and 4, suggesting an important structural and/or functional role for this region. No compelling evidence emerges from this study for uteroglobin and the annexins sharing similar tertiary structures, or for uteroglobin representing a derivative of a primordial one-repeat structure that underwent duplication to give the present day annexins. The analyses performed in this paper are re-evaluated in the Appendix, in the light of the recently published X-ray structure for human annexin V. The structure confirms most of the predictions and shows the power of techniques for the determination of tertiary structural information from the amino acid sequences of an aligned protein family.
The 1000 Genomes Project: new opportunities for research and social challenges

PubMed Central

2010-01-01

The 1000 Genomes Project, an international collaboration, is sequencing the whole genome of approximately 2,000 individuals from different worldwide populations. The central goal of this project is to describe most of the genetic variation that occurs at a population frequency greater than 1%. The results of this project will allow scientists to identify genetic variation at an unprecedented degree of resolution and will also help improve the imputation methods for determining unobserved genetic variants that are not represented on current genotyping arrays. By identifying novel or rare functional genetic variants, researchers will be able to pinpoint disease-causing genes in genomic regions initially identified by association studies. This level of detailed sequence information will also improve our knowledge of the evolutionary processes and the genomic patterns that have shaped the human species as we know it today. The new data will also lay the foundation for future clinical applications, such as prediction of disease susceptibility and drug response. However, the forthcoming availability of whole genome sequences at affordable prices will raise ethical concerns and pose potential threats to individual privacy. Nevertheless, we believe that these potential risks are outweighed by the benefits in terms of diagnosis and research, so long as rigorous safeguards are kept in place through legislation that prevents discrimination on the basis of the results of genetic testing. PMID:20193048
Double Hits in Schizophrenia.

PubMed

Vorstman, Jacob A S; Olde Loohuis, Loes M; Kahn, René S; Ophoff, Roel A

2018-05-14

The co-occurrence of a Copy Number Variant (CNV) and a functional variant on the other allele may be a relevant genetic mechanism in schizophrenia. We hypothesized that the cumulative burden of such double hits - in particular those composed of a deletion and a coding single nucleotide variation (SNV) - is increased in patients with schizophrenia.We combined CNV data with coding variants data in 795 patients with schizophrenia and 474 controls. To limit false CNV-detection, only CNVs called only by two algorithms we included. CNV-affected genes were subsequently examined for coding SNVs, which we termed "CNV-SNVs". Correcting for total queried sequence, we assessed the CNV-SNV-burden and the combined predicted deleterious effect. We estimated p-values by permutation of the phenotype.We detected 105 CNV-SNVs; 67 in duplicated and 38 in deleted genic sequence. While the difference in CNV-SNVs rates was not significant, the combined deleteriousness inferred by CNV-SNVs in deleted sequence was almost fourfold higher in cases compared to controls (nominal p = 0.009). This effect may be driven by a higher number of CNV-SNVs and/or by a higher degree of predicted deleteriousness of CNV-SNVs. No such effect was observed for duplications.We provide early evidence that deletions co-occurring with a functional variant may be relevant, albeit of modest impact, for the genetic etiology of schizophrenia. Large-scale consortium studies are required to validate our findings. Sequence-based analyses would provide the best resolution for detection of CNVs as well as coding variants genome-wide.
Predicting residue-wise contact orders in proteins by support vector regression.

PubMed

Song, Jiangning; Burrage, Kevin

2006-10-03

The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.

Dictyocaulus viviparus genome, variome and transcriptome elucidate lungworm biology and support future intervention

PubMed Central

McNulty, Samantha N.; Strübe, Christina; Rosa, Bruce A.; Martin, John C.; Tyagi, Rahul; Choi, Young-Jun; Wang, Qi; Hallsworth Pepin, Kymberlie; Zhang, Xu; Ozersky, Philip; Wilson, Richard K.; Sternberg, Paul W.; Gasser, Robin B.; Mitreva, Makedonka

2016-01-01

The bovine lungworm, Dictyocaulus viviparus (order Strongylida), is an important parasite of livestock that causes substantial economic and production losses worldwide. Here we report the draft genome, variome, and developmental transcriptome of D. viviparus. The genome (161 Mb) is smaller than those of related bursate nematodes and encodes fewer proteins (14,171 total). In the first genome-wide assessment of genomic variation in any parasitic nematode, we found a high degree of sequence variability in proteins predicted to be involved host-parasite interactions. Next, we used extensive RNA sequence data to track gene transcription across the life cycle of D. viviparus, and identified genes that might be important in nematode development and parasitism. Finally, we predicted genes that could be vital in host-parasite interactions, genes that could serve as drug targets, and putative RNAi effectors with a view to developing functional genomic tools. This extensive, well-curated dataset should provide a basis for developing new anthelmintics, vaccines, and improved diagnostic tests and serve as a platform for future investigations of drug resistance and epidemiology of the bovine lungworm and related nematodes. PMID:26856411
SNPnexus: assessing the functional relevance of genetic variation to facilitate the promise of precision medicine.

PubMed

Dayem Ullah, Abu Z; Oscanoa, Jorge; Wang, Jun; Nagano, Ai; Lemoine, Nicholas R; Chelala, Claude

2018-05-11

Broader functional annotation of genetic variation is a valuable means for prioritising phenotypically-important variants in further disease studies and large-scale genotyping projects. We developed SNPnexus to meet this need by assessing the potential significance of known and novel SNPs on the major transcriptome, proteome, regulatory and structural variation models. Since its previous release in 2012, we have made significant improvements to the annotation categories and updated the query and data viewing systems. The most notable changes include broader functional annotation of noncoding variants and expanding annotations to the most recent human genome assembly GRCh38/hg38. SNPnexus has now integrated rich resources from ENCODE and Roadmap Epigenomics Consortium to map and annotate the noncoding variants onto different classes of regulatory regions and noncoding RNAs as well as providing their predicted functional impact from eight popular non-coding variant scoring algorithms and computational methods. A novel functionality offered now is the support for neo-epitope predictions from leading tools to facilitate its use in immunotherapeutic applications. These updates to SNPnexus are in preparation for its future expansion towards a fully comprehensive computational workflow for disease-associated variant prioritization from sequencing data, placing its users at the forefront of translational research. SNPnexus is freely available at http://www.snp-nexus.org.
Application of a time-dependent coalescence process for inferring the history of population size changes from DNA sequence data.

PubMed

Polanski, A; Kimmel, M; Chakraborty, R

1998-05-12

Distribution of pairwise differences of nucleotides from data on a sample of DNA sequences from a given segment of the genome has been used in the past to draw inferences about the past history of population size changes. However, all earlier methods assume a given model of population size changes (such as sudden expansion), parameters of which (e.g., time and amplitude of expansion) are fitted to the observed distributions of nucleotide differences among pairwise comparisons of all DNA sequences in the sample. Our theory indicates that for any time-dependent population size, N(tau) (in which time tau is counted backward from present), a time-dependent coalescence process yields the distribution, p(tau), of the time of coalescence between two DNA sequences randomly drawn from the population. Prediction of p(tau) and N(tau) requires the use of a reverse Laplace transform known to be unstable. Nevertheless, simulated data obtained from three models of monotone population change (stepwise, exponential, and logistic) indicate that the pattern of a past population size change leaves its signature on the pattern of DNA polymorphism. Application of the theory to the published mtDNA sequences indicates that the current mtDNA sequence variation is not inconsistent with a logistic growth of the human population.
Huangshan population of Chinese Zacco platypus (Teleostei, Cyprinidae) harbors diverse matrilines and high genetic diversity.

PubMed

Zheng, Xin; Zhou, Tian-Qi; Wan, Tao; Perdices, Anabel; Yang, Jin-Quan; Tang, Xin-Sheng; Wang, Zheng-Ping; Huang, Li-Qun; Huang, Song; He, Shun-Ping

2016-03-18

Six main mitochondrial DNA (mtDNA) lineages have been described in minnow (Zacco platypus) samples obtained from northern, western and southern China. Perdices et al. (2004) predicted that further sampling of other tributaries might discover more lineages of this species. In this study, we collected 26 Zacco platypus individuals in the Huangshan area of eastern China and determined the cytochrome b (cytb) sequence variations. Combined with reported data in GenBank, we identified ten matrilines (Zacco A-J) in a total of 169 samples, with relatively high molecular divergence found among them. The Huangshan population had the greatest genetic variation among all sampled regions and hosted six of the ten matrilines. Our results highlight the significance of the Huangshan area for the conservation of Zacco platypus.
Human genetics: international projects and personalized medicine.

PubMed

Apellaniz-Ruiz, Maria; Gallego, Cristina; Ruiz-Pinto, Sara; Carracedo, Angel; Rodríguez-Antona, Cristina

2016-03-01

In this article, we present the progress driven by the recent technological advances and new revolutionary massive sequencing technologies in the field of human genetics. We discuss this knowledge in relation with drug response prediction, from the germline genetic variation compiled in the 1000 Genomes Project or in the Genotype-Tissue Expression project, to the phenome-genome archives, the international cancer projects, such as The Cancer Genome Atlas or the International Cancer Genome Consortium, and the epigenetic variation and its influence in gene expression, including the regulation of drug metabolism. This review is based on the lectures presented by the speakers of the Symposium "Human Genetics: International Projects & New Technologies" from the VII Conference of the Spanish Pharmacogenetics and Pharmacogenomics Society, held on the 20th and 21st of April 2015.
The polymorphisms of LCR, E6, and E7 of HPV-58 isolates in Yunnan, Southwest China.

PubMed

Xi, Juemin; Chen, Junying; Xu, Miaoling; Yang, Hongying; Wen, Songjiao; Pan, Yue; Wang, Xiaodan; Ye, Chao; Qiu, Lijuan; Sun, Qiangming

2018-04-25

Variations in HPV LCR/E6/E7 have been shown to be associated with the viral persistence and cervical cancer development. So far, there are few reports about the polymorphisms of the HPV-58 LCR/E6/E7 sequences in Southwest China. This study aims to characterize the gene polymorphisms of the HPV-58 LCR/E6/E7 sequences in women of Southwest China, and assess the effects of variations on the immune recognition of viral E6 and E7 antigens. Twelve LCR/E6/E7 of the HPV-58 isolates were amplified and sequenced. A neighbor-joining phylogenetic tree was constructed by MEGA 7.0, followed by the secondary structure prediction of the related proteins using PSIPRED v3.3. The selection pressure acting on the HPV-58 E6 and E7 coding regions was estimated by Bayes empirical Bayes analysis of PAML 4.8. Meanwhile, the MHC class-I and II binding peptides were predicted by the ProPred-I server and ProPred server. The transcription factor binding sites in the HPV-58 LCR were analyzed using the JASPAR database. Twenty nine SNPs (20 in the LCR, 3 in the E6, 6 in the E7) were identified at 27 nucleotide sites across the HPV-58 LCR/E6/E7. From the most variable to the least variable, the nucleotide variations were LCR > E7 > E6. The combinations of all the SNPs resulted in 11 unique sequences, which were clustered into the A lineage (7 belong to A1, 2 belong to A2, and 2 belong to A3). An insertion (TGTCAGTTTCCT) was found between the nucleotide sites 7280 and 7281 in 2 variants, and a deletion (TTTAT) was found between 7429 and 7433 in 1 variant. The most common non-synonymous substitution V77A in the E7 was observed in the sequences encoding the α-helix. 63G in the E7 was determined to be the only one positively selected site in the HPV-58 E6/E7 sequences. Six non-synonymous amino acid substitutions (including S71F and K93 N in the E6, and T20I, G41R, G63S/D, and V77A in the E7) were affecting multiple putative epitopes for both CD4 + and CD8 + T-cells. In the LCR, C7265G and C7266T were the most variable sites and were the potential binding sites for the transcription factor SOX10. These results provide an insight into the intrinsic geographical relatedness and biological differences of the HPV-58 variants, and contribute to further research on the HPV-58 epidemiology, carcinogenesis, and therapeutic vaccine development.
Genome-wide patterns of differentiation and spatially varying selection between postglacial recolonization lineages of Populus alba (Salicaceae), a widespread forest tree.

PubMed

Stölting, Kai N; Paris, Margot; Meier, Cécile; Heinze, Berthold; Castiglione, Stefano; Bartha, Denes; Lexer, Christian

2015-08-01

Studying the divergence continuum in plants is relevant to fundamental and applied biology because of the potential to reveal functionally important genetic variation. In this context, whole-genome sequencing (WGS) provides the necessary rigour for uncovering footprints of selection. We resequenced populations of two divergent phylogeographic lineages of Populus alba (n = 48), thoroughly characterized by microsatellites (n = 317), and scanned their genomes for regions of unusually high allelic differentiation and reduced diversity using > 1.7 million single nucleotide polymorphisms (SNPs) from WGS. Results were confirmed by Sanger sequencing. On average, 9134 high-differentiation (≥ 4 standard deviations) outlier SNPs were uncovered between populations, 848 of which were shared by ≥ three replicate comparisons. Annotation revealed that 545 of these were located in 437 predicted genes. Twelve percent of differentiation outlier genome regions exhibited significantly reduced genetic diversity. Gene ontology (GO) searches were successful for 327 high-differentiation genes, and these were enriched for 63 GO terms. Our results provide a snapshot of the roles of 'hard selective sweeps' vs divergent selection of standing genetic variation in distinct postglacial recolonization lineages of P. alba. Thus, this study adds to our understanding of the mechanisms responsible for the origin of functionally relevant variation in temperate trees. © 2015 The Authors. New Phytologist © 2015 New Phytologist Trust.
Population-genomic variation within RNA viruses of the Western honey bee, Apis mellifera, inferred from deep sequencing

USDA-ARS?s Scientific Manuscript database

Deep sequencing of viruses isolated from infected hosts is an efficient way to measure population-genetic variation and can reveal patterns of dispersal and natural selection. In this study, we mined existing Illumina sequence reads to investigate single-nucleotide polymorphisms (SNPs) within two RN...
Human Genome Sequencing in Health and Disease

PubMed Central

Gonzaga-Jauregui, Claudia; Lupski, James R.; Gibbs, Richard A.

2013-01-01

Following the “finished,” euchromatic, haploid human reference genome sequence, the rapid development of novel, faster, and cheaper sequencing technologies is making possible the era of personalized human genomics. Personal diploid human genome sequences have been generated, and each has contributed to our better understanding of variation in the human genome. We have consequently begun to appreciate the vastness of individual genetic variation from single nucleotide to structural variants. Translation of genome-scale variation into medically useful information is, however, in its infancy. This review summarizes the initial steps undertaken in clinical implementation of personal genome information, and describes the application of whole-genome and exome sequencing to identify the cause of genetic diseases and to suggest adjuvant therapies. Better analysis tools and a deeper understanding of the biology of our genome are necessary in order to decipher, interpret, and optimize clinical utility of what the variation in the human genome can teach us. Personal genome sequencing may eventually become an instrument of common medical practice, providing information that assists in the formulation of a differential diagnosis. We outline herein some of the remaining challenges. PMID:22248320
Association of Amine-Receptor DNA Sequence Variants with Associative Learning in the Honeybee.

PubMed

Lagisz, Malgorzata; Mercer, Alison R; de Mouzon, Charlotte; Santos, Luana L S; Nakagawa, Shinichi

2016-03-01

Octopamine- and dopamine-based neuromodulatory systems play a critical role in learning and learning-related behaviour in insects. To further our understanding of these systems and resulting phenotypes, we quantified DNA sequence variations at six loci coding octopamine-and dopamine-receptors and their association with aversive and appetitive learning traits in a population of honeybees. We identified 79 polymorphic sequence markers (mostly SNPs and a few insertions/deletions) located within or close to six candidate genes. Intriguingly, we found that levels of sequence variation in the protein-coding regions studied were low, indicating that sequence variation in the coding regions of receptor genes critical to learning and memory is strongly selected against. Non-coding and upstream regions of the same genes, however, were less conserved and sequence variations in these regions were weakly associated with between-individual differences in learning-related traits. While these associations do not directly imply a specific molecular mechanism, they suggest that the cross-talk between dopamine and octopamine signalling pathways may influence olfactory learning and memory in the honeybee.
Genome Sequencing of Ralstonia solanacearum CQPS-1, a Phylotype I Strain Collected from a Highland Area with Continuous Cropping of Tobacco

PubMed Central

Liu, Ying; Tang, Yuanman; Qin, Xiyun; Yang, Liang; Jiang, Gaofei; Li, Shili; Ding, Wei

2017-01-01

Ralstonia solanacearum, an agent of bacterial wilt, is a highly variable species with a broad host range and wide geographic distribution. As a species complex, it has extensive genetic diversity and its living environment is polymorphic like the lowland and the highland area, so more genomes are needed for studying population evolution and environment adaptation. In this paper, we reported the genome sequencing of R. solanacearum strain CQPS-1 isolated from wilted tobacco in Pengshui, Chongqing, China, a highland area with severely acidified soil and continuous cropping of tobacco more than 20 years. The comparative genomic analysis among different R. solanacearum strains was also performed. The completed genome size of CQPS-1 was 5.89 Mb and contained the chromosome (3.83 Mb) and the megaplasmid (2.06 Mb). A total of 5229 coding sequences were predicted (the chromosome and megaplasmid encoded 3573 and 1656 genes, respectively). A comparative analysis with eight strains from four phylotypes showed that there was some variation among the species, e.g., a large set of specific genes in CQPS-1. Type III secretion system gene cluster (hrp gene cluster) was conserved in CQPS-1 compared with the reference strain GMI1000. In addition, most genes coding core type III effectors were also conserved with GMI1000, but significant gene variation was found in the gene ripAA: the identity compared with strain GMI1000 was 75% and the hrpII box promoter in the upstream had significantly mutated. This study provided a potential resource for further understanding of the relationship between variation of pathogenicity factors and adaptation to the host environment. PMID:28620361
Sequence and functional characterization of hypoxia-inducible factors, HIF1α, HIF2αa, and HIF3α, from the estuarine fish, Fundulus heteroclitus

PubMed Central

Townley, Ian K.; Karchner, Sibel I.; Skripnikova, Elena; Wiese, Thomas E.; Hahn, Mark E.

2017-01-01

The hypoxia-inducible factor (HIF) family of transcription factors plays central roles in the development, physiology, pathology, and environmental adaptation of animals. Because many aquatic habitats are characterized by episodes of low dissolved oxygen, fish represent ideal models to study the roles of HIF in the response to aquatic hypoxia. The estuarine fish Fundulus heteroclitus is found in habitats prone to hypoxia. It responds to low oxygen via behavioral, physiological, and molecular changes, and one member of the HIF family, HIF2α, has been previously described. Herein, cDNA sequencing, phylogenetic analyses, and genomic approaches were used to determine other members of the HIFα family from F. heteroclitus and their relationships to HIFα subunits from other vertebrates. In vitro and cellular approaches demonstrated that full-length forms of HIF1α, HIF2α, and HIF3α independently formed complexes with the β-subunit, aryl hydrocarbon receptor nuclear translocator, to bind to hypoxia response elements and activate reporter gene expression. Quantitative PCR showed that HIFα mRNA abundance varied among organs of normoxic fish in an isoform-specific fashion. Analysis of the F. heteroclitus genome revealed a locus encoding a second HIF2α—HIF2αb—a predicted protein lacking oxygen sensing and transactivation domains. Finally, sequence analyses demonstrated polymorphism in the coding sequence of each F. heteroclitus HIFα subunit, suggesting that genetic variation in these transcription factors may play a role in the variation in hypoxia responses among individuals or populations. PMID:28039194
Sequence and functional characterization of hypoxia-inducible factors, HIF1α, HIF2αa, and HIF3α, from the estuarine fish, Fundulus heteroclitus.

PubMed

Townley, Ian K; Karchner, Sibel I; Skripnikova, Elena; Wiese, Thomas E; Hahn, Mark E; Rees, Bernard B

2017-03-01

The hypoxia-inducible factor (HIF) family of transcription factors plays central roles in the development, physiology, pathology, and environmental adaptation of animals. Because many aquatic habitats are characterized by episodes of low dissolved oxygen, fish represent ideal models to study the roles of HIF in the response to aquatic hypoxia. The estuarine fish Fundulus heteroclitus is found in habitats prone to hypoxia. It responds to low oxygen via behavioral, physiological, and molecular changes, and one member of the HIF family, HIF2α, has been previously described. Herein, cDNA sequencing, phylogenetic analyses, and genomic approaches were used to determine other members of the HIFα family from F. heteroclitus and their relationships to HIFα subunits from other vertebrates. In vitro and cellular approaches demonstrated that full-length forms of HIF1α, HIF2α, and HIF3α independently formed complexes with the β-subunit, aryl hydrocarbon receptor nuclear translocator, to bind to hypoxia response elements and activate reporter gene expression. Quantitative PCR showed that HIFα mRNA abundance varied among organs of normoxic fish in an isoform-specific fashion. Analysis of the F. heteroclitus genome revealed a locus encoding a second HIF2α-HIF2αb-a predicted protein lacking oxygen sensing and transactivation domains. Finally, sequence analyses demonstrated polymorphism in the coding sequence of each F. heteroclitus HIFα subunit, suggesting that genetic variation in these transcription factors may play a role in the variation in hypoxia responses among individuals or populations. Copyright © 2017 the American Physiological Society.
Comparative study of the hemagglutinin and neuraminidase genes of influenza A virus H3N2, H9N2, and H5N1 subtypes using bioinformatics techniques.

PubMed

Ahn, Insung; Son, Hyeon S

2007-07-01

To investigate the genomic patterns of influenza A virus subtypes, such as H3N2, H9N2, and H5N1, we collected 1842 sequences of the hemagglutinin and neuraminidase genes from the NCBI database and parsed them into 7 categories: accession number, host species, sampling year, country, subtype, gene name, and sequence. The sequences that were isolated from the human, avian, and swine populations were extracted and stored in a MySQL database for intensive analysis. The GC content and relative synonymous codon usage (RSCU) values were calculated using JAVA codes. As a result, correspondence analysis of the RSCU values yielded the unique codon usage pattern (CUP) of each subtype and revealed no extreme differences among the human, avian, and swine isolates. H5N1 subtype viruses exhibited little variation in CUPs compared with other subtypes, suggesting that the H5N1 CUP has not yet undergone significant changes within each host species. Moreover, some observations may be relevant to CUP variation that has occurred over time among the H3N2 subtype viruses isolated from humans. All the sequences were divided into 3 groups over time, and each group seemed to have preferred synonymous codon patterns for each amino acid, especially for arginine, glycine, leucine, and valine. The bioinformatics technique we introduce in this study may be useful in predicting the evolutionary patterns of pandemic viruses.
Genomic Rearrangements in Arabidopsis Considered as Quantitative Traits.

PubMed

Imprialou, Martha; Kahles, André; Steffen, Joshua G; Osborne, Edward J; Gan, Xiangchao; Lempe, Janne; Bhomra, Amarjit; Belfield, Eric; Visscher, Anne; Greenhalgh, Robert; Harberd, Nicholas P; Goram, Richard; Hein, Jotun; Robert-Seilaniantz, Alexandre; Jones, Jonathan; Stegle, Oliver; Kover, Paula; Tsiantis, Miltos; Nordborg, Magnus; Rätsch, Gunnar; Clark, Richard M; Mott, Richard

2017-04-01

To understand the population genetics of structural variants and their effects on phenotypes, we developed an approach to mapping structural variants that segregate in a population sequenced at low coverage. We avoid calling structural variants directly. Instead, the evidence for a potential structural variant at a locus is indicated by variation in the counts of short-reads that map anomalously to that locus. These structural variant traits are treated as quantitative traits and mapped genetically, analogously to a gene expression study. Association between a structural variant trait at one locus, and genotypes at a distant locus indicate the origin and target of a transposition. Using ultra-low-coverage (0.3×) population sequence data from 488 recombinant inbred Arabidopsis thaliana genomes, we identified 6502 segregating structural variants. Remarkably, 25% of these were transpositions. While many structural variants cannot be delineated precisely, we validated 83% of 44 predicted transposition breakpoints by polymerase chain reaction. We show that specific structural variants may be causative for quantitative trait loci for germination and resistance to infection by the fungus Albugo laibachii , isolate Nc14. Further we show that the phenotypic heritability attributable to read-mapping anomalies differs from, and, in the case of time to germination and bolting, exceeds that due to standard genetic variation. Genes within structural variants are also more likely to be silenced or dysregulated. This approach complements the prevalent strategy of structural variant discovery in fewer individuals sequenced at high coverage. It is generally applicable to large populations sequenced at low-coverage, and is particularly suited to mapping transpositions. Copyright © 2017 by the Genetics Society of America.
A novel real-time duplex PCR assay for detecting penA and ponA genotypes in Neisseria gonorrhoeae: Comparison with phenotypes determined by the E-test.

PubMed

Vernel-Pauillac, Frédérique; Merien, Fabrice

2006-12-01

For many years, the pathogenic bacterium Neisseria gonorrhoeae, the etiologic agent of gonorrhea, was generally susceptible to penicillin, until the emergence of resistant strains. Well-characterized genetic variations in the penicillin resistance-determining region correlate with decreased susceptibility to penicillin. At least 5 genes (penA, penB, mtrR, ponA, and penC) are involved in the chromosomally mediated resistance to this antibiotic. To date, no development of multiplex PCR assays targeting a range of gonococcal genes and variations as a means of predicting antibiotic resistance has been reported. The aim of this study was to develop a duplex assay using DNA from isolated strains. We describe the development and evaluation on the LightCycler platform of a real-time duplex PCR assay (hybridization probe format) for rapid and specific detection of ponA and penA variations, predicting penicillin susceptibilities. The real-time duplex PCR assay successfully detected variations in ponA and penA genes by use of distinct melting temperatures from a total of 120 Neisseria gonorrhoeae isolates. Moreover, the variation profiles obtained with the real-time PCR and the melting analysis showed good correlation with the pattern of penicillin susceptibility generated with classical antibiograms. Nucleotide sequencing data were in complete agreement with multiplex assay results. The presented assay is suitable for the detection of chromosomally mediated resistant strains of Neisseria gonorrhoeae in genotyping studies and could be valuable in the effective antimicrobial strategy to gonococci.
[A boy with Meier-Gorlin syndrome carrying a novel ORC6 mutation and uniparental disomy of chromosome 16].

PubMed

Li, Juan; Ding, Yu; Chang, Guoying; Cheng, Qing; Li, Xin; Wang, Jian; Wang, Xiumin; Shen, Yiping

2017-02-10

To identify the genetic cause for a 11-year-old Chinese boy with Meier-Gorlin syndrome (MGS). Chromosomal microarray analysis (CMA) was used to detect potential variations, while whole exome sequencing (WES) was used to identify sequence variants. Sanger sequencing was used to confirm the suspected variants. The boy has featured short stature, microtia, small patella, slender body build, craniofacial anomalies, and small testes with normal gonadotropin. A complete uniparental disomy of chromosome 16 was revealed by CMA. WES has identified a novel homozygous mutation c.67A>G (p.Lys23Glu) in ORC6 gene mapped to chromosome 16. As predicted by Alamut functional software, the mutation may affect the function of structural domain of the ORC6 protein. The patient is probably the first diagnosed MGS case in China, who carried a novel homozygous mutation of the ORC6 gene and uniparental disomy of chromosome 16. The effect of this novel mutation on the growth and development needs to be further investigated.
Candidate chemosensory ionotropic receptors in a Lepidoptera.

PubMed

Olivier, V; Monsempes, C; François, M-C; Poivet, E; Jacquin-Joly, E

2011-04-01

A new family of candidate chemosensory ionotropic receptors (IRs) related to ionotropic glutamate receptors (iGluRs) was recently discovered in Drosophila melanogaster. Through Blast analyses of an expressed sequenced tag library prepared from male antennae of the noctuid moth Spodoptera littoralis, we identified 12 unigenes encoding proteins related to D. melanogaster and Bombyx mori IRs. Their full length sequences were obtained and the analyses of their expression patterns suggest that they were exclusively expressed or clearly enriched in chemosensory organs. The deduced protein sequences were more similar to B. mori and D. melanogaster IRs than to iGluRs and showed considerable variations in the predicted ligand-binding domains; none have the three glutamate-interacting residues found in iGluRs, suggesting different binding specificities. Our data suggest that we identified members of the insect IR chemosensory receptor family in S. littoralis and we report here the first demonstration of IR expression in Lepidoptera. © 2010 The Authors. Insect Molecular Biology © 2010 The Royal Entomological Society.
The structure of the human interferon alpha/beta receptor gene.

PubMed

Lutfalla, G; Gardiner, K; Proudhon, D; Vielh, E; Uzé, G

1992-02-05

Using the cDNA coding for the human interferon alpha/beta receptor (IFNAR), the IFNAR gene has been physically mapped relative to the other loci of the chromosome 21q22.1 region. 32,906 base pairs covering the IFNAR gene have been cloned and sequenced. Primer extension and solution hybridization-ribonuclease protection have been used to determine that the transcription of the gene is initiated in a broad region of 20 base pairs. Some aspects of the polymorphism of the gene, including noncoding sequences, have been analyzed; some are allelic differences in the coding sequence that induce amino acid variations in the resulting protein. The exon structure of the IFNAR gene and of that of the available genes for the receptors of the cytokine/growth hormone/prolactin/interferon receptor family have been compared with the predictions for the secondary structure of those receptors. From this analysis, we postulate a common origin and propose an hypothesis for the divergence from the immunoglobulin superfamily.
Whole genome sequencing of a dizygotic twin suggests a role for the serotonin receptor HTR7 in autism spectrum disorder.

PubMed

Helsmoortel, Céline; Swagemakers, Sigrid M A; Vandeweyer, Geert; Stubbs, Andrew P; Palli, Ivo; Mortier, Geert; Kooy, R Frank; van der Spek, Peter J

2016-12-01

Whole genome sequencing of a severely affected dizygotic twin with an autism spectrum disorder and intellectual disability revealed a compound heterozygous mutation in the HTR7 gene as the only variation not detected in control databases. Each parent carries one allele of the mutation, which is not present in an unaffected stepsister. The HTR7 gene encodes the 5-HT 7 serotonin receptor that is involved in brain development, synaptic transmission, and plasticity. The paternally inherited p.W60C variant is situated at an evolutionary conserved nucleotide and predicted damaging by Polyphen2. A mutation akin to the maternally inherited pV286I mutation has been reported to significantly affect the binding characteristics of the receptor. Therefore, the observed sequence alterations provide a first suggestive link between a genetic abnormality in the HTR7 gene and a neurodevelopmental disorder. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

Reducing DNA context dependence in bacterial promoters

PubMed Central

Carr, Swati B.; Densmore, Douglas M.

2017-01-01

Variation in the DNA sequence upstream of bacterial promoters is known to affect the expression levels of the products they regulate, sometimes dramatically. While neutral synthetic insulator sequences have been found to buffer promoters from upstream DNA context, there are no established methods for designing effective insulator sequences with predictable effects on expression levels. We address this problem with Degenerate Insulation Screening (DIS), a novel method based on a randomized 36-nucleotide insulator library and a simple, high-throughput, flow-cytometry-based screen that randomly samples from a library of 436 potential insulated promoters. The results of this screen can then be compared against a reference uninsulated device to select a set of insulated promoters providing a precise level of expression. We verify this method by insulating the constitutive, inducible, and repressible promotors of a four transcriptional-unit inverter (NOT-gate) circuit, finding both that order dependence is largely eliminated by insulation and that circuit performance is also significantly improved, with a 5.8-fold mean improvement in on/off ratio. PMID:28422998
The selenium content of SEPP1 versus selenium requirements in vertebrates

PubMed Central

Hamre, Kristin; Ellingsen, Ståle

2015-01-01

Selenoprotein P (SEPP1) distributes selenium (Se) throughout the body via the circulatory system. For vertebrates, the Se content of SEPP1 varies from 7 to 18 Se atoms depending on the species, but the reason for this variation remains unclear. Herein we provide evidence that vertebrate SEPP1 Sec content correlates positively with Se requirements. As the Se content of full length SEPP1 is genetically determined, this presents a unique case where a nutrient requirement can be predicted based on genomic sequence information. PMID:26734501
Guidelines for reporting and using prediction tools for genetic variation analysis.

PubMed

Vihinen, Mauno

2013-02-01

Computational prediction methods are widely used for the analysis of human genome sequence variants and their effects on gene/protein function, splice site aberration, pathogenicity, and disease risk. New methods are frequently developed. We believe that guidelines are essential for those writing articles about new prediction methods, as well as for those applying these tools in their research, so that the necessary details are reported. This will enable readers to gain the full picture of technical information, performance, and interpretation of results, and to facilitate comparisons of related methods. Here, we provide instructions on how to describe new methods, report datasets, and assess the performance of predictive tools. We also discuss what details of predictor implementation are essential for authors to understand. Similarly, these guidelines for the use of predictors provide instructions on what needs to be delineated in the text, as well as how researchers can avoid unwarranted conclusions. They are applicable to most prediction methods currently utilized. By applying these guidelines, authors will help reviewers, editors, and readers to more fully comprehend prediction methods and their use. © 2012 Wiley Periodicals, Inc.
Simultaneous mutation and copy number variation (CNV) detection by multiplex PCR-based GS-FLX sequencing.

PubMed

Goossens, Dirk; Moens, Lotte N; Nelis, Eva; Lenaerts, An-Sofie; Glassee, Wim; Kalbe, Andreas; Frey, Bruno; Kopal, Guido; De Jonghe, Peter; De Rijk, Peter; Del-Favero, Jurgen

2009-03-01

We evaluated multiplex PCR amplification as a front-end for high-throughput sequencing, to widen the applicability of massive parallel sequencers for the detailed analysis of complex genomes. Using multiplex PCR reactions, we sequenced the complete coding regions of seven genes implicated in peripheral neuropathies in 40 individuals on a GS-FLX genome sequencer (Roche). The resulting dataset showed highly specific and uniform amplification. Comparison of the GS-FLX sequencing data with the dataset generated by Sanger sequencing confirmed the detection of all variants present and proved the sensitivity of the method for mutation detection. In addition, we showed that we could exploit the multiplexed PCR amplicons to determine individual copy number variation (CNV), increasing the spectrum of detected variations to both genetic and genomic variants. We conclude that our straightforward procedure substantially expands the applicability of the massive parallel sequencers for sequencing projects of a moderate number of amplicons (50-500) with typical applications in resequencing exons in positional or functional candidate regions and molecular genetic diagnostics. 2008 Wiley-Liss, Inc.
Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants.

PubMed

Gagliano, Sarah A; Ravji, Reena; Barnes, Michael R; Weale, Michael E; Knight, Jo

2015-08-24

Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies.
Cytochrome C oxydase deficiency: SURF1 gene investigation in patients with Leigh syndrome.

PubMed

Maalej, Marwa; Kammoun, Thouraya; Alila-Fersi, Olfa; Kharrat, Marwa; Ammar, Marwa; Felhi, Rahma; Mkaouar-Rebai, Emna; Keskes, Leila; Hachicha, Mongia; Fakhfakh, Faiza

2018-03-18

Leigh syndrome (LS) is a rare progressive neurodegenerative disorder occurring in infancy. The most common clinical signs reported in LS are growth retardation, optic atrophy, ataxia, psychomotor retardation, dystonia, hypotonia, seizures and respiratory disorders. The paper reported a manifestation of 3 Tunisian patients presented with LS syndrome. The aim of this study is the MT[HYPHEN]ATP6 and SURF1 gene screening in Tunisian patients affected with classical Leigh syndrome and the computational investigation of the effect of detected mutations on its structure and functions by clinical and bioinformatics analyses. After clinical investigations, three Tunisian patients were tested for mutations in both MT-ATP6 and SURF1 genes by direct sequencing followed by in silico analyses to predict the effects of sequence variation. The result of mutational analysis revealed the absence of mitochondrial mutations in MT-ATP6 gene and the presence of a known homozygous splice site mutation c.516-517delAG in sibling patients added to the presence of a novel double het mutations in LS patient (c.752-18 A > C/c. c.751 + 16G > A). In silico analyses of theses intronic variations showed that it could alters splicing processes as well as SURF1 protein translation. Leigh syndrome (LS) is a rare progressive neurodegenerative disorder occurring in infancy. The most common clinical signs reported in LS are growth retardation, optic atrophy, ataxia, psychomotor retardation, dystonia, hypotonia, seizures and respiratory disorders. The paper reported a manifestation of 3 Tunisian patients presented with LS syndrome. The aim of this study is MT-ATP6 and SURF1 genes screening in Tunisian patients affected with classical Leigh syndrome and the computational investigation of the effect of detected mutations on its structure and functions. After clinical investigations, three Tunisian patients were tested for mutations in both MT-ATP6 and SURF1 genes by direct sequencing followed by in silico analysis to predict the effects of sequence variation. The result of mutational analysis revealed the absence of mitochondrial mutations in MT-ATP6 gene and the presence of a known homozygous splice site mutation c.516-517delAG in sibling patients added to the presence of a novel double het mutations in LS patient (c.752-18 A>C/ c.751+16G>A). In silico analysis of theses intronic vaiations showed that it could alters splicing processes as well as SURF1 protein translation. Copyright © 2018 Elsevier Inc. All rights reserved.
Effect of the sequence data deluge on the performance of methods for detecting protein functional residues.

PubMed

Garrido-Martín, Diego; Pazos, Florencio

2018-02-27

The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.
Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation

NASA Astrophysics Data System (ADS)

Sheynkman, Gloria M.; Shortreed, Michael R.; Cesnik, Anthony J.; Smith, Lloyd M.

2016-06-01

Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.
Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation

PubMed Central

Sheynkman, Gloria M.; Shortreed, Michael R.; Cesnik, Anthony J.; Smith, Lloyd M.

2016-01-01

Mass spectrometry–based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications. PMID:27049631
A comprehensive analysis of rare genetic variation in amyotrophic lateral sclerosis in the UK.

PubMed

Morgan, Sarah; Shatunov, Aleksey; Sproviero, William; Jones, Ashley R; Shoai, Maryam; Hughes, Deborah; Al Khleifat, Ahmad; Malaspina, Andrea; Morrison, Karen E; Shaw, Pamela J; Shaw, Christopher E; Sidle, Katie; Orrell, Richard W; Fratta, Pietro; Hardy, John; Pittman, Alan; Al-Chalabi, Ammar

2017-06-01

Amyotrophic lateral sclerosis is a progressive neurodegenerative disease of motor neurons. About 25 genes have been verified as relevant to the disease process, with rare and common variation implicated. We used next generation sequencing and repeat sizing to comprehensively assay genetic variation in a panel of known amyotrophic lateral sclerosis genes in 1126 patient samples and 613 controls. About 10% of patients were predicted to carry a pathological expansion of the C9orf72 gene. We found an increased burden of rare variants in patients within the untranslated regions of known disease-causing genes, driven by SOD1, TARDBP, FUS, VCP, OPTN and UBQLN2. We found 11 patients (1%) carried more than one pathogenic variant (P = 0.001) consistent with an oligogenic basis of amyotrophic lateral sclerosis. These findings show that the genetic architecture of amyotrophic lateral sclerosis is complex and that variation in the regulatory regions of associated genes may be important in disease pathogenesis. © The Author (2017). Published by Oxford University Press on behalf of the Guarantors of Brain.
SEAN: SNP prediction and display program utilizing EST sequence clusters.

PubMed

Huntley, Derek; Baldo, Angela; Johri, Saurabh; Sergot, Marek

2006-02-15

SEAN is an application that predicts single nucleotide polymorphisms (SNPs) using multiple sequence alignments produced from expressed sequence tag (EST) clusters. The algorithm uses rules of sequence identity and SNP abundance to determine the quality of the prediction. A Java viewer is provided to display the EST alignments and predicted SNPs.
Comparative microRNA-seq Analysis Depicts Candidate miRNAs Involved in Skin Color Differentiation in Red Tilapia.

PubMed

Wang, Lanmei; Zhu, Wenbin; Dong, Zaijie; Song, Feibiao; Dong, Juanjuan; Fu, Jianjun

2018-04-16

Differentiation and variation in body color has been a growing limitation to the commercial value of red tilapia. Limited microRNA (miRNA) information is available on skin color differentiation and variation in fish so far. In this study, a high-throughput Illumina sequencing of sRNAs was conducted on three color varieties of red tilapia and 81,394,491 raw reads were generated. A total of 158 differentially expressed miRNAs (|log₂(fold change)| ≥ 1 and q -value ≤ 0.001) were identified. Target prediction and functional analysis of color-related miRNAs showed that a variety of putative target genes—including slc7a11 , mc1r and asip —played potential roles in pigmentation. Moreover; the miRNA-mRNA regulatory network was illustrated to elucidate the pigmentation differentiation, in which miR-138-5p and miR-722 were predicted to play important roles in regulating the pigmentation process. These results advance our understanding of the molecular mechanisms of skin pigmentation differentiation in red tilapia.
Unique LCR variations among lineages of HPV16, 18 and 45 isolates from women with normal cervical cytology in Ghana.

PubMed

Awua, Adolf K; Adanu, Richard M K; Wiredu, Edwin K; Afari, Edwin A; Zubuch, Vanessa A; Asmah, Richard H; Severini, Alberto

2017-04-21

In addition to being useful for classification, sequence variations of human Papillomavirus (HPV) genotypes have been implicated in differential oncogenic potential and a differential association with the different histological forms of invasive cervical cancer. These associations have also been indicated for HPV genotype lineages and sub-lineages. In order to better understand the potential implications of lineage variation in the occurrence of cervical cancers in Ghana, we studied the lineages of the three most prevalent HPV genotypes among women with normal cytology as baseline to further studies. Of previously collected self- and health personnel-collected cervical specimen, 54, which were positive for HPV16, 18 and 45, were selected and the long control region (LCR) of each HPV genotype was separately amplified by a nested PCR. DNA sequences of 41 isolates obtained with the forward and reverse primers by Sanger sequencing were analysed. Nucleotide sequence variations of the HPV16 genotypes were observed at 30 positions within the LCR (7460 - 7840). Of these, 19 were the known variations for the lineages B and C (African lineages), while the other 11 positions had variations unique to the HPV16 isolates of this study. For the HPV18 isolates, the variations were at 35 positions, 22 of which were known variations of Africa lineages and the other 13 were unique variations observed for the isolates obtained in this study (at positions 7799 and 7813). HPV45 isolates had variations at 35 positions and 2 (positions 7114 and 97) were unique to the isolates of this study. This study provides the first data on the lineages of HPV 16, 18 and 45 isolates from Ghana. Although the study did not obtain full genome sequence data for a comprehensive comparison with known lineages, these genotypes were predominately of the Africa lineages and had some unique sequence variations at positions that suggest potential oncogenic implications. These data will be useful for comparison with lineages of these genotypes from women with cervical lesion and all the forms of invasive cervical cancers.
SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences.

PubMed

Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke

2008-05-01

Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.
SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences

PubMed Central

Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke

2008-01-01

Background Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. Results SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. Conclusion The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods. PMID:18452616
Characterization of genetic sequence variation of 58 STR loci in four major population groups.

PubMed

Novroski, Nicole M M; King, Jonathan L; Churchill, Jennifer D; Seah, Lay Hong; Budowle, Bruce

2016-11-01

Massively parallel sequencing (MPS) can identify sequence variation within short tandem repeat (STR) alleles as well as their nominal allele lengths that traditionally have been obtained by capillary electrophoresis. Using the MiSeq FGx Forensic Genomics System (Illumina), STRait Razor, and in-house excel workbooks, genetic variation was characterized within STR repeat and flanking regions of 27 autosomal, 7 X-chromosome and 24 Y-chromosome STR markers in 777 unrelated individuals from four population groups. Seven hundred and forty six autosomal, 227 X-chromosome, and 324 Y-chromosome STR alleles were identified by sequence compared with 357 autosomal, 107 X-chromosome, and 189 Y-chromosome STR alleles that were identified by length. Within the observed sequence variation, 227 autosomal, 156 X-chromosome, and 112 Y-chromosome novel alleles were identified and described. One hundred and seventy six autosomal, 123 X-chromosome, and 93 Y-chromosome sequence variants resided within STR repeat regions, and 86 autosomal, 39 X-chromosome, and 20 Y-chromosome variants were located in STR flanking regions. Three markers, D18S51, DXS10135, and DYS385a-b had 1, 4, and 1 alleles, respectively, which contained both a novel repeat region variant and a flanking sequence variant in the same nucleotide sequence. There were 50 markers that demonstrated a relative increase in diversity with the variant sequence alleles compared with those of traditional nominal length alleles. These population data illustrate the genetic variation that exists in the commonly used STR markers in the selected population samples and provide allele frequencies for statistical calculations related to STR profiling with MPS data. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Persistent Sub-Yearly Chromospheric Variations in Lower Main-Sequence Stars: Tau Booe and alpha Com

NASA Technical Reports Server (NTRS)

Maulik, Davesh; Donahue, Robert A.; Baliunas, Sallie L.

1997-01-01

The recent discoveries of extrasolar planetary systems around lower main-sequence stars such as tau Booe (HD 120136) has prompted further investigation into their stellar activity. A cursory analysis of tau Booe for cyclic chromospheric activity, based on its 30-yr record of Ca 2 H and K fluxes obtained as part of the HK Project from Mount Wilson Observatory, finds an intermediate, sub-yearly period (approximately 117 d) in chromospheric activity in addition to, and separate from, both its rotation (3.3 d) and long-term variability. As a persistent subyearly period in surface magnetic activity is unprecedented, we investigate this apparent anomaly further by examining chromospheric activity levels of other stars with similar mass, searching for variability in chromospheric activity with periods of less than one year, but longer than measured or predicted rotation. An examination of the time series of 40 mid-to-late F dwarfs yielded one other star for further analysis: alpha Com (HD 114378, P approximately 132 d). The variations for these two stars were checked for persistence and coherence. Based on these determinations, we eliminate the possibilities of rotation, long-term activity cycle, and the evolution of active regions as the cause of this variation in both stars. In particular, for tau Booe we infer that the phenomenon may be chromospheric in origin; however, beyond this, it is difficult to identify anything further regarding the cause of the activity variations, or even whether the observed modulation in the two stars have the same origin.
Spatial and Temporal Stress Drop Variations of the 2011 Tohoku Earthquake Sequence

NASA Astrophysics Data System (ADS)

Miyake, H.

2013-12-01

The 2011 Tohoku earthquake sequence consists of foreshocks, mainshock, aftershocks, and repeating earthquakes. To quantify spatial and temporal stress drop variations is important for understanding M9-class megathrust earthquakes. Variability and spatial and temporal pattern of stress drop is a basic information for rupture dynamics as well as useful to source modeling. As pointed in the ground motion prediction equations by Campbell and Bozorgnia [2008, Earthquake Spectra], mainshock-aftershock pairs often provide significant decrease of stress drop. We here focus strong motion records before and after the Tohoku earthquake, and analyze source spectral ratios considering azimuth- and distance dependency [Miyake et al., 2001, GRL]. Due to the limitation of station locations on land, spatial and temporal stress drop variations are estimated by adjusting shifts from the omega-squared source spectral model. The adjustment is based on the stochastic Green's function simulations of source spectra considering azimuth- and distance dependency. We assumed the same Green's functions for event pairs for each station, both the propagation path and site amplification effects are cancelled out. Precise studies of spatial and temporal stress drop variations have been performed [e.g., Allmann and Shearer, 2007, JGR], this study targets the relations between stress drop vs. progression of slow slip prior to the Tohoku earthquake by Kato et al. [2012, Science] and plate structures. Acknowledgement: This study is partly supported by ERI Joint Research (2013-B-05). We used the JMA unified earthquake catalogue and K-NET, KiK-net, and F-net data provided by NIED.
Variation block-based genomics method for crop plants.

PubMed

Kim, Yul Ho; Park, Hyang Mi; Hwang, Tae-Young; Lee, Seuk Ki; Choi, Man Soo; Jho, Sungwoong; Hwang, Seungwoo; Kim, Hak-Min; Lee, Dongwoo; Kim, Byoung-Chul; Hong, Chang Pyo; Cho, Yun Sung; Kim, Hyunmin; Jeong, Kwang Ho; Seo, Min Jung; Yun, Hong Tai; Kim, Sun Lim; Kwon, Young-Up; Kim, Wook Han; Chun, Hye Kyung; Lim, Sang Jong; Shin, Young-Ah; Choi, Ik-Young; Kim, Young Sun; Yoon, Ho-Sung; Lee, Suk-Ha; Lee, Sunghoon

2014-06-15

In contrast with wild species, cultivated crop genomes consist of reshuffled recombination blocks, which occurred by crossing and selection processes. Accordingly, recombination block-based genomics analysis can be an effective approach for the screening of target loci for agricultural traits. We propose the variation block method, which is a three-step process for recombination block detection and comparison. The first step is to detect variations by comparing the short-read DNA sequences of the cultivar to the reference genome of the target crop. Next, sequence blocks with variation patterns are examined and defined. The boundaries between the variation-containing sequence blocks are regarded as recombination sites. All the assumed recombination sites in the cultivar set are used to split the genomes, and the resulting sequence regions are termed variation blocks. Finally, the genomes are compared using the variation blocks. The variation block method identified recurring recombination blocks accurately and successfully represented block-level diversities in the publicly available genomes of 31 soybean and 23 rice accessions. The practicality of this approach was demonstrated by the identification of a putative locus determining soybean hilum color. We suggest that the variation block method is an efficient genomics method for the recombination block-level comparison of crop genomes. We expect that this method will facilitate the development of crop genomics by bringing genomics technologies to the field of crop breeding.
Comparative Sequence Analysis of the Plasmid-Encoded Regulator of Enteropathogenic Escherichia coli Strains

PubMed Central

Okeke, Iruka N.; Borneman, Jade A.; Shin, Sooan; Mellies, Jay L.; Quinn, Laura E.; Kaper, James B.

2001-01-01

Enteropathogenic Escherichia coli (EPEC) strains that carry the EPEC adherence factor (EAF) plasmid were screened for the presence of different EAF sequences, including those of the plasmid-encoded regulator (per). Considerable variation in gene content of EAF plasmids from different strains was seen. However, bfpA, the gene encoding the structural subunit for the bundle-forming pilus, bundlin, and per genes were found in 96.8% of strains. Sequence analysis of the per operon and its promoter region from 15 representative strains revealed that it is highly conserved. Most of the variation occurs in the 5′ two-thirds of the perA gene. In contrast, the C-terminal portion of the predicted PerA protein that contains the DNA-binding helix-turn-helix motif is 100% conserved in all strains that possess a full-length gene. In a minority of strains including the O119:H2 and canine isolates and in a subset of O128:H2 and O142:H6 strains, frameshift mutations in perA leading to premature truncation and consequent inactivation of the gene were identified. Cloned perA, -B, and -C genes from these strains, unlike those from strains with a functional operon, failed to activate the LEE1 operon and bfpA transcriptional fusions or to complement a per mutant in reference strain E2348/69. Furthermore, O119, O128, and canine strains that carry inactive per operons were deficient in virulence protein expression. The context in which the perABC operon occurs on the EAF plasmid varies. The sequence upstream of the per promoter region in EPEC reference strains E2348/69 and B171-8 was present in strains belonging to most serogroups. In a subset of O119:H2, O128:H2, and O142:H6 strains and in the canine isolate, this sequence was replaced by an IS1294-homologous sequence. PMID:11500429

Phylogenetically Structured Differences in rRNA Gene Sequence Variation among Species of Arbuscular Mycorrhizal Fungi and Their Implications for Sequence Clustering

PubMed Central

Ekanayake, Saliya; Ruan, Yang; Schütte, Ursel M. E.; Kaonongbua, Wittaya; Fox, Geoffrey; Ye, Yuzhen; Bever, James D.

2016-01-01

ABSTRACT Arbuscular mycorrhizal (AM) fungi form mutualisms with plant roots that increase plant growth and shape plant communities. Each AM fungal cell contains a large amount of genetic diversity, but it is unclear if this diversity varies across evolutionary lineages. We found that sequence variation in the nuclear large-subunit (LSU) rRNA gene from 29 isolates representing 21 AM fungal species generally assorted into genus- and species-level clades, with the exception of species of the genera Claroideoglomus and Entrophospora. However, there were significant differences in the levels of sequence variation across the phylogeny and between genera, indicating that it is an evolutionarily constrained trait in AM fungi. These consistent patterns of sequence variation across both phylogenetic and taxonomic groups pose challenges to interpreting operational taxonomic units (OTUs) as approximations of species-level groups of AM fungi. We demonstrate that the OTUs produced by five sequence clustering methods using 97% or equivalent sequence similarity thresholds failed to match the expected species of AM fungi, although OTUs from AbundantOTU, CD-HIT-OTU, and CROP corresponded better to species than did OTUs from mothur or UPARSE. This lack of OTU-to-species correspondence resulted both from sequences of one species being split into multiple OTUs and from sequences of multiple species being lumped into the same OTU. The OTU richness therefore will not reliably correspond to the AM fungal species richness in environmental samples. Conservatively, this error can overestimate species richness by 4-fold or underestimate richness by one-half, and the direction of this error will depend on the genera represented in the sample. IMPORTANCE Arbuscular mycorrhizal (AM) fungi form important mutualisms with the roots of most plant species. Individual AM fungi are genetically diverse, but it is unclear whether the level of this diversity differs among evolutionary lineages. We found that the amount of sequence variation in an rRNA gene that is commonly used to identify AM fungal species varied significantly between evolutionary groups that correspond to different genera, with the exception of two genera that are genetically indistinguishable from each other. When we clustered groups of similar sequences into operational taxonomic units (OTUs) using five different clustering methods, these patterns of sequence variation caused the number of OTUs to either over- or underestimate the actual number of AM fungal species, depending on the genus. Our results indicate that OTU-based inferences about AM fungal species composition from environmental sequences can be improved if they take these taxonomically structured patterns of sequence variation into account. PMID:27260357
Sequence charge decoration dictates coil-globule transition in intrinsically disordered proteins.

PubMed

Firman, Taylor; Ghosh, Kingshuk

2018-03-28

We present an analytical theory to compute conformations of heteropolymers-applicable to describe disordered proteins-as a function of temperature and charge sequence. The theory describes coil-globule transition for a given protein sequence when temperature is varied and has been benchmarked against the all-atom Monte Carlo simulation (using CAMPARI) of intrinsically disordered proteins (IDPs). In addition, the model quantitatively shows how subtle alterations of charge placement in the primary sequence-while maintaining the same charge composition-can lead to significant changes in conformation, even as drastic as a coil (swelled above a purely random coil) to globule (collapsed below a random coil) and vice versa. The theory provides insights on how to control (enhance or suppress) these changes by tuning the temperature (or solution condition) and charge decoration. As an application, we predict the distribution of conformations (at room temperature) of all naturally occurring IDPs in the DisProt database and notice significant size variation even among IDPs with a similar composition of positive and negative charges. Based on this, we provide a new diagram-of-states delineating the sequence-conformation relation for proteins in the DisProt database. Next, we study the effect of post-translational modification, e.g., phosphorylation, on IDP conformations. Modifications as little as two-site phosphorylation can significantly alter the size of an IDP with everything else being constant (temperature, salt concentration, etc.). However, not all possible modification sites have the same effect on protein conformations; there are certain "hot spots" that can cause maximal change in conformation. The location of these "hot spots" in the parent sequence can readily be identified by using a sequence charge decoration metric originally introduced by Sawle and Ghosh. The ability of our model to predict conformations (both expanded and collapsed states) of IDPs at a high-throughput level can provide valuable insights into the different mechanisms by which phosphorylation/charge mutation controls IDP function.
The genetic landscape of a physical interaction

PubMed Central

Diss, Guillaume

2018-01-01

A key question in human genetics and evolutionary biology is how mutations in different genes combine to alter phenotypes. Efforts to systematically map genetic interactions have mostly made use of gene deletions. However, most genetic variation consists of point mutations of diverse and difficult to predict effects. Here, by developing a new sequencing-based protein interaction assay – deepPCA – we quantified the effects of >120,000 pairs of point mutations on the formation of the AP-1 transcription factor complex between the products of the FOS and JUN proto-oncogenes. Genetic interactions are abundant both in cis (within one protein) and trans (between the two molecules) and consist of two classes – interactions driven by thermodynamics that can be predicted using a three-parameter global model, and structural interactions between proximally located residues. These results reveal how physical interactions generate quantitatively predictable genetic interactions. PMID:29638215
FrameD: A flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences.

PubMed

Schiex, Thomas; Gouzy, Jérôme; Moisan, Annick; de Oliveira, Yannick

2003-07-01

We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms.
Global sequence variation in the histidine-rich proteins 2 and 3 of Plasmodium falciparum: implications for the performance of malaria rapid diagnostic tests

PubMed Central

2010-01-01

Background Accurate diagnosis is essential for prompt and appropriate treatment of malaria. While rapid diagnostic tests (RDTs) offer great potential to improve malaria diagnosis, the sensitivity of RDTs has been reported to be highly variable. One possible factor contributing to variable test performance is the diversity of parasite antigens. This is of particular concern for Plasmodium falciparum histidine-rich protein 2 (PfHRP2)-detecting RDTs since PfHRP2 has been reported to be highly variable in isolates of the Asia-Pacific region. Methods The pfhrp2 exon 2 fragment from 458 isolates of P. falciparum collected from 38 countries was amplified and sequenced. For a subset of 80 isolates, the exon 2 fragment of histidine-rich protein 3 (pfhrp3) was also amplified and sequenced. DNA sequence and statistical analysis of the variation observed in these genes was conducted. The potential impact of the pfhrp2 variation on RDT detection rates was examined by analysing the relationship between sequence characteristics of this gene and the results of the WHO product testing of malaria RDTs: Round 1 (2008), for 34 PfHRP2-detecting RDTs. Results Sequence analysis revealed extensive variations in the number and arrangement of various repeats encoded by the genes in parasite populations world-wide. However, no statistically robust correlation between gene structure and RDT detection rate for P. falciparum parasites at 200 parasites per microlitre was identified. Conclusions The results suggest that despite extreme sequence variation, diversity of PfHRP2 does not appear to be a major cause of RDT sensitivity variation. PMID:20470441
Sequence variations in RepMP2/3 and RepMP4 elements reveal intragenomic homologous DNA recombination events in Mycoplasma pneumoniae.

PubMed

Spuesens, Emiel B M; Oduber, Minoushka; Hoogenboezem, Theo; Sluijter, Marcel; Hartwig, Nico G; van Rossum, Annemarie M C; Vink, Cornelis

2009-07-01

The gene encoding major adhesin protein P1 of Mycoplasma pneumoniae, MPN141, contains two DNA sequence stretches, designated RepMP2/3 and RepMP4, which display variation among strains. This variation allows strains to be differentiated into two major P1 genotypes (1 and 2) and several variants. Interestingly, multiple versions of the RepMP2/3 and RepMP4 elements exist at other sites within the bacterial genome. Because these versions are closely related in sequence, but not identical, it has been hypothesized that they have the capacity to recombine with their counterparts within MPN141, and thereby serve as a source of sequence variation of the P1 protein. In order to determine the variation within the RepMP2/3 and RepMP4 elements, both within the bacterial genome and among strains, we analysed the DNA sequences of all RepMP2/3 and RepMP4 elements within the genomes of 23 M. pneumoniae strains. Our data demonstrate that: (i) recombination is likely to have occurred between two RepMP2/3 elements in four of the strains, and (ii) all previously described P1 genotypes can be explained by inter-RepMP recombination events. Moreover, the difference between the two major P1 genotypes was reflected in all RepMP elements, such that subtype 1 and 2 strains can be differentiated on the basis of sequence variation in each RepMP element. This implies that subtype 1 and subtype 2 strains represent evolutionarily diverged strain lineages. Finally, a classification scheme is proposed in which the P1 genotype of M. pneumoniae isolates can be described in a sequence-based, universal fashion.
Burst muscle performance predicts the speed, acceleration, and turning performance of Anna’s hummingbirds

PubMed Central

Segre, Paolo S; Dakin, Roslyn; Zordan, Victor B; Dickinson, Michael H; Straw, Andrew D; Altshuler, Douglas L

2015-01-01

Despite recent advances in the study of animal flight, the biomechanical determinants of maneuverability are poorly understood. It is thought that maneuverability may be influenced by intrinsic body mass and wing morphology, and by physiological muscle capacity, but this hypothesis has not yet been evaluated because it requires tracking a large number of free flight maneuvers from known individuals. We used an automated tracking system to record flight sequences from 20 Anna's hummingbirds flying solo and in competition in a large chamber. We found that burst muscle capacity predicted most performance metrics. Hummingbirds with higher burst capacity flew with faster velocities, accelerations, and rotations, and they used more demanding complex turns. In contrast, body mass did not predict variation in maneuvering performance, and wing morphology predicted only the use of arcing turns and high centripetal accelerations. Collectively, our results indicate that burst muscle capacity is a key predictor of maneuverability. DOI: http://dx.doi.org/10.7554/eLife.11159.001 PMID:26583753
Child Development and Structural Variation in the Human Genome

ERIC Educational Resources Information Center

Zhang, Ying; Haraksingh, Rajini; Grubert, Fabian; Abyzov, Alexej; Gerstein, Mark; Weissman, Sherman; Urban, Alexander E.

2013-01-01

Structural variation of the human genome sequence is the insertion, deletion, or rearrangement of stretches of DNA sequence sized from around 1,000 to millions of base pairs. Over the past few years, structural variation has been shown to be far more common in human genomes than previously thought. Very little is currently known about the effects…
Cloning, characterization, expression and comparative analysis of pig Golgi membrane sphingomyelin synthase 1.

PubMed

Guillén, Natalia; Navarro, María A; Surra, Joaquín C; Arnal, Carmen; Fernández-Juan, Marta; Cebrián-Pérez, Jose Alvaro; Osada, Jesús

2007-02-15

Pig sphingomyelin synthase 1 (SMS1) cDNA was cloned, characterized and compared to the human ortholog. Porcine protein consists of 413 amino acids and displays a 97% sequence identity with human protein. A phylogenic tree of proteins reveals that porcine SMS1 is more closely related to bovine and rodent proteins than to human. Analysis of protein mass was higher than the theoretical prediction based on amino acid sequence suggesting a kind of posttranslational modification. Quantitative representation of tissue distribution obtained by real-time RT-PCR showed that it was widely expressed although important variations in levels were obtained among organs. Thus, the cardiovascular system, especially the heart, showed the highest value of all the tissues studied. Regional differences of expression were observed in the central nervous system and intestinal tract. Analysis of the hepatic mRNA and protein expressions of SMS1 following turpentine treatment revealed a progressive decrease in the former paralleled by a decrease in the protein concentration. These findings indicate the variation in expression in the different tissues might suggest a different requirement of Golgi sphingomyelin for the specific function in each organ and a regulation of the enzyme in response to turpentine-induced hepatic injury.
Effects of age, sex, and genotype on high-sensitivity metabolomic profiles in the fruit fly, Drosophila melanogaster

PubMed Central

Hoffman, Jessica M; Soltow, Quinlyn A; Li, Shuzhao; Sidik, Alfire; Jones, Dean P; Promislow, Daniel E L

2014-01-01

Researchers have used whole-genome sequencing and gene expression profiling to identify genes associated with age, in the hope of understanding the underlying mechanisms of senescence. But there is a substantial gap from variation in gene sequences and expression levels to variation in age or life expectancy. In an attempt to bridge this gap, here we describe the effects of age, sex, genotype, and their interactions on high-sensitivity metabolomic profiles in the fruit fly, Drosophila melanogaster. Among the 6800 features analyzed, we found that over one-quarter of all metabolites were significantly associated with age, sex, genotype, or their interactions, and multivariate analysis shows that individual metabolomic profiles are highly predictive of these traits. Using a metabolomic equivalent of gene set enrichment analysis, we identified numerous metabolic pathways that were enriched among metabolites associated with age, sex, and genotype, including pathways involving sugar and glycerophospholipid metabolism, neurotransmitters, amino acids, and the carnitine shuttle. Our results suggest that high-sensitivity metabolomic studies have excellent potential not only to reveal mechanisms that lead to senescence, but also to help us understand differences in patterns of aging among genotypes and between males and females. PMID:24636523
Selective sweep at the Drosophila melanogaster Suppressor of Hairless locus and its association with the In(2L)t inversion polymorphism.

PubMed Central

Depaulis, F; Brazier, L; Veuille, M

1999-01-01

The hitchhiking model of population genetics predicts that an allele favored by Darwinian selection can replace haplotypes from the same locus previously established at a neutral mutation-drift equilibrium. This process, known as "selective sweep," was studied by comparing molecular variation between the polymorphic In(2L)t inversion and the standard chromosome. Sequence variation was recorded at the Suppressor of Hairless (Su[H]) gene in an African population of Drosophila melanogaster. We found 47 nucleotide polymorphisms among 20 sequences of 1.2 kb. Neutrality tests were nonsignificant at the nucleotide level. However, these sites were strongly associated, because 290 out of 741 observed pairwise combinations between them were in significant linkage disequilibrium. We found only seven haplotypes, two occurring in the 9 In(2L)t chromosomes, and five in the 11 standard chromosomes, with no shared haplotype. Two haplotypes, one in each chromosome arrangement, made up two-thirds of the sample. This low haplotype diversity departed from neutrality in a haplotype test. This pattern supports a selective sweep hypothesis for the Su(H) chromosome region. PMID:10388820
Whole exome sequencing to estimate alloreactivity potential between donors and recipients in stem cell transplantation

PubMed Central

Sampson, Juliana K.; Sheth, Nihar U.; Koparde, Vishal N.; Scalora, Allison F.; Serrano, Myrna G.; Lee, Vladimir; Roberts, Catherine H.; Jameson-Lee, Max; Ferreira-Gonzalez, Andrea; Manjili, Masoud H.; Buck, Gregory A.; Neale, Michael C.; Toor, Amir A.

2016-01-01

Summary Whole exome sequencing (WES) was performed on stem cell transplant donor-recipient (D-R) pairs to determine the extent of potential antigenic variation at a molecular level. In a small cohort of D-R pairs, a high frequency of sequence variation was observed between the donor and recipient exomes independent of human leucocyte antigen (HLA) matching. Nonsynonymous, nonconservative single nucleotide polymorphisms were approximately twice as frequent in HLA-matched unrelated, compared with related D-R pairs. When mapped to individual chromosomes, these polymorphic nucleotides were uniformly distributed across the entire exome. In conclusion, WES reveals extensive nucleotide sequence variation in the exomes of HLA-matched donors and recipients. PMID:24749631
Transposon variation by order during allopolyploidisation between Brassica oleracea and Brassica rapa.

PubMed

An, Z; Tang, Z; Ma, B; Mason, A S; Guo, Y; Yin, J; Gao, C; Wei, L; Li, J; Fu, D

2014-07-01

Although many studies have shown that transposable element (TE) activation is induced by hybridisation and polyploidisation in plants, much less is known on how different types of TE respond to hybridisation, and the impact of TE-associated sequences on gene function. We investigated the frequency and regularity of putative transposon activation for different types of TE, and determined the impact of TE-associated sequence variation on the genome during allopolyploidisation. We designed different types of TE primers and adopted the Inter-Retrotransposon Amplified Polymorphism (IRAP) method to detect variation in TE-associated sequences during the process of allopolyploidisation between Brassica rapa (AA) and Brassica oleracea (CC), and in successive generations of self-pollinated progeny. In addition, fragments with TE insertions were used to perform Blast2GO analysis to characterise the putative functions of the fragments with TE insertions. Ninety-two primers amplifying 548 loci were used to detect variation in sequences associated with four different orders of TE sequences. TEs could be classed in ascending frequency into LTR-REs, TIRs, LINEs, SINEs and unknown TEs. The frequency of novel variation (putative activation) detected for the four orders of TEs was highest from the F1 to F2 generations, and lowest from the F2 to F3 generations. Functional annotation of sequences with TE insertions showed that genes with TE insertions were mainly involved in metabolic processes and binding, and preferentially functioned in organelles. TE variation in our study severely disturbed the genetic compositions of the different generations, resulting in inconsistencies in genetic clustering. Different types of TE showed different patterns of variation during the process of allopolyploidisation. © 2013 German Botanical Society and The Royal Botanical Society of the Netherlands.
Reverse Transcription Errors and RNA-DNA Differences at Short Tandem Repeats.

PubMed

Fungtammasan, Arkarachai; Tomaszkiewicz, Marta; Campos-Sánchez, Rebeca; Eckert, Kristin A; DeGiorgio, Michael; Makova, Kateryna D

2016-10-01

Transcript variation has important implications for organismal function in health and disease. Most transcriptome studies focus on assessing variation in gene expression levels and isoform representation. Variation at the level of transcript sequence is caused by RNA editing and transcription errors, and leads to nongenetically encoded transcript variants, or RNA-DNA differences (RDDs). Such variation has been understudied, in part because its detection is obscured by reverse transcription (RT) and sequencing errors. It has only been evaluated for intertranscript base substitution differences. Here, we investigated transcript sequence variation for short tandem repeats (STRs). We developed the first maximum-likelihood estimator (MLE) to infer RT error and RDD rates, taking next generation sequencing error rates into account. Using the MLE, we empirically evaluated RT error and RDD rates for STRs in a large-scale DNA and RNA replicated sequencing experiment conducted in a primate species. The RT error rates increased exponentially with STR length and were biased toward expansions. The RDD rates were approximately 1 order of magnitude lower than the RT error rates. The RT error rates estimated with the MLE from a primate data set were concordant with those estimated with an independent method, barcoded RNA sequencing, from a Caenorhabditis elegans data set. Our results have important implications for medical genomics, as STR allelic variation is associated with >40 diseases. STR nonallelic transcript variation can also contribute to disease phenotype. The MLE and empirical rates presented here can be used to evaluate the probability of disease-associated transcripts arising due to RDD. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Prediction of constitutive A-to-I editing sites from human transcriptomes in the absence of genomic sequences

PubMed Central

2013-01-01

Background Adenosine-to-inosine (A-to-I) RNA editing is recognized as a cellular mechanism for generating both RNA and protein diversity. Inosine base pairs with cytidine during reverse transcription and therefore appears as guanosine during sequencing of cDNA. Current approaches of RNA editing identification largely depend on the comparison between transcriptomes and genomic DNA (gDNA) sequencing datasets from the same individuals, and it has been challenging to identify editing candidates from transcriptomes in the absence of gDNA information. Results We have developed a new strategy to accurately predict constitutive RNA editing sites from publicly available human RNA-seq datasets in the absence of relevant genomic sequences. Our approach establishes new parameters to increase the ability to map mismatches and to minimize sequencing/mapping errors and unreported genome variations. We identified 695 novel constitutive A-to-I editing sites that appear in clusters (named “editing boxes”) in multiple samples and which exhibit spatial and dynamic regulation across human tissues. Some of these editing boxes are enriched in non-repetitive regions lacking inverted repeat structures and contain an extremely high conversion frequency of As to Is. We validated a number of editing boxes in multiple human cell lines and confirmed that ADAR1 is responsible for the observed promiscuous editing events in non-repetitive regions, further expanding our knowledge of the catalytic substrate of A-to-I RNA editing by ADAR enzymes. Conclusions The approach we present here provides a novel way of identifying A-to-I RNA editing events by analyzing only RNA-seq datasets. This method has allowed us to gain new insights into RNA editing and should also aid in the identification of more constitutive A-to-I editing sites from additional transcriptomes. PMID:23537002
Mining sequence variations in representative polyploid sugarcane germplasm accessions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yang, Xiping; Song, Jian; You, Qian

Sugarcane (Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes.more » To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.« less
Mining sequence variations in representative polyploid sugarcane germplasm accessions

DOE PAGES

Yang, Xiping; Song, Jian; You, Qian; ...

2017-08-09

Sugarcane (Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes.more » To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.« less
Deep sequencing reveals cell-type-specific patterns of single-cell transcriptome variation.

PubMed

Dueck, Hannah; Khaladkar, Mugdha; Kim, Tae Kyung; Spaethling, Jennifer M; Francis, Chantal; Suresh, Sangita; Fisher, Stephen A; Seale, Patrick; Beck, Sheryl G; Bartfai, Tamas; Kuhn, Bernhard; Eberwine, James; Kim, Junhyong

2015-06-09

Differentiation of metazoan cells requires execution of different gene expression programs but recent single-cell transcriptome profiling has revealed considerable variation within cells of seeming identical phenotype. This brings into question the relationship between transcriptome states and cell phenotypes. Additionally, single-cell transcriptomics presents unique analysis challenges that need to be addressed to answer this question. We present high quality deep read-depth single-cell RNA sequencing for 91 cells from five mouse tissues and 18 cells from two rat tissues, along with 30 control samples of bulk RNA diluted to single-cell levels. We find that transcriptomes differ globally across tissues with regard to the number of genes expressed, the average expression patterns, and within-cell-type variation patterns. We develop methods to filter genes for reliable quantification and to calibrate biological variation. All cell types include genes with high variability in expression, in a tissue-specific manner. We also find evidence that single-cell variability of neuronal genes in mice is correlated with that in rats consistent with the hypothesis that levels of variation may be conserved. Single-cell RNA-sequencing data provide a unique view of transcriptome function; however, careful analysis is required in order to use single-cell RNA-sequencing measurements for this purpose. Technical variation must be considered in single-cell RNA-sequencing studies of expression variation. For a subset of genes, biological variability within each cell type appears to be regulated in order to perform dynamic functions, rather than solely molecular noise.
Sequence variations of the alpha-globin genes: scanning of high CG content genes with DHPLC and DG-DGGE.

PubMed

Lacerra, Giuseppina; Fiorito, Mirella; Musollino, Gennaro; Di Noce, Francesca; Esposito, Maria; Nigro, Vincenzo; Gaudiano, Carlo; Carestia, Clementina

2004-10-01

The alpha-globin chains are encoded by two duplicated genes (HBA2 and HBA1, 5'-3') showing overall sequence homology >96% and average CG content >60%. alpha-Thalassemia, the most prevalent worldwide autosomal recessive disorder, is a hereditary anemia caused by sequence variations of these genes in about 25% of carriers. We evaluated the overall sensitivity and suitability of DHPLC and DG-DGGE in scanning both the alpha-globin genes by carrying out a retrospective analysis of 19 variant alleles in 29 genotypes. The HBA2 alleles c.1A>G, c.79G>A, and c.281T>G, and the HBA1 allele c.475C>A were new. Three pathogenic sequence variations were associated in cis with nonpathogenic variations in all families studied; they were the HBA2 variation c.2T>C associated with c.-24C>G, and the HBA2 variations c.391G>C and c.427T>C, both associated with c.565G>A. We set up original experimental conditions for DHPLC and DG-DGGE and analyzed 10 normal subjects, 46 heterozygotes, seven homozygotes, seven compound heterozygotes, and six compound heterozygotes for a hybrid gene. Both the methodologies gave reproducible results and no false-positive was detected. DHPLC showed 100% sensitivity and DG-DGGE nearly 90%. About 100% of the sequence from the cap site to the polyA addition site could be scanned by DHPLC, about 87% by DG-DGGE. It is noteworthy that the three most common pathogenic sequence variations (HBA2 alleles c.2T>C, c.95+2_95+6del, and c.523A>G) were unambiguously detected by both the methodologies. Genotype diagnosis must be confirmed with PCR sequencing of single amplicons or with an allele-specific method. This study can be helpful for scanning genes with high CG content and offers a model suitable for duplicated genes with high homology. Copyright 2004 Wiley-Liss, Inc.
GASP: Gapped Ancestral Sequence Prediction for proteins

PubMed Central

Edwards, Richard J; Shields, Denis C

2004-01-01

Background The prediction of ancestral protein sequences from multiple sequence alignments is useful for many bioinformatics analyses. Predicting ancestral sequences is not a simple procedure and relies on accurate alignments and phylogenies. Several algorithms exist based on Maximum Parsimony or Maximum Likelihood methods but many current implementations are unable to process residues with gaps, which may represent insertion/deletion (indel) events or sequence fragments. Results Here we present a new algorithm, GASP (Gapped Ancestral Sequence Prediction), for predicting ancestral sequences from phylogenetic trees and the corresponding multiple sequence alignments. Alignments may be of any size and contain gaps. GASP first assigns the positions of gaps in the phylogeny before using a likelihood-based approach centred on amino acid substitution matrices to assign ancestral amino acids. Important outgroup information is used by first working down from the tips of the tree to the root, using descendant data only to assign probabilities, and then working back up from the root to the tips using descendant and outgroup data to make predictions. GASP was tested on a number of simulated datasets based on real phylogenies. Prediction accuracy for ungapped data was similar to three alternative algorithms tested, with GASP performing better in some cases and worse in others. Adding simple insertions and deletions to the simulated data did not have a detrimental effect on GASP accuracy. Conclusions GASP (Gapped Ancestral Sequence Prediction) will predict ancestral sequences from multiple protein alignments of any size. Although not as accurate in all cases as some of the more sophisticated maximum likelihood approaches, it can process a wide range of input phylogenies and will predict ancestral sequences for gapped and ungapped residues alike. PMID:15350199

Read clouds uncover variation in complex regions of the human genome

PubMed Central

Bishara, Alex; Liu, Yuling; Weng, Ziming; Kashef-Haghighi, Dorna; Newburger, Daniel E.; West, Robert; Sidow, Arend; Batzoglou, Serafim

2015-01-01

Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies. PMID:26286554
Variation analysis of the severe acute respiratory syndrome coronavirus putative non-structural protein 2 gene and construction of three-dimensional model.

PubMed

Lu, Jia-hai; Zhang, Ding-mei; Wang, Guo-ling; Guo, Zhong-min; Zhang, Chuan-hai; Tan, Bing-yan; Ouyang, Li-ping; Lin, Li; Liu, Yi-min; Chen, Wei-qing; Ling, Wen-hua; Yu, Xin-bing; Zhong, Nan-shan

2005-05-05

The rapid transmission and high mortality rate made severe acute respiratory syndrome (SARS) a global threat for which no efficacious therapy is available now. Without sufficient knowledge about the SARS coronavirus (SARS-CoV), it is impossible to define the candidate for the anti-SARS targets. The putative non-structural protein 2 (nsp2) (3CL(pro), following the nomenclature by Gao et al, also known as nsp5 in Snidjer et al) of SARS-CoV plays an important role in viral transcription and replication, and is an attractive target for anti-SARS drug development, so we carried on this study to have an insight into putative polymerase nsp2 of SARS-CoV Guangdong (GD) strain. The SARS-CoV strain was isolated from a SARS patient in Guangdong, China, and cultured in Vero E6 cells. The nsp2 gene was amplified by reverse transcription-polymerase chain reaction (RT-PCR) and cloned into eukaryotic expression vector pCI-neo (pCI-neo/nsp2). Then the recombinant eukaryotic expression vector pCI-neo/nsp2 was transfected into COS-7 cells using lipofectin reagent to express the nsp2 protein. The expressive protein of SARS-CoV nsp2 was analyzed by 7% sodium dodecylsulfate polyacrylamide gel electrophoresis (SDS-PAGE). The nucleotide sequence and protein sequence of GD nsp2 were compared with that of other SARS-CoV strains by nucleotide-nucleotide basic local alignment search tool (BLASTN) and protein-protein basic local alignment search tool (BLASTP) to investigate its variance trend during the transmission. The secondary structure of GD strain and that of other strains were predicted by Garnier-Osguthorpe-Robson (GOR) Secondary Structure Prediction. Three-dimensional-PSSM Protein Fold Recognition (Threading) Server was employed to construct the three-dimensional model of the nsp2 protein. The putative polymerase nsp2 gene of GD strain was amplified by RT-PCR. The eukaryotic expression vector (pCI-neo/nsp2) was constructed and expressed the protein in COS-7 cells successfully. The result of sequencing and sequence comparison with other SARS-CoV strains showed that nsp2 gene was relatively conservative during the transmission and total five base sites mutated in about 100 strains investigated, three of which in the early and middle phases caused synonymous mutation, and another two base sites variation in the late phase resulted in the amino acid substitutions and secondary structure changes. The three-dimensional structure of the nsp2 protein was successfully constructed. The results suggest that polymerase nsp2 is relatively stable during the phase of epidemic. The amino acid and secondary structure change may be important for viral infection. The fact that majority of single nucleotide variations (SNVs) are predicted to cause synonymous, as well as the result of low mutation rate of nsp2 gene in the epidemic variations, indicates that the nsp2 is conservative and could be a target for anti-SARS drugs. The three-dimensional structure result indicates that the nsp2 protein of GD strain is high homologous with 3CL(pro) of SARS-CoV urbani strain, 3CL(pro) of transmissible gastroenteritis virus and 3CL(pro) of human coronavirus 229E strain, which further suggests that nsp2 protein of GD strain possesses the activity of 3CL(pro).
ACTG: novel peptide mapping onto gene models.

PubMed

Choi, Seunghyuk; Kim, Hyunwoo; Paek, Eunok

2017-04-15

In many proteogenomic applications, mapping peptide sequences onto genome sequences can be very useful, because it allows us to understand origins of the gene products. Existing software tools either take the genomic position of a peptide start site as an input or assume that the peptide sequence exactly matches the coding sequence of a given gene model. In case of novel peptides resulting from genomic variations, especially structural variations such as alternative splicing, these existing tools cannot be directly applied unless users supply information about the variant, either its genomic position or its transcription model. Mapping potentially novel peptides to genome sequences, while allowing certain genomic variations, requires introducing novel gene models when aligning peptide sequences to gene structures. We have developed a new tool called ACTG (Amino aCids To Genome), which maps peptides to genome, assuming all possible single exon skipping, junction variation allowing three edit distances from the original splice sites, exon extension and frame shift. In addition, it can also consider SNVs (single nucleotide variations) during mapping phase if a user provides the VCF (variant call format) file as an input. Available at http://prix.hanyang.ac.kr/ACTG/search.jsp . eunokpaek@hanyang.ac.kr. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Discrete sequence prediction and its applications

NASA Technical Reports Server (NTRS)

Laird, Philip

1992-01-01

Learning from experience to predict sequences of discrete symbols is a fundamental problem in machine learning with many applications. We apply sequence prediction using a simple and practical sequence-prediction algorithm, called TDAG. The TDAG algorithm is first tested by comparing its performance with some common data compression algorithms. Then it is adapted to the detailed requirements of dynamic program optimization, with excellent results.
Identifying Preserved Storm Events on Beaches from Trenches and Cores

NASA Astrophysics Data System (ADS)

Wadman, H. M.; Gallagher, E. L.; McNinch, J.; Reniers, A.; Koktas, M.

2014-12-01

Recent research suggests that even small scale variations in grain size in the shallow stratigraphy of sandy beaches can significantly influence large-scale morphology change. However, few quantitative studies of variations in shallow stratigraphic layers, as differentiated by variations in mean grain size, have been conducted, in no small part due to the difficulty of collecting undisturbed sediment cores in the energetic lower beach and swash zone. Due to this lack of quantitative stratigraphic grain size data, most coastal morphology models assume that uniform grain sizes dominate sandy beaches, allowing for little to no temporal or spatial variations in grain size heterogeneity. In a first-order attempt to quantify small-scale, temporal and spatial variations in beach stratigraphy, thirty-five vibracores were collected at the USACE Field Research Facility (FRF), Duck, NC, in March-April of 2014 using the FRF's Coastal Research and Amphibious Buggy (CRAB). Vibracores were collected at set locations along a cross-shore profile from the toe of the dune to a water depth of ~1m in the surf zone. Vibracores were repeatedly collected from the same locations throughout a tidal cycle, as well as pre- and post a nor'easter event. In addition, two ~1.5m deep trenches were dug in the cross-shore and along-shore directions (each ~14m in length) after coring was completed to allow better interpretation of the stratigraphic sequences observed in the vibracores. The elevations of coherent stratigraphic layers, as revealed in vibracore-based fence diagrams and trench data, are used to relate specific observed stratigraphic sequences to individual storm events observed at the FRF. These data provide a first-order, quantitative examination of the small-scale temporal and spatial variability of shallow grain size along an open, sandy coastline. The data will be used to refine morphological model predictions to include variations in grain size and associated shallow stratigraphy.
Increase in membrane thickness during development compensates for eggshell thinning due to calcium uptake by the embryo in falcons

NASA Astrophysics Data System (ADS)

Castilla, Aurora M.; van Dongen, Stefan; Herrel, Anthony; Francesch, Amadeu; Martínez de Aragón, Juan; Malone, Jim; José Negro, Juan

2010-02-01

We compared membrane thickness of fully developed eggs with those of non-developed eggs in different endangered falcon taxa. To our knowledge, membrane thickness variation during development has never been examined before in falcons or any other wild bird. Yet, the egg membrane constitutes an important protective barrier for the developing embryo. Because eggshell thinning is a general process that occurs during bird development, caused by calcium uptake by the embryo, eggs are expected to be less protected and vulnerable to breakage near the end of development. Thus, egg membranes could play an important protective role in the later stages of development by getting relatively thicker. We used linear mixed models to explore the variation in membrane thickness ( n = 378 eggs) in relation to developmental stage, taxon, female age, mass and identity (73 females), egg-laying sequence (105 clutches) and the study zone. Our results are consistent with the prediction that egg membranes are thicker in fully developed eggs than in non-developed eggs, suggesting that the increase in membrane thickness during development may compensate for eggshell thinning. In addition, our data shown that thicker membranes are associated with larger, heavier and relatively wider eggs, as well as with eggs that had thinner eggshells. Egg-laying sequence, female age and the study zone did not explain the observed variation of membrane thickness in the falcon taxa studied. As we provide quantitative data on membrane thickness variation during development in falcons not subjected to contamination or food limitation (i.e. bred under captive conditions), our data may be used as a reference for studies on eggs from natural populations. Considering the large variation in membrane thickness and the multiple factors affecting on it and its importance in the protection of the embryo, we encourage other researchers to include measurements on membranes in studies exploring eggshell thickness variation.
Validation of Skeletal Muscle cis-Regulatory Module Predictions Reveals Nucleotide Composition Bias in Functional Enhancers

PubMed Central

Kwon, Andrew T.; Chou, Alice Yi; Arenillas, David J.; Wasserman, Wyeth W.

2011-01-01

We performed a genome-wide scan for muscle-specific cis-regulatory modules (CRMs) using three computational prediction programs. Based on the predictions, 339 candidate CRMs were tested in cell culture with NIH3T3 fibroblasts and C2C12 myoblasts for capacity to direct selective reporter gene expression to differentiated C2C12 myotubes. A subset of 19 CRMs validated as functional in the assay. The rate of predictive success reveals striking limitations of computational regulatory sequence analysis methods for CRM discovery. Motif-based methods performed no better than predictions based only on sequence conservation. Analysis of the properties of the functional sequences relative to inactive sequences identifies nucleotide sequence composition can be an important characteristic to incorporate in future methods for improved predictive specificity. Muscle-related TFBSs predicted within the functional sequences display greater sequence conservation than non-TFBS flanking regions. Comparison with recent MyoD and histone modification ChIP-Seq data supports the validity of the functional regions. PMID:22144875
Colorectal Cancer and the Human Gut Microbiome: Reproducibility with Whole-Genome Shotgun Sequencing

PubMed Central

Hua, Xing; Zeller, Georg; Sunagawa, Shinichi; Voigt, Anita Y.; Hercog, Rajna; Goedert, James J.; Shi, Jianxin; Bork, Peer; Sinha, Rashmi

2016-01-01

Accumulating evidence indicates that the gut microbiota affects colorectal cancer development, but previous studies have varied in population, technical methods, and associations with cancer. Understanding these variations is needed for comparisons and for potential pooling across studies. Therefore, we performed whole-genome shotgun sequencing on fecal samples from 52 pre-treatment colorectal cancer cases and 52 matched controls from Washington, DC. We compared findings from a previously published 16S rRNA study to the metagenomics-derived taxonomy within the same population. In addition, metagenome-predicted genes, modules, and pathways in the Washington, DC cases and controls were compared to cases and controls recruited in France whose specimens were processed using the same platform. Associations between the presence of fecal Fusobacteria, Fusobacterium, and Porphyromonas with colorectal cancer detected by 16S rRNA were reproduced by metagenomics, whereas higher relative abundance of Clostridia in cancer cases based on 16S rRNA was merely borderline based on metagenomics. This demonstrated that within the same sample set, most, but not all taxonomic associations were seen with both methods. Considering significant cancer associations with the relative abundance of genes, modules, and pathways in a recently published French metagenomics dataset, statistically significant associations in the Washington, DC population were detected for four out of 10 genes, three out of nine modules, and seven out of 17 pathways. In total, colorectal cancer status in the Washington, DC study was associated with 39% of the metagenome-predicted genes, modules, and pathways identified in the French study. More within and between population comparisons are needed to identify sources of variation and disease associations that can be reproduced despite these variations. Future studies should have larger sample sizes or pool data across studies to have sufficient power to detect associations that are reproducible and significant after correction for multiple testing. PMID:27171425
Chromosomal Mapping of Canine-Derived BAC Clones to the Red Fox and American Mink Genomes

PubMed Central

Vorobieva, Nadegda V.; Beklemisheva, Violetta R.; Johnson, Jennifer L.; Temnykh, Svetlana V.; Yudkin, Dmitry V.; Trut, Lyudmila N.; Andre, Catherine; Galibert, Francis; Aguirre, Gustavo D.; Acland, Gregory M.; Graphodatsky, Alexander S.

2009-01-01

High-quality sequencing of the dog (Canis lupus familiaris) genome has enabled enormous progress in genetic mapping of canine phenotypic variation. The red fox (Vulpes vulpes), another canid species, also exhibits a wide range of variation in coat color, morphology, and behavior. Although the fox genome has not yet been sequenced, canine genomic resources have been used to construct a meiotic linkage map of the red fox genome and begin genetic mapping in foxes. However, a more detailed gene-specific comparative map between the dog and fox genomes is required to establish gene order within homologous regions of dog and fox chromosomes and to refine breakpoints between homologous chromosomes of the 2 species. In the current study, we tested whether canine-derived gene–containing bacterial artificial chromosome (BAC) clones can be routinely used to build a gene-specific map of the red fox genome. Forty canine BAC clones were mapped to the red fox genome by fluorescence in situ hybridization (FISH). Each clone was uniquely assigned to a single fox chromosome, and the locations of 38 clones agreed with cytogenetic predictions. These results clearly demonstrate the utility of FISH mapping for construction of a whole-genome gene-specific map of the red fox. The further possibility of using canine BAC clones to map genes in the American mink (Mustela vison) genome was also explored. Much lower success was obtained for this more distantly related farm-bred species, although a few BAC clones were mapped to the predicted chromosomal locations. PMID:19546120
Chromosomal mapping of canine-derived BAC clones to the red fox and American mink genomes.

PubMed

Kukekova, Anna V; Vorobieva, Nadegda V; Beklemisheva, Violetta R; Johnson, Jennifer L; Temnykh, Svetlana V; Yudkin, Dmitry V; Trut, Lyudmila N; Andre, Catherine; Galibert, Francis; Aguirre, Gustavo D; Acland, Gregory M; Graphodatsky, Alexander S

2009-01-01

High-quality sequencing of the dog (Canis lupus familiaris) genome has enabled enormous progress in genetic mapping of canine phenotypic variation. The red fox (Vulpes vulpes), another canid species, also exhibits a wide range of variation in coat color, morphology, and behavior. Although the fox genome has not yet been sequenced, canine genomic resources have been used to construct a meiotic linkage map of the red fox genome and begin genetic mapping in foxes. However, a more detailed gene-specific comparative map between the dog and fox genomes is required to establish gene order within homologous regions of dog and fox chromosomes and to refine breakpoints between homologous chromosomes of the 2 species. In the current study, we tested whether canine-derived gene-containing bacterial artificial chromosome (BAC) clones can be routinely used to build a gene-specific map of the red fox genome. Forty canine BAC clones were mapped to the red fox genome by fluorescence in situ hybridization (FISH). Each clone was uniquely assigned to a single fox chromosome, and the locations of 38 clones agreed with cytogenetic predictions. These results clearly demonstrate the utility of FISH mapping for construction of a whole-genome gene-specific map of the red fox. The further possibility of using canine BAC clones to map genes in the American mink (Mustela vison) genome was also explored. Much lower success was obtained for this more distantly related farm-bred species, although a few BAC clones were mapped to the predicted chromosomal locations.
Variation of clinical expression in patients with Stargardt dystrophy and sequence variations in the ABCR gene.

PubMed

Fishman, G A; Stone, E M; Grover, S; Derlacki, D J; Haines, H L; Hockey, R R

1999-04-01

To report the spectrum of ophthalmic findings in patients with Stargardt dystrophy or fundus flavimaculatus who have a specific sequence variation in the ABCR gene. Twenty-nine patients with Stargardt dystrophy or fundus flavimaculatus from different pedigrees were identified with possible disease-causing sequence variations in the ABCR gene from a group of 66 patients who were screened for sequence variations in this gene. Patients underwent a routine ocular examination, including slitlamp biomicroscopy and a dilated fundus examination. Fluorescein angiography was performed on 22 patients, and electroretinographic measurements were obtained on 24 of 29 patients. Kinetic visual fields were measured with a Goldmann perimeter in 26 patients. Single-strand conformation polymorphism analysis and DNA sequencing were used to identify variations in coding sequences of the ABCR gene. Three clinical phenotypes were observed among these 29 patients. In phenotype I, 9 of 12 patients had a sequence change in exon 42 of the ABCR gene in which the amino acid glutamic acid was substituted for glycine (Gly1961Glu). In only 4 of these 9 patients was a second possible disease-causing mutation found on the other ABCR allele. In addition to an atrophic-appearing macular lesion, phenotype I was characterized by localized perifoveal yellowish white flecks, the absence of a dark choroid, and normal electroretinographic amplitudes. Phenotype II consisted of 10 patients who showed a dark choroid and more diffuse yellowish white flecks in the fundus. None exhibited the Gly1961Glu change. Phenotype III consisted of 7 patients who showed extensive atrophic-appearing changes of the retinal pigment epithelium. Electroretinographic cone and rod amplitudes were reduced. One patient showed the Gly1961Glu change. A wide variation in clinical phenotype can occur in patients with sequence changes in the ABCR gene. In individual patients, a certain phenotype seems to be associated with the presence of a Gly1961Glu change in exon 42 of the ABCR gene. The identification of correlations between specific mutations in the ABCR gene and clinical phenotypes will better facilitate the counseling of patients on their visual prognosis. This information will also likely be important for future therapeutic trials in patients with Stargardt dystrophy.
VaDiR: an integrated approach to Variant Detection in RNA.

PubMed

Neums, Lisa; Suenaga, Seiji; Beyerlein, Peter; Anders, Sara; Koestler, Devin; Mariani, Andrea; Chien, Jeremy

2018-02-01

Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.
Intra-isolate genome variation in arbuscular mycorrhizal fungi persists in the transcriptome.

PubMed

Boon, E; Zimmerman, E; Lang, B F; Hijri, M

2010-07-01

Arbuscular mycorrhizal fungi (AMF) are heterokaryotes with an unusual genetic makeup. Substantial genetic variation occurs among nuclei within a single mycelium or isolate. AMF reproduce through spores that contain varying fractions of this heterogeneous population of nuclei. It is not clear whether this genetic variation on the genome level actually contributes to the AMF phenotype. To investigate the extent to which polymorphisms in nuclear genes are transcribed, we analysed the intra-isolate genomic and cDNA sequence variation of two genes, the large subunit ribosomal RNA (LSU rDNA) of Glomus sp. DAOM-197198 (previously known as G. intraradices) and the POL1-like sequence (PLS) of Glomus etunicatum. For both genes, we find high sequence variation at the genome and transcriptome level. Reconstruction of LSU rDNA secondary structure shows that all variants are functional. Patterns of PLS sequence polymorphism indicate that there is one functional gene copy, PLS2, which is preferentially transcribed, and one gene copy, PLS1, which is a pseudogene. This is the first study that investigates AMF intra-isolate variation at the transcriptome level. In conclusion, it is possible that, in AMF, multiple nuclear genomes contribute to a single phenotype.
Whole-Genome Sequence Variation among Multiple Isolates of Pseudomonas aeruginosa

PubMed Central

Spencer, David H.; Kas, Arnold; Smith, Eric E.; Raymond, Christopher K.; Sims, Elizabeth H.; Hastings, Michele; Burns, Jane L.; Kaul, Rajinder; Olson, Maynard V.

2003-01-01

Whole-genome shotgun sequencing was used to study the sequence variation of three Pseudomonas aeruginosa isolates, two from clonal infections of cystic fibrosis patients and one from an aquatic environment, relative to the genomic sequence of reference strain PAO1. The majority of the PAO1 genome is represented in these strains; however, at least three prominent islands of PAO1-specific sequence are apparent. Conversely, ∼10% of the sequencing reads derived from each isolate fail to align with the PAO1 backbone. While average sequence variation among all strains is roughly 0.5%, regions of pronounced differences were evident in whole-genome scans of nucleotide diversity. We analyzed two such divergent loci, the pyoverdine and O-antigen biosynthesis regions, by complete resequencing. A thorough analysis of isolates collected over time from one of the cystic fibrosis patients revealed independent mutations resulting in the loss of O-antigen synthesis alternating with a mucoid phenotype. Overall, we conclude that most of the PAO1 genome represents a core P. aeruginosa backbone sequence while the strains addressed in this study possess additional genetic material that accounts for at least 10% of their genomes. Approximately half of these additional sequences are novel. PMID:12562802
Genetic variability among Schistosoma japonicum isolates from the Philippines, Japan and China revealed by sequence analysis of three mitochondrial genes.

PubMed

Chen, Fen; Li, Juan; Sugiyama, Hiromu; Zhou, Dong-Hui; Song, Hui-Qun; Zhao, Guang-Hui; Zhu, Xing-Quan

2015-02-01

The present study examined sequence variability in the mitochondrial (mt) protein-coding genes cytochrome b (cytb), NADH dehydrogenase subunits 2 and 6 (nad2 and nad6) among 24 isolates of Schistosoma japonicum from different endemic regions in the Philippines, Japan and China. The complete cytb, nad2 and nad6 genes were amplified and sequenced separately from individual schistosome. Sequence variations for isolates from the Philippines were 0-0.5% for cytb, 0-0.6% for nad2, and 0-0.9% for nad6. Variation was 0-0.5%, 0.1-0.8%, 0-0.7% for corresponding genes for schistosome samples from mainland China. For worms in Japan, genetic variations were 0-0.2%, 0.1-0.2% and 0 for the three genes, respectively. Sequence variations were 0-1.0%, 0-1.8% and 0-1.1% for cytb, nad2 and nad6, respectively, among schistosome isolates from different geographical strains in the Philippines, Japan and China. Of the three countries, lowest sequence variations were found between isolates from mainland China and the Philippines and highest were detected between Japan and the Philippines in three mtDNA genes. Phylogenetic analyses based on the combined sequences of cytb, nad2 and nad6 revealed that all isolates in the Philippines clustered together sistered to samples from Yunnan and Zhejiang provinces in China, while isolates from Yamanashi in Japan were in a solitary clade. These results demonstrated the usefulness of the combined three mtDNA sequences for studying genetic diversity and population structure among S. japonicum isolates from the Philippines, China and Japan.
Functional genomics to assess biological responses to marine pollution at physiological and evolutionary timescales: toward a vision of predictive ecotoxicology.

PubMed

Reid, Noah M; Whitehead, Andrew

2016-09-01

Marine pollution is ubiquitous, and is one of the key factors influencing contemporary marine biodiversity worldwide. To protect marine biodiversity, how do we surveil, document and predict the short- and long-term impacts of pollutants on at-risk species? Modern genomics tools offer high-throughput, information-rich and increasingly cost-effective approaches for characterizing biological responses to environmental stress, and are important tools within an increasing sophisticated kit for surveiling and assessing impacts of pollutants on marine species. Through the lens of recent research in marine killifish, we illustrate how genomics tools may be useful for screening chemicals and pollutants for biological activity and to reveal specific mechanisms of action. The high dimensionality of transcriptomic responses enables their usage as highly specific fingerprints of exposure, and these fingerprints can be used to diagnose environmental problems. We also emphasize that molecular pathways recruited to respond at physiological timescales are the same pathways that may be targets for natural selection during chronic exposure to pollutants. Gene complement and sequence variation in those pathways can be related to variation in sensitivity to environmental pollutants within and among species. Furthermore, allelic variation associated with evolved tolerance in those pathways could be tracked to estimate the pace of environmental health decline and recovery. We finish by integrating these paradigms into a vision of how genomics approaches could anchor a modernized framework for advancing the predictive capacity of environmental and ecotoxicological science. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com.
PANTHER-PSEP: predicting disease-causing genetic variants using position-specific evolutionary preservation.

PubMed

Tang, Haiming; Thomas, Paul D

2016-07-15

PANTHER-PSEP is a new software tool for predicting non-synonymous genetic variants that may play a causal role in human disease. Several previous variant pathogenicity prediction methods have been proposed that quantify evolutionary conservation among homologous proteins from different organisms. PANTHER-PSEP employs a related but distinct metric based on 'evolutionary preservation': homologous proteins are used to reconstruct the likely sequences of ancestral proteins at nodes in a phylogenetic tree, and the history of each amino acid can be traced back in time from its current state to estimate how long that state has been preserved in its ancestors. Here, we describe the PSEP tool, and assess its performance on standard benchmarks for distinguishing disease-associated from neutral variation in humans. On these benchmarks, PSEP outperforms not only previous tools that utilize evolutionary conservation, but also several highly used tools that include multiple other sources of information as well. For predicting pathogenic human variants, the trace back of course starts with a human 'reference' protein sequence, but the PSEP tool can also be applied to predicting deleterious or pathogenic variants in reference proteins from any of the ∼100 other species in the PANTHER database. PANTHER-PSEP is freely available on the web at http://pantherdb.org/tools/csnpScoreForm.jsp Users can also download the command-line based tool at ftp://ftp.pantherdb.org/cSNP_analysis/PSEP/ CONTACT: pdthomas@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Cleavage of nucleic acids

DOEpatents

Prudent, James R.; Hall, Jeff G.; Lyamichev, Victor L.; Brow, Mary Ann D.; Dahlberg, James E.

2007-12-11

The present invention relates to means for the detection and characterization of nucleic acid sequences, as well as variations in nucleic acid sequences. The present invention also relates to methods for forming a nucleic acid cleavage structure on a target sequence and cleaving the nucleic acid cleavage structure in a site-specific manner. The structure-specific nuclease activity of a variety of enzymes is used to cleave the target-dependent cleavage structure, thereby indicating the presence of specific nucleic acid sequences or specific variations thereof.
Invasive cleavage of nucleic acids

DOEpatents

Prudent, James R.; Hall, Jeff G.; Lyamichev, Victor I.; Brow, Mary Ann D.; Dahlberg, James E.

1999-01-01

The present invention relates to means for the detection and characterization of nucleic acid sequences, as well as variations in nucleic acid sequences. The present invention also relates to methods for forming a nucleic acid cleavage structure on a target sequence and cleaving the nucleic acid cleavage structure in a site-specific manner. The structure-specific nuclease activity of a variety of enzymes is used to cleave the target-dependent cleavage structure, thereby indicating the presence of specific nucleic acid sequences or specific variations thereof.
Invasive cleavage of nucleic acids

DOEpatents

Prudent, James R.; Hall, Jeff G.; Lyamichev, Victor I.; Brow, Mary Ann D.; Dahlberg, James E.

2002-01-01

The present invention relates to means for the detection and characterization of nucleic acid sequences, as well as variations in nucleic acid sequences. The present invention also relates to methods for forming a nucleic acid cleavage structure on a target sequence and cleaving the nucleic acid cleavage structure in a site-specific manner. The structure-specific nuclease activity of a variety of enzymes is used to cleave the target-dependent cleavage structure, thereby indicating the presence of specific nucleic acid sequences or specific variations thereof.

Cleavage of nucleic acids

DOEpatents

Prudent, James R.; Hall, Jeff G.; Lyamichev, Victor I.; Brow; Mary Ann D.; Dahlberg, James E.

2010-11-09

The present invention relates to means for the detection and characterization of nucleic acid sequences, as well as variations in nucleic acid sequences. The present invention also relates to methods for forming a nucleic acid cleavage structure on a target sequence and cleaving the nucleic acid cleavage structure in a site-specific manner. The structure-specific nuclease activity of a variety of enzymes is used to cleave the target-dependent cleavage structure, thereby indicating the presence of specific nucleic acid sequences or specific variations thereof.
Cleavage of nucleic acids

DOEpatents

Prudent, James R.; Hall, Jeff G.; Lyamichev, Victor I.; Brow, Mary Ann D.; Dahlberg, James E.

2000-01-01

The present invention relates to means for the detection and characterization of nucleic acid sequences, as well as variations in nucleic acid sequences. The present invention also relates to methods for forming a nucleic acid cleavage structure on a target sequence and cleaving the nucleic acid cleavage structure in a site-specific manner. The structure-specific nuclease activity of a variety of enzymes is used to cleave the target-dependent cleavage structure, thereby indicating the presence of specific nucleic acid sequences or specific variations thereof.
Nucleic acid detection assays

DOEpatents

Prudent, James R.; Hall, Jeff G.; Lyamichev, Victor I.; Brow, Mary Ann; Dahlberg, James E.

2005-04-05

The present invention relates to means for the detection and characterization of nucleic acid sequences, as well as variations in nucleic acid sequences. The present invention also relates to methods for forming a nucleic acid cleavage structure on a target sequence and cleaving the nucleic acid cleavage structure in a site-specific manner. The structure-specific nuclease activity of a variety of enzymes is used to cleave the target-dependent cleavage structure, thereby indicating the presence of specific nucleic acid sequences or specific variations thereof.
Mitochondrial DNA variation and phylogenetic relationships among five tuna species based on sequencing of D-loop region.

PubMed

Kumar, Girish; Kocour, Martin; Kunal, Swaraj Priyaranjan

2016-05-01

In order to assess the DNA sequence variation and phylogenetic relationship among five tuna species (Auxis thazard, Euthynnus affinis, Katsuwonus pelamis, Thunnus tonggol, and T. albacares) out of all four tuna genera, partial sequences of the mitochondrial DNA (mtDNA) D-loop region were analyzed. The estimate of intra-specific sequence variation in studied species was low, ranging from 0.027 to 0.080 [Kimura's two parameter distance (K2P)], whereas values of inter-specific variation ranged from 0.049 to 0.491. The longtail tuna (T. tonggol) and yellowfin tuna (T. albacares) were found to share a close relationship (K2P = 0.049) while skipjack tuna (K. pelamis) was most divergent studied species. Phylogenetic analysis using Maximum-Likelihood (ML) and Neighbor-Joining (NJ) methods supported the monophyletic origin of Thunnus species. Similarly, phylogeny of Auxis and Euthynnus species substantiate the monophyly. However, results showed a distinct origin of K. pelamis from genus Thunnus as well as Auxis and Euthynnus. Thus, the mtDNA D-loop region sequence data supports the polyphyletic origin of tuna species.
The Mouse Genomes Project: a repository of inbred laboratory mouse strain genomes.

PubMed

Adams, David J; Doran, Anthony G; Lilue, Jingtao; Keane, Thomas M

2015-10-01

The Mouse Genomes Project was initiated in 2009 with the goal of using next-generation sequencing technologies to catalogue molecular variation in the common laboratory mouse strains, and a selected set of wild-derived inbred strains. The initial sequencing and survey of sequence variation in 17 inbred strains was completed in 2011 and included comprehensive catalogue of single nucleotide polymorphisms, short insertion/deletions, larger structural variants including their fine scale architecture and landscape of transposable element variation, and genomic sites subject to post-transcriptional alteration of RNA. From this beginning, the resource has expanded significantly to include 36 fully sequenced inbred laboratory mouse strains, a refined and updated data processing pipeline, and new variation querying and data visualisation tools which are available on the project's website ( http://www.sanger.ac.uk/resources/mouse/genomes/ ). The focus of the project is now the completion of de novo assembled chromosome sequences and strain-specific gene structures for the core strains. We discuss how the assembled chromosomes will power comparative analysis, data access tools and future directions of mouse genetics.
Whole exome sequencing to estimate alloreactivity potential between donors and recipients in stem cell transplantation.

PubMed

Sampson, Juliana K; Sheth, Nihar U; Koparde, Vishal N; Scalora, Allison F; Serrano, Myrna G; Lee, Vladimir; Roberts, Catherine H; Jameson-Lee, Max; Ferreira-Gonzalez, Andrea; Manjili, Masoud H; Buck, Gregory A; Neale, Michael C; Toor, Amir A

2014-08-01

Whole exome sequencing (WES) was performed on stem cell transplant donor-recipient (D-R) pairs to determine the extent of potential antigenic variation at a molecular level. In a small cohort of D-R pairs, a high frequency of sequence variation was observed between the donor and recipient exomes independent of human leucocyte antigen (HLA) matching. Nonsynonymous, nonconservative single nucleotide polymorphisms were approximately twice as frequent in HLA-matched unrelated, compared with related D-R pairs. When mapped to individual chromosomes, these polymorphic nucleotides were uniformly distributed across the entire exome. In conclusion, WES reveals extensive nucleotide sequence variation in the exomes of HLA-matched donors and recipients. © 2014 John Wiley & Sons Ltd.
Bioinformatic Analyses of Unique (Orphan) Core Genes of the Genus Acidithiobacillus: Functional Inferences and Use As Molecular Probes for Genomic and Metagenomic/Transcriptomic Interrogation

PubMed Central

González, Carolina; Lazcano, Marcelo; Valdés, Jorge; Holmes, David S.

2016-01-01

Using phylogenomic and gene compositional analyses, five highly conserved gene families have been detected in the core genome of the phylogenetically coherent genus Acidithiobacillus of the class Acidithiobacillia. These core gene families are absent in the closest extant genus Thermithiobacillus tepidarius that subtends the Acidithiobacillus genus and roots the deepest in this class. The predicted proteins encoded by these core gene families are not detected by a BLAST search in the NCBI non-redundant database of more than 90 million proteins using a relaxed cut-off of 1.0e−5. None of the five families has a clear functional prediction. However, bioinformatic scrutiny, using pI prediction, motif/domain searches, cellular location predictions, genomic context analyses, and chromosome topology studies together with previously published transcriptomic and proteomic data, suggests that some may have functions associated with membrane remodeling during cell division perhaps in response to pH stress. Despite the high level of amino acid sequence conservation within each family, there is sufficient nucleotide variation of the respective genes to permit the use of the DNA sequences to distinguish different species of Acidithiobacillus, making them useful additions to the armamentarium of tools for phylogenetic analysis. Since the protein families are unique to the Acidithiobacillus genus, they can also be leveraged as probes to detect the genus in environmental metagenomes and metatranscriptomes, including industrial biomining operations, and acid mine drainage (AMD). PMID:28082953
Bioinformatic Analyses of Unique (Orphan) Core Genes of the Genus Acidithiobacillus: Functional Inferences and Use As Molecular Probes for Genomic and Metagenomic/Transcriptomic Interrogation.

PubMed

González, Carolina; Lazcano, Marcelo; Valdés, Jorge; Holmes, David S

2016-01-01

Using phylogenomic and gene compositional analyses, five highly conserved gene families have been detected in the core genome of the phylogenetically coherent genus Acidithiobacillus of the class Acidithiobacillia . These core gene families are absent in the closest extant genus Thermithiobacillus tepidarius that subtends the Acidithiobacillus genus and roots the deepest in this class. The predicted proteins encoded by these core gene families are not detected by a BLAST search in the NCBI non-redundant database of more than 90 million proteins using a relaxed cut-off of 1.0e -5 . None of the five families has a clear functional prediction. However, bioinformatic scrutiny, using pI prediction, motif/domain searches, cellular location predictions, genomic context analyses, and chromosome topology studies together with previously published transcriptomic and proteomic data, suggests that some may have functions associated with membrane remodeling during cell division perhaps in response to pH stress. Despite the high level of amino acid sequence conservation within each family, there is sufficient nucleotide variation of the respective genes to permit the use of the DNA sequences to distinguish different species of Acidithiobacillus , making them useful additions to the armamentarium of tools for phylogenetic analysis. Since the protein families are unique to the Acidithiobacillus genus, they can also be leveraged as probes to detect the genus in environmental metagenomes and metatranscriptomes, including industrial biomining operations, and acid mine drainage (AMD).
Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping

PubMed Central

2011-01-01

Background Integration of genomic variation with phenotypic information is an effective approach for uncovering genotype-phenotype associations. This requires an accurate identification of the different types of variation in individual genomes. Results We report the integration of the whole genome sequence of a single Holstein Friesian bull with data from single nucleotide polymorphism (SNP) and comparative genomic hybridization (CGH) array technologies to determine a comprehensive spectrum of genomic variation. The performance of resequencing SNP detection was assessed by combining SNPs that were identified to be either in identity by descent (IBD) or in copy number variation (CNV) with results from SNP array genotyping. Coding insertions and deletions (indels) were found to be enriched for size in multiples of 3 and were located near the N- and C-termini of proteins. For larger indels, a combination of split-read and read-pair approaches proved to be complementary in finding different signatures. CNVs were identified on the basis of the depth of sequenced reads, and by using SNP and CGH arrays. Conclusions Our results provide high resolution mapping of diverse classes of genomic variation in an individual bovine genome and demonstrate that structural variation surpasses sequence variation as the main component of genomic variability. Better accuracy of SNP detection was achieved with little loss of sensitivity when algorithms that implemented mapping quality were used. IBD regions were found to be instrumental for calculating resequencing SNP accuracy, while SNP detection within CNVs tended to be less reliable. CNV discovery was affected dramatically by platform resolution and coverage biases. The combined data for this study showed that at a moderate level of sequencing coverage, an ensemble of platforms and tools can be applied together to maximize the accurate detection of sequence and structural variants. PMID:22082336
Sequence alterations in RX in patients with microphthalmia, anophthalmia, and coloboma

PubMed Central

London, Nikolas J.S.; Kessler, Patricia; Williams, Bryan; Pauer, Gayle J.; Hagstrom, Stephanie A.

2009-01-01

Purpose Microphthalmia, anophthalmia, and coloboma are ocular malformations with a significant genetic component. Rx is a homeobox gene expressed early in the developing retina and is important in retinal cell fate specification as well as stem cell proliferation. We screened a group of 24 patients with microphthalmia, coloboma, and/or anophthalmia for RX mutations. Methods We used standard PCR and automated sequencing techniques to amplify and sequence each of the three RX exons. Patients’ charts were reviewed for clinical information. The pathologic impact of the identified sequence variant was analyzed by computational methods using PolyPhen and PMut algorithms. Results In addition to the polymorphisms we identified a single patient with coloboma having a heterozygous nucleotide change (g.197G>C) in the first exon that results in a missense mutation of arginine to threonine at amino acid position 66 (R66T). In silico analysis predicted R66T to be a deleterious mutation. Conclusions Sequence variations in RX are uncommon in patients with congenital ocular malformations, but may play a role in disease pathogenesis. We observed a missense mutation in RX in a patient with a small, typical chorioretinal coloboma, and postulate that the mutation is responsible for the patient’s phenotype. PMID:19158959
Assessing randomness and complexity in human motion trajectories through analysis of symbolic sequences

PubMed Central

Peng, Zhen; Genewein, Tim; Braun, Daniel A.

2014-01-01

Complexity is a hallmark of intelligent behavior consisting both of regular patterns and random variation. To quantitatively assess the complexity and randomness of human motion, we designed a motor task in which we translated subjects' motion trajectories into strings of symbol sequences. In the first part of the experiment participants were asked to perform self-paced movements to create repetitive patterns, copy pre-specified letter sequences, and generate random movements. To investigate whether the degree of randomness can be manipulated, in the second part of the experiment participants were asked to perform unpredictable movements in the context of a pursuit game, where they received feedback from an online Bayesian predictor guessing their next move. We analyzed symbol sequences representing subjects' motion trajectories with five common complexity measures: predictability, compressibility, approximate entropy, Lempel-Ziv complexity, as well as effective measure complexity. We found that subjects' self-created patterns were the most complex, followed by drawing movements of letters and self-paced random motion. We also found that participants could change the randomness of their behavior depending on context and feedback. Our results suggest that humans can adjust both complexity and regularity in different movement types and contexts and that this can be assessed with information-theoretic measures of the symbolic sequences generated from movement trajectories. PMID:24744716
Sequence-Based Genotyping of Expressed Swine Leukocyte Antigen Class I Alleles by Next-Generation Sequencing Reveal Novel Swine Leukocyte Antigen Class I Haplotypes and Alleles in Belgian, Danish, and Kenyan Fattening Pigs and Göttingen Minipigs.

PubMed

Sørensen, Maria Rathmann; Ilsøe, Mette; Strube, Mikael Lenz; Bishop, Richard; Erbs, Gitte; Hartmann, Sofie Bruun; Jungersen, Gregers

2017-01-01

The need for typing of the swine leukocyte antigen (SLA) is increasing with the expanded use of pigs as models for human diseases and organ-transplantation experiments, their use in infection studies, and for design of veterinary vaccines. Knowledge of SLA sequences is furthermore a prerequisite for the prediction of epitope binding in pigs. The low number of known SLA class I alleles and the limited knowledge of their prevalence in different pig breeds emphasizes the need for efficient SLA typing methods. This study utilizes an SLA class I-typing method based on next-generation sequencing of barcoded PCR amplicons. The amplicons were generated with universal primers and predicted to resolve 68-88% of all known SLA class I alleles dependent on amplicon size. We analyzed the SLA profiles of 72 pigs from four different pig populations; Göttingen minipigs and Belgian, Kenyan, and Danish fattening pigs. We identified 67 alleles, nine previously described haplotypes and 15 novel haplotypes. The highest variation in SLA class I profiles was observed in the Danish pigs and the lowest among the Göttingen minipig population, which also have the highest percentage of homozygote individuals. Highlighting the fact that there are still numerous unknown SLA class I alleles to be discovered, a total of 12 novel SLA class I alleles were identified. Overall, we present new information about known and novel alleles and haplotypes and their prevalence in the tested pig populations.
Whole-Genome Sequences of Thirteen Isolates of Borrelia burgdorferi

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schutzer S. E.; Dunn J.; Fraser-Liggett, C. M.

2011-02-01

Borrelia burgdorferi is a causative agent of Lyme disease in North America and Eurasia. The first complete genome sequence of B. burgdorferi strain 31, available for more than a decade, has assisted research on the pathogenesis of Lyme disease. Because a single genome sequence is not sufficient to understand the relationship between genotypic and geographic variation and disease phenotype, we determined the whole-genome sequences of 13 additional B. burgdorferi isolates that span the range of natural variation. These sequences should allow improved understanding of pathogenesis and provide a foundation for novel detection, diagnosis, and prevention strategies.
A novel mutation in TFL1 homolog affecting determinacy in cowpea (Vigna unguiculata).

PubMed

Dhanasekar, P; Reddy, K S

2015-02-01

Mutations in the widely conserved Arabidopsis Terminal Flower 1 (TFL1) gene and its homologs have been demonstrated to result in determinacy across genera, the knowledge of which is lacking in cowpea. Understanding the molecular events leading to determinacy of apical meristems could hasten development of cowpea varieties with suitable ideotypes. Isolation and characterization of a novel mutation in cowpea TFL1 homolog (VuTFL1) affecting determinacy is reported here for the first time. Cowpea TFL1 homolog was amplified using primers designed based on conserved sequences in related genera and sequence variation was analysed in three gamma ray-induced determinate mutants, their indeterminate parent "EC394763" and two indeterminate varieties. The analyses of sequence variation exposed a novel SNP distinguishing the determinate mutants from the indeterminate types. The non-synonymous point mutation in exon 4 at position 1,176 resulted from transversion of cytosine (C) to adenine (A) leading to an amino acid change (Pro-136 to His) in determinate mutants. The effect of the mutation on protein function and stability was predicted to be detrimental using different bioinformatics/computational tools. The functionally significant novel substitution mutation is hypothesized to affect determinacy in the cowpea mutants. Development of suitable regeneration protocols in this hitherto recalcitrant crop and subsequent complementation assay in mutants or over-expressing assay in parents could decisively conclude the role of the SNP in regulating determinacy in these cowpea mutants.
Distinguishing functional polymorphism from random variation in the sequences of >10,000 HLA-A, -B and -C alleles.

PubMed

Robinson, James; Guethlein, Lisbeth A; Cereb, Nezih; Yang, Soo Young; Norman, Paul J; Marsh, Steven G E; Parham, Peter

2017-06-01

HLA class I glycoproteins contain the functional sites that bind peptide antigens and engage lymphocyte receptors. Recently, clinical application of sequence-based HLA typing has uncovered an unprecedented number of novel HLA class I alleles. Here we define the nature and extent of the variation in 3,489 HLA-A, 4,356 HLA-B and 3,111 HLA-C alleles. This analysis required development of suites of methods, having general applicability, for comparing and analyzing large numbers of homologous sequences. At least three amino-acid substitutions are present at every position in the polymorphic α1 and α2 domains of HLA-A, -B and -C. A minority of positions have an incidence >1% for the 'second' most frequent nucleotide, comprising 70 positions in HLA-A, 85 in HLA-B and 54 in HLA-C. The majority of these positions have three or four alternative nucleotides. These positions were subject to positive selection and correspond to binding sites for peptides and receptors. Most alleles of HLA class I (>80%) are very rare, often identified in one person or family, and they differ by point mutation from older, more common alleles. These alleles with single nucleotide polymorphisms reflect the germ-line mutation rate. Their frequency predicts the human population harbors 8-9 million HLA class I variants. The common alleles of human populations comprise 42 core alleles, which represent all selected polymorphism, and recombinants that have assorted this polymorphism.
Distinguishing functional polymorphism from random variation in the sequences of >10,000 HLA-A, -B and -C alleles

PubMed Central

Cereb, Nezih; Yang, Soo Young; Marsh, Steven G. E.; Parham, Peter

2017-01-01

HLA class I glycoproteins contain the functional sites that bind peptide antigens and engage lymphocyte receptors. Recently, clinical application of sequence-based HLA typing has uncovered an unprecedented number of novel HLA class I alleles. Here we define the nature and extent of the variation in 3,489 HLA-A, 4,356 HLA-B and 3,111 HLA-C alleles. This analysis required development of suites of methods, having general applicability, for comparing and analyzing large numbers of homologous sequences. At least three amino-acid substitutions are present at every position in the polymorphic α1 and α2 domains of HLA-A, -B and -C. A minority of positions have an incidence >1% for the ‘second’ most frequent nucleotide, comprising 70 positions in HLA-A, 85 in HLA-B and 54 in HLA-C. The majority of these positions have three or four alternative nucleotides. These positions were subject to positive selection and correspond to binding sites for peptides and receptors. Most alleles of HLA class I (>80%) are very rare, often identified in one person or family, and they differ by point mutation from older, more common alleles. These alleles with single nucleotide polymorphisms reflect the germ-line mutation rate. Their frequency predicts the human population harbors 8–9 million HLA class I variants. The common alleles of human populations comprise 42 core alleles, which represent all selected polymorphism, and recombinants that have assorted this polymorphism. PMID:28650991
Preparation of Meloidogyne javanica near-isogenic lines virulent and avirulent against the tomato resistance gene Mi and preliminary analyses of the genetic variation between the two lines.

PubMed

Xu, Jian-Hua; Narabu, Takashi; Li, Hong-Mei; Fu, Peng

2002-01-01

Meloidogyne javanica, reproducing by mitotic parthenogenesis, is an economically important pathogen of a wide range of crops. A pair of near-isogenic lines virulent and avirulent toward the tomato resistance gene Mi were prepared for M. javanica by continuously selecting an avirulent population on the resistant tomato cultivar Momotaro over 19 generations. Random amplified polymorphic DNA (RAPD) analysis with 102 primers revealed that RAPD patterns were highly conserved between the virulent and avirulent lines, confirming that the two lines were genomically very similar. Nevertheless, with one of the primers a distinct polymorphic fragment, specific for the avirulent lines, was amplified. Southern hybridization results indicated that the polymorphic fragment and its homologs were deleted from the genome of the virulent line during the process of virulence acquisition. Sequence analysis and homology searches of public data bases, however, revealed no published sequences significantly similar to the sequence of the fragment, precluding a prediction of the potential function of the sequence. The successful preparation of the near-isogenic Mi-virulent and avirulent lines laid a firm foundation for the further identification and isolation of virulence-related genes in M. javanica.
Diff-seq: A high throughput sequencing-based mismatch detection assay for DNA variant enrichment and discovery

PubMed Central

Karas, Vlad O; Sinnott-Armstrong, Nicholas A; Varghese, Vici; Shafer, Robert W; Greenleaf, William J; Sherlock, Gavin

2018-01-01

Abstract Much of the within species genetic variation is in the form of single nucleotide polymorphisms (SNPs), typically detected by whole genome sequencing (WGS) or microarray-based technologies. However, WGS produces mostly uninformative reads that perfectly match the reference, while microarrays require genome-specific reagents. We have developed Diff-seq, a sequencing-based mismatch detection assay for SNP discovery without the requirement for specialized nucleic-acid reagents. Diff-seq leverages the Surveyor endonuclease to cleave mismatched DNA molecules that are generated after cross-annealing of a complex pool of DNA fragments. Sequencing libraries enriched for Surveyor-cleaved molecules result in increased coverage at the variant sites. Diff-seq detected all mismatches present in an initial test substrate, with specific enrichment dependent on the identity and context of the variation. Application to viral sequences resulted in increased observation of variant alleles in a biologically relevant context. Diff-Seq has the potential to increase the sensitivity and efficiency of high-throughput sequencing in the detection of variation. PMID:29361139
RSAT 2015: Regulatory Sequence Analysis Tools

PubMed Central

Medina-Rivera, Alejandra; Defrance, Matthieu; Sand, Olivier; Herrmann, Carl; Castro-Mondragon, Jaime A.; Delerce, Jeremy; Jaeger, Sébastien; Blanchet, Christophe; Vincens, Pierre; Caron, Christophe; Staines, Daniel M.; Contreras-Moreira, Bruno; Artufel, Marie; Charbonnier-Khamvongsa, Lucie; Hernandez, Céline; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques

2015-01-01

RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. PMID:25904632
Characterization of Dermanyssus gallinae (Acarina: Dermanissydae) by sequence analysis of the ribosomal internal transcribed spacer regions.

PubMed

Potenza, L; Cafiero, M A; Camarda, A; La Salandra, G; Cucchiarini, L; Dachà, M

2009-10-01

In the present work mites previously identified as Dermanyssus gallinae De Geer (Acari, Mesostigmata) using morphological keys were investigated by molecular tools. The complete internal transcribed spacer 1 (ITS1), 5.8S ribosomal DNA, and ITS2 region of the ribosomal DNA from mites were amplified and sequenced to examine the level of sequence variations and to explore the feasibility of using this region in the identification of this mite. Conserved primers located at the 3'end of 18S and at the 5'start of 28S rRNA genes were used first, and amplified fragments were sequenced. Sequence analyses showed no variation in 5.8S and ITS2 region while slight intraspecific variations involving substitutions as well as deletions concentrated in the ITS1 region. Based on the sequence analyses a nested PCR of the ITS2 region followed by RFLP analyses has been set up in the attempt to provide a rapid molecular diagnostic tool of D. gallinae.

Mapping and phasing of structural variation in patient genomes using nanopore sequencing.

PubMed

Cretu Stancu, Mircea; van Roosmalen, Markus J; Renkens, Ivo; Nieboer, Marleen M; Middelkamp, Sjors; de Ligt, Joep; Pregno, Giulia; Giachino, Daniela; Mandrile, Giorgia; Espejo Valle-Inclan, Jose; Korzelius, Jerome; de Bruijn, Ewart; Cuppen, Edwin; Talkowski, Michael E; Marschall, Tobias; de Ridder, Jeroen; Kloosterman, Wigard P

2017-11-06

Despite improvements in genomics technology, the detection of structural variants (SVs) from short-read sequencing still poses challenges, particularly for complex variation. Here we analyse the genomes of two patients with congenital abnormalities using the MinION nanopore sequencer and a novel computational pipeline-NanoSV. We demonstrate that nanopore long reads are superior to short reads with regard to detection of de novo chromothripsis rearrangements. The long reads also enable efficient phasing of genetic variations, which we leveraged to determine the parental origin of all de novo chromothripsis breakpoints and to resolve the structure of these complex rearrangements. Additionally, genome-wide surveillance of inherited SVs reveals novel variants, missed in short-read data sets, a large proportion of which are retrotransposon insertions. We provide a first exploration of patient genome sequencing with a nanopore sequencer and demonstrate the value of long-read sequencing in mapping and phasing of SVs for both clinical and research applications.
Comparative Analysis of the Genomes of Two Field Isolates of the Rice Blast Fungus Magnaporthe oryzae

PubMed Central

Li, Zhigang; Hu, Songnian; Yao, Nan; Dean, Ralph A.; Zhao, Wensheng; Shen, Mi; Zhang, Haiwang; Li, Chao; Liu, Liyuan; Cao, Lei; Xu, Xiaowen; Xing, Yunfei; Hsiang, Tom; Zhang, Ziding; Xu, Jin-Rong; Peng, You-Liang

2012-01-01

Rice blast caused by Magnaporthe oryzae is one of the most destructive diseases of rice worldwide. The fungal pathogen is notorious for its ability to overcome host resistance. To better understand its genetic variation in nature, we sequenced the genomes of two field isolates, Y34 and P131. In comparison with the previously sequenced laboratory strain 70-15, both field isolates had a similar genome size but slightly more genes. Sequences from the field isolates were used to improve genome assembly and gene prediction of 70-15. Although the overall genome structure is similar, a number of gene families that are likely involved in plant-fungal interactions are expanded in the field isolates. Genome-wide analysis on asynonymous to synonymous nucleotide substitution rates revealed that many infection-related genes underwent diversifying selection. The field isolates also have hundreds of isolate-specific genes and a number of isolate-specific gene duplication events. Functional characterization of randomly selected isolate-specific genes revealed that they play diverse roles, some of which affect virulence. Furthermore, each genome contains thousands of loci of transposon-like elements, but less than 30% of them are conserved among different isolates, suggesting active transposition events in M. oryzae. A total of approximately 200 genes were disrupted in these three strains by transposable elements. Interestingly, transposon-like elements tend to be associated with isolate-specific or duplicated sequences. Overall, our results indicate that gain or loss of unique genes, DNA duplication, gene family expansion, and frequent translocation of transposon-like elements are important factors in genome variation of the rice blast fungus. PMID:22876203
Molecular organization and phylogenetic analysis of 5S rDNA in crustaceans of the genus Pollicipes reveal birth-and-death evolution and strong purifying selection.

PubMed

Perina, Alejandra; Seoane, David; González-Tizón, Ana M; Rodríguez-Fariña, Fernanda; Martínez-Lage, Andrés

2011-10-17

The 5S ribosomal DNA (5S rDNA) is organized in tandem arrays with repeat units that consist of a transcribing region (5S) and a variable nontranscribed spacer (NTS), in higher eukaryotes. Until recently the 5S rDNA was thought to be subject to concerted evolution, however, in several taxa, sequence divergence levels between the 5S and the NTS were found higher than expected under this model. So, many studies have shown that birth-and-death processes and selection can drive the evolution of 5S rDNA. In analyses of 5S rDNA evolution is found several 5S rDNA types in the genome, with low levels of nucleotide variation in the 5S and a spacer region highly divergent. Molecular organization and nucleotide sequence of the 5S ribosomal DNA multigene family (5S rDNA) were investigated in three Pollicipes species in an evolutionary context. The nucleotide sequence variation revealed that several 5S rDNA variants occur in Pollicipes genomes. They are clustered in up to seven different types based on differences in their nontranscribed spacers (NTS). Five different units of 5S rDNA were characterized in P. pollicipes and two different units in P. elegans and P. polymerus. Analysis of these sequences showed that identical types were shared among species and that two pseudogenes were present. We predicted the secondary structure and characterized the upstream and downstream conserved elements. Phylogenetic analysis showed an among-species clustering pattern of 5S rDNA types. These results suggest that the evolution of Pollicipes 5S rDNA is driven by birth-and-death processes with strong purifying selection.
Molecular organization and phylogenetic analysis of 5S rDNA in crustaceans of the genus Pollicipes reveal birth-and-death evolution and strong purifying selection

PubMed Central

2011-01-01

Background The 5S ribosomal DNA (5S rDNA) is organized in tandem arrays with repeat units that consist of a transcribing region (5S) and a variable nontranscribed spacer (NTS), in higher eukaryotes. Until recently the 5S rDNA was thought to be subject to concerted evolution, however, in several taxa, sequence divergence levels between the 5S and the NTS were found higher than expected under this model. So, many studies have shown that birth-and-death processes and selection can drive the evolution of 5S rDNA. In analyses of 5S rDNA evolution is found several 5S rDNA types in the genome, with low levels of nucleotide variation in the 5S and a spacer region highly divergent. Molecular organization and nucleotide sequence of the 5S ribosomal DNA multigene family (5S rDNA) were investigated in three Pollicipes species in an evolutionary context. Results The nucleotide sequence variation revealed that several 5S rDNA variants occur in Pollicipes genomes. They are clustered in up to seven different types based on differences in their nontranscribed spacers (NTS). Five different units of 5S rDNA were characterized in P. pollicipes and two different units in P. elegans and P. polymerus. Analysis of these sequences showed that identical types were shared among species and that two pseudogenes were present. We predicted the secondary structure and characterized the upstream and downstream conserved elements. Phylogenetic analysis showed an among-species clustering pattern of 5S rDNA types. Conclusions These results suggest that the evolution of Pollicipes 5S rDNA is driven by birth-and-death processes with strong purifying selection. PMID:22004418
Community detection in sequence similarity networks based on attribute clustering

DOE PAGES

Chowdhary, Janamejaya; Loeffler, Frank E.; Smith, Jeremy C.

2017-07-24

Networks are powerful tools for the presentation and analysis of interactions in multi-component systems. A commonly studied mesoscopic feature of networks is their community structure, which arises from grouping together similar nodes into one community and dissimilar nodes into separate communities. Here in this paper, the community structure of protein sequence similarity networks is determined with a new method: Attribute Clustering Dependent Communities (ACDC). Sequence similarity has hitherto typically been quantified by the alignment score or its expectation value. However, pair alignments with the same score or expectation value cannot thus be differentiated. To overcome this deficiency, the method constructs,more » for pair alignments, an extended alignment metric, the link attribute vector, which includes the score and other alignment characteristics. Rescaling components of the attribute vectors qualitatively identifies a systematic variation of sequence similarity within protein superfamilies. The problem of community detection is then mapped to clustering the link attribute vectors, selection of an optimal subset of links and community structure refinement based on the partition density of the network. ACDC-predicted communities are found to be in good agreement with gold standard sequence databases for which the "ground truth" community structures (or families) are known. ACDC is therefore a community detection method for sequence similarity networks based entirely on pair similarity information. A serial implementation of ACDC is available from https://cmb.ornl.gov/resources/developments« less
Community detection in sequence similarity networks based on attribute clustering

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chowdhary, Janamejaya; Loeffler, Frank E.; Smith, Jeremy C.

Networks are powerful tools for the presentation and analysis of interactions in multi-component systems. A commonly studied mesoscopic feature of networks is their community structure, which arises from grouping together similar nodes into one community and dissimilar nodes into separate communities. Here in this paper, the community structure of protein sequence similarity networks is determined with a new method: Attribute Clustering Dependent Communities (ACDC). Sequence similarity has hitherto typically been quantified by the alignment score or its expectation value. However, pair alignments with the same score or expectation value cannot thus be differentiated. To overcome this deficiency, the method constructs,more » for pair alignments, an extended alignment metric, the link attribute vector, which includes the score and other alignment characteristics. Rescaling components of the attribute vectors qualitatively identifies a systematic variation of sequence similarity within protein superfamilies. The problem of community detection is then mapped to clustering the link attribute vectors, selection of an optimal subset of links and community structure refinement based on the partition density of the network. ACDC-predicted communities are found to be in good agreement with gold standard sequence databases for which the "ground truth" community structures (or families) are known. ACDC is therefore a community detection method for sequence similarity networks based entirely on pair similarity information. A serial implementation of ACDC is available from https://cmb.ornl.gov/resources/developments« less
Identification and Resolution of Microdiversity through Metagenomic Sequencing of Parallel Consortia

PubMed Central

Maezato, Yukari; Wu, Yu-Wei; Romine, Margaret F.; Lindemann, Stephen R.

2015-01-01

To gain a predictive understanding of the interspecies interactions within microbial communities that govern community function, the genomic complement of every member population must be determined. Although metagenomic sequencing has enabled the de novo reconstruction of some microbial genomes from environmental communities, microdiversity confounds current genome reconstruction techniques. To overcome this issue, we performed short-read metagenomic sequencing on parallel consortia, defined as consortia cultivated under the same conditions from the same natural community with overlapping species composition. The differences in species abundance between the two consortia allowed reconstruction of near-complete (at an estimated >85% of gene complement) genome sequences for 17 of the 20 detected member species. Two Halomonas spp. indistinguishable by amplicon analysis were found to be present within the community. In addition, comparison of metagenomic reads against the consensus scaffolds revealed within-species variation for one of the Halomonas populations, one of the Rhodobacteraceae populations, and the Rhizobiales population. Genomic comparison of these representative instances of inter- and intraspecies microdiversity suggests differences in functional potential that may result in the expression of distinct roles in the community. In addition, isolation and complete genome sequence determination of six member species allowed an investigation into the sensitivity and specificity of genome reconstruction processes, demonstrating robustness across a wide range of sequence coverage (9× to 2,700×) within the metagenomic data set. PMID:26497460
Barley whole exome capture: a tool for genomic research in the genus Hordeum and beyond

PubMed Central

Mascher, Martin; Richmond, Todd A; Gerhardt, Daniel J; Himmelbach, Axel; Clissold, Leah; Sampath, Dharanya; Ayling, Sarah; Steuernagel, Burkhard; Pfeifer, Matthias; D'Ascenzo, Mark; Akhunov, Eduard D; Hedley, Pete E; Gonzales, Ana M; Morrell, Peter L; Kilian, Benjamin; Blattner, Frank R; Scholz, Uwe; Mayer, Klaus FX; Flavell, Andrew J; Muehlbauer, Gary J; Waugh, Robbie; Jeddeloh, Jeffrey A; Stein, Nils

2013-01-01

Advanced resources for genome-assisted research in barley (Hordeum vulgare) including a whole-genome shotgun assembly and an integrated physical map have recently become available. These have made possible studies that aim to assess genetic diversity or to isolate single genes by whole-genome resequencing and in silico variant detection. However such an approach remains expensive given the 5 Gb size of the barley genome. Targeted sequencing of the mRNA-coding exome reduces barley genomic complexity more than 50-fold, thus dramatically reducing this heavy sequencing and analysis load. We have developed and employed an in-solution hybridization-based sequence capture platform to selectively enrich for a 61.6 megabase coding sequence target that includes predicted genes from the genome assembly of the cultivar Morex as well as publicly available full-length cDNAs and de novo assembled RNA-Seq consensus sequence contigs. The platform provides a highly specific capture with substantial and reproducible enrichment of targeted exons, both for cultivated barley and related species. We show that this exome capture platform provides a clear path towards a broader and deeper understanding of the natural variation residing in the mRNA-coding part of the barley genome and will thus constitute a valuable resource for applications such as mapping-by-sequencing and genetic diversity analyzes. PMID:23889683
Identification and Resolution of Microdiversity through Metagenomic Sequencing of Parallel Consortia

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nelson, William C.; Maezato, Yukari; Wu, Yu-Wei

2015-10-23

To gain a predictive understanding of the interspecies interactions within microbial communities that govern community function, the genomic complement of every member population must be determined. Although metagenomic sequencing has enabled thede novoreconstruction of some microbial genomes from environmental communities, microdiversity confounds current genome reconstruction techniques. To overcome this issue, we performed short-read metagenomic sequencing on parallel consortia, defined as consortia cultivated under the same conditions from the same natural community with overlapping species composition. The differences in species abundance between the two consortia allowed reconstruction of near-complete (at an estimated >85% of gene complement) genome sequences for 17 ofmore » the 20 detected member species. TwoHalomonasspp. indistinguishable by amplicon analysis were found to be present within the community. In addition, comparison of metagenomic reads against the consensus scaffolds revealed within-species variation for one of theHalomonaspopulations, one of theRhodobacteraceaepopulations, and theRhizobialespopulation. Genomic comparison of these representative instances of inter- and intraspecies microdiversity suggests differences in functional potential that may result in the expression of distinct roles in the community. In addition, isolation and complete genome sequence determination of six member species allowed an investigation into the sensitivity and specificity of genome reconstruction processes, demonstrating robustness across a wide range of sequence coverage (9× to 2,700×) within the metagenomic data set.« less
Experimental and statistical post-validation of positive example EST sequences carrying peroxisome targeting signals type 1 (PTS1)

PubMed Central

Lingner, Thomas; Kataya, Amr R. A.; Reumann, Sigrun

2012-01-01

We recently developed the first algorithms specifically for plants to predict proteins carrying peroxisome targeting signals type 1 (PTS1) from genome sequences.1 As validated experimentally, the prediction methods are able to correctly predict unknown peroxisomal Arabidopsis proteins and to infer novel PTS1 tripeptides. The high prediction performance is primarily determined by the large number and sequence diversity of the underlying positive example sequences, which mainly derived from EST databases. However, a few constructs remained cytosolic in experimental validation studies, indicating sequencing errors in some ESTs. To identify erroneous sequences, we validated subcellular targeting of additional positive example sequences in the present study. Moreover, we analyzed the distribution of prediction scores separately for each orthologous group of PTS1 proteins, which generally resembled normal distributions with group-specific mean values. The cytosolic sequences commonly represented outliers of low prediction scores and were located at the very tail of a fitted normal distribution. Three statistical methods for identifying outliers were compared in terms of sensitivity and specificity.” Their combined application allows elimination of erroneous ESTs from positive example data sets. This new post-validation method will further improve the prediction accuracy of both PTS1 and PTS2 protein prediction models for plants, fungi, and mammals. PMID:22415050
Experimental and statistical post-validation of positive example EST sequences carrying peroxisome targeting signals type 1 (PTS1).

PubMed

Lingner, Thomas; Kataya, Amr R A; Reumann, Sigrun

2012-02-01

We recently developed the first algorithms specifically for plants to predict proteins carrying peroxisome targeting signals type 1 (PTS1) from genome sequences. As validated experimentally, the prediction methods are able to correctly predict unknown peroxisomal Arabidopsis proteins and to infer novel PTS1 tripeptides. The high prediction performance is primarily determined by the large number and sequence diversity of the underlying positive example sequences, which mainly derived from EST databases. However, a few constructs remained cytosolic in experimental validation studies, indicating sequencing errors in some ESTs. To identify erroneous sequences, we validated subcellular targeting of additional positive example sequences in the present study. Moreover, we analyzed the distribution of prediction scores separately for each orthologous group of PTS1 proteins, which generally resembled normal distributions with group-specific mean values. The cytosolic sequences commonly represented outliers of low prediction scores and were located at the very tail of a fitted normal distribution. Three statistical methods for identifying outliers were compared in terms of sensitivity and specificity." Their combined application allows elimination of erroneous ESTs from positive example data sets. This new post-validation method will further improve the prediction accuracy of both PTS1 and PTS2 protein prediction models for plants, fungi, and mammals.
Human structural variation: mechanisms of chromosome rearrangements

PubMed Central

Weckselblatt, Brooke; Rudd, M. Katharine

2015-01-01

Chromosome structural variation (SV) is a normal part of variation in the human genome, but some classes of SV can cause neurodevelopmental disorders. Analysis of the DNA sequence at SV breakpoints can reveal mutational mechanisms and risk factors for chromosome rearrangement. Large-scale SV breakpoint studies have become possible recently owing to advances in next-generation sequencing (NGS) including whole-genome sequencing (WGS). These findings have shed light on complex forms of SV such as triplications, inverted duplications, insertional translocations, and chromothripsis. Sequence-level breakpoint data resolve SV structure and determine how genes are disrupted, fused, and/or misregulated by breakpoints. Recent improvements in breakpoint sequencing have also revealed non-allelic homologous recombination (NAHR) between paralogous long interspersed nuclear element (LINE) or human endogenous retrovirus (HERV) repeats as a cause of deletions, duplications, and translocations. This review covers the genomic organization of simple and complex constitutional SVs, as well as the molecular mechanisms of their formation. PMID:26209074
Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition

PubMed Central

2012-01-01

Background Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods. Results This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble. Conclusions The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role. Availability The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/. PMID:23282103
Read clouds uncover variation in complex regions of the human genome.

PubMed

Bishara, Alex; Liu, Yuling; Weng, Ziming; Kashef-Haghighi, Dorna; Newburger, Daniel E; West, Robert; Sidow, Arend; Batzoglou, Serafim

2015-10-01

Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies. © 2015 Bishara et al.; Published by Cold Spring Harbor Laboratory Press.
Epigenetic Variance, Performing Cooperative Structure with Genetics, Is Associated with Leaf Shape Traits in Widely Distributed Populations of Ornamental Tree Prunus mume

PubMed Central

Ma, Kaifeng; Sun, Lidan; Cheng, Tangren; Pan, Huitang; Wang, Jia; Zhang, Qixiang

2018-01-01

Increasing evidence shows that epigenetics plays an important role in phenotypic variance. However, little is known about epigenetic variation in the important ornamental tree Prunus mume. We used amplified fragment length polymorphism (AFLP) and methylation-sensitive amplified polymorphism (MSAP) techniques, and association analysis and sequencing to investigate epigenetic variation and its relationships with genetic variance, environment factors, and traits. By performing leaf sampling, the relative total methylation level (29.80%) was detected in 96 accessions of P. mume. And the relative hemi-methylation level (15.77%) was higher than the relative full methylation level (14.03%). The epigenetic diversity (I∗ = 0.575, h∗ = 0.393) was higher than the genetic diversity (I = 0.484, h = 0.319). The cultivated population displayed greater epigenetic diversity than the wild populations in both southwest and southeast China. We found that epigenetic variance and genetic variance, and environmental factors performed cooperative structures, respectively. In particular, leaf length, width and area were positively correlated with relative full methylation level and total methylation level, indicating that the DNA methylation level played a role in trait variation. In total, 203 AFLP and 423 MSAP associated markers were detected and 68 of them were sequenced. Homologous analysis and functional prediction suggested that the candidate marker-linked genes were essential for leaf morphology development and metabolism, implying that these markers play critical roles in the establishment of leaf length, width, area, and ratio of length to width. PMID:29441078
Epigenetic Variance, Performing Cooperative Structure with Genetics, Is Associated with Leaf Shape Traits in Widely Distributed Populations of Ornamental Tree Prunus mume.

PubMed

Ma, Kaifeng; Sun, Lidan; Cheng, Tangren; Pan, Huitang; Wang, Jia; Zhang, Qixiang

2018-01-01

Increasing evidence shows that epigenetics plays an important role in phenotypic variance. However, little is known about epigenetic variation in the important ornamental tree Prunus mume . We used amplified fragment length polymorphism (AFLP) and methylation-sensitive amplified polymorphism (MSAP) techniques, and association analysis and sequencing to investigate epigenetic variation and its relationships with genetic variance, environment factors, and traits. By performing leaf sampling, the relative total methylation level (29.80%) was detected in 96 accessions of P . mume . And the relative hemi-methylation level (15.77%) was higher than the relative full methylation level (14.03%). The epigenetic diversity ( I ∗ = 0.575, h ∗ = 0.393) was higher than the genetic diversity ( I = 0.484, h = 0.319). The cultivated population displayed greater epigenetic diversity than the wild populations in both southwest and southeast China. We found that epigenetic variance and genetic variance, and environmental factors performed cooperative structures, respectively. In particular, leaf length, width and area were positively correlated with relative full methylation level and total methylation level, indicating that the DNA methylation level played a role in trait variation. In total, 203 AFLP and 423 MSAP associated markers were detected and 68 of them were sequenced. Homologous analysis and functional prediction suggested that the candidate marker-linked genes were essential for leaf morphology development and metabolism, implying that these markers play critical roles in the establishment of leaf length, width, area, and ratio of length to width.
A detailed gene expression study of the Miscanthus genus reveals changes in the transcriptome associated with the rejuvenation of spring rhizomes.

PubMed

Barling, Adam; Swaminathan, Kankshita; Mitros, Therese; James, Brandon T; Morris, Juliette; Ngamboma, Ornella; Hall, Megan C; Kirkpatrick, Jessica; Alabady, Magdy; Spence, Ashley K; Hudson, Matthew E; Rokhsar, Daniel S; Moose, Stephen P

2013-12-09

The Miscanthus genus of perennial C4 grasses contains promising biofuel crops for temperate climates. However, few genomic resources exist for Miscanthus, which limits understanding of its interesting biology and future genetic improvement. A comprehensive catalog of expressed sequences were generated from a variety of Miscanthus species and tissue types, with an emphasis on characterizing gene expression changes in spring compared to fall rhizomes. Illumina short read sequencing technology was used to produce transcriptome sequences from different tissues and organs during distinct developmental stages for multiple Miscanthus species, including Miscanthus sinensis, Miscanthus sacchariflorus, and their interspecific hybrid Miscanthus × giganteus. More than fifty billion base-pairs of Miscanthus transcript sequence were produced. Overall, 26,230 Sorghum gene models (i.e., ~ 96% of predicted Sorghum genes) had at least five Miscanthus reads mapped to them, suggesting that a large portion of the Miscanthus transcriptome is represented in this dataset. The Miscanthus × giganteus data was used to identify genes preferentially expressed in a single tissue, such as the spring rhizome, using Sorghum bicolor as a reference. Quantitative real-time PCR was used to verify examples of preferential expression predicted via RNA-Seq. Contiguous consensus transcript sequences were assembled for each species and annotated using InterProScan. Sequences from the assembled transcriptome were used to amplify genomic segments from a doubled haploid Miscanthus sinensis and from Miscanthus × giganteus to further disentangle the allelic and paralogous variations in genes. This large expressed sequence tag collection creates a valuable resource for the study of Miscanthus biology by providing detailed gene sequence information and tissue preferred expression patterns. We have successfully generated a database of transcriptome assemblies and demonstrated its use in the study of genes of interest. Analysis of gene expression profiles revealed biological pathways that exhibit altered regulation in spring compared to fall rhizomes, which are consistent with their different physiological functions. The expression profiles of the subterranean rhizome provides a better understanding of the biological activities of the underground stem structures that are essentials for perenniality and the storage or remobilization of carbon and nutrient resources.
Sequence charge decoration dictates coil-globule transition in intrinsically disordered proteins

NASA Astrophysics Data System (ADS)

Firman, Taylor; Ghosh, Kingshuk

2018-03-01

We present an analytical theory to compute conformations of heteropolymers—applicable to describe disordered proteins—as a function of temperature and charge sequence. The theory describes coil-globule transition for a given protein sequence when temperature is varied and has been benchmarked against the all-atom Monte Carlo simulation (using CAMPARI) of intrinsically disordered proteins (IDPs). In addition, the model quantitatively shows how subtle alterations of charge placement in the primary sequence—while maintaining the same charge composition—can lead to significant changes in conformation, even as drastic as a coil (swelled above a purely random coil) to globule (collapsed below a random coil) and vice versa. The theory provides insights on how to control (enhance or suppress) these changes by tuning the temperature (or solution condition) and charge decoration. As an application, we predict the distribution of conformations (at room temperature) of all naturally occurring IDPs in the DisProt database and notice significant size variation even among IDPs with a similar composition of positive and negative charges. Based on this, we provide a new diagram-of-states delineating the sequence-conformation relation for proteins in the DisProt database. Next, we study the effect of post-translational modification, e.g., phosphorylation, on IDP conformations. Modifications as little as two-site phosphorylation can significantly alter the size of an IDP with everything else being constant (temperature, salt concentration, etc.). However, not all possible modification sites have the same effect on protein conformations; there are certain "hot spots" that can cause maximal change in conformation. The location of these "hot spots" in the parent sequence can readily be identified by using a sequence charge decoration metric originally introduced by Sawle and Ghosh. The ability of our model to predict conformations (both expanded and collapsed states) of IDPs at a high-throughput level can provide valuable insights into the different mechanisms by which phosphorylation/charge mutation controls IDP function.
Genomic Prediction and Association Mapping of Curd-Related Traits in Gene Bank Accessions of Cauliflower.

PubMed

Thorwarth, Patrick; Yousef, Eltohamy A A; Schmid, Karl J

2018-02-02

Genetic resources are an important source of genetic variation for plant breeding. Genome-wide association studies (GWAS) and genomic prediction greatly facilitate the analysis and utilization of useful genetic diversity for improving complex phenotypic traits in crop plants. We explored the potential of GWAS and genomic prediction for improving curd-related traits in cauliflower ( Brassica oleracea var. botrytis ) by combining 174 randomly selected cauliflower gene bank accessions from two different gene banks. The collection was genotyped with genotyping-by-sequencing (GBS) and phenotyped for six curd-related traits at two locations and three growing seasons. A GWAS analysis based on 120,693 single-nucleotide polymorphisms identified a total of 24 significant associations for curd-related traits. The potential for genomic prediction was assessed with a genomic best linear unbiased prediction model and BayesB. Prediction abilities ranged from 0.10 to 0.66 for different traits and did not differ between prediction methods. Imputation of missing genotypes only slightly improved prediction ability. Our results demonstrate that GWAS and genomic prediction in combination with GBS and phenotyping of highly heritable traits can be used to identify useful quantitative trait loci and genotypes among genetically diverse gene bank material for subsequent utilization as genetic resources in cauliflower breeding. Copyright © 2018 Thorwarth et al.
αIIbβ3 variants defined by next-generation sequencing: Predicting variants likely to cause Glanzmann thrombasthenia

PubMed Central

Buitrago, Lorena; Rendon, Augusto; Liang, Yupu; Simeoni, Ilenia; Negri, Ana; Filizola, Marta; Ouwehand, Willem H.; Coller, Barry S.; Alessi, Marie-Christine; Ballmaier, Matthias; Bariana, Tadbir; Bellissimo, Daniel; Bertoli, Marta; Bray, Paul; Bury, Loredana; Carrell, Robin; Cattaneo, Marco; Collins, Peter; French, Deborah; Favier, Remi; Freson, Kathleen; Furie, Bruce; Germeshausen, Manuela; Ghevaert, Cedric; Gomez, Keith; Goodeve, Anne; Gresele, Paolo; Guerrero, Jose; Hampshire, Dan J.; Hadinnapola, Charaka; Heemskerk, Johan; Henskens, Yvonne; Hill, Marian; Hogg, Nancy; Johnsen, Jill; Kahr, Walter; Kerr, Ron; Kunishima, Shinji; Laffan, Michael; Natwani, Amit; Neerman-Arbez, Marguerite; Nurden, Paquita; Nurden, Alan; Ormiston, Mark; Othman, Maha; Ouwehand, Willem; Perry, David; Vilk, Shoshana Ravel; Reitsma, Pieter; Rondina, Matthew; Simeoni, Ilenia; Smethurst, Peter; Stephens, Jonathan; Stevenson, William; Szkotak, Artur; Turro, Ernest; Van Geet, Christel; Vries, Minka; Ward, June; Waye, John; Westbury, Sarah; Whiteheart, Sidney; Wilcox, David; Zhang, Bi

2015-01-01

Next-generation sequencing is transforming our understanding of human genetic variation but assessing the functional impact of novel variants presents challenges. We analyzed missense variants in the integrin αIIbβ3 receptor subunit genes ITGA2B and ITGB3 identified by whole-exome or -genome sequencing in the ThromboGenomics project, comprising ∼32,000 alleles from 16,108 individuals. We analyzed the results in comparison with 111 missense variants in these genes previously reported as being associated with Glanzmann thrombasthenia (GT), 20 associated with alloimmune thrombocytopenia, and 5 associated with aniso/macrothrombocytopenia. We identified 114 novel missense variants in ITGA2B (affecting ∼11% of the amino acids) and 68 novel missense variants in ITGB3 (affecting ∼9% of the amino acids). Of the variants, 96% had minor allele frequencies (MAF) < 0.1%, indicating their rarity. Based on sequence conservation, MAF, and location on a complete model of αIIbβ3, we selected three novel variants that affect amino acids previously associated with GT for expression in HEK293 cells. αIIb P176H and β3 C547G severely reduced αIIbβ3 expression, whereas αIIb P943A partially reduced αIIbβ3 expression and had no effect on fibrinogen binding. We used receiver operating characteristic curves of combined annotation-dependent depletion, Polyphen 2-HDIV, and sorting intolerant from tolerant to estimate the percentage of novel variants likely to be deleterious. At optimal cut-off values, which had 69–98% sensitivity in detecting GT mutations, between 27% and 71% of the novel αIIb or β3 missense variants were predicted to be deleterious. Our data have implications for understanding the evolutionary pressure on αIIbβ3 and highlight the challenges in predicting the clinical significance of novel missense variants. PMID:25827233

Monogenic diabetes syndromes: Locus‐specific databases for Alström, Wolfram, and Thiamine‐responsive megaloblastic anemia

PubMed Central

Astuti, Dewi; Sabir, Ataf; Fulton, Piers; Zatyka, Malgorzata; Williams, Denise; Hardy, Carol; Milan, Gabriella; Favaretto, Francesca; Yu‐Wai‐Man, Patrick; Rohayem, Julia; López de Heredia, Miguel; Hershey, Tamara; Tranebjaerg, Lisbeth; Chen, Jian‐Hua; Chaussenot, Annabel; Nunes, Virginia; Marshall, Bess; McAfferty, Susan; Tillmann, Vallo; Maffei, Pietro; Paquis‐Flucklinger, Veronique; Geberhiwot, Tarekign; Mlynarski, Wojciech; Parkinson, Kay; Picard, Virginie; Bueno, Gema Esteban; Dias, Renuka; Arnold, Amy; Richens, Caitlin; Paisey, Richard; Urano, Fumihiko; Semple, Robert; Sinnott, Richard

2017-01-01

Abstract We developed a variant database for diabetes syndrome genes, using the Leiden Open Variation Database platform, containing observed phenotypes matched to the genetic variations. We populated it with 628 published disease‐associated variants (December 2016) for: WFS1 (n = 309), CISD2 (n = 3), ALMS1 (n = 268), and SLC19A2 (n = 48) for Wolfram type 1, Wolfram type 2, Alström, and Thiamine‐responsive megaloblastic anemia syndromes, respectively; and included 23 previously unpublished novel germline variants in WFS1 and 17 variants in ALMS1. We then investigated genotype–phenotype relations for the WFS1 gene. The presence of biallelic loss‐of‐function variants predicted Wolfram syndrome defined by insulin‐dependent diabetes and optic atrophy, with a sensitivity of 79% (95% CI 75%–83%) and specificity of 92% (83%–97%). The presence of minor loss‐of‐function variants in WFS1 predicted isolated diabetes, isolated deafness, or isolated congenital cataracts without development of the full syndrome (sensitivity 100% [93%–100%]; specificity 78% [73%–82%]). The ability to provide a prognostic prediction based on genotype will lead to improvements in patient care and counseling. The development of the database as a repository for monogenic diabetes gene variants will allow prognostic predictions for other diabetes syndromes as next‐generation sequencing expands the repertoire of genotypes and phenotypes. The database is publicly available online at https://lovd.euro-wabb.org. PMID:28432734
LineageSpecificSeqgen: generating sequence data with lineage-specific variation in the proportion of variable sites

PubMed Central

Grievink, Liat Shavit; Penny, David; Hendy, Mike D; Holland, Barbara R

2009-01-01

Correction to Shavit Grievink L, Penny D, Hendy MD, Holland BR: LineageSpecificSeqgen: generating sequence data with lineage-specific variation in the proportion of variable sites. BMC Evol Biol 2008, 8(1):317.
The mathematical limits of genetic prediction for complex chronic disease.

PubMed

Keyes, Katherine M; Smith, George Davey; Koenen, Karestan C; Galea, Sandro

2015-06-01

Attempts at predicting individual risk of disease based on common germline genetic variation have largely been disappointing. The present paper formalises why genetic prediction at the individual level is and will continue to have limited utility given the aetiological architecture of most common complex diseases. Data were simulated on one million populations with 10 000 individuals in each populations with varying prevalences of a genetic risk factor, an interacting environmental factor and the background rate of disease. The determinant risk ratio and risk difference magnitude for the association between a gene variant and disease is a function of the prevalence of the interacting factors that activate the gene, and the background rate of disease. The risk ratio and total excess cases due to the genetic factor increase as the prevalence of interacting factors increase, and decrease as the background rate of disease increases. Germline genetic variations have high predictive capacity for individual disease only under conditions of high heritability of particular genetic sequences, plausible only under rare variant hypotheses. Under a model of common germline genetic variants that interact with other genes and/or environmental factors in order to cause disease, the predictive capacity of common genetic variants is determined by the prevalence of the factors that interact with the variant and the background rate. A focus on estimating genetic associations for the purpose of prediction without explicitly grounding such work in an understanding of modifiable (including environmentally influenced) factors will be limited in its ability to yield important insights about the risk of disease. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
TARGET Research Goals

Cancer.gov

TARGET researchers use various sequencing and array-based methods to examine the genomes, transcriptomes, and for some diseases epigenomes of select childhood cancers. This “multi-omic” approach generates a comprehensive profile of molecular alterations for each cancer type. Alterations are changes in DNA or RNA, such as rearrangements in chromosome structure or variations in gene expression, respectively. Through computational analyses and assays to validate biological function, TARGET researchers predict which alterations disrupt the function of a gene or pathway and promote cancer growth, progression, and/or survival. Researchers identify candidate therapeutic targets and/or prognostic markers from the cancer-associated alterations.
Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce.

PubMed

Yehdego, Daniel T; Zhang, Boyu; Kodimala, Vikram K R; Johnson, Kyle L; Taufer, Michela; Leung, Ming-Ying

2013-05-01

Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.
Thermodynamic characterization of tandem mismatches found in naturally occurring RNA

PubMed Central

Christiansen, Martha E.; Znosko, Brent M.

2009-01-01

Although all sequence symmetric tandem mismatches and some sequence asymmetric tandem mismatches have been thermodynamically characterized and a model has been proposed to predict the stability of previously unmeasured sequence asymmetric tandem mismatches [Christiansen,M.E. and Znosko,B.M. (2008) Biochemistry, 47, 4329–4336], experimental thermodynamic data for frequently occurring tandem mismatches is lacking. Since experimental data is preferred over a predictive model, the thermodynamic parameters for 25 frequently occurring tandem mismatches were determined. These new experimental values, on average, are 1.0 kcal/mol different from the values predicted for these mismatches using the previous model. The data for the sequence asymmetric tandem mismatches reported here were then combined with the data for 72 sequence asymmetric tandem mismatches that were published previously, and the parameters used to predict the thermodynamics of previously unmeasured sequence asymmetric tandem mismatches were updated. The average absolute difference between the measured values and the values predicted using these updated parameters is 0.5 kcal/mol. This updated model improves the prediction for tandem mismatches that were predicted rather poorly by the previous model. This new experimental data and updated predictive model allow for more accurate calculations of the free energy of RNA duplexes containing tandem mismatches, and, furthermore, should allow for improved prediction of secondary structure from sequence. PMID:19509311
Integrating sequence stratigraphy and rock-physics to interpret seismic amplitudes and predict reservoir quality

NASA Astrophysics Data System (ADS)

Dutta, Tanima

This dissertation focuses on the link between seismic amplitudes and reservoir properties. Prediction of reservoir properties, such as sorting, sand/shale ratio, and cement-volume from seismic amplitudes improves by integrating knowledge from multiple disciplines. The key contribution of this dissertation is to improve the prediction of reservoir properties by integrating sequence stratigraphy and rock physics. Sequence stratigraphy has been successfully used for qualitative interpretation of seismic amplitudes to predict reservoir properties. Rock physics modeling allows quantitative interpretation of seismic amplitudes. However, often there is uncertainty about selecting geologically appropriate rock physics model and its input parameters, away from the wells. In the present dissertation, we exploit the predictive power of sequence stratigraphy to extract the spatial trends of sedimentological parameters that control seismic amplitudes. These spatial trends of sedimentological parameters can serve as valuable constraints in rock physics modeling, especially away from the wells. Consequently, rock physics modeling, integrated with the trends from sequence stratigraphy, become useful for interpreting observed seismic amplitudes away from the wells in terms of underlying sedimentological parameters. We illustrate this methodology using a comprehensive dataset from channelized turbidite systems, deposited in minibasin settings in the offshore Equatorial Guinea, West Africa. First, we present a practical recipe for using closed-form expressions of effective medium models to predict seismic velocities in unconsolidated sandstones. We use an effective medium model that combines perfectly rough and smooth grains (the extended Walton model), and use that model to derive coordination number, porosity, and pressure relations for P and S wave velocities from experimental data. Our recipe provides reasonable fits to other experimental and borehole data, and specifically improves the predictions of shear wave velocities. In addition, we provide empirical relations on normal compaction depth trends of porosity, velocities, and VP/VS ratio for shale and clean sands in shallow, supra-salt sediments in the Gulf of Mexico. Next, we identify probable spatial trends of sand/shale ratio and sorting as predicted by the conventional sequence stratigraphic model in minibasin settings (spill-and-fill model). These spatial trends are evaluated using well data from offshore West Africa, and the same well data are used to calibrate rock physics models (modified soft-sand model) that provide links between P-impedance and quartz/clay ratio, and sorting. The spatial increase in sand/shale ratio and sorting corresponds to an overall increase in P-impedance, and AVO intercept and gradient. The results are used as a guide to interpret sedimentological parameters from seismic attributes, away from the well locations. We present a quantitative link between carbonate cement and seismic attributes by combining stratigraphie cycles and the rock physics model (modified differential effective medium model). The variation in carbonate cement volume in West Africa can be linked with two distinct stratigraphic cycles: the coarsening-upward cycles and the fining-upward cycles. Cemented sandstones associated with these cycles exhibit distinct signatures on P-impedance vs. porosity and AVO intercept vs. gradient crossplots. These observations are important for assessing reservoir properties in the West Africa as well as in other analogous depositional environments. Finally, we investigate the relationship between seismic velocities and time temperature index (TTI) using basin and petroleum system modeling at Rio Muni basin, West Africa. We find that both VP and VS increase exponentially with TTI. The results can be applied to predict TTI, and thereby thermal maturity, from observed velocities.
Predicting DNA hybridization kinetics from sequence

NASA Astrophysics Data System (ADS)

Zhang, Jinny X.; Fang, John Z.; Duan, Wei; Wu, Lucia R.; Zhang, Angela W.; Dalchau, Neil; Yordanov, Boyan; Petersen, Rasmus; Phillips, Andrew; Zhang, David Yu

2018-01-01

Hybridization is a key molecular process in biology and biotechnology, but so far there is no predictive model for accurately determining hybridization rate constants based on sequence information. Here, we report a weighted neighbour voting (WNV) prediction algorithm, in which the hybridization rate constant of an unknown sequence is predicted based on similarity reactions with known rate constants. To construct this algorithm we first performed 210 fluorescence kinetics experiments to observe the hybridization kinetics of 100 different DNA target and probe pairs (36 nt sub-sequences of the CYCS and VEGF genes) at temperatures ranging from 28 to 55 °C. Automated feature selection and weighting optimization resulted in a final six-feature WNV model, which can predict hybridization rate constants of new sequences to within a factor of 3 with ∼91% accuracy, based on leave-one-out cross-validation. Accurate prediction of hybridization kinetics allows the design of efficient probe sequences for genomics research.
Stratigraphy and structure of coalbed methane reservoirs in the United States: an overview

USGS Publications Warehouse

Pashin, J.C.

1998-01-01

Stratigraphy and geologic structure determine the shape, continuity and permeability of coal and are therefore critical considerations for designing exploration and production strategies for coalbed methane. Coal in the United states is dominantly of Pennsylvanian, Cretaceous and Tertiary age, and to date, more than 90% of the coalbed methane produced is from Pennsylvanian and cretaceous strata of the Black Warrior and San Juan Basins. Investigations of these basins establish that sequence stratigraphy is a promising approach for regional characterization of coalbed methane reservoirs. Local stratigraphic variation within these strata is the product of sedimentologic and tectonic processes and is a consideration for selecting completion zones. Coalbed methane production in the United States is mainly from foreland and intermontane basins containing diverse compression and extensional structures. Balanced structural models can be used to construct and validate cross sections as well as to quantify layer-parallel strain and predict the distribution of fractures. Folds and faults influence gas and water production in diverse ways. However, interwell heterogeneity related to fractures and shear structures makes the performance of individual wells difficult to predict.Stratigraphy and geologic structure determine the shape, continuity and permeability of coal and are therefore critical considerations for designing exploration and production strategies for coalbed methane. Coal in the United States is dominantly of Pennsylvanian, Cretaceous and Tertiary age, and to date, more than 90% of the coalbed methane produced is from Pennsylvanian and Cretaceous strata of the Black Warrior and San Juan Basins. Investigations of these basins establish that sequence stratigraphy is a promising approach for regional characterization of coalbed methane reservoirs. Local stratigraphic variation within these strata is the product of sedimentologic and tectonic processes and is a consideration for selecting completion zones. Coalbed methane production in the United States is mainly from foreland and intermontane basins containing diverse compressional and extensional structures. Balanced structural models can be used to construct and validate cross sections as well as to quantify layer-parallel strain and predict the distribution of fractures. Folds and faults influence gas and water production in diverse ways. However, interwell heterogeneity related to fractures and shear structures makes the performance of individual wells difficult to predict.
Environmental Sensing of Expert Knowledge in a Computational Evolution System for Complex Problem Solving in Human Genetics

NASA Astrophysics Data System (ADS)

Greene, Casey S.; Hill, Douglas P.; Moore, Jason H.

The relationship between interindividual variation in our genomes and variation in our susceptibility to common diseases is expected to be complex with multiple interacting genetic factors. A central goal of human genetics is to identify which DNA sequence variations predict disease risk in human populations. Our success in this endeavour will depend critically on the development and implementation of computational intelligence methods that are able to embrace, rather than ignore, the complexity of the genotype to phenotype relationship. To this end, we have developed a computational evolution system (CES) to discover genetic models of disease susceptibility involving complex relationships between DNA sequence variations. The CES approach is hierarchically organized and is capable of evolving operators of any arbitrary complexity. The ability to evolve operators distinguishes this approach from artificial evolution approaches using fixed operators such as mutation and recombination. Our previous studies have shown that a CES that can utilize expert knowledge about the problem in evolved operators significantly outperforms a CES unable to use this knowledge. This environmental sensing of external sources of biological or statistical knowledge is important when the search space is both rugged and large as in the genetic analysis of complex diseases. We show here that the CES is also capable of evolving operators which exploit one of several sources of expert knowledge to solve the problem. This is important for both the discovery of highly fit genetic models and because the particular source of expert knowledge used by evolved operators may provide additional information about the problem itself. This study brings us a step closer to a CES that can solve complex problems in human genetics in addition to discovering genetic models of disease.
Length Variation, Heteroplasmy and Sequence Divergence in the Mitochondrial DNA of Four Species of Sturgeon (Acipenser)

PubMed Central

Brown, J. R.; Beckenbach, K.; Beckenbach, A. T.; Smith, M. J.

1996-01-01

The extent of mtDNA length variation and heteroplasmy as well as DNA sequences of the control region and two tRNA genes were determined for four North American sturgeon species: Acipenser transmontanus, A. medirostris, A. fulvescens and A. oxyrhnychus. Across the Continental Divide, a division in the occurrence of length variation and heteroplasmy was observed that was concordant with species biogeography as well as with phylogenies inferred from restriction fragment length polymorphisms (RFLP) of whole mtDNA and pairwise comparisons of unique sequences of the control region. In all species, mtDNA length variation was due to repeated arrays of 78-82-bp sequences each containing a D-loop strand synthesis termination associated sequence (TAS). Individual repeats showed greater sequence conservation within individuals and species rather than between species, which is suggestive of concerted evolution. Differences in the frequencies of multiple copy genomes and heteroplasmy among the four species may be ascribed to differences in the rates of recurrent mutation. A mechanism that may offset the high rate of mutation for increased copy number is suggested on the basis that an increase in the number of functional TAS motifs might reduce the frequency of successfully initiated H-strand replications. PMID:8852850
Translation efficiency of heterologous proteins is significantly affected by the genetic context of RBS sequences in engineered cyanobacterium Synechocystis sp. PCC 6803.

PubMed

Thiel, Kati; Mulaku, Edita; Dandapani, Hariharan; Nagy, Csaba; Aro, Eva-Mari; Kallio, Pauli

2018-03-02

Photosynthetic cyanobacteria have been studied as potential host organisms for direct solar-driven production of different carbon-based chemicals from CO 2 and water, as part of the development of sustainable future biotechnological applications. The engineering approaches, however, are still limited by the lack of comprehensive information on most optimal expression strategies and validated species-specific genetic elements which are essential for increasing the intricacy, predictability and efficiency of the systems. This study focused on the systematic evaluation of the key translational control elements, ribosome binding sites (RBS), in the cyanobacterial host Synechocystis sp. PCC 6803, with the objective of expanding the palette of tools for more rigorous engineering approaches. An expression system was established for the comparison of 13 selected RBS sequences in Synechocystis, using several alternative reporter proteins (sYFP2, codon-optimized GFPmut3 and ethylene forming enzyme) as quantitative indicators of the relative translation efficiencies. The set-up was shown to yield highly reproducible expression patterns in independent analytical series with low variation between biological replicates, thus allowing statistical comparison of the activities of the different RBSs in vivo. While the RBSs covered a relatively broad overall expression level range, the downstream gene sequence was demonstrated in a rigorous manner to have a clear impact on the resulting translational profiles. This was expected to reflect interfering sequence-specific mRNA-level interaction between the RBS and the coding region, yet correlation between potential secondary structure formation and observed translation levels could not be resolved with existing in silico prediction tools. The study expands our current understanding on the potential and limitations associated with the regulation of protein expression at translational level in engineered cyanobacteria. The acquired information can be used for selecting appropriate RBSs for optimizing over-expression constructs or multicistronic pathways in Synechocystis, while underlining the complications in predicting the activity due to gene-specific interactions which may reduce the translational efficiency for a given RBS-gene combination. Ultimately, the findings emphasize the need for additional characterized insulator sequence elements to decouple the interaction between the RBS and the coding region for future engineering approaches.
Comprehensive Analysis of Non-Synonymous Natural Variants of G Protein-Coupled Receptors.

PubMed

Kim, Hee Ryung; Duc, Nguyen Minh; Chung, Ka Young

2018-03-01

G protein-coupled receptors (GPCRs) are the largest superfamily of transmembrane receptors and have vital signaling functions in various organs. Because of their critical roles in physiology and pathology, GPCRs are the most commonly used therapeutic target. It has been suggested that GPCRs undergo massive genetic variations such as genetic polymorphisms and DNA insertions or deletions. Among these genetic variations, non-synonymous natural variations change the amino acid sequence and could thus alter GPCR functions such as expression, localization, signaling, and ligand binding, which may be involved in disease development and altered responses to GPCR-targeting drugs. Despite the clinical importance of GPCRs, studies on the genotype-phenotype relationship of GPCR natural variants have been limited to a few GPCRs such as β-adrenergic receptors and opioid receptors. Comprehensive understanding of non-synonymous natural variations within GPCRs would help to predict the unknown genotype-phenotype relationship and yet-to-be-discovered natural variants. Here, we analyzed the non-synonymous natural variants of all non-olfactory GPCRs available from a public database, UniProt. The results suggest that non-synonymous natural variations occur extensively within the GPCR superfamily especially in the N-terminus and transmembrane domains. Within the transmembrane domains, natural variations observed more frequently in the conserved residues, which leads to disruption of the receptor function. Our analysis also suggests that only few non-synonymous natural variations have been studied in efforts to link the variations with functional consequences.
Mitochondrial DNA Sequence Variation in North Atlantic Long-Finned Pilot Whales, Globicephala melas

DTIC Science & Technology

1994-06-01

Delphinapterus leucas : mitochondrial DNA sequence variation within and among North American populations. M.Sc. thesis. McMaster University. Brown, G.G...Delphinapteras leucas ) (Brennin 1992), minke whales {Balaenoptera acutorostratd) (Wada et al. 1991), bottlenose dolphins {Tursiops truncatus) (Dowling & Brown
Widespread Transient Hoogsteen Base-Pairs in Canonical Duplex DNA with Variable Energetics

PubMed Central

Alvey, Heidi S.; Gottardo, Federico L.; Nikolova, Evgenia N.; Al-Hashimi, Hashim M.

2015-01-01

Hoogsteen base-pairing involves a 180 degree rotation of the purine base relative to Watson-Crick base-pairing within DNA duplexes, creating alternative DNA conformations that can play roles in recognition, damage induction, and replication. Here, using Nuclear Magnetic Resonance R1ρ relaxation dispersion, we show that transient Hoogsteen base-pairs occur across more diverse sequence and positional contexts than previously anticipated. We observe sequence-specific variations in Hoogsteen base-pair energetic stabilities that are comparable to variations in Watson-Crick base-pair stability, with Hoogsteen base-pairs being more abundant for energetically less favorable Watson-Crick base-pairs. Our results suggest that the variations in Hoogsteen stabilities and rates of formation are dominated by variations in Watson-Crick base pair stability, suggesting a late transition state for the Watson-Crick to Hoogsteen conformational switch. The occurrence of sequence and position-dependent Hoogsteen base-pairs provide a new potential mechanism for achieving sequence-dependent DNA transactions. PMID:25185517
CNV-seq, a new method to detect copy number variation using high-throughput sequencing.

PubMed

Xie, Chao; Tammi, Martti T

2009-03-06

DNA copy number variation (CNV) has been recognized as an important source of genetic variation. Array comparative genomic hybridization (aCGH) is commonly used for CNV detection, but the microarray platform has a number of inherent limitations. Here, we describe a method to detect copy number variation using shotgun sequencing, CNV-seq. The method is based on a robust statistical model that describes the complete analysis procedure and allows the computation of essential confidence values for detection of CNV. Our results show that the number of reads, not the length of the reads is the key factor determining the resolution of detection. This favors the next-generation sequencing methods that rapidly produce large amount of short reads. Simulation of various sequencing methods with coverage between 0.1x to 8x show overall specificity between 91.7 - 99.9%, and sensitivity between 72.2 - 96.5%. We also show the results for assessment of CNV between two individual human genomes.
Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle.

PubMed

van Binsbergen, Rianne; Calus, Mario P L; Bink, Marco C A M; van Eeuwijk, Fred A; Schrooten, Chris; Veerkamp, Roel F

2015-09-17

In contrast to currently used single nucleotide polymorphism (SNP) panels, the use of whole-genome sequence data is expected to enable the direct estimation of the effects of causal mutations on a given trait. This could lead to higher reliabilities of genomic predictions compared to those based on SNP genotypes. Also, at each generation of selection, recombination events between a SNP and a mutation can cause decay in reliability of genomic predictions based on markers rather than on the causal variants. Our objective was to investigate the use of imputed whole-genome sequence genotypes versus high-density SNP genotypes on (the persistency of) the reliability of genomic predictions using real cattle data. Highly accurate phenotypes based on daughter performance and Illumina BovineHD Beadchip genotypes were available for 5503 Holstein Friesian bulls. The BovineHD genotypes (631,428 SNPs) of each bull were used to impute whole-genome sequence genotypes (12,590,056 SNPs) using the Beagle software. Imputation was done using a multi-breed reference panel of 429 sequenced individuals. Genomic estimated breeding values for three traits were predicted using a Bayesian stochastic search variable selection (BSSVS) model and a genome-enabled best linear unbiased prediction model (GBLUP). Reliabilities of predictions were based on 2087 validation bulls, while the other 3416 bulls were used for training. Prediction reliabilities ranged from 0.37 to 0.52. BSSVS performed better than GBLUP in all cases. Reliabilities of genomic predictions were slightly lower with imputed sequence data than with BovineHD chip data. Also, the reliabilities tended to be lower for both sequence data and BovineHD chip data when relationships between training animals were low. No increase in persistency of prediction reliability using imputed sequence data was observed. Compared to BovineHD genotype data, using imputed sequence data for genomic prediction produced no advantage. To investigate the putative advantage of genomic prediction using (imputed) sequence data, a training set with a larger number of individuals that are distantly related to each other and genomic prediction models that incorporate biological information on the SNPs or that apply stricter SNP pre-selection should be considered.
Independent studies using deep sequencing resolve the same set of core bacterial species dominating gut communities of honey bees.

PubMed

Sabree, Zakee L; Hansen, Allison K; Moran, Nancy A

2012-01-01

Starting in 2003, numerous studies using culture-independent methodologies to characterize the gut microbiota of honey bees have retrieved a consistent and distinctive set of eight bacterial species, based on near identity of the 16S rRNA gene sequences. A recent study [Mattila HR, Rios D, Walker-Sperling VE, Roeselers G, Newton ILG (2012) Characterization of the active microbiotas associated with honey bees reveals healthier and broader communities when colonies are genetically diverse. PLoS ONE 7(3): e32962], using pyrosequencing of the V1-V2 hypervariable region of the 16S rRNA gene, reported finding entirely novel bacterial species in honey bee guts, and used taxonomic assignments from these reads to predict metabolic activities based on known metabolisms of cultivable species. To better understand this discrepancy, we analyzed the Mattila et al. pyrotag dataset. In contrast to the conclusions of Mattila et al., we found that the large majority of pyrotag sequences belonged to clusters for which representative sequences were identical to sequences from previously identified core species of the bee microbiota. On average, they represent 95% of the bacteria in each worker bee in the Mattila et al. dataset, a slightly lower value than that found in other studies. Some colonies contain small proportions of other bacteria, mostly species of Enterobacteriaceae. Reanalysis of the Mattila et al. dataset also did not support a relationship between abundances of Bifidobacterium and of putative pathogens or a significant difference in gut communities between colonies from queens that were singly or multiply mated. Additionally, consistent with previous studies, the dataset supports the occurrence of considerable strain variation within core species, even within single colonies. The roles of these bacteria within bees, or the implications of the strain variation, are not yet clear.
Molecular mechanisms of epigenetic variation in plants.

PubMed

Fujimoto, Ryo; Sasaki, Taku; Ishikawa, Ryo; Osabe, Kenji; Kawanabe, Takahiro; Dennis, Elizabeth S

2012-01-01

Natural variation is defined as the phenotypic variation caused by spontaneous mutations. In general, mutations are associated with changes of nucleotide sequence, and many mutations in genes that can cause changes in plant development have been identified. Epigenetic change, which does not involve alteration to the nucleotide sequence, can also cause changes in gene activity by changing the structure of chromatin through DNA methylation or histone modifications. Now there is evidence based on induced or spontaneous mutants that epigenetic changes can cause altering plant phenotypes. Epigenetic changes have occurred frequently in plants, and some are heritable or metastable causing variation in epigenetic status within or between species. Therefore, heritable epigenetic variation as well as genetic variation has the potential to drive natural variation.
IMPACT: a whole-exome sequencing analysis pipeline for integrating molecular profiles with actionable therapeutics in clinical samples

PubMed Central

Hintzsche, Jennifer; Kim, Jihye; Yadav, Vinod; Amato, Carol; Robinson, Steven E; Seelenfreund, Eric; Shellman, Yiqun; Wisell, Joshua; Applegate, Allison; McCarter, Martin; Box, Neil; Tentler, John; De, Subhajyoti

2016-01-01

Objective Currently, there is a disconnect between finding a patient’s relevant molecular profile and predicting actionable therapeutics. Here we develop and implement the Integrating Molecular Profiles with Actionable Therapeutics (IMPACT) analysis pipeline, linking variants detected from whole-exome sequencing (WES) to actionable therapeutics. Methods and materials The IMPACT pipeline contains 4 analytical modules: detecting somatic variants, calling copy number alterations, predicting drugs against deleterious variants, and analyzing tumor heterogeneity. We tested the IMPACT pipeline on whole-exome sequencing data in The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples with known EGFR mutations. We also used IMPACT to analyze melanoma patient tumor samples before treatment, after BRAF-inhibitor treatment, and after BRAF- and MEK-inhibitor treatment. Results IMPACT Food and Drug Administration (FDA) correctly identified known EGFR mutations in the TCGA lung adenocarcinoma samples. IMPACT linked these EGFR mutations to the appropriate FDA-approved EGFR inhibitors. For the melanoma patient samples, we identified NRAS p.Q61K as an acquired resistance mutation to BRAF-inhibitor treatment. We also identified CDKN2A deletion as a novel acquired resistance mutation to BRAFi/MEKi inhibition. The IMPACT analysis pipeline predicts these somatic variants to actionable therapeutics. We observed the clonal dynamic in the tumor samples after various treatments. We showed that IMPACT not only helped in successful prioritization of clinically relevant variants but also linked these variations to possible targeted therapies. Conclusion IMPACT provides a new bioinformatics strategy to delineate candidate somatic variants and actionable therapies. This approach can be applied to other patient tumor samples to discover effective drug targets for personalized medicine. IMPACT is publicly available at http://tanlab.ucdenver.edu/IMPACT. PMID:27026619

IMPACT: a whole-exome sequencing analysis pipeline for integrating molecular profiles with actionable therapeutics in clinical samples.

PubMed

Hintzsche, Jennifer; Kim, Jihye; Yadav, Vinod; Amato, Carol; Robinson, Steven E; Seelenfreund, Eric; Shellman, Yiqun; Wisell, Joshua; Applegate, Allison; McCarter, Martin; Box, Neil; Tentler, John; De, Subhajyoti; Robinson, William A; Tan, Aik Choon

2016-07-01

Currently, there is a disconnect between finding a patient's relevant molecular profile and predicting actionable therapeutics. Here we develop and implement the Integrating Molecular Profiles with Actionable Therapeutics (IMPACT) analysis pipeline, linking variants detected from whole-exome sequencing (WES) to actionable therapeutics. The IMPACT pipeline contains 4 analytical modules: detecting somatic variants, calling copy number alterations, predicting drugs against deleterious variants, and analyzing tumor heterogeneity. We tested the IMPACT pipeline on whole-exome sequencing data in The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples with known EGFR mutations. We also used IMPACT to analyze melanoma patient tumor samples before treatment, after BRAF-inhibitor treatment, and after BRAF- and MEK-inhibitor treatment. IMPACT Food and Drug Administration (FDA) correctly identified known EGFR mutations in the TCGA lung adenocarcinoma samples. IMPACT linked these EGFR mutations to the appropriate FDA-approved EGFR inhibitors. For the melanoma patient samples, we identified NRAS p.Q61K as an acquired resistance mutation to BRAF-inhibitor treatment. We also identified CDKN2A deletion as a novel acquired resistance mutation to BRAFi/MEKi inhibition. The IMPACT analysis pipeline predicts these somatic variants to actionable therapeutics. We observed the clonal dynamic in the tumor samples after various treatments. We showed that IMPACT not only helped in successful prioritization of clinically relevant variants but also linked these variations to possible targeted therapies. IMPACT provides a new bioinformatics strategy to delineate candidate somatic variants and actionable therapies. This approach can be applied to other patient tumor samples to discover effective drug targets for personalized medicine.IMPACT is publicly available at http://tanlab.ucdenver.edu/IMPACT. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models.

PubMed

Park, Byungkyu; Im, Jinyong; Tuvshinjargal, Narankhuu; Lee, Wook; Han, Kyungsook

2014-11-01

As many structures of protein-DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein-DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
The study of human Y chromosome variation through ancient DNA.

PubMed

Kivisild, Toomas

2017-05-01

High throughput sequencing methods have completely transformed the study of human Y chromosome variation by offering a genome-scale view on genetic variation retrieved from ancient human remains in context of a growing number of high coverage whole Y chromosome sequence data from living populations from across the world. The ancient Y chromosome sequences are providing us the first exciting glimpses into the past variation of male-specific compartment of the genome and the opportunity to evaluate models based on previously made inferences from patterns of genetic variation in living populations. Analyses of the ancient Y chromosome sequences are challenging not only because of issues generally related to ancient DNA work, such as DNA damage-induced mutations and low content of endogenous DNA in most human remains, but also because of specific properties of the Y chromosome, such as its highly repetitive nature and high homology with the X chromosome. Shotgun sequencing of uniquely mapping regions of the Y chromosomes to sufficiently high coverage is still challenging and costly in poorly preserved samples. To increase the coverage of specific target SNPs capture-based methods have been developed and used in recent years to generate Y chromosome sequence data from hundreds of prehistoric skeletal remains. Besides the prospects of testing directly as how much genetic change in a given time period has accompanied changes in material culture the sequencing of ancient Y chromosomes allows us also to better understand the rate at which mutations accumulate and get fixed over time. This review considers genome-scale evidence on ancient Y chromosome diversity that has recently started to accumulate in geographic areas favourable to DNA preservation. More specifically the review focuses on examples of regional continuity and change of the Y chromosome haplogroups in North Eurasia and in the New World.
Detection of nucleic acid sequences by invader-directed cleavage

DOEpatents

Brow, Mary Ann D.; Hall, Jeff Steven Grotelueschen; Lyamichev, Victor; Olive, David Michael; Prudent, James Robert

1999-01-01

The present invention relates to means for the detection and characterization of nucleic acid sequences, as well as variations in nucleic acid sequences. The present invention also relates to methods for forming a nucleic acid cleavage structure on a target sequence and cleaving the nucleic acid cleavage structure in a site-specific manner. The 5' nuclease activity of a variety of enzymes is used to cleave the target-dependent cleavage structure, thereby indicating the presence of specific nucleic acid sequences or specific variations thereof. The present invention further relates to methods and devices for the separation of nucleic acid molecules based by charge.
Rare variation facilitates inferences of fine-scale population structure in humans.

PubMed

O'Connor, Timothy D; Fu, Wenqing; Mychaleckyj, Josyf C; Logsdon, Benjamin; Auer, Paul; Carlson, Christopher S; Leal, Suzanne M; Smith, Joshua D; Rieder, Mark J; Bamshad, Michael J; Nickerson, Deborah A; Akey, Joshua M

2015-03-01

Understanding the genetic structure of human populations has important implications for the design and interpretation of disease mapping studies and reconstructing human evolutionary history. To date, inferences of human population structure have primarily been made with common variants. However, recent large-scale resequencing studies have shown an abundance of rare variation in humans, which may be particularly useful for making inferences of fine-scale population structure. To this end, we used an information theory framework and extensive coalescent simulations to rigorously quantify the informativeness of rare and common variation to detect signatures of fine-scale population structure. We show that rare variation affords unique insights into patterns of recent population structure. Furthermore, to empirically assess our theoretical findings, we analyzed high-coverage exome sequences in 6,515 European and African American individuals. As predicted, rare variants are more informative than common polymorphisms in revealing a distinct cluster of European-American individuals, and subsequent analyses demonstrate that these individuals are likely of Ashkenazi Jewish ancestry. Our results provide new insights into the population structure using rare variation, which will be an important factor to account for in rare variant association studies. © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Improved Model for Predicting the Free Energy Contribution of Dinucleotide Bulges to RNA Duplex Stability.

PubMed

Tomcho, Jeremy C; Tillman, Magdalena R; Znosko, Brent M

2015-09-01

Predicting the secondary structure of RNA is an intermediate in predicting RNA three-dimensional structure. Commonly, determining RNA secondary structure from sequence uses free energy minimization and nearest neighbor parameters. Current algorithms utilize a sequence-independent model to predict free energy contributions of dinucleotide bulges. To determine if a sequence-dependent model would be more accurate, short RNA duplexes containing dinucleotide bulges with different sequences and nearest neighbor combinations were optically melted to derive thermodynamic parameters. These data suggested energy contributions of dinucleotide bulges were sequence-dependent, and a sequence-dependent model was derived. This model assigns free energy penalties based on the identity of nucleotides in the bulge (3.06 kcal/mol for two purines, 2.93 kcal/mol for two pyrimidines, 2.71 kcal/mol for 5'-purine-pyrimidine-3', and 2.41 kcal/mol for 5'-pyrimidine-purine-3'). The predictive model also includes a 0.45 kcal/mol penalty for an A-U pair adjacent to the bulge and a -0.28 kcal/mol bonus for a G-U pair adjacent to the bulge. The new sequence-dependent model results in predicted values within, on average, 0.17 kcal/mol of experimental values, a significant improvement over the sequence-independent model. This model and new experimental values can be incorporated into algorithms that predict RNA stability and secondary structure from sequence.
Prediction of protein secondary structure content for the twilight zone sequences.

PubMed

Homaeian, Leila; Kurgan, Lukasz A; Ruan, Jishou; Cios, Krzysztof J; Chen, Ke

2007-11-15

Secondary protein structure carries information about local structural arrangements, which include three major conformations: alpha-helices, beta-strands, and coils. Significant majority of successful methods for prediction of the secondary structure is based on multiple sequence alignment. However, multiple alignment fails to provide accurate results when a sequence comes from the twilight zone, that is, it is characterized by low (<30%) homology. To this end, we propose a novel method for prediction of secondary structure content through comprehensive sequence representation, called PSSC-core. The method uses a multiple linear regression model and introduces a comprehensive feature-based sequence representation to predict amount of helices and strands for sequences from the twilight zone. The PSSC-core method was tested and compared with two other state-of-the-art prediction methods on a set of 2187 twilight zone sequences. The results indicate that our method provides better predictions for both helix and strand content. The PSSC-core is shown to provide statistically significantly better results when compared with the competing methods, reducing the prediction error by 5-7% for helix and 7-9% for strand content predictions. The proposed feature-based sequence representation uses a comprehensive set of physicochemical properties that are custom-designed for each of the helix and strand content predictions. It includes composition and composition moment vectors, frequency of tetra-peptides associated with helical and strand conformations, various property-based groups like exchange groups, chemical groups of the side chains and hydrophobic group, auto-correlations based on hydrophobicity, side-chain masses, hydropathy, and conformational patterns for beta-sheets. The PSSC-core method provides an alternative for predicting the secondary structure content that can be used to validate and constrain results of other structure prediction methods. At the same time, it also provides useful insight into design of successful protein sequence representations that can be used in developing new methods related to prediction of different aspects of the secondary protein structure. (c) 2007 Wiley-Liss, Inc.
Identification of structural variation in mouse genomes.

PubMed

Keane, Thomas M; Wong, Kim; Adams, David J; Flint, Jonathan; Reymond, Alexandre; Yalcin, Binnaz

2014-01-01

Structural variation is variation in structure of DNA regions affecting DNA sequence length and/or orientation. It generally includes deletions, insertions, copy-number gains, inversions, and transposable elements. Traditionally, the identification of structural variation in genomes has been challenging. However, with the recent advances in high-throughput DNA sequencing and paired-end mapping (PEM) methods, the ability to identify structural variation and their respective association to human diseases has improved considerably. In this review, we describe our current knowledge of structural variation in the mouse, one of the prime model systems for studying human diseases and mammalian biology. We further present the evolutionary implications of structural variation on transposable elements. We conclude with future directions on the study of structural variation in mouse genomes that will increase our understanding of molecular architecture and functional consequences of structural variation.
Functional region prediction with a set of appropriate homologous sequences-an index for sequence selection by integrating structure and sequence information with spatial statistics

PubMed Central

2012-01-01

Background The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. Results We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods. Conclusions Appropriate homologous sequences are selected automatically and objectively by the index. Such sequence selection improved the performance of functional region prediction. As far as we know, this is the first approach in which spatial statistics have been applied to protein analyses. Such integration of structure and sequence information would be useful for other bioinformatics problems. PMID:22643026
Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure.

PubMed

Song, Jiangning; Yuan, Zheng; Tan, Hao; Huber, Thomas; Burrage, Kevin

2007-12-01

Disulfide bonds are primary covalent crosslinks between two cysteine residues in proteins that play critical roles in stabilizing the protein structures and are commonly found in extracy-toplasmatic or secreted proteins. In protein folding prediction, the localization of disulfide bonds can greatly reduce the search in conformational space. Therefore, there is a great need to develop computational methods capable of accurately predicting disulfide connectivity patterns in proteins that could have potentially important applications. We have developed a novel method to predict disulfide connectivity patterns from protein primary sequence, using a support vector regression (SVR) approach based on multiple sequence feature vectors and predicted secondary structure by the PSIPRED program. The results indicate that our method could achieve a prediction accuracy of 74.4% and 77.9%, respectively, when averaged on proteins with two to five disulfide bridges using 4-fold cross-validation, measured on the protein and cysteine pair on a well-defined non-homologous dataset. We assessed the effects of different sequence encoding schemes on the prediction performance of disulfide connectivity. It has been shown that the sequence encoding scheme based on multiple sequence feature vectors coupled with predicted secondary structure can significantly improve the prediction accuracy, thus enabling our method to outperform most of other currently available predictors. Our work provides a complementary approach to the current algorithms that should be useful in computationally assigning disulfide connectivity patterns and helps in the annotation of protein sequences generated by large-scale whole-genome projects. The prediction web server and Supplementary Material are accessible at http://foo.maths.uq.edu.au/~huber/disulfide
Relationships between physical properties and sequence in silkworm silks

PubMed Central

Malay, Ali D.; Sato, Ryota; Yazawa, Kenjiro; Watanabe, Hiroe; Ifuku, Nao; Masunaga, Hiroyasu; Hikima, Takaaki; Guan, Juan; Mandal, Biman B.; Damrongsakkul, Siriporn; Numata, Keiji

2016-01-01

Silk has attracted widespread attention due to its superlative material properties and promising applications. However, the determinants behind the variations in material properties among different types of silk are not well understood. We analysed the physical properties of silk samples from a variety of silkmoth cocoons, including domesticated Bombyx mori varieties and several species from Saturniidae. Tensile deformation tests, thermal analyses, and investigations on crystalline structure and orientation of the fibres were performed. The results showed that saturniid silks produce more highly-defined structural transitions compared to B. mori, as seen in the yielding and strain hardening events during tensile deformation and in the changes observed during thermal analyses. These observations were analysed in terms of the constituent fibroin sequences, which in B. mori are predicted to produce heterogeneous structures, whereas the strictly modular repeats of the saturniid sequences are hypothesized to produce structures that respond in a concerted manner. Within saturniid fibroins, thermal stability was found to correlate with the abundance of poly-alanine residues, whereas differences in fibre extensibility can be related to varying ratios of GGX motifs versus bulky hydrophobic residues in the amorphous phase. PMID:27279149
The Genome of the Obligately Intracellular Bacterium Ehrlichia canis Reveals Themes of Complex Membrane Structure and Immune Evasion Strategies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mavromatis, K; Doyle, C Kuyler; Lykidis, A

2006-01-01

Ehrlichia canis, a small obligately intracellular, tick-transmitted, gram-negative, {alpha}-proteobacterium, is the primary etiologic agent of globally distributed canine monocytic ehrlichiosis. Complete genome sequencing revealed that the E. canis genome consists of a single circular chromosome of 1,315,030 bp predicted to encode 925 proteins, 40 stable RNA species, 17 putative pseudogenes, and a substantial proportion of noncoding sequence (27%). Interesting genome features include a large set of proteins with transmembrane helices and/or signal sequences and a unique serine-threonine bias associated with the potential for O glycosylation that was prominent in proteins associated with pathogen-host interactions. Furthermore, two paralogous protein families associatedmore » with immune evasion were identified, one of which contains poly(G-C) tracts, suggesting that they may play a role in phase variation and facilitation of persistent infections. Genes associated with pathogen-host interactions were identified, including a small group encoding proteins (n = 12) with tandem repeats and another group encoding proteins with eukaryote-like ankyrin domains (n = 7).« less
A rare variant in APOC3 is associated with plasma triglyceride and VLDL levels in Europeans.

PubMed

Timpson, Nicholas J; Walter, Klaudia; Min, Josine L; Tachmazidou, Ioanna; Malerba, Giovanni; Shin, So-Youn; Chen, Lu; Futema, Marta; Southam, Lorraine; Iotchkova, Valentina; Cocca, Massimiliano; Huang, Jie; Memari, Yasin; McCarthy, Shane; Danecek, Petr; Muddyman, Dawn; Mangino, Massimo; Menni, Cristina; Perry, John R B; Ring, Susan M; Gaye, Amadou; Dedoussis, George; Farmaki, Aliki-Eleni; Burton, Paul; Talmud, Philippa J; Gambaro, Giovanni; Spector, Tim D; Smith, George Davey; Durbin, Richard; Richards, J Brent; Humphries, Steve E; Zeggini, Eleftheria; Soranzo, Nicole

2014-09-16

The analysis of rich catalogues of genetic variation from population-based sequencing provides an opportunity to screen for functional effects. Here we report a rare variant in APOC3 (rs138326449-A, minor allele frequency ~0.25% (UK)) associated with plasma triglyceride (TG) levels (-1.43 s.d. (s.e.=0.27 per minor allele (P-value=8.0 × 10(-8))) discovered in 3,202 individuals with low read-depth, whole-genome sequence. We replicate this in 12,831 participants from five additional samples of Northern and Southern European origin (-1.0 s.d. (s.e.=0.173), P-value=7.32 × 10(-9)). This is consistent with an effect between 0.5 and 1.5 mmol l(-1) dependent on population. We show that a single predicted splice donor variant is responsible for association signals and is independent of known common variants. Analyses suggest an independent relationship between rs138326449 and high-density lipoprotein (HDL) levels. This represents one of the first examples of a rare, large effect variant identified from whole-genome sequencing at a population scale.
The genome of obligately intracellular Ehrlichia canis revealsthemes of complex membrane structure and immune evasion strategies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mavromatis, K.; Kuyler Doyle, C.; Lykidis, A.

2005-09-01

Ehrlichia canis, a small obligately intracellular, tick-transmitted, gram-negative, a-proteobacterium is the primary etiologic agent of globally distributed canine monocytic ehrlichiosis. Complete genome sequencing revealed that the E. canis genome consists of a single circular chromosome of 1,315,030 bp predicted to encode 925 proteins, 40 stable RNA species, and 17 putative pseudogenes, and a substantial proportion of non-coding sequence (27 percent). Interesting genome features include a large set of proteins with transmembrane helices and/or signal sequences, and a unique serine-threonine bias associated with the potential for O-glycosylation that was prominent in proteins associated with pathogen-host interactions. Furthermore, two paralogous protein familiesmore » associated with immune evasion were identified, one of which contains poly G:C tracts, suggesting that they may play a role in phase variation and facilitation of persistent infections. Proteins associated with pathogen-host interactions were identified including a small group of proteins (12) with tandem repeats and another with eukaryotic-like ankyrin domains (7).« less
Relationships between physical properties and sequence in silkworm silks

NASA Astrophysics Data System (ADS)

Malay, Ali D.; Sato, Ryota; Yazawa, Kenjiro; Watanabe, Hiroe; Ifuku, Nao; Masunaga, Hiroyasu; Hikima, Takaaki; Guan, Juan; Mandal, Biman B.; Damrongsakkul, Siriporn; Numata, Keiji

2016-06-01

Silk has attracted widespread attention due to its superlative material properties and promising applications. However, the determinants behind the variations in material properties among different types of silk are not well understood. We analysed the physical properties of silk samples from a variety of silkmoth cocoons, including domesticated Bombyx mori varieties and several species from Saturniidae. Tensile deformation tests, thermal analyses, and investigations on crystalline structure and orientation of the fibres were performed. The results showed that saturniid silks produce more highly-defined structural transitions compared to B. mori, as seen in the yielding and strain hardening events during tensile deformation and in the changes observed during thermal analyses. These observations were analysed in terms of the constituent fibroin sequences, which in B. mori are predicted to produce heterogeneous structures, whereas the strictly modular repeats of the saturniid sequences are hypothesized to produce structures that respond in a concerted manner. Within saturniid fibroins, thermal stability was found to correlate with the abundance of poly-alanine residues, whereas differences in fibre extensibility can be related to varying ratios of GGX motifs versus bulky hydrophobic residues in the amorphous phase.
A novel LPL intronic variant: g.18704C>A identified by re-sequencing Kuwaiti Arab samples is associated with high-density lipoprotein, very low-density lipoprotein and triglyceride lipid levels.

PubMed

Al-Bustan, Suzanne A; Al-Serri, Ahmad; Annice, Babitha G; Alnaqeeb, Majed A; Al-Kandari, Wafa Y; Dashti, Mohammed

2018-01-01

The role interethnic genetic differences play in plasma lipid level variation across populations is a global health concern. Several genes involved in lipid metabolism and transport are strong candidates for the genetic association with lipid level variation especially lipoprotein lipase (LPL). The objective of this study was to re-sequence the full LPL gene in Kuwaiti Arabs, analyse the sequence variation and identify variants that could attribute to variation in plasma lipid levels for further genetic association. Samples (n = 100) of an Arab ethnic group from Kuwait were analysed for sequence variation by Sanger sequencing across the 30 Kb LPL gene and its flanking sequences. A total of 293 variants including 252 single nucleotide polymorphisms (SNPs) and 39 insertions/deletions (InDels) were identified among which 47 variants (32 SNPs and 15 InDels) were novel to Kuwaiti Arabs. This study is the first to report sequence data and analysis of frequencies of variants at the LPL gene locus in an Arab ethnic group with a novel "rare" variant (LPL:g.18704C>A) significantly associated to HDL (B = -0.181; 95% CI (-0.357, -0.006); p = 0.043), TG (B = 0.134; 95% CI (0.004-0.263); p = 0.044) and VLDL (B = 0.131; 95% CI (-0.001-0.263); p = 0.043) levels. Sequence variation in Kuwaiti Arabs was compared to other populations and was found to be similar with regards to the number of SNPs, InDels and distribution of the number of variants across the LPL gene locus and minor allele frequency (MAF). Moreover, comparison of the identified variants and their MAF with other reports provided a list of 46 potential variants across the LPL gene to be considered for future genetic association studies. The findings warrant further investigation into the association of g.18704C>A with lipid levels in other ethnic groups and with clinical manifestations of dyslipidemia.
A novel LPL intronic variant: g.18704C>A identified by re-sequencing Kuwaiti Arab samples is associated with high-density lipoprotein, very low-density lipoprotein and triglyceride lipid levels

PubMed Central

Al-Serri, Ahmad; Annice, Babitha G.; Alnaqeeb, Majed A.; Al-Kandari, Wafa Y.; Dashti, Mohammed

2018-01-01

The role interethnic genetic differences play in plasma lipid level variation across populations is a global health concern. Several genes involved in lipid metabolism and transport are strong candidates for the genetic association with lipid level variation especially lipoprotein lipase (LPL). The objective of this study was to re-sequence the full LPL gene in Kuwaiti Arabs, analyse the sequence variation and identify variants that could attribute to variation in plasma lipid levels for further genetic association. Samples (n = 100) of an Arab ethnic group from Kuwait were analysed for sequence variation by Sanger sequencing across the 30 Kb LPL gene and its flanking sequences. A total of 293 variants including 252 single nucleotide polymorphisms (SNPs) and 39 insertions/deletions (InDels) were identified among which 47 variants (32 SNPs and 15 InDels) were novel to Kuwaiti Arabs. This study is the first to report sequence data and analysis of frequencies of variants at the LPL gene locus in an Arab ethnic group with a novel “rare” variant (LPL:g.18704C>A) significantly associated to HDL (B = -0.181; 95% CI (-0.357, -0.006); p = 0.043), TG (B = 0.134; 95% CI (0.004–0.263); p = 0.044) and VLDL (B = 0.131; 95% CI (-0.001–0.263); p = 0.043) levels. Sequence variation in Kuwaiti Arabs was compared to other populations and was found to be similar with regards to the number of SNPs, InDels and distribution of the number of variants across the LPL gene locus and minor allele frequency (MAF). Moreover, comparison of the identified variants and their MAF with other reports provided a list of 46 potential variants across the LPL gene to be considered for future genetic association studies. The findings warrant further investigation into the association of g.18704C>A with lipid levels in other ethnic groups and with clinical manifestations of dyslipidemia. PMID:29438437
A statistical approach to detection of copy number variations in PCR-enriched targeted sequencing data.

PubMed

Demidov, German; Simakova, Tamara; Vnuchkova, Julia; Bragin, Anton

2016-10-22

Multiplex polymerase chain reaction (PCR) is a common enrichment technique for targeted massive parallel sequencing (MPS) protocols. MPS is widely used in biomedical research and clinical diagnostics as the fast and accurate tool for the detection of short genetic variations. However, identification of larger variations such as structure variants and copy number variations (CNV) is still being a challenge for targeted MPS. Some approaches and tools for structural variants detection were proposed, but they have limitations and often require datasets of certain type, size and expected number of amplicons affected by CNVs. In the paper, we describe novel algorithm for high-resolution germinal CNV detection in the PCR-enriched targeted sequencing data and present accompanying tool. We have developed a machine learning algorithm for the detection of large duplications and deletions in the targeted sequencing data generated with PCR-based enrichment step. We have performed verification studies and established the algorithm's sensitivity and specificity. We have compared developed tool with other available methods applicable for the described data and revealed its higher performance. We showed that our method has high specificity and sensitivity for high-resolution copy number detection in targeted sequencing data using large cohort of samples.
Characterization of Trichuris trichiura from humans and T. suis from pigs in China using internal transcribed spacers of nuclear ribosomal DNA.

PubMed

Liu, G H; Zhou, W; Nisbet, A J; Xu, M J; Zhou, D H; Zhao, G H; Wang, S K; Song, H Q; Lin, R Q; Zhu, X Q

2014-03-01

Trichuris trichiura and Trichuris suis parasitize (at the adult stage) the caeca of humans and pigs, respectively, causing trichuriasis. Despite these parasites being of human and animal health significance, causing considerable socio-economic losses globally, little is known of the molecular characteristics of T. trichiura and T. suis from China. In the present study, the entire first and second internal transcribed spacer (ITS-1 and ITS-2) regions of nuclear ribosomal DNA (rDNA) of T. trichiura and T. suis from China were amplified by polymerase chain reaction (PCR), the representative amplicons were cloned and sequenced, and sequence variation in the ITS rDNA was examined. The ITS rDNA sequences for the T. trichiura and T. suis samples were 1222-1267 bp and 1339-1353 bp in length, respectively. Sequence analysis revealed that the ITS-1, 5.8S and ITS-2 rDNAs of both whipworms were 600-627 bp and 655-661 bp, 154 bp, and 468-486 bp and 530-538 bp in size, respectively. Sequence variation in ITS rDNA within and among T. trichiura and T. suis was examined. Excluding nucleotide variations in the simple sequence repeats, the intra-species sequence variation in the ITS-1 was 0.2-1.7% within T. trichiura, and 0-1.5% within T. suis. For ITS-2 rDNA, the intra-species sequence variation was 0-1.3% within T. trichiura and 0.2-1.7% within T. suis. The inter-species sequence differences between the two whipworms were 60.7-65.3% for ITS-1 and 59.3-61.5% for ITS-2. These results demonstrated that the ITS rDNA sequences provide additional genetic markers for the characterization and differentiation of the two whipworms. These data should be useful for studying the epidemiology and population genetics of T. trichiura and T. suis, as well as for the diagnosis of trichuriasis in humans and pigs.
Mapping HLA-A2, -A3 and -B7 supertype-restricted T-cell epitopes in the ebolavirus proteome.

PubMed

Lim, Wan Ching; Khan, Asif M

2018-01-19

Ebolavirus (EBOV) is responsible for one of the most fatal diseases encountered by mankind. Cellular T-cell responses have been implicated to be important in providing protection against the virus. Antigenic variation can result in viral escape from immune recognition. Mapping targets of immune responses among the sequence of viral proteins is, thus, an important first step towards understanding the immune responses to viral variants and can aid in the identification of vaccine targets. Herein, we performed a large-scale, proteome-wide mapping and diversity analyses of putative HLA supertype-restricted T-cell epitopes of Zaire ebolavirus (ZEBOV), the most pathogenic species among the EBOV family. All publicly available ZEBOV sequences (14,098) for each of the nine viral proteins were retrieved, removed of irrelevant and duplicate sequences, and aligned. The overall proteome diversity of the non-redundant sequences was studied by use of Shannon's entropy. The sequences were predicted, by use of the NetCTLpan server, for HLA-A2, -A3, and -B7 supertype-restricted epitopes, which are relevant to African and other ethnicities and provide for large (~86%) population coverage. The predicted epitopes were mapped to the alignment of each protein for analyses of antigenic sequence diversity and relevance to structure and function. The putative epitopes were validated by comparison with experimentally confirmed epitopes. ZEBOV proteome was generally conserved, with an average entropy of 0.16. The 185 HLA supertype-restricted T-cell epitopes predicted (82 (A2), 37 (A3) and 66 (B7)) mapped to 125 alignment positions and covered ~24% of the proteome length. Many of the epitopes showed a propensity to co-localize at select positions of the alignment. Thirty (30) of the mapped positions were completely conserved and may be attractive for vaccine design. The remaining (95) positions had one or more epitopes, with or without non-epitope variants. A significant number (24) of the putative epitopes matched reported experimentally validated HLA ligands/T-cell epitopes of A2, A3 and/or B7 supertype representative allele restrictions. The epitopes generally corresponded to functional motifs/domains and there was no correlation to localization on the protein 3D structure. These data and the epitope map provide important insights into the interaction between EBOV and the host immune system.

Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information.

PubMed

Song, Jiangning; Burrage, Kevin; Yuan, Zheng; Huber, Thomas

2006-03-09

The majority of peptide bonds in proteins are found to occur in the trans conformation. However, for proline residues, a considerable fraction of Prolyl peptide bonds adopt the cis form. Proline cis/trans isomerization is known to play a critical role in protein folding, splicing, cell signaling and transmembrane active transport. Accurate prediction of proline cis/trans isomerization in proteins would have many important applications towards the understanding of protein structure and function. In this paper, we propose a new approach to predict the proline cis/trans isomerization in proteins using support vector machine (SVM). The preliminary results indicated that using Radial Basis Function (RBF) kernels could lead to better prediction performance than that of polynomial and linear kernel functions. We used single sequence information of different local window sizes, amino acid compositions of different local sequences, multiple sequence alignment obtained from PSI-BLAST and the secondary structure information predicted by PSIPRED. We explored these different sequence encoding schemes in order to investigate their effects on the prediction performance. The training and testing of this approach was performed on a newly enlarged dataset of 2424 non-homologous proteins determined by X-Ray diffraction method using 5-fold cross-validation. Selecting the window size 11 provided the best performance for determining the proline cis/trans isomerization based on the single amino acid sequence. It was found that using multiple sequence alignments in the form of PSI-BLAST profiles could significantly improve the prediction performance, the prediction accuracy increased from 62.8% with single sequence to 69.8% and Matthews Correlation Coefficient (MCC) improved from 0.26 with single local sequence to 0.40. Furthermore, if coupled with the predicted secondary structure information by PSIPRED, our method yielded a prediction accuracy of 71.5% and MCC of 0.43, 9% and 0.17 higher than the accuracy achieved based on the singe sequence information, respectively. A new method has been developed to predict the proline cis/trans isomerization in proteins based on support vector machine, which used the single amino acid sequence with different local window sizes, the amino acid compositions of local sequence flanking centered proline residues, the position-specific scoring matrices (PSSMs) extracted by PSI-BLAST and the predicted secondary structures generated by PSIPRED. The successful application of SVM approach in this study reinforced that SVM is a powerful tool in predicting proline cis/trans isomerization in proteins and biological sequence analysis.
Proteogenomic characterization of human colon and rectal cancer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Bing; Wang, Jing; Wang, Xiaojing

2014-09-18

We analyzed proteomes of colon and rectal tumors previously characterized by the Cancer Genome Atlas (TCGA) and performed integrated proteogenomic analyses. Protein sequence variants encoded by somatic genomic variations displayed reduced expression compared to protein variants encoded by germline variations. mRNA transcript abundance did not reliably predict protein expression differences between tumors. Proteomics identified five protein expression subtypes, two of which were associated with the TCGA "MSI/CIMP" transcriptional subtype, but had distinct mutation and methylation patterns and associated with different clinical outcomes. Although CNAs showed strong cis- and trans-effects on mRNA expression, relatively few of these extend to the proteinmore » level. Thus, proteomics data enabled prioritization of candidate driver genes. Our analyses identified HNF4A, a novel candidate driver gene in tumors with chromosome 20q amplifications. Integrated proteogenomic analysis provides functional context to interpret genomic abnormalities and affords novel insights into cancer biology.« less
Prediction of multi-drug resistance transporters using a novel sequence analysis method [version 2; referees: 2 approved

DOE PAGES

McDermott, Jason E.; Bruillard, Paul; Overall, Christopher C.; ...

2015-03-09

There are many examples of groups of proteins that have similar function, but the determinants of functional specificity may be hidden by lack of sequencesimilarity, or by large groups of similar sequences with different functions. Transporters are one such protein group in that the general function, transport, can be easily inferred from the sequence, but the substrate specificity can be impossible to predict from sequence with current methods. In this paper we describe a linguistic-based approach to identify functional patterns from groups of unaligned protein sequences and its application to predict multi-drug resistance transporters (MDRs) from bacteria. We first showmore » that our method can recreate known patterns from PROSITE for several motifs from unaligned sequences. We then show that the method, MDRpred, can predict MDRs with greater accuracy and positive predictive value than a collection of currently available family-based models from the Pfam database. Finally, we apply MDRpred to a large collection of protein sequences from an environmental microbiome study to make novel predictions about drug resistance in a potential environmental reservoir.« less
Sequence variation in mitochondrial cox1 and nad1 genes of ascaridoid nematodes in cats and dogs from Iran.

PubMed

Mikaeili, F; Mirhendi, H; Mohebali, M; Hosseini, M; Sharbatkhori, M; Zarei, Z; Kia, E B

2015-07-01

The study was conducted to determine the sequence variation in two mitochondrial genes, namely cytochrome c oxidase 1 (pcox1) and NADH dehydrogenase 1 (pnad1) within and among isolates of Toxocara cati, Toxocara canis and Toxascaris leonina. Genomic DNA was extracted from 32 isolates of T. cati, 9 isolates of T. canis and 19 isolates of T. leonina collected from cats and dogs in different geographical areas of Iran. Mitochondrial genes were amplified by polymerase chain reaction (PCR) and sequenced. Sequence data were aligned using the BioEdit software and compared with published sequences in GenBank. Phylogenetic analysis was performed using Bayesian inference and maximum likelihood methods. Based on pairwise comparison, intra-species genetic diversity within Iranian isolates of T. cati, T. canis and T. leonina amounted to 0-2.3%, 0-1.3% and 0-1.0% for pcox1 and 0-2.0%, 0-1.7% and 0-2.6% for pnad1, respectively. Inter-species sequence variation among the three ascaridoid nematodes was significantly higher, being 9.5-16.6% for pcox1 and 11.9-26.7% for pnad1. Sequence and phylogenetic analysis of the pcox1 and pnad1 genes indicated that there is significant genetic diversity within and among isolates of T. cati, T. canis and T. leonina from different areas of Iran, and these genes can be used for studying genetic variation of ascaridoid nematodes.
Single-strand conformation polymorphism (SSCP)-based mutation scanning approaches to fingerprint sequence variation in ribosomal DNA of ascaridoid nematodes.

PubMed

Zhu, X Q; Gasser, R B

1998-06-01

In this study, we assessed single-strand conformation polymorphism (SSCP)-based approaches for their capacity to fingerprint sequence variation in ribosomal DNA (rDNA) of ascaridoid nematodes of veterinary and/or human health significance. The second internal transcribed spacer region (ITS-2) of rDNA was utilised as the target region because it is known to provide species-specific markers for this group of parasites. ITS-2 was amplified by PCR from genomic DNA derived from individual parasites and subjected to analysis. Direct SSCP analysis of amplicons from seven taxa (Toxocara vitulorum, Toxocara cati, Toxocara canis, Toxascaris leonina, Baylisascaris procyonis, Ascaris suum and Parascaris equorum) showed that the single-strand (ss) ITS-2 patterns produced allowed their unequivocal identification to species. While no variation in SSCP patterns was detected in the ITS-2 within four species for which multiple samples were available, the method allowed the direct display of four distinct sequence types of ITS-2 among individual worms of T. cati. Comparison of SSCP/sequencing with the methods of dideoxy fingerprinting (ddF) and restriction endonuclease fingerprinting (REF) revealed that also ddF allowed the definition of the four sequence types, whereas REF displayed three of four. The findings indicate the usefulness of the SSCP-based approaches for the identification of ascaridoid nematodes to species, the direct display of sequence variation in rDNA and the detection of population variation. The ability to fingerprint microheterogeneity in ITS-2 rDNA using such approaches also has implications for studying fundamental aspects relating to mutational change in rDNA.
Gene and translation initiation site prediction in metagenomic sequences

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hyatt, Philip Douglas; LoCascio, Philip F; Hauser, Loren John

2012-01-01

Gene prediction in metagenomic sequences remains a difficult problem. Current sequencing technologies do not achieve sufficient coverage to assemble the individual genomes in a typical sample; consequently, sequencing runs produce a large number of short sequences whose exact origin is unknown. Since these sequences are usually smaller than the average length of a gene, algorithms must make predictions based on very little data. We present MetaProdigal, a metagenomic version of the gene prediction program Prodigal, that can identify genes in short, anonymous coding sequences with a high degree of accuracy. The novel value of the method consists of enhanced translationmore » initiation site identification, ability to identify sequences that use alternate genetic codes and confidence values for each gene call. We compare the results of MetaProdigal with other methods and conclude with a discussion of future improvements.« less
Gene Unprediction with Spurio: A tool to identify spurious protein sequences.

PubMed

Höps, Wolfram; Jeffryes, Matt; Bateman, Alex

2018-01-01

We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation. Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases. We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes. Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.
Copy number variation of individual cattle genomes using next-generation sequencing

USDA-ARS?s Scientific Manuscript database

Copy number variations (CNVs) affect a wide range of phenotypic traits; however, CNVs in or near segmental duplication regions are often intractable. Using a read depth approach based on next-generation sequencing, we examined genome-wide copy number differences among five taurine (three Angus, one ...
Copy number variation of individual cattle genomes using next-generation sequencing

USDA-ARS?s Scientific Manuscript database

Copy Number Variations (CNVs) affect a wide range of phenotypic traits; however, CNVs in or near segmental duplication regions are often difficult to track. Using a read depth approach based on next generation sequencing, we examined genome-wide copy number differences among five taurine (three Angu...
A high-resolution cattle CNV map by population-scale genome sequencing

USDA-ARS?s Scientific Manuscript database

Copy Number Variations (CNVs) are common genomic structural variations that have been linked to human diseases and phenotypic traits. Prior studies in cattle have produced low-resolution CNV maps. We constructed a draft, high-resolution map of cattle CNVs based on whole genome sequencing data from 7...
Maize HapMap2 identifies extant variation from a genome in flux

USDA-ARS?s Scientific Manuscript database

The maize genome is the largest, most diverse and complex plant genome sequenced to date. Using high-throughput sequencing to access genetic variation and a population genetics model to score the polymorphisms, we characterize and unite the diversity of the world’s key breeding germplasm, wild rela...
RSAT 2015: Regulatory Sequence Analysis Tools.

PubMed

Medina-Rivera, Alejandra; Defrance, Matthieu; Sand, Olivier; Herrmann, Carl; Castro-Mondragon, Jaime A; Delerce, Jeremy; Jaeger, Sébastien; Blanchet, Christophe; Vincens, Pierre; Caron, Christophe; Staines, Daniel M; Contreras-Moreira, Bruno; Artufel, Marie; Charbonnier-Khamvongsa, Lucie; Hernandez, Céline; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques

2015-07-01

RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Large Scale Comparative Visualisation of Regulatory Networks with TRNDiff

DOE PAGES

Chua, Xin-Yi; Buckingham, Lawrence; Hogan, James M.; ...

2015-06-01

The advent of Next Generation Sequencing (NGS) technologies has seen explosive growth in genomic datasets, and dense coverage of related organisms, supporting study of subtle, strain-specific variations as a determinant of function. Such data collections present fresh and complex challenges for bioinformatics, those of comparing models of complex relationships across hundreds and even thousands of sequences. Transcriptional Regulatory Network (TRN) structures document the influence of regulatory proteins called Transcription Factors (TFs) on associated Target Genes (TGs). TRNs are routinely inferred from model systems or iterative search, and analysis at these scales requires simultaneous displays of multiple networks well beyond thosemore » of existing network visualisation tools [1]. In this paper we describe TRNDiff, an open source system supporting the comparative analysis and visualization of TRNs (and similarly structured data) from many genomes, allowing rapid identification of functional variations within species. The approach is demonstrated through a small scale multiple TRN analysis of the Fur iron-uptake system of Yersinia, suggesting a number of candidate virulence factors; and through a larger study exploiting integration with the RegPrecise database (http://regprecise.lbl.gov; [2]) - a collection of hundreds of manually curated and predicted transcription factor regulons drawn from across the entire spectrum of prokaryotic organisms.« less
ShatterProof: operational detection and quantification of chromothripsis.

PubMed

Govind, Shaylan K; Zia, Amin; Hennings-Yeomans, Pablo H; Watson, John D; Fraser, Michael; Anghel, Catalina; Wyatt, Alexander W; van der Kwast, Theodorus; Collins, Colin C; McPherson, John D; Bristow, Robert G; Boutros, Paul C

2014-03-19

Chromothripsis, a newly discovered type of complex genomic rearrangement, has been implicated in the evolution of several types of cancers. To date, it has been described in bone cancer, SHH-medulloblastoma and acute myeloid leukemia, amongst others, however there are still no formal or automated methods for detecting or annotating it in high throughput sequencing data. As such, findings of chromothripsis are difficult to compare and many cases likely escape detection altogether. We introduce ShatterProof, a software tool for detecting and quantifying chromothriptic events. ShatterProof takes structural variation calls (translocations, copy-number variations, short insertions and loss of heterozygosity) produced by any algorithm and using an operational definition of chromothripsis performs robust statistical tests to accurately predict the presence and location of chromothriptic events. Validation of our tool was conducted using clinical data sets including matched normal, prostate cancer samples in addition to the colorectal cancer and SCLC data sets used in the original description of chromothripsis. ShatterProof is computationally efficient, having low memory requirements and near linear computation time. This allows it to become a standard component of sequencing analysis pipelines, enabling researchers to routinely and accurately assess samples for chromothripsis. Source code and documentation can be found at http://search.cpan.org/~sgovind/Shatterproof.
Long period astronomical cycles from the Triassic to Jurassic bedded chert sequence (Inuyama, Japan); Geologic evidences for the chaotic behavior of solar planets

NASA Astrophysics Data System (ADS)

Ikeda, Masayuki; Tada, Ryuji

2013-04-01

Astronomical theory predicts that ~2 Myr eccentricity cycle have changed its periodicity and amplitude through time because of the chaotic behavior of solar planets, especially Earth-Mars secular resonance. Although the ~2 Myr eccentricity cycle has been occasionally recognized in geological records, their frequency transitions have never been reported. To explore the frequency evolution of ~2 Myr eccentricity cycle, we used the bedded chert sequence in Inuyama, Japan, of which rhythms were proven to be of astronomical origin, covering the ~30 Myr long spanning from the Triassic to Jurassic. The frequency modulation of ~2 Myr cycle between ~1.6 and ~1.8 Myr periodicity detected from wavelet analysis of chert bed thickness variation are the first geologic record of chaotic transition of Earth-Mars secular resonance. The frequency modulation of ~2 Myr cycle will provide new constraints for the orbital models. Additionally, ~8 Myr cycle detected as chert bed thickness variation and its amplitude modulation of ~2 Myr cycle may be related to the amplitude modulation of ~2 Myr eccentricity cycle through non-linear process(es) of Earth system dynamics, suggesting possible impact of the chaotic behavior of Solar planets on climate change.
Transfer of genetic therapy across human populations: molecular targets for increasing patient coverage in repeat expansion diseases

PubMed Central

Varela, Miguel A; Curtis, Helen J; Douglas, Andrew GL; Hammond, Suzan M; O'Loughlin, Aisling J; Sobrido, Maria J; Scholefield, Janine; Wood, Matthew JA

2016-01-01

Allele-specific gene therapy aims to silence expression of mutant alleles through targeting of disease-linked single-nucleotide polymorphisms (SNPs). However, SNP linkage to disease varies between populations, making such molecular therapies applicable only to a subset of patients. Moreover, not all SNPs have the molecular features necessary for potent gene silencing. Here we provide knowledge to allow the maximisation of patient coverage by building a comprehensive understanding of SNPs ranked according to their predicted suitability toward allele-specific silencing in 14 repeat expansion diseases: amyotrophic lateral sclerosis and frontotemporal dementia, dentatorubral-pallidoluysian atrophy, myotonic dystrophy 1, myotonic dystrophy 2, Huntington's disease and several spinocerebellar ataxias. Our systematic analysis of DNA sequence variation shows that most annotated SNPs are not suitable for potent allele-specific silencing across populations because of suboptimal sequence features and low variability (>97% in HD). We suggest maximising patient coverage by selecting SNPs with high heterozygosity across populations, and preferentially targeting SNPs that lead to purine:purine mismatches in wild-type alleles to obtain potent allele-specific silencing. We therefore provide fundamental knowledge on strategies for optimising patient coverage of therapeutics for microsatellite expansion disorders by linking analysis of population genetic variation to the selection of molecular targets. PMID:25990798
Transfer of genetic therapy across human populations: molecular targets for increasing patient coverage in repeat expansion diseases.

PubMed

Varela, Miguel A; Curtis, Helen J; Douglas, Andrew G L; Hammond, Suzan M; O'Loughlin, Aisling J; Sobrido, Maria J; Scholefield, Janine; Wood, Matthew J A

2016-02-01

Allele-specific gene therapy aims to silence expression of mutant alleles through targeting of disease-linked single-nucleotide polymorphisms (SNPs). However, SNP linkage to disease varies between populations, making such molecular therapies applicable only to a subset of patients. Moreover, not all SNPs have the molecular features necessary for potent gene silencing. Here we provide knowledge to allow the maximisation of patient coverage by building a comprehensive understanding of SNPs ranked according to their predicted suitability toward allele-specific silencing in 14 repeat expansion diseases: amyotrophic lateral sclerosis and frontotemporal dementia, dentatorubral-pallidoluysian atrophy, myotonic dystrophy 1, myotonic dystrophy 2, Huntington's disease and several spinocerebellar ataxias. Our systematic analysis of DNA sequence variation shows that most annotated SNPs are not suitable for potent allele-specific silencing across populations because of suboptimal sequence features and low variability (>97% in HD). We suggest maximising patient coverage by selecting SNPs with high heterozygosity across populations, and preferentially targeting SNPs that lead to purine:purine mismatches in wild-type alleles to obtain potent allele-specific silencing. We therefore provide fundamental knowledge on strategies for optimising patient coverage of therapeutics for microsatellite expansion disorders by linking analysis of population genetic variation to the selection of molecular targets.
Acceleration techniques and their impact on arterial input function sampling: Non-accelerated versus view-sharing and compressed sensing sequences.

PubMed

Benz, Matthias R; Bongartz, Georg; Froehlich, Johannes M; Winkel, David; Boll, Daniel T; Heye, Tobias

2018-07-01

The aim was to investigate the variation of the arterial input function (AIF) within and between various DCE MRI sequences. A dynamic flow-phantom and steady signal reference were scanned on a 3T MRI using fast low angle shot (FLASH) 2d, FLASH3d (parallel imaging factor (P) = P0, P2, P4), volumetric interpolated breath-hold examination (VIBE) (P = P0, P3, P2 × 2, P2 × 3, P3 × 2), golden-angle radial sparse parallel imaging (GRASP), and time-resolved imaging with stochastic trajectories (TWIST). Signal over time curves were normalized and quantitatively analyzed by full width half maximum (FWHM) measurements to assess variation within and between sequences. The coefficient of variation (CV) for the steady signal reference ranged from 0.07-0.8%. The non-accelerated gradient echo FLASH2d, FLASH3d, and VIBE sequences showed low within sequence variation with 2.1%, 1.0%, and 1.6%. The maximum FWHM CV was 3.2% for parallel imaging acceleration (VIBE P2 × 3), 2.7% for GRASP and 9.1% for TWIST. The FWHM CV between sequences ranged from 8.5-14.4% for most non-accelerated/accelerated gradient echo sequences except 6.2% for FLASH3d P0 and 0.3% for FLASH3d P2; GRASP FWHM CV was 9.9% versus 28% for TWIST. MRI acceleration techniques vary in reproducibility and quantification of the AIF. Incomplete coverage of the k-space with TWIST as a representative of view-sharing techniques showed the highest variation within sequences and might be less suited for reproducible quantification of the AIF. Copyright © 2018 Elsevier B.V. All rights reserved.
A pyrosequencing assay for the quantitative methylation analysis of the PCDHB gene cluster, the major factor in neuroblastoma methylator phenotype.

PubMed

Banelli, Barbara; Brigati, Claudio; Di Vinci, Angela; Casciano, Ida; Forlani, Alessandra; Borzì, Luana; Allemanni, Giorgio; Romani, Massimo

2012-03-01

Epigenetic alterations are hallmarks of cancer and powerful biomarkers, whose clinical utilization is made difficult by the absence of standardization and of common methods of data interpretation. The coordinate methylation of many loci in cancer is defined as 'CpG island methylator phenotype' (CIMP) and identifies clinically distinct groups of patients. In neuroblastoma (NB), CIMP is defined by a methylation signature, which includes different loci, but its predictive power on outcome is entirely recapitulated by the PCDHB cluster only. We have developed a robust and cost-effective pyrosequencing-based assay that could facilitate the clinical application of CIMP in NB. This assay permits the unbiased simultaneous amplification and sequencing of 17 out of 19 genes of the PCDHB cluster for quantitative methylation analysis, taking into account all the sequence variations. As some of these variations were at CpG doublets, we bypassed the data interpretation conducted by the methylation analysis software to assign the corrected methylation value at these sites. The final result of the assay is the mean methylation level of 17 gene fragments in the protocadherin B cluster (PCDHB) cluster. We have utilized this assay to compare the methylation levels of the PCDHB cluster between high-risk and very low-risk NB patients, confirming the predictive value of CIMP. Our results demonstrate that the pyrosequencing-based assay herein described is a powerful instrument for the analysis of this gene cluster that may simplify the data comparison between different laboratories and, in perspective, could facilitate its clinical application. Furthermore, our results demonstrate that, in principle, pyrosequencing can be efficiently utilized for the methylation analysis of gene clusters with high internal homologies.
Human milk peptides differentiate between the preterm and term infant and across varying lactational stages.

PubMed

Dingess, Kelly A; de Waard, Marita; Boeren, Sjef; Vervoort, Jacques; Lambers, Tim T; van Goudoever, Johannes B; Hettinga, Kasper

2017-10-18

Variations in endogenous peptide profiles, functionality, and the enzymes responsible for the formation of these peptides in human milk are understudied. Additionally, there is a lack of knowledge regarding peptides in donor human milk, which is used to feed preterm infants when mother's own milk is not (sufficiently) available. To assess this, 29 human milk samples from the Dutch Human Milk Bank were analyzed as three groups, preterm late lactation stage (LS) (n = 12), term early (n = 8) and term late LS (n = 9). Gestational age (GA) groups were defined as preterm (24-36 weeks) and term (≥37 weeks). LS was determined as days postpartum as early (16-36 days) or late (55-88 days). Peptides, analyzed by LC-MS/MS, and parent proteins (proteins from matched peptide sequences) were identified and quantified, after which peptide functionality and the enzymes responsible for protein cleavage were determined. A total of 16 different parent proteins were identified from human milk, with no differences by GA or LS. We identified 1104 endogenous peptides, of which, the majority were from the parent proteins β-casein, polymeric immunoglobulin receptor, α s1 -casein, osteopontin, and κ-casein. The absolute number of peptides differed by GA and LS with 30 and 41 differing sequences respectively (p < 0.05) Odds likelihood tests determined that 32 peptides had a predicted bioactive functionality, with no significant differences between groups. Enzyme prediction analysis showed that plasmin/trypsin enzymes most likely cleaved the identified human milk peptides. These results explain some of the variation in endogenous peptides in human milk, leading to future targets that may be studied for functionality.

Evaluation of targeted exome sequencing for 28 protein-based blood group systems, including the homologous gene systems, for blood group genotyping.

PubMed

Schoeman, Elizna M; Lopez, Genghis H; McGowan, Eunike C; Millard, Glenda M; O'Brien, Helen; Roulis, Eileen V; Liew, Yew-Wah; Martin, Jacqueline R; McGrath, Kelli A; Powley, Tanya; Flower, Robert L; Hyland, Catherine A

2017-04-01

Blood group single nucleotide polymorphism genotyping probes for a limited range of polymorphisms. This study investigated whether massively parallel sequencing (also known as next-generation sequencing), with a targeted exome strategy, provides an extended blood group genotype and the extent to which massively parallel sequencing correctly genotypes in homologous gene systems, such as RH and MNS. Donor samples (n = 28) that were extensively phenotyped and genotyped using single nucleotide polymorphism typing, were analyzed using the TruSight One Sequencing Panel and MiSeq platform. Genes for 28 protein-based blood group systems, GATA1, and KLF1 were analyzed. Copy number variation analysis was used to characterize complex structural variants in the GYPC and RH systems. The average sequencing depth per target region was 66.2 ± 39.8. Each sample harbored on average 43 ± 9 variants, of which 10 ± 3 were used for genotyping. For the 28 samples, massively parallel sequencing variant sequences correctly matched expected sequences based on single nucleotide polymorphism genotyping data. Copy number variation analysis defined the Rh C/c alleles and complex RHD hybrids. Hybrid RHD*D-CE-D variants were correctly identified, but copy number variation analysis did not confidently distinguish between D and CE exon deletion versus rearrangement. The targeted exome sequencing strategy employed extended the range of blood group genotypes detected compared with single nucleotide polymorphism typing. This single-test format included detection of complex MNS hybrid cases and, with copy number variation analysis, defined RH hybrid genes along with the RHCE*C allele hitherto difficult to resolve by variant detection. The approach is economical compared with whole-genome sequencing and is suitable for a red blood cell reference laboratory setting. © 2017 AABB.
Representation of DNA sequences in genetic codon context with applications in exon and intron prediction.

PubMed

Yin, Changchuan

2015-04-01

To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.
Transcription of TP0126, Treponema pallidum Putative OmpW Homolog, Is Regulated by the Length of a Homopolymeric Guanosine Repeat

PubMed Central

Brandt, Stephanie L.; Ke, Wujian; Reid, Tara B.; Molini, Barbara J.; Iverson-Cabral, Stefanie; Ciccarese, Giulia; Drago, Francesco; Lukehart, Sheila A.; Centurion-Lara, Arturo

2015-01-01

An effective mechanism for introduction of phenotypic diversity within a bacterial population exploits changes in the length of repetitive DNA elements located within gene promoters. This phenomenon, known as phase variation, causes rapid activation or silencing of gene expression and fosters bacterial adaptation to new or changing environments. Phase variation often occurs in surface-exposed proteins, and in Treponema pallidum subsp. pallidum, the syphilis agent, it was reported to affect transcription of three putative outer membrane protein (OMP)-encoding genes. When the T. pallidum subsp. pallidum Nichols strain genome was initially annotated, the TP0126 open reading frame was predicted to include a poly(G) tract and did not appear to have a predicted signal sequence that might suggest the possibility of its being an OMP. Here we show that the initial annotation was incorrect, that this poly(G) is instead located within the TP0126 promoter, and that it varies in length in vivo during experimental syphilis. Additionally, we show that TP0126 transcription is affected by changes in the poly(G) length consistent with regulation by phase variation. In silico analysis of the TP0126 open reading frame based on the experimentally identified transcriptional start site shortens this hypothetical protein by 69 amino acids, reveals a predicted cleavable signal peptide, and suggests structural homology with the OmpW family of porins. Circular dichroism of recombinant TP0126 supports structural homology to OmpW. Together with the evidence that TP0126 is fully conserved among T. pallidum subspecies and strains, these data suggest an important role for TP0126 in T. pallidum biology and syphilis pathogenesis. PMID:25802057
The coalescent process in models with selection and recombination.

PubMed

Hudson, R R; Kaplan, N L

1988-11-01

The statistical properties of the process describing the genealogical history of a random sample of genes at a selectively neutral locus which is linked to a locus at which natural selection operates are investigated. It is found that the equations describing this process are simple modifications of the equations describing the process assuming that the two loci are completely linked. Thus, the statistical properties of the genealogical process for a random sample at a neutral locus linked to a locus with selection follow from the results obtained for the selected locus. Sequence data from the alcohol dehydrogenase (Adh) region of Drosophila melanogaster are examined and compared to predictions based on the theory. It is found that the spatial distribution of nucleotide differences between Fast and Slow alleles of Adh is very similar to the spatial distribution predicted if balancing selection operates to maintain the allozyme variation at the Adh locus. The spatial distribution of nucleotide differences between different Slow alleles of Adh do not match the predictions of this simple model very well.
Characterization of four species of Trichuris (Nematoda: Enoplida) by their second internal transcribed spacer ribosomal DNA sequence.

PubMed

Oliveros, R; Cutillas, C; De Rojas, M; Arias, P

2000-12-01

Adult worms of Trichuris ovis and T. globulosa were collected from Ovis aries (sheep) and Capra hircus (goats). T. suis was isolated from Sus scrofa domestica (swine) and T. leporis was isolated from Lepus europaeus (rabbits) in Spain. Genomic DNA was isolated and a ribosomal internal transcribed spacer (ITS2) was amplified and sequenced using polymerase-chain-reaction (PCR) techniques. The ITS2 of T. ovis and T. globulosa was 407 nucleotides in length and had a GC content of about 62%. Furthermore, the ITS2 of T. suis and T. leporis was 534 and 418 nucleotides in length and had a GC content of about 64.8% and 62.4%, respectively. There was evidence of slight variation in the sequence within individuals of all species analyzed, indicating intraindividual variation in the sequence of different copies of the ribosomal DNA. Furthermore, low-level intraspecific variation was detected. Sequence analyses of ITS2 products of T. ovis and T. globulosa demonstrated no sequence difference between them. Nevertheless, differences were detected between the ITS2 sequences of T. suis, T. leporis, and T. ovis, indicating that Trichuris species can reliably be differentiated by their ITS2 sequences and PCR-linked restriction-fragment-length polymorphism (RFLP).
A global reference for human genetic variation

PubMed Central

2016-01-01

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies. PMID:26432245
Characterization of the ligand-binding site of the transferrin receptor in Trypanosoma brucei demonstrates a structural relationship with the N-terminal domain of the variant surface glycoprotein.

PubMed

Salmon, D; Hanocq-Quertier, J; Paturiaux-Hanocq, F; Pays, A; Tebabi, P; Nolan, D P; Michel, A; Pays, E

1997-12-15

The Trypanosoma brucei transferrin (Tf) receptor is a heterodimer encoded by ESAG7 and ESAG6, two genes contained in the different polycistronic transcription units of the variant surface glycoprotein (VSG) gene. The sequence of ESAG7/6 differs slightly between different units, so that receptors with different affinities for Tf are expressed alternatively following transcriptional switching of VSG expression sites during antigenic variation of the parasite. Based on the sequence homology between pESAG7/6 and the N-terminal domain of VSGs, it can be predicted that the four blocks containing the major sequence differences between pESAG7 and pESAG6 form surface-exposed loops and generate the ligand-binding site. The exchange of a few amino acids in this region between pESAG6s encoded by different VSG units greatly increased the affinity for bovine Tf. Similar changes in other regions were ineffective, while mutations predicted to alter the VSG-like structure abolished the binding. Chimeric proteins containing the N-terminal dimerization domain of VSG and the C-terminal half of either pESAG7 or pESAG6, which contains the ligand-binding domain, can form heterodimers that bind Tf. Taken together, these data provided evidence that the T.brucei Tf receptor is structurally related to the N-terminal domain of the VSG and that the ligand-binding site corresponds to the exposed surface loops of the protein.
Genetic spectrum of low density lipoprotein receptor gene variations in South Indian population.

PubMed

ArulJothi, K N; Suruthi Abirami, B; Devi, Arikketh

2018-03-01

Low density lipoprotein receptor (LDLR) is a membrane bound receptor maintaining cholesterol homeostasis along with Apolipoprotein B (APOB), Proprotein Convertase Subtilisin/Kexin type 9 (PCSK9) and other genes of lipid metabolism. Any pathogenic variation in these genes alters the function of the receptor and leads to Familial Hypercholesterolemia (FH) and other cardiovascular diseases. This study was aimed at screening the LDLR, APOB and PCSK9 genes in Hypercholesterolemic patients to define the genetic spectrum of FH in Indian population. Familial Hypercholesterolemia patients (n=78) of South Indian Tamil population with LDL cholesterol and Total cholesterol levels above 4.9mmol/l and 7.5mmol/l with family history of Myocardial infarction were involved. DNA was isolated by organic extraction method from blood samples and LDLR, APOB and PCSK9 gene exons were amplified using primers that cover exon-intron boundaries. The amplicons were screened using High Resolution Melt (HRM) Analysis and the screened samples were sequenced after purification. This study reports 20 variations in South Indian population for the first time. In this set of variations 9 are novel variations which are reported for the first time, 11 were reported in other studies also. The in silico analysis for all the variations detected in this study were done to predict the probabilistic effect in pathogenicity of FH. This study adds 9 novel variations and 11 recurrent variations to the spectrum of LDLR gene mutations in Indian population. All these variations are reported for the first time in Indian population. This spectrum of variations was different from the variations of previous Indian reports. Copyright © 2017 Elsevier B.V. All rights reserved.
Diversity and evolutionary patterns of immune genes in free-ranging Namibian leopards (Panthera pardus pardus).

PubMed

Castro-Prieto, Aines; Wachter, Bettina; Melzheimer, Joerg; Thalwitzer, Susanne; Sommer, Simone

2011-01-01

The genes of the major histocompatibility complex (MHC) are a key component of the mammalian immune system and have become important molecular markers for fitness-related genetic variation in wildlife populations. Currently, no information about the MHC sequence variation and constitution in African leopards exists. In this study, we isolated and characterized genetic variation at the adaptively most important region of MHC class I and MHC class II-DRB genes in 25 free-ranging African leopards from Namibia and investigated the mechanisms that generate and maintain MHC polymorphism in the species. Using single-stranded conformation polymorphism analysis and direct sequencing, we detected 6 MHC class I and 6 MHC class II-DRB sequences, which likely correspond to at least 3 MHC class I and 3 MHC class II-DRB loci. Amino acid sequence variation in both MHC classes was higher or similar in comparison to other reported felids. We found signatures of positive selection shaping the diversity of MHC class I and MHC class II-DRB loci during the evolutionary history of the species. A comparison of MHC class I and MHC class II-DRB sequences of the leopard to those of other felids revealed a trans-species mode of evolution. In addition, the evolutionary relationships of MHC class II-DRB sequences between African and Asian leopard subspecies are discussed.
HomoSAR: bridging comparative protein modeling with quantitative structural activity relationship to design new peptides.

PubMed

Borkar, Mahesh R; Pissurlenkar, Raghuvir R S; Coutinho, Evans C

2013-11-15

Peptides play significant roles in the biological world. To optimize activity for a specific therapeutic target, peptide library synthesis is inevitable; which is a time consuming and expensive. Computational approaches provide a promising way to simply elucidate the structural basis in the design of new peptides. Earlier, we proposed a novel methodology termed HomoSAR to gain insight into the structure activity relationships underlying peptides. Based on an integrated approach, HomoSAR uses the principles of homology modeling in conjunction with the quantitative structural activity relationship formalism to predict and design new peptide sequences with the optimum activity. In the present study, we establish that the HomoSAR methodology can be universally applied to all classes of peptides irrespective of sequence length by studying HomoSAR on three peptide datasets viz., angiotensin-converting enzyme inhibitory peptides, CAMEL-s antibiotic peptides, and hAmphiphysin-1 SH3 domain binding peptides, using a set of descriptors related to the hydrophobic, steric, and electronic properties of the 20 natural amino acids. Models generated for all three datasets have statistically significant correlation coefficients (r(2)) and predictive r2 (r(pred)2) and cross validated coefficient ( q(LOO)2). The daintiness of this technique lies in its simplicity and ability to extract all the information contained in the peptides to elucidate the underlying structure activity relationships. The difficulties of correlating both sequence diversity and variation in length of the peptides with their biological activity can be addressed. The study has been able to identify the preferred or detrimental nature of amino acids at specific positions in the peptide sequences. Copyright © 2013 Wiley Periodicals, Inc.
KinView: A visual comparative sequence analysis tool for integrated kinome research

PubMed Central

McSkimming, Daniel Ian; Dastgheib, Shima; Baffi, Timothy R.; Byrne, Dominic P.; Ferries, Samantha; Scott, Steven Thomas; Newton, Alexandra C.; Eyers, Claire E.; Kochut, Krzysztof J.; Eyers, Patrick A.

2017-01-01

Multiple sequence alignments (MSAs) are a fundamental analysis tool used throughout biology to investigate relationships between protein sequence, structure, function, evolutionary history, and patterns of disease-associated variants. However, their widespread application in systems biology research is currently hindered by the lack of user-friendly tools to simultaneously visualize, manipulate and query the information conceptualized in large sequence alignments, and the challenges in integrating MSAs with multiple orthogonal data such as cancer variants and post-translational modifications, which are often stored in heterogeneous data sources and formats. Here, we present the Multiple Sequence Alignment Ontology (MSAOnt), which represents a profile or consensus alignment in an ontological format. Subsets of the alignment are easily selected through the SPARQL Protocol and RDF Query Language for downstream statistical analysis or visualization. We have also created the Kinome Viewer (KinView), an interactive integrative visualization that places eukaryotic protein kinase cancer variants in the context of natural sequence variation and experimentally determined post-translational modifications, which play central roles in the regulation of cellular signaling pathways. Using KinView, we identified differential phosphorylation patterns between tyrosine and serine/threonine kinases in the activation segment, a major kinase regulatory region that is often mutated in proliferative diseases. We discuss cancer variants that disrupt phosphorylation sites in the activation segment, and show how KinView can be used as a comparative tool to identify differences and similarities in natural variation, cancer variants and post-translational modifications between kinase groups, families and subfamilies. Based on KinView comparisons, we identify and experimentally characterize a regulatory tyrosine (Y177PLK4) in the PLK4 C-terminal activation segment region termed the P+1 loop. To further demonstrate the application of KinView in hypothesis generation and testing, we formulate and validate a hypothesis explaining a novel predicted loss-of-function variant (D523NPKCβ) in the regulatory spine of PKCβ, a recently identified tumor suppressor kinase. KinView provides a novel, extensible interface for performing comparative analyses between subsets of kinases and for integrating multiple types of residue specific annotations in user friendly formats. PMID:27731453
Adaptive diversification of growth allometry in the plant Arabidopsis thaliana.

PubMed

Vasseur, François; Exposito-Alonso, Moises; Ayala-Garay, Oscar J; Wang, George; Enquist, Brian J; Vile, Denis; Violle, Cyrille; Weigel, Detlef

2018-03-27

Seed plants vary tremendously in size and morphology; however, variation and covariation in plant traits may be governed, at least in part, by universal biophysical laws and biological constants. Metabolic scaling theory (MST) posits that whole-organismal metabolism and growth rate are under stabilizing selection that minimizes the scaling of hydrodynamic resistance and maximizes the scaling of resource uptake. This constrains variation in physiological traits and in the rate of biomass accumulation, so that they can be expressed as mathematical functions of plant size with near-constant allometric scaling exponents across species. However, the observed variation in scaling exponents calls into question the evolutionary drivers and the universality of allometric equations. We have measured growth scaling and fitness traits of 451 Arabidopsis thaliana accessions with sequenced genomes. Variation among accessions around the scaling exponent predicted by MST was correlated with relative growth rate, seed production, and stress resistance. Genomic analyses indicate that growth allometry is affected by many genes associated with local climate and abiotic stress response. The gene with the strongest effect, PUB4 , has molecular signatures of balancing selection, suggesting that intraspecific variation in growth scaling is maintained by opposing selection on the trade-off between seed production and abiotic stress resistance. Our findings suggest that variation in allometry contributes to local adaptation to contrasting environments. Our results help reconcile past debates on the origin of allometric scaling in biology and begin to link adaptive variation in allometric scaling to specific genes. Copyright © 2018 the Author(s). Published by PNAS.
Adaptive diversification of growth allometry in the plant Arabidopsis thaliana

PubMed Central

Vasseur, François; Ayala-Garay, Oscar J.; Wang, George; Enquist, Brian J.; Vile, Denis; Violle, Cyrille

2018-01-01

Seed plants vary tremendously in size and morphology; however, variation and covariation in plant traits may be governed, at least in part, by universal biophysical laws and biological constants. Metabolic scaling theory (MST) posits that whole-organismal metabolism and growth rate are under stabilizing selection that minimizes the scaling of hydrodynamic resistance and maximizes the scaling of resource uptake. This constrains variation in physiological traits and in the rate of biomass accumulation, so that they can be expressed as mathematical functions of plant size with near-constant allometric scaling exponents across species. However, the observed variation in scaling exponents calls into question the evolutionary drivers and the universality of allometric equations. We have measured growth scaling and fitness traits of 451 Arabidopsis thaliana accessions with sequenced genomes. Variation among accessions around the scaling exponent predicted by MST was correlated with relative growth rate, seed production, and stress resistance. Genomic analyses indicate that growth allometry is affected by many genes associated with local climate and abiotic stress response. The gene with the strongest effect, PUB4, has molecular signatures of balancing selection, suggesting that intraspecific variation in growth scaling is maintained by opposing selection on the trade-off between seed production and abiotic stress resistance. Our findings suggest that variation in allometry contributes to local adaptation to contrasting environments. Our results help reconcile past debates on the origin of allometric scaling in biology and begin to link adaptive variation in allometric scaling to specific genes. PMID:29540570
Copy Number Variations in Tilapia Genomes.

PubMed

Li, Bi Jun; Li, Hong Lian; Meng, Zining; Zhang, Yong; Lin, Haoran; Yue, Gen Hua; Xia, Jun Hong

2017-02-01

Discovering the nature and pattern of genome variation is fundamental in understanding phenotypic diversity among populations. Although several millions of single nucleotide polymorphisms (SNPs) have been discovered in tilapia, the genome-wide characterization of larger structural variants, such as copy number variation (CNV) regions has not been carried out yet. We conducted a genome-wide scan for CNVs in 47 individuals from three tilapia populations. Based on 254 Gb of high-quality paired-end sequencing reads, we identified 4642 distinct high-confidence CNVs. These CNVs account for 1.9% (12.411 Mb) of the used Nile tilapia reference genome. A total of 1100 predicted CNVs were found overlapping with exon regions of protein genes. Further association analysis based on linear model regression found 85 CNVs ranging between 300 and 27,000 base pairs significantly associated to population types (R 2 > 0.9 and P > 0.001). Our study sheds first insights on genome-wide CNVs in tilapia. These CNVs among and within tilapia populations may have functional effects on phenotypes and specific adaptation to particular environments.
POPISK: T-cell reactivity prediction using support vector machines and string kernels

PubMed Central

2011-01-01

Background Accurate prediction of peptide immunogenicity and characterization of relation between peptide sequences and peptide immunogenicity will be greatly helpful for vaccine designs and understanding of the immune system. In contrast to the prediction of antigen processing and presentation pathway, the prediction of subsequent T-cell reactivity is a much harder topic. Previous studies of identifying T-cell receptor (TCR) recognition positions were based on small-scale analyses using only a few peptides and concluded different recognition positions such as positions 4, 6 and 8 of peptides with length 9. Large-scale analyses are necessary to better characterize the effect of peptide sequence variations on T-cell reactivity and design predictors of a peptide's T-cell reactivity (and thus immunogenicity). The identification and characterization of important positions influencing T-cell reactivity will provide insights into the underlying mechanism of immunogenicity. Results This work establishes a large dataset by collecting immunogenicity data from three major immunology databases. In order to consider the effect of MHC restriction, peptides are classified by their associated MHC alleles. Subsequently, a computational method (named POPISK) using support vector machine with a weighted degree string kernel is proposed to predict T-cell reactivity and identify important recognition positions. POPISK yields a mean 10-fold cross-validation accuracy of 68% in predicting T-cell reactivity of HLA-A2-binding peptides. POPISK is capable of predicting immunogenicity with scores that can also correctly predict the change in T-cell reactivity related to point mutations in epitopes reported in previous studies using crystal structures. Thorough analyses of the prediction results identify the important positions 4, 6, 8 and 9, and yield insights into the molecular basis for TCR recognition. Finally, we relate this finding to physicochemical properties and structural features of the MHC-peptide-TCR interaction. Conclusions A computational method POPISK is proposed to predict immunogenicity with scores which are useful for predicting immunogenicity changes made by single-residue modifications. The web server of POPISK is freely available at http://iclab.life.nctu.edu.tw/POPISK. PMID:22085524
POPISK: T-cell reactivity prediction using support vector machines and string kernels.

PubMed

Tung, Chun-Wei; Ziehm, Matthias; Kämper, Andreas; Kohlbacher, Oliver; Ho, Shinn-Ying

2011-11-15

Accurate prediction of peptide immunogenicity and characterization of relation between peptide sequences and peptide immunogenicity will be greatly helpful for vaccine designs and understanding of the immune system. In contrast to the prediction of antigen processing and presentation pathway, the prediction of subsequent T-cell reactivity is a much harder topic. Previous studies of identifying T-cell receptor (TCR) recognition positions were based on small-scale analyses using only a few peptides and concluded different recognition positions such as positions 4, 6 and 8 of peptides with length 9. Large-scale analyses are necessary to better characterize the effect of peptide sequence variations on T-cell reactivity and design predictors of a peptide's T-cell reactivity (and thus immunogenicity). The identification and characterization of important positions influencing T-cell reactivity will provide insights into the underlying mechanism of immunogenicity. This work establishes a large dataset by collecting immunogenicity data from three major immunology databases. In order to consider the effect of MHC restriction, peptides are classified by their associated MHC alleles. Subsequently, a computational method (named POPISK) using support vector machine with a weighted degree string kernel is proposed to predict T-cell reactivity and identify important recognition positions. POPISK yields a mean 10-fold cross-validation accuracy of 68% in predicting T-cell reactivity of HLA-A2-binding peptides. POPISK is capable of predicting immunogenicity with scores that can also correctly predict the change in T-cell reactivity related to point mutations in epitopes reported in previous studies using crystal structures. Thorough analyses of the prediction results identify the important positions 4, 6, 8 and 9, and yield insights into the molecular basis for TCR recognition. Finally, we relate this finding to physicochemical properties and structural features of the MHC-peptide-TCR interaction. A computational method POPISK is proposed to predict immunogenicity with scores which are useful for predicting immunogenicity changes made by single-residue modifications. The web server of POPISK is freely available at http://iclab.life.nctu.edu.tw/POPISK.
Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) Version 3.0 User Guide

EPA Science Inventory

User Guide to describe the complete functionality of the Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) Version 3.0 online tool. The US Environmental Protection Agency Sequence Alignment to Predict Across Species Susceptibility tool (SeqAPASS; https://seqa...
The evolution of transcriptional regulation in eukaryotes

NASA Technical Reports Server (NTRS)

Wray, Gregory A.; Hahn, Matthew W.; Abouheif, Ehab; Balhoff, James P.; Pizer, Margaret; Rockman, Matthew V.; Romano, Laura A.

2003-01-01

Gene expression is central to the genotype-phenotype relationship in all organisms, and it is an important component of the genetic basis for evolutionary change in diverse aspects of phenotype. However, the evolution of transcriptional regulation remains understudied and poorly understood. Here we review the evolutionary dynamics of promoter, or cis-regulatory, sequences and the evolutionary mechanisms that shape them. Existing evidence indicates that populations harbor extensive genetic variation in promoter sequences, that a substantial fraction of this variation has consequences for both biochemical and organismal phenotype, and that some of this functional variation is sorted by selection. As with protein-coding sequences, rates and patterns of promoter sequence evolution differ considerably among loci and among clades for reasons that are not well understood. Studying the evolution of transcriptional regulation poses empirical and conceptual challenges beyond those typically encountered in analyses of coding sequence evolution: promoter organization is much less regular than that of coding sequences, and sequences required for the transcription of each locus reside at multiple other loci in the genome. Because of the strong context-dependence of transcriptional regulation, sequence inspection alone provides limited information about promoter function. Understanding the functional consequences of sequence differences among promoters generally requires biochemical and in vivo functional assays. Despite these challenges, important insights have already been gained into the evolution of transcriptional regulation, and the pace of discovery is accelerating.
Parallel gene analysis with allele-specific padlock probes and tag microarrays

PubMed Central

Banér, Johan; Isaksson, Anders; Waldenström, Erik; Jarvius, Jonas; Landegren, Ulf; Nilsson, Mats

2003-01-01

Parallel, highly specific analysis methods are required to take advantage of the extensive information about DNA sequence variation and of expressed sequences. We present a scalable laboratory technique suitable to analyze numerous target sequences in multiplexed assays. Sets of padlock probes were applied to analyze single nucleotide variation directly in total genomic DNA or cDNA for parallel genotyping or gene expression analysis. All reacted probes were then co-amplified and identified by hybridization to a standard tag oligonucleotide array. The technique was illustrated by analyzing normal and pathogenic variation within the Wilson disease-related ATP7B gene, both at the level of DNA and RNA, using allele-specific padlock probes. PMID:12930977
Sequence, distribution and chromosomal context of class I and class II pilin genes of Neisseria meningitidis identified in whole genome sequences

PubMed Central

2014-01-01

Background Neisseria meningitidis expresses type four pili (Tfp) which are important for colonisation and virulence. Tfp have been considered as one of the most variable structures on the bacterial surface due to high frequency gene conversion, resulting in amino acid sequence variation of the major pilin subunit (PilE). Meningococci express either a class I or a class II pilE gene and recent work has indicated that class II pilins do not undergo antigenic variation, as class II pilE genes encode conserved pilin subunits. The purpose of this work was to use whole genome sequences to further investigate the frequency and variability of the class II pilE genes in meningococcal isolate collections. Results We analysed over 600 publically available whole genome sequences of N. meningitidis isolates to determine the sequence and genomic organization of pilE. We confirmed that meningococcal strains belonging to a limited number of clonal complexes (ccs, namely cc1, cc5, cc8, cc11 and cc174) harbour a class II pilE gene which is conserved in terms of sequence and chromosomal context. We also identified pilS cassettes in all isolates with class II pilE, however, our analysis indicates that these do not serve as donor sequences for pilE/pilS recombination. Furthermore, our work reveals that the class II pilE locus lacks the DNA sequence motifs that enable (G4) or enhance (Sma/Cla repeat) pilin antigenic variation. Finally, through analysis of pilin genes in commensal Neisseria species we found that meningococcal class II pilE genes are closely related to pilE from Neisseria lactamica and Neisseria polysaccharea, suggesting horizontal transfer among these species. Conclusions Class II pilins can be defined by their amino acid sequence and genomic context and are present in meningococcal isolates which have persisted and spread globally. The absence of G4 and Sma/Cla sequences adjacent to the class II pilE genes is consistent with the lack of pilin subunit variation in these isolates, although horizontal transfer may generate class II pilin diversity. This study supports the suggestion that high frequency antigenic variation of pilin is not universal in pathogenic Neisseria. PMID:24690385

Draft versus finished sequence data for DNA and protein diagnostic signature development

PubMed Central

Gardner, Shea N.; Lam, Marisa W.; Smith, Jason R.; Torres, Clinton L.; Slezak, Tom R.

2005-01-01

Sequencing pathogen genomes is costly, demanding careful allocation of limited sequencing resources. We built a computational Sequencing Analysis Pipeline (SAP) to guide decisions regarding the amount of genomic sequencing necessary to develop high-quality diagnostic DNA and protein signatures. SAP uses simulations to estimate the number of target genomes and close phylogenetic relatives (near neighbors or NNs) to sequence. We use SAP to assess whether draft data are sufficient or finished sequencing is required using Marburg and variola virus sequences. Simulations indicate that intermediate to high-quality draft with error rates of 10−3–10−5 (∼8× coverage) of target organisms is suitable for DNA signature prediction. Low-quality draft with error rates of ∼1% (3× to 6× coverage) of target isolates is inadequate for DNA signature prediction, although low-quality draft of NNs is sufficient, as long as the target genomes are of high quality. For protein signature prediction, sequencing errors in target genomes substantially reduce the detection of amino acid sequence conservation, even if the draft is of high quality. In summary, high-quality draft of target and low-quality draft of NNs appears to be a cost-effective investment for DNA signature prediction, but may lead to underestimation of predicted protein signatures. PMID:16243783
Understanding the mechanisms of protein-DNA interactions

NASA Astrophysics Data System (ADS)

Lavery, Richard

2004-03-01

Structural, biochemical and thermodynamic data on protein-DNA interactions show that specific recognition cannot be reduced to a simple set of binary interactions between the partners (such as hydrogen bonds, ion pairs or steric contacts). The mechanical properties of the partners also play a role and, in the case of DNA, variations in both conformation and flexibility as a function of base sequence can be a significant factor in guiding a protein to the correct binding site. All-atom molecular modeling offers a means of analyzing the role of different binding mechanisms within protein-DNA complexes of known structure. This however requires estimating the binding strengths for the full range of sequences with which a given protein can interact. Since this number grows exponentially with the length of the binding site it is necessary to find a method to accelerate the calculations. We have achieved this by using a multi-copy approach (ADAPT) which allows us to build a DNA fragment with a variable base sequence. The results obtained with this method correlate well with experimental consensus binding sequences. They enable us to show that indirect recognition mechanisms involving the sequence dependent properties of DNA play a significant role in many complexes. This approach also offers a means of predicting protein binding sites on the basis of binding energies, which is complementary to conventional lexical techniques.
Domain-general sequence learning deficit in specific language impairment.

PubMed

Lukács, Agnes; Kemény, Ferenc

2014-05-01

Grammar-specific accounts of specific language impairment (SLI) have been challenged by recent claims that language problems are a consequence of impairments in domain-general mechanisms of learning that also play a key role in the process of language acquisition. Our studies were designed to test the generality and nature of this learning deficit by focusing on both sequential and nonsequential, and on verbal and nonverbal, domains. Twenty-nine children with SLI were compared with age-matched typically developing (TD) control children using (a) a serial reaction time task (SRT), testing the learning of motor sequences; (b) an artificial grammar learning (AGL) task, testing the extraction of regularities from auditory sequences; and (c) a weather prediction task (WP), testing probabilistic category learning in a nonsequential task. For the 2 sequence learning tasks, a significantly smaller proportion of children showed evidence of learning in the SLI than in the TD group (χ2 tests, p < .001 for the SRT task, p < .05 for the AGL task), whereas the proportion of learners on the WP task was the same in the 2 groups. The level of learning for SLI learners was comparable with that of TD children on all tasks (with great individual variation). Taken together, these findings suggest that domain-general processes of implicit sequence learning tend to be impaired in SLI. Further research is needed to clarify the relationship of deficits in implicit learning and language.
On the normalization of the minimum free energy of RNAs by sequence length.

PubMed

Trotta, Edoardo

2014-01-01

The minimum free energy (MFE) of ribonucleic acids (RNAs) increases at an apparent linear rate with sequence length. Simple indices, obtained by dividing the MFE by the number of nucleotides, have been used for a direct comparison of the folding stability of RNAs of various sizes. Although this normalization procedure has been used in several studies, the relationship between normalized MFE and length has not yet been investigated in detail. Here, we demonstrate that the variation of MFE with sequence length is not linear and is significantly biased by the mathematical formula used for the normalization procedure. For this reason, the normalized MFEs strongly decrease as hyperbolic functions of length and produce unreliable results when applied for the comparison of sequences with different sizes. We also propose a simple modification of the normalization formula that corrects the bias enabling the use of the normalized MFE for RNAs longer than 40 nt. Using the new corrected normalized index, we analyzed the folding free energies of different human RNA families showing that most of them present an average MFE density more negative than expected for a typical genomic sequence. Furthermore, we found that a well-defined and restricted range of MFE density characterizes each RNA family, suggesting the use of our corrected normalized index to improve RNA prediction algorithms. Finally, in coding and functional human RNAs the MFE density appears scarcely correlated with sequence length, consistent with a negligible role of thermodynamic stability demands in determining RNA size.
On the Normalization of the Minimum Free Energy of RNAs by Sequence Length

PubMed Central

Trotta, Edoardo

2014-01-01

The minimum free energy (MFE) of ribonucleic acids (RNAs) increases at an apparent linear rate with sequence length. Simple indices, obtained by dividing the MFE by the number of nucleotides, have been used for a direct comparison of the folding stability of RNAs of various sizes. Although this normalization procedure has been used in several studies, the relationship between normalized MFE and length has not yet been investigated in detail. Here, we demonstrate that the variation of MFE with sequence length is not linear and is significantly biased by the mathematical formula used for the normalization procedure. For this reason, the normalized MFEs strongly decrease as hyperbolic functions of length and produce unreliable results when applied for the comparison of sequences with different sizes. We also propose a simple modification of the normalization formula that corrects the bias enabling the use of the normalized MFE for RNAs longer than 40 nt. Using the new corrected normalized index, we analyzed the folding free energies of different human RNA families showing that most of them present an average MFE density more negative than expected for a typical genomic sequence. Furthermore, we found that a well-defined and restricted range of MFE density characterizes each RNA family, suggesting the use of our corrected normalized index to improve RNA prediction algorithms. Finally, in coding and functional human RNAs the MFE density appears scarcely correlated with sequence length, consistent with a negligible role of thermodynamic stability demands in determining RNA size. PMID:25405875
Statistical physics of interacting neural networks

NASA Astrophysics Data System (ADS)

Kinzel, Wolfgang; Metzler, Richard; Kanter, Ido

2001-12-01

Recent results on the statistical physics of time series generation and prediction are presented. A neural network is trained on quasi-periodic and chaotic sequences and overlaps to the sequence generator as well as the prediction errors are calculated numerically. For each network there exists a sequence for which it completely fails to make predictions. Two interacting networks show a transition to perfect synchronization. A pool of interacting networks shows good coordination in the minority game-a model of competition in a closed market. Finally, as a demonstration, a perceptron predicts bit sequences produced by human beings.
Predicting success on the certification examinations of the American Board of Anesthesiology.

PubMed

McClintock, Joseph C; Gravlee, Glenn P

2010-01-01

Currently, residency programs lack objective predictors for passing the sequenced American Board of Anesthesiology (ABA) certification examinations on the first attempt. Our hypothesis was that performance on the ABA/American Society of Anesthesiologists In-Training Examination (ITE) and other variables can predict combined success on the ABA Part 1 and Part 2 examinations. The authors studied 2,458 subjects who took the ITE immediately after completing the first year of clinical anesthesia training and took the ABA Part 1 examination for primary certification immediately after completing residency training 2 yr later. ITE scores and other variables were used to predict which residents would complete the certification process (passing the ABA Part 1 and Part 2 examinations) in the shortest possible time after graduation. ITE scores alone accounted for most of the explained variation in the desired outcome of certification in the shortest possible time. In addition, almost half of the observed variation and most of the explained variance in ABA Part 1 scores was accounted for by ITE scores. A combined model using ITE scores, residency program accreditation cycle length, country of medical school, and gender best predicted which residents would complete the certification examinations in the shortest possible time. The principal implication of this study is that higher ABA/ American Society of Anesthesiologists ITE scores taken at the end of the first clinical anesthesia year serve as a significant and moderately strong predictor of high performance on the ABA Part 1 (written) examination, and a significant predictor of success in completing both the Part 1 and Part 2 examinations within the calendar year after the year of graduation from residency. Future studies may identify other predictors, and it would be helpful to identify factors that predict clinical performance as well.
Adaptive Anchoring Model: How Static and Dynamic Presentations of Time Series Influence Judgments and Predictions.

PubMed

Kusev, Petko; van Schaik, Paul; Tsaneva-Atanasova, Krasimira; Juliusson, Asgeir; Chater, Nick

2018-01-01

When attempting to predict future events, people commonly rely on historical data. One psychological characteristic of judgmental forecasting of time series, established by research, is that when people make forecasts from series, they tend to underestimate future values for upward trends and overestimate them for downward ones, so-called trend-damping (modeled by anchoring on, and insufficient adjustment from, the average of recent time series values). Events in a time series can be experienced sequentially (dynamic mode), or they can also be retrospectively viewed simultaneously (static mode), not experienced individually in real time. In one experiment, we studied the influence of presentation mode (dynamic and static) on two sorts of judgment: (a) predictions of the next event (forecast) and (b) estimation of the average value of all the events in the presented series (average estimation). Participants' responses in dynamic mode were anchored on more recent events than in static mode for all types of judgment but with different consequences; hence, dynamic presentation improved prediction accuracy, but not estimation. These results are not anticipated by existing theoretical accounts; we develop and present an agent-based model-the adaptive anchoring model (ADAM)-to account for the difference between processing sequences of dynamically and statically presented stimuli (visually presented data). ADAM captures how variation in presentation mode produces variation in responses (and the accuracy of these responses) in both forecasting and judgment tasks. ADAM's model predictions for the forecasting and judgment tasks fit better with the response data than a linear-regression time series model. Moreover, ADAM outperformed autoregressive-integrated-moving-average (ARIMA) and exponential-smoothing models, while neither of these models accounts for people's responses on the average estimation task. Copyright © 2017 The Authors. Cognitive Science published by Wiley Periodicals, Inc. on behalf of Cognitive Science Society.
Somatic Genetic Variation in Solid Pseudopapillary Tumor of the Pancreas by Whole Exome Sequencing

PubMed Central

Guo, Meng; Luo, Guopei; Jin, Kaizhou; Long, Jiang; Cheng, He; Lu, Yu; Wang, Zhengshi; Yang, Chao; Xu, Jin; Ni, Quanxing; Yu, Xianjun; Liu, Chen

2017-01-01

Solid pseudopapillary tumor of the pancreas (SPT) is a rare pancreatic disease with a unique clinical manifestation. Although CTNNB1 gene mutations had been universally reported, genetic variation profiles of SPT are largely unidentified. We conducted whole exome sequencing in nine SPT patients to probe the SPT-specific insertions and deletions (indels) and single nucleotide polymorphisms (SNPs). In total, 54 SNPs and 41 indels of prominent variations were demonstrated through parallel exome sequencing. We detected that CTNNB1 mutations presented throughout all patients studied (100%), and a higher count of SNPs was particularly detected in patients with older age, larger tumor, and metastatic disease. By aggregating 95 detected variation events and viewing the interconnections among each of the genes with variations, CTNNB1 was identified as the core portion in the network, which might collaborate with other events such as variations of USP9X, EP400, HTT, MED12, and PKD1 to regulate tumorigenesis. Pathway analysis showed that the events involved in other cancers had the potential to influence the progression of the SNPs count. Our study revealed an insight into the variation of the gene encoding region underlying solid-pseudopapillary neoplasm tumorigenesis. The detection of these variations might partly reflect the potential molecular mechanism. PMID:28054945
Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM.

PubMed

Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

2015-01-01

Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences.
Extensive variation at MHC DRB in the New Zealand sea lion (Phocarctos hookeri) provides evidence for balancing selection

PubMed Central

Osborne, A J; Zavodna, M; Chilvers, B L; Robertson, B C; Negro, S S; Kennedy, M A; Gemmell, N J

2013-01-01

Marine mammals are often reported to possess reduced variation of major histocompatibility complex (MHC) genes compared with their terrestrial counterparts. We evaluated diversity at two MHC class II B genes, DQB and DRB, in the New Zealand sea lion (Phocarctos hookeri, NZSL) a species that has suffered high mortality owing to bacterial epizootics, using Sanger sequencing and haplotype reconstruction, together with next-generation sequencing. Despite this species' prolonged history of small population size and highly restricted distribution, we demonstrate extensive diversity at MHC DRB with 26 alleles, whereas MHC DQB is dimorphic. We identify four DRB codons, predicted to be involved in antigen binding, that are evolving under adaptive evolution. Our data suggest diversity at DRB may be maintained by balancing selection, consistent with the role of this locus as an antigen-binding region and the species' recent history of mass mortality during a series of bacterial epizootics. Phylogenetic analyses of DQB and DRB sequences from pinnipeds and other carnivores revealed significant allelic diversity, but little phylogenetic depth or structure among pinniped alleles; thus, we could neither confirm nor refute the possibility of trans-species polymorphism in this group. The phylogenetic pattern observed however, suggests some significant evolutionary constraint on these loci in the recent past, with the pattern consistent with that expected following an epizootic event. These data may help further elucidate some of the genetic factors underlying the unusually high susceptibility to bacterial infection of the threatened NZSL, and help us to better understand the extent and pattern of MHC diversity in pinnipeds. PMID:23572124
Numeric stratigraphic modeling: Testing sequence Numeric stratigraphic modeling: Testing sequence stratigraphic concepts using high resolution geologic examples

DOE Office of Scientific and Technical Information (OSTI.GOV)

Armentrout, J.M.; Smith-Rouch, L.S.; Bowman, S.A.

1996-08-01

Numeric simulations based on integrated data sets enhance our understanding of depositional geometry and facilitate quantification of depositional processes. Numeric values tested against well-constrained geologic data sets can then be used in iterations testing each variable, and in predicting lithofacies distributions under various depositional scenarios using the principles of sequence stratigraphic analysis. The stratigraphic modeling software provides a broad spectrum of techniques for modeling and testing elements of the petroleum system. Using well-constrained geologic examples, variations in depositional geometry and lithofacies distributions between different tectonic settings (passive vs. active margin) and climate regimes (hothouse vs. icehouse) can provide insight tomore » potential source rock and reservoir rock distribution, maturation timing, migration pathways, and trap formation. Two data sets are used to illustrate such variations: both include a seismic reflection profile calibrated by multiple wells. The first is a Pennsylvanian mixed carbonate-siliciclastic system in the Paradox basin, and the second a Pliocene-Pleistocene siliciclastic system in the Gulf of Mexico. Numeric simulations result in geometry and facies distributions consistent with those interpreted using the integrated stratigraphic analysis of the calibrated seismic profiles. An exception occurs in the Gulf of Mexico study where the simulated sediment thickness from 3.8 to 1.6 Ma within an upper slope minibasin was less than that mapped using a regional seismic grid. Regional depositional patterns demonstrate that this extra thickness was probably sourced from out of the plane of the modeled transect, illustrating the necessity for three-dimensional constraints on two-dimensional modeling.« less
Novel mutation of GATA4 gene in Kurdish population of Iran with nonsyndromic congenital heart septals defects.

PubMed

Soheili, Fariborz; Jalili, Zahra; Rahbar, Mahtab; Khatooni, Zahed; Mashayekhi, Amir; Jafari, Hossein

2018-03-01

The mutations in GATA4 gene induce inherited atrial and ventricular septation defects, which is the most frequent forms of congenital heart defects (CHDs) constituting about half of all cases. We have performed High resolution melting (HRM) mutation scanning of GATA4 coding exons of nonsyndrome 100 patients as a case group including 39 atrial septal defects (ASD), 57 ventricular septal defects (VSD) and four patients with both above defects and 50 healthy individuals as a control group. Our samples are categorized according to their HRM graph. The genome sequencing has been done for 15 control samples and 25 samples of patients whose HRM analysis were similar to healthy subjects for each exon. The PolyPhen-2 and MUpro have been used to determine the causative possibility and structural stability prediction of GATA4 sequence variation. The HRM curve analysis exhibit that 21 patients and 3 normal samples have deviated curves for GATA4 coding exons. Sequencing analysis has revealed 12 nonsynonymous mutations while all of them resulted in stability structure of protein 10 of them are pathogenic and 2 of them are benign. Also we found two nucleotide deletions which one of them was novel and one new indel mutation resulting in frame shift mutation, and 4 synonymous variations or polymorphism in 6 of patients and 3 of normal individuals. Six or about 50% of these nonsynonymous mutations have not been previously reported. Our results show that there is a spectrum of GATA4 mutations resulting in septal defects. © 2018 Wiley Periodicals, Inc.
Inference of Gorilla Demographic and Selective History from Whole-Genome Sequence Data

PubMed Central

McManus, Kimberly F.; Kelley, Joanna L.; Song, Shiya; Veeramah, Krishna R.; Woerner, August E.; Stevison, Laurie S.; Ryder, Oliver A.; Ape Genome Project, Great; Kidd, Jeffrey M.; Wall, Jeffrey D.; Bustamante, Carlos D.; Hammer, Michael F.

2015-01-01

Although population-level genomic sequence data have been gathered extensively for humans, similar data from our closest living relatives are just beginning to emerge. Examination of genomic variation within great apes offers many opportunities to increase our understanding of the forces that have differentially shaped the evolutionary history of hominid taxa. Here, we expand upon the work of the Great Ape Genome Project by analyzing medium to high coverage whole-genome sequences from 14 western lowland gorillas (Gorilla gorilla gorilla), 2 eastern lowland gorillas (G. beringei graueri), and a single Cross River individual (G. gorilla diehli). We infer that the ancestors of western and eastern lowland gorillas diverged from a common ancestor approximately 261 ka, and that the ancestors of the Cross River population diverged from the western lowland gorilla lineage approximately 68 ka. Using a diffusion approximation approach to model the genome-wide site frequency spectrum, we infer a history of western lowland gorillas that includes an ancestral population expansion of 1.4-fold around 970 ka and a recent 5.6-fold contraction in population size 23 ka. The latter may correspond to a major reduction in African equatorial forests around the Last Glacial Maximum. We also analyze patterns of variation among western lowland gorillas to identify several genomic regions with strong signatures of recent selective sweeps. We find that processes related to taste, pancreatic and saliva secretion, sodium ion transmembrane transport, and cardiac muscle function are overrepresented in genomic regions predicted to have experienced recent positive selection. PMID:25534031
Thermal and acid tolerant beta-xylosidases, genes encoding, related organisms, and methods

DOEpatents

Thompson, David N [Idaho Falls, ID; Thompson, Vicki S [Idaho Falls, ID; Schaller, Kastli D [Ammon, ID; Apel, William A [Jackson, WY; Lacey, Jeffrey A [Idaho Falls, ID; Reed, David W [Idaho Falls, ID

2011-04-12

Isolated and/or purified polypeptides and nucleic acid sequences encoding polypeptides from Alicyclobacillus acidocaldarius and variations thereof are provided. Further provided are methods of at least partially degrading xylotriose and/or xylobiose using isolated and/or purified polypeptides and nucleic acid sequences encoding polypeptides from Alicyclobacillus acidocaldarius and variations thereof.
Genetic variation and biological activity of isolates of lymantria dispar multiple nucleopolyhedrovirus from north america, europe, and asia

USDA-ARS?s Scientific Manuscript database

Little is known about genetic variation of Lymantria dispar multiple nucleopolyhedrovirus (LdMNPV; Baculoviridae: Alphabaculovirus) at the nucleotide sequence level. To obtain a more comprehensive view of genetic diversity among isolates of LdMNPV, partial sequences of the lef-8 gene were generated...
Natural Variation in Brachypodium disctachyon: Deep Sequencing of Highly Diverse Natural Accessions (2013 DOE JGI Genomics of Energy and Environment 8th Annual User Meeting)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gordon, Sean

2013-03-01

Sean Gordon of the USDA on Natural variation in Brachypodium disctachyon: Deep Sequencing of Highly Diverse Natural Accessions at the 8th Annual Genomics of Energy Environment Meeting on March 27, 2013 in Walnut Creek, CA.
Sequence variation of the feline immunodeficiency virus genome and its clinical relevance.

PubMed

Stickney, A L; Dunowska, M; Cave, N J

2013-06-08

The ongoing evolution of feline immunodeficiency virus (FIV) has resulted in the existence of a diverse continuum of viruses. FIV isolates differ with regards to their mutation and replication rates, plasma viral loads, cell tropism and the ability to induce apoptosis. Clinical disease in FIV-infected cats is also inconsistent. Genomic sequence variation of FIV is likely to be responsible for some of the variation in viral behaviour. The specific genetic sequences that influence these key viral properties remain to be determined. With knowledge of the specific key determinants of pathogenicity, there is the potential for veterinarians in the future to apply this information for prognostic purposes. Genomic sequence variation of FIV also presents an obstacle to effective vaccine development. Most challenge studies demonstrate acceptable efficacy of a dual-subtype FIV vaccine (Fel-O-Vax FIV) against FIV infection under experimental settings; however, vaccine efficacy in the field still remains to be proven. It is important that we discover the key determinants of immunity induced by this vaccine; such data would compliment vaccine field efficacy studies and provide the basis to make informed recommendations on its use.
SNBRFinder: A Sequence-Based Hybrid Algorithm for Enhanced Prediction of Nucleic Acid-Binding Residues.

PubMed

Yang, Xiaoxia; Wang, Jia; Sun, Jun; Liu, Rong

2015-01-01

Protein-nucleic acid interactions are central to various fundamental biological processes. Automated methods capable of reliably identifying DNA- and RNA-binding residues in protein sequence are assuming ever-increasing importance. The majority of current algorithms rely on feature-based prediction, but their accuracy remains to be further improved. Here we propose a sequence-based hybrid algorithm SNBRFinder (Sequence-based Nucleic acid-Binding Residue Finder) by merging a feature predictor SNBRFinderF and a template predictor SNBRFinderT. SNBRFinderF was established using the support vector machine whose inputs include sequence profile and other complementary sequence descriptors, while SNBRFinderT was implemented with the sequence alignment algorithm based on profile hidden Markov models to capture the weakly homologous template of query sequence. Experimental results show that SNBRFinderF was clearly superior to the commonly used sequence profile-based predictor and SNBRFinderT can achieve comparable performance to the structure-based template methods. Leveraging the complementary relationship between these two predictors, SNBRFinder reasonably improved the performance of both DNA- and RNA-binding residue predictions. More importantly, the sequence-based hybrid prediction reached competitive performance relative to our previous structure-based counterpart. Our extensive and stringent comparisons show that SNBRFinder has obvious advantages over the existing sequence-based prediction algorithms. The value of our algorithm is highlighted by establishing an easy-to-use web server that is freely accessible at http://ibi.hzau.edu.cn/SNBRFinder.
Selection of a DNA barcode for Nectriaceae from fungal whole-genomes.

PubMed

Zeng, Zhaoqing; Zhao, Peng; Luo, Jing; Zhuang, Wenying; Yu, Zhihe

2012-01-01

A DNA barcode is a short segment of sequence that is able to distinguish species. A barcode must ideally contain enough variation to distinguish every individual species and be easily obtained. Fungi of Nectriaceae are economically important and show high species diversity. To establish a standard DNA barcode for this group of fungi, the genomes of Neurospora crassa and 30 other filamentous fungi were compared. The expect value was treated as a criterion to recognize homologous sequences. Four candidate markers, Hsp90, AAC, CDC48, and EF3, were tested for their feasibility as barcodes in the identification of 34 well-established species belonging to 13 genera of Nectriaceae. Two hundred and fifteen sequences were analyzed. Intra- and inter-specific variations and the success rate of PCR amplification and sequencing were considered as important criteria for estimation of the candidate markers. Ultimately, the partial EF3 gene met the requirements for a good DNA barcode: No overlap was found between the intra- and inter-specific pairwise distances. The smallest inter-specific distance of EF3 gene was 3.19%, while the largest intra-specific distance was 1.79%. In addition, there was a high success rate in PCR and sequencing for this gene (96.3%). CDC48 showed sufficiently high sequence variation among species, but the PCR and sequencing success rate was 84% using a single pair of primers. Although the Hsp90 and AAC genes had higher PCR and sequencing success rates (96.3% and 97.5%, respectively), overlapping occurred between the intra- and inter-specific variations, which could lead to misidentification. Therefore, we propose the EF3 gene as a possible DNA barcode for the nectriaceous fungi.

Discriminative prediction of mammalian enhancers from DNA sequence

PubMed Central

Lee, Dongwon; Karchin, Rachel; Beer, Michael A.

2011-01-01

Accurately predicting regulatory sequences and enhancers in entire genomes is an important but difficult problem, especially in large vertebrate genomes. With the advent of ChIP-seq technology, experimental detection of genome-wide EP300/CREBBP bound regions provides a powerful platform to develop predictive tools for regulatory sequences and to study their sequence properties. Here, we develop a support vector machine (SVM) framework which can accurately identify EP300-bound enhancers using only genomic sequence and an unbiased set of general sequence features. Moreover, we find that the predictive sequence features identified by the SVM classifier reveal biologically relevant sequence elements enriched in the enhancers, but we also identify other features that are significantly depleted in enhancers. The predictive sequence features are evolutionarily conserved and spatially clustered, providing further support of their functional significance. Although our SVM is trained on experimental data, we also predict novel enhancers and show that these putative enhancers are significantly enriched in both ChIP-seq signal and DNase I hypersensitivity signal in the mouse brain and are located near relevant genes. Finally, we present results of comparisons between other EP300/CREBBP data sets using our SVM and uncover sequence elements enriched and/or depleted in the different classes of enhancers. Many of these sequence features play a role in specifying tissue-specific or developmental-stage-specific enhancer activity, but our results indicate that some features operate in a general or tissue-independent manner. In addition to providing a high confidence list of enhancer targets for subsequent experimental investigation, these results contribute to our understanding of the general sequence structure of vertebrate enhancers. PMID:21875935
Construction of a large collection of small genome variations in French dairy and beef breeds using whole-genome sequences.

PubMed

Boussaha, Mekki; Michot, Pauline; Letaief, Rabia; Hozé, Chris; Fritz, Sébastien; Grohs, Cécile; Esquerré, Diane; Duchesne, Amandine; Philippe, Romain; Blanquet, Véronique; Phocas, Florence; Floriot, Sandrine; Rocha, Dominique; Klopp, Christophe; Capitan, Aurélien; Boichard, Didier

2016-11-15

In recent years, several bovine genome sequencing projects were carried out with the aim of developing genomic tools to improve dairy and beef production efficiency and sustainability. In this study, we describe the first French cattle genome variation dataset obtained by sequencing 274 whole genomes representing several major dairy and beef breeds. This dataset contains over 28 million single nucleotide polymorphisms (SNPs) and small insertions and deletions. Comparisons between sequencing results and SNP array genotypes revealed a very high genotype concordance rate, which indicates the good quality of our data. To our knowledge, this is the first large-scale catalog of small genomic variations in French dairy and beef cattle. This resource will contribute to the study of gene functions and population structure and also help to improve traits through genotype-guided selection.
MHC diversity in two Acrocephalus species: the outbred Great reed warbler and the inbred Seychelles warbler.

PubMed

Richardson, David S; Westerdahl, Helena

2003-12-01

The Great reed warbler (GRW) and the Seychelles warbler (SW) are congeners with markedly different demographic histories. The GRW is a normal outbred bird species while the SW population remains isolated and inbred after undergoing a severe population bottleneck. We examined variation at Major Histocompatibility Complex (MHC) class I exon 3 using restriction fragment length polymorphism, denaturing gradient gel electrophoresis and DNA sequencing. Although genetic variation was higher in the GRW, considerable variation has been maintained in the SW. The ten exon 3 sequences found in the SW were as diverged from each other as were a random sub-sample of the 67 sequences from the GRW. There was evidence for balancing selection in both species, and the phylogenetic analysis showing that the exon 3 sequences did not separate according to species, was consistent with transspecies evolution of the MHC.
Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives.

PubMed

Zhao, Min; Wang, Qingguo; Wang, Quan; Jia, Peilin; Zhao, Zhongming

2013-01-01

Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development.
Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives

PubMed Central

2013-01-01

Copy number variation (CNV) is a prevalent form of critical genetic variation that leads to an abnormal number of copies of large genomic regions in a cell. Microarray-based comparative genome hybridization (arrayCGH) or genotyping arrays have been standard technologies to detect large regions subject to copy number changes in genomes until most recently high-resolution sequence data can be analyzed by next-generation sequencing (NGS). During the last several years, NGS-based analysis has been widely applied to identify CNVs in both healthy and diseased individuals. Correspondingly, the strong demand for NGS-based CNV analyses has fuelled development of numerous computational methods and tools for CNV detection. In this article, we review the recent advances in computational methods pertaining to CNV detection using whole genome and whole exome sequencing data. Additionally, we discuss their strengths and weaknesses and suggest directions for future development. PMID:24564169
Variation in the Oxytocin Receptor Gene Predicts Brain Region Specific Expression and Social Attachment

PubMed Central

King, Lanikea B.; Walum, Hasse; Inoue, Kiyoshi; Eyrich, Nicholas W.; Young, Larry J.

2015-01-01

Background Oxytocin (OXT) modulates several aspects of social behavior. Intranasal OXT is a leading candidate for treating social deficits in autism spectrum disorder (ASD) and common genetic variants in the human oxytocin receptor (OXTR) are associated with emotion recognition, relationship quality and ASD. Animal models have revealed that individual differences in Oxtr expression in the brain drive social behavior variation. Our understanding of how genetic variation contributes to brain OXTR expression is very limited. Methods We investigated Oxtr expression in monogamous prairie voles, which have a well characterized OXT system. We quantified brain region-specific levels of Oxtr mRNA and OXTR protein with established neuroanatomical methods. We used pyrosequencing to investigate allelic imbalance of Oxtr mRNA, a molecular signature of polymorphic genetic regulatory elements. We performed next-generation sequencing to discover variants in and near the Oxtr gene. We investigated social attachment using the partner preference test. Results Our allelic imbalance data demonstrates that genetic variants contribute to individual differences in Oxtr expression, but only in particular brain regions, including the nucleus accumbens (NAcc), where OXTR signaling facilitates social attachment. Next-generation sequencing identified one polymorphism in the Oxtr intron, near a putative cis-regulatory element, explaining 74% of the variance in striatal Oxtr expression specifically. Males homozygous for the high expressing allele display enhanced social attachment. Discussion Taken together, these findings provide convincing evidence for robust genetic influence on Oxtr expression and provide novel insights into how non-coding polymorphisms in the OXTR might influence individual differences in human social cognition and behavior PMID:26893121
The genome of the vervet (Chlorocebus aethiops sabaeus)

PubMed Central

Warren, Wesley C.; Jasinska, Anna J.; García-Pérez, Raquel; Svardal, Hannes; Tomlinson, Chad; Rocchi, Mariano; Archidiacono, Nicoletta; Capozzi, Oronzo; Minx, Patrick; Montague, Michael J.; Kyung, Kim; Hillier, LaDeana W.; Kremitzki, Milinn; Graves, Tina; Chiang, Colby; Hughes, Jennifer; Tran, Nam; Huang, Yu; Ramensky, Vasily; Choi, Oi-wa; Jung, Yoon J.; Schmitt, Christopher A.; Juretic, Nikoleta; Wasserscheid, Jessica; Turner, Trudy R.; Wiseman, Roger W.; Tuscher, Jennifer J.; Karl, Julie A.; Schmitz, Jörn E.; Zahn, Roland; O'Connor, David H.; Redmond, Eugene; Nisbett, Alex; Jacquelin, Béatrice; Müller-Trutwin, Michaela C.; Brenchley, Jason M.; Dione, Michel; Antonio, Martin; Schroth, Gary P.; Kaplan, Jay R.; Jorgensen, Matthew J.; Thomas, Gregg W.C.; Hahn, Matthew W.; Raney, Brian J.; Aken, Bronwen; Nag, Rishi; Schmitz, Juergen; Churakov, Gennady; Noll, Angela; Stanyon, Roscoe; Webb, David; Thibaud-Nissen, Francoise; Nordborg, Magnus; Marques-Bonet, Tomas; Dewar, Ken; Weinstock, George M.; Wilson, Richard K.; Freimer, Nelson B.

2015-01-01

We describe a genome reference of the African green monkey or vervet (Chlorocebus aethiops). This member of the Old World monkey (OWM) superfamily is uniquely valuable for genetic investigations of simian immunodeficiency virus (SIV), for which it is the most abundant natural host species, and of a wide range of health-related phenotypes assessed in Caribbean vervets (C. a. sabaeus), whose numbers have expanded dramatically since Europeans introduced small numbers of their ancestors from West Africa during the colonial era. We use the reference to characterize the genomic relationship between vervets and other primates, the intra-generic phylogeny of vervet subspecies, and genome-wide structural variations of a pedigreed C. a. sabaeus population. Through comparative analyses with human and rhesus macaque, we characterize at high resolution the unique chromosomal fission events that differentiate the vervets and their close relatives from most other catarrhine primates, in whom karyotype is highly conserved. We also provide a summary of transposable elements and contrast these with the rhesus macaque and human. Analysis of sequenced genomes representing each of the main vervet subspecies supports previously hypothesized relationships between these populations, which range across most of sub-Saharan Africa, while uncovering high levels of genetic diversity within each. Sequence-based analyses of major histocompatibility complex (MHC) polymorphisms reveal extremely low diversity in Caribbean C. a. sabaeus vervets, compared to vervets from putatively ancestral West African regions. In the C. a. sabaeus research population, we discover the first structural variations that are, in some cases, predicted to have a deleterious effect; future studies will determine the phenotypic impact of these variations. PMID:26377836
A novel selection signature in stearoyl-coenzyme A desaturase (SCD) gene for enhanced milk fat content in Bubalus bubalis.

PubMed

Maryam, J; Babar, M E; Bao, Zhang; Nadeem, A

2016-10-01

Modern molecular interventions are dynamic gears for breeding animals with superior genetic make-up. These scientific efforts lead us toward sustainable dairy herds with improved milk production in terms of yield and quality. Many of candidate genes have been dissected at molecular level, and suitable genetic markers have been identified in cattle, but this work has not been validated in buffaloes so far. Stearoyl-coenzyme A desaturase (SCD) has been a potential candidate gene for fat content of milk. Genomic analysis of SCD revealed a total of six variations that were identified through DNA sequencing of animals with lower and higher butter fat %age. After statistical analysis, genotype AB of p.K158I could be associated (P value <0.0001) with higher milk fat %age (10.5 ± 0.5464). This SNP was validated on larger data set by cleaved amplified polymorphic sequences (CAPS) by using DdeI. To scrutinize the functional consequences of p.K158I, 3D protein structure of SCD was predicted by homology modeling and this variation was found located in the vicinity of functional domain and a part of transmembrane helix of this membrane integrated protein. This is a first report toward genetic screening of SCD gene at molecular level in buffalo. This report illustrates the implication of SCD gene and in particular p.K158I variation, in imparting its effect on milk fat %age, which can be targeted in selection of superior dairy buffaloes.
In silico Derivation of HLA-Specific Alloreactivity Potential from Whole Exome Sequencing of Stem-Cell Transplant Donors and Recipients: Understanding the Quantitative Immunobiology of Allogeneic Transplantation

PubMed Central

Jameson-Lee, Max; Koparde, Vishal; Griffith, Phil; Scalora, Allison F.; Sampson, Juliana K.; Khalid, Haniya; Sheth, Nihar U.; Batalo, Michael; Serrano, Myrna G.; Roberts, Catherine H.; Hess, Michael L.; Buck, Gregory A.; Neale, Michael C.; Manjili, Masoud H.; Toor, Amir Ahmed

2014-01-01

Donor T-cell mediated graft versus host (GVH) effects may result from the aggregate alloreactivity to minor histocompatibility antigens (mHA) presented by the human leukocyte antigen (HLA) molecules in each donor–recipient pair undergoing stem-cell transplantation (SCT). Whole exome sequencing has previously demonstrated a large number of non-synonymous single nucleotide polymorphisms (SNP) present in HLA-matched recipients of SCT donors (GVH direction). The nucleotide sequence flanking each of these SNPs was obtained and the amino acid sequence determined. All the possible nonameric peptides incorporating the variant amino acid resulting from these SNPs were interrogated in silico for their likelihood to be presented by the HLA class I molecules using the Immune Epitope Database stabilized matrix method (SMM) and NetMHCpan algorithms. The SMM algorithm predicted that a median of 18,396 peptides weakly bound HLA class I molecules in individual SCT recipients, and 2,254 peptides displayed strong binding. A similar library of presented peptides was identified when the data were interrogated using the NetMHCpan algorithm. The bioinformatic algorithm presented here demonstrates that there may be a high level of mHA variation in HLA-matched individuals, constituting a HLA-specific alloreactivity potential. PMID:25414699
In silico Derivation of HLA-Specific Alloreactivity Potential from Whole Exome Sequencing of Stem-Cell Transplant Donors and Recipients: Understanding the Quantitative Immunobiology of Allogeneic Transplantation.

PubMed

Jameson-Lee, Max; Koparde, Vishal; Griffith, Phil; Scalora, Allison F; Sampson, Juliana K; Khalid, Haniya; Sheth, Nihar U; Batalo, Michael; Serrano, Myrna G; Roberts, Catherine H; Hess, Michael L; Buck, Gregory A; Neale, Michael C; Manjili, Masoud H; Toor, Amir Ahmed

2014-01-01

Donor T-cell mediated graft versus host (GVH) effects may result from the aggregate alloreactivity to minor histocompatibility antigens (mHA) presented by the human leukocyte antigen (HLA) molecules in each donor-recipient pair undergoing stem-cell transplantation (SCT). Whole exome sequencing has previously demonstrated a large number of non-synonymous single nucleotide polymorphisms (SNP) present in HLA-matched recipients of SCT donors (GVH direction). The nucleotide sequence flanking each of these SNPs was obtained and the amino acid sequence determined. All the possible nonameric peptides incorporating the variant amino acid resulting from these SNPs were interrogated in silico for their likelihood to be presented by the HLA class I molecules using the Immune Epitope Database stabilized matrix method (SMM) and NetMHCpan algorithms. The SMM algorithm predicted that a median of 18,396 peptides weakly bound HLA class I molecules in individual SCT recipients, and 2,254 peptides displayed strong binding. A similar library of presented peptides was identified when the data were interrogated using the NetMHCpan algorithm. The bioinformatic algorithm presented here demonstrates that there may be a high level of mHA variation in HLA-matched individuals, constituting a HLA-specific alloreactivity potential.
Screening strategies for a highly polymorphic gene: DHPLC analysis of the Fanconi anemia group A gene.

PubMed

Rischewski, J; Schneppenheim, R

2001-01-30

Patients with Fanconi anemia (Fanc) are at risk of developing leukemia. Mutations of the group A gene (FancA) are most common. A multitude of polymorphisms and mutations within the 43 exons of the gene are described. To examine the role of heterozygosity as a risk factor for malignancies, a partially automatized screening method to identify aberrations was needed. We report on our experience with DHPLC (WAVE (Transgenomic)). PCR amplification of all 43 exons from one individual was performed on one microtiter plate on a gradient thermocycler. DHPLC analysis conditions were established via melting curves, prediction software, and test runs with aberrant samples. PCR products were analyzed twice: native, and after adding a WT-PCR product. Retention patterns were compared with previously identified polymorphic PCR products or mutants. We have defined the mutation screening conditions for all 43 exons of FancA using DHPLC. So far, 40 different sequence variations have been detected in more than 100 individuals. The native analysis identifies heterozygous individuals, and the second run detects homozygous aberrations. Retention patterns are specific for the underlying sequence aberration, thus reducing sequencing demand and costs. DHPLC is a valuable tool for reproducible recognition of known sequence aberrations and screening for unknown mutations in the highly polymorphic FancA gene.
Genome sequencing of adzuki bean (Vigna angularis) provides insight into high starch and low fat accumulation and domestication.

PubMed

Yang, Kai; Tian, Zhixi; Chen, Chunhai; Luo, Longhai; Zhao, Bo; Wang, Zhuo; Yu, Lili; Li, Yisong; Sun, Yudong; Li, Weiyu; Chen, Yan; Li, Yongqiang; Zhang, Yueyang; Ai, Danjiao; Zhao, Jinyang; Shang, Cheng; Ma, Yong; Wu, Bin; Wang, Mingli; Gao, Li; Sun, Dongjing; Zhang, Peng; Guo, Fangfang; Wang, Weiwei; Li, Yuan; Wang, Jinlong; Varshney, Rajeev K; Wang, Jun; Ling, Hong-Qing; Wan, Ping

2015-10-27

Adzuki bean (Vigna angularis), an important legume crop, is grown in more than 30 countries of the world. The seed of adzuki bean, as an important source of starch, digestible protein, mineral elements, and vitamins, is widely used foods for at least a billion people. Here, we generated a high-quality draft genome sequence of adzuki bean by whole-genome shotgun sequencing. The assembled contig sequences reached to 450 Mb (83% of the genome) with an N50 of 38 kb, and the total scaffold sequences were 466.7 Mb with an N50 of 1.29 Mb. Of them, 372.9 Mb of scaffold sequences were assigned to the 11 chromosomes of adzuki bean by using a single nucleotide polymorphism genetic map. A total of 34,183 protein-coding genes were predicted. Functional analysis revealed that significant differences in starch and fat content between adzuki bean and soybean were likely due to transcriptional abundance, rather than copy number variations, of the genes related to starch and oil synthesis. We detected strong selection signals in domestication by the population analysis of 50 accessions including 11 wild, 11 semiwild, 17 landraces, and 11 improved varieties. In addition, the semiwild accessions were illuminated to have a closer relationship to the cultigen accessions than the wild type, suggesting that the semiwild adzuki bean might be a preliminary landrace and play some roles in the adzuki bean domestication. The genome sequence of adzuki bean will facilitate the identification of agronomically important genes and accelerate the improvement of adzuki bean.
Construction of a high-density genetic map for grape using next generation restriction-site associated DNA sequencing

PubMed Central

2012-01-01

Background Genetic mapping and QTL detection are powerful methodologies in plant improvement and breeding. Construction of a high-density and high-quality genetic map would be of great benefit in the production of superior grapes to meet human demand. High throughput and low cost of the recently developed next generation sequencing (NGS) technology have resulted in its wide application in genome research. Sequencing restriction-site associated DNA (RAD) might be an efficient strategy to simplify genotyping. Combining NGS with RAD has proven to be powerful for single nucleotide polymorphism (SNP) marker development. Results An F1 population of 100 individual plants was developed. In-silico digestion-site prediction was used to select an appropriate restriction enzyme for construction of a RAD sequencing library. Next generation RAD sequencing was applied to genotype the F1 population and its parents. Applying a cluster strategy for SNP modulation, a total of 1,814 high-quality SNP markers were developed: 1,121 of these were mapped to the female genetic map, 759 to the male map, and 1,646 to the integrated map. A comparison of the genetic maps to the published Vitis vinifera genome revealed both conservation and variations. Conclusions The applicability of next generation RAD sequencing for genotyping a grape F1 population was demonstrated, leading to the successful development of a genetic map with high density and quality using our designed SNP markers. Detailed analysis revealed that this newly developed genetic map can be used for a variety of genome investigations, such as QTL detection, sequence assembly and genome comparison. PMID:22908993
Genome sequencing of adzuki bean (Vigna angularis) provides insight into high starch and low fat accumulation and domestication

PubMed Central

Yang, Kai; Tian, Zhixi; Chen, Chunhai; Luo, Longhai; Zhao, Bo; Wang, Zhuo; Yu, Lili; Li, Yisong; Sun, Yudong; Li, Weiyu; Chen, Yan; Li, Yongqiang; Zhang, Yueyang; Ai, Danjiao; Zhao, Jinyang; Shang, Cheng; Ma, Yong; Wu, Bin; Wang, Mingli; Gao, Li; Sun, Dongjing; Zhang, Peng; Guo, Fangfang; Wang, Weiwei; Li, Yuan; Wang, Jinlong; Varshney, Rajeev K.; Wang, Jun; Ling, Hong-Qing; Wan, Ping

2015-01-01

Adzuki bean (Vigna angularis), an important legume crop, is grown in more than 30 countries of the world. The seed of adzuki bean, as an important source of starch, digestible protein, mineral elements, and vitamins, is widely used foods for at least a billion people. Here, we generated a high-quality draft genome sequence of adzuki bean by whole-genome shotgun sequencing. The assembled contig sequences reached to 450 Mb (83% of the genome) with an N50 of 38 kb, and the total scaffold sequences were 466.7 Mb with an N50 of 1.29 Mb. Of them, 372.9 Mb of scaffold sequences were assigned to the 11 chromosomes of adzuki bean by using a single nucleotide polymorphism genetic map. A total of 34,183 protein-coding genes were predicted. Functional analysis revealed that significant differences in starch and fat content between adzuki bean and soybean were likely due to transcriptional abundance, rather than copy number variations, of the genes related to starch and oil synthesis. We detected strong selection signals in domestication by the population analysis of 50 accessions including 11 wild, 11 semiwild, 17 landraces, and 11 improved varieties. In addition, the semiwild accessions were illuminated to have a closer relationship to the cultigen accessions than the wild type, suggesting that the semiwild adzuki bean might be a preliminary landrace and play some roles in the adzuki bean domestication. The genome sequence of adzuki bean will facilitate the identification of agronomically important genes and accelerate the improvement of adzuki bean. PMID:26460024
Efficient analysis of mouse genome sequences reveal many nonsense variants

PubMed Central

Steeland, Sophie; Timmermans, Steven; Van Ryckeghem, Sara; Hulpiau, Paco; Saeys, Yvan; Van Montagu, Marc; Vandenbroucke, Roosmarijn E.; Libert, Claude

2016-01-01

Genetic polymorphisms in coding genes play an important role when using mouse inbred strains as research models. They have been shown to influence research results, explain phenotypical differences between inbred strains, and increase the amount of interesting gene variants present in the many available inbred lines. SPRET/Ei is an inbred strain derived from Mus spretus that has ∼1% sequence difference with the C57BL/6J reference genome. We obtained a listing of all SNPs and insertions/deletions (indels) present in SPRET/Ei from the Mouse Genomes Project (Wellcome Trust Sanger Institute) and processed these data to obtain an overview of all transcripts having nonsynonymous coding sequence variants. We identified 8,883 unique variants affecting 10,096 different transcripts from 6,328 protein-coding genes, which is about 28% of all coding genes. Because only a subset of these variants results in drastic changes in proteins, we focused on variations that are nonsense mutations that ultimately resulted in a gain of a stop codon. These genes were identified by in silico changing the C57BL/6J coding sequences to the SPRET/Ei sequences, converting them to amino acid (AA) sequences, and comparing the AA sequences. All variants and transcripts affected were also stored in a database, which can be browsed using a SPRET/Ei M. spretus variants web tool (www.spretus.org), including a manual. We validated the tool by demonstrating the loss of function of three proteins predicted to be severely truncated, namely Fas, IRAK2, and IFNγR1. PMID:27147605
Identification of the sequence variations of 15 autosomal STR loci in a Chinese population.

PubMed

Chen, Wenjing; Cheng, Jianding; Ou, Xueling; Chen, Yong; Tong, Dayue; Sun, Hongyu

2014-01-01

DNA sequence variation including base(s) changes and insertion or deletion in the primer binding region may cause a null allele and, if this changes the length of the amplified fragment out of the allelic ladder, off-ladder (OL) alleles may be detected. In order to provide accurate and reliable DNA evidence for forensic DNA analysis, it is essential to clarify sequence variations in prevalently used STR loci. Suspected null alleles and OL alleles of PlowerPlex16® System from 21,934 unrelated Chinese individuals were verified by alternative systems and sequenced. A total of 17 cases with null alleles were identified, including 12 kinds of point mutations in 16 cases and a 19-base deletion in one case. The total frequency of null alleles was 7.751 × 10(-4). Eight hundred and forty-four OL alleles classified as being of 97 different kinds were observed at 15 STR loci of the PowerPlex®16 system except vWA. All the frequencies of OL alleles were under 0.01. Null alleles should be confirmed by alternative primers and OL alleles should be named appropriately. Particular attention should be paid to sequence variation, since incorrect designation could lead to false conclusions.
Sequence characterization of S100A8 gene reveals structural differences of protein and transcriptional factor binding sites in water buffalo and yak.

PubMed

Kathiravan, P; Goyal, S; Kataria, R S; Mishra, B P; Jayakumar, S; Joshi, B K

2011-01-01

The present study was undertaken to characterize the structure of S100A8 gene and its promoter in water buffalo and yak. Sequence data of 2.067 kb, 2.071 kb, and 2.052 kb with respect to complete S100A8 gene including 5' flanking region was generated in river buffalo, swamp buffalo, and yak, respectively. BLAST analysis of coding DNA sequences (CDS) of S100A8 gene revealed 95% homology of buffalo sequence with cattle, 85% with pig and horse, 83% with dog, 72-73% with murines, and around 79% with primates and humans. Phylogenetic analysis of predicted CDS revealed distinct clustering of murines, primates, and domestic animals with bovines and bubalines forming a subcluster among farm animals. In silico translation of predicted CDS revealed a sequence of 89 amino acids with 7 amino acid changes between cattle and buffalo and 2 changes between cattle and yak. The search for Pfam family revealed the N-terminal calcium binding domain and the noncanonical EF hand domain in the carboxy terminus, with more variations being observed in the N-terminal domain among different species. Two amino acid changes observed in carboxy terminal EF hand domain resulted in altered secondary structure of yak S100A8 protein. Analysis of S100A8 gene promoter revealed 14 putative motifs for transcriptional factor binding sites. Two putative motifs viz. C/EBP and v-Myb were found to be absent in swamp buffalo as compared to river buffalo and cattle. Differences in the structure of S100A8 protein and the transcriptional factor binding sites identified in the present study need to be analyzed further for their functional significance in yak and swamp buffalo respectively. Copyright © Taylor & Francis Group, LLC
Assembly and comparison of two closely related Brassica napus genomes.

PubMed

Bayer, Philipp E; Hurgobin, Bhavna; Golicz, Agnieszka A; Chan, Chon-Kit Kenneth; Yuan, Yuxuan; Lee, HueyTyng; Renton, Michael; Meng, Jinling; Li, Ruiyuan; Long, Yan; Zou, Jun; Bancroft, Ian; Chalhoub, Boulos; King, Graham J; Batley, Jacqueline; Edwards, David

2017-12-01

As an increasing number of plant genome sequences become available, it is clear that gene content varies between individuals, and the challenge arises to predict the gene content of a species. However, genome comparison is often confounded by variation in assembly and annotation. Differentiating between true gene absence and variation in assembly or annotation is essential for the accurate identification of conserved and variable genes in a species. Here, we present the de novo assembly of the B. napus cultivar Tapidor and comparison with an improved assembly of the Brassica napus cultivar Darmor-bzh. Both cultivars were annotated using the same method to allow comparison of gene content. We identified genes unique to each cultivar and differentiate these from artefacts due to variation in the assembly and annotation. We demonstrate that using a common annotation pipeline can result in different gene predictions, even for closely related cultivars, and repeat regions which collapse during assembly impact whole genome comparison. After accounting for differences in assembly and annotation, we demonstrate that the genome of Darmor-bzh contains a greater number of genes than the genome of Tapidor. Our results are the first step towards comparison of the true differences between B. napus genomes and highlight the potential sources of error in future production of a B. napus pangenome. © 2017 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Genetic variation of the Borrelia burgdorferi gene vlsE involves cassette-specific, segmental gene conversion.

PubMed

Zhang, J R; Norris, S J

1998-08-01

The Lyme disease spirochete Borrelia burgdorferi possesses 15 silent vls cassettes and a vls expression site (vlsE) encoding a surface-exposed lipoprotein. Segments of the silent vls cassettes have been shown to recombine with the vlsE cassette region in the mammalian host, resulting in combinatorial antigenic variation. Despite promiscuous recombination within the vlsE cassette region, the 5' and 3' coding sequences of vlsE that flank the cassette region are not subject to sequence variation during these recombination events. The segments of the silent vls cassettes recombine in the vlsE cassette region through a unidirectional process such that the sequence and organization of the silent vls loci are not affected. As a result of recombination, the previously expressed segments are replaced by incoming segments and apparently degraded. These results provide evidence for a gene conversion mechanism in VlsE antigenic variation.
Method for Constructing Composite Response Surfaces by Combining Neural Networks with Polynominal Interpolation or Estimation Techniques

NASA Technical Reports Server (NTRS)

Rai, Man Mohan (Inventor); Madavan, Nateri K. (Inventor)

2007-01-01

A method and system for data modeling that incorporates the advantages of both traditional response surface methodology (RSM) and neural networks is disclosed. The invention partitions the parameters into a first set of s simple parameters, where observable data are expressible as low order polynomials, and c complex parameters that reflect more complicated variation of the observed data. Variation of the data with the simple parameters is modeled using polynomials; and variation of the data with the complex parameters at each vertex is analyzed using a neural network. Variations with the simple parameters and with the complex parameters are expressed using a first sequence of shape functions and a second sequence of neural network functions. The first and second sequences are multiplicatively combined to form a composite response surface, dependent upon the parameter values, that can be used to identify an accurate mode

Some links on this page may take you to non-federal websites. Their policies may differ from this site.