Genetic imputation has become standard practice in modern genetic studies. However, several important issues have not been adequately addressed including the utility of study-specific reference, performance in admixed populations, and quality for less common (minor allele frequency [MAF] 0.005–0.05) and rare (MAF < 0.005) variants. These issues only recently became addressable with genome-wide association studies (GWAS) follow-up studies using dense genotyping or sequencing in large samples of non-European individuals. In this work, we constructed a study-specific reference panel of 3,924 haplotypes using African Americans in the Women’s Health Initiative (WHI) genotyped on both the Metabochip and the Affymetrix 6.0 GWAS platform. We used this reference panel to impute into 6,459 WHI SNP Health Association Resource (SHARe) study subjects with only GWAS genotypes. Our analysis confirmed the imputation quality metric Rsq (estimated r2, specific to each SNP) as an effective post-imputation filter. We recommend different Rsq thresholds for different MAF categories such that the average (across SNPs) Rsq is above the desired dosage r2 (squared Pearson correlation between imputed and experimental genotypes).With a desired dosage r2 of 80%, 99.9% (97.5%, 83.6%, 52.0%, 20.5%) of SNPs with MAF > 0.05 (0.03–0.05, 0.01–0.03, 0.005–0.01, and 0.001–0.005) passed the post-imputation filter. The average dosage r2 for these SNPs is 94.7%, 92.1%, 89.0%, 83.1%, and 79.7%, respectively. These results suggest that for African Americans imputation of Metabochip SNPs from GWAS data, including low frequency SNPs with MAF 0.005–0.05, is feasible and worthwhile for power increase in downstream association analysis provided a sizable reference panel is available.
Liu, Eric Yi; Buyske, Steven; Aragaki, Aaron K.; Peters, Ulrike; Boerwinkle, Eric; Carlson, Chris; Carty, Cara; Crawford, Dana C.; Haessler, Jeff; Hindorff, Lucia A.; Marchand, Loic Le; Manolio, Teri A.; Matise, Tara; Wang, Wei; Kooperberg, Charles; North, Kari E.; Li, Yun
Background Meta-analysis (MA) is widely used to pool genome-wide association studies (GWASes) in order to a) increase the power to detect strong or weak genotype effects or b) as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS consortia to avoid losing too many SNPs in a MA. YAMAS (Yet Another Meta Analysis Software), however, enables cross-GWAS conclusions prior to finished and polished imputation runs, which eventually are time-consuming. Results Here we present a fast method to avoid forfeiting SNPs present in only a subset of studies, without relying on imputation. This is accomplished by using reference linkage disequilibrium data from 1,000 Genomes/HapMap projects to find proxy-SNPs together with in-phase alleles for SNPs missing in at least one study. MA is conducted by combining association effect estimates of a SNP and those of its proxy-SNPs. Our algorithm is implemented in the MA software YAMAS. Association results from GWAS analysis applications can be used as input files for MA, tremendously speeding up MA compared to the conventional imputation approach. We show that our proxy algorithm is well-powered and yields valuable ad hoc results, possibly providing an incentive for follow-up studies. We propose our method as a quick screening step prior to imputation-based MA, as well as an additional main approach for studies without available reference data matching the ethnicities of study participants. As a proof of principle, we analyzed six dbGaP Type II Diabetes GWAS and found that the proxy algorithm clearly outperforms naïve MA on the p-value level: for 17 out of 23 we observe an improvement on the p-value level by a factor of more than two, and a maximum improvement by a factor of 2127. Conclusions YAMAS is an efficient and fast meta-analysis program which offers various methods, including conventional MA as well as inserting proxy-SNPs for missing markers to avoid unnecessary power loss. MA with YAMAS can be readily conducted as YAMAS provides a generic parser for heterogeneous tabulated file formats within the GWAS field and avoids cumbersome setups. In this way, it supplements the meta-analysis process.
Genotype imputation is now an essential tool in the analysis of genomewide association scans. The technique allows geneticists to accurately evaluate the evidence for association at genetic markers that are not directly genotyped. Genotype imputation increases power of genomewide association scans and is particularly useful for combining the association scan results across studies that rely on different genotyping platforms. Here, we review the history and theoretical underpinnings of the technique. To illustrate performance of the approach, we summarize results from several actual gene mapping studies. Finally, we preview the role of genotype imputation in an era when whole genome resequencing is becoming increasingly common.
Li, Yun; Willer, Cristen; Sanna, Serena; Abecasis, Goncalo
Although genomic selection offers the prospect of improving the rate of genetic gain in meat, wool and dairy sheep breeding programs, the key constraint is likely to be the cost of genotyping. Potentially, this constraint can be overcome by genotyping selection candidates for a low density (low cost) panel of SNPs with sparse genotype coverage, imputing a much higher density of SNP genotypes using a densely genotyped reference population. These imputed genotypes would then be used with a prediction equation to produce genomic estimated breeding values. In the future, it may also be desirable to impute very dense marker genotypes or even whole genome re-sequence data from moderate density SNP panels. Such a strategy could lead to an accurate prediction of genomic estimated breeding values across breeds, for example. We used genotypes from 48?640 (50K) SNPs genotyped in four sheep breeds to investigate both the accuracy of imputation of the 50K SNPs from low density SNP panels, as well as prospects for imputing very dense or whole genome re-sequence data from the 50K SNPs (by leaving out a small number of the 50K SNPs at random). Accuracy of imputation was low if the sparse panel had less than 5000 (5K) markers. Across breeds, it was clear that the accuracy of imputing from sparse marker panels to 50K was higher if the genetic diversity within a breed was lower, such that relationships among animals in that breed were higher. The accuracy of imputation from sparse genotypes to 50K genotypes was higher when the imputation was performed within breed rather than when pooling all the data, despite the fact that the pooled reference set was much larger. For Border Leicesters, Poll Dorsets and White Suffolks, 5K sparse genotypes were sufficient to impute 50K with 80% accuracy. For Merinos, the accuracy of imputing 50K from 5K was lower at 71%, despite a large number of animals with full genotypes (2215) being used as a reference. For all breeds, the relationship of individuals to the reference explained up to 64% of the variation in accuracy of imputation, demonstrating that accuracy of imputation can be increased if sires and other ancestors of the individuals to be imputed are included in the reference population. The accuracy of imputation could also be increased if pedigree information was available and was used in tracking inheritance of large chromosome segments within families. In our study, we only considered methods of imputation based on population-wide linkage disequilibrium (largely because the pedigree for some of the populations was incomplete). Finally, in the scenarios designed to mimic imputation of high density or whole genome re-sequence data from the 50K panel, the accuracy of imputation was much higher (86-96%). This is promising, suggesting that in silico genome re-sequencing is possible in sheep if a suitable pool of key ancestors is sequenced for each breed. PMID:22221027
Hayes, B J; Bowman, P J; Daetwyler, H D; Kijas, J W; van der Werf, J H J
Imputation of genome-wide single-nucleotide polymorphism (SNP) arrays to a larger known reference panel of SNPs has become a standard and an essential part of genome-wide association studies. However, little is known about the behavior of imputation in African Americans with respect to the different imputation algorithms, the reference population(s) and the reference SNP panels used. Genome-wide SNP data (Affymetrix 6.0) from 3207 African American samples in the Atherosclerosis Risk in Communities Study (ARIC) was used to systematically evaluate imputation quality and yield. Imputation was performed with the imputation algorithms MACH, IMPUTE and BEAGLE using several combinations of three reference panels of HapMap III (ASW, YRI and CEU) and 1000 Genomes Project (pilot 1 YRI June 2010 release, EUR and AFR August 2010 and June 2011 releases) panels with SNP data on chromosomes 18, 20 and 22. About 10% of the directly genotyped SNPs from each chromosome were masked, and SNPs common between the reference panels were used for evaluating the imputation quality using two statistical metrics—concordance accuracy and Cohen’s kappa (?) coefficient. The dependencies of these metrics on the minor allele frequencies (MAF) and specific genotype categories (minor allele homozygotes, heterozygotes and major allele homozygotes) were thoroughly investigated to determine the best panel and method for imputation in African Americans. In addition, the power to detect imputed SNPs associated with simulated phenotypes was studied using the mean genotype of each masked SNP in the imputed data. Our results indicate that the genotype concordances after stratification into each genotype category and Cohen’s ? coefficient are considerably better equipped to differentiate imputation performance compared with the traditionally used total concordance statistic, and both statistics improved with increasing MAF irrespective of the imputation method. We also find that both MACH and IMPUTE performed equally well and consistently better than BEAGLE irrespective of the reference panel used. Of the various combinations of reference panels, for both HapMap III and 1000 Genomes Project reference panels, the multi-ethnic panels had better imputation accuracy than those containing only single ethnic samples. The most recent 1000 Genomes Project release June 2011 had substantially higher number of imputed SNPs than HapMap III and performed as well or better than the best combined HapMap III reference panels and previous releases of the 1000 Genomes Project.
Chanda, Pritam; Yuhki, Naoya; Li, Man; Bader, Joel S; Hartz, Alex; Boerwinkle, Eric; Kao, WH Linda; Arking, Dan E
Imputation, the practice of 'filling in' missing data with plausible values, has long been recognized as an attractive approach to analysing incomplete data. For decades, survey statisticians have been imputing large databases by often elaborate means.1 From an operational standpoint, imputation solves the missing-data problem at the outset, enabling the analyst to proceed without further hindrance. From a statistical standpoint,
Joseph L Schafer
A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of geno-typed SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.
Hancock, Dana B.; Levy, Joshua L.; Gaddis, Nathan C.; Saccone, Nancy L.; Bierut, Laura J.; Page, Grier P.
Genotype imputation, used in genome-wide association studies to expand coverage of single nucleotide polymorphisms (SNPs), has performed poorly in African Americans compared to less admixed populations. Overall, imputation has typically relied on HapMap reference haplotype panels from Africans (YRI), European Americans (CEU), and Asians (CHB/JPT). The 1000 Genomes project offers a wider range of reference populations, such as African Americans (ASW), but their imputation performance has had limited evaluation. Using 595 African Americans genotyped on Illumina’s HumanHap550v3 BeadChip, we compared imputation results from four software programs (IMPUTE2, BEAGLE, MaCH, and MaCH-Admix) and three reference panels consisting of different combinations of 1000 Genomes populations (February 2012 release): (1) 3 specifically selected populations (YRI, CEU, and ASW); (2) 8 populations of diverse African (AFR) or European (AFR) descent; and (3) all 14 available populations (ALL). Based on chromosome 22, we calculated three performance metrics: (1) concordance (percentage of masked genotyped SNPs with imputed and true genotype agreement); (2) imputation quality score (IQS; concordance adjusted for chance agreement, which is particularly informative for low minor allele frequency [MAF] SNPs); and (3) average r2hat (estimated correlation between the imputed and true genotypes, for all imputed SNPs). Across the reference panels, IMPUTE2 and MaCH had the highest concordance (91%–93%), but IMPUTE2 had the highest IQS (81%–83%) and average r2hat (0.68 using YRI+ASW+CEU, 0.62 using AFR+EUR, and 0.55 using ALL). Imputation quality for most programs was reduced by the addition of more distantly related reference populations, due entirely to the introduction of low frequency SNPs (MAF?2%) that are monomorphic in the more closely related panels. While imputation was optimized by using IMPUTE2 with reference to the ALL panel (average r2hat?=?0.86 for SNPs with MAF>2%), use of the ALL panel for African American studies requires careful interpretation of the population specificity and imputation quality of low frequency SNPs.
Hancock, Dana B.; Levy, Joshua L.; Gaddis, Nathan C.; Bierut, Laura J.; Saccone, Nancy L.; Page, Grier P.; Johnson, Eric O.
Most current genotype imputation methods are model-based and computationally intensive, taking days to impute one chromosome pair on 1000 people. We describe an efficient genotype imputation method based on matrix completion. Our matrix completion method is implemented in MATLAB and tested on real data from HapMap 3, simulated pedigree data, and simulated low-coverage sequencing data derived from the 1000 Genomes Project. Compared with leading imputation programs, the matrix completion algorithm embodied in our program MENDEL-IMPUTE achieves comparable imputation accuracy while reducing run times significantly. Implementation in a lower-level language such as Fortran or C is apt to further improve computational efficiency.
Chi, Eric C.; Zhou, Hua; Chen, Gary K.; Del Vecchyo, Diego Ortega; Lange, Kenneth
Background Genotype imputation is commonly used in genetic association studies to test untyped variants using information on linkage disequilibrium (LD) with typed markers. Imputing genotypes requires a suitable reference population in which the LD pattern is known, most often one selected from HapMap. However, some populations, such as American Indians, are not represented in HapMap. In the present study, we assessed accuracy of imputation using HapMap reference populations in a genome-wide association study in Pima Indians. Results Data from six randomly selected chromosomes were used. Genotypes in the study population were masked (either 1% or 20% of SNPs available for a given chromosome). The masked genotypes were then imputed using the software Markov Chain Haplotyping Algorithm. Using four HapMap reference populations, average genotype error rates ranged from 7.86% for Mexican Americans to 22.30% for Yoruba. In contrast, use of the original Pima Indian data as a reference resulted in an average error rate of 1.73%. Conclusions Our results suggest that the use of HapMap reference populations results in substantial inaccuracy in the imputation of genotypes in American Indians. A possible solution would be to densely genotype or sequence a reference American Indian population.
Malhotra, Alka; Kobes, Sayuko; Bogardus, Clifton; Knowler, William C.; Baier, Leslie J.; Hanson, Robert L.
Background Genotyping with the medium-density Bovine SNP50 BeadChip® (50K) is now standard in cattle. The high-density BovineHD BeadChip®, which contains 777 609 single nucleotide polymorphisms (SNPs), was developed in 2010. Increasing marker density increases the level of linkage disequilibrium between quantitative trait loci (QTL) and SNPs and the accuracy of QTL localization and genomic selection. However, re-genotyping all animals with the high-density chip is not economically feasible. An alternative strategy is to genotype part of the animals with the high-density chip and to impute high-density genotypes for animals already genotyped with the 50K chip. Thus, it is necessary to investigate the error rate when imputing from the 50K to the high-density chip. Methods Five thousand one hundred and fifty three animals from 16 breeds (89 to 788 per breed) were genotyped with the high-density chip. Imputation error rates from the 50K to the high-density chip were computed for each breed with a validation set that included the 20% youngest animals. Marker genotypes were masked for animals in the validation population in order to mimic 50K genotypes. Imputation was carried out using the Beagle 3.3.0 software. Results Mean allele imputation error rates ranged from 0.31% to 2.41% depending on the breed. In total, 1980 SNPs had high imputation error rates in several breeds, which is probably due to genome assembly errors, and we recommend to discard these in future studies. Differences in imputation accuracy between breeds were related to the high-density-genotyped sample size and to the genetic relationship between reference and validation populations, whereas differences in effective population size and level of linkage disequilibrium showed limited effects. Accordingly, imputation accuracy was higher in breeds with large populations and in dairy breeds than in beef breeds. More than 99% of the alleles were correctly imputed if more than 300 animals were genotyped at high-density. No improvement was observed when multi-breed imputation was performed. Conclusion In all breeds, imputation accuracy was higher than 97%, which indicates that imputation to the high-density chip was accurate. Imputation accuracy depends mainly on the size of the reference population and the relationship between reference and target populations.
Genotype imputation has the potential to assess human genetic variation at a lower cost than assaying the variants using laboratory techniques. The performance of imputation for rare variants has not been comprehensively studied. We utilized 8865 human samples with high depth resequencing data for the exons and flanking regions of 202 genes and Genome-Wide Association Study (GWAS) data to characterize the performance of genotype imputation for rare variants. We evaluated reference sets ranging from 100 to 3713 subjects for imputing into samples typed for the Affymetrix (500K and 6.0) and Illumina 550K GWAS panels. The proportion of variants that could be well imputed (true r2>0.7) with a reference panel of 3713 individuals was: 31% (Illumina 550K) or 25% (Affymetrix 500K) with MAF (Minor Allele Frequency) less than or equal 0.001, 48% or 35% with 0.001
Li, Li; Li, Yun; Browning, Sharon R.; Browning, Brian L.; Slater, Andrew J.; Kong, Xiangyang; Aponte, Jennifer L.; Mooser, Vincent E.; Chissoe, Stephanie L.; Whittaker, John C.; Nelson, Matthew R.; Ehm, Margaret Gelder
Background Imputation of genotypes from low-density to higher density chips is a cost-effective method to obtain high-density genotypes for many animals, based on genotypes of only a relatively small subset of animals (reference population) on the high-density chip. Several factors influence the accuracy of imputation and our objective was to investigate the effects of the size of the reference population used for imputation and of the imputation method used and its parameters. Imputation of genotypes was carried out from 50 000 (moderate-density) to 777 000 (high-density) SNPs (single nucleotide polymorphisms). Methods The effect of reference population size was studied in two datasets: one with 548 and one with 1289 Holstein animals, genotyped with the Illumina BovineHD chip (777 k SNPs). A third dataset included the 548 animals genotyped with the 777 k SNP chip and 2200 animals genotyped with the Illumina BovineSNP50 chip. In each dataset, 60 animals were chosen as validation animals, for which all high-density genotypes were masked, except for the Illumina BovineSNP50 markers. Imputation was studied in a subset of six chromosomes, using the imputation software programs Beagle and DAGPHASE. Results Imputation with DAGPHASE and Beagle resulted in 1.91% and 0.87% allelic imputation error rates in the dataset with 548 high-density genotypes, when scale and shift parameters were 2.0 and 0.1, and 1.0 and 0.0, respectively. When Beagle was used alone, the imputation error rate was 0.67%. If the information obtained by Beagle was subsequently used in DAGPHASE, imputation error rates were slightly higher (0.71%). When 2200 moderate-density genotypes were added and Beagle was used alone, imputation error rates were slightly lower (0.64%). The least imputation errors were obtained with Beagle in the reference set with 1289 high-density genotypes (0.41%). Conclusions For imputation of genotypes from the 50 k to the 777 k SNP chip, Beagle gave the lowest allelic imputation error rates. Imputation error rates decreased with increasing size of the reference population. For applications for which computing time is limiting, DAGPHASE using information from Beagle can be considered as an alternative, since it reduces computation time and increases imputation error rates only slightly.
Multiple imputation for nonresponse in public-use files replaces each missing value by two or more plausible values. The values can be chosen to represent both uncertainty about which values to impute assuming the reasons for nonresponse are known and uncertainty about the reasons for nonresponse. The theoretical underpinnings and several examples are given in Rubin (1987). Thispresentation illustrates the dramatic
Donald B. Rubin
Rubin has offered multiple imputation as a general approach to inference from survey data sets with missing values filled in through imputation. In spite of the considerable scope of work on the subject, the literature on multiple imputation has failed to produce a set of clear and sufficient conditions for the validity of multiple imputation that would justify many of
Robert E. Fay
Missing data imputation is an important issue in machine learning and data mining. In this paper, we pro- pose a new and efficient imputation method for a kind of missing data: semi-parametric data. Our imputation method aims at making an optimal evaluation about Root Mean Square Error (RMSE), distribution function and quantile af- ter missing-data are imputed. We evaluate our
Yongsong Qin; Shichao Zhang; Xiaofeng Zhu; Jilian Zhang; Chengqi Zhang
The aim of this study was to evaluate the impact of genotype imputation on the performance of the GBLUP and Bayesian methods for genomic prediction. A total of 10,309 Holstein bulls were genotyped on the BovineSNP50 BeadChip (50 k). Five low density single nucleotide polymorphism (SNP) panels, containing 6,177, 2,480, 1,536, 768 and 384 SNPs, were simulated from the 50 k panel. A fraction of 0%, 33% and 66% of the animals were randomly selected from the training sets to have low density genotypes which were then imputed into 50 k genotypes. A GBLUP and a Bayesian method were used to predict direct genomic values (DGV) for validation animals using imputed or their actual 50 k genotypes. Traits studied included milk yield, fat percentage, protein percentage and somatic cell score (SCS). Results showed that performance of both GBLUP and Bayesian methods was influenced by imputation errors. For traits affected by a few large QTL, the Bayesian method resulted in greater reductions of accuracy due to imputation errors than GBLUP. Including SNPs with largest effects in the low density panel substantially improved the accuracy of genomic prediction for the Bayesian method. Including genotypes imputed from the 6 k panel achieved almost the same accuracy of genomic prediction as that of using the 50 k panel even when 66% of the training population was genotyped on the 6 k panel. These results justified the application of the 6 k panel for genomic prediction. Imputations from lower density panels were more prone to errors and resulted in lower accuracy of genomic prediction. But for animals that have close relationship to the reference set, genotype imputation may still achieve a relatively high accuracy.
Chen, Liuhong; Li, Changxi; Sargolzaei, Mehdi; Schenkel, Flavio
Imputation and multiple-imputation procedures have been used in practice to handle the problem of ignorable nonresponse in sample surveys. We examine the large-sample properties of these procedures where covariates are available for the case when the complete-data analysis is based on least squares. The results provide a formal justification for the inference procedures discussed by Rubin and Schenker for the
Nathaniel Schenker; A. H. Welsh
Background Despite the dramatic reduction in the cost of high-density genotyping that has occurred over the last decade, it remains one of the limiting factors for obtaining the large datasets required for genomic studies of disease in the horse. In this study, we investigated the potential for low-density genotyping and subsequent imputation to address this problem. Results Using the haplotype phasing and imputation program, BEAGLE, it is possible to impute genotypes from low- to high-density (50K) in the Thoroughbred horse with reasonable to high accuracy. Analysis of the sources of variation in imputation accuracy revealed dependence both on the minor allele frequency of the single nucleotide polymorphisms (SNPs) being imputed and on the underlying linkage disequilibrium structure. Whereas equidistant spacing of the SNPs on the low-density panel worked well, optimising SNP selection to increase their minor allele frequency was advantageous, even when the panel was subsequently used in a population of different geographical origin. Replacing base pair position with linkage disequilibrium map distance reduced the variation in imputation accuracy across SNPs. Whereas a 1K SNP panel was generally sufficient to ensure that more than 80% of genotypes were correctly imputed, other studies suggest that a 2K to 3K panel is more efficient to minimize the subsequent loss of accuracy in genomic prediction analyses. The relationship between accuracy and genotyping costs for the different low-density panels, suggests that a 2K SNP panel would represent good value for money. Conclusions Low-density genotyping with a 2K SNP panel followed by imputation provides a compromise between cost and accuracy that could promote more widespread genotyping, and hence the use of genomic information in horses. In addition to offering a low cost alternative to high-density genotyping, imputation provides a means to combine datasets from different genotyping platforms, which is becoming necessary since researchers are starting to use the recently developed equine 70K SNP chip. However, more work is needed to evaluate the impact of between-breed differences on imputation accuracy.
Background: The activity of thiopurine methyltransferase (TPMT) is subject to genetic variation. Loss-of-function alleles are associated with various degrees of myelosuppression after treatment with thiopurine drugs, thus genotype-based dosing recommendations currently exist. The aim of this study was to evaluate the potential utility of leveraging genomic data from large biorepositories in the identification of individuals with TPMT defective alleles. Material and methods: TPMT variants were imputed using the 1000 Genomes Project reference panel in 87,979 samples from the biobank at The Children's Hospital of Philadelphia. Population ancestry was determined by principal component analysis using HapMap3 samples as reference. Frequencies of the TPMT imputed alleles, genotypes and the associated phenotype were determined across the different populations. A sample of 630 subjects with genotype data from Sanger sequencing (N = 59) and direct genotyping (N = 583) (12 samples overlapping in the two groups) was used to check the concordance between the imputed and observed genotypes, as well as the sensitivity, specificity and positive and negative predictive values of the imputation. Results: Two SNPs (rs1800460 and rs1142345) that represent three TPMT alleles (*3A, *3B, and *3C) were imputed with adequate quality. Frequency for the associated enzyme activity varied across populations and 89.36–94.58% were predicted to have normal TPMT activity, 5.3–10.31% intermediate and 0.12–0.34% poor activities. Overall, 98.88% of individuals (623/630) were correctly imputed into carrying no risk alleles (553/553), heterozygous (45/46) and homozygous (25/31). Sensitivity, specificity and predictive values of imputation were over 90% in all cases except for the sensitivity of imputing homozygous subjects that was 80.64%. Conclusion: Imputation of TPMT alleles from existing genomic data can be used as a first step in the screening of individuals at risk of developing serious adverse events secondary to thiopurine drugs.
Almoguera, Berta; Vazquez, Lyam; Connolly, John J.; Bradfield, Jonathan; Sleiman, Patrick; Keating, Brendan; Hakonarson, Hakon
The analysis of less common variants in genome-wide association studies promises to elucidate complex trait genetics but is hampered by low power to reliably detect association. We show that addition of population-specific exome sequence data to global reference data allows more accurate imputation, particularly of less common SNPs (minor allele frequency 1–10%) in two very different European populations. The imputation improvement corresponds to an increase in effective sample size of 28–38%, for SNPs with a minor allele frequency in the range 1–3%.
Joshi, Peter K.; Prendergast, James; Fraser, Ross M.; Huffman, Jennifer E.; Vitart, Veronique; Hayward, Caroline; McQuillan, Ruth; Glodzik, Dominik; Polasek, Ozren; Hastie, Nicholas D.; Rudan, Igor; Campbell, Harry; Wright, Alan F.; Haley, Chris S.
The analysis of less common variants in genome-wide association studies promises to elucidate complex trait genetics but is hampered by low power to reliably detect association. We show that addition of population-specific exome sequence data to global reference data allows more accurate imputation, particularly of less common SNPs (minor allele frequency 1-10%) in two very different European populations. The imputation improvement corresponds to an increase in effective sample size of 28-38%, for SNPs with a minor allele frequency in the range 1-3%. PMID:23874685
Joshi, Peter K; Prendergast, James; Fraser, Ross M; Huffman, Jennifer E; Vitart, Veronique; Hayward, Caroline; McQuillan, Ruth; Glodzik, Dominik; Polašek, Ozren; Hastie, Nicholas D; Rudan, Igor; Campbell, Harry; Wright, Alan F; Haley, Chris S; Wilson, James F; Navarro, Pau
The method of multiple imputation (MI) is used increasingly for analyzing datasets with missing observations. Two sets of tasks are required in order to implement the method: (a) generating multiple complete datasets in which missing values have been imputed by simulating from an appropriate probability distribution and (b) analyzing the multiple imputed datasets and combining complete data inferences from them
John B. Carlin; Ning Li; Philip Greenwood; Carolyn Coffey
\\u000a The problem of missing data is often addressed with imputation. Traditional single imputation methods, such as the ratio imputation,\\u000a multiple regression imputation, nearest neighbor imputation, respondent mean imputation or hot deck imputation, have been\\u000a widely used to compensate for non-response. Nonparametric regression methods have been recently applied to the estimation\\u000a of the regression function in a wide range of settings
Ismael Sánchez-Borrego; M ar ´ ia del Mar Rueda; Juan Muñoz
The objective of this study was to evaluate, using three different genotype density panels, the accuracy of imputation from lower- to higher-density genotypes in dairy and beef cattle. High-density genotypes consisting of 777,962 single-nucleotide polymorphisms (SNP) were available on 3122 animals comprised of 269, 196, 710, 234, 719, 730 and 264 Angus, Belgian Blue, Charolais, Hereford, Holstein-Friesian, Limousin and Simmental bulls, respectively. Three different genotype densities were generated: low density (LD; 6501 autosomal SNPs), medium density (50K; 47,770 autosomal SNPs) and high density (HD; 735,151 autosomal SNPs). Imputation from lower- to higher-density genotype platforms was undertaken within and across breeds exploiting population-wide linkage disequilibrium. The mean allele concordance rate per breed from LD to HD when undertaken using a single breed or multiple breed reference population varied from 0.956 to 0.974 and from 0.947 to 0.967, respectively. The mean allele concordance rate per breed from 50K to HD when undertaken using a single breed or multiple breed reference population varied from 0.987 to 0.994 and from 0.987 to 0.993, respectively. The accuracy of imputation was generally greater when the reference population was solely comprised of the breed to be imputed compared to when the reference population comprised of multiple breeds, although the impact was less when imputing from 50K to HD compared to imputing from LD. PMID:24906026
Berry, D P; McClure, M C; Mullen, M P
We describe composite likelihood-based analysis of a genome-wide breast cancer case–control sample from the Cancer Genetic Markers of Susceptibility project. We determine 14?380 genome regions of fixed size on a linkage disequilibrium (LD) map, which delimit comparable levels of LD. Although the numbers of single-nucleotide polymorphisms (SNPs) are highly variable, each region contains an average of ?35 SNPs and an average of ?69 after imputation of missing genotypes. Composite likelihood association mapping yields a single P-value for each region, established by a permutation test, along with a maximum likelihood disease location, SE and information weight. For single SNP analysis, the nominal P-value for the most significant SNP (msSNP) requires substantial correction given the number of SNPs in the region. Therefore, imputing genotypes may not always be advantageous for the msSNP test, in contrast to composite likelihood. For the region containing FGFR2 (a known breast cancer gene) the largest ?2 is obtained under composite likelihood with imputed genotypes (?22 increases from 20.6 to 22.7), and compares with a single SNP-based ?22 of 19.9 after correction. Imputation of additional genotypes in this region reduces the size of the 95% confidence interval for location of the disease gene by ?40%. Among the highest ranked regions, SNPs in the NTSR1 gene would be worthy of examination in additional samples. Meta-analysis, which combines weighted evidence from composite likelihood in different samples, and refines putative disease locations, is facilitated through defining fixed regions on an underlying LD map.
Politopoulos, Ioannis; Gibson, Jane; Tapper, William; Ennis, Sarah; Eccles, Diana; Collins, Andrew
We describe composite likelihood-based analysis of a genome-wide breast cancer case-control sample from the Cancer Genetic Markers of Susceptibility project. We determine 14?380 genome regions of fixed size on a linkage disequilibrium (LD) map, which delimit comparable levels of LD. Although the numbers of single-nucleotide polymorphisms (SNPs) are highly variable, each region contains an average of ?35 SNPs and an average of ?69 after imputation of missing genotypes. Composite likelihood association mapping yields a single P-value for each region, established by a permutation test, along with a maximum likelihood disease location, SE and information weight. For single SNP analysis, the nominal P-value for the most significant SNP (msSNP) requires substantial correction given the number of SNPs in the region. Therefore, imputing genotypes may not always be advantageous for the msSNP test, in contrast to composite likelihood. For the region containing FGFR2 (a known breast cancer gene) the largest ?(2) is obtained under composite likelihood with imputed genotypes (?(2)(2) increases from 20.6 to 22.7), and compares with a single SNP-based ?(2)(2) of 19.9 after correction. Imputation of additional genotypes in this region reduces the size of the 95% confidence interval for location of the disease gene by ?40%. Among the highest ranked regions, SNPs in the NTSR1 gene would be worthy of examination in additional samples. Meta-analysis, which combines weighted evidence from composite likelihood in different samples, and refines putative disease locations, is facilitated through defining fixed regions on an underlying LD map. PMID:20959865
Politopoulos, Ioannis; Gibson, Jane; Tapper, William; Ennis, Sarah; Eccles, Diana; Collins, Andrew
We present full-genome genotype imputations for 100 classical laboratory mouse strains, using a novel method. Using genotypes at 549,683 SNP loci obtained with the Mouse Diversity Array, we partitioned the genome of 100 mouse strains into 40,647 intervals that exhibit no evidence of historical recombination. For each of these intervals we inferred a local phylogenetic tree. We combined these data with 12 million loci with sequence variations recently discovered by whole-genome sequencing in a common subset of 12 classical laboratory strains. For each phylogenetic tree we identified strains sharing a leaf node with one or more of the sequenced strains. We then imputed high- and medium-confidence genotypes for each of 88 nonsequenced genomes. Among inbred strains, we imputed 92% of SNPs genome-wide, with 71% in high-confidence regions. Our method produced 977 million new genotypes with an estimated per-SNP error rate of 0.083% in high-confidence regions and 0.37% genome-wide. Our analysis identified which of the 88 nonsequenced strains would be the most informative for improving full-genome imputation, as well as which additional strain sequences will reveal more new genetic variants. Imputed sequences and quality scores can be downloaded and visualized online.
Wang, Jeremy R.; de Villena, Fernando Pardo-Manuel; Lawson, Heather A.; Cheverud, James M.; Churchill, Gary A.; McMillan, Leonard
Multiple imputation was designed to handle the problem of missing data in public-use data bases where the data-base constructor and the ultimate user are distinct entities. The objective is valid frequency inference for ultimate users who in general have access only to complete-data software and possess limited knowledge of specific reasons and models for nonresponse. For this situation and objective,
Donald B. Rubin
Missing values, common in epidemiologic studies, are a major issue in obtaining valid estimates. Simulation studies have suggested that multiple imputation is an attractive method for imputing missing values, but it is relatively complex and requires specialized software. For each of 28 studies in the Asia Pacific Cohort Studies Collaboration, a comparison of eight imputation procedures (unconditional and conditional mean,
Federica Barzi; Mark Woodward
Summary We propose a multiple imputation estimator for parameter estimation in a quantile regression model when some covariates are missing at random. The estimation procedure fully utilizes the entire dataset to achieve increased efficiency, and the resulting coefficient estimators are root-n consistent and asymptotically normal. To protect against possible model misspecification, we further propose a shrinkage estimator, which automatically adjusts for possible bias. The finite sample performance of our estimator is investigated in a simulation study. Finally, we apply our methodology to part of the Eating at American’s Table Study data, investigating the association between two measures of dietary intake.
Wei, Ying; Ma, Yanyuan; Carroll, Raymond J.
When using multiple imputation in the analysis of incomplete data, a prominent guideline suggests that more than 10 imputed data values are seldom needed. This article calls into question the optimism of this guideline and illustrates that important quantities (e.g., p values, confidence interval half-widths, and estimated fractions of missing…
Bodner, Todd E.
Multiple imputation is a technique for handling data sets with missing values. The method fills in the missing values several times, creating several completed data sets for analysis. Each data set is analyzed separately using techniques designed for complete data, and the results are then combined in such a way that the variability due to imputation may be incorporated. Methods
Nathaniel Schenker; Jeremy M. G. Taylor
Multiple imputation provides a useful strategy for dealing with data sets with missing values. Instead of filling in a single value for each missing value, Rubin's (1987) multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. These multiply imputed data sets are then analyzed by using
Yang C. Yuan
In most situations, simple techniques for handling missing data (such as complete case analysis, overall mean imputation, and the missing-indicator method) produce biased results, whereas imputation techniques yield valid results without complicating the analysis once the imputations are carried out. Imputation techniques are based on the idea that any subject in a study sample can be replaced by a new
A. Rogier T. Donders; Geert J. M. G. van der Heijden; Theo Stijnen; Karel G. M. Moons
Many imputation techniques and imputation software packages have been developed over the years to deal with missing data. Different methods may work well under different circumstances, and it is advisable to conduct a sensitivity analysis when choosing an imputation method for a particular survey. This study reviewed about 30 imputation methods…
Hu, Ming-xiu; Salvucci, Sameena
Abstract In most situations, simple techniques for handling missing data (such as complete case analysis, overall mean imputation, and the miss- ing-indicator method) produce biased results, whereas imputation techniques yield valid results without complicating the analysis once the imputations are carried out. Imputation techniques are based on the idea that any subject in a study sample can be replaced by
A. Rogier T. Donders; Theo Stijnen; Karel G. M. Moons
Rubin has offered multiple imputation as a general approach to inference from survey data sets with missing values filled in through imputation. In many situations the multiple imputation variance estimator is consistent. In turn, this observation has lent support to a number of complex applications. In fact, however, the multiple imputation variance estimator is inconsistent under some simple conditions. This
Robert E. Fay
In this paper we propose a latent class based multiple imputation approach for analyzing missing categorical covariate data in a highly stratified data model. In this approach, we impute the missing data assuming a latent class imputation model and we use likelihood methods to analyze the imputed data. Via extensive simulations, we study its statistical properties and make comparisons with
Mulugeta Gebregziabher; Stacia M. DeSantis
Multiple imputation (MI) and full information maximum likelihood (FIML) are the two most common approaches to missing data\\u000a analysis. In theory, MI and FIML are equivalent when identical models are tested using the same variables, and when m, the number of imputations performed with MI, approaches infinity. However, it is important to know how many imputations\\u000a are necessary before MI
John W. Graham; Allison E. Olchowski; Tamika D. Gilreath
This paper discusses a real-time imputation method for sparse floating car data (FCD.) Floating cars are effective way to collect traffic information; however, because of the limitation of the number of floating cars, there is a large amount of missing data with FCD. In an effort to address this problem, we previously proposed a new imputation method based on feature space projection. The method consists of three major processes: (i) determination of a feature space from past FCD history; (ii) feature space projection of current FCD; and (iii) estimation of missing data performed by inverse projection from the feature space. Since estimation is achieved on each feature space axis that represents the spatial correlated component of FCD, it performs an accurate imputation and enlarges information coverage area. However, correlation difference among multiple road-links sometimes causes a trade-off problem between the accuracy and the coverage. Therefore, we developed an additional function in order to filter the road-links that have low correlation with the others. The function uses spectral factorization as filtering index, which is suitable to evaluate the correlation on the multidimensional feature space. Combination use of the imputation method and the filtering function decreases maximum estimation error-rate from 0.39 to 0.24, keeping 60% coverage area against sparse FCD of 15% observations.
Kumagai, Masatoshi; Hiruta, Tomoaki; Fushiki, Takumi; Yokota, Takayoshi
In genome-wide association studies (GWAS), it is a common practice to impute the genotypes of untyped single nucleotide polymorphism (SNP) by exploiting the linkage disequilibrium structure among SNPs. The use of imputed genotypes improves genome coverage and makes it possible to perform meta-analysis combining results from studies genotyped on different platforms. A popular way of using imputed data is the "expectation-substitution" method, which treats the imputed dosage as if it were the true genotype. In current practice, the estimates given by the expectation-substitution method are usually combined using inverse variance weighting (IVM) scheme in meta-analysis. However, the IVM is not optimal as the estimates given by the expectation-substitution method are generally biased. The optimal weight is, in fact, proportional to the inverse variance and the expected value of the effect size estimates. We show both theoretically and numerically that the bias of the estimates is very small under practical conditions of low effect sizes in GWAS. This finding validates the use of the expectation-substitution method, and shows the inverse variance is a good approximation of the optimal weight. Through simulation, we compared the power of the IVM method with several methods including the optimal weight, the regular z-score meta-analysis and a recently proposed "imputation aware" meta-analysis method (Zaitlen and Eskin  Genet Epidemiol 34:537-542). Our results show that the performance of the inverse variance weight is always indistinguishable from the optimal weight and similar to or better than the other two methods. PMID:21769935
Jiao, Shuo; Hsu, Li; Hutter, Carolyn M; Peters, Ulrike
Objectives Missing laboratory data is a common issue, but the optimal method of imputation of missing values has not been determined. The aims of our study were to compare the accuracy of four imputation methods for missing completely at random laboratory data and to compare the effect of the imputed values on the accuracy of two clinical predictive models. Design Retrospective cohort analysis of two large data sets. Setting A tertiary level care institution in Ann Arbor, Michigan. Participants The Cirrhosis cohort had 446 patients and the Inflammatory Bowel Disease cohort had 395 patients. Methods Non-missing laboratory data were randomly removed with varying frequencies from two large data sets, and we then compared the ability of four methods—missForest, mean imputation, nearest neighbour imputation and multivariate imputation by chained equations (MICE)—to impute the simulated missing data. We characterised the accuracy of the imputation and the effect of the imputation on predictive ability in two large data sets. Results MissForest had the least imputation error for both continuous and categorical variables at each frequency of missingness, and it had the smallest prediction difference when models used imputed laboratory values. In both data sets, MICE had the second least imputation error and prediction difference, followed by the nearest neighbour and mean imputation. Conclusions MissForest is a highly accurate method of imputation for missing laboratory data and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in two clinical predicative models.
Waljee, Akbar K; Mukherjee, Ashin; Singal, Amit G; Zhang, Yiwei; Warren, Jeffrey; Balis, Ulysses; Marrero, Jorge; Zhu, Ji; Higgins, Peter DR
Background/Aims An individual’s genotypes at a group of Single Nucleotide Polymorphisms (SNPs) can be used to predict that individual’s ethnicity, or ancestry. In medical studies, knowledge of a subject’s ancestry can minimize possible confounding, and in forensic applications, such knowledge can help direct investigations. Our goal is to select a small subset of SNPs, from the millions already identified in the human genome, that can predict ancestry with a minimal error rate. Methods The general form for this variable selection procedure is to estimate the expected error rates for sets of SNPs using a training dataset and consider those sets with the lowest error rates given their size. The quality of the estimate for the error rate determines the quality of the resulting SNPs. As the apparent error rate performs poorly when either the number of SNPs or the number of populations is large, we propose a new estimate, the Improved Bayesian Estimate. Conclusions We demonstrate that selection procedures based on this estimate produce small sets of SNPs that can accurately predict ancestry. We also provide a list of the 100 optimal SNPs for identifying ancestry. R functions are available at http://bioinformatics.med.yale.edu/group/josh/index.html.
Sampson, Joshua; Kidd, Kenneth K; Kidd, Judith R; Zhao, Hongyu
The role of rare variants has become a focus in the search for association with complex traits. Imputation is a powerful and cost-efficient tool to access variants that have not been directly typed, but there are several challenges when imputing rare variants, most notably reference panel selection. Extensions to rare variant association tests to incorporate genotype uncertainty from imputation are discussed, as well as the use of imputed low frequency and rare variants in the study of population isolates.
Asimit, Jennifer L; Zeggini, Eleftheria
Abstract Multiple imputation provides a useful strategy for dealing with data sets with missing,values. Instead of filling in a single value for each missing value, Rubin’s (1987) multiple imputation procedure,replaces each,missing value with a set of plausible values that represent the uncertainty about the right value to impute. These multiply imputed data sets are then analyzed,by using standard procedures,for com-
Current inverse optimization-based treatment planning for radiotherapy requires a set of complex DVH objectives to be simultaneously minimized. This process, known as multi-objective optimization, is challenging due to non-convexity in individual objectives and insufficient knowledge in the tradeoffs among the objective set. As such, clinical practice involves numerous iterations of human intervention that is costly and often inconsistent. In this work, we propose to address treatment planning with convex imputing, a new-data mining technique that explores the existence of a latent convex objective whose optimizer reflects the DVH and dose-shaping properties of previously optimized cases. Using ten clinical prostate cases as the basis for comparison, we imputed a simple least-squares problem from the optimized solutions of the prostate cases, and show that the imputed plans are more consistent than their clinical counterparts in achieving planning goals.
Sayre, G. A.; Ruan, D.
Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure…
Lee, Taehun; Cai, Li
Although structural equation modeling software packages use maximum likelihood estimation by default, there are situations where one might prefer to use multiple imputation to handle missing data rather than maximum likelihood estimation (e.g., when incorporating auxiliary variables). The selection of variables is one of the nuances associated with implementing multiple imputation, because the imputer must take special care to preserve
Craig K. Enders; Amanda C. Gottschall
To assist cattle producers transition from microsatellite (MS) to single nucleotide polymorphism (SNP) genotyping for parental verification we previously devised an effective and inexpensive method to impute MS alleles from SNP haplotypes. While the reported method was verified with only a limited data set (N = 479) from Brown Swiss, Guernsey, Holstein, and Jersey cattle, some of the MS-SNP haplotype associations were concordant across these phylogenetically diverse breeds. This implied that some haplotypes predate modern breed formation and remain in strong linkage disequilibrium. To expand the utility of MS allele imputation across breeds, MS and SNP data from more than 8000 animals representing 39 breeds (Bos taurus and B. indicus) were used to predict 9410 SNP haplotypes, incorporating an average of 73 SNPs per haplotype, for which alleles from 12 MS markers could be accurately be imputed. Approximately 25% of the MS-SNP haplotypes were present in multiple breeds (N = 2 to 36 breeds). These shared haplotypes allowed for MS imputation in breeds that were not represented in the reference population with only a small increase in Mendelian inheritance inconsistancies. Our reported reference haplotypes can be used for any cattle breed and the reported methods can be applied to any species to aid the transition from MS to SNP genetic markers. While ~91% of the animals with imputed alleles for 12 MS markers had ?1 Mendelian inheritance conflicts with their parents' reported MS genotypes, this figure was 96% for our reference animals, indicating potential errors in the reported MS genotypes. The workflow we suggest autocorrects for genotyping errors and rare haplotypes, by MS genotyping animals whose imputed MS alleles fail parentage verification, and then incorporating those animals into the reference dataset.
McClure, Matthew C.; Sonstegard, Tad S.; Wiggans, George R.; Van Eenennaam, Alison L.; Weber, Kristina L.; Penedo, Cecilia T.; Berry, Donagh P.; Flynn, John; Garcia, Jose F.; Carmo, Adriana S.; Regitano, Luciana C. A.; Albuquerque, Milla; Silva, Marcos V. G. B.; Machado, Marco A.; Coffey, Mike; Moore, Kirsty; Boscher, Marie-Yvonne; Genestout, Lucie; Mazza, Raffaele; Taylor, Jeremy F.; Schnabel, Robert D.; Simpson, Barry; Marques, Elisa; McEwan, John C.; Cromie, Andrew; Coutinho, Luiz L.; Kuehn, Larry A.; Keele, John W.; Piper, Emily K.; Cook, Jim; Williams, Robert; Van Tassell, Curtis P.
All About Circuits is a website that Ã¢ÂÂprovides a series of online textbooks covering electricity and electronics.Ã¢ÂÂ Written by Tony R. Kuphaldt, the textbooks available here are wonderful resources for students, teachers, and anyone who is interested in learning more about electronics. This specific section, Filters, is the eighth chapter in Volume II Ã¢ÂÂAlternating Current (AC). A few of the topics covered in this chapter include: Low-pass filters, High-pass filters, Band-pass filters, Band-stop filters, and Resonant filters. Diagrams and detailed descriptions of concepts are included throughout the chapter to provide users with a comprehensive lesson. Visitors to the site are also encouraged to discuss concepts and topics using the All About Circuits discussion forums (registration with the site is required to post materials).
Kuphaldt, Tony R.
Multiple imputation by chained equations is a flexible and practical approach to handling missing data. We describe the principles of the method and show how to impute categorical and quantitative variables, including skewed variables. We give guidance on how to specify the imputation model and how many imputations are needed. We describe the practical analysis of multiply imputed data, including model building and model checking. We stress the limitations of the method and discuss the possible pitfalls. We illustrate the ideas using a data set in mental health, giving Stata code fragments. PMID:21225900
White, Ian R; Royston, Patrick; Wood, Angela M
Different data imputation techniques that are useful for equipercentile equating are discussed, and empirical data are used to evaluate the accuracy of these techniques as compared with chained equipercentile equating. The kernel estimator, the EM algorithm, the EB model, and the iterative moment estimator are considered. (SLD)
Liou, Michelle; Cheng, Philip E.
Background We explored the imputation performance of the program IMPUTE in an admixed sample from Mexico City. The following issues were evaluated: (a) the impact of different reference panels (HapMap vs. 1000 Genomes) on imputation; (b) potential differences in imputation performance between single-step vs. two-step (phasing and imputation) approaches; (c) the effect of different INFO score thresholds on imputation performance and (d) imputation performance in common vs. rare markers. Methods The sample from Mexico City comprised 1,310 individuals genotyped with the Affymetrix 5.0 array. We randomly masked 5% of the markers directly genotyped on chromosome 12 (n?=?1,046) and compared the imputed genotypes with the microarray genotype calls. Imputation was carried out with the program IMPUTE. The concordance rates between the imputed and observed genotypes were used as a measure of imputation accuracy and the proportion of non-missing genotypes as a measure of imputation efficacy. Results The single-step imputation approach produced slightly higher concordance rates than the two-step strategy (99.1% vs. 98.4% when using the HapMap phase II combined panel), but at the expense of a lower proportion of non-missing genotypes (85.5% vs. 90.1%). The 1,000 Genomes reference sample produced similar concordance rates to the HapMap phase II panel (98.4% for both datasets, using the two-step strategy). However, the 1000 Genomes reference sample increased substantially the proportion of non-missing genotypes (94.7% vs. 90.1%). Rare variants (<1%) had lower imputation accuracy and efficacy than common markers. Conclusions The program IMPUTE had an excellent imputation performance for common alleles in an admixed sample from Mexico City, which has primarily Native American (62%) and European (33%) contributions. Genotype concordances were higher than 98.4% using all the imputation strategies, in spite of the fact that no Native American samples are present in the HapMap and 1000 Genomes reference panels. The best balance of imputation accuracy and efficiency was obtained with the 1,000 Genomes panel. Rare variants were not captured effectively by any of the available panels, emphasizing the need to be cautious in the interpretation of association results for imputed rare variants.
We have used linkage disequilibrium (LD) to identify single nucleotide polymorphisms (SNPs) on the Illumina Equine SNP50 BeadChip, which may be incorrectly positioned on the genome map. A total of 1201 Thoroughbred horses were genotyped using the Illumina Equine SNP50 BeadChip. LD was evaluated in a pairwise fashion between all autosomal SNPs, both within and across chromosomes. Filters were then applied to the data, firstly to identify SNPs that may have been mapped to the wrong chromosome and secondly to identify SNPs that may have been incorrectly positioned within chromosomes. We identified a single SNP on ECA28, which showed low LD with neighbouring SNPs but considerable LD with a group of SNPs on ECA10. Furthermore, a cluster of SNPs on ECA5 showed unusually low LD with surrounding SNPs. A total of 39 SNPs met the criteria for unusual within-chromosome LD. The results of this study indicate that some SNPs may be misplaced. This finding is significant, as misplaced SNPs may lead to difficulties in the application of genomic methods, such as homozygosity mapping, for which SNP order is important. PMID:22486508
Corbin, L J; Blott, S C; Swinburne, J E; Vaudin, M; Bishop, S C; Woolliams, J A
In the U.S. Census of Manufactures, the Census Bureau imputes missing values using a combination of mean imputation, ratio imputation, and conditional mean imputation. It is well-known that imputations based on these methods can result in underestimation ...
A. Petrin, J. P. Reiter, T. K. White
In the design of common-item equating, two groups of examinees are administered separate test forms, and each test form contains a common subset of items. We consider test equating under this situation as an incomplete data problem—that is, examinees have observed scores on one test form and missing scores on the other. Through the use of statistical data-imputation techniques, the
Michelle Liou; Philip E. Cheng
In medical research, it is common to have doubly censored survival data: origin time and event time are both subject to censoring. In this paper, we review simple and probability-based methods that are used to impute interval censored origin time and compare the performance of these methods through extensive simulations in the one-sample problem, two-sample problem and Cox regression model problem. The use of a bootstrap procedure for inference is demonstrated.
Zhang, Wei; Zhang, Ying; Chaloner, Kathryn; Stapleton, Jack T.
In medical research, it is common to have doubly censored survival data: origin time and event time are both subject to censoring. In this paper, we review simple and probability-based methods that are used to impute interval censored origin time and compare the performance of these methods through extensive simulations in the one-sample problem, two-sample problem and Cox regression model problem. The use of a bootstrap procedure for inference is demonstrated. PMID:21304834
Zhang, Wei; Zhang, Ying; Chaloner, Kathryn; Stapleton, Jack T
Recent technological breakthroughs have enabled high-throughput quantitative measurements of hundreds of thousands of genetic interactions among hundreds of genes in Saccharomyces cerevisiae. However, these assays often fail to measure the genetic interactions among up to 40% of the studied gene pairs. Here we present a novel method, which combines genetic interaction data together with diverse genomic data, to quantitatively impute these missing interactions. We also present data on almost 190,000 novel interactions.
This article describes a substantial update to mvis, which brings it more closely in line with the feature set of S. van Buuren and C. G. M. Oudshoorn’s implementation of the MICE system in R and S-PLUS (for details, see http:\\/\\/www.multiple-imputation.com). To make a clear distinction from mvis,the principal program of the new Stata release is called ice. I will
...2013-10-01 false Determining imputed cost of money. 1830.7002-4 Section 1830.7002-4...1830.7002-4 Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction,...
Many studies in human genetics compare informativeness of single-nucleotide polymorphisms (SNPs) and microsatellites (single sequence repeats; SSR) in genome scans, but it is difficult to transfer the results directly to livestock because of different population structures. The aim of this study was to determine the number of SNPs needed to obtain the same differentiation power as with a given standard set of microsatellites. Eight chicken breeds were genotyped for 29 SSRs and 9216 SNPs. After filtering, only 2931 SNPs remained. The differentiation power was evaluated using two methods: partitioning of the Euclidean distance matrix based on a principal component analysis (PCA) and a Bayesian model-based clustering approach. Generally, with PCA-based partitioning, 70 SNPs provide a comparable resolution to 29 SSRs. In model-based clustering, the similarity coefficient showed significantly higher values between repeated runs for SNPs compared to SSRs. For the membership coefficients, reflecting the proportion to which a fraction segment of the genome belongs to the ith cluster, the highest values were obtained for 29 SSRs and 100 SNPs respectively. With a low number of loci (29 SSRs or ?100 SNPs), neither marker types could detect the admixture in the Gödöllö Nhx population. Using more than 250 SNPs allowed a more detailed insight into the genetic architecture. Thus, the admixed population could be detected. It is concluded that breed differentiation studies will substantially gain power even with moderate numbers of SNPs. PMID:22497629
Gärke, C; Ytournel, F; Bed'hom, B; Gut, I; Lathrop, M; Weigend, S; Simianer, H
Genotype imputation is an important tool in human genetics studies, which uses reference sets with known genotypes and prior knowledge on linkage disequilibrium and recombination rates to infer un-typed alleles for human genetic variations at a low cost. The reference sets used by current imputation approaches are based on HapMap data, and/or based on recently available next-generation sequencing (NGS) data such as data generated by the 1000 Genomes Project. However, with different coverage and call rates for different NGS data sets, how to integrate NGS data sets of different accuracy as well as previously available reference data as references in imputation is not an easy task and has not been systematically investigated. In this study, we performed a comprehensive assessment of three strategies on using NGS data and previously available reference data in genotype imputation for both simulated data and empirical data, in order to obtain guidelines for optimal reference set construction. Briefly, we considered three strategies: strategy 1 uses one NGS data as a reference; strategy 2 imputes samples by using multiple individual data sets of different accuracy as independent references and then combines the imputed samples with samples based on the high accuracy reference selected when overlapping occurs; and strategy 3 combines multiple available data sets as a single reference after imputing each other. We used three software (MACH, IMPUTE2 and BEAGLE) for assessing the performances of these three strategies. Our results show that strategy 2 and strategy 3 have higher imputation accuracy than strategy 1. Particularly, strategy 2 is the best strategy across all the conditions that we have investigated, producing the best accuracy of imputation for rare variant. Our study is helpful in guiding application of imputation methods in next generation association analyses.
Chen, Jun; Zhang, Ji-Gang; Li, Jian; Pei, Yu-Fang; Deng, Hong-Wen
... Under Certain Conditions Smoking and Susceptibility to Lung Cancer Smoking and Susceptibility to Lung Cancer - SNPs in Carcinogen-Making Proteins Smoking and Susceptibility to Lung Cancer - SNPs in Detoxifying ...
Imputation methods are popular for the handling of missing data in psychology. The methods generally consist of predicting missing data based on observed data, yielding a complete data set that is amiable to standard statistical analyses. In the context of Bayesian factor analysis, this article compares imputation under an unrestricted…
Merkle, Edgar C.
This paper investigates the development of shareholder clienteles in response to the introduction of the Dividend Imputation (Integrated Tax System) into the Australian capital market. It is found that companies paying franked dividends have significantly increased dividend payments relative to companies paying little or no imputation tax credit. It is also shown that the use of dividend reinvestment plans has
David E Bellamy
Multiple imputations for latent variables are constructed so that analyses treating them as true variables have the correct expectations for population characteristics. Analyzing multiple imputations in accordance with their construction yields correct estimates of population characteristics, whereas analyzing them as multiple indicators generally…
Mislevy, Robert J.
Smoking is a leading global cause of disease and mortality1. We performed a genomewide meta-analytic association study of smoking-related behavioral traits in a total sample of 41,150 individuals drawn from 20 disease, population, and control cohorts. Our analysis confirmed an effect on smoking quantity (SQ) at a locus on 15q25 (P=9.45e-19) that includes three genes encoding neuronal nicotinic acetylcholine receptor subunits (CHRNA5, CHRNA3, CHRNB4). We used data from the 1000 Genomes project to investigate the region using imputation, which allowed analysis of virtually all common variants in the region and offered a five-fold increase in coverage over the HapMap. This increased the spectrum of potentially causal single nucleotide polymorphisms (SNPs), which included a novel SNP that showed the highest significance, rs55853698, located within the promoter region of CHRNA5. Conditional analysis also identified a secondary locus (rs6495308) in CHRNA3.
Liu, Jason Z.; Tozzi, Federica; Waterworth, Dawn M.; Pillai, Sreekumar G.; Muglia, Pierandrea; Middleton, Lefkos; Berrettini, Wade; Knouff, Christopher W.; Yuan, Xin; Waeber, Gerard; Vollenweider, Peter; Preisig, Martin; Wareham, Nicholas J; Zhao, Jing Hua; Loos, Ruth J.F.; Barroso, Ines; Khaw, Kay-Tee; Grundy, Scott; Barter, Philip; Mahley, Robert; Kesaniemi, Antero; McPherson, Ruth; Vincent, John B.; Strauss, John; Kennedy, James L.; Farmer, Anne; McGuffin, Peter; Day, Richard; Matthews, Keith; Bakke, Per; Gulsvik, Amund; Lucae, Susanne; Ising, Marcus; Brueckl, Tanja; Horstmann, Sonja; Wichmann, H.-Erich; Rawal, Rajesh; Dahmen, Norbert; Lamina, Claudia; Polasek, Ozren; Zgaga, Lina; Huffman, Jennifer; Campbell, Susan; Kooner, Jaspal; Chambers, John C; Burnett, Mary Susan; Devaney, Joseph M.; Pichard, Augusto D.; Kent, Kenneth M.; Satler, Lowell; Lindsay, Joseph M.; Waksman, Ron; Epstein, Stephen; Wilson, James F.; Wild, Sarah H.; Campbell, Harry; Vitart, Veronique; Reilly, Muredach P.; Li, Mingyao; Qu, Liming; Wilensky, Robert; Matthai, William; Hakonarson, Hakon H.; Rader, Daniel J.; Franke, Andre; Wittig, Michael; Schafer, Arne; Uda, Manuela; Terracciano, Antonio; Xiao, Xiangjun; Busonero, Fabio; Scheet, Paul; Schlessinger, David; St Clair, David; Rujescu, Dan; Abecasis, Goncalo R.; Grabe, Hans Jorgen; Teumer, Alexander; Volzke, Henry; Petersmann, Astrid; John, Ulrich; Rudan, Igor; Hayward, Caroline; Wright, Alan F.; Kolcic, Ivana; Wright, Benjamin J; Thompson, John R; Balmforth, Anthony J.; Hall, Alistair S.; Samani, Nilesh J.; Anderson, Carl A.; Ahmad, Tariq; Mathew, Christopher G.; Parkes, Miles; Satsangi, Jack; Caulfield, Mark; Munroe, Patricia B.; Farrall, Martin; Dominiczak, Anna; Worthington, Jane; Thomson, Wendy; Eyre, Steve; Barton, Anne; Mooser, Vincent; Francks, Clyde; Marchini, Jonathan
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. PMID:24589914
Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry
Summary Multiple imputation is a practically useful approach to handling incompletely observed data in statistical analysis. Parameter estimation and inference based on imputed full data have been made easy by Rubin's rule for result combination. However, creating proper imputation that accommodates flexible models for statistical analysis in practice can be very challenging. We propose an imputation framework that uses conditional semiparametric odds ratio models to impute the missing values. The proposed imputation framework is more flexible and robust than the imputation approach based on the normal model. It is a compatible framework in comparison to the approach based on fully conditionally specified models. The proposed algorithms for multiple imputation through the Monte Carlo Markov Chain sampling approach can be straightforwardly carried out. Simulation studies demonstrate that the proposed approach performs better than existing, commonly used imputation approaches. The proposed approach is applied to imputing missing values in bone fracture data.
Chen, Hua Yun; Xie, Hui; Qian, Yi
Current genome-wide association studies (GWAS) use commercial genotyping microarrays that can assay over a million single nucleotide polymorphisms (SNPs). The number of SNPs is further boosted by advanced statistical genotype-imputation algorithms and large SNP databases for reference human populations. The testing of a huge number of SNPs needs to be taken into account in the interpretation of statistical significance in such genome-wide studies, but this is complicated by the non-independence of SNPs because of linkage disequilibrium (LD). Several previous groups have proposed the use of the effective number of independent markers (M(e)) for the adjustment of multiple testing, but current methods of calculation for M(e) are limited in accuracy or computational speed. Here, we report a more robust and fast method to calculate M(e). Applying this efficient method [implemented in a free software tool named Genetic type 1 error calculator (GEC)], we systematically examined the M(e), and the corresponding p-value thresholds required to control the genome-wide type 1 error rate at 0.05, for 13 Illumina or Affymetrix genotyping arrays, as well as for HapMap Project and 1000 Genomes Project datasets which are widely used in genotype imputation as reference panels. Our results suggested the use of a p-value threshold of ~10(-7) as the criterion for genome-wide significance for early commercial genotyping arrays, but slightly more stringent p-value thresholds ~5 × 10(-8) for current or merged commercial genotyping arrays, ~10(-8) for all common SNPs in the 1000 Genomes Project dataset and ~5 × 10(-8) for the common SNPs only within genes. PMID:22143225
Li, Miao-Xin; Yeung, Juilian M Y; Cherny, Stacey S; Sham, Pak C
Osteoarthritis (OA) risk is widely recognized to be heritable but few loci have been identified. Observational studies have identified higher systemic bone mineral density (BMD) to be associated with an increased risk of radiographic knee osteoarthritis. With this in mind, we sought to evaluate whether well-established genetic loci for variance in BMD are associated with risk for radiographic OA in the Osteoarthritis Initiative (OAI) and the Johnston County Osteoarthritis (JoCo) Project. Cases had at least one knee with definite radiographic OA, defined as the presence of definite osteophytes with or without joint space narrowing (Kellgren-Lawrence [KL] grade ? 2) and controls were absent for definite radiographic OA in both knees (KL grade ? 1 bilaterally). There were 2014 and 658 Caucasian cases, respectively, in the OAI and JoCo Studies, and 953 and 823 controls. Single nucleotide polymorphisms (SNPs) were identified for association analysis from the literature. Genotyping was carried out on Illumina 2.5M and 1M arrays in Genetic Components of Knee OA (GeCKO) and JoCo, respectively and imputation was done. Association analyses were carried out separately in each cohort with adjustments for age, body mass index (BMI), and sex, and then parameter estimates were combined across the two cohorts by meta-analysis. We identified four SNPs significantly associated with prevalent radiographic knee OA. The strongest signal (p?=?0.0009; OR?=?1.22; 95% CI, 1.08-1.37) maps to 12q3, which contains a gene coding for SP7. Additional loci map to 7p14.1 (TXNDC3), 11q13.2 (LRP5), and 11p14.1 (LIN7C). For all four loci the allele associated with higher BMD was associated with higher odds of OA. A BMD risk allele score was not significantly associated with OA risk. This meta-analysis demonstrates that several genomewide association studies (GWAS)-identified BMD SNPs are nominally associated with prevalent radiographic knee OA and further supports the hypothesis that BMD, or its determinants, may be a risk factor contributing to OA development. © 2014 American Society for Bone and Mineral Research. PMID:24339167
Yerges-Armstrong, Laura M; Yau, Michelle S; Liu, Youfang; Krishnan, Subha; Renner, Jordan B; Eaton, Charles B; Kwoh, C Kent; Nevitt, Michael C; Duggan, David J; Mitchell, Braxton D; Jordan, Joanne M; Hochberg, Marc C; Jackson, Rebecca D
Osteoarthritis (OA) risk is widely recognized to be heritable but few loci have been identified. Observational studies have identified higher systemic bone mineral density (BMD) to be associated with an increased risk of radiographic knee osteoarthritis. With this in mind, we sought to evaluate whether well-established genetic loci for variance in BMD are associated with risk for radiographic OA in the Osteoarthritis Initiative (OAI) and the Johnston County Osteoarthritis (JoCo) Project. Cases had at least one knee with definite radiographic OA defined as the presence of definite osteophytes with or without joint space narrowing (KL grade ? 2) and controls were absent for definite radiographic OA in both knees (KL grade ? 1bilaterally). There were 2014 and 658 Caucasian cases, respectively, in the OAI and JoCo Studies, and 953 and 823 controls. Single nucleotide polymorphisms (SNPs) were identified for association analysis from the literature. Genotyping was carried out on the Illumina 2.5M and 1M arrays in GeCKO and JoCo, respectively and imputation was done. Association analyses were carried out separately in each cohort with adjustments for age, BMI, and sex and then parameter estimates were combined across the two cohorts by meta-analysis. We identified 4 SNPs significantly associated with prevalent radiographic knee OA. The strongest signal (p=0.0009, OR=1.22, 95% CI[1.08–1.37]) maps to 12q3 which contains a gene coding for SP7. Additional loci map to 7p14.1 (TXNDC3), 11q13.2 (LRP5) and 11p14.1 (LIN7C). For all four loci the allele associated with higher BMD was associated with higher odds of OA. A BMD risk allele score was not significantly associated with OA risk. This meta-analysis demonstrates that several GWAS-identified BMD SNPs are nominally associated with prevalent radiographic knee OA and further supports the hypothesis that BMD, or its determinants, may be a risk factor contributing to OA development.
Yerges-Armstrong, LM; Yau, MS; Liu, Y; Krishnan, S; Renner, JB; Eaton, CB; Kwoh, CK; Nevitt, MC; Duggan, DJ; Mitchell, BD; Jordan, JM; Hochberg, MC; Jackson, RD
By applying an imputation strategy based on the 1000 Genomes project to two genome-wide association studies (GWAS), we detected a susceptibility locus for venous thrombosis on chromosome 11p11.2 that was missed by previous GWAS analyses that had been conducted on the same datasets. A comprehensive linkage disequilibrium and haplotype analysis of the whole locus where twelve SNPs exhibited association p-values lower than 2.23 10(-11) and the use of independent case-control samples demonstrated that the culprit variant was a rare variant located ~1 Mb away from the original hits, not tagged by current genome-wide genotyping arrays and even not well imputed in the original GWAS samples. This variant was in fact the rs1799963, also known as the FII G20210A prothrombin mutation. This work may be of major interest not only for its scientific impact but also for its methodological findings. PMID:22675575
Germain, Marine; Saut, Noémie; Oudot-Mellakh, Tiphaine; Letenneur, Luc; Dupuy, Anne-Marie; Bertrand, Marion; Alessi, Marie-Christine; Lambert, Jean-Charles; Zelenika, Diana; Emmerich, Joseph; Tiret, Laurence; Cambien, Francois; Lathrop, Mark; Amouyel, Philippe; Morange, Pierre-Emmanuel; Trégouët, David-Alexandre
By applying an imputation strategy based on the 1000 Genomes project to two genome-wide association studies (GWAS), we detected a susceptibility locus for venous thrombosis on chromosome 11p11.2 that was missed by previous GWAS analyses that had been conducted on the same datasets. A comprehensive linkage disequilibrium and haplotype analysis of the whole locus where twelve SNPs exhibited association p-values lower than 2.23 10?11 and the use of independent case-control samples demonstrated that the culprit variant was a rare variant located ?1 Mb away from the original hits, not tagged by current genome-wide genotyping arrays and even not well imputed in the original GWAS samples. This variant was in fact the rs1799963, also known as the FII G20210A prothrombin mutation. This work may be of major interest not only for its scientific impact but also for its methodological findings.
Germain, Marine; Saut, Noemie; Oudot-Mellakh, Tiphaine; Letenneur, Luc; Dupuy, Anne-Marie; Bertrand, Marion; Alessi, Marie-Christine; Lambert, Jean-Charles; Zelenika, Diana; Emmerich, Joseph; Tiret, Laurence; Cambien, Francois; Lathrop, Mark; Amouyel, Philippe; Morange, Pierre-Emmanuel; Tregouet, David-Alexandre
Genome-wide association studies (GWASs) have identified thousands of single nucleotide polymorphisms (SNPs) associated with human traits and diseases. But because the vast majority of these SNPs are located in the noncoding regions of the genome their risk promoting mechanisms are elusive. Employing a new methodology combining cistromics, epigenomics and genotype imputation we annotate the noncoding regions of the genome in breast cancer cells and systematically identify the functional nature of SNPs associated with breast cancer risk. Our results demonstrate that breast cancer risk-associated SNPs are enriched in the cistromes of FOXA1 and ESR1 and the epigenome of H3K4me1 in a cancer and cell-type-specific manner. Furthermore, the majority of these risk-associated SNPs modulate the affinity of chromatin for FOXA1 at distal regulatory elements, which results in allele-specific gene expression, exemplified by the effect of the rs4784227 SNP on the TOX3 gene found within the 16q12.1 risk locus.
Cowper-Sal?lari, Richard; Zhang, Xiaoyang; Wright, Jason B.; Bailey, Swneke D.; Cole, Michael D.; Eeckhoute, Jerome; Moore, Jason H.; Lupien, Mathieu
The imputation of unknown or missing data is a crucial task on the analysis of biomedical datasets. There are several situations where it is necessary to classify or identify instances given incomplete vectors, and the existence of missing values can much degrade the performance of the algorithms used for the classification/recognition. The task of learning accurately from incomplete data raises a number of issues some of which have not been completely solved in machine learning applications. In this sense, effective missing value estimation methods are required. Different methods for missing data imputations exist but most of the times the selection of the appropriate technique involves testing several methods, comparing them and choosing the right one. Furthermore, applying these methods, in most cases, is not straightforward, as they involve several technical details, and in particular in cases such as when dealing with microarray datasets, the application of the methods requires huge computational resources. As far as we know, there is not a public software application that can provide the computing capabilities required for carrying the task of data imputation. This paper presents a new public tool for missing data imputation that is attached to a computer cluster in order to execute high computational tasks. The software WIMP (Web IMPutation) is a public available web site where registered users can create, execute, analyze and store their simulations related to missing data imputation. PMID:23017251
Urda, D; Subirats, J L; García-Laencina, P J; Franco, L; Sancho-Gómez, J L; Jerez, J M
Multiple imputation is commonly used to impute missing covariate in Cox semiparametric regression setting. It is to fill each missing data with more plausible values, via a Gibbs sampling procedure, specifying an imputation model for each missing variable. This imputation method is implemented in several softwares that offer imputation models steered by the shape of the variable to be imputed, but all these imputation models make an assumption of linearity on covariates effect. However, this assumption is not often verified in practice as the covariates can have a nonlinear effect. Such a linear assumption can lead to a misleading conclusion because imputation model should be constructed to reflect the true distributional relationship between the missing values and the observed values. To estimate nonlinear effects of continuous time invariant covariates in imputation model, we propose a method based on B-splines function. To assess the performance of this method, we conducted a simulation study, where we compared the multiple imputation method using Bayesian splines imputation model with multiple imputation using Bayesian linear imputation model in survival analysis setting. We evaluated the proposed method on the motivated data set collected in HIV-infected patients enrolled in an observational cohort study in Senegal, which contains several incomplete variables. We found that our method performs well to estimate hazard ratio compared with the linear imputation methods, when data are missing completely at random, or missing at random. PMID:23712767
Mbougua, Jules Brice Tchatchueng; Laurent, Christian; Ndoye, Ibra; Delaporte, Eric; Gwet, Henri; Molinari, Nicolas
In cluster randomized trials, intact social units such as schools, worksites or medical practices - rather than individuals themselves - are randomly allocated to intervention and control conditions, while the outcomes of interest are then observed on individuals within each cluster. Such trials are becoming increasingly common in the fields of health promotion and health services research. Attrition is a common occurrence in randomized trials, and a standard approach for dealing with the resulting missing values is imputation. We consider imputation strategies for missing continuous outcomes, focusing on trials with a completely randomized design in which fixed cohorts from each cluster are enrolled prior to random assignment. We compare five different imputation strategies with respect to Type I and Type II error rates of the adjusted two-sample t -test for the intervention effect. Cluster mean imputation is compared with multiple imputation, using either within-cluster data or data pooled across clusters in each intervention group. In the case of pooling across clusters, we distinguish between standard multiple imputation procedures which do not account for intracluster correlation and a specialized procedure which does account for intracluster correlation but is not yet available in standard statistical software packages. A simulation study is used to evaluate the influence of cluster size, number of clusters, degree of intracluster correlation, and variability among cluster follow-up rates. We show that cluster mean imputation yields valid inferences and given its simplicity, may be an attractive option in some large community intervention trials which are subject to individual-level attrition only; however, it may yield less powerful inferences than alternative procedures which pool across clusters especially when the cluster sizes are small and cluster follow-up rates are highly variable. When pooling across clusters, the imputation procedure should generally take intracluster correlation into account to obtain valid inferences; however, as long as the intracluster correlation coefficient is small, we show that standard multiple imputation procedures may yield acceptable type I error rates; moreover, these procedures may yield more powerful inferences than a specialized procedure, especially when the number of available clusters is small. Within-cluster multiple imputation is shown to be the least powerful among the procedures considered. PMID:18537126
Taljaard, Monica; Donner, Allan; Klar, Neil
Imputation in admixed populations is an important problem but challenging due to the complex linkage disequilibrium (LD) pattern. The emergence of large reference panels such as that from the 1,000 Genomes Project enables more accurate imputation in general, and in particular for admixed populations and for uncommon variants. To efficiently benefit from these large reference panels, one key issue to consider in modern genotype imputation framework is the selection of effective reference panels. In this work, we consider a number of methods for effective reference panel construction inside a hidden Markov model and specific to each target individual. These methods fall into two categories: identity-by-state (IBS) based and ancestry-weighted approach. We evaluated the performance on individuals from recently admixed populations. Our target samples include 8,421 African Americans and 3,587 Hispanic Americans from the Women’s Health Initiative, which allow assessment of imputation quality for uncommon variants. Our experiments include both large and small reference panels; large, medium, and small target samples; and in genome regions of varying levels of LD. We also include BEAGLE and IMPUTE2 for comparison. Experiment results with large reference panel suggest that our novel piecewise IBS method yields consistently higher imputation quality than other methods/software. The advantage is particularly noteworthy among uncommon variants where we observe up to 5.1% information gain with the difference being highly significant (Wilcoxon signed rank test P-value < 0.0001). Our work is the first that considers various sensible approaches for imputation in admixed populations and presents a comprehensive comparison.
Liu, Eric Yi; Li, Mingyao; Wang, Wei; Li, Yun
Imputation in admixed populations is an important problem but challenging due to the complex linkage disequilibrium (LD) pattern. The emergence of large reference panels such as that from the 1,000 Genomes Project enables more accurate imputation in general, and in particular for admixed populations and for uncommon variants. To efficiently benefit from these large reference panels, one key issue to consider in modern genotype imputation framework is the selection of effective reference panels. In this work, we consider a number of methods for effective reference panel construction inside a hidden Markov model and specific to each target individual. These methods fall into two categories: identity-by-state (IBS) based and ancestry-weighted approach. We evaluated the performance on individuals from recently admixed populations. Our target samples include 8,421 African Americans and 3,587 Hispanic Americans from the Women' Health Initiative, which allow assessment of imputation quality for uncommon variants. Our experiments include both large and small reference panels; large, medium, and small target samples; and in genome regions of varying levels of LD. We also include BEAGLE and IMPUTE2 for comparison. Experiment results with large reference panel suggest that our novel piecewise IBS method yields consistently higher imputation quality than other methods/software. The advantage is particularly noteworthy among uncommon variants where we observe up to 5.1% information gain with the difference being highly significant (Wilcoxon signed rank test P-value < 0.0001). Our work is the first that considers various sensible approaches for imputation in admixed populations and presents a comprehensive comparison. PMID:23074066
Liu, Eric Yi; Li, Mingyao; Wang, Wei; Li, Yun
Although large-scale genetic association studies involving hundreds to thousands of SNPs have become feasible, the associated cost is substantial. Even with the increased efficiency introduced by the use of tagSNPs, researchers are often seeking ways to maximize resource utilization given a set of SNP-based gene-mapping goals. We have developed a web server named QuickSNP in order to provide cost-effective selection of SNPs, and to fill in some of the gaps in existing SNP selection tools. One useful feature of QuickSNP is the option to select only gene-centric SNPs from a chromosomal region in an automated fashion. Other useful features include automated selection of coding non-synonymous SNPs, SNP filtering based on inter-SNP distances and information regarding the availability of genotyping assays for SNPs and whether they are present on whole genome chips. The program produces user-friendly summary tables and results, and a link to a UCSC Genome Browser track illustrating the position of the selected tagSNPs in relation to genes and other genomic features. We hope the unique combination of features of this server will be useful for researchers aiming to select markers for their genotyping studies. The server is freely available and can be accessed at the URL http://bioinformoodics.jhmi.edu/quickSNP.pl. PMID:17517769
Grover, Deepak; Woodfield, Alonzo S; Verma, Ranjana; Zandi, Peter P; Levinson, Douglas F; Potash, James B
Although large-scale genetic association studies involving hundreds to thousands of SNPs have become feasible, the associated cost is substantial. Even with the increased efficiency introduced by the use of tagSNPs, researchers are often seeking ways to maximize resource utilization given a set of SNP-based gene-mapping goals. We have developed a web server named QuickSNP in order to provide cost-effective selection of SNPs, and to fill in some of the gaps in existing SNP selection tools. One useful feature of QuickSNP is the option to select only gene-centric SNPs from a chromosomal region in an automated fashion. Other useful features include automated selection of coding non-synonymous SNPs, SNP filtering based on inter-SNP distances and information regarding the availability of genotyping assays for SNPs and whether they are present on whole genome chips. The program produces user-friendly summary tables and results, and a link to a UCSC Genome Browser track illustrating the position of the selected tagSNPs in relation to genes and other genomic features. We hope the unique combination of features of this server will be useful for researchers aiming to select markers for their genotyping studies. The server is freely available and can be accessed at the URL http://bioinformoodics.jhmi.edu/quickSNP.pl.
Grover, Deepak; Woodfield, Alonzo S.; Verma, Ranjana; Zandi, Peter P.; Levinson, Douglas F.; Potash, James B.
The recent dramatic cost reduction of next-generation sequencing technology enables investigators to assess most variants in the human genome to identify risk variants for complex diseases. However, sequencing large samples remains very expensive. For a study sample with existing genotype data, such as array data from genome-wide association studies, a cost-effective approach is to sequence a subset of the study sample and then to impute the rest of the study sample, using the sequenced subset as a reference panel. The use of such an internal reference panel identifies population-specific variants and avoids the problem of a substantial mismatch in ancestry background between the study population and the reference population. To efficiently select an internal panel, we introduce an idea of phylogenetic diversity from mathematical phylogenetics and comparative genomics. We propose the “most diverse reference panel”, defined as the subset with the maximal “phylogenetic diversity”, thereby incorporating individuals that span a diverse range of genotypes within the sample. Using data both from simulations and from the 1000 Genomes Project, we show that the most diverse reference panel can substantially improve the imputation accuracy compared to randomly selected reference panels, especially for the imputation of rare variants. The improvement in imputation accuracy holds across different marker densities, reference panel sizes, and lengths for the imputed segments. We thus propose a novel strategy for planning sequencing studies on samples with existing genotype data.
Zhang, Peng; Zhan, Xiaowei; Rosenberg, Noah A.; Zollner, Sebastian
... Imputation of conflicts of interest; General rule. 11.110 Section... Imputation of conflicts of interest; General rule. (a) While practitioners...prohibition is based on a personal interest of the disqualified...
Genome-wide association studies (GWAS) have successfully identified many genetic variants associated with complex diseases and traits. However, functional consequence of genetic variants studied in GWAS is not yet fully investigated, which would hinder the application of GWAS. We therefore performed a systematic functional analysis of HapMap SNPs, which have been most commonly used as the reference panel for GWAS. Our study highlights several characteristics of HapMap SNPs and identifies subsets of genetic variants with interesting functional implication. The results show that HapMap SNPs have good coverage within RefSeq genes, especially within known disease-related genes. On the other hand, only a small percentage of SNPs are non-synonymous SNPs while many SNPs are actually located at gene deserts. Moreover, many functionally important variants are not yet still interrogated. A redesigned SNP reference panel with additional functionally important variants would be useful to identify disease-causal variants in the future genome-wide studies. PMID:23041558
Liu, Ching-Ti; Lin, Houwei; Lin, Honghuang
Longitudinal studies of microbial water quality are subject to missing observations. This study evaluates multiple imputation (MI) against data deletion, mean or median imputation for replacing missing microbial water quality data. The specific context is data collected in Chicago Area Waterway System (2007-2009), where 45% of Escherichia coli and 53% of enterococci densities were missing owing to sample analysis deficiencies. Imputation methods were compared performing a simulation study using complete observations with introduced missing values and subsequently compared with the original data with missing observations. Coefficients for E. coli densities in linear regression models predicting somatic coliphages density show that MI introduces the least bias among other methods while controlling Type I error. Further exploration of utilizing different MI implementations is recommended to address the influence of missing percentage on MI performance and to explore sensitivity to the degree of violation of the missing completely at random assumption. PMID:24705739
Nieh, Chiping; Dorevitch, Samuel; Liu, Li C; Jones, Rachael M
This paper used an event study approach to examine the impact of dividend reinvestment plans on shareholders returns in the pre- and post-imputation environment. The daily share return behaviour indicated that the announcement to introduce DRP was received indifferntly by the market prior to the imputation, but was valued positively afterwards. The results support the suggestion that under imputation the
Keith Chan; D McColough; Michael Skully
The performance of five simple multiple imputation methods for dealing with missing data were compared. In addition, random imputation and multivariate normal imputation were used as lower and upper benchmark, respectively. Test data were simulated and item scores were deleted such that they were either missing completely at random, missing at…
van Ginkel, Joost R.; van der Ark, L. Andries; Sijtsma, Klaas
A well-known problem in the analysis of test and questionnaire data is that some item scores may be missing. Advanced methods for the imputation of missing data are available, such as multiple imputation under the multivariate normal model and imputation under the saturated logistic model (Schafer, 1997). Accompanying software was made available…
van Ginkel, Joost R.; van der Ark, L. Andries
Multiple imputation is applied to a demographic data set with coarse age measurements for Tanzanian children. The heaped ages are multiply imputed with plausible true ages using (a) a simple naive model and (b) a new, relatively complex model that relates true age to the observed values of heaped age, sex, and anthropometric variables. The imputed true ages are used
Daniel F. Heitjan; Donald B. Rubin
Several multiple imputation techniques are described for simple random samples with ignorable nonresponse on a scalar outcome variable. The methods are compared using both analytic and Monte Carlo results concerning coverages of the resulting intervals for the population mean. Using m = 2 imputations per missing value gives accurate coverages in common cases and is clearly superior to single imputation
Donald B. Rubin; Nathaniel Schenker
Objective Thousands of complex-disease single-nucleotide polymorphisms (SNPs) have been discovered in genome-wide association studies (GWAS). However, these intragenic SNPs have not been collectively mined to unveil the genetic architecture between complex clinical traits. The authors hypothesize that biological annotations of host genes of trait-associated SNPs may reveal the biomolecular modularity across complex-disease traits and offer insights for drug repositioning. Methods Trait-to-polymorphism (SNPs) associations confirmed in GWAS were used. A novel method to quantify trait–trait similarity anchored in Gene Ontology annotations of human proteins and information theory was developed. The results were then validated with the shortest paths of physical protein interactions between biologically similar traits. Results A network was constructed consisting of 280 significant intertrait similarities among 177 disease traits, which covered 1438 well-validated disease-associated SNPs. Thirty-nine percent of intertrait connections were confirmed by curators, and the following additional studies demonstrated the validity of a proportion of the remainder. On a phenotypic trait level, higher Gene Ontology similarity between proteins correlated with smaller ‘shortest distance’ in protein interaction networks of complexly inherited diseases (Spearman p<2.2×10?16). Further, ‘cancer traits’ were similar to one another, as were ‘metabolic syndrome traits’ (Fisher's exact test p=0.001 and 3.5×10?7, respectively). Conclusion An imputed disease network by information-anchored functional similarity from GWAS trait-associated SNPs is reported. It is also demonstrated that small shortest paths of protein interactions correlate with complex-disease function. Taken together, these findings provide the framework for investigating drug targets with unbiased functional biomolecular networks rather than worn-out single-gene and subjective canonical pathway approaches.
Li, Haiquan; Lee, Younghee; Chen, James L; Rebman, Ellen; Li, Jianrong
SNPs are the most abundant forms of genetic variations amongst species; the association studies between complex diseases and SNPs or haplotypes have received great attention. However, these studies are restricted by the cost of genotyping all SNPs; thus, it is necessary to find smaller subsets, or tag SNPs, representing the rest of the SNPs. In fact, the existing tag SNP selection algorithms are notoriously time-consuming. An efficient algorithm for tag SNP selection was presented, which was applied to analyze the HapMap YRI data. The experimental results show that the proposed algorithm can achieve better performance than the existing tag SNP selection algorithms; in most cases, this proposed algorithm is at least ten times faster than the existing methods. In many cases, when the redundant ratio of the block is high, the proposed algorithm can even be thousands times faster than the previously known methods. Tools and web services for haplotype block analysis integrated by hadoop MapReduce framework are also developed using the proposed algorithm as computation kernels. PMID:24212035
Chen, Wen-Pei; Hung, Che-Lun; Tsai, Suh-Jen Jane; Lin, Yaw-Ling
Background Imputation of genotypes for ungenotyped individuals could enable the use of valuable phenotypes created before the genomic era in analyses that require genotypes. The objective of this study was to investigate the accuracy of imputation of non-genotyped individuals using genotype information from relatives. Methods Genotypes were simulated for all individuals in the pedigree of a real (historical) dataset of phenotyped dairy cows and with part of the pedigree genotyped. The software AlphaImpute was used for imputation in its standard settings but also without phasing, i.e. using basic inheritance rules and segregation analysis only. Different scenarios were evaluated i.e.: (1) the real data scenario, (2) addition of genotypes of sires and maternal grandsires of the ungenotyped individuals, and (3) addition of one, two, or four genotyped offspring of the ungenotyped individuals to the reference population. Results The imputation accuracy using AlphaImpute in its standard settings was lower than without phasing. Including genotypes of sires and maternal grandsires in the reference population improved imputation accuracy, i.e. the correlation of the true genotypes with the imputed genotype dosages, corrected for mean gene content, across all animals increased from 0.47 (real situation) to 0.60. Including one, two and four genotyped offspring increased the accuracy of imputation across all animals from 0.57 (no offspring) to 0.73, 0.82, and 0.92, respectively. Conclusions At present, the use of basic inheritance rules and segregation analysis appears to be the best imputation method for ungenotyped individuals. Comparison of our empirical animal-specific imputation accuracies to predictions based on selection index theory suggested that not correcting for mean gene content considerably overestimates the true accuracy. Imputation of ungenotyped individuals can help to include valuable phenotypes for genome-wide association studies or for genomic prediction, especially when the ungenotyped individuals have genotyped offspring.
US studies have consistently reported that the relationship between beta and return is less steeply sloped than that implied by the simple CAPM. The introduction of a dividend imputation tax system in Australia and other tax law differences suggest the relationship between beta and return may be more steeply sloped in this country. Empirical evidence subsequent to the introduction of
Robert Faff; David Hillier; Justin Wood
Conducting sample surveys, imputing incomplete observations, and analyzing the resulting data are three indispensable phases of modern practice with public-use data files and with many other statistical applications. Each phase inherits different input, including the information preceding it and the intellectual assessments available, and aims to provide output that is one step closer to arriving at statistical inferences with scientific
: Two algorithms for producing multiple imputations for missing data are evaluatedwith simulated data. Software using a propensity score classifier with the approximate Bayesianboostrap produces badly biased estimates of regression coefficients when data on predictorvariables are missing at random or missing completely at random. On the other hand, aregression-based method employing the data augmentation algorithm produces estimates withlittle or no
Paul D. Allison
Recent survey validation studies suggest that measurement error in earnings data is pervasive and violates classical measurement error assumptions and, therefore, may bias estimation of cross-section and longitudinal earnings models. The authors model the structure of earnings measurement error using data from the Panel Study of Income Dynamics Validation Study (PSIDVS). They then use Donald B. Rubin's (1987) multiple imputation
David Brownstone; Robert G. Valletta
The performance of multiple imputation in questionnaire data has been studied in various simulation studies. However, in practice, questionnaire data are usually more complex than simulated data. For example, items may be counterindicative or may have unacceptably low factor loadings on every subscale, or completely missing subscales may…
Van Ginkel, Joost R.
A technique is presented to transform incomplete categorical data into complete data by imputing appropriate scores into missing cells. A solution of the optimization problem is suggested, and relevant psychometric theory is discussed. The average correlation should be at least 0.50 before the method becomes practical. (SLD)
van Buuren, Stef; van Rijckevorsel, Jan L. A.
Background Multiple imputation is a commonly used method for handling incomplete covariates as it can provide valid inference when data are missing at random. This depends on being able to correctly specify the parametric model used to impute missing values, which may be difficult in many realistic settings. Imputation by predictive mean matching (PMM) borrows an observed value from a donor with a similar predictive mean; imputation by local residual draws (LRD) instead borrows the donor’s residual. Both methods relax some assumptions of parametric imputation, promising greater robustness when the imputation model is misspecified. Methods We review development of PMM and LRD and outline the various forms available, and aim to clarify some choices about how and when they should be used. We compare performance to fully parametric imputation in simulation studies, first when the imputation model is correctly specified and then when it is misspecified. Results In using PMM or LRD we strongly caution against using a single donor, the default value in some implementations, and instead advocate sampling from a pool of around 10 donors. We also clarify which matching metric is best. Among the current MI software there are several poor implementations. Conclusions PMM and LRD may have a role for imputing covariates (i) which are not strongly associated with outcome, and (ii) when the imputation model is thought to be slightly but not grossly misspecified. Researchers should spend efforts on specifying the imputation model correctly, rather than expecting predictive mean matching or local residual draws to do the work.
Design and coverage of high throughput genotyping arrays optimized for individuals of East Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP selection algorithm.
Four custom Axiom genotyping arrays were designed for a genome-wide association (GWA) study of 100,000 participants from the Kaiser Permanente Research Program on Genes, Environment and Health. The array optimized for individuals of European race/ethnicity was previously described. Here we detail the development of three additional microarrays optimized for individuals of East Asian, African American, and Latino race/ethnicity. For these arrays, we decreased redundancy of high-performing SNPs to increase SNP capacity. The East Asian array was designed using greedy pairwise SNP selection. However, removing SNPs from the target set based on imputation coverage is more efficient than pairwise tagging. Therefore, we developed a novel hybrid SNP selection method for the African American and Latino arrays utilizing rounds of greedy pairwise SNP selection, followed by removal from the target set of SNPs covered by imputation. The arrays provide excellent genome-wide coverage and are valuable additions for large-scale GWA studies. PMID:21903159
Hoffmann, Thomas J; Zhan, Yiping; Kvale, Mark N; Hesselson, Stephanie E; Gollub, Jeremy; Iribarren, Carlos; Lu, Yontao; Mei, Gangwu; Purdy, Matthew M; Quesenberry, Charles; Rowell, Sarah; Shapero, Michael H; Smethurst, David; Somkin, Carol P; Van den Eeden, Stephen K; Walter, Larry; Webster, Teresa; Whitmer, Rachel A; Finn, Andrea; Schaefer, Catherine; Kwok, Pui-Yan; Risch, Neil
The default method of dealing with missing data in statistical analyses is to only use the complete observations (complete case analysis), which can lead to unexpected bias when data do not meet the assumption of missing completely at random (MCAR). For the assumption of MCAR to be met, missingness cannot be related to either the observed or unobserved variables. A less stringent assumption, missing at random (MAR), requires that missingness not be associated with the value of the missing variable itself, but can be associated with the other observed variables. When data are truly MAR as opposed to MCAR, the default complete case analysis method can lead to biased results. There are statistical options available to adjust for data that are MAR, including multiple imputation (MI) which is consistent and efficient at estimating effects. Multiple imputation uses informing variables to determine statistical distributions for each piece of missing data. Then multiple datasets are created by randomly drawing on the distributions for each piece of missing data. Since MI is efficient, only a limited number, usually less than 20, of imputed datasets are required to get stable estimates. Each imputed dataset is analyzed using standard statistical techniques, and then results are combined to get overall estimates of effect. A simulation study will be demonstrated to show the results of using the default complete case analysis, and MI in a linear regression of MCAR and MAR simulated data. Further, MI was successfully applied to the association study of CO2 levels and headaches when initial analysis showed there may be an underlying association between missing CO2 levels and reported headaches. Through MI, we were able to show that there is a strong association between average CO2 levels and the risk of headaches. Each unit increase in CO2 (mmHg) resulted in a doubling in the odds of reported headaches.
Foy, M.; VanBaalen, M.; Wear, M.; Mendez, C.; Mason, S.; Meyers, V.; Alexander, D.; Law, J.
We present a method of analyzing a series of independent cross-sectional surveysin which some questions are not answered in some surveys and some respondents donot answer some of the questions posed. The method is also applicable to a singlesurvey in which di#erent questions are asked, or di#erent sampling methods used, indi#erent strata or clusters. Our method involves multiply-imputing the missing
Andrew Gelman; Gary King; Chuanhai Liu
Background There is considerable interest in developing high-throughput genotyping with single nucleotide polymorphisms (SNPs) for the identification of genes affecting important ecological or economical traits. SNPs are evenly distributed throughout the genome and are likely to be functionally relevant. In rainbow trout, in silico screening of EST databases represents an attractive approach for de novo SNP identification. Nevertheless, EST sequencing errors and assembly of EST paralogous sequences can lead to the identification of false positive SNPs which renders the reliability of EST-derived SNPs relatively low. Further validation of EST-derived SNPs is therefore required. The objective of this work was to assess the quality of and to validate a large number of rainbow trout EST-derived SNPs. Results A panel of 1,152 EST-derived SNPs was selected from the INRA Sigenae SNP database and was genotyped in standard and double haploid individuals from several populations using the Illumina GoldenGate BeadXpress assay. High-quality genotyping data were obtained for 958 SNPs representing a genotyping success rate of 83.2?%, out of which, 350 SNPs (36.5?%) were polymorphic in at least one population and were designated as true SNPs. They also proved to be a potential tool to investigate genetic diversity of the species, as the set of SNP successfully sorted individuals into three main groups using STRUCTURE software. Functional annotations revealed 28 non-synonymous SNPs, out of which four substitutions were predicted to affect protein functions. A subset of 223 true SNPs were polymorphic in the two INRA mapping reference families and were integrated into the INRA microsatellite-based linkage map. Conclusions Our results represent the first study of EST-derived SNPs validation in rainbow trout, a species whose genome sequences is not yet available. We designed several specific filters in order to improve the genotyping yield. Nevertheless, our selection criteria should be further improved in order to reduce the observed high rate of false positive SNPs which results from the occurrence of whole genome duplications.
SUMMARY Biomedical research is plagued with problems of missing data, especially in clinical trials of medical and behavioral therapies adopting longitudinal design. After a literature review on modeling incomplete longitudinal data based on full-likelihood functions, this paper proposes a set of imputation-based strategies for implementing selection, pattern-mixture, and shared-parameter models for handling intermittent missing values and dropouts that are potentially nonignorable according to various criteria. Within the framework of multiple partial imputation, intermittent missing values are first imputed several times; then, each partially imputed data set is analyzed to deal with dropouts with or without further imputation. Depending on the choice of imputation model or measurement model, there exist various strategies that can be jointly applied to the same set of data to study the effect of treatment or intervention from multi-faceted perspectives. For illustration, the strategies were applied to a data set with continuous repeated measures from a smoking cessation clinical trial.
Yang, Xiaowei; Li, Jinhui; Shoptaw, Steven
Background Single Nucleotide Polymorphisms (SNPs) are one of the largest sources of new data in biology. In most papers, SNPs between individuals are visualized with Principal Component Analysis (PCA), an older method for this purpose. Principal Findings We compare PCA, an aging method for this purpose, with a newer method, t-Distributed Stochastic Neighbor Embedding (t-SNE) for the visualization of large SNP datasets. We also propose a set of key figures for evaluating these visualizations; in all of these t-SNE performs better. Significance To transform data PCA remains a reasonably good method, but for visualization it should be replaced by a method from the subfield of dimension reduction. To evaluate the performance of visualization, we propose key figures of cross-validation with machine learning methods, as well as indices of cluster validity.
The goal ofmultiple imputation is to provide valid inferences for statistical estimates from incomplete data.\\u000aTo achieve that goal, imputed values should preserve the structure in the data, as well as the uncertainty\\u000aabout this structure, and include any knowledge about the process that generated the missing data. Two\\u000aapproaches for imputing multivariate data exist: joint modeling (JM) and fully
Stef van Buuren
Methods for data imputation applicable to air quality data sets were evaluated in the context of univariate (linear, spline and nearest neighbour interpolation), multivariate (regression-based imputation (REGEM), nearest neighbour (NN), self-organizing map (SOM), multi-layer perceptron (MLP)), and hybrid methods of the previous by using simulated missing data patterns. Additionally, a multiple imputation procedure was considered in order to make comparison
Heikki Junninen; Harri Niska; Kari Tuppurainen; Juhani Ruuskanen; Mikko Kolehmainen
... Table of Contents Joining and Switching Medicare SNPs 15 . . . . . . . . . . . . . . . . . . . . When Can I Join or Switch a Medicare ... and coordinate their different Medicare and Medicaid services. 15 Joining and Switching Medicare SNPs When Can I ...
Objective To examine sequential and simultaneous approaches to multiple imputation of missing data in a longitudinal dataset where losses due to death were common. Method Comparison of results from analyses and simulations of time to incident difficulty of activities of daily living (ADL) in the Cardiovascular Health Study when missing data were imputed simultaneously or sequentially. Results Results differed with imputation methods. The largest proportional differences in 12 risk factor parameter estimates were: heart failure by 106%, social support by 33%, and arthritis by 27%. Conclusions Decedents’ final characteristics were influential on future imputations of those with missing values.
Ning, Yuming; McAvay, Gail; Sarwat, I. Chaudhry; Arnold, Alice M.; Allore, Heather G.
A common method of handling the problem of missing variances in meta-analysis of continuous response is through imputation. However, the performance of imputation techniques may be influenced by the type of model utilised. In this article, we examine through a simulation study the effects of the techniques of imputation of the missing SDs and type of models used on the overall meta-analysis estimates. The results suggest that imputation should be adopted to estimate the overall effect size, irrespective of the model used. However, the accuracy of the estimates of the corresponding standard error (SE) is influenced by the imputation techniques. For estimates based on the fixed effects model, mean imputation provides better estimates than multiple imputations, while those based on the random effects model responds more robustly to the type of imputation techniques. The results showed that although imputation is good in reducing the bias in point estimates, it is more likely to produce coverage probability which is higher than the nominal value.
Idris, N. R. N.; Abdullah, M. H.; Tolos, S. M.
Latent class regression (LCR) is a popular method for analyzing multiple categorical outcomes. While non-response to the manifest items is a common complication, inferences of LCR can be evaluated using maximum likelihood, multiple imputation, and two-stage multiple imputation. Under similar missing data assumptions, the estimates and variances from all three procedures are quite close. However, multiple imputation and two-stage multiple imputation can provide additional information: estimates for the rates of missing information. The methodology is illustrated using an example from a study on racial and ethnic disparities in breast cancer severity.
Harel, Ofer; Chung, Hwan; Miglioretti, Diana
We propose a semiparametric approach incorporating principles of multiple imputation under the normality assumption, multivariate number generation, and computation of empirical cumulative distribution function (eCDF) values to impute continuous data with variables following any marginal distribution. This method involves mapping the data to normally distributed values, imputing these values, and back-transforming the data onto the scale of the original data. The transformations associated with eCDF computations constitute the nonparametric portion of our algorithm, while imputation under the normality assumption constitutes the parametric portion. Application of this method to simulated and real data leads to promising results. PMID:24605974
Helenowski, Irene B; Demirtas, Hakan
There is an increasing interest in single nucleotide polymorphism (SNP) typing in the forensic field, not only for the usefulness of SNPs for defining Y chromosome or mtDNA haplogroups or for analyzing the geographical origin of samples, but also for the potential applications of autosomal SNPs. The interest of forensic researchers in autosomal SNPs has been attracted due to the
Beatriz Sobrino; María Brión; Angel Carracedo
Background Multiple imputation (MI) is becoming increasingly popular as a strategy for handling missing data, but there is a scarcity of tools for checking the adequacy of imputation models. The Kolmogorov-Smirnov (KS) test has been identified as a potential diagnostic method for assessing whether the distribution of imputed data deviates substantially from that of the observed data. The aim of this study was to evaluate the performance of the KS test as an imputation diagnostic. Methods Using simulation, we examined whether the KS test could reliably identify departures from assumptions made in the imputation model. To do this we examined how the p-values from the KS test behaved when skewed and heavy-tailed data were imputed using a normal imputation model. We varied the amount of missing data, the missing data models and the amount of skewness, and evaluated the performance of KS test in diagnosing issues with the imputation models under these different scenarios. Results The KS test was able to flag differences between the observations and imputed values; however, these differences did not always correspond to problems with MI inference for the regression parameter of interest. When there was a strong missing at random dependency, the KS p-values were very small, regardless of whether or not the MI estimates were biased; so that the KS test was not able to discriminate between imputed variables that required further investigation, and those that did not. The p-values were also sensitive to sample size and the proportion of missing data, adding to the challenge of interpreting the results from the KS test. Conclusions Given our study results, it is difficult to establish guidelines or recommendations for using the KS test as a diagnostic tool for MI. The investigation of other imputation diagnostics and their incorporation into statistical software are important areas for future research.
Genotyping by sequencing (GBS) recently has emerged as a promising genomic approach for assessing genetic diversity on a genome-wide scale. However, concerns are not lacking about the uniquely large unbalance in GBS genotype data. Although some genotype imputation has been proposed to infer missing observations, little is known about the reliability of a genetic diversity analysis of GBS data, with up to 90% of observations missing. Here we performed an empirical assessment of accuracy in genetic diversity analysis of highly incomplete single nucleotide polymorphism genotypes with imputations. Three large single-nucleotide polymorphism genotype data sets for corn, wheat, and rice were acquired, and missing data with up to 90% of missing observations were randomly generated and then imputed for missing genotypes with three map-independent imputation methods. Estimating heterozygosity and inbreeding coefficient from original, missing, and imputed data revealed variable patterns of bias from assessed levels of missingness and genotype imputation, but the estimation biases were smaller for missing data without genotype imputation. The estimates of genetic differentiation were rather robust up to 90% of missing observations but became substantially biased when missing genotypes were imputed. The estimates of topology accuracy for four representative samples of interested groups generally were reduced with increased levels of missing genotypes. Probabilistic principal component analysis based imputation performed better in terms of topology accuracy than those analyses of missing data without genotype imputation. These findings are not only significant for understanding the reliability of the genetic diversity analysis with respect to large missing data and genotype imputation but also are instructive for performing a proper genetic diversity analysis of highly incomplete GBS or other genotype data.
... true May the African Development Foundation impute conduct of one person to another...Foreign Relations AFRICAN DEVELOPMENT FOUNDATION GOVERNMENTWIDE DEBARMENT AND SUSPENSION...630 May the African Development Foundation impute conduct of one person to...
We performed a discovery genome-wide association study to identify genetic factors associated with variation in plasma estradiol (E2) concentrations using DNA from 772 postmenopausal women with estrogen receptor (ER)-positive breast cancer prior to the initiation of aromatase inhibitor therapy. Association analyses showed that the single nucleotide polymorphisms (SNP) (rs1864729) with the lowest P value (P = 3.49E-08), mapped to chromosome 8 near TSPYL5. We also identified 17 imputed SNPs in or near TSPYL5 with P values < 5E-08, one of which, rs2583506, created a functional estrogen response element. We then used a panel of lymphoblastoid cell lines (LCLs) stably transfected with ER? with known genome-wide SNP genotypes to demonstrate that TSPYL5 expression increased after E2 exposure of cells heterozygous for variant TSPYL5 SNP genotypes, but not in those homozygous for wild-type alleles. TSPYL5 knockdown decreased, and overexpression increased aromatase (CYP19A1) expression in MCF-7 cells, LCLs, and adipocytes through the skin/adipose (I.4) promoter. Chromatin immunoprecipitation assay showed that TSPYL5 bound to the CYP19A1 I.4 promoter. A putative TSPYL5 binding motif was identified in 43 genes, and TSPYL5 appeared to function as a transcription factor for most of those genes. In summary, genome-wide significant SNPs in TSPYL5 were associated with elevated plasma E2 in postmenopausal breast cancer patients. SNP rs2583506 created a functional estrogen response element, and LCLs with variant SNP genotypes displayed increased E2-dependent TSPYL5 expression. TSPYL5 induced CYP19A1 expression and that of many other genes. These studies have revealed a novel mechanism for regulating aromatase expression and plasma E2 concentrations in postmenopausal women with ER(+) breast cancer.
Liu, Mohan; Ingle, James N.; Fridley, Brooke L.; Buzdar, Aman U.; Robson, Mark E.; Kubo, Michiaki; Wang, Liewei; Batzler, Anthony; Jenkins, Gregory D.; Pietrzak, Tracy L.; Carlson, Erin E.; Goetz, Matthew P.; Northfelt, Donald W.; Perez, Edith A.; Williard, Clark V.; Schaid, Daniel J.; Nakamura, Yusuke
Genotyping of classical human leukocyte antigen (HLA) alleles is an essential tool in the analysis of diseases and adverse drug reactions with associations mapping to the major histocompatibility complex (MHC). However, deriving high-resolution HLA types subsequent to whole-genome single-nucleotide polymorphism (SNP) typing or sequencing is often cost prohibitive for large samples. An alternative approach takes advantage of the extended haplotype structure within the MHC to predict HLA alleles using dense SNP genotypes, such as those available from genome-wide SNP panels. Current methods for HLA imputation are difficult to apply or may require the user to have access to large training data sets with SNP and HLA types. We propose HIBAG, HLA Imputation using attribute BAGging, that makes predictions by averaging HLA-type posterior probabilities over an ensemble of classifiers built on bootstrap samples. We assess the performance of HIBAG using our study data (n=2668 subjects of European ancestry) as a training set and HLA data from the British 1958 birth cohort study (n?1000 subjects) as independent validation samples. Prediction accuracies for HLA-A, B, C, DRB1 and DQB1 range from 92.2% to 98.1% using a set of SNP markers common to the Illumina 1M Duo, OmniQuad, OmniExpress, 660K and 550K platforms. HIBAG performed well compared with the other two leading methods, HLA*IMP and BEAGLE. This method is implemented in a freely available HIBAG R package that includes pre-fit classifiers for European, Asian, Hispanic and African ancestries, providing a readily available imputation approach without the need to have access to large training data sets. PMID:23712092
Zheng, X; Shen, J; Cox, C; Wakefield, J C; Ehm, M G; Nelson, M R; Weir, B S
Background Attrition, which leads to missing data, is a common problem in cluster randomized trials (CRTs), where groups of patients rather than individuals are randomized. Standard multiple imputation (MI) strategies may not be appropriate to impute missing data from CRTs since they assume independent data. In this paper, under the assumption of missing completely at random and covariate dependent missing, we compared six MI strategies which account for the intra-cluster correlation for missing binary outcomes in CRTs with the standard imputation strategies and complete case analysis approach using a simulation study. Method We considered three within-cluster and three across-cluster MI strategies for missing binary outcomes in CRTs. The three within-cluster MI strategies are logistic regression method, propensity score method, and Markov chain Monte Carlo (MCMC) method, which apply standard MI strategies within each cluster. The three across-cluster MI strategies are propensity score method, random-effects (RE) logistic regression approach, and logistic regression with cluster as a fixed effect. Based on the community hypertension assessment trial (CHAT) which has complete data, we designed a simulation study to investigate the performance of above MI strategies. Results The estimated treatment effect and its 95% confidence interval (CI) from generalized estimating equations (GEE) model based on the CHAT complete dataset are 1.14 (0.76 1.70). When 30% of binary outcome are missing completely at random, a simulation study shows that the estimated treatment effects and the corresponding 95% CIs from GEE model are 1.15 (0.76 1.75) if complete case analysis is used, 1.12 (0.72 1.73) if within-cluster MCMC method is used, 1.21 (0.80 1.81) if across-cluster RE logistic regression is used, and 1.16 (0.82 1.64) if standard logistic regression which does not account for clustering is used. Conclusion When the percentage of missing data is low or intra-cluster correlation coefficient is small, different approaches for handling missing binary outcome data generate quite similar results. When the percentage of missing data is large, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effect. Within-cluster and across-cluster MI strategies (except for random-effects logistic regression MI strategy), which take the intra-cluster correlation into account, seem to be more appropriate to handle the missing outcome from CRTs. Under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar.
Introduction. The microarray datasets from the MicroArray Quality Control (MAQC) project have enabled the assessment of the precision, comparability of microarrays, and other various microarray analysis methods. However, to date no studies that we are aware of have reported the performance of missing value imputation schemes on the MAQC datasets. In this study, we use the MAQC Affymetrix datasets to evaluate several imputation procedures in Affymetrix microarrays. Results. We evaluated several cutting edge imputation procedures and compared them using different error measures. We randomly deleted 5% and 10% of the data and imputed the missing values using imputation tests. We performed 1000 simulations and averaged the results. The results for both 5% and 10% deletion are similar. Among the imputation methods, we observe the local least squares method with k = 4 is most accurate under the error measures considered. The k-nearest neighbor method with k = 1 has the highest error rate among imputation methods and error measures. Conclusions. We conclude for imputing missing values in Affymetrix microarray datasets, using the MAS 5.0 preprocessing scheme, the local least squares method with k = 4 has the best overall performance and k-nearest neighbor method with k = 1 has the worst overall performance. These results hold true for both 5% and 10% missing values.
Rao, Sreevidya Sadananda Sadasiva; Shepherd, Lori A.; Bruno, Andrew E.; Liu, Song; Miecznikowski, Jeffrey C.
This article reviews multiple imputation, describes assumptionsthat it requires, and reviews software packages that implementthis procedure. We apply the methods and compare theresults using two examples---a child psychopathology datasetwith missing outcomes and an artificial dataset with missing covariates.We conclude with some discussion of the strengths andweaknesses of these implementations as well as advantages andlimitations of imputation
Nicholas J. Horton; Stuart R. Lipsitz
Missing data are nearly always a problem in research, and missing values represent a serious threat to the validity of inferences drawn from findings. Increasingly, social science researchers are turning to multiple imputation to handle missing data. Multiple imputation, in which missing values are replaced by values repeatedly drawn from…
Rose, Roderick A.; Fraser, Mark W.
When missing values are present in item response data, there are a number of ways one might impute a correct or incorrect response to a multiple-choice item. There are significantly fewer methods for imputing the actual response option an examinee may have provided if he or she had not omitted the item either purposely or accidentally. This…
Wolkowitz, Amanda A.; Skorupski, William P.
In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian,…
Si, Yajuan; Reiter, Jerome P.
Principal components analysis revealed four patterns of nonresponse on children's psychosocial adjustment, lifetime poverty experiences, and family history. Results from examining latent growth curve models using listwise deletion and multiple imputation indicated that multiple imputation corrected for selective nonresponse, providing less-biased…
Davey, Adam; Shanahan, Michael J.; Schafer, Joseph L.
Multiple imputation is now a well-established technique for analysing data sets where some units have incomplete observations. Provided that the imputation model is correct, the resulting estimates are consistent. An alternative, weighting by the inverse probability of observing complete data on a unit, is conceptually simple and involves fewer modelling assumptions, but it is known to be both inefficient (relative
James R. Carpenter; Michael G. Kenward; Stijn Vansteelandt
In real-life situations, we often encounter data sets containing missing observations. Statistical methods that address missingness have been extensively studied in recent years. One of the more popular approaches involves imputation of the missing values prior to the analysis, thereby rendering the data complete. Imputation broadly encompasses an entire scope of techniques that have been developed to make inferences about
José Cortiñas Abrahantes; Cristina Sotto; Geert Molenberghs; Geert Vromman; Bart Bierinckx
Multiple imputation (MI) is a commonly used technique for handling missing data in large-scale medical and public health studies. However, variable selection on multiply-imputed data remains an important and longstanding statistical problem. If a variable selection method is applied to each imputed dataset separately, it may select different variables for different imputed datasets, which makes it difficult to interpret the final model or draw scientific conclusions. In this paper, we propose a novel multiple imputation-least absolute shrinkage and selection operator (MI-LASSO) variable selection method as an extension of the least absolute shrinkage and selection operator (LASSO) method to multiply-imputed data. The MI-LASSO method treats the estimated regression coefficients of the same variable across all imputed datasets as a group and applies the group LASSO penalty to yield a consistent variable selection across multiple-imputed datasets. We use a simulation study to demonstrate the advantage of the MI-LASSO method compared with the alternatives. We also apply the MI-LASSO method to the University of Michigan Dioxin Exposure Study to identify important circumstances and exposure factors that are associated with human serum dioxin concentration in Midland, Michigan. PMID:23526243
Chen, Qixuan; Wang, Sijian
Summary Often a binary variable is generated by dichotomizing an underlying continuous variable measured at a specific time point according to a prespecified threshold value. In the event that the underlying continuous measurements are from a longitudinal study, one can use repeated measures model to impute missing data on responder status as a result of subject drop-out and apply logistic regression model on the observed or otherwise imputed responder status. Standard Bayesian multiple imputation techniques (Rubin, 1987, Multiple Imputation for Nonresponse in Surveys) which draw the parameters for the imputation model from the posterior distribution and construct the variance of parameter estimates for the analysis model as a combination of within- and between-imputation variances are found to be conservative. The frequentist multiple imputation approach which fixes the parameters for the imputation model at the maximum likelihood estimates and construct the variance of parameter estimates for the analysis model using the results of (Robins and Wang, 2000, Biometrika 87, 113–124) is shown to be more efficient. We propose to apply (Kenward and Roger, 1997, Biometrics 53, 983–997) degrees-of-freedom to account for the uncertainty associated with variance-covariance parameter estimates for the repeated measures model.
Lu, Kaifeng; Jiang, Liqiu; Tsiatis, Anastasios A.
Background It is an important pre-processing step to accurately estimate missing values in microarray data, because complete datasets are required in numerous expression profile analysis in bioinformatics. Although several methods have been suggested, their performances are not satisfactory for datasets with high missing percentages. Results The paper explores the feasibility of doing missing value imputation with the help of gene regulatory mechanism. An imputation framework called histone acetylation information aided imputation method (HAIimpute method) is presented. It incorporates the histone acetylation information into the conventional KNN(k-nearest neighbor) and LLS(local least square) imputation algorithms for final prediction of the missing values. The experimental results indicated that the use of acetylation information can provide significant improvements in microarray imputation accuracy. The HAIimpute methods consistently improve the widely used methods such as KNN and LLS in terms of normalized root mean squared error (NRMSE). Meanwhile, the genes imputed by HAIimpute methods are more correlated with the original complete genes in terms of Pearson correlation coefficients. Furthermore, the proposed methods also outperform GOimpute, which is one of the existing related methods that use the functional similarity as the external information. Conclusion We demonstrated that the using of histone acetylation information could greatly improve the performance of the imputation especially at high missing percentages. This idea can be generalized to various imputation methods to facilitate the performance. Moreover, with more knowledge accumulated on gene regulatory mechanism in addition to histone acetylation, the performance of our approach can be further improved and verified.
Xiang, Qian; Dai, Xianhua; Deng, Yangyang; He, Caisheng; Wang, Jiang; Feng, Jihua; Dai, Zhiming
This article estimates a model of self-reported financial well-being (FWB) using primary data collected for a Southwestern U.S. city. Missing data are estimated using multiple imputation. Model estimates show how FWB depends on home ownership, the number of children, health insurance, age, and income. Multiple imputation results differ somewhat from complete case results.
While fusion can be accomplished at multiple levels in a multibiometric system, score level fusion is commonly used as it offers a good trade-off between fusion complexity and data availability. However, missing scores affect the implementation of several biometric fusion rules. While there are several techniques for handling missing data, the imputation scheme - which replaces missing values with predicted values - is preferred since this scheme can be followed by a standard fusion scheme designed for complete data. This paper compares the performance of three imputation methods: Imputation via Maximum Likelihood Estimation (MLE), Multiple Imputation (MI) and Random Draw Imputation through Gaussian Mixture Model estimation (RD GMM). A novel method called Hot-deck GMM is also introduced and exhibits markedly better performance than the other methods because of its ability to preserve the local structure of the score distribution. Experiments on the MSU dataset indicate the robustness of the schemes in handling missing scores at various missing data rates.
Ding, Yaohui; Ross, Arun
Genome-wide association studies are usually accompanied by imputation techniques to complement genome-wide SNP chip genotypes. Current imputation approaches separate the phasing of study data from imputing, which makes the phasing independent from the reference data. The two-step approach allows for updating the imputation for a new reference panel without repeating the tedious phasing step. This advantage, however, does no longer hold, when the build of the study data differs from the build of the reference data. In this case, the current approach is to harmonize the study data annotation with the reference data (prephasing lift-over), requiring rephasing and re-imputing. As a novel approach, we propose to harmonize study haplotypes with reference haplotypes (postphasing lift-over). This allows for updating imputed study data for new reference panels without requiring rephasing. With continuously updated reference panels, our approach can save considerable computing time of up to 1 month per re-imputation. We evaluated the rephasing and postphasing lift-over approaches by using data from 1,644 unrelated individuals imputed by both approaches and comparing it with directly typed genotypes. On average, both approaches perform equally well with mean concordances of 93% between imputed and typed genotypes for both approaches. Also, imputation qualities are similar (mean difference in RSQ < 0.1%). We demonstrate that our novel postphasing lift-over approach is a practical and time-saving alternative to the prephasing lift-over. This might encourage study partners to accommodate updated reference builds and ultimately improve the information content of study data. Our novel approach is implemented in the software PhaseLift. PMID:24962562
Gorski, Mathias; Winkler, Thomas W; Stark, Klaus; Müller-Nurasyid, Martina; Ried, Janina S; Grallert, Harald; Weber, Bernhard H F; Heid, Iris M
Widely recognized in many fields including economics, engineering, epidemiology, health sciences, technology and wildlife management, length-biased sampling generates biased and right-censored data but often provide the best information available for statistical inference. Different from traditional right-censored data, length-biased data have unique aspects resulting from their sampling procedures. We exploit these unique aspects and propose a general imputation-based estimation method for analyzing length-biased data under a class of flexible semiparametric transformation models. We present new computational algorithms that can jointly estimate the regression coefficients and the baseline function semiparametrically. The imputation-based method under the transformation model provides an unbiased estimator regardless whether the censoring is independent or not on the covariates. We establish large-sample properties using the empirical processes method. Simulation studies show that under small to moderate sample sizes, the proposed procedure has smaller mean square errors than two existing estimation procedures. Finally, we demonstrate the estimation procedure by a real data example.
Liu, Hao; Qin, Jing; Shen, Yu
\\u000a Recent studies have shown that the chromosomal recombination only takes places at some narrow hotspots. Within the chromosomal\\u000a region between these hotspots (called haplotype block), little or even no recombination occurs, and a small subset of SNPs\\u000a (called tag SNPs) is sufficient to capture the haplotype pattern of the block. In reality, the tag SNPs may be genotyped as\\u000a missing
Yao-ting Huang; Kui Zhang; Ting Chen; Kun-mao Chao
Background Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used. Results We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods. Conclusion The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can – up to a certain degree – be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).
Tuikkala, Johannes; Elo, Laura L; Nevalainen, Olli S; Aittokallio, Tero
Related individuals share potentially long chromosome segments that trace to a common ancestor. We describe a phasing algorithm (ChromoPhase) that utilizes this characteristic of finite populations to phase large sections of a chromosome. In addition to phasing, our method imputes missing genotypes in individuals genotyped at lower marker density when more densely genotyped relatives are available. ChromoPhase uses a pedigree to collect an individual's (the proband) surrogate parents and offspring and uses genotypic similarity to identify its genomic surrogates. The algorithm then cycles through the relatives and genomic surrogates one at a time to find shared chromosome segments. Once a segment has been identified, any missing information in the proband is filled in with information from the relative. We tested ChromoPhase in a simulated population consisting of 400 individuals at a marker density of 1500/M, which is approximately equivalent to a 50K bovine single nucleotide polymorphism chip. In simulated data, 99.9% loci were correctly phased and, when imputing from 100 to 1500 markers, more than 87% of missing genotypes were correctly imputed. Performance increased when the number of generations available in the pedigree increased, but was reduced when the sparse genotype contained fewer loci. However, in simulated data, ChromoPhase correctly imputed at least 12% more genotypes than fastPHASE, depending on sparse marker density. We also tested the algorithm in a real Holstein cattle data set to impute 50K genotypes in animals with a sparse 3K genotype. In these data 92% of genotypes were correctly imputed in animals with a genotyped sire. We evaluated the accuracy of genomic predictions with the dense, sparse, and imputed simulated data sets and show that the reduction in genomic evaluation accuracy is modest even with imperfectly imputed genotype data. Our results demonstrate that imputation of missing genotypes, and potentially full genome sequence, using long-range phasing is feasible. PMID:21705746
Daetwyler, Hans D; Wiggans, George R; Hayes, Ben J; Woolliams, John A; Goddard, Mike E
Background The objective of the present study was to test the ability of the partial least squares regression technique to impute genotypes from low density single nucleotide polymorphisms (SNP) panels i.e. 3K or 7K to a high density panel with 50K SNP. No pedigree information was used. Methods Data consisted of 2093 Holstein, 749 Brown Swiss and 479 Simmental bulls genotyped with the Illumina 50K Beadchip. First, a single-breed approach was applied by using only data from Holstein animals. Then, to enlarge the training population, data from the three breeds were combined and a multi-breed analysis was performed. Accuracies of genotypes imputed using the partial least squares regression method were compared with those obtained by using the Beagle software. The impact of genotype imputation on breeding value prediction was evaluated for milk yield, fat content and protein content. Results In the single-breed approach, the accuracy of imputation using partial least squares regression was around 90 and 94% for the 3K and 7K platforms, respectively; corresponding accuracies obtained with Beagle were around 85% and 90%. Moreover, computing time required by the partial least squares regression method was on average around 10 times lower than computing time required by Beagle. Using the partial least squares regression method in the multi-breed resulted in lower imputation accuracies than using single-breed data. The impact of the SNP-genotype imputation on the accuracy of direct genomic breeding values was small. The correlation between estimates of genetic merit obtained by using imputed versus actual genotypes was around 0.96 for the 7K chip. Conclusions Results of the present work suggested that the partial least squares regression imputation method could be useful to impute SNP genotypes when pedigree information is not available.
Multivariate imputation by chained equations (MICE) has emerged as a principled method of dealing with missing data. Despite properties that make MICE particularly useful for large imputation procedures and advances in software development that now make it accessible to many researchers, many psychiatric researchers have not been trained in these methods and few practical resources exist to guide researchers in the implementation of this technique. This paper provides an introduction to the MICE method with a focus on practical aspects and challenges in using this method. A brief review of software programs available to implement MICE and then analyze multiply imputed data is also provided.
Azur, Melissa J.; Stuart, Elizabeth A.; Frangakis, Constantine; Leaf, Philip J.
Background Multiple imputation (MI) was developed as a method to enable valid inferences to be obtained in the presence of missing data rather than to re-create the missing values. Within the applied setting, it remains unclear how important it is that imputed values should be plausible for individual observations. One variable type for which MI may lead to implausible values is a limited-range variable, where imputed values may fall outside the observable range. The aim of this work was to compare methods for imputing limited-range variables, with a focus on those that restrict the range of the imputed values. Methods Using data from a study of adolescent health, we consider three variables based on responses to the General Health Questionnaire (GHQ), a tool for detecting minor psychiatric illness. These variables, based on different scoring methods for the GHQ, resulted in three continuous distributions with mild, moderate and severe positive skewness. In an otherwise complete dataset, we set 33% of the GHQ observations to missing completely at random or missing at random; repeating this process to create 1000 datasets with incomplete data for each scenario. For each dataset, we imputed values on the raw scale and following a zero-skewness log transformation using: univariate regression with no rounding; post-imputation rounding; truncated normal regression; and predictive mean matching. We estimated the marginal mean of the GHQ and the association between the GHQ and a fully observed binary outcome, comparing the results with complete data statistics. Results Imputation with no rounding performed well when applied to data on the raw scale. Post-imputation rounding and imputation using truncated normal regression produced higher marginal means than the complete data estimate when data had a moderate or severe skew, and this was associated with under-coverage of the complete data estimate. Predictive mean matching also produced under-coverage of the complete data estimate. For the estimate of association, all methods produced similar estimates to the complete data. Conclusions For data with a limited range, multiple imputation using techniques that restrict the range of imputed values can result in biased estimates for the marginal mean when data are highly skewed.
Autosomal DNA polymorphisms can provide new information and understanding of both the origins of and relationships among modern Native American populations. At the same time that autosomal markers can be highly informative, they are also susceptible to ascertainment biases in the selection of the markers to use. Identifying markers that can be used for ancestry inference among Native American populations can be considered separate from identifying markers to further the quest for history. In the current study we are using data on nine Native American populations to compare the results based on a large haplotype-based dataset with relatively small independent sets of SNPs. We are interested in what types of limited datasets an individual laboratory might be able to collect are best for addressing two different questions of interest. First, how well can we differentiate the Native American populations and/or infer ancestry by assigning an individual to her population(s) of origin? Second, how well can we infer the historical/evolutionary relationships among Native American populations and their Eurasian origins. We conclude that only a large comprehensive dataset involving multiple autosomal markers on multiple populations will be able to answer both questions; different small sets of markers are able to answer only one or the other of these questions. Using our largest dataset we see a general increasing distance from Old World populations from North to South in the New World except for an unexplained close relationship between our Maya and Quechua samples.
Kidd, Judith R.; Friedlaender, Francoise; Pakstis, Andrew J.; Furtado, Manohar; Fang, Rixun; Wang, Xudong; Nievergelt, Caroline M.; Kidd, Kenneth K.
BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes
Pierre Nicolas; Fengzhu Sun; Lei M. Li
When statistical agencies release microdata to the public, a major concern is the control of disclosure risk, while ensuring utility in the released data. Often some statistical disclosure control methods such as data swapping, multiple imputation, top co...
B. Sinha, M. Klein, T. Mathew
We propose a method for comparing survival distributions when cause-of-failure information is missing for some individuals. We use multiple imputation to impute missing causes of failure, where the probability that a missing cause is that of interest may depend on auxiliary covariates, and combine log-rank statistics computed from several 'completed' datasets into a test statistic that achieves asymptotically the nominal
Anastasios A. Tsiatis
Summary. In problems with missing or latent data, a standard approach is to first impute the unobserved data, then perform all statistical analyses on the completed dataset—corresponding to the observed data and imputed unobserved data—using standard procedures for complete-data inference. Here, we extend this approach to model checking by demonstrating the advantages of the use of completed-data model diagnos- tics
Andrew Gelman; Iven Van Mechelen; Geert Verbeke; Daniel F. Heitjan; Michel Meulders
Missing data is a common drawback in many real-life pattern classification scenarios. One of the most popular solutions is missing data imputation by the K nearest neighbours ðKNNÞ algorithm. In this article, we propose a novel KNN imputation procedure using a feature-weighted distance metric based on mutual information (MI). This method provides a missing data estimation aimed at solving the
Pedro J. García-laencina; José-luis Sancho-gómez; Aníbal R. Figueiras-vidal; Michel Verleysen
Missing values in a medical database present a problem when trying to develop a prediction model for a broad range of patients, if the data are not missing at random. We present a data imputation approach for physiologic parameters that incorporates individualized case information into the imputed values. We replaced missing values in a neonatal intensive care unit (NICU) database with relevant data by integrating aspects of artificial neural networks (ANNs) and case-based reasoning (CBR). PMID:19163673
Ennett, Colleen M; Frize, Monique; Walker, C
In most longitudinal clinical trials, some patients drop out before the end of the planned follow-up, and, in order to allow an all-patient intent-to-treat analysis to be performed, it is common practice to use some method of imputation to estimate values for missing data. However, different imputation methods may provide different results, and it is essential to investigate the sensitivity
L. W. Huson; J. Chung; M. Salgo
Several approaches exist for handling missing covariates in the Cox proportional hazards model. The multiple imputation (MI) is relatively easy to implement with various software available and results in consistent estimates if the imputation model is correct. On the other hand, the fully augmented weighted estimators (FAWEs) recover a substantial proportion of the efficiency and have the doubly robust property. In this paper, we compare the FAWEs and the MI through a comprehensive simulation study. For the MI, we consider the multiple imputation by chained equation (MICE) and focus on two imputation methods: Bayesian linear regression imputation and predictive mean matching. Simulation results show that the imputation methods can be rather sensitive to model misspecification and may have large bias when the censoring time depends on the missing covariates. In contrast, the FAWEs allow the censoring time to depend on the missing covariates and are remarkably robust as long as getting either the conditional expectations or the selection probability correct due to the doubly robust property. The comparison suggests that the FAWEs show the potential for being a competitive and attractive tool for tackling the analysis of survival data with missing covariates.
Qi, Lihong; Wang, Ying-Fang; He, Yulei
The multivariate normal (MVN) distribution is arguably the most popular parametric model used in imputation and is available in most software packages (e.g., SAS PROC MI, R package norm). When it is applied to categorical variables as an approximation, practitioners often either apply simple rounding techniques for ordinal variables or create a distinct ‘missing’ category and/or disregard the nominal variable from the imputation phase. All of these practices can potentially lead to biased and/or uninterpretable inferences. In this work, we develop a new rounding methodology calibrated to preserve observed distributions to multiply impute missing categorical covariates. The major attractiveness of this method is its flexibility to use any ‘working’ imputation software, particularly those based on MVN, allowing practitioners to obtain usable imputations with small biases. A simulation study demonstrates the clear advantage of the proposed method in rounding ordinal variables and, in some scenarios, its plausibility in imputing nominal variables. We illustrate our methods on a widely used National Survey of Children with Special Health Care Needs where incomplete values on race posed a valid threat on inferences pertaining to disparities.
Yucel, Recai M.; He, Yulei; Zaslavsky, Alan M.
Although mapping quantitative traits in inbred strains is simpler than mapping the analogous traits in humans, classical inbred crosses suffer from reduced genetic diversity compared to experimental designs involving outbred animal populations. Multiple crosses, for example the Complex Trait Consortium's eight-way cross, circumvent these difficulties. However, complex mating schemes and systematic inbreeding raise substantial computational difficulties. Here we present a method for locally imputing the strain origins of each genotyped animal along its genome. Imputed origins then serve as mean effects in a multivariate Gaussian model for testing association between trait levels and local genomic variation. Imputation is a combinatorial process that assigns the maternal and paternal strain origin of each animal on the basis of observed genotypes and prior pedigree information. Without smoothing, imputation is likely to be ill-defined or jump erratically from one strain to another as an animal's genome is traversed. In practice, one expects to see long stretches where strain origins are invariant. Smoothing can be achieved by penalizing strain changes from one marker to the next. A dynamic programming algorithm then solves the strain imputation process in one quick pass through the genome of an animal. Imputation accuracy exceeds 99% in practical examples and leads to high-resolution mapping in simulated and real data. The previous fastest quantitative trait loci (QTL) mapping software for dense genome scans reduced compute times to hours. Our implementation further reduces compute times from hours to minutes with no loss in statistical power. Indeed, power is enhanced for full pedigree data.
Zhou, Jin J.; Ghazalpour, Anatole; Sobel, Eric M.; Sinsheimer, Janet S.; Lange, Kenneth
Non-coding variants at human chromosome 9p21 near CDKN2A and CDKN2B are associated with type 2 diabetes (T2D)1-4, myocardial infarction (MI)5-7, aneurysm8, vertical cup disc ratio9, and at least five cancers10-16. We compared approaches to more comprehensively assess genetic variation in the region. We performed targeted sequencing at high coverage in 47 individuals and compared the results to pilot data from the 1000 Genomes Project. We imputed variants into T2D and MI cohorts directly from targeted sequencing, from a genotyped reference panel derived from sequencing, and from 1000 Genomes low-coverage data. Common polymorphisms were captured similarly by all strategies. Imputation of intermediate frequency polymorphisms required a higher density of tag SNPs in disease samples than available on first generation Genome Wide Association Study (GWAS) arrays. Association analyses identified more comprehensive sets of variants demonstrating equivalent statistical association to T2D or MI, but did not identify stronger associations the original GWAS signals.
Shea, Jessica; Agarwala, Vineeta; Philippakis, Anthony A.; Maguire, Jared; Banks, Eric; DePristo, Mark; Thomson, Brian; Guiducci, Candace; Kathiresan, Sekar; Gabriel, Stacey; Burtt, Noel P; Daly, Mark J.; Groop, Leif; Altshuler, David
The true missing data mechanism is never known in practice. We present a method for generating multiple imputations for binary variables, which formally incorporates missing data mechanism uncertainty. Imputations are generated from a distribution of imputation models rather than a single model, with the distribution reflecting subjective notions of missing data mechanism uncertainty. Parameter estimates and standard errors are obtained using rules for nested multiple imputation. Using simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal smoking cessation trial where nonignorably missing data were a concern. Our method provides a simple approach for formalizing subjective notions regarding nonresponse and can be implemented using existing imputation software. Copyright © 2014 John Wiley & Sons, Ltd. PMID:24634315
Siddique, Juned; Harel, Ofer; Crespi, Catherine M; Hedeker, Donald
The availability of genomic resources can facilitate progress in plant breeding through the application of advanced molecular technologies for crop improvement. This is particularly important in the case of less researched crops such as cassava, a staple and food security crop for more than 800 million people. Here, expressed sequence tags (ESTs) were generated from five drought stressed and well-watered cassava varieties. Two cDNA libraries were developed: one from root tissue (CASR), the other from leaf, stem and stem meristem tissue (CASL). Sequencing generated 706 contigs and 3,430 singletons. These sequences were combined with those from two other EST sequencing initiatives and filtered based on the sequence quality. Quality sequences were aligned using CAP3 and embedded in a Windows browser called HarvEST:Cassava which is made available. HarvEST:Cassava consists of a Unigene set of 22,903 quality sequences. A total of 2,954 putative SNPs were identified. Of these 1,536 SNPs from 1,170 contigs and 53 cassava genotypes were selected for SNP validation using Illumina's GoldenGate assay. As a result 1,190 SNPs were validated technically and biologically. The location of validated SNPs on scaffolds of the cassava genome sequence (v.4.1) is provided. A diversity assessment of 53 cassava varieties reveals some sub-structure based on the geographical origin, greater diversity in the Americas as opposed to Africa, and similar levels of diversity in West Africa and southern, eastern and central Africa. The resources presented allow for improved genetic dissection of economically important traits and the application of modern genomics-based approaches to cassava breeding and conservation. PMID:22069119
Ferguson, Morag E; Hearne, Sarah J; Close, Timothy J; Wanamaker, Steve; Moskal, William A; Town, Christopher D; de Young, Joe; Marri, Pradeep Reddy; Rabbi, Ismail Yusuf; de Villiers, Etienne P
Background Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. Results In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. Conclusions In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65-0.68). Using genotypes imputed from a large reference panel (accuracy: R(2) = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R(2) = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for selection results in lower accuracy of genomic evaluation. PMID:24531728
Badke, Yvonne M; Bates, Ronald O; Ernst, Catherine W; Fix, Justin; Steibel, Juan P
Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65?0.68). Using genotypes imputed from a large reference panel (accuracy: R2 = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R2 = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for selection results in lower accuracy of genomic evaluation.
Badke, Yvonne M.; Bates, Ronald O.; Ernst, Catherine W.; Fix, Justin; Steibel, Juan P.
The aim of this study was to evaluate different-density genotyping panels for genotype imputation and genomic prediction. Genotypes from customized Golden Gate Bovine3K BeadChip [LD3K; low-density (LD) 3,000-marker (3K); Illumina Inc., San Diego, CA] and BovineLD BeadChip [LD6K; 6,000-marker (6K); Illumina Inc.] panels were imputed to the BovineSNP50v2 BeadChip [50K; 50,000-marker; Illumina Inc.]. In addition, LD3K, LD6K, and 50K genotypes were imputed to a BovineHD BeadChip [HD; high-density 800,000-marker (800K) panel], and with predictive ability evaluated and compared subsequently. Comparisons of prediction accuracy were carried out using Random boosting and genomic BLUP. Four traits under selection in the Spanish Holstein population were used: milk yield, fat percentage (FP), somatic cell count, and days open (DO). Training sets at 50K density for imputation and prediction included 1,632 genotypes. Testing sets for imputation from LD to 50K contained 834 genotypes and testing sets for genomic evaluation included 383 bulls. The reference population genotyped at HD included 192 bulls. Imputation using BEAGLE software (http://faculty.washington.edu/browning/beagle/beagle.html) was effective for reconstruction of dense 50K and HD genotypes, even when a small reference population was used, with 98.3% of SNP correctly imputed. Random boosting outperformed genomic BLUP in terms of prediction reliability, mean squared error, and selection effectiveness of top animals in the case of FP. For other traits, however, no clear differences existed between methods. No differences were found between imputed LD and 50K genotypes, whereas evaluation of genotypes imputed to HD was on average across data set, method, and trait, 4% more accurate than 50K prediction, and showed smaller (2%) mean squared error of predictions. Similar bias in regression coefficients was found across data sets but regressions were 0.32 units closer to unity for DO when genotypes were imputed to HD density. Imputation to HD genotypes might produce higher stability in the genomic proofs of young candidates. Regarding selection effectiveness of top animals, more (2%) top bulls were classified correctly with imputed LD6K genotypes than with LD3K. When the original 50K genotypes were used, correct classification of top bulls increased by 1%, and when those genotypes were imputed to HD, 3% more top bulls were detected. Selection effectiveness could be slightly enhanced for certain traits such as FP, somatic cell count, or DO when genotypes are imputed to HD. Genetic evaluation units may consider a trait-dependent strategy in terms of method and genotype density for use in the genome-enhanced evaluations. PMID:23810591
Jiménez-Montero, J A; Gianola, D; Weigel, K; Alenda, R; González-Recio, O
Background Efficient, robust, and accurate genotype imputation algorithms make large-scale application of genomic selection cost effective. An algorithm that imputes alleles or allele probabilities for all animals in the pedigree and for all genotyped single nucleotide polymorphisms (SNP) provides a framework to combine all pedigree, genomic, and phenotypic information into a single-stage genomic evaluation. Methods An algorithm was developed for imputation of genotypes in pedigreed populations that allows imputation for completely ungenotyped animals and for low-density genotyped animals, accommodates a wide variety of pedigree structures for genotyped animals, imputes unmapped SNP, and works for large datasets. The method involves simple phasing rules, long-range phasing and haplotype library imputation and segregation analysis. Results Imputation accuracy was high and computational cost was feasible for datasets with pedigrees of up to 25 000 animals. The resulting single-stage genomic evaluation increased the accuracy of estimated genomic breeding values compared to a scenario in which phenotypes on relatives that were not genotyped were ignored. Conclusions The developed imputation algorithm and software and the resulting single-stage genomic evaluation method provide powerful new ways to exploit imputation and to obtain more accurate genetic evaluations.
This article presents a multiple imputation method for sensitivity analyses of time-to-event data with possibly informative censoring. The imputed time for censored values is drawn from the failure time distribution conditional on the time of follow-up discontinuation. A variety of specifications regarding the post-discontinuation tendency of having events can be incorporated in the imputation through a hazard ratio parameter for discontinuation versus continuation of follow-up. Multiple-imputed data sets are analyzed with the primary analysis method, and the results are then combined using the methods of Rubin. An illustrative example is provided.
Zhao, Yue; Herring, Amy H.; Zhou, Haibo; Ali, Mirza W.; Koch, Gary G.
Background Missing values commonly occur in the microarray data, which usually contain more than 5% missing values with up to 90% of genes affected. Inaccurate missing value estimation results in reducing the power of downstream microarray data analyses. Many types of methods have been developed to estimate missing values. Among them, the regression-based methods are very popular and have been shown to perform better than the other types of methods in many testing microarray datasets. Results To further improve the performances of the regression-based methods, we propose shrinkage regression-based methods. Our methods take the advantage of the correlation structure in the microarray data and select similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and then use the new coefficients to estimate missing values. Simulation results show that the proposed methods provide more accurate missing value estimation in six testing microarray datasets than the existing regression-based methods do. Conclusions Imputation of missing values is a very important aspect of microarray data analyses because most of the downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become an essential issue. Since our proposed shrinkage regression-based methods can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based methods.
Despite significant improvements in recent years, proteomic datasets currently available still suffer large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic da-tasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values for proteins experi-mentally undetected, using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes expression was measured after the cells were exposed to 1 mM potassium chromate for 5-, 30-, 60-, and 90-min, while protein abundance was measured only for 45- and 90-min samples. With the goal of elucidating the relationship between temporal gene expression and protein abundance data, and then using it to impute missing proteomic values for samples of 45-min (which does not have cognate transcriptomic data) and 90-min, we initially used nonlinear Smoothing Splines Curve Fitting (SSCF) to identify temporal relationships among transcriptomic data at different time points and then imputed missing gene expression measurements for the sample at 45-min. After the imputation was validated by biological constrains (i.e. operons), we used a data-driven Gradient Boosted Trees (GBT) model to uncover possible non-linear relationships between temporal transcriptomic and proteomic data, and to impute protein abundance for the proteins experimentally undetected in the 45- and 90-min sam-ples, based on relevant predictors such as temporal mRNA gene expression data, cellular roles, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. The imputed protein values were validated using biological constraints such as operon, regulon and pathway information. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.
Torres-García, Wandaliz [Arizona State University; Brown, Steven D [ORNL; Johnson, Roger [Arizona State University; Zhang, Weiwen [Arizona State University; Runger, George [Arizona State University; Meldrum, Deirdre [Arizona State University
Recent emergence of the common-disease-rare-variant hypothesis has renewed interest in the use of large pedigrees for identifying rare causal variants. Genotyping with modern sequencing platforms is increasingly common in the search for such variants but remains expensive and often is limited to only a few subjects per pedigree. In population-based samples, genotype imputation is widely used so that additional genotyping is not needed. We now introduce an analogous approach that enables computationally efficient imputation in large pedigrees. Our approach samples inheritance vectors (IVs) from a Markov Chain Monte Carlo sampler by conditioning on genotypes from a sparse set of framework markers. Missing genotypes are probabilistically inferred from these IVs along with observed dense genotypes that are available on a subset of subjects. We implemented our approach in the Genotype Imputation Given Inheritance (GIGI) program and evaluated the approach on both simulated and real large pedigrees. With a real pedigree, we also compared imputed results obtained from this approach with those from the population-based imputation program BEAGLE. We demonstrated that our pedigree-based approach imputes many alleles with high accuracy. It is much more accurate for calling rare alleles than is population-based imputation and does not require an outside reference sample. We also evaluated the effect of varying other parameters, including the marker type and density of the framework panel, threshold for calling genotypes, and population allele frequencies. By leveraging information from existing genotypes already assayed on large pedigrees, our approach can facilitate cost-effective use of sequence data in the pursuit of rare causal variants.
Cheung, Charles Y.K.; Thompson, Elizabeth A.; Wijsman, Ellen M.
This study aims to compare several imputation methods to complete the missing values of spatio-temporal meteorological time series. To this end, six imputation methods are assessed with respect to various criteria including accuracy, robustness, precision, and efficiency for artificially created missing data in monthly total precipitation and mean temperature series obtained from the Turkish State Meteorological Service. Of these methods, simple arithmetic average, normal ratio (NR), and NR weighted with correlations comprise the simple ones, whereas multilayer perceptron type neural network and multiple imputation strategy adopted by Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) are computationally intensive ones. In addition, we propose a modification on the EM-MCMC method. Besides using a conventional accuracy measure based on squared errors, we also suggest the correlation dimension (CD) technique of nonlinear dynamic time series analysis which takes spatio-temporal dependencies into account for evaluating imputation performances. Depending on the detailed graphical and quantitative analysis, it can be said that although computational methods, particularly EM-MCMC method, are computationally inefficient, they seem favorable for imputation of meteorological time series with respect to different missingness periods considering both measures and both series studied. To conclude, using the EM-MCMC algorithm for imputing missing values before conducting any statistical analyses of meteorological data will definitely decrease the amount of uncertainty and give more robust results. Moreover, the CD measure can be suggested for the performance evaluation of missing data imputation particularly with computational methods since it gives more precise results in meteorological time series.
Yozgatligil, Ceylan; Aslan, Sipan; Iyigun, Cem; Batmaz, Inci
A variable is 'systematically missing' if it is missing for all individuals within particular studies in an individual participant data meta-analysis. When a systematically missing variable is a potential confounder in observational epidemiology, standard methods either fail to adjust the exposure-disease association for the potential confounder or exclude studies where it is missing. We propose a new approach to adjust for systematically missing confounders based on multiple imputation by chained equations. Systematically missing data are imputed via multilevel regression models that allow for heterogeneity between studies. A simulation study compares various choices of imputation model. An illustration is given using data from eight studies estimating the association between carotid intima media thickness and subsequent risk of cardiovascular events. Results are compared with standard methods and also with an extension of a published method that exploits the relationship between fully adjusted and partially adjusted estimated effects through a multivariate random effects meta-analysis model. We conclude that multiple imputation provides a practicable approach that can handle arbitrary patterns of systematic missingness. Bias is reduced by including sufficient between-study random effects in the imputation model. PMID:23857554
Resche-Rigon, Matthieu; White, Ian R; Bartlett, Jonathan W; Peters, Sanne A E; Thompson, Simon G
In a typical randomized clinical trial, a continuous variable of interest (e.g., bone density) is measured at baseline and fixed postbaseline time points. The resulting longitudinal data, often incomplete due to dropouts and other reasons, are commonly analyzed using parametric likelihood-based methods that assume multivariate normality of the response vector. If the normality assumption is deemed untenable, then semiparametric methods such as (weighted) generalized estimating equations are considered. We propose an alternate approach in which the missing data problem is tackled using multiple imputation, and each imputed dataset is analyzed using robust regression (M-estimation; Huber, 1973, Annals of Statistics 1, 799-821.) to protect against potential non-normality/outliers in the original or imputed dataset. The robust analysis results from each imputed dataset are combined for overall estimation and inference using either the simple Rubin (1987, Multiple Imputation for Nonresponse in Surveys, New York: Wiley) method, or the more complex but potentially more accurate Robins and Wang (2000, Biometrika 87, 113-124.) method. We use simulations to show that our proposed approach performs at least as well as the standard methods under normality, but is notably better under both elliptically symmetric and asymmetric non-normal distributions. A clinical trial example is used for illustration. PMID:22994905
Mehrotra, Devan V; Li, Xiaoming; Liu, Jiajun; Lu, Kaifeng
Background Variant Creutzfeldt-Jakob disease is an infectious, neurodegenerative, protein-misfolding disease, of the prion disease family, originally acquired through ingestion of meat products contaminated with bovine spongiform encephalopathy (BSE). Public health concern was increased by the discovery of human-to-human transmission via blood transfusion. This study has verified a novel genetic marker linked to disease risk. Methods SNP imputation and association testing indicated those genes that had significant linkage to disease risk and one gene was investigated further with Sanger resequencing. Results from variant Creutzfeldt-Jakob disease were compared with those from sporadic (idiopathic) Creutzfeldt-Jakob disease and published controls. Results The most significant disease risk, in addition to the prion protein gene, was for the phosphatidylinositol-specific phospholipase C, X domain containing 3 (PLCXD3) gene. Sanger resequencing of CJD patients across a region of PLCXD3 with known variants confirmed three SNPs associated with variant and sporadic CJD. Conclusions These data provide the first highly significant confirmation of SNP allele frequencies for a novel CJD candidate gene providing new avenues for investigating these neurodegenerative prion diseases. The phospholipase PLCXD3 is primarily expressed in the brain and is associated with lipid catabolism and signal transduction.
...2009-04-01 true May the Inter-American Foundation impute conduct of one person to another...Foreign Relations INTER-AMERICAN FOUNDATION GOVERNMENTWIDE DEBARMENT AND SUSPENSION...1006.630 May the Inter-American Foundation impute conduct of one person to...
Populus species are currently being domesticated through intensive time- and resource-dependent programs for utilization in phytoremediation, wood and paper products, and conversion to biofuels. Poplar leaf rust disease can greatly reduce wood volume. Genetic resistance is effective in reducing economic losses but major resistance loci have been race-specific and can be readily defeated by the pathogen. Developing durable disease resistance requires the identification of non-race-specific loci. In the presented study, area under the disease progress curve was calculated from natural infection of Melampsora ×columbiana in three consecutive years. Association analysis was performed using 412 P. trichocarpa clones genotyped with 29,355 SNPs covering 3,543 genes. We found 40 SNPs within 26 unique genes significantly associated (permutated P<0.05) with poplar rust severity. Moreover, two SNPs were repeated in all three years suggesting non-race-specificity and three additional SNPs were differentially expressed in other poplar rust interactions. These five SNPs were found in genes that have orthologs in Arabidopsis with functionality in pathogen induced transcriptome reprogramming, Ca2+/calmodulin and salicylic acid signaling, and tolerance to reactive oxygen species. The additive effect of non-R gene functional variants may constitute high levels of durable poplar leaf rust resistance. Therefore, these findings are of significance for speeding the genetic improvement of this long-lived, economically important organism.
La Mantia, Jonathan; Klapste, Jaroslav; El-Kassaby, Yousry A.; Azam, Shofiul; Guy, Robert D.; Douglas, Carl J.; Mansfield, Shawn D.; Hamelin, Richard
Methods for data imputation applicable to air quality data sets were evaluated in the context of univariate (linear, spline and nearest neighbour interpolation), multivariate (regression-based imputation (REGEM), nearest neighbour (NN), self-organizing map (SOM), multi-layer perceptron (MLP)), and hybrid methods of the previous by using simulated missing data patterns. Additionally, a multiple imputation procedure was considered in order to make comparison between single and multiple imputations schemes. Four statistical criteria were adopted: the index of agreement, the squared correlation coefficient ( R2), the root mean square error and the mean absolute error with bootstrapped standard errors. The results showed that the performance of interpolation in respect to the length of gaps could be estimated separately for each variable of air quality by calculating a gradient and an exponent ? (Hurst exponent). This can be further utilised in hybrid approach in which the imputation has been performed either by interpolation or multivariate method depending on the length of gaps and variable under study. Among the multivariate methods, SOM and MLP performed slightly better than REGEM and NN methods. The advantage of SOM over the others was that it was less dependent on the actual location of the missing values. If priority is given to computational speed, however, NN can be recommended. The results in general showed that the slight improvement in the performances of multivariate methods can be achieved by using the hybridisation and more substantial one by using the multiple imputations where a final estimate is composed of the outputs of several multivariate fill-in methods.
Junninen, Heikki; Niska, Harri; Tuppurainen, Kari; Ruuskanen, Juhani; Kolehmainen, Mikko
Although mapping quantitative traits in inbred strains is simpler than mapping the analogous traits in humans, classical inbred crosses suffer from reduced genetic diversity compared to experimental designs involving outbred animal populations. Multiple crosses, for example the Complex Trait Consortium's eight-way cross, circumvent these difficulties. However, complex mating schemes and systematic inbreeding raise substantial computational difficulties. Here we present a method for locally imputing the strain origins of each genotyped animal along its genome. Imputed origins then serve as mean effects in a multivariate Gaussian model for testing association between trait levels and local genomic variation. Imputation is a combinatorial process that assigns the maternal and paternal strain origin of each animal on the basis of observed genotypes and prior pedigree information. Without smoothing, imputation is likely to be ill-defined or jump erratically from one strain to another as an animal's genome is traversed. In practice, one expects to see long stretches where strain origins are invariant. Smoothing can be achieved by penalizing strain changes from one marker to the next. A dynamic programming algorithm then solves the strain imputation process in one quick pass through the genome of an animal. Imputation accuracy exceeds 99% in practical examples and leads to high-resolution mapping in simulated and real data. The previous fastest quantitative trait loci (QTL) mapping software for dense genome scans reduced compute times to hours. Our implementation further reduces compute times from hours to minutes with no loss in statistical power. Indeed, power is enhanced for full pedigree data. PMID:22143921
Zhou, Jin J; Ghazalpour, Anatole; Sobel, Eric M; Sinsheimer, Janet S; Lange, Kenneth
Background Microarray technologies produced large amount of data. In a previous study, we have shown the interest of k-Nearest Neighbour approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. Since, numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, we have evaluated twelve different usable methods, and their influence on the quality of gene clustering. Interestingly we have used several datasets, both kinetic and non kinetic experiments from yeast and human. Results We underline the excellent efficiency of approaches proposed and implemented by Bo and co-workers and especially one based on expected maximization (EM_array). These improvements have been observed also on the imputation of extreme values, the most difficult predictable values. We showed that the imputed MVs have still important effects on the stability of the gene clusters. The improvement on the clustering obtained by hierarchical clustering remains limited and, not sufficient to restore completely the correct gene associations. However, a common tendency can be found between the quality of the imputation method and the gene cluster stability. Even if the comparison between clustering algorithms is a complex task, we observed that k-means approach is more efficient to conserve gene associations. Conclusions More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. Important improvements have so been done since our last study. The EM_array approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. Our study highlights the need for a systematic assessment of imputation methods and so of dedicated benchmarks. A noticeable point is the specific influence of some biological dataset.
Diagnosis of patent foramen ovale (PFO) is commonly made by echocardiography with contrast injection. PFO can be responsible for a transient right-to-left shunting with paroxysmal dyspnoea but punctual measurements of oxygen saturation may fail to detect arterial desaturations. Thus, claiming the imputability of PFO in dyspnoeic symptoms remains difficult. We report on the case of a 64-year-old man presenting an intermittent disabilitating dyspnoea, for which the pulse oximetry monitoring allowed to impute symptoms to the right-to-left shunting through the PFO and influenced the decision of percutaneous closure. PMID:16869459
Cohen, Remy; Laperche, Thierry; Royer, Thierry
Background Single nucleotide polymorphisms (SNPs) are an abundant form of genetic variation in the genome of every species and are useful for gene mapping and association studies. Of particular interest are non-synonymous SNPs, which may alter protein function and phenotype. We therefore examined bovine expressed sequences for non-synonymous SNPs and validated and tested selected SNPs for their association with measured traits. Results Over 500,000 public bovine expressed sequence tagged (EST) sequences were used to search for coding SNPs (cSNPs). A total of 15,353 SNPs were detected in the transcribed sequences studied, of which 6,325 were predicted to be coding SNPs with the remaining 9,028 SNPs presumed to be in untranslated regions. Of the cSNPs detected, 2,868 were predicted to result in a change in the amino acid encoded. In order to determine the actual number of non-synonymous polymorphic SNPs we designed assays for 920 of the putative SNPs. These SNPs were then genotyped through a panel of cattle DNA pools using chip-based MALDI-TOF mass spectrometry. Of the SNPs tested, 29% were found to be polymorphic with a minor allele frequency >10%. A subset of the SNPs was genotyped through animal resources in order to look for association with age of puberty, facial eczema resistance or meat yield. Three SNPs were nominally associated with resistance to the disease facial eczema (P < 0.01). Conclusion We have identified 15,353 putative SNPs in or close to bovine genes and 2,868 of these SNPs were predicted to be non-synonymous. Approximately 29% of the non-synonymous SNPs were polymorphic and common with a minor allele frequency >10%. Of the SNPs detected in this study, 99% have not been previously reported. These novel SNPs will be useful for association studies or gene mapping.
Lee, Michael A; Keane, Orla M; Glass, Belinda C; Manley, Tim R; Cullen, Neil G; Dodds, Ken G; McCulloch, Alan F; Morris, Chris A; Schreiber, Mark; Warren, Jonathan; Zadissa, Amonida; Wilson, Theresa; McEwan, John C
Large-scale genetic screens in zebrafish have identified thousands of mutations in hundreds of essential genes. The genetic mapping of these mutations is necessary to link DNA sequences to the gene functions defined by mutant phenotypes. Here, we report two advances that will accelerate the mapping of zebrafish mutations: (1) The construction of a first generation single nucleotide polymorphism (SNP) map of the zebrafish genome comprising 2035 SNPs and 178 small insertions/deletions, and (2) the development of a method for mapping mutations in which hundreds of SNPs can be scored in parallel with an oligonucleotide microarray. We have demonstrated the utility of the microarray technique in crosses with haploid and diploid embryos by mapping two known mutations to their previously identified locations. We have also used this approach to localize four previously unmapped mutations. We expect that mapping with SNPs and oligonucleotide microarrays will accelerate the molecular analysis of zebrafish mutations. PMID:12466297
Stickney, Heather L; Schmutz, Jeremy; Woods, Ian G; Holtzer, Caleb C; Dickson, Mark C; Kelly, Peter D; Myers, Richard M; Talbot, William S
Many panels of ancestry informative single nucleotide polymorphisms have been proposed in recent years for various purposes including detecting stratification in biomedical studies and determining an individual's ancestry in a forensic context. All of the panels have limitations in their generality and efficiency for routine forensic work. Some panels have used only a few populations to validate them. Some panels are based on very large numbers of SNPs thereby limiting the ability of others to test different populations. We have been working toward an efficient and globally useful panel of ancestry informative markers that is comprised of a small number of highly informative SNPs. We have developed a panel of 55 SNPs analyzed on 73 populations from around the world. We present the details of the panel and discuss its strengths and limitations. PMID:24508742
Kidd, Kenneth K; Speed, William C; Pakstis, Andrew J; Furtado, Manohar R; Fang, Rixun; Madbouly, Abeer; Maiers, Martin; Middha, Mridu; Friedlaender, Françoise R; Kidd, Judith R
Recently, five thyroid cancer significantly associated genetic variants (rs965513, rs944289, rs116909374, rs966423, and rs2439302) have been discovered and validated in two independent GWAS and numerous case–control studies, which were conducted in different populations. We genotyped the above five single nucleotide polymorphisms (SNPs) in Han Chinese populations and performed thyroid cancer-risk predictions with nine machine learning methods. We found that four SNPs were significantly associated with thyroid cancer in Han Chinese population, while no polymorphism was observed for rs116909374. Small familial relative risks (1.02–1.05) and limited power to predict thyroid cancer (AUCs: 0.54–0.60) indicate limited clinical potential. Four significant SNPs have limited prediction ability for thyroid cancer.
Guo, Shicheng; Wang, Yu-Long; Li, Yi; Jin, Li; Xiong, Momiao; Ji, Qing-Hai; Wang, Jiucun
Recently, five thyroid cancer significantly associated genetic variants (rs965513, rs944289, rs116909374, rs966423, and rs2439302) have been discovered and validated in two independent GWAS and numerous case-control studies, which were conducted in different populations. We genotyped the above five single nucleotide polymorphisms (SNPs) in Han Chinese populations and performed thyroid cancer-risk predictions with nine machine learning methods. We found that four SNPs were significantly associated with thyroid cancer in Han Chinese population, while no polymorphism was observed for rs116909374. Small familial relative risks (1.02-1.05) and limited power to predict thyroid cancer (AUCs: 0.54-0.60) indicate limited clinical potential. Four significant SNPs have limited prediction ability for thyroid cancer. PMID:24591304
Guo, Shicheng; Wang, Yu-Long; Li, Yi; Jin, Li; Xiong, Momiao; Ji, Qing-Hai; Wang, Jiucun
Computational biology has the opportunity to play an important role in the identification of functional single nucleotide polymorphisms (SNPs) discovered in large-scale genotyping studies, ultimately yielding new drug targets and biomarkers. The medical genetics and molecular biology communities are increasingly turning to computational biology methods to prioritize interesting SNPs found in linkage and association studies. Many such methods are now available through web interfaces, but the interested user is confronted with an array of predictive results that are often in disagreement with each other. Many tools today produce results that are difficult to understand without bioinformatics expertise, are biased towards non-synonymous SNPs, and do not necessarily reflect up-to-date versions of their source bioinformatics resources, such as public SNP repositories. Here, I assess the utility of the current generation of webservers; and suggest improvements for the next generation of webservers to better deliver value to medical geneticists and molecular biologists.
SNPs are useful for genome-wide mapping and the study of disease genes. Previous studies have focused on SNPs in specific genes or SNPs pooled from a variety of different sources. Here, a systematic approach to the analysis of SNPs in relation to various features on a genome-wide scale, with emphasis on protein features and pseudogenes, is presented. We have performed
Suganthi Balasubramanian; Paul Harrison; Hedi Hegyi; Paul Bertone; Nicholas Luscombe; Nathaniel Echols; Patrick McGarvey; ZhaoLei Zhang; Mark Gerstein
Genetic susceptibility to alcoholic cirrhosis (AC) exists. We previously demonstrated hepatic mitochondrial DNA (mtDNA) damage in patients with AC compared with chronic alcoholics without cirrhosis. Mitochondrial transcription factor A (mtTFA) is central to mtDNA expression regulation and repair; however, it is unclear whether there are specific mtTFA single nucleotide polymorphisms (SNPs) in patients with AC and whether they affect mtDNA repair. In the present study, we screened mtTFA SNPs in patients with AC and analyzed their impact on the copy number of mtDNA in AC. A total of 50 patients with AC, 50 alcoholics without AC and 50 normal subjects were enrolled in the study. SNPs of full-length mtTFA were analyzed using the polymerase chain reaction (PCR) combined with gene sequencing. The hepatic mtTFA mRNA and mtDNA copy numbers were measured using quantitative PCR (qPCR), and mtTFA protein was measured using western blot analysis. A total of 18 mtTFA SNPs specific to patients with AC with frequencies >10% were identified. Two were located in the coding region and 16 were identified in non-coding regions. Conversely, there were five SNPs that were only present in patients with AC and normal subjects and had a frequency >10%. In the AC group, the hepatic mtTFA mRNA and protein levels were significantly lower than those in the other two groups. Moreover, the hepatic mtDNA copy number was significantly lower in the AC group than in the controls and alcoholics without AC. Based on these data, we conclude that AC-specific mtTFA SNPs may be responsible for the observed reductions in mtTFA mRNA, protein levels and mtDNA copy number and they may also increase the susceptibility to AC.
TANG, CHUN; LIU, HONGMING; TANG, YONGLIANG; GUO, YONG; LIANG, XIANCHUN; GUO, LIPING; PI, RUXIAN; YANG, JUNTAO
Allele-specific silencing using small interfering RNAs targeting heterozygous single-nucleotide polymorphisms (SNPs) is a promising therapy for human trinucleotide repeat diseases such as Huntington's disease. Linking SNP identities to the two HTT alleles, normal and disease-causing, is a prerequisite for allele-specific RNA interference. Here we describe a method, SNP linkage by circularization (SLiC), to identify linkage between CAG repeat length and nucleotide identity of heterozygous SNPs using Huntington's disease patient peripheral blood samples. PMID:18931668
Liu, Wanzhao; Kennington, Lori A; Rosas, H Diana; Hersch, Steven; Cha, Jang-Ho; Zamore, Phillip D; Aronin, Neil
Genome-wide association studies (GWAS) have been widely applied to identify informative SNPs associated with common and complex diseases. Besides single-SNP analysis, the interaction between SNPs is believed to play an important role in disease risk due to the complex networking of genetic regulations. While many approaches have been proposed for detecting SNP interactions, the relative performance and merits of these methods in practice are largely unclear. In this paper, a ground-truth based comparative study is reported involving 9 popular SNP detection methods using realistic simulation datasets. The results provide general characteristics and guidelines on these methods that may be informative to the biological investigators. PMID:21151836
Chen, Li; Yu, Guoqiang; Miller, David J; Song, Lei; Langefeld, Carl; Herrington, David; Liu, Yongmei; Wang, Yue
In this work, we have analyzed the genetic variation that can alter the expression and the function in BRCA2 gene using computational methods. Out of the total 534 SNPs, 101 were found to be non synonymous (nsSNPs). Among the 7 SNPs in the untranslated region, 3 SNPs were found in 5? and 4 SNPs were found in 3? un-translated regions
R Rajasekaran; C George Priya Doss; C Sudandiradoss; K Ramanathan; Purohit Rituraj; Sethumadhavan Rao
Breast cancer is the most common cancer among women affecting up to one third of tehm during their lifespans. Increased expression of some genes due to polymorphisms increases the risk of breast cancer incidence. Since mutations that are recognized to increase breast cancer risk within families are quite rare, identification of these SNPs is very important. The most important loci which include mutations are; BRCA1, BRCA2, PTEN, ATM, TP53, CHEK2, PPM1D, CDH1, MLH1, MRE11, MSH2, MSH6, MUTYH, NBN, PMS1, PMS2, BRIP1, RAD50, RAD51C, STK11 and BARD1. Presence of SNPs in these genes increases the risk of breast cancer and associated diagnostic markers are among the most reliable for assessing prognosis of breast cancer. In this article we reviewed the hereditary genes of breast cancer and SNPs associated with increasing the risk of breast cancer that were recently were reported from candidate gene, meta-analysis and GWAS studies. SNPs of genes associated with breast cancer can be used as a potential tool for improving cancer diagnosis and treatment planning. PMID:23886119
Mahdi, Kooshyar Mohammad; Nassiri, Mohammad Reza; Nasiri, Khadijeh
Single-nucleotide polymorphisms (SNPs) are the most frequent DNA sequence variations, and they have become increasingly popular markers for association studies. Allelic discrimination of the mostly binary SNPs has been reported for diploid species, mainly the human, but not for polyploid genomes such as the agriculturally important crops. In the present study, we analyzed the applicability of pyrosequencing to genotyping SNPs in tetraploid potatoes. Out of 94 polymorphic loci tested, 76 (81%) proved to be amenable to allelic discrimination by pyrosequencing. An additional locus could be genotyped by the addition of an ssDNA binding protein to the pyrosequencing reaction. Of the remaining 17 loci, two failed because of the presence of paralogs in the genome, while in the other cases, self-annealing of the primer or template at the low reaction temperature (28 degrees C) employed in pyrosequencing rendered allelic discrimination impossible. The quantitative precision ofpyrosequencing was found to be similar to that of conventional dideoxy sequencing and single-nucleotide primer extension. Exceptfor some sequencespecific limitations, pyrosequencing appears to be an appropriate method for genotying SNPs in polyploid species because it is possible to distinguish not only between homoand heterozygosity but also between the different heterozygous states. PMID:11911662
Rickert, Andreas M; Premstaller, Andreas; Gebhardt, Christiane; Oefner, Peter J
The key ideas of multiple imputation for multivariate missing data problems are reviewed. Software programs available for this analysis are described, and their use is illustrated with data from the Adolescent Alcohol Prevention Trial (W. Hansen and J. Graham, 1991). (SLD)
Schafer, Joseph L.; Olsen, Maren K.
Principled techniques for incomplete-data problems are increasingly part of mainstream statistical practice. Among many proposed techniques so far, inference by multiple imputation (MI) has emerged as one of the most popular. While many strategies leading to inference by MI are available in cross-sectional settings, the same richness does not exist in multilevel applications. The limited methods available for multilevel applications rely on the multivariate adaptations of mixed-effects models. This approach preserves the mean structure across clusters and incorporates distinct variance components into the imputation process. In this paper, I add to these methods by considering a random covariance structure and develop computational algorithms. The attraction of this new imputation modeling strategy is to correctly reflect the mean and variance structure of the joint distribution of the data, and allow the covariances differ across the clusters. Using Markov Chain Monte Carlo techniques, a predictive distribution of missing data given observed data is simulated leading to creation of multiple imputations. To circumvent the large sample size requirement to support independent covariance estimates for the level-1 error term, I consider distributional impositions mimicking random-effects distributions assigned a priori. These techniques are illustrated in an example exploring relationships between victimization and individual and contextual level factors that raise the risk of violent crime.
Yucel, Recai M.
This paper used an event study approach to examine the impact of dividend reinvestment plans on shareholders' returns in the pre- and post-imputation environments. The daily share return behaviour indicated that the announcement to introduce a dividend reinvestment plan (DRP) was received indifferently by the market before 1 July 1988, but was valued positively after superannuation funds were able to
Keith K. W. Chan; Damien W. McColough; Michael T. Skully
Microarrays are able to measure the patterns of expression of thousands of genes in a genome to give profiles that facilitate much faster analysis of biological processes for diagnosis, prognosis and tailored drug discovery. Microarrays, however, commonly have missing values which can result in erroneous downstream analysis. To impute these missing values, various algorithms have been proposed including Collateral Missing
Muhammad Shoaib B. Sehgal; Iqbal Gondal; Laurence S. Dooley; Ross L. Coppel
The bootstrap and multiple imputations are two techniques that can enhance the accuracy of estimated confidence bands and critical values. Although they are computationally intensive, relying on repeated sampling from empirical data sets and associated estimates, modern computing power enables their application in a wide and growing number of econometric settings. We provide an intuitive overview of how to apply
David Brownstone; Robert Valletta
Missing data represent a general problem in many scientific fields above all in environmental research. Several methods have been proposed in literature for handling missing data and the choice of an appropriate method depends, among others, on the missing data pattern and on the missing-data mechanism. One approach to the problem is to impute them to yield a complete data set. The goal of this paper is to propose a new single imputation method and to compare its performance to other single and multiple imputation methods known in literature. Considering a data set of concentration measured every 2h by eight monitoring stations distributed over the metropolitan area of Palermo, Sicily, during 2003, simulated incomplete data have been generated, and the performance of the imputation methods have been compared on the correlation coefficient (?), the index of agreement ( d), the root mean square deviation (RMSD) and the mean absolute deviation (MAD). All the performance indicators agree to evaluate the proposed method as the best among the ones compared, independently on the gap length and on the number of stations with missing data.
Plaia, A.; Bondì, A. L.
Statistical analyses of recurrent event data have typically been based on the missing at random assumption. One implication of this is that, if data are collected only when patients are on their randomized treatment, the resulting de jure estimator of treatment effect corresponds to the situation in which the patients adhere to this regime throughout the study. For confirmatory analysis of clinical trials, sensitivity analyses are required to investigate alternative de facto estimands that depart from this assumption. Recent publications have described the use of multiple imputation methods based on pattern mixture models for continuous outcomes, where imputation for the missing data for one treatment arm (e.g. the active arm) is based on the statistical behaviour of outcomes in another arm (e.g. the placebo arm). This has been referred to as controlled imputation or reference-based imputation. In this paper, we use the negative multinomial distribution to apply this approach to analyses of recurrent events and other similar outcomes. The methods are illustrated by a trial in severe asthma where the primary endpoint was rate of exacerbations and the primary analysis was based on the negative binomial model. Copyright © 2014 John Wiley & Sons, Ltd. PMID:24931317
Keene, Oliver N; Roger, James H; Hartley, Benjamin F; Kenward, Michael G
This article provides a comprehensive review of multiple imputation (MI), a technique for analyzing data sets with missing values. Formally, MI is the process of replacing each missing data point with a set of m > 1 plausible values to generate m complete data sets. These complete data sets are then analyzed by standard statistical software, and the results combined,
Sandip Sinharay; Hal S. Stern; Daniel Russell
The problem of missing data is frequently encountered in observational studies. We compared approaches to dealing with missing data. Three multiple imputation methods were compared with a method of enhancing a clinical database through merging with administrative data. The clinical database used for comparison contained information collected from 6,065 cardiac care patients in 1995 in the province of Alberta, Canada.
Peter D Faris; William A Ghali; Rollin Brant; Colleen M Norris; P. Diane Galbraith; Merril L Knudtson
Statistically matched files are created in an attempt to solve the practical problem that exists when no single file has the full set of variables needed for drawing important inferences. Previous methods of file matching are reviewed, and the method of file concatenation with adjusted weights and multiple imputations is described and illustrated on an artificial example. A major benefit
Donald B. Rubin
This study examines how sample attrition and missing partner data influence studies of cohabitors’ union transitions. We rely on data from both waves of the National Survey of Families and Households (NSFH). Cohabitors with missing partner information or who were lost-to-follow-up have significantly fewer years of schooling and lower yearly earnings than cohabitors with complete data. Multiple imputation techniques are
Sharon Sassler; James McNally
Analyses of multivariate data are frequently hampered by missing values. Until recently,the only missing-data methods available to most data analysts have been relativelyad hoc practices such as listwise deletion. Recent dramatic advances in theoretical and computationalstatistics, however, have produced a new generation of flexible procedures with asound statistical basis. These procedures involve multiple imputation (Rubin, 1987), a simulationtechnique that replaces
Joseph L. Schafer; Maren K. Olsen
Background. Nonresponse bias is a concern in any epidemio- logic survey in which a subset of selected individuals declines to participate. Methods. We reviewed multiple imputation, a widely applica- ble and easy to implement Bayesian methodology to adjust for nonresponse bias. To illustrate the method, we used data from the Canadian Multicentre Osteoporosis Study, a large cohort study of 9423
Andrew Kmetic; Lawrence Joseph; Claudie Berger; Alan Tenenhouse
Describes and assesses missing data methods currently used to analyze data from matrix sampling designs implemented by the National Assessment of Educational Progress. Several improved methods are developed, and these models are evaluated using an EM algorithm to obtain maximum likelihood estimates followed by multiple imputation of complete data…
Thomas, Neal; Gan, Nianci
Several statistical agencies use, or are considering the use of, multiple imputation to limit the risk of disclosing respondents' identities or sensitive attributes in public use data les. For example, agencies can release partially synthetic datasets, comprising the units originally surveyed with some collected values, such as sensitive values at high risk of disclosure or values of key identiers, replaced
Jerome P. Reiter
Almost universally, forest inventory and monitoring databases are incomplete, ranging from missing data for only a few records and a few variables, common for small land areas, to missing data for many observations and many variables, common for large land areas. For a wide variety of applications, nearest neighbor (NN) imputation methods have been developed to fill in observations of
BIANCA N. I. ESKELSON; Hailemariam Temesgen; Valerie Lemay; TARA M. BARRETT; NICHOLAS L. CROOKSTON; ANDREW T. HUDAK
Ensemble methods often produce effective classifiers by learning a set of base classifiers from a diverse collection of the training sets. In this paper, we present a system, voting on classifications from imputed learning sets (VCI), that produces those diverse training sets by randomly removing a small percentage of attribute values from the original training set, and then using an
Xiaoyuan Su; Taghi M. Khoshgoftaar; Russell Greiner
Excess zeros exhibited by dental caries data require special attention when multiple imputation is applied to such data. Objective To demonstrate a simple technique using a zero-inflated Poisson (ZIP) regression model, to perform multiple imputation for missing caries data. Methods The technique is demonstrated using data (N=24,403) from a medical office-based preventive dental program in North Carolina, where 27.2% of children (N=6,637) were missing information on physician-identified count of carious teeth. We first estimate a ZIP regression model using the non-missing caries data (N=17,766). The coefficients from the ZIP model are then used to predict the missing caries data. Results This technique results in imputed caries counts that are similar to the non-missing caries data in their distribution, especially with respect to the excess zeros in the non-missing caries data. Conclusion This technique can be easily applied to impute missing dental caries data.
Pahel, Bhavna T.; Preisser, John S.; Stearns, Sally C.; Rozier, R. Gary
BACKGROUND: Microarray technologies produced large amount of data. In a previous study, we have shown the interest of k-Nearest Neighbour approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. Since, numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, we have evaluated
Magalie Celton; Alain Malpertuy; Gaëlle Lelandais; Alexandre G. de Brevern
BACKGROUND: The outcome of in vitro fertilization (IVF) has been widely investigated over the last 30 years, but evaluation was mostly based on pregnancy rate per oocyte retrieval. Our objective was to estimate the cumulative live birth rate after four IVF aspirations, using multiple imputation that takes into account treatment interruptions. METHODS: We analysed data from 3037 couples beginning IVF
N. Soullier; J. Bouyer; J. L. Pouly; J. Guibert
Summary. In recent years there has been considerable research devoted to the development of methods for the analysis of incomplete data in longitudinal studies. Despite these advances, the methods used in practice have changed relatively little, particularly in the reporting of pharmaceutical trials. In this setting, perhaps the most widely adopted strategy for dealing with incomplete longitudinal data is imputation
Richard J. Cook; Leilei Zeng; Grace Y. Yi
DNA repair genes are important for maintaining genomic stability and limiting carcinogenesis. We analyzed all single nucleotide polymorphisms (SNPs) of 125 DNA repair genes covered by the Illumina HumanHap300 (v1.1) BeadChips in a previously conducted genome-wide association study (GWAS) of 1,154 lung cancer cases and 1,137 controls and replicated the top-hits of XRCC4 SNPs in an independent set of 597 cases and 611 controls in Texas populations. We found that six of 20 XRCC4 SNPs were associated with a decreased risk of lung cancer with a P value of 0.01 or lower in the discovery dataset, of which the most significant SNP was rs10040363 (P for allelic test = 4.89 ×10?4). Moreover, the data in this region allowed us to impute a potentially functional SNP rs2075685 (imputed P for allelic test = 1.3 ×10?3). A luciferase reporter assay demonstrated that the rs2075685G>T change in the XRCC4 promoter increased expression of the gene. In the replication study of rs10040363, rs1478486, rs9293329, and rs2075685, however, only rs10040363 achieved a borderline association with a decreased risk of lung cancer in a dominant model (adjusted OR = 0.80, 95% CI = 0.62–1.03, P = 0.079). In the final combined analysis of both the Texas GWAS discovery and replication datasets, the strength of the association was increased for rs10040363 (adjusted OR = 0.77, 95% CI = 0.66–0.89, Pdominant = 5×10?4 and P for trend = 5×10?4) and rs1478486 (adjusted OR = 0.82, 95% CI = 0.71 ?0.94, Pdominant = 6×10?3 and P for trend = 3.5×10?3). Finally, we conducted a meta-analysis of these XRCC4 SNPs with available data from published GWA studies of lung cancer with a total of 12,312 cases and 47,921 controls, in which none of these XRCC4 SNPs was associated with lung cancer risk. It appeared that rs2075685, although associated with increased expression of a reporter gene and lung cancer risk in the Texas populations, did not have an effect on lung cancer risk in other populations. This study underscores the importance of replication using published data in larger populations.
Yu, Hongping; Zhao, Hui; Wang, Li-E; Han, Younghun; Chen, Wei V.; Amos, Christopher I.; Rafnar, Thorunn; Sulem, Patrick; Stefansson, Kari; Landi, Maria Teresa; Caporaso, Neil; Albanes, Demetrius; Thun, Michael; McKay, James D.; Brennan, Paul; Wang, Yufei; Houlston, Richard S; Spitz, Margaret R.; Wei, Qingyi
DNA repair genes are important for maintaining genomic stability and limiting carcinogenesis. We analyzed all single nucleotide polymorphisms (SNPs) of 125 DNA repair genes covered by the Illumina HumanHap300 (v1.1) BeadChips in a previously conducted genome-wide association study (GWAS) of 1154 lung cancer cases and 1137 controls and replicated the top-hits of XRCC4 SNPs in an independent set of 597 cases and 611 controls in Texas populations. We found that six of 20 XRCC4 SNPs were associated with a decreased risk of lung cancer with a P-value of 0.01 or lower in the discovery dataset, of which the most significant SNP was rs10040363 (P for allelic test=4.89 x 10??). Moreover, the data in this region allowed us to impute a potentially functional SNP rs2075685 (imputed P for allelic test=1.3 x 10?³). A luciferase reporter assay demonstrated that the rs2075685G>T change in the XRCC4 promoter increased expression of the gene. In the replication study of rs10040363, rs1478486, rs9293329, and rs2075685, however, only rs10040363 achieved a borderline association with a decreased risk of lung cancer in a dominant model (adjusted OR=0.80, 95% CI=0.62-1.03 and P=0.079). In the final combined analysis of both the Texas GWAS discovery and replication datasets, the strength of the association was increased for rs10040363 (adjusted OR=0.77, 95% CI=0.66-0.89, P(dominant)=5 x 10?? and P for trend=5 x 10??) and rs1478486 (adjusted OR=0.82, 95% CI=0.71-0.94, P(dominant)=6 x 10?³ and P for trend=3.5 x 10?³). Finally, we conducted a meta-analysis of these XRCC4 SNPs with available data from published GWA studies of lung cancer with a total of 12,312 cases and 47,921 controls, in which none of these XRCC4 SNPs was associated with lung cancer risk. It appeared that rs2075685, although associated with increased expression of a reporter gene and lung cancer risk in the Texas populations, did not have an effect on lung cancer risk in other populations. This study underscores the importance of replication using published data in larger populations. PMID:21296624
Yu, Hongping; Zhao, Hui; Wang, Li-E; Han, Younghun; Chen, Wei V; Amos, Christopher I; Rafnar, Thorunn; Sulem, Patrick; Stefansson, Kari; Landi, Maria Teresa; Caporaso, Neil; Albanes, Demetrius; Thun, Michael; McKay, James D; Brennan, Paul; Wang, Yufei; Houlston, Richard S; Spitz, Margaret R; Wei, Qingyi
Estimation of narrow-sense heritability, h2, from genome-wide SNPs genotyped in unrelated individuals has recently attracted interest and offers several advantages over traditional pedigree-based methods. With the use of this approach, it has been estimated that over half the heritability of human height can be attributed to the ?300,000 SNPs on a genome-wide genotyping array. In comparison, only 5%–10% can be explained by SNPs reaching genome-wide significance. We investigated via simulation the validity of several key assumptions underpinning the mixed-model analysis used in SNP-based h2 estimation. Although we found that the method is reasonably robust to violations of four key assumptions, it can be highly sensitive to uneven linkage disequilibrium (LD) between SNPs: contributions to h2 are overestimated from causal variants in regions of high LD and are underestimated in regions of low LD. The overall direction of the bias can be up or down depending on the genetic architecture of the trait, but it can be substantial in realistic scenarios. We propose a modified kinship matrix in which SNPs are weighted according to local LD. We show that this correction greatly reduces the bias and increases the precision of h2 estimates. We demonstrate the impact of our method on the first seven diseases studied by the Wellcome Trust Case Control Consortium. Our LD adjustment revises downward the h2 estimate for immune-related diseases, as expected because of high LD in the major-histocompatibility region, but increases it for some nonimmune diseases. To calculate our revised kinship matrix, we developed LDAK, software for computing LD-adjusted kinships.
Speed, Doug; Hemani, Gibran; Johnson, Michael R.; Balding, David J.
Inference of haplotypes, or the sequence of alleles along each chromosome, is a fundamental problem in genetics and is important for many analyses, including admixture mapping, identifying regions of identity by descent, and imputation. Traditionally, haplotypes are inferred from genotype data obtained from microarrays using information on population haplotype frequencies inferred from either a large sample of genotyped individuals or a reference dataset such as the HapMap. Since the availability of large reference datasets, modern approaches for haplotype phasing along these lines are closely related to imputation methods. When applied to data obtained from sequencing studies, a straightforward way to obtain haplotypes is to first infer genotypes from the sequence data and then apply an imputation method. However, this approach does not take into account that alleles on the same sequence read originate from the same chromosome. Haplotype assembly approaches take advantage of this insight and predict haplotypes by assigning the reads to chromosomes in such a way that minimizes the number of conflicts between the reads and the predicted haplotypes. Unfortunately, assembly approaches require very high sequencing coverage and are usually not able to fully reconstruct the haplotypes. In this work, we present a novel approach, Hap-seq, which is simultaneously an imputation and assembly method that combines information from a reference dataset with the information from the reads using a likelihood framework. Our method applies a dynamic programming algorithm to identify the predicted haplotype, which maximizes the joint likelihood of the haplotype with respect to the reference dataset and the haplotype with respect to the observed reads. We show that our method requires only low sequencing coverage and can reconstruct haplotypes containing both common and rare alleles with higher accuracy compared to the state-of-the-art imputation methods. PMID:23383995
He, Dan; Han, Buhm; Eskin, Eleazar
Abstract Inference of haplotypes, or the sequence of alleles along each chromosome, is a fundamental problem in genetics and is important for many analyses, including admixture mapping, identifying regions of identity by descent, and imputation. Traditionally, haplotypes are inferred from genotype data obtained from microarrays using information on population haplotype frequencies inferred from either a large sample of genotyped individuals or a reference dataset such as the HapMap. Since the availability of large reference datasets, modern approaches for haplotype phasing along these lines are closely related to imputation methods. When applied to data obtained from sequencing studies, a straightforward way to obtain haplotypes is to first infer genotypes from the sequence data and then apply an imputation method. However, this approach does not take into account that alleles on the same sequence read originate from the same chromosome. Haplotype assembly approaches take advantage of this insight and predict haplotypes by assigning the reads to chromosomes in such a way that minimizes the number of conflicts between the reads and the predicted haplotypes. Unfortunately, assembly approaches require very high sequencing coverage and are usually not able to fully reconstruct the haplotypes. In this work, we present a novel approach, Hap-seq, which is simultaneously an imputation and assembly method that combines information from a reference dataset with the information from the reads using a likelihood framework. Our method applies a dynamic programming algorithm to identify the predicted haplotype, which maximizes the joint likelihood of the haplotype with respect to the reference dataset and the haplotype with respect to the observed reads. We show that our method requires only low sequencing coverage and can reconstruct haplotypes containing both common and rare alleles with higher accuracy compared to the state-of-the-art imputation methods.
Han, Buhm; Eskin, Eleazar
Recent advances in sequencing technologies promise better diagnostics for many diseases as well as better understanding of evolution of microbial populations. Single Nucleotide Polymorphisms(SNPs) are established genetic markers that aid in the identification of loci affecting quantitative traits and/or disease in a wide variety of eukaryotic species. With today's technological capabilities, it is possible to re-sequence a large set of appropriate candidate genes in individuals with a given disease and then screen for causative mutations.In addition, SNPs have been used extensively in efforts to study the evolution of microbial populations, and the recent application of random shotgun sequencing to environmental samples makes possible more extensive SNP analysis of co-occurring and co-evolving microbial populations. The program is available at http://genome.lbl.gov/vista/snpvista.
Shah, Nameeta; Teplitsky, Michael V.; Pennacchio, Len A.; Hugenholtz, Philip; Hamann, Bernd; Dubchak, Inna L.
We report the results of a high-density attempted suicide association study of the X chromosome, which genotyped 23,141 SNPs on 983 attempters and 1143 non-attempters and generated modest evidence for association for SH3KBP1 (P=1.07 × 10?4) and GRIA3 (P=4.01 × 10?4). These findings highlight the need for larger sample sets and meta-analytic approaches.
Jancic, Dubravka; Seifuddin, Fayaz; Zandi, Peter P.; Potash, James B.; Willour, Virginia L.
BACKGROUND: Desmoglein 1 (DSG1) is the target protein in the skin disease exudative epidermitis in pigs caused by virulent strains of Staphylococcus hyicus. The exfoliative toxins produced by S. hyicus digest the porcine desmoglein 1 (PIG)DSG1 by a very specific reaction. This study investigated the location of single nucleotide polymorphisms (SNPs) in the porcine desmoglein 1 gene (PIG)DSG1 in correlation
Lise Daugaard; Lars Ole Andresen; Merete Fredholm
The U.S. has been providing national-scale estimates of forest carbon (C) stocks and stock change to meet United Nations Framework Convention on Climate Change (UNFCCC) reporting requirements for years. Although these currently are provided as national estimates by pool and year to meet greenhouse gas monitoring requirements, there is growing need to disaggregate these estimates to finer scales to enable strategic forest management and monitoring activities focused on various ecosystem services such as C storage enhancement. Through application of a nearest-neighbor imputation approach, spatially extant estimates of forest C density were developed for the conterminous U.S. using the U.S.'s annual forest inventory. Results suggest that an existing forest inventory plot imputation approach can be readily modified to provide raster maps of C density across a range of pools (e.g., live tree to soil organic carbon) and spatial scales (e.g., sub-county to biome). Comparisons among imputed maps indicate strong regional differences across C pools. The C density of pools closely related to detrital input (e.g., dead wood) is often highest in forests suffering from recent mortality events such as those in the northern Rocky Mountains (e.g., beetle infestations). In contrast, live tree carbon density is often highest on the highest quality forest sites such as those found in the Pacific Northwest. Validation results suggest strong agreement between the estimates produced from the forest inventory plots and those from the imputed maps, particularly when the C pool is closely associated with the imputation model (e.g., aboveground live biomass and live tree basal area), with weaker agreement for detrital pools (e.g., standing dead trees). Forest inventory imputed plot maps provide an efficient and flexible approach to monitoring diverse C pools at national (e.g., UNFCCC) and regional scales (e.g., Reducing Emissions from Deforestation and Forest Degradation projects) while allowing timely incorporation of empirical data (e.g., annual forest inventory). PMID:23305341
Wilson, Barry Tyler; Woodall, Christopher W; Griffith, Douglas M
7 Agriculture 15 2010-01-01 2010-01-01 false May the Department of Agriculture impute conduct of one person to another? 3017.630 Section 3017.630 Agriculture Regulations of the Department of Agriculture...
Large-scale genetic screens in zebrafish have identified thousands of mutations in hundreds of essential genes. The genetic mapping of these mutations is necessary to link DNA sequences to the gene functions defined by mutant phenotypes. Here, we report two advances that will accelerate the mapping of zebrafish mutations: (1) The construction of a first generation single nucleotide polymorphism (SNP) map of the zebrafish genome comprising 2035 SNPs and 178 small insertions/deletions, and (2) the development of a method for mapping mutations in which hundreds of SNPs can be scored in parallel with an oligonucleotide microarray. We have demonstrated the utility of the microarray technique in crosses with haploid and diploid embryos by mapping two known mutations to their previously identified locations. We have also used this approach to localize four previously unmapped mutations. We expect that mapping with SNPs and oligonucleotide microarrays will accelerate the molecular analysis of zebrafish mutations. [Supplemental material is available online at www.genome.org. The sequence data described in this paper have been submitted to dbSNP under accession nos. 5103507–5105537. The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: J. Postlethwait, C.-B. Chien, C. Kimmel, L. Maves, and M. Westerfield.
Stickney, Heather L.; Schmutz, Jeremy; Woods, Ian G.; Holtzer, Caleb C.; Dickson, Mark C.; Kelly, Peter D.; Myers, Richard M.; Talbot, William S.
Single genetic variants discovered so far have been only weakly associated with melanoma. This study aims to use multiple single nucleotide polymorphisms (SNPs) jointly to obtain a larger genetic effect and to improve the predictive value of a conventional phenotypic model. We analyzed 11 SNPs that were associated with melanoma risk in previous studies and were genotyped in MD Anderson Cancer Center (MDACC) and Harvard Medical School investigations. Participants with ?15 risk alleles were 5-fold more likely to have melanoma compared to those carrying ?6. Compared to a model using the most significant single variant rs12913832, the increase in predictive value for the model using a polygenic risk score (PRS) comprised of 11 SNPs was 0.07(95% CI, 0.05-0.07). The overall predictive value of the PRS together with conventional phenotypic factors in the MDACC population was 0.69 (95% CI, 0.64-0.69). PRS significantly improved the risk prediction and reclassification in melanoma as compared with the conventional model. Our study suggests that a polygenic profile can improve the predictive value of an individual gene polymorphism and may be able to significantly improve the predictive value beyond conventional phenotypic melanoma risk factors.
Fang, Shenying; Han, Jiali; Zhang, Mingfeng; Wang, Li-e; Wei, Qingyi; Amos, Christopher I.; Lee, Jeffrey E.
Background The very recent availability of fully sequenced individual human genomes is a major revolution in biology which is certainly going to provide new insights into genetic diseases and genomic rearrangements. Results We mapped the insertions, deletions and SNPs (single nucleotide polymorphisms) that are present in Craig Venter's genome, more precisely on chromosomes 17 to 22, and compared them with the human reference genome hg17. Our results show that insertions and deletions are almost absent in L1 and generally scarce in L2 isochore families (GC-poor L1+L2 isochores represent slightly over half of the human genome), whereas they increase in GC-rich isochores, largely paralleling the densities of genes, retroviral integrations and Alu sequences. The distributions of insertions/deletions are in striking contrast with those of SNPs which exhibit almost the same density across all isochore families with, however, a trend for lower concentrations in gene-rich regions. Conclusions Our study strongly suggests that the distribution of insertions/deletions is due to the structure of chromatin which is mostly open in gene-rich, GC-rich isochores, and largely closed in gene-poor, GC-poor isochores. The different distributions of insertions/deletions and SNPs are clearly related to the two different responsible mechanisms, namely recombination and point mutations.
Costantini, Maria; Bernardi, Giorgio
The recent release of the Bovine HapMap dataset represents the most detailed survey of bovine genetic diversity to date, providing an important resource for the design and development of livestock production. We studied this dataset, comprising more than 30,000 Single Nucleotide Polymorphisms (SNPs) for 19 breeds (13 taurine, three zebu, and three hybrid breeds), seeking to identify small panels of genetic markers that can be used to trace the breed of unknown cattle samples. Taking advantage of the power of Principal Components Analysis and algorithms that we have recently described for the selection of Ancestry Informative Markers from genomewide datasets, we present a decision-tree which can be used to accurately infer the origin of individual cattle. In doing so, we present a thorough examination of population genetic structure in modern bovine breeds. Performing extensive cross-validation experiments, we demonstrate that 250-500 carefully selected SNPs suffice in order to achieve close to 100% prediction accuracy of individual ancestry, when this particular set of 19 breeds is considered. Our methods, coupled with the dense genotypic data that is becoming increasingly available, have the potential to become a valuable tool and have considerable impact in worldwide livestock production. They can be used to inform the design of studies of the genetic basis of economically important traits in cattle, as well as breeding programs and efforts to conserve biodiversity. Furthermore, the SNPs that we have identified can provide a reliable solution for the traceability of breed-specific branded products.
Lewis, Jamey; Abas, Zafiris; Dadousis, Christos; Lykidis, Dimitrios; Paschou, Peristera; Drineas, Petros
We report a novel algorithm, iBLUP, to impute missing genotypes by simultaneously and comprehensively using identity by descent and linkage disequilibrium information. The simulation studies showed that the algorithm exhibited drastically tolerance to high missing rate, especially for rare variants than other common imputation methods, e.g. BEAGLE and fastPHASE. At a missing rate of 70%, the accuracy of BEAGLE and fastPHASE dropped to 0.82 and 0.74 respectively while iBLUP retained an accuracy of 0.95. For minor allele, the accuracy of BEAGLE and fastPHASE decreased to ?0.1 and 0.03, while iBLUP still had an accuracy of 0.61.We implemented the algorithm in a publicly available software package also named iBLUP. The application of iBLUP for processing real sequencing data in an outbred pig population was demonstrated.
Chen, Qiang; Liao, Rongrong; Zhang, Xiangzhe; Yang, Hongjie; Zheng, Youmin; Zhang, Zhiwu; Pan, Yuchun
Accurate knowledge of haplotypes, the combination of alleles co-residing on a single copy of a chromosome, enables powerful gene mapping and sequence imputation methods. Since humans are diploid, haplotypes must be derived from genotypes by a phasing process. In this study, we present a new computational model for haplotype phasing based on pairwise sharing of haplotypes inferred to be Identical-By-Descent (IBD). We apply the Bayesian network based model in a new phasing algorithm, called systematic long-range phasing (SLRP), that can capitalize on the close genetic relationships in isolated founder populations, and show with simulated and real genome-wide genotype data that SLRP substantially reduces the rate of phasing errors compared to previous phasing algorithms. Furthermore, the method accurately identifies regions of IBD, enabling linkage-like studies without pedigrees, and can be used to impute most genotypes with very low error rate. Genet. Epidemiol. 2011. © 2011 Wiley Periodicals, Inc.35:853-860, 2011
Palin, Kimmo; Campbell, Harry; Wright, Alan F; Wilson, James F; Durbin, Richard
We consider two difficulties with standard multiple imputation methods for missing data based on Rubin's t method for confidence intervals: their often excessive width, and their instability. These problems are present most often when the number of copies is small, as is often the case when a data collection organization is making multiple completed datasets available for analysis. We suggest using mixtures of normals as an alternative to Rubin's t. We also examine the performance of improper imputation methods as an alternative to generating copies from the true posterior distribution for the missing observations. We report the results of simulation studies and analyses of data on health-related quality of life in which the methods suggested here gave narrower confidence intervals and more stable inferences, especially with small numbers of copies or non-normal posterior distributions of parameter estimates. A free R software package called MImix that implements our methods is available from CRAN.
Steele, Russell J.; Wang, Naisyin; Raftery, Adrian E.
This article presents a new methodology for solving problems resulting from missing data in large-scale item performance behavioral databases. Useful statistics corrected for missing data are described, and a new method of imputation for missing data is proposed. This methodology is applied to the Dutch Lexicon Project database recently published by Keuleers, Diependaele, and Brysbaert (Frontiers in Psychology, 1, 174, 2010), which allows us to conclude that this database fulfills the conditions of use of the method recently proposed by Courrieu, Brand-D'Abrescia, Peereman, Spieler, and Rey (2011) for testing item performance models. Two application programs in MATLAB code are provided for the imputation of missing data in databases and for the computation of corrected statistics to test models. PMID:21424187
Courrieu, Pierre; Rey, Arnaud
\\u000a Microarray data is used in a large number of applications ranging from diagnosis through to drug discovery. Such data however,\\u000a often contains multiple missing genetic expressions which are generally ignored thus degrading the reliability of inferred\\u000a results. This paper presents an innovative and robust imputation framework that more accurately estimates missing values leading\\u000a subsequently to better gene selection and class
Muhammad Shoaib B. Sehgal; Iqbal Gondal; Laurence Dooley
Background Genetic markers are widely used to understand the biology and population dynamics of disease vectors, but often markers are limited in the resolution they provide. In particular, the delineation of population structure, fine scale movement and patterns of relatedness are often obscured unless numerous markers are available. To address this issue in the major arbovirus vector, the yellow fever mosquito (Aedes aegypti), we used double digest Restriction-site Associated DNA (ddRAD) sequencing for the discovery of genome-wide single nucleotide polymorphisms (SNPs). We aimed to characterize the new SNP set and to test the resolution against previously described microsatellite markers in detecting broad and fine-scale genetic patterns in Ae. aegypti. Results We developed bioinformatics tools that support the customization of restriction enzyme-based protocols for SNP discovery. We showed that our approach for RAD library construction achieves unbiased genome representation that reflects true evolutionary processes. In Ae. aegypti samples from three continents we identified more than 18,000 putative SNPs. They were widely distributed across the three Ae. aegypti chromosomes, with 47.9% found in intergenic regions and 17.8% in exons of over 2,300 genes. Pattern of their imputed effects in ORFs and UTRs were consistent with those found in a recent transcriptome study. We demonstrated that individual mosquitoes from Indonesia, Australia, Vietnam and Brazil can be assigned with a very high degree of confidence to their region of origin using a large SNP panel. We also showed that familial relatedness of samples from a 0.4 km2 area could be confidently established with a subset of SNPs. Conclusions Using a cost-effective customized RAD sequencing approach supported by our bioinformatics tools, we characterized over 18,000 SNPs in field samples of the dengue fever mosquito Ae. aegypti. The variants were annotated and positioned onto the three Ae. aegypti chromosomes. The new SNP set provided much greater resolution in detecting population structure and estimating fine-scale relatedness than a set of polymorphic microsatellites. RAD-based markers demonstrate great potential to advance our understanding of mosquito population processes, critical for implementing new control measures against this major disease vector.
We present an approach that uses latent variable modeling and multiple imputation to correct rater bias when one group of raters tends to be more lenient in assigning a diagnosis than another. Our method assumes there exists an unobserved moderate category of patient that is assigned a positive diagnosis by one type of rater and a negative diagnosis by the other type. We present a Bayesian random effects censored ordinal probit model which allows us to calibrate the diagnoses across rater types by identifying and multiply imputing “case” or “non-case” status for patients in the moderate category. A Markov chain Monte Carlo algorithm is presented to estimate the posterior distribution of the model parameters and generate multiple imputations. Our method enables the calibrated diagnosis variable to be used in subsequent analyses while also preserving uncertainty in true diagnosis. We apply our model to diagnoses of posttraumatic stress disorder (PTSD) from a depression study where nurse practitioners were twice as likely as clinical psychologists to diagnose PTSD despite the fact that participants were randomly assigned to either a nurse or a psychologist. Our model appears to balance PTSD rates across raters, provides a good fit to the data, and preserves between-rater variability. After calibrating the diagnoses of PTSD across rater types, we perform an analysis looking at the effects of comorbid PTSD on changes in depression scores over time. Results are compared to an analysis that uses the original diagnoses and show that calibrating the PTSD diagnoses can yield different inferences.
Siddique, Juned; Crespi, Catherine M.; Gibbons, Robert D.; Green, Bonnie L.
High coverage whole genome sequencing provides near complete information about genetic variation. However, other technologies can be more efficient in some settings by (a) reducing redundant coverage within samples and (b) exploiting patterns of genetic variation across samples. To characterize as many samples as possible, many genetic studies therefore employ lower coverage sequencing or SNP array genotyping coupled to statistical imputation. To compare these approaches individually and in conjunction, we developed a statistical framework to estimate genotypes jointly from sequence reads, array intensities, and imputation. In European samples, we find similar sensitivity (89%) and specificity (99.6%) from imputation with either 1× sequencing or 1 M SNP arrays. Sensitivity is increased, particularly for low-frequency polymorphisms (), when low coverage sequence reads are added to dense genome-wide SNP arrays — the converse, however, is not true. At sites where sequence reads and array intensities produce different sample genotypes, joint analysis reduces genotype errors and identifies novel error modes. Our joint framework informs the use of next-generation sequencing in genome wide association studies and supports development of improved methods for genotype calling.
Flannick, Jason; Korn, Joshua M.; Fontanillas, Pierre; Grant, George B.; Banks, Eric; Depristo, Mark A.; Altshuler, David
Background The most common application of imputation is to infer genotypes of a high-density panel of markers on animals that are genotyped for a low-density panel. However, the increase in accuracy of genomic predictions resulting from an increase in the number of markers tends to reach a plateau beyond a certain density. Another application of imputation is to increase the size of the training set with un-genotyped animals. This strategy can be particularly successful when a set of closely related individuals are genotyped. Methods Imputation on completely un-genotyped dams was performed using known genotypes from the sire of each dam, one offspring and the offspring’s sire. Two methods were applied based on either allele or haplotype frequencies to infer genotypes at ambiguous loci. Results of these methods and of two available software packages were compared. Quality of imputation under different population structures was assessed. The impact of using imputed dams to enlarge training sets on the accuracy of genomic predictions was evaluated for different populations, heritabilities and sizes of training sets. Results Imputation accuracy ranged from 0.52 to 0.93 depending on the population structure and the method used. The method that used allele frequencies performed better than the method based on haplotype frequencies. Accuracy of imputation was higher for populations with higher levels of linkage disequilibrium and with larger proportions of markers with more extreme allele frequencies. Inclusion of imputed dams in the training set increased the accuracy of genomic predictions. Gains in accuracy ranged from close to zero to 37.14%, depending on the simulated scenario. Generally, the larger the accuracy already obtained with the genotyped training set, the lower the increase in accuracy achieved by adding imputed dams. Conclusions Whenever a reference population resembling the family configuration considered here is available, imputation can be used to achieve an extra increase in accuracy of genomic predictions by enlarging the training set with completely un-genotyped dams. This strategy was shown to be particularly useful for populations with lower levels of linkage disequilibrium, for genomic selection on traits with low heritability, and for species or breeds for which the size of the reference population is limited.
One difficult question facing researchers is how to prioritize SNPs detected from genetic association studies for functional studies. Often a list of the top M SNPs is determined based on solely the p-value from an association analysis, where M is determined by financial/time constraints. For many studies of complex diseases, multiple analyses have been completed and integrating these multiple sets of results may be difficult. One may also wish to incorporate biological knowledge, such as whether the SNP is in the exon of a gene or a regulatory region, into the selection of markers to follow-up. In this manuscript, we propose a Bayesian latent variable model (BLVM) for incorporating “features” about a SNP to estimate a latent “quality score”, with SNPs prioritized based on the posterior probability distribution of the rankings of these quality scores. We illustrate the method using data from an ovarian cancer genome-wide association study (GWAS). In addition to the application of the BLVM to the ovarian GWAS, we applied the BLVM to simulated data which mimics the setting involving the prioritization of markers across multiple GWAS for related diseases/traits. The top ranked SNP by BLVM for the ovarian GWAS, ranked 2nd and 7th based on p-values from analyses of all invasive and invasive serous cases. The top SNP based on serous case analysis p-value (which ranked 197th for invasive case analysis), was ranked 8th based on the posterior probability of being in the top 5 markers (0.13). In summary, the application of the BLVM allows for the systematic integration of multiple SNP “features” for the prioritization of loci for fine-mapping or functional studies, taking into account the uncertainty in ranking.
Fridley, Brooke L.; Iversen, Ed; Tsai, Ya-Yu; Jenkins, Gregory D.; Goode, Ellen L.; Sellers, Thomas A.
Two cohorts of women with PCOS (400 probands and affected sisters in 365 families and a case-control group including 395 women with PCOS and 171 healthy women with regular menstrual cycles) were studied to determine whether SNPs identified as susceptibility loci in genome-wide association studies of type 2 diabetes are also associated with PCOS. None of the 18 allelic variants in ten genes previously shown to be associated with type 2 diabetes were found to be associated with PCOS, but some were associated with indices of beta cell function.
Ewens, Kathryn G.; Jones, Michelle R.; Ankener, Wendy; Stewart, Douglas R.; Urbanek, Margrit; Dunaif, Andrea; Legro, Richard S.; Chua, Angela; Azziz, Ricardo; Spielman, Richard S.; Goodarzi, Mark O.; Strauss, Jerome F.
Background The Focused Assessment with Sonography for Trauma (FAST) exam is an important variable in many retrospective trauma studies. The purpose of this study was to devise an imputation method to overcome missing data for the FAST exam. Due to variability in patients’ injuries and trauma care, these data are unlikely to be missing completely at random (MCAR), raising concern for validity when analyses exclude patients with missing values. Methods Imputation was conducted under a less restrictive, more plausible missing at random (MAR) assumption. Patients with missing FAST exams had available data on alternate, clinically relevant elements that were strongly associated with FAST results in complete cases, especially when considered jointly. Subjects with missing data (32.7%) were divided into eight mutually exclusive groups based on selected variables that both described the injury and were associated with missing FAST values. Additional variables were selected within each group to classify missing FAST values as positive or negative, and correct FAST exam classification based on these variables was determined for patients with non-missing FAST values. Results Severe head/neck injury (odds ratio, OR=2.04), severe extremity injury (OR=4.03), severe abdominal injury (OR=1.94), no injury (OR=1.94), other abdominal injury (OR=0.47), other head/neck injury (OR=0.57) and other extremity injury (OR=0.45) groups had significant ORs for missing data; the other group odds ratio was not significant (OR=0.84). All 407 missing FAST values were imputed, with 109 classified as positive. Correct classification of non-missing FAST results using the alternate variables was 87.2%. Conclusions Purposeful imputation for missing FAST exams based on interactions among selected variables assessed by simple stratification may be a useful adjunct to sensitivity analysis in the evaluation of imputation strategies under different missing data mechanisms. This approach has the potential for widespread application in clinical and translational research and validation is warranted. Level of Evidence Level II Prognostic or Epidemiological
Fuchs, Paul A.; del Junco, Deborah J.; Fox, Erin E.; Holcomb, John B.; Rahbar, Mohammad H.; Wade, Charles A.; Alarcon, Louis H.; Brasel, Karen J.; Bulger, Eileen M.; Cohen, Mitchell J.; Myers, John G.; Muskat, Peter; Phelan, Herb A.; Schreiber, Martin A.; Cotton, Bryan A.
Single nucleotide polymorphisms (SNPs) have become the marker of choice for genome-wide association studies in many species. High-throughput sequencing of RNA was developed primarily to analyze global gene expression, while it is an efficient way to discover SNPs from the expressed genes. In this study, we conducted transcriptome sequencing of the gill samples of Takifugu rubripes analyzed by using Illumina HiSeq 2000 platform to identify gene-associated SNPs from the transcriptome of T. rubripes gill. A total of 27,085,235 unique-mapped-reads from 55,061,524 raw data reads were generated. A total of 56,972 putative SNPs were discovered, which were located in 11,327 genes. 35,839 SNPs were transitions (Ts), 21,074 SNPs were transversions (Tv) and 88.1% of 56,972 SNPs were assigned to the 22 chromosomes. The average minor allele frequency (MAF) of the SNPs was 0.26. GO and KEGG pathway analyses were conducted to analyze the genes containing SNPs. Validation of selected SNPs revealed that 63.4% of SNPs (34/52) were true SNPs. RNA-Seq is a cost-effective way to discover gene-associated SNPs. In this study, a large number of SNPs were identified and these data will be useful resources for population genetic study, evolution analysis, resource assessment, genetic linkage analysis and genome-wide association studies. The results of our study can also offer some useful information as molecular makers to help select and cultivate T. rubripes. PMID:24747987
Cui, Jun; Wang, Hongdi; Liu, Shikai; Qiu, Xuemei; Jiang, Zhiqiang; Wang, Xiuli
Atlantic salmon of Eastern Canada were once of considerable importance to aboriginal, recreational, and commercial fisheries, yet many populations are now in decline, particularly those of the inner Bay of Fundy (iBoF), which were recently listed as endangered. We investigated whether nonneutral SNPs could be used to assign individual Atlantic salmon accurately to either the iBoF or the outer Bay of Fundy (oBoF) metapopulations because this has been difficult with existing neutral markers. We first searched for markers under diversifying selection by genotyping eight captively bred Bay of Fundy (BoF) populations for 320 SNP loci with the Sequenom MassARRAY™ system and then analysed the data set with four different F(ST) outlier detection programs. Three outlier loci were identified by both BayesFST and BayeScan whereas seven outlier loci, including the three previously mentioned, were identified by both Fdist and Arlequin. A subset of 14 nonneutral SNPs was more accurate (85% accuracy) than a subset of 67 neutral SNPs (75% accuracy) at assigning individual salmon back to their metapopulation. We then chose a subset of nine outlier SNP markers and used them to inexpensively genotype archived DNA samples from seven wild BoF populations using Invader™ chemistry. Hierarchical AMOVA of these independent wild samples corroborated our previous findings of significant genetic differentiation between iBoF and oBoF salmon metapopulations. Our research shows that identifying and using outlier loci is an important step towards achieving the goal of consistently and accurately distinguishing iBoF from oBoF Atlantic salmon, which will aid in their conservation. PMID:21429179
Freamo, Heather; O'Reilly, Patrick; Berg, Paul R; Lien, Sigbjørn; Boulding, Elizabeth G
An electric disk filter provides a high efficiency at high temperature. A hollow outer filter of fibrous stainless steel forms the ground electrode. A refractory filter material is placed between the outer electrode and the inner electrically isolated high voltage electrode. Air flows through the outer filter surfaces through the electrified refractory filter media and between the high voltage electrodes and is removed from a space in the high voltage electrode.
A number of studies have demonstrated that stress is involved in all aspects of smoking behavior, including initiation, maintenance and relapse. The mineralocorticoid (MR) and glucocorticoid (GR) receptors are expressed in several brain areas and play a key role in negative feedback of the hypothalamic-pituitary-adrenal (HPA) axis. As nicotine increases the activation of the HPA axis, we wondered if functional SNPs (single nucleotide polymorphisms) in MR and GR coding genes (NR3C2 rs5522 and NR3C1 rs6198, respectively) may be involved in smoking susceptibility. The sample included 627 volunteers, of which 514 were never-smokers and 113 lifetime smokers. We report an interaction effect between rs5522 and rs6198 SNPs. The odds ratio (OR) for the presence of the NR3C2 rs5522 Val allele in NR3C1 rs6198 G carriers was 0.18 (P = 0.007), while in rs6198 G noncarriers the OR was 1.83 (P = 0.027). We also found main effects of the NR3C1 rs6198 G allele on number of cigarettes smoked per day (P = 0.027) and in total score of the Fagerström Test for Nicotine Dependence (P = 0.007). These findings are consistent with a possible link between NR3C2 and NR3C1 polymorphisms and smoking behavior and provide a first partial replication for a nominally significant GWAS finding between NR3C2 and tobacco smoking. PMID:23543128
Rovaris, Diego L; Mota, Nina R; de Azeredo, Lucas A; Cupertino, Renata B; Bertuzzi, Guilherme P; Polina, Evelise R; Contini, Verônica; Kortmann, Gustavo L; Vitola, Eduardo S; Grevet, Eugenio H; Grassi-Oliveira, Rodrigo; Callegari-Jacques, Sidia M; Bau, Claiton H D
CD14 is a monocytic differentiation antigen that regulates innate immune responses to pathogens. Here, we show that murine Cd14 SNPs regulate the length of Cd14 mRNA and CD14 protein translation efficiency, and consequently the basal level of soluble CD14 (sCD14) and type I IFN production by murine macrophages. This has substantial downstream consequences for the innate immune response; the level of expression of at least 40 IFN-responsive murine genes was altered by this mechanism. We also observed that there was substantial variation in the length of human CD14 mRNAs and in their translation efficiency. sCD14 increased cytokine production by human dendritic cells (DCs), and sCD14-primed DCs augmented human CD4T cell proliferation. These findings may provide a mechanism for exploring the complex relationship between CD14 SNPs, serum sCD14 levels, and susceptibility to human infectious and allergic diseases. PMID:22445606
Liu, Hong-Hsing; Hu, Yajing; Zheng, Ming; Suhoski, Megan M; Engleman, Edgar G; Dill, David L; Hudnall, Matt; Wang, Jianmei; Spolski, Rosanne; Leonard, Warren J; Peltz, Gary
Evolution by natural selection acts on natural populations amidst migration, gene-by-environmental interactions, constraints, and tradeoffs, which affect the rate and frequency of adaptive change. We asked how many and how rapidly loci change in populations subject to severe, recent environmental changes. To address these questions, we used genomic approaches to identify randomly selected single nucleotide polymorphisms (SNPs) with evolutionarily significant patterns in three natural populations of Fundulus heteroclitus that inhabit and have adapted to highly polluted Superfund sites. Three statistical tests identified 1.4-2.5% of SNPs that were significantly different from the neutral model in each polluted population. These nonneutral patterns in populations adapted to highly polluted environments suggest that these loci or closely linked loci are evolving by natural selection. One SNP identified in all polluted populations using all tests is in the gene for the xenobiotic metabolizing enzyme, cytochrome P4501A (CYP1A), which has been identified previously as being refractory to induction in the three highly polluted populations. Extrapolating across the genome, these data suggest that rapid evolutionary change in natural populations can involve hundreds of loci, a few of which will be shared in independent events. PMID:21220761
Williams, Larissa M; Oleksiak, Marjorie F
Summary Background Genome-wide association studies (GWAS) for Parkinson's disease have linked two loci (MAPT and SNCA) to risk of Parkinson's disease. We aimed to identify novel risk loci for Parkinson's disease. Methods We did a meta-analysis of datasets from five Parkinson's disease GWAS from the USA and Europe to identify loci associated with Parkinson's disease (discovery phase). We then did replication analyses of significantly associated loci in an independent sample series. Estimates of population-attributable risk were calculated from estimates from the discovery and replication phases combined, and risk-profile estimates for loci identified in the discovery phase were calculated. Findings The discovery phase consisted of 5333 case and 12-019 control samples, with genotyped and imputed data at 7-689-524 SNPs. The replication phase consisted of 7053 case and 9007 control samples. We identified 11 loci that surpassed the threshold for genome-wide significance (p<5×10?8). Six were previously identified loci (MAPT, SNCA, HLA-DRB5, BST1, GAK and LRRK2) and five were newly identified loci (ACMSD, STK39, MCCC1/LAMP3, SYT11, and CCDC62/HIP1R). The combined population-attributable risk was 60·3% (95% CI 43·7–69·3). In the risk-profile analysis, the odds ratio in the highest quintile of disease risk was 2·51 (95% CI 2·23–2·83) compared with 1·00 in the lowest quintile of disease risk. Interpretation These data provide an insight into the genetics of Parkinson's disease and the molecular cause of the disease and could provide future targets for therapies. Funding Wellcome Trust, National Institute on Aging, and US Department of Defense.
Background Schizophrenia is a severe brain disorder, and SNPs (Single nucleotide polymorphism) in schizophrenia-associated miRNAs are believed to be one of the important reasons for dysregulation which might contribute to the altered expression of genes and ultimately result in the disease. Identification of causal SNPs in associated miRNAs may have certain significance in understanding the mechanism of schizophrenia. Results For the above purposes, a method based on detection of free energy change is proposed for identification of causal SNPs in schizophrenia-associated miRNAs. A miRNA is firstly segmented, and free energy change is computed after adding an SNP into a segment. The method discovers successfully 6 out of 32 known SNPs and some artificial SNPs could cause significant change in free energy, and among which, 6 known SNPs are supposed to be responsible for most cases of schizophrenia in population. Conclusions The proposed method is not only a convenient way to discover causal SNPs in schizophrenia-associated miRNAs without any biochemical assay or sample comparison between cases and controls, but it also has high resolution for causal SNPs even if the SNPs are not reported for their very rare cases in the population. Moreover, the method can be applied to discover the causal SNPs in miRNAs associated with other diseases.
A vast amount of SNPs derived from genome-wide association studies are represented by non-coding ones, therefore exacerbating the need for effective identification of regulatory SNPs (rSNPs) among them. However, this task remains challenging since the regulatory part of the human genome is annotated much poorly as opposed to coding regions. Here we describe an approach aggregating the whole set of ENCODE ChIP-seq data in order to search for rSNPs, and provide the experimental evidence of its efficiency. Its algorithm is based on the assumption that the enrichment of a genomic region with transcription factor binding loci (ChIP-seq peaks) indicates its regulatory function, and thereby SNPs located in this region are more likely to influence transcription regulation. To ensure that the approach preferably selects functionally meaningful SNPs, we performed enrichment analysis of several human SNP datasets associated with phenotypic manifestations. It was shown that all samples are significantly enriched with SNPs falling into the regions of multiple ChIP-seq peaks as compared with the randomly selected SNPs. For experimental verification, 40 SNPs falling into overlapping regions of at least 7 TF binding loci were selected from OMIM. The effect of SNPs on the binding of the DNA fragments containing them to the nuclear proteins from four human cell lines (HepG2, HeLaS3, HCT-116, and K562) has been tested by EMSA. A radical change in the binding pattern has been observed for 29 SNPs, besides, 6 more SNPs also demonstrated less pronounced changes. Taken together, the results demonstrate the effective way to search for potential rSNPs with the aid of ChIP-seq data provided by ENCODE project.
Matveeva, Marina Yu.; Shilov, Alexander G.; Kashina, Elena V.; Mordvinov, Viatcheslav A.; Merkulova, Tatyana I.
The Aquaspace H2OME Guardian Water Filter, available through Western Water International, Inc., reduces lead in water supplies. The filter is mounted on the faucet and the filter cartridge is placed in the "dead space" between sink and wall. This filter is one of several new filtration devices using the Aquaspace compound filter media, which combines company developed and NASA technology. Aquaspace filters are used in industrial, commercial, residential, and recreational environments as well as by developing nations where water is highly contaminated.
Background Accurately estimating the period of time that individuals are exposed to online intervention content is important for understanding program engagement. This can be calculated from time-stamped data reflecting navigation to and from individual webpages. Prolonged periods of inactivity are commonly handled with a time-out feature and assigned a prespecified exposure duration. Unfortunately, this practice can lead to biased results describing program exposure. Objective The aim of the study was to describe how multiple imputations can be used to better account for the time spent viewing webpages that result in a prolonged period of inactivity or a time-out. Methods To illustrate this method, we present data on time-outs collected from the Q2 randomized smoking cessation trial. For this analysis, we evaluate the effects on intervention exposure of receiving content written in a prescriptive versus motivational tone. Using multiple imputations, we created five complete datasets in which the time spent viewing webpages that resulted in a time-out were replaced with values estimated with imputation models. We calculated standard errors using Rubin’s formulas to account for the variability due to the imputations. We also illustrate how current methods of accounting for time-outs (excluding timed-out page views or assigning an arbitrary viewing time) can influence conclusions about participant engagement. Results A total of 63.00% (1175/1865) of participants accessed the online intervention in the Q2 trial. Of the 6592 unique page views, 683 (10.36%, 683/6592) resulted in a time-out. The median time spent viewing webpages that did not result in a time-out was 1.07 minutes. Assuming participants did not spend any time viewing a webpage that resulted in a time-out, no difference between the two message tones was observed (ratio of mean time online: 0.87, 95% CI 0.75-1.02). Assigning 30 minutes of viewing time to all page views that resulted in a time-out concludes that participants who received content in a motivational tone spent less time viewing content (ratio of mean time online: 0.86, 95% CI 0.77-0.98) than those participants who received content in a prescriptive tone. Using multiple imputations to account for time-outs concludes that there is no difference in participant engagement between the two message tones (ratio of mean time online: 0.87; 95% CI 0.75-1.01). Conclusions The analytic technique chosen can significantly affect conclusions about online intervention engagement. We propose a standardized methodology in which time spent viewing webpages that result in a time-out is treated as missing information and corrected with multiple imputations. Trial Registration Clinicaltrials.gov NCT00992264; http://clinicaltrials.gov/ct2/show/NCT00992264 (Archived by WebCite at http://www.webcitation.org/6Kw5m8EkP).
Bogart, Andy; McClure, Jennifer B
Background The weighted estimators generally used for analyzing case-cohort studies are not fully efficient and naive estimates of the predictive ability of a model from case-cohort data depend on the subcohort size. However, case-cohort studies represent a special type of incomplete data, and methods for analyzing incomplete data should be appropriate, in particular multiple imputation (MI). Methods We performed simulations to validate the MI approach for estimating hazard ratios and the predictive ability of a model or of an additional variable in case-cohort surveys. As an illustration, we analyzed a case-cohort survey from the Three-City study to estimate the predictive ability of D-dimer plasma concentration on coronary heart disease (CHD) and on vascular dementia (VaD) risks. Results When the imputation model of the phase-2 variable was correctly specified, MI estimates of hazard ratios and predictive abilities were similar to those obtained with full data. When the imputation model was misspecified, MI could provide biased estimates of hazard ratios and predictive abilities. In the Three-City case-cohort study, elevated D-dimer levels increased the risk of VaD (hazard ratio for two consecutive tertiles = 1.69, 95%CI: 1.63-1.74). However, D-dimer levels did not improve the predictive ability of the model. Conclusions MI is a simple approach for analyzing case-cohort data and provides an easy evaluation of the predictive ability of a model or of an additional variable.
Up-scaling from sparse measurements to a continuous raster of estimated values is a common problem in Earth System Science. We present a new general-purpose empirical imputation method based on associative clustering, which associates sparse measurements of dependent variables with particular multivariate clustered combinations of the independent variables, and then uses several methods to estimate values for unmeasured clusters, based on directional proximity in multidimensional data space, at both the cluster and map cell levels of resolution. We demonstrate this new imputation tool on tree species range distribution maps, which describe the suitable extent and expected growth performance of a particular tree species over a wide area. Range maps having continuous estimates of tree growth performance are more useful than more classical tree range maps that simply show binary occurence suitability. The USDA Forest Service Forest Inventory Assessment (FIA) plots provide information about the occurence and growth performance for various tree species across the US, but such measurements are limited to FIA plots. Using Associative Clustering, we scale up the discontinuous FIA Inventory growth measurements into continuous maps that show the expected growth and suitabilty for individual tree species covering the Continental United States. A multivariate cluster analysis was applied to global output from a General Circulation Model (GCM) consisting of 17 variables downscaled to 4km2 resolution. Present global growing conditions were divided into 30 thousand relatively homogeneous ecoregions describing climatic and topographic conditions. At every mapcell a multi-linear regression was applied in 17 dimensional hyperspace to derive the suitability of a tree species where not measured using the forest inventory data. The continuous species distribution maps obtained were compared and validated against existing tree range suitability maps. Associative Clustering is intended to be a general-purpose imputation tool, is model-free, and can be used to derive tree growth for future conditions that have no present-day analog.
Hargrove, W. W.; Kumar, J.; Hoffman, F. M.; Potter, K. M.; Mills, R. T.
Background Identifying recombination events and the chromosomal segments that constitute a gamete is useful for a number of applications in genomic analyses. In livestock, genotypic data are commonly available for half-sib families. We propose a straightforward but computationally efficient method to use single nucleotide polymorphism marker genotypes on half-sibs to reconstruct the recombination and segregation events that occurred during meiosis in a sire to form the haplotypes observed in its offspring. These meiosis events determine a block structure in paternal haplotypes of the progeny and this can be used to phase the genotypes of individuals in single half-sib families, to impute haplotypes of the sire if they are not genotyped or to impute the paternal strand of the offspring’s sequence based on sequence data of the sire. Methods The hsphase algorithm exploits information from opposing homozygotes among half-sibs to identify recombination events, and the chromosomal regions from the paternal and maternal strands of the sire (blocks) that were inherited by its progeny. This information is then used to impute the sire’s genotype, which, in turn, is used to phase the half-sib family. Accuracy (defined as R2) and performance of this approach were evaluated by using simulated and real datasets. Phasing results for the half-sibs were benchmarked to other commonly used phasing programs – AlphaPhase, BEAGLE and PedPhase 3. Results Using a simulated dataset with 20 markers per cM, and for a half-sib family size of 4 and 40, the accuracy of block detection, was 0.58 and 0.96, respectively. The accuracy of inferring sire genotypes was 0.75 and 1.00 and the accuracy of phasing was around 0.97, respectively. hsphase was more robust to genotyping errors than PedPhase 3, AlphaPhase and BEAGLE. Computationally, hsphase was much faster than AlphaPhase and BEAGLE. Conclusions In half-sib families of size 8 and above, hsphase can accurately detect block structure of paternal haplotypes, impute genotypes of ungenotyped sires and reconstruct haplotypes in progeny. The method is much faster and more accurate than other widely used population-based phasing programs. A program implementing the method is freely available as an R package (hsphase).
Testing for trend is an important problem, especially when one is dealing with environmental time series. The tests considered here are the usual t-test and the Mann-Kendall test, a nonparametric version widely used because it requires fewer assumptions. The aim is to assess the performance of two trend tests in time series with autocorrelation after an imputation method is applied to estimate the missing observations. The performance of the trend tests will be illustrated for some well-known data sets existing in R software.
Ramos, M. Rosário; Cordeiro, Clara
An electrical filter is described for removing noise from voice communications signals. Filtering is accomplished by adding balanced, with respect to a midpoint sample, spaced pairs of the sampled signal values, and then multiplying each pair by a selecte...
T. R. Edwards H. W. Zeanah
Presents the 1978 literature review of wastewater treatment. The review is concerned with biological filters, and it covers: (1) trickling filters; (2) rotating biological contractors; and (3) miscellaneous reactors. A list of 14 references is also presented. (HM)
Klemetson, S. L.
Single Nucleotide Polymorphism (SNP) is a mutation where, a single base in the DNA differs from the usual base at that position. SNPs are the marker of choice in genetic analysis and also useful in locating genes associated with diseases. SNPs are important and frequently occurring point mutations in genomes and have many practical implications. In silico methods are easy to study the SNPs that are occurring in known genomes or sequences of a species of interest during the post genomic era. There are many on-line and stand alone tools to analyse the SNPs. We intend to guide the reader with the software details such as algorithmic background, file requirements, operating system specificity and species specificity, if any, for the tools of SNPs detection in plants and animals. We also list many databases and resources available today to describe SNPs in wide range of organisms. PMID:24794070
Seal, Abhik; Gupta, Arun; Mahalaxmi, M; Aykkal, Riju; Singh, Tiratha Raj; Arunachalam, Vadivel
With the advance of sequencing technologies, whole exome sequencing has increasingly been used to identify mutations that cause human diseases, especially rare Mendelian diseases. Among the analysis steps, functional prediction (of being deleterious) plays an important role in filtering or prioritizing nonsynonymous SNP (NS) for further analysis. Unfortunately, different prediction algorithms use different information and each has its own strength and weakness. It has been suggested that investigators should use predictions from multiple algorithms instead of relying on a single one. However, querying predictions from different databases/Web-servers for different algorithms is both tedious and time consuming, especially when dealing with a huge number of NSs identified by exome sequencing. To facilitate the process, we developed dbNSFP (database for nonsynonymous SNPs' functional predictions). It compiles prediction scores from four new and popular algorithms (SIFT, Polyphen2, LRT, and MutationTaster), along with a conservation score (PhyloP) and other related information, for every potential NS in the human genome (a total of 75,931,005). It is the first integrated database of functional predictions from multiple algorithms for the comprehensive collection of human NSs. dbNSFP is freely available for download at http://sites.google.com/site/jpopgen/dbNSFP. Hum Mutat 32:894–899, 2011. © 2011 Wiley-Liss, Inc.
Liu, Xiaoming; Jian, Xueqiu; Boerwinkle, Eric
The improvement of accuracy in using the smoothing filter instead of the Kalman filter is discussed. Factors of improvement for velocity errors of up to four are shown for position measurements. Smoothing equations are presented, and it is shown that smoothing equations for the smoothing filter appear to be stable.
Lear, W. H.
Selecting an informative subset of SNPs, generally referred to as tag SNPs, to genotype and analyze is considered to be an\\u000a essential step toward effective disease association studies. However, while the selected informative tag SNPs may characterize\\u000a the allele information of a target genomic region, they are not necessarily the ones directly associated with disease or with\\u000a functional impairment. To
Phil Hyoun Lee; Hagit Shatkay
Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
Paschou, Peristera; Ziv, Elad; Burchard, Esteban G; Choudhry, Shweta; Rodriguez-Cintron, William; Mahoney, Michael W; Drineas, Petros
Genomic selection requires a large reference population to accurately estimate single nucleotide polymorphism (SNP) effects. In some Canadian dairy breeds, the available reference populations are not large enough for accurate estimation of SNP effects for traits of interest. If marker phase is highly consistent across multiple breeds, it is theoretically possible to increase the accuracy of genomic prediction for one or all breeds by pooling several breeds into a common reference population. This study investigated the extent of linkage disequilibrium (LD) in 5 major dairy breeds using a 50,000 (50K) SNP panel and 3 of the same breeds using the 777,000 (777K) SNP panel. Correlation of pair-wise SNP phase was also investigated on both panels. The level of LD was measured using the squared correlation of alleles at 2 loci (r(2)), and the consistency of SNP gametic phases was correlated using the signed square root of these values. Because of the high cost of the 777K panel, the accuracy of imputation from lower density marker panels [6,000 (6K) or 50K] was examined both within breed and using a multi-breed reference population in Holstein, Ayrshire, and Guernsey. Imputation was carried out using FImpute V2.2 and Beagle 3.3.2 software. Imputation accuracies were then calculated as both the proportion of correct SNP filled in (concordance rate) and allelic R(2). Computation time was also explored to determine the efficiency of the different algorithms for imputation. Analysis showed that LD values >0.2 were found in all breeds at distances at or shorter than the average adjacent pair-wise distance between SNP on the 50K panel. Correlations of r-values, however, did not reach high levels (<0.9) at these distances. High correlation values of SNP phase between breeds were observed (>0.94) when the average pair-wise distances using the 777K SNP panel were examined. High concordance rate (0.968-0.995) and allelic R(2) (0.946-0.991) were found for all breeds when imputation was carried out with FImpute from 50K to 777K. Imputation accuracy for Guernsey and Ayrshire was slightly lower when using the imputation method in Beagle. Computing time was significantly greater when using Beagle software, with all comparable procedures being 9 to 13 times less efficient, in terms of time, compared with FImpute. These findings suggest that use of a multi-breed reference population might increase prediction accuracy using the 777K SNP panel and that 777K genotypes can be efficiently and effectively imputed using the lower density 50K SNP panel. PMID:24582440
Larmer, S G; Sargolzaei, M; Schenkel, F S
In this work, we have analyzed the genetic variation that can alter the expression and the function in BRCA2 gene using computational methods. Out of the total 534 SNPs, 101 were found to be non synonymous (nsSNPs). Among the 7 SNPs in the untranslated region, 3 SNPs were found in 5' and 4 SNPs were found in 3' un-translated regions (UTR). Of the nsSNPs 20.7% were found to be damaging by both SIFT and PolyPhen server among the 101 nsSNPs investigated. UTR resource tool suggested that 2 SNPs in the 5' UTR region and 4 SNPs in the 3' UTR regions might change the protein expression levels. The mutation from asparagine to isoleucine at the position 3124 of the native protein of BRCA2 gene was most deleterious by both SIFT and PolyPhen servers. A structural analysis of this mutated protein and the native protein was made which had an RMSD value of 0.301 nm. Based on this work, we proposed that this most deleterious nsSNP with an SNPid rs28897759 is an important candidate for the cause of breast cancer by BRCA2 gene. PMID:18724707
Rajasekaran, R; Doss, George Priya; Sudandiradoss, C; Ramanathan, K; Rituraj, Purohit; Sethumadhavan, Rao
Motivation: The NCBI dbSNP database lists over 9 million SNPs in the human genome, but currently contains limited annotation information. SNPs that result in amino-acid resi- due changes (nsSNPs) are of critical importance in variation between individuals, including disease and drug sensitivity. Results: We have developed LS-SNP, a genomic-scale software pipeline to annotate nsSNPs. LS-SNP comprehen- sively maps nsSNPs onto
Rachel Karchin; Mark Diekhans; Libusha Kelly; Daryl J. Thomas; Ursula Pieper; Narayanan Eswar; David Haussler; Andrej Sali
Basic aspects in the handling of fatty acid-data have remained largely underexposed. Of these, we aimed to address three statistical methodological issues, by quantitatively exemplifying their imminent confounding impact on analytical outcomes: (1) presenting results as relative percentages or absolute concentrations, (2) handling of missing/non-detectable values, and (3) using structural indices for data-reduction. Therefore, we reanalyzed an example dataset containing erythrocyte fatty acid-concentrations of 137 recurrently depressed patients and 73 controls. First, correlations between data presented as percentages and concentrations varied for different fatty acids, depending on their correlation with the total fatty acid-concentration. Second, multiple imputation of non-detects resulted in differences in significance compared to zero-substitution or omission of non-detects. Third, patients' chain length-, unsaturation-, and peroxidation-indices were significantly lower compared to controls, which corresponded with patterns interpreted from individual fatty acid tests. In conclusion, results from our example dataset show that statistical methodological choices can have a significant influence on outcomes of fatty acid analysis, which emphasizes the relevance of: (1) hypothesis-based fatty acid-presentation (percentages or concentrations), (2) multiple imputation, preventing bias introduced by non-detects; and (3) the possibility of using (structural) indices, to delineate fatty acid-patterns thereby preventing multiple testing. PMID:22446846
Mocking, Roel J T; Assies, Johanna; Lok, Anja; Ruhé, Henricus G; Koeter, Maarten W J; Visser, Ieke; Bockting, Claudi L H; Schene, Aart H
Three experiments evaluated an imputed pitch velocity model of the auditory kappa effect. Listeners heard 3-tone sequences and judged the timing of the middle (target) tone relative to the timing of the 1st and 3rd (bounding) tones. Experiment 1 held pitch constant but varied the time (T) interval between bounding tones (T = 728, 1,000, or 1,600 ms) in order to establish baseline performance levels for the 3 values of T. Experiments 2 and 3 combined the values of T tested in Experiment 1 with a pitch manipulation in order to create fast (8 semitones/728 ms), medium (8 semitones/1,000 ms), and slow (8 semitones/1,600 ms) velocity conditions. Consistent with an auditory motion hypothesis, distortions in perceived timing were larger for fast than for slow velocity conditions for both ascending sequences (Experiment 2) and descending sequences (Experiment 3). Overall, results supported the proposed imputed pitch velocity model of the auditory kappa effect. PMID:19331507
Henry, Molly J; McAuley, J Devin
What is the link, if any, between the patterns of connections in the brain and the behavioural effects of localized brain lesions? We explored this question in four related ways. First, we investigated the distribution of activity decrements that followed simulated damage to elements of the thalamocortical network, using integrative mechanisms that have recently been used to successfully relate connection data to information on the spread of activation, and to account simultaneously for a variety of lesion effects. Second, we examined the consequences of the patterns of decrement seen in the simulation for each type of inference that has been employed to impute function to structure on the basis of the effects of brain lesions. Every variety of conventional inference, including double dissociation, readily misattributed function to structure. Third, we tried to derive a more reliable framework of inference for imputing function to structure, by clarifying concepts of function, and exploring a more formal framework, in which knowledge of connectivity is necessary but insufficient, based on concepts capable of mathematical specification. Fourth, we applied this framework to inferences about function relating to a simple network that reproduces intact, lesioned and paradoxically restored orientating behaviour. Lesion effects could be used to recover detailed and reliable information on which structures contributed to particular functions in this simple network. Finally, we explored how the effects of brain lesions and this formal approach could be used in conjunction with information from multiple neuroscience methodologies to develop a practical and reliable approach to inferring the functional roles of brain structures.
Young, M P; Hilgetag, C C; Scannell, J W
Summary Reproductive hormone levels are highly variable among premenopausal women during the menstrual cycle. Accurate timing of hormone measurement is essential, especially when investigating day- or phase-specific effects. The BioCycle Study used daily urine home fertility monitors to help detect the luteinising hormone (LH) surge in order to schedule visits with biologically relevant windows of hormonal variability. However, as the LH surge is brief and cycles vary in length, relevant hormonal changes may not align with scheduled visits even when fertility monitors are used. Using monitor data, measurements were reclassified according to biological phase of the menstrual cycle to more accurate cycle phase categories. Longitudinal multiple imputation methods were applied after reclassification if no visit occurred during a given menstrual cycle phase. Reclassified cycles had more clearly defined hormonal profiles, with higher mean peak hormones (up to 141%) and reduced variability (up to 71%). We demonstrate the importance of realigning visits to biologically relevant windows when assessing phase- or day-specific effects and the feasibility of applying longitudinal multiple imputation methods. Our method has applications in settings where missing data may occur over time, where daily blood sampling for hormonal measurements is not feasible, and in other areas where timing is essential.
Mumford, Sunni L.; Schisterman, Enrique F.; Gaskins, Audrey J.; Pollack, Anna Z.; Perkins, Neil J.; Whitcomb, Brian W.; Ye, Aijun; Wactawski-Wende, Jean
Background Molecular breeding of pepper (Capsicum spp.) can be accelerated by developing DNA markers associated with transcriptomes in breeding germplasm. Before the advent of next generation sequencing (NGS) technologies, the majority of sequencing data were generated by the Sanger sequencing method. By leveraging Sanger EST data, we have generated a wealth of genetic information for pepper including thousands of SNPs and Single Position Polymorphic (SPP) markers. To complement and enhance these resources, we applied NGS to three pepper genotypes: Maor, Early Jalapeño and Criollo de Morelos-334 (CM334) to identify SNPs and SSRs in the assembly of these three genotypes. Results Two pepper transcriptome assemblies were developed with different purposes. The first reference sequence, assembled by CAP3 software, comprises 31,196 contigs from >125,000 Sanger-EST sequences that were mainly derived from a Korean F1-hybrid line, Bukang. Overlapping probes were designed for 30,815 unigenes to construct a pepper Affymetrix GeneChip® microarray for whole genome analyses. In addition, custom Python scripts were used to identify 4,236 SNPs in contigs of the assembly. A total of 2,489 simple sequence repeats (SSRs) were identified from the assembly, and primers were designed for the SSRs. Annotation of contigs using Blast2GO software resulted in information for 60% of the unigenes in the assembly. The second transcriptome assembly was constructed from more than 200 million Illumina Genome Analyzer II reads (80–120 nt) using a combination of Velvet, CLC workbench and CAP3 software packages. BWA, SAMtools and in-house Perl scripts were used to identify SNPs among three pepper genotypes. The SNPs were filtered to be at least 50 bp from any intron-exon junctions as well as flanking SNPs. More than 22,000 high-quality putative SNPs were identified. Using the MISA software, 10,398 SSR markers were also identified within the Illumina transcriptome assembly and primers were designed for the identified markers. The assembly was annotated by Blast2GO and 14,740 (12%) of annotated contigs were associated with functional proteins. Conclusions Before availability of pepper genome sequence, assembling transcriptomes of this economically important crop was required to generate thousands of high-quality molecular markers that could be used in breeding programs. In order to have a better understanding of the assembled sequences and to identify candidate genes underlying QTLs, we annotated the contigs of Sanger-EST and Illumina transcriptome assemblies. These and other information have been curated in a database that we have dedicated for pepper project.
Background High-throughput re-sequencing, new genotyping technologies and the availability of reference genomes allow the extensive characterization of Single Nucleotide Polymorphisms (SNPs) and insertion/deletion events (indels) in many plant species. The rapidly increasing amount of re-sequencing and genotyping data generated by large-scale genetic diversity projects requires the development of integrated bioinformatics tools able to efficiently manage, analyze, and combine these genetic data with genome structure and external data. Results In this context, we developed SNiPlay, a flexible, user-friendly and integrative web-based tool dedicated to polymorphism discovery and analysis. It integrates: 1) a pipeline, freely accessible through the internet, combining existing softwares with new tools to detect SNPs and to compute different types of statistical indices and graphical layouts for SNP data. From standard sequence alignments, genotyping data or Sanger sequencing traces given as input, SNiPlay detects SNPs and indels events and outputs submission files for the design of Illumina's SNP chips. Subsequently, it sends sequences and genotyping data into a series of modules in charge of various processes: physical mapping to a reference genome, annotation (genomic position, intron/exon location, synonymous/non-synonymous substitutions), SNP frequency determination in user-defined groups, haplotype reconstruction and network, linkage disequilibrium evaluation, and diversity analysis (Pi, Watterson's Theta, Tajima's D). Furthermore, the pipeline allows the use of external data (such as phenotype, geographic origin, taxa, stratification) to define groups and compare statistical indices. 2) a database storing polymorphisms, genotyping data and grapevine sequences released by public and private projects. It allows the user to retrieve SNPs using various filters (such as genomic position, missing data, polymorphism type, allele frequency), to compare SNP patterns between populations, and to export genotyping data or sequences in various formats. Conclusions Our experiments on grapevine genetic projects showed that SNiPlay allows geneticists to rapidly obtain advanced results in several key research areas of plant genetic diversity. Both the management and treatment of large amounts of SNP data are rendered considerably easier for end-users through automation and integration. Current developments are taking into account new advances in high-throughput technologies. SNiPlay is available at: http://sniplay.cirad.fr/.
It is widely hoped that the study of sequence variation in the human genome will provide a means of elucidating the genetic component of complex diseases and variable drug responses. A major stumbling block to the successful design and execution of genome-wide disease association studies using single-nucleotide polymorphisms (SNPs) and linkage disequilibrium is the enormous number of SNPs in the
Bjarni V. Halldorsson; Vineet Bafna; Ross Lippert; Russell Schwartz; Francisco M. De La Vega; Andrew G. Clark; Sorin Istrail
Background Rhesus macaques serve a critical role in the study of human biomedical research. While both Indian and Chinese rhesus macaques are commonly used, genetic differences between these two subspecies affect aspects of their behavior and physiology, including response to simian immunodeficiency virus (SIV) infection. Single nucleotide polymorphisms (SNPs) can play an important role in both establishing ancestry and in identifying genes involved in complex diseases. We sequenced the 3' end of rhesus macaque genes in an effort to identify gene-based SNPs that could distinguish between Indian and Chinese rhesus macaques and aid in association analysis. Results We surveyed the 3' end of 94 genes in 20 rhesus macaque animals. The study included 10 animals each of Indian and Chinese ancestry. We identified a total of 661 SNPs, 457 of which appeared exclusively in one or the other population. Seventy-nine additional animals were genotyped at 44 of the population-exclusive SNPs. Of those, 38 SNPs were confirmed as being population-specific. Conclusion This study demonstrates that the 3' end of genes is rich in sequence polymorphisms and is suitable for the efficient discovery of gene-linked SNPs. In addition, the results show that the genomic sequences of Indian and Chinese rhesus macaque are remarkably divergent, and include numerous population-specific SNPs. These ancestral SNPs could be used for the rapid scanning of rhesus macaques, both to establish animal ancestry and to identify gene alleles that may contribute to the phenotypic differences observed in these populations.
Ferguson, Betsy; Street, Summer L; Wright, Hollis; Pearson, Carlo; Jia, Yibing; Thompson, Shaun L; Allibone, Patrick; Dubay, Christopher J; Spindel, Eliot; Norgren, Robert B
Non-synonymous SNPs (nsSNPs), also known as Single Amino acid Polymorphisms (SAPs) account for the majority of human inherited diseases. It is important to distinguish the deleterious SAPs from neutral ones. Most traditional computational methods to classify SAPs are based on sequential or structural features. However, these features cannot fully explain the association between a SAP and the observed pathophysiological phenotype.
Tao Huang; Ping Wang; Zhi-Qiang Ye; Heng Xu; Zhisong He; Kai-Yan Feng; LeLe Hu; WeiRen Cui; Kai Wang; Xiao Dong; Lu Xie; Xiangyin Kong; Yu-Dong Cai; Yixue Li
Background The BARD1 gene encodes for the BRCA1-associated RING domain (BARD1) protein. Germ line and somatic mutations in BARD1 are found in sporadic breast, ovarian and uterine cancers. There is a plethora of single nucleotide polymorphisms (SNPs) which may or may not be involved in the onset of female cancers. Hence, before planning a larger population study, it is advisable to sort out the possible functional SNPs. To accomplish this goal, data available in the dbSNP database and different computer programs can be used. To the best of our knowledge, until now there has been no such study on record for the BARD1 gene. Therefore, this study was undertaken to find the functional nsSNPs in BARD1. Result 2.85% of all SNPs in the dbSNP database were present in the coding regions. SIFT predicted 11 out of 50 nsSNPs as not tolerable and PolyPhen assessed 27 out of 50 nsSNPs as damaging. FastSNP revealed that the rs58253676 SNP in the 3? UTR may have splicing regulator and enhancer functions. In the 5? UTR, rs17489363 and rs17426219 may alter the transcriptional binding site. The intronic region SNP rs67822872 may have a medium-high risk level. The protein structures 1JM7, 3C5R and 2NTE were predicted by PDBSum and shared 100% similarity with the BARD1 amino acid sequence. Among the predicted nsSNPs, rs4986841, rs111367604, rs13389423 and rs139785364 were identified as deleterious and damaging by the SIFT and PolyPhen programs. Additionally, I-Mutant showed a decrease in stability for these nsSNPs upon mutation. Finally, the ExPASy-PROSIT program revealed that the predicted deleterious mutations are contained in the ankyrin ring and BRCT domains. Conclusion Using the available bioinformatics tools and the data present in the dbSNP database, the four nsSNPs, rs4986841, rs111367604, rs13389423 and rs139785364, were identified as deleterious, reducing the protein stability of BARD1. Hence, these SNPs can be used for the larger population-based studies of female cancers.
Alshatwi, Ali A.; Hasan, Tarique N.; Syed, Naveed A.; Shafi, Gowhat; Grace, B. Leena
Missing not at random (MNAR) post-dropout missing data from a longitudinal clinical trial result in the collection of “biased data”, which leads to biased estimators and tests of corrupted hypotheses. In a full rank linear model analysis the model equation, E[Y] = X?, leads to the definition of the primary parameter ? = (X?X)?1X?E[Y], and the definition of linear secondary parameters of the form ? = L? = L(X?X)?1X?E[Y], including for example, a parameter representing a “treatment effect”. These parameters depend explicitly on E[Y], which raises the questions: what is E[Y] when some elements of the incomplete random vector Y are not observed and MNAR, or when such a Y is “completed” via imputation? We develop a rigorous, readily interpretable definition of E[Y] in this context that leads directly to definitions of ?,Bias(?^)=E[?^]??,Bias(?^)=E[?^ ]?L?, and the extent of hypothesis corruption. These definitions provide a basis for evaluating, comparing, and removing biases induced by various linear imputation methods for MNAR incomplete data from longitudinal clinical trials. Linear imputation methods use earlier data from a subject to impute values for post-dropout missing values and include “Last Observation Carried Forward” (LOCF) and “Baseline Observation Carried Forward” (BOCF), among others. We illustrate the methods of evaluating, comparing, and removing biases and the effects of testing corresponding corrupted hypotheses via a hypothetical, but very realistic longitudinal analgesic clinical trial.
Helms, Ronald W.; Helms-Reece, Laura; Helms, Russell W.; Helms, Mary W.
This report describes the imputation procedures used to deal with missing data in the National Education Longitudinal Study of 1988 (NELS:88), the only current National Center for Education Statistics (NCES) dataset that contains scores from cognitive tests given the same set of students at multiple time points. As is inevitable, cognitive test…
Bokossa, Maxime C.; Huang, Gary G.
We describe methods used to create a new Census data base that can be used to study comparability of industry and occupation classification systems. This project represents the most extensive application of multiple imputation to date, and the modeling effort was considerable as well—hundreds of logistic regressions were estimated. One goal of this article is to summarize the strategies used
Clifford C. Clogg; Donald B. Rubin; Nathaniel Schenker; Bradley Schultz; Lynn Weidman
We consider genomic imputation for low-coverage genotyping-by-sequencing data with high levels of missing data. We compensate for this loss of information by utilizing family relationships in multiparental experimental crosses. This nearly quadruples the number of usable markers when applied to a large rice Multiparent Advanced Generation InterCross (MAGIC) study. PMID:24583583
Huang, B Emma; Raghavan, Chitra; Mauleon, Ramil; Broman, Karl W; Leung, Hei
Validation of a sterilizing filtration process is critical since it is impossible with currently available technology to measure the sterility of each filled container; therefore, sterility assurance of the filtered product must be achieved through validation of the filtration process. Validating a pharmaceutical sterile filtration process involves three things: determining the effect of the liquid on the filter, determining the effect of the filter on the liquid, and demonstrating that the filter removes all microorganisms from the liquid under actual processing conditions. PMID:16570864
Madsen, Russell E
Background Environmental and biomedical researchers frequently encounter laboratory data constrained by a lower limit of detection (LOD). Commonly used methods to address these left-censored data, such as simple substitution of a constant for all values < LOD, may bias parameter estimation. In contrast, multiple imputation (MI) methods yield valid and robust parameter estimates and explicit imputed values for variables that can be analyzed as outcomes or predictors. Objective In this article we expand distribution-based MI methods for left-censored data to a bivariate setting, specifically, a longitudinal study with biological measures at two points in time. Methods We have presented the likelihood function for a bivariate normal distribution taking into account values < LOD as well as missing data assumed missing at random, and we use the estimated distributional parameters to impute values < LOD and to generate multiple plausible data sets for analysis by standard statistical methods. We conducted a simulation study to evaluate the sampling properties of the estimators, and we illustrate a practical application using data from the Community Participatory Approach to Measuring Farmworker Pesticide Exposure (PACE3) study to estimate associations between urinary acephate (APE) concentrations (indicating pesticide exposure) at two points in time and self-reported symptoms. Results Simulation study results demonstrated that imputed and observed values together were consistent with the assumed and estimated underlying distribution. Our analysis of PACE3 data using MI to impute APE values < LOD showed that urinary APE concentration was significantly associated with potential pesticide poisoning symptoms. Results based on simple substitution methods were substantially different from those based on the MI method. Conclusions The distribution-based MI method is a valid and feasible approach to analyze bivariate data with values < LOD, especially when explicit values for the nondetections are needed. We recommend the use of this approach in environmental and biomedical research.
Chen, Haiying; Quandt, Sara A.; Grzywacz, Joseph G.; Arcury, Thomas A.
In this engineering activity, challenge learners to invent a water filter that cleans dirty water. Learners construct a filter device out of a 2-liter bottle and then experiment with different materials like gravel, sand, and cotton balls to see which is the most effective. Safety note: An adult's help is needed for this activity.
Students learn how CCD cameras use color filters to create astronomical images in this Moveable Museum unit. The four-page PDF guide includes suggested general background readings for educators, activity notes, and step-by-step directions. Students look at black-and-white photos to understand gray scale and construct simple red and green cellophane filters and observe magazine images through them.
Missing data are a pervasive problem in health investigations. We describe some background of missing data analysis and criticize ad-hoc methods which are prone to serious problems. We then focus on multiple imputation, in which missing cases are first filled in by several sets of plausible values to create multiple completed datasets, then standard complete-data procedures are applied to each completed dataset, and finally the multiple sets of results are combined to yield a single inference. We introduce the basic concepts and general methodology, and provide some guidance for application. For illustration, we use a study assessing the effect of cardiovascular diseases on hospice discussion for late stage lung cancer patients.
Objective To determine the effect of using Euclidean measurements and zip-code centroid geo-imputation versus more precise spatial analytical techniques in health care research. Data Sources Commercially insured members from a southeastern managed care organization. Study Design Distance from admitting inpatient facility to member's home and zip-code centroid (geographic placement) was compared using Euclidean straight-line and shortest-path drive distances (measurement technique). Data Collection Administrative claims from October 2005 to September 2006. Principal Findings Measurement technique had a greater impact on distance values compared with geographic placement. Drive distance from the geocoded address was highly correlated (r=0.99) with the Euclidean distance from the zip-code centroid. Conclusions Actual differences were relatively small. Researchers without capabilities to produce drive distance measurements and/or address geocoding techniques could rely on simple linear regressions to estimate correction factors with a high degree of confidence.
Jones, Stephen G; Ashby, Avery J; Momin, Soyal R; Naidoo, Allen
Genome wide disease association analysis using SNPs is being explored as a method for dissecting complex genetic traits and a vast number of SNPs have been generated for this purpose. As there are cost and throughput limitations of genotyping large numbers of SNPs and statistical issues regarding the large number of dependent tests on the same data set, to make association analysis practical it has been proposed that SNPs should be prioritized based on likely functional importance. The most easily identifiable functional SNPs are coding SNPs (cSNPs) and accordingly cSNPs have been screened in a number of studies. SNPs in gene regulatory sequences embedded in noncoding DNA are another class of SNPs suggested for prioritization due to their predicted quantitative impact on gene expression. The main challenge in evaluating these SNPs, in contrast to cSNPs is a lack of robust algorithms and databases for recognizing regulatory sequences in noncoding DNA. Approaches that have been previously used to delineate noncoding sequences with gene regulatory activity include cross-species sequence comparisons and the search for sequences recognized by transcription factors. We combined these two methods to sift through mouse human genomic sequences to identify putative gene regulatory elements and subsequently localized SNPs within these sequences in a 1 Megabase (Mb) region of human chromosome 5q31, orthologous to mouse chromosome 11 containing the Interleukin cluster.
Banerjee, Poulabi; Bahlo, Melanie; Schwartz, Jody R.; Loots, Gabriela G.; Houston, Kathryn A.; Dubchak, Inna; Speed, Terence P.; Rubin, Edward M.
A radiation source emits a beam of penetrating radiation toward an examination object. A protective filter, fabricated of yttrium foil attached to a bakelite card, is positioned in the path of the radiation beam between the source and the examination object. The yttrium filter has a preselected critical absorption edge operable to obstruct from the beam photon energy below 20 keV and permit a filtered beam having a photon energy above 20 keV to pass through the examination object. The filtered radiation emerging from the examination object is detected by preselected means, such as illuminated film, an X-ray intensifier, a CT scanner, or the like. The detector generates an output signal corresponding to the intensity of the emerging filtered radiation. An image processor converts the output signals to a radiographic image displaying the examination object.
In the logistic regression analysis of a small-sized, case-control study on Alzheimer's disease, some of the risk factors exhibited missing values, motivating the use of multiple imputation. Usually, Rubin's rules (RR) for combining point estimates and variances would then be used to estimate (symmetric) confidence intervals (CIs), on the assumption that the regression coefficients were distributed normally. Yet, rarely is this assumption tested, with or without transformation. In analyses of small, sparse, or nearly separated data sets, such symmetric CI may not be reliable. Thus, RR alternatives have been considered, for example, Bayesian sampling methods, but not yet those that combine profile likelihoods, particularly penalized profile likelihoods, which can remove first order biases and guarantee convergence of parameter estimation. To fill the gap, we consider the combination of penalized likelihood profiles (CLIP) by expressing them as posterior cumulative distribution functions (CDFs) obtained via a chi-squared approximation to the penalized likelihood ratio statistic. CDFs from multiple imputations can then easily be averaged into a combined CDF c , allowing confidence limits for a parameter ? ?at level 1?-?? to be identified as those ?* and ?** that satisfy CDF c (?*)?=?????2 and CDF c (?**)?=?1?-?????2. We demonstrate that the CLIP method outperforms RR in analyzing both simulated data and data from our motivating example. CLIP can also be useful as a confirmatory tool, should it show that the simpler RR are adequate for extended analysis. We also compare the performance of CLIP to Bayesian sampling methods using Markov chain Monte Carlo. CLIP is available in the R package logistf. PMID:23873477
Heinze, Georg; Ploner, Meinhard; Beyea, Jan
Background: The National Trauma Data Bank (NTDB) is plagued by the problem of missing physiological data. The Glasgow Coma Scale score, Respiratory Rate and Systolic Blood Pressure are an essential part of risk adjustment strategies for trauma system evaluation and clinical research. Missing data on these variables may compromise the feasibility and the validity of trauma group comparisons. Aims: To evaluate the validity of Multiple Imputation (MI) for completing missing physiological data in the National Trauma Data Bank (NTDB), by assessing the impact of MI on 1) frequency distributions, 2) associations with mortality, and 3) risk adjustment. Methods: Analyses were based on 170,956 NTDB observations with complete physiological data (observed data set). Missing physiological data were artificially imposed on this data set and then imputed using MI (MI data set). To assess the impact of MI on risk adjustment, 100 pairs of hospitals were randomly selected with replacement and compared using adjusted Odds Ratios (OR) of mortality. OR generated by the observed data set were then compared to those generated by the MI data set. Results: Frequency distributions and associations with mortality were preserved following MI. The median absolute difference between adjusted OR of mortality generated by the observed data set and by the MI data set was 3.6% (inter-quartile range: 2.4%-6.1%). Conclusions: This study suggests that, provided it is implemented with care, MI of missing physiological data in the NTDB leads to valid frequency distributions, preserves associations with mortality, and does not compromise risk adjustment in inter-hospital comparisons of mortality.
Moore, Lynne; Hanley, James A; Lavoie, Andre; Turgeon, Alexis
Identifying signatures of selection can provide valuable insight about the genes or genomic regions that are or have been under selective pressure, which can lead to a better understanding of genotype-phenotype relationships. A common strategy for selection signature detection is to compare samples from several populations and search for genomic regions with outstanding genetic differentiation. Wright's fixation index, FST, is a useful index for evaluation of genetic differentiation between populations. The aim of this study was to detect selective signatures between different chicken groups based on SNP-wise FST calculation. A total of 96 individuals of three commercial layer breeds and 14 non-commercial fancy breeds were genotyped with three different 600K SNP-chips. After filtering a total of 1 million SNPs were available for FST calculation. Averages of FST values were calculated for overlapping windows. Comparisons of these were then conducted between commercial egg layers and non-commercial fancy breeds, as well as between white egg layers and brown egg layers. Comparing non-commercial and commercial breeds resulted in the detection of 630 selective signatures, while 656 selective signatures were detected in the comparison between the commercial egg-layer breeds. Annotation of selection signature regions revealed various genes corresponding to productions traits, for which layer breeds were selected. Among them were NCOA1, SREBF2 and RALGAPA1 associated with reproductive traits, broodiness and egg production. Furthermore, several of the detected genes were associated with growth and carcass traits, including POMC, PRKAB2, SPP1, IGF2, CAPN1, TGFb2 and IGFBP2. Our approach demonstrates that including different populations with a specific breeding history can provide a unique opportunity for a better understanding of farm animal selection.
Gholami, Mahmood; Erbe, Malena; Garke, Christian; Preisinger, Rudolf; Weigend, Annett; Weigend, Steffen; Simianer, Henner
Single-nucleotide polymorphisms (SNPs) have been emerging out of the efforts to research human diseases and ethnic disparities. A semantic network is needed for in-depth understanding of the impacts of SNPs, because phenotypes are modulated by complex networks, including biochemical and physiological pathways. We identified ethnicity-specific SNPs by eliminating overlapped SNPs from HapMap samples, and the ethnicity-specific SNPs were mapped to the UCSC RefGene lists. Ethnicity-specific genes were identified as follows: 22 genes in the USA (CEU) individuals, 25 genes in the Japanese (JPT) individuals, and 332 genes in the African (YRI) individuals. To analyze the biologically functional implications for ethnicity-specific SNPs, we focused on constructing a semantic network model. Entities for the network represented by "Gene," "Pathway," "Disease," "Chemical," "Drug," "ClinicalTrials," "SNP," and relationships between entity-entity were obtained through curation. Our semantic modeling for ethnicity-specific SNPs showed interesting results in the three categories, including three diseases ("AIDS-associated nephropathy," "Hypertension," and "Pelvic infection"), one drug ("Methylphenidate"), and five pathways ("Hemostasis," "Systemic lupus erythematosus," "Prostate cancer," "Hepatitis C virus," and "Rheumatoid arthritis"). We found ethnicity-specific genes using the semantic modeling, and the majority of our findings was consistent with the previous studies - that an understanding of genetic variability explained ethnicity-specific disparities.
Kim, HyoYoung; Yoo, Won Gi; Park, Junhyung; Kim, Heebal
Selection of genetic variants is a crucial first step in the rational design of studies aimed at explaining individual differences in susceptibility to complex human diseases or health intervention outcomes; for example, in the emerging fields of pharmacogenomics, nutrigenomics, and vaccinomics. While single nucleotide polymorphisms (SNPs) are frequently employed in these studies, the cost of genotyping a huge number of SNPs remains a limiting factor, particularly in low and middle income countries. Therefore, it is important to detect a subset of SNPs to represent the rest of SNPs with maximum possible accuracy. The present study introduces a new method, CLONTagger with parameter optimization, which uses Support Vector Machine (SVM) to predict the rest of SNPs and Clonal Selection Algorithm (CLONALG) to select tag SNPs. Furthermore, the Particle Swarm Optimization algorithm is preferred for the optimization of C and ? parameters of the Support Vector Machine. Additionally, using many datasets, we compared the proposed new method with the tag SNP selection algorithms present in literature. Our results suggest that the CLONTagger with parameter optimization can identify tag SNPs with better prediction accuracy than other methods. Application-oriented studies are warranted to evaluate the utility of this method in future research in human genetics and study of the genetic components of variable responses to drugs, nutrition, and vaccines. PMID:23758474
Ilhan, Ilhan; Tezel, Gülay
We propose an improvement of an information filtering system with independent components selection. The independent components are obtained by Independent Component Analysis and considered as topics. It is effective for improving accuracy of information filtering to select some similar topics by focusing on these meaning. To achieve this, we select the topics by Maximum Distance Algorithm with Jensen-Shannon divergence. In addition, document vectors are represented by the selected topics. We create a user profile from transformed data with a relevance feedback. Finally, we recommend documents by the user profile and evaluate the accuracy by imputation precision. We carry out an evaluation experiment to confirm availability of the proposed method and also consider the meaning of components in this experiment.
Yokoi, Takeru; Yanagimoto, Hidekazu; Omatu, Sigeru
The Porcine SNP database has a huge number of SNPs, but these SNPs are mostly found by computer data-mining procedures and\\u000a have not been well characterized. We re-sequenced 1,439 porcine public SNPs from four commercial pig breeds and one Korean\\u000a domestic breed (Korean Native pig, KNP) by using two DNA pools from eight unrelated animals in each breed. These SNPs
Xiaoping Li; Sang-Wook Kim; Kyoung-Tag Do; You-Kyoung Ha; Yun-Mi Lee; Suk-Hee Yoon; Hee-Bal Kim; Jong-Joo Kim; Bong-Hwan Choi; Kwan-Suk Kim
Summary Documentation of Selected Activities from the National Household Survey on Drug Abuse. Machine Editing; Imputation; Sampling Weight Calibration; Small Area Estimation; Table Production; Disclosure.
This document summarizes the major project operations for the 2001 National Survey on Drug Abuse (NHSDA), in the following areas: Machine editing, imputation, sampling weight calibration, small area estimation, table production, and disclosure. Topics pre...
Multiple imputation based on chained equations (MICE) is an alternative missing genotype method that can use genetic and nongenetic auxiliary data to inform the imputation process. Previously, MICE was successfully tested on strongly linked genetic data. We have now tested it on data of the HBA2 gene which, by the experimental design used in a malaria association study in Tanzania, shows a high missing data percentage and is weakly linked with the remaining genetic markers in the data set. We constructed different imputation models and studied their performance under different missing data conditions. Overall, MICE failed to accurately predict the true genotypes. However, using the best imputation model for the data, we obtained unbiased estimates for the genetic effects, and association signals of the HBA2 gene on malaria positivity. When the whole data set was analyzed with the same imputation model, the association signal increased from 0.80 to 2.70 before and after imputation, respectively. Conversely, postimputation estimates for the genetic effects remained the same in relation to the complete case analysis but showed increased precision. We argue that these postimputation estimates are reasonably unbiased, as a result of a good study design based on matching key socio-environmental factors. PMID:24942080
Sepúlveda, Nuno; Manjurano, Alphaxard; Drakeley, Chris; Clark, Taane G
We report a genome-wide assessment of single nucleotide polymorphisms (SNPs) and copy number variants (CNVs) in schizophrenia. We investigated SNPs using 871 patients and 863 controls, following up the top hits in four independent cohorts comprising 1,460 patients and 12,995 controls, all of European origin. We found no genome-wide significant associations, nor could we provide support for any previously reported
Anna C. Need; Dongliang Ge; Michael E. Weale; Jessica Maia; Sheng Feng; Erin L. Heinzen; Kevin V. Shianna; Woohyun Yoon; Dalia Kasperavi?i?t?; Massimo Gennarelli; Warren J. Strittmatter; Cristian Bonvicini; Giuseppe Rossi; Karu Jayathilake; Philip A. Cola; Joseph P. McEvoy; Richard S. E. Keefe; Elizabeth M. C. Fisher; Pamela L. St. Jean; Ina Giegling; Annette M. Hartmann; Hans-Jürgen Möller; Andreas Ruppert; Gillian Fraser; Caroline Crombie; Lefkos T. Middleton; David St. Clair; Allen D. Roses; Pierandrea Muglia; Clyde Francks; Dan Rujescu; Herbert Y. Meltzer; David B. Goldstein
Patterns of population structure provide insights into evolutionary processes and help identify groups of individuals for genotype-phenotype association studies. With increasing availability of polymorphic molecular markers across genomes, the examination of population structure using large numbers of unlinked loci has become a common practice in evolutionary biology and human genetics. The two classes of molecular variation most widely used for this purpose, short tandem repeat polymorphisms (STRPs) and single-nucleotide polymorphisms (SNPs), differ in mutational properties expected to affect population structure. To measure the relative ability of these loci to describe population structure, we compared diversity at neighboring STRPs and SNPs from 720 genomic regions in the four populations that comprise the Human HapMap. Comparing loci from the same genomic regions allowed us to focus on the contribution of mutational differences (rather than variation in genealogical history) to disparities in population structure between STRPs and SNPs. Relative to average values for SNPs from the same regions, STRPs had lower F(st), but higher G(st)' and I(n) values. STRP-SNP correlations in population structure across genomic regions were statistically significant but weak in magnitude. Separate analyses by repeat type showed that these correlations were driven primarily by tetranucleotide and trinucleotide STRPs; measures of population structure at dinucleotides and SNPs were not significantly correlated. Pairwise comparisons among populations revealed effects of divergence time on differences in population structure between STRPs and SNPs. Collectively, these results confirm that individual STRPs can provide more information about population structure than individual SNPs, but suggest that the difference in structure at STRPs and SNPs depends on local genealogical history. Our study motivates theoretical comparisons of population structure at loci with different mutational properties. PMID:19289600
Payseur, Bret A; Jing, Peicheng
We estimate and partition genetic variation for height, body mass index (BMI), von Willebrand factor and QT interval (QTi) using 586,898 SNPs genotyped on 11,586 unrelated individuals. We estimate that ?45%, ?17%, ?25% and ?21% of the variance in height, BMI, von Willebrand factor and QTi, respectively, can be explained by all autosomal SNPs and a further ?0.5–1% can be
Jian Yang; Teri A Manolio; Louis R Pasquale; Eric Boerwinkle; Neil Caporaso; Julie M Cunningham; Mariza de Andrade; Bjarke Feenstra; Eleanor Feingold; M Geoffrey Hayes; William G Hill; Maria Teresa Landi; Alvaro Alonso; Guillaume Lettre; Peng Lin; Hua Ling; William Lowe; Rasika A Mathias; Mads Melbye; Elizabeth Pugh; Marilyn C Cornelis; Bruce S Weir; Michael E Goddard; Peter M Visscher
Single nucleotide polymorphisms (SNPs) on chromosome 9p21 are associated with coronary artery disease, diabetes, and multiple cancers. Risk SNPs are mainly non-coding, suggesting that they influence expression and may act in cis. We examined the association between 56 SNPs in this region and peripheral blood expression of the three nearest genes CDKN2A, CDKN2B, and ANRIL using total and allelic expression in two populations of healthy volunteers: 177 British Caucasians and 310 mixed-ancestry South Africans. Total expression of the three genes was correlated (P<0.05), suggesting that they are co-regulated. SNP associations mapped by allelic and total expression were similar (r?=?0.97, P?=?4.8×10?99), but the power to detect effects was greater for allelic expression. The proportion of expression variance attributable to cis-acting effects was 8% for CDKN2A, 5% for CDKN2B, and 20% for ANRIL. SNP associations were similar in the two populations (r?=?0.94, P?=?10?72). Multiple SNPs were independently associated with expression of each gene (P<0.05 after correction for multiple testing), suggesting that several sites may modulate disease susceptibility. Individual SNPs correlated with changes in expression up to 1.4-fold for CDKN2A, 1.3-fold for CDKN2B, and 2-fold for ANRIL. Risk SNPs for coronary disease, stroke, diabetes, melanoma, and glioma were all associated with allelic expression of ANRIL (all P<0.05 after correction for multiple testing), while association with the other two genes was only detectable for some risk SNPs. SNPs had an inverse effect on ANRIL and CDKN2B expression, supporting a role of antisense transcription in CDKN2B regulation. Our study suggests that modulation of ANRIL expression mediates susceptibility to several important human diseases.
Cunnington, Michael S.; Santibanez Koref, Mauro; Mayosi, Bongani M.; Burn, John; Keavney, Bernard
Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR?=?1?FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci.
Schork, Andrew J.; Thompson, Wesley K.; Pham, Phillip; Torkamani, Ali; Roddey, J. Cooper; Sullivan, Patrick F.; Kelsoe, John R.; O'Donovan, Michael C.; Furberg, Helena; Schork, Nicholas J.; Andreassen, Ole A.; Dale, Anders M.
Recent results indicate that genome-wide association studies (GWAS) have the potential to explain much of the heritability of common complex phenotypes, but methods are lacking to reliably identify the remaining associated single nucleotide polymorphisms (SNPs). We applied stratified False Discovery Rate (sFDR) methods to leverage genic enrichment in GWAS summary statistics data to uncover new loci likely to replicate in independent samples. Specifically, we use linkage disequilibrium-weighted annotations for each SNP in combination with nominal p-values to estimate the True Discovery Rate (TDR = 1-FDR) for strata determined by different genic categories. We show a consistent pattern of enrichment of polygenic effects in specific annotation categories across diverse phenotypes, with the greatest enrichment for SNPs tagging regulatory and coding genic elements, little enrichment in introns, and negative enrichment for intergenic SNPs. Stratified enrichment directly leads to increased TDR for a given p-value, mirrored by increased replication rates in independent samples. We show this in independent Crohn's disease GWAS, where we find a hundredfold variation in replication rate across genic categories. Applying a well-established sFDR methodology we demonstrate the utility of stratification for improving power of GWAS in complex phenotypes, with increased rejection rates from 20% in height to 300% in schizophrenia with traditional FDR and sFDR both fixed at 0.05. Our analyses demonstrate an inherent stratification among GWAS SNPs with important conceptual implications that can be leveraged by statistical methods to improve the discovery of loci. PMID:23637621
Schork, Andrew J; Thompson, Wesley K; Pham, Phillip; Torkamani, Ali; Roddey, J Cooper; Sullivan, Patrick F; Kelsoe, John R; O'Donovan, Michael C; Furberg, Helena; Schork, Nicholas J; Andreassen, Ole A; Dale, Anders M
Environmental and biomedical research often produces data below the limit of detection (LOD), or left-censored data. Imputing explicit values for values < LOD in a multivariate setting, such as with longitudinal data, is difficult using a likelihood-based approach. A Bayesian multiple imputation (MI) method is introduced to handle left-censored multivariate data. A Gibbs sampler, which uses an iterative process, is employed to simulate the target multivariate distribution within a Bayesian framework. Following convergence, multiple plausible data sets are generated for analysis by standard statistical methods outside of a Bayesian framework. With explicit imputed values available variables can be analyzed as outcomes or predictors. We illustrate a practical application using longitudinal data from the Community Participatory Approach to Measuring Farmworker Pesticide Exposure (PACE3) study to evaluate the association between urinary acephate concentrations (indicating pesticide exposure) and self-reported potential pesticide poisoning symptoms. Additionally, a simulation study is used to evaluate the sampling property of the estimators for distributional parameters as well as regression coefficients estimated with the generalized estimating equation (GEE) approach. Results demonstrated that the Bayesian MI estimates performed well in most settings, and we recommend the use of this valid and feasible approach to analyze multivariate data with values < LOD.
Chen, Haiying; Quandt, Sara A.; Grzywacz, Joseph G.; Arcury, Thomas A.
Background The high levels of variation characterising the mitochondrial DNA (mtDNA) molecule are due ultimately to its high average mutation rate; moreover, mtDNA variation is deeply structured in different populations and ethnic groups. There is growing interest in selecting a reduced number of mtDNA single nucleotide polymorphisms (mtSNPs) that account for the maximum level of discrimination power in a given population. Applications of the selected mtSNP panel range from anthropologic and medical studies to forensic genetic casework. Methodology/Principal Findings This study proposes a new simulation-based method that explores the ability of different mtSNP panels to yield the maximum levels of discrimination power. The method explores subsets of mtSNPs of different sizes randomly chosen from a preselected panel of mtSNPs based on frequency. More than 2,000 complete genomes representing three main continental human population groups (Africa, Europe, and Asia) and two admixed populations (“African-Americans” and “Hispanics”) were collected from GenBank and the literature, and were used as training sets. Haplotype diversity was measured for each combination of mtSNP and compared with existing mtSNP panels available in the literature. The data indicates that only a reduced number of mtSNPs ranging from six to 22 are needed to account for 95% of the maximum haplotype diversity of a given population sample. However, only a small proportion of the best mtSNPs are shared between populations, indicating that there is not a perfect set of “universal” mtSNPs suitable for all population contexts. The discrimination power provided by these mtSNPs is much higher than the power of the mtSNP panels proposed in the literature to date. Some mtSNP combinations also yield high diversity values in admixed populations. Conclusions/Significance The proposed computational approach for exploring combinations of mtSNPs that optimise the discrimination power of a given set of mtSNPs is more efficient than previous empirical approaches. In contrast to precedent findings, the results seem to indicate that only few mtSNPs are needed to reach high levels of discrimination power in a population, independently of its ancestral background.
Salas, Antonio; Amigo, Jorge
Background The computational analysis of regulatory SNPs (rSNPs) is an essential step in the elucidation of the structure and function of regulatory networks at the cellular level. In this work we focus in particular on SNPs that potentially affect a Transcription Factor Binding Site (TFBS) to a significant extent, possibly resulting in changes to gene expression patterns or alternative splicing. The application described here is based on the MAPPER platform, a previously developed web-based system for the computational detection of TFBSs in DNA sequences. Methods rSNP-MAPPER is a computational tool that analyzes SNPs lying within predicted TFBSs and determines whether the allele substitution results in a significant change in the TFBS predictive score. The application's simple and intuitive interface supports several usage modes. For example, the user may search for potential rSNPs in the promoters of one or more genes, specified as a list of identifiers or chosen among the members of a pathway. Alternatively, the user may specify a set of SNPs to be analyzed by uploading a list of SNP identifiers or providing the coordinates of a genomic region. Finally, the user can provide two alternative sequences (wildtype and mutant), and the system will determine the location of variants to be analyzed by comparing them. Results In this paper we outline the architecture of rSNP-MAPPER, describing its intuitive and powerful user interface in detail. We then present several examples of the use of rSNP-MAPPER to reproduce and confirm experimental studies aimed at identifying regulatory SNPs in human genes, that show how rSNP-MAPPER is able to detect and characterize rSNPs with high accuracy. Results are richly annotated and can be displayed online or downloaded in a number of different formats. Conclusions rSNP-MAPPER is optimized for large scale work, allowing for the efficient annotation of thousands of SNPs, and is designed to assist in the genome-wide investigation of transcriptional regulatory networks, prioritizing potential rSNPs for subsequent experimental validation. rSNP-MAPPER is freely available at http://genome.ufl.edu/mapper/.
Seeking to find a more effective method of filtering potable water that was highly contaminated, Mike Pedersen, founder of Western Water International, learned that NASA had conducted extensive research in methods of purifying water on board manned spacecraft. The key is Aquaspace Compound, a proprietary WWI formula that scientifically blends various types of glandular activated charcoal with other active and inert ingredients. Aquaspace systems remove some substances; chlorine, by atomic adsorption, other types of organic chemicals by mechanical filtration and still others by catalytic reaction. Aquaspace filters are finding wide acceptance in industrial, commercial, residential and recreational applications in the U.S. and abroad.
Prior work found the APOL1, 2 and 4 genes, located on chromosome 22q12.3-q13.1, to be upregulated in brains of schizophrenic patients. We performed a family-based association study using 130 SNPs tagging the APOL gene family (APOL1-6). The subjects were 112 African-American (AA), 114 European-American (EA), 109 Chinese (Ch) and 42 Japanese (Jp) families with schizophrenia (377 families, 1171 genotyped members and 647 genotyped affecteds in total). Seven SNPs had p-values < 0.05 in the APOL1, 2 and 4 regions for the AA, EA and combined (AA and EA) samples. In the AA sample, two SNPs, rs9610449 and rs6000200 showed low p-values; and a haplotype which comprised these two SNPs yielded a p-value of 0.00029 using the global test (GT) and the allele specific test (AST). The two SNPs and the haplotype were associated with risk for schizophrenia in African-Americans. In the combined (AA and EA) sample, two SNPs, rs2003813 and rs2157249 showed low p-values; and a three SNP haplotype including these two SNPs ¥was significant using the GT (p = 0.0013) and the AST (p = 0.000090). The association of this haplotype with schizophrenia was significant for the entire (AA, EA, Ch and Jp) sample using the GT (p = 0.00054) and the AST (p = 0.00011). Although our study is not definitive, it suggests that the APOL genes should be more extensively studied in schizophrenia.
Takahashi, Sakae; Cui, Yu-hu; Han, Yong-hua; Fagerness, Jesen A.; Galloway, Brian; Shen, Yu-cun; Kojima, Takuya; Uchiyama, Makoto; Faraone, Stephen V.; Tsuang, Ming T.
Nonsense SNPs introduce premature termination codons into genes and can result in the absence of a gene product or in a truncated and potentially harmful protein, so they are often considered disadvantageous and are associated with disease susceptibility. As such, we might expect the disrupted allele to be rare and, in healthy people, observed only in a heterozygous state. However, some, like those in the CASP12 and ACTN3 genes, are known to be present at high frequencies and to occur often in a homozygous state and seem to have been advantageous in recent human evolution. To evaluate the selective forces acting on nonsense SNPs as a class, we have carried out a large-scale experimental survey of nonsense SNPs in the human genome by genotyping 805 of them (plus control synonymous SNPs) in 1,151 individuals from 56 worldwide populations. We identified 169 genes containing nonsense SNPs that were variable in our samples, of which 99 were found with both copies inactivated in at least one individual. We found that the sampled humans differ on average by 24 genes (out of about 20,000) because of these nonsense SNPs alone. As might be expected, nonsense SNPs as a class were found to be slightly disadvantageous over evolutionary timescales, but a few nevertheless showed signs of being possibly advantageous, as indicated by unusually high levels of population differentiation, long haplotypes, and/or high frequencies of derived alleles. This study underlines the extent of variation in gene content within humans and emphasizes the importance of understanding this type of variation.
Yngvadottir, Bryndis; Xue, Yali; Searle, Steve; Hunt, Sarah; Delgado, Marcos; Morrison, Jonathan; Whittaker, Pamela; Deloukas, Panos; Tyler-Smith, Chris
Endurance training-induced changes in hemodynamic traits are heritable. However, few genes associated with heart rate training responses have been identified. The purpose of our study was to perform a genome-wide association study to uncover DNA sequence variants associated with submaximal exercise heart rate training responses in the HERITAGE Family Study. Heart rate was measured during steady-state exercise at 50 W (HR50) on 2 separate days before and after a 20-wk endurance training program in 483 white subjects from 99 families. Illumina HumanCNV370-Quad v3.0 BeadChips were genotyped using the Illumina BeadStation 500GX platform. After quality control procedures, 320,000 single-nucleotide polymorphisms (SNPs) were available for the genome-wide association study analyses, which were performed using the MERLIN software package (single-SNP analyses and conditional heritability tests) and standard regression models (multivariate analyses). The strongest associations for HR50 training response adjusted for age, sex, body mass index, and baseline HR50 were detected with SNPs at the YWHAQ locus on chromosome 2p25 (P = 8.1 × 10(-7)), the RBPMS locus on chromosome 8p12 (P = 3.8 × 10(-6)), and the CREB1 locus on chromosome 2q34 (P = 1.6 × 10(-5)). In addition, 37 other SNPs showed P values <9.9 × 10(-5). After removal of redundant SNPs, the 10 most significant SNPs explained 35.9% of the ?HR50 variance in a multivariate regression model. Conditional heritability tests showed that nine of these SNPs (all intragenic) accounted for 100% of the ?HR50 heritability. Our results indicate that SNPs in nine genes related to cardiomyocyte and neuronal functions, as well as cardiac memory formation, fully account for the heritability of the submaximal heart rate training response. PMID:22174390
Rankinen, Tuomo; Sung, Yun Ju; Sarzynski, Mark A; Rice, Treva K; Rao, D C; Bouchard, Claude
DNA-based parentage determination accelerates genetic improvement in sheep by increasing pedigree accuracy. Single nucleotide polymorphism (SNP) markers can be used for determining parentage and to provide unique molecular identifiers for tracing sheep products to their source. However, the utility of a particular "parentage SNP" varies by breed depending on its minor allele frequency (MAF) and its sequence context. Our aims were to identify parentage SNPs with exceptional qualities for use in globally diverse breeds and to develop a subset for use in North American sheep. Starting with genotypes from 2,915 sheep and 74 breed groups provided by the International Sheep Genomics Consortium (ISGC), we analyzed 47,693 autosomal SNPs by multiple criteria and selected 163 with desirable properties for parentage testing. On average, each of the 163 SNPs was highly informative (MAF?0.3) in 48±5 breed groups. Nearby polymorphisms that could otherwise confound genetic testing were identified by whole genome and Sanger sequencing of 166 sheep from 54 breed groups. A genetic test with 109 of the 163 parentage SNPs was developed for matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry. The scoring rates and accuracies for these 109 SNPs were greater than 99% in a panel of North American sheep. In a blinded set of 96 families (sire, dam, and non-identical twin lambs), each parent of every lamb was identified without using the other parent's genotype. In 74 ISGC breed groups, the median estimates for probability of a coincidental match between two animals (PI), and the fraction of potential adults excluded from parentage (PE) were 1.1×10(-39) and 0.999987, respectively, for the 109 SNPs combined. The availability of a well-characterized set of 163 parentage SNPs facilitates the development of high-throughput genetic technologies for implementing accurate and economical parentage testing and traceability in many of the world's sheep breeds. PMID:24740156
Heaton, Michael P; Leymaster, Kreg A; Kalbfleisch, Theodore S; Kijas, James W; Clarke, Shannon M; McEwan, John; Maddox, Jillian F; Basnayake, Veronica; Petrik, Dustin T; Simpson, Barry; Smith, Timothy P L; Chitko-McKown, Carol G
DNA-based parentage determination accelerates genetic improvement in sheep by increasing pedigree accuracy. Single nucleotide polymorphism (SNP) markers can be used for determining parentage and to provide unique molecular identifiers for tracing sheep products to their source. However, the utility of a particular “parentage SNP” varies by breed depending on its minor allele frequency (MAF) and its sequence context. Our aims were to identify parentage SNPs with exceptional qualities for use in globally diverse breeds and to develop a subset for use in North American sheep. Starting with genotypes from 2,915 sheep and 74 breed groups provided by the International Sheep Genomics Consortium (ISGC), we analyzed 47,693 autosomal SNPs by multiple criteria and selected 163 with desirable properties for parentage testing. On average, each of the 163 SNPs was highly informative (MAF?0.3) in 48±5 breed groups. Nearby polymorphisms that could otherwise confound genetic testing were identified by whole genome and Sanger sequencing of 166 sheep from 54 breed groups. A genetic test with 109 of the 163 parentage SNPs was developed for matrix-assisted laser desorption/ionization–time-of-flight mass spectrometry. The scoring rates and accuracies for these 109 SNPs were greater than 99% in a panel of North American sheep. In a blinded set of 96 families (sire, dam, and non-identical twin lambs), each parent of every lamb was identified without using the other parent’s genotype. In 74 ISGC breed groups, the median estimates for probability of a coincidental match between two animals (PI), and the fraction of potential adults excluded from parentage (PE) were 1.1×10(?39) and 0.999987, respectively, for the 109 SNPs combined. The availability of a well-characterized set of 163 parentage SNPs facilitates the development of high-throughput genetic technologies for implementing accurate and economical parentage testing and traceability in many of the world’s sheep breeds.
Heaton, Michael P.; Leymaster, Kreg A.; Kalbfleisch, Theodore S.; Kijas, James W.; Clarke, Shannon M.; McEwan, John; Maddox, Jillian F.; Basnayake, Veronica; Petrik, Dustin T.; Simpson, Barry; Smith, Timothy P. L.; Chitko-McKown, Carol G.
Combining data from genome-wide association studies (GWAS) conducted at different locations, using genotype imputation and fixed-effects meta-analysis, has been a powerful approach for dissecting complex disease genetics in populations of European ancestry. Here we investigate the feasibility of applying the same approach in Africa, where genetic diversity, both within and between populations, is far more extensive. We analyse genome-wide data from approximately 5,000 individuals with severe malaria and 7,000 population controls from three different locations in Africa. Our results show that the standard approach is well powered to detect known malaria susceptibility loci when sample sizes are large, and that modern methods for association analysis can control the potential confounding effects of population structure. We show that pattern of association around the haemoglobin S allele differs substantially across populations due to differences in haplotype structure. Motivated by these observations we consider new approaches to association analysis that might prove valuable for multicentre GWAS in Africa: we relax the assumptions of SNP–based fixed effect analysis; we apply Bayesian approaches to allow for heterogeneity in the effect of an allele on risk across studies; and we introduce a region-based test to allow for heterogeneity in the location of causal alleles.
Band, Gavin; Le, Quang Si; Jostins, Luke; Pirinen, Matti; Kivinen, Katja; Jallow, Muminatou; Sisay-Joof, Fatoumatta; Bojang, Kalifa; Pinder, Margaret; Sirugo, Giorgio; Conway, David J.; Nyirongo, Vysaul; Kachala, David; Molyneux, Malcolm; Taylor, Terrie; Ndila, Carolyne; Peshu, Norbert; Marsh, Kevin; Williams, Thomas N.; Alcock, Daniel; Andrews, Robert; Edkins, Sarah; Gray, Emma; Hubbart, Christina; Jeffreys, Anna; Rowlands, Kate; Schuldt, Kathrin; Clark, Taane G.; Small, Kerrin S.; Teo, Yik Ying; Kwiatkowski, Dominic P.; Rockett, Kirk A.; Barrett, Jeffrey C.; Spencer, Chris C. A.
Disclosure limitation is an important consideration in the release of public use data sets. It is particularly challenging for longitudinal data sets, since information about an individual accumulates with repeated measures over time. Research on disclosure limitation methods for longitudinal data has been very limited. We consider here problems created by high ages in cohort studies. Because of the risk of disclosure, ages of very old respondents can often not be released; in particular this is a specific stipulation of the Health Insurance Portability and Accountability Act (HIPAA) for the release of health data for individuals. Top-coding of individuals beyond a certain age is a standard way of dealing with this issue, and it may be adequate for cross-sectional data, when a modest number of cases are affected. However, this approach leads to serious loss of information in longitudinal studies when individuals have been followed for many years. We propose and evaluate an alternative to top-coding for this situation based on multiple imputation (MI). This MI method is applied to a survival analysis of simulated data, and data from the Charleston Heart Study (CHS), and is shown to work well in preserving the relationship between hazard and covariates.
An, Di; Little, Roderick J.A.; McNally, James W.
When measurements of values that are less than the limit of detection are reported as not detected, the data are referred to as censored. The non-recording of values below the limit of detection is common in soil science research although modelling data affected by censoring can be problematic. This paper develops and tests a modified version of Spatial Simulated Annealing, called Simulated Annealing by Variogram and Histogram form, for drawing values for censored points given a mixed set of observed and censored data. The algorithm aims to maximise the goodness of fitting between the experimental and theoretical variograms (by allowing variation in its parameters) while the imputed values are constrained to a target histogram form. In practice, the experimental histogram is estimated by transforming the available data (interval and exact observations) to quantiles and fitting a plausible distribution. The theoretical distribution of the data is used to constrain the variogram fitting. The proposed simulated annealing method is designed to find the optimal spatial arrangement of values, given by the lowest errors in variogram and histogram fitting and kriging prediction. The accuracy of the method proposed is assessed on a simulated data set in which the censored point values are known and compared with the Spatial Simulated Annealing algorithm. According to the results obtained, the Simulated Annealing by Variogram and Histogram form (SAVH) approach can be recommended as a useful tool for the analysis of spatially distributed data with censoring.
Sedda, L.; Atkinson, P. M.; Barca, E.; Passarella, G.
Haplotype phasing is one of the most important problems in population genetics as haplotypes can be used to estimate the relatedness of individuals and to impute genotype information which is a commonly performed analysis when searching for variants involved in disease. The problem of haplotype phasing has been well studied. Methodologies for haplotype inference from sequencing data either combine a set of reference haplotypes and collected genotypes using a Hidden Markov Model or assemble haplotypes by overlapping sequencing reads. A recent algorithm Hap-seq considers using both sequencing data and reference haplotypes and it is a hybrid of a dynamic programming algorithm and a Hidden Markov Model (HMM), which is shown to be optimal. However, the algorithm requires extremely large amount of memory which is not practical for whole genome datasets. The current algorithm requires saving intermediate results to disk and reads these results back when needed, which significantly affects the practicality of the algorithm. In this work, we proposed the expedited version of the algorithm Hap-seqX, which addressed the memory issue by using a posterior probability to select the records that should be saved in memory. We show that Hap-seqX can save all the intermediate results in memory and improves the execution time of the algorithm dramatically. Utilizing the strategy, Hap-seqX is able to predict haplotypes from whole genome sequencing data. PMID:23269365
He, Dan; Eskin, Eleazar
A notch filter for the selective attenuation of a narrow band of frequencies out of a larger band was developed. A helical resonator is connected to an input circuit and an output circuit through discrete and equal capacitors, and a resistor is connected between the input and the output circuits.
Shelton, G. B. (inventor)
Highlights: Black-Right-Pointing-Pointer Proper dataset partition can improve the prediction of deleterious nsSNPs. Black-Right-Pointing-Pointer Partition according to original residue type at nsSNP is a good criterion. Black-Right-Pointing-Pointer Similar strategy is supposed promising in other machine learning problems. -- Abstract: Many non-synonymous SNPs (nsSNPs) are associated with diseases, and numerous machine learning methods have been applied to train classifiers for sorting disease-associated nsSNPs from neutral ones. The continuously accumulated nsSNP data allows us to further explore better prediction approaches. In this work, we partitioned the training data into 20 subsets according to either original or substituted amino acid type at the nsSNP site. Using support vector machine (SVM), training classification models on each subset resulted in an overall accuracy of 76.3% or 74.9% depending on the two different partition criteria, while training on the whole dataset obtained an accuracy of only 72.6%. Moreover, the dataset was also randomly divided into 20 subsets, but the corresponding accuracy was only 73.2%. Our results demonstrated that partitioning the whole training dataset into subsets properly, i.e., according to the residue type at the nsSNP site, will improve the performance of the trained classifiers significantly, which should be valuable in developing better tools for predicting the disease-association of nsSNPs.
Yang, Jing; Li, Yuan-Yuan [School of Biotechnology, East China University of Science and Technology, Shanghai 200237 (China) [School of Biotechnology, East China University of Science and Technology, Shanghai 200237 (China); Shanghai Center for Bioinformation Technology, Shanghai 200235 (China); Li, Yi-Xue, E-mail: firstname.lastname@example.org [School of Biotechnology, East China University of Science and Technology, Shanghai 200237 (China) [School of Biotechnology, East China University of Science and Technology, Shanghai 200237 (China); Shanghai Center for Bioinformation Technology, Shanghai 200235 (China); Ye, Zhi-Qiang, E-mail: email@example.com [Laboratory of Chemical Genomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen 518055 (China) [Laboratory of Chemical Genomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen 518055 (China); Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031 (China)
Isolated population groups are useful in conducting association studies of complex diseases to avoid various pitfalls, including those arising from population stratification. Since DNA resequencing is expensive, it is recommended that genotyping be carried out at tagSNP (tSNP) loci. For this, tSNPs identified in one isolated population need to be used in another. Unless tSNPs are highly portable across populations this strategy may result in loss of information in association studies. We examined the issue of tSNP portability by sampling individuals from 10 isolated ethnic groups from India. We generated DNA resequencing data pertaining to 3 genomic regions and identified tSNPs in each population. We defined an index of tSNP portability and showed that portability is low across isolated Indian ethnic groups. The extent of portability did not significantly correlate with genetic similarity among the populations studied here. We also analyzed our data with sequence data from individuals of African and European descent. Our results indicated that it may be necessary to carry out resequencing in a small number of individuals to discover SNPs and identify tSNPs in the specific isolated population in which a disease association study is to be conducted. PMID:17627800
Sarkar Roy, N; Farheen, S; Roy, N; Sengupta, S; Majumder, P P
Single nucleotide polymorphisms (SNPs) can play a direct or indirect role in phenotypic expression in food allergy pathogenesis. Our goal was to quantitate the expression of SNPs in relevant cytokines that were expressed in food allergic patients. SNPs in cytokine genes IL-4 and IL-10 are known to be important in IgE generation and regulation. We examined IL-4 (C-590T), IL-4R? (1652A/G) and IL-10 (C-627A) SNPs using real-time PCR followed by restriction fragment length polymorphism (RFLP) analysis. Our results show that the AA, AG and GG genotypes for IL-4R? (1652A/G) polymorphisms were statistically different in radioallergosorbent test (RAST) positive versus negative patients, and although no statistically significant differences were observed between genotypes in the IL-4 (C-590T) and IL-10 (C-627A) SNPs, we observed a significant decrease in IL-4 (C-590T) gene expression and increase in IL-4R? (1652A/G) and IL-10 (C-627A) gene expression between RAST+ versus RAST? patients, respectively. We also observed significant modulation in the protein expression of IL-4 and IL-10 in the serum samples of the RAST+ patients as compared to the RAST? patients indicating that changes in SNP expression resulted in altered phenotypic response in these patients.
Brown, Paula; Nair, Bindukumar; Sykes, Donald E.; Rich, Gary; Reynolds, Jessica L.; Aalinkeel, Ravikumar; Wheeler, John; Schwartz, Stanley A.
Graves' disease, the production of thyroid-stimulating hormone receptor-stimulating antibodies leading to hyperthyroidism, is one of the most common forms of human autoimmune disease. It is widely agreed that complex diseases are not controlled simply by an individual gene or DNA variation but by their combination. Single nucleotide polymorphisms (SNPs), which are the most common form of DNA variation, have great potential as a medical diagnostic tool. In this paper, the P-value is used as a SNP pre-selection criterion, and a wrapper algorithm with binary particle swarm optimization is used to find the rule for discriminating between affected and control subjects. We analyzed the association between combinations of SNPs and Graves' disease by investigating 108 SNPs in 384 cases and 652 controls. We evaluated our method by differentiating between cases and controls in a five-fold cross validation test, and it achieved a 72.9% prediction accuracy with a combination of 17 SNPs. The experimental results showed that SNPs, even those with a high P-value, have a greater effect on Graves' disease when acting in a combination. PMID:21318483
Wei, Bin; Peng, QinKe; Zhang, QuanWei; Li, ChenYao
Allele-specific methylation (ASM) has long been studied but mainly documented in the context of genomic imprinting and X chromosome inactivation. Taking advantage of the next-generation sequencing technology, we conduct a high-throughput sequencing experiment with four prostate cell lines to survey the whole genome and identify single nucleotide polymorphisms (SNPs) with ASM. A Bayesian approach is proposed to model the counts of short reads for each SNP conditional on its genotypes of multiple subjects, leading to a posterior probability of ASM. We flag SNPs with high posterior probabilities of ASM by accounting for multiple comparisons based on posterior false discovery rates. Applying the Bayesian approach to the in-house prostate cell line data, we identify 269 SNPs as candidates of ASM. A simulation study is carried out to demonstrate the quantitative performance of the proposed approach.
Hu, Bo; Xu, Yaomin
Use of silver nanoparticles (SNPs) is increasing in a large number of consumer products. Thus, the possible build-up of the nanoparticles in the environment is becoming a major concern. Aeromonas punctata isolated from sewage showed tolerance to 200 ?g/ml SNPs. The growth kinetics data for A. punctata treated with nanoparticles were similar to those in the absence of nanoparticles. There was a reduction in the amount of exopolysaccharides (EPS) in bacterial culture supernatant after nanoparticle-supernatant interaction. EPS capping of the nanoparticles was confirmed by UV-visible, XRD and comparative FTIR analysis. The EPS-capped SNPs showed less toxicity to Escherichia coli, Staphylococcus aureus and Micrococcus luteus compared to the uncapped ones. The study suggests capping of nanoparticles by bacterially produced EPS as a probable physiological defense mechanism. PMID:21077112
Sudheer Khan, S; Bharath Kumar, E; Mukherjee, Amitava; Chandrasekaran, N
Multiple sclerosis (MS) is a complex disease with underlying genetic and environmental factors. Although the contribution of alleles within the major histocompatibility complex (MHC) are known to exert strong effects on MS risk, much remains to be learned about the contributions of loci with more modest effects identified by genome-wide association studies (GWASs), as well as loci that remain undiscovered. We use a recently developed method to estimate the proportion of variance in disease liability explained by 475,806 single nucleotide polymorphisms (SNPs) genotyped in 1,854 MS cases and 5,164 controls. We reveal that ~30% of MS genetic liability is explained by SNPs in this dataset, the majority of which is accounted for by common variants. These results suggest that the unaccounted for proportion could be explained by variants that are in imperfect linkage disequilibrium with common GWAS SNPs, highlighting the potential importance of rare variants in the susceptibility to MS.
Watson, Corey T.; Disanto, Giulio; Breden, Felix; Giovannoni, Gavin; Ramagopalan, Sreeram V.
We describe a method for simultaneous amplification of 49 autosomal single nucleotide polymorphisms (SNPs) by multiplex PCR and detection of the SNP alleles by single base extension (SBE) and capillary electrophoresis. All the SNPs may be amplified from only 100 pg of genomic DNA and the length of the amplicons range from 65 to 115 bp. The high sensitivity and the short amplicon sizes make the assay very suitable for typing of degraded DNA samples, and the low mutation rate of SNPs makes the assay very useful for relationship testing. Combined, these advantages make the assay well suited for disaster victim identifications, where the DNA from the victims may be highly degraded and the victims are identified via investigation of their relatives. The assay was validated according to the ISO 17025 standard and used for routine case work in our laboratory. PMID:22139655
Børsting, Claus; Tomas, Carmen; Morling, Niels
Background Asthma genome-wide association studies (GWAS) have identified several asthma susceptibility genes with confidence; however the relative contribution of these genetic variants or single nucleotide polymorphisms (SNPs) to clinical endpoints (as opposed to disease diagnosis) remains largely unknown. Thus the aim of this study was to firstly bridge this gap in knowledge and secondly investigate whether these SNPs or those that are in linkage disequilibrium are likely to be functional candidates with respect to regulation of gene expression, using reported data from the ENCODE project. Methods Eleven of the key SNPs identified in eight loci from recent asthma GWAS were evaluated for association with asthma and clinical outcomes, including percent predicted FEV1, bronchial hyperresponsiveness (BHR) to methacholine, severity defined by British Thoracic Society steps and positive response to skin prick test, using the family based association test additive model in a well characterised UK cohort consisting of 370 families with at least two asthmatic children. Results GSDMB SNP rs2305480 (Ser311Pro) was associated with asthma diagnosis (p?=?8.9×10-4), BHR (p?=?8.2×10-4) and severity (p?=?1.5×10-4) with supporting evidence from a second GSDMB SNP rs11078927 (intronic). SNPs evaluated in IL33, IL18R1, IL1RL1, SMAD3, IL2RB, PDE4D, CRB1 and RAD50 did not show association with any phenotype tested when corrected for multiple testing. Analysis using ENCODE data provides further insight into the functional relevance of these SNPs. Conclusions Our results provide further support for the role of GSDMB SNPs in determining multiple asthma related phenotypes in childhood asthma including associations with lung function and disease severity.
Biomedical Optical Company of America's suntiger lenses eliminate more than 99% of harmful light wavelengths. NASA derived lenses make scenes more vivid in color and also increase the wearer's visual acuity. Distant objects, even on hazy days, appear crisp and clear; mountains seem closer, glare is greatly reduced, clouds stand out. Daytime use protects the retina from bleaching in bright light, thus improving night vision. Filtering helps prevent a variety of eye disorders, in particular cataracts and age related macular degeneration.
Background Candidate single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWASs) were often selected for validation\\u000a based on their functional annotation, which was inadequate and biased. We propose to use the more than 200,000 microarray\\u000a studies in the Gene Expression Omnibus to systematically prioritize candidate SNPs from GWASs.\\u000a \\u000a \\u000a \\u000a \\u000a Results We analyzed all human microarray studies from the Gene Expression Omnibus, and calculated
Rong Chen; Alex A Morgan; Joel Dudley; Tarangini Deshpande; Li Li; Keiichi Kodama; Annie P Chiang; Atul J Butte
PrimerZ (http:\\/\\/genepipe.ngc.sinica.edu.tw\\/primerz\\/) is a web application dedicated primarily to primer design for genes and human SNPs. PrimerZ accepts genes by gene name or Ensembl accession code, and SNPs by dbSNP rs or AFFY_Probe IDs. The promoter and exon sequence information of all gene transcripts fetched from the Ensembl database (http:\\/\\/www.ensembl.org\\/) are processed before being passed on to Primer3 (http:\\/\\/frodo.wi.mit. edu\\/cgi-bin\\/primer3\\/primer3_www.cgi)
Ming-fang Tsai; Yi-jung Lin; Yu-chang Cheng; Kuo-hsi Lee; Cheng-chih Huang; Yuan-tsong Chen; Adam Yao
The typing of single nucleotide polymorphisms (SNPs) located throughout the mitochondrial genome (mtGenome) can help resolve individuals with an identical HV1\\/HV2 mitotype. A set of 11 SNPs selected for distinguishing individuals of the most common Caucasian HV1\\/HV2 mitotype were incorporated in an allele specific primer extension assay. The assay was optimized for multiplex detection of SNPs at positions 3010, 4793,
Peter M. Vallone; Rebecca S. Just; Michael D. Coble; John M. Butler; Thomas J. Parsons
Background Gene variants within regulatory regions are thought to be major contributors of the variation of complex traits/diseases. Genome wide association studies (GWAS), have identified scores of genetic variants that appear to contribute to human disease risk. However, most of these variants do not appear to be functional. Thus, the significance of the association may be brought up by still unknown mechanisms or by linkage disequilibrium (LD) with functional polymorphisms. In the present study, focused on functional variants related with the binding of microRNAs (miR), we utilized SNP data, including newly released 1000 Genomes Project data to perform a genome-wide scan of SNPs that abrogate or create miR recognition element (MRE) seed sites (MRESS). Results We identified 2723 SNPs disrupting, and 22295 SNPs creating MRESSs. We estimated the percent of SNPs falling within both validated (5%) and predicted conserved MRESSs (3%). We determined 87 of these MRESS SNPs were listed in GWAS association studies, or in strong LD with a GWAS SNP, and may represent the functional variants of identified GWAS SNPs. Furthermore, 39 of these have evidence of co-expression of target mRNA and the predicted miR. We also gathered previously published eQTL data supporting a functional role for four of these SNPs shown to associate with disease phenotypes. Comparison of FST statistics (a measure of population subdivision) for predicted MRESS SNPs against non MRESS SNPs revealed a significantly higher (P = 0.0004) degree of subdivision among MRESS SNPs, suggesting a role for these SNPs in environmentally driven selection. Conclusions We have demonstrated the potential of publicly available resources to identify high priority candidate SNPs for functional studies and for disease risk prediction.
Background Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK’s recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone. Results The detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes. Conclusions The described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses.
Next generation sequencing (NGS) has been widely used to study genomic variation in a variety of prokaryotes. Single nucleotide polymorphisms (SNPs) resulting from genomic comparisons need to be annotated for their functional impact on the coding sequences. We have developed a program, TRAMS, for functional annotation of genomic SNPs which is available to download as a single file executable for WINDOWS users with limited computational experience and as a Python script for Mac OS and Linux users. TRAMS needs a tab delimited text file containing SNP locations, reference nucleotide and SNPs in variant strains along with a reference genome sequence in GenBank or EMBL format. SNPs are annotated as synonymous, nonsynonymous or nonsense. Nonsynonymous SNPs in start and stop codons are separated as non-start and non-stop SNPs, respectively. SNPs in multiple overlapping features are annotated separately for each feature and multiple nucleotide polymorphisms within a codon are combined before annotation. We have also developed a workflow for Galaxy, a highly used tool for analysing NGS data, to map short reads to a reference genome and extract and annotate the SNPs. TRAMS is a simple program for rapid and accurate annotation of SNPs that will be very useful for microbiologists in analysing genomic diversity in microbial populations. PMID:23828175
Reumerman, Richard A; Tucker, Nicholas P; Herron, Paul R; Hoskisson, Paul A; Sangal, Vartul
Studies have suggested that residential exposure to extremely low frequency (50Hz) electromagnetic fields (ELF-EMF) from high voltage cables, overhead power lines, electricity substations or towers are associated with reduced birth weight and may be associated with adverse birth outcomes or even miscarriages. We previously conducted a study of 140,356 singleton live births between 2004 and 2008 in Northwest England, which suggested that close residential proximity (?50m) to ELF-EMF sources was associated with reduced average birth weight of 212g (95%CI: -395 to -29g) but not with statistically significant increased risks for other adverse perinatal outcomes. However, the cohort was limited by missing data for most potentially confounding variables including maternal smoking during pregnancy, which was only available for a small subgroup, while also residual confounding could not be excluded. This study, using the same cohort, was conducted to minimize the effects of these problems using multiple imputation to address missing data and propensity score matching to minimize residual confounding. Missing data were imputed using multiple imputation using chained equations to generate five datasets. For each dataset 115 exposed women (residing ?50m from a residential ELF-EMF source) were propensity score matched to 1150 unexposed women. After doubly robust confounder adjustment, close proximity to a residential ELF-EMF source remained associated with a reduction in birth weight of -116g (95% confidence interval: -224:-7g). No effect was found for proximity ?100m compared to women living further away. These results indicate that although the effect size was about half of the effect previously reported, close maternal residential proximity to sources of ELF-EMF remained associated with suboptimal fetal growth. PMID:24815339
de Vocht, Frank; Lee, Brian
Background Identification of recombination events and which chromosomal segments contributed to an individual is useful for a number of applications in genomic analyses including haplotyping, imputation, signatures of selection, and improved estimates of relationship and probability of identity by descent. Genotypic data on half-sib family groups are widely available in livestock genomics. This structure makes it possible to identify recombination events accurately even with only a few individuals and it lends itself well to a range of applications such as parentage assignment and pedigree verification. Results Here we present hsphase, an R package that exploits the genetic structure found in half-sib livestock data to identify and count recombination events, impute and phase un-genotyped sires and phase its offspring. The package also allows reconstruction of family groups (pedigree inference), identification of pedigree errors and parentage assignment. Additional functions in the package allow identification of genomic mapping errors, imputation of paternal high density genotypes from low density genotypes, evaluation of phasing results either from hsphase or from other phasing programs. Various diagnostic plotting functions permit rapid visual inspection of results and evaluation of datasets. Conclusion The hsphase package provides a suite of functions for analysis and visualization of genomic structures in half-sib family groups implemented in the widely used R programming environment. Low level functions were implemented in C++ and parallelized to improve performance. hsphase was primarily designed for use with high density SNP array data but it is fast enough to run directly on sequence data once they become more widely available. The package is available (GPL 3) from the Comprehensive R Archive Network (CRAN) or from http://www-personal.une.edu.au/~cgondro2/hsphase.htm.
Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets. PMID:11252602
Hopke, P K; Liu, C; Rubin, D B
BACKGROUND: Genetic linkage maps are necessary for mapping of mendelian traits and quantitative trait loci (QTLs). To identify the actual genes, which control these traits, a map based on gene-associated single nucleotide polymorphism (SNP) markers is highly valuable. In this study, the SNPs were genotyped in a large family material comprising more than 5,000 piglets derived from 12 Duroc boars
Rikke KK Vingborg; ViviR R Gregersen; Bujie Zhan; Frank Panitz; Anette Høj; Kirsten K Sørensen; Lone B Madsen; Knud Larsen; Henrik Hornshøj; Xuefei Wang; Christian Bendixen
The large number of published meta-analyses on the associations between single-nucleotide polymorphisms (SNPs) and suicidal behavior mirrors the enormous research interest in this topic. Although meta-analytic evidence is abundant and certain patterns are apparent, those have not been integrated into a general framework as of yet. In a systematic review, genetic association studies between SNPs and suicidal behavior were identified. Previously published meta-analyses for eight SNPs were updated and the results of the different meta-analyses were compared. Meta-analyses for 15 SNPs, which had not been subjected to meta-analysis before, were conducted. The present meta-analytical field synopsis showed five major similarities between new and published analyses: 1) Summary effect sizes were small and rarely statistically significant, 2) heterogeneity between studies was often substantial, 3) there were no time trends, 4) effects were easily swayed and were largely dependent on individual studies, and 5) publication bias does not play a role in this field of research. Meta-analytic data show once more that major contributions of single genes are unlikely. However, association studies and corresponding meta-analyses have been an important and necessary stepping stone in the development of modern and more complex approaches in the genetics of suicidal behavior. PMID:23831262
Schild, Anne H E; Pietschnig, Jakob; Tran, Ulrich S; Voracek, Martin
DNA markers used for individual identification in forensic sciences are based on repeat sequences in nuclear DNA and the mitochondrial DNA hypervariable regions 1 and 2. An alternative to these markers is the use of single nucleotide polymorphisms (SNPs). These have a particular advantage in the analysis of degraded or poor samples, which are often all that is available in
Elizabet Petkovski; Christine Keyser-Tracqui; Rémi Hienne; Bertrand Ludes
High-density SNP arrays developed for humans and their companion species provide a rapid and convenient tool for generating SNP data in closely-related non-model organisms, but have not yet been widely applied to phylogenetically divergent taxa. Consequently, we used the CanineHD BeadChip to genotype 24 Antarctic fur seal (Arctocephalus gazella) individuals. Despite seals and dogs having diverged around 44 million years ago, 33,324 out of 173,662 loci (19.2%) could be genotyped, of which 173 were polymorphic and clearly interpretable. Two SNPs were validated using KASP genotyping assays, with the resulting genotypes being 100% concordant with those obtained from the high-density array. Two loci were also confirmed through in silico visualisation after mapping them to the fur seal transcriptome. Polymorphic SNPs were distributed broadly throughout the dog genome and did not differ significantly in proximity to genes from either monomorphic SNPs or those that failed to cross-amplify in seals. However, the nearest genes to polymorphic SNPs were significantly enriched for functional annotations relating to energy metabolism, suggesting a possible bias towards conserved regions of the genome.
Hoffman, Joseph I.; Thorne, Michael A. S.; McEwing, Rob; Forcada, Jaume; Ogden, Rob
A general question for linkage disequilibrium-based association studies is how power to detect an association is compromised when tag SNPs are chosen from data in one population sample and then deployed in another sample. Specifically, it is important to know how well tags picked from the HapMap DNA samples capture the variation in other samples. To address this, we collected
Paul I W de Bakker; Noël P Burtt; Robert R Graham; Candace Guiducci; Roman Yelensky; Jared A Drake; Todd Bersaglieri; Kathryn L Penney; Johannah Butler; Stanton Young; Robert C Onofrio; Helen N Lyon; Daniel O Stram; Christopher A Haiman; Matthew L Freedman; Xiaofeng Zhu; Richard Cooper; Leif Groop; Laurence N Kolonel; Brian E Henderson; Mark J Daly; Joel N Hirschhorn; David Altshuler
BACKGROUND: Cystic fibrosis (CF) lung disease manifest by impaired chloride secretion leads to eventual respiratory failure. Candidate genes that may modify CF lung disease severity include alternative chloride channels. The objectives of this study are to identify single nucleotide polymorphisms (SNPs) in the airway epithelial chloride channel, CLC-2, and correlate these polymorphisms with CF lung disease. METHODS: The CLC-2 promoter,
Carol J Blaisdell; Timothy D Howard; Augustus Stern; Penelope Bamford; Eugene R Bleecker; O Colin Stine
Background The least absolute shrinkage and selection operator (LASSO) can be used to predict SNP effects. This operator has the desirable feature of including in the model only a subset of explanatory SNPs, which can be useful both in QTL detection and GWS studies. LASSO solutions can be obtained by the least angle regression (LARS) algorithm. The big issue with this procedure is to define the best constraint (t), i.e. the upper bound of the sum of absolute value of the SNP effects which roughly corresponds to the number of SNPs to be selected. Usai et al. (2009) dealt with this problem by a cross-validation approach and defined t as the average number of selected SNPs overall replications. Nevertheless, in small size populations, such estimator could give underestimated values of t. Here we propose two alternative ways to define t and compared them with the "classical" one. Methods The first (strategy 1), was based on 1,000 cross-validations carried out by randomly splitting the reference population (2,000 individuals with performance) into two halves. The value of t was the number of SNPs which occurred in more than 5% of replications. The second (strategy 2), which did not use cross-validations, was based on the minimization of the Cp-type selection criterion which depends on the number of selected SNPs and the expected residual variance. Results The size of the subset of selected SNPs was 46, 189 and 64 for the classical approach, strategy 1 and 2 respectively. Classical and strategy 2 gave similar results and indicated quite clearly the regions were QTL with additive effects were located. Strategy 1 confirmed such regions and added further positions which gave a less clear scenario. Correlation between GEBVs estimated with the three strategies and TBVs in progenies without phenotypes were 0.9237, 0.9000 and 0.9240 for classical, strategy 1 and 2 respectively. Conclusions This suggests that the Cp-type selection criterion is a valid alternative to the cross-validations to define the best constraint for selecting subsets of predicting SNPs by LASSO-LARS procedure.
MicroRNAs (miRNAs) are small regulatory RNAs that modulate the expression of approximately half of all human genes. Small changes in miRNA expression have been associated with several psychiatric and neurological disorders, but whether the polymorphisms in genes involved in the processing of miRNAs into maturity influence the susceptibility of a person to schizophrenia (SZ) has not yet been elucidated. In this study, we investigated the association between SZ risk and single-nucleotide polymorphisms (SNPs) in microRNA machinery genes. We assessed the associations between SZ as a risk and six potentially functional SNPs from five miRNA processing genes (DROSHA, DGCR8, DICER, AGO1, and GEMIN4) in a case-control study of 256 Chinese SZ patients and 252 frequency-matched (age, gender, and ethnicity) controls. All the SNPs (rs10719, rs3757, rs3742330, rs636832, rs7813, and rs3744741) were genotyped by high resolution melting method. We found that two SNPs in the DGCR8 and DICER gene were significantly associated with the altered SZ risk. The genotype or allele frequency of rs3742330 in DICER was significantly different in patients and controls. Moreover, the recessive model of rs3757 in DGCR8 (AA vs. GA/GG) exhibited a significantly increased risk with an odds ratio (OR) of 3.73 [95 % confidence interval (CI), 1.03-13.52, P?=?0.032]; the dominant model of rs3742330 in DICER (AA vs. AG/GG) exhibited a significantly increased risk with OR of 1.49 (95 % CI, 1.04-2.13; P?=?0.028). Other SNPs and the haplotype of GEMIN4 (rs3744741 and rs7813) did not show any association with SZ. Our results suggested that the specific genetic variants in microRNA machinery genes may affect SZ susceptibility. PMID:23015298
Zhou, Yi; Wang, Jun; Lu, Xiaojun; Song, Xingbo; Ye, Yuanxin; Zhou, Juan; Ying, Binwu; Wang, Lanlan
Background The p21 codon 31 single nucleotide polymorphism (SNP), rs1801270, has been linked to cervical cancer but with controversial results. The aims of this study were to investigate the role of p21 SNP-rs1801270 and other untested p21 SNPs in the risk of cervical cancer in a Chinese population. Methods We genotyped five p21 SNPs (rs762623, rs2395655, rs1801270, rs3176352, and rs1059234) using peripheral blood DNA from 393 cervical cancer patients and 434 controls. Results The frequency of the rs1801270 A allele in patients (0.421) was significantly lower than that in controls (0.494, p = 0.003). The frequency of the rs3176352 C allele in cases (0.319) was significantly lower than that in controls (0.417, p < 0.001).The allele frequency of other three p21 SNPs showed not statistically significantly different between patients and controls. The rs1801270 AA genotype was associated with a decreased risk for the development of cervical cancer (OR = 0.583, 95%CI: 0.399 - 0.853, P = 0.005). We observed that the three p21 SNPs (rs1801270, rs3176352, and rs1059234) was in linkage disequilibrium (LD) and thus haplotype analysis was performed. The AGT haplotype (which includes the rs1801270A allele) was the most frequent haplotype among all subjects, and both homozygosity and heterozygosity for the AGT haplotype provided a protective effect from development of cervical cancer. Conclusions We show an association between the p21 SNP rs1801270A allele and a decreased risk for cervical cancer in a population of Chinese women. The AGT haplotype formed by three p21 SNPs in LD (rs1801270, rs3176352 and rs1059234) also provided a protective effect in development of cervical cancer in this population.
A backwash filter system includes a filter array comprising a plurality of filter units each having a housing, a filter element unit and a process liquid inlet port and a filtered liquid outlet port connected respectively with process liquid inlet and filtered liquid outlet sides of the filter element unit for filtering process liquid. A process liquid inlet header connects to the process liquid inlet ports of the filter units. A filtered liquid outlet header connects to the filtered liquid outlet ports of the filter units. A first valve unit on the process liquid inlet header is switchable for connecting same alternatively to a process liquid source and a backwash liquid drain. The filtered liquid outlet header can communicate with a filtered liquid receiver.
This article reviews the types and capabilities of birefringent filters. The general operating principles of Lyot (perfect polarizers), partial polarizing, and Solc (no internal polarizers) filters are introduced. Appropriate techniques for tuning each filter type are presented. Field of view of birefringent filters is discussed and is compared to Fabry-Perot and interference filters. The transmission and throughput advantages of birefringent filters are shown. Finally, the current state of the art in practical filters is reviewed.
Title, A. M.; Rosenberg, W. J.
Objectives. Since subjects may have been diagnosed before cohort entry, analysis of late HIV diagnosis (LD) is usually restricted to the newly diagnosed. We estimate the magnitude and risk factors of LD in a cohort of seroprevalent individuals by imputing seroconversion dates. Methods. Multicenter cohort of HIV-positive subjects who were treatment naive at entry, in Spain, 2004–2008. Multiple-imputation techniques were used. Subjects with times to HIV diagnosis longer than 4.19 years were considered LD. Results. Median time to HIV diagnosis was 2.8 years in the whole cohort of 3,667 subjects. Factors significantly associated with LD were: male sex; Sub-Saharan African, Latin-American origin compared to Spaniards; and older age. In 2,928 newly diagnosed subjects, median time to diagnosis was 3.3 years, and LD was more common in injecting drug users. Conclusions. Estimates of the magnitude and risk factors of LD for the whole cohort differ from those obtained for new HIV diagnoses.
Sobrino-Vegas, Paz; Perez-Hoyos, Santiago; Geskus, Ronald; Padilla, Belen; Segura, Ferran; Rubio, Rafael; del Romero, Jorge; Santos, Jesus; Moreno, Santiago; del Amo, Julia
Single nucleotide polymorphisms within microRNA (miRNA) binding sites comprise a novel genre of cancer biomarkers. Since miRNA regulation is dependent on sequence complementarity between the mRNA transcript and the miRNA, even single nucleotide aberrations can have significant effects. Over the past few years, many examples of these functional miRNA binding site SNPs have been identified as cancer biomarkers. While most of the research to date focuses on associations with cancer risk, more and more studies are linking these SNPs to cancer prognosis and response to treatment as well. This review summarizes the state of the field and draws importance to this rapidly expanding area of cancer biomarkers.
Preskill, Carina; Weidhaas, Joanne B.
Large-scale transcriptome profiling in clinical studies often involves assaying multiple samples of a patient to monitor disease progression, treatment effect, and host response in multiple tissues. Such profiling is prone to human error, which often results in mislabeled samples. Here, we present a method to detect mislabeled sample outliers using coding single nucleotide polymorphisms (cSNPs) specifically designed on the microarray and demonstrate that the mislabeled samples can be efficiently identified by either simple clustering of allele-specific expression scores or Mahalanobis distance-based outlier detection method. Based on our results, we recommend the incorporation of cSNPs into future transcriptome array designs as intrinsic markers for sample tracking. PMID:22668418
Xu, Weihong; Gao, Hong; Seok, Junhee; Wilhelmy, Julie; Mindrinos, Michael N; Davis, Ronald W; Xiao, Wenzhong
Large-scale transcriptome profiling in clinical studies often involves assaying multiple samples of a patient to monitor disease progression, treatment effect, and host response in multiple tissues. Such profiling is prone to human error, which often results in mislabeled samples. Here, we present a method to detect mislabeled sample outliers using coding single nucleotide polymorphisms (cSNPs) specifically designed on the microarray and demonstrate that the mislabeled samples can be efficiently identified by either simple clustering of allele-specific expression scores or Mahalanobis distance-based outlier detection method. Based on our results, we recommend the incorporation of cSNPs into future transcriptome array designs as intrinsic markers for sample tracking.
Xu, Weihong; Gao, Hong; Seok, Junhee; Wilhelmy, Julie; Mindrinos, Michael N.; Davis, Ronald W.; Xiao, Wenzhong
A filter assembly for fluid containing microbial agents comprises a base defining a first broad filter supporting surface for receiving an appropriate microbial filter membrane, and at least one fluid drain disposed therebeneath to conduct the fluid carry...
Monolithic, macroscopic, nanoporous nanotube filters are fabricated having radially aligned carbon nanotube walls. The freestanding filters have diameters and lengths up to several centimeters. A single-step filtering process was demonstrated in two impor...
A. Srivastava O. N. Srivastava P. M. Ajayan R. Vajtal S. Talapatra
The genome-wide association study by Herbert and colleagues identified the INSIG2 single nucleotide polymorphism (SNP) rs7566605 as contributing to increased BMI in ethnically distinct cohorts. The present study sought to further clarify by testing whether SNPs of INSIG2 influenced quantitative adiposity or glucose homeostasis traits in Hispanics of the Insulin Resistance Atherosclerosis Family Study (IRASFS). Using a tagging SNP approach, rs7566605 and 31 additional SNPs were genotyped in 1425 IRASFS Hispanics. SNPs were tested for association with six adiposity measures: BMI, waist circumference (WAIST), waist to hip ratio (WHR), subcutaneous adipose tissue (SAT), visceral adipose tissue (VAT), and VAT to SAT ratio (VSR). SNPs were also tested for association with fasting glucose (GFAST), fasting insulin (FINS), and three measures obtained from the frequently sampled intravenous glucose tolerance test: insulin sensitivity (SI), acute insulin response (AIR), and disposition index (DI). Most prominent association was observed with direct CT-measured adiposity phenotypes, including VAT, SAT, and VSR (P-values range from 0.007 to 0.044 for rs17586756, rs17047718, rs17047731, rs9308762, rs12623648, and rs11673900). Multiple SNP associations were observed with all glucose homeostasis traits (P-values range from 0.001 to 0.031 for rs17047718, rs17047731, rs2161829, rs10490625, rs889904, and rs12623648). Using BMI as a covariate in evaluation of glucose homeostasis traits slightly reduced their association. However, association with adiposity and glucose homeostasis phenotypes is not significant following multiple comparisons adjustment. Trending association after multiple comparisons adjustment remains suggestive of a role for genetic variation of INSIG2 in obesity, but these results require validation.
Talbert, Matthew E; Langefeld, Carl D; Ziegler, Julie; Haffner, Steven M; Norris, Jill M; Bowden, Donald W
This paper assesses the use of single nucleotide polymorphisms (SNPs) for forensic analysis. It demonstrates that relatively\\u000a small arrays of approx. 50 loci are comparable to existing short tandem repeat (STR) multiplexes. A quantitative test, however,\\u000a is a prerequisite for mixture interpretation. In addition, as the mixture proportion becomes low, it will be necessary to\\u000a distinguish between the allele and
A novel approach for simultaneous detection of multiple SNPs genotyping by ligation of universal probes on a 3D polyacrylamide gel DNA microarray was reported in this study. In the principle experiment, synthetic single-stranded DNA (ssDNA) templates simulating 3 SNP loci modified with acrylamide group at the 5'-termial were copolymerized with the acrylamide monomer to be immobilized onto one chip and
Jie Ma; Sheng Ning; Pengfeng Xiao
Genetic variants in GABRA2 have previously been shown to be associated with alcohol measures, EEG ? waves, and impulsiveness-related traits. Impulsiveness is a behavioral risk factor for alcohol and other substance abuse. Here, we tested association between 11 variants in GABRA2 with NEO- impulsiveness and problem drinking. Our sample of 295 unrelated adult subjects was from a community of families with at least one male with DSM-IV Alcohol use diagnosis, and from a socioeconomically comparable control group. Ten GABRA2 SNPs were associated with the NEO-impulsiveness (p < 0.03). The alleles associated with higher impulsiveness correspond to the minor alleles identified in previous alcohol dependence studies. All ten SNPs are in LD with each other and represent one effect on impulsiveness. Four SNPs and the corresponding haplotype from intron 3 to intron 4 were also associated with Lifetime Alcohol Problems Score (LAPS, p < 0.03) (not corrected for multiple testing). Impulsiveness partially mediates (22.6% average) this relation between GABRA2 and LAPS. Our results suggest that GABRA2 variation in the region between introns 3 and 4 is associated with impulsiveness and this effect partially influences the development of alcohol problems, but a direct effect of GABRA2 on problem drinking remains. A potential functional SNP rs279827, located next to a splice site, is located in the most significant region for both impulsiveness and LAPS. The high degree of LD among nine of these SNPs and the conditional analyses we have performed suggest that all variants represent one signal.
Villafuerte, Sandra; Strumba, Viktorya; Stoltenberg, Scott F.; Zucker, Robert A.; Burmeister, Margit
Background Genome-wide scans of hundreds of thousands of single-nucleotide polymorphisms (SNPs) have resulted in the identification of new susceptibility variants to common diseases and are providing new insights into the genetic structure and relationships of human populations. Moreover, genome-wide data can be used to search for signals of recent positive selection, thereby providing new insights into the genetic adaptations that occurred as modern humans spread out of Africa and around the world. Methodology We genotyped approximately 500,000 SNPs in 255 individuals (5 individuals from each of 51 worldwide populations) from the Human Genome Diversity Panel (HGDP-CEPH). When merged with non-overlapping SNPs typed previously in 250 of these same individuals, the resulting data consist of over 950,000 SNPs. We then analyzed the genetic relationships and ancestry of individuals without assigning them to populations, and we also identified candidate regions of recent positive selection at both the population and regional (continental) level. Conclusions Our analyses both confirm and extend previous studies; in particular, we highlight the impact of various dispersals, and the role of substructure in Africa, on human genetic diversity. We also identified several novel candidate regions for recent positive selection, and a gene ontology (GO) analysis identified several GO groups that were significantly enriched for such candidate genes, including immunity and defense related genes, sensory perception genes, membrane proteins, signal receptors, lipid binding/metabolism genes, and genes involved in the nervous system. Among the novel candidate genes identified are two genes involved in the thyroid hormone pathway that show signals of selection in African Pygmies that may be related to their short stature.
Theunert, Christoph; Pugach, Irina; Li, Jing; Nandineni, Madhusudan R.; Gross, Arnd; Scholz, Markus; Stoneking, Mark
The combination of whole-genome sequencing efforts and emerging high-throughput genotyping techniques has made single nucleotide polymorphisms (SNPs) a marker of choice for molecular genetic analyses in model organisms. This class of marker holds great promise for resolving questions of phylogeny, population structure, introgression, and adaptive genetic variation. Fifty-five polymerase chain reaction primer pairs were used to target variable regions of
A. E. Sprowles; M. R. Stephens; N. W. Clipperton; B. P. May
Xinong Saanen (SN, n=323) and Guanzhong (GZ, n=197) goat breeds were used to detect single nucleotide polymorphisms (SNPs) in the coding regions with their intron-exon boundaries of prolactin receptor (PRLR) gene by DNA sequencing, primer-introduced restriction analysis-polymerase chain reaction (PIRA-PCR) and PCR-restriction fragment length polymorphism (PCR-RFLP). Four novel SNPs (g.40452T>C, g.40471G>A, g.61677G>A and g.61865G>A) were identified. The g.61677G>A and g.61865G>A SNPs caused amino acid variations p.Ser485Asn and p.Val548Met, respectively. Both g.40452T>C and g.40471G>A loci were closely linked in SN and GZ goat breeds (r(2)>0.33). In addition, there was also a close linkage between g.61677G>A and g.61865G>A loci in both goat breeds. Statistical results indicated that the g.40452T>C, g.61677G>A and g.61865G>A SNPs were significantly associated with milk production traits in SN and GZ breeds. Further analysis revealed that combinative genotype C1 (TTAAGGGG) was better than the others for milk yield in SN and GZ goat breeds. These results are consistent with the regulatory function of PRLR in mammary gland development, milk secretion, and expression of milk protein genes, and extend the spectrum of genetic variation of the caprine PRLR gene, which might contribute to goat genetic resources and breeding. PMID:23954220
Hou, J X; An, X P; Song, Y X; Wang, J G; Ma, T; Han, P; Fang, F; Cao, B Y
Several multiple, large-scale, genetic studies on autoimmune-disease-associated SNPs have been reported recently: peptidylarginine deiminase type 4 (PADI4) in rheumatoid arthritis (RA); solute carrier family 22 members 4 and 5 (SLC22A4 and 5) in RA and Crohn’s disease (CD); programmed cell death 1 (PDCD1) in systemic lupus erythematosus (SLE), type 1 diabetes mellitus (T1D), and RA; and protein tyrosine phosphatase nonreceptor
Mikako Mori; Ryo Yamada; Kyoko Kobayashi; Reimi Kawaida; Kazuhiko Yamamoto
Identifying the genetic cis associations between DNA variants (single-nucleotide polymorphisms (SNPs)) and gene expression in brain tissue may be a promising approach to find functionally relevant pathways that contribute to the etiology of psychiatric disorders. In this study, we examined the association between genetic variations and gene expression in prefrontal cortex, hippocampus, temporal cortex, thalamus and cerebellum in subjects with psychiatric disorders and in normal controls. We identified cis associations between 648 transcripts and 6725 SNPs in the various brain regions. Several SNPs showed brain regional-specific associations. The expression level of only one gene, PDE4DIP, was associated with a SNP, rs12124527, in all the brain regions tested here. From our data, we generated a list of brain cis expression quantitative trait loci (eQTL) genes that we compared with a list of schizophrenia candidate genes downloaded from the Schizophrenia Forum (SZgene) database (http://www.szgene.org/). Of the SZgene candidate genes, we found that the expression levels of four genes, HTR2A, PLXNA2, SRR and TCF4, were significantly associated with cis SNPs in at least one brain region tested. One gene, SRR, was also involved in a coexpression module that we found to be associated with disease status. In addition, a substantial number of cis eQTL genes were also involved in the module, suggesting eQTL analysis of brain tissue may identify more reliable susceptibility genes for schizophrenia than case–control genetic association analyses. In an attempt to facilitate the identification of genetic variations that may underlie the etiology of major psychiatric disorders, we have integrated the brain eQTL results into a public and online database, Stanley Neuropathology Consortium Integrative Database (SNCID; http://sncid.stanleyresearch.org).
Kim, S; Cho, H; Lee, D; Webster, M J
TILLING (Targeting Induced Local Lesions IN Genomes) exploits the fact that CEL I endonuclease cleaves heteroduplexes at positions\\u000a of single nucleotide or small indel mismatches. To detect single nucleotide polymorphisms (SNPs) across a population, DNA\\u000a pools are created and a target locus under query is PCR-amplified and subjected to heteroduplex formation, followed by CEL\\u000a I cleavage. Currently, the common method
Chitra Raghavan; Ma. Elizabeth B. Naredo; Hehe Wang; Genelou Atienza; Bin Liu; Fulin Qiu; Kenneth L. McNally; Hei Leung
Alzheimer's disease (AD) is a complex and multifactorial disease with the possible involvement of several genes. With the exception of the APOE gene as a susceptibility marker, no other genes have been shown consistently to be associated with late-onset AD (LOAD). A recent genome-wide association study of 17,343 gene-based putative functional single nucleotide polymorphisms (SNPs) found 19 significant variants, including 3 linked to APOE, showing association with LOAD (Hum Mol Genet 2007; 16:865–873). We have set out to replicate the 16 new significant associations in a large case-control cohort of American Whites. Additionally, we examined six variants present in positional and/or biological candidate genes for AD. We genotyped the 22 SNPs in up to 1,009 Caucasian Americans with LOAD and up to 1,010 age-matched healthy Caucasian Americans, using 5? nuclease assays. We did not observe a statistically significant association between the SNPs and the risk of AD, either individually or stratified by APOE. Our data suggest that the association of the studied variants with LOAD risk, if it exists, is not statistically significant in our sample.
Figgins, Jessica A.; Minster, Ryan L.; Demirci, F. Yesim; DeKosky, Steven T.; Kamboh, M. Ilyas
Structural characteristics are essential for the functioning of many noncoding RNAs and cis-regulatory elements of mRNAs. SNPs may disrupt these structures, interfere with their molecular function, and hence cause a phenotypic effect. RNA folding algorithms can provide detailed insights into structural effects of SNPs. The global measures employed so far suffer from limited accuracy of folding programs on large RNAs and are computationally too demanding for genome-wide applications. Here, we present a strategy that focuses on the local regions of maximal structural change between mutant and wild-type. These local regions are approximated in a “screening mode” that is intended for genome-wide applications. Furthermore, localized regions are identified as those with maximal discrepancy. The mutation effects are quantified in terms of empirical P values. To this end, the RNAsnp software uses extensive precomputed tables of the distribution of SNP effects as function of length and GC content. RNAsnp thus achieves both a noise reduction and speed-up of several orders of magnitude over shuffling-based approaches. On a data set comprising 501 SNPs associated with human-inherited diseases, we predict 54 to have significant local structural effect in the untranslated region of mRNAs. RNAsnp is available at http://rth.dk/resources/rnasnp.
Sabarinathan, Radhakrishnan; Tafer, Hakim; Seemann, Stefan E; Hofacker, Ivo L; Stadler, Peter F; Gorodkin, Jan
The dopamine D4 receptor (DRD4) gene has been frequently studied in relation to attention deficit hyperactivity disorder (ADHD) but little is known about the contribution of single nucleotide polymorphisms (SNPs) of the DRD4 gene to the development and persistence of ADHD. In the present study, we examined the association between two SNPs in DRD4 (rs1800955, rs916455) and adult ADHD persistence in a Chinese sample. Subjects (n=193) were diagnosed with ADHD in childhood and reassessed in young adulthood at an affiliated clinic of Peking University Sixth Hospital. Kaplan-Meier survival analyses and Cox proportional hazard models were used to test the association between ADHD remission and alleles of the two SNPs. DRD4 rs916455 C allele carriers were more likely to have persistent ADHD symptoms in adulthood. No significant association was found between rs1800955 allele and the course of ADHD. These newly detected associations between DRD4 polymorphisms and ADHD prognosis in adulthood may help to predict the persistence of childhood ADHD into adulthood. PMID:23031802
Li, Yueling; Baker-Ericzen, Mary; Ji, Ning; Chang, Weili; Guan, Lili; Qian, Qiujin; Zhang, Yujuan; Faraone, Stephen V; Wang, Yufeng
An increasing number of single nucleotide polymorphisms (SNPs) on the Y chromosome are being identified. To utilize the full potential of the SNP markers in population genetic studies, new genotyping methods with high throughput are required. We describe a microarray system based on the minisequencing single nucleotide primer extension principle for multiplex genotyping of Y-chromosomal SNP markers. The system was applied for screening a panel of 25 Y-chromosomal SNPs in a unique collection of samples representing five Finno--Ugric populations. The specific minisequencing reaction provides 5-fold to infinite discrimination between the Y-chromosomal genotypes, and the microarray format of the system allows parallel and simultaneous analysis of large numbers of SNPs and samples. In addition to the SNP markers, five Y-chromosomal microsatellite loci were typed. Altogether 10,000 genotypes were generated to assess the genetic diversity in these population samples. Six of the 25 SNP markers (M9, Tat, SRY10831, M17, M12, 92R7) were polymorphic in the analyzed populations, yielding six distinct SNP haplotypes. The microsatellite data were used to study the genetic structure of two major SNP haplotypes in the Finns and the Saami in more detail. We found that the most common haplotypes are shared between the Finns and the Saami, and that the SNP haplotypes show regional differences within the Finns and the Saami, which supports the hypothesis of two separate settlement waves to Finland. PMID:11230171
Raitio, M; Lindroos, K; Laukkanen, M; Pastinen, T; Sistonen, P; Sajantila, A; Syvänen, A C
An increasing number of single nucleotide polymorphisms (SNPs) on the Y chromosome are being identified. To utilize the full potential of the SNP markers in population genetic studies, new genotyping methods with high throughput are required. We describe a microarray system based on the minisequencing single nucleotide primer extension principle for multiplex genotyping of Y-chromosomal SNP markers. The system was applied for screening a panel of 25 Y-chromosomal SNPs in a unique collection of samples representing five Finno–Ugric populations. The specific minisequencing reaction provides 5-fold to infinite discrimination between the Y-chromosomal genotypes, and the microarray format of the system allows parallel and simultaneous analysis of large numbers of SNPs and samples. In addition to the SNP markers, five Y-chromosomal microsatellite loci were typed. Altogether 10,000 genotypes were generated to assess the genetic diversity in these population samples. Six of the 25 SNP markers (M9, Tat, SRY10831, M17, M12, 92R7) were polymorphic in the analyzed populations, yielding six distinct SNP haplotypes. The microsatellite data were used to study the genetic structure of two major SNP haplotypes in the Finns and the Saami in more detail. We found that the most common haplotypes are shared between the Finns and the Saami, and that the SNP haplotypes show regional differences within the Finns and the Saami, which supports the hypothesis of two separate settlement waves to Finland.
Raitio, Mirja; Lindroos, Katarina; Laukkanen, Minna; Pastinen, Tomi; Sistonen, Pertti; Sajantila, Antti; Syvanen, Ann-Christine
Hairs are complex structures, making a protective layer and serves different biological functions. TRPS1, a transcription factor is one of the candidate genes causing congenital hypertrichosis, an excessive hair growth at inappropriate body parts. SNPs of TRPS1 were retrieved from dbSNP which were screened by SIFT and PolyPhen servers based on their functional impacts. Out of the screened SNPs, rs181507248 and rs146506752 were predicted as intolerant and damaging by both the servers. The predicted tertiary structure of the native TRPS1 after refinement and validation was successfully submitted to the Protein Model Database and was assigned with PMDB ID PM0077843, as it was previously unpredicted. It was observed through the structure based analysis that, the SNPs rs181507248 and rs146506752 caused significant changes in the secondary and tertiary structures as well as the physiochemical properties of TRPS1 protein. It can thus be concluded that the changed properties due to these single nucleotide polymorphisms effect the interactions of TRPS1 which result in congenital hypertrichosis.
Waheed, Rabiya; Khan, Mohammad Haroon; Bano, Raisa; Rashid, Hamid
ALOX5AP (5-lipoxygenase) has been recognized as a susceptibility gene for stroke. Using a case-control design, the whole coding and adjoining intronic regions of ALOX5AP were sequenced to study the role of SNPs and their interplay with other risk factors in Greek patients with stroke. Patients (n=213) were classified by the Trial of Org 10172 in Acute Stroke Treatment (TOAST). Their mean age of was 58.9±14.64, comprising 145 males. The control group consisted of 210 subjects, ethnicity, sex and age matched, with no stroke history. Risk factors (hyperlipidemia, hypertension, atrial fibrillation, migraine, CAD, diabetes, smoking and alcohol consumption) were assessed as confounding factors and comparisons were done using logistic regression analysis. SNPs rs4769055, rs202068154 and rs3803277 located in intronic regions of the gene and according to in silico programs EX_SKIP and HSF possibly affecting splicing of exons 1 and 2 of ALOX5AP, showed significantly different frequencies between patients and controls. The genotype frequencies of rs4769055: AA, of rs202068154: AC and of rs3803277: CA were significantly higher (p<0.001, 0.058) in controls than in patients. The results were indicative of a protective role of the three SNPs either in homozygosity or heterozygosity for MAF and more specifically rs3803277: CA/AA genotypes were protective against SVO stroke subtype. PMID:25010723
Papapostolou, Apostolis; Spengos, Kostas; Fylaktou, Irene; Poulou, Myrto; Gountas, Ilias; Kitsiou-Tzeli, Sophia; Kanavakis, Emmanuel; Tzetis, Maria
The Ewing Sarcoma is an important tumor of bone and soft tissue. The SNPs Arg72Pro of TP53 and T309G of MDM2 have been associated with many cancer types and have been differently distributed among populations worldwide. Based on a case-control design, this study aimed to assess the role of these SNPs in 24 Ewing Sarcoma patients, compared to 91 control individuals. DNA samples were extracted from blood and genotyped for both SNPs by PCR-RFLP and confirmed by DNA sequencing. The results showed an association between the G allele of the T309G and Ewing Sarcoma (P=0.02). Comparing to the TT carriers, the risk of G allele carriers was 3.35 (95% CI=1.22-9.21) with P=0.02. At the genotypic level, an association of the TT genotype with the control group (P=0.03) was found. Comparing to the TT genotype, the risk of TG and GG was 2.97 (95% CI=1.03-8.58) with P=0.04 and 5.00 (95% CI=1.23-20.34) with P=0.02, respectively. No associations regarding the Arg72Pro SNP were found. Considering that the T309G has been associated with several types of cancer, including sarcomas, our results indicate that this SNP may also be important to Ewing Sarcoma predisposition. PMID:23661019
Thurow, Helena S; Hartwig, Fernando P; Alho, Clarice S; Silva, Deborah S B S; Roesler, Rafael; Abujamra, Ana Lucia; de Farias, Caroline Brunetto; Brunetto, Algemir Lunardi; Horta, Bernardo L; Dellagostin, Odir A; Collares, Tiago; Seixas, Fabiana K
Background Genetic association study is currently the primary vehicle for identification and characterization of disease-predisposing variant(s) which usually involves multiple single-nucleotide polymorphisms (SNPs) available. However, SNP-wise association tests raise concerns over multiple testing. Haplotype-based methods have the advantage of being able to account for correlations between neighbouring SNPs, yet assuming Hardy-Weinberg equilibrium (HWE) and potentially large number degrees of freedom can harm its statistical power and robustness. Approaches based on principal component analysis (PCA) are preferable in this regard but their performance varies with methods of extracting principal components (PCs). Results PCA-based bootstrap confidence interval test (PCA-BCIT), which directly uses the PC scores to assess gene-disease association, was developed and evaluated for three ways of extracting PCs, i.e., cases only(CAES), controls only(COES) and cases and controls combined(CES). Extraction of PCs with COES is preferred to that with CAES and CES. Performance of the test was examined via simulations as well as analyses on data of rheumatoid arthritis and heroin addiction, which maintains nominal level under null hypothesis and showed comparable performance with permutation test. Conclusions PCA-BCIT is a valid and powerful method for assessing gene-disease association involving multiple SNPs.
Hairs are complex structures, making a protective layer and serves different biological functions. TRPS1, a transcription factor is one of the candidate genes causing congenital hypertrichosis, an excessive hair growth at inappropriate body parts. SNPs of TRPS1 were retrieved from dbSNP which were screened by SIFT and PolyPhen servers based on their functional impacts. Out of the screened SNPs, rs181507248 and rs146506752 were predicted as intolerant and damaging by both the servers. The predicted tertiary structure of the native TRPS1 after refinement and validation was successfully submitted to the Protein Model Database and was assigned with PMDB ID PM0077843, as it was previously unpredicted. It was observed through the structure based analysis that, the SNPs rs181507248 and rs146506752 caused significant changes in the secondary and tertiary structures as well as the physiochemical properties of TRPS1 protein. It can thus be concluded that the changed properties due to these single nucleotide polymorphisms effect the interactions of TRPS1 which result in congenital hypertrichosis. PMID:22553388
Waheed, Rabiya; Khan, Mohammad Haroon; Bano, Raisa; Rashid, Hamid
Purpose The cytochrome p450 family 1 subfamily B (CYP1B1) gene is a well known cause of autosomal recessive primary congenital glaucoma. It has also been postulated as a modifier of disease severity in primary open angle glaucoma (POAG), particularly in juvenile onset families. However, the role of common variation in the gene in relation to POAG has not been thoroughly explored. Methods Seven tag single nucleotide polymorphisms (SNPs), including two coding variants (L432V and N543S), were genotyped in 860 POAG cases and 898 examined normal controls. Each SNP and haplotype was assessed for association with disease. In addition, a subset of 396 severe cases and 452 elderly controls were analyzed separately. Results There was no association of any individual SNP in the full data set. Two SNPs (rs162562 and rs10916) were nominally associated under a dominant model in the severe cases (p<0.05). A common haplotype (AGCAGCC) was also found to be nominally associated in both the full data set (p=0.048, OR [95%CI]=0.83 [0.69–0.90]) and more significantly in the severe cases (p=0.004, OR [95%CI]=0.68 [0.52–0.89]) which survives correction for multiple testing. Conclusions Although no major effect of common variation at the CYP1B1 locus on POAG was found, there could be an effect of SNPs tagged by rs162562 and represented on the AGCAGCC haplotype.
Hewitt, Alex W.; Mackey, David A.; Mitchell, Paul; Craig, Jamie E.
Background: Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern, to be addressed here, is the role of the pattern of missing data. Methods: We used information of 2720 prisoners. Results derived from fitting regression model to whole data were served as gold standard. Missing data were then generated so that 10%, 20% and 50% of data were lost. In scenario 1, we generated missing values, at above rates, in one variable which was significant in gold model (age). In scenario 2, a small proportion of each of independent variable was dropped out. Four imputation methods, under different Event Per Variable (EPV) values, were compared in terms of selection of important variables and parameter estimation. Results: In scenario 2, bias in estimates was low and performances of all methods for handing missing data were similar. All methods at all missing rates were able to detect significance of age. In scenario 1, biases in estimations were increased, in particular at 50% missing rate. Here at EPVs of 10 and 5, imputation methods failed to capture effect of age. Conclusion: In scenario 2, all imputation methods at all missing rates, were able to detect age as being significant. This was not the case in scenario 1. Our results showed that performance of imputation methods depends on the pattern of missing data. PMID:24596839
Haji-Maghsoudi, Saiedeh; Haghdoost, Ali-Akbar; Rastegari, Azam; Baneshi, Mohammad Reza
Because of their high variability, microsatellites are still considered the marker of choice for studies on parentage and kinship in wild populations. Nevertheless, single nucleotide polymorphisms (SNPs) are becoming increasing popular in many areas of molecular ecology, owing to their high-throughput, easy transferability between laboratories and low genotyping error. An ongoing discussion concerns the relative power of SNPs compared to microsatellites-that is, how many SNP loci are needed to replace a panel of microsatellites? Here, we evaluate the assignment power of 80 SNPs (H(E) = 0.30, 80 independent alleles) and 11 microsatellites (H(E) = 0.85, 192 independent alleles) in a wild population of about 400 sockeye salmon with two commonly used software packages (Cervus3, Colony2) and, for SNPs only, a newly developed software (SNPPIT). Assignment success was higher for SNPs than for microsatellites, especially for parent pairs, irrespective of the method used. Colony2 assigned a larger proportion of offspring to at least one parent than the other methods, although Cervus and SNPPIT detected more parent pairs. Identification of full-sib groups without parental information from relatedness measures was possible using both marker systems, although explicit reconstruction of such groups in Colony2 was impossible for SNPs because of computation time. Our results confirm the applicability of SNPs for parentage analyses and refute the predictability of assignment success from the number of independent alleles. PMID:21429171
Hauser, Lorenz; Baird, Melissa; Hilborn, Ray; Seeb, Lisa W; Seeb, James E
Objective We previously reported an analysis of single nucleotide polymorphisms (SNPs) in three validated European rheumatoid arthritis (RA) susceptibility loci, TAGAP, TNFAIP3, and CCR6 in African-Americans with RA. Unexpectedly, the disease-associated alleles were different in African-Americans than in Europeans. In an effort to better define their contribution, we performed additional SNP genotyping in these genes. Methods Seven SNPs were genotyped in 446 African Americans with RA and 733 African American controls. Differences in minor allele frequency between cases and controls were analyzed after controlling for global proportion of European admixture, and pairwise linkage disequilibrium (LD) was estimated among the SNPs. Results Three SNPs were significantly associated with RA: TNFAIP3 rs719149 A allele (OR (95% CI) 1.22 (1.03–1.44) (p =0.02); TAGAP rs1738074 G allele OR 0.75 (0.63–0.89), (p =0.0012); and TAGAP rs4709267 G allele 0.74 (0.60–0.91), (p =0.004). Pairwise LD between the TAGAP SNPs was low (R2=0.034). The haplotype containing minor alleles for both TAGAP SNPs was uncommon (4.5%). After conditional analysis on each TAGAP SNP, its counterpart remained significantly associated with RA (rs1738074 for rs4709267 p=0.00001; rs4709267 for rs1738074 p=0.00005), suggesting independent effects. Conclusions SNPs in regulatory regions of TAGAP and an intronic SNP (TNFAIP3) are potential susceptibility loci in African Americans. Pairwise LD, haplotype analysis, and SNP conditioning analysis suggest that these two SNPs in TAGAP are independent susceptibility alleles. Additional fine mapping of this gene and functional genomic studies of these SNPs should provide additional insight into the role of these genes in RA.
Perkins, Elizabeth A.; Landis, Dawn; Causey, Zenoria L.; Edberg, Yuanqing; Reynolds, Richard J.; Hughes, Laura B.; Gregersen, Peter K.; Kimberly, Robert P.; Edberg, Jeffrey C.; Bridges, S. Louis
Background There are few genomic tools available in melon (Cucumis melo L.), a member of the Cucurbitaceae, despite its importance as a crop. Among these tools, genetic maps have been constructed mainly using marker types such as simple sequence repeats (SSR), restriction fragment length polymorphisms (RFLP) and amplified fragment length polymorphisms (AFLP) in different mapping populations. There is a growing need for saturating the genetic map with single nucleotide polymorphisms (SNP), more amenable for high throughput analysis, especially if these markers are located in gene coding regions, to provide functional markers. Expressed sequence tags (ESTs) from melon are available in public databases, and resequencing ESTs or validating SNPs detected in silico are excellent ways to discover SNPs. Results EST-based SNPs were discovered after resequencing ESTs between the parental lines of the PI 161375 (SC) × 'Piel de sapo' (PS) genetic map or using in silico SNP information from EST databases. In total 200 EST-based SNPs were mapped in the melon genetic map using a bin-mapping strategy, increasing the map density to 2.35 cM/marker. A subset of 45 SNPs was used to study variation in a panel of 48 melon accessions covering a wide range of the genetic diversity of the species. SNP analysis correctly reflected the genetic relationships compared with other marker systems, being able to distinguish all the accessions and cultivars. Conclusion This is the first example of a genetic map in a cucurbit species that includes a major set of SNP markers discovered using ESTs. The PI 161375 × 'Piel de sapo' melon genetic map has around 700 markers, of which more than 500 are gene-based markers (SNP, RFLP and SSR). This genetic map will be a central tool for the construction of the melon physical map, the step prior to sequencing the complete genome. Using the set of SNP markers, it was possible to define the genetic relationships within a collection of forty-eight melon accessions as efficiently as with SSR markers, and these markers may also be useful for cultivar identification in Occidental melon varieties.
Deleu, Wim; Esteras, Cristina; Roig, Cristina; Gonzalez-To, Mireia; Fernandez-Silva, Iria; Gonzalez-Ibeas, Daniel; Blanca, Jose; Aranda, Miguel A; Arus, Pere; Nuez, Fernando; Monforte, Antonio J; Pico, Maria Belen; Garcia-Mas, Jordi
Somatic mutations and dysregulation by microRNAs (miRNAs) may have a pivotal role in the Congenital Heart Defects (CHDs). The purpose of the study was to assess both somatic and germline mutations in the GATA4 and NKX2.5 genes as well as to identify 3'UTR single nucleotide polymorphisms (SNPs) in the miRNA target sites. We enrolled 30 patients (13 males; 13.4±8.3 years) with non-syndromic CHD. GATA4 and NKX2.5 genes were screened in cardiac tissue of sporadic and in blood samples of familial cases. Computational methods were used to detect putative miRNAs in the 3'UTR region and to assess the Minimum Free Energy of hybridization (MFE, kcal/mol). Difference of MFEs (?MFE) ?4 kcal/mol between alleles was considered biologically relevant on miRNA binding. The sum of all ?MFEs (|?MFEtot|=?|?MFE|) was calculated in order to predict the biological importance of SNPs binding more miRNAs. No evidence of novel GATA4 and NKX2.5 mutations was found both in sporadic and familial patients. Bioinformatic analysis revealed 27 putative miRNAs binding to identified SNPs in the 3'UTR of GATA4. ?MFE ?4 kcal/mol between alleles was obtained for the +354A>C (miR-4299), +587A>G (miR-604), +1355G>A (miR-548v, miR-139-5p) and +1521C>G (miR-583, miR-3125, miR-3928) SNPs. The +1521C>G SNP showed the highest ?MFEtot (21.66 kcal/mol). Luciferase reporter assays indicated that miR-583 was dose-dependently effective in regulating +1521 C allele compared with +1521 G allele. Based on the analysis of 100 CHD cases and 204 healthy newborns, the +1521 G allele was also associated with a lower risk of CHD (OR=0.5, 95% CI 0.3-0.9, p=0.03), likely due to the relatively low binding of the miRNA and high levels of protein. These results suggest that common SNPs in the 3'UTR of GATA4 alter miRNA gene regulation contributing to the pathogenesis of CHDs. PMID:23583740
Sabina, Saverio; Pulignani, Silvia; Rizzo, Milena; Cresci, Monica; Vecoli, Cecilia; Foffa, Ilenia; Ait-Ali, Lamia; Pitto, Letizia; Andreassi, Maria Grazia
The Porcine SNP database has a huge number of SNPs, but these SNPs are mostly found by computer data-mining procedures and have not been well characterized. We re-sequenced 1,439 porcine public SNPs from four commercial pig breeds and one Korean domestic breed (Korean Native pig, KNP) by using two DNA pools from eight unrelated animals in each breed. These SNPs were from 419 protein-coding genes covering the 18 autosomes, and the re-sequencing in breeds confirmed 690 public SNPs (47.9%) and 226 novel mutations (173 SNPs and 53 insertions/deletions). Thus, totally, 916 variations were found from our study. Of the 916 variations, 148 SNPs (16.2%) were found across all the five breeds, and 199 SNPs (21.7%) were breed specific polymorphisms. According to the SNP locations in the gene sequences, these 916 variations were categorized into 802 non-coding SNPs (785 in intron, 17 in 3'-UTR) and 114 coding SNPs (86 synonymous SNPs, 28 non-synonymous SNPs). The nucleotide substitution analyses for these SNPs revealed that 70.2% were from transitions, 20.0% from transversions, and the remaining 5.79% were deletions or insertions. Subsequently, we genotyped 261 SNPs from 180 genes in an experimental KNP × Landrace F2 cross by the Sequenom MassARRAY system. A total of 33 traits including growth, carcass composition and meat quality were analyzed for the phenotypic association tests using the 132 SNPs in 108 genes with minor allele frequency (MAF)>0.2. The association results showed that five marker-trait combinations were significant at the 5% experiment-wise level (ADCK4 for rear leg, MYH3 for rear leg, Hunter B, Loin weight and Shearforce) and four at the 10% experiment-wise level (DHX38 for average daily gain at live weight, LGALS9 for crude lipid, NGEF for front leg and LIFR for pH at 24 h). In addition, 49 SNPs in 44 genes showing significant association with the traits were detected at the 1% comparison-wise level. A large number of genes that function as enzymes, transcription factors or signalling molecules were considered as genetic markers for pig growth (RNF103, TSPAN31, DHX38, ABCF1, ABCC10, SCD5, KIAA0999 and FKBP10), muscling (HSPA5, PTPRM, NUP88, ADCK4, PLOD1, DLX1 and GRM8), fatness (PTGIS, IDH3B, RYR2 and NOL4) and meat quality traits (DUSP4, LIFR, NGEF, EWSR1, ACTN2, PLXND1, DLX3, LGALS9, ENO3, EPRS, TRIM29, EHMT2, RBM42, SESN2 and RAB4B). The SNPs or genes reported here may be beneficial to future marker assisted selection breeding in pigs. PMID:21107721
Li, Xiaoping; Kim, Sang-Wook; Do, Kyoung-Tag; Ha, You-Kyoung; Lee, Yun-Mi; Yoon, Suk-Hee; Kim, Hee-Bal; Kim, Jong-Joo; Choi, Bong-Hwan; Kim, Kwan-Suk
Background: The dissection of complex traits of economic importance to the pig industry requires the availability of a significant number of genetic markers, such as single nucleotide polymorphisms (SNPs). This study was conducted to discover several hundreds of thousands of porcine SNPs using next generation sequencing technologies and use these SNPs, as well as others from different public sources, to
Antonio M. Ramos; Richard P. M. A. Crooijmans; Nabeel A. Affara; Andreia J. Amaral; H. H. D. Kerstens; H. J. W. C. Megens; M. A. M. Groenen; Carol Churcher; Richard Clark; Patrick Dehais; Mark S. Hansen; Jakob Hedegaard; Zhi-Liang Hu; Andy S. Law; Hendrik-Jan Megens; Denis Milan; Danny J. Nonneman; Gary A. Rohrer; Max F. Rothschild; Tim P. L. Smith; Robert D. Schnabel; Curt P. Van Tassell; Jeremy F. Taylor; Ralph T. Wiedmann; Lawrence B. Schook
BackgroundThe dissection of complex traits of economic importance to the pig industry requires the availability of a significant number of genetic markers, such as single nucleotide polymorphisms (SNPs). This study was conducted to discover several hundreds of thousands of porcine SNPs using next generation sequencing technologies and use these SNPs, as well as others from different public sources, to design
Antonio M. Ramos; Richard P. M. A. Crooijmans; Nabeel A. Affara; Andreia J. Amaral; Alan L. Archibald; Jonathan E. Beever; Christian Bendixen; Carol Churcher; Richard Clark; Patrick Dehais; Mark S. Hansen; Jakob Hedegaard; Zhi-Liang Hu; Hindrik H. Kerstens; Andy S. Law; Hendrik-Jan Megens; Denis Milan; Danny J. Nonneman; Gary A. Rohrer; Max F. Rothschild; Tim P. L. Smith; Robert D. Schnabel; Curt P. van Tassell; Jeremy F. Taylor; Ralph T. Wiedmann; Lawrence B. Schook; Martien A. M. Groenen; Laszlo Orban
In this paper, we propose a novel type of explicit image fil- ter - guided filter. Derived from a local linear model, the guided filter generates the filtering output by considering the content of a guidance image, which can be the input image itself or another different image. The guided filter can perform as an edge-preserving smoothing opera- tor like
Kaiming He; Jian Sun; Xiaoou Tang
A process is described for dissolution of spent high efficiency particulate air (HEPA) filters and then combining the complexed filter solution with other radioactive wastes prior to calcining the mixed and blended waste feed. The process is an alternate to a prior method of acid leaching the spent filters which is an inefficient method of treating spent HEPA filters for disposal. 4 figures.
Brewer, K.N.; Murphy, J.A.
A study was made of the performance of fibrous filters with the specific ; objective of obtaining design information for high-velocity air filters. ; Experimental data on pressure drop, collection efficiency, and life for fibrous ; filters were obtained with two supercooled liquid aerosols (0.3 and 1.4 microns ; diameter) and one solid aerosol (1.2 micron diameter). Filter fiber size
T. E. Wright; R. J. Stasny; C. E. Lapple
Agafonov's (1970) method for the synthesis of polynomial microwave filters is extended to filters whose circuits cannot be divided into individual parts. Filter design relationships are obtained for three classes of circuits. The proposed method can be applied to circuits with lumped and distributed constants. It is particularly effective for the synthesis of complex filters, e.g., those based on coupled microstrip lines.
Agafonov, V. M.
This review summarizes the research progress made so far on electret air filters used for separation of airborne particles from complex air stream. A set of different categories of these filters are delineated and the methods of manufacturing of these filters are described. The principles and mechanisms of filtration and modeling of pressure drop by these filters are analyzed. The
Rashmi Thakur; Dipayan Das; Apurba Das
Leibniz filters play a prominent role in the theory of protoalgebraic logics. In  the problem of the definability of Leibniz filters is considered. Here we study the definability of Leibniz filters with parameters. The main result of the paper says that a protoalgebraic logic S has its strong version weakly algebraizable iff it has its Leibniz filters explicitly definable
This report documents progress through May 16, 1990 in the marketing of the Mobile K' filter. This air filter traps fine particulates. A total number of 167 of the filter units have been sold. An effort to increase sales by lowering the cost of the units by delivering the filters unassembled is under way. (GHH)
An electric air filter cartridge has a cylindrical inner high voltage electrode, a layer of filter material, and an outer ground electrode formed of a plurality of segments moveably connected together. The outer electrode can be easily opened to remove or insert filter material. Air flows through the two electrodes and the filter material and is exhausted from the center of the inner electrode.
A major challenge in the analysis of human genetic variation is to distinguish functional from nonfunctional SNPs. Discovering these functional SNPs is one of the main goals of modern genetics and genomics studies. There is a need to effectively and efficiently identify functionally important nsSNPs which may be deleterious or disease causing and to identify their molecular effects. The prediction of phenotype of nsSNPs by computational analysis may provide a good way to explore the function of nsSNPs and its relationship with susceptibility to disease. In this context, we surveyed and compared variation databases along with in silico prediction programs to assess the effects of deleterious functional variants on protein functions. In other respects, we attempted these methods to work as first-pass filter to identify the deleterious substitutions worth pursuing for further experimental research. In this analysis, we used the existing computational methods to explore the mutation-structure-function relationship in HGD gene causing alkaptonuria.
Magesh, R.; George Priya Doss, C.
The Haplotype Map (HapMap) project recently generated genotype data for more than 1 million single-nucleotide polymorphisms (SNPs) in four population samples. The main application of the data is in the selection of tag single-nucleotide polymorphisms (tSNPs) to use in association studies. The usefulness of this selection process needs to be verified in populations outside those used for the HapMap project. In addition, it is not known how well the data represent the general population, as only 90-120 chromosomes were used for each population and since the genotyped SNPs were selected so as to have high frequencies. In this study, we analyzed more than 1,000 individuals from Estonia. The population of this northern European country has been influenced by many different waves of migrations from Europe and Russia. We genotyped 1,536 randomly selected SNPs from two 500-kbp ENCODE regions on Chromosome 2. We observed that the tSNPs selected from the CEPH (Centre d'Etude du Polymorphisme Humain) from Utah (CEU) HapMap samples (derived from US residents with northern and western European ancestry) captured most of the variation in the Estonia sample. (Between 90% and 95% of the SNPs with a minor allele frequency of more than 5% have an r2 of at least 0.8 with one of the CEU tSNPs.) Using the reverse approach, tags selected from the Estonia sample could almost equally well describe the CEU sample. Finally, we observed that the sample size, the allelic frequency, and the SNP density in the dataset used to select the tags each have important effects on the tagging performance. Overall, our study supports the use of HapMap data in other Caucasian populations, but the SNP density and the bias towards high-frequency SNPs have to be taken into account when designing association studies. PMID:16532062
Montpetit, Alexandre; Nelis, Mari; Laflamme, Philippe; Magi, Reedik; Ke, Xiayi; Remm, Maido; Cardon, Lon; Hudson, Thomas J; Metspalu, Andres
A large number of genome-wide association studies have been performed during the past five years to identify associations between SNPs and human complex diseases and traits. The assignment of a functional role for the identified disease-associated SNP is not straight-forward. Genome-wide expression quantitative trait locus (eQTL) analysis is frequently used as the initial step to define a function while allele-specific gene expression (ASE) analysis has not yet gained a wide-spread use in disease mapping studies. We compared the power to identify cis-acting regulatory SNPs (cis-rSNPs) by genome-wide allele-specific gene expression (ASE) analysis with that of traditional expression quantitative trait locus (eQTL) mapping. Our study included 395 healthy blood donors for whom global gene expression profiles in circulating monocytes were determined by Illumina BeadArrays. ASE was assessed in a subset of these monocytes from 188 donors by quantitative genotyping of mRNA using a genome-wide panel of SNP markers. The performance of the two methods for detecting cis-rSNPs was evaluated by comparing associations between SNP genotypes and gene expression levels in sample sets of varying size. We found that up to 8-fold more samples are required for eQTL mapping to reach the same statistical power as that obtained by ASE analysis for the same rSNPs. The performance of ASE is insensitive to SNPs with low minor allele frequencies and detects a larger number of significantly associated rSNPs using the same sample size as eQTL mapping. An unequivocal conclusion from our comparison is that ASE analysis is more sensitive for detecting cis-rSNPs than standard eQTL mapping. Our study shows the potential of ASE mapping in tissue samples and primary cells which are difficult to obtain in large numbers.
Almlof, Jonas Carlsson; Lundmark, Per; Lundmark, Anders; Ge, Bing; Maouche, Seraya; Goring, Harald H. H.; Liljedahl, Ulrika; Enstrom, Camilla; Brocheton, Jessy; Proust, Carole; Godefroy, Tiphaine; Sambrook, Jennifer G.; Jolley, Jennifer; Crisp-Hihn, Abigail; Foad, Nicola; Lloyd-Jones, Heather; Stephens, Jonathan; Gwilliam, Rhian; Rice, Catherine M.; Hengstenberg, Christian; Samani, Nilesh J.; Erdmann, Jeanette; Schunkert, Heribert; Pastinen, Tomi; Deloukas, Panos; Goodall, Alison H.; Ouwehand, Willem H.; Cambien, Francois; Syvanen, Ann-Christine
Background The single nucleotide polymorphisms (SNPs) in matrix metalloproteinase 1(MMP-1)play important roles in some cancers. This study examined the associations between individual SNPs or haplotypes in MMP-1 and susceptibility, clinicopathological parameters and prognosis of gastric cancer in a large sample of the Han population in northern China. Methods In this case–controlled study, there were 404 patients with gastric cancer and 404 healthy controls. Seven SNPs were genotyped using the MALDI-TOF MS system. Then, SPSS software, Haploview 4.2 software, Haplo.states software and THEsias software were used to estimate the association between individual SNPs or haplotypes of MMP-1 and gastric cancer susceptibility, progression and prognosis. Results Among seven SNPs, there were no individual SNPs correlated to gastric cancer risk. Moreover, only the rs470206 genotype had a correlation with histologic grades, and the patients with GA/AA had well cell differentiation compared to the patients with genotype GG (OR=0.573; 95%CI: 0.353–0.929; P=0.023). Then, we constructed a four-marker haplotype block that contained 4 common haplotypes: TCCG, GCCG, TTCG and TTTA. However, all four common haplotypes had no correlation with gastric cancer risk and we did not find any relationship between these haplotypes and clinicopathological parameters in gastric cancer. Furthermore, neither individual SNPs nor haplotypes had an association with the survival of patients with gastric cancer. Conclusions This study evaluated polymorphisms of the MMP-1 gene in gastric cancer with a MALDI-TOF MS method in a large northern Chinese case-controlled cohort. Our results indicated that these seven SNPs of MMP-1 might not be useful as significant markers to predict gastric cancer susceptibility, progression or prognosis, at least in the Han population in northern China.
Wang, Zhen-Ning; Gao, Peng; Li, Ai-Lin; Liang, Ji-Wang; Zhu, Jin-Liang; Xu, Ying-Ying; Xu, Hui-Mian
Background Genome-wide association studies (GWAS) have successfully identified a large number of single nucleotide polymorphisms (SNPs) that are associated with a wide range of human diseases. However, many of these disease-associated SNPs are located in non-coding regions and have remained largely unexplained. Recent findings indicate that disease-associated SNPs in human large intergenic non-coding RNA (lincRNA) may lead to susceptibility to diseases through their effects on lincRNA expression. There is, therefore, a need to specifically record these SNPs and annotate them as potential candidates for disease. Description We have built LincSNP, an integrated database, to identify and annotate disease-associated SNPs in human lincRNAs. The current release of LincSNP contains approximately 140,000 disease-associated SNPs (or linkage disequilibrium SNPs), which can be mapped to around 5,000 human lincRNAs, together with their comprehensive functional annotations. The database also contains annotated, experimentally supported SNP-lincRNA-disease associations and disease-associated lincRNAs. It provides flexible search options for data extraction and searches can be performed by disease/phenotype name, SNP ID, lincRNA name and chromosome region. In addition, we provide users with a link to download all the data from LincSNP and have developed a web interface for the submission of novel identified SNP-lincRNA-disease associations. Conclusions The LincSNP database aims to integrate disease-associated SNPs and human lincRNAs, which will be an important resource for the investigation of the functions and mechanisms of lincRNAs in human disease. The database is available at http://bioinfo.hrbmu.edu.cn/LincSNP.
BACKGROUND: Phosphorylation is a reversible post-translational modification that affects the intrinsic properties of proteins, such as structure and function. Non-synonymous single nucleotide polymorphisms (nsSNPs) result in the substitution of the encoded amino acids and thus are likely to alter the phosphorylation motifs in the proteins. METHODS: In this study, we used the web-based NetPhos tool to predict candidate nsSNPs that
Sevtap Savas; Hilmi Ozcelik
New methods were investigated of using optical interference coatings to produce bandpass filters for the spectral region 110 nm to 200 nm. The types of filter are: triple cavity metal dielectric filters; all dielectric reflection filters; and all dielectric Fabry Perot type filters. The latter two types use thorium fluoride and either cryolite films or magnesium fluoride films in the stacks. The optical properties of the thorium fluoride were also measured.
Baumeister, P. W.
Coronary artery disease (CAD) is one of the leading causes of death worldwide that is influenced by both environmental as well as genetic factors. Several recent genome-wide association studies (GWAS) have reported the association of multiple single nucleotide polymorphisms (SNPs) mainly in the 9p21 region with CAD. However, the association of these SNPs with CAD has not been rigorously tested in Indian population, which accounts for the largest incidences of CAD in the world. Herein, we genotyped six such SNPs (rs10116277, rs10757274, rs1333040, rs2383206, rs2383207 and rs1994016) identified through GWAS, in 754 individuals (311 angiography-confirmed CAD patients and 443 treadmill test controls) recruited mainly from North India to evaluate if these SNPs were associated with CAD. The minor allele frequency of these six SNPs was comparable to that reported in the respective GWAS. We found that three of these SNPs (rs10116277, rs1333040 and rs2383206) present at the locus 9p21 were significantly associated with CAD even after controlling for the confounding factors such as age, sex, body mass index, homocysteine, hypertension, diabetes, smoking, diet, etc. In conclusion, the locus 9p21 found to be significantly associated with cardiovascular diseases in the Caucasian populations seems to be also important in North Indian population. PMID:20718794