Hi, I'm interested in ASE in plant hybrids, of which I have an F2 backcross. F2 hybrid progeny can have both homospecific (or intra) or heterospecific (inter) haplotype combinations. At the same SNP, I will have some individuals that are both heterospecific, and homospecific, most often in unequal proportions. For every SNP, I want to statistically compare changes in allelic ratios (ref/alt+ref) between these haplotype combos while accounting for variation within individuals.
Note that this design is fundamentally different from traditional unpaired condition-specific DE ASE analysis (control vs treatment), since the 'condition' (homo or hetero haplotype combo) is not the same for all SNPs in an individual (due to recombination, some haplotype combinations may be heterospecific, while others may be homospecific). I create a toy example of data from a single SNP so that you guys can determine if I'm on the right track.
Here's the data with counts for each indivuals's reference and alternate allele counts in successive order
hap <- c(rep("het",6),rep("hom",4))
ind <- factor(rep(1:5,each = 2))
allele <- factor(rep(c("ref","alt"),5))
counts <- c(9,19,2,4,10,11,13,16,13,20)
Create counts matrix and sample data
counts_matrix <- t(matrix(counts)) samps <- cbind.data.frame(ind,hap,allele)
Create DESeq data structure with "dummy" design
dds <- DESeqDataSetFromMatrix(counts_matrix,samps,design = ~ind + allele)
Add a within-group identifier as suggested to prevent confounding
dds$ind.n <- factor(c(rep(1:3,each = 2),rep(1:2,each = 2)))
So now our coldata looks like this:
colData(dds) DataFrame with 10 rows and 4 columns ind hap allele ind.n <factor> <factor> <factor> <integer> 1 1 het ref 1 2 1 het alt 1 3 2 het ref 2 4 2 het alt 2 5 3 het ref 3 6 3 het alt 3 7 4 hom ref 1 8 4 hom alt 1 9 5 hom ref 2 10 5 hom alt 2
Create custom design matrix and remove all-zero columns to get full rank. I'm not sure if I should add or remove the intercept. Would you guys say this design is correct to determine differences in allelic ratios between homo/het groups while controlling for uninteresting within individual variation?
mm <- model.matrix(~ 0 + hap + hap:ind.n + hap:allele, colData(dds))
all.zero <- apply(mm, MARGIN = 2, function(x) all(x==0)) #detect nonzero cols idx <- which(all.zero) #index nonzero cols mm <- mm[,-idx] #remove all-zero cols
Replace dds design with correct formula used to create custom mm
design(dds) <- ~0 + hap+ hap:ind.n + hap:allele
Set size factors to 1 (specific to ASE analysis)
sizeFactors(dds) <- rep(1,ncol(counts_matrix))
Manually run all steps of DESeq. Dispersion estimation is not possible, since I can't process multiple SNPs, genes at once. Is this a shortfoll that can be fixed?
dds <- estimateDispersionsGeneEst(dds,modelMatrix = mm) dispersions(dds) <- mcols(dds)$dispGeneEst dds <- nbinomWaldTest(dds, betaPrior = FALSE, modelMatrix = mm) #I've seen people use LRT, but I think this appropriate
These are what the results look like:
results(dds) log2 fold change (MLE): haphom.alleleref Wald test p-value: haphom.alleleref DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> 1 11.7 -0.469485 0.3713064 -1.264414 0.2060815 0.2060815
As described by Michael Love, I convert LFC to allelic ratios (ref/alt+ref) in the two groups, as well as the contrast between groups as follows (described previously by Michael Love in the link below):
https://rpubs.com/mikelove/ase
hom_results <- results(dds, name = "haphet.alleleref") hom_ref.vs.alt <- 2^hom_results$log2FoldChange hom_ref.vs.tot <- hom_ref.vs.alt/(1+hom_ref.vs.alt) hom_p <- hom_results[,"pvalue"] het_results <- results(dds, name = "haphom.alleleref") het_ref.vs.alt <- 2^het_results$log2FoldChange het_ref.vs.tot <- het_ref.vs.alt/(1+het_ref.vs.alt) het_p <- het_results[,"pvalue"] comp_results <- results(dds,contrast = list("haphom.alleleref","haphet.alleleref")) diff <- 2^comp_results$log2FoldChange * (1/(1 + hom_ref.vs.alt)) / (1/(1 + het_ref.vs.alt)) #can you check that this is correct? diff_p <- comp_results[,"pvalue"]
Then I create a vector with calculated values: ASE in the respective groups and difference in ASE between groups. The vector will be appended to a larger data structure using an lapply over all SNPs from which I can p.adjust("fdr") for multiple testing. What do you guys think?
out <- c(homAR = hom_ref.vs.tot, hom_p = hom_p, hetAR = het_ref.vs.tot,het_p = het_p,diff = diff, diff_p = diff_p) out homAR hom_p hetAR het_p diff diff_p 0.38181824 0.08255042 0.41935489 0.20608150 1.24490400 0.67943076
Thanks a lot in advance. And thanks for the really clear manuals and answers to questions. I really has helped me a long way to using DE methods to detect ASE.
Any help, comments, suggestions, flaws, and criticism is highly appreciated!
Stephan Engelbrecht
sessionInfo() R version 3.3.3 (2017-03-06) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X Yosemite 10.10.3 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.28.21 BiocInstaller_1.22.3 stringr_1.2.0 gridExtra_2.3 reshape2_1.4.2 [6] ggplot2_2.2.1 DESeq2_1.12.4 SummarizedExperiment_1.2.3 Biobase_2.32.0 GenomicRanges_1.24.3 [11] GenomeInfoDb_1.8.7 IRanges_2.6.1 S4Vectors_0.10.3 BiocGenerics_0.18.0 bindrcpp_0.2 [16] tidyr_0.7.2 dplyr_0.7.4 loaded via a namespace (and not attached): [1] bit64_0.9-7 splines_3.3.3 Formula_1.2-2 assertthat_0.2.0 latticeExtra_0.6-28 blob_1.1.0 [7] RSQLite_2.0 backports_1.1.1 lattice_0.20-35 glue_1.2.0 digest_0.6.12 RColorBrewer_1.1-2 [13] XVector_0.12.1 checkmate_1.8.5 colorspace_1.3-2 htmltools_0.3.6 Matrix_1.2-11 plyr_1.8.4 [19] XML_3.98-1.9 pkgconfig_2.0.1 genefilter_1.54.2 zlibbioc_1.18.0 purrr_0.2.4 xtable_1.8-2 [25] scales_0.5.0 BiocParallel_1.6.6 htmlTable_1.9 tibble_1.3.4 annotate_1.50.1 nnet_7.3-12 [31] lazyeval_0.2.1 survival_2.41-3 magrittr_1.5 memoise_1.1.0 foreign_0.8-69 tools_3.3.3 [37] data.table_1.10.4-3 munsell_0.4.3 locfit_1.5-9.1 cluster_2.0.6 AnnotationDbi_1.34.4 rlang_0.1.4 [43] grid_3.3.3 RCurl_1.95-4.8 htmlwidgets_0.9 bitops_1.0-6 base64enc_0.1-3 labeling_0.3 [49] gtable_0.2.0 DBI_0.7 R6_2.2.2 knitr_1.17 bit_1.1-12 bindr_0.1 [55] Hmisc_4.0-3 stringi_1.1.5 Rcpp_0.12.13 geneplotter_1.50.0 rpart_4.1-11 acepack_1.4.1 [61] tidyselect_0.2.3
Hi Michael
Thank you for your rapid response! Unfortunately I'm unable to rbind 'dds' experiments together, since the number of samples assayed for each haplotype 'condition' varies across genes/SNPs. Rbinding SummarizedExperiments requires that objects must have the same number of samples.
I don't think that there is any workaround for this, unless you know of anything, in which case some code would be extremely helpful.
Again, thanks for going over this!
Ah, then it would take more hacking. You could really run estimateDispersionsFit() on a dummy dataset, it just looks at mcols(dds)... then distribute the dispersionFunction() to the split apart dataset. But this seems like a lot of hacking, and there may be specialized software that handles this for you.
Hi Michael!
Yes, I think the hacking is going to more effort than what it's worth. I'm pretty new at this, but what would you think of specifying the design matrix and counts data to glm.nb(), and a maximum likelihood approach can then be used to estimate the overdispersion parameter from all allelic counts across individuals for a particular SNP, which can then be appended to dds? I'll add some code if you think this is feasible.
I realise this is fundamentally different from the DESeq2 approach which 'borrows' information from counts of other genes, but I will have a large number of samples in each group (>10), and should therefore be able to get an accurate estimate? What do you think?
Thanks again, you are really helping me out alot!
hi Stephan,
I'm not sure about the point of glm.nb. You can use estimateDispersionsGeneEst() to quickly obtain the MLE dispersions. These are stored in mcols(dds)$dispGeneEst, and you can set them to be final with dispersions(dds) <- x. n=10 per group is ok, depending on how many groups, you may get a decent estimator for the dispersion using the MLE.