SNP-level Allele-specific expression detection using DESeq2 in an unpaired experimental design
2
1
Entering edit mode
@engelbrechtstephan-14355
Last seen 5.9 years ago

Hi, I'm interested in ASE in plant hybrids, of which I have an F2 backcross. F2 hybrid progeny can have both homospecific (or intra) or heterospecific (inter) haplotype combinations. At the same SNP, I will have some individuals that are both heterospecific, and homospecific, most often in unequal proportions. For every SNP, I want to statistically compare changes in allelic ratios (ref/alt+ref) between these haplotype combos while accounting for variation within individuals.

Note that this design is fundamentally different from traditional unpaired condition-specific DE ASE analysis (control vs treatment), since the 'condition' (homo or hetero haplotype combo) is not the same for all SNPs in an individual (due to recombination, some haplotype combinations may be heterospecific, while others may be homospecific). I create a toy example of data from a single SNP so that you guys can determine if I'm on the right track.

Here's the data with counts for each indivuals's reference and alternate allele counts in successive order

hap <- c(rep("het",6),rep("hom",4))
ind <- factor(rep(1:5,each = 2))
allele <- factor(rep(c("ref","alt"),5))
counts <- c(9,19,2,4,10,11,13,16,13,20)

Create counts matrix and sample data

counts_matrix <- t(matrix(counts))
samps <- cbind.data.frame(ind,hap,allele)

Create DESeq data structure with "dummy" design

dds <- DESeqDataSetFromMatrix(counts_matrix,samps,design = ~ind + allele)

Add a within-group identifier as suggested to prevent confounding

dds$ind.n <- factor(c(rep(1:3,each = 2),rep(1:2,each = 2)))

So now our coldata looks like this:

colData(dds)

DataFrame with 10 rows and 4 columns
        ind      hap   allele     ind.n
   <factor> <factor> <factor> <integer>
1         1      het      ref         1
2         1      het      alt         1
3         2      het      ref         2
4         2      het      alt         2
5         3      het      ref         3
6         3      het      alt         3
7         4      hom      ref         1
8         4      hom      alt         1
9         5      hom      ref         2
10        5      hom      alt         2

Create custom design matrix and remove all-zero columns to get full rank. I'm not sure if I should add or remove the intercept. Would you guys say this design is correct to determine differences in allelic ratios between homo/het groups while controlling for uninteresting within individual variation?

mm <- model.matrix(~ 0 + hap + hap:ind.n + hap:allele, colData(dds))
all.zero <- apply(mm, MARGIN = 2, function(x) all(x==0)) #detect nonzero cols
idx <- which(all.zero) #index nonzero cols
mm <- mm[,-idx] #remove all-zero cols

Replace dds design with correct formula used to create custom mm

design(dds) <- ~0 + hap+ hap:ind.n + hap:allele

Set size factors to 1 (specific to ASE analysis)

sizeFactors(dds) <- rep(1,ncol(counts_matrix))

Manually run all steps of DESeq. Dispersion estimation is not possible, since I can't process multiple SNPs, genes at once. Is this a shortfoll that can be fixed?

dds <- estimateDispersionsGeneEst(dds,modelMatrix = mm) 
dispersions(dds) <- mcols(dds)$dispGeneEst   
dds <- nbinomWaldTest(dds, betaPrior = FALSE, modelMatrix = mm) #I've seen people use LRT, but I think this appropriate

These are what the results look like:

results(dds)

log2 fold change (MLE): haphom.alleleref 
Wald test p-value: haphom.alleleref 
DataFrame with 1 row and 6 columns
   baseMean log2FoldChange     lfcSE      stat    pvalue      padj
  <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
1      11.7      -0.469485 0.3713064 -1.264414 0.2060815 0.2060815

As described by Michael Love, I convert LFC to allelic ratios (ref/alt+ref) in the two groups, as well as the contrast between groups as follows (described previously by Michael Love in the link below):

https://rpubs.com/mikelove/ase

hom_results <- results(dds, name = "haphet.alleleref")
hom_ref.vs.alt <- 2^hom_results$log2FoldChange
hom_ref.vs.tot <- hom_ref.vs.alt/(1+hom_ref.vs.alt)
hom_p <- hom_results[,"pvalue"]

het_results <- results(dds, name = "haphom.alleleref")
het_ref.vs.alt <- 2^het_results$log2FoldChange
het_ref.vs.tot <- het_ref.vs.alt/(1+het_ref.vs.alt)
het_p <- het_results[,"pvalue"]

comp_results <- results(dds,contrast = list("haphom.alleleref","haphet.alleleref"))
diff <- 2^comp_results$log2FoldChange * (1/(1 + hom_ref.vs.alt)) / (1/(1 + het_ref.vs.alt)) #can you check that this is correct?
diff_p <- comp_results[,"pvalue"]

 

Then I create a vector with calculated values: ASE in the respective groups and difference in ASE between groups. The vector will be appended to a larger data structure using an lapply over all SNPs from which I can p.adjust("fdr") for multiple testing. What do you guys think?

out <- c(homAR = hom_ref.vs.tot, hom_p = hom_p, hetAR = het_ref.vs.tot,het_p =  het_p,diff = diff, diff_p = diff_p)

out
     homAR      hom_p      hetAR      het_p       diff     diff_p 
0.38181824 0.08255042 0.41935489 0.20608150 1.24490400 0.67943076 

Thanks a lot in advance. And thanks for the really clear manuals and answers to questions. I really has helped me a long way to using DE methods to detect ASE.

Any help, comments, suggestions, flaws, and criticism is highly appreciated!

Stephan Engelbrecht

sessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.3

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] limma_3.28.21              BiocInstaller_1.22.3       stringr_1.2.0              gridExtra_2.3              reshape2_1.4.2         
[6] ggplot2_2.2.1              DESeq2_1.12.4              SummarizedExperiment_1.2.3 Biobase_2.32.0             GenomicRanges_1.24.3   
[11] GenomeInfoDb_1.8.7         IRanges_2.6.1              S4Vectors_0.10.3           BiocGenerics_0.18.0        bindrcpp_0.2           
[16] tidyr_0.7.2                dplyr_0.7.4               

loaded via a namespace (and not attached):
[1] bit64_0.9-7          splines_3.3.3        Formula_1.2-2        assertthat_0.2.0     latticeExtra_0.6-28  blob_1.1.0        
[7] RSQLite_2.0          backports_1.1.1      lattice_0.20-35      glue_1.2.0           digest_0.6.12        RColorBrewer_1.1-2
[13] XVector_0.12.1       checkmate_1.8.5      colorspace_1.3-2     htmltools_0.3.6      Matrix_1.2-11        plyr_1.8.4        
[19] XML_3.98-1.9         pkgconfig_2.0.1      genefilter_1.54.2    zlibbioc_1.18.0      purrr_0.2.4          xtable_1.8-2      
[25] scales_0.5.0         BiocParallel_1.6.6   htmlTable_1.9        tibble_1.3.4         annotate_1.50.1      nnet_7.3-12       
[31] lazyeval_0.2.1       survival_2.41-3      magrittr_1.5         memoise_1.1.0        foreign_0.8-69       tools_3.3.3       
[37] data.table_1.10.4-3  munsell_0.4.3        locfit_1.5-9.1       cluster_2.0.6        AnnotationDbi_1.34.4 rlang_0.1.4       
[43] grid_3.3.3           RCurl_1.95-4.8       htmlwidgets_0.9      bitops_1.0-6         base64enc_0.1-3      labeling_0.3      
[49] gtable_0.2.0         DBI_0.7              R6_2.2.2             knitr_1.17           bit_1.1-12           bindr_0.1         
[55] Hmisc_4.0-3          stringi_1.1.5        Rcpp_0.12.13         geneplotter_1.50.0   rpart_4.1-11         acepack_1.4.1     
[61] tidyselect_0.2.3    
deseq2 rnaseq differential gene expression • 1.5k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen just now
United States

hi Stephan,

"Is this a shortfall that can be fixed?" This is a good question. The step that estimates the parameters for the Bayesian prior on dispersions doesn't need a model matrix. So I think you could rbind() the dds together, then run estimateDispersionsFit(), then split them apart again for estimateDispersionMAP() and nbinomWaldTest(), both with custom model matrices. This way you would avoid having to use MLE estimates of dispersion here:

dispersions(dds) <- mcols(dds)$dispGeneEst   

Does that make sense? Let me know if you need more code, or if there are any hitches. But I think this approach should work. You can use Wald or LRT, depending on if you want to test individual contrasts or test the null hypothesis for multiple coefficients at once. This is how we make use of Wald and LRT in DESeq2 at least.

 

ADD COMMENT
0
Entering edit mode

Hi Michael

Thank you for your rapid response! Unfortunately I'm unable to rbind 'dds' experiments together, since the number of samples assayed for each haplotype 'condition' varies across genes/SNPs. Rbinding SummarizedExperiments requires that objects must have the same number of samples. 

I don't think that there is any workaround for this, unless you know of anything, in which case some code would be extremely helpful.

Again, thanks for going over this!

ADD REPLY
1
Entering edit mode

Ah, then it would take more hacking. You could really run estimateDispersionsFit() on a dummy dataset, it just looks at mcols(dds)... then distribute the dispersionFunction() to the split apart dataset. But this seems like a lot of hacking, and there may be specialized software that handles this for you.

ADD REPLY
0
Entering edit mode

Hi Michael!

Yes, I think the hacking is going to more effort than what it's worth. I'm pretty new at this, but what would you think of specifying the design matrix and counts data to glm.nb(), and a maximum likelihood approach can then be used to estimate the overdispersion parameter from all allelic counts across individuals for a particular SNP, which can then be appended to dds? I'll add some code if you think this is feasible. 

I realise this is fundamentally different from the DESeq2 approach which 'borrows' information from counts of other genes, but I will have a large number of samples in each group (>10), and should therefore be able to get an accurate estimate? What do you think?

Thanks again, you are really helping me out alot!

ADD REPLY
0
Entering edit mode

hi Stephan,

I'm not sure about the point of glm.nb. You can use estimateDispersionsGeneEst() to quickly obtain the MLE dispersions. These are stored in mcols(dds)$dispGeneEst, and you can set them to be final with dispersions(dds) <- x. n=10 per group is ok, depending on how many groups, you may get a decent estimator for the dispersion using the MLE.

ADD REPLY
0
Entering edit mode
@engelbrechtstephan-14355
Last seen 5.9 years ago

Hi Michael. I apologise, I was a bit confused by your suggestion. but after going through the DESeq2 manual I came up with the following solution for obtaining MAP dispersion estimates:

I created a placeholder dataset with matrix counts generated by makeExampleDESeqDataset, to which dispersion estimates and other relevant data is appended to from different DESeq objects (each corresponding to a particular SNP). Since we are only interested in gene metadata to which dispersion estimates can be added, the number of samples (and counts generated for them) don't really matter, but the number of genes has to equal true number of genes in the dataset. My procedure is as follows:

dds <- estimateDispersionsGeneEst(dds, modelMatrix = mm) # for every DESeq object to obtain gene-wise dispersion estimates.

The dispersion estimates here replace dispersion estimates in the placeholder dataset (as well as baseMean and baseVar). Custom model matrices are saved in a list by SNP name, which are to be accessed in later steps (MAP estimation, nbinomWaldTest). 

placeholder <- estimateDispersionsFit(placeholder) #DispersionFit is determined for all SNPs

After which dispersion function is distributed to all DESeq objects with lapply()

dispersionFunction(dds) <- dispersionFunction(placeholder)

Now the rest of the steps can be run on each DESeq object

dds <- estimateDispersionsMAP(dds, modelMatrix = mm) #same matrix used as before, obtained from list

For a lot of dds's (each containing only counts of a single SNP) I receive the message: 

all genes have dispersion estimates < 1e-07, returning disp = 1e-07
dds <- nbinomWaldTest(dds, modelMatrix = mm) #Run Wald test after final dispersions are obtained

After this, I also add MAP dispersions of each DESeq2 object (and outlier status) to the placeholder dataset so that I can use:

plotDispEsts(dds) #to look at the dispersion

If MAP dispersion could not be estimated because MLE dispersion were below the required threshold, I returned an NA to the placeholder, and set final dispersion to the estimated MLE dispersion.

https://imgur.com/TZ2rkTB - a link to obtained dispersion plot

There are some observations I made in this plot (10 000 SNPs), which I would like your comment on:

1: The fitted line looks different than those of other common dispersion plots. The dispersion is lower at very low read coverage. Could this be because dispersion is small at very low read counts at the heterozygous SNP (<20 reads)? I don't plan to use observations < 10

2. A large proportion of SNPs showed very small MLE dispersion, could this also possibly be due to lower read counts at a single SNP compared to larger read counts of an entire gene?

3. Separation between final MAP dispersion estimates and outliers are less clear compared to standard dispersion curves. I think this is because the model matrices, and number of samples differ across SNPs?

I would very much like to know what you think about my implementation of your suggestion Michael!

Regards,

Stephan

ADD COMMENT
0
Entering edit mode

Hi Stephan,

Kudos for the initiative to make this work. I think there is an issue I forget before, which is, when you do 

dispersionFunction(dds) <- ...

The default behavior is to re-estimate the spread of the dispersion, which is used in the MAP calculation. This is the way the software works because a user can specify an arbitrary dispersion function and the software would then need to estimate the variance of MLE dispersions around the fit. However, here it causes problems because the subset data doesn't have enough points to estimate the variance. So I forgot to say, you need to add:

dispersionFunction(dds, estimateVar=FALSE) <- ...

And this will use the variance of dispersion values around the fit that was previously calculated, when you ran estimateDispersionsFit().

This should fix the issue with the labelled outliers you see in the plot.

Re: (1) and (2), the dispersion values may be different than what you see with a typical (non-ASE) RNA-seq counts. I don't see a problem with that, the methods are broadly applicable: use genes with similar average signal to construct an empirical Bayes prior on the dispersion estimates, then calculate the posterior mode, and then use this for Wald test or LRT...

ADD REPLY
0
Entering edit mode

Hi Michael

Thanks, I though it had to be something like that, I guess I just (still) don't completely grasp the input and output of every function :).

This makes sense, but I run into the following problem - It seems that adding estimateVar = FALSE means dispFit is not added to the metadata of dds.  Obviously estimateDispersionsMAP requires dispFit values (unless dispersion is smaller than 1x10-e08). So what do you think is a workaround? Do you think I can add these from the dispFit values in the aggregated (placeholder) dataset?

I'm close, I can feel it! Thanks for your continued assistance.

Stephan

ADD REPLY
0
Entering edit mode

Thanks for reporting that. That is an out of the way bug. I just added a fix for this to the devel branch, but you can also add this line to make it work in release:

dispersionFunction(dds, estimateVar=FALSE) <- dispersionFunction(placeholder)
mcols(dds)$dispFit <- dispersionFunction(dds)(mcols(dds)$baseMean)

It looks a little funny, but since 'dispersionFunction(dds)' returns a function, you can straightaway apply it to the means to get the values for dispFit.

ADD REPLY

Login before adding your answer.

Traffic: 1266 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6