Hi all,
We have a 2-factor, 2 levels per factor experimental design with 2 replicates per covariate combination (so 8 samples total, n=2 per cell) that we are analysing using DESeq2.
Normally DESeq2 generates a Cook's distance for each replicate within a cell (by "cell" I mean unique combination of covariates) and flags those observations with a Cook's distance above a certain threshold. Now, I understand that with n=2 you can't tell which replicate is an outlier when you see a high Cook's distance, but this doesn't really matter for us since we are usually flagging and removing entire genes (not just individual observations) from the analysis, so we just need to detect whether one of the 2 reps is an outlier. (My understanding is that if n >= 7 the default behaviour changes to a more precise replacement of the individual observation with the high Cook's distance, but we are not aiming for this.) I see that DESeq2 still generates Cook's distances when n=2, and these roughly correlate with gene-wise dispersion. My question is can we legitimately use these to filter out genes containing an outlier?
If so, what threshold to use? From the vignette:
The default Cook’s distance cutoff for the two behaviors described above depends on the sample size and number of parameters to be estimated. The default is to use the 99% quantile of the F(p,m-p) distribution (with p the number of parameters including the intercept and m number of samples).
But I would not know how to calculate that threshold value.
If not, is there any other acceptable strategy for flagging genes with a likely outlier observation when n=2?
Thank you Michael. However, I found that
mcols(dds)$maxCooks
was all NAs by default when n=2, so I added it with:I hope this is equivalent to the normal calculation.
Best,
Stuart