I am trying to determine how best to manually set a cooksCutoff value for outlier detection, and have made the following observations:
quantiles:
0% 25% 50% 75% 90% 95% 99% 100%
0.00360426 0.08825488 0.21268380 0.73935583 2.970987 5.582727 13.544483 135.17274114
I've made a plot of maxCooks vs p-adj which I'm having trouble including in this post (but will continue trying to get it on here if someone more knowledgeable than I would like to take a look), but the graph essentially demonstrates that large maxCooks values do not necessarily correspond to low/significant p-adj values.
Additionally, if it makes any difference, my MA plot looks much different than what I think is to be expected. The vast majority of points, and colored points, are above the midline (me including this fact might just reveal how truly out of my depth I'm in).
When I inspect the number of significant genes I get, a lot of them seem to be influenced by the presence of a count outlier that seems to always be from the same sample. Is there a quantitative way to determine a cutoff, or am I left to eyeball it? Should I start by setting the cutoff to the top x% of maxCooks values and widdle away from there, or is there a more empirical way of approaching this? I'm pretty stumped, so any and all advice would be much appreciated.
Update: Setting a cooksCutoff value of 5 yields 3 significant genes that still seem to be influenced by a count outlier from the same sample
Thank you for your quick response. I am still waiting for the rlog transform function to finish running and will get the PCA plot in the morning. My experiment is looking at the difference in the change in gene expression between men and women as a result of a treatment. I have 113 paired samples, and used STAR to align and featureCounts to count genes. My design is ~ sex + sex:nested + sex:condition. To get the results, I first set the model matrix: apply(model, 2, function(x) all(x == 0)), get my dds: dds <- DESeq(dds, full = model, betaPrior = FALSE), and then make the contrast: sex_diff = results(results_source, contrast = list("sexFemale.conditionpost", "sexMale.conditionpost"))
My colData looks something like the following:
Any help is much appreciated, and thank you for being so generous with your time.
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] IHW_1.2.0 DESeq2_1.14.1 SummarizedExperiment_1.4.0 Biobase_2.34.0
[5] GenomicRanges_1.26.2 GenomeInfoDb_1.10.2 IRanges_2.8.1 S4Vectors_0.12.1
[9] BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] genefilter_1.56.0 locfit_1.5-9.1 slam_0.1-40 splines_3.3.1 lattice_0.20-34
[6] colorspace_1.3-2 htmltools_0.3.5 base64enc_0.1-3 survival_2.40-1 XML_3.98-1.5
[11] foreign_0.8-67 DBI_0.5-1 BiocParallel_1.8.1 RColorBrewer_1.1-2 plyr_1.8.4
[16] stringr_1.1.0 zlibbioc_1.20.0 munsell_0.4.3 gtable_0.2.0 htmlwidgets_0.8
[21] memoise_1.0.0 latticeExtra_0.6-28 knitr_1.15.1 geneplotter_1.52.0 fdrtool_1.2.15
[26] AnnotationDbi_1.36.2 htmlTable_1.9 Rcpp_0.12.9 acepack_1.4.1 xtable_1.8-2
[31] backports_1.0.5 scales_0.4.1 checkmate_1.8.2 Hmisc_4.0-2 annotate_1.52.1
[36] XVector_0.14.0 lpsymphony_1.2.0 gridExtra_2.2.1 ggplot2_2.2.1 digest_0.6.12
[41] stringi_1.1.2 grid_3.3.1 tools_3.3.1 bitops_1.0-6 magrittr_1.5
[46] lazyeval_0.2.0 RCurl_1.95-4.8 tibble_1.2 RSQLite_1.1-2 Formula_1.2-1
[51] cluster_2.0.5 Matrix_1.2-8 data.table_1.10.2 assertthat_0.1 rpart_4.1-10
[56] nnet_7.3-12
The following links are to the images I have referenced:
boxplot of maxCooks: http://i.imgur.com/KnPaiwV.png
maxCooks vs. padj: http://i.imgur.com/MkXwc7Q.png
example of gene with outlier at cooks ~5: http://i.imgur.com/2Yrd410.png
strange looking MA plot: http://i.imgur.com/IXlUJLI.png
PCA plot: http://i.imgur.com/6eF6dXe.png
PCA plot with original PCA outleirs removed: http://i.imgur.com/fq0Z4mn.png
The samples that appear to be outliers in both PCA plots are not the ones that appear to always be the outlier, such as the one in the 3rd image.
Any help is as always much appreciated
Regarding the MA plot, this looks fine to me, I would suspect that the female treatment effect is often stronger than the male treatment effect (as in the case of MPP4, which looks like an obvious significant gene to me), and the contrast is simply picking this up. Do you have a reason to suspect this is not the case?
You mentioned it is often a single sample which has outlier counts. Can you identify this sample by looking at the boxplots of assays(dds)[["cooks"]] ? If so, I would exclude a sample if it consistently has outliers. You have plenty of replicates, it won't hurt your power at all. If you do decide to remove this sample, I would remove both the pre and post sample for that one sample.
Because you have a model with more than 113 coefficients (you have a coefficient in the model for each pair of samples), it's easy for a single outlier to pull up the Cook's cutoff, because it is affecting the coefficient related to that pair of samples, but it's not a problem for inference on the treatment:sex terms, which is what you care about. I would just turn off Cook's related functions: cooksCutoff=FALSE, and look by eye at the genes with top mcols(dds)$maxCooks. If you don't see cases where the significance seems to be driven by a single outlier, I wouldn't worry about Cook's for this dataset. Note that, I don't think the outlier is affecting MPP4 at all. That just seems to be a gene with a true difference in treatment effect across sex.
Thank you for your thoughtful response. It's definitely reassuring to hear the MA plot is fine because we do suspect the female treatment effect is stronger.
When I remove the one sample (pre and post) in MPP4, it is no longer significant with a padj < 0.05, which makes me think that an outlier is influencing the results, although it sure looks like it ought to be a significant change without the outlier: http://i.imgur.com/83SbtoG.png
Additionally, when I remove the pre and post for that one sample, the number of differentially expressed genes goes from 716 to 18 at a p-adj <= 0.05, which also makes me think that an outlier is influencing the results.
Of these 18, there still appears to be significance driven by one or two samples (http://i.imgur.com/wGE53DH.png)
Given all of this, should I stick with these 18 genes, or stick with the original 716 and accept that one point hanging in the upper right corner? I cannot express how grateful I am that you're being so generous with your time in helping me out!