Hello!
So, I ran DESeq2 against female rna seq data with a dichotomous outcome ('yes', 'no'). I discovered, that some of my sample at the end had no value listed for condition, i.e:
10XXX | yes |
59XX | no |
XX13 | no |
8XX7 | |
1XXX9 | |
96XX | |
1XX21 | |
XXX10 |
My results were staggering for the set,
out of 20332 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 8004, 39%
LFC < 0 (down) : 2280, 11%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> sum(res$padj < 0.1, na.rm=TRUE)
[1] 10284
> sum(res$padj < 0.05, na.rm=TRUE)
[1] 8302
> sum(res$padj < 0.001, na.rm=TRUE)
So I want to know how DESeq would treat those samples. They weren't caught by the sanity check: all(rownames(colData)==colnames(data)) obviously, so this is my bad clearly, but I would have thought they would have been dropped by DESeq and counted as NULL or NA. When I run DESeq2 with those removed, I get drastically different results (after setting independent filtering to false, and selecting a threshold from a screwy looking rejection plot):
> sum(res$padj < 0.1, na.rm=TRUE) [1] 4
Thanks!
So the top is just my count matrix. I had a few samples at the bottom that had a sample name, but no value in the actual column under 'condition. The results table was just the first:
The character string "" is considered its own level using factors in R (DESeq2 makes use of the factor variables and the model.matrix function to build design matrices).
It's as if you had "yes", "no", and a third option "missing". The character string "" is alphabetically first, so you will have coefficient contrasting "no" with "" and "yes" with "".
It would be better for you to actually give these samples a value of "missing" so it's more clear when other people look over your analysis. Then how to deal with these samples is up to you. I might remove these samples from the DESeqDataSet, if you want to compare yes with no:
I did remove the samples for actual analysis. I just wanted an explanation for why it drove up the number of deferentially expressed genes, which you answered, thanks very much!