Hi,
I'm analysing RNA-seq data and I switched from edgeR to DESeq2 because most of my differentially expressed genes had an outlier sample, which was not an outlier in the entire dataset (judged by PCA and overall Cook's distance).
My question is related to the criteria used by DESeq2 to count the number of replicates in my experiment.
I have 3 experimental groups, A,B and C, each of them with 16 samples. If I run a LRT model with no batch, the model considers I have over 7 replicates per group and replaces counts with large Cook’s distance with the trimmed mean.
dgeszexp4 <- DESeqDataSetFromMatrix(countData = counts5,
colData = phenomodel,
design = ~exp)
dgeszLRT <- DESeq(dgeszexp, test="LRT", reduced=~1)
dgeszLRTres <- results(dgeszLRT)
I get:
-- replacing outliers and refitting for 240 genes
outliers [1] : 0, 0%
If I include sex and week of experiment as covariates in my model, the model considers I have less than 7 but more than 3 replicates per group (I have 4 samples same sex, same week, same exposure) and removes the genes with outliers instead instead of replacing the counts.
dgeszexp4 <- DESeqDataSetFromMatrix(countData = counts5,
colData = phenomodel1,
design = ~sex+week+exp)
dgeszLRT4 <- DESeq(dgeszexp4, test="LRT", reduced=~sex + week)
dgeszLRTres4 <- results(dgeszLRT4)
I get:
outliers [1] : 11, 0.057%
In my opinion, I believe I still have 16 replicates per group and I'm only trying to control for the fact that I have different sex and the experiment was run in two weeks.
This gets worse if I try to include normalisation factors estimated by RUV-seq. Since each factor is different for each sample, DESeq2 will assume I only have 1 sample per group and it won't deal with my outliers.
dgeszLRT <- DESeq(dgesznowat, test="LRT", reduced=~sex + week + ruv)
Any ideas on how to get around this?
Thanks in advance!
I don't want to go on a tangent here, but I did use the quasi likelihood framework from edgeR, which did not solve the problem. My guess is that these genes do not have outlying dispersions.
I got around it by running the model without covariates, extract the replaces counts and re-run the model with the covariates. Since in my model with RUV estimates there will be absolutely no replacement of outliers of exclusion of genes I feel this is ok.
But I think something should be added to the reference manual to warn users about the fact that adding batch might change the way DESeq2 deals with the number of replicates - this wasn't intuitive for me.
Thanks for you help!