Question

DESeq2: Outliers, covariates and number of replicates

0

Entering edit mode

j.viana • 0

@jviana-11863

Last seen 8.4 years ago

Hi,

I'm analysing RNA-seq data and I switched from edgeR to DESeq2 because most of my differentially expressed genes had an outlier sample, which was not an outlier in the entire dataset (judged by PCA and overall Cook's distance).

My question is related to the criteria used by DESeq2 to count the number of replicates in my experiment.

I have 3 experimental groups, A,B and C, each of them with 16 samples. If I run a LRT model with no batch, the model considers I have over 7 replicates per group and replaces counts with large Cook’s distance with the trimmed mean.

dgeszexp4 <- DESeqDataSetFromMatrix(countData = counts5,
colData = phenomodel,
design = ~exp)

dgeszLRT <- DESeq(dgeszexp, test="LRT", reduced=~1)
dgeszLRTres <- results(dgeszLRT)

I get:

-- replacing outliers and refitting for 240 genes

outliers [1] : 0, 0%

If I include sex and week of experiment as covariates in my model, the model considers I have less than 7 but more than 3 replicates per group (I have 4 samples same sex, same week, same exposure) and removes the genes with outliers instead instead of replacing the counts.

dgeszexp4 <- DESeqDataSetFromMatrix(countData = counts5,
colData = phenomodel1,
design = ~sex+week+exp)

dgeszLRT4 <- DESeq(dgeszexp4, test="LRT", reduced=~sex + week)
dgeszLRTres4 <- results(dgeszLRT4)

I get:

outliers [1] : 11, 0.057%

In my opinion, I believe I still have 16 replicates per group and I'm only trying to control for the fact that I have different sex and the experiment was run in two weeks.

This gets worse if I try to include normalisation factors estimated by RUV-seq. Since each factor is different for each sample, DESeq2 will assume I only have 1 sample per group and it won't deal with my outliers.

dgeszLRT <- DESeq(dgesznowat, test="LRT", reduced=~sex + week + ruv)

Any ideas on how to get around this?

Thanks in advance!

deseq2 ruvseq cookscutoff covariates batch • 3.0k views

ADD COMMENT • link 8.4 years ago j.viana • 0

0

Entering edit mode

I don't want to go on a tangent here, but I did use the quasi likelihood framework from edgeR, which did not solve the problem. My guess is that these genes do not have outlying dispersions.

ADD REPLY • link 8.4 years ago j.viana • 0

0

Entering edit mode

I got around it by running the model without covariates, extract the replaces counts and re-run the model with the covariates. Since in my model with RUV estimates there will be absolutely no replacement of outliers of exclusion of genes I feel this is ok.

But I think something should be added to the reference manual to warn users about the fact that adding batch might change the way DESeq2 deals with the number of replicates - this wasn't intuitive for me.

Thanks for you help!

ADD REPLY • link 8.4 years ago j.viana • 0

score 0 · Answer 1 · 2016-11-16

I don't have a comment regarding your DESeq2 question, but I don't understand your reasoning for switching between edgeR and DESeq2.

I mean, using either is a fine choice but it sounds like you think there is a reason that forced your switch. Your 'outlier sample' scenario should be rather nicely taken care of when you use edgeR's quasi likelihood framework. If you're not familiar with it, have a read through their f1000research paper.

score 0 · Answer 2 · 2016-11-16

Outliers are defined with respect to the fitted coefficients and the expected values, which change when you introduce covariates to account for more differences. So there's not really a way around this. What you could do is run the model with just the three groups and minReplicatesForReplace=Inf, and keep track of the genes that have relatively high maximum Cook's distance per gene, mcols(dds)$maxCooks. Then you could still run the final model with sex, week and RUV factors, just using the maxCooks information from the previous analysis to flag genes as potential outliers which you inspect by eye with plotCounts().