Question

skewed differentially expressed gene results - DESeq2

0

Entering edit mode

CE ▴ 20

@ce-15259

Last seen 6.3 years ago

United States

Hi,

I'm not sure if this is a question of outliers, since it is happening with more than one sample in a group, but I am seeing genes coming back as being differentially expressed when they are only obviously different in 3-4 samples out of 16 total samples in a group compared to 18 samples in a control group. I am using DESeq2 with default settings. Is there a way to change the settings in DESeq to prioritize genes that show similar expression within a group? I want to find differentially expressed genes that are different for most samples within a group instead of being different for about 1/4 of the samples within a group.

Thanks!

deseq2 differential gene expression outliers • 2.1k views

ADD COMMENT • link updated 6.3 years ago by Ryan C. Thompson ★ 7.9k • written 6.3 years ago by CE ▴ 20

score 0 · Answer 1 · 2018-08-03

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

Can you include a plotCounts() plot for one of these genes that you are less interested in?

Also, I'd suggest trying lfcShrink() to provide better LFCs for ranking. You can combine subsetting to genes with a small adjusted p-value, and then ranking by the absolute value of the shrunken LFC.

ADD COMMENT • link 6.3 years ago Michael Love 43k

0

Entering edit mode

Thanks for the quick reply!

Here is an example...

3 samples from the 'yes' group are clearly skewing the results. This gene has a padj of 0.001, a LFC of 1.22 before shrinking and 1.17 LFC after shrinking.

Most of our DE genes do not have very large fold changes and we are dealing with very noisy human data.

ADD REPLY • link 6.3 years ago CE ▴ 20

0

Entering edit mode

I don't think these results are obviously being skewed by the 3 highest samples in the "yes" group. Even if you ignore these, the "yes" group still has a higher average normalized count than the "no" group. It might not be as significant without the 3 highest samples, but I wouldn't say that this gene is an unambiguous false positive.

ADD REPLY • link 6.3 years ago Ryan C. Thompson ★ 7.9k

score 0 · Answer 2 · 2018-08-03

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 weeks ago

Icahn School of Medicine at Mount Sinai…

Is it possible that the same 3 or 4 samples are outliers in many genes? If so, it might help if you redo your analysis with limma using voomWithQualityWeights. This will hopefully identify and down-weight the samples that are consistently outliers across many genes. You can also inspect the weights to determine which samples the method believes to be outliers - these will be the samples with the lowest weights.

ADD COMMENT • link 6.3 years ago Ryan C. Thompson ★ 7.9k

1

Entering edit mode

To tack on to Ryan's answer: you can check for outliers in a PCA plot, see the vignette.

Of course, if 3 samples always have higher counts, then it would be picked up and corrected by size factors. And if they are only "outliers" on some genes, I'm not sure I'd want to downweight them. How to approach this definitely depends on the analyst, but looking at the above plot, I would say it's a good example of DE, and the LFC seems reasonable, and I wouldn't downweight the top 3 samples in "yes".

ADD REPLY • link 6.3 years ago Michael Love 43k

0

Entering edit mode

You are exactly right, when I plot a heatmap of the top differentially expressed genes, most of them appear to be significantly different in these same samples. They show a very similar pattern to this gene when I plotCounts(). I have almost 600 genes with padj < 0.05 which is a lot to sift through to see which genes are mostly showing up because of these same samples.

Maybe it would make sense to filter for low within-group variance to narrow down my genes of interest?

Thanks for the advice to try limma voomWithQualityWeights. I'll give it a try and see how things look.

ADD REPLY • link 6.3 years ago CE ▴ 20

0

Entering edit mode

There is no point in filtering for low within-group variance. DESeq2 is already doing this when it assesses the the significance of each gene and computes a p-value and adjusted p-value. If these outlier samples are to blame for what you believe to be false positive genes, then the problem is not within-group variance. If the effect appears systematic across many genes, another option (which is compatible with DESeq2) is to use surrogate variable analysis (sva) to estimate the systematic effects and include them in the design.

Of course, the "nuclear option" is to discard the samples entirely, but I don't think that is likely to be justified in this case. And even then you need to be wary of the potential for bias if you are discarding samples until you get the result you want to see.

ADD REPLY • link 6.3 years ago Ryan C. Thompson ★ 7.9k