Question

Normalisation in DESeq2

0

Entering edit mode

thjnant ▴ 10

@thjnant-23566

Last seen 4.9 years ago

Hello,

I have two groups (they are two different species rather than control vs. treatment group). I did the following steps in DESeq2 for normalization:

dds <- estimateSizeFactors(dds)
sizeFactors(dds)
normalized_counts <- log2(counts(dds, normalized=TRUE)+1)

I then created a scatterplot matrix to evaluate the normalization and to see whether data points are symmetrical around the x = y line.

I have two questions about my data:

Some samples have a pronounced streak of highly-expressed genes. How should I deal with such genes in these samples?
One sample shows a large amount of variation, not only in the comparison with samples of another group but also in comparisons of samples from the same group. How would this affect the differential gene expression analysis?

Thank you in advance!

deseq2 rna-seq • 4.2k views

ADD COMMENT • link updated 4.9 years ago by Kevin Blighe ★ 4.0k • written 4.9 years ago by thjnant ▴ 10

score 1 · Answer 1 · 2020-05-22

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 18 days ago

Republic of Ireland

Hi, I would not log [base 2] transform the normalised counts. Please transform via variance-stabilisation (vst()) or regularised log transform (rlog()), and then re-generate the scatterplot.

One sample shows a large amount of variation, not only in the comparison with samples of another group but also in comparisons of samples from the same group. How would this affect the differential gene expression analysis?

They may fail Cook's test and / or the independent filtering step, and be assigned a p- or q- value of NA, respectively. If you could share some actual data that you have, that may help.

For any doubt, check the very detailed and helpful DESeq2 vignette.

Kevin

ADD COMMENT • link 4.9 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thank you very much for your reply. I used rlog() and vst() as well. The same pattern as described in the question is observed with these transformations. I tried to attach my graph but it looks it's not possible to attach an imagine here. Should I post my raw counts? There are of more than 13000 genes, how can I share them effectively here? Thank you again.

ADD REPLY • link 4.9 years ago thjnant ▴ 10

0

Entering edit mode

If you have not pre-filtered your raw count matrix for low count genes, then that could also produce the observed behaviour.

ADD REPLY • link 4.9 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

I filtered for minimum of 10 reads across all samples. Should I try filtering for larger numbers? But what I see is a pronounced streak of highly-expressed genes for one sample so it means there is a large number of genes for that sample. I'd love to share my data here if that is possible, I just don't know how I could do that.

ADD REPLY • link 4.9 years ago thjnant ▴ 10

0

Entering edit mode

I just uploaded my graph in my github page:

With log2:

https://github.com/Homap/datascience/blob/master/docs/hteosteo.pdf

With rlog():

https://github.com/Homap/datascience/blob/master/docs/hteosteo_rld.pdf

ADD REPLY • link 4.9 years ago thjnant ▴ 10

1

Entering edit mode

Thanks for sharing.

I filtered for minimum of 10 reads across all samples. Should I try filtering for larger numbers?

You mean that each gene just required a minimum of 10 summed reads across all samples? Perhaps using the mean is better, so, mean > 10.

These artifacts in scatter and MA plots are not entirely unusual. I am going to guess that your sample size is small and that the genes represented by those straight lines have 0 values (and / or constant variance) in one or the other of your sample groups.

ADD REPLY • link 4.9 years ago Kevin Blighe ★ 4.0k

1

Entering edit mode

Hi Kevin,

Oh, I see, definitely, mean > 10 sounds much more reasonable. Yes, I only have 5 replicates per each group. Great suggestion, I will check those genes for their read counts and variance.

Thanks a lot again.

ADD REPLY • link 4.9 years ago thjnant ▴ 10

0

Entering edit mode

Sure, let me know if you get anywhere with it. I would start by literally outputting (to the terminal) the rows from your data that contain the lowest expression values based on mean.

ADD REPLY • link 4.9 years ago Kevin Blighe ★ 4.0k