Dear Michael,
I have a raw count dataset with three conditions (A,B,C) and each condition has 2-3 biological replicates. I calculated the normalized counts (not log2 transformed) in DESeq2, and for instance the following lists the normalized count of a gene for each replicate.
# A1 A2 A3 B1 B2 B3 C1 C2
#108.74237 86.54679 109.39501 562.57695 570.28352 481.73985 1537.51446 1571.32569
I tried to estimate the fold change between C and A simply by calculating the ratio. That is, the mean of normalized counts in C divided by the mean of normalized counts in A is 15.31.
I also tried transforming the raw counts through the rlog method (set blind = FALSE), and the following gives the rlog values for the same gene in each replicate:
# A1 A2 A3 B1 B2 B3 C1 C2
# 7.956540 7.925518 7.983866 9.034676 9.043481 8.896545 10.052179 10.091817
I estimated the log2 fold change (C vs A) based on the rlog values, that, the mean of rlog values in C minus that in A. The resulting fold change estimate will be 4.34, much less than 15.31 above.
I understand that DESeq2 shrinks the log fold change to take care of genes of low counts, so it’s not simply the ratio of normalized counts. I also calculated the shrunken fold change (that is, 15.24) in DESeq2 and compared that to the one (15.31) I calculated previously, and they were very close.
After looking at the manual and paper, I thought log fold change shrinkage and the rlog method were conceptually consistent since they both tried to minimize the differences among samples of genes with low counts. But I am wondering why the fold change estimate using the rlog values will turn out to be so different. I would appreciate it very much that you would give any comments or correct my thought.
Also, the goal of my analysis to identify expression patterns of genes among the three conditions. I think transformed data from either rlog or vst would be a better choice to be used for the hierarchical clustering of the genes as the dependence of variance on the mean is removed and this might lead to more accurate clustering than using the normalized counts (without variance stabilization). Wondering if my understanding is correct.
Thanks so much for your help in advance!
Thanks Michael! Sorry for the typo. Yes, I meant the subtraction and corrected the original text. When using rlog values, the fold change estimate is 4.34, very different from the shrunken fold change (15.24) or the estimate from the ratio of normalized counts (15.31). Wondering if it’s expected that the rlog transformation will largely change the fold change of each gene even with moderate or high counts but the expression patterns (trend) across samples will always be reserved no matter what transformation (rlog or vst) is applied. I thought that the estimate of the fold change from rlog values or normalized counts should be similar, but I was wrong. I will surely use
lfcshrink
to calculate the log2 fold change for each gene, but just curious about the discrepancy in the example I provided.Right, so the LFC is ~3.9 and the difference in rlog values is ~2.
Right, we don't recommend the rlog for LFC calculation, this and the VST (VST is now the recommended transformation generally) are for computing sample distances.
What do you get with VST by the way?