I’m working on RNAseq data using the DEseq2 R package. Most of the analysis will be differential expression between 2 (or 3) groups of samples with at least 3 biological replicates.
Before displaying the differential expression of certain groups of genes, I would like to plot a heatmap of the, for example 50, most expressed genes in group 1 and showing the expression of those genes in the other group(s). There are multiple ways to normalize the gene counts per sample (like % of reads), however to account for differences in total mapped reads between samples I use the default method for normalization in the DEseq2 package (the estimateSizeFactors function with default arguments, where a size-factor per sample is calculated as the median of the ratios: gene-count/geometric mean of that gene, if I understand correctly).
This seems to give robust results and if I adjust the counts for a couple of samples by multiplying or dividing the gene counts for a few of my samples by a factor of 10, the normalized gene counts remain similar.
My question is, what would be the best way to describe these values. For example these are the normalized results for the top2 genes:
Group1-sample1 | Group1-sample2 | Group1-sample3 | Group2-sample1 | Group2-sample2 | Group2-sample3 | |
Gene1 | 46774.604 | 51353.7355 | 64484.2654 | 617.716637 | 1074.286180 | 662.753292 |
Gene2 | 26206.402 | 32438.9899 | 43765.2866 | 37326.099729 | 23284.689905 | 23229.030093 |
Total mapped reads per sample (after removing low-count genes):
Group1-sample1 | Group1-sample2 | Group1-sample3 | Group2-sample1 | Group2-sample2 | Group2-sample3 | |
Mapped reads | 4061637 | 4445092 | 3618525 | 3159137 | 3143315 | 3884281 |
It is obvious that the difference between group 1 and 2 for Gene1 is much larger than for gene2. But when showing (or heatmap plotting) this, the exact interpretation of the values remains a challenge. I can describe it as “read-counts normalized for total mapped reads per sample”. However, I am afraid that people will make those exact values more important than they really are.. Alternatively, I could use the normalized counts and divide them by the total mapped reads (or make it normalized reads per 1 million mapped reads) but that may undo the normalization if large differences between total mapped read counts between samples are present. Alternatively would be to divide by the colSums value for the normalized counts. A last option would be the log2(n+1) transformation but it will still be difficult to explain the value of that number.
How would you use and plot these values?
Thank you so much. I had read those sections, but after re-reading them and reading the DEseq2 paper it made more sense. For clustering of samples and displaying the highest expressed genes I will use the rlog transformed counts.