Hi,
I understand that the presence of negative log-transformed count values is a common occurrence in RNA-seq data analysis and these values are not necessarily problematic for statistical analyses. I used TMM
method > log transformed and used limma for identifying significant gene expression (see below). Then, I extracted few genes for (line plot) that is high expressed between the groups of interest, but I see negative log (CPM) values on y-axis (though the expression pattern on the line plot shows the difference). My collaborators were not convinced with this y-axis scale and asked me if there is a way to get the values on the positive scale.
Then, I increased prior.count
(started increasing from +150 --> + 250), the minimum values increase to positive value scale. Meaning the log negative values now transformed to the positive values. However, I am not sure if adding larger prior.count value would affect downstream analysis like differential analysis in general. Does, choosing a large prior.count for visualization only seems like a good strategy because with prior.count = 1, the scale is negative.
Alternatively, I tried Deseq2 (VST) as it did not produce log negative counts. I also have TPM values and observed log of TPM also does not produce negative values. But I do not want to use these methods again for the analysis just to justify the visualization needs.
TMM:
y <- calcNormFactors(y, method = "TMM")
logCPM <- cpm(y, prior.count=1, log=TRUE)
Use limma to analyse multi-factor and varaibles
Thank you
Gordon Smyth thanks. Yes, indeed, plotting CPM values would be an alternative. As discussed, I was discussing about few biologically interesting genes (see below), the question about negative logCPM scale was raised and it was at the moment interpreted that, though the gene is high expressed statistically in particular condition, negative scale might perhaps indicate gene is too low to be really expressed.
Another related question, in addition to the raw count values which I use in edgeR as as input, my RNA-Seq pipeline also generates TPM values. Does it make sense to use to log TPM values only for visualization? Basically, all the analysis would remain same using edgeR and limma for differential expression except for visualization use log TPM values instead of logCPM values.
Yes, negative logCPMs do mean that the gene is expressed at a low level. On your plots, a logCPM of -7 probably means the gene is not expressed at all.
I don't see how manipulating the data to avoid negative values would solve the problem. It doesn't make the gene any more highly expressed but might give the false impression that it is. My aim is always to communicate correct information.
That's entirely up to you, but I don't generally do that. It depends on your pipeline and the purpose of your study. I prefer to plot quantities that accurately reflect the DE analysis. Given that the edgeR analysis is based on counts, I prefer to plot quantities that are directly related to those counts so I know the DE analysis and plots are in agreement. The logCPM values are counts normalized for library size and then converted to log-scale with minimal correction to avoid infinite values.
You could perhaps plot both logCPM and logTPM to check they agree, but I have less faith in TPM values than many other people on the internet because TPMs are typically estimated with higher technical uncertainty than genewise counts and because TPM don't correctly reflect the TMM normalized library sizes.
Gordon Smyth noted. Related to the graphs above, the table below is the output from
limma
, and compound B has 2 statistically significant genes that are up-regulated but the logCPM are on negative scale which makes difficult to explain that the genes are indeed up-regulated.Furthermore, I tried plotting with log2(TPM), now the y-axis shows positive scale of values compared to log2(CPM) which shows negative scale, however the expression trend of these genes in the graphs looks similar.
Sorry, but some of the things you say are not correct.
No, there are no significant genes in the table you show. Significance is judged by adjusted p-values and the table only shows raw p-values.
It appears that you are plotting genes that are not actually significant. If you plot these genes and claim that they are DE, then that would explain why the plots appear to be at odds with the DE results.
logTPM are just as likely to be negative as logCPM values if the two are computed in equivalent ways.
Gordon Smyth Apologies, I used a wrong term, I would have mentioned raw
p-value
instead. Unfortunately, only 3 gene passed FDR (BH) in the analysis. None of the genes I plotted for instance did not pass the FDR < 0.05 , but are either directly or in-directly associated with research question we are exploring. Ideally, as a soft raw p-value cut-off In sometimes considerp<0.05
, for FDR, in the least case, can I consider maybeFDR<0.05
orFDR<0.1
.Regarding TPM, I plotted
Gene_4
for instance with two subgroups this time, below is the plot.Sorry, I was wrong to say that log2(TPM) is always negative, but log2(TPM) is not more likely that log2(CPM) to be positive if both are computed in equivalent ways. I don't know exactly which groups your plots relate to, but it seems impossible for TPM to disagree with CPM to the extent shown in your plots for Gene 4. If CPM = 0, as it seems to be for some groups in a previous plot for Gene 4, then you must have TPM=0 as well.
Anyway, I can only give you help with edgeR and limma usage, and the software seems to be working correctly.
You can visualize whatever is appropriate for your story. If negative logCPMs are a problem (unintuitive I agree) then for visualization simply get
log2(cpm+1)
and problem solved.ATpoint OK, are you referring to the
log2(cpm+1)
orlog2(tpm+1)
Applies to both.