regularized log transformation- loss of zero values for sparsely expressed genes
1
0
Entering edit mode
@longwoodsequencer-11269
Last seen 8.3 years ago

I'm using the rlog function in the DESeq2 package and I notice a quirk in the transformed data that I do not know what to make of: for genes that are expressed in a small proportion of samples (say, for gene X, 10 samples have non-zero raw counts out of 300 samples), the transformed dataset has no zero values at all; instead, the majority of samples have some other value that is either negative or positive. Negative count doesn't make sense so I could, I suppose, deal with that by zeroing all counts less than 1 in the transformed dataset, but I don't know what to do about the cases where most samples have a positive value, say 3.5, and a small proportion have other higher values- it's as if the zero-level for that gene is shifted to a small positive number. This is seen only with genes expressed in a small proportion of samples, and the amount of shift, positive or negative, varies across genes.  I notice the same with variance-stabilizing transformation and regardless of whether I set blind=FALSE or not.

Have others noticed this with their dataset? If so, how did you deal with it? I don't know how much of an impact this would have on the results of clustering-type exploratory analyses, but I am also not comfortable with seeing that a gene that should not be expressed at all in most samples has positive counts for all of them.

 

 

 

 

deseq2 rlog transformation vst variancestabilizingtransformation rlog • 4.1k views
ADD COMMENT
2
Entering edit mode

Keep in mind that the rlog transformation and VST are both log-like transformations, which means that they can theoretically return any value from -Inf to +Inf, and zero is not a special number in any way.

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 22 hours ago
United States

"Negative count doesn't make sense"

First, Ryan is correct that rlog and VST return log2-like values, so negative values are normal, and simply indicate an expected count less than 1. Many samples will have expected counts less than 1 in a very sparse dataset.

Secondly, the rlog and VST may not be optimal for very sparse data. If I were you I would compare to other transformations and pick based on properties such as the stabilization of variance over the mean (see vignette) and preservation of signal (seen for example through a PCA plot).

ADD COMMENT
0
Entering edit mode

Thank you Ryan and Michael for your quick responses. I see how negative values are possible in the transformation but that's easier to deal with since that can be interpreted as '<1'. I am bummed about the positive values though. I've uploaded images with an example of this and also a comparison of log2, vst and rlog in stabilizing variance over means. I have ~18000 genes and ~300 samples.

I realized after reading your comment that I do have a large number of sparse genes in this dataset so I will try next with less sparse genes. But could you elaborate what you mean by 'preservation of signal (seen for example through a PCA plot)'?

 

http://imgur.com/a/p1c9d

 

ADD REPLY
2
Entering edit mode

Here VST and rlog are much better at stabilizing variance than log2(x+1). You might try a higher pseudocount for log2 as well while you are making comparisons.

What I meant by preservation of signal is to inspect if you have biologically meaningful separation of groups in the PCA plot. While a transformation may not be able to bring this signal out if it does not exist in the data, you would want a good transformation and visualization to make biological signal prominent.

 

ADD REPLY

Login before adding your answer.

Traffic: 675 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6