Are transformed values from rlog /vst log2 normalized counts?
1
0
Entering edit mode
Eva ▴ 10
@ae923a5a
Last seen 19 hours ago
Spain

I am trying to understand the vst/rlog transformation of DESeq2 and...in the following vignette - section 4.2. where vst and rlog is explained, it has this paragraph:

Both vst and rlog return a DESeqTransform object which is based on the SummarizedExperiment class. The transformed values are no longer counts, and are stored in the assay slot.

What does it mean that they are no longer counts? It may be mean that the transformed values are not going to be in the "counts" slot as you would find it doing this: counts(dds, normalized=TRUE) or is it something else?

It is clear that the magnitude that you get after vst/rlog and counts(dds, normalized=TRUE) is not the same... but it is because that vst/rlog outputs in a log2 scale, isn't? (of course, there is a variance-stabilized transformation, but the results are in a log2 scale...?) So... this output will be log2 normalized and transformed counts...?

**The reason of this question is because I am wondering if I should save those transformed counts as "normalized_transformed" counts for the future. I used to save the counts(dds, normalized=TRUE) and those were the ones that I was using for downstream analyses... but now that I have discovered (and read more about) vst/rlog transformation, I will have to change the way of working and doing my analyses. But I am quite worried about the paragraph above, that they are no longer counts and I don't know if I understand everything properly.

Thanks in advance

Regards

Normalization DESeq2 rlog vst • 262 views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States

They are not counts because counts are integers.

> z <- makeExampleDESeqDataSet()
## These are counts
> head(assay(z))
      sample1 sample2 sample3
gene1      55      23      46
gene2       7       9       0
gene3       6      45      79
gene4       1       1       0
gene5       0       1       2
gene6     206     461     187
      sample4 sample5 sample6
gene1      51      16      34
gene2       2      14       4
gene3       9      20      14
gene4       2       0       0
gene5       0       8      11
gene6     428     277     200
      sample7 sample8 sample9
gene1      48      56      11
gene2       9       1       7
gene3      17      17      17
gene4       0       6       9
gene5       0       4       3
gene6     270     157     481
      sample10 sample11 sample12
gene1       25       29       91
gene2        4        9        8
gene3       32       19        4
gene4        0        0        0
gene5        5        0        7
gene6      257      140      385

## These are not counts!
> head(assay(rlog(z)))
       sample1  sample2   sample3
gene1 5.508012 4.732097 5.2791903
gene2 2.532313 2.639302 1.9586419
gene3 3.507028 4.889190 5.3750272
gene4 0.283636 0.281633 0.1745515
gene5 1.261563 1.350452 1.4254712
gene6 7.727871 8.605355 7.5497238
        sample4   sample5   sample6
gene1 5.4296431 4.4794192 5.0229586
gene2 2.1618863 2.8903103 2.3107258
gene3 3.7026089 4.2296479 3.9381367
gene4 0.3784741 0.1766241 0.1749336
gene5 1.2612043 1.8266005 1.9365913
gene6 8.5333462 8.0548245 7.6361029
        sample7   sample8   sample9
gene1 5.3835566 5.5187194 4.2431675
gene2 2.6497815 2.0684901 2.5366996
gene3 4.1119435 4.1044945 4.1169204
gene4 0.1766834 0.6869912 0.8742537
gene5 1.2617262 1.5861930 1.5178622
gene6 8.0286134 7.4289983 8.6872464
       sample10  sample11  sample12
gene1 4.7867908 4.9386086 5.9969793
gene2 2.3177758 2.6500157 2.5894217
gene3 4.5787006 4.1931967 3.3466693
gene4 0.1755969 0.1766985 0.1764068
gene5 1.6452014 1.2617493 1.7717412
gene6 7.9341562 7.3198482 8.4157297

And these data are not intended for analysis using any count-based method. From ?varianceStabilizingTransformation:

Description:

     This function calculates a variance stabilizing transformation
     (VST) from the fitted dispersion-mean relation(s) and then
     transforms the count data (normalized by division by the size
     factors or normalization factors), yielding a matrix of values
     which are now approximately homoskedastic (having constant
     variance along the range of mean values). The transformation also
     normalizes with respect to library size. The 'rlog' is less
     sensitive to size factors, which can be an issue when size factors
     vary widely. These transformations are useful when checking for
     outliers or as input for machine learning techniques such as
     clustering or linear discriminant analysis.

So if you want to plot the data or do other downstream analyses you could use rlog or vst, but neither are meant to be used prior to analyzing the data.

0
Entering edit mode

Many thanks for you detailed answer, I really appreciate it.

Could you give me an example (or more if possible) of analysis using count-based method, please? Because I don't know if I know any. I want to have it clear when I can use this type of data and when I cannot (cause the sources that I found do not explain more than "it is okay for downstream analyses and/or plots" and it is quite frustrating).

Re this sentence

"The 'rlog' is less sensitive to size factors, which can be an issue when size factors vary widely."

I don't really understand it cause I found in this paper that vst was the one that has problems with size factors. Did I understand wrong or have I mixed different concepts?

while the VST is also effective at stabilizing variance, it does not directly take into account differences in size factors; and in datasets with large variation in sequencing depth (dynamic range of size factors >~ 4) we observed undesirable artifacts in the performance of the VST.

Thanks in advance.

ADD REPLY
1
Entering edit mode

The DESeq2 package is meant to analyze count data! Most packages that are intended to be used for RNA-Seq data are going to want the raw counts, as they are fitting generalized linear models with a negative binomial link function. That's what DESeq2 does, as well as edgeR, which are the two main packages in the Bioconductor world for analyzing RNA-Seq.

I believe you are misinterpreting the first quote in your post (the one about rlog). What that sentence means is that rlog is better if you have large differences in sequencing depth (e.g., size factors vary widely) because it takes sequencing depth into account, whereas vst doesn't account for sequencing depth and can therefore have problems if the sequencing depth of an experiment varies widely.

0
Entering edit mode

I understood exactly what you said, but I got confused with the documentation and the info that gives about rlog because it seems to be that they are contradictory (rlog is sensitive but also robust according to this documentation or this one respectively.

Anyway, many thanks again for the answer and your help! :)

ADD REPLY

Login before adding your answer.

Traffic: 880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6