Question

Using the correct counts with lm function linear model

0

Entering edit mode

Bine ▴ 50

@bine-23912

Last seen 11 months ago

UK

Dear all,

So now I have done my DESeq2 Analysis with RAW gene expression counts and realised that I should not use any longer RAW counts for the linear models via lm function in R. I am thinking to use the transcripts per million (TPM) counts instead, but I read somewhere that it would make sense to take the log2 of it as well?

Do you agree to this approach?

See below article where I read it:

"However, even if this is true (i.e. the normalization has worked well), the CPM/TPM values do not have stable variance. Specifically, as the size of the values increases, so does the variance. This feature of the data (heteroskedacity, or asymmetric, heavy-tailed distributions) is problematic for statistical analysis methods that assume homoskedacity, that is that there is no relationship between the mean of expression values and their variance (i.e. just about anything that uses a Gaussian error model).

As such, we should apply a variance stabilizing transformation to these data so that we can use standard statistical methods like linear regression and PCA with confidence. Developing a thoroughly effective variance stabilizing transformation is a challenge, so almost universally a log transformation (typically log2) is applied to the CPM/TPM values (the logcounts slot in a SingleCellExperiment object is expected to contain (normalized) log2-scale CPM/TPM values). "

Thanks so much for any input!

Best, Bine

lm DESeq2 • 3.2k views

ADD COMMENT • link updated 4.3 years ago by josephbrown ▴ 10 • written 4.3 years ago by Bine ▴ 50

1

Entering edit mode

I will permit that Gordon or Mike respond with an answer, but, if you must use a linear model subsequent to performing the standard differential expression analysis via DESeq2, which implements a negative binomial GLM, then I would use the variance-stabilised expression levels. A question back to you, though, would be why you need to do this at all?

ADD REPLY • link 4.3 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

So basically what I want to do i want to plot in linear model some of the genes identified via Differential Expression Analysis. I could go to the beginning and use the TPM values instead of the raw values used for the DESeq2 and create a completely new dataset, since I understand, at least for the lm() function, i cannot use the raw counts?! When you say variance-stabilised expression levels you mean the log2 of the TPM values?

ADD REPLY • link 4.3 years ago Bine ▴ 50

0

Entering edit mode

There are 2 functions in the DESeq2 package (vst and rlog) that will take your raw expression data and transform them into a log scale while stabilizing the difference in variance seen across the range of expression of count data.

ADD REPLY • link 4.3 years ago josephbrown ▴ 10

0

Entering edit mode

Thank you, yes, I used the vst before for my exploratory analysis as part of the differential expression analysis. But for the linear model via lm() i think I cant use the DESeq2 object.. So I am still not really sure.. but I think I just take the TPM as described above and take the log2 of it..

ADD REPLY • link 4.3 years ago Bine ▴ 50

0

Entering edit mode

The issue there, is that the variance isn't the same throughout the expression range, so after a log transformation, the low expression genes will have way more variance and this could affect lm() and any other downstream analyses. vst and rlog take care of that. Also be sure to use the correct blind=TRUE or blind=FALSE so the transformation will handle the variance from your experimental design in whatever way is appropriate for your analysis.

You can access the matrix (df?) of the transformed TPM with assay(vst_object) to use with lm() the same way you would if you had taken the log2 of TPM yourself.

Check out these two pages from a tutorial by a group with a bunch of other very help tutorials in their github:

That second page has info about using the transforms.

I am not a statistician though, so take what I say with a huge grain of salt.

ADD REPLY • link 4.3 years ago josephbrown ▴ 10

score 0 · Answer 1 · 2021-01-08

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 13 hours ago

United States

In DESeq2, for association testing we recommend to use the NB GLM via DESeq(). You can read about the methods in the DESeq2 publication. This allows for testing of the log of expected counts against a linear combination of predictors. The equation is in the vignette also.

ADD COMMENT • link 4.3 years ago Michael Love 43k

0

Entering edit mode

So it is like lm () but I can use the DESeq2 generated objects? Okay I will read about it!Can you recommend an article for that?

ADD REPLY • link 4.3 years ago Bine ▴ 50

0

Entering edit mode

I have a kind of related question. My group was doing some method testing, and we used a set of samples which were a mix of two different tissue types in different ratios. I wanted to use DESeq to determine which genes were following the expected linear expression related to dose, but the logging of the counts made the relationship not linear. Was this a totally inappropriate use of DESeq? WAs there a better way to have used DESeq for this?

ADD REPLY • link 4.3 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

If you have a covariate that you think is linear with the counts, I believe you could take the log of the covariate. The inference will be on the log scale however, but you can transform e.g. CI back to the original scale by exponentiating.

ADD REPLY • link 4.3 years ago Michael Love 43k

0

Entering edit mode

When I graph normalized counts versus tissue A percentage, I get linear graphs, no matter whether the gene counts increase or decrease with tissue A percentage. With logged counts, the relationship is only linear if the logged percentage of the tissue with more of the gene is used. I guess I could filter based on the sign of the results, and do it twice.

ADD REPLY • link 4.3 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

Yeah I don't have much further ideas.

In DESeq2::unmix we estimate mixing proportions for counts, where we believe the relationship is linear on the original scale. The way we do the linear regression is by finding linear coefficients but computing loss on the VST scale. The point there is not for inference, but just to estimate proportions (e.g. we don't provide SEs).

ADD REPLY • link 4.3 years ago Michael Love 43k