Hi folks,
We are using pseudobulk DESeq2 (DESeq2_1.26.0) approach to perform DGE testing on single cell datasets.
We are interested in assessing condition induced differences, while controlling for other variables such as RIN, PMI, Age and tissue (shown in metadata).
Age RIN PMI Tissue Condition
1 45 9.0 16 A Disease
2 71 8.8 35 A Control
3 47 9.3 25 A Disease
4 38 7.3 14 A Disease
5 58 7.4 21 A Disease
6 85 8.8 26 A Control
7 56 7.3 18 A Disease
8 53 9.2 27 A Disease
9 64 7.6 20 A Control
10 51 9.4 26 A Disease
11 54 9.3 15 A Disease
12 67 7.8 23 A Control
13 37 9.2 17 A Disease
14 43 7.9 32 A Disease
15 38 7.1 27 A Control
16 43 8.2 32 A Disease
17 37 8.6 32 A Control
18 57 8.8 11 A Control
19 42 8.2 21 A Control
20 47 8.1 24 A Control
21 53 9.2 27 A Disease
22 67 7.2 23 B Control
23 58 8.2 21 B Disease
24 38 7.5 27 B Control
25 67 7.2 12 B Disease
26 64 7.3 20 B Control
27 58 8.2 29 B Disease
28 91 7.0 25 C Control
29 89 7.4 34 C Control
30 64 7.2 20 C Control
31 81 6.9 10 C Disease
32 74 6.8 29 C Disease
33 84 7.6 28 C Disease
However, when using:
dds <- DESeqDataSetFromMatrix(pseudobulk, sample_table, ~Age + RIN +PMI+tissue+disease)
following warning is shown:
converting counts to integer mode
the design formula contains one or more numeric variables with integer values,
specifying a model with increasing fold change for higher values.
did you mean for this to be a factor? if so, first convert
this variable to a factor using the factor() function
the design formula contains one or more numeric variables that have mean or
standard deviation larger than 5 (an arbitrary threshold to trigger this message).
it is generally a good idea to center and scale numeric variables in the design
to improve GLM convergence.
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
some variables in design formula are characters, converting to factors
Going further with this approach, we were able to obtain some results
dds <- DESeq(dds, test = "LRT", reduced = ~Age+ RIN +PMI+tissue, sfType = "poscounts", useT = T, minmu = 1e-6, minReplicatesForReplace=Inf)
As a second approach, in order to avoid above warning, we converted numeric variables to factors using factor()
and call the same function
dds <- DESeqDataSetFromMatrix(pseudobulk, sample_table, ~Age + RIN +PMI+tissue+disease)
following error is thrown:
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
In the first approach where some variables were numeric, the warning was given but it worked and we obtained some results.
In the second approach it didn’t work at all. Now we are not quite sure, why there is no linear combination when dealing with numerics, but it appears when converting them to factors.
So the question is, is it right, to convert the numeric values here as factors and if so, how we can check if there is really a linear combination? As we are dealing with more then 30 replicates, it is not obvious to check, if there are some, as most of the numeric variables are highly diverse between the replicates.
The second question is, is it ok to regress so many different variables to test only condition induced differences?
Thanks a lot for your help!
Hi Kevin,
thanks a lot for your answer.
I'm sorry if this is maybe basic, but does DESeq take into account how similar numeric values are? E.g. having 3 replicates with age 80,79 and 20, will the first two be treated somehow more similar? Then I understand, why scaling makes sense. I assumed, regression in DESeq just handles values, if they are same or not, so the difference between the three replicates in the example is always 1. This is, why I thought, having own categories to ages is good, so only exactly same aged replicates will be treated together.
I tried running it with scaling all numeric variables and it worked without warnings, thanks!
And regarding your question: Thats a good question. I saw in some related papers, that people are regressing everything, what can be extracted from the metadata. We have sex of the replicates as another value. When not regressing that, genes like XIST, which are clearly sex related, come up. I assume, that biological values like age and sex, quality ones like RIN and batch, and also tissue difference can somehow confound the analysis. So actually I'm not quite sure, what is the best approach. I tried with your scale suggestion now, regressing nothing (using Wald test), and then add different metadata columns. Overall the results are somehow pretty similar, however, using everything to reduce, genes like XIST come up again, so probably there are too many variables to fit then. But I'm not sure, how to weigh then the importance, what to regress and what not. Guess, I need to try a bit and compare results. Thanks again!
DESEq2 will just take whatever is the result of
factor()
applied to the continuous variable; so, even, for example, 20.2 will be different from 20.22.You could dichotomise your continuous variables into meaningful groups, if desired. For example:
-------------
Yes,
sex
can confound certain RNA-seq studies, with XIST and TSIX being among the primary genes that drive this. So, includingsex
is generally regarded as fine, if necessary. I am not too sure about including RIN in the model - I have heard mixed opinions on this, and have no opinion of my own.With regard to tissue, sometimes, tissue-specific differences can be too great such that stratifying an analysis by tissue may be more appropriate. For example, I did a previous study where it became too difficult to analyse blood and synovial tissue together due to their vast transcriptomal differences. The globin genes will be the markers of blood tissue, unless these are depleted in the wet-lab protocol (blood samples in red, with haemoglobin alpha-1 subunit expression plotted):
Thanks again for your answer! One last question then, from your first answer, I got the impression that you suggest using scaling for numeric variables with high range (like age), instead of factorizing it. And now you had this suggestion of making categories (I was already thinking of something similar). Do you have a hint, what might be more appropriate here?
I agree, sometimes the differences between tissues can be too high, I also experienced that. Thanks!
From my position, I cannot really say which is more appropriate for your data and study area. I would check how the samples are distributed via a PCA bi-plot based on age, though. In doing this, it may be inferred that age is indeed something for which we must control by including it in the design formula.
You could also simply regress age against your outcome to see if there is a statistically significant association, which would again merit the inclusion of age in the design formula. e.g., outside DESeq2:
Perfect, thanks very much Kevin!
Hello kevin lets say I have a single factor in my design such as age how do I interpret the result do we need to do this "You could dichotomise your continuous variables into meaningful groups, if desired. " in case of numerical variable?