Factor vs Character in design
United Kingdom

I have no code to post. This is question about the different the differing results I get when I change a comparison variable in the design formula from character, to factor.

i am comparing differential expression across age groups.

The data has a variable 'age' : with these values, 20,25,30,35,40,45,50. with 20 as the base comparison level.

when I run results for 'age' as a factor I get :

 Gene     baseMean  log2FoldChange      lfcSE        stat       pvalue       padj

 GeneZ    2.0324404  -0.0230828518 0.17758857 -0.12997938 0.8965827428 0.96129754

but when I run it with 'age' as a character get :

 Gene     baseMean  log2FoldChange      lfcSE        stat       pvalue         padj

 GeneZ    2.0324404  -0.013965354 0.17827642 -0.07833539 9.375613e-01 9.842875e-01

Is R treating the factor data as numerical ordinal?

So, which should I use?

(single gene for example - I note the padj)

Many thanks.

United States

Using a character age vs a factor age won't make a difference. R will convert the character age to factor, with the same order and then proceed. As an example:

> fakeo <- data.frame(vals = rnorm(70), age = as.character(rep(seq(20,50,5), each = 10)))
> summary(lm(vals~age, fakeo))

lm(formula = vals ~ age, data = fakeo)

    Min      1Q  Median      3Q     Max 
-2.8111 -0.5374  0.1354  0.5590  2.1417 

            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01985    0.31871  -0.062    0.951
age25       -0.26316    0.45072  -0.584    0.561
age30       -0.52770    0.45072  -1.171    0.246
age35       -0.27076    0.45072  -0.601    0.550
age40        0.39448    0.45072   0.875    0.385
age45        0.33408    0.45072   0.741    0.461
age50        0.11166    0.45072   0.248    0.805

Residual standard error: 1.008 on 63 degrees of freedom
Multiple R-squared:  0.0978,    Adjusted R-squared:  0.01188 
F-statistic: 1.138 on 6 and 63 DF,  p-value: 0.3509

> fakeo$age <- factor(fakeo$age)
> fakeo$age
 [1] 20 20 20 20 20 20 20 20 20 20 25 25 25 25 25 25 25 25 25 25 30 30 30 30 30 30 30 30 30 30 35 35 35 35 35 35 35
[38] 35 35 35 40 40 40 40 40 40 40 40 40 40 45 45 45 45 45 45 45 45 45 45 50 50 50 50 50 50 50 50 50 50
Levels: 20 25 30 35 40 45 50
> summary(lm(vals~age, fakeo))

lm(formula = vals ~ age, data = fakeo)

    Min      1Q  Median      3Q     Max 
-2.8111 -0.5374  0.1354  0.5590  2.1417 

            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01985    0.31871  -0.062    0.951
age25       -0.26316    0.45072  -0.584    0.561
age30       -0.52770    0.45072  -1.171    0.246
age35       -0.27076    0.45072  -0.601    0.550
age40        0.39448    0.45072   0.875    0.385
age45        0.33408    0.45072   0.741    0.461
age50        0.11166    0.45072   0.248    0.805

Residual standard error: 1.008 on 63 degrees of freedom
Multiple R-squared:  0.0978,    Adjusted R-squared:  0.01188 
F-statistic: 1.138 on 6 and 63 DF,  p-value: 0.3509

Same results, regardless.

Put a different way, it's likely that you are computing a different contrast somehow rather than having something to do with how R handles numeric-looking characters.

Only thing I could imagine is that when using a factor 20 is not the base level while when using character the internal conversion makes 20 the base level. There are posts here that show factor level order can make a slight difference in how DESeq2 estimates model parameters.

That's possible as well, although OP says 20 is the baseline.

That would seem a logical and reasonable explanation of what I'm seeing.

United Kingdom

Excellent explanation - thank you very much.


