DESeq2 notice about GLM convergence
1
0
Entering edit mode
@stephenhartley-9155
Last seen 3.2 years ago
United States

I was running DESeq2 and I noticed a message/warning that I don't think I've seen before:

the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). it is generally a good idea to center and scale numeric variables in the design to improve GLM convergence.

I just want to make sure I understand correctly what it wants, because it's not mentioned in the vignette. So if I had a variable like age, which ranges from, say, 1 to 100, I should transform it like this:

age.rescaled <- ( age - mean(age) ) / sd(age)

Is this correct?

deseq2 RNASeq • 1.1k views
ADD COMMENT
0
Entering edit mode

Sometimes I get a very different p-value when I do a simple linear transform of an independent variable.

That . . . doesn't seem right...

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 18 hours ago
United States

Yes, that works well.

I added this recently because I noticed a user had a dataset with many continuous covariates with widely varying range, e.g. some were 0-1, otherwise were 1,000-2,000, etc. It doesn't matter typically, but it's just better to standardize these to help out the model fitting.

ADD COMMENT
0
Entering edit mode

Sometimes I get a very different p-value when I do a simple linear transform of an independent variable.

That . . . doesn't seem right...

What's going on here?

# **Minimal example:**

#Take the pasilla dataset from the vignette: (code copy pasted from vignette):

library("pasilla",verbose=F,quietly=T)

library("DESeq2",verbose=F,quietly=T,warn.conflicts=F)

pasCts <- system.file("extdata", "pasilla_gene_counts.tsv",package="pasilla", mustWork=TRUE)

pasAnno <- system.file("extdata", "pasilla_sample_annotation.csv",  package="pasilla", mustWork=TRUE)

cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))

coldata <- read.csv(pasAnno, row.names=1)

coldata <- coldata[,c("condition","type")]


#Make a randomized continuous X variable:

set.seed(1348948086)

coldata$RANDX.RAW   <- ifelse( coldata$condition == "treated",
                               runif(nrow(coldata)/2,0,3),
                               runif(nrow(coldata)/2,2,4)  )

#Make a second variable that is just the same variable, multiplied by 10:

coldata$RANDX.x10   <- coldata$RANDX.RAW * 10

#(More code copy-pasted from the vignette:)

rownames(coldata) <- sub("fb", "", rownames(coldata))

cts <- cts[, rownames(coldata)]

all(rownames(coldata) == colnames(cts))

dds.RAW <- DESeqDataSetFromMatrix(countData = cts, colData = coldata,  design = ~ RANDX.RAW)

dds.x10 <- DESeqDataSetFromMatrix(countData = cts, colData = coldata,  design = ~ RANDX.x10)

dds.RAW <- (DESeq(dds.RAW))

dds.x10 <- (DESeq(dds.x10))

res.RAW <- as.data.frame(results(dds.RAW))

res.x10 <- as.data.frame(results(dds.x10))

res.RAW$LP <- log10(res.RAW$pvalue)

res.x10$LP <- log10(res.x10$pvalue)

names(res.RAW) <- paste0("RAW.",names(res.RAW));

names(res.x10) <- paste0("x10.",names(res.x10));

res.ALL <- cbind.data.frame(res.RAW,res.x10);

#Check if the two p-values are more an order of magnitude apart: (This should always be FALSE)

table( abs( res.ALL$RAW.LP - res.ALL$x10.LP ) > 1 )

problem.idx <- which( abs( res.ALL$RAW.LP - res.ALL$x10.LP ) > 1)

res.ALL[ problem.idx , ]

# RESULT:

> t(res.ALL[problem.idx, ])

                   FBgn0030880
    RAW.baseMean        13.0066400
    RAW.log2FoldChange   4.7633803
    RAW.lfcSE            2.2081207
    RAW.stat             2.1572101
    RAW.pvalue           0.0309893
    RAW.padj             0.4687206
    RAW.LP              -1.5087882
    x10.baseMean        13.0066400
    x10.log2FoldChange   0.1345846
    x10.lfcSE            0.2252476
    x10.stat             0.5974963
    x10.pvalue           0.5501761
    x10.padj             0.9674915
    x10.LP              -0.2594983



#########Just for replicability:
> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.10 (Final)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DESeq2_1.26.0               SummarizedExperiment_1.16.1
 [3] DelayedArray_0.12.2         BiocParallel_1.20.1        
 [5] matrixStats_0.55.0          Biobase_2.46.0             
 [7] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
 [9] IRanges_2.20.2              S4Vectors_0.24.3           
[11] BiocGenerics_0.32.0         pasilla_1.14.0             

... (cut off because it has to be < 5000 characters)

ADD REPLY
0
Entering edit mode

This gene has 5 0's and then a sample with 103, and your simulated covariate then causes issues. glm.nb also fails to converge for this gene and with the covariate you generated. You have a repeated value of 3.668005, one time associated with 103 and one time associated with 0, so that will lead to a super high dispersion estimate, but then the rest of the counts are four 0's and a 1. It's more of a problem with the simulation I think.

ADD REPLY
0
Entering edit mode

Is there an easy way to tell when the GLM failed to converge? Is there a way I could flag such p-values as being problematic?

ADD REPLY
0
Entering edit mode

Yes, it is labelled in the mcols, but in this case, I should say that only glm.nb failed to converge. This is just a weird simulation gene, because you have a repeated value of your covariate, once with 0 and once with a large count, whereas the other values are all 0. Better to filter out genes where you have only 2 samples with non-zero counts.

ADD REPLY
0
Entering edit mode

So on my real dataset, I ran the same model twice, once with the age variable reduced by 100. The majority of the results were almost identical, but in 142 genes they changed by an order of magnitude or more. A few even achieved genome wide significance in one version but not the other...

I've tried raising the inclusion thresholds. The new run requires that the mean count greater than 10 and that 10 samples have more than 10 reads.

Which variable indicates it failed to converge? Should I drop genes where the glm.nb failed to converge?

ADD REPLY
0
Entering edit mode

Do these have the property of many 0’s?

Are you including a large range covariate with others that are small in range? In this case, you should just follow the instruction and reduce the range for better fit.

ADD REPLY

Login before adding your answer.

Traffic: 658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6