lmFit very slow if there are missing values
1
0
Entering edit mode
@frederik-ziebell-14676
Last seen 9 months ago
Heidelberg, Germany

Having only a single missing value slows lmFit down by over an order of magnitude:

library("limma")
library("tictoc")

n_genes <- 10^6

sd <- 0.3*sqrt(4/rchisq(n_genes,df=4))
y <- matrix(rnorm(n_genes*6,sd=sd),n_genes,6)
y[1:2,4:6] <- y[1:2,4:6] + 2
design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))

y_NA <- y
y_NA[1,1] <- NA

tic()
fit <- lmFit(y,design)
toc()

tic()
fit <- lmFit(y_NA,design)
toc()

While the first fit takes about 1.1sec, the second needs over a minute. Is this a bug?

limma • 1.2k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

The timings you give show that lmFit is actually very fast, especially so when there are no NAs. You are the first person ever to view that as a "bug".

lmFit does an intial scan for NAs or weights and, if they are absent, then it runs a special super-fast algorithm that only works when there are no NAs.

ADD COMMENT
0
Entering edit mode

Thank you for the clarification. The actual dataset I have contains many conditions and so lmFit takes about half an hour, that's why I initially viewed it as a bug. Just out of curiosity, what's the super-fast algorithm that only works if there are no NAs?

ADD REPLY
0
Entering edit mode

If there are no weights or NAs then the same QR decomposition can be applied to all genes.

Even with NAs, lmFit should still be about 20 times faster than looping through the rows with lm() and summary().

ADD REPLY

Login before adding your answer.

Traffic: 700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6