Question

lmFit very slow if there are missing values

0

Entering edit mode

Frederik Ziebell ▴ 30

@frederik-ziebell-14676

Last seen 9 months ago

Heidelberg, Germany

Having only a single missing value slows lmFit down by over an order of magnitude:

library("limma")
library("tictoc")

n_genes <- 10^6

sd <- 0.3*sqrt(4/rchisq(n_genes,df=4))
y <- matrix(rnorm(n_genes*6,sd=sd),n_genes,6)
y[1:2,4:6] <- y[1:2,4:6] + 2
design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))

y_NA <- y
y_NA[1,1] <- NA

tic()
fit <- lmFit(y,design)
toc()

tic()
fit <- lmFit(y_NA,design)
toc()

While the first fit takes about 1.1sec, the second needs over a minute. Is this a bug?

limma • 1.2k views

ADD COMMENT • link updated 5.4 years ago by Gordon Smyth 52k • written 5.4 years ago by Frederik Ziebell ▴ 30

score 0 · Answer 1 · 2019-08-04

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

The timings you give show that lmFit is actually very fast, especially so when there are no NAs. You are the first person ever to view that as a "bug".

lmFit does an intial scan for NAs or weights and, if they are absent, then it runs a special super-fast algorithm that only works when there are no NAs.

ADD COMMENT • link 5.4 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you for the clarification. The actual dataset I have contains many conditions and so lmFit takes about half an hour, that's why I initially viewed it as a bug. Just out of curiosity, what's the super-fast algorithm that only works if there are no NAs?

ADD REPLY • link 5.4 years ago Frederik Ziebell ▴ 30

0

Entering edit mode

If there are no weights or NAs then the same QR decomposition can be applied to all genes.

Even with NAs, lmFit should still be about 20 times faster than looping through the rows with lm() and summary().

ADD REPLY • link 5.4 years ago Gordon Smyth 52k