Entering edit mode
Having only a single missing value slows lmFit down by over an order of magnitude:
library("limma")
library("tictoc")
n_genes <- 10^6
sd <- 0.3*sqrt(4/rchisq(n_genes,df=4))
y <- matrix(rnorm(n_genes*6,sd=sd),n_genes,6)
y[1:2,4:6] <- y[1:2,4:6] + 2
design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))
y_NA <- y
y_NA[1,1] <- NA
tic()
fit <- lmFit(y,design)
toc()
tic()
fit <- lmFit(y_NA,design)
toc()
While the first fit takes about 1.1sec, the second needs over a minute. Is this a bug?
Thank you for the clarification. The actual dataset I have contains many conditions and so lmFit takes about half an hour, that's why I initially viewed it as a bug. Just out of curiosity, what's the super-fast algorithm that only works if there are no NAs?
If there are no weights or NAs then the same QR decomposition can be applied to all genes.
Even with NAs,
lmFit
should still be about 20 times faster than looping through the rows withlm()
andsummary()
.