I am trying to apply the LIMMA package to a dataset that is much larger than a traditional microarray or RNA-seq experiment. The number of samples is 96, but the number of observations is ~11,127,676. I have been able to run LIMMA on this dataset previously to make different comparisons between samples, however, now that I am trying to incorporate blocking by subject my jobs will time out before completing. In the documentation it says the following:
"The function may take long time to execute as it fits a mixed linear model for each gene for an iterative algorithm."
Taken from a previous question to demonstrate the design setup:
# For simplicity, we'll just use 2 replicates
# per tissue and time-point combination.
tissue <- rep(c("M", "C"), each=10)
time <- rep(rep(c(0, 6, 24, 72, 168), each=2), 2)
# Assume one subject contributes to all time points
# for simplicity. I know this isn't actually the case,
# but this is fine provided that most subjects contribute
# to at least two samples.
subject <- c(rep(1:2, 5), rep(3:4, 5))
The obvious design matrix is:
g <- factor(paste0(tissue, time))
design <- model.matrix(~0 + g)
colnames(design) <- levels(g)
#block on subject
corfit <- duplicateCorrelation(eset,design,block=eset$subject)
corfit$consensus
When performing the duplicateCorrelation() function with my actual data using different numbers of observations the time it takes is as follows:
#1250: 5.088 sec elapsed
#2500: 9.374 sec elapsed
#5000: 19.896 sec elapsed
#10000: 38.284 sec elapsed
#20000: 74.756 sec elapsed
#40000: 147.583 sec elapsed
#80000: 286.273 sec elapsed
#160000: 697.932 sec elapsed
#320000: 1744.34 sec elapsed
#If I plot the time points it appears to not be linear in terms of time with increasing observation number:
y <- c(5.088, 9.374, 19.896, 38.284, 74.756, 147.583, 286.273, 697.932)
x <- c(1250, 2500, 5000, 10000, 20000, 40000, 80000, 160000)
lm(y~x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-42.32812 0.00533
I'm doing the following on my laptop (8GB ram, 2.7 GHz i5 processor) but I am running the actual job on a computing cluster with 128GB ram and 16 x 2.6 GHz Intel Xeon E5-2670 CPUs. Assuming this "linear" trend follows, on my computer it should take this much time to run the computation: (0.00533(64417279))/3600 = 16.47 hours. However, when running on the cluster with a 48 hour window it does not complete.
Question: Is there a way to speed up the dupcor() function (run in parallel) or alternative way to block on subject with the current design to still perform contrasts as seen in the previous post (117545)?
I randomly chose 10,000 rows 1,000 times from the entire dataset (since it was ordered in a specific way) to run the
duplicateCorrelation
function and got a normally distributedcorfit$consensus
so I used the median value as my parameter for my actual experiment. Thanks for all the help!That's ok, but there is nothing to be gained by subsetting 1000 times. It would be better to simply subset 100,000 rows once, although the result probably won't be very different from what you already have.