Question

limma models out of memory issue on Slurm HPC

0

Entering edit mode

Jitendra ▴ 10

@nabiyogesh-11718

Last seen 5 months ago

United Kingdom

I am trying to run these below regression models using limma R package, but getting an issue due to out of memory even though I have mentioned more memory than this job script needed to run. First three models run successfully but then shows out of memory issue on SLURM HPC. Is there any other way to run these models using MPI or by using reduced memory?

#model matrix

var1<-model.matrix(~Analysis5 + sex + Age +CD8T +CD4T +NK + Bcell +Mono+smoking, data=targets2)

var2<-model.matrix(~Analysis5 + sex + Age +CD8T +CD4T +NK + Bcell +Mono+smoking + Batch+ Sentrix_Position, data=targets2)

var3 <-model.matrix(~Analysis5 + sex + Age +CD8T +CD4T +NK + Bcell +Mono+smoking + PC1 + PC2 + PC3, data=targets2)

var4 <-model.matrix(~Analysis5 + sex + Age +CD8T +CD4T +NK + Bcell +Mono+smoking + PC1 + PC2 + PC3+Glucose, data=targets2)

var5 <-model.matrix(~Analysis5 + sex + Age +CD8T +CD4T +NK + Bcell +Mono+smoking + PC1 + PC2 + PC3+eGFR, data=targets2)

var6 <-model.matrix(~Analysis5 + sex + Age +CD8T +CD4T +NK + Bcell +Mono+smoking + PC1 + PC2 + PC3+Chol, data=targets2)







##############
fit1<-lmFit(mval,var1)


fit1<-eBayes(fit1,trend=TRUE, robust=TRUE)

probe1<-topTable(fit1,adjust="BH",coef=2,num=Inf)

sig.probe1<-probe1[which(probe1$adj.P.Val<=0.05),]

write.table(sig.probe1,file="limma..model1.significant..output.txt",sep="\t",quote=TRUE)
write.table(probe1,file="limma..model1..output.txt",sep="\t",quote=TRUE)
#####################################
#model2

########

#model2

fit2<-lmFit(mval,var2)


fit2<-eBayes(fit2,trend=TRUE, robust=TRUE)

probe2<-topTable(fit2,adjust="BH",coef=2,num=Inf)

sig.probe2<-probe2[which(probe2$adj.P.Val<=0.05),]

write.table(sig.probe2,file="limma..model2.significant..output.txt",sep="\t",quote=TRUE)
write.table(probe2,file="limma..model2..output.txt",sep="\t",quote=TRUE)



###########################


fit3<-lmFit(mval,var3)


fit3<-eBayes(fit3,trend=TRUE, robust=TRUE)

probe3<-topTable(fit3,adjust="BH",coef=2,num=Inf)

sig.probe3<-probe3[which(probe3$adj.P.Val<=0.05),]

write.table(sig.probe3,file="limma..model3.significant..output.txt",sep="\t",quote=TRUE)
write.table(probe3,file="limma..model3..output.txt",sep="\t",quote=TRUE)


#######################################
fit4<-lmFit(mval,var4)


fit4<-eBayes(fit4,trend=TRUE, robust=TRUE)

probe4<-topTable(fit4,adjust="BH",coef=2,num=Inf)

sig.probe4<-probe4[which(probe4$adj.P.Val<=0.05),]

write.table(sig.probe4,file="limma..model4.significant..output.txt",sep="\t",quote=TRUE)
write.table(probe4,file="limma..model4..output.txt",sep="\t",quote=TRUE)
##########
fit5<-lmFit(mval,var5)


fit5<-eBayes(fit5,trend=TRUE, robust=TRUE)

probe5<-topTable(fit5,adjust="BH",coef=2,num=Inf)

sig.probe5<-probe5[which(probe5$adj.P.Val<=0.05),]

write.table(sig.probe5,file="limma..model5.significant..output.txt",sep="\t",quote=TRUE)
write.table(probe5,file="limma..model5..output.txt",sep="\t",quote=TRUE)


##########

fit6<-lmFit(mval,var6)


fit6<-eBayes(fit6,trend=TRUE, robust=TRUE)

probe6<-topTable(fit6,adjust="BH",coef=2,num=Inf)

sig.probe6<-probe6[which(probe6$adj.P.Val<=0.05),]

write.table(sig.probe6,file="limma..model6.significant..output.txt",sep="\t",quote=TRUE)
write.table(probe6,file="limma..model6..output.txt",sep="\t",quote=TRUE)

Many thanks,

Regression limma STATISITICS HPC • 1.8k views

ADD COMMENT • link updated 8 months ago by Gordon Smyth 52k • written 8 months ago by Jitendra ▴ 10

0

Entering edit mode

These models sound using a lot of variables to me. Recently Gordon said "limma and voom work with any linear model. There can be any combination of continous covariates and categorical factors. The only requirement is that there should be enough samples to estimate all the model coefficients with some degrees of freedom left over to estimate the variance.". How many samples are there in this experiment?

ADD REPLY • link 8 months ago SamGG ▴ 360

0

Entering edit mode

Thanks, around 2000 samples and we are using two input files here; 1) sample phenotypes files (size= 442k) and methylation epic array data files (25G).

ADD REPLY • link 8 months ago Jitendra ▴ 10

score 2 · Accepted Answer · 2024-03-25

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Your code seems to be very memory inefficient because you are storing all six linear model fits in memory at the same time without any need to do so. You could evidently run this work within the available memory if you simply used the same name fit for each linear model fit (instead of fit1 to fit6). Altnernatively you could run each linear model fit (var1 to var6) in a separate batch run, and again that would presumably work.

Your code also stores 12 different data.frames of results in memory at the same time. Again that is unnecessary -- only one data.frame object is actually necessary.

Added later: final resolution

After some simple checking, it turns out that you have made a programming error, converting the numeric covariate Glucose into a factor with a huge number of levels, almost as many levels as there are samples. Once that error is corrected, this analysis would run on even on my laptop computer with 32Gb of memory. I've just confirmed that the same limma analysis runs on my laptop in a minute or two on simulated data with same number of rows and columns as your data.

ADD COMMENT • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

thanks Gordon, I am thinking to run individually these models or is there any other way to make this code memory efficient.

ADD REPLY • link 8 months ago Jitendra ▴ 10

0

Entering edit mode

I don't see how limma could be causing any serious memory problems. I can run limma code similar to yours on my laptop with 2000 samples, 100000 probes and 14 covariates in the design matrix. In my experience, there's no need for huge memory or HPC or MP1, even with much larger problems than 2000 samples.

ADD REPLY • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

thanks Gordon, but our methylation epic array input file is quite big and we can't read it in R by just using laptop memory so we need to use HPC. I have tried to use same name fit for each linear model but still it is not working. thanks for your help. Just to let you know I am only able to run first 3 models but after that it shows memory issue. Many thanks, Jitendra

ADD REPLY • link 8 months ago Jitendra ▴ 10

0

Entering edit mode

Alternatively you could run each linear model fit (var1 to var6) in a separate batch run, and again that would presumably work.

I think you should try this suggestion from Gordon, as each batch will run in a "clean" R session.

The Infinium methylation EPIC array allows the interrogation of methylation patterns at the genome-wide level, covering more than 850,000 methylation sites.

If possible, try to increase the initial filtering to reduce the number of sites to before analysis.

Nevertheless, 850000 2000 8/1024^3 = 12.66 GB, so this dataset should occupy 13 GB. I checked that r = matrix(rnorm(85000*2000),85000,2000) (1/10 of such a dataset) occupies 1.4 GB on a Windows PC. The number of elements is lower than 2^31, so no problem on that side.

Ultimately, in the event of blocking and with such a large number of "probes", I would try to randomly divide the data into 2 or 4 and group the results at the end... unless others say this is totally insane.

ADD REPLY • link 8 months ago SamGG ▴ 360

0

Entering edit mode

Thanks, do you think if I installed latest limma package then it can able to handle this datasets. currently I am using this limma R version; "limma 3.56.2". I am not sure how to update it to latest limma version.

ADD REPLY • link 8 months ago Jitendra ▴ 10

0

Entering edit mode

No, updating limma would make no difference. limma does not offer any alternative version that uses less memory.

If you want more help, please give more information about how big your problem is. We would need to see output from

dim(mval)
class(mval)
dim(var4)

Also, how are you reading mval into your R session? I wonder how much memory is being used before you even start limma.

As I said above, I can't see how limma can be using so much memory to be causing problems on the HPC system of the size you say it is in your email to me.

Have you considered initial filtering of the mval matrix, as suggested by SamGG? Filtering is a standard step in most limma analyses, but you seem to be omitting it.

ADD REPLY • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

thanks, I have attached here some of the outputs for one of the analysis, I can't filter any probes here as we want to run association analysis with all genome wide CpG probes.

> dim(mval)
[1] 764875   1455
> class(mval)
[1] "data.frame"

> dim(var)

[1] 1455 1124

I am reading methylation data file using fread, but later convert it to data frame; like shown below;

> betaval<- fread("beta.filter.csv")
|--------------------------------------------------|
|==================================================|
Warning message:
In fread("beta.filter.csv") :
  Detected 1456 column names but the data has 1457 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.
> head(betaval, 10) [,1:5]
       V1         ID    N.1.0093        N.1.0095     N.1.0097
    <int>     <char>       <num>       <num>       <num>
 1:     1 cg26928153  0.93939394  0.94126350  0.92441664
 2:     2 cg16269199  0.90022785  0.90201061  0.87172497
 3:     3 cg13869341  0.93593791  0.92627636  0.92429134
 4:     4 cg24669183  0.89818676  0.90142927  0.86383383
 5:     5 cg26679879  0.44733850  0.36922633  0.42294328
 6:     6 cg22519184  0.45240135  0.39721049  0.44954496
 7:     7 cg15560884  0.83876023  0.89411495  0.90817973
 8:     8 cg01014490  0.01167777  0.02401764  0.01091026
 9:     9 cg10692041  0.79697215  0.82967048  0.89695114
10:    10 cg02339369  0.82484170  0.86534621  0.77615458
> #targets2<-targets[,-1]
> betaval<-betaval[,-1]
> head(betaval, 10) [,1:5]
            ID        N.1.0093   N.1.0095       N.1.0097   N.1.0098
        <char>       <num>       <num>       <num>       <num>
 1: cg26928153  0.93939394  0.94126350  0.92441664  0.91881627
 2: cg16269199  0.90022785  0.90201061  0.87172497  0.89605784
 3: cg13869341  0.93593791  0.92627636  0.92429134  0.88896369
 4: cg24669183  0.89818676  0.90142927  0.86383383  0.91781553
 5: cg26679879  0.44733850  0.36922633  0.42294328  0.45145460
 6: cg22519184  0.45240135  0.39721049  0.44954496  0.44428988
 7: cg15560884  0.83876023  0.89411495  0.90817973  0.80325979
 8: cg01014490  0.01167777  0.02401764  0.01091026  0.02289879
 9: cg10692041  0.79697215  0.82967048  0.89695114  0.86189893
10: cg02339369  0.82484170  0.86534621  0.77615458  0.87268879
> #set first column header as blank
> names(betaval)[1] <- ""
> #convert into dataframe
> betaval <- data.frame(betaval)

Many thanks,

ADD REPLY • link 8 months ago Jitendra ▴ 10

0

Entering edit mode

What is var? There is no variable by that name in your question.

ADD REPLY • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

Just running everthing again so named as var ; model matrix; I also sent screenshot on your email.

#model.matrix
    > var <-model.matrix(~Analysis8 + sex + Age +CD8T +CD4T +NK + Bcell +Mono+smoking + PC1 + PC2 + PC3+Glucose, data=targets2)

ADD REPLY • link 8 months ago Jitendra ▴ 10

0

Entering edit mode

And now what is Analysis8? Again, there was no variable by that name in your question.

ADD REPLY • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

sorry, I am running several analysis/models; so Analysis8 is just another categorical variable (case/control for disease condition) for which we are running association analysis against CpG probes.

ADD REPLY • link 8 months ago Jitendra ▴ 10

0

Entering edit mode

I don't understand how you can possibly have a design matrix with 1124 columns, which your var matrix has. That can't possibly give reliable results. Which factor in the model.matrix formula has >1000 levels?

ADD REPLY • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

thanks Gordon for your time and help; how we can get below information.

Which factor in the model.matrix formula has >1000 levels?

ADD REPLY • link 8 months ago Jitendra ▴ 10

0

Entering edit mode

This is basic checking that you need to be able to do for yourself.

Try typing colnames(var)[1:50] for example.

ADD REPLY • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

thanks, it is coming like this;

colnames(var)[1:50]
 [1] "(Intercept)"      "Analysis8control" "sex"              "Age"             
 [5] "CD8T"             "CD4T"             "NK"               "Bcell"           
 [9] "Mono"             "smoking"          "PC1"              "PC2"             
[13] "PC3"              "Glucose100.03"    "Glucose100.04"    "Glucose100.06"   
[17] "Glucose100.10"    "Glucose100.12"    "Glucose100.14"    "Glucose100.19"   
[21] "Glucose100.23"    "Glucose100.25"    "Glucose100.26"    "Glucose100.31"   
[25] "Glucose100.33"    "Glucose100.34"    "Glucose100.36"    "Glucose100.37"   
[29] "Glucose100.39"    "Glucose100.42"    "Glucose100.45"    "Glucose100.48"   
[33] "Glucose100.49"    "Glucose100.53"    "Glucose100.54"    "Glucose100.55"   
[37] "Glucose100.57"    "Glucose100.61"    "Glucose100.64"    "Glucose100.66"   
[41] "Glucose100.70"    "Glucose100.71"    "Glucose100.75"    "Glucose100.81"   
[45] "Glucose100.87"    "Glucose100.89"    "Glucose100.92"    "Glucose100.97"   
[49] "Glucose101.03"    "Glucose101.09"

I don't know why it is showing Glucose100, I don't have this variable in my phenotype file; I just have Glucose;

colnames(targets2)
 [1] "barcode"          "Sentrix_Position" "PC1"              "PC2"             
 [5] "PC3"              "PC4"              "PC5"              "Batch"           
 [9] "Age"              "Gender"           "sex"              "eGFR"            
[13] "CKDstage"         "creat"            "glucose"          "CD8T"            
[17] "CD4T"             "NK"               "Bcell"            "Mono"            
[21] "Gran"             "BMI"              "smoking"          "Analysis8"       
[25] "Chol"             "Glucose"

ADD REPLY • link 8 months ago Jitendra ▴ 10

2

Entering edit mode

This shows that you have made an error in reading the Glucose variable. Instead of being a numeric covariate, as it is supposed to be, it has somehow been converted into a categorical factor with more than 1100 distinct values. Every distinct numeric Glucose value has become a factor level. When you run model.matrix with this incorrect variable, it creates an incorrect design matrix with an absurdly large number of columns. This is also why you are running out of memory.

One way that this error can occur would be if the data file containing the Glucose values incorrectly contained a character value somewhere in the Glucose column, which will cause the whole column to be turned into a character vector. Alternatively, the problem would arise if you typed target2$Glucose <- factor(targets2$Glucose).

Anyway, this shows that the whole problem is a data processing error. It is not an issue of how limma uses memory.

ADD REPLY • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

Thank you so much Gordon, I will check this and will update you. thanks again for your patience and time to resolve this issue.

Many thanks,

ADD REPLY • link 8 months ago Jitendra ▴ 10

1

Entering edit mode

If I may say so, you do need to become more familiar with factors and covariates in R. The reason why you see coefficient Glucose100.03 etc is that the numeric value of 100.03 has been converted into a categorical factor level.

ADD REPLY • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

If I don't add Glucose then it comes like this and this works very well;

var<-model.matrix(~Analysis8 + sex + Age +CD8T +CD4T +NK + Bcell +Mono+smoking, data=targets2)
> dim(var)
[1] 1455   10

ADD REPLY • link 8 months ago Jitendra ▴ 10

0

Entering edit mode

I can't help adding that it would have been logical for you to check the Glucose variable for yourself, which would have saved lots of time for both you and me. When you posted this question, you said that the problem occurred for the fourth model, and the only change in the fourth model was that Glucose was added. So you already had enough information right at the start to suspect that the Glucose variable might be the problem.

A little bit of checking like tail(targets2) or apply(targets2,class) or summary(targets2) might diagnose possible problems.

ADD REPLY • link 8 months ago Gordon Smyth 52k

0

Entering edit mode

thanks Gordon, I will check these things for each regression analyses. Just one more thing, here we have coded sex as numerical variable ( 0 for female and 1 for male) and we are using it as covariate in the regression model. do you think should I convert this as factor or categorical variable or this is ok to use as numerical covariates.