Question

Can the limma package be applied to Log2 RUV-normalized data?

1

Entering edit mode

pg45863 ▴ 10

@e44d727a

Last seen 18 months ago

Portugal

So I have a dataset that consists of the batch correction through RUV-normalization of several microarray datasets containing tumoral and non-tumoral samples. The data is in Log2 RUV-normalized expression. I want to perform differential expression analysis. Is the limma package in R fit for this?

From what I've read the limma package expects Log2 expression data without normalization, but some tutorials I also find use normalized data.

Thank you very much to all!

RUVnormalizeData MicroarrayData limma • 2.5k views

ADD COMMENT • link updated 16 months ago by Shaimaa Gamal • 0 • written 22 months ago by pg45863 ▴ 10

score 1 · Answer 1 · 2023-05-23

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 10 hours ago

WEHI, Melbourne, Australia

Yes, you can use limma to analyse RUV-normalized microarray data if the number of samples in the dataset is large.

It isn't a matter of whether limma is fit or not for this sort of data. limma handles general linear models so, if the data isn't suitable for limma, then it won't be suitable for any other statistical tools either.

The reason why we recommend that data not be batch-corrected before limma analyses is that batch-corrected data is artifically homogeneous compared to original data without batch effects, so there is a risk of underestimating the between-sample variability leading to false discoveries. However, if the dataset is large and you use RUV, then this effect may be relatively minor.

If the dataset is not large, then it might be better to use RUV to generate surrogate variables and to include the surrogate variables in the limma design matrix.

ADD COMMENT • link 22 months ago Gordon Smyth 52k

1

Entering edit mode

Hi Gordon,

If the dataset is not large, then it might be better to use RUV to generate surrogate variables and to include the surrogate variables in the limma design matrix.

What is the difference between "batch-corrected data" and "surrogate variables" (generated by RUV)? Could you elaborate the design with those surrogate or point to an example please ?

Best, Samuel

ADD REPLY • link 22 months ago SamGG ▴ 360

0

Entering edit mode

Hello Gordon.Thank you for your answer. Yes, my dataset is large so I think I'm good, but have to have into account those characteristics you mentiones.

My next step was to perform GSEA. However I only get signifcant p-values and no significant adjusted p-values. Do you think it may be because of that artificial homogeinity you refered?

ADD REPLY • link 22 months ago pg45863 ▴ 10

0

Entering edit mode

No, that is not the reason. Homogeneity from batch correction tends to produce too many DE results rather than too few.

ADD REPLY • link 22 months ago Gordon Smyth 52k

0

Entering edit mode

I used the ClusterProfiler package to perform the GSEA analysis as follows. Do you find something wrong with it? I have a panel of 630 genes and I want to do the GSEA with the ImmuneSigDB gene collection.

> library(clusterProfiler)
> library(msigdbr)
> library(openxlsx)

> hs_immune_df <- msigdbr(species = "Homo sapiens",
+                           category = "C7",
+                           subcategory = "IMMUNESIGDB")

> data <- read.xlsx("SHH_Fchange_for_volcano.xlsx")
> data$genesymbol <- data[,1]
> data <- data[,-1]
> head(data)

avg.SHH avg.NonTumor   LOG2Fc   statistic$p statistic$p.adj genesymbol
1 1.8764427  -0.32140825 2.197851 8.971148e-116        9.0e-116      CXCR4
2 1.2022301  -0.61802680 1.820257  1.393914e-70         1.4e-70     COL3A1
3 1.7586427   0.08035498 1.678288  4.435748e-80         4.4e-80      TGFB2
4 0.4601373  -0.82755567 1.287693  9.799204e-55         9.8e-55        PBK
5 0.3916605  -0.87553505 1.267196  1.712509e-22         1.7e-22       CD24
6 1.0681264  -0.06568419 1.133811  4.272034e-76         4.3e-76     NFATC1

> lfc_vector <- data %>%
+   # Extract a vector of `log2FoldChange` named by `gene_symbol`
+   dplyr::pull(LOG2Fc, name = genesymbol)

> # Look at first entries of the log2 fold change vector
> head(lfc_vector)
   CXCR4   COL3A1    TGFB2      PBK     CD24   NFATC1 
2.197851 1.820257 1.678288 1.287693 1.267196 1.133811 

> # Look at the last entries of the log2 fold change vector
> tail(lfc_vector)
    CXCL14      TRAF3       RORA      S100B       NEFL      EOMES 
-0.8077164 -0.8486950 -0.9762321 -1.1426779 -1.2737958 -2.1757762

> gsea_results <- GSEA(geneList = lfc_vector,  # ordered ranked gene list
+                      minGSSize = 25,  # minimum gene set size
+                      maxGSSize = 500,  # maximum gene set set
+                      pvalueCutoff = 0.05,
+                      pAdjustMethod = "fdr",  # correction for multiple hypothesis testing
+                      TERM2GENE = dplyr::select(hs_immune_df,
+                                                gs_name,
+                                                gene_symbol))

My batch-corrected data looks like this:

Patients     A2M   ABCB1   ABCF1   ACKR2     ADA  ADGRE2     AGK   AICDA    AIRE
1  Patient0  0.2895  0.5023  0.0848 -0.0767 -0.0226  0.1158 -0.4757 -0.0283  0.2186
2  Patient1  1.2463  1.2426  0.5559  0.2744 -0.0375  0.1684  0.1734  0.2079  0.1558
3  Patient2 -0.1309 -0.5671  0.1182  0.3907  0.7206 -0.2642  0.0771  0.0282 -0.1934
4  Patient3 -0.8332 -0.5815  0.0412 -0.3495 -0.0117  0.1162  0.0308 -0.0611  0.2594
5  Patient4 -0.5526 -0.1716 -0.0118  0.0176 -0.3979 -0.3081  0.0494  0.1550 -0.0663
6  Patient5  0.0017  1.1230  0.1299  0.2393 -0.1945  0.0663 -0.1494  0.1685  0.0566
7  Patient6  0.6156  0.3743  0.2574  0.0155  0.1537 -0.1182 -0.2632  0.2076  0.0945
8  Patient7  0.2293  0.0532 -0.0514 -0.4302  0.5099  0.0827 -0.1653  0.7120  0.1221
9  Patient8 -0.4436  0.1204 -0.1118 -0.1493 -0.3487 -0.0533 -0.5002 -0.1257 -0.0079
10 Patient9 -0.5870  1.1584  0.1633 -0.5265  0.4070 -0.5457  0.3359  0.1575 -0.1248

ADD REPLY • link 22 months ago pg45863 ▴ 10

0

Entering edit mode

Hi Gordon, On what basis we can determine whether the dataset is large or not? I am working on a dataset of 515 samples and 5440 genes. I was wondering if the dataset is large enough to not be affected by RUV correction because I will use it for limma voom analysis?

ADD REPLY • link 16 months ago Shaimaa Gamal • 0

2

Entering edit mode

515 samples is certainly large enough.

ADD REPLY • link 16 months ago Gordon Smyth 52k

0

Entering edit mode

I see. Do you recommend a certain method to get norm.factors through calcNormFactors function espcially if the downstream analysis includes limma voom.

calcNormFactors (object, method = c("TMM","TMMwsp","RLE","upperquartile","none")

I tried the TMM method and it didn't perform well. On the contrary, the TMMwsp performed better.

ADD REPLY • link 16 months ago Shaimaa Gamal • 0