So I have a dataset that consists of the batch correction through RUV-normalization of several microarray datasets containing tumoral and non-tumoral samples. The data is in Log2 RUV-normalized expression. I want to perform differential expression analysis. Is the limma package in R fit for this?
From what I've read the limma package expects Log2 expression data without normalization, but some tutorials I also find use normalized data.
Yes, you can use limma to analyse RUV-normalized microarray data if the number of samples in the dataset is large.
It isn't a matter of whether limma is fit or not for this sort of data. limma handles general linear models so, if the data isn't suitable for limma, then it won't be suitable for any other statistical tools either.
The reason why we recommend that data not be batch-corrected before limma analyses is that batch-corrected data is artifically homogeneous compared to original data without batch effects, so there is a risk of underestimating the between-sample variability leading to false discoveries. However, if the dataset is large and you use RUV, then this effect may be relatively minor.
If the dataset is not large, then it might be better to use RUV to generate surrogate variables and to include the surrogate variables in the limma design matrix.
If the dataset is not large, then it might be better to use RUV to generate surrogate variables and to include the surrogate variables in the limma design matrix.
What is the difference between "batch-corrected data" and "surrogate variables" (generated by RUV)? Could you elaborate the design with those surrogate or point to an example please ?
Hello Gordon.Thank you for your answer. Yes, my dataset is large so I think I'm good, but have to have into account those characteristics you mentiones.
My next step was to perform GSEA. However I only get signifcant p-values and no significant adjusted p-values. Do you think it may be because of that artificial homogeinity you refered?
I used the ClusterProfiler package to perform the GSEA analysis as follows. Do you find something wrong with it? I have a panel of 630 genes and I want to do the GSEA with the ImmuneSigDB gene collection.
> lfc_vector <- data %>%
+ # Extract a vector of `log2FoldChange` named by `gene_symbol`
+ dplyr::pull(LOG2Fc, name = genesymbol)
> # Look at first entries of the log2 fold change vector
> head(lfc_vector)
CXCR4 COL3A1 TGFB2 PBK CD24 NFATC1
2.197851 1.820257 1.678288 1.287693 1.267196 1.133811
> # Look at the last entries of the log2 fold change vector
> tail(lfc_vector)
CXCL14 TRAF3 RORA S100B NEFL EOMES
-0.8077164 -0.8486950 -0.9762321 -1.1426779 -1.2737958 -2.1757762
> gsea_results <- GSEA(geneList = lfc_vector, # ordered ranked gene list
+ minGSSize = 25, # minimum gene set size
+ maxGSSize = 500, # maximum gene set set
+ pvalueCutoff = 0.05,
+ pAdjustMethod = "fdr", # correction for multiple hypothesis testing
+ TERM2GENE = dplyr::select(hs_immune_df,
+ gs_name,
+ gene_symbol))
Hi Gordon,
On what basis we can determine whether the dataset is large or not? I am working on a dataset of 515 samples and 5440 genes. I was wondering if the dataset is large enough to not be affected by RUV correction because I will use it for limma voom analysis?
I see. Do you recommend a certain method to get norm.factors through calcNormFactors function espcially if the downstream analysis includes limma voom.
Hi Gordon,
What is the difference between "batch-corrected data" and "surrogate variables" (generated by RUV)? Could you elaborate the design with those surrogate or point to an example please ?
Best, Samuel
Hello Gordon.Thank you for your answer. Yes, my dataset is large so I think I'm good, but have to have into account those characteristics you mentiones.
My next step was to perform GSEA. However I only get signifcant p-values and no significant adjusted p-values. Do you think it may be because of that artificial homogeinity you refered?
No, that is not the reason. Homogeneity from batch correction tends to produce too many DE results rather than too few.
I used the ClusterProfiler package to perform the GSEA analysis as follows. Do you find something wrong with it? I have a panel of 630 genes and I want to do the GSEA with the ImmuneSigDB gene collection.
My batch-corrected data looks like this:
Hi Gordon, On what basis we can determine whether the dataset is large or not? I am working on a dataset of 515 samples and 5440 genes. I was wondering if the dataset is large enough to not be affected by RUV correction because I will use it for limma voom analysis?
515 samples is certainly large enough.
I see. Do you recommend a certain method to get norm.factors through calcNormFactors function espcially if the downstream analysis includes limma voom.
calcNormFactors (object, method = c("TMM","TMMwsp","RLE","upperquartile","none")
I tried the TMM method and it didn't perform well. On the contrary, the TMMwsp performed better.