Question

how plotSA shall look like for good or bad data, any example of good or bad data shall look like in plotSA regarding whether the data is OK for limma analysis or not

2

Entering edit mode

Ming ▴ 380

@ming-yi-6363

Last seen 2.6 years ago

United States

Hi, there:

I have a question about how plotSA shall look like for good or bad data, any example of good or bad data shall look like in plotSA regarding whether the data is OK for limma analysis or not?

I indeed looked through web or the user guide of limma or the limma method paper, I could not find example plots of good or bad plotSA plots shall look like. So hereby I wonder if one can offer examples on what kind of plotSA plots shall look like or what criteria to determine regarding to whether the data is OK for limma analysis or not. Since if plotSA plot can be a diagnosis tool to determine if limma can be applied to the data as discussed initially and partially at URL: old discussion, it is better to know how and what is OK and what is not OK look like. I attached example of my plotSA plot here with also plotMD plots for reference here and code is below. The plotSA plot for my data look kind of bent and skewed, although plotMD look great.

enter image description here

Thanks in advance! Best

Ming

Code should be placed in three backticks as shown below


> show(mydata[1:5,1:3]);
                 COLON.Mut.LS513 COLON.Mut.CL11 COLON.Mut.LS1034
A1BG..1.             0.009157821   -0.001640884      -0.05054882
A1CF..29974.        -0.021848711    0.036079619       0.05448993
A2M..2.              0.099182798    0.105247983       0.04387373
A2ML1..144568.       0.107724902    0.043404811       0.02711932
A3GALT2..127550.    -0.045207710   -0.146916234      -0.37110402
> show(tar[1:5,c(8,7,3)])
                         TypesLines RAS Types
COLON.Mut.LS513     COLON.Mut.LS513 Mut COLON
COLON.Mut.CL11       COLON.Mut.CL11 Mut COLON
COLON.Mut.LS1034   COLON.Mut.LS1034 Mut COLON
COLON.Mut.SNU1033 COLON.Mut.SNU1033 Mut COLON
COLON.Mut.COLO678 COLON.Mut.COLO678 Mut COLON


eset <- ExpressionSet(assayData = as.matrix(mydata))

group1<-paste(tar$Types,tar$RAS,sep=".");
group<-factor(group1,levels=c("COLON.Mut","COLON.WT","LUNG.Mut","LUNG.WT"));
design <- model.matrix(~0+group)
colnames(design) <- levels(group)

fit <- lmFit(eset, design)

> show(fit)
An object of class "MArrayLM"
$coefficients
                    COLON.Mut    COLON.WT    LUNG.Mut     LUNG.WT
A1BG..1.         -0.006184935 -0.01589437 -0.01650645 -0.03327325
A1CF..29974.     -0.036653106 -0.05470048 -0.03705100 -0.05318098
A2M..2.           0.059910009  0.05123221  0.08906358  0.07615483
A2ML1..144568.    0.062222203  0.12934582  0.10457048  0.12544309
A3GALT2..127550. -0.142014114 -0.12313766 -0.15668482 -0.17035788
17640 more rows ...
.......

expName<-paste("Test_PlotSA_LungColon_V2", "",sep="");;
pngF<-paste(outPath,paste(expName,".png",sep=""),sep="/");
png(file= pngF, width = 1000, height = 400);
par(mfrow=c(1, 3))

cont.matrix<-makeContrasts(COLON.Mut_WT=COLON.Mut-COLON.WT, LUNG.Mut_WT=LUNG.Mut-LUNG.WT,levels=design);

fit2 <- contrasts.fit(fit, cont.matrix)
fit3 <- eBayes(fit2)

> show(fit3)
An object of class "MArrayLM"
$coefficients
                  Contrasts
                   COLON.Mut_WT LUNG.Mut_WT
  A1BG..1.          0.009709434  0.01676680
  A1CF..29974.      0.018047372  0.01612999
  A2M..2.           0.008677799  0.01290875
  A2ML1..144568.   -0.067123620 -0.02087261
  A3GALT2..127550. -0.018876454  0.01367307
17640 more rows ...
.......

plotMD(fit3, coef=1,main=paste0("Mean-Difference Plot:", colnames(fit3$coefficients)[1]));
abline(0,0,col="blue")
plotMD(fit3, coef=2,main=paste0("Mean-Difference Plot:", colnames(fit3$coefficients)[2]));
abline(0,0,col="blue")
plotSA(fit3, main="Residual standard deviation vs. average log expression")

dev.off()

I atttached the image of the created plots here for your view

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] calibrate_1.7.7     MASS_7.3-54         Biobase_2.50.0
[4] BiocGenerics_0.38.0 limma_3.46.0        reshape_0.8.8
[7] reshape2_1.4.4      ggrepel_0.9.1       ggplot2_3.3.5

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       pillar_1.6.2     compiler_4.0.2   plyr_1.8.6
 [5] tools_4.0.2      digest_0.6.27    lifecycle_1.0.0  tibble_3.1.3
 [9] gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.11     DBI_1.1.1
[13] withr_2.4.2      dplyr_1.0.7      stringr_1.4.0    generics_0.1.0
[17] vctrs_0.3.8      grid_4.0.2       tidyselect_1.1.1 glue_1.4.2
[21] R6_2.5.0         fansi_0.5.0      farver_2.1.0     purrr_0.3.4
[25] magrittr_2.0.1   scales_1.1.1     ellipsis_0.3.2   assertthat_0.2.1
[29] colorspace_2.0-2 labeling_0.4.2   utf8_1.2.2       stringi_1.7.3
[33] munsell_0.5.0    crayon_1.4.1

plotSA limma • 3.6k views

ADD COMMENT • link 3.1 years ago Ming ▴ 380

score 0 · Answer 1 · 2022-03-04

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

I already answered your question 4 months ago (Is there a need or general principle/rule to assess whether the data types/distribution etc is suitable for limma R package to derive differential features between interested contrasting groups?). I already told you that that I see no evidence that limma would not work on your data but your data looks to be poor quality and you could improve the analysis if you stabilized the variance better. You don't seem to have followed my hints but have instead repeated the same question both on this forum and in a series of emails sent to me. This is the 5th time now that you have asked me the same question about the same plots.

You have been demanding a simple test of whether data is OK for a limma analysis. In my opinion, your campaign for a universal test that can be applied for any type of data, any technology platform and any preprocessing practice is unrealistic and also misguided. I think your questions are based on some mistaken premises.

You ask whether the data is suitable for a limma analysis, but suitable compared to what alternative? If you decide the data is not suitable, what will you do then? Throw the data out? That's not very constructive. Apply a t-test or a Wilcoxon test? That will just make problems worse. I am not aware of generally applicable alternatives that will make fewer assumptions than limma.
limma can be applied to data from most genomic technologies provided the data is quantitative and is summarized on a scale for which linear modeling is reasonable. The challenge with each new technology is to preprocess and normalize the data to make the linear model assumptions as good as possible. There is no cutoff between good and bad, just a continuum of better or worse preprocessing practices. Figuring out the best preprocessing for each new technology is one of the key concerns of statistical bioinformatics research and is the topic of many published journal articles.
You give no information about the genomic technology that you are using, but almost all the data values appear to be close to zero with a few negative values. The data seem to have an upper bound of some sort, so it is not like log-expression. You say that the data has been scaled to be between -2 and 1, which seems strange. Rather than trying to shift the onus to the DE analysis tool, you should be trouble-shooting the data quality and improving the data preprocessing.
I did not suggest plotSA as a universal test but rather as a way to examine whether you had stabilized the variance properly. Frankly, it shows the shortcomings of your data pretty clearly. I would never put an example like this in the limma User's Guide because I would never process data like this.
Have you looked at the SA plot closely? Your data shows a distinct mean-variance trend so the curve fitted by limma looks ok, or at least as good as can be done. The uptick of the trend-line on the far right of the plot is a region where there almost no genes, so it will have little or no influence on your analysis. limma is doing as well as can be done. What more do you expect??

ADD COMMENT • link 3.1 years ago Gordon Smyth 52k

0

Entering edit mode

Dr, Smyth:

Thanks so much for taking time respond to my questions. I apologize about being persistent and even a bit stubborn to keep asking seemingly a same question. One reason is: probably in contrast to what you might have thought, indeed, I was so impressed and even a bit overwhelmed by how well limma has performed in my datasets as disucssed to reflect what exatcly I thought in terms of making great biological sense of the data even more than expected, to the extent almost too good to be true. Of course, as limma paper stated, I understood limma borrows information across genes and leverages information from the within-group replicates to obtain much higher significance levels than the native t-test etc, which is why we have limma method developed by you as one of the greatest methods I ever experienced. I even did lots of simulations to compare randomly selected contrasts with million trials of limma analyses to validate my observations, which all come great. This left me just one big question as I kept asking you: how to evaluate and what kind of data limma method can be applied for? and is limma applicable to my dataset given the plotSA I got with bent and skewed curve, which was asked by my colleagues. I do not know the prerequisite of limma and did not see examples of plotSA in user guide, which is why I kept asking. Now thank you for being critical on our data quality, but it is cell assay-based high throughput data, which has been processed in very complicated procedures for experimental procedure-based reasons, and for other purposes as well as proprietary reasons, and is not under my control. I just happened to think about using limma to deal with the processed data for the questions that I am asking about. Another reason is from this experience and past ones, I trully appreciate the power of limma method and its wide potential applications in many types of high-thoourghput datasets, and that is why I wish to have a way to judge on my own to make sure limma can be applicable to many potential datasets not just common genomic array or q-PCR data etc. Thank you for your last point, which seems giving new cues in terms of the SA plot and sounds encouraging. Your 4th point mentioned a possible way of improving the data by stabilizing the variance properly, which we have touched a bit in earlier discussion. Do you mind suggesting an actual way or R package/function that I can try that can optimize the data as input for limma? Again, I am very grateful for all your kind help and certainly will cite limma and acknowldege your help if my this work can be ever published later. Thank you!

ADD REPLY • link 3.1 years ago Ming ▴ 380

0

Entering edit mode

Can I suggest a way to optimize your data? No I can't. Data preprocessing is technology specific. No one can tell you how to preprocess your data because we don't have examples of data from your assay technology. We don't even know what your technology is. Please read the last sentence of my point 2 above. You have already explained that the data is proprietary and not under you control so it is very difficult indeed to understand what you are expecting me to suggest.

ADD REPLY • link 3.1 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks for the input! What I am asking is not to optimize the whole data preprocessing, since it has been done and it is techonology dependent, and no need and would be more issues to do anything on that side. I am just wondering about any potential way of stabilizing the variance of processed data properly as observed in plotSA and as you suggested.

ADD REPLY • link 3.1 years ago Ming ▴ 380

0

Entering edit mode

I have no idea how to improve your data. I've never seen anything similar to it. If you want info about standard variance stabiliizing transformations, do a Google search (as I suggested 4 months ago). But I have no expectation that any of the standard transformations will be appropriate for your data. The need for variance stabilization generally needs to be incorporated at early stages of the data processing steps. It is not always possible to fix the data by tacking on a transformation at the end.

It seems from your comments that you already have perfectly satisfactory results and that the limma package has performed correctly. If you still need more assistance, the logical step would be to collaborate with a statistical bioinformatician at your own institution.