Dear all,
I am using the R package VSN for normalising proteomics data (abundances) that then goes in limma for contrast analysis. I know that VSN was not designed to work with proteomics data but actually vsn2 is doing a great job and clearly stabilised the variance, see below
The issue I have, if that the results are slightly different depending on which platform I am running the R code.
Example MacOS
logFC AveExpr t P.Value adj.P.Val B
1 1.5389102 12.82511 7.730600 2.590139e-06 0.002297454 4.74513335
2 1.7838522 17.67135 5.358662 1.149218e-04 0.047924259 1.44324118
3 1.0633257 13.62943 5.164849 1.620888e-04 0.047924259 1.13231059
4 1.5435723 12.16677 4.682596 3.896566e-04 0.086406353 0.33353763
5 -1.0275168 12.40099 -4.508445 5.386750e-04 0.095560937 0.03691621
Normalized abundances
control_24_3 control_24_4 control_24_5 control_24_6 control_24_7
[1,] 13.07592 12.64651 12.24371 12.45800 12.00618
[2,] 13.44991 14.09624 14.22835 13.85055 14.05933
[3,] 16.24021 16.82461 16.19956 16.47579 17.68974
[4,] 10.97001 10.35783 11.89993 10.56033 11.50507
[5,] 14.25346 14.69884 14.11004 14.68608 13.98822
Debian 7
logFC AveExpr t P.Value adj.P.Val B
1 1.5389434 12.82507 7.730738 2.589577e-06 0.002296955 4.74535815
2 1.7838631 17.67135 5.358692 1.149147e-04 0.047919273 1.44330565
3 1.0633463 13.62941 5.164902 1.620719e-04 0.047919273 1.13241108
4 1.5435961 12.16673 4.682718 3.895658e-04 0.086386209 0.33375116
5 -1.0275537 12.40094 -4.508525 5.385908e-04 0.095546006 0.03705773
the normalised abundances look slightly different
control_24_3 control_24_4 control_24_5 control_24_6 control_24_7 [1,] 13.07588 12.64647 12.24365 12.45795 12.00612 [2,] 13.44988 14.09622 14.22836 13.85052 14.05931 [3,] 16.24018 16.82459 16.19959 16.47577 17.68974 [4,] 10.96996 10.35768 11.89984 10.56020 11.50499 [5,] 14.25343 14.69882 14.11004 14.68606 13.98820
When the normalisation is disabled the results are identical on both GNU/Linux and MacOS. That’s why I am suspecting a rounding issue since the abundances are integers when they go in limma. However, the t-tests, logFC and avgExp in limma are completely wrong as the data have a huge variance.
I asked this question to Wolfgang Huber who advice me to check the point FAQ 7.31 (http://cran.r-project.org/doc/FAQ/R-FAQ.html ) and pointe out that VSN is using an iterative algorithm for a nonlinear optimisation.
For info, under MacOS:
sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-apple-darwin14.3.0 (64-bit) Running under: OS X 10.10.3 (Yosemite) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] qvalue_2.0.0 limma_3.24.10 vsn_3.36.0 Biobase_2.28.0 BiocGenerics_0.14.0 argparse_1.0.1 [7] proto_0.3-10 magrittr_1.5 dplyr_0.4.2 tidyr_0.2.0 reshape2_1.4.1 BiocInstaller_1.18.3 loaded via a namespace (and not attached): [1] Rcpp_0.11.6 plyr_1.8.3 tools_3.2.1 zlibbioc_1.14.0 digest_0.6.8 [6] preprocessCore_1.30.0 gtable_0.1.2 lattice_0.20-31 DBI_0.3.1 findpython_1.0.1 [11] stringr_1.0.0 grid_3.2.1 getopt_1.20.0 R6_2.0.1 ggplot2_1.0.1 [16] scales_0.2.5 MASS_7.3-40 splines_3.2.1 assertthat_0.1 colorspace_1.2-6 [21] stringi_0.5-2 affy_1.46.1 munsell_0.4.2 rjson_0.2.15 affyio_1.36.0
Debian, packages and R are of same versions
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)
other attached packages:
[1] qvalue_2.0.0 limma_3.24.10 vsn_3.36.0
[4] Biobase_2.28.0 BiocGenerics_0.14.0 argparse_1.0.1
[7] proto_0.3-10 magrittr_1.5 dplyr_0.4.2
[10] tidyr_0.2.0 reshape2_1.4.1
loaded via a namespace (and not attached):
[1] Rcpp_0.11.6 BiocInstaller_1.18.3 plyr_1.8.3
[4] tools_3.2.1 zlibbioc_1.14.0 digest_0.6.8
[7] preprocessCore_1.30.0 gtable_0.1.2 lattice_0.20-31
[10] DBI_0.3.1 findpython_1.0.1 stringr_1.0.0
[13] grid_3.2.1 getopt_1.20.0 R6_2.0.1
[16] ggplot2_1.0.1 scales_0.2.5 MASS_7.3-41
[19] splines_3.2.1 assertthat_0.1 colorspace_1.2-6
[22] stringi_0.4-1 affy_1.46.1 munsell_0.4.2
[25] rjson_0.2.15 affyio_1.36.0
If you have experienced this, and have some comments / suggestions, I will be happy to update, correct my script.
Many thanks in advance,
Aurelien
Dear Wolfgang,
thanks again for your help. In the tables I first provided, both were processed with vsn2, the numbers are indeed very close, but at the scale of the whole dataset, when we applied a cut-off on the q-values we did not retain the same number of proteins depending on the platform.
When we don't normalized, the limma output is just weird to me. For example
A test case code, together with a dataset sample are available at github here https://github.com/ginolhac/vsn_proteomics for anyone who wants to try it out.
Best wishes,
Aurelien