It is not usually statistically appropriate to use Wilcoxon tests to compare read counts in a SummarizedExperiment object. James has mentioned that the Wilcoxon test loses statistical power. While that is true, there are even more fundamental objections to using the Wilcoxon test. The Wilcoxon test assumes that all the counts are independent and identically distributed under the null hypothesis, which is very unlikely to be true for sequencing data where the sequencing depth typically varies from the one library to another.
Bioconductor provides a number of very sophisticated and effective statistical procedures for comparing read counts between groups. You don't explain what sort of integer counts you have (RNA-seq? GWAS? ChIP-seq?) but, whatever they are, it would probably be more productive for you to learn about what the Bioconductor packages can provide in terms of statistical analysis of your data, rather than trying to make up a simple statistical analysis for yourself. Even if you are a top statistics expert in your own right, it would still be best to find out what Bioconductor provides first, before trying to figure out whether you can improve it yourself.
Although I am very sceptical that this is a good idea for your data, performing Wilcoxon tests in R is very easy. The rankSumTestWithCorrelation
function in the limma package is pretty fast. If y
is a matrix:
library(limma)
ngenes <- 1000
nsamples <- 10
y <- matrix(rpois(ngenes*nsamples,lambda=5), ngenes, nsamples)
and your samples are in two groups, with columns j
being the first group, for example
j <- 1:5
if the first 5 columns are group 1. Then row-wise Wilcoxon tests can be done by:
PValues <- matrix(1, ngenes, 2)
for (i in 1:nrow(y)) PValues[i,] <- rankSumTestWithCorrelation(j, y[i,])
This takes about a second on my laptop for 10,000 genes and 10 samples.
Thanks for your reply.
I have some fungi DNA SNPs data. I created a SummarizedExperiment object from it. The assay is used to count non-synonymous SNPs per Gene. I want to find whether there are count differences per Gene between Group A samples and group V samples. From my understanding (I may be wrong), my case is different from RNA-seq / GWAS / ChIP-seq.
As a beginner, I am not familiar with sophisticated statistics models and most Bioconductor packages. I just want to try many ways to do some preliminary data analysis. Wilcoxon test is one of them. I think there should be a simple way in Bioconductor to use Wilcoxon test to do it. I tried to figure out it myself. But it’s not easy for me now.
Is it somehow correct to use Wilcoxon test for my case? If it is correct, how to do it in a simple way in Bioconductor?
Thanks!
I don't know enough about your experiment to advise whether the Wilcoxon test is valid for your data, although I'm pretty sure it is not terribly good. Anyway, I have edited my answer to tell you how to do Wilcoxon tests.
Thank you very much! Your code will be helpful to me.
Hi Gordon Smyth,
Thank you for the answer and R code on how to perform a Wilcoxon test. I am wondering how to interpret the left and right tail p-values from the output generated with
rankSumTestWithCorrelation
. Is this the probability that say, the gene MMP14 is greater in the "mutant" group compared the "wild-type" group (if looking at the right tail p-value), and if so, it is statistically significant assuming alpha = 0.05? I am unsure how to interpret results, relative to the base rwilcox.test
. Any clarification you may be able to give is greatly appreciated.Best, Jennifer