Dear all,
I have a question on the correct way to use camera with a filtered gene expression matrix.
Let's assume that I have a n * m expression matrix (array or RNA-Seq, it doesn't matter) and that after filtering the features below a certain intensity/cpm, I am left with a matrix n' * m, being k = n - n' the number of features removed. Presumably most of these filtered features will not be significantly associated with the phenotype.
Now, if I understand correctly, when I use 'ids2indices' to map the elements of a given gene set to the features of the expression matrix, the elements with no match will contain an NA, and will not excluded from the rest of the analysis. This means that if I have a gene set where only 10% of the genes are present in the filtered expression matrix, the actual gene set that will be tested will be composed by that 10%. In my (very possibly incorrect) understanding, this makes perfect sense if the non-matching features are actually not testable (for example if the array does not contain probe sets mapping them). However, in the case of filtered features I am a bit confused. In the example above, if that 10% of the genes in the gene set was associated with the phenotype, and the remaining 90% was removed, I would probably see a significant association of the gene set with the phenotype. If, instead, I kept that 90% of genes that are not significantly associated with the phenotype in the gene set, I would probably obtain a non-significant result. My questions therefore are:
1. Is my understanding correct?
2. If yes, what would be the best way to retain the information of the k weak, (moslty) non-significant features in the analysis?
Apologies for the somewhat lengthy question, and many thanks in advance.
Dear Gordon, thank you so much for your kind and clear answer. I was asking because I recently had a gene set that, by design, was significantly associated with a data set (genes found to be up-regulated in an independent identical experiment under the same conditions) which, although ranking near the top or the Camera output, was ranked below another gene set in which less than 20% of the genes were mappable to the expression matrix, and was, from a biological point of view, quite unlikely to be associated with the phenotype. In these cases, would you recommend filtering the gene sets below a certain fraction of mappable ids, or rather to keep all the gene sets and interpret the results a posteriori?
Many thanks again for your answer and for your code!