Hello,
I use GOseq quite often for RNAseq analyses, including the length bias correction. My question is how to use it properly for data without length bias, e.g., for microarrays.
Normally, with RNAseq data, I would use something like:
pwf <- nullp(gene.vector, "hg19", "geneSymbol", bias.data = lengthhg19) GO.wall <- goseq(pwf, "hg19", "geneSymbol", use_genes_without_cat = TRUE)
But what to change for microarray data? Would it be enough to add method = "Hypergeometric"
to the goseq
function? Or do I also have to feed the nullp
function with dummy values (let's say 100 for each gene)?
Thanks for your advice!
Ben
Do you mean proceed with:
Take a minute to read through the documentation from
?goseq
. Specifically I'm referring to this section:CAUTION: "Hypergeometric" should NEVER be used for producing results for biological interpretation. If there is genuinly no bias in power to detect DE in your experiment, the PWF will reflect this and the other methods will produce accuracte results.
"Hypergeometric" assumes there is no bias in power to detect differential expression at all and calculates the p-values using a standard hypergeometric distribution. Useful if you wish to test the effect of selection bias on your results.
So, the lack of bias in DE detection should be reflected in the plot of your
pwf
and you can use the defaultmethod
(Wallenius
) when callinggoseq
. That having been said, theyHyperGeometric
option will ensure that no information from yourpwf
is used at all.Lastly, it might be that there is no length bias from the microarray data, but there could very well be bias coming from another factor. You could imagine calculating your
pwf
from abias.data
vector that consists of GC content of the probes used for each expression readout. Not sure what that would look like, but might be interesting to see.Thanks Steve for the nice explanation. I guess it would be safest to just use the Hypergeometric method for not microarray data.
Did you mean to say "I guess it would be safest to just use the Hypergeometric method for microarray data"?
You had a "not" in there, so wasn't sure. In any event, I wasn't suggesting that it's "safer" to do so. If I were doing the analysis and given that I have a tool at hand (goseq) that lets me ask questions about bias in DE quite easily, I'd just try a few different
bias.data
vectors (eg. length, gc content) just to see if there is bias I didn't expect.Either way, people have been using HyperGeoemtric tests for GO enrichment with microarray data for a long time now, so you can just go that route if it suits your purpose.
Yes, the "not" slipped in there by accident. Sorry about that.
I am actually writing some additional functions to ease up (or automate) my GO analysis with goseq, and at the moment I don't have actual microarray studies, but just in case ;-)
Thanks again!
Wouldn't it be easiest to simply run goseq with a vector of dummy gene length, say
rep(1000, number.of.genes)
?Random dummy lengths would work, but using the same value (1000) for all genes would trow an error. I tried that.
True. Something like
bias <- jitter(rep(1000, length(genes))
should do the trick I guess, giving a completely flat fit.