Question

How to use GOseq without length bias

0

Entering edit mode

b.nota ▴ 370

@bnota-7379

Last seen 4.4 years ago

Netherlands

Hello,

I use GOseq quite often for RNAseq analyses, including the length bias correction. My question is how to use it properly for data without length bias, e.g., for microarrays.

Normally, with RNAseq data, I would use something like:

pwf <- nullp(gene.vector, "hg19", "geneSymbol", bias.data = lengthhg19)

GO.wall <- goseq(pwf, "hg19", "geneSymbol", use_genes_without_cat = TRUE)

But what to change for microarray data? Would it be enough to add method = "Hypergeometric" to the goseq function? Or do I also have to feed the nullp function with dummy values (let's say 100 for each gene)?

Thanks for your advice!

Ben

goseq • 2.9k views

ADD COMMENT • link updated 9.1 years ago by Steve Lianoglou ★ 13k • written 9.1 years ago by b.nota ▴ 370

score 1 · Answer 1 · 2016-01-21

1

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 23 months ago

United States

One option would be to still use the real gene lengths and pass those into the nullp function. This would presumably create a "flat line" plot (i.e. no bias) and you can proceed from there as normal.

ADD COMMENT • link 9.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Do you mean proceed with:

GO.wall <- goseq(pwf, "hg19", "geneSymbol", use_genes_without_cat = TRUE, method = "Hypergeometric")

ADD REPLY • link 9.1 years ago b.nota ▴ 370

2

Entering edit mode

Take a minute to read through the documentation from ?goseq. Specifically I'm referring to this section:

CAUTION: "Hypergeometric" should NEVER be used for producing results for biological interpretation. If there is genuinly no bias in power to detect DE in your experiment, the PWF will reflect this and the other methods will produce accuracte results.

"Hypergeometric" assumes there is no bias in power to detect differential expression at all and calculates the p-values using a standard hypergeometric distribution. Useful if you wish to test the effect of selection bias on your results.

So, the lack of bias in DE detection should be reflected in the plot of your pwf and you can use the default method (Wallenius) when calling goseq. That having been said, they HyperGeometric option will ensure that no information from your pwf is used at all.

Lastly, it might be that there is no length bias from the microarray data, but there could very well be bias coming from another factor. You could imagine calculating your pwf from a bias.data vector that consists of GC content of the probes used for each expression readout. Not sure what that would look like, but might be interesting to see.

ADD REPLY • link 9.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks Steve for the nice explanation. I guess it would be safest to just use the Hypergeometric method for not microarray data.

ADD REPLY • link 9.1 years ago b.nota ▴ 370

0

Entering edit mode

Did you mean to say "I guess it would be safest to just use the Hypergeometric method for microarray data"?

You had a "not" in there, so wasn't sure. In any event, I wasn't suggesting that it's "safer" to do so. If I were doing the analysis and given that I have a tool at hand (goseq) that lets me ask questions about bias in DE quite easily, I'd just try a few different bias.data vectors (eg. length, gc content) just to see if there is bias I didn't expect.

Either way, people have been using HyperGeoemtric tests for GO enrichment with microarray data for a long time now, so you can just go that route if it suits your purpose.

ADD REPLY • link 9.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Yes, the "not" slipped in there by accident. Sorry about that.

I am actually writing some additional functions to ease up (or automate) my GO analysis with goseq, and at the moment I don't have actual microarray studies, but just in case ;-)

Thanks again!

ADD REPLY • link 9.1 years ago b.nota ▴ 370

0

Entering edit mode

Wouldn't it be easiest to simply run goseq with a vector of dummy gene length, say rep(1000, number.of.genes)?

ADD REPLY • link 4.4 years ago ATpoint ★ 4.6k

0

Entering edit mode

Random dummy lengths would work, but using the same value (1000) for all genes would trow an error. I tried that.

ADD REPLY • link 4.4 years ago b.nota ▴ 370

0

Entering edit mode

True. Something like bias <- jitter(rep(1000, length(genes)) should do the trick I guess, giving a completely flat fit.

ADD REPLY • link 4.4 years ago ATpoint ★ 4.6k