Question

Goseq with small numbers of genes: minimum number?

0

Entering edit mode

matt.arno • 0

@mattarno-8491

Last seen 9.3 years ago

United Kingdom

Hi - I have some relatively small genes lists (around 10-20 significant genes (padj<0.05), and tried goseq to look for over represented GO terms and KEGG pathways. I also did the 'sampling' method as a negative control but this gave very similar results to the real test (similar pvalues and terms):

> head(GO.samp.MF.LowvNon) # this is the sampling method control 
       category over_represented_pvalue under_represented_pvalue numDEInCat numInCat
1769 GO:0016362             0.001998002                        1          1        2
2627 GO:0034711             0.001998002                        1          1        3
1377 GO:0008466             0.003996004                        1          1        1
2009 GO:0017002             0.003996004                        1          1        7
2762 GO:0038023             0.003996004                        1          4      579
3172 GO:0048185             0.005994006                        1          1       11
                                        term ontology
1769      activin receptor activity, type II       MF
2627                         inhibin binding       MF
1377 glycogenin glucosyltransferase activity       MF
2009     activin-activated receptor activity       MF
2762             signaling receptor activity       MF
3172                         activin binding       MF

> head(GO.MF.LowvNon) # this is the real test
       category over_represented_pvalue under_represented_pvalue numDEInCat numInCat
1377 GO:0008466             0.001197658                1.0000000          1        1
1769 GO:0016362             0.002340668                0.9999987          1        2
2627 GO:0034711             0.003514714                0.9999962          1        3
18   GO:0000155             0.003516708                0.9999962          1        3
2762 GO:0038023             0.003856110                0.9996154          4      579
730  GO:0004673             0.004728336                0.9999922          1        4
                                        term ontology
1377 glycogenin glucosyltransferase activity       MF
1769      activin receptor activity, type II       MF
2627                         inhibin binding       MF
18       phosphorelay sensor kinase activity       MF
2762             signaling receptor activity       MF
730        protein histidine kinase activity       MF

my question is this: is this likely to be due to putting too few genes into the analysis?

I think my code is OK, as I've done this before with larger lists and got some good pvalues for the real test and sampling pvalues were close to 1.

Cheers for any insight.

matt

goseq • 1.6k views

ADD COMMENT • link 9.3 years ago matt.arno • 0

0

Entering edit mode

...I think I've got the wrong end of the stick with this: the method=sampling means not using the Wallenius method for the null distribution. For some reason I thought this was a background analysis or negative control to compare the real thing to...

...it must be getting late...

matt

ADD REPLY • link 9.3 years ago matt.arno • 0