Question

comparing two GOstats result lists

0

Entering edit mode

sven.schenk ▴ 30

@svenschenk-15837

Last seen 5.9 years ago

Vienna

Hi,

I am trying to compare two result lists from GOstats. The reason for doing this is that I did a proteomics experiment in which we established a new sample preparation protocol and I now want to kind of show that with this new method we do not enrich/deplete for certain proteins/protein functions. The obvious way, comparing the protein lists and check the overlap isn`t very telling as there is redundancy in the proteome, i.e. protein X does not neccessarily have the same ID in both sets, and further, both sets contain different amount of entries. So I thought doing a GO enrichment analysis to check whether certain catergories are enriched in one list versus the other would be appropriate.

So now I have the lists of the two sets and I wanted to do a simple statistical test to compare them GO term by GO term. As this is a count based list I wanted to use a GTest (or Chi square) . But as the input lists have different sizes, I would naturally expect less Counts per GO term in the smaller data set (GO.list.1) compared to the larger list (GO.list.2), meaning that the simple GTest:

>GO.list<-merge(GO.list.1, GO.list.2, by=2, all=F)
>colnames(GO.list)
 [1] "GOMFID"      "Pvalue.x"    "OddsRatio.x" "ExpCount.x"  "Count.x"    "Size.x"
 [7] "Term.x"     "test.x"      "ontology.x"  "subset.x"  "Pvalue.y"    "OddsRatio.y"  
[13] "ExpCount.y"  "Count.y"     "Size.y"   "Term.y"      "test.y"      "ontology.y"  
[19]  "subset.y"

>GO.list$pvalue<-sapply(1:nrow(GO.list), 
 function(i) GTest(x=as.numeric(unlist(GO.list[i,c(5,14)]))))$pvalue)

will bias my result as it assumes equal proportions.

As the GOstats result file give you all kind of information (based on the input data) I thought I`d use these informations to calculate my expected proportions. My first thought now was to normalise the counts per GO term on the relative proportions of the ExpCounts given in the GOstats result file, i.e. columns 4 and 13 of the merged result file.

>GO.list<-merge(GO.list.1, GO.list.2, by=2, all=F)
>colnames(GO.list)
 [1] "GOMFID"      "Pvalue.x"    "OddsRatio.x" "ExpCount.x"  "Count.x"    "Size.x"
 [7] "Term.x"     "test.x"      "ontology.x"  "subset.x"  "Pvalue.y"    "OddsRatio.y"  
[13] "ExpCount.y"  "Count.y"     "Size.y"   "Term.y"      "test.y"      "ontology.y"  
[19]  "subset.y"

>p1<-GO.list$ExpCount.x/(GO.list$ExpCount.y+GO.list$ExpCount.x)
>p2<-GO.list$ExpCount.y/(GO.list$ExpCount.y+GO.list$ExpCount.x)

>GO.list$pvalue<-sapply(1:nrow(GO.list), 
 function(i) GTest(x=as.numeric(unlist(GO.list[i,c(5,14)]))), p=c(p1,p2))$pvalue)

But I don`t think this is the correct way to do it either, because if I run GOstats on a merged list (i.e. merging my two protein in put lists) then the ExpCounts are (somewhat expectedly) not the sum of the two input lists.

So I was wondering if I could use the Count columns as a proxy for calculating the expected proportions:

>GO.list<-merge(GO.list.1, GO.list.2, by=2, all=F)
>colnames(GO.list)
 [1] "GOMFID"      "Pvalue.x"    "OddsRatio.x" "ExpCount.x"  "Count.x"    "Size.x"
 [7] "Term.x"     "test.x"      "ontology.x"  "subset.x"  "Pvalue.y"    "OddsRatio.y"  
[13] "ExpCount.y"  "Count.y"     "Size.y"   "Term.y"      "test.y"      "ontology.y"  
[19]  "subset.y"

>p1<-((sum(GO.list$Count.x)/sum(GO.list$Count.y))*0.5)/(((sum(GO.list$Count.x)/sum(GO.list$Count.y))*0.5)+0.5) 
>p2<-((sum(GO.list$Count.y)/sum(GO.list$Count.x))*0.5)/(((sum(GO.list$Count.y)/sum(GO.list$Count.x))*0.5)+0.5)

#in my case: p1: 0.4176183
#p2: 0.5823817

>GO.list$pvalue<-sapply(1:nrow(GO.list), 
 function(i) GTest(x=as.numeric(unlist(GO.list[i,c(5,14)]))), p=c(p1,p2))$pvalue)

Does anybody have any suggestion if this is any way correct, or should I do something completely different?

Cheers

Sven

gostats statistics • 1.4k views

ADD COMMENT • link updated 6.7 years ago by James W. MacDonald 68k • written 6.7 years ago by sven.schenk ▴ 30

score 0 · Answer 1 · 2018-06-07

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 8 hours ago

United States

I wouldn't do it that way. In one set of data you have some evidence that a set of proteins that make up a group are being perturbed, and you want to test for evidence that the same group of proteins is (not) being affected in a different experiment. The gene set testing framework is the canonical way of doing that, and it has the added benefit that you don't necessarily need to measure the exact same things in each experiment (although you would like a pretty good overlap). The limma package has a bunch of different tests, but for your purposes I would imagine that roast would be applicable.

ADD COMMENT • link 6.7 years ago James W. MacDonald 68k

0

Entering edit mode

Hi thanks for your answer.

Yes, I think roast would do exactly what I'd want, but for that I'd have to have replicates which I don't have. I have only the Counts from one enrichment analysis the Counts from the second one which I want to compare to each other.

I then found that I could use geneSetTest :

>pvalue<-geneSetTest(as.vector(unlist(GO.list[,c(5)])),as.vector(unlist(GO.list[,c(14)])), ranks.only=FALSE)
>pvalue
[1] 1e-04

So now I get a pvalue, but I hoestly don`t know what it means, in the documentation it reads:

geneSetTest performs a competitive test in the sense that genes in the test set are compared to other genes (Goeman and Buhlmann, 2007). If the statistic is a genewise test statistic for differential expression, then geneSetTest tests whether genes in the set are more differentially expressed than genes not in the set.

But it doesn`t tell me what the null hypothesis is. Is the null that the genes are more differently, or is it they are not differently?

ADD REPLY • link 6.7 years ago sven.schenk ▴ 30