comparing two GOstats result lists
1
0
Entering edit mode
sven.schenk ▴ 30
@svenschenk-15837
Last seen 5.7 years ago
Vienna

Hi,

I am trying to compare two result lists from GOstats. The reason for doing this is that I did a proteomics experiment in which we established a new sample preparation protocol and I now want to kind of show that with this new method we do not enrich/deplete for certain proteins/protein functions. The obvious way, comparing the protein lists and check the overlap isn`t very telling as there is redundancy in the proteome, i.e. protein X does not neccessarily have the same ID in both sets, and further, both sets contain different amount of entries. So I thought doing a GO enrichment analysis to check whether certain catergories are enriched in one list versus the other would be appropriate.

So now I have the lists of the two sets and I wanted to do a simple statistical test to compare them GO term by GO term. As this is a count based list I wanted to use a GTest (or Chi square) . But as the input lists have different sizes, I would naturally expect less Counts per GO term in the smaller data set (GO.list.1) compared to the larger list (GO.list.2), meaning that the simple GTest:

>GO.list<-merge(GO.list.1, GO.list.2, by=2, all=F)
>colnames(GO.list)
 [1] "GOMFID"      "Pvalue.x"    "OddsRatio.x" "ExpCount.x"  "Count.x"    "Size.x"
 [7] "Term.x"     "test.x"      "ontology.x"  "subset.x"  "Pvalue.y"    "OddsRatio.y"  
[13] "ExpCount.y"  "Count.y"     "Size.y"   "Term.y"      "test.y"      "ontology.y"  
[19]  "subset.y"   
>GO.list$pvalue<-sapply(1:nrow(GO.list), 
 function(i) GTest(x=as.numeric(unlist(GO.list[i,c(5,14)]))))$pvalue)

will bias my result as it assumes equal proportions.

As the GOstats result file give you all kind of information (based on the input data) I thought I`d use these informations to calculate my expected proportions. My first thought now was to normalise the counts per GO term on the relative proportions of the ExpCounts given in the GOstats result file, i.e. columns 4  and 13 of the merged result file.

>GO.list<-merge(GO.list.1, GO.list.2, by=2, all=F)
>colnames(GO.list)
 [1] "GOMFID"      "Pvalue.x"    "OddsRatio.x" "ExpCount.x"  "Count.x"    "Size.x"
 [7] "Term.x"     "test.x"      "ontology.x"  "subset.x"  "Pvalue.y"    "OddsRatio.y"  
[13] "ExpCount.y"  "Count.y"     "Size.y"   "Term.y"      "test.y"      "ontology.y"  
[19]  "subset.y"   
>p1<-GO.list$ExpCount.x/(GO.list$ExpCount.y+GO.list$ExpCount.x)
>p2<-GO.list$ExpCount.y/(GO.list$ExpCount.y+GO.list$ExpCount.x)

>GO.list$pvalue<-sapply(1:nrow(GO.list), 
 function(i) GTest(x=as.numeric(unlist(GO.list[i,c(5,14)]))), p=c(p1,p2))$pvalue)

But I don`t think this is the correct way to do it either, because if I run GOstats on a merged list (i.e. merging my two protein in put lists) then the ExpCounts are (somewhat expectedly) not the sum of the two input lists.

So I was wondering if I could use the Count columns as a proxy for calculating the expected proportions:

>GO.list<-merge(GO.list.1, GO.list.2, by=2, all=F)
>colnames(GO.list)
 [1] "GOMFID"      "Pvalue.x"    "OddsRatio.x" "ExpCount.x"  "Count.x"    "Size.x"
 [7] "Term.x"     "test.x"      "ontology.x"  "subset.x"  "Pvalue.y"    "OddsRatio.y"  
[13] "ExpCount.y"  "Count.y"     "Size.y"   "Term.y"      "test.y"      "ontology.y"  
[19]  "subset.y"   
>p1<-((sum(GO.list$Count.x)/sum(GO.list$Count.y))*0.5)/(((sum(GO.list$Count.x)/sum(GO.list$Count.y))*0.5)+0.5) 
>p2<-((sum(GO.list$Count.y)/sum(GO.list$Count.x))*0.5)/(((sum(GO.list$Count.y)/sum(GO.list$Count.x))*0.5)+0.5)
#in my case: p1: 0.4176183
#p2: 0.5823817
>GO.list$pvalue<-sapply(1:nrow(GO.list), 
 function(i) GTest(x=as.numeric(unlist(GO.list[i,c(5,14)]))), p=c(p1,p2))$pvalue)

Does anybody have any suggestion if this is any way correct, or should I do something completely different?

Cheers

Sven

             
             

 

gostats statistics • 1.3k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 18 hours ago
United States

I wouldn't do it that way. In one set of data you have some evidence that a set of proteins that make up a group are being perturbed, and you want to test for evidence that the same group of proteins is (not) being affected in a different experiment. The gene set testing framework is the canonical way of doing that, and it has the added benefit that you don't necessarily need to measure the exact same things in each experiment (although you would like a pretty good overlap). The limma package has a bunch of different tests, but for your purposes I would imagine that roast would be applicable.

 

ADD COMMENT
0
Entering edit mode

Hi thanks for your answer.

Yes, I think roast would do exactly what I'd want, but for that I'd have to have replicates which I don't have. I have only the Counts from one enrichment analysis the Counts from the second one which I want to compare to each other.

I then found that I could use geneSetTest :

>pvalue<-geneSetTest(as.vector(unlist(GO.list[,c(5)])),as.vector(unlist(GO.list[,c(14)])), ranks.only=FALSE)
>pvalue
[1] 1e-04

So now I get a pvalue, but I hoestly don`t know what it means, in the documentation it reads:

geneSetTest performs a competitive test in the sense that genes in the test set are compared to other genes (Goeman and Buhlmann, 2007). If the statistic is a genewise test statistic for differential expression, then geneSetTest tests whether genes in the set are more differentially expressed than genes not in the set.

But it doesn`t tell me what the null hypothesis is. Is the null that the genes are more differently, or is it they are not differently?

 

 

ADD REPLY

Login before adding your answer.

Traffic: 547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6