Question

Problem when using topGO for RNAseq data: The number of annotated genes for GO terms changes according to the number of genes in the gene-background-LIst.

0

Entering edit mode

colaneri ▴ 30

@colaneri-7770

Last seen 5.6 years ago

United States

I’am running topGO, to perform GO analysis in a RNA-seq data set. I have an small data set of 104 significant genes that I called “sigG”.

Firstly, I used genefilter to find genes that have similar level of expression than my “sigG”. For each sigGene I got 10 genes which make a background set of 1040 genes (backG).

I run topGO but I found results that make suspect that something is going wrong. See the head of these table:

Using 10 background genes per sig gene:

1-For example in my last row I see 127 sig genes mapping to GO:0016740, but I only use a set of 104 genes.

2-When I include more backG (50 instead of 10 per sigG) I notice that the number of annotated genes for each GO term changed (increased for most of them)

Finally, if I took my sigGenes and I analyzed them with a variety of online GO tools I always get nice enrichment in very specific terms, independently of the tool. However not of those terms appear in this topGO analysis. I cannot get an explanation for these results. I will appreciate if someone can give me a hint of what is going one. Below you can find the R script used for this analysis

The code I used to generate these tables is:

# resSxT is a result DESeq2 object
# > class(resSxT)
# [1] "DESeqResults"
#attr(,"package")
#[1] "DESeq2"


#get the list of genes considered to be significant for the interaction test
sigGenesSxT <-rownames(subset(resSxT, pvalue < 0.01))
#get average expressions
overallBaseMeanSxT <- as.matrix(resSxT[, "baseMean", drop = F])
backG_SxT <- genefinder(overallBaseMeanSxT, sigGenesSxT, 50, method = "manhattan")

# get identified similar genes
backG_SxTwithSIG <- rownames(overallBaseMeanSxT)[as.vector(sapply(backG_SxT, function(x)x$indices))]

## remove Diff Expressed genes from background
backG_SxT <- setdiff(backG_SxTwithSIG, sigGenesSxT)

##RUNNING topGO on resSxT

onts = c( "MF", "BP", "CC" )
geneIDs = rownames(overallBaseMeanSxT)
inUniverse = geneIDs %in% c(sigGenesSxT, backG_SxT)
inSelection = geneIDs %in% backG_SxT
algSxT <- factor( as.integer( inSelection[inUniverse] ) )
names(algSxT) <- geneIDs[inUniverse]

tab = as.list(onts)
names(tab) = onts
### test all three top level ontologies
for(i in 1:3){
        ## prepare data
        tgdSxT <- new( "topGOdata", ontology=onts[i], allGenes = algSxT, nodeSize=5, annot=annFUN.org, mapping="org.Hs.eg.db", ID = "SYMBOL" )
        ## run tests
        resultTopGO_SxT.elim <- runTest(tgdSxT, algorithm = "elim", statistic = "Fisher" )
        resultTopGO_SxT.classic <- runTest(tgdSxT, algorithm = "classic", statistic = "Fisher" )
        ## look at results
        tab[[i]] <- GenTable( tgdSxT, Fisher.elim = resultTopGO_SxT.elim,
                              Fisher.classic = resultTopGO_SxT.classic,
                              orderBy = "Fisher.classic" , topNodes = 200)
}    
topGOResultsSxT <- as.data.frame(do.call(rbind,tab))
write.csv(topGOResultsSxT, file = "SxT_topGOResults.csv")

topGO gene ontolology rnaseq • 1.9k views

ADD COMMENT • link 7.7 years ago colaneri ▴ 30