Filtering genes for GOstats
3
0
Entering edit mode
@noncodinggene-7018
Last seen 6.3 years ago

Hi

I'm using GOstats, I've created the universe for the Saccharomyces cerevisiae

frame <- toTable(org.Sc.sgdGO)
goframeData <- data.frame(frame$go_id, frame$Evidence, frame$systematic_name)

goFrame <- GOFrame(goframeData, organism = "Saccharomyces cerevisiae")
goAllFrame <- GOAllFrame(goFrame)
Sc.gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
setwd("/path/")
save(Sc.gsc, file = "Sc.gsc.Rda")

Now, if I'm not wrong I must select the universe of genes present at my experiment. I have a list of characters with the genes that are present in my experiment and have a value (this means the value is not NA).  The problem is that I'm not being able to properly filter the genes on my experiment (variable filtered_universe).

universe = Lkeys(org.Sc.sgdGO)
genes = universe[1:500]
params <- GSEAGOHyperGParams(name="My Custom GSEA based annot Params",
+ geneSetCollection=Sc.gsc,
+ geneIds = filtered_universe,
+ universeGeneIds = universe,
+ ontology = "MF",
+ pvalueCutoff = 0.05,
+ conditional = FALSE,
+ testDirection = "over")

 

I've tried this, but I'm getting an error on the last line, and I'm not really sure this is how this should be done. gso is the original unverse, and fn0 is the list of genes

subsettingGeneSet <- function(gs0, fn0){
  geneIds(gs0) <- geneIds(gs0)[is.element(geneIds(gs0), fn0)]
}

gsc2 <- sapply(Sc.gsc, subsettingGeneSet, fn0 = alk_name)
gsc2 <- GeneSetCollection(gsc2)
 
gostats • 2.5k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 15 hours ago
United States

Please note that S. cerevisiae IS a supported organism, and you can use the 'usual' GOstats pipeline directly.

> library(org.Sc.sgd.db)
> univ <- unique(keys(org.Sc.sgd.db))
> genes <- univ[sample(1:length(univ), 250)]
> p <- new("GOHyperGParams", geneIds = genes, universeGeneIds = univ, ontology = "BP", annotation = "org.Sc.sgd.db")
> hyp <- hyperGTest(p)
> head(summary(hyp))
      GOBPID      Pvalue OddsRatio   ExpCount Count Size
1 GO:0006643 0.001352049  4.156876 2.17207334     8   70
2 GO:0009306 0.002152682 15.838462 0.27926657     3    9
3 GO:1901977 0.002815495 63.081633 0.09308886     2    3
4 GO:2000001 0.002815495 63.081633 0.09308886     2    3
5 GO:2000002 0.002815495 63.081633 0.09308886     2    3
6 GO:0006188 0.004038588 11.875000 0.34132581     3   11
                                          Term
1             membrane lipid metabolic process
2                            protein secretion
3 negative regulation of cell cycle checkpoint
4          regulation of DNA damage checkpoint
5 negative regulation of DNA damage checkpoint
6                     IMP biosynthetic process

 

ADD COMMENT
0
Entering edit mode

I have understood that your universe of genes (geneIds, the ones you analyzed from your experiment and are statiscally significative) , should match the universe of genes (universeGeneIds) so the statistics work properly.

In this case the universe is the whole universe of gene from Yeast, instead of the universe of genes that are present in your experiment. Is this right?

ADD REPLY
0
Entering edit mode

No. The "universe" you use in GO enrichment analyses is the set of genes your assay could have potentially measured.

Imagine if you ran an experiment that used a targeted assay which (for whatever reason) only measured expression/whatever of the kinases in a given organism. If you set the Universe to include all the possible genes from your target organism, then no matter what actually happens in the experiment, you're results would always return significant hits for things related to kinase activity.

 

ADD REPLY
0
Entering edit mode
@noncodinggene-7018
Last seen 6.3 years ago

I'm trying to create the universeGeneIds

Alk.entrezid <- unlist(mget(alk_names, org.Sc.sgdENTREZID))

Where alk_names is a list of genes present in my array that at least have one expression value, so all genes with NA in it's row are removed.

But I get the following error:

Error en unlist(mget(alk_names, org.Sc.sgdENTREZID)) :
  error in evaluating the argument 'x' in selecting a method for function 'unlist': Error en .checkKeys(value, Lkeys(x), x@ifnotfound) :
  value for "YAR044W" not found

If I remove that gene, I get the same error with a different gene placed a few positions further in the list (about 200 positions, this is no important I guess).

ADD COMMENT
0
Entering edit mode

That's an old school way of getting things, and no longer recommended (but note that it would work if you added the argument ifnotfound = NA to your mget() call).

Instead you would be better off using select()

tmp <- select(org.Sc.sgd.db, alk_names, "ENTREZID")

You might get duplicates, and you want unique Entrez IDs, and no NA values, so you have to subset

genes <- as.character(tmp$ENTREZID)[!is.na(tmp$ENTREZID) & !duplicated(tmp$ENTREZID)]


 

ADD REPLY
0
Entering edit mode

Thanks I solved it the old fashioned way, and I've noticed that if you do it the old way yo get duplicated, meanwhile if you do it the new way you get non duplicated.

ADD REPLY
0
Entering edit mode

That should be exactly opposite. If you do it the old fashioned way, there are no duplicates, whereas you do get duplicates using select().

This is because mget() will return NA for any probeset that is annotated to more than one gene, whereas select() will return both genes. There are ways to use mget() that will return all the genes that a given probeset might interrogate, but it is just easier to use select(), which is why I recommend you do so.

ADD REPLY
0
Entering edit mode

Thanks, I was using the wrong function and I was doing it the wrong way.

ADD REPLY
0
Entering edit mode
@noncodinggene-7018
Last seen 6.3 years ago

Still getting some errors, this time is when I run the hyperGTest.

Alk.entrezid <- select(org.Sc.sgd.db, alk_names, "ENTREZID") #New fashioned way
anyDuplicated(Alk.entrezid$ENTREZID)
dim(Alk.entrezid)

#Removing duplicated -mainly NA in ENTREZID column-
Alk.entrezid <- as.character(Alk.entrezid$ENTREZID)[!is.na(Alk.entrezid$ENTREZID) & !duplicated(Alk.entrezid$ENTREZID)]
anyDuplicated(Alk.entrezid)
length(Alk.entrezid)
rm(alk_names)

#The same with genes that are significative
up.Alk.entrezid <- select(org.Sc.sgd.db, up.alk, "ENTREZID") 
anyDuplicated(up.Alk.entrezid$ENTREZID)

up.Alk.entrezid <- as.character(up.Alk.entrezid$ENTREZID)[!is.na(up.Alk.entrezid$ENTREZID) & !duplicated(up.Alk.entrezid$ENTREZID)]
anyDuplicated(up.Alk.entrezid)

#Running test
params <- new("GOHyperGParams", geneIds = up.Alk.entrezid, universeGeneIds = Alk.entrezid, ontology = "BP", annotation = "org.Sc.sgd.db")
overRepresented <- hyperGTest(params)

 

Error:

Error en getUniverseHelper(probes, datPkg, entrezIds) :
  After filtering, there are no valid IDs that can be used as the Gene universe.
  Check input values to confirm they are the same type as the central ID used by your annotation package.
  For chip packages, this will still mean the central GENE identifier used by the package (NOT the probe IDs).
ADD COMMENT
0
Entering edit mode

Indeed. And the error message is intended to be helpful (and if you look back at my original answer, you can get a hint there). The org.Sc.scd.db doesn't (as the name suggests) use Entrez Gene IDs, but SCD IDs.

So you can get the universe as I have already done, or better yet, generate a universe vector based on the genes on your array (using SCD IDs), and do the same for the differentially expressed genes as well.

 

ADD REPLY

Login before adding your answer.

Traffic: 889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6