Question

Gene set enrichment analysis

0

Entering edit mode

kaushal Raj Chaudhary ▴ 10

@kaushal-raj-chaudhary-6165

Last seen 10.0 years ago

United States

Hello Forum,

I want to perform gene set enrichment analysis of whole genome. Can it be done or I have to have some selective genes? Thanks for any input in this regard.

gostat gsea gseabase topgo • 7.0k views

ADD COMMENT • link updated 10.4 years ago by Levi Waldron ★ 1.1k • written 10.4 years ago by kaushal Raj Chaudhary ▴ 10

0

Entering edit mode

Could you please update your question with more detail? Minimally adding information on what type of data do you have, for instance, would be a good start ... what experiment was run, under what conditions, etc.

It's hard to understand what you really want to do as the question currently stands.

ADD REPLY • link 10.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hello Steve,

I have gene expression data 28000 probes ( Affymetrix-rat) and 11 samples (case and control). I did run differential gene expression analysis using limma package but did not get significance based on the FDR values. Now I want to perform gene set enrichment analysis rather than individual gene analysis. My question is how to select the list of genes since I did not get any significant genes. Or if I could select all genes for performing GSEA?

Thanks !!!

ADD REPLY • link 10.4 years ago kaushal Raj Chaudhary ▴ 10

score 2 · Accepted Answer · 2014-12-01

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

For GSEA you don't specify a set of differentially expressed genes. Instead, you specify sets of genes and then decide if those genes appear in aggregate to be differentially expressed in your samples, where 'differentially expressed in aggregate' can take at least two forms; competitive or self-contained. You can also test a whole battery of gene sets (like the Broad c1-c6 gene sets) or just a few.

If you are already using limma, then you can do a competitive GSEA for a battery of gene sets using romer() or for a few sets using camera(), and the analogous self-contained GSEA using mroast() and roast(). Or basic GSEA using geneSetTest(). See 10.GeneSetTests for more information, as well as http://bioinformatics.oxfordjournals.org/content/23/8/980.short for more information about the various types of GSEA.

ADD COMMENT • link 10.4 years ago James W. MacDonald 68k

0

Entering edit mode

Hi Jim,

I am trying to use geneSetTest() with t-value as statistics from top table.

geneSetTest(index=, statistics=abs(tt$t), alternative="mixed", ranks.only=TRUE, nsim=9999)

What values goes in index argument? The documentation says it is selected gene sets but I don't know which gene sets are available.

Thank you.

ADD REPLY • link 10.4 years ago kaushal Raj Chaudhary ▴ 10

1

Entering edit mode

The index argument is a vector of TRUE/FALSE, equal to the number of genes in your experiment, where the index is TRUE for genes that are in your gene set. How you get that vector is dependent on how you are defining your gene sets, etc. Without knowing more about the gene sets that you are interested in, I can't give much more help.

ADD REPLY • link 10.4 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks Jim !!!

ADD REPLY • link 10.4 years ago kaushal Raj Chaudhary ▴ 10

0

Entering edit mode

Hi Jim,

I am mostly interested in PI3K/AKT singnaling pathway. Would that be a gene set than can be tested against background? I don't have any other known gene set that I can test.

Thanks !!!

ADD REPLY • link 10.4 years ago kaushal Raj Chaudhary ▴ 10

1

Entering edit mode

There are a whole host of genesets "curated" for you over at MSigDB, which the folks at WEHI have parsed into an R consumable form here.

ADD REPLY • link 10.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve,

It does not seem they have rat genes. How to map rat genes to human genes or mouse genes so I can use it geneSetTest() function?

Thanks !!!

ADD REPLY • link 10.4 years ago kaushal Raj Chaudhary ▴ 10

2

Entering edit mode

One way is to use the idConverter() function from AnnotationDbi, which will map from say human to rat. I wrote an Rscript a while ago that can be used to do human to rat or mouse:

## usage:  R CMD Rscript parseBroad_nonhuman.R  thedir thespecies wheretoput
## where thedir is the dir containing the broad entrez-gene based data (broad_by_eg)
## thespecies is mouse or rat (currently)
## wheretoput is the dir that the output should be saved in - no need to create first.
## the args are space delimited

args <- commandArgs(TRUE)

if(length(args) != 3) stop(paste("\nUsage: R CMD Rscript parseBroad_nonhuman.R",
         " <thedir> <thespecies> <wheretoput>\n",
         "<thedir> = the directory containing Entrez Gene based Gene sets",
         "(currently /data2/romer/broad_by_eg)\n",
         "<thespecies> = the species to map to, currently either mouse or rat\n",
         "<wheretoput> = the directory in which to output the mapped gene sets\n\n",
         "Note that these arguments are space-delimited, and the function is best run",
         " under nohup. Example:\n",
         "nohup R CMD Rscript parseBroad_nonhuman.R broad_by_eg mouse mouse2 &\n\n", sep = ""),
         call. = FALSE)

makeGeneSets <- function(thedir, species, wheretoput){
    mapper <- c("MUSMU","RATNO")
    names(mapper) <- c("mouse","rat")
    thespecies <- mapper[species]
    require("AnnotationDbi", quietly = TRUE, character.only = TRUE)
    
    
    fls <- paste(thedir, dir(thedir, pattern = "gmt$"), sep = "/")
    thenames <- c("Positional", "Chemical_genetic_perturbations","Biocarta",
                  "KEGG", "Reactome","Canonical_pathway","microRNA",
                  "Transcription_factor_targets","Cancer_gene_neighborhoods","Cancer_modules",
                  "GO_biological_process","GO_cellular_component", "GO_molecular_function",
                  "Oncogenic")
    
    parseBroad <- function(thefile){
        len <- as.numeric(strsplit(system(paste("wc -l", thefile), intern=TRUE), " ")[[1]][1])
        con <- file(thefile, "r")

        lst <- list(length = len)
        nam <- vector(length = len)

        for(i in seq_len(len)){
            theline <- readLines(con, 1L)
            theline <- strsplit(theline, "\t")
            nam[i] <- theline[[1]][1]
            lst[[i]] <- theline[[1]][-c(1:2)]
        }
        close(con)
        names(lst) <- nam
        return(lst)

    }

    convertIt <- function(x, species){
        
        tryCatch(idConverter(x, "HOMSA", species, "EG", "SYMBOL"), error = function(e) NULL)
    }

    broad <- lapply(fls, parseBroad)
    ## the canonical pathways include reactome, biocarta and KEGG, so subset
    ind <- !names(broad[[6]]) %in% c(names(broad[[3]]), names(broad[[4]]), names(broad[[5]]))
    broad[[6]] <- broad[[6]][ind]

    names(broad) <- thenames
    broad <- lapply(broad, function(x) lapply(x, convertIt, species = thespecies))
    ## remove empty gene sets
    broad <- lapply(broad, function(x) x[!sapply(x, is.null)])
    if(!file.exists(wheretoput)) dir.create(wheretoput)
    save(broad, file = paste(wheretoput, "/broad_", species, "_byset.Rdata", sep = ""))
}

makeGeneSets(args[1], args[2], args[3])

If you have one or more of the gmt files from the Broad, you can use that script to make an Rdata file for rat. I ran it yesterday, and the genes for the PI3K/AKT gene set for rat are:

structure(c("Gsk3a", "Mapkap1", "Rictor", "Tbc1d4", "Raf1", "Casp9",
"Map3k5", "Foxo3", "Prkdc", "Ywhaq", "Ywhah", "Mtor", "Foxo1",
"Ywhag", "Cdkn1a", "Ywhae", "Chuk", "Bad", "Kpna1", "Ywhaz",
"Ywhab", "Akt1", "Akt2", "Gsk3b", "Mlst8"), .Names = c("2931",
"79109", "253260", "9882", "5894", "842", "4217", "2309", "5591",
"10971", "7533", "2475", "2308", "7532", "1026", "7531", "1147",
"572", "3836", "7534", "7529", "207", "208", "2932", "64223"))

Which you can paste into an R session to create a named vector (names are Entrez Gene IDs, values are HUGO symbols). This has 25 of the 38 genes found in the human version of this pathway (REACTOME_PI3K_AKT_ACTIVATION).

ADD REPLY • link 10.4 years ago James W. MacDonald 68k

0

Entering edit mode

Nice.

Have you seen docopt yet? I've been having fun using that with my R based command line scripts. I also use the #!/usr/bin/env Rscript shebang as opposed to running through R CMD Rscript ... is there any reason why you prefer the latter?

ADD REPLY • link 10.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

I hadn't seen docopt. Thanks for the pointer.

I tend to keep lots of previous versions of R/Bioc laying about, for that special PI who invariably comes back like two years later, wanting me to add 'just one more thing' to their analysis. Rather than using the current R/Bioc, I usually opt to use whatever iteration was used for their original analysis (which I know because I always drop a sessionInfo() at the end of the Sweave/knitr document), so the only thing that will change will be the 'one more thing' that they wanted me to add.

Anyway, because I have lots of different R versions installed, I tend not to hard code anything to point to the current or system-wide installation of R because I may well need to use a particular version. So most of my scripts either use R CMD Rscript (or e.g., R-2.15.1 CMD Rscript), or I just write .R files and use R-3.0.1 --vanilla < mysuperl33tscript.R, etc.

ADD REPLY • link 10.4 years ago James W. MacDonald 68k

0

Entering edit mode

Hi Jim,

How did you come up with that genes? Can I use those gene list in index argument of geneSetTest() function?

Thank you.

ADD REPLY • link 10.4 years ago kaushal Raj Chaudhary ▴ 10

1

Entering edit mode

I thought I was pretty clear, but perhaps not. I went here (http://www.broadinstitute.org/gsea/msigdb/index.jsp), and having registered already, went to the downloads page, and downloaded all the entrez.gmt files, for each of the gene sets (I actually did this a while ago - I have the 3.1 versions, whereas the current ones at the Broad are 4.0).

I then ran the script that I supplied in an earlier comment (see above), to convert those human Entrez Gene IDs to rat Entrez Gene IDs and symbols. This script does all of the gene sets at once, creating a .Rdata file that I loaded into R and then extracted the converted genes for the REACTOME_PI3K_AKT_ACTIVATION gene set, which is, I believe, the one you are interested in.

And to answer your question, no you cannot use these genes directly. As I mentioned in a (much) earlier comment, you need a logical (TRUE/FALSE) vector that is the same length as the number of rows in your MArrayLM object, where each TRUE or FALSE indicates whether or not the corresponding gene (in the MArrayLM object) is in (TRUE) or not in (FALSE) the gene set of interest.

So your homework is to annotate the genes in your MArrayLM object with either their Entrez Gene or symbol, and then use the genes I have supplied to generate the logical vector, which you can then use as input to geneSetTest() or camera().

ADD REPLY • link 10.4 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks Jim.

ADD REPLY • link 10.4 years ago kaushal Raj Chaudhary ▴ 10

score 1 · Accepted Answer · 2014-12-02

1

Entering edit mode

Levi Waldron ★ 1.1k

@levi-waldron-3429

Last seen 3 months ago

CUNY Graduate School of Public Health a…

The GeneSetEnrichment biocViews shows relevant packages, although James's curated list is probably more helpful:

http://www.bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment

I think you will have to map to human Entrez or HGNC gene symbols, unless you know of rat gene sets that you want to test.

ADD COMMENT • link 10.4 years ago Levi Waldron ★ 1.1k