Gene enrichment question
1
0
Entering edit mode
@michael-salbaum-5309
Last seen 10.2 years ago
I'm not a statistician, but interested in the question. Wouldn't it be warranted to ask the question how 'membership' of DE genes in the stem cell gene set would compare to 'membership' in a randomly drawn gene set of the same size as the stem cell set? Fisher's Exact test to evaluate, then bootstrap? J. Michael Salbaum, Ph.D. Associate Professor Pennington Biomedical Research Center Louisiana State University System 6400 Perkins Road Baton Rouge, LA 70808 (225) 763-2782 -----Original Message----- From: bioconductor-bounces@r-project.org on behalf of Aliaksei Holik Sent: Wed 8/15/2012 8:51 AM Cc: bioconductor@r-project.org Subject: [BioC] Gene enrichment question Dear listers, Apologies if my question is not strictly related to Bioconductor, though one never knows, maybe there's a package that does what I need. I am analysing a list of differentially expressed genes from an Illumina microarray. In particular I'm trying to compare the list of differentially expressed genes to an existing list of genes preferentially expressed in the stem cell population (stem cell signature). When I do so, 10% of DE genes belong to the stem cell signature. What I'd like to do now is to find out, how likely that would happen by chance, i.e. put a p value on it. At the moment I know: There're 17119 unique genes in my dataset. Of them 86 are differentially expressed. The stem cell signature contains 510 genes. It is combined from several platforms, which makes it hard to establish the total number of unique genes, but it's at least 20819 (the size of the largest platform). There are 9 overlapping genes between DE genes and the stem cell signature. So I wonder: 1) If there's an accepted way to calculate a p value using these data. For instance could I run a like of a chi squared test? E.g. stem cell specific genes represent 510/20819=2.4% of total dataset. So expected number of the stem cell genes in my DE genes would be 86x2.4%=2. So my chi squared test would be based on 9 observed vs 2 expected. 2) Or do I have to generate a geneset based on the stem cell signature and go through GSEA algorithms to calculate enrichment and significance. Any pointers in the right direction would be much appreciated. Many thanks for your time and help! Aliaksei. _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
GO GO • 1.8k views
ADD COMMENT
0
Entering edit mode
@aliaksei-holik-4992
Last seen 8.8 years ago
Spain/Barcelona/Centre for Genomic Regu…
Dear all, Thank you for your answers and suggestions. Rather than replying to each of you, I'm just going to summarise. I've looked into hypergeometric test suggested by Alex, but admittedly couldn't get my head round the jargon. So I focused on other methods suggested. However, Alex's suggestion to remove 'stem cell genes' not present in my dataset was most helpful, as indeed some 56 genes were missing from my array. I haven't tried going into GSEA as suggested by Steve, but I do have expression values for the genes and will try to have a go at it later. I have then looked into Fisher's exact test and bootstrapping suggested by Michael and Steve. I couldn't figure out, how to use 'boot' function to get it to sample a limited size sample (86) from a given population (17119), so I ended up doing it "manually". In fact, I tried both permutation and bootstrapping, by using sampling without and with replacement, but couldn't see much difference in distribution. Any comments on bootstrapping vs permutation and the ideal number of replications are much appreciated. I'm still not quite sure if I calculated final p value correctly and certain that the whole task could have been accomplished more elegantly. So if anyone is kind enough to read through all my coding, and correct me, I would be very grateful. I just hope the script is readable enough. Here it is: ########## # Following character vectors contain names of: # all.genes - all genes in dataset (17119) # de.genes - differentially expressed genes in dataset (86) # scs.genes - stem cell signature genes present in dataset (454) # de.vs.scs - DE genes present in stem cell signature (9) # Vectors with similar names, but with prefix 'n' (n.de.genes) contain # numerical vectors of length of each character vector ## Run a Fisher's exact test on DE genes in SCS # Generate a contingency table c.table <- matrix(c(n.all.genes, n.de.genes, + n.scs.genes, n.de.vs.scs), nrow = 2) # Run Fisher's test and assign observed p value to 'z.obs' z.obs <- fisher.test(c.table)$p.value ## Write a function to generate a random sample of DE genes from all ## the genes in array and calculate Fisher's statistics on it boot.f.test <- function(k){ # Where k is randomly drawn sample # of 86 genes from 17119 gene pool n <- length(intersect(k, array.vs.SCS)) # Compare sampled 86 genes to the ones in SCS c.table <- matrix(c(n.all.genes, n.de.genes, + n.scs.genes, n), nrow = 2) fisher.test(c.table) } ## Resample DE genes 1000 times and calculate Fisher's exact test z <- vector() for (i in 1:1000){ # Sample 86 genes from the pool of 17119 boot.de.genes <- sample(all.genes.names, size=length(de.genes.names), replace = F ) # Calculate Fisher's test on the sample z[i] <- boot.f.test(boot.de.genes)$p.value } # Calculate p value as a proportion of simulated p values equal or # smaller than observed p value boot.p.value <- length(which(z <= z.obs))/length(z) boot.p.value ### End of Script Many thanks! Aliaksei.
ADD COMMENT
0
Entering edit mode
On 16.08.2012 00:57, Aliaksei Holik wrote: > Dear all, > > Thank you for your answers and suggestions. Rather than replying to > each of you, I'm just going to summarise. > > I've looked into hypergeometric test suggested by Alex, but > admittedly couldn't get my head round the jargon. So I focused on > other methods suggested. However, Alex's suggestion to remove 'stem > cell genes' not present in my dataset was most helpful, as indeed > some > 56 genes were missing from my array. > > I haven't tried going into GSEA as suggested by Steve, but I do have > expression values for the genes and will try to have a go at it > later. > > I have then looked into Fisher's exact test and bootstrapping > suggested by Michael and Steve. I couldn't figure out, how to use > 'boot' function to get it to sample a limited size sample (86) from a > given population (17119), so I ended up doing it "manually". In fact, > I tried both permutation and bootstrapping, by using sampling without > and with replacement, but couldn't see much difference in > distribution. Any comments on bootstrapping vs permutation and the > ideal number of replications are much appreciated. Just FYI, the hypergeometric test is equivalent to Fisher's exact (in the case of a 2x2 contingency table), so the P value (before bootstrapping) should be identical (http://en.wikipedia.org/wiki/Hypergeometric_distribution%23Relationsh ip_to_Fisher.27s_exact_test). I don't know your exact research question, but I would probably argue that stats has taken you as far as you need in this case - the Fisher test shows you that yes you do indeed have (slightly) more overlapping genes than you would expect by chance. Other methods are unlikely to change that fundamental conclusion. The real question is what are those overlapping genes and do they have any real relevance for the biology you are studying (omics derived gene lists are often filled with stuff of tangential interest to the question in hand in my experience!). If I understand right then given there are only 9 of these genes in your set, this can probably be done fairly quickly by hand via Pubmed and/or a friendly local expert. -- Alex Gutteridge
ADD REPLY

Login before adding your answer.

Traffic: 795 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6