By 'self-contained' I actually meant 'code that somebody other than you could run'. Perhaps I wasn't explicit enough.
Anyway, note that enrichDO is just fitting a Fisher's exact test, which is based on a 2x2 table, where the cells of the table are
|
Genes in term |
Genes not in term |
Gene is significant |
|
|
Gene not significant |
|
|
And if you only use half of your genes, you may affect the count of genes in all four of those cells. Say you have 7 'influenza' genes in your set, and you select all 7 regardless of whether or not you use 50 or 100 genes.
|
Genes in term |
Genes not in term |
Gene is significant |
7 |
43 |
Gene not significant |
Inf.genes - 7 |
Univ - 50 |
But then you add in the other 50 genes
|
Genes in term |
Genes not in term |
Gene is significant |
7 |
93 |
Gene not significant |
Inf.genes - 7 |
Univ - 100
|
In that situation, three of the cells used to compute the statistic have now changed, which, depending on the number of influenza genes in the universe, and the number in the universe, may affect the statistic sufficiently for it to no longer be significant.
We can make a stupid example to test this, using the example data for enrichDO
:
> example(enrichDO)
<snip>
> z <- summary(yy)
fun <- function(gene, sigresult, B){
rslt <- vector(length = B)
for(i in seq(B)){
gene2 <- gene[sample(1:525, 212)] # Randomly select half the genes
tmp <- summary(enrichDO(gene2))
rslt[i] <- sum(!row.names(tmp) %in% row.names(sigresult))
}
rslt
}
> y <- fun(gene, z, 100)
> y
[1] 0 37 7 104 0 0 15 3 108 70 20 158 72 7 31 59 1 9
[19] 72 88 47 156 71 34 109 90 74 91 47 87 56 29 38 110 3 48
[37] 33 102 29 0 95 0 5 80 10 74 82 3 82 32 36 181 19 2
[55] 0 51 69 50 6 54 75 6 112 64 13 77 30 6 1 32 24 17
[73] 34 14 112 33 68 83 26 49 36 23 105 12 53 12 59 83 26 8
[91] 7 0 142 74 1 17 16 93 19 17
So we get from 0 to edit - 181 terms that appear when we use a random subset of half the original 525 genes used in the example, as compared to the terms we got in the first place.
You need to show some self-contained code that others can run. Otherwise, it's not clear exactly what you are doing, nor what the problem is.
My code:
library(DOSE)
library(biomaRt)
ensembl=useEnsembl(biomart="ensembl",dataset="hsapiens_gene_ensembl",version=89)
geneId<-readLines("First_100_Selected_Gene_name.txt") try(entrez_gene<-getBM(attributes=c('entrezgene'),filters='hgnc_symbol',values=geneId,mart=ensembl))
write.table(entrez_gene,file="entrezgene.txt",sep="\t")
entrez_gene1<-entrez_gene[,1]
try(x<-enrichDO(gene=entrez_gene1,ont="DO",pvalueCutoff=0.05,pAdjustMethod="BH",minGSSize=5,readable=FALSE))
y<-as.data.frame(x)
write.table(y,file="disease_ontology_output.txt",sep="\t"
If I use "First_50_Selected_Gene_name.txt", I can get a significant term but if I use "First_100_Selected_Gene_name.txt", that term is missing. My question is, "Is it possible that subset can have a term but whole set can not ?"