Hello. I am doing ensemble gene set testing using EGSEA (Ensemble of Gene Set Enrichment Analyses) from the edgeR output. I have a list of Ensembl gene IDs from edgeR (dge variable), that I found that after voom transformation, I need to build an index for each gene set collection using the EGSEA indexing functions (buildCustomIdx) which relies on Entrez gene IDs only. So, I tried as follows, but I faced with the below Error which shows there are many ensemble IDs (around 50% of my genes) that do not have corresponding entrez IDs:
> v <- voom(dge, design, plot=FALSE)
> library(EGSEA)
> library(EGSEAdata)
> egsea.data("human")
> info = egsea.data("human", returnInfo = TRUE)
> gsets = list(info$msigdb$info$collections[c(3, 6)])
> gs.annots <- buildCustomIdx(geneIDs=v$genes$gene.id, gsets=gsets, species = "human")
Error in data.frame(ID = paste0(label, seq(1, length(gsets.idx))), GeneSet = gsets.names) : arguments imply differing number of rows: 2, 0
The class of v$genes$gene.id is "character" and the class of gsets = list(info$msigdb$info$collections[c(3, 6)]) is "list".
I know that different databases have different gene notions and it is expected not to have annotations for all genes. I also tried to convert my ensemble gene IDs to entrez gene IDs using biomaRt library as below, but I faced with the similar Error:
> library(biomaRt)
> v$genes <- gsub('\\..+$', '', v$genes$gene.id)
> ensembl.genes <- v$genes
> mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
> genes <- getBM(filters = "ensembl_gene_id",
attributes = c("ensembl_gene_id","entrezgene_id"),
values = ensembl.genes,
> library(EGSEA)
> entrez_id <- data.frame(genes$entrezgene_id)
> gs.annots = buildIdx(entrezIDs= entrez_id, species="human",
msigdb.gsets=c("c2", "c5"), go.part = TRUE)
Error in data.frame(ID = gsets.ids, GeneSet = gsets.names, NumGenes = paste0(sapply(gsets, : arguments imply differing number of rows: 0, 1 mart = mart)
I searched a lot, but I could not find how to fix this Error. I would highly appreciate if you could help me how to build an index from the list of my Ensembl gene IDs in order to perform EGSEA. Many thanks.
Hi fgol,
I had the same problem and I ended up making a ensembl to entrez id master table. I couldn't find a quick way around it. What I did was check all the duplicated entries and see if there is actually a correct solution (wrong input) or they are actually duplicated, in which case I will delete the one that looks "less correct" e.g. labelled as readthrough of two genes in one system but actually is a single gene in another system. It sounds confusing but if you actually try to work on it you will see what I mean.
For your attempt using biomaRt, if I recall correctly, getBM sometimes give you two values, blank value of entrez id for one ensembl id entry or same entrez id for two ensembl id. You need to make sure these are sorted out first.
Also using gene symbol and build an annotation with gmt file from MSigDB is better for people using ensembl id since it looks like there are fewer entries lost due to missing value and other causes.
It's not exactly an answer to your question so I'll just leave it here as a comment. Hope it helps.
I have come across the same issue. My count matrix contains about 65,000 EnsemblGeneIDs and when I convert them to EntrezIDs / NCBI.gene.ID I end up with only about 20,000 entries. For the majority of my EnsemblGeneIDs (about 42,000) there are no EntrezIDs available. For the remaining 23,000 EnsemblGeneIDs, many have the same EntrezID allocated, which brings the final number of 'unique' EntrezIDs down to about 20,000.
Although the extent of genes I 'lose' during the conversion (over 65%) surprised and concerns me, it may just be the nature of the beast? The Entrez gene annotation database, to my knowledge, is much more conservative (containing mainly well annotated / protein coding genes) compared to Ensembl, which is more progressive with annotating novel and non-coding genes. However, since functional information is not yet available for many of those novel genes and thus, they are not yet implemented into many functional 'gene sets' or pathway databases, it's probably not such a big issue for any Gene Set Enrichment Analysis?
Re @kentfung's comment on using HUGO gene symbols, I had the same observation that fewer entries are 'lost' during conversion as compared to EntrezIDs, bringing my final number to about 37,000 genes based on symbols.