Question

Biomart cannot get multiple genes?

0

Entering edit mode

Mike ▴ 10

@mike-18117

Last seen 5.4 years ago

So I am trying get gene information from MCA genes.

I used :

library(biomaRt)

human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mca_filter <- mca@var.genes
attr <- c("ensembl_gene_id", "hgnc_symbol","chromosome_name",'entrezgene', "start_position", "end_position")
Info <- getBM(attributes = attr,
filters = "hgnc_symbol",
values = mca_filter,
mart = human)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

But the problem is that I cannot get multiple gene information by using biomaRt. (collecting single/individual information is fine)

For example, 'mca_filter' contains "Selenop" gene. So

> which(mca_filter == "Selenop")
[1] 400

and it can also get gene information by using this code

Info <- getBM(attributes = attr,
filters = "hgnc_symbol",
values = "Selenop",
mart = human)

which gives this result

>Info
ensembl_gene_id hgnc_symbol chromosome_name entrezgene start_position end_position
1 ENSG00000250722 SELENOP 5 6414 42799880 42887392

HOWEVER, If I just put mca_filter instead of single gene:

Info <- getBM(attributes = attr,
filters = "hgnc_symbol",
values = mca_filter,
mart = human)

I cannot get many single gene information.

> which(Info$hgnc_symbol == "Selenop")
integer(0)

Do you know why? Please let me know. Thank you!

R biomart • 1.3k views

ADD COMMENT • link 6.0 years ago Mike ▴ 10

0

Entering edit mode

Before digging any deeper, can you check this isn't due to case sensitive matching. The command which(Info$hgnc_symbol == "Selenop") will only match entries that look like Selenop, but your query returns SELENOP. Your first example will fail this too:

Info <- getBM(attributes = attr,
              filters = "hgnc_symbol",
              values = c("Selenop"),
              mart = human)

> which(Info$hgnc_symbol == "Selenop")
integer(0)

You can use a function like grep to perform a case-insenstive search e.g.

Info <- getBM(attributes = attr,
              filters = "hgnc_symbol",
              values = c("Selenop", "CDC6"),
              mart = human)

> Info
  ensembl_gene_id hgnc_symbol chromosome_name entrezgene start_position end_position
1 ENSG00000094804        CDC6              17        990       40287633     40304657
2 ENSG00000250722     SELENOP               5       6414       42799880     42887392
> grep(x = Info$hgnc_symbol, pattern = 'Selenop', ignore.case = TRUE)
[1] 2

If this doesn't resolve the issue then please include the output of is(mca_filter) and head(mca_filter) so we can see examples of what values are present.

ADD REPLY • link 6.0 years ago Mike Smith ★ 6.6k

0

Entering edit mode

Thanks for the comment, Mike. I am afraid that it is not a case sensitive matching.

mca_filter has "Selenop" gene so I tried both values = mca_filter and values = c("Selenop").

But only values = c("Selenop") gives the correct result.

=================================================================

> Info <- getBM(attributes = attr,
+ filters = "hgnc_symbol",
+ values = c("Selenop", "CDC6"),
+ mart = human)
> Info
ensembl_gene_id hgnc_symbol chromosome_name entrezgene start_position end_position
1 ENSG00000094804 CDC6 17 990 40287633 40304657
2 ENSG00000250722 SELENOP 5 6414 42799880 42887392

This also works for me but when I put mca_filter rather than some single or multiple gene, it only gives one gene information.

> intersect(Info$hgnc_symbol, mca_filter)
[1] "H19"

This means it only get "H19" gene information.

> length(Info$hgnc_symbol)
[1] 697
> length(mca_filter)
[1] 1000

When I check the number of genes in each list, they show like the above.

==========================================================

I will also give you the information that you asked for.

> is(mca_filter)
[1] "character" "vector" "data.frameRowLabels" "SuperClassMethod" "index" "atomicVector" "kfunction"
[8] "EnumerationValue" "characterORconnection" "characterORMIAME" "character_OR_NULL" "atomic" "listI" "output"
[15] "vector_OR_factor"

> head(mca_filter, 10)
[1] "Spink1" "Gast" "Sbp" "Wap" "Csn1s2a" "Ins2" "Igha" "Igkc" "Sftpc" "Scgb1a1"

=============================================================================

For pbmc data from Seurat, it worked perfectly fine but when I used mca data from Seurat (https://satijalab.org/seurat/mca_loom.html), it doesn't work. I also used mca_filter as hv.genes from this website.

Hence,

mca_filter <- hv.genes

================================================================

Thank you so much again.

ADD REPLY • link 6.0 years ago Mike ▴ 10

score 0 · Answer 1 · 2018-11-29

I still think this might be a case sensitive issue. The lines I've reproduced below show that there are multiple values returned by your query. It's not the full 1000, but 697 gene symbols are matched by your query.

> length(Info$hgnc_symbol)
[1] 697
> length(mca_filter)
[1] 1000

It is normal for biomaRt to return nothing if it doesn't find a match for an element in your values vector, and presumably here 303 elements of mca_filter return nothing.

I suspect the reason intersect(Info$hgnc_symbol, mca_filter) gives only one result is probably down to the fact that it is case sensitive. The example below demonstrates the same behaviour:

> intersect(c("H19", "Selenop", "Cdc6"), c("H19", "SELENOP", "CDC6"))
[1] "H19"

This leaves two questions:

why are we only finding 697 hits
why does the capitalization change between mca_filter and the biomaRt results?

I think both of these are because the MCA data are from mouse, but you are querying the human dataset at Ensembl. The incomplete number of matches is because you wouldn't expect to get a complete set of genes found in both organisms, and it is also standard for mouse gene symbols to be stylised Selenop and human symbols to be all capitals e.g. SELENOP