I am trying to use biomaRt in order to retrieve the gene symbols of a small (126 entries) list of SNPs referenced by rsIDs, chromosome, and bp position. I am trying to use biomaRt to retrieve their gene symbols and their SNP rsIDs as a dataframe. Because I will want to do some manual checking, my code is written so that I also retrieve the ensembl ID along with the other two features.
My query uses the 'snp' mart because of the 'snpfilter' (rsID) filtering options and because I thought I had identified the gene attributes that return exactly what I was looking for ('ensembl gene stable id' and 'associated gene'), but the results turn out to be quite disappointing as there are really very few hits in the associatedgene column.
At first I thought that perhaps this had to do with a lack of gene symbol association in the ensembl database but I have checked a few of the ensembl IDs that return no associated_gene symbol and they do show a gene name.
Because it was a fairly small list I was hoping I could do it with biomaRt since I have already managed to get familiar with it and I have some time constraints, but if anyone can suggest an alternative way to get the gene symbols list I am listening!
Here is my bit of code to provide some context. snp = useMart("ENSEMBLMARTSNP", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_snp")
results<-c()
for (i in 1:dim(trim_SSNP_W)[1]){
temp <- getBM(attributes = c('refsnp_id', 'ensembl_gene_stable_id', 'associated_gene'),
filters = c('snp_filter'),
values = list(trim_SSNP_W[i,1]),
mart = snp,
uniqueRows = TRUE)
results <- rbind(results,temp)
}
I know that a for loop is not exactly standard coding in R but I am just learning and a bit more used to other programming languages (Java and Python) and I am struggling a bit with the compact coding style of R.
Thanks in advance, any suggestions are hugely welcome! Alejandra
Hey Alejandra ( and I saw your post on Biostars: https://www.biostars.org/p/336724/#453535 ), could you share some of the rs IDs that return no associated gene? I am not sure, but, running this as a
for
loop could result in your IP address being blacklisted due to repeat and rapid requests to the Ensembl servers. You can pass all IDs as a vector tovalues
, and then just rungetBM()
once.Oh, sorry, I did not know that was a thing =S I will try and look how to do that...
Here is the output I obtain for some of the rsIDs:
Those appear to be RSIDs that do return an associated gene. It would be helpful to have a vector of those that do not.
I think these are examples that are missing a value for the
associated_gene
attribute.Ah, I get it. But the associated_gene is the 'Associated gene with phenotype' which appears to be information that links a given variant with a particular gene based on a particular study, unless I misunderstand the Phenotype Annotation section on Biomart.
If the OP is actually trying to get the HUGO gene symbol, I am not sure you can get that from the SNP mart, can you?
Or do you have to do it the hard way:
Yep, I think this is an example where the attribute name is pretty ambiguous taken on it's own, and you really need the full text name to get a little more insight.
This two step approach is what I'd do too.
Oh, I see the issue there now, thank you so much for that!! I absolutely misunderstood what the associatedgene did! I thought it was somewhat analogous to the "hgncsymbol"... I will be taking the hard way as that is exactly the output that I want. I was not sure if I could extract the HUGO gene symbol (I am pretty new at this and just learnt today that it's called HUGO...)
Thanks!!
Those appear to be RSIDs that do return an associated gene. It would be helpful to have a vector of those that do not.