Hi Elodie,
Not that this will help you figure out why your call to biomaRt::getBM()
doesn't return all the data you expect, but I would recommend a different approach. With the GenomicFeatures package you can import all the gene, transcript, exon, and CDS info into a TxDb object first. It's easy and takes only a few minutes (about 5) to make the object. The benefit is that querying the TxDb object is fast (now the data are local) and convenient, and you can save it and re-use it later, or share it with your collaborators:
library(GenomicFeatures)
date <- "jan2013"
host <- paste(date,"archive.ensembl.org",sep=".")
txdb <- makeTxDbFromBiomart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host=host)
Save the object with saveDb()
so you can re-use it later (load it with loadDb()
).
If you use the devel version of Bioconductor (BioC 3.1, you'll need R-devel for that), it happens that I just added the transcriptLengths()
function a couple of days ago that seems to be exactly what you need here:
tx_lengths <- transcriptLengths(txdb, with.cds_len=TRUE)
head(tx_lengths)
# tx_id tx_name gene_id nexon tx_len cds_len
# 1 1 ENST00000456328 ENSG00000223972 3 1657 0
# 2 2 ENST00000515242 ENSG00000223972 3 1653 0
# 3 3 ENST00000518655 ENSG00000223972 4 1483 0
# 4 4 ENST00000450305 ENSG00000223972 6 632 0
# 5 5 ENST00000473358 ENSG00000243485 3 712 0
# 6 6 ENST00000469289 ENSG00000243485 2 535 0
To see the transcript and CDS lengths for gene ENSG00000116171:
tx_lengths[tx_lengths$gene_id == "ENSG00000116171", ]
# tx_id tx_name gene_id nexon tx_len cds_len
# 2986 2986 ENST00000371514 ENSG00000116171 16 2811 1644
# 2987 2987 ENST00000478631 ENSG00000116171 17 5204 1101
# 2988 2988 ENST00000528311 ENSG00000116171 15 1896 1401
# 2989 2989 ENST00000371509 ENSG00000116171 15 1889 1512
# 2990 2990 ENST00000407246 ENSG00000116171 15 2239 1572
# 2991 2991 ENST00000371513 ENSG00000116171 11 2003 969
# 2992 2992 ENST00000528809 ENSG00000116171 7 570 0
# 2996 2996 ENST00000529363 ENSG00000116171 8 671 671
# 2997 2997 ENST00000473584 ENSG00000116171 6 582 0
# 2999 2999 ENST00000430330 ENSG00000116171 5 938 423
# 3000 3000 ENST00000408941 ENSG00000116171 4 1298 180
# 3001 3001 ENST00000478274 ENSG00000116171 4 969 421
# 3002 3002 ENST00000484100 ENSG00000116171 4 671 235
# 3003 3003 ENST00000533119 ENSG00000116171 4 742 97
# 3004 3004 ENST00000435345 ENSG00000116171 5 894 432
# 3005 3005 ENST00000488965 ENSG00000116171 3 3926 180
To get the transcript and CDS lengths for the list of genes you have in annot$ensembl_gene_id
, you would do:
tx_lengths[tx_lengths$gene_id %in% annot$ensembl_gene_id, ]
See ?transcriptLengths
in the GenomicFeatures package for more information.
Hope this helps,
H.
What is the value of
unique(annot$ensembl_gene_id)
?