I am using a dataset that was annotated using Ensembl version 87 of human reference genome version 38. I wrote an R script to access certain values for that data (FASTA sequences for the protein-coding genes). However it appears that biomaRt does not have access to an archived version of that reference genome. Is there a way around this? Like another source for the annotated reference genome? I see that version 92 of the reference genome has been archived, but I am hesitant to use that version because I don't want my transcripts to get mistaken for other transcripts. Here is my code.
getFASTAseq <- function(data){
ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "87")
sequences = getSequence(id = data[2:length(data)], type='ensembl_transcript_id', seqType='peptide', mart = ensembl)
# gets the sequences for the transcripts
exportFASTA( sequences, file=data[1] )
}
You could use the
EnsDb
annotation resource for Ensembl 87.EnsDb
annotation databases from theensembldb
package contain all annotations (genes, transcripts, exons, proteins) for a particular Ensembl release and are fully compatible with theTxDb
databases from theGenomicFeatures
package. You can get these data resources easily fromAnnotationHub
.To get the annotation resource for Ensembl 87:
you can then query this
edb
object to retrieve e.g. protein sequences for all genes.If you want to get the sequences only for certain genes:
if you want filter on transcript identifiers you would simply use
filter = ~ tx_id %in% ...
in the query above.Hope this helps.
cheers, jo