Biomart - upstream region of 'coding'
Hi everyone,

I want to perform motif enrichment analysis and, so, I want to use 500bp region before and 100bp region into the coding region. For the coding region, I think I am set but, I wanted to confirm I am doing everything fine by using coding_gene_flank. Currently, I am doing this:

seq1 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="coding_gene_flank", upstream = 500, mart = ensembl,verbose = T)
seq2 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="coding", mart = ensembl,verbose = T)


Thanks in advance!

R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default

[1] LC_COLLATE=Portuguese_Portugal.utf8  LC_CTYPE=Portuguese_Portugal.utf8   
[3] LC_MONETARY=Portuguese_Portugal.utf8 LC_NUMERIC=C                        
[5] LC_TIME=Portuguese_Portugal.utf8    

time zone: Europe/Lisbon
tzcode source: internal

Mike Smith
Last seen 23 days ago
EMBL Heidelberg

This looks like a good start. However, I think you need to consider that your seq2 object might return multiple sequences. That's because seqType="coding" returns a sequence per transcript, rather than per gene. Given the transcripts can start in different places, it might not make sense to paste a single upstream flank. If you want to do this on a per transcript basis, you probably want to use type="ensembl_transcript_id" in the first query.

Thanks Mike Smith! What I am doing right now is to confirm if upstream + coding sequence appear on the gene sequence (added a buffer upstream as well):

seq3 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="gene_exon_intron", mart = ensembl,verbose = T)
seq4 = biomaRt::getSequence(id=i, type="ensembl_gene_id", seqType="gene_flank", upstream = 500, mart = ensembl,verbose = T)
full = paste0 (seq3,seq4)

Then, if the upstream + coding sequence are in this full region, I save the combination; otherwise, I do not consider it.


