Hi,
I am doing an analysis of transcription factor binding and I need to retrieve the promoter region of all the genes in the genome. I can obtain 2kb upstream easily from a TxDB object by:
library(GenomicRanges)
library(TxDb.Athaliana.BioMart.plantsmart28)
gnflanks = flank(genes(TxDb.Athaliana.BioMart.plantsmart28), width=2000)
However, some regions overlap with coding regions upstream (other genes).
I would like to obtain 2kb of the upstream sequence up to the nearest coding region (2kb if no overlaping gene).
I know how to subtract the coding regions from the promoters, but if there is still some "promoter" region upstream of the upstream gene, it will be retained.
example:
upstream region:
========================(TSS)gene
overlapping gene:
======
========================(TSS)gene
If I remove it, I am left with:
===== =============(TSS)gene
And I want to retrieve only:
=============
Thanks in advance!
And FWIW here is the fast version of it:
This version is semantically equivalent to my previous "naive" version but only takes 0.062 s on my laptop so is about 38000 times faster ;-)
Cheers,
H.
Wow! thank you so much for your help! it worked perfectly!