Hello,
I'm new to Bioconductor and Biostrings and would greatly appreciate any help with the task I'm trying to complete. I'd like to identify potential transcription factor binding sites for a few transcription factors using a position weight matrix representing the TF's binding motifs. I want to do this analysis on genes in the human genome, but only consider 1000 bp upstream and 500 bp downstream of the transcription start sites (TSS).
I've been trying to use the matchPWM function but am not sure how to curate a genome dataset so that it only contains the sequence strings I'm interested in. I installed and loaded the TxDb.Hsapiens.UCSC.hg19.knownGene because I think that's a genome database that's getting at what I'm interested in, but am not sure how to proceed from here to narrow the database down.
Thanks,
Duncan
Hi Duncan,
Here are a couple of things that could explain the differences you got between the method used in the generegulation workflow and your own code:
1) By default
promoters()
will extract the regions located 2000 bp upstream and 200 bp downstream of the TSS (quickly check this withargs(promoters)
). Since you said you wanted 1000 bp upstream and 500 bp downstream of the TSS, make sure you specify this in your call topromoters()
:2) I guess the purpose of your last step is to keep only the hits returned by
matchPWM()
that are in the promoter regions. Instead of usingintersect()
for this (whose exact semantic can be a little bit tricky to grasp), I would suggest you do:Erratum: need to use
%within%
or%over%
instead of%in%
in the above code. Edited it to use%within%
.Unlike
interesect()
, this won't truncate TF binding sites that are not fully contained inside a promoter region, or merge TF binding sites that overlap. These are rare events but they can happen. In other words, by subsettinggenome_hit
, the ranges inpromoter_hit
are guaranteed to be a subset of the ranges ingenome_hit
, and with no alteration of the ranges.Anyway, since your ultimate goal is to get the list of transcripts/genes for your TF of interest, some of these steps can be avoided. Also, since the string matching step is the expensive one, it's worth trying to minimize that cost by calling
matchPWM()
/countPWM()
on the promoter regions only (instead of the entire genome). ThegetTFtranscripts()
function below does that:Finally, also please note that what you pass to
getTFtranscripts()
(or tomatchPWM()
, or tocountPWM()
) should be a Position Weight Matrix (PWM) and not a Position Frequency Matrix (PFM).matchPWM()
andcountPWM()
are really supposed to be used with a PWM. Note that a PFM can easily be turned into a PWM by just calling thePWM()
function on it. See?PWM
for more information about this. The PFM -> PWM transformation performed byPWM()
is based on the method described in the Wasserman & Sandelin paper (see?PWM
for the reference to the paper). It's probably not a big deal since the results you will get when using a PWM instead of a PFM will generally be very close or even the same. But sometimes they won't.Cheers,
H.
2 more things:
(1) I meant to use
%within%
instead of%in%
in my previous answer. I just edited it.(2) Here is a version of
getTFtranscripts()
that is a little bit more efficient by taking advantage of the fact thatprom
typically contains many duplicated ranges (because transcripts that belong to the same gene often share the same TSS):Anyway, even with that trick, calling
countPWM()
in ansapply
loop remains a major bottleneck. If we hadvmatchPWM()
andvcountPWM()
(like we do forv/matchPattern()
andv/countPattern()
), that bottleneck would go away, and that could makegetTFtranscripts()
50x faster or more. I'll add these functions to Biostrings and post here when it's ready.H.
Hey H.,
Fantastic! Thanks for the help. With this I was able to create the gene sets I'm interested in.
-Duncan
Hey H.,
Fantastic! Thanks for the help. With this I was able to create the gene sets I'm interested in.
-Duncan