Question

How to find TF binding sites for particular TF's within +- 1000 bases of gene TSS for all genes in human genome?

2

Entering edit mode

duncp92 ▴ 20

@duncp92-8104

Last seen 9.9 years ago

United States

Hello,

I'm new to Bioconductor and Biostrings and would greatly appreciate any help with the task I'm trying to complete. I'd like to identify potential transcription factor binding sites for a few transcription factors using a position weight matrix representing the TF's binding motifs. I want to do this analysis on genes in the human genome, but only consider 1000 bp upstream and 500 bp downstream of the transcription start sites (TSS).

I've been trying to use the matchPWM function but am not sure how to curate a genome dataset so that it only contains the sequence strings I'm interested in. I installed and loaded the TxDb.Hsapiens.UCSC.hg19.knownGene because I think that's a genome database that's getting at what I'm interested in, but am not sure how to proceed from here to narrow the database down.

Thanks,

Duncan

bsgenome transcription factor binding site enrichment analysis matchpwm • 4.2k views

ADD COMMENT • link 9.9 years ago duncp92 ▴ 20

score 2 · Answer 1 · 2015-06-09

Hi Duncan,

A good place to start (if you have not already done so) is to read through our Bioc workflow, "Finding Candidate Binding Sites for Known Transcription Factors via Sequence Matching".

http://bioconductor.org/help/workflows/generegulation/

As mentioned there -- but not yet demonstrated -- DNASE I footprints and ChIP-seq can be very useful in limiting your search.

We will be glad to provide further help once you have extracted what benefit you can from the workflow.

- Paul

score 0 · Answer 2 · 2015-06-09

0

Entering edit mode

duncp92 ▴ 20

@duncp92-8104

Last seen 9.9 years ago

United States

Paul,

Thanks for the response. I had looked through that but reviewed it and now the steps have become more clear to me. What's interesting is that through the method listed there I get about 151 hits for my TF of interest, but another method I tried gave me 88. Here are those steps.

human_promoters <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene)

genome_hit <- matchPWM(pcm.rest,Hsapiens,"85%")

promoter_hit <- intersect(genome_hit,human_promoters)

I think I'll stick with the method in the link you provided because I might be doing something incorrect here.

Anyways, I've been able to establish a list of promoter hits which is good. Now I'd like to be able to work backwards from this list and create a gene list from the respective promoters. This is so I can do this step with multiple transcription factors and be able to create gene sets from their promoter hits and compare the gene sets to find common regulatory targets. Any ideas?

-Duncan

ADD COMMENT • link 9.9 years ago duncp92 ▴ 20

2

Entering edit mode

Hi Duncan,

Here are a couple of things that could explain the differences you got between the method used in the generegulation workflow and your own code:

1) By default promoters() will extract the regions located 2000 bp upstream and 200 bp downstream of the TSS (quickly check this with args(promoters)). Since you said you wanted 1000 bp upstream and 500 bp downstream of the TSS, make sure you specify this in your call to promoters():

human_promoters <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene,
                             upstream=1000, downstream=500)

2) I guess the purpose of your last step is to keep only the hits returned by matchPWM() that are in the promoter regions. Instead of using intersect() for this (whose exact semantic can be a little bit tricky to grasp), I would suggest you do:

promoter_hit <- genome_hit[genome_hit %within% human_promoters]

Erratum: need to use %within% or %over% instead of %in% in the above code. Edited it to use %within%.

Unlike interesect(), this won't truncate TF binding sites that are not fully contained inside a promoter region, or merge TF binding sites that overlap. These are rare events but they can happen. In other words, by subsetting genome_hit, the ranges in promoter_hit are guaranteed to be a subset of the ranges in genome_hit, and with no alteration of the ranges.

Anyway, since your ultimate goal is to get the list of transcripts/genes for your TF of interest, some of these steps can be avoided. Also, since the string matching step is the expensive one, it's worth trying to minimize that cost by calling matchPWM()/countPWM() on the promoter regions only (instead of the entire genome). The getTFtranscripts() function below does that:

library(BSgenome)

## Returns a data.frame with 1 row per transcript and 2 columns (tx_name
## and gene_id).
getTFtranscripts <- function(pwm, txdb, genome,
                             upstream=2000, downstream=200, min.score="80%")
{
    prom <- suppressWarnings(
        promoters(txdb, upstream=upstream, downstream=downstream,
                        columns=c("tx_name", "gene_id")))
    prom <- trim(prom)
    genome <- getBSgenome(genome)
    prom_seqs <- getSeq(genome, prom)
    hits_per_prom <- suppressWarnings(sapply(
        seq_along(prom_seqs),
        function(i) countPWM(pwm, prom_seqs[[i]], min.score=min.score)
    ))
    ans <- mcols(prom)[hits_per_prom != 0L, ]
    ans[ , "gene_id"] <- as.character(ans[ , "gene_id"])
    as.data.frame(ans)
}

## Then:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
pcm.rest.transcripts <- getTFtranscripts(pcm.rest, txdb, "hg19",
                            upstream=1000, downstream=500, min.score="85%")

Finally, also please note that what you pass to getTFtranscripts() (or to matchPWM(), or to countPWM()) should be a Position Weight Matrix (PWM) and not a Position Frequency Matrix (PFM). matchPWM() and countPWM() are really supposed to be used with a PWM. Note that a PFM can easily be turned into a PWM by just calling the PWM() function on it. See ?PWM for more information about this. The PFM -> PWM transformation performed by PWM() is based on the method described in the Wasserman & Sandelin paper (see ?PWM for the reference to the paper). It's probably not a big deal since the results you will get when using a PWM instead of a PFM will generally be very close or even the same. But sometimes they won't.

Cheers,

H.

ADD REPLY • link 9.9 years ago Hervé Pagès 16k

1

Entering edit mode

2 more things:

(1) I meant to use %within% instead of %in% in my previous answer. I just edited it.

(2) Here is a version of getTFtranscripts() that is a little bit more efficient by taking advantage of the fact that prom typically contains many duplicated ranges (because transcripts that belong to the same gene often share the same TSS):

#.count_pwm_hits_per_promoter <- function(pwm, prom, genome, min.score)
#{
#    genome <- getBSgenome(genome)
#    prom_seqs <- getSeq(genome, prom)
#    suppressWarnings(sapply(
#        seq_along(prom_seqs),
#        function(i) countPWM(pwm, prom_seqs[[i]], min.score=min.score)
#    ))
#}

## Same as .count_pwm_hits_per_promoter() above but takes advantage of
## the fact that 'prom' typically contains many duplicated ranges
## (because transcripts that belong to the same gene often share the
## same TSS).
.count_pwm_hits_per_promoter <- function(pwm, prom, genome, min.score)
{
    sm <- selfmatch(prom)
    idx <- which(sm == seq_along(sm))
    uprom <- prom[idx]

    genome <- getBSgenome(genome)
    uprom_seqs <- getSeq(genome, uprom)
    nhit_per_uprom <- suppressWarnings(sapply(
        seq_along(uprom_seqs),
        function(i) countPWM(pwm, uprom_seqs[[i]], min.score=min.score)
    ))

    nhit_per_prom <- integer(length(prom))
    nhit_per_prom[idx] <- nhit_per_uprom
    nhit_per_prom[sm]
}

getTFtranscripts <- function(pwm, txdb, genome,
                        upstream=2000, downstream=200, min.score="80%")
{
    prom <- suppressWarnings(
        promoters(txdb, upstream=upstream, downstream=downstream,
                        columns=c("tx_name", "gene_id")))
    prom <- trim(prom)
    nhit_per_prom <- .count_pwm_hits_per_promoter(pwm, prom, genome,
                                                  min.score)
    ans <- mcols(prom)[nhit_per_prom != 0L, ]
    ans[ , "gene_id"] <- as.character(ans[ , "gene_id"])
    as.data.frame(ans)
}

Anyway, even with that trick, calling countPWM() in an sapply loop remains a major bottleneck. If we had vmatchPWM() and vcountPWM() (like we do for v/matchPattern() and v/countPattern()), that bottleneck would go away, and that could make getTFtranscripts() 50x faster or more. I'll add these functions to Biostrings and post here when it's ready.

H.

ADD REPLY • link 9.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hey H.,

Fantastic! Thanks for the help. With this I was able to create the gene sets I'm interested in.

-Duncan

ADD REPLY • link 9.9 years ago duncp92 ▴ 20

0

Entering edit mode

Hey H.,

Fantastic! Thanks for the help. With this I was able to create the gene sets I'm interested in.

-Duncan

ADD REPLY • link 9.9 years ago duncp92 ▴ 20