Dear all,
I would like to generate a network between human TFs and their predicted targets using sequence motif information. I have been following the workflow described here (https://www.bioconductor.org/help/workflows/gene-regulation-tfbs/) with additional methodology here (https://support.bioconductor.org/p/68495/#68597). See MWE at end of this email.
However, I’m a bit confused about some of the results that I’m getting. When I look for targets of MYC (a transcription factor known to regulate thousands of genes) with this method, I get 2553 unique gene IDs. Seems reasonable.
But when I do the same thing with TP53 (also known to transcriptionally regulate thousands of genes) and I only get 4? Seems not reasonable.
I have also tried to increase the upstream region to 5kb in case TP53 motif binding sites are particularly distant, but then the numbers of targets identified for TP53 and MYC are 12 and 7009 respectively.
Can anyone spot an obvious mistake? Or am I missing a biological issue specific to TP53?
Thanks in advance,
Lisa
=====================================================
library(MotifDb) library(GenomicFeatures) library(org.Hs.eg.db) library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) getTFtranscripts <- function(pwm, txdb, genome, upstream=2000, downstream=200, min.score="80%") { prom <- suppressWarnings( promoters(txdb, upstream=upstream, downstream=downstream, columns=c("tx_name", "gene_id"))) prom <- trim(prom) genome <- getBSgenome(genome) prom_seqs <- getSeq(genome, prom) hits_per_prom <- suppressWarnings(sapply( seq_along(prom_seqs), function(i) countPWM(pwm, prom_seqs[[i]], min.score=min.score) )) ans <- mcols(prom)[hits_per_prom != 0L, ] ans[ , "gene_id"] <- as.character(ans[ , "gene_id"]) as.data.frame(ans) } TP53.jaspar <- query(query(MotifDb,"TP53"),"JASPAR_CORE")[[1]] TP53.targets <- getTFtranscripts(TP53.jaspar, TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19", upstream=1000, downstream=250, min.score="85%") print( TP53.targets ) print( length( unique(TP53.targets$gene_id[!is.na(TP53.targets$gene_id)] ) ) ) MYC.jaspar <- query(query(MotifDb,"MYC"),"JASPAR_CORE")[[1]] MYC.targets <- getTFtranscripts(MYC.jaspar, TxDb.Hsapiens.UCSC.hg19.knownGene, "hg19", upstream=1000, downstream=250, min.score="85%") print( MYC.targets ) print( length( unique(MYC.targets$gene_id[!is.na(MYC.targets$gene_id)] ) ) )
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.38.0
[4] rtracklayer_1.30.1 org.Hs.eg.db_3.2.3 RSQLite_1.0.0
[7] DBI_0.3.1 GenomicFeatures_1.22.0 AnnotationDbi_1.32.0
[10] Biobase_2.30.0 GenomicRanges_1.22.0 GenomeInfoDb_1.6.0
[13] MotifDb_1.12.0 Biostrings_2.38.0 XVector_0.10.0
[16] IRanges_2.4.1 S4Vectors_0.8.0 BiocGenerics_0.16.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0 GenomicAlignments_1.6.1 BiocParallel_1.4.0 tools_3.2.2
[5] SummarizedExperiment_1.0.0 lambda.r_1.1.7 futile.logger_1.4.1 futile.options_1.0.0
[9] bitops_1.0-6 biomaRt_2.26.0 RCurl_1.95-4.7 Rsamtools_1.22.0
[13] XML_3.98-1.3
Thanks for your response Paul, I'm glad my question was clear.
Ahh, didn't clock the length of the motif. That would explain it. I will consult with the people at JASPAR on that one (and other experts locally here), but would also appreciate any answers with respect to trimming the motif in this thread also.