I tried to use csaw to compare samples from MeDIP data. My dataset has 4 conditions namely group1, group2, group3 and group4 with one ChIP and Input sample for each condition (unfortunately, we have only a single replicate per condition for now; because of this, I used glmFit function with a dispersion value of 0.05 as suggested in the user manual). But then, after analyzing the data as outlined below, I have a few questions. It would be really helpful if someone could share their insight on this.
- I‘ve seen that the authors suggested a window width of 10 for transcription factor experiments and 150 for histone modification data. Is there any ‘specified’ width threshold for MeDIP data?
- Are there any specific guidelines for comparing ChIP samples to Input samples (detection of methylated regions) in csaw? As I would be looking for only regions enriched in ChIP sample compared to Input sample, I guess, the regions with more number of windows satisfying LogFC < -0.5 would represent noisy regions. I filtered out all windows that have log fold change less than zero (showing enrichment in input samples) before merging windows to regions using ‘mergeWindows’ function. Is this a valid approach?
max <- 650
width <- 150
filter <- 20
pet.param <- readParam(max.frag=max, pe="both",dedup=TRUE)
petBAMFile <- c("group1_sorted.bam","group1I_sorted.bam","group2_sorted.bam","group2I_sorted.bam","group3_sorted.bam","group3I_sorted.bam","group4_sorted.bam","group4I_sorted.bam")
data <- windowCounts(petBAMFile, param=pet.param,width=width,filter=filter,ext=250)
binned <- windowCounts(petBAMFile, bin=TRUE, width=10000, param=pet.param,ext=250)
normfacs <- normalize(binned)
sampleinfo <- data.frame(c("group1","group1I","group2","group2I","group3","group3I","group4","group4I"),c("group1","group1I","group2","group2I","group3","group3I","group4","group4I"))
names(sampleinfo) <- c("sampleID","group")
myfactor <- factor(sampleinfo$group)
y <- asDGEList(data,norm.factors=normfacs,group=myfactor)
design <- model.matrix(~0+group, data=y$samples)
fit.norep <- glmFit(y, design, dispersion=0.05)
For comparing ChIP to input sample, I used only windows with a positive fold change (ChIP versus Input) for merging into regions as below.
results_group1 <- glmLRT(fit.norep, contrast=c(1,-1,0,0,0,0,0,0))
up_group1 <- results_group1$table$logFC > 0
results_up_group1 <- results_group1$table[up_group1,]
data_group1 <- data[up_group1]
merged_group1 <- mergeWindows(rowRanges(data_group1), tol=1000L)
tabcom_group1 <- combineTests(merged_group1$id, results_up_group1)
anno_group1 <- detailRanges(merged_group1$region, txdb=TxDb.Hsapiens.UCSC.hg19.knownGene,orgdb=org.Hs.eg.db, promoter=c(3000, 1000), dist=10000)
write.table(data.frame(as.data.frame(merged_group1$region)[,1:3], tabcom_group1, anno_group1),file="group1.regions", row.names=FALSE, quote=FALSE, sep="\t")
For comparison across conditions, windows were not filtered based on fold change.
results_group1_group3 <- glmLRT(fit.norep, contrast=c(1,0,0,0,-1,0,0,0))
merged <- mergeWindows(rowRanges(data), tol=1000L)
tabcom_group1_group3 <- combineTests(merged$id, results_group1_group3$table)
anno <- detailRanges(merged$region, txdb=TxDb.Hsapiens.UCSC.hg19.knownGene,orgdb=org.Hs.eg.db, promoter=c(3000, 1000), dist=10000)
write.table(data.frame(as.data.frame(merged$region)[,1:3], tabcom_group1_group3, anno),file="group1_group3.regions", row.names=FALSE, quote=FALSE, sep="\t")
> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[2] GenomicFeatures_1.22.13
[3] org.Hs.eg.db_3.2.3
[4] RSQLite_1.0.0
[5] DBI_0.3.1
[6] AnnotationDbi_1.32.3
[7] edgeR_3.12.0
[8] limma_3.26.9
[9] csaw_1.4.1
[10] SummarizedExperiment_1.0.2
[11] Biobase_2.30.0
[12] GenomicRanges_1.22.4
[13] GenomeInfoDb_1.6.3
[14] IRanges_2.4.8
[15] S4Vectors_0.8.11
[16] BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] XVector_0.10.0 zlibbioc_1.16.0 GenomicAlignments_1.6.3
[4] BiocParallel_1.4.3 tools_3.2.4 lambda.r_1.1.7
[7] futile.logger_1.4.1 rtracklayer_1.30.4 futile.options_1.0.0
[10] bitops_1.0-6 biomaRt_2.26.1 RCurl_1.95-4.8
[13] Rsamtools_1.22.0 Biostrings_2.38.4 XML_3.98-1.4