Hi,
I have a question about interpreting my cross-correlation plots from native ChIP-seq data with 100bp SE reads aligned uniquely to the genome using bowtie2. The bam files have duplicates marked and removed for this step.
To estimate fragment length I used correlateReads()
from the csaw package then plotted a very different graph than the smooth one showed in the user guide... I still see a prominent peak occurring at my expected fragment size of 147 bp for histone data, but it's hard to interpret what the other peaks are. I don't have a lot of data-sets to test this on, but how critical is the smoothness of this plot? From the commentary and other readings I understand that a cross correlation plot will have a second peak at the read-length if the efficiency was poor or duplicates not removed, but there is no mention about jaggedness. Based on the browser tracks and peak-calling results I always assumed my efficiency of pull down was very good, as the data is very clean and clear.
At first I thought the shape may be due to read-trimming and processing, but raw and processed .bam files give the same result.
Any help with interpretation is appreciated.
Cross correlation plot:
Code:
library(rtracklayer) ch <- import.chain("~/ChIPdata/reference/ChainFiles/mm9ToMm10.over.chain") original <- import("~/ChIPdata/reference/ChainFiles/mm9-blacklist.bed") blacklist <- liftOver(x=original, chain=ch) blacklist <- unlist(blacklist) library(csaw) x <- correlateReads(bamfile, max.dist=500, param=readParam(minq=20, discard=blacklist, dedup=T)) plot(1:length(x)-1, x, xlab="Delay (bp)", ylab="CCF", type="l")
Hi Aaron, thanks for your answer. Sorry to not mention that this is H3K4me3 data from mouse sperm cells which are a-typical and have only ~1% of the nucleosomes present in somatic cells. These nucleosomes are mainly retained at TSS and preferentially enrich at CpG dense regions. Median peak width is around 2000 bp. Here's a browser shot of a couple peak regions... To me it looks bimodal, but there are some broad regions throughout as well.
Re: duplicates, I worded it wrong, I marked them and only set
readParam(dedup=T)
for this plot as per the documentation... All that changed between the two was the presence of another wobbly peak at the 100bp size of my read length.I like your thinking about the double helix and MNase sensitivity. I'll look into that.
Thanks again
Those peaks are on the larger side, so perhaps it's not surprising to fail to see any strand bimodality. Generally, it seems that as soon as peaks get above 1 kbp, the bimodality becomes less pronounced; you're basically looking for a 100 bp interval of strand-specific coverage on the flanks of the bound region, which will be harder to spot as the region size increases. In the tracks you've shown, there's not much strand bimodality, assuming the bars represent reads coloured by strand - I would have expected a pure red block adjacent to a pure blue block, rather than the colours being mixed together.
In short, I wouldn't worry about the strange cross-correlation plot. Yes, it's odd, but so is your biological system, and as long as you're confident that the enriched regions aren't caused by something else (e.g., repeats), then you should just proceed with your analysis. In fact, for large regions, the fragment length doesn't really matter because the count is mainly determined by the window width, e.g., if you're using windows of 2 kbp, then it doesn't make much difference to give or take 100 bp.
Also, try to get in the habit of typing out
TRUE
in its entirety. Otherwise you can get surprises like this:Will do. Thanks a lot!