Hi,
When using summarizeOverlaps to count reads overlapping a particular feature, what is the minimum number of bases required to overlap in order to be called a hit (ie. 1 base, 2 bases, 3 bases)? I'm using mode="Union", inter.feature="FALSE" (ie. type="any"), ignore.strand=FALSE.
I'm curious because I am also using bedtools (v 2.25) coverage command to determine the %sequence coverage of the particular feature, and the read counts I get from bedtools is different from summarizeOverlaps for many loci (some are actually the same). By default bedtools coverage uses 1 base overlap necessary for a hit to be called. I've also made sure that bedtools coverage is assessing the exact same loci as summarizeOverlaps.
Thanks,
Thomas
bedtools is reporting different counts based on percent overlap. That is not the case with mode=Union. By default 'Union' drops reads that overlap more than one feature. If you don't want to drop reads that overlap multiple features set inter.feature=FALSE. If you do this you'll likely see the counts go up. This is not related to how much (percent) a single read overlaps a single feature but instead to a single read overlapping more than one feature.
The 'Union' overlap situations are shown in the graphic in the vignette.
Valerie
"I'm using mode="Union", inter.feature="FALSE" (ie. type="any"), ignore.strand=FALSE."
I appreciate your explanation, but I feel that we are still not on the same page. The quote above is from my original posting on this topic. I have been using ignore.strand=FALSE. I understand how this function works, which is also why I stated it as type = any (if one were thinking about countOverlaps()). I learned this from an earlier exchange you had on this forum with Thomas Girke. It's really helpful.
summarizeOverlaps mode ignoring inter feature overlaps
I appreciate you continuing to make efforts to explain the difference, but setting inter.feature=FALSE does not change my point at all, because I have always had this setting in effect.
edit: For extra clarity, I should also say that when I use bedtools coverage, I also use the flag -s to enfore standedness (same as ignore.strand=FALSE in summarizeOverlaps()). I've made every effort to keep the settings consistent as possible, but the outputs only come close when I add the flag -F 0.10 to bedtools coverage.
From http://bedtools.readthedocs.org/en/latest/content/tools/coverage.html we have
instead of 3, 1, 0 reported on that page. This makes sense when interpreting ranges as 0-based, half-closed, versus the Bioc 1-based, closed intervals. I believe rtracklayer::import() does the right thing in terms of BED input. It might therefore help to understand where the features are coming from, to explore different scenarios using Union() and simple GRanges like the example above, and to explore input of specific regions using readGAlignments() (or readGAlignmentList()).
Thank you so much! That's a good idea. I'll give that a shot with my data and see how things shake out.