BSgenome.Mmusculus.UCSC.mm10
and BSgenome.Mmusculus.UCSC.mm10.masked
contain the same information, with unknown nucleotides coded as either 'N'
or '#'
:
bsgenome <- BSgenome.Mmusculus.UCSC.mm10::BSgenome.Mmusculus.UCSC.mm10
maskedbs <- BSgenome.Mmusculus.UCSC.mm10.masked::BSgenome.Mmusculus.UCSC.mm10.masked
BSgenome::alphabetFrequency(bsgenome$chr1)
BSgenome::alphabetFrequency(maskedbs$chr1)
So it comes as no surprise that matching BSgenome.Mmusculus.UCSC.mm10
gives identical results as matching BSgenome.Mmusculus.UCSC.mm10.masked
:
cas9seq <- "CTTATATTGTCTCCAGCAGAAGG"
Biostrings::countPattern(cas9seq, bsgenome$chr1) # 59
Biostrings::countPattern(cas9seq, maskedbs$chr1) # 59
The matching performance also seems to be very similar:
benchmark <- microbenchmark::microbenchmark
benchmark(Biostrings::countPattern(cas9seq, bsgenome$chr1), times = 50L)
benchmark(Biostrings::countPattern(cas9seq, maskedbs$chr1), times = 50L)
benchmark(Biostrings::vcountPDict(cas9seq, bsgenome, min.mismatch = 0, max.mismatch = 0), times = 1)
benchmark(Biostrings::vcountPDict(cas9seq, maskedbs, min.mismatch = 0, max.mismatch = 0), times = 1)
Which makes one wonder: what is actually the use of a masked BSgenome? :-D
It is not true in general that a "naked" BSgenome object and its corresponding "masked" BSgenome object contain the same information, with unknown nucleotides encoded as either
N
or#
. That would be true if the latter only contained the mask of assembly gaps (AGAPS masks) and intra-contig ambiguities (AMB masks), which are indeed encoded asN
in the "naked" version. (To be precise, in some rare occasions, the intra-contig ambiguities are encoded with other IUPAC ambiguity letters, e.g. BSgenome.Hsapiens.UCSC.hg18 has someR
's andM
's on chr3.)However most "masked" BSgenome objects also contain the RepeatMasker (RM) and Tandem Repeats Finder (TRF) masks which are NOT encoded as
N
in the "naked" version.One important bit here is that masks can be activated or deactivated at the chromosome level. For example to activate the RepeatMasker (RM) mask on Human chr1, do something like:
Note that only the mask of assembly gaps (AGAPS) and intra-contig ambiguities (AMB) are active by default. So by default, yes, matching gives identical results on the "naked" and "masked" BSgenome objects but not if you activate the other masks.
The BSgenome.Mmusculus.UCSC.mm10.masked case is special because this "masked" BSgenome only contains the AGAPS and AMB masks. Furthermore, the AMB masks are empty (in other words, the assembled sequences don't contain ambiguity letters inside the contigs), so could in theory be removed but that's another story. In this particular case, the "naked" and "masked" objects contain the same information. A minor advantage of using the latter is speed, even though, as you noticed, the speed gain is somewhat limited. Another minor advantage is the possibility to do fuzzy matching (by specifying
fixed=FALSE
) without getting zillions of meaningless matches to the AGAPS regions (i.e. inter-contig) that you would get if you were using the "naked" BSgenome. Although this could also be prevented by usingfixed="subject"
on the "naked" BSgenome but this means that you know aboutfixed="subject"
. This is starting to be pretty advanced stuff so I should probably stop here.Hope this helps,
H.
Thank you Herve. As with your other posts, very illuminating.