Hi,
I've used Bismark v0.14.1 (http://www.bioinformatics.babraham.ac.uk/projects/bismark/) to align bs-seq reads and call methylated C's. Now I'd like to perform a differential methylation analysis using bsseq v1.6.0.
I've generated the coverage files using bismark_methylation_extractor
but I don't understand whether I should use the 1-based files (.cov) or the 0-based ones (.zero.cov).
The bsseq::read.bismark()
help says:
• The genomic co-ordinates of the Bismark output file may be zero-based or one-based depending on whether the ‘--zero_based’ argument is used. Furthermore, the default co-ordinate system varies by version of Bismark. bsseq makes no assumptions about the basis of the genomic co-ordinates and it is left to the user to ensure that the appropriate basis is used in the analysis of their data.
which I find a bit ambiguous.
Should I use 1-based or 0-based coverage files from Bismark to perform a differential methylation analysis in bsseq?
Thank you.
Thank you. I don't really have a preference: I'm just worried that bsseq might interpret the two formats differently. The coverage files are converted to GRanges internally, which are 1-based. My question basically is: Is bsseq assuming that the coverage files are 1-based when converting them to GRanges?
This shouldn't be a mystery, given that we can inspect the code. In
read.bismark()
we have this:And read.bismarkCovRaw() will almost certainly be hidden in the namespace, so we use the
:::
function to get it out.Which shows pretty clearly what assumptions are being made.
Thanks! From that code it looks like both 0-based and 1-based files should be imported correctly as only the start coordinate is used.
I still think the docs could have been clearer, as shown by the fact that we had to look at a hidden function to find out - I'll ask Peter to clarify.
Thanks, Jim. Yes, we left it to the user to decide whether they wish to use 0-based or 1-based co-ordinates. Moreover, we simply construct a
GRanges
object using the given co-ordinates, without trying to adjust for whether the co-ordinates are 0-based or 1-based. The reason is thatbsseq::read.bismark()
cannot infer from the.cov
file alone whether it is 0-based or 1-based (some versions of Bismark encode this in the file name, but not all as I recall). As noted in the documentation, the default basis of the co-ordinate system has varied over different versions of Bismark.FWIW, I use 1-based co-ordinates in my analyses, as is the convention for much Bioconductor infrastructure as Jim notes.
Pete (author of
bsseq::read.bismark()
)Hi Peter, thanks for chiming in!
Any chance the help for that function could be clearer? I understand that both 0-based and 1-based files are imported correctly: if so, is it worth saying that more explicitly?
Cheers,
I think if you look at the code and think about it a bit, you will see that the zero-based data are NOT imported 'correctly', if I interpret that word to mean 'imported as zero-based, half open'. They are instead imported as zero-based, closed, which is, IMO as correct as possible, given that
In other words, zero-based, half open would say that a SNP at the first position of chr1 is at (0,1), whereas one-based, closed, would say it is at (1,1). So if you import zero-based data you get
And if the data are one-based, closed, you get
So in the former case you have to shift everything to the right by one position to be consistent with, like everything else in Bioconductor. So this is correct inasmuch as it does exactly what the help page says it will do, but not correct if you are thinking you can just blindly import zero-based data and have everything work out for you.
@Peter - Am I wrong to think that you could use the end position from the Bismark output, regardless of the counting scheme used, and thus not need to care if it is zero or one based? I think that would work, but may well be missing some important caveat.
@Jim That's an excellent idea. I'll check that bismark's zero-based co-ordinates were always half-open (I have a sketchy recollection that some versions didn't do this properly) and implement it for the next Bioconductor release.
To close this, I investigated the option of using the end co-ordinate as per Jim's suggestions. Unfortunately, the use of open vs. closed, zero-based vs. one-based co-ordinates has not been consistent across the development of bismark. Just by way of example:
v0.7.9
bedGraph
output is closed, 0-based, e.g., [665, 665]v0.12.3
bedGraph
output is half-open, zero-based, e.g., (665, 666]v0.12.3
cov
output is closed, 1-based, e.g., [666, 666]v0.12.3
cov
output with--zero_based
flag is half-open, 0-based, e.g., (665, 666]It's very difficult to support all these options, especially since there is no version information in the files we import. So we will continue to leave this decision to the user and have added some comments based on this thread to the docs in the BioC 3.3 release version.
Understood. Thanks Peter!
Got it now, thanks.
So, to answer my initial question, one should use 1-based coverage files to ensure compatibility with GRanges and other Bioconductor components, as 0-based coverage files will result in an inaccurate genomic range (unless you want to manually shift all coordinates after creating the GRanges objects).
I second the option to use the end coordinate from the Bismark output in the future as it may be less problematic (if compatible with older Bismark versions obviously).
I'm happy to clarify the documentation and suggestions for a sentence or two are most welcome! Like Jim, I don't find the current message ambiguous but I'm biased due my familiarity with the code.
In light of the discussion above, I would simply remind the user that Bioconductor and GRanges use a 1-based system and that is the preferred format to import.
See also my own answer: A: How to correctly import Bismark coverage files into bsseq?
Thanks!