Is there any thought of introducing the windowed mode into the subread software featureCounts. Perhaps this is the wrong forum to ask :) Of course defining the windows isn't too problematic.
Anyhow, what I really would appreciate would be a constructor parsing an featureCounts file and returning a RangedSummarizedExperiment object that could be used with csaw.
The literal answer to your question is that it's very easy to construct a RangedSummarizedExperiment from the output of featureCounts (at least, if you're calling the latter from Rsubread):
I assume you did counting without meta-feature summarization (as that wouldn't make much sense for windows) so Start and End should just be numeric values rather than comma-separated lists.
Now, for the actual answer to your question, some extra fields have to be added to rse to make it behave like the output of windowCounts when you use the downstream functions in csaw. At the very least, you'll need to fill in rse$totals to specify the sequencing depth of each library, and also metadata(rse)$rlen to specify the final extension length for each read (if you performed directional extension, otherwise set it to the read length). I would say that, if you want to perform an analysis in csaw, it's probably easier to stick with the csaw functions unless you know what you're doing. windowCounts mightn't be the fastest, but it does the job.
I think library(GenomicRanges); as(reg.out, "GRanges") will work (makeGRangesFromDataFrame() if one wants more control) and might be a shorter way to create win.coords.
I only wanted to count start sites, and hence I kept the ext-parameters to 1. The above works, and takes me through the pipeline with satisfying results at the end. That's about as much I can say about the correctness :) For instance, with another setup I assume the param object has to be modified accordingly.
Well, if you keep ext to 1, that implies you're only considering the 5' end of each read during counting (as in the literal 5' base position of the read, and nothing else). By default, featureCounts should count any overlaps between the read and the window, so if you're using the default settings you'd be better off setting ext to the read length. Obviously, if you did any extension with featureCounts, then this should be reflected in the value of ext and final.ext. This value is important for scaling the threshold (computed based on binned counts) during filtering of the relatively smaller windows.
Also, setting totals to the column sums assumes you count each read exactly once into any window, otherwise it won't be quite correct. Storage of incorrect totals will be problematic if you're computing normalization factors based on bin counts to eliminate composition biases, as this relies on the totals being the same across different counting schemes.
I'm still a bit bemused about why you wouldn't want to use windowCounts - it would certainly save you more time trying to figure out whether or not the rest of the code is interacting properly with your custom RangedSummarizedExperiment object, and it's not like you can't save the returned object with saveRDS for use in future R sessions. If you insist on doing it your own way, I can only advise you to proceed with caution. Mind you, if you don't do any csaw-specific filtering or normalization, then you might as well skip directly to creating a DGEList for use with edgeR, and avoid the middleman altogether.
I think
library(GenomicRanges); as(reg.out, "GRanges")
will work (makeGRangesFromDataFrame()
if one wants more control) and might be a shorter way to createwin.coords
.Thanks for the answer,
I figured out how to generate the objects, I think. I didn't go through Rsubread in the end
I only wanted to count start sites, and hence I kept the ext-parameters to 1. The above works, and takes me through the pipeline with satisfying results at the end. That's about as much I can say about the correctness :) For instance, with another setup I assume the param object has to be modified accordingly.
Well, if you keep
ext
to 1, that implies you're only considering the 5' end of each read during counting (as in the literal 5' base position of the read, and nothing else). By default,featureCounts
should count any overlaps between the read and the window, so if you're using the default settings you'd be better off settingext
to the read length. Obviously, if you did any extension withfeatureCounts
, then this should be reflected in the value ofext
andfinal.ext
. This value is important for scaling the threshold (computed based on binned counts) during filtering of the relatively smaller windows.Also, setting
totals
to the column sums assumes you count each read exactly once into any window, otherwise it won't be quite correct. Storage of incorrect totals will be problematic if you're computing normalization factors based on bin counts to eliminate composition biases, as this relies on thetotals
being the same across different counting schemes.I'm still a bit bemused about why you wouldn't want to use
windowCounts
- it would certainly save you more time trying to figure out whether or not the rest of the code is interacting properly with your customRangedSummarizedExperiment
object, and it's not like you can't save the returned object withsaveRDS
for use in future R sessions. If you insist on doing it your own way, I can only advise you to proceed with caution. Mind you, if you don't do any csaw-specific filtering or normalization, then you might as well skip directly to creating aDGEList
for use with edgeR, and avoid the middleman altogether.