Hi,
When I'm using the import.bw
from rtracklayer
, if two adjacent positions have the same scores, they will be put into a single GRanges, i.e., position 1 and 2 in this case
> gr1
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
[1] chr2 1-2 * | 1
[2] chr2 4 * | 1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
But the GRanges I wish to get back is like this one,
> gr0
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
[1] chr2 1 * | 1
[2] chr2 2 * | 1
[3] chr2 4 * | 1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The reason is I want to sum the scores across all sites, base by base, while
> sum(gr0$score)
[1] 3
> sum(gr1$score)
[1] 2
I tried to find parameters in the function to control this behaviour but get no luck. Maybe I miss something.
I do know a function makeGRangesBRG
from BRGenomics
can help me with that but I'm wondering if there is a more native and direct way?
Thanks,
Yixin
Thanks for your reply and sorry for the unclarity, James! I'm experimenting it a couple of times more and now realize actually it depends on how the BigWig files are written. It's good to know the behaviour of rtracklayer actually works as expected. See the following reproducible case,
Since I cannot control how the BigWig files are written, I guess my real problem is, is there any way to force rtracklayer to read gr1 as gr0, or maybe convert gr1 to gr0 using some common packages like
GenomicRanges
? Imagine it's genomic data so I have ~100M positions in reality.I get what you want to do, but I don't understand why. A wiggle file is meant to define a genomic region with a 'score', which can then be used to plot a rectangle on a genome, where the base of the rectangle is defined by the position and width and the height is defined by the score. So if you have a wiggle file (or a bigWig of the same), and there is a region on chr2 from 1-2, with a score of 1, that means there is a genomic region that spans both those bases, the height of which is 1. If the first range was chr2:1-1e6, with a score of 1, would you want a
GRanges
with 1M rows, each of which having a score of 1? To what end? I mean if you plotted that, you know what it would look like? A box with height 1, from 1 - 1e6 on chr2, which is the same exact thing you would get if you didn't expand to 1M rows.But maybe I completely misunderstand your use case, so if you can explain it?
Thanks James, I can see your point! The reason is the data I'm dealing with is quite sparse, most of time I have regions with width 1 and a score corresponding to a single position. However, sometimes the width could be 2, 3, or a few more bases and they may have same scores. I previously didn't notice this fact, so
sum(gr0$score)
(which is 3) is not equal tosum(gr1$score)
(which is 2), and it eventually messes up my analysis.Based on what you are saying, I think one way is to convert gr1 to gr0 so my sum is something I wish to get. Alternatively, now I think I can do
sum(width(gr1) * gr1$score)
, which is also 3 in total.I really appreciate your questions and answers, which help me figure out which steps cause the issues and make me re-think what I actually need!