Question

Group GRanges and keep metadata

1

Entering edit mode

likeeatingpizza ▴ 10

@likeeatingpizza-10790

Last seen 8.9 years ago

This question comes from a quiz exercise of the Coursera Bioconductor course. Since over there the class has been totally abandoned from the start by the tutors/mentors (supposing there are any) and apparently Mr Kasper Hansen has other priorities than checking the discussion board and helping his students, I have no other option than to post here.

I have a GRanges of H3K27me3 histone modification for the chr22 of H1 cell line that looks like this

> H1.chr22

GRanges object with 4068 ranges and 6 metadata columns:
         seqnames               ranges strand   |        name     score signalValue
            <Rle>            <IRanges>  <Rle>   | <character> <numeric>   <numeric>
     [1]    chr22 [16554493, 16554606]      *   | Rank_121761        23     2.36555   
     [2]    chr22 [16867445, 16867558]      *   |  Rank_98479        35     3.13762   
     [3]    chr22 [17270830, 17270985]      *   |  Rank_98480        35     3.13762   
     [4]    chr22 [17275375, 17275501]      *   |  Rank_78285        43     3.55442   
     [5]    chr22 [17325983, 17326150]      *   |  Rank_84671        39     3.22958

I have retrieved a vector of fold-change enrichment of ChIP signal for the same chr and cell line, that looks like this:

> fc_signal
GRanges object with 341236 ranges and 1 metadata column:
           seqnames               ranges strand   |            score
              <Rle>            <IRanges>  <Rle>   |        <numeric>
       [1]    chr22 [16554493, 16554502]      *   | 3.24963998794556
       [2]    chr22 [16554503, 16554530]      *   | 4.06205987930298
       [3]    chr22 [16554531, 16554542]      *   | 3.24963998794556
       [4]    chr22 [16554543, 16554544]      *   | 2.04167008399963
       [5]    chr22 [16554545, 16554565]      *   |          1.53125

As you can see the fc_signal ranges are much shorter (in width) but altogether they cover the same intervals as H1.chr22. Indeed their difference is zero

> setdiff(fc_signal, H1.chr22)
GRanges object with 0 ranges and 0 metadata columns:

I have to compute the average score in fc.signal for each region in the H1.chr22 GRanges. It was suggest to compute this using “Views”.

In other words, I need to find a way to "group" o "map" the fc_signal short ranges on the bigger ones, and getting the corresponding average value of their score. I have tried to do both subsetbyOverlaps or intersect, but other than the fact that I just get back the same query GRanges, I also loose the metadata values, so no chance to average them. I also don't see how could I create a Views with these two GRanges. Should I map them on the chr22 sequence (DNAString)?

I hope the question is clear

granges metadata coursera • 2.3k views

ADD COMMENT • link updated 8.9 years ago by Michael Lawrence ★ 11k • written 8.9 years ago by likeeatingpizza ▴ 10

score 2 · Accepted Answer · 2016-06-06

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.4 years ago

United States

The fc_signal object is a run-length encoding of a vector using a GRanges. You can convert that to an Rle and then form views on that to aggregate:

rle <- coverage(fc_signal, weight="score")
viewMeans(Views(rle, H1.chr22))

As an alternative to using Views, you could use Lists:

mean(rle[H1.chr22])

ADD COMMENT • link 8.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

THANK YOU Sir, that's exactly what I wanted to do. Should have thought of converting into an Rle... too many new different data classes/container in Bioconductor, I still have to get comfortable using them properly.

ADD REPLY • link 8.9 years ago likeeatingpizza ▴ 10

0

Entering edit mode

To be precise, to construct a Views on a GRanges this needs to be coerced as RangesList otherwise it will throw an error. This is the code that works:

vi <- Views(rle$chr22, as(H1.chr22, "RangesList")$chr22)

ADD REPLY • link 8.9 years ago likeeatingpizza ▴ 10

0

Entering edit mode

Thanks for catching that mistake, yes, you need to convert to RangesList, but you can do that across the entire genome:

vi <- Views(rle, as(H1.chr22, "RangesList"))

That returns a ViewsList, and stuff like viewMeans() works directly on those.

ADD REPLY • link 8.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Wasn't there talk at some point about modifying Views to support GRanges? That would allow convenient things like this:

mcols(H1.chr22)$mean_score <- viewMeans(Views(rle, H1.chr22))

Currently this is a bit of a pain because coercing a GRanges into a RangesList modifies the order.

ADD REPLY • link 8.9 years ago Jeff Johnston ▴ 90

0

Entering edit mode

If order is an issue, the List approach in the answer is probably the way to go for now.

ADD REPLY • link 8.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Absolutely. Note that you can already do this (i.e. specify views with a GRanges) on a BSgenome object. But we should also be able to do this on a RleList object. I'm raising the priority for this in my TODO list. Thanks for the reminder.

H.

ADD REPLY • link 8.9 years ago Hervé Pagès 16k