Compute average score across multiple bed files
2
0
Entering edit mode
jinseok • 0
@jinseok-16651
Last seen 6.3 years ago

I would like to compute the average score across multiple bed files (say from ChIP-seq) for a specific genomic region.

Let's say I have the following three bed files and I wish to compute the average score for chr1:10-20

<caption>a.bed</caption>
chr start end score
chr1 10 20 3

 

<caption>b.bed</caption>
chr start end score
chr1 12 14 3

 

<caption>c.bed</caption>
chr start end score
chr1 16 18 3

 

The desired output I wish to obtain is the following

<caption>desired output</caption>
chr start end mean.score
chr1 10 11 1
chr1 12 14 2
chr1 15 15 1
chr1 16 18 2
chr1 19 20 1

 

What is the best way or computationally inexpensive way of computing this without creating a data frame with the score column populated for every single bp for every single file?

chip-seq granges R bed files • 2.5k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode
lee.s ▴ 70
@lees-15179
Last seen 5.1 years ago

Updated to reflect Michael's comments:

Here's a way with plyranges and one with GenomicRanges.

suppressPackageStartupMessages(library(plyranges))

a <- GRanges("chr1:10-20", score = 3)
b <- GRanges("chr1:12-14", score = 3)
c <- GRanges("chr1:16-18", score = 3)

bind_ranges(a,b,c) %>% 
  disjoin_ranges(score = sum(score) / 3L)
#> GRanges object with 5 ranges and 1 metadata column:
#>       seqnames    ranges strand |     score
#>          <Rle> <IRanges>  <Rle> | <numeric>
#>   [1]     chr1     10-11      * |         1
#>   [2]     chr1     12-14      * |         2
#>   [3]     chr1        15      * |         1
#>   [4]     chr1     16-18      * |         2
#>   [5]     chr1     19-20      * |         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# with vanilla GenomicRanges
abc <- c(a,b,c)
r <- disjoin(c(a,b,c), with.revmap = TRUE, ignore.strand = TRUE)
r$score <- sum(extractList(score(abc), r$revmap))/3L
r
#> GRanges object with 5 ranges and 2 metadata columns:
#>       seqnames    ranges strand |        revmap     score
#>          <Rle> <IRanges>  <Rle> | <IntegerList> <numeric>
#>   [1]     chr1     10-11      * |             1         1
#>   [2]     chr1     12-14      * |           1,2         2
#>   [3]     chr1        15      * |             1         1
#>   [4]     chr1     16-18      * |           1,3         2
#>   [5]     chr1     19-20      * |             1         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Created on 2018-07-26 by the reprex package (v0.2.0).

ADD COMMENT
0
Entering edit mode

Shouldn't this average the scores, not count the ranges?

ADD REPLY
0
Entering edit mode

In their example it looks like they want a count since all their scores are equal to three

ADD REPLY
0
Entering edit mode

Sure, since the scores are always 3, so that happens to work, but I doubt all the scores are actually 3.

ADD REPLY
0
Entering edit mode

Here is one way to do it:

abc <- c(a,b,c)
r <- disjoin(abc, with.revmap = TRUE, ignore.strand = TRUE)
r$mean.score <- sum(extractList(score(abc), r$revmap)) / 3L

It really seems like you would want the zero ranges in this data though. You could have a zero-score GRanges for the entire genome:

z <- keepStandardChromosomes(GRanges(Seqinfo(genome="hg38")),
                             pruning.mode="coarse")
score(z) <- 0
abcz <- c(a,b,c,z)
r <- disjoin(abcz, with.revmap = TRUE, ignore.strand = TRUE)
r$mean.score <- sum(extractList(score(abc), r$revmap)) / 3L

Here is how one can do it without loading all of the data into memory, in case that is an issue:

files <- c("a.bed", "b.bed", "c.bed")
si <- Seqinfo(genome="hg38")
z <- keepStandardChromosomes(GRanges(si), pruning.mode="coarse")
score(z) <- 0
r <- Reduce(function(left, right) {
    gr <- c(left, import(right, seqinfo=si))
    d <- disjoin(gr, with.revmap=TRUE)
    d$score <- sum(extractList(score(gr), d$revmap))
    d
}, files, init=z)
r$mean.score <- r$score / length(files)

 

 

ADD REPLY

Login before adding your answer.

Traffic: 856 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6