Question

Understanding group_by, reduce_ranges,

0

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 16 months ago

Italy

Hello Bioconductor,

I'm am trying to understand how the functions of group_by and reduce in plyranges package work. My purpose is to create a Granges object that has all the genomic ranges from BAM files and reduce them, to find the common / union in all samples.

example:

# toy samples taken/modified from BiocWorkshops/fluent-genomic-data-analysis-with-plyranges
set.seed(42)
pos <- sample(1:10000, size = 100)
size <- sample(10:50, size = 100, replace = TRUE)
strand <- sample(c("+","-"), size = 100, replace = TRUE)
rep1 <- data.frame(seqnames = "chr1", 
              start = pos,
              width = size,
              strand =strand, 
              counts = round(rnorm(100, mean = 100,sd = 20)))
rep1 <- as_granges(rep1) 
seqlengths(rep1) <- 248956422
genome(rep1) <- "hg38"
rep2 <- rep1 %>% anchor_center %>% mutate(width = width / 2, counts = counts + 100 ) %>% shift_downstream(8L)
rep3 <- rep1 %>% shift_upstream(11L) %>% mutate(counts = counts / 2)
rep4 <- rep3 %>% anchor_start() %>% mutate(width = width * 1.3, counts = counts + 1000 ) %>% shift_downstream(7L)
rep4[c(1 , 50:70)] <- rep2[c(1, 50:70)] %>% shift_downstream(10L)
rep5 <- rep4 %>% anchor_center() %>% mutate(width = width * 1.5)
rep5[c(10:30)] <- rep2[c(20:40)] %>% anchor_center() %>% mutate(width = width / 2)
intensities <- bind_ranges(rep1, rep2, rep3, rep4, .id = "replicate") %>%
arrange(start)
r1 <- intensities %>% group_by(strand,replicate) %>% reduce_ranges_directed() %>% as_tibble
r2 <- intensities %>% group_by(strand) %>%  reduce_ranges_directed() %>% as_tibble
r3 <- intensities %>% reduce_ranges_directed() %>% as_tibble
r4 <- intensities %>% group_by(strand,replicate) %>% reduce_ranges() %>% as_tibble
r5 <- intensities %>% group_by(strand) %>% reduce_ranges()%>% as_tibble
r6 <- intensities %>%  reduce_ranges()%>% as_tibble
sapply(list(r1,r2,r3,r4,r5,r6), nrow)
[1] 349  81  81 349  81  63
identical(r2$end,r3$end)
[1] TRUE

r1 and r4 are identical, as r2, r3, r5. Although we start from 500 ranges with grouping by strand and replicate it keeps 457 ranges. In my specific case, I have the information about the strand and using whichever of the "r2,r3,r5" ways to collapse the Granges probably it doesn't change anything. In case of a Granges object with multiple chromosomes do we need to groupby(seqnames) or just by using reduceranges_directed() would be enough?

Then I would like to find the overlaps from each sample with the collapsed grange object in order to acquire a matrix with the counts:

r3 <- intensities %>% reduce_ranges_directed()
rep1 <- rep1  %>%  select(rep_1_counts = "counts",everything()) %>% arrange(start)
rep2 <- rep2  %>%  select(rep_2_counts = "counts",everything()) %>% arrange(start)
r3 %>% join_overlap_left_directed( rep1 ) %>% 
mutate(rep_1_counts = replace_na(rep_1_counts,0)) %>%  
  join_overlap_left_directed( rep2 ) %>% 
  mutate(rep_2_counts = replace_na(rep_2_counts,0))



      GRanges object with 164 ranges and 2 metadata columns:
        seqnames    ranges strand | rep_1_counts rep_2_counts
           <Rle> <IRanges>  <Rle> |    <numeric>    <numeric>
    [1]     chr1      5-63      + |          131          231
    [2]     chr1     80-89      + |            0            0
    [3]     chr1    91-168      + |          109          209
    [4]     chr1    91-168      + |          109          214
    [5]     chr1    91-168      + |          109          212
    ...      ...       ...    ... .          ...          ...
  [160]     chr1 9290-9333      - |          109          209
  [161]     chr1 9467-9515      - |          112          212
  [162]     chr1 9544-9585      - |          142          242
  [163]     chr1 9770-9834      - |           69          169
  [164]     chr1 9954-9995      - |          118          218
  -------
  seqinfo: 2 sequences from 2 genomes (NA, hg38)

Why it has now an NA in genomes and how can I remove it? In these toy data, there is the issue of having the same region e.g.: 91-168 with several entries, how is it possible to address it if it will arrise in real data?

Thank you for your time

plyranges granges • 2.8k views

ADD COMMENT • link updated 5.4 years ago by lee.s ▴ 70 • written 5.4 years ago by Konstantinos Yeles ▴ 80

score 2 · Accepted Answer · 2019-11-07

2

Entering edit mode

lee.s ▴ 70

@lees-15179

Last seen 5.4 years ago

Hi Konstantinos,

r1 and r4 are identical, as r2, r3, r5. Although we start from 500 ranges with grouping by strand and replicate it keeps 457 ranges. In my specific case, I have the information about the strand and using whichever of the "r2,r3,r5" ways to collapse the Granges probably it doesn't change anything. In case of a Granges object with multiple chromosomes do we need to groupby(seqnames) or just by using reduceranges_directed() would be enough?

The purpose of reduce_ranges() is to merge overlapping ranges (and optionally apply some transformation based on those ranges metadata). What is considered as overlapping ranges, depends on the groupings and whether you are considering strand (i.e. using directed). So when you do groupby(replicate) then apply reduceranges, ranges within each replicate will be reduce. You don't need to group by seqnames when using reduceranges or reduceranges_directed as ranges on different seqnames are distinct.

Why it has now an NA in genomes and how can I remove it? In these toy data, there is the issue of having the same region e.g.: 91-168 with several entries, how is it possible to address it if it will arrise in real data?

This is a bug in left overlap join that should be fixed in plyranges 1.6.2 (once it gets onto the servers). For ranges, where there were multiple hits you could possibly use groupby + summarise + asgranges or group_by + reduce to aggregate the counts.

Here's an example - we use selfmatch as a short cut for doing group_by(seqnames, start, end, strand), also note that you can use the dplyr scoped verbs to save some type when replacing the NA values.

ans <- r3 %>% 
  join_overlap_left_directed( rep1 ) %>%  
  join_overlap_left_directed( rep2 ) %>% 
  mutate(revmap = selfmatch(.)) %>%
  dplyr::mutate_at(dplyr::vars(dplyr::starts_with('rep')), 
                   function(.) ifelseis.na(.), 0, .)
  ) %>% 
  group_by(revmap) %>% 
  reduce_ranges_directed(
    rep_1_counts = mean(rep_1_counts, na.rm = TRUE),
    rep_2_counts = mean(rep_2_counts, na.rm = TRUE)
  )

ADD COMMENT • link 5.4 years ago lee.s ▴ 70

0

Entering edit mode

Hello lee.s, thank you for your answer. I run the code but I get this:

ans <- r3 %>% 
  join_overlap_left_directed( rep1 ) %>%  
  join_overlap_left_directed( rep2 ) %>% 
  mutate(revmap = selfmatch(.)) %>%
  dplyr::mutate_at(dplyr::vars(dplyr::starts_with('rep')), 
                   function(.) ifelseis.na(.), 0, .) %>% 
  group_by(revmap) %>% 
  reduce_ranges_directed(
    rep_1_counts = mean(rep_1_counts, na.rm = TRUE),
    rep_2_counts = mean(rep_2_counts, na.rm = TRUE)
  )

Error in UseMethod("tblvars") : no applicable method for 'tblvars' applied to an object of class "quosures"

ADD REPLY • link 5.4 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

what version of plyranges are you using? could you post your sessionInfo or provide a reprex?

ADD REPLY • link 5.4 years ago lee.s ▴ 70

0

Entering edit mode

I found the problem a bracket was missing...

 function(.) ifelseis.na(.), 0, .) %>% 
                   ^
 function(.) ifelse/is.na(.), 0, .) %>%

Regarding reduce_ranges_directed is there a way to use the mean not manually? And when we have more seqlevels the correct way to perform reduce is :

r3 <- intensities %>% reduce_ranges_directed()

without using any grouping?

ADD REPLY • link 5.4 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

it seems like the support site code editor doesn't show the bracket - how strange! What do you mean by the second part?

ADD REPLY • link 5.4 years ago lee.s ▴ 70

0

Entering edit mode

By "manually way" I mean something to use like apply or map to be done for each of the samples instead of writing for each sample rep_1_counts = mean(rep_1_counts, na.rm = TRUE), but in real data, it is very time consuming (just using 3 samples it took 15mins).

reduce_ranges_directed(
    rep_1_counts = mean(rep_1_counts, na.rm = TRUE),
    rep_2_counts = mean(rep_2_counts, na.rm = TRUE)
  )

For that reason, I was searching a way to perform it faster and it seems is better to do it to each sample as tibble instead to a plyrange

ADD REPLY • link 5.4 years ago Konstantinos Yeles ▴ 80