Hello Bioconductor,
I'm am trying to understand how the functions of group_by and reduce in plyranges package work. My purpose is to create a Granges object that has all the genomic ranges from BAM files and reduce them, to find the common / union in all samples.
example:
# toy samples taken/modified from BiocWorkshops/fluent-genomic-data-analysis-with-plyranges
set.seed(42)
pos <- sample(1:10000, size = 100)
size <- sample(10:50, size = 100, replace = TRUE)
strand <- sample(c("+","-"), size = 100, replace = TRUE)
rep1 <- data.frame(seqnames = "chr1",
start = pos,
width = size,
strand =strand,
counts = round(rnorm(100, mean = 100,sd = 20)))
rep1 <- as_granges(rep1)
seqlengths(rep1) <- 248956422
genome(rep1) <- "hg38"
rep2 <- rep1 %>% anchor_center %>% mutate(width = width / 2, counts = counts + 100 ) %>% shift_downstream(8L)
rep3 <- rep1 %>% shift_upstream(11L) %>% mutate(counts = counts / 2)
rep4 <- rep3 %>% anchor_start() %>% mutate(width = width * 1.3, counts = counts + 1000 ) %>% shift_downstream(7L)
rep4[c(1 , 50:70)] <- rep2[c(1, 50:70)] %>% shift_downstream(10L)
rep5 <- rep4 %>% anchor_center() %>% mutate(width = width * 1.5)
rep5[c(10:30)] <- rep2[c(20:40)] %>% anchor_center() %>% mutate(width = width / 2)
intensities <- bind_ranges(rep1, rep2, rep3, rep4, .id = "replicate") %>%
arrange(start)
r1 <- intensities %>% group_by(strand,replicate) %>% reduce_ranges_directed() %>% as_tibble
r2 <- intensities %>% group_by(strand) %>% reduce_ranges_directed() %>% as_tibble
r3 <- intensities %>% reduce_ranges_directed() %>% as_tibble
r4 <- intensities %>% group_by(strand,replicate) %>% reduce_ranges() %>% as_tibble
r5 <- intensities %>% group_by(strand) %>% reduce_ranges()%>% as_tibble
r6 <- intensities %>% reduce_ranges()%>% as_tibble
sapply(list(r1,r2,r3,r4,r5,r6), nrow)
[1] 349 81 81 349 81 63
identical(r2$end,r3$end)
[1] TRUE
r1 and r4 are identical, as r2, r3, r5. Although we start from 500 ranges with grouping by strand and replicate it keeps 457 ranges. In my specific case, I have the information about the strand and using whichever of the "r2,r3,r5" ways to collapse the Granges probably it doesn't change anything. In case of a Granges object with multiple chromosomes do we need to groupby(seqnames) or just by using reduceranges_directed() would be enough?
Then I would like to find the overlaps from each sample with the collapsed grange object in order to acquire a matrix with the counts:
r3 <- intensities %>% reduce_ranges_directed()
rep1 <- rep1 %>% select(rep_1_counts = "counts",everything()) %>% arrange(start)
rep2 <- rep2 %>% select(rep_2_counts = "counts",everything()) %>% arrange(start)
r3 %>% join_overlap_left_directed( rep1 ) %>%
mutate(rep_1_counts = replace_na(rep_1_counts,0)) %>%
join_overlap_left_directed( rep2 ) %>%
mutate(rep_2_counts = replace_na(rep_2_counts,0))
GRanges object with 164 ranges and 2 metadata columns:
seqnames ranges strand | rep_1_counts rep_2_counts
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 5-63 + | 131 231
[2] chr1 80-89 + | 0 0
[3] chr1 91-168 + | 109 209
[4] chr1 91-168 + | 109 214
[5] chr1 91-168 + | 109 212
... ... ... ... . ... ...
[160] chr1 9290-9333 - | 109 209
[161] chr1 9467-9515 - | 112 212
[162] chr1 9544-9585 - | 142 242
[163] chr1 9770-9834 - | 69 169
[164] chr1 9954-9995 - | 118 218
-------
seqinfo: 2 sequences from 2 genomes (NA, hg38)
Why it has now an NA in genomes and how can I remove it? In these toy data, there is the issue of having the same region e.g.: 91-168 with several entries, how is it possible to address it if it will arrise in real data?
Thank you for your time
Hello lee.s, thank you for your answer. I run the code but I get this:
what version of plyranges are you using? could you post your sessionInfo or provide a reprex?
I found the problem a bracket was missing...
Regarding
reduce_ranges_directed
is there a way to use the mean not manually? And when we have more seqlevels the correct way to perform reduce is :without using any grouping?
it seems like the support site code editor doesn't show the bracket - how strange! What do you mean by the second part?
By "manually way" I mean something to use like apply or map to be done for each of the samples instead of writing for each sample
rep_1_counts = mean(rep_1_counts, na.rm = TRUE)
, but in real data, it is very time consuming (just using 3 samples it took 15mins).For that reason, I was searching a way to perform it faster and it seems is better to do it to each sample as tibble instead to a plyrange