I have a matrix of read counts which represent the number of reads aligned in bins along the genome. There are millions of bins and they are not all the same length. For each bin I would like to count the number of reads within 5,000 bp either side of the bin's centre. Reads should only be counted from bins which are entirely within the 5,000 bp range.
So far, I've been able to take my bins (queryRanges) and create two vectors which represent the first (firstHits) and last bin (lastHits) within the total 10,000 bp search space (subjectRanges):
## Show query ranges > queryRanges GRanges object with 3913116 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [3000193, 3000814] * [2] chr1 [3001121, 3001796] * [3] chr1 [3003603, 3004014] * [4] chr1 [3004127, 3004339] * [5] chr1 [3004340, 3005438] * ... ... ... ... [3913112] chrY [90818058, 90821170] * [3913113] chrY [90823995, 90828869] * [3913114] chrY [90828938, 90829083] * [3913115] chrY [90829136, 90829941] * [3913116] chrY [90831793, 90832280] * ------- seqinfo: 21 sequences from mm10 genome ## Create window size subject ranges > windowSize <- 10000 > subjectRanges <- suppressWarnings(trim(resize(queryRanges, windowSize, fix="center"))) ## Show subjectRanges > subjectRanges GRanges object with 3913116 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [2995504, 3005503] * [2] chr1 [2996459, 3006458] * [3] chr1 [2998809, 3008808] * [4] chr1 [2999233, 3009232] * [5] chr1 [2999889, 3009888] * ... ... ... ... [3913112] chrY [90814614, 90824613] * [3913113] chrY [90821432, 90831431] * [3913114] chrY [90824011, 90834010] * [3913115] chrY [90824539, 90834538] * [3913116] chrY [90827037, 90837036] * ------- seqinfo: 21 sequences from mm10 genome ## Get first query within subject ranges > firstHits <- findOverlaps(queryRanges, subjectRanges, type="within", select="first") ## Replace NA hits with index of query bin > indexNA <- which(is.na(firstHits)) > firstHits[indexNA] <- indexNA ## Get last query within subject ranges > lastHits <- findOverlaps(queryRanges, subjectRanges, type="within", select="last") ## Replace NA hits with index of query bin > indexNA <- which(is.na(lastHits)) > lastHits[indexNA] <- indexNA # Show index of first query bin within each subject bin > head(firstHits, n=25) [1] 1 1 1 1 1 2 3 3 3 3 4 4 5 5 6 6 7 7 9 10 11 12 15 15 16 # Show index of last query bin within each subject bin > head(lastHits, n=25) [1] 5 6 10 12 14 15 18 18 20 20 21 22 22 23 23 26 26 27 27 27 27 27 28 29 30
From here I could take each column of my count matrix and use these indexes to subset the column, sum the counts, and create a new matrix. This however seems to be a very slow process, and memory inefficient. Is there a better alternative to this approach? Are they any facets of GRanges transformations/structures which I could take advantage of? I could also just use the 10,000bp bins to count the reads from the BAM files using Rsubread's featureCounts, but I wondered since I already have the information if I could get my result faster...
Thanks Aaron, I'll stick with featureCounts to generate the window counts.