as a sequel to my subsetting GRanges object based on gene IDs, I was wondering how multiple mapping can be handled. I have two GRanges objects. One is with intensities per position, the other is for two genes, I would like to overlap with.
gr1 <- GRanges("VI", IRanges(c(3320:3321,3330:3331,3341:3342), c(3320:3321,3330:3331,3341:3342) )) gr2 <- GRanges("VI", IRanges(c(3322, 3030), c(3846, 3338), names=c("YFL064C", "YFL065C")), gene_id=c("YFL064C", "YFL065C"))
I know, I can handle multi-mapping using the CharacterList(split())
-combination. When doing so this is what I get:
ranges <- subsetByOverlaps(gr1,gr2) hits <- findOverlaps(gr1, gr2) geneids <- CharacterList(split(gr2$gene_id[subjectHits(hits)], queryHits(hits))) mcols(ranges) <- DataFrame(mcols(ranges), geneids)
and the results looks like that:
> ranges GRanges object with 6 ranges and 1 metadata column: seqnames ranges strand | geneids <Rle> <IRanges> <Rle> | <CharacterList> [1] VI [3320, 3320] * | YFL065C [2] VI [3321, 3321] * | YFL065C [3] VI [3330, 3330] * | YFL065C,YFL064C [4] VI [3331, 3331] * | YFL065C,YFL064C [5] VI [3341, 3341] * | YFL064C [6] VI [3342, 3342] * | YFL064C ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
This is what I expect to see, as the first and last two positions are unique to the the genes YFL065C and YFL064C respectively, while the two positions in between overlap both genes genes.
But I would like to know if it is possible to separate these multiple mapping into separate row, so that I would be able later on to split the GRanges into a list based on the gene_id column.
I would like it to be like that
> ranges GRanges object with 6 ranges and 1 metadata column: seqnames ranges strand | geneids <Rle> <IRanges> <Rle> | <CharacterList> [1] VI [3320, 3320] * | YFL065C [2] VI [3321, 3321] * | YFL065C [3] VI [3330, 3330] * | YFL065C [4] VI [3330, 3330] * | YFL064C [5] VI [3331, 3331] * | YFL065C [6] VI [3331, 3331] * | YFL064C [7] VI [3341, 3341] * | YFL064C [8] VI [3342, 3342] * | YFL064C ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
so that the multi-mapped positions are separated into multiple rows, each with a different gene id.
thanks Assa
I managed to do it in a very indirect way by exporting the
GRanges
object to adata.frame
, converting theCharacterList
column into character vectors using theunstrsplit()
command fromIRanges
package. I than expanded the list based on the geneids column with myexpand.delimited
function (s. below for the script).This was followed by merging the two newly created data frames into one and re-organizing the columns. The data was than coerced back into a GRanges object.
I am sure there is a better and more efficient way of doing it. I would appreciate your help.
this was the workflow: