Question

subsetByOverlaps with Multimapping

0

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 7 weeks ago

Germany

as a sequel to my subsetting GRanges object based on gene IDs, I was wondering how multiple mapping can be handled. I have two GRanges objects. One is with intensities per position, the other is for two genes, I would like to overlap with.

gr1 <- GRanges("VI", IRanges(c(3320:3321,3330:3331,3341:3342), c(3320:3321,3330:3331,3341:3342) ))
gr2 <- GRanges("VI", IRanges(c(3322, 3030), c(3846, 3338), names=c("YFL064C", "YFL065C")), gene_id=c("YFL064C", "YFL065C"))

I know, I can handle multi-mapping using the CharacterList(split())-combination. When doing so this is what I get:

ranges <- subsetByOverlaps(gr1,gr2)
hits <- findOverlaps(gr1, gr2)
geneids <- CharacterList(split(gr2$gene_id[subjectHits(hits)], queryHits(hits)))
mcols(ranges) <- DataFrame(mcols(ranges), geneids)

and the results looks like that:

> ranges
GRanges object with 6 ranges and 1 metadata column:
      seqnames       ranges strand |         geneids
         <Rle>    <IRanges>  <Rle> | <CharacterList>
  [1]       VI [3320, 3320]      * |         YFL065C
  [2]       VI [3321, 3321]      * |         YFL065C
  [3]       VI [3330, 3330]      * | YFL065C,YFL064C
  [4]       VI [3331, 3331]      * | YFL065C,YFL064C
  [5]       VI [3341, 3341]      * |         YFL064C
  [6]       VI [3342, 3342]      * |         YFL064C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

This is what I expect to see, as the first and last two positions are unique to the the genes YFL065C and YFL064C respectively, while the two positions in between overlap both genes genes.

But I would like to know if it is possible to separate these multiple mapping into separate row, so that I would be able later on to split the GRanges into a list based on the gene_id column.

I would like it to be like that

> ranges
GRanges object with 6 ranges and 1 metadata column:
      seqnames       ranges strand |         geneids
         <Rle>    <IRanges>  <Rle> | <CharacterList>
  [1]       VI [3320, 3320]      * |         YFL065C
  [2]       VI [3321, 3321]      * |         YFL065C
  [3]       VI [3330, 3330]      * |         YFL065C
  [4]       VI [3330, 3330]      * |         YFL064C
  [5]       VI [3331, 3331]      * |         YFL065C
  [6]       VI [3331, 3331]      * |         YFL064C
  [7]       VI [3341, 3341]      * |         YFL064C
  [8]       VI [3342, 3342]      * |         YFL064C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

so that the multi-mapped positions are separated into multiple rows, each with a different gene id.

thanks Assa

genomicranges subsetbyoverlaps granges findoverlaps • 2.3k views

ADD COMMENT • link 7.3 years ago Assa Yeroslaviz ★ 1.5k

0

Entering edit mode

I managed to do it in a very indirect way by exporting the GRanges object to a data.frame, converting the CharacterList column into character vectors using the unstrsplit() command from IRanges package. I than expanded the list based on the geneids column with my expand.delimited function (s. below for the script).

This was followed by merging the two newly created data frames into one and re-organizing the columns. The data was than coerced back into a GRanges object.

I am sure there is a better and more efficient way of doing it. I would appreciate your help.

this was the workflow:

ranges$geneids <- unstrsplit(ranges$geneids, sep=",") # converting the CharacterList into Character
df <- as.data.frame(ranges)

df.2 <- expand.delimited(x=df, col1=2, col2=6) # expanding the data frames

df.merged <- merge(x=df.2, y=df, by.x=1, by.y=2,all.y=TRUE) # merging
df.merged <- df.merged[,c("seqnames", "start", "end", "width", "strand", "geneids.x")] # reorgenizing

as(df.merged, "GRanges") #coercing back to GRanges

expand.delimited <- function(x, col1=1, col2=2, sep=",") {
  rnum <- 1
  expand_row <- function(y) {
    factr <- y[col1]
    strng <- toString(y[col2])
    expand <- strsplit(strng, sep)[[1]]
    num <- length(expand)
    factor <- rep(factr,num)
    return(as.data.frame(cbind(factor,expand),row.names=seq(rnum:(rnum+num)-1)))
    rnum <- (rnum+num)-1
  }
  expanded <- apply(x,1,expand_row)
  df <- do.call("rbind", expanded)
  names(df) <- c(names(x)[col1],names(x)[col2])
  return(df)
}

ADD REPLY • link 7.3 years ago Assa Yeroslaviz ★ 1.5k

score 2 · Accepted Answer · 2017-09-06

Hi Assa,

I would just do:

hits <- findOverlaps(gr2, gr1)
setNames(extractList(gr1, as(hits, "List")), names(gr2))
# GRangesList object of length 2:
# $YFL064C 
# GRanges object with 4 ranges and 0 metadata columns:
#       seqnames       ranges strand
#          <Rle>    <IRanges>  <Rle>
#   [1]       VI [3330, 3330]      *
#   [2]       VI [3331, 3331]      *
#   [3]       VI [3341, 3341]      *
#   [4]       VI [3342, 3342]      *
#
# $YFL065C 
# GRanges object with 4 ranges and 0 metadata columns:
#       seqnames       ranges strand
#   [1]       VI [3320, 3320]      *
#   [2]       VI [3321, 3321]      *
#   [3]       VI [3330, 3330]      *
#   [4]       VI [3331, 3331]      *
#
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

This produces a GRangesList object parallel to gr2 i.e. with one list-element per gene. You can then unlist() it if you want a GRanges object, and add the gene_id metadata column (using the names for that):

gr1b <- unlist(setNames(extractList(gr1, as(hits, "List")), names(gr2)))
mcols(gr1b)$gene_id <- names(gr1b)
gr1b
# GRanges object with 8 ranges and 1 metadata column:
#           seqnames       ranges strand |     gene_id
#              <Rle>    <IRanges>  <Rle> | <character>
#   YFL064C       VI [3330, 3330]      * |     YFL064C
#   YFL064C       VI [3331, 3331]      * |     YFL064C
#   YFL064C       VI [3341, 3341]      * |     YFL064C
#   YFL064C       VI [3342, 3342]      * |     YFL064C
#   YFL065C       VI [3320, 3320]      * |     YFL065C
#   YFL065C       VI [3321, 3321]      * |     YFL065C
#   YFL065C       VI [3330, 3330]      * |     YFL065C
#   YFL065C       VI [3331, 3331]      * |     YFL065C
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Cheers,

H.