subsetByOverlaps with Multimapping
1
0
Entering edit mode
Assa Yeroslaviz ★ 1.5k
@assa-yeroslaviz-1597
Last seen 1 day ago
Germany

as a sequel to my subsetting GRanges object based on gene IDs, I was wondering how multiple mapping can be handled. I have two GRanges objects. One is with intensities per position, the other is for two genes, I would like to overlap with. 

gr1 <- GRanges("VI", IRanges(c(3320:3321,3330:3331,3341:3342), c(3320:3321,3330:3331,3341:3342) ))
gr2 <- GRanges("VI", IRanges(c(3322, 3030), c(3846, 3338), names=c("YFL064C", "YFL065C")), gene_id=c("YFL064C", "YFL065C"))

I know, I can handle multi-mapping using the CharacterList(split())-combination. When doing so this is what I get:

ranges <- subsetByOverlaps(gr1,gr2)
hits <- findOverlaps(gr1, gr2)
geneids <- CharacterList(split(gr2$gene_id[subjectHits(hits)], queryHits(hits)))
mcols(ranges) <- DataFrame(mcols(ranges), geneids)

 

and the results looks like that:

> ranges
GRanges object with 6 ranges and 1 metadata column:
      seqnames       ranges strand |         geneids
         <Rle>    <IRanges>  <Rle> | <CharacterList>
  [1]       VI [3320, 3320]      * |         YFL065C
  [2]       VI [3321, 3321]      * |         YFL065C
  [3]       VI [3330, 3330]      * | YFL065C,YFL064C
  [4]       VI [3331, 3331]      * | YFL065C,YFL064C
  [5]       VI [3341, 3341]      * |         YFL064C
  [6]       VI [3342, 3342]      * |         YFL064C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

This is what I expect to see, as the first and last two positions are unique to the the genes YFL065C and YFL064C respectively, while the two positions in between overlap both genes genes.

But I would like to know if it is possible to separate these multiple mapping into separate row, so that I would be able later on to split the GRanges into a list based on the gene_id column. 

I would like it to be like that

> ranges
GRanges object with 6 ranges and 1 metadata column:
      seqnames       ranges strand |         geneids
         <Rle>    <IRanges>  <Rle> | <CharacterList>
  [1]       VI [3320, 3320]      * |         YFL065C
  [2]       VI [3321, 3321]      * |         YFL065C
  [3]       VI [3330, 3330]      * |         YFL065C
  [4]       VI [3330, 3330]      * |         YFL064C
  [5]       VI [3331, 3331]      * |         YFL065C
  [6]       VI [3331, 3331]      * |         YFL064C
  [7]       VI [3341, 3341]      * |         YFL064C
  [8]       VI [3342, 3342]      * |         YFL064C
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

so that the multi-mapped positions are separated into multiple rows, each with a different gene id. 

thanks Assa

 

 

 

 

 

 

genomicranges subsetbyoverlaps granges findoverlaps • 2.3k views
ADD COMMENT
0
Entering edit mode

I managed to do it in a very indirect way by exporting the GRanges object to a data.frame, converting the CharacterList column into character vectors using the unstrsplit() command from IRanges package. I than expanded the list based on the geneids column with my expand.delimited function (s. below for the script).

This was followed by merging the two newly created data frames into one and re-organizing the columns. The data was than coerced back into a GRanges object.

I am sure there is a better and more efficient way of doing it. I would appreciate your help.

 

this was the workflow:

ranges$geneids <- unstrsplit(ranges$geneids, sep=",") # converting the CharacterList into Character
df <- as.data.frame(ranges)

df.2 <- expand.delimited(x=df, col1=2, col2=6) # expanding the data frames

df.merged <- merge(x=df.2, y=df, by.x=1, by.y=2,all.y=TRUE) # merging
df.merged <- df.merged[,c("seqnames", "start", "end", "width", "strand", "geneids.x")] # reorgenizing

as(df.merged, "GRanges") #coercing back to GRanges

expand.delimited <- function(x, col1=1, col2=2, sep=",") {
  rnum <- 1
  expand_row <- function(y) {
    factr <- y[col1]
    strng <- toString(y[col2])
    expand <- strsplit(strng, sep)[[1]]
    num <- length(expand)
    factor <- rep(factr,num)
    return(as.data.frame(cbind(factor,expand),row.names=seq(rnum:(rnum+num)-1)))
    rnum <- (rnum+num)-1
  }
  expanded <- apply(x,1,expand_row)
  df <- do.call("rbind", expanded)
  names(df) <- c(names(x)[col1],names(x)[col2])
  return(df)
}

 

 

ADD REPLY
2
Entering edit mode
@herve-pages-1542
Last seen 9 days ago
Seattle, WA, United States

Hi Assa,

I would just do:

hits <- findOverlaps(gr2, gr1)
setNames(extractList(gr1, as(hits, "List")), names(gr2))
# GRangesList object of length 2:
# $YFL064C 
# GRanges object with 4 ranges and 0 metadata columns:
#       seqnames       ranges strand
#          <Rle>    <IRanges>  <Rle>
#   [1]       VI [3330, 3330]      *
#   [2]       VI [3331, 3331]      *
#   [3]       VI [3341, 3341]      *
#   [4]       VI [3342, 3342]      *
#
# $YFL065C 
# GRanges object with 4 ranges and 0 metadata columns:
#       seqnames       ranges strand
#   [1]       VI [3320, 3320]      *
#   [2]       VI [3321, 3321]      *
#   [3]       VI [3330, 3330]      *
#   [4]       VI [3331, 3331]      *
#
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

This produces a GRangesList object parallel to gr2 i.e. with one list-element per gene. You can then unlist() it if you want a GRanges object, and add the gene_id metadata column (using the names for that):

gr1b <- unlist(setNames(extractList(gr1, as(hits, "List")), names(gr2)))
mcols(gr1b)$gene_id <- names(gr1b)
gr1b
# GRanges object with 8 ranges and 1 metadata column:
#           seqnames       ranges strand |     gene_id
#              <Rle>    <IRanges>  <Rle> | <character>
#   YFL064C       VI [3330, 3330]      * |     YFL064C
#   YFL064C       VI [3331, 3331]      * |     YFL064C
#   YFL064C       VI [3341, 3341]      * |     YFL064C
#   YFL064C       VI [3342, 3342]      * |     YFL064C
#   YFL065C       VI [3320, 3320]      * |     YFL065C
#   YFL065C       VI [3321, 3321]      * |     YFL065C
#   YFL065C       VI [3330, 3330]      * |     YFL065C
#   YFL065C       VI [3331, 3331]      * |     YFL065C
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Cheers,

H.

ADD COMMENT
0
Entering edit mode

thanks a lot. I thought there might be a better way to do it.

ADD REPLY

Login before adding your answer.

Traffic: 619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6