I would like to subset a subject GRanges object by intersection with a query GRanges object. When I use subsetByOverlaps, some entries in the resulting GRanges object can extend "outside" the query GRanges.
library(GenomicRanges)
subject <- GRanges(c("chrI:1-99", "chrI:100-199", "chrI:200-299"), score = 1:3)
subject
# GRanges object with 3 ranges and 1 metadata column:
# seqnames ranges strand | score
# <Rle> <IRanges> <Rle> | <integer>
# [1] chrI 1-99 * | 1
# [2] chrI 100-199 * | 2
# [3] chrI 200-299 * | 3
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
query <- GRanges("chrI:50-150")
query
# GRanges object with 1 range and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chrI 50-150 *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
subsetByOverlaps(x = subject, ranges = query)
# GRanges object with 2 ranges and 1 metadata column:
# seqnames ranges strand | score
# <Rle> <IRanges> <Rle> | <integer>
# [1] chrI 1-99 * | 1
# [2] chrI 100-199 * | 2
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
I would like these to be "truncated" such that they end at the coordinates defined by the query GRanges. I can achieve this by first converting the subject GRanges object to a GPos object.
subject_GPos <- GPos(subject)
subject_GPos$score <- rep(subject$score, width(subject))
subject_GPos
# StitchedGPos object with 299 positions and 1 metadata column:
# seqnames pos strand | score
# <Rle> <integer> <Rle> | <integer>
# [1] chrI 1 * | 1
# [2] chrI 2 * | 1
# [3] chrI 3 * | 1
# [4] chrI 4 * | 1
# [5] chrI 5 * | 1
# ... ... ... ... . ...
# [295] chrI 295 * | 3
# [296] chrI 296 * | 3
# [297] chrI 297 * | 3
# [298] chrI 298 * | 3
# [299] chrI 299 * | 3
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
subsetByOverlaps(x = subject_GPos, ranges = query)
# StitchedGPos object with 101 positions and 1 metadata column:
# seqnames pos strand | score
# <Rle> <integer> <Rle> | <integer>
# [1] chrI 50 * | 1
# [2] chrI 51 * | 1
# [3] chrI 52 * | 1
# [4] chrI 53 * | 1
# [5] chrI 54 * | 1
# ... ... ... ... . ...
# [97] chrI 146 * | 2
# [98] chrI 147 * | 2
# [99] chrI 148 * | 2
# [100] chrI 149 * | 2
# [101] chrI 150 * | 2
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
Is there an easier way to do this without conversion to GPos?
sessionInfo( )
# attached base packages:
# [1] grid stats4 stats graphics grDevices utils datasets methods base
# other attached packages:
# [1] Gviz_1.38.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.0 IRanges_2.28.0 S4Vectors_0.32.3 BiocGenerics_0.40.0
That also included me hitting 'Enter' before adding the triple-backticks, so you'll need to go to the support site to see the edited post
That looks great. If you want to turn this into an answer I will be happy to accept it.