How to remove out of bound rows from granges
3
3
Entering edit mode
vinod.acear ▴ 50
@vinodacear-8884
Last seen 4.2 years ago
India

Hi 

I want to remove rows from out of chromosome length  ranges from granges. Please suggest

Granges grangeslist • 8.7k views
ADD COMMENT
0
Entering edit mode

Hi i have ext_grn my badGr which ahve 253 ranges, m trying to remove location 6 and 10 that are out of bound. when i applied the above process on it it is still having 253 ranges.

> ext_grn
GRanges object with 253 ranges and 2 metadata columns:
        seqnames           ranges strand   |            id     score
           <Rle>        <IRanges>  <Rle>   |   <character> <numeric>
    [1]        1 [   250,   1250]      -   |   eaton_acs_1   24.6037
    [2]        1 [ 30518,  31518]      +   |   eaton_acs_2   18.0581
    [3]        1 [ 41560,  42560]      +   |   eaton_acs_3   15.4399
    [4]        1 [ 69917,  70917]      -   |   eaton_acs_4   10.2180
    [5]        1 [124012, 125012]      -   |   eaton_acs_5   19.1554
    ...      ...              ...    ... ...           ...       ...
  [249]       16 [776581, 777581]      -   | eaton_acs_249   18.2426
  [250]       16 [818826, 819826]      -   | eaton_acs_250   20.8951
  [251]       16 [880423, 881423]      +   | eaton_acs_251   21.7175
  [252]       16 [932651, 933651]      -   | eaton_acs_252   15.2781
  [253]       16 [941943, 942943]      +   | eaton_acs_253   21.6356
  -------
  seqinfo: 16 sequences from sacCer2 genome

> a=Seqinfo(genome="sacCer2")
> seqlevels(a,force=TRUE) <- c("chrI" ,   "chrII" ,  "chrIII" , "chrIV"  , "chrV"  ,  "chrVI" ,  "chrVII" , "chrVIII" ,"chrIX"  , "chrX" , "chrXI"  , "chrXII" , "chrXIII" ,"chrXIV" , "chrXV"  , "chrXVI" )

> a=renameSeqlevels(a,as.character(c(1:16)))
> a

Seqinfo object with 16 sequences from sacCer2 genome:
  seqnames seqlengths isCircular  genome
  1            230208      FALSE sacCer2
  2            813178      FALSE sacCer2
  3            316617      FALSE sacCer2
  4           1531919      FALSE sacCer2
  5            576869      FALSE sacCer2
  ...             ...        ...     ...
  12          1078175      FALSE sacCer2
  13           924429      FALSE sacCer2
  14           784333      FALSE sacCer2
  15          1091289      FALSE sacCer2
  16           948062      FALSE sacCer2

> seqinfo(ext_grn) <- a
Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 2 out-of-bound ranges located on sequences 6 and 10. Note
  that only ranges located on a non-circular sequence whose length is not NA can be
  considered out-of-bound (use seqlengths() and isCircular() to get the lengths and
  circularity flags of the underlying sequences). You can use trim() to trim these
  ranges. See ?`trim,GenomicRanges-method` for more information.
> ext_grn <- trim(ext_grn)
> ext_grn
GRanges object with 253 ranges and 2 metadata columns:
        seqnames           ranges strand   |            id     score
           <Rle>        <IRanges>  <Rle>   |   <character> <numeric>
    [1]        1 [   250,   1250]      -   |   eaton_acs_1   24.6037
    [2]        1 [ 30518,  31518]      +   |   eaton_acs_2   18.0581
    [3]        1 [ 41560,  42560]      +   |   eaton_acs_3   15.4399
    [4]        1 [ 69917,  70917]      -   |   eaton_acs_4   10.2180
    [5]        1 [124012, 125012]      -   |   eaton_acs_5   19.1554
    ...      ...              ...    ... ...           ...       ...
  [249]       16 [776581, 777581]      -   | eaton_acs_249   18.2426
  [250]       16 [818826, 819826]      -   | eaton_acs_250   20.8951
  [251]       16 [880423, 881423]      +   | eaton_acs_251   21.7175
  [252]       16 [932651, 933651]      -   | eaton_acs_252   15.2781
  [253]       16 [941943, 942943]      +   | eaton_acs_253   21.6356
  -------
  seqinfo: 16 sequences from sacCer2 genome

 

 

ADD REPLY
0
Entering edit mode

I see. I'm not sure there's a super-easy way to drop ranges that fall outside of the bounds. trim() is expected to return an object of the same length as its input, so we might need a new function. Maybe Herve has some ideas?

ADD REPLY
3
Entering edit mode

Hi,

trim() uses internal helper GenomicRanges:::get_out_of_bound_index() to find the out-of-bound ranges. So you could use this on your GRanges object:

idx <- GenomicRanges:::get_out_of_bound_index(ext_grn)
if (length(idx) != 0L)
    ext_grn <- ext_grn[-idx]

We should probably expose and document get_out_of_bound_index(), and rename it in the process.

H.

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

There might be an easy way to make a GRanges object for the chromosomes, but it wasn't readily apparent to me, so I did it the hard way. But the idea is to have a GRanges that encompasses the extent of the chromosomes, and then use subsetByOverlaps().

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
## get chrlengths - there must actually be an accessor I don't know abou
> con <- dbConnect(SQLite(), paste0(path.package("TxDb.Hsapiens.UCSC.hg19.knownGene"), "/extdata/TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite"))
> chrs <- dbGetQuery(con, "select chrom, length from chrominfo limit 24;")
> chrGR <- GRanges(chrs[,1], IRanges(rep(1, 24), chrs[,2]))
## so chrGR is a GRanges that has start and end for all 24 chromosomes
> badGR <- GRanges(paste0("chr", c(1,2,3,4,5)), IRanges(c(249250600, 3,4,5,6), c(249250645, 6,7,8,9)))
## a 'bad' GRanges, where the first range extends past the end of chr1
> badGR
GRanges object with 5 ranges and 0 metadata columns:
      seqnames                 ranges strand
         <Rle>              <IRanges>  <Rle>
  [1]     chr1 [249250600, 249250645]      *
  [2]     chr2 [        3,         6]      *
  [3]     chr3 [        4,         7]      *
  [4]     chr4 [        5,         8]      *
  [5]     chr5 [        6,         9]      *
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
## now remove all ranges that are not 'within' the chromosome boundaries
> subsetByOverlaps(badGR, chrGR, type="within")
GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2    [3, 6]      *
  [2]     chr3    [4, 7]      *
  [3]     chr4    [5, 8]      *
  [4]     chr5    [6, 9]      *
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

ADD COMMENT
1
Entering edit mode

Hi Jim,

FWIW, getting the sequence lengths out of a TxDb object is a one-liner:

seqlengths(TxDb.Hsapiens.UCSC.hg19.knownGene)

If you want the full Seqinfo:

seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)

If you want it as a GRanges object (your chrGR object):

as(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene), "GRanges")

Too easy I know ;-)

H.

ADD REPLY
0
Entering edit mode

Thanks Hervé! I figured it had to be something easy...

ADD REPLY
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

If your genome build is hg19 or some other key known to UCSC, you can do something like:

seqinfo(badGR) <- Seqinfo(genome="hg19")
goodGR <- trim(badGR)

 

ADD COMMENT
0
Entering edit mode
balwierz ▴ 40
@balwierz-8583
Last seen 5.3 years ago
United Kingdom

It is super easy

a <- a[trim(a) == a]

If your GRanges a does not have seqinfo, assign it first (and do it always whenever you create any GRanges):

seqlevels(a) <- seqlevels(Mmusculus)
seqlengths(a) <- seqlengths(Mmusculus)
ADD COMMENT

Login before adding your answer.

Traffic: 633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6