Trim/Filter out-of-bounds GRanges
1
0
Entering edit mode
mat149 ▴ 80
@mat149-11450
Last seen 3 minutes ago
United States

Can anyone provide a better way to remove/trim out-of-bounds genomic ranges?

I have 6349 scaffolds in the reference genome. I want 500 bp of upstream sequence from all start codons; many predicted genes are on scaffolds without 500bp of upstream sequence that need removed/filtered from the GRanges object. Ideally, I want only the genes that have 500bp of upstream sequence. I am not having any issues removing out-of-bound genomic ranges with the forward strand, but I need to remove predicted genes on the negative strand that extend past the scaffold end limits. I can remove them one at a time, but this is not practical for 323 genomic ranges/genes. I want to add 500bp to each 'end' range then trim the genes off that extend past the scaffold seqlength on the negative strand. See code below:


tb3<-transcriptsBy(tx3,by="gene")               
tb3<-unlist(tb3)
tb3

GRanges object with 10189 ranges and 2 metadata columns:
              seqnames      ranges strand |     tx_id                tx_name
                 <Rle>   <IRanges>  <Rle> | <integer>            <character>
  60S_1 WNWQ01000464.1   3717-4775      - |      6338 rna-gnl|WGS:WNWQ|BLS..
  60S_2 WNWQ01000622.1   7540-8266      + |      7310 rna-gnl|WGS:WNWQ|BLS..
  60S_3 WNWQ01000073.1 39459-40199      - |      1864 rna-gnl|WGS:WNWQ|BLS..
   ACC1 WNWQ01000188.1  6555-13575      - |      3682 rna-gnl|WGS:WNWQ|BLS..
   ACR2 WNWQ01000227.1  6921-10580      - |      4167 rna-gnl|WGS:WNWQ|BLS..
    ...            ...         ...    ... .       ...                    ...
   YME2 WNWQ01000051.1 31504-34100      + |      1422 rna-gnl|WGS:WNWQ|BLS..
   YOP1 WNWQ01000685.1 16975-17921      + |      7660 rna-gnl|WGS:WNWQ|BLS..
   YPT1 WNWQ01000164.1   3943-4620      + |      3344 rna-gnl|WGS:WNWQ|BLS..
   YPT2 WNWQ01000058.1 64153-64901      + |      1582 rna-gnl|WGS:WNWQ|BLS..
   YPT5 WNWQ01000426.1  9277-10155      - |      6058 rna-gnl|WGS:WNWQ|BLS..
  -------
  seqinfo: 6349 sequences from ASM976944v1 genome

sub3<-subset(tb3,tb3@ranges@start > 502 & strand == "+")                             # 0 out-of-bound ranges
sub3

GRanges object with 4983 ranges and 2 metadata columns:
              seqnames      ranges strand |     tx_id                tx_name
                 <Rle>   <IRanges>  <Rle> | <integer>            <character>
  60S_2 WNWQ01000622.1   7540-8266      + |      7310 rna-gnl|WGS:WNWQ|BLS..
   ACU8 WNWQ01000193.1 33583-35574      + |      3731 rna-gnl|WGS:WNWQ|BLS..
   ADK1 WNWQ01000028.1 27957-28941      + |       929 rna-gnl|WGS:WNWQ|BLS..
   AKR1 WNWQ01001397.1   4341-5927      + |      9792 rna-gnl|WGS:WNWQ|BLS..
   ALC1 WNWQ01000768.1 12581-15782      + |      8067 rna-gnl|WGS:WNWQ|BLS..
    ...            ...         ...    ... .       ...                    ...
   YHM2 WNWQ01000269.1 19271-20502      + |      4643 rna-gnl|WGS:WNWQ|BLS..
   YME2 WNWQ01000051.1 31504-34100      + |      1422 rna-gnl|WGS:WNWQ|BLS..
   YOP1 WNWQ01000685.1 16975-17921      + |      7660 rna-gnl|WGS:WNWQ|BLS..
   YPT1 WNWQ01000164.1   3943-4620      + |      3344 rna-gnl|WGS:WNWQ|BLS..
   YPT2 WNWQ01000058.1 64153-64901      + |      1582 rna-gnl|WGS:WNWQ|BLS..

gb3<-getPromoterSeq(sub3,genome3,upstream=500,downstream=3)     
gb3

DNAStringSet object of length 4983:
       width seq                                            names               
   [1]   503 GACCATCACTCTTCGTGATCTT...TCGCCAAATCTACTAACAATG 60S_2
   [2]   503 AAGACAGGGCTTCTCGAATAAG...TACACCCACCAATTCGCCATG ACU8
   [3]   503 GATTTGGCGAATGTTGCAGTGA...ACACTTCTTTCATACACAATG ADK1
   [4]   503 AGTGACAGTTCATAAACTTAAG...TGCAGCTCGCTCCACACAATG AKR1
   [5]   503 GTAACTCTGCACAAGGATCCTA...CAGGAACAAGAACCAATTATG ALC1
   ...   ... ...
[4979]   503 AGTCCGAAAGGTGGCTTTGAAG...AGCAACCAACTCCAGAGAATG YHM2
[4980]   503 GAGTCTAGGTCGGAGCGATGGA...TATTGCCTTGCCAACATCATG YME2
[4981]   503 TGACGAGTAGGGGAACATGTGC...ACCACAAAATCTCCCGCCATG YOP1
[4982]   503 CACGATAAGGCTTCCTGCTGCC...TGCTTCTTCGATTCGCCGATG YPT1
[4983]   503 CGCTGGAAGGAATGGCGGTTGG...TTCACATACCCTGACACCATG YPT2



tub3<-subset(tb3, strand == "-")                                 
end(tub3)<-end(tub3)+500
tub3

GRanges object with 5045 ranges and 2 metadata columns:
              seqnames      ranges strand |     tx_id                tx_name
                 <Rle>   <IRanges>  <Rle> | <integer>            <character>
  60S_1 WNWQ01000464.1   3717-5275      - |      6338 rna-gnl|WGS:WNWQ|BLS..
  60S_3 WNWQ01000073.1 39459-40699      - |      1864 rna-gnl|WGS:WNWQ|BLS..
   ACC1 WNWQ01000188.1  6555-14075      - |      3682 rna-gnl|WGS:WNWQ|BLS..
   ACR2 WNWQ01000227.1  6921-11080      - |      4167 rna-gnl|WGS:WNWQ|BLS..
   ADE1 WNWQ01000248.1 21892-23690      - |      4423 rna-gnl|WGS:WNWQ|BLS..
    ...            ...         ...    ... .       ...                    ...
   VPS1 WNWQ01000097.1 11196-14061      - |      2288 rna-gnl|WGS:WNWQ|BLS..
  VPS26 WNWQ01000360.1 14036-15722      - |      5538 rna-gnl|WGS:WNWQ|BLS..
   VTC1 WNWQ01000130.1 20691-21836      - |      2847 rna-gnl|WGS:WNWQ|BLS..
   YAH1 WNWQ01000503.1 20116-21329      - |      6634 rna-gnl|WGS:WNWQ|BLS..
   YPT5 WNWQ01000426.1  9277-10655      - |      6058 rna-gnl|WGS:WNWQ|BLS..
  -------
  seqinfo: 6349 sequences from ASM976944v1 genome


gb33<-getPromoterSeq(tub3,genome3,upstream=500,downstream=3)

Error in loadFUN(x, seqname, ranges) : 
  trying to load regions beyond the boundaries of non-circular sequence
  "BLS_5"
In addition: Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 323 out-of-bound ranges located on sequences
  WNWQ01000088.1, WNWQ01001000.1, WNWQ01001003.1, WNWQ01001004.1,
  WNWQ01001005.1, WNWQ01001014.1, WNWQ01001023.1, WNWQ01001027.1,
  WNWQ01001040.1, WNWQ01001047.1, WNWQ01000105.1, WNWQ01001059.1,
  WNWQ01001061.1, WNWQ01001063.1, WNWQ01001069.1, WNWQ01001076.1,
  WNWQ01001087.1, WNWQ01000109.1, WNWQ01001109.1, WNWQ01001110.1,
  WNWQ01001115.1, WNWQ01001116.1, WNWQ01001120.1, WNWQ01001125.1,
  WNWQ01001127.1, WNWQ01001132.1, WNWQ01001133.1, WNWQ01001136.1,
  WNWQ01001144.1, WNWQ01001152.1, WNWQ01001153.1, WNWQ01001156.1,
  WNWQ01001158.1, WNWQ01000116.1, WNWQ01001161.1, WNWQ01001163.1,
  WNWQ01000119.1, WNWQ01000120.1, WNWQ01001205.1, WNWQ01001206.1,
  WNWQ01001213.1, WNWQ01001217.1, WNWQ01001223.1, WNWQ01001230.1,
  WNWQ01001249.1, WNWQ01001252.1, WNWQ01001254.1, WNWQ01001261.1,
  WNWQ01001265.1, WNWQ01000127.1, WNWQ01001271.1, WNWQ01001272.1,
  WNWQ01001281.1, WNWQ01001286.1, WNWQ01001291.1, WNWQ01000013.1,
  WNWQ0 [... truncated]



## manual workaround ##
## check each scaffold for out-of-bound ranges, then batch remove
seqlengths(genome3)['WNWQ01001115.1']
WNWQ01001115.1 
          **9687**
ss1<-subset(tub3,seqnames == 'WNWQ01001115.1')
ss1

GRanges object with 1 range and 2 metadata columns:
                   seqnames     ranges strand |     tx_id
                      <Rle>  <IRanges>  <Rle> | <integer>
  BLS_000622 WNWQ01001115.1 8561-**10140**      - |      9290
                            tx_name
                        <character>
  BLS_000622 rna-gnl|WGS:WNWQ|BLS..
  -------
  seqinfo: 6349 sequences from ASM976944v1 genome

BLS_000622 extends past the scaffold seqlength, and needs filtered. I want to replicate this approach for the remaining 323 out-of-bound genomic ranges

GenomicRanges • 364 views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 6 days ago
United States

How to remove out of bound rows from granges

ADD COMMENT

Login before adding your answer.

Traffic: 473 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6