Can anyone provide a better way to remove/trim out-of-bounds genomic ranges?
I have 6349 scaffolds in the reference genome. I want 500 bp of upstream sequence from all start codons; many predicted genes are on scaffolds without 500bp of upstream sequence that need removed/filtered from the GRanges object. Ideally, I want only the genes that have 500bp of upstream sequence. I am not having any issues removing out-of-bound genomic ranges with the forward strand, but I need to remove predicted genes on the negative strand that extend past the scaffold end limits. I can remove them one at a time, but this is not practical for 323 genomic ranges/genes. I want to add 500bp to each 'end' range then trim the genes off that extend past the scaffold seqlength on the negative strand. See code below:
tb3<-transcriptsBy(tx3,by="gene")
tb3<-unlist(tb3)
tb3
GRanges object with 10189 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
60S_1 WNWQ01000464.1 3717-4775 - | 6338 rna-gnl|WGS:WNWQ|BLS..
60S_2 WNWQ01000622.1 7540-8266 + | 7310 rna-gnl|WGS:WNWQ|BLS..
60S_3 WNWQ01000073.1 39459-40199 - | 1864 rna-gnl|WGS:WNWQ|BLS..
ACC1 WNWQ01000188.1 6555-13575 - | 3682 rna-gnl|WGS:WNWQ|BLS..
ACR2 WNWQ01000227.1 6921-10580 - | 4167 rna-gnl|WGS:WNWQ|BLS..
... ... ... ... . ... ...
YME2 WNWQ01000051.1 31504-34100 + | 1422 rna-gnl|WGS:WNWQ|BLS..
YOP1 WNWQ01000685.1 16975-17921 + | 7660 rna-gnl|WGS:WNWQ|BLS..
YPT1 WNWQ01000164.1 3943-4620 + | 3344 rna-gnl|WGS:WNWQ|BLS..
YPT2 WNWQ01000058.1 64153-64901 + | 1582 rna-gnl|WGS:WNWQ|BLS..
YPT5 WNWQ01000426.1 9277-10155 - | 6058 rna-gnl|WGS:WNWQ|BLS..
-------
seqinfo: 6349 sequences from ASM976944v1 genome
sub3<-subset(tb3,tb3@ranges@start > 502 & strand == "+") # 0 out-of-bound ranges
sub3
GRanges object with 4983 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
60S_2 WNWQ01000622.1 7540-8266 + | 7310 rna-gnl|WGS:WNWQ|BLS..
ACU8 WNWQ01000193.1 33583-35574 + | 3731 rna-gnl|WGS:WNWQ|BLS..
ADK1 WNWQ01000028.1 27957-28941 + | 929 rna-gnl|WGS:WNWQ|BLS..
AKR1 WNWQ01001397.1 4341-5927 + | 9792 rna-gnl|WGS:WNWQ|BLS..
ALC1 WNWQ01000768.1 12581-15782 + | 8067 rna-gnl|WGS:WNWQ|BLS..
... ... ... ... . ... ...
YHM2 WNWQ01000269.1 19271-20502 + | 4643 rna-gnl|WGS:WNWQ|BLS..
YME2 WNWQ01000051.1 31504-34100 + | 1422 rna-gnl|WGS:WNWQ|BLS..
YOP1 WNWQ01000685.1 16975-17921 + | 7660 rna-gnl|WGS:WNWQ|BLS..
YPT1 WNWQ01000164.1 3943-4620 + | 3344 rna-gnl|WGS:WNWQ|BLS..
YPT2 WNWQ01000058.1 64153-64901 + | 1582 rna-gnl|WGS:WNWQ|BLS..
gb3<-getPromoterSeq(sub3,genome3,upstream=500,downstream=3)
gb3
DNAStringSet object of length 4983:
width seq names
[1] 503 GACCATCACTCTTCGTGATCTT...TCGCCAAATCTACTAACAATG 60S_2
[2] 503 AAGACAGGGCTTCTCGAATAAG...TACACCCACCAATTCGCCATG ACU8
[3] 503 GATTTGGCGAATGTTGCAGTGA...ACACTTCTTTCATACACAATG ADK1
[4] 503 AGTGACAGTTCATAAACTTAAG...TGCAGCTCGCTCCACACAATG AKR1
[5] 503 GTAACTCTGCACAAGGATCCTA...CAGGAACAAGAACCAATTATG ALC1
... ... ...
[4979] 503 AGTCCGAAAGGTGGCTTTGAAG...AGCAACCAACTCCAGAGAATG YHM2
[4980] 503 GAGTCTAGGTCGGAGCGATGGA...TATTGCCTTGCCAACATCATG YME2
[4981] 503 TGACGAGTAGGGGAACATGTGC...ACCACAAAATCTCCCGCCATG YOP1
[4982] 503 CACGATAAGGCTTCCTGCTGCC...TGCTTCTTCGATTCGCCGATG YPT1
[4983] 503 CGCTGGAAGGAATGGCGGTTGG...TTCACATACCCTGACACCATG YPT2
tub3<-subset(tb3, strand == "-")
end(tub3)<-end(tub3)+500
tub3
GRanges object with 5045 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
60S_1 WNWQ01000464.1 3717-5275 - | 6338 rna-gnl|WGS:WNWQ|BLS..
60S_3 WNWQ01000073.1 39459-40699 - | 1864 rna-gnl|WGS:WNWQ|BLS..
ACC1 WNWQ01000188.1 6555-14075 - | 3682 rna-gnl|WGS:WNWQ|BLS..
ACR2 WNWQ01000227.1 6921-11080 - | 4167 rna-gnl|WGS:WNWQ|BLS..
ADE1 WNWQ01000248.1 21892-23690 - | 4423 rna-gnl|WGS:WNWQ|BLS..
... ... ... ... . ... ...
VPS1 WNWQ01000097.1 11196-14061 - | 2288 rna-gnl|WGS:WNWQ|BLS..
VPS26 WNWQ01000360.1 14036-15722 - | 5538 rna-gnl|WGS:WNWQ|BLS..
VTC1 WNWQ01000130.1 20691-21836 - | 2847 rna-gnl|WGS:WNWQ|BLS..
YAH1 WNWQ01000503.1 20116-21329 - | 6634 rna-gnl|WGS:WNWQ|BLS..
YPT5 WNWQ01000426.1 9277-10655 - | 6058 rna-gnl|WGS:WNWQ|BLS..
-------
seqinfo: 6349 sequences from ASM976944v1 genome
gb33<-getPromoterSeq(tub3,genome3,upstream=500,downstream=3)
Error in loadFUN(x, seqname, ranges) :
trying to load regions beyond the boundaries of non-circular sequence
"BLS_5"
In addition: Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
GRanges object contains 323 out-of-bound ranges located on sequences
WNWQ01000088.1, WNWQ01001000.1, WNWQ01001003.1, WNWQ01001004.1,
WNWQ01001005.1, WNWQ01001014.1, WNWQ01001023.1, WNWQ01001027.1,
WNWQ01001040.1, WNWQ01001047.1, WNWQ01000105.1, WNWQ01001059.1,
WNWQ01001061.1, WNWQ01001063.1, WNWQ01001069.1, WNWQ01001076.1,
WNWQ01001087.1, WNWQ01000109.1, WNWQ01001109.1, WNWQ01001110.1,
WNWQ01001115.1, WNWQ01001116.1, WNWQ01001120.1, WNWQ01001125.1,
WNWQ01001127.1, WNWQ01001132.1, WNWQ01001133.1, WNWQ01001136.1,
WNWQ01001144.1, WNWQ01001152.1, WNWQ01001153.1, WNWQ01001156.1,
WNWQ01001158.1, WNWQ01000116.1, WNWQ01001161.1, WNWQ01001163.1,
WNWQ01000119.1, WNWQ01000120.1, WNWQ01001205.1, WNWQ01001206.1,
WNWQ01001213.1, WNWQ01001217.1, WNWQ01001223.1, WNWQ01001230.1,
WNWQ01001249.1, WNWQ01001252.1, WNWQ01001254.1, WNWQ01001261.1,
WNWQ01001265.1, WNWQ01000127.1, WNWQ01001271.1, WNWQ01001272.1,
WNWQ01001281.1, WNWQ01001286.1, WNWQ01001291.1, WNWQ01000013.1,
WNWQ0 [... truncated]
## manual workaround ##
## check each scaffold for out-of-bound ranges, then batch remove
seqlengths(genome3)['WNWQ01001115.1']
WNWQ01001115.1
**9687**
ss1<-subset(tub3,seqnames == 'WNWQ01001115.1')
ss1
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <integer>
BLS_000622 WNWQ01001115.1 8561-**10140** - | 9290
tx_name
<character>
BLS_000622 rna-gnl|WGS:WNWQ|BLS..
-------
seqinfo: 6349 sequences from ASM976944v1 genome
BLS_000622 extends past the scaffold seqlength, and needs filtered. I want to replicate this approach for the remaining 323 out-of-bound genomic ranges