Question

Extract genomic features from a TxDb-like object

0

Entering edit mode

rbenel ▴ 40

@rbenel-13642

Last seen 2.7 years ago

Israel

Hi,

I am trying to extract promoter sequences from a TxDB like object. When I try the following lines of code, the GRange object that is formed does not have IDs because there is no use.name = T argument, which means when I try to extract the sequences there is no match. I know this because if I limit my promoter annotations to just the first output, promoters_txdb[1] and then extract the sequences I receive a DNA string set list.

library(GenomicFeatures)
library(GenomicRanges)
library(BSgenome.Athaliana.TAIR.TAIR9)
library(TxDb.Athaliana.BioMart.plantsmart28)
library(Biostrings)

# The entire athaliana genome
bs_athaliana <- BSgenome.Athaliana.TAIR.TAIR9 

# transcript DB information
txdb_athaliana <- TxDb.Athaliana.BioMart.plantsmart28

head(seqlevels(txdb_athaliana))

#all genes 
promoters_txdb <- transcriptsBy(txdb_athaliana, by = "gene")

#extract sequences 
promoter_seqs <- getPromoterSeq(promoters_txdb, bs_athaliana, upstream = 1000, downstream = 100)


Error:

Error in extractList(x, at) : some ranges are out of bounds

TxDB genomicfeatures promoter getpromoterseq • 2.4k views

ADD COMMENT • link updated 7.1 years ago by Hervé Pagès 16k • written 7.1 years ago by rbenel ▴ 40

0

Entering edit mode

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_IL.UTF-8       LC_NUMERIC=C               LC_TIME=en_IL.UTF-8        LC_COLLATE=en_IL.UTF-8    
 [5] LC_MONETARY=en_IL.UTF-8    LC_MESSAGES=en_IL.UTF-8    LC_PAPER=en_IL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Athaliana.BioMart.plantsmart28_3.2.2 BSgenome.Athaliana.TAIR.TAIR9_1.3.1000   
 [3] BSgenome_1.46.0                           rtracklayer_1.38.3                       
 [5] Biostrings_2.46.0                         XVector_0.18.0                           
 [7] GenomicFeatures_1.30.3                    AnnotationDbi_1.40.0                     
 [9] Biobase_2.38.0                            GenomicRanges_1.30.3                     
[11] GenomeInfoDb_1.14.0                       IRanges_2.12.0                           
[13] S4Vectors_0.16.0                          BiocGenerics_0.24.0                      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15               pillar_1.2.1               compiler_3.4.3             prettyunits_1.0.2         
 [5] bitops_1.0-6               tools_3.4.3                zlibbioc_1.24.0            progress_1.1.2            
 [9] biomaRt_2.35.11            digest_0.6.15              bit_1.1-12                 lattice_0.20-35           
[13] RSQLite_2.0                memoise_1.1.0              tibble_1.4.2               pkgconfig_2.0.1           
[17] rlang_0.2.0                Matrix_1.2-12              DelayedArray_0.4.1         DBI_0.8                   
[21] yaml_2.1.17                GenomeInfoDbData_1.0.0     stringr_1.3.0              httr_1.3.1                
[25] grid_3.4.3                 bit64_0.9-7                R6_2.2.2                   BiocParallel_1.12.0       
[29] XML_3.98-1.10              RMySQL_0.10.14             blob_1.1.0                 magrittr_1.5              
[33] matrixStats_0.53.1         GenomicAlignments_1.14.1   Rsamtools_1.30.0           SummarizedExperiment_1.8.1
[37] assertthat_0.2.0           stringi_1.1.6              RCurl_1.95-4.10

ADD REPLY • link 7.1 years ago rbenel ▴ 40

score 1 · Answer 1 · 2018-03-07

Hi,

Let's separate the 2 issues you seem to be having: (1) the use of different chromosome naming conventions for Athaliana between plantsmart28 and UCSC, and (2) getPromoterSeq() failing to extract out-of-bounds ranges.

Issue (1) can be addressed by setting the same sequence style on the 2 objects:

library(TxDb.Athaliana.BioMart.plantsmart28)
txdb <- TxDb.Athaliana.BioMart.plantsmart28
library(BSgenome.Athaliana.TAIR.TAIR9)
genome <- BSgenome.Athaliana.TAIR.TAIR9
seqlevels(txdb)
# [1] "1"  "2"  "3"  "4"  "5"  "Mt" "Pt"
seqlevels(genome)
# [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"
seqlevelsStyle(txdb)
# [1] "Ensembl" "MSU6"
seqlevelsStyle(genome)
# [1] "TAIR9"
seqlevelsStyle(txdb) <- seqlevelsStyle(genome)
seqlevels(txdb)
# [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"

Issue (2) is caused by getPromoterSeq() trying to extract sequences from beyond the bounds of the chromosomes, and failing to do so. getPromoterSeq() could certainly be improved to gracefully handle this situation. In the mean time, you can use the following workaround:

tx_by_gene <- transcriptsBy(txdb, by="gene")
prom_ranges <- promoters(tx_by_gene, upstream=1000, downstream=100)
# Warning messages about out-of-bound ranges...

## Let's take a look at some of those out-of-bound ranges:
prom_ranges[any(start(prom_ranges) < 1)]
# GRangesList object of length 1:
# $ATCG00010 
# GRanges object with 1 range and 2 metadata columns:
#       seqnames      ranges strand |     tx_id     tx_name
#          <Rle>   <IRanges>  <Rle> | <integer> <character>
#   [1]     ChrC [-23, 1076]      - |     41594 ATCG00010.1
# 
# -------
# seqinfo: 7 sequences (1 circular) from an unspecified genome

## getSeq() will fail to extract the corresponding sequences (which is
## actually what getPromoterSeq() tried to do internally):
getSeq(genome, prom_ranges)
# Error in extractList(x, at) : some ranges are out of bounds

## Let's trim the ranges in prom_ranges. Unfortunately there is no "trim"
## method for GRangesList objects, only for GRanges objects, so we use the
## unlist/relist trick to trim GRangesList object prom_ranges:
prom_ranges2 <- relist(trim(unlist(prom_ranges, use.names=FALSE)), prom_ranges)

## Then:
getSeq(genome, prom_ranges2)
# DNAStringSetList of length 33602
# [["AT1G01010"]] ATATTGCTATTTCTGCCAATATTAAAACTTCACTTAGGAAGACTTGAACCTACCACA...
# [["AT1G01020"]] AATATAAAATGCATTTTAATATTCAAAAGAATGCATGCATAGACTGATAGAAAAGAA...
# [["AT1G01030"]] TTGAACAAACAGACACGTATTGATTATAGTATTTCTGCTATTGATAAGTTTTTAAAC...
# [["AT1G01040"]] ATGCGTGTTGGATGCTTACAAAGAAGAATTTGTTTTCGGCGTTAGCCAAAAGTCGAT...
# [["AT1G01046"]] CGCGATTCTTTTTGGCAATGAGCTGGATGCAGAGGTTGTTGTTTCTCTCTCCATGGA...
# [["AT1G01050"]] GGTGTTGGAACATGCAAGAGTCTTAAAACGAGACAGACAGGATTTAAGCCATACAAG...
# [["AT1G01060"]] TGGTTATTTTGGAATAATTTCGGTTATTTCAATTAGATTCGGGTAGTTCAGTTCTTC...
# [["AT1G01070"]] GACATAATCTGCAAAGAATACGCAATATAATATTAGTAAATGAGTGGCAATTTCTCA...
# [["AT1G01073"]] AGTTCTAATTTTGCTTTCTGTTTTTAATTTAAATTTTGGATCAATAAGAAGAGACAT...
# [["AT1G01080"]] AAACAACCTTCTCTTTAATTTGAATTTTTTTTTGCAGCTGAGAAAGCCAAATACGCG...
# ...
# <33592 more elements>

Note that you end up with a DNAStringSetList object "parallel" to prom_ranges2, prom_ranges, and tx_by_gene i.e. with one list element per gene. Each list element contains one sequence per transcript in the gene.

I'm not sure I understood your comment about the use.names argument. This doesn't seem to have anything to do with issues (1) and (2).

Hope this helps,

H.