Question

GRangesList and `seqlevels<-` with force=TRUE

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 4.7 years ago

Zentrum für Molekularbiologie, Universi…

Hi

I wonder if the following might be a bug.

As in my previous post, I start with

> library( GenomicRanges )
> library( TxDb.Hsapiens.UCSC.hg19.knownGene )
> transcripts <- transcriptsBy( TxDb.Hsapiens.UCSC.hg19.knownGene )

This object contains all kinds of extra contigs:

> seqlevels(transcripts)
 [1] "chr1"                  "chr2"                  "chr3"                 
 [4] "chr4"                  "chr5"                  "chr6"                 
 [7] "chr7"                  "chr8"                  "chr9"                 
[...]
[22] "chr22"                 "chrX"                  "chrY"                 
[25] "chrM"                  "chr1_gl000191_random"  "chr1_gl000192_random" 
[28] "chr4_ctg9_hap1"        "chr4_gl000193_random"  "chr4_gl000194_random" 
[31] "chr6_apd_hap1"         "chr6_cox_hap2"         "chr6_dbb_hap3"        
[34] "chr6_mann_hap4"        "chr6_mcf_hap5"         "chr6_qbl_hap6"        
[37] "chr6_ssto_hap7"        "chr7_gl000195_random"  "chr8_gl000196_random" 
[...]

I keep only the useful ones

> realchroms <- seqlevels(transcripts)[1:25]
> realchroms
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM"

For a GRanges object, I can now remove all ranges on the extra contigs. For example (using a gene from the HLA):

> hlab <- transcripts[["3106"]]
> hlab
GRanges object with 36 ranges and 2 metadata columns:
             seqnames               ranges strand   |     tx_id     tx_name
                <Rle>            <IRanges>  <Rle>   | <integer> <character>
   [1]           chr6 [31237743, 31324989]      -   |     26240  uc031snl.1
   [2]           chr6 [31238850, 31324989]      -   |     26241  uc003ntf.2
   [3]           chr6 [31321649, 31324450]      -   |     26243  uc003ntg.1
   [4]           chr6 [31321649, 31324989]      -   |     26244  uc003nth.2
   [5]           chr6 [31322256, 31323369]      -   |     26245  uc003nti.1
   ...            ...                  ...    ... ...       ...         ...
  [32] chr6_ssto_hap7   [2655408, 2658208]      -   |     82522  uc011jii.2
  [33] chr6_ssto_hap7   [2655408, 2658748]      -   |     82523  uc011jij.2
  [34] chr6_ssto_hap7   [2656015, 2657128]      -   |     82524  uc011jik.1
  [35] chr6_ssto_hap7   [2656853, 2657905]      -   |     82525  uc011jil.1
  [36] chr6_ssto_hap7   [2657058, 2658492]      -   |     82526  uc011jim.1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

> seqlevels( hlab, force=TRUE ) <- realchroms
> hlab
GRanges object with 7 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr6 [31237743, 31324989]      - |     26240  uc031snl.1
  [2]     chr6 [31238850, 31324989]      - |     26241  uc003ntf.2
  [3]     chr6 [31321649, 31324450]      - |     26243  uc003ntg.1
  [4]     chr6 [31321649, 31324989]      - |     26244  uc003nth.2
  [5]     chr6 [31322256, 31323369]      - |     26245  uc003nti.1
  [6]     chr6 [31323094, 31324145]      - |     26246  uc010jsn.1
  [7]     chr6 [31323299, 31324734]      - |     26247  uc010jso.2
  -------
  seqinfo: 25 sequences (1 circular) from hg19 genome

If I do the same on the whole GRangesList object, however, the whole gene vanishes, not only the ranges those from the alternative contigs:

> seqlevels( transcripts, force=TRUE ) <- realchroms
> transcripts
GRangesList object of length 23191:
$1
GRanges object with 2 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]    chr19 [58858172, 58864865]      - |     70455  uc002qsd.4
  [2]    chr19 [58859832, 58874214]      - |     70456  uc002qsf.2

$10
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand | tx_id    tx_name
  [1]     chr8 [18248755, 18258723]      + | 31944 uc003wyw.1

$100
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand | tx_id    tx_name
  [1]    chr20 [43248163, 43280376]      - | 72132 uc002xmj.3

...
<23188 more elements>
-------
seqinfo: 25 sequences (1 circular) from hg19 genome

> transcripts[["3106"]]
NULL

So: Is this intended or a bug? And: How do I now get rid of the extra sequences?

Thanks

Simon

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=fi_FI.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[2] GenomicFeatures_1.22.8                 
[3] AnnotationDbi_1.32.3                   
[4] Biobase_2.30.0                         
[5] GenomicRanges_1.22.3                   
[6] GenomeInfoDb_1.6.1                     
[7] IRanges_2.4.6                          
[8] S4Vectors_0.8.7                        
[9] BiocGenerics_0.16.1                    

loaded via a namespace (and not attached):
 [1] XVector_0.10.0             zlibbioc_1.16.0           
 [3] GenomicAlignments_1.6.3    BiocParallel_1.4.3        
 [5] tools_3.2.2                SummarizedExperiment_1.0.2
 [7] DBI_0.3.1                  lambda.r_1.1.7            
 [9] futile.logger_1.4.1        rtracklayer_1.30.1        
[11] futile.options_1.0.0       bitops_1.0-6              
[13] RCurl_1.95-4.7             biomaRt_2.26.1            
[15] RSQLite_1.0.0              Biostrings_2.38.3         
[17] Rsamtools_1.22.0           XML_3.98-1.3

GRangesList GenomicRanges • 2.3k views

ADD COMMENT • link updated 9.2 years ago by Hervé Pagès 16k • written 9.2 years ago by Simon Anders ★ 3.8k

2

Entering edit mode

Looks like a bug to me, but just wanted to point out that keepStandardChromosomes(transcripts) might be more convenient.

ADD REPLY • link 9.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks. This is more convenient.

(How would I have found about this, by the way? I mean, if I started reading from the GenomicRanges and the AnnotationDbi vignettes.)

ADD REPLY • link 9.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Probably best to start reading the high-level workflows, like this one, instead of individual package vignettes, which will only discuss features of that package. There are over a dozen packages constituting the infrastructure now.

ADD REPLY • link 9.2 years ago Michael Lawrence ★ 11k

score 2 · Answer 1 · 2016-01-26

Hi Simon,

The vanishing of the whole gene is intended. This is the semantic of the seqlevels() setter on List objects. See B. DROP SEQLEVELS OF A LIST-LIKE OBJECT in the examples section of ?seqlevels for more information. It explains how to get rid of feature components located on given chromosomes without getting rid of the whole feature.

The reason for the current semantic of seqlevels<- on List objects is to not break the structure of the feature. The structure is not important when the feature is a gene and the components are transcripts (your case) but VERY important when the feature is a transcript and the components are exons (e.g. exonsBy(txdb, "tx")).

Cheers,

H.