Hi
I wonder if the following might be a bug.
As in my previous post, I start with
> library( GenomicRanges ) > library( TxDb.Hsapiens.UCSC.hg19.knownGene ) > transcripts <- transcriptsBy( TxDb.Hsapiens.UCSC.hg19.knownGene )
This object contains all kinds of extra contigs:
> seqlevels(transcripts) [1] "chr1" "chr2" "chr3" [4] "chr4" "chr5" "chr6" [7] "chr7" "chr8" "chr9" [...] [22] "chr22" "chrX" "chrY" [25] "chrM" "chr1_gl000191_random" "chr1_gl000192_random" [28] "chr4_ctg9_hap1" "chr4_gl000193_random" "chr4_gl000194_random" [31] "chr6_apd_hap1" "chr6_cox_hap2" "chr6_dbb_hap3" [34] "chr6_mann_hap4" "chr6_mcf_hap5" "chr6_qbl_hap6" [37] "chr6_ssto_hap7" "chr7_gl000195_random" "chr8_gl000196_random" [...]
I keep only the useful ones
> realchroms <- seqlevels(transcripts)[1:25] > realchroms [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" [19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY" "chrM"
For a GRanges object, I can now remove all ranges on the extra contigs. For example (using a gene from the HLA):
> hlab <- transcripts[["3106"]] > hlab GRanges object with 36 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr6 [31237743, 31324989] - | 26240 uc031snl.1 [2] chr6 [31238850, 31324989] - | 26241 uc003ntf.2 [3] chr6 [31321649, 31324450] - | 26243 uc003ntg.1 [4] chr6 [31321649, 31324989] - | 26244 uc003nth.2 [5] chr6 [31322256, 31323369] - | 26245 uc003nti.1 ... ... ... ... ... ... ... [32] chr6_ssto_hap7 [2655408, 2658208] - | 82522 uc011jii.2 [33] chr6_ssto_hap7 [2655408, 2658748] - | 82523 uc011jij.2 [34] chr6_ssto_hap7 [2656015, 2657128] - | 82524 uc011jik.1 [35] chr6_ssto_hap7 [2656853, 2657905] - | 82525 uc011jil.1 [36] chr6_ssto_hap7 [2657058, 2658492] - | 82526 uc011jim.1 ------- seqinfo: 93 sequences (1 circular) from hg19 genome > seqlevels( hlab, force=TRUE ) <- realchroms > hlab GRanges object with 7 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr6 [31237743, 31324989] - | 26240 uc031snl.1 [2] chr6 [31238850, 31324989] - | 26241 uc003ntf.2 [3] chr6 [31321649, 31324450] - | 26243 uc003ntg.1 [4] chr6 [31321649, 31324989] - | 26244 uc003nth.2 [5] chr6 [31322256, 31323369] - | 26245 uc003nti.1 [6] chr6 [31323094, 31324145] - | 26246 uc010jsn.1 [7] chr6 [31323299, 31324734] - | 26247 uc010jso.2 ------- seqinfo: 25 sequences (1 circular) from hg19 genome
If I do the same on the whole GRangesList object, however, the whole gene vanishes, not only the ranges those from the alternative contigs:
> seqlevels( transcripts, force=TRUE ) <- realchroms > transcripts GRangesList object of length 23191: $1 GRanges object with 2 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr19 [58858172, 58864865] - | 70455 uc002qsd.4 [2] chr19 [58859832, 58874214] - | 70456 uc002qsf.2 $10 GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | tx_id tx_name [1] chr8 [18248755, 18258723] + | 31944 uc003wyw.1 $100 GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | tx_id tx_name [1] chr20 [43248163, 43280376] - | 72132 uc002xmj.3 ... <23188 more elements> ------- seqinfo: 25 sequences (1 circular) from hg19 genome > transcripts[["3106"]] NULL
So: Is this intended or a bug? And: How do I now get rid of the extra sequences?
Thanks
Simon
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [2] GenomicFeatures_1.22.8 [3] AnnotationDbi_1.32.3 [4] Biobase_2.30.0 [5] GenomicRanges_1.22.3 [6] GenomeInfoDb_1.6.1 [7] IRanges_2.4.6 [8] S4Vectors_0.8.7 [9] BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] XVector_0.10.0 zlibbioc_1.16.0 [3] GenomicAlignments_1.6.3 BiocParallel_1.4.3 [5] tools_3.2.2 SummarizedExperiment_1.0.2 [7] DBI_0.3.1 lambda.r_1.1.7 [9] futile.logger_1.4.1 rtracklayer_1.30.1 [11] futile.options_1.0.0 bitops_1.0-6 [13] RCurl_1.95-4.7 biomaRt_2.26.1 [15] RSQLite_1.0.0 Biostrings_2.38.3 [17] Rsamtools_1.22.0 XML_3.98-1.3
Looks like a bug to me, but just wanted to point out that
keepStandardChromosomes(transcripts)
might be more convenient.Thanks. This is more convenient.
(How would I have found about this, by the way? I mean, if I started reading from the GenomicRanges and the AnnotationDbi vignettes.)
Probably best to start reading the high-level workflows, like this one, instead of individual package vignettes, which will only discuss features of that package. There are over a dozen packages constituting the infrastructure now.