Hi
Given a GRangesList object grl
, running
lapply( grl, length )
is very slow. Why is that? What should I do instead?
Background and example: I want to get, for all human genes, the range that their transcripts cover. So I did
> library( GenomicRanges ) > library( TxDb.Hsapiens.UCSC.hg19.knownGene ) > transcripts <- transcriptsBy( TxDb.Hsapiens.UCSC.hg19.knownGene ) > genes <- reduce( transcripts )
I was happy to notice that reduce seems to do what I want:
> genes GRangesList object of length 23459: $1 GRanges object with 1 range and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr19 [58858172, 58874214] - $10 GRanges object with 1 range and 0 metadata columns: seqnames ranges strand [1] chr8 [18248755, 18258723] + $100 GRanges object with 1 range and 0 metadata columns: seqnames ranges strand [1] chr20 [43248163, 43280376] - ... <23456 more elements> ------- seqinfo: 93 sequences (1 circular) from hg19 genome
However, I want a GRanges object, not a GRangesList with each list element containing only a single range. Hence, I used unlist:
> unlist( genes ) GRanges object with 25004 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> 1 chr19 [ 58858172, 58874214] - 10 chr8 [ 18248755, 18258723] + 100 chr20 [ 43248163, 43280376] - 1000 chr18 [ 25530930, 25757445] - 10000 chr1 [243651535, 244006886] - ... ... ... ... 9991 chr9 [114979995, 115095944] - 9992 chr21 [ 35736323, 35743440] + 9993 chr22 [ 19023795, 19109967] - 9994 chr6 [ 90539619, 90584155] + 9997 chr22 [ 50961997, 50964905] - ------- seqinfo: 93 sequences (1 circular) from hg19 genome However, the GenomicRange returned by unlist contains more rows than genes had list elements. Some of the list elements must have contained more than one range. I would like to do > lengths <- lapply( genes, length ) > which( lengths > 1 )
However, the lapply taskes ages. Why?
Cheers
Simon
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [2] GenomicFeatures_1.22.8 [3] AnnotationDbi_1.32.3 [4] Biobase_2.30.0 [5] GenomicRanges_1.22.3 [6] GenomeInfoDb_1.6.1 [7] IRanges_2.4.6 [8] S4Vectors_0.8.7 [9] BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] XVector_0.10.0 zlibbioc_1.16.0 [3] GenomicAlignments_1.6.3 BiocParallel_1.4.3 [5] tools_3.2.2 SummarizedExperiment_1.0.2 [7] DBI_0.3.1 lambda.r_1.1.7 [9] futile.logger_1.4.1 rtracklayer_1.30.1 [11] futile.options_1.0.0 bitops_1.0-6 [13] RCurl_1.95-4.7 biomaRt_2.26.1 [15] RSQLite_1.0.0 Biostrings_2.38.3 [17] Rsamtools_1.22.0 XML_3.98-1.3
BTW, I know by now why I had more than one range in some of the object returned by reduce: that was simply because the original object contained alignments to alternate MHC versions and the like. For example:
But I still would like to know why lapply is so slow, and how to better inspect the elements in a GRangesList object.
You could try
elementLengths(grl)
for this case.