Hi all,
I try to use summarizeOverlaps function on a GRangesList and get the error below.
library(GenomicFeatures)
library(GenomicRanges)
library(Rsamtools)
library(GenomicAlignments)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb2 <- TxDb.Hsapiens.UCSC.hg38.knownGene
ex=exonsBy(txdb2,"gene")
> head(ex)
GRangesList object of length 6:
$1
GRanges object with 15 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr19 [58346806, 58347029] - | 264625 <NA>
[2] chr19 [58347353, 58347640] - | 264626 <NA>
[3] chr19 [58348466, 58349128] - | 264627 <NA>
[4] chr19 [58349568, 58350651] - | 264628 <NA>
[5] chr19 [58350370, 58350651] - | 264629 <NA>
... ... ... ... ... ... ...
[11] chr19 [58357585, 58357649] - | 264636 <NA>
[12] chr19 [58357952, 58358286] - | 264637 <NA>
[13] chr19 [58358489, 58358585] - | 264638 <NA>
[14] chr19 [58359197, 58359323] - | 264639 <NA>
[15] chr19 [58362677, 58362848] - | 264640 <NA>
bamdir <-"/home/adiatarca/arun/HG19BAMS/"
fls <- paste(bamdir,setdiff(dir(bamdir,pattern="*.bam"),dir(bamdir,pattern="*.bam.bai")),sep="")
bamlst <- BamFileList(fls,yieldSize=1000000)
register(MulticoreParam(workers = 20))
genehits <- summarizeOverlaps(ex, bamlst, mode="Union",singleEnd=TRUE, ignore.strand=FALSE)
Error in validObject(.Object) :
invalid class "SummarizedExperiment" object: 'rowRanges' length differs from 'assays' nrow
However, if the txdb2 object below would be obtained via
txdb2=makeTxDbFromUCSC(
genome="hg19",
tablename="knownGene")
the function does not fail. Unfortunately makeTxDbFromUCSC for knownGene does not seem to work with hg38.
Thanks for any suggestions,
Adi
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicAlignments_1.4.2
[2] TxDb.Hsapiens.UCSC.hg38.knownGene_3.1.2
[3] Rsamtools_1.20.5
[4] Biostrings_2.36.4
[5] XVector_0.8.0
[6] GenomicFeatures_1.20.6
[7] AnnotationDbi_1.30.1
[8] Biobase_2.28.0
[9] GenomicRanges_1.20.8
[10] GenomeInfoDb_1.4.3
[11] IRanges_2.2.9
[12] S4Vectors_0.6.6
[13] BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0 BiocParallel_1.2.22 tools_3.2.2
[4] DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1
[7] rtracklayer_1.28.10 futile.options_1.0.0 bitops_1.0-6
[10] RCurl_1.95-4.7 biomaRt_2.24.1 RSQLite_1.0.0
[13] XML_3.98-1.3
Are your BAM files aligned against the hg19 or hg38 reference genome? Judging from the name of the data directory, it seems to be hg19. If this is the case, the two coordinate systems, i.e. reference genome coordinates, would differ, which may be the cause of the cryptic error message.
Thanks Julian, this sounds like a good reason for the error.
Glad to hear that it helped. If your alignments are for hg19, then you would want to compare it to the hg19 gene annotation anyway - which seems to work for you already. You may also ask the maintainers of the summarizeOverlaps function if they want to add a more informative error message, in case the genomes/seqlengths do not match.