Dear all,
I expect AnnotationHub resource for ensembl v71 and a direct download using rtracklayer should give identical results. The code snipped below shows that there are some gene_identifiers missing in the AnnotationHub extracted data. Is there some additional processing done in the AnnotationHub recipe for ensembl v71? The last section of the code shows there is a mismatch in annotation between the two resources. Does AH7666 really reflects Homo_sapiens.GRCh37.71.gtf?
Kind regards,
Maarten
ah <- AnnotationHub() updating metadata: retrieving 1 resource |======================================================================| 100% snapshotDate(): 2016-10-11 > query(ah, c("ensembl", "sapiens", "71")) AnnotationHub with 7 records # snapshotDate(): 2016-10-11 # $dataprovider: Ensembl # $species: Homo sapiens # $rdataclass: FaFile, GRanges # additional mcols(): taxonomyid, genome, description, # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, # sourceurl, sourcetype # retrieve records with, e.g., 'object[["AH7666"]]' title AH7666 | Homo_sapiens.GRCh37.71.gtf AH11241 | Homo_sapiens.GRCh37.71.cdna.all.fa AH11242 | Homo_sapiens.GRCh37.71.dna_rm.toplevel.fa AH11243 | Homo_sapiens.GRCh37.71.dna_sm.toplevel.fa AH11244 | Homo_sapiens.GRCh37.71.dna.toplevel.fa AH11245 | Homo_sapiens.GRCh37.71.ncrna.fa AH11246 | Homo_sapiens.GRCh37.71.pep.all.fa > gtfAH <- ah[["AH7666"]] require(“GenomicRanges”) downloading from ‘https://annotationhub.bioconductor.org/fetch/7666’ retrieving 1 resource |======================================================================| 100% using guess work to populate seqinfo > library(rtracklayer) > gtfRT <- import(metadata(gtfAH)$`Data Source`) trying URL 'ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz' ftp data connection made, file length 26733273 bytes ================================================== downloaded 25.5 MB > ensRT <- mcols(gtfRT)$gene_id > ensAH <- mcols(gtfAH)$gene_id > str(ensRT) chr [1:2253155] "ENSG00000261516" "ENSG00000261516" "ENSG00000261516" ... > str(ensAH) chr [1:2253155] "ENSG00000261516" "ENSG00000261516" "ENSG00000261516" ... > length(unique(ensRT)) [1] 61893 > length(unique(ensAH)) [1] 37482 > head(setdiff(ensRT, ensAH)) [1] "ENSG00000236896" "ENSG00000212521" "ENSG00000095380" "ENSG00000106785" [5] "ENSG00000106789" "ENSG00000095383" > which(mcols(gtfAH)$gene_id == "ENSG00000236896") integer(0) > gtfRT[mcols(gtfRT)$gene_id == "ENSG00000236896"] GRanges object with 2 ranges and 12 metadata columns: seqnames ranges strand | source type score <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> [1] 9 [100749802, 100749938] - | antisense exon <NA> [2] 9 [100748833, 100749097] - | antisense exon <NA> phase gene_id transcript_id exon_number gene_name <integer> <character> <character> <character> <character> [1] <NA> ENSG00000236896 ENST00000411981 1 RP11-535C21.3 [2] <NA> ENSG00000236896 ENST00000411981 2 RP11-535C21.3 gene_biotype transcript_name exon_id protein_id <character> <character> <character> <character> [1] antisense RP11-535C21.3-001 ENSE00001610477 <NA> [2] antisense RP11-535C21.3-001 ENSE00001796798 <NA> ------- seqinfo: 244 sequences from an unspecified genome; no seqlengths > gr <- gtfRT[mcols(gtfRT)$gene_id == "ENSG00000236896"] > subsetByOverlaps(gr, gtfAH) GRanges object with 2 ranges and 12 metadata columns: seqnames ranges strand | source type score <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> [1] 9 [100749802, 100749938] - | antisense exon <NA> [2] 9 [100748833, 100749097] - | antisense exon <NA> phase gene_id transcript_id exon_number gene_name <integer> <character> <character> <character> <character> [1] <NA> ENSG00000236896 ENST00000411981 1 RP11-535C21.3 [2] <NA> ENSG00000236896 ENST00000411981 2 RP11-535C21.3 gene_biotype transcript_name exon_id protein_id <character> <character> <character> <character> [1] antisense RP11-535C21.3-001 ENSE00001610477 <NA> [2] antisense RP11-535C21.3-001 ENSE00001796798 <NA> ------- seqinfo: 244 sequences from an unspecified genome; no seqlengths > sessionInfo() R Under development (unstable) (2016-08-25 r71150) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] rtracklayer_1.34.1 GenomicRanges_1.26.2 GenomeInfoDb_1.10.2 [4] IRanges_2.8.1 S4Vectors_0.12.1 AnnotationHub_2.6.4 [7] BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.9 compiler_3.4.0 [3] BiocInstaller_1.24.0 XVector_0.14.0 [5] bitops_1.0-6 tools_3.4.0 [7] zlibbioc_1.20.0 digest_0.6.11 [9] RSQLite_1.1-2 memoise_1.0.0 [11] lattice_0.20-34 Matrix_1.2-7.1 [13] shiny_1.0.0 DBI_0.5-1 [15] curl_2.3 yaml_2.1.14 [17] httr_1.2.1 Biostrings_2.42.1 [19] grid_3.4.0 Biobase_2.34.0 [21] R6_2.2.0 AnnotationDbi_1.36.1 [23] XML_3.98-1.5 BiocParallel_1.8.1 [25] Rsamtools_1.26.1 htmltools_0.3.5 [27] GenomicAlignments_1.10.0 SummarizedExperiment_1.4.0 [29] mime_0.5 interactiveDisplayBase_1.12.0 [31] xtable_1.8-2 httpuv_1.3.3 [33] RCurl_1.95-4.8
I don't know if the
GRanges
for gtf files are rebuilt regularly, I guess not, so the only explanation would be that the version ofrtracklayer
at the time theGRanges
was built did not import all of the data from the GTF. I don't believe that Ensembl changes there files and adds new content to old releases (in fact, the GTF file was created on March 26 2016).