mismatch between AnnotationHub ensembl v71 hsapiens and direct download using rtracklayer
2
1
Entering edit mode
mviterson ▴ 10
@mviterson-6917
Last seen 7.3 years ago
Netherlands

Dear all,

I expect AnnotationHub resource for ensembl v71 and a direct download using rtracklayer should give identical results. The code snipped below shows that there are some gene_identifiers missing in the AnnotationHub extracted data. Is there some additional processing done in the AnnotationHub recipe for ensembl v71? The last section of the code shows there is a mismatch in annotation between the two resources. Does AH7666  really reflects Homo_sapiens.GRCh37.71.gtf?

Kind regards,

Maarten

ah <- AnnotationHub()
updating metadata: retrieving 1 resource
  |======================================================================| 100%
snapshotDate(): 2016-10-11
> query(ah, c("ensembl", "sapiens", "71"))
AnnotationHub with 7 records
# snapshotDate(): 2016-10-11
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: FaFile, GRanges
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH7666"]]'

            title                                    
  AH7666  | Homo_sapiens.GRCh37.71.gtf               
  AH11241 | Homo_sapiens.GRCh37.71.cdna.all.fa       
  AH11242 | Homo_sapiens.GRCh37.71.dna_rm.toplevel.fa
  AH11243 | Homo_sapiens.GRCh37.71.dna_sm.toplevel.fa
  AH11244 | Homo_sapiens.GRCh37.71.dna.toplevel.fa   
  AH11245 | Homo_sapiens.GRCh37.71.ncrna.fa          
  AH11246 | Homo_sapiens.GRCh37.71.pep.all.fa        
> gtfAH <- ah[["AH7666"]]
require(“GenomicRanges”)
downloading from ‘https://annotationhub.bioconductor.org/fetch/7666’
retrieving 1 resource
  |======================================================================| 100%
using guess work to populate seqinfo
> library(rtracklayer)
> gtfRT <- import(metadata(gtfAH)$`Data Source`)
trying URL 'ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz'
ftp data connection made, file length 26733273 bytes
==================================================
downloaded 25.5 MB

> ensRT <- mcols(gtfRT)$gene_id
> ensAH <- mcols(gtfAH)$gene_id
> str(ensRT)
 chr [1:2253155] "ENSG00000261516" "ENSG00000261516" "ENSG00000261516" ...
> str(ensAH)
 chr [1:2253155] "ENSG00000261516" "ENSG00000261516" "ENSG00000261516" ...
> length(unique(ensRT))
[1] 61893
> length(unique(ensAH))
[1] 37482
> head(setdiff(ensRT, ensAH))
[1] "ENSG00000236896" "ENSG00000212521" "ENSG00000095380" "ENSG00000106785"
[5] "ENSG00000106789" "ENSG00000095383"
> which(mcols(gtfAH)$gene_id == "ENSG00000236896")
integer(0)

> gtfRT[mcols(gtfRT)$gene_id == "ENSG00000236896"]
GRanges object with 2 ranges and 12 metadata columns:
      seqnames                 ranges strand |    source     type     score
         <Rle>              <IRanges>  <Rle> |  <factor> <factor> <numeric>
  [1]        9 [100749802, 100749938]      - | antisense     exon      <NA>
  [2]        9 [100748833, 100749097]      - | antisense     exon      <NA>
          phase         gene_id   transcript_id exon_number     gene_name
      <integer>     <character>     <character> <character>   <character>
  [1]      <NA> ENSG00000236896 ENST00000411981           1 RP11-535C21.3
  [2]      <NA> ENSG00000236896 ENST00000411981           2 RP11-535C21.3
      gene_biotype   transcript_name         exon_id  protein_id
       <character>       <character>     <character> <character>
  [1]    antisense RP11-535C21.3-001 ENSE00001610477        <NA>
  [2]    antisense RP11-535C21.3-001 ENSE00001796798        <NA>
  -------
  seqinfo: 244 sequences from an unspecified genome; no seqlengths
> gr <- gtfRT[mcols(gtfRT)$gene_id == "ENSG00000236896"]
> subsetByOverlaps(gr, gtfAH)
GRanges object with 2 ranges and 12 metadata columns:
      seqnames                 ranges strand |    source     type     score
         <Rle>              <IRanges>  <Rle> |  <factor> <factor> <numeric>
  [1]        9 [100749802, 100749938]      - | antisense     exon      <NA>
  [2]        9 [100748833, 100749097]      - | antisense     exon      <NA>
          phase         gene_id   transcript_id exon_number     gene_name
      <integer>     <character>     <character> <character>   <character>
  [1]      <NA> ENSG00000236896 ENST00000411981           1 RP11-535C21.3
  [2]      <NA> ENSG00000236896 ENST00000411981           2 RP11-535C21.3
      gene_biotype   transcript_name         exon_id  protein_id
       <character>       <character>     <character> <character>
  [1]    antisense RP11-535C21.3-001 ENSE00001610477        <NA>
  [2]    antisense RP11-535C21.3-001 ENSE00001796798        <NA>
  -------
  seqinfo: 244 sequences from an unspecified genome; no seqlengths

> sessionInfo()
R Under development (unstable) (2016-08-25 r71150)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] rtracklayer_1.34.1   GenomicRanges_1.26.2 GenomeInfoDb_1.10.2
[4] IRanges_2.8.1        S4Vectors_0.12.1     AnnotationHub_2.6.4
[7] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9                   compiler_3.4.0               
 [3] BiocInstaller_1.24.0          XVector_0.14.0               
 [5] bitops_1.0-6                  tools_3.4.0                  
 [7] zlibbioc_1.20.0               digest_0.6.11                
 [9] RSQLite_1.1-2                 memoise_1.0.0                
[11] lattice_0.20-34               Matrix_1.2-7.1               
[13] shiny_1.0.0                   DBI_0.5-1                    
[15] curl_2.3                      yaml_2.1.14                  
[17] httr_1.2.1                    Biostrings_2.42.1            
[19] grid_3.4.0                    Biobase_2.34.0               
[21] R6_2.2.0                      AnnotationDbi_1.36.1         
[23] XML_3.98-1.5                  BiocParallel_1.8.1           
[25] Rsamtools_1.26.1              htmltools_0.3.5              
[27] GenomicAlignments_1.10.0      SummarizedExperiment_1.4.0   
[29] mime_0.5                      interactiveDisplayBase_1.12.0
[31] xtable_1.8-2                  httpuv_1.3.3                 
[33] RCurl_1.95-4.8              

 

 

annotationhub • 1.6k views
ADD COMMENT
0
Entering edit mode

I don't know if the GRanges for gtf files are rebuilt regularly, I guess not, so the only explanation would be that the version of rtracklayer at the time the GRanges was built did not import all of the data from the GTF. I don't believe that Ensembl changes there files and adds new content to old releases (in fact, the GTF file was created on March 26 2016).

ADD REPLY
0
Entering edit mode
mviterson ▴ 10
@mviterson-6917
Last seen 7.3 years ago
Netherlands

The mismatch between annotationhub and direct download using rtracklayer does not only involve ensembl gtf v71. For example, I got similar results for

ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz

Furthermore, all possible snapshot dates gave showed the same mismatch? Is there something wrong with the conversion of AnnotationHub from GTF to GRanges? I try to find the code of this but could not find it yet!

 

ADD COMMENT
0
Entering edit mode

have you tried that with something more recent too, e.g. Ensembl 84 or 85?

ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States

Hi,

Prior to July 2016, the GTF files from Ensembl were pre-built and stored in S3 (versions <=83). Once built they are not updated unless a bug is reported. The date of the GTF 71 files in the S3 bucket is November 2013.

As of July 2016, there is no pre-building of the GTF (versions >=84) The metadata in AnnotationHub points to the Ensembl url, the file is downloaded and converted to GTF on the fly. The processing happens in the .get1,GTFFileResource-method in AnnotationHub.

In this code chunk, 'yy' is the url and AnnotationHub:::.tidyGRanges() adds metadata to the GRanges, cleans up the seqinfo and does some sorting.

setMethod(".get1", "GTFFileResource",
    function(x, ...)
{
    message("Importing File into R ..")
    .require("rtracklayer")
    .require("GenomeInfoDb")
    yy <- getHub(x)
    gtf <- rtracklayer::import(cache(yy), format="gtf", genome=yy$genome, ...)
    .tidyGRanges(x, gtf)
})

If the Ensembl 71 GTF files were built in Nov 2013 you'd need to look at the AnnotationHubData (not AnnotationHub) code in Bioconductor 2.12 to see how the files were processed. As you say, something may have changed in rtrackalyer or it could be how rtracklayer::import() was called from the AnnotationHubData recipe.

As Jo mentioned, it would be interesting to see if you find inconsistencies with more current versions. I'm open to modifying the .get1() method if data that should be included are being dropped.

Valerie

ADD COMMENT
0
Entering edit mode

Thank you Valerie for the explanation. I will check other Ensembl version as suggested by Johannes.

Just a thought; would it be possible for all version of the Ensembl gtf's to be build on the fly?
 

ADD REPLY
0
Entering edit mode

Single-square-bracket subsetting a hub object returns information about the resource, including it's source url. Use that for direct import, rtracklayer::import(ah["AH7666"]$sourceurl)

ADD REPLY
0
Entering edit mode

Thanks this is a really useful to know!

ADD REPLY

Login before adding your answer.

Traffic: 495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6