Question

TxDb annotations for human chrM

0

Entering edit mode

Robert Castelo ★ 3.4k

@rcastelo

Last seen 6 days ago

Barcelona/Universitat Pompeu Fabra

Hi,

it seems that by default the human TxDb.* annotation packages do not have annotations for genes in the mitochondrial chromosome, while the Org.Hs.eg.db package includes those genes. Here is my exploration of this:

library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

allMTGenesOrgDb <- select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns="CHR")
allMTGenesOrgDb <- as.character(allMTGenesOrgDb$ENTREZID[allMTGenesOrgDb$CHR %in% "MT"])
allMTGenesOrgDb
 [1] "4508" "4509" "4511" "4512" "4513" "4514" "4519" "4535" "4536" "4537" "4538" "4539" "4540" "4541" "4549"
[16] "4550" "4553" "4555" "4556" "4558" "4563" "4564" "4565" "4566" "4567" "4568" "4569" "4570" "4571" "4572"
[31] "4573" "4574" "4575" "4576" "4577" "4578" "4579"
txdb19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
allEntrezGenesTxDb19 <- keys(txdb19, ketype="GENEID")
any(allMTGenesOrgDb %in% allEntrezGenesTxDb19)
[1] FALSE

Since the raw annotation file at UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz) and the browser view (http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chrM%3A1-16569) show annotations for the mitochondrial chromosome, I guess these annotations are being discarded or deactivated somehow from the TxDb.*. This is happening in current release (sessionInfo() shown below). I've googled about this but could not find a direct answer to the question, which is how can I access the annotations for chrM in the TxDb.* packages?

Thanks!

robert.

sessionInfo()

R version 3.1.1 Patched (2014-10-14 r66765)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 GenomicFeatures_1.18.3                 
 [3] GenomicRanges_1.18.3                    org.Hs.eg.db_3.0.0                     
 [5] RSQLite_1.0.0                           DBI_0.3.1                              
 [7] AnnotationDbi_1.28.1                    GenomeInfoDb_1.2.4                     
 [9] IRanges_2.0.1                           S4Vectors_0.4.0                        
[11] Biobase_2.26.0                          BiocGenerics_0.12.1                    
[13] vimcom_1.0-0                            setwidth_1.0-3                         
[15] colorout_1.0-3                         

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8              BiocParallel_1.0.0     
 [5] biomaRt_2.22.0          Biostrings_2.34.1       bitops_1.0-6            brew_1.0-6             
 [9] checkmate_1.5.1         codetools_0.2-9         digest_0.6.7            fail_1.2               
[13] foreach_1.4.2           GenomicAlignments_1.2.1 iterators_1.0.7         RCurl_1.95-4.5         
[17] Rsamtools_1.18.2        rtracklayer_1.26.2      sendmailR_1.2-1         stringr_0.6.2          
[21] tools_3.1.1             XML_3.98-1.1            XVector_0.6.0           zlibbioc_1.12.0

GenomicFeatures TxDb.Hsapiens.UCSC.hg19.knownGene annotation • 4.7k views

ADD COMMENT • link 10.2 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Robert Castelo ★ 3.4k

@rcastelo

Last seen 6 days ago

Barcelona/Universitat Pompeu Fabra

Hi Hervé,

I see, so actually the weird thing here (IMO) is that NCBI puts HGNC symbols in the 'Name' field of GFF3 files instead of Entrez Identifiers, which are the ones defined by NCBI themselves (AFAIK). I have solved this by using a Perl one-liner that replaces the Name field with the Entrez ID and which I paste here for the record, in case this is useful to anyone interested in this discussion:

cat ref_GRCh38_top_level.chrM.gff3 | perl -ne 'if (/(.+;Name=).+;Dbxref=GeneID:([0-9]+)/) { print "$1$2;gene=$2\n" } else { if (/(.+);Dbxref=GeneID:([0-9]+)/) { print "$1;gene=$2\n" } else { print $_; } }' > ref_GRCh38_top_level.chrM.egID.gff3

As for whether there's something to do about it in the BioC API, I agree that the additional complexity in the arguments of makeTxDbFromGRanges() to fetch a specific identifier may not be worth the trouble. Let me recap my initial question.

I want to use gene annotations for the human mtDNA chromosome. If I load the classical UCSC known gene annotations, then:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

txdb.UCSC <- TxDb.Hsapiens.UCSC.hg38.knownGene

genes.UCSC <- genes(txdb.UCSC)

genes.UCSC[seqnames(genes.UCSC) %in% "chrM"]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     gene_id
      <Rle> <IRanges>  <Rle> | <character>
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome

I get no genes because as pointed out at the beginning of this discussion, UCSC does not link any of their annotated transcripts to a gene. If I make now an annotation package from Biomart (via makeTxDbFromBiomart()):

library(TxDb.Hsapiens.BioMart.ensembl.GRCh38)

txdb.ENSEMBL <- TxDb.Hsapiens.BioMart.ensembl.GRCh38

genes.ENSEMBL <- genes(txdb.ENSEMBL)

genes.ENSEMBL[seqnames(genes.ENSEMBL) %in% "MT"]
GRanges object with 37 ranges and 1 metadata column:
                  seqnames         ranges strand   |         gene_id
                     <Rle>      <IRanges>  <Rle>   |     <character>
  ENSG00000198695       MT [14149, 14673]      -   | ENSG00000198695
  ENSG00000198712       MT [ 7586,  8269]      +   | ENSG00000198712
  ENSG00000198727       MT [14747, 15887]      +   | ENSG00000198727
  ENSG00000198763       MT [ 4470,  5511]      +   | ENSG00000198763
  ENSG00000198786       MT [12337, 14148]      +   | ENSG00000198786
              ...      ...            ...    ... ...             ...
  ENSG00000210195       MT [15888, 15953]      +   | ENSG00000210195
  ENSG00000210196       MT [15956, 16023]      -   | ENSG00000210196
  ENSG00000211459       MT [  648,  1601]      +   | ENSG00000211459
  ENSG00000212907       MT [10470, 10766]      +   | ENSG00000212907
  ENSG00000228253       MT [ 8366,  8572]      +   | ENSG00000228253
  -------
  seqinfo: 987 sequences (1 circular) from an unspecified genome

Then, I do get the 37 known genes in the human mtDNA but their gene identifiers are ENSEMBL gene identifiers, so upfront I cannot see the HGNC classical symbols or fetch GO annotations because the default key throughout BioC annotation packages to fetch gene-centric annotations are Entrez identifiers. This is the point where I thought the definite solution would be to use NCBI annotations, which I thought would be anchored at Entrez identifiers, but as we have seen this is not directly the case.

I can parse the NCBI GFF3 file to get the Entrez identifer at the Name tag, or I could use the BioC API to replace the ENSEMBL gene identifiers by Entrez identifiers following some sensible criterion to resolve one2many or many2one assignments (if any). Would you recommend any other solution to obtain an Entrez ID centered TxDb annotation object for the mtDNA chromosome?

thanks again!

robert.

ADD COMMENT • link 10.2 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Hi Robert,

FWIW one trick that you could use here is to store the Ensembl id to Entrez id relationship in your genes.ENSEMBL object without trying to get rid of its many-to-many nature for the moment (this can always be deferred to a later time):

library(org.Hs.eg.db)

# I'm using select() here but note that 'ens2eg' could also be obtained
# in a much simpler manner with toTable(org.Hs.egENSEMBL)
ens2eg <- select(org.Hs.eg.db,
                 keys=names(genes.UCSC), keytype="ENSEMBL",
                 columns=c("ENSEMBL", "ENTREZID"))

## Add the entrez_id metadata column (as a CharacterList)
mcols(genes.ENSEMBL)$entrez_id <-
    splitAsList(ens2eg$ENTREZID, ens2eg$ENSEMBL)[names(genes.ENSEMBL)]

genes.ENSEMBL.MT <- genes.ENSEMBL[seqnames(genes.ENSEMBL) %in% "MT"]

Then:

GRanges object with 37 ranges and 2 metadata columns:
                  seqnames         ranges strand   |         gene_id
                     <Rle>      <IRanges>  <Rle>   |     <character>
  ENSG00000198695       MT [14149, 14673]      -   | ENSG00000198695
  ENSG00000198712       MT [ 7586,  8269]      +   | ENSG00000198712
  ENSG00000198727       MT [14747, 15887]      +   | ENSG00000198727
  ENSG00000198763       MT [ 4470,  5511]      +   | ENSG00000198763
  ENSG00000198786       MT [12337, 14148]      +   | ENSG00000198786
              ...      ...            ...    ... ...             ...
  ENSG00000210195       MT [15888, 15953]      +   | ENSG00000210195
  ENSG00000210196       MT [15956, 16023]      -   | ENSG00000210196
  ENSG00000211459       MT [  648,  1601]      +   | ENSG00000211459
  ENSG00000212907       MT [10470, 10766]      +   | ENSG00000212907
  ENSG00000228253       MT [ 8366,  8572]      +   | ENSG00000228253
                        entrez_id
                  <CharacterList>
  ENSG00000198695            4541
  ENSG00000198712            4513
  ENSG00000198727            4519
  ENSG00000198763            4536
  ENSG00000198786            4540
              ...             ...
  ENSG00000210195              NA
  ENSG00000210196              NA
  ENSG00000211459              NA
  ENSG00000212907            4539
  ENSG00000228253            4509
  -------
  seqinfo: 987 sequences (1 circular) from an unspecified genome

Now at this point you could try as.character( mcols( genes.ENSEMBL.MT )$entrez_id ) and it will succeed if the relationship is not one-to-many (i.e. if the individual list elements in the CharacterList contain at most 1 element). The benefit of delaying this step is that the chances for as.character() to succeed now are good because the Ensembl id to Entrez id relationship has been restricted to a very small subset. (Of course relying on luck is ok when working interactively but not in production code ;-) )

HTH,

H.

ADD REPLY • link 10.2 years ago Hervé Pagès 16k

0

Entering edit mode

hi, thanks for the code, i see that in principle there is no third alternative to what i had in mind. Just for the record, I'd like to highlight the caveats of mapping identifiers, even in such a well annotated piece of the human genome. If I build the annotation using the NCBI file i parsed above:

gr <- import("ref_GRCh38_top_level.chrM.egID.gff3", format="gff3")
seqlengths(gr) <- 16569
isCircular(gr) <- TRUE

metadata <- data.frame(name=c("Data source", "Genome", "Organism",
                              "Resource URL", "Full dataset"),
                       value=c("NCBI", "GRCh38", "Homo sapiens",
                               "ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38_top_level.gff3.gz", "no"))
txdb <- GenomicFeatures:::makeTxDbFromGRanges(gr, metadata=metadata)

genes.NCBI <- genes(txdb)

while annotations are identical with ENSEMBL in terms of positions:

rng.NCBI <- sort(ranges(genes.NCBI))
names(rng.NCBI) <- NULL
rng.ENSEMBL <- sort(ranges(genes.ENSEMBL.MT))
names(rng.ENSEMBL) <- NULL
identical(rng.NCBI, rng.ENSEMBL)
[1] TRUE

and we could assume they correspond to the same genes, the Entrez identifier assignments are quite different:

as.charactermcolssortgenes.ENSEMBL.MT))$entrez_id) [1] NA NA NA "100616263" NA "4535" NA NA "4536" [10] NA "4512" NA "4513" NA "4509" "4508" "4514" NA [19] "4537" NA "4539" "4538" NA NA NA "4540" "4519" [28] NA NA NA NA NA NA NA "4541" NA [37] NA names(sort(genes.NCBI)) [1] "4558" "4549" "4577" "4550" "4567" "4535" "4565" "4569" "4536" "4578" "4512" "4555" "4513" "4566" "4509" "4508" [17] "4514" "4563" "4537" "4573" "4539" "4538" "4564" "4575" "4568" "4540" "4519" "4576" "4572" "4553" "4570" "4511" [33] "4579" "4574" "4541" "4556" "4571"

i guess the mapping could be somehow improved but a first shot like this can be quite fustrating. IMO this highlights the importance of fecthing annotations from their original source and avoiding the identifier mapping magic whenever possible (N.B. identifier mapping magic is great, necessary and i often use it).

robert.

ADD REPLY • link 10.2 years ago Robert Castelo ★ 3.4k

score 2 · Accepted Answer · 2015-01-21

2

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 9 hours ago

Seattle, WA, United States

Hi Robert,

Nothing is discarded. What happens is that, even if the UCSC Genes track contains a few transcripts on chrM (20 to be precise), none of them is linked to a gene:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## By requesting the gene_id column, we'll be able to see what
## Entrez Gene IDs are linked to the transcripts.
hg19_tx <- transcripts(txdb, columns=c("tx_name", "gene_id"))
hg19_genes <- genes(txdb)

Then:

> hg19_tx_on_chrM <- hg19_tx[seqnames(hg19_tx) == "chrM"]
> hg19_tx_on_chrM
GRanges object with 20 ranges and 2 metadata columns:
       seqnames         ranges strand   |     tx_name         gene_id
          <Rle>      <IRanges>  <Rle>   | <character> <CharacterList>
   [1]     chrM   [ 651,  674]      +   |  uc022bqo.2                
   [2]     chrM   [1604, 1634]      +   |  uc004cor.1                
   [3]     chrM   [1844, 4264]      +   |  uc004cos.5                
   [4]     chrM   [5905, 7439]      +   |  uc031tga.1                
   [5]     chrM   [7587, 9208]      +   |  uc011mfi.2                
   ...      ...            ...    ... ...         ...             ...
  [16]     chrM [ 7587, 15888]      -   |  uc022bqs.1                
  [17]     chrM [ 8367, 14149]      -   |  uc022bqt.1                
  [18]     chrM [10761, 14149]      -   |  uc031tgb.1                
  [19]     chrM [14675, 14698]      -   |  uc022bqv.1                
  [20]     chrM [15960, 16024]      -   |  uc022bqx.1                
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

> as.character(mcols(hg19_tx_on_chrM)$gene_id)
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

> hg19_genes[seqnames(hg19_genes) == "chrM"]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     gene_id
      <Rle> <IRanges>  <Rle> | <character>
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

More generally speaking, and AFAICT, the UCSC Genes and RefSeq Genes tracks tend to contain a lot less Entrez Gene IDs than the corresponding Org.*.eg.db package (about half the number for Human). Seems like they go thru some serious curation process at UCSC...

Hope this helps,

H.

ADD COMMENT • link 10.2 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks, I see, I find it a bit odd that in this case is not that there are fewer Entrez Gene IDs, but that there is none of them, while some correspond to well-known genes (e.g., http://en.wikipedia.org/wiki/MT-ATP6). Anyway, I guess this is an UCSC issue. So I decided to fetch the GFF3 file of NCBI annotations and build a TxDb object from them, but I run into a problem. First I fetch the GFF anotations from the unix shell and filter them to have only those from the mitochondrial chromosome renaming the contig ID to the more handy UCSC nomenclature chrM:

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38_top_level.gff3.gz
$ gzip -cd ref_GRCh38_top_level.gff3.gz | grep NC_012920.1 | sed 's/NC_012920.1/chrM/g' > ref_GRCh38_top_level.chrM.gff3

then I proceed to build the TxDb object from this GFF file and then get an error:

library(GenomicFeatures)
txdbchrMhg38 <- makeTranscriptDbFromGFF(file="ref_GRCh38_top_level.chrM.gff3",
                                        gffGeneIdAttributeName="GeneID",
                                        chrominfo=data.frame(chrom="chrM", length=16569, is_circular=TRUE),
                                        dataSource="ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38_top_level.gff3.gz",
                                        species="Homo sapiens")
Error in DataFrame(...) : different row counts implied by arguments

traceback()
12: stop("different row counts implied by arguments")
11: DataFrame(...)
10: .Method(..., deparse.level = deparse.level)
9: eval(expr, envir, enclos)
8: eval(.dotsCall, env)
7: eval(.dotsCall, env)
6: standardGeneric("cbind")
5: cbind(data, gene_id)
4: .checkGeneIdAttrib(data, gff, gffGeneIdAttributeName)
3: .prepareGFF3data.frame(gff, exonRankAttributeName, gffGeneIdAttributeName)
2: .prepareGFF3Tables(gff, exonRankAttributeName, gffGeneIdAttributeName,
       useGenesAsTranscripts)
1: makeTranscriptDbFromGFF(file = "ref_GRCh38_top_level.chrM.gff3",
       gffGeneIdAttributeName = "GeneID", chrominfo = data.frame(chrom = "chrM",
           length = 16569, is_circular = TRUE), dataSource = "ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38_top_level.gff3.gz",
       species = "Homo sapiens")

This error seems to be related to how the GFF file is built but I cannot figure out what is actually wrong. The code above should enable anyone reproducing this problem so, any hint about how to solve this will be highly appreciated. I paste below my sessionInfo()

Thanks!

robert.

sessionInfo()

R version 3.1.1 Patched (2014-10-13 r66751)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C              LC_TIME=en_US.UTF8        LC_COLLATE=en_US.UTF8     LC_MONETARY=en_US.UTF8
[6] LC_MESSAGES=en_US.UTF8    LC_PAPER=en_US.UTF8       LC_NAME=C                 LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] GenomicFeatures_1.18.3 AnnotationDbi_1.28.1 Biobase_2.26.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 IRanges_2.0.1
[7] S4Vectors_0.4.0 BiocGenerics_0.12.1 vimcom_1.0-0 setwidth_1.0-3 colorout_1.0-3

loaded via a namespace (and not attached):
[1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8              BiocParallel_1.0.0      biomaRt_2.22.0
[6] Biostrings_2.34.1       bitops_1.0-6            brew_1.0-6              checkmate_1.5.1         codetools_0.2-10
[11] DBI_0.3.1               digest_0.6.8            fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.1
[16] iterators_1.0.7         RCurl_1.95-4.5          Rsamtools_1.18.2        RSQLite_1.0.0           rtracklayer_1.26.2
[21] sendmailR_1.2-1         stringr_0.6.2           tools_3.1.1             XML_3.98-1.1            XVector_0.6.0
[26] zlibbioc_1.12.0

ADD REPLY • link 10.2 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Hi Robert,

Even though many GFF3 files around are broken, most of them contain valuable data so we definitely want to support them. It can be tricky though, and a lot of effort has been put in makeTranscriptDbFromGFF() to handle these invalid GFF3 files. Unfortunately, the function still chokes on many of them. In BioC devel, we've started to work on makeTxDbFromGRanges(). It's not exported yet (because it's not ready for prime time) but you can already use it here:

## Using R-3.2 (current devel) + BioC 3.1 (current devel).

library(rtracklayer)
gr <- import("ref_GRCh38_top_level.chrM.gff3", format="gff3")
seqlengths(gr) <- 16569
isCircular(gr) <- TRUE

library(GenomicFeatures)  # make sure this is version >= 1.19.15

metadata <- data.frame(
    name=c("Data source", "Genome", "Organism",
           "Resource URL", "Full dataset"),
    value=c("NCBI", "GRCh38", "Homo sapiens",
            "ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38_top_level.gff3.gz", "no"))

txdb <- GenomicFeatures:::makeTxDbFromGRanges(gr, metadata=metadata)

Then:

> genes(txdb)
GRanges object with 37 ranges and 1 metadata column:
        seqnames         ranges strand   |     gene_id
           <Rle>      <IRanges>  <Rle>   | <character>
   ATP6     chrM   [8527, 9207]      +   |        ATP6
   ATP8     chrM   [8366, 8572]      +   |        ATP8
   COX1     chrM   [5904, 7445]      +   |        COX1
   COX2     chrM   [7586, 8269]      +   |        COX2
   COX3     chrM   [9207, 9990]      +   |        COX3
    ...      ...            ...    ... ...         ...
  TRNS2     chrM [12207, 12265]      +   |       TRNS2
   TRNT     chrM [15888, 15953]      +   |        TRNT
   TRNV     chrM [ 1602,  1670]      +   |        TRNV
   TRNW     chrM [ 5512,  5579]      +   |        TRNW
   TRNY     chrM [ 5826,  5891]      -   |        TRNY
  -------
  seqinfo: 1 sequence (1 circular) from GRCh38 genome

makeTxDbFromGRanges() is still a work in progress. Once it's ready, well enough tested, exported, and documented, the plan it to use it as the workhorse behind makeTranscriptDbFromGFF(). We'll announce on Bioc-devel when this happens.

Cheers,

H.

ADD REPLY • link 10.2 years ago Hervé Pagès 16k

0

Entering edit mode

Hervé, one question related this feature that come now using it. Why makeTxDbFromGFF() sets the HGNC symbol as gene identifer instead of the Entrez ID identifier (and then the HGNC symbol as a SYMBOL column)?

Most gene-centric resources in BioC are anchored at Entrez IDs so I think it would be more helpful that a GFF file from NCBI also gets its TxDb object anchored at Entrez IDs. Would this be possible?

cheers,

robert.

ADD REPLY • link 10.2 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Hi Robert,

The difficulty (for both makeTxDbFromGRanges() and makeTxDbFromGFF()) is that it's not clear where to extract the gene id from. According to the official GFF3 specs, it should be extracted from the Name tag, which is a mandatory tag in GFF3. The ID tag, which is also mandatory, is an internal id that has meaning only within the scope of the file. For example, for the ATP6 gene in the ref_GRCh38_top_level.gff3 file, the ID and Name tags are:

  ID=gene46136
  Name=ATP6

This is why makeTxDbFromGRanges() picks up ATP6 as gene id for this gene.

However it turns out that the file also contains the following tag

  Dbxref=GeneID:4508,HGNC:7414,MIM:516060

Note that the Dbxref tag is not an official GFF3 tag and is optional. It's a free format tag that can contain anything. In that case it's actually a multiple-value tag (comma separated) and the 1st value seems to be the Entrez Gene ID for ATP6.

So if we wanted makeTxDbFromGRanges() to extract the Entrez Gene ID to set the gene id, it would need to be able to guess a lot of things, or it would need help from the user. IMHO only the latter is realistic. Now how could the user tell makeTxDbFromGRanges() where to extract the gene id from in the general case? One way would be to allow him/her to specify the name of the tag (Dbxref in our case) and a callback function that extracts the gene id from the tag value. That provides a lot of flexibility but it not very convenient. Maybe a slightly less flexible but more convenient way would be to let him/her specify the name of the tag and the "subtag" (AFAIK there is no notion of subtag in GFF3). So gene_id_tag="Dbxref" and gene_id_subtag="GeneID" in our case.

So it's doable but is it worth the trouble?

H.

ADD REPLY • link 10.2 years ago Hervé Pagès 16k

0

Entering edit mode

"Note that the Dbxref tag is not an official GFF3 tag and is optional. It's a free format tag that can contain anything."

I take this back sorry. After checking again the official GFF3 spec at http://www.sequenceontology.org/gff3.shtml, Dbxref is an official GFF3 tag (it's optional), its format is described in the spec, and the format used in ref_GRCh38_top_level.gff3 (Dbxref=GeneID:4508,HGNC:7414,MIM:516060) is compliant with the spec.

To me that makes supporting extraction of the gene ids from that tag a lot more attractive.

H.

ADD REPLY • link 10.2 years ago Hervé Pagès 16k

0

Entering edit mode

hi, I had read the GFF3 spec too but did not notice either about Dbxref being an official GFF3 tag. So, if this is parsed then maybe every subtag could be a metadata column and one could consider whether the GeneID subtag becomes the gene_id column in TxDb objects or leave the user to switch that in the GRanges object before turning it into a TxDb object.

The ftp://ftp.ncbi.nlm.nih.gov/genomes/<organism>/GFF/ref_<organism>_top_level.gff3.gz files are a valuable resource of annotations, specially for nonmainstream organisms and, IMO, fetching Entrez-centric annotations from NCBI would ease the analysis of data from these organism with BioC workflows.

robert.

ADD REPLY • link 10.2 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Hi Robert,

I think we all agree that the ref_<organism>_top_level.gff3.gz files at NCBI are valuable annotation resources and that we want to be able to import them in a TxDb object. We also want to make it as easy as possible for the user to get the Entrez ids for the genes. I'll work on that.

Note that bringing the subtags one level up by turning them into top level tags is an interesting idea but would need to be done upstream of makeTxDbFromGRanges(), that is, in rtracklayer::import(). Such a change to import() would be too disruptive though to be made its default behavior so it would probably need to be controlled via an extra argument. Anyway for the user of makeTxDbFromGFF(), this is just implementation details. The input is the file and how the user specifies where to get the gene id from should reflect what s/he sees in the file, not what metadata columns are present on the intermediate GRanges representation.

Cheers,

H.

ADD REPLY • link 10.2 years ago Hervé Pagès 16k

0

Entering edit mode

Hi dear R Users,

I refer to the above nice post by H. which is exactly what I was looking for but it sadly does not work up to now. It is a command line for receiving a GRangeObject for chrM of H.sapiens.

As you can see in the first two command lines used, I have problems importing the gff file right at the beginning.

My R version is 3.2; GenomicFeatures version 3.20.3.

- Is there some step I have to do before fetching/importing the GFF file, do I have to download and save the complete GFF file before able to import it?

Is there some other suggested way to receive the 37 genes with IDs and ranges of chrM ?

Cheers,

Irina

> version

version.string R version 3.2.0 (2015-04-16)

>library(rtracklayer)

Warning message:

Paket ‘rtracklayer’ wurde unter R Version 3.2.2 erstellt

> gr <- import("ref_GRCh38_top_level.chrM.gff3", format="gff3")

Error in open.connection(con, open) : cannot open the connection

In addition: Warning message:

In open.connection(con, open) :

cannot open file 'ref_GRCh38_top_level.chrM.gff3': No such file or directory

library(GenomicFeatures)  # make sure this is version >= 1.19.15

metadata <- data.frame(.............

ADD REPLY • link 9.6 years ago iidler • 0

0

Entering edit mode

Any time you see an error saying 'No such file or directory' it means that R expects there to be a file that it cannot find, so yes you need to download that file first.

ADD REPLY • link 9.6 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you for your reply. But where do you get the ref_GRCh38_top_level.chrM.gff3' file?

ADD REPLY • link 9.6 years ago iidler • 0

0

Entering edit mode

Have you read the entirety of this post? Robert generated that by subsetting part of the top_level gff file. But the file he used has been replaced. It is now ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/GFF/ref_GRCh38.p2_top_level.gff3.gz

ADD REPLY • link 9.6 years ago James W. MacDonald 68k

0

Entering edit mode

hi,

You can follow Herve's answer to get the a GRanges for transcripts on chrM if you have a TxDb object.

A: TxDb annotations for human chrM

If you don't yet have a TxDb object, you can build one using makeTxDbFromGFF() from the GenomicFeatures package. If you encounter an error, I'd recommending starting a new support post, and posting all your code, the error message and sessionInfo()

ADD REPLY • link 9.6 years ago Michael Love 43k