Hi,
it seems that by default the human TxDb.* annotation packages do not have annotations for genes in the mitochondrial chromosome, while the Org.Hs.eg.db package includes those genes. Here is my exploration of this:
library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg19.knownGene) allMTGenesOrgDb <- select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns="CHR") allMTGenesOrgDb <- as.character(allMTGenesOrgDb$ENTREZID[allMTGenesOrgDb$CHR %in% "MT"]) allMTGenesOrgDb [1] "4508" "4509" "4511" "4512" "4513" "4514" "4519" "4535" "4536" "4537" "4538" "4539" "4540" "4541" "4549" [16] "4550" "4553" "4555" "4556" "4558" "4563" "4564" "4565" "4566" "4567" "4568" "4569" "4570" "4571" "4572" [31] "4573" "4574" "4575" "4576" "4577" "4578" "4579" txdb19 <- TxDb.Hsapiens.UCSC.hg19.knownGene allEntrezGenesTxDb19 <- keys(txdb19, ketype="GENEID") any(allMTGenesOrgDb %in% allEntrezGenesTxDb19) [1] FALSE
Since the raw annotation file at UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz) and the browser view (http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chrM%3A1-16569) show annotations for the mitochondrial chromosome, I guess these annotations are being discarded or deactivated somehow from the TxDb.*. This is happening in current release (sessionInfo() shown below). I've googled about this but could not find a direct answer to the question, which is how can I access the annotations for chrM in the TxDb.* packages?
Thanks!
robert.
sessionInfo() R version 3.1.1 Patched (2014-10-14 r66765) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 GenomicFeatures_1.18.3 [3] GenomicRanges_1.18.3 org.Hs.eg.db_3.0.0 [5] RSQLite_1.0.0 DBI_0.3.1 [7] AnnotationDbi_1.28.1 GenomeInfoDb_1.2.4 [9] IRanges_2.0.1 S4Vectors_0.4.0 [11] Biobase_2.26.0 BiocGenerics_0.12.1 [13] vimcom_1.0-0 setwidth_1.0-3 [15] colorout_1.0-3 loaded via a namespace (and not attached): [1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 BiocParallel_1.0.0 [5] biomaRt_2.22.0 Biostrings_2.34.1 bitops_1.0-6 brew_1.0-6 [9] checkmate_1.5.1 codetools_0.2-9 digest_0.6.7 fail_1.2 [13] foreach_1.4.2 GenomicAlignments_1.2.1 iterators_1.0.7 RCurl_1.95-4.5 [17] Rsamtools_1.18.2 rtracklayer_1.26.2 sendmailR_1.2-1 stringr_0.6.2 [21] tools_3.1.1 XML_3.98-1.1 XVector_0.6.0 zlibbioc_1.12.0
Thanks, I see, I find it a bit odd that in this case is not that there are fewer Entrez Gene IDs, but that there is none of them, while some correspond to well-known genes (e.g., http://en.wikipedia.org/wiki/MT-ATP6). Anyway, I guess this is an UCSC issue. So I decided to fetch the GFF3 file of NCBI annotations and build a TxDb object from them, but I run into a problem. First I fetch the GFF anotations from the unix shell and filter them to have only those from the mitochondrial chromosome renaming the contig ID to the more handy UCSC nomenclature chrM:
then I proceed to build the TxDb object from this GFF file and then get an error:
traceback()
12: stop("different row counts implied by arguments")
11: DataFrame(...)
10: .Method(..., deparse.level = deparse.level)
9: eval(expr, envir, enclos)
8: eval(.dotsCall, env)
7: eval(.dotsCall, env)
6: standardGeneric("cbind")
5: cbind(data, gene_id)
4: .checkGeneIdAttrib(data, gff, gffGeneIdAttributeName)
3: .prepareGFF3data.frame(gff, exonRankAttributeName, gffGeneIdAttributeName)
2: .prepareGFF3Tables(gff, exonRankAttributeName, gffGeneIdAttributeName,
useGenesAsTranscripts)
1: makeTranscriptDbFromGFF(file = "ref_GRCh38_top_level.chrM.gff3",
gffGeneIdAttributeName = "GeneID", chrominfo = data.frame(chrom = "chrM",
length = 16569, is_circular = TRUE), dataSource = "ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38_top_level.gff3.gz",
species = "Homo sapiens")
This error seems to be related to how the GFF file is built but I cannot figure out what is actually wrong. The code above should enable anyone reproducing this problem so, any hint about how to solve this will be highly appreciated. I paste below my sessionInfo()
Thanks!
robert.
sessionInfo()
R version 3.1.1 Patched (2014-10-13 r66751)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8 LC_MONETARY=en_US.UTF8
[6] LC_MESSAGES=en_US.UTF8 LC_PAPER=en_US.UTF8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.18.3 AnnotationDbi_1.28.1 Biobase_2.26.0 GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 IRanges_2.0.1
[7] S4Vectors_0.4.0 BiocGenerics_0.12.1 vimcom_1.0-0 setwidth_1.0-3 colorout_1.0-3
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8 BiocParallel_1.0.0 biomaRt_2.22.0
[6] Biostrings_2.34.1 bitops_1.0-6 brew_1.0-6 checkmate_1.5.1 codetools_0.2-10
[11] DBI_0.3.1 digest_0.6.8 fail_1.2 foreach_1.4.2 GenomicAlignments_1.2.1
[16] iterators_1.0.7 RCurl_1.95-4.5 Rsamtools_1.18.2 RSQLite_1.0.0 rtracklayer_1.26.2
[21] sendmailR_1.2-1 stringr_0.6.2 tools_3.1.1 XML_3.98-1.1 XVector_0.6.0
[26] zlibbioc_1.12.0
Hi Robert,
Even though many GFF3 files around are broken, most of them contain valuable data so we definitely want to support them. It can be tricky though, and a lot of effort has been put in
makeTranscriptDbFromGFF()
to handle these invalid GFF3 files. Unfortunately, the function still chokes on many of them. In BioC devel, we've started to work onmakeTxDbFromGRanges()
. It's not exported yet (because it's not ready for prime time) but you can already use it here:Then:
makeTxDbFromGRanges()
is still a work in progress. Once it's ready, well enough tested, exported, and documented, the plan it to use it as the workhorse behindmakeTranscriptDbFromGFF()
. We'll announce on Bioc-devel when this happens.Cheers,
H.
Hervé, one question related this feature that come now using it. Why makeTxDbFromGFF() sets the HGNC symbol as gene identifer instead of the Entrez ID identifier (and then the HGNC symbol as a SYMBOL column)?
Most gene-centric resources in BioC are anchored at Entrez IDs so I think it would be more helpful that a GFF file from NCBI also gets its TxDb object anchored at Entrez IDs. Would this be possible?
cheers,
robert.
Hi Robert,
The difficulty (for both
makeTxDbFromGRanges()
andmakeTxDbFromGFF()
) is that it's not clear where to extract the gene id from. According to the official GFF3 specs, it should be extracted from the Name tag, which is a mandatory tag in GFF3. The ID tag, which is also mandatory, is an internal id that has meaning only within the scope of the file. For example, for the ATP6 gene in the ref_GRCh38_top_level.gff3 file, the ID and Name tags are:This is why
makeTxDbFromGRanges()
picks up ATP6 as gene id for this gene.However it turns out that the file also contains the following tag
Note that the Dbxref tag is not an official GFF3 tag and is optional. It's a free format tag that can contain anything. In that case it's actually a multiple-value tag (comma separated) and the 1st value seems to be the Entrez Gene ID for ATP6.
So if we wanted
makeTxDbFromGRanges()
to extract the Entrez Gene ID to set the gene id, it would need to be able to guess a lot of things, or it would need help from the user. IMHO only the latter is realistic. Now how could the user tellmakeTxDbFromGRanges()
where to extract the gene id from in the general case? One way would be to allow him/her to specify the name of the tag (Dbxref in our case) and a callback function that extracts the gene id from the tag value. That provides a lot of flexibility but it not very convenient. Maybe a slightly less flexible but more convenient way would be to let him/her specify the name of the tag and the "subtag" (AFAIK there is no notion of subtag in GFF3). Sogene_id_tag="Dbxref"
andgene_id_subtag="GeneID"
in our case.So it's doable but is it worth the trouble?
H.
"Note that the Dbxref tag is not an official GFF3 tag and is optional. It's a free format tag that can contain anything."
I take this back sorry. After checking again the official GFF3 spec at http://www.sequenceontology.org/gff3.shtml, Dbxref is an official GFF3 tag (it's optional), its format is described in the spec, and the format used in ref_GRCh38_top_level.gff3 (
Dbxref=GeneID:4508,HGNC:7414,MIM:516060
) is compliant with the spec.To me that makes supporting extraction of the gene ids from that tag a lot more attractive.
H.
hi, I had read the GFF3 spec too but did not notice either about
Dbxref
being an official GFF3 tag. So, if this is parsed then maybe every subtag could be a metadata column and one could consider whether theGeneID
subtag becomes thegene_id
column inTxDb
objects or leave the user to switch that in theGRanges
object before turning it into aTxDb
object.The
ftp://ftp.ncbi.nlm.nih.gov/genomes/<organism>/GFF/ref_<organism>_top_level.gff3.gz
files are a valuable resource of annotations, specially for nonmainstream organisms and, IMO, fetching Entrez-centric annotations from NCBI would ease the analysis of data from these organism with BioC workflows.robert.
Hi Robert,
I think we all agree that the
ref_<organism>_top_level.gff3.gz
files at NCBI are valuable annotation resources and that we want to be able to import them in a TxDb object. We also want to make it as easy as possible for the user to get the Entrez ids for the genes. I'll work on that.Note that bringing the subtags one level up by turning them into top level tags is an interesting idea but would need to be done upstream of
makeTxDbFromGRanges()
, that is, inrtracklayer::import()
. Such a change toimport()
would be too disruptive though to be made its default behavior so it would probably need to be controlled via an extra argument. Anyway for the user ofmakeTxDbFromGFF()
, this is just implementation details. The input is the file and how the user specifies where to get the gene id from should reflect what s/he sees in the file, not what metadata columns are present on the intermediate GRanges representation.Cheers,
H.
Hi dear R Users,
I refer to the above nice post by H. which is exactly what I was looking for but it sadly does not work up to now. It is a command line for receiving a GRangeObject for chrM of H.sapiens.
As you can see in the first two command lines used, I have problems importing the gff file right at the beginning.
My R version is 3.2; GenomicFeatures version 3.20.3.
- Is there some step I have to do before fetching/importing the GFF file, do I have to download and save the complete GFF file before able to import it?
Is there some other suggested way to receive the 37 genes with IDs and ranges of chrM ?
Cheers,
Irina
>library(rtracklayer)
Warning message:
Paket ‘rtracklayer’ wurde unter R Version 3.2.2 erstellt
> gr <- import("ref_GRCh38_top_level.chrM.gff3", format="gff3")
Error in open.connection(con, open) : cannot open the connection
In addition: Warning message:
In open.connection(con, open) :
cannot open file 'ref_GRCh38_top_level.chrM.gff3': No such file or directory
Any time you see an error saying 'No such file or directory' it means that R expects there to be a file that it cannot find, so yes you need to download that file first.
Thank you for your reply. But where do you get the ref_GRCh38_top_level.chrM.gff3' file?
Have you read the entirety of this post? Robert generated that by subsetting part of the top_level gff file. But the file he used has been replaced. It is now ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/GFF/ref_GRCh38.p2_top_level.gff3.gz
hi,
You can follow Herve's answer to get the a GRanges for transcripts on chrM if you have a TxDb object.
A: TxDb annotations for human chrM
If you don't yet have a TxDb object, you can build one using makeTxDbFromGFF() from the GenomicFeatures package. If you encounter an error, I'd recommending starting a new support post, and posting all your code, the error message and sessionInfo()