Hi Haiyin,
Please show your sessionInfo()
as requested in our posting guide. makeTranscriptDbFromGFF()
is deprecated in the current release of BioC (BioC 3.1) and has been replaced with makeTxDbFromGFF()
. So I suspect you are using an old (and unsupported) version of BioC. Please update your installation to use the current release of BioC.
Generally speaking, there is no reason to expect that the resulting TxDb object contains all the chromosomes/contigs that are in the GFF file. This is because a GFF file can contain many kinds of features however only the genes, transcripts, exons, and CDS are imported in the TxDb object. So only chromosomes/contigs that actually have these features will be reported by seqlevels(txdb)
.
As for specifying which field in the GFF to use for the gene_id there is no easy way at the moment. But note that the current version of makeTxDbFromGFF()
is trying to do a better job at importing the gene_id from the right place:
cyno <- makeTxDbFromGFF("GCF_000364345.1_Macaca_fascicularis_5.0_genomic.gff.gz")
# Prepare the 'metadata' data frame ... metadata: OK
# Warning messages:
# 1: In matchCircularity(seqlevels(gr), circ_seqs) :
# None of the strings in your circ_seqs argument match your seqnames.
# 2: In .extract_exons_from_GRanges(exon_IDX, gr, ID, Name, Parent, feature = "exon", :
# 411 orphan exons were dropped
genes(cyno)
# GRanges object with 35836 ranges and 1 metadata column:
# seqnames ranges strand | gene_id
# <Rle> <IRanges> <Rle> | <character>
# A1BG NC_022290.1 [ 58937557, 58951892] - | A1BG
# A1CF NC_022280.1 [ 85376927, 85454067] + | A1CF
# A2M NC_022282.1 [ 9354433, 9401000] - | A2M
# A2ML1 NC_022282.1 [ 9098500, 9153859] + | A2ML1
# A3GALT2 NC_022272.1 [194631589, 194645505] + | A3GALT2
# ... ... ... ... ... ...
# ZXDC NC_022273.1 [145848836, 145886645] + | ZXDC
# ZYG11A NC_022272.1 [174567376, 174637485] - | ZYG11A
# ZYG11B NC_022272.1 [174660062, 174713629] - | ZYG11B
# ZYX NC_022274.1 [176564664, 176574909] + | ZYX
# ZZZ3 NC_022272.1 [149532272, 149659867] + | ZZZ3
# -------
# seqinfo: 670 sequences from an unspecified genome; no seqlengths
Please let us know if this is still not what you need.
Cheers,
H.
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] XVector_0.8.0 GenomicFeatures_1.20.1 AnnotationDbi_1.30.1
[4] Biobase_2.28.0 GenomicRanges_1.20.5 GenomeInfoDb_1.4.1
[7] IRanges_2.2.5 S4Vectors_0.6.2 BiocGenerics_0.14.0
[10] BiocInstaller_1.18.4
loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0 GenomicAlignments_1.4.1 BiocParallel_1.2.9
[4] tools_3.2.0 DBI_0.3.1 lambda.r_1.1.7
[7] futile.logger_1.4.1 rtracklayer_1.28.6 futile.options_1.0.0
[10] bitops_1.0-6 RCurl_1.95-4.7 biomaRt_2.24.0
[13] RSQLite_1.0.0 Biostrings_2.36.1 Rsamtools_1.20.4
[16] XML_3.98-1.3
It does look like things work fine in the devel version of Bioc, so we have a way around the bug, but the ID issue is still a pain.
It looks like (in devel) it is using the Name attribute as the gene_id, which is strange, since Name is not guaranteed to be unique. Shouldn't it use Name for the gene_name, and ID for gene_id? Haiyin wants to use the "Dbxref" attribute to as the gene_id. That attribute is formalized, and the "GeneID" prefix, according to ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs, indicates that it is an Entrez gene ID, so it is a valid, unique identifier. It would be cool if GenomicFeatures could recognize that, and somehow integrate information from the Org packages when making an OrganismDb package. But at the very least, letting the user specify a custom gene ID attribute would be easy and worthwhile.
Hi Michael,
Yes, it seems that Dbxref would be a better place than Name to grab the gene_id. At least when it contains Entez gene or Ensembl IDs.
Just to clarify TxDb objects don't have a gene_name field. For transcripts and exons we have both tx_id/tx_name and exon_id/exon_name, but not for genes. The naming of the fields is confusing because tx_id and exon_id are internal ids that are not meaningful beyond the TxDb object. OTOH tx_name and exon_name are meaningful beyond the TxDb object but are not guaranteed to be unique (or even defined). For genes we don't use internal ids. So gene_id is more like the equivalent of tx_name and exon_name but for genes.
By default
makeTxDbFromGFF()
now grabs the gene_id from the Name attribute. Used to be ID but for most GFF3 files that was giving a meaningless id. According to the GFF3 official specs, ID is indeed internal to the file with no guarantees to be meaningful beyond it. After testing this on many different files from different annotation providers it turned out that grabbing the gene_id from Name instead of ID was almost always giving more satisfying results.Dbxref sounds like a better choice than Name though. Anyway it's on our list to let the user specify what attribute to use for the gene_id.
Thanks for your input,
H.
One complication is that Dbxref can have multiple values. So one would need to indicate not just the attribute name, but also the desired Dbxref prefix (like GeneID).