The GenomicFeatures library allows gene models to be stored in an SQLite database.
Here is the full SQLite schema
CREATE TABLE chrominfo (
_chrom_id INTEGER PRIMARY KEY,
chrom TEXT UNIQUE NOT NULL,
length INTEGER NULL,
is_circular INTEGER NULL
);
CREATE TABLE transcript (
_tx_id INTEGER PRIMARY KEY,
tx_name TEXT NULL,
tx_type TEXT NULL,
tx_chrom TEXT NOT NULL,
tx_strand TEXT NOT NULL,
tx_start INTEGER NOT NULL,
tx_end INTEGER NOT NULL,
FOREIGN KEY (tx_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE exon (
_exon_id INTEGER PRIMARY KEY,
exon_name TEXT NULL,
exon_chrom TEXT NOT NULL,
exon_strand TEXT NOT NULL,
exon_start INTEGER NOT NULL,
exon_end INTEGER NOT NULL,
FOREIGN KEY (exon_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE cds (
_cds_id INTEGER PRIMARY KEY,
cds_name TEXT NULL,
cds_chrom TEXT NOT NULL,
cds_strand TEXT NOT NULL,
cds_start INTEGER NOT NULL,
cds_end INTEGER NOT NULL,
FOREIGN KEY (cds_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE splicing (
_tx_id INTEGER NOT NULL,
exon_rank INTEGER NOT NULL,
_exon_id INTEGER NOT NULL,
_cds_id INTEGER NULL,
UNIQUE (_tx_id, exon_rank),
FOREIGN KEY (_tx_id) REFERENCES transcript,
FOREIGN KEY (_exon_id) REFERENCES exon,
FOREIGN KEY (_cds_id) REFERENCES cds
);
CREATE TABLE gene (
gene_id TEXT NOT NULL,
_tx_id INTEGER NOT NULL,
UNIQUE (gene_id, _tx_id),
FOREIGN KEY (_tx_id) REFERENCES transcript
);
CREATE TABLE `metadata` (
`name` TEXT,
`value` TEXT
);
However, one important field seems to be missing: CDS phase. Phase specifies how many nucleotides must be removed from the beginning of a CDS to reach a complete codon (0, 1 or 2). If phase is missing, the CDS cannot be translated without making dangerous assumptions about the data. Specifically, you have to assume that the first CDS has a phase of 0. For poorly assembled genomes, this may not by the case.
Even if the assumption always holds, a CDS without phase data can be hard to work with since you cannot translate an individual CDS without considering the other members of the model.
According to the GFF specification, phase information is required for CDS entries and is stored in the 8th column. Here is an example of an incomplete gene model (from Saccharomyces uvarum):
# example.gff
AACA01001085.1 AUGUSTUS CDS 1 289 . + 1 Parent=g1.t1
AACA01001085.1 AUGUSTUS exon 1 289 1.00 + . Parent=g1.t1
AACA01001085.1 AUGUSTUS gene 1 289 1 + . g1
AACA01001085.1 AUGUSTUS mRNA 1 289 . + . ID=g1.t1;Other=g1
Where AACA01001085.1
is the short scaffold:
>AACA01001085.1
GTGGGGTTTCATGATTAGTTTCTTCACTCCTTTCATCACCGGCGCAATCAACTTTTACTA
TGGCTATGTCTTCATTGGCTGTCTATGTGTTGCCTACTTTTacgtctttttctttgtgcC
AGAAACGAAAGGCCTGACATTAGAAGAGGTCGACACTATGTGGCTGGAAGGCATTCCACC
ATGGAAATCGGCCTCATGGGTGCCACCGGACAGAAGAACCGCAGATTACGATGAAGACGC
TGTGGCCCATGACGAGAGGCCAGCTTACAAAAGATTCTTCTCCAACTGATCCTCCGACTC
ATCGATGAttcaaatgttttttttccctacCTCCGTAATATTTATgatgtttatttttgt
acTCCTTAATAATTTATGAACTTTTCCTTGTATTTCGCAGCAGCTGAACTCTAGTGCTGG
AGTGGACGCTACACAGCCCTCGGGACTAGCAGGTGCAGGAGCGAACGAAGCCGCAGGAAC
TCCACTAACAGACTACGCACAAAGACCCGGTGGACCCTGACTCAAAGCTGACGCAATACT
GCCACTACTTTGCAAACCTTAACTTTAACTTCGGCCCGACGCACCTAAGACGGACATAAG
AATGAGAAATGCAGTGAAGAGATCCTCCATCATATGCCAGCCGTAGGCCTTCGATGTTCT
TTGTGCAATGCTAC
Working with these partial models can result in serious problems:
library(GenomicFeatures)
library(Rsamtools)
library(Biostrings)
# Build SQLite gene model database
gff <- GenomicFeatures::makeTxDbFromGFF('example.gff')
# Build indexed fasta file and open a connection to it.
dna <- Rsamtools::FaFile(Rsamtools::indexFa("Saccharomyces_uvarum.fna"))
# Extract the CDS from the database. This will create a list of GRanges objects
# where each one contains an ordered set of coding sequences.
trans <- GenomicFeatures::cdsBy(gff, by='tx', use.names=TRUE)
trans
#> GRangesList object of length 1:
#> $g1.t1
#> GRanges object with 1 range and 3 metadata columns:
#> seqnames ranges strand | cds_id cds_name exon_rank
#> <Rle> <IRanges> <Rle> | <integer> <character> <integer>
#> [1] AACA01001085.1 [1, 289] + | 1 <NA> 1
#>
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Extract the nucleotide sequences
cds <- GenomicFeatures::extractTranscriptSeqs(x=dna, transcripts=trans)
# Translate the CDS
aa_seq <- Biostrings::translate(cds)
#> Warning message:
#> In .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet], :
#> last base was ignored
aa_seq
#> A AAStringSet instance of length 1
#> width seq names
#> [1] 96 VGFHD*FLHSFHHRRNQLLLWLC...RRLR*RRCGP*REASLQKILLQL g1.t1
The final translated sequence is frameshifted. The translate
function detects that something is wrong, but it doesn't have enough information to make the right decision. It decides to ignore that last base, when it should have ignored the first 2 bases.
One solution is to store the phase in the SQLite database and then write a phase-aware function for extracting CDS (e.g. extractCDSSeqs
).
Here is my session info:
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux
Matrix products: default
BLAS: /usr/lib/libblas.so.3.7.1
LAPACK: /usr/lib/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] Rsamtools_1.28.0 Biostrings_2.44.2 XVector_0.16.0
[4] GenomicFeatures_1.28.4 AnnotationDbi_1.38.2 Biobase_2.36.2
[7] GenomicRanges_1.28.5 GenomeInfoDb_1.12.2 IRanges_2.10.3
[10] S4Vectors_0.14.5 BiocGenerics_0.22.0 colorout_1.1-2
[13] pryr_0.1.2 devtools_1.13.3 magrittr_1.5
loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 compiler_3.4.2
[3] bitops_1.0-6 tools_3.4.2
[5] zlibbioc_1.22.0 biomaRt_2.32.1
[7] digest_0.6.12 bit_1.1-12
[9] lattice_0.20-35 RSQLite_2.0
[11] memoise_1.1.0 tibble_1.3.4
[13] pkgconfig_2.0.1 rlang_0.1.2
[15] Matrix_1.2-11 DelayedArray_0.2.7
[17] DBI_0.7 GenomeInfoDbData_0.99.0
[19] rtracklayer_1.36.4 withr_2.0.0
[21] stringr_1.2.0 grid_3.4.2
[23] bit64_0.9-7 XML_3.98-1.9
[25] BiocParallel_1.10.1 blob_1.1.0
[27] matrixStats_0.52.2 codetools_0.2-15
[29] GenomicAlignments_1.12.2 SummarizedExperiment_1.6.4
[31] stringi_1.1.5 RCurl_1.95-4.8
The phase will be 0 if the models are complete, but this isn't always the case. Note that the first CDS begins on the very first nucleotide of the scaffold. The GFF example was produced by the AUGUSTUS gene model predictor, which will detect models even when they are broken across scaffolds. I have several low quality yeast assemblies and each of them has hundreds partial models.
It is also convenient to include phase with each CDS entry because it allows the CDS to be translated independently of the rest of the model. Also, the phase is of biological interest. So making it accessible would be nice.
Partially degraded transcripts could also have non-zero phase.
To add some historic background here:
TxDb objects were originally designed to support correct gene models from UCSC and Ensembl (via
makeTxDbFromUCSC
andmakeTxDbFromEnsembl
). So at the time it didn't feel necessary to include the phase information. Note that the UCSC tracks don't contain this information because it can easily be computed from the other fields in the track, assuming that the information in these other fields is correct of course. (The phase actually gets computed/added to the GTF files you can get via the Table Browser.) Support for GFF files was only added a couple of years later viamakeTxDbFromGFF
.The phase information could probably be added to TxDb objects. However, I don't think it's a good idea nor realistic that TxDb objects support incorrect gene models. In this case, one should work directly from the GFF file after loading it in a data.frame with
rtracklayer::readGFF()
or in a GRanges object withrtracklayer::import()
. This will import the phase.Note that, assuming the gene model in a TxDb object is correct, the phase can easily be computed with something like:
Then:
I'll modify
cdsBy()
so it returns the above (only when grouping by transcript though i.e. when called withby="tx"
). The good news is that there is no need to touch the db schema.Hope this helps,
H.
Thanks, the modification to `cdsBy` certainly help. I coded around the partial models in my project by loading the GFF into a GRanges object and adding the phase to the start position (which makes in-frame models). Then I encoded the the original GFF phase in the CDS 'name' field before passing the GRanges object into `makeTxDbFromGRanges`. It was ugly, but preserved the phase.
I'm being a little pedantic, perhaps, but I would not call these partial models "incorrect". The part of the model that is described may be perfectly correct, but due to an incomplete genome assembly, the initial part of the sequence is missing.
If phase is not supported, perhaps importing partial models should raise an immediate error or warning? Otherwise, proteins translated based on the models will be frameshifted. This is exactly what was happening in my project.
OK.
So we could add a warning in
makeTxDbFromGFF()
when the phases reported in the file for the CDS in a transcript don't agree with the phases inferred from the CDS cumwidths. Not a totally satisfying solution because only the person who callsmakeTxDbFromGFF()
to create the TxDb object gets to see the warning, but not other people using that TxDb downstream. Would probably be better to have the TxDb object itself "remember" that the CDS information cannot be trusted for translation e.g. by having this reported in the metadata table or, as you suggested, by having the phase imported from the file. ThencdsBy( , by="tx")
could issue a warning. More generally, tools that take a TxDb and DNA sequences as input in order to do translation (e.g.predictCoding()
) would be able to warn the user, or fail graciously, or maybe still manage to produce correct/meaningful output if they have access to the phase that was reported in the GFF file.This will be a significant effort though and is not likely to happen before our next release.
Thanks for your feedback,
H.
FYI I made a first pass on this: in GenomicFeatures 1.29.14,
makeTxDbFromGFF()
andmakeTxDbFromGRanges()
now import the CDS phase:https://github.com/Bioconductor/GenomicFeatures/commit/22b2ea0a81bc064b8554400ce8f7e13e2c833d38
Was actually not that bad.
cdsBy(txdb, by="tx")
still needs to be modified to return the phase info. I'll try to get to this later.Cheers,
H.
Just what I was looking for! Except, upon review, I think the logic is off. I've commented one line as #OLD and replaced it with proposed new line.
Thumbs up?
Any chance of this hitting the mainstream?
So you are saying that the phase of the 2nd CDS on Fly transcript FBtr0300689 is 2 instead of 1?
With my original version of
addCdsPhase()
(after replacingphead()
withheads()
):With your modified version:
I disagree with that.
H.
Yes, I am, Herve. Indeed.
Let's back up a sec, and, for completeness let's make sure we are talking about the same thing.
Here's my best reference - https://useast.ensembl.org/Help/Glossary where we find:
Phase Exon The position of an exon/intron boundary within a codon. A phase of zero means the boundary falls between codons, one means between the first and second base and two means between the second and third base. Exons have a start and end phase, whereas introns have just one phase. A boundary in a non-coding region has a phase of -1.
I like that definition.
I also like Ensemble's implementation, which also assigns 2 as the value of the "start phase" of the second exon.
see: http://useast.ensembl.org/Drosophila_melanogaster/Transcript/Exons?db=core;g=FBgn0031208;r=2L:7529-9484;t=FBtr0300689
FWIW: their implementation appears to be setExonPhases in FlyBaseToolbox.pm;
There seems to be some confusion between frame and phase. For the CDS phase I prefer to stick to the definition provided at:
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
which, unfortunately, is in disagreement with Ensembl's definition.
Note that Ensembl's definition is for the exon phase, not the CDS phase. I don't know if that could explain the different definitions.
Anyway the CDS phase as defined in the official GFF3 specs is the phase implemented by my original
addCdsPhase()
. It is consistent with the phase of the 2nd CDS of Fly transcript FBtr0300689 reported in the Drosophila_melanogaster.BDGP6.93.chromosome.2L.gff3.gz file located at ftp://ftp.ensembl.org/pub/release-93/gff3/drosophila_melanogaster/H.