I'd like to annotate a DESeqDataSet with information on gene size, specifically populating mcols(dds)$basepairs, where dds is my DESeqDataSet. The DESeqDataSet was created from HTSeq output, using the DESeqDataSetFromHTSeqCount function.
The reason for this is to pull out FPKMs, and to have genomic intervals for other downstream analysis. The HTSeq counts are by gene symbol.
Genome in this case is hg19, and I have TxDb.Hsapiens.UCSC.hg19.knownGene and Org.Hs.eg.db data, but not sure which slots to match.
Any suggestions would be really appreciated. thanks!
Ashley
I'm not sure if this is more correct, but what I do (in well-annotated genomes) is compute the length of every transcript isoform and then take the length of the longest isoform for each gene. This would be less than the total width of all exons in the gene in cases where some exons are mutually exclusive.
Why not just use the exact "effective length" that was used during the counting step? And by "effective length" I mean to just sum up all of the basepairs that are considered to "hit" the "parent gene" when creating your count matrix.
Because there may not be any transcript that includes all exons. Imagine there are two isoforms of a gene, each including one of two mutually exclusive exons, and both expressed equally. (And assume the the two isoforms are roughly equal in length.) Then the coverage depth for the two mutually exclusive exons will be half of the others, and you would underestimate the overall FPKM for the gene if you divided by the total length of all exons. So dividing by the length of a known isoform seems better to me, as long as the genome is well-annotated.
It's up to the user. The code above is just a fast way to get an estimate of the size of the feature used for counting (also it's not taking into consideration the subtraction of basepairs which overlap another feature, and therefore do not contribute to the uniquely assignable count).
It's good to remember that most genes with multiple transcripts share the large majority of their basepairs but I see your point about the longest isoform.
For more accurate FPKM I would use an estimator like RSEM.
Hi Michael,
How do I arrange my list of basepairs in the same order as the dds? mcols(dds) does not have a column for gene IDs so I don't know where to start.
Thank you!
Josh
EDIT: looks like it's just in alphabetical order?
Try mcols(dds, use.names=TRUE)
EDIT: I reran the DESeq and the error message below disappeared. I don't know how. I am now able to get the fpkm values. Thank you for the help!!
Thanks for the reply!
I ordered my gene length data frame like this:
It ordered the rows of my gene length data frame as the mcols(dds).
and then when I tried to add a basepairs column to dds like this:
It gave an error message:
What caused this error? My gene length data frame and the mcols(dds) have the same numbers of rows.