Question

How to annotate DESeq2 object with gene size to get FPKMs

0

Entering edit mode

ashley.doane ▴ 20

@ashleydoane-8524

Last seen 6.6 years ago

United States

I'd like to annotate a DESeqDataSet with information on gene size, specifically populating mcols(dds)$basepairs, where dds is my DESeqDataSet. The DESeqDataSet was created from HTSeq output, using the DESeqDataSetFromHTSeqCount function.

The reason for this is to pull out FPKMs, and to have genomic intervals for other downstream analysis. The HTSeq counts are by gene symbol.

Genome in this case is hg19, and I have TxDb.Hsapiens.UCSC.hg19.knownGene and Org.Hs.eg.db data, but not sure which slots to match.

Any suggestions would be really appreciated. thanks!

Ashley

deseq2 fpkm • 4.9k views

ADD COMMENT • link updated 9.8 years ago by Michael Love 43k • written 9.8 years ago by ashley.doane ▴ 20

score 0 · Answer 1 · 2015-08-02

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 7 days ago

United States

You can use makeTxDbFromGFF from the GenomicFeatures package to build a TxDb object, and then exonsBy with by="gene" to obtain a GRangesList of the exons for each gene.

sum(width(reduce(x)))

for x the GRangesList will give you the exonic basepairs for each gene. You'll have to just arrange this vector in the same order as the dds.

See an example of makeTxDbFromGFF and exonsBy here:

http://master.bioconductor.org/help/workflows/rnaseqGene/#count

ADD COMMENT • link 9.8 years ago Michael Love 43k

0

Entering edit mode

I'm not sure if this is more correct, but what I do (in well-annotated genomes) is compute the length of every transcript isoform and then take the length of the longest isoform for each gene. This would be less than the total width of all exons in the gene in cases where some exons are mutually exclusive.

ADD REPLY • link 9.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Why not just use the exact "effective length" that was used during the counting step? And by "effective length" I mean to just sum up all of the basepairs that are considered to "hit" the "parent gene" when creating your count matrix.

ADD REPLY • link 9.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Because there may not be any transcript that includes all exons. Imagine there are two isoforms of a gene, each including one of two mutually exclusive exons, and both expressed equally. (And assume the the two isoforms are roughly equal in length.) Then the coverage depth for the two mutually exclusive exons will be half of the others, and you would underestimate the overall FPKM for the gene if you divided by the total length of all exons. So dividing by the length of a known isoform seems better to me, as long as the genome is well-annotated.

ADD REPLY • link 9.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

It's up to the user. The code above is just a fast way to get an estimate of the size of the feature used for counting (also it's not taking into consideration the subtraction of basepairs which overlap another feature, and therefore do not contribute to the uniquely assignable count).

It's good to remember that most genes with multiple transcripts share the large majority of their basepairs but I see your point about the longest isoform.

For more accurate FPKM I would use an estimator like RSEM.

ADD REPLY • link 9.7 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

How do I arrange my list of basepairs in the same order as the dds? mcols(dds) does not have a column for gene IDs so I don't know where to start.

Thank you!

Josh

EDIT: looks like it's just in alphabetical order?

ADD REPLY • link 6.8 years ago Josh Wang ▴ 10

1

Entering edit mode

Try mcols(dds, use.names=TRUE)

ADD REPLY • link 6.8 years ago Michael Love 43k

1

Entering edit mode

EDIT: I reran the DESeq and the error message below disappeared. I don't know how. I am now able to get the fpkm values. Thank you for the help!!

Thanks for the reply!

I ordered my gene length data frame like this:

gene_lengths<-gene_lengths[match(rownames(mcols(dds, use.names=TRUE)), gene_lengths$Geneid),]

It ordered the rows of my gene length data frame as the mcols(dds).

and then when I tried to add a basepairs column to dds like this:

mcols(dds)$basepairs <- gene_lengths[, "Length"]

It gave an error message:

Error in V_recycle(value, x, x_what = "value", skeleton_what = "x") : 
  'NROW(value)' is greater than 'length(x)'

What caused this error? My gene length data frame and the mcols(dds) have the same numbers of rows.

ADD REPLY • link 6.8 years ago Josh Wang ▴ 10