Question

Using Sailfish estimated read-counts for DESeq2 analysis

0

Entering edit mode

jceledon • 0

@jceledon-6919

Last seen 10.2 years ago

Canada

Hi there,

Has anyone tried to use Sailfish estimated RPKM as input for DESeq2 for differential gene expression analysis?

Patro, R., Mount, S.M. and Kingsford, C. (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol., 1–6.

http://www.ncbi.nlm.nih.gov/pubmed/24752080

I noticed that the DESeq2 article explicitly says "DESeq2 performs analysis on counts of reads which can be uniquely assigned to genes, while a number of other algorithms [25, 26] perform differential analysis on a probabilistic assignment of reads to transcripts."

And it is not clear tom me how Sailfish RPKMs are calculated.

Thank you!

Jose

deseq2 sailfish differential gene expression • 3.4k views

ADD COMMENT • link updated 10.2 years ago by Simon Anders ★ 3.8k • written 10.2 years ago by jceledon • 0

score 2 · Answer 1 · 2014-11-20

2

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 23 months ago

United States

You say:

  And it is not clear tom me how Sailfish RPKMs are calculated

This doesn't answer your sailfish question explicitly, but it doesn't really matter how they are calculated for your question: the "official party line" that you'll uncover if you do a bit of searching on here is that RPKMs are not valid inputs to any of the GLM based RNAseq differential expression based approaches, so using DESeq2 on sailfish data is a non-starter.

ADD COMMENT • link 10.2 years ago Steve Lianoglou ★ 13k

2

Entering edit mode

To elaborate on Steve's answer; transforming raw counts into RPKMs discards information about the absolute size of the counts. This information is required to model the mean-variance relationship. If you use RPKMs instead of raw counts as inputs to, e.g., edgeR, you will incorrectly estimate this relationship. In addition, most GLM-based methods use discrete distributions. This means that the continuous nature of RPKMs will not be accurately modelled, especially at low values.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

1

Entering edit mode

hi Jose,

Aaron has laid out the exact point about why count-based methods such as DESeq2, edgeR and others require gene-level raw counts for statistical inference. The library size is accounted for within the model, rather than by transforming the input data. Similarly, sample-specific effects on gene-level counts of covariates such as GC content can be included (if estimated by another package such as cqn or EDASeq) in the model using normalizationFactors in DESeq2 (or an offset in edgeR), and if one has obtained estimates of different average transcript length for each gene across samples (RSEM produces such a column), this can be included in a model for gene-level counts as well using the 'normMatrix' argument of estimateSizeFactors. But the counts slot of the DESeqDataSet needs to contain raw counts for each gene.

(edited to emphasize gene-level counts)

ADD REPLY • link 10.2 years ago Michael Love 43k

score 2 · Answer 2 · 2014-11-21

The real issue is this: DESeq2 is meant for testing for differential expression of genes, not isoforms. Quantification of different isoforms for the same genes will always have correlated errors, and unless these correlations are explictely accounted for, any test for differential expression is bound to produce loads of false positives.

So, just counting reads for each isoform independently and running DESeq (or any other testing method) is a big no. (See this old post if it is unclear why: A: DESeq and transcript-wise analysis)

Methods like RSEM (and similarly Sailfish, I suppose) avoid double-counting reads for several isoforms, so the problem is not exactly as described in the post above. Still, the errors (i.e., deviations of RSEM's estimates from the true values) will be correlated between different isoforms of the same gene. RSEM is actually able to estimate and report these correlation matrices, and a method that used this additional information would be able to perform correct inference. However, I am not aware of any tool that is designed to process RSEM's error estimates for purposes of comparing isoform expression.

So, if you are just interested in comparing on the level of genes, use a method that quantifies genes, not isoforms.

If you are interested in comparing on the level of isoforms, use a method that is able to estimate these correlations and account for them in testing. Tools like BitSeq or cuffdiff integrate both the quantification and testing in a single tool or tool-chain, precisely for the reason that the correlation estimates must be passed from the quantification to the testing stage to avoid false positives, and this is typically not possible if you combine a quantification tool from one author with a testing tool from somebody else.

If, however, you are willing to follow our argument that testing on the exon level instead of the isoform level not only avoids all these difficulties but also provides more interpreatble results, use our DEXSeq package (and read our paper on it).