Hi,
I have data for 26 different tissues and I want to perform exhaustive
pairwise comparisons between them. I want to find normalized expression
levels for the samples and differentially expressed genes between tissues.
For both I intend to use DESeq2.
From my reading and testing, I believe that the samples have two
undesirable characteristics:
o the replicates of a two tissues show high 'within-group variability'.
The manual explains to me that this can bias the dispersion calculations.
o the sample data have a wide range of coverage: when I run DESeq2 on all
of the samples at once and calculate the size factors, I find a range
of .1 to 41.
My questions are
o is it reasonable to use DESeq2 expression level values rather
than using read counts to normalize samples?
o is there an objective way to decide whether to run the differential
expression analysis 'all together or split into pairs of groups?'
(I have not found satisfying answers to these questions in my web searches
and manual reading.)
As always, I am grateful for assistance and consideration.
Brent
Hi,
I had made a PCA plot and saw that replicates of most tissues cluster reasonably well but the replicates that belong to two tissues spread across about half of the vertical extent of the plot. I know that this is bad, it distorts the 'globally' calculated dispersion values. Is there an objective or semi-objective way to decide how much within-goup variation is too much? (Additionally, I made scatter plots comparing log2fc, p-values, and padjust to compare the two results for one of the pair-wise comparisons. These show large differences, especially the p-values and padjust, which show an alarming lack of correlation. More usefully, the 'split into pairs' analysis gives 878 significant genes whereas the 'all together' analysis gives 566 significant genes, which is consistent with exaggerated dispersion values in the 'all together' analysis. More reassuringly, about 85% of the 566 genes are common to the results of both analyses.)
Also, I used your plotDispEsts() function on both the two condition run and the all condition run. The two condition run shows a lot of dispersion shrinkage, perhaps by between 1/4 and 1/3, whereas the the all condition run shows nearly no shrinkage.
I garbled my first question. I want to find gene expression levels by condition. I can use something akin to RPKM or benefit from the DESeq2 size factors by using the gene expression mean value, mu_ij, divided by the sample size factor to find 'q_ij'. Is it reasonable to take the latter approach? I suppose this depends, again, on the nature of the data.
Thank you for taking the time to consider these questions.
Brent
hi Brent,
Depending on how many samples you have per tissue, you might consider to run DESeq2 on pairs, if those two tissues would distort the global dispersion value. The case where I think you still are better off estimating dispersion over all tissues is if you have low sample size per tissue, and here I might say, less than 5 samples per tissue. There is no perfect rule though. There are advantages and disadvantages to different approaches.
Your second comment makes sense: shrinkage is reduced when there are more samples over which to estimate the parameter.
"I want to find gene expression levels by condition. I can use something akin to RPKM or benefit from the DESeq2 size factors by using the gene expression mean value, mu_ij, divided by the sample size factor to find 'q_ij'. Is it reasonable to take the latter approach? I suppose this depends, again, on the nature of the data."
The answer to this depends on how you made the gene count matrix. If you used tximport() upstream, you could just work with the txi$abundance table. Otherwise, you need to divide out gene length to obtain something that is on the scale of gene expression across genes. How did you input the count matrix?
Hi Michael,
Regarding the together versus split question, thank you for pointing out the shrinkage dependence on the number of analyzed samples. I am looking at plotDispEsts() plots for analyses of tissue subsets and I was confusing the effects of within-group variability and the number of tissues. I will read yet again the DESeq2 paper.
Your advice on using the DESeq2 gene expression mean value data is very helpful too. I did not keep in mind the effect of gene length.
I am grateful for your illuminating comments.
Best Wishes,
Brent
Hi Michael,
I have yet another question with which to reveal my naivete.
I asked about "using the gene expression mean value, mu_ij, divided by the sample size factor to find 'q_ij'" in order to calculate values akin to RPKM. I am considering using instead the rlog transformed count data raised to the power of two (and scaled by gene length) in order to generate RPKM-like values. I wonder if this might have an advantage or disadvantage in comparison to a calculation based on the mu_ij.
Again, I thank you for your patience.
Best Wishes,
Brent
I wouldn’t recommend this as a way to estimate abundance. I’d divide counts by gene length or use TPM directly from a quantification software like RSEM, Salmon, kallisto, etc.
Hi Michael,
Again I thank you for your advice.
Best Wishes,
Brent