Differential Expression Analysis with unbalanced batches without replication between conditions
1
0
Entering edit mode
Thili • 0
@d1ed9b14
Last seen 2 days ago
Finland

Hello, everyone! I am working with pseudo bulk RNA-seq data and facing challenges with designing an appropriate analysis approach due to confounded batch effects and unbalanced conditions. Here is a summary of my data.

enter image description here

challenges:

  1. The Diagnosis groups (e.g. healthy vs. cancer) do not overlap with the same batches, making it impossible to adjust for batch effects using the typical design matrix: ~ Diagnosis + Batch.
  2. I'm interested in comparing healthy vs. cancer samples while eliminating batch effects.

Questions:

  1. Is there an alternative model or approach in tools like edgeR, limma-voom, or DESeq2 or any other that can handle confounded batch effects? (Currently I'm working with edgeR with passing Diagnosis as a single factor to the design matrix. But MDS plot separate clusters for dataset1, dataset2 and dataset3)

  2. Would combining Diagnosis and Batch into a single group factor be advisable here?

  3. Are there any tools that take preprocessed(batch-corrected data, i.e. I have )data in differential expression analysis? (I guess edgeR only works with raw counts)

Thank you in advance for the help.

edgeR BatchEffect • 1.5k views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

Is there an alternative model or approach in tools like edgeR, limma-voom, or DESeq2 or any other that can handle confounded batch effects?

No. Completely confounded batch effects is a fundamental scientific flaw and there is no way for any DE analysis tool to separate batch effects from true biological signal.

Would combining Diagnosis and Batch into a single group factor be advisable here?

I don't see how that would help.

Are there any tools that take preprocessed(batch-corrected data, i.e. I have )data in differential expression analysis?

Yes, but you would have to explain exactly what form the batch-corrected data takes. And of course, your results would only be as good as the batch correction, in a situation where batch correction is almost impossible. The only possible way forward, other than abandoning the data, would be to use a batch correction method based on control genes. Ideally that batch correction method would generate surrogate variables that are then incorporated into the edgeR analysis. Alternatively, either edgeR or limma can accept some types of batch corrected data, but the DE results and p-values from such an analysis will inevitably be somewhat liberal. That is not the fault of the DE method, but an inevitable consequence of analysing batch corrected data as if it was raw data.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. I currently have batch-corrected data processed with scvi-tools (specifically, totalvi) and have followed the same pipeline for analyzing reconstructed counts in edgeR as I did for raw data. However, I am observing differences in the TopTags output, particularly among the top 100 genes when ordered by p-value.

It also appears that edgeR might be designed primarily for discrete count data, which could pose challenges when using it with reconstructed data that might be more continuous in nature.

ADD REPLY
1
Entering edit mode

Have you read the totalVI documentation?, which says:

Important

We do not recommend using totalVI denoised values in other differential expression tools, as denoised values are a summary of a random quantity. The totalVI DE test takes into account the full uncertainty of the denoised quantities.

If you used totalVI, then you must use it for the DE analysis as well, because it does not pass useable expression values on to other programs. I have no faith however that totalVI can correct for completely confounded batch effects or that it can return realistic p-values or posterior probabilities taking into account biological variation.

Trying to analyse totalVI output in edgeR would be nonsensical. It is not a matter of discrete vs continuous (edgeR is perfectly capable of analysing continuous expected counts from RSEM, Salmon or kallisto), but rather the fact that effects like library size and donor effects have been removed and the resulting data is on the wrong scale.

ADD REPLY
0
Entering edit mode

Thank you for sharing your insights. I truly learned a lot from your explanations!

ADD REPLY

Login before adding your answer.

Traffic: 508 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6