Differential Expression Analysis with unbalanced batches without replication between conditions
1
0
Entering edit mode
Thili • 0
@d1ed9b14
Last seen 4 hours ago
Finland

Hello, everyone! I am working with pseudo bulk RNA-seq data and facing challenges with designing an appropriate analysis approach due to confounded batch effects and unbalanced conditions. Here is a summary of my data.

enter image description here

challenges:

  1. The Diagnosis groups (e.g. healthy vs. cancer) do not overlap with the same batches, making it impossible to adjust for batch effects using the typical design matrix: ~ Diagnosis + Batch.
  2. I'm interested in comparing healthy vs. cancer samples while eliminating batch effects.

Questions:

  1. Is there an alternative model or approach in tools like edgeR, limma-voom, or DESeq2 or any other that can handle confounded batch effects? (Currently I'm working with edgeR with passing Diagnosis as a single factor to the design matrix. But MDS plot separate clusters for dataset1, dataset2 and dataset3)

  2. Would combining Diagnosis and Batch into a single group factor be advisable here?

  3. Are there any tools that take preprocessed(batch-corrected data, i.e. I have )data in differential expression analysis? (I guess edgeR only works with raw counts)

Thank you in advance for the help.

edgeR BatchEffect • 249 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

Is there an alternative model or approach in tools like edgeR, limma-voom, or DESeq2 or any other that can handle confounded batch effects?

No. Completely confounded batch effects is a fundamental scientific flaw and there is no way for any DE analysis tool to separate batch effects from true biological signal.

Would combining Diagnosis and Batch into a single group factor be advisable here?

I don't see how that would help.

Are there any tools that take preprocessed(batch-corrected data, i.e. I have )data in differential expression analysis?

Yes, but you would have to explain exactly what form the batch-corrected data takes. And of course, your results would only be as good as the batch correction, in a situation where batch correction is almost impossible. The only possible way forward, other than abandoning the data, would be to use a batch correction method based on control genes. Ideally that batch correction method would generate surrogate variables that are then incorporated into the edgeR analysis. Alternatively, either edgeR or limma can accept some times of batch corrected data, but the DE results and p-values from such an analysis will inevitably be somewhat liberal. That is not the fault of the DE method, but an inevitable consequence of analysing batch corrected data as if it was raw data.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. I currently have batch-corrected data processed with scvi-tools (specifically, totalvi) and have followed the same pipeline for analyzing reconstructed counts in edgeR as I did for raw data. However, I am observing differences in the TopTags output, particularly among the top 100 genes when ordered by p-value.

It also appears that edgeR might be designed primarily for discrete count data, which could pose challenges when using it with reconstructed data that might be more continuous in nature.

ADD REPLY

Login before adding your answer.

Traffic: 892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6