Hello, everyone! I am working with pseudo bulk RNA-seq data and facing challenges with designing an appropriate analysis approach due to confounded batch effects and unbalanced conditions. Here is a summary of my data.
challenges:
- The Diagnosis groups (e.g. healthy vs. cancer) do not overlap with the same batches, making it impossible to adjust for batch effects using the typical design matrix:
~ Diagnosis + Batch
. - I'm interested in comparing healthy vs. cancer samples while eliminating batch effects.
Questions:
Is there an alternative model or approach in tools like edgeR, limma-voom, or DESeq2 or any other that can handle confounded batch effects? (Currently I'm working with edgeR with passing Diagnosis as a single factor to the design matrix. But MDS plot separate clusters for dataset1, dataset2 and dataset3)
Would combining Diagnosis and Batch into a single group factor be advisable here?
Are there any tools that take preprocessed(batch-corrected data, i.e. I have )data in differential expression analysis? (I guess edgeR only works with raw counts)
Thank you in advance for the help.
Thank you for your reply. I currently have batch-corrected data processed with scvi-tools (specifically, totalvi) and have followed the same pipeline for analyzing reconstructed counts in edgeR as I did for raw data. However, I am observing differences in the TopTags output, particularly among the top 100 genes when ordered by p-value.
It also appears that edgeR might be designed primarily for discrete count data, which could pose challenges when using it with reconstructed data that might be more continuous in nature.