I have received RNA-seq (paired end) data from my sequencing facility. Details are like this. This is human PBMC samples for 98 participants and two days day0 and day7 (corresponding to pre and post vaccination) - in total 196 samples. 54 out of 98 were high responders and 44 were low responders.
I had the data randomised for them based on both factors so that they can use it in batches of three as we were going to produce same data for metabolomics in three batches. I kept the biological variation same in three batches based on both factors and also the visits for same partcipants in same batch. Now theyt are telling me this:
The "run" actually refers to the sequencing run for the samples and not associated with any other process.
"run 5" and "run 6" refers to the NovaSeq run, where we targeted greater than 40M reads/samples.
Regarding the processing and batching, below are the details:
Extraction (multiple batches),
Followed by library prep (2 batches of 95 samples + 8 samples in last batch),
Followed by sequencing (~50 samples/lane - ~4 lanes total sequenced across 2 sequencing runs (run 5 and run 6))
Question 0- This means I have a run variable where run5 has ~50 samples and run6 has ~150 samples. and library prep had three batches of 95 samples in two batches and 8 samples in last batch.
I am not sure I can use combat seq for this imbalanced design at all? because assumption is similar biological variation across batches which wont be the case in both. Am I thinking right?
Question 1- What is best solution for moving forward on this scenario? I do know that I could incorporate the batch in DE analysis e.g. when I am using edger, deseq2 or limma but what would I do if I need to do other analysis like correlation , WGCNA?
Question 2: Also If I am using sva for finding out the other hidden batch effects ? Should I incorporate in sva?
a. my all biological variables of interest e.g. visit and response status? b. should I also control for know batch effects while trying to find out others e.g.
mod <- model.matrix(~ visit+response_status+batchvariable1+batchvariable2, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
svseq <- svaseq(dat, mod = mod, mod0 = mod0, n.sv = 3)
I posted same question on biostars https://www.biostars.org/p/9567857/#9567864 but did not receive any comment and was a little desperate. Thank you
ATpoint Could you give me some opinion please?