Hi all,
I read a lot about testing on datasets with batch effects, but in all cases the effect is on biological replicates and not technical replicates.
Just some quick terms for a better understanding: technical replicates: same sample, same RNA isolation and same library prep, just loaded 2x on the machine (for read depth) biological replicates: different samples (e.g. celllines) of same group (e.g. genotype) and different RNA isolation, but same library prep and loading onto the machine
I noticed two different types of variation between technical replicates:
A) systematic/plane shift in PCA --> batch effect due to different sequencing runs (see image A below)
B) dispersed --> small random technical variability within one sequencing run (see image B below)
Normally one would expect the variation between technical replicates to be small and non-systematic in the PCA (B).
Now I had the event of a batch effect between technical replicates (A). The experiment was design with 6x biological replicates (6 different samples for each group of interest, colored dots in PCA) and 2x technical replicates for each biological replicate (samples connected via line in PCA). The technical replicates were on two different sequencing runs.
The general approach for (B) is to simply add the technical replicates together and do the test between groups. As discussed in https://support.bioconductor.org/p/85536/.
Now for the case of a batch effect between technical replicates (A), it gets a bit ambiguous for me.
X) Ignore the batch effect and simply sum up and test (I feel bad about this)
Y) Not merge the technical replicates together, but test using a design including the batch as covariate (~genotype+batch)
Z) Test the two runs of technical replicates individually and keep the intersect of significant genes (probably loss of power due to lower library sizes and sample number)
Findings: Y results in a major increase in identified significant genes compared to X.
Which way would be the best to handle this situation? Will the fact that I am not summing up the technical replicates (y) be a problem?
And how much technical variation (without a batch effect, case B) can be "ignored" before again proceeding with one of the approaches X,Y,Z?
Additional note: I also tried SVA, but noticed that the first surrogate vector corresponds exactly to the batchrun covariate.
Thank you very much for your help!
EDIT (correct links) A-batcheffectPCA: https://ibb.co/VYpxnxh B-techvariancePCA: https://ibb.co/SQnWnjr
Thank you very much, this is very helpful!