First question, is it bulk data? Below, I will assume it is. I am assuming you have bulk RNA-seq dataset with two conditions (say healthy and diseased), where all samples of a certain condition are on one plate and all samples of the other condition are on the other plate. Except for one sample, which is on both plates.
If I understand the design correctly (EDIT: I did not, see updated response in comments), you will never be able to disentangle the condition effect and the plate effect. This is often called "perfect confounding" between the variable of interest and a nuisance factor.
Correcting (removing) the plate effect can in principle be done in two ways. If the conditions are present on both plates (e.g., 4 healthy and 4 diseased samples on both plates), it can simply be done by adding a fixed effect to your model like you suggested. If each sample (of a single same condition) are present on both plates (splitting the tissue in half), one can think of applying batch correction methods, which aim for removing the technical plate effect while retaining the relevant condition effect. Example methods are harmony, Seurat CCA, ....
However, neither strategy will work for you. The analysis with the plate + condition
formula cannot be performed, since it essentially will aim to estimate the condition effect (healthy vs diseased) in both plates. For that, you need both conditions to be present on both plates. The second strategy, batch correction, would need to learn a batch correction strategy based on a single sample. This will also not work; I expect the batch correction method to return an error, and even if it doesn't, the results should not be trusted.
Jeroen
Hi Jeoren, thanks for the reply.
Foe the sake of simplicity I am going with one condition, diseased and healthy and it is in fact bulk data. However, the data are split amongst the plates. So healthy is 50 per cent on plate 1 and 50 per cent on plate 2 and the same for diseased. There is one sample patient 1 who is healthy and is on both plates in order to see the difference in the plates. So I don't have a confounded plate or effect or anything like that. I was just wondering if there was any way to say that a sample should be the same across both plates?
Great, that really changes the design for the better!
I have to say though, having just one patient on both plates is quite uncommon to me; I typically encounter either having no patients on both plates or all patients on both plates. And those two scenarios would come with different designs.
The former scenario should be analyzed with the
plate + condition
design, which again just estimates the condition effect while correcting for the plate effect. The latter scenario should not be analyzed with that design, because it does not acknowledge that two samples come from the same patient. This would require a different modelling approach, like the one suggested in the edgeR user guide part 3.5 (just change the treatment variable there to a plate variable and imagine just two disease states): https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdfLong story short, all your samples except the patient 1 are following the first design, so
plate + condition
will work for those samples. However, I am a bit uncomfortable with patient 1 appearing twice, as that is not acknowledged by that model. If you are being pragmatic, you could argue that ignoring this for just 1 patient will not strongly impact your analysis, also depending on how many samples you have in total. If you are being strict, I would consider removing patient 1 from the analysis (altogether or from one of the plates).I am interested in other people's take on this, but I would really advice using the latter strategy.
Jeroen
I agree with Jeroen's points above, also concerns about repeating one sample on both plates in the statistical analysis. For DESeq2 with fixed effects you'd just drop one of the technical replicates I think. Or you could use random effects modeling.
There is one avenue of methods that deal with technical replicates, in RUV. But I'm not sure it's worth it, or very efficient, with just a single technical replicate.
Yeah, it is just one patient sample repeated. Maybe I should explain a bit more. Each patient sample is split into 3 so I have triplicates already for each patient. For this one patient who is in both plates, I just have 6 samples. So if I collapse the wells into technical replicates it should be ok?
The biologist thought that by doing the experiment with the same sample on both plates we would be able to see the variation more clearly.
I guess, you can either go the route of random effects modeling with limma's duplicateCorrelation or look into RUV methods for estimating technical variance with a single repeated sample, but we don't have support for dealing with this in DESeq2.
Cool thanks Michael. I think I will just ignore the sample and report back to not do this again to the lab. Thanks for all your help.