Hello, I have the following dataset.
Sample Patient Condition
Sample1 Patient1 Condition1
Sample2 Patient1 Condition1
Sample3 Patient2 Condition1
Sample4 Patient2 Condition1
Sample5 Patient3 Condition2
Sample6 Patient3 Condition2
Sample7 Patient4 Condition2
Sample8 Patient4 Condition2
Samples from the same patient are biological replicates since they were taken from the same culture in different days and processed separately. Should I add Patient in the design formula ~Patient + Condition or would it be fine if I leave ~Condition?
Thanks in advance.
Best regards,
S.
https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#model-matrix-not-full-rank
I would be pretty worried about leaving solely as
~condition
. You have clustered data here. Failing to account for clustering can often lead to high type i error. Eg as described here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6634702/Hi Oscar, there is no mention to clustering.
Hi, thanks for quick response. I'm not a statistician so forgive me if I am using the wrong term. By clustered data I meant that the data from condition 1 is likely to fall into two distinct clusters (one from patient 1 and one from patient 2). The paper linked describes the issues with this (though admittedly in a non-RNA-seq context)
Precisely why I said:
Yes I see that. So, to expand, if strong grouping IS seen, then the
~condition
approach is not appropriate, as will likely lead to poorly controlled type I error.Given that such grouping/clustering is very likely in such a scenario, what would you suggest in this situation?