Question

DESeq2 - groups with biological replicates from the same patient

0

Entering edit mode

S ▴ 10

@399a8e69

Last seen 20 months ago

Spain

Hello, I have the following dataset.

Sample  Patient  Condition
Sample1 Patient1 Condition1
Sample2 Patient1 Condition1
Sample3 Patient2 Condition1
Sample4 Patient2 Condition1
Sample5 Patient3 Condition2
Sample6 Patient3 Condition2
Sample7 Patient4 Condition2
Sample8 Patient4 Condition2

Samples from the same patient are biological replicates since they were taken from the same culture in different days and processed separately. Should I add Patient in the design formula ~Patient + Condition or would it be fine if I leave ~Condition?

Thanks in advance.

Best regards,

S.

DESeq2 • 3.3k views

ADD COMMENT • link updated 14 months ago by shepherl 4.1k • written 3.8 years ago by S ▴ 10

0

Entering edit mode

https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#model-matrix-not-full-rank

ADD REPLY • link 3.8 years ago XTR5 ▴ 10

0

Entering edit mode

I would be pretty worried about leaving solely as ~condition. You have clustered data here. Failing to account for clustering can often lead to high type i error. Eg as described here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6634702/

ADD REPLY • link 16 months ago Oscar • 0

0

Entering edit mode

Hi Oscar, there is no mention to clustering.

ADD REPLY • link 16 months ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Hi, thanks for quick response. I'm not a statistician so forgive me if I am using the wrong term. By clustered data I meant that the data from condition 1 is likely to fall into two distinct clusters (one from patient 1 and one from patient 2). The paper linked describes the issues with this (though admittedly in a non-RNA-seq context)

ADD REPLY • link 16 months ago Oscar • 0

0

Entering edit mode

Precisely why I said:

please check the PCA bi-plots to assess sample grouping.

ADD REPLY • link 16 months ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Yes I see that. So, to expand, if strong grouping IS seen, then the ~condition approach is not appropriate, as will likely lead to poorly controlled type I error.

Given that such grouping/clustering is very likely in such a scenario, what would you suggest in this situation?

ADD REPLY • link 16 months ago Oscar • 0

shepherl · Answer 1 · 2021-07-04

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 14 days ago

Republic of Ireland

I see no major issue using just ~ condition. Would we ever expect a situation whereby, e.g., Patient1 had both Condition1 and Condition2 (?) - rhetorical question.

If you proceed with just ~ condition, please check the PCA bi-plots to assess sample grouping.

they were taken from the same culture in different days

Keep in mind, therefore, that time may have an effect.

Kevin

ADD COMMENT • link 3.8 years ago Kevin Blighe ★ 4.0k

1

Entering edit mode

Hello. Thanks very much for your reply. The same patient won't be in both groups (the experiment does not have paired samples). I just have biological replicates from the same patient within the same group. These biological replicates should be almost identical but they were collected in different days and as you say time might have an effect. Thanks

ADD REPLY • link 3.8 years ago S ▴ 10

0

Entering edit mode

I find this response rather surprising. Sample1 and Sample2 (for example) are not independent because they are derived from the same patient. By solely using ~condition are you not telling DESeq2 that Sample1 and Sample2 are independent replicates from Condition1? This will surely greatly increase the chance of false positive detection of differentially expressed genes?

I did some simulations to test this under conditions with 0 differentially expressed genes. As predicted, in my tests taking two samples from the same 'patient' results in huge numbers of false positive differentially expressed genes, whereas taking only a single sample from each (or combining counts) leads to almost zero false positives.

ADD REPLY • link 16 months ago Oscar • 0

0

Entering edit mode

Hi Oscar, one cannot have the formula ~Patient + Condition here. It makes no sense. Note my comment: "please check the PCA bi-plots to assess sample grouping"

ADD REPLY • link 16 months ago Kevin Blighe ★ 4.0k

0

Entering edit mode

understood, but I would be very worried that claiming we have 4 independent replicates per condition will lead to large type i error rate here?

ADD REPLY • link 16 months ago Oscar • 0

0

Entering edit mode

I see, so regarding PCA plots, would your advice be to only proceed with using solely condition if there is no clear grouping/clustering of the patients, and instead the variance is dominated by condition?

What would your advice instead be if strong clustering/grouping is observed? Perhaps collapsing the replicates?

ADD REPLY • link 16 months ago Oscar • 0

0

Entering edit mode

My advice would be to not use Chat GPT.

ADD REPLY • link updated 14 months ago by shepherl 4.1k • written 16 months ago by Kevin Blighe ★ 4.0k

0

Entering edit mode

I'm not using Chat GPT. I've been discussing at length this issue of having multiple samples derived from the same patients with many colleagues, to try to understand the best way to process such data. I found this forum post on a google search and was surprised by your answer, so was seeking clarification. I think we are agreed that, in the case there is strong grouping/clustering, solely using ~condition could lead to very high type i error?

ADD REPLY • link 16 months ago Oscar • 0

0

Entering edit mode

If by clustering you mean there might be a within-subject correlation, then yes it's a possibility. But you cannot control for that using a fixed effect. You would need to use limma-voom, blocking on subject, to estimate the within-subject correlation and then fit a generalized least squares model.

ADD REPLY • link 16 months ago James W. MacDonald 68k

0

Entering edit mode

Thanks James for your helpful response. This particular experimental set up is very common in iPSC studies, where multiple differentiations are performed per cell line, leading to high potential for within-subject (within cell-line) correlation. The 'convention' appears to be to simply ignore the fact that there are multiple replicates from the same subject/cell-line and treat them all as independent replicates, which will surely lead to increase type I error in many cases.

I guess we should switch to limma-voom for RNA seq with this kind of experimental design then? Are you aware of any specific guides for performing the blocking and then fitting a generalised least squares model? Thanks!

ADD REPLY • link 16 months ago Oscar • 0