Dear All,
Since we have a rather complicated experimental design we would like to get
We have 70 samples from 40 patients. Each sample was collected from a different time interval (ID and RE) based on the disease stage. For 10 patients there is only 1 sample available. We would like to find the differentially expressed genes between subtype1 and subtype2, in which the patients are divided.
For example, this is how our design look like,
Sample | condition | Patient |
A1_ID | Subtype1 | A1 |
A1_RE | Subtype1 | A1 |
B1_ID | Subtype2 | B1 |
B1_RE | Subtype2 | B1 |
C1_ID | Subtype1 | C1 |
Our concern is that are we boosting certain genes by having two samples from the same patients, how could we account for that using a multifactorial design? Should we account for additive effect ( ~ condition+patient ) or also interaction (~ condition*patient )?
Thanks for all opinion and suggestions
Both of those models pool across the time interval, which seems bold given you say that corresponds to disease stage. For the patients with only one sample, do these correspond to one specific timepoint, or are they censored (ie the disease didn't progress so you only have the initial timepoint, or progressed so rapidly that you don't have the first timepoint) or are they missing entirely at random? If they're missing at random, then I'd think any biases would cancel out, but if not, then we'll need more details.
I'd recommend thoroughly justifying your decision to pool over timepoint/stage, or consider a model including a timepoint factor. It looks like you'll have difficulty fitting both patient and condition factors as they look confounded from the snippet you give; you can get some of the way using the advice given in section 3.12 of the DESeq2 vignette.
Roughly, your options for taking timepoint/stage into account are: pool (as you've done);normalise out stage-effect; stratify, so you have analyses for both timepoints separately; or look for interactions (so which genes have a different time-profile in different subtypes).
I agree with Gavin's comments here.
It will help if you can say more about exactly what kind of DE across subtype you are interested in, particularly given the two disease stages.
@Gavin ,@Mike, Thanks for your opinion. It is much valued. The dropouts are purely technical, and not disease-related.
Also, in another analysis, we could see that the difference between the stages is very minimal and that the subtype is the driving force in the overall expression pattern of the samples. This was our rationale to pool them together. Having said that, what we are interested in, is to characterize the upregulated genes in our subtype of question.
Regarding the vignette: I do not fully understand what example you are referring to in section 3.12 in the DESeq2 vignette. What scenario are you suggesting?