Hi there,
I want to ask a question about number of interaction terms in a multiple linear regression model.
I have a dataset with various predictors, including sample_group (case/control), some confounding factors (age, gender, ethnicity, ...), and most importantly, cell contents of 5 cell types (cell types 1, 2, 3, ...) of the sample. My purpose is to find the differential genes between the case/control cell types (i.e., differential genes between case cell type1 and control cell type1, differential genes between case cell type2 and control cell type2, ...).
I wrote the design matrix in limma like this,
design <- model.matrix(~ Cell1:Sample_Group + Cell2:Sample_Group + Cell3:Sample_Group + Cell4:Sample_Group + Cell5:Sample_Group + Sample_Group + Gender + Ethnicity + Age + ..., data = pd)
Hence, the regression model included 5 interaction terms. Then, I made a contrast matrix to detect the signficant genes between case cell1 and control cell1 (Sample_GroupCase_Cell1 - Sample_GroupControl_Cell1
), case cell2 and control cell2 (Sample_GroupCase_Cell2 - Sample_GroupControl_Cell2
), .... However, the problem was that, most cell types had no significant differential genes detected. I thought the reason was that, the sample size of 200 was not enough to include 5 interaction terms in a model at the same time, and the confidential intervals of the interaction terms would be very wide.
Thus, I am constructing 5 different regression models and each of them only includes 1 of the 5 interaction terms, and then find the differetial genes in 5 cell types separately. However, my question is that, in such a model with only 1 interaction term, even if I can finally get the differetial genes between case cell1 and control cell1, is this result reliable? In another word, will the differental genes detected in cell1 include too many false positives due to ignorance of the other 4 iteraction terms in its model?
Thank you so much!
Thank you so much Dr Smyth,
Sorry I didn't illustrate well in the original question. Actually, the values of the 5 cell type variables are their contents in the sample (for example, in sample1, its content of cell1 is 10%, of cell2 is 20%, of cell3 is 30%, ...), so they are continuous variables. I have 200 samples, and each sample has 5 such continuous values for the 5 cell types, so it is a matrix with 200 rows (samples) and 5 columns (cell contents). This is the reason why I didn't include the 5 cell types into one CellType factor.
While the
Sample_Group
variable has 2 levels [case (1) and control (0)], so as to the interaction term between, for example, cell type1 (continous cell contents) andSample_Group
(discrete 0/1 levels), I explained its regression coefficient as the gene expression difference between case and control, but specific for cell type1. This is what I want, to get the differetial genes between case and control, but specific for single cell type.However, as I said in the original question, if I included all 5 interaction terms in the same model, few such difference could be detected, but if I only include 1 interaction term (
cell type1:Sample_Group
), I am not sure whether the ignorance of other 4 cell types in the model will bring many false positives.Thank you for your help again!
OK, I see. I didn't understand from the original question that Cell1, Cell2 hold percentages.
Your model formula still doesn't make a lot of sense to me. The formula does not seem sufficient to deconvolve expression changes into cell type, which is presumably what you want do do. In any case, I don't believe it would be correct to include one cell type in the model at a time.
I only have time to answer software issues rather advising on how to analyse complex experiments, so I will have to leave it here.
Okay. Thank you all the same.