Question

Number of interaction terms in linear regression model

0

Entering edit mode

yuabrahamliu • 0

@yuabrahamliu-17670

Last seen 4.1 years ago

Hi there,

I want to ask a question about number of interaction terms in a multiple linear regression model.

I have a dataset with various predictors, including sample_group (case/control), some confounding factors (age, gender, ethnicity, ...), and most importantly, cell contents of 5 cell types (cell types 1, 2, 3, ...) of the sample. My purpose is to find the differential genes between the case/control cell types (i.e., differential genes between case cell type1 and control cell type1, differential genes between case cell type2 and control cell type2, ...).

I wrote the design matrix in limma like this,

design <- model.matrix(~ Cell1:Sample_Group + Cell2:Sample_Group + Cell3:Sample_Group + Cell4:Sample_Group + Cell5:Sample_Group + Sample_Group + Gender + Ethnicity + Age + ..., data = pd)

Hence, the regression model included 5 interaction terms. Then, I made a contrast matrix to detect the signficant genes between case cell1 and control cell1 (Sample_GroupCase_Cell1 - Sample_GroupControl_Cell1), case cell2 and control cell2 (Sample_GroupCase_Cell2 - Sample_GroupControl_Cell2), .... However, the problem was that, most cell types had no significant differential genes detected. I thought the reason was that, the sample size of 200 was not enough to include 5 interaction terms in a model at the same time, and the confidential intervals of the interaction terms would be very wide.

Thus, I am constructing 5 different regression models and each of them only includes 1 of the 5 interaction terms, and then find the differetial genes in 5 cell types separately. However, my question is that, in such a model with only 1 interaction term, even if I can finally get the differetial genes between case cell1 and control cell1, is this result reliable? In another word, will the differental genes detected in cell1 include too many false positives due to ignorance of the other 4 iteraction terms in its model?

Thank you so much!

limma linear regression model interaction terms • 1.1k views

ADD COMMENT • link updated 5.1 years ago by Gordon Smyth 52k • written 5.1 years ago by yuabrahamliu • 0

Gordon Smyth · Answer 1 · 2020-03-11

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 14 hours ago

WEHI, Melbourne, Australia

Your model formula doesn't make a lot of sense to me. If you have five cell types, then should be only one CellType factor (taking 5 levels) instead of 5 different factors.

You could analyse your experiment either as described in Section 9.5.2 of the limma User's Guide or as in Section 9.5.3. I am guessing you are perhaps intending to use the approach of Section 9.5.3, but you have far too many factors. If you follow Section 9.5.3, then Strain corresponds to a factor CellType taking 5 levels in your experiment and Treatment corresponds to Sample_Group.

A sample size of 200 is easily sufficient to fit any model you want, so sample size should not be a problem, but the model has to be specified correctly. You do have too many interaction terms simply because you have too many factors.

ADD COMMENT • link 5.1 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you so much Dr Smyth,

Sorry I didn't illustrate well in the original question. Actually, the values of the 5 cell type variables are their contents in the sample (for example, in sample1, its content of cell1 is 10%, of cell2 is 20%, of cell3 is 30%, ...), so they are continuous variables. I have 200 samples, and each sample has 5 such continuous values for the 5 cell types, so it is a matrix with 200 rows (samples) and 5 columns (cell contents). This is the reason why I didn't include the 5 cell types into one CellType factor.

While the Sample_Group variable has 2 levels [case (1) and control (0)], so as to the interaction term between, for example, cell type1 (continous cell contents) and Sample_Group (discrete 0/1 levels), I explained its regression coefficient as the gene expression difference between case and control, but specific for cell type1. This is what I want, to get the differetial genes between case and control, but specific for single cell type.

However, as I said in the original question, if I included all 5 interaction terms in the same model, few such difference could be detected, but if I only include 1 interaction term (cell type1:Sample_Group), I am not sure whether the ignorance of other 4 cell types in the model will bring many false positives.

Thank you for your help again!

ADD REPLY • link updated 5.1 years ago by Gordon Smyth 52k • written 5.1 years ago by yuabrahamliu • 0

0

Entering edit mode

OK, I see. I didn't understand from the original question that Cell1, Cell2 hold percentages.

Your model formula still doesn't make a lot of sense to me. The formula does not seem sufficient to deconvolve expression changes into cell type, which is presumably what you want do do. In any case, I don't believe it would be correct to include one cell type in the model at a time.

I only have time to answer software issues rather advising on how to analyse complex experiments, so I will have to leave it here.