Hello, I have a question about model.matrix() while I'm doing some analysis. I have counts and coldata variables. As you know, the counts column has about 15 samples, with 7 being "Ctrl" and 8 being "Treat." The rows of counts represent gene lists with expression levels. The coldata variable has a column called "disease" and another column called "new_variance," with 15 samples as rows. The "disease" column has 7 "Ctrl" and 8 "Treat" labels, and the "new_variance" column has 15 random values (e.g., 1.434, 1.5, 0.989, ...).
I'm having trouble understanding the meaning of the following three cases of
# 1. disease
mm1 = model.matrix(disease, coldata)
ddsMat <- DESeqDataSetFromMatrix(counts, coldata, design = ~ 1)
ddsMat2 = DESeq(ddsMat, full = mm1, betaPrior = FALSE)
# 2. new_variance
mm2 = model.matrix(~new_variance, coldata)
ddsMat3 <- DESeqDataSetFromMatrix(counts, coldata, design = ~ 1)
ddsMat4 = DESeq(ddsMat3, full = mm2, betaPrior = FALSE)
# 3. interaction
mm3 = model.matrix(~new_variance*disease, coldata)
ddsMat5 <- DESeqDataSetFromMatrix(counts, coldata, design = ~ 1)
ddsMat6 = DESeq(ddsMat5, full = mm3, betaPrior = FALSE)
In the above examples, I can easily understand that 1. disease is a typical case study in RNA-seq analysis, where we can predict the Fold Change as treat/control because the "disease" column in coldata clearly distinguishes "Ctrl" and "Treat."
However, for 2. new_variance, can we distinguish "Ctrl" and "Treat"? Furthermore, I'm not clear on how to interpret the interaction in 3. What I'm expecting is that the meaning of RNA-seq analysis might change a bit. I thought that if I use the "new_variance" column, it's not a typical case study. So, when applying a new metric to RNA-seq analysis, what does it mean, and how should I interpret the statistical analysis results?
I've looked at "Analyzing RNA-seq data with DESeq2," but I didn't quite understand it. If you could help me understand what I'm missing, I would be really grateful. In the cases of 2 and 3, what is the meaning of the Fold Change that appears in the statistical analysis results? If there's a correct answer, and if I've missed something, providing a reference link would be great. Thank you.
Thanks for reply. Honestly, it's not entirely a random variable, and there is some distinction between control and treatment, but it's not a perfect separation like 0 and 1. For example, let's assume that the average of 7 control samples of new_variance is 0.5, and the standard deviation is 0.2. In that case, you can think about a situation where the average of 8 treatment samples is 2.5, and the standard deviation is around 0.6. So, in conclusion, I'd like to have a rough idea of what calculations are performed internally when I run mm2 or mm3. If there is "almost" a separation between control and treat in the composition of the new_variance data, does this have any significance?
It's not entirely a random variable? There's no gray area here. You said it had 15 random values.
Anyway, as I already mentioned, this site isn't meant to be a place for people to get a primer on linear regression, but instead it's meant to help people with technical questions about the software. I already gave you a reference link to Julian Faraway's linear regression book. You might also Google things like 'ANOVA vs regression', and 'interaction term linear regression' if Faraway's book is TL,DR for you.