Hi Bioconductor forum comunity,
I have a couple of questions to developers and experienced users of the MAST package. I am analyzing a data set that consists of single cell data from a number of case and control samples.
I would like to test differential expression between cases and controls using MAST within specific cell clusters.
I have patient ID, age, gender and sample freshness (some samples were freshly processed and others were frozen first and processed later) as potential confounders in addition to the well-known cellular detection rate that is commonly modeled as a confounder in MAST.
So I am trying to control for the effect of all of these confounders in my DGE analysis
So I define my model as follows:
Zlm(~ condition(case/control) + gender + age + freshness + cngeneson + sample ID)
Sometimes I include the sample ID as a random effect (1 | sample ID)
Then I perform a likelihood ratio test comparing the full model to a reduced model as follows: Summary(zlmCond, doLRT = “conditioncase”)
Afterwards I extract the logFC coefficients from the results object.
Here are my questions:
1.I have a problem in interpreting those coefficients?
When I compare the values of those coefficients to the average logFC values calculated from the normalized counts, I find that for a considerable number of genes, the sign of the MAST coefficient is opposite to that of the average logFC value! So which of them is a more accurate estimate of effect size and which of them should I depend on to decide on the real direction of change in the gene expression? The coefficient or the average logFC? I am a bit confused
2.If I wanna include a continous variable like age in the model, should I scale it as one scales the cellular detection rate? What is the effect of scaling continous variables on the model? Does it really matter?
3.Are there other ways for controlling the effect of patient or sample IDs? Would you recommend count aggregation across samples and cell clusters and using DESeq2 or EdgeR for example?
Thank you very much !