Dear Michael Love,
I am conducting a comparative analysis of tissue samples from cases compared to controls. Simultaneously, I am examining four different regions within each individual. For this comparison, I employed DESeq2, utilizing the following model: design = ~ sex + PMI + batchlib + region_condition Here, "region_condition" represents the combined effect of region and disease status, denoted as region(1 to 4)_condition(1 & 2).
To explore the impact of disease on each region, I performed the following:
mod_mat <- model.matrix(design(dds), colData(dds))
# the below code was run for each region_condition separately
region1_condition2 <- colMeans(mod_mat[dds$region_condition == "region1_condition2", ])
region1_condition1 <- colMeans(mod_mat[dds$region_condition == "region1_condition1", ])
res <- results(dds, contrast = region1_condition2 - region1_condition1)
res <- lfcShrink(dds, contrast = contrast, res =res, type="ashr")
As a result, I will have four sets of Differentially Expressed Genes (DEGs). Subsequently, I aim to compare each set of DEGs individually. While calculating the log2foldchange is straightforward by subtracting the log2foldchange of region1 from region2, determining p-values poses a challenge. I have attempted a meta-analysis using Fisher's test for p-value calculation, but I am uncertain about its appropriateness.
It's noteworthy that I employed an interaction term in my model, although the results from each approach (with and without interaction term) yielded disparate DEG counts. Despite the complexity, I prefer not to rely on the interaction term and seek a method to obtain p-values for the comparison.
Any insights or suggestions would be greatly appreciated.
Best regards, Paria
The null hypothesis is that the impact of the disease is equal across different regions.
Then you want an LRT with full design including an interaction of disease and region, and the reduced taking that interaction term away.
Thank you so much for your response! Does this mean that my approach is not correct or using LRT is more accurate? Thanks, Paria
You just said you would combine p-values. I'm not sure what that would give you, but it doesn't seem like a direct way to test the null you stated.
The LRT is the standard way to test if the disease effect is equal across regions.
I see. I run DEG analysis using LRT test as below:
I have two questions: First, I am not sure why I get same results for each cluster when I put different coefs with "lfcShrink" ? should I just ignore whatever in resultsNames(dds) and consider that the effect is the one which is not in reduced model?
Second, does LRT apply also when I want to measure disease effect controlling for region, as well as disease effect in each region separately? (again with a different ref level of region and same coef, group_id_als_vs_con, I expected to get disease effect in different regions, but I did not). I appreciate it if you confirm if using below model and tests are correct for the mentioned comparison.
HOWEVER, the results of approach 2.1 and 2.2 are not the same. I was wondering which one you recommend more?
Thank you so much for your help in advance! Paria
If you mean p-value from the LRT, the answer is in the FAQ:
https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#i-ran-a-likelihood-ratio-test-but-results-only-gives-me-one-comparison.
The LRT tests if the disease effect is equal across regions.
The LRT I suggested tests for the disease effect across all regions. You would need different tests to test each region's disease effect separately. The easiest way is to remove the main variable from design, which will produce an interaction term for each region (including the reference level) that you can extract with simply
results(dds, name="...", test="wald")
.I got my answer for the first part, thank you! Regarding my second question, I am getting an error when I do what you explained
So, I did it as below and removing both region and interaction term:
if you meant this analysis, it worked like LRT test and different coefficient output same results!!?
Additionally, my coefficient of interest is "group_id_als_vs_con" , and defining each region as ref level gives me the effect of disease in that region, could you please confirm if you mean it? however, with LRT , then Wald test it's not working.
I really appreciate your time and feedback!
Thank you!
Paria
Sorry I was not clear. In my last message I meant you should use the provided as the (full) design.
So like this:
...this is to obtain region specific differences, not the LRT approach.
Ah, I see. I think it's clear now. The only thing that I am not sure is that why I need to remove region "variable". I mean the coefficients when including region and excluding it is as follow respectively:
what would be the difference?
Moreover, if for instance, I extract "group_id_als_vs_con" I know it is the effect of disease in ref region which is ocu here. But I would like to know if it also only consider this comparison in ref batch, ref sex, etc? If it does the comparison in ref level of these variables I think I had the misunderstanding as I put these variables to regress them out. Thanks for your time and patient! Paria
The second one is the one that you can use to obtain region specific differences. The first cannot be interpreted that way. For more of a "why" answer, I recommend to meet with a statistician or someone familiar with linear modeling in R who can explain these terms and the different formulations.
Perfect, thanks for bearing with me. I'll summarized what we discussed here for anyone who will read this post later.
In my pseudo-bulk DEG analysis, I have samples from four tissue regions of both cases and controls. The variables I am including in my model are as follows: design = ~ sex + PMI + batchlib + group_id + region + group_id:region
I need to test different hypothesis:
First, I need to measure the effect of disease in each region separately: null hypothesis: Disease has no effect on different regions. The below analysis are done after aggregation of data to the sample level.
Second, I need to obtain region specific differences which means how much the effect of disease is different in regions compared to ref region (which is ocu here). null hypothesis: There is no region specific difference in regions (med, sc, sl) compared to ocu
Third, I would like to know how the impact of the disease is different in all the regions together compared to the ref region. The difference with the second approach is that here we look for a patterns, like genes that show different pattern in all regions and not looking at regions two by two. I used LRT which tests multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. null hypothesis: there is no genes whose expression is different in all regions.
Fourth, to measure the overall effect of the disease I removed interaction term; this way, status effect represents the overall effect controlling for differences due to region.
If there is anything wrong please notify me dear Michael Love
Thanks,
Paria