How to calculate p value when comparing two sets of DEGs and not applying interaction
1
0
Entering edit mode
@fa00ef88
Last seen 6 months ago
Canada

Dear Michael Love,

I am conducting a comparative analysis of tissue samples from cases compared to controls. Simultaneously, I am examining four different regions within each individual. For this comparison, I employed DESeq2, utilizing the following model: design = ~ sex + PMI + batchlib + region_condition Here, "region_condition" represents the combined effect of region and disease status, denoted as region(1 to 4)_condition(1 & 2).

To explore the impact of disease on each region, I performed the following:

mod_mat <- model.matrix(design(dds), colData(dds))
# the below code was run for each region_condition separately
region1_condition2 <- colMeans(mod_mat[dds$region_condition == "region1_condition2", ])
region1_condition1 <- colMeans(mod_mat[dds$region_condition == "region1_condition1", ])
res <- results(dds, contrast = region1_condition2 - region1_condition1)
res <- lfcShrink(dds, contrast =  contrast, res =res, type="ashr")

As a result, I will have four sets of Differentially Expressed Genes (DEGs). Subsequently, I aim to compare each set of DEGs individually. While calculating the log2foldchange is straightforward by subtracting the log2foldchange of region1 from region2, determining p-values poses a challenge. I have attempted a meta-analysis using Fisher's test for p-value calculation, but I am uncertain about its appropriateness.

It's noteworthy that I employed an interaction term in my model, although the results from each approach (with and without interaction term) yielded disparate DEG counts. Despite the complexity, I prefer not to rely on the interaction term and seek a method to obtain p-values for the comparison.

Any insights or suggestions would be greatly appreciated.

Best regards, Paria

DEGcomparison DESeq2 InteractionSet pvalue • 1.8k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

What is the null hypothesis for the p-value you want to construct?

ADD COMMENT
0
Entering edit mode

The null hypothesis is that the impact of the disease is equal across different regions.

ADD REPLY
0
Entering edit mode

Then you want an LRT with full design including an interaction of disease and region, and the reduced taking that interaction term away.

ADD REPLY
0
Entering edit mode

Thank you so much for your response! Does this mean that my approach is not correct or using LRT is more accurate? Thanks, Paria

ADD REPLY
1
Entering edit mode

You just said you would combine p-values. I'm not sure what that would give you, but it doesn't seem like a direct way to test the null you stated.

The LRT is the standard way to test if the disease effect is equal across regions.

ADD REPLY
0
Entering edit mode

I see. I run DEG analysis using LRT test as below:

# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(cluster_counts, 
                              colData = cluster_metadata, 
                              design = ~ sex + PMI + batchlib + group_id + region + group_id:region)

# pre-filtering
keep <- rowSums(counts(dds)>=10)>= 0.9*37   
dds <- dds[keep,]

dds$group_id <- relevel(dds$group_id, ref="con")
dds$region <- relevel(dds$region, ref = "ocu")

# Running DESeq2 with reduced model
dds <- DESeq(dds, test="LRT", reduced = ~ sex + PMI + batchlib + group_id + region)

resultsNames(dds)
 [1] "Intercept"             "sex_2_vs_1"            "PMI"                   "batchlib_2_vs_1"      
 [5] "batchlib_3_vs_1"       "batchlib_4_vs_1"       "batchlib_5_vs_1"       "batchlib_6_vs_1"      
 [9] "batchlib_7_vs_1"       "group_id_als_vs_con"   "region_med_vs_sl"      "region_ocu_vs_sl"     
[13] "region_sc_vs_sl"       "group_idals.regionmed" "group_idals.regionocu" "group_idals.regionsc"

res <- lfcShrink(dds, coef=14, type="ashr")

res_tbl <- res %>%
        data.frame() %>%
        rownames_to_column(var="gene") %>%
        as_tibble()

padj_cutoff <- 0.05
# Subset the significant results
res_tbl <- res %>%
        data.frame() %>%
        rownames_to_column(var="gene") %>%
        as_tibble()

if (!is.data.frame(res_tbl)) {
    res_tbl <- as.data.frame(res_tbl)
  }

sig_res <- dplyr::filter(res_tbl, padj < padj_cutoff) %>%
        dplyr::arrange(padj)

I have two questions: First, I am not sure why I get same results for each cluster when I put different coefs with "lfcShrink" ? should I just ignore whatever in resultsNames(dds) and consider that the effect is the one which is not in reduced model?

Second, does LRT apply also when I want to measure disease effect controlling for region, as well as disease effect in each region separately? (again with a different ref level of region and same coef, group_id_als_vs_con, I expected to get disease effect in different regions, but I did not). I appreciate it if you confirm if using below model and tests are correct for the mentioned comparison.

  1. disease effect in general using design model as ~ sex + PMI + batchlib + group_id + region, then, extracting the coefficient, case_vs_control.
  2. disease effect in each region separately: 2.1. using design model as ~ sex + PMI + batchlib + group_id + region + group_id:region. Then, defining each region as ref level to measure the effect of disease in that region. OR 2.2. using design model as ~ sex + PMI + batchlib + group_id_region (combination of two variable). Then, comparing case vs control in each region by defining contrasts: mod_mat <- model.matrix(design(dds), colData(dds)) #disease effect in region 1 by subtracting first from second in contrast ocu_als <- colMeans(mod_mat[dds$region_condition == "ocu_als", ]) ocu_con <- colMeans(mod_mat[dds$region_condition == "ocu_con", ]) #disease effect in region 2 by subtracting first from second in contrast med_con <- colMeans(mod_mat[dds$region_condition == "med_con", ]) med_als <- colMeans(mod_mat[dds$region_condition == "med_als", ]) # same for region 3 and 4

HOWEVER, the results of approach 2.1 and 2.2 are not the same. I was wondering which one you recommend more?

Thank you so much for your help in advance! Paria

ADD REPLY
0
Entering edit mode

same results for each cluster

If you mean p-value from the LRT, the answer is in the FAQ:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#i-ran-a-likelihood-ratio-test-but-results-only-gives-me-one-comparison.

The LRT tests if the disease effect is equal across regions.

does LRT apply also when I want to measure disease effect controlling for region, as well as disease effect in each region separately?

The LRT I suggested tests for the disease effect across all regions. You would need different tests to test each region's disease effect separately. The easiest way is to remove the main variable from design, which will produce an interaction term for each region (including the reference level) that you can extract with simply results(dds, name="...", test="wald").

~ sex + PMI + batchlib + group_id + group_id:region.
ADD REPLY
0
Entering edit mode

I got my answer for the first part, thank you! Regarding my second question, I am getting an error when I do what you explained

dds <- DESeqDataSetFromMatrix(cluster_counts, 
                               colData = cluster_metadata, 
                               design = ~ sex + PMI + batchlib + group_id + region + group_id:region)

keep <- rowSums(counts(dds)>=10)>= 0.9*37   
dds <- dds[keep,]

dds$group_id <- relevel(dds$group_id, ref="con")
dds$region <- relevel(dds$region, ref = "ocu")

dds <- DESeq(dds, test="LRT", reduced = ~ sex + PMI + batchlib + group_id + group_id:region)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Error in nbinomLRT(object, full = full, reduced = reduced, quiet = quiet,  : 
  less than one degree of freedom, perhaps full and reduced models are not in the correct order

So, I did it as below and removing both region and interaction term:

dds <- DESeqDataSetFromMatrix(cluster_counts, 
                               colData = cluster_metadata, 
                               design = ~ sex + PMI + batchlib + group_id + region + group_id:region)

keep <- rowSums(counts(dds)>=10)>= 0.9*37   
dds <- dds[keep,]

dds$group_id <- relevel(dds$group_id, ref="con")
dds$region <- relevel(dds$region, ref = "ocu")

dds <- DESeq(dds, test="LRT", reduced = ~ sex + PMI + batchlib + group_id)

resultsNames(dds)
 [1] "Intercept"             "sex_2_vs_1"            "PMI"                   "batchlib_2_vs_1"      
 [5] "batchlib_3_vs_1"       "batchlib_4_vs_1"       "batchlib_5_vs_1"       "batchlib_6_vs_1"      
 [9] "batchlib_7_vs_1"       "group_id_als_vs_con"   "region_med_vs_ocu"     "region_sc_vs_ocu"     
[13] "region_sl_vs_ocu"      "group_idals.regionmed" "group_idals.regionsc"  "group_idals.regionsl" 

res <- results(dds, name="region_med_vs_ocu", test="Wald")

if you meant this analysis, it worked like LRT test and different coefficient output same results!!?

Additionally, my coefficient of interest is "group_id_als_vs_con" , and defining each region as ref level gives me the effect of disease in that region, could you please confirm if you mean it? however, with LRT , then Wald test it's not working.

I really appreciate your time and feedback!

Thank you!

Paria

ADD REPLY
0
Entering edit mode

Sorry I was not clear. In my last message I meant you should use the provided as the (full) design.

So like this:

design(dds) <- ~ sex + PMI + batchlib + group_id + group_id:region
dds <- DESeq(dds)
results(dds, name="...")

...this is to obtain region specific differences, not the LRT approach.

ADD REPLY
0
Entering edit mode

Ah, I see. I think it's clear now. The only thing that I am not sure is that why I need to remove region "variable". I mean the coefficients when including region and excluding it is as follow respectively:

#coefficient including region
resultsNames(dds) 
 [1] "Intercept"             "sex_2_vs_1"            "PMI"                   "batchlib_2_vs_1"      
 [5] "batchlib_3_vs_1"       "batchlib_4_vs_1"       "batchlib_5_vs_1"       "batchlib_6_vs_1"      
 [9] "batchlib_7_vs_1"       "group_id_als_vs_con"   "region_med_vs_ocu"     "region_sc_vs_ocu"     
[13] "region_sl_vs_ocu"      "group_idals.regionmed" "group_idals.regionsc"  "group_idals.regionsl" 

#coefficient excluding region
resultsNames(dds)
 [1] "Intercept"             "sex_2_vs_1"            "PMI"                   "batchlib_2_vs_1"      
 [5] "batchlib_3_vs_1"       "batchlib_4_vs_1"       "batchlib_5_vs_1"       "batchlib_6_vs_1"      
 [9] "batchlib_7_vs_1"       "group_id_als_vs_con"   "group_idcon.regionmed" "group_idals.regionmed"
[13] "group_idcon.regionsc"  "group_idals.regionsc"  "group_idcon.regionsl"  "group_idals.regionsl"

what would be the difference?

Moreover, if for instance, I extract "group_id_als_vs_con" I know it is the effect of disease in ref region which is ocu here. But I would like to know if it also only consider this comparison in ref batch, ref sex, etc? If it does the comparison in ref level of these variables I think I had the misunderstanding as I put these variables to regress them out. Thanks for your time and patient! Paria

ADD REPLY
0
Entering edit mode

The second one is the one that you can use to obtain region specific differences. The first cannot be interpreted that way. For more of a "why" answer, I recommend to meet with a statistician or someone familiar with linear modeling in R who can explain these terms and the different formulations.

ADD REPLY
0
Entering edit mode

Perfect, thanks for bearing with me. I'll summarized what we discussed here for anyone who will read this post later.

In my pseudo-bulk DEG analysis, I have samples from four tissue regions of both cases and controls. The variables I am including in my model are as follows: design = ~ sex + PMI + batchlib + group_id + region + group_id:region

I need to test different hypothesis:

First, I need to measure the effect of disease in each region separately: null hypothesis: Disease has no effect on different regions. The below analysis are done after aggregation of data to the sample level.

#Create DESeq2 object
dds <- DESeqDataSetFromMatrix(cluster_counts, 
                              colData = cluster_metadata, 
                              design = ~ sex + PMI + batchlib + group_id + region + group_id:region)

#pre-filtering (37 is sample size)
keep <- rowSums(counts(dds)>=10)>= 0.9*37   
dds <- dds[keep,]

#determining ref level
#to measure the effect of disease in s=each region change the ref level to that region.
dds$group_id <- relevel(dds$group_id, ref="con")
dds$region <- relevel(dds$region, ref = "ocu")

#Running DESeq2
dds <- DESeq(dds, fitType='local') 
#then, I extract the coefficient "group_id_als_vs_con".

Second, I need to obtain region specific differences which means how much the effect of disease is different in regions compared to ref region (which is ocu here). null hypothesis: There is no region specific difference in regions (med, sc, sl) compared to ocu

#Create DESeq2 object excluding region
dds <- DESeqDataSetFromMatrix(cluster_counts, 
                              colData = cluster_metadata, 
                              design = ~ sex + PMI + batchlib + group_id + group_id:region)

#pre-filtering (37 is sample size)
keep <- rowSums(counts(dds)>=10)>= 0.9*37   
dds <- dds[keep,]

#determining ref level
#to measure the effect of disease in s=each region change the ref level to that region.
dds$group_id <- relevel(dds$group_id, ref="con")
dds$region <- relevel(dds$region, ref = "ocu")

#Running DESeq2
dds <- DESeq(dds, fitType='local') 
#then extracting group_idals.regionsl, or group_idals.regionsc, or group_idals.regionmed

Third, I would like to know how the impact of the disease is different in all the regions together compared to the ref region. The difference with the second approach is that here we look for a patterns, like genes that show different pattern in all regions and not looking at regions two by two. I used LRT which tests multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. null hypothesis: there is no genes whose expression is different in all regions.

#Create DESeq2 object with full model
dds <- DESeqDataSetFromMatrix(cluster_counts, 
                              colData = cluster_metadata, 
                              design = ~ sex + PMI + batchlib + group_id + region + group_id:region)

# pre-filtering
keep <- rowSums(counts(dds)>=10)>= 0.9*37   
dds <- dds[keep,]

dds$group_id <- relevel(dds$group_id, ref="con")
dds$region <- relevel(dds$region, ref = "ocu")


#Running DESeq2 with reduced model
dds <- DESeq(dds, test="LRT", reduced = ~ sex + PMI + batchlib + group_id + region)

#then extracting any coefficient will output the effect of the variable which was reduced

Fourth, to measure the overall effect of the disease I removed interaction term; this way, status effect represents the overall effect controlling for differences due to region.

#Create DESeq2 object with full model
dds <- DESeqDataSetFromMatrix(cluster_counts, 
                              colData = cluster_metadata, 
                               design = ~ sex + PMI + batchlib + group_id + region)
# pre-filtering
keep <- rowSums(counts(dds)>=10)>= 0.9*37   
dds <- dds[keep,]

dds$group_id <- relevel(dds$group_id, ref="con")

dds <- DESeq(dds, fitType='local')
#then, extracting the "group_id_als_vs_con" coefficient.

If there is anything wrong please notify me dear Michael Love

Thanks,

Paria

ADD REPLY

Login before adding your answer.

Traffic: 527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6