Question

Comparison of results obtained from a subset of the data versus the complete dataset for a few selected genes.

0

Entering edit mode

Bioinformatician_R ▴ 20

@hrishi27n-11821

Last seen 33 minutes ago

United States

Hi,

I am conducted a Differential expression analysis using DESeq2 to investigate the impact of treatment across sexes. The dataset includes two treatment groups, A and B. To assess sex-specific treatment effects, I am using these two approaches: (1) using the full dataset with a model that included sex and treatment as factors, followed by extraction of sex-specific contrasts; and (2) subsetting the dataset by sex and performing DESeq2 analysis separately for each group before extracting treatment-specific results. My PCA analysis does not show any separation by sex.

When comparing the log2 fold changes obtained from both approaches, the results were largely consistent, with most genes showing similar values. However, a few genes exhibited notable differences. Interestingly, one of these genes was the top hit in the DE analysis using the full dataset, yet it was not even statistically significant when using the subset-based approach. This discrepancy is not entirely unexpected, as p-values can vary between approaches due to differences in sample size and model structure. I also see for a few genes the baseMean is drastically different between both approaches.

I would like to understand what approach should be taken in such cases.

Whole Data

pheno <- pheno %>%
  dplyr::mutate(sex_specific = paste(cohort,updated_sex, sep='_')) %>%
  dplyr::mutate(sex_specific = as.factor(sex_specific))

dds_sex <- DESeqDataSetFromMatrix(countData =  round(filtered_data),
                                    colData = pheno,
                                    design = ~0 + sex_specific)

dds_sex <- DESeq(dds_sex, parallel = TRUE)

male_compare <- results(dds_sex, contrast = c('sex_specific','treated_A_M','treated_B_M')) %>%
      data.frame() %>% 
      dplyr::filter(!is.na(padj)) %>%
      dplyr::arrange(padj) %>%
      rownames_to_column(var='ensembl')

Subset Data

dds <- DESeqDataSetFromMatrix(countData =  round(filtered_data),
                                    colData = pheno,
                                    design = ~0 + updated_sex + cohort)

dds <- DESeq(dds, parallel = TRUE)

dds_men_sub <- subset(dds, select=colData(dds)$updated_sex == 'M')
design(dds_men_sub) <- formula(~0 + cohort)

dds_men_sub <- DESeq(dds_men_sub, parallel = TRUE)

subset_method <- results(dds_men_sub, contrast = c('cohort','treated_A','treated_B')) %>%
  data.frame() %>%
  dplyr::filter(!is.na(padj)) %>%
  dplyr::arrange(padj) 

subset_method <- subset_method[male_compare$ensembl,]

plot(subset_method$log2FoldChange, male_compare$log2FoldChange)

enter image description here

DESeq2 • 361 views

ADD COMMENT • link updated 1 day ago by swbarnes2 ★ 1.4k • written 2 days ago by Bioinformatician_R ▴ 20

score 0 · Answer 1 · 2025-04-11

0

Entering edit mode

swbarnes2 ★ 1.4k

@swbarnes2-14086

Last seen 1 hour ago

San Diego

In general, using the entire dataset and then just comparing the subgroups you want within the whole dataset is preferable.

ADD COMMENT • link 1 day ago swbarnes2 ★ 1.4k