Question

Constant High Padj Values

0

Entering edit mode

le.alex.chang • 0

@lealexchang-14531

Last seen 7.4 years ago

Hi everyone,

Our lab is attempting to use the DESeq2 package to perform DGE analysis with a multifactorial experimental design. Our question of interest is whether a treatment effect differs for two conditions (tissues) C and T at each time point, day 3, 7, 14 and 56; we would like to obtain DEGs for each day. For each tissue group, 5 replicates were treated, and 3 served as controls, for a total of (5 treated + 3 controls)* 2 conditions = 16 replicates for each of the 4 days.

library(DESeq2)
library(RUVSeq)
# a separate dds object was created for each day by repeating these lines
condition <- factor(c(rep("C", 8), rep("T",8)))
treatment <- factor(c(rep("treated", 5), rep("control", 3),
                      rep("treated", 5), rep("control", 3)))

# RUVg with negative control house-keeping genes
uq3 <- betweenLaneNormalization(cts3, which="upper")
ruv3 <- RUVg(uq3, cIdx=normgenes, k=1, round=TRUE)

coldata3 <- data.frame(row.names=colnames(cts3), treatment, condition, ruv3$W)
coldata3$condtreat <- factor(paste0(coldata3$condition, coldata3$treatment))

dds3 <- DESeqDataSetFromMatrix(countData=cts3,
                               colData=coldata3,
                               design=~W_1 + condtreat)

boxplot(log2(counts(dds3)), range=0, las=2) # medians line up


dds3$treatment <- relevel( dds3$treatment, "control" )
dds3$condition <- relevel( dds3$condition, "C" )

dds3 <- DESeq(dds3)


res3 <- results(dds3, contrast=list(c("condtreat_Ttreated_vs_Ccontrol"), 
                                    c("condtreat_Tcontrol_vs_Ccontrol", 
                                      "condtreat_Ctreated_vs_Ccontrol")), alpha=0.05)  
# is this contrast correct to find differences in treatment effect between both conditions? 

table(res3$padj<0.05)

hist(res3$pvalue) # looks like the ones others have for significant data

After running the code above for day 3, a significant number of DEG's (with padj < 0.05) were outputted (~1200). However, by running the exact same calls for the other days, under 15 are found. By looking at the dataframe outputted by results(), The large majority of padj values are identical and greater than 0.99. After plotting the pvalue histogram, we noticed that days 7 and up had a "hill-shape" which could explain the inadequacy for the BH padj method. Why are the pvalue histograms so different from day 3?

Another data analyst has performed DGE on our data using EdgeR and managed to pull > 1000 genes per day. 500 of ours from day 3 overlap with theirs, which leads us to believe that our data is actually statistically significant; our DESeq2 analysis just seems to be missing something. Does anyone know what we could be overlooking?

Thank you,

Alex

deseq2 differential gene expression multiple factor design padj • 2.9k views

ADD COMMENT • link updated 7.4 years ago by Michael Love 43k • written 7.4 years ago by le.alex.chang • 0

score 1 · Answer 1 · 2017-12-05

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 19 hours ago

United States

Let's jump back to the biological question and appropriate design:

"Our question of interest is whether a treatment effect differs for two conditions "

You can use an interaction term between condition and treatment, and test on this term. This would look like a design:

~condition + treatment + condition:treatment

Then you can easily perform a likelihood ratio test:

dds <- DESeq(dds, test="LRT", reduced=~condition + treatment)
res <- results(dds)

ADD COMMENT • link 7.4 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

Thanks for this response, I'm working with Alex up there on this. We've done the LRT and had no issues with our dataset in that regard. In addition to the time-series question (which is answered with the LRT analyses), we wanted to know the DEGs at each individual time point taking into account each condition's respective control.

e.g. C(treated/control) vs. T(treated/control) at 3 days

We've gotten over 1200 genes for day 3, but 7, 14, 56 we've had lots of issues with the p-adjusted value being very close to 1. We aren't sure why this is happening, because we have biological replicates (others have reported this when there are no biological replicates).

Thanks for all your help,
James

ADD REPLY • link 7.4 years ago jmsyl.hong • 0

0

Entering edit mode

You can look at examples by eye to get a better sense (see plotCounts), but from what you've told me, I would infer that the changes are more pronounced at day 3 and less so at the other days, such that testing on the individual interaction terms for those other days doesn't give you small enough p-values to survive multiple test correction.

ADD REPLY • link 7.4 years ago Michael Love 43k

0

Entering edit mode

Thanks Michael, it's strange as I've done the analysis on EdgeR and cuffdiff and have gotten hundreds of DEGs for day 7, 14, 56. I also visually inspected some of the genes that were found to be significant in one pipeline vs. deseq2, and the counts cluster far apart between the treated of the two conditions, but not for the control of the two conditions--is the interaction term going to tax us for that? I'm thinking something is wrong with the way we are writing the script. Did you see anything wrong with the scripts we wrote above for a single time-point comparison?

Thanks again, James

ADD REPLY • link 7.4 years ago jmsyl.hong • 0

0

Entering edit mode

"Did you see anything wrong with the scripts we wrote above"

It seems correct, but try using the LRT code I recommended above for each day, testing on the interaction term by removing it from the design. (I assume you are still using the contrast code above when you say that you didn't find significant genes with DESeq2.)

No, the interaction won't tax you for that pattern, that is exactly what the interaction model is designed to pick up on.

ADD REPLY • link 7.4 years ago Michael Love 43k