Hi all,
We are using DESeq2 for differential gene expression analysis of RNA-seq data. We have three cell lines, in each we have performed shRNA-mediated knock-down of a transcription factor so we have control and knock-down for each cell line in triplicate. We would like to find the gene expression changes due to knock down that are common across the three cell lines.
> dds <- DESeqDataSetFromMatrix(countData = counts, colData = sampleData, design = ~ cell.line + shRNA) > dds <- DESeq(dds) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing > colData(dds) DataFrame with 18 rows and 3 columns cell.line shRNA sizeFactor <factor> <factor> <numeric> 1 A scr 0.8885007 2 A scr 2.0095038 3 A scr 1.9165557 4 A kd 0.9561916 5 A kd 1.4908854 ... ... ... ... 14 C scr 0.9472558 15 C scr 0.9923416 16 C kd 0.4343480 17 C kd 0.9292461 18 C kd 0.3899622 > design(dds) ~cell.line + shRNA > resultsNames(dds) [1] "Intercept" "cell.line_B_vs_A" "cell.line_C_vs_A" "shRNA_kd_vs_scr" > results(dds) log2 fold change (MLE): shRNA kd vs scr Wald test p-value: shRNA kd vs scr DataFrame with 27557 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000000003 1.217422 0.00484854 1.1959788 0.004054035 0.99676536 NA > res <- results(dds) > res <- subset(res, padj < 0.05) > res <- res[order(res$log2FoldChange),] > res.dn <- head(rownames(res)) # plotting normalized counts for the genes from res.dn we get:
My understanding was that the design (~ cell.line + shRNA) would give me the overall effect of the shRNA taking into account differences between the cell lines. But when the counts are plotted it is clear that for many of the genes with padj < 0.05 the difference in expression is dominated by one cell line with low expression and/or little/no difference in expression for the others. How can we find the genes that respond in all three of the cell lines?
We can of course run a separate comparison for each cell line and then look for the common changes manually but I'd much prefer to incorporate it into the design from the beginning. Any advice on how to change the design or call results greatly appreciated.
Thanks so much for the speedy response! I'll work through the method you suggest and see where it gets me. I'm not a statistician - is there a logical way to determine the padj cut-off to apply when using this approach? Can one use a less stringent cut off given that it's applied for three separate comparisons in this case?
I wouldn't use a less stringent cutoff just because it's a three-way intersection. And I should say there isn't an unified FDR interpretation for the final set. It's just a post-hoc set, based of off three FDR bounded sets.