Hi all,
I'm having trouble interpreting the output of topTable() for a gene level copy number analysis among four different groups of cell lines. The problem is as follows:
After creating my design matrix and contrast matrix using what I think is group means parameterization, I use eBayes to create the model.
##Differential CNA Analysis (Pairwise Comparison) design_CNA <- model.matrix(~0+factor(cell_line_cluster_4)) colnames(design_CNA) <- c("A","B","C","D") fit_CNA <- lmFit(copy_number_data,design_CNA) contrast_matrix_CNA <- makeContrasts("B-A","C-A","C-B","D-A", "D-B", "D-C", levels = design_CNA) fit_CNA <- contrasts.fit(fit_CNA, contrast_matrix_CNA) fit_CNA <- eBayes(fit_CNA)
However, when I use topTable() to determine which cell line group comparisons have significantly different gene copy numbers, I get no significant calls with the following line of code (not specifying a coefficient):
DECNA <- topTable(fit_CNA, number = Inf, p.value = adjPvalueCutoff, adjust = "BH")
Yet, when I use topTable() with each comparison called specifically by its coefficient, some of these comparisons call significant genes.
DECNA_1 <- topTable(fit_CNA, number = Inf, p.value = adjPvalueCutoff, adjust = "BH", coef = 1) DECNA_2 <- topTable(fit_CNA, number = Inf, p.value = adjPvalueCutoff, adjust = "BH", coef = 2) DECNA_3 <- topTable(fit_CNA, number = Inf, p.value = adjPvalueCutoff, adjust = "BH", coef = 3) DECNA_4 <- topTable(fit_CNA, number = Inf, p.value = adjPvalueCutoff, adjust = "BH", coef = 4) DECNA_5 <- topTable(fit_CNA, number = Inf, p.value = adjPvalueCutoff, adjust = "BH", coef = 5) DECNA_6 <- topTable(fit_CNA, number = Inf, p.value = adjPvalueCutoff, adjust = "BH", coef = 6)
I think the discrepancy may be related to the significance testing adjustment, in that calling topTable() on the entire model without specifying a comparison has to correct for many more hypotheses than for a specific comparison (i.e. B vs A). However, I'm not 100% confident that's correct and I'm concerned there's a programming and/or conceptual error here. But even if I'm right, which method is "better" for determining genes with significantly different copy numbers in each group?
Thanks in advance for any assistance or insight!
Best Regards,
Kevin
To elaborate further on Steve's answer, if you don't specify
coef
, then all the (non-intercept) coefficients will be dropped simultaneously by default intopTable
, thus performing an ANOVA-like test. The null hypothesis for such a test is that all the individual nulls are true, i.e., A=B=C=D in your cell-means design. This is inherently different to the specific pairwise comparisons - which, by definition, only test between two groups - which is why you get different results.In your case, the ANOVA-like test needs to account for the extra variability introduced by considering many groups, which is why it has less power to detect DE between two specific groups (compared to a direct pairwise comparison). However, that's not to say that pairwise comparisons are more powerful. If there were small differences across all groups, the ANOVA-like test would be able to combine information from many groups to reject the null, whereas a pairwise comparison between any two groups would not.
As a general rule, I tend to observe more rejections in my DE analyses when I use an ANOVA-like test, because it can effectively combine evidence for DE across multiple groups. But I wouldn't be surprised or concerned at seeing fewer rejections, for the reasons stated above; especially when you have low numbers of DE in your experiment.