Hi, I am following an example presented in: https://bioinformatics-core-shared-training.github.io/RNAseq-R/rna-seq-de.nb.html I will present first two models that are based on the same model.matrix, and then step into my question
# The analysis uses the following sampleinfo chart:
> sampleinfo
FileName SampleName CellType Status
1 MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1 MCL1.DG basal virgin
2 MCL1.DH_BC2CTUACXX_CAGATC_L002_R1 MCL1.DH basal virgin
3 MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1 MCL1.DI basal pregnant
4 MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1 MCL1.DJ basal pregnant
5 MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1 MCL1.DK basal lactate
6 MCL1.DL_BC2CTUACXX_ATCACG_L002_R1 MCL1.DL basal lactate
7 MCL1.LA_BC2CTUACXX_GATCAG_L001_R1 MCL1.LA luminal virgin
8 MCL1.LB_BC2CTUACXX_TGACCA_L001_R1 MCL1.LB luminal virgin
9 MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1 MCL1.LC luminal pregnant
10 MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1 MCL1.LD luminal pregnant
11 MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1 MCL1.LE luminal lactate
12 MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1 MCL1.LF luminal lactate
# And groups were defined with:
group <- paste(sampleinfo$CellType,sampleinfo$Status,sep=".")
group
> group
[1] "basal.virgin" "basal.virgin" "basal.pregnant" "basal.pregnant" "basal.lactate" "basal.lactate" "luminal.virgin"
[8] "luminal.virgin" "luminal.pregnant" "luminal.pregnant" "luminal.lactate" "luminal.lactate"
Accordingly, there are two variables: 'status' (contains: 'basal', 'luminal') and 'cell type' (contains: 'virgin', 'pregnant', 'lactate').
In order to determine differences between the two cell types we create a model with only main effects, that is no interaction. The main assumption here is that the effect of the status is the same in both type of cells.
# We create the two variables
group <- as.character(group)
type <- sapply(strsplit(group, ".", fixed=T), function(x) x[1])
status <- sapply(strsplit(group, ".", fixed=T), function(x) x[2])
class(type)
class(status)
# And specify a design matrix with an intercept term
design <- model.matrix(~ type + status)
design
> design
(Intercept) typeluminal statuspregnant statusvirgin
1 1 0 0 1
2 1 0 0 1
3 1 0 1 0
4 1 0 1 0
5 1 0 0 0
6 1 0 0 0
7 1 1 0 1
8 1 1 0 1
9 1 1 1 0
10 1 1 1 0
11 1 1 0 0
12 1 1 0 0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$type
[1] "contr.treatment"
attr(,"contrasts")$status
[1] "contr.treatment"
# We then estimating the dispersion
dgeObj <- estimateCommonDisp(dgeObj)
# Then, we estimate gene-wise dispersion estimates, allowing a possible trend with average count size:
dgeObj <- estimateGLMTrendedDisp(dgeObj)
dgeObj <- estimateTagwiseDisp(dgeObj)
# Then, we test for differential expression. We fit genewise glms and create a linear model:
fit <- glmFit(dgeObj, design)
> colnames(fit)
[1] "(Intercept)" "typeluminal" "statuspregnant" "statusvirgin"
We note that the as per our model.matrix - the generated model presents only the (k-1) variables whose levels vary fastest; Thus variable 'lactate' is not being explicitly represented.
# To conduct likelihood ratio test for 'luminal' vs 'basal' ("BvsL") and show the top genes we do:
lrt.BvsL <- glmLRT(fit, coef=2)
topTags(lrt.BvsL)
Coefficient: typeluminal
logFC logCPM LR PValue FDR
110308 -8.940579 10.264297 24.89789 6.044844e-07 0.0004377961
50916 -8.636503 5.749781 24.80037 6.358512e-07 0.0004377961
12293 -8.362247 6.794788 24.68526 6.749827e-07 0.0004377961
56069 -8.419433 6.124377 24.41532 7.764861e-07 0.0004377961
24117 -9.290691 6.757163 24.32506 8.137331e-07 0.0004377961
12818 -8.216790 8.172247 24.24233 8.494462e-07 0.0004377961
22061 -8.034712 7.255370 24.16987 8.820135e-07 0.0004377961
12797 -9.001419 9.910795 24.12854 9.011487e-07 0.0004377961
50706 -7.697022 10.809629 24.06926 9.293193e-07 0.0004377961
237979 -8.167451 5.215921 24.03528 9.458678e-07 0.0004377961
Now, in case we want to find differentially expressed genes between 'pregnant' and 'virgin' (i.e. "PvsV"), we don’t have a parameter that explicitly will allow us to test that hypothesis so we need to build a contrast:
PvsV <- makeContrasts(statuspregnant-statusvirgin, levels=design)
lrt.pPsV <- glmLRT(fit, contrast=PvsV)
topTags(lrt.PvsV)
> topTags(lrt.PvsV)
Coefficient: 1*statuspregnant -1*statusvirgin
logFC logCPM LR PValue FDR
232016 -4.593938 6.389781 12.40835 0.0004274195 1
236643 -6.608868 -1.481424 12.34832 0.0004413842 1
19283 -4.502691 4.318145 10.72832 0.0010550833 1
12993 7.359664 15.507688 10.65284 0.0010990179 1
15903 -3.746794 5.065876 10.53470 0.0011715416 1
544963 -2.988110 4.975414 10.37943 0.0012742723 1
17755 -2.649741 6.118883 10.36888 0.0012815732 1
16687 -5.696048 4.958824 10.12211 0.0014650148 1
381217 3.213414 4.748535 10.08264 0.0014967284 1
14663 7.581478 14.480767 10.01154 0.0015556252 1
My question is: Given the analysis design presented above and its rational to include an 'Intercept' term - what should be the proper way to conduct likelihood ratio tests for 'virgin' vs 'lactate' and show the top genes ? Can we still use the current model.matrix and refer to variable 'lactate' although it is not explicitly defined by this model.matix?
Thanks a lot for both of you guys!