Question

Conduct likelihood ratio tests for a variable that isn't explicitly defined by matrix.model

0

Entering edit mode

blumroy • 0

@blumroy-19867

Last seen 6.2 years ago

Hi, I am following an example presented in: https://bioinformatics-core-shared-training.github.io/RNAseq-R/rna-seq-de.nb.html I will present first two models that are based on the same model.matrix, and then step into my question

# The analysis uses the following sampleinfo chart:
> sampleinfo
                            FileName SampleName CellType   Status
1  MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1    MCL1.DG    basal   virgin
2  MCL1.DH_BC2CTUACXX_CAGATC_L002_R1    MCL1.DH    basal   virgin
3  MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1    MCL1.DI    basal pregnant
4  MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1    MCL1.DJ    basal pregnant
5  MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1    MCL1.DK    basal  lactate
6  MCL1.DL_BC2CTUACXX_ATCACG_L002_R1    MCL1.DL    basal  lactate
7  MCL1.LA_BC2CTUACXX_GATCAG_L001_R1    MCL1.LA  luminal   virgin
8  MCL1.LB_BC2CTUACXX_TGACCA_L001_R1    MCL1.LB  luminal   virgin
9  MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1    MCL1.LC  luminal pregnant
10 MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1    MCL1.LD  luminal pregnant
11 MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1    MCL1.LE  luminal  lactate
12 MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1    MCL1.LF  luminal  lactate

# And groups were defined with:
group <- paste(sampleinfo$CellType,sampleinfo$Status,sep=".")
group

> group
 [1] "basal.virgin"     "basal.virgin"     "basal.pregnant"   "basal.pregnant"   "basal.lactate"    "basal.lactate"    "luminal.virgin" 
 [8] "luminal.virgin"   "luminal.pregnant" "luminal.pregnant" "luminal.lactate"  "luminal.lactate"

Accordingly, there are two variables: 'status' (contains: 'basal', 'luminal') and 'cell type' (contains: 'virgin', 'pregnant', 'lactate').

In order to determine differences between the two cell types we create a model with only main effects, that is no interaction. The main assumption here is that the effect of the status is the same in both type of cells.

# We create the two variables
group <- as.character(group)
type <- sapply(strsplit(group, ".", fixed=T), function(x) x[1])
status <- sapply(strsplit(group, ".", fixed=T), function(x) x[2])
class(type)
class(status)

# And specify a design matrix with an intercept term
design <- model.matrix(~ type + status)
design

> design
   (Intercept) typeluminal statuspregnant statusvirgin
1            1           0              0            1
2            1           0              0            1
3            1           0              1            0
4            1           0              1            0
5            1           0              0            0
6            1           0              0            0
7            1           1              0            1
8            1           1              0            1
9            1           1              1            0
10           1           1              1            0
11           1           1              0            0
12           1           1              0            0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$type
[1] "contr.treatment"

attr(,"contrasts")$status
[1] "contr.treatment"

# We then estimating the dispersion
dgeObj <- estimateCommonDisp(dgeObj)

# Then, we estimate gene-wise dispersion estimates, allowing a possible trend with average count size:
dgeObj <- estimateGLMTrendedDisp(dgeObj)
dgeObj <- estimateTagwiseDisp(dgeObj)

# Then, we test for differential expression. We fit genewise glms and create a linear model:
fit <- glmFit(dgeObj, design)

> colnames(fit)
[1] "(Intercept)"    "typeluminal"    "statuspregnant" "statusvirgin"

We note that the as per our model.matrix - the generated model presents only the (k-1) variables whose levels vary fastest; Thus variable 'lactate' is not being explicitly represented.

# To conduct likelihood ratio test for 'luminal' vs 'basal' ("BvsL") and show the top genes we do:

lrt.BvsL <- glmLRT(fit, coef=2)
topTags(lrt.BvsL)

Coefficient:  typeluminal
           logFC    logCPM       LR       PValue          FDR
110308 -8.940579 10.264297 24.89789 6.044844e-07 0.0004377961
50916  -8.636503  5.749781 24.80037 6.358512e-07 0.0004377961
12293  -8.362247  6.794788 24.68526 6.749827e-07 0.0004377961
56069  -8.419433  6.124377 24.41532 7.764861e-07 0.0004377961
24117  -9.290691  6.757163 24.32506 8.137331e-07 0.0004377961
12818  -8.216790  8.172247 24.24233 8.494462e-07 0.0004377961
22061  -8.034712  7.255370 24.16987 8.820135e-07 0.0004377961
12797  -9.001419  9.910795 24.12854 9.011487e-07 0.0004377961
50706  -7.697022 10.809629 24.06926 9.293193e-07 0.0004377961
237979 -8.167451  5.215921 24.03528 9.458678e-07 0.0004377961

Now, in case we want to find differentially expressed genes between 'pregnant' and 'virgin' (i.e. "PvsV"), we don’t have a parameter that explicitly will allow us to test that hypothesis so we need to build a contrast:

PvsV <- makeContrasts(statuspregnant-statusvirgin, levels=design)

lrt.pPsV <- glmLRT(fit, contrast=PvsV)
topTags(lrt.PvsV)

> topTags(lrt.PvsV)
Coefficient:  1*statuspregnant -1*statusvirgin
           logFC    logCPM       LR       PValue FDR
232016 -4.593938  6.389781 12.40835 0.0004274195   1
236643 -6.608868 -1.481424 12.34832 0.0004413842   1
19283  -4.502691  4.318145 10.72832 0.0010550833   1
12993   7.359664 15.507688 10.65284 0.0010990179   1
15903  -3.746794  5.065876 10.53470 0.0011715416   1
544963 -2.988110  4.975414 10.37943 0.0012742723   1
17755  -2.649741  6.118883 10.36888 0.0012815732   1
16687  -5.696048  4.958824 10.12211 0.0014650148   1
381217  3.213414  4.748535 10.08264 0.0014967284   1
14663   7.581478 14.480767 10.01154 0.0015556252   1

My question is: Given the analysis design presented above and its rational to include an 'Intercept' term - what should be the proper way to conduct likelihood ratio tests for 'virgin' vs 'lactate' and show the top genes ? Can we still use the current model.matrix and refer to variable 'lactate' although it is not explicitly defined by this model.matix?

edger • 1.0k views

ADD COMMENT • link updated 6.2 years ago by Gordon Smyth 52k • written 6.2 years ago by blumroy • 0

score 3 · Accepted Answer · 2019-02-13

It's much easier if you just compute the means for each group and then do the comparisons directly. For whatever reason the example you link to (and you) both do that sort of thing, but then inexplicably don't use it.

> group <- paste(rep(c("basal","luminal"), each = 6), rep(c("virgin","pregnant","lactate"), each = 2, times = 3), sep = ".")

> design <- model.matrix(~0 + group)
> design
   groupbasal.lactate groupbasal.pregnant groupbasal.virgin
1                   0                   0                 1
2                   0                   0                 1
3                   0                   1                 0
4                   0                   1                 0
5                   1                   0                 0
6                   1                   0                 0
7                   0                   0                 0
8                   0                   0                 0
9                   0                   0                 0
10                  0                   0                 0
11                  0                   0                 0
12                  0                   0                 0
13                  0                   0                 1
14                  0                   0                 1
15                  0                   1                 0
16                  0                   1                 0
17                  1                   0                 0
18                  1                   0                 0
   groupluminal.lactate groupluminal.pregnant groupluminal.virgin
1                     0                     0                   0
2                     0                     0                   0
3                     0                     0                   0
4                     0                     0                   0
5                     0                     0                   0
6                     0                     0                   0
7                     0                     0                   1
8                     0                     0                   1
9                     0                     1                   0
10                    0                     1                   0
11                    1                     0                   0
12                    1                     0                   0
13                    0                     0                   0
14                    0                     0                   0
15                    0                     0                   0
16                    0                     0                   0
17                    0                     0                   0
18                    0                     0                   0
attr(,"assign")
[1] 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

>

Now each coefficient is the mean of each group, and you can make any comparison you might like. For example, virgin vs lactate would be

fit <- glmQLFit(dgeObj, design)
tt <- glmQLFTest(fit, contrast = c(-1,0,1,-1,0,1)/2)
topTags(tt)

Where we simply ask for evidence that the mean of the virgin groups is different from the mean of the lactate groups.

score 2 · Accepted Answer · 2019-02-13

The short answer is that the 'virgin' vs 'lactate" comparison is already in your model. The coefficient called statusvirgin represents 'virgin' vs 'lactate'. To test that comparison you simply use:

glmLRT(fit, coef=4)

or

glmLRT(fit, coef="statusvirgin")

However the model formula you are using is too restrictive (i.e., wrong) in that it assumes that differential expression between pregnancy stages are the same for both luminal and basal cells. You can follow James' suggestion or else follow the correct and complete analysis of the same data provided here:

https://bioconductor.org/packages/RnaSeqGeneEdgeRQL

Just click on the "html" link to see the workflow.