DESeq2 design multi factor
2
0
Entering edit mode
Laia ▴ 10
@239caad3
Last seen 14 months ago
Belgium

Hi Michael,

I am having some trouble fitting my model in DEseq2. In my design, I am missing one condition (A-5); see below:

stiffness   shear
A   1
A   1
A   1
A   2
A   2
A   2
A   2
A   3
A   3
A   4
A   4
B   1
B   1
B   1
B   1
B   2
B   2
B   2
B   3
B   3
B   3
B   4
B   4
B   4
B   5
B   5
B   5
C   1
C   1
C   1
C   2
C   2
C   2
C   2
C   3
C   3
C   3
C   4
C   4
C   4
C   5
C   5
C   5

I tried to solve this by adding all-zero columns to my data_matrix and sample files (adding "fake-samples" to both). Only then I was able to fit the model ~design shear + stiffness + shear:stiffness:

dds <- DESeqDataSetFromMatrix(countData = counts_data,
                              colData = colData,
                              design = ~ shear + stiffness + shear:stiffness)

However, then I need to remove these all-0 columns to be able to use the DESeq function. I followed the script in one previous reply [DESeq2 paired and multi factor comparison]

But I can't make it work, and I get an error in the very first step:

#First I create this matrix and I relavel to my reference
mm1 <- model.matrix(~ shear + stiffness + shear:stiffness, colData)
dds$shear <- relevel(dds$shear, ref='1')

#Then I create this object of zeros as you do in your reply script (I think it takes all 0 columns and create a new matrix with them to then substract it?):

idx <- which(colSums(mm1 == 0) == nrow(mm1))
mm1 <- mm1[,-idx]              # removing the columns with all 0's
mm0 <- model.matrix(~ shear + stiffness + shear:stiffness, colData)               # this matrix is necessary to run nbinomLRT
design(dds) <- ~ shear + stiffness + shear:stiffness
dds <- estimateSizeFactors(dds)

Here I already get: Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means

And I see that after applying

mm1 <- mm1[,-idx]      

My mm1 that initially was

 num [1:46 , 1:15]

it becomes

 num [1:46 , 0]

Can you help me understand what am I doing wrong? Thank you.

Laia

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Catalan_Spain.utf8  LC_CTYPE=Catalan_Spain.utf8   
[3] LC_MONETARY=Catalan_Spain.utf8 LC_NUMERIC=C                  
[5] LC_TIME=Catalan_Spain.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BiocParallel_1.30.3         vsn_3.64.0                 
 [3] gplots_3.1.3                RColorBrewer_1.1-3         
 [5] pheatmap_1.0.12             apeglm_1.18.0              
 [7] ggrepel_0.9.1               writexl_1.4.0              
 [9] airway_1.16.0               forcats_0.5.2              
[11] stringr_1.4.1               dplyr_1.0.10               
[13] purrr_0.3.4                 readr_2.1.2                
[15] tidyr_1.2.1                 tibble_3.1.8               
[17] ggplot2_3.3.6               tidyverse_1.3.2            
[19] DESeq2_1.36.0               SummarizedExperiment_1.26.1
[21] Biobase_2.56.0              MatrixGenerics_1.8.1       
[23] matrixStats_0.62.0          GenomicRanges_1.48.0       
[25] GenomeInfoDb_1.32.4         IRanges_2.30.1             
[27] S4Vectors_0.34.0            BiocGenerics_0.42.0        

loaded via a namespace (and not attached):
  [1] readxl_1.4.1           backports_1.4.1        plyr_1.8.7            
  [4] splines_4.2.1          usethis_2.1.6          digest_0.6.29         
  [7] htmltools_0.5.3        fansi_1.0.3            magrittr_2.0.3        
 [10] memoise_2.0.1          googlesheets4_1.0.1    limma_3.52.4          
 [13] tzdb_0.3.0             remotes_2.4.2          Biostrings_2.64.1     
 [16] annotate_1.74.0        modelr_0.1.9           bdsmatrix_1.3-6       
 [19] prettyunits_1.1.1      colorspace_2.0-3       blob_1.2.3            
 [22] rvest_1.0.3            haven_2.5.1            callr_3.7.2           
 [25] crayon_1.5.2           RCurl_1.98-1.8         jsonlite_1.8.0        
 [28] genefilter_1.78.0      survival_3.3-1         glue_1.6.2            
 [31] gtable_0.3.1           gargle_1.2.1           zlibbioc_1.42.0       
 [34] XVector_0.36.0         DelayedArray_0.22.0    pkgbuild_1.3.1        
 [37] scales_1.2.1           mvtnorm_1.1-3          DBI_1.1.3             
 [40] miniUI_0.1.1.1         Rcpp_1.0.9             xtable_1.8-4          
 [43] emdbook_1.3.12         bit_4.0.4              preprocessCore_1.58.0 
 [46] profvis_0.3.7          htmlwidgets_1.5.4      httr_1.4.4            
 [49] ellipsis_0.3.2         urlchecker_1.0.1       pkgconfig_2.0.3       
 [52] XML_3.99-0.10          dbplyr_2.2.1           locfit_1.5-9.6        
 [55] utf8_1.2.2             tidyselect_1.2.0       rlang_1.0.6           
 [58] later_1.3.0            AnnotationDbi_1.58.0   munsell_0.5.0         
 [61] cellranger_1.1.0       tools_4.2.1            cachem_1.0.6          
 [64] cli_3.4.1              generics_0.1.3         RSQLite_2.2.17        
 [67] devtools_2.4.5         broom_1.0.1            fastmap_1.1.0         
 [70] processx_3.7.0         bit64_4.0.5            fs_1.5.2              
 [73] caTools_1.18.2         KEGGREST_1.36.3        mime_0.12             
 [76] xml2_1.3.3             compiler_4.2.1         rstudioapi_0.14       
 [79] png_0.1-7              affyio_1.66.0          reprex_2.0.2          
 [82] geneplotter_1.74.0     stringi_1.7.8          ps_1.7.1              
 [85] lattice_0.20-45        Matrix_1.4-1           vctrs_0.4.1           
 [88] pillar_1.8.1           lifecycle_1.0.3        BiocManager_1.30.18   
 [91] bitops_1.0-7           httpuv_1.6.6           R6_2.5.1              
 [94] affy_1.74.0            promises_1.2.0.1       KernSmooth_2.23-20    
 [97] sessioninfo_1.2.2      codetools_0.2-18       MASS_7.3-57           
[100] gtools_3.9.3           assertthat_0.2.1       pkgload_1.3.0         
[103] withr_2.5.0            GenomeInfoDbData_1.2.8 parallel_4.2.1        
[106] hms_1.1.2              grid_4.2.1             coda_0.19-4           
[109] googledrive_2.0.0      bbmle_1.0.25           numDeriv_2016.8-1.1   
[112] shiny_1.7.2            lubridate_1.8.0 
DESeq2 design 0substract nofullrank • 2.8k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 5 days ago
United States

I tried to solve this by adding all-zero columns to my data_matrix and sample files (adding "fake-samples" to both).

Don't do this. Go back to the original data. Adding the zeros is what causes the error "every gene contains at least one zero...". I can't think of any Bioconductor packages that want you to add pseudo-data or pseudo-samples.

Secondly, mm0 is not defined correctly. It is the same as mm1 essentially (but without taking care of the 0 columns). You may need to chat with a statistician about the role of mm1 and mm0. mm0 should have the columns of mm1 removed that you want to test. And don't forget to provide mm1 and mm0 to DESeq():

dds <- DESeq(dds, test="LRT", full=mm1, reduced=mm0)
ADD COMMENT
0
Entering edit mode

Thank you Michael,

I was trying the strategy with "fake samples" because I thought they would be ignored aftermath, but I stopped that. Yesterday I could manage to give the full matrix (including the incomplete condition) to DEseq2 by just defining it manually, and now LRT is finally working on all my samples. I am using LRT as follows:

LRT_int <- DESeq(DEmat, test="LRT", full = ~ Bch + Fl + St + Fl:St, reduced = ~ Bch + Fl + St)

Here, I included the batch (Bch) effect in the model (because due to technical issues the RNAseq samples ended up being split into different batches..), and this brought me to a new question:

When I want to see the interaction effect, I should only remove from the equation the interaction (Fl:St) term, is that right? So the batch has to stay in the reduced model, otherwise would be also tested together with the interaction term.

Also, when I want to see the effect of St alone, I should remove only St, because if I would remove both St and Fl:St, then I would be testing for both terms, right?

Thank you once again.

ADD REPLY
0
Entering edit mode

Oh, I just realised of this warning message after running DEseq(LRT) with my manually-entered-factors matrix:

the design formula contains one or more numeric variables with integer values, specifying a model with increasing fold change for higher values. did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function the design formula contains one or more numeric variables that have mean or standard deviation larger than 5 (an arbitrary threshold to trigger this message). Including numeric variables with large mean can induce collinearity with the intercept. Users should center and scale numeric variables in the design to improve GLM convergence.

This is because I made all my factor levels numerical.. If I keep them non-numerical, then I get the full rank error:

the model matrix is not full rank, so the model cannot be fit as specified. Levels or combinations of levels without any samples have resulted in column(s) of zeros in the model matrix.

So I guess that the results from LRT_int may not be correct?

If using the manually entered dataset is not working properly, then I think that another option would be to remove the incomplete condition from my data, run the LRT tests on the new data, and afterwards do Wald contrasts on the excluded condition...

Is there a better idea?

ADD REPLY
0
Entering edit mode

Can we go back to my suggestion where you form mm0 properly? In the first post, mm0 was just mm1 but without taking care of the 0 columns.

mm0 should be a matrix that represents the null (removing variables of interest in testing and leaving only variables that you want to control for).

ADD REPLY
0
Entering edit mode

Hi Michael,

Thank you for your message. I tried now to go back to the mm1/mm0 as you advised, and I think it worked :)

I started with the dds without the interaction because I cannot run the function with the missing condition, otherwise:

dds <- DESeqDataSetFromMatrix(countData = counts_data, colData = colData, design = ~ shear + stiffness)

mm1 <- model.matrix(~ shear + stiffness + shear:stiffness, colData)         # here I create the mm1 including the interaction
dds$shear <- factor(dds$shear, levels = c("STATIC", "LOW", "MOD", "HIGH1", "HIGH2"))
dds$stiffness <- factor(dds$stiffness, levels = c("A", "B", "C"))

idx <- which(colSums(mm1 == 0) == nrow(mm1))
mm1 <- mm1[,-idx]              
mm0 <- model.matrix(~ shear + stiffness, colData)           # here I excluded the interaction (to be tested)

design(dds) <- ~ shear + stiffness + shear:stiffness         # I give a new design (complete this time) to dds

Here I am following your script, but I don't get why are we doing these steps manually. This is what DEseq would do automatically right?

dds <- estimateSizeFactors(dds)                    
dds <- estimateDispersionsGeneEst(dds, modelMatrix=mm1)
dds <- estimateDispersionsFit(dds)
dds <- estimateDispersionsMAP(dds, modelMatrix=mm1)

And here is where LRT is called:

dds <- nbinomLRT(dds, full=mm1, reduced=mm0)

keep <- rowSums(counts(dds)) >= 10     # I'd do this here to remove low counts
dds <- dds[keep,]
dds

vsd <- varianceStabilizingTransformation(dds)
plotPCA(vsd, intgroup= c("shear", "stiffness"))     # I do this to visualize my data and I can check that is giving what I expected 

res <- results(dds)  
summary(res)
df <- res %>% data.frame() %>% rownames_to_column(var="gene") %>% as_tibble() %>% arrange(padj)

I think this worked this time :) Thank you!

Laia

ADD REPLY
0
Entering edit mode

You don’t need to call those functions manually, just DESeq(…, test=“LRT”, …) should be enough.

I would filter at the top, not bottom of the script.

ADD REPLY
0
Entering edit mode

Thank you Michael!

Laia

ADD REPLY
0
Entering edit mode
Laia ▴ 10
@239caad3
Last seen 14 months ago
Belgium

Hi Michael,

I am coming back to this post because I just realized that my reference condition might not be correct in mm1, and then the resulting UP and DOWN-reg genes are not telling what I am looking for.

I thought I was controlling this in this step:

mm1 <- model.matrix(~ shear + stiffness + shear:stiffness, colData)         # here I create the mm1 including the interaction
dds$shear <- factor(dds$shear, levels = c("STATIC", "LOW", "MOD", "HIGH1", "HIGH2"))
dds$stiffness <- factor(dds$stiffness, levels = c("A", "B", "C")) 

By putting the levels in order so that STATIC and A become the reference condition for each factor (shear and stiffness, respectively).

But when I check the mm1 intercept, this is still taken by alphabetic order (no matter in what order I do the dds$shear or stiffness, with respect to mm1 creation). I get that these functions will affect dds, and not mm1. But how can I do it to change the intercept to be STATIC and A?

Thank you.

Laia

ADD COMMENT
0
Entering edit mode

Hi Michael,

I have tried something that looks like it worked, but I am not sure if it is correct, could you please double-check for me?

First I create the dds object:

dds <- DESeqDataSetFromMatrix(countData = counts_data, colData = colData, design = ~ shear + stiffness)

Then I relevel: I use "dds$shear" instead of "shear" (same for stiffness) as the object name, because otherwise it does nothing on the mm1 afterwards..

dds$shear <- relevel(dds$shear,"STATIC")
dds$stiffness <- relevel(dds$stiffness,"A")

I create the mm1 by using the "dds$shear" and "dds$stiffness", instead of just "shear" and "stiffness" (that was not changing the levels used as the intercept):

 mm1 <- model.matrix(~ dds$shear + dds$stiffness + dds$shear:dds$stiffness, colData)

Here I eliminate the 0 columns as in previous posts:

idx <- which(colSums(mm1 == 0) == nrow(mm1))
mm1 <- mm1[,-idx]  

But here, I am not using the dds$ anymore but it works anyways, or at least I am able to get results afterwards that include the interaction, and they differ from the ones obtained when the reference was not correct:

design(dds) <- ~ shear + stiffness + shear:stiffness

I think I don't need this two lines below anymore, unless I want to see the levels in this specific order in plots:

#dds$shear <- factor(dds$shear, levels = c("STATIC", "LOW", "MOD", "HIGH1", "HIGH2"))
#dds$stiffness <- factor(dds$stiffness, levels = c("A", "B", "C"))

To generate the reduced designs, I use "dds$shear" and "dds$stiffness" because I think it makes sense, since I will be comparing this with mm1:

mm0 <- model.matrix(~ dds$shear + dds$stiffness, colData)  # INTERACTION to be tested
mm02 <- model.matrix(~ dds$stiffness, colData)             # SHEAR to be tested
mm03 <- model.matrix(~ dds$shear, colData)                 # STIFFNESS to be tested

And I filter counts:

keep <- rowSums(counts(dds) >= 10) >= 5
dds <- dds[keep,]
dds

Finally, I apply DEseq2 using LRT test per each reduced matrix

dds1 <- DESeq(dds, test="LRT", full = mm1, reduced = mm0) # Testing INTERACTION
dds2 <- DESeq(dds, test="LRT", full = mm1, reduced = mm02) # Testing SHEAR
dds3 <- DESeq(dds, test="LRT", full = mm1, reduced = mm03) # Testing STIFFNESS

Is the above code correct? Should I be using "dds$s..." also in the re-design step?

Thank you.

Laia

ADD REPLY
0
Entering edit mode

Yes, I agree for these.

ADD REPLY
0
Entering edit mode

Let me answer this first post: you need to make the changes to the variables _before_ you make the matrix. The matrix won't update after it is made.

ADD REPLY
0
Entering edit mode

Thank you for your replies, Michael.

Why is it that at this stage:

design(dds) <- ~ shear + stiffness + shear:stiffness

I can't use the "dds$" objects anymore? Is it because I am modifying dds and then I can't call it while it is being modified?

Can I be sure that then

dds$shear <- relevel(dds$shear,"STATIC")
dds$stiffness <- relevel(dds$stiffness,"A")

is still being taken into account?

I ask it because I am trying to validate my results with some gene for which I expect an up-regulation at shear conditions to static, but then the log2FC I get is negative. When I plot the counts of the gene along the conditions, they look like this:

I have also tried to get the results excluding the LOW conditions, but the log2FC is still negative.

Why do I get a negative log2FC if counts in shear coditions (other than in LOW) are comparable or higher than in the reference (Static red)?

Thank you very much.

Laia

ADD REPLY
0
Entering edit mode

I think our lines got crossed, I was referring to your first post where you did model.matrix, then relevel. model.matrix is the one that isn't mutable.

Re: the sign of the LFC, take a look at the columns of the model matrix. It will be named according to the numerator of the LFC.

ADD REPLY

Login before adding your answer.

Traffic: 669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6