Question

How to set model formulae in DEXSeq when 2 shRNAs were used to target a same gene

0

Entering edit mode

zliu ▴ 10

@zliu-23226

Last seen 4.1 years ago

I am learning to use DEXSeq to analyze RNA-seq data after knocking down a splicing factor. I have 3 samples from the shCTRL, and 6 samples from 2 different shRNAs targeting the same splicing factor: shRNA1, shRNA4 (3 replicates for each shRNA). Two shRNAs targeting the same splicing factor were used because I wanted to eliminate the off-target effect of shRNAs. I am not sure if I should put all the 9 samples together to do the analysis or I should compare shRNA1 with shCTRL and shRNA4 with shCTRL separately. The separate comparisons (shRNA1 vs. shCTRL, shRNA4 vs. shCTRL) ran well. The sample tables are like the following.

sampleAnnotation(dxd1)

DataFrame with 6 rows and 4 columns
sample  condition    libType        sizeFactor
<factor>   <factor>   <factor>         <numeric>
1    CTRL_1    control paired-end  1.14881507491493
2    CTRL_2    control paired-end  1.12577220439625
3    CTRL_3    control paired-end  1.11761713391514
4 Treatment1_1 knockdown1 paired-end 0.839327417001361
5 Treatment1_2 knockdown1 paired-end 0.992697092731002
6 Treatment1_3 knockdown1 paired-end 0.907206458911359

sampleAnnotation(dxd4)

DataFrame with 6 rows and 4 columns
 sample  condition    libType        sizeFactor
 <factor>   <factor>   <factor>         <numeric>
1    CTRL_1    control paired-end  1.04077845798234
2    CTRL_2    control paired-end  1.01594922129081
3    CTRL_3    control paired-end  1.00979043835092
4 Treatment4_1 knockdown4 paired-end 0.930994837364004
5 Treatment4_2 knockdown4 paired-end  1.03814641210704
6 Treatment4_3 knockdown4 paired-end  1.04888449050392

I noticed that the size factors are different for CTRL1, CTRL2, and CTRL_3 in the 2 analyses. Is this right?

Then I put all the 9 samples together and set the sample table like this:

sampleTable2

         condition  shRNA
CTRL_1      control shCTRL
CTRL_2      control shCTRL
CTRL_3      control shCTRL
Treatment1_1 knockdown shRNA1
Treatment1_2 knockdown shRNA1
Treatment1_3 knockdown shRNA1
Treatment4_1 knockdown shRNA4
Treatment4_2 knockdown shRNA4
Treatment4_3 knockdown shRNA4

I ran the following commands

dxd = DEXSeqDataSetFromHTSeq(countfiles = countFiles, sampleData = sampleTable2, design = ~ sample + exon + condition:exon, flattenedfile = flattenedFile)

formulaFullModel = ~ sample + exon + shRNA:exon + condition:exon

formulaReducedModel = ~ sample + exon + shRNA:exon

dxd = estimateSizeFactors(dxd)

dxd = estimateDispersions(dxd, formula = formulaFullModel)

dxd = testForDEU(dxd, reducedModel = formulaReducedModel, fullModel = formulaFullModel)

I had the following error: Error in nbinomLRT(x, reduced = reducedModelMatrix, full = fullModelMatrix) : less than one degree of freedom, perhaps full and reduced models are not in the correct order

Can anyone tell what is wrong with the setting of formulaFullModel and formulaReducedModel. And statistically, which way is better for the analysis: analyze separately and overlap the splice events or put all the nine samples together and analyze them?

Thanks!

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] DEXSeq_1.32.0               RColorBrewer_1.1-2         
[3] AnnotationDbi_1.48.0        DESeq2_1.26.0              
[5] SummarizedExperiment_1.16.1 DelayedArray_0.12.2        
[7] matrixStats_0.56.0          GenomicRanges_1.38.0       
[9] GenomeInfoDb_1.22.0         IRanges_2.20.2             
[11] S4Vectors_0.24.3            Biobase_2.46.0             
[13] BiocGenerics_0.32.0         BiocParallel_1.20.1        

loaded via a namespace (and not attached):
[1] bitops_1.0-6           bit64_0.9-7            progress_1.2.2        
[4] httr_1.4.1             tools_3.6.3            backports_1.1.5       
[7] R6_2.4.1               rpart_4.1-15           Hmisc_4.3-1              
[10] DBI_1.1.0              colorspace_1.4-1       nnet_7.3-13           
[13] tidyselect_1.0.0       gridExtra_2.3          prettyunits_1.1.1     
[16] curl_4.3               bit_1.1-15.2           compiler_3.6.3        
[19] htmlTable_1.13.3       scales_1.1.0           checkmate_2.0.0       
[22] genefilter_1.68.0      askpass_1.1            rappdirs_0.3.1        
[25] Rsamtools_2.2.3        stringr_1.4.0          digest_0.6.25         
[28] foreign_0.8-76         rmarkdown_2.1          XVector_0.26.0        
[31] base64enc_0.1-3        jpeg_0.1-8.1           pkgconfig_2.0.3       
[34] htmltools_0.4.0        dbplyr_1.4.2           htmlwidgets_1.5.1     
[37] rlang_0.4.5            rstudioapi_0.11        RSQLite_2.2.0         
[40] hwriter_1.3.2          acepack_1.4.1          dplyr_0.8.5           
[43] RCurl_1.98-1.1         magrittr_1.5           GenomeInfoDbData_1.2.2
[46] Formula_1.2-3          Matrix_1.2-18          Rcpp_1.0.4            
[49] munsell_0.5.0          lifecycle_0.2.0        stringi_1.4.6         
[52] yaml_2.2.1             zlibbioc_1.32.0        BiocFileCache_1.10.2  
[55] grid_3.6.3             blob_1.2.1             crayon_1.3.4           
[58] lattice_0.20-40        Biostrings_2.54.0      splines_3.6.3         
[61] annotate_1.64.0        hms_0.5.3              locfit_1.5-9.1        
[64] knitr_1.28             pillar_1.4.3           geneplotter_1.64.0    
[67] biomaRt_2.42.0         XML_3.99-0.3           glue_1.3.2            
[70] evaluate_0.14          latticeExtra_0.6-29    data.table_1.12.8      
[73] png_0.1-7              vctrs_0.2.4            gtable_0.3.0          
[76] openssl_1.4.1          purrr_0.3.3            assertthat_0.2.1      
[79] ggplot2_3.3.0          xfun_0.12              xtable_1.8-4          
[82] survival_3.1-11        tibble_2.1.3           memoise_1.1.0         
[85] cluster_2.1.0          statmod_1.4.34

DEXSeq • 775 views

ADD COMMENT • link updated 4.7 years ago by Alejandro Reyes ★ 1.9k • written 4.7 years ago by zliu ▴ 10

score 1 · Answer 1 · 2020-04-02

Hi @zliu,

Thank you for your detailed report. You are getting an error because you are getting dependent terms in the model matrix (the matrix is not full rank). For more information, I'd recommend you to check this section of the DESeq2 vignette, which explains this problem in detail: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#model-matrix-not-full-rank

However, if you run DEXSeq with the following formulae to get those exons that where the different siRNAs have a consistent effect in their inclusion:

formulaFullModel = ~ sample + exon + condition:exon
formulaReducedModel = ~ sample + exon

Both ways of analyzing the data are statisticall valid. The approach of analysing them separately and comparing the results would enable you to evaluate both consistent exon usage differences (by taking the overlap) as well as which exon usage differences are only siRNA-specific (and very likely off target effects). The second approach would only look at the former.

Alejandro