Hi dear community,
I am new to bioinformatics and RNAseq analysis and trying to solve an issue with an RNA dataset. This dataset consists of 3 groups, each representing a disease (D1, D2 and D3) and a control group CTRL. For 2 disease groups (D1 and D2) data for two treatments are available (MTX and Ada) while the D3 group only contains data for the MTX treatment due to lack of samples. The samples were taken before therapy and after therapy. Samples for the control group were only taken once at the start of the study though.
What should be tested:
a) The healthy patients against each D-group (disease) before treatment for each treatment (Healthy vs DX-pre-MTX & Healthy vs DX-pre-Ada)
b) The healthy patients against each D-group after treatment for each treatment (Healthy vs DX-post-MTX & Healthy vs DX-post-Ada)
c) Each D-group before treatment vs the same D-group after treatment (DX-pre-MTX vs DX-post-MTX & DX-pre-Ada vs DX-post-Ada)
Patient TimepointProcessed Sex Disease Treatment TreatmentGroup
1 Healthy 1 l w CTRL None Healthy
2 Healthy 2 l w CTRL None Healthy
3 Healthy 3 s m CTRL None Healthy
4 Healthy 4 l w CTRL None Healthy
5 D1-MTX 1 s w D1 preMTX D1preMTX
6 D1-MTX 2 l m D1 preMTX D1preMTX
7 D1-MTX 3 s w D1 preMTX D1preMTX
8 D1-MTX 1 s w D1 postMTX D1postMTX
9 D1-MTX 2 l m D1 postMTX D1postMTX
10 D1-MTX 3 s w D1 postMTX D1postMTX
11 D1-Ada 1 l w D1 preAda D1preAda
12 D1-Ada 2 l m D1 preAda D1preAda
13 D1-Ada 3 s w D1 preAda D1preAda
14 D1-Ada 4 s w D1 preAda D1preAda
15 D1-Ada 5 s w D1 preAda D1preAda
16 D1-Ada 6 s w D1 preAda D1preAda
17 D1-Ada 1 l w D1 postAda D1postAda
18 D1-Ada 2 s m D1 postAda D1postAda
19 D1-Ada 5 l w D1 postAda D1postAda
20 D1-Ada 6 s w D1 postAda D1postAda
21 D2-MTX 1 s w D2 preMTX D2preMTX
22 D2-MTX 2 s w D2 preMTX D2preMTX
23 D2-MTX 3 s w D2 preMTX D2preMTX
24 D2-MTX 1 l w D2 postMTX D2postMTX
25 D2-MTX 2 s w D2 postMTX D2postMTX
26 D2-MTX 3 l w D2 postMTX D2postMTX
27 D2-Ada 1 s m D2 preAda D2preAda
28 D2-Ada 3 s w D2 preAda D2preAda
29 D2-Ada 4 s m D2 preAda D2preAda
30 D2-Ada 1 s m D2 postAda D2postAda
31 D2-Ada 2 s w D2 postAda D2postAda
32 D2-Ada 3 s w D2 postAda D2postAda
33 D2-Ada 4 s m D2 postAda D2postAda
34 D3-MTX 1 s w D3 preMTX D3preMTX
35 D3-MTX 2 s m D3 preMTX D3preMTX
36 D3-MTX 4 s w D3 preMTX D3preMTX
37 D3-MTX 1 l w D3 postMTX D3postMTX
38 D3-MTX 2 l m D3 postMTX D3postMTX
39 D3-MTX 5 s m D3 postMTX D3postMTX
It was shown that the sex and timepoint at which the samples were processed (s=shortly after collection, l=the next day), show batch effects on the data and need to be taken into account. I wanted to start with the comparison of a) and b). After reading through the vingrette and some examples posted before, I concatenated the Disease and Treatment columns to one to use TreatmentGroup as one factor in the design. Apparently I am missing something because I get the warnings for 'In fitNbinomGLMs ...' and the ' 3 rows did not converge in beta, labelled in mcols(object)$betaConv' notice.
# Create DESeq object
CD4_dds <- DESeqDataSetFromMatrix(countData = CD4_counts, colData = CD4_coldata, design = ~ TimepointProcessed + Sex + TreatmentGroup)
# Assign control group as reference level
CD4_dds$TreatmentGroup <- relevel(CD4_dds$TreatmentGroup, "Healthy")
nrow(CD4_dds)
smallestGroupSize <- 3
keep <- rowSums(counts(CD4_dds) >= 10) >= smallestGroupSize
CD4_dds <- CD4_dds[keep,]
nrow(CD4_dds)
CD4_DESeq <- DESeq(CD4_dds, test='Wald', fitType = 'local')
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
3 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
Warning messages:
1: In fitNbinomGLMs(objectNZ[fitidx, , drop = FALSE], alpha_hat = alpha_hat[fitidx], :
4337rows had non-positive estimates of variance for coefficients
2: In fitNbinomGLMs(objectNZ, betaTol = betaTol, maxit = maxit, useOptim = useOptim, :
4528rows had non-positive estimates of variance for coefficients
If you are wondering why I set the fitType argument to 'local': while running DESeq() I first got the note below along with the warnings, so I adjusted it.
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
function: y = a/x + b, and a local regression fit was automatically substituted.
specify fitType='local' or 'mean' to avoid this message next time.
I then tried to augment the maxit argument but am still stuck with the same issue.
> CD4_dds <- estimateSizeFactors(CD4_dds)
> CD4_estDisp <- estimateDispersions(CD4_dds, fitType='local')
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
Warning message:
In fitNbinomGLMs(objectNZ[fitidx, , drop = FALSE], alpha_hat = alpha_hat[fitidx], :
4337rows had non-positive estimates of variance for coefficients
> CD4_DESeq <- nbinomWaldTest(CD4_estDisp, maxit=1000)
3 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
Warning message:
In fitNbinomGLMs(objectNZ, betaTol = betaTol, maxit = maxit, useOptim = useOptim, :
4515rows had non-positive estimates of variance for coefficients
So my question is what did go wrong and what would be the best approach to analyze the data? Thank you in advance for any reply, I really appreciate it.
sessionInfo( )
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] limma_3.60.3 EnhancedVolcano_1.22.0 pcaExplorer_2.30.0 sva_3.52.0 BiocParallel_1.38.0
[6] genefilter_1.86.0 mgcv_1.9-1 nlme_3.1-165 PCAtools_2.16.0 pheatmap_1.0.12
[11] vsn_3.72.0 ggrepel_0.9.5 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[16] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[21] ggplot2_3.5.1 tidyverse_2.0.0 DESeq2_1.44.0 SummarizedExperiment_1.34.0 Biobase_2.64.0
[26] MatrixGenerics_1.16.0 matrixStats_1.3.0 GenomicRanges_1.56.1 GenomeInfoDb_1.40.1 IRanges_2.38.1
[31] S4Vectors_0.42.1 BiocGenerics_0.50.0
loaded via a namespace (and not attached):
[1] splines_4.4.1 later_1.3.2 bitops_1.0-7 filelock_1.0.3 preprocessCore_1.66.0 graph_1.82.0
[7] XML_3.99-0.17 lifecycle_1.0.4 httr2_1.0.1 edgeR_4.2.0 topGO_2.56.0 doParallel_1.0.17
[13] lattice_0.22-5 crosstalk_1.2.1 dendextend_1.17.1 magrittr_2.0.3 plotly_4.10.4 rmarkdown_2.27
[19] shinyBS_0.61.1 httpuv_1.6.15 NMF_0.27 cowplot_1.1.3 DBI_1.2.3 RColorBrewer_1.1-3
[25] pkgload_1.4.0 abind_1.4-5 zlibbioc_1.50.0 RCurl_1.98-1.14 rappdirs_0.3.3 seriation_1.5.5
[31] GenomeInfoDbData_1.2.12 irlba_2.3.5.1 AnnotationForge_1.46.0 annotate_1.82.0 dqrng_0.4.1 DelayedMatrixStats_1.26.0
[37] codetools_0.2-19 DelayedArray_0.30.1 DT_0.33 xml2_1.3.6 tidyselect_1.2.1 GOstats_2.70.0
[43] UCSC.utils_1.0.0 ScaledMatrix_1.12.0 viridis_0.6.5 TSP_1.2-4 BiocFileCache_2.12.0 base64enc_0.1-3
[49] webshot_0.5.5 jsonlite_1.8.8 survival_3.7-0 iterators_1.0.14 foreach_1.5.2 tools_4.4.1
[55] progress_1.2.3 Rcpp_1.0.12 glue_1.7.0 gridExtra_2.3 SparseArray_1.4.8 xfun_0.45
[61] ca_0.71.1 shinydashboard_0.7.2 withr_3.0.0 Category_2.70.0 BiocManager_1.30.23 fastmap_1.2.0
[67] fansi_1.0.6 SparseM_1.84 digest_0.6.36 rsvd_1.0.5 timechange_0.3.0 R6_2.5.1
[73] mime_0.12 colorspace_2.1-0 GO.db_3.19.1 biomaRt_2.60.1 RSQLite_2.3.7 threejs_0.3.3
[79] utf8_1.2.4 generics_0.1.3 data.table_1.15.4 prettyunits_1.2.0 httr_1.4.7 htmlwidgets_1.6.4
[85] S4Arrays_1.4.1 pkgconfig_2.0.3 gtable_0.3.5 blob_1.2.4 registry_0.5-1 XVector_0.44.0
[91] htmltools_0.5.8.1 RBGL_1.80.0 GSEABase_1.66.0 scales_1.3.0 png_0.1-8 knitr_1.48
[97] rstudioapi_0.16.0 tzdb_0.4.0 reshape2_1.4.4 curl_5.2.1 shinyAce_0.4.2 cachem_1.1.0
[103] parallel_4.4.1 AnnotationDbi_1.66.0 pillar_1.9.0 grid_4.4.1 vctrs_0.6.5 promises_1.3.0
[109] BiocSingular_1.20.0 dbplyr_2.5.0 beachmat_2.20.0 xtable_1.8-4 cluster_2.1.6 Rgraphviz_2.48.0
[115] evaluate_0.24.0 cli_3.6.3 locfit_1.5-9.10 compiler_4.4.1 rlang_1.1.4 crayon_1.5.3
[121] rngtools_1.5.2 heatmaply_1.5.0 affy_1.82.0 plyr_1.8.9 stringi_1.8.4 viridisLite_0.4.2
[127] gridBase_0.4-7 assertthat_0.2.1 munsell_0.5.1 Biostrings_2.72.1 lazyeval_0.2.2 Matrix_1.6-5
[133] hms_1.1.3 sparseMatrixStats_1.16.0 bit64_4.0.5 KEGGREST_1.44.1 statmod_1.5.0 shiny_1.8.1.1
[139] igraph_2.0.3 memoise_2.0.1 affyio_1.74.0 bit_4.0.5
Thank you for your reply Michael,
increasing the sample number just lowered the row number that give a warning but limma-voom worked fine with this dataset.