Hi all,
I'm trying to analyze RNA-seq data with DESeq2, and I have some batches in my data that I need to account for. Some of them have biological relevance, and some of them are purely technical.
I have 3 patients, 6 time points per patient, and each time point has been sequenced with 2 different techniques (Type 1 and Type 2). Because of technical limitations due to library prep time and index diversity, all Type 1 sequencing experiments were prepped and sequenced in one run, whereas Type 2 experiments were divided in 2 unequal batches. It is not possible to prepare and sequence type 1 and type 2 together.
Unfortunately, one of the batches (batch B) coincides with Patient 1, and it's not yet very clear to me whether this can be properly accounted for. There are, however, other problems.
To make this easier, here's the coldata:
patient seq_type time seq_batch Type1_0_1 1 Type1 0 A Type1_0_2 2 Type1 0 A Type1_0_3 3 Type1 0 A Type1_1_1 1 Type1 1 A Type1_1_2 2 Type1 1 A Type1_1_3 3 Type1 1 A Type1_2_1 1 Type1 2 A Type1_2_2 2 Type1 2 A Type1_2_3 3 Type1 2 A Type1_6_1 1 Type1 6 A Type1_6_2 2 Type1 6 A Type1_6_3 3 Type1 6 A Type1_24_1 1 Type1 24 A Type1_24_2 2 Type1 24 A Type1_24_3 3 Type1 24 A Type1_48_1 1 Type1 48 A Type1_48_2 2 Type1 48 A Type1_48_3 3 Type1 48 A Type2_0_1 1 Type2 0 B Type2_0_2 2 Type2 0 C Type2_0_3 3 Type2 0 C Type2_1_1 1 Type2 1 B Type2_1_2 2 Type2 1 C Type2_1_3 3 Type2 1 C Type2_2_1 1 Type2 2 B Type2_2_2 2 Type2 2 C Type2_2_3 3 Type2 2 C Type2_6_1 1 Type2 6 B Type2_6_2 2 Type2 6 C Type2_6_3 3 Type2 6 C Type2_24_1 1 Type2 24 B Type2_24_2 2 Type2 24 C Type2_24_3 3 Type2 24 C Type2_48_1 1 Type2 48 B Type2_48_2 2 Type2 48 C Type2_48_3 3 Type2 48 C
I am interested in understanding what are the differences between any time point and time 0, in the 2 different sequencing types. This should be relatively straightforward (I only use Wald statistics for each time point independently): I take the counts and coldata for each sequencing type separately, and create a DESeq Dataset using ddsMat <- DESeqDataSetFromMatrix(countData = type1, colData = coldata_type1, design = ~ patient + time)
releveling for time = 0.
Nevertheless, as I have no sequencing batches in Type 1, I can do this without any problem; it becomes instead impossible to do for Type 2, as batch B and Patient 1 coincide and for a design = ~ patient + time + seqbatch
the matrix is no longer full rank since Patient 1 is contained within Batch B.
A first question would be: is the effect of Batch B anyway negligible once we take into account the Patient variable?
Going deeper, if I want to know which of these differences remain taking into account the differences in sequencing type, I would use the full count matrix (type 1 + type 2) and the coldata shown above; this time I use an interaction model:
ddsMat <- DESeqDataSetFromMatrix(countData = type1, colData = coldata_type1, design = ~ patient + time + seq_type + time:seq_type)
Once again, if I want to add the sequencing batch as a variable, since batch A coincides with type 1 and batch B coincides with patient 1 I will get another 'matrix not full rank' warning.
This is particularly problematic since, by doing a PCA on the rlog-normalized counts for type 1, I can see that PC1 follows the time, as expected; however, in type 2 the PC1 separates perfectly batches B and C, accounting for a whopping 79% of the variance.
It seems that I'm left with a huge batch effect that I don't know how to correct, especially when trying to use the interaction model. Moreover, in the future I won't be able to make many changes to the experimental protocol: even though we avoided confounding batches with patients, we would never be able to have all type 2 datasets within the same prep/run unless we reduced the number of timepoints.
My second question is: does anyone have any suggestions as to how to correct for these effects?
Here is the sessionInfo() for my current settings:
> sessionInfo()
R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin16.7.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] sva_3.24.4 BiocParallel_1.10.1 [3] genefilter_1.58.1 mgcv_1.8-22 [5] nlme_3.1-131 edgeR_3.18.1 [7] limma_3.32.10 ggplot2_2.2.1 [9] biomaRt_2.32.1 PoiClaClu_1.0.2 [11] pheatmap_1.0.8 RColorBrewer_1.1-2 [13] DESeq2_1.16.1 SummarizedExperiment_1.6.5 [15] DelayedArray_0.2.7 matrixStats_0.52.2 [17] Biobase_2.36.2 GenomicRanges_1.28.6 [19] GenomeInfoDb_1.12.3 IRanges_2.10.5 [21] S4Vectors_0.14.7 BiocGenerics_0.22.1 loaded via a namespace (and not attached): [1] Rcpp_0.12.13 locfit_1.5-9.1 lattice_0.20-35 [4] digest_0.6.12 plyr_1.8.4 backports_1.1.1 [7] acepack_1.4.1 RSQLite_2.0 zlibbioc_1.22.0 [10] rlang_0.1.4 lazyeval_0.2.1 data.table_1.10.4-3 [13] annotate_1.54.0 blob_1.1.0 rpart_4.1-11 [16] Matrix_1.2-11 checkmate_1.8.5 splines_3.4.1 [19] geneplotter_1.54.0 stringr_1.2.0 foreign_0.8-69 [22] htmlwidgets_0.9 RCurl_1.95-4.8 bit_1.1-12 [25] munsell_0.4.3 compiler_3.4.1 base64enc_0.1-3 [28] htmltools_0.3.6 nnet_7.3-12 tibble_1.3.4 [31] gridExtra_2.3 htmlTable_1.9 GenomeInfoDbData_0.99.0 [34] Hmisc_4.0-3 XML_3.98-1.9 bitops_1.0-6 [37] grid_3.4.1 xtable_1.8-2 gtable_0.2.0 [40] DBI_0.7 magrittr_1.5 scales_0.5.0 [43] stringi_1.1.5 XVector_0.16.0 latticeExtra_0.6-28 [46] Formula_1.2-2 tools_3.4.1 bit64_0.9-7 [49] survival_2.41-3 AnnotationDbi_1.38.2 colorspace_1.3-2 [52] cluster_2.0.6 memoise_1.1.0 knitr_1.17
Thanks Michael, sorry for the late reply. It works, although the variance is really hindered by the "rogue" batch. Nevertheless I can now find some very interesting differences. I'm impressed at the flexibility of the linear model implementation.
Thanks a lot!