Question

DESeq2 batch effect with and without interaction model

1

Entering edit mode

gdagstn ▴ 10

@gdagstn-14406

Last seen 5.3 years ago

Hi all,

I'm trying to analyze RNA-seq data with DESeq2, and I have some batches in my data that I need to account for. Some of them have biological relevance, and some of them are purely technical.

I have 3 patients, 6 time points per patient, and each time point has been sequenced with 2 different techniques (Type 1 and Type 2). Because of technical limitations due to library prep time and index diversity, all Type 1 sequencing experiments were prepped and sequenced in one run, whereas Type 2 experiments were divided in 2 unequal batches. It is not possible to prepare and sequence type 1 and type 2 together.
Unfortunately, one of the batches (batch B) coincides with Patient 1, and it's not yet very clear to me whether this can be properly accounted for. There are, however, other problems.

To make this easier, here's the coldata:

           patient seq_type time seq_batch
Type1_0_1      1   Type1    0         A
Type1_0_2      2   Type1    0         A
Type1_0_3      3   Type1    0         A
Type1_1_1      1   Type1    1         A
Type1_1_2      2   Type1    1         A
Type1_1_3      3   Type1    1         A
Type1_2_1      1   Type1    2         A
Type1_2_2      2   Type1    2         A
Type1_2_3      3   Type1    2         A
Type1_6_1      1   Type1    6         A
Type1_6_2      2   Type1    6         A
Type1_6_3      3   Type1    6         A
Type1_24_1     1   Type1   24         A
Type1_24_2     2   Type1   24         A
Type1_24_3     3   Type1   24         A
Type1_48_1     1   Type1   48         A
Type1_48_2     2   Type1   48         A
Type1_48_3     3   Type1   48         A
Type2_0_1      1   Type2    0         B
Type2_0_2      2   Type2    0         C
Type2_0_3      3   Type2    0         C
Type2_1_1      1   Type2    1         B
Type2_1_2      2   Type2    1         C
Type2_1_3      3   Type2    1         C
Type2_2_1      1   Type2    2         B
Type2_2_2      2   Type2    2         C
Type2_2_3      3   Type2    2         C
Type2_6_1      1   Type2    6         B
Type2_6_2      2   Type2    6         C
Type2_6_3      3   Type2    6         C
Type2_24_1     1   Type2   24         B
Type2_24_2     2   Type2   24         C
Type2_24_3     3   Type2   24         C
Type2_48_1     1   Type2   48         B
Type2_48_2     2   Type2   48         C
Type2_48_3     3   Type2   48         C

I am interested in understanding what are the differences between any time point and time 0, in the 2 different sequencing types. This should be relatively straightforward (I only use Wald statistics for each time point independently): I take the counts and coldata for each sequencing type separately, and create a DESeq Dataset using ddsMat <- DESeqDataSetFromMatrix(countData = type1, colData = coldata_type1, design = ~ patient + time) releveling for time = 0.

Nevertheless, as I have no sequencing batches in Type 1, I can do this without any problem; it becomes instead impossible to do for Type 2, as batch B and Patient 1 coincide and for a design = ~ patient + time + seqbatch the matrix is no longer full rank since Patient 1 is contained within Batch B.

A first question would be: is the effect of Batch B anyway negligible once we take into account the Patient variable?

Going deeper, if I want to know which of these differences remain taking into account the differences in sequencing type, I would use the full count matrix (type 1 + type 2) and the coldata shown above; this time I use an interaction model:

ddsMat <- DESeqDataSetFromMatrix(countData = type1, colData = coldata_type1, design = ~ patient + time + seq_type + time:seq_type)

Once again, if I want to add the sequencing batch as a variable, since batch A coincides with type 1 and batch B coincides with patient 1 I will get another 'matrix not full rank' warning.

This is particularly problematic since, by doing a PCA on the rlog-normalized counts for type 1, I can see that PC1 follows the time, as expected; however, in type 2 the PC1 separates perfectly batches B and C, accounting for a whopping 79% of the variance.

It seems that I'm left with a huge batch effect that I don't know how to correct, especially when trying to use the interaction model. Moreover, in the future I won't be able to make many changes to the experimental protocol: even though we avoided confounding batches with patients, we would never be able to have all type 2 datasets within the same prep/run unless we reduced the number of timepoints.

My second question is: does anyone have any suggestions as to how to correct for these effects?

Here is the sessionInfo() for my current settings:

> sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin16.7.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] sva_3.24.4                 BiocParallel_1.10.1
 [3] genefilter_1.58.1          mgcv_1.8-22
 [5] nlme_3.1-131               edgeR_3.18.1
 [7] limma_3.32.10              ggplot2_2.2.1
 [9] biomaRt_2.32.1             PoiClaClu_1.0.2
[11] pheatmap_1.0.8             RColorBrewer_1.1-2
[13] DESeq2_1.16.1              SummarizedExperiment_1.6.5
[15] DelayedArray_0.2.7         matrixStats_0.52.2
[17] Biobase_2.36.2             GenomicRanges_1.28.6
[19] GenomeInfoDb_1.12.3        IRanges_2.10.5
[21] S4Vectors_0.14.7           BiocGenerics_0.22.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13            locfit_1.5-9.1          lattice_0.20-35
 [4] digest_0.6.12           plyr_1.8.4              backports_1.1.1
 [7] acepack_1.4.1           RSQLite_2.0             zlibbioc_1.22.0
[10] rlang_0.1.4             lazyeval_0.2.1          data.table_1.10.4-3
[13] annotate_1.54.0         blob_1.1.0              rpart_4.1-11
[16] Matrix_1.2-11           checkmate_1.8.5         splines_3.4.1
[19] geneplotter_1.54.0      stringr_1.2.0           foreign_0.8-69
[22] htmlwidgets_0.9         RCurl_1.95-4.8          bit_1.1-12
[25] munsell_0.4.3           compiler_3.4.1          base64enc_0.1-3
[28] htmltools_0.3.6         nnet_7.3-12             tibble_1.3.4
[31] gridExtra_2.3           htmlTable_1.9           GenomeInfoDbData_0.99.0
[34] Hmisc_4.0-3             XML_3.98-1.9            bitops_1.0-6
[37] grid_3.4.1              xtable_1.8-2            gtable_0.2.0
[40] DBI_0.7                 magrittr_1.5            scales_0.5.0
[43] stringi_1.1.5           XVector_0.16.0          latticeExtra_0.6-28
[46] Formula_1.2-2           tools_3.4.1             bit64_0.9-7
[49] survival_2.41-3         AnnotationDbi_1.38.2    colorspace_1.3-2
[52] cluster_2.0.6           memoise_1.1.0           knitr_1.17

deseq2 rna-seq R linear model design matrix • 2.4k views

ADD COMMENT • link updated 7.5 years ago by Michael Love 43k • written 7.5 years ago by gdagstn ▴ 10

score 4 · Accepted Answer · 2017-11-17

As you say, it's unfortunate that batch is confounded with patient in seq type 2. Because you already want to control for patient overall, I'd just use a model where you control for patient effects for each sequencing type separately (this choice wouldn't be necessary but due to the confounding problem). You can use a design of

~ seq_type + time + seq_type:patient + seq_type:time

which will give you interaction terms for each time point for which you can extract Wald tests using results() and the 'name' argument. These interaction terms will be the differences between type 2 and type 1 for that time point relative to time 0. And controlling for patient within each type includes controlling for batch (again, only necessary to control for patient within type because of the confounding problem).