Question

RNAseq : identical sample-sample distances after using limma::removeBatchEffect

0

Entering edit mode

athiebaut • 0

@112b3e6e

Last seen 3.9 years ago

Switzerland

Hello, I've been analysing RNAseq samples to identify differentially expressed genes (DEG) using DESeq2. I have 6 samples, 3 treated/3 control and a batch effect caused by the day the experiments were performed. Briefly, this is the design of the experiment :

Sample   Treatment    BatchEffect
  S1      Control          A
  S2      Control          B
  S3      Control          C
  S4      Treated          A
  S5      Treated          B
  S6      Treated          C

I've accounted for the batch effect in DESeq2 using design = ~ BatchEffect + Treatment and I found several DEG, so no problem here. On top of that, I also wanted to assess the quality of the data and the impact of the batch effect removal using sample to sample distances, amongst other things (as adviced in DESeq2 vignette). The batch effect removal (I used limma::removeBatchEffect) is pretty efficient because it clusters the samples by condition and reduce the distances between them. However, after removing the batch effect, I noticed that the sample to sample distances are the same between conditions. Here's an example of distance heatmaps before and after batch correction :

Distance

For example, the distance between Treated:A and Treated:B is the same than between Control:A and Control:B. This experiment was performed in 10 species and the "identical distances" happened everytime. Is it an intended feature of batch effect removal ? If not, what could explain this behaviour ?

Here's the code I used (after running DESeqDataSetFromMatrix() and DESeq() commands) :

# Applies variance stabilization
blind_vst <- varianceStabilizingTransformation(dds, blind = T, fitType = "local")
non_blind_vst <- varianceStabilizingTransformation(dds, blind = F, fitType = "local")

# Removes batch effect
mat <- assay(non_blind_vst)
model_mat <- model.matrix(formula(paste("~ ",Treatment)), colData(non_blind_vst))
mat <- limma::removeBatchEffect(x = mat, batch = non_blind_vst[[Treatment]], design = model_mat)
assay(non_blind_vst) <- mat

# Sample to sample distance heatmap without batch effect correction
## Extracts distance data for the plot
sample_dist <- dist(t(assay(blind_vst)))
sample_dist_mat <- as.matrix(sample_dist)
## Makes the plot
rownames(sample_dist_mat) <- paste(blind_vst[[condition]],blind_vst[[batch]], sep = ':')
colnames(sample_dist_mat) <- paste(blind_vst[[condition]],blind_vst[[batch]], sep = ':')
pheatmap(mat = sample_dist_mat, color = colorRampPalette(rev(brewer.pal(9,"Blues")))(255),
          clustering_distance_rows = sample_dist, clustering_distance_cols = sample_dist,
          border_color = "grey50", treeheight_row = 75, treeheight_col = 75,
          legend = T, show_rownames = T, show_colnames = T,
          main = bquote(atop("Distance between samples in"~italic(.(assembly)),
                             "Before batch correction")),
          fontsize = 13, fontsize_row = 10, fontsize_col = 10, angle_col = 315,
          display_numbers = T, number_format = "%.1f", number_color = "red", fontsize_number = 10,
          na_col = "grey", silent = F)

# Sample to sample distance heatmap with batch effect correction
## Extracts distance data for the plot
sample_dist_cor <- dist(t(assay(non_blind_vst)))
sample_dist_mat_cor <- as.matrix(sample_dist_cor)
## Makes the plot
rownames(sample_dist_mat_cor) <- paste(non_blind_vst[[condition]],non_blind_vst[[batch]], sep = ':')
colnames(sample_dist_mat_cor) <- paste(non_blind_vst[[condition]],non_blind_vst[[batch]], sep = ':')
pheatmap(mat = sample_dist_mat_cor, color = colorRampPalette(rev(brewer.pal(9,"Blues")))(255),
         clustering_distance_rows = sample_dist_cor, clustering_distance_cols = sample_dist_cor,
         border_color = "grey50", treeheight_row = 75, treeheight_col = 75,
         legend = T, show_rownames = T, show_colnames = T,
         main = bquote(atop("Distance between samples in"~italic(.(assembly)),
                            "After batch correction")),
         fontsize = 13, fontsize_row = 10, fontsize_col = 10, angle_col = 315,
         display_numbers = T, number_format = "%.1f", number_color = "red", fontsize_number = 10,
         na_col = "grey", silent = F)

And here's the result of sessionInfo() :

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=fr_FR.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.1.1                RColorBrewer_1.1-2          pheatmap_1.0.12             limma_3.44.3               
 [5] ggrepel_0.9.1               ggplot2_3.3.3               IHW_1.16.0                  DESeq2_1.28.1              
 [9] SummarizedExperiment_1.18.2 DelayedArray_0.14.1         matrixStats_0.59.0          Biobase_2.48.0             
[13] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2         IRanges_2.22.2              S4Vectors_0.26.1           
[17] BiocGenerics_0.34.0         data.table_1.14.0           apeglm_1.10.0              

loaded via a namespace (and not attached):
 [1] bit64_4.0.5            splines_4.0.3          blob_1.2.1             GenomeInfoDbData_1.2.3 slam_0.1-48           
 [6] numDeriv_2016.8-1.1    pillar_1.6.1           RSQLite_2.2.7          lattice_0.20-44        glue_1.4.2            
[11] bbmle_1.0.23.1         XVector_0.28.0         colorspace_2.0-1       Matrix_1.3-4           plyr_1.8.6            
[16] XML_3.99-0.6           pkgconfig_2.0.3        emdbook_1.3.12         genefilter_1.70.0      zlibbioc_1.34.0       
[21] purrr_0.3.4            xtable_1.8-4           mvtnorm_1.1-1          fdrtool_1.2.16         BiocParallel_1.22.0   
[26] tibble_3.1.2           annotate_1.66.0        generics_0.1.0         ellipsis_0.3.2         cachem_1.0.5          
[31] withr_2.4.2            survival_3.2-11        magrittr_2.0.1         crayon_1.4.1           memoise_2.0.0         
[36] fansi_0.5.0            MASS_7.3-54            tools_4.0.3            lifecycle_1.0.0        munsell_0.5.0         
[41] locfit_1.5-9.4         AnnotationDbi_1.50.3   compiler_4.0.3         rlang_0.4.11           grid_4.0.3            
[46] RCurl_1.98-1.3         bitops_1.0-7           gtable_0.3.0           DBI_1.1.1              R6_2.5.0              
[51] dplyr_1.0.6            bdsmatrix_1.3-4        fastmap_1.1.0          bit_4.0.4              utf8_1.2.1            
[56] lpsymphony_1.16.0      Rcpp_1.0.6             vctrs_0.3.8            geneplotter_1.66.0     tidyselect_1.1.1      
[61] coda_0.19-4

Thanks in advance for your help !

DESeq2 RNASeq BatchEffect limma • 1.6k views

ADD COMMENT • link updated 20 months ago by Michael Love 43k • written 3.9 years ago by athiebaut • 0

score 2 · Accepted Answer · 2021-06-09

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 days ago

United States

This is a consequence of the method of batch removal using the design.

ADD COMMENT • link 3.9 years ago Michael Love 43k

0

Entering edit mode

Alright, so nothing to worry about, then, I guess ?

Thank you very much for your answer !

ADD REPLY • link 3.9 years ago athiebaut • 0

0

Entering edit mode

Hi Mike, I have a related issue where I'm seeing (post-batch correction) identical gene counts across different donors (that pertain to a single condition), despite having specified to adjust for the donor variable as a batch covariate using limma::removeBatchEffect.

Given my Donor / Group split below (I have multiple samples belonging to different treatments etc. for Group B, but I'm only interested in correcting counts across donors for visualization) - I'm basically seeing all the Group A Donor's getting identical counts after using the function (for a given gene), and the rest adjusted

  Donor    Group A Group B  
   D1        0       18
   D2        1        0
   D3        0       20
   D4        1        0
   D5        0       19
   D6        1        0
   D7        1        0
   D8        0       16