Question

TMM normalization on sparse datasets

1

Entering edit mode

jc235601 ▴ 20

@jc235601-21161

Last seen 4.2 years ago

Dalhousie University

Hello,

Based on a recent publication (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02104-1) I have been trying to apply TMM normalization (using edger::calcNormFactors) to a number of different microbiome datasets that I have been working with. While working with them I noticed that the normalization factors for some datasets were not reproducible when I shuffled the order of the samples (columns) in the raw read count matrix. I tested this behavior with nine different datasets and found that this usually occurs when the sparsity of the matrix > 0.79. I'm trying to figure out why this is the case and why this impacts the reproducible of the normalization factors. I have included a link to a dropbox containing the datasets of interest as well as an rmarkdown file with all of the code. Any guidance would be highly appreciated.

https://www.dropbox.com/sh/6fo3fipyc0p36f3/AACnNhDmOk5U285g_u9g0fTba?dl=0

Here are a few code lines to reproduce the issue.

library(edgeR)

#download data matrix TSV file
download.file("https://www.dropbox.com/s/w6l11rfyh8z19wl/BISCUIT_ASVs_table.tsv?dl=1", "count_matrix.tsv")

BISCUIT_count <- as.matrix(read.table("count_matrix.tsv",
                                      sep="\t", row.names = 1, comment.char = "", skip=1, header=T, check.names = F, quote=""))


BISCUIT_count_norm1 <- calcNormFactors(BISCUIT_count, method="TMM")

#shuffle column order 
shuffled_df <- BISCUIT_count[,sample(c(1:38), 38, replace=F)]

shuffled_norm <- calcNormFactors(shuffled_df, method="TMM")

sorted_norm1 <- BISCUIT_count_norm1[sort(names(BISCUIT_count_norm1))]
sorted_shuffle_norm <- shuffled_norm[sort(names(shuffled_norm))]

identical(sorted_norm1, sorted_shuffle_norm)

cor.test(sorted_norm1, sorted_shuffle_norm)

Thanks, Jacob Nearing

edger limma • 1.4k views

ADD COMMENT • link 4.2 years ago jc235601 ▴ 20

score 7 · Accepted Answer · 2020-09-11

calcNormFactors should return the same normalization factors regardless of the order of the samples.

I appreciate that you have provided code and data, which is good, but I don't have time to sift through the complex analysis session you have provided. You should be able to demonstrate the problem with a single count matrix and a couple of lines of code. Can you do that please? The code lines can be copied into your question -- we don't need html or markdown.

20 minutes later

OK, I see the problem:

> x <- read.delim("ArcticFireSoils_ASVs_table.tsv",skip=2,header=FALSE,row.names=1)
> dim(x)
[1] 37244   148
> set.seed(20200912)
> j <- sample(148)
> f1 <- calcNormFactors(x)
> f2 <- calcNormFactors(x[,j])
> summary(f1[j] - f2)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.55507 -0.13682 -0.01441  0.01381  0.11001  1.83245

I'll have a look at it.

Another 20 minutes later

The problem is caused by the upper-quartile normalization method. The TMM method requires one of the samples to be chosen as the reference sample to which all the others are compared. By default, the upper-quartile method is used to choose the reference sample. The calcNormFactors help page says:

If refColumn is unspecified, then the library whose upper quartile is closest to the mean upper quartile is used.

However for your data, more than 75% of the data values are zero, so the upper-quartile is zero, so the upper-quartile normalization factors are undefined:

> summary(calcNormFactors(x,method="upperquartile"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     NA      NA      NA     NaN      NA      NA     148

As a result, TMM falls back on choosing the first column as the reference sample. If you shuffle the columns, then the first column will be different, leading to a different reference sample and to different normalization factors.

You can fix the problem by choosing the reference sample in a consistent and robust fashion. I also suggest switching to TMMwsp instead of ordinary TMM because it is somewhat more robust to sparse data:

> Ref1 <- which.max(colSums(sqrt(x)))
> f1 <- calcNormFactors(x, method="TMMwsp", refCol=Ref1)
> Ref2 <- which.max(colSums(sqrt(x[,j])))
> f2 <- calcNormFactors(x[,j], method="TMMwsp", refCol=Ref2)
> summary(f1[j] - f2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0       0       0       0