TMM normalization on sparse datasets
1
1
Entering edit mode
jc235601 ▴ 20
@jc235601-21161
Last seen 4.2 years ago
Dalhousie University

Hello,

Based on a recent publication (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02104-1) I have been trying to apply TMM normalization (using edger::calcNormFactors) to a number of different microbiome datasets that I have been working with. While working with them I noticed that the normalization factors for some datasets were not reproducible when I shuffled the order of the samples (columns) in the raw read count matrix. I tested this behavior with nine different datasets and found that this usually occurs when the sparsity of the matrix > 0.79. I'm trying to figure out why this is the case and why this impacts the reproducible of the normalization factors. I have included a link to a dropbox containing the datasets of interest as well as an rmarkdown file with all of the code. Any guidance would be highly appreciated.

https://www.dropbox.com/sh/6fo3fipyc0p36f3/AACnNhDmOk5U285g_u9g0fTba?dl=0

Here are a few code lines to reproduce the issue.

library(edgeR)

#download data matrix TSV file
download.file("https://www.dropbox.com/s/w6l11rfyh8z19wl/BISCUIT_ASVs_table.tsv?dl=1", "count_matrix.tsv")

BISCUIT_count <- as.matrix(read.table("count_matrix.tsv",
                                      sep="\t", row.names = 1, comment.char = "", skip=1, header=T, check.names = F, quote=""))


BISCUIT_count_norm1 <- calcNormFactors(BISCUIT_count, method="TMM")

#shuffle column order 
shuffled_df <- BISCUIT_count[,sample(c(1:38), 38, replace=F)]

shuffled_norm <- calcNormFactors(shuffled_df, method="TMM")

sorted_norm1 <- BISCUIT_count_norm1[sort(names(BISCUIT_count_norm1))]
sorted_shuffle_norm <- shuffled_norm[sort(names(shuffled_norm))]

identical(sorted_norm1, sorted_shuffle_norm)

cor.test(sorted_norm1, sorted_shuffle_norm)

Thanks, Jacob Nearing

edger limma • 1.4k views
ADD COMMENT
7
Entering edit mode
@gordon-smyth
Last seen just now
WEHI, Melbourne, Australia

calcNormFactors should return the same normalization factors regardless of the order of the samples.

I appreciate that you have provided code and data, which is good, but I don't have time to sift through the complex analysis session you have provided. You should be able to demonstrate the problem with a single count matrix and a couple of lines of code. Can you do that please? The code lines can be copied into your question -- we don't need html or markdown.

20 minutes later

OK, I see the problem:

> x <- read.delim("ArcticFireSoils_ASVs_table.tsv",skip=2,header=FALSE,row.names=1)
> dim(x)
[1] 37244   148
> set.seed(20200912)
> j <- sample(148)
> f1 <- calcNormFactors(x)
> f2 <- calcNormFactors(x[,j])
> summary(f1[j] - f2)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.55507 -0.13682 -0.01441  0.01381  0.11001  1.83245

I'll have a look at it.

Another 20 minutes later

The problem is caused by the upper-quartile normalization method. The TMM method requires one of the samples to be chosen as the reference sample to which all the others are compared. By default, the upper-quartile method is used to choose the reference sample. The calcNormFactors help page says:

If refColumn is unspecified, then the library whose upper quartile is closest to the mean upper quartile is used.

However for your data, more than 75% of the data values are zero, so the upper-quartile is zero, so the upper-quartile normalization factors are undefined:

> summary(calcNormFactors(x,method="upperquartile"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     NA      NA      NA     NaN      NA      NA     148 

As a result, TMM falls back on choosing the first column as the reference sample. If you shuffle the columns, then the first column will be different, leading to a different reference sample and to different normalization factors.

You can fix the problem by choosing the reference sample in a consistent and robust fashion. I also suggest switching to TMMwsp instead of ordinary TMM because it is somewhat more robust to sparse data:

> Ref1 <- which.max(colSums(sqrt(x)))
> f1 <- calcNormFactors(x, method="TMMwsp", refCol=Ref1)
> Ref2 <- which.max(colSums(sqrt(x[,j])))
> f2 <- calcNormFactors(x[,j], method="TMMwsp", refCol=Ref2)
> summary(f1[j] - f2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0       0       0       0 
ADD COMMENT
1
Entering edit mode

This explanation clears up the behavior perfectly. Thanks for taking the time to resolve the issue.

ADD REPLY

Login before adding your answer.

Traffic: 537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6