Hello,
Based on a recent publication (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02104-1) I have been trying to apply TMM normalization (using edger::calcNormFactors) to a number of different microbiome datasets that I have been working with. While working with them I noticed that the normalization factors for some datasets were not reproducible when I shuffled the order of the samples (columns) in the raw read count matrix. I tested this behavior with nine different datasets and found that this usually occurs when the sparsity of the matrix > 0.79. I'm trying to figure out why this is the case and why this impacts the reproducible of the normalization factors. I have included a link to a dropbox containing the datasets of interest as well as an rmarkdown file with all of the code. Any guidance would be highly appreciated.
https://www.dropbox.com/sh/6fo3fipyc0p36f3/AACnNhDmOk5U285g_u9g0fTba?dl=0
Here are a few code lines to reproduce the issue.
library(edgeR)
#download data matrix TSV file
download.file("https://www.dropbox.com/s/w6l11rfyh8z19wl/BISCUIT_ASVs_table.tsv?dl=1", "count_matrix.tsv")
BISCUIT_count <- as.matrix(read.table("count_matrix.tsv",
sep="\t", row.names = 1, comment.char = "", skip=1, header=T, check.names = F, quote=""))
BISCUIT_count_norm1 <- calcNormFactors(BISCUIT_count, method="TMM")
#shuffle column order
shuffled_df <- BISCUIT_count[,sample(c(1:38), 38, replace=F)]
shuffled_norm <- calcNormFactors(shuffled_df, method="TMM")
sorted_norm1 <- BISCUIT_count_norm1[sort(names(BISCUIT_count_norm1))]
sorted_shuffle_norm <- shuffled_norm[sort(names(shuffled_norm))]
identical(sorted_norm1, sorted_shuffle_norm)
cor.test(sorted_norm1, sorted_shuffle_norm)
Thanks, Jacob Nearing
This explanation clears up the behavior perfectly. Thanks for taking the time to resolve the issue.