Hi
I'm comparing the outputs of the betweenLaneNormalization(which="upper") function and the calcNormFactors(method="upperquartile") function. Technically both functions are based on the same method from Bullard. But I get completely opposite library sizes after I perform the normalization. Which one is right? Am I doing something wrong?
library(RUVSeq)
library(edgeR)
library(zebrafishRNASeq)
data(zfGenes)
filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered <- zfGenes[filter,]
genes <- rownames(filtered)[grep("^ENS", rownames(filtered))]
spikes <- rownames(filtered)[grep("^ERCC", rownames(filtered))]
x <- as.factor(rep(c("Ctl", "Trt"), each=3))
set <- newSeqExpressionSet(as.matrix(filtered),
phenoData = data.frame(x, row.names=colnames(filtered)),)
set <- betweenLaneNormalization(set, which="upper",offset = T)
![enter image description here][1]
libsizes1 <- colSums(set@assayData$normalizedCounts)
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
libsizes2 <- y$samples$lib.size*y$samples$norm.factors
plot(libsizes1,libsizes2)
sessionInfo( )
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8
[2] LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] zebrafishRNASeq_1.18.0 RUVSeq_1.32.0
[3] EDASeq_2.32.0 ShortRead_1.56.1
[5] GenomicAlignments_1.34.1 SummarizedExperiment_1.28.0
[7] MatrixGenerics_1.10.0 matrixStats_1.2.0
[9] Rsamtools_2.14.0 GenomicRanges_1.50.2
[11] Biostrings_2.66.0 GenomeInfoDb_1.34.9
[13] XVector_0.38.0 IRanges_2.32.0
[15] S4Vectors_0.36.2 BiocParallel_1.32.6
[17] Biobase_2.58.0 BiocGenerics_0.44.0
[19] edgeR_3.40.2 limma_3.54.2
loaded via a namespace (and not attached):
[1] bitops_1.0-7 bit64_4.0.5
[3] filelock_1.0.3 RColorBrewer_1.1-3
[5] progress_1.2.3 httr_1.4.7
[7] tools_4.2.0 utf8_1.2.4
[9] R6_2.5.1 DBI_1.2.2
[11] tidyselect_1.2.0 prettyunits_1.2.0
[13] bit_4.0.5 curl_5.2.1
[15] compiler_4.2.0 cli_3.6.2
[17] xml2_1.3.6 DelayedArray_0.24.0
[19] rtracklayer_1.58.0 rappdirs_0.3.3
[21] stringr_1.5.1 digest_0.6.34
[23] rmarkdown_2.25 R.utils_2.12.3
[25] jpeg_0.1-10 pkgconfig_2.0.3
[27] htmltools_0.5.7 dbplyr_2.4.0
[29] fastmap_1.1.1 rlang_1.1.3
[31] rstudioapi_0.15.0 RSQLite_2.3.5
[33] BiocIO_1.8.0 generics_0.1.3
[35] hwriter_1.3.2.1 dplyr_1.1.4
[37] R.oo_1.26.0 RCurl_1.98-1.14
[39] magrittr_2.0.3 GenomeInfoDbData_1.2.9
[41] interp_1.1-6 Matrix_1.6-5
[43] Rcpp_1.0.12 fansi_1.0.6
[45] lifecycle_1.0.4 R.methodsS3_1.8.2
[47] stringi_1.8.3 yaml_2.3.8
[49] MASS_7.3-56 zlibbioc_1.44.0
[51] BiocFileCache_2.6.1 grid_4.2.0
[53] blob_1.2.4 parallel_4.2.0
[55] crayon_1.5.2 deldir_2.0-4
[57] lattice_0.20-45 GenomicFeatures_1.50.4
[59] hms_1.1.3 KEGGREST_1.38.0
[61] locfit_1.5-9.9 knitr_1.45
[63] pillar_1.9.0 rjson_0.2.21
[65] codetools_0.2-18 biomaRt_2.54.1
[67] XML_3.99-0.16.1 glue_1.7.0
[69] evaluate_0.23 latticeExtra_0.6-30
[71] png_0.1-8 vctrs_0.6.5
[73] cachem_1.0.8 xfun_0.42
[75] aroma.light_3.28.0 restfulr_0.0.15
[77] tibble_3.2.1 AnnotationDbi_1.60.2
[79] memoise_2.0.1
Gordon is correct. The column sums of the
normalizedCounts
matrix are meaningless and not intended to be used as size factors.You can use
offst(set)
to extract a matrix of offsets to be added to downstream negative binomial models (e.g., inedgeR
'sglmFit
) as illustrated in the EDASeq vignette. Note that the offsets extracted from EDASeq need to be multiplied by -1 before using them in edgeR (see section 7.3 of the EDASeq vignette for details).