Hi,
I am using tximport to combine the gene-level quantification and further normalize it using edgeR's TMM normalization. The code I used is from https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html#edgeR
cts <- txi$counts
normMat <- txi$length
# Obtaining per-observation scaling factors for length, adjusted to avoid
# changing the magnitude of the counts.
normMat <- normMat/exp(rowMeans(log(normMat)))
normCts <- cts/normMat
# RSEM produce 0 effective length
# Computing effective library sizes from scaled counts, to account for
# composition biases between samples.
eff.lib <- calcNormFactors(normCts) * colSums(normCts)
# Combining effective library sizes with the length factors, and calculating
# offsets for a log-link GLM.
normMat <- sweep(normMat, 2, eff.lib, "*")
normMat <- log(normMat)
# Creating a DGEList object for use in edgeR.
y <- DGEList(cts)
y <- scaleOffset(y, normMat)
y <- y[keep_rows, ]
cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE)
However, as RSEM output zero effective gene length if it < 1 and result in error of calcNormFactors, shall I re-set those values to 1 in the gene length matrix before calcNormFactors
?
cts <- txi$counts
normMat <- txi$length
normMat[normMat==0] <- 1
Thanks a lot for your reply. Could you explain more for removing those genes at the beginning? My understanding is that those genes are shorter than fragments and thereby only 1 possible sequencing starting point. So, I reset the effective length to 1 to estimate library size. I will then remove those genes in downstream analysis.
Are you sure that those genes with size < 1 have non-zero counts?