Question

EDASeq normalization: genes with 0 counts

0

Entering edit mode

gprezza • 0

@611515a7

Last seen 7 months ago

Germany

I am analyzing an RNA-seq experiment in which I compare a control library (high coverage) with a treatment one (low coverage). There is some sample-specific length bias across samples and I am trying to correct this with the EDASeq package.

data <- newSeqExpressionSet(counts=as.matrix(reads[keep,]),
                            featureData=featureData,
                            phenoData=data.frame(
                                    conditions=conds,
                                    row.names=colnames(reads)))

dataOffset <- withinLaneNormalization(data,"length", which="full",offset=TRUE)
dataOffset <- betweenLaneNormalization(dataOffset, which="full",offset=TRUE)

After normalization, I do DGE analysis with edgeR:

a <- DGEList(counts=counts(dataOffset), group=conds, genes=as.character(rownames(reads[keep,])))
a$offset <- -offst(dataOffset)
a <- estimateDisp(a, design)
fit <- glmQLFit(a,design,robust=TRUE)

If we look at the volcano plot from the DGE analysis, there's a weird cluster of genes that are clearly separated from the "volcano":

These genes are the ones that have 0 (or very low) read counts in the treatment library (which has a much lower coverage than the control one). If we look at the raw and normalized counts of these genes, I see that often they have originally 0 reads, but after normalization they get assigned with a discrete number of reads. This number is usually relatively different between replicates, and I guess that's why edgeR (correctly I guess) assigns a high FDR to these genes.

I am however a bit confused by how can a gene with 0 reads be assigned any number of reads after normalization. Is this something that is expected? I don't recall ever seeing a volcano plot with a separated cluster of genes as I am seeing now.

edgeR EDASeq • 1.5k views

ADD COMMENT • link 3.8 years ago • updated 3.5 years ago gprezza • 0

score 2 · Accepted Answer · 2021-02-15

Hi,

you are using full-quantile normalization to normalize the data. This method matches the empirical distribution of all samples to that of a reference sample (usually taken as the median across all samples). If, as in your case, samples have very different coverages, it may happen that the zeroes are transformed in non-zero values.

As a (somewhat extreme) toy example, consider if you only had two samples and 6 genes that looked like this:

> x = matrix(c(0, 0, 1, 2, 4, 5, 10, 12, 12, 13, 14, 15), ncol=2)
> x
     [,1] [,2]
[1,]    0   10
[2,]    0   12
[3,]    1   12
[4,]    2   13
[5,]    4   14
[6,]    5   15

Because the median of 0 and 10 is 5, the first 0 will be transformed to 5 (and the second to 6).

> betweenLaneNormalization(x, which="full")
     [,1] [,2]
[1,]    5    5
[2,]    6    6
[3,]    6    6
[4,]    8    8
[5,]    9    9
[6,]   10   10

If you want to preserve the zeros in the normalized data, you have to use one of the many scaling normalizations available, for instance with upper-quartile scaling, the same toy example will look like this

> betweenLaneNormalization(x, which="upper")
     [,1] [,2]
[1,]    0    6
[2,]    0    8
[3,]    2    8
[4,]    5    8
[5,]   10    9
[6,]   12    9
attr(,"scaled:scale")
[1] 0.4057971 1.5942029

I hope this helps.