From the DESeq2
vignette, the following snippet provides the example of how cqn
offsets can be used in the analysis:
cqnOffset <- cqnObject$glm.offset
cqnNormFactors <- exp(cqnOffset)
Meanwhile, the EDAseq normfactors should be used after flipping their sign (note the -1
):
EDASeqNormFactors <- exp(-1 * EDASeqOffset)
The same usage is echoed in the documentation of these tools.
However, from my understanding both cqn
and EDASeq
return offsets that are of the following form:
offset = log(normed_counts) - log(actual_counts)
See page 4 of cqn vignette (starting "the normalized expression values are", rearrange to move y
(equiv. to log2(actual_counts)) left of equation to get the expression above), and see section 7.3 of EDASeq
vignette which uses this expression as is.
I may be missing something very obvious here, but, given this, why multiply the EDASeq
offset by -1? Or, equivalently, why not do that for cqn
offsets?
But wouldn't the same be true for
cqn
offsets then? Why don't we flip those?From
cqn
vignette:Which is equivalent to (as
$y
should belog(y_raw)
andRPKM.cqn
is equivalent tolog(y_norm)
)giving
which gives back
I don't use
offset
in the DESeq2 vignette. I useglm.offset
.See the last section the cqn vignette.
Yes - this is the key I now realise.
The last section of the vignette does not really explain how exactly
glm.offset
is different, nor should be used instead ofoffset
, so I had a look at the code:Note that
sweep
function subtracts by default so the minus in the last term cancels out, givingSo there are three distinct steps happening in the
glm.offset
calculation:- offset
)+log2(size_factors /10^6)
) (this statement was edited, see comments below).log(2) *
)So this answers the question of why
EDASeq
andcqn
are treated differently inDESeq2
, and the answer for that is thatcqn
does some extra work in the package to make it compatible,EDASeq
doesn't.Re: (2), I read the code and docs as adding in library size scaling, not removing.
This is because, after (1) the offset is in the direction of the bias, e.g. higher raw count, higher offset. So by adding log2 of size factor, this is going in the direction of the sequencing depth bias.
Also, the man page implies this as well:
Yes - I think you are right once more. Thanks for taking so much time to look into this.
Size factors come in twice in the code of
cqn
: at the very beginning, and in the snippet I pasted before. At the beginning, they are subtracted away from the counts (the$y
variable is actually in counts per million). This makes the effective offset (from raw counts) to be equal tooffset - log2(size_factors/10^6)
. Taking the negative of this effective offset results into-offset + log2(size_factors/10^6)
which is what we see in the equation above.I'll edit my previous answer to have the correct information in case somebody from the future is also wondering about this.