Dear edgeR-Team,
I have around 200 samples for which I have
1) CNV data, ie. a matrix X (regions x samples) in which a cell is either 2 (normal diploid), 0/1 (loss), or 3/4/5/... (gain),
2) RNA-seq data, ie. a matrix Y (genes x samples) storing the raw RNA-seq read counts.
Now I want to find for each CNV region (rows of X) DE genes between sample groups formed by CN state,.
For each CNV region, the control group is formed by samples that are diploid (2) in this region, against we are comparing one or more sample groups, each representing a deviating state (eg. 3; corresponding to a one copy gain in that region).
That essentally means that the rows of X are different versions of my group vector by which I am testing for differential expression.
Accordingly I could go ahead (reasonable or not), applying edgeR's standard workflow for each row of X
- corresponding to testing all CNVs against all genes:
# already filtered for genes that are not sufficiently expressed
y <- edgeR::DGEList(counts=Y)
y <- edgeR::calcNormFactors(y)
edgeIt <- function(group)
{
nr.states <- length(unique(group))
group <- as.factor(group)
y$samples$group <- group
f <- stats::formula("~group")
design <- stats::model.matrix(f)
y <- edgeR::estimateDisp(y, design, robust=TRUE)
fit <- edgeR::glmQLFit(y, design, robust=TRUE)
qlf <- edgeR::glmQLFTest(fit, coef=2:nr.states)
return(qlf)
}
res <- apply(X, 1, edgeIt)
Now, due to runtime and power considerations, that's not really the optimal approach that you can think off.
First question: Testing various rows of X (many different group vectors), it figures that results of
y <- edgeR::estimateDisp(y, design, robust=TRUE)
hardly change, and it seems to be ok to do this just once after
y <- edgeR::calcNormFactors(y)
i.e. moving the dispersion estimate out of the function? (see code below)
Second question:
As for eQTL studies, you rather want to restrict testing to the genes close-by to a CNV, let's say in a 1 Mbp window.
Let's say I have a pre-computed list cnv2genes
giving me for each CNV the surrounding genes. Would it then be appropriate to just test these genes for each CNV for DE? I.e.
# already filtered for genes that are not sufficiently expressed
y <- edgeR::DGEList(counts=Y)
y <- edgeR::calcNormFactors(y)
## computing dispersion only once, independent of design
y <- edgeR::estimateDisp(y, robust=TRUE)
edgeIt2 <- function(group, genes)
{
nr.states <- length(unique(group))
group <- as.factor(group)
y$samples$group <- group
f <- stats::formula("~group")
design <- stats::model.matrix(f)
# testing only genes in the neighborhood of a CNV
fit <- edgeR::glmQLFit(y[genes,], design, robust=TRUE)
qlf <- edgeR::glmQLFTest(fit, coef=2:nr.states)
return(qlf)
}
res <- lapply(rownames(X), function(cnv) edgeIt2(X[cnv,], cnv2genes[[cnv]]))
However, if I'm doing the second version, it runs through for some CNVs, but for others I encounter errors like:
Error in makeCompressedMatrix(glmfit$var.post, dim(glmfit$counts), byrow = FALSE) :
'dims[1]' should equal length of 'x'
or
Error in squeezeVar(s2, df = df.residual, covariate = AveLogCPM, robust = robust, :
Could not estimate prior df
Many thanks!
Reply to answers using the "Add comment" button, not the "add answer".
Your proposed approach works, provided the CNV-specific matrix used in
estimateDisp
also includes the effects of other CNV regions as I described above. (That is, do a PCA without the CNV region of interest, and use the top X PCs and the CNV profile for the region of interest in the design matrix.) Your design can't just contain the values for a single CNV region, because you're fitting the design matrix to all genes; so it needs to model any variation associated with any of the CNV regions. Otherwise you end up with problems related to inflation of the dispersions as previously described.A faster approach is to assume that the CNV profile for your region of interest can be approximated by a linear combination of the PCs when the PCA is done on the entire CNV matrix. This means that you can run
estimateDisp
once with the PC-derived design matrix, and then set up a contrast specific to each CNV region based on the aforementioned linear combination. (Here, the null hypothesis is whether the linear combination is equal to zero, i.e., no CNV-associated effect.)