Hello,
I have successfully run edgeR on an ATAC-seq dataset to identify differential open chromatin peaks across two cell types. I wanted to normalise for peaks read count, GC content and peak width, so my method was to create a consensus peak set in Diffbind (2 cell types, 3 replicates per cell type), then to use the CQN package for normalisation steps. This creates a GLM offset to add to the DGEList object in edgeR that accounts for the normalisation.
The issue I have is that when I try to filter my DGEList object in edgeR to remove 'peaks' that have very low read counts using the command:
keep <- rowSums(cpm(peak.list) > 2) >= 3
I get the following error:
Error in cpm.default(x$counts, lib.size = lib.size, log = log, prior.count = prior.count) : matrix dimensions are not consistent for non-repeating number of columns
I'm not sure what's going on here. I suspect it may have something to do with the additional parameters I used whilst setting up the DGEList object to account for the CQN normalisation, my command was this:
peak.list <- DGEList(counts = count.matrix, lib.size = depth.matrix, group = rep(c("grp1", "grp2"), each = 3), genes = GC.count.matrix)
The offset I mentioned is added later: peak.list$offset <- cqn.ch1$glm.offset
and creates the following object:
An object of class "DGEList" $counts ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2 chr1_10002_10467 58 54 126 153 143 65 chr1_15409_15764 4 4 1 2 6 6 chr1_17404_17589 28 27 34 58 64 38 chr1_21206_21446 1 1 1 1 1 1 chr1_26045_26177 1 1 1 1 1 1 22010 more rows ... $samples group lib.size.ATAC16_1 lib.size.ATAC17_1 lib.size.ATAC18_2 lib.size.ATAC20_4 lib.size.ATAC22_1 lib.size.ATAC22_2 norm.factors ATAC16_1 grp1 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC17_1 grp1 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC18_2 grp1 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC20_4 grp2 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC22_1 grp2 5620945 2511455 5162441 9085694 9852238 6172175 1 ATAC22_2 grp2 5620945 2511455 5162441 9085694 9852238 6172175 1 $genes length gccontent chr1_10002_10467 465 0.520430 chr1_15409_15764 355 0.602817 chr1_17404_17589 185 0.691892 chr1_21206_21446 240 0.566667 chr1_26045_26177 132 0.545455 22010 more rows ... $offset ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2 chr1_10002_10467 1.2106529 0.40383051 1.1225094 1.44570668 1.6393788 1.0711832 chr1_15409_15764 0.1482142 0.05310531 0.6880168 0.63767323 0.9767822 0.4124740 chr1_17404_17589 0.6204587 0.02583219 0.7587918 0.61683366 0.8382115 0.2851622 chr1_21206_21446 -0.3672278 -0.67452604 0.2488282 0.22727645 0.5559019 -0.1707992 chr1_26045_26177 -0.7680765 -0.76807653 -0.0846480 0.07907179 0.3615096 -0.3383576 22010 more rows ...
I could alter the count matrix earlier at the Diffbind stage but as I'd like to try different values for the read count cut off, and run several other optimisation steps using different cutoff values over a number of different datasets, I'd rather not have to run the entire script each time I want to change this. It would be much easier if I could sort this error and alter this in edgeR.
Any suggestions would be greatly appreciated.
Hi Aaron,
Thanks for the response.
Yes, the lib-size was the issue. Thanks for your help.
I tried updating my version of edgeR (3.18.1) but it doesn't go beyond this version. My R version is not the latest, so I'm sure it's down to that.
Is there any chance you could elaborate a little on your last point?
Until now, I have been using
peak.list$offset <- cqn.ch1$glm.offset
. This was recommended in the CQN vignette and it adds the GLM.offset matrix directly to the DGEList:Using the scaleOffset() function as you suggest:
peak.list <- scaleOffset(peak.list, offset = cqn.ch1$glm.offset)
... continued below ...
... the values are different for the offset in both cases due, no doubt, to the log-library size scaling you mentioned. The part I don't understand fully is your last point: 'adding an appropriate prior count to shrink expression values to a constant value (and thus, the log-fold changes to zero)', and by extension, how this would effect the
cpm
call when removing peaks with low read counts. As it stands I'm currently using:... and will replug this into the DGEList using (from page 49 in the edgeR guide):
However, I notice when checking
?cpm
that there arelog
,prior.count
andgene.length
parameters and I'm unsure whether or not to use these in my cpm call. For example, are these new values in the offset matrix log values, meaning I should set log = TRUE, and if so what should I set the prior.count to?Use of
scaleOffset()
doesn't affect thecpm()
call, as the latter does not use the offsets at all. The purpose ofscaleOffset()
is to ensure that the shrinkage of the log-fold changes is done correctly (see?predFC
). We added this function a few releases ago, so it is perhaps not surprising that other packages have yet to make full use of it.The documentation is pretty clear about the purpose of the other arguments, so just read it.
P.S. I would update R and then all Bioconductor packages. Don't expose yourself to known bugs if you can avoid it.
P.P.S. It would be remiss of me to not point out the problems with peak-calling in conjunction with using edgeR. In particular, peak calling and filtering does not constitute an independent filtering scheme, biasing the results of differential analyses with edgeR. See https://dx.doi.org/10.1093/nar/gku351 for more details. This is especially true when you have systematically different library sizes between groups, which results in asymmetry in peak calling power between your groups.
Thanks Aaron.
Both for your answers and advice. Your last point is interesting as we are hoping to make comparisons between groups with different library sizes.
I'll have a think about this, and have a better read of your paper in the next few days.