Question

edgeR error when trying to filter peaks with low read counts

0

Entering edit mode

camerond • 0

@camerond-15316

Last seen 11 months ago

United Kingdom

Hello,

I have successfully run edgeR on an ATAC-seq dataset to identify differential open chromatin peaks across two cell types. I wanted to normalise for peaks read count, GC content and peak width, so my method was to create a consensus peak set in Diffbind (2 cell types, 3 replicates per cell type), then to use the CQN package for normalisation steps. This creates a GLM offset to add to the DGEList object in edgeR that accounts for the normalisation.

The issue I have is that when I try to filter my DGEList object in edgeR to remove 'peaks' that have very low read counts using the command:

keep <- rowSums(cpm(peak.list) > 2) >= 3

I get the following error:

Error in cpm.default(x$counts, lib.size = lib.size, log = log, prior.count = prior.count) : 
  matrix dimensions are not consistent for non-repeating number of columns

I'm not sure what's going on here. I suspect it may have something to do with the additional parameters I used whilst setting up the DGEList object to account for the CQN normalisation, my command was this:

peak.list <- DGEList(counts = count.matrix, lib.size = depth.matrix, group = rep(c("grp1", "grp2"), each = 3), genes = GC.count.matrix)

The offset I mentioned is added later: peak.list$offset <- cqn.ch1$glm.offset and creates the following object:

An object of class "DGEList"
$counts
                 ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2
chr1_10002_10467       58       54      126      153      143       65
chr1_15409_15764        4        4        1        2        6        6
chr1_17404_17589       28       27       34       58       64       38
chr1_21206_21446        1        1        1        1        1        1
chr1_26045_26177        1        1        1        1        1        1
22010 more rows ...

$samples
         group lib.size.ATAC16_1 lib.size.ATAC17_1 lib.size.ATAC18_2 lib.size.ATAC20_4 lib.size.ATAC22_1 lib.size.ATAC22_2 norm.factors
ATAC16_1  grp1           5620945           2511455           5162441           9085694           9852238           6172175            1
ATAC17_1  grp1           5620945           2511455           5162441           9085694           9852238           6172175            1
ATAC18_2  grp1           5620945           2511455           5162441           9085694           9852238           6172175            1
ATAC20_4  grp2           5620945           2511455           5162441           9085694           9852238           6172175            1
ATAC22_1  grp2           5620945           2511455           5162441           9085694           9852238           6172175            1
ATAC22_2  grp2           5620945           2511455           5162441           9085694           9852238           6172175            1

$genes
                 length gccontent
chr1_10002_10467    465  0.520430
chr1_15409_15764    355  0.602817
chr1_17404_17589    185  0.691892
chr1_21206_21446    240  0.566667
chr1_26045_26177    132  0.545455
22010 more rows ...

$offset
                   ATAC16_1    ATAC17_1   ATAC18_2   ATAC20_4  ATAC22_1   ATAC22_2
chr1_10002_10467  1.2106529  0.40383051  1.1225094 1.44570668 1.6393788  1.0711832
chr1_15409_15764  0.1482142  0.05310531  0.6880168 0.63767323 0.9767822  0.4124740
chr1_17404_17589  0.6204587  0.02583219  0.7587918 0.61683366 0.8382115  0.2851622
chr1_21206_21446 -0.3672278 -0.67452604  0.2488282 0.22727645 0.5559019 -0.1707992
chr1_26045_26177 -0.7680765 -0.76807653 -0.0846480 0.07907179 0.3615096 -0.3383576
22010 more rows ...

I could alter the count matrix earlier at the Diffbind stage but as I'd like to try different values for the read count cut off, and run several other optimisation steps using different cutoff values over a number of different datasets, I'd rather not have to run the entire script each time I want to change this. It would be much easier if I could sort this error and alter this in edgeR.

Any suggestions would be greatly appreciated.

edger R filtering • 1.8k views

ADD COMMENT • link 6.9 years ago camerond • 0

score 3 · Accepted Answer · 2018-04-18

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 5 hours ago

The city by the bay

lib.size= should be a vector, not a matrix, see ?DGEList.

With your current set-up, there is no $samples$lib.size, resulting in a library size vector of length zero. This eventually results in errors with respect to matrix dimensions. Also, you don't mention what version of edgeR you're using, but the reported error message no longer exists in the current release version (3.20.9).

Finally, if you're setting offsets, you should use the scaleOffset() function. This ensures that the offsets are interpretable on the same scale as log-library sizes, which is important when adding an appropriate prior count to shrink expression values to a constant value (and thus, the log-fold changes to zero).

ADD COMMENT • link 6.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

Thanks for the response.

Yes, the lib-size was the issue. Thanks for your help.

I tried updating my version of edgeR (3.18.1) but it doesn't go beyond this version. My R version is not the latest, so I'm sure it's down to that.

Is there any chance you could elaborate a little on your last point?

Until now, I have been using peak.list$offset <- cqn.ch1$glm.offset. This was recommended in the CQN vignette and it adds the GLM.offset matrix directly to the DGEList:

An object of class "DGEList"
$counts
                 ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2
chr1_10002_10467       58       54      126      153      143       65
chr1_15409_15764        4        4        1        2        6        6
chr1_17404_17589       28       27       34       58       64       38
chr1_21206_21446        1        1        1        1        1        1
chr1_26045_26177        1        1        1        1        1        1
22010 more rows ...

$samples
         group lib.size norm.factors
ATAC16_1  grp1  5620945            1
ATAC17_1  grp1  2511455            1
ATAC18_2  grp1  5162441            1
ATAC20_4  grp2  9085694            1
ATAC22_1  grp2  9852238            1
ATAC22_2  grp2  6172175            1

$genes
                 length gccontent
chr1_10002_10467    465  0.520430
chr1_15409_15764    355  0.602817
chr1_17404_17589    185  0.691892
chr1_21206_21446    240  0.566667
chr1_26045_26177    132  0.545455
22010 more rows ...

$offset
                   ATAC16_1    ATAC17_1    ATAC18_2   ATAC20_4  ATAC22_1   ATAC22_2
chr1_10002_10467  1.2100869  0.40413134  1.12272869 1.44601091 1.6394837  1.0708546
chr1_15409_15764  0.1480778  0.05321138  0.68757220 0.63722860 0.9763376  0.4125135
chr1_17404_17589  0.6200409  0.02589410  0.75829204 0.61710300 0.8383258  0.2853416
chr1_21206_21446 -0.3676724 -0.67495521  0.24838355 0.22683182 0.5554573 -0.1712439
chr1_26045_26177 -0.7682129 -0.76821293 -0.08509263 0.07862716 0.3610650 -0.3388022
22010 more rows ...

Using the scaleOffset() function as you suggest: peak.list <- scaleOffset(peak.list, offset = cqn.ch1$glm.offset)

An object of class "DGEList"
$counts
                 ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2
chr1_10002_10467       58       54      126      153      143       65
chr1_15409_15764        4        4        1        2        6        6
chr1_17404_17589       28       27       34       58       64       38
chr1_21206_21446        1        1        1        1        1        1
chr1_26045_26177        1        1        1        1        1        1
22010 more rows ...

$samples
         group lib.size norm.factors
ATAC16_1  grp1  5620945            1
ATAC17_1  grp1  2511455            1
ATAC18_2  grp1  5162441            1
ATAC20_4  grp2  9085694            1
ATAC22_1  grp2  9852238            1
ATAC22_2  grp2  6172175            1

$genes
                 length gccontent
chr1_10002_10467    465  0.520430
chr1_15409_15764    355  0.602817
chr1_17404_17589    185  0.691892
chr1_21206_21446    240  0.566667
chr1_26045_26177    132  0.545455
22010 more rows ...

$offset
                 ATAC16_1 ATAC17_1 ATAC18_2 ATAC20_4 ATAC22_1 ATAC22_2
chr1_10002_10467 15.64392 14.83796 15.55656 15.87984 16.07332 15.50469
chr1_15409_15764 15.24497 15.15010 15.78446 15.73412 16.07323 15.50940
chr1_17404_17589 15.67859 15.08444 15.81684 15.67565 15.89687 15.34389
chr1_21206_21446 15.24558 14.93829 15.86163 15.84008 16.16870 15.44200
chr1_26045_26177 15.06794 15.06794 15.75106 15.91478 16.19722 15.49735
22010 more rows ...

... continued below ...

ADD REPLY • link 6.9 years ago camerond • 0

0

Entering edit mode

... the values are different for the offset in both cases due, no doubt, to the log-library size scaling you mentioned. The part I don't understand fully is your last point: 'adding an appropriate prior count to shrink expression values to a constant value (and thus, the log-fold changes to zero)', and by extension, how this would effect the cpm call when removing peaks with low read counts. As it stands I'm currently using:

keep <- rowSums(cpm(peak.list, lib.size = depth.vector) > 2, ) >= 3

... and will replug this into the DGEList using (from page 49 in the edgeR guide):

peak.list <- peak.list[keep, , keep.lib.sizes=FALSE]

However, I notice when checking ?cpm that there are log, prior.count and gene.length parameters and I'm unsure whether or not to use these in my cpm call. For example, are these new values in the offset matrix log values, meaning I should set log = TRUE, and if so what should I set the prior.count to?

ADD REPLY • link 6.9 years ago camerond • 0

1

Entering edit mode

Use of scaleOffset() doesn't affect the cpm() call, as the latter does not use the offsets at all. The purpose of scaleOffset() is to ensure that the shrinkage of the log-fold changes is done correctly (see ?predFC). We added this function a few releases ago, so it is perhaps not surprising that other packages have yet to make full use of it.

The documentation is pretty clear about the purpose of the other arguments, so just read it.

P.S. I would update R and then all Bioconductor packages. Don't expose yourself to known bugs if you can avoid it.

P.P.S. It would be remiss of me to not point out the problems with peak-calling in conjunction with using edgeR. In particular, peak calling and filtering does not constitute an independent filtering scheme, biasing the results of differential analyses with edgeR. See https://dx.doi.org/10.1093/nar/gku351 for more details. This is especially true when you have systematically different library sizes between groups, which results in asymmetry in peak calling power between your groups.

ADD REPLY • link 6.9 years ago • updated 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron.

Both for your answers and advice. Your last point is interesting as we are hoping to make comparisons between groups with different library sizes.

I'll have a think about this, and have a better read of your paper in the next few days.

ADD REPLY • link 6.8 years ago camerond • 0