edgeR::cpm() results in negative values from raw counts of 0
3
2
Entering edit mode
Kinnary ▴ 20
@b0cbf5fc
Last seen 18 hours ago
United States

Hello! I wanted to ask how the edgeR::cpm() function results in negative values from raw counts of 0.

I've looked at these posts from several years ago (Differences between limma voom E values and edgeR cpm values? and How does edgeR cpm function calculate log(CPM) values?). I also saw that the functions have been updated less than a year ago, so I wanted to make sure the information was up to date. I looked through the function calls, but the edgeR::cpm() function uses C code, so I couldn't exactly trace how a logcounts value of ~ -0.9 can be computed with a pseudo-bulk count of 0.

How exactly is the prior count scaled by library size?

Also tagging Leonardo Collado Torres

Thanks, Kinnary Shah

library(spatialLIBD)

spe <- spatialLIBD::fetch_data(type = "spe")
spe_pseudo <- registration_pseudobulk(spe, var_registration = "spatialLIBD", var_sample_id = "sample_id" )

# this function uses the below code to create the logcounts assay variable
# logcounts(sce_pseudo) <-
#            edgeR::cpm(edgeR::calcNormFactors(sce_pseudo),
#                log = TRUE,
#                prior.count = 1
#            )

edgeR::calcNormFactors(spe_pseudo)$samples["151671_WM",]
#              group lib.size norm.factors 
#  151671_WM     1   443047     1.013545

counts(spe_pseudo)["ENSG00000186827", "151671_WM"]
# 0

logcounts(spe_pseudo)["ENSG00000186827", "151671_WM"]
# -0.9157698

2^(-0.9157698) ## Undoing the log2()
# 0.530061
2^(-0.9157698) - 1 ## Undoing adding the +1 prior.count
# -0.469939 ## Different from the 0 pseudo-bulked count

genes_with_neg <- rownames(logcounts(spe_pseudo))[rowSums(logcounts(spe_pseudo) < 0) > 0]
head(genes_with_neg) # 7673 genes

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.6.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
    [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
    [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
    [1] spatialLIBD_1.19.6          SpatialExperiment_1.15.1    SingleCellExperiment_1.27.2
[4] SummarizedExperiment_1.35.3 Biobase_2.65.1              GenomicRanges_1.57.1       
[7] GenomeInfoDb_1.41.2         IRanges_2.39.2              S4Vectors_0.43.2           
[10] BiocGenerics_0.51.3         MatrixGenerics_1.17.0       matrixStats_1.4.1          

loaded via a namespace (and not attached):
    [1] RColorBrewer_1.1-3       rstudioapi_0.16.0        jsonlite_1.8.9           shape_1.4.6.1           
[5] magrittr_2.0.3           ggbeeswarm_0.7.2         magick_2.8.5             GlobalOptions_0.1.2     
[9] BiocIO_1.16.0            zlibbioc_1.51.1          vctrs_0.6.5              memoise_2.0.1           
[13] config_0.3.2             Rsamtools_2.22.0         paletteer_1.6.0          RCurl_1.98-1.16         
[17] benchmarkme_1.0.8        htmltools_0.5.8.1        S4Arrays_1.5.10          AnnotationHub_3.13.3    
[21] curl_5.2.3               BiocNeighbors_1.99.2     SparseArray_1.5.44       sass_0.4.9              
[25] bslib_0.8.0              htmlwidgets_1.6.4        plotly_4.10.4            cachem_1.1.0            
[29] GenomicAlignments_1.42.0 mime_0.12                lifecycle_1.0.4          iterators_1.0.14        
[33] pkgconfig_2.0.3          rsvd_1.0.5               Matrix_1.7-0             R6_2.5.1                
[37] fastmap_1.2.0            GenomeInfoDbData_1.2.13  shiny_1.9.1              clue_0.3-65             
[41] digest_0.6.37            colorspace_2.1-1         rematch2_2.1.2           AnnotationDbi_1.67.0    
[45] scater_1.34.0            irlba_2.3.5.1            ExperimentHub_2.13.1     RSQLite_2.3.7           
[49] beachmat_2.21.6          filelock_1.0.3           fansi_1.0.6              httr_1.4.7              
[53] abind_1.4-8              compiler_4.4.1           withr_3.0.1              bit64_4.5.2             
[57] doParallel_1.0.17        attempt_0.3.1            BiocParallel_1.39.0      viridis_0.6.5           
[61] DBI_1.2.3                sessioninfo_1.2.2        rappdirs_0.3.3           DelayedArray_0.31.14    
[65] rjson_0.2.23             tools_4.4.1              vipor_0.4.7              beeswarm_0.4.0          
[69] httpuv_1.6.15            glue_1.8.0               restfulr_0.0.15          promises_1.3.0          
[73] grid_4.4.1               cluster_2.1.6            generics_0.1.3           gtable_0.3.5            
[77] tidyr_1.3.1              data.table_1.16.0        ScaledMatrix_1.13.0      BiocSingular_1.21.4     
[81] utf8_1.2.4               XVector_0.45.0           stringr_1.5.1            ggrepel_0.9.6           
[85] BiocVersion_3.20.0       foreach_1.5.2            pillar_1.9.0             limma_3.61.12           
[89] later_1.3.2              circlize_0.4.16          benchmarkmeData_1.0.4    dplyr_1.1.4             
[93] BiocFileCache_2.13.0     lattice_0.22-6           rtracklayer_1.66.0       bit_4.5.0               
[97] tidyselect_1.2.1         ComplexHeatmap_2.21.1    locfit_1.5-9.10          Biostrings_2.73.2       
[101] scuttle_1.15.4           gridExtra_2.3            edgeR_4.3.16             statmod_1.5.0           
[105] DT_0.33                  stringi_1.8.4            UCSC.utils_1.1.0         lazyeval_0.2.2          
[109] yaml_2.3.10              shinyWidgets_0.8.7       codetools_0.2-20         tibble_3.2.1            
[113] BiocManager_1.30.25      cli_3.6.3                xtable_1.8-4             jquerylib_0.1.4         
[117] munsell_0.5.1            golem_0.5.1              Rcpp_1.0.13              dbplyr_2.5.0            
[121] png_0.1-8                XML_3.99-0.17            parallel_4.4.1           ggplot2_3.5.1           
[125] blob_1.2.4               bitops_1.0-9             viridisLite_0.4.2        scales_1.3.0            
[129] purrr_1.0.2              crayon_1.5.3             GetoptLong_1.0.5         rlang_1.1.4             
[133] cowplot_1.1.3            KEGGREST_1.45.1
edgeR • 794 views
ADD COMMENT
0
Entering edit mode

Thanks for posting this Kinnary. See also https://github.com/LieberInstitute/spatialLIBD/issues/106 for more details on the background.

ADD REPLY
2
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

It's calculated exactly how you might expect (it's called counts/million counts for a reason).

>  y <- matrix(rnbinom(20,size=1,mu=10),5,4)
> y
     [,1] [,2] [,3] [,4]
[1,]    7    5   20    6
[2,]    5    2    0    6
[3,]   39    8    9   11
[4,]   15   24    3   19
[5,]    0    9    5   14
> cpm(y)
>           [,1]      [,2]      [,3]
[1,] 106060.61 104166.67 540540.54
[2,]  75757.58  41666.67      0.00
[3,] 590909.09 166666.67 243243.24
[4,] 227272.73 500000.00  81081.08
[5,]      0.00 187500.00 135135.14
         [,4]
[1,] 107142.9
[2,] 107142.9
[3,] 196428.6
[4,] 339285.7
[5,] 250000.0

## Now by hand

> cs <- colSums(y)/1e6
> y/cs
          [,1]      [,2]     [,3]
[1,]  106060.6 104166.67 540540.5
[2,]  104166.7  54054.05      0.0
[3,] 1054054.1 142857.14 136363.6
[4,]  267857.1 363636.36  62500.0
[5,]       0.0 187500.00 135135.1
          [,4]
[1,] 107142.86
[2,]  90909.09
[3,] 229166.67
[4,] 513513.51
[5,] 250000.00

When taking logs, a prior count is added to account for those zeros (by default 2 is added)

> cpm(y, log = TRUE)
         [,1]     [,2]     [,3]
[1,] 17.03537 17.01637 19.03626
[2,] 16.69638 16.18595 15.13067
[3,] 19.15656 17.54006 17.99736
[4,] 17.91322 18.93157 16.76199
[5,] 15.13067 17.67949 17.29951
         [,4]
[1,] 17.04612
[2,] 17.04612
[3,] 17.73535
[4,] 18.42036
[5,] 18.03154

We can't recapitulate the log values exactly, which is explained by Gordon Smyth in the first link you reference.

But anyway, note that the zeros in this case end up being 15.3. That's because the column sums are small, and dividing by 1e6 means we divide by a small number when computing CPM. But normally you would have millions of counts per sample. As an example, let's say you have 20M reads for a sample, and there's a gene with a count of 1. When we compute CPM, it will be 0.05 (1/20), and if you take logs of 0.05, you get -4.32, and in fact any count < 20 will result in a negative logCPM because any value < 1 when logged is a negative number.

0
Entering edit mode

Thanks for the answer James! I had forgotten about the "PM" part of CPM when I met with Kinnary yesterday!

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 4 hours ago
WEHI, Melbourne, Australia

logCPM values decrease as the library size increases

edgeR::cpm computes logCPM values according to the formula log2( (y + y0) / (L + 2*y0) * 1e6), where y is the count, L is the normalized library size and y0 is the sample-specific prior count.

Consider a vector of counts, from 0 to 4:

> y <- matrix(0:4,5,1)
> row.names(y) <- 0:4

If the library size is small, then the logCPM values will be large:

> library(edgeR)
> L <- 100
> cpm(y, lib.size=L, log=TRUE, prior.count=1)
      [,1]
0 13.25914
1 14.25914
2 14.84411
3 15.25914
4 15.58107

The logCPM corresponding to zero can be confirmed as:

> log2( (0 + 1) / (L + 2) * 1e6)
[1] 13.25914

If the library size is larger, then the logCPM values will naturally decrease:

> L <- 1e5
> cpm(y, lib.size=L, log=TRUE, prior.count=1)
      [,1]
0 3.321899
1 4.321899
2 4.906862
3 5.321899
4 5.643827

If the library size is just under two million, then the logCPM value corresponding to a zero count will be much as you observed:

> L <- 1.9e6
> cpm(y, lib.size=L, log=TRUE, prior.count=1)
         [,1]
0 -0.92600094
1  0.07399906
2  0.65896156
3  1.07399906
4  1.39592716
> log2( (0 + 1) / (L + 2) * 1e6)
[1] -0.9260009

We want the logCPM corresponding to any fixed count to decrease as the library size increases. We don't apply any lower limit, so that the logCPMs will eventually become negative if the library size is large enough. This is all deliberate and I don't see any problem with negative values.

Math details

Our logCPM formula is designed to minimize the mean-square-error estimation error for the true logCPM values under a particular Bayesian model for the CPMs. The theory was indicated by Belinda Phipson in her PhD Thesis, although we have not made it public.

You might ask why the formula divides by L + 2*y0 instead of just L or L + y0. Partly, this is to ensure that the logCPM values are always less than log2(1e6). More exactly, it arises from some mathematical theory in which we apply add the prior count y0 to both the count for one gene and the total count for all other genes, and the effective library size is the sum of the two.

Finally, the sample-specific prior.counts are computed by y0 <- prior.count * L / median(L), where L is the vector of library sizes and prior.count is the average prior-count specified as an argument of the cpm function. For example:

> Y <- cbind(y,y,y,y)
> L <- c(100,200,300,400)
> cpm(Y, lib.size=L, log=TRUE, prior.count=1)
      [,1]     [,2]     [,3]     [,4]
0 11.95429 11.95429 11.95429 11.95429
1 13.76164 13.12421 12.82876 12.65473
2 14.53925 13.76164 13.36933 13.12421
3 15.04175 14.20222 13.76164 13.47785
4 15.41372 14.53925 14.06977 13.76164
> prior.count <- 1
> y0 <- prior.count * L / median(L)
> y0 <- matrix(y0,5,4,byrow=TRUE)
> L <- matrix(L,5,4,byrow=TRUE)
> log2( (Y+y0) / (L + 2*y0) * 1e6)
      [,1]     [,2]     [,3]     [,4]
0 11.95429 11.95429 11.95429 11.95429
1 13.76164 13.12421 12.82876 12.65473
2 14.53925 13.76164 13.36933 13.12421
3 15.04175 14.20222 13.76164 13.47785
4 15.41372 14.53925 14.06977 13.76164

Relationship to limma-voom

The calculation of logCPM by limma-voom uses exactly the same formula, except that the limma-voom prior-counts are not sample-specific. limma-vooma sets y0 <- 0.5, but otherwise the logCPM calculation is exactly as for edgeR. The use of sample-specific prior counts in edgeR ensures that, if a gene has the same CPM value in two different samples, then it will have identical logCPM values in the two samples as well.

Reference

Phipson, B. (2013). Empirical Bayes modelling of expression profiles and their associations. Unpublished PhD Thesis. University of Melbourne, Australia. http://hdl.handle.net/11343/38162

ADD COMMENT
0
Entering edit mode

Thank you Gordon for the detailed answer, examples, and link to the unpublished thesis!

ADD REPLY
0
Entering edit mode

I dove a bit deeper to understand more the differences between scuttle::logNormCounts() and edgeR::cpm(log = TRUE) at https://github.com/LieberInstitute/spatialLIBD/issues/106#issuecomment-2821662900. Thanks for the formulas! Now I understand more the differences.

Best, Leo

ADD REPLY
1
Entering edit mode
ATpoint ★ 4.8k
@atpoint-13662
Last seen 28 minutes ago
Germany

The function, when log = TRUE adds a small prior count to the counts before taking the log, avoiding taking the log of 0. If that prior is smaller than 1 then, as for any log transformation, values turn negative. This is normal and expected.

That having said, for DE analysis such as limma-trend, or downstream analysis such as clustering or PCA it's perfectly fine. It gets annoying when plotting such data as the negative values are unintuitive. For plotting I always use log2(cpm+1) where cpm is edgeR::cpm(x, log = FALSE).

ADD COMMENT
2
Entering edit mode

Thanks answering this question. Let me say though that I've never understood why people would find negative log values to be annoying or unintuitive. The log transformation is used throughout mathematics to transform the positive real line to the whole read line. The possibility of getting negative values is not only natural but fundamental to the transformation. The whole point of the transformation is that the resulting values are unconstrained. Artifically demanding that the log values be positive introduces a new constraint that makes the math less natural and, to my mind, less intuitive.

I also prefer to let the cpm offset depend on the library sizes rather than using a preset value like 1 because, to my mind, there should be less shrinkage on the logFC scale when the counts are large and more shrinkage when they are small. To my mind, it is the prior count distribution that should be constant rather than the logFC shrinkage. The way that edgeR computes logCPM may seem more complex, but it has the property of minimizing the mean-square-error of the logFCs between samples under a certain prior distribution on the true logFCs. (This was all part of Belinda Phipson's PhD thesis long ago, see https://hdl.handle.net/11343/38162. We never did publish it, but perhaps we should have.)

ADD REPLY
0
Entering edit mode

One purely practical advantage to keeping zeros zero in particular is that sparse matrices stay sparse (using the log1p function). I think this is why this is done with single-cell count data. They don't use CPM as the units so the pseudo-count isn't forced to be 1 CPM.

Another advantage of keeping zeros zero I've seen, again with single-cell data, is that it allows Non-negative Matrix Factorization to be applied. But there may be better ways to go about NMF than this.

ADD REPLY
0
Entering edit mode

Mmm, I am not at all attracted to those approaches. Perhaps the use of log(cpm+1) might facilitate the use of sparse methods in other packages (I'm not familiar with that), but it isn't statistically optimal, for the reasons I have given.

Ayway, we are discussing edgeR here, and edgeR doesn't support sparse objects, so there is no need to make statistical compromises.

ADD REPLY

Login before adding your answer.

Traffic: 1020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6