Question

edgeR::cpm() results in negative values from raw counts of 0

1

Entering edit mode

Kinnary ▴ 10

@b0cbf5fc

Last seen 3 hours ago

United States

Hello! I wanted to ask how the edgeR::cpm() function results in negative values from raw counts of 0.

I've looked at these posts from several years ago (Differences between limma voom E values and edgeR cpm values? and How does edgeR cpm function calculate log(CPM) values?). I also saw that the functions have been updated less than a year ago, so I wanted to make sure the information was up to date. I looked through the function calls, but the edgeR::cpm() function uses C code, so I couldn't exactly trace how a logcounts value of ~ -0.9 can be computed with a pseudo-bulk count of 0.

How exactly is the prior count scaled by library size?

Also tagging Leonardo Collado Torres

Thanks, Kinnary Shah

library(spatialLIBD)

spe <- spatialLIBD::fetch_data(type = "spe")
spe_pseudo <- registration_pseudobulk(spe, var_registration = "spatialLIBD", var_sample_id = "sample_id" )

# this function uses the below code to create the logcounts assay variable
# logcounts(sce_pseudo) <-
#            edgeR::cpm(edgeR::calcNormFactors(sce_pseudo),
#                log = TRUE,
#                prior.count = 1
#            )

edgeR::calcNormFactors(spe_pseudo)$samples["151671_WM",]
#              group lib.size norm.factors 
#  151671_WM     1   443047     1.013545

counts(spe_pseudo)["ENSG00000186827", "151671_WM"]
# 0

logcounts(spe_pseudo)["ENSG00000186827", "151671_WM"]
# -0.9157698

2^(-0.9157698) ## Undoing the log2()
# 0.530061
2^(-0.9157698) - 1 ## Undoing adding the +1 prior.count
# -0.469939 ## Different from the 0 pseudo-bulked count

genes_with_neg <- rownames(logcounts(spe_pseudo))[rowSums(logcounts(spe_pseudo) < 0) > 0]
head(genes_with_neg) # 7673 genes

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.6.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
    [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
    [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
    [1] spatialLIBD_1.19.6          SpatialExperiment_1.15.1    SingleCellExperiment_1.27.2
[4] SummarizedExperiment_1.35.3 Biobase_2.65.1              GenomicRanges_1.57.1       
[7] GenomeInfoDb_1.41.2         IRanges_2.39.2              S4Vectors_0.43.2           
[10] BiocGenerics_0.51.3         MatrixGenerics_1.17.0       matrixStats_1.4.1          

loaded via a namespace (and not attached):
    [1] RColorBrewer_1.1-3       rstudioapi_0.16.0        jsonlite_1.8.9           shape_1.4.6.1           
[5] magrittr_2.0.3           ggbeeswarm_0.7.2         magick_2.8.5             GlobalOptions_0.1.2     
[9] BiocIO_1.16.0            zlibbioc_1.51.1          vctrs_0.6.5              memoise_2.0.1           
[13] config_0.3.2             Rsamtools_2.22.0         paletteer_1.6.0          RCurl_1.98-1.16         
[17] benchmarkme_1.0.8        htmltools_0.5.8.1        S4Arrays_1.5.10          AnnotationHub_3.13.3    
[21] curl_5.2.3               BiocNeighbors_1.99.2     SparseArray_1.5.44       sass_0.4.9              
[25] bslib_0.8.0              htmlwidgets_1.6.4        plotly_4.10.4            cachem_1.1.0            
[29] GenomicAlignments_1.42.0 mime_0.12                lifecycle_1.0.4          iterators_1.0.14        
[33] pkgconfig_2.0.3          rsvd_1.0.5               Matrix_1.7-0             R6_2.5.1                
[37] fastmap_1.2.0            GenomeInfoDbData_1.2.13  shiny_1.9.1              clue_0.3-65             
[41] digest_0.6.37            colorspace_2.1-1         rematch2_2.1.2           AnnotationDbi_1.67.0    
[45] scater_1.34.0            irlba_2.3.5.1            ExperimentHub_2.13.1     RSQLite_2.3.7           
[49] beachmat_2.21.6          filelock_1.0.3           fansi_1.0.6              httr_1.4.7              
[53] abind_1.4-8              compiler_4.4.1           withr_3.0.1              bit64_4.5.2             
[57] doParallel_1.0.17        attempt_0.3.1            BiocParallel_1.39.0      viridis_0.6.5           
[61] DBI_1.2.3                sessioninfo_1.2.2        rappdirs_0.3.3           DelayedArray_0.31.14    
[65] rjson_0.2.23             tools_4.4.1              vipor_0.4.7              beeswarm_0.4.0          
[69] httpuv_1.6.15            glue_1.8.0               restfulr_0.0.15          promises_1.3.0          
[73] grid_4.4.1               cluster_2.1.6            generics_0.1.3           gtable_0.3.5            
[77] tidyr_1.3.1              data.table_1.16.0        ScaledMatrix_1.13.0      BiocSingular_1.21.4     
[81] utf8_1.2.4               XVector_0.45.0           stringr_1.5.1            ggrepel_0.9.6           
[85] BiocVersion_3.20.0       foreach_1.5.2            pillar_1.9.0             limma_3.61.12           
[89] later_1.3.2              circlize_0.4.16          benchmarkmeData_1.0.4    dplyr_1.1.4             
[93] BiocFileCache_2.13.0     lattice_0.22-6           rtracklayer_1.66.0       bit_4.5.0               
[97] tidyselect_1.2.1         ComplexHeatmap_2.21.1    locfit_1.5-9.10          Biostrings_2.73.2       
[101] scuttle_1.15.4           gridExtra_2.3            edgeR_4.3.16             statmod_1.5.0           
[105] DT_0.33                  stringi_1.8.4            UCSC.utils_1.1.0         lazyeval_0.2.2          
[109] yaml_2.3.10              shinyWidgets_0.8.7       codetools_0.2-20         tibble_3.2.1            
[113] BiocManager_1.30.25      cli_3.6.3                xtable_1.8-4             jquerylib_0.1.4         
[117] munsell_0.5.1            golem_0.5.1              Rcpp_1.0.13              dbplyr_2.5.0            
[121] png_0.1-8                XML_3.99-0.17            parallel_4.4.1           ggplot2_3.5.1           
[125] blob_1.2.4               bitops_1.0-9             viridisLite_0.4.2        scales_1.3.0            
[129] purrr_1.0.2              crayon_1.5.3             GetoptLong_1.0.5         rlang_1.1.4             
[133] cowplot_1.1.3            KEGGREST_1.45.1

edgeR • 181 views

ADD COMMENT • link updated 6 hours ago by ATpoint ★ 4.8k • written 7 hours ago by Kinnary ▴ 10

0

Entering edit mode

Thanks for posting this Kinnary. See also https://github.com/LieberInstitute/spatialLIBD/issues/106 for more details on the background.

ADD REPLY • link 6 hours ago Leonardo Collado Torres ★ 1.1k

score 1 · Answer 1 · 2025-04-21

It's calculated exactly how you might expect (it's called counts/million counts for a reason).

>  y <- matrix(rnbinom(20,size=1,mu=10),5,4)
> y
     [,1] [,2] [,3] [,4]
[1,]    7    5   20    6
[2,]    5    2    0    6
[3,]   39    8    9   11
[4,]   15   24    3   19
[5,]    0    9    5   14
> cpm(y)
>           [,1]      [,2]      [,3]
[1,] 106060.61 104166.67 540540.54
[2,]  75757.58  41666.67      0.00
[3,] 590909.09 166666.67 243243.24
[4,] 227272.73 500000.00  81081.08
[5,]      0.00 187500.00 135135.14
         [,4]
[1,] 107142.9
[2,] 107142.9
[3,] 196428.6
[4,] 339285.7
[5,] 250000.0

## Now by hand

> cs <- colSums(y)/1e6
> y/cs
          [,1]      [,2]     [,3]
[1,]  106060.6 104166.67 540540.5
[2,]  104166.7  54054.05      0.0
[3,] 1054054.1 142857.14 136363.6
[4,]  267857.1 363636.36  62500.0
[5,]       0.0 187500.00 135135.1
          [,4]
[1,] 107142.86
[2,]  90909.09
[3,] 229166.67
[4,] 513513.51
[5,] 250000.00

When taking logs, a prior count is added to account for those zeros (by default 2 is added)

> cpm(y, log = TRUE)
         [,1]     [,2]     [,3]
[1,] 17.03537 17.01637 19.03626
[2,] 16.69638 16.18595 15.13067
[3,] 19.15656 17.54006 17.99736
[4,] 17.91322 18.93157 16.76199
[5,] 15.13067 17.67949 17.29951
         [,4]
[1,] 17.04612
[2,] 17.04612
[3,] 17.73535
[4,] 18.42036
[5,] 18.03154

We can't recapitulate the log values exactly, which is explained by Gordon Smyth in the first link you reference.

But anyway, note that the zeros in this case end up being 15.3. That's because the column sums are small, and dividing by 1e6 means we divide by a small number when computing CPM. But normally you would have millions of counts per sample. As an example, let's say you have 20M reads for a sample, and there's a gene with a count of 1. When we compute CPM, it will be 0.05 (1/20), and if you take logs of 0.05, you get -4.32, and in fact any count < 20 will result in a negative logCPM because any value < 1 when logged is a negative number.

score 1 · Answer 2 · 2025-04-21

The function, when log = TRUE adds a small prior count to the counts before taking the log, avoiding taking the log of 0. If that prior is smaller than 1 then, as for any log transformation, values turn negative. This is normal and expected.

That having said, for DE analysis such as limma-trend, or downstream analysis such as clustering or PCA it's perfectly fine. It gets annoying when plotting such data as the negative values are unintuitive. For plotting I always use log2(cpm+1) where cpm is edgeR::cpm(x, log = FALSE).