scRNA-seq pseudo-bulk differential expression analysis with pseudobulkDGE()
2
1
Entering edit mode
enricoferrero ▴ 660
@enricoferrero-6037
Last seen 3.0 years ago
Switzerland

Hello,

I'm trying to perform a simple differential expression analysis between two conditions across cell clusters, using pseudo-bulking of scRNA-seq data, here is a toy example:

# load libraries
library(tidyverse)
library(scater)
library(scuttle)
library(scran)

# create sce
set.seed(16)
sce <- mockSCE(ncells = 1000)
sce <- logNormCounts(sce)

# annotate sce
sce$sample <- rep(str_c("sample", 1:10), each = 100)
sce$celltype <- str_c("celltype", unname(kmeans(t(logcounts(sce)), centers=3)$cluster))
sce$condition <- rep(c("healthy", "disease"), each = 500)

# create pseudobulk se
se <- aggregateAcrossCells(sce, ids = colData(sce)[, c("sample", "celltype")])

Assuming I'm using aggregateAcrossCells() correctly, I would now like to compute, for each cell type, which genes are differentially expressed between disease and healthy pseudosamples. For my own clarity, I would like to be able to explicitly set the design to ~ 0 + condition and then get the results for the contrast conditiondisease - conditionhealthy.

How do I do this?

I tried a few things:

# differential expression analysis
dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, coef = "conditionhealthy")
dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, contrast = "conditiondisease - conditionhealthy")

but in both cases I get a List of length 0 as a result, so I must be doing something wrong.

Thank you!


My sessionInfo() output:

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/prog/OpenBLAS/0.2.20-GCC-6.4.0-2.28/lib/libopenblas_haswellp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scran_1.17.14               scuttle_0.99.11             scater_1.17.4              
 [4] SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
 [7] matrixStats_0.56.0          Biobase_2.48.0              GenomicRanges_1.40.0       
[10] GenomeInfoDb_1.24.2         IRanges_2.22.2              S4Vectors_0.26.1           
[13] BiocGenerics_0.34.0         forcats_0.5.0               stringr_1.4.0              
[16] dplyr_1.0.1                 purrr_0.3.4                 readr_1.3.1                
[19] tidyr_1.1.1                 tibble_3.0.3                ggplot2_3.3.2              
[22] tidyverse_1.3.0            

loaded via a namespace (and not attached):
 [1] bitops_1.0-6              fs_1.5.0                  lubridate_1.7.9          
 [4] httr_1.4.2                tools_4.0.2               backports_1.1.8          
 [7] R6_2.4.1                  irlba_2.3.3               vipor_0.4.5              
[10] DBI_1.1.0                 colorspace_1.4-1          withr_2.2.0              
[13] tidyselect_1.1.0          gridExtra_2.3             compiler_4.0.2           
[16] cli_2.0.2                 rvest_0.3.6               BiocNeighbors_1.6.0      
[19] xml2_1.3.2                scales_1.1.1              XVector_0.28.0           
[22] pkgconfig_2.0.3           dbplyr_1.4.4              limma_3.44.3             
[25] rlang_0.4.7               readxl_1.3.1              rstudioapi_0.11          
[28] DelayedMatrixStats_1.10.1 generics_0.0.2            jsonlite_1.7.0           
[31] BiocParallel_1.22.0       RCurl_1.98-1.2            magrittr_1.5             
[34] BiocSingular_1.4.0        GenomeInfoDbData_1.2.3    Matrix_1.2-18            
[37] Rcpp_1.0.4.6              ggbeeswarm_0.6.0          munsell_0.5.0            
[40] fansi_0.4.1               viridis_0.5.1             lifecycle_0.2.0          
[43] stringi_1.4.6             edgeR_3.30.3              zlibbioc_1.34.0          
[46] grid_4.0.2                blob_1.2.1                dqrng_0.2.1              
[49] crayon_1.3.4              lattice_0.20-41           haven_2.3.1              
[52] hms_0.5.3                 locfit_1.5-9.4            pillar_1.4.6             
[55] igraph_1.2.5              reprex_0.3.0              glue_1.4.1               
[58] packrat_0.5.0             modelr_0.1.8              vctrs_0.3.2              
[61] cellranger_1.1.0          gtable_0.3.0              assertthat_0.2.1         
[64] rsvd_1.0.3                broom_0.7.0               viridisLite_0.3.0        
[67] beeswarm_0.2.3            statmod_1.4.34            bluster_0.99.1           
[70] ellipsis_0.3.1
scran scater scuttle • 7.0k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 10 hours ago
The city by the bay
dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, coef = "conditionhealthy")

For starters, se$cluster is NULL. I assume you meant se$celltype, which yields the expected List. Also, your specified comparison doesn't make any sense; in a no-intercept model, coef="conditionhealthy" represents the null hypothesis that the healthy condition has an average log-expression of zero, rather than the log-fold change being zero.

dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, contrast = "conditiondisease - conditionhealthy")

The contrast is better this time, but the function isn't smart enough to run contrast through makeContrasts(). You need to give it some hints, so the final form of your function call should look like:

dea <- pseudoBulkDGE(se, label = se$celltype, condition = se$condition, design = ~ 0 + condition, coef=NULL, contrast = c(1, -1))

I guess I should add an option to pass in a string as well, in which case makeContrasts() is automatically called. I should also fix it so that you don't have to set coef=NULL here when you're specifying the contrast already. If someone can add an issue on the GitHub page to remind me, that would help.

I think your pseudoBulkDGE() step is going one step too far.

Tell me about it.

I was tired of manually setting up a for loop to repeat the process for multiple cell types, hence this function. Note that the intermediate DGEList and DGEGLM objects are packed into the metadata() of each DataFrame, so you can always pull them out and do stuff quickly on those objects. Of course, for anything more complex that doesn't follow the standard edgeR/voom pipeline, you'll have to write everything yourself.

Note that the muscat package has a pbDS function that does something very similar, but it expects a slightly different input. Specifically, it expects a SummarizedExperiment where each column corresponds to a sample-of-origin and each assay corresponds to a cell type, whereas pseudoBulkDGE expects a SummarizedExperiment with a single count matrix containing columns for all combinations of cell types and samples. Sometimes you happen to have the former, sometimes you get the latter, so you can choose how to proceed based on what you've got.

ADD COMMENT
0
Entering edit mode

Thank you Aaron!

I have opened an issue here on the coef and contrast arguments of pseudobulkDGE().

I have two follow-up questions:

  1. One reason I dislike numeric vector contrasts is that I'm never sure what is contrasted with what (and therefore the directionality of the resulting fold changes), especially in cases like this where I can't see the design matrix. How did you know that you had to set it up as c(1, -1) and not c(-1, 1) to model conditiondisease over conditionhealthy? I'm guessing that depends on the ordering of the se$condition factor? Assuming that's in alphabetical order (which may not always be the case), then disease is indeed the first level of the factor. Am I missing something?

  2. In my real-world example, all columns of all resulting DataFrames contain NAs. Do you have any ideas on what might be causing this?

i.e.:

> dea[[1]]
DataFrame with 33538 rows and 5 columns
                logFC    logCPM         F    PValue       FDR
            <numeric> <numeric> <numeric> <numeric> <numeric>
MIR1302-2HG        NA        NA        NA        NA        NA
FAM138A            NA        NA        NA        NA        NA
OR4F5              NA        NA        NA        NA        NA
AL627309.1         NA        NA        NA        NA        NA
AL627309.3         NA        NA        NA        NA        NA
...               ...       ...       ...       ...       ...
AC233755.2         NA        NA        NA        NA        NA
AC233755.1         NA        NA        NA        NA        NA
AC240274.1         NA        NA        NA        NA        NA
AC213203.1         NA        NA        NA        NA        NA
FAM231C            NA        NA        NA        NA        NA
ADD REPLY
1
Entering edit mode

1. In usual analyses, you would be able to see the design matrix so it is usually not too hard to manually construct an appropriate contrast matrix. In fact, if the columns of your design matrix do not have syntactically valid names, manual construction is really your only option because makeContrasts() won't be able to parse your expression properly.

In this particular case, because the design matrix is constructed inside the pseudoBulkDGE() function, you have to guess as to what the columns will be. Supplying a string to contrast= takes the guesswork out of the order of the columns. However, you will still be susceptible to the absence of levels (e.g., because a cell type isn't present in a particular condition), which can cause coefficients to drop out of the design matrix and for your coef= or contrast= to not make sense anymore.

2. This happens when your comparison cannot be performed, e.g., because the design matrix is not of full rank or when your comparison doesn't make sense for a cell type that only occurs in one condition. The BioC-devel version of the function will not bother to report DataFrames for such failures, instead registering the failed runs in the metadata() of the output List.

ADD REPLY
0
Entering edit mode
@steve-lianoglou-2771
Last seen 21 months ago
United States

I think your pseudoBulkDGE() step is going one step too far.

Take a look at the examples in the OSCA book

Assuming this step worked:

se <- aggregateAcrossCells(sce, ids = colData(sce)[, c("sample", "celltype")])

You would then create a DGEList from that, and go wild, ie. something like:

y.all <- DGEList(counts(se), samples = colData(se))
# ... then do the standard edgeR DGE moves from here on out ...
ADD COMMENT

Login before adding your answer.

Traffic: 927 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6