Hello.
My PI is interested in comparing groups of cells on a cell-to-cell basis using pseudobulking, and our groups have differing numbers of cells. For that reason, I suggested factoring in cell numbers into the normalization process to also generate DE results at the tissue level. To exemplify, I made the following table to simulate some of our data. The example has two groups of cells (A and B), and there are 3 cells in group A and 6 cells in group B. The values are raw counts per cell. Groups A and B refer to the same cell type (let's say hepatocytes), but in different samples corresponding to different experimental conditions.
Gene Group A Group B
Glul 9 9 9 0 0 0 1 0 0
Airn 6 9 10 1 1 2 1 3 1
Lgr5 7 7 8 4 5 5 5 4 3
Gapdh 5 5 5 4 5 5 5 4 4
When these cells are pseudobulked by sample, the following table is generated.
Gene Group A Group B
Glul 27 1
Airn 25 9
Lgr5 22 26
Gapdh 15 27
Since group A has half as many cells as group B but the total cells are approximately the same in both experimental conditions, which is also reflected in our lab data at the tissue level, I proposed dividing the default normalization factors of group B by the following value Z to obtain tissue-level differential expression results.
X = Ratio of group A hepatocytes to total cells in condition 1 = 30/300 = 0.10
Y = Ratio of group B hepatocytes to total cells in condition 2 = 61/300 = 0.22
Z = Y/X = 2.2
I believe the default normalization factors allow for cell-to-cell comparison between each pseudobulked sample. To perform comparisons at the tissue level, I think dividing group B's default normalization factor by this Z value should accurately highlight the strength of gene expression differences at the tissue level, as halving the normalization factor for a group means that the group's gene expression is now doubled, and when the proportion of certain cells comprising the total amount of cells in a group is doubled, the total gene expression of these cells should scale linearly.
In other words, if I am comparing cell A to cell B, cell B has twice as much expression of a particular gene, and there are also twice as many cells corresponding to cell B in cell B's sample than there are cells corresponding to cell A in cell A's sample, then the total gene expression fold change between all A cells and all B cells for that particular gene should be 4 times (2 times between cells A and B alone x 2 times the number of B cells vs. A cells).
Does this make sense? Please let me know your thoughts and suggestions. Best, Skanda
Understood, thank you! We also do have replicates of single-cell samples (3 of each condition).
Ah that’s good. I would take this approach. Anyway in my opinion it’s hard to disentangle sequencing depth and number of cells per cell type when doing DE (and seq depth can be confounded with cell type). So just doing the cell type level DE across actual samples is what makes more sense to me.