Our data is from single cell sequencing. The goal is trying to calculate Disease DEG for each cell type.
- expression matrix. Take Celltype A and Disease A for example. I used the sum raw counts for this cell type and individual as a pseudobulk expression and generated the expression matrix.
For filtering, I removed samples if there were less than 50 cells per sample. I filtered genes if the row sum is smaller than 10. I also filtered genes by only keeping genes with expression larger than 0.5 (cpm normalization) in at least 30% of the samples. I didn’t use any normalization for the expression matrix and still use raw counts as input.
For the test. I know that there are several different test methods that can be used. Wald, LRT, and also LRT(fitType="glmGamPoi") (This one seems recommended for single cell data. This method seems lower the criteria of filtering) I am not sure whether this is still true for pseudobulk or not. . So I only use the default LRT test here.
For covariates in the design. There are several covariates I considered, experimental batch cohort, biological_sex, PMI, average umi counts per cell, and average number of genes detected per cell for each sample.
dds <- DESeqDataSetFromMatrix(countData = expression, colData = meta design = ~ batchCohort+Biological_Sex +PMI+avecellumi+n_gene+ Disease)
DESeq(dds,parallel=TRUE,BPPARAM=MulticoreParam(10), test = "LRT", reduced = ~ batchCohort+Biological_Sex +PMI+avecellumi+n_gene)
I have some questions related to the methods that I use. Q1:shall I use glmGamPoi for single cell pseudobulk? Q2:About avecellumi and log(avecellumi), which one is better to be used as covaraites. Q3:Would correlation between covariates or between one covariate and condition affect the result? Q4: is there any covariates that worth consideration for single cell pseudobulk?(eg. single cell cell count per sample) Or any covariates that should be removed from the design?
Thanks for your reply! This is so helpful!
Agree with @ATpoint.
Probably better to use log of a positive, right skewed covariate. Also good to center all covariates.
Yes, avoid correlated covariates, I've actually used RUV for pseudobulk data as well which produces orthogonal nuisance variables. Often the RUV factors explain the known technical covariates anyway, and you don't need to include the known ones if you use the RUV ones in the design.
Actually depending on the sample size, glmGamPoi may or may not be faster. It is much faster with large matrices of repeated integer values.
Thanks a lot! I have modified my script based on nice suggestions from you all!
So are you saying pseudobulk data can essentially be treated as bulk data? If so, would a Wald test be appropriate for pseudobulk differential expression analysis? I saw these recommendations for single-cell analysis and assumed they applied to pseudobulk data too, but perhaps they were intended for analyses treating single cells as independent observations.
For pseudobulk you can treat it like normal, i.e. you don't have to follow those recommendations.