Question

Mitigating cell cycle effect in scRNA-seq using a blocking factor or design matrix

0

Entering edit mode

s1437643 ▴ 20

@s1437643-9524

Last seen 5.5 years ago

I have an obvious cell cycle effect in my scRNA-seq data (treatment versus control layout) which I would like to mitigate in my analysis. My plan was to use the cell cycle assignment scores (calculated by cyclone) as covariates in the removeBatchEffect function to create a 'corrected' expression matrix, but perform feature selection and marker detection on the 'uncorrected' expression matrix using the cell cycle assignments as blocking terms. However, I'm confused about whether to use a blocking factor (the cell cycle assignments) versus a design matrix of covariates (cell cycle assignment scores) in some of the downstream functions within the scran package. For example, the modelGeneVar function can take either a blocking factor or a design matrix. Would the blocking factor simply be a character vector of cell cycle assignments (e.g. G1, G2M, or S) and the design be the same matrix of covariates as supplied to the removeBatchEffect function? The findMarkers function can also take either a blocking factor or design matrix. Is there a reason to prefer one over the other?

scran • 1.7k views

ADD COMMENT • link updated 5.5 years ago by Aaron Lun ★ 28k • written 5.5 years ago by s1437643 ▴ 20

score 0 · Answer 1 · 2019-09-27

As a general rule, block= is always safer than design=. The former literally processes each block separately and combines the results, which allows us to handle differences in the mean-variance trend (in modelGeneVar()) or differences in variance between groups (in findMarkers()). The use of a design matrix causes these methods to switch to linear models, which makes more assumptions about how similar the different blocking levels are. Nonetheless, design= may be necessary in some cases, e.g., if you have a set-up where all cells in one cluster are in one blocking level and all cells in another cluster are in another level, it's not possible to compare them by using findMarkers() with block=. These points are discussed briefly in the documentation.

Honestly speaking, I have mixed feelings about regressing the cell cycle effect. It seemed like a good idea at the time, and everyone was doing it, so that's why I talked about it in the workflow. But I've become increasingly concerned that the cell cycle is not entirely orthogonal to biological processes of interest, and attempting to regress it out could cause more trouble than it's worth. For example, if one cell type cycles more actively than another cell type, or when we're talking about T cell activation, trying to regress out cell cycle effect could cripple your signal (or even worse, introduce spurious signal). I've also had some nagging doubts about the accuracy of cell type calls from cyclone() or related methods that are based on classifiers learnt from a single reference dataset - it's not hard to find situations where the test dataset involves different cell types that don't behave much like the reference w.r.t. cell cycle-associated genes.

If I had to do it, I would block on the phase assignments, but I'm starting to consider whether just tossing out all genes with annotated associations to the cell cycle would be a safer approach (e.g., based on all terms in GO:0007049, which pretty much covers anything that might be associated with the cell cycle). This won't get rid of unannotated genes that have expression correlated to the cell cycle, but any such effect is indistinguishable from the activity of a separate pathway that happens to be correlated to the cell cycle.