Question

Differential Expression Analysis between conditions of single cell RNA-Seq

1

Entering edit mode

hamza_karakurt ▴ 60

@hamza_karakurt-17704

Last seen 2.3 years ago

Turkey

Hello everyone, I am working on scRNA-Seq data analysis and I have a technical question. We can combine different scRNA-Seq experiments with batch correction methods such as MNN or CCA. As I know, while doing differential expression analysis we should consider batch effect like Scater/Scran package provided a block parameter to do analysis with batch.

But the point is, if out data sets comes from different conditions (let's say healthy and disease) and real source of batch effect is the condition and we want to compare the transcriptomes of specific cell types between conditions, what should we do? We cannot block or do correction for batch since we want to see the effect of batch to specific conditions.

I identified cell types of clusters of each data seperately. Now I have 2 data sets from 2 conditions and I know which cluster is which cell type and I want to compare specific cell types in these 2 different conditions.

Treating scRNA-Seq data as bulk RNA-Seq data and use raw-counts (after deletion of non-expressed genes of course) with methods such as DESeq2 or edgeR, would it be okay?

Thank you in advance.

deseq2 scRNA-Seq RNA-Seq differential expression analysis • 6.9k views

ADD COMMENT • link updated 5.7 years ago by Michael Love 43k • written 5.7 years ago by hamza_karakurt ▴ 60

score 1 · Answer 1 · 2019-03-06

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

This is beyond what I've done so far, so I would only be speculating, and I hesitate to do that when others know more about scRNA-seq. I've likewise been asking people I know how comparisons are to be made across experiments and donors, and I think it's a very active area of research, to capture both the mean expression and variability across biological units. I know the Satija lab uses the method of summing counts from populations of cells to compare across donors, so you might ask them about the performance of this method.

ADD COMMENT • link 5.7 years ago Michael Love 43k

1

Entering edit mode

This may be of interest, as well as the vignettes here and here. Note that the last link is getting fixed in the next build cycle to use the summed counts directly, so just ignore the weighting part.

ADD REPLY • link 5.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Ah yes, I forgot about your Biostatistics paper (1st link) exploring just this question. Thanks.

ADD REPLY • link 5.7 years ago Michael Love 43k

0

Entering edit mode

Hello again Aaron, Micheal and thank you for answers. Counting counts and create a pseudo-bulk RNA-Seq is an effective method I see and it is suitable to use on specific cell types but I face another question mark. What if we don't have replicates which mean only 1 data for healthy and 1 data for disease. Counting counts will create a pseudo-bulk RNA-Seq without replicates and without replicates it is not suitable to use standard statistical methods. Do you think using raw counts without summing and treating each cell as a replicate is a proper way to do it (with DESeq2 for example)?

Thank you for all your answers.

ADD REPLY • link 5.6 years ago hamza_karakurt ▴ 60

2

Entering edit mode

Do you think using raw counts without summing and treating each cell as a replicate is a proper way to do it (with DESeq2 for example)?

Is it "proper"? No. You're treating the cells as replicates, which is inappropriate for various reasons. The most obvious is that, if you have multiple cell types/states in a cluster, your replicates will exhibit hidden correlations that compromise the DE analysis. Even more problematic is the fact that cells are not units of experimental replication, and treating them as such makes little sense.

To understand the latter reason, imagine what would happen if you or someone else tried to repeat the experiment. In the vast majority of cases, you will not have access to the same population of cells. Rather, you will generate a new population of cells from a different sample (e.g., patient, animal, cell culture) to use in your experiment. These samples are the relevant units of replication in an experimental context, not the individual cells themselves. Indeed, all of classical hypothesis testing is about the long-run expected results from repeated experiments (e.g., type I error rates, expected FDRs), so this is what the replication should reflect.

If you only have one sample per condition, you are in the same position as if you have only one bulk RNA-seq sample. Cell-to-cell variability doesn't tell you anything about the sample-to-sample variability; you can have highly heterogeneous populations (high cell-to-cell variability) that are very consistent across samples, as well as populations that consist of one cell type/kind (low cell-to-cell variability) but that cell type/kind differs across samples.

Having said that, people frequently pretend that cells are replicates (including me, e.g., here). Sounds bad, but the excuse is that the aim of the analysis is to simply rank the genes to identify good markers for particular clusters. Cell-to-cell variability becomes relevant in such cases as the most appealing markers are consistently upregulated in one population compared to another. Importantly: we don't bother interpreting the magnitude of the p-value to define significant DE genes, for reasons related to the replication described above, and also because of the circularity of computing p-values from the same data used to define clusters.

If you care about having valid p-values to define DE genes, then you need replicate samples. This is no different from bulk RNA-seq, you can't talk you way out of it by pretending cells are replicates.

ADD REPLY • link 5.6 years ago Aaron Lun ★ 28k