Is there a way in SingleR to know for each cell in the test data- which genes (in the test cell) are mostly correlated with the genes in the predicted cell from the reference dataset? For example, for a cell X in test data that is predicted to be cell Y in the reference data - what are the highly correlated genes in X with Y?
Given a single cell, it is not possible to determine the "most highly correlated gene". For one cell X, one reference Y and one gene, we only have a pair of observations; there's nothing to compute a correlation on. In fact, SingleR doesn't even use the concept of per-gene correlation when we're talking about populations of cells.
I suspect you are instead asking "which genes are driving the correlation between X and Y?" SingleR doesn't have a formal way of breaking down the statistics in this manner. The book describes some diagnostic plots based on marker gene expression, which may be sufficient. You could also look at which marker genes for Y are most highly expressed in X, which should be a good heuristic for identifying the contributing genes.
(For completeness, I would answer the above question by removing one marker gene at a time, repeating the assignment and examining the difference in the scores for the initial label Y. Large drops in the score indicate that the gene is very important for that cell's assignment to Y. However, there would be a lot of genes to go through, which is a pain; the qualitative diagnostics are fast and probably good enough for most purposes.)
Thanks. My understanding is that single cell computes that spearman correlation between each cell in the test and the reference dataset and in an iterative manner chooses the most correlated cell in the reference.
Yes, I meant the genes in cell X (test set) that are driving the correlation high between X and Y, not a single gene.
So if I have 2 cells in the test data that are labeled “T cells” , the best way would be to look at the marker genes of the T cells from the reference and see how the are expressed in the 2 cells in the test set?
Thanks. My understanding is that single cell computes that spearman correlation between each cell in the test and the reference dataset and in an iterative manner chooses the most correlated cell in the reference. Yes, I meant the genes in cell X (test set) that are driving the correlation high between X and Y, not a single gene. So if I have 2 cells in the test data that are labeled “T cells” , the best way would be to look at the marker genes of the T cells from the reference and see how the are expressed in the 2 cells in the test set?