I am therefore wondering what would be the most appropriate approach since I have samples from different donors and hence I expect differences between them.
I would say that cydar is best used for tightly-designed experiments where all samples are multiplexed together (e.g., using a palladium barcoding scheme); or are at least multiplexed in non-confounding batches to enable intensity normalization with normalizeBatches
.
In principle, we could handle arbitrary samples collected at different time points by increasing the hypersphere radius to "ride out" any differences due to batch effects. I would not be particularly happy about that, though, as cydar is designed to be highly sensitive to differences in the distribution of cells across the phenotype space. This means that there is a fairly high risk that uncorrected batch effects will introduce false positive differences.
By comparison, clustering methods are less sensitive to batch effects. If you manage to cluster cells of the same type together across multiple batches, then the distribution of cells within the cluster doesn't matter, as you're taking the entire cluster as a single unit for differential abundance analyses. The flipside is that you're dependent on the whims of the clustering algorithm. This may make your analysis less sensitive to real changes in abundance between conditions; for example, if the clustering algorithm manages to put CD4+ and CD8+ cells together in one cluster, a switch in CD4/CD8 proportions would not be detected. And the batch effect protection obviously depends on not having the misfortune of forming batch-specific clusters, either...
One could apply single-cell batch correction methods like fastMNN
(see, for example, https://support.bioconductor.org/p/129407/) to remove batch effects prior to downstream analyses with cydar or other methods. Those involve assumptions of their own, of course, so it doesn't come for free.
Personally, I would roll my sleeves up and perform manual (or at least semi-automated) gating to analyze this type of data. This involves the fewest assumptions - or more specifically, the assumptions that you make are explicit when you're deciding where to put the gate, and that's better than the implicit assumptions that are silently made by these various clustering/batch correction algorithms. For typical applications with binary markers, gating should be fairly straightforward; for more continuous markers, I think everyone would struggle when you throw in a batch effect, but at least you know what parts of your analysis are less trustworthy.
Second, I read in the package's vignette that cydar's power is that it can find DA populations between two conditions.
Yes, that's kind of the whole point.
From the paper and the vignette, I understand that cydar can identify these subpopulation and I wanted to ask how one would present a quantification of the DA hyperspheres.
I usually just make a t-SNE or UMAP of the hyperspheres and say, "hey, all the DA parts of the population are here". You could also make a heatmap based on representative hyperspheres (see the findFirstSphere()
function) and use those as proxies for the underlying subpopulation.
Anecdotally, I have found that most users are using cydar to identify what subpopulations are of interest in a highly sensitive manner, and then going back and gating the dataset to generate biaxial plots of those subpopulations. This is also fine for presentation purposes, provided that you understand that the p-value associated with the gated percentages does not consider the multiple testing that was performed by cydar.