How should I determine the tolerance setting for countCells in Cydar? The package tutorial recommended 0.5 for logicle-transformed data. Unfortunately, my data is already arcsinh-transformed so I do not think that I can still use the tutorial's recommended value.
When I plot my CyTOF data, I can see that the expression for most markers usually ranges from 0 to 6. How should I proceed to determine the value?
An empirical approach would be to use neighborDistances to identify an appropriate tolerance that captures a "sufficient" number of cells in each hypersphere. See the Examples in ?neighborDistances for details, and an example usage here.
The default of 0.5 is chosen based on experience with mass/flow cytometry data, where the transformed intensities (logicle or arcsinh) for a homogeneous population are usually plus/minus 0.5 around the population mean. This corresponds to a 10-fold difference between the lowest- and highest-expressing cells in a population, assuming a log-10 scale. Hence we set 0.5 so that a hypersphere centred at the middle of a population would capture many of its cells.
I like to choose the smallest tolerance that will retain sufficient cells per hypersphere. Larger tolerances will obviously give you larger counts but at the cost of reduced resolution. The former would increase power (more evidence to reject the null) but the latter would decrease power (by "diluting" the effect size). This is an inevitable trade-off when you're deciding how to summarize counts. If you want to cover all your bases, you can repeat your analyses with different tolerances, and combine the p-values for each hypersphere (this is possible as the hypersphere will be centred on the same set of cells across different choices of tol).
Thank you for the quick response especially for the link to the repository. However, please bear with me as I do not think I have fully understood your answer. You said that the default 0.5 tolerance corresponds to a 10-fold difference of intensities between the opposing edges of the sphere. But what did you mean exactly with "assuming a log10 scale"? Did you mean to log-transform the marker intensities before working through the cydar pipeline?
I like to choose the smallest tolerance that will retain sufficient cells per hypersphere
The arbitrariness of "sufficient" still eludes me. It is really tempting to iteratively change the parameters after peeking at the p-values to get a satisfactory result. Filtering for cells/clusters with low counts, from what I understand, is what most immunologists would avoid as they are mainly interested to look for the mysterious and rare subsets that will make all the difference. I know a lot of people who will retain a cluster of cells as long as it has at least a total of 100 cells aggregated from all samples, which in many cases, that cluster is mostly driven only by a few samples. I have had a few opportunities to analyse RNA-Seq data before and I usually set a threshold of the minimum percentage of samples a gene is expressed in addition to the count threshold. I am rambling now.
So, what is the less-biased way to set the parameter for filtering the hypersphere or with any other methods (in the context of CyTOF data analysis)?
Did you mean to log-transform the marker intensities before working through the cydar pipeline?
The logicle function (as well as arc-sinh) will approach a log function at large values. In mass/flow cytometry applications, the former are usually set up so that one unit of transformed intensities corresponds to a log10-unit, i.e., a decade, or a 10-fold increase in intensities. Under these settings, a tolerance of plus/minus 0.5 in the transformed space corresponds to a span of one log10-unit. Of course, this interpretation only makes sense at higher intensities as the transformations become linear near zero, but it's a good rule of thumb.
The arbitrariness of "sufficient" still eludes me.
I'm going to asssume you're talking about the choice of the tolerance, because your second paragraph is not entirely clear.
The size of the hyperspheres reflects an analytical trade-off between count size and resolution. This is not something that can be optimized unless you know the 'span' of the differentially abundant regions of the intensity space. Trying to determine this from the data would require knowledge of the differentially abundance regions in the first place, which defeats the purpose of doing the DA analysis.
To illustrate, consider a simple 1-dimensional case of a single cell population stained for CD25. Assume that the CD25 intensity distribution is continuous from low to middle to high. However, only the CD25-high subset is differentially abundant between conditions. Using the span of the entire population would be suboptimal for detecting this difference; you want a span based on the width of the differential interval. However, this interval is unknown at this point, and the only way to identify the interval is via differential testing. So we're back where we started. Now extend this lack of knowledge to N-dimensional data, and we're at the current state of affairs.
It is really tempting to iteratively change the parameters after peeking at the p-values to get a satisfactory result.
Well, then, if you're afraid of this parameter, then just leave it as the default, which is pretty sensible based on experience of how these experiments work.
But if you do test multiple versions, you should correct for multiple testing, which is straightforward in a meta-analysis framework. This was what I meant by combining the p-values per hypersphere.
Filtering for cells/clusters with low counts, from what I understand, is what most immunologists would avoid as they are mainly interested to look for the mysterious and rare subsets that will make all the difference.
I think we've gone off-track here. Are you talking about filtering thresholds? That refers to another parameter in another step of the analysis. The width of the tolerance is only used to define the hypersphere size.
It is always possible to skip the filtering if you are looking for rare subsets. Many of the statistical arguments for filtering in genomics analyses do not apply here, as relatively few hyperspheres have low counts. However, it is up to you whether you want to spend your time following up hyperspheres that contain 1 or 2 cells per sample - assuming that they even have enough counts to reject the null hypothesis.
Anecdotally, I've had one or two collaborators who also talked about looking at rare subsets. But as soon as the results came in, they immediately focused on the big players with large absolute changes in abundance (e.g., CD4/CD8 T cells, B cells, granulocytes). It's hard to get excited about a loss or gain of a few cells in a rare population when your big T cell population increases by 50% upon drug treatment.
So, what is the less-biased way to set the parameter for filtering the hypersphere or with any other methods (in the context of CyTOF data analysis)?
Be careful with the use of the word 'bias'. An independent filtering regime will never bias the differential testing results (in terms of skewing the distribution of p-values under the null hypothesis).
In general, the filtering strategy for any high-throughput data analysis is guided by what features you find interesting. If you don't care about low-density regions of the phenotype space, you can filter them out with a high threshold. If you do care, then leave them in. If this is an independent filter, then the test procedure will be valid either way (in that the error rates will be correctly controlled).
If you call this 'biased', then any experimental procedure involving any type of selection would be biased as well. Data rarely speaks for itself, we generally see it through the prism of the hypothesis that we're testing.
Thank you for taking the time to write such an elaborate reply. Indeed, I asked about the choice for best hypersphere radius and then moved on rather suddenly to minimum cells/hypersphere threshold.
At first, I considered cydar as a completely unbiased method to analyse the cytometry data due to the omission of conventional clustering steps. However, after reading your explanation, now I realise that every method has its trade-off.
Dear Aaron,
Thank you for the quick response especially for the link to the repository. However, please bear with me as I do not think I have fully understood your answer. You said that the default 0.5 tolerance corresponds to a 10-fold difference of intensities between the opposing edges of the sphere. But what did you mean exactly with "assuming a log10 scale"? Did you mean to log-transform the marker intensities before working through the
cydar
pipeline?The arbitrariness of "sufficient" still eludes me. It is really tempting to iteratively change the parameters after peeking at the p-values to get a satisfactory result. Filtering for cells/clusters with low counts, from what I understand, is what most immunologists would avoid as they are mainly interested to look for the mysterious and rare subsets that will make all the difference. I know a lot of people who will retain a cluster of cells as long as it has at least a total of 100 cells aggregated from all samples, which in many cases, that cluster is mostly driven only by a few samples. I have had a few opportunities to analyse RNA-Seq data before and I usually set a threshold of the minimum percentage of samples a gene is expressed in addition to the count threshold. I am rambling now.
So, what is the less-biased way to set the parameter for filtering the hypersphere or with any other methods (in the context of CyTOF data analysis)?
Kind regards, Mikhael
The logicle function (as well as arc-sinh) will approach a log function at large values. In mass/flow cytometry applications, the former are usually set up so that one unit of transformed intensities corresponds to a log10-unit, i.e., a decade, or a 10-fold increase in intensities. Under these settings, a tolerance of plus/minus 0.5 in the transformed space corresponds to a span of one log10-unit. Of course, this interpretation only makes sense at higher intensities as the transformations become linear near zero, but it's a good rule of thumb.
I'm going to asssume you're talking about the choice of the tolerance, because your second paragraph is not entirely clear.
The size of the hyperspheres reflects an analytical trade-off between count size and resolution. This is not something that can be optimized unless you know the 'span' of the differentially abundant regions of the intensity space. Trying to determine this from the data would require knowledge of the differentially abundance regions in the first place, which defeats the purpose of doing the DA analysis.
To illustrate, consider a simple 1-dimensional case of a single cell population stained for CD25. Assume that the CD25 intensity distribution is continuous from low to middle to high. However, only the CD25-high subset is differentially abundant between conditions. Using the span of the entire population would be suboptimal for detecting this difference; you want a span based on the width of the differential interval. However, this interval is unknown at this point, and the only way to identify the interval is via differential testing. So we're back where we started. Now extend this lack of knowledge to N-dimensional data, and we're at the current state of affairs.
Well, then, if you're afraid of this parameter, then just leave it as the default, which is pretty sensible based on experience of how these experiments work.
But if you do test multiple versions, you should correct for multiple testing, which is straightforward in a meta-analysis framework. This was what I meant by combining the p-values per hypersphere.
I think we've gone off-track here. Are you talking about filtering thresholds? That refers to another parameter in another step of the analysis. The width of the tolerance is only used to define the hypersphere size.
It is always possible to skip the filtering if you are looking for rare subsets. Many of the statistical arguments for filtering in genomics analyses do not apply here, as relatively few hyperspheres have low counts. However, it is up to you whether you want to spend your time following up hyperspheres that contain 1 or 2 cells per sample - assuming that they even have enough counts to reject the null hypothesis.
Anecdotally, I've had one or two collaborators who also talked about looking at rare subsets. But as soon as the results came in, they immediately focused on the big players with large absolute changes in abundance (e.g., CD4/CD8 T cells, B cells, granulocytes). It's hard to get excited about a loss or gain of a few cells in a rare population when your big T cell population increases by 50% upon drug treatment.
Be careful with the use of the word 'bias'. An independent filtering regime will never bias the differential testing results (in terms of skewing the distribution of p-values under the null hypothesis).
In general, the filtering strategy for any high-throughput data analysis is guided by what features you find interesting. If you don't care about low-density regions of the phenotype space, you can filter them out with a high threshold. If you do care, then leave them in. If this is an independent filter, then the test procedure will be valid either way (in that the error rates will be correctly controlled).
If you call this 'biased', then any experimental procedure involving any type of selection would be biased as well. Data rarely speaks for itself, we generally see it through the prism of the hypothesis that we're testing.
Dear Aaron,
Thank you for taking the time to write such an elaborate reply. Indeed, I asked about the choice for best hypersphere radius and then moved on rather suddenly to minimum cells/hypersphere threshold.
At first, I considered
cydar
as a completely unbiased method to analyse the cytometry data due to the omission of conventional clustering steps. However, after reading your explanation, now I realise that every method has its trade-off.Your reply makes me become a better scientist.
Best regards, Mikhael