Hi all, hi Michael,
we want to build receiver operating characteristics for patients from two separate cohorts (using gene expression from sequencing as predictor), and I was thinking about how to compare them in this context. I did vst() (varianceStablilizingTransformation()
) on both of them and tried to use one dataset as training set ("pred_data
") and the other set as the test set ("test_data
") in the generation of the ROC, like so (count is the variance stabilized count):
fit_glm <- glm(event ~ count, pred_data, family=binomial(link="logit"))
glm_link_scores <- predict(fit_glm, test_data, type="link")
glm_response_scores <- predict(fit_glm, test_data, type="response")
I chose the most highly DE gene in both datasets as the first example. However, the vst() output of data set 1 has slighty higher stabilized counts than the second one for this particular gene (median around 20 vs median around 16). Thus, the linear model based on data set 1 seems to be non-applicable to data set 2 and because of that the ROC curve is nonsense.
This is a scatter plot of the stabilized counts of the example gene in both data sets. Each dot represents an individual patient, the color is event/no event. I imagine that for a valid comparison, I would have to get these two distributions to approximately the same mean and sd. Adding pseudocounts to the log-transform does not really impact these highly expressed genes. I also think that a per-gene approach would be more appropriate than a correction of the whole set, but that might be totally wrong.
Now I believe that using TPM values for comparing between samples is not a good approach, correct me if I'm wrong. The question is, can I normalize or scale the vst() counts to get a usable distribution around the same mean or median for these two datasets, and how would I robustly do that? (Ie, "pull up" the stabilized counts of set 2 to the levels of set 1.)
Thanks in advance!
Sebastian
Hi Michael, thanks for the rapid help!
I think my question might not have been clear enough; anyhow, the pseudocounts do next to nothing to the genes I am interested in for the ROC, probably because these are some of the most highly expressed genes in the set.
I updated my question accordingly, could you please have another look at it?
Sorry I don’t have any concrete suggestions, I think it’s maybe beyond just a question about VST or normalization in DESeq2 and more about batch effects across study and how to come up with classifies that generalize well. I think the top scoring pairs may be a relevant place for you to look but I don’t think DESeq2 can help here.