Hi all,
Right now I'm planning to perform some machine learning analyses on Nanostring data, and would like to experiment coercing the data to linear by DESeq2's VST functions, as VST has been working fine with out RNA-seq data.
However, unlike limma-voom, there're few guidance for using DESeq(2) on Nanostring data. Simon Anders A: Nanostring ncounterdata - DESeq, and I haven't noticed any discussions otherwise.
So, have anyone here experienced using DESeq(2) with Nanostring counts data? And how do you process the data?
What if I just don't put in the spike-ins? And, from what you mentioned above, I guess you refer to nCounts that have not been normalized, right?
Yes, the raw counts need to be normalized, and if you have a panel of genes that are specifically chosen because they may change across samples, you need to have a set of housekeeping genes or positive spike-ins for reasonable normalization.
By "normalized" I solely mean Nanostring's own CodeSet normalization. But should I put in the positive controls only, or also the negative controls as well? From what I know, Nanostring's own guidelines no longer recommends performing background deductions.
I took a look at Nanostring's normalization guidelines, and they are essentially recommending DESeq normalization.
What I would recommend, knowing you have 768 genes measured, is to supply DESeq2 with the raw counts and to use the default normalization, so just DESeq() as normal.
My concern with Nanostring counts is that sometimes people are only looking at a small subset of genes (say 100-200) known to be DE across samples, and in that case I'd really prefer if there were known housekeeping genes included on the panel. Housekeeping is better than positive controls, which is better than nothing in my opinion, in this case with a small panel. But with 768 genes -- James is correct -- you can probably identify the per-sample size factors using the default DESeq2 steps.
What I would recommend is, after doing the standard DESeq2 analysis, make an MA plot and draw the positive controls just to see where they fall:
I updated this comment, earlier I didn't have code for selecting the positive controls.
Another thing to consider is the number of probes in your codeset. With the smaller, more directed codesets there is always the possibility that most if not all of them are being affected by the experimental conditions. However, as you get into the larger codesets this may be less true, and you may be able to use the conventional assumptions for RNA-Seq data.
The devil, as always, is in the details, and whatever assumptions you make have to be backed up by various types of exploratory data analysis. In my experience, once you get past maybe 350-400 genes in the codeset, you can start thinking that maybe the usual RNA-Seq normalizations are applicable.
The number of probesets is not an issue; this project involves more than one complete set (of 768 "Endogenous" probes).