how does Bioconductor know that I am working with RPKM counts?
If you read the sales pitch, Bioconductor is a software project. It doesn't know anything about your specific analysis. I presume you mean to say "how does a specific Bioconductor package for the analysis of single-cell data know that I am working with RPKM counts?"
For many of the packages under my control (e.g., scran, some of scater, DropletUtils), the expected type of expression values will be stated in the documentation. If you give it a RPKM matrix, and the function is expecting a raw count matrix or a log-expression matrix, then you're operating outside the specifications of the function. At this point, the function has no obligation to do anything sensible and Bad Things may happen.
In fact, if you're using the SingleCellExperiment
container, these functions will automatically look for an assay with the expected name, e.g., "counts"
, "logcounts"
. If you trick it into accepting a RPKM matrix - either by storing a RPKM matrix as the "counts"
assay, or by explicitly redirecting the function to use an "rpkm"
assay - then anything bad is on you.
Also, RPKMs are not counts.
Is there a way to specify that my counts are log2(RPKM) counts, rather than raw reads?
Putting aside your incorrect terminology about "counts" and "raw reads", if the function of interest doesn't work on log2(RPKM) values, then there's no way to specify anything. Even if we were to add an option to allow you to specify something, the only sensible course of action for the function would be to simply throw an error. Not very helpful.
Further, does the calcAverage() function in scater treat raw reads differently than log-transformed RPKM values?
Read ?calcAverage
and you'll see that it expects a count matrix. Whatever you give calcAverage
, it will assume that they are counts. RPKMs are not counts, so whatever calcAverage
computes may or may not be gibberish.
Compare to the wording in ?nexprs
, which only requires a general expression matrix. Of course, I would advise against playing too many word games; the only real way to know if a function is appropriate for a data type is to understand what it does.
I am wondering which downstream analyses that would impact.
With RPKMs: mean-variance trend modelling will fail. Any proper differential expression analysis will fail. The log-transformation will do funny things - the contribution of long genes to heterogeneity is implicitly downweighted. Also see my answer here.
Short answer...throw away your RPKM values. No differential expression software wants them, and converting back to the desired raw gene counts is extremely hard to reverse engineer 100% accurately.