I have a set of non coding genes and a set of coding genes expressed across a number of samples. Based on the boxplots it is obvious that the lncRNAs are expressed lower than coding genes (log2 non-normallized counts). But what is the appropriate test for this kind of data, and is such a test implemented in e.g. DESeq2?
In my opinion, there is no rigorous statistical test available. Any obvious test, such a t-test or Wilcoxon test, assumes that the genes are independent whereas they are not. The dependence structure between the genes is unknown and, if it could be incorporated into the test, might have a dramatic effect on the computed p-value.
Since the result is so obvious, do you need a test?
I don't know — whenever someone asks me to say something about the relative expression of different genes within the same samples, I never feel comfortable doing it ... in fact, I'll fight tooth and nail against it for the following reasons.
First, and perhaps most importantly: what does this comparison tell you? In the scenario, if you have fewer amounts (in TPM(!)) of lncRNAs than mRNAs, does that mean the lncRNAs have less impact/activity than mRNAs? Or if you have less mRNA for gene A than gene B, are "you" going to then assert that gene A has less functional influence on the cell than gene B?
If I offered to give you a "pinch" of uranium or a pound of salt, I bet you'd take the pound of salt.
Second: do we really think length normalization is all we need? Perhaps the results are out there (so please point me to those references if so) but are there convincing publications out there that show accuracy of TPMs to something like molar quantities of transcripts for a large number of transcripts?
If you want to know how the relative abundance of genes change across conditions (ie. ratio of gene A:gene B in condition vs condition 2), then fine: we're good ;-)
I don't disagree with you, but I did want to point out modern methods do much more than just length normalization to produce TPMs, e.g. almost all correct for size selection, and many correct for biases from degradation, random hexamer priming, amplification, and sequencing errors.
If you want to know more about the bias-correction methods, I think you might have to read the docs & source code of Kallisto, Salmon, et. al. They have both added lots of bias-correction features fairly recently, so I don't think there's a publication covering all or even most of them.
This includes a fragment sequence bias correction similar to that described in alpine. The Salmon implementation then is preferred over alpine for producing transcript quantifications, as it is substantially faster and probabilistically allocates reads which map to multiple genomics locations.
I'd take TPM values from RSEM or EXPRESS (accurate quantification of transcript abundance) then simply do a t-test or non parametric equivalent on the TPMs between the lncRNAs and mRNAs.
DESEQ does not deal with changes in transcript abundance, but in fold changes between conditions.
Yes, I agree with Chris. I would do a non parametric test (e.g. wilcox.test) of TPMs from RSEM, eXpress, or the faster methods: Salmon, Sailfish, kallisto.
For others reading the thread who don't see why not to use counts here: the reason is that this question concerns testing average expression across *genes*, not expression across samples in different conditions.
Thanks! Could I also do a t-test using normalized counts from Deseq? These should take into account differences in sequencing depth, so should be similar to TPMs? The two gene sets are from the same sequencing libraries by the way.
Thanks for this point Gordon, I hadn't thought of that.