Test whether expression levels of two gene sets are significantly different
4
0
Entering edit mode
Jon Bråte ▴ 260
@jon-brate-6263
Last seen 6 months ago
Norway

Hi,

I have a set of non coding genes and a set of coding genes expressed across a number of samples. Based on the boxplots it is obvious that the lncRNAs are expressed lower than coding genes (log2 non-normallized counts). But what is the appropriate test for this kind of data, and is such a test implemented in e.g. DESeq2?

http://imgur.com/a/Xd9jT


Thanks, Jon

R DESeq2 gene set testing gene expression • 2.7k views
ADD COMMENT
3
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

In my opinion, there is no rigorous statistical test available. Any obvious test, such a t-test or Wilcoxon test, assumes that the genes are independent whereas they are not. The dependence structure between the genes is unknown and, if it could be incorporated into the test, might have a dramatic effect on the computed p-value.

Since the result is so obvious, do you need a test?

ADD COMMENT
0
Entering edit mode

Thanks for this point Gordon, I hadn't thought of that.

ADD REPLY
1
Entering edit mode
@steve-lianoglou-2771
Last seen 22 months ago
United States

I don't know — whenever someone asks me to say something about the relative expression of different genes within the same samples, I never feel comfortable doing it ... in fact, I'll fight tooth and nail against it for the following reasons.

First, and perhaps most importantly: what does this comparison tell you? In the scenario, if you have fewer amounts (in TPM(!)) of lncRNAs than mRNAs, does that mean the lncRNAs have less impact/activity than mRNAs? Or if you have less mRNA for gene A than gene B, are "you" going to then assert that gene A has less functional influence on the cell than gene B?

If I offered to give you a "pinch" of uranium or a pound of salt, I bet you'd take the pound of salt.

Second: do we really think length normalization is all we need? Perhaps the results are out there (so please point me to those references if so) but are there convincing publications out there that show accuracy of TPMs to something like molar quantities of transcripts for a large number of transcripts?

 

If you want to know how the relative abundance of genes change across conditions (ie. ratio of gene A:gene B in condition vs condition 2), then fine: we're good ;-)

 

ADD COMMENT
2
Entering edit mode

I don't disagree with you, but I did want to point out modern methods do much more than just length normalization to produce TPMs, e.g. almost all correct for size selection, and many correct for biases from degradation, random hexamer priming, amplification, and sequencing errors.

ADD REPLY
0
Entering edit mode

Thanks for pointing that out, Mike ... could you point towards a good rabbit hole to start digging into this topic "a new"?

I reckon starting with your alpine paper and digging through the refs from there eventually will get me to see the light?

 

ADD REPLY
0
Entering edit mode

If you want to know more about the bias-correction methods, I think you might have to read the docs & source code of Kallisto, Salmon, et. al. They have both added lots of bias-correction features fairly recently, so I don't think there's a publication covering all or even most of them.

ADD REPLY
0
Entering edit mode

I can speak to the Salmon part. The latest bias correction methods for Salmon are described in the updated bioRxiv print:

http://biorxiv.org/content/early/2016/08/30/021592

This includes a fragment sequence bias correction similar to that described in alpine. The Salmon implementation then is preferred over alpine for producing transcript quantifications, as it is substantially faster and probabilistically allocates reads which map to multiple genomics locations.

ADD REPLY
0
Entering edit mode

The alpine pre-print is now published in NBT, I wrote up a blog post describing the overall point: 

Also see the updated Salmon manuscript linked below as to how this is implemented in the latest version of Salmon.

ADD REPLY
0
Entering edit mode
chris86 ▴ 420
@chris86-8408
Last seen 5.0 years ago
UCL, United Kingdom

I'd take TPM values from RSEM or EXPRESS (accurate quantification of transcript abundance) then simply do a t-test or non parametric equivalent on the TPMs between the lncRNAs and mRNAs.

DESEQ does not deal with changes in transcript abundance, but in fold changes between conditions.

ADD COMMENT
1
Entering edit mode

Yes, I agree with Chris. I would do a non parametric test (e.g. wilcox.test) of TPMs from RSEM, eXpress, or the faster methods: Salmon, Sailfish, kallisto.

For others reading the thread who don't see why not to use counts here: the reason is that this question concerns testing average expression across *genes*, not expression across samples in different conditions.

ADD REPLY
0
Entering edit mode

Thanks! Could I also do a t-test using normalized counts from Deseq? These should take into account differences in sequencing depth, so should be similar to TPMs? The two gene sets are from the same sequencing libraries by the way.

ADD REPLY
0
Entering edit mode

No because they don't take into account length

ADD REPLY
0
Entering edit mode
Jon Bråte ▴ 260
@jon-brate-6263
Last seen 6 months ago
Norway

Hi,

Thank you all for valuable and interesting comments. I learned a lot from this.

Jon
 

ADD COMMENT

Login before adding your answer.

Traffic: 438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6