Hi there,
I have been using edgeR
for quite some time now, and now I am facing with a question that applies to experimental designs that are not the "traditional" control vs treated design, but to cases where multiple treatments are being tested.
In my experimental design, I have 3 different micro plastic treatments (whose effects were assessed: MIX1, MIX2 and MIX3) and one control (i.e. CTL). For each treatment I have 4 independent biological replicates (n=16 total RNASeq libraries). The goal is to assess each of the micro plastic against the control by doing 3 pairwise comparisons.
My question applies to the initial filtering steps. I use filterByExpr()
on the raw count table and I provide the group
variable (which I input as a factor) to define all the 4 groups, then I set min.count = 5, min.prop = 0.8
.
All works like a charm, I then run normalisation and DEA by running all the pairwise comparisons that I mentioned above. However my doubts arise when looking at the results of some pariwise analysis. In example, in the comparison CTL vs MIX1
I get, among the other differentially expressed genes, this one (I am reporting the statistics):
logFC logCPM LR PValue FDR
transcript_172821 22.80989877 1.655460041 37.96739989 7.19366E-10 2.91137E-06
This gene has a very low FDR and high logFC. However, when looking back at the original count table I see that the counts were as this:
MIX1_r1 MIX1_r2 MIX1_r3 MIX1_r4 CTL_r1 CTL_r2 CTL_r3 CTL_r4 MIX2_r1 MIX2_r2 MIX2_r3 MIX2_r4 MIX3_r1 MIX3_r2 MIX3_r3 MIX3_r4
transcript_172821 0 0 0 0 112.941 0 0 0 0 45.9024 6 56.1972 0 0 0 0
I understand that the gene has passed the filtering criteria because it was expressed in one CTL sample and in 3 other samples belonging to MIX2, however I am wondering on the biological meaning of this result: Should I not consider this gene as differentially expressed in the comparison CTL vs MIX1
since it only had reads in 1 sample of the 8 samples (i.e. 4 CTL and 4 MIX1) being compared? So I am wondering if I should apply the filterByExpr
separately each time on a count table that only includes the two treatments that I am going to include in the pairwise comparison, rather than applying it on the whole count table (i.e. with all 4 treatments together), to avoid situations like the one above?
I'd like to know your feelings about
Looking forward to hearing from you people
Thanks. Luca
Dear Gordon Smyth thanks a lot for your reply. I have got a couple of extra comments to follow up on your reply.
We do work with a transcriptome of a non model marine mollusc whose genome is not available. The transcriptome has been generated using Iso-Seq technology from 3 different tissues. The obtained high quality reads were then clustered to call isoforms. This is the transcriptome file we work with. We generally produce single end Lexogen 3' tag libraries for our RNAseq analyses which are 75bp.
I have checked Baldoni et al and also the associated GitHub and would like to try your
catchKallisto
function. However I was not able to find an example command to run kallisto accordingly to our needs (there is a reference to the--bootstrap-samples
option but not more than that). Could you please provide such example with the "must-to-include" parameters given our setup?As far as the
glmLRT
I was using it after applying theRUVs
normalisation from theRUVSeq
package where they suggest the following (from RUVseq vignette, section 2.3):If you deem that it's ok to use
glmQLFit
even after applying theRUVs
normalisation, then I will edit my code like this:Thanks for the extra information. I see now that you are using Lexogen 3' tag libraries (i.e., QuantSeq). QuantSeq measures genes rather than isoforms, so there is no read assigment ambiguity and hence no need for catchKallisto or catchSalmon etc. The usual gene-level edgeR pipeline should be fine.
Yes, I recommend that you use glmQLFit whether you are using RUVSeq or not. The RUVSeq vignette is a bit out of date in this respect, it probably pre-dates our quasi pipeline.
Thanks for the quick and precise reply. So the
catchkallisto
function is intended for "classic" illumina sequencing where the whole mRNA is sequenced, am I right?Anyway, perfect so I will edit my code to perform the
glmQLFit
function rather than theglmLRT
one. And possibly increase a bit the thresholds ofmin.counts
andmin.total.counts
to try and avoid as much as possible the cases like the one I showed at the top of the post.Thanks again for your precious help,
All the best
Luca
Sorry Gordon, I have got an extra question regarding the
filterByExpr
function. In your edgeR's vignette (section 2.7) is written: "The function accesses the group factor contained in y in order to compute the minimum group size, but the filtering is performed independently of which sample belongs to which group so that no bias is introduced."What kind of bias would that be? I'm curious to understand but cannot figure out what it might be
Thanks Luca
Bias means selectively keeping genes that are DE, leading to failure to control the FDR in downstream DE analyses. For example, if you keep genes expressed in all four samples of at least one group, then the downstream DE analysis will fail to control the FDR.
Yes, catchKallisto() and catchSalmon() are for whole mRNA RNA-seq and, even then, only for transcript-level analyses.
Thanks a lot!
Have a good day Luca