Question

Question about filterByExpr function edgeR

0

Entering edit mode

lucap • 0

@lucap-20484

Last seen 19 months ago

Italy

Hi there, I have been using edgeR for quite some time now, and now I am facing with a question that applies to experimental designs that are not the "traditional" control vs treated design, but to cases where multiple treatments are being tested.
In my experimental design, I have 3 different micro plastic treatments (whose effects were assessed: MIX1, MIX2 and MIX3) and one control (i.e. CTL). For each treatment I have 4 independent biological replicates (n=16 total RNASeq libraries). The goal is to assess each of the micro plastic against the control by doing 3 pairwise comparisons.
My question applies to the initial filtering steps. I use filterByExpr() on the raw count table and I provide the group variable (which I input as a factor) to define all the 4 groups, then I set min.count = 5, min.prop = 0.8.

All works like a charm, I then run normalisation and DEA by running all the pairwise comparisons that I mentioned above. However my doubts arise when looking at the results of some pariwise analysis. In example, in the comparison CTL vs MIX1 I get, among the other differentially expressed genes, this one (I am reporting the statistics):

    logFC   logCPM  LR  PValue  FDR
transcript_172821   22.80989877 1.655460041 37.96739989 7.19366E-10 2.91137E-06

This gene has a very low FDR and high logFC. However, when looking back at the original count table I see that the counts were as this:

    MIX1_r1 MIX1_r2 MIX1_r3 MIX1_r4 CTL_r1  CTL_r2  CTL_r3  CTL_r4 MIX2_r1  MIX2_r2 MIX2_r3 MIX2_r4 MIX3_r1 MIX3_r2 MIX3_r3 MIX3_r4
transcript_172821   0   0   0   0   112.941 0   0   0   0   45.9024 6   56.1972 0   0   0   0

I understand that the gene has passed the filtering criteria because it was expressed in one CTL sample and in 3 other samples belonging to MIX2, however I am wondering on the biological meaning of this result: Should I not consider this gene as differentially expressed in the comparison CTL vs MIX1 since it only had reads in 1 sample of the 8 samples (i.e. 4 CTL and 4 MIX1) being compared? So I am wondering if I should apply the filterByExpr separately each time on a count table that only includes the two treatments that I am going to include in the pairwise comparison, rather than applying it on the whole count table (i.e. with all 4 treatments together), to avoid situations like the one above?

I'd like to know your feelings about
Looking forward to hearing from you people

Thanks. Luca

edgeR • 2.9k views

ADD COMMENT • link 20 months ago lucap • 0

score 0 · Answer 1 · 2023-08-19

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

We can't offer biological interpretations because it depends on your experiment and the scientific context.

You could subset the data for each pairwise comparison, but I don't recommend that and I don't do it myself. I would instead revisit how you are doing the differential expression analysis and why your data is so over-dispersed:

Leave the filterByExpr settings at the default values. This particular transcript would already have been filtered had you left min.count at the default value of 10. And changing min.prop can't have any effect for your data.
You appear to be doing analysis of transcripts (isoforms) rather than of genes. edgeR is not designed to analyse transcript level data unless you account for read assignment ambiguity as described by Baldoni et al 2023.
We recommend a quasi-likelihood analysis with glmQLFit etc but you are using the older glmLRT pipeline.
If you use glmLRT, then it is essential that you use tagwise dispersion values rather than trended or common. You might already be doing that but, given that this gene is so significant, perhaps not. The negative binomial dispersion for this transcript will be very large.

ADD COMMENT • link 20 months ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon Smyth thanks a lot for your reply. I have got a couple of extra comments to follow up on your reply.

We do work with a transcriptome of a non model marine mollusc whose genome is not available. The transcriptome has been generated using Iso-Seq technology from 3 different tissues. The obtained high quality reads were then clustered to call isoforms. This is the transcriptome file we work with. We generally produce single end Lexogen 3' tag libraries for our RNAseq analyses which are 75bp.

I have checked Baldoni et al and also the associated GitHub and would like to try your catchKallisto function. However I was not able to find an example command to run kallisto accordingly to our needs (there is a reference to the --bootstrap-samples option but not more than that). Could you please provide such example with the "must-to-include" parameters given our setup?

As far as the glmLRT I was using it after applying the RUVs normalisation from the RUVSeq package where they suggest the following (from RUVseq vignette, section 2.3):

y <- DGEList(counts=counts(set1), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt)

If you deem that it's ok to use glmQLFit even after applying the RUVs normalisation, then I will edit my code like this:

y = DGEList(counts=set1, group=x)
y = calcNormFactors(y) 
y <- estimateDisp(y, MatrixDesign,robust=TRUE)
fit <- glmQLFit(y, MatrixDesign,robust=TRUE)

ADD REPLY • link 20 months ago lucap • 0

0

Entering edit mode

Thanks for the extra information. I see now that you are using Lexogen 3' tag libraries (i.e., QuantSeq). QuantSeq measures genes rather than isoforms, so there is no read assigment ambiguity and hence no need for catchKallisto or catchSalmon etc. The usual gene-level edgeR pipeline should be fine.

Yes, I recommend that you use glmQLFit whether you are using RUVSeq or not. The RUVSeq vignette is a bit out of date in this respect, it probably pre-dates our quasi pipeline.

ADD REPLY • link 20 months ago Gordon Smyth 52k

0

Entering edit mode

Thanks for the quick and precise reply. So the catchkallisto function is intended for "classic" illumina sequencing where the whole mRNA is sequenced, am I right?
Anyway, perfect so I will edit my code to perform the glmQLFit function rather than the glmLRT one. And possibly increase a bit the thresholds of min.counts and min.total.counts to try and avoid as much as possible the cases like the one I showed at the top of the post.

Thanks again for your precious help,

All the best
Luca

ADD REPLY • link 20 months ago lucap • 0

0

Entering edit mode

Sorry Gordon, I have got an extra question regarding the filterByExpr function. In your edgeR's vignette (section 2.7) is written: "The function accesses the group factor contained in y in order to compute the minimum group size, but the filtering is performed independently of which sample belongs to which group so that no bias is introduced."

What kind of bias would that be? I'm curious to understand but cannot figure out what it might be

Thanks Luca

ADD REPLY • link 20 months ago lucap • 0

0

Entering edit mode

Bias means selectively keeping genes that are DE, leading to failure to control the FDR in downstream DE analyses. For example, if you keep genes expressed in all four samples of at least one group, then the downstream DE analysis will fail to control the FDR.

ADD REPLY • link 20 months ago Gordon Smyth 52k

0

Entering edit mode

Yes, catchKallisto() and catchSalmon() are for whole mRNA RNA-seq and, even then, only for transcript-level analyses.

ADD REPLY • link 20 months ago Gordon Smyth 52k

0

Entering edit mode

Thanks a lot!

Have a good day Luca

ADD REPLY • link 20 months ago lucap • 0