I have a bulk-RNA experiment, and I have mapped the sequencing to the human genome using STAR and featureCounts, producing a count-matrix of ENSGs (rows) and samples (columns) of size [ExS].
I'm interested in running an edgeR>limma differential gene expression analysis only on protein-coding genes. As multiple ENSGs can be mapped to a single gene (e.g., PINX1 is mapped to ENSG00000254093 and ENSG00000258724), I'm wondering about the right timing of the mapping from ENSGs to gene names, and the right time to filter out non protein-coding genes.
The way I'm currently doing this, is
- Use
filterByExpr()
on the [ExS] count-matrix, making it size of [FxS]. - running
voomLmFit()
andeBayes()
/treat()
on the [FxS] count-matrix, producing aMArrayLM
object with F rows. - Using biomaRt, filtering the
MArrayLM
to include only protein-coding genes, shrinking the from F rows to P rows. - Using biomaRt, converting the ENSG IDs to gene names; The
MArrayLM
object still has P rows, but their names are now different: gene symbol names instead of ENSGs. - Continue on
decideTests()
on theMArrayLM
object with P rows.
This is what I have interpreted from Aaron Lun's comment:
I would only filter out uninteresting genes after eBayes. Then you can have your cake (reliable variance estimates) and eat it too (fewer tests during multiplicity correction). Note that you should have already filtered out low-abundance genes, as these won't provide much information for variance estimation anyway.
So, I'm wondering
- Is my approach the right one?
- Should the steps "use biomaRt to map the ENSGs to gene names" and "filter by protein-coding" be done separately? together? on different stages of the pipeline?
- So far, I haven't encountered any duplicated gene symbols after filtering... what should I do if they'll pop up on future analyses?
Thank you!
If you have versioned Ensembl Gene IDs, you can always use "GENEIDVERSION" for the
select
statement, or just strip off the version numbers.The other reason for doing this is because when you do
topTable(fit)
, you will automatically annotate the output data.Thank you; I was indeed using
DGEList
on the count matrix, but with the multiple edits of the post I forgot to include this step.BTW, shouldn't you be using
keep.lib.sizes = F
and re-do thecalcNormFactors()
after filtering of thedglst
low-expression ENSGs, as Gordon Smyth recommends here?That's an interesting question. Gordon's response is not what I expected though. I mean he recommends using keep.lib.sizes = FALSE, but then says that the results will be nearly the same regardless. However, the default for
[.DGEList
is keep.lib.sizes = TRUE. And in some sense the default argument is the recommended argument, because the developer set it as the default, and why would you use a default argument that you don't recommend?My view on the subject is that it probably doesn't matter and I am happy to keep the default.