Question

Differential expression of protein coding ENSGs (Ensembl IDs) | When to biomaRt?

0

Entering edit mode

Jonathan ▴ 10

@c31cf0e5

Last seen 2 hours ago

United States

I have a bulk-RNA experiment, and I have mapped the sequencing to the human genome using STAR and featureCounts, producing a count-matrix of ENSGs (rows) and samples (columns) of size [ExS].
I'm interested in running an edgeR>limma differential gene expression analysis only on protein-coding genes. As multiple ENSGs can be mapped to a single gene (e.g., PINX1 is mapped to ENSG00000254093 and ENSG00000258724), I'm wondering about the right timing of the mapping from ENSGs to gene names, and the right time to filter out non protein-coding genes.

The way I'm currently doing this, is

Use filterByExpr() on the [ExS] count-matrix, making it size of [FxS].
running voomLmFit() and eBayes()/treat() on the [FxS] count-matrix, producing a MArrayLM object with F rows.
Using biomaRt, filtering the MArrayLM to include only protein-coding genes, shrinking the from F rows to P rows.
Using biomaRt, converting the ENSG IDs to gene names; The MArrayLM object still has P rows, but their names are now different: gene symbol names instead of ENSGs.
Continue on decideTests() on the MArrayLM object with P rows.

This is what I have interpreted from Aaron Lun's comment:

I would only filter out uninteresting genes after eBayes. Then you can have your cake (reliable variance estimates) and eat it too (fewer tests during multiplicity correction). Note that you should have already filtered out low-abundance genes, as these won't provide much information for variance estimation anyway.

So, I'm wondering

Is my approach the right one?
Should the steps "use biomaRt to map the ENSGs to gene names" and "filter by protein-coding" be done separately? together? on different stages of the pipeline?
So far, I haven't encountered any duplicated gene symbols after filtering... what should I do if they'll pop up on future analyses?

Thank you!

limma edgeR biomaRt • 132 views

ADD COMMENT • link updated 45 minutes ago by James W. MacDonald 67k • written 5 hours ago by Jonathan ▴ 10

score 2 · Accepted Answer · 2025-01-16

2

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 45 minutes ago

United States

You say you have a matrix of counts, but that's not what you should be using. Instead you should be using a DGEList object. I would normally do something like this

> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("homo sapiens","ensdb"))
## now identify the EnsDb that matches with the Ensembl build you aligned against
## I'll just assume it's build 113
> ensdb <- hub[["AH119325"]]
## where XXXX is the correct number for your Ensembl build
> annot <- select(ensdb, row.names(<your matrix object's name>, c("GENENAME", "GENEBIOTYPE"), "GENEID")
## I don't have your row.names, so as an example
> annot <- select(ensdb, head(keys(ensdb)), c("GENENAME","GENEBIOTYPE"), "GENEID")
> annot
           GENEID GENENAME
1 ENSG00000000003   TSPAN6
2 ENSG00000000005     TNMD
3 ENSG00000000419     DPM1
4 ENSG00000000457    SCYL3
5 ENSG00000000460    FIRRM
6 ENSG00000000938      FGR
     GENEBIOTYPE
1 protein_coding
2 protein_coding
3 protein_coding
4 protein_coding
5 protein_coding
6 protein_coding
## this should always be TRUE, but just to be sure...
> dglst <- if(isTRUE(all.equal(annot$GENEID, row.names(<your matrix name>)))) DGList(<your matrix name>, genes = annot)
> dglst <- calcNormFactors(dglst)
> keep <- filterByExpr(dglst, design)
> dglst <- dglst[keep,]
> fit <- eBayes(contrasts.fit(voomLmFit(dglst, design), contrast))
## now if you only want protein coding
> fit.filt <- fit[fit$genes$GENEBIOTYPE %in% "protein_coding",]

I don't ever worry about duplicate gene symbols, because I just think of them as being easy names for my biologist friends to remember. There's no guarantee that they are unique anyway. If you do have duplicates, you can always go to ensembl.org to figure out what's going on, but that is a one-off type deal, and not something I would try to do programmatically.

ADD COMMENT • link 4 hours ago James W. MacDonald 67k

0

Entering edit mode

If you have versioned Ensembl Gene IDs, you can always use "GENEIDVERSION" for the select statement, or just strip off the version numbers.

ADD REPLY • link 4 hours ago James W. MacDonald 67k

0

Entering edit mode

The other reason for doing this is because when you do topTable(fit), you will automatically annotate the output data.

ADD REPLY • link 4 hours ago James W. MacDonald 67k

0

Entering edit mode

Thank you; I was indeed using DGEList on the count matrix, but with the multiple edits of the post I forgot to include this step.
BTW, shouldn't you be using keep.lib.sizes = F and re-do the calcNormFactors() after filtering of the dglst low-expression ENSGs, as Gordon Smyth recommends here?

> dglst <- calcNormFactors(dglst)
> keep <- filterByExpr(dglst, design)
> dglst <- dglst[keep,,keep.lib.sizes = F]
> dglst <- calcNormFactors(dglst)
> fit <- eBayes(contrasts.fit(voomLmFit(dglst, design), contrast))

ADD REPLY • link 2 hours ago Jonathan ▴ 10

0

Entering edit mode

BTW, shouldn't you be using keep.lib.sizes = F and re-do the calcNormFactors() after filtering of the dglst low-expression ENSGs, as Gordon Smyth recommends here?

That's an interesting question. Gordon's response is not what I expected though. I mean he recommends using keep.lib.sizes = FALSE, but then says that the results will be nearly the same regardless. However, the default for [.DGEList is keep.lib.sizes = TRUE. And in some sense the default argument is the recommended argument, because the developer set it as the default, and why would you use a default argument that you don't recommend?

My view on the subject is that it probably doesn't matter and I am happy to keep the default.

ADD REPLY • link 45 minutes ago James W. MacDonald 67k