Question

How to identify top most variable genes from log2 normalized data?

0

Entering edit mode

mohammedtoufiq91 ▴ 10

@mohammedtoufiq91-17679

Last seen 4 months ago

United States

Hi,

I have performed RNAseq analysis (filtering, normalization, and interested group comparisons using EdgeR package. Further, I am also interested in performing unsupervised Hierarchical clustering heatmap (maybe using ComplexHeatmap or coolmap). Here, unsupervised refers to employing a list of genes that is not identified through group comparisons (i.e. not informed by grouping labels). Typically, such lists would be filtered based on detection thresholds and variance across samples (e.g. detected in at least 10% of samples, and top 1000 most variable genes) or something along those lines. As a next step a, I want to generate a heatmap.

I am not sure if rowVars on the log2cpm data would get do this. Please provide suggestions if there is a built in method in EdgeR or a efficient way to identify identify top most variable genes.

Thank you,

Mohammed

supervised edgeR variance RNASeq heatmaps • 5.8k views

ADD COMMENT • link updated 2.2 years ago by Gordon Smyth 52k • written 2.3 years ago by mohammedtoufiq91 ▴ 10

score 3 · Answer 1 · 2022-12-27

3

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Yes, rowVars on filtered log2cpm values would be a standard way to select genes for an unsupervised heatmap.

Personally, though, I don't see a lot of value in unsupervised heatmaps in the context of a RNA-seq DE analysis. Genes that are highly variable but not DE don't seem of much interest, except perhaps to trouble-shoot outlier samples.

ADD COMMENT • link 2.3 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you very much Gordon Smyth Yes, I too agree. I just want to include this as an additional step.

Maybe this should work for the following cases?

Case 1: (Getting the top 1000 highly variable genes) - These are the genes most variable across all samples regardless of which samples they are

topVarGenes <- head(order(rowVars(cpm_with_log2), decreasing = TRUE), 1000)
mat_1  <- cpm_with_log2[ topVarGenes, ]

Case 2: Detected in at least 10% of samples

topVarGenes.v1 <- rowVars( cpm_with_log2) > 18/10 ## (Total samples, n=18)
mat_2 <- cpm_with_log2[topVarGenes.v1, ]

Case 3: Filtered based on detection thresholds and variance across samples (e.g. detected in at least 10% of samples, and top 1000 most variable genes)

Do you know how I could write the code to include 1000 genes detected in at least 10% of samples here?

ADD REPLY • link 2.3 years ago mohammedtoufiq91 ▴ 10

1

Entering edit mode

The filtering based on detection thresholds is already done as part of the edgeR DE analysis, for example by filterByExpr. You simply run cpm on the filtered DGEList object to get the log2CPM values. There is no need to filter again.

ADD REPLY • link 2.3 years ago Gordon Smyth 52k

0

Entering edit mode

Gordon Smyth, yes, I used filterByExpr > cpm. I understand that now I have to just run the below as I want to select high variable genes, am I right?

topVarGenes <- head(order(rowVars(cpm_with_log2), decreasing = TRUE), 1000)
mat_1  <- cpm_with_log2[ topVarGenes, ]

ADD REPLY • link 2.3 years ago mohammedtoufiq91 ▴ 10

0

Entering edit mode

Your code doesn't look right to me. I think you're using order() where sort() would be more appropriate.

To order your cpm matrix by row variances:

o <- order(rowVars(cpm_with_log2), decreasing = TRUE)
cpm_ordered <- cpm_with_log2[o,]

ADD REPLY • link 2.2 years ago Gordon Smyth 52k