Should the expression data be filtered for `Protein-Coding Genes` before Running ssGSEA?
1
0
Entering edit mode
@2c913ccf
Last seen 18 hours ago
India

I'm planning to run ssGSEA on TPM expression data from an RNA-seq analysis. My dataset includes around 60,000 genes, encompassing not just protein-coding genes but also other biotypes like miRNA, lncRNA, pseudogenes, etc. Since the hallmark gene sets from MSigDB only consist of protein-coding genes, should I filter my expression data to include only protein-coding genes before running the ssGSEA analysis? Or does the GSVA package automatically handle the exclusion of non-protein-coding genes when calculating enrichment scores? What would be the best practice here?

ssgsea GSVA gsea • 442 views
ADD COMMENT
0
Entering edit mode
Axel Klenk ★ 1.0k
@axel-klenk-3224
Last seen 18 hours ago
UPF, Barcelona, Spain

Hi,

the GSVA package has no notion of protein-coding genes and the only thing it will remove automatically are genes with zero variance across all samples (or across all non-zero values for sparse data) -- however, method ssGSEA is an exception where such genes with constant expression will trigger a warning but will not be removed automatically and it is up to the user to do so if they wish.

We usually recommend that, in addition, users filter out genes with very low expression values to reduce noise as well as memory footprint and computation time.

If during exploratory analysis certain biotypes would be found to have consistently higher (or lower) expression values than the protein-coding genes in the gene sets, I'd be inclined to remove those as well to avoid biasing the results, at least for methods ssGSEA and GSVA.

ADD COMMENT

Login before adding your answer.

Traffic: 751 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6