Should 0 values for gene counts be removed prior to ssGSEA/GSVA analysis?
1
0
Entering edit mode
João • 0
@9504eb7d
Last seen 2.4 years ago
United Kingdom

Since the ssGSEA/GSVA algorithms work by determining how much more expressed the genes of our gene list are when compared to all other genes within the sample, should remove genes with 0 counts in each individual sample before running the algorithm?

Say gene x is present in sample 1 but not sample 2, should we omit it from sample 2's calculations but keep it for sample 1? (i.e. replace all 0 with "NA")

In theory, if we have 2 samples with the exact same expression of our genes of interest but sample 1 has 1000 non-0 value genes and sample 2 has 500 non-zero value genes and 500 0-value genes, not removing the 0s would give the same score to both samples, but sample 2 clearly behaves differently.

Should we remove these 0 count genes?

GSEA GSVA • 2.8k views
ADD COMMENT
0
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 17 hours ago
Barcelona/Universitat Pompeu Fabra

Hi, we advice to remove lowly-expressed genes in the same way you would do it before a differential expression analysis. This has been previously discussed in this forum, see for instance this post or this other one.

ADD COMMENT
0
Entering edit mode

Thank you for your answer, but I am afraid this does not answer my question.

For example, if gene A is 0 counts in sample 1 and around 1000 counts in all other samples, it is not a low-expression gene in general, so I cannot simply remove it from all samples.

However, I will be skewing the results for sample 1 if I include that gene in my calculations, whether that gene is in my gene set list or not (for this example, lets assume it's not), but If I remove gene A from the calculation for sample 1 alone, then the result will also be skewed as it is taking a smaller number of total "outside" genes into account when calculating the final score.

I understand that removing a single gene has a negligible effect overall but if we apply this reasoning to all genes in our samples, it could really skew our results.

Therefore the question is: which of these approaches gives a more meaningful result and is there any other approach that I am not thinking of to solve this?

ADD REPLY
0
Entering edit mode

In my opinion, you definitely should not remove genes from individual samples only - the gene matrix used in the analysis should contain the same genes for all samples. Thus, if you decide to remove a gene, it gets removed from the whole data set and if you decide to keep a gene, it's kept for all samples. If you have genes that are very lowly expressed (or zero) in some samples, but very highly expressed in others, you may want to keep those as they could be important in the biological context eg on/off switches. To filter lowly expressed genes you may use a range of filtering criteria eg. a gene gets only removed if it is zero in ALL samples, or if it is zero in more than a percentage of the samples ... and/or it's kept when it has more than x counts (any cutoff you deem appropriate) in at least one sample, or in at least a percentage of samples. Unfortunately there is no generally accepted gold standard (that I know of) on how you define lowly expressed genes, it's something you need to define for yourself based on your specific data set and biologcial context.

ADD REPLY

Login before adding your answer.

Traffic: 306 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6