I am trying to run a differential expression analysis for proteomics data using limma. The data has been quantified and normalized using the LFQ method - the sample distributions are log normal, but with a high median 10e7, ranging from 0 to 1e10
The problem arises with genes that are present in one condition and absent another. Before putting the values into limma I need to log them, which converts 0 - Inf, and drops the genes from the analysis. I tried replacing Inf with 0 or adding pseudocounts, but both things break the model.
Is there any way around this?
Tnx
This is how the vulcano plot looks like if I either add pseudocounts or replace Inf with 0
That is what you should expect, no? If you have a protein that is found at high concentration in one sample, and is essentially undetectable in another, shouldn't the fold change be huge, and the p-value really small? I am not sure this constitutes a breakage of the model.
It might not be obvious from this plot, but problem arises in the calculated pvalues - a lot of genes that have a considerable fold change, and have been experimentally validated are not found to be statistically significant, i.e., a gene which has a lfc of 2 gets a p.value of 0.76.
This seems to be a separate problem that's unrelated to the pseudo-counts. If your example with a log-fold change of 2 is being affected by a small pseudo-count, then the raw expression value would be in the single digits, which is negligible compared to the values in the millions/billions you've described in your original post. That would suggest that it's not being expressed much or at all; at the very least, there's not a lot of evidence to detect differences, and the variability of low-abundance genes tends to be higher (at least, for RNA-seq/microarray data), which will reduce power.
So, if your example gene is actually expressed at decent levels, the pseudo-count addition shouldn't affect its inferences; instead, I'd suggest you have a look at its variability across replicates. This may explain why a decent log-fold change has such a large p-value. If the gene's expression turns out to be highly variable across replicates, then it makes sense to not call it a DE gene based on this data. Which is disappointing, but that's life.