I was wondering what a good cutoff is for a truely expressed probe in the limma data analysis (single channel agilent):
#filter out expressed probes from background signal: control type probes
neg95 <- apply(G.norm.quant$E[G.norm.quant$genes$ControlType==-1,, drop=FALSE],2,function(x) quantile(x,p=0.95))
cutoff <- matrix(1.1*neg95,nrow(G.norm.quant),ncol(G.norm.quant),byrow=TRUE)
This was based on the limma user guide: I took 110% of the 0.95 percentile of the control probes expression per array. So for a log2 value of 5 as bakcground, 5.5 was used as cutoff. This gave me 44000 expressed probes out of 60000 probes tested (8x60k array) and seams a lot to me. Better to take 150% or 200%?
The choice of 95% quantile for negative controls and the 1.1 multiplier as cutoff were ad hoc choices for this analysis in the limma guide. The exact values do not matter as a wide range of values will give similarly good results.
Having said, that, a multiplier of 200% sounds overly conservative to me as the 95% quantile is already in the top range of possible values for non-expressed probes.
If you want to be more systematic about it, you could use the propexpr() function in limma to estimate the proportion of truly expressed probes on each array. This works well for Illumina arrays, but we find that the negative control probes on Agilent arrays tend to give somewhat biased results. So I suggest taking the negative controls as a rough guide rather than trying to be too formal about it.
Thanks for your answer and time. I understand the exact choice of multiplier is a little bit arbitrary but I am just looking for a good way of describing the amount of expressed probes/genes in my system and 44000 expressed probes seemed a lot to me.
Would the following be a good rough guide: look at the multiplier that gave most DE probes? (balance between background signal and multiple hypothesis testing?)
In terms of DE probes, with a multiplier of 1.5 there we're most DE. (1600 probes). for the 1,1 multiplier (1450), for the 2 mulitplier (800).
You say that all you want to do is to estimate the number of expressed probes, but I am guessing that your real interest is to undertake a DE analysis and that filtering probes is just a means to that end. If you achieve a good DE analysis, why should you care exactly what percentage of probes are truly expressed? What does it matter if you filter out some expressed probes, if the expression level is too low to be of interest? What does it matter if you keep some non-expressed probes in the analysis if it doesn't hurt the analysis of the other probes? If I'm right, then the procedure you state (choose the filtering that gives most DE genes) is reasonable. If your main aim really is to estimate the number of expressed probes, then you should explore propexpr().
You say that all you want to do is to estimate the number of expressed probes, but I am guessing that your real interest is to undertake a DE analysis and that filtering probes is just a means to that end. If you achieve a good DE analysis, why should you care exactly what percentage of probes are truly expressed? What does it matter if you filter out some expressed probes, if the expression level is too low to be of interest? What does it matter if you keep some non-expressed probes in the analysis if it doesn't hurt the analysis of the other probes? If I'm right, then the procedure you state (choose the filtering that gives most DE genes) is reasonable. If your main aim really is to estimate the number of expressed probes, then you should explore propexpr().