In a typical DESeq2 output file there are 6 numbers given for each gene; baseMean, log2FoldChange, lfc, stat, pvalue and padj. Each of them have 15 digits. I think this is because R produces 15 digits when it does a calculation. However, these numbers are being derived from measurements and measurements have significant figures and I would not think that that the measurements for gene counts would have 15 significant figures.
Why does this matter to me ? Because, I want to have a cutoff for calling genes accepted as differentially regulated; either a cutoff for padj or log2FoldChange or both and I would like to be consistent and be using the best practice. Say, I choose a cutoff of padj of 0.05. But, the number of genes resulting will depend on how you round off the 15 digit number given for padj. Do you round to two digits after the decimal point, 0.05 (this will include genes with padj of 0.054 or less), or three, 0.050 (this will include genes of 0.0504 or less), or 4, 0.0500, or use the whole 15 digit number with no rounding which, I would think, has too many significant figures.
Also, one more question. If you have a padj cutoff off of 0.05, do you include 0.05 ? Frequently, you hear the cutoff was 0.05, but it usually is not specified if that means less than 0.05 or less than and equal to 0.05.
Thank you for your reply. It is good to know that cutoff of 0.05 means less than 0.05.
My concerns are about consistency and reproducibility. We have many people in our lab who are doing RNA-Seq analysis and it is preferable that each get identical gene lists when using the same cutoff's on the same data. For example, one person in our lab used a log2 cutoff of 1, but rounded the log2 values to the nearest integer. I repeated the work on the same data, but I rounded to the nearest hundreth. His list resulted in 100 or so more genes up regulated than my list of up regulated genes. The lists were identical except for the 100 or so extra, which was 20% of the total gene lists. Rounding up to the nearest integer meant that his list included genes with log2 of 0.05 or greater, while my list only included genes with log2 of 0.95 or greater. That may be a somewhat extreme example, but even a much smaller percentage difference in gene lists could lead to differences in what is considered significant in a downstream analysis. This is not a preferable situation to have in which the results are dependent on who performs the analysis.
We could just somewhat arbitrarily decide on one standard, but it would be preferable to have some scientific justification for that standard.
If you want consistency then use a fixed software environment like Docker, wirh fixed code and thresholds that make sense. Rounding to nearest integer is a little nonsense since on log2 wcale every integer is a 2-fold change so this vastly inflates the effects. If you have different thresholds per people then of course you get different results.