how many significant figures are there in DESeq2 output
1
0
Entering edit mode
@matthew-mccormack-2021
Last seen 23 hours ago
United States

In a typical DESeq2 output file there are 6 numbers given for each gene; baseMean, log2FoldChange, lfc, stat, pvalue and padj. Each of them have 15 digits. I think this is because R produces 15 digits when it does a calculation. However, these numbers are being derived from measurements and measurements have significant figures and I would not think that that the measurements for gene counts would have 15 significant figures.

Why does this matter to me ? Because, I want to have a cutoff for calling genes accepted as differentially regulated; either a cutoff for padj or log2FoldChange or both and I would like to be consistent and be using the best practice. Say, I choose a cutoff of padj of 0.05. But, the number of genes resulting will depend on how you round off the 15 digit number given for padj. Do you round to two digits after the decimal point, 0.05 (this will include genes with padj of 0.054 or less), or three, 0.050 (this will include genes of 0.0504 or less), or 4, 0.0500, or use the whole 15 digit number with no rounding which, I would think, has too many significant figures.

Also, one more question. If you have a padj cutoff off of 0.05, do you include 0.05 ? Frequently, you hear the cutoff was 0.05, but it usually is not specified if that means less than 0.05 or less than and equal to 0.05.

DESeq2 • 296 views
ADD COMMENT
0
Entering edit mode
ATpoint ★ 4.7k
@atpoint-13662
Last seen 20 hours ago
Germany

I strongly recommend to not concern yourself with these sorts of marginal problems because you're entering a realm in which you're entirely involved with pure technical edge cases rather than interpreting your sctual analysis results. From what I can say padj < 0.05 is a general agreement. Recommend to use that and focus on interpretation of results rather than postdecimals. If you feel like you have too many DEGs consider testing against a fold change as described in the vignette.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. It is good to know that cutoff of 0.05 means less than 0.05.

My concerns are about consistency and reproducibility. We have many people in our lab who are doing RNA-Seq analysis and it is preferable that each get identical gene lists when using the same cutoff's on the same data. For example, one person in our lab used a log2 cutoff of 1, but rounded the log2 values to the nearest integer. I repeated the work on the same data, but I rounded to the nearest hundreth. His list resulted in 100 or so more genes up regulated than my list of up regulated genes. The lists were identical except for the 100 or so extra, which was 20% of the total gene lists. Rounding up to the nearest integer meant that his list included genes with log2 of 0.05 or greater, while my list only included genes with log2 of 0.95 or greater. That may be a somewhat extreme example, but even a much smaller percentage difference in gene lists could lead to differences in what is considered significant in a downstream analysis. This is not a preferable situation to have in which the results are dependent on who performs the analysis.

We could just somewhat arbitrarily decide on one standard, but it would be preferable to have some scientific justification for that standard.

ADD REPLY
0
Entering edit mode

If you want consistency then use a fixed software environment like Docker, wirh fixed code and thresholds that make sense. Rounding to nearest integer is a little nonsense since on log2 wcale every integer is a 2-fold change so this vastly inflates the effects. If you have different thresholds per people then of course you get different results.

ADD REPLY

Login before adding your answer.

Traffic: 1411 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6