Question

strange FDR distribution for RNA-seq sample with edgeR

0

Entering edit mode

OLIVER, DAVID ▴ 10

@oliver-david-6569

Last seen 10.2 years ago

United States

Hi all,

I am having a problem with edgeR's FDR output for a sample. Specifically, the FDR calculated for the sample makes all but 1 gene greater than 0.999 FDR.

I have 6 samples with 3 biological replicates per sample (except 1 sample which only has 2, not the sample in question). I'm comparing them all to a control sample.

Here is my code snippet

group <- factor(c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6))
y <- DGEList(counts = data.data, group = group)

design <- model.matrix(~0+group)
keep <- which(rowMeans(cpm(y)) > 1)
y <- y[keep,]
y$samples$lib.size <- colSums(y$counts)
y <- calcNormFactors(y)
y.tagwiseD <- estimateGLMTagwiseDisp(y, design)
fit <- flmFit(y.tagwiseD, design)
lrt.pa <- glmLRT(fit, contrast=c(-1,1,0,0,0,0))
lrt.pb <- glmLRT(fit, contrast=c(-1,0,1,0,0,0))
lrt.pc <- glmLRT(fit, contrast=c(-1,0,0,1,0,0))
lrt.pd <- glmLRT(fit, contrast=c(-1,0,0,0,1,0))
lrt.pe <- glmLRT(fit, contrast=c(-1,0,0,0,0,1))

tt.pa <- topTags(lrt.pa, n = Inf, sort.by = "PValue")
tt.pb <- topTags(lrt.pb, n = Inf, sort.by = "PValue")
tt.pc <- topTags(lrt.pc, n = Inf, sort.by = "PValue")
tt.pd <- topTags(lrt.pd, n = Inf, sort.by = "PValue")
tt.pe <- topTags(lrt.pe, n = Inf, sort.by = "PValue")

Here is what the library sizes look like

          group lib.size norm.factors
control.1     1  7748209    1.0210560
control.2     1  7179951    0.9556734
control.3     1 11971447    1.0134466
pa.1          2 10175798    1.0012736
pa.2          2  9470155    1.0016820
pa.3          2 10123124    1.0584135
pb.1          3  7913483    0.9819921
pb.2          3  8295638    0.9916526
pb.3          3  9695925    0.9868686
pc.1          4  8920315    0.9763594
pc.2          4  8218776    0.9714772
pc.3          4  7328867    1.0263048
pd.1          5  7748209    1.0210560
pd.2          5 11961039    0.9783426
pd.3          5 11204776    1.0162369
pe.1          6  9162241    0.9585401
pe.3          6 11191854    1.0464313

Sample pd is the problem sample. Here is what the sample looks like

> head(tt.pd)
Coefficient:  -1*group1 1*group5
                        logFC    logCPM       LR       PValue       FDR
ENSG00000185483.9  -2.0867536 0.8938299 16.18707 5.738457e-05 0.8752869
ENSG00000183508.4  -1.7070970 3.7773429 14.11071 1.723595e-04 0.9996358
ENSG00000162817.6  -1.2346601 2.8174421 12.05522 5.164758e-04 0.9996358
ENSG00000005243.7  -0.6852432 6.4953765 11.99971 5.320874e-04 0.9996358
ENSG00000138944.7  -2.5864985 0.5451848 11.96504 5.420809e-04 0.9996358
ENSG00000158270.10 -2.0259865 3.1257016 11.55704 6.749312e-04 0.9996358

As you can see, the FDR doesn't really match up with the p-value or the logFC. Interestingly, we have a second control sample in the data set which is sample pe and it shows an FDR distribution like one would expect from comparing two controls, specifically, some changed genes but much much less than treated samples. Here is the toptags for sample pe (the other control)

> head(tt.pe)
Coefficient:  -1*group1 1*group6
                       logFC   logCPM       LR       PValue         FDR
ENSG00000248223.1  -1.728649 5.278270 23.85892 1.036605e-06 0.009185926
ENSG00000001617.9  -1.433806 3.550665 23.57005 1.204475e-06 0.009185926
ENSG00000072041.14 -2.454923 2.091251 19.47446 1.019537e-05 0.040531314
ENSG00000128917.6  -1.039870 3.864531 19.39490 1.062907e-05 0.040531314
ENSG00000182752.9  -1.038423 5.551625 18.94674 1.344191e-05 0.041005891
ENSG00000162654.8   1.323367 2.564031 17.59689 2.730339e-05 0.050695072

So to reiterate my question, why is the FDR distribution so strange for sample pd?

Thanks,
David Oliver

edgeR RNA-seq FDR • 2.8k views

ADD COMMENT • link updated 10.2 years ago by Gordon Smyth 52k • written 10.2 years ago by OLIVER, DAVID ▴ 10

0

Entering edit mode

A couple of typos:

fit <- flmFit(y.tagwiseD, design)
tt.pa <- topTagslrt.pa, n = Inf, sort.by = "PValue")
tt.pe <- topTagslrt.pe, n = Inf, sort.by = "PValue")
headtt.pe)

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

score 1 · Answer 1 · 2014-12-19

For your tt.pd results, there is nothing strange with the computed FDRs. To correct for multiple testing, the FDR should be larger than the p-value from which it was computed; and indeed, this is the case.

The presence of a number of genes on a FDR of 0.9996358 is caused by enforced monotonicity in the Benjamini-Hochberg method. This is necessary as the FDR calculation is sensitive to the rank of the p-value. Monotonicity ensures that, say, gene A with a lower p-value than gene B will not have a higher FDR than gene B. Otherwise, you'd end up with a paradoxical situation where a gene might be more significant at the p-value level, but less significant at the FDR level.

In short, don't concern yourself with the distribution of the FDR values, as the requirement for monotonicity will result in some irregular distributions. Rather, focus on using the FDR to detect differential expression. For this particular comparison, the tt.pd results indicate that there are no DE genes at a FDR threshold of 5%.

score 0 · Answer 2 · 2014-12-19

To add to Aaron's answer, here's how the FDR is computed. I can deduce that you have 15253 genes in this analysis. The FDR for the top gene in your tt.pd table is

5.738457e-05 * 15253 = 0.8752869.

The FDR the second gene is

1.723595e-04 * 15253 / 2

which would be greater than 1, but of course we don't allow FDR greater than 1 or FDR out of order. So all the genes from the second onwards have FDR which is essentially 1, meaning no evidence of DE.

This is the type of distribution of FDR one always sees when there is no evidence of DE.

The FDR for tt.pe are about 100 times smaller because the p-values are about 100 times smaller than for tt.pd.

It's as simple as that.

score 0 · Answer 3 · 2014-12-19

Two other points about matters that are not your main question.

First, did you know that you can use:

y <- y[keep,,keep.lib.sizes=FALSE]

Then you wouldn't need to re-compute the library sizes.

Second, I strongly recommend that you call estimateGLMCommonDisp() and estimateGLMTrendedDisp() before estimateGLMTagwiseDisp(). Your current code is not estimating an abundance trend for the dispersions, and generally RNA-seq data shows such a trend.