Question

Why the log fold change of some genes returned by edgeR is not compatible with their related FPKM value?

0

Entering edit mode

Sara ▴ 20

@sara-9865

Last seen 10 weeks ago

Germany

Hi all experts,

I have used Trinity software for doing de novo transcriptome assembly. Then, I used bowtie and RSEM within Trinity package for read mapping and counting. Finally, I got the raw count and FPKM value. I used raw count for doing differential expression analysis by edgeR software, say, I compared A-B library. I found that among DEG genes reported by edgeR software, some few genes with positive log fold change, meaning that the higher expression in A library in relative to B, have the lower FPKM value in A as compared with its FPKM value in B library. For instance, please consider the below example

logFC logCPM LR PValue FDR FPKM (B) FPKM(A)

c50354_g2_i5 2.825768038 5.659882391 41.70680491 1.06039E-10 6.64373E-07 118.187 18.093

As you could see the FPKM value of c50354_g2_i5 transcript are in 18.093 and 118.187 in the library B and A, respectively. But during the comparison A-B library, the log fold change is positive. Could you please explain to me what happened for such genes?

Thank you in advance

edgeR logFC FPKM • 3.4k views

ADD COMMENT • link updated 8.5 years ago by Gordon Smyth 52k • written 8.5 years ago by Sara ▴ 20

0

Entering edit mode

The first thing that comes to my mind is that you should probably include the code you used to obtain these values. And the output of sessionInfo().

ADD REPLY • link 8.5 years ago Diego Diez ▴ 760

0

Entering edit mode

If you plot edgeR's log fold change esimates vs. your manual FPKM log ratios, are they for the most part concordant with a few outliers, or most of them not concordant, at all?

ADD REPLY • link 8.5 years ago Steve Lianoglou ★ 13k

Gordon Smyth · Answer 1 · 2016-10-10

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

If you use edgeR correctly, and compute FPKM correctly, then the logFC and the FPKM will agree.

If you want more help, please give the code by which you computed the logFC and the FPKM. We can't tell you what happened without that information. Perhaps for example you may have simply computed the logFC for B vs A instead of A vs B, and that would agree perfectly with the FPKM values you give.

Later:

It turns out from your later post that the logFC you give is not a simple logFC between TT6 and TS6 but has been adjusted for baseline control values. Furthermore the controls are different for TT and TS. When you looked at the FPKM, you have forgotten to look at and correct for the control values (CT and CS) for this gene.

ADD COMMENT • link 8.5 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you very much for the comment. Actually, I have done DE analysis with your great help, Gordon, here. I have 5 libraries (unfortunately without replicate) as below:

Control-Sensitive cell line (CS)

Treatment-Sensitive cell line after 6 hours (TS6)

Control-Tolerant cell line (CT)

Treatment-Tolerant cell line after 6 hours (TT6)

Treatment-Tolerant cell line after 12 hours (TT12)

The following codes were used for DE analysis:

targets <- read.delim("targets3.txt", stringsAsFactors=FALSE)
targets
  Label Treatment  line time
1    CS   control sensitive    0
2   TS6      drug sensitive    6
3    CT   control  tolerant    0
4   TT6      drug  tolerant    6
5  TT12      drug  tolerant   12

Group <- factor(c("CS","TS6","CT","TT6","TT12"))
 design <- model.matrix(~0+Group)
 colnames(design) <- levels(Group)
 Group2 <- factor(c(1,2,1,2,3))
 design2 <- model.matrix(~Group2)

expressioncount <- read.delim("expression_rawcount.txt", row.names=1)
colnames(expressioncount) <- substring(colnames(expressioncount),1,6)
dim(expressioncount)
[1] 144103     6

 head(expressioncount)
             length CS TS6 CT TT6 TT12
c51581_g1_i1    462  0   2  1   1    0
c19380_g1_i1    439  3   0  2   0    2
c4832_g1_i1     665  0   2  0   7   23
c25894_g1_i1    402  0   1  1   1    3
c37752_g4_i1    736  2   4  2   3    2
c557_g1_i1      434  7   0  4   0    0

 y <- DGEList(expressioncount[,2:6], group=Group)
> keep <- rowSums(cpm(y) > 1) >= 1
> y <- calcNormFactors(y)
> y$samples
     group lib.size norm.factors
CS       1 20146232    1.1727230
TS6      1 19308652    0.9188417
CT       1 19792940    1.2271948
TT6      1 19051509    0.9463686
TT12     1 17752012    0.7990795

table(keep)
keep
FALSE  TRUE
68016 76087
y <- estimateGLMCommonDisp(y, design2, verbose=TRUE)
Disp = 0.05004 , BCV = 0.2237              dispersion=bcv^2
 y <- estimateGLMTrendedDisp(y, design2)
fit <- glmFit(y,design)
 cm <- makeContrasts(TS6 = TS6-CS, TT6 = TT6-CT, TT12 = TT12-CT, TT6.TS6 = (TT6-CT) - (TS6-CS), levels=design)

lrt3 <- glmLRT(fit, contrast=cm[,"TT6.TS6"])

is.de <- decideTestsDGE(lrt3)
summary(is.de)
   [,1] 
-1     60
0  143843
1     200

I have computed FPKM value with the below code:

RPKM <- rpkm(y, gene.length=expressioncount$length)

When I checked DE genes (TT6.TS6) the logFC and FPKM value agreed together (positive logFC was equal to higher FPKM value in TT6 than TS6 and vice versa for negative logFC), but some interesting genes with positive logFC had the higher FPKM value in TS6 than TT6!! Please consider the below example:

geneid   logFC   logCPM   LR   PValue   FDR   FPKM_TS6   FPKM_TT6
c41775_g1_i2   2.68363364   5.682588148   25.68892732   4.01122E-07   0.000704913   81.928   33.113
c50354_g2_i5   2.825768038   5.659882391   41.70680491   1.06039E-10   6.64373E-07   118.187   18.093

Please kindly explain to me what happened for such genes. I'm very concerned about the rest results

Thanks in advance

ADD REPLY • link updated 8.5 years ago by Gordon Smyth 52k • written 8.5 years ago by Sara ▴ 20

3

Entering edit mode

The contrast you are calling TT6.TS6 isn't a comparison between those two samples. It's the interaction term between the tolerant and sensitive samples, so you can't figure out if the logFC makes sense just by looking at the TT6 and TS6 samples! Note that the contrasts are just algebra, and mean exactly what you would think. So when you do

TT6.TS6 = (TT6-CT) - (TS6-CS)

It literally means that you want to make that comparison. So if you want to compare the FPKM changes to the logFC from the model, you have to do the same thing with the FPKM values (e.g., compute TT6 - CT - TS6 + CS using the FPKM samples) and just looking at TT6 - TS6 isn't the same thing.

ADD REPLY • link 8.5 years ago James W. MacDonald 68k