Question

how to use edgeR analysis RNA-seq data with a obvious deviation sample

0

Entering edit mode

xiaoaozqd • 0

@xiaoaozqd-9533

Last seen 7.8 years ago

the raw code:

########################################################

> plant = c(rep("MK",5),rep("NR",30),rep("NS",15))
> fungi = c(rep("32",2),rep("26",18),rep("32",30))
> time = c(rep("usp",5),rep("18hpi",3),rep("24hpi",3),rep("48hpi",3),rep("96hpi",3),rep("168hpi",3),rep("18hpi",3),rep("24hpi",3),rep("48hpi",3),rep("96hpi",3),rep("168hpi",3),rep("18hpi",3),rep("24hpi",3),rep("48hpi",3),rep("96hpi",3),rep("168hpi",3))
> batch = c(c("b","c"),rep(c("a","b","c"),16))
> batch = factor(batch)
> grouping <- factor(paste(plant, fungi, time, sep="."))
> table(grouping)
grouping
   MK.26.usp    MK.32.usp NR.26.168hpi  NR.26.18hpi  NR.26.24hpi  NR.26.48hpi  NR.26.96hpi NR.32.168hpi  NR.32.18hpi  NR.32.24hpi  NR.32.48hpi  NR.32.96hpi NS.32.168hpi  NS.32.18hpi  NS.32.24hpi  NS.32.48hpi  NS.32.96hpi
           3            2            3            3            3            3            3            3            3            3            3            3            3            3            3            3            3
> counts <- read.csv("All_Fragments_16_3_30.csv")
> head(counts[,1:5],3)
         GeneID MK.32.usp.b MK.32.usp.c MK.26.usp.a MK.26.usp.b
1 PSTCY32_21349         570         517         820         992
2     Contig410           5           5           1           0
3     Contig413         223         328         220         181
> d <- DGEList(counts = counts[,2:51], genes = counts[,1], group = grouping)
> keep <- rowSums(cpm(d)>1)>=2
> d <- d[keep,]
> d <- DGEList(counts= d$counts, genes = d$genes, group = grouping)
> d  =  calcNormFactors(d)
> design = model.matrix(~ condition, samples)
> design <- model.matrix(~ 0 + grouping + batch)

>  d2 = estimateGLMCommonDisp(d,design)
>  d2 = estimateGLMTrendedDisp(d2, design)
>  d2 = estimateGLMTagwiseDisp(d2, design)
> f = glmFit(d2, design)
> con <- makeContrasts( ( groupingNR.26.18hpi - groupingMK.26.usp ) -
 ( groupingNR.32.18hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.18hpi - groupingMK.26.usp ) -
 ( groupingNS.32.18hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.24hpi - groupingMK.26.usp ) -
 ( groupingNR.32.24hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.24hpi - groupingMK.26.usp ) -
 ( groupingNS.32.24hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.48hpi - groupingMK.26.usp ) -
 ( groupingNR.32.48hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.48hpi - groupingMK.26.usp ) -
 ( groupingNS.32.48hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.96hpi - groupingMK.26.usp ) -
 ( groupingNR.32.96hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.96hpi - groupingMK.26.usp ) -
 ( groupingNS.32.96hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.168hpi - groupingMK.26.usp ) -
 ( groupingNR.32.168hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.168hpi - groupingMK.26.usp ) -
 ( groupingNS.32.168hpi - groupingMK.32.usp ) ,levels=design)
> lrt <- glmLRT(f,contrast=con)
> tt <- topTags(lrt, n=Inf)$table

########################################################

In my data, one biological repeat (CYR32_1) in one group is obvious deviation from the other two biological repeats(CYR32_2 and CYR32_3) in the plotMDS with or without the method BCV.

So I just removed this sample, while when the design (= model.matrix(~ 0 + grouping + batch)) was used, it seemed among all of the data only the batch2 and batch3 were used.
design
groupingNILR.B.0h groupingNILR.A.0h groupingNILR.B.18h groupingNILR.B.24h groupingNILR.S.18h groupingNILR.S.24h groupingNILS.B.18h groupingNILS.B.24h batchb batchc

the levels of con：groupingNILR.B.0h groupingNILR.A.0h groupingNILR.B.18h groupingNILR.B.24h groupingNILR.S.18h groupingNILR.S.24h groupingNILS.B.18h groupingNILS.B.24h batchb batchc

After the statistical analysis with glmFit and glmLTR, there are 11994 significantly differentially expressed genes with the FDR<0.05.

I tried to omit the batch effect to use the design3 (= model.matrix(~ 0 + grouping)). The levels of con are groupingMK.26.usp, groupingMK.32.usp, groupingNR.26.18h, groupingNR.26.24h, groupingNR.32.18h, groupingNR.32.24h, groupingNS.32.18h, groupingNS.32.24h.

After the statistical analysis with glmFit and glmLTR, there are only 10398 significantly differentially expressed genes (DEG) with the FDR<0.05. compare with the result of “design”, 8722 was identified by both “design” and “design3”. In addition, the FDR and P-value of the same gene is different.

> design3 (= model.matrix(~ 0 + grouping))

So my questions are:

1. When one of the biological repeat should be removed, how to design the model?
2. In my data, which result seem to be more receivable, the “design” or “design3”, or some others?
3. Does the batch effects have a great effect of the DEG.
4. During the estimation of dispersion, there are three approaches:
d = estimateGLMCommonDisp(d,design)
d = estimateGLMTrendedDisp(d,design)
d = estimateGLMTagwiseDisp(d,design)
After each method was used, the last logFC of each gene will have a minor alteration while the FDR and p-value will have a great change, in some paper only the first and the third method were used, so I’m not so clear whether all of these three approaches should be used together, for in the edgeR users guide just described “The tagwise dispersion approach is strongly recommended in multi-factor experiment cases.”

edger RNA-seq anova • 1.9k views

ADD COMMENT • link updated 9.0 years ago by Aaron Lun ★ 28k • written 9.0 years ago by xiaoaozqd • 0

0

Entering edit mode

There's no NILS or NILR in the code you've shown above, only MK, NR or NS.

ADD REPLY • link 9.0 years ago Aaron Lun ★ 28k

score 3 · Accepted Answer · 2016-04-01

What makes you think that the first batch is not being used? There's no batcha term, but that doesn't mean that the samples in batch A are being ignored. Every grouping* coefficient represents the log-expression of each group, using the sample in batch A as the "baseline". The batchb and batchc terms represent the log-fold change in each other batch over batch A, averaged across all groups. So, all samples that you've included in the design matrix are being used here. Similarly, there's no need to redesign the model if you take out a CYR32 replicate, even if this is the only sample of the group in batch A. (Indeed, the interpretation of the corresponding coefficient does not change; if there is no actual batch A sample in that group, then the coefficient will represent the average log-expression of a hypothetical sample in batch A.)

To determine whether you need to include the batch effect, I would try a couple of things. First, have a look at the MDS plot, and see if your samples are clustering by batch. If they are, then you need to model the batch effect. If not (i.e., samples from different batches intermingle), then it's probably okay to just using grouping alone. Also, I would try the analysis with and without the batch factor, and see what happens to the the number of DE genes. If there's a strong batch effect, then modelling it should increase the number of DE genes. (Specifically, blocking on batch will decrease the dispersion estimates but will use up residual d.f., which generally increase and decrease power, respectively - so you want to see whether the former can offset the latter for DE detection.)

Finally, you should be using all estimateGLM*Disp calls together, as it's a three-step pipeline which uses all functions. Or better yet, use estimateDisp, which is an upgraded single-step version.