Please share me your helpful comments with edgeR analysis without replicate
3
0
Entering edit mode
Sara ▴ 20
@sara-9865
Last seen 2.1 years ago
Germany

Hi all experts, 
I’m very new in this field and it’s my first experience, so please correct me when I’m wrong during the conversation.I have 5 libraries (unfortunately without replicate) as below, 

Control-Sensitive cell line (CS)

Treatment-Sensitive cell line after 12 hours (TS12)

Control-Tolerant cell line (CT)

Treatment-Tolerant cell line after 12 hours (TT12)

Treatment-Tolerant cell line after 24 hours (TT24)

I’m aware of the replication importance, however, at this moment, there is no way to have replication in this experiment. As far as I read, I can use just the exact test from edgeR package, which the interaction effects could not be controlled by it. Given this experiment, please kindly clear me about the below questions.

1) As you know better me, in order to use exact test, we set dispersion parameter manually. For dispersion estimation, I ignored the cell line type and time; in fact, I considered the experiment as control with two replicates (CS and CT) and treatment with three replicate (TS12, TT12, and TT24), and it is the related MDS plot. 

 

As this plot shows, the difference between control and treatment is more than between cell lines (CS and CT, ST12 and TT12), it returned to me this consideration (control and treatment) sounds right, do you agree? Please share me your opinion. Then I calculated dispersion using "y <- estimateDisp(y, design, robust=TRUE)", the related plotBCV is here. Please let me know if I can use the calculated dispersion for the exact test? 

Regarding the interaction effect, when we use the exact test for comparing two cell line types say, CS and CT as well as TS12 and TT12, and comparing control and treatment within each cell line, please kindly tell me if we can say those DE genes that are common when comparing was done within each cell line and when two cell line types were compared with each other might be due to the cell line effect not treatment?

As the last question, please tell me how is important the control for interaction effects? Please advise and introduce me if there is any software that can handle the interaction effect without replication?

Sorry for several questions and many thanks for your help and time in advance.

edgeR dispersion value replication • 1.9k views
ADD COMMENT
3
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

You already know that not having replicates isn't ideal, and Aaron and Ryan have already discussed that, so I won't go over it again. This is how I would do the best with what you have.

Following item 3 on page 22 of the edgeR User's Guide, I would create an artificial factor that ignores the difference between sensitive and tolerant cell lines. This is similar to what you have done, but with 3 groups instead of just 2.

Here is the real design matrix:

Group <- factor(c("CS","TS12","CT","TT12","TT24"))
design <- model.matrix(~0+Group)
colnames(design) <- levels(Group)

Here is the reduced design matrix:

Group2 <- factor(c(1,2,1,2,3))
design2 <- model.matrix(~Group2)

Estimate a trended dispersion with the reduced design matrix:

y <- estimateGLMTrendedDisp(y, design2)

Again, this is similar to what you have done but with trended dispersion.

Then fit a glm with the real design matrix:

fit <- glmFit(y,design)

Now you can test any hypothesis you want:

cm <- makeContrasts(
   TS12 = TS12-CS,
   TT12 = TT12-CT,
   TT24 = TT24-CT,
   TT12.TS12 = (TT12-CT) - (TS12-CS),
   levels=design
)

To get 12 hr treatment effect in sensitive line:

lrt <- glmLRT(fit, contrast=cm[,"TS12"])
topTags(lrt)

To get 12 hr treatment effect in tolerant line:

lrt <- glmLRT(fit, contrast=cm[,"TT12"])
topTags(lrt)

To get 24 hr treatment effect in tolerant line:

lrt <- glmLRT(fit, contrast=cm[,"TT24"])
topTags(lrt)

To test for different effects (interaction) in tolerant vs sensitive lines:

lrt <- glmLRT(fit, contrast=cm[,"TT12.TS12"])
topTags(lrt)

This analysis is generally conservative, but it may also be liberal for some genes, so the p-values should be taken with a grain of salt. However it's the best that can be done, and the ranking of genes may be informative.

Your MDS plot suggests a huge treatment effect, but little systematic difference between sensitive and tolerant lines. The huge size of the treatment effect may complicate your conclusions just as much as the lack of replicates.

ADD COMMENT
0
Entering edit mode

 

Hi Dr. Gordon Smyth,

Many thanks for the detailed response and providing the codes. I followed your instruction, but the DE results is a bit strange, for example, the number of DE (at FDR of 0.05) is about 65000 and all up-regulated for TS12 while it was 5 for TT12. These are all the codes that I used, please  kindly tell me if they right or there is something wrong?

> targets <- read.delim("targets.txt", stringsAsFactors=FALSE)
> Group <- factor(c("CS","TS12","CT","TT12","TT24"))

> design <- model.matrix(~Group)
> colnames(design) <- levels(Group)
> Group2 <- factor(c(1,2,1,2,3))
> design2 <- model.matrix(~Group2)
>  expressioncount <- read.delim("expression.txt", row.names=1)
> dim(expressioncount)
>  y <- DGEList(expressioncount[1:5], group=Group)
>  keep <- rowSums(cpm(y) > 1) >= 2
>  y <- y[keep, , keep.lib.sizes=FALSE]
>  y <- calcNormFactors(y)
> y <- estimateGLMTrendedDisp(y, design2)
> cm <- makeContrasts(TS12 = TS12-CS, TT12 = TT12-CT, TT24 = TT24-CT, TT12.TS12 = (TT12-CT) - (TS12-CS), levels=design)
> lrt <- glmLRT(fit, contrast=cm[,"TS12"])
> topTags(lrt)
> is.de <- decideTestsDGE(lrt)
> summary(is.de)

that the output of summary command was:

   [,1]
-1     0
0   1147
1  65943

While this output for TT12 was as below:

     [,1]
-1     3
0  67085
1        2

 

Could you please let me know if there is any wrong and need the modification?

Thank you very much for your help and time in advance

 

ADD REPLY
0
Entering edit mode

My mistake, the design matrix should have been

design <- model.matrix(~0+Group)

instead of

design <- model.matrix(~Group)

I have edited my answer above to be correct now.

I assume these are just copying errors, but there are a couple of obvious mistakes in the code you give, in that expressioncount[1:5] can't work, and there's no glmFit() command in your code.

Also, how can you have 66,000 genes?

ADD REPLY
0
Entering edit mode

Thanks a lot for reviewing my codes. The analysis is related to an RNA-seq project that gave us a large number of genes.Besides your modification in the code, I changed the related code for expression count [1:5] and add glmFit() as you kindly suggested. It sounds that the problem was solved, however, there is an issue regarding the interaction effect. Here are all codes, I would highly appreciate if you please take a look at them and let me if anything is wrong.

 targets <- read.delim("targets.txt", stringsAsFactors=FALSE)
> targets
  Label Treatment  cellline time(hours)
1    CS   control sensitive    12
2   TS12      drug sensitive    12
3    CT   control  tolerant    12
4   TT12      drug  tolerant    12
5  TT24      drug tolerant      24
> Group <- factor(c("CS","TS12","CT","TT12","TT24"))
​> design <- model.matrix(~0+Group)
> colnames(design) <- levels(Group)
> Group2 <- factor(c(1,2,1,2,3))
> design2 <- model.matrix(~Group2)
>  expressioncount <- read.delim("expression.txt", row.names=1)
head(expressioncount)
            CS TS12 CT TT12 TT24
c51581_g1_i1  0   2  1   1    0
c19380_g1_i1  3   0  2   0    2
c4832_g1_i1   0   2  0   7   23
c25894_g1_i1  0   1  1   1    3
c37752_g4_i1  2   4  2   3    2
c557_g1_i1    7   0  4   0    0

> y <- DGEList(expressioncount[,1:5], group=Group)
> keep <- rowSums(cpm(y) > 1) >= 2
>  y <- calcNormFactors(y)
>  y <- estimateDisp(y, design2, robust=TRUE)
> plotBCV(y)
>  y <- estimateGLMCommonDisp(y, design2, verbose=TRUE)
Disp = 0.05004 , BCV = 0.2237
> y <- estimateGLMTrendedDisp(y, design2)
> fit <- glmFit(y,design)
>  cm <- makeContrasts(TS12 = TS12-CS, TT12 = TT12-CT, TT24 = TT24-CT, TT12.TS12 = (TT12-CT) - (TS12-CS), levels=design)
> lrt <- glmLRT(fit, contrast=cm[,"TS12"])
> topTags(lrt)
> is.de <- decideTestsDGE(lrt)
> summary(is.de)
   [,1]  
-1  11895
0  119488
1   12720

Now. It sounds that the number of DE (up and down) is reasonable here. The similar pattern happened for TT12 and TT24. So, the problem was solved and everything is OK, yes? again thanks for your help. But, when I used the following command for the interaction effect, it returned me nothing.

> lrt3 <- glmLRT(fit, contrast=cm[,"TT12.TS12"])
>  is.de <- decideTestsDGE(lrt3)
> summary(is.de)
   [,1]  
-1      0
0  144103
1       0

Could you please kindly tell me if It means that there isn't any interaction effect, how I can make sure about it, or there is something wrong?

As the last question, I defined contrasts as "CT.CS = (CT-TT12) - (CS-TS12)" and used the similar above code for comparing the sensitive and tolerant cell lines at before any treatment (control), but the output was no DE genes.

> lrt4 <- glmLRT(fit, contrast=cm[,"CT.CS"])
>  is.de <- decideTestsDGE(lrt4)
> summary(is.de)
   [,1] 
-1      0
0  144103
1       0

I am not sure about these outputs (no DE for TT12.TS12 and no DE for CT.CS), it may be there is something wrong due to my mistake. Could you please correct me if there is any mistake in the analysis of interaction effect?

 

Sorry for several questions, it's my first experience and I need a professional scientist like you confirm or reject me.

Many thanks for all your help and time in advance.

ADD REPLY
0
Entering edit mode

The only conclusion you can make is that there is no evidence for an interaction effect in any of your genes. It's impossible to tell whether that's because there truly is no interaction effect, or whether it's because your experimental design does not have enough power to detect a weak but genuine interaction effect. With the data you currently have, there's no way to "make sure" about whether an interaction effect is truly absent.

As for your CT.CS contrast, this does exactly the same thing as your TT12.TS12 contrast, except that the signs of the log-fold changes would be flipped around. Thus, if you don't get any DE genes in one of these contrasts, you won't get any DE genes in the other contrast either.

ADD REPLY
0
Entering edit mode

Thank you, Aaron for your comment. Although Gordon helped me much, I'm not sure if all above step and codes are correct as it's my first experience. Could you please review the codes and let me know if there is something wrong? Thank in advance

ADD REPLY
1
Entering edit mode

Code looks mostly fine, but there are a couple of odd things. For starters, your filtering step needs to be adjusted for the fact that you have only one sample per group:

keep <- rowSums(cpm(y) > 1) >= 1

Otherwise you would fail to retain genes that were only expressed in one of the groups. You also do a lot of redundant dispersion estimation. You should either use:

y <- estimateDisp(y, design2, robust=TRUE)
fit <- glmFit(y, design, dispersion=y$trended.dispersion)

... or just take out estimateDisp altogether:

y <- estimateGLMCommonDisp(y, design2, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design2)
fit <- glmFit(y,design)

Finally, if you are uncertain of your code, I would suggest contacting a local bioinformatician to walk you through an analysis. In the absence of some specific error or misbehaviour, asking whether some code is correct or not is a vague question that is not suited to the support site.

ADD REPLY
0
Entering edit mode

Thank you very much, Aaron for reviewing the codes and letting me know my mistake and sorry if it's not suited to this site. Honestly, there is no experienced person around me to ask the question. The only place where I can learn and correct any mistake is here. Now, with your correction, some genes showed interaction effect. Again thank you.

ADD REPLY
0
Entering edit mode

There isn't any genome with 66,000 genes. I suspect that you are actually analyzing transcripts, which is problematic in itself for a count-based analysis.

It is obvious from the MDS plot that there will be little interaction.

 

ADD REPLY
0
Entering edit mode

Thanks a lot, Gordon, for following the post. Your help was great. That's right, I'm busy with transcriptome analysis. Yes, there was some interaction as you mentioned, I could get them when the code was corrected by Aaron. Again thank you for all help.

ADD REPLY
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 17 hours ago
The city by the bay

Please let me know if I can use the calculated dispersion for the exact test? 

Can you? Yes. Should you? That's harder to answer. If your treatment effects are very different from each other, than the dispersion will be inflated if you pretend that the different samples are replicates. This will make your analysis more conservative (possibly a lot more so) than it should be.

The only way to know is to get replicates to estimate the dispersion. Besides, cell lines are about the easiest system to get replicates from; substantially simpler than live animal models, and orders of magnitude easier than patient samples. So there's no practical argument for not having replicates in your experimental design.

Regarding the interaction effect... ?

From what I understand, you're looking for genes that are DE between cell lines and also DE upon treatment. These genes have both a cell line effect and a treatment effect, so I don't know why you want to ignore the latter.

... how is important the control for interaction effects?

If you don't have the control sample for each cell line, you won't be able to calculate the treatment effect for that cell line. That means you won't be able to calculate the interaction effect, as this describes the differential response to treatment between cell lines.

Please advise and introduce me if there is any software that can handle the interaction effect without replication?

Any interaction model will contain one coefficient for each combination of treatment and cell line. If you only have one sample per combination, then you don't have any replication. Without replication, you can't model variability in the counts, which is critical to a sensible analysis as overdispersion is definitely present in RNA-seq data. This is a fundamental statistical issue that cannot be waved away with some magic software. Get some replicates instead. I mean, statistics aside, it's just good science.

ADD COMMENT
0
Entering edit mode
Sara ▴ 20
@sara-9865
Last seen 2.1 years ago
Germany

Thank you, Aaron Lun for the explanation. You're right about replication, however, as I mentioned in the post, we cannot generate replication at the moment and we have to go ahead with the available data. As you correctly understood, I'm looking for DE genes between cell lines and also DE after treatment. Please kindly let me know if edgeR (exact test) and simple pairwise comparisons can correctly answer these questions? the false positive rate could be high due to the lack of replication, though. 

ADD COMMENT
0
Entering edit mode

It's not an issue of false positive rate or choosing the right method. The question of differential expression is "Does the fold change rise above what would be expected from random within-group biological variation and/or measurement error?" This question is comparing two quantities, the fold change and the variability. It's impossible to give a meaningful answer unless you can measure both quantities, and without replicates, you have no measure of the variability. The most you can do is report a fold change value for each gene and accept that there's no way to assign a significance value to it.

(Note: Please use the "Add Comment" button to reply to an answer.)

ADD REPLY
0
Entering edit mode

Thanks, Ryan. OK. With pairwise comparison in the exact test, we can get DE due to either treatment or cell line effect, yes? what about the interaction effect between treatment and cell line type? could you please let me know how we can detect it? 

ADD REPLY
0
Entering edit mode

As I said, it's not a matter of selecting the right method, nor is it a matter of simply accepting a high false positive rate. Whether you use exactTest or glmLRT or glmQLFtest, or even another package like DESeq2 or limma, the situation is the same: no replicates means no statistical power.

You can, as you suggest, consider a partial design in which only one of the two experimental factors is included in the design, and the other is ignored. For example, if you fit a model with only treatment, then any variation due to tolerance, timepoint, or any of the interactions between tolerance , time point, and treatment is counted as random variation. This is the method you have used above to get the dispersions in your attached BCV plot. Counting these effects as random variation causes edgeR to overestimate the dispersions, which reduces your overall statistical power. Furthermore, it is probably not valid to apply this dispersion estimate to any other test besides the control group vs the treated group, since that is the only comparison that was included in the design when estimating the dispersions.

Just because a function gives you a number between 0 and 1 for each gene and calls it a p-value doesn't guarantee that the number has any meaningful interpretation. The various test functions don't necessarily have checks to stop you from doing statistically invalid tests. Rather, it is your responsibility to ensure that what you are doing has a valid statistical interpretation.

ADD REPLY

Login before adding your answer.

Traffic: 679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6