Question

differential expressed genes, paired samples and multiple factors

0

Entering edit mode

pavlidisp ▴ 10

@pavlidisp-8874

Last seen 9.5 years ago

Greece

Dear Bioconductor community,

I'm analyzing some RNA-seq data, with the following design.

> targets

animal genotype treatment

1 1 0 0

2 1 0 1

3 2 0 0

4 2 0 1

5 3 1 0

6 3 1 1

7 4 1 0

8 4 1 1

I'm interested in finding genes that

1) respond to treatment in genotype0

2) respond to treatment in genotype1

3) how the genotype affects gene expression in treatment0

4) how the genotype affects gene expression in treatment1

5) Do genes respond differently on treatment in genotyp0 and genotype1?

After studying quite a lot the edgeR manual and asking also in bioconductor support site some time ago, I think of the following ideas (please tell me what is correct and if you have other suggestions).

1) it seems that it is a paired-samples design (corresponding to 3.4.1 section of edgeR manual).

However, I cannot exclude just the effect of animal because it is a linear combination with the genotype. Thus, then, I follow the advices of section 3.5 (comparisons within and between samples). This seems to fit nicely.

Thus, I form the model matrix:

design = model.matrix(~genotype + genotype:animal + genotype:treatment)

This results to a model matrix with the following headers

(Intercept)

genotype1

genotype0:animal1

genotype1:animal1

genotype0:treatment1

genotype1:treatment1

question1: what is the meaning of the intercept here? When I am comparing for a certain coefficient do I compare against the intercept?

and to get the effect of treatment in genotype0, I ask for:

lrt <- glmLRT(fit, coef="genotype0:treatment1")

This seems correct to me, assuming that the intercept means genotype0:treatment0 (is that right?).

question2: why with this way I have taken care of the effect of the animal?

In a similar way I find the effect of treatment in genotype1, i.e. lrt <- glmLRT(fit, coef="genotype1:treatment1")

2) If I need to find the genes that respond differently with treatment in genotype1 vs genotype0, then I guess that I need to form contrasts. Will this be something like c(0,0,0,0,-1,1) ?

3) The aforementioned model matrix does not help me with the effects of genotypes on treatments. Thus, I form another model.matrix aka

design2 <- model.matrix(~ treatment + treatment:animal + treatment:genotype )

question3: is it correct to analyze the same data with 2 independent design matrices? or this can lead to some bias (multiple testing?)

4) Then, I thought that perhaps it would be better to put all these together in one model matrix as:

design.together <-model.matrix( ~ genotype + treatment + genotype:treatment + treatment:animal + genotype:animal)

To do the comparisons, I formed contrasts (as suggested here http://seqanswers.com/forums/showpost.php?p=153077&postcount=13) as follows:

cntr1 <- colMeans(m[genotype==1 & treatment == 0,]) - colMeans(m[genotype == 0 & treatment == 0,])

this results to:

cntr1

(Intercept) genotype1 treatment1

0.0 1.0 0.0

genotype1:treatment1 treatment0:animal1 treatment1:animal1

0.0 0.0 0.0

genotype1:animal1

0.5

question4: should the 0.5 term be there in genotype1:animal1? is this approach correct? if yes, is it equivalent to the previous design?

Thanks a lot, and sorry for all these questions

edgeR batch effect differential gene expression paired samples • 2.3k views

ADD COMMENT • link updated 9.5 years ago by Aaron Lun ★ 28k • written 9.5 years ago by pavlidisp ▴ 10

score 2 · Answer 1 · 2015-10-25

Your design is incorrect, as it treats animal as a real-valued covariate. You should be treating it as a factor:

genotype <- factor(targets$genotype)
animal <- factor(targets$animal)
treatment <- factor(targets$treatment)
design <- model.matrix(~0 + animal + genotype:treatment)

# To get to full rank:
design <- design[,-grep("genotype[01]:treatment0", colnames(design))]

Now, if we look at colnames(design), we get:

[1] "animal1"              "animal2"              "animal3"             
[4] "animal4"              "genotype0:treatment1" "genotype1:treatment1"

The first four coefficients represent the baseline expression for treatment 0 in each animal. The fifth coefficient represents the effect of treatment 1 in genotype 0, while the sixth coefficient represents the treatment effect in genotype 1. You can drop either of these with coef=5 or coef=6 in glmLRT or glmQLFTest to determine the effect of treatment in each genotype. If you want to test for a differential effect between genotypes, you can simply set contrast=c(0,0,0,0,1,-1) in the testing function.

However, with this design, there is no way to test for DE genes between genotypes for the same treatment (i.e., your aims 3 and 4). Any direct difference between genotypes is inherently confounded by the animal blocking factor. Instead, if you want to do this comparison, you'll have to treat different animals as replicates. This means taking animal out of the design, possibly using a one-way layout:

grouping <- paste0(genotype, ".", treatment)
design <- model.matrix(~0 + grouping)

You can then compare directly between groups using the contrast argument. Note that the treated and untreated samples from each animal are treated as independent, which is unlikely to be true. In limma, this would be accounted for by blocking on animal in duplicateCorrelation. edgeR doesn't have that capability, which means that the dispersion estimates will be treated as being more reliable than they really are (especially in the QL framework). This may result in some liberalness.

Don't worry about using different design matrices to analyze the same data. You're controlling the FDR within each contrast, which should - more or less - grant you control of the FDR across all contrasts. If you want to be more rigorous, you can pool the p-values from all contrasts and apply the BH method across all of them, e.g., as described in C: EdgeR - RNA-Seq time course analysis - Testing main and interaction effects and . This will guarantee control of the FDR across all genes and contrasts.