Question

edgeR: Usage of estimated dispersion from replicates for pools of the same experiment

0

Entering edit mode

k.roelfs • 0

@4c64e625

Last seen 2.1 years ago

Germany

Hello everybody,

I'm currently analyzing RNAseq data of plants of the same species but different phenotypes (A and B).

In case of phenotype A, we sampled two different tissues (T1a and T2) before altering growth conditions, sampled T1 again (T1b) during this alteration and T1 one last time after returning back to normal (T1c).

In case of phenotype B, we just sampled T1 and T2 without changing growth conditions.

Due to the high degree of heterozygosity of the plants used for the experiment, we thought it best to pool the same tissue under the same conditions from multiple plants (30 - 100) to reduce noise and capture only those genes that show large differences in expression.

We were aware of the challenges that this design would pose for statistical analysis and therefore provided 3 biological replicates for sequencing for each tissue and growth condition (at lower depths than for the pools). We would then use the replicate data to estimate the dispersion and use it to analyze the data from the pools.

My question now is how to do this properly.

We already had a look at the manual from edgeR (Chapter 2.12) and thought we could use the common dispersion estimated from the replicates for the pools. This would mean that the same dispersion value would have to be used for all genes, and we are not sure if this would be the best way.

library(edgeR)

counts <- read.delim(data, header=T, row.names="ID")
group <- factor(c("B_T1", "B_T1", "B_T1", "B_T2", "B_T2", "B_T2", "A_T1a", "A_T1a", "A_T1a", "A_T2", "A_T2", "A_T2", "A_T1b", "A_T1b", "A_T1b", "A_T1c", "A_T1c", "A_T1c"))
y <- DGEList(counts=counts, group=group)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)

design <- model.matrix(~0+group)
colnames(design) <- levels(y$samples$group)
y <- estimateDisp(y, design, robust=T)
c.disp <- y$common.dispersion

# Continue with analysis of pools

counts <- read.delim(data, header=T, row.names="ID")
group <- factor(c("B_T1", "B_T2", "A_T1a", "A_T2", "A_T1b", "A_T1c")
y <- DGEList(counts=counts, group=group)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)

design <- model.matrix(~0+group)
colnames(design) <- levels(y$samples$group)
y$common.dispersion <- c.disp

# Continue with glmFit approach
...

An alternative approach would be to use the tagwise dispersion of each gene from the replicates for the genes detected in the pools, but this would result in dropping several genes which could not be detected in the replicates but in the pools because we would have no dispersion value for them.

Is one of this approaches the "best" way or are there other suggestions?

Thank you, Kai

edgeR RNAseq • 1.5k views

ADD COMMENT • link updated 2.1 years ago by Gordon Smyth 52k • written 2.1 years ago by k.roelfs • 0

0

Entering edit mode

Sorry, I don't understand your experiment.

You are reading the edgeR Section on on what to do without replicates, but your experiment appears to have three biological replicates for each tissue and condition.

You say that samples are extracted from 30-100 plants. Are all the samples extracted from this number of multiple plants or only some? If only some, then which samples are pools and which are not?

If some samples are not pools, then what are they? Single plants?

Is each sample extracted from a different set of plants? If not, then which samples are extracted from the same plants?

ADD REPLY • link 2.1 years ago Gordon Smyth 52k

0

Entering edit mode

I'm sorry, I know it is kinda confusing...

We have 6 pools in total: Phenotype A: T1a, T2, T1b, T1c Phenotype B: T1, T2

Each pool consists of 30 - 100 of the respective tissue samples, i.e. for phenotype A and tissue T1a, material from ~100 plants was pooled together in equal amounts and RNA was subsequently extracted from it. T2 of the same phenotype has "only" ~40 tissue samples pooled together and so on.

But at the same time, for each pool, we have three biological replicates from the same population of plants with the respective phenotype and time point.

I hope it's more clear now.

ADD REPLY • link 2.1 years ago k.roelfs • 0

0

Entering edit mode

Am I understanding correctly that you have 24 RNA-seq libraries in total, 6 from the pooled samples and 18 from regular samples with biological replicates?

Also, are the biological replicate samples are from the same plants as used for the pools or from different plants? In other words, are they biologically independent from the pools? I assume that they must be from different plants rather than just being aliquots of the pool RNA.

ADD REPLY • link 2.1 years ago Gordon Smyth 52k

0

Entering edit mode

Yes you are right, we have 24 RNA-seq libraries in total.

The replicates come from the same population of plants, but independent from those which were used for the pools.

ADD REPLY • link 2.1 years ago k.roelfs • 0

0

Entering edit mode

How many plants are used for each of the biological replicate samples? One plant or a few?

ADD REPLY • link 2.1 years ago Gordon Smyth 52k

0

Entering edit mode

For tissue T1 and each time point, each replicate corresponds to a single plant.

For tissue T2, we actually had to sample 5 plants per replicate to obtain enough material because the tissue we are interested in covers only a small area.

ADD REPLY • link 2.1 years ago k.roelfs • 0

score 2 · Accepted Answer · 2023-01-27

Due to the high degree of heterozygosity of the plants used for the experiment, we thought it best to pool the same tissue under the same conditions from multiple plants (30 - 100) to reduce noise

That is an unfortunate decision because it is never a good idea to pool samples. It is always better to divide the plants into smaller groups and perform biological replicates, even if you do no more sequencing total for the biological replicates than you would have done for the pool. For a given number of plants and a given amount of sequencing, the biological replicates cost hardly any more than the pool but are enormously more valuable for statistical analysis. In other words, it is always better to allow edgeR to pool the samples in silico rather than physically pooling the RNA.

We would then use the replicate data to estimate the dispersion and use it to analyze the data from the pools.

No, that is the worst of both worlds, losing the advantages of either the pools or the biological replicates. If the pools really are less variable than the individual replicates, which was the whole motivation for forming the pools in the first place, then the dispersion from the biological replicates will be inappropriately large.

What you actually have now is four biological replicates for each tissue & growth condition but with one replicate based on many more plants than the other replicates. edgeR doesn't have the ability to downweight or upweight individual samples based on quality so, using edgeR, the best you could do in the circumstances is to simply treat the samples as four biological replicates and ignore the fact that some are pools and some are not.

Alternatively you could use a limma-voom pipeline, again treating the samples as 4 biological replicates, but asking limma to weight the pooled samples differently:

fit <- voomLmFit(y, design, var.group = PoolStatus)

where PoolStatus just indicates whether a sample is a pool or not.

Next time, if you do a similar experiment again, it would be better to simply form 4 or more biological replicates with roughly the same number of plants used for each library.