Hello
I am having trouble choosing a model matrix to analyze certain RNA-Seq data set. I will first describe the experiment and afterwards show between which designs I am doubting.
Design of the experiment
There are six distinct animals and each animal received the control (untrt) and the treatment (trt). The control and the treatment were separated by a fixed amount of time. This makes it a paired design, where each animal serves as it own control.
After the treatment, sperm was extracted from each animal. With that sperm sample, five embryos were generated.
This results in 60 samples (6 animals times two treatments times 5 embryos = 60 samples).
This is a part of my metadata.
> head(setup, 10)
animal treatment attribute
sample21 animal1 untrt good
sample22 animal1 untrt good
sample23 animal1 untrt good
sample24 animal1 untrt good
sample25 animal1 untrt good
sample26 animal1 trt good
sample27 animal1 trt good
sample28 animal1 trt good
sample29 animal1 trt good
sample30 animal1 trt good
This shows that the experiment is completely balanced.
> sapply(setup, table)
$animal
animal1 animal2 animal3 animal4 animal5 animal6
10 10 10 10 10 10
$treatment
trt untrt
30 30
$attribute
bad good
30 30
The goal of the experiment is to discover if the gene expression of the embryos differs between the treatments. But I am confused what to do with the different embryos? I suppose the embryos can't be considered biological replicates, because one sperm sample was used to generate five embryos.
First idea
My first idea was to sum the counts for all embryos from a given animal and a given treatment (for example sum all samples from animal1 and untrt). But I am not sure if it OK to do this. Don't you loose information about the variability between the embryos when summing ?
Second idea
Keep the metadata and samples as they are and fit a model with animal as blocking factor (~ animal + treatment
).
In both cases, I would test the last coefficient of model.
Thanks in advance.
P.S. This is my first post on this forum, so I hope I followed all conventions.
From a statistical perspective, fitting a fixed effect for animal makes the assumption that differences between litters are relatively consistent increases or decreases in expression for some genes, within each litter, so you can adjust for those differences by estimating the mean per litter and adding that as a coefficient in the model to account for it.
Probably the best way to see if this is a reasonable assumption is to do an MDS plot and see if you get samples grouping by litter in one or more of the first few dimensions, which would indicate consistent expression levels within litter. Another alternative would be to fit the model with an 'animal' factor and see how many genes are significant. How many are 'enough' is up to you as an analyst to decide.
Fitting a random effect (the
voom
duplicateCorrelation
pipeline that Steve mentions) makes the assumption that the gene expression within litter is correlated to some extent. In other words, if a gene in one littermate is high, you might expect that gene could be high in some of the other littermates. You can runvoom
andduplicateCorrelation
(twice each, supplying the correlation value from the first run ofduplicateCorrelation
tovoom
the second time around), and see if the consensus correlation is 'high enough' to provide evidence that it's a thing (where 'high enough' is something you have to define as the analyst). So something likeAnd deciding if the value of
vcor$consensus.correlation
is large enough to care about.Thank you for your answer.
I made a PCA plot of the 60 samples. I did not obverse a strong clustering by litter. It looked like a random cloud of points.
Why do you run
duplicateCorrelation
twice?If it is allowed, I have an off-topic question. Could you recommend one or more sources to learn about the (dis)avantages of certain designs? I can come up with different ideas, but I sometimes do not see the (dis)advantages.
When you run
voom
the second time, by including block and correlation you use a linear mixed model (rather than the fixed effects model used in the first iteration) to estimate the observation-level weights, which may affect those weights. When you runduplicateCorrelation
, the weights are used as part of estimating the correlations, so you want to provide the updated weights to get a better estimate of the consensus correlation.I can't come up with any sources offhand, other than just regular linear modeling textbooks. During my formal training, like 20 years ago now, they used Applied Linear Statistical Models, by Neter et al, which I always thought was pretty approachable (especially for someone like me, who came to this via a biology undergrad degree). No idea if that book is still around, but I imagine it is? I always though Julian Faraday's linear modeling book was pretty good as well, being R based and all. But most books like that approach the problem from 'if you have repeated measures you must use mixed models', rather than saying what the pros and cons might be.
Another thing you could do is just search this support site for Aaron Lun and duplicateCorrelation or so. He's like way smarter than I will ever be, and does a good job of outlining the pros and cons. But you have to find the relevant posts... I usually do something like <search strings="" go="" here=""> site:support.bioconductor.org on Google, which seems to do a better job than the site search functionality. Although I believe it is supposed to have been improved.
Thanks for chiming in, James!
I'm always grateful when the stats folks chime in with mini lessons like this. I've learned a lot about the proper application of linear models to genomics analysis through posts just like this one through the years.
Thank you for the answer.
I felt that the terms biological and technical replicate were too restrictive in this case. They seem to be neither of them. Given your example, I would also call them biological replicates.
Thanks for the tip about
duplicateCorrelation
.