I have RNA sequencing data for which I would like to look at the differential gene expression effect of a certain treatment or condition while correcting for batch. However, I am not sure how to set up a proper design formula to do this.
My data looks like this (simplified):
batch treatment
1 a control
2 b treated
3 c control
4 c treated
Except, in my actual data I have between 15-19 replicates of each of these 4.
Now, if all of these where processed in a different batch, I would use the following design:
~ batch + treatment
However, in my case, I think that there should be a better way to do this. If I look at the differentially expressed genes between 3 and 4, there should be no batch effect there to be corrected for. If I look at the differential expression between 1 and 2, there is a batch effect on top of the effect of the actual treatment effect that I am looking for. I think I should be able to look at what does and what does not overlap between these two comparisons, and use that to better define my batch effect.
I have been looking through other peoples experiment designs in DEseq and think that I should take the interaction between different variables into account in my design formula, but I am not completely sure how to properly set this up. Anyone have some insight into this?
Thank you for your answer!
I definitely trust DESeq, but I think my design formula could better reflect the difference that I am looking for than the simple formula that I described.
In addition, I think it may help to clarify a bit more what actually my 'batches' are. The difference between the batches is the index primer that we used to label them, and the day the libraries were sequenced. Actually, a more precise representation would be the following:
So when I would control for primer (with the formula below), I assume that I will be removing part of the variation that I am looking for from my data with the batch effects.
But when I control only for the day of sequencing (see next), I maintain part of the batch effect that was due to primer difference on sequencing on day 1.
As the batch effect that I see seems to be linear when judging from the PCA, my thought was that the difference between the comparisons 1 versus 2 and 3 versus 4 would help to better define the batch effect, and setting up the right design formula including this comparison could be an elegant way to do this. I am just not sure how to set up something like that, and whether it is possible. Any suggestions are definitely appreciated!
Did you already look at PCA or MDS plot to verify if the batch effect you expect is really present?
My PCA plot shows that the batch effect is one of the biggest differences present in my dataset. I can add the plot if that helps?
Neither day of sequencing nor indexing primer add any technical error worth modeling. (Day of library prep, or day of RNA extraction can matter)
Ah so when I said day of sequencing, that also includes the day of library prep and day of RNA extraction. However, with the two samples from day 1, the only real difference is the primer.
Maybe the PCA plots with my data will be useful.
This image is a PCA plot colored by the treatment condition: https://imgshare.io/image/GLHJy
This is the same PCA plot, but colored by the batches (i.e. primer number, with 025 and 026 processed on the exact same day): https://imgshare.io/image/GL0Xe
What strikes me is that all the replicates come from the same batch and where all frozen at the same moment. I would therefore not expect any batch effects, but I clearly see it in the PCA plot.
Sequencing index by itself doesn't add any technical variance. But in your case, it might be a surrogate for something else. Your groups A and AC are thoroughly confounded with your primers. All your 026s are group A, all your 025s are group AC. It looks like the ones you did on that day are separating by group A/AC, and the 027 batch from the other day is not.
Exactly!
And with batch 027, everything was pooled together and processed with the same primer. So the only thing I know to be different is the primer, but it is off course possible that something happened to one pooled sample during processing that slightly effects the results - although this is not something that I can check.
Now, what I was wondering was if I can use the information that I have to better define the batch effect in my DEseq design formula, or if that is not possible at all, in which case I would just look at my hits gene by gene if the effect is not batch-related (using the plotCounts function). Would this be the right way to go?
So include batch in your design. But not sequencing primer. Sequencing primer is not a problem. (Or, if it were, since you completely confounded it with treatment, you can't tell.)
How on earth did you only have three sequencing primers for all those samples?
To be clear, the samples have different CelSeq2 primers, but the samples where then pooled together in a library and given a library index (which is where I see a batch effect).