After preforming an RNAseq analysis with 35 samples (10 controls, 25 cases), we've sequenced an additional 6 samples of a separate nature and would like to combine all 41 samples in a larger analysis. The design of the experiment is basically this:
Older data - 4 separate conditions (control, negative, intermediate, positive), 5 batches
Newer data - 1 completely different condition (diffuse), 1 batch
We know there are batch effects in the older data and would like to correct for those batch effects but are unsure of the best way to do so within the combination of all the data due to the confounded condition-batch of the newer data. The approach I've tried takes the old data, utilizes removeBatchEffects from the limma package, forces any negatives from the batch effect removal to zero, combines the old/new data, and then executes voom with just condition in the model. This seems to yield the desired results. However, the comparison of the conditions to the controls within the older data differ greatly compared to previous analysis (simply including batch in the model). Unfortunately, we can't include it in the model here because the new data's condition and batch are confounded. Would it be possible to model the older data through voom with just the batch factor, combine the old/new data, then model the complete set with the condition factor? Hoping for suggestions on the best approach to remove the batch effects of the older data while still maintaining the power to compare the newer data condition to the older data conditions.
Thanks in advance!
Steve
Aaron,
Thanks for the quick reply! As the experiment was originally only meant to encompass the older data, we definitely realize the poor design, just trying to see if there's a solution that leads to a legitimate comparison of the data. The more I looked at removeBatchEffect, the more I realized it's not a good solution for DE analysis and more useful in visualizations of data.
An initial look at your suggestion looks like using duplicateCorrelation will work! The batches in the older data are not confounded with condition. We may have to add a disclaimer about the time effect but I will check that with the MDS plot and let you know. I'll delve a little deeper in the next few days and come back with some analysis.
Thanks again!
- Steve
Aaron,
I've compared your suggested method that includes the new data to the analysis of just the old data that included batch in the model. There are some differences in the number of DE genes but not large differences and strong similarities exist when comparing the fold changes of the common genes. As for testing for time effect, the MDS of the data shows the diffuse condition separate from the other conditions. However, I don't believe this is a time effect due to the nature of the DE genes, many of which are what we expected to be different. I think we will move forward with analysis using this method. Thanks for your help!
Steve
Yes, upon thinking about it, you wouldn't be able to use the MDS plot to evaluate whether or not there was a time effect, given that it's confounded with the conditions. So you'll just have to assume there isn't.