I am wanting to perform a meta-analysis using 8 publicly available RNAseq datasets which each have disease and control samples (~330 samples in total). A few of these datasets have samples from different histological locations (e.g. low fibrosis vs high fibrosis). These 8 datasets contain two different sequencing platforms (6 datasets on Illumina and 2 datasets on Ion Torrent). What is the best way to remove the batch effects (datasets and platforms) so that I can perform a differential expression analysis?
I have attempted using Combat_seq() to remove the batch effects, but it only allows me to run one batch effect at a time, so I'm not sure if that is a viable approach.
Would it be better to try to add the batch effects to the model in DESeq2?
Alternatively, for this type of meta-analysis, would it be better to run each dataset individually and then perform the meta-analysis using p-value combination or other methods?
I would appreciate any suggestions.
I am myself doing a meta-analysis using
metapod
currently, in particular usingparallelStouffer
, and I should point out that Stouffer's method is based on the idea that p-values map one-to-one onto the normal distribution, so you can simply map p-values to z-scores, compute a weighted z-score for each gene, convert back to p-values and voila!However! Stouffer's method is based on one-tailed p-values, which you won't get from any software. You need one-tailed p-values if you care to incorporate the sign of the statistic into your meta analysis. In other words, consider a situation where gene X has a logFC of -1.2 in study A and a logFC of 1.2 in study B, and p-value of 0.001 in both. If you naively convert the p-values to z-scores, you get -3.1 for both, and the weighed mean of that z-score will be something similar to -3.1, which then converts back to ~0.001 for the p-value.
I would argue that the p-value should be close to [edit] 1 (not zero, lol) though, because the gene is strongly up-regulated in one study and strongly down-regulated in the other. In which case you should have converted the p-values to one-tailed, use
parallelStouffer
, and then convert back to two-tailed p-values.Thank you for your suggestions and explanation of your approach!