Question

Meta-analysis on RNA-seq data sets

0

Entering edit mode

Travis • 0

@154e642d

Last seen 5 months ago

United States

I am wanting to perform a meta-analysis using 8 publicly available RNAseq datasets which each have disease and control samples (~330 samples in total). A few of these datasets have samples from different histological locations (e.g. low fibrosis vs high fibrosis). These 8 datasets contain two different sequencing platforms (6 datasets on Illumina and 2 datasets on Ion Torrent). What is the best way to remove the batch effects (datasets and platforms) so that I can perform a differential expression analysis?

I have attempted using Combat_seq() to remove the batch effects, but it only allows me to run one batch effect at a time, so I'm not sure if that is a viable approach.

Would it be better to try to add the batch effects to the model in DESeq2?

Alternatively, for this type of meta-analysis, would it be better to run each dataset individually and then perform the meta-analysis using p-value combination or other methods?

I would appreciate any suggestions.

meta-analysis sva Combat_Seq RNAseq • 619 views

ADD COMMENT • link 5 months ago Travis • 0

score 0 · Answer 1 · 2024-09-26

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 7 minutes ago

United States

This isn't really the place for general analysis questions (a better choice being biostars.org). That said, I have never personally found combining disparate datasets into one analysis to be particularly useful, but ymmv. I prefer a meta-analysis approach. Depending on the studies, you could use GeneMeta to make comparisons using effect sizes (in which case you are likely better off using limma-voom rather than DESeq2, because GeneMeta expects you to have t-statistics), or you could use metapod to combine using the p-values.

ADD COMMENT • link 5 months ago James W. MacDonald 68k

0

Entering edit mode

I am myself doing a meta-analysis using metapod currently, in particular using parallelStouffer, and I should point out that Stouffer's method is based on the idea that p-values map one-to-one onto the normal distribution, so you can simply map p-values to z-scores, compute a weighted z-score for each gene, convert back to p-values and voila!

However! Stouffer's method is based on one-tailed p-values, which you won't get from any software. You need one-tailed p-values if you care to incorporate the sign of the statistic into your meta analysis. In other words, consider a situation where gene X has a logFC of -1.2 in study A and a logFC of 1.2 in study B, and p-value of 0.001 in both. If you naively convert the p-values to z-scores, you get -3.1 for both, and the weighed mean of that z-score will be something similar to -3.1, which then converts back to ~0.001 for the p-value.

I would argue that the p-value should be close to [edit] 1 (not zero, lol) though, because the gene is strongly up-regulated in one study and strongly down-regulated in the other. In which case you should have converted the p-values to one-tailed, use parallelStouffer, and then convert back to two-tailed p-values.