Hello,
I'm working on a RNA-seq differential expression analysis on human samples, and have been trying to understand the effect of the various packages that allows the weighting of samples or features - e.g. voom's sample quality weights, or edgeR-robust's implementation of observation weights.
I'm doing this because I'm particularly concerned about the effect of extreme outliers in my analysis - i.e. samples with counts for a gene that are 2 orders of magnitude different to the average across the entire sample set. Based on my experience of previous differential analyses on this dataset, I'm worried about a large proportion of the significant genes being called as significant due to the effect of 2-3 of these hugely outlier samples. (The dataset consists of 48 samples, divided into 4 groups.)
I'm also wondering about batch effects in my data, so I've tried out the sva package as well, which - from my understanding - uses a similar sample weighting approach to account for confounding variables.
My question is: given that the weights (of all types) are generally used in the fitting of the GLM to call differential genes, is it reasonable to include the weights when plotting a PCA or MDS, to see the effect of the weights on the sample group separation?
To my mind, if you're doing the MDS/PCA on the unweighted count data, then you weight the data before doing the actual differential analysis, it should still be valid to include the weights at the MDS/PCA stage.
Thanks in advance for any suggestions and insights into the statistics!
Dear James,
Thank you for the references, although your second link doesn't work!
I'm trying your suggestion of plotting the weights also, although there doesn't seem to be any association of up-/down-weighting of samples and where they fall on my MDS (within a subgroup, between two subgroups, etc. etc.)
In which case, I'll rephrase my question: are there any ways to visualise the effect that these sample-level weights will have?
I'd like to be able to share a nice figure, because otherwise the only way I can think of comparing the analysis with/without sample weights is by looking at the final DE gene lists. Unless I'm missing something very obvious!
In case someone is still interested (4.4 years later) in links provided by James, here is the URL for second link https://www.itl.nist.gov/div898/handbook/pmd/section4/pmd432.htm
Hi Jim,
I have a very similar issue about attempting to visual the data with the weight effects removed for a heatmap. If I had a simple batch effect, I would include it as a term in the model but I could also use removeBatchEffects() to get batch-adjusted expression values to use in a heatmap to see the expression patterns without the batch effects in the way. Is there any way to do something similar with sample weights? My guess is not...
Thanks!
Hi Jenny,
Having thought about this a little more, I think you can just multiply your gene expression values by the weights and then do the heatmap.
My rationale for this idea is that in an ANOVA context you are simply computing the means of each group - and it is always a weighted mean, the weights just happen to be all 1s in 'regular' ANOVA. So a heatmap showing the genes that were chosen based on the results of a regular ANOVA are the expression values multiplied by their respective weights. Analogously, a heatmap showing the genes that were selected using a weighted ANOVA should also be the expression values multiplied by their respective weights.
Does that make sense?
Jim
Hi Jenny,
You guessed right! Removing the batch effect in the way you describe is the way to go in this example if the goal is to get better sample clustering for a heatmap. Weights are only intended to improve the summary statistics from a differential expression analysis to give users greater power to detect changes, and won't help improve visualizations of the raw data.
Cheers,
Matt
Thanks, Jim and Matt! Multiplying the expression levels by the weights doesn't work very well because some samples have weights closer to 2 and others closer to 0 so the amount of within-group "variation" appears extremely high. I'll stick with unweighted expression values for heatmaps or just switch over to the estimated group means, which will reflect the sample weights.