Visualising effects of observation/sample weights in differential expression analysis
2
0
Entering edit mode
m.fletcher ▴ 20
@mfletcher-7284
Last seen 8.2 years ago
Germany

Hello,

I'm working on a RNA-seq differential expression analysis on human samples, and have been trying to understand the effect of the various packages that allows the weighting of samples or features - e.g. voom's sample quality weights, or edgeR-robust's implementation of observation weights.

I'm doing this because I'm particularly concerned about the effect of extreme outliers in my analysis - i.e. samples with counts for a gene that are 2 orders of magnitude different to the average across the entire sample set. Based on my experience of previous differential analyses on this dataset, I'm worried about a large proportion of the significant genes being called as significant due to the effect of 2-3 of these hugely outlier samples. (The dataset consists of 48 samples, divided into 4 groups.)

I'm also wondering about batch effects in my data, so I've tried out the sva package as well, which - from my understanding - uses a similar sample weighting approach to account for confounding variables.

My question is: given that the weights (of all types) are generally used in the fitting of the GLM to call differential genes, is it reasonable to include the weights when plotting a PCA or MDS, to see the effect of the weights on the sample group separation?

To my mind, if you're doing the MDS/PCA on the unweighted count data, then you weight the data before doing the actual differential analysis, it should still be valid to include the weights at the MDS/PCA stage.

Thanks in advance for any suggestions and insights into the statistics!

limma voom sva edgeR • 3.8k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States

If you include the weights in the PCA or MDS, you by definition cannot see the effect of the weights because the plot now includes the weights! Anyway, the weights aren't used to modify the data - instead they are used as part of the linear regression to decrease the impact of the outliers on the model fit. There is a pretty good explanation here (http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd143.htm and http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd432.htm).

Instead, I generally make the PCA plot and then plot the weights as well (right above or below each sample), so I can see which samples are being down-weighted, and by how much.

ADD COMMENT
1
Entering edit mode

Dear James,

Thank you for the references, although your second link doesn't work!

I'm trying your suggestion of plotting the weights also, although there doesn't seem to be any association of up-/down-weighting of samples and where they fall on my MDS (within a subgroup, between two subgroups, etc. etc.)

In which case, I'll rephrase my question: are there any ways to visualise the effect that these sample-level weights will have?

I'd like to be able to share a nice figure, because otherwise the only way I can think of comparing the analysis with/without sample weights is by looking at the final DE gene lists. Unless I'm missing something very obvious!

ADD REPLY
0
Entering edit mode

In case someone is still interested (4.4 years later) in links provided by James, here is the URL for second link https://www.itl.nist.gov/div898/handbook/pmd/section4/pmd432.htm

ADD REPLY
1
Entering edit mode

Hi Jim,

I have a very similar issue about attempting to visual the data with the weight effects removed for a heatmap. If I had a simple batch effect, I would include it as a term in the model but I could also use removeBatchEffects() to get batch-adjusted expression values to use in a heatmap to see the expression patterns without the batch effects in the way. Is there any way to do something similar with sample weights? My guess is not...

Thanks!

ADD REPLY
0
Entering edit mode

Hi Jenny,

Having thought about this a little more, I think you can just multiply your gene expression values by the weights and then do the heatmap.

My rationale for this idea is that in an ANOVA context you are simply computing the means of each group - and it is always a weighted mean, the weights just happen to be all 1s in 'regular' ANOVA. So a heatmap showing the genes that were chosen based on the results of a regular ANOVA are the expression values multiplied by their respective weights. Analogously, a heatmap showing the genes that were selected using a weighted ANOVA should also be the expression values multiplied by their respective weights.

Does that make sense?

Jim

ADD REPLY
0
Entering edit mode

Hi Jenny,

You guessed right! Removing the batch effect in the way you describe is the way to go in this example if the goal is to get better sample clustering for a heatmap. Weights are only intended to improve the summary statistics from a differential expression analysis to give users greater power to detect changes, and won't help improve visualizations of the raw data. 

Cheers,

Matt

ADD REPLY
0
Entering edit mode

Thanks, Jim and Matt! Multiplying the expression levels by the weights doesn't work very well because some samples have weights closer to 2 and others closer to 0 so the amount of within-group "variation" appears extremely high. I'll stick with unweighted expression values for heatmaps or just switch over to the estimated group means, which will reflect the sample weights. 

ADD REPLY
0
Entering edit mode
Matthew Ritchie ▴ 1000
@matthew-ritchie-650
Last seen 6 months ago
Australia

Apologies for the delayed response. The voomWithQualityWeights() function combines observational level weights (i.e. a distinct weight for each and every gene expression value) from voom with sample-specific weights by multiplying them together. Running this function with plot=TRUE provides a summary of the abundance related trend in variability estimated by voom in one panel and sample-specific variability in a second panel. Sample weights are often (but not always) related to how well samples cluster in the PCA / MDS plot (i.e. lower weights are assigned to samples that cluster less well).

In practice, trying an analysis with different weighting schemes (i.e. in the case of RNA-seq voom only versus voomWithQualityWeights) and seeing if you get more differentially expressed genes is a good way to empirically determine whether the extra effort of modelling sample-level variability was worthwhile. I hope this helps.

ADD COMMENT

Login before adding your answer.

Traffic: 704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6