Question

Appropriate explanation of the methodology behind arrayWeights function from limma R package

1

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 21 days ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor Community,

i would like to ask a very specific and short question about the "basic" methodology behind the arrayWeights() function from the package limma. Although, from previous discussions from related posts and also from the original article, i understand that empirical array quality weights mainly "adjust" samples of low quality, which could cause unusual/exaggerated high variance-characterized by computing an overall quality performance from all samples, based on the heteroscedastic model described in the paper

-could also the specific methodology downweights samples, from which unexpected variance is due to other "supplementary biological reasons": ??

for instance, accounting for inter-tumor heterogeneity when having samples from different patients which contribute to the same anatomic location or tissue(etc) ? and/or even for tumor heteroscedacity ?

Or the case is clearly due to various problems relating to sample quality ??

Please excuse me for this naive question, but im currently writting a report and i would not like to include any irrelevant misconceptions about the description of this methodology!!

Best,

Efstathios

limma arrayWeights variance sampleQuality microarray • 2.6k views

ADD COMMENT • link updated 9.5 years ago by Ryan C. Thompson ★ 7.9k • written 9.6 years ago by svlachavas ▴ 840

1

Entering edit mode

Yes. Low quality (or high variability) could arise from a multitude of sources (e.g. the RNA may be degraded for a particular sample or tumor samples may be contaminated with normal cells to varying degrees, leading to increased variation etc. etc.) and the model has no way of knowing the precise source.

ADD REPLY • link 9.5 years ago Matthew Ritchie ▴ 1000

score 5 · Answer 1 · 2015-10-30

I don't fully understand the mathematics behind it, so anyone feel free to correct me, but my conceptual understanding is that arrayWeights is based on the principle that any sample is expected to be an outlier for some genes by random chance, but a sample that is consistently an outlier across many genes is likely a low-quality sample that should be down-weighted. So arrayWeights fits the linear model given by your design, and then computes sample weights based on which samples have consistently large residuals across many genes. Then it just re-fits the model with the new sample weights and re-computes new sample weights, and continues repeating this until the weights converge. So arrayWeights is just a way to identify which samples are consistently more variable across more genes than other samples. It makes no assumptions about whether the source of the variability is technical or biological.

Also note that arrayWeights can also be used to compute group weights instead of sample weights (or impose any other structure on the weights) by passing an appropriate design for the var.design argument. The classic example for this is a cancer vs normal comparison where the cancer samples are assumed to be more variable than the normals (though the biological truth of this assumption is debatable). In practice, I've used this feature (via voomWithQualityWeights) to analyze RNA-seq data that were generated by two slightly different protocols that produced data of different quality but were otherwise similar enough to be comparable.

Lastly, I'll say that in my experience, trying to use arrayWeights on a dataset that is expected based on other quality measures to be of consistent quality generally either makes no difference or ends up giving you even fewer significant genes than without it, while using it on a dataset that is expected to have variations in quality (e.g. samples with varying degrees of RNA degradation, or the aforementioned RNA-seq dataset) tends to give you more significant genes.