Question

voom for spectral counts

4

Entering edit mode

Ben Neely ▴ 40

@ben-neely-7299

Last seen 10.2 years ago

United States

I am re-analyzing a data set from many years ago and since then voom has come along and I would like to address its applicability to a typical proteomics label-free analysis.

The data is a proteomics data set which is spectral count information for ~200 proteins (I am using spectral counts not emPAI, NSAF, etc). This is across 11 samples, 8 disease 3 control. Also to note is that spectral count data in proteomics can be zero, and therefore the data is quite sparse at times. For instance, approximately 20% of the proteins have zero values for at least 6 of the samples. The counts themselves can run from 0 to ~150 (this can range depending on the mass spec and experimental workflow). Lastly, it is accepted in proteomics that zeros can be replaced with a half-minimum observed value prior to statistical analysis.

A conservative approach to this type of data is to log transform and analyze with a equal two sample t-test. Another is to not transform and use a rank sum approach. Other approaches are proteomics specific like PLGEM, or count specific like EdgeR, DESeq, baySeq, etc, which I somewhat described here.

I have used a voom-limma approach on this data. I used the data as is, data with replaced zeros (with 0.5), and data removing proteins with 6 or more zero measurements. Each of these produces slightly different results based on how voom converts the counts. This line in the manual concerns me with how to best prune my data before running voom limma:

The limma-voom method assumes that rows with zero or very low counts have been removed.

I am not sure (1) how many zeros can be allowed, and (2) what "low" means. I have run across some other discussions which imply that voom-limma might be better at dealing with the low counts we see in proteomics as opposed to other count based methods like DESeq.

Any guidance would be appreciated, and if this isn't applicable, that is fine. Proteomics is often driving downstream experiments and this data has already had secondary confirmation by westerns and IHC. In other words, if voom-limma isn't applicable, a t-test works just fine at telling the story we have experimentally confirmed.

Thanks ahead of time for any help.

limma voom proteomics spectral counts massspectrometry • 7.9k views

ADD COMMENT • link updated 10.2 years ago by Gordon Smyth 52k • written 10.2 years ago by Ben Neely ▴ 40

0

Entering edit mode

Here is a related question relating to count data of microbial communities converting extremely sparse count dataframe to continuous distributions for study in WGCNA

ADD REPLY • link 8.7 years ago chrisclarkson100 ▴ 50

6

Entering edit mode

Bernd Fischer ▴ 550

@bernd-fischer-5348

Last seen 8.3 years ago

Germany / Heidelberg / DKFZ

Counts in spectral count data are usually very low, therefore, I expect that the normal approximation from voom will not be the best solution, but it depends on your data. I would prefer one of edgeR, DESeq, DESeq2, or similar.

Filtering proteins with a low number of counts or a low number of non-zero values is a good idea, for both voom / limma or edgeR, DESeq, because it is very unlikely that you will obtain significant hits for proteins with a very low spectral count. See "Independent filtering increases detection power for high-throughput experiments., Bourgon, et al., PNAS, 2010".

ADD COMMENT • link 10.2 years ago Bernd Fischer ▴ 550

1

Entering edit mode

In addition, filtering in voom gets rid of discrete log-transformed values for which it is difficult to fit a mean-variance trend; check out C: Voom on TCGA data shifts count distributions towards negative values . Count-based methods like edgeR tend to be a bit more robust to low counts, as discreteness is explicitly considered by the statistical model.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for the response. I obviously got something mixed up with thinking voom was best for low numbers and the other count based were better at high numbers. Seems the other way around, and I appreciate your clarification and recommendation.

As for filtering out these low values, I am still uncertain as to what low values violate assumptions within the different count based methods. Bourgon demonstrates that using mean (or variance) as a independent filter improved the power of analysis after correction for multiple testing. Similarly I could evaluate my data to see what an appropriate theta would be, but it would be more interesting to know what "low" really means. I responded to Aaron Lun's comment saying that defining low as < 5 or 10 spectral counts could be appropriate since from my experience these proteins have a decent amount of variability due to technical effects.

Thanks for your response and advice.

ADD REPLY • link 10.2 years ago Ben Neely ▴ 40

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

I'll flip your questions around a bit, such that we keep proteins where there are at least N libraries with a count above X. The code might look a bit like this, for a matrix of counts:

keep <- rowSums(counts > X) >= N
counts <- counts[keep,]

Now, our problem involves choosing appropriate values for N and X. For N, the value can be determined from your experimental design. In your case, you've got 8 samples in one group and 3 samples in the other. This suggests that N=3 is most appropriate. Otherwise, you might filter out significant proteins that are present in the smaller group and near-absent in the larger group.

The value of X is more arbitrary. From a biological viewpoint, how low must expression be in order to be considered negligble? That's hard to answer, but from a statistical perspective, we want to get rid of those low, discrete counts that will mess up voom's approximations. A value of 5 - 10 might be appropriate for your data. Alternatively, you can filter on CPM values above 1 - 2, as is done in the case studies in the edgeR user's guide.

Of course, as Bernd has suggested, count-based models might be more appropriate if many of your counts are low.

P.S. voom already adds 0.5 to the counts during its internal log-transformation, so there's no need to do it manually. Also, if you remove proteins with 6 or more zero measurements, then you could be throwing out significant proteins that are totally absent in the larger group but present in the smaller group.

ADD COMMENT • link 10.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for the excellent response. I am still not completely clear on what "low" could mean, but from my experience I can say that normally something under 5 spectral counts is extremely variable and typically won't yield significance, not because the protein itself is that variable (which it might be) but due to the technical nature of how data dependent mass spec runs are set up for shotgun proteomics analyses. Bottom line, filtering base on a cutoff of 5 sounds reasonable with N>=3.

On a side note, if voom uses a lowess the way I think I am understanding Law et al., 2014, wouldn't that mean that there is probably some requirement for number of observations? I remember a normalization paper or manual which indicated that lowess works well with 1,000s of points, but not with 100s of points. This could be another reason voom may not be the best approach for this set specifically. Albeit, many proteomics experiments do have 1000s of variables at which point this could be viable.

Thank you again for your help and insight.

ADD REPLY • link 10.2 years ago Ben Neely ▴ 40

1

Entering edit mode

On the side note: having 100s of points is not a particular problem for lowess or voom. We have used voom on datasets with a few hundred probes. If the number of features is very low, you could increase the span setting used by voom to ensure more stability.

ADD REPLY • link 10.2 years ago Gordon Smyth 52k

0

Entering edit mode

The lowess algorithm probably does have a minimum requirement for the number of data points, in order to obtain a stable fit (i.e., avoid overfitting). I'm not sure how large this needs to be, though. We usually have more than enough points in genome-scale data sets to obtain reproducible curves. My only comment would be that you'd need more points if you wanted to reduce the span/bandwidth of the fit.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

score 10 · Accepted Answer · 2015-01-30

We have experimented with a few of different statistical approaches to proteomics data and spectral counts. We are still learning, but spectral counts seem to behave somewhat differently to RNA-seq counts, showing much more technical variability, so I don't think that I would recommend voom or classic edgeR or DESeq.

Rather than voom, a slightly simpler and very robust approach would be to tranform to y = log2(count+1), quantile normalize, then run a limma pipeline with eBayes(trend=TRUE, robust=TRUE). I feel confident that this would be better than using ordinary t-tests.

Alternatively, if there are a lot of zeros in your data, then a quasi-negative-binomial approach may well be preferable. Have a look at Section 4.7 "Profiles of Unrelated Nigerian Individuals" of the edgeR User's Guide for how a quasi-likelihood pipeline might go.

Whatever pipeline you use, you should filter out low count proteins, as Aaron and Bernd have already discussed. How much filtering to do? The mean-variance plots will guide you. For voom, have a look at the plot produced by voom(plot=TRUE). Ideally the trend line should be monotonically decreasing. If it goes sharply up before trending down, then you haven't filtered enough. For limma and log-counts, look at plotSA(fit) where fit is output from lmFit and eBayes. The trend line should be smooth. If it increases sharply at the low count end, then you haven't filtered enough. For edgeR quasi-likelihood, look at plotBCV and plotQLDisp. Again the trends should be smooth.