I have a 3-condition, 4-timepoint, 3-replicates model and data that is whole proteomics mass spectrometry. My data (normalized LFQ values) contains missing values that are both Missing At Random (MAR) and Missing Not At Random (MNAR). I am sure of this because a lot of proteins have missing values only in one condition and it is of biological relevance that they are not being detected in that particular condition because they are very low abundant.
My problem is I am looking for a way that can impute MAR and MNAR values differently. If I impute all of them based on normal distribution fit, it skews the data too much to low intensity values which affects the later differential expression analysis later.
Here is my design matrix:
Cond Time Batch Rep
Fed_2_B_2 Fed 2 B 2
Fed_2_D_1 Fed 2 D 1
Fed_2_D_2 Fed 2 D 2
Fed_4_A_1 Fed 4 A 1
Fed_4_B_2 Fed 4 B 2
Fed_4_C_3 Fed 4 C 3
Fed_6_A_1 Fed 6 A 1
Fed_6_D_1 Fed 6 D 1
Fed_6_D_2 Fed 6 D 2
Fed_24_A_1 Fed 24 A 1
Fed_24_C_3 Fed 24 C 3
Fed_24_D_1 Fed 24 D 1
Fasted_2_A_1 Fasted 2 A 1
Fasted_2_C_3 Fasted 2 C 3
Fasted_2_D_1 Fasted 2 D 1
Fasted_4_B_2 Fasted 4 B 2
Fasted_4_C_3 Fasted 4 C 3
Fasted_4_D_1 Fasted 4 D 1
Fasted_6_A_1 Fasted 6 A 1
Fasted_6_B_2 Fasted 6 B 2
Fasted_6_C_3 Fasted 6 C 3
Fasted_24_B_2 Fasted 24 B 2
Fasted_24_C_3 Fasted 24 C 3
Fasted_24_D_1 Fasted 24 D 1
Refed_2_A_1 Refed 2 A 1
Refed_2_C_3 Refed 2 C 3
Refed_2_D_1 Refed 2 D 1
Refed_4_B_2 Refed 4 B 2
Refed_4_D_1 Refed 4 D 1
Refed_4_D_2 Refed 4 D 2
Refed_6_B_2 Refed 6 B 2
Refed_6_C_3 Refed 6 C 3
Refed_6_D_1 Refed 6 D 1
Refed_24_A_1 Refed 24 A 1
Refed_24_C_3 Refed 24 C 3
Refed_24_D_1 Refed 24 D 1
I was trying to use proDA (https://github.com/const-ae/proDA) to see if it can do a better job to fill out these missing values. However, I am not sure if it can work with more than 2 conditions and if I do any filtering of the data before putting it in to proDA. Before I have been filtering the data to keep only those proteins which have atleast 8 out of 12 non-zero values in atleast 2 out of the 3 conditions. This filtering reduces nrow (proteins) from 4128 initially to 868 proteins.
Any help would be appreciated.
I think I was not able to fully comprehend the paper on proDA so the question: Is it possible to just get a 'imputed' matrix out of the proDA function object? For eg. if I want to test pairwise comparisons using other approaches?
Well, yes you can get the completed matrix using
The documenation for the predict function is here https://const-ae.github.io/proDA/reference/predict-proDAFit-method.html
But, I think you should not do that and then run another statistical test on the completed data. The problem is that you get unrealistically small variance estimates, so the method might consider too many proteins significant.
I would recommend to use the build in function
test_diff()
to identify significantly changed proteins.My only concern was that I know that there is a batch effect in the dataset. That's why I wanted to take the completed matrix, correct for the batch effect and then run a statistical test. Or would you suggest another method of dealing with this that is more compatible with proDA?
Also, I just saw that based on the imputation proDA does, the data is no longer following a normal distribution (I am guessing because of a lot of imputation of "constant" value for proteins which have a lot of missing values). Does Wald test work with non-normally distributed data?
Ah, okay. Thanks for the clarification :)
If the batch effect that you mentioned, is the one in the column of the design matrix, the best way to handle it is to include it in the linear model. That way it is automatically regressed out and the statistical test is independent of the batch effect.
If on the other hand, you have strong reason to believe that there are additional batch effects that were not recorded, the problem is much more difficult.
proDA
can handle some ways to correct for global intensity differences using themedian_normalization()
method and the sample specific dropout curve inference. To address any more complicated batch effects, I think, is still an open research topic, but usually the first two options should already get you quite far.After you call the
predict
function, you will often observe that many of the completed values are identical to each other and the data does not look normal. But,proDA
actually does not work on the completed dataset internally, but directly combines the information available from the observed and the missing values. In that sense, yes the Wald test does work for that kind of data, as long as you use the one based on the probabilistic dropout model implemented in my package :)I would like to extend the question. Is it possible to use the imputed data for visualizations such as PCA or heatmaps with both rows and columns clustering?
Yes I think it can be okay to use the completed dataset for some visualizations. However, for PCA and clustering in a heatmap, I think you should use the dist_approx() function.
It is specifically designed to calculate distances between samples or proteins taking into account the uncertainty from the missing values. That function will give you a more accurate estimate of the distance than working on the completed data. For an example see the Figure 5 of the proDA paper. The code to generate the heatmap is on GitHub.
To run the equivalent of a PCA, check out multidimensional-scaling (cmdscale) which does basically the same as PCA, except that it takes the pairwise distances as input (e.g. from
dist_approx()
). To cluster a heatmap, callhclust(dist_approx(fit)$mean)
.Thank you very much this is very helpful.