Hello, I just wanted to get a detailed answer on how limma handles missing values.
I am working on proteomics data and already filtered out proteins having many missing values. However, some missing values will remain in the data. I then use limma to fit a linear model and wanted to ask how limma is treating these missing values. From the internet, I cannot find a clear answer to this question.
Best regards, Lis
For proteome experiments missing values are rather common. You should read up on imputation methods that fit your type of experiment.
You should re-read what Gordon said. There is nothing in his response that should lead you to believe that a protein with missing data is completely removed.
limma removes NA values. It does not remove non-NA values. A protein with at least one non-NA value in a least two groups will receive a non-NA p-value.
Thank you for this discussion. I am a Stata user and new to R and limma and struggling to understand, in part because I am adopting syntax that other people developed and made available in a publication, without much experience/understanding of R and limma.
I understand from Gordon's post >10 years ago that lmFit should provide how many observations(?) have been removed due to missing data, but the syntax I'm using does not seem like showing this to me. And I'm struggling to alter the syntax how to find out.
My data comprise of a few hundred unique individuals (rows) and several proteins (column). In this situation, what is the 'group', when you mentioned "A protein with at least one non-NA value in a least two groups will receive a non-NA p-value"?
I am using the following syntax. Could it be possible to alter somewhere to obtain the number of observations dropped from the analysis due to missingneess? The output is shown after the syntax, and it does show the number of observations (underlined, 0=491, 1=44), but it's the number including missing data, thus not what I would like to see.
Your groups are cases and controls.
In R, to identify NA values in a matrix
y
, useis.na(y)
. To count the number of NA observations, usesum(is.na(y))
. That is just basic R, not specific to limma.Thank you, Gordon, for your reply.
I used sum(is.na(lmfit)) after lmFit, but somewhat it indicated there was no NA observations even when I gave missingness in all proteins for a bunch of observations. I tried to use nobs(), but it does not seem to work after lmFit.
To remind, what I want to know is how many observation was used in an estimation by lmFit. Therefore, either the number of removed observation, or the number of used observation (in an estimation) is fine. Is there something wrong in what I did?
Also, using the data below with 5 proteins, have I correctly understood that lmFit: 1) removes persons 1 and 2 from the estimation because they had missing data in all five proteins, but 2) uses person 3 in the estimation because s/he had values in some proteins?
Thank you so much for your time for my questions.
limma uses all the observations. There's nothing complicated about it.
limma does not remove any persons from the analysis.
I did not advise you to apply
is.na()
to a fitted model object. I advised you to to apply it your data matrix.You R code is problematic. You don't seem to have created the expression matrix correctly in the first place. The code
t(dat[43:ncoldat)]
that you have given both here and in previous question is not syntactically correct and could not possibly run in R. I suggest that you check the expression matrix properly before worrying about what limma is doing.In future, if you have a question about limma, please ask a new question of your own. Adding comments to another person's question from long ago isn't helpful.