Question

Merging Methylation data (M-values) with phenotypic data

0

Entering edit mode

hayleyw • 0

@43494c06

Last seen 7 months ago

Canada

Hi Members,

I am relatively new to R and am having an issue with lme4. I want to use my methylation (m-values) and my phenotypic datasets. However, when I run my code I get an error about different variable lengths. I believe I need to combine these datasets into one, but I am so lost at how to do this and could really use some help. I have provided additional information below.

Original lme4 code:

binTP.lme= lmer(adj.m ~ pheno$condition + pheno$Sex + pheno$Age.at.collection + pheno$smoke + pheno$EUR + pheno$AFR + (1|pheno$pairing))

Error message: Error in model.frame.default(drop.unused.levels = TRUE, formula = adj.m ~ : variable lengths differ (found for 'pheno$condition')

adj.m dimensions = 586229 46 pheno dimensions = 46 15

Thanks in advance! the body of text here

MethylationArrayData methylationArrayAnalysis • 1.1k views

ADD COMMENT • link updated 20 months ago by James W. MacDonald 68k • written 20 months ago by hayleyw • 0

score 0 · Answer 1 · 2023-08-22

This isn't the place to be asking questions about the lme4 package, as that's a CRAN package, not Bioconductor. However, it's an invaluable skill to be able to read error messages and diagnose your issue from there. So let's see what it says and try to decipher. You got an error message that said this:

Error message: Error in model.frame.default(drop.unused.levels = TRUE, formula = adj.m ~ : variable lengths differ (found for 'pheno$condition')

adj.m dimensions = 586229 46 pheno dimensions = 46 15

It says variable lengths differ and then provides the dimensions of the two data sets. One has over 500k rows and 46 columns and the other has 46 rows and 15 columns. You have the same number of columns in adj.m as you do rows in pheno. And the error says your variable lengths differ. What does that imply?

Also, when you are fitting your model, on the right hand side you are using individual columns from your pheno object, yet on the left hand side you are using an entire matrix that has over 500K rows! Does that seem like a thing that could possibly work? Is it more likely that you have to fit each CpG value individually?

And getting back to Bioconductor, I would imagine you used minfi to preprocess these data. That package also includes facilities to make comparisons (by CpG as well as by genomic region). Is there a particular reason to ignore that and use lme4 instead? You are just fitting a random intercept for each subject (which I infer by the 'pairing' in your pheno object). If you have complete observations for all subjects you can simply block on subject instead of fitting a random intercept. Or if you don't have complete observations, you could use the limma package to fit the model using generalized least squares.

score 0 · Answer 2 · 2023-08-22

0

Entering edit mode

hayleyw • 0

@43494c06

Last seen 7 months ago

Canada

Thanks for your reply. I apologize that I posted this question in the incorrect forum.

I have used limma to perform linear mixed regression analyses, using individual ID as a blocking variable for this data. However, my supervisor and I do not agree with some of the assumptions that limma makes with the data, which is why I am trying to use lme4.

I realize the m-value matrix is very large, but I didn't think it would be a problem since I also used this same matrix as the response variable in my limma analyses. The columns in the m-value matrix are the individual IDs, which is why it is the same as the rows in the phenotype dataset. The dimensions for the datasets was not provided by the error code, I added that for more context (sorry I should have been more clear about that). I wasn't sure if that had anything to do with the error message. I have tried transposing the m-value matrix to see if that would eliminate the error message, however, the data seems to disappear when I transpose the dataset.

ADD COMMENT • link 20 months ago hayleyw • 0

1

Entering edit mode

When replying to an answer, please use the ADD COMMENT button, not the ADD ANSWER button (you are not adding an answer after all!).

You can provide limma the entire matrix of M-values because it is natively designed to deal with high-throughput data. On the other hand, lme4 is not meant to do that, so you have to make adjustments for that fact. In other words, you have to transpose the data and then feed one column at a time to lmer. Or you could not transpose and feed one row at a time. Your choice. You could also use the variancePartition package, which uses lmer under the hood, and is meant to understand that in high-throughput data are normally transposed as compared to conventional statistics.

I don't know what 'the data seems to disappear when I transpose the dataset' means. In my experience that's not a thing.

I am interested in what assumptions for the limma package you and your supervisor disagree with. The limma package is the preeminent package for analyzing high-throughput data and is the top non-infrastructure package in all of Bioconductor, so you have a decidedly non-consensus viewpoint on the subject.

ADD REPLY • link 20 months ago James W. MacDonald 68k