Is there any chance to solve "the model matrix is not full rank" in DESeq2?
2
0
Entering edit mode
Sep • 0
@06de5a1f
Last seen 16 months ago
Germany

Is there any chance to solve "the model matrix is not full rank" in DESeq2?

My design matrix is a count matrix, one row per gene and one column per sample,

The sample information (colData in dds) is a dataframe with two column, a column for condition containing pre and post treatment and column containing subject Id.

I am asking this question with regards to do paired samples

The problem is happened when

dds_1 <- DESeqDataSetFromMatrix(countData = df, colData = sample_info, design = ~ subject + condition)

This is the error I get:

Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. Levels or combinations of levels without any samples have resulted in column(s) of zeros in the model matrix.

I read the vignette section 'Model matrix not full rank': but I could not find any solution for the problem.

Is there any solution for this problem?

DESeq2 • 3.9k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 19 hours ago
United States

You have two pieces of information. The subject and the condition/time. As Basti already noted, condition and time are identical, and you can only use one (e.g., pre and before are the same thing and post and after are the same thing as well). What you are trying to do is determine peptides that are differentially abundant at the two time points, adjusting for subject-specific differences.

As an aside, if you have m/z counts, then using DESeq2 is OK, but if you actually have something like normalized area under an integrated peak, then you should instead use limma, most likely limma-trend. Analyzing proteomics data is non-trivial, and it's pretty easy to do something really suboptimal. Anyway...

You need a Sample info data.frame that looks more like this:

> head(Sample_info, 10)
   Subject Condition
1        1       pre
2        2       pre
3        1      post
4        2      post
5        3       pre
6        4       pre
7        3      post
8        4      post
9        5       pre
10       6       pre

And now this will create a full rank model matrix

> library(limma)
> design <- model.matrix(~Subject + Condition, Sample_info)
> is.fullrank(design)
[1] TRUE

And the final column of that design matrix will test for the difference between pre and post after adjusting for subject.

ADD COMMENT
0
Entering edit mode

Thanks a billion for all of your kind help and time.

ADD REPLY
0
Entering edit mode
Basti ▴ 780
@7d45153c
Last seen 26 days ago
France

You need to update your post by giving an example of your design matrix

ADD COMMENT
0
Entering edit mode

My design matrix is a count matrix, one row per gene and one column per sample,

The sample information (colData in dds) is a dataframe with two column, a column for condition containing pre and post treatment and column containing subject Id.

I am asking this question with regards to do paired samples The problem is happened when

dds_1 <- DESeqDataSetFromMatrix(countData = df, colData = sample_info, design = ~ subject + condition)

ADD REPLY
0
Entering edit mode

Also, there is a whole section devoted to fixing a model matrix that has columns of all zeros, which for sure will fix your problem. You need to show what you have done to try to fix the issue.

ADD REPLY
0
Entering edit mode

Thank you for your prompt reply but even after removing that, the problem was there still.

ADD REPLY
0
Entering edit mode

Both Basti and I have asked you to show your design matrix and show what you have done. You refuse to do that, and instead you say you did stuff and it didn't work. We already know you did stuff and it didn't work, or you wouldn't have posted here in the first place. We need to know exactly what you did, and exactly what your design matrix looks like, and the only way to know that is if you provide code and your design matrix. If you want help, then please provide what we ask for.

ADD REPLY
0
Entering edit mode

My design matrix is with 100 samples ( 50 samples before treatment and the same 50 samples after treatment and with around 3 million peptide

bigdf_t = transpose(tinydf)

colnames(bigdf_t) <- bigdf_t[1, ] rownames(bigdf_t) = colnames(tinydf) bigdf_t <- bigdf_t[-1,] bigdf_t <- bigdf_t[!(row.names(bigdf_t) %in% c("pre_post")),]

pep_all_zeros = colnames(tinydf[ , colSums(tinydf==0) == 100]) remove rows with all zero values bigdf_t <- bigdf_t[!(row.names(bigdf_t) %in% pep_all_zeros),]

colnames(bigdf_t)[1:50] <- paste("pre", colnames(bigdf_t[1:50]), sep = "_") colnames(bigdf_t)[51:100] <- paste("post", colnames(bigdf_t[51:100]), sep = "_")

sample_info = data.frame(matrix(nrow = ncol(bigdf_t), ncol = 2)) rownames(sample_info) = colnames(bigdf_t) names(sample_info)1 <- "condition" names(sample_info)2 <- "time"

sample_info$condition[sample_info$condition[1:50] == ''] <- 'Pre'

sample_info[1:50,1] <- replace(sample_info[1:50,1], is.na(sample_info[1:50,1]), "Pre") sample_info[51:100,1] <- replace(sample_info[51:100,1], is.na(sample_info[51:100,1]), "Post") sample_info[1:50,2] <- replace(sample_info[1:50,2], is.na(sample_info[1:50,2]), "before") sample_info[51:100,2] <- replace(sample_info[51:100,2], is.na(sample_info[51:100,2]), "after")

enter image description here

all(rownames(sample_info) == colnames(bigdf_t))

chars <- sapply(bigdf_t, is.character)

bigdf_t[ , chars] <- as.data.frame(apply(bigdf_t[ , chars], 2, as.numeric)) bigdf_t = round((10000 * bigdf_t)) # DSeq2 only accept integer and the data is read per million (multiply by 1000A0 as R can not handle big integer)

enter image description here

DSeq2 analysis

dds <- DESeqDataSetFromMatrix(countData = bigdf_t, colData = sample_info, design = ~ time + condition) (Here gives me the full rank error)

Hope I could provide you your needed information.

Would you please help me in this regard?

ADD REPLY
1
Entering edit mode

We just need to know the sample_info dataset to infer the design matrix, but from the small example of sample_info you give here, I see no difference between condition and time variable, they are the same the variable with just different names

ADD REPLY
0
Entering edit mode

Thanks a lot for your reply.

But what else can I have there in the column "time"? It is the only information that I have from the dataset. Even if I consider the subject id, still the problem is there

If I want to analyze this data which are coming from the paired sample, what should I do?

I would highly appreciate if you could help me in this regard?

ADD REPLY

Login before adding your answer.

Traffic: 694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6