Problem with batches and "Model matrix not full rank" during differential gene expression analysis with DESeq2
2
0
Entering edit mode
alallo • 0
@alallo-21363
Last seen 4.4 years ago

Hi,

I am trying to create a Shiny app to allow my lab members to access and analyse the RNAseq data from patient derived tumour samples we have in our group. One of the option in the App is to perform differential gene expression analysis with DESeq2. The App will allow the user define two group (Group 1 and Group 2) by selecting two or more samples from our RNAseq data. Then the App will automatically generate the DESeqDataSet and perform DESeq on the two groups to generate a results table.

Because the samples have been collected and sequenced at different time in the past 5-6 years, I have included a batch variable in the design formula, like below:

dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~ batch + group)

Where for batch I used the different sequencing run. Here is an example of the metadata:

batch sample group
1    CD17    X32     2
2    CD17    X32     2
3    CD17    X32     2
4    CD19    X33     2
5    CD19    X33     2
6    CD19    X33     2
7     CD7    X08     1
8     CD7    X08     1
9     CD7    X08     1
10    CD7    X11     1
11    CD7    X11     1
12    CD7    X11     1

However, when I compare samples that have been sequenced on different days (like above) I get this error:

Error in checkFullRank(modelMatrix) : 
  the model matrix is not full rank, so the model cannot be fit as specified.
  One or more variables or interaction terms in the design formula are linear
  combinations of the others and must be removed.

Is there any way to account for the batch effect avoiding this error?

I know that the design of the experiment is not ideal, because most samples have been sequenced on a specific day and do not appear in later or previous sequencing run, and this is probably what causes the error. However, because these are a lot of data (42 patient derived samples in biological replicates for a total of 152 sequenced samples), I was wondering if there is any way to fix this issue without having to re-sequence all of them...

deseq2 • 672 views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 7 hours ago
United States

The error actually refers you to the vignette section which talks about this topic, have you read it?

ADD COMMENT
0
Entering edit mode

Yes, I did. Maybe I have misunderstood, but from the vignette it seems that there is no way around...

ADD REPLY
0
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 2 hours ago
San Diego

You want to compare samples that might have been prepped years apart? That doesn't sound wise.

If batch is confounded with sample type any change that looks interesting might be totally due to batch. The best thing to do is to drop batch from the design and warn users that anything they see is very very suspect, and very well might be batch-related artifact.

ADD COMMENT
0
Entering edit mode

We started generating these patient derived models in 2014...we have started to sequence the first one that were generated, but with time we have generated more and they have been sequenced subsequently...I think it is a problem any lab that is generating models has. You start sequencing your first models and you publish them, then a few years later you have a lager biobank of models and you sequence the new one...

I may have to do as you suggested and remove the batch effect from the formula. I was just hoping this was not necessary. I just wonder how people can manage when they compare large dataset of patient samples collected and sequenced by different labs. Do they just assume that there will be an effect due to batches and accept it?

ADD REPLY

Login before adding your answer.

Traffic: 540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6