Hi all,
I would like to use DESeq2 for the analysis of a pooled shRNA screen to identify mutation specific lethals. The design of the experiment is a number of mutant cell lines and a number of wildtype cell lines. For every cell line there is a t0 measurement and t1 measurement. Interesting hairpins/genes are those that give significant more killing in the mutant cell lines, compare to the wt cell lines. Because each cell line has it's own t0, the data need to be paired wise analysed.
I tried to do it with DESeq2 in the following design:
formula(~ line + time + condition)
A simple example: d<-data.frame(w_l1_t0=c(4,8),w_l2_t0=c(5,9),w_l1_t1=c(4,8),w_l2_t1=c(5,9),m_l3_t0=c(4,8),m_l4_t0=c(5,9),m_l3_t1=c(1,2), m_l4_t1=c(1,2)) row.names(d) <- c("hp1","hp2") countData <- as.matrix(d) colData = data.frame(condition=c("w","w","w","w","m","m","m","m"), time=c("0","0","1","1","0","0","1","1"), line=c("1","2","1","2","3","4","3","4")) dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ line + time + condition)
However I then get the error:
Error in DESeqDataSet(se, design = design, ignoreRank) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed
I assume this is because cell line and condition are linked. Is there a way to do the above analysis in DESeq2? I don't know the exact meaning of the parameter ignoreRank, but would setting it to TRUE be an option?
Thank you in advance for your reply!
Best regards,
Cor Lieftink
P.s. I thought it was solved, after using an extended dataset with 15 mt and 15 wt cell lines the error was gone. However I had there numbered the lines within the subset of wildtype and mutant, which should be (I think) continuous over the two sets. The error than re-emerge.
Hello Michael,
Thank you for your quick response!
Best regards,
Cor
Hi Michael,
Sorry to bother you again. Your solution works fine when I have equal number of mutant and wildtype cells. However if not, I get again the error "the model matrix is not full rank". Do you know of any alternative design to deal with that as well?
Best regards,
Cor
hi Cor,
Can you post the as.data.frame(colData(dds)), including the nested.line column, and the design you are using, for an example when you are not full rank?
Hi Michael,
It runs fine with 3 replicates in each status condition, however adding a fourth in only the 0 (=wildtype) condition gives the error not full rank.
Design used:
Best regards,
Cor
I see now what is causing the problem, which is something i've been working to make things a bit easier in the development branch of DESeq2. Basically, R's model.matrix() function is great when it helps the user easily express experiment designs, but annoying when it breaks down. This is why i've added support for users to tweak the model matrix and give this to the DESeq() function.
What's happening is the extra level of rep=4 creates a column of all zeros in the model matrix, because model.matrix() function tries to make the interaction of rep=4 and status=1.
There are two solutions:
1) use DESeq2 version >= 1.7, which is in the development branch, so would require getting the development version of R/Bioconductor. Here, I've allowed the user to supply custom a model matrix to DESeq(). You would form the model.matrix yourself:
And then remove the column which has all zeros, and provide this modified 'mm' to DESeq():
2) or add variables accounting for the cell line effects to the colData(dds) manually. For the above colData, you would need to add some variables like so:
You'd need to do this for status0rep2, status0rep3, status0rep4, status1rep2, status1rep3. And then add all of these to the design: ~status + time + status0rep2 + ... + status:time
You can ignore a message about the design including numeric variables, if that comes up.
Hi Michael,
Thanks again for your quick response! I downloaded the development version, and was now able to process the data also in case of different number of replicates for the two status conditions!
Best regards,
Cor
Hi Michael,
Yesterday I ran an analysis of 15 mutant cell lines against 226 wildtype/unknown cell lines for 1200 hairpins. It took 8 hours on my home computer, but I have the results! :-) Thanks again.
Cor
If possible could you send the dataset to me (maintainer("DESeq2")). That's slower than I've seen before.
For a 2 group comparison with simulated data, 250 samples takes 40 seconds on my laptop:
> dds = makeExampleDESeqDataSet(m=250,n=1200)
> system.time({ dds = DESeq(dds) })
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
user system elapsed
39.901 0.036 39.903
hi Cor,
I'm continuing our conversation on the site, but won't mention any specifics.
I was confused about why it would take so long, but I should have paid attention in your comment that you want to account for 100s of cell lines in the model with terms for each. With 100s of cell lines, the model matrix is growing to have many hundreds of columns. And the GLM solution requires iterations to converge to the solution, so this involves large matrix multiplications.
I would actually recommend using voom and limma if you have 100s of columns in the model matrix. This would speed things up a lot, because the solution in the linear case can be easily calculated, even when the model matrix has 100s of columns . You can also directly use normalized CPM (counts per million) directly with voom and limma, without having to try to convert back to raw counts.
Hi Michael,
Thanks a lot!
Best regards,
Cor