I am using DESeq2 on a pretty large dataset. I have mock and infection conditions for several mouse lines at 3 time points. My goal is to contrast each line at each timepoint with it's corresponding mock at the same timepoint. Previously I split the data into smaller subsets by line, but now I'm trying to use the entire data set at once to get the benefits of more data.
I have the following DESeq object:
class: DESeqDataSet
dim: 50686 116
metadata(1): version
assays(1): counts
rownames(50686): ENSMUSG00000099985 ENSMUSG00000030105 ... ENSMUSG00000053774 ENSMUSG00000030058
rowData names(0):
colnames(116): RIX007 RIX008 ... RIX233 RIX234
colData names(20): ID Tissue ... Replicate sizeFactor
My design is:
~ Infection_Line_Time
My plan was to make a composite field, infection_line_time, and then relevel and rerun DESeq(dds) for each condition. For example, to get the results for mock_day1_lineA, I do the following.
dds$Infection_Line_Time <- relevel(dds$Infection_Line_Time, "mock_day1_16012X15119")
dds <- DESeq(dds)
Also the Infection_Line_Time looks like this:
> dds$Infection_Line_Time
[1] mock_day1_16012X15119 mock_day1_16012X15119 mock_day3_16012X15119 mock_day3_16012X15119 mock_day5_16012X15119 mock_day5_16012X15119
[7] ebola_day1_16012X15119 ebola_day1_16012X15119 ebola_day3_16012X15119 ebola_day3_16012X15119 ebola_day5_16012X15119 ebola_day5_16012X15119
[13] mock_day1_16513X16188 mock_day1_16513X16188 mock_day3_16513X16188 mock_day3_16513X16188 mock_day5_16513X16188 mock_day5_16513X16188
[19] mock_day5_16513X16188 ebola_day1_16513X16188 ebola_day3_16513X16188 ebola_day5_16513X16188 ebola_day3_16513X16188 ebola_day5_16513X16188
[25] ebola_day5_16513X16188 ebola_day5_16513X16188 mock_day1_8034X8048 mock_day1_8034X8048 mock_day1_8034X8048 mock_day3_8034X8048
[31] mock_day3_8034X8048 mock_day3_8034X8048 mock_day5_8034X8048 mock_day5_8034X8048 mock_day5_8034X8048 ebola_day1_8034X8048
[37] ebola_day1_8034X8048 ebola_day1_8034X8048 ebola_day3_8034X8048 ebola_day3_8034X8048 ebola_day3_8034X8048 ebola_day5_8034X8048
[43] ebola_day5_8034X8048 ebola_day5_8034X8048 ebola_day1_16072X3393 ebola_day1_16072X3393 ebola_day1_16072X3393 ebola_day3_16072X3393
[49] ebola_day3_16072X3393 ebola_day3_16072X3393 ebola_day5_16072X3393 ebola_day5_16072X3393 ebola_day5_16072X3393 ebola_day5_16072X3393
[55] mock_day1_16072X3393 mock_day1_16072X3393 mock_day1_16072X3393 mock_day3_16072X3393 mock_day3_16072X3393 mock_day3_16072X3393
[61] mock_day5_16072X3393 mock_day5_16072X3393 mock_day1_CC041 mock_day3_CC041 mock_day3_CC041 mock_day5_CC041
[67] mock_day5_CC041 mock_day5_CC041 mock_day5_CC041 ebola_day1_CC041 ebola_day1_CC041 ebola_day1_CC041
[73] ebola_day3_CC041 ebola_day3_CC041 ebola_day3_CC041 mock_day1_CC041 mock_day1_CC041 mock_day1_CC041
[79] mock_day3_CC041 mock_day3_CC041 mock_day3_CC041 mock_day5_CC041 mock_day5_CC041 mock_day5_CC041
[85] ebola_day1_CC041 ebola_day1_CC041 ebola_day1_CC041 ebola_day1_CC041 ebola_day1_CC041 ebola_day3_CC041
[91] ebola_day3_CC041 ebola_day3_CC041 ebola_day3_CC041 ebola_day3_CC041 ebola_day5_CC041 ebola_day5_CC041
[97] ebola_day5_CC041 ebola_day5_CC041 ebola_day5_CC041 mock_day1_CC011 mock_day3_CC011 mock_day3_CC011
[103] mock_day5_CC011 mock_day5_CC011 ebola_day1_CC011 ebola_day1_CC011 ebola_day3_CC011 ebola_day3_CC011
[109] ebola_day5_CC011 ebola_day5_CC011 ebola_day1_CC011 ebola_day3_CC011 ebola_day5_CC011 ebola_day1_CC011
[115] ebola_day3_CC011 ebola_day5_CC011
36 Levels: mock_day1_16012X15119 ebola_day1_16012X15119 ebola_day1_16072X3393 ebola_day1_16513X16188 ebola_day1_8034X8048 ... mock_day5_CC041 |
When I attempt to run the command,
dds <- DESeq(dds)
it looks normal until the "fitting model and testing" phase, then throws the following error:
error: matrix multiplication: incompatible matrix dimensions: 116x36 and 0x1
Error in fitBeta(ySEXP = ySEXP, xSEXP = xSEXP, nfSEXP = nfSEXP, alpha_hatSEXP = alpha_hatSEXP, :
matrix multiplication: incompatible matrix dimensions: 116x36 and 0x1 |
I've never seen this before and I can't find anyone else who has encountered this problem online. I'm actively working on resolving it, but could use some help if anyone has any input or advice.
My theories are as of now that it's a memory issue, or that when I'm comparing different lines to mock_day1_lineA, that it isn't happy for some reason. Like, for instance, ebola_day3_lineC vs mock_day1_lineA will be one possibility in the resultsNames(dds). It's a comparison that makes no sense and I won't use, but it's still being calculated for every permutation. I'm still not sure how to handle it. Is there a better design or a way to do this without resorting to cutting my data into separate data sets?
Thanks in advance for any insight you can provide!
Yes, using version 1.16.
I can't reproduce this error so I'm not sure what to say, e.g. 36 levels, 3 replicates each works fine with simulated data, this takes 30 seconds to fit per 500 genes:
Can you print your sessionInfo() so I can see if maybe there are some package incompatibilities? You can also run biocValid() from the BiocInstaller package to see if you are up to date on all packages.
By the way, you shouldn't be releveling and rerunning DESeq() for comparing groups, you only need to use results() with contrast specifying the two groups to compare.
e.g. for the above example, you only run DESeq() once, then:
Thanks for your reply. The output you requested is too many characters to post here directly, so here is a link to that information.
https://justpaste.it/1c00t
Thanks very much for your help, the point about the releveling is also very useful. I'm going to try to update some packages that are older (though they seem unrelated at first glance). If it doesn't work after that I'll try running it on a high memory machine. Currently the rld(dds) step is taking several hours because of the data size. Thanks again!
if that's rlog(), I highly recommend using vst() instead. It's much faster for large datasets, and this is what I tend to use now for speed.
Update:
It seems like it's getting past this phase using VST and updated R packages. At "estimating dispersions" now, so hopefully everything will go through. I did receive the following warning, but maybe it's okay.
This is ok, it's just that the increases to the likelihood are accumulating slowly. Typically the beta isn't changing much but still producing small changes to likelihood.
Ok great, that's good to know. So assuming the "estimating dispersions" goes through, for future reference for anyone who encounters this problem, try VST instead of rlog, and update your R packages.
My guess is that the rlog might have failed somehow in the larger data set, if I had to guess what the fix was I would say varianceStabilizingTransformation(dds).
Thanks for your help Michael!
Hi, Michael I would like to request how to use vst() functions before estimateDispersions step. When the vst() function or varianceStabilizingTransformation() was used, it return a DESeqTransform. I mention it that DESeqTransform can not been used to difference expression. Thanks very much!
I don't see how this is related to the thread you are adding to.
A few points:
vst() requires dispersion estimation. Either it has already been run and will be used by the function (if blind=FALSE), or it will run dispersion estimation internally. You can run vst() on any DESeqDataSet. Please see the examples in the documentation/vignette/workflow.
I also meet the same error:
error: matrix multiplication: incompatible matrix dimensions: 1176x590 and 0x1 Error in fitBeta(ySEXP = ySEXP, xSEXP = xSEXP, nfSEXP = nfSEXP, alphahatSEXP = alphahatSEXP, : matrix multiplication: incompatible matrix dimensions: 1176x590 and 0x1
I mention it that vst() can only be used for visualization and not used to DE, So I would like to know how to use vst() for DE
You shouldn't use vst() for DE. Why don't you try following our workflow?
https://www.bioconductor.org/packages/release/workflows/html/rnaseqGene.html
And if you encounter problems, please make a new post with your entire code and the error that occurred if you encounter an error.