Question

Subset-ing Large DESeqDataSet for data extraction with contrast function

0

Entering edit mode

l.s.wijaya • 0

@lswijaya-21856

Last seen 5.0 years ago

Hi all,

I am now running Deseq2 analysis. However, I am stuck at extracting the results with results(dds, contrast = ()) after deseq() from the large DESeqDataSet. The problem was, it took me extremely long to only extract one comparison (3 hours/comparison meanwhile, I have 265 comparison to extract). I am planning to make a subset of the dds large matrix (after running the deseq()) since I don't want to lose genes before normalization. However, I always got an error when running results(dds, contrast = ()) using the new subset matrix :

Error in counts(object) %*% whichSamples : non-conformable arguments

Then, I tried to relevel the variable containing the numerator and denumerator for contrast function and it gave me another error :

Error in cleanContrast(object, contrast, expanded = isExpanded, listValues = listValues

It seems like the contrast = could not recognize the numerator and denumerator from the new subset matrix. I also read that actually releveling the factor after running the deseq() isn't allowed as it makes the software confused. https://support.bioconductor.org/p/102317/

If there is a way to do solve this problem, please share with me. Thanks in advance.

deseq2 • 4.1k views

ADD COMMENT • link updated 5.6 years ago by Michael Love 43k • written 5.6 years ago by l.s.wijaya • 0

score 0 · Answer 1 · 2019-09-10

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

How many samples do you have? Also, how many samples per group?

ADD COMMENT • link 5.6 years ago Michael Love 43k

0

Entering edit mode

in total I have 845 samples, each sample has 21111 genes. The samples per group vary depending on the group. Some, I have 6 only, but some I have 116. In total, I have 8 groups so that I plan to subset the matrix into 8 parts.

ADD REPLY • link 5.6 years ago l.s.wijaya • 0

0

Entering edit mode

Two things: I personally use limma-voom for datasets with 100s of individuals in my lab, as I've said before on the support site.

Also, the development branch of DESeq2 (released in one month) is 10x faster for datasets with >100 samples.

ADD REPLY • link 5.6 years ago Michael Love 43k

0

Entering edit mode

That would be great to try the second option. Thanks a lot. However, is there anything worth trying to solve this issue? I mean, like any way to subset the dds large matrix. I can also consider to use the first option.

ADD REPLY • link 5.6 years ago l.s.wijaya • 0

0

Entering edit mode

You can subset the dds and run DESeq() and results() for pairs of groups.

ADD REPLY • link 5.6 years ago Michael Love 43k

0

Entering edit mode

And if you wanted to share the dispersion estimate across all groups, you could do DESeq() on the whole dataset, then subset and run only nbinomWaldTest() and results() on pairs.

ADD REPLY • link 5.6 years ago Michael Love 43k

0

Entering edit mode

Currently, I am trying to subset the outcome matrix of DESeq(). However, I encountered errors written above, What do you mean running results() on pairs? Perhaps, if I manage to subset the matrix, I can run nbinomWaldTest() and calculate the log 2 FC with results().

ADD REPLY • link 5.6 years ago l.s.wijaya • 0

0

Entering edit mode

Yes the second is what I’m recommending.

ADD REPLY • link 5.6 years ago Michael Love 43k

0

Entering edit mode

By the way, is there any way I can use a function to spread the cores for results(). For instance, like deseq(), I can use BIOPARAM to spread the worker. This would be super helpful if I can do this stuff for results(). I tried to use mclapply for instance, bus still the results() was only performed by one worker, The results() is the one taking longest time.

ADD REPLY • link 5.6 years ago l.s.wijaya • 0

0

Entering edit mode

Your first step should be looking up help for functions.

?results

See BPPARAM

Also if you want to keep going with DESeq2, go ahead and use the development branch which is 10x faster for this size of dataset. That’s more than you’ll get with parallelization.

ADD REPLY • link 5.6 years ago Michael Love 43k

0

Entering edit mode

Since the development branch of DESeq2 will be in 1 month, I will try to use the parallelization. I will keep an eye on the update. Thanks a lot for the helps!!

ADD REPLY • link 5.6 years ago l.s.wijaya • 0

0

Entering edit mode

Hi, I have a question. So finally I manage to get my log2FC with your suggestions. However, when I assess the DEGs, I don't get many since my adj-pval is so high (even the p-val) even though the log2FC is around 5 for instance. When I checked the count per replicate, I didn't see much variation. This makes me think that the calculation of p-value with dseq() accounts the variation from the whole data set as the ifcSE value is very high and my data set has non-responding -- high responding compounds. My question is : 1. Is it correct if I subset my data set per compound and analyse it compound by compound? I normalized it with CPM then calculate the log2FC with dseq() function. 2. If I do so, is it fair to compare the expression of the same gene from the different compounds? My concern is because each compound normalized differently and the log2FC is calculated differently, it's not mathematically correct to compare them. Maybe this question is out of topic, but I appreciate your answers. Thanks a lot.

ADD REPLY • link 5.6 years ago l.s.wijaya • 0

0

Entering edit mode

If you want to use a different method, e.g. LFC of CPM data, you should use limma and not DESeq2. You can't insert CPM normalized data into DESeq2.

ADD REPLY • link 5.6 years ago Michael Love 43k

0

Entering edit mode

CPM <- apply(count_Filtered, 2, function(x) (x/sum(x))*1000000)

dds <- DESeqDataSetFromMatrix(countData = count_Filtered,
                              colData = meta_Filtered,
                              design = ~ meanID)

if (NormMethod == "DESeq2"){ 
  dds <- estimateSizeFactors(dds) 
} else if (NormMethod == "CPM"){
  sizeFactors(dds) <- colSums(count_Filtered)/1000000
} else { 
  warning("Incorrect input for 'NormMethod', please double check above!")
}

dds <- DESeq(dds)

So I use this chunk of the script. Before I process the DESeq(dds), I define the sizeFactor(dds) first with CPM normalization. I don't know it this s an appropriate manner to do log2FC calculation.

ADD REPLY • link 5.5 years ago l.s.wijaya • 0

0

Entering edit mode

I do not recommend those size factors that you are using under NormMethod: CPM. Note, that is not what cpm gives you in edgeR/limma either.

ADD REPLY • link 5.5 years ago Michael Love 43k

0

Entering edit mode

Ah, ok. I will have a look further for this. So that my question is again : do you recommend to normalize the dataset per compound + control separately using DESeq? And Is it still thereafter appropriate to compare the log2FC after DESeq() processing? As far as I am aware that DESeq is a bit tricky to be applied for the comparisson from separated dataset.

ADD REPLY • link 5.5 years ago l.s.wijaya • 0

0

Entering edit mode

It’s up to you. In this thread I’ve given some valid options for analysis but the analysis is up to you.