Question

[DESeq2] Automatic outlier detection and replacement with continuous variables

0

Entering edit mode

cajawe • 0

@cajawe-12587

Last seen 8.1 years ago

Hello,

I am running DESeq2_1.10.1 on a data set with a continuous predictor term (either with or without a categorical blocking term).

My understanding is that in such cases, outlier detection and replacement is not automatically applied, and instead it's necessary to conduct a manual inspection of Cook's distances. I base this on section 3.6 of the Nov 30 2016 version of the DESeq2 vignette www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf)

However, when I run the DESeq function, I see the following message:

fitting model and testing
-- replacing outliers and refitting for 361 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)

My question, then, is what actually is happening here? Am I looking at a copy of the vignette that is out of date? Is my analysis carrying out the outlier replacement procedure even though it's not optimal for continuous predictors? Or am I misinterpreting the message entirely?

Thank you in advance! I can provide more details about my DESeqDataSet object if helpful.
Cameron

deseq2 outliers RNAseq differential gene expression • 3.7k views

ADD COMMENT • link 8.1 years ago cajawe • 0

0

Entering edit mode

The covariate is a disease phenotype: the proportion of afflicted individuals per inbred strain, with each RNAseq sample corresponding to a strain. Many samples are zero—in light of your comment, I suppose that this may be the source of the problem. I've pasted the covariate below (after a log(x+c) transformation).

I've also carried out the analysis with the data downgraded to binary (zero vs nonzero). Should I stick with the downgraded data and avoid using the continuous data given its unusual distribution?

Thanks!

-4.094345, -1.78271, -0.2889348, -4.094345, -4.094345, -0.7524469, -1.99021, 
-4.094345, -4.094345, -4.094345, -1.954278, -0.9287555, -4.094345, -4.094345, 
-4.094345, -1.487357, -4.094345, -0.2750636, -4.094345, -4.094345, -3.401197, 
-0.2830146, -4.094345, -4.094345, -4.094345, -2.669336, -4.094345, -4.094345, 
-4.094345, -4.094345, -1.287623, -4.094345, -4.094345, -4.094345, -4.094345, 
-2.744418, -2.148434, -2.148434, -4.094345, -3.220816, -4.094345, -4.094345, 
-4.094345, -2.870569

ADD REPLY • link updated 8.1 years ago by Michael Love 43k • written 8.1 years ago by cajawe • 0

1

Entering edit mode

There's not necessarily a problem then. The outlier replacement procedure can run on this dataset, because there is repetition in the continuous values.

You may choose to turn it off if you feel it's not helpful, by setting minReplicatesForReplace=Inf.

I wouldn't make modeling choices (continuous vs binary) based on this outlier procedure. It usually is just picking up on a number of genes with all 0's but then one or two samples have technical artifacts.

ADD REPLY • link 8.1 years ago Michael Love 43k

0

Entering edit mode

Okay, great—thanks kindly for your help!

ADD REPLY • link 8.1 years ago cajawe • 0

score 0 · Answer 1 · 2017-03-14

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

Can you show what your continuous covariate looks like? While it's not described in that section, DESeq2 actually looks to see if it can still do outlier replacement if the continuous covariate has replication similar to a categorical covariate.

ADD COMMENT • link 8.1 years ago Michael Love 43k

0

Entering edit mode

Dear Michael,

How can someone check whether a continuous variable has replication similar to a categorical value?

My design is the following: design = ~ Gender + InsulinResistance + cutAge + cutBMI

This is the message I get after I run Deseq

estimating size factors estimating dispersions gene-wise dispersion estimates: 6 workers mean-dispersion relationship final dispersion estimates, fitting model and testing: 6 workers -- replacing outliers and refitting for 6443 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing 172 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

and here is the information I get from my results

summary(resIRvsIS) out of 37936 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 1351, 3.6% LFC < 0 (down) : 886, 2.3% outliers [1] : 4535, 12% low counts [2] : 331, 0.87%

I have discretized my continuous variables, as you suggested in a different post. I do pre-filter as well. I have checked for sample outliers and I cannot see a distinct one. There are 3 samples that "stand out" but since these are human data, variability is expected and I would be very hesitant in removing those from my analysis. The same with the boxplot of the Cook’s distances to see if one sample is consistently higher than others. These 3 aforementioned samples are a tiny bit higher than the rest of the samples but not enough to convince me that they are outliers. After reading the other suggested approaches, it seems to me that the approach I need to follow is minReplicatesForReplace=Inf and cooksCutoff=FALSE.

I was intrigued by this post though and I wanted to understand whether this replication that you mention might happen in my design and if so how can I tackle it ? Is there something else going on here that I am failing to see?

ADD REPLY • link 5.1 years ago MiKappa ▴ 30

0

Entering edit mode

I agree with your approach.

Here, I describe that, if the continuous variable had repeated discrete values similar to a factor having repeated samples, then the same approach is taken.

ADD REPLY • link 5.1 years ago Michael Love 43k

0

Entering edit mode

Thank you for your response.

In another design where I correct for more covariates Deseq runs without problems. Does that make sense? My designs: The one with the "outliers" message: design1 = ~ Gender + InsulinResistance + cutAge + cutBMI

The one that runs without any apparent problems. Here I additionally correct for differences in cell type composition. Again I have discretized all numerical values. design2 = ~ Gender + InsulinResistance + cutAge + cutBMI + cutBcells + cutNKcells + cutCD4Tcells + cutCD8Tcells + cutMono + cutNeutro + cutEosino

ADD REPLY • link 5.1 years ago MiKappa ▴ 30

0

Entering edit mode

The design choice is really up to you. I have to limit my time on support site to software issues and questions.

ADD REPLY • link 5.1 years ago Michael Love 43k