DEXSeq dispersion outlier
1
0
Entering edit mode
fiona.dick91 ▴ 60
@fionadick91-16521
Last seen 23 months ago
Norway

Hi,

I have been using DEXSeq with transcript counts from salmon (as suggested in e.g. : https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0862-3 ) to investigate DTU between conditions case vs control. I followed a workflow described in: https://f1000research.com/articles/7-952/v3 . When I plotted a volcano plot with the p-values from the DEXSeq result object and its estimated log2fold change (from DEXSeq::estimateExonFoldChange), i saw some extreme p-value outliers. When I plotted the counts (fitted and observed) I saw that there was no visible change at all between conditions. When I looked at the pvalues of DRIMSeq for those transcripts, they were far from significant (~0.5). I saw that those transcripts (among others) were labelled by DEXSeq as dispOutliers (which I assume means dispersion outliers?!). I attached the volcano plots and a rds object which holds the 6 outlier transcripts and the mcols from DEXSeq. What could be the reason for these extreme p-values when there clearly seems to be no significance. I aimed at remodelling everything by hand for those outliers to see if I get similar lrt statistics but I am not completely sure which values to use from mcols(DEXSeqobj) to retrieve correct variable coefficients for the samples.

volcano Plot: https://drive.google.com/file/d/12ohJQfxL15W7R11K8ew-qR_J-Y4Tf4vJ/view?usp=sharing

rds obj: https://drive.google.com/file/d/1M9mm5AqsWG2tuhK1Vbb5cBjHpDbLhex9/view?usp=sharing

SessionInfo: https://drive.google.com/file/d/1h-WgD8BLep1DuKfZPrbLF-iY9R4ybAAD/view?usp=sharing

countplot for most extreme pvalue: https://drive.google.com/file/d/1h0Ch_5v-OMLkypU9KdotPDt-Ei0NEQCl/view?usp=sharing

( I extracted the fitted counts as follows:

fitted.common.scale = as.data.frame( t(t(assays(DEXSeqObj)[["mu"]][,1:49])/DEXSeqObj$sizeFactor))              
colnames(fitted.common.scale) <- sampleAnnotation(DEXSeqObj)$sample_id

Im not sure if this is the right way to do it, I referred to this post : https://support.bioconductor.org/p/60567/ (which was about DESeq2))

DEXSeq workflow:

covariates<-lapply(covariates,function(c){paste0("+ ",as.character(c),":exon")})


    design<-formula(paste("~sample + exon ",stringi::stri_flatten(covariates)," + condition:exon"))
    print("DESGIN:")
    print(design)
    reduced_design<-formula(paste("~sample + exon ",stringi::stri_flatten(covariates)))
    print("DESGIN reduced:")
    print(reduced_design)

    sample.data <- DRIMSeq::samples(d)
    count.data <- round(as.matrix(DRIMSeq::counts(d)[,-c(1:2)]))
    dx <- DEXSeqDataSet(countData=count.data,
                        sampleData=sample.data,
                        design=design,
                        featureID=counts(d)$feature_id,groupID=counts(d)$gene_id)
                        #~sample + exon + rin:exon + condition:exon,

    system.time({
      dx <- estimateSizeFactors(dx)
      dx <- estimateDispersions(dx,quiet=FALSE,maxit=500)  
      dx <- testForDEU(dx,reduced=reduced_design)
      dx <- estimateExonFoldChanges(dx, fitExpToVar="condition")

[1] "DESGIN:" ~sample + exon + rin:exon + ageyears:exon + sex:exon + cohort:exon + oligodendrocytes:exon + condition:exon [1] "DESGIN red:" ~sample + exon + rin:exon + ageyears:exon + sex:exon + cohort:exon + oligodendrocytes:exon converting counts to integer mode the design formula contains a numeric variable with integer values,
specifying a model with increasing fold change for higher values.
did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function

DEXSeq dispersion pvalue • 1.9k views
ADD COMMENT
0
Entering edit mode

Can you try without e.g. rin:exon? I don't typically put RIN in as a covariate.

ADD REPLY
0
Entering edit mode

yes, I did try. I also changed the volcano plot to use the condition coefficient as l2fc instead of the result of DEXSeq::estimeExonFoldChange. There still remain some outliers but its not the same transcripts. Here you can find both volcano plots. The one resulting from analysis with and one without rin as covariate. We found that our main axis of variation is significantly associated with RIN and adding it as a covariate has been suggested by other studies at least for DGE analyisis (https://www.biostars.org/p/312942/https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-12-42 , https://www.pnas.org/content/pnas/114/27/7130.full.pdf )

ADD REPLY
0
Entering edit mode
Alejandro Reyes ★ 1.9k
@alejandro-reyes-5124
Last seen 4 months ago
Novartis Institutes for BioMedical Reseā€¦

Hi Fiona,

Thanks for your detail report! It is interesting what you are describing. For these points that with extreme values, have you tried plotting the (log) counts per transcript for those genes in each sample?

A possible explanation for this is that of the estimation of log2fold changes from DEXSeq does not take into account all the covariates from your model. It only takes into account "condition" If the most of the variability is driven by the other covariates of your model, this might be hiding the "condition" effect.

ADD COMMENT
0
Entering edit mode

Hi,

thanks for replying.

  • I have attached the plot you have described for the six outliers that i have reported above (in the .rds file): https://drive.google.com/file/d/1VKSjamZMHjLYWsYL6YH3t_zrABNRSaQ-/view?usp=sharing ( counts from DEXSeq::featureCounts, added pseudocount and the took the log)
  • Apart from the volcano plot i am not using the l2fc from DEXSeq::estimateExonFoldChange anywhere. The extreme reported p-value resulted from the model where all covariates were taken into account. When I plot the counts (or relative abundances) of the fitted values...and also the raw observed ones...there is nothing. And more importantly, DRIMSeq shows a p-value of 0.5 for this transcript (vs DEXSeq a pvalue of smaller than 1*10^-50). Furthermore, out of thousands of transcripts, there is only 6 such outliers...and they even have the similar lg2fc...all of this seems to me really suspicious. Can I trust the rest of my results and do I filter those "artifacts" or is there a general mistake somewhere?
  • what is the consequence for a transcript to be labelled as a dispOutlier (=TRUE)?
  • you say DEXSeq::estimateExonFoldChange only models with the variable given to the function and doesnt take my covariates into account, would you suggest using the modeled coefficient of the condition variable instead (for generating the volcano plot)? (i.e. mcols(DEXSeq_obj)$exonthis.condition1 )
ADD REPLY
0
Entering edit mode

About points 1 and 2, since you have many covariates (rin, ageyears, sex, cohort, oligodendrocytes), it will be hard to distinguish how each of these contribute to the individual counts. It is also hard to say that there is nothing. What happens if you remove all those covariates and just test the condition variable? Are those points still significant? Even though, the distributions are not identical. For example, the single transcript counts that you shared for ENSG00000254681, there might be a contribution of the condition. In general, I don't think there is a mistake and taking those things that are significant both in DRIMseq and DEXSeq is a good (conservative) way to go. Generally speaking, there should be a good agreement between DRIMseq and DEXseq.

About point 3 (dispersion outlier flag), good question. DEXSeq internally uses DESeq2. I ran the DESeq2 examples with the default and I see the same behavior.

About your last point, definitively. Actually, your comment made me wonder why is the current implementation fitting another round of GLMs to get exon fold changes. I need to revisit why this is the case, but it is likely that there is no need to fit something separately to get exon fold changes!

ADD REPLY
0
Entering edit mode

thanks again for your detailed answer.

  • the counts from DEXSeq::featureCounts are the raw ones and not the fitted ones, right?
  • Sorry for asking again, I dont understand your point with the covariates. We neither see sth when looking at the fitted nor the raw counts, right? And even if there d be sth, the dispersion is so big. How come we have such an EXTREME p-value..that I would interpret as a transcript I can be really confident of. Whereas I have other transcripts that are also significant with much higher p-value (see volcano plot), that have much lower dispersion and that show a clear change.
  • I read on a forum somewhere (cant find the link anymore) that DESeq2 doesnt shrink transcripts that have been labelled as dispOutliers but Im not sure I remember correctly
ADD REPLY

Login before adding your answer.

Traffic: 645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6