DEXSeq estimateExonFoldChanges breaks samples' names and counts relation
1
0
Entering edit mode
gmerino ▴ 20
@gmerino-8558
Last seen 6.8 years ago
Argentina

Dear list,

From version 1.16.7 the DEXSeq function: estimateExonFoldChanges does not work properly, i.e., it breaks samples and counts relation as depicted below:

>count_matrix["ENSG00000180155:008",]

                                              s1   s3   s4  s5  s6  s8  s9  s10
ENSG00000180155:008      102 32  131 12 161 34 14 595

>sample_data

         condition 
s1  Non-Metastatic   
s3  Non-Metastatic   
s4  Non-Metastatic   
s5  Non-Metastatic   
s6      Metastatic   
s8      Metastatic   
s9      Metastatic   
s10     Metastatic   

>DEXSeqObject<-DEXSeqDataSet(count_matrix, sample_data, design, featureID=aux[,"featureID"], groupID=aux[,"groupID"])
>DEXSeqObject<-estimateSizeFactors(DEXSeqObject)

>DEXSeqObject<-estimateDispersions(DEXSeqObject, maxit=500)

>fullModel<- ~ sample + exon +condition:exon
>reducedModel<- ~ sample + exon 

>DEXSeqObject<-testForDEU(DEXSeqObject, fullModel=fullModel, reducedModel=reducedModel)

>DEXSeqObject<-estimateExonFoldChanges(DEXSeqObject, fitExpToVar="condition", denominator="Non-Metastatic")
>myresults<- DEXSeqResults( DEXSeqObject )

For ezample, for the "ENSG00000180155" gene, exon 008 we found:

>myresults["ENSG00000180155:008",]

LRT p-value: full vs reduced

                                                groupID          featureID        exonBaseMean     dispersion
                                              <character>      <character>         <numeric>        <numeric>
ENSG00000180155:008   ENSG00000180155       008                100.024         0.1793136
                                                  stat                  pvalue                 padj          Non.Metastatic
                                              <numeric>         <numeric>          <numeric>      <numeric>
ENSG00000180155:008        54.08763        1.917445e-13    2.158161e-08       7.254149
                                                Metastatic        log2fold_Metastatic_Non.Metastatic   genomicData
                                               <numeric>                          <numeric>                        < GRangesList>
ENSG00000180155:008            11.02504                0.6039062                                  ########
                                                 countData
                                                    <matrix>
ENSG00000180155:008            102 32 131 ...

The log2fold_Metastatic_Non.Metastatic  is 0.6039062. But examining into the estimateExonFoldChanges function (run step by step): 

> .....

frm <- as.formula(paste("count ~", fitExpToVar, "* exon"))
>....

>mf <- object@modelFrameBM
>geteffects <- function(geneID){
        rt <- groups %in% geneID & notNAs
        if( sum(rt) < 2 ){ return(NULL) }
        countsThis <- countsAll[rt,]
        rownames(countsThis) <- gsub("\\S+:", "", rownames(countsThis))
        dispsThis <- disps[rt]
        names(dispsThis) <- features[rt]
        numexons <- sum(rt)
        newMf <- mf[as.vector( sapply( split( seq_len(nrow(mf)), mf$sample ), "[", seq_len( numexons ) ) ),]
        newMf$exon <- factor( rep( features[rt], numsamples ) )
        newMf$dispersion <- dispsThis[match(newMf$exon, names(dispsThis))]
        newMf$count <- as.vector( countsThis )
        newMf <- droplevels(newMf)
        coefficients <- fitAndArrangeCoefs( frm, balanceExons = TRUE, mf=newMf, maxRowsMF=maxRowsMF)
        if (is.null(coefficients)) {
            return(coefficients)
        }
        ret <- t( getEffectsForPlotting(coefficients, averageOutExpression = TRUE, 
        groupingVar = fitExpToVar))
        rownames(ret) <- paste(geneID, rownames(ret), sep = ":")
        return(ret)
    }
alleffects <- bplapply( testablegenes, geteffects, BPPARAM=BPPARAM )

If for example the geteffects function is run only using geneID="ENSG00000180155"  we can found:

>newMf

sample exon      condition run sizeFactor dispersion count
s1    008 Non-Metastatic   1  1.2725022  0.1793136   102
s10  008     Metastatic   0  1.7272164  0.1793136         32
s3    008 Non-Metastatic   1  0.5398603  0.1793136   131
s4    008 Non-Metastatic   1  0.9582068  0.1793136     12
s5    008 Non-Metastatic   0  0.7174887  0.1793136   161
s6    008     Metastatic   1  1.4029030  0.1793136         34
s8    008     Metastatic   0  0.8922834  0.1793136         14
s9    008     Metastatic   1  1.4042222  0.1793136       595

Note that only the sample "s1" have its count properly assigned (remember count_matrix values showed at the beginning).  The others samples have their count values mixed. Thus, counts are mixed between conditions too, i.e. condition "Non-Metastatic" should have counts 102 ("s1"), 32 ("s3"), 131 ("s4") and 12 ("s5") and condition "Metastatic" should have 161 ("s6"), 34 ("s8"), 14 ("s9") and 595 ("s10"). Thus the log2FC is erroneously computed.

We found that this confusion is produced when mf is cut in order to build the newMf object because the new object have its sample column (factor) reordered in an alphabetic way follow the factor levels. Thus sample column is :

>newMf$sample

[1] s1 s10 s3  s4  s5 s6  s8 s9 
Levels: s1 s10 s3 s4 s5 s6 s8 s9

instead of s1 s3 s4 s5 s6 s8 s9 s10 like sample_data row names.

I suggest add a single code line into the estimateExonFoldChanges in order to avoid this problem. In this method, after mf creation I suggest set the levels of sample columns using the row names of the sample_data design matrix as:

>mf <- object@modelFrameBM

##add this in order to conserve sample-counts relation

>mf$sample<-factor(mf$sample, levels=unique(mf$sample))

or into the geteffects function, control the counts assignation as

>newMf$counts<-as.vector(countsThis[,as.character(unique(newMf$sample))])

Now, the newMf results as:

    sample exon      condition run sizeFactor dispersion count       

s1   008 Non-Metastatic   1  1.2725022  0.1793136   102
s3   008 Non-Metastatic   1  0.5398603  0.1793136    32
s4   008 Non-Metastatic   1  0.9582068  0.1793136   131
s5   008 Non-Metastatic   0  0.7174887  0.1793136    12
s6   008     Metastatic       1  1.4029030  0.1793136   161
s8   008     Metastatic       0  0.8922834  0.1793136    34
s9   008     Metastatic       1  1.4042222  0.1793136    14
s10  008     Metastatic      0  1.7272164  0.1793136   595

And the corresponding DEXSeq results after call the corrected extimateExonFoldChanges function for gene is:

                                              groupID                featureID      exonBaseMean dispersion

                                             <character>           <character>    <numeric>      <numeric>
ENSG00000180155:008   ENSG00000180155      008             100.024         0.1793136
                                                stat              pvalue               padj                Non.Metastatic
                                          <numeric>    <numeric>           <numeric>            <numeric>
ENSG00000180155:008  54.08763      1.917445e-13     2.158161e-08       5.240263
                                         Metastatic        log2fold_Metastatic_Non.Metastatic       genomicData
                                         <numeric>                          <numeric>                         <GRangesList>
ENSG00000180155:008   11.63103                           1.150268                              ########
                                         countData
                                            <matrix>
ENSG00000180155:008  102 32 131 ...
 

As can be observed the log2FC value now is 1.150268 correctly computed between Metastatic vs Non-Metastatic samples. Obviously, the condition means were also corrected. 

I hope that this code error can be  shortly solved because it cannot be easily detected given that DEXSeq doesn't return an error message and a lot of people uses this software everyday. 

Best, 

Gabriela

 

 

 

 

 

 

 

 

 

 

 

software error dexseq estimateexonfoldchanges • 1.2k views
ADD COMMENT
0
Entering edit mode
Alejandro Reyes ★ 1.9k
@alejandro-reyes-5124
Last seen 5 months ago
Novartis Institutes for BioMedical Rese…

 

Thanks a lot for your detailed report. Sorry about this bug! Coincidentally, someone else just reported the same issue and it has been already fixed in the latest release version 1.16.8, the changed were also added to the development branch. 

Thanks again and best regards,
Alejandro

ADD COMMENT

Login before adding your answer.

Traffic: 1009 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6