Question

JunctionSeq producing strangely large effect sizes for certain exon counting bins

0

Entering edit mode

stuart ▴ 10

@stuart-9561

Last seen 3.7 years ago

Australia

I am using JunctionSeq (which appears to use a fairly similar algorithm to DEXSeq) to test for differential exon usage in an RNA-seq experiment (treatment vs control, n=6 per condition). For certain genes (often genes with few exon count bins), there appears to be a strangely large effect size produced despite there being little actual difference in the raw counts. My question is, why are these exons assigned such large relative effect sizes (and low FDRs) when it looks like the expression is actually pretty much unchanged in the raw count data?

For example, looking at the table produced by fData(jscs), the 2-exon gene WBGene00015769 has these results:

             geneID featureType countbinID testable status allZero baseMean  baseVar
3977 WBGene00015769 exonic_part       E002     TRUE     OK   FALSE 184.6766 358.5396
3978 WBGene00015769 splice_site       J003     TRUE     OK   FALSE 168.6154 399.1275
3979 WBGene00015769 exonic_part       E001     TRUE     OK   FALSE 246.5519 722.2288

     dispBeforeSharing  dispFitted  dispersion       pvalue      padjust chr   start
3977        0.00000001 0.003229147 0.001270983 7.055541e-01 9.784938e-01   V 5665970
3978        0.00000001 0.002496928 0.001433164 3.788741e-08 1.287139e-05   V 5665578
3979        0.01155711 0.003000040 0.007924036 1.005878e-40 7.517930e-37   V 5665238

         end strand transcripts padjust_noFilter log2FC(t/c) log2FCvst(t/c)
3977 5666463      -    C14C11.7     1.000000e+00   -0.04289528      -0.02141665
3978 5665970      -    C14C11.7     1.887138e-05    0.64414792       0.32171324
3979 5665578      -    C14C11.7     1.102241e-36    5.50411701       2.62285726

     expr_c   expr_t geneWisePadj
3977 190.3833 179.3888            0
3978 159.9550 178.0070            0
3979 239.1954 254.4281            0

Note the Log2FC (5.5) for exon E001 and the VST-transformed effect size (2.6), and FDR (7.5e-37) all points to very strong differential usage for E001. However the expression values are actually very similar between control and treated. Raw counts are:

            gene countBin  c1  c2  c3  c4  c5  c6  t1  t2  t3  t4  t5  t6
1 WBGene00015769     A000 342 239 257 269 363 291 235 246 239 283 264 167
2 WBGene00015769     E001 304 238 231 269 332 290 214 246 202 282 233 166
3 WBGene00015769     E002 229 165 195 194 252 205 167 171 173 201 188 114
4 WBGene00015769     J003 188 164 168 194 219 202 146 171 135 199 156 112

which crudely normalised to the sums of the rows (excluding the A000 row; c and t separate) and then the column sums is:

            gene countBin    c_norm    t_norm
2 WBGene00015769     E001 0.4119832 0.4099512
3 WBGene00015769     E002 0.3070067 0.3095238
4 WBGene00015769     J003 0.2810102 0.2805250

So there is really not much difference in exon usage between Control and Treatment for this gene! (Also, of the plots produced, the 'relExpr' plot shows a large difference for E001 but the 'expr' plot shows practically none). The entire library sizes are not greatly different either. I'm struggling to understand why this gene and a few others like it are assigned such large effect sizes and low FDRs.

Sorry not to have a reproducible example, the files are quite large. However the jscs object was basically produced from the count matrix as so:

jscs <- runJunctionSeqAnalyses( use.novel.junctions = TRUE,
                                sample.files = decoder$countFiles,
                                sample.names = decoder$sample.ID,
                                condition = decoder$group.ID,
                                flat.gff.file = "./JunctionSeq.flat.gff.gz",
                                nCores = 8,
                                verbose=TRUE,
                                debug.mode = TRUE
)

where decoder is just a dataframe containing the paths to the count files, sample names and the experimental conditions; and JunctionSeq.flat.gff.gz was produced by running qorts makeFlatGff --stranded on the standard .gtf for c.elegans. Package versions:

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

other attached packages:
 [1] DESeq2_1.24.0               JunctionSeq_1.14.0          RcppArmadillo_0.9.600.4.0  
 [4] Rcpp_1.0.2                  SummarizedExperiment_1.14.1 DelayedArray_0.10.0        
 [7] BiocParallel_1.18.1         matrixStats_0.54.0          GenomicRanges_1.36.0       
[10] GenomeInfoDb_1.20.0         IRanges_2.18.1              S4Vectors_0.22.0           
[13] dplyr_0.8.3                 VennDiagram_1.6.20          futile.logger_1.4.3        
[16] Biobase_2.44.0              BiocGenerics_0.30.0         gridExtra_2.3              
[19] ggplot2_3.2.0               stringr_1.4.0

junctionseq splicing RNAseq DEXSeq • 1.1k views

ADD COMMENT • link 5.3 years ago stuart ▴ 10

0

Entering edit mode

Actually DEXSeq and JunctionSeq appear to disagree with each other for quite a few genes with few exons, comparing the Log2FC of DEXseq with the Log2(vst)FC of my application of JunctionSeq to the same count-matrix gives: plot . (non VST-transformed LFC vals from JunctionSeq give similar results, just more spread out on the x-axis).

Not sure if I've kept everything the same between the two analyses but I tried to keep it as close as possible to the standard scripts from the vignettes.

ADD REPLY • link 5.3 years ago stuart ▴ 10