negative values for trended dispersion with edgeR
1
0
Entering edit mode
@emmanouela-repapi-6515
Last seen 2.6 years ago
United Kingdom

Hello, 

I have a dataset with very high coverage on each sample for which I want to run edgeR on to get the differentially expressed transcripts between 2 types of cells with 3 replicates each. 
The analysis runs smoothly and I have many DE transcripts, but when I plot the BCV I get a plot (which is slightly weird with the trended dispersion line being bumpy - see attached) and the warning:

Warning message:
In sqrt(y$trended.dispersion) : NaNs produced

When I look into it, it turns out I have many negative values for the trended dispersion (for 160 transcripts) but the actual counts for these transcripts don't look unusual and I can't tell what is wrong. 

My code looks like this:

counts_data <- DGEList(counts=round(all_samples[,2:7]),group=rep(c("PC","GS"), each=3), genes=my_genes) 
counts_cpm <- cpm(counts_data)
filter_cpm <- rowSums(counts_cpm> 1)>= min(table(group))
counts_data <- counts_data[filter_cpm,]
counts_data$samples$lib.size <- colSums(counts_data$counts)
counts_cpm <- cpm(counts_data) 
counts_data_norm <- calcNormFactors(counts_data)
set_d <- estimateCommonDisp(counts_data_norm, verbose=T)
set_d <- estimateTrendedDisp(set_d)
set_d <- estimateTagwiseDisp(set_d)

Some of the transcripts with the negative trended dispersion look like this:

set_d[set_d$trended.dispersion<0]
An object of class "DGEList"
$counts
                PC1_EstimatedNumReads PC2_EstimatedNumReads PC3_EstimatedNumReads GS1_EstimatedNumReads GS2_EstimatedNumReads GS3_EstimatedNumReads
ENST00000196551                178336                170990                162887                100756                 77829                 81842
ENST00000202773                164480                154960                155601                102093                 84593                 89253
ENST00000216181                247587                287651                278037                285772                258325                262454
ENST00000216281                829577                717654                834776                504985                388358                412351
ENST00000216962                 66190                 73460                 60030                165816                152397                175372
155 more rows ...

$samples
                      group lib.size norm.factors
PC1_EstimatedNumReads    PC 2.17e+08        0.982
PC2_EstimatedNumReads    PC 2.21e+08        0.996
PC3_EstimatedNumReads    PC 2.18e+08        0.986
GS1_EstimatedNumReads    GS 2.18e+08        1.031
GS2_EstimatedNumReads    GS 1.89e+08        1.013
GS3_EstimatedNumReads    GS 1.97e+08        0.993

$genes
         Transcript Length   Geneid Transcript_name
359 ENST00000196551    740     RPS5        RPS5-201
401 ENST00000202773    947     RPL6        RPL6-001
585 ENST00000216181   7501     MYH9        MYH9-001
607 ENST00000216281   3379 HSP90AA1    HSP90AA1-002
679 ENST00000216962   4134     PYGB        PYGB-001
155 more rows ...

$common.dispersion
[1] 0.0511

$pseudo.counts
                PC1_EstimatedNumReads PC2_EstimatedNumReads PC3_EstimatedNumReads GS1_EstimatedNumReads GS2_EstimatedNumReads GS3_EstimatedNumReads
ENST00000196551                175182                162868                159302                 94074                 85077                 87631
ENST00000202773                161571                147600                152177                 95323                 92471                 95566
ENST00000216181                243208                273988                271918                266821                282381                281018
ENST00000216281                814904                683566                816405                471496                424523                441517
ENST00000216962                 65019                 69971                 58709                154820                166589                187776
155 more rows ...

$pseudo.lib.size
[1] 2.1e+08

$AveLogCPM
[1]  9.25  9.21 10.33 11.50  9.12
155 more elements ...

$trended.dispersion
[1] -0.00180 -0.00159 -0.00782 -0.01435 -0.00112
155 more elements ...

$prior.n
[1] 2.5

$tagwise.dispersion
[1] 0.00601 0.00559 0.00604 0.00720 0.00785
155 more elements ...

How could it be that I am getting negative values for the trended dispersion (especially when the respective tagwise dispersions are small but positive). Is this behaviour alarming or is it OK to be using the results from the analysis? Its probably worth mentioning that using a cpm>5 cutoff for the filtering reduces the problem ( I only get 5 negative trended dispersion values) but does not get rid of it. Increasing it more also doesn't help.

Any help would be very much appreciated!

Best wishes,

Emmanouela

Session Info:

R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.ISO-8859-1       LC_NUMERIC=C                    LC_TIME=en_GB.ISO-8859-1        LC_COLLATE=en_GB.ISO-8859-1    
 [5] LC_MONETARY=en_GB.ISO-8859-1    LC_MESSAGES=en_GB.ISO-8859-1    LC_PAPER=en_GB.ISO-8859-1       LC_NAME=C                      
 [9] LC_ADDRESS=C                    LC_TELEPHONE=C                  LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C            

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.8.5  limma_3.22.1

loaded via a namespace (and not attached):
[1] tools_3.1.1

 

 

 
rnaseq estimatedispersions edgeR • 1.6k views
ADD COMMENT
0
Entering edit mode

You're a version out of date. See if the problem shows up in the latest version of edgeR (3.10.2) and Bioconductor (3.2). You'll also have to update R to version 3.2.0 or higher.

ADD REPLY
2
Entering edit mode
Yunshun Chen ▴ 870
@yunshun-chen-5451
Last seen 28 days ago
Australia

The way estimateTrendedDisp() works is as follows:

It first splits the data into a number of bins by their expression level. It then estimates the common dispersion for each bin and fits a spline curve through all those bin-wise common dispersions to get a dispersion trend.

The tagwise dispersion values are quite large for the lowly expressed genes in your data. One would suspect the unusual behaviour of the dispersion trend might be driven by the extremely high common dispersion estimate in one (or some) of the bins at the lower end. You also mentioned that increasing the filtering threshold reduces the problem.

However, the problem still remains after the threshold is increased. That indicates the lack of filtering might not be the reason why the trended dispersion goes below zero. I can't tell what the reason is without seeing the data. Besides, the library sizes are all around 200 million. Personally, I would decrease the filtering threshold instead.

We've noticed that estimating the trended dispersion by the spline method is not always stable. That's why we recommend the weighted likelihood approach in which the common, trended and tagwise dispersions can be estimated in one run. It is implemented in estimateDisp(). Try

set_d <- estimateDisp(counts_data_norm)

and see whether it solves the problem.

 

 

ADD COMMENT
0
Entering edit mode

That worked! Thank you Yunshun!

ADD REPLY

Login before adding your answer.

Traffic: 849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6