Hi everybody,
I am working with RNASeq data.
I am using DESeq2 and I would like to correct for the GC content and length bias using cqn.
When I tested the normalized counts with NOISeq to test if the biases were corrected, I noticed that when I divided out the normalization factors by the geometric mean (as suggested in DESeq2 manual), the cqn correction becomes ineffective.
Instead when I used the normalization factors without dividing out the geometric mean (different scale) the biases are significantly reduced
Could you help me with this? I cannot understand why this is not working.
Here is the code I used
#> head(uCovar_filt) # length GC_content #ENSBTAG00000000005 2310 43.65 #ENSBTAG00000000009 1543 65.02 #ENSBTAG00000000010 1348 48.38 #ENSBTAG00000000012 1227 38.70 #ENSBTAG00000000013 4478 37.35 #ENSBTAG00000000014 1044 50.95 #> head(countData) # 6 8 10 11 16 20 27 28 29 33 42 14 17 #ENSBTAG00000000005 222 160 252 230 448 380 306 259 333 420 240 218 514 #ENSBTAG00000000009 103 139 138 97 57 198 161 57 138 149 60 89 303 #ENSBTAG00000000010 290 593 458 423 653 718 608 402 527 686 459 260 641 #ENSBTAG00000000012 390 369 318 245 425 342 372 274 392 391 376 217 291 #ENSBTAG00000000013 1667 2637 1862 1537 2001 2292 1730 1445 2182 2125 1686 940 2225 #ENSBTAG00000000014 614 1814 1276 751 959 1867 1733 865 1325 1777 882 674 1177 # 19 21 25 32 34 38 #ENSBTAG00000000005 316 336 236 315 269 334 #ENSBTAG00000000009 171 140 121 97 78 117 #ENSBTAG00000000010 601 498 416 377 335 408 #NSBTAG00000000012 285 213 286 295 264 291 #ENSBTAG00000000013 1858 1538 1606 1538 1456 1813 #ENSBTAG00000000014 1628 1465 1273 969 670 759 library(DESeq2) # input data design=~covariate1+covariate2+covariate3 dds <- DESeqDataSetFromMatrix(countData = countData,colData = pheno,design =design) #estimate size factors to be used in cqn dds=estimateSizeFactors(dds) sizefactors=sizeFactors(dds) #run cqn to generate the normalization factors matrix to account for GC and length bias library(cqn) cqnObject=cqn(counts = countData,x=uCovar_filt$GC_content, lengths = uCovar_filt$length, sizeFactors =sizefactors) #extract normalization factors as suggested in DESeq2 manual cqnOffset <- cqnObject$glm.offset cqnNormFactors <- exp(cqnOffset) #generate the normalization factor matrices with and without dividing out the geometric mean normFactors=cqnNormFactors normFactors_sameScale <- cqnNormFactors / exp(rowMeans(log(cqnNormFactors))) #create other two DESeq objects using the normalization facors with (sameScale)and without dividing out the geometric mean. # create dds1 using normFactors_sameScale dds1 <- DESeqDataSetFromMatrix(countData = countData,colData = pheno,design =design) normalizationFactors(dds1)=normFactors_sameScale # create dds2 using normFactors dds2 <- DESeqDataSetFromMatrix(countData = countData,colData = pheno,design =design) normalizationFactors(dds2)=normFactors #extract the normalized counts #dds contains the size factors generated by DESEq2 #dds1 contains the normalization factor matrix generated with cqn, dividing out the geometric means #dds2 contains the normalization factor matrix generated wit cqn, WITHOUT dividing out the geometric means counts_normalized=counts(dds,normalized=TRUE) #(Fig A) counts_nobias_sameScale=counts(dds1,normalized=TRUE) #(Fig B) counts_nobias=counts(dds2,normalized=TRUE) #(Fig C) #I tested these three count matrices with NOISeq. I attached the plots generated (Fig A, Fig B, Fig C)
Fig A (normalization with DESeq2)
Fig B (bias correction with cqn dividing out the geometric mean)
Fig C (bias correction with cqn WITHOUT dividing out the geometric mean)
Thank you.
Best regards,
Gianluca
Dear Mike,
Noiseq generates a plot for each sample for the GC content (plot showed in the post) and a plot for each sample for the length bias (plot not showed in my post)
The plots look similar across the sample (both for GC content and length bias).
I attached here the plots for other samples (only for GC content)
FIG A (6 samples, normalization with DESeq2)
Fig B (6 samples, bias correction with cqn dividing out the geometric mean)
Fig C (6 sample bias correction with cqn WITHOUT diving out the geometric mean)
Thank you. Best Regards,
Gianluca
Ok, thanks, I think I see what is going on. Using cqn in combination with DESeq2 via normalization factors, corrects for the differential GC dependence per gene. So if the counts are low because the gene has high GC content, this is fine for DE testing except if the degree to which the counts are low differs across samples for that gene. So the key plot would be if the log2 fold changes depended on the gene's GC content before and after using cqn. In short, my answer is that, it's ok if there is general overall dependence of counts on GC content -- this is not a problem for statistical analysis at all -- as long as there is not differential dependence per gene across samples, which should be correct by using cqn in combination with DESeq2.
When you keep in the geometric mean of the normalization factors, cqn is additionally correcting the row-wise average count so when you look across genes it has removed dependence of this average on the gene's GC content. But in doing so, it hurts DESeq2's methods, because it shifts the row-wise average count from its observed value to a new range. This hurts DESeq2's variance mean modeling, and gives no advantage for statistical testing (the shift is absorbed by the intercept either way).
Great! Thank you so much for your help.
If I understood correctly, you suggest to perform the DE analysis using cqn to correct for between samples (GC content and length) bias.
I know that the gene length and GC content bias within sample does not compromise DE results but that they affect the functional enrichment. It is expected that genes with more counts will be most likely to be DE than shorter gene (within sample bias). Am I right?
If possible, I would like to ask you a small suggestion.
If I want to perform functional enrichment of the DE genes, do you suggest to account for these biases after the DE (using for example GOseq) and not before?
Thank you for all your help!
Best Regards,
Gianluca
Yes, you could use goseq to account for these biases.
If you read the goseq manual, there is a section "6.8 Correcting for other biases", where they discuss using the mean or total count across a row instead of gene length as a way to deal with other biases than strictly gene length bias. Binning by mean count makes a lot of sense to me, as the sensitivity for gene DE depends on length and GC content only to the extent that these increase (or decrease) the mean count. So by modeling on the mean count, you account for length and GC bias.