Accounting for 5'/3' Bias in DESeq 2
2
0
Entering edit mode
Jakub ▴ 50
@jakub-9073
Last seen 9 months ago
United Kingdom

I didn't find an answer to this searching the forums. I have RNASeq samples with 5'/3' biases that are unevenly distributed amongst the samples. Some of my conditions have more samples with the bias, some less - the reason for these biases is almost certainly different levels of RNA sample fragmentation or other differences in sample prep (PS: I realise that this is not a good start). This makes DESeq2 call DE amongst the bias distributions.

What is the best method for accounting for this variation in an objective way: the RUV package, adding 5'/3' calculated bias ratios to the GLM (e.g. from Picard), using residuals? Any opinions would be greatly appreciated.

Many thanks, J

 

deseq2 • 2.7k views
ADD COMMENT
3
Entering edit mode
@ryan-c-thompson-5618
Last seen 6 weeks ago
Icahn School of Medicine at Mount Sinai…

I've recently seen a paper that presents a metric meant to account for the integrity of each transcript/gene in each sample: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0922-z

They describe an adjustment that removes the dependency between this "transcript integrity number" and the logCPM of each gene by fitting a loess curve and then subtracting that curve out (see figure 6), and they demonstrate that their adjustment reduces the number of (presumed) false positives in a differential expression test. However, in the paper they implement their adjustment by modifying the counts directly. Instead, I would recommend you use the adjustment to compute an offset matrix, since it's important for edgeR and DESeq2 to have access to the raw counts so they can accurately account for the counting uncertainty.

ADD COMMENT
0
Entering edit mode

Thanks!

I've done exactly this and computed TINs for each gene, and performed the loess regression. I now have the raw logcounts and corrected logcounts. I guess I am not clear in my head which value is best to use in a normFactor offset matrix, before normalising each row to a geometric mean of 1 as described in the vignette.

  • difference in absolute values, i.e. 10^corrected-10^raw
  • difference in log values, i.e. (corrected-raw)
  • difference in % abs values, i.e. 10^corrected/10^raw

PS: I used % values as absolute differences can be negative and the matrix has to be positive and the package authors explicitly warn against using log differences.

ADD REPLY
0
Entering edit mode

Looking at the documentation, I see DESeq2 uses a matrix of "normalization factors" on the scale of the raw counts rather than a GLM offset matrix. The raw counts are divided by the normalization factors to get the normalized counts. So if normcounts = rawcounts / normfactors, then normfactors = rawcounts / normcounts. So compute that, then normalize the geometric mean of each row to 1 as described in the DESeq2 manual, and store these norm factors in the DESeqDataSet object. Finally, you'll need to run estimateSizeFactors, since the TIN normalization only normalizes within samples and you still need to normalize between samples. After that, you should be able to run through your standard DESeq2 pipeline and have it use your TIN-derived normalization factors.

(Mike, please correct me if I got anything wrong here.)

ADD REPLY
0
Entering edit mode

Sounds right.

If you want to correct for library size on top of normalization factors, pass the normFactors matrix (with row-wise geometric means around 1) to the normMatrix argument of estimateSizeFactors:

normMatrix: optional, a matrix of normalization factors which do not
          control for library size.... Providing ‘normMatrix’ will estimate
          size factors on the count matrix divided by ‘normMatrix’ and
          store the product of the size factors and ‘normMatrix’ as
          ‘normalizationFactors’.
ADD REPLY
2
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

It's hard to predict how the 5'/3' bias will affect the counts, although it's reasonable to expect that it will. 

I'd recommend either SVAseq or RUVseq, either of which will be able to pick up on systematic differences (including this bias) that affect the counts across many rows. 

The only situation where these packages can't help you -- and I'm not sure any computational method can -- is if the 5'/3' bias is perfectly confounded with the condition.

ADD COMMENT

Login before adding your answer.

Traffic: 527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6