Optimizing CPM Calculation for Varied RNA-seq Data
2
0
Entering edit mode
Ahdee ▴ 50
@ahdee-8938
Last seen 6 weeks ago
United States

I'm using edgeR to calculate CPM values for bulk RNA-seq samples. I have >100 samples from a diverse cohort, with each sample collected, library prepped, and sequenced at different time points throughout the year. For example 1-10 samples could had been prepped in Jan, and 3 more in Feb, 2 mid-Feb, etc... Moreover I do not have information as to when these samples were processed per se.

Given this situation, should I merge all the data together and use TMM normalization to calculate the normalization factors before producing CPM counts? Or should I calculate CPM individually for each sample without using TMM normalization?

thanks in advance.

edgeR • 442 views
ADD COMMENT
1
Entering edit mode

Without TMM (or similar) normalization it is likely that you get compositional artifacts. I would always use it. Also, your problem is the batch effects between the samples due to different dates of preparation. That is the main issue. The question on calculating normalization over all or a subset of samples is simple: Run it across all samples that you want to compare. Typically that is the entire dataset.

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

If you wish to perform DE analyses using the samples then you should perform TMM normalization. Whether or not the samples are diverse or sequenced at different times is irrelevant, unless that prevents you from undertaking a DE analysis with all the samples.

ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 13 hours ago
United States

In addition to Gordon's point, in my experience batch effects from different batches at the same sequencing center are usually quite minor. We have analyzed large data sets that were run at different times over the years due to the logistics of processing the samples. While it is important to ensure that you randomize to the batches appropriately so your batch effects are orthogonal to the coefficient of interest, running samples in several batches is not likely to be problematic.

Login before adding your answer.

Traffic: 569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6