Question

Optimizing CPM Calculation for Varied RNA-seq Data

0

Entering edit mode

Ahdee ▴ 60

@ahdee-8938

Last seen 5 months ago

United States

I'm using edgeR to calculate CPM values for bulk RNA-seq samples. I have >100 samples from a diverse cohort, with each sample collected, library prepped, and sequenced at different time points throughout the year. For example 1-10 samples could had been prepped in Jan, and 3 more in Feb, 2 mid-Feb, etc... Moreover I do not have information as to when these samples were processed per se.

Given this situation, should I merge all the data together and use TMM normalization to calculate the normalization factors before producing CPM counts? Or should I calculate CPM individually for each sample without using TMM normalization?

thanks in advance.

edgeR • 776 views

ADD COMMENT • link updated 8 months ago by James W. MacDonald 68k • written 8 months ago by Ahdee ▴ 60

1

Entering edit mode

Without TMM (or similar) normalization it is likely that you get compositional artifacts. I would always use it. Also, your problem is the batch effects between the samples due to different dates of preparation. That is the main issue. The question on calculating normalization over all or a subset of samples is simple: Run it across all samples that you want to compare. Typically that is the entire dataset.

ADD REPLY • link 8 months ago ATpoint ★ 4.8k

score 1 · Answer 1 · 2024-08-06

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 35 minutes ago

WEHI, Melbourne, Australia

If you wish to perform DE analyses using the samples then you should perform TMM normalization. Whether or not the samples are diverse or sequenced at different times is irrelevant, unless that prevents you from undertaking a DE analysis with all the samples.

ADD COMMENT • link 8 months ago Gordon Smyth 52k

score 1 · Answer 2 · 2024-08-07

In addition to Gordon's point, in my experience batch effects from different batches at the same sequencing center are usually quite minor. We have analyzed large data sets that were run at different times over the years due to the logistics of processing the samples. While it is important to ensure that you randomize to the batches appropriately so your batch effects are orthogonal to the coefficient of interest, running samples in several batches is not likely to be problematic.