I'm using edgeR to calculate CPM values for bulk RNA-seq samples. I have >100 samples from a diverse cohort, with each sample collected, library prepped, and sequenced at different time points throughout the year. For example 1-10 samples could had been prepped in Jan, and 3 more in Feb, 2 mid-Feb, etc... Moreover I do not have information as to when these samples were processed per se.
Given this situation, should I merge all the data together and use TMM normalization to calculate the normalization factors before producing CPM counts? Or should I calculate CPM individually for each sample without using TMM normalization?
thanks in advance.
Without TMM (or similar) normalization it is likely that you get compositional artifacts. I would always use it. Also, your problem is the batch effects between the samples due to different dates of preparation. That is the main issue. The question on calculating normalization over all or a subset of samples is simple: Run it across all samples that you want to compare. Typically that is the entire dataset.