I have a RNA-seq dataset with two treatments A and B (+/+, +/-, -/+, -/- design) and a batch effect. Both environments are equally represented in all batches. For differential expression analysis, I have added this batch effect as an element in the design matrix in EdgeR and it works perfect.
I also want to make a clustered heatmap from the genes that were found to be differentially expressed. When I do this, the batch effect is visible and samples form the same batch group together (but there is a stronger grouping from treatment A). I wanted to remove this batch effect and re-do the clustering and heat map generation. For this purpose, I tried the limma package function removeBatchEffect() and a few other similar tools.
Here is what I have done:
(1) Defined three experimental factors: A, B and batch.
(2) Made a design matrix that only includes A and B, but not batch.
(3) Put raw, non-normalized counts (non-integer expected counts output from RSEM) in a DGEList.
(4) Used TMM-normalization.
(5) Converted to log2 counts per million.
(6) Ran removeBatchEffect() on the result of (5) and specified the design matrix without batch and the batch factor.
Here is the code I ran:
# Loading required R libraries. library(gplots) library(limma) library(edgeR) # Define experimental factors and design matrix. batch <- factor(substring(colnames(data), 12, 12)) factorA <- factor(substring(colnames(data), 1, 6)) factorB <- factor(substring(colnames(data), 7, 10)) design <- model.matrix(~factorA*factorB) # Put counts into DGEList. y <- DGEList(counts=data) # TMM-normalization. y <- calcNormFactors(y) # Convert to CPM and log2 transformation. logCPM <- cpm(y, log=TRUE, prior.count=0.1) # Remove batch effect. logCPM_no_batch <- removeBatchEffect(logCPM, batch=batch, design=design) # Make heat map heatmap(logCPM_no_batch)
The results look absolutely fantastic. Samples from the same batch no longer cluster at all, yet the basic appearance of the heatmap is the same in term of treatment effects.
Previous attempts at this some months ago produced horrible results (no heatmap patterning at all, not even small regions), presumably because I made mistakes, wrote wrong things etc. But when things goes from horrible to absolutely fantastic from an afternoon's work, I instantly become suspicious that it is too good to be true and that I have made some other error that makes it all invalid somehow. Sure, it is a bit paranoid perhaps, but better safe than sorry in these relatively complicated scientific matters.
So my two questions are:
(1) Is the usage above, both the direct application of removeBatchEffect() and the prior treatment of the data appropriate for the stated goal of making a "batch-free" clustered heatmap? Any red flags that pop up? I have browsed through several previous posts and looked at the manual for this function before asking.
(2) Is it crucial to include a length normalization before making a clustered heatmap for these kind of RNA-seq-based differential expression figures? Obviously, the gene length is exactly the same for all samples for a given gene, so it will not bias within-gene between-sample comparisons. What would I lose if i skipped it? Would it affect any technical aspect (such as gene hierarchical clustering) or what kind of conclusions that can be drawn from the heatmap? If I wanted to include it, is it enough to switch from cpm() to rpkm() and add a vector with gene lengths?
In the part of the custom code I plan to use for making a heat map (I added heatmap() for simplicity), I use the following code:
data <- t(scale(t(logCPM_no_batch), scale=F))
Does this accomplish mean-centering in the way you propose?
For the purposes of this heat map, it will just be an illustration in a paper. No downstream analysis will be done using the heatmap as such, but instead be done on the edgeR differential expression analysis. Am I understanding you correctly that it will then not really matter if I skip length normalization for the purposes of generating a heatmap?
I will try rpkm() just so I can see what the results look like.
Yes (though I find it easier just to subtract it directly) and yes (it'll cancel out when you centre the log-expression).
There is no need to mean-center because heatmap() does this automatically by default. See the 'scale' argument.
The only reason for scaling yourself is if you want to mean correct but not scale the sd. In that case the fastest and easiest code is:
Of course this will be irrelevant unless you turn off the scaling that heatmap() does. And it is more usual to allow heatmap() to scale both mean and sd.
Please use a larger value for prior.count. We use values in the range 3--5. It can make a difference.
As Aaron says, length normalization makes no difference. cpm() and rpkm() will give identical heatmaps.
Thank you for your valuable reply, particularly with regards to prior count and rpkm().
If I
instead of scale=F and set scale="none" in heatmap() would this produce a heat map that had both mean and sd scaled, thus being more in line with what is usually done? This is how I interpret the scale() documentation, but just want to make sure I am not going off the rails here.
The reason I ask is that if i comment out the data <- t(scale(t(logCPM_no_batch), scale=T)) and use scale defaults in heatmap, heatmap() complains that "`x' must be a numeric matrix".
It isn't possible for the code given in your post to give the error message you mention from heatmap() because removeBatchEffect() always produces a numeric matrix. You already told us in your original post that the code worked.
It would seem that you may have introduced a problem somewhere with other code that you haven't mentioned here. I suggest you solve that problem rather than using an ad hoc fix with scale().
I'm conflicted on whether or not to post as a new question, but since this is a directly related follow up I'll ask here.
Recently I wanted to present the results of how accounting for batch in a differential expression analysis from RNA-seq will impact the downstream results aside from just providing two topTable results (one each from a design with and without a batch covariate) for comparison, so I also wanted to show before / after heatmaps of raw / corrected data.
I initially provided these results using the suggested workflow here, which is to provide the results from
cpm(dgelist, log=TRUE, prior.count=5)
toremoveBatchEffect
to get your "batch corrected data".Skimming through the
removeBatchEffect
documentation, however, I realized that it explicitly calls out that the function will utilizeweights
, if available, in estimation of the batch effects.This made me think about
voom
, but passing in a voomedEList
dictates that theprior.count=0.5
At the risk of beating a dead horse here, my question is what's the official party line here? What would the source of the
weights
be thatremoveBatchEffect
is referencing if not fromvoom
? I realize that the documentation inremoveBatchEffect
that points to weights may have been there since beforevoom
, so perhaps it's another function that shoots weights along with the input object x, but not sure where that might come from.There's also other weights like the array weights from, well, the
arrayWeights
function. Obviously,removeBatchEffects
is not restricted to RNA-seq data, so the use of weights is more general than just those fromvoom
. My view is that I only ever useremoveBatchEffects
for visualization, so I don't bother worrying about weights to improve accuracy - the human eye can't distinguish between "very blue" and "very very blue" anyway. However, I do worry when I get a solid red/blue block because of extreme log-fold changes from low counts. Hence, the use of a large prior count incpm
.I had initially thought that it could be referencing weights fom
arrayWeights
, butarrayWeights
doesn't "load" those results into theEList
(presumably what you pass into removeBatchEffect as x), but rather returns a vector of weights, so wasn't sure if that was it.But I now realize that you can just use the dots
...
and pass in aweights
argument which then passes them down tolmFit
which then does the job, ie.Sorry I didn't put 2 & 2 together like that before.
To address your point about extreme log fold changes blowing out the signal in your heatmap, I've found the
circlize::colorRamp2
(which is whatComplexHeatmap
uses by default to generate its colors) awesome here. Values outside of the specified breaks are simply set to the break value/color for plotting purposes, which is nice.