Dear Bioconductor,
I would like to ask about the normalization method I should follow regarding the issue of technical variation that a treatment could introduce on top of the extraction of RNA from different components of the cell.
Experimental design:
> samples group batch treatment Sample1 component1 1 treated Sample2 component1 1 untreated Sample3 component1 2 treated Sample4 component1 2 untreated Sample5 component1 3 treated Sample6 component1 3 untreated Sample7 component2 1 treated Sample8 component2 1 untreated Sample9 component2 2 treated Sample10 component2 2 untreated Sample11 component2 3 treated Sample12 component2 3 untreated Sample13 total 4 treated Sample14 total 4 untreated Sample15 total 5 treated Sample16 total 5 untreated Sample17 total 6 treated Sample18 total 6 untreated
Additionally, various kinds of spike-ins (specific modified and non-modified to the treatment) were used to evaluate the treatment. The specific modified / non modified were inserted before treatment and miRNA spike-ins after.
Following "RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR" and A: DESeq and Limma+Voom Normalization for Rna-Seq Data Using Ercc Spike-Inin order to assess if using spike-ins for normalization would be informative I performed normalization with edgeR and limma-voom (quantile or spike-ins), boxplots:
code used:
cpm <- cpm(x)
keep.exprs <- rowSums(cpm>1)>=3
x1 <- x[keep.exprs,,keep.lib.sizes=FALSE]
x1_TMM <- calcNormFactors(x1, method = "TMM")
x1_RLE <- calcNormFactors(x1, method = "RLE")
x1_uQ <- calcNormFactors(x1, method = "upperquartile")
x1_TMMwzp <- calcNormFactors(x1, method = "TMMwzp")
# unormalized
lcpm <- cpm(x1,log=TRUE,prior.count=5)
boxplot(lcpm, las=2, col=col, main="")
title(main="A. Unnormalized data",ylab="Log-cpm")
#normalized
#TMM
lcpm_TMM <- cpm(x1_TMM,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_TMM, las=2, col=col, main="")
title(main="B. TMM Normalized data",ylab="Log-cpm")
#TMMwzp
lcpm_TMMwzp <- cpm(x1_TMMwzp,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_TMMwzp, las=2, col=col, main="")
title(main="C. TMMwzp Normalized data",ylab="Log-cpm")
#RLE
lcpm_RLE <- cpm(x1_RLE,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_RLE, las=2, col=col, main="")
title(main="D. RLE Normalized data",ylab="Log-cpm")
#upper Quartile
lcpm_uQ <- cpm(x1_uQ,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_uQ, las=2, col=col, main="")
title(main="E. Upper Quartile Normalized data",ylab="Log-cpm")
#limma voom quantile
design <- model.matrix(~x1$samples$group+x1$samples$treatment)`
x1_voom <- voom(x1,design = design,normalize.method = "quantile")
N <- colSums(x1$counts)
#all spike-ins
nf <- calcNormFactors(spikeinsdf,lib.size = N)
spike_in_voom_all_spike_ins <- voom(x1,design,lib.size = N*nf)
boxplot(spike_in_voom_all_spike_ins$E, las=2, col=col, main="")
title(main="G1. voom-all-spike-ins Normalised data",ylab="Log-cpm")
.
.
.
`
We want to do differential expression between treated, non-treated samples for each cell component but also between components:
total treated vs total non-treated, component1 treated vs component1 non-treated, component1 treated vs component2 treated...
Concerning the differential expression between component1-component2, there is a technical issue in which the proportion of RNA amount isolated from 10 million cells differs. Even if we start from the same amount of RNA, final amount of the isolated RNA is higher in the component1 than in the component2. For library preparation 1μg of starting RNA was used but in terms of concentration, because of the small dimension of the component2 it would have been better to use only 0.1μg RNA, but this was not possible because of the library preperation protocol. In terms of "percentages", it is like having RNA from 1 cellular component1 and ~5 from cellular component2.
- Should I take in consideration this issue and provide an adjustment variable in the normalisation step or after?
- Regarding the boxplots, it seems that limma voom transformation with quantile normalization relatively performs better than the others. So, in your opinion should I proceed with this normalisation for downstream DE analysis?
-
In the afformentioned code, is this the correct way to create the model matrix? Between treatment and cell component do I have to use duplicate Correlation or blocking? Another way of to create the model matrix:
design <- model.matrix(~0+batch+group:treatment)
limma voom transformation with TMM, RLE, upperQuartile normalization
Dear James,
thank you for the answer and your suggestions on this matter. Sorry for the delayed reply.
My question is not directly linked to the Scopus of this group, but I have to "gain the necessary expertise by myself". As I couldn't find any publication regarding the issue of normalization between components of the cell (there are, but they don't describe anything about the normalization part (between components) I don't know if they didn't take it into account or performed another protocol for sequencing) I made this question to the experts of Bioinformatics/Biostatistics community I trust more.
If I have to restate my question then :
1) I created the design matrix following limma user guide: 9.7 Multi-level Experiments and the answer of this Q Limma Design for Paired Samples Question
I consider of using also
as Gordon Smyth kindly suggests in this Q LIMMA : group specific effect of a treatment while controlling for individual effect
my updated design matrix using TMM normalization and voom (this part remains an issue with the normalization between components and treated-nontreated) :
Is this correct? Do I have to consider another factor regarding samples from the total that are not paired?
2) I included the values of spike-ins in the final expression matrix as "controls" for DE. (I have read the posts/pubs about the problems regarding spike-ins, A: DESeq and Limma+Voom Normalization for Rna-Seq Data Using Ercc Spike-In, https://support.bioconductor.org/p/63339/#63340)
I found all of them DE with 3<logFC<5 and adj.p<0.0005 in the component1 treated vs non-treated,
same for component2 treated vs non-treated.
In the total treated vs non-treated only 4 were found DE with adj.p<0.05 and 1.9<LogFC<2.
In treated component1 vs 2 and non-treated comp1 vs comp2, I didn't find any. (In this one is the issue with normalisation between components)
From the results mentioned above, I consider of not doing any DE analysis between treated/non-treated samples (for several reasons).
But If I want to use a normalization factor on top of TMM normalisation before using voom I have to apply it in the offsets? (and more correctly but it's beyond the scope of this support page, Should I? is it correct to apply a correction on top of already normalized data? I have read in different posts that it is not recommended to use a second normalization, for example TMM normalized and then with voom, quantile normalization)
3) If I set a cutoff for each comparison regarding the ranges of logFC of DE spikes-ins, every feature found DE with |logFC(feature)|>|logFC(spike-ins)| could be more "trusted" result of true positive hit or my naiveness leads me to false assumptions?
Best regards
Konstantinos
1) No (don't subset to
$E
, repeatvoom
andduplicateCorrelation
), and no.2) No, scaling normalization is not additive. It is not valid to apply one set of normalization factors on top of another.
3) No, it doesn't make much sense to compare log-fold changes of genes to spike-ins.
You should think about what the spike-ins are, and what you want them to do.
If you went for the first option; if you have major differences in total RNA content between compartments; and if you used TMM normalization prior to a DE analysis; your spike-ins will always show up as "DE", simply due to composition effects. If you don't care about total RNA content, the spike-ins are more or less useless. Well, for DE analyses at least - you can still use them for modelling technical variability, as we often do for scRNA-seq data analyses.
Dear Aaron,
thank you for your answer!! Unfortunately, I just discussed with the experimentalist; the spike-ins haven't been added to each sample proportional to the number of cells. At least now I could suggest that for future experiments.
But with this data, is it correct to proceed with the workflow mentioned above?