Hi, I am analyzing ATAC-Seq data to test for differentially open chromatin among two genotypes of mice (n = 5 per group). Following Lun & Smyth 2014, I combined reads from all 10 samples into a single BAM, called peaks (Genrich), and counted reads per peak region for each sample (featureCounts). I then ran limma following common guidance (example code below), and the resulting MA plot shows many more significantly negative logFCs (~4300) than positive logFCs (~900), especially at high mean logCPM.
To evaluate whether this pattern was a result of something specific in limma, I also tried edgeR and DESeq2 and get the same pattern (not shown). Is this something to be concerned about, and if so, do you have any suggestions to remedy it? Perhaps quantile normalization?
A related question is whether we can conclude that these patterns are due to inadequate data processing or if they reflect a true biological signal. We do not have any a priori reason to believe that openness would tend to decline in this comparison, nor do we have any reason to believe that there would be few or many significantly differentially open regions (we observe ~5k significant of 80k total peak regions in these tests).
Thank you for your help!
cnt = DGEList(counts.all)
cnt$samples$genotype <- genotype
design <- model.matrix(~0+genotype)
keep.exprs <- filterByExpr(cnt, design)
cnt <- cnt[keep.exprs, , keep.lib.sizes=FALSE]
#Alternative filtering that also leads to same pattern ultimately
#keep <- aveLogCPM(cnt) > 1
#cnt <- cnt[keep, , keep.lib.sizes=FALSE]
cnt <- calcNormFactors(cnt, method = "TMM")
v <- voom(cnt, design, plot=TRUE)
contr.matrix <- makeContrasts( AvB = genotypeA-genotypeB, levels = colnames(design))
vfit = lmFit(v, v$design)
vfit = contrasts.fit(vfit, contr.matrix)
efit = eBayes(vfit, robust=TRUE)
eres <- decideTests(efit, p.value = 0.05)
plotMD(efit, column=1, status=abs(eres[,1]), main=colnames(efit)[1], ylab="log2(fold-change)", xlab="Mean log(CPM)", legend=FALSE, cex=0.5)
abline(h=0, col="darkgrey")
If you're not going to show DESeq2 code, don't tag it with
DESeq2
. You can hardly expect the DESeq2 authors to provide any debugging advice on their package by looking at your limma code.Ok, sorry, my misunderstanding of the posting guidelines. However, I believe my question is not so much about debugging and more about whether there is any reason to be concerned or to change my analysis based on the bias toward significant negative log fold changes at high logCPM in the MA plots - a pattern observed in all three programs when following common guidance. Any thoughts are greatly appreciated.
The pattern seems surprising. It would helpful to have the plot of the voom command to view the mean-variance trend . I think you should also look at the pair plots of each genotype (2 matrices of 5x5 plots), then pair plots of 3 samples of each genotype together (leading to a matrix of 6 x 6 plots).
Thanks for your response. Here are two voom plots:
voom plot, before
voom plot, after
And, here are pair plots, in which the first 5 are one genotype and the other five are another genotype:
pair plots
voom plots looks fine. The pair plot would lead me to remove the last sample (last column) as you probably guessed. Hope others will contribute here.
Thanks, I agree that the voom plots seem fine. I can also add that an MDS plot shows the samples clustering by genotype nicely. And, I tried removing the last sample as you suggested and the results are the same (MA plot bias, MDS plot, voom plots, etc.).
It seems to me limma is behaving as it should, and perhaps the pattern reflects a real biological response. However, I may be missing something more fundamental. For example, if I run voom with quantile normalization (which may or may not be appropriate), the bias disappears. I am curious to hear what others think.
MA plot after quantile normalization