My data has two variables: treatment (untreated vs treated) and exon number (control, 1 and 9), each with 3 replicates. So 18 samples total but 6 unique groups (untreated-ctrl, treated-ctrl, untreated-1, treated-1, untreated-9, treated-9).
I started by coding everything into one design matrix, then making contrasts for all the comparisons I care about
# No intercept, all data
design <- model.matrix( ~0 + group)
contr.matrix <- makeContrasts(
Ex1Unt = untreated_1 - untreated-ctrl,
Ex9Unt = untreated_9 - untreated-ctrl,
Ex1Treat = treated_1 - treated_ctrl,
Ex9Treat = treated_9 - treated_ctrl,
Ex1 = treated_1 - untreated_1,
Ex9 = treated_9 - untreated_9,
levels = colnames(design))
v <- voom(dge, design, plot=TRUE)
fit <- lmFit(v, design)
fit <- contrasts.fit(fit.voom, contrasts=contr.matrix)
fit<- eBayes(fit.voom, trend=TRUE)
summary(decideTests(fit))
Ex1Unt Ex9Unt ...
Down 2099 968 ...
NotSig 12434 14619 ...
Up 1396 342 ...
My understanding is that since my design matrix is orthogonal, the intercept vs no-intercept approach should give me the same output. So I tried the analysis with an intercept but it gave me very different numbers.
This time I subsetted the data to just the untreated samples (untreated_ctrl, untreated_1, untreated_9)
# With intercept, just Untreated data
design_meanref <- model.matrix(~group)
v <- voom(dge, design_meanref, plot=TRUE)
fit <- lmFit(v,design_meanref)
fit <- eBayes(fit)
summary(decideTests(fit))
(Intercept) groupingEx1 groupingEx9
Down 493 2473 1311
NotSig 1695 11336 12958
Up 13041 1420 960
My hunch was that this is different because this was just a subset, though I'm not sure why that would be the case. But I tried the first approach again, this time with the same subset.
# No intercept, just untreated data
Ex1 Ex9
Down 2487 1333
NotSig 11319 12941
Up 1423 955
So two questions here:
1) Why were the numbers so different in my first and second analyses? Why did including all the data change the expression analysis so much even though I was looking at distinct groups in my comparisons? Which is more accurate?
2) Why did my third approach give me very close, but not the exact same result as the second?
Thank you!
Hi Gordon, thanks for the clarification regarding subsetting the data. I re-ran it with the whole data set with and without the intercept. I did rename some of the variables for simplicity/clarity in my original post and I've updated my annotation.
No intercept, all data
With intercept, all data
The results comparable here are the first two columns of the first output, and the last two columns of the second output (comparing untreated ex1 and ex9 to untreated_ctrl as the reference). The difference is not that much, but it's not exactly the same. I read from an old thread that this could be the case with non-orthogonal data, which mine is not.
You should write functions or loops/lapply constructs to ensure that the exact same chunk of code runs with different params (here that would be the design). By eyeballing I see:
...as a difference, that is already the case in the toplevel question.
Thank you for the extra set of eyes! I was trying a lot of different options, including trend, and I forgot to remove it when copy+pasting since I've been staring at it for so long. I'm getting the same results now.