We are doing a loss-of-function shRNA-Seq experiment with 185 shRNA for 37 different genes. We have two conditions (WT and MUT) with each two groups (before and after selection). For each we have five replicate (one only with four0.
we have the following sample data:
sample condition group A1 WT Input A2 WT Input A3 WT Input A4 WT Input A5 WT Input A6 WT after sel. A7 WT after sel. A8 WT after sel. A9 WT after sel. A10 WT after sel. B1 MUT Input B2 MUT Input B3 MUT Input B4 MUT Input B5 MUT Input B6 MUT after sel. B8 MUT after sel. B9 MUT after sel. B10 MUT after sel.
after reading the paper/tutorial I was thinking on how to apply it to my data and would like to ask for advice.
We are interested in two things
1. first we would like to know genes/shRNAs which are changed within each of the conditions
2. but mainly we would like to know which genes are showing a dropout in the MUT compared to the WT, as to be able to amrk them as gene targets.
I have created the DGE object
$ x.dge An object of class "DGEList" $counts A1_R1 A2_R1 A3_R1 A4_R1 A5_R1 A6_R1 A7_R1 A8_R1 A9_R1 A10_R1 B1_R1 B2_R1 B3_R1 Abcf1_TRCN0000113400 1538 2064 1817 3197 2600 876 1057 1645 808 1165 1348 1741 1968 Abcf1_TRCN0000113401 1742 1982 2450 3221 4532 1023 1187 1411 1152 1361 2206 3111 3368 B4_R1 B5_R1 B6_R1 B8_R1 B9_R1 B10_R1 Abcf1_TRCN0000113400 1257 2743 1140 1202 1619 367 Abcf1_TRCN0000113401 3579 1573 1325 1781 2620 1204 167 more rows ... $samples sample condition group lib.size norm.factors 1 A1 WT Input 226541 1 2 A2 WT Input 246536 1 3 A3 WT Input 246198 1 4 A4 WT Input 356525 1 5 A5 WT Input 317193 1 14 more rows ... $genes gene clone [1,] "Abcf1" "TRCN0000113400" [2,] "Abcf1" "TRCN0000113401" [3,] "Abcf1" "TRCN0000113402" [4,] "Abcf1" "TRCN0000113403" [5,] "Abcf1" "TRCN0000113404" 174 more rows ...
For my first question I was wondering if it is possible to do it from within this experimental design, such as like
des = model.matrix(~0 + condition + group)
Would this gives me the pair-wise comparison for WT and MUT each separately? or should they better be separated?
For my second (main) question, I am not sure which way to go.
I was thinking about using the following design
des <- model.matrix(~condition+group)
or even
des <- model.matrix(~0 + condition + condition:group)
I would appreciate it, if you can suggest me the (more) correct experimental design for the two questions i have here.
thanks in advance.
Assa
Thanks for the reply and sorry for the late response.
when using the makeContrast function for the genotype-specific columns I took
Does this mean that shRNAs with positive values are those which will be expressed stronger in the mutant. Is that true?
To get the
topTags
I than need to (following your tutorial)estimateDisp
, fit the negative GLM and calculate the LRT. Is that correct?I am not quite sure what you did with setting the design matrix to full rank by deleting the two columns of
genotypeMut.selectionInput
andgenotypeWT.selectionInput.
Can you please explain me why it was necessary?
Firstly, your
contmat
contrast won't work because the names aren't syntactically valid. You should replace:
with.
, which was the whole point of usingmake.names
in my original answer.The interpretation of the log-fold changes from your
contmat
contrast is somewhat complex. (I'll call these interaction log-fold changes to distinguish them from the selection log-fold changes within each genotype.) Positive interaction log-fold changes mean that selection log-fold change in mutants is greater than the selection log-fold change in the wild-type, in a literal mathematical sense. The biological interpretation depends on the signs of the selection log-fold changes:All of these scenarios yield a positive interaction log-fold change, but can have different interpretations regarding the nature of the differences in the selection effect between genotypes. (Flip around the logic for negative interaction log-fold changes.) That's why it's important to also report the individual selection log-fold changes in your results for this comparison.
Regarding your code: I don't know why you're switching between the QL and non-QL functions. Just use the QL methods throughout, they are more accurate than the non-QL methods for most (well-behaved) data sets.
Regarding the rank: if you don't discard those two columns, you'll notice that the vector obtained by taking the row sums of the
pair
columns is the same as the vector obtained by taking the row sums of the remaining columns. This means that the columns of the design matrix are linearly dependent such that there is no unique solution for the coefficients, i.e., it is impossible to have a single best fit of the GLM. For example, let's say that I found a "best" fit where allpair
coefficients were equal to 1 and all other coefficients were equal to zero. But in this case, I could obtain the exact same fit by setting allpair
coefficients to zero and all other coefficients to 1; thus, there is no unique solution.By removing some columns, I break the linear dependencies and allow the GLM to be fitted. I removed the
Input
columns so that the remainingSelected
columns represent the selection effect (over input). If I removed theSelected
columns, the remainingInput
columns would represent the log-fold change of input over selection, which is somewhat more difficult to interpret. One could also conceivably remove some of thepairs
columns instead, though again, this is difficult to interpret.The contrasts matrix I have seen the problem already and changed it beforehand. the QL-function I haven't so thanks for that remark.
But i am not really happy with the results of the differential analysis.
I don't get really good results neither for the MUT comparison before against after
nor for the comparison between the two cell lines
For the comparison of the WT before against after, I get the best results (Which I didn't expect).
How much weight should I put on the FDR values? Or should I in this case concentrate more on the logFC values, as i have only ~180 shRNA molecules for approx. 30 genes.?
Only the FDR (or perhaps more precisely, the adjusted p-value relative to a FDR threshold) can be trusted for decisions relating to the hypothesis of interest, i.e., is there actually an effect or not? The log-fold change is purely a descriptive statistic, as it doesn't capture the variability of the underlying observations.
Anyway, I don't see why you're unhappy about your results. Your interaction contrast (middle) indicates that no gene exhibits a significant difference in selection effects between genotypes at a FDR of 5%. As things go, this is a pretty clear-cut result. The most parsimonious explanation would be that the genotype just doesn't have a big effect on selection. Maybe you don't like that, but maybe that's the truth. You should only talk about the "goodness" of your results when you know the ground truth, and you haven't mentioned any positive controls in this experiment.
Note that a slight increase in power for the wild-type selection effect is expected as you have more samples for the wild-type (5 selection/input pairs) compared to the mutant (4 selection/input pairs + 1 input sample that is effectively ignored upon blocking). However, if you're not seeing any significant interaction effects, then it's hard to be interested in the selection effects by themselves, as they could just represent the general effect of time/passages/etc.
Of course, I could imagine all sorts of things going wrong when you have so few features, or when your replicates are highly variable. You'll have to look at diagnostic plots (e.g., MA plots for normalization, MDS plots, mean-dispersion plots) to judge that and modify the appropriate edgeR parameters in response to obvious issues, e.g., setting
robust=TRUE
inglmQLFit
when you see outliers on the QL mean-dispersion plot.yes I was thinking the same thing. I have already done the QC plots and they do not look like I have expected it to look like:
plotBCV:
plotMDS
and the Smear plots
plotSmear(lrtWT, de.tags = de.genes, , main="WT Smear plot")
plotSmear(qlf, de.tags = de.genes, main="comparison between cell lines")
It really does not look like I was expected. The shRNa molecules which shows a large CPM have very small logFC.
The shape of the MA plots are mostly expected. Large log-fold changes at low abundance for each plot are caused by the high variance relative to the mean at low counts. This is why you should rely on the p-values for identifying genes of interest rather than looking at the reported log-fold changes.
I am a bit surprised at the presence of a clear distinction between low- and high-abundance genes. I don't know why some shRNAs would be so readily sequenced and others not as much, even before any selection has taken place. I would guess your input shRNA composition might not be properly balanced. Note, low abundances are unlikely to be caused by a decrease in abundance upon selection, as we would expect to see all of them to have strong negative log-fold changes in that case.
The other major issue with having so few shRNAs is that normalization via
calcNormFactors
assumes that most shRNAs have no effect. However, the strength of this assumption requires negative control shRNAs to assess, so unless you have them, it's hard to say anything either way.Just for the sake of testing I have tried to analyse only the first group of samples from
WT
(the first 10 samples).I have created a new
DEGList
object using only this subsetand run the DE analysis
Now I get (a bit) better results.
At least I have 19 shRNAs which show significant behavior, compared to 12 before that. I know it is not a lot, but, would it means, that it is better to run the analysis separately?
It seems that your definition of "better" is "more DE genes". This is incorrect unless you have positive controls. Why is it so hard to accept that there are not many (or any) DE genes?
And no, it is not "better" to run the analysis separately. (1) You don't get to share information about dispersion estimation across samples. (2) You can't compare between genotypes.
I hope you are not choosing your analysis parameters on the basis of what gives you the most DE genes. That would be p-hacking, which is Bad.