Dear group,
i would like to adress some specific design implementations in limma and ask your opinion about "valid" or appropriate they are for my comparison of interest. It refers to a cancer microarray dataset, and the phenoData looks like:
> head(pData(eset.4))
Meta_factor Disease Location Study
St_1_WL57A.CEL 0 Normal left_sided hgu133plus2
St_2_WL57A.CEL 0 Cancer left_sided hgu133plus2
St_N_EC59A.CEL 0 Normal left_sided hgu133plus2
St_T_EC59A.CEL 0 Cancer left_sided hgu133plus2
St_N_EJ31A.CEL 0 Normal right_sided hgu133plus2
St_T_EJ31A.CEL 0 Cancer right_sided hgu133plus2
As i have adressed in some previous posts, my initial goal was to identity DE genes regarding different anatomic tumor location, versus their adjucent controls. Thus, i used an interaction model, as my samples are paired-belong to each patient:
pairs <- factor(rep(1:30, each=2))
design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset.filtered)
design1 <- design[,-grep("Normal", colnames(design))] # get full rank
arrayw <- arrayWeights(eset.filtered, design=design1)
fit <- lmFit(eset.filtered, design1, weights=arrayw)
fit2 <- eBayes(fit, trend=TRUE)..
But now, my collaborators asked me if i can through limma to test directly for genes differentially expressed between right_sided and left_sided tumors, and not just comparing the two DE resulted lists for common genuine or diffenent DE genes (which is also of the same importance):
So, from the above design matrix, i have two coefficients DiseaseCancer:Locationleft_sided & DiseaseCancer:Locationright_sided, which represent the average log2-fold change between the left sided tumors vs their adjucent controls, and also the same for the right-sided tumors respectively. I understand, that trying to compare directly the two above coefficients, would probably give no results, as any differences would mostly absorved in the pairs factor. Thus, for my above question, after the above step of fit, by implementing:
cm <- makeContrasts(RvsL= DiseaseCancer:Locationright_sided-DiseaseCancer:Locationleft_sided, levels=design)
fit2 <- contrasts.fit(fit, cm)
fit3 <- eBayes(fit2, trend=TRUE)...
and with this contrast i can ask in which genes the log2-fold change of the right-sided tumors vs their controls, is significantly different in the left_sided tumors ? And this could at "least" partially give some insights in my question ?
Or my approach is incorrect ?
Dear Aaron, thank you for your quick confirmation !! So, i will need to rename the two specific column names in the design matrix, in order to not get any error produced by makeContrasts().
Regarding your extra suggestion: you meant by using duplicateCorrelation and avoid including the pairs factor in the design matrix, i can implement some direct comparisons, like directly comparing right-sided vs left-sided cancer samples right ? but this obviously will change the way variance is estimated, as this approach will model/estimate variability in absolute expression values between my patients, correct ? But is obviously will depend on the number of observations, that is my number of patients, which in this case in 30 paired observations(60 samples in total)..
The second approach is simply a description of what you could do, if you wanted to compare left/right-sided tumours directly; it's not a suggestion of what you should do. If the first approach is sufficient, then stick with it.
Dear Aaron, just a small update and your opinion on this matter: i runned -as you also suggested changed the 2 specific column names-- and procedeed:
pairs <- factor(rep(1:30, each=2))
design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset.filtered)
design1 <- design[,-grep("Normal", colnames(design))]
colnames(design1)[31:32] <- c("Disease_left_sided", "Disease_right_sided")
arrayw <- arrayWeights(eset.filtered, design=design1)
fit <- lmFit(eset.filtered, design1, weights=arrayw)
cm <- makeContrasts(RvsL=Disease_right_sided-Disease_left_sided , levels=design1)
fit2 <- contrasts.fit(fit, cm)
fit3 <- eBayes(fit2, trend=TRUE)
top <- limma::topTable(fit3, coef=1, number=nrow(fit2), adjust.method="fdr", sort.by="none")
and then when i used
top2 <- top[top$adj.P.Val<0.05,]
:top2
PROBEID SYMBOL ENTREZID logFC AveExpr t P.Value
1559_at 1559_at CYP2C9 1559 -0.6965133 6.725423 -5.483557 5.211610e-06
2245_at 2245_at FGD1 2245 -0.5707114 6.908230 -4.967785 2.302236e-05
259230_at 259230_at SGMS1 259230 0.5646353 6.845278 4.712484 4.792467e-05
2705_at 2705_at GJB1 2705 -0.5520401 8.499240 -4.751274 4.288131e-05
51195_at 51195_at RAPGEFL1 51195 -0.6789049 7.532551 -5.038683 1.877319e-05
5364_at 5364_at PLXNB1 5364 -0.5256315 7.393941 -4.725663 4.614833e-05
55500_at 55500_at ETNK1 55500 -1.4986307 8.578999 -5.162079 1.315836e-05
7485_at 7485_at WRB 7485 0.6328237 8.620731 4.844169 3.284587e-05
7991_at 7991_at TUSC3 7991 0.5945766 6.283816 5.294618 8.981564e-06
adj.P.Val B
1559_at 0.03948384 3.708941
2245_at 0.04144946 2.463368
259230_at 0.04793532 1.845013
2705_at 0.04793532 1.938921
51195_at 0.04144946 2.635072
5364_at 0.04793532 1.876916
55500_at 0.03948384 2.933685
7485_at 0.04793532 2.163908
7991_at 0.03948384 3.253892
As you can see, even with the implementation of arrayWeights, i get only 9 DE genes, with adjusted p-values near to 5% of FDR---thus, although this is obviously a sign of week evidence of significance for my contrast above--But could still search these genes individually for any relevance to my studied cancer and location related ? Or again as my results are very "weak", any approach to this direction would be trivial ?
Alternatively, based on your second approach mentioned with duplicateCorrelation() for the direct comparison:
How i can formulate my design matrix, so that i will have in columns cancer samples right_sided and left_sided, in order to implement them in makeContrasts ??
i naively tried:
pairs <- factor(rep(1:30, each=2)) # to use afterwards in duplicateCorrelation
design1 <- model.matrix(~ 0 + Disease +Location:Disease, data=eset.filtered)
colnames(design1)
[1] "DiseaseCancer" "DiseaseNormal"
[3] "DiseaseCancer:Locationright_sided" "DiseaseNormal:Locationright_sided"
and design2 <- model.matrix(~ 0 + Disease +Location, data=eset.filtered)
colnames(design2)
[1] "DiseaseCancer" "DiseaseNormal" "Locationright_sided"
But obviously this isn't what i want to compare, which is right_sided tumors vs left_sided tumors..
Well, it is what it is. The only thing I can say is that I'd expect about 5% of those genes to be false discoveries. Whether or not this acceptable for your follow-up studies is up to you.
If you want to do a direct comparison, you should create groups:
Then you can just compare right- and left-sided tumours directly.
Thank you again for your help and comment---probably i should give more emphasis to the already acquired DE separate lists between each location cancer vs their adjucent controls, and perhaps the direct approach could yield some different insights on this matter----regarding your code above, i should not drop any of the Normal levels, before using duplicateCorrelation, right ?
You may indeed get more power with
duplicateCorrelation
, as the information isn't being wasted on trying to estimate the 30 batch effects. And no, you shouldn't drop any of the normal levels in the one-way layout I've described above.Dear Aaron, sorry to return to this post, but i would like to ask you two very important "updated" questions and how i could interpret them !!
1) In detail, after i also used the duplicateCorrelation about the direct comparison of Right_sided vs Left_sided tumors(15 vs 15), in which (with also the implementation of arrayWeights & duplicateCorrelation), i resulted in a final list after topTable, with 27 DE genes with an adjusted p-val < 0.05. However, only 18 of these had an absolute value of logFC > 0.5, while the others had an absolute logFC from ~3.5 to 4.9. On the other hand, when i used these 28 genes to create a heatmap--also using only my cancer samples--there was a great separation of left and right cancer samples (with 1-2 mislabeled samples). Thus, what is your opinion on this matter ? That despite the fact that in this specific case there is not a great amount of differential expression (and possibly notable changes-but this could be attributed to also various reasons) between "right" and "left" cancer samples, could these 28 genes still considered "valid candidates" for follow up-studies or further research ? As they also show a relatively good separation in my heatmap plot ?
2) Finally, i would like to ask you about the two lists i have also acquired from limma, regarding the comparison of each tumor anatomic location vs their adjucent controls:
So, my crusial question is: in order to identify DE genes that are unique to each Location (vs their paired controls), i should:
I) From topTable already acquired for specific Location -i.e. Right-sided tumors vs their adjucent controls:
get the DE genes. So for this specific purpose, i should consider only the adjusted p-value cutoff ? i.e. adjusted p.val < 0.05 ? Or i could also consider a further cutoff ? like Treat ? with something like lfc=0.5 ? But with this i could loose genes for the intersection below?
II) And then, in the other list (Left sided tumors vs controls), use in the topTable (after using confint=0.95) 3 simultaneous cutoffs:
And finally, with the intersection of these two above lists(probesetIDs, as they are in the sameExpressionSet), i will have identify the non-DE genes in the second list ? That is in my example, the Left_sided list?
Best,
Efstathios
It's not possible to give a general answer to your first question. Whether those genes are biologically meaningful or interesting is best answered by biological collaborators, if you have any willing to do follow-up work. All I'd be willing to say is that they are significantly differentially expressed.
For the first part of your second question, if you don't have many DE genes, I'd be disinclined to use TREAT. This is because it's more conservative and that's not what you want when you're trying to squeeze more results out.
For the second part of the second question, your code will identify non-DE genes in whatever contrast you used to generate
top
. Whether it's sensible to intersect this with another list is up to you.Dear Aaron,
thank you for your thoughts on this matter !!! Actually (also with the implementation of arrayWeights), i get a lot of DEs for both locations--so perhaps, a use of TREAT with a same cutoff of logFC could reduce my results to the "most interesting", especially for the log fold change interpretation (or in a subsequent pathway analysis). Regarding the first question about the 27 genes, i will definately discuss it with my collaborators--i mainly mentioned it, because, it "impressed me a bit": as these genes do not show a "great alteration of expression"-but separate relatively well my cancer samples, based on location, which is the most interesting thing.
Dear Aaron, one last comment regarding my yesterday questions---based also on a useful comment from a previous post (C: Pathway analysis of differentially methylated CpGs) regarding the use of argument robust---i tried it also in my direct comparison between tumors, to see if a get a bigger number of DE genes(from the 27 genes i reported). And then, i saw when i used robust=TRUE, the number of DE genes was reduced slightly from 27 to 24. I acknowledge that this change is possibly negligible, but essentially robust=TRUE should not increase the power for "more DE" ? or my notion is incorrect ? The code used is below:
Best,
Efstathios
Robustness just ensures that outliers don't distort the EB statistics. This is not guaranteed to increase or decrease the number of detections - for example, if all your outliers were below the other points, you could imagine that they were pulling down the mean-variance trend, so removing them would result in larger variance estimates and reduced detections. Now, in general, robustifying results in more DE genes because the prior d.f. usually increases when outliers are ignored. In your case, a difference of 24 to 27 is negligible so I don't think it's worth worrying about.