Dear Bioconductor Community,
i have returned with a previous problem i have faced from a previous post (Possible implementation of multifactorial contrasts in limma regarding a microarray dataset). To provide a small summary from the link, with the great assistance and crusial suggestions of Aaron Lun i implemented in my merged colorectal affymetrix dataset, comprised of 30 different patients with two samples each(cancer and adjucent control), a multifactorial design in limma, incoprorating except from the "paired comparison" the specific tumor location of each patient:
head(pData(eset.2))
Disease Location Meta_factor Study
St_1_WL57.CEL Normal sigmoid_colon 0 hgu133plus2
St_2_WL57.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EC59.CEL Normal sigmoid_colon 0 hgu133plus2
St_T_EC59.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EJ58.CEL Normal cecum 0 hgu133plus2
St_T_EJ58.CEL Cancer cecum 0 hgu133plus2
with the following code:
pairs <- factor(rep(1:30, each=2))
design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset.2)
design <- design[,-grep("Normal", colnames(design))] # get full rank
colnames(design)
fit <- lmFit(eset.2, design)
fit2 <- eBayes(fit, trend=TRUE)
Moreover, as Aaron mentioned, except from the first 30 coefficients which represent to the patient-specific blocking factors, the coefficients 31:39 correspond to the location-specific log-fold change of cancer vs control samples.
Thus, for istanse top2 <- topTable(fit2, coef="DiseaseCancer:Locationascending_colon", number=nrow(fit2), adjust.method="fdr", sort.by="none") represents the log-fold change of the 8 ascending colon tumor samples versus their adjucent controls
So, to adress and highlight my problem, although there are 9 coefficients which represent 9 different anatomic locations, only in the above(ascending colon) i get from topTable DE genes with an FDR < 0.05.
table(pData(eset.2)$Location)
ascending_colon cecum descending_colon left_colic_flexure
10 12 6 2
rectosigmoid_colon rectum right_colic_flexure sigmoid_colon
4 8 6 10
transverse_colon
2
As you can see from the above, although some comparisons include a very low number of samples, in the other comparisons it is very weird (like cecum and sigmoid colon) not to return any DEG genes. I acknoledge that each study has its specific details and experimental design regarding the biology of each system, as also other various reasons related with the analysis itself, but as it is a remaining issue i also discussed with my lab members and remains under discussion, i would like to adress some more questions(as im mainly a molecular biologist and perform data analysis about a year) to get also a statistical view or suggestions on the matter:
1. As Aaron also suggested various possible explanations, i would like first to highlight one of them(and please correct me for any misunderstanding or misconception): how it is possible(from a statistical view) ,a high variability in a specific location(although patients are different) to mask the DE genes expression in the other locations ? Even if generally cancer vs control samples generally yield a lot of different gene expression patterns ? And furthermore, as this model Aaron suggested models the variability in the cancer/normal logFC between the patients ?
2. One other suggestion was to construct an MDS plot but based on the log-fold changes of each of the above nine coefficients. Thus, i have to use something like this:
q <- fit2$coefficients[,31:39] and then run: plotMDS(x=q, top=500, gene.selection="pairwise") ?
And if the above implementation is correct, i should inspect the plot for samples that "behave" as outliers or gather far from the others ?
3. To continue from the question 3, then i should remove the specific patient samples from my expression set, and then re-run the analysis ?
I know also the function removeBatchEffect() but i have never use it and i dont know if an how fits for my current problem ?
4. Finally, there are any other diagnostic plots or ways to adress the problem, or alternative solutions ?
5. To mention, before performing limma, from my EDA plots like clustering the samples and PCA, one specific control sample seems to be a common outlier. Also, i have perfomed paired analysis with limma without the location information
(*Please excuse me for any naive questions and my long message, but as im generally new in R and i dont have many experience in these specific issues and statistical models, i would like to cover any important part of my analysis and also to gain as much as possible feedback !!)
Dear James,
thank you for your answer and general advice further off the specific topic above. With all respect, i never have(or had) a purpose of disturbing any of the specialists on the field and also the moderators on repeated questions here, and especially on every matter or every project i have(and please excuse me if i have crossed that line without an intention of doing so or didnt realize that i had). Regarding your nice proposals of MOOCs, I have also attended from the last year many courses on edx on biological data analysis, and currently in coursera regarding the genomic data science specialization and downloading important books. However, despite the great knowledge, wonderful material and many help from the moderators on various topics(including the above one), the hard truth is that when it comes to a real data analysis with specific issues, it is really essential for people like me (in their start of the phd) to have a feedback from people like you, especially when im from a molecular biology field. So, it is really important for me although many times i have developed ideas and methodologies regarding my projects, to get a further validation in this wonderful group(even if sometimes" i became like one of the 3 stooges"), because i have a strong belief that in a challenging field like this, i prefer to get a further confirmation about my analysis rather than implementing ideas or "simple" tutorials, and then producing erroneous and biased results (without acknowledging the true nature and puprose of this group). Thank you again for your help and guidance !!