Dear Community,
I'm currently try to analyze one microarray dataset, essentially comprised of 3 different cell lines/batches, each one having the same experimental design: the deletion (CRISPR/Cas9) of a specific gene, and includes WT (wild type) samples, versus knock-out samples. After import, normalization and filtering of all samples together (oligo R package-rma), the experimental design looks like the following:
eset.2
ExpressionSet (storageMode: lockedEnvironment)
assayData: 36451 features, 17 samples
element names: exprs
protocolData
rowNames: 01-1_(HuGene-2_0-st)_CEM-FasKO_FasWT_1.CEL
02-3_(HuGene-2_0-st)_CEM-FasKO_FasWT_3.CEL ...
6-H9_FasKO-36_(HuGene-2_0-st).CEL (17 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: 01-1_(HuGene-2_0-st)_CEM-FasKO_FasWT_1.CEL
02-3_(HuGene-2_0-st)_CEM-FasKO_FasWT_3.CEL ...
6-H9_FasKO-36_(HuGene-2_0-st).CEL (17 total)
varLabels: index Condition_detailed Cell_line Condition_Fas
varMetadata: labelDescription channel
featureData
featureNames: 16657436 16657440 ... 17118478 (36451 total)
fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.2.0.st
head(pData(eset.2))
index Condition_detailed
01-1_(HuGene-2_0-st)_CEM-FasKO_FasWT_1.CEL 1 WT_reconstituted
02-3_(HuGene-2_0-st)_CEM-FasKO_FasWT_3.CEL 2 WT_reconstituted
03-4_(HuGene-2_0-st)_CEM-FasKO_FasWT_4.CEL 3 WT_reconstituted
08-10_(HuGene-2_0-st)_CEM-FasKO_pcDNA_10.CEL 4 KO_clone
09-11_(HuGene-2_0-st)_CEM-FasKO_pcDNA_11.CEL 5 KO_clone
1-CEM_WT_(HuGene-2_0-st).CEL 6 WT_parental_tech1
Cell_line Condition_Fas
01-1_(HuGene-2_0-st)_CEM-FasKO_FasWT_1.CEL CEM WT
02-3_(HuGene-2_0-st)_CEM-FasKO_FasWT_3.CEL CEM WT
03-4_(HuGene-2_0-st)_CEM-FasKO_FasWT_4.CEL CEM WT
08-10_(HuGene-2_0-st)_CEM-FasKO_pcDNA_10.CEL CEM KO_clone
09-11_(HuGene-2_0-st)_CEM-FasKO_pcDNA_11.CEL CEM KO_clone
1-CEM_WT_(HuGene-2_0-st).CEL CEM WT
table(pData(eset.2)$Cell_line)
CEM H9 MDA_MB_231
11 3 3
table(pData(eset.2)$Condition_Fas,pData(eset.2)$Cell_line)
CEM H9 MDA_MB_231
KO_clone 6 2 2
WT 5 1 1
However, the major issue-batch effect, is clear on the relative MDS plots-as you can see(attached links below), both 3 cell lines cluster clearly in distinct parts, whereas the individual biological conditions, are not clearly distinguished. As i acknowledge the putative bottlenecks in the aformentioned experimental design, how should i proceed to take into account this issue ?
https://www.dropbox.com/s/cng4b0q7djtshbm/MDSplot.FasExp.NormFiltered.CellLine.tiff?dl=0 https://www.dropbox.com/s/5vsztfun72auzh7/MDSplot.FasExperiment.Filtered.Norm.Condition.tiff?dl=0
In your opinion, i should use the general condition WT vs KO, and block on the Cell_line variable ? or this would be biased, as each cell line perhaps would have a "different biological behaviour" regarding the targeted genome editing ? and a general DEG list would not represent-or reflect the differences between each cell line phenotype ?
alternatively, i should perform pairwise comparisons within each cell line for WT vs KO ? My additional concern here, is that in two of the 3 cell lines (H9 & MDAMB231), there is only one biological replicate/sample of the wild type...
Overall, my goal is to identify any DEGs related to immunity, based on the effect of the deleted gene-that is, the comparison of WT vs KO samples-
any suggestions or ideas for this challenging scenario would be grateful !!
Kind Regards,
Efstathios
Dear James,
thank you for your comment-in fact, I'm fully aware of the tradeoffs and putative solutions-however, even that i have analyzed similar data in the past, i had never encountered thus far this scenario with such a high batch effect presense-that is why, i also created a post here, as much more experienced people and specialists on the field like you could just provide an extra opinion on this matter-
Overall, if i could re-formulate my question: from the relative MDS plots, you think that this batch effect could be adressed by blocking with the confounder variable in the design matrix ? or a more "strigent" solution, like batch effect correction, and then perhaps something like weighted least squares test could be the alternative ?
Thank you in advance,
Efstathios
My point is that you aren't asking about how to use the software, and you apparently understand that there are tradeoffs. So the only question left is to decide what tradeoffs are acceptable and what you should do. This is an analysis decision, and as the analyst it is up to you to make the decision. Peter and Ryan can tell you what they think are reasonable things to do or try, but you know what? They haven't seen your data, so they are just making guesses. If you want to base your analysis on the guesses of people who haven't actually dealt firsthand with your data, then that is one approach, but I am not sure that's the optimal way to proceed. YMMV.
Dear James, thanks again for bringing up this point. Indeed, my main notion was to take into account some different ideas/suggestions on this matter, and of course then implement my analysis.