Question

Appropriate implementation of ComBat function for known batch effect correction and alternative methodologies of merging microarray datasets

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 9 days ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

i have acquired two new colorectal cancer affymetrix datasets, with different platforms--hgu133a and hgu133plus2---comprising of different patients, but with identical phenotype variables of study. I will try briefly to describe my pipeline analysis: as i wanted to use customCDFs, with ultimate goal of merging both datasets(in order to perform some statistical comparisons which are limited due low sample size in each dataset and also some other procedures, like machine learning), the brainArray team kindely provided with a python script program, with which i splitted each hgu133aplu2 CEL file, to A platform CEL files(in order to get only the A platform signals) and B CEL files. Thus, ---by keeping only the A CEL files--, i have the same identical probesets in each dataset, and also to use the same annotation, that is hgu133a.

So, i pre-processed separately each dataset using the same customcdf annotation[hgu133a] and normalization with fRMA. Then, i merged the two datasets using the function combineTwoExpressionSet() from the R package a4Base (without performing any correction, just merging the two datasets based on the common probesets, which in this case are the total hgu133a probesets). But after the creation of some diagnostic plots, there seems a strong batch effect, by which the samples separate mostly based on the "platform"(although i used the same customcdf annotation for each dataset). Thus, my main questions are:

A sample of the phenodata of my merged eset looks like:

> head(phenodat)
Metastasis Condition Anatomic_location Dataset
Patient_1_A 0 Control left_sided hgu133plus2
Patient_1_B 0 Cancer left_sided hgu133plus2
Patient_2_A 0 Control left_sided hgu133plus2
Patient_2_B 0 Cancer left_sided hgu133plus2
Patient_3_A 0 Control right_sided hgu133plus2
Patient_3_B 0 Cancer right_sided hgu133plus2

So, if i would like to implement the ComBat() function to directly adjust for my known batch variable above, which is "Dataset"(indicates the different datasets each CEL file belogs):

batch <- phenodat$Dataset
modcombat <- model.matrix(~1, data=phenodat)
combat_data <- ComBat(dat=exprs(eset_merged), batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)

The above code should be fine ? In detail, as the other above variables will be used after ComBat for various statistical comparisons with limma, so above regarding the "mod" argument of ComBat, the model.matrix with the intercept term should be correct ?

2. Secondly, regarding my brief description of my experimental design/pre-processing: as essentially with this python script i only finally using the A.CEL files from the hgu133plus2 dataset--and thus these have the signals from the A-chip, an alternative solution would be to normalize both datasets together ? and then perform batch effect correction ? Or because there are different patients and essentially different chips from the start, my above procedure is more suitable ?

I understand that there is no gold standard or perfect implementations, but any suggestion or feedback on this matter would be essential !!

Thank you,

Efstathios

sva ComBat customcdf affymetrix microarrays limma • 2.1k views

ADD COMMENT • link 9.2 years ago svlachavas ▴ 840