Merging microarray datasets

0

Entering edit mode

Kathy Duncan ▴ 130

@kathy-duncan-2722

Last seen 10.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080423/ b700048d/attachment.pl

• 1.9k views

ADD COMMENT • link 17.0 years ago Kathy Duncan ▴ 130

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 10.0 years ago

United States

Hi Kathy, This question has been asked many times, and the advice remains the same: it doesn't make any sense to normalize different data sets together. You should normalize them separately and use appropriate statistical models to combine the data into a single analysis. best wishes Robert Kathy Duncan wrote: > Hi, > > I have a simple and basic question: > > Is it alright to think of merging two datasets (either from cDNA or > Affymetrix platform) - Final goal is to have ONE normalized dataset, where > the datasets are scaled in order to compensate the different types of > variations, if present ! > > Comments on strategy and packages available in Bioconductor would be of > great help. > > Thanks. > > Kathy > > > * PS: Thanks James for your earlier reply (SORRY for the delay). Here, I > have re-framed my question - I'm not sure *MergeMaid* exactly does what I'm > looking for. > > > = = == = = = = = = = = = = > > > On Tue, Apr 15, 2008 at 1:19 AM, James W. MacDonald <jmacdon at="" med.umich.edu=""> > wrote: > >> Hi Kathy, >> >> Assuming the chips are all the same species, I would probably use >> something like MergeMaid. There may be others -- you can check here: >> >> http://bioconductor.org/packages/2.2/DifferentialExpression.html >> >> Best, >> >> Jim >> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD COMMENT • link 17.0 years ago rgentleman ★ 5.5k

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080423/ eb99a88e/attachment.pl

ADD REPLY • link 17.0 years ago Kathy Duncan ▴ 130

0

Entering edit mode

This is an interesting question and one that I like to explore further. The papers I have seen on combining microarray datasets so far select one algorithm for Affymetrix and one algorithm for cDNA. Has anyone investigated which combination of preprocessing algorithm(s) make data from these two platforms comparable? Indeed, how does one check if they are comparable? Any references and suggestions would be very welcome. Thank you. Regards, Adai Kathy Duncan wrote: > Thanks Robert & Balasubramanian! > > Let's consider that the raw datasets (of a platform) are individually > normalized. Now, what approach is advisable to have a single set out of them > while they are scaled too to get rid of the possible "between-array" > variation (I hope this sounds ok !). > > Kathy > > - - - - - > > Robert Gentleman <rgentlem at="" fhcrc.org=""> wrote: > > Hi Kathy, > This question has been asked many times, and the advice remains the same: > it doesn't make any sense to normalize different data sets together. You > should normalize them separately and use appropriate statistical models to > combine the data into a single analysis. > > best wishes > Robert > > > > - - - - - - - > > Balasubramanian Ganesan <balag at="" cc.usu.edu=""> wrote: > > Yes, but you have to normalize all raw data together. All data should also > be of one platform only. Then you can simply normalize all CEL files or all > ----- files together and be done. > For Affy data, you can use the Affy package for normalization. Depends on > how you want to normalize anyway. > > > ------------ Original Message --------------------- > > > > Kathy Duncan wrote: > > Hi, > > I have a simple and basic question: > > Is it alright to think of merging two datasets (either from cDNA or > Affymetrix platform) - Final goal is to have ONE normalized dataset, where > the datasets are scaled in order to compensate the different types of > variations, if present ! > > Comments on strategy and packages available in Bioconductor would be of > great help. > > Thanks. > > Kathy > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 17.0 years ago Adaikalavan Ramasamy ▴ 100

0

Entering edit mode

> -----Original Message----- > From: bioconductor-bounces at stat.math.ethz.ch > Subject: Re: [BioC] Merging microarray datasets > > > This is an interesting question and one that I like to > explore further. > > The papers I have seen on combining microarray datasets so > far select one algorithm for Affymetrix and one algorithm for cDNA. > > Has anyone investigated which combination of preprocessing > algorithm(s) make data from these two platforms comparable? > Indeed, how does one check if they are comparable? Any > references and suggestions would be very welcome. Platforms aside, cDNA arrays are usually two color and ratios, and Affy are one color and not ratios. So one approach is to turn everything into ratios after preprocessing and normalization (using, for example, a suitable set of reference samples...e.g. normal kidney for kidney tumors). Obviously, thoughtful selection of reference samples is required, and the reference chosen should be analagous to whatever was used as the reference in the two color arrays. Then, one can try to bring things further into line by converting the log transformed ratios into z scores. As far as verification, a place to start would be box and whisker plots to expose obvious abnormalities. You can also perform unsupervised clustering to see if the samples cluster mainly according to platform/lab or mainly according to known phenotype. Then, as Robert Gentleman stated, you can use appropriate models to correct systematic biases. But I will leave the details of that to the statisticians (and, indeed, the archives of this list). Obviously, the whole exercise is frought with difficulties, but it is done. See for example the oncomine project. Whether it is is fruitfully done is open to argument. One thing to consider is to utilize down-stream analysis methods that care more about relative position of genes and less about magnitude of values (e.g. GSEA or PGSEA). > > Thank you. > > Regards, Adai > > > This email message, including any attachments, is for th...{{dropped:6}}

ADD REPLY • link 17.0 years ago Kort, Eric ▴ 220

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080424/ 8ebc13ad/attachment.pl

ADD REPLY • link 17.0 years ago Kathy Duncan ▴ 130

0

Entering edit mode

> Thanks to Adai and Eric, > > Well, I'm trying to bring back the discussion to the previous direction as > it apparently went to a different area : cross-platform integration. :) > I > was wondering about integration within the same platform ? an issue when > there are multiple chips (in case of affymetrix) OR multiple > print layouts (cDNA .gal files). > > " have to normalize all raw data together. All data should also be of one > platform only. Then you can simply normalize all CEL files or all ----- > files together and be done." [courtesy: Balasubramanian] > > So, if I have suppose MA1, MA2 as respective normalised datasets (same > platform. After doing normalization based on chip-types in case of > Affymetrix, OR, print layouts in case of cDNA), can I just normalize them > again for the final dataset, or I need to take care of some other issues > too > (how to tackle!) ? Also, wonder if there's any smart package in this > regard! I am no certain that there is a magic package that can do the best data transformation for all situations. You may well have to include a dataset effect into your analysis (as Robert and others are recommending it), and there are many (smart) packages available in R to help you build models and estimate effects. > > Also Eric, I didn't get you what project you were talking about : " See > for > example the oncomine project ." The project aims at bundling heterogeneous expression data together. http://www.ncbi.nlm.nih.gov/pubmed/15068665?dopt=AbstractPlus > Thanks. > > Kathy > > > > > = = = = = = = = > > On Thu, Apr 24, 2008 at 7:34 AM, Kort, Eric <eric.kort at="" vai.org=""> wrote: > >> > -----Original Message----- >> > From: bioconductor-bounces at stat.math.ethz.ch >> > Subject: Re: [BioC] Merging microarray datasets >> > >> > >> > This is an interesting question and one that I like to >> > explore further. >> > >> > The papers I have seen on combining microarray datasets so >> > far select one algorithm for Affymetrix and one algorithm for cDNA. >> > >> > Has anyone investigated which combination of preprocessing >> > algorithm(s) make data from these two platforms comparable? >> > Indeed, how does one check if they are comparable? Any >> > references and suggestions would be very welcome. >> >> Platforms aside, cDNA arrays are usually two color and ratios, and Affy >> are one color and not ratios. >> >> So one approach is to turn everything into ratios after preprocessing >> and >> normalization (using, for example, a suitable set of reference >> samples...e.g. normal kidney for kidney tumors). Obviously, thoughtful >> selection of reference samples is required, and the reference chosen >> should >> be analagous to whatever was used as the reference in the two color >> arrays. >> >> Then, one can try to bring things further into line by converting the >> log >> transformed ratios into z scores. >> >> As far as verification, a place to start would be box and whisker plots >> to >> expose obvious abnormalities. You can also perform unsupervised >> clustering >> to see if the samples cluster mainly according to platform/lab or mainly >> according to known phenotype. >> >> Then, as Robert Gentleman stated, you can use appropriate models to >> correct systematic biases. But I will leave the details of that to the >> statisticians (and, indeed, the archives of this list). >> >> Obviously, the whole exercise is frought with difficulties, but it is >> done. See for example the oncomine project. Whether it is is fruitfully >> done is open to argument. One thing to consider is to utilize >> down-stream >> analysis methods that care more about relative position of genes and >> less >> about magnitude of values (e.g. GSEA or PGSEA). >> >> > >> > Thank you. >> > >> > Regards, Adai >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 17.0 years ago lgautier@altern.org ▴ 950

0

Entering edit mode

Kathy, this depends on two thing. 1) how similar are the chip types? For example, hgu95a and hgu95av2, two Affymetrix chips which differed by one probeset. I do not know why it differed by one probeset but I suppose one just omitted the extra probeset and preprocess them together. 2) the type of preprocessing algorithm used (and sample size) If you are using preprocessing algorithms that work on array by array basis (e.g. median scaling), then you can normalize the different chip types differently followed by a merge(). Creating missing values for the probes not present in one type of chip but in others. Next, you can either try to adjust the expression values for possible biases (e.g. see Benito PMID:14693816) or include a chip type indicator in your analysis as Gentlemen and others have pointed out. If you are using algorithms that take information across chips (e.g. RMA) AND you only have small number of arrays for each chip type, then you need to give more thought. It would be worth exploring if one can merging the chips at probe-level (e.g. matchprobes package) to benefit from better parameter estimation. Regards, Adai Kathy, sorry for bring up the issue of preprocessing. It is relevant and I will raise in a separate thread. Kathy Duncan wrote: > Thanks to Adai and Eric, > > Well, I'm trying to bring back the discussion to the previous direction as > it apparently went to a different area : cross-platform integration. :) I > was wondering about integration within the same platform ? an issue when > there are multiple chips (in case of affymetrix) OR multiple > print layouts (cDNA .gal files). > > "? have to normalize all raw data together. All data should also be of one > platform only. Then you can simply normalize all CEL files or all ----- > files together and be done." [courtesy: Balasubramanian] > > So, if I have suppose MA1, MA2? as respective normalised datasets (same > platform. After doing normalization based on chip-types in case of > Affymetrix, OR, print layouts in case of cDNA), can I just normalize them > again for the final dataset, or I need to take care of some other issues too > (how to tackle!) ? Also, wonder if there's any smart package in this regard! > > > Also Eric, I didn't get you what project you were talking about : "?See for > example the oncomine project?." > > Thanks. > > Kathy >

ADD REPLY • link 17.0 years ago Adaikalavan Ramasamy ★ 1.8k

0

Entering edit mode

Kathy Duncan ▴ 130

@kathy-duncan-2722

Last seen 10.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080426/ c9b83568/attachment.pl

ADD COMMENT • link 17.0 years ago Kathy Duncan ▴ 130

Login before adding your answer.