RMA question
3
0
Entering edit mode
@james-anderson-1641
Last seen 10.2 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20061217/ b680070d/attachment.pl
• 966 views
ADD COMMENT
0
Entering edit mode
@wolfgang-huber-3550
Last seen 3 months ago
EMBL European Molecular Biology Laborat…
Hi James, this is a general problem of normalization methods that work by adapting arrays in a set to themselves, and not to an independent reference. Option 1 is indeed discredited when you want to get a fair estimate of classification rates, since it does not faithfully simulate the real application where you want to classify a new sample. Option 2 does not work since f contains for each array a number of array-specific, ideosyncratic parameters that reflect hybridization conditions, labeling efficiency, RNA extraction etc. You cannot "learn" them in advance. The option I'd take is to look for a normalization method that normalizes each new array individually (or in sets appropriate to your intended application) to an existing database of reference arrays. I know that various people on this list have been/are working on such methods. But I am probably not up-to-date myself - maybe someone can recommend? Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber > Hi, I have a question for RMA normalization. Since RMA is an across > sample normalization, suppose I have 50 training samples (cel files) and 50 test samples (cel files). There are two ways to perform normalization: > 1. Combine all the 100 samples together and use RMA to do normalization. Then train the training set of 50 samples to classify the 50 test samples. > 2. Use the 50 training samples to do RMA, then each cel file is converted to gene expression vector. Suppose the mapping from cel file to expression vector is: > Expression = f(cel). The form of f is determined by the 50 training cel files. Then apply the same mapping to the test cel files. > > I would think method 2 is more reasonable and trully blind. However, it is not clear how to determine the function f from the 50 training cel files. method 1 is easy to implement, but it is not trully blind, since the normalization of cel files from training samples actually utilized the information from test cel files. > Could anybody tell me how to determine the function f from the 50 training cel files? > > Many thanks, James
ADD COMMENT
0
Entering edit mode
I would say that it depends on how you plan to use the classification function. If, in future, you will collect more samples, and use the classification function to classify them, then you need to normalize the test set the same way you will normalize the new arrays. How you plan to do this may also affect how you normalize the training set. --Naomi At 02:53 PM 12/17/2006, Wolfgang Huber wrote: >Hi James, > >this is a general problem of normalization methods that work by adapting >arrays in a set to themselves, and not to an independent reference. > >Option 1 is indeed discredited when you want to get a fair estimate of >classification rates, since it does not faithfully simulate the real >application where you want to classify a new sample. > >Option 2 does not work since f contains for each array a number of >array-specific, ideosyncratic parameters that reflect hybridization >conditions, labeling efficiency, RNA extraction etc. You cannot "learn" >them in advance. > >The option I'd take is to look for a normalization method that >normalizes each new array individually (or in sets appropriate to your >intended application) to an existing database of reference arrays. I >know that various people on this list have been/are working on such >methods. But I am probably not up-to-date myself - maybe someone can >recommend? > > Best wishes > Wolfgang > >------------------------------------------------------------------ >Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber > > > > Hi, I have a question for RMA normalization. Since RMA is an across > > sample >normalization, suppose I have 50 training samples (cel files) and 50 >test samples (cel files). There are two ways to perform normalization: > > 1. Combine all the 100 samples together and use RMA to do >normalization. Then train the training set of 50 samples to classify the >50 test samples. > > 2. Use the 50 training samples to do RMA, then each cel file is >converted to gene expression vector. Suppose the mapping from cel file >to expression vector is: > > Expression = f(cel). The form of f is determined by the 50 training >cel files. Then apply the same mapping to the test cel files. > > > > I would think method 2 is more reasonable and trully blind. However, >it is not clear how to determine the function f from the 50 training cel >files. method 1 is easy to implement, but it is not trully blind, since >the normalization of cel files from training samples actually utilized >the information from test cel files. > > Could anybody tell me how to determine the function f from the 50 >training cel files? > > > > Many thanks, James > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
ADD REPLY
0
Entering edit mode
hi james, briefly, to make new chips comparable to a training data set normalized with RMA you can do the following: normalize your training arrays keeping track of: (1) the means over the ranks used in quantile normalization (2) the probe effects estimated by the median polish procedure as the background correction is performed chip-by-chip, you can transform each test (future) array to be compatible to the training arrays (and the classifier) with the above information. f() then works roughly like that: * substitute the (ranked) test-expression values by the means over the ranks from (1) (you're normalized now) * calculate a chip-effect (for each probe set) via subtracting the probe effect from (2) from each probe set (you're done now) i can send you the code for the above, in case you are interested. all the best, dennis Naomi Altman wrote: > I would say that it depends on how you plan to use the classification function. > > If, in future, you will collect more samples, and use the > classification function to classify them, then you need to normalize > the test set the same way you will normalize the new arrays. > How you plan to do this may also affect how you normalize the training set. > > --Naomi > > At 02:53 PM 12/17/2006, Wolfgang Huber wrote: >> Hi James, >> >> this is a general problem of normalization methods that work by adapting >> arrays in a set to themselves, and not to an independent reference. >> >> Option 1 is indeed discredited when you want to get a fair estimate of >> classification rates, since it does not faithfully simulate the real >> application where you want to classify a new sample. >> >> Option 2 does not work since f contains for each array a number of >> array-specific, ideosyncratic parameters that reflect hybridization >> conditions, labeling efficiency, RNA extraction etc. You cannot "learn" >> them in advance. >> >> The option I'd take is to look for a normalization method that >> normalizes each new array individually (or in sets appropriate to your >> intended application) to an existing database of reference arrays. I >> know that various people on this list have been/are working on such >> methods. But I am probably not up-to-date myself - maybe someone can >> recommend? >> >> Best wishes >> Wolfgang >> >> ------------------------------------------------------------------ >> Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber >> >> >>> Hi, I have a question for RMA normalization. Since RMA is an across >>> sample >> normalization, suppose I have 50 training samples (cel files) and 50 >> test samples (cel files). There are two ways to perform normalization: >>> 1. Combine all the 100 samples together and use RMA to do >> normalization. Then train the training set of 50 samples to classify the >> 50 test samples. >>> 2. Use the 50 training samples to do RMA, then each cel file is >> converted to gene expression vector. Suppose the mapping from cel file >> to expression vector is: >>> Expression = f(cel). The form of f is determined by the 50 training >> cel files. Then apply the same mapping to the test cel files. >>> I would think method 2 is more reasonable and trully blind. However, >> it is not clear how to determine the function f from the 50 training cel >> files. method 1 is easy to implement, but it is not trully blind, since >> the normalization of cel files from training samples actually utilized >> the information from test cel files. >>> Could anybody tell me how to determine the function f from the 50 >> training cel files? >>> Many thanks, James >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Lana Schaffer ★ 1.3k
@lana-schaffer-1056
Last seen 10.2 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20061218/ c1e1474d/attachment.pl
ADD COMMENT
0
Entering edit mode
@harbron-chris-1976
Last seen 10.2 years ago
Hi James, Can I point you in the direction of the RefPlus package available in Bioconductor release 2.0, which will do what I think you are looking for, i.e. allowing additional cel files to be added into a data set without affecting the gene expression or the normalisation parameters calculated from the previously processed cel files. You might also want to check out the paper from Darlene Goldstein in Bioinformatics (2006 p2364-2372) which discusses similar algorithms. All the best Chris Chris Harbron Technical Lead Statistician, AstraZeneca Hi, I have a question for RMA normalization. Since RMA is an across sample normalization, suppose I have 50 training samples (cel files) and 50 test samples (cel files). There are two ways to perform normalization: 1. Combine all the 100 samples together and use RMA to do normalization. Then train the training set of 50 samples to classify the 50 test samples. 2. Use the 50 training samples to do RMA, then each cel file is converted to gene expression vector. Suppose the mapping from cel file to expression vector is: Expression = f(cel). The form of f is determined by the 50 training cel files. Then apply the same mapping to the test cel files. I would think method 2 is more reasonable and trully blind. However, it is not clear how to determine the function f from the 50 training cel files. method 1 is easy to implement, but it is not trully blind, since the normalization of cel files from training samples actually utilized the information from test cel files. Could anybody tell me how to determine the function f from the 50 training cel files? Many thanks, James
ADD COMMENT

Login before adding your answer.

Traffic: 914 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6