set.seed() in gcrma
1
0
Entering edit mode
@william-t-barry-2655
Last seen 10.2 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080213/ 81dafb82/attachment.pl
• 1.9k views
ADD COMMENT
0
Entering edit mode
@henrik-bengtsson-4333
Last seen 6 months ago
United States
Hi, I think you have spotted something very important, and I agree that the random seed should be reset by any function that set it in order to get reproducible samples. We've been working with a port/wrapper for gcrma() but this never struck us - thanks for bringing it up. FYI, I've just sent a question to R-devel, as it is more appropriate there, on how to best push the random generator back to the state it was before calling set.seed(). Best, Henrik On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> wrote: > Hello, > > While recently using the gcrma package, I noticed some strange behavior > that I want to report. I was applying the pre-processing algorithm (using > default settings) inside a resampling scheme and noticed that subsequent > calls of sample() were returning a constant vector of values. Example > output shown below: > > > temp <- gcrma(celdat[,1:2]) > Adjusting for optical effect..Done. > Computing affinities.Done. > Adjusting for non-specific binding..Done. > Normalizing > Calculating Expression > > > sample(1:10000,3) > [1] 7030 5473 8109 > > > temp <- gcrma(celdat[,1:2]) > Adjusting for optical effect..Done. > Computing affinities.Done. > Adjusting for non-specific binding..Done. > Normalizing > Calculating Expression > > > sample(1:10000,3) > [1] 7030 5473 8109 > > > sessionInfo() > R version 2.6.1 (2007-11-26) > x86_64-redhat-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_U S.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF -8;LC_IDENTIFICATION=C > > attached base packages: > [1] splines tools stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 > [4] matchprobes_1.10.0 affy_1.16.0 preprocessCore_1.0.0 > [7] affyio_1.6.1 Biobase_1.16.2 > > loaded via a namespace (and not attached): > [1] rcompgen_0.1-17 > > In looking at the gcrma functions, it appears that gcrma.engine2() is > setting the seed without saving and restoring .Random.seed . This was > easily corrected in the external loop of the resampling scheme, but I was > unsure whether this would be considered a 'bug' in gcrma. > > Kind regards, > Bill > > > ------------------------------- > William T. Barry, Ph.D. > Dept. of Biostatistics & Bioinformatics, DUMC > Duke Institute for Genome Science & Policy > Hock Plaza, Room 8030 > DUMC Box 2717 > Durham, NC 27710 > > Phone: 919-681-5047 > Fax: 919-668-9335 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Bill Thank you for that suggestion. We will edit gcrma to restore the seed in the next release. Zhijin Henrik Bengtsson wrote: > Hi, > > I think you have spotted something very important, and I agree that > the random seed should be reset by any function that set it in order > to get reproducible samples. We've been working with a port/wrapper > for gcrma() but this never struck us - thanks for bringing it up. > > FYI, I've just sent a question to R-devel, as it is more appropriate > there, on how to best push the random generator back to the state it > was before calling set.seed(). > > Best, > > Henrik > > On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> wrote: >> Hello, >> >> While recently using the gcrma package, I noticed some strange behavior >> that I want to report. I was applying the pre-processing algorithm (using >> default settings) inside a resampling scheme and noticed that subsequent >> calls of sample() were returning a constant vector of values. Example >> output shown below: >> >>> temp <- gcrma(celdat[,1:2]) >> Adjusting for optical effect..Done. >> Computing affinities.Done. >> Adjusting for non-specific binding..Done. >> Normalizing >> Calculating Expression >> >>> sample(1:10000,3) >> [1] 7030 5473 8109 >> >>> temp <- gcrma(celdat[,1:2]) >> Adjusting for optical effect..Done. >> Computing affinities.Done. >> Adjusting for non-specific binding..Done. >> Normalizing >> Calculating Expression >> >>> sample(1:10000,3) >> [1] 7030 5473 8109 >> >>> sessionInfo() >> R version 2.6.1 (2007-11-26) >> x86_64-redhat-linux-gnu >> >> locale: >> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en _US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_ US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UT F-8;LC_IDENTIFICATION=C >> >> attached base packages: >> [1] splines tools stats graphics grDevices utils datasets >> [8] methods base >> >> other attached packages: >> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 >> [4] matchprobes_1.10.0 affy_1.16.0 preprocessCore_1.0.0 >> [7] affyio_1.6.1 Biobase_1.16.2 >> >> loaded via a namespace (and not attached): >> [1] rcompgen_0.1-17 >> >> In looking at the gcrma functions, it appears that gcrma.engine2() is >> setting the seed without saving and restoring .Random.seed . This was >> easily corrected in the external loop of the resampling scheme, but I was >> unsure whether this would be considered a 'bug' in gcrma. >> >> Kind regards, >> Bill >> >> >> ------------------------------- >> William T. Barry, Ph.D. >> Dept. of Biostatistics & Bioinformatics, DUMC >> Duke Institute for Genome Science & Policy >> Hock Plaza, Room 8030 >> DUMC Box 2717 >> Durham, NC 27710 >> >> Phone: 919-681-5047 >> Fax: 919-668-9335 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- ------------------------------------------- Zhijin (Jean) Wu Assistant Professor of Biostatistics Brown University, Box G-S121 Providence, RI 02912 Tel: 401 863 1230 Fax: 401 863 9182 http://stat.brown.edu/~zwu
ADD REPLY
0
Entering edit mode
On Feb 13, 2008 9:49 AM, Zhijin Wu <zwu at="" stat.brown.edu=""> wrote: > Bill > Thank you for that suggestion. We will edit gcrma to restore the seed in > the next release. FYI, see the answer to my recent R-devel question on how to best reset the seed; https://stat.ethz.ch/pipermail/r-devel/2008-February/048367.html /Henrik > Zhijin > > Henrik Bengtsson wrote: > > Hi, > > > > I think you have spotted something very important, and I agree that > > the random seed should be reset by any function that set it in order > > to get reproducible samples. We've been working with a port/wrapper > > for gcrma() but this never struck us - thanks for bringing it up. > > > > FYI, I've just sent a question to R-devel, as it is more appropriate > > there, on how to best push the random generator back to the state it > > was before calling set.seed(). > > > > Best, > > > > Henrik > > > > On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> wrote: > >> Hello, > >> > >> While recently using the gcrma package, I noticed some strange behavior > >> that I want to report. I was applying the pre-processing algorithm (using > >> default settings) inside a resampling scheme and noticed that subsequent > >> calls of sample() were returning a constant vector of values. Example > >> output shown below: > >> > >>> temp <- gcrma(celdat[,1:2]) > >> Adjusting for optical effect..Done. > >> Computing affinities.Done. > >> Adjusting for non-specific binding..Done. > >> Normalizing > >> Calculating Expression > >> > >>> sample(1:10000,3) > >> [1] 7030 5473 8109 > >> > >>> temp <- gcrma(celdat[,1:2]) > >> Adjusting for optical effect..Done. > >> Computing affinities.Done. > >> Adjusting for non-specific binding..Done. > >> Normalizing > >> Calculating Expression > >> > >>> sample(1:10000,3) > >> [1] 7030 5473 8109 > >> > >>> sessionInfo() > >> R version 2.6.1 (2007-11-26) > >> x86_64-redhat-linux-gnu > >> > >> locale: > >> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE= en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=e n_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US. UTF-8;LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] splines tools stats graphics grDevices utils datasets > >> [8] methods base > >> > >> other attached packages: > >> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 > >> [4] matchprobes_1.10.0 affy_1.16.0 preprocessCore_1.0.0 > >> [7] affyio_1.6.1 Biobase_1.16.2 > >> > >> loaded via a namespace (and not attached): > >> [1] rcompgen_0.1-17 > >> > >> In looking at the gcrma functions, it appears that gcrma.engine2() is > >> setting the seed without saving and restoring .Random.seed . This was > >> easily corrected in the external loop of the resampling scheme, but I was > >> unsure whether this would be considered a 'bug' in gcrma. > >> > >> Kind regards, > >> Bill > >> > >> > >> ------------------------------- > >> William T. Barry, Ph.D. > >> Dept. of Biostatistics & Bioinformatics, DUMC > >> Duke Institute for Genome Science & Policy > >> Hock Plaza, Room 8030 > >> DUMC Box 2717 > >> Durham, NC 27710 > >> > >> Phone: 919-681-5047 > >> Fax: 919-668-9335 > >> > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > ------------------------------------------- > Zhijin (Jean) Wu > Assistant Professor of Biostatistics > Brown University, Box G-S121 > Providence, RI 02912 > > Tel: 401 863 1230 > Fax: 401 863 9182 > http://stat.brown.edu/~zwu > >
ADD REPLY
0
Entering edit mode
In my opinion gcrma should _not_ play with the seed. That is something for the user to decide. Kasper On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote: > Bill > Thank you for that suggestion. We will edit gcrma to restore the > seed in > the next release. > Zhijin > Henrik Bengtsson wrote: >> Hi, >> >> I think you have spotted something very important, and I agree that >> the random seed should be reset by any function that set it in order >> to get reproducible samples. We've been working with a port/wrapper >> for gcrma() but this never struck us - thanks for bringing it up. >> >> FYI, I've just sent a question to R-devel, as it is more appropriate >> there, on how to best push the random generator back to the state it >> was before calling set.seed(). >> >> Best, >> >> Henrik >> >> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> wrote: >>> Hello, >>> >>> While recently using the gcrma package, I noticed some strange >>> behavior >>> that I want to report. I was applying the pre-processing >>> algorithm (using >>> default settings) inside a resampling scheme and noticed that >>> subsequent >>> calls of sample() were returning a constant vector of values. >>> Example >>> output shown below: >>> >>>> temp <- gcrma(celdat[,1:2]) >>> Adjusting for optical effect..Done. >>> Computing affinities.Done. >>> Adjusting for non-specific binding..Done. >>> Normalizing >>> Calculating Expression >>> >>>> sample(1:10000,3) >>> [1] 7030 5473 8109 >>> >>>> temp <- gcrma(celdat[,1:2]) >>> Adjusting for optical effect..Done. >>> Computing affinities.Done. >>> Adjusting for non-specific binding..Done. >>> Normalizing >>> Calculating Expression >>> >>>> sample(1:10000,3) >>> [1] 7030 5473 8109 >>> >>>> sessionInfo() >>> R version 2.6.1 (2007-11-26) >>> x86_64-redhat-linux-gnu >>> >>> locale: >>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ >>> US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en >>> _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US >>> .UTF-8;LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] splines tools stats graphics grDevices utils >>> datasets >>> [8] methods base >>> >>> other attached packages: >>> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 >>> [4] matchprobes_1.10.0 affy_1.16.0 preprocessCore_1.0.0 >>> [7] affyio_1.6.1 Biobase_1.16.2 >>> >>> loaded via a namespace (and not attached): >>> [1] rcompgen_0.1-17 >>> >>> In looking at the gcrma functions, it appears that gcrma.engine2 >>> () is >>> setting the seed without saving and restoring .Random.seed . >>> This was >>> easily corrected in the external loop of the resampling scheme, >>> but I was >>> unsure whether this would be considered a 'bug' in gcrma. >>> >>> Kind regards, >>> Bill >>> >>> >>> ------------------------------- >>> William T. Barry, Ph.D. >>> Dept. of Biostatistics & Bioinformatics, DUMC >>> Duke Institute for Genome Science & Policy >>> Hock Plaza, Room 8030 >>> DUMC Box 2717 >>> Durham, NC 27710 >>> >>> Phone: 919-681-5047 >>> Fax: 919-668-9335 >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/ >>> gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/ >> gmane.science.biology.informatics.conductor > > > -- > ------------------------------------------- > Zhijin (Jean) Wu > Assistant Professor of Biostatistics > Brown University, Box G-S121 > Providence, RI 02912 > > Tel: 401 863 1230 > Fax: 401 863 9182 > http://stat.brown.edu/~zwu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen <khansen at="" stat.berkeley.edu=""> wrote: > In my opinion gcrma should _not_ play with the seed. That is > something for the user to decide. It's a trade-off between providing reproducible examples and adding to the overall confusion. I think it is important to be able to generate identical results for identical calls, and since it is not clear to most users that the gcRMA parameters are estimated based on a sample one I can see how the current solution avoids the problem of having ask people to set the seed explicitly before calling gcrma() [also I have a hard time to keep track on what is going on in all methods]. One solution would be to move the value of seed to be an argument with a default value and if NA/NULL the seed is *not* set. That would provide a backward-compatible upgrade to gcrma(). Comments? Henrik > > Kasper > > > On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote: > > > Bill > > Thank you for that suggestion. We will edit gcrma to restore the > > seed in > > the next release. > > Zhijin > > Henrik Bengtsson wrote: > >> Hi, > >> > >> I think you have spotted something very important, and I agree that > >> the random seed should be reset by any function that set it in order > >> to get reproducible samples. We've been working with a port/wrapper > >> for gcrma() but this never struck us - thanks for bringing it up. > >> > >> FYI, I've just sent a question to R-devel, as it is more appropriate > >> there, on how to best push the random generator back to the state it > >> was before calling set.seed(). > >> > >> Best, > >> > >> Henrik > >> > >> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> wrote: > >>> Hello, > >>> > >>> While recently using the gcrma package, I noticed some strange > >>> behavior > >>> that I want to report. I was applying the pre-processing > >>> algorithm (using > >>> default settings) inside a resampling scheme and noticed that > >>> subsequent > >>> calls of sample() were returning a constant vector of values. > >>> Example > >>> output shown below: > >>> > >>>> temp <- gcrma(celdat[,1:2]) > >>> Adjusting for optical effect..Done. > >>> Computing affinities.Done. > >>> Adjusting for non-specific binding..Done. > >>> Normalizing > >>> Calculating Expression > >>> > >>>> sample(1:10000,3) > >>> [1] 7030 5473 8109 > >>> > >>>> temp <- gcrma(celdat[,1:2]) > >>> Adjusting for optical effect..Done. > >>> Computing affinities.Done. > >>> Adjusting for non-specific binding..Done. > >>> Normalizing > >>> Calculating Expression > >>> > >>>> sample(1:10000,3) > >>> [1] 7030 5473 8109 > >>> > >>>> sessionInfo() > >>> R version 2.6.1 (2007-11-26) > >>> x86_64-redhat-linux-gnu > >>> > >>> locale: > >>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ > >>> US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en > >>> _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US > >>> .UTF-8;LC_IDENTIFICATION=C > >>> > >>> attached base packages: > >>> [1] splines tools stats graphics grDevices utils > >>> datasets > >>> [8] methods base > >>> > >>> other attached packages: > >>> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 > >>> [4] matchprobes_1.10.0 affy_1.16.0 preprocessCore_1.0.0 > >>> [7] affyio_1.6.1 Biobase_1.16.2 > >>> > >>> loaded via a namespace (and not attached): > >>> [1] rcompgen_0.1-17 > >>> > >>> In looking at the gcrma functions, it appears that gcrma.engine2 > >>> () is > >>> setting the seed without saving and restoring .Random.seed . > >>> This was > >>> easily corrected in the external loop of the resampling scheme, > >>> but I was > >>> unsure whether this would be considered a 'bug' in gcrma. > >>> > >>> Kind regards, > >>> Bill > >>> > >>> > >>> ------------------------------- > >>> William T. Barry, Ph.D. > >>> Dept. of Biostatistics & Bioinformatics, DUMC > >>> Duke Institute for Genome Science & Policy > >>> Hock Plaza, Room 8030 > >>> DUMC Box 2717 > >>> Durham, NC 27710 > >>> > >>> Phone: 919-681-5047 > >>> Fax: 919-668-9335 > >>> > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at stat.math.ethz.ch > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/ > >>> gmane.science.biology.informatics.conductor > >>> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/ > >> gmane.science.biology.informatics.conductor > > > > > > -- > > ------------------------------------------- > > Zhijin (Jean) Wu > > Assistant Professor of Biostatistics > > Brown University, Box G-S121 > > Providence, RI 02912 > > > > Tel: 401 863 1230 > > Fax: 401 863 9182 > > http://stat.brown.edu/~zwu > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/ > > gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote: > On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen > <khansen at="" stat.berkeley.edu=""> wrote: >> In my opinion gcrma should _not_ play with the seed. That is >> something for the user to decide. > > It's a trade-off between providing reproducible examples and adding to > the overall confusion. I think it is important to be able to generate > identical results for identical calls, and since it is not clear to > most users that the gcRMA parameters are estimated based on a sample > one I can see how the current solution avoids the problem of having > ask people to set the seed explicitly before calling gcrma() [also I > have a hard time to keep track on what is going on in all methods]. When a function sets the seed within itself and then does a sample() immediately afterwards what it really does is using a fixed index vector for arrays of a certain size. To call this "a random sample" is simply wrong. It would be easier and more transparent to just choose a regular grid like seg(from = 1, to = maxIndex, length.out = sampleSize) Now some people would argue that it seems wrong to use a regular grid like this and that the probes should be picked at random. To which I repeat what i just wrote: when you do set.seed you are not selecting anything at random - but you are obscuring this fact. If it really is important to not always use the same probes, the seed should not be set and you should live with the fact that you do not get reproducible effects. I agree that for gcRMA it would make sense to have the results be highly reproducible (otherwise I foresee many posts about this), but that is achieved even better by using a grid as above. Or perhaps do something like having an argument called (say) randomSample = FALSE as default and then do something like if(randomSample) gridIdx = sample(maxIndex, size = sampleSize) else gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize) (appropriately modified if you want to exclude certain indices corresponding to QC probes). Setting the seed inside a function should in my opinion always be discouraged. Kasper > One solution would be to move the value of seed to be an argument with > a default value and if NA/NULL the seed is *not* set. That would > provide a backward-compatible upgrade to gcrma(). This is the solution taken by most people who want a similar functionality. I still > Comments? > > Henrik > >> >> Kasper >> >> >> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote: >> >>> Bill >>> Thank you for that suggestion. We will edit gcrma to restore the >>> seed in >>> the next release. >>> Zhijin >>> Henrik Bengtsson wrote: >>>> Hi, >>>> >>>> I think you have spotted something very important, and I agree that >>>> the random seed should be reset by any function that set it in >>>> order >>>> to get reproducible samples. We've been working with a port/ >>>> wrapper >>>> for gcrma() but this never struck us - thanks for bringing it up. >>>> >>>> FYI, I've just sent a question to R-devel, as it is more >>>> appropriate >>>> there, on how to best push the random generator back to the >>>> state it >>>> was before calling set.seed(). >>>> >>>> Best, >>>> >>>> Henrik >>>> >>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> >>>> wrote: >>>>> Hello, >>>>> >>>>> While recently using the gcrma package, I noticed some strange >>>>> behavior >>>>> that I want to report. I was applying the pre-processing >>>>> algorithm (using >>>>> default settings) inside a resampling scheme and noticed that >>>>> subsequent >>>>> calls of sample() were returning a constant vector of values. >>>>> Example >>>>> output shown below: >>>>> >>>>>> temp <- gcrma(celdat[,1:2]) >>>>> Adjusting for optical effect..Done. >>>>> Computing affinities.Done. >>>>> Adjusting for non-specific binding..Done. >>>>> Normalizing >>>>> Calculating Expression >>>>> >>>>>> sample(1:10000,3) >>>>> [1] 7030 5473 8109 >>>>> >>>>>> temp <- gcrma(celdat[,1:2]) >>>>> Adjusting for optical effect..Done. >>>>> Computing affinities.Done. >>>>> Adjusting for non-specific binding..Done. >>>>> Normalizing >>>>> Calculating Expression >>>>> >>>>>> sample(1:10000,3) >>>>> [1] 7030 5473 8109 >>>>> >>>>>> sessionInfo() >>>>> R version 2.6.1 (2007-11-26) >>>>> x86_64-redhat-linux-gnu >>>>> >>>>> locale: >>>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e >>>>> n_ >>>>> US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER= >>>>> en >>>>> _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_ >>>>> US >>>>> .UTF-8;LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] splines tools stats graphics grDevices utils >>>>> datasets >>>>> [8] methods base >>>>> >>>>> other attached packages: >>>>> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 >>>>> [4] matchprobes_1.10.0 affy_1.16.0 preprocessCore_1.0.0 >>>>> [7] affyio_1.6.1 Biobase_1.16.2 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] rcompgen_0.1-17 >>>>> >>>>> In looking at the gcrma functions, it appears that gcrma.engine2 >>>>> () is >>>>> setting the seed without saving and restoring .Random.seed . >>>>> This was >>>>> easily corrected in the external loop of the resampling scheme, >>>>> but I was >>>>> unsure whether this would be considered a 'bug' in gcrma. >>>>> >>>>> Kind regards, >>>>> Bill >>>>> >>>>> >>>>> ------------------------------- >>>>> William T. Barry, Ph.D. >>>>> Dept. of Biostatistics & Bioinformatics, DUMC >>>>> Duke Institute for Genome Science & Policy >>>>> Hock Plaza, Room 8030 >>>>> DUMC Box 2717 >>>>> Durham, NC 27710 >>>>> >>>>> Phone: 919-681-5047 >>>>> Fax: 919-668-9335 >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/ >>>>> gmane.science.biology.informatics.conductor >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/ >>>> gmane.science.biology.informatics.conductor >>> >>> >>> -- >>> ------------------------------------------- >>> Zhijin (Jean) Wu >>> Assistant Professor of Biostatistics >>> Brown University, Box G-S121 >>> Providence, RI 02912 >>> >>> Tel: 401 863 1230 >>> Fax: 401 863 9182 >>> http://stat.brown.edu/~zwu >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/ >>> gmane.science.biology.informatics.conductor >> >>
ADD REPLY
0
Entering edit mode
On Feb 13, 2008 11:54 AM, Kasper Daniel Hansen <khansen at="" stat.berkeley.edu=""> wrote: > On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote: > > > On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen > > <khansen at="" stat.berkeley.edu=""> wrote: > >> In my opinion gcrma should _not_ play with the seed. That is > >> something for the user to decide. > > > > It's a trade-off between providing reproducible examples and adding to > > the overall confusion. I think it is important to be able to generate > > identical results for identical calls, and since it is not clear to > > most users that the gcRMA parameters are estimated based on a sample > > one I can see how the current solution avoids the problem of having > > ask people to set the seed explicitly before calling gcrma() [also I > > have a hard time to keep track on what is going on in all methods]. > > When a function sets the seed within itself and then does a sample() > immediately afterwards what it really does is using a fixed index > vector for arrays of a certain size. To call this "a random sample" > is simply wrong. It would be easier and more transparent to just > choose a regular grid like > seg(from = 1, to = maxIndex, length.out = sampleSize) > Now some people would argue that it seems wrong to use a regular grid > like this and that the probes should be picked at random. To which I > repeat what i just wrote: when you do set.seed you are not selecting > anything at random - but you are obscuring this fact. If it really is > important to not always use the same probes, the seed should not be > set and you should live with the fact that you do not get > reproducible effects. > > I agree that for gcRMA it would make sense to have the results be > highly reproducible (otherwise I foresee many posts about this), but > that is achieved even better by using a grid as above. Or perhaps do > something like having an argument called (say) randomSample = FALSE > as default and then do something like > if(randomSample) > gridIdx = sample(maxIndex, size = sampleSize) > else > gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize) > (appropriately modified if you want to exclude certain indices > corresponding to QC probes). Setting the seed inside a function > should in my opinion always be discouraged. I don't disagree with your arguments. However, there is a risk of selecting probes/units with common properties if seq() is used. This comes down to how Affymetrix has outlined the probes on the particular chip types being analyzed, and since we don't know what chip types will show up in the future, I would like to argue that a per-chip-type pre-determined sample() is safer than a uniform seq(). I've checked the latest BioC devel version of 'gcrma' and I noticed that in both bg.parameters.ns() and bg.parameters.ns2() set.seed() is called, but I cannot see any code that actually sample:s anything. Are those set.seed() left overs or am I missing something? /Henrik > > Kasper > > > > > One solution would be to move the value of seed to be an argument with > > a default value and if NA/NULL the seed is *not* set. That would > > provide a backward-compatible upgrade to gcrma(). > > This is the solution taken by most people who want a similar > functionality. I still > > > > > Comments? > > > > Henrik > > > >> > >> Kasper > >> > >> > >> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote: > >> > >>> Bill > >>> Thank you for that suggestion. We will edit gcrma to restore the > >>> seed in > >>> the next release. > >>> Zhijin > >>> Henrik Bengtsson wrote: > >>>> Hi, > >>>> > >>>> I think you have spotted something very important, and I agree that > >>>> the random seed should be reset by any function that set it in > >>>> order > >>>> to get reproducible samples. We've been working with a port/ > >>>> wrapper > >>>> for gcrma() but this never struck us - thanks for bringing it up. > >>>> > >>>> FYI, I've just sent a question to R-devel, as it is more > >>>> appropriate > >>>> there, on how to best push the random generator back to the > >>>> state it > >>>> was before calling set.seed(). > >>>> > >>>> Best, > >>>> > >>>> Henrik > >>>> > >>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> > >>>> wrote: > >>>>> Hello, > >>>>> > >>>>> While recently using the gcrma package, I noticed some strange > >>>>> behavior > >>>>> that I want to report. I was applying the pre-processing > >>>>> algorithm (using > >>>>> default settings) inside a resampling scheme and noticed that > >>>>> subsequent > >>>>> calls of sample() were returning a constant vector of values. > >>>>> Example > >>>>> output shown below: > >>>>> > >>>>>> temp <- gcrma(celdat[,1:2]) > >>>>> Adjusting for optical effect..Done. > >>>>> Computing affinities.Done. > >>>>> Adjusting for non-specific binding..Done. > >>>>> Normalizing > >>>>> Calculating Expression > >>>>> > >>>>>> sample(1:10000,3) > >>>>> [1] 7030 5473 8109 > >>>>> > >>>>>> temp <- gcrma(celdat[,1:2]) > >>>>> Adjusting for optical effect..Done. > >>>>> Computing affinities.Done. > >>>>> Adjusting for non-specific binding..Done. > >>>>> Normalizing > >>>>> Calculating Expression > >>>>> > >>>>>> sample(1:10000,3) > >>>>> [1] 7030 5473 8109 > >>>>> > >>>>>> sessionInfo() > >>>>> R version 2.6.1 (2007-11-26) > >>>>> x86_64-redhat-linux-gnu > >>>>> > >>>>> locale: > >>>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e > >>>>> n_ > >>>>> US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER= > >>>>> en > >>>>> _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_ > >>>>> US > >>>>> .UTF-8;LC_IDENTIFICATION=C > >>>>> > >>>>> attached base packages: > >>>>> [1] splines tools stats graphics grDevices utils > >>>>> datasets > >>>>> [8] methods base > >>>>> > >>>>> other attached packages: > >>>>> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 > >>>>> [4] matchprobes_1.10.0 affy_1.16.0 preprocessCore_1.0.0 > >>>>> [7] affyio_1.6.1 Biobase_1.16.2 > >>>>> > >>>>> loaded via a namespace (and not attached): > >>>>> [1] rcompgen_0.1-17 > >>>>> > >>>>> In looking at the gcrma functions, it appears that gcrma.engine2 > >>>>> () is > >>>>> setting the seed without saving and restoring .Random.seed . > >>>>> This was > >>>>> easily corrected in the external loop of the resampling scheme, > >>>>> but I was > >>>>> unsure whether this would be considered a 'bug' in gcrma. > >>>>> > >>>>> Kind regards, > >>>>> Bill > >>>>> > >>>>> > >>>>> ------------------------------- > >>>>> William T. Barry, Ph.D. > >>>>> Dept. of Biostatistics & Bioinformatics, DUMC > >>>>> Duke Institute for Genome Science & Policy > >>>>> Hock Plaza, Room 8030 > >>>>> DUMC Box 2717 > >>>>> Durham, NC 27710 > >>>>> > >>>>> Phone: 919-681-5047 > >>>>> Fax: 919-668-9335 > >>>>> > >>>>> > >>>>> [[alternative HTML version deleted]] > >>>>> > >>>>> _______________________________________________ > >>>>> Bioconductor mailing list > >>>>> Bioconductor at stat.math.ethz.ch > >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>> Search the archives: http://news.gmane.org/ > >>>>> gmane.science.biology.informatics.conductor > >>>>> > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at stat.math.ethz.ch > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: http://news.gmane.org/ > >>>> gmane.science.biology.informatics.conductor > >>> > >>> > >>> -- > >>> ------------------------------------------- > >>> Zhijin (Jean) Wu > >>> Assistant Professor of Biostatistics > >>> Brown University, Box G-S121 > >>> Providence, RI 02912 > >>> > >>> Tel: 401 863 1230 > >>> Fax: 401 863 9182 > >>> http://stat.brown.edu/~zwu > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at stat.math.ethz.ch > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/ > >>> gmane.science.biology.informatics.conductor > >> > >> > >
ADD REPLY
0
Entering edit mode
Hi All, The sample step occurs in the gcrma.engine/gcrma.engine2 sub function. A random sample of the entire matrix of PM values is selected to calculate an linear regression coefficient that is used in the full_model and affinities model. I believe the set.seed allows one to obtain the same results for any reanalysis of exactly the same CEL files. If one adds or subtracts even one CEL file the sampling of that entire matrix change and a different result is obtained. S. Keith Anderson, MS Statistician, Cancer Center Statistics Mayo Clinic email: anderson.s at mayo.edu -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Henrik Bengtsson Sent: Wednesday, February 13, 2008 2:41 PM To: Kasper Daniel Hansen Cc: William T Barry; bioconductor at stat.math.ethz.ch Subject: Re: [BioC] set.seed() in gcrma On Feb 13, 2008 11:54 AM, Kasper Daniel Hansen <khansen at="" stat.berkeley.edu=""> wrote: > On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote: > > > On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen > > <khansen at="" stat.berkeley.edu=""> wrote: > >> In my opinion gcrma should _not_ play with the seed. That is > >> something for the user to decide. > > > > It's a trade-off between providing reproducible examples and adding to > > the overall confusion. I think it is important to be able to generate > > identical results for identical calls, and since it is not clear to > > most users that the gcRMA parameters are estimated based on a sample > > one I can see how the current solution avoids the problem of having > > ask people to set the seed explicitly before calling gcrma() [also I > > have a hard time to keep track on what is going on in all methods]. > > When a function sets the seed within itself and then does a sample() > immediately afterwards what it really does is using a fixed index > vector for arrays of a certain size. To call this "a random sample" > is simply wrong. It would be easier and more transparent to just > choose a regular grid like > seg(from = 1, to = maxIndex, length.out = sampleSize) > Now some people would argue that it seems wrong to use a regular grid > like this and that the probes should be picked at random. To which I > repeat what i just wrote: when you do set.seed you are not selecting > anything at random - but you are obscuring this fact. If it really is > important to not always use the same probes, the seed should not be > set and you should live with the fact that you do not get > reproducible effects. > > I agree that for gcRMA it would make sense to have the results be > highly reproducible (otherwise I foresee many posts about this), but > that is achieved even better by using a grid as above. Or perhaps do > something like having an argument called (say) randomSample = FALSE > as default and then do something like > if(randomSample) > gridIdx = sample(maxIndex, size = sampleSize) > else > gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize) > (appropriately modified if you want to exclude certain indices > corresponding to QC probes). Setting the seed inside a function > should in my opinion always be discouraged. I don't disagree with your arguments. However, there is a risk of selecting probes/units with common properties if seq() is used. This comes down to how Affymetrix has outlined the probes on the particular chip types being analyzed, and since we don't know what chip types will show up in the future, I would like to argue that a per-chip-type pre-determined sample() is safer than a uniform seq(). I've checked the latest BioC devel version of 'gcrma' and I noticed that in both bg.parameters.ns() and bg.parameters.ns2() set.seed() is called, but I cannot see any code that actually sample:s anything. Are those set.seed() left overs or am I missing something? /Henrik > > Kasper > > > > > One solution would be to move the value of seed to be an argument with > > a default value and if NA/NULL the seed is *not* set. That would > > provide a backward-compatible upgrade to gcrma(). > > This is the solution taken by most people who want a similar > functionality. I still > > > > > Comments? > > > > Henrik > > > >> > >> Kasper > >> > >> > >> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote: > >> > >>> Bill > >>> Thank you for that suggestion. We will edit gcrma to restore the > >>> seed in > >>> the next release. > >>> Zhijin > >>> Henrik Bengtsson wrote: > >>>> Hi, > >>>> > >>>> I think you have spotted something very important, and I agree that > >>>> the random seed should be reset by any function that set it in > >>>> order > >>>> to get reproducible samples. We've been working with a port/ > >>>> wrapper > >>>> for gcrma() but this never struck us - thanks for bringing it up. > >>>> > >>>> FYI, I've just sent a question to R-devel, as it is more > >>>> appropriate > >>>> there, on how to best push the random generator back to the > >>>> state it > >>>> was before calling set.seed(). > >>>> > >>>> Best, > >>>> > >>>> Henrik > >>>> > >>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> > >>>> wrote: > >>>>> Hello, > >>>>> > >>>>> While recently using the gcrma package, I noticed some strange > >>>>> behavior > >>>>> that I want to report. I was applying the pre-processing > >>>>> algorithm (using > >>>>> default settings) inside a resampling scheme and noticed that > >>>>> subsequent > >>>>> calls of sample() were returning a constant vector of values. > >>>>> Example > >>>>> output shown below: > >>>>> > >>>>>> temp <- gcrma(celdat[,1:2]) > >>>>> Adjusting for optical effect..Done. > >>>>> Computing affinities.Done. > >>>>> Adjusting for non-specific binding..Done. > >>>>> Normalizing > >>>>> Calculating Expression > >>>>> > >>>>>> sample(1:10000,3) > >>>>> [1] 7030 5473 8109 > >>>>> > >>>>>> temp <- gcrma(celdat[,1:2]) > >>>>> Adjusting for optical effect..Done. > >>>>> Computing affinities.Done. > >>>>> Adjusting for non-specific binding..Done. > >>>>> Normalizing > >>>>> Calculating Expression > >>>>> > >>>>>> sample(1:10000,3) > >>>>> [1] 7030 5473 8109 > >>>>> > >>>>>> sessionInfo() > >>>>> R version 2.6.1 (2007-11-26) > >>>>> x86_64-redhat-linux-gnu > >>>>> > >>>>> locale: > >>>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e > >>>>> n_ > >>>>> US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER= > >>>>> en > >>>>> _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_ > >>>>> US > >>>>> .UTF-8;LC_IDENTIFICATION=C > >>>>> > >>>>> attached base packages: > >>>>> [1] splines tools stats graphics grDevices utils > >>>>> datasets > >>>>> [8] methods base > >>>>> > >>>>> other attached packages: > >>>>> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 > >>>>> [4] matchprobes_1.10.0 affy_1.16.0 preprocessCore_1.0.0 > >>>>> [7] affyio_1.6.1 Biobase_1.16.2 > >>>>> > >>>>> loaded via a namespace (and not attached): > >>>>> [1] rcompgen_0.1-17 > >>>>> > >>>>> In looking at the gcrma functions, it appears that gcrma.engine2 > >>>>> () is > >>>>> setting the seed without saving and restoring .Random.seed . > >>>>> This was > >>>>> easily corrected in the external loop of the resampling scheme, > >>>>> but I was > >>>>> unsure whether this would be considered a 'bug' in gcrma. > >>>>> > >>>>> Kind regards, > >>>>> Bill > >>>>> > >>>>> > >>>>> ------------------------------- > >>>>> William T. Barry, Ph.D. > >>>>> Dept. of Biostatistics & Bioinformatics, DUMC > >>>>> Duke Institute for Genome Science & Policy > >>>>> Hock Plaza, Room 8030 > >>>>> DUMC Box 2717 > >>>>> Durham, NC 27710 > >>>>> > >>>>> Phone: 919-681-5047 > >>>>> Fax: 919-668-9335 > >>>>> > >>>>> > >>>>> [[alternative HTML version deleted]] > >>>>> > >>>>> _______________________________________________ > >>>>> Bioconductor mailing list > >>>>> Bioconductor at stat.math.ethz.ch > >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>> Search the archives: http://news.gmane.org/ > >>>>> gmane.science.biology.informatics.conductor > >>>>> > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at stat.math.ethz.ch > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: http://news.gmane.org/ > >>>> gmane.science.biology.informatics.conductor > >>> > >>> > >>> -- > >>> ------------------------------------------- > >>> Zhijin (Jean) Wu > >>> Assistant Professor of Biostatistics > >>> Brown University, Box G-S121 > >>> Providence, RI 02912 > >>> > >>> Tel: 401 863 1230 > >>> Fax: 401 863 9182 > >>> http://stat.brown.edu/~zwu > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at stat.math.ethz.ch > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/ > >>> gmane.science.biology.informatics.conductor > >> > >> > > _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi, Does all this imply that one may (and will, in fact) obtain slightly different results when reanalysing exactly the same CEL files if their order is simply shuffled (e.g. if one of the CEL files is renamed) ?? Pierre. Anderson, S. Keith a ?crit : > Hi All, > > The sample step occurs in the gcrma.engine/gcrma.engine2 sub function. > A random sample of the entire matrix of PM values is selected to > calculate an linear regression coefficient that is used in the > full_model and affinities model. > I believe the set.seed allows one to obtain the same results for any > reanalysis of exactly the same CEL files. > If one adds or subtracts even one CEL file the sampling of that entire > matrix change and a different result is obtained. > > S. Keith Anderson, MS > Statistician, Cancer Center Statistics > Mayo Clinic > email: anderson.s at mayo.edu > > > -----Original Message----- > From: bioconductor-bounces at stat.math.ethz.ch > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Henrik > Bengtsson > Sent: Wednesday, February 13, 2008 2:41 PM > To: Kasper Daniel Hansen > Cc: William T Barry; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] set.seed() in gcrma > > On Feb 13, 2008 11:54 AM, Kasper Daniel Hansen > <khansen at="" stat.berkeley.edu=""> wrote: >> On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote: >> >>> On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen >>> <khansen at="" stat.berkeley.edu=""> wrote: >>>> In my opinion gcrma should _not_ play with the seed. That is >>>> something for the user to decide. >>> It's a trade-off between providing reproducible examples and adding > to >>> the overall confusion. I think it is important to be able to > generate >>> identical results for identical calls, and since it is not clear to >>> most users that the gcRMA parameters are estimated based on a sample >>> one I can see how the current solution avoids the problem of having >>> ask people to set the seed explicitly before calling gcrma() [also I >>> have a hard time to keep track on what is going on in all methods]. >> When a function sets the seed within itself and then does a sample() >> immediately afterwards what it really does is using a fixed index >> vector for arrays of a certain size. To call this "a random sample" >> is simply wrong. It would be easier and more transparent to just >> choose a regular grid like >> seg(from = 1, to = maxIndex, length.out = sampleSize) >> Now some people would argue that it seems wrong to use a regular grid >> like this and that the probes should be picked at random. To which I >> repeat what i just wrote: when you do set.seed you are not selecting >> anything at random - but you are obscuring this fact. If it really is >> important to not always use the same probes, the seed should not be >> set and you should live with the fact that you do not get >> reproducible effects. >> >> I agree that for gcRMA it would make sense to have the results be >> highly reproducible (otherwise I foresee many posts about this), but >> that is achieved even better by using a grid as above. Or perhaps do >> something like having an argument called (say) randomSample = FALSE >> as default and then do something like >> if(randomSample) >> gridIdx = sample(maxIndex, size = sampleSize) >> else >> gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize) >> (appropriately modified if you want to exclude certain indices >> corresponding to QC probes). Setting the seed inside a function >> should in my opinion always be discouraged. > > I don't disagree with your arguments. However, there is a risk of > selecting probes/units with common properties if seq() is used. This > comes down to how Affymetrix has outlined the probes on the particular > chip types being analyzed, and since we don't know what chip types > will show up in the future, I would like to argue that a per-chip- type > pre-determined sample() is safer than a uniform seq(). > > I've checked the latest BioC devel version of 'gcrma' and I noticed > that in both bg.parameters.ns() and bg.parameters.ns2() set.seed() is > called, but I cannot see any code that actually sample:s anything. > Are those set.seed() left overs or am I missing something? > > /Henrik > >> Kasper >> >> >> >>> One solution would be to move the value of seed to be an argument > with >>> a default value and if NA/NULL the seed is *not* set. That would >>> provide a backward-compatible upgrade to gcrma(). >> This is the solution taken by most people who want a similar >> functionality. I still >> >> >> >>> Comments? >>> >>> Henrik >>> >>>> Kasper >>>> >>>> >>>> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote: >>>> >>>>> Bill >>>>> Thank you for that suggestion. We will edit gcrma to restore the >>>>> seed in >>>>> the next release. >>>>> Zhijin >>>>> Henrik Bengtsson wrote: >>>>>> Hi, >>>>>> >>>>>> I think you have spotted something very important, and I agree > that >>>>>> the random seed should be reset by any function that set it in >>>>>> order >>>>>> to get reproducible samples. We've been working with a port/ >>>>>> wrapper >>>>>> for gcrma() but this never struck us - thanks for bringing it up. >>>>>> >>>>>> FYI, I've just sent a question to R-devel, as it is more >>>>>> appropriate >>>>>> there, on how to best push the random generator back to the >>>>>> state it >>>>>> was before calling set.seed(). >>>>>> >>>>>> Best, >>>>>> >>>>>> Henrik >>>>>> >>>>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> >>>>>> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> While recently using the gcrma package, I noticed some strange >>>>>>> behavior >>>>>>> that I want to report. I was applying the pre-processing >>>>>>> algorithm (using >>>>>>> default settings) inside a resampling scheme and noticed that >>>>>>> subsequent >>>>>>> calls of sample() were returning a constant vector of values. >>>>>>> Example >>>>>>> output shown below: >>>>>>> >>>>>>>> temp <- gcrma(celdat[,1:2]) >>>>>>> Adjusting for optical effect..Done. >>>>>>> Computing affinities.Done. >>>>>>> Adjusting for non-specific binding..Done. >>>>>>> Normalizing >>>>>>> Calculating Expression >>>>>>> >>>>>>>> sample(1:10000,3) >>>>>>> [1] 7030 5473 8109 >>>>>>> >>>>>>>> temp <- gcrma(celdat[,1:2]) >>>>>>> Adjusting for optical effect..Done. >>>>>>> Computing affinities.Done. >>>>>>> Adjusting for non-specific binding..Done. >>>>>>> Normalizing >>>>>>> Calculating Expression >>>>>>> >>>>>>>> sample(1:10000,3) >>>>>>> [1] 7030 5473 8109 >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 2.6.1 (2007-11-26) >>>>>>> x86_64-redhat-linux-gnu >>>>>>> >>>>>>> locale: >>>>>>> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e >>>>>>> n_ >>>>>>> > US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER= >>>>>>> en >>>>>>> > _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_ >>>>>>> US >>>>>>> .UTF-8;LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] splines tools stats graphics grDevices utils >>>>>>> datasets >>>>>>> [8] methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 >>>>>>> [4] matchprobes_1.10.0 affy_1.16.0 > preprocessCore_1.0.0 >>>>>>> [7] affyio_1.6.1 Biobase_1.16.2 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] rcompgen_0.1-17 >>>>>>> >>>>>>> In looking at the gcrma functions, it appears that gcrma.engine2 >>>>>>> () is >>>>>>> setting the seed without saving and restoring .Random.seed . >>>>>>> This was >>>>>>> easily corrected in the external loop of the resampling scheme, >>>>>>> but I was >>>>>>> unsure whether this would be considered a 'bug' in gcrma. >>>>>>> >>>>>>> Kind regards, >>>>>>> Bill >>>>>>> >>>>>>> >>>>>>> ------------------------------- >>>>>>> William T. Barry, Ph.D. >>>>>>> Dept. of Biostatistics & Bioinformatics, DUMC >>>>>>> Duke Institute for Genome Science & Policy >>>>>>> Hock Plaza, Room 8030 >>>>>>> DUMC Box 2717 >>>>>>> Durham, NC 27710 >>>>>>> >>>>>>> Phone: 919-681-5047 >>>>>>> Fax: 919-668-9335 >>>>>>> >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/ >>>>>>> gmane.science.biology.informatics.conductor >>>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/ >>>>>> gmane.science.biology.informatics.conductor >>>>> >>>>> -- >>>>> ------------------------------------------- >>>>> Zhijin (Jean) Wu >>>>> Assistant Professor of Biostatistics >>>>> Brown University, Box G-S121 >>>>> Providence, RI 02912 >>>>> >>>>> Tel: 401 863 1230 >>>>> Fax: 401 863 9182 >>>>> http://stat.brown.edu/~zwu >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/ >>>>> gmane.science.biology.informatics.conductor >>>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > . >
ADD REPLY
0
Entering edit mode
Pierre, I have not tried that. And yes, I would expect to obtain slightly different results with just shuffling the order of the CEL files. S. Keith Anderson -----Original Message----- From: Pierre Neuvial [mailto:pierre.neuvial@curie.fr] Sent: Friday, February 15, 2008 3:22 AM To: Anderson, S. Keith Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] set.seed() in gcrma Hi, Does all this imply that one may (and will, in fact) obtain slightly different results when reanalysing exactly the same CEL files if their order is simply shuffled (e.g. if one of the CEL files is renamed) ?? Pierre. Anderson, S. Keith a ?crit : > Hi All, > > The sample step occurs in the gcrma.engine/gcrma.engine2 sub function. > A random sample of the entire matrix of PM values is selected to > calculate an linear regression coefficient that is used in the > full_model and affinities model. > I believe the set.seed allows one to obtain the same results for any > reanalysis of exactly the same CEL files. > If one adds or subtracts even one CEL file the sampling of that entire > matrix change and a different result is obtained. > > S. Keith Anderson, MS > Statistician, Cancer Center Statistics > Mayo Clinic > email: anderson.s at mayo.edu > > > -----Original Message----- > From: bioconductor-bounces at stat.math.ethz.ch > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Henrik > Bengtsson > Sent: Wednesday, February 13, 2008 2:41 PM > To: Kasper Daniel Hansen > Cc: William T Barry; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] set.seed() in gcrma > > On Feb 13, 2008 11:54 AM, Kasper Daniel Hansen > <khansen at="" stat.berkeley.edu=""> wrote: >> On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote: >> >>> On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen >>> <khansen at="" stat.berkeley.edu=""> wrote: >>>> In my opinion gcrma should _not_ play with the seed. That is >>>> something for the user to decide. >>> It's a trade-off between providing reproducible examples and adding > to >>> the overall confusion. I think it is important to be able to > generate >>> identical results for identical calls, and since it is not clear to >>> most users that the gcRMA parameters are estimated based on a sample >>> one I can see how the current solution avoids the problem of having >>> ask people to set the seed explicitly before calling gcrma() [also I >>> have a hard time to keep track on what is going on in all methods]. >> When a function sets the seed within itself and then does a sample() >> immediately afterwards what it really does is using a fixed index >> vector for arrays of a certain size. To call this "a random sample" >> is simply wrong. It would be easier and more transparent to just >> choose a regular grid like >> seg(from = 1, to = maxIndex, length.out = sampleSize) >> Now some people would argue that it seems wrong to use a regular grid >> like this and that the probes should be picked at random. To which I >> repeat what i just wrote: when you do set.seed you are not selecting >> anything at random - but you are obscuring this fact. If it really is >> important to not always use the same probes, the seed should not be >> set and you should live with the fact that you do not get >> reproducible effects. >> >> I agree that for gcRMA it would make sense to have the results be >> highly reproducible (otherwise I foresee many posts about this), but >> that is achieved even better by using a grid as above. Or perhaps do >> something like having an argument called (say) randomSample = FALSE >> as default and then do something like >> if(randomSample) >> gridIdx = sample(maxIndex, size = sampleSize) >> else >> gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize) >> (appropriately modified if you want to exclude certain indices >> corresponding to QC probes). Setting the seed inside a function >> should in my opinion always be discouraged. > > I don't disagree with your arguments. However, there is a risk of > selecting probes/units with common properties if seq() is used. This > comes down to how Affymetrix has outlined the probes on the particular > chip types being analyzed, and since we don't know what chip types > will show up in the future, I would like to argue that a per-chip- type > pre-determined sample() is safer than a uniform seq(). > > I've checked the latest BioC devel version of 'gcrma' and I noticed > that in both bg.parameters.ns() and bg.parameters.ns2() set.seed() is > called, but I cannot see any code that actually sample:s anything. > Are those set.seed() left overs or am I missing something? > > /Henrik > >> Kasper >> >> >> >>> One solution would be to move the value of seed to be an argument > with >>> a default value and if NA/NULL the seed is *not* set. That would >>> provide a backward-compatible upgrade to gcrma(). >> This is the solution taken by most people who want a similar >> functionality. I still >> >> >> >>> Comments? >>> >>> Henrik >>> >>>> Kasper >>>> >>>> >>>> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote: >>>> >>>>> Bill >>>>> Thank you for that suggestion. We will edit gcrma to restore the >>>>> seed in >>>>> the next release. >>>>> Zhijin >>>>> Henrik Bengtsson wrote: >>>>>> Hi, >>>>>> >>>>>> I think you have spotted something very important, and I agree > that >>>>>> the random seed should be reset by any function that set it in >>>>>> order >>>>>> to get reproducible samples. We've been working with a port/ >>>>>> wrapper >>>>>> for gcrma() but this never struck us - thanks for bringing it up. >>>>>> >>>>>> FYI, I've just sent a question to R-devel, as it is more >>>>>> appropriate >>>>>> there, on how to best push the random generator back to the >>>>>> state it >>>>>> was before calling set.seed(). >>>>>> >>>>>> Best, >>>>>> >>>>>> Henrik >>>>>> >>>>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> >>>>>> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> While recently using the gcrma package, I noticed some strange >>>>>>> behavior >>>>>>> that I want to report. I was applying the pre-processing >>>>>>> algorithm (using >>>>>>> default settings) inside a resampling scheme and noticed that >>>>>>> subsequent >>>>>>> calls of sample() were returning a constant vector of values. >>>>>>> Example >>>>>>> output shown below: >>>>>>> >>>>>>>> temp <- gcrma(celdat[,1:2]) >>>>>>> Adjusting for optical effect..Done. >>>>>>> Computing affinities.Done. >>>>>>> Adjusting for non-specific binding..Done. >>>>>>> Normalizing >>>>>>> Calculating Expression >>>>>>> >>>>>>>> sample(1:10000,3) >>>>>>> [1] 7030 5473 8109 >>>>>>> >>>>>>>> temp <- gcrma(celdat[,1:2]) >>>>>>> Adjusting for optical effect..Done. >>>>>>> Computing affinities.Done. >>>>>>> Adjusting for non-specific binding..Done. >>>>>>> Normalizing >>>>>>> Calculating Expression >>>>>>> >>>>>>>> sample(1:10000,3) >>>>>>> [1] 7030 5473 8109 >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 2.6.1 (2007-11-26) >>>>>>> x86_64-redhat-linux-gnu >>>>>>> >>>>>>> locale: >>>>>>> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e >>>>>>> n_ >>>>>>> > US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER= >>>>>>> en >>>>>>> > _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_ >>>>>>> US >>>>>>> .UTF-8;LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] splines tools stats graphics grDevices utils >>>>>>> datasets >>>>>>> [8] methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 >>>>>>> [4] matchprobes_1.10.0 affy_1.16.0 > preprocessCore_1.0.0 >>>>>>> [7] affyio_1.6.1 Biobase_1.16.2 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] rcompgen_0.1-17 >>>>>>> >>>>>>> In looking at the gcrma functions, it appears that gcrma.engine2 >>>>>>> () is >>>>>>> setting the seed without saving and restoring .Random.seed . >>>>>>> This was >>>>>>> easily corrected in the external loop of the resampling scheme, >>>>>>> but I was >>>>>>> unsure whether this would be considered a 'bug' in gcrma. >>>>>>> >>>>>>> Kind regards, >>>>>>> Bill >>>>>>> >>>>>>> >>>>>>> ------------------------------- >>>>>>> William T. Barry, Ph.D. >>>>>>> Dept. of Biostatistics & Bioinformatics, DUMC >>>>>>> Duke Institute for Genome Science & Policy >>>>>>> Hock Plaza, Room 8030 >>>>>>> DUMC Box 2717 >>>>>>> Durham, NC 27710 >>>>>>> >>>>>>> Phone: 919-681-5047 >>>>>>> Fax: 919-668-9335 >>>>>>> >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/ >>>>>>> gmane.science.biology.informatics.conductor >>>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/ >>>>>> gmane.science.biology.informatics.conductor >>>>> >>>>> -- >>>>> ------------------------------------------- >>>>> Zhijin (Jean) Wu >>>>> Assistant Professor of Biostatistics >>>>> Brown University, Box G-S121 >>>>> Providence, RI 02912 >>>>> >>>>> Tel: 401 863 1230 >>>>> Fax: 401 863 9182 >>>>> http://stat.brown.edu/~zwu >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/ >>>>> gmane.science.biology.informatics.conductor >>>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > . >
ADD REPLY
0
Entering edit mode
Anderson, S. Keith scripsit: > Pierre, > > I have not tried that. > And yes, I would expect to obtain slightly different results with just shuffling the order of the CEL files. > > S. Keith Anderson > > -----Original Message----- > From: Pierre Neuvial [mailto:pierre.neuvial at curie.fr] > Sent: Friday, February 15, 2008 3:22 AM > To: Anderson, S. Keith > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] set.seed() in gcrma > > Hi, > > Does all this imply that one may (and will, in fact) obtain slightly different results when reanalysing exactly the same CEL files if their order is simply shuffled (e.g. if one of the CEL files is renamed) ?? > > Pierre. > Hi, Yes, this has been my experience. Also, if you reorder the order of the probes within the probesets in the CDF environment. And there was a discussion some time ago on differences in gcrma results between linux and windows: https://stat.ethz.ch/pipermail/bioconductor/2007-February/015924.html AfaIk the set.seed in the gcrma function was introduced in response to that, followed later by improvements in the sampling schemes as mentioned by Jean in this thread. Given the wide usage and usefulness of the method it would be good to have a better understanding and documentation of its precision under what people deem "small" variations of the input data. Best wishes Wolfgang > > Anderson, S. Keith a ?crit : >> Hi All, >> >> The sample step occurs in the gcrma.engine/gcrma.engine2 sub function. >> A random sample of the entire matrix of PM values is selected to >> calculate an linear regression coefficient that is used in the >> full_model and affinities model. >> I believe the set.seed allows one to obtain the same results for any >> reanalysis of exactly the same CEL files. >> If one adds or subtracts even one CEL file the sampling of that entire >> matrix change and a different result is obtained. >> >> S. Keith Anderson, MS >> Statistician, Cancer Center Statistics >> Mayo Clinic >> email: anderson.s at mayo.edu >> >> >> -----Original Message----- >> From: bioconductor-bounces at stat.math.ethz.ch >> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Henrik >> Bengtsson >> Sent: Wednesday, February 13, 2008 2:41 PM >> To: Kasper Daniel Hansen >> Cc: William T Barry; bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] set.seed() in gcrma >> >> On Feb 13, 2008 11:54 AM, Kasper Daniel Hansen >> <khansen at="" stat.berkeley.edu=""> wrote: >>> On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote: >>> >>>> On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen >>>> <khansen at="" stat.berkeley.edu=""> wrote: >>>>> In my opinion gcrma should _not_ play with the seed. That is >>>>> something for the user to decide. >>>> It's a trade-off between providing reproducible examples and adding >> to >>>> the overall confusion. I think it is important to be able to >> generate >>>> identical results for identical calls, and since it is not clear to >>>> most users that the gcRMA parameters are estimated based on a sample >>>> one I can see how the current solution avoids the problem of having >>>> ask people to set the seed explicitly before calling gcrma() [also I >>>> have a hard time to keep track on what is going on in all methods]. >>> When a function sets the seed within itself and then does a sample() >>> immediately afterwards what it really does is using a fixed index >>> vector for arrays of a certain size. To call this "a random sample" >>> is simply wrong. It would be easier and more transparent to just >>> choose a regular grid like >>> seg(from = 1, to = maxIndex, length.out = sampleSize) >>> Now some people would argue that it seems wrong to use a regular grid >>> like this and that the probes should be picked at random. To which I >>> repeat what i just wrote: when you do set.seed you are not selecting >>> anything at random - but you are obscuring this fact. If it really is >>> important to not always use the same probes, the seed should not be >>> set and you should live with the fact that you do not get >>> reproducible effects. >>> >>> I agree that for gcRMA it would make sense to have the results be >>> highly reproducible (otherwise I foresee many posts about this), but >>> that is achieved even better by using a grid as above. Or perhaps do >>> something like having an argument called (say) randomSample = FALSE >>> as default and then do something like >>> if(randomSample) >>> gridIdx = sample(maxIndex, size = sampleSize) >>> else >>> gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize) >>> (appropriately modified if you want to exclude certain indices >>> corresponding to QC probes). Setting the seed inside a function >>> should in my opinion always be discouraged. >> I don't disagree with your arguments. However, there is a risk of >> selecting probes/units with common properties if seq() is used. This >> comes down to how Affymetrix has outlined the probes on the particular >> chip types being analyzed, and since we don't know what chip types >> will show up in the future, I would like to argue that a per-chip- type >> pre-determined sample() is safer than a uniform seq(). >> >> I've checked the latest BioC devel version of 'gcrma' and I noticed >> that in both bg.parameters.ns() and bg.parameters.ns2() set.seed() is >> called, but I cannot see any code that actually sample:s anything. >> Are those set.seed() left overs or am I missing something? >> >> /Henrik >> >>> Kasper >>> >>> >>> >>>> One solution would be to move the value of seed to be an argument >> with >>>> a default value and if NA/NULL the seed is *not* set. That would >>>> provide a backward-compatible upgrade to gcrma(). >>> This is the solution taken by most people who want a similar >>> functionality. I still >>> >>> >>> >>>> Comments? >>>> >>>> Henrik >>>> >>>>> Kasper >>>>> >>>>> >>>>> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote: >>>>> >>>>>> Bill >>>>>> Thank you for that suggestion. We will edit gcrma to restore the >>>>>> seed in >>>>>> the next release. >>>>>> Zhijin >>>>>> Henrik Bengtsson wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I think you have spotted something very important, and I agree >> that >>>>>>> the random seed should be reset by any function that set it in >>>>>>> order >>>>>>> to get reproducible samples. We've been working with a port/ >>>>>>> wrapper >>>>>>> for gcrma() but this never struck us - thanks for bringing it up. >>>>>>> >>>>>>> FYI, I've just sent a question to R-devel, as it is more >>>>>>> appropriate >>>>>>> there, on how to best push the random generator back to the >>>>>>> state it >>>>>>> was before calling set.seed(). >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Henrik >>>>>>> >>>>>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> >>>>>>> wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> While recently using the gcrma package, I noticed some strange >>>>>>>> behavior >>>>>>>> that I want to report. I was applying the pre-processing >>>>>>>> algorithm (using >>>>>>>> default settings) inside a resampling scheme and noticed that >>>>>>>> subsequent >>>>>>>> calls of sample() were returning a constant vector of values. >>>>>>>> Example >>>>>>>> output shown below: >>>>>>>> >>>>>>>>> temp <- gcrma(celdat[,1:2]) >>>>>>>> Adjusting for optical effect..Done. >>>>>>>> Computing affinities.Done. >>>>>>>> Adjusting for non-specific binding..Done. >>>>>>>> Normalizing >>>>>>>> Calculating Expression >>>>>>>> >>>>>>>>> sample(1:10000,3) >>>>>>>> [1] 7030 5473 8109 >>>>>>>> >>>>>>>>> temp <- gcrma(celdat[,1:2]) >>>>>>>> Adjusting for optical effect..Done. >>>>>>>> Computing affinities.Done. >>>>>>>> Adjusting for non-specific binding..Done. >>>>>>>> Normalizing >>>>>>>> Calculating Expression >>>>>>>> >>>>>>>>> sample(1:10000,3) >>>>>>>> [1] 7030 5473 8109 >>>>>>>> >>>>>>>>> sessionInfo() >>>>>>>> R version 2.6.1 (2007-11-26) >>>>>>>> x86_64-redhat-linux-gnu >>>>>>>> >>>>>>>> locale: >>>>>>>> >> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e >>>>>>>> n_ >>>>>>>> >> US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER= >>>>>>>> en >>>>>>>> >> _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_ >>>>>>>> US >>>>>>>> .UTF-8;LC_IDENTIFICATION=C >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] splines tools stats graphics grDevices utils >>>>>>>> datasets >>>>>>>> [8] methods base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 >>>>>>>> [4] matchprobes_1.10.0 affy_1.16.0 >> preprocessCore_1.0.0 >>>>>>>> [7] affyio_1.6.1 Biobase_1.16.2 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] rcompgen_0.1-17 >>>>>>>> >>>>>>>> In looking at the gcrma functions, it appears that gcrma.engine2 >>>>>>>> () is >>>>>>>> setting the seed without saving and restoring .Random.seed . >>>>>>>> This was >>>>>>>> easily corrected in the external loop of the resampling scheme, >>>>>>>> but I was >>>>>>>> unsure whether this would be considered a 'bug' in gcrma. >>>>>>>> >>>>>>>> Kind regards, >>>>>>>> Bill >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------- >>>>>>>> William T. Barry, Ph.D. >>>>>>>> Dept. of Biostatistics & Bioinformatics, DUMC >>>>>>>> Duke Institute for Genome Science & Policy >>>>>>>> Hock Plaza, Room 8030 >>>>>>>> DUMC Box 2717 >>>>>>>> Durham, NC 27710 >>>>>>>> >>>>>>>> Phone: 919-681-5047 >>>>>>>> Fax: 919-668-9335 >>>>>>>> >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: http://news.gmane.org/ >>>>>>>> gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/ >>>>>>> gmane.science.biology.informatics.conductor >>>>>> -- >>>>>> ------------------------------------------- >>>>>> Zhijin (Jean) Wu >>>>>> Assistant Professor of Biostatistics >>>>>> Brown University, Box G-S121 >>>>>> Providence, RI 02912 >>>>>> >>>>>> Tel: 401 863 1230 >>>>>> Fax: 401 863 9182 >>>>>> http://stat.brown.edu/~zwu >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/ >>>>>> gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> . >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
ADD REPLY
0
Entering edit mode
> Anderson, S. Keith scripsit: >> Pierre, >> I have not tried that. >> And yes, I would expect to obtain slightly different results with just shuffling the order of the CEL files. >> S. Keith Anderson >> -----Original Message----- >> From: Pierre Neuvial [mailto:pierre.neuvial at curie.fr] >> Sent: Friday, February 15, 2008 3:22 AM >> To: Anderson, S. Keith >> Cc: bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] set.seed() in gcrma >> Hi, >> Does all this imply that one may (and will, in fact) obtain slightly different results when reanalysing exactly the same CEL files if their order is simply shuffled (e.g. if one of the CEL files is renamed) ?? Pierre. > > > Hi, > > Yes, this has been my experience. Also, if you reorder the order of the probes within the probesets in the CDF environment. And there was a discussion some time ago on differences in gcrma results between linux and windows: > https://stat.ethz.ch/pipermail/bioconductor/2007-February/015924.html AfaIk the set.seed in the gcrma function was introduced in response to that, followed later by improvements in the sampling schemes as > mentioned by Jean in this thread. > > Given the wide usage and usefulness of the method it would be good to have a better understanding and documentation of its precision under what people deem "small" variations of the input data. > > Best wishes > Wolfgang > I think that the order in which CEL files are submitted, or the order in which probes in probesets are defined should not qualify as a particular of source of variations (in the later case, that order is often the one of probes along the matching reference sequence). If there is a random component, it could probably be explicit and users educated about it and understand that the resulting expression values, as well as the outcome of an analysis is likely to change. I do agree that a certain amount of understanding and documentation of the uncertainty of the results is needed. However, analysis performed with gcRMA might just require to be done "a few times" (just like k-means clustering on random starting seeds is probably often performed several times). With the characterization of the variability of results obtained with gcRMA, one could also look into having procedures for common microarray data analysis tasks that would account for the intrinsic variability of gcRMA. There is actually a chance that this improves results obtained with gcRMA (by cutting down the number of false-positives). Laurent > > >> Anderson, S. Keith a ?crit : >>> Hi All, >>> The sample step occurs in the gcrma.engine/gcrma.engine2 sub function. A random sample of the entire matrix of PM values is selected to calculate an linear regression coefficient that is used in the full_model and affinities model. >>> I believe the set.seed allows one to obtain the same results for any reanalysis of exactly the same CEL files. >>> If one adds or subtracts even one CEL file the sampling of that entire matrix change and a different result is obtained. >>> S. Keith Anderson, MS >>> Statistician, Cancer Center Statistics >>> Mayo Clinic >>> email: anderson.s at mayo.edu >>> -----Original Message----- >>> From: bioconductor-bounces at stat.math.ethz.ch >>> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Henrik Bengtsson >>> Sent: Wednesday, February 13, 2008 2:41 PM >>> To: Kasper Daniel Hansen >>> Cc: William T Barry; bioconductor at stat.math.ethz.ch >>> Subject: Re: [BioC] set.seed() in gcrma >>> On Feb 13, 2008 11:54 AM, Kasper Daniel Hansen >>> <khansen at="" stat.berkeley.edu=""> wrote: >>>> On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote: >>>>> On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen >>>>> <khansen at="" stat.berkeley.edu=""> wrote: >>>>>> In my opinion gcrma should _not_ play with the seed. That is something for the user to decide. >>>>> It's a trade-off between providing reproducible examples and adding >>> to >>>>> the overall confusion. I think it is important to be able to >>> generate >>>>> identical results for identical calls, and since it is not clear to most users that the gcRMA parameters are estimated based on a sample one I can see how the current solution avoids the problem of having ask people to set the seed explicitly before calling gcrma() [also I have a hard time to keep track on what is going on in all methods]. >>>> When a function sets the seed within itself and then does a sample() immediately afterwards what it really does is using a fixed index vector for arrays of a certain size. To call this "a random sample" is simply wrong. It would be easier and more transparent to just choose a regular grid like >>>> seg(from = 1, to = maxIndex, length.out = sampleSize) >>>> Now some people would argue that it seems wrong to use a regular grid like this and that the probes should be picked at random. To which I repeat what i just wrote: when you do set.seed you are not selecting anything at random - but you are obscuring this fact. If it really is important to not always use the same probes, the seed should not be set and you should live with the fact that you do not get >>>> reproducible effects. >>>> I agree that for gcRMA it would make sense to have the results be highly reproducible (otherwise I foresee many posts about this), but that is achieved even better by using a grid as above. Or perhaps do something like having an argument called (say) randomSample = FALSE as default and then do something like >>>> if(randomSample) >>>> gridIdx = sample(maxIndex, size = sampleSize) >>>> else >>>> gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize) >>>> (appropriately modified if you want to exclude certain indices corresponding to QC probes). Setting the seed inside a function should in my opinion always be discouraged. >>> I don't disagree with your arguments. However, there is a risk of selecting probes/units with common properties if seq() is used. This comes down to how Affymetrix has outlined the probes on the particular chip types being analyzed, and since we don't know what chip types will show up in the future, I would like to argue that a per-chip-type pre-determined sample() is safer than a uniform seq(). >>> I've checked the latest BioC devel version of 'gcrma' and I noticed that in both bg.parameters.ns() and bg.parameters.ns2() set.seed() is called, but I cannot see any code that actually sample:s anything. Are those set.seed() left overs or am I missing something? >>> /Henrik >>>> Kasper >>>>> One solution would be to move the value of seed to be an argument >>> with >>>>> a default value and if NA/NULL the seed is *not* set. That would provide a backward-compatible upgrade to gcrma(). >>>> This is the solution taken by most people who want a similar >>>> functionality. I still >>>>> Comments? >>>>> Henrik >>>>>> Kasper >>>>>> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote: >>>>>>> Bill >>>>>>> Thank you for that suggestion. We will edit gcrma to restore the seed in >>>>>>> the next release. >>>>>>> Zhijin >>>>>>> Henrik Bengtsson wrote: >>>>>>>> Hi, >>>>>>>> I think you have spotted something very important, and I agree >>> that >>>>>>>> the random seed should be reset by any function that set it in order >>>>>>>> to get reproducible samples. We've been working with a port/ wrapper >>>>>>>> for gcrma() but this never struck us - thanks for bringing it up. FYI, I've just sent a question to R-devel, as it is more >>>>>>>> appropriate >>>>>>>> there, on how to best push the random generator back to the state it >>>>>>>> was before calling set.seed(). >>>>>>>> Best, >>>>>>>> Henrik >>>>>>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at="" duke.edu=""> wrote: >>>>>>>>> Hello, >>>>>>>>> While recently using the gcrma package, I noticed some strange behavior >>>>>>>>> that I want to report. I was applying the pre-processing algorithm (using >>>>>>>>> default settings) inside a resampling scheme and noticed that subsequent >>>>>>>>> calls of sample() were returning a constant vector of values. Example >>>>>>>>> output shown below: >>>>>>>>>> temp <- gcrma(celdat[,1:2]) >>>>>>>>> Adjusting for optical effect..Done. >>>>>>>>> Computing affinities.Done. >>>>>>>>> Adjusting for non-specific binding..Done. >>>>>>>>> Normalizing >>>>>>>>> Calculating Expression >>>>>>>>>> sample(1:10000,3) >>>>>>>>> [1] 7030 5473 8109 >>>>>>>>>> temp <- gcrma(celdat[,1:2]) >>>>>>>>> Adjusting for optical effect..Done. >>>>>>>>> Computing affinities.Done. >>>>>>>>> Adjusting for non-specific binding..Done. >>>>>>>>> Normalizing >>>>>>>>> Calculating Expression >>>>>>>>>> sample(1:10000,3) >>>>>>>>> [1] 7030 5473 8109 >>>>>>>>>> sessionInfo() >>>>>>>>> R version 2.6.1 (2007-11-26) >>>>>>>>> x86_64-redhat-linux-gnu >>>>>>>>> locale: >>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e >>>>>>>>> n_ >>> US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER> >>>>>>> en >>> _US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_ >>>>>>>>> US >>>>>>>>> .UTF-8;LC_IDENTIFICATION=C >>>>>>>>> attached base packages: >>>>>>>>> [1] splines tools stats graphics grDevices utils datasets >>>>>>>>> [8] methods base >>>>>>>>> other attached packages: >>>>>>>>> [1] hu6800probe_2.0.0 hu6800cdf_2.0.0 gcrma_2.10.0 [4] matchprobes_1.10.0 affy_1.16.0 >>> preprocessCore_1.0.0 >>>>>>>>> [7] affyio_1.6.1 Biobase_1.16.2 >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] rcompgen_0.1-17 >>>>>>>>> In looking at the gcrma functions, it appears that gcrma.engine2 () is >>>>>>>>> setting the seed without saving and restoring .Random.seed . This was >>>>>>>>> easily corrected in the external loop of the resampling scheme, but I was >>>>>>>>> unsure whether this would be considered a 'bug' in gcrma. Kind regards, >>>>>>>>> Bill >>>>>>>>> ------------------------------- >>>>>>>>> William T. Barry, Ph.D. >>>>>>>>> Dept. of Biostatistics & Bioinformatics, DUMC >>>>>>>>> Duke Institute for Genome Science & Policy >>>>>>>>> Hock Plaza, Room 8030 >>>>>>>>> DUMC Box 2717 >>>>>>>>> Durham, NC 27710 >>>>>>>>> Phone: 919-681-5047 >>>>>>>>> Fax: 919-668-9335 >>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: http://news.gmane.org/ >>>>>>>>> gmane.science.biology.informatics.conductor >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: http://news.gmane.org/ >>>>>>>> gmane.science.biology.informatics.conductor >>>>>>> -- >>>>>>> ------------------------------------------- >>>>>>> Zhijin (Jean) Wu >>>>>>> Assistant Professor of Biostatistics >>>>>>> Brown University, Box G-S121 >>>>>>> Providence, RI 02912 >>>>>>> Tel: 401 863 1230 >>>>>>> Fax: 401 863 9182 >>>>>>> http://stat.brown.edu/~zwu >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/ >>>>>>> gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor . >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Best wishes > Wolfgang > > ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY

Login before adding your answer.

Traffic: 579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6