expresso: performing RMA on NON-Affy data?
1
0
Entering edit mode
@jdelasherasedacuk-1189
Last seen 9.4 years ago
United Kingdom
Dear all, Is there a reasonable manner to coax expresso to normalise data that is not Affymetrix? In other words, would it be possible[1] to create an AffyBatch manually, filling the required slots with my own non-Affy data so that Expresso can work on them? [1] possible it must be, but the effort required has to be conmensurate with the results I'm after ;-) Any other alternative? Thank you. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
• 2.3k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States
Hi Jose, Do you want to do RMA, or just normalize? The problem with trying to wedge things into an AffyBatch is that the affy package will then try to find the cdfenv that contains the probe to probeset mappings, so by trying to leverage the AffyBatch infrastructure you will have to also come up with a fake cdf. If you don't have probes that make up a probeset, then I'm not sure the affy package will be of use here. Can you give more details? Best, Jim J.delasHeras at ed.ac.uk wrote: > > Dear all, > > Is there a reasonable manner to coax expresso to normalise data that is > not Affymetrix? > > In other words, would it be possible[1] to create an AffyBatch manually, > filling the required slots with my own non-Affy data so that Expresso > can work on them? > > [1] possible it must be, but the effort required has to be conmensurate > with the results I'm after ;-) > > Any other alternative? > > Thank you. > > Jose > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826
ADD COMMENT
0
Entering edit mode
Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > Hi Jose, > > Do you want to do RMA, or just normalize? The problem with trying to > wedge things into an AffyBatch is that the affy package will then try > to find the cdfenv that contains the probe to probeset mappings, so by > trying to leverage the AffyBatch infrastructure you will have to also > come up with a fake cdf. > > If you don't have probes that make up a probeset, then I'm not sure the > affy package will be of use here. > > Can you give more details? > > Best, > > Jim Hi Jim, normalisation is not an issue, it's more to do with the summarisation of probesets and something like 'Expresso' seems like a good way to do what I need (and some other things I don't need). I am dealing with Nimblegen arrays. Both two colour (whole genome promoter arrays, with anything up to 20 probes per probeset), and one colour "a la Affymetrix" (expression arrays, with anything between 3 to 8 probes per probeset). I've been dealing with teh two colour stuff just like I used to deal with my spotted cDNA arrays, using Limma. To summarise the data... I've used several approaches. Mostly I am not interested in the whole 2.7kb that each "promoter region" comprises, so I've taken subsets blah blah... Anyway, I'm happy with the results there. But for the expression data, I have one channel data, just like Affy data. Numblegen provides already normalised and summarised data along with the raw data, and they state they use the RMA procedure which I've come across with when readingabout Affy chips, although I've never analysed Affy data myself. I'm reasonably happy with the data given to me. It looks reasonable. So I want to be able to do that myself rather than depending on their data (thus allowing me to do things a bit differently if I want to), and since the RMA-processed data looks good, I am interested in finding a way to do RMA myself. You're right, the problem with my trying to make an AffyBatch from non Affy data is that I'm going to have to create a cdf-like file... and probably will encounter other obstacles... that's why I thought I'd ask here, as there's people who are very familiar with that structure... In my naivety, it seems it should be a simple enough task... and as I'm using 4 types of arrays mostly... I'd only have to do some work to make these work and then just enjoy the ride as new experiments roll in... Am I naive? ;-) I hope I clarified enough what I'm after. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
ADD REPLY
0
Entering edit mode
Hi Jose, J.delasHeras at ed.ac.uk wrote: > Hi Jim, > > normalisation is not an issue, it's more to do with the summarisation of > probesets and something like 'Expresso' seems like a good way to do what > I need (and some other things I don't need). > > I am dealing with Nimblegen arrays. Both two colour (whole genome > promoter arrays, with anything up to 20 probes per probeset), and one > colour "a la Affymetrix" (expression arrays, with anything between 3 to > 8 probes per probeset). > > I've been dealing with teh two colour stuff just like I used to deal > with my spotted cDNA arrays, using Limma. To summarise the data... I've > used several approaches. Mostly I am not interested in the whole 2.7kb > that each "promoter region" comprises, so I've taken subsets blah > blah... Anyway, I'm happy with the results there. > But for the expression data, I have one channel data, just like Affy > data. Numblegen provides already normalised and summarised data along > with the raw data, and they state they use the RMA procedure which I've > come across with when readingabout Affy chips, although I've never > analysed Affy data myself. > > I'm reasonably happy with the data given to me. It looks reasonable. So > I want to be able to do that myself rather than depending on their data > (thus allowing me to do things a bit differently if I want to), and > since the RMA-processed data looks good, I am interested in finding a > way to do RMA myself. > > You're right, the problem with my trying to make an AffyBatch from non > Affy data is that I'm going to have to create a cdf-like file... and > probably will encounter other obstacles... that's why I thought I'd ask > here, as there's people who are very familiar with that structure... > > In my naivety, it seems it should be a simple enough task... and as I'm > using 4 types of arrays mostly... I'd only have to do some work to make > these work and then just enjoy the ride as new experiments roll in... > Am I naive? ;-) Nope. Just looking in the wrong place. If you want to analyze Nimblegen single color data, then you need to use the oligo package rather than affy. What package works for which type of oligonucleotide array can be found here: http://www.bioconductor.org/docs/workflows/oligoarrays/ Best, Jim > > I hope I clarified enough what I'm after. > > Jose > > > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826
ADD REPLY
0
Entering edit mode
Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > Hi Jose, > > J.delasHeras at ed.ac.uk wrote: >> Hi Jim, >> >> normalisation is not an issue, it's more to do with the >> summarisation of probesets and something like 'Expresso' seems like >> a good way to do what I need (and some other things I don't need). >> >> I am dealing with Nimblegen arrays. Both two colour (whole genome >> promoter arrays, with anything up to 20 probes per probeset), and >> one colour "a la Affymetrix" (expression arrays, with anything >> between 3 to 8 probes per probeset). >> >> I've been dealing with teh two colour stuff just like I used to >> deal with my spotted cDNA arrays, using Limma. To summarise the >> data... I've used several approaches. Mostly I am not interested in >> the whole 2.7kb that each "promoter region" comprises, so I've >> taken subsets blah blah... Anyway, I'm happy with the results there. >> But for the expression data, I have one channel data, just like >> Affy data. Numblegen provides already normalised and summarised >> data along with the raw data, and they state they use the RMA >> procedure which I've come across with when readingabout Affy chips, >> although I've never analysed Affy data myself. >> >> I'm reasonably happy with the data given to me. It looks >> reasonable. So I want to be able to do that myself rather than >> depending on their data (thus allowing me to do things a bit >> differently if I want to), and since the RMA-processed data looks >> good, I am interested in finding a way to do RMA myself. >> >> You're right, the problem with my trying to make an AffyBatch from >> non Affy data is that I'm going to have to create a cdf-like >> file... and probably will encounter other obstacles... that's why I >> thought I'd ask here, as there's people who are very familiar with >> that structure... >> >> In my naivety, it seems it should be a simple enough task... and as >> I'm using 4 types of arrays mostly... I'd only have to do some >> work to make these work and then just enjoy the ride as new >> experiments roll in... >> Am I naive? ;-) > > Nope. Just looking in the wrong place. If you want to analyze Nimblegen > single color data, then you need to use the oligo package rather than > affy. > > What package works for which type of oligonucleotide array can be found here: > > http://www.bioconductor.org/docs/workflows/oligoarrays/ > > Best, > > Jim Hi Jim, thanks for that link. I now vaguely remember the Oligo package. I tried to use it long ago whilst it was still in development but didn't get very far as it required some extra files that Nimblegen didn't routinely provide (although when scanning the images yourself, with NimbleScan, they can be generated easily) and lost interest. I'll recheck, as it's been a while and things have probably improved a lot. regards, Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
ADD REPLY
0
Entering edit mode
On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: > Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > >> Hi Jose, >> >> Do you want to do RMA, or just normalize? The problem with trying to >> wedge things into an AffyBatch is that the affy package will then try >> to find the cdfenv that contains the probe to probeset mappings, so >> by >> trying to leverage the AffyBatch infrastructure you will have to also >> come up with a fake cdf. >> >> If you don't have probes that make up a probeset, then I'm not sure >> the >> affy package will be of use here. >> >> Can you give more details? >> >> Best, >> >> Jim > > Hi Jim, > > normalisation is not an issue, it's more to do with the > summarisation of probesets and something like 'Expresso' seems like > a good way to do what I need (and some other things I don't need). > > I am dealing with Nimblegen arrays. Both two colour (whole genome > promoter arrays, with anything up to 20 probes per probeset), and > one colour "a la Affymetrix" (expression arrays, with anything > between 3 to 8 probes per probeset). > > I've been dealing with teh two colour stuff just like I used to deal > with my spotted cDNA arrays, using Limma. To summarise the data... > I've used several approaches. Mostly I am not interested in the > whole 2.7kb that each "promoter region" comprises, so I've taken > subsets blah blah... Anyway, I'm happy with the results there. > But for the expression data, I have one channel data, just like Affy > data. Numblegen provides already normalised and summarised data > along with the raw data, and they state they use the RMA procedure > which I've come across with when readingabout Affy chips, although > I've never analysed Affy data myself. > > I'm reasonably happy with the data given to me. It looks reasonable. > So I want to be able to do that myself rather than depending on > their data (thus allowing me to do things a bit differently if I > want to), and since the RMA-processed data looks good, I am > interested in finding a way to do RMA myself. > > You're right, the problem with my trying to make an AffyBatch from > non Affy data is that I'm going to have to create a cdf-like file... > and probably will encounter other obstacles... that's why I thought > I'd ask here, as there's people who are very familiar with that > structure... > > In my naivety, it seems it should be a simple enough task... and as > I'm using 4 types of arrays mostly... I'd only have to do some work > to make these work and then just enjoy the ride as new experiments > roll in... > Am I naive? ;-) It is pretty simple to do what you want, but "simple" is of course in the eye of the beholder - it depends on how familiar you are with the affy structures from a development point of view. I am not familiar with Nimblegen, but that might be much easier, as in working out of the box. Kasper > I hope I clarified enough what I'm after. > > Jose > > > > -- > Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk > The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 > 6513374 > Institute for Cell & Molecular Biology Fax: +44 (0)131 > 6507360 > Swann Building, Mayfield Road > University of Edinburgh > Edinburgh EH9 3JR > UK > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Jose, Can I add a (possibly naive) suggestion? You say normalization is not the issue, its the summarization of 3-8 probes per probeset for your 1-colour Nimblegen data. So maybe you just want to fit the log-additive RMA linear model to your data. This is akin to estimating a model with probe effects and chip effects for each probeset ... of which you are really interested in the chip effects. Say you were able to collect your data from a single probeset into a matrix Y (concocted example below): ce <- rnorm(10,mean=6) # 10 samples pe <- rnorm(5) # 5 probes in probeset Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # add noise One way to do the fits in a quick and robust way is median polish: f <- medpolish(Y) --------------- > f Median Polish Results (Dataset: "Y") Overall: 6.949745 Row Effects: [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895 [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 Column Effects: [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 -4.283675e-01 Residuals: [,1] [,2] [,3] [,4] [,5] [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 > dim(Y) [1] 10 5 > f$overall + f$row # estimated chip effects [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 6.514582 [9] 5.691860 7.430009 > ce # true chip effects [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 6.101933 [9] 5.117349 6.905161 --------------- quick sketch ... it would be (fairly) easy to split up your table of log-transformed normalized probe intensities into a matrix for each probeset (e.g. using split() ...), robustly fit the model, extract the chip effects and whammo, there is your table of summarized expression values ... this would only be a few lines of R code and probably not too inefficient (?) ... this is effectively what goes on under the hood of affy::rma() and the like (although it uses C code and in a very general way that uses CDF environments). I suspect you could use the 'oligo' package to do much the same thing, after using pdInfoBuilder() to correctly assign probes to probesets ... ... anyways, just some thoughts. Cheers, Mark On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: > > On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: > >> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: >> >>> Hi Jose, >>> >>> Do you want to do RMA, or just normalize? The problem with trying to >>> wedge things into an AffyBatch is that the affy package will then >>> try >>> to find the cdfenv that contains the probe to probeset mappings, >>> so by >>> trying to leverage the AffyBatch infrastructure you will have to >>> also >>> come up with a fake cdf. >>> >>> If you don't have probes that make up a probeset, then I'm not >>> sure the >>> affy package will be of use here. >>> >>> Can you give more details? >>> >>> Best, >>> >>> Jim >> >> Hi Jim, >> >> normalisation is not an issue, it's more to do with the >> summarisation of probesets and something like 'Expresso' seems like >> a good way to do what I need (and some other things I don't need). >> >> I am dealing with Nimblegen arrays. Both two colour (whole genome >> promoter arrays, with anything up to 20 probes per probeset), and >> one colour "a la Affymetrix" (expression arrays, with anything >> between 3 to 8 probes per probeset). >> >> I've been dealing with teh two colour stuff just like I used to >> deal with my spotted cDNA arrays, using Limma. To summarise the >> data... I've used several approaches. Mostly I am not interested in >> the whole 2.7kb that each "promoter region" comprises, so I've >> taken subsets blah blah... Anyway, I'm happy with the results there. >> But for the expression data, I have one channel data, just like >> Affy data. Numblegen provides already normalised and summarised >> data along with the raw data, and they state they use the RMA >> procedure which I've come across with when readingabout Affy chips, >> although I've never analysed Affy data myself. >> >> I'm reasonably happy with the data given to me. It looks >> reasonable. So I want to be able to do that myself rather than >> depending on their data (thus allowing me to do things a bit >> differently if I want to), and since the RMA-processed data looks >> good, I am interested in finding a way to do RMA myself. >> >> You're right, the problem with my trying to make an AffyBatch from >> non Affy data is that I'm going to have to create a cdf-like >> file... and probably will encounter other obstacles... that's why I >> thought I'd ask here, as there's people who are very familiar with >> that structure... >> >> In my naivety, it seems it should be a simple enough task... and as >> I'm using 4 types of arrays mostly... I'd only have to do some work >> to make these work and then just enjoy the ride as new experiments >> roll in... >> Am I naive? ;-) > > It is pretty simple to do what you want, but "simple" is of course > in the eye of the beholder - it depends on how familiar you are with > the affy structures from a development point of view. > > I am not familiar with Nimblegen, but that might be much easier, as > in working out of the box. > > Kasper > > >> I hope I clarified enough what I'm after. >> >> Jose >> >> >> >> -- >> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk >> The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 >> 6513374 >> Institute for Cell & Molecular Biology Fax: +44 (0)131 >> 6507360 >> Swann Building, Mayfield Road >> University of Edinburgh >> Edinburgh EH9 3JR >> UK >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852
ADD REPLY
0
Entering edit mode
dear Mark, i've followed with interest your example and i found something a bit counter intuitive to me, let me run your code again with a seed so that we can all reproduce the numbers: nsamples <- 10 nprobes <- 5 set.seed(123) chipEffect <- rnorm(nsamples, mean=6) names(chipEffect) <- paste("e",1:nsamples,sep="") probeEffect <- rnorm(nprobes) Y <- outer(chipEffect, probeEffect, "+") # probe effect Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise dimnames(Y) <- list(paste("e",1:nsamples,sep=""), paste("p",1:nprobes,sep="")) Y p1 p2 p3 p4 p5 e1 6.842297 5.630669 5.909160 5.437896 5.035330 e2 7.043689 6.213415 6.225986 5.840217 5.059106 e3 8.586128 7.933859 7.953289 7.622725 7.061329 e4 7.364726 6.316509 6.440684 6.259188 5.527053 e5 7.306090 6.614483 6.492012 6.231634 5.595041 e6 8.832364 8.117525 8.046366 7.851080 7.197188 e7 7.663201 6.791223 6.840896 6.568744 5.854843 e8 5.856420 5.184265 5.009171 4.841334 4.145777 e9 6.464340 5.760774 5.930814 5.560690 4.655448 e10 6.715916 5.996310 6.075906 5.642444 4.891318 f <- medpolish(Y) summaryvalues <- f$overall + f$row summaryvalues e1 e2 e3 e4 e5 e6 e7 e8 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652 e9 e10 5.760774 5.911843 chipEffect e1 e2 e3 e4 e5 e6 e7 e8 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939 e9 e10 5.313147 5.554338 but if we actually calculate the mean squared error (MSE) of the previous estimates with respect to the true chip effect and compare it with the MSE of the estimate obtained by simply averaging the observations across probes: sum((summaryvalues-chipEffect)^2) [1] 1.528436 sum((rowMeans(Y)-chipEffect)^2) [1] 0.9438652 averaging seems to me to work better than median polish, am i missing something here? thanks! robert. On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote: > Jose, > > Can I add a (possibly naive) suggestion? > > You say normalization is not the issue, its the summarization of 3-8 > probes per probeset for your 1-colour Nimblegen data. So maybe you > just want to fit the log-additive RMA linear model to your data. This > is akin to estimating a model with probe effects and chip effects for > each probeset ... of which you are really interested in the chip > effects. > > Say you were able to collect your data from a single probeset into a > matrix Y (concocted example below): > > ce <- rnorm(10,mean=6) # 10 samples > pe <- rnorm(5) # 5 probes in probeset > Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # add > noise > > One way to do the fits in a quick and robust way is median polish: > > f <- medpolish(Y) > > --------------- > > f > > Median Polish Results (Dataset: "Y") > > Overall: 6.949745 > > Row Effects: > [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895 > [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 > > Column Effects: > [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 > -4.283675e-01 > > Residuals: > [,1] [,2] [,3] [,4] [,5] > [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 > [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 > [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 > [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 > [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 > [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 > [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 > [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 > [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 > [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 > > > dim(Y) > [1] 10 5 > > f$overall + f$row # estimated chip effects > [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 > 6.514582 > [9] 5.691860 7.430009 > > ce # true chip effects > [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 > 6.101933 > [9] 5.117349 6.905161 > --------------- > > quick sketch ... it would be (fairly) easy to split up your table of > log-transformed normalized probe intensities into a matrix for each > probeset (e.g. using split() ...), robustly fit the model, extract the > chip effects and whammo, there is your table of summarized expression > values ... this would only be a few lines of R code and probably not > too inefficient (?) ... this is effectively what goes on under the > hood of affy::rma() and the like (although it uses C code and in a > very general way that uses CDF environments). > > I suspect you could use the 'oligo' package to do much the same thing, > after using pdInfoBuilder() to correctly assign probes to probesets ... > > ... anyways, just some thoughts. > > Cheers, > Mark > > > > > > > > On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: > > > > > On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: > > > >> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > >> > >>> Hi Jose, > >>> > >>> Do you want to do RMA, or just normalize? The problem with trying to > >>> wedge things into an AffyBatch is that the affy package will then > >>> try > >>> to find the cdfenv that contains the probe to probeset mappings, > >>> so by > >>> trying to leverage the AffyBatch infrastructure you will have to > >>> also > >>> come up with a fake cdf. > >>> > >>> If you don't have probes that make up a probeset, then I'm not > >>> sure the > >>> affy package will be of use here. > >>> > >>> Can you give more details? > >>> > >>> Best, > >>> > >>> Jim > >> > >> Hi Jim, > >> > >> normalisation is not an issue, it's more to do with the > >> summarisation of probesets and something like 'Expresso' seems like > >> a good way to do what I need (and some other things I don't need). > >> > >> I am dealing with Nimblegen arrays. Both two colour (whole genome > >> promoter arrays, with anything up to 20 probes per probeset), and > >> one colour "a la Affymetrix" (expression arrays, with anything > >> between 3 to 8 probes per probeset). > >> > >> I've been dealing with teh two colour stuff just like I used to > >> deal with my spotted cDNA arrays, using Limma. To summarise the > >> data... I've used several approaches. Mostly I am not interested in > >> the whole 2.7kb that each "promoter region" comprises, so I've > >> taken subsets blah blah... Anyway, I'm happy with the results there. > >> But for the expression data, I have one channel data, just like > >> Affy data. Numblegen provides already normalised and summarised > >> data along with the raw data, and they state they use the RMA > >> procedure which I've come across with when readingabout Affy chips, > >> although I've never analysed Affy data myself. > >> > >> I'm reasonably happy with the data given to me. It looks > >> reasonable. So I want to be able to do that myself rather than > >> depending on their data (thus allowing me to do things a bit > >> differently if I want to), and since the RMA-processed data looks > >> good, I am interested in finding a way to do RMA myself. > >> > >> You're right, the problem with my trying to make an AffyBatch from > >> non Affy data is that I'm going to have to create a cdf-like > >> file... and probably will encounter other obstacles... that's why I > >> thought I'd ask here, as there's people who are very familiar with > >> that structure... > >> > >> In my naivety, it seems it should be a simple enough task... and as > >> I'm using 4 types of arrays mostly... I'd only have to do some work > >> to make these work and then just enjoy the ride as new experiments > >> roll in... > >> Am I naive? ;-) > > > > It is pretty simple to do what you want, but "simple" is of course > > in the eye of the beholder - it depends on how familiar you are with > > the affy structures from a development point of view. > > > > I am not familiar with Nimblegen, but that might be much easier, as > > in working out of the box. > > > > Kasper > > > > > >> I hope I clarified enough what I'm after. > >> > >> Jose > >> > >> > >> > >> -- > >> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk > >> The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 > >> 6513374 > >> Institute for Cell & Molecular Biology Fax: +44 (0)131 > >> 6507360 > >> Swann Building, Mayfield Road > >> University of Edinburgh > >> Edinburgh EH9 3JR > >> UK > >> > >> -- > >> The University of Edinburgh is a charitable body, registered in > >> Scotland, with registration number SC005336. > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ------------------------------ > Mark Robinson > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: m.robinson at garvan.org.au > e: mrobinson at wehi.edu.au > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Hi Robert, Robert Castelo wrote: > dear Mark, > > i've followed with interest your example and i found something a bit > counter intuitive to me, let me run your code again with a seed so that > we can all reproduce the numbers: > > nsamples <- 10 > nprobes <- 5 > set.seed(123) > chipEffect <- rnorm(nsamples, mean=6) > names(chipEffect) <- paste("e",1:nsamples,sep="") > probeEffect <- rnorm(nprobes) > Y <- outer(chipEffect, probeEffect, "+") # probe effect > Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise > dimnames(Y) <- list(paste("e",1:nsamples,sep=""), > paste("p",1:nprobes,sep="")) > > Y > p1 p2 p3 p4 p5 > e1 6.842297 5.630669 5.909160 5.437896 5.035330 > e2 7.043689 6.213415 6.225986 5.840217 5.059106 > e3 8.586128 7.933859 7.953289 7.622725 7.061329 > e4 7.364726 6.316509 6.440684 6.259188 5.527053 > e5 7.306090 6.614483 6.492012 6.231634 5.595041 > e6 8.832364 8.117525 8.046366 7.851080 7.197188 > e7 7.663201 6.791223 6.840896 6.568744 5.854843 > e8 5.856420 5.184265 5.009171 4.841334 4.145777 > e9 6.464340 5.760774 5.930814 5.560690 4.655448 > e10 6.715916 5.996310 6.075906 5.642444 4.891318 > > f <- medpolish(Y) > > summaryvalues <- f$overall + f$row > summaryvalues > e1 e2 e3 e4 e5 e6 e7 e8 > 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652 > e9 e10 > 5.760774 5.911843 > > chipEffect > e1 e2 e3 e4 e5 e6 e7 e8 > 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939 > e9 e10 > 5.313147 5.554338 > > but if we actually calculate the mean squared error (MSE) of the > previous estimates with respect to the true chip effect and compare it > with the MSE of the estimate obtained by simply averaging the > observations across probes: > > sum((summaryvalues-chipEffect)^2) > [1] 1.528436 > sum((rowMeans(Y)-chipEffect)^2) > [1] 0.9438652 > > > averaging seems to me to work better than median polish, am i missing > something here? Yes. You are missing the fact that the data from Affy probes usually are not normally distributed. In fact, it is not uncommon for a given probeset to have widely divergent intensity levels for its component probes. Because of the fact that the mean is not robust to outliers, people long ago abandoned methods based on a normal distribution. So yeah, in your case with noisy normally distributed data, a mean decomposition is more powerful than a median decomposition (it is UMP for normally distributed data after all), but in practice it will be pretty ugly. Best, Jim > > thanks! > robert. > > > > On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote: >> Jose, >> >> Can I add a (possibly naive) suggestion? >> >> You say normalization is not the issue, its the summarization of 3-8 >> probes per probeset for your 1-colour Nimblegen data. So maybe you >> just want to fit the log-additive RMA linear model to your data. This >> is akin to estimating a model with probe effects and chip effects for >> each probeset ... of which you are really interested in the chip >> effects. >> >> Say you were able to collect your data from a single probeset into a >> matrix Y (concocted example below): >> >> ce <- rnorm(10,mean=6) # 10 samples >> pe <- rnorm(5) # 5 probes in probeset >> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # add >> noise >> >> One way to do the fits in a quick and robust way is median polish: >> >> f <- medpolish(Y) >> >> --------------- >> > f >> >> Median Polish Results (Dataset: "Y") >> >> Overall: 6.949745 >> >> Row Effects: >> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895 >> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 >> >> Column Effects: >> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 >> -4.283675e-01 >> >> Residuals: >> [,1] [,2] [,3] [,4] [,5] >> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 >> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 >> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 >> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 >> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 >> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 >> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 >> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 >> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 >> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 >> >> > dim(Y) >> [1] 10 5 >> > f$overall + f$row # estimated chip effects >> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 >> 6.514582 >> [9] 5.691860 7.430009 >> > ce # true chip effects >> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 >> 6.101933 >> [9] 5.117349 6.905161 >> --------------- >> >> quick sketch ... it would be (fairly) easy to split up your table of >> log-transformed normalized probe intensities into a matrix for each >> probeset (e.g. using split() ...), robustly fit the model, extract the >> chip effects and whammo, there is your table of summarized expression >> values ... this would only be a few lines of R code and probably not >> too inefficient (?) ... this is effectively what goes on under the >> hood of affy::rma() and the like (although it uses C code and in a >> very general way that uses CDF environments). >> >> I suspect you could use the 'oligo' package to do much the same thing, >> after using pdInfoBuilder() to correctly assign probes to probesets ... >> >> ... anyways, just some thoughts. >> >> Cheers, >> Mark >> >> >> >> >> >> >> >> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: >> >>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: >>> >>>> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: >>>> >>>>> Hi Jose, >>>>> >>>>> Do you want to do RMA, or just normalize? The problem with trying to >>>>> wedge things into an AffyBatch is that the affy package will then >>>>> try >>>>> to find the cdfenv that contains the probe to probeset mappings, >>>>> so by >>>>> trying to leverage the AffyBatch infrastructure you will have to >>>>> also >>>>> come up with a fake cdf. >>>>> >>>>> If you don't have probes that make up a probeset, then I'm not >>>>> sure the >>>>> affy package will be of use here. >>>>> >>>>> Can you give more details? >>>>> >>>>> Best, >>>>> >>>>> Jim >>>> Hi Jim, >>>> >>>> normalisation is not an issue, it's more to do with the >>>> summarisation of probesets and something like 'Expresso' seems like >>>> a good way to do what I need (and some other things I don't need). >>>> >>>> I am dealing with Nimblegen arrays. Both two colour (whole genome >>>> promoter arrays, with anything up to 20 probes per probeset), and >>>> one colour "a la Affymetrix" (expression arrays, with anything >>>> between 3 to 8 probes per probeset). >>>> >>>> I've been dealing with teh two colour stuff just like I used to >>>> deal with my spotted cDNA arrays, using Limma. To summarise the >>>> data... I've used several approaches. Mostly I am not interested in >>>> the whole 2.7kb that each "promoter region" comprises, so I've >>>> taken subsets blah blah... Anyway, I'm happy with the results there. >>>> But for the expression data, I have one channel data, just like >>>> Affy data. Numblegen provides already normalised and summarised >>>> data along with the raw data, and they state they use the RMA >>>> procedure which I've come across with when readingabout Affy chips, >>>> although I've never analysed Affy data myself. >>>> >>>> I'm reasonably happy with the data given to me. It looks >>>> reasonable. So I want to be able to do that myself rather than >>>> depending on their data (thus allowing me to do things a bit >>>> differently if I want to), and since the RMA-processed data looks >>>> good, I am interested in finding a way to do RMA myself. >>>> >>>> You're right, the problem with my trying to make an AffyBatch from >>>> non Affy data is that I'm going to have to create a cdf-like >>>> file... and probably will encounter other obstacles... that's why I >>>> thought I'd ask here, as there's people who are very familiar with >>>> that structure... >>>> >>>> In my naivety, it seems it should be a simple enough task... and as >>>> I'm using 4 types of arrays mostly... I'd only have to do some work >>>> to make these work and then just enjoy the ride as new experiments >>>> roll in... >>>> Am I naive? ;-) >>> It is pretty simple to do what you want, but "simple" is of course >>> in the eye of the beholder - it depends on how familiar you are with >>> the affy structures from a development point of view. >>> >>> I am not familiar with Nimblegen, but that might be much easier, as >>> in working out of the box. >>> >>> Kasper >>> >>> >>>> I hope I clarified enough what I'm after. >>>> >>>> Jose >>>> >>>> >>>> >>>> -- >>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk >>>> The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 >>>> 6513374 >>>> Institute for Cell & Molecular Biology Fax: +44 (0)131 >>>> 6507360 >>>> Swann Building, Mayfield Road >>>> University of Edinburgh >>>> Edinburgh EH9 3JR >>>> UK >>>> >>>> -- >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> ------------------------------ >> Mark Robinson >> Epigenetics Laboratory, Garvan >> Bioinformatics Division, WEHI >> e: m.robinson at garvan.org.au >> e: mrobinson at wehi.edu.au >> p: +61 (0)3 9345 2628 >> f: +61 (0)3 9347 0852 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826
ADD REPLY
0
Entering edit mode
hi Jim, thanks for clearing this up for me. do you know then what would be the way of simulating the more realistic situation you describe? (such that we would see how median polish outperforms averaging in the summarization step) cheers, robert. On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote: > Hi Robert, > > Robert Castelo wrote: > > dear Mark, > > > > i've followed with interest your example and i found something a bit > > counter intuitive to me, let me run your code again with a seed so that > > we can all reproduce the numbers: > > > > nsamples <- 10 > > nprobes <- 5 > > set.seed(123) > > chipEffect <- rnorm(nsamples, mean=6) > > names(chipEffect) <- paste("e",1:nsamples,sep="") > > probeEffect <- rnorm(nprobes) > > Y <- outer(chipEffect, probeEffect, "+") # probe effect > > Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise > > dimnames(Y) <- list(paste("e",1:nsamples,sep=""), > > paste("p",1:nprobes,sep="")) > > > > Y > > p1 p2 p3 p4 p5 > > e1 6.842297 5.630669 5.909160 5.437896 5.035330 > > e2 7.043689 6.213415 6.225986 5.840217 5.059106 > > e3 8.586128 7.933859 7.953289 7.622725 7.061329 > > e4 7.364726 6.316509 6.440684 6.259188 5.527053 > > e5 7.306090 6.614483 6.492012 6.231634 5.595041 > > e6 8.832364 8.117525 8.046366 7.851080 7.197188 > > e7 7.663201 6.791223 6.840896 6.568744 5.854843 > > e8 5.856420 5.184265 5.009171 4.841334 4.145777 > > e9 6.464340 5.760774 5.930814 5.560690 4.655448 > > e10 6.715916 5.996310 6.075906 5.642444 4.891318 > > > > f <- medpolish(Y) > > > > summaryvalues <- f$overall + f$row > > summaryvalues > > e1 e2 e3 e4 e5 e6 e7 e8 > > 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652 > > e9 e10 > > 5.760774 5.911843 > > > > chipEffect > > e1 e2 e3 e4 e5 e6 e7 e8 > > 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939 > > e9 e10 > > 5.313147 5.554338 > > > > but if we actually calculate the mean squared error (MSE) of the > > previous estimates with respect to the true chip effect and compare it > > with the MSE of the estimate obtained by simply averaging the > > observations across probes: > > > > sum((summaryvalues-chipEffect)^2) > > [1] 1.528436 > > sum((rowMeans(Y)-chipEffect)^2) > > [1] 0.9438652 > > > > > > averaging seems to me to work better than median polish, am i missing > > something here? > > Yes. You are missing the fact that the data from Affy probes usually are > not normally distributed. In fact, it is not uncommon for a given > probeset to have widely divergent intensity levels for its component > probes. Because of the fact that the mean is not robust to outliers, > people long ago abandoned methods based on a normal distribution. > > So yeah, in your case with noisy normally distributed data, a mean > decomposition is more powerful than a median decomposition (it is UMP > for normally distributed data after all), but in practice it will be > pretty ugly. > > Best, > > Jim > > > > > > thanks! > > robert. > > > > > > > > On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote: > >> Jose, > >> > >> Can I add a (possibly naive) suggestion? > >> > >> You say normalization is not the issue, its the summarization of 3-8 > >> probes per probeset for your 1-colour Nimblegen data. So maybe you > >> just want to fit the log-additive RMA linear model to your data. This > >> is akin to estimating a model with probe effects and chip effects for > >> each probeset ... of which you are really interested in the chip > >> effects. > >> > >> Say you were able to collect your data from a single probeset into a > >> matrix Y (concocted example below): > >> > >> ce <- rnorm(10,mean=6) # 10 samples > >> pe <- rnorm(5) # 5 probes in probeset > >> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # add > >> noise > >> > >> One way to do the fits in a quick and robust way is median polish: > >> > >> f <- medpolish(Y) > >> > >> --------------- > >> > f > >> > >> Median Polish Results (Dataset: "Y") > >> > >> Overall: 6.949745 > >> > >> Row Effects: > >> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895 > >> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 > >> > >> Column Effects: > >> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 > >> -4.283675e-01 > >> > >> Residuals: > >> [,1] [,2] [,3] [,4] [,5] > >> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 > >> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 > >> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 > >> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 > >> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 > >> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 > >> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 > >> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 > >> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 > >> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 > >> > >> > dim(Y) > >> [1] 10 5 > >> > f$overall + f$row # estimated chip effects > >> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 > >> 6.514582 > >> [9] 5.691860 7.430009 > >> > ce # true chip effects > >> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 > >> 6.101933 > >> [9] 5.117349 6.905161 > >> --------------- > >> > >> quick sketch ... it would be (fairly) easy to split up your table of > >> log-transformed normalized probe intensities into a matrix for each > >> probeset (e.g. using split() ...), robustly fit the model, extract the > >> chip effects and whammo, there is your table of summarized expression > >> values ... this would only be a few lines of R code and probably not > >> too inefficient (?) ... this is effectively what goes on under the > >> hood of affy::rma() and the like (although it uses C code and in a > >> very general way that uses CDF environments). > >> > >> I suspect you could use the 'oligo' package to do much the same thing, > >> after using pdInfoBuilder() to correctly assign probes to probesets ... > >> > >> ... anyways, just some thoughts. > >> > >> Cheers, > >> Mark > >> > >> > >> > >> > >> > >> > >> > >> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: > >> > >>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: > >>> > >>>> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > >>>> > >>>>> Hi Jose, > >>>>> > >>>>> Do you want to do RMA, or just normalize? The problem with trying to > >>>>> wedge things into an AffyBatch is that the affy package will then > >>>>> try > >>>>> to find the cdfenv that contains the probe to probeset mappings, > >>>>> so by > >>>>> trying to leverage the AffyBatch infrastructure you will have to > >>>>> also > >>>>> come up with a fake cdf. > >>>>> > >>>>> If you don't have probes that make up a probeset, then I'm not > >>>>> sure the > >>>>> affy package will be of use here. > >>>>> > >>>>> Can you give more details? > >>>>> > >>>>> Best, > >>>>> > >>>>> Jim > >>>> Hi Jim, > >>>> > >>>> normalisation is not an issue, it's more to do with the > >>>> summarisation of probesets and something like 'Expresso' seems like > >>>> a good way to do what I need (and some other things I don't need). > >>>> > >>>> I am dealing with Nimblegen arrays. Both two colour (whole genome > >>>> promoter arrays, with anything up to 20 probes per probeset), and > >>>> one colour "a la Affymetrix" (expression arrays, with anything > >>>> between 3 to 8 probes per probeset). > >>>> > >>>> I've been dealing with teh two colour stuff just like I used to > >>>> deal with my spotted cDNA arrays, using Limma. To summarise the > >>>> data... I've used several approaches. Mostly I am not interested in > >>>> the whole 2.7kb that each "promoter region" comprises, so I've > >>>> taken subsets blah blah... Anyway, I'm happy with the results there. > >>>> But for the expression data, I have one channel data, just like > >>>> Affy data. Numblegen provides already normalised and summarised > >>>> data along with the raw data, and they state they use the RMA > >>>> procedure which I've come across with when readingabout Affy chips, > >>>> although I've never analysed Affy data myself. > >>>> > >>>> I'm reasonably happy with the data given to me. It looks > >>>> reasonable. So I want to be able to do that myself rather than > >>>> depending on their data (thus allowing me to do things a bit > >>>> differently if I want to), and since the RMA-processed data looks > >>>> good, I am interested in finding a way to do RMA myself. > >>>> > >>>> You're right, the problem with my trying to make an AffyBatch from > >>>> non Affy data is that I'm going to have to create a cdf-like > >>>> file... and probably will encounter other obstacles... that's why I > >>>> thought I'd ask here, as there's people who are very familiar with > >>>> that structure... > >>>> > >>>> In my naivety, it seems it should be a simple enough task... and as > >>>> I'm using 4 types of arrays mostly... I'd only have to do some work > >>>> to make these work and then just enjoy the ride as new experiments > >>>> roll in... > >>>> Am I naive? ;-) > >>> It is pretty simple to do what you want, but "simple" is of course > >>> in the eye of the beholder - it depends on how familiar you are with > >>> the affy structures from a development point of view. > >>> > >>> I am not familiar with Nimblegen, but that might be much easier, as > >>> in working out of the box. > >>> > >>> Kasper > >>> > >>> > >>>> I hope I clarified enough what I'm after. > >>>> > >>>> Jose > >>>> > >>>> > >>>> > >>>> -- > >>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk > >>>> The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 > >>>> 6513374 > >>>> Institute for Cell & Molecular Biology Fax: +44 (0)131 > >>>> 6507360 > >>>> Swann Building, Mayfield Road > >>>> University of Edinburgh > >>>> Edinburgh EH9 3JR > >>>> UK > >>>> > >>>> -- > >>>> The University of Edinburgh is a charitable body, registered in > >>>> Scotland, with registration number SC005336. > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at stat.math.ethz.ch > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at stat.math.ethz.ch > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> ------------------------------ > >> Mark Robinson > >> Epigenetics Laboratory, Garvan > >> Bioinformatics Division, WEHI > >> e: m.robinson at garvan.org.au > >> e: mrobinson at wehi.edu.au > >> p: +61 (0)3 9345 2628 > >> f: +61 (0)3 9347 0852 > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > >
ADD REPLY
0
Entering edit mode
Hi Robert, Why would you want to simulate? There have to be hundreds of datasets that have celfiles on GEO alone - if you are interested in this subject why not just get some real data and see which method is better? Best, Jim Robert Castelo wrote: > hi Jim, > > thanks for clearing this up for me. do you know then what would be the > way of simulating the more realistic situation you describe? (such that > we would see how median polish outperforms averaging in the > summarization step) > > cheers, > robert. > > > On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote: >> Hi Robert, >> >> Robert Castelo wrote: >>> dear Mark, >>> >>> i've followed with interest your example and i found something a bit >>> counter intuitive to me, let me run your code again with a seed so that >>> we can all reproduce the numbers: >>> >>> nsamples <- 10 >>> nprobes <- 5 >>> set.seed(123) >>> chipEffect <- rnorm(nsamples, mean=6) >>> names(chipEffect) <- paste("e",1:nsamples,sep="") >>> probeEffect <- rnorm(nprobes) >>> Y <- outer(chipEffect, probeEffect, "+") # probe effect >>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise >>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""), >>> paste("p",1:nprobes,sep="")) >>> >>> Y >>> p1 p2 p3 p4 p5 >>> e1 6.842297 5.630669 5.909160 5.437896 5.035330 >>> e2 7.043689 6.213415 6.225986 5.840217 5.059106 >>> e3 8.586128 7.933859 7.953289 7.622725 7.061329 >>> e4 7.364726 6.316509 6.440684 6.259188 5.527053 >>> e5 7.306090 6.614483 6.492012 6.231634 5.595041 >>> e6 8.832364 8.117525 8.046366 7.851080 7.197188 >>> e7 7.663201 6.791223 6.840896 6.568744 5.854843 >>> e8 5.856420 5.184265 5.009171 4.841334 4.145777 >>> e9 6.464340 5.760774 5.930814 5.560690 4.655448 >>> e10 6.715916 5.996310 6.075906 5.642444 4.891318 >>> >>> f <- medpolish(Y) >>> >>> summaryvalues <- f$overall + f$row >>> summaryvalues >>> e1 e2 e3 e4 e5 e6 e7 e8 >>> 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652 >>> e9 e10 >>> 5.760774 5.911843 >>> >>> chipEffect >>> e1 e2 e3 e4 e5 e6 e7 e8 >>> 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939 >>> e9 e10 >>> 5.313147 5.554338 >>> >>> but if we actually calculate the mean squared error (MSE) of the >>> previous estimates with respect to the true chip effect and compare it >>> with the MSE of the estimate obtained by simply averaging the >>> observations across probes: >>> >>> sum((summaryvalues-chipEffect)^2) >>> [1] 1.528436 >>> sum((rowMeans(Y)-chipEffect)^2) >>> [1] 0.9438652 >>> >>> >>> averaging seems to me to work better than median polish, am i missing >>> something here? >> Yes. You are missing the fact that the data from Affy probes usually are >> not normally distributed. In fact, it is not uncommon for a given >> probeset to have widely divergent intensity levels for its component >> probes. Because of the fact that the mean is not robust to outliers, >> people long ago abandoned methods based on a normal distribution. >> >> So yeah, in your case with noisy normally distributed data, a mean >> decomposition is more powerful than a median decomposition (it is UMP >> for normally distributed data after all), but in practice it will be >> pretty ugly. >> >> Best, >> >> Jim >> >> >>> thanks! >>> robert. >>> >>> >>> >>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote: >>>> Jose, >>>> >>>> Can I add a (possibly naive) suggestion? >>>> >>>> You say normalization is not the issue, its the summarization of 3-8 >>>> probes per probeset for your 1-colour Nimblegen data. So maybe you >>>> just want to fit the log-additive RMA linear model to your data. This >>>> is akin to estimating a model with probe effects and chip effects for >>>> each probeset ... of which you are really interested in the chip >>>> effects. >>>> >>>> Say you were able to collect your data from a single probeset into a >>>> matrix Y (concocted example below): >>>> >>>> ce <- rnorm(10,mean=6) # 10 samples >>>> pe <- rnorm(5) # 5 probes in probeset >>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # add >>>> noise >>>> >>>> One way to do the fits in a quick and robust way is median polish: >>>> >>>> f <- medpolish(Y) >>>> >>>> --------------- >>>> > f >>>> >>>> Median Polish Results (Dataset: "Y") >>>> >>>> Overall: 6.949745 >>>> >>>> Row Effects: >>>> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895 >>>> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 >>>> >>>> Column Effects: >>>> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 >>>> -4.283675e-01 >>>> >>>> Residuals: >>>> [,1] [,2] [,3] [,4] [,5] >>>> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 >>>> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 >>>> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 >>>> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 >>>> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 >>>> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 >>>> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 >>>> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 >>>> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 >>>> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 >>>> >>>> > dim(Y) >>>> [1] 10 5 >>>> > f$overall + f$row # estimated chip effects >>>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 >>>> 6.514582 >>>> [9] 5.691860 7.430009 >>>> > ce # true chip effects >>>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 >>>> 6.101933 >>>> [9] 5.117349 6.905161 >>>> --------------- >>>> >>>> quick sketch ... it would be (fairly) easy to split up your table of >>>> log-transformed normalized probe intensities into a matrix for each >>>> probeset (e.g. using split() ...), robustly fit the model, extract the >>>> chip effects and whammo, there is your table of summarized expression >>>> values ... this would only be a few lines of R code and probably not >>>> too inefficient (?) ... this is effectively what goes on under the >>>> hood of affy::rma() and the like (although it uses C code and in a >>>> very general way that uses CDF environments). >>>> >>>> I suspect you could use the 'oligo' package to do much the same thing, >>>> after using pdInfoBuilder() to correctly assign probes to probesets ... >>>> >>>> ... anyways, just some thoughts. >>>> >>>> Cheers, >>>> Mark >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: >>>> >>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: >>>>> >>>>>> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: >>>>>> >>>>>>> Hi Jose, >>>>>>> >>>>>>> Do you want to do RMA, or just normalize? The problem with trying to >>>>>>> wedge things into an AffyBatch is that the affy package will then >>>>>>> try >>>>>>> to find the cdfenv that contains the probe to probeset mappings, >>>>>>> so by >>>>>>> trying to leverage the AffyBatch infrastructure you will have to >>>>>>> also >>>>>>> come up with a fake cdf. >>>>>>> >>>>>>> If you don't have probes that make up a probeset, then I'm not >>>>>>> sure the >>>>>>> affy package will be of use here. >>>>>>> >>>>>>> Can you give more details? >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>> Hi Jim, >>>>>> >>>>>> normalisation is not an issue, it's more to do with the >>>>>> summarisation of probesets and something like 'Expresso' seems like >>>>>> a good way to do what I need (and some other things I don't need). >>>>>> >>>>>> I am dealing with Nimblegen arrays. Both two colour (whole genome >>>>>> promoter arrays, with anything up to 20 probes per probeset), and >>>>>> one colour "a la Affymetrix" (expression arrays, with anything >>>>>> between 3 to 8 probes per probeset). >>>>>> >>>>>> I've been dealing with teh two colour stuff just like I used to >>>>>> deal with my spotted cDNA arrays, using Limma. To summarise the >>>>>> data... I've used several approaches. Mostly I am not interested in >>>>>> the whole 2.7kb that each "promoter region" comprises, so I've >>>>>> taken subsets blah blah... Anyway, I'm happy with the results there. >>>>>> But for the expression data, I have one channel data, just like >>>>>> Affy data. Numblegen provides already normalised and summarised >>>>>> data along with the raw data, and they state they use the RMA >>>>>> procedure which I've come across with when readingabout Affy chips, >>>>>> although I've never analysed Affy data myself. >>>>>> >>>>>> I'm reasonably happy with the data given to me. It looks >>>>>> reasonable. So I want to be able to do that myself rather than >>>>>> depending on their data (thus allowing me to do things a bit >>>>>> differently if I want to), and since the RMA-processed data looks >>>>>> good, I am interested in finding a way to do RMA myself. >>>>>> >>>>>> You're right, the problem with my trying to make an AffyBatch from >>>>>> non Affy data is that I'm going to have to create a cdf-like >>>>>> file... and probably will encounter other obstacles... that's why I >>>>>> thought I'd ask here, as there's people who are very familiar with >>>>>> that structure... >>>>>> >>>>>> In my naivety, it seems it should be a simple enough task... and as >>>>>> I'm using 4 types of arrays mostly... I'd only have to do some work >>>>>> to make these work and then just enjoy the ride as new experiments >>>>>> roll in... >>>>>> Am I naive? ;-) >>>>> It is pretty simple to do what you want, but "simple" is of course >>>>> in the eye of the beholder - it depends on how familiar you are with >>>>> the affy structures from a development point of view. >>>>> >>>>> I am not familiar with Nimblegen, but that might be much easier, as >>>>> in working out of the box. >>>>> >>>>> Kasper >>>>> >>>>> >>>>>> I hope I clarified enough what I'm after. >>>>>> >>>>>> Jose >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk >>>>>> The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 >>>>>> 6513374 >>>>>> Institute for Cell & Molecular Biology Fax: +44 (0)131 >>>>>> 6507360 >>>>>> Swann Building, Mayfield Road >>>>>> University of Edinburgh >>>>>> Edinburgh EH9 3JR >>>>>> UK >>>>>> >>>>>> -- >>>>>> The University of Edinburgh is a charitable body, registered in >>>>>> Scotland, with registration number SC005336. >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> ------------------------------ >>>> Mark Robinson >>>> Epigenetics Laboratory, Garvan >>>> Bioinformatics Division, WEHI >>>> e: m.robinson at garvan.org.au >>>> e: mrobinson at wehi.edu.au >>>> p: +61 (0)3 9345 2628 >>>> f: +61 (0)3 9347 0852 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826
ADD REPLY
0
Entering edit mode
Let me turn this around. I have never seen a convincing simulation of microarray data. You can find examples of simulated data, but I have never seen simulated data behaving like real data. Kasper On Apr 24, 2009, at 11:35 , James W. MacDonald wrote: > Hi Robert, > > Why would you want to simulate? There have to be hundreds of > datasets that have celfiles on GEO alone - if you are interested in > this subject why not just get some real data and see which method is > better? > > Best, > > Jim > > > > Robert Castelo wrote: >> hi Jim, >> thanks for clearing this up for me. do you know then what would be >> the >> way of simulating the more realistic situation you describe? (such >> that >> we would see how median polish outperforms averaging in the >> summarization step) >> cheers, >> robert. >> On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote: >>> Hi Robert, >>> >>> Robert Castelo wrote: >>>> dear Mark, >>>> >>>> i've followed with interest your example and i found something a >>>> bit >>>> counter intuitive to me, let me run your code again with a seed >>>> so that >>>> we can all reproduce the numbers: >>>> >>>> nsamples <- 10 >>>> nprobes <- 5 >>>> set.seed(123) >>>> chipEffect <- rnorm(nsamples, mean=6) >>>> names(chipEffect) <- paste("e",1:nsamples,sep="") >>>> probeEffect <- rnorm(nprobes) >>>> Y <- outer(chipEffect, probeEffect, "+") # probe effect >>>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise >>>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""), >>>> paste("p",1:nprobes,sep="")) >>>> >>>> Y >>>> p1 p2 p3 p4 p5 >>>> e1 6.842297 5.630669 5.909160 5.437896 5.035330 >>>> e2 7.043689 6.213415 6.225986 5.840217 5.059106 >>>> e3 8.586128 7.933859 7.953289 7.622725 7.061329 >>>> e4 7.364726 6.316509 6.440684 6.259188 5.527053 >>>> e5 7.306090 6.614483 6.492012 6.231634 5.595041 >>>> e6 8.832364 8.117525 8.046366 7.851080 7.197188 >>>> e7 7.663201 6.791223 6.840896 6.568744 5.854843 >>>> e8 5.856420 5.184265 5.009171 4.841334 4.145777 >>>> e9 6.464340 5.760774 5.930814 5.560690 4.655448 >>>> e10 6.715916 5.996310 6.075906 5.642444 4.891318 >>>> >>>> f <- medpolish(Y) >>>> >>>> summaryvalues <- f$overall + f$row >>>> summaryvalues >>>> e1 e2 e3 e4 e5 e6 >>>> e7 e8 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 >>>> 6.826611 5.052652 e9 e10 5.760774 5.911843 >>>> chipEffect >>>> e1 e2 e3 e4 e5 e6 >>>> e7 e8 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 >>>> 6.460916 4.734939 e9 e10 5.313147 5.554338 >>>> but if we actually calculate the mean squared error (MSE) of the >>>> previous estimates with respect to the true chip effect and >>>> compare it >>>> with the MSE of the estimate obtained by simply averaging the >>>> observations across probes: >>>> >>>> sum((summaryvalues-chipEffect)^2) >>>> [1] 1.528436 >>>> sum((rowMeans(Y)-chipEffect)^2) >>>> [1] 0.9438652 >>>> >>>> >>>> averaging seems to me to work better than median polish, am i >>>> missing >>>> something here? >>> Yes. You are missing the fact that the data from Affy probes >>> usually are not normally distributed. In fact, it is not uncommon >>> for a given probeset to have widely divergent intensity levels for >>> its component probes. Because of the fact that the mean is not >>> robust to outliers, people long ago abandoned methods based on a >>> normal distribution. >>> >>> So yeah, in your case with noisy normally distributed data, a mean >>> decomposition is more powerful than a median decomposition (it is >>> UMP for normally distributed data after all), but in practice it >>> will be pretty ugly. >>> >>> Best, >>> >>> Jim >>> >>> >>>> thanks! >>>> robert. >>>> >>>> >>>> >>>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote: >>>>> Jose, >>>>> >>>>> Can I add a (possibly naive) suggestion? >>>>> >>>>> You say normalization is not the issue, its the summarization of >>>>> 3-8 probes per probeset for your 1-colour Nimblegen data. So >>>>> maybe you just want to fit the log-additive RMA linear model to >>>>> your data. This is akin to estimating a model with probe >>>>> effects and chip effects for each probeset ... of which you are >>>>> really interested in the chip effects. >>>>> >>>>> Say you were able to collect your data from a single probeset >>>>> into a matrix Y (concocted example below): >>>>> >>>>> ce <- rnorm(10,mean=6) # 10 samples >>>>> pe <- rnorm(5) # 5 probes in probeset >>>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # >>>>> add noise >>>>> >>>>> One way to do the fits in a quick and robust way is median polish: >>>>> >>>>> f <- medpolish(Y) >>>>> >>>>> --------------- >>>>> > f >>>>> >>>>> Median Polish Results (Dataset: "Y") >>>>> >>>>> Overall: 6.949745 >>>>> >>>>> Row Effects: >>>>> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 >>>>> -0.3210895 >>>>> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 >>>>> >>>>> Column Effects: >>>>> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 >>>>> -4.283675e-01 >>>>> >>>>> Residuals: >>>>> [,1] [,2] [,3] [,4] [,5] >>>>> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 >>>>> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 >>>>> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 >>>>> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 >>>>> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 >>>>> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 >>>>> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 >>>>> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 >>>>> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 >>>>> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 >>>>> >>>>> > dim(Y) >>>>> [1] 10 5 >>>>> > f$overall + f$row # estimated chip effects >>>>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 >>>>> 7.467772 6.514582 >>>>> [9] 5.691860 7.430009 >>>>> > ce # true chip effects >>>>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 >>>>> 6.963525 6.101933 >>>>> [9] 5.117349 6.905161 >>>>> --------------- >>>>> >>>>> quick sketch ... it would be (fairly) easy to split up your >>>>> table of log-transformed normalized probe intensities into a >>>>> matrix for each probeset (e.g. using split() ...), robustly fit >>>>> the model, extract the chip effects and whammo, there is your >>>>> table of summarized expression values ... this would only be a >>>>> few lines of R code and probably not too inefficient (?) ... >>>>> this is effectively what goes on under the hood of affy::rma() >>>>> and the like (although it uses C code and in a very general way >>>>> that uses CDF environments). >>>>> >>>>> I suspect you could use the 'oligo' package to do much the same >>>>> thing, after using pdInfoBuilder() to correctly assign probes >>>>> to probesets ... >>>>> >>>>> ... anyways, just some thoughts. >>>>> >>>>> Cheers, >>>>> Mark >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: >>>>> >>>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: >>>>>> >>>>>>> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: >>>>>>> >>>>>>>> Hi Jose, >>>>>>>> >>>>>>>> Do you want to do RMA, or just normalize? The problem with >>>>>>>> trying to >>>>>>>> wedge things into an AffyBatch is that the affy package will >>>>>>>> then try >>>>>>>> to find the cdfenv that contains the probe to probeset >>>>>>>> mappings, so by >>>>>>>> trying to leverage the AffyBatch infrastructure you will have >>>>>>>> to also >>>>>>>> come up with a fake cdf. >>>>>>>> >>>>>>>> If you don't have probes that make up a probeset, then I'm >>>>>>>> not sure the >>>>>>>> affy package will be of use here. >>>>>>>> >>>>>>>> Can you give more details? >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>> Hi Jim, >>>>>>> >>>>>>> normalisation is not an issue, it's more to do with the >>>>>>> summarisation of probesets and something like 'Expresso' seems >>>>>>> like a good way to do what I need (and some other things I >>>>>>> don't need). >>>>>>> >>>>>>> I am dealing with Nimblegen arrays. Both two colour (whole >>>>>>> genome promoter arrays, with anything up to 20 probes per >>>>>>> probeset), and one colour "a la Affymetrix" (expression >>>>>>> arrays, with anything between 3 to 8 probes per probeset). >>>>>>> >>>>>>> I've been dealing with teh two colour stuff just like I used >>>>>>> to deal with my spotted cDNA arrays, using Limma. To >>>>>>> summarise the data... I've used several approaches. Mostly I >>>>>>> am not interested in the whole 2.7kb that each "promoter >>>>>>> region" comprises, so I've taken subsets blah blah... Anyway, >>>>>>> I'm happy with the results there. >>>>>>> But for the expression data, I have one channel data, just >>>>>>> like Affy data. Numblegen provides already normalised and >>>>>>> summarised data along with the raw data, and they state they >>>>>>> use the RMA procedure which I've come across with when >>>>>>> readingabout Affy chips, although I've never analysed Affy >>>>>>> data myself. >>>>>>> >>>>>>> I'm reasonably happy with the data given to me. It looks >>>>>>> reasonable. So I want to be able to do that myself rather >>>>>>> than depending on their data (thus allowing me to do things a >>>>>>> bit differently if I want to), and since the RMA-processed >>>>>>> data looks good, I am interested in finding a way to do RMA >>>>>>> myself. >>>>>>> >>>>>>> You're right, the problem with my trying to make an AffyBatch >>>>>>> from non Affy data is that I'm going to have to create a cdf- >>>>>>> like file... and probably will encounter other obstacles... >>>>>>> that's why I thought I'd ask here, as there's people who are >>>>>>> very familiar with that structure... >>>>>>> >>>>>>> In my naivety, it seems it should be a simple enough task... >>>>>>> and as I'm using 4 types of arrays mostly... I'd only have to >>>>>>> do some work to make these work and then just enjoy the ride >>>>>>> as new experiments roll in... >>>>>>> Am I naive? ;-) >>>>>> It is pretty simple to do what you want, but "simple" is of >>>>>> course in the eye of the beholder - it depends on how familiar >>>>>> you are with the affy structures from a development point of >>>>>> view. >>>>>> >>>>>> I am not familiar with Nimblegen, but that might be much >>>>>> easier, as in working out of the box. >>>>>> >>>>>> Kasper >>>>>> >>>>>> >>>>>>> I hope I clarified enough what I'm after. >>>>>>> >>>>>>> Jose >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk >>>>>>> The Wellcome Trust Centre for Cell Biology Phone: +44 >>>>>>> (0)131 6513374 >>>>>>> Institute for Cell & Molecular Biology Fax: +44 >>>>>>> (0)131 6507360 >>>>>>> Swann Building, Mayfield Road >>>>>>> University of Edinburgh >>>>>>> Edinburgh EH9 3JR >>>>>>> UK >>>>>>> >>>>>>> -- >>>>>>> The University of Edinburgh is a charitable body, registered in >>>>>>> Scotland, with registration number SC005336. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> ------------------------------ >>>>> Mark Robinson >>>>> Epigenetics Laboratory, Garvan >>>>> Bioinformatics Division, WEHI >>>>> e: m.robinson at garvan.org.au >>>>> e: mrobinson at wehi.edu.au >>>>> p: +61 (0)3 9345 2628 >>>>> f: +61 (0)3 9347 0852 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> > > -- > James W. MacDonald, M.S. > Biostatistician > Douglas Lab > University of Michigan > Department of Human Genetics > 5912 Buhl > 1241 E. Catherine St. > Ann Arbor MI 48109-5618 > 734-615-7826
ADD REPLY
0
Entering edit mode
What is meant by "convincing simulation / behaving like real data" here ? That it would pass a sort of Turing test for microarray data ? Simulated data can be extremely handy for a "what if" scenario. L. On Sat, 2009-04-25 at 09:37 -0700, Kasper Daniel Hansen wrote: > Let me turn this around. > > I have never seen a convincing simulation of microarray data. You can > find examples of simulated data, but I have never seen simulated data > behaving like real data. > > Kasper > > On Apr 24, 2009, at 11:35 , James W. MacDonald wrote: > > > Hi Robert, > > > > Why would you want to simulate? There have to be hundreds of > > datasets that have celfiles on GEO alone - if you are interested in > > this subject why not just get some real data and see which method is > > better? > > > > Best, > > > > Jim > > > > > > > > Robert Castelo wrote: > >> hi Jim, > >> thanks for clearing this up for me. do you know then what would be > >> the > >> way of simulating the more realistic situation you describe? (such > >> that > >> we would see how median polish outperforms averaging in the > >> summarization step) > >> cheers, > >> robert. > >> On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote: > >>> Hi Robert, > >>> > >>> Robert Castelo wrote: > >>>> dear Mark, > >>>> > >>>> i've followed with interest your example and i found something a > >>>> bit > >>>> counter intuitive to me, let me run your code again with a seed > >>>> so that > >>>> we can all reproduce the numbers: > >>>> > >>>> nsamples <- 10 > >>>> nprobes <- 5 > >>>> set.seed(123) > >>>> chipEffect <- rnorm(nsamples, mean=6) > >>>> names(chipEffect) <- paste("e",1:nsamples,sep="") > >>>> probeEffect <- rnorm(nprobes) > >>>> Y <- outer(chipEffect, probeEffect, "+") # probe effect > >>>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise > >>>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""), > >>>> paste("p",1:nprobes,sep="")) > >>>> > >>>> Y > >>>> p1 p2 p3 p4 p5 > >>>> e1 6.842297 5.630669 5.909160 5.437896 5.035330 > >>>> e2 7.043689 6.213415 6.225986 5.840217 5.059106 > >>>> e3 8.586128 7.933859 7.953289 7.622725 7.061329 > >>>> e4 7.364726 6.316509 6.440684 6.259188 5.527053 > >>>> e5 7.306090 6.614483 6.492012 6.231634 5.595041 > >>>> e6 8.832364 8.117525 8.046366 7.851080 7.197188 > >>>> e7 7.663201 6.791223 6.840896 6.568744 5.854843 > >>>> e8 5.856420 5.184265 5.009171 4.841334 4.145777 > >>>> e9 6.464340 5.760774 5.930814 5.560690 4.655448 > >>>> e10 6.715916 5.996310 6.075906 5.642444 4.891318 > >>>> > >>>> f <- medpolish(Y) > >>>> > >>>> summaryvalues <- f$overall + f$row > >>>> summaryvalues > >>>> e1 e2 e3 e4 e5 e6 > >>>> e7 e8 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 > >>>> 6.826611 5.052652 e9 e10 5.760774 5.911843 > >>>> chipEffect > >>>> e1 e2 e3 e4 e5 e6 > >>>> e7 e8 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 > >>>> 6.460916 4.734939 e9 e10 5.313147 5.554338 > >>>> but if we actually calculate the mean squared error (MSE) of the > >>>> previous estimates with respect to the true chip effect and > >>>> compare it > >>>> with the MSE of the estimate obtained by simply averaging the > >>>> observations across probes: > >>>> > >>>> sum((summaryvalues-chipEffect)^2) > >>>> [1] 1.528436 > >>>> sum((rowMeans(Y)-chipEffect)^2) > >>>> [1] 0.9438652 > >>>> > >>>> > >>>> averaging seems to me to work better than median polish, am i > >>>> missing > >>>> something here? > >>> Yes. You are missing the fact that the data from Affy probes > >>> usually are not normally distributed. In fact, it is not uncommon > >>> for a given probeset to have widely divergent intensity levels for > >>> its component probes. Because of the fact that the mean is not > >>> robust to outliers, people long ago abandoned methods based on a > >>> normal distribution. > >>> > >>> So yeah, in your case with noisy normally distributed data, a mean > >>> decomposition is more powerful than a median decomposition (it is > >>> UMP for normally distributed data after all), but in practice it > >>> will be pretty ugly. > >>> > >>> Best, > >>> > >>> Jim > >>> > >>> > >>>> thanks! > >>>> robert. > >>>> > >>>> > >>>> > >>>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote: > >>>>> Jose, > >>>>> > >>>>> Can I add a (possibly naive) suggestion? > >>>>> > >>>>> You say normalization is not the issue, its the summarization of > >>>>> 3-8 probes per probeset for your 1-colour Nimblegen data. So > >>>>> maybe you just want to fit the log-additive RMA linear model to > >>>>> your data. This is akin to estimating a model with probe > >>>>> effects and chip effects for each probeset ... of which you are > >>>>> really interested in the chip effects. > >>>>> > >>>>> Say you were able to collect your data from a single probeset > >>>>> into a matrix Y (concocted example below): > >>>>> > >>>>> ce <- rnorm(10,mean=6) # 10 samples > >>>>> pe <- rnorm(5) # 5 probes in probeset > >>>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # > >>>>> add noise > >>>>> > >>>>> One way to do the fits in a quick and robust way is median polish: > >>>>> > >>>>> f <- medpolish(Y) > >>>>> > >>>>> --------------- > >>>>> > f > >>>>> > >>>>> Median Polish Results (Dataset: "Y") > >>>>> > >>>>> Overall: 6.949745 > >>>>> > >>>>> Row Effects: > >>>>> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 > >>>>> -0.3210895 > >>>>> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 > >>>>> > >>>>> Column Effects: > >>>>> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 > >>>>> -4.283675e-01 > >>>>> > >>>>> Residuals: > >>>>> [,1] [,2] [,3] [,4] [,5] > >>>>> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 > >>>>> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 > >>>>> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 > >>>>> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 > >>>>> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 > >>>>> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 > >>>>> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 > >>>>> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 > >>>>> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 > >>>>> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 > >>>>> > >>>>> > dim(Y) > >>>>> [1] 10 5 > >>>>> > f$overall + f$row # estimated chip effects > >>>>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 > >>>>> 7.467772 6.514582 > >>>>> [9] 5.691860 7.430009 > >>>>> > ce # true chip effects > >>>>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 > >>>>> 6.963525 6.101933 > >>>>> [9] 5.117349 6.905161 > >>>>> --------------- > >>>>> > >>>>> quick sketch ... it would be (fairly) easy to split up your > >>>>> table of log-transformed normalized probe intensities into a > >>>>> matrix for each probeset (e.g. using split() ...), robustly fit > >>>>> the model, extract the chip effects and whammo, there is your > >>>>> table of summarized expression values ... this would only be a > >>>>> few lines of R code and probably not too inefficient (?) ... > >>>>> this is effectively what goes on under the hood of affy::rma() > >>>>> and the like (although it uses C code and in a very general way > >>>>> that uses CDF environments). > >>>>> > >>>>> I suspect you could use the 'oligo' package to do much the same > >>>>> thing, after using pdInfoBuilder() to correctly assign probes > >>>>> to probesets ... > >>>>> > >>>>> ... anyways, just some thoughts. > >>>>> > >>>>> Cheers, > >>>>> Mark > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: > >>>>> > >>>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: > >>>>>> > >>>>>>> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > >>>>>>> > >>>>>>>> Hi Jose, > >>>>>>>> > >>>>>>>> Do you want to do RMA, or just normalize? The problem with > >>>>>>>> trying to > >>>>>>>> wedge things into an AffyBatch is that the affy package will > >>>>>>>> then try > >>>>>>>> to find the cdfenv that contains the probe to probeset > >>>>>>>> mappings, so by > >>>>>>>> trying to leverage the AffyBatch infrastructure you will have > >>>>>>>> to also > >>>>>>>> come up with a fake cdf. > >>>>>>>> > >>>>>>>> If you don't have probes that make up a probeset, then I'm > >>>>>>>> not sure the > >>>>>>>> affy package will be of use here. > >>>>>>>> > >>>>>>>> Can you give more details? > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> > >>>>>>>> Jim > >>>>>>> Hi Jim, > >>>>>>> > >>>>>>> normalisation is not an issue, it's more to do with the > >>>>>>> summarisation of probesets and something like 'Expresso' seems > >>>>>>> like a good way to do what I need (and some other things I > >>>>>>> don't need). > >>>>>>> > >>>>>>> I am dealing with Nimblegen arrays. Both two colour (whole > >>>>>>> genome promoter arrays, with anything up to 20 probes per > >>>>>>> probeset), and one colour "a la Affymetrix" (expression > >>>>>>> arrays, with anything between 3 to 8 probes per probeset). > >>>>>>> > >>>>>>> I've been dealing with teh two colour stuff just like I used > >>>>>>> to deal with my spotted cDNA arrays, using Limma. To > >>>>>>> summarise the data... I've used several approaches. Mostly I > >>>>>>> am not interested in the whole 2.7kb that each "promoter > >>>>>>> region" comprises, so I've taken subsets blah blah... Anyway, > >>>>>>> I'm happy with the results there. > >>>>>>> But for the expression data, I have one channel data, just > >>>>>>> like Affy data. Numblegen provides already normalised and > >>>>>>> summarised data along with the raw data, and they state they > >>>>>>> use the RMA procedure which I've come across with when > >>>>>>> readingabout Affy chips, although I've never analysed Affy > >>>>>>> data myself. > >>>>>>> > >>>>>>> I'm reasonably happy with the data given to me. It looks > >>>>>>> reasonable. So I want to be able to do that myself rather > >>>>>>> than depending on their data (thus allowing me to do things a > >>>>>>> bit differently if I want to), and since the RMA-processed > >>>>>>> data looks good, I am interested in finding a way to do RMA > >>>>>>> myself. > >>>>>>> > >>>>>>> You're right, the problem with my trying to make an AffyBatch > >>>>>>> from non Affy data is that I'm going to have to create a cdf- > >>>>>>> like file... and probably will encounter other obstacles... > >>>>>>> that's why I thought I'd ask here, as there's people who are > >>>>>>> very familiar with that structure... > >>>>>>> > >>>>>>> In my naivety, it seems it should be a simple enough task... > >>>>>>> and as I'm using 4 types of arrays mostly... I'd only have to > >>>>>>> do some work to make these work and then just enjoy the ride > >>>>>>> as new experiments roll in... > >>>>>>> Am I naive? ;-) > >>>>>> It is pretty simple to do what you want, but "simple" is of > >>>>>> course in the eye of the beholder - it depends on how familiar > >>>>>> you are with the affy structures from a development point of > >>>>>> view. > >>>>>> > >>>>>> I am not familiar with Nimblegen, but that might be much > >>>>>> easier, as in working out of the box. > >>>>>> > >>>>>> Kasper > >>>>>> > >>>>>> > >>>>>>> I hope I clarified enough what I'm after. > >>>>>>> > >>>>>>> Jose > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk > >>>>>>> The Wellcome Trust Centre for Cell Biology Phone: +44 > >>>>>>> (0)131 6513374 > >>>>>>> Institute for Cell & Molecular Biology Fax: +44 > >>>>>>> (0)131 6507360 > >>>>>>> Swann Building, Mayfield Road > >>>>>>> University of Edinburgh > >>>>>>> Edinburgh EH9 3JR > >>>>>>> UK > >>>>>>> > >>>>>>> -- > >>>>>>> The University of Edinburgh is a charitable body, registered in > >>>>>>> Scotland, with registration number SC005336. > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Bioconductor mailing list > >>>>>>> Bioconductor at stat.math.ethz.ch > >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>> _______________________________________________ > >>>>>> Bioconductor mailing list > >>>>>> Bioconductor at stat.math.ethz.ch > >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>> ------------------------------ > >>>>> Mark Robinson > >>>>> Epigenetics Laboratory, Garvan > >>>>> Bioinformatics Division, WEHI > >>>>> e: m.robinson at garvan.org.au > >>>>> e: mrobinson at wehi.edu.au > >>>>> p: +61 (0)3 9345 2628 > >>>>> f: +61 (0)3 9347 0852 > >>>>> > >>>>> _______________________________________________ > >>>>> Bioconductor mailing list > >>>>> Bioconductor at stat.math.ethz.ch > >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>> > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > Douglas Lab > > University of Michigan > > Department of Human Genetics > > 5912 Buhl > > 1241 E. Catherine St. > > Ann Arbor MI 48109-5618 > > 734-615-7826 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
What is meant by "convincing simulation / behaving like real data" here ? That it would pass a sort of Turing test for microarray data ? Simulated data can be extremely handy for a "what if" scenario. L. On Sat, 2009-04-25 at 09:37 -0700, Kasper Daniel Hansen wrote: > Let me turn this around. > > I have never seen a convincing simulation of microarray data. You can > find examples of simulated data, but I have never seen simulated data > behaving like real data. > > Kasper > > On Apr 24, 2009, at 11:35 , James W. MacDonald wrote: > > > Hi Robert, > > > > Why would you want to simulate? There have to be hundreds of > > datasets that have celfiles on GEO alone - if you are interested in > > this subject why not just get some real data and see which method is > > better? > > > > Best, > > > > Jim > > > > > > > > Robert Castelo wrote: > >> hi Jim, > >> thanks for clearing this up for me. do you know then what would be > >> the > >> way of simulating the more realistic situation you describe? (such > >> that > >> we would see how median polish outperforms averaging in the > >> summarization step) > >> cheers, > >> robert. > >> On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote: > >>> Hi Robert, > >>> > >>> Robert Castelo wrote: > >>>> dear Mark, > >>>> > >>>> i've followed with interest your example and i found something a > >>>> bit > >>>> counter intuitive to me, let me run your code again with a seed > >>>> so that > >>>> we can all reproduce the numbers: > >>>> > >>>> nsamples <- 10 > >>>> nprobes <- 5 > >>>> set.seed(123) > >>>> chipEffect <- rnorm(nsamples, mean=6) > >>>> names(chipEffect) <- paste("e",1:nsamples,sep="") > >>>> probeEffect <- rnorm(nprobes) > >>>> Y <- outer(chipEffect, probeEffect, "+") # probe effect > >>>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise > >>>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""), > >>>> paste("p",1:nprobes,sep="")) > >>>> > >>>> Y > >>>> p1 p2 p3 p4 p5 > >>>> e1 6.842297 5.630669 5.909160 5.437896 5.035330 > >>>> e2 7.043689 6.213415 6.225986 5.840217 5.059106 > >>>> e3 8.586128 7.933859 7.953289 7.622725 7.061329 > >>>> e4 7.364726 6.316509 6.440684 6.259188 5.527053 > >>>> e5 7.306090 6.614483 6.492012 6.231634 5.595041 > >>>> e6 8.832364 8.117525 8.046366 7.851080 7.197188 > >>>> e7 7.663201 6.791223 6.840896 6.568744 5.854843 > >>>> e8 5.856420 5.184265 5.009171 4.841334 4.145777 > >>>> e9 6.464340 5.760774 5.930814 5.560690 4.655448 > >>>> e10 6.715916 5.996310 6.075906 5.642444 4.891318 > >>>> > >>>> f <- medpolish(Y) > >>>> > >>>> summaryvalues <- f$overall + f$row > >>>> summaryvalues > >>>> e1 e2 e3 e4 e5 e6 > >>>> e7 e8 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 > >>>> 6.826611 5.052652 e9 e10 5.760774 5.911843 > >>>> chipEffect > >>>> e1 e2 e3 e4 e5 e6 > >>>> e7 e8 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 > >>>> 6.460916 4.734939 e9 e10 5.313147 5.554338 > >>>> but if we actually calculate the mean squared error (MSE) of the > >>>> previous estimates with respect to the true chip effect and > >>>> compare it > >>>> with the MSE of the estimate obtained by simply averaging the > >>>> observations across probes: > >>>> > >>>> sum((summaryvalues-chipEffect)^2) > >>>> [1] 1.528436 > >>>> sum((rowMeans(Y)-chipEffect)^2) > >>>> [1] 0.9438652 > >>>> > >>>> > >>>> averaging seems to me to work better than median polish, am i > >>>> missing > >>>> something here? > >>> Yes. You are missing the fact that the data from Affy probes > >>> usually are not normally distributed. In fact, it is not uncommon > >>> for a given probeset to have widely divergent intensity levels for > >>> its component probes. Because of the fact that the mean is not > >>> robust to outliers, people long ago abandoned methods based on a > >>> normal distribution. > >>> > >>> So yeah, in your case with noisy normally distributed data, a mean > >>> decomposition is more powerful than a median decomposition (it is > >>> UMP for normally distributed data after all), but in practice it > >>> will be pretty ugly. > >>> > >>> Best, > >>> > >>> Jim > >>> > >>> > >>>> thanks! > >>>> robert. > >>>> > >>>> > >>>> > >>>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote: > >>>>> Jose, > >>>>> > >>>>> Can I add a (possibly naive) suggestion? > >>>>> > >>>>> You say normalization is not the issue, its the summarization of > >>>>> 3-8 probes per probeset for your 1-colour Nimblegen data. So > >>>>> maybe you just want to fit the log-additive RMA linear model to > >>>>> your data. This is akin to estimating a model with probe > >>>>> effects and chip effects for each probeset ... of which you are > >>>>> really interested in the chip effects. > >>>>> > >>>>> Say you were able to collect your data from a single probeset > >>>>> into a matrix Y (concocted example below): > >>>>> > >>>>> ce <- rnorm(10,mean=6) # 10 samples > >>>>> pe <- rnorm(5) # 5 probes in probeset > >>>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # > >>>>> add noise > >>>>> > >>>>> One way to do the fits in a quick and robust way is median polish: > >>>>> > >>>>> f <- medpolish(Y) > >>>>> > >>>>> --------------- > >>>>> > f > >>>>> > >>>>> Median Polish Results (Dataset: "Y") > >>>>> > >>>>> Overall: 6.949745 > >>>>> > >>>>> Row Effects: > >>>>> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 > >>>>> -0.3210895 > >>>>> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 > >>>>> > >>>>> Column Effects: > >>>>> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 > >>>>> -4.283675e-01 > >>>>> > >>>>> Residuals: > >>>>> [,1] [,2] [,3] [,4] [,5] > >>>>> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 > >>>>> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 > >>>>> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 > >>>>> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 > >>>>> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 > >>>>> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 > >>>>> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 > >>>>> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 > >>>>> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 > >>>>> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 > >>>>> > >>>>> > dim(Y) > >>>>> [1] 10 5 > >>>>> > f$overall + f$row # estimated chip effects > >>>>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 > >>>>> 7.467772 6.514582 > >>>>> [9] 5.691860 7.430009 > >>>>> > ce # true chip effects > >>>>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 > >>>>> 6.963525 6.101933 > >>>>> [9] 5.117349 6.905161 > >>>>> --------------- > >>>>> > >>>>> quick sketch ... it would be (fairly) easy to split up your > >>>>> table of log-transformed normalized probe intensities into a > >>>>> matrix for each probeset (e.g. using split() ...), robustly fit > >>>>> the model, extract the chip effects and whammo, there is your > >>>>> table of summarized expression values ... this would only be a > >>>>> few lines of R code and probably not too inefficient (?) ... > >>>>> this is effectively what goes on under the hood of affy::rma() > >>>>> and the like (although it uses C code and in a very general way > >>>>> that uses CDF environments). > >>>>> > >>>>> I suspect you could use the 'oligo' package to do much the same > >>>>> thing, after using pdInfoBuilder() to correctly assign probes > >>>>> to probesets ... > >>>>> > >>>>> ... anyways, just some thoughts. > >>>>> > >>>>> Cheers, > >>>>> Mark > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: > >>>>> > >>>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: > >>>>>> > >>>>>>> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > >>>>>>> > >>>>>>>> Hi Jose, > >>>>>>>> > >>>>>>>> Do you want to do RMA, or just normalize? The problem with > >>>>>>>> trying to > >>>>>>>> wedge things into an AffyBatch is that the affy package will > >>>>>>>> then try > >>>>>>>> to find the cdfenv that contains the probe to probeset > >>>>>>>> mappings, so by > >>>>>>>> trying to leverage the AffyBatch infrastructure you will have > >>>>>>>> to also > >>>>>>>> come up with a fake cdf. > >>>>>>>> > >>>>>>>> If you don't have probes that make up a probeset, then I'm > >>>>>>>> not sure the > >>>>>>>> affy package will be of use here. > >>>>>>>> > >>>>>>>> Can you give more details? > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> > >>>>>>>> Jim > >>>>>>> Hi Jim, > >>>>>>> > >>>>>>> normalisation is not an issue, it's more to do with the > >>>>>>> summarisation of probesets and something like 'Expresso' seems > >>>>>>> like a good way to do what I need (and some other things I > >>>>>>> don't need). > >>>>>>> > >>>>>>> I am dealing with Nimblegen arrays. Both two colour (whole > >>>>>>> genome promoter arrays, with anything up to 20 probes per > >>>>>>> probeset), and one colour "a la Affymetrix" (expression > >>>>>>> arrays, with anything between 3 to 8 probes per probeset). > >>>>>>> > >>>>>>> I've been dealing with teh two colour stuff just like I used > >>>>>>> to deal with my spotted cDNA arrays, using Limma. To > >>>>>>> summarise the data... I've used several approaches. Mostly I > >>>>>>> am not interested in the whole 2.7kb that each "promoter > >>>>>>> region" comprises, so I've taken subsets blah blah... Anyway, > >>>>>>> I'm happy with the results there. > >>>>>>> But for the expression data, I have one channel data, just > >>>>>>> like Affy data. Numblegen provides already normalised and > >>>>>>> summarised data along with the raw data, and they state they > >>>>>>> use the RMA procedure which I've come across with when > >>>>>>> readingabout Affy chips, although I've never analysed Affy > >>>>>>> data myself. > >>>>>>> > >>>>>>> I'm reasonably happy with the data given to me. It looks > >>>>>>> reasonable. So I want to be able to do that myself rather > >>>>>>> than depending on their data (thus allowing me to do things a > >>>>>>> bit differently if I want to), and since the RMA-processed > >>>>>>> data looks good, I am interested in finding a way to do RMA > >>>>>>> myself. > >>>>>>> > >>>>>>> You're right, the problem with my trying to make an AffyBatch > >>>>>>> from non Affy data is that I'm going to have to create a cdf- > >>>>>>> like file... and probably will encounter other obstacles... > >>>>>>> that's why I thought I'd ask here, as there's people who are > >>>>>>> very familiar with that structure... > >>>>>>> > >>>>>>> In my naivety, it seems it should be a simple enough task... > >>>>>>> and as I'm using 4 types of arrays mostly... I'd only have to > >>>>>>> do some work to make these work and then just enjoy the ride > >>>>>>> as new experiments roll in... > >>>>>>> Am I naive? ;-) > >>>>>> It is pretty simple to do what you want, but "simple" is of > >>>>>> course in the eye of the beholder - it depends on how familiar > >>>>>> you are with the affy structures from a development point of > >>>>>> view. > >>>>>> > >>>>>> I am not familiar with Nimblegen, but that might be much > >>>>>> easier, as in working out of the box. > >>>>>> > >>>>>> Kasper > >>>>>> > >>>>>> > >>>>>>> I hope I clarified enough what I'm after. > >>>>>>> > >>>>>>> Jose > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk > >>>>>>> The Wellcome Trust Centre for Cell Biology Phone: +44 > >>>>>>> (0)131 6513374 > >>>>>>> Institute for Cell & Molecular Biology Fax: +44 > >>>>>>> (0)131 6507360 > >>>>>>> Swann Building, Mayfield Road > >>>>>>> University of Edinburgh > >>>>>>> Edinburgh EH9 3JR > >>>>>>> UK > >>>>>>> > >>>>>>> -- > >>>>>>> The University of Edinburgh is a charitable body, registered in > >>>>>>> Scotland, with registration number SC005336. > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Bioconductor mailing list > >>>>>>> Bioconductor at stat.math.ethz.ch > >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>> _______________________________________________ > >>>>>> Bioconductor mailing list > >>>>>> Bioconductor at stat.math.ethz.ch > >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>> ------------------------------ > >>>>> Mark Robinson > >>>>> Epigenetics Laboratory, Garvan > >>>>> Bioinformatics Division, WEHI > >>>>> e: m.robinson at garvan.org.au > >>>>> e: mrobinson at wehi.edu.au > >>>>> p: +61 (0)3 9345 2628 > >>>>> f: +61 (0)3 9347 0852 > >>>>> > >>>>> _______________________________________________ > >>>>> Bioconductor mailing list > >>>>> Bioconductor at stat.math.ethz.ch > >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>> > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > Douglas Lab > > University of Michigan > > Department of Human Genetics > > 5912 Buhl > > 1241 E. Catherine St. > > Ann Arbor MI 48109-5618 > > 734-615-7826 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
hi Jim, the reason is that i'm teaching a course on microarray data analysis to students who are not familiar with statistics beyond the basic descriptive ones. in front of such audience it has been helpful for me to simulate some data and apply to it the corresponding analysis technique when illustrating how the technique works (after that, then we use it on real data). by simulating data, people sees explicitly the assumptions made behind the mechanism generating these data so that a fraction of them (which makes me already happy) gets to understand why a particular method works better than other one. in the particular case below i'd like to make the point on why the median polish summarization method works better than the taking the arithmetic mean, illustrating somehow what you wisely said about the mean being not robust to outliers but being uniformly more powerful for Gaussian data, etc etc. i know i can download lots of real data, but i don't know how could i demonstrate that a summarization method is better than other one with real data. using some QC technique (MA plots..) ?? i'll appreciate any idea about this too !! :) thanks! robert. On Fri, 2009-04-24 at 14:35 -0400, James W. MacDonald wrote: > Hi Robert, > > Why would you want to simulate? There have to be hundreds of datasets > that have celfiles on GEO alone - if you are interested in this subject > why not just get some real data and see which method is better? > > Best, > > Jim > > > > Robert Castelo wrote: > > hi Jim, > > > > thanks for clearing this up for me. do you know then what would be the > > way of simulating the more realistic situation you describe? (such that > > we would see how median polish outperforms averaging in the > > summarization step) > > > > cheers, > > robert. > > > > > > On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote: > >> Hi Robert, > >> > >> Robert Castelo wrote: > >>> dear Mark, > >>> > >>> i've followed with interest your example and i found something a bit > >>> counter intuitive to me, let me run your code again with a seed so that > >>> we can all reproduce the numbers: > >>> > >>> nsamples <- 10 > >>> nprobes <- 5 > >>> set.seed(123) > >>> chipEffect <- rnorm(nsamples, mean=6) > >>> names(chipEffect) <- paste("e",1:nsamples,sep="") > >>> probeEffect <- rnorm(nprobes) > >>> Y <- outer(chipEffect, probeEffect, "+") # probe effect > >>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise > >>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""), > >>> paste("p",1:nprobes,sep="")) > >>> > >>> Y > >>> p1 p2 p3 p4 p5 > >>> e1 6.842297 5.630669 5.909160 5.437896 5.035330 > >>> e2 7.043689 6.213415 6.225986 5.840217 5.059106 > >>> e3 8.586128 7.933859 7.953289 7.622725 7.061329 > >>> e4 7.364726 6.316509 6.440684 6.259188 5.527053 > >>> e5 7.306090 6.614483 6.492012 6.231634 5.595041 > >>> e6 8.832364 8.117525 8.046366 7.851080 7.197188 > >>> e7 7.663201 6.791223 6.840896 6.568744 5.854843 > >>> e8 5.856420 5.184265 5.009171 4.841334 4.145777 > >>> e9 6.464340 5.760774 5.930814 5.560690 4.655448 > >>> e10 6.715916 5.996310 6.075906 5.642444 4.891318 > >>> > >>> f <- medpolish(Y) > >>> > >>> summaryvalues <- f$overall + f$row > >>> summaryvalues > >>> e1 e2 e3 e4 e5 e6 e7 e8 > >>> 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652 > >>> e9 e10 > >>> 5.760774 5.911843 > >>> > >>> chipEffect > >>> e1 e2 e3 e4 e5 e6 e7 e8 > >>> 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939 > >>> e9 e10 > >>> 5.313147 5.554338 > >>> > >>> but if we actually calculate the mean squared error (MSE) of the > >>> previous estimates with respect to the true chip effect and compare it > >>> with the MSE of the estimate obtained by simply averaging the > >>> observations across probes: > >>> > >>> sum((summaryvalues-chipEffect)^2) > >>> [1] 1.528436 > >>> sum((rowMeans(Y)-chipEffect)^2) > >>> [1] 0.9438652 > >>> > >>> > >>> averaging seems to me to work better than median polish, am i missing > >>> something here? > >> Yes. You are missing the fact that the data from Affy probes usually are > >> not normally distributed. In fact, it is not uncommon for a given > >> probeset to have widely divergent intensity levels for its component > >> probes. Because of the fact that the mean is not robust to outliers, > >> people long ago abandoned methods based on a normal distribution. > >> > >> So yeah, in your case with noisy normally distributed data, a mean > >> decomposition is more powerful than a median decomposition (it is UMP > >> for normally distributed data after all), but in practice it will be > >> pretty ugly. > >> > >> Best, > >> > >> Jim > >> > >> > >>> thanks! > >>> robert. > >>> > >>> > >>> > >>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote: > >>>> Jose, > >>>> > >>>> Can I add a (possibly naive) suggestion? > >>>> > >>>> You say normalization is not the issue, its the summarization of 3-8 > >>>> probes per probeset for your 1-colour Nimblegen data. So maybe you > >>>> just want to fit the log-additive RMA linear model to your data. This > >>>> is akin to estimating a model with probe effects and chip effects for > >>>> each probeset ... of which you are really interested in the chip > >>>> effects. > >>>> > >>>> Say you were able to collect your data from a single probeset into a > >>>> matrix Y (concocted example below): > >>>> > >>>> ce <- rnorm(10,mean=6) # 10 samples > >>>> pe <- rnorm(5) # 5 probes in probeset > >>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # add > >>>> noise > >>>> > >>>> One way to do the fits in a quick and robust way is median polish: > >>>> > >>>> f <- medpolish(Y) > >>>> > >>>> --------------- > >>>> > f > >>>> > >>>> Median Polish Results (Dataset: "Y") > >>>> > >>>> Overall: 6.949745 > >>>> > >>>> Row Effects: > >>>> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895 > >>>> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 > >>>> > >>>> Column Effects: > >>>> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 > >>>> -4.283675e-01 > >>>> > >>>> Residuals: > >>>> [,1] [,2] [,3] [,4] [,5] > >>>> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 > >>>> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 > >>>> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 > >>>> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 > >>>> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 > >>>> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 > >>>> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 > >>>> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 > >>>> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 > >>>> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 > >>>> > >>>> > dim(Y) > >>>> [1] 10 5 > >>>> > f$overall + f$row # estimated chip effects > >>>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 > >>>> 6.514582 > >>>> [9] 5.691860 7.430009 > >>>> > ce # true chip effects > >>>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 > >>>> 6.101933 > >>>> [9] 5.117349 6.905161 > >>>> --------------- > >>>> > >>>> quick sketch ... it would be (fairly) easy to split up your table of > >>>> log-transformed normalized probe intensities into a matrix for each > >>>> probeset (e.g. using split() ...), robustly fit the model, extract the > >>>> chip effects and whammo, there is your table of summarized expression > >>>> values ... this would only be a few lines of R code and probably not > >>>> too inefficient (?) ... this is effectively what goes on under the > >>>> hood of affy::rma() and the like (although it uses C code and in a > >>>> very general way that uses CDF environments). > >>>> > >>>> I suspect you could use the 'oligo' package to do much the same thing, > >>>> after using pdInfoBuilder() to correctly assign probes to probesets ... > >>>> > >>>> ... anyways, just some thoughts. > >>>> > >>>> Cheers, > >>>> Mark > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: > >>>> > >>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: > >>>>> > >>>>>> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > >>>>>> > >>>>>>> Hi Jose, > >>>>>>> > >>>>>>> Do you want to do RMA, or just normalize? The problem with trying to > >>>>>>> wedge things into an AffyBatch is that the affy package will then > >>>>>>> try > >>>>>>> to find the cdfenv that contains the probe to probeset mappings, > >>>>>>> so by > >>>>>>> trying to leverage the AffyBatch infrastructure you will have to > >>>>>>> also > >>>>>>> come up with a fake cdf. > >>>>>>> > >>>>>>> If you don't have probes that make up a probeset, then I'm not > >>>>>>> sure the > >>>>>>> affy package will be of use here. > >>>>>>> > >>>>>>> Can you give more details? > >>>>>>> > >>>>>>> Best, > >>>>>>> > >>>>>>> Jim > >>>>>> Hi Jim, > >>>>>> > >>>>>> normalisation is not an issue, it's more to do with the > >>>>>> summarisation of probesets and something like 'Expresso' seems like > >>>>>> a good way to do what I need (and some other things I don't need). > >>>>>> > >>>>>> I am dealing with Nimblegen arrays. Both two colour (whole genome > >>>>>> promoter arrays, with anything up to 20 probes per probeset), and > >>>>>> one colour "a la Affymetrix" (expression arrays, with anything > >>>>>> between 3 to 8 probes per probeset). > >>>>>> > >>>>>> I've been dealing with teh two colour stuff just like I used to > >>>>>> deal with my spotted cDNA arrays, using Limma. To summarise the > >>>>>> data... I've used several approaches. Mostly I am not interested in > >>>>>> the whole 2.7kb that each "promoter region" comprises, so I've > >>>>>> taken subsets blah blah... Anyway, I'm happy with the results there. > >>>>>> But for the expression data, I have one channel data, just like > >>>>>> Affy data. Numblegen provides already normalised and summarised > >>>>>> data along with the raw data, and they state they use the RMA > >>>>>> procedure which I've come across with when readingabout Affy chips, > >>>>>> although I've never analysed Affy data myself. > >>>>>> > >>>>>> I'm reasonably happy with the data given to me. It looks > >>>>>> reasonable. So I want to be able to do that myself rather than > >>>>>> depending on their data (thus allowing me to do things a bit > >>>>>> differently if I want to), and since the RMA-processed data looks > >>>>>> good, I am interested in finding a way to do RMA myself. > >>>>>> > >>>>>> You're right, the problem with my trying to make an AffyBatch from > >>>>>> non Affy data is that I'm going to have to create a cdf-like > >>>>>> file... and probably will encounter other obstacles... that's why I > >>>>>> thought I'd ask here, as there's people who are very familiar with > >>>>>> that structure... > >>>>>> > >>>>>> In my naivety, it seems it should be a simple enough task... and as > >>>>>> I'm using 4 types of arrays mostly... I'd only have to do some work > >>>>>> to make these work and then just enjoy the ride as new experiments > >>>>>> roll in... > >>>>>> Am I naive? ;-) > >>>>> It is pretty simple to do what you want, but "simple" is of course > >>>>> in the eye of the beholder - it depends on how familiar you are with > >>>>> the affy structures from a development point of view. > >>>>> > >>>>> I am not familiar with Nimblegen, but that might be much easier, as > >>>>> in working out of the box. > >>>>> > >>>>> Kasper > >>>>> > >>>>> > >>>>>> I hope I clarified enough what I'm after. > >>>>>> > >>>>>> Jose > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk > >>>>>> The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 > >>>>>> 6513374 > >>>>>> Institute for Cell & Molecular Biology Fax: +44 (0)131 > >>>>>> 6507360 > >>>>>> Swann Building, Mayfield Road > >>>>>> University of Edinburgh > >>>>>> Edinburgh EH9 3JR > >>>>>> UK > >>>>>> > >>>>>> -- > >>>>>> The University of Edinburgh is a charitable body, registered in > >>>>>> Scotland, with registration number SC005336. > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Bioconductor mailing list > >>>>>> Bioconductor at stat.math.ethz.ch > >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>> _______________________________________________ > >>>>> Bioconductor mailing list > >>>>> Bioconductor at stat.math.ethz.ch > >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> ------------------------------ > >>>> Mark Robinson > >>>> Epigenetics Laboratory, Garvan > >>>> Bioinformatics Division, WEHI > >>>> e: m.robinson at garvan.org.au > >>>> e: mrobinson at wehi.edu.au > >>>> p: +61 (0)3 9345 2628 > >>>> f: +61 (0)3 9347 0852 > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at stat.math.ethz.ch > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > > >
ADD REPLY
0
Entering edit mode
Hi Robert, Robert Castelo wrote: > hi Jim, > > the reason is that i'm teaching a course on microarray data analysis to > students who are not familiar with statistics beyond the basic > descriptive ones. in front of such audience it has been helpful for me > to simulate some data and apply to it the corresponding analysis > technique when illustrating how the technique works (after that, then we > use it on real data). by simulating data, people sees explicitly the > assumptions made behind the mechanism generating these data so that a > fraction of them (which makes me already happy) gets to understand why a > particular method works better than other one. That seems a bit backwards to me - there are no assumptions behind the mechanism generating these data. They just are what they are. The only assumptions being made would be that the data are of a certain distribution (or convolution of one or more distributions) when you were simulating. > > in the particular case below i'd like to make the point on why the > median polish summarization method works better than the taking the > arithmetic mean, illustrating somehow what you wisely said about the > mean being not robust to outliers but being uniformly more powerful for > Gaussian data, etc etc. Why not just show some examples of what real data look like? The Dilution series contains some of the cleanest data around (as it was a spike-in data set run as carefully as possible), and you can easily see what I am talking about just by randomly picking a probeset: library(affy) library(affydata) library(lattice) data(Dilution) a <- pm(Dilution, "1007_s_at") boxplot(a) points(1:4, colMeans(a), pch = 20, col="red", cex=1.2) nam <- factor(rep(colnames(a), each = dim(a)[1])) probes <- rep(1:dim(a)[1], 4) dim(a) <- NULL b <- data.frame(Values = a, Chip = nam, Probes = factor(probes)) barchart(Values~Probes |Chip , data=b) > > i know i can download lots of real data, but i don't know how could i > demonstrate that a summarization method is better than other one with > real data. using some QC technique (MA plots..) ?? i'll appreciate any > idea about this too !! :) > > thanks! > robert. > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826
ADD REPLY
0
Entering edit mode
hi Jim, On Mon, 2009-04-27 at 09:36 -0400, James W. MacDonald wrote: > Hi Robert, > > Robert Castelo wrote: > > hi Jim, > > > > the reason is that i'm teaching a course on microarray data analysis to > > students who are not familiar with statistics beyond the basic > > descriptive ones. in front of such audience it has been helpful for me > > to simulate some data and apply to it the corresponding analysis > > technique when illustrating how the technique works (after that, then we > > use it on real data). by simulating data, people sees explicitly the > > assumptions made behind the mechanism generating these data so that a > > fraction of them (which makes me already happy) gets to understand why a > > particular method works better than other one. > > That seems a bit backwards to me - there are no assumptions behind the > mechanism generating these data. They just are what they are. The only > assumptions being made would be that the data are of a certain > distribution (or convolution of one or more distributions) when you were > simulating. ups..sorry, i meant "people see explicitly the assumptions made behind the mechanism generating these *simulated* data". these assumptions, as you point out, will be that the data are of a certain distribution (or convolution...) and also will be those related to the (in)dependencies among the random variables that are employed to sample the data. then of course microarray data are just what they are, but the point i try to make to my students is that if, for instance, they learn about assessing differential expression with a Students t-test (i know they should use a modified one, etc.) then simulating the data that meet the Students t-test assumptions would be sampling from independent normal densities, one for each gene. this kind of exercise, i believe, helps them in understanding the assumptions behind the Student's t-test and helps me in illustrating them the possible pitfalls and disadvantages of that particular technique. obviously all this is not necessary for people with a good knowledge of statistics. > > in the particular case below i'd like to make the point on why the > > median polish summarization method works better than the taking the > > arithmetic mean, illustrating somehow what you wisely said about the > > mean being not robust to outliers but being uniformly more powerful for > > Gaussian data, etc etc. > > Why not just show some examples of what real data look like? The > Dilution series contains some of the cleanest data around (as it was a > spike-in data set run as carefully as possible), and you can easily see > what I am talking about just by randomly picking a probeset: > > library(affy) > library(affydata) > library(lattice) > data(Dilution) > a <- pm(Dilution, "1007_s_at") > boxplot(a) > points(1:4, colMeans(a), pch = 20, col="red", cex=1.2) > nam <- factor(rep(colnames(a), each = dim(a)[1])) > probes <- rep(1:dim(a)[1], 4) > dim(a) <- NULL > b <- data.frame(Values = a, Chip = nam, Probes = factor(probes)) > barchart(Values~Probes |Chip , data=b) thanks for the example, we already work with spike-in data sets when looking at why we need to do background correction. i did not have any such strategy in my mind for illustrating the different approaches to summarization and that's why i got interested when i saw the email form Mark with some code illustrating the simulation of probe data for its summarization. cheers, robert. > > > > i know i can download lots of real data, but i don't know how could i > > demonstrate that a summarization method is better than other one with > > real data. using some QC technique (MA plots..) ?? i'll appreciate any > > idea about this too !! :) > > > > thanks! > > robert. > > >
ADD REPLY
0
Entering edit mode
Robert Castelo wrote: > hi Jim, > > the reason is that i'm teaching a course on microarray data analysis to > students who are not familiar with statistics beyond the basic > descriptive ones. in front of such audience it has been helpful for me > to simulate some data and apply to it the corresponding analysis > technique when illustrating how the technique works (after that, then we > use it on real data). by simulating data, people sees explicitly the > assumptions made behind the mechanism generating these data so that a > fraction of them (which makes me already happy) gets to understand why a > particular method works better than other one. The work presented in "Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset - Genome Biology 2005 - Choe et al." could be an interesting starting point (and dataset). Otherwise, did you check the package biocDatasets ? It is still very much work-in-progress, but it should already provide a framework for such simulations. > in the particular case below i'd like to make the point on why the > median polish summarization method works better than the taking the > arithmetic mean, illustrating somehow what you wisely said about the > mean being not robust to outliers but being uniformly more powerful for > Gaussian data, etc etc. > > i know i can download lots of real data, but i don't know how could i > demonstrate that a summarization method is better than other one with > real data. Spike-ins datasets and graphical representation have mostly been used to demonstrate the efficiency of processing methods. An early reference is "A benchmark for Affymetrix GeneChip expression measures - Bioinformatics 2003 - Leslie Cope et al.", but there are several more recent ones. Correlation between "expected" values and the actual values after processing is often used. > using some QC technique (MA plots..) ?? i'll appreciate any > idea about this too !! :) L. > thanks! > robert. > > On Fri, 2009-04-24 at 14:35 -0400, James W. MacDonald wrote: >> Hi Robert, >> >> Why would you want to simulate? There have to be hundreds of datasets >> that have celfiles on GEO alone - if you are interested in this subject >> why not just get some real data and see which method is better? >> >> Best, >> >> Jim >> >> >> >> Robert Castelo wrote: >>> hi Jim, >>> >>> thanks for clearing this up for me. do you know then what would be the >>> way of simulating the more realistic situation you describe? (such that >>> we would see how median polish outperforms averaging in the >>> summarization step) >>> >>> cheers, >>> robert. >>> >>> >>> On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote: >>>> Hi Robert, >>>> >>>> Robert Castelo wrote: >>>>> dear Mark, >>>>> >>>>> i've followed with interest your example and i found something a bit >>>>> counter intuitive to me, let me run your code again with a seed so that >>>>> we can all reproduce the numbers: >>>>> >>>>> nsamples <- 10 >>>>> nprobes <- 5 >>>>> set.seed(123) >>>>> chipEffect <- rnorm(nsamples, mean=6) >>>>> names(chipEffect) <- paste("e",1:nsamples,sep="") >>>>> probeEffect <- rnorm(nprobes) >>>>> Y <- outer(chipEffect, probeEffect, "+") # probe effect >>>>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise >>>>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""), >>>>> paste("p",1:nprobes,sep="")) >>>>> >>>>> Y >>>>> p1 p2 p3 p4 p5 >>>>> e1 6.842297 5.630669 5.909160 5.437896 5.035330 >>>>> e2 7.043689 6.213415 6.225986 5.840217 5.059106 >>>>> e3 8.586128 7.933859 7.953289 7.622725 7.061329 >>>>> e4 7.364726 6.316509 6.440684 6.259188 5.527053 >>>>> e5 7.306090 6.614483 6.492012 6.231634 5.595041 >>>>> e6 8.832364 8.117525 8.046366 7.851080 7.197188 >>>>> e7 7.663201 6.791223 6.840896 6.568744 5.854843 >>>>> e8 5.856420 5.184265 5.009171 4.841334 4.145777 >>>>> e9 6.464340 5.760774 5.930814 5.560690 4.655448 >>>>> e10 6.715916 5.996310 6.075906 5.642444 4.891318 >>>>> >>>>> f <- medpolish(Y) >>>>> >>>>> summaryvalues <- f$overall + f$row >>>>> summaryvalues >>>>> e1 e2 e3 e4 e5 e6 e7 e8 >>>>> 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652 >>>>> e9 e10 >>>>> 5.760774 5.911843 >>>>> >>>>> chipEffect >>>>> e1 e2 e3 e4 e5 e6 e7 e8 >>>>> 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939 >>>>> e9 e10 >>>>> 5.313147 5.554338 >>>>> >>>>> but if we actually calculate the mean squared error (MSE) of the >>>>> previous estimates with respect to the true chip effect and compare it >>>>> with the MSE of the estimate obtained by simply averaging the >>>>> observations across probes: >>>>> >>>>> sum((summaryvalues-chipEffect)^2) >>>>> [1] 1.528436 >>>>> sum((rowMeans(Y)-chipEffect)^2) >>>>> [1] 0.9438652 >>>>> >>>>> >>>>> averaging seems to me to work better than median polish, am i missing >>>>> something here? >>>> Yes. You are missing the fact that the data from Affy probes usually are >>>> not normally distributed. In fact, it is not uncommon for a given >>>> probeset to have widely divergent intensity levels for its component >>>> probes. Because of the fact that the mean is not robust to outliers, >>>> people long ago abandoned methods based on a normal distribution. >>>> >>>> So yeah, in your case with noisy normally distributed data, a mean >>>> decomposition is more powerful than a median decomposition (it is UMP >>>> for normally distributed data after all), but in practice it will be >>>> pretty ugly. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>>> thanks! >>>>> robert. >>>>> >>>>> >>>>> >>>>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote: >>>>>> Jose, >>>>>> >>>>>> Can I add a (possibly naive) suggestion? >>>>>> >>>>>> You say normalization is not the issue, its the summarization of 3-8 >>>>>> probes per probeset for your 1-colour Nimblegen data. So maybe you >>>>>> just want to fit the log-additive RMA linear model to your data. This >>>>>> is akin to estimating a model with probe effects and chip effects for >>>>>> each probeset ... of which you are really interested in the chip >>>>>> effects. >>>>>> >>>>>> Say you were able to collect your data from a single probeset into a >>>>>> matrix Y (concocted example below): >>>>>> >>>>>> ce <- rnorm(10,mean=6) # 10 samples >>>>>> pe <- rnorm(5) # 5 probes in probeset >>>>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # add >>>>>> noise >>>>>> >>>>>> One way to do the fits in a quick and robust way is median polish: >>>>>> >>>>>> f <- medpolish(Y) >>>>>> >>>>>> --------------- >>>>>> > f >>>>>> >>>>>> Median Polish Results (Dataset: "Y") >>>>>> >>>>>> Overall: 6.949745 >>>>>> >>>>>> Row Effects: >>>>>> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895 >>>>>> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 >>>>>> >>>>>> Column Effects: >>>>>> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 >>>>>> -4.283675e-01 >>>>>> >>>>>> Residuals: >>>>>> [,1] [,2] [,3] [,4] [,5] >>>>>> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 >>>>>> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 >>>>>> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 >>>>>> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 >>>>>> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 >>>>>> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 >>>>>> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 >>>>>> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 >>>>>> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 >>>>>> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 >>>>>> >>>>>> > dim(Y) >>>>>> [1] 10 5 >>>>>> > f$overall + f$row # estimated chip effects >>>>>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 >>>>>> 6.514582 >>>>>> [9] 5.691860 7.430009 >>>>>> > ce # true chip effects >>>>>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 >>>>>> 6.101933 >>>>>> [9] 5.117349 6.905161 >>>>>> --------------- >>>>>> >>>>>> quick sketch ... it would be (fairly) easy to split up your table of >>>>>> log-transformed normalized probe intensities into a matrix for each >>>>>> probeset (e.g. using split() ...), robustly fit the model, extract the >>>>>> chip effects and whammo, there is your table of summarized expression >>>>>> values ... this would only be a few lines of R code and probably not >>>>>> too inefficient (?) ... this is effectively what goes on under the >>>>>> hood of affy::rma() and the like (although it uses C code and in a >>>>>> very general way that uses CDF environments). >>>>>> >>>>>> I suspect you could use the 'oligo' package to do much the same thing, >>>>>> after using pdInfoBuilder() to correctly assign probes to probesets ... >>>>>> >>>>>> ... anyways, just some thoughts. >>>>>> >>>>>> Cheers, >>>>>> Mark >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote: >>>>>> >>>>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote: >>>>>>> >>>>>>>> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: >>>>>>>> >>>>>>>>> Hi Jose, >>>>>>>>> >>>>>>>>> Do you want to do RMA, or just normalize? The problem with trying to >>>>>>>>> wedge things into an AffyBatch is that the affy package will then >>>>>>>>> try >>>>>>>>> to find the cdfenv that contains the probe to probeset mappings, >>>>>>>>> so by >>>>>>>>> trying to leverage the AffyBatch infrastructure you will have to >>>>>>>>> also >>>>>>>>> come up with a fake cdf. >>>>>>>>> >>>>>>>>> If you don't have probes that make up a probeset, then I'm not >>>>>>>>> sure the >>>>>>>>> affy package will be of use here. >>>>>>>>> >>>>>>>>> Can you give more details? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>> Hi Jim, >>>>>>>> >>>>>>>> normalisation is not an issue, it's more to do with the >>>>>>>> summarisation of probesets and something like 'Expresso' seems like >>>>>>>> a good way to do what I need (and some other things I don't need). >>>>>>>> >>>>>>>> I am dealing with Nimblegen arrays. Both two colour (whole genome >>>>>>>> promoter arrays, with anything up to 20 probes per probeset), and >>>>>>>> one colour "a la Affymetrix" (expression arrays, with anything >>>>>>>> between 3 to 8 probes per probeset). >>>>>>>> >>>>>>>> I've been dealing with teh two colour stuff just like I used to >>>>>>>> deal with my spotted cDNA arrays, using Limma. To summarise the >>>>>>>> data... I've used several approaches. Mostly I am not interested in >>>>>>>> the whole 2.7kb that each "promoter region" comprises, so I've >>>>>>>> taken subsets blah blah... Anyway, I'm happy with the results there. >>>>>>>> But for the expression data, I have one channel data, just like >>>>>>>> Affy data. Numblegen provides already normalised and summarised >>>>>>>> data along with the raw data, and they state they use the RMA >>>>>>>> procedure which I've come across with when readingabout Affy chips, >>>>>>>> although I've never analysed Affy data myself. >>>>>>>> >>>>>>>> I'm reasonably happy with the data given to me. It looks >>>>>>>> reasonable. So I want to be able to do that myself rather than >>>>>>>> depending on their data (thus allowing me to do things a bit >>>>>>>> differently if I want to), and since the RMA-processed data looks >>>>>>>> good, I am interested in finding a way to do RMA myself. >>>>>>>> >>>>>>>> You're right, the problem with my trying to make an AffyBatch from >>>>>>>> non Affy data is that I'm going to have to create a cdf-like >>>>>>>> file... and probably will encounter other obstacles... that's why I >>>>>>>> thought I'd ask here, as there's people who are very familiar with >>>>>>>> that structure... >>>>>>>> >>>>>>>> In my naivety, it seems it should be a simple enough task... and as >>>>>>>> I'm using 4 types of arrays mostly... I'd only have to do some work >>>>>>>> to make these work and then just enjoy the ride as new experiments >>>>>>>> roll in... >>>>>>>> Am I naive? ;-) >>>>>>> It is pretty simple to do what you want, but "simple" is of course >>>>>>> in the eye of the beholder - it depends on how familiar you are with >>>>>>> the affy structures from a development point of view. >>>>>>> >>>>>>> I am not familiar with Nimblegen, but that might be much easier, as >>>>>>> in working out of the box. >>>>>>> >>>>>>> Kasper >>>>>>> >>>>>>> >>>>>>>> I hope I clarified enough what I'm after. >>>>>>>> >>>>>>>> Jose >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk >>>>>>>> The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 >>>>>>>> 6513374 >>>>>>>> Institute for Cell & Molecular Biology Fax: +44 (0)131 >>>>>>>> 6507360 >>>>>>>> Swann Building, Mayfield Road >>>>>>>> University of Edinburgh >>>>>>>> Edinburgh EH9 3JR >>>>>>>> UK >>>>>>>> >>>>>>>> -- >>>>>>>> The University of Edinburgh is a charitable body, registered in >>>>>>>> Scotland, with registration number SC005336. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at stat.math.ethz.ch >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> ------------------------------ >>>>>> Mark Robinson >>>>>> Epigenetics Laboratory, Garvan >>>>>> Bioinformatics Division, WEHI >>>>>> e: m.robinson at garvan.org.au >>>>>> e: mrobinson at wehi.edu.au >>>>>> p: +61 (0)3 9345 2628 >>>>>> f: +61 (0)3 9347 0852 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Quoting Mark Robinson <mrobinson at="" wehi.edu.au="">: > Jose, > > Can I add a (possibly naive) suggestion? > > You say normalization is not the issue, its the summarization of 3-8 > probes per probeset for your 1-colour Nimblegen data. So maybe you > just want to fit the log-additive RMA linear model to your data. This > is akin to estimating a model with probe effects and chip effects for > each probeset ... of which you are really interested in the chip > effects. > > Say you were able to collect your data from a single probeset into a > matrix Y (concocted example below): > > ce <- rnorm(10,mean=6) # 10 samples > pe <- rnorm(5) # 5 probes in probeset > Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # add noise > > One way to do the fits in a quick and robust way is median polish: > > f <- medpolish(Y) > > --------------- >> f > > Median Polish Results (Dataset: "Y") > > Overall: 6.949745 > > Row Effects: > [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895 > [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 > > Column Effects: > [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 -4.283675e-01 > > Residuals: > [,1] [,2] [,3] [,4] [,5] > [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 > [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 > [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 > [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 > [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 > [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 > [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 > [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 > [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 > [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 > >> dim(Y) > [1] 10 5 >> f$overall + f$row # estimated chip effects > [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 6.514582 > [9] 5.691860 7.430009 >> ce # true chip effects > [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 6.101933 > [9] 5.117349 6.905161 > --------------- > > quick sketch ... it would be (fairly) easy to split up your table of > log-transformed normalized probe intensities into a matrix for each > probeset (e.g. using split() ...), robustly fit the model, extract the > chip effects and whammo, there is your table of summarized expression > values ... this would only be a few lines of R code and probably not > too inefficient (?) ... this is effectively what goes on under the hood > of affy::rma() and the like (although it uses C code and in a very > general way that uses CDF environments). > > I suspect you could use the 'oligo' package to do much the same thing, > after using pdInfoBuilder() to correctly assign probes to probesets ... > > ... anyways, just some thoughts. > > Cheers, > Mark Hi Mark, Thanks for your email! Your *naive* suggestion actually solves my most important issue. I wanted to apply the method and then learn what happened under the hood, but I should have probably gone the other way ;-) I tried the median polish and it reproduces almost exactly the summarised data I had, so I'm not sure this is the method they employed. I just altered your example code in one place. The example had a matrix where columns corresponded to different chips (10 replicates) and rows corresponded to probes in the probeset (5). One matrix per probeset. In your code you suggested I took: f$overall + f$row (where f=medpolish(matrix)) to obtain the estimated chip effects. This summarises the 10 chips into a single figure... per probe. What I wanted was to summarise each probeset into a single number, so I take instead: f$overall + f$col This gives me the summarised probeset *per chip*, reducing my table of 72000ish rows (probes) x 12 chips to one of 24000ish rows (probesets) x 12 chips. That is great. The obvious question is... would it be reasonable to repeat the process to summarise replicate chips? Essentially it'll use the whole matrix, and the result could be obtained as a single column (24000ish values, one per probeset) like this: f$overall + f$row (as you suggested initially). probably a slow process, but does this make sense at least? Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
ADD REPLY
0
Entering edit mode
Jose. Comments below. On 30/04/2009, at 1:56 AM, J.delasHeras at ed.ac.uk wrote: > Quoting Mark Robinson <mrobinson at="" wehi.edu.au="">: > >> Jose, >> >> Can I add a (possibly naive) suggestion? >> >> You say normalization is not the issue, its the summarization of 3-8 >> probes per probeset for your 1-colour Nimblegen data. So maybe you >> just want to fit the log-additive RMA linear model to your data. >> This >> is akin to estimating a model with probe effects and chip effects for >> each probeset ... of which you are really interested in the chip >> effects. >> >> Say you were able to collect your data from a single probeset into a >> matrix Y (concocted example below): >> >> ce <- rnorm(10,mean=6) # 10 samples >> pe <- rnorm(5) # 5 probes in probeset >> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # >> add noise >> >> One way to do the fits in a quick and robust way is median polish: >> >> f <- medpolish(Y) >> >> --------------- >>> f >> >> Median Polish Results (Dataset: "Y") >> >> Overall: 6.949745 >> >> Row Effects: >> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895 >> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634 >> >> Column Effects: >> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01 >> -4.283675e-01 >> >> Residuals: >> [,1] [,2] [,3] [,4] [,5] >> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587 >> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587 >> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658 >> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717 >> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018 >> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629 >> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533 >> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635 >> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479 >> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157 >> >>> dim(Y) >> [1] 10 5 >>> f$overall + f$row # estimated chip effects >> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772 >> 6.514582 >> [9] 5.691860 7.430009 >>> ce # true chip effects >> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525 >> 6.101933 >> [9] 5.117349 6.905161 >> --------------- >> >> quick sketch ... it would be (fairly) easy to split up your table of >> log-transformed normalized probe intensities into a matrix for each >> probeset (e.g. using split() ...), robustly fit the model, extract >> the >> chip effects and whammo, there is your table of summarized expression >> values ... this would only be a few lines of R code and probably not >> too inefficient (?) ... this is effectively what goes on under the >> hood >> of affy::rma() and the like (although it uses C code and in a very >> general way that uses CDF environments). >> >> I suspect you could use the 'oligo' package to do much the same >> thing, >> after using pdInfoBuilder() to correctly assign probes to >> probesets ... >> >> ... anyways, just some thoughts. >> >> Cheers, >> Mark > > Hi Mark, > > Thanks for your email! Your *naive* suggestion actually solves my > most important issue. > I wanted to apply the method and then learn what happened under the > hood, but I should have probably gone the other way ;-) > > I tried the median polish and it reproduces almost exactly the > summarised data I had, so I'm not sure this is the method they > employed. Good, glad it was useful. > I just altered your example code in one place. > > The example had a matrix where columns corresponded to different > chips (10 replicates) and rows corresponded to probes in the > probeset (5). One matrix per probeset. > In your code you suggested I took: > > f$overall + f$row (where f=medpolish(matrix)) > > to obtain the estimated chip effects. This summarises the 10 chips > into a single figure... per probe. > What I wanted was to summarise each probeset into a single number, > so I take instead: > > f$overall + f$col Right, my concocted example was simply the transpose of what you have. > This gives me the summarised probeset *per chip*, reducing my table > of 72000ish rows (probes) x 12 chips to one of 24000ish rows > (probesets) x 12 chips. > > That is great. > > The obvious question is... would it be reasonable to repeat the > process to summarise replicate chips? > Essentially it'll use the whole matrix, and the result could be > obtained as a single column (24000ish values, one per probeset) like > this: > > f$overall + f$row (as you suggested initially). > > probably a slow process, but does this make sense at least? I'm not sure yet if this makes sense, but let me explain back what I understand of this. You are proposing to take the whole (72000-by-12, probes by samples) matrix and in the end, get a 24000-by-1 (probesets by 1) vector. I don't think medpolish() will work for you here, not as a single call at least. The median polish model for a single probeset can be written as: Y_{ij} = mu + row_i + col_j + error_{ij} [1] with constraints on row_i and col_j so that the parameter are identifiable ... a reparameterization of [1] gives ... Y_{ij} = probe_i + chip_j + error_{ij} [2] But, maybe what you want to share the chip parameter over all samples (for replicate samples) and just have: Y_{ij} = probe_i + chip + error_{ij} [3] If you wanted to fit everything in one model (keep in mind that this makes a constant variance assumption), you could run a single rlm fit something like: f <- rlm(Ybig ~ probeset + probe) where Ybig is your 72000-by-12 matrix, probeset takes on 24000 different values and probe takes on 72000 different values (and will have the constraint). I don't know if this will run in a reasonable amount of time. Alternatively, you could just take a rowMeans() of your 24000-by-12 matrix from before and it should be close. Or, yet another alternative, fit model [3] separately for each probeset (rlm?) and collect the "chip" estimates in a vector. Lots of options. Hope that helps. Mark > Jose > > -- > Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk > The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 > 6513374 > Institute for Cell & Molecular Biology Fax: +44 (0)131 > 6507360 > Swann Building, Mayfield Road > University of Edinburgh > Edinburgh EH9 3JR > UK > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > ------------------------------ Mark Robinson Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852
ADD REPLY

Login before adding your answer.

Traffic: 731 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6