affyPLM and exon array question

0

Entering edit mode

Henrik Bengtsson ★ 2.4k

@henrik-bengtsson-4333

Last seen 10 months ago

United States

Hi, I would like to follow up with a question to Ben on this (old) thread. You saying below that the time complexity of the fitting algorithm is *not* symmetric in probes and samples - having many samples is not as bad as having many probes. I can guess a few reasons for why this is, but that is of less interest, so my main question is: Because the log-additive model is symmetric in probes and samples, can you just swap samples and probes, i.e. transposing the matrix, and fit the model, and then afterwards swap chip-effect with probe-effect estimates? ...or are there asymmetries in how the algorithm is fitted that breaks this idea, e.g. re-iterative robustifications etc.? Thanks in advance Henrik On 5/1/07, Ben Bolstad <bmb at="" bmbolstad.com=""> wrote: > The slowdown you are observing is due to just a few probesets on the > array. These probesets contain many 1000's of probes. In the current > implementation when you use the command that you specified (fitting the > default model) fitPLM uses a procedure optimized for probesets with > relatively few probes across many arrays and so is pretty quick most of > the time (my experience is that is is not completely unacceptable even > up to about 1000 probes across a large number of arrays, at least on my > machine). > > eg both of the following contain same number of datapoints > > Case I: 11 probes and 1000 arrays > Case II: 1000 probes and 11 probes > > but case I will be a lot quicker than case II in the current > implementation. > > Demonstration code > > > library(affyPLM) > > ### note to any developers out there, the following is UNSUPPORTED > ### and subject to change. DO NOT USE. > > rlm.default.rma.model <- function(y,PsiCode=0,PsiK=1.345){ > + .Call("R_rlm_rma_default_model",y,PsiCode,PsiK,PACKAGE="affyPLM") > + } > > #Case I > > y <- matrix(rnorm(11*1000),11,1000) > > system.time(test <- rlm.default.rma.model(y)) > [1] 0.735 0.032 0.788 0.000 0.000 > > #Case II > > y <- matrix(rnorm(11*1000),1000,11) > > system.time(test <- rlm.default.rma.model(y)) > [1] 19.776 0.508 21.730 0.000 0.000 > > As for workarounds, I am pretty sure that these extremely large > probesets are control probesets of some kind that could be safely > ignored and it is possible to pass a vector of probeset names specifying > a subset to use for fitPLM. > > Best, > > Ben > > On Tue, 2007-05-01 at 12:36 -0700, Allen Day wrote: > > I suspect so, although I haven't tried running rma() directly. > > Just.rma() works fine, and fitPLM is able to RMA normalize internally. > > > > I was able to move this a little further along by patching the mm() > > function to return empty list in the case of a dimensionless pset > > variable. Apparently it is usually a two-column matrix with pm in > > psets[,1] and mm in psets[,2]. Heres the patch. > > http://paste.turbogears.org/paste/1253/plain > > > > This allows me to successfully background correct and normalize with > > RMA through wrapper function fitPLM from the affyPLM library. It's > > taking forever though, even running with minimal options. Here's my > > call: > > > > fitPLM(ab, output.param=list(residuals=FALSE,weights=FALSE,resid.S E=FALSE),verbosity.level=10); > > > > Any advice? > > > > -Allen > > > > On 5/1/07, Crispin Miller <cmiller at="" picr.man.ac.uk=""> wrote: > > > Hi Allen, > > > Does rma() work with your cdf? > > > > > > We've also produced one that works OK with rma() (see the 'exonmap' > > > package vignette for more details, including how to get it). Don't know > > > if that helps? > > > > > > Crispin > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: bioconductor-bounces at stat.math.ethz.ch > > > > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Allen Day > > > > Sent: 01 May 2007 01:32 > > > > To: bioconductor at stat.math.ethz.ch > > > > Subject: [BioC] affyPLM and exon array question > > > > > > > > Hi, > > > > > > > > I've been trying to get NUSE, RLE, and RMA values for > > > > HuEx-1_0-st-v2 (Human "all exon") Affymetrix arrays. > > > > > > > > So far I have successfully read the arrays into an affybatch object. > > > > This required creating the CDF environment, which I have > > > > already done with makecdfenv. I'll be submitting that for > > > > inclusion shortly, but that's another topic. > > > > > > > > After creating the AffyBatch, I try to use affyPLM to do an > > > > RMA model fit. R = 2.4.1, affyPLM = 1.12.0, affy = 1.12.2. > > > > That's where there's trouble, and it appears to be caused by > > > > the lack of mismatch probes on the array. Here's code > > > > illustrating the problem: > > > > > > > > > library( 'affy' ); > > > > > library( 'affyPLM' ); > > > > > ab = read.affybatch( > > > > filenames='/home/allenday/cel/0001.CEL' ); ab; # > > > > > works, output omitted pm( ab ); # works, output omitted mm( ab ); # > > > > > fails! > > > > Error in FUN(X[[1411190]], ...) : subscript out of bounds > > > > > plm = fitPLM( ab ); #same failure in fitPLM, caused by a > > > > call to mm() > > > > > on variable ab; > > > > Error in FUN(X[[1411190]], ...) : subscript out of bounds > > > > > > > > I'm only proficient enough in R and C to track this down -- > > > > I'm don't know R or Bioconductor well enough to know how to > > > > fix it. If I can get this going I will submit a new package > > > > that provides just.nuse() and just.rle() functions. Can > > > > someone give me a pointer for how to make this work? > > > > > > > > Thanks. > > > > > > > > -Allen > > > > > > > > _______________________________________________ > > > > Bioconductor mailing list > > > > Bioconductor at stat.math.ethz.ch > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > Search the archives: > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > -------------------------------------------------------- > > > > > > > > > This email is confidential and intended solely for the use of the person(s) ('the intended recipient') to whom it was addressed. Any views or opinions presented are solely those of the author and do not necessarily represent those of the Paterson Institute for Cancer Research or the University of Manchester. It may contain information that is privileged & confidential within the meaning of applicable law. Accordingly any dissemination, distribution, copying, or other use of this message, or any of its contents, by any person other than the intended recipient may constitute a breach of civil or criminal law and is strictly prohibited. If you are NOT the intended recipient please contact the sender and dispose of this e-mail as soon as possible. > > > > > > > -- > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

cdf affy makecdfenv affyPLM cdf affy makecdfenv affyPLM • 1.2k views

ADD COMMENT • link updated 17.5 years ago by Ben Bolstad ★ 1.2k • written 17.5 years ago by Henrik Bengtsson ★ 2.4k

0

Entering edit mode

Ben Bolstad ★ 1.2k

@ben-bolstad-1494

Last seen 7.5 years ago

Henrik, Absolutely correct, you could in principle do what you propose. Where things start to break down a little is in the computation of the StdError terms. Referring here to preprocessCore, soon to appear in BioC 2.1, since it gives basically the easiest entry point into the fitting code. Note I don't really like talking about pre-release packages on the main list, rather than the devel list, so this is a warning that following example my cease to work (in the same manner, or any manner at all) by the time of release. Bestm Ben library(preprocessCore) ### ### Creating a simulation model ### col.effects <- c(10,11,10.5,12,9.5) row.effects <- c(seq(-0.5,-0.1,by=0.1),seq(0.1,0.5,by=0.1)) y <- outer(row.effects, col.effects,"+") ### ### Checking that the estimates agree ### ### Note that this function fits ### y = col + row + error ### with sum(row) = 0 ### rcModelPLM(y)$Estimates rcModelPLM(t(y))$Estimates transposedEstimates <- rcModelPLM(t(y))$Estimates transposedEstimates[1:nrow(y)] - mean(transposedEstimates[1:nrow(y)]) transposedEstimates[(nrow(y) + 1):(nrow(y) +ncol(y))] + mean(transposedEstimates[1:nrow(y)]) ### ### Checking that residuals and Weights agree ### w <- runif(50) y <- y + rnorm(50) rcModelPLM(y)$Estimates rcModelPLM(t(y))$Estimates transposedEstimates <- rcModelPLM(t(y))$Estimates transposedEstimates[1:nrow(y)] - mean(transposedEstimates[1:nrow(y)]) transposedEstimates[(nrow(y) + 1):(nrow(y) +ncol(y))] + mean(transposedEstimates[1:nrow(y)]) ### ### These should be at the level of machine error ### max(abs(rcModelPLM(t(y))$Weights - t(rcModelPLM(y)$Weights))) max(abs(rcModelPLM(t(y))$Residuals - t(rcModelPLM(y)$Residuals))) ### ### Unfortunately this is where things start breaking down ### rcModelPLM(y)$StdErrors rcModelPLM(t(y))$StdErrors ### ### Super-secret testing functions. ### Forget you ever saw them ### these are non-public api ### ### Although this is not really my favorite ### way of computing the SE estimates, but ### it is the current default (se.method=4 in fitPLM) ### residSE <- sqrt(sum(rcModelPLM(y)$Residuals^2*rcModelPLM(y)$Weights)/(50 - 14)) weights <- rcModelPLM(y)$Weights XTWX <- .C("XTWX_R",as.integer(10),as.integer(5),as.double(weights),double( 14^2),PACKAGE="preprocessCore")[[4]] XTWXinv <- matrix(.C("XTWX_R_inv",as.integer(10),as.integer(5),as.double(XTWX))[[ 3]],14,14) weights2 <- rcModelPLM(t(y))$Weights XTWX2 <- .C("XTWX_R",as.integer(5),as.integer(10),as.double(weights2),double (14^2),PACKAGE="preprocessCore")[[4]] XTWXinv2 <- matrix(.C("XTWX_R_inv",as.integer(5),as.integer(10),as.double(XTWX2))[ [3]],14,14) residSE*sqrt(diag(XTWXinv)) residSE*sqrt(diag(XTWXinv2)) ### Compare with rcModelPLM(y)$StdError rcModelPLM(t(y))$StdError On Fri, 2007-09-21 at 17:23 -0700, Henrik Bengtsson wrote: > Hi, > > I would like to follow up with a question to Ben on this (old) thread. > You saying below that the time complexity of the fitting algorithm is > *not* symmetric in probes and samples - having many samples is not as > bad as having many probes. I can guess a few reasons for why this is, > but that is of less interest, so my main question is: Because the > log-additive model is symmetric in probes and samples, can you just > swap samples and probes, i.e. transposing the matrix, and fit the > model, and then afterwards swap chip-effect with probe-effect > estimates? ...or are there asymmetries in how the algorithm is fitted > that breaks this idea, e.g. re-iterative robustifications etc.? > > Thanks in advance > > Henrik > > On 5/1/07, Ben Bolstad <bmb at="" bmbolstad.com=""> wrote: > > The slowdown you are observing is due to just a few probesets on the > > array. These probesets contain many 1000's of probes. In the current > > implementation when you use the command that you specified (fitting the > > default model) fitPLM uses a procedure optimized for probesets with > > relatively few probes across many arrays and so is pretty quick most of > > the time (my experience is that is is not completely unacceptable even > > up to about 1000 probes across a large number of arrays, at least on my > > machine). > > > > eg both of the following contain same number of datapoints > > > > Case I: 11 probes and 1000 arrays > > Case II: 1000 probes and 11 probes > > > > but case I will be a lot quicker than case II in the current > > implementation. > > > > Demonstration code > > > > > library(affyPLM) > > > > ### note to any developers out there, the following is UNSUPPORTED > > ### and subject to change. DO NOT USE. > > > rlm.default.rma.model <- function(y,PsiCode=0,PsiK=1.345){ > > + .Call("R_rlm_rma_default_model",y,PsiCode,PsiK,PACKAGE="affyPLM") > > + } > > > > #Case I > > > y <- matrix(rnorm(11*1000),11,1000) > > > system.time(test <- rlm.default.rma.model(y)) > > [1] 0.735 0.032 0.788 0.000 0.000 > > > > #Case II > > > y <- matrix(rnorm(11*1000),1000,11) > > > system.time(test <- rlm.default.rma.model(y)) > > [1] 19.776 0.508 21.730 0.000 0.000 > > > > As for workarounds, I am pretty sure that these extremely large > > probesets are control probesets of some kind that could be safely > > ignored and it is possible to pass a vector of probeset names specifying > > a subset to use for fitPLM. > > > > Best, > > > > Ben > > > > On Tue, 2007-05-01 at 12:36 -0700, Allen Day wrote: > > > I suspect so, although I haven't tried running rma() directly. > > > Just.rma() works fine, and fitPLM is able to RMA normalize internally. > > > > > > I was able to move this a little further along by patching the mm() > > > function to return empty list in the case of a dimensionless pset > > > variable. Apparently it is usually a two-column matrix with pm in > > > psets[,1] and mm in psets[,2]. Heres the patch. > > > http://paste.turbogears.org/paste/1253/plain > > > > > > This allows me to successfully background correct and normalize with > > > RMA through wrapper function fitPLM from the affyPLM library. It's > > > taking forever though, even running with minimal options. Here's my > > > call: > > > > > > fitPLM(ab, output.param=list(residuals=FALSE,weights=FALSE,resid .SE=FALSE),verbosity.level=10); > > > > > > Any advice? > > > > > > -Allen > > > > > > On 5/1/07, Crispin Miller <cmiller at="" picr.man.ac.uk=""> wrote: > > > > Hi Allen, > > > > Does rma() work with your cdf? > > > > > > > > We've also produced one that works OK with rma() (see the 'exonmap' > > > > package vignette for more details, including how to get it). Don't know > > > > if that helps? > > > > > > > > Crispin > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: bioconductor-bounces at stat.math.ethz.ch > > > > > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Allen Day > > > > > Sent: 01 May 2007 01:32 > > > > > To: bioconductor at stat.math.ethz.ch > > > > > Subject: [BioC] affyPLM and exon array question > > > > > > > > > > Hi, > > > > > > > > > > I've been trying to get NUSE, RLE, and RMA values for > > > > > HuEx-1_0-st-v2 (Human "all exon") Affymetrix arrays. > > > > > > > > > > So far I have successfully read the arrays into an affybatch object. > > > > > This required creating the CDF environment, which I have > > > > > already done with makecdfenv. I'll be submitting that for > > > > > inclusion shortly, but that's another topic. > > > > > > > > > > After creating the AffyBatch, I try to use affyPLM to do an > > > > > RMA model fit. R = 2.4.1, affyPLM = 1.12.0, affy = 1.12.2. > > > > > That's where there's trouble, and it appears to be caused by > > > > > the lack of mismatch probes on the array. Here's code > > > > > illustrating the problem: > > > > > > > > > > > library( 'affy' ); > > > > > > library( 'affyPLM' ); > > > > > > ab = read.affybatch( > > > > > filenames='/home/allenday/cel/0001.CEL' ); ab; # > > > > > > works, output omitted pm( ab ); # works, output omitted mm( ab ); # > > > > > > fails! > > > > > Error in FUN(X[[1411190]], ...) : subscript out of bounds > > > > > > plm = fitPLM( ab ); #same failure in fitPLM, caused by a > > > > > call to mm() > > > > > > on variable ab; > > > > > Error in FUN(X[[1411190]], ...) : subscript out of bounds > > > > > > > > > > I'm only proficient enough in R and C to track this down -- > > > > > I'm don't know R or Bioconductor well enough to know how to > > > > > fix it. If I can get this going I will submit a new package > > > > > that provides just.nuse() and just.rle() functions. Can > > > > > someone give me a pointer for how to make this work? > > > > > > > > > > Thanks. > > > > > > > > > > -Allen > > > > > > > > > > _______________________________________________ > > > > > Bioconductor mailing list > > > > > Bioconductor at stat.math.ethz.ch > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > Search the archives: > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > -------------------------------------------------------- > > > > > > > > > > > > This email is confidential and intended solely for the use of the person(s) ('the intended recipient') to whom it was addressed. Any views or opinions presented are solely those of the author and do not necessarily represent those of the Paterson Institute for Cancer Research or the University of Manchester. It may contain information that is privileged & confidential within the meaning of applicable law. Accordingly any dissemination, distribution, copying, or other use of this message, or any of its contents, by any person other than the intended recipient may constitute a breach of civil or criminal law and is strictly prohibited. If you are NOT the intended recipient please contact the sender and dispose of this e-mail as soon as possible. > > > > > > > > > > -- > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 17.5 years ago Ben Bolstad ★ 1.2k

Login before adding your answer.