Integrating Codelink data with bioconductor (using affy and limmafunctions)
1
0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia
Dear Diego, We are interested in introducing Codelink support into limma at some stage, so we'd be interested to see your class definition and parser at some time. The main reason we don't support Codelink yet is some uncertainty about how best to handle the multiple fields (sub-arrays) which are part of the Codelink platform. I think that you would almost certainly be better off avoiding missing values in the first place rather than trying to accommodate them during the normalization process. See for example: https://stat.ethz.ch/pipermail/bioconductor/2005-April/008348.html Note that while normalizeQuantiles() does allow for missing values, the missing values are assumed to have occured at random. If the missing values are actually associated with low intensities, this is not missing at random and the normalization process will be biased. Even if you use a method other than quantile normalization, the whole process of between-array normalization is problematic with missing values. Better to avoid them. Gordon >Date: Fri, 22 Apr 2005 14:45:53 +0200 >From: Diego D?ez Ruiz <ddiez@iib.uam.es> >Subject: [BioC] Integrating Codelink data with bioconductor (using > affy and limmafunctions) >To: bioconductor@stat.math.ethz.ch > >Dear BioC users and developers: > >I'm working with Codelink plattform microarray data and i want to apply >some of the knowledge currently available for other platforms as >Affymetrix o cDNA spotted microarrays to deal with this data. I have >been using Affymetrix data for a while and i used suscessfully the >great affy package. I also worked with spoted cDNA data and use the >also great package limma mainly. Currently this is what I have done: > >1. Write a parser for exported text data from the codelink software. > >2. I found convenient to create a new class for storing data from this >platform because there are no unique identifiers as in Affymetrix and so >an exprSet object was not appropiate. I think an RGlist/MAlist-like >definition could do the job fine so I almost cut and paste the classes >definition in the limma package an modified it a little (any concerns >about using other's code is wellcome. I'm willing to release this code >under LGPL licence) to make a currently named "Codelink" class. The >class currently stores signal values, flag values, type of spot and >codelink identifiers. Also I make some annotation packages for the >human10k, humanwholegenome and ratwholegenome bioarrays but i have to >test it deeper. A remember and older thread in the BioC list about >modifiying the exprSet definition to deal with this problem. Is there >any plan to change it so I can use a more standard approach better that >create a new class? > >3. Created/adapted some functions for plotting: Density plots, MAplot, >Scatter plot that let see the position of diferent type of spots in >codelink chips: that is FIDUCIAL, DISCOVERY, POSITIVE, etc. > > >All that worked more or less. Data can be imported as RAW data (default) >or normalized (Codelink median normalization) and as log2 values >(default) or not. When I calculate log2 values, i initially decided to >set negative values (obtained from subtracting background to signal >sometimes) to somethign like 0.01. This works fine with >normalize.quantiles and normalize.loess from affy package but create >artifact point in a MAplot and a low intensity peak in a density plot. >Then, I saw a post from Gordon Smyth (sorry dont have link) telling that >limma make negative values NA prior to log2. I wanted to do that and >modified the functions involved. Finaly this is when I found a mayor >problem: > >1) normalize.loess: I got to modify this function to allow NA values. >The function compares each array whith the others so if you have 10 >arrays then each array is compared and modified (after loess >estimation) 10 times. If in each iteration I select the non NA values >created when you calculate A or M then in each iteration you have a >different subset of genes used in the normalization process. As a little >example, if you have 4 arrays with 5 genes: > > chip1 chip2 chip3 chip4 >gen1 1 10 NA 5 >gen2 3 NA 40 6 >gen3 4 NA NA 7 >gen4 3 12 45 6 >gen5 2 1 4 3 > >normalize.loess take a matrix as argument and do all pairwise comparisons: > >chip1-chip2: used gen1,gen4,gen5 to estimate loess curve. >chip1-chip3: used gen2,gen4,gen5 to estimate loess curve. >chip1-chip4: used all genes to estimate loess curve. >chip2-chip3: used gen4,gen5 to estimate loess curve. >chip2-chip4: used gen1,gen4,gen5 to estimate loess curve. >chip3-chip4: used gen2,gen4,gen5 to estimate loess curve. > >The code modified/added to allow for NA is between #: >(from normalize.loess in affy 1.5.8) > >--------------------------------------------------------------------- --- > >normalize.loess <- function (mat, subset = sample(1:(dim(mat)[1]), >min(c(5000, nrow(mat)))), > epsilon = 10^-2, maxit = 1, log.it = TRUE, verbose = TRUE, > span = 2/3, family.loess = "symmetric") >{ > J <- dim(mat)[2] > II <- dim(mat)[1] > newData <- mat > if log.it) { > mat <- log2(mat) > newData <- log2(newData) > } > change <- epsilon + 1 > fs <- matrix(0, II, J) > iter <- 0 > w <- c(0, rep(1, length(subset)), 0) > while (iter < maxit) { > iter <- iter + 1 > means <- matrix(0, II, J) > for (j in 1:(J - 1)) { > for (k in (j + 1):J) { > y <- newData[, j] - newData[, k] > x <- (newData[, j] + newData[, k])/2 > ##################################### > # Select genes that are not set to NA > sel <- which(!is.na(as.character(y))) > y <- y[sel] > x <- x[sel] > ##################################### > index <- c(order(x)[1], subset, order(-x)[1]) > xx <- x[index] > yy <- y[index] > aux <- loess(yy ~ xx, span = span, degree = 1, > weights = w, family = family.loess) > aux <- predict(aux, data.frame(xx = x))/J > ##################################### > # Apply correction to genes not NA. > means[sel, j] <- means[sel, j] + aux > means[sel, k] <- means[sel, k] - aux > ##################################### > if (verbose) > cat("Done with", j, "vs", k, " in iteration ", > iter, "\n") > } > } > fs <- fs + means > newData <- mat - fs > change <- max(colMeans((means[subset, ])^2)) > if (verbose) > cat(iter, change, "\n") > oldfs <- fs > } > if ((change > epsilon) & (maxit > 1)) > warning(paste("No convergence after", maxit, "iterations.\n")) > if log.it) { > return(2^newData) > } > else return(newData) >} > >--------------------------------------------------------------------- --- > >I'm not an statitician so not sure if this could be a correct approach: >Is this correct or I cannot do that? > >2) With normalize.quantiles all seems to work fine... the function don't >say nothing about NA values and the density plots seems correct (almost >same density plot for all chips) but the MAplot have changed greatly >with a great number of genes with greater values of M (in absolute >value) that is, a wider cloud of points. I don't know why is this >happening. I also tried to see at the C code for normalize.quantiles (at >qnorm.c) but not found an easy answer. Hopefully I finally found that I >can use normalizeQuantiles from limma package that allows for NA values. > >By the way, is this really necesary or can I assign a low value to >negative ones?. I think it is better to use NA but... >Something I'm going to do sonner is to import all the values obtained >from the Codelink data (mean foreground and background, median >foreground and background, etc) so it could be possible to calculate a >signal value within R and then use a method that avoids negative values. >But that could be the same in some cases to assign a small value (Or I'm >wrong) > >I'm working with a linux box (Debian Sarge) with R 2.0.1 compiled from >sources and BioC 1.5. > >Any comment on all that would be very appreciated. > >Thanks in advanced! > > >D.
Microarray Annotation Normalization affy limma PROcess codelink ASSIGN Microarray affy • 1.3k views
ADD COMMENT
0
Entering edit mode
Diego Diez ▴ 760
@diego-diez-4520
Last seen 4.2 years ago
Japan
Dear Gordon, Thanks for your response. I will use the data as early but, What do you think it could affect more to normalization process: Some points assigned as NA values or some point with lowers A values as one of the intensitues was assigned a value of say 0.01? I'd let you see my class definition and parser of course. This is really the first time a make use of classes and store all things as an R package so I thought that the best way to make something usable and quick without having to read completly "writting R extensions" was using others packages to learn (that is one of the greatness of opensource :). Of course I will have to read it one day. Briefly: 1. The parser read exported txt files from codelink software. It works fine with 3 different chips so I think it should work fine with all types. A problem is that exported text data have custom fields (and you can chose within all fields including Raw_intensity, Median_foreground, etc) So it could be possible to found files with missing fields not exported. I know that it is possible to export as XML but a didn't try that yet. 2. The class definition is very simple. I based it in RGlist and used almost all redefinitions of dim() as.matrix() etc... that you use in limma. I also based a subsetting system in the one used in AffyBatch objects in affy. A Codelink object stores as a list 3 matrices. One of intensities, one of Flags and one last with probe name and probe type. I actually named it "val" "flags" and "info" slots but i don't thing they are appropiate so this week I want to import all possible fields and name it as they are called in the exported files. I probably too make comprobation about the fields present and warn or error if a *must have* field is missing. When I have a more clear and clean code I will not have any problems in let you see it. D Gordon Smyth escribi?: > Dear Diego, > > We are interested in introducing Codelink support into limma at some > stage, so we'd be interested to see your class definition and parser at > some time. The main reason we don't support Codelink yet is some > uncertainty about how best to handle the multiple fields (sub- arrays) > which are part of the Codelink platform. > > I think that you would almost certainly be better off avoiding missing > values in the first place rather than trying to accommodate them during > the normalization process. See for example: > > https://stat.ethz.ch/pipermail/bioconductor/2005-April/008348.html > > Note that while normalizeQuantiles() does allow for missing values, the > missing values are assumed to have occured at random. If the missing > values are actually associated with low intensities, this is not missing > at random and the normalization process will be biased. Even if you use > a method other than quantile normalization, the whole process of > between-array normalization is problematic with missing values. Better > to avoid them. > > Gordon > >> Date: Fri, 22 Apr 2005 14:45:53 +0200 >> From: Diego D?ez Ruiz <ddiez@iib.uam.es> >> Subject: [BioC] Integrating Codelink data with bioconductor (using >> affy and limmafunctions) >> To: bioconductor@stat.math.ethz.ch >> >> Dear BioC users and developers: >> >> I'm working with Codelink plattform microarray data and i want to apply >> some of the knowledge currently available for other platforms as >> Affymetrix o cDNA spotted microarrays to deal with this data. I have >> been using Affymetrix data for a while and i used suscessfully the >> great affy package. I also worked with spoted cDNA data and use the >> also great package limma mainly. Currently this is what I have done: >> >> 1. Write a parser for exported text data from the codelink software. >> >> 2. I found convenient to create a new class for storing data from this >> platform because there are no unique identifiers as in Affymetrix and so >> an exprSet object was not appropiate. I think an RGlist/MAlist-like >> definition could do the job fine so I almost cut and paste the classes >> definition in the limma package an modified it a little (any concerns >> about using other's code is wellcome. I'm willing to release this code >> under LGPL licence) to make a currently named "Codelink" class. The >> class currently stores signal values, flag values, type of spot and >> codelink identifiers. Also I make some annotation packages for the >> human10k, humanwholegenome and ratwholegenome bioarrays but i have to >> test it deeper. A remember and older thread in the BioC list about >> modifiying the exprSet definition to deal with this problem. Is there >> any plan to change it so I can use a more standard approach better that >> create a new class? >> >> 3. Created/adapted some functions for plotting: Density plots, MAplot, >> Scatter plot that let see the position of diferent type of spots in >> codelink chips: that is FIDUCIAL, DISCOVERY, POSITIVE, etc. >> >> >> All that worked more or less. Data can be imported as RAW data (default) >> or normalized (Codelink median normalization) and as log2 values >> (default) or not. When I calculate log2 values, i initially decided to >> set negative values (obtained from subtracting background to signal >> sometimes) to somethign like 0.01. This works fine with >> normalize.quantiles and normalize.loess from affy package but create >> artifact point in a MAplot and a low intensity peak in a density plot. >> Then, I saw a post from Gordon Smyth (sorry dont have link) telling that >> limma make negative values NA prior to log2. I wanted to do that and >> modified the functions involved. Finaly this is when I found a mayor >> problem: >> >> 1) normalize.loess: I got to modify this function to allow NA values. >> The function compares each array whith the others so if you have 10 >> arrays then each array is compared and modified (after loess >> estimation) 10 times. If in each iteration I select the non NA values >> created when you calculate A or M then in each iteration you have a >> different subset of genes used in the normalization process. As a little >> example, if you have 4 arrays with 5 genes: >> >> chip1 chip2 chip3 chip4 >> gen1 1 10 NA 5 >> gen2 3 NA 40 6 >> gen3 4 NA NA 7 >> gen4 3 12 45 6 >> gen5 2 1 4 3 >> >> normalize.loess take a matrix as argument and do all pairwise >> comparisons: >> >> chip1-chip2: used gen1,gen4,gen5 to estimate loess curve. >> chip1-chip3: used gen2,gen4,gen5 to estimate loess curve. >> chip1-chip4: used all genes to estimate loess curve. >> chip2-chip3: used gen4,gen5 to estimate loess curve. >> chip2-chip4: used gen1,gen4,gen5 to estimate loess curve. >> chip3-chip4: used gen2,gen4,gen5 to estimate loess curve. >> >> The code modified/added to allow for NA is between #: >> (from normalize.loess in affy 1.5.8) >> >> ------------------------------------------------------------------- ----- >> >> normalize.loess <- function (mat, subset = sample(1:(dim(mat)[1]), >> min(c(5000, nrow(mat)))), >> epsilon = 10^-2, maxit = 1, log.it = TRUE, verbose = TRUE, >> span = 2/3, family.loess = "symmetric") >> { >> J <- dim(mat)[2] >> II <- dim(mat)[1] >> newData <- mat >> if log.it) { >> mat <- log2(mat) >> newData <- log2(newData) >> } >> change <- epsilon + 1 >> fs <- matrix(0, II, J) >> iter <- 0 >> w <- c(0, rep(1, length(subset)), 0) >> while (iter < maxit) { >> iter <- iter + 1 >> means <- matrix(0, II, J) >> for (j in 1:(J - 1)) { >> for (k in (j + 1):J) { >> y <- newData[, j] - newData[, k] >> x <- (newData[, j] + newData[, k])/2 >> ##################################### >> # Select genes that are not set to NA >> sel <- which(!is.na(as.character(y))) >> y <- y[sel] >> x <- x[sel] >> ##################################### >> index <- c(order(x)[1], subset, order(-x)[1]) >> xx <- x[index] >> yy <- y[index] >> aux <- loess(yy ~ xx, span = span, degree = 1, >> weights = w, family = family.loess) >> aux <- predict(aux, data.frame(xx = x))/J >> ##################################### >> # Apply correction to genes not NA. >> means[sel, j] <- means[sel, j] + aux >> means[sel, k] <- means[sel, k] - aux >> ##################################### >> if (verbose) >> cat("Done with", j, "vs", k, " in iteration ", >> iter, "\n") >> } >> } >> fs <- fs + means >> newData <- mat - fs >> change <- max(colMeans((means[subset, ])^2)) >> if (verbose) >> cat(iter, change, "\n") >> oldfs <- fs >> } >> if ((change > epsilon) & (maxit > 1)) >> warning(paste("No convergence after", maxit, "iterations.\n")) >> if log.it) { >> return(2^newData) >> } >> else return(newData) >> } >> >> ------------------------------------------------------------------- ----- >> >> I'm not an statitician so not sure if this could be a correct approach: >> Is this correct or I cannot do that? >> >> 2) With normalize.quantiles all seems to work fine... the function don't >> say nothing about NA values and the density plots seems correct (almost >> same density plot for all chips) but the MAplot have changed greatly >> with a great number of genes with greater values of M (in absolute >> value) that is, a wider cloud of points. I don't know why is this >> happening. I also tried to see at the C code for normalize.quantiles (at >> qnorm.c) but not found an easy answer. Hopefully I finally found that I >> can use normalizeQuantiles from limma package that allows for NA values. >> >> By the way, is this really necesary or can I assign a low value to >> negative ones?. I think it is better to use NA but... >> Something I'm going to do sonner is to import all the values obtained >> from the Codelink data (mean foreground and background, median >> foreground and background, etc) so it could be possible to calculate a >> signal value within R and then use a method that avoids negative values. >> But that could be the same in some cases to assign a small value (Or I'm >> wrong) >> >> I'm working with a linux box (Debian Sarge) with R 2.0.1 compiled from >> sources and BioC 1.5. >> >> Any comment on all that would be very appreciated. >> >> Thanks in advanced! >> >> >> D. > >
ADD COMMENT
0
Entering edit mode
At 09:35 PM 25/04/2005, Diego D?ez Ruiz wrote: >Dear Gordon, > >Thanks for your response. I will use the data as early but, What do you >think it could affect more to normalization process: Some points assigned >as NA values or some point with lowers A values as one of the intensitues >was assigned a value of say 0.01? Unless you're doing much more than I think you are, you must avoid NAs at all costs. If you have to live with low intensities, then so be it. >I'd let you see my class definition and parser of course. This is really >the first time a make use of classes and store all things as an R package >so I thought that the best way to make something usable and quick without >having to read completly "writting R extensions" was using others packages >to learn (that is one of the greatness of opensource :). Of course I will >have to read it one day. >Briefly: >1. The parser read exported txt files from codelink software. I've never seen Codelink output, but my understanding is that it is essentially just ImaGene output. Is that not correct? Gordon > It works fine with 3 different chips so I think it should work fine with > all types. A problem is that exported text data have custom fields (and > you can chose within all fields including Raw_intensity, > Median_foreground, etc) So it could be possible to found files with > missing fields not exported. I know that it is possible to export as XML > but a didn't try that yet. > >2. The class definition is very simple. I based it in RGlist and used >almost all redefinitions of dim() as.matrix() etc... that you use in >limma. I also based a subsetting system in the one used in AffyBatch >objects in affy. A Codelink object stores as a list 3 matrices. One of >intensities, one of Flags and one last with probe name and probe type. I >actually named it "val" "flags" and "info" slots but i don't thing they >are appropiate so this week I want to import all possible fields and name >it as they are called in the exported files. I probably too make >comprobation about the fields present and warn or error if a *must have* >field is missing. > >When I have a more clear and clean code I will not have any problems in >let you see it. > >D
ADD REPLY
0
Entering edit mode
Gordon Smyth escribi?: > At 09:35 PM 25/04/2005, Diego D?ez Ruiz wrote: > >> Dear Gordon, >> >> Thanks for your response. I will use the data as early but, What do >> you think it could affect more to normalization process: Some points >> assigned as NA values or some point with lowers A values as one of the >> intensitues was assigned a value of say 0.01? > > > Unless you're doing much more than I think you are, you must avoid NAs > at all costs. If you have to live with low intensities, then so be it. then so be it. > >> I'd let you see my class definition and parser of course. This is >> really the first time a make use of classes and store all things as an >> R package so I thought that the best way to make something usable and >> quick without having to read completly "writting R extensions" was >> using others packages to learn (that is one of the greatness of >> opensource :). Of course I will have to read it one day. >> Briefly: >> 1. The parser read exported txt files from codelink software. > > > I've never seen Codelink output, but my understanding is that it is > essentially just ImaGene output. Is that not correct? I've never seen Imagene output. This is header and column names from codelink output: CodeLink Expression Analysis 4.1.0.29054 CNIC Report for Slide (T00241792) LAYOUT EXP294X192-912.22.ID PROJECT EXPERIMENT PRODUCT Human Whole Genome Sample Name Array 1 Sample001 Median Array 1 86,6547470092773 Report( 1 ): 310105-Person ---------------------------------------------------------------------- ---------- Idx Array Sample_name Probe_name Annotation_PIN Annotation_NCBI_Acc Annotation_NCBI_NID Annotation_LocusLink Annotation_OGS Annotation_UniGene Annotation_ENSEMBL Probe_type Feature_id Raw_intensity Normalized_intensity Quality_flag Signal_strength Logical_row Logical_col Center_X Center_Y Spot_mean Spot_median Spot_stdev Spot_area Spot_diameter Spot_noise_level Bkgd_mean Bkgd_median Bkgd_stdev Bkgd_area Annotation_Molecular_Function Annotation_Biological_Process Annotation_Cellular_Component Annotation_Cytoband Annotation_HS_Homology Annotation_MM_Homology Annotation_RN_Homology Annotation_Analogous_CodeLink Annotation_Legacy_Probe_Name Description Header could be less than 10 rows (custom) and columns could be customized (for example in my own data I avoid Annotation_* and Description columns). I'm not sure if in this example there are all the possible fields. D > > Gordon > >> It works fine with 3 different chips so I think it should work fine >> with all types. A problem is that exported text data have custom >> fields (and you can chose within all fields including Raw_intensity, >> Median_foreground, etc) So it could be possible to found files with >> missing fields not exported. I know that it is possible to export as >> XML but a didn't try that yet. >> >> 2. The class definition is very simple. I based it in RGlist and used >> almost all redefinitions of dim() as.matrix() etc... that you use in >> limma. I also based a subsetting system in the one used in AffyBatch >> objects in affy. A Codelink object stores as a list 3 matrices. One of >> intensities, one of Flags and one last with probe name and probe type. >> I actually named it "val" "flags" and "info" slots but i don't thing >> they are appropiate so this week I want to import all possible fields >> and name it as they are called in the exported files. I probably too >> make comprobation about the fields present and warn or error if a >> *must have* field is missing. >> >> When I have a more clear and clean code I will not have any problems >> in let you see it. >> >> D > >
ADD REPLY

Login before adding your answer.

Traffic: 694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6