Using ReadAffy with custom CDFs on tiling array data

0

Entering edit mode

Arkady ▴ 60

@arkady-2936

Last seen 10.2 years ago

A couple of questions herein. Background: I'm trying to load the CEL files for the Affy whole-genome tiling arrays. I have lots and lots of bzipped2 CEL files (3452 of them). They seem to ask for Wgc_Universal_fe1 as the cdf, and this package does not appear to be available through Bioconductor, according to getCDF(cleancdfname("Wgc_Universal_fe1")). According to some papers I've found, newer custom CDFs are better. So I tried using some from UMich, but again, they don't appear to be available in the repository (at least for human tiling 1.0R and 2.0R). Finally, I downloaded all of the probe and CDF data from UMich and installed it manually, both the probe and cdf packages. That appeared to work, and I can load a single CEL file. Unfortunately, this has left me with several questions. 1. The CEL files contain the names of the original CDFs. How do I translate those to the names of the custom CDFs? Is there some way to establish a mapping? 2. How do I deal with multiple CDFs for a single experiment? Do I load each of my 3452 files separately, specifying the CDF each time? 3. What about the probe packages? Is there a unified package that contains both pieces (CDF and probes) of information? 4. Why aren't the CDFs for the human tiling arrays made available through Bioconductor? Thanks again. Cheers, John Woods

cdf probe affy cdf probe affy • 1.8k views

ADD COMMENT • link updated 16.3 years ago by Naira Naouar ▴ 140 • written 16.3 years ago by Arkady ▴ 60

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 8 hours ago

United States

Arkady wrote: > A couple of questions herein. > > Background: I'm trying to load the CEL files for the Affy whole- genome > tiling arrays. I have lots and lots of bzipped2 CEL files (3452 of > them). They seem to ask for Wgc_Universal_fe1 as the cdf, and this > package does not appear to be available through Bioconductor, > according to getCDF(cleancdfname("Wgc_Universal_fe1")). > > According to some papers I've found, newer custom CDFs are better. So > I tried using some from UMich, but again, they don't appear to be > available in the repository (at least for human tiling 1.0R and 2.0R). > > Finally, I downloaded all of the probe and CDF data from UMich and > installed it manually, both the probe and cdf packages. That appeared > to work, and I can load a single CEL file. > > Unfortunately, this has left me with several questions. > > 1. The CEL files contain the names of the original CDFs. How do I > translate those to the names of the custom CDFs? Is there some way to > establish a mapping? See ?ReadAffy, specifically the cdfname argument (you _did_ already read the help, no?). > > 2. How do I deal with multiple CDFs for a single experiment? Do I load > each of my 3452 files separately, specifying the CDF each time? Again, see ReadAffy(). And good luck reading in 3452 celfiles unless you have more RAM than the NSA (or Google - not that there is much difference ;-D) > > 3. What about the probe packages? Is there a unified package that > contains both pieces (CDF and probes) of information? See the oligo and pdInfoBuilder packages, which is what you should be using for tiling arrays anyway (or the affyTiling package). > > 4. Why aren't the CDFs for the human tiling arrays made available > through Bioconductor? Mainly because the demand is so low, and there isn't a really easy way to analyze them at this time. The barrier to entry is so high that most people who analyze these things are on the bleeding edge anyway, so they typically don't need our help. Plus, it somehow never occurred to me to build them ;-D Best, Jim > > Thanks again. > > Cheers, > John Woods > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, MS Biostatistician UMCCC cDNA and Affymetrix Core University of Michigan 1500 E Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD COMMENT • link 16.3 years ago James W. MacDonald 67k

0

Entering edit mode

On Tue, Jul 22, 2008 at 6:18 PM, James MacDonald <jmacdon@med.umich.edu> wrote: > John wrote: > >> 1. The CEL files contain the names of the original CDFs. How do I >> translate those to the names of the custom CDFs? Is there some way to >> establish a mapping? >> > > See ?ReadAffy, specifically the cdfname argument (you _did_ already read > the help, no?). > (Yes, of course. And the vignettes.) I'll try to clarify. Is there a way to tell from the CDF requested in the CEL which *custom* CDF it should load instead? For example, how do I know which (of the many CDFs for the Affy 1.0R, 2.0R, 21/22 arrays) custom CDF is the replacement for the default CDF Wgc_Universal_fe1? > > 2. How do I deal with multiple CDFs for a single experiment? Do I load >> each of my 3452 files separately, specifying the CDF each time? >> > > Again, see ReadAffy(). And good luck reading in 3452 celfiles unless you > have more RAM than the NSA (or Google - not that there is much difference > ;-D) > Yes, Google = NSA, and NASA, too. Soon they'll put NAAAAAAAAAASA at the bottom of search results instead of Goooooooooogle. So you're saying that instead of calling ReadAffy() with no or few args on the current working directory, I should call ReadAffy for each CEL separately, specifying the custom CDF manually? (I guess I was wondering if there was an option like usecustom=TRUE.) Trying to increase my understanding: If I've got three replicates of an array that all come from the same biological replicate, should those be read in a single call to ReadAffy? > 3. What about the probe packages? Is there a unified package that >> contains both pieces (CDF and probes) of information? >> > > See the oligo and pdInfoBuilder packages, which is what you should be using > for tiling arrays anyway (or the affyTiling package). > I'm getting: package 'affyTiling' is not available. I do have package tilingArray, but the vignettes there don't really answer my question. Is there somewhere else I should look? Specifically, I'd like help on how best to load this data into R using existing (Yoda) methods in Bioconductor--if it's possible. The pdInfoBuilder vignette is not terribly helpful either. Is there some document that explains how all of this stuff gets tied together, and who should be using which package for what application? I'm really having a little trouble keeping track. > > 4. Why aren't the CDFs for the human tiling arrays made available >> through Bioconductor? >> > > Mainly because the demand is so low, and there isn't a really easy way to > analyze them at this time. The barrier to entry is so high that most people > who analyze these things are on the bleeding edge anyway, so they typically > don't need our help. Plus, it somehow never occurred to me to build them ;-D > Yay bleeding edge. Cheers, John [[alternative HTML version deleted]]

ADD REPLY • link 16.3 years ago Arkady ▴ 60

0

Entering edit mode

Arkady wrote: > On Tue, Jul 22, 2008 at 6:18 PM, James MacDonald <jmacdon at="" med.umich.edu=""> > wrote: > >> John wrote: >> >>> 1. The CEL files contain the names of the original CDFs. How do I >>> translate those to the names of the custom CDFs? Is there some way to >>> establish a mapping? >>> >> See ?ReadAffy, specifically the cdfname argument (you _did_ already read >> the help, no?). >> > > (Yes, of course. And the vignettes.) > > I'll try to clarify. Is there a way to tell from the CDF requested in the > CEL which *custom* CDF it should load instead? For example, how do I know > which (of the many CDFs for the Affy 1.0R, 2.0R, 21/22 arrays) custom CDF is > the replacement for the default CDF Wgc_Universal_fe1? No. You just have to know enough about the array you are using to know which custom one is applicable. > > >> 2. How do I deal with multiple CDFs for a single experiment? Do I load >>> each of my 3452 files separately, specifying the CDF each time? >>> >> Again, see ReadAffy(). And good luck reading in 3452 celfiles unless you >> have more RAM than the NSA (or Google - not that there is much difference >> ;-D) >> > > Yes, Google = NSA, and NASA, too. Soon they'll put NAAAAAAAAAASA at the > bottom of search results instead of Goooooooooogle. > > So you're saying that instead of calling ReadAffy() with no or few args on > the current working directory, I should call ReadAffy for each CEL > separately, specifying the custom CDF manually? (I guess I was wondering if > there was an option like usecustom=TRUE.) Calling ReadAffy() on individual celfiles isn't likely to be helpful if you are intending to process them together. You will want to background correct and normalize things in batches. > > Trying to increase my understanding: > If I've got three replicates of an array that all come from the same > biological replicate, should those be read in a single call to ReadAffy? Yes. > > > >> 3. What about the probe packages? Is there a unified package that >>> contains both pieces (CDF and probes) of information? >>> >> See the oligo and pdInfoBuilder packages, which is what you should be using >> for tiling arrays anyway (or the affyTiling package). >> > > I'm getting: > package 'affyTiling' is not available. Well, that shows my ignorance of the package name - it _is_ tilingArray. > > I do have package tilingArray, but the vignettes there don't really answer > my question. Is there somewhere else I should look? Specifically, I'd like > help on how best to load this data into R using existing (Yoda) methods in > Bioconductor--if it's possible. > > The pdInfoBuilder vignette is not terribly helpful either. Is there some > document that explains how all of this stuff gets tied together, and who > should be using which package for what application? I'm really having a > little trouble keeping track. Well, that's the trouble with being on the bleeding edge of things. I have worked in an Affy core for over 6 years now, and have *never* seen a tiling array, unless you count CHiP-chip stuff with the promoter arrays. I think they might be nice for some things, but they are pretty hard to sell to the Standard Issue Biologist (SAB). In an Open Source project like Bioconductor, the things that get the attention are the things that people use most/ask questions about. Since tiling arrays seem to be a bit rare, I don't think they have got as much attention as the standard expression arrays. Best, Jim > > >> 4. Why aren't the CDFs for the human tiling arrays made available >>> through Bioconductor? >>> >> Mainly because the demand is so low, and there isn't a really easy way to >> analyze them at this time. The barrier to entry is so high that most people >> who analyze these things are on the bleeding edge anyway, so they typically >> don't need our help. Plus, it somehow never occurred to me to build them ;-D >> > > Yay bleeding edge. > > Cheers, > John > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD REPLY • link 16.3 years ago James W. MacDonald 67k

0

Entering edit mode

Arkady ▴ 60

@arkady-2936

Last seen 10.2 years ago

Benilton, Thank you so much for writing back. My apologies if I've come across very negatively. I do realize documentation and development are ongoing processes, and I have been incredibly impressed by the extensive docs available for most Bioconductor packages. The code sample is exactly the kind of thing I'm looking for. I'll spend some time tomorrow playing around with it. Aside from feedback, is there anything else I can do to help as I go through this, such as writing a vignette? I've been experimenting with Sweave a bit. Cheers, John On Wed, Jul 23, 2008 at 5:50 PM, Benilton Carvalho <bcarvalh@jhsph.edu> wrote: > Dear John, > > my apologies for not being very helpful, but the combo oligo+pdInfoBuilder > is meant to help you with expression, SNP, tiling, exon and gene arrays. > > Because I've been working more closely with SNP arrays, the interface for > SNPs is a bit more developed. And, slowly, I'm adding functionalities for > other platforms. > > Soon I'll be able to give more attention to both packages, but you can get > started by creating the pdInfo package for the array you have. > > The code I use to create such package is: > > library(pdInfoBuilder) > bpmapFile <- "Hs_PromPR_v02-3_NCBIv36.bpmap" > cifFile <- "Hs_PromPR_v02.cif" > obj <- new("AffyTilingPDInfoPkgSeed", > version="0.0.1", > author="Benilton Carvalho", email="bcarvalh@jhsph.edu", > biocViews="AnnotationData", > genomebuild="NCBI Build 36", > bpmapFile=bpmapFile, > cifFile=cifFile) > makePdInfoPackage(obj, destDir=".") > > > And, after installing the resulting package, I'm able to read in the CEL > files. As I said, there isn't much implemented specific for Tiling Arrays, > and it would be actually very helpful to have feedback from "tiling arrays' > users". > > Cheers, > > b > > [[alternative HTML version deleted]]

ADD COMMENT • link 16.3 years ago Arkady ▴ 60

0

Entering edit mode

Naira Naouar ▴ 140

@naira-naouar-2394

Last seen 10.2 years ago

Dear, There are now no CDF available for tiling arrays except the ones provided by Manhong Dai http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF _download_v10.asp You will see that there is one CDF available by tiling array and by database which was used as reference for genome annotation. Depending on the database that you trust more for genes annotation, you will choose the "unique" CDF that you need for your analysis. (with those CDF you will be able to perform RMA, ...). Personally, I have been working on Arabidopsis Thaliana 1.0R tiling array and I have produced my own CDF for this array. The way I did it is explained here: http://wiki.fhcrc.org/bioc/DetailedScheduleTentative?action=AttachFile &do=get&target=Lightning-Naouar.pdf Basically, I started from all probes that I aligned to the genome and I eliminated the probes which were not of interest for me (keeping only the unique exonic probes for each gene annotated). My CDF contains more genes than the one proposed by Manhong Dai (I am not 100% sure on the way he used to select the correct probes for each gene). My last comment for the moment will be that it will be very difficult to analyse all your arrays together. you will realize that it is taking a lot of memory for the storage. If I can be of any other help, please let me know, Naira Arkady wrote: > A couple of questions herein. > > Background: I'm trying to load the CEL files for the Affy whole- genome > tiling arrays. I have lots and lots of bzipped2 CEL files (3452 of > them). They seem to ask for Wgc_Universal_fe1 as the cdf, and this > package does not appear to be available through Bioconductor, > according to getCDF(cleancdfname("Wgc_Universal_fe1")). > > According to some papers I've found, newer custom CDFs are better. So > I tried using some from UMich, but again, they don't appear to be > available in the repository (at least for human tiling 1.0R and 2.0R). > > Finally, I downloaded all of the probe and CDF data from UMich and > installed it manually, both the probe and cdf packages. That appeared > to work, and I can load a single CEL file. > > Unfortunately, this has left me with several questions. > > 1. The CEL files contain the names of the original CDFs. How do I > translate those to the names of the custom CDFs? Is there some way to > establish a mapping? > > 2. How do I deal with multiple CDFs for a single experiment? Do I load > each of my 3452 files separately, specifying the CDF each time? > > 3. What about the probe packages? Is there a unified package that > contains both pieces (CDF and probes) of information? > > 4. Why aren't the CDFs for the human tiling arrays made available > through Bioconductor? > > Thanks again. > > Cheers, > John Woods > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- ================================================================== Naira Naouar Tel:+32 (0)9 331 38 63 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM nanao at psb.ugent.be http://www.psb.ugent.be

ADD COMMENT • link 16.3 years ago Naira Naouar ▴ 140

0

Entering edit mode

Hi Naira, On Jul 28, 2008, at 9:57 AM, Naira Naouar wrote: > Personally, I have been working on Arabidopsis Thaliana 1.0R tiling > array and I have produced my own CDF for this array. The way I did > it is explained here: http://wiki.fhcrc.org/bioc/DetailedScheduleTen tative?action=AttachFile&do=get&target=Lightning-Naouar.pdf Thanks for the slides! I've actually done something very similar for the Drosophila 1.0R tiling array in terms of realigning probes and annotating as exon/ambiguous-exon (ie, an exon in only 1 isoform of a transcript)/intron/intergenic/etc. The only question I have is what did you then use to get it back into the expected CDF format for seamless use in Bioconductor? Sorry if I'm missing something, but I haven't come across a way to create a custom cdf w/o affy's bpmap and cif files ... did you create your own bpmap from your annotations (or something)? Thanks, -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Cornell Medical College of Cornell University http://cbio.mskcc.org/~lianos

ADD REPLY • link 16.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, Steve Lianoglou wrote: > Hi Naira, > > On Jul 28, 2008, at 9:57 AM, Naira Naouar wrote: > >> Personally, I have been working on Arabidopsis Thaliana 1.0R tiling >> array and I have produced my own CDF for this array. The way I did it >> is explained here: >> http://wiki.fhcrc.org/bioc/DetailedScheduleTentative?action=AttachF ile&do=get&target=Lightning-Naouar.pdf >> > > Thanks for the slides! I've actually done something very similar for > the Drosophila 1.0R tiling array in terms of realigning probes and > annotating as exon/ambiguous-exon (ie, an exon in only 1 isoform of a > transcript)/intron/intergenic/etc. The only question I have is what > did you then use to get it back into the expected CDF format for > seamless use in Bioconductor? > > Sorry if I'm missing something, but I haven't come across a way to > create a custom cdf w/o affy's bpmap and cif files ... did you create > your own bpmap from your annotations (or something)? > Actually, I have created a CDF package for Arabidopsis 1.0R Tiling array that is available here: ftp://ftp.psb.ugent.be/pub/nanao/athtiling1.0rcdf.tar.gz doing the following: 1. For each probe that I selected for a specific gene (AGI code), I kept track of the PM and MM 'xy' position on the array (via the bpmap file provided by Affymetrix for the tiling array). 2. I created 2 functions to convert the 'xy' positions on the array to the 'i' positions that will be on your CDF. The basic functions are the following: xy2i=function(x,y) {y*DIM+x+1} i2xy = function(i) {r=cbind((i-1)%%DIM,(i-1)%/%DIM); colnames(r)=c('x','y'); return(r)} where DIM corresponds to the dimension of the tiling array. 3. Then, I created an environment containing the PM and MM positions for each gene (like in a normal CDF). You should have an environment (ex: athtiling1.0rcdf) where the labels are the names of the genes and the content is a matrix with 'i' positions of the PM and MM indexes for each. (the i positions are simply calculated with the previous xy2i function. Something like: ## Create environment myverynicecdf = new.env() ## For each gene create a matrix of 2 columns (pm, mm) and x rows (corresponding to the i position on the array) and matrix_PM_MM = blabla assign(GENE_NAME, matrix_PM_MM,envir=myverynicecdf) In the end, you should end up with the following: ## Exemple > get("AT1G01020",athtiling1.0rcdf) pm mm [1,] 3984824 3987384 [2,] 1022692 1025252 ... [10,] 511312 513872 [11,] 5051196 5053756 4. I saved this environment as a package. The code should resembles the following: package.skeleton(name = "myverynicecdf", list = c("myverynicecdf", "xy2i", "i2xy"), path = "my/path/") There are 2/3 changes to do in the package (folder and files) created that are basically the normal things for R. (If you have any questions about that I can also help). Then, you build your package and basically you have your drosophila CDF library that can be used in R :) I hope I am clear but you can always ask me questions about this. Naira -- ================================================================== Naira Naouar Tel:+32 (0)9 331 38 63 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM nanao at psb.ugent.be http://www.psb.ugent.be

ADD REPLY • link 16.3 years ago Naira Naouar ▴ 140

Login before adding your answer.