Hi Greg,
I use some nice functions written by Ariel Chernomoretz (thanks!!) and
posted to the Bioconductor mailing list on April 21, 2005:
http://article.gmane.org/gmane.science.biology.informatics.conductor/4
258/match=,
although I do have a caveat/question about the effects of removing
probe
sets on gcrma values (see below).
You can use the main function 'RemoveProbes' to remove individual
probes
and/or entire probe sets, and I routinely use it to remove the non-
soybean
probe sets from Affy's soybean array. I made one change to the code of
'RemoveProbes', which was to add a call at the beginning to
automatically
load the cdf and probe packages if they weren't loaded already:
require(cdfpackagename,character.only=T)
require(probepackagename,character.only=T)
If you contact me off-list I can send you my revised code and a few
pointers if you're having trouble figuring out how to get
'RemoveProbes' to
work. Basically, all you need to input is a list of the probe set
names
that you want to remove.
CAVEAT: Ariel pointed out in a post on April 20, 2005
(
http://article.gmane.org/gmane.science.biology.informatics.conductor/
4242/match=)
that removing even one probe will slightly affect the resulting gcrma
values, because of the sampling of affinity values, likely in the
internal
function 'bg.parameters.ns'. I have found this effect to be greatly
increased when I remove the non-soybean probe sets (~40% of the array)
before running 'gcrma': 88% of the remaining probe sets have smaller
gcrma
values than before, and 25% have been decreased 1.4 fold or more. The
changes occur in the background adjustment step, not in the
normalization
or summarization steps. If you remove a random 40% of the probe sets,
there
isn't the same magnitude (90% < 1.05 fold) or bias in direction (45%
smaller) of changes . That's when I noticed that the distribution of
expression values in my samples for the soybean genes was very
different
from the distribution of non-soybean genes. Nearly 80% of the soybean
genes
are called present in my samples but only 5% of the non-soybean genes
are
called present.
My question/concern is why a change in the distribution of probe
values can
have a large effect on the gc-based background correction? The
background
corrections used by RMA and MAS5 aren't affected by removing the
non-soybean genes, and neither is the background
correction+normalization
of VSN. It seems to be that there might be some inherent assumptions
in the
gc-based method that aren't in the other methods about the
distribution of
probe values. Which gcrma values are "better"? Oops - I just compared
the
limma results using both sets of gcrma values, and it appears that
whatever
changes are occurring don't really affect the significant gene lists
nor
the magnitude of direction of changes between two treatment groups, so
perhaps this caveat of mine is unnecessary! It might be important for
users
of soybean and other large arrays, if they run 'gcrma' on random
subsets of
arrays instead of all together in order to overcome memory limitations
-
make sure you are consistent in cutting out the probe sets before
running
'gcrma', else your values will not be comparable between subsets!!
Cheers,
Jenny
At 05:30 AM 6/26/2006, Henrik Bengtsson wrote:
>See the affxparser package, e.g. readCelUnits(filenames,
>units=c(1,600:612,45)). At the moment,you have to take it from there
>yourself.
>
>Henrik Bengtsson
>
>On 6/24/06, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote:
> > Hi Greg,
> >
> > Alvord, Greg (DMS) [Contr] wrote:
> > >
> > >
> > > Dear List -
> > >
> > >
> > >
> > > I am new to BioConductor and R, working under Windows with a gig
of RAM,
> > > version R-2.2.1 of R. I have successfully read in six .CEL
files and
> > > created the following AffyBatch object.
> > >
> > >
> > >
> > >
> > >>soy.ab
> > >
> > >
> > > AffyBatch object
> > >
> > > size of arrays=1164x1164 features (63516 kb)
> > >
> > > cdf=Soybean (61170 affyids)
> > >
> > > number of samples=6
> > >
> > > number of genes=61170
> > >
> > > annotation=soybean
> > >
> > >
> > >
> > > The investigator for whom I'm working is interested in an
analysis of
> > > differential gene expression on a subset of affyids in this
AffyBatch
> > > object, specifically in 37,744 of the 61,170 affyids (indicated
above)
> > > that relate specifically to the soybean genome. I have learned
that the
> > > relevant species of interest is labeled 'Glycine max'. I
obtained this
> > > information from another source and have not (due to my
ignorance) been
> > > able to identify any slot in soy.ab AffyBatch object that
identifies
> > > this species. Here is a table of the species on the soy.ab
AffyBatch
> > > object (which I obtained from another source).
> > >
> > >
> > >
> > >
> > >>cbind(table(Species))
> > >
> > >
> > > [,1]
> > >
> > > Alfalfa mosaic virus 3
> > >
> > > Bean pod mottle virus strain G-7 2
> > >
> > > Bean pod mottle virus strain K-Hancock1 1
> > >
> > > Clover yellow vein virus 1
> > >
> > > Glycine max 37744
> > >
> > > Heterodera glycines 7539
> > >
> > > Phytophthora sojae 15864
> > >
> > > S. saman 4
> > >
> > > Southern bean mosaic virus strain SBMV-S 1
> > >
> > > Soybean mosaic virus 1
> > >
> > > Soybean mosaic virus strain G5 3
> > >
> > > Soybean mosaic virus strain G7 1
> > >
> > > Soybean mosaic virus strain N 1
> > >
> > > Tobacco ringspot virus 2
> > >
> > > Tobacco streak virus 3
> > >
> > >
> > >
> > >
> > >
> > > I want to select from the soy.ab AffyBatch object the relevant
> > > information for the species 'Glycine max' only. I have created
a data
> > > frame containing those Affy.ID's for species 'Glycine max',
e.g.,
> > >
> > >
> > >
> > >
> > >>Glycine.max.Species.AffyID.df[c(1:3,37742:37744),]
> > >
> > >
> > > Species Affy.ID
> > >
> > > 8 Glycine max AFFX-BioB-3_at
> > >
> > > 9 Glycine max AFFX-BioB-5_at
> > >
> > > 10 Glycine max AFFX-BioB-M_at
> > >
> > > 37749 Glycine max soybean_rRNA_838_RC_at
> > >
> > > 37750 Glycine max soybean_rRNA_918_at
> > >
> > > 37751 Glycine max soybean_rRNA_918_RC_at
> > >
> > >
> > >
> > >
> > >>dim(Glycine.max.Species.AffyID.df)
> > >
> > >
> > > [1] 37744 2
> > >
> > >
> > >
> > > How do I extract/create an AffyBatch object containing only the
> > > appropriate Affy.ID's related to the 'Glycine max' species?
> >
> > An AffyBatch object isn't the best for subsetting this way. Better
would
> > be to compute expression values using rma() or your favorite
method, and
> > then subset.
> >
> > eset <- rma(soy.ab)
> > subsetted.exprset <- eset[Glycine.max.Species.AffyID.df[,2],]
> >
> > HTH,
> >
> > Jim
> >
> > --
> > James W. MacDonald
> > University of Michigan
> > Affymetrix and cDNA Microarray Core
> > 1500 E Medical Center Drive
> > Ann Arbor MI 48109
> > 734-647-5623
> >
> >
> >
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and
should
> not be used for urgent or sensitive issues.
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> >
https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
Jenny Drnevich, Ph.D.
Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign
330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA
ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at uiuc.edu