On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu="">
wrote:
> Dear list,
>
>
>
> Is there a way to automatically select one probeset for one gene in
Affy
> arrays?
>
> Say, if we have several probesets for a given gene, we select the
one
> with the highest level of expression, or based on any other
reasonable
> criteria...?
>
> I am sorry if this question was answered before, it seems to be very
> basic question and I hope there is the solution...
Galina,
You can contrive a solution, I suppose. However, I'm not sure this is
a
good idea. Whatever "reasonable criteria" you use are likely to lead
to
bias. Filtering on unmeasured probesets or other quality measures
applied
equally to all probesets is probably reasonable, but not applying on a
per-gene basis. There have been related discussions in the past,
often
centering around "averaging" expression values.
The more accepted way of dealing with multiple probesets is to do your
analysis based on the probeset; only after that is done do you then
connect
your gene labels back to the probesets.
Sean
Hi,
Sean Davis wrote:
>
>
> On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu="">
> wrote:
>
>
>>Dear list,
>>
>>
>>
>>Is there a way to automatically select one probeset for one gene in
Affy
>>arrays?
>>
>>Say, if we have several probesets for a given gene, we select the
one
>>with the highest level of expression, or based on any other
reasonable
>>criteria...?
>>
>>I am sorry if this question was answered before, it seems to be very
>>basic question and I hope there is the solution...
>
>
> Galina,
>
> You can contrive a solution, I suppose. However, I'm not sure this
is a
> good idea. Whatever "reasonable criteria" you use are likely to
lead to
> bias. Filtering on unmeasured probesets or other quality measures
applied
> equally to all probesets is probably reasonable, but not applying on
a
> per-gene basis. There have been related discussions in the past,
often
> centering around "averaging" expression values.
>
> The more accepted way of dealing with multiple probesets is to do
your
> analysis based on the probeset; only after that is done do you then
connect
> your gene labels back to the probesets.
Unfortunately that approach does not always work and something needs
to be done a bit earlier in the process if a user wants to make use of
data such as GO, chromosomal location etc where the mapping is based
on
Entrez Gene ID (for example, but other identifiers have very similar
issues). Not removing the duplicates leads to often quite different
results (in essence there is over counting if all probes are
accurate).
As users of GOstats know, you have to choose one candidate for each
Entrez gene id (and probably what I have been doing there is not ideal
-
the suggestion below, due to Seth Falcon is, I think, better). But I
would be interested to hear other points of view.
I also do not like averaging for several reasons. Now, I have two
kinds of measurements (averages and ordinary old probes) and that is
problematic for some uses. Second, if not all of the probes work
(which
might be why there are several variants) then I am averaging the good
with the bad, which also seems like a less than ideal way to go.
One suggestion is to do non-specific filtering (say on variation,
or
for expressed versus not, or something of that ilk) and to then select
the probe set that has the highest value. Thus, you are selecting the
probe with the most information (but do be careful not to use any
phenotypic information as this could cause problems). Your (Galina's)
suggestion was to use level of expression, but that is generally a bad
idea because that would involve a between probe within array
comparison
and these are not ideal; just because one spot is brighter does not
mean
it works better, or that there is more mRNA than a less bright spot.
HTH
Robert
>
> Sean
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>
--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
Robert Gentleman wrote:
> Hi,
>
> Sean Davis wrote:
>
>>
>>On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu="">
>>wrote:
>>
>>
>>
>>>Dear list,
>>>
>>>
>>>
>>>Is there a way to automatically select one probeset for one gene in
Affy
>>>arrays?
>>>
>>>Say, if we have several probesets for a given gene, we select the
one
>>>with the highest level of expression, or based on any other
reasonable
>>>criteria...?
>>>
>>>I am sorry if this question was answered before, it seems to be
very
>>>basic question and I hope there is the solution...
>>
>>
>>Galina,
>>
>>You can contrive a solution, I suppose. However, I'm not sure this
is a
>>good idea. Whatever "reasonable criteria" you use are likely to
lead to
>>bias. Filtering on unmeasured probesets or other quality measures
applied
>>equally to all probesets is probably reasonable, but not applying on
a
>>per-gene basis. There have been related discussions in the past,
often
>>centering around "averaging" expression values.
>>
>>The more accepted way of dealing with multiple probesets is to do
your
>>analysis based on the probeset; only after that is done do you then
connect
>>your gene labels back to the probesets.
>
>
>
> Unfortunately that approach does not always work and something
needs
> to be done a bit earlier in the process if a user wants to make use
of
> data such as GO, chromosomal location etc where the mapping is based
on
> Entrez Gene ID (for example, but other identifiers have very similar
> issues). Not removing the duplicates leads to often quite different
> results (in essence there is over counting if all probes are
accurate).
> As users of GOstats know, you have to choose one candidate for each
> Entrez gene id (and probably what I have been doing there is not
ideal -
> the suggestion below, due to Seth Falcon is, I think, better). But I
> would be interested to hear other points of view.
>
> I also do not like averaging for several reasons. Now, I have two
> kinds of measurements (averages and ordinary old probes) and that is
> problematic for some uses. Second, if not all of the probes work
(which
> might be why there are several variants) then I am averaging the
good
> with the bad, which also seems like a less than ideal way to go.
One inherent problem with using the Affy probesets is that there are
known issues with many of the probes; some measure related transcripts
and others measure unrelated transcripts, so what you are measuring is
not always clear. The MBNI cdfs which have been re-mapped may help
with
at least two of these problems. First, all probes that no longer blast
to the transcript of interest are removed from consideration. Second,
all probes that do blast to the transcript of interest are piled
together into one probeset (I guess you could argue this is bad since
the expression measures are now based on variable numbers of probes,
but
that is already true anyway...). Note that these cdfs are planned to
be
part of the new release of BioC, but currently are only available from
the MBNI website
http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/gen
omic_curated_CDF.asp
Since you now have only one probeset per gene (based on Entrez Gene,
UniGene, RefSeq, or Ensembl) you no longer have to decide which one to
use. The biggest downside to using these cdfs is the lack of
infrastructure in BioC that is tailored to their use, which requires a
higher level of understanding of R than one would need to use a
'stock'
cdf (which reminds me - I should be doing something about that ;-D).
HTH,
Jim
>
> One suggestion is to do non-specific filtering (say on variation,
or
> for expressed versus not, or something of that ilk) and to then
select
> the probe set that has the highest value. Thus, you are selecting
the
> probe with the most information (but do be careful not to use any
> phenotypic information as this could cause problems). Your
(Galina's)
> suggestion was to use level of expression, but that is generally a
bad
> idea because that would involve a between probe within array
comparison
> and these are not ideal; just because one spot is brighter does not
mean
> it works better, or that there is more mRNA than a less bright spot.
>
> HTH
> Robert
>
>
>
>>Sean
>>
>>_______________________________________________
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch
>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>
>
>
--
James W. MacDonald
University of Michigan
Affymetrix and cDNA Microarray Core
1500 E Medical Center Drive
Ann Arbor MI 48109
734-647-5623
**********************************************************
Electronic Mail is not secure, may not be read every day, and should
not be used for urgent or sensitive issues.
James W. MacDonald wrote:
> Robert Gentleman wrote:
>
>>Hi,
>>
>>Sean Davis wrote:
>>
>>
>>>On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu="">
>>>wrote:
>>>
>>>
>>>
>>>
>>>>Dear list,
>>>>
>>>>
>>>>
>>>>Is there a way to automatically select one probeset for one gene
in Affy
>>>>arrays?
>>>>
>>>>Say, if we have several probesets for a given gene, we select the
one
>>>>with the highest level of expression, or based on any other
reasonable
>>>>criteria...?
>>>>
>>>>I am sorry if this question was answered before, it seems to be
very
>>>>basic question and I hope there is the solution...
>>>
>>>
>>>Galina,
>>>
>>>You can contrive a solution, I suppose. However, I'm not sure this
is a
>>>good idea. Whatever "reasonable criteria" you use are likely to
lead to
>>>bias. Filtering on unmeasured probesets or other quality measures
applied
>>>equally to all probesets is probably reasonable, but not applying
on a
>>>per-gene basis. There have been related discussions in the past,
often
>>>centering around "averaging" expression values.
>>>
>>>The more accepted way of dealing with multiple probesets is to do
your
>>>analysis based on the probeset; only after that is done do you then
connect
>>>your gene labels back to the probesets.
>>
>>
>>
>> Unfortunately that approach does not always work and something
needs
>>to be done a bit earlier in the process if a user wants to make use
of
>>data such as GO, chromosomal location etc where the mapping is based
on
>>Entrez Gene ID (for example, but other identifiers have very similar
>>issues). Not removing the duplicates leads to often quite different
>>results (in essence there is over counting if all probes are
accurate).
>>As users of GOstats know, you have to choose one candidate for each
>>Entrez gene id (and probably what I have been doing there is not
ideal -
>>the suggestion below, due to Seth Falcon is, I think, better). But I
>>would be interested to hear other points of view.
>>
>> I also do not like averaging for several reasons. Now, I have two
>>kinds of measurements (averages and ordinary old probes) and that is
>>problematic for some uses. Second, if not all of the probes work
(which
>>might be why there are several variants) then I am averaging the
good
>>with the bad, which also seems like a less than ideal way to go.
>
>
> One inherent problem with using the Affy probesets is that there are
> known issues with many of the probes; some measure related
transcripts
> and others measure unrelated transcripts, so what you are measuring
is
> not always clear. The MBNI cdfs which have been re-mapped may help
with
> at least two of these problems. First, all probes that no longer
blast
> to the transcript of interest are removed from consideration.
Second,
> all probes that do blast to the transcript of interest are piled
> together into one probeset (I guess you could argue this is bad
since
> the expression measures are now based on variable numbers of probes,
but
> that is already true anyway...). Note that these cdfs are planned to
be
> part of the new release of BioC, but currently are only available
from
> the MBNI website
>
> http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/g
enomic_curated_CDF.asp
>
> Since you now have only one probeset per gene (based on Entrez Gene,
> UniGene, RefSeq, or Ensembl) you no longer have to decide which one
to
> use. The biggest downside to using these cdfs is the lack of
> infrastructure in BioC that is tailored to their use, which requires
a
> higher level of understanding of R than one would need to use a
'stock'
> cdf (which reminds me - I should be doing something about that ;-D).
Hi,
These are good points, but I think that they are complementary
rather
than a strict replacement. First, I might just have expression data,
not
CEL files, so this approach would not be an option. Second, I might
decide to map to Unigene or RefSeq, and then would still have the same
problem these do not necessarily have a 1-1 correspondence with Entrez
gene. And finally, I might be working with cDNA arrays where there is
no
clear way to take this same approach. That is not to say that this is
not a viable approach and it certainly does solve some problems,
best wishes
Robert
>
> HTH,
>
> Jim
>
>
>
>> One suggestion is to do non-specific filtering (say on variation,
or
>>for expressed versus not, or something of that ilk) and to then
select
>>the probe set that has the highest value. Thus, you are selecting
the
>>probe with the most information (but do be careful not to use any
>>phenotypic information as this could cause problems). Your
(Galina's)
>>suggestion was to use level of expression, but that is generally a
bad
>>idea because that would involve a between probe within array
comparison
>>and these are not ideal; just because one spot is brighter does not
mean
>>it works better, or that there is more mRNA than a less bright spot.
>>
>> HTH
>> Robert
>>
>>
>>
>>
>>>Sean
>>>
>>>_______________________________________________
>>>Bioconductor mailing list
>>>Bioconductor at stat.math.ethz.ch
>>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>
>>
>>
>
>
--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
Dear Sean,
Thank you for the answer.
This sounds good but what if I do multiple testing?
Then my adjusted p-values are based on the entire array, and I will
not
be able to see differentially expressed genes because I am testing say
40,000 hypotheses, while there are actually as many hypotheses as
there
are genes.
I would appreciate if you could give me some references to the papers
where this question was discussed.
Best regards
Galina
-----Original Message-----
From: Sean Davis [mailto:sdavis2@mail.nih.gov]
Sent: Monday, March 13, 2006 4:05 PM
To: Glazko, Galina; Bioconductor
Subject: Re: [BioC] select one Affy probeset for one gene
On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu="">
wrote:
> Dear list,
>
>
>
> Is there a way to automatically select one probeset for one gene in
Affy
> arrays?
>
> Say, if we have several probesets for a given gene, we select the
one
> with the highest level of expression, or based on any other
reasonable
> criteria...?
>
> I am sorry if this question was answered before, it seems to be very
> basic question and I hope there is the solution...
Galina,
You can contrive a solution, I suppose. However, I'm not sure this is
a
good idea. Whatever "reasonable criteria" you use are likely to lead
to
bias. Filtering on unmeasured probesets or other quality measures
applied
equally to all probesets is probably reasonable, but not applying on a
per-gene basis. There have been related discussions in the past,
often
centering around "averaging" expression values.
The more accepted way of dealing with multiple probesets is to do your
analysis based on the probeset; only after that is done do you then
connect
your gene labels back to the probesets.
Sean
On 3/13/06 4:15 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu="">
wrote:
> Dear Sean,
>
> Thank you for the answer.
> This sounds good but what if I do multiple testing?
> Then my adjusted p-values are based on the entire array, and I will
not
> be able to see differentially expressed genes because I am testing
say
> 40,000 hypotheses, while there are actually as many hypotheses as
there
> are genes.
Galina,
I agree here, but this general concept is slightly different than
trying to
choose the "best" probeset for a given gene.
To reduce the data dimensionality, you want to choose probesets that:
1) Are measuring something.
2) Are showing some variation between samples
To determine 1, you can use multiple lines of evidence, such as level
of
expression or affy calls. To determine 2, you can calculate a CV
(coefficient of variation) or something like that. Notice that this
doesn't
involve determining which genes represent which probeset, but only
determining the "quality" of the data.
You can look at the genefilter package for some hints about how to do
this.
Hope this clarifies a bit.
Sean