select one Affy probeset for one gene

0

Entering edit mode

Glazko, Galina ▴ 350

@glazko-galina-1653

Last seen 10.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20060313/ 135334f1/attachment.pl

• 1.1k views

ADD COMMENT • link 19.1 years ago Glazko, Galina ▴ 350

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 9 weeks ago

United States

On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu=""> wrote: > Dear list, > > > > Is there a way to automatically select one probeset for one gene in Affy > arrays? > > Say, if we have several probesets for a given gene, we select the one > with the highest level of expression, or based on any other reasonable > criteria...? > > I am sorry if this question was answered before, it seems to be very > basic question and I hope there is the solution... Galina, You can contrive a solution, I suppose. However, I'm not sure this is a good idea. Whatever "reasonable criteria" you use are likely to lead to bias. Filtering on unmeasured probesets or other quality measures applied equally to all probesets is probably reasonable, but not applying on a per-gene basis. There have been related discussions in the past, often centering around "averaging" expression values. The more accepted way of dealing with multiple probesets is to do your analysis based on the probeset; only after that is done do you then connect your gene labels back to the probesets. Sean

ADD COMMENT • link 19.1 years ago Sean Davis 21k

0

Entering edit mode

Hi, Sean Davis wrote: > > > On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu=""> > wrote: > > >>Dear list, >> >> >> >>Is there a way to automatically select one probeset for one gene in Affy >>arrays? >> >>Say, if we have several probesets for a given gene, we select the one >>with the highest level of expression, or based on any other reasonable >>criteria...? >> >>I am sorry if this question was answered before, it seems to be very >>basic question and I hope there is the solution... > > > Galina, > > You can contrive a solution, I suppose. However, I'm not sure this is a > good idea. Whatever "reasonable criteria" you use are likely to lead to > bias. Filtering on unmeasured probesets or other quality measures applied > equally to all probesets is probably reasonable, but not applying on a > per-gene basis. There have been related discussions in the past, often > centering around "averaging" expression values. > > The more accepted way of dealing with multiple probesets is to do your > analysis based on the probeset; only after that is done do you then connect > your gene labels back to the probesets. Unfortunately that approach does not always work and something needs to be done a bit earlier in the process if a user wants to make use of data such as GO, chromosomal location etc where the mapping is based on Entrez Gene ID (for example, but other identifiers have very similar issues). Not removing the duplicates leads to often quite different results (in essence there is over counting if all probes are accurate). As users of GOstats know, you have to choose one candidate for each Entrez gene id (and probably what I have been doing there is not ideal - the suggestion below, due to Seth Falcon is, I think, better). But I would be interested to hear other points of view. I also do not like averaging for several reasons. Now, I have two kinds of measurements (averages and ordinary old probes) and that is problematic for some uses. Second, if not all of the probes work (which might be why there are several variants) then I am averaging the good with the bad, which also seems like a less than ideal way to go. One suggestion is to do non-specific filtering (say on variation, or for expressed versus not, or something of that ilk) and to then select the probe set that has the highest value. Thus, you are selecting the probe with the most information (but do be careful not to use any phenotypic information as this could cause problems). Your (Galina's) suggestion was to use level of expression, but that is generally a bad idea because that would involve a between probe within array comparison and these are not ideal; just because one spot is brighter does not mean it works better, or that there is more mRNA than a less bright spot. HTH Robert > > Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 19.1 years ago rgentleman ★ 5.5k

0

Entering edit mode

Robert Gentleman wrote: > Hi, > > Sean Davis wrote: > >> >>On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu=""> >>wrote: >> >> >> >>>Dear list, >>> >>> >>> >>>Is there a way to automatically select one probeset for one gene in Affy >>>arrays? >>> >>>Say, if we have several probesets for a given gene, we select the one >>>with the highest level of expression, or based on any other reasonable >>>criteria...? >>> >>>I am sorry if this question was answered before, it seems to be very >>>basic question and I hope there is the solution... >> >> >>Galina, >> >>You can contrive a solution, I suppose. However, I'm not sure this is a >>good idea. Whatever "reasonable criteria" you use are likely to lead to >>bias. Filtering on unmeasured probesets or other quality measures applied >>equally to all probesets is probably reasonable, but not applying on a >>per-gene basis. There have been related discussions in the past, often >>centering around "averaging" expression values. >> >>The more accepted way of dealing with multiple probesets is to do your >>analysis based on the probeset; only after that is done do you then connect >>your gene labels back to the probesets. > > > > Unfortunately that approach does not always work and something needs > to be done a bit earlier in the process if a user wants to make use of > data such as GO, chromosomal location etc where the mapping is based on > Entrez Gene ID (for example, but other identifiers have very similar > issues). Not removing the duplicates leads to often quite different > results (in essence there is over counting if all probes are accurate). > As users of GOstats know, you have to choose one candidate for each > Entrez gene id (and probably what I have been doing there is not ideal - > the suggestion below, due to Seth Falcon is, I think, better). But I > would be interested to hear other points of view. > > I also do not like averaging for several reasons. Now, I have two > kinds of measurements (averages and ordinary old probes) and that is > problematic for some uses. Second, if not all of the probes work (which > might be why there are several variants) then I am averaging the good > with the bad, which also seems like a less than ideal way to go. One inherent problem with using the Affy probesets is that there are known issues with many of the probes; some measure related transcripts and others measure unrelated transcripts, so what you are measuring is not always clear. The MBNI cdfs which have been re-mapped may help with at least two of these problems. First, all probes that no longer blast to the transcript of interest are removed from consideration. Second, all probes that do blast to the transcript of interest are piled together into one probeset (I guess you could argue this is bad since the expression measures are now based on variable numbers of probes, but that is already true anyway...). Note that these cdfs are planned to be part of the new release of BioC, but currently are only available from the MBNI website http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/gen omic_curated_CDF.asp Since you now have only one probeset per gene (based on Entrez Gene, UniGene, RefSeq, or Ensembl) you no longer have to decide which one to use. The biggest downside to using these cdfs is the lack of infrastructure in BioC that is tailored to their use, which requires a higher level of understanding of R than one would need to use a 'stock' cdf (which reminds me - I should be doing something about that ;-D). HTH, Jim > > One suggestion is to do non-specific filtering (say on variation, or > for expressed versus not, or something of that ilk) and to then select > the probe set that has the highest value. Thus, you are selecting the > probe with the most information (but do be careful not to use any > phenotypic information as this could cause problems). Your (Galina's) > suggestion was to use level of expression, but that is generally a bad > idea because that would involve a between probe within array comparison > and these are not ideal; just because one spot is brighter does not mean > it works better, or that there is more mRNA than a less bright spot. > > HTH > Robert > > > >>Sean >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >> > > -- James W. MacDonald University of Michigan Affymetrix and cDNA Microarray Core 1500 E Medical Center Drive Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.

ADD REPLY • link 19.1 years ago James W. MacDonald 68k

0

Entering edit mode

James W. MacDonald wrote: > Robert Gentleman wrote: > >>Hi, >> >>Sean Davis wrote: >> >> >>>On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu=""> >>>wrote: >>> >>> >>> >>> >>>>Dear list, >>>> >>>> >>>> >>>>Is there a way to automatically select one probeset for one gene in Affy >>>>arrays? >>>> >>>>Say, if we have several probesets for a given gene, we select the one >>>>with the highest level of expression, or based on any other reasonable >>>>criteria...? >>>> >>>>I am sorry if this question was answered before, it seems to be very >>>>basic question and I hope there is the solution... >>> >>> >>>Galina, >>> >>>You can contrive a solution, I suppose. However, I'm not sure this is a >>>good idea. Whatever "reasonable criteria" you use are likely to lead to >>>bias. Filtering on unmeasured probesets or other quality measures applied >>>equally to all probesets is probably reasonable, but not applying on a >>>per-gene basis. There have been related discussions in the past, often >>>centering around "averaging" expression values. >>> >>>The more accepted way of dealing with multiple probesets is to do your >>>analysis based on the probeset; only after that is done do you then connect >>>your gene labels back to the probesets. >> >> >> >> Unfortunately that approach does not always work and something needs >>to be done a bit earlier in the process if a user wants to make use of >>data such as GO, chromosomal location etc where the mapping is based on >>Entrez Gene ID (for example, but other identifiers have very similar >>issues). Not removing the duplicates leads to often quite different >>results (in essence there is over counting if all probes are accurate). >>As users of GOstats know, you have to choose one candidate for each >>Entrez gene id (and probably what I have been doing there is not ideal - >>the suggestion below, due to Seth Falcon is, I think, better). But I >>would be interested to hear other points of view. >> >> I also do not like averaging for several reasons. Now, I have two >>kinds of measurements (averages and ordinary old probes) and that is >>problematic for some uses. Second, if not all of the probes work (which >>might be why there are several variants) then I am averaging the good >>with the bad, which also seems like a less than ideal way to go. > > > One inherent problem with using the Affy probesets is that there are > known issues with many of the probes; some measure related transcripts > and others measure unrelated transcripts, so what you are measuring is > not always clear. The MBNI cdfs which have been re-mapped may help with > at least two of these problems. First, all probes that no longer blast > to the transcript of interest are removed from consideration. Second, > all probes that do blast to the transcript of interest are piled > together into one probeset (I guess you could argue this is bad since > the expression measures are now based on variable numbers of probes, but > that is already true anyway...). Note that these cdfs are planned to be > part of the new release of BioC, but currently are only available from > the MBNI website > > http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/g enomic_curated_CDF.asp > > Since you now have only one probeset per gene (based on Entrez Gene, > UniGene, RefSeq, or Ensembl) you no longer have to decide which one to > use. The biggest downside to using these cdfs is the lack of > infrastructure in BioC that is tailored to their use, which requires a > higher level of understanding of R than one would need to use a 'stock' > cdf (which reminds me - I should be doing something about that ;-D). Hi, These are good points, but I think that they are complementary rather than a strict replacement. First, I might just have expression data, not CEL files, so this approach would not be an option. Second, I might decide to map to Unigene or RefSeq, and then would still have the same problem these do not necessarily have a 1-1 correspondence with Entrez gene. And finally, I might be working with cDNA arrays where there is no clear way to take this same approach. That is not to say that this is not a viable approach and it certainly does solve some problems, best wishes Robert > > HTH, > > Jim > > > >> One suggestion is to do non-specific filtering (say on variation, or >>for expressed versus not, or something of that ilk) and to then select >>the probe set that has the highest value. Thus, you are selecting the >>probe with the most information (but do be careful not to use any >>phenotypic information as this could cause problems). Your (Galina's) >>suggestion was to use level of expression, but that is generally a bad >>idea because that would involve a between probe within array comparison >>and these are not ideal; just because one spot is brighter does not mean >>it works better, or that there is more mRNA than a less bright spot. >> >> HTH >> Robert >> >> >> >> >>>Sean >>> >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor at stat.math.ethz.ch >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> >> > > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 19.1 years ago rgentleman ★ 5.5k

0

Entering edit mode

Glazko, Galina ▴ 350

@glazko-galina-1653

Last seen 10.6 years ago

Dear Sean, Thank you for the answer. This sounds good but what if I do multiple testing? Then my adjusted p-values are based on the entire array, and I will not be able to see differentially expressed genes because I am testing say 40,000 hypotheses, while there are actually as many hypotheses as there are genes. I would appreciate if you could give me some references to the papers where this question was discussed. Best regards Galina -----Original Message----- From: Sean Davis [mailto:sdavis2@mail.nih.gov] Sent: Monday, March 13, 2006 4:05 PM To: Glazko, Galina; Bioconductor Subject: Re: [BioC] select one Affy probeset for one gene On 3/13/06 3:38 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu=""> wrote: > Dear list, > > > > Is there a way to automatically select one probeset for one gene in Affy > arrays? > > Say, if we have several probesets for a given gene, we select the one > with the highest level of expression, or based on any other reasonable > criteria...? > > I am sorry if this question was answered before, it seems to be very > basic question and I hope there is the solution... Galina, You can contrive a solution, I suppose. However, I'm not sure this is a good idea. Whatever "reasonable criteria" you use are likely to lead to bias. Filtering on unmeasured probesets or other quality measures applied equally to all probesets is probably reasonable, but not applying on a per-gene basis. There have been related discussions in the past, often centering around "averaging" expression values. The more accepted way of dealing with multiple probesets is to do your analysis based on the probeset; only after that is done do you then connect your gene labels back to the probesets. Sean

ADD COMMENT • link 19.1 years ago Glazko, Galina ▴ 350

0

Entering edit mode

On 3/13/06 4:15 PM, "Glazko, Galina" <galina_glazko at="" urmc.rochester.edu=""> wrote: > Dear Sean, > > Thank you for the answer. > This sounds good but what if I do multiple testing? > Then my adjusted p-values are based on the entire array, and I will not > be able to see differentially expressed genes because I am testing say > 40,000 hypotheses, while there are actually as many hypotheses as there > are genes. Galina, I agree here, but this general concept is slightly different than trying to choose the "best" probeset for a given gene. To reduce the data dimensionality, you want to choose probesets that: 1) Are measuring something. 2) Are showing some variation between samples To determine 1, you can use multiple lines of evidence, such as level of expression or affy calls. To determine 2, you can calculate a CV (coefficient of variation) or something like that. Notice that this doesn't involve determining which genes represent which probeset, but only determining the "quality" of the data. You can look at the genefilter package for some hints about how to do this. Hope this clarifies a bit. Sean

ADD REPLY • link 19.1 years ago Sean Davis 21k

Login before adding your answer.