Question about inconsistent gene expression

0

Entering edit mode

Chunyan Liu ▴ 40

@chunyan-liu-2324

Last seen 10.3 years ago

Dear James MacDonald and all from the list, I posted my question on the list last Wednesday and James MacDonald gave me profound insight on the issue. The following is a quick review and then another follow-up question. >Liu: I'm doing gene expression comparisons between two groups of subjects using affymetrix single-channel hgu133plus2 microarray chips. After using limma, I get a list of up- and down-regulated probeset (1,000 up and 2,000 down regulated probesets). When I translate these into unique gene symbols, I found 200 gene symbols that appear in both lists. Is this plausible? Interpretable? >MacDonald: Ah, now that is the problem, isn't it? Another problem is the case where 10 probesets are supposed to interrogate a particular gene and one is significant, but the other nine are not. In that case is the gene differentialy expressed or not? What you have to understand is that Affy designed the probesets for this chip based on the UniGene build 133, which was the best information at the time, but which is really outdated now (we are on build 203 currently). Even when they designed the chip, there were three levels of probesets. Those with an _at suffix, which indicated that the probes all blast exclusively to the transcript in question, those with an _s_at (or _a_at, I forget what they used for the 133), that indicates that some of the probes bind to related transcripts (whatever 'related' means), and _x_at, which indicates that some probes bind to completely unrelated transcripts. So even when the chip was designed, some of the probesets were not nearly as reliable as others. If you take the probe sequences and blast them today, you can find _at probesets with probes that bind to unrelated sequences, so time has not always been kind to the probe mappings. What can you do about this problem? There are a couple of things you can do, but any 'fix' has its own problems. First, you can use the remapped cdfs that are made available by the MBNI at the University of Michigan (via BioC). These remapped cdfs discard the original probesets and only use those probes that are known to map to unique sequences in the genome (based on the current UniGene build), and then map to transcripts or genes based on Entrez Gene, GenBank, UniGene, Ensembl, etc. The upside to these cdfs is that you will have only one probeset per transcript/gene, so it will be impossible to have a gene symbol appearing in both the up and down regulated groups. In addition, the assumptions of say RMA or GCRMA (or any probe-level models in affyPLM) will again hold true; in other words, the intensity of a given probe will be due only to the level of the transcript it is supposed to measure plus the probe-specific binding. The downside of these cdfs is that the number of probes per probeset will vary from something like 3 - 150, so the standard error of your estimate will also vary widely. If you simply take the expression values for these probesets and analyze using limma, you will be ignoring this extra level of error (which you can safely ignore using the 'stock' affy cdfs, since most of those probesets have 11 probes per). Second, you can just use the 'stock' affy cdfs, and do some ad hoc method to decide which of the probesets to believe. You can simply choose to believe only the _at probesets. Or you can decide to blast (or blat, which is much faster and AFAICT nearly as accurate) each of the disagreeing probesets to see which one appears to actually measure the gene transcript in question. The upside here is you don't have the extra level of variability introduced by the MBNI cdfs, but the downside is the amount of extra work it will entail. Follow-up question: Now, my thought is, given the noisy nature of microarray gene expression, I do believe for any mild to moderate gene effect, maybe even with strong gene effect like in cancer, this is a common but unrecognized problem. Does anyone notice this in their microarray analysis? If you do, please share your experience and thoughts with us because I really think this is something that deserves attention. I did some analyses based on James response and here is what I found: Among the 3000 probesets list from limma that were significantly expressed, some genes had multiple probesets (but only about 500), the following table summarizes how the disagreement varies by the number of signigicant probesets within a gene. For example, there were 400 genes that were represented by 2 significant probesets and of the 80 genes were inconsistent! That is, one probeset was up-regulated and the other was down-regulated. Of course, the more probesets, the greater the chance of one disagreement. probesets #of genes #of disagree % with at least one disagreement 2 400 80 20% 3 100 40 40% 4 30 15 50% 5 5 4 80% 7 1 1 100% Then I picked out the probesets only with the suffix "_at" and did the same as above. Here are what I found. There were a total of about 2000 "_at" probesets. probesets #of genes #of disagree % 2 120 40 33% 3 40 20 50% 4 6 3 50% 5 2 2 100% An easy fix to not deal with the problem is to select the probesets with the largest IQR (or any other statistic) to get one probeset per gene as in GeneSet Enrichment Analysis by Robert Gentlemen. Any thoughts on doing this? So, has anyone else observed this result? is this common? So please share your thoughts with me. Thank you, Chunyan Liu Cincinnati Children's Hospital Medical Center

Microarray Cancer hgu133plus2 probe affy limma gcrma Microarray Cancer hgu133plus2 probe • 1.2k views

ADD COMMENT • link updated 17.4 years ago by Michal Okoniewski ▴ 120 • written 17.4 years ago by Chunyan Liu ▴ 40

0

Entering edit mode

Michal Okoniewski ▴ 120

@michal-okoniewski-1752

Last seen 10.3 years ago

Hi Chunyan Liu, There is yet another set of CDF files that correct for slightly different problems that the Brainarray ones, done by Lu et al and described in: http://www.biomedcentral.com/1471-2105/8/108 As for your question: it is not that surprising. 1) Transcript-gene relationships form very complex graphs. 2) The probesets interrogate various parts of transcripts and we know that there are various alternatively spliced isoforms. There is no such thing as uniform gene-level expression. Perhaps in some cases just a specific isoform was significantly expressed. Cheers, Michal -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Chunyan Liu Sent: 24 August 2007 14:33 To: bioconductor at stat.math.ethz.ch Subject: [BioC] Question about inconsistent gene expression Dear James MacDonald and all from the list, I posted my question on the list last Wednesday and James MacDonald gave me profound insight on the issue. The following is a quick review and then another follow-up question. >Liu: I'm doing gene expression comparisons between two groups of subjects using affymetrix single-channel hgu133plus2 microarray chips. After using limma, I get a list of up- and down-regulated probeset (1,000 up and 2,000 down regulated probesets). When I translate these into unique gene symbols, I found 200 gene symbols that appear in both lists. Is this plausible? Interpretable? >MacDonald: Ah, now that is the problem, isn't it? Another problem is the case where 10 probesets are supposed to interrogate a particular gene and one is significant, but the other nine are not. In that case is the gene differentialy expressed or not? What you have to understand is that Affy designed the probesets for this chip based on the UniGene build 133, which was the best information at the time, but which is really outdated now (we are on build 203 currently). Even when they designed the chip, there were three levels of probesets. Those with an _at suffix, which indicated that the probes all blast exclusively to the transcript in question, those with an _s_at (or _a_at, I forget what they used for the 133), that indicates that some of the probes bind to related transcripts (whatever 'related' means), and _x_at, which indicates that some probes bind to completely unrelated transcripts. So even when the chip was designed, some of the probesets were not nearly as reliable as others. If you take the probe sequences and blast them today, you can find _at probesets with probes that bind to unrelated sequences, so time has not always been kind to the probe mappings. What can you do about this problem? There are a couple of things you can do, but any 'fix' has its own problems. First, you can use the remapped cdfs that are made available by the MBNI at the University of Michigan (via BioC). These remapped cdfs discard the original probesets and only use those probes that are known to map to unique sequences in the genome (based on the current UniGene build), and then map to transcripts or genes based on Entrez Gene, GenBank, UniGene, Ensembl, etc. The upside to these cdfs is that you will have only one probeset per transcript/gene, so it will be impossible to have a gene symbol appearing in both the up and down regulated groups. In addition, the assumptions of say RMA or GCRMA (or any probe-level models in affyPLM) will again hold true; in other words, the intensity of a given probe will be due only to the level of the transcript it is supposed to measure plus the probe-specific binding. The downside of these cdfs is that the number of probes per probeset will vary from something like 3 - 150, so the standard error of your estimate will also vary widely. If you simply take the expression values for these probesets and analyze using limma, you will be ignoring this extra level of error (which you can safely ignore using the 'stock' affy cdfs, since most of those probesets have 11 probes per). Second, you can just use the 'stock' affy cdfs, and do some ad hoc method to decide which of the probesets to believe. You can simply choose to believe only the _at probesets. Or you can decide to blast (or blat, which is much faster and AFAICT nearly as accurate) each of the disagreeing probesets to see which one appears to actually measure the gene transcript in question. The upside here is you don't have the extra level of variability introduced by the MBNI cdfs, but the downside is the amount of extra work it will entail. Follow-up question: Now, my thought is, given the noisy nature of microarray gene expression, I do believe for any mild to moderate gene effect, maybe even with strong gene effect like in cancer, this is a common but unrecognized problem. Does anyone notice this in their microarray analysis? If you do, please share your experience and thoughts with us because I really think this is something that deserves attention. I did some analyses based on James response and here is what I found: Among the 3000 probesets list from limma that were significantly expressed, some genes had multiple probesets (but only about 500), the following table summarizes how the disagreement varies by the number of signigicant probesets within a gene. For example, there were 400 genes that were represented by 2 significant probesets and of the 80 genes were inconsistent! That is, one probeset was up-regulated and the other was down-regulated. Of course, the more probesets, the greater the chance of one disagreement. probesets #of genes #of disagree % with at least one disagreement 2 400 80 20% 3 100 40 40% 4 30 15 50% 5 5 4 80% 7 1 1 100% Then I picked out the probesets only with the suffix "_at" and did the same as above. Here are what I found. There were a total of about 2000 "_at" probesets. probesets #of genes #of disagree % 2 120 40 33% 3 40 20 50% 4 6 3 50% 5 2 2 100% An easy fix to not deal with the problem is to select the probesets with the largest IQR (or any other statistic) to get one probeset per gene as in GeneSet Enrichment Analysis by Robert Gentlemen. Any thoughts on doing this? So, has anyone else observed this result? is this common? So please share your thoughts with me. Thank you, Chunyan Liu Cincinnati Children's Hospital Medical Center _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -------------------------------------------------------- This email is confidential and intended solely for the use o...{{dropped}}

ADD COMMENT • link 17.4 years ago Michal Okoniewski ▴ 120

Login before adding your answer.