HGU133Plus2 CDF vs hgu133plus2hsentrezgcdf CDF (30% difference in results)

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.7 years ago

Hello, My name is Mahes Muniandy and I am a doctoral student. I have been analysing Affymetrix HGU133Plus2 cel files to determine differential expressions in twin pairs (within pair differences). I have used affy, gcrma, nsfilter and limma to do my analysis. I have run my analysis using the HGU133plus2 CDF available in biocondutor and then tried the whole analysis again using the HGU133plus2 cdf from Brainarray. The limma results differ significantly (2351 differentially expressed genes for the former and 2700 genes for the latter analysis). 630 genes (about 30%) from the 2351 genes do not exist in the list of 2700 genes. I have read "Evolving Gene/Transcript Definitions Significantly Alter the Interpretation of GeneChip Data M. Dai et al." and see some convincing arguments there. But, I am confused with which limma results to go with. Could you advise me on the guiding principles that I should follow in order to decide which cdf to use. I do realise that the onus is on me to decide but sadly, I am quite lost in this matter. I would appreciate any help available. Many Thanks, Mahes Muniandy, MSc, MBA, MCPM, PMP Uni. Helsinki -- output of sessionInfo(): > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] genefilter_1.44.0 limma_3.18.13 gcrma_2.34.0 affy_1.40.0 [5] Biobase_2.22.0 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] affyio_1.30.0 annotate_1.40.1 AnnotationDbi_1.24.0 [4] BiocInstaller_1.12.1 Biostrings_2.30.1 DBI_0.2-7 [7] IRanges_1.20.7 preprocessCore_1.24.0 RSQLite_0.11.4 [10] splines_3.0.2 stats4_3.0.2 survival_2.37-7 [13] XML_3.98-1.1 xtable_1.7-3 XVector_0.2.0 [16] zlibbioc_1.8.0 -- Sent via the guest posting facility at bioconductor.org.

GO hgu133plus2 cdf affy limma • 2.3k views

ADD COMMENT • link updated 10.6 years ago by Steve Lianoglou ★ 13k • written 10.6 years ago by Guest User ★ 13k

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 25 days ago

United States

Hi, On Sat, Sep 13, 2014 at 11:31 AM, Mahes Muniandy [guest] <guest at="" bioconductor.org=""> wrote: > Hello, > My name is Mahes Muniandy and I am a doctoral student. I have been analysing Affymetrix HGU133Plus2 cel files to determine differential expressions in twin pairs (within pair differences). I have used affy, gcrma, nsfilter and limma to do my analysis. I have run my analysis using the HGU133plus2 CDF available in biocondutor and then tried the whole analysis again using the HGU133plus2 cdf from Brainarray. The limma results differ significantly (2351 differentially expressed genes for the former and 2700 genes for the latter analysis). 630 genes (about 30%) from the 2351 genes do not exist in the list of 2700 genes. > > I have read "Evolving Gene/Transcript Definitions Significantly Alter the Interpretation of GeneChip Data M. Dai et al." and see some convincing arguments there. But, I am confused with which limma results to go with. Could you advise me on the guiding principles that I should follow in order to decide which cdf to use. I do realise that the onus is on me to decide but sadly, I am quite lost in this matter. I would appreciate any help available. I'd start by investigating whether or not the genes included in one analysis and not the other seem reasonable for your experiment (ie. do some GO analysis on the differences and see if they are relevant to the data/treatment you are studying). Another thing to check is to plot the t-statistics against each other from each analysis. Is the result you are finding a result of genes dancing around thresholds of significance? If you define significance by a certain FDR *and* a minimum absolute log-fold-change, it might be that you have better concordance -- when this too isn't perfect concordance, I'd go back and start looking at the differing genes and try to interpret the differences to see which makes more sense than the other. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 10.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

Another thing to consider is that the probesets for that array are based on UniGene build 133, which was current somewhere around 10 years ago (if not longer). That is a long time ago, considering the speed with which the human genome has been updated, so there may be many probesets on that array that no longer measure anything recognizable. If you care to find out how bad (or good) the conventional Affymetrix probeset definitions are, you could re-align the probe sequences against the current genome and see how many are still measuring the intended target. Or you could assume that the updated alignments from MBNI are better, and just go with that (certainly easier, but you know what they say about assumptions...). Personally, I would go with option A, which would have two benefits. One, you would get to have some fun learning how to do something different. And really, who doesn't like that? Two, it would give you a rock-solid rationale for your choice of CDF, which should be impressive to your advisor because you a) thought about the problem and then b) did something to actively quantify the differences, so you can make an informed choice. Best, Jim On Sun, Sep 14, 2014 at 9:51 AM, Steve Lianoglou <lianoglou.steve@gene.com> wrote: > Hi, > > On Sat, Sep 13, 2014 at 11:31 AM, Mahes Muniandy [guest] > <guest@bioconductor.org> wrote: > > Hello, > > My name is Mahes Muniandy and I am a doctoral student. I have been > analysing Affymetrix HGU133Plus2 cel files to determine differential > expressions in twin pairs (within pair differences). I have used affy, > gcrma, nsfilter and limma to do my analysis. I have run my analysis using > the HGU133plus2 CDF available in biocondutor and then tried the whole > analysis again using the HGU133plus2 cdf from Brainarray. The limma results > differ significantly (2351 differentially expressed genes for the former > and 2700 genes for the latter analysis). 630 genes (about 30%) from the > 2351 genes do not exist in the list of 2700 genes. > > > > I have read "Evolving Gene/Transcript Definitions Significantly Alter > the Interpretation of GeneChip Data M. Dai et al." and see some > convincing arguments there. But, I am confused with which limma results to > go with. Could you advise me on the guiding principles that I should follow > in order to decide which cdf to use. I do realise that the onus is on me to > decide but sadly, I am quite lost in this matter. I would appreciate any > help available. > > I'd start by investigating whether or not the genes included in one > analysis and not the other seem reasonable for your experiment (ie. do > some GO analysis on the differences and see if they are relevant to > the data/treatment you are studying). > > Another thing to check is to plot the t-statistics against each other > from each analysis. Is the result you are finding a result of genes > dancing around thresholds of significance? If you define significance > by a certain FDR *and* a minimum absolute log-fold-change, it might be > that you have better concordance -- when this too isn't perfect > concordance, I'd go back and start looking at the differing genes and > try to interpret the differences to see which makes more sense than > the other. > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.6 years ago James W. MacDonald 68k

Login before adding your answer.