Large # of significant genes with SAM

0

Entering edit mode

Vincent Detours ▴ 80

@vincent-detours-1021

Last seen 10.6 years ago

Dear all, Your expert opinion are most welcome on the following. I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips) that 37% of all genes are regulated with respect to patient-matched "normal" tissues in somme tumors not particularly known for huge aneuploidy. Looking at another data set from the same cancer but collected by another group on indepentent samples on Affy, I got 34%. The number seems to hold. How to interpret this? Are really 30% of the genes disturbed, even to a small extent, in these tumors? Could SAM do something wrong? If yes, how to verify it? Any advise, shared experience, references, etc. are welcome Cheers Vincent ------------------------------------------ Vincent Detours, Ph.D. IRIBHM Bldg C, room C.4.116 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4220 Fax: +32-2-555 4655 E-mail: vdetours at ulb.ac.be URL: http://homepages.ulb.ac.be/~vdetours/

Cancer affy Cancer affy • 2.0k views

ADD COMMENT • link updated 20.0 years ago by Charles Berry ▴ 290 • written 20.0 years ago by Vincent Detours ▴ 80

0

Entering edit mode

Joern Toedling ▴ 730

@joern-toedling-1244

Last seen 10.6 years ago

Hi Vincent, I imagine such large numbers of differentially expressed genes could arise for various reasons. One issue could be that there are large technical or experimental differences between your tumour and control samples due to scanner settings or hybridisation protocols etc. I would check if after normalisation such large differences between the groups are obvious by using boxplots, Scatter-Plots etc. (many examples for such control procedures can be found on the Bioconductor website , especially on the pages containing material for courses and workshops). If so, you might think about other methods for normalisation or combining the two groups data in another way, if they happen to be too different. Another reason for large differences could be that there might really be huge biological differences between the two groups. For instance, when analyzing T- versus B-lymphocytes, one usually observes large percentages > 20% of differentially expressed genes, since in that case we were comparing very different cell types with each other. However, I would not expect such striking differences between a tumour and the related physiological tissue. To check, if there are really large biological differences between the two groups, you could also check if the lists of significantly up- or down regulated genes hint to precise biological picture, for example by using Bioconductor's "GOstats" package and looking for relationships between the most significant GO nodes. Since SAM computes a regularised t-statistic, I think, you should also check that the normal-distribution assumption does at least approximately hold. Double-checking the results might be good idea, and, since finding differentially expressed genes is a standard task, you have a large number of methods/ packages available for that. Again, you should check the documents at the Bioconductor website courses/compendiums/materials section. You might also consider using other packages, such as "twilight", to obtain an estimate for the percentage of differentially expressed genes in your data. Best regards, Joern Vincent Detours wrote: >Dear all, > >Your expert opinion are most welcome on the following. > >I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips) >that 37% of all genes are regulated with respect to patient-matched >"normal" tissues in somme tumors not particularly known for huge >aneuploidy. Looking at another data set from the same cancer but >collected by another group on indepentent samples on Affy, I got 34%. >The number seems to hold. > >How to interpret this? Are really 30% of the genes disturbed, even to >a small extent, in these tumors? Could SAM do something wrong? If yes, >how to verify it? > >Any advise, shared experience, references, etc. are welcome > >Cheers > >Vincent > > >------------------------------------------ >Vincent Detours, Ph.D. >IRIBHM >Bldg C, room C.4.116 >ULB, Campus Erasme, CP602 >808 route de Lennik >B-1070 Brussels >Belgium > >Phone: +32-2-555 4220 >Fax: +32-2-555 4655 > >E-mail: vdetours at ulb.ac.be > >URL: http://homepages.ulb.ac.be/~vdetours/ > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor > >

ADD COMMENT • link 20.0 years ago Joern Toedling ▴ 730

0

Entering edit mode

On May 10, 2005, at 11:35 AM, Joern Toedling wrote: > Hi Vincent, > > I imagine such large numbers of differentially expressed genes could > arise for various reasons. > One issue could be that there are large technical or experimental > differences between your tumour and control samples due to scanner > settings or hybridisation protocols etc. I would check if after > normalisation such large differences between the groups are obvious by > using boxplots, Scatter-Plots etc. (many examples for such control > procedures can be found on the Bioconductor website , especially on > the pages containing material for courses and workshops). If so, you > might think about other methods for normalisation or combining the two > groups data in another way, if they happen to be too different. > Another reason for large differences could be that there might really > be huge biological differences between the two groups. For instance, > when analyzing T- versus B-lymphocytes, one usually observes large > percentages > 20% of differentially expressed genes, since in that > case we were comparing very different cell types with each other. > However, I would not expect such striking differences between a tumour > and the related physiological tissue. Vincent, Actually, having a large proportion of differentially-expressed genes between tumor and normal is certainly possible. You got the same results with two different data sets if I read your original post correctly, so go back to check quality of data, statistical biases, etc., but it seems quite possible that your results are correct. You will, of course, have to think about validation strategies, but.... Sean

ADD REPLY • link 20.0 years ago Sean Davis 21k

0

Entering edit mode

Just a few precisions 1- our cDNA data correlates at 0.72 (comparing gene averages over patients) with data from another completely independent group using Affy U133 chips. This excludes gross programing errors, and more. 2- running SAM @ q<30% on their data I that more than 30% of the genes are called significant 3- 7/7 genes with average fold-change >2.0 were confirmed by RT-PCR 4- RT-PCR gave mixed results for two genes ranking high with SAM but with fold-change 1.6. By mixed result I mean that RT-PCR data are clearly correlated with microarray but give lower fold-change 5- I searched for spatial biased with box-plots, and did find some, but not of the magnitude that could explain my results. 6- we are talking about paired sample SAM comparisons. On Tue, 10 May 2005, Sean Davis wrote: > Date: Tue, 10 May 2005 11:58:37 -0400 > From: Sean Davis <sdavis2@mail.nih.gov> > To: Joern Toedling <toedling@ebi.ac.uk> > Cc: Vincent Detours <vdetours@ulb.ac.be>, > Bioconductor mailing list <bioconductor@stat.math.ethz.ch> > Subject: Re: [BioC] Large # of significant genes with SAM > > > On May 10, 2005, at 11:35 AM, Joern Toedling wrote: > > > Hi Vincent, > > > > I imagine such large numbers of differentially expressed genes could > > arise for various reasons. > > One issue could be that there are large technical or experimental > > differences between your tumour and control samples due to scanner > > settings or hybridisation protocols etc. I would check if after > > normalisation such large differences between the groups are obvious by > > using boxplots, Scatter-Plots etc. (many examples for such control > > procedures can be found on the Bioconductor website , especially on > > the pages containing material for courses and workshops). If so, you > > might think about other methods for normalisation or combining the two > > groups data in another way, if they happen to be too different. > > Another reason for large differences could be that there might really > > be huge biological differences between the two groups. For instance, > > when analyzing T- versus B-lymphocytes, one usually observes large > > percentages > 20% of differentially expressed genes, since in that > > case we were comparing very different cell types with each other. > > However, I would not expect such striking differences between a tumour > > and the related physiological tissue. > > Vincent, > > Actually, having a large proportion of differentially-expressed genes > between tumor and normal is certainly possible. You got the same > results with two different data sets if I read your original post > correctly, so go back to check quality of data, statistical biases, > etc., but it seems quite possible that your results are correct. You > will, of course, have to think about validation strategies, but.... > > Sean > Vincent Detours, Ph.D. IRIBHM Bldg C, room C.4.116 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4220 Fax: +32-2-555 4655 E-mail: vdetours at ulb.ac.be URL: http://homepages.ulb.ac.be/~vdetours/

ADD REPLY • link 20.0 years ago Vincent Detours ▴ 80

0

Entering edit mode

ignore previous message, my finger slipped on the wrong key!! Sorry! Correct reply is comming soon Vincent On Tue, 10 May 2005, Sean Davis wrote: > Date: Tue, 10 May 2005 11:58:37 -0400 > From: Sean Davis <sdavis2@mail.nih.gov> > To: Joern Toedling <toedling@ebi.ac.uk> > Cc: Vincent Detours <vdetours@ulb.ac.be>, > Bioconductor mailing list <bioconductor@stat.math.ethz.ch> > Subject: Re: [BioC] Large # of significant genes with SAM > > > On May 10, 2005, at 11:35 AM, Joern Toedling wrote: > > > Hi Vincent, > > > > I imagine such large numbers of differentially expressed genes could > > arise for various reasons. > > One issue could be that there are large technical or experimental > > differences between your tumour and control samples due to scanner > > settings or hybridisation protocols etc. I would check if after > > normalisation such large differences between the groups are obvious by > > using boxplots, Scatter-Plots etc. (many examples for such control > > procedures can be found on the Bioconductor website , especially on > > the pages containing material for courses and workshops). If so, you > > might think about other methods for normalisation or combining the two > > groups data in another way, if they happen to be too different. > > Another reason for large differences could be that there might really > > be huge biological differences between the two groups. For instance, > > when analyzing T- versus B-lymphocytes, one usually observes large > > percentages > 20% of differentially expressed genes, since in that > > case we were comparing very different cell types with each other. > > However, I would not expect such striking differences between a tumour > > and the related physiological tissue. > > Vincent, > > Actually, having a large proportion of differentially-expressed genes > between tumor and normal is certainly possible. You got the same > results with two different data sets if I read your original post > correctly, so go back to check quality of data, statistical biases, > etc., but it seems quite possible that your results are correct. You > will, of course, have to think about validation strategies, but.... > > Sean > Vincent Detours, Ph.D. IRIBHM Bldg C, room C.4.116 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4220 Fax: +32-2-555 4655 E-mail: vdetours at ulb.ac.be URL: http://homepages.ulb.ac.be/~vdetours/

ADD REPLY • link 20.0 years ago Vincent Detours ▴ 80

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 4.0 years ago

United States

It always pays to look at the actual numbers, and also plot the data. Perhaps I am unusually careless, but most of the time when I get unexpected results, I have made a mistake - e.g. read in the flags instead of the expression values, or that type of thing. --Naomi At 06:56 AM 5/9/2005, Vincent Detours wrote: >Dear all, > >Your expert opinion are most welcome on the following. > >I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips) >that 37% of all genes are regulated with respect to patient-matched >"normal" tissues in somme tumors not particularly known for huge >aneuploidy. Looking at another data set from the same cancer but >collected by another group on indepentent samples on Affy, I got 34%. >The number seems to hold. > >How to interpret this? Are really 30% of the genes disturbed, even to >a small extent, in these tumors? Could SAM do something wrong? If yes, >how to verify it? > >Any advise, shared experience, references, etc. are welcome > >Cheers > >Vincent > > >------------------------------------------ >Vincent Detours, Ph.D. >IRIBHM >Bldg C, room C.4.116 >ULB, Campus Erasme, CP602 >808 route de Lennik >B-1070 Brussels >Belgium > >Phone: +32-2-555 4220 >Fax: +32-2-555 4655 > >E-mail: vdetours at ulb.ac.be > >URL: http://homepages.ulb.ac.be/~vdetours/ > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 20.0 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 9 weeks ago

EMBL European Molecular Biology Laborat…

Hi Vincent, after all the good answers, here some more comments: In one of our papers that compared 37 matched normals and tumors, we also found large numbers. Have a look at Fig. 3A of PubMed-ID 11691851, which shows that in this experiment the number of "significantly differentially expressed genes" growed linearly (!) with the number of samples, for up to 37. At the time, we were similarly surprised. Basically, the reason is that t-test (on which SAM is based) looks for differences in the mean between tumor and normal - however small, as long as it significant. It is important to distinguish "effect size" from "significance". There is an excellent paper on this subject: Pepe MS, Longton G, Anderson GL, Schummer M. Selecting differentially expressed genes from microarray experiments. Biometrics. 2003 Mar;59(1):133-42. PMID: 12762450 Their pAUC statistics is implemented in the ROC package (but slow...) Also have a look at the exercise "Testing for Differential Expression" (Wed morning) of our 2004 bioC short course: http://www.bioconductor.org/workshops/Bressanone Best wishes Wolfgang > I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips) > that 37% of all genes are regulated with respect to patient-matched > "normal" tissues in somme tumors not particularly known for huge > aneuploidy. Looking at another data set from the same cancer but > collected by another group on indepentent samples on Affy, I got 34%. > The number seems to hold. > > How to interpret this? Are really 30% of the genes disturbed, even to > a small extent, in these tumors? Could SAM do something wrong? If yes, > how to verify it? > > Any advise, shared experience, references, etc. are welcome > > Cheers > > Vincent > > > ------------------------------------------ > Vincent Detours, Ph.D. > IRIBHM > Bldg C, room C.4.116 > ULB, Campus Erasme, CP602 > 808 route de Lennik > B-1070 Brussels > Belgium > > Phone: +32-2-555 4220 > Fax: +32-2-555 4655 > > E-mail: vdetours at ulb.ac.be > > URL: http://homepages.ulb.ac.be/~vdetours/ > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- Best regards Wolfgang ------------------------------------- Wolfgang Huber European Bioinformatics Institute European Molecular Biology Laboratory Cambridge CB10 1SD England Phone: +44 1223 494642 Fax: +44 1223 494486 Http: www.ebi.ac.uk/huber

ADD COMMENT • link 20.0 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Vincent Detours ▴ 80

@vincent-detours-1021

Last seen 10.6 years ago

Dear all, A few precisions regarding my previous posting 1- our cDNA data correlates at 0.72 (comparing >6000 genes averaged over patients) with data from another completely independent group using Affy U133 chips and different lab technicians, pathologists and sammples. 2- running SAM @ q<0.05, more than 30% of the genes are called significant in *both* data sets. Affy data were normalised with MAS5, I don't have access to the CEL files. 3- 7/7 genes with average fold-change >2.0, and high SAM rank, were confirmed by RT-PCR in our data set 4- RT-PCR gave mixed results for two genes ranking high with SAM but with fold-change 1.6. By mixed result I mean that RT-PCR data are clearly correlated with microarray but give lower fold-change 5- I searched for spatial biased with box-plots, and did find some, but much below the magnitude that could explain the 30% result. 6- we are talking about paired sample SAM comparisons. I call SAM with cl <- rep(1, N) #paired samples sam1 <- sam(exprs, cl, B=1000, rand=123, q.version=1) To summarize, the data seems correct. The questions are whether SAM is appropriate on these data sets and others, whether q-values mean what they are supposed to, what is the relevance of calling a gene regulated on a purely statistical basis, etc. >Since SAM computes a regularised t-statistic, I think, you should >also check that the normal-distribution assumption does at least >approximately hold. I though SAM use a computed permutation-based null distribution of the moderated t-statistics in order to avoid hypothesis about this distribution? Am I missing somenthing here? Thanks you all for your input! Vincent On Tue, 10 May 2005, Sean Davis wrote: > > settings or hybridisation protocols etc. I would check if after > > normalisation such large differences between the groups are obvious by > > using boxplots, Scatter-Plots etc. (many examples for such control > > procedures can be found on the Bioconductor website , especially on > > the pages containing material for courses and workshops). If so, you > > might think about other methods for normalisation or combining the two > > groups data in another way, if they happen to be too different. > > Another reason for large differences could be that there might really > > be huge biological differences between the two groups. For instance, > > when analyzing T- versus B-lymphocytes, one usually observes large > > percentages > 20% of differentially expressed genes, since in that > > case we were comparing very different cell types with each other. > > However, I would not expect such striking differences between a tumour > > and the related physiological tissue. > > Vincent, > > Actually, having a large proportion of differentially-expressed genes > between tumor and normal is certainly possible. You got the same > results with two different data sets if I read your original post > correctly, so go back to check quality of data, statistical biases, > etc., but it seems quite possible that your results are correct. You > will, of course, have to think about validation strategies, but.... > > Sean > Vincent Detours, Ph.D. IRIBHM Bldg C, room C.4.116 ULB, Campus Erasme, CP602 808 route de Lennik B-1070 Brussels Belgium Phone: +32-2-555 4220 Fax: +32-2-555 4655 E-mail: vdetours at ulb.ac.be URL: http://homepages.ulb.ac.be/~vdetours/

ADD COMMENT • link 20.0 years ago Vincent Detours ▴ 80

0

Entering edit mode

Charles Berry ▴ 290

@charles-berry-5754

Last seen 6.1 years ago

United States

Vincent, As others have pointed out, this could be the actual state of nature or could be artifacts that you can straighten out by a closer look at the data. I have found this approach to be helpul: Bradley Efron. Large Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis. JASA, 99(465):96104, Mar 2004 As is pointed out there, artifacts in the data may tend to inflate test statistics 'across the board' leading to very large numbers of supposedly significant (or truly discovered) genes. The suggested approach (recalibrating the null variance and shifting the location) compensates for this even when you cannot specifically identify the artifacts. It is a fairly simple exercise in R to implement this. I can send you some hints, if you wish Chuck On Mon, 9 May 2005, Vincent Detours wrote: > Dear all, > > Your expert opinion are most welcome on the following. > > I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips) > that 37% of all genes are regulated with respect to patient-matched > "normal" tissues in somme tumors not particularly known for huge > aneuploidy. Looking at another data set from the same cancer but > collected by another group on indepentent samples on Affy, I got 34%. > The number seems to hold. > > How to interpret this? Are really 30% of the genes disturbed, even to > a small extent, in these tumors? Could SAM do something wrong? If yes, > how to verify it? > > Any advise, shared experience, references, etc. are welcome > > Cheers > > Vincent > > > ------------------------------------------ > Vincent Detours, Ph.D. > IRIBHM > Bldg C, room C.4.116 > ULB, Campus Erasme, CP602 > 808 route de Lennik > B-1070 Brussels > Belgium > > Phone: +32-2-555 4220 > Fax: +32-2-555 4655 > > E-mail: vdetours at ulb.ac.be > > URL: http://homepages.ulb.ac.be/~vdetours/ > > > Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry@tajo.ucsd.edu UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717

ADD COMMENT • link 20.0 years ago Charles Berry ▴ 290

Login before adding your answer.