Dear all,
Your expert opinion are most welcome on the following.
I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips)
that 37% of all genes are regulated with respect to patient-matched
"normal" tissues in somme tumors not particularly known for huge
aneuploidy. Looking at another data set from the same cancer but
collected by another group on indepentent samples on Affy, I got 34%.
The number seems to hold.
How to interpret this? Are really 30% of the genes disturbed, even to
a small extent, in these tumors? Could SAM do something wrong? If yes,
how to verify it?
Any advise, shared experience, references, etc. are welcome
Cheers
Vincent
------------------------------------------
Vincent Detours, Ph.D.
IRIBHM
Bldg C, room C.4.116
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium
Phone: +32-2-555 4220
Fax: +32-2-555 4655
E-mail: vdetours at ulb.ac.be
URL: http://homepages.ulb.ac.be/~vdetours/
Hi Vincent,
I imagine such large numbers of differentially expressed genes could
arise for various reasons.
One issue could be that there are large technical or experimental
differences between your tumour and control samples due to scanner
settings or hybridisation protocols etc. I would check if after
normalisation such large differences between the groups are obvious by
using boxplots, Scatter-Plots etc. (many examples for such control
procedures can be found on the Bioconductor website , especially on
the
pages containing material for courses and workshops). If so, you might
think about other methods for normalisation or combining the two
groups
data in another way, if they happen to be too different.
Another reason for large differences could be that there might really
be
huge biological differences between the two groups. For instance, when
analyzing T- versus B-lymphocytes, one usually observes large
percentages > 20% of differentially expressed genes, since in that
case
we were comparing very different cell types with each other. However,
I
would not expect such striking differences between a tumour and the
related physiological tissue. To check, if there are really large
biological differences between the two groups, you could also check if
the lists of significantly up- or down regulated genes hint to precise
biological picture, for example by using Bioconductor's "GOstats"
package and looking for relationships between the most significant GO
nodes.
Since SAM computes a regularised t-statistic, I think, you should also
check that the normal-distribution assumption does at least
approximately hold. Double-checking the results might be good idea,
and,
since finding differentially expressed genes is a standard task, you
have a large number of methods/ packages available for that. Again,
you
should check the documents at the Bioconductor website
courses/compendiums/materials section.
You might also consider using other packages, such as "twilight", to
obtain an estimate for the percentage of differentially expressed
genes
in your data.
Best regards,
Joern
Vincent Detours wrote:
>Dear all,
>
>Your expert opinion are most welcome on the following.
>
>I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips)
>that 37% of all genes are regulated with respect to patient-matched
>"normal" tissues in somme tumors not particularly known for huge
>aneuploidy. Looking at another data set from the same cancer but
>collected by another group on indepentent samples on Affy, I got 34%.
>The number seems to hold.
>
>How to interpret this? Are really 30% of the genes disturbed, even to
>a small extent, in these tumors? Could SAM do something wrong? If
yes,
>how to verify it?
>
>Any advise, shared experience, references, etc. are welcome
>
>Cheers
>
>Vincent
>
>
>------------------------------------------
>Vincent Detours, Ph.D.
>IRIBHM
>Bldg C, room C.4.116
>ULB, Campus Erasme, CP602
>808 route de Lennik
>B-1070 Brussels
>Belgium
>
>Phone: +32-2-555 4220
>Fax: +32-2-555 4655
>
>E-mail: vdetours at ulb.ac.be
>
>URL: http://homepages.ulb.ac.be/~vdetours/
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor@stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>
>
On May 10, 2005, at 11:35 AM, Joern Toedling wrote:
> Hi Vincent,
>
> I imagine such large numbers of differentially expressed genes could
> arise for various reasons.
> One issue could be that there are large technical or experimental
> differences between your tumour and control samples due to scanner
> settings or hybridisation protocols etc. I would check if after
> normalisation such large differences between the groups are obvious
by
> using boxplots, Scatter-Plots etc. (many examples for such control
> procedures can be found on the Bioconductor website , especially on
> the pages containing material for courses and workshops). If so, you
> might think about other methods for normalisation or combining the
two
> groups data in another way, if they happen to be too different.
> Another reason for large differences could be that there might
really
> be huge biological differences between the two groups. For instance,
> when analyzing T- versus B-lymphocytes, one usually observes large
> percentages > 20% of differentially expressed genes, since in that
> case we were comparing very different cell types with each other.
> However, I would not expect such striking differences between a
tumour
> and the related physiological tissue.
Vincent,
Actually, having a large proportion of differentially-expressed genes
between tumor and normal is certainly possible. You got the same
results with two different data sets if I read your original post
correctly, so go back to check quality of data, statistical biases,
etc., but it seems quite possible that your results are correct. You
will, of course, have to think about validation strategies, but....
Sean
Just a few precisions
1- our cDNA data correlates at 0.72 (comparing gene averages over
patients) with data from another completely independent group using
Affy U133 chips. This excludes gross programing errors, and more.
2- running SAM @ q<30% on their data I that more than 30% of the genes
are called significant
3- 7/7 genes with average fold-change >2.0 were confirmed by RT-PCR
4- RT-PCR gave mixed results for two genes ranking high with SAM but
with fold-change 1.6. By mixed result I mean that RT-PCR data are
clearly correlated with microarray but give lower fold-change
5- I searched for spatial biased with box-plots, and did find some,
but not of the magnitude that could explain my results.
6- we are talking about paired sample SAM comparisons.
On Tue, 10 May 2005, Sean Davis wrote:
> Date: Tue, 10 May 2005 11:58:37 -0400
> From: Sean Davis <sdavis2@mail.nih.gov>
> To: Joern Toedling <toedling@ebi.ac.uk>
> Cc: Vincent Detours <vdetours@ulb.ac.be>,
> Bioconductor mailing list <bioconductor@stat.math.ethz.ch>
> Subject: Re: [BioC] Large # of significant genes with SAM
>
>
> On May 10, 2005, at 11:35 AM, Joern Toedling wrote:
>
> > Hi Vincent,
> >
> > I imagine such large numbers of differentially expressed genes
could
> > arise for various reasons.
> > One issue could be that there are large technical or experimental
> > differences between your tumour and control samples due to scanner
> > settings or hybridisation protocols etc. I would check if after
> > normalisation such large differences between the groups are
obvious by
> > using boxplots, Scatter-Plots etc. (many examples for such control
> > procedures can be found on the Bioconductor website , especially
on
> > the pages containing material for courses and workshops). If so,
you
> > might think about other methods for normalisation or combining the
two
> > groups data in another way, if they happen to be too different.
> > Another reason for large differences could be that there might
really
> > be huge biological differences between the two groups. For
instance,
> > when analyzing T- versus B-lymphocytes, one usually observes large
> > percentages > 20% of differentially expressed genes, since in that
> > case we were comparing very different cell types with each other.
> > However, I would not expect such striking differences between a
tumour
> > and the related physiological tissue.
>
> Vincent,
>
> Actually, having a large proportion of differentially-expressed
genes
> between tumor and normal is certainly possible. You got the same
> results with two different data sets if I read your original post
> correctly, so go back to check quality of data, statistical biases,
> etc., but it seems quite possible that your results are correct.
You
> will, of course, have to think about validation strategies, but....
>
> Sean
>
Vincent Detours, Ph.D.
IRIBHM
Bldg C, room C.4.116
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium
Phone: +32-2-555 4220
Fax: +32-2-555 4655
E-mail: vdetours at ulb.ac.be
URL: http://homepages.ulb.ac.be/~vdetours/
ignore previous message, my finger slipped on the wrong key!!
Sorry! Correct reply is comming soon
Vincent
On Tue,
10 May 2005, Sean Davis wrote:
> Date: Tue, 10 May 2005 11:58:37 -0400
> From: Sean Davis <sdavis2@mail.nih.gov>
> To: Joern Toedling <toedling@ebi.ac.uk>
> Cc: Vincent Detours <vdetours@ulb.ac.be>,
> Bioconductor mailing list <bioconductor@stat.math.ethz.ch>
> Subject: Re: [BioC] Large # of significant genes with SAM
>
>
> On May 10, 2005, at 11:35 AM, Joern Toedling wrote:
>
> > Hi Vincent,
> >
> > I imagine such large numbers of differentially expressed genes
could
> > arise for various reasons.
> > One issue could be that there are large technical or experimental
> > differences between your tumour and control samples due to scanner
> > settings or hybridisation protocols etc. I would check if after
> > normalisation such large differences between the groups are
obvious by
> > using boxplots, Scatter-Plots etc. (many examples for such control
> > procedures can be found on the Bioconductor website , especially
on
> > the pages containing material for courses and workshops). If so,
you
> > might think about other methods for normalisation or combining the
two
> > groups data in another way, if they happen to be too different.
> > Another reason for large differences could be that there might
really
> > be huge biological differences between the two groups. For
instance,
> > when analyzing T- versus B-lymphocytes, one usually observes large
> > percentages > 20% of differentially expressed genes, since in that
> > case we were comparing very different cell types with each other.
> > However, I would not expect such striking differences between a
tumour
> > and the related physiological tissue.
>
> Vincent,
>
> Actually, having a large proportion of differentially-expressed
genes
> between tumor and normal is certainly possible. You got the same
> results with two different data sets if I read your original post
> correctly, so go back to check quality of data, statistical biases,
> etc., but it seems quite possible that your results are correct.
You
> will, of course, have to think about validation strategies, but....
>
> Sean
>
Vincent Detours, Ph.D.
IRIBHM
Bldg C, room C.4.116
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium
Phone: +32-2-555 4220
Fax: +32-2-555 4655
E-mail: vdetours at ulb.ac.be
URL: http://homepages.ulb.ac.be/~vdetours/
It always pays to look at the actual numbers, and also plot the data.
Perhaps I am unusually careless, but most of the time when I get
unexpected
results, I have made a mistake - e.g. read in the flags instead of the
expression values, or that type of thing.
--Naomi
At 06:56 AM 5/9/2005, Vincent Detours wrote:
>Dear all,
>
>Your expert opinion are most welcome on the following.
>
>I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips)
>that 37% of all genes are regulated with respect to patient-matched
>"normal" tissues in somme tumors not particularly known for huge
>aneuploidy. Looking at another data set from the same cancer but
>collected by another group on indepentent samples on Affy, I got 34%.
>The number seems to hold.
>
>How to interpret this? Are really 30% of the genes disturbed, even to
>a small extent, in these tumors? Could SAM do something wrong? If
yes,
>how to verify it?
>
>Any advise, shared experience, references, etc. are welcome
>
>Cheers
>
>Vincent
>
>
>------------------------------------------
>Vincent Detours, Ph.D.
>IRIBHM
>Bldg C, room C.4.116
>ULB, Campus Erasme, CP602
>808 route de Lennik
>B-1070 Brussels
>Belgium
>
>Phone: +32-2-555 4220
>Fax: +32-2-555 4655
>
>E-mail: vdetours at ulb.ac.be
>
>URL: http://homepages.ulb.ac.be/~vdetours/
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor@stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
Naomi S. Altman 814-865-3791 (voice)
Associate Professor
Bioinformatics Consulting Center
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348
(Statistics)
University Park, PA 16802-2111
Hi Vincent,
after all the good answers, here some more comments:
In one of our papers that compared 37 matched normals and tumors, we
also found large numbers. Have a look at Fig. 3A of PubMed-ID
11691851,
which shows that in this experiment the number of "significantly
differentially expressed genes" growed linearly (!) with the number of
samples, for up to 37. At the time, we were similarly surprised.
Basically, the reason is that t-test (on which SAM is based) looks for
differences in the mean between tumor and normal - however small, as
long as it significant.
It is important to distinguish "effect size" from "significance".
There is an excellent paper on this subject: Pepe MS, Longton G,
Anderson GL, Schummer M. Selecting differentially expressed genes from
microarray experiments. Biometrics. 2003 Mar;59(1):133-42.
PMID: 12762450
Their pAUC statistics is implemented in the ROC package (but slow...)
Also have a look at the exercise "Testing for Differential Expression"
(Wed morning) of our 2004 bioC short course:
http://www.bioconductor.org/workshops/Bressanone
Best wishes
Wolfgang
> I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips)
> that 37% of all genes are regulated with respect to patient-matched
> "normal" tissues in somme tumors not particularly known for huge
> aneuploidy. Looking at another data set from the same cancer but
> collected by another group on indepentent samples on Affy, I got
34%.
> The number seems to hold.
>
> How to interpret this? Are really 30% of the genes disturbed, even
to
> a small extent, in these tumors? Could SAM do something wrong? If
yes,
> how to verify it?
>
> Any advise, shared experience, references, etc. are welcome
>
> Cheers
>
> Vincent
>
>
> ------------------------------------------
> Vincent Detours, Ph.D.
> IRIBHM
> Bldg C, room C.4.116
> ULB, Campus Erasme, CP602
> 808 route de Lennik
> B-1070 Brussels
> Belgium
>
> Phone: +32-2-555 4220
> Fax: +32-2-555 4655
>
> E-mail: vdetours at ulb.ac.be
>
> URL: http://homepages.ulb.ac.be/~vdetours/
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
--
Best regards
Wolfgang
-------------------------------------
Wolfgang Huber
European Bioinformatics Institute
European Molecular Biology Laboratory
Cambridge CB10 1SD
England
Phone: +44 1223 494642
Fax: +44 1223 494486
Http: www.ebi.ac.uk/huber
Dear all,
A few precisions regarding my previous posting
1- our cDNA data correlates at 0.72 (comparing >6000 genes averaged
over patients) with data from another completely independent group
using Affy U133 chips and different lab technicians, pathologists and
sammples.
2- running SAM @ q<0.05, more than 30% of the genes are called
significant in *both* data sets. Affy data were normalised with MAS5,
I don't have access to the CEL files.
3- 7/7 genes with average fold-change >2.0, and high SAM rank, were
confirmed by RT-PCR in our data set
4- RT-PCR gave mixed results for two genes ranking high with SAM but
with fold-change 1.6. By mixed result I mean that RT-PCR data are
clearly correlated with microarray but give lower fold-change
5- I searched for spatial biased with box-plots, and did find some,
but much below the magnitude that could explain the 30% result.
6- we are talking about paired sample SAM comparisons. I call SAM with
cl <- rep(1, N) #paired samples
sam1 <- sam(exprs, cl, B=1000, rand=123, q.version=1)
To summarize, the data seems correct. The questions are whether SAM is
appropriate on these data sets and others, whether q-values mean what
they are supposed to, what is the relevance of calling a gene
regulated on a purely statistical basis, etc.
>Since SAM computes a regularised t-statistic, I think, you should
>also check that the normal-distribution assumption does at least
>approximately hold.
I though SAM use a computed permutation-based null distribution of the
moderated t-statistics in order to avoid hypothesis about this
distribution? Am I missing somenthing here?
Thanks you all for your input!
Vincent
On Tue, 10 May 2005, Sean Davis wrote:
> > settings or hybridisation protocols etc. I would check if after
> > normalisation such large differences between the groups are
obvious by
> > using boxplots, Scatter-Plots etc. (many examples for such control
> > procedures can be found on the Bioconductor website , especially
on
> > the pages containing material for courses and workshops). If so,
you
> > might think about other methods for normalisation or combining the
two
> > groups data in another way, if they happen to be too different.
> > Another reason for large differences could be that there might
really
> > be huge biological differences between the two groups. For
instance,
> > when analyzing T- versus B-lymphocytes, one usually observes large
> > percentages > 20% of differentially expressed genes, since in that
> > case we were comparing very different cell types with each other.
> > However, I would not expect such striking differences between a
tumour
> > and the related physiological tissue.
>
> Vincent,
>
> Actually, having a large proportion of differentially-expressed
genes
> between tumor and normal is certainly possible. You got the same
> results with two different data sets if I read your original post
> correctly, so go back to check quality of data, statistical biases,
> etc., but it seems quite possible that your results are correct.
You
> will, of course, have to think about validation strategies, but....
>
> Sean
>
Vincent Detours, Ph.D.
IRIBHM
Bldg C, room C.4.116
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium
Phone: +32-2-555 4220
Fax: +32-2-555 4655
E-mail: vdetours at ulb.ac.be
URL: http://homepages.ulb.ac.be/~vdetours/
Vincent,
As others have pointed out, this could be the actual state of nature
or
could be artifacts that you can straighten out by a closer look at the
data.
I have found this approach to be helpul:
Bradley Efron. Large Scale Simultaneous Hypothesis Testing:
The
Choice of a Null Hypothesis. JASA, 99(465):96104, Mar 2004
As is pointed out there, artifacts in the data may tend to inflate
test
statistics 'across the board' leading to very large numbers of
supposedly
significant (or truly discovered) genes.
The suggested approach (recalibrating the null variance and shifting
the
location) compensates for this even when you cannot specifically
identify
the artifacts.
It is a fairly simple exercise in R to implement this. I can send you
some
hints, if you wish
Chuck
On Mon, 9 May 2005, Vincent Detours wrote:
> Dear all,
>
> Your expert opinion are most welcome on the following.
>
> I am finding using siggenes' SAM @ q<0.05 (26 samples on cDNA chips)
> that 37% of all genes are regulated with respect to patient-matched
> "normal" tissues in somme tumors not particularly known for huge
> aneuploidy. Looking at another data set from the same cancer but
> collected by another group on indepentent samples on Affy, I got
34%.
> The number seems to hold.
>
> How to interpret this? Are really 30% of the genes disturbed, even
to
> a small extent, in these tumors? Could SAM do something wrong? If
yes,
> how to verify it?
>
> Any advise, shared experience, references, etc. are welcome
>
> Cheers
>
> Vincent
>
>
> ------------------------------------------
> Vincent Detours, Ph.D.
> IRIBHM
> Bldg C, room C.4.116
> ULB, Campus Erasme, CP602
> 808 route de Lennik
> B-1070 Brussels
> Belgium
>
> Phone: +32-2-555 4220
> Fax: +32-2-555 4655
>
> E-mail: vdetours at ulb.ac.be
>
> URL: http://homepages.ulb.ac.be/~vdetours/
>
>
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive
Medicine
E mailto:cberry@tajo.ucsd.edu UC San Diego
http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego
92093-0717