Dear All:
Here is a general question and I apologize if it is a little bit off
topic (but I believe bioconductor must have some solution for that.)
Is there a guideline or good tool to get "gene" expression profile
from "probe" expression profile? In this process, I am concerned that
such tool or guide should address the issues of "multiple probes to
one gene" and "one probe to multiple genes".
I believe it is a non-trivial process and automation of this process
might not be easy:
for example, for the former issue, how do you get an "average"
expression from multiple probles for one gene? for the latter, which
gene do you believe is the "right" one for the probe.
Any recommendation is appreciated !
--
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
Weiwei Shi wrote:
> Dear All:
>
> Here is a general question and I apologize if it is a little bit off
> topic (but I believe bioconductor must have some solution for that.)
>
> Is there a guideline or good tool to get "gene" expression profile
> from "probe" expression profile? In this process, I am concerned
that
> such tool or guide should address the issues of "multiple probes to
> one gene" and "one probe to multiple genes".
>
>
Don't deal with the first case. Do all of your analyses at the probe
level. There probably is not a safe, totally general way to aggregate
probes in a gene expression context. Instead, do you differential
expression testing and then map probes to genes for downstream
processing (looking up in Pubmed, etc).
The second case can't be dealt with appropriately without knowing why
one probe maps to multiple genes. In general, unless you do your own
annotation (using blast, for example), it will be difficult to make a
call in the general case. However, in some cases, the answer is
"obvious". In the case you emailed about earlier today (one probe
hitting 3 genes), it was fairly obvious what the answer was, since one
of the genes was a "Refseq" gene while the other two were simply
computationally predicted genes. The most important point is to know
what sources of annotation are being used, what their limitations are,
and how they relate to other sources of annotation--this knowledge is
often not easy to come by, but is invaluable.
> I believe it is a non-trivial process and automation of this
process
> might not be easy:
>
Automation really isn't possible, since there is not a general
solution
to every case. My rule of thumb is to maintain as much information as
possible throughout the process of data analysis and then do some
biologic knowledge curation when the gene lists are in.
Unfortunately,
there really isn't a fantastic substitute for this last step.
Just my two-cents worth.
Sean
To add to Sean's comments, in general probe sets should be considered
as
independent entities (not necessarily as multiple/replicate
measurements of
the same entity, i.e. the underlying gene). So the question of which
probeset-to-gene map should be used is rather ill posed.
The answer will generally depend on the objective of the study. For
example, if the objective is to develop a predictive (classification)
model,
probe sets are the independent predictors and the question of gene-
average
expression is not really relevant. As another example, if the
objective is
to compare the reproducibility of gene expression between two or more
platforms, then it is imperative to match data at the probe set level
to
allow for a meaningful evaluation. Different probe sets map to
different
parts of the gene and thus tend to behave independently, in many cases
driven by allelic effects in the study population.
Finally, if the objective is to understand the biology behind
differentially
expressed genes, then it is important to first double-check the
validity of
the "official" probe to gene mappings. Then spend some time to try to
understand the implications of the relative position of the probe set
on the
gene sequence.
The following two articles are informative in this respect:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&d
opt=Ab
stractPlus&list_uids=16284200&query_hl=15&itool=pubmed_docsum
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&d
opt=Ab
stractPlus&list_uids=17224057&query_hl=13&itool=pubmed_docsum
So I would argue that this is more of a biology problem rather than a
bioinformatics problem and thus not amenable to an automated solution.
-Christos
Christos Hatzis, Ph.D.
Nuvera Biosciences, Inc.
400 West Cummings Park
Suite 5350
Woburn, MA 01801
Tel: 781-938-3830
www.nuverabio.com
> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch
> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of
> Sean Davis
> Sent: Monday, April 02, 2007 2:24 PM
> To: Weiwei Shi
> Cc: bioconductor
> Subject: Re: [BioC] probe expression profile to gene
> expression profile
>
> Weiwei Shi wrote:
> > Dear All:
> >
> > Here is a general question and I apologize if it is a
> little bit off
> > topic (but I believe bioconductor must have some solution for
that.)
> >
> > Is there a guideline or good tool to get "gene" expression profile
> > from "probe" expression profile? In this process, I am
> concerned that
> > such tool or guide should address the issues of "multiple probes
to
> > one gene" and "one probe to multiple genes".
> >
> >
> Don't deal with the first case. Do all of your analyses at
> the probe level. There probably is not a safe, totally
> general way to aggregate probes in a gene expression context.
> Instead, do you differential expression testing and then map
> probes to genes for downstream processing (looking up in
> Pubmed, etc).
>
> The second case can't be dealt with appropriately without
> knowing why one probe maps to multiple genes. In general,
> unless you do your own annotation (using blast, for example),
> it will be difficult to make a call in the general case.
> However, in some cases, the answer is "obvious". In the case
> you emailed about earlier today (one probe hitting 3 genes),
> it was fairly obvious what the answer was, since one of the
> genes was a "Refseq" gene while the other two were simply
> computationally predicted genes. The most important point is
> to know what sources of annotation are being used, what their
> limitations are, and how they relate to other sources of
> annotation--this knowledge is often not easy to come by, but
> is invaluable.
>
> > I believe it is a non-trivial process and automation of
> this process
> > might not be easy:
> >
> Automation really isn't possible, since there is not a
> general solution to every case. My rule of thumb is to
> maintain as much information as possible throughout the
> process of data analysis and then do some biologic knowledge
> curation when the gene lists are in. Unfortunately, there
> really isn't a fantastic substitute for this last step.
>
> Just my two-cents worth.
>
> Sean
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
Hi, there:
I think my first email was asking more about guidelines or generally
what people deal with probe2gene issue instead of for fully automation
(I mentioned "not easy"). But the discussion somehow becomes at what
stage we should do probe2gene or whether we should for some objectives
of study.
I agree in theory that analysis at probe level can keep info and avoid
early aggregation of info at gene level. However, at some point, you
still need to perform further analysis at gene or pathway level to
find the biological significance behind if your objective of study is.
Then the question is, is analysis like differential testing at probe
level safe then? (b/c some probes have been removed from this step,
for example). It is like "maximum pick" instead of "average pick".
Moreover, probes (mapped to one gene) are supposed to be highly
correlated. Highly correlated predictors are not desired in supervised
learning process, IMO.
Again, in theory, I agree to check manually instead of automatically
to make sure of each biological validity and the problem is more like
a biological one instead of bioinformatics one. However again :), in
practice, it might not be feasible for high-throughput technology,
which IMHO, allows some high-level noises or errors, but gives people
more statistical significance.
Just my2cents,
Weiwei
On 4/2/07, Christos Hatzis <christos at="" nuverabio.com=""> wrote:
> To add to Sean's comments, in general probe sets should be
considered as
> independent entities (not necessarily as multiple/replicate
measurements of
> the same entity, i.e. the underlying gene). So the question of which
> probeset-to-gene map should be used is rather ill posed.
>
> The answer will generally depend on the objective of the study. For
> example, if the objective is to develop a predictive
(classification) model,
> probe sets are the independent predictors and the question of gene-
average
> expression is not really relevant. As another example, if the
objective is
> to compare the reproducibility of gene expression between two or
more
> platforms, then it is imperative to match data at the probe set
level to
> allow for a meaningful evaluation. Different probe sets map to
different
> parts of the gene and thus tend to behave independently, in many
cases
> driven by allelic effects in the study population.
>
> Finally, if the objective is to understand the biology behind
differentially
> expressed genes, then it is important to first double-check the
validity of
> the "official" probe to gene mappings. Then spend some time to try
to
> understand the implications of the relative position of the probe
set on the
> gene sequence.
>
> The following two articles are informative in this respect:
>
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve
&dopt=Ab
> stractPlus&list_uids=16284200&query_hl=15&itool=pubmed_docsum
>
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve
&dopt=Ab
> stractPlus&list_uids=17224057&query_hl=13&itool=pubmed_docsum
>
>
> So I would argue that this is more of a biology problem rather than
a
> bioinformatics problem and thus not amenable to an automated
solution.
>
> -Christos
>
> Christos Hatzis, Ph.D.
> Nuvera Biosciences, Inc.
> 400 West Cummings Park
> Suite 5350
> Woburn, MA 01801
> Tel: 781-938-3830
> www.nuverabio.com
>
>
>
> > -----Original Message-----
> > From: bioconductor-bounces at stat.math.ethz.ch
> > [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of
> > Sean Davis
> > Sent: Monday, April 02, 2007 2:24 PM
> > To: Weiwei Shi
> > Cc: bioconductor
> > Subject: Re: [BioC] probe expression profile to gene
> > expression profile
> >
> > Weiwei Shi wrote:
> > > Dear All:
> > >
> > > Here is a general question and I apologize if it is a
> > little bit off
> > > topic (but I believe bioconductor must have some solution for
that.)
> > >
> > > Is there a guideline or good tool to get "gene" expression
profile
> > > from "probe" expression profile? In this process, I am
> > concerned that
> > > such tool or guide should address the issues of "multiple probes
to
> > > one gene" and "one probe to multiple genes".
> > >
> > >
> > Don't deal with the first case. Do all of your analyses at
> > the probe level. There probably is not a safe, totally
> > general way to aggregate probes in a gene expression context.
> > Instead, do you differential expression testing and then map
> > probes to genes for downstream processing (looking up in
> > Pubmed, etc).
> >
> > The second case can't be dealt with appropriately without
> > knowing why one probe maps to multiple genes. In general,
> > unless you do your own annotation (using blast, for example),
> > it will be difficult to make a call in the general case.
> > However, in some cases, the answer is "obvious". In the case
> > you emailed about earlier today (one probe hitting 3 genes),
> > it was fairly obvious what the answer was, since one of the
> > genes was a "Refseq" gene while the other two were simply
> > computationally predicted genes. The most important point is
> > to know what sources of annotation are being used, what their
> > limitations are, and how they relate to other sources of
> > annotation--this knowledge is often not easy to come by, but
> > is invaluable.
> >
> > > I believe it is a non-trivial process and automation of
> > this process
> > > might not be easy:
> > >
> > Automation really isn't possible, since there is not a
> > general solution to every case. My rule of thumb is to
> > maintain as much information as possible throughout the
> > process of data analysis and then do some biologic knowledge
> > curation when the gene lists are in. Unfortunately, there
> > really isn't a fantastic substitute for this last step.
> >
> > Just my two-cents worth.
> >
> > Sean
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
>
>
>
--
Weiwei Shi, Ph.D
Research Scientist
GeneGO, Inc.
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III