Dear Bioconductors,
I have some proteomics data for several tissues:
Heart x 3 replicates
Lung x 3 replicates
Each data set has a gene symbol and the number of peptides for that
gene (a
rough measure of protein expression).
I want to make a data structure like:
heart1 heart2 heart3 lung1 lung2 lung3
Gene1 2 4 3 7 9 20
Gene2 50 45 33 0 1 0
Gene3 ...... etc
Gene4
Each number in the data frame corresponds to number of peptides for
that
gene.
My questions are:
Is a Principle Component Analysis useful for this data set?
What would a PCA tell me?
What function would I use make a nice graphical representation of the
data?
Or should I used a concordance function, something like?
con<-function(y1,y2){
d<-(mean(y1) - mean(y2))
v1<-var(y1)
v2<-var(y2)
cov<-cov(y1,y2)
con<-(2*cov)/(v1+v2+d^2)
return(con)};
This will tell me if two samples have concordance but I don't know how
to
involve all samples. Basically, I want to summarise the data.
Any suggestions will be appreciated.
J.
[[alternative HTML version deleted]]
Hi,
On Wed, Mar 3, 2010 at 11:41 AM, Johnny H <ukfriend22 at="" googlemail.com=""> wrote:
> Dear Bioconductors,
> I have some proteomics data for several tissues:
>
> Heart x 3 replicates
> Lung x 3 replicates
>
> Each data set has a gene symbol and the number of peptides for that
gene (a
> rough measure of protein expression).
>
> I want to make a data structure like:
>
> ? ? ? ? ? ?heart1 ? heart2 ?heart3 ?lung1 ? lung2 ? ?lung3
> Gene1 ?2 ? ? ? ? ? ?4 ? ? ? ? ? 3 ? ? ? ?7 ? ? ? ? 9 ? ? ? ? ? 20
> Gene2 ? ?50 ? ? ? ?45 ? ? ? ? ?33 ? ? ?0 ? ? ? ? 1 ? ? ? ? ? ?0
> Gene3 ?...... etc
> Gene4
>
> Each number in the data frame corresponds to number of peptides for
that
> gene.
I've never worked with proteomics data, but just a quick point since
you're saying you want to "show something" based on the number of
peptides found per protein -- I guess you'll have to somehow normalize
for the (expected) length (# of peptides) of the protein itself?
> Is a Principle Component Analysis useful for this data set?
What are you trying to show?
> What would a PCA ?tell me?
There are lots and lots of tutorials and things about PCA on the
intertubes. Here's a quote from the wikipedia article that, I think,
gives a decent "intuition" on what it tries to do:
"""PCA is the simplest of the true eigenvector-based multivariate
analyses. Often, its operation can be thought of as revealing the
internal structure of the data in a way which best explains the
variance in the data. If a multivariate dataset is visualised as a set
of coordinates in a high-dimensional data space (1 axis per variable),
PCA supplies the user with a lower-dimensional picture, a "shadow" of
this object when viewed from its (in some sense) most informative
viewpoint."""
I guess the last sentence, in particular, is useful.
> What function would I use make a nice graphical representation of
the data?
What are you trying to show?
> Or should I used a concordance function, something like?
>
> con<-function(y1,y2){
> ?d<-(mean(y1) - mean(y2))
> ?v1<-var(y1)
> ?v2<-var(y2)
> ?cov<-cov(y1,y2)
> ?con<-(2*cov)/(v1+v2+d^2)
> ?return(con)};
>
> This will tell me if two samples have concordance but I don't know
how to
> involve all samples. Basically, I want to summarise the data.
Summarize it like how?
Summing up the number of peptides found in each sample is one type of
summary, but might not inform you of what you'd like to be informed
about (you haven't been clear on what that is). It could be
informative in other ways, though, that I guess aren't immediately
obvious: eg. it can give you an idea telling you if you have roughly
the same amount of "input" into each of your replicates.
So ... what are you trying to show?
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
On Wed, Mar 3, 2010 at 11:41 AM, Johnny H <ukfriend22 at="" googlemail.com=""> wrote:
> Dear Bioconductors,
> I have some proteomics data for several tissues:
>
> Heart x 3 replicates
> Lung x 3 replicates
>
> Each data set has a gene symbol and the number of peptides for that
gene (a
> rough measure of protein expression).
>
> I want to make a data structure like:
>
> ? ? ? ? ? ?heart1 ? heart2 ?heart3 ?lung1 ? lung2 ? ?lung3
> Gene1 ?2 ? ? ? ? ? ?4 ? ? ? ? ? 3 ? ? ? ?7 ? ? ? ? 9 ? ? ? ? ? 20
> Gene2 ? ?50 ? ? ? ?45 ? ? ? ? ?33 ? ? ?0 ? ? ? ? 1 ? ? ? ? ? ?0
> Gene3 ?...... etc
> Gene4
>
> Each number in the data frame corresponds to number of peptides for
that
> gene.
>
> My questions are:
>
> Is a Principle Component Analysis useful for this data set?
> What would a PCA ?tell me?
> What function would I use make a nice graphical representation of
the data?
>
> Or should I used a concordance function, something like?
>
> con<-function(y1,y2){
> ?d<-(mean(y1) - mean(y2))
> ?v1<-var(y1)
> ?v2<-var(y2)
> ?cov<-cov(y1,y2)
> ?con<-(2*cov)/(v1+v2+d^2)
> ?return(con)};
>
> This will tell me if two samples have concordance but I don't know
how to
> involve all samples. Basically, I want to summarise the data.
Start simple. There are likely biases (that depend on the
experimental design and assays used) in the data. Try to determine
what those are using simple plots of the data. What are the
distributions of the data when cut various ways (per gene, per
sample)? What do scatter plots of one sample versus another look
like? Do the data need transformation (log, for example)? Do the
data need normalization (likely)?
In short, some data exploration might be necessary before you can move
on to ask more biologically relevant questions. You may already have
the information that you need to determine the best way forward, but
that isn't clear from your post.
Hope that helps,
Sean
Dear Steve and Sean.
First, thanks for replying to my query.
Normalisation:
Yes, I will normalise to number of peptides per gene model (due to
isoforms). If I wanted to be ultra careful, I would see if any
peptides were
specific to any one isoform but that is complicated and probably
inaccurate.
I also will normalise on amount of protein loaded onto the mass/spec
as
there are slight differences.
Thanks for the PCA information.
What am I trying to show?
Well, I would like to graphically show:
1) How well the replicates of each tissue
concord/agree/correlate. So
how good were the replicates, how close did they agree? Are the
majority of
the same genes being expressed at similar levels between the
replicates
2) Overall, globally/graphically, how the heart data differed
from the
lung data. What genes are not being expressed between the tissues,
what are
and at what differences in the level of expression.
Summing the number of peptides sounds good (easy :-)).
There are likely biases (that depend on the experimental design and
assays
used) in the data. Try to determine what those are using simple plots
of
the data.
Once I have normalised the peptide counts, the major difference is the
tissues. Please can you give me an example of a plot function to use?
Mmmm,
maybe you elude to that below.
What are the distributions of the data when cut various ways (per
gene, per
sample)?
Thanks, I will try that.
Do the data need transformation (log, for example)?
You mean like microarray data, for a plot to have a normal
distribution
curve around zero? How will I know that?
Thanks again, it is very helpful.
John.
On Wed, Mar 3, 2010 at 5:17 PM, Sean Davis <seandavi@gmail.com> wrote:
> On Wed, Mar 3, 2010 at 11:41 AM, Johnny H
<ukfriend22@googlemail.com>
> wrote:
> > Dear Bioconductors,
> > I have some proteomics data for several tissues:
> >
> > Heart x 3 replicates
> > Lung x 3 replicates
> >
> > Each data set has a gene symbol and the number of peptides for
that gene
> (a
> > rough measure of protein expression).
> >
> > I want to make a data structure like:
> >
> > heart1 heart2 heart3 lung1 lung2 lung3
> > Gene1 2 4 3 7 9 20
> > Gene2 50 45 33 0 1 0
> > Gene3 ...... etc
> > Gene4
> >
> > Each number in the data frame corresponds to number of peptides
for that
> > gene.
> >
> > My questions are:
> >
> > Is a Principle Component Analysis useful for this data set?
> > What would a PCA tell me?
> > What function would I use make a nice graphical representation of
the
> data?
> >
> > Or should I used a concordance function, something like?
> >
> > con<-function(y1,y2){
> > d<-(mean(y1) - mean(y2))
> > v1<-var(y1)
> > v2<-var(y2)
> > cov<-cov(y1,y2)
> > con<-(2*cov)/(v1+v2+d^2)
> > return(con)};
> >
> > This will tell me if two samples have concordance but I don't know
how to
> > involve all samples. Basically, I want to summarise the data.
>
> Start simple. There are likely biases (that depend on the
> experimental design and assays used) in the data. Try to determine
> what those are using simple plots of the data. What are the
> distributions of the data when cut various ways (per gene, per
> sample)? What do scatter plots of one sample versus another look
> like? Do the data need transformation (log, for example)? Do the
> data need normalization (likely)?
>
> In short, some data exploration might be necessary before you can
move
> on to ask more biologically relevant questions. You may already
have
> the information that you need to determine the best way forward, but
> that isn't clear from your post.
>
> Hope that helps,
> Sean
>
[[alternative HTML version deleted]]