Question

PCA or concordance

0

Entering edit mode

Johnny H ▴ 80

@johnny-h-3952

Last seen 9.4 years ago

United Kingdom

Dear Bioconductors, I have some proteomics data for several tissues: Heart x 3 replicates Lung x 3 replicates Each data set has a gene symbol and the number of peptides for that gene (a rough measure of protein expression). I want to make a data structure like: heart1 heart2 heart3 lung1 lung2 lung3 Gene1 2 4 3 7 9 20 Gene2 50 45 33 0 1 0 Gene3 ...... etc Gene4 Each number in the data frame corresponds to number of peptides for that gene. My questions are: Is a Principle Component Analysis useful for this data set? What would a PCA tell me? What function would I use make a nice graphical representation of the data? Or should I used a concordance function, something like? con<-function(y1,y2){ d<-(mean(y1) - mean(y2)) v1<-var(y1) v2<-var(y2) cov<-cov(y1,y2) con<-(2*cov)/(v1+v2+d^2) return(con)}; This will tell me if two samples have concordance but I don't know how to involve all samples. Basically, I want to summarise the data. Any suggestions will be appreciated. J. [[alternative HTML version deleted]]

Proteomics Proteomics • 1.3k views

ADD COMMENT • link updated 14.7 years ago by Sean Davis 21k • written 14.7 years ago by Johnny H ▴ 80

score 0 · Answer 1 · 2010-03-03

Hi, On Wed, Mar 3, 2010 at 11:41 AM, Johnny H <ukfriend22 at="" googlemail.com=""> wrote: > Dear Bioconductors, > I have some proteomics data for several tissues: > > Heart x 3 replicates > Lung x 3 replicates > > Each data set has a gene symbol and the number of peptides for that gene (a > rough measure of protein expression). > > I want to make a data structure like: > > ? ? ? ? ? ?heart1 ? heart2 ?heart3 ?lung1 ? lung2 ? ?lung3 > Gene1 ?2 ? ? ? ? ? ?4 ? ? ? ? ? 3 ? ? ? ?7 ? ? ? ? 9 ? ? ? ? ? 20 > Gene2 ? ?50 ? ? ? ?45 ? ? ? ? ?33 ? ? ?0 ? ? ? ? 1 ? ? ? ? ? ?0 > Gene3 ?...... etc > Gene4 > > Each number in the data frame corresponds to number of peptides for that > gene. I've never worked with proteomics data, but just a quick point since you're saying you want to "show something" based on the number of peptides found per protein -- I guess you'll have to somehow normalize for the (expected) length (# of peptides) of the protein itself? > Is a Principle Component Analysis useful for this data set? What are you trying to show? > What would a PCA ?tell me? There are lots and lots of tutorials and things about PCA on the intertubes. Here's a quote from the wikipedia article that, I think, gives a decent "intuition" on what it tries to do: """PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the internal structure of the data in a way which best explains the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA supplies the user with a lower-dimensional picture, a "shadow" of this object when viewed from its (in some sense) most informative viewpoint.""" I guess the last sentence, in particular, is useful. > What function would I use make a nice graphical representation of the data? What are you trying to show? > Or should I used a concordance function, something like? > > con<-function(y1,y2){ > ?d<-(mean(y1) - mean(y2)) > ?v1<-var(y1) > ?v2<-var(y2) > ?cov<-cov(y1,y2) > ?con<-(2*cov)/(v1+v2+d^2) > ?return(con)}; > > This will tell me if two samples have concordance but I don't know how to > involve all samples. Basically, I want to summarise the data. Summarize it like how? Summing up the number of peptides found in each sample is one type of summary, but might not inform you of what you'd like to be informed about (you haven't been clear on what that is). It could be informative in other ways, though, that I guess aren't immediately obvious: eg. it can give you an idea telling you if you have roughly the same amount of "input" into each of your replicates. So ... what are you trying to show? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

score 0 · Answer 2 · 2010-03-03

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

On Wed, Mar 3, 2010 at 11:41 AM, Johnny H <ukfriend22 at="" googlemail.com=""> wrote: > Dear Bioconductors, > I have some proteomics data for several tissues: > > Heart x 3 replicates > Lung x 3 replicates > > Each data set has a gene symbol and the number of peptides for that gene (a > rough measure of protein expression). > > I want to make a data structure like: > > ? ? ? ? ? ?heart1 ? heart2 ?heart3 ?lung1 ? lung2 ? ?lung3 > Gene1 ?2 ? ? ? ? ? ?4 ? ? ? ? ? 3 ? ? ? ?7 ? ? ? ? 9 ? ? ? ? ? 20 > Gene2 ? ?50 ? ? ? ?45 ? ? ? ? ?33 ? ? ?0 ? ? ? ? 1 ? ? ? ? ? ?0 > Gene3 ?...... etc > Gene4 > > Each number in the data frame corresponds to number of peptides for that > gene. > > My questions are: > > Is a Principle Component Analysis useful for this data set? > What would a PCA ?tell me? > What function would I use make a nice graphical representation of the data? > > Or should I used a concordance function, something like? > > con<-function(y1,y2){ > ?d<-(mean(y1) - mean(y2)) > ?v1<-var(y1) > ?v2<-var(y2) > ?cov<-cov(y1,y2) > ?con<-(2*cov)/(v1+v2+d^2) > ?return(con)}; > > This will tell me if two samples have concordance but I don't know how to > involve all samples. Basically, I want to summarise the data. Start simple. There are likely biases (that depend on the experimental design and assays used) in the data. Try to determine what those are using simple plots of the data. What are the distributions of the data when cut various ways (per gene, per sample)? What do scatter plots of one sample versus another look like? Do the data need transformation (log, for example)? Do the data need normalization (likely)? In short, some data exploration might be necessary before you can move on to ask more biologically relevant questions. You may already have the information that you need to determine the best way forward, but that isn't clear from your post. Hope that helps, Sean

ADD COMMENT • link 14.7 years ago Sean Davis 21k

0

Entering edit mode

Dear Steve and Sean. First, thanks for replying to my query. Normalisation: Yes, I will normalise to number of peptides per gene model (due to isoforms). If I wanted to be ultra careful, I would see if any peptides were specific to any one isoform but that is complicated and probably inaccurate. I also will normalise on amount of protein loaded onto the mass/spec as there are slight differences. Thanks for the PCA information. What am I trying to show? Well, I would like to graphically show: 1) How well the replicates of each tissue concord/agree/correlate. So how good were the replicates, how close did they agree? Are the majority of the same genes being expressed at similar levels between the replicates 2) Overall, globally/graphically, how the heart data differed from the lung data. What genes are not being expressed between the tissues, what are and at what differences in the level of expression. Summing the number of peptides sounds good (easy :-)). There are likely biases (that depend on the experimental design and assays used) in the data. Try to determine what those are using simple plots of the data. Once I have normalised the peptide counts, the major difference is the tissues. Please can you give me an example of a plot function to use? Mmmm, maybe you elude to that below. What are the distributions of the data when cut various ways (per gene, per sample)? Thanks, I will try that. Do the data need transformation (log, for example)? You mean like microarray data, for a plot to have a normal distribution curve around zero? How will I know that? Thanks again, it is very helpful. John. On Wed, Mar 3, 2010 at 5:17 PM, Sean Davis <seandavi@gmail.com> wrote: > On Wed, Mar 3, 2010 at 11:41 AM, Johnny H <ukfriend22@googlemail.com> > wrote: > > Dear Bioconductors, > > I have some proteomics data for several tissues: > > > > Heart x 3 replicates > > Lung x 3 replicates > > > > Each data set has a gene symbol and the number of peptides for that gene > (a > > rough measure of protein expression). > > > > I want to make a data structure like: > > > > heart1 heart2 heart3 lung1 lung2 lung3 > > Gene1 2 4 3 7 9 20 > > Gene2 50 45 33 0 1 0 > > Gene3 ...... etc > > Gene4 > > > > Each number in the data frame corresponds to number of peptides for that > > gene. > > > > My questions are: > > > > Is a Principle Component Analysis useful for this data set? > > What would a PCA tell me? > > What function would I use make a nice graphical representation of the > data? > > > > Or should I used a concordance function, something like? > > > > con<-function(y1,y2){ > > d<-(mean(y1) - mean(y2)) > > v1<-var(y1) > > v2<-var(y2) > > cov<-cov(y1,y2) > > con<-(2*cov)/(v1+v2+d^2) > > return(con)}; > > > > This will tell me if two samples have concordance but I don't know how to > > involve all samples. Basically, I want to summarise the data. > > Start simple. There are likely biases (that depend on the > experimental design and assays used) in the data. Try to determine > what those are using simple plots of the data. What are the > distributions of the data when cut various ways (per gene, per > sample)? What do scatter plots of one sample versus another look > like? Do the data need transformation (log, for example)? Do the > data need normalization (likely)? > > In short, some data exploration might be necessary before you can move > on to ask more biologically relevant questions. You may already have > the information that you need to determine the best way forward, but > that isn't clear from your post. > > Hope that helps, > Sean > [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago Johnny H ▴ 80